ia64/linux-2.6.18-xen.hg

annotate Documentation/memory-barriers.txt @ 854:950b9eb27661

usbback: fix urb interval value for interrupt urbs.

Signed-off-by: Noboru Iwamatsu <n_iwamatsu@jp.fujitsu.com>
author Keir Fraser <keir.fraser@citrix.com>
date Mon Apr 06 13:51:20 2009 +0100 (2009-04-06)
parents 831230e53067
children
rev   line source
ian@0 1 ============================
ian@0 2 LINUX KERNEL MEMORY BARRIERS
ian@0 3 ============================
ian@0 4
ian@0 5 By: David Howells <dhowells@redhat.com>
ian@0 6
ian@0 7 Contents:
ian@0 8
ian@0 9 (*) Abstract memory access model.
ian@0 10
ian@0 11 - Device operations.
ian@0 12 - Guarantees.
ian@0 13
ian@0 14 (*) What are memory barriers?
ian@0 15
ian@0 16 - Varieties of memory barrier.
ian@0 17 - What may not be assumed about memory barriers?
ian@0 18 - Data dependency barriers.
ian@0 19 - Control dependencies.
ian@0 20 - SMP barrier pairing.
ian@0 21 - Examples of memory barrier sequences.
ian@0 22 - Read memory barriers vs load speculation.
ian@0 23
ian@0 24 (*) Explicit kernel barriers.
ian@0 25
ian@0 26 - Compiler barrier.
ian@0 27 - The CPU memory barriers.
ian@0 28 - MMIO write barrier.
ian@0 29
ian@0 30 (*) Implicit kernel memory barriers.
ian@0 31
ian@0 32 - Locking functions.
ian@0 33 - Interrupt disabling functions.
ian@0 34 - Miscellaneous functions.
ian@0 35
ian@0 36 (*) Inter-CPU locking barrier effects.
ian@0 37
ian@0 38 - Locks vs memory accesses.
ian@0 39 - Locks vs I/O accesses.
ian@0 40
ian@0 41 (*) Where are memory barriers needed?
ian@0 42
ian@0 43 - Interprocessor interaction.
ian@0 44 - Atomic operations.
ian@0 45 - Accessing devices.
ian@0 46 - Interrupts.
ian@0 47
ian@0 48 (*) Kernel I/O barrier effects.
ian@0 49
ian@0 50 (*) Assumed minimum execution ordering model.
ian@0 51
ian@0 52 (*) The effects of the cpu cache.
ian@0 53
ian@0 54 - Cache coherency.
ian@0 55 - Cache coherency vs DMA.
ian@0 56 - Cache coherency vs MMIO.
ian@0 57
ian@0 58 (*) The things CPUs get up to.
ian@0 59
ian@0 60 - And then there's the Alpha.
ian@0 61
ian@0 62 (*) References.
ian@0 63
ian@0 64
ian@0 65 ============================
ian@0 66 ABSTRACT MEMORY ACCESS MODEL
ian@0 67 ============================
ian@0 68
ian@0 69 Consider the following abstract model of the system:
ian@0 70
ian@0 71 : :
ian@0 72 : :
ian@0 73 : :
ian@0 74 +-------+ : +--------+ : +-------+
ian@0 75 | | : | | : | |
ian@0 76 | | : | | : | |
ian@0 77 | CPU 1 |<----->| Memory |<----->| CPU 2 |
ian@0 78 | | : | | : | |
ian@0 79 | | : | | : | |
ian@0 80 +-------+ : +--------+ : +-------+
ian@0 81 ^ : ^ : ^
ian@0 82 | : | : |
ian@0 83 | : | : |
ian@0 84 | : v : |
ian@0 85 | : +--------+ : |
ian@0 86 | : | | : |
ian@0 87 | : | | : |
ian@0 88 +---------->| Device |<----------+
ian@0 89 : | | :
ian@0 90 : | | :
ian@0 91 : +--------+ :
ian@0 92 : :
ian@0 93
ian@0 94 Each CPU executes a program that generates memory access operations. In the
ian@0 95 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
ian@0 96 perform the memory operations in any order it likes, provided program causality
ian@0 97 appears to be maintained. Similarly, the compiler may also arrange the
ian@0 98 instructions it emits in any order it likes, provided it doesn't affect the
ian@0 99 apparent operation of the program.
ian@0 100
ian@0 101 So in the above diagram, the effects of the memory operations performed by a
ian@0 102 CPU are perceived by the rest of the system as the operations cross the
ian@0 103 interface between the CPU and rest of the system (the dotted lines).
ian@0 104
ian@0 105
ian@0 106 For example, consider the following sequence of events:
ian@0 107
ian@0 108 CPU 1 CPU 2
ian@0 109 =============== ===============
ian@0 110 { A == 1; B == 2 }
ian@0 111 A = 3; x = A;
ian@0 112 B = 4; y = B;
ian@0 113
ian@0 114 The set of accesses as seen by the memory system in the middle can be arranged
ian@0 115 in 24 different combinations:
ian@0 116
ian@0 117 STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4
ian@0 118 STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3
ian@0 119 STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4
ian@0 120 STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4
ian@0 121 STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3
ian@0 122 STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4
ian@0 123 STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4
ian@0 124 STORE B=4, ...
ian@0 125 ...
ian@0 126
ian@0 127 and can thus result in four different combinations of values:
ian@0 128
ian@0 129 x == 1, y == 2
ian@0 130 x == 1, y == 4
ian@0 131 x == 3, y == 2
ian@0 132 x == 3, y == 4
ian@0 133
ian@0 134
ian@0 135 Furthermore, the stores committed by a CPU to the memory system may not be
ian@0 136 perceived by the loads made by another CPU in the same order as the stores were
ian@0 137 committed.
ian@0 138
ian@0 139
ian@0 140 As a further example, consider this sequence of events:
ian@0 141
ian@0 142 CPU 1 CPU 2
ian@0 143 =============== ===============
ian@0 144 { A == 1, B == 2, C = 3, P == &A, Q == &C }
ian@0 145 B = 4; Q = P;
ian@0 146 P = &B D = *Q;
ian@0 147
ian@0 148 There is an obvious data dependency here, as the value loaded into D depends on
ian@0 149 the address retrieved from P by CPU 2. At the end of the sequence, any of the
ian@0 150 following results are possible:
ian@0 151
ian@0 152 (Q == &A) and (D == 1)
ian@0 153 (Q == &B) and (D == 2)
ian@0 154 (Q == &B) and (D == 4)
ian@0 155
ian@0 156 Note that CPU 2 will never try and load C into D because the CPU will load P
ian@0 157 into Q before issuing the load of *Q.
ian@0 158
ian@0 159
ian@0 160 DEVICE OPERATIONS
ian@0 161 -----------------
ian@0 162
ian@0 163 Some devices present their control interfaces as collections of memory
ian@0 164 locations, but the order in which the control registers are accessed is very
ian@0 165 important. For instance, imagine an ethernet card with a set of internal
ian@0 166 registers that are accessed through an address port register (A) and a data
ian@0 167 port register (D). To read internal register 5, the following code might then
ian@0 168 be used:
ian@0 169
ian@0 170 *A = 5;
ian@0 171 x = *D;
ian@0 172
ian@0 173 but this might show up as either of the following two sequences:
ian@0 174
ian@0 175 STORE *A = 5, x = LOAD *D
ian@0 176 x = LOAD *D, STORE *A = 5
ian@0 177
ian@0 178 the second of which will almost certainly result in a malfunction, since it set
ian@0 179 the address _after_ attempting to read the register.
ian@0 180
ian@0 181
ian@0 182 GUARANTEES
ian@0 183 ----------
ian@0 184
ian@0 185 There are some minimal guarantees that may be expected of a CPU:
ian@0 186
ian@0 187 (*) On any given CPU, dependent memory accesses will be issued in order, with
ian@0 188 respect to itself. This means that for:
ian@0 189
ian@0 190 Q = P; D = *Q;
ian@0 191
ian@0 192 the CPU will issue the following memory operations:
ian@0 193
ian@0 194 Q = LOAD P, D = LOAD *Q
ian@0 195
ian@0 196 and always in that order.
ian@0 197
ian@0 198 (*) Overlapping loads and stores within a particular CPU will appear to be
ian@0 199 ordered within that CPU. This means that for:
ian@0 200
ian@0 201 a = *X; *X = b;
ian@0 202
ian@0 203 the CPU will only issue the following sequence of memory operations:
ian@0 204
ian@0 205 a = LOAD *X, STORE *X = b
ian@0 206
ian@0 207 And for:
ian@0 208
ian@0 209 *X = c; d = *X;
ian@0 210
ian@0 211 the CPU will only issue:
ian@0 212
ian@0 213 STORE *X = c, d = LOAD *X
ian@0 214
ian@0 215 (Loads and stores overlap if they are targetted at overlapping pieces of
ian@0 216 memory).
ian@0 217
ian@0 218 And there are a number of things that _must_ or _must_not_ be assumed:
ian@0 219
ian@0 220 (*) It _must_not_ be assumed that independent loads and stores will be issued
ian@0 221 in the order given. This means that for:
ian@0 222
ian@0 223 X = *A; Y = *B; *D = Z;
ian@0 224
ian@0 225 we may get any of the following sequences:
ian@0 226
ian@0 227 X = LOAD *A, Y = LOAD *B, STORE *D = Z
ian@0 228 X = LOAD *A, STORE *D = Z, Y = LOAD *B
ian@0 229 Y = LOAD *B, X = LOAD *A, STORE *D = Z
ian@0 230 Y = LOAD *B, STORE *D = Z, X = LOAD *A
ian@0 231 STORE *D = Z, X = LOAD *A, Y = LOAD *B
ian@0 232 STORE *D = Z, Y = LOAD *B, X = LOAD *A
ian@0 233
ian@0 234 (*) It _must_ be assumed that overlapping memory accesses may be merged or
ian@0 235 discarded. This means that for:
ian@0 236
ian@0 237 X = *A; Y = *(A + 4);
ian@0 238
ian@0 239 we may get any one of the following sequences:
ian@0 240
ian@0 241 X = LOAD *A; Y = LOAD *(A + 4);
ian@0 242 Y = LOAD *(A + 4); X = LOAD *A;
ian@0 243 {X, Y} = LOAD {*A, *(A + 4) };
ian@0 244
ian@0 245 And for:
ian@0 246
ian@0 247 *A = X; Y = *A;
ian@0 248
ian@0 249 we may get either of:
ian@0 250
ian@0 251 STORE *A = X; Y = LOAD *A;
ian@0 252 STORE *A = Y = X;
ian@0 253
ian@0 254
ian@0 255 =========================
ian@0 256 WHAT ARE MEMORY BARRIERS?
ian@0 257 =========================
ian@0 258
ian@0 259 As can be seen above, independent memory operations are effectively performed
ian@0 260 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
ian@0 261 What is required is some way of intervening to instruct the compiler and the
ian@0 262 CPU to restrict the order.
ian@0 263
ian@0 264 Memory barriers are such interventions. They impose a perceived partial
ian@0 265 ordering over the memory operations on either side of the barrier.
ian@0 266
ian@0 267 Such enforcement is important because the CPUs and other devices in a system
ian@0 268 can use a variety of tricks to improve performance - including reordering,
ian@0 269 deferral and combination of memory operations; speculative loads; speculative
ian@0 270 branch prediction and various types of caching. Memory barriers are used to
ian@0 271 override or suppress these tricks, allowing the code to sanely control the
ian@0 272 interaction of multiple CPUs and/or devices.
ian@0 273
ian@0 274
ian@0 275 VARIETIES OF MEMORY BARRIER
ian@0 276 ---------------------------
ian@0 277
ian@0 278 Memory barriers come in four basic varieties:
ian@0 279
ian@0 280 (1) Write (or store) memory barriers.
ian@0 281
ian@0 282 A write memory barrier gives a guarantee that all the STORE operations
ian@0 283 specified before the barrier will appear to happen before all the STORE
ian@0 284 operations specified after the barrier with respect to the other
ian@0 285 components of the system.
ian@0 286
ian@0 287 A write barrier is a partial ordering on stores only; it is not required
ian@0 288 to have any effect on loads.
ian@0 289
ian@0 290 A CPU can be viewed as committing a sequence of store operations to the
ian@0 291 memory system as time progresses. All stores before a write barrier will
ian@0 292 occur in the sequence _before_ all the stores after the write barrier.
ian@0 293
ian@0 294 [!] Note that write barriers should normally be paired with read or data
ian@0 295 dependency barriers; see the "SMP barrier pairing" subsection.
ian@0 296
ian@0 297
ian@0 298 (2) Data dependency barriers.
ian@0 299
ian@0 300 A data dependency barrier is a weaker form of read barrier. In the case
ian@0 301 where two loads are performed such that the second depends on the result
ian@0 302 of the first (eg: the first load retrieves the address to which the second
ian@0 303 load will be directed), a data dependency barrier would be required to
ian@0 304 make sure that the target of the second load is updated before the address
ian@0 305 obtained by the first load is accessed.
ian@0 306
ian@0 307 A data dependency barrier is a partial ordering on interdependent loads
ian@0 308 only; it is not required to have any effect on stores, independent loads
ian@0 309 or overlapping loads.
ian@0 310
ian@0 311 As mentioned in (1), the other CPUs in the system can be viewed as
ian@0 312 committing sequences of stores to the memory system that the CPU being
ian@0 313 considered can then perceive. A data dependency barrier issued by the CPU
ian@0 314 under consideration guarantees that for any load preceding it, if that
ian@0 315 load touches one of a sequence of stores from another CPU, then by the
ian@0 316 time the barrier completes, the effects of all the stores prior to that
ian@0 317 touched by the load will be perceptible to any loads issued after the data
ian@0 318 dependency barrier.
ian@0 319
ian@0 320 See the "Examples of memory barrier sequences" subsection for diagrams
ian@0 321 showing the ordering constraints.
ian@0 322
ian@0 323 [!] Note that the first load really has to have a _data_ dependency and
ian@0 324 not a control dependency. If the address for the second load is dependent
ian@0 325 on the first load, but the dependency is through a conditional rather than
ian@0 326 actually loading the address itself, then it's a _control_ dependency and
ian@0 327 a full read barrier or better is required. See the "Control dependencies"
ian@0 328 subsection for more information.
ian@0 329
ian@0 330 [!] Note that data dependency barriers should normally be paired with
ian@0 331 write barriers; see the "SMP barrier pairing" subsection.
ian@0 332
ian@0 333
ian@0 334 (3) Read (or load) memory barriers.
ian@0 335
ian@0 336 A read barrier is a data dependency barrier plus a guarantee that all the
ian@0 337 LOAD operations specified before the barrier will appear to happen before
ian@0 338 all the LOAD operations specified after the barrier with respect to the
ian@0 339 other components of the system.
ian@0 340
ian@0 341 A read barrier is a partial ordering on loads only; it is not required to
ian@0 342 have any effect on stores.
ian@0 343
ian@0 344 Read memory barriers imply data dependency barriers, and so can substitute
ian@0 345 for them.
ian@0 346
ian@0 347 [!] Note that read barriers should normally be paired with write barriers;
ian@0 348 see the "SMP barrier pairing" subsection.
ian@0 349
ian@0 350
ian@0 351 (4) General memory barriers.
ian@0 352
ian@0 353 A general memory barrier gives a guarantee that all the LOAD and STORE
ian@0 354 operations specified before the barrier will appear to happen before all
ian@0 355 the LOAD and STORE operations specified after the barrier with respect to
ian@0 356 the other components of the system.
ian@0 357
ian@0 358 A general memory barrier is a partial ordering over both loads and stores.
ian@0 359
ian@0 360 General memory barriers imply both read and write memory barriers, and so
ian@0 361 can substitute for either.
ian@0 362
ian@0 363
ian@0 364 And a couple of implicit varieties:
ian@0 365
ian@0 366 (5) LOCK operations.
ian@0 367
ian@0 368 This acts as a one-way permeable barrier. It guarantees that all memory
ian@0 369 operations after the LOCK operation will appear to happen after the LOCK
ian@0 370 operation with respect to the other components of the system.
ian@0 371
ian@0 372 Memory operations that occur before a LOCK operation may appear to happen
ian@0 373 after it completes.
ian@0 374
ian@0 375 A LOCK operation should almost always be paired with an UNLOCK operation.
ian@0 376
ian@0 377
ian@0 378 (6) UNLOCK operations.
ian@0 379
ian@0 380 This also acts as a one-way permeable barrier. It guarantees that all
ian@0 381 memory operations before the UNLOCK operation will appear to happen before
ian@0 382 the UNLOCK operation with respect to the other components of the system.
ian@0 383
ian@0 384 Memory operations that occur after an UNLOCK operation may appear to
ian@0 385 happen before it completes.
ian@0 386
ian@0 387 LOCK and UNLOCK operations are guaranteed to appear with respect to each
ian@0 388 other strictly in the order specified.
ian@0 389
ian@0 390 The use of LOCK and UNLOCK operations generally precludes the need for
ian@0 391 other sorts of memory barrier (but note the exceptions mentioned in the
ian@0 392 subsection "MMIO write barrier").
ian@0 393
ian@0 394
ian@0 395 Memory barriers are only required where there's a possibility of interaction
ian@0 396 between two CPUs or between a CPU and a device. If it can be guaranteed that
ian@0 397 there won't be any such interaction in any particular piece of code, then
ian@0 398 memory barriers are unnecessary in that piece of code.
ian@0 399
ian@0 400
ian@0 401 Note that these are the _minimum_ guarantees. Different architectures may give
ian@0 402 more substantial guarantees, but they may _not_ be relied upon outside of arch
ian@0 403 specific code.
ian@0 404
ian@0 405
ian@0 406 WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
ian@0 407 ----------------------------------------------
ian@0 408
ian@0 409 There are certain things that the Linux kernel memory barriers do not guarantee:
ian@0 410
ian@0 411 (*) There is no guarantee that any of the memory accesses specified before a
ian@0 412 memory barrier will be _complete_ by the completion of a memory barrier
ian@0 413 instruction; the barrier can be considered to draw a line in that CPU's
ian@0 414 access queue that accesses of the appropriate type may not cross.
ian@0 415
ian@0 416 (*) There is no guarantee that issuing a memory barrier on one CPU will have
ian@0 417 any direct effect on another CPU or any other hardware in the system. The
ian@0 418 indirect effect will be the order in which the second CPU sees the effects
ian@0 419 of the first CPU's accesses occur, but see the next point:
ian@0 420
ian@0 421 (*) There is no guarantee that a CPU will see the correct order of effects
ian@0 422 from a second CPU's accesses, even _if_ the second CPU uses a memory
ian@0 423 barrier, unless the first CPU _also_ uses a matching memory barrier (see
ian@0 424 the subsection on "SMP Barrier Pairing").
ian@0 425
ian@0 426 (*) There is no guarantee that some intervening piece of off-the-CPU
ian@0 427 hardware[*] will not reorder the memory accesses. CPU cache coherency
ian@0 428 mechanisms should propagate the indirect effects of a memory barrier
ian@0 429 between CPUs, but might not do so in order.
ian@0 430
ian@0 431 [*] For information on bus mastering DMA and coherency please read:
ian@0 432
ian@0 433 Documentation/pci.txt
ian@0 434 Documentation/DMA-mapping.txt
ian@0 435 Documentation/DMA-API.txt
ian@0 436
ian@0 437
ian@0 438 DATA DEPENDENCY BARRIERS
ian@0 439 ------------------------
ian@0 440
ian@0 441 The usage requirements of data dependency barriers are a little subtle, and
ian@0 442 it's not always obvious that they're needed. To illustrate, consider the
ian@0 443 following sequence of events:
ian@0 444
ian@0 445 CPU 1 CPU 2
ian@0 446 =============== ===============
ian@0 447 { A == 1, B == 2, C = 3, P == &A, Q == &C }
ian@0 448 B = 4;
ian@0 449 <write barrier>
ian@0 450 P = &B
ian@0 451 Q = P;
ian@0 452 D = *Q;
ian@0 453
ian@0 454 There's a clear data dependency here, and it would seem that by the end of the
ian@0 455 sequence, Q must be either &A or &B, and that:
ian@0 456
ian@0 457 (Q == &A) implies (D == 1)
ian@0 458 (Q == &B) implies (D == 4)
ian@0 459
ian@0 460 But! CPU 2's perception of P may be updated _before_ its perception of B, thus
ian@0 461 leading to the following situation:
ian@0 462
ian@0 463 (Q == &B) and (D == 2) ????
ian@0 464
ian@0 465 Whilst this may seem like a failure of coherency or causality maintenance, it
ian@0 466 isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
ian@0 467 Alpha).
ian@0 468
ian@0 469 To deal with this, a data dependency barrier or better must be inserted
ian@0 470 between the address load and the data load:
ian@0 471
ian@0 472 CPU 1 CPU 2
ian@0 473 =============== ===============
ian@0 474 { A == 1, B == 2, C = 3, P == &A, Q == &C }
ian@0 475 B = 4;
ian@0 476 <write barrier>
ian@0 477 P = &B
ian@0 478 Q = P;
ian@0 479 <data dependency barrier>
ian@0 480 D = *Q;
ian@0 481
ian@0 482 This enforces the occurrence of one of the two implications, and prevents the
ian@0 483 third possibility from arising.
ian@0 484
ian@0 485 [!] Note that this extremely counterintuitive situation arises most easily on
ian@0 486 machines with split caches, so that, for example, one cache bank processes
ian@0 487 even-numbered cache lines and the other bank processes odd-numbered cache
ian@0 488 lines. The pointer P might be stored in an odd-numbered cache line, and the
ian@0 489 variable B might be stored in an even-numbered cache line. Then, if the
ian@0 490 even-numbered bank of the reading CPU's cache is extremely busy while the
ian@0 491 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
ian@0 492 but the old value of the variable B (2).
ian@0 493
ian@0 494
ian@0 495 Another example of where data dependency barriers might by required is where a
ian@0 496 number is read from memory and then used to calculate the index for an array
ian@0 497 access:
ian@0 498
ian@0 499 CPU 1 CPU 2
ian@0 500 =============== ===============
ian@0 501 { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
ian@0 502 M[1] = 4;
ian@0 503 <write barrier>
ian@0 504 P = 1
ian@0 505 Q = P;
ian@0 506 <data dependency barrier>
ian@0 507 D = M[Q];
ian@0 508
ian@0 509
ian@0 510 The data dependency barrier is very important to the RCU system, for example.
ian@0 511 See rcu_dereference() in include/linux/rcupdate.h. This permits the current
ian@0 512 target of an RCU'd pointer to be replaced with a new modified target, without
ian@0 513 the replacement target appearing to be incompletely initialised.
ian@0 514
ian@0 515 See also the subsection on "Cache Coherency" for a more thorough example.
ian@0 516
ian@0 517
ian@0 518 CONTROL DEPENDENCIES
ian@0 519 --------------------
ian@0 520
ian@0 521 A control dependency requires a full read memory barrier, not simply a data
ian@0 522 dependency barrier to make it work correctly. Consider the following bit of
ian@0 523 code:
ian@0 524
ian@0 525 q = &a;
ian@0 526 if (p)
ian@0 527 q = &b;
ian@0 528 <data dependency barrier>
ian@0 529 x = *q;
ian@0 530
ian@0 531 This will not have the desired effect because there is no actual data
ian@0 532 dependency, but rather a control dependency that the CPU may short-circuit by
ian@0 533 attempting to predict the outcome in advance. In such a case what's actually
ian@0 534 required is:
ian@0 535
ian@0 536 q = &a;
ian@0 537 if (p)
ian@0 538 q = &b;
ian@0 539 <read barrier>
ian@0 540 x = *q;
ian@0 541
ian@0 542
ian@0 543 SMP BARRIER PAIRING
ian@0 544 -------------------
ian@0 545
ian@0 546 When dealing with CPU-CPU interactions, certain types of memory barrier should
ian@0 547 always be paired. A lack of appropriate pairing is almost certainly an error.
ian@0 548
ian@0 549 A write barrier should always be paired with a data dependency barrier or read
ian@0 550 barrier, though a general barrier would also be viable. Similarly a read
ian@0 551 barrier or a data dependency barrier should always be paired with at least an
ian@0 552 write barrier, though, again, a general barrier is viable:
ian@0 553
ian@0 554 CPU 1 CPU 2
ian@0 555 =============== ===============
ian@0 556 a = 1;
ian@0 557 <write barrier>
ian@0 558 b = 2; x = b;
ian@0 559 <read barrier>
ian@0 560 y = a;
ian@0 561
ian@0 562 Or:
ian@0 563
ian@0 564 CPU 1 CPU 2
ian@0 565 =============== ===============================
ian@0 566 a = 1;
ian@0 567 <write barrier>
ian@0 568 b = &a; x = b;
ian@0 569 <data dependency barrier>
ian@0 570 y = *x;
ian@0 571
ian@0 572 Basically, the read barrier always has to be there, even though it can be of
ian@0 573 the "weaker" type.
ian@0 574
ian@0 575 [!] Note that the stores before the write barrier would normally be expected to
ian@0 576 match the loads after the read barrier or data dependency barrier, and vice
ian@0 577 versa:
ian@0 578
ian@0 579 CPU 1 CPU 2
ian@0 580 =============== ===============
ian@0 581 a = 1; }---- --->{ v = c
ian@0 582 b = 2; } \ / { w = d
ian@0 583 <write barrier> \ <read barrier>
ian@0 584 c = 3; } / \ { x = a;
ian@0 585 d = 4; }---- --->{ y = b;
ian@0 586
ian@0 587
ian@0 588 EXAMPLES OF MEMORY BARRIER SEQUENCES
ian@0 589 ------------------------------------
ian@0 590
ian@0 591 Firstly, write barriers act as a partial orderings on store operations.
ian@0 592 Consider the following sequence of events:
ian@0 593
ian@0 594 CPU 1
ian@0 595 =======================
ian@0 596 STORE A = 1
ian@0 597 STORE B = 2
ian@0 598 STORE C = 3
ian@0 599 <write barrier>
ian@0 600 STORE D = 4
ian@0 601 STORE E = 5
ian@0 602
ian@0 603 This sequence of events is committed to the memory coherence system in an order
ian@0 604 that the rest of the system might perceive as the unordered set of { STORE A,
ian@0 605 STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
ian@0 606 }:
ian@0 607
ian@0 608 +-------+ : :
ian@0 609 | | +------+
ian@0 610 | |------>| C=3 | } /\
ian@0 611 | | : +------+ }----- \ -----> Events perceptible
ian@0 612 | | : | A=1 | } \/ to rest of system
ian@0 613 | | : +------+ }
ian@0 614 | CPU 1 | : | B=2 | }
ian@0 615 | | +------+ }
ian@0 616 | | wwwwwwwwwwwwwwww } <--- At this point the write barrier
ian@0 617 | | +------+ } requires all stores prior to the
ian@0 618 | | : | E=5 | } barrier to be committed before
ian@0 619 | | : +------+ } further stores may be take place.
ian@0 620 | |------>| D=4 | }
ian@0 621 | | +------+
ian@0 622 +-------+ : :
ian@0 623 |
ian@0 624 | Sequence in which stores are committed to the
ian@0 625 | memory system by CPU 1
ian@0 626 V
ian@0 627
ian@0 628
ian@0 629 Secondly, data dependency barriers act as a partial orderings on data-dependent
ian@0 630 loads. Consider the following sequence of events:
ian@0 631
ian@0 632 CPU 1 CPU 2
ian@0 633 ======================= =======================
ian@0 634 { B = 7; X = 9; Y = 8; C = &Y }
ian@0 635 STORE A = 1
ian@0 636 STORE B = 2
ian@0 637 <write barrier>
ian@0 638 STORE C = &B LOAD X
ian@0 639 STORE D = 4 LOAD C (gets &B)
ian@0 640 LOAD *C (reads B)
ian@0 641
ian@0 642 Without intervention, CPU 2 may perceive the events on CPU 1 in some
ian@0 643 effectively random order, despite the write barrier issued by CPU 1:
ian@0 644
ian@0 645 +-------+ : : : :
ian@0 646 | | +------+ +-------+ | Sequence of update
ian@0 647 | |------>| B=2 |----- --->| Y->8 | | of perception on
ian@0 648 | | : +------+ \ +-------+ | CPU 2
ian@0 649 | CPU 1 | : | A=1 | \ --->| C->&Y | V
ian@0 650 | | +------+ | +-------+
ian@0 651 | | wwwwwwwwwwwwwwww | : :
ian@0 652 | | +------+ | : :
ian@0 653 | | : | C=&B |--- | : : +-------+
ian@0 654 | | : +------+ \ | +-------+ | |
ian@0 655 | |------>| D=4 | ----------->| C->&B |------>| |
ian@0 656 | | +------+ | +-------+ | |
ian@0 657 +-------+ : : | : : | |
ian@0 658 | : : | |
ian@0 659 | : : | CPU 2 |
ian@0 660 | +-------+ | |
ian@0 661 Apparently incorrect ---> | | B->7 |------>| |
ian@0 662 perception of B (!) | +-------+ | |
ian@0 663 | : : | |
ian@0 664 | +-------+ | |
ian@0 665 The load of X holds ---> \ | X->9 |------>| |
ian@0 666 up the maintenance \ +-------+ | |
ian@0 667 of coherence of B ----->| B->2 | +-------+
ian@0 668 +-------+
ian@0 669 : :
ian@0 670
ian@0 671
ian@0 672 In the above example, CPU 2 perceives that B is 7, despite the load of *C
ian@0 673 (which would be B) coming after the the LOAD of C.
ian@0 674
ian@0 675 If, however, a data dependency barrier were to be placed between the load of C
ian@0 676 and the load of *C (ie: B) on CPU 2:
ian@0 677
ian@0 678 CPU 1 CPU 2
ian@0 679 ======================= =======================
ian@0 680 { B = 7; X = 9; Y = 8; C = &Y }
ian@0 681 STORE A = 1
ian@0 682 STORE B = 2
ian@0 683 <write barrier>
ian@0 684 STORE C = &B LOAD X
ian@0 685 STORE D = 4 LOAD C (gets &B)
ian@0 686 <data dependency barrier>
ian@0 687 LOAD *C (reads B)
ian@0 688
ian@0 689 then the following will occur:
ian@0 690
ian@0 691 +-------+ : : : :
ian@0 692 | | +------+ +-------+
ian@0 693 | |------>| B=2 |----- --->| Y->8 |
ian@0 694 | | : +------+ \ +-------+
ian@0 695 | CPU 1 | : | A=1 | \ --->| C->&Y |
ian@0 696 | | +------+ | +-------+
ian@0 697 | | wwwwwwwwwwwwwwww | : :
ian@0 698 | | +------+ | : :
ian@0 699 | | : | C=&B |--- | : : +-------+
ian@0 700 | | : +------+ \ | +-------+ | |
ian@0 701 | |------>| D=4 | ----------->| C->&B |------>| |
ian@0 702 | | +------+ | +-------+ | |
ian@0 703 +-------+ : : | : : | |
ian@0 704 | : : | |
ian@0 705 | : : | CPU 2 |
ian@0 706 | +-------+ | |
ian@0 707 | | X->9 |------>| |
ian@0 708 | +-------+ | |
ian@0 709 Makes sure all effects ---> \ ddddddddddddddddd | |
ian@0 710 prior to the store of C \ +-------+ | |
ian@0 711 are perceptible to ----->| B->2 |------>| |
ian@0 712 subsequent loads +-------+ | |
ian@0 713 : : +-------+
ian@0 714
ian@0 715
ian@0 716 And thirdly, a read barrier acts as a partial order on loads. Consider the
ian@0 717 following sequence of events:
ian@0 718
ian@0 719 CPU 1 CPU 2
ian@0 720 ======================= =======================
ian@0 721 { A = 0, B = 9 }
ian@0 722 STORE A=1
ian@0 723 <write barrier>
ian@0 724 STORE B=2
ian@0 725 LOAD B
ian@0 726 LOAD A
ian@0 727
ian@0 728 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
ian@0 729 some effectively random order, despite the write barrier issued by CPU 1:
ian@0 730
ian@0 731 +-------+ : : : :
ian@0 732 | | +------+ +-------+
ian@0 733 | |------>| A=1 |------ --->| A->0 |
ian@0 734 | | +------+ \ +-------+
ian@0 735 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
ian@0 736 | | +------+ | +-------+
ian@0 737 | |------>| B=2 |--- | : :
ian@0 738 | | +------+ \ | : : +-------+
ian@0 739 +-------+ : : \ | +-------+ | |
ian@0 740 ---------->| B->2 |------>| |
ian@0 741 | +-------+ | CPU 2 |
ian@0 742 | | A->0 |------>| |
ian@0 743 | +-------+ | |
ian@0 744 | : : +-------+
ian@0 745 \ : :
ian@0 746 \ +-------+
ian@0 747 ---->| A->1 |
ian@0 748 +-------+
ian@0 749 : :
ian@0 750
ian@0 751
ian@0 752 If, however, a read barrier were to be placed between the load of B and the
ian@0 753 load of A on CPU 2:
ian@0 754
ian@0 755 CPU 1 CPU 2
ian@0 756 ======================= =======================
ian@0 757 { A = 0, B = 9 }
ian@0 758 STORE A=1
ian@0 759 <write barrier>
ian@0 760 STORE B=2
ian@0 761 LOAD B
ian@0 762 <read barrier>
ian@0 763 LOAD A
ian@0 764
ian@0 765 then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
ian@0 766 2:
ian@0 767
ian@0 768 +-------+ : : : :
ian@0 769 | | +------+ +-------+
ian@0 770 | |------>| A=1 |------ --->| A->0 |
ian@0 771 | | +------+ \ +-------+
ian@0 772 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
ian@0 773 | | +------+ | +-------+
ian@0 774 | |------>| B=2 |--- | : :
ian@0 775 | | +------+ \ | : : +-------+
ian@0 776 +-------+ : : \ | +-------+ | |
ian@0 777 ---------->| B->2 |------>| |
ian@0 778 | +-------+ | CPU 2 |
ian@0 779 | : : | |
ian@0 780 | : : | |
ian@0 781 At this point the read ----> \ rrrrrrrrrrrrrrrrr | |
ian@0 782 barrier causes all effects \ +-------+ | |
ian@0 783 prior to the storage of B ---->| A->1 |------>| |
ian@0 784 to be perceptible to CPU 2 +-------+ | |
ian@0 785 : : +-------+
ian@0 786
ian@0 787
ian@0 788 To illustrate this more completely, consider what could happen if the code
ian@0 789 contained a load of A either side of the read barrier:
ian@0 790
ian@0 791 CPU 1 CPU 2
ian@0 792 ======================= =======================
ian@0 793 { A = 0, B = 9 }
ian@0 794 STORE A=1
ian@0 795 <write barrier>
ian@0 796 STORE B=2
ian@0 797 LOAD B
ian@0 798 LOAD A [first load of A]
ian@0 799 <read barrier>
ian@0 800 LOAD A [second load of A]
ian@0 801
ian@0 802 Even though the two loads of A both occur after the load of B, they may both
ian@0 803 come up with different values:
ian@0 804
ian@0 805 +-------+ : : : :
ian@0 806 | | +------+ +-------+
ian@0 807 | |------>| A=1 |------ --->| A->0 |
ian@0 808 | | +------+ \ +-------+
ian@0 809 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
ian@0 810 | | +------+ | +-------+
ian@0 811 | |------>| B=2 |--- | : :
ian@0 812 | | +------+ \ | : : +-------+
ian@0 813 +-------+ : : \ | +-------+ | |
ian@0 814 ---------->| B->2 |------>| |
ian@0 815 | +-------+ | CPU 2 |
ian@0 816 | : : | |
ian@0 817 | : : | |
ian@0 818 | +-------+ | |
ian@0 819 | | A->0 |------>| 1st |
ian@0 820 | +-------+ | |
ian@0 821 At this point the read ----> \ rrrrrrrrrrrrrrrrr | |
ian@0 822 barrier causes all effects \ +-------+ | |
ian@0 823 prior to the storage of B ---->| A->1 |------>| 2nd |
ian@0 824 to be perceptible to CPU 2 +-------+ | |
ian@0 825 : : +-------+
ian@0 826
ian@0 827
ian@0 828 But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
ian@0 829 before the read barrier completes anyway:
ian@0 830
ian@0 831 +-------+ : : : :
ian@0 832 | | +------+ +-------+
ian@0 833 | |------>| A=1 |------ --->| A->0 |
ian@0 834 | | +------+ \ +-------+
ian@0 835 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
ian@0 836 | | +------+ | +-------+
ian@0 837 | |------>| B=2 |--- | : :
ian@0 838 | | +------+ \ | : : +-------+
ian@0 839 +-------+ : : \ | +-------+ | |
ian@0 840 ---------->| B->2 |------>| |
ian@0 841 | +-------+ | CPU 2 |
ian@0 842 | : : | |
ian@0 843 \ : : | |
ian@0 844 \ +-------+ | |
ian@0 845 ---->| A->1 |------>| 1st |
ian@0 846 +-------+ | |
ian@0 847 rrrrrrrrrrrrrrrrr | |
ian@0 848 +-------+ | |
ian@0 849 | A->1 |------>| 2nd |
ian@0 850 +-------+ | |
ian@0 851 : : +-------+
ian@0 852
ian@0 853
ian@0 854 The guarantee is that the second load will always come up with A == 1 if the
ian@0 855 load of B came up with B == 2. No such guarantee exists for the first load of
ian@0 856 A; that may come up with either A == 0 or A == 1.
ian@0 857
ian@0 858
ian@0 859 READ MEMORY BARRIERS VS LOAD SPECULATION
ian@0 860 ----------------------------------------
ian@0 861
ian@0 862 Many CPUs speculate with loads: that is they see that they will need to load an
ian@0 863 item from memory, and they find a time where they're not using the bus for any
ian@0 864 other loads, and so do the load in advance - even though they haven't actually
ian@0 865 got to that point in the instruction execution flow yet. This permits the
ian@0 866 actual load instruction to potentially complete immediately because the CPU
ian@0 867 already has the value to hand.
ian@0 868
ian@0 869 It may turn out that the CPU didn't actually need the value - perhaps because a
ian@0 870 branch circumvented the load - in which case it can discard the value or just
ian@0 871 cache it for later use.
ian@0 872
ian@0 873 Consider:
ian@0 874
ian@0 875 CPU 1 CPU 2
ian@0 876 ======================= =======================
ian@0 877 LOAD B
ian@0 878 DIVIDE } Divide instructions generally
ian@0 879 DIVIDE } take a long time to perform
ian@0 880 LOAD A
ian@0 881
ian@0 882 Which might appear as this:
ian@0 883
ian@0 884 : : +-------+
ian@0 885 +-------+ | |
ian@0 886 --->| B->2 |------>| |
ian@0 887 +-------+ | CPU 2 |
ian@0 888 : :DIVIDE | |
ian@0 889 +-------+ | |
ian@0 890 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
ian@0 891 division speculates on the +-------+ ~ | |
ian@0 892 LOAD of A : : ~ | |
ian@0 893 : :DIVIDE | |
ian@0 894 : : ~ | |
ian@0 895 Once the divisions are complete --> : : ~-->| |
ian@0 896 the CPU can then perform the : : | |
ian@0 897 LOAD with immediate effect : : +-------+
ian@0 898
ian@0 899
ian@0 900 Placing a read barrier or a data dependency barrier just before the second
ian@0 901 load:
ian@0 902
ian@0 903 CPU 1 CPU 2
ian@0 904 ======================= =======================
ian@0 905 LOAD B
ian@0 906 DIVIDE
ian@0 907 DIVIDE
ian@0 908 <read barrier>
ian@0 909 LOAD A
ian@0 910
ian@0 911 will force any value speculatively obtained to be reconsidered to an extent
ian@0 912 dependent on the type of barrier used. If there was no change made to the
ian@0 913 speculated memory location, then the speculated value will just be used:
ian@0 914
ian@0 915 : : +-------+
ian@0 916 +-------+ | |
ian@0 917 --->| B->2 |------>| |
ian@0 918 +-------+ | CPU 2 |
ian@0 919 : :DIVIDE | |
ian@0 920 +-------+ | |
ian@0 921 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
ian@0 922 division speculates on the +-------+ ~ | |
ian@0 923 LOAD of A : : ~ | |
ian@0 924 : :DIVIDE | |
ian@0 925 : : ~ | |
ian@0 926 : : ~ | |
ian@0 927 rrrrrrrrrrrrrrrr~ | |
ian@0 928 : : ~ | |
ian@0 929 : : ~-->| |
ian@0 930 : : | |
ian@0 931 : : +-------+
ian@0 932
ian@0 933
ian@0 934 but if there was an update or an invalidation from another CPU pending, then
ian@0 935 the speculation will be cancelled and the value reloaded:
ian@0 936
ian@0 937 : : +-------+
ian@0 938 +-------+ | |
ian@0 939 --->| B->2 |------>| |
ian@0 940 +-------+ | CPU 2 |
ian@0 941 : :DIVIDE | |
ian@0 942 +-------+ | |
ian@0 943 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
ian@0 944 division speculates on the +-------+ ~ | |
ian@0 945 LOAD of A : : ~ | |
ian@0 946 : :DIVIDE | |
ian@0 947 : : ~ | |
ian@0 948 : : ~ | |
ian@0 949 rrrrrrrrrrrrrrrrr | |
ian@0 950 +-------+ | |
ian@0 951 The speculation is discarded ---> --->| A->1 |------>| |
ian@0 952 and an updated value is +-------+ | |
ian@0 953 retrieved : : +-------+
ian@0 954
ian@0 955
ian@0 956 ========================
ian@0 957 EXPLICIT KERNEL BARRIERS
ian@0 958 ========================
ian@0 959
ian@0 960 The Linux kernel has a variety of different barriers that act at different
ian@0 961 levels:
ian@0 962
ian@0 963 (*) Compiler barrier.
ian@0 964
ian@0 965 (*) CPU memory barriers.
ian@0 966
ian@0 967 (*) MMIO write barrier.
ian@0 968
ian@0 969
ian@0 970 COMPILER BARRIER
ian@0 971 ----------------
ian@0 972
ian@0 973 The Linux kernel has an explicit compiler barrier function that prevents the
ian@0 974 compiler from moving the memory accesses either side of it to the other side:
ian@0 975
ian@0 976 barrier();
ian@0 977
ian@0 978 This a general barrier - lesser varieties of compiler barrier do not exist.
ian@0 979
ian@0 980 The compiler barrier has no direct effect on the CPU, which may then reorder
ian@0 981 things however it wishes.
ian@0 982
ian@0 983
ian@0 984 CPU MEMORY BARRIERS
ian@0 985 -------------------
ian@0 986
ian@0 987 The Linux kernel has eight basic CPU memory barriers:
ian@0 988
ian@0 989 TYPE MANDATORY SMP CONDITIONAL
ian@0 990 =============== ======================= ===========================
ian@0 991 GENERAL mb() smp_mb()
ian@0 992 WRITE wmb() smp_wmb()
ian@0 993 READ rmb() smp_rmb()
ian@0 994 DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
ian@0 995
ian@0 996
ian@0 997 All CPU memory barriers unconditionally imply compiler barriers.
ian@0 998
ian@0 999 SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
ian@0 1000 systems because it is assumed that a CPU will be appear to be self-consistent,
ian@0 1001 and will order overlapping accesses correctly with respect to itself.
ian@0 1002
ian@0 1003 [!] Note that SMP memory barriers _must_ be used to control the ordering of
ian@0 1004 references to shared memory on SMP systems, though the use of locking instead
ian@0 1005 is sufficient.
ian@0 1006
ian@0 1007 Mandatory barriers should not be used to control SMP effects, since mandatory
ian@0 1008 barriers unnecessarily impose overhead on UP systems. They may, however, be
ian@0 1009 used to control MMIO effects on accesses through relaxed memory I/O windows.
ian@0 1010 These are required even on non-SMP systems as they affect the order in which
ian@0 1011 memory operations appear to a device by prohibiting both the compiler and the
ian@0 1012 CPU from reordering them.
ian@0 1013
ian@0 1014
ian@0 1015 There are some more advanced barrier functions:
ian@0 1016
ian@0 1017 (*) set_mb(var, value)
ian@0 1018
ian@0 1019 This assigns the value to the variable and then inserts at least a write
ian@0 1020 barrier after it, depending on the function. It isn't guaranteed to
ian@0 1021 insert anything more than a compiler barrier in a UP compilation.
ian@0 1022
ian@0 1023
ian@0 1024 (*) smp_mb__before_atomic_dec();
ian@0 1025 (*) smp_mb__after_atomic_dec();
ian@0 1026 (*) smp_mb__before_atomic_inc();
ian@0 1027 (*) smp_mb__after_atomic_inc();
ian@0 1028
ian@0 1029 These are for use with atomic add, subtract, increment and decrement
ian@0 1030 functions that don't return a value, especially when used for reference
ian@0 1031 counting. These functions do not imply memory barriers.
ian@0 1032
ian@0 1033 As an example, consider a piece of code that marks an object as being dead
ian@0 1034 and then decrements the object's reference count:
ian@0 1035
ian@0 1036 obj->dead = 1;
ian@0 1037 smp_mb__before_atomic_dec();
ian@0 1038 atomic_dec(&obj->ref_count);
ian@0 1039
ian@0 1040 This makes sure that the death mark on the object is perceived to be set
ian@0 1041 *before* the reference counter is decremented.
ian@0 1042
ian@0 1043 See Documentation/atomic_ops.txt for more information. See the "Atomic
ian@0 1044 operations" subsection for information on where to use these.
ian@0 1045
ian@0 1046
ian@0 1047 (*) smp_mb__before_clear_bit(void);
ian@0 1048 (*) smp_mb__after_clear_bit(void);
ian@0 1049
ian@0 1050 These are for use similar to the atomic inc/dec barriers. These are
ian@0 1051 typically used for bitwise unlocking operations, so care must be taken as
ian@0 1052 there are no implicit memory barriers here either.
ian@0 1053
ian@0 1054 Consider implementing an unlock operation of some nature by clearing a
ian@0 1055 locking bit. The clear_bit() would then need to be barriered like this:
ian@0 1056
ian@0 1057 smp_mb__before_clear_bit();
ian@0 1058 clear_bit( ... );
ian@0 1059
ian@0 1060 This prevents memory operations before the clear leaking to after it. See
ian@0 1061 the subsection on "Locking Functions" with reference to UNLOCK operation
ian@0 1062 implications.
ian@0 1063
ian@0 1064 See Documentation/atomic_ops.txt for more information. See the "Atomic
ian@0 1065 operations" subsection for information on where to use these.
ian@0 1066
ian@0 1067
ian@0 1068 MMIO WRITE BARRIER
ian@0 1069 ------------------
ian@0 1070
ian@0 1071 The Linux kernel also has a special barrier for use with memory-mapped I/O
ian@0 1072 writes:
ian@0 1073
ian@0 1074 mmiowb();
ian@0 1075
ian@0 1076 This is a variation on the mandatory write barrier that causes writes to weakly
ian@0 1077 ordered I/O regions to be partially ordered. Its effects may go beyond the
ian@0 1078 CPU->Hardware interface and actually affect the hardware at some level.
ian@0 1079
ian@0 1080 See the subsection "Locks vs I/O accesses" for more information.
ian@0 1081
ian@0 1082
ian@0 1083 ===============================
ian@0 1084 IMPLICIT KERNEL MEMORY BARRIERS
ian@0 1085 ===============================
ian@0 1086
ian@0 1087 Some of the other functions in the linux kernel imply memory barriers, amongst
ian@0 1088 which are locking and scheduling functions.
ian@0 1089
ian@0 1090 This specification is a _minimum_ guarantee; any particular architecture may
ian@0 1091 provide more substantial guarantees, but these may not be relied upon outside
ian@0 1092 of arch specific code.
ian@0 1093
ian@0 1094
ian@0 1095 LOCKING FUNCTIONS
ian@0 1096 -----------------
ian@0 1097
ian@0 1098 The Linux kernel has a number of locking constructs:
ian@0 1099
ian@0 1100 (*) spin locks
ian@0 1101 (*) R/W spin locks
ian@0 1102 (*) mutexes
ian@0 1103 (*) semaphores
ian@0 1104 (*) R/W semaphores
ian@0 1105 (*) RCU
ian@0 1106
ian@0 1107 In all cases there are variants on "LOCK" operations and "UNLOCK" operations
ian@0 1108 for each construct. These operations all imply certain barriers:
ian@0 1109
ian@0 1110 (1) LOCK operation implication:
ian@0 1111
ian@0 1112 Memory operations issued after the LOCK will be completed after the LOCK
ian@0 1113 operation has completed.
ian@0 1114
ian@0 1115 Memory operations issued before the LOCK may be completed after the LOCK
ian@0 1116 operation has completed.
ian@0 1117
ian@0 1118 (2) UNLOCK operation implication:
ian@0 1119
ian@0 1120 Memory operations issued before the UNLOCK will be completed before the
ian@0 1121 UNLOCK operation has completed.
ian@0 1122
ian@0 1123 Memory operations issued after the UNLOCK may be completed before the
ian@0 1124 UNLOCK operation has completed.
ian@0 1125
ian@0 1126 (3) LOCK vs LOCK implication:
ian@0 1127
ian@0 1128 All LOCK operations issued before another LOCK operation will be completed
ian@0 1129 before that LOCK operation.
ian@0 1130
ian@0 1131 (4) LOCK vs UNLOCK implication:
ian@0 1132
ian@0 1133 All LOCK operations issued before an UNLOCK operation will be completed
ian@0 1134 before the UNLOCK operation.
ian@0 1135
ian@0 1136 All UNLOCK operations issued before a LOCK operation will be completed
ian@0 1137 before the LOCK operation.
ian@0 1138
ian@0 1139 (5) Failed conditional LOCK implication:
ian@0 1140
ian@0 1141 Certain variants of the LOCK operation may fail, either due to being
ian@0 1142 unable to get the lock immediately, or due to receiving an unblocked
ian@0 1143 signal whilst asleep waiting for the lock to become available. Failed
ian@0 1144 locks do not imply any sort of barrier.
ian@0 1145
ian@0 1146 Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
ian@0 1147 equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
ian@0 1148
ian@0 1149 [!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way
ian@0 1150 barriers is that the effects instructions outside of a critical section may
ian@0 1151 seep into the inside of the critical section.
ian@0 1152
ian@0 1153 A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
ian@0 1154 because it is possible for an access preceding the LOCK to happen after the
ian@0 1155 LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
ian@0 1156 two accesses can themselves then cross:
ian@0 1157
ian@0 1158 *A = a;
ian@0 1159 LOCK
ian@0 1160 UNLOCK
ian@0 1161 *B = b;
ian@0 1162
ian@0 1163 may occur as:
ian@0 1164
ian@0 1165 LOCK, STORE *B, STORE *A, UNLOCK
ian@0 1166
ian@0 1167 Locks and semaphores may not provide any guarantee of ordering on UP compiled
ian@0 1168 systems, and so cannot be counted on in such a situation to actually achieve
ian@0 1169 anything at all - especially with respect to I/O accesses - unless combined
ian@0 1170 with interrupt disabling operations.
ian@0 1171
ian@0 1172 See also the section on "Inter-CPU locking barrier effects".
ian@0 1173
ian@0 1174
ian@0 1175 As an example, consider the following:
ian@0 1176
ian@0 1177 *A = a;
ian@0 1178 *B = b;
ian@0 1179 LOCK
ian@0 1180 *C = c;
ian@0 1181 *D = d;
ian@0 1182 UNLOCK
ian@0 1183 *E = e;
ian@0 1184 *F = f;
ian@0 1185
ian@0 1186 The following sequence of events is acceptable:
ian@0 1187
ian@0 1188 LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
ian@0 1189
ian@0 1190 [+] Note that {*F,*A} indicates a combined access.
ian@0 1191
ian@0 1192 But none of the following are:
ian@0 1193
ian@0 1194 {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E
ian@0 1195 *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
ian@0 1196 *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
ian@0 1197 *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E
ian@0 1198
ian@0 1199
ian@0 1200
ian@0 1201 INTERRUPT DISABLING FUNCTIONS
ian@0 1202 -----------------------------
ian@0 1203
ian@0 1204 Functions that disable interrupts (LOCK equivalent) and enable interrupts
ian@0 1205 (UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O
ian@0 1206 barriers are required in such a situation, they must be provided from some
ian@0 1207 other means.
ian@0 1208
ian@0 1209
ian@0 1210 MISCELLANEOUS FUNCTIONS
ian@0 1211 -----------------------
ian@0 1212
ian@0 1213 Other functions that imply barriers:
ian@0 1214
ian@0 1215 (*) schedule() and similar imply full memory barriers.
ian@0 1216
ian@0 1217
ian@0 1218 =================================
ian@0 1219 INTER-CPU LOCKING BARRIER EFFECTS
ian@0 1220 =================================
ian@0 1221
ian@0 1222 On SMP systems locking primitives give a more substantial form of barrier: one
ian@0 1223 that does affect memory access ordering on other CPUs, within the context of
ian@0 1224 conflict on any particular lock.
ian@0 1225
ian@0 1226
ian@0 1227 LOCKS VS MEMORY ACCESSES
ian@0 1228 ------------------------
ian@0 1229
ian@0 1230 Consider the following: the system has a pair of spinlocks (M) and (Q), and
ian@0 1231 three CPUs; then should the following sequence of events occur:
ian@0 1232
ian@0 1233 CPU 1 CPU 2
ian@0 1234 =============================== ===============================
ian@0 1235 *A = a; *E = e;
ian@0 1236 LOCK M LOCK Q
ian@0 1237 *B = b; *F = f;
ian@0 1238 *C = c; *G = g;
ian@0 1239 UNLOCK M UNLOCK Q
ian@0 1240 *D = d; *H = h;
ian@0 1241
ian@0 1242 Then there is no guarantee as to what order CPU #3 will see the accesses to *A
ian@0 1243 through *H occur in, other than the constraints imposed by the separate locks
ian@0 1244 on the separate CPUs. It might, for example, see:
ian@0 1245
ian@0 1246 *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
ian@0 1247
ian@0 1248 But it won't see any of:
ian@0 1249
ian@0 1250 *B, *C or *D preceding LOCK M
ian@0 1251 *A, *B or *C following UNLOCK M
ian@0 1252 *F, *G or *H preceding LOCK Q
ian@0 1253 *E, *F or *G following UNLOCK Q
ian@0 1254
ian@0 1255
ian@0 1256 However, if the following occurs:
ian@0 1257
ian@0 1258 CPU 1 CPU 2
ian@0 1259 =============================== ===============================
ian@0 1260 *A = a;
ian@0 1261 LOCK M [1]
ian@0 1262 *B = b;
ian@0 1263 *C = c;
ian@0 1264 UNLOCK M [1]
ian@0 1265 *D = d; *E = e;
ian@0 1266 LOCK M [2]
ian@0 1267 *F = f;
ian@0 1268 *G = g;
ian@0 1269 UNLOCK M [2]
ian@0 1270 *H = h;
ian@0 1271
ian@0 1272 CPU #3 might see:
ian@0 1273
ian@0 1274 *E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
ian@0 1275 LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
ian@0 1276
ian@0 1277 But assuming CPU #1 gets the lock first, it won't see any of:
ian@0 1278
ian@0 1279 *B, *C, *D, *F, *G or *H preceding LOCK M [1]
ian@0 1280 *A, *B or *C following UNLOCK M [1]
ian@0 1281 *F, *G or *H preceding LOCK M [2]
ian@0 1282 *A, *B, *C, *E, *F or *G following UNLOCK M [2]
ian@0 1283
ian@0 1284
ian@0 1285 LOCKS VS I/O ACCESSES
ian@0 1286 ---------------------
ian@0 1287
ian@0 1288 Under certain circumstances (especially involving NUMA), I/O accesses within
ian@0 1289 two spinlocked sections on two different CPUs may be seen as interleaved by the
ian@0 1290 PCI bridge, because the PCI bridge does not necessarily participate in the
ian@0 1291 cache-coherence protocol, and is therefore incapable of issuing the required
ian@0 1292 read memory barriers.
ian@0 1293
ian@0 1294 For example:
ian@0 1295
ian@0 1296 CPU 1 CPU 2
ian@0 1297 =============================== ===============================
ian@0 1298 spin_lock(Q)
ian@0 1299 writel(0, ADDR)
ian@0 1300 writel(1, DATA);
ian@0 1301 spin_unlock(Q);
ian@0 1302 spin_lock(Q);
ian@0 1303 writel(4, ADDR);
ian@0 1304 writel(5, DATA);
ian@0 1305 spin_unlock(Q);
ian@0 1306
ian@0 1307 may be seen by the PCI bridge as follows:
ian@0 1308
ian@0 1309 STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
ian@0 1310
ian@0 1311 which would probably cause the hardware to malfunction.
ian@0 1312
ian@0 1313
ian@0 1314 What is necessary here is to intervene with an mmiowb() before dropping the
ian@0 1315 spinlock, for example:
ian@0 1316
ian@0 1317 CPU 1 CPU 2
ian@0 1318 =============================== ===============================
ian@0 1319 spin_lock(Q)
ian@0 1320 writel(0, ADDR)
ian@0 1321 writel(1, DATA);
ian@0 1322 mmiowb();
ian@0 1323 spin_unlock(Q);
ian@0 1324 spin_lock(Q);
ian@0 1325 writel(4, ADDR);
ian@0 1326 writel(5, DATA);
ian@0 1327 mmiowb();
ian@0 1328 spin_unlock(Q);
ian@0 1329
ian@0 1330 this will ensure that the two stores issued on CPU #1 appear at the PCI bridge
ian@0 1331 before either of the stores issued on CPU #2.
ian@0 1332
ian@0 1333
ian@0 1334 Furthermore, following a store by a load to the same device obviates the need
ian@0 1335 for an mmiowb(), because the load forces the store to complete before the load
ian@0 1336 is performed:
ian@0 1337
ian@0 1338 CPU 1 CPU 2
ian@0 1339 =============================== ===============================
ian@0 1340 spin_lock(Q)
ian@0 1341 writel(0, ADDR)
ian@0 1342 a = readl(DATA);
ian@0 1343 spin_unlock(Q);
ian@0 1344 spin_lock(Q);
ian@0 1345 writel(4, ADDR);
ian@0 1346 b = readl(DATA);
ian@0 1347 spin_unlock(Q);
ian@0 1348
ian@0 1349
ian@0 1350 See Documentation/DocBook/deviceiobook.tmpl for more information.
ian@0 1351
ian@0 1352
ian@0 1353 =================================
ian@0 1354 WHERE ARE MEMORY BARRIERS NEEDED?
ian@0 1355 =================================
ian@0 1356
ian@0 1357 Under normal operation, memory operation reordering is generally not going to
ian@0 1358 be a problem as a single-threaded linear piece of code will still appear to
ian@0 1359 work correctly, even if it's in an SMP kernel. There are, however, three
ian@0 1360 circumstances in which reordering definitely _could_ be a problem:
ian@0 1361
ian@0 1362 (*) Interprocessor interaction.
ian@0 1363
ian@0 1364 (*) Atomic operations.
ian@0 1365
ian@0 1366 (*) Accessing devices (I/O).
ian@0 1367
ian@0 1368 (*) Interrupts.
ian@0 1369
ian@0 1370
ian@0 1371 INTERPROCESSOR INTERACTION
ian@0 1372 --------------------------
ian@0 1373
ian@0 1374 When there's a system with more than one processor, more than one CPU in the
ian@0 1375 system may be working on the same data set at the same time. This can cause
ian@0 1376 synchronisation problems, and the usual way of dealing with them is to use
ian@0 1377 locks. Locks, however, are quite expensive, and so it may be preferable to
ian@0 1378 operate without the use of a lock if at all possible. In such a case
ian@0 1379 operations that affect both CPUs may have to be carefully ordered to prevent
ian@0 1380 a malfunction.
ian@0 1381
ian@0 1382 Consider, for example, the R/W semaphore slow path. Here a waiting process is
ian@0 1383 queued on the semaphore, by virtue of it having a piece of its stack linked to
ian@0 1384 the semaphore's list of waiting processes:
ian@0 1385
ian@0 1386 struct rw_semaphore {
ian@0 1387 ...
ian@0 1388 spinlock_t lock;
ian@0 1389 struct list_head waiters;
ian@0 1390 };
ian@0 1391
ian@0 1392 struct rwsem_waiter {
ian@0 1393 struct list_head list;
ian@0 1394 struct task_struct *task;
ian@0 1395 };
ian@0 1396
ian@0 1397 To wake up a particular waiter, the up_read() or up_write() functions have to:
ian@0 1398
ian@0 1399 (1) read the next pointer from this waiter's record to know as to where the
ian@0 1400 next waiter record is;
ian@0 1401
ian@0 1402 (4) read the pointer to the waiter's task structure;
ian@0 1403
ian@0 1404 (3) clear the task pointer to tell the waiter it has been given the semaphore;
ian@0 1405
ian@0 1406 (4) call wake_up_process() on the task; and
ian@0 1407
ian@0 1408 (5) release the reference held on the waiter's task struct.
ian@0 1409
ian@0 1410 In otherwords, it has to perform this sequence of events:
ian@0 1411
ian@0 1412 LOAD waiter->list.next;
ian@0 1413 LOAD waiter->task;
ian@0 1414 STORE waiter->task;
ian@0 1415 CALL wakeup
ian@0 1416 RELEASE task
ian@0 1417
ian@0 1418 and if any of these steps occur out of order, then the whole thing may
ian@0 1419 malfunction.
ian@0 1420
ian@0 1421 Once it has queued itself and dropped the semaphore lock, the waiter does not
ian@0 1422 get the lock again; it instead just waits for its task pointer to be cleared
ian@0 1423 before proceeding. Since the record is on the waiter's stack, this means that
ian@0 1424 if the task pointer is cleared _before_ the next pointer in the list is read,
ian@0 1425 another CPU might start processing the waiter and might clobber the waiter's
ian@0 1426 stack before the up*() function has a chance to read the next pointer.
ian@0 1427
ian@0 1428 Consider then what might happen to the above sequence of events:
ian@0 1429
ian@0 1430 CPU 1 CPU 2
ian@0 1431 =============================== ===============================
ian@0 1432 down_xxx()
ian@0 1433 Queue waiter
ian@0 1434 Sleep
ian@0 1435 up_yyy()
ian@0 1436 LOAD waiter->task;
ian@0 1437 STORE waiter->task;
ian@0 1438 Woken up by other event
ian@0 1439 <preempt>
ian@0 1440 Resume processing
ian@0 1441 down_xxx() returns
ian@0 1442 call foo()
ian@0 1443 foo() clobbers *waiter
ian@0 1444 </preempt>
ian@0 1445 LOAD waiter->list.next;
ian@0 1446 --- OOPS ---
ian@0 1447
ian@0 1448 This could be dealt with using the semaphore lock, but then the down_xxx()
ian@0 1449 function has to needlessly get the spinlock again after being woken up.
ian@0 1450
ian@0 1451 The way to deal with this is to insert a general SMP memory barrier:
ian@0 1452
ian@0 1453 LOAD waiter->list.next;
ian@0 1454 LOAD waiter->task;
ian@0 1455 smp_mb();
ian@0 1456 STORE waiter->task;
ian@0 1457 CALL wakeup
ian@0 1458 RELEASE task
ian@0 1459
ian@0 1460 In this case, the barrier makes a guarantee that all memory accesses before the
ian@0 1461 barrier will appear to happen before all the memory accesses after the barrier
ian@0 1462 with respect to the other CPUs on the system. It does _not_ guarantee that all
ian@0 1463 the memory accesses before the barrier will be complete by the time the barrier
ian@0 1464 instruction itself is complete.
ian@0 1465
ian@0 1466 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
ian@0 1467 compiler barrier, thus making sure the compiler emits the instructions in the
ian@0 1468 right order without actually intervening in the CPU. Since there's only one
ian@0 1469 CPU, that CPU's dependency ordering logic will take care of everything else.
ian@0 1470
ian@0 1471
ian@0 1472 ATOMIC OPERATIONS
ian@0 1473 -----------------
ian@0 1474
ian@0 1475 Whilst they are technically interprocessor interaction considerations, atomic
ian@0 1476 operations are noted specially as some of them imply full memory barriers and
ian@0 1477 some don't, but they're very heavily relied on as a group throughout the
ian@0 1478 kernel.
ian@0 1479
ian@0 1480 Any atomic operation that modifies some state in memory and returns information
ian@0 1481 about the state (old or new) implies an SMP-conditional general memory barrier
ian@0 1482 (smp_mb()) on each side of the actual operation. These include:
ian@0 1483
ian@0 1484 xchg();
ian@0 1485 cmpxchg();
ian@0 1486 atomic_cmpxchg();
ian@0 1487 atomic_inc_return();
ian@0 1488 atomic_dec_return();
ian@0 1489 atomic_add_return();
ian@0 1490 atomic_sub_return();
ian@0 1491 atomic_inc_and_test();
ian@0 1492 atomic_dec_and_test();
ian@0 1493 atomic_sub_and_test();
ian@0 1494 atomic_add_negative();
ian@0 1495 atomic_add_unless();
ian@0 1496 test_and_set_bit();
ian@0 1497 test_and_clear_bit();
ian@0 1498 test_and_change_bit();
ian@0 1499
ian@0 1500 These are used for such things as implementing LOCK-class and UNLOCK-class
ian@0 1501 operations and adjusting reference counters towards object destruction, and as
ian@0 1502 such the implicit memory barrier effects are necessary.
ian@0 1503
ian@0 1504
ian@0 1505 The following operation are potential problems as they do _not_ imply memory
ian@0 1506 barriers, but might be used for implementing such things as UNLOCK-class
ian@0 1507 operations:
ian@0 1508
ian@0 1509 atomic_set();
ian@0 1510 set_bit();
ian@0 1511 clear_bit();
ian@0 1512 change_bit();
ian@0 1513
ian@0 1514 With these the appropriate explicit memory barrier should be used if necessary
ian@0 1515 (smp_mb__before_clear_bit() for instance).
ian@0 1516
ian@0 1517
ian@0 1518 The following also do _not_ imply memory barriers, and so may require explicit
ian@0 1519 memory barriers under some circumstances (smp_mb__before_atomic_dec() for
ian@0 1520 instance)):
ian@0 1521
ian@0 1522 atomic_add();
ian@0 1523 atomic_sub();
ian@0 1524 atomic_inc();
ian@0 1525 atomic_dec();
ian@0 1526
ian@0 1527 If they're used for statistics generation, then they probably don't need memory
ian@0 1528 barriers, unless there's a coupling between statistical data.
ian@0 1529
ian@0 1530 If they're used for reference counting on an object to control its lifetime,
ian@0 1531 they probably don't need memory barriers because either the reference count
ian@0 1532 will be adjusted inside a locked section, or the caller will already hold
ian@0 1533 sufficient references to make the lock, and thus a memory barrier unnecessary.
ian@0 1534
ian@0 1535 If they're used for constructing a lock of some description, then they probably
ian@0 1536 do need memory barriers as a lock primitive generally has to do things in a
ian@0 1537 specific order.
ian@0 1538
ian@0 1539
ian@0 1540 Basically, each usage case has to be carefully considered as to whether memory
ian@0 1541 barriers are needed or not.
ian@0 1542
ian@0 1543 [!] Note that special memory barrier primitives are available for these
ian@0 1544 situations because on some CPUs the atomic instructions used imply full memory
ian@0 1545 barriers, and so barrier instructions are superfluous in conjunction with them,
ian@0 1546 and in such cases the special barrier primitives will be no-ops.
ian@0 1547
ian@0 1548 See Documentation/atomic_ops.txt for more information.
ian@0 1549
ian@0 1550
ian@0 1551 ACCESSING DEVICES
ian@0 1552 -----------------
ian@0 1553
ian@0 1554 Many devices can be memory mapped, and so appear to the CPU as if they're just
ian@0 1555 a set of memory locations. To control such a device, the driver usually has to
ian@0 1556 make the right memory accesses in exactly the right order.
ian@0 1557
ian@0 1558 However, having a clever CPU or a clever compiler creates a potential problem
ian@0 1559 in that the carefully sequenced accesses in the driver code won't reach the
ian@0 1560 device in the requisite order if the CPU or the compiler thinks it is more
ian@0 1561 efficient to reorder, combine or merge accesses - something that would cause
ian@0 1562 the device to malfunction.
ian@0 1563
ian@0 1564 Inside of the Linux kernel, I/O should be done through the appropriate accessor
ian@0 1565 routines - such as inb() or writel() - which know how to make such accesses
ian@0 1566 appropriately sequential. Whilst this, for the most part, renders the explicit
ian@0 1567 use of memory barriers unnecessary, there are a couple of situations where they
ian@0 1568 might be needed:
ian@0 1569
ian@0 1570 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
ian@0 1571 so for _all_ general drivers locks should be used and mmiowb() must be
ian@0 1572 issued prior to unlocking the critical section.
ian@0 1573
ian@0 1574 (2) If the accessor functions are used to refer to an I/O memory window with
ian@0 1575 relaxed memory access properties, then _mandatory_ memory barriers are
ian@0 1576 required to enforce ordering.
ian@0 1577
ian@0 1578 See Documentation/DocBook/deviceiobook.tmpl for more information.
ian@0 1579
ian@0 1580
ian@0 1581 INTERRUPTS
ian@0 1582 ----------
ian@0 1583
ian@0 1584 A driver may be interrupted by its own interrupt service routine, and thus the
ian@0 1585 two parts of the driver may interfere with each other's attempts to control or
ian@0 1586 access the device.
ian@0 1587
ian@0 1588 This may be alleviated - at least in part - by disabling local interrupts (a
ian@0 1589 form of locking), such that the critical operations are all contained within
ian@0 1590 the interrupt-disabled section in the driver. Whilst the driver's interrupt
ian@0 1591 routine is executing, the driver's core may not run on the same CPU, and its
ian@0 1592 interrupt is not permitted to happen again until the current interrupt has been
ian@0 1593 handled, thus the interrupt handler does not need to lock against that.
ian@0 1594
ian@0 1595 However, consider a driver that was talking to an ethernet card that sports an
ian@0 1596 address register and a data register. If that driver's core talks to the card
ian@0 1597 under interrupt-disablement and then the driver's interrupt handler is invoked:
ian@0 1598
ian@0 1599 LOCAL IRQ DISABLE
ian@0 1600 writew(ADDR, 3);
ian@0 1601 writew(DATA, y);
ian@0 1602 LOCAL IRQ ENABLE
ian@0 1603 <interrupt>
ian@0 1604 writew(ADDR, 4);
ian@0 1605 q = readw(DATA);
ian@0 1606 </interrupt>
ian@0 1607
ian@0 1608 The store to the data register might happen after the second store to the
ian@0 1609 address register if ordering rules are sufficiently relaxed:
ian@0 1610
ian@0 1611 STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
ian@0 1612
ian@0 1613
ian@0 1614 If ordering rules are relaxed, it must be assumed that accesses done inside an
ian@0 1615 interrupt disabled section may leak outside of it and may interleave with
ian@0 1616 accesses performed in an interrupt - and vice versa - unless implicit or
ian@0 1617 explicit barriers are used.
ian@0 1618
ian@0 1619 Normally this won't be a problem because the I/O accesses done inside such
ian@0 1620 sections will include synchronous load operations on strictly ordered I/O
ian@0 1621 registers that form implicit I/O barriers. If this isn't sufficient then an
ian@0 1622 mmiowb() may need to be used explicitly.
ian@0 1623
ian@0 1624
ian@0 1625 A similar situation may occur between an interrupt routine and two routines
ian@0 1626 running on separate CPUs that communicate with each other. If such a case is
ian@0 1627 likely, then interrupt-disabling locks should be used to guarantee ordering.
ian@0 1628
ian@0 1629
ian@0 1630 ==========================
ian@0 1631 KERNEL I/O BARRIER EFFECTS
ian@0 1632 ==========================
ian@0 1633
ian@0 1634 When accessing I/O memory, drivers should use the appropriate accessor
ian@0 1635 functions:
ian@0 1636
ian@0 1637 (*) inX(), outX():
ian@0 1638
ian@0 1639 These are intended to talk to I/O space rather than memory space, but
ian@0 1640 that's primarily a CPU-specific concept. The i386 and x86_64 processors do
ian@0 1641 indeed have special I/O space access cycles and instructions, but many
ian@0 1642 CPUs don't have such a concept.
ian@0 1643
ian@0 1644 The PCI bus, amongst others, defines an I/O space concept - which on such
ian@0 1645 CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O
ian@0 1646 space. However, it may also be mapped as a virtual I/O space in the CPU's
ian@0 1647 memory map, particularly on those CPUs that don't support alternate I/O
ian@0 1648 spaces.
ian@0 1649
ian@0 1650 Accesses to this space may be fully synchronous (as on i386), but
ian@0 1651 intermediary bridges (such as the PCI host bridge) may not fully honour
ian@0 1652 that.
ian@0 1653
ian@0 1654 They are guaranteed to be fully ordered with respect to each other.
ian@0 1655
ian@0 1656 They are not guaranteed to be fully ordered with respect to other types of
ian@0 1657 memory and I/O operation.
ian@0 1658
ian@0 1659 (*) readX(), writeX():
ian@0 1660
ian@0 1661 Whether these are guaranteed to be fully ordered and uncombined with
ian@0 1662 respect to each other on the issuing CPU depends on the characteristics
ian@0 1663 defined for the memory window through which they're accessing. On later
ian@0 1664 i386 architecture machines, for example, this is controlled by way of the
ian@0 1665 MTRR registers.
ian@0 1666
ian@0 1667 Ordinarily, these will be guaranteed to be fully ordered and uncombined,,
ian@0 1668 provided they're not accessing a prefetchable device.
ian@0 1669
ian@0 1670 However, intermediary hardware (such as a PCI bridge) may indulge in
ian@0 1671 deferral if it so wishes; to flush a store, a load from the same location
ian@0 1672 is preferred[*], but a load from the same device or from configuration
ian@0 1673 space should suffice for PCI.
ian@0 1674
ian@0 1675 [*] NOTE! attempting to load from the same location as was written to may
ian@0 1676 cause a malfunction - consider the 16550 Rx/Tx serial registers for
ian@0 1677 example.
ian@0 1678
ian@0 1679 Used with prefetchable I/O memory, an mmiowb() barrier may be required to
ian@0 1680 force stores to be ordered.
ian@0 1681
ian@0 1682 Please refer to the PCI specification for more information on interactions
ian@0 1683 between PCI transactions.
ian@0 1684
ian@0 1685 (*) readX_relaxed()
ian@0 1686
ian@0 1687 These are similar to readX(), but are not guaranteed to be ordered in any
ian@0 1688 way. Be aware that there is no I/O read barrier available.
ian@0 1689
ian@0 1690 (*) ioreadX(), iowriteX()
ian@0 1691
ian@0 1692 These will perform as appropriate for the type of access they're actually
ian@0 1693 doing, be it inX()/outX() or readX()/writeX().
ian@0 1694
ian@0 1695
ian@0 1696 ========================================
ian@0 1697 ASSUMED MINIMUM EXECUTION ORDERING MODEL
ian@0 1698 ========================================
ian@0 1699
ian@0 1700 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
ian@0 1701 maintain the appearance of program causality with respect to itself. Some CPUs
ian@0 1702 (such as i386 or x86_64) are more constrained than others (such as powerpc or
ian@0 1703 frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
ian@0 1704 of arch-specific code.
ian@0 1705
ian@0 1706 This means that it must be considered that the CPU will execute its instruction
ian@0 1707 stream in any order it feels like - or even in parallel - provided that if an
ian@0 1708 instruction in the stream depends on the an earlier instruction, then that
ian@0 1709 earlier instruction must be sufficiently complete[*] before the later
ian@0 1710 instruction may proceed; in other words: provided that the appearance of
ian@0 1711 causality is maintained.
ian@0 1712
ian@0 1713 [*] Some instructions have more than one effect - such as changing the
ian@0 1714 condition codes, changing registers or changing memory - and different
ian@0 1715 instructions may depend on different effects.
ian@0 1716
ian@0 1717 A CPU may also discard any instruction sequence that winds up having no
ian@0 1718 ultimate effect. For example, if two adjacent instructions both load an
ian@0 1719 immediate value into the same register, the first may be discarded.
ian@0 1720
ian@0 1721
ian@0 1722 Similarly, it has to be assumed that compiler might reorder the instruction
ian@0 1723 stream in any way it sees fit, again provided the appearance of causality is
ian@0 1724 maintained.
ian@0 1725
ian@0 1726
ian@0 1727 ============================
ian@0 1728 THE EFFECTS OF THE CPU CACHE
ian@0 1729 ============================
ian@0 1730
ian@0 1731 The way cached memory operations are perceived across the system is affected to
ian@0 1732 a certain extent by the caches that lie between CPUs and memory, and by the
ian@0 1733 memory coherence system that maintains the consistency of state in the system.
ian@0 1734
ian@0 1735 As far as the way a CPU interacts with another part of the system through the
ian@0 1736 caches goes, the memory system has to include the CPU's caches, and memory
ian@0 1737 barriers for the most part act at the interface between the CPU and its cache
ian@0 1738 (memory barriers logically act on the dotted line in the following diagram):
ian@0 1739
ian@0 1740 <--- CPU ---> : <----------- Memory ----------->
ian@0 1741 :
ian@0 1742 +--------+ +--------+ : +--------+ +-----------+
ian@0 1743 | | | | : | | | | +--------+
ian@0 1744 | CPU | | Memory | : | CPU | | | | |
ian@0 1745 | Core |--->| Access |----->| Cache |<-->| | | |
ian@0 1746 | | | Queue | : | | | |--->| Memory |
ian@0 1747 | | | | : | | | | | |
ian@0 1748 +--------+ +--------+ : +--------+ | | | |
ian@0 1749 : | Cache | +--------+
ian@0 1750 : | Coherency |
ian@0 1751 : | Mechanism | +--------+
ian@0 1752 +--------+ +--------+ : +--------+ | | | |
ian@0 1753 | | | | : | | | | | |
ian@0 1754 | CPU | | Memory | : | CPU | | |--->| Device |
ian@0 1755 | Core |--->| Access |----->| Cache |<-->| | | |
ian@0 1756 | | | Queue | : | | | | | |
ian@0 1757 | | | | : | | | | +--------+
ian@0 1758 +--------+ +--------+ : +--------+ +-----------+
ian@0 1759 :
ian@0 1760 :
ian@0 1761
ian@0 1762 Although any particular load or store may not actually appear outside of the
ian@0 1763 CPU that issued it since it may have been satisfied within the CPU's own cache,
ian@0 1764 it will still appear as if the full memory access had taken place as far as the
ian@0 1765 other CPUs are concerned since the cache coherency mechanisms will migrate the
ian@0 1766 cacheline over to the accessing CPU and propagate the effects upon conflict.
ian@0 1767
ian@0 1768 The CPU core may execute instructions in any order it deems fit, provided the
ian@0 1769 expected program causality appears to be maintained. Some of the instructions
ian@0 1770 generate load and store operations which then go into the queue of memory
ian@0 1771 accesses to be performed. The core may place these in the queue in any order
ian@0 1772 it wishes, and continue execution until it is forced to wait for an instruction
ian@0 1773 to complete.
ian@0 1774
ian@0 1775 What memory barriers are concerned with is controlling the order in which
ian@0 1776 accesses cross from the CPU side of things to the memory side of things, and
ian@0 1777 the order in which the effects are perceived to happen by the other observers
ian@0 1778 in the system.
ian@0 1779
ian@0 1780 [!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
ian@0 1781 their own loads and stores as if they had happened in program order.
ian@0 1782
ian@0 1783 [!] MMIO or other device accesses may bypass the cache system. This depends on
ian@0 1784 the properties of the memory window through which devices are accessed and/or
ian@0 1785 the use of any special device communication instructions the CPU may have.
ian@0 1786
ian@0 1787
ian@0 1788 CACHE COHERENCY
ian@0 1789 ---------------
ian@0 1790
ian@0 1791 Life isn't quite as simple as it may appear above, however: for while the
ian@0 1792 caches are expected to be coherent, there's no guarantee that that coherency
ian@0 1793 will be ordered. This means that whilst changes made on one CPU will
ian@0 1794 eventually become visible on all CPUs, there's no guarantee that they will
ian@0 1795 become apparent in the same order on those other CPUs.
ian@0 1796
ian@0 1797
ian@0 1798 Consider dealing with a system that has pair of CPUs (1 & 2), each of which has
ian@0 1799 a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
ian@0 1800
ian@0 1801 :
ian@0 1802 : +--------+
ian@0 1803 : +---------+ | |
ian@0 1804 +--------+ : +--->| Cache A |<------->| |
ian@0 1805 | | : | +---------+ | |
ian@0 1806 | CPU 1 |<---+ | |
ian@0 1807 | | : | +---------+ | |
ian@0 1808 +--------+ : +--->| Cache B |<------->| |
ian@0 1809 : +---------+ | |
ian@0 1810 : | Memory |
ian@0 1811 : +---------+ | System |
ian@0 1812 +--------+ : +--->| Cache C |<------->| |
ian@0 1813 | | : | +---------+ | |
ian@0 1814 | CPU 2 |<---+ | |
ian@0 1815 | | : | +---------+ | |
ian@0 1816 +--------+ : +--->| Cache D |<------->| |
ian@0 1817 : +---------+ | |
ian@0 1818 : +--------+
ian@0 1819 :
ian@0 1820
ian@0 1821 Imagine the system has the following properties:
ian@0 1822
ian@0 1823 (*) an odd-numbered cache line may be in cache A, cache C or it may still be
ian@0 1824 resident in memory;
ian@0 1825
ian@0 1826 (*) an even-numbered cache line may be in cache B, cache D or it may still be
ian@0 1827 resident in memory;
ian@0 1828
ian@0 1829 (*) whilst the CPU core is interrogating one cache, the other cache may be
ian@0 1830 making use of the bus to access the rest of the system - perhaps to
ian@0 1831 displace a dirty cacheline or to do a speculative load;
ian@0 1832
ian@0 1833 (*) each cache has a queue of operations that need to be applied to that cache
ian@0 1834 to maintain coherency with the rest of the system;
ian@0 1835
ian@0 1836 (*) the coherency queue is not flushed by normal loads to lines already
ian@0 1837 present in the cache, even though the contents of the queue may
ian@0 1838 potentially effect those loads.
ian@0 1839
ian@0 1840 Imagine, then, that two writes are made on the first CPU, with a write barrier
ian@0 1841 between them to guarantee that they will appear to reach that CPU's caches in
ian@0 1842 the requisite order:
ian@0 1843
ian@0 1844 CPU 1 CPU 2 COMMENT
ian@0 1845 =============== =============== =======================================
ian@0 1846 u == 0, v == 1 and p == &u, q == &u
ian@0 1847 v = 2;
ian@0 1848 smp_wmb(); Make sure change to v visible before
ian@0 1849 change to p
ian@0 1850 <A:modify v=2> v is now in cache A exclusively
ian@0 1851 p = &v;
ian@0 1852 <B:modify p=&v> p is now in cache B exclusively
ian@0 1853
ian@0 1854 The write memory barrier forces the other CPUs in the system to perceive that
ian@0 1855 the local CPU's caches have apparently been updated in the correct order. But
ian@0 1856 now imagine that the second CPU that wants to read those values:
ian@0 1857
ian@0 1858 CPU 1 CPU 2 COMMENT
ian@0 1859 =============== =============== =======================================
ian@0 1860 ...
ian@0 1861 q = p;
ian@0 1862 x = *q;
ian@0 1863
ian@0 1864 The above pair of reads may then fail to happen in expected order, as the
ian@0 1865 cacheline holding p may get updated in one of the second CPU's caches whilst
ian@0 1866 the update to the cacheline holding v is delayed in the other of the second
ian@0 1867 CPU's caches by some other cache event:
ian@0 1868
ian@0 1869 CPU 1 CPU 2 COMMENT
ian@0 1870 =============== =============== =======================================
ian@0 1871 u == 0, v == 1 and p == &u, q == &u
ian@0 1872 v = 2;
ian@0 1873 smp_wmb();
ian@0 1874 <A:modify v=2> <C:busy>
ian@0 1875 <C:queue v=2>
ian@0 1876 p = &v; q = p;
ian@0 1877 <D:request p>
ian@0 1878 <B:modify p=&v> <D:commit p=&v>
ian@0 1879 <D:read p>
ian@0 1880 x = *q;
ian@0 1881 <C:read *q> Reads from v before v updated in cache
ian@0 1882 <C:unbusy>
ian@0 1883 <C:commit v=2>
ian@0 1884
ian@0 1885 Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
ian@0 1886 no guarantee that, without intervention, the order of update will be the same
ian@0 1887 as that committed on CPU 1.
ian@0 1888
ian@0 1889
ian@0 1890 To intervene, we need to interpolate a data dependency barrier or a read
ian@0 1891 barrier between the loads. This will force the cache to commit its coherency
ian@0 1892 queue before processing any further requests:
ian@0 1893
ian@0 1894 CPU 1 CPU 2 COMMENT
ian@0 1895 =============== =============== =======================================
ian@0 1896 u == 0, v == 1 and p == &u, q == &u
ian@0 1897 v = 2;
ian@0 1898 smp_wmb();
ian@0 1899 <A:modify v=2> <C:busy>
ian@0 1900 <C:queue v=2>
ian@0 1901 p = &b; q = p;
ian@0 1902 <D:request p>
ian@0 1903 <B:modify p=&v> <D:commit p=&v>
ian@0 1904 <D:read p>
ian@0 1905 smp_read_barrier_depends()
ian@0 1906 <C:unbusy>
ian@0 1907 <C:commit v=2>
ian@0 1908 x = *q;
ian@0 1909 <C:read *q> Reads from v after v updated in cache
ian@0 1910
ian@0 1911
ian@0 1912 This sort of problem can be encountered on DEC Alpha processors as they have a
ian@0 1913 split cache that improves performance by making better use of the data bus.
ian@0 1914 Whilst most CPUs do imply a data dependency barrier on the read when a memory
ian@0 1915 access depends on a read, not all do, so it may not be relied on.
ian@0 1916
ian@0 1917 Other CPUs may also have split caches, but must coordinate between the various
ian@0 1918 cachelets for normal memory accesss. The semantics of the Alpha removes the
ian@0 1919 need for coordination in absence of memory barriers.
ian@0 1920
ian@0 1921
ian@0 1922 CACHE COHERENCY VS DMA
ian@0 1923 ----------------------
ian@0 1924
ian@0 1925 Not all systems maintain cache coherency with respect to devices doing DMA. In
ian@0 1926 such cases, a device attempting DMA may obtain stale data from RAM because
ian@0 1927 dirty cache lines may be resident in the caches of various CPUs, and may not
ian@0 1928 have been written back to RAM yet. To deal with this, the appropriate part of
ian@0 1929 the kernel must flush the overlapping bits of cache on each CPU (and maybe
ian@0 1930 invalidate them as well).
ian@0 1931
ian@0 1932 In addition, the data DMA'd to RAM by a device may be overwritten by dirty
ian@0 1933 cache lines being written back to RAM from a CPU's cache after the device has
ian@0 1934 installed its own data, or cache lines simply present in a CPUs cache may
ian@0 1935 simply obscure the fact that RAM has been updated, until at such time as the
ian@0 1936 cacheline is discarded from the CPU's cache and reloaded. To deal with this,
ian@0 1937 the appropriate part of the kernel must invalidate the overlapping bits of the
ian@0 1938 cache on each CPU.
ian@0 1939
ian@0 1940 See Documentation/cachetlb.txt for more information on cache management.
ian@0 1941
ian@0 1942
ian@0 1943 CACHE COHERENCY VS MMIO
ian@0 1944 -----------------------
ian@0 1945
ian@0 1946 Memory mapped I/O usually takes place through memory locations that are part of
ian@0 1947 a window in the CPU's memory space that have different properties assigned than
ian@0 1948 the usual RAM directed window.
ian@0 1949
ian@0 1950 Amongst these properties is usually the fact that such accesses bypass the
ian@0 1951 caching entirely and go directly to the device buses. This means MMIO accesses
ian@0 1952 may, in effect, overtake accesses to cached memory that were emitted earlier.
ian@0 1953 A memory barrier isn't sufficient in such a case, but rather the cache must be
ian@0 1954 flushed between the cached memory write and the MMIO access if the two are in
ian@0 1955 any way dependent.
ian@0 1956
ian@0 1957
ian@0 1958 =========================
ian@0 1959 THE THINGS CPUS GET UP TO
ian@0 1960 =========================
ian@0 1961
ian@0 1962 A programmer might take it for granted that the CPU will perform memory
ian@0 1963 operations in exactly the order specified, so that if a CPU is, for example,
ian@0 1964 given the following piece of code to execute:
ian@0 1965
ian@0 1966 a = *A;
ian@0 1967 *B = b;
ian@0 1968 c = *C;
ian@0 1969 d = *D;
ian@0 1970 *E = e;
ian@0 1971
ian@0 1972 They would then expect that the CPU will complete the memory operation for each
ian@0 1973 instruction before moving on to the next one, leading to a definite sequence of
ian@0 1974 operations as seen by external observers in the system:
ian@0 1975
ian@0 1976 LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
ian@0 1977
ian@0 1978
ian@0 1979 Reality is, of course, much messier. With many CPUs and compilers, the above
ian@0 1980 assumption doesn't hold because:
ian@0 1981
ian@0 1982 (*) loads are more likely to need to be completed immediately to permit
ian@0 1983 execution progress, whereas stores can often be deferred without a
ian@0 1984 problem;
ian@0 1985
ian@0 1986 (*) loads may be done speculatively, and the result discarded should it prove
ian@0 1987 to have been unnecessary;
ian@0 1988
ian@0 1989 (*) loads may be done speculatively, leading to the result having being
ian@0 1990 fetched at the wrong time in the expected sequence of events;
ian@0 1991
ian@0 1992 (*) the order of the memory accesses may be rearranged to promote better use
ian@0 1993 of the CPU buses and caches;
ian@0 1994
ian@0 1995 (*) loads and stores may be combined to improve performance when talking to
ian@0 1996 memory or I/O hardware that can do batched accesses of adjacent locations,
ian@0 1997 thus cutting down on transaction setup costs (memory and PCI devices may
ian@0 1998 both be able to do this); and
ian@0 1999
ian@0 2000 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
ian@0 2001 mechanisms may alleviate this - once the store has actually hit the cache
ian@0 2002 - there's no guarantee that the coherency management will be propagated in
ian@0 2003 order to other CPUs.
ian@0 2004
ian@0 2005 So what another CPU, say, might actually observe from the above piece of code
ian@0 2006 is:
ian@0 2007
ian@0 2008 LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
ian@0 2009
ian@0 2010 (Where "LOAD {*C,*D}" is a combined load)
ian@0 2011
ian@0 2012
ian@0 2013 However, it is guaranteed that a CPU will be self-consistent: it will see its
ian@0 2014 _own_ accesses appear to be correctly ordered, without the need for a memory
ian@0 2015 barrier. For instance with the following code:
ian@0 2016
ian@0 2017 U = *A;
ian@0 2018 *A = V;
ian@0 2019 *A = W;
ian@0 2020 X = *A;
ian@0 2021 *A = Y;
ian@0 2022 Z = *A;
ian@0 2023
ian@0 2024 and assuming no intervention by an external influence, it can be assumed that
ian@0 2025 the final result will appear to be:
ian@0 2026
ian@0 2027 U == the original value of *A
ian@0 2028 X == W
ian@0 2029 Z == Y
ian@0 2030 *A == Y
ian@0 2031
ian@0 2032 The code above may cause the CPU to generate the full sequence of memory
ian@0 2033 accesses:
ian@0 2034
ian@0 2035 U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
ian@0 2036
ian@0 2037 in that order, but, without intervention, the sequence may have almost any
ian@0 2038 combination of elements combined or discarded, provided the program's view of
ian@0 2039 the world remains consistent.
ian@0 2040
ian@0 2041 The compiler may also combine, discard or defer elements of the sequence before
ian@0 2042 the CPU even sees them.
ian@0 2043
ian@0 2044 For instance:
ian@0 2045
ian@0 2046 *A = V;
ian@0 2047 *A = W;
ian@0 2048
ian@0 2049 may be reduced to:
ian@0 2050
ian@0 2051 *A = W;
ian@0 2052
ian@0 2053 since, without a write barrier, it can be assumed that the effect of the
ian@0 2054 storage of V to *A is lost. Similarly:
ian@0 2055
ian@0 2056 *A = Y;
ian@0 2057 Z = *A;
ian@0 2058
ian@0 2059 may, without a memory barrier, be reduced to:
ian@0 2060
ian@0 2061 *A = Y;
ian@0 2062 Z = Y;
ian@0 2063
ian@0 2064 and the LOAD operation never appear outside of the CPU.
ian@0 2065
ian@0 2066
ian@0 2067 AND THEN THERE'S THE ALPHA
ian@0 2068 --------------------------
ian@0 2069
ian@0 2070 The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that,
ian@0 2071 some versions of the Alpha CPU have a split data cache, permitting them to have
ian@0 2072 two semantically related cache lines updating at separate times. This is where
ian@0 2073 the data dependency barrier really becomes necessary as this synchronises both
ian@0 2074 caches with the memory coherence system, thus making it seem like pointer
ian@0 2075 changes vs new data occur in the right order.
ian@0 2076
ian@0 2077 The Alpha defines the Linux's kernel's memory barrier model.
ian@0 2078
ian@0 2079 See the subsection on "Cache Coherency" above.
ian@0 2080
ian@0 2081
ian@0 2082 ==========
ian@0 2083 REFERENCES
ian@0 2084 ==========
ian@0 2085
ian@0 2086 Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
ian@0 2087 Digital Press)
ian@0 2088 Chapter 5.2: Physical Address Space Characteristics
ian@0 2089 Chapter 5.4: Caches and Write Buffers
ian@0 2090 Chapter 5.5: Data Sharing
ian@0 2091 Chapter 5.6: Read/Write Ordering
ian@0 2092
ian@0 2093 AMD64 Architecture Programmer's Manual Volume 2: System Programming
ian@0 2094 Chapter 7.1: Memory-Access Ordering
ian@0 2095 Chapter 7.4: Buffering and Combining Memory Writes
ian@0 2096
ian@0 2097 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
ian@0 2098 System Programming Guide
ian@0 2099 Chapter 7.1: Locked Atomic Operations
ian@0 2100 Chapter 7.2: Memory Ordering
ian@0 2101 Chapter 7.4: Serializing Instructions
ian@0 2102
ian@0 2103 The SPARC Architecture Manual, Version 9
ian@0 2104 Chapter 8: Memory Models
ian@0 2105 Appendix D: Formal Specification of the Memory Models
ian@0 2106 Appendix J: Programming with the Memory Models
ian@0 2107
ian@0 2108 UltraSPARC Programmer Reference Manual
ian@0 2109 Chapter 5: Memory Accesses and Cacheability
ian@0 2110 Chapter 15: Sparc-V9 Memory Models
ian@0 2111
ian@0 2112 UltraSPARC III Cu User's Manual
ian@0 2113 Chapter 9: Memory Models
ian@0 2114
ian@0 2115 UltraSPARC IIIi Processor User's Manual
ian@0 2116 Chapter 8: Memory Models
ian@0 2117
ian@0 2118 UltraSPARC Architecture 2005
ian@0 2119 Chapter 9: Memory
ian@0 2120 Appendix D: Formal Specifications of the Memory Models
ian@0 2121
ian@0 2122 UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
ian@0 2123 Chapter 8: Memory Models
ian@0 2124 Appendix F: Caches and Cache Coherency
ian@0 2125
ian@0 2126 Solaris Internals, Core Kernel Architecture, p63-68:
ian@0 2127 Chapter 3.3: Hardware Considerations for Locks and
ian@0 2128 Synchronization
ian@0 2129
ian@0 2130 Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
ian@0 2131 for Kernel Programmers:
ian@0 2132 Chapter 13: Other Memory Models
ian@0 2133
ian@0 2134 Intel Itanium Architecture Software Developer's Manual: Volume 1:
ian@0 2135 Section 2.6: Speculation
ian@0 2136 Section 4.4: Memory Access