34
Hardware Memory Consistency Model

Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Hardware  Memory    Consistency  Model  

Page 2: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Memory  Model  •  Defines  the  set  of  values  that  a  read  in  a  program  is  allowed  to  return.  

•  Sequen>al  Consistency  is  the  most  intui>ve  memory  model,  but  most  hardware  and  compiler  do  not  implement  it,  for  performance  reasons.  

•  Language  Memory  model:  defines  what  op>miza>ons  /  instruc>on  re-­‐write  /  reordering  a  compiler  can  do  (for  a  par>cular  architecture).  

•  Hardware    memory  model:    defines  what  op>miza>on  /  instruc>on  reordering  an  architecture  implementa>on  is  allowed  to  do  (that  are  visible  to  the  programmer).  

Page 3: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Example  1:  Store  Buffering  

DOI:10.1145/1785414.1785443

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 89

x86-TSO: A Rigorous and Usable Programmer’s Model for x86 MultiprocessorsBy Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen

AbstractExploiting the multiprocessors that have recently become ubiquitous requires high-performance and reliable concur-rent systems code, for concurrent data structures, operating system kernels, synchronization libraries, compilers, and so on. However, concurrent programming, which is always challenging, is made much more so by two problems. First, real multiprocessors typically do not provide the sequen-tially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, varying in subtle ways between proces-sor families, in which different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures, supposedly specifying what programmers can rely on, are often in ambiguous informal prose (a particularly poor medium for loose specifications), leading to widespread confusion.

In this paper we focus on x86 processors. We review sev-eral recent Intel and AMD specifications, showing that all contain serious ambiguities, some are arguably too weak to program above, and some are simply unsound with respect to actual hardware. We present a new x86-TSO programmer’s model that, to the best of our knowledge, suffers from none of these problems. It is mathematically precise (rigorously defined in HOL4) but can be presented as an intuitive abstract machine which should be widely accessible to working pro-grammers. We illustrate how this can be used to reason about the correctness of a Linux spinlock implementation and describe a general theory of data-race freedom for x86-TSO. This should put x86 multiprocessor system building on a more solid foundation; it should also provide a basis for future work on verification of such systems.

1. INTRODUCTIONMultiprocessor machines, with many processors acting on a shared memory, have been developed since the 1960s; they are now ubiquitous. Meanwhile, the difficulty of programming concurrent systems has motivated extensive research on pro gramming language design, semantics, and verification, from semaphores and monitors to program logics, software model checking, and so forth. This work has almost always assumed that concurrent threads share a single sequentially consistent memory,21 with their reads and writes interleaved in some order. In fact, however, real multiprocessors use sophisticated techniques to achieve high performance: store buffers, hierarchies of local cache, speculative execution,

etc. These optimizations are not observable by sequential code, but in multithreaded programs different threads may see subtly different views of memory; such machines exhibit relaxed, or weak, memory models.6, 7, 17, 19

For a simple example, consider the following assembly language program (SB) for modern Intel or AMD x86 mul-tiprocessors: given two distinct memory locations x and y (initially holding 0), if two processors respectively write 1 to x and y and then read from y and x (into register EAX on proces-sor 0 and EBX on processor 1), it is possible for both to read 0 in the same execution. It is easy to check that this result can-not arise from any interleaving of the reads and writes of the two processors; modern x86 multiprocessors do not have a sequentially consistent semantics.

Microarchitecturally, one can view this particular example as a visible consequence of store buffering: if each proces-sor effectively has a FIFO buffer of pending memory writes (to avoid the need to block while a write completes), then the reads from y and x could occur before the writes have propa-gated from the buffers to main memory.

Other families of multiprocessors, dating back at least to the IBM 370, and including ARM, Itanium, POWER, and SPARC, also exhibit relaxed-memory behavior. Moreover, there are major and subtle differences between different pro-cessor families (arising from their different internal design choices): in the details of exactly what non-sequentially-con-sistent executions they permit, and of what memory barrier and synchronization instructions they provide to let the pro-grammer regain control.

For any of these processors, relaxed-memory behavior exacerbates the difficulties of writing concurrent software, as systems programmers cannot reason, at the level of abstraction of memory reads and writes, in terms of an intui-tive concept of global time.

This paper is based on work that first appeared in the Proceedings of the 36th SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2009, and in the Proceedings of the 22nd International Conference on Theorem Proving in Higher-Order Logics (TPHOLs), 2009.

SBProc 0 Proc 1

MOV [x]¬1MOV EAX¬[y]

MOV [y]¬1MOV EBX¬[x]

Allowed Final State: Proc 0:EAX=0 � Proc 1:EBX=0

Allowed?  

Yes  for  most  exis>ng  architectures  (AFAIK).  

Page 4: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Hardware  Reordering  

∙  The processor can issue stores faster than the network can handle them ⇒ store buffer.

∙  Since a load may stall the processor until it is satisfied, loads take priority, bypassing the store buffer.

∙  If a load address matches an address in the store buffer, the store buffer returns the result.

∙  Thus, a load can bypass a store to a different address.

Memory System

Load Bypass

Processor

Network Store Buffer

P

Page 5: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Example  2:  Ind.  Read  Ind.  Write  

Allowed?  

Depends  on  the  architecture.  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 91

and following some testing, IRIW is not observable in practice, even without MFENCEs. It appears that some JVM implementations depend on this fact, and would not be correct if one assumed only the IWP/AMD3.14/x86-CC architecture.15

Second, more seriously, x86-CC and IWP are unsound with respect to current processors. The following example, n6, due to Paul Loewenstein [personal communication, Nov. 2008] shows a behavior that is observable (e.g., on an Intel Core 2

duo) but that is disallowed by x86-CC and by any interpreta-tion we can make of IWP principles P1, 2, 4 and 6.27, A.5

To see why this could be allowed by multiprocessors with FIFO store buffers, suppose that first the Proc 1 write of [y]=2 is buffered, then Proc 0 buffers its write of [x]=1, reads [x]=1 from its own store buffer, and reads [y]=0 from main memory, then Proc 1 buffers its [x]=2 write and flushes its buffered [y]=2 and [x]=2 writes to memory, then finally Proc 0 flushes its [x]=1 write to memory.

The AMD3.14 manual is not expressed in terms of a clearly identified set of principles, and the main text (vol. 2, §7.2) leaves the ordering of stores to a single location uncon-strained, though elsewhere the manual describes a micro-architecture with store buffers and cache protocols that strongly implies that memory is coherent. In the absence of an analogue of the IWP P6, the reasoning prohibiting n6 does not carry over.

2.3. Intel SDM rev. 29–34 (Nov. 2008–Mar. 2010)The most recent substantial change to the Intel memory-model specification, at the time of writing, was in revision 29 of the Intel SDM (revisions 29–34 are essentially identical except for the LFENCE text). This is in a similar informal-prose style to previous versions, again supported by litmus tests, but is significantly different to IWP/x86-CC/AMD3.14. First, the IRIW final state above is forbidden,5, Example 8–7, vol. 3A and the previous coherence condition: “P6. In a multipro-cessor system, stores to the same location have a total order” has been replaced by: “Any two stores are seen in a consistent order by processors other than those performing the stores” (we label this P9).

Second, the memory barrier instructions are now inclu-ded. It is stated that reads and writes cannot pass MFENCE instructions, together with more refined properties for SFENCE and LFENCE

Third, same-processor writes are now explicitly ordered: “Writes by a single processor are observed in the same order by all processors” (P10) (we regarded this as implicit in the IWP “P2. Stores are not reordered with other stores”).

This revision appears to deal with the unsoundness, admit-ting the n6 behavior above, but, unfortunately, it is still prob-lematic. The first issue is, again, how to interpret “causality”

unsupported by any examples. It is hard to see precisely what this prose means, especially without additional knowledge or assumptions about the microarchitecture of particular implementations. The uncertainty about x86 behavior that at least some systems programmers had about earlier IA-32 processors can be gauged from an extensive discussion about the correctness of a proposed optimization to a Linux spinlock implementation.1 The discussion is largely in microarchitectural terms, not just in terms of the specified architecture, and seems to have been resolved only with input from Intel staff. We return to this optimization in Section 4, where we can explain why it is sound with respect to x86-TSO.

2.2. IWP/AMD3.14/x86-CCIn August 2007, an Intel White Paper4 (IWP) gave a somewhat more precise model, with 8 informal-prose principles P1–P8 supported by 10 examples (known as litmus tests). This was incorporated, essentially unchanged, into later revisions of the Intel SDM (including rev. 26–28), and AMD gave simi-lar, though not identical, prose and tests in rev. 3.14 of their manual3, vol. 2, §7.2 (AMD3.14). These are essentially causal- consistency models,9 and they allow different processors to see writes to independent locations in different orders, as in the IRIW litmus test11 below. a AMD3.14 allows this explic-itly, while IWP allows it implicitly, as IRIW is not ruled out by the stated principles. Microarchitecturally, IRIW can arise

from store buffers that are shared between some but not all processors.However, both require that, in some sense, causality is respected, as in the IWP principle “P5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).”

We used these informal specifications as the basis for a formal model, x86-CC,31 for which a key issue was giving a reasonable interpretation to this “causality,” which is not defined in IWP or AMD3.14. Apart from that, the informal specifications were reasonably unambiguous—but they turned out to have two serious flaws.

First, they are arguably rather weak for programmers. In particular, they admit the IRIW behavior above but, under reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding MFENCEs would not suffice to recover sequential consistency (instead, one would have to make liberal use of x86 LOCK’d instructions).31, §2.12 Here, the specifications seem to be much looser than the behavior of implemented processors: to the best of our knowledge,

a We use Intel assembly syntax throughout except that we use an arrow m to indicate the direction of data flow, so MOV [x]m1 is a write of 1 to address x and MOV EAXm[x] is a read from address x into register EAX. Initial states are all 0 unless otherwise specified.

IRIWProc 0 Proc 1 Proc 2 Proc 3

MOV [x]¬1 MOV [y]¬1 MOV EAX¬[x]MOV EBX¬[y]

MOV ECX¬[y] MOV EDX¬[x]

Forbidden Final State: Proc 2:EAX=1 � Proc 2:EBX=0 � Proc 3:ECX=1 � Proc 3:EDX=0

n6Proc 0 Proc 1

MOV [x]¬1MOV EAX¬[x]MOV EBX¬[y]

MOV [y]¬2 MOV [x]¬2

Allowed Final State: Proc 0:EAX=1 � Proc 0:EBX=0 � [x]=1

Page 6: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Lecture  Today  

We  are  going  to  look  at  two  different  flavors  of  hardware  memory  model:  •  x86-­‐TSO:  implemented  by  major  x86  processors.  

•  POWER  relaxed  memory  model:  implemented  by  Power  and  Arm  processors.  

Page 7: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Relevant  x86  Instruc>ons    

•  loads  and  stores:  ex:  mov  [x]  <-­‐  2            //  store  2  into  memory  x  

•  fences:  ex:  SFENCE,  LFENCE,  MFENCE  

•  locked  instruc>ons:  ex:  lock;  inc  [EAX]  ex:  xchg    

Page 8: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

The  Abstrac>on  Machine  Model  for  X86-­‐TSO  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 93

executing a LOCK’d instruction.

1. Rp[a]=u: p can read u from memory at address a if p is not blocked, there are no writes to a in p’s store buffer, and the memory does contain u at a.

2. Rp[a]=u: p can read u from its store buffer for address a if p is not blocked and has u as the newest write to a in its buffer.

3. Wp[a]=u: p can write u to its store buffer for address a at any time.

4. tp: if p is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in mem-ory at the given address, without coordinating with any hardware thread.

5. Fp: if p’s store buffer is empty, it can execute an MFENCE (note that if a hardware thread encounters an MFENCE instruction when its store buffer is not empty, it can take one or more tp steps to empty the buffer and proceed, and similarly in 7 below).

6. Lp: if the lock is not held, it can begin a LOCK’d instruction.

7. Up: if p holds the lock, and its store buffer is empty, it can end a LOCK’d instruction.

Technically, the formal versions of these rules27 define a labeled transition system (with the events as labels) for the storage subsystem, and we define the behavior of the whole system as a parallel composition of that and transition sys-tems for each thread, synchronizing on the non-t labels as in CCS.25

Additionally, we tentatively impose a progress condi-tion, that each memory write is eventually propagated from the relevant store buffer to the shared memory. This is not stated in the documentation and is hard to test. We are assured that it holds at least for AMD processors.

For write-back cacheable memory, and the fragment of the instruction set that we consider, we treat LFENCE and SFENCE semantically as no-ops. This follows the Intel and AMD documentation, both of which imply that these fences do not order store/load pairs which are the only reorderings allowed in x86-TSO. Note, though, that else-where it is stated that the Intel SFENCE flushes the store buffer.5, vol.3A, §11.10

3.2. Litmus testsFor our introductory SB example from Section 1, x86-TSO permits the given behavior for the same reasons as set forth there. For each of the examples in Section 2 (IRIW, n6, and n5/n4b), x86-TSO permits the given final state if and only if it is observable in our testing of actual processors, i.e., for IRIW it is forbidden (in contrast to IWP and AMD3.14), for n6 it is allowed (in contrast to IWP), and for n5/n4b it is forbidden (in contrast to the Intel SDM rev. 29–34). For all the other relevant tests from the current Intel and AMD manuals the stated behavior agrees with x86-TSO. We now go through Examples 8–1 to 8–10 from rev. 34 of the Intel SDM, and the three other tests from AMD3.15, and explain the x86-TSO behavior in each case.

For completeness we repeat the Intel SDM short

thread.The behavior of the storage subsystem is described in

more detail below, but the main points are: ! The store buffers are FIFO and a reading thread must

read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory.

! An MFENCE instruction flushes the store buffer of that thread.

! To execute a LOCK’d instruction, a thread must first ob-tain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read.

! A buffered write from a thread can propagate to the shared memory at any time except when some other thread holds the lock.

More precisely, the possible interactions between the threads and the storage subsystem are described by the following events:

! Wp [a]=u, for a write of value u to address a by thread p ! Rp [a]=u, for a read of u from a by thread p ! Fp, for an MFENCE memory barrier by thread p ! Lp, at the start of a LOCK’d instruction by thread p ! Up, at the end of a LOCK’d instruction by thread p ! tp, for an internal action of the storage subsystem, prop-

agating a write from p’s store buffer to the shared memory

For example, suppose a particular hardware thread p has come to the instruction INC [56] (which adds 1 to the value at address 56), and p’s store buffer contains a single write to 56, of value 0. In one execution we might see read and write events, Rp[56]=0 and Wp[56]=1, followed by two tp events as the two writes propagate to shared memory. Another execu-tion might start with the write of 0 propagating to shared memory, where it could be overwritten by another thread. Executions of LOCK;INC [56] would be similar but bracketed by Lp and Up events.

The behavior of the storage subsystem is specified by the following rules, where we define a hardware thread to be blocked if the storage subsystem lock is taken by another hardware thread, i.e., while another hardware thread is

Lock

Write buffer

Write buffer

Shared memory

H/W thread H/W thread

Figure 1. x89-TSO block diagram.

•  Instruc>ons  are  "commieed"  in  order.  •  Loads  are  served  from  the  store  buffer  is  the  address  match;  

otherwise  served  from  memory.    •  A  load  is  commi%ed  if  it's  value  is  fixed.  

Page 9: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

The  Abstrac>on  Machine  Model  for  X86-­‐TSO  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 93

executing a LOCK’d instruction.

1. Rp[a]=u: p can read u from memory at address a if p is not blocked, there are no writes to a in p’s store buffer, and the memory does contain u at a.

2. Rp[a]=u: p can read u from its store buffer for address a if p is not blocked and has u as the newest write to a in its buffer.

3. Wp[a]=u: p can write u to its store buffer for address a at any time.

4. tp: if p is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in mem-ory at the given address, without coordinating with any hardware thread.

5. Fp: if p’s store buffer is empty, it can execute an MFENCE (note that if a hardware thread encounters an MFENCE instruction when its store buffer is not empty, it can take one or more tp steps to empty the buffer and proceed, and similarly in 7 below).

6. Lp: if the lock is not held, it can begin a LOCK’d instruction.

7. Up: if p holds the lock, and its store buffer is empty, it can end a LOCK’d instruction.

Technically, the formal versions of these rules27 define a labeled transition system (with the events as labels) for the storage subsystem, and we define the behavior of the whole system as a parallel composition of that and transition sys-tems for each thread, synchronizing on the non-t labels as in CCS.25

Additionally, we tentatively impose a progress condi-tion, that each memory write is eventually propagated from the relevant store buffer to the shared memory. This is not stated in the documentation and is hard to test. We are assured that it holds at least for AMD processors.

For write-back cacheable memory, and the fragment of the instruction set that we consider, we treat LFENCE and SFENCE semantically as no-ops. This follows the Intel and AMD documentation, both of which imply that these fences do not order store/load pairs which are the only reorderings allowed in x86-TSO. Note, though, that else-where it is stated that the Intel SFENCE flushes the store buffer.5, vol.3A, §11.10

3.2. Litmus testsFor our introductory SB example from Section 1, x86-TSO permits the given behavior for the same reasons as set forth there. For each of the examples in Section 2 (IRIW, n6, and n5/n4b), x86-TSO permits the given final state if and only if it is observable in our testing of actual processors, i.e., for IRIW it is forbidden (in contrast to IWP and AMD3.14), for n6 it is allowed (in contrast to IWP), and for n5/n4b it is forbidden (in contrast to the Intel SDM rev. 29–34). For all the other relevant tests from the current Intel and AMD manuals the stated behavior agrees with x86-TSO. We now go through Examples 8–1 to 8–10 from rev. 34 of the Intel SDM, and the three other tests from AMD3.15, and explain the x86-TSO behavior in each case.

For completeness we repeat the Intel SDM short

thread.The behavior of the storage subsystem is described in

more detail below, but the main points are: ! The store buffers are FIFO and a reading thread must

read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory.

! An MFENCE instruction flushes the store buffer of that thread.

! To execute a LOCK’d instruction, a thread must first ob-tain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read.

! A buffered write from a thread can propagate to the shared memory at any time except when some other thread holds the lock.

More precisely, the possible interactions between the threads and the storage subsystem are described by the following events:

! Wp [a]=u, for a write of value u to address a by thread p ! Rp [a]=u, for a read of u from a by thread p ! Fp, for an MFENCE memory barrier by thread p ! Lp, at the start of a LOCK’d instruction by thread p ! Up, at the end of a LOCK’d instruction by thread p ! tp, for an internal action of the storage subsystem, prop-

agating a write from p’s store buffer to the shared memory

For example, suppose a particular hardware thread p has come to the instruction INC [56] (which adds 1 to the value at address 56), and p’s store buffer contains a single write to 56, of value 0. In one execution we might see read and write events, Rp[56]=0 and Wp[56]=1, followed by two tp events as the two writes propagate to shared memory. Another execu-tion might start with the write of 0 propagating to shared memory, where it could be overwritten by another thread. Executions of LOCK;INC [56] would be similar but bracketed by Lp and Up events.

The behavior of the storage subsystem is specified by the following rules, where we define a hardware thread to be blocked if the storage subsystem lock is taken by another hardware thread, i.e., while another hardware thread is

Lock

Write buffer

Write buffer

Shared memory

H/W thread H/W thread

Figure 1. x89-TSO block diagram.

•  The  store  is  commi%ed  if  it's  wrieen  to  the  store  buffer.  •  The  store  buffers  are  FIFO.  •  An  MFENCE  flushes  the  store  buffer  of  that  thread.  

Page 10: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

The  Abstrac>on  Machine  Model  for  X86-­‐TSO  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 93

executing a LOCK’d instruction.

1. Rp[a]=u: p can read u from memory at address a if p is not blocked, there are no writes to a in p’s store buffer, and the memory does contain u at a.

2. Rp[a]=u: p can read u from its store buffer for address a if p is not blocked and has u as the newest write to a in its buffer.

3. Wp[a]=u: p can write u to its store buffer for address a at any time.

4. tp: if p is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in mem-ory at the given address, without coordinating with any hardware thread.

5. Fp: if p’s store buffer is empty, it can execute an MFENCE (note that if a hardware thread encounters an MFENCE instruction when its store buffer is not empty, it can take one or more tp steps to empty the buffer and proceed, and similarly in 7 below).

6. Lp: if the lock is not held, it can begin a LOCK’d instruction.

7. Up: if p holds the lock, and its store buffer is empty, it can end a LOCK’d instruction.

Technically, the formal versions of these rules27 define a labeled transition system (with the events as labels) for the storage subsystem, and we define the behavior of the whole system as a parallel composition of that and transition sys-tems for each thread, synchronizing on the non-t labels as in CCS.25

Additionally, we tentatively impose a progress condi-tion, that each memory write is eventually propagated from the relevant store buffer to the shared memory. This is not stated in the documentation and is hard to test. We are assured that it holds at least for AMD processors.

For write-back cacheable memory, and the fragment of the instruction set that we consider, we treat LFENCE and SFENCE semantically as no-ops. This follows the Intel and AMD documentation, both of which imply that these fences do not order store/load pairs which are the only reorderings allowed in x86-TSO. Note, though, that else-where it is stated that the Intel SFENCE flushes the store buffer.5, vol.3A, §11.10

3.2. Litmus testsFor our introductory SB example from Section 1, x86-TSO permits the given behavior for the same reasons as set forth there. For each of the examples in Section 2 (IRIW, n6, and n5/n4b), x86-TSO permits the given final state if and only if it is observable in our testing of actual processors, i.e., for IRIW it is forbidden (in contrast to IWP and AMD3.14), for n6 it is allowed (in contrast to IWP), and for n5/n4b it is forbidden (in contrast to the Intel SDM rev. 29–34). For all the other relevant tests from the current Intel and AMD manuals the stated behavior agrees with x86-TSO. We now go through Examples 8–1 to 8–10 from rev. 34 of the Intel SDM, and the three other tests from AMD3.15, and explain the x86-TSO behavior in each case.

For completeness we repeat the Intel SDM short

thread.The behavior of the storage subsystem is described in

more detail below, but the main points are: ! The store buffers are FIFO and a reading thread must

read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory.

! An MFENCE instruction flushes the store buffer of that thread.

! To execute a LOCK’d instruction, a thread must first ob-tain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read.

! A buffered write from a thread can propagate to the shared memory at any time except when some other thread holds the lock.

More precisely, the possible interactions between the threads and the storage subsystem are described by the following events:

! Wp [a]=u, for a write of value u to address a by thread p ! Rp [a]=u, for a read of u from a by thread p ! Fp, for an MFENCE memory barrier by thread p ! Lp, at the start of a LOCK’d instruction by thread p ! Up, at the end of a LOCK’d instruction by thread p ! tp, for an internal action of the storage subsystem, prop-

agating a write from p’s store buffer to the shared memory

For example, suppose a particular hardware thread p has come to the instruction INC [56] (which adds 1 to the value at address 56), and p’s store buffer contains a single write to 56, of value 0. In one execution we might see read and write events, Rp[56]=0 and Wp[56]=1, followed by two tp events as the two writes propagate to shared memory. Another execu-tion might start with the write of 0 propagating to shared memory, where it could be overwritten by another thread. Executions of LOCK;INC [56] would be similar but bracketed by Lp and Up events.

The behavior of the storage subsystem is specified by the following rules, where we define a hardware thread to be blocked if the storage subsystem lock is taken by another hardware thread, i.e., while another hardware thread is

Lock

Write buffer

Write buffer

Shared memory

H/W thread H/W thread

Figure 1. x89-TSO block diagram.

•  To  execute  a  LOCK’d  instruc>on,  a  thread  must  first  obtain  the  global  lock.    At  the  end  of  the  instruc>on,  it  flushes  its  store  buffer  and  relinquishes  the  lock.    

•  While  the  lock  is  held  by  one  thread,  all  other  threads  are  blocked.  

Page 11: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

The  Abstrac>on  Machine  Model  for  X86-­‐TSO  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 93

executing a LOCK’d instruction.

1. Rp[a]=u: p can read u from memory at address a if p is not blocked, there are no writes to a in p’s store buffer, and the memory does contain u at a.

2. Rp[a]=u: p can read u from its store buffer for address a if p is not blocked and has u as the newest write to a in its buffer.

3. Wp[a]=u: p can write u to its store buffer for address a at any time.

4. tp: if p is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in mem-ory at the given address, without coordinating with any hardware thread.

5. Fp: if p’s store buffer is empty, it can execute an MFENCE (note that if a hardware thread encounters an MFENCE instruction when its store buffer is not empty, it can take one or more tp steps to empty the buffer and proceed, and similarly in 7 below).

6. Lp: if the lock is not held, it can begin a LOCK’d instruction.

7. Up: if p holds the lock, and its store buffer is empty, it can end a LOCK’d instruction.

Technically, the formal versions of these rules27 define a labeled transition system (with the events as labels) for the storage subsystem, and we define the behavior of the whole system as a parallel composition of that and transition sys-tems for each thread, synchronizing on the non-t labels as in CCS.25

Additionally, we tentatively impose a progress condi-tion, that each memory write is eventually propagated from the relevant store buffer to the shared memory. This is not stated in the documentation and is hard to test. We are assured that it holds at least for AMD processors.

For write-back cacheable memory, and the fragment of the instruction set that we consider, we treat LFENCE and SFENCE semantically as no-ops. This follows the Intel and AMD documentation, both of which imply that these fences do not order store/load pairs which are the only reorderings allowed in x86-TSO. Note, though, that else-where it is stated that the Intel SFENCE flushes the store buffer.5, vol.3A, §11.10

3.2. Litmus testsFor our introductory SB example from Section 1, x86-TSO permits the given behavior for the same reasons as set forth there. For each of the examples in Section 2 (IRIW, n6, and n5/n4b), x86-TSO permits the given final state if and only if it is observable in our testing of actual processors, i.e., for IRIW it is forbidden (in contrast to IWP and AMD3.14), for n6 it is allowed (in contrast to IWP), and for n5/n4b it is forbidden (in contrast to the Intel SDM rev. 29–34). For all the other relevant tests from the current Intel and AMD manuals the stated behavior agrees with x86-TSO. We now go through Examples 8–1 to 8–10 from rev. 34 of the Intel SDM, and the three other tests from AMD3.15, and explain the x86-TSO behavior in each case.

For completeness we repeat the Intel SDM short

thread.The behavior of the storage subsystem is described in

more detail below, but the main points are: ! The store buffers are FIFO and a reading thread must

read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory.

! An MFENCE instruction flushes the store buffer of that thread.

! To execute a LOCK’d instruction, a thread must first ob-tain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read.

! A buffered write from a thread can propagate to the shared memory at any time except when some other thread holds the lock.

More precisely, the possible interactions between the threads and the storage subsystem are described by the following events:

! Wp [a]=u, for a write of value u to address a by thread p ! Rp [a]=u, for a read of u from a by thread p ! Fp, for an MFENCE memory barrier by thread p ! Lp, at the start of a LOCK’d instruction by thread p ! Up, at the end of a LOCK’d instruction by thread p ! tp, for an internal action of the storage subsystem, prop-

agating a write from p’s store buffer to the shared memory

For example, suppose a particular hardware thread p has come to the instruction INC [56] (which adds 1 to the value at address 56), and p’s store buffer contains a single write to 56, of value 0. In one execution we might see read and write events, Rp[56]=0 and Wp[56]=1, followed by two tp events as the two writes propagate to shared memory. Another execu-tion might start with the write of 0 propagating to shared memory, where it could be overwritten by another thread. Executions of LOCK;INC [56] would be similar but bracketed by Lp and Up events.

The behavior of the storage subsystem is specified by the following rules, where we define a hardware thread to be blocked if the storage subsystem lock is taken by another hardware thread, i.e., while another hardware thread is

Lock

Write buffer

Write buffer

Shared memory

H/W thread H/W thread

Figure 1. x89-TSO block diagram.

•  When  a  thread  is  blocked,  it  cannot  read  (even  from  its  SB),  and  a  buffered  write  cannot  propagate  to  the  shared  memory.  

Page 12: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

The  Abstrac>on  Machine  Model  for  X86-­‐TSO  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 93

executing a LOCK’d instruction.

1. Rp[a]=u: p can read u from memory at address a if p is not blocked, there are no writes to a in p’s store buffer, and the memory does contain u at a.

2. Rp[a]=u: p can read u from its store buffer for address a if p is not blocked and has u as the newest write to a in its buffer.

3. Wp[a]=u: p can write u to its store buffer for address a at any time.

4. tp: if p is not blocked, it can silently dequeue the oldest write from its store buffer and place the value in mem-ory at the given address, without coordinating with any hardware thread.

5. Fp: if p’s store buffer is empty, it can execute an MFENCE (note that if a hardware thread encounters an MFENCE instruction when its store buffer is not empty, it can take one or more tp steps to empty the buffer and proceed, and similarly in 7 below).

6. Lp: if the lock is not held, it can begin a LOCK’d instruction.

7. Up: if p holds the lock, and its store buffer is empty, it can end a LOCK’d instruction.

Technically, the formal versions of these rules27 define a labeled transition system (with the events as labels) for the storage subsystem, and we define the behavior of the whole system as a parallel composition of that and transition sys-tems for each thread, synchronizing on the non-t labels as in CCS.25

Additionally, we tentatively impose a progress condi-tion, that each memory write is eventually propagated from the relevant store buffer to the shared memory. This is not stated in the documentation and is hard to test. We are assured that it holds at least for AMD processors.

For write-back cacheable memory, and the fragment of the instruction set that we consider, we treat LFENCE and SFENCE semantically as no-ops. This follows the Intel and AMD documentation, both of which imply that these fences do not order store/load pairs which are the only reorderings allowed in x86-TSO. Note, though, that else-where it is stated that the Intel SFENCE flushes the store buffer.5, vol.3A, §11.10

3.2. Litmus testsFor our introductory SB example from Section 1, x86-TSO permits the given behavior for the same reasons as set forth there. For each of the examples in Section 2 (IRIW, n6, and n5/n4b), x86-TSO permits the given final state if and only if it is observable in our testing of actual processors, i.e., for IRIW it is forbidden (in contrast to IWP and AMD3.14), for n6 it is allowed (in contrast to IWP), and for n5/n4b it is forbidden (in contrast to the Intel SDM rev. 29–34). For all the other relevant tests from the current Intel and AMD manuals the stated behavior agrees with x86-TSO. We now go through Examples 8–1 to 8–10 from rev. 34 of the Intel SDM, and the three other tests from AMD3.15, and explain the x86-TSO behavior in each case.

For completeness we repeat the Intel SDM short

thread.The behavior of the storage subsystem is described in

more detail below, but the main points are: ! The store buffers are FIFO and a reading thread must

read its most recent buffered write, if there is one, to that address; otherwise reads are satisfied from shared memory.

! An MFENCE instruction flushes the store buffer of that thread.

! To execute a LOCK’d instruction, a thread must first ob-tain the global lock. At the end of the instruction, it flushes its store buffer and relinquishes the lock. While the lock is held by one thread, no other thread can read.

! A buffered write from a thread can propagate to the shared memory at any time except when some other thread holds the lock.

More precisely, the possible interactions between the threads and the storage subsystem are described by the following events:

! Wp [a]=u, for a write of value u to address a by thread p ! Rp [a]=u, for a read of u from a by thread p ! Fp, for an MFENCE memory barrier by thread p ! Lp, at the start of a LOCK’d instruction by thread p ! Up, at the end of a LOCK’d instruction by thread p ! tp, for an internal action of the storage subsystem, prop-

agating a write from p’s store buffer to the shared memory

For example, suppose a particular hardware thread p has come to the instruction INC [56] (which adds 1 to the value at address 56), and p’s store buffer contains a single write to 56, of value 0. In one execution we might see read and write events, Rp[56]=0 and Wp[56]=1, followed by two tp events as the two writes propagate to shared memory. Another execu-tion might start with the write of 0 propagating to shared memory, where it could be overwritten by another thread. Executions of LOCK;INC [56] would be similar but bracketed by Lp and Up events.

The behavior of the storage subsystem is specified by the following rules, where we define a hardware thread to be blocked if the storage subsystem lock is taken by another hardware thread, i.e., while another hardware thread is

Lock

Write buffer

Write buffer

Shared memory

H/W thread H/W thread

Figure 1. x89-TSO block diagram.

•  The  shared  memory  is  access-­‐atomic  (no  interleaving  bits)  and  mul1ple-­‐copy  atomic  (once  a  store  hits  memory,  it's  visible  to  all  threads).  

Page 13: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

X86-­‐TSO  from  a  Thread's  Perspec>ve  

1.   Loads  are  not   reordered  with  loads.  2.   Stores  are  not   reordered  with  stores.  3.   Stores  are  not   reordered  with  prior  loads.  4.  A  load  may   be  reordered  with  a  prior  store  to  a  different  

loca>on  but  not   with  a  prior  store  to  the  same  loca>on.  5.   Loads  and  stores  are  not   reordered  with  lock  

instruc>ons.  6.   Stores  to  the  same  loca>on  is  coherent.  7.   Stores  are  transi>vely  visible.  8.   Lock  instruc>ons  respect  a  global  total  order.    

Page 14: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Loads  May  be  Reordered  with  Older  Stores    

DOI:10.1145/1785414.1785443

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 89

x86-TSO: A Rigorous and Usable Programmer’s Model for x86 MultiprocessorsBy Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen

AbstractExploiting the multiprocessors that have recently become ubiquitous requires high-performance and reliable concur-rent systems code, for concurrent data structures, operating system kernels, synchronization libraries, compilers, and so on. However, concurrent programming, which is always challenging, is made much more so by two problems. First, real multiprocessors typically do not provide the sequen-tially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, varying in subtle ways between proces-sor families, in which different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures, supposedly specifying what programmers can rely on, are often in ambiguous informal prose (a particularly poor medium for loose specifications), leading to widespread confusion.

In this paper we focus on x86 processors. We review sev-eral recent Intel and AMD specifications, showing that all contain serious ambiguities, some are arguably too weak to program above, and some are simply unsound with respect to actual hardware. We present a new x86-TSO programmer’s model that, to the best of our knowledge, suffers from none of these problems. It is mathematically precise (rigorously defined in HOL4) but can be presented as an intuitive abstract machine which should be widely accessible to working pro-grammers. We illustrate how this can be used to reason about the correctness of a Linux spinlock implementation and describe a general theory of data-race freedom for x86-TSO. This should put x86 multiprocessor system building on a more solid foundation; it should also provide a basis for future work on verification of such systems.

1. INTRODUCTIONMultiprocessor machines, with many processors acting on a shared memory, have been developed since the 1960s; they are now ubiquitous. Meanwhile, the difficulty of programming concurrent systems has motivated extensive research on pro gramming language design, semantics, and verification, from semaphores and monitors to program logics, software model checking, and so forth. This work has almost always assumed that concurrent threads share a single sequentially consistent memory,21 with their reads and writes interleaved in some order. In fact, however, real multiprocessors use sophisticated techniques to achieve high performance: store buffers, hierarchies of local cache, speculative execution,

etc. These optimizations are not observable by sequential code, but in multithreaded programs different threads may see subtly different views of memory; such machines exhibit relaxed, or weak, memory models.6, 7, 17, 19

For a simple example, consider the following assembly language program (SB) for modern Intel or AMD x86 mul-tiprocessors: given two distinct memory locations x and y (initially holding 0), if two processors respectively write 1 to x and y and then read from y and x (into register EAX on proces-sor 0 and EBX on processor 1), it is possible for both to read 0 in the same execution. It is easy to check that this result can-not arise from any interleaving of the reads and writes of the two processors; modern x86 multiprocessors do not have a sequentially consistent semantics.

Microarchitecturally, one can view this particular example as a visible consequence of store buffering: if each proces-sor effectively has a FIFO buffer of pending memory writes (to avoid the need to block while a write completes), then the reads from y and x could occur before the writes have propa-gated from the buffers to main memory.

Other families of multiprocessors, dating back at least to the IBM 370, and including ARM, Itanium, POWER, and SPARC, also exhibit relaxed-memory behavior. Moreover, there are major and subtle differences between different pro-cessor families (arising from their different internal design choices): in the details of exactly what non-sequentially-con-sistent executions they permit, and of what memory barrier and synchronization instructions they provide to let the pro-grammer regain control.

For any of these processors, relaxed-memory behavior exacerbates the difficulties of writing concurrent software, as systems programmers cannot reason, at the level of abstraction of memory reads and writes, in terms of an intui-tive concept of global time.

This paper is based on work that first appeared in the Proceedings of the 36th SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2009, and in the Proceedings of the 22nd International Conference on Theorem Proving in Higher-Order Logics (TPHOLs), 2009.

SBProc 0 Proc 1

MOV [x]¬1MOV EAX¬[y]

MOV [y]¬1MOV EBX¬[x]

Allowed Final State: Proc 0:EAX=0 � Proc 1:EBX=0

Page 15: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Loads  Are  Not  Reordered  with  Older  Stores  to  the  Same  Loca>on  

94 COMMUNICATIONS OF THE ACM | JULY 2010 | VOL. 53 | NO. 7

research highlights

E!"#$%& '–(. S)*+&, A+& S&&- .- " C*-,.,)&-) O+/&+ 01 O)2&+ P+*3&,,*+,. This test rules out the IRIW behavior as described in Section 2.2. x86-TSO forbids the given final state because the Proc 2 constraints imply that x was written to shared memory before y whereas the Proc 3 constraints imply that y was written to shared memory before x.

E!"#$%& '–'. L*34&/ I-,)+53).*-, H"6& " T*)"% O+/&+. This is the same as the IRIW Example 8–7 but with LOCK’d instructions for the writes; x86-TSO forbids the final state for the same reason as above.

E!"#$%& '–7. L*"/, A+& -*) R&*+/&+&/ 8.)2 L*34,.

This test indicates that locking both writes in Example 8–3 would forbid the nonsequentially consistent result. x86-TSO forbids the final state because LOCK’d instructions flush the local store buffer. If only one write were LOCK’d (say the write to x), the Example 8–3 final state would be allowed as follows: on Proc 1, buffer the write to y and exe-cute the read x, then on Proc 0 write to x in shared memory then read from y.

E!"#$%& '–9:. S)*+&, A+& -*) R&*+/&+&/ 8.)2 L*34,.This is implied by Example 8–1, as we treat the memory writes of LOCK’d instructions as stores.

T&,) "#/5.

For x86-TSO, this test has the same force as Example 8.8, but using MFENCE instructions to flush the buffers instead of LOCK’d instructions. The tenth AMD test is similar. None of the Intel litmus tests include fence instructions.

In x86-TSO adding MFENCE between every instruction would clearly suffice to regain sequential consistency (though obviously in practice one would insert fewer barriers), in con-trast to IWP/x86-CC/AMD3.14.

3.3. Empirical testingTo build confidence that we have a sound model of the behavior of actual x86 processors we have tested the

descriptions of these tests, e.g. “stores are not reordered with other stores,” but note that “not reordered with” is not defined there and is open to misinterpretation.27, §3.2

E!"#$%& '–9. S)*+&, A+& -*) R&*+/&+&/ 8.)2 *)2&+ S)*+&,.

This test implies that the writes by Proc 0 are seen in order by Proc 1’s reads, which also execute in order. x86-TSO forbids the final state because Proc 0’s store buffer is FIFO, and Proc 0 communicates with Proc 1 only through shared memory.

E!"#$%& '–;. S)*+&, A+& N*) R&*+/&+&/ 8.)2 O%/&+ L*"/,.

x86-TSO forbids the final state because reads are never delayed.

E!"#$%& '–<. L*"/, M"1 B& R&*+/&+&/ 8.)2 O%/&+ S)*+&,. This test is just the SB example from Section 1, which x86-TSO permits. The third AMD test (amd3) is simi-lar but with additional writes inserted in the middle of each thread, of 2 to x and y respectively.

E!"#$%& '–=. L*"/, A+& -*) R&*+/&+&/ 8.)2 O%/&+ S)*+&, )* )2& S"#& L*3").*-.

x86-TSO requires the specified result because reads must check the local store buffer.

E!"#$%& '–>. I-)+"?P+*3&,,*+ F*+8"+/.-@ I, A%%*8&/. This test is similar to Example 8–3.

E!"#$%& '–A. S)*+&, A+& T+"-,.).6&%1 V.,.0%&.

x86-TSO forbids the given final state because otherwise the Proc 2 constraints imply that y was written to shared memory before x. Hence the write to x must be in Proc 0’s store buffer (or the instruction has not executed), when the write to y is initiated. Note that this test contains the only mention of “transitive visibility” in the Intel SDM, leaving its meaning unclear.

Proc 0 Proc 1

MOV [x]m1 MOV [y]m1

MOV EAXm[y]MOV EBXm[x]

Forbidden Final State: Proc 1:EAX=1 � Proc 1:EBX=0

Proc 0 Proc 1

MOV EAXm[x]MOV [y]m1

MOV EBXm[y]MOV [x]m1

Forbidden Final State: Proc 0:EAX=1 � Proc 1:EBX=1

Proc 0

MOV [x]m1MOV EAXm[x]

Required Final State: Proc 0:EAX=1

Proc 0 Proc 1 Proc 2

MOV [x]m1 MOV EAXm[x]MOV [y]m1

MOV EBXm[y]MOV ECXm[x]

Forbidden Final State: Proc 1:EAX=1 � Proc 2:EBX=1 � Proc 2:ECX=0

Proc 0 Proc 1

XCHG [x]mEAXMOV EBXm[y]

XCHG [y]mECXMOV EDXm[x]

Initial state: Proc 0:EAX=1 � Proc 1:ECX=1 (elsewhere 0)

Forbidden Final State: Proc 0:EBX=0 � Proc 1:EDX=0

Proc 0 Proc 1

XCHG [x]mEAX MOV [y]m1

MOV EBXm[y]MOV ECXm[x]

Initial state: Proc 0:EAX=1 (elsewhere 0)

Forbidden Final State: Proc 1:EBX=1 � Proc 1:ECX=0

Proc 0 Proc 1

MOV [x]m1MFENCEMOV EAXm[y]

MOV [y]m1MFENCEMOV EBXm[x]

Forbidden Final State: Proc 0:EAX=0 � Proc 1:EBX=0

Page 16: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Stores  are  Transi>vely  Visible  

94 COMMUNICATIONS OF THE ACM | JULY 2010 | VOL. 53 | NO. 7

research highlights

E!"#$%& '–(. S)*+&, A+& S&&- .- " C*-,.,)&-) O+/&+ 01 O)2&+ P+*3&,,*+,. This test rules out the IRIW behavior as described in Section 2.2. x86-TSO forbids the given final state because the Proc 2 constraints imply that x was written to shared memory before y whereas the Proc 3 constraints imply that y was written to shared memory before x.

E!"#$%& '–'. L*34&/ I-,)+53).*-, H"6& " T*)"% O+/&+. This is the same as the IRIW Example 8–7 but with LOCK’d instructions for the writes; x86-TSO forbids the final state for the same reason as above.

E!"#$%& '–7. L*"/, A+& -*) R&*+/&+&/ 8.)2 L*34,.

This test indicates that locking both writes in Example 8–3 would forbid the nonsequentially consistent result. x86-TSO forbids the final state because LOCK’d instructions flush the local store buffer. If only one write were LOCK’d (say the write to x), the Example 8–3 final state would be allowed as follows: on Proc 1, buffer the write to y and exe-cute the read x, then on Proc 0 write to x in shared memory then read from y.

E!"#$%& '–9:. S)*+&, A+& -*) R&*+/&+&/ 8.)2 L*34,.This is implied by Example 8–1, as we treat the memory writes of LOCK’d instructions as stores.

T&,) "#/5.

For x86-TSO, this test has the same force as Example 8.8, but using MFENCE instructions to flush the buffers instead of LOCK’d instructions. The tenth AMD test is similar. None of the Intel litmus tests include fence instructions.

In x86-TSO adding MFENCE between every instruction would clearly suffice to regain sequential consistency (though obviously in practice one would insert fewer barriers), in con-trast to IWP/x86-CC/AMD3.14.

3.3. Empirical testingTo build confidence that we have a sound model of the behavior of actual x86 processors we have tested the

descriptions of these tests, e.g. “stores are not reordered with other stores,” but note that “not reordered with” is not defined there and is open to misinterpretation.27, §3.2

E!"#$%& '–9. S)*+&, A+& -*) R&*+/&+&/ 8.)2 *)2&+ S)*+&,.

This test implies that the writes by Proc 0 are seen in order by Proc 1’s reads, which also execute in order. x86-TSO forbids the final state because Proc 0’s store buffer is FIFO, and Proc 0 communicates with Proc 1 only through shared memory.

E!"#$%& '–;. S)*+&, A+& N*) R&*+/&+&/ 8.)2 O%/&+ L*"/,.

x86-TSO forbids the final state because reads are never delayed.

E!"#$%& '–<. L*"/, M"1 B& R&*+/&+&/ 8.)2 O%/&+ S)*+&,. This test is just the SB example from Section 1, which x86-TSO permits. The third AMD test (amd3) is simi-lar but with additional writes inserted in the middle of each thread, of 2 to x and y respectively.

E!"#$%& '–=. L*"/, A+& -*) R&*+/&+&/ 8.)2 O%/&+ S)*+&, )* )2& S"#& L*3").*-.

x86-TSO requires the specified result because reads must check the local store buffer.

E!"#$%& '–>. I-)+"?P+*3&,,*+ F*+8"+/.-@ I, A%%*8&/. This test is similar to Example 8–3.

E!"#$%& '–A. S)*+&, A+& T+"-,.).6&%1 V.,.0%&.

x86-TSO forbids the given final state because otherwise the Proc 2 constraints imply that y was written to shared memory before x. Hence the write to x must be in Proc 0’s store buffer (or the instruction has not executed), when the write to y is initiated. Note that this test contains the only mention of “transitive visibility” in the Intel SDM, leaving its meaning unclear.

Proc 0 Proc 1

MOV [x]m1 MOV [y]m1

MOV EAXm[y]MOV EBXm[x]

Forbidden Final State: Proc 1:EAX=1 � Proc 1:EBX=0

Proc 0 Proc 1

MOV EAXm[x]MOV [y]m1

MOV EBXm[y]MOV [x]m1

Forbidden Final State: Proc 0:EAX=1 � Proc 1:EBX=1

Proc 0

MOV [x]m1MOV EAXm[x]

Required Final State: Proc 0:EAX=1

Proc 0 Proc 1 Proc 2

MOV [x]m1 MOV EAXm[x]MOV [y]m1

MOV EBXm[y]MOV ECXm[x]

Forbidden Final State: Proc 1:EAX=1 � Proc 2:EBX=1 � Proc 2:ECX=0

Proc 0 Proc 1

XCHG [x]mEAXMOV EBXm[y]

XCHG [y]mECXMOV EDXm[x]

Initial state: Proc 0:EAX=1 � Proc 1:ECX=1 (elsewhere 0)

Forbidden Final State: Proc 0:EBX=0 � Proc 1:EDX=0

Proc 0 Proc 1

XCHG [x]mEAX MOV [y]m1

MOV EBXm[y]MOV ECXm[x]

Initial state: Proc 0:EAX=1 (elsewhere 0)

Forbidden Final State: Proc 1:EBX=1 � Proc 1:ECX=0

Proc 0 Proc 1

MOV [x]m1MFENCEMOV EAXm[y]

MOV [y]m1MFENCEMOV EBXm[x]

Forbidden Final State: Proc 0:EAX=0 � Proc 1:EBX=0

Page 17: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

LOCK's  Instruc>ons  Have  a  Total  Order  

Vol. 3A 8-11

MULTIPLE-PROCESSOR MANAGEMENT

8.2.3.7 Stores Are Seen in a Consistent Order by Other Processors

As noted in Section 8.2.3.5, the memory-ordering model allows stores by two processors to be seen in different orders by those two processors. However, any two stores must appear to execute in the same order to all proces-sors other than those performing the stores. This is illustrated by the following example:

By the principles discussed in Section 8.2.3.2,

• processor 2’s first and second load cannot be reordered,

• processor 3’s first and second load cannot be reordered.

• If r1 = 1 and r2 = 0, processor 0’s store appears to precede processor 1’s store with respect to processor 2.

• Similarly, r3 = 1 and r4 = 0 imply that processor 1’s store appears to precede processor 0’s store with respect to processor 1.

Because the memory-ordering model ensures that any two stores appear to execute in the same order to all processors (other than those performing the stores), this set of return values is not allowed

8.2.3.8 Locked Instructions Have a Total Order

The memory-ordering model ensures that all processors agree on a single execution order of all locked instruc-tions, including those that are larger than 8 bytes or are not naturally aligned. This is illustrated by the following example:

Processor 2 and processor 3 must agree on the order of the two executions of XCHG. Without loss of generality, suppose that processor 0’s XCHG occurs first.

• If r5 = 1, processor 1’s XCHG into y occurs before processor 3’s load from y.

• Because the Intel-64 memory-ordering model prevents loads from being reordered (see Section 8.2.3.2), processor 3’s loads occur in order and, therefore, processor 1’s XCHG occurs before processor 3’s load from x.

• Since processor 0’s XCHG into x occurs before processor 1’s XCHG (by assumption), it occurs before processor 3’s load from x. Thus, r6 = 1.

A similar argument (referring instead to processor 2’s loads) applies if processor 1’s XCHG occurs before processor 0’s XCHG.

8.2.3.9 Loads and Stores Are Not Reordered with Locked InstructionsThe memory-ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later. The examples in this section illustrate only cases in which a locked instruction is executed before a

Example 8-7. Stores Are Seen in a Consistent Order by Other ProcessorsProcessor 0 Processor 1 Processor 2 Processor 3

mov [ _x], 1 mov [ _y], 1 mov r1, [ _x] mov r3, [_y]

mov r2, [ _y] mov r4, [_x]

Initially x = y =0

r1 = 1, r2 = 0, r3 = 1, r4 = 0 is not allowed

Example 8-8. Locked Instructions Have a Total OrderProcessor 0 Processor 1 Processor 2 Processor 3

xchg [ _x], r1 xchg [ _y], r2

mov r3, [ _x] mov r5, [_y]

mov r4, [ _y] mov r6, [_x]

Initially r1 = r2 = 1, x = y = 0

r3 = 1, r4 = 0, r5 = 1, r6 = 0 is not allowed

Page 18: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Is  this  Allowed?  

Allowed?  

No.      Stores  are  seem  by  other  processors  in  consistent  order.  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 91

and following some testing, IRIW is not observable in practice, even without MFENCEs. It appears that some JVM implementations depend on this fact, and would not be correct if one assumed only the IWP/AMD3.14/x86-CC architecture.15

Second, more seriously, x86-CC and IWP are unsound with respect to current processors. The following example, n6, due to Paul Loewenstein [personal communication, Nov. 2008] shows a behavior that is observable (e.g., on an Intel Core 2

duo) but that is disallowed by x86-CC and by any interpreta-tion we can make of IWP principles P1, 2, 4 and 6.27, A.5

To see why this could be allowed by multiprocessors with FIFO store buffers, suppose that first the Proc 1 write of [y]=2 is buffered, then Proc 0 buffers its write of [x]=1, reads [x]=1 from its own store buffer, and reads [y]=0 from main memory, then Proc 1 buffers its [x]=2 write and flushes its buffered [y]=2 and [x]=2 writes to memory, then finally Proc 0 flushes its [x]=1 write to memory.

The AMD3.14 manual is not expressed in terms of a clearly identified set of principles, and the main text (vol. 2, §7.2) leaves the ordering of stores to a single location uncon-strained, though elsewhere the manual describes a micro-architecture with store buffers and cache protocols that strongly implies that memory is coherent. In the absence of an analogue of the IWP P6, the reasoning prohibiting n6 does not carry over.

2.3. Intel SDM rev. 29–34 (Nov. 2008–Mar. 2010)The most recent substantial change to the Intel memory-model specification, at the time of writing, was in revision 29 of the Intel SDM (revisions 29–34 are essentially identical except for the LFENCE text). This is in a similar informal-prose style to previous versions, again supported by litmus tests, but is significantly different to IWP/x86-CC/AMD3.14. First, the IRIW final state above is forbidden,5, Example 8–7, vol. 3A and the previous coherence condition: “P6. In a multipro-cessor system, stores to the same location have a total order” has been replaced by: “Any two stores are seen in a consistent order by processors other than those performing the stores” (we label this P9).

Second, the memory barrier instructions are now inclu-ded. It is stated that reads and writes cannot pass MFENCE instructions, together with more refined properties for SFENCE and LFENCE

Third, same-processor writes are now explicitly ordered: “Writes by a single processor are observed in the same order by all processors” (P10) (we regarded this as implicit in the IWP “P2. Stores are not reordered with other stores”).

This revision appears to deal with the unsoundness, admit-ting the n6 behavior above, but, unfortunately, it is still prob-lematic. The first issue is, again, how to interpret “causality”

unsupported by any examples. It is hard to see precisely what this prose means, especially without additional knowledge or assumptions about the microarchitecture of particular implementations. The uncertainty about x86 behavior that at least some systems programmers had about earlier IA-32 processors can be gauged from an extensive discussion about the correctness of a proposed optimization to a Linux spinlock implementation.1 The discussion is largely in microarchitectural terms, not just in terms of the specified architecture, and seems to have been resolved only with input from Intel staff. We return to this optimization in Section 4, where we can explain why it is sound with respect to x86-TSO.

2.2. IWP/AMD3.14/x86-CCIn August 2007, an Intel White Paper4 (IWP) gave a somewhat more precise model, with 8 informal-prose principles P1–P8 supported by 10 examples (known as litmus tests). This was incorporated, essentially unchanged, into later revisions of the Intel SDM (including rev. 26–28), and AMD gave simi-lar, though not identical, prose and tests in rev. 3.14 of their manual3, vol. 2, §7.2 (AMD3.14). These are essentially causal- consistency models,9 and they allow different processors to see writes to independent locations in different orders, as in the IRIW litmus test11 below. a AMD3.14 allows this explic-itly, while IWP allows it implicitly, as IRIW is not ruled out by the stated principles. Microarchitecturally, IRIW can arise

from store buffers that are shared between some but not all processors.However, both require that, in some sense, causality is respected, as in the IWP principle “P5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).”

We used these informal specifications as the basis for a formal model, x86-CC,31 for which a key issue was giving a reasonable interpretation to this “causality,” which is not defined in IWP or AMD3.14. Apart from that, the informal specifications were reasonably unambiguous—but they turned out to have two serious flaws.

First, they are arguably rather weak for programmers. In particular, they admit the IRIW behavior above but, under reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding MFENCEs would not suffice to recover sequential consistency (instead, one would have to make liberal use of x86 LOCK’d instructions).31, §2.12 Here, the specifications seem to be much looser than the behavior of implemented processors: to the best of our knowledge,

a We use Intel assembly syntax throughout except that we use an arrow m to indicate the direction of data flow, so MOV [x]m1 is a write of 1 to address x and MOV EAXm[x] is a read from address x into register EAX. Initial states are all 0 unless otherwise specified.

IRIWProc 0 Proc 1 Proc 2 Proc 3

MOV [x]¬1 MOV [y]¬1 MOV EAX¬[x]MOV EBX¬[y]

MOV ECX¬[y] MOV EDX¬[x]

Forbidden Final State: Proc 2:EAX=1 � Proc 2:EBX=0 � Proc 3:ECX=1 � Proc 3:EDX=0

n6Proc 0 Proc 1

MOV [x]¬1MOV EAX¬[x]MOV EBX¬[y]

MOV [y]¬2 MOV [x]¬2

Allowed Final State: Proc 0:EAX=1 � Proc 0:EBX=0 � [x]=1

Page 19: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Is  this  Allowed?  

JULY 2010 | VOL. 53 | NO. 7 | COMMUNICATIONS OF THE ACM 91

and following some testing, IRIW is not observable in practice, even without MFENCEs. It appears that some JVM implementations depend on this fact, and would not be correct if one assumed only the IWP/AMD3.14/x86-CC architecture.15

Second, more seriously, x86-CC and IWP are unsound with respect to current processors. The following example, n6, due to Paul Loewenstein [personal communication, Nov. 2008] shows a behavior that is observable (e.g., on an Intel Core 2

duo) but that is disallowed by x86-CC and by any interpreta-tion we can make of IWP principles P1, 2, 4 and 6.27, A.5

To see why this could be allowed by multiprocessors with FIFO store buffers, suppose that first the Proc 1 write of [y]=2 is buffered, then Proc 0 buffers its write of [x]=1, reads [x]=1 from its own store buffer, and reads [y]=0 from main memory, then Proc 1 buffers its [x]=2 write and flushes its buffered [y]=2 and [x]=2 writes to memory, then finally Proc 0 flushes its [x]=1 write to memory.

The AMD3.14 manual is not expressed in terms of a clearly identified set of principles, and the main text (vol. 2, §7.2) leaves the ordering of stores to a single location uncon-strained, though elsewhere the manual describes a micro-architecture with store buffers and cache protocols that strongly implies that memory is coherent. In the absence of an analogue of the IWP P6, the reasoning prohibiting n6 does not carry over.

2.3. Intel SDM rev. 29–34 (Nov. 2008–Mar. 2010)The most recent substantial change to the Intel memory-model specification, at the time of writing, was in revision 29 of the Intel SDM (revisions 29–34 are essentially identical except for the LFENCE text). This is in a similar informal-prose style to previous versions, again supported by litmus tests, but is significantly different to IWP/x86-CC/AMD3.14. First, the IRIW final state above is forbidden,5, Example 8–7, vol. 3A and the previous coherence condition: “P6. In a multipro-cessor system, stores to the same location have a total order” has been replaced by: “Any two stores are seen in a consistent order by processors other than those performing the stores” (we label this P9).

Second, the memory barrier instructions are now inclu-ded. It is stated that reads and writes cannot pass MFENCE instructions, together with more refined properties for SFENCE and LFENCE

Third, same-processor writes are now explicitly ordered: “Writes by a single processor are observed in the same order by all processors” (P10) (we regarded this as implicit in the IWP “P2. Stores are not reordered with other stores”).

This revision appears to deal with the unsoundness, admit-ting the n6 behavior above, but, unfortunately, it is still prob-lematic. The first issue is, again, how to interpret “causality”

unsupported by any examples. It is hard to see precisely what this prose means, especially without additional knowledge or assumptions about the microarchitecture of particular implementations. The uncertainty about x86 behavior that at least some systems programmers had about earlier IA-32 processors can be gauged from an extensive discussion about the correctness of a proposed optimization to a Linux spinlock implementation.1 The discussion is largely in microarchitectural terms, not just in terms of the specified architecture, and seems to have been resolved only with input from Intel staff. We return to this optimization in Section 4, where we can explain why it is sound with respect to x86-TSO.

2.2. IWP/AMD3.14/x86-CCIn August 2007, an Intel White Paper4 (IWP) gave a somewhat more precise model, with 8 informal-prose principles P1–P8 supported by 10 examples (known as litmus tests). This was incorporated, essentially unchanged, into later revisions of the Intel SDM (including rev. 26–28), and AMD gave simi-lar, though not identical, prose and tests in rev. 3.14 of their manual3, vol. 2, §7.2 (AMD3.14). These are essentially causal- consistency models,9 and they allow different processors to see writes to independent locations in different orders, as in the IRIW litmus test11 below. a AMD3.14 allows this explic-itly, while IWP allows it implicitly, as IRIW is not ruled out by the stated principles. Microarchitecturally, IRIW can arise

from store buffers that are shared between some but not all processors.However, both require that, in some sense, causality is respected, as in the IWP principle “P5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility).”

We used these informal specifications as the basis for a formal model, x86-CC,31 for which a key issue was giving a reasonable interpretation to this “causality,” which is not defined in IWP or AMD3.14. Apart from that, the informal specifications were reasonably unambiguous—but they turned out to have two serious flaws.

First, they are arguably rather weak for programmers. In particular, they admit the IRIW behavior above but, under reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding MFENCEs would not suffice to recover sequential consistency (instead, one would have to make liberal use of x86 LOCK’d instructions).31, §2.12 Here, the specifications seem to be much looser than the behavior of implemented processors: to the best of our knowledge,

a We use Intel assembly syntax throughout except that we use an arrow m to indicate the direction of data flow, so MOV [x]m1 is a write of 1 to address x and MOV EAXm[x] is a read from address x into register EAX. Initial states are all 0 unless otherwise specified.

IRIWProc 0 Proc 1 Proc 2 Proc 3

MOV [x]¬1 MOV [y]¬1 MOV EAX¬[x]MOV EBX¬[y]

MOV ECX¬[y] MOV EDX¬[x]

Forbidden Final State: Proc 2:EAX=1 � Proc 2:EBX=0 � Proc 3:ECX=1 � Proc 3:EDX=0

n6Proc 0 Proc 1

MOV [x]¬1MOV EAX¬[x]MOV EBX¬[y]

MOV [y]¬2 MOV [x]¬2

Allowed Final State: Proc 0:EAX=1 � Proc 0:EBX=0 � [x]=1

Allowed?  

Yes!      The  read  of  EAX  can  be  served  via  SB,  and  Mov  [x]  <-­‐  1  gets  flushed  out  of  SB  aker  Mov  [x]  <-­‐  2  gets  flushed.  

Page 20: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Power  Memory  Model  

•  Instruc>ons  are  commieed  out  of  order.  

RW

W

W

W

W

R

RR

R WW

WW

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

Thread1

Memory1

Mem

ory2

Memory3Memory4

Mem

ory 5

Thread2

Thread3Thread

4

Thread

5

We speak of the collection of all the memories and their interconnect (i.e., everything except the threads) as the storagesubsystem.

For the thread-local out-of-order (and speculative) execution, in general we can think of each thread, at any point intime, as having a tree of the committed and in-flight instruction instances. Newly fetched instructions become in-flight,and later, subject to appropriate preconditions, can be committed. For example, below we show a set of instructioninstances {i1, . . . , i13} with the program-order-successor relation among them. Three of those ({i1, i3, i4}, boxed)have been committed; the remainder are in-flight.

i1 i2 i3 i4 i5

i6

i8

i7

i9

i10

i13

i11 i12

Instruction instances i5 and i9 are branches for which the thread has fetched multiple possible successors; here just two,but a branch with a computed address might in principle fetch many possible successors. A typical implementationmight well explore at most one speculative path at a time. Note that the committed instances are not necessarilycontiguous: here i3 and i4 have been committed even though i2 has not, which can only happen if they are sufficientlyindependent. When a branch is committed then any un-taken alternative paths are discarded, and instructions thatfollow (in program order) an uncommitted branch cannot be committed until that branch is, so the tree must be linearbefore any committed (boxed) instructions.

For a read instruction, as soon as an address for the read is known, the read might be satisfied, binding its valueto one received from the local memory (or in some cases forwarded from earlier in the thread). That value couldimmediately be used by later instructions in the thread that depend on it, but it and they are subject to being restartedor (if this is a speculative path) aborted until the read is committed.

For a write instruction, the key points are when the address and value become determined. After that (subject toother conditions) the write can be committed, sent to the local memory; this is not subject to restart or abort. Afterthat, the write might propagate to other threads, becoming readable by them.

7

Commieed  instruc>ons   Specula>ve  

branch  

Tree  must  be  linear  before  commieed  (boxed)  instruc>ons.  

Page 21: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Power  Memory  Model  •  The  memory  sa>sfies  access  atomicity  but  not  mul>ple-­‐copy  atomicity.      

RW

W

W

W

W

R

RR

R WW

WW

W

W

W

W

W

W

W

W

W

W

W

W

W

W

W

WThread1

Memory1

Mem

ory2

Memory3Memory4

Mem

ory 5

Thread2

Thread3Thread

4

Thread

5

We speak of the collection of all the memories and their interconnect (i.e., everything except the threads) as the storagesubsystem.

For the thread-local out-of-order (and speculative) execution, in general we can think of each thread, at any point intime, as having a tree of the committed and in-flight instruction instances. Newly fetched instructions become in-flight,and later, subject to appropriate preconditions, can be committed. For example, below we show a set of instructioninstances {i1, . . . , i13} with the program-order-successor relation among them. Three of those ({i1, i3, i4}, boxed)have been committed; the remainder are in-flight.

i1 i2 i3 i4 i5

i6

i8

i7

i9

i10

i13

i11 i12

Instruction instances i5 and i9 are branches for which the thread has fetched multiple possible successors; here just two,but a branch with a computed address might in principle fetch many possible successors. A typical implementationmight well explore at most one speculative path at a time. Note that the committed instances are not necessarilycontiguous: here i3 and i4 have been committed even though i2 has not, which can only happen if they are sufficientlyindependent. When a branch is committed then any un-taken alternative paths are discarded, and instructions thatfollow (in program order) an uncommitted branch cannot be committed until that branch is, so the tree must be linearbefore any committed (boxed) instructions.

For a read instruction, as soon as an address for the read is known, the read might be satisfied, binding its valueto one received from the local memory (or in some cases forwarded from earlier in the thread). That value couldimmediately be used by later instructions in the thread that depend on it, but it and they are subject to being restartedor (if this is a speculative path) aborted until the read is committed.

For a write instruction, the key points are when the address and value become determined. After that (subject toother conditions) the write can be committed, sent to the local memory; this is not subject to restart or abort. Afterthat, the write might propagate to other threads, becoming readable by them.

7

Once  a  store  makes  it  to  a  thread's  memory,  it  may  not  propagate  to  all  other  threads'  memory  at  once.  

Page 22: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Is  this  Allowed?  words, the desired behaviour is that if the read of y saw 1 then the read of x must not have seen 0. Or, equivalently,the desired behaviour is that final outcomes in which r1=1 and r2=0 should be forbidden.

MP PseudocodeThread 0 Thread 1

x=1 r1=yy=1 r2=xInitial state: x=0 ! y=0Forbidden?: 1:r1=1 ! 1:r2=0

A litmus test such as this comprises a small multithreaded program, with a defined initial state and with a constraint ontheir final state that picks out the potential executions of interest. Given that, for any architecture we can ask whethersuch an execution is allowed or forbidden; we can also run the test (in a test harness [AMSS11a]) on particularprocessor implementations to see whether it is observed or not observed.

Throughout this document we use the term “thread” to refer to hardware threads on SMT machines and processorson non-SMT machines. Assuming a correctly implemented scheduler (with appropriate barriers at context switches)it should be sound to think of software threads in the same way.

3.1.2 Observed Behaviour In a sequentially consistent model, that final outcome of r1=1 ! r2=0 is indeedforbiden, as there is no interleaving of the reads and writes (in which each read reads the value of the most recent writeto the same address) which permits it. To check this, one can just enumerate the six possible interleavings that respectthe program order of each thread:

Interleaving Final register statex=1; y=1; r1=y; r2=x r1=1 ! r2=1x=1; r1=y; y=1; r2=x r1=0 ! r2=1x=1; r1=y; r2=x; y=1 r1=0 ! r2=1r1=y; r2=x; x=1; y=1 r1=0 ! r2=0r1=y; x=1; r2=x; y=1 r1=0 ! r2=1r1=y; x=1; y=1; r2=x r1=0 ! r2=1

On x86-TSO or SPARC TSO that final outcome of r1=1 ! r2=0 is also forbidden, as the two writes flow through aFIFO buffer into the shared memory before becoming visible to the reading thread. But on ARM and POWER, thisfinal outcome is allowed in the architecture, and it is commonly observable on current processor implementations.Thread 1 can see the flag y set to 1, and program-order-subsequently see the data x still 0. The table below gives somesample experimental data, running this test on various processor implementations using a test harness produced by ourlitmus tool [AMSS11a]. Each entry gives a ratiom/n, wherem is the number of times that the final outcome of r1=1! r2=0 was observed in n trials.

POWER ARMKind PowerG5 Power6 Power7 Tegra2 Tegra3 APQ8060 A5X

MP Allow 10M/4.9G 6.5M/29G 1.7G/167G 40M/3.8G 138k/16M 61k/552M 437k/185M

Here we just show the frequency of the outcome identified by the final state constraint, but many other outcomes (allthe sequentially consistent outcomes listed above), are also allowed and observable.

Care is needed in interpreting such results, of course: the specific numbers can be highly dependent on the testharness; such testing, of highly nondeterministic systems, is not guaranteed to produce all the executions that an imple-mentation might produce; and the architectures are intentionally looser in some respects than current implementations,so (as we will see later) some behaviour may be architecturally allowed even though it is never observable in currentprocessors. Moreover, there might be differences between our architectural models, the vendor’s architectural intent,and the vendor’s architecture manuals (the ARM ARM [ARM08a] and POWER ISA [Pow09]). And of course, whileour models are based in part on extensive discussion with ARM and IBM architects and designers, we do not speak foreither vendor. All that said, we have reasonable confidence in our models, and we have found our testing process tobe reasonably discriminating. Whereever we mark a test execution as allowed or forbidden, we believe that that doesmatch the architectural intention, and, unless otherwise stated, everything marked as allowed is observable on someimplementation of one or other architecture, and (modulo processor errata, which we do not discuss here) everythingmarked as forbidden has not been observable. We give some summary test data to illustate this in each section.

9

Yes!  

final  state  

Page 23: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Broken  Concurrent  Programming  Idiom  

Barriers are similar in that they get committed at a thread and sent to the local part of the storage subsystem, beforeperhaps propagating to other threads. The constraints on how writes and barriers can propagate are intertwined, as weshall see.

Aside: other notions of atomicity We introduced multiple-copy atomicity above, but some caution is needed, asthere are many different senses of “atomic” in use. Two other important notions of atomicity are as follows.

A memory read or write by an instruction is access-atomic (or single-copy atomic, in the terminology of Col-lier [Col92]—though note that single-copy atomic is not the opposite of multiple-copy atomic) if it gives rise to asingle access to the memory. Typically an architecture will specify that certain sizes of reads and writes, subject tosome alignment constraints, (such as 1, 2, 4, 8, and 16-byte accesses with those alignments), are access-atomic, whileother sizes and non-aligned accesses may be split into several distinct subaccesses. For example, two writes of thesame size to the same address are access-atomic iff the result is guaranteed to be either one or the other value, not acombination of their bits. Of course, in a machine which is not multiple-copy atomic, even if a write instruction isaccess-atomic, the write may become visible to different threads at different times (and if a write is not access-atomic,the individual subaccesses may become visible to different threads at different times, perhaps in different orders).

An instruction that involves more than one memory access, such as an increment that does a read and a write to thesame location, or a load-multiple that reads several words, is instruction-atomic if its accesses are indivisible in time,with no other intervening access by other threads to their locations. For example, increment is instruction-atomic ifftwo concurrent increments to the same location that is initially 0 are guaranteed to result in the location containing 2,not 1. On x86 INC is not instruction-atomic whereas LOCK;INC is. On POWER an lmw load-multiple instruction isnot instruction-atomic.

Yet another usage is the C11 and C++11 atomic types and operations. These have various properties, includinganalogues of access- and instruction-atomicity, that we will not discuss here; see [BA08, BOS+11, Bec11, ISO11] fordetails.

3 Introducing Litmus Tests, and Simple Message Passing (MP)

3.1 Message Passing Attempts without Barriers or Dependencies3.1.1 The Message Passing (MP) Example A simple example illustrating some ways in which ARM andPOWER are relaxed is the classic message passing (MP) example below, with two threads (Thread 0 and Thread 1)and two shared variables (x and y). This is a simple low-level concurrency programming idiom, in which one thread(Thread 0) writes some data x, and then sets a flag y to indicate that the data is ready to be read, while another thread(Thread 1) busy-waits reading the flag y until it sees it set, and then reads the data x into a local variable or processorregister r2. The desired behaviour is that after the reading thread has seen the flag set, its subsequent read of the datax will see the value from the writing thread, not the initial state (or some other previous value). In pseudocode:

MP-loop PseudocodeThread 0 Thread 1

x=1 // write data while (y==0) {} // busy-wait for flagy=1 // write flag r2=x // read dataInitial state: x=0 ! y=0Forbidden?: Thread 1 register r2 = 0

The test specifies the initial state of registers and memory (x=0 and y=0; henceforth we assume these are zero if notgiven explicitly) and a constraint on the final state, e.g. that Thread 1’s register r2 is 0. Here x (or [x] in assembly tests)is the value of memory location x; later we write 1:r2 for the value of register r2 on hardware thread 1. If one reachedthat final state, with r2=0, then the Thread 1 read of x would have to have read x=0 from the initial state despite theThread 1 while loop having successfully exit on reading from the Thread 0 write of y=1, program-order-after its writeof x=1.

We can simplify the example without really affecting what is going on by looking at just a single test of theflag: instead of looking at all executions of the MP-loop busy-waiting loop, we can restrict our attention to just theexecutions of the MP program below in which the Thread 1 read of y sees the value 1 written by Thread 0 (we areeffectively considering just the executions of MP-loop in which the while loop test succeeds the first time). In other

8

final  state  

Page 24: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Enforcing  Order  with  dmb  /  sync  Barrier  

•  Read  /  Read:  the  barrier  will  ensure  that  they  are  sa>sfied  and  commieed  in  program  order.    

•  Read  /  Write:  the  barrier  will  ensure  that  the  read  is  commieed  before  the  write  can  be  commieed  (and  thus  propagated  and  become  visible  to  others).  

•  Write  /  Write:  the  barrier  will  ensure  that  the  first  write  is  commieed  and  has  propagated  to  all  other  threads  before  the  second  write  is  commieed.  

•  Write  /  Read:  the  barrier  will  ensure  that  the  write  is  commieed  and  has  propagated  to  all  other  threads  before  the  read  is  sa>sfied.  

Page 25: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Enforcing  Order  with  lwsync  Barrier  (Power  Architecture  Only)  

•  Read  /  Read:  the  barrier  will  ensure  that  they  are  sa>sfied  and  commieed  in  program  order.    

•  Read  /  Write:  the  barrier  will  ensure  that  the  read  is  commieed  before  the  write  can  be  commieed  (and  thus  propagated  and  become  visible  to  others).  

•  Write  /  Write:  the  barrier  will  ensure  that  for  any  par>cular  thread,  the  first  write  propagates  to  that  thread  before  the  second.  

•  Write  /  Read:  the  barrier  will  ensure  that  the  write  is  commieed  before  the  read  is  sa>sfied,  but  the  read  can  be  sa>sfied  before  the  write  is  propagated  to  any  other  thread.  

The  main  disBncBon  is  between  sync  and  lwsync  is  how  a  write  before  the  barrier  is  handled  relaBve  to  the  second  instrucBon.    

Page 26: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Enforcing  (Execu>on)  Order  using  Dependencies  

•  Address  Dependency:  an  address  dependency  from  a  read  to  a  PO-­‐later  read/write  where  the  value  read  by  the  first  is  used  to  compute  the  address  of  the  second  instruc>on.    

•  Control  Dependency:  a  control  dependency  from  a  read  to  a  PO-­‐later  write  (only)  where  the  value  read  by  the  first  is  used  to  compute  the  condi>on  of  a  condi>onal  branch  that  comes  in  PO  before  the  second.    

•  Data  Dependency:  a  data  dependency  from  a  read  to  a  PO-­‐later  write  where  the  value  read  by  the  first  is  used  to  compute  the  value  of    the  second  write.  

Page 27: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Address  Dependencies  

We show a case where the weakness of lwsync really matters in test SB+lwsyncs, in Section 6. ARM does nothave an analogue of lwsync.

3.4 Observed BehaviourBelow we show experimental data for these tests: for MP+dmbs and MP+syncs on ARM and POWER, and forMP+lwsyncs just on POWER.

POWER ARMKind PowerG5 Power6 Power7 Tegra2 Tegra3 APQ8060 A5X

MP Allow 10M/4.9G 6.5M/29G 1.7G/167G 40M/3.8G 138k/16M 61k/552M 437k/185MMP+dmbs/syncs Forbid 0/6.9G 0/40G 0/252G 0/24G 0/39G 0/26G 0/2.2GMP+lwsyncs Forbid 0/6.9G 0/40G 0/220G — — — —

Here the allowed result forMP is observable on all platforms, while the forbidden results for the variants with barriersare not observable on any platform.

4 Enforcing Order with DependenciesIn fact, on the read side of the message-passing example, the sync, lwsync, and DMBmemory barriers used above arestronger than necessary: one can enforce enough ordering to prohibit the undesired outcome just by relying on variouskinds of dependency in the code. In this section we explain what those are and what their force is. In later sectionswe use dependencies in examples that illustrate some other relaxed-memory properties of the machines. For POWER,in all the examples of this section one could replace the sync on the writing thread with lwsync without affecting theresults.

4.1 Address DependenciesThe simplest kind of dependency is an address dependency. There is an address dependency from a read instructionto a program-order-later read or write instruction when the value read by the first is used to compute the addressused for the second. In the variation of MP below, instead of writing a flag value of 1, the writer Thread 0 writesthe address of location x, and the reader Thread 1 uses that address for its second read. That dependency is enoughto keep the two reads satisfied in program order on Thread 1: the second read cannot get started until its address is(perhaps speculatively) known, so the second read cannot be satisfied until the first read is satisfied (in other words,the ARM and POWER architectures do not allow value speculation of addresses). Combining that with the dmb/syncon Thread 0 (which keeps the write to x and the write to y in order as far as any other thread is concerned) is enoughto prevent Thread 1 reading 0 from x if it has read &x from y.

MP+dmb/sync+addr! PseudocodeThread 0 Thread 1

x=1 r1=ydmb/syncy=&x r2=*r1Initial state: x=0 " y=0Forbidden: 1:r1=&x " 1:r2=0 Test MP+dmb/sync+addr’: Forbidden

Thread 0a: W[x]=1

b: W[y]=&x

c: R[y]=&xThread 1

d: R[x]=0

dmb/syncrf

addr

rf

Note that there is a slight mismatch here between the C-like syntax of our pseudocode, in which x is a C variable and&x its address, and the notation of our assembly examples, in which x is a location.

4.1.1 Compound Data To see that this message-passing-with-dependency idiom can still work correctly if thedata (the value stored at x) were multi-word, note that all the writes to the parts of the data would precede the dmb/syncon Thread 0, while all the reads of the parts of the data should each be address-dependent on the value read from y onThread 1, by some offset calculation from that value.

13

Page 28: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Ar>ficial  Address  Dependencies  

4.1.2 C11 Consume This preserved address dependency is what is made available in the C/C++11memory modelsby their read-consume atomic operations: if the read of y were tagged as a read-consume at the C/C++ level, thencompilers are required to respect the dependency to the second Thread 1 read (or to implement something stronger,such as an lwsync/sync/dmb barrier between them.

4.1.3 Address Dependencies from Reads to Writes An address dependency from a read to a write has a similareffect to one from a read to a read, preventing the write getting started until the read has been satisfied and a value forthe read is known; we illustrate this with the test S in Section 4.4 below.

4.1.4 Artificial Dependencies Above we said that “there is an address dependency from a read instruction to aprogram-order-later read or write instruction when the value read by the first is used to compute the address usedfor the second”, and that computation may be via any data-flow path through registers and arithmetic or logicaloperations (though not via memory) — even if the value of the address used cannot be affected by the value read.In theMP+dmb/sync+addr variant below, the value read is exclusive-or’d with itself and then added to the (constant)address of x to calculate the address to be used for the second Thread 1 read. The result of the exclusive-or will alwaysbe zero, and so the address used for the second read will always be equal to that of x, but the two reads are still keptin order. Adding such an artificial dependency (sometimes these are known, perhaps confusingly, as false or fakedependencies) can be a useful programming idiom, to enforce some ordering from a read to a later read (or write) atlow run-time cost.

MP+dmb/sync+addr PseudocodeThread 0 Thread 1

x=1 r1=ydmb/sync r3=(r1 xor r1)y=1 r2=*(&x + r3)Initial state: x=0 ! y=0Forbidden: 1:r1=1 ! 1:r2=0

MP+dmb+addr ARMThread 0 Thread 1

MOV R0,#1 LDR R0,[R4]STR R0,[R2] EOR R1,R0,R0DMB LDR R2,[R1,R3]MOV R1,#1STR R1,[R3]Initial state: 0:R2=x ! 0:R3=y ! 1:R3=x

! 1:R4=yForbidden: 1:R0=1 ! 1:R2=0

MP+sync+addr POWERThread 0 Thread 1

li r1,1 lwz r1,0(r2)stw r1,0(r2) xor r3,r1,r1sync lwzx r4,r3,r5li r3,1stw r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r5=xForbidden: 1:r1=1 ! 1:r4=0

As artificial address dependencies behave just like natural ones, we draw them in the same way, with addr edges,abstracting from the details of exactly what address computation is done:

Test MP+dmb/sync+addr: Forbidden

Thread 0a: W[x]=1

b: W[y]=1

c: R[y]=1Thread 1

d: R[x]=0

dmb/syncrf

addr

rf

4.2 Control DependenciesA rather different kind of dependency is a control dependency, from a read to a program-order-later read or writeinstruction, where the value read by the first read is used to compute the condition of a conditional branch that isprogram-order-before the second read or write.

A control dependency from a read to a read has little force, as we see in the MP+dmb+ctrl and MP+sync+ctrlexamples below: ARM and POWER processors can speculatively execute past the conditional branch (perhaps fol-lowing one path based on a branch prediction, or in principle following both paths simultaneously), and so satisfy thesecond read before satisfying the first read. The first read, the branch, and the second read might then all be committed(with those values) in program order.

14

Page 29: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Control  Dependencies  

MP+dmb/sync+ctrl PseudocodeThread 0 Thread 1

x=1 r1=ydmb/sync if (r1 == r1) {}y=1 r2=xInitial state: x=0 ! y=0Allowed: 1:r1=1 ! 1:r2=0

MP+dmb+ctrl ARMThread 0 Thread 1

MOV R0,#1 LDR R0,[R3]STR R0,[R2] CMP R0,R0DMB BNE LC00MOV R1,#1 LC00:STR R1,[R3] LDR R1,[R2]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yAllowed: 1:R0=1 ! 1:R1=0

MP+sync+ctrl POWERThread 0 Thread 1

li r1,1 lwz r1,0(r2)stw r1,0(r2) cmpw r1,r1sync beq LC00li r3,1 LC00:stw r3,0(r4) lwz r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xAllowed: 1:r1=1 ! 1:r3=0

Test MP+dmb/sync+ctrl: Allowed

Thread 0a: W[x]=1

b: W[y]=1

c: R[y]=1Thread 1

d: R[x]=0

dmb/syncrf

ctrl

rf

For compactness, we show examples in which the branch is just to the next instruction, an instance of which willbe executed whether or not the branch is taken. This makes no difference: the behaviour would be just the same if thebranch were to a different location and the second read were only executed in one case.

The value of the branch condition in the examples is also unaffected by the value read, as it is just based on anequality comparison of a register with itself. Just as for the artificial address dependencies described above, this alsomakes no difference: the existence of the control dependency simply relies on the fact that the value read is used inthe computation of the condition, not on whether the value of the condition would be changed if a different value wereread. There are therefore many ways of writing a pseudocode example that are essentially equivalent to that above,e.g. by putting the r2=x inside the conditional, or, for an example without a race on x, putting the read of y in a loopsuch as do {r1=y;} while (r1 == 0).

4.3 Control-isb/isync DependenciesTo give a read-to-read control dependency some force, one can add an ISB (ARM) or isync (POWER) instructionbetween the conditional branch and the second read, as in the examples below. This prevents the second read frombeing satisfied until the conditional branch is committed, which cannot happen until the value of the first read is fixed(i.e., until that read is satisfied and committed); the two reads are thus kept in order and the specified outcome of thetest is now forbidden.

MP+dmb/sync+ctrlisb/ctrlisyncThread 0 Thread 1

x=1 r1=ydmb/sync if (r1 == r1) {}

isb/isyncy=1 r2=xInitial state: x=0 ! y=0Forbidden: 1:r1=1 ! 1:r2=0

MP+dmb+ctrlisb ARMThread 0 Thread 1

MOV R0,#1 LDR R0,[R3]STR R0,[R2] CMP R0,R0DMB BNE LC00MOV R1,#1 LC00:STR R1,[R3] ISB

LDR R1,[R2]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yForbidden: 1:R0=1 ! 1:R1=0

MP+sync+ctrlisync POWERThread 0 Thread 1

li r1,1 lwz r1,0(r2)stw r1,0(r2) cmpw r1,r1sync beq LC00li r3,1 LC00:stw r3,0(r4) isync

lwz r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xForbidden: 1:r1=1 ! 1:r3=0

Test MP+dmb/sync+ctrlisb/ctrlisync: Forbidden

Thread 0a: W[x]=1

b: W[y]=1

c: R[y]=1Thread 1

d: R[x]=0

dmb/syncrf

ctrlisb/isync

rf

4.4 Control Dependencies from a Read to a WriteIn contrast to control dependencies (without an isb/isync) from a read to a read, a control dependency from a read toa write does have some force: the write cannot be seen by any other thread until the branch is committed, and hence

15

Page 30: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

No  Control  Dependencies  for  R/R  

MP+dmb/sync+ctrl PseudocodeThread 0 Thread 1

x=1 r1=ydmb/sync if (r1 == r1) {}y=1 r2=xInitial state: x=0 ! y=0Allowed: 1:r1=1 ! 1:r2=0

MP+dmb+ctrl ARMThread 0 Thread 1

MOV R0,#1 LDR R0,[R3]STR R0,[R2] CMP R0,R0DMB BNE LC00MOV R1,#1 LC00:STR R1,[R3] LDR R1,[R2]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yAllowed: 1:R0=1 ! 1:R1=0

MP+sync+ctrl POWERThread 0 Thread 1

li r1,1 lwz r1,0(r2)stw r1,0(r2) cmpw r1,r1sync beq LC00li r3,1 LC00:stw r3,0(r4) lwz r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xAllowed: 1:r1=1 ! 1:r3=0

Test MP+dmb/sync+ctrl: Allowed

Thread 0a: W[x]=1

b: W[y]=1

c: R[y]=1Thread 1

d: R[x]=0

dmb/syncrf

ctrl

rf

For compactness, we show examples in which the branch is just to the next instruction, an instance of which willbe executed whether or not the branch is taken. This makes no difference: the behaviour would be just the same if thebranch were to a different location and the second read were only executed in one case.

The value of the branch condition in the examples is also unaffected by the value read, as it is just based on anequality comparison of a register with itself. Just as for the artificial address dependencies described above, this alsomakes no difference: the existence of the control dependency simply relies on the fact that the value read is used inthe computation of the condition, not on whether the value of the condition would be changed if a different value wereread. There are therefore many ways of writing a pseudocode example that are essentially equivalent to that above,e.g. by putting the r2=x inside the conditional, or, for an example without a race on x, putting the read of y in a loopsuch as do {r1=y;} while (r1 == 0).

4.3 Control-isb/isync DependenciesTo give a read-to-read control dependency some force, one can add an ISB (ARM) or isync (POWER) instructionbetween the conditional branch and the second read, as in the examples below. This prevents the second read frombeing satisfied until the conditional branch is committed, which cannot happen until the value of the first read is fixed(i.e., until that read is satisfied and committed); the two reads are thus kept in order and the specified outcome of thetest is now forbidden.

MP+dmb/sync+ctrlisb/ctrlisyncThread 0 Thread 1

x=1 r1=ydmb/sync if (r1 == r1) {}

isb/isyncy=1 r2=xInitial state: x=0 ! y=0Forbidden: 1:r1=1 ! 1:r2=0

MP+dmb+ctrlisb ARMThread 0 Thread 1

MOV R0,#1 LDR R0,[R3]STR R0,[R2] CMP R0,R0DMB BNE LC00MOV R1,#1 LC00:STR R1,[R3] ISB

LDR R1,[R2]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yForbidden: 1:R0=1 ! 1:R1=0

MP+sync+ctrlisync POWERThread 0 Thread 1

li r1,1 lwz r1,0(r2)stw r1,0(r2) cmpw r1,r1sync beq LC00li r3,1 LC00:stw r3,0(r4) isync

lwz r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xForbidden: 1:r1=1 ! 1:r3=0

Test MP+dmb/sync+ctrlisb/ctrlisync: Forbidden

Thread 0a: W[x]=1

b: W[y]=1

c: R[y]=1Thread 1

d: R[x]=0

dmb/syncrf

ctrlisb/isync

rf

4.4 Control Dependencies from a Read to a WriteIn contrast to control dependencies (without an isb/isync) from a read to a read, a control dependency from a read toa write does have some force: the write cannot be seen by any other thread until the branch is committed, and hence

15

Page 31: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Use  isync  to  get  back    Control  Dependencies  for  R/R  

MP+dmb/sync+ctrl PseudocodeThread 0 Thread 1

x=1 r1=ydmb/sync if (r1 == r1) {}y=1 r2=xInitial state: x=0 ! y=0Allowed: 1:r1=1 ! 1:r2=0

MP+dmb+ctrl ARMThread 0 Thread 1

MOV R0,#1 LDR R0,[R3]STR R0,[R2] CMP R0,R0DMB BNE LC00MOV R1,#1 LC00:STR R1,[R3] LDR R1,[R2]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yAllowed: 1:R0=1 ! 1:R1=0

MP+sync+ctrl POWERThread 0 Thread 1

li r1,1 lwz r1,0(r2)stw r1,0(r2) cmpw r1,r1sync beq LC00li r3,1 LC00:stw r3,0(r4) lwz r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xAllowed: 1:r1=1 ! 1:r3=0

Test MP+dmb/sync+ctrl: Allowed

Thread 0a: W[x]=1

b: W[y]=1

c: R[y]=1Thread 1

d: R[x]=0

dmb/syncrf

ctrl

rf

For compactness, we show examples in which the branch is just to the next instruction, an instance of which willbe executed whether or not the branch is taken. This makes no difference: the behaviour would be just the same if thebranch were to a different location and the second read were only executed in one case.

The value of the branch condition in the examples is also unaffected by the value read, as it is just based on anequality comparison of a register with itself. Just as for the artificial address dependencies described above, this alsomakes no difference: the existence of the control dependency simply relies on the fact that the value read is used inthe computation of the condition, not on whether the value of the condition would be changed if a different value wereread. There are therefore many ways of writing a pseudocode example that are essentially equivalent to that above,e.g. by putting the r2=x inside the conditional, or, for an example without a race on x, putting the read of y in a loopsuch as do {r1=y;} while (r1 == 0).

4.3 Control-isb/isync DependenciesTo give a read-to-read control dependency some force, one can add an ISB (ARM) or isync (POWER) instructionbetween the conditional branch and the second read, as in the examples below. This prevents the second read frombeing satisfied until the conditional branch is committed, which cannot happen until the value of the first read is fixed(i.e., until that read is satisfied and committed); the two reads are thus kept in order and the specified outcome of thetest is now forbidden.

MP+dmb/sync+ctrlisb/ctrlisyncThread 0 Thread 1

x=1 r1=ydmb/sync if (r1 == r1) {}

isb/isyncy=1 r2=xInitial state: x=0 ! y=0Forbidden: 1:r1=1 ! 1:r2=0

MP+dmb+ctrlisb ARMThread 0 Thread 1

MOV R0,#1 LDR R0,[R3]STR R0,[R2] CMP R0,R0DMB BNE LC00MOV R1,#1 LC00:STR R1,[R3] ISB

LDR R1,[R2]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yForbidden: 1:R0=1 ! 1:R1=0

MP+sync+ctrlisync POWERThread 0 Thread 1

li r1,1 lwz r1,0(r2)stw r1,0(r2) cmpw r1,r1sync beq LC00li r3,1 LC00:stw r3,0(r4) isync

lwz r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xForbidden: 1:r1=1 ! 1:r3=0

Test MP+dmb/sync+ctrlisb/ctrlisync: Forbidden

Thread 0a: W[x]=1

b: W[y]=1

c: R[y]=1Thread 1

d: R[x]=0

dmb/syncrf

ctrlisb/isync

rf

4.4 Control Dependencies from a Read to a WriteIn contrast to control dependencies (without an isb/isync) from a read to a read, a control dependency from a read toa write does have some force: the write cannot be seen by any other thread until the branch is committed, and hence

15

Page 32: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Control  Dependencies  for  R/W  

until the value of the first read is fixed. To illustrate this we use a variation of theMP test family, known (for historicalreasons) as S, in which the second read on Thread 1 (of x) is replaced by a write of x. Instead of asking whether theThread 1 read of x is guaranteed to see the value written by Thread 0, we now ask whether the Thread 1 write of x isguaranteed to be coherence-after the Thread 0 write of x (i.e., whether a third thread, that reads x twice, is guaranteednot to see those two writes in the opposite order; we return to coherence in Section 8). Without a control dependencyon Thread 1, that is not guaranteed; the execution below is allowed.

S+dmb/sync+po PseudocodeThread 0 Thread 1

x=2 r1=ydmb/sync x=1y=1Initial state: x=0 ! y=0Forbidden: 1:r1=1 ! x=2

S+dmb+po ARMThread 0 Thread 1

MOV R0,#2 LDR R0,[R3]STR R0,[R2] MOV R1,#1DMB STR R1,[R2]MOV R1,#1STR R1,[R3]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yAllowed: [x]=2 ! 1:R0=1

S+sync+po POWERThread 0 Thread 1

li r1,2 lwz r1,0(r2)stw r1,0(r2) li r3,1sync stw r3,0(r4)li r3,1stw r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xAllowed: [x]=2 ! 1:r1=1

We draw such coherence conditions with a co edge, between two writes to the same address:

Test S+dmb/sync+po: Allowed

Thread 0a: W[x]=2

b: W[y]=1

c: R[y]=1Thread 1

d: W[x]=1

dmb/syncrf co

po

If we add a read-to-write control dependency on Thread 1, that final outcome becomes forbidden:

S+dmb/sync+ctrl PseudocodeThread 0 Thread 1

x=2 r1=ydmb/sync if (r1==r1) { }y=1 x=1Initial state: x=0 ! y=0Forbidden: 1:r1=1 ! x=2

S+dmb+ctrl ARMThread 0 Thread 1

MOV R0,#2 LDR R0,[R3]STR R0,[R2] CMP R0,R0DMB BNE LC00MOV R1,#1 LC00:STR R1,[R3] MOV R1,#1

STR R1,[R2]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yForbidden: [x]=2 ! 1:R0=1

S+sync+ctrl POWERThread 0 Thread 1

li r1,2 lwz r1,0(r2)stw r1,0(r2) cmpw r1,r1sync beq LC00li r3,1 LC00:stw r3,0(r4) li r3,1

stw r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xForbidden: [x]=2 ! 1:r1=1

Test S+dmb/sync+ctrl: Forbidden

Thread 0a: W[x]=2

b: W[y]=1

c: R[y]=1Thread 1

d: W[x]=1

dmb/syncrf co

ctrl

4.5 Data Dependencies from a Read to a WriteOur final kind of dependency is a data dependency, from a read to a program-order-later write where the value readis used to compute the value written. These have a broadly similar effect to address, control, or control-isb/isyncdependencies from reads to writes: they prevent the write being committed (and hence being propagated and be-coming visible to other threads) until the value of the read is fixed when the read is committed. Accordingly, theS+dmb/sync+data variant below of the above S+dmb/sync+ctrl test is also forbidden.

16

Page 33: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

Data  Dependencies  for  R/W  

S+dmb/sync+data PseudocodeThread 0 Thread 1

x=2 r1=ydmb/sync r3 = (r1 xor r1)y=1 x = 1 + r3

r2=xInitial state: x=0 ! y=0Forbidden: 1:r1=0 ! x=1

S+dmb+data ARMThread 0 Thread 1

MOV R0,#2 LDR R0,[R3]STR R0,[R2] EOR R1,R0,R0DMB ADD R1,R1,#1MOV R1,#1 STR R1,[R2]STR R1,[R3]Initial state: 0:R2=x ! 0:R3=y ! 1:R2=x

! 1:R3=yForbidden: [x]=2 ! 1:R0=1

S+sync+data POWERThread 0 Thread 1

li r1,2 lwz r1,0(r2)stw r1,0(r2) xor r3,r1,r1sync addi r3,r3,1li r3,1 stw r3,0(r4)stw r3,0(r4)Initial state: 0:r2=x ! 0:r4=y ! 1:r2=y

! 1:r4=xForbidden: [x]=2 ! 1:r1=1

Test S+dmb/sync+data: Forbidden

Thread 0a: W[x]=2

b: W[y]=1

c: R[y]=1Thread 1

d: W[x]=1

dmb/syncrf co

data

4.6 Summary of DependenciesTo summarise, we have:

RR and RW address an address dependency from a read to a program-order-later read or write where the value readby the first is used to compute the address of the second;

RR and RW control a control dependency from a read to a program-order-later read or write where the value readby the first is used to compute the condition of a conditional branch that is program-order-before the second;

RR and RW control-isb/isync a control-isb or control-isync dependency from a read to a program-order-later read orwrite where the value read by the first is used to compute the condition of a conditional branch that is program-order-before an isb/isync instruction before the second; and

RW data a data dependency from a read to a program-order-later write where the value read by the first is used tocompute the value written by the second.

There are no dependencies from writes (to either reads or writes).In each case, the use of the value read can be via any dataflow chain of register-to-register operations, and it does

not matter whether it is artificial (or fake/false) or not: there is still a dependency even if the value read cannot affectthe actual value used as address, data, or condition.

From one read to another, an address or control-isb/isync dependency will prevent the second read being satisfiedbefore the first is, while a plain control dependency will not.

From a read to a write, an address, control (and so also a control-isb/isync) or data dependency will prevent thewrite being visible to any other thread before the value of the read is fixed.

We return in Section 10 to some more subtle properties of dependencies.As we shall see, dependencies are strictly weaker than the DMB, sync, and lwsync barriers: replacing a depen-

dency by one of those barriers will never permit more behaviour (and so should always be a safe program transforma-tion), whereas the converse is not true. Dependencies only have thread-local effects, whereas DMB, sync, and lwsynchave stronger ‘cumulatively’ properties that we introduce in the next section.

4.7 Observed BehaviourBelow we summarise the results of hardware experiments on dependencies.

17

Page 34: Hardware’Memory’’ Consistency’Model’€¦ · Hardware’Reordering’ ∙ The processor can issue stores faster than the network can handle them ⇒ store buffer. ∙ Since

References  

TSO:    [1]  Peter  Sewell,  Susmit  Sarkar,  Scoe  Owens,  Francesco  Zappa  Nardelli,  and  Magnus  O.  Myreen.    x86-­‐TSO:  A  Rigorous  and  Usable  Programmer's  Model  for  x86  Mul>processors.    Vol  53,  No.  7.    Communica>ons  of  the  ACM,  July  2010.  [2]    Intel  Corpora>on.    Intel  64  and  IA-­‐32  Architectures  Sokware  Developer's  Manual  Volume  3  (3A,  3B  &  3C):  System  Programming  Guide,  chapter  8.    2015.    Power:  [3]  Susmit  Sarkar,  Peter  Sewell,  Jade  Alglave,  Luc  Maranget,  and  Derek  Williams.    Understanding  the  POWER  Mul>processors.    In  PLDI  2011.  [4]  Luc  Maranget,  Susmit  Sarkar,  and  Peter  Sewell.    A  Tutorial  Introduc>on  to  the  ARM  and  POWER  Relaxed  Memory  Models.      Tech  report.    hep://www.cl.cam.ac.uk/~pes20/ppc-­‐supplemental/test7.pdf