View
252
Download
1
Embed Size (px)
Citation preview
Computer Architecture II
2
Today:
• Consistency models– Program order– Difference between coherency and
consistency – Sequential consistency– Relaxing sequential consistency
Computer Architecture II
3
Today: Consistency models
• Program order
• Difference between coherency and consistency
• Sequential consistency
• Relaxing sequential consistency
Computer Architecture II
4
Program order (an example)
• Order in which instructions appear in source code– May be changed by a compiler– We will assume the order the programmer sees (what you see in the example above, not how the assembly code would
look like)
• Sequential program order
– P1: 1a->1b
– P2: 2a->2b
• Parallel program order: an arbitrary interleaving of sequential orders of P1 and P2
– 1a->1b->2a->2b – 1a->2a->1b->2b– 2a->1a->1b->2b– 2a->2b->1a->1b
P1 P2
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
Computer Architecture II
5
Program order
• Possible intuitive printings of the program?• A compiler or an out-of-order execution on a superscalar processor may
reorder 1a and 1b of P1 as long as they not affect the result of the program on P1
– This would produce non-intuitive results • Now assume that the compiler/superscalar processor does not reorder
– P1 will “see” the results of the writes A=1 and B=2 in the program order– But
• when will P2 see the results of the writes A=1 and B=2 ? • when will P2 see the results of the write A=1?
– We can say a processor P1 “sees” the results of write of P2 or the write operation of P1 completes with respect to P2
– Coherence => Writes to one location become visible to all in the same order– But here we have 2 locations!
P1 P2
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
Initially A=0, B=0
Computer Architecture II
6
Setup for Memory Consistency• Coherence => Writes to one location become visible to all in the
same order• Nothing is said about
– when does a write become visible to another processor?• Use event synchronization to insure that
– Which is the order in which consecutive writes to different locations are seen by other processors
P1 P2
/*Assume initial value of A is 0*/A = 1;Barrier -----------------------Barrier
print A;
Computer Architecture II
7
Second Example
• Intuition not guaranteed by coherence– Refers to one location: return the last value written to A or to flag
– Does not say anything about order the modification of A and flag are seen by P2
• Intuitively we expect memory to – respect order between accesses to different locations issued by a given process (1.b seen after 1.a)
• Conclusion: Coherence is not enough!– pertains only to single location
P1 P2
/*Assume initial value of A and flag is 0*/
1.a A = 1; 2.a while (flag == 0); /*spin idly*/
1.b flag = 1; 2.b print A;
Computer Architecture II
8
Back to Second Example
– What’s the intuition? If 2a prints 2, will 2b print 1?– We need an ordering model for clear semantics
• across different locations as well• so programmers can reason about what results are possible
– This is the memory consistency model
P1 P2
/*Assume initial values of A and B are 0*/
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
Computer Architecture II
9
Memory Consistency Model• Specifies constraints on the order in which memory operations (from
any process) can appear to execute with respect to one another– What orders are preserved?– Given a load, which are the possible values returned by it
• Without it, can’t tell much about an SAS program’s execution• Implications for both programmer and system designer
– Programmer uses to reason about correctness and possible results
– System designer can use to constrain how much accesses can be reordered by compiler or hardware
• Contract between programmer and system
Computer Architecture II
10
Sequential Consistency
• Total order achieved by interleaving accesses from different processes– Maintains program order, and memory operations, from all processes,
appear to [issue, execute, complete] atomically w.r.t. others– as if there were no caches, and a single memory
• “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979]
Processors issuing memory references as per program order
P1 P2 Pn
Memory
The “switch” is randomly set after each memoryreference
Computer Architecture II
11
SC Example
• What matters is order in which operations appear to execute, not the chronological order of events
• Possible outcomes for (A,B): (0,0), (1,0), (1,2)• What about (0,2) ?
– program order => 1a->1b and 2a->2b– A = 0 implies 2b->1a, which implies 2a->1b– B = 2 implies 1b->2a, which leads to a contradiction
• What about 1b->1a->2b->2a ?– appears just like 1a->1b->2a->2b => fine!– execution order 1b->2a->2b->1a is not fine, would produce (0,2)
P1 P2
/*Assume initial values of A and B are 0*/
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;A=0
B=2
Computer Architecture II
12
• Sequential program order – P1: 1a->1b– P2: 2a->2b
• Parallel program order: an arbitrary interleaving of sequential orders of P1 and P2
– 1a->1b->2a->2b – 1a->2a->1b->2b– 1a->2a->2b->1b– 2a->1a->1b->2b– 2a->1a->2b->1b– 2a->2b->1a->1b
– But, 1a->1b->2b->2a is also acceptable for SC!
Back to the first exampleP1 P2
(1a) A = 1; (2a) print B;
(1b) B = 2; (2b) print A;
intuitive
Computer Architecture II
13
Implementing SC• Two kinds of requirements
–Program order• memory operations issued by a process must
appear to execute (become visible to others and itself) in program order
–Atomicity• in the overall hypothetical total order, one memory
operation should appear to complete with respect to all processes before the next one is issued
• guarantees that total order is consistent across processes
Computer Architecture II
14
Summary of Sequential Consistency
• Maintain order between shared access in each thread–reads or writes wait for previous reads or writes
to complete
READ WRITE WRITEREAD
READ WRITE READ WRITE
Computer Architecture II
15
Do we really need SC?
• SC has strong requirements• SC may prevent compiler (code reorganization) and architectural
optimizations (out-of-order execution in superscalar)• Many programs execute correctly even without “strong” ordering
• explicit synch operations order key accesses
initial: A, B=0
P1 P2
A := 1;
B := 3.1415
barrier -------------------barrier
... = A;
... = B;
Computer Architecture II
16
Does SC eliminate synchronization?
• No, still needed – Critical sections ( e.g. insert element into a doubly-
linked list)– Barriers (e.g. enforce order on a variable access)– Events (e.g. wait for a condition to become true)
• only ensures interleaving semantics of individual memory operations
Computer Architecture II
17
Is SC hardware enough?• No, Compiler can violate ordering constraints
– Register allocation to eliminate memory accesses – Common subexpression elimination– Instruction reordering– Software Pipelining
• Unfortunately, programming languages and compilers are largely oblivious to memory consistency models
P1 P2 P1 P2
B=0 A=0 r1=0 r2=0
A=1 B=1 A=1 B=1
u=B v=A u=r1 v=r2
B=r1 A=r2
(u,v)=(0,0) disallowed under SC may occur here
Computer Architecture II
18
What orderings are essential?
• Stores to A and B must complete before unlock• Loads to A and B must be performed after lock• Conclusion: may relax the sequential
consistency semantics
initial: A, B=0
P1 P2
A := 1;
B := 3.1415
unlock(L) lock(L)
... = A;
... = B;
Computer Architecture II
19
Hardware Centric Models• Processor Consistency (Goodman 89)• Total Store Ordering (Sindhu 90)
• Partial Store Ordering (Sindhu 90)
• Causal Memory (Hutto 90)• Weak Ordering (Dubois 86)
READ WRITE WRITEREAD
READ WRITE READ WRITE
READ WRITE WRITEREAD
READ WRITE READ WRITE
Computer Architecture II
20
Relaxing write-to-read (PC, TSO)
• Why?– Hardware may hide latency of write
• write-miss in write buffer, later reads hit, maybe even bypass write
• write to flag not visible until write to A visible• PC: non atomic write (write does not complete wrt all other
processors) • Ex: Sequent Balance, Encore Multimax, vax 8800, SparcCenter, SGI
Challenge, Pentium-Pro
initial: A, flag, y == 0
P1 P2
(a) A = 1; (c) while (flag ==0) {}
(b) flag = 1; (d) y = A;
Computer Architecture II
21
Comparing with SC
• Different results– a, b: same for SC, TSO, PC
– c: PC allows A=0 no write atomicity: A=1 may complete wrt P2 but not wrt P3
– d: TSO and PC allow A=B=0 (read execute before write)• Mechanism for insuring SC semantics: MEMBAR (Sun SPARC V9)
– A subsequent read waits until all write complete
Initially A,B=0 Initially A,B=0
Initially A,B=0Initially A,B=0
Computer Architecture II
22
Comparing with SC
• Different results– a, b: same for SC, TSO, PC
– c: PC allows A=0 no write atomicity: A=1 may complete wrt P2 but not wrt P3
– d: TSO and PC allow A=B=0 (read execute before write)• Mechanism for insuring SC semantics: MEMBAR (Sun SPARC V9)
– A subsequent read waits until all write complete
Initially A,B=0 Initially A,B=0
Initially A,B=0Initially A,B=0
Computer Architecture II
23
Comparing with SC
• Mechanism for insuring SC semantics: MEMBAR (Sun SPARC V9)– A subsequent read waits until all write complete
Initially A,B=0 Initially A,B=0
Initially A,B=0Initially A,B=0
P1 P2
/* initially A, B = 0 */
A = 1; B=1,
membar; membar;print B; print A;
Computer Architecture II
24
Relaxing write-to-read and write-to-write (PSO)
• Why?– Bypass multiple write cache missing – Overlap several write operation => good performance
• But, even example (a) breaks – Use MEMBAR: a subsequent write waits until all previous writes
have completed
Initially A,B=0 Initially A,B=0
Initially A,B=0Initially A,B=0
Computer Architecture II
25
Relaxing all orders
• Retain control and data dependences within each thread
• Why?–allow multiple overlapping read operations
• May be bypassed by writes• Hyde read latency (for read misses)
• Two important models – Weak ordering– Release Consistency
Computer Architecture II
26
Weak ordering
• synchronization operations wait for all previous memory operations to complete
• arbitrary completion ordering between them
: synchronization operation
Computer Architecture II
27
Release consistency• Differentiate between synchronization operations
– acquire: read operation to gain access to set of operations or variables– release: write operation to grant access to other processors– acquire must complete wrt all processors before following accesses
• Lock(TaskQ) before newTask->next = Head; …, UnLock(TaskQ)
– release must wait until accesses before acquire complete• UnLock(TaskQ) waits for Lock(TaskQ), …, Head=newTask->next;
: acquire:release
Computer Architecture II
28
Release consistency
• Intuition: – The programmer inserts acquire/release operations for code that shares
variables – acquire has to complete before the following instructions
• Because the other processes must know a critical section is entered• Acquire and code before acquire can be reordered
– The code before the release has to complete• Because the critical section modifications must become visible to the others• Release and code after release can be reordered
: acquire:release
Computer Architecture II
29
Preserved Orderings
• A block contains the instructions of one processor that me be reordered • Intuitive results and performance if data races are eliminated through synchronization
read/write° ° °
read/write
Synch
read/write° ° °
read/write
Synch
read/write° ° °
read/write
Weak Ordering
read/write° ° °
read/write
Acquire
read/write° ° °
read/write
Release
read/write° ° °
read/write
Release Consistency
1
2
3
1
2
3