Upload
harley-toms
View
221
Download
0
Embed Size (px)
Citation preview
CS 162Memory Consistency Models
Memory operations are reordered to improve performance
Hardware (e.g., store buffer, reorder buffer)Compiler (e.g., code motion, caching value in register)
Behave the same as long as dependences are respected
Reordering in Uniprocessors
a1: St x
a2: Ld y
a2: Ld y
a1: St x≡
counter-intuitive program behavior
Reordering in Multiprocessors
Initially x=y=0
(Rx=1, Ry =1)
(Rx=1, Ry =0)
(Rx=0, Ry =0)
b1: Ry = y;
b2: Rx = x;
a1: x = 1;
a2: y = 1;
b2: Rx = x;
a1: x = 1;
a2: y = 1;
b1: Ry = y;
b2: Rx = x;
a1: x = 1;
a2: y = 1;
b1: Ry = y;b1: Ry = y;
b2: Rx = x;
(Rx=0, Ry =1)Intuitively, y=1 x=1
a1: x = 1; b1: Ry = y;
b2: Rx = x;a2: y = 1;
P1 P2
a1: x = 1;
a2: y = 1;
Possible outcomes
Reordering in Multiprocessors
p = new A(…) if (flag)
a = p->var;flag = true;
P1 P2
flag is supposed to be set after p is allocated
Initially p=NULL, flag = false
counter-intuitive program behavior
Lock-free algorithms, e.g., Dekker, Peterson
Dekker Algorithm (mutual exclusion)
Reordering in Multiprocessors
flag1 = 1; flag2 = 1;if (flag2 == 0) if (flag1 == 0) critical section critical section
P1 P2
Initially flag1 = flag2 = 0
flag1 = 1flag2 == 0
After reordering, both flag1 and flag2 can be 0
St flag1
Ld flag2
counter-intuitive program behavior
Memory Consistency Models
Specify the ordering of loads and stores to different memory locations
Ld Ld, Ld St, St Ld, St St
Contract between hardware, compiler, and programmer
hardware and compiler will not violate the ordering specified
the programmer will not assume a stricter order than that of the model
Memory Consistency Models
Allowed Reordering
Commercial Architecture
Sequential Consistency
None not exist
Total Store Ordering
St Ld x86, SPARC
Relaxed Memory Order
All ARM, PowerPC
Low
High
Perform
ance
Stronger modelsStronger constraints
Fewer memory
reorderings
Easier to reason
Lower performance
High
Low
Program
mability
Cache Coherence vs. Memory Model
Cache coherence ensures a consistent view of memory
Guarantees that the update to memory by one processor will be seen by other processors eventually
But, how consistent ?NO guarantees on when an update should be seenNO guarantees on what order of updates should be seen
Cache Coherence vs. Memory Model
Initially A = B = 0
P1 P2 P3 A = 1; while (A != 1) ;
B = 1; while (B != 1) ;
tmp = A ;
tmp = 1? or tmp = 0?
Sequential Consistency (SC)Definition [Lamport]
(1) the result of any execution is the same as if the operations of all processors were executed in some sequential order;(2) the operations of each individual processor appear in this sequence in the order specified by its program.
MEMORY
P1 P3P2 Pn Behave as the repetition:(1) Pick a processor by any
method (e.g., randomly)(2) the processor completes a
load/store operation
SC Example
b1: Ry = y;
b2: Rx = x;
a1: x = 1;
a2: y = 1;
b2: Rx = x;
a1: x = 1;
a2: y = 1;
b1: Ry = y;
b1: Ry = y;
b2: Rx = x;
(Rx=0, Ry =0)
a1: x = 1; b1: Ry = y;
b2: Rx = x;a2: y = 1;
P1 P2
a1: x = 1;
a2: y = 1;
b1: Ry = y;
b2: Rx = x;
a1: x = 1;
a2: y = 1;≡
b1: Ry = y;
b2: Rx = x;
a1: x = 1;
a2: y = 1;
a2: y = 1;
b1: Ry = y;
b2: Rx = x;a1: x = 1;
Sequential Consistency (SC)
Simple and intuitive consistent with programmers’ intuition
easy to reason program behavior
However, the simplicity comes at the cost of performance
prevents aggressive compiler optimizations (e.g., load reordering, store reordering, caching value in register)constrains hardware utilization, (e.g., store buffer)
SC Violation
a1: x = 1
a2: y = 1
b1: R1 = y
b2: R2 = x
program order
conflict relation
SC Violation
- A cycle formed by program orders and conflict orders[Shasha and Snir, 1988] e.g., (a2, b1, b2, a1, a2)
- Executing in the order (a2, b1, b2, a1) will produce R1=1, R2=0, which is not an SC outcome
Insert fences to break cycle- a2 can not be executed before a1
Fence Instructions
p = new A(…)
flag = true;
P1
Fence InstructionsOrder memory operations before and after the fence
FENCE
Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11]
Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]
a1: St x
a2: Ld y
Fence1
b1: St y
b2: Ld x
Fence2
Conservativeness of Fences
Inserted statically and conservatively
T
At time T, a1 and a2 have completed; b1 and b2 only execute after time T.
No cycle is formed at runtime
if (cond) a1: St x
a2: Ld y
b1: St y
b2: Ld x
Fence1 Fence2
a1 is in a conditional branch
Conservativeness of Fences
a1: St *p
a2: Ld x
b1: St x
b2: Ld *q
Fence1 Fence2
p and q may point to the same memory location
Inserted statically and conservatively
No cycle is formed at runtime
Processor-centric Fence
Traditional fence
Processor-centric - unaware of memory accesses in other processors
However, purpose of fences
Prevent memory accesses from being reordered and observed by other processors (i.e., a cycle formed at runtime)
Address-aware Fences
Consider memory locations accessed around fences at runtime
Fences only take effect when there is a cycle about to happen
Detect and Avoid Cycles
A1
A2
Proc 1 Proc 2
a1: …
a2: …
Fence1
B1
B2
b1: …
Fence2
b2: …
c1
c2?
How to detect c2 efficiently?
Detect and Avoid Cycles
A1
A2
Proc 1 Proc 2
a1: …
a2: …
Fence1
B1
B2
b1: …
Fence2
b2: …
c1
watchlist
c2?
How to detect c2 efficiently?Collecting watchlist for each fence
Completing memory operation checks the watchlist
- bypass, if its address is not in the watchlist
- stall, otherwise
Performance: Execution TimeTraditional fence (T) vs. Address-aware fence (A)
Fence overhead becomes negligible
Further ReadingL. Lamport. How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Trans. Comput., 28(9):690–691, 1979.
S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:66–76, 1995.
D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282–312, 1988.
Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 2011.
C. Lin, V. Nagarajan, and R. Gupta. Address-aware fences. ICS ’13, pages 313–324, 2013