CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware...

Preview:

Citation preview

CS 162Memory Consistency Models

Memory operations are reordered to improve performance

Hardware (e.g., store buffer, reorder buffer)Compiler (e.g., code motion, caching value in register)

Behave the same as long as dependences are respected

Reordering in Uniprocessors

a1: St x

a2: Ld y

a2: Ld y

a1: St x≡

counter-intuitive program behavior

Reordering in Multiprocessors

Initially x=y=0

(Rx=1, Ry =1)

(Rx=1, Ry =0)

(Rx=0, Ry =0)

b1: Ry = y;

b2: Rx = x;

a1: x = 1;

a2: y = 1;

b2: Rx = x;

a1: x = 1;

a2: y = 1;

b1: Ry = y;

b2: Rx = x;

a1: x = 1;

a2: y = 1;

b1: Ry = y;b1: Ry = y;

b2: Rx = x;

(Rx=0, Ry =1)Intuitively, y=1 x=1

a1: x = 1; b1: Ry = y;

b2: Rx = x;a2: y = 1;

P1 P2

a1: x = 1;

a2: y = 1;

Possible outcomes

Reordering in Multiprocessors

p = new A(…) if (flag)

a = p->var;flag = true;

P1 P2

flag is supposed to be set after p is allocated

Initially p=NULL, flag = false

counter-intuitive program behavior

Lock-free algorithms, e.g., Dekker, Peterson

Dekker Algorithm (mutual exclusion)

Reordering in Multiprocessors

flag1 = 1; flag2 = 1;if (flag2 == 0) if (flag1 == 0) critical section critical section

P1 P2

Initially flag1 = flag2 = 0

flag1 = 1flag2 == 0

After reordering, both flag1 and flag2 can be 0

St flag1

Ld flag2

counter-intuitive program behavior

Memory Consistency Models

Specify the ordering of loads and stores to different memory locations

Ld Ld, Ld St, St Ld, St St

Contract between hardware, compiler, and programmer

hardware and compiler will not violate the ordering specified

the programmer will not assume a stricter order than that of the model

Memory Consistency Models

Allowed Reordering

Commercial Architecture

Sequential Consistency

None not exist

Total Store Ordering

St Ld x86, SPARC

Relaxed Memory Order

All ARM, PowerPC

Low

High

Perform

ance

Stronger modelsStronger constraints

Fewer memory

reorderings

Easier to reason

Lower performance

High

Low

Program

mability

Cache Coherence vs. Memory Model

Cache coherence ensures a consistent view of memory

Guarantees that the update to memory by one processor will be seen by other processors eventually

But, how consistent ?NO guarantees on when an update should be seenNO guarantees on what order of updates should be seen

Cache Coherence vs. Memory Model

Initially A = B = 0

P1 P2 P3 A = 1; while (A != 1) ;

B = 1; while (B != 1) ;

tmp = A ;

tmp = 1? or tmp = 0?

Sequential Consistency (SC)Definition [Lamport]

(1) the result of any execution is the same as if the operations of all processors were executed in some sequential order;(2) the operations of each individual processor appear in this sequence in the order specified by its program.

MEMORY

P1 P3P2 Pn Behave as the repetition:(1) Pick a processor by any

method (e.g., randomly)(2) the processor completes a

load/store operation

SC Example

b1: Ry = y;

b2: Rx = x;

a1: x = 1;

a2: y = 1;

b2: Rx = x;

a1: x = 1;

a2: y = 1;

b1: Ry = y;

b1: Ry = y;

b2: Rx = x;

(Rx=0, Ry =0)

a1: x = 1; b1: Ry = y;

b2: Rx = x;a2: y = 1;

P1 P2

a1: x = 1;

a2: y = 1;

b1: Ry = y;

b2: Rx = x;

a1: x = 1;

a2: y = 1;≡

b1: Ry = y;

b2: Rx = x;

a1: x = 1;

a2: y = 1;

a2: y = 1;

b1: Ry = y;

b2: Rx = x;a1: x = 1;

Sequential Consistency (SC)

Simple and intuitive consistent with programmers’ intuition

easy to reason program behavior

However, the simplicity comes at the cost of performance

prevents aggressive compiler optimizations (e.g., load reordering, store reordering, caching value in register)constrains hardware utilization, (e.g., store buffer)

SC Violation

a1: x = 1

a2: y = 1

b1: R1 = y

b2: R2 = x

program order

conflict relation

SC Violation

- A cycle formed by program orders and conflict orders[Shasha and Snir, 1988] e.g., (a2, b1, b2, a1, a2)

- Executing in the order (a2, b1, b2, a1) will produce R1=1, R2=0, which is not an SC outcome

Insert fences to break cycle- a2 can not be executed before a1

Fence Instructions

p = new A(…)

flag = true;

P1

Fence InstructionsOrder memory operations before and after the fence

FENCE

Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11]

Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]

a1: St x

a2: Ld y

Fence1

b1: St y

b2: Ld x

Fence2

Conservativeness of Fences

Inserted statically and conservatively

T

At time T, a1 and a2 have completed; b1 and b2 only execute after time T.

No cycle is formed at runtime

if (cond) a1: St x

a2: Ld y

b1: St y

b2: Ld x

Fence1 Fence2

a1 is in a conditional branch

Conservativeness of Fences

a1: St *p

a2: Ld x

b1: St x

b2: Ld *q

Fence1 Fence2

p and q may point to the same memory location

Inserted statically and conservatively

No cycle is formed at runtime

Processor-centric Fence

Traditional fence

Processor-centric - unaware of memory accesses in other processors

However, purpose of fences

Prevent memory accesses from being reordered and observed by other processors (i.e., a cycle formed at runtime)

Address-aware Fences

Consider memory locations accessed around fences at runtime

Fences only take effect when there is a cycle about to happen

Detect and Avoid Cycles

A1

A2

Proc 1 Proc 2

a1: …

a2: …

Fence1

B1

B2

b1: …

Fence2

b2: …

c1

c2?

How to detect c2 efficiently?

Detect and Avoid Cycles

A1

A2

Proc 1 Proc 2

a1: …

a2: …

Fence1

B1

B2

b1: …

Fence2

b2: …

c1

watchlist

c2?

How to detect c2 efficiently?Collecting watchlist for each fence

Completing memory operation checks the watchlist

- bypass, if its address is not in the watchlist

- stall, otherwise

Performance: Execution TimeTraditional fence (T) vs. Address-aware fence (A)

Fence overhead becomes negligible

Further ReadingL. Lamport. How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Trans. Comput., 28(9):690–691, 1979.

S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. IEEE Computer, 29:66–76, 1995.

D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2):282–312, 1988.

Daniel J. Sorin, Mark D. Hill, David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 2011.

C. Lin, V. Nagarajan, and R. Gupta. Address-aware fences. ICS ’13, pages 313–324, 2013

Recommended