42
Memory Models: A Case for Rethinking Parallel Languages and Hardware Sarita V. Adve University of Illinois [email protected] Acks: Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh, Hans Boehm, Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir, Byn Choi, Rakesh Komuravelli, Hyojin Sung Also a paper by S. V. Adve & H.-J. Boehm, To appear in CACM

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois [email protected] Acks: Mark Hill, Kourosh

Embed Size (px)

Citation preview

Page 1: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Memory Models: A Case for Rethinking Parallel Languages and Hardware†

Sarita V. Adve

University of Illinois

[email protected]

Acks: Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh, Hans Boehm, Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir,

Byn Choi, Rakesh Komuravelli, Hyojin Sung

† Also a paper by S. V. Adve & H.-J. Boehm, To appear in CACM

Page 2: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 3: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 4: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 5: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 6: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Memory Consistency Models

Parallelism for the masses!Shared-memory most common

Memory model = Legal values for reads

Page 7: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

20 Years of Memory Models

• Memory model is at the heart of concurrency semantics– 20 year journey from confusion to convergence at last!– Hard lessons learned– Implications for future

• Current way to specify concurrency semantics is too hard– Fundamentally broken

• Must rethink parallel languages and hardware– E.g., Illinois Deterministic Parallel Java, DeNovo architecture

Page 8: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

What is a Memory Model?

• Memory model defines what values a read can return

Initially A=B=C=Flag=0

Thread 1 Thread 2 A = 26 while (Flag != 1) {;} B = 90 r1 = B … r2 = A Flag = 1 …

90

26 0

Page 9: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Memory Model is Key to Concurrency Semantics

• Interface between program and transformers of program– Defines what values a read can return

C++ program Compiler

Dynamic optimizer

Hardware

• Weakest system component exposed to the programmer

– Language level model has implications for hardware

• Interface must last beyond trends

Assem

bly

Page 10: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Desirable Properties of a Memory Model• 3 Ps

– Programmability– Performance– Portability

• Challenge: hard to satisfy all 3 Ps– Late 1980’s - 90’s: Largely driven by hardware

• Lots of models, little consensus– 2000 onwards: Largely driven by languages/compilers

• Consensus model for Java, C++ (C, others ongoing)• Had to deal with mismatches in hardware models

Path to convergence has lessons for future

Page 11: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Programmability – SC [Lamport79]

• Programmability: Sequential consistency (SC) most intuitive– Operations of a single thread in program order– All operations in a total order or atomic

• But Performance?– Recent (complex) hardware techniques boost

performance with SC– But compiler transformations still inhibited

• But Portability?

– Almost all h/w, compilers violate SC today

SC not practical, but…

Page 12: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Next Best Thing – SC Almost Always

• Parallel programming too hard even with SC– Programmers (want to) write well structured code

– Explicit synchronization, no data races Thread 1 Thread 2

Lock(L) Lock(L)

Read Data1 Read Data2

Write Data2 Write Data1

… … Unlock(L) Unlock(L)

– SC for such programs much easier: can reorder data accesses

Þ Data-race-free model [AdveHill90]

– SC for data-race-free programs

– No guarantees for programs with data races

Page 13: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Definition of a Data Race• Distinguish between data and non-data (synchronization) accesses• Only need to define for SC executions total order• Two memory accesses form a race if

– From different threads, to same location, at least one is a write– Occur one after another

Thread 1 Thread 2 Write, A, 26 Write, B, 90 Read, Flag, 0 Write, Flag, 1

Read, Flag, 1 Read, B, 90

Read, A, 26

• A race with a data access is a data race

• Data-race-free-program = No data race in any SC execution

Page 14: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Data-Race-Free Model

Data-race-free model = SC for data-race-free programs– Does not preclude races for wait-free constructs,

etc.• Requires races be explicitly identified as synchronization

• E.g., use volatile variables in Java, atomics in C++

– Dekker’s algorithm Initially Flag1 = Flag2 = 0

volatile Flag1, Flag2

Thread1 Thread2 Flag1 = 1 Flag2 = 1 if Flag2 == 0 if Flag1 == 0 //critical section //critical section

SC prohibits both loads returning 0

Page 15: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Data-Race-Free Approach• Programmer’s model: SC for data-race-free

programs

• Programmability– Simplicity of SC, for data-race-free programs

• Performance– Specifies minimal constraints (for SC-centric view)

• Portability– Language must provide way to identify races– Hardware must provide way to preserve ordering on

races– Compiler must translate correctly

Page 16: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

1990's in Practice (The Memory Models Mess)

• Hardware– Implementation/performance-centric view

– Different vendors had different models – most non-SC

• Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, …

– Various ordering guarantees + fences to impose other orders

– Many ambiguities - due to complexity, by design(?), …

• High-level languages– Most shared-memory programming with Pthreads, OpenMP

• Incomplete, ambiguous model specs

• Memory model property of language, not library [Boehm05]

– Java – commercially successful language with threads

• Chapter 17 of Java language spec on memory model

• But hard to interpret, badly broken

LD

LD

LD

ST

ST

ST

ST

LD

Fence

Page 17: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

2000 – 2004: Java Memory Model

• ~ 2000: Bill Pugh publicized fatal flaws in Java model

• Lobbied Sun to form expert group to revise Java

model

• Open process via mailing list – Diverse participants– Took 5 years of intense, spirited debates– Many competing models– Final consensus model approved in 2005 for Java 5.0

[MansonPughAdve POPL 2005]

Page 18: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Java Memory Model Highlights

• Quick agreement that SC for data-race-free was required

• Missing piece: Semantics for programs with data races– Java cannot have undefined semantics for ANY program

– Must ensure safety/security guarantees– Limit damage from data races in untrusted code

• Goal: Satisfy security/safety, w/ maximum system flexibility– Problem: “safety/security, limited damage” w/ threads

very vague

Page 19: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Java Memory Model Highlights

Initially X=Y=0

Thread 1 Thread 2

r1 = X r2 = Y Y = r1 X = r2 Is r1=r2=42 allowed?

Data races produce causality loop!

Definition of a causality loop was surprisingly hardCommon compiler optimizations seem to

violate“causality”

Page 20: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Java Memory Model Highlights

• Final model based on consensus, but complex– Programmers can (must) use “SC for data-race-free”

– But system designers must deal with complexity

– Correctness tools, racy programs, debuggers, …??

– Recent discovery of bugs [SevcikAspinall08]

Page 21: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

2005 - :C++, Microsoft Prism, Multicore

• ~ 2005: Hans Boehm initiated C++ concurrency model– Prior status: no threads in C++, most concurrency w/

Pthreads

• Microsoft concurrently started its own internal effort

• C++ easier than Java because it is unsafe– Data-race-free is plausible model

• BUT multicore New h/w optimizations, more scrutiny– Mismatched h/w, programming views became painfully

obvious– Debate that SC for data-race-free inefficient w/ hardware

models

Page 22: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Hardware Implications of Data-Race-Free

• Synchronization (volatiles/atomics) must appear SC– Each thread’s synch must appear in program order

synch Flag1, Flag2

Thread 1 Thread 2

Flag1 = 1 Flag2 = 1

Fence Fence

if Flag2 == 0 if Flag1 == 0

critical section critical section

SC both reads cannot return 0

– Requires efficient fences between synch

stores/loads

– All synchs must appear in a total order (atomic)

Page 23: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Independent reads, independent writes (IRIW): Initially X=Y=0

T1 T2 T3 T4

X = 1 Y = 1 … = Y … = X

fence fence

… = X … = Y

SC no thread sees new value until old copies invalidated

– Shared caches w/ hyperthreading/multicore make this

harder

– Programmers don’t usually use IRIW

– Why pay cost for SC in h/w if not useful to s/w?

0

Implications of Atomic Synch Writes

1 1

0

Page 24: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

C++ Challenges

2006: Pressure to change Java/C++ to remove SC baselineTo accommodate some hardware vendors

• But what is alternative?– Must allow some hardware optimizations– But must be teachable to undergrads

• Showed such an alternative (probably) does not exist

Page 25: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

C++ Compromise

• Default C++ model is data-race-free• AMD, Intel, … on board• But

– Some systems need expensive fence for SC– Some programmers really want more flexibility

• C++ specifies low-level model only for experts• Complicates spec, but only for experts• We are not advertising this part

– [BoehmAdve PLDI 2008]

Page 26: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Summary of Current Status

• Convergence to “SC for data-race-free” as baseline• For programs with data races

– Minimal but complex semantics for safe languages– No semantics for unsafe languages

Page 27: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Lessons Learned

• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; debugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

Þ State-of-the-art is fundamentally broken

Page 28: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Lessons Learned

• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; ebugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

Þ State-of-the-art is fundamentally broken

Banish shared-memory?

Page 29: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Lessons Learned

• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; debugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

Þ State-of-the-art is fundamentally broken• We need

– Higher-level disciplined models that enforce discipline– Hardware co-designed with high-level models

Banish wild shared-memory!

Page 30: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Lessons Learned

• SC for data-race-free minimal baseline• Specifying semantics for programs with data races

is HARD– But “no semantics for data races” also has problems

• Not an option for safe languages; debugging; correctness checking tools

• Hardware-software mismatch for some code– “Simple” optimizations have unintended consequences

Þ State-of-the-art is fundamentally broken• We need

– Higher-level disciplined models that enforce discipline

– Hardware co-designed with high-level models

Banish wild shared-memory!

Deterministic Parallel Java [V. Adve et al.]

DeNovo hardware

Page 31: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Research Agenda for Languages

• Disciplined shared-memory models– Simple– Enforceable– Expressive– Performance

Key: What discipline? How to enforce it?

Page 32: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Data-Race-Free

• A near-term discipline: Data-race-free• Enforcement

– Ideally, language prohibits by design– Else, runtime catches as exception

• But data-race-free still not sufficiently high level

Page 33: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Deterministic-by-Default Parallel Programming

• Even data-race-free parallel programs are too hard– Multiple interleavings due to unordered synchronization (or races)– Makes reasoning and testing hard

• But many algorithms are deterministic– Fixed input gives fixed output– Standard model for sequential programs– Also holds for many transformative parallel programs

• Parallelism not part of problem specification, only for performance

Why write such an algorithm in non-deterministic style, then struggle to understand and control its behavior?

Page 34: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Deterministic-by-Default Model

• Parallel programs should be deterministic-by-default– Sequential semantics (easier than SC!)

• If non-determinism is needed– should be explicitly requested– should be isolated from deterministic parts

• Enforcement: – Ideally, language prohibits by design– Else, runtime

Page 35: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

State-of-the-art

• Many deterministic languages today– Functional, pure data parallel, some domain-specific, …– Much recent work on runtime, library-based approaches

• Our work: Language approach for modern O-O methods– Deterministic Parallel Java (DPJ) [V. Adve et al.]

Page 36: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Deterministic Parallel Java (DPJ)• Object-oriented type and effect system

– Use “named” regions to partition the heap– Annotate methods with effect summaries: regions read or written– If program type-checks, guaranteed deterministic

* Simple, modular compiler checking* No run-time checks today, may add in future

– Side benefit: regions, effects are valuable documentation

• Extended sequential subset of Java (DPC++ ongoing)– Initial evaluation for expressivity, performance [Oopsla09]– Integrating disciplined non-determinism– Encapsulating frameworks and unchecked code– Semi-automatic tool for effect annotations [ASE09]

Page 37: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Research Agenda for Hardware

• Current hardware not matched even to current model• Near term: ISA changes, speculation• Long term: Co-design hardware with new software models

Page 38: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Illinois DeNovo Project

• Design hardware to exploit disciplined parallelism– Simpler hardware– Scalable performance– Power/energy efficiency

• Working with DPJ as example disciplined model– Exploit data-race-freedom, region/effect information

* Simpler coherence* Efficient communication: point to point, bulk, …* Efficient data layout: region vs. cache line centric memory* New hardware/software interface

Page 39: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Cache Coherence

• Commonly accepted definition (software-oblivious)– All writes to the same location appear in the same order– Source of much complexity– Coherence protocols to scale to 1000 cores?

• What do we really need (software-aware)?– Get the right data to the right task at the right time– Disciplined models make it easier to determine what is “right”

(Assume only for-each loops)• Read must return value of

– Last write in its own task or– Last write in previous for-each loop

Page 40: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Today's Coherence Protocols

• Snooping– Broadcast, ordered networks

• Directory – avoid broadcast through level of indirection– Complexity: Races in protocol – Performance: Level of indirection – Overhead: Sharer list

Page 41: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Today's Coherence Protocols• Snooping

– Broadcast, ordered networks

• Directory – avoid broadcast through level of indirection– Complexity: Races in protocol

• Race-free software race-free coherence protocol

– Performance: Level of indirection • But definition of coherence no longer requires serialization

– Overhead: Sharer list• Region-effects enable self-invalidations

+ No false sharing, flexible communication granularity, region based data layout

Simpler, more efficient DeNovo protocol

Page 42: Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois sadve@illinois.edu Acks: Mark Hill, Kourosh

Conclusions• Current way to specify concurrency semantics fundamentally broken

– Best we can do is SC for data-race-free* But cannot hide from programs with data races

– Mismatched hardware-software* Simple optimizations give unintended consequences

• Need

– High-level disciplined models that enforce discipline– Hardware co-designed with high-level model DPJ – deterministic-by-default parallel programming

DeNovo – hardware for disciplined parallel programming• Previous memory models convergence from similar process

– But this time, let’s co-design s/w, h/w