38
Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability Josep Torrellas Department of Computer Science University of Illinois at Urbana- Champaign http://iacoma.cs.uiuc.edu

The Bulk Multicore Architecture for Programmability

  • Upload
    odell

  • View
    67

  • Download
    0

Embed Size (px)

DESCRIPTION

The Bulk Multicore Architecture for Programmability. Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu. Acknowledgments. Key contributors: Luis Ceze Calin Cascaval James Tuck Pablo Montesinos Wonsun Ahn Milos Prvulovic - PowerPoint PPT Presentation

Citation preview

Page 1: The Bulk Multicore Architecture  for Programmability

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois

The Bulk Multicore Architecture for Programmability

Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu

Page 2: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

2

Acknowledgments

Key contributors: • Luis Ceze• Calin Cascaval• James Tuck• Pablo Montesinos• Wonsun Ahn• Milos Prvulovic• Pin Zhou• YY Zhou• Jose Martinez

Page 3: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

3

Challenges for Multicore Designers

• 100 cores/chip coming & there is little parallel SW

– System architecture should support programmable environment

• User-friendly concurrency and consistency models

• Always-on production-run debugging

Page 4: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

4

Challenges for Multicore Designers (II)

• Decreasing transistor size will make designing cores hard

– Design rules too hard to satisfy manually just shrink the core

– Cores will become commodity

• Big cores, small cores, specialized cores…

– Innovation will be in cache hierarchy & network

Page 5: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

5

Challenges for Multicore Designers (III)

• We will be adding accelerators:

– Accelerators need to have the same (simple) interface to the cache coherent fabric as processors

– Need simple memory consistency models

Page 6: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

6

A Vision of Year 2015-2018 Multicore

• 128+ cores per chip

• Simple shared-memory programming model(s):– Support for shared memory (perhaps in groups of procs)– Enforce interleaving restrictions imposed by the language &

concurrency model– Sequential memory consistency model for simplicity

• Sophisticated always-on debugging environment– Deterministic replay of parallel programs with no log– Data race detection at production-run speed– Pervasive program monitoring

Page 7: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

7

Proposal: The Bulk Multicore

• Idea: Eliminate the commit of individual instructions at a time

• Mechanism:

– Default is processors commit chunks of instructions at a time (e.g. 2,000 dynamic instr)

– Chunks execute atomically and in isolation (using buffering and undo)

– Memory effects of chunks summarized in HW signatures

• Advantages over current:

– Higher programmability

– Higher performance

– Simpler hardware

The Bulk

Multicore

Page 8: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

8

Rest of the Talk

• The Bulk Multicore• How it improves programmability• What’s next?

Page 9: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

9

Hardware Mechanism: Signatures

• Hardware accumulates the addresses read/written in signatures

• Read and Write signatures

• Summarize the footprint of a Chunk of code

[ISCA06]

Page 10: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

10

Signature Operations In Hardware

Inexpensive

Operations on

Groups of

Addresses

Page 11: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

11

Executing Chunks Atomically & In Isolation: Simple!

commit W0

W0 = sig(B,C)R0 = sig(X,Y)

W1 = sig(T)R1 = sig(B,C)

(W0 ∩ R1 ) (W0 ∩ W1)

ld Xst Bst Cld Y

Chunk

Thread 0 Thread 1

ld Bst Tld C

Page 12: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

12

Chunk Operation + Signatures: Bulk

st D

st A

P1

ld C

st C ld Dst X

st A

P2

st C

• Execute each chunk atomically and in isolation

• (Distributed) arbiter ensures a total order of chunk commits

[ISCA07]

• Supports Sequential Consistency [Lamport79]:

– Low hardware complexity:

– High performance:

P1 P2 P3 PN...

Mem

Logical picture

st Ast C

ld Dst X

st Ald C

st Dst C

Need not snoop ld buffer for consistency

Instructions are fully reordered by HW

(loads and stores make it in any order to the sig)

Page 13: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

13

Summary: Benefits of Bulk Multicore

• Gains in HW simplicity, performance, and programmability

• Hardware simplicity:

– Memory consistency support moved away from core

– Toward commodity cores

– Easy to plug-in accelerators

• High performance:

– HW reorders accesses heavily (intra- and inter-chunk)

Page 14: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

14

Benefits of Bulk Multicore (II)

• High programmability:

– Invisible to the programming model/language

– Supports Sequential Consistency (SC)

* Software correctness tools assume SC

– Enables novel always-on debugging techniques

* Only keep per-chunk state, not per-load/store state

* Deterministic replay of parallel programs with no log

* Data race detection at production-run speed

Page 15: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

15

Benefits of Bulk Multicore (III)

• Extension: Signatures visible to SW through ISA

– Enables pervasive monitoring

– Enables novel compiler opts

Many novel programming/compiler/tool opportunities

Page 16: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

16

Rest of the Talk

• The Bulk Multicore• How it improves programmability• What’s next?

Page 17: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

17

Supports Sequential Consistency (SC)

• Correctness tools assume SC:

– Verification tools that prove software correctness

• Under SC, semantics for data races are clear:

– Easy specifications for safe languages

• Much easier to debug parallel codes (and design debuggers)

• Works with “hand-crafted” synchronization

Page 18: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

18

Deterministic Replay of MP Execution

• During Execution: HW records into a log the order of dependences between threads

• The log has captured the “interleaving” of threads

• During Replay: Re-run the program

– Enforcing the dependence orders in the log

Page 19: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

19

Conventional Schemes

P1

Wa

Wb

P2

Ra

Wb

n2 m1

m2

n1 P2’s Log

P1 n1 m1

P1 n2 m2

• Potentially large logs

Page 20: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

20

Bulk: Log Necessary is Minuscule [ISCA08]

P1WaWb

P2

Rc

RbWc

chunk1

chunk2

Wa

Combined Log = NIL

If we fix the chunk commit interleaving:

Combined Log of all Procs:

P1

P2

Pi

• During Execution:

– Commit the instructions in chunks, not individually

Page 21: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

21

Data Race Detection at Production-Run Speed

Unlock L

Unlock L

Lock LLock L • If we detect communication between…

– Ordered chunks: not a data race

– Unordered chunks: data race

[ISCA03]

Page 22: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

22

Different Synchronization Ops

Unlock L

Unlock L

Lock LLock L

Lock

Set F

Wait F

Flag

Barrier

Barrier

Barrier

Page 23: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

23

Benefits of Bulk Multicore (III)

• Extension: Signatures visible to SW through ISA

– Enables pervasive monitoring [ISCA04]

Support numerous watchpoints for free

– Enables novel compiler opts [ASPLOS08]

Function memoization

Loop-invariant code motion

Page 24: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

24

Pervasive Monitoring:Attaching a Monitor Function to Address

instr

instrinstr

*p = ...

instr

instr

instr

instr

Watch(addr, usr_monitor)

usr_monitor(Addr){

…..

}

• Watch memory location

• Trigger monitoring function when it is accessed

Rest of MonitoringFunctionProgram

Main

Thread

Program

*p=

Page 25: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

25

Enabling Novel Compiler Optimizations

New instruction: Begin/End collecting addresses into sig

Page 26: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

26

Enabling Novel Compiler Optimizations

New instruction: Begin/End collecting addresses into sig

Page 27: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

27

Instruction: Begin/End Disambiguation Against Sig

Page 28: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

28

Instruction: Begin/End Disambiguation Against Sig

Page 29: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

29

Instruction: Begin/End Remote Disambiguation

Page 30: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

30

Instruction: Begin/End Remote Disambiguation

Page 31: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

31

Example Opt: Function Memoization

• Goal: skip the execution of functions

Page 32: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

32

• Goal: skip the execution of functions whose outputs are known

Example Opt: Function Memoization

Page 33: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

33

Example Opt: Loop-Invariant Code Motion

Page 34: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

34

Example Opt: Loop-Invariant Code Motion

Page 35: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

35

Rest of the Talk

• The Bulk Multicore• How it improves programmability• What’s next?

Page 36: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

36

What is Going On?

Bulk MulticoreHardware

Architecture

Compiler (Matt Frank et al)

Libraries and run time systems

Language support(Marc Snir, Vikram Adve et al)

Debugging Tools(Sam King, Darko Marinov et al)

FPGA Prototype (at Intel)

Page 37: The Bulk Multicore Architecture  for Programmability

Josep TorrellasThe BULK Multicore Architecture

37

Summary: The Bulk Multicore for Year 2015-2018

• 128+ cores/chip, shared-memory (perhaps in groups)

• Simple HW with commodity cores– Memory consistency checks moved away from the core

• High performance shared-memory programming model– Execution in programmer-transparent chunks – Signatures for disambiguation, cache coherence, and compiler opts – High-performance sequential consistency with simple HW

• High programmability: Sophisticated always-on debugging support– Deterministic replay of parallel programs with no log (DeLorean)– Data race detection for production runs (ReEnact)– Pervasive program monitoring (iWatcher)

Page 38: The Bulk Multicore Architecture  for Programmability

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois

The Bulk Multicore Architecture for Programmability

Josep Torrellas

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu