59
Coherence, Consistency, and Déjà vu: Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at Urbana-Champaign [email protected] w/ Johnathan Alsop, Rakesh Komuravelli, Matt Sinclair, Hyojin Sung and numerous colleagues and students over > 25 years This work was supported in part by the NSF and by C-FAR, one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.

[email protected] w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Coherence, Consistency, and Déjà vu:

Memory Hierarchies in the Era of Specialization

Sarita AdveUniversity of Illinois at Urbana-Champaign

[email protected]

w/ Johnathan Alsop, Rakesh Komuravelli, Matt Sinclair, Hyojin Sung

and numerous colleagues and students over > 25 years

This work was supported in part by the NSF and by C-FAR, one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.

Page 2: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

BUT impact software, hardware, hardware-software interface

This talk: Memory hierarchy for heterogeneous parallel systems

Global address space, coherence, consistency

But first …

Silver Bullets for End of Moore’s Law?

Parallelism Specialization

Page 3: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

3

My Story (1988 2016)

Page 4: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory consistency model?

– Simplest model: sequential consistency (SC) [Lamport79]

• Memory operations execute one at a time in program order

• Simple, but inefficient

– Implementation/performance-centric view

• Order in which memory operations execute

• Different vendors w/ different models (orderings)

– Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, …

• Many ambiguities due to complexity, by design(?), …

Memory model = What value can a read return?

HW/SW Interface: affects performance, programmability, portability

4

My Story (1988 2016)

LD

LD

LD

ST

ST

ST

ST

LD

Fence

Page 5: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• Distinguish data vs. synchronization (race)

• Data (non-race) can be optimized

• High performance for DRF programs

5

My Story (1988 2016)

Page 6: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-05: Java memory model [POPL05]

– DRF model

6

My Story (1988 2016)

Page 7: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-05: Java memory model [POPL05]

– DRF model BUT racy programs need semantics

No out-of-thin-air values

7

My Story (1988 2016)

Page 8: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-05: Java memory model [POPL05]

– DRF model BUT racy programs need semantics

No out-of-thin-air values DRF + big mess

8

My Story (1988 2016)

Page 9: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-05: Java memory model [POPL05]

– DRF model + big mess

• 2005-08: C++ memory model [PLDI 2008]

– DRF model

9

My Story (1988 2016)

Page 10: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-05: Java memory model [POPL05]

– DRF model + big mess

• 2005-08: C++ memory model [PLDI 2008]

– DRF model BUT need high performance; mismatched hardware

Relaxed atomics

10

My Story (1988 2016)

Page 11: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-05: Java memory model [POPL05]

– DRF model + big mess

• 2005-08: C++ memory model [PLDI 2008]

– DRF model BUT need high performance; mismatched hardware

Relaxed atomics DRF + big mess

11

My Story (1988 2016)

Page 12: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-08: Java, C++, … memory model [POPL05, PLDI08]

– DRF model + big mess (but after 20 years, convergence at last)

12

My Story (1988 2016)

Page 13: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10]

– DRF model + big mess (but after 20 years, convergence at last)

13

My Story (1988 2016)

Page 14: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10]

– DRF model + big mess (but after 20 years, convergence at last)

• 2008-14: Software-centric view for coherence: DeNovo protocol

– More performance-, energy-, and complexity-efficient than MESI

[PACT12, ASPLOS14, ASPLOS15]

• Began with DPJ’s disciplined parallelism [OOPSLA09, POPL11]

• Identified fundamental, minimal coherence mechanisms

• Loosened s/w constraints, but still minimal, efficient hardware14

My Story (1988 2016)

Page 15: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10]

– DRF model + big mess (but after 20 years, convergence at last)

• 2008-14: Software-centric view for coherence: DeNovo protocol

– More performance-, energy-, and complexity-efficient than MESI

[PACT12, ASPLOS14, ASPLOS15]

• 2014-: Déjà vu: Heterogeneous systems [ISCA15, Micro15]

15

My Story (1988 2016)

Page 16: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Traditional Heterogeneous SoC Memory Hierarchies

• Loosely coupled memory hierarchies

– Local memories don’t communicate with each other

– Unnecessary data movement

16

Main Memory

Interconnect

Modem

GPS

DSP DSP

GPU

A/V HW Accels.

DSPMulti-

media

CPU

L1 Cache

L2 Cache

CPU

L1 Cache

Vect.Vect.

A tightly coupled memory hierarchy is needed

Page 17: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Tightly Coupled SoC Memory Hierarchies

• Tightly coupled memory hierarchies: unified address space

– Entering mainstream, especially CPU-GPU

– Accelerator can access CPU’s data using same address

17

L2 $Bank

Interconnection n/w

Accelerator

L2 $Bank

CPU

CacheCache

Inefficient coherence and consistency

Specialized private memories still used for efficiency

Page 18: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Memory Hierarchies for Heterogeneous SoC

18

• Efficient coherence (DeNovo), simple consistency (DRF)[MICRO ’15, Top Picks ’16 Honorable Mention]

• Better semantics for relaxed atomics and evaluation[in review]

• Integrate specialized memories in global address space[ISCA ’15, Top Picks ’16 Honorable Mention]

EfficiencyProgrammability

Page 19: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Memory Hierarchies for Heterogeneous SoC

19

• Efficient coherence (DeNovo), simple consistency (DRF)[MICRO ’15, Top Picks ’16 Honorable Mention]

• Better semantics for relaxed atomics and evaluation[in review]

• Integrate specialized memories in global address space[ISCA ’15, Top Picks ’16 Honorable Mention]

Focus: CPU-GPU systems with caches and scratchpads

EfficiencyProgrammability

Page 20: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

CPU Coherence: MSI

• Single writer, multiple reader

– On write miss, get ownership + invalidate all sharers

– On read miss, add to sharer list

Directory to store sharer list

Many transient states

Excessive traffic, indirection20

L2 Cache,Directory

Interconnection n/w

CPU

L2 Cache,Directory

CPU

L1 CacheL1 Cache

Complex + inefficient

Page 21: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

GPU Coherence with DRF

• With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished

– At all synch points

• Flush all dirty data: Unnecessary writethroughs

• Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC)

Simple, but inefficient at synchronization 21

L2 CacheBank

Interconnection n/w

GPU

L2 CacheBank

CPU

CacheCacheValidDirtyValid

Flush dirty

data

Invalidate

all data

Page 22: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

GPU Coherence with DRF

• With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished

– At all synch points

• Flush all dirty data: Unnecessary writethroughs

• Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC)

22

Page 23: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished

– At all synch points

• Flush all dirty data: Unnecessary writethroughs

• Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC)

– No overhead for locally scoped synchs

• But higher programming complexity

GPU Coherence with HRF

23

heterogeneous HRF[ASPLOS ’14]

global

and their scopes

Global

heterogeneous

Page 24: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Modern GPU Coherence & Consistency

Consistency Coherence

De Facto

Recent

24

Heterogeneous-race-free (HRF)

Scoped synchronization

Complex

No overhead for local synchs

Efficient for local synch

Data-race-free (DRF)

Simple

High overhead on synchs

Inefficient

DeNovo+DRF: Efficient AND simpler memory model

Do GPU models (HRF) need to be more complex than CPU models (DRF)?

NO! Not if coherence is done right!

Page 25: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• Read hit: Don’t return stale data

• Read miss: Find one up-to-date copy

A Classification of Coherence Protocols

Invalidator

Writer Reader

Track

up-to-

date

copy

Ownership

Writethrough

MESI

GPU

DeNovo

25

• Reader-initiated invalidations

– No invalidation or ack traffic, directories, transient states

• Obtaining ownership for written data

– Reuse owned data across synchs (not flushed at synch points)

Page 26: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

DeNovo Coherence with DRF

• With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished

– At all synch points

• Flush all dirty data

• Invalidate all data

– Synchronization accesses must go to last level cache (LLC)

• 3% state overhead vs. GPU coherence + HRF26

L2 CacheBank

Interconnection n/w

GPU

L2 CacheBank

CPU

CacheCacheObtain

ownership

Invalidate

non-owned dataDirtyValidOwn

Can reuse

owned dataObtain ownership for dirty data

can be at L1all non-owned data

Page 27: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

DeNovo Configurations Studied

27

• DeNovo+DRF:

– Invalidate all non-owned data at synch points

• DeNovo+RO+DRF:

– Avoids invalidating read-only data at synch points

• DeNovo+HRF:

– Reuse valid data if synch is locally scoped

Page 28: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Coherence & Consistency Summary

28

Coherence + ConsistencyReuse Data

Owned Valid

Do Synchs

at L1

X X X

local local local

X

local

(GD)

(GH)

(DD)

(DH)

(DD+RO) read-only

GPU + DRF

GPU + HRF

DeNovo-RO + DRF

DeNovo + DRF

DeNovo + HRF

Page 29: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Evaluation Methodology

29

• 1 CPU core + 15 GPU compute units (CU)

– Each node has private L1, scratchpad, tile of shared L2

• Simulation Environment

– GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT

• Workloads

– 10 apps from Rodinia, Parboil: no fine-grained synch

• DeNovo and GPU coherence perform comparably

– UC-Davis microbenchmarks + UTS from HRF paper:

• Mutex, semaphore, barrier, work sharing

• Shows potential for future apps

• Created two versions of each: global, local/hybrid scoped synch

Page 30: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

FAM SLM SPM SPMBO AVG

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10

DeNovo has 28% lower execution time than GPU with global synch

30

Global Synch – Execution Time

Page 31: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10

Series5 Series4 Series3 Series2 Series1

Global Synch – Energy

DeNovo has 51% lower energy than GPU with global synch

31

FAM SLM SPM SPMBO AVG

Page 32: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

32

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]

GD GH DD DD+RO DH

Page 33: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

33

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]

DeNovo+DRF comparable to GPU+HRF, but simpler consistency model

GD GH DD DD+RO DH

Page 34: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

34

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]

DeNovo+DRF comparable to GPU+HRF, but simpler consistency model

DeNovo-RO+DRF reduces gap by not invalidating read-only data

GD GH DD DD+RO DH

Page 35: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

35

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]

DeNovo+DRF comparable to GPU+HRF, but simpler consistency model

DeNovo-RO+DRF reduces gap by not invalidating read-only data

DeNovo+HRF is best, if consistency complexity acceptable

GD GH DD DD+RO DH

Page 36: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Series5 Series4 Series3 Series2 Series1

Local Synch – Energy

Energy trends similar to execution time36

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVGTB

Page 37: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Memory Hierarchies for Heterogeneous SoC

37

• Efficient coherence (DeNovo), simple consistency (DRF)[MICRO ’15, Top Picks ’16 Honorable Mention]

• Better semantics for relaxed atomics and evaluation[in review]

• Integrate specialized memories in global address space[ISCA ’15, Top Picks ’16 Honorable Mention]

EfficiencyProgrammability

Page 38: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

38

Background: Atomic Ordering Constraints

ConsistencyData

Data

DRF0

Synch

Data

Synch

Synch

Retains SC

semantics?

Data - Acq,

Rel - DataX

Allowed Reorderings

• Racy synchronizations implemented with atomics

• DRF0: simple and efficient

– All atomics assumed to order data accesses

– Atomics can’t be reordered with atomics

– Ensures SC semantics

Page 39: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

39

Background: Atomic Ordering Constraints

ConsistencyData

Data

DRF0

DRF1

Synch

Data

Synch

Synch

Retains SC

semantics?

Data - Acq,

Rel - DataX

DRF0 +

unpairedX

Allowed Reorderings

• DRF1 uses more SW info to identify unpaired atomics

– Unpaired atomics do not order any data accesses

– Unpaired avoid invalidations/flushes for heterogeneous systems

– Ensures SC semantics

Page 40: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

40

Background: Atomic Ordering Constraints

ConsistencyData

Data

DRF0

DRF1

DRF1 + relaxed

atomics

Synch

Data

Synch

Synch

Retains SC

semantics?

Data - Acq,

Rel - DataX

DRF0 +

unpairedX

DRF1 +

relaxedXrelaxed

Allowed Reorderings

• Use more SW info to identify relaxed atomics

– Reordered, overlapped with all other memory accesses

– BUT violates SC: very hard to formalize and reason about

Page 41: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• CPUs: clear directions to programmers to avoid

– Only expert programmers of performance critical code use

– C, C++, Java: still no acceptable semantics!

• Heterogeneous Systems

– OpenCL, HSA, HRF: adopted similar consistency to DRF0

– But generally use simple, SW-based coherence

– Cost of staying away from relaxed atomics too high!

– DeNovo helps, but relaxed atomics still beneficial

41

Relaxed Atomics

Can we utilize additional SW info to give SC even with relaxed atomics?

Page 42: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

42

How Are Relaxed Atomics Used?

• Examined how relaxed atomics are used

– Collected examples from developers

– Categorized which relaxations are beneficial

– Extended DRF0/1 to allow varying levels of atomic relaxation

• Contributions:

1. DRF2: preserves SC-centric semantics

2. Evaluated benefit of using relaxed atomics

• Usually small benefit (≤ 6% better perf for microbenchmarks)

• Gains sometimes significant (up to 60% better perf for PR)

Page 43: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• How existing apps use relaxed atomics:

1. Unpaired

2. Commutative

3. Non-Ordering

4. Quantum

43

Relaxed Atomic Use Cases

Page 44: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Accel

1. Threads concurrently update counters

– Read part of a data array, updated its counter

44

Commutative – Event Counter

L1Cache

L1Cache

L1Cache

L1 Cache

L2 Cache Counters…0 0 0 0 0 0

1 1 1 1 1 1

Page 45: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Accel

1. Threads concurrently update counters

– Read part of a data array, updated its counter

– Increments race, so have to use atomics

45

Commutative – Event Counter (Cont.)

L1Cache

L1Cache

L1Cache

L1 Cache

L2 Cache Counters…2 1 3 2 1 1

1 1 1 1 1 1

Page 46: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Accel

1. Threads concurrently update counters

– Read part of a data array, updated its counter

– Increments race, so have to use atomics

2. Once all threads done, one thread reads all counters

46

Commutative – Event Counter (Cont.)

L1Cache

L1Cache

L1Cache

L1 Cache

L2 Cache Counters…

7 1 9 1 5 3

Read all

counters

Page 47: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• DRF0 and DRF1 ensure SC semantics

– DRF0 overly restrictive: increments do not order data

– DRF1: little benefit because no reuse in data

• Relaxed atomics:

– Reorder, overlap atomics from same thread

– Commutative increments – result is same regardless of order

• DRF2

– Distinguish commutative: intermediate values not observable

– Define commutative races

– Program is DRF2 if DRF1 and no commutative races

– DRF2 systems give efficiency and SC to DRF2 programs47

Commutative – Event Counter (Cont.)

Page 48: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Evaluation Methodology

48

• Similar simulation environment

– Extended to compare DRF0, DRF1, and DRF2

– Do not compare to HRF because few apps use scopes

• Workloads

– Microbenchmarks

• Traditional use cases for relaxed atomics

• Stress memory system (high contention)

– All benchmarks from major suites with > 2% global atomics

• UTS, PageRank (PR), Betweeness Centrality (BC)

• Show 4 representative graphs for PR and BC

Page 49: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

49

Relaxed Atomic Microbenchmarks – Execution Time

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

HG_NO Flags SCHHG RC AVG

102 102105 103

GD0 = GPU coherence with DRF0

GD1 = GPU coherence with DRF1

GD2 = GPU coherence with DRF2

DD0 = DeNovo coherence with DRF0

DD1 = DeNovo coherence with DRF1

DD2 = DeNovo coherence with DRF2

Page 50: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

DRF1 and DRF2 do not significantly affect performance (≤ 6% on average)

DeNovo exploits synch reuse, outperforms GPU (DRF2: 11% avg)

50

Relaxed Atomic Microbenchmarks – Execution Time

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

HG_NO Flags SCHHG RC AVG

102 102105 103

Page 51: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

Series5 Series4 Series3 Series2 Series1

51

Relaxed Atomic Microbenchmarks – Energy

HG_NO Flags SCHHG RC AVG

103 101103

Energy trends similar to execution time

Page 52: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

52

Relaxed Atomic Apps – Execution Time

PR-2 PR-3 PR-4PR-1UTS BC-1 BC-2 BC-3 BC-4 AVG

Weakening consistency model helps a lot for PageRank (up to 60% for GPU)

DRF1 avoids costly synchronization overhead (23% average improvement)

DRF2 overlaps atomics (up to 21% better than DRF1)

106103103

102

Page 53: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Series5 Series4 Series3 Series2 Series1

Energy somewhat similar to execution time trends

DRF2: DeNovo’s data and synch reuse reduces energy (23% avg vs GPU)

53

Relaxed Atomic Apps – Energy

PR-2 PR-3 PR-4PR-1UTS BC-1 BC-2 BC-3 BC-4 AVG

105104104

104 104

Page 54: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Memory Hierarchies for Heterogeneous SoC

54

• Efficient coherence (DeNovo), simple consistency (DRF)[MICRO ’15, Top Picks ’16 Honorable Mention]

• Better semantics for relaxed atomics and evaluation[in review]

• Integrate specialized memories in global address space[ISCA ’15, Top Picks ’16 Honorable Mention]

EfficiencyProgrammability

Page 55: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Specialized Memories for Efficiency

55

• Heterogeneous SoCs use specialized memories for energy

• E.g., scratchpads, FIFOs, stream buffers, …

Scratchpad

Directly addressed: no tags/TLB/conflicts X

Compact storage: no holes in cache lines X

Cache

Page 56: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Specialized Memories for Energy

56

• Heterogeneous SoCs use specialized memories for energy

• E.g., scratchpads, FIFOs, stream buffers, …

Can specialized memories be globally addressable, coherent?

Can we have our scratchpad and cache it too?

Scratchpad

Directly addressed: no tags/TLB/conflicts X

Compact storage: no holes in cache lines X

Global address space: implicit data movement X

Coherent: reuse, lazy writebacks X

Cache

Page 57: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

Can We Have Our Scratchpad and Cache it Too?

• Make specialized memories globally addressable, coherent

– Efficient address mapping

– Efficient coherence protocol

• Focus: CPU-GPU systems with scratchpads and caches

– Up to 31% less execution time, 51% less energy

57

Stash

Scratchpad Cache

+ Directly addressable

+ Compact storage

+ Global address space

+ Coherent

Page 58: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• Tension between programmability and efficiency

– Coherence: performs poorly for emerging apps

– Consistency: complicated, relaxed atomics worsen

– Specialized memories: not visible in global address space

• Insight: adjust coherence and consistency complexity

– Efficient coherence [MICRO ‘15, TP ‘16 HM]

– DRF consistency model [MICRO ‘15, TP ‘16 HM, in submission]

– Specialized mems in global addr space [ISCA ’15, TP ’16 HM]

– Future: optimize DeNovo; integrate more specialized mems;

interface 58

Conclusion

EfficiencyProgrammability

Page 59: sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli ...rsim.cs.illinois.edu/Talks/17-hipeac.pdf•1988 to 1989: What is a memory model? –What value can a read return? • 1990s:

• 1988 to 1989: What is a memory model?

– What value can a read return?

• 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

• 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10]

– DRF model + big mess (but after 20 years, convergence at last)

• 2008-14: Software-centric view for coherence: DeNovo protocol

– More performance-, energy-, and complexity-efficient than MESI

[PACT12, ASPLOS14, ASPLOS15]

• 2014-16: Déjà vu: Heterogeneous systems [ISCA15, Micro15]

– Coherence, consistency, global addressability

59

My Story (1988 2016)