59
IBM T.J. Watson Research Center RACES’12 Oct 21, 2012 © 2012 IBM Corporation Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models Harold “Trey” Cain IBM T.J. Watson Research Center Prof. Mikko H. Lipasti University of Wisconsin

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models

Embed Size (px)

DESCRIPTION

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models. Harold “Trey” Cain IBM T.J. Watson Research Center Prof. Mikko H. Lipasti University of Wisconsin. Gotta go back in time!. Part of Ph.D. Dissertation Never submitted for publication, until now. - PowerPoint PPT Presentation

Citation preview

Page 1: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM T.J. Watson Research Center

RACES’12 Oct 21, 2012 © 2012 IBM Corporation

Edge Chasing Delayed Consistency:Pushing the Limits of Weak Memory Models

Harold “Trey” CainIBM T.J. Watson Research Center

Prof. Mikko H. LipastiUniversity of Wisconsin

Page 2: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation2 Cain and Lipasti RACES’12 Oct 21, 2012

Gotta go back in time!

Part of Ph.D. Dissertation

– Never submitted for publication, until now.

– Looked particularly relevant when I saw the RACES CFP.

Journey back in time to the year 2004, when…

– … Mark Zuckerberg launched Facebook

– … Janet Jackson suffered a “wardrobe malfunction” during the Superbowl halftime show

– … an incumbent president was being challenged by a Massachusetts politician

88mph here we come!

Page 3: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation3 Cain and Lipasti RACES’12 Oct 21, 2012

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering

From the RACES website: – “an approach towards scalability that reduces synchronization requirements

drastically, possibly to the point of discarding them altogether.”

A hardware developer’s perspective:– Constraints of Legacy Code

• What if we want to apply this principle, but have no control over the applications that are running on a system?

– Can one build a coherence protocol that avoids synchronizing cores as much as possible?• For example by allowing each core to use stale versions of cache lines as long as

possible• While maintaining architectural correctness; i.e. we will not break existing code

• If we do that, what will happen?

Page 4: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation4 Cain and Lipasti RACES’12 Oct 21, 2012

Cache-Coherent Shared-memory multiprocessors

Are ubiquitous

Coherence misses are a major source of performance loss for shared memory applications

10 years ago Today

Page 5: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation5 Cain and Lipasti RACES’12 Oct 21, 2012

16MB L3 Cache Misses per 1000 inst

Page 6: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation6 Cain and Lipasti RACES’12 Oct 21, 2012

Edge-Chasing Delayed Consistency (ECDC)

A new hardware implementation of POWER weak ordering

– Not a new consistency model

Allows a cache line to be non-speculatively read after being invalidated.

Based on necessary conditions

– Processor must fetch new data only if causally dependent on it.

Page 7: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation7 Cain and Lipasti RACES’12 Oct 21, 2012

Constraint graph

Introduced for SC by Landin et al., ISCA-18

Directed-graph represents a multithreaded execution

– Nodes represent dynamic instances of instructions

– Edges represent their transitive orders (program order, RAW, WAW, WAR).

If the constraint graph is acyclic, then the execution is correct

Page 8: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation8 Cain and Lipasti RACES’12 Oct 21, 2012

Constraint graph example - WO

Proc 1

Proc 2

LD AST B

LD BST->MBOrder

LD->MBOrder

Write-after-readdependence order

Read-after-writedependence order

ST A

MB MBMB->STOrder

MB->LDOrder

1.

2.

3.

5.

4.

Observation: An aggressive coherence protocol can ignore coherence messages

unless doing so will create a cycle in the constraint graph

Page 9: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation9 Cain and Lipasti RACES’12 Oct 21, 2012

Edge-chasing delayed consistency

Based on edge-chasing algorithms used by distributed database systems for deadlock detection

P1 P2 P3 P4Wham-O!

Cycle in WFG detected when a locally created probe received

Page 10: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation10 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC - Basic idea

Observation: Cycles in constraint graph can be detected using a similar mechanism

Protocol:

– Upon write miss, create a “probe”

– Upon receipt of invalidation, add probe to cache line• Continue to read stale block until the probe is re-observed on

another message

– Pass probe to other processors at communication

Page 11: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation11 Cain and Lipasti RACES’12 Oct 21, 2012

Example – necessary miss (SC)

Proc 1

Proc 2

LD A

ST B

LD BRAW

ST A

LD A

WARLine A is in proc 1’scache, valid bit = 1

Line A is in proc 1’scache, valid bit = 0 Supplanter ProbeA = RAW

Page 12: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation12 Cain and Lipasti RACES’12 Oct 21, 2012

Detecting critical writes

Some write values shouldn’t be delayed (e.g. lock releases, barriers, etc.)

Two heuristics

– Atomic primitives – any cache block that has been touched by a store-conditional should not be delayed

– Polling detection – If consecutive cache accesses have same PC and address, discard stale line

Page 13: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation13 Cain and Lipasti RACES’12 Oct 21, 2012

Performance Evaluation

PHARMSim – Cycle-mode Full System Simulator– Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar], within

the SimOS-PPC full-system simulator– Out-of-order single-threaded core– 32k DM L1 icache (1), 32k DM L1 dcache (1), 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte

cache lines– Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)– Stride-based prefetcher modeled after Power4

Lock-free list insertion microbenchmark

Full applications– SPLASH2: fft, fmm, ocean, radix, raytrace– Commercial: DB2/TPC-B, DB2/TPC-H, SPECjbb2000, SPECweb99

Page 14: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation14 Cain and Lipasti RACES’12 Oct 21, 2012

Why delayed consistency?

False sharing/Silent sharing

Convergant/Data-race tolerant algorithms

– Genetic algorithms

– Parallel equation solvers

– Sparse matrix factorization

Lock-free parallel linked data structures

Page 15: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation15 Cain and Lipasti RACES’12 Oct 21, 2012

Lock-free Algorithms

For example list insertion:

– New node’s next pointer set to cur

– CAS operation atomically updates prev’s next pointer to new

Increasingly common

prev cur

new

Page 16: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation16 Cain and Lipasti RACES’12 Oct 21, 2012

Prior work (Delayed consistency)

Invalidate-based receiver-delayed protocols, sender-delayed protocols (Dubois et al., SC ’91)

Lazy release consistency (Keleher et al., ISCA ’92)

Update-based receiver-delayed, sender-delayed protocols (Afek et al., TPLS, ’93)

Tear-off blocks in DSI (Lebeck and Wood, ISCA ’95)

Write cache for reducing bandwidth in update coherence protocol (Dahlgren and Stenstrom, JPDC ’95)

Page 17: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation17 Cain and Lipasti RACES’12 Oct 21, 2012

Lock-free list microbenchmark

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 20 40 60 80 100

% updates

cycle

s/s

earch

base-1000

ecdc-1000

base-100

ecdc-100

base-10

ecdc-10

Based on hazard-pointer lock-free list maintenance algorithm [Michael, PODC ’02]

15 threads randomly updating or searching linked list, 1 thread performing searches

Page 18: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation18 Cain and Lipasti RACES’12 Oct 21, 2012

Intolerable miss reduction

Left to right: a) baseline, b) ECDC base, c) ECDC merged read/write sets, d) ECDC scalar probe set

Page 19: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation19 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC Performance (Infinite resources)

Page 20: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation20 Cain and Lipasti RACES’12 Oct 21, 2012

Conclusions

Of nine applications studied, performance improvement for two

– Mostly due to reduction in false sharing misses Other applications:

– Not enough coherence misses, or– The avoidance of those misses does not improve performance

We believe these results generalize to lock-based programs

Other programming models may have potential– As shown, lock-free data structures

• Should also apply to transactional programming model– But beware, “Premature Optimization is the Root of All Evil” – Donald Knuth– Best to identify apps with a communication bottleneck before attacking

Page 21: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation21 Cain and Lipasti RACES’12 Oct 21, 2012

Questions?

Page 22: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation22 Cain and Lipasti RACES’12 Oct 21, 2012

Backup slides

Page 23: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation23 Cain and Lipasti RACES’12 Oct 21, 2012

Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],

within the SimOS-PPC full-system simulator

Out-of-order execution core

15-stage, 8-wide pipeline

256 entry reorder buffer, 128 entry load/store queue

32 entry issue queue

Functional units (latency)

8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),

4 L1 Dcache load ports in OoO window

1 L1 Dcache load/store port at commit

Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB

Memory system (latency)

32k DM L1 icache (1), 32k DM L1 dcache (1)

256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines

Memory (400 cycle/100 ns best-case latency, 10 GB/S BW based on 5GHZ clock)

Stride-based prefetcher modeled after Power4

Page 24: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation24 Cain and Lipasti RACES’12 Oct 21, 2012

Causality (Lamport)

An instruction i is causally dependent upon instruction j if there is a directed path from j to i

Two operations are concurrent if neither causally depends upon the other

Coherence misses are a significant source of performance degradation for many applications

If two operations are concurrent, why is their performance penalized?

Time

P3P2P1

st A

st C

ld Ast B

ld C

ld B

ld A

Page 25: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation25 Cain and Lipasti RACES’12 Oct 21, 2012

Prior work: formal memory model representations

Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)

Acyclic graph representation (Landin et al., ISCA-18)

Modeling memory operation as a series of sub-operations (Collier, RAPA)

Acyclic graph + sub-operations (Adve, thesis)

Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis)

Page 26: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation26 Cain and Lipasti RACES’12 Oct 21, 2012

Anatomy of a cycle

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Incoming invalidate

Cache miss

Page 27: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation27 Cain and Lipasti RACES’12 Oct 21, 2012

Other prior work

Speculative stale value usage

– LVP with Stale Values (Lepak, Ph.D. Thesis ‘03)

– Coherence Decoupling (Huh et al., ASPLOS ’04)

Delayed RFO response to improve synchronization throughput (Rajwar et al., HPCA ’00)

Page 28: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation28 Cain and Lipasti RACES’12 Oct 21, 2012

Constraint graph extensions

Constraint graph definition differs for other consistency models

Processor consistency

– Remove program order edges from stores to subsequent loads

– Remaining single-thread orders: edges from

• Loads to subsequent loads• Stores to subsequent stores• Loads to subsequent stores

Page 29: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation29 Cain and Lipasti RACES’12 Oct 21, 2012

Constraint graph extensions

Constraint graph definition differs for other consistency models

Weak ordering

– Remove program order edges

– Add single-thread ordering edges between

• memory barrier and preceding/following instructions• same address reads/writes• dependent instructions

Page 30: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation30 Cain and Lipasti RACES’12 Oct 21, 2012

PC Example – Dekker’s Alg.

Proc 1

ST A

Proc 2

ST B

LD B LD A

Write-after-readdependence order

Programorder

Programorder

Lack of store-to-load order results in acyclic graph

1.

2.

3.

4.

Page 31: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation31 Cain and Lipasti RACES’12 Oct 21, 2012

Constraint graph example - SC

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

Write-after-readdependence order

Read-after-writedependence order

Cycle indicates that execution is

incorrect

1.

2.

3.

4.

Page 32: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation32 Cain and Lipasti RACES’12 Oct 21, 2012

Constraint graph example - PC

Proc 1

ST A

Proc 2

LD BST B

LD A

Programorder

ProgramOrder

Write-after-readdependence order

Read-after-writedependence order

1.

2.

3.

4.

Page 33: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation33 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC Conceptual Description

Identify causal dependences (upstream probe sets)

– 1 upstream set per processor

– 2 upstream sets per cache block (read set, write set)

Communicating dependences

– Probe sets passed on response messages

– Probes attached to incoming invalidation messages

– Extra ProbePropagation messages sent at memory barriers

Identifying usable stale blocks

– Extra stable state in cache (ST)

– Supplanter probe

Page 34: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation34 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC Operation

Initially

1. ld A2. st A3. ld B4. st B5. ld C

Фprocupstream

{ }

{ }{ , }{ , }{ , }

Ф(read|write)A

{ | , }

{ | , }{ | , }{ | , }{ | , }

{ | }

{ | }{ | }

{ , | , }{ , | , }

Ф(read|write)B

Page 35: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation35 Cain and Lipasti RACES’12 Oct 21, 2012

Finite ECDC Performance

When restricting PPB/STAB resources (220 KB per processor)

– 16k probe lifetime counter

– 128 entry STAB per processor

– 32 Entry PPB per processor/directory controller (256 PPB virtual namespace)

TPC-H/SPECweb99 performance within margin of error to infinite resources

Page 36: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation36 Cain and Lipasti RACES’12 Oct 21, 2012

Non-atomicity of writes

Absent from model

Effect on optimizations

– Forces unnecessary orders to exist

– Correct, but another example of over-conservatism

Hopefully, infrequent performance divot

Processor p1

st r1, [A]

Processor p2

ld r1, [A]st r2, [r1]

Processor p3

ld r1, [B]membarld r2, [A]

Page 37: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation37 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC Base machine modelPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence protocol [Rajwar],

within the SimOS-PPC full-system simulator

Out-of-order execution core

15-stage, 8-wide pipeline

256 entry reorder buffer, 128 entry load/store queue

32 entry issue queue

Functional units (latency)

8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),

4 L1 Dcache load ports in OoO window

1 L1 Dcache load/store port at commit

Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB

Cache Hierarchy (latency)

32k DM L1 icache (1), 32k DM L1 dcache (1)

256K 8-way L2 (7), 16 MB 8-way L3 (15), 128 byte cache lines

Stride-based prefetcher modeled after Power4

Memory system (latency)

2-D static DOR routed torus interconnect. 60 cycle per link+route (40 GB/S bandwidth per link, 5GHZ clock)

Memory (400 cycle best-case latency, 10 GB/S bandwidth)

Page 38: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation38 Cain and Lipasti RACES’12 Oct 21, 2012

Mapping ECDC to HW

STAB – Maintains supplanting probe for each stale cache block

PPB – Maintains approximation of upstream sets

In caches – 2 extra bits for stale state and synch heuristic

DRAM

Dir

MemCtr

NIC

L2 $

D$I$

P

STAB

PPB

CastoutPPB

Page 39: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation39 Cain and Lipasti RACES’12 Oct 21, 2012

Probe representation

Each probe represented by n-bit timer

Stale block may be used until supplanting probe timer expires

Probe set in p-processor system represented by p timers

Page 40: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation40 Cain and Lipasti RACES’12 Oct 21, 2012

STAB Detail

125258123

timer

9980x112c

0x24e20xc123

address

925690xf2e5104250x8000 (998)

(13523)(21646)

Cache

Incoming Invalidatesp1 p2 p3

counters

Page 41: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation41 Cain and Lipasti RACES’12 Oct 21, 2012

PPB Detail

address hash

0005

515

189327

000

27

27127282735

00

92180

280800855950

000

12

121212

724

Shift register/probe timers

Incoming upstream set

Expired upstream set

Timer index table

Page 42: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation42 Cain and Lipasti RACES’12 Oct 21, 2012

Memory consistency review

Memory consistency model

– Specifies the programming interface to a shared memory

– i.e. the allowable interleaving of instructions

Models discussed here:

– Sequential Consistency

– Processor Consistency• No store-to-load program order

– Weak Ordering• Order wrt memory barriers• Same-address order• Dependence order

Page 43: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation43 Cain and Lipasti RACES’12 Oct 21, 2012

Example – necessary miss (SC)

Proc 1

Proc 2

LD A

ST B

LD BRAW

ST A

LD A

WAR

PO PO

PO

Block A is in proc 1’scache, valid bit = 1

Block A is in proc 1’scache, valid bit = 0

Page 44: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation44 Cain and Lipasti RACES’12 Oct 21, 2012

Example – avoidable miss (SC)Proc

1Proc 2

LD AST B

LD BRAW ST A

LD A

WAR

PO PO

PO

Block A is in proc 1’scache, valid bit = 1

Block A is in proc 1’scache, valid bit = 0

Page 45: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation45 Cain and Lipasti RACES’12 Oct 21, 2012

Typical ReadX transaction

When sending invalidation, create probe, add to PPB

At receipt of invalidation (2b, 2c) add probe to STAB

When sending invalidate acknowledgment, add probe set to the response

When receiving invalidate acknowledgment, add incoming probe set to the PPB

3(a) Inval Ack

R

S1

H

1. ReadX

3(b) Inval Ack

S2

2(a) Sharers/Data

2(b) Inval

2(c) Inval

Page 46: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation46 Cain and Lipasti RACES’12 Oct 21, 2012

Invalidation to read distance

0%

20%

40%

60%

80%

100%

1 10 100 1000 10000 100000 1E+06 1E+07 1E+08 1E+09

cycles

% o

f loa

d co

h m

isses

fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

Page 47: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation47 Cain and Lipasti RACES’12 Oct 21, 2012

Invalidation to read distance (synch)

0%10%20%30%40%50%60%70%80%90%

100%

1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09

cycles

% o

f loa

d co

h m

isse

s

fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

Page 48: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation48 Cain and Lipasti RACES’12 Oct 21, 2012

Invalidation to read distance (data)

0%10%20%30%40%50%60%70%80%90%

100%

1 10 100 1000 10000 100000 1000000 1E+07 1E+08 1E+09

cycles

% o

f loa

d co

h m

isses

fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

Page 49: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation49 Cain and Lipasti RACES’12 Oct 21, 2012

STAB entry death cdf

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000 1000000

cycles

% S

TAB

ent

ries

deal

loca

ted fft

fmm

ocean

radix

raytrace

SPECjbb2000

SPECweb99

TPC-B

TPC-H

Page 50: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation50 Cain and Lipasti RACES’12 Oct 21, 2012

STAB Entry Lifetime

Page 51: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation51 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC performance (16k probe lifetime)

Page 52: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation52 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC Perf (128 entry STAB, 32 entry PPB, 256 entry namespace)

Page 53: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation53 Cain and Lipasti RACES’12 Oct 21, 2012

ProbePropagation messages

Page 54: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation54 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC Storage Overhead

0

50

100

150

200

250

300

350

4p 8p 16p 32p 64p 128p 256p 512p 1024p

Processor count

Sto

rag

e (

KB

)

Page 55: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation55 Cain and Lipasti RACES’12 Oct 21, 2012

What about limit study?

Indicated a larger number of avoidable coherence misses

Reasons:

– Did not account for non-speculative nature of protocol (oracle ECDC could be better)

– Inaccurate measurement of critical writes

• Many loads perform polling to lines that have never been touched by a load-linked or store-conditional

– Used isolated stale data detection mechanism

Page 56: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation56 Cain and Lipasti RACES’12 Oct 21, 2012

What about speculative load squashes?

In a few applications, they occur frequently (SPECjbb2000, TPC-H)

Implemented/evaluated read-set-tracking w/ squash on miss

Could eliminate a large fraction of squashes

– Unfortunately, little performance improvement

– Presumably, many squashes caused by contended spin locks

Page 57: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation57 Cain and Lipasti RACES’12 Oct 21, 2012

ECDC and other consistency models

Stricter model => more ProbePropagation messages

Potential for release consistency

In SC/PC/TSO, ECDC benefits will probably be dominated by extra ProbePropagation messages

Page 58: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation58 Cain and Lipasti RACES’12 Oct 21, 2012

Cause of STAB entry deallocation

Page 59: Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory  Models

IBM Research

© 2012 IBM Corporation59 Cain and Lipasti RACES’12 Oct 21, 2012

Publications

[ISCA ’04] Memory ordering: A Value-based approach.– Selected for IEEE Micro Top Picks ‘04

[PACT ’03] Constraint Graph Analysis of Multithreaded Programs.– Selected for Best of PACT JILP Issue

[PACT ’03] Redeeming IPC as a Performance Metric for Multithreaded Programs.

[CAECW ’02] Precise and Accurate Processor Simulation

[SPAA Revue ’02] Verifying Sequential Consistency Using Vector Clocks.

[Micro ’01] Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing.

[WBT ’01] A Dynamic Binary Translation Approach to Architectural Simulation

[HPCA ’01] An Architectural Characterization of Java TPC-W.

[Euro-Par ’00] A Callgraph-Based Search Strategy for Automated Performance Diagnosis.– Selected as distinguished paper

[CAECW ’00] Characterizing a Java Implementation of TPC-W