The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University

The Locality-Aware Adaptive Cache Coherence Protocol

George Kurian1, Omer Khan2, Srini Devadas1

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs

1

2

Cache Hierarchy OrganizationDirectory-Based Coherence

Private cache Write miss

1

2Shared Cache + Directory

4Sharer

3Sharer

• Private caches: 1 or 2 levels• Shared cache: Last-level

Write word

• Concurrent reads lead to replication in private caches

• Directory maintains coherence for replicated lines

Private CachingAdvantages & Drawbacks

3

☺ Exploits spatio- temporal locality

☺ Efficient low-latency local access to private + shared data (cache line replication)

☹ Inefficiently handles data with LOW spatio-temporal locality

☹Working set > private cache size☹ Inefficient cache utilization

(Cache thrashing)☹ Unnecessary fetch of entire

cache line☹ Shared data replication

increases working set

Private CachingAdvantages & Drawbacks

4

☺ Exploits spatio-temporal locality

☺ Efficient low-latency local access to private + shared data (cache line replication)

☹ Inefficiently handles data with LOW spatio-temporal locality

☹Working set > private cache size

☹Shared data with frequent writes☹Wasteful invalidations,

synchronous writebacks, cache line ping-ponging

Increased on-chip communication and time spent waiting for expensive events

5

On-Chip Communication Problem

Wires relative to gates are getting worse every generation

Shekhar Borkar, Intel

Must Architect Efficient Coherence Protocols

Bit movement is much more expensive than computation

Bill Dally, Stanford

• Utilization: # private L1 cache accesses before cache line is evicted

• 40% of lines evicted have a utilization < 4

Locality of BenchmarksEvaluating Reuse before Evictions

6

80%

20%

• Utilization: # private L1 cache accesses before cache line is invalidated (intervening write)

Locality of BenchmarksEvaluating Reuse before Invalidations

7

80%

10%

1

Remote-Word Access (RA)

8

2

Ho

me

core

NUCA-based protocol[Fensch et al HPCA’08]

[Hoffmann et al HiPEAC’10]

Write word

• Assign each memory address to unique “home” core– Cache line present only in

shared cache at “home” core (single location)

• For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access

Remote-Word AccessAdvantages & Drawbacks

9

☺ Energy Efficient(low locality data) Word access (~200 bits) cheaper than cache line fetch (~640 bits)

☺ NO data replication Efficient private cache utilization

☺ NO invalidations / synchronous writebacks

☹ Round-trip network request for remote-WORD access

☹ Expensive for high locality data

☹ Data placement dictates distance & frequency of remote accesses

Locality-Aware Cache Coherence• Combine advantages of private caching and

remote access• Privately cache high locality lines

– Optimize hit latency and energy• Remotely cache low locality lines

– Prevent data replication & costly data movement

• Private Caching Threshold (PCT)– Utilization >= PCT Mark as private– Utilization < PCT Mark as remote

10

0%10%20%30%40%50%60%70%80%90%

100%1 2,3 4,5 6,7 >=8

Inva

lidati

ons B

reak

dow

n (%

)

Locality-Aware Cache Coherence

11

Invalidations vs Utilization

• Private Caching Theshold (PCT) = 4

Remote

Private

Outline

• Motivation for Locality-Aware Coherence• Detailed Implementation• Optimizations• Evaluation• Conclusion

12

13

Baseline System

• Compute pipeline• Private L1-I and L1-D caches• Logically shared physically distributed L2 cache with

integrated directory

Router

L1 I-CacheL1 D-Cache

L2 Shared Cache

Core

Compute Pipeline

Directory

M

M

M

• L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09]• ACKwise limited-directory protocol [Kurian – PACT10]

Locality-Aware CoherenceImportant Features

• Intelligent allocation of cache lines– In the private L1 cache– Allocation decision made per-core at cache line level

• Efficient locality tracking hardware– Decoupled from traditional coherence tracking

structures• Protocol complexity low

– NO additional networks for deadlock avoidance

14

Implementation DetailsPrivate Cache Line Tag

• Private Utilization bits to track cache line usage in L1 cache

• Communicated back to directory on eviction or invalidation

• Storage overhead is only 0.4%

15

State LRU Tag PrivateUtilization

Implementation DetailsDirectory Entry

• P/Ri: Private/Remote Mode

• Remote-Utilizationi: Line usage by Corei at shared L2 cache

• Complete Locality Classifier: Track mode/remote-utilization for all cores

• Storage overhead reduced later 16

State TagACKwise Pointers

1 … p

Remote Utilization1

Remote Utilizationn

…P/R1

…

P/Rn

Mode Transitions Summary

• Classification based on previous behavior

17

RemotePrivate

Private Utilization < PCT

Private Utilization >= PCT

InitialRemote Utilization < PCT

Remote Utilization >= PCT

Walk Through Example

18

Core-A

Private

U

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

Private Caching ThresholdPCT = 2

Unc

ache

d

Pipeline + L1 Cache

Pipeline +L1 Cache

Pipeline + L1 Cache

L2 Cache + Directory

All cores start out in private mode

Network


19

Core-A

Private

U

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Unc

ache

d

Read[X]


20

Core-B

Private

U

Core-A

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

Sha

red

PCT = 2

Cache Line [X]

Clean -


21

Core-A

Private

C

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

Shared 1 PCT = 2

Sha

red

Cache Line [X]

Clean -


22

Core-A

Private

C

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Read[X]

Clean -


23

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Cache Line [X]

Clean -


24

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 1 Cache Line [X]

Clean -


25

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 1

Read[X]

Clean -


26

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 2

Clean -


27

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Write[X] Shared 2

Clean -


28

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 2

Inv [X]

Clean -


29

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Invalid 0

Shared 2

Inv-Reply [X] (1)

Clean -


30

Core-B

Private

U

Core-A

Remote

0

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Inv-Reply [X] (1)

Clean -


31

Core-A

Remote

0

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Inv-Reply [X] (2)

Invalid 0

Clean -


32

Core-A

Remote

0

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Unc

ache

d

Inv-Reply [X] (2)

Clean -


33

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Cache Line [X]

Clean -


34

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 1 Cache Line [X]

Clean -


35

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 1

Read[X]

Clean -


36

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 1

WB [X]

Clean -


37

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Shared 1 WB-Reply [X]

Clean -


38

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

WB-Reply [X]

Dirty -


39

Core-B

Private

C

Core-A

Remote

1

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Word [X]

Dirty -


40

Core-A

Remote

1

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1 Write [X]

Dirty -


41

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Shared 1

Dirty -

Upgrade-Reply [X]


42

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 2

Dirty -


43

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Modified 2

Read [X]

Dirty -


44

Core-B

Private

C

Core-A

Remote

1

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Dirty -

Read [X]


45

Core-B

Private

C

Core-A

Remote

1

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Word [X]

Dirty -


46

Core-A

Remote

1

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Read [X]

Dirty -


47

Core-B

Private

C

Core-A

Remote

2

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Dirty -

Read [X]


48

Core-B

Private

C

Core-A

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Dirty -

Cache Line [X] (2)


49

Core-A

Private

C

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Shared 2

Cache Line [X] (2)

Dirty -

Outline

• Motivation for Locality-Aware Coherence• Detailed Implementation• Optimizations• Evaluation• Conclusion

50

Complete Locality ClassifierHigh Directory Storage

• Complete Locality Classifier: Tracks locality information for all cores

51

State TagACKwise Pointers

1 … p

Remote Utilization1

Remote Utilizationn

…P/R1

…

P/Rn

Classifier CompleteBit Overhead per core (256 KB L2)

192 KB (60%)

Limited Locality ClassifierReduces Directory Storage

• Utilization and mode tracked for k sharers• Modes of other sharers obtained by taking a

majority vote

52

State TagACKwisePointers

1 … p

Core ID1

Remote Utilization1

Core IDk

Remote Utilizationk

…

…P/R1

…

P/Rk

Limited-3 Locality Classifier

53

Classifier Complete Limited-3Bit Overhead per core(256 KB L2)

192 KB (60%) 18 KB (5.7%)

Metric Limited-3 vs CompleteCompletion Time 3 % lowerEnergy 1.5 % lower

• Utilization and mode tracked for 3 sharers

Achieves the performance and energy of the Complete locality classifier• CT and Energy lower because remote mode

classification learned faster with Limited-3

Private <-> Remote TransitionResults In Private Cache Thrashing

54

RemotePrivate



Initial Remote Utilization < PCT

Remote Utilization >= PCT

• Core reverts back to private mode after #PCT accesses to cache line at shared L2 cache

• Evicts other lines in the private L1 cache• Results in low spatio-temporal locality for all

• Difficult to measure private cache locality of line in shared L2 cache

Ideal ClassifierNO Private Cache Thrashing

55

• Ideal classifier maintains part of the working set in the private cache

• Other lines placed in remote mode at shared cache

Remote Access ThresholdReduces Private Cache Thrashing

• Remote Access Threshold (RAT) varied based on PCT & application behavior [details in paper] 56

RemotePrivate



Initial Remote Utilization < RAT

Remote Utilization >= RAT

• If core classified as remote sharer (capacity), increase cost of promotion to private mode

• If core classified as private sharer, reset the cost back to its starting valueReduces private cache thrashing to a

negligible level

Outline

• Motivation for Locality-Aware Coherence• Implementation Details• Optimizations• Evaluation• Conclusion

57

Reducing Capacity MissesPrivate L1 Cache Miss Rate vs PCT (Blackscholes)

58

• Miss rate reduces as PCT increases (better utilization)• Multiple capacity misses (expensive) replaced with

single word access (cheap)• Cache miss rate increases towards the end

(one capacity miss turns into multiple word misses)

1 2 3 4 5 6 7 80

0.51

1.52

2.53

Cold Capacity Upgrade Sharing WordCa

che

Mis

s Ra

te

Brea

kdow

n (%

)

PCT

Energy vs PCTBlackscholes

• Reducing L1 cache misses (& Capacity Word) lead to lesser network traffic and L2 accesses

• Accessing a word (200 bits) cheaper than fetching the entire cache line (640 bits) 59

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Network Link Network Router Directory L2 Cache L1-D Cache L1-I Cache

Ener

gy (n

orm

aliz

ed)

PCT

Completion Time vs PCTBlackscholes

• Lower L1 cache miss rate + miss penalty • Less time spent waiting on L1 cache misses

60

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Synchronization L2Cache-OffChip L2Cache-Sharers L2Cache-Waiting L1Cache-L2Cache Compute

Com

pleti

on T

ime

(nor

mal

ized

)

Reducing Sharing MissesPrivate L1 Cache Miss Rate vs PCT (Streamcluster)

61

• Sharing misses (expensive) turned into word misses (cheap) as PCT increases

PCT1 2 3 4 5 6 7 8

012345678

Cold Capacity Upgrade Sharing Word

Cach

e M

iss

Rate

Br

eakd

own

(%)

Energy vs PCTStreamcluster

• Reduce invalidations, asynchronous write-backs and cache-line ping-pong’ing

62

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Network Link Network Router Directory L2 Cache L1-D Cache L1-I Cache

Ener

gy (n

orm

al-

ized

)

PCT

Completion Time vs PCTStreamcluster

• Less time spent waiting for invalidations and invalidations and by loads waiting for previous stores

• Critical section time reduction -> synchronization time reduction 63

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Synchronization L2Cache-OffChip L2Cache-Sharers L2Cache-Waiting L1Cache-L2Cache Compute

Com

pleti

on T

ime

(nor

mal

ized

)

PCT

Variation with PCTResults Summary

• Evaluated 18 benchmarks from the SPLASH-2, PARSEC, parallel-MI bench and UHPC suites + 3 hand-written benchmarks

• PCT of 4 obtains 25% reduction in energy and 15% reduction in completion time

• Evaluations done using Graphite simulator for 64 cores, McPAT/CACTI cache energy models and DSENT network energy models at 11 nm

64

Conclusion• Three potential advantages of the locality-aware

adaptive cache coherence protocol– Better private cache utilization– Reduced on-chip communication (invalidations, asynchronous

write-backs and cache-line transfers)– Reduced memory access latency and energy

• Efficient locality tracking hardware• Decoupled from traditional coherence tracking structures• Limited3 locality classifier has low overhead of 18KB per-core

(with 256KB per-core L2 cache)

• Simple to implement– NO additional networks for deadlock avoidance

65

Documents

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University