Upload
vivian-rhea
View
221
Download
0
Embed Size (px)
Citation preview
The Locality-Aware Adaptive Cache Coherence Protocol
George Kurian1, Omer Khan2, Srini Devadas1
1 Massachusetts Institute of Technology2 University of Connecticut, Storrs
1
2
Cache Hierarchy OrganizationDirectory-Based Coherence
Private cache Write miss
1
2Shared Cache + Directory
4Sharer
3Sharer
• Private caches: 1 or 2 levels• Shared cache: Last-level
Write word
• Concurrent reads lead to replication in private caches
• Directory maintains coherence for replicated lines
Private CachingAdvantages & Drawbacks
3
☺ Exploits spatio- temporal locality
☺ Efficient low-latency local access to private + shared data (cache line replication)
☹ Inefficiently handles data with LOW spatio-temporal locality
☹Working set > private cache size☹ Inefficient cache utilization
(Cache thrashing)☹ Unnecessary fetch of entire
cache line☹ Shared data replication
increases working set
Private CachingAdvantages & Drawbacks
4
☺ Exploits spatio-temporal locality
☺ Efficient low-latency local access to private + shared data (cache line replication)
☹ Inefficiently handles data with LOW spatio-temporal locality
☹Working set > private cache size
☹Shared data with frequent writes☹Wasteful invalidations,
synchronous writebacks, cache line ping-ponging
Increased on-chip communication and time spent waiting for expensive events
5
On-Chip Communication Problem
Wires relative to gates are getting worse every generation
Shekhar Borkar, Intel
Must Architect Efficient Coherence Protocols
Bit movement is much more expensive than computation
Bill Dally, Stanford
• Utilization: # private L1 cache accesses before cache line is evicted
• 40% of lines evicted have a utilization < 4
Locality of BenchmarksEvaluating Reuse before Evictions
6
80%
20%
• Utilization: # private L1 cache accesses before cache line is invalidated (intervening write)
Locality of BenchmarksEvaluating Reuse before Invalidations
7
80%
10%
1
Remote-Word Access (RA)
8
2
Ho
me
core
NUCA-based protocol[Fensch et al HPCA’08]
[Hoffmann et al HiPEAC’10]
Write word
• Assign each memory address to unique “home” core– Cache line present only in
shared cache at “home” core (single location)
• For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access
Remote-Word AccessAdvantages & Drawbacks
9
☺ Energy Efficient(low locality data) Word access (~200 bits) cheaper than cache line fetch (~640 bits)
☺ NO data replication Efficient private cache utilization
☺ NO invalidations / synchronous writebacks
☹ Round-trip network request for remote-WORD access
☹ Expensive for high locality data
☹ Data placement dictates distance & frequency of remote accesses
Locality-Aware Cache Coherence• Combine advantages of private caching and
remote access• Privately cache high locality lines
– Optimize hit latency and energy• Remotely cache low locality lines
– Prevent data replication & costly data movement
• Private Caching Threshold (PCT)– Utilization >= PCT Mark as private– Utilization < PCT Mark as remote
10
0%10%20%30%40%50%60%70%80%90%
100%1 2,3 4,5 6,7 >=8
Inva
lidati
ons B
reak
dow
n (%
)
Locality-Aware Cache Coherence
11
Invalidations vs Utilization
• Private Caching Theshold (PCT) = 4
Remote
Private
Outline
• Motivation for Locality-Aware Coherence• Detailed Implementation• Optimizations• Evaluation• Conclusion
12
13
Baseline System
• Compute pipeline• Private L1-I and L1-D caches• Logically shared physically distributed L2 cache with
integrated directory
Router
L1 I-CacheL1 D-Cache
L2 Shared Cache
Core
Compute Pipeline
Directory
M
M
M
• L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09]• ACKwise limited-directory protocol [Kurian – PACT10]
Locality-Aware CoherenceImportant Features
• Intelligent allocation of cache lines– In the private L1 cache– Allocation decision made per-core at cache line level
• Efficient locality tracking hardware– Decoupled from traditional coherence tracking
structures• Protocol complexity low
– NO additional networks for deadlock avoidance
14
Implementation DetailsPrivate Cache Line Tag
• Private Utilization bits to track cache line usage in L1 cache
• Communicated back to directory on eviction or invalidation
• Storage overhead is only 0.4%
15
State LRU Tag PrivateUtilization
Implementation DetailsDirectory Entry
• P/Ri: Private/Remote Mode
• Remote-Utilizationi: Line usage by Corei at shared L2 cache
• Complete Locality Classifier: Track mode/remote-utilization for all cores
• Storage overhead reduced later 16
State TagACKwise Pointers
1 … p
Remote Utilization1
Remote Utilizationn
…P/R1
…
P/Rn
Mode Transitions Summary
• Classification based on previous behavior
17
RemotePrivate
Private Utilization < PCT
Private Utilization >= PCT
InitialRemote Utilization < PCT
Remote Utilization >= PCT
Walk Through Example
18
Core-A
Private
U
Core-B
Private
U
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
Private Caching ThresholdPCT = 2
Unc
ache
d
Pipeline + L1 Cache
Pipeline +L1 Cache
Pipeline + L1 Cache
L2 Cache + Directory
All cores start out in private mode
Network
Walk Through Example
19
Core-A
Private
U
Core-B
Private
U
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Unc
ache
d
Read[X]
Walk Through Example
20
Core-B
Private
U
Core-A
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
Sha
red
PCT = 2
Cache Line [X]
Clean -
Walk Through Example
21
Core-A
Private
C
Core-B
Private
U
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
Shared 1 PCT = 2
Sha
red
Cache Line [X]
Clean -
Walk Through Example
22
Core-A
Private
C
Core-B
Private
U
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Read[X]
Clean -
Walk Through Example
23
Core-A
Private
C
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Cache Line [X]
Clean -
Walk Through Example
24
Core-A
Private
C
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Shared 1 Cache Line [X]
Clean -
Walk Through Example
25
Core-A
Private
C
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Shared 1
Read[X]
Clean -
Walk Through Example
26
Core-A
Private
C
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Shared 2
Clean -
Walk Through Example
27
Core-A
Private
C
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Write[X] Shared 2
Clean -
Walk Through Example
28
Core-A
Private
C
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Shared 2
Inv [X]
Clean -
Walk Through Example
29
Core-A
Private
C
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Invalid 0
Shared 2
Inv-Reply [X] (1)
Clean -
Walk Through Example
30
Core-B
Private
U
Core-A
Remote
0
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 2
Inv-Reply [X] (1)
Clean -
Walk Through Example
31
Core-A
Remote
0
Core-B
Private
U
Core-C
Private
C
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Inv-Reply [X] (2)
Invalid 0
Clean -
Walk Through Example
32
Core-A
Remote
0
Core-B
Private
U
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Unc
ache
d
Inv-Reply [X] (2)
Clean -
Walk Through Example
33
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Mod
ified
Cache Line [X]
Clean -
Walk Through Example
34
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Mod
ified
Modified 1 Cache Line [X]
Clean -
Walk Through Example
35
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Mod
ified
Modified 1
Read[X]
Clean -
Walk Through Example
36
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Mod
ified
Modified 1
WB [X]
Clean -
Walk Through Example
37
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Mod
ified
Shared 1 WB-Reply [X]
Clean -
Walk Through Example
38
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
WB-Reply [X]
Dirty -
Walk Through Example
39
Core-B
Private
C
Core-A
Remote
1
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1
Word [X]
Dirty -
Walk Through Example
40
Core-A
Remote
1
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 1 Write [X]
Dirty -
Walk Through Example
41
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Mod
ified
Shared 1
Dirty -
Upgrade-Reply [X]
Walk Through Example
42
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Mod
ified
Modified 2
Dirty -
Walk Through Example
43
Core-A
Remote
0
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Modified 2
Read [X]
Dirty -
Walk Through Example
44
Core-B
Private
C
Core-A
Remote
1
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 2
Dirty -
Read [X]
Walk Through Example
45
Core-B
Private
C
Core-A
Remote
1
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 2
Word [X]
Dirty -
Walk Through Example
46
Core-A
Remote
1
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 2
Read [X]
Dirty -
Walk Through Example
47
Core-B
Private
C
Core-A
Remote
2
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 2
Dirty -
Read [X]
Walk Through Example
48
Core-B
Private
C
Core-A
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 2
Dirty -
Cache Line [X] (2)
Walk Through Example
49
Core-A
Private
C
Core-B
Private
C
Core-C
Private
U
Directory
Core A
Core B
Core D
Core C
PCT = 2
Sha
red
Shared 2
Shared 2
Cache Line [X] (2)
Dirty -
Outline
• Motivation for Locality-Aware Coherence• Detailed Implementation• Optimizations• Evaluation• Conclusion
50
Complete Locality ClassifierHigh Directory Storage
• Complete Locality Classifier: Tracks locality information for all cores
51
State TagACKwise Pointers
1 … p
Remote Utilization1
Remote Utilizationn
…P/R1
…
P/Rn
Classifier CompleteBit Overhead per core (256 KB L2)
192 KB (60%)
Limited Locality ClassifierReduces Directory Storage
• Utilization and mode tracked for k sharers• Modes of other sharers obtained by taking a
majority vote
52
State TagACKwisePointers
1 … p
Core ID1
Remote Utilization1
Core IDk
Remote Utilizationk
…
…P/R1
…
P/Rk
Limited-3 Locality Classifier
53
Classifier Complete Limited-3Bit Overhead per core(256 KB L2)
192 KB (60%) 18 KB (5.7%)
Metric Limited-3 vs CompleteCompletion Time 3 % lowerEnergy 1.5 % lower
• Utilization and mode tracked for 3 sharers
Achieves the performance and energy of the Complete locality classifier• CT and Energy lower because remote mode
classification learned faster with Limited-3
Private <-> Remote TransitionResults In Private Cache Thrashing
54
RemotePrivate
Private Utilization < PCT
Private Utilization >= PCT
Initial Remote Utilization < PCT
Remote Utilization >= PCT
• Core reverts back to private mode after #PCT accesses to cache line at shared L2 cache
• Evicts other lines in the private L1 cache• Results in low spatio-temporal locality for all
• Difficult to measure private cache locality of line in shared L2 cache
Ideal ClassifierNO Private Cache Thrashing
55
• Ideal classifier maintains part of the working set in the private cache
• Other lines placed in remote mode at shared cache
Remote Access ThresholdReduces Private Cache Thrashing
• Remote Access Threshold (RAT) varied based on PCT & application behavior [details in paper] 56
RemotePrivate
Private Utilization < PCT
Private Utilization >= PCT
Initial Remote Utilization < RAT
Remote Utilization >= RAT
• If core classified as remote sharer (capacity), increase cost of promotion to private mode
• If core classified as private sharer, reset the cost back to its starting valueReduces private cache thrashing to a
negligible level
Outline
• Motivation for Locality-Aware Coherence• Implementation Details• Optimizations• Evaluation• Conclusion
57
Reducing Capacity MissesPrivate L1 Cache Miss Rate vs PCT (Blackscholes)
58
• Miss rate reduces as PCT increases (better utilization)• Multiple capacity misses (expensive) replaced with
single word access (cheap)• Cache miss rate increases towards the end
(one capacity miss turns into multiple word misses)
1 2 3 4 5 6 7 80
0.51
1.52
2.53
Cold Capacity Upgrade Sharing WordCa
che
Mis
s Ra
te
Brea
kdow
n (%
)
PCT
Energy vs PCTBlackscholes
• Reducing L1 cache misses (& Capacity Word) lead to lesser network traffic and L2 accesses
• Accessing a word (200 bits) cheaper than fetching the entire cache line (640 bits) 59
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
Network Link Network Router Directory L2 Cache L1-D Cache L1-I Cache
Ener
gy (n
orm
aliz
ed)
PCT
Completion Time vs PCTBlackscholes
• Lower L1 cache miss rate + miss penalty • Less time spent waiting on L1 cache misses
60
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
Synchronization L2Cache-OffChip L2Cache-Sharers L2Cache-Waiting L1Cache-L2Cache Compute
Com
pleti
on T
ime
(nor
mal
ized
)
Reducing Sharing MissesPrivate L1 Cache Miss Rate vs PCT (Streamcluster)
61
• Sharing misses (expensive) turned into word misses (cheap) as PCT increases
PCT1 2 3 4 5 6 7 8
012345678
Cold Capacity Upgrade Sharing Word
Cach
e M
iss
Rate
Br
eakd
own
(%)
Energy vs PCTStreamcluster
• Reduce invalidations, asynchronous write-backs and cache-line ping-pong’ing
62
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
Network Link Network Router Directory L2 Cache L1-D Cache L1-I Cache
Ener
gy (n
orm
al-
ized
)
PCT
Completion Time vs PCTStreamcluster
• Less time spent waiting for invalidations and invalidations and by loads waiting for previous stores
• Critical section time reduction -> synchronization time reduction 63
1 2 3 4 5 6 7 80
0.2
0.4
0.6
0.8
1
1.2
Synchronization L2Cache-OffChip L2Cache-Sharers L2Cache-Waiting L1Cache-L2Cache Compute
Com
pleti
on T
ime
(nor
mal
ized
)
PCT
Variation with PCTResults Summary
• Evaluated 18 benchmarks from the SPLASH-2, PARSEC, parallel-MI bench and UHPC suites + 3 hand-written benchmarks
• PCT of 4 obtains 25% reduction in energy and 15% reduction in completion time
• Evaluations done using Graphite simulator for 64 cores, McPAT/CACTI cache energy models and DSENT network energy models at 11 nm
64
Conclusion• Three potential advantages of the locality-aware
adaptive cache coherence protocol– Better private cache utilization– Reduced on-chip communication (invalidations, asynchronous
write-backs and cache-line transfers)– Reduced memory access latency and energy
• Efficient locality tracking hardware• Decoupled from traditional coherence tracking structures• Limited3 locality classifier has low overhead of 18KB per-core
(with 256KB per-core L2 cache)
• Simple to implement– NO additional networks for deadlock avoidance
65