Upload
tobias-stevens
View
221
Download
1
Tags:
Embed Size (px)
Citation preview
Teaching Old Caches New Tricks:Predictor Virtualization
Andreas MoshovosUniv. of Toronto
Ioana Burcea’s Thesis workSome parts joint with Stephen Somogyi (CMU) and Babak
Falsafi (EPFL)
2Prediction The Way Forward
CPU Predictors
PrefetchingBranch Target and DirectionCache ReplacementCache Hit
• Application footprints grow• Predictors need to scale to remain effective• Ideally, fast, accurate predictions•Can’t have this with conventional technology
Prediction has proven useful – Many forms – Which to choose?
3The Problem with Conventional Predictors
Predictor Virtualization Approximate Large, Accurate, Fast Predictors
Predictor
Hardware Cost
Accuracy Latency
• What we have• Small• Fast • Not-so-accurate
• What we want• Small• Fast • Accurate
4Why Now?
L2 Cache
Physical Memory
CPU CPU CPU CPU
10-100MB
I$D$ I$D$ I$D$ I$D$
Extra Resources: CMPs with Large Caches
5
L2 Cache
Predictor Virtualization (PV)
Use the on-chip cache to store metadataReduce cost of dedicated predictors
Physical Memory
CPU CPU CPU CPU
I$D$ I$D$ I$D$ I$D$
6
L2 Cache
Predictor Virtualization (PV)
Use the on-chip cache to store metadataImplement otherwise impractical predictors
Physical Memory
CPU CPU CPU CPU
I$D$ I$D$ I$D$ I$D$
7Research Overview• PV breaks the conventional predictor design trade offs
– Lowers cost of adoption– Facilitates implementation of otherwise impractical predictors
• Freeloads on existing resources– Adaptive demand
• Key Design Challenge: – How to compensate for the longer latency to metadata
• PV in action– Virtualized “Spatial Memory Streaming”– Virtualized Branch Target Buffers
8
• PV Architecture• PV in Action
– Virtualizing “Spatial Memory Streaming”– Virtualizing Branch Target Buffers
• Conclusions
Talk Roadmap
9PV Architecture
CPU
I$D$
L2 Cache
Physical Memory
OptimizationEngine
PredictorTable
request prediction
Virtualize
10PV Architecture
CPU
I$D$
L2 Cache
Physical Memory
OptimizationEngine
PVCache
request predictionPVProxy
PVTable
Requires access to L2Back-side of L1Not as performance critical
11PV Challenge: Prediction Latency
CPU
I$D$
L2 Cache
Physical Memory
OptimizationEngine
PVCache
request predictionPVProxy
PVTable
CommonCase
Infrequentlatency: 12-18 cycles
Rarelatency: 400 cycles
Key: How to pack metadata in L2 cache blocks to amortize costs
12To Virtualize or Not To Virtualize• Predictors redesigned with PV in mind
• Overcoming the latency challenge– Metadata reuse
• Intrinsic: one entry used for multiple predictions• Temporal: one entry reused in the near future• Spatial: one miss overcome by several subsequent hits
– Metadata access pattern predictability• Predictor metadata prefetching
– Looks similar to designing caches• BUT:
– Does not have to be correct all the time– Time limit on usefullnes
13PV in Action
• Data prefetching– Virtualize “Spatial Memory Streaming” [ISCA06]
• Within 1% performance• Hardware cost from 60KB down to < 1KB
• Branch prediction– Virtualize branch target buffers
• Increase the perceived BTB accuracy• Up to 12.75% IPC improvement with 8% hardware overhead
14Spatial Memory Streaming [ISCA06]M
emor
y
1100001010001…
1101100000001…
spatial patterns
Pattern History Table
[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
15Spatial Memory Streaming (SMS)
Detector Predictordata accessstream
patterns
trigger access
pattern
prefetches
~1KB ~60KBVirtualize
16Virtualizing SMS
VirtualTable
patterntag patterntag patterntag...
unused
11 ways
1K sets
PVCache
11 ways
8 sets
L2 cache line
Region-level prefetching is naturally tolerant of longer prediction latenciesSimply pack predictor entries spatially
17Experimental Methodology• SimFlex:
– full-system, cycle-accurate simulator• Baseline processor configuration
– 4-core CMP - OoO– L1D/L1I 64KB 4-way set-associative– UL2 8MB 16-way set-associative
• Commercial Workloads– Web servers: Apache and Zeus– TPC-C: DB2 and Oracle– TPC-H: several queries– Developed by Impetus group at CMU
• Anastasia Ailamaki & Babak Falsafi PIs
SMS Performance Potential
0
20
40
60
80
100
120
Infin
ite1K
- 1
6a1K
- 1
1a51
2-11
a25
6-11
a12
8-11
a64
-11a
32-1
1a16
- 1
1a8
- 11
a
Infin
ite1K
- 1
6a1K
- 1
1a51
2-11
a25
6-11
a12
8-11
a64
-11a
32-1
1a16
- 1
1a8
- 11
a
Infin
ite1K
- 1
6a1K
- 1
1a51
2-11
a25
6-11
a12
8-11
a64
-11a
32-1
1a16
- 1
1a8
- 11
a
Apache Oracle Qry 17
Covered Uncovered Overpredictions
Per
cent
age
L1 R
ead
Mis
es (
%)
Conventional Predictor Degrades with Limited Storage
19Virtualized SMS
Hardware CostOriginal Prefetcher ~ 60KBVirtualized Prefetcher < 1KB
Sp
eed
up
better
Impact of Virtualization on L2 Requests
0
5
10
15
20
25
30
35
40
45
Apache Oracle Qry 17
PV-8 PV-16P
erce
nta
ge
Incr
ease
L2
Req
uest
s (%
)
Impact of Virtualization on Off-Chip Bandwidth
0%
1%
1%
2%
2%
3%
3%
4%
4%
5%
PV-8 PV-16 PV-8 PV-16 PV-8 PV-16
Apache Oracle Qry17
L2 Misses L2 Write backs
Off-
Chi
p B
andw
idth
Inc
reas
e
22PV in Action
• Data prefetching– Virtualize “Spatial Memory Streaming” [ISCA06]
• Same performance• Hardware cost from 60KB down to < 1KB
• Branch prediction– Virtualize branch target buffers
• Increase the perceived BTB capacity• Up to 12.75% IPC improvement with 8% hardware overhead
23 The Need for Larger BTBs
Bra
nch
MP
KI
better
Commercial applications benefit from large BTBs
BTB entries
24
L2 Cache
Virtualizing BTBs: Phantom-BTB
BTBPC Virtual
Table
• Latency challenge• Not latency tolerant to longer prediction latencies
• Solution: predictor metadata prefetching• Virtual table decoupled from the BTB• Virtual table entry: temporal group
Small and Fast Large and Slow
Facilitating Metadata Prefetching• Intuition: Programs follow mostly similar paths
Detection path Subsequent path
26Temporal Groups
Past misses Good indicator of future missesDedicated Predictor acts as a filter
27Fetch Trigger
Preceding miss triggers temporal group fetchNot precise region around miss
28Temporal Group Prefetching
29Temporal Group Prefetching
30
L2 Cache
TemporalGroup Generator
Phantom-BTB Architecture
BTBPC
Prefetch Engine
• Temporal Group Generator• Generates and installs temporal groups in the L2 cache
• Prefetch Engine• Prefetches temporal groups
31
Branch Stream
TemporalGroup Generator
Temporal Group Generation
BTBPC
Prefetch Engine
L2 Cache
miss
BTB misses generate temporal groupsBTB hits do not generate any PBTB activity
Miss
Hit
32
Prefetch Engine
miss
Branch Metadata Prefetching
BTBPC
TemporalGroup Generator
VirtualTableL2 Cache
PrefetchBuffer
BTB misses trigger metadata prefetches
Parallel lookup in BTB and prefetch buffer
Branch Stream
Miss
Hit
Prefetch Buffer Hits
33Phantom-BTB Advantages
• “Pay-as-you-go” approach– Practical design– Increases the perceived BTB capacity– Dynamic allocation of resources
• Branch metadata allocated on demand– On-the-fly adaptation to application demands
• Branch metadata generation and retrieval performed on BTB misses• Only if the application sees misses• Metadata survives in the L2 as long as there is sufficient capacity and
demand
34Experimental Methodology
• Flexus cycle-accurate, full-system simulator• Uniprocessor - OoO
– 1K-entry conventional BTB– 64KB 2-way ICache/DCache– 4MB 16-way L2 Cache
• Phantom-BTB– 64-entry prefetch buffer– 6-entry temporal group – 4K-entry virtual table
• Commercial Workloads
35PBTB vs. Conventional BTBs
Sp
eed
up
better
Performance within 1% of a 4K-entry BTB with 3.6x less storage
36Phantom-BTB with Larger Dedicated BTBs
Sp
eed
up
better
PBTB remains effective with larger dedicated BTBs
37Increase in L2 MPKI
L2 M
PK
I
Marginal increase in L2 misses
better
38Increase in L2 Accesses
L2 A
cces
ses
per
KI
• PBTB follows application demand for BTB capacity
better
39Summary
• Predictor metadata stored in memory hierarchy– Benefits
• Reduces dedicated predictor resources• Emulates large predictor tables for increased predictor accuracy
– Why now?• Large on-chip caches / CMPs / need for large predictors
– Predictor virtualization advantages• Predictor adaptation• Metadata sharing
• Moving Forward– Virtualize other predictors– Expose predictor interface to software level