Upload
pierce
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ASR: Adaptive Selective Replication for CMP Caches. Brad Beckmann † , Mike Marty, and David Wood Multifacet Project University of Wisconsin-Madison 12/13/06. † currently at Microsoft. Maximize Cache Capacity. 40+ Cycles. A. Slow Access Latency. Introduction: Shared Cache. L1 I $. - PowerPoint PPT Presentation
Citation preview
ASR: Adaptive Selective ASR: Adaptive Selective Replication for CMP CachesReplication for CMP Caches
Brad Beckmann†, Mike Marty, and David Wood
Multifacet ProjectUniversity of Wisconsin-Madison
12/13/06
† currently at Microsoft
2
Introduction: Introduction: Shared CacheShared Cache
CPU 3L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
CPU 2
CPU 1
CPU 0
CPU 4
CPU 5
CPU 6
CPU 7
L2Bank
L2Bank
L2Bank
L2Bank
L2Bank
L2Bank
L2Bank
L2Bank
A
MaximizeCache
Capacity40+ Cycles
SlowAccessLatency
3
Introduction: Introduction: Private CachesPrivate Caches
CPU 3L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
CPU 2
CPU 1
CPU 0
CPU 4
CPU 5
CPU 6
CPU 7
Private
L2
Private
L2
Private
L2
Private
L2
Private
L2
Private
L2
Private
L2
PrivateL2
FastAccessLatencyA
LowerEffectiveCapacity
A
A Desire bothFast Access &High Capacity
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 4
IntroductionIntroduction• Previous hybrid proposals
– Victim Replication, CMP-NuRapid, Cooperative Caching– Achieve fast access and high capacity
• Under certain workloads & system configurations• Utilize static rules
– Non-adaptive
• Adaptive Selective Replication: ASR– Dynamically monitor workload behavior– Adapt the L2 cache to workload demand– Up to 12% improvement vs. previous proposals
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 5
OutlineOutline• Introduction
• Understanding L2 Replication• Benefit• Cost• Key Observation• Solution
• ASR: Adaptive Selective Replication
• Evaluation
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 6
Understanding L2 ReplicationUnderstanding L2 Replication
• Three L2 block sharing types1. Single requestor
– All requests by a single processor
2. Shared read only– Read only requests by multiple processors
3. Shared read-write– Read and write requests by multiple processors
• Profile L2 blocks during their on-chip lifetime– 8 processor CMP– 16 MB shared L2 cache– 64-byte block size
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 7
Understanding L2 ReplicationUnderstanding L2 Replication
Shared Read-only
Shared Read-write
Single Requestor
ApacheJbbOltpZeus
High Locality
Mid Locality
Low Locality
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 8
Understanding L2 Replication: Understanding L2 Replication: BenefitBenefit
L2 H
it C
ycle
s
Replication Capacity
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 9
Understanding L2 Replication: Understanding L2 Replication: CostCost
L2 M
iss
Cyc
les
Replication Capacity
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 10
Understanding L2 Replication: Understanding L2 Replication: Key ObservationKey Observation
L2 H
it C
ycle
s
Replication Capacity
Top 3% of Shared Read-only blocks satisfy70% of Shared Read-only requests
Replicate FrequentlyRequested Blocks First
TotalCycleCurve
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 11
Understanding L2 Replication: Understanding L2 Replication: SolutionSolution
Tot
al C
ycle
s
Replication Capacity
Optimal
Property of WorkloadCache Interaction
Not Fixed Must Adapt
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 12
OutlineOutline• Wires and CMP caches
• Understanding L2 Replication
• ASR: Adaptive Selective Replication– SPR: Selective Probabilistic Replication– Monitoring and adapting to workload behavior
• Evaluation
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 13
SPR: SPR: Selective Probabilistic Selective Probabilistic ReplicationReplication
• Mechanism for Selective Replication– Relax L2 inclusion property
• L2 evictions do not force L1 evictions• Non-exclusive cache hierarchy
– Ring Writebacks• L1 Writebacks passed clockwise between private L2 caches• Merge with other existing L2 copies
• Probabilistically choose between– Local writeback allow replication– Ring writeback disallow replication
• Replicates frequently requested blocks
14
PrivateL2
PrivateL2
SPR: SPR: Selective Probabilistic Selective Probabilistic ReplicationReplication
CPU 3L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
CPU 2
CPU 1
CPU 0
CPU 4
CPU 5
CPU 6
CPU 7
PrivateL2
PrivateL2
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
L1I $
L1D $
PrivateL2
PrivateL2
PrivateL2
PrivateL2
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 15
SPR: SPR: Selective Probabilistic Selective Probabilistic ReplicationReplication
Rep
licat
ion
Cap
acity
Replication Levels0 1 2 3 4 5
Replication Level 0 1 2 3 4 5
Prob. of Replication 0 1/64 1/16 1/4 1/2 1
CurrentLevel
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 16
Monitoring and Adapting to Monitoring and Adapting to Workload BehaviorWorkload Behavior
1. Decrease in Replication Benefit– Bit marks replicas of the current, but not lower level
2. Increase in Replication Benefit– Store 8-bit partial tags of next higher level replications
L2 H
it C
ycle
s
Replication Capacitycurrent levellower level higher level
ReplicationBenefit Curve
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 17
Monitoring and Adapting to Monitoring and Adapting to Workload BehaviorWorkload Behavior
3. Decrease in Replication Cost– Stores 16-bit partial tags of recently evicted blocks
4. Increase in Replication Cost– Way and Set counters track soon-to-be-evicted blocks
L2 M
iss
Cyc
les
Replication Capacitycurrent level
ReplicationCost Curve
higher levellower level
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 18
OutlineOutline• Wires and CMP caches
• Understanding L2 Replication
• ASR: Adaptive Selective Replication
• Evaluation
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 19
MethodologyMethodology
• Full system simulation– Simics– Wisconsin’s GEMS Timing Simulator
• Out-of-order processor• Memory system
• Workloads– Commercial
• apache, jbb, otlp, zeus
– Scientific (see paper)• SpecOMP: apsi & art• Splash: barnes & ocean
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 20
System ParametersSystem Parameters
Memory System Dynamically Scheduled Processor
L1 I & D caches 64 KB, 4-way, 3 cycles Clock frequency 5.0 GHz
Unified L2 cache 16 MB, 16-way Reorder buffer / scheduler
128 / 64 entries
L1 / L2 prefetching Unit & Non-unit strided prefetcher (similar Power4)
Pipeline width 4-wide fetch & issue
Memory latency 500 cycles Pipeline stages 30
Memory bandwidth 50 GB/s Direct branch predictor 3.5 KB YAGS
Memory size 4 GB of DRAM Return address stack 64 entries
Outstanding memory request / CPU
16 Indirect branch predictor 256 entries (cascaded)
[ 8 core CMP, 45 nm technology ]
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 21
Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves
Benefit Cost
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 22
Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves
Effectiveness
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 23
Comparison of Replication Comparison of Replication PoliciesPolicies
• SPR multiple possible policies• Evaluated 4 shared read-only replication policies
1. VR: Victim Replication– Previously proposed [Zhang ISCA 05]– Disallow replicas to evict shared owner blocks
2. NR: CMP-NuRapid– Previously proposed [Chishti ISCA 05]– Replicate upon the second request
3. CC: Cooperative Caching– Previously proposed [Chang ISCA 06]– Replace replicas first– Spill singlets to remote caches– Tunable parameter 100%, 70%, 30%, 0%
4. ASR: Adaptive Selective Replication– Our proposal– Monitor and adjust to workload demand
LackDynamic
Adaptation
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 24
ASR: ASR: PerformancePerformance
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 25
ConclusionsConclusions
• CMP Cache Replication– No replications conservers capacity– All replications reduces on-chip latency– Previous hybrid proposals
• Work well for certain criteria• Non-adaptive
• Adaptive Selective Replication– Probabilistic policy favors frequently requested blocks– Dynamically monitor replication benefit & cost– Replicate benefit > cost– Improves performance up to 12% vs. previous schemes
Backup SlidesBackup Slides
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 27
ASR: ASR: Memory CyclesMemory Cycles
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
L2 Cache Requests BreakdownL2 Cache Requests Breakdown
L2 Cache Requests Breakdown: L2 Cache Requests Breakdown: User & OSUser & OS
Shared Read-write Requests Shared Read-write Requests BreakdownBreakdown
Shared Read-write Block Shared Read-write Block BreakdownBreakdown
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 32
ASR: ASR: Decrease-in-replication Decrease-in-replication BenefitBenefit
L2 H
it C
ycle
s
Replication Capacity
current levellower level
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 33
ASR: ASR: Decrease-in-replication Decrease-in-replication BenefitBenefit
• Goal– Determine replication benefit decrease of the next lower level
• Mechanism– Current Replica Bit
• Per L2 cache block• Set for replications of the current level• Not set for replications of lower level
– Current replica hits would be remote hits with next lower level
• Overhead– 1-bit x 256 K L2 blocks = 32 KB
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 34
ASR: ASR: Increase-in-replication Increase-in-replication BenefitBenefit
L2 H
it C
ycle
s
Replication Capacity
current level higher level
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 35
ASR: ASR: Increase-in-replication Increase-in-replication BenefitBenefit
• Goal– Determine replication benefit increase of the next higher level
• Mechanism– Next Level Hit Buffers (NLHBs)
• 8-bit partial tag buffer• Store replicas of the next higher
– NLHB hits would be local L2 hits with next higher level
• Overhead– 8-bits x 16 K entries x 8 processors = 128 KB
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 36
ASR: ASR: Decrease-in-replicationDecrease-in-replicationCostCost
L2 M
iss
Cyc
les
Replication Capacitycurrent levellower level
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 37
ASR: ASR: Decrease-in-replication Decrease-in-replication CostCost
• Goal– Determine replication cost decrease of the next lower level
• Mechanism– Victim Tag Buffers (VTBs)
• 16-bit partial tags • Store recently evicted blocks of current replication level
– VTB hits would be on-chip hits with next lower level
• Overhead– 16-bits x 1 K entry x 8 processors = 16 KB
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 38
ASR: ASR: Increase-in-replicationIncrease-in-replicationCostCost
L2 M
iss
Cyc
les
Replication Capacitycurrent level higher level
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 39
ASR: ASR: Increase-in-replication Increase-in-replication CostCost
• Goal– Determine replication cost increase of the next higher level
• Mechanism– Way and Set counters [Suh et al. HPCA 2002]
• Identify soon-to-be-evicted blocks• 16-way pseudo LRU• 256 set groups
– On-chip hits that would be off-chip with next higher level
• Overhead– 255-bit pseudo LRU tree x 8 processors = 255 B
Overall storage overhead: 212 KB or 1.2% of total storage
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 40
ASR: ASR: Triggering a Cost-Triggering a Cost-Benefit AnalysisBenefit Analysis
• Goal– Dynamically adapt to workload behavior– Avoid unnecessary replication level changes
• Mechanism– Evaluation trigger
• Local replications or NLHB allocations exceed 1K
– Replication change• Four consecutive evaluations in the same direction
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 41
ASR: ASR: Adaptive AlgorithmAdaptive AlgorithmDecrease in
Replication Cost > Increase in Replication Benefit
Decrease in
Replication Cost < Increase in Replication Benefit
Decrease in
Replication Benefit > Increase in Replication Cost
Go in direction with greater value
Increase
ReplicationDecrease in
Replication Benefit < Increase in Replication Cost
Decrease
Replication
Do
Nothing
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 42
ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior
Oltp: All CPUs
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 43
ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior
Apache: All CPUs
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 44
ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior
Apache: CPU 0
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 45
ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior
Apache: CPUs 1-7
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 46
Replication CapacityReplication Capacity
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 47
Replication CapacityReplication Capacity4 MB150 Memory LatencyIn-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 48
Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves
Benefit Cost 4 MB150 Memory LatencyIn-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 49
Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves
Effectiveness4 MB150 Memory LatencyIn-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 50
Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves
Benefit Cost 16 MB500 Memory LatencyIn-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 51
Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves
Effectiveness16 MB500 Memory LatencyIn-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 52
Replication Analytic ModelReplication Analytic Model
• Utilize workload characterization data
• Goal: initutition not accuracy
• Optimal point of replication– Sensitive to cache size– Sensitive to memory latency
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 53
Replication Model: Replication Model: Selective Selective ReplicationReplication
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 54
ASR: ASR: Memory CyclesMemory Cycles
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
4 MB150 Memory LatencyIn-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 55
ASR: ASR: PerformancePerformance
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
4 MB150 Memory LatencyIn-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 56
ASR: ASR: Memory CyclesMemory Cycles
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
16 MB250 Memory LatencyOut-of-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 57
ASR: ASR: PerformancePerformance
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
16 MB250 Memory LatencyOut-of-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 58
ASR: ASR: Memory CyclesMemory Cycles
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
16 MB500 Memory LatencyOut-of-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 59
ASR: ASR: PerformancePerformance
S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR
16 MB500 Memory LatencyOut-of-order processors
Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 60
Token CoherenceToken Coherence
• Proposed for SMPs [Martin 03], CMPs [Marty 05]• Provides a simple correctness substrate
– One token to read– All tokens to write
• Advantages– Permits a broadcast protocol on unordered network without
acknowledgement messages– Supports multiple allocation policies
• Disadvantages– All blocks must be written back (cannot destroy tokens)– Token counts at memory– Persistent request can be a performance bottleneck