60
ASR: Adaptive ASR: Adaptive Selective Replication Selective Replication for CMP Caches for CMP Caches Brad Beckmann, Mike Marty, and David Wood Multifacet Project University of Wisconsin-Madison 12/13/06 currently at Microsoft

ASR: Adaptive Selective Replication for CMP Caches

  • Upload
    pierce

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

ASR: Adaptive Selective Replication for CMP Caches. Brad Beckmann † , Mike Marty, and David Wood Multifacet Project University of Wisconsin-Madison 12/13/06. † currently at Microsoft. Maximize Cache Capacity. 40+ Cycles. A. Slow Access Latency. Introduction: Shared Cache. L1 I $. - PowerPoint PPT Presentation

Citation preview

Page 1: ASR: Adaptive Selective Replication for CMP Caches

ASR: Adaptive Selective ASR: Adaptive Selective Replication for CMP CachesReplication for CMP Caches

Brad Beckmann†, Mike Marty, and David Wood

Multifacet ProjectUniversity of Wisconsin-Madison

12/13/06

† currently at Microsoft

Page 2: ASR: Adaptive Selective Replication for CMP Caches

2

Introduction: Introduction: Shared CacheShared Cache

CPU 3L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

CPU 2

CPU 1

CPU 0

CPU 4

CPU 5

CPU 6

CPU 7

L2Bank

L2Bank

L2Bank

L2Bank

L2Bank

L2Bank

L2Bank

L2Bank

A

MaximizeCache

Capacity40+ Cycles

SlowAccessLatency

Page 3: ASR: Adaptive Selective Replication for CMP Caches

3

Introduction: Introduction: Private CachesPrivate Caches

CPU 3L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

CPU 2

CPU 1

CPU 0

CPU 4

CPU 5

CPU 6

CPU 7

Private

L2

Private

L2

Private

L2

Private

L2

Private

L2

Private

L2

Private

L2

PrivateL2

FastAccessLatencyA

LowerEffectiveCapacity

A

A Desire bothFast Access &High Capacity

Page 4: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 4

IntroductionIntroduction• Previous hybrid proposals

– Victim Replication, CMP-NuRapid, Cooperative Caching– Achieve fast access and high capacity

• Under certain workloads & system configurations• Utilize static rules

– Non-adaptive

• Adaptive Selective Replication: ASR– Dynamically monitor workload behavior– Adapt the L2 cache to workload demand– Up to 12% improvement vs. previous proposals

Page 5: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 5

OutlineOutline• Introduction

• Understanding L2 Replication• Benefit• Cost• Key Observation• Solution

• ASR: Adaptive Selective Replication

• Evaluation

Page 6: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 6

Understanding L2 ReplicationUnderstanding L2 Replication

• Three L2 block sharing types1. Single requestor

– All requests by a single processor

2. Shared read only– Read only requests by multiple processors

3. Shared read-write– Read and write requests by multiple processors

• Profile L2 blocks during their on-chip lifetime– 8 processor CMP– 16 MB shared L2 cache– 64-byte block size

Page 7: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 7

Understanding L2 ReplicationUnderstanding L2 Replication

Shared Read-only

Shared Read-write

Single Requestor

ApacheJbbOltpZeus

High Locality

Mid Locality

Low Locality

Page 8: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 8

Understanding L2 Replication: Understanding L2 Replication: BenefitBenefit

L2 H

it C

ycle

s

Replication Capacity

Page 9: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 9

Understanding L2 Replication: Understanding L2 Replication: CostCost

L2 M

iss

Cyc

les

Replication Capacity

Page 10: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 10

Understanding L2 Replication: Understanding L2 Replication: Key ObservationKey Observation

L2 H

it C

ycle

s

Replication Capacity

Top 3% of Shared Read-only blocks satisfy70% of Shared Read-only requests

Replicate FrequentlyRequested Blocks First

Page 11: ASR: Adaptive Selective Replication for CMP Caches

TotalCycleCurve

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 11

Understanding L2 Replication: Understanding L2 Replication: SolutionSolution

Tot

al C

ycle

s

Replication Capacity

Optimal

Property of WorkloadCache Interaction

Not Fixed Must Adapt

Page 12: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 12

OutlineOutline• Wires and CMP caches

• Understanding L2 Replication

• ASR: Adaptive Selective Replication– SPR: Selective Probabilistic Replication– Monitoring and adapting to workload behavior

• Evaluation

Page 13: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 13

SPR: SPR: Selective Probabilistic Selective Probabilistic ReplicationReplication

• Mechanism for Selective Replication– Relax L2 inclusion property

• L2 evictions do not force L1 evictions• Non-exclusive cache hierarchy

– Ring Writebacks• L1 Writebacks passed clockwise between private L2 caches• Merge with other existing L2 copies

• Probabilistically choose between– Local writeback allow replication– Ring writeback disallow replication

• Replicates frequently requested blocks

Page 14: ASR: Adaptive Selective Replication for CMP Caches

14

PrivateL2

PrivateL2

SPR: SPR: Selective Probabilistic Selective Probabilistic ReplicationReplication

CPU 3L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

CPU 2

CPU 1

CPU 0

CPU 4

CPU 5

CPU 6

CPU 7

PrivateL2

PrivateL2

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

L1I $

L1D $

PrivateL2

PrivateL2

PrivateL2

PrivateL2

Page 15: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 15

SPR: SPR: Selective Probabilistic Selective Probabilistic ReplicationReplication

Rep

licat

ion

Cap

acity

Replication Levels0 1 2 3 4 5

Replication Level 0 1 2 3 4 5

Prob. of Replication 0 1/64 1/16 1/4 1/2 1

CurrentLevel

Page 16: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 16

Monitoring and Adapting to Monitoring and Adapting to Workload BehaviorWorkload Behavior

1. Decrease in Replication Benefit– Bit marks replicas of the current, but not lower level

2. Increase in Replication Benefit– Store 8-bit partial tags of next higher level replications

L2 H

it C

ycle

s

Replication Capacitycurrent levellower level higher level

ReplicationBenefit Curve

Page 17: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 17

Monitoring and Adapting to Monitoring and Adapting to Workload BehaviorWorkload Behavior

3. Decrease in Replication Cost– Stores 16-bit partial tags of recently evicted blocks

4. Increase in Replication Cost– Way and Set counters track soon-to-be-evicted blocks

L2 M

iss

Cyc

les

Replication Capacitycurrent level

ReplicationCost Curve

higher levellower level

Page 18: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 18

OutlineOutline• Wires and CMP caches

• Understanding L2 Replication

• ASR: Adaptive Selective Replication

• Evaluation

Page 19: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 19

MethodologyMethodology

• Full system simulation– Simics– Wisconsin’s GEMS Timing Simulator

• Out-of-order processor• Memory system

• Workloads– Commercial

• apache, jbb, otlp, zeus

– Scientific (see paper)• SpecOMP: apsi & art• Splash: barnes & ocean

Page 20: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 20

System ParametersSystem Parameters

Memory System Dynamically Scheduled Processor

L1 I & D caches 64 KB, 4-way, 3 cycles Clock frequency 5.0 GHz

Unified L2 cache 16 MB, 16-way Reorder buffer / scheduler

128 / 64 entries

L1 / L2 prefetching Unit & Non-unit strided prefetcher (similar Power4)

Pipeline width 4-wide fetch & issue

Memory latency 500 cycles Pipeline stages 30

Memory bandwidth 50 GB/s Direct branch predictor 3.5 KB YAGS

Memory size 4 GB of DRAM Return address stack 64 entries

Outstanding memory request / CPU

16 Indirect branch predictor 256 entries (cascaded)

[ 8 core CMP, 45 nm technology ]

Page 21: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 21

Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves

Benefit Cost

Page 22: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 22

Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves

Effectiveness

Page 23: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 23

Comparison of Replication Comparison of Replication PoliciesPolicies

• SPR multiple possible policies• Evaluated 4 shared read-only replication policies

1. VR: Victim Replication– Previously proposed [Zhang ISCA 05]– Disallow replicas to evict shared owner blocks

2. NR: CMP-NuRapid– Previously proposed [Chishti ISCA 05]– Replicate upon the second request

3. CC: Cooperative Caching– Previously proposed [Chang ISCA 06]– Replace replicas first– Spill singlets to remote caches– Tunable parameter 100%, 70%, 30%, 0%

4. ASR: Adaptive Selective Replication– Our proposal– Monitor and adjust to workload demand

LackDynamic

Adaptation

Page 24: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 24

ASR: ASR: PerformancePerformance

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

Page 25: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 25

ConclusionsConclusions

• CMP Cache Replication– No replications conservers capacity– All replications reduces on-chip latency– Previous hybrid proposals

• Work well for certain criteria• Non-adaptive

• Adaptive Selective Replication– Probabilistic policy favors frequently requested blocks– Dynamically monitor replication benefit & cost– Replicate benefit > cost– Improves performance up to 12% vs. previous schemes

Page 26: ASR: Adaptive Selective Replication for CMP Caches

Backup SlidesBackup Slides

Page 27: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 27

ASR: ASR: Memory CyclesMemory Cycles

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

Page 28: ASR: Adaptive Selective Replication for CMP Caches

L2 Cache Requests BreakdownL2 Cache Requests Breakdown

Page 29: ASR: Adaptive Selective Replication for CMP Caches

L2 Cache Requests Breakdown: L2 Cache Requests Breakdown: User & OSUser & OS

Page 30: ASR: Adaptive Selective Replication for CMP Caches

Shared Read-write Requests Shared Read-write Requests BreakdownBreakdown

Page 31: ASR: Adaptive Selective Replication for CMP Caches

Shared Read-write Block Shared Read-write Block BreakdownBreakdown

Page 32: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 32

ASR: ASR: Decrease-in-replication Decrease-in-replication BenefitBenefit

L2 H

it C

ycle

s

Replication Capacity

current levellower level

Page 33: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 33

ASR: ASR: Decrease-in-replication Decrease-in-replication BenefitBenefit

• Goal– Determine replication benefit decrease of the next lower level

• Mechanism– Current Replica Bit

• Per L2 cache block• Set for replications of the current level• Not set for replications of lower level

– Current replica hits would be remote hits with next lower level

• Overhead– 1-bit x 256 K L2 blocks = 32 KB

Page 34: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 34

ASR: ASR: Increase-in-replication Increase-in-replication BenefitBenefit

L2 H

it C

ycle

s

Replication Capacity

current level higher level

Page 35: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 35

ASR: ASR: Increase-in-replication Increase-in-replication BenefitBenefit

• Goal– Determine replication benefit increase of the next higher level

• Mechanism– Next Level Hit Buffers (NLHBs)

• 8-bit partial tag buffer• Store replicas of the next higher

– NLHB hits would be local L2 hits with next higher level

• Overhead– 8-bits x 16 K entries x 8 processors = 128 KB

Page 36: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 36

ASR: ASR: Decrease-in-replicationDecrease-in-replicationCostCost

L2 M

iss

Cyc

les

Replication Capacitycurrent levellower level

Page 37: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 37

ASR: ASR: Decrease-in-replication Decrease-in-replication CostCost

• Goal– Determine replication cost decrease of the next lower level

• Mechanism– Victim Tag Buffers (VTBs)

• 16-bit partial tags • Store recently evicted blocks of current replication level

– VTB hits would be on-chip hits with next lower level

• Overhead– 16-bits x 1 K entry x 8 processors = 16 KB

Page 38: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 38

ASR: ASR: Increase-in-replicationIncrease-in-replicationCostCost

L2 M

iss

Cyc

les

Replication Capacitycurrent level higher level

Page 39: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 39

ASR: ASR: Increase-in-replication Increase-in-replication CostCost

• Goal– Determine replication cost increase of the next higher level

• Mechanism– Way and Set counters [Suh et al. HPCA 2002]

• Identify soon-to-be-evicted blocks• 16-way pseudo LRU• 256 set groups

– On-chip hits that would be off-chip with next higher level

• Overhead– 255-bit pseudo LRU tree x 8 processors = 255 B

Overall storage overhead: 212 KB or 1.2% of total storage

Page 40: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 40

ASR: ASR: Triggering a Cost-Triggering a Cost-Benefit AnalysisBenefit Analysis

• Goal– Dynamically adapt to workload behavior– Avoid unnecessary replication level changes

• Mechanism– Evaluation trigger

• Local replications or NLHB allocations exceed 1K

– Replication change• Four consecutive evaluations in the same direction

Page 41: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 41

ASR: ASR: Adaptive AlgorithmAdaptive AlgorithmDecrease in

Replication Cost > Increase in Replication Benefit

Decrease in

Replication Cost < Increase in Replication Benefit

Decrease in

Replication Benefit > Increase in Replication Cost

Go in direction with greater value

Increase

ReplicationDecrease in

Replication Benefit < Increase in Replication Cost

Decrease

Replication

Do

Nothing

Page 42: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 42

ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior

Oltp: All CPUs

Page 43: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 43

ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior

Apache: All CPUs

Page 44: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 44

ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior

Apache: CPU 0

Page 45: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 45

ASR: ASR: Adapting to Workload Adapting to Workload BehaviorBehavior

Apache: CPUs 1-7

Page 46: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 46

Replication CapacityReplication Capacity

Page 47: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 47

Replication CapacityReplication Capacity4 MB150 Memory LatencyIn-order processors

Page 48: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 48

Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves

Benefit Cost 4 MB150 Memory LatencyIn-order processors

Page 49: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 49

Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves

Effectiveness4 MB150 Memory LatencyIn-order processors

Page 50: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 50

Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves

Benefit Cost 16 MB500 Memory LatencyIn-order processors

Page 51: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 51

Replication Benefit, Cost, & Replication Benefit, Cost, & Effectiveness CurvesEffectiveness Curves

Effectiveness16 MB500 Memory LatencyIn-order processors

Page 52: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 52

Replication Analytic ModelReplication Analytic Model

• Utilize workload characterization data

• Goal: initutition not accuracy

• Optimal point of replication– Sensitive to cache size– Sensitive to memory latency

Page 53: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 53

Replication Model: Replication Model: Selective Selective ReplicationReplication

Page 54: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 54

ASR: ASR: Memory CyclesMemory Cycles

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

4 MB150 Memory LatencyIn-order processors

Page 55: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 55

ASR: ASR: PerformancePerformance

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

4 MB150 Memory LatencyIn-order processors

Page 56: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 56

ASR: ASR: Memory CyclesMemory Cycles

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

16 MB250 Memory LatencyOut-of-order processors

Page 57: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 57

ASR: ASR: PerformancePerformance

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

16 MB250 Memory LatencyOut-of-order processors

Page 58: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 58

ASR: ASR: Memory CyclesMemory Cycles

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

16 MB500 Memory LatencyOut-of-order processors

Page 59: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 59

ASR: ASR: PerformancePerformance

S: CMP-SharedP: CMP-PrivateV: SPR-VRN: SPR-NRC: SPR-CCA: SPR-ASR

16 MB500 Memory LatencyOut-of-order processors

Page 60: ASR: Adaptive Selective Replication for CMP Caches

Beckmann, Marty, & Wood ASR: Adaptive Selective Replication for CMP Caches 60

Token CoherenceToken Coherence

• Proposed for SMPs [Martin 03], CMPs [Marty 05]• Provides a simple correctness substrate

– One token to read– All tokens to write

• Advantages– Permits a broadcast protocol on unordered network without

acknowledgement messages– Supports multiple allocation policies

• Disadvantages– All blocks must be written back (cannot destroy tokens)– Token counts at memory– Persistent request can be a performance bottleneck