38
Trey Cain and Mikko Lipasti Trey Cain and Mikko Lipasti University of Wisconsin-Madison University of Wisconsin-Madison Memory Ordering: Memory Ordering: A Value-based Approach A Value-based Approach

Memory Ordering: A Value-based Approach

  • Upload
    merrill

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Memory Ordering: A Value-based Approach. Trey Cain and Mikko Lipasti University of Wisconsin-Madison. Value-based replay. High ILP => Large instruction windows Larger physical register file Larger scheduler Larger load/store queues Result in increased access latency Value-based Replay - PowerPoint PPT Presentation

Citation preview

Page 1: Memory Ordering: A Value-based Approach

Trey Cain and Mikko LipastiTrey Cain and Mikko Lipasti

University of Wisconsin-MadisonUniversity of Wisconsin-Madison

Memory Ordering:Memory Ordering:A Value-based ApproachA Value-based Approach

Page 2: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 2 of 26

Value-based replayValue-based replay High ILP => Large instruction windowsHigh ILP => Large instruction windows

Larger physical register fileLarger physical register file Larger schedulerLarger scheduler Larger load/store queuesLarger load/store queues Result in increased access latencyResult in increased access latency

Value-based ReplayValue-based Replay If load queue scalability a problem…who needs one!If load queue scalability a problem…who needs one! Instead, re-execute load instructions a 2Instead, re-execute load instructions a 2ndnd time in time in

program orderprogram order Filter replays: heuristics reduce extra cache Filter replays: heuristics reduce extra cache

bandwidth to 3.5% on averagebandwidth to 3.5% on average

Page 3: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 3 of 26

OutlineOutline

Conventional load queue Conventional load queue functionality/microarchitecturefunctionality/microarchitecture

Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation

Page 4: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 4 of 26

Enforcing RAW dependencesEnforcing RAW dependences

1. (1) store A2. (3) store ?3. (2) load A

Program order (Exe order)

Load queue contains load addressesLoad queue contains load addresses One search per store address calculationOne search per store address calculation If match, the load is squashed If match, the load is squashed

Page 5: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 5 of 26

Enforcing memory consistencyEnforcing memory consistency

Processor p2

1. (2) store A

Processor p1

1. (3) load A

2. (1) load A

raw

war

Two approachesTwo approaches Snooping: Search per incoming invalidateSnooping: Search per incoming invalidate Insulated: Search per load address calculationInsulated: Search per load address calculation

Page 6: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 6 of 26

Load queue implementationLoad queue implementation

addressCAM

loadmeta-data

RAM

external address

store address

load address

store age

load age

squash determination

queue management

external request

# of write ports = load address calc width# of write ports = load address calc width # of read ports = load+store address calc width ( + 1)# of read ports = load+store address calc width ( + 1) Current generation designs (32-48 entries, 2 write ports, Current generation designs (32-48 entries, 2 write ports,

2 (3) read ports)2 (3) read ports)

Page 7: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 7 of 26

Load queue scalingLoad queue scaling

Larger instruction window => larger load Larger instruction window => larger load queuequeue Increases access latencyIncreases access latency Increases energy consumptionIncreases energy consumption

Wider issue width => more read/write Wider issue width => more read/write portsports Also increases latency and energyAlso increases latency and energy

Page 8: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 8 of 26

Related work: MICRO 2003Related work: MICRO 2003

Park et al., PurduePark et al., Purdue Extra structure dedicated to enforcing memory Extra structure dedicated to enforcing memory

consistencyconsistency Increase capacity through segmentationIncrease capacity through segmentation

Sethumadhavan et al., UT-AustinSethumadhavan et al., UT-Austin Add set of filters summarizing contents of load Add set of filters summarizing contents of load

queuequeue

Page 9: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 9 of 26

Keep it simple…Keep it simple…

Throw more hardware at the problem?Throw more hardware at the problem? Need to design/implement/verifyNeed to design/implement/verify Execution core is already complicatedExecution core is already complicated

Load queue checks for rare errorsLoad queue checks for rare errors Why not move error checking away from exe?Why not move error checking away from exe?

Page 10: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 10 of 26

CMP

Value-based orderingValue-based ordering

ReplayReplay: access the cache a second time -: access the cache a second time -cheaply!cheaply! Almost always cache hitAlmost always cache hit Reuse address calculation and translationReuse address calculation and translation Share cache port used by stores in commit stageShare cache port used by stores in commit stage

CompareCompare: compares new value to original value: compares new value to original value Squash if the values differSquash if the values differ

DIVA áDIVA á la carte [Austin, Micro 99]la carte [Austin, Micro 99]

IF1 D R Q S EX CREPIF2 WB…

Page 11: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 11 of 26

Rules of replayRules of replay

1.1. All prior stores must have written data to All prior stores must have written data to the cachethe cache

No store-to-load forwardingNo store-to-load forwarding

2.2. Loads must replay in program orderLoads must replay in program order If a cache miss occurs, all subsequent loads If a cache miss occurs, all subsequent loads

must be replayedmust be replayed

3.3. If a load is squashed, it should not be If a load is squashed, it should not be replayed a second timereplayed a second time

Ensures forward progressEnsures forward progress

Page 12: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 12 of 26

Replay reductionReplay reduction

Replay costsReplay costs Consumes cache bandwidth (and power)Consumes cache bandwidth (and power) Increases reorder buffer occupancyIncreases reorder buffer occupancy

Can we avoid these penalties?Can we avoid these penalties? Infer correctness of certain operationsInfer correctness of certain operations

Four replay filtersFour replay filters

Page 13: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 13 of 26

No-Reorder filterNo-Reorder filter

Avoid replay if load isn’t reordered wrt Avoid replay if load isn’t reordered wrt other memory operationsother memory operations

Can we do better?Can we do better?

Page 14: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 14 of 26

Enforcing single-thread RAW Enforcing single-thread RAW dependenciesdependencies

No-Unresolved Store Address FilterNo-Unresolved Store Address Filter Load instruction Load instruction ii is replayed if there are prior is replayed if there are prior

stores with unresolved addresses when stores with unresolved addresses when ii issuesissues

Works for intra-processor RAW dependencesWorks for intra-processor RAW dependences Doesn’t enforce memory consistencyDoesn’t enforce memory consistency

Page 15: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 15 of 26

Enforcing MP consistencyEnforcing MP consistency

No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line

fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow

No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external

invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window

Page 16: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 16 of 26

Constraint graphConstraint graph

Defined for sequential consistency by Landin et Defined for sequential consistency by Landin et al., ISCA-18al., ISCA-18

Directed-graph represents a multithreaded Directed-graph represents a multithreaded executionexecution Nodes represent dynamic instruction instancesNodes represent dynamic instruction instances Edges represent their transitive orders (program Edges represent their transitive orders (program

order, RAW, WAW, WAR).order, RAW, WAW, WAR). If the constraint graph is acyclic, then the If the constraint graph is acyclic, then the

execution is correctexecution is correct

Page 17: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 17 of 26

Constraint graph example - SCConstraint graph example - SC

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Cycle indicates that execution is

incorrect

1.

2.

3.

4.

Page 18: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 18 of 26

Anatomy of a cycleAnatomy of a cycle

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Incoming invalidate

Cache miss

Page 19: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 19 of 26

Enforcing MP consistencyEnforcing MP consistency

No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line

fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow

No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external

invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window

Page 20: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 20 of 26

Filter SummaryFilter Summary

Replay all committed loads

No-Reorder Filter

No-Unresolved Store/No-Recent-Snoop Filter

No-Unresolved Store/No-Recent-Miss Filter

Conservative

Aggressive

Page 21: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 21 of 26

OutlineOutline

Conventional load queue Conventional load queue functionality/microarchitecturefunctionality/microarchitecture

Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation

Page 22: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 22 of 26

Base machine modelBase machine modelPHARMsimPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence Based on SimpleMP including Sun Gigaplane-like snooping coherence

protocol [Rajwar], within the SimOS-PPC full-system simulatorprotocol [Rajwar], within the SimOS-PPC full-system simulator

Out-of-order Out-of-order execution execution corecore

5 GHZ, 5 GHZ, 15-stage, 8-wide pipeline15-stage, 8-wide pipeline

256 entry reorder buffer, 128 entry load/store queue256 entry reorder buffer, 128 entry load/store queue

32 entry issue queue32 entry issue queue

Functional Functional units units (latency)(latency)

8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),

4 L1 Dcache load ports in OoO window4 L1 Dcache load ports in OoO window

1 L1 Dcache load/store port at commit1 L1 Dcache load/store port at commit

Front-endFront-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB16k entry selection table, 64 entry RAS, 8k entry 4-way BTB

Memory Memory system system (latency)(latency)

32k DM L1 icache (1), 32k DM L1 dcache (1)32k DM L1 icache (1), 32k DM L1 dcache (1)

256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines

Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)

Stride-based prefetcher modeled after Power4`Stride-based prefetcher modeled after Power4`

Page 23: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 23 of 26

%L1 DCache bandwidth increase%L1 DCache bandwidth increase

(a) replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter

On average, 3.4% bandwidth overhead using no-recent-snoop filter

SPECint2000 SPECfp2000 commercial multiprocessor

Page 24: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 24 of 26

Value-based replay performance Value-based replay performance (relative to constrained load queue)(relative to constrained load queue)

Value-based replay 8% faster on avg than baseline using 16-entry ld queue

SPECint2000 SPECfp2000 commercial multiprocessor

Page 25: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 25 of 26

Value-based replay Pros/ConsValue-based replay Pros/Cons

+ Eliminates associative lookup hardwareEliminates associative lookup hardware Load queue becomes simple FIFOLoad queue becomes simple FIFO Negligible IPC or L1D bandwidth impactNegligible IPC or L1D bandwidth impact

+ Can be used to fix value predictionCan be used to fix value prediction Enforces dependence order consistency Enforces dependence order consistency

constraint [Martin et al., Micro 2001]constraint [Martin et al., Micro 2001]- Requires additional pipeline stagesRequires additional pipeline stages- Requires additional cache datapath for Requires additional cache datapath for

loadsloads

Page 26: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 26 of 26

The EndThe End

Questions?Questions?

Page 27: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 27 of 26

BackupsBackups

Page 28: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 28 of 26

Does value locality help?Does value locality help?

Not much…Not much… Value locality does avoid memory ordering Value locality does avoid memory ordering

violationsviolations 59% single-thread violations avoided59% single-thread violations avoided 95% consistency violations avoided95% consistency violations avoided

But these violations rarely occurBut these violations rarely occur ~1 single-thread violation per 100 million instr~1 single-thread violation per 100 million instr 4 consistency violation per 10,000 instr 4 consistency violation per 10,000 instr

Page 29: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 29 of 26

What About What About PowerPower??

Simple power model:Simple power model:

Empirically: 0.02 replay loads per committed Empirically: 0.02 replay loads per committed instructioninstruction

If load queue CAM energy/insn > 0.02 If load queue CAM energy/insn > 0.02 × energy energy expenditure of a cache access and comparison: expenditure of a cache access and comparison: value-based implementation saves power!value-based implementation saves power!

Energy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search × # ldq searches )

Page 30: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 30 of 26

Caveat: Memory Dependence PredictionCaveat: Memory Dependence Prediction

Some predictors train using the conflicting storeSome predictors train using the conflicting store (e.g. store-set predictor)(e.g. store-set predictor)

Replay mechanism is unable to pinpoint Replay mechanism is unable to pinpoint conflicting storeconflicting store

Fair comparison:Fair comparison: Baseline machine: store-set predictor w/ 4k entry Baseline machine: store-set predictor w/ 4k entry

SSIT and 128 entry LFSTSSIT and 128 entry LFST Experimental machine: Simple 21264-style Experimental machine: Simple 21264-style

dependence predictor w/ 4k entry history tabledependence predictor w/ 4k entry history table

Page 31: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 31 of 26

Load queue search energyLoad queue search energy

0

0.5

1

1.5

2

2.5

3

3.5

16 32 64 128 256 512

number of entries

ac

ce

ss

en

erg

y (

nJ

)

rd6wr6

rd4wr4

rd2wr2

Based on 0.09 micron process technology using Cacti v. 3.2

Page 32: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 32 of 26

Load queue search latencyLoad queue search latency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

16 32 64 128 256 512

number of entries

ac

ce

ss

late

nc

y (

ns

)

rd6wr6

rd4wr4

rd2wr2

Based on 0.09 micron process technology using Cacti v. 3.2

Page 33: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 33 of 26

BenchmarksBenchmarks

MP (16-way)MP (16-way) Commercial workloads (SPECweb, TPC-H)Commercial workloads (SPECweb, TPC-H) SPLASH2 scientific application (ocean)SPLASH2 scientific application (ocean) Error bars signify 95% statistical confidenceError bars signify 95% statistical confidence

UPUP 3 from SPECfp20003 from SPECfp2000

Selected due to high reorder buffer utilizationSelected due to high reorder buffer utilization apsi, art, wupwiseapsi, art, wupwise

3 commercial3 commercial SPECjbb2000, TPC-B, TPC-HSPECjbb2000, TPC-B, TPC-H

A few from SPECint2000A few from SPECint2000

Page 34: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 34 of 26

LD ?ST ?ST ?LD ? LD ?ST ? LD ?

Life cycle of a loadLife cycle of a load

OoO Execution Window

LD ?ST ? ST ? ST ?

Load queue

LD ?LD A

LD A ST A

Blam!

Page 35: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 35 of 26

Performance relative to Performance relative to unconstrained load queueunconstrained load queue

Good news: Replay w/ no-recent-snoop filter only 1% slower on average

Page 36: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 36 of 26

Reorder-Buffer UtilizationReorder-Buffer Utilization

Page 37: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 37 of 26

Why focus on load queue?Why focus on load queue?

Load queue has different constraints that store Load queue has different constraints that store queuequeue More loads than stores (30% vs 14% dynamic More loads than stores (30% vs 14% dynamic

instructions)instructions) Load queue searched more frequently (consuming Load queue searched more frequently (consuming

more power)more power) Store-forwarding logic performance criticalStore-forwarding logic performance critical

Many non-scalable structures in OoO processorMany non-scalable structures in OoO processor SchedulerScheduler Physical register filePhysical register file Register mapRegister map

Page 38: Memory Ordering: A Value-based Approach

Cain and Lipasti, ISCA 2004 38 of 26

Prior work: formal memory model Prior work: formal memory model representationsrepresentations

Local, WRT, global “performance” of memory Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)ops (Dubois et al., ISCA-13)

Acyclic graph representation (Landin et al., Acyclic graph representation (Landin et al., ISCA-18)ISCA-18)

Modeling memory operation as a series of sub-Modeling memory operation as a series of sub-operations (Collier, RAPA)operations (Collier, RAPA)

Acyclic graph + sub-operations (Adve, thesis)Acyclic graph + sub-operations (Adve, thesis) Initiation event, for modeling early store-to-load Initiation event, for modeling early store-to-load

forwarding (Gharachorloo, thesis)forwarding (Gharachorloo, thesis)