Memory Ordering: A Value-based Approach

Trey Cain and Mikko LipastiTrey Cain and Mikko Lipasti

University of Wisconsin-MadisonUniversity of Wisconsin-Madison

Memory Ordering:Memory Ordering:A Value-based ApproachA Value-based Approach

Cain and Lipasti, ISCA 2004 2 of 26

Value-based replayValue-based replay High ILP => Large instruction windowsHigh ILP => Large instruction windows

Larger physical register fileLarger physical register file Larger schedulerLarger scheduler Larger load/store queuesLarger load/store queues Result in increased access latencyResult in increased access latency

Value-based ReplayValue-based Replay If load queue scalability a problem…who needs one!If load queue scalability a problem…who needs one! Instead, re-execute load instructions a 2Instead, re-execute load instructions a 2ndnd time in time in

program orderprogram order Filter replays: heuristics reduce extra cache Filter replays: heuristics reduce extra cache

bandwidth to 3.5% on averagebandwidth to 3.5% on average


OutlineOutline

Conventional load queue Conventional load queue functionality/microarchitecturefunctionality/microarchitecture

Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation


Enforcing RAW dependencesEnforcing RAW dependences

1. (1) store A2. (3) store ?3. (2) load A

Program order (Exe order)

Load queue contains load addressesLoad queue contains load addresses One search per store address calculationOne search per store address calculation If match, the load is squashed If match, the load is squashed


Enforcing memory consistencyEnforcing memory consistency

Processor p2

1. (2) store A

Processor p1

1. (3) load A

2. (1) load A

raw

war

Two approachesTwo approaches Snooping: Search per incoming invalidateSnooping: Search per incoming invalidate Insulated: Search per load address calculationInsulated: Search per load address calculation


Load queue implementationLoad queue implementation

addressCAM

loadmeta-data

RAM

external address

store address

load address

store age

load age

squash determination

queue management

external request

# of write ports = load address calc width# of write ports = load address calc width # of read ports = load+store address calc width ( + 1)# of read ports = load+store address calc width ( + 1) Current generation designs (32-48 entries, 2 write ports, Current generation designs (32-48 entries, 2 write ports,

2 (3) read ports)2 (3) read ports)


Load queue scalingLoad queue scaling

Larger instruction window => larger load Larger instruction window => larger load queuequeue Increases access latencyIncreases access latency Increases energy consumptionIncreases energy consumption

Wider issue width => more read/write Wider issue width => more read/write portsports Also increases latency and energyAlso increases latency and energy


Related work: MICRO 2003Related work: MICRO 2003

Park et al., PurduePark et al., Purdue Extra structure dedicated to enforcing memory Extra structure dedicated to enforcing memory

consistencyconsistency Increase capacity through segmentationIncrease capacity through segmentation

Sethumadhavan et al., UT-AustinSethumadhavan et al., UT-Austin Add set of filters summarizing contents of load Add set of filters summarizing contents of load

queuequeue


Keep it simple…Keep it simple…

Throw more hardware at the problem?Throw more hardware at the problem? Need to design/implement/verifyNeed to design/implement/verify Execution core is already complicatedExecution core is already complicated

Load queue checks for rare errorsLoad queue checks for rare errors Why not move error checking away from exe?Why not move error checking away from exe?


CMP

Value-based orderingValue-based ordering

ReplayReplay: access the cache a second time -: access the cache a second time -cheaply!cheaply! Almost always cache hitAlmost always cache hit Reuse address calculation and translationReuse address calculation and translation Share cache port used by stores in commit stageShare cache port used by stores in commit stage

CompareCompare: compares new value to original value: compares new value to original value Squash if the values differSquash if the values differ

DIVA áDIVA á la carte [Austin, Micro 99]la carte [Austin, Micro 99]

IF1 D R Q S EX CREPIF2 WB…


Rules of replayRules of replay

1.1. All prior stores must have written data to All prior stores must have written data to the cachethe cache

No store-to-load forwardingNo store-to-load forwarding

2.2. Loads must replay in program orderLoads must replay in program order If a cache miss occurs, all subsequent loads If a cache miss occurs, all subsequent loads

must be replayedmust be replayed

3.3. If a load is squashed, it should not be If a load is squashed, it should not be replayed a second timereplayed a second time

Ensures forward progressEnsures forward progress


Replay reductionReplay reduction

Replay costsReplay costs Consumes cache bandwidth (and power)Consumes cache bandwidth (and power) Increases reorder buffer occupancyIncreases reorder buffer occupancy

Can we avoid these penalties?Can we avoid these penalties? Infer correctness of certain operationsInfer correctness of certain operations

Four replay filtersFour replay filters


No-Reorder filterNo-Reorder filter

Avoid replay if load isn’t reordered wrt Avoid replay if load isn’t reordered wrt other memory operationsother memory operations

Can we do better?Can we do better?


Enforcing single-thread RAW Enforcing single-thread RAW dependenciesdependencies

No-Unresolved Store Address FilterNo-Unresolved Store Address Filter Load instruction Load instruction ii is replayed if there are prior is replayed if there are prior

stores with unresolved addresses when stores with unresolved addresses when ii issuesissues

Works for intra-processor RAW dependencesWorks for intra-processor RAW dependences Doesn’t enforce memory consistencyDoesn’t enforce memory consistency


Enforcing MP consistencyEnforcing MP consistency

No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line

fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow

No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external

invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window


Constraint graphConstraint graph

Defined for sequential consistency by Landin et Defined for sequential consistency by Landin et al., ISCA-18al., ISCA-18

Directed-graph represents a multithreaded Directed-graph represents a multithreaded executionexecution Nodes represent dynamic instruction instancesNodes represent dynamic instruction instances Edges represent their transitive orders (program Edges represent their transitive orders (program

order, RAW, WAW, WAR).order, RAW, WAW, WAR). If the constraint graph is acyclic, then the If the constraint graph is acyclic, then the

execution is correctexecution is correct


Constraint graph example - SCConstraint graph example - SC

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Cycle indicates that execution is

incorrect

1.

2.

3.

4.


Anatomy of a cycleAnatomy of a cycle

Proc 1

ST A

Proc 2

LD AST B

LD BProgramorder

Programorder

WAR

RAW

Incoming invalidate

Cache miss


Enforcing MP consistencyEnforcing MP consistency

No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line

fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow

No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external

invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window


Filter SummaryFilter Summary

Replay all committed loads

No-Reorder Filter

No-Unresolved Store/No-Recent-Snoop Filter

No-Unresolved Store/No-Recent-Miss Filter

Conservative

Aggressive


OutlineOutline

Conventional load queue Conventional load queue functionality/microarchitecturefunctionality/microarchitecture

Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation


Base machine modelBase machine modelPHARMsimPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence Based on SimpleMP including Sun Gigaplane-like snooping coherence

protocol [Rajwar], within the SimOS-PPC full-system simulatorprotocol [Rajwar], within the SimOS-PPC full-system simulator

Out-of-order Out-of-order execution execution corecore

5 GHZ, 5 GHZ, 15-stage, 8-wide pipeline15-stage, 8-wide pipeline

256 entry reorder buffer, 128 entry load/store queue256 entry reorder buffer, 128 entry load/store queue

32 entry issue queue32 entry issue queue

Functional Functional units units (latency)(latency)

8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),

4 L1 Dcache load ports in OoO window4 L1 Dcache load ports in OoO window

1 L1 Dcache load/store port at commit1 L1 Dcache load/store port at commit

Front-endFront-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB16k entry selection table, 64 entry RAS, 8k entry 4-way BTB

Memory Memory system system (latency)(latency)

32k DM L1 icache (1), 32k DM L1 dcache (1)32k DM L1 icache (1), 32k DM L1 dcache (1)

256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines

Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)

Stride-based prefetcher modeled after Power4`Stride-based prefetcher modeled after Power4`


%L1 DCache bandwidth increase%L1 DCache bandwidth increase

(a) replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter

On average, 3.4% bandwidth overhead using no-recent-snoop filter

SPECint2000 SPECfp2000 commercial multiprocessor


Value-based replay performance Value-based replay performance (relative to constrained load queue)(relative to constrained load queue)

Value-based replay 8% faster on avg than baseline using 16-entry ld queue

SPECint2000 SPECfp2000 commercial multiprocessor


Value-based replay Pros/ConsValue-based replay Pros/Cons

+ Eliminates associative lookup hardwareEliminates associative lookup hardware Load queue becomes simple FIFOLoad queue becomes simple FIFO Negligible IPC or L1D bandwidth impactNegligible IPC or L1D bandwidth impact

+ Can be used to fix value predictionCan be used to fix value prediction Enforces dependence order consistency Enforces dependence order consistency

constraint [Martin et al., Micro 2001]constraint [Martin et al., Micro 2001]- Requires additional pipeline stagesRequires additional pipeline stages- Requires additional cache datapath for Requires additional cache datapath for

loadsloads


The EndThe End

Questions?Questions?


BackupsBackups


Does value locality help?Does value locality help?

Not much…Not much… Value locality does avoid memory ordering Value locality does avoid memory ordering

violationsviolations 59% single-thread violations avoided59% single-thread violations avoided 95% consistency violations avoided95% consistency violations avoided

But these violations rarely occurBut these violations rarely occur ~1 single-thread violation per 100 million instr~1 single-thread violation per 100 million instr 4 consistency violation per 10,000 instr 4 consistency violation per 10,000 instr


What About What About PowerPower??

Simple power model:Simple power model:

Empirically: 0.02 replay loads per committed Empirically: 0.02 replay loads per committed instructioninstruction

If load queue CAM energy/insn > 0.02 If load queue CAM energy/insn > 0.02 × energy energy expenditure of a cache access and comparison: expenditure of a cache access and comparison: value-based implementation saves power!value-based implementation saves power!

Energy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search × # ldq searches )


Caveat: Memory Dependence PredictionCaveat: Memory Dependence Prediction

Some predictors train using the conflicting storeSome predictors train using the conflicting store (e.g. store-set predictor)(e.g. store-set predictor)

Replay mechanism is unable to pinpoint Replay mechanism is unable to pinpoint conflicting storeconflicting store

Fair comparison:Fair comparison: Baseline machine: store-set predictor w/ 4k entry Baseline machine: store-set predictor w/ 4k entry

SSIT and 128 entry LFSTSSIT and 128 entry LFST Experimental machine: Simple 21264-style Experimental machine: Simple 21264-style

dependence predictor w/ 4k entry history tabledependence predictor w/ 4k entry history table


Load queue search energyLoad queue search energy

0

0.5

1

1.5

2

2.5

3

3.5

16 32 64 128 256 512

number of entries

ac

ce

ss

en

erg

y (

nJ

)

rd6wr6

rd4wr4

rd2wr2

Based on 0.09 micron process technology using Cacti v. 3.2


Load queue search latencyLoad queue search latency

0

0.2

0.4

0.6

0.8

1

1.2

1.4

16 32 64 128 256 512

number of entries

ac

ce

ss

late

nc

y (

ns

)

rd6wr6

rd4wr4

rd2wr2

Based on 0.09 micron process technology using Cacti v. 3.2


BenchmarksBenchmarks

MP (16-way)MP (16-way) Commercial workloads (SPECweb, TPC-H)Commercial workloads (SPECweb, TPC-H) SPLASH2 scientific application (ocean)SPLASH2 scientific application (ocean) Error bars signify 95% statistical confidenceError bars signify 95% statistical confidence

UPUP 3 from SPECfp20003 from SPECfp2000

Selected due to high reorder buffer utilizationSelected due to high reorder buffer utilization apsi, art, wupwiseapsi, art, wupwise

3 commercial3 commercial SPECjbb2000, TPC-B, TPC-HSPECjbb2000, TPC-B, TPC-H

A few from SPECint2000A few from SPECint2000


LD ?ST ?ST ?LD ? LD ?ST ? LD ?

Life cycle of a loadLife cycle of a load

OoO Execution Window

LD ?ST ? ST ? ST ?

Load queue

LD ?LD A

LD A ST A

Blam!


Performance relative to Performance relative to unconstrained load queueunconstrained load queue

Good news: Replay w/ no-recent-snoop filter only 1% slower on average


Reorder-Buffer UtilizationReorder-Buffer Utilization


Why focus on load queue?Why focus on load queue?

Load queue has different constraints that store Load queue has different constraints that store queuequeue More loads than stores (30% vs 14% dynamic More loads than stores (30% vs 14% dynamic

instructions)instructions) Load queue searched more frequently (consuming Load queue searched more frequently (consuming

more power)more power) Store-forwarding logic performance criticalStore-forwarding logic performance critical

Many non-scalable structures in OoO processorMany non-scalable structures in OoO processor SchedulerScheduler Physical register filePhysical register file Register mapRegister map


Prior work: formal memory model Prior work: formal memory model representationsrepresentations

Local, WRT, global “performance” of memory Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)ops (Dubois et al., ISCA-13)

Acyclic graph representation (Landin et al., Acyclic graph representation (Landin et al., ISCA-18)ISCA-18)

Modeling memory operation as a series of sub-Modeling memory operation as a series of sub-operations (Collier, RAPA)operations (Collier, RAPA)

Acyclic graph + sub-operations (Adve, thesis)Acyclic graph + sub-operations (Adve, thesis) Initiation event, for modeling early store-to-load Initiation event, for modeling early store-to-load

forwarding (Gharachorloo, thesis)forwarding (Gharachorloo, thesis)

Documents

Memory Ordering: A Value-based Approach