A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E....

Preview:

Citation preview

A Mechanistic Model for A Mechanistic Model for Superscalar ProcessorsSuperscalar Processors

J. E. SmithUniversity of Wisconsin-Madison

Lieven Eeckhout, Stijn EyermanGhent University

Tejas KarkhanisAMD

Superscalar Modeling © J. E. Smith, 2006 2

Interval AnalysisInterval Analysis

Superscalar execution can be divided into intervals separated by miss events

• Branch miss predictions• I cache misses• Long D cache misses• TLB misses, etc.

Provides more insight than simulation • You can see the forest and the trees• Supplements simulation, not a replacement

time

IPC

branchmispredicts

i-cachemiss long d-cache miss

interval 1 interval 2 interval 3interval 0

Superscalar Modeling © J. E. Smith, 2006 3

OutlineOutline

Development of Interval Analysis • Modeling ILP• Modeling miss events

Balanced Superscalar Processors• Performance components• Optimal pipeline configurations

Performance Counter Architecture• Accurate CPI stacks

Superscalar Modeling © J. E. Smith, 2006 4

Superscalar ProcessorsSuperscalar Processors

I-cache Decode PipelineIssueBuffer

Exec.Unit

Exec.Unit

Exec.Unit

Reorder Buffer (Window)

PhysicalRegisterFile(s)

F D D I

MSHRs

D

R

BranchPredict

Fetchbuffer

# entries

# entries

miss rate

W entries

# entries

# entries

# and type of unitsunit latencies

Pipeline depth

instructiondelivery

algorithm

miss-rate

mispredictrate

Store Q

Load Q# entries L1 Data

Cache#ports

L2Cache

miss rate

toI-cache

mainmemorylatency

Superscalar Modeling © J. E. Smith, 2006 5

Superscalar ProcessorsSuperscalar Processors Ifetch

• Adequate fetch resources to sustain decode/dispatch width D• F > D plus fetch buffer to smooth flow

Decode• Assume decode pipe and dispatch bandwidth D

Window• Window, size W, holds in-flight instructions• Equivalent to ROB• Issue buffer holds subset of window (as an optimization)• Assume unified issue buffer, but model can support partitioned buffers

Issue• Width may be more or less than dispatch and commit widths

Retire• Retire width R typically equal to dispatch width

Superscalar Modeling © J. E. Smith, 2006 6

Superscalar Processor PerformanceSuperscalar Processor Performance

Maximum IPC under ideal conditions• No cache misses or branch mispredictions

Miss-events disrupt smooth flow• In balanced design, performance is all about the transients

time

IPC

branchmispredicts

i-cachemiss

long d-cachemiss

Superscalar Modeling © J. E. Smith, 2006 7

Modeling ILPModeling ILP

Relationship between maximum window size W and achieved issue width i

Program dependence structure Has a long history…

Superscalar Modeling © J. E. Smith, 2006 8

Riseman and Foster (1972)Riseman and Foster (1972)

Basic relationship between window size and IPC

• Classic Study• Approx quadratic

relationship under ideal conditions

Wi

Superscalar Modeling © J. E. Smith, 2006 9

Wall (1991)Wall (1991)

Limits of ILP• Another classic study• Approx. quadratic

relationship under “perfect” conditions

Superscalar Modeling © J. E. Smith, 2006 10

Michaud, Seznec, JourdanMichaud, Seznec, Jourdan

More recent study Key Result (Michaud, Seznec, Jourdan):

• Approx. quadratic relationship

Superscalar Modeling © J. E. Smith, 2006 11

Our ExperimentOur Experiment

Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate i, as a fcn of window size W

•Approx. quadratic relationship

Superscalar Modeling © J. E. Smith, 2006 12

Modeling IW CharacteristicModeling IW Characteristic

Clearly a function of program dependence structure Simple, single-level dependence models don’t work

very well• Need to consider dependence chains

Slide window over dynamic stream and compute average critical path k(W)

For unit latency, i = W/k(W)

Window

Dynamic InstructionStream

Superscalar Modeling © J. E. Smith, 2006 13

Average Critical PathAverage Critical Path

For our benchmarks, 1.3 ≤ β ≤ 1.9• Quadratic when β=2

Unit latency avg. IPC

Avg. latency l, avg. IPC

/11)( WWk

/11 Wi

/111 Wli1/)/( liW

Superscalar Modeling © J. E. Smith, 2006 14

Generic IntervalGeneric Interval

All intervals follow same basic profile

Time (in Cycles)

Instructionsper Cycle

ramp-up asinstructions

enter window

time dependenton type of miss

event

ramp-down aswindow drains

transient due tomiss-event

Superscalar Modeling © J. E. Smith, 2006 15

I Cache Miss IntervalI Cache Miss Interval

total time = n/D + ciL1

n = no. instructions in interval

D = decode/dispatch width

cIL1 = miss delay cycles Predicts performance loss is

independent of pipe length

re-fillpipeline

miss delay

windowdrains

time= n/D

Superscalar Modeling © J. E. Smith, 2006 16

Independence from Pipe LengthIndependence from Pipe Length

16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 8 cycles Penalty independent of pipe length Similar across benchmarks

0.0

8.0

bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr

cycl

es

4 front-end stages 8 front-end stages

Superscalar Modeling © J. E. Smith, 2006 17

Branch Miss Prediction IntervalBranch Miss Prediction Interval

Total time = n/D + cdr (D) + cfe n = no. instructions in intervalD = decode/dispatch widthcdr (D) = drain cycles; function of width

(and ILP)cfe = front-end pipeline length

time = n/D time= pipeline length

time= branch latency

window drain time

Superscalar Modeling © J. E. Smith, 2006 18

Branch Resolution TimeBranch Resolution Time

Assumes mispredicted branch is one of the last instructions to issue

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

per

cent

age

>5

5

4

3

2

1

0

Superscalar Modeling © J. E. Smith, 2006 19

Branch Miss Prediction PenaltyBranch Miss Prediction Penalty

Branch penalty is dependent on interval length

The penalty can be 2+ times pipeline length

Penalty is less for short intervals; more for long intervals

See ISPASS ’06 paper for more details

Superscalar Modeling © J. E. Smith, 2006 20

Long D-cache Miss IntervalLong D-cache Miss Interval

Loadenters

window

ROB fills

Data returns frommemory

steady state

Instructionsenter window

issue rampsup to

steady state

time = n/D

Issue window emptyof issuable insns

Loadissues

miss latency

ROB fill time

Loadresolution

time

Superscalar Modeling © J. E. Smith, 2006 21

Long D-cache Miss IntervalLong D-cache Miss Interval

For isolated miss total time = n/D - W/D + cLr (D) + cL2

n = no. instructions in intervalD = decode/dispatch widthW = window (ROB) sizecLr (D) = load resolution time; function of widthcL2 = L2 miss delay

Loadenterswindow

ROB fills

Data returns frommemory

steady state

Instructionsenter window

issue rampsup to

steady state

time = N/d

Issue window emptyof issuable insns

Loadissues

miss latency

ROB fill time

Loadresolution

time

Superscalar Modeling © J. E. Smith, 2006 22

Miss Event OverlapsMiss Event Overlaps

Branch Misprediction and I-Cache Miss effects “serialize”

• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and

B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Overlaps with branch mispredictions and I-cache misses are

insignificant

BranchMispredicts

I-Cache Misses

Long D-CacheMisses

Superscalar Modeling © J. E. Smith, 2006 23

Overlapping Long D-cache MissesOverlapping Long D-cache Misses

s/D reflects amount of overlap Total penalty is independent of s/D

1st loadenterswindow

ROB fills

Load 1data returns from

memory

time = n/D

Issue window emptyof issuable insns

1st loadissues

miss latency

s/D

2nd loadissues

s/D

Superscalar Modeling © J. E. Smith, 2006 24

Experimental ResultsExperimental Results

For each long miss, collect stats on other misses within a “ROB distance”

• This is a trace statistic• Assume W/D = cLr

0.0

50.0

100.0

150.0

200.0

bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr

cycl

es

Simulation Analytical Model

Superscalar Modeling © J. E. Smith, 2006 25

Overall PerformanceOverall Performance

Sum over all intervals

I cache miss interval: n/D + cic

Branch mispredict: n/D + cdr + cfe

Long d-cache miss: n/D - W/D + cLr + cL2

(non-overlapping)

Collect the n/D terms:

Ntotal/D Account for “ceiling inefficiency”

((D-1)/2D)*(miL1 + mbr + mL2)

Superscalar Modeling © J. E. Smith, 2006 26

Overall PerformanceOverall Performance

Total Cycles = Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)

+ mic * ciL1

+ mbr * (cdr + cfe)

+ mL2 * (- W/D + clr + cL2)

TLB misses similar to L2 misses

Superscalar Modeling © J. E. Smith, 2006 27

AccuracyAccuracy

Decode Width, D=4Average error 4.2%; max 8.6%

D=2, error = 1.8%D=6, error = 5.6%D=8, error = 5.6%

Superscalar Modeling © J. E. Smith, 2006 28

Decode EfficiencyDecode Efficiency

Compare with simulation• D = 4

mcf dominated by intervals of length 5 and 13

• Less efficient than model would predict

This is an inherent inefficiency due to intervals

• Strongly correlates w/ interval lengths

Superscalar Modeling © J. E. Smith, 2006 29

Convert From Cycles to TimeConvert From Cycles to Time

Important if pipeline depth is to be modeled• latch overheads become important

Start with baseline 5 stage front-end• pb = #pipeline stages in baseline

Allow for arbitrary number of stages• p = #pipeline stages• Increase all latencies proportionate to relative depth

Multiply cycles by p/pb

Convert total cycles to total time• tp = total pipeline latency; to = latch overhead

• cycle time = tp / p + to

Superscalar Modeling © J. E. Smith, 2006 30

Convert to Absolute TimeConvert to Absolute Time

Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

+ mic * ciL1*(p/pb)* (tp / p + to)

+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)

+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)

TPI = Total Time/Ntotal

Now, consider some of the terms in isolation

Superscalar Modeling © J. E. Smith, 2006 31

Base TPI + One Linear Miss EventBase TPI + One Linear Miss Event

Component TPI

0

0.2

0.4

0.6

0.8

1

1.2

5 10 15 20 25 30 35

Pipeline Stages

TP

I

width 2

width 4

width6

width8

miss event

Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

+ mic * ciL1*(p/pb)* (tp / p + to)

+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)

+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)

TPI = Total Time/Ntotal

Total TPI

0.6

0.8

1

1.2

1.4

5 10 15 20 25 30 35

Pipeline Stages

T

PI

width 2

width 4

width6

width8

miss event

Superscalar Modeling © J. E. Smith, 2006 32

Pipelining of Miss EventsPipelining of Miss Events

Fully Pipelined Unit

0

1

2

3

4

5

5 10 15 20 25 30 35

Pipeline Stages

T

PI

miss event

miss event x 2

miss event x 3

miss event x 4

Not all paths are fully pipelined• e.g. cache misses may not be fully pipelined• A pipeline factor (0 ≤ f ≤ 1) can be added to a term• Example: I cache miss

mic * ciL1*(p/pb)* (tp / p + fiL1 to)

Changing Pipeline Factor

0.6

0.7

0.8

0.91

1.1

1.2

5 10 15 20 25 30 35

Pipeline Stages

T

PI

pipelined

pipelined .5

pipelined .25

nonpipelined

Superscalar Modeling © J. E. Smith, 2006 33

Fetch InefficiencyFetch Inefficiency

Inherent fetch inefficiency • Due to presence of misses• As opposed to structural inefficiency• More important for wider pipelines

[Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

Effect of Inefficiency

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7

Pipeline Stages

TP

I

width 2

width 4

width6

width8

w2+ovhd

w4+inherent

w6+inherent

w8+inherent

Superscalar Modeling © J. E. Smith, 2006 34

Miss Events Dependent on ROB SizeMiss Events Dependent on ROB Size Miss events are dependent on ROB size

• And therefore dependent on depth/width for balanced designs Branch mispredicts go up due to late update of predictor L2 miss behavior may be better or worse depending on overlaps

• Deeper pipeline longer miss penalty• Longer ROB more MLP

Superscalar Modeling © J. E. Smith, 2006 35

Balanced Superscalar Processor DesignBalanced Superscalar Processor Design

Definition: At iW balance point:

• Under ideal conditions, achieved issue width i = I, but decreasing W means achieved issue width diminishes..

• For practical issue widths, there is enough ILP that balance can be achieved (See earlier work)

• Balance does not imply overall width/depth optimality Provide adequate numbers of other resources

• Issue buffer, load/store buffers, rename regs., functional units, etc.• Reducing resources below adequate level causes reduced performance

Superscalar Modeling © J. E. Smith, 2006 36

Balanced Superscalar Processor DesignBalanced Superscalar Processor Design

Choose Width/Depth Optimize other elements based on Width/Depth

IssueWidth

I-FetchResources

(aciheved width)Commit Width ROB Size

Beta (~ quadratic)Relationship

# RenameRegisters

Load/StoreBuffer Sizes

Numbers ofFunctional Units

Issue BufferSize

LinearRelationship Linear

RelationshipLinear

Relationship

LinearRelationships

PipelineDepth

Beta (~ quadratic)Relationship

Inverse RelationshipAt optimal point, widerissue implies shallower

pipeline

Superscalar Modeling © J. E. Smith, 2006 37

Optimize Pipeline DepthOptimize Pipeline Depth

Start with baseline 5 stage front-end• pb = #pipeline stages in baseline

Evaluate 1x, 2x, 3x, 4x, 5x depths• Increase all latencies proportionate to depths• Multiply by p/pb

Convert total cycles to total time• cycle time = tp / p + to

• p = # stages; tp = total pipeline latency; to = latch overhead

Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

+ mic * ciL1*(p/pb)* (tp / p + to)

+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)

+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)

TPI = Total Time/Ntotal

Superscalar Modeling © J. E. Smith, 2006 38

Pipeline Depth ResultsPipeline Depth Results

Use tp/to = 55 as in Hartstein and Puzak

• Also illustrates accuracy of model• Consider four typical benchmarks:

Superscalar Modeling © J. E. Smith, 2006 39

Pipeline Depth ResultsPipeline Depth Results

On average, 2X baseline pipeline depth is optimal Consistent w/ H&P

Superscalar Modeling © J. E. Smith, 2006 40

Optimize Pipeline WidthOptimize Pipeline Width

In general wider means higher performance (to 8-wide) Optimal depth becomes shallower as width grows Diminishing returns w/ wider pipelines

• 4 vs. 2 13.3%; 6 vs. 4 7.1%; 8 vs. 6 2.9%

Superscalar Modeling © J. E. Smith, 2006 41

Short Interval EffectsShort Interval Effects With short intervals, may never reach peak issue rate Example: assume 1 mispredict every 96 instructions

• E.g. SPEC benchmark crafty with 4K gshare• Max issue rate never reached for D = 6,8

Yet, there is a benefit from wider pipelines

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60

Cycle

IPC

D=8

D=6

D=4

D=2

Superscalar Modeling © J. E. Smith, 2006 42

Benefit Does Not Come From Benefit Does Not Come From IssueIssue Width Width Benefit comes from wider decode/dispatch width

• Get to next I-cache miss sooner• Resolve branch mispredicts sooner• Benefit comes from faster ramp-up• D = 8 faster than D = 6• D = 8, I =6 gives same performance as D = 8, I = 8

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60

Cycle

IPC

D=8

D=6

D=4

D=2

Superscalar Modeling © J. E. Smith, 2006 43

Potential High Perf ProcessorPotential High Perf Processor

Widen Fetch, Decode, Retire• Keep relatively narrow issue

Lengthen ROB• And related structures

I-cache Decode PipelineIssueBuffer

Exec.Unit

Exec.Unit

Exec.Unit

Reorder Buffer (Window)

PhysicalRegister

File(s)

F D D I

D

R

BranchPredict

Fetchbuffer

# entries

# entries

miss rate

W entries

# entries

# and type of unitsunit latencies

Pipeline depth

instructiondelivery

algorithm

miss-rate

mispredictrate

Store Q

Load Q# entries L1 Data

Cache#ports

L2Cache

miss rate

toI-cache

mainmemorylatency

Superscalar Modeling © J. E. Smith, 2006 44

Issue Buffer SizingIssue Buffer Sizing

y = 0.3115x

0

50

100

150

200

250

0 200 400 600 800

Reorder Buffer Size

Issu

e B

uff

er S

ize

Similar to ROB sizing Use average path rather

than average critical path

(See Tejas Thesis)

Processor ROB Size Issue Buffer

Ratio

Intel Core 96 32 .3

Power4 100 36 .4

MIPS R10K 64 20 .3

Pentium Pro

40 20 .5

Alpha 21264

80 20 .25

Opteron 72 24 .3

AMD K5 16 4 .25

Superscalar Modeling © J. E. Smith, 2006 45

Function Unit Demand VariationFunction Unit Demand Variation

0

0.2

0.4

0.6

0.8

1

2 12 22 32 42 52 62 72 82 92

DemandIALU

Instructions (millions)

MeanMean+1 stdevMean+2 stdevActual

Example: gcc

Superscalar Modeling © J. E. Smith, 2006 46

Function Unit ResourcesFunction Unit Resources

Demand proportional to instruction mix Dependent on program and phases

• Collect phase-based data Must be an integer Number of functional units of type k:

• Lk = issue latency for unit k

• Gk = fraction using unit k

Use similar approach for other hardware resources

Fk = I (Dk) + (2 (Dk)) Lk

Superscalar Modeling © J. E. Smith, 2006 47

Comparison With H&PComparison With H&P

H&P:

Total Time = Ntotal/α * (tp / p + to)

+ γ NH * ( to p + tp )

Empirical: fit to detailed simulation data to determine α and γ.requires re-simulation if caches/predictor/pipeline factor, etc. change

Interval Model:

Total Time = Ntotal/D (tp / p + to) + ((D-1)/2D)*(miL1 + mbr + mL2)* (tp / p + to)

+ miL1 * ciL1 * 1/ pb * (to piL1 + tp)

+ mbr * (cdr(p,D) /p + cfe) * 1/ pb * (to p + tp)

+ mL2 * (- W/Dp + clr(p,D)/p + cL2) * 1/ pb * (to pL2 + tp)

Mechanistic: Bottom-up -- no need to perform detailed simulationnot all hazard terms are linear in pnot all hazard terms are independent of D

Superscalar Modeling © J. E. Smith, 2006 48

Application: Performance ArchitectureApplication: Performance Architecture

Construct performance counters based on interval model Total cycle counter + one counter per miss event type Front-end miss events

• Front-end Miss Event Table (FMT) Back-end miss events

• Begin counting when full ROB stalls • Increment appropriate counter depending on inst. at ROB head

D-TLB miss,

L2 D-cache miss,

L1 D-cache miss,

Long functional unit (divide)

Superscalar Modeling © J. E. Smith, 2006 49

Performance Architecture: FMTPerformance Architecture: FMT

On entry per outstanding branch Tracks pre-window instructions

• between fetch and dispatch tail Tracks in-flight instructions

• between ROB tail and ROB head Table Increments

• For I1 or I2 miss or I-TLB increment counter pointed to by fetch

• Branch penalty counters between head and tail increment every cycle

Counter updates• When correctly predicted branch retires,

update I1, I2, I-TLB counters• When mispredicted branch retires, update

Branch mispredict counter (and continue counting until first instruction is dispatched))

Superscalar Modeling © J. E. Smith, 2006 50

Simplified FMTSimplified FMT

Shared I1, I2, ITLB entry Instructions in ROB marked w/ I-

cache miss or I-TLB miss When a miss instruction retires,

• Shared entry is copied to counters, • ROB tag bits are cleared

When a mispredicted branch retires

• Add to branch mispredict counter,• Clear shared entries

Superscalar Modeling © J. E. Smith, 2006 51

EvaluationEvaluation

Compare:• Simulation – add miss events one at a time and

measure difference• Simulation-rev – same as above, but reverse order of

miss events• naïve -- Count miss events, multiply by fixed penalty• naïve non-spec – Similar to above, but wrong-path

events not counted• Power5 – IBM Power5 method• FMT• sFMT

Superscalar Modeling © J. E. Smith, 2006 52

EvaluationEvaluation

Superscalar Modeling © J. E. Smith, 2006 53

ComparisonComparison

FMT and sFMT are most accurate• naïve is worst

FMT and sFMT similar• simplified version is adequate

Power5 underestimates frontend miss events

Superscalar Modeling © J. E. Smith, 2006 54

Interval Model DevelopmentInterval Model Development

Michaud, Seznec, Jourdan – Issue transient Tejas Gap model – All transients Taha and Wills -- Interval (macro block) model Hartstein and Puzak – Optimal pipelines

Superscalar Modeling © J. E. Smith, 2006 55

ConclusionsConclusions

Intervals yield a divide-and-conquer approach Supports intuition (adds confidence to intuition) Its all about transients

• The only things that count are cache miss and branch mispredictions

Application to automated design, performance monitoring, very fast simulation, optimizing compiler analysis, etc.

Analysis of pipeline limits,• Re-enforces conventional wisdom• We are close to the practical limits for depth and width

Extends to energy modeling (Tejas PhD)

Recommended