55
A Mechanistic Model for A Mechanistic Model for Superscalar Processors Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout, Stijn Eyerman Ghent University Tejas Karkhanis AMD

A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Embed Size (px)

Citation preview

Page 1: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

A Mechanistic Model for A Mechanistic Model for Superscalar ProcessorsSuperscalar Processors

J. E. SmithUniversity of Wisconsin-Madison

Lieven Eeckhout, Stijn EyermanGhent University

Tejas KarkhanisAMD

Page 2: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 2

Interval AnalysisInterval Analysis

Superscalar execution can be divided into intervals separated by miss events

• Branch miss predictions• I cache misses• Long D cache misses• TLB misses, etc.

Provides more insight than simulation • You can see the forest and the trees• Supplements simulation, not a replacement

time

IPC

branchmispredicts

i-cachemiss long d-cache miss

interval 1 interval 2 interval 3interval 0

Page 3: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 3

OutlineOutline

Development of Interval Analysis • Modeling ILP• Modeling miss events

Balanced Superscalar Processors• Performance components• Optimal pipeline configurations

Performance Counter Architecture• Accurate CPI stacks

Page 4: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 4

Superscalar ProcessorsSuperscalar Processors

I-cache Decode PipelineIssueBuffer

Exec.Unit

Exec.Unit

Exec.Unit

Reorder Buffer (Window)

PhysicalRegisterFile(s)

F D D I

MSHRs

D

R

BranchPredict

Fetchbuffer

# entries

# entries

miss rate

W entries

# entries

# entries

# and type of unitsunit latencies

Pipeline depth

instructiondelivery

algorithm

miss-rate

mispredictrate

Store Q

Load Q# entries L1 Data

Cache#ports

L2Cache

miss rate

toI-cache

mainmemorylatency

Page 5: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 5

Superscalar ProcessorsSuperscalar Processors Ifetch

• Adequate fetch resources to sustain decode/dispatch width D• F > D plus fetch buffer to smooth flow

Decode• Assume decode pipe and dispatch bandwidth D

Window• Window, size W, holds in-flight instructions• Equivalent to ROB• Issue buffer holds subset of window (as an optimization)• Assume unified issue buffer, but model can support partitioned buffers

Issue• Width may be more or less than dispatch and commit widths

Retire• Retire width R typically equal to dispatch width

Page 6: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 6

Superscalar Processor PerformanceSuperscalar Processor Performance

Maximum IPC under ideal conditions• No cache misses or branch mispredictions

Miss-events disrupt smooth flow• In balanced design, performance is all about the transients

time

IPC

branchmispredicts

i-cachemiss

long d-cachemiss

Page 7: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 7

Modeling ILPModeling ILP

Relationship between maximum window size W and achieved issue width i

Program dependence structure Has a long history…

Page 8: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 8

Riseman and Foster (1972)Riseman and Foster (1972)

Basic relationship between window size and IPC

• Classic Study• Approx quadratic

relationship under ideal conditions

Wi

Page 9: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 9

Wall (1991)Wall (1991)

Limits of ILP• Another classic study• Approx. quadratic

relationship under “perfect” conditions

Page 10: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 10

Michaud, Seznec, JourdanMichaud, Seznec, Jourdan

More recent study Key Result (Michaud, Seznec, Jourdan):

• Approx. quadratic relationship

Page 11: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 11

Our ExperimentOur Experiment

Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate i, as a fcn of window size W

•Approx. quadratic relationship

Page 12: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 12

Modeling IW CharacteristicModeling IW Characteristic

Clearly a function of program dependence structure Simple, single-level dependence models don’t work

very well• Need to consider dependence chains

Slide window over dynamic stream and compute average critical path k(W)

For unit latency, i = W/k(W)

Window

Dynamic InstructionStream

Page 13: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 13

Average Critical PathAverage Critical Path

For our benchmarks, 1.3 ≤ β ≤ 1.9• Quadratic when β=2

Unit latency avg. IPC

Avg. latency l, avg. IPC

/11)( WWk

/11 Wi

/111 Wli1/)/( liW

Page 14: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 14

Generic IntervalGeneric Interval

All intervals follow same basic profile

Time (in Cycles)

Instructionsper Cycle

ramp-up asinstructions

enter window

time dependenton type of miss

event

ramp-down aswindow drains

transient due tomiss-event

Page 15: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 15

I Cache Miss IntervalI Cache Miss Interval

total time = n/D + ciL1

n = no. instructions in interval

D = decode/dispatch width

cIL1 = miss delay cycles Predicts performance loss is

independent of pipe length

re-fillpipeline

miss delay

windowdrains

time= n/D

Page 16: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 16

Independence from Pipe LengthIndependence from Pipe Length

16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 8 cycles Penalty independent of pipe length Similar across benchmarks

0.0

8.0

bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr

cycl

es

4 front-end stages 8 front-end stages

Page 17: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 17

Branch Miss Prediction IntervalBranch Miss Prediction Interval

Total time = n/D + cdr (D) + cfe n = no. instructions in intervalD = decode/dispatch widthcdr (D) = drain cycles; function of width

(and ILP)cfe = front-end pipeline length

time = n/D time= pipeline length

time= branch latency

window drain time

Page 18: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 18

Branch Resolution TimeBranch Resolution Time

Assumes mispredicted branch is one of the last instructions to issue

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

per

cent

age

>5

5

4

3

2

1

0

Page 19: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 19

Branch Miss Prediction PenaltyBranch Miss Prediction Penalty

Branch penalty is dependent on interval length

The penalty can be 2+ times pipeline length

Penalty is less for short intervals; more for long intervals

See ISPASS ’06 paper for more details

Page 20: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 20

Long D-cache Miss IntervalLong D-cache Miss Interval

Loadenters

window

ROB fills

Data returns frommemory

steady state

Instructionsenter window

issue rampsup to

steady state

time = n/D

Issue window emptyof issuable insns

Loadissues

miss latency

ROB fill time

Loadresolution

time

Page 21: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 21

Long D-cache Miss IntervalLong D-cache Miss Interval

For isolated miss total time = n/D - W/D + cLr (D) + cL2

n = no. instructions in intervalD = decode/dispatch widthW = window (ROB) sizecLr (D) = load resolution time; function of widthcL2 = L2 miss delay

Loadenterswindow

ROB fills

Data returns frommemory

steady state

Instructionsenter window

issue rampsup to

steady state

time = N/d

Issue window emptyof issuable insns

Loadissues

miss latency

ROB fill time

Loadresolution

time

Page 22: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 22

Miss Event OverlapsMiss Event Overlaps

Branch Misprediction and I-Cache Miss effects “serialize”

• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and

B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Overlaps with branch mispredictions and I-cache misses are

insignificant

BranchMispredicts

I-Cache Misses

Long D-CacheMisses

Page 23: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 23

Overlapping Long D-cache MissesOverlapping Long D-cache Misses

s/D reflects amount of overlap Total penalty is independent of s/D

1st loadenterswindow

ROB fills

Load 1data returns from

memory

time = n/D

Issue window emptyof issuable insns

1st loadissues

miss latency

s/D

2nd loadissues

s/D

Page 24: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 24

Experimental ResultsExperimental Results

For each long miss, collect stats on other misses within a “ROB distance”

• This is a trace statistic• Assume W/D = cLr

0.0

50.0

100.0

150.0

200.0

bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr

cycl

es

Simulation Analytical Model

Page 25: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 25

Overall PerformanceOverall Performance

Sum over all intervals

I cache miss interval: n/D + cic

Branch mispredict: n/D + cdr + cfe

Long d-cache miss: n/D - W/D + cLr + cL2

(non-overlapping)

Collect the n/D terms:

Ntotal/D Account for “ceiling inefficiency”

((D-1)/2D)*(miL1 + mbr + mL2)

Page 26: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 26

Overall PerformanceOverall Performance

Total Cycles = Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)

+ mic * ciL1

+ mbr * (cdr + cfe)

+ mL2 * (- W/D + clr + cL2)

TLB misses similar to L2 misses

Page 27: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 27

AccuracyAccuracy

Decode Width, D=4Average error 4.2%; max 8.6%

D=2, error = 1.8%D=6, error = 5.6%D=8, error = 5.6%

Page 28: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 28

Decode EfficiencyDecode Efficiency

Compare with simulation• D = 4

mcf dominated by intervals of length 5 and 13

• Less efficient than model would predict

This is an inherent inefficiency due to intervals

• Strongly correlates w/ interval lengths

Page 29: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 29

Convert From Cycles to TimeConvert From Cycles to Time

Important if pipeline depth is to be modeled• latch overheads become important

Start with baseline 5 stage front-end• pb = #pipeline stages in baseline

Allow for arbitrary number of stages• p = #pipeline stages• Increase all latencies proportionate to relative depth

Multiply cycles by p/pb

Convert total cycles to total time• tp = total pipeline latency; to = latch overhead

• cycle time = tp / p + to

Page 30: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 30

Convert to Absolute TimeConvert to Absolute Time

Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

+ mic * ciL1*(p/pb)* (tp / p + to)

+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)

+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)

TPI = Total Time/Ntotal

Now, consider some of the terms in isolation

Page 31: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 31

Base TPI + One Linear Miss EventBase TPI + One Linear Miss Event

Component TPI

0

0.2

0.4

0.6

0.8

1

1.2

5 10 15 20 25 30 35

Pipeline Stages

TP

I

width 2

width 4

width6

width8

miss event

Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

+ mic * ciL1*(p/pb)* (tp / p + to)

+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)

+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)

TPI = Total Time/Ntotal

Total TPI

0.6

0.8

1

1.2

1.4

5 10 15 20 25 30 35

Pipeline Stages

T

PI

width 2

width 4

width6

width8

miss event

Page 32: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 32

Pipelining of Miss EventsPipelining of Miss Events

Fully Pipelined Unit

0

1

2

3

4

5

5 10 15 20 25 30 35

Pipeline Stages

T

PI

miss event

miss event x 2

miss event x 3

miss event x 4

Not all paths are fully pipelined• e.g. cache misses may not be fully pipelined• A pipeline factor (0 ≤ f ≤ 1) can be added to a term• Example: I cache miss

mic * ciL1*(p/pb)* (tp / p + fiL1 to)

Changing Pipeline Factor

0.6

0.7

0.8

0.91

1.1

1.2

5 10 15 20 25 30 35

Pipeline Stages

T

PI

pipelined

pipelined .5

pipelined .25

nonpipelined

Page 33: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 33

Fetch InefficiencyFetch Inefficiency

Inherent fetch inefficiency • Due to presence of misses• As opposed to structural inefficiency• More important for wider pipelines

[Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

Effect of Inefficiency

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7

Pipeline Stages

TP

I

width 2

width 4

width6

width8

w2+ovhd

w4+inherent

w6+inherent

w8+inherent

Page 34: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 34

Miss Events Dependent on ROB SizeMiss Events Dependent on ROB Size Miss events are dependent on ROB size

• And therefore dependent on depth/width for balanced designs Branch mispredicts go up due to late update of predictor L2 miss behavior may be better or worse depending on overlaps

• Deeper pipeline longer miss penalty• Longer ROB more MLP

Page 35: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 35

Balanced Superscalar Processor DesignBalanced Superscalar Processor Design

Definition: At iW balance point:

• Under ideal conditions, achieved issue width i = I, but decreasing W means achieved issue width diminishes..

• For practical issue widths, there is enough ILP that balance can be achieved (See earlier work)

• Balance does not imply overall width/depth optimality Provide adequate numbers of other resources

• Issue buffer, load/store buffers, rename regs., functional units, etc.• Reducing resources below adequate level causes reduced performance

Page 36: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 36

Balanced Superscalar Processor DesignBalanced Superscalar Processor Design

Choose Width/Depth Optimize other elements based on Width/Depth

IssueWidth

I-FetchResources

(aciheved width)Commit Width ROB Size

Beta (~ quadratic)Relationship

# RenameRegisters

Load/StoreBuffer Sizes

Numbers ofFunctional Units

Issue BufferSize

LinearRelationship Linear

RelationshipLinear

Relationship

LinearRelationships

PipelineDepth

Beta (~ quadratic)Relationship

Inverse RelationshipAt optimal point, widerissue implies shallower

pipeline

Page 37: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 37

Optimize Pipeline DepthOptimize Pipeline Depth

Start with baseline 5 stage front-end• pb = #pipeline stages in baseline

Evaluate 1x, 2x, 3x, 4x, 5x depths• Increase all latencies proportionate to depths• Multiply by p/pb

Convert total cycles to total time• cycle time = tp / p + to

• p = # stages; tp = total pipeline latency; to = latch overhead

Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

+ mic * ciL1*(p/pb)* (tp / p + to)

+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)

+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)

TPI = Total Time/Ntotal

Page 38: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 38

Pipeline Depth ResultsPipeline Depth Results

Use tp/to = 55 as in Hartstein and Puzak

• Also illustrates accuracy of model• Consider four typical benchmarks:

Page 39: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 39

Pipeline Depth ResultsPipeline Depth Results

On average, 2X baseline pipeline depth is optimal Consistent w/ H&P

Page 40: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 40

Optimize Pipeline WidthOptimize Pipeline Width

In general wider means higher performance (to 8-wide) Optimal depth becomes shallower as width grows Diminishing returns w/ wider pipelines

• 4 vs. 2 13.3%; 6 vs. 4 7.1%; 8 vs. 6 2.9%

Page 41: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 41

Short Interval EffectsShort Interval Effects With short intervals, may never reach peak issue rate Example: assume 1 mispredict every 96 instructions

• E.g. SPEC benchmark crafty with 4K gshare• Max issue rate never reached for D = 6,8

Yet, there is a benefit from wider pipelines

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60

Cycle

IPC

D=8

D=6

D=4

D=2

Page 42: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 42

Benefit Does Not Come From Benefit Does Not Come From IssueIssue Width Width Benefit comes from wider decode/dispatch width

• Get to next I-cache miss sooner• Resolve branch mispredicts sooner• Benefit comes from faster ramp-up• D = 8 faster than D = 6• D = 8, I =6 gives same performance as D = 8, I = 8

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60

Cycle

IPC

D=8

D=6

D=4

D=2

Page 43: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 43

Potential High Perf ProcessorPotential High Perf Processor

Widen Fetch, Decode, Retire• Keep relatively narrow issue

Lengthen ROB• And related structures

I-cache Decode PipelineIssueBuffer

Exec.Unit

Exec.Unit

Exec.Unit

Reorder Buffer (Window)

PhysicalRegister

File(s)

F D D I

D

R

BranchPredict

Fetchbuffer

# entries

# entries

miss rate

W entries

# entries

# and type of unitsunit latencies

Pipeline depth

instructiondelivery

algorithm

miss-rate

mispredictrate

Store Q

Load Q# entries L1 Data

Cache#ports

L2Cache

miss rate

toI-cache

mainmemorylatency

Page 44: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 44

Issue Buffer SizingIssue Buffer Sizing

y = 0.3115x

0

50

100

150

200

250

0 200 400 600 800

Reorder Buffer Size

Issu

e B

uff

er S

ize

Similar to ROB sizing Use average path rather

than average critical path

(See Tejas Thesis)

Processor ROB Size Issue Buffer

Ratio

Intel Core 96 32 .3

Power4 100 36 .4

MIPS R10K 64 20 .3

Pentium Pro

40 20 .5

Alpha 21264

80 20 .25

Opteron 72 24 .3

AMD K5 16 4 .25

Page 45: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 45

Function Unit Demand VariationFunction Unit Demand Variation

0

0.2

0.4

0.6

0.8

1

2 12 22 32 42 52 62 72 82 92

DemandIALU

Instructions (millions)

MeanMean+1 stdevMean+2 stdevActual

Example: gcc

Page 46: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 46

Function Unit ResourcesFunction Unit Resources

Demand proportional to instruction mix Dependent on program and phases

• Collect phase-based data Must be an integer Number of functional units of type k:

• Lk = issue latency for unit k

• Gk = fraction using unit k

Use similar approach for other hardware resources

Fk = I (Dk) + (2 (Dk)) Lk

Page 47: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 47

Comparison With H&PComparison With H&P

H&P:

Total Time = Ntotal/α * (tp / p + to)

+ γ NH * ( to p + tp )

Empirical: fit to detailed simulation data to determine α and γ.requires re-simulation if caches/predictor/pipeline factor, etc. change

Interval Model:

Total Time = Ntotal/D (tp / p + to) + ((D-1)/2D)*(miL1 + mbr + mL2)* (tp / p + to)

+ miL1 * ciL1 * 1/ pb * (to piL1 + tp)

+ mbr * (cdr(p,D) /p + cfe) * 1/ pb * (to p + tp)

+ mL2 * (- W/Dp + clr(p,D)/p + cL2) * 1/ pb * (to pL2 + tp)

Mechanistic: Bottom-up -- no need to perform detailed simulationnot all hazard terms are linear in pnot all hazard terms are independent of D

Page 48: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 48

Application: Performance ArchitectureApplication: Performance Architecture

Construct performance counters based on interval model Total cycle counter + one counter per miss event type Front-end miss events

• Front-end Miss Event Table (FMT) Back-end miss events

• Begin counting when full ROB stalls • Increment appropriate counter depending on inst. at ROB head

D-TLB miss,

L2 D-cache miss,

L1 D-cache miss,

Long functional unit (divide)

Page 49: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 49

Performance Architecture: FMTPerformance Architecture: FMT

On entry per outstanding branch Tracks pre-window instructions

• between fetch and dispatch tail Tracks in-flight instructions

• between ROB tail and ROB head Table Increments

• For I1 or I2 miss or I-TLB increment counter pointed to by fetch

• Branch penalty counters between head and tail increment every cycle

Counter updates• When correctly predicted branch retires,

update I1, I2, I-TLB counters• When mispredicted branch retires, update

Branch mispredict counter (and continue counting until first instruction is dispatched))

Page 50: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 50

Simplified FMTSimplified FMT

Shared I1, I2, ITLB entry Instructions in ROB marked w/ I-

cache miss or I-TLB miss When a miss instruction retires,

• Shared entry is copied to counters, • ROB tag bits are cleared

When a mispredicted branch retires

• Add to branch mispredict counter,• Clear shared entries

Page 51: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 51

EvaluationEvaluation

Compare:• Simulation – add miss events one at a time and

measure difference• Simulation-rev – same as above, but reverse order of

miss events• naïve -- Count miss events, multiply by fixed penalty• naïve non-spec – Similar to above, but wrong-path

events not counted• Power5 – IBM Power5 method• FMT• sFMT

Page 52: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 52

EvaluationEvaluation

Page 53: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 53

ComparisonComparison

FMT and sFMT are most accurate• naïve is worst

FMT and sFMT similar• simplified version is adequate

Power5 underestimates frontend miss events

Page 54: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 54

Interval Model DevelopmentInterval Model Development

Michaud, Seznec, Jourdan – Issue transient Tejas Gap model – All transients Taha and Wills -- Interval (macro block) model Hartstein and Puzak – Optimal pipelines

Page 55: A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E. Smith University of Wisconsin-Madison Lieven Eeckhout,

Superscalar Modeling © J. E. Smith, 2006 55

ConclusionsConclusions

Intervals yield a divide-and-conquer approach Supports intuition (adds confidence to intuition) Its all about transients

• The only things that count are cache miss and branch mispredictions

Application to automated design, performance monitoring, very fast simulation, optimizing compiler analysis, etc.

Analysis of pipeline limits,• Re-enforces conventional wisdom• We are close to the practical limits for depth and width

Extends to energy modeling (Tejas PhD)