A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E....

A Mechanistic Model for A Mechanistic Model for Superscalar ProcessorsSuperscalar Processors

J. E. SmithUniversity of Wisconsin-Madison

Lieven Eeckhout, Stijn EyermanGhent University

Tejas KarkhanisAMD

Superscalar Modeling © J. E. Smith, 2006 2

Interval AnalysisInterval Analysis

Superscalar execution can be divided into intervals separated by miss events

• Branch miss predictions• I cache misses• Long D cache misses• TLB misses, etc.

Provides more insight than simulation • You can see the forest and the trees• Supplements simulation, not a replacement

branchmispredicts

i-cachemiss long d-cache miss

interval 1 interval 2 interval 3interval 0

OutlineOutline

Development of Interval Analysis • Modeling ILP• Modeling miss events

Balanced Superscalar Processors• Performance components• Optimal pipeline configurations

Performance Counter Architecture• Accurate CPI stacks

Superscalar ProcessorsSuperscalar Processors

I-cache Decode PipelineIssueBuffer

Exec.Unit

Reorder Buffer (Window)

PhysicalRegisterFile(s)

F D D I

BranchPredict

Fetchbuffer

# entries

miss rate

W entries

# entries

# and type of unitsunit latencies

Pipeline depth

instructiondelivery

algorithm

miss-rate

mispredictrate

Store Q

Load Q# entries L1 Data

Cache#ports

L2Cache

miss rate

toI-cache

mainmemorylatency

Superscalar ProcessorsSuperscalar Processors Ifetch

• Adequate fetch resources to sustain decode/dispatch width D• F > D plus fetch buffer to smooth flow

Decode• Assume decode pipe and dispatch bandwidth D

Window• Window, size W, holds in-flight instructions• Equivalent to ROB• Issue buffer holds subset of window (as an optimization)• Assume unified issue buffer, but model can support partitioned buffers

Issue• Width may be more or less than dispatch and commit widths

Retire• Retire width R typically equal to dispatch width

Superscalar Processor PerformanceSuperscalar Processor Performance

Maximum IPC under ideal conditions• No cache misses or branch mispredictions

Miss-events disrupt smooth flow• In balanced design, performance is all about the transients

branchmispredicts

i-cachemiss

long d-cachemiss

Modeling ILPModeling ILP

Relationship between maximum window size W and achieved issue width i

Program dependence structure Has a long history…

Riseman and Foster (1972)Riseman and Foster (1972)

Basic relationship between window size and IPC

• Classic Study• Approx quadratic

relationship under ideal conditions

Wall (1991)Wall (1991)

Limits of ILP• Another classic study• Approx. quadratic

relationship under “perfect” conditions

Michaud, Seznec, JourdanMichaud, Seznec, Jourdan

More recent study Key Result (Michaud, Seznec, Jourdan):

• Approx. quadratic relationship

Our ExperimentOur Experiment

Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate i, as a fcn of window size W

•Approx. quadratic relationship

Modeling IW CharacteristicModeling IW Characteristic

Clearly a function of program dependence structure Simple, single-level dependence models don’t work

very well• Need to consider dependence chains

Slide window over dynamic stream and compute average critical path k(W)

For unit latency, i = W/k(W)

Window

Dynamic InstructionStream

Average Critical PathAverage Critical Path

For our benchmarks, 1.3 ≤ β ≤ 1.9• Quadratic when β=2

Unit latency avg. IPC

Avg. latency l, avg. IPC

/11)( WWk

/11 Wi

/111 Wli1/)/( liW

Generic IntervalGeneric Interval

All intervals follow same basic profile

Time (in Cycles)

Instructionsper Cycle

ramp-up asinstructions

enter window

time dependenton type of miss

ramp-down aswindow drains

transient due tomiss-event

I Cache Miss IntervalI Cache Miss Interval

total time = n/D + ciL1

n = no. instructions in interval

D = decode/dispatch width

cIL1 = miss delay cycles Predicts performance loss is

independent of pipe length

re-fillpipeline

miss delay

windowdrains

time= n/D

Independence from Pipe LengthIndependence from Pipe Length

16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 8 cycles Penalty independent of pipe length Similar across benchmarks

bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr

4 front-end stages 8 front-end stages

Branch Miss Prediction IntervalBranch Miss Prediction Interval

Total time = n/D + cdr (D) + cfe n = no. instructions in intervalD = decode/dispatch widthcdr (D) = drain cycles; function of width

(and ILP)cfe = front-end pipeline length

time = n/D time= pipeline length

time= branch latency

window drain time

Branch Resolution TimeBranch Resolution Time

Assumes mispredicted branch is one of the last instructions to issue

ex vpr

Branch Miss Prediction PenaltyBranch Miss Prediction Penalty

Branch penalty is dependent on interval length

The penalty can be 2+ times pipeline length

Penalty is less for short intervals; more for long intervals

See ISPASS ’06 paper for more details

Long D-cache Miss IntervalLong D-cache Miss Interval

Loadenters

window

ROB fills

Data returns frommemory

steady state

Instructionsenter window

issue rampsup to

steady state

time = n/D

Issue window emptyof issuable insns

Loadissues

miss latency

ROB fill time

Loadresolution

Long D-cache Miss IntervalLong D-cache Miss Interval

For isolated miss total time = n/D - W/D + cLr (D) + cL2

n = no. instructions in intervalD = decode/dispatch widthW = window (ROB) sizecLr (D) = load resolution time; function of widthcL2 = L2 miss delay

Loadenterswindow

ROB fills

Data returns frommemory

steady state

Instructionsenter window

issue rampsup to

steady state

time = N/d

Loadissues

miss latency

ROB fill time

Loadresolution

Miss Event OverlapsMiss Event Overlaps

Branch Misprediction and I-Cache Miss effects “serialize”

• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and

B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Overlaps with branch mispredictions and I-cache misses are

insignificant

BranchMispredicts

I-Cache Misses

Long D-CacheMisses

Overlapping Long D-cache MissesOverlapping Long D-cache Misses

s/D reflects amount of overlap Total penalty is independent of s/D

1st loadenterswindow

ROB fills

Load 1data returns from

memory

time = n/D

1st loadissues

miss latency

2nd loadissues

Experimental ResultsExperimental Results

For each long miss, collect stats on other misses within a “ROB distance”

• This is a trace statistic• Assume W/D = cLr

bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr

Simulation Analytical Model

Overall PerformanceOverall Performance

Sum over all intervals

I cache miss interval: n/D + cic

Branch mispredict: n/D + cdr + cfe

Long d-cache miss: n/D - W/D + cLr + cL2

(non-overlapping)

Collect the n/D terms:

Ntotal/D Account for “ceiling inefficiency”

((D-1)/2D)*(miL1 + mbr + mL2)

Overall PerformanceOverall Performance

Total Cycles = Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)

+ mic * ciL1

+ mbr * (cdr + cfe)

+ mL2 * (- W/D + clr + cL2)

TLB misses similar to L2 misses

AccuracyAccuracy

Decode Width, D=4Average error 4.2%; max 8.6%

D=2, error = 1.8%D=6, error = 5.6%D=8, error = 5.6%

Decode EfficiencyDecode Efficiency

Compare with simulation• D = 4

mcf dominated by intervals of length 5 and 13

• Less efficient than model would predict

This is an inherent inefficiency due to intervals

• Strongly correlates w/ interval lengths

Convert From Cycles to TimeConvert From Cycles to Time

Important if pipeline depth is to be modeled• latch overheads become important

Start with baseline 5 stage front-end• pb = #pipeline stages in baseline

Allow for arbitrary number of stages• p = #pipeline stages• Increase all latencies proportionate to relative depth

Multiply cycles by p/pb

Convert total cycles to total time• tp = total pipeline latency; to = latch overhead

• cycle time = tp / p + to

Convert to Absolute TimeConvert to Absolute Time

Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

+ mic * ciL1*(p/pb)* (tp / p + to)

+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)

+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)

TPI = Total Time/Ntotal

Now, consider some of the terms in isolation

Base TPI + One Linear Miss EventBase TPI + One Linear Miss Event

Component TPI

5 10 15 20 25 30 35

Pipeline Stages

width 2

width 4

width6

width8

miss event

Total TPI

5 10 15 20 25 30 35

Pipeline Stages

width 2

width 4

width6

width8

miss event

Pipelining of Miss EventsPipelining of Miss Events

Fully Pipelined Unit

5 10 15 20 25 30 35

Pipeline Stages

miss event

miss event x 2

miss event x 3

miss event x 4

Not all paths are fully pipelined• e.g. cache misses may not be fully pipelined• A pipeline factor (0 ≤ f ≤ 1) can be added to a term• Example: I cache miss

mic * ciL1*(p/pb)* (tp / p + fiL1 to)

Changing Pipeline Factor

5 10 15 20 25 30 35

Pipeline Stages

pipelined

pipelined .5

pipelined .25

nonpipelined

Fetch InefficiencyFetch Inefficiency

Inherent fetch inefficiency • Due to presence of misses• As opposed to structural inefficiency• More important for wider pipelines

[Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)

Effect of Inefficiency

1 2 3 4 5 6 7

Pipeline Stages

width 2

width 4

width6

width8

w2+ovhd

w4+inherent

w6+inherent

w8+inherent

Miss Events Dependent on ROB SizeMiss Events Dependent on ROB Size Miss events are dependent on ROB size

• And therefore dependent on depth/width for balanced designs Branch mispredicts go up due to late update of predictor L2 miss behavior may be better or worse depending on overlaps

• Deeper pipeline longer miss penalty• Longer ROB more MLP

Balanced Superscalar Processor DesignBalanced Superscalar Processor Design

Definition: At iW balance point:

• Under ideal conditions, achieved issue width i = I, but decreasing W means achieved issue width diminishes..

• For practical issue widths, there is enough ILP that balance can be achieved (See earlier work)

• Balance does not imply overall width/depth optimality Provide adequate numbers of other resources

• Issue buffer, load/store buffers, rename regs., functional units, etc.• Reducing resources below adequate level causes reduced performance

Balanced Superscalar Processor DesignBalanced Superscalar Processor Design

Choose Width/Depth Optimize other elements based on Width/Depth

IssueWidth

I-FetchResources

(aciheved width)Commit Width ROB Size

Beta (~ quadratic)Relationship

# RenameRegisters

Load/StoreBuffer Sizes

Numbers ofFunctional Units

Issue BufferSize

LinearRelationship Linear

RelationshipLinear

Relationship

LinearRelationships

PipelineDepth

Beta (~ quadratic)Relationship

Inverse RelationshipAt optimal point, widerissue implies shallower

pipeline

Optimize Pipeline DepthOptimize Pipeline Depth

Start with baseline 5 stage front-end• pb = #pipeline stages in baseline

Evaluate 1x, 2x, 3x, 4x, 5x depths• Increase all latencies proportionate to depths• Multiply by p/pb

Convert total cycles to total time• cycle time = tp / p + to

• p = # stages; tp = total pipeline latency; to = latch overhead

Pipeline Depth ResultsPipeline Depth Results

Use tp/to = 55 as in Hartstein and Puzak

• Also illustrates accuracy of model• Consider four typical benchmarks:

Pipeline Depth ResultsPipeline Depth Results

On average, 2X baseline pipeline depth is optimal Consistent w/ H&P

Optimize Pipeline WidthOptimize Pipeline Width

In general wider means higher performance (to 8-wide) Optimal depth becomes shallower as width grows Diminishing returns w/ wider pipelines

• 4 vs. 2 13.3%; 6 vs. 4 7.1%; 8 vs. 6 2.9%

Short Interval EffectsShort Interval Effects With short intervals, may never reach peak issue rate Example: assume 1 mispredict every 96 instructions

• E.g. SPEC benchmark crafty with 4K gshare• Max issue rate never reached for D = 6,8

Yet, there is a benefit from wider pipelines

0 10 20 30 40 50 60

Benefit Does Not Come From Benefit Does Not Come From IssueIssue Width Width Benefit comes from wider decode/dispatch width

• Get to next I-cache miss sooner• Resolve branch mispredicts sooner• Benefit comes from faster ramp-up• D = 8 faster than D = 6• D = 8, I =6 gives same performance as D = 8, I = 8

0 10 20 30 40 50 60

Potential High Perf ProcessorPotential High Perf Processor

Widen Fetch, Decode, Retire• Keep relatively narrow issue

Lengthen ROB• And related structures

I-cache Decode PipelineIssueBuffer

Exec.Unit

Reorder Buffer (Window)

PhysicalRegister

File(s)

F D D I

BranchPredict

Fetchbuffer

# entries

miss rate

W entries

# entries

# and type of unitsunit latencies

Pipeline depth

instructiondelivery

algorithm

miss-rate

mispredictrate

Store Q

Load Q# entries L1 Data

Cache#ports

L2Cache

miss rate

toI-cache

mainmemorylatency

Issue Buffer SizingIssue Buffer Sizing

y = 0.3115x

0 200 400 600 800

Reorder Buffer Size

Similar to ROB sizing Use average path rather

than average critical path

(See Tejas Thesis)

Processor ROB Size Issue Buffer

Intel Core 96 32 .3

Power4 100 36 .4

MIPS R10K 64 20 .3

Pentium Pro

40 20 .5

Alpha 21264

80 20 .25

Opteron 72 24 .3

AMD K5 16 4 .25

Function Unit Demand VariationFunction Unit Demand Variation

2 12 22 32 42 52 62 72 82 92

DemandIALU

Instructions (millions)

MeanMean+1 stdevMean+2 stdevActual

Example: gcc

Function Unit ResourcesFunction Unit Resources

Demand proportional to instruction mix Dependent on program and phases

• Collect phase-based data Must be an integer Number of functional units of type k:

• Lk = issue latency for unit k

• Gk = fraction using unit k

Use similar approach for other hardware resources

Fk = I (Dk) + (2 (Dk)) Lk

Comparison With H&PComparison With H&P

Total Time = Ntotal/α * (tp / p + to)

+ γ NH * ( to p + tp )

Empirical: fit to detailed simulation data to determine α and γ.requires re-simulation if caches/predictor/pipeline factor, etc. change

Interval Model:

Total Time = Ntotal/D (tp / p + to) + ((D-1)/2D)*(miL1 + mbr + mL2)* (tp / p + to)

+ miL1 * ciL1 * 1/ pb * (to piL1 + tp)

+ mbr * (cdr(p,D) /p + cfe) * 1/ pb * (to p + tp)

+ mL2 * (- W/Dp + clr(p,D)/p + cL2) * 1/ pb * (to pL2 + tp)

Mechanistic: Bottom-up -- no need to perform detailed simulationnot all hazard terms are linear in pnot all hazard terms are independent of D

Application: Performance ArchitectureApplication: Performance Architecture

Construct performance counters based on interval model Total cycle counter + one counter per miss event type Front-end miss events

• Front-end Miss Event Table (FMT) Back-end miss events

• Begin counting when full ROB stalls • Increment appropriate counter depending on inst. at ROB head

D-TLB miss,

L2 D-cache miss,

L1 D-cache miss,

Long functional unit (divide)

Performance Architecture: FMTPerformance Architecture: FMT

On entry per outstanding branch Tracks pre-window instructions

• between fetch and dispatch tail Tracks in-flight instructions

• between ROB tail and ROB head Table Increments

• For I1 or I2 miss or I-TLB increment counter pointed to by fetch

• Branch penalty counters between head and tail increment every cycle

Counter updates• When correctly predicted branch retires,

update I1, I2, I-TLB counters• When mispredicted branch retires, update

Branch mispredict counter (and continue counting until first instruction is dispatched))

Simplified FMTSimplified FMT

Shared I1, I2, ITLB entry Instructions in ROB marked w/ I-

cache miss or I-TLB miss When a miss instruction retires,

• Shared entry is copied to counters, • ROB tag bits are cleared

When a mispredicted branch retires

• Add to branch mispredict counter,• Clear shared entries

EvaluationEvaluation

Compare:• Simulation – add miss events one at a time and

measure difference• Simulation-rev – same as above, but reverse order of

miss events• naïve -- Count miss events, multiply by fixed penalty• naïve non-spec – Similar to above, but wrong-path

events not counted• Power5 – IBM Power5 method• FMT• sFMT

EvaluationEvaluation

ComparisonComparison

FMT and sFMT are most accurate• naïve is worst

FMT and sFMT similar• simplified version is adequate

Power5 underestimates frontend miss events

Interval Model DevelopmentInterval Model Development

Michaud, Seznec, Jourdan – Issue transient Tejas Gap model – All transients Taha and Wills -- Interval (macro block) model Hartstein and Puzak – Optimal pipelines

ConclusionsConclusions

Intervals yield a divide-and-conquer approach Supports intuition (adds confidence to intuition) Its all about transients

• The only things that count are cache miss and branch mispredictions

Application to automated design, performance monitoring, very fast simulation, optimizing compiler analysis, etc.

Analysis of pipeline limits,• Re-enforces conventional wisdom• We are close to the practical limits for depth and width

Extends to energy modeling (Tejas PhD)

A Mechanistic Model for Superscalar Processors A Mechanistic Model for Superscalar Processors J. E....

Documents

Fundamentals of Superscalar Processors

Instruction Level Parallelism 2. Superscalar and VLIW processors

A Study of Control Independence in Superscalar Processors · A Study of Control Independence in Superscalar Processors December 18, 1998 4 1.1 Prior work Lam and Wilson’s limit

Visualizing Application Behavior on Superscalar Processorsgraphics.stanford.edu/Papers/Rivet_pipeline/Pipeline.pdfthat superscalar processors use to improve performance, and then explain

Pipeline Behavior Prediction for Superscalar Processors by

Lect. 3: Superscalar Processors

1 Chapter 4 Multiple-Issue Processors. 2 Multiple-issue processors This chapter concerns multiple-issue processors, i.e. superscalar and VLIW (very long

The Microarchitecture of Superscalar Processors - Computing

Chapter 14 Instruction Level Parallelism and Superscalar Processors

An Approach for Implementing Efficient Superscalar CISC Processors

Modern Processor Design Fundamentals of Superscalar Processors 130225191042 Phpapp02

CSL718 : Superscalar Processors

The Microarchitecture of Superscalar Processors

Superscalar Processors - bohr.wlu.ca Superscalar Processors.pdfPreserving the Sequential Consistency of Exception Processing When instructions are executed in parallel, interrupt requests,

Task Superscalar: Using Processors as Functional … · Task Superscalar: Using Processors as Functional Units ... (Latency related) ... • Building a large pipeline • Multiplex

Data Caches for Superscalar Processors*

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors

Ch_14_Instruction Level Parallelism and Superscalar Processors

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently

Complexity-Effective Superscalar Processorsftp.cs.wisc.edu/pub/sohi/theses/subbarao.pdf · Complexity-Effective Superscalar Processors by ... answering my questions during ... 3.2.5