53
Advanced Microarchitecture Lecture 2: Pipelining and Superscalar Review

Advanced Microarchitecture

  • Upload
    taran

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 2: Pipelining and Superscalar Review. Pipelined Design. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth or Throughput = Performance BW = num. tasks/unit time - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced  Microarchitecture

Advanced MicroarchitectureLecture 2: Pipelining and Superscalar Review

Page 2: Advanced  Microarchitecture

2

Pipelined Design• Motivation: Increase throughput with little

increase in cost (hardware, power, complexity, etc.)

• Bandwidth or Throughput = Performance• BW = num. tasks/unit time• For a system that operates on one task at a

time:BW = 1 / latency

• Pipelining can increase BW if many repetitions of same operation/task

• Latency per task remains same or increasesLecture 2: Pipelining and Superscalar Review

Page 3: Advanced  Microarchitecture

3

Pipelining Illustrated

Lecture 2: Pipelining and Superscalar Review

Combinatorial LogicN Gate Delays

BW = ~(1/n)

Combinatorial LogicN/2 Gate Delays

Combinatorial LogicN/2 Gate Delays

Comb. LogicN/3 Gates

Comb. LogicN/3 Gates

Comb. LogicN/3 Gates

BW = ~(2/n)

BW = ~(3/n)

Page 4: Advanced  Microarchitecture

4

T/k

T/k

Performance Model• Starting from an

unpipelined version with propagation delay T and BW=1/T

Perfpipe = BWpipe =1 / (T/k + S)

where S = latch delaywhere k = num stages

Lecture 2: Pipelining and Superscalar Review

TS

S

k-stage pipelinedunpipelined

Page 5: Advanced  Microarchitecture

5

G/k

G/k

Hardware Cost Model• Starting from an

unpipelined version with hardware cost G

Costpipe = G + kL

where L = latch cost incl. controlwhere k = num stages

Lecture 2: Pipelining and Superscalar Review

GL

L

k-stage pipelinedunpipelined

Page 6: Advanced  Microarchitecture

6

Cost/Performance Tradeoff

Lecture 2: Pipelining and Superscalar Review

Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S)

= LT + GS + LSk + GT/k

Optimal Cost/Performance: find min. C/P w.r.t. choice of k

k

C/P

è øç ÷

ç ÷ç ÷æ ö

koptGTLS--------=

çç ç

ç

Lk + G1

Tk+ S

ddk

= 0 + 0 + LS -GTk2

Page 7: Advanced  Microarchitecture

7

“Optimal” Pipeline Depth: kopt

Lecture 2: Pipelining and Superscalar Review

0

1

2

3

4

5

6

7

0 10 20 30 40 50Pipeline Depth k

x104Co

st/P

erfo

rman

ce R

atio

(C/P

)

G=175, L=41, T=400, S=22

G=175, L=21, T=400, S=11

Page 8: Advanced  Microarchitecture

8

Cost?• “Hardware Cost”

– Transistor/Gate Count• Should include additional logic to control the pipeline

– Area (related to gate count)– Power!

• More gates more switching• More gates more leakage

• Many metrics to optimize• Very difficult to determine what really is

“optimal”

Lecture 2: Pipelining and Superscalar Review

Page 9: Advanced  Microarchitecture

9

Pipelining Idealism• Uniform Suboperations

– The operation to be pipelined can be evenly partitioned into uniform-latency suboperations

• Repetition of Identical Operations– The same operations are to be performed

repeatedly on a large number of different inputs• Repetition of Independent Operations

– All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts

Lecture 2: Pipelining and Superscalar Review

Good Examples:Automobile assembly lineFloating-Point multiplierInstruction pipeline (?)

Page 10: Advanced  Microarchitecture

10

Instruction Pipeline Design• Uniform Suboperations … NOT!

– Balance pipeline stages• Stage quantization to yield balanced stages• Minimize internal fragmentation (some waiting stages)

• Identical operations … NOT!– Unifying instruction types

• Coalescing instruction types into one “multi-function” pipe• Minimize external fragmentation (some idling stages)

• Independent operations … NOT!– Resolve data and resource hazards

• Inter-instruction dependency detection and resolution• Minimize performance loss

Lecture 2: Pipelining and Superscalar Review

Page 11: Advanced  Microarchitecture

11

The Generic Instruction Cycle• The “computation” to be pipelined:

1. Instruction Fetch (IF)2. Instruction Decode (ID)3.Operand(s) Fetch (OF)4. Instruction Execution (EX)5.Operand Store (OS)

• a.k.a. writeback (WB)6.Update Program Counter (PC)

Lecture 2: Pipelining and Superscalar Review

Page 12: Advanced  Microarchitecture

12

The Generic Instruction Pipeline

Lecture 2: Pipelining and Superscalar Review

Based on Obvious Subcomputations:Instruction Fetch

Instruction Decode

Operand Fetch

Instruction Execute

Operand Store

IF

ID

OF/RF

EX

OS/WB

Page 13: Advanced  Microarchitecture

13

Balancing Pipeline Stages

Lecture 2: Pipelining and Superscalar Review

TIF= 6 units

TID= 2 units

TID= 9 units

TEX= 5 units

TOS= 9 units

• Without pipeliningTcyc TIF+TID+TOF+TEX+TOS

= 31

• PipelinedTcyc max{TIF, TID, TOF, TEX, TOS}

= 9

Speedup= 31 / 9

Can we do better in terms of either performance or efficiency?

IF

ID

OF/RF

EX

OS/WB

Page 14: Advanced  Microarchitecture

14

Balancing Pipeline Stages• Two methods for stage quantization

– Merging multiple subcomputations into one– Subdividing a subcomputation into multiple

smaller ones

• Recent/Current trends– Deeper pipelines (more and more stages)

• To a certain point: then cost function takes over– Multiple different pipelines/subpipelines– Pipelining of memory accesses (tricky)

Lecture 2: Pipelining and Superscalar Review

Page 15: Advanced  Microarchitecture

15

Granularity of Pipeline Stages

Lecture 2: Pipelining and Superscalar Review

Coarser-Grained Machine Cycle: 4 machine cyc /

instructionTIF&ID= 8 units

TOF= 9 units

TEX= 5 units

TOS= 9 units

Finer-Grained Machine Cycle: 11 machine cyc

/instruction

Tcyc= 3 units

TIF,TID,TOF,TEX,TOS = (6/2/9/5/9)

IFID

OF

OS

EX

IFIFID

OFOFOFEXEX

OSOSOS

Page 16: Advanced  Microarchitecture

16

Hardware Requirements• Logic needed for

each pipeline stage

• Register file ports needed to support all (relevant) stages

• Memory accessing ports needed to support all (relevant) stages

Lecture 2: Pipelining and Superscalar Review

IFID

OF

OS

EX

IFIFID

OFOFOFEXEX

OSOSOS

Page 17: Advanced  Microarchitecture

17

Pipeline Examples

Lecture 2: Pipelining and Superscalar Review

IF

RD

ALU

MEM

WB

IF

IDOF

EX

OS

PC GENCache ReadCache Read

DecodeRead REGAdd GEN

Cache ReadCache Read

EX 1EX 2

Check ResultWrite Result

OS

EX

OFID

IFMIPS R2000/R3000

AMDAHL 470V/7

Page 18: Advanced  Microarchitecture

18

Instruction Dependencies• Data Dependence

– True Dependence (RAW)• Instruction must wait for all required input operands

– Anti-Dependence (WAR)• Later write must not clobber a still-pending earlier read

– Output Dependence (WAW)• Earlier write must not clobber an already-finished later write

• Control Dependence (a.k.a. Procedural Dependence)– Conditional branches cause uncertainty to instruction

sequencing– Instructions following a conditional branch depends on

the execution of the branch instruction– Instructions following a computed branch depends on the

execution of the branch instruction

Lecture 2: Pipelining and Superscalar Review

Page 19: Advanced  Microarchitecture

Example: Quick Sort on MIPS

Lecture 2: Pipelining and Superscalar Review 19

bge $10, $9, $36mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge $25, $15, $36

$35:addu $10, $10, 1. . .

$36:addu $11, $11, -1. . .

# for (;(j<high)&&(array[j]<array[low]);++j);# $10 = j; $9 = high; $6 = array; $8 = low

Page 20: Advanced  Microarchitecture

20

Hardware Dependency Analysis• Processor must handle

– Register Data Dependencies• RAW, WAW, WAR

– Memory Data Dependencies• RAW, WAW, WAR

– Control Dependencies

Lecture 2: Pipelining and Superscalar Review

Page 21: Advanced  Microarchitecture

21

Terminology• Pipeline Hazards:

– Potential violations of program dependencies– Must ensure program dependencies are not

violated• Hazard Resolution:

– Static method: performed at compile time in software

– Dynamic method: performed at runtime using hardware

Stall, Flush or Forward

• Pipeline Interlock:– Hardware mechanism for dynamic hazard

resolution– Must detect and enforce dependencies at

runtime

Lecture 2: Pipelining and Superscalar Review

Page 22: Advanced  Microarchitecture

22

Pipeline: Steady State

Lecture 2: Pipelining and Superscalar Review

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM

IF ID RD ALUIF ID RD

IF IDIF

t0 t1 t2 t3 t4 t5

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 23: Advanced  Microarchitecture

23

Pipeline: Data Hazard

Lecture 2: Pipelining and Superscalar Review

t0 t1 t2 t3 t4 t5

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM

IF ID RD ALUIF ID RD

IF IDIF

Instj

Instj+1

Instj+2

Instj+3

Instj+4

Page 24: Advanced  Microarchitecture

24

Pipeline: Stall on Data Hazard

Lecture 2: Pipelining and Superscalar Review

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID Stalled in RD ALU MEM WBIF Stalled in ID RD ALU MEM WB

Stalled in IF ID RD ALU MEMIF ID RD ALU

t0 t1 t2 t3 t4 t5

Instj

Instj+1

Instj+2

Instj+3

Instj+4

RDIDIF

IF ID RDIF ID

IF

Page 25: Advanced  Microarchitecture

25

Different Viewt0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

IF Ij Ij+1 Ij+2 Ij+3 Ij+4 Stall Ij+4

ID Ij Ij+1 Ij+2 Ij+3 Stall Ij+3 Ij+4

RD Ij Ij+1 Ij+2 Stall Ij+2 Ij+3 Ij+4

ALU Ij Ij+1 nop nop nop Ij+2 Ij+3 Ij+4

MEM Ij Ij+1 nop nop nop Ij+2 Ij+3

WB Ij Ij+1 nop nop nop Ij+2

Lecture 2: Pipelining and Superscalar Review

Page 26: Advanced  Microarchitecture

26

Pipeline: Forwarding Paths

Lecture 2: Pipelining and Superscalar Review

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM

IF ID RD ALUIF ID RD

IF IDIF

t0 t1 t2 t3 t4 t5

Many possible pathsInstj

Instj+1

Instj+2

Instj+3

Instj+4

MEM ALURequires stalling even with fwding paths

Page 27: Advanced  Microarchitecture

27

ALU Forwarding Paths

Lecture 2: Pipelining and Superscalar Review

Deeper pipeline mayrequire additionalforwarding paths

IF ID Register Filesrc1src2

==

ALU

MEM

==

dest

Page 28: Advanced  Microarchitecture

28

Pipeline: Control Hazard

Lecture 2: Pipelining and Superscalar Review

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEM WBIF ID RD ALU MEM

IF ID RD ALUIF ID RD

IF IDIF

Page 29: Advanced  Microarchitecture

29

Pipeline: Stall on Control Hazard

Lecture 2: Pipelining and Superscalar Review

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU MEMIF ID RD ALU

IF ID RDIF ID

IF

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

Stalled in IF

Page 30: Advanced  Microarchitecture

Pipeline: Prediction for Control Hazards

Lecture 2: Pipelining and Superscalar Review 30

t0 t1 t2 t3 t4 t5

InstiInsti+1

Insti+2

Insti+3

Insti+4

IF ID RD ALU MEM WBIF ID RD ALU MEM WB

IF ID RD ALU nop nopIF ID RD nop nop

IF ID nop nopIF ID RD

IF IDIF

nopnop nopALU nopRD ALUID RD

nopnopnop

New Insti+2New Insti+3New Insti+4

Speculative State Cleared

Fetch Resteered

Page 31: Advanced  Microarchitecture

31

Going Beyond Scalar• Simple pipeline limited to execution of CPI

≥ 1.0• “Superscalar” can achieve CPI ≤ 1.0 (i.e.,

IPC ≥ 1.0)– Superscalar means executing more than one

scalar instruction in parallel (e.g., add + xor + mul)

– Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions)

Lecture 2: Pipelining and Superscalar Review

Page 32: Advanced  Microarchitecture

32

Architectures for Instruction Parallelism• Scalar pipeline (baseline)

– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1

Lecture 2: Pipelining and Superscalar Review

D

Succ

essiv

eIn

stru

ctio

ns

Time in cycles

1 2 3 4 5 6 7 8 9 10 11 12

D different instructions overlapped

Page 33: Advanced  Microarchitecture

33

Superscalar Machine• Superscalar (pipelined) Execution

– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle

Lecture 2: Pipelining and Superscalar Review

N

Succ

essiv

eIn

stru

ctio

ns

Time in cycles

1 2 3 4 5 6 7 8 9 10 11 12

D x N different instructions overlapped

Page 34: Advanced  Microarchitecture

34

Ex. Original Pentium

Lecture 2: Pipelining and Superscalar Review

Prefetch

Decode1

Decode2 Decode2

Execute Execute

WritebackWriteback

4× 32-byte buffers

Decode up to 2 insts

Read operands, Addr comp

Asymmetric pipesu-pipe v-pipe

shiftrotate

some FP

jmp, jcc,call,fxch

Bothmov, lea,

simple ALU,push/poptest/cmp

Page 35: Advanced  Microarchitecture

35

Pentium Hazards, Stalls• “Pairing Rules” (when can/can’t two insts exec at the

same time?)– read/flow dependence

mov eax, 8mov [ebp], eax

– output dependencemov eax, 8

mov eax, [ebp]– partial register stalls

mov al, 1mov ah, 0

– function unit rules• some instructions can never be paired:

MUL, DIV, PUSHA, MOVS, some FP

Lecture 2: Pipelining and Superscalar Review

Page 36: Advanced  Microarchitecture

36

Limitations of In-Order Pipelines• CPI of inorder pipelines degrades very

sharply if the machine parallelism is increased beyond a certain point– i.e., when N approaches the average distance

between dependent instructions– Forwarding is no longer effectiveMust stall more oftenPipeline may never be full due to frequency of

dependency stalls

Lecture 2: Pipelining and Superscalar Review

Page 37: Advanced  Microarchitecture

37

N Instruction Limit

Lecture 2: Pipelining and Superscalar Review

Ex. Superscalar degree N = 4

Any dependencybetween theseinstructions willcause a stall

Dependent instmust be N =

4 instructionsaway

On average, the parent-child separation is onlyabout 5± instructions!

(Franklin and Sohi ’92)

Pentium: Superscalar degree N=2is reasonable… going much further

encounters rapidly diminishing returns

Average of 5 means there are many cases when the

separation is < 4… each of these limits parallelism

Page 38: Advanced  Microarchitecture

38

In Search of Parallelism• “Trivial” Parallelism is limited

– What is trivial parallelism?• In-order: sequential instructions do not have

dependencies• in all previous examples, all instructions executed

either at the same time or after earlier instructions– previous slides show that superscalar execution

quickly hits a ceiling

• So what is “non-trivial” parallelism? …

Lecture 2: Pipelining and Superscalar Review

Page 39: Advanced  Microarchitecture

39

What is Parallelism?• Work

T1: time to complete a computation on a sequential system

• Critical PathT: time to complete the same

computation on an infinitely-parallel system

• Average ParallelismPavg = T1/ T

• For a p-wide systemTp max{T1/p , T}Pavg >> p Tp T1/p

Lecture 2: Pipelining and Superscalar Review

+

+-

*

*2

a b

xy

x = a + b; y = b * 2z =(x-y) * (x+y)

Page 40: Advanced  Microarchitecture

40

ILP: Instruction-Level Parallelism• ILP is a measure of the amount of inter-

dependencies between instructions• Average ILP = num instructions / longest

pathcode1: ILP = 1 (must execute serially)

T1 = 3, T = 3code2: ILP = 3 (can execute at the same time)

T1 = 3, T = 1

Lecture 2: Pipelining and Superscalar Review

code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3

code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10

Page 41: Advanced  Microarchitecture

41

ILP != IPC• Instruction level parallelism usually

assumes infinite resources, perfect fetch, and unit-latency for all instructions

• ILP is more a property of the program dataflow

• IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine

• The ILP of a program is an upper-bound on the attainable IPC

Lecture 2: Pipelining and Superscalar Review

Page 42: Advanced  Microarchitecture

42

Scope of ILP Analysis

Lecture 2: Pipelining and Superscalar Review

r1 r2 + 1r3 r1 / 17r4 r0 - r3

r11 r12 + 1r13 r19 / 17r14 r0 - r20

ILP=2ILP=1

ILP=3

Page 43: Advanced  Microarchitecture

43

DFG AnalysisA: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]

Lecture 2: Pipelining and Superscalar Review

Page 44: Advanced  Microarchitecture

44

In-Order Issue, Out-of-Order Completion

Lecture 2: Pipelining and Superscalar Review

Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard

Issue = send an instructionto execution

INT Fadd1

Fadd2

Fmul1

Fmul2

Fmul3

Ld/St

In-orderInst.

Stream

ExecutionBegins

In-order

Out-of-orderCompletion

Page 45: Advanced  Microarchitecture

45

Example

Lecture 2: Pipelining and Superscalar Review

A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]

A BCycle 1:

C2:

D3:

4:

5:

E F6: GH JK

7:

8:

IPC = 10/8 = 1.25

A B

C

D

E F

G

H

J

K

Page 46: Advanced  Microarchitecture

46

Example (2)

Lecture 2: Pipelining and Superscalar Review

A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R9 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R9]J: R1 = R9 – 1K: R3 ST 0[R1]

A BCycle 1:

C2:

D3:

4:

5:

E F G

IPC = 10/7 = 1.43

H J6:

K7:

A B

C

D

E

F G

H J

K

Page 47: Advanced  Microarchitecture

47

Track with Simple Scoreboarding• Scoreboard: a bit-array, 1-bit for each GPR

– If the bit is not set: the register has valid data– If the bit is set: the register has stale data

i.e., some outstanding instruction is going to change it• Issue in Order: RD Fn (RS, RT)

– If SB[RS] or SB[RT] is set RAW, stall– If SB[RD] is set WAW, stall– Else, dispatch to FU (Fn) and set SB[RD]

• Complete out-of-order– Update GPR[RD], clear SB[RD]

Lecture 2: Pipelining and Superscalar Review

Page 48: Advanced  Microarchitecture

48

Out-of-Order Issue

Lecture 2: Pipelining and Superscalar Review

INT Fadd1

Fadd2

Fmul1

Fmul2

Fmul3

Ld/St

In-orderInst.

Stream

DR DR DR DR

Out-of-orderCompletion

Out ofProgram

OrderExecution

Need an extraStage/buffers forDependencyResolution

Page 49: Advanced  Microarchitecture

49

OOO Scoreboarding• Similar to In-Order scoreboarding

– Need new tables to track status of individual instructions and functional units

– Still enforce dependencies• Stall dispatch on WAW• Stall issue on RAW• Stall completion on WAR

• Limitations of Scoreboarding?• Hints

– No structural hazards– Can always write a RAW-free code sequence

Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; …– Think about x86 ISA with only 8 registers

Lecture 2: Pipelining and Superscalar Review

Finite number of registers inany ISA will force you to reuseregister names at some point

WAR, WAW stalls

Page 50: Advanced  Microarchitecture

50

Lessons thus Far• More out-of-orderness More ILP exposed

But more hazards• Stalling is a generic technique to ensure

sequencing• RAW stall is a fundamental requirement (?)

• Compiler analysis and scheduling can help(not covered in this course)

Lecture 2: Pipelining and Superscalar Review

Page 51: Advanced  Microarchitecture

51

Ex. Tomasulo’s Algorithm [IBM 360/91, 1967]

Lecture 2: Pipelining and Superscalar Review

Adder

Floating Point

Registers FLR

0

2

4

8

Store

Data

1

2

3

Buffers SDB

Control

Decoder

Floating

Operand

Stack

FLOSControl

Floating Point

Buffers FLB

1

2

3

4

5

6

Decoder

Floating PointRegisters (FLR)

Control

0248

Floating

Operand Stack

Floating Point

Buffers (FLB)

123456

StoreData

123

Buffers (SDB)

Control

Storage Bus Instruction Unit

Result

Multiply/Divide

Common Data Bus (CDB)

Point

BusyBits

Adder

FLB BusFLR Bus

CDB ••

Tags

Tags

Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.

Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.

Result

(FLOS)

Page 52: Advanced  Microarchitecture

52

FYI: Historical Note• Tomasulo’s algorithm (1967) was not the

first• Also at IBM, Lynn Conway proposed multi-

issue dynamic instruction scheduling (OOO) in Feb 1966– Ideas got buried due to internal politics,

changing project goals, etc.– But it’s still the first (as far as I know)

Lecture 2: Pipelining and Superscalar Review

Page 53: Advanced  Microarchitecture

53

Modern Enhancements to Tomasulo’s Algorithm

Lecture 2: Pipelining and Superscalar Review

TomasuloPeak IPC = 12 FP FU’sSingle CDBOperand copyingRS TagTag-based forwardingImprecise

ModernPeak IPC = 6+6-10+ FU’sMany forwarding busesRenamed registersRenamed registersTag-based forwardingPrecise (requires ROB)

Machine WidthStructural Deps

Anti-DepsOutput-DepsTrue DepsExceptions