20
logolund Lecture 4: EIT090 Computer Architecture Anders Ardö EIT – Electrical and Information Technology, Lund University Sept 23, 2009 A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 1 / 78 logolund Outline 1 Reiteration 2 MIPS R4000 3 Instruction Level Parallelism 4 Branch Prediction 5 Dependencies 6 Instruction Scheduling 7 Scoreboard A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 2 / 78 logolund Previous lecture Introduction to pipeline basics General principles of pipelining (review from EIT070 Computer Organization) Techniques to avoid pipeline stalls due to hazards What makes pipelining hard to implement? Support of multi-cycle instructions in a pipeline A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 3 / 78 logolund A Pipelined MIPS Datapath A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 4 / 78

A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Lecture 4: EIT090 Computer Architecture

Anders Ardö

EIT – Electrical and Information Technology, Lund University

Sept 23, 2009

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 1 / 78

logolund

Outline

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 2 / 78

logolund

Previous lecture

Introduction to pipeline basics

General principles of pipelining(review from EIT070 Computer Organization)Techniques to avoid pipeline stalls due to hazardsWhat makes pipelining hard to implement?Support of multi-cycle instructions in a pipeline

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 3 / 78

logolund

A Pipelined MIPS Datapath

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 4 / 78

Page 2: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Pipelining Lessons

Pipelining doesn’t help latency of a single instruction, it helpsthroughput of the entire workloadPipeline rate is limited by the slowest pipeline stageMultiple tasks are operating simultaneouslyPotential speedup = Number of pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to fill pipeline and time to drain reduces speedup

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 5 / 78

logolund

Summary

Pipelining:Speeds up throughput, not latencySpeedup ≤ #stages

Speedup =CPIunpipelined

Ideal CPI + # stall-cycles/instr∗

TCunpipelinedTCpipelined

≈ #stages1 + # stall-cycles/instruction

Hazards limit performance, generate stalls:Structural: need more HWData (RAW,WAR,WAW): need forwarding and compiler schedulingControl: delayed branch, branch prediction

Complications:Precise exceptions may be difficult to implementThe instruction set can be more or less suited for pipelining

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 6 / 78

logolund

Lecture 4 agenda

Appendix A.6-A.7, and Chapters 2.1-2.6, 2.9 in "ComputerArchitecture"

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 7 / 78

logolund

Lecture 4

A case study of a deep pipeline: the MIPS R4000Introduction to Instruction Level Parallelism: ILPBranch predictionInstruction scheduling

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 8 / 78

Page 3: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Outline

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 9 / 78

logolund

The MIPS R4000

R4000 - MIPS64:First (1992) true 64-bit architecture (addresses and data)Clock frequency (1997): 100 MHz-250 MHzDeep 8 stage pipeline (superpipelined)

Implications because of the deeper pipeline:load latency: 2 cyclesbranch latency: 3 cycles (incl one delay slot)=⇒ High demands on the compilerBypassing (forwarding) from more stages

Performance equation: CPI ∗ Tc must be lower for the longerpipeline to make it worthwhile

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 10 / 78

logolund

High Clock Frequency Implications

200 MHz clock frequency corresponds to 5 ns cycle time in eachpipe stage.While an ALU can be clocked in this frequency, the memoryaccesses become main bottlenecks.

The MIPS R4000 has 8-16 Kbytes I- and D-cachesExtra pipeline stages comes from decomposing memory access

Operations at a cache access:Address translationCache look-upHit/miss detection

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 11 / 78

logolund

The MIPS R4000 Pipeline

Instruction fetch is done in the IF and IS stages speculativelyindependent of a cache hit or missRF performs the same as ID in the classic 5-stage pipeline butalso check for instruction cache hit/missThe MEM stage is divided into three stages (DF, DS, and TC).Cache misses are detected in TC (Tag check)WB is the same as in classic 5-stage pipeline

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 12 / 78

Page 4: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Branch Penalties

One branch delay slotPredict-Not taken scheme squashes the next two sequentiallyfetched instructions if the branch is taken

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 13 / 78

logolund

Branch Penalties

branch takenInstruction Clock number

1 2 3 4 5 6 7 8 9Branch instr IF IS RF EX DF DS TC WBDelay slot IF IS RF EX DF DS TC WBStall s s s s s s sStall s s s s s sBranch target IF IS RF EX DF

branch not takenInstruction Clock number

1 2 3 4 5 6 7 8 9Branch instr IF IS RF EX DF DS TC WBDelay slot IF IS RF EX DF DS TC WBBranch instr +1 IF IS RF EX DF DS TCBranch instr +2 IF IS RF EX DF DS

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 14 / 78

logolund

Load Penalties

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 15 / 78

logolund

Load Penalties

Instruction Clock number1 2 3 4 5 6 7 8 9

LD R1,... IF IS RF EX DF DS TC WBDADD R2,R1,R4 IF IS RF stall stall EX DF DSDSUB R3,R1,R5 IF IS stall stall RF EX DF

Data is available after the DS stage (second stage in ’Datamemory’)=⇒ 2 load delay slotsFour bypasses instead of two because of the two extra pipestages to read from memory.

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 16 / 78

Page 5: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

R4000 Performance

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 17 / 78

logolund

R4000 Performance

CPI penaltiesAverage Load Branch FP data FP struct

CPI hazard hazard2.00 0.10 0.36 0.46 0.09

(SPEC92 CPI measurements)

The penalty for control hazards is very high for integer programs(up to 0.61)The penalty for FP data hazards is also highThe higher clock frequency compensates for the higher CPI

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 18 / 78

logolund

Outline

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 19 / 78

logolund

Instruction Level Parallelism - ILP

ILP: Overlap execution of unrelated instructions: PipeliningTwo main approaches:

DYNAMIC =⇒ hardware detects parallelismSTATIC =⇒ software detects parallelism

Often a mix between both.

Pipeline CPI = Ideal CPI + Structural stalls+ Data hazard stalls + Control stalls

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 20 / 78

Page 6: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Instruction Level Parallelism - ILP

Example:DIVD F0,F2,F4ADDD F10,F1,F2SUBD F8,F8,F14

These instructions are independent and could be executed inparallelgcc has 17% control transfer instructions:

5 sequential instructions / branchWe must look beyond a single basic block to get more ILP

Loop level parallelism one opportunity, SW & HW

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 21 / 78

logolund

Loop-level Parallelism

for (i=1; i≤1000; i=i+1)x[i] = x[i] + 10;

There is very little available parallelism within an iterationHowever, there is parallelism between loop iterations; eachiteration is independent of the other

Potential speedup = 1000!!!

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 22 / 78

logolund

MIPS-code for the Loop

loop: LD F0, 0(R1) ; F0 = array elementADDD F4, F0, F2 ; Add scalar constantSD F4, 0(R1) ; Save resultDADDUI R1, R1, #-8 ; decrement array ptr.BNE R1,R2, loop ; reiterate if R1 != R2

Instruction producing Instruction using Latencyresult resultFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double FP ALU op 1Load double Store double 0Integer op Integer op 0

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 23 / 78

logolund

Loop Showing Stalls

1 loop: LD F0, 0(R1) ; F0 = array element2 stall3 ADDD F4, F0, F2 ; Add scalar constant4 stall5 stall6 SD F4, 0(R1) ; Save result7 DADDUI R1, R1, #-8 ; decrement array ptr.8 stall9 BNE R1,R2 loop ; reiterate if R1 != R210 nop

How can we get rid of the stalls?

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 24 / 78

Page 7: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Restructured Loop Minimizing Stalls

1 loop: LD F0, 0(R1) ; F0 = array element2 DADDUI R1, R1, #-8 ; decrement array ptr.3 ADDD F4, F0, F2 ; Add scalar constant4 stall5 BNE R1, R2, loop ; reiterate if R1 != R26 SD F4, 8(R1) ; Save result

Swap DADDUI and SD by changing offset in SD6 clock cycles per iterationSophisticated compiler analysis requiredCan we do better?

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 25 / 78

logolund

Unroll Loop Four Times

1 loop: LD F0, 0(R1)2 ADDD F4, F0, F23 SD F4, 0(R1) ; drop DADDUI & BNE4 LD F6, -8(R1)5 ADDD F8, F6, F26 SD F8, -8(R1) ; drop DADDUI & BNE7 LD F10, -16(R1)8 ADDD F12, F10, F29 SD F12, -16(R1) ; drop DADDUI & BNE10 LD F14, -24(R1)11 ADDD F16, F14, F212 SD F16, -24(R1)13 DADDUI R1, R1, #-32 ; alter to 4*814 BNE R1, R2, loop15 NOP

15 + 4*(1+2) + 1 = 28 clock cycles, or 7 per iterationAssumes R1 is a multiple of 4

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 26 / 78

logolund

Scheduled Unrolled Loop

1 loop: LD F0, 0(R1)2 LD F6, -8(R1)3 LD F10, -16(R1)4 LD F14, -24(R1)5 ADDD F4, F0, F26 ADDD F8, F6, F27 ADDD F12, F10, F28 ADDD F16, F14, F29 SD F4, 0(R1)10 SD F8, -8(R1)11 DADDUI R1, R1, #-3212 SD F12, 16(R1) ; 16-32 = -1613 BNE R1, R2, loop14 SD F16, 8(R1) ; 8-32 = -24

14 clock cycles, or 3.5 per iterationWhen is it safe to move instructions?

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 27 / 78

logolund

Why loop unrolling works

Longer sequences of straight code without branches (longer basicblocks) allows for easier compiler static reschedulingLonger basic blocks also facilitates dynamic rescheduling such asScoreboard and Tomasulo’s algorithm

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 28 / 78

Page 8: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Outline

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 29 / 78

logolund

Dynamic Branch Prediction

Branches limit performance because:Branch penaltiesLimit to available Instruction Level Parallelism

Solution: Dynamic branch prediction to predict the outcome ofconditional branches.Benefits:

Reduce the time to when the branch condition is knownReduce the time to calculate the branch target address

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 30 / 78

logolund

Branch History Table

Simplest branch predictionThe branch-prediction buffer is indexed by bits frombranch-instruction PC valuesThe bit corresponding to a branch indicates whether the branch ispredicted as taken or notWhen prediction is wrong: invert bit

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 31 / 78

logolund

A two-bit prediction scheme

Requires prediction to miss twice in order to change prediction=⇒ better performance1%-18% miss prediction frequency for SPEC89Integer programs have higher miss frequency than FP programs

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 32 / 78

Page 9: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Branch Target Buffer - BTB

Predicts branch target address in the IF stageCan be combined with 2-bit branch prediction

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 33 / 78

logolund

Branch Target Buffer (BTB) Algorithm

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 34 / 78

logolund

Branch prediction penalties

for a Branch Target Buffer scheme

Instruction Prediction Actual Penaltyin buffer branch cycles

Yes Taken Taken 0Yes Taken Not taken 2No Taken 2No Not taken 0

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 35 / 78

logolund

Outline

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 36 / 78

Page 10: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Dependencies

Two instructions must be independent in order to execute inparallelThere are three general types of dependencies that limitparallelism:

Data dependenciesName dependenciesControl dependencies

Dependencies are properties of the programWhether a dependency leads to a hazard or not is a property ofthe pipeline implementation

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 37 / 78

logolund

Data Dependencies

An instruction j is data dependent on instruction i if:

Instruction i produces a result used by instr. j, orInstruction j is data dependent on instruction k and instr. k is datadependent on instr. i

Example:LD F0,0(R1)ADDD F4,F0,F2SD 0(R1),F4

Easy to detect for registers

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 38 / 78

logolund

Name Dependencies

Two instructions use same name (register or memory address) but donot exchange data

Anti-dependence (WAR if hazard in HW)ADDD F2,F0,F2 ; Must execute before LDLD F0,0(R1)

Output dependence (WAW if hazard in HW)ADDD F0,F2,F2 ; Must execute before LDLD F0,0(R1)

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 39 / 78

logolund

Where Are the Name Dependencies?

1 loop: LD F0, 0(R1)2 ADDD F4, F0, F23 SD F4, 0(R1) ; drop DADDUI & BNE4 LD F0, -8(R1)5 ADDD F4, F0, F26 SD F4, -8(R1) ; drop DADDUI & BNE7 LD F0, -16(R1)8 ADDD F4, F0, F29 SD F4, -16(R1) ; drop DADDUI & BNE10 LD F0, -24(R1)11 ADDD F4, F0, F212 SD F4, -24(R1)13 DADDUI R1, R1, #-32 ; alter to 4*814 BNE R1, R2 loop15 NOP

How can we remove them?

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 40 / 78

Page 11: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Control Dependencies

Determines order between an instruction and a branch instruction

Example:if Test1 then { S1 }if Test2 then { S2 }

S1 is control dependent on Test1S2 is control dependent on Test2; but not on Test1

We cannot move an instruction that is dependent on a branchbefore the branch instructionWe cannot move an instruction not control dependent on a branchafter the branch instruction

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 41 / 78

logolund

Outline

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 42 / 78

logolund

Instruction scheduling

Scheduling is the process that determines when to start a particularinstruction, when to read its operands, and when to write its result,Target of scheduling: rearrange instructions to reduce stalls whendata or control dependencies are present

Static (compiler at compile-time) scheduling:Pro: May look into future; no HW support neededCon: Cannot detect all dependencies, esp. across branches

Dynamic (hardware at run-time) scheduling:Pro: Can potentially detect all dependenciesCon: Cannot look into the future; HW support needed whichcomplicates the pipeline hardware

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 43 / 78

logolund

Dynamic instruction scheduling

Key idea: allow instructions behind stall to proceed

DIVD F0,F2,F4 ; takes long timeADDD F10,F0,F8 ; stalls waiting for F0MULTD F12,F8,F14 ; Let this instr. bypass the ADDD

MULTD is not data dependent on anything in the pipelineEnables out-of-order execution =⇒ out-of-order completionID stage checks for structural and data dependencies

Scoreboard dates to CDC 6600 in 1963Tomasulo in IBM 360/91 in 1967

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 44 / 78

Page 12: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Dynamic instruction scheduling

Why in hardware at run time?

It works when we cannot know real dependence at compile timeIt makes the compiler simplerCode for one implementation runs well on another

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 45 / 78

logolund

Outline

1 Reiteration

2 MIPS R4000

3 Instruction Level Parallelism

4 Branch Prediction

5 Dependencies

6 Instruction Scheduling

7 Scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 46 / 78

logolund

Scoreboard pipeline

Goal of scoreboarding is to maintain an execution rate of oneinstruction per clock cycle by executing an instruction as early aspossible.Instructions execute out-of-order when there are sufficientresources and no data dependencies.A scoreboard is a hardware unit that keeps track of

the instructions that are in the process of being executed,the functional units that are doing the executing,and the registers that will hold the results of those units.

A scoreboard centrally performs all hazard detection andresolution and thus controls the instruction progression from onestep to the next.

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 47 / 78

logolund

Scoreboard pipeline

Issue: Decode and check for structural hazardsRead operands: wait until no data hazards, then read operandsAll data hazards are handled by the scoreboard

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 48 / 78

Page 13: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Scoreboard complications

Out-of-order execution =⇒WAR & WAW hazards

Solutions for WAR:Stall instruction in the Write stage until all previously issuedinstructions (with a WAR hazard) have read their operands

For WAW, must detect hazard: stall in Issue until other instructioncompletesRAW hazards handled in Read OperandsScoreboard keeps track of dependencies and state of operations

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 49 / 78

logolund

Scoreboard functionality

Issue: An instruction is issued if:The needed functional unit is free (there is no structural hazard)No functional unit has a destination operand equal to thedestination of the instruction (resolves WAW hazards)

Read: Wait until no data hazards, then read operandsPerformed in parallel for all functional unitsResolves RAW hazards dynamically

EX: normal executionNotify the scoreboard when ready

Write: The instruction can update destination if:All instructions issued earlier have read their operands (resolvesWAR hazards)

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 50 / 78

logolund

Data structures in the scoreboard

1. Instruction status – keeps track of which of the 4 steps theinstruction is in2. Functional unit status – Indicates the state of the functionalunit (FU). 9 fields for each FU:

Busy : Indicates whether the unit is busy or notOp: Operation to perform in the unit (e.g. add or sub)Fi : Destination register nameFj, Fk : Source register namesQj, Qk : Name of functional unit producing regs Fj, FkRj, Rk : Flags indicating when Fj and Fk are ready

3. Register result status – Indicates which functional unit willwrite each register, if any.

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 51 / 78

logolund

Scoreboard Example

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 52 / 78

Page 14: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Detailed Scoreboard Pipeline Control

See figure A.54 in “Computer Architecture”

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 53 / 78

logolund

Scoreboard Example, CP 1

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 54 / 78

logolund

Scoreboard Example, CP 2

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 55 / 78

logolund

Scoreboard Example, CP 3

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 56 / 78

Page 15: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Scoreboard Example, CP 4

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 57 / 78

logolund

Scoreboard Example, CP 5

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 58 / 78

logolund

Scoreboard Example, CP 6

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 59 / 78

logolund

Scoreboard Example, CP 7

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 60 / 78

Page 16: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Scoreboard Example, CP 8a

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 61 / 78

logolund

Scoreboard Example, CP 8

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 62 / 78

logolund

Scoreboard Example, CP 9

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 63 / 78

logolund

Scoreboard Example, CP 11

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 64 / 78

Page 17: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Scoreboard Example, CP 12

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 65 / 78

logolund

Scoreboard Example, CP 13

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 66 / 78

logolund

Scoreboard Example, CP 14

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 67 / 78

logolund

Scoreboard Example, CP 16

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 68 / 78

Page 18: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Scoreboard Example, CP 17

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 69 / 78

logolund

Scoreboard Example, CP 19

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 70 / 78

logolund

Scoreboard Example, CP 20

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 71 / 78

logolund

Scoreboard Example, CP 21

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 72 / 78

Page 19: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Scoreboard Example, CP 22

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 73 / 78

logolund

Scoreboard Example, CP 61

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 74 / 78

logolund

Scoreboard Example, CP 62

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 75 / 78

logolund

Factors that limit performance

The scoreboard technique is limited by:The amount of parallelism available in codeThe number of scoreboard entries (window size)

A large window can look ahead across more instructionsThe number and types of functional units

Contention for functional units leads to structural hazardsThe presence of anti-dependencies and output dependencies

These lead to WAR and WAW hazards that are handled by stallingthe pipeline in the Scoreboard

Number of datapaths to registers

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 76 / 78

Page 20: A Pipelined MIPS Datapath - eit.lth.se · The MIPS R4000 Pipeline Instruction fetch is done in the IF and IS stages speculatively independent of a cache hit or miss RF performs the

logolund

Scoreboard Summary

Main advantage:managing multiple FUsout-of-order execution of multi-cycle operationsmaintaining all data dependencies (RAW, WAW, WAR)

Scoreboard limitations:single issue scheme, however: scheme is extendable tomultiple-issuein-order issueanti-dependencies and output dependencies may lead to WARand WAW stalls,no forwarding hardware all results go through the registers

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 77 / 78

logolund

Summary

ILP:Rescheduling and loop unrolling are important to take advantage ofpotential Instruction Level Parallelism

Dynamic instruction schedulingAn alternative to compile-time schedulingDoes not need recompilation to increase performanceUsed in most new processor implementations

Dynamic Branch Predictionreduce branch penalties by early prediction of conditional branchoutcomes

A. Ardö, EIT Lecture 4: EIT090 Computer Architecture Sept 23, 2009 78 / 78