Chapter 6 Slides - University of Arizona · 2 Pipelining • Improve performance by increasing instruction throughput Instruction class Instruction memory Register read ALU Data memory

1

CHAPTER 6

2

Pipelining

• Improve performance by increasing instruction throughput

Instruction class

Instructionmemory

Registerread

ALUData

memoryRegister

writeTotal

(in ps)

Load word 200 100 200 200 100 800

Store word 200 100 200 200 700

R-format 200 100 200 100 600

Branch 200 100 200 500

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Ins tru ction�fe tch R eg A LU D ata �

acc ess R eg

8 n s Ins tru ction�fe tch R eg A LU D ata �

ac cess R eg

8 n sIns tru ction�

fe tch

8 ns

T im e

lw $ 1, 10 0 ($0 )

lw $ 2, 20 0 ($0 )

lw $ 3, 30 0 ($0 )

2 4 6 8 1 0 1 2 14 16 1 8

2 4 6 8 1 0 1 2 14

...

P rog ram �e xecution �o rd er�(in in struc tio ns )

Ins truc tion �fe tch R eg ALU D a ta�

access R eg

T im e

lw $1 , 1 00 ($ 0)

lw $2 , 2 00 ($ 0)

lw $3 , 3 00 ($ 0)

2 ns Ins truc tion �fe tch R eg ALU D a ta�

access R eg

2 nsIns truc tion �

fe tc h R eg A LU D a ta�access R eg

2 ns 2 n s 2 n s 2 ns 2 n s

�

P rog ram �e xecut io n�o rd er�( in in struc tio n s)

3

Pipelining

• What makes it easy– all instructions are the same length– just a few instruction formats– memory operands appear only in loads and stores

• What makes it hard?– structural hazards: suppose we had only one memory– control hazards: need to worry about branch instructions– data hazards: an instruction depends on a previous instruction

• We’ll build a simple pipeline and look at these issues

• We’ll talk about modern processors and what really makes it hard:– exception handling– trying to improve performance with out-of-order execution, etc.

4

Hazards

A=B+EC=B+F

lw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0)lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0)

lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2sw $t3, 12($t0)add $t5, $t1, $t4sw $t5, 16($t0)

5

Basic Idea

What do we need to add to actually split the datapath into stages?

6

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

Pipelined datapath

7

Five Stages (lw)

Memory and registersLeft half: writeRight half: read

8

Five Stages (lw)

9

Five Stages (lw)

10

What is wrong with this datapath?

11

• Can help with answering questions like:– How many cycles does it take to execute this code?– What is the ALU doing during cycle 4?– Use this representation to help understand datapaths

Graphically representing pipelines

12

Pipeline operation

• In pipeline one operation begins in every cycle• Also, one operation completes in each cycle• Each instruction takes 5 clock cycles

– k cycles in general, where k is pipeline depth• When a stage is not used, no control needs to be applied• In one clock cycle, several instructions are active • Different stages are executing different instructions• How to generate control signals for them is an issue

13

Pipeline control

• We have 5 stages. What needs to be controlled in each stage?– Instruction Fetch and PC Increment– Instruction Decode / Register Fetch– Execution– Memory Stage– Write Back

• How would control be handled in an automobile plant?– A fancy control center telling everyone what to do?– Should we use a finite state machine?

14

PC

Instruction�memory

Address

Inst

ruct

ion

Instruction�[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction�[15– 0]

0

0Registers

Write�register

Write�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1Write�data

Read�data M�

u�x

1

ALU�control

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Data�memory

PCSrc

Zero

Add Add�result

Shift�left 2

ALU�result

ALUZero

Add

0

1

M�u�x

0

1

M�u�x

Pipeline control

15

Execution/Address Calculation stage control

linesMemory access stage

control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src

Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Pipeline control

16

PC


Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

M�u�x

0

1

Add Add�result

RegistersW rite�register

W rite�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1

ALU�result

Zero

Write�data

Read�data

M�u�x

1

ALU�control

Shift�left 2

Reg

Writ

e

MemRead

Control

ALU


6

EX

M

W B

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M�u�x

0

1

Mem

Writ

e

AddressData�

memory

Address

Datapath with control

17

• Problem with starting next instruction before first is finished– Dependencies that “go backward in time” are data hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Program�execution�order�(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of �register $2:

DM Reg

Reg

Reg

Reg

DM

Dependencies

18

• Use temporary results, don’t wait for them to be written– register file forwarding to handle read/write to same register– ALU forwarding

Programexecutionorder(in instructions)

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14,$2 , $2

sw $15, 100($2)

Forwarding

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2:Value of EX/MEM: X X X –20 X X X X XValue of MEM/WB: X X X X –20 X X X X

19

PC Instruction�memory

Registers

M�u�x

M�u�x

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Data�memory

M�u�x

Forwarding�unit

IF/ID

Inst

ruct

ion

M�u�x

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

Forwardingsub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

20

• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction

that writes to the same register.

• Thus, we need a hazard detection unit to “stall” the load instruction

Can't always forward

Programexecutionorder(in instructions)

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

IM DMReg Reg

21

ForwardingForward from EX/MEM registers

If (EX/MEM.RegWrite)and If (EX/MEM.Rd != 0)

and (ID/EX.Rs == EX/MEM.Rd)

Forward from MEM/WB registers

If (MEM/WB.RegWrite)and If (MEM/WB.Rd != 0)

and If (ID/EX.Rt==EX/MEM.Rd)

22

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Stalling

• Hardware detection and no-op insertion is called stalling• Stall pipeline by keeping instruction in the same stage

23

Example

24

25

Stall logic

• Stall logic– If (ID/EX.MemRead) // Load

word instruction AND– If ((ID/EX.Rt == IF/ID.Rs) or

(ID/EX.Rt == IF/ID.Rt))

• Insert no-op (no-operation)– Deasserting all control

signals

• Stall following instruction– Not writing program counter– Not writing IF/ID registers

PCWrite

IF/ID.RsIF/ID.Rt

ID/EX.Rt

26

Pipeline with hazard detection

27

Summary

28

Forwarding Case Summary

29

Multi-cycle

30

Multi-cycle

31

Multi-cycle Pipeline

32

PC


Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

M�u�x

0

1

Add Add�result

RegistersW rite�register

W rite�data

Read�data 1

Read�data 2

Read�register 1

Read�register 2

Sign�extend

M�u�x

1

ALU�result

Zero

Write�data

Read�data

M�u�x

1

ALU�control

Shift�left 2

Reg

Writ

e

MemRead

Control

ALU


6

EX

M

W B

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

M�u�x

0

1

Mem

Writ

e

AddressData�

memory

Address

Branch Hazards

33

• When we decide to branch, other instructions are in the pipeline!• We are predicting “branch not taken”

– need to add hardware for flushing instructions if we are wrong

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Branch hazards

34

Solution to control hazards

• Branch prediction– We are predicting “branch not taken”– Need to add hardware for flushing instructions if we are wrong

• Reduce branch penalty– By advancing the branch decision to ID stage– Compare the data read from two registers read in ID stage– Comparison for equality is a simpler design! (Why?)– Still need to flush instruction in IF stage

• Make the hazard into a feature!– Delayed branch slot - Always execute instruction following

branch

35

Branch detection in ID stage

36

Dynamic branch prediction

• Use lower part of instruction address

– Use one bit to say denote branch taken or not taken

– Disadvantage: poor performance in loops

• Dynamic branch prediction– Use two bits instead of one– Condition must be satisfied

twice to predict

• More sophisticated– Count the number of times

branch is taken 2-bit branch predictionState diagram

37

Correlating Branches• Hypothesis: recent branches are correlated; that is, behavior of recently

executed branches affects prediction of current branch• Idea: record m most recently executed branches as taken or not taken, and

use that pattern to select the proper branch history table• In general, (m,n) predictor means record last m branches to select between

2m history tables each with n-bit counters– Old 2-bit BHT is then a (0,2) predictor

If (aa == 2)aa=0;

If (bb == 2)bb = 0;

If (aa != bb)do something;

38

XX XX

Branch address

Prediction

2-bit global branch history

2-bit per branch predictors

4

Correlating Branches

(2,2) predictor– Then behavior of

recent branches selects between, say, four predictions of next branch, updating just that prediction

Branch address

2-bits per branch predictors

PredictionPrediction

2-bit global branch history

39

Freq

uenc

y of

Mis

pred

ictio

ns

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

nasa

7

mat

rix30

0

tom

catv

dodu

cd

spic

e

fppp

p gcc

espr

esso

eqnt

ott li

0%1%

5%6% 6%

11%

4%

6%5%

1%

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

Accuracy of Different Schemes

0%

18%

Freq

uenc

y of

Mis

pred

ictio

ns

40

Branch Prediction

• Sophisticated Techniques:– A “branch target buffer” to help us look up the destination– Correlating predictors that base prediction on global behavior

and recently executed branches (e.g., prediction for a specificbranch instruction based on what happened in previous branches)

– Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.

– A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)

• Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!

• Modern processors predict correctly 95% of the time!

41

Branch Target Buffer

• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong

branch address

• Return instruction addresses predicted with stack

Predicted PCBranch Prediction:Taken or not Taken

42

Scheduling in delayed branching

43

Other issues in pipelines

• Exceptions– Errors in ALU for arithmetic instructions– Memory non-availability

• Exceptions lead to a jump in a program• However, the current PC value must be saved so that the program

can return to it back for recoverable errors• Multiple exception can occur in a pipeline• Preciseness of exception location is important in some cases• I/O exceptions are handled in the same manner

44

Exceptions

45

Improving Performance

• Try and avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

• Dynamic Pipeline Scheduling– Hardware chooses which instructions to execute next– Will execute instructions out of order (e.g., doesn’t wait for a

dependency to be resolved, but rather keeps going!)– Speculates on branches and keeps the pipeline full

(may need to rollback if prediction incorrect)

• Trying to exploit instruction-level parallelism

46

Advanced Pipelining

• Increase the depth of the pipeline• Start more than one instruction each cycle (multiple issue)• Loop unrolling to expose more ILP (better scheduling)• “Superscalar” processors

– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue• All modern processors are superscalar and issue multiple

instructions usually with some limitations (e.g., different “pipes”)• VLIW: very long instruction word, static multiple issue

(relies more on compiler technology)

• This class has given you the background you need to learn more!

47

Superscalar architecture --Two instructions executed in parallel

48

Dynamically scheduled pipeline

49

Motorola G4e

50

Intel Pentium 4

51

IBM PowerPC 970

52

Important facts to remember

• Pipelined processors divide execution in multiple steps

• However pipeline hazards reduce performance

– Structural, data, and control hazard

• Data forwarding helps resolve data hazards

– But all hazards cannot be resolved

– Some data hazards require bubble or noop insertion

• Effects of control hazard reduced by branch prediction

– Predict always taken, delayed slots, branch prediction table

– Structural hazards are resolved by duplicating resources

• Time to execute n instructions depends on

– # of stages (k)– # of control hazard and penalty of

each step– # of data hazards and penalty for

each– Time = n + k - 1 + (load hazard

penalty) + (branch penalty)

• Load hazard penalty is 1 or 0 cycle – Depending on data use with

forwarding

• Branch penalty is 3, 2, 1, or zero cycles depending on scheme

53

Design and performance issues with pipelining

• Pipelined processors are not EASY to design

• Technology affect implementation

• Instruction set design affect the performance

– i.e., beq, bne• More stages do not lead to higher

performance!

54

Chapter 6 Summary

• Pipelining does not improve latency, but does improve throughput

Slower Faster

Instructions per clock (IPC = 1/CPI)

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

Deeplypipelined

Pipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)

1 Several

Use latency in instructions

Multicycle(Section 5.5)

Single-cycle(Section 5.4)

DeeplypipelinedPipelined

Multiple issuewith deep pipeline

(Section 6.10)

Multiple-issuepipelined

(Section 6.9)

Documents

Chapter 6 Slides - University of Arizona · 2 Pipelining • Improve performance by increasing instruction throughput Instruction class Instruction memory Register read ALU Data memory