23
Page 1 DAP Spr.‘98 ©UCB 1 Lecture 2: Review of Pipelines Prof. David A. Patterson Modifié par M. Aboulhamid DAP Spr.‘98 ©UCB 2 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Lecture 2: Review of Pipelines

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 2: Review of Pipelines

Page 1

DAP Spr.‘98 ©UCB 1

Lecture 2: Review of Pipelines

Prof. David A. Patterson

Modifié par M. Aboulhamid

DAP Spr.‘98 ©UCB 2

Pipelining: Its Natural!

• Laundry Example

• Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Page 2: Lecture 2: Review of Pipelines

Page 2

DAP Spr.‘98 ©UCB 3

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads

• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

DAP Spr.‘98 ©UCB 4

Pipelined LaundryStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Page 3: Lecture 2: Review of Pipelines

Page 3

DAP Spr.‘98 ©UCB 5

Pipelining Lessons• Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

DAP Spr.‘98 ©UCB 6

Computer Pipelines

• Execute billions of instructions, so throughput is what matters

• DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores

+ N'est pas visible au programmeur

Page 4: Lecture 2: Review of Pipelines

Page 4

DAP Spr.‘98 ©UCB 7

5 Steps of DLX Datapath

Instruction fetchInstruction decode/

register fetch

Execute/ address

calculation

Memory access

Write back

B

P C

4

AL U

1 6 3 2

Ad d

D a t a m e m o r y

Re g i ste rs

S ig n ex ten d

I n s t r u c t i o n m e m o r y

M u x

M u x

M u x

M u x

Z e r o ?B r a n c h

t a k e nC o n d

N P C

l m m

A L U o u t p u t

IRA

L M D

FIGURE 3.1 The implementation of the DLX datapath allows every instruction to be executed in four or five clock

cycles.

DAP Spr.‘98 ©UCB 8

Steps 1 & 2

• IF - instruction fetch step

IR <-- Mem[ PC]: fetch the next instruction from memoryNPC <-- PC + 4 : compute the new PC

• • done in parallel with opcode decode

• ID - instruction decode and register fetch step– A <-- Regs[ IR 6.. 10 ]

– B <-- Regs[ IR 11.. 16 ]

• • Possible since register specifiers are encoded in fixed fields• • We may fetch register contents that we don’t use but OK since

• the operands will be ready if the opcode is of the type that does use

• them

• • Also calculate the sign extended immediate in case that’s the

• value that the opcode needs

Page 5: Lecture 2: Review of Pipelines

Page 5

DAP Spr.‘98 ©UCB 9

Pipelined DLX DatapathFigure 3.4, page 137

Data memory

ALU

Si gn ex tend

P C

Instruction memory

A DD

IF/ID

4

I D / E X EX/MEM M E M / W B

I R6. .10

M E M / W B . I R

M u x

M u x

M u x

I R11 ..1 5

Regi ster s

B ran ch taken

IR

16 32

M u x

Zero?

FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages.

DAP Spr.‘98 ©UCB 10

Visualizing PipeliningFigure 3.3, Page 133

ALU

ALU

RegRegIM D M

RegIM D M

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Pro

gram

exe

cutio

n ord

er (in in

stru

ctions)

Reg

CC 8 CC 9

RegIM D M RegALU

RegIM D M RegALU

RegIM D M RegALU

FIGURE 3.3 The pipeline can be thought of as a series of datapaths shifted in time.

Page 6: Lecture 2: Review of Pipelines

Page 6

DAP Spr.‘98 ©UCB 11

Its Not That Easy for Computers

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)

– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

– Control hazards: Pipelining of branches & other instructions that change the PC

– Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

DAP Spr.‘98 ©UCB 12

One Memory Port/Structural HazardsFigure 3.6, Page 142

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Page 7: Lecture 2: Review of Pipelines

Page 7

DAP Spr.‘98 ©UCB 13

ALU

AL

U

RegRegMem Mem

RegMem Mem

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Reg

CC 8

RegMem Mem RegALU

RegMem Mem RegAL

U

RegMem MemALU

Load

Instruction 1

Instruction 2

Instruction 3

Instruction 4

FIGURE 3.6 A machine with only one memory port will generate a conflict whenever a memory reference occurs.

DAP Spr.‘98 ©UCB 14

One Memory Port/Structural HazardsFigure 3.7, Page 143

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

stall

Instr 3

Page 8: Lecture 2: Review of Pipelines

Page 8

DAP Spr.‘98 ©UCB 15

ALU

ALU

RegRegMem Mem

RegMem Mem

Time (in clock cycles)

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Reg

CC 8

RegMem Mem RegALU

RegMem MemAL

U

Load

Instruction 1

Instruction 2

Stall

Instruction 3

Bubble Bubble Bubble Bubble Bubble

FIGURE 3.7 The structural hazard causes pipeline bubbles to be inserted.

DAP Spr.‘98 ©UCB 16

Speed Up Equation for Pipelining

CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr

Speedup = Ideal CPI x Pipeline depth Clock CycleunpipelinedIdeal CPI + Pipeline stall CPI Clock Cyclepipelined

Speedup = Pipeline depth Clock Cycleunpipelined1 + Pipeline stall CPI Clock Cyclepipelined

x

x

Page 9: Lecture 2: Review of Pipelines

Page 9

DAP Spr.‘98 ©UCB 17

Example: Dual-port vs. Single-port

• Machine A: Dual ported memory

• Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate

• Ideal CPI = 1 for both

• Loads are 40% of instructions executedSpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)

= Pipeline DepthSpeedUpB = Pipeline Depth/(1 + 0.4 x 1)

x (clockunpipe/(clockunpipe / 1.05)= (Pipeline Depth/1.4) x 1.05= 0.75 x Pipeline Depth

SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

• Machine A is 1.33 times faster

DAP Spr.‘98 ©UCB 18

Data Hazard on R1Figure 3.9, page 147

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WB

Page 10: Lecture 2: Review of Pipelines

Page 10

DAP Spr.‘98 ©UCB 19

C C 1 C C 2 C C 3 C C 4 C C 5 C C 6

T i m e ( i n c l o c k c y c l e s )

R 1 , R 2 , R 3

R e g

D M

D M

D M

A D D

S U B R 4 , R 1 , R 5

A N D R 6 , R 1 , R 7

O R R 8 , R 1 , R 9

X O R R 1 0 , R 1 , R 1 1

R e g

R e g R e g

R e gI M

I M

I M

I M

I M

R e gA

L

U

AL

U

AL

U

AL

U

R e g

Pro

gra

m e

xe

cu

tion

ord

er

(in

ins

tru

cti

ons

)

F I G U R E 3 . 9 T h e u s e o f t h e r e s u l t o f t h e i n s t r u c t i o n i n t h e n e x t t h r e e i n s t r u c t i o n s c a u s e s a h a z a r d , s i n c e t h e A D D

register is not written until after those instructions read it.

DAP Spr.‘98 ©UCB 20

D M

D M

D M

C C 1 C C 2 C C 3 C C 4 C C 5 C C 6

T i m e ( i n c l o c k c y c l e s )

A D D R 1 , R 2 , R 3

S U B R 4 , R 1 , R 5

A N D R 6 , R 1 , R 7

O R R 8 , R 1 , R 9

X O R R 1 0 , R 1 , R 1 1

R e g

R e g

ALU

AL

U

AL

U

AL

U

R e g

R e g

R e gI M

I M

I M

I M

I M

R e g

R e g

Pro

gra

m e

xe

cu

tio

n o

rde

r (i

n i

ns

tru

cti

on

s)

FIGURE 3.10 A set of instructions that depend on the result use forwarding paths to avoid the data hazard.ADD

Page 11: Lecture 2: Review of Pipelines

Page 11

DAP Spr.‘98 ©UCB 21

Three Generic Data HazardsInstrI followed by InstrJ

• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it

DAP Spr.‘98 ©UCB 22

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

R1, R2, R3

D M

D M

D M

ADD

LW R4, 0(R1)

SW 12(R1), R4

Reg

Reg Reg

RegIM

IM

IM ALU

AL

U

AL

U

Reg

Pro

gra

m e

xec

utio

n or

der

(in

inst

ruct

ions

)

FIGURE 3.11 Stores require an operand during MEM, and forwarding of that operand is shown here.

Page 12: Lecture 2: Review of Pipelines

Page 12

DAP Spr.‘98 ©UCB 23

Three Generic Data HazardsInstrI followed by InstrJ

• Write After Read (WAR)InstrJ tries to write operand before InstrI reads i

– Gets wrong operand

• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and

– Reads are always in stage 2, and

– Writes are always in stage 5

DAP Spr.‘98 ©UCB 24

Three Generic Data Hazards

InstrI followed by InstrJ

• Write After Write (WAW)InstrJ tries to write operand before InstrI writes it

– Leaves wrong result ( InstrI not InstrJ )

• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and

– Writes are always in stage 5

• Will see WAR and WAW in later more complicated pipes

Page 13: Lecture 2: Review of Pipelines

Page 13

DAP Spr.‘98 ©UCB 25

Instr.

Order

Time (clock cycles)

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with ForwardingFigure 3.12, Page 153

DAP Spr.‘98 ©UCB 26

D MAL

U

ALU

ALU

D M

CC 1 CC 2 CC 3 CC 4 CC 5

Time (in clock cycles)

LW R1, 0(R2)

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

Reg

Reg

RegIM

IM

IM

IM Reg

Reg

Pro

gram

exe

cutio

n o

rder

(in

inst

ruct

ions

)

FIGURE 3.12 The load instruction can bypass its results to the and instructions, but not to the , since AND OR SUBthat would mean forwarding the result in "negative time."

Page 14: Lecture 2: Review of Pipelines

Page 14

DAP Spr.‘98 ©UCB 27

Data Hazard Even with ForwardingFigure 3.13, Page 154

Instr.

Order

Time (clock cycles)

lw r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

DAP Spr.‘98 ©UCB 28

D M

D M

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

LW R1, 0(R2)

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

Reg ALU

AL

U

ALU

Reg

Reg

RegIM

IM

IM

IM Reg

Pro

gra

m e

xecu

tion

ord

er (

in in

stru

ctio

ns)

Bubble

Bubble

Bubble

FIGURE 3.13 The load interlock causes a stall to be inserted at clock cycle 4, delaying the instruction and those SUBthat follow by one cycle.

Page 15: Lecture 2: Review of Pipelines

Page 15

DAP Spr.‘98 ©UCB 29

A = B + C

WBMEMEXIDCaleIFsw a, ra

WBMEMEXCaleIDIFadd ra,rb,rc

WBMEMEXIDIFlw rc, c

WBMEMEXIDIFlw rb,b

DAP Spr.‘98 ©UCB 30

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

Page 16: Lecture 2: Review of Pipelines

Page 16

DAP Spr.‘98 ©UCB 31

A = B + C; D = E + F

MEMEXIDIFSw a, ra

WBMEMEXIDIFadd ra,rb,rc

WBMEMEXIDIFLw rf, f

WBMEMEXIDIFLw re, e

WBMEMEXIDIFlw rc, c

WBMEMEXIDIFlw rb,b

DAP Spr.‘98 ©UCB 32

HW Change for ForwardingFigure 3.20, Page 161

Page 17: Lecture 2: Review of Pipelines

Page 17

DAP Spr.‘98 ©UCB 33

Data memory

ALU

Zero?

ID/EX EX/MEM MEM/WB

M u x

M u x

FIGURE 3.20 Forwarding of results to the ALU requires the addition of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs.

DAP Spr.‘98 ©UCB 34

Control Hazard on BranchesThree Stage Stall

Page 18: Lecture 2: Review of Pipelines

Page 18

DAP Spr.‘98 ©UCB 35

Branch Stall Impact

• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!

• Two part solution:– Determine branch taken or not sooner, AND

– Compute taken branch address earlier

• DLX branch tests if register = 0 or not 0

• DLX Solution:– Move Zero test to ID/RF stage

– Adder to calculate new PC in ID/RF stage

– 1 clock cycle penalty for branch versus 3

DAP Spr.‘98 ©UCB 36

Pipelined DLX DatapathFigure 3.22, page 163

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc.

This is the correct 1 cyclelatency implementation!

Page 19: Lecture 2: Review of Pipelines

Page 19

DAP Spr.‘98 ©UCB 37

Data memory

ALU

Sign extend

PC

Instruction memory

ADD

IF/ID

4

ID/EX EX/MEM MEM/WB

IR6..10

MEM/WB.IR

M u x

M u x

M u x

IR11..15

Reg

iste

rs

Branch taken

IR

16 32

M u x

Zero?

FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages.

DAP Spr.‘98 ©UCB 38

DataALU

Signextend

16 32

memory

PC

Instruction memory

ADD

ADD

IF/ID

4

ID/EX

EX/MEM MEM/WB

IR6..10

MEM/WB.IR

IR11..15

Reg

iste

rs

Zero?

M u x

M u x

M u x

IR

FIGURE 3.22 The stall from branch hazards can be reduced by moving the zero test and branch target calculation

into the ID phase of the pipeline.

Page 20: Lecture 2: Review of Pipelines

Page 20

DAP Spr.‘98 ©UCB 39

P e r c e n t a g e o f i n s t r u c t i o n s e x e c u t e d

0% 25%5% 10% 15% 20%

10%

0%

0%

2%

1%

2%

6%

4%4%

6%

2%2%

11%

8%4%

12%

4%3%

11%

1%4%

22%

2%2%

11%

3%3%

9%0%

1%

Forward conditional branches

Unconditional branchesBackward conditional branches

Benchmark

compress

eqntott

espresso

gcc

li

doduc

ear

hydro2d

mdljdp

su2cor

FIGURE 3.24 The frequency of instructions (branches, jumps, calls, and returns) that may change the PC.

DAP Spr.‘98 ©UCB 40

Fraction of all conditional branches

0%

80%

10%

20%

30%

40%

50%

70%

60%61%

21%

14%

53%

37%38%

26%

34%

13%

44%

16%

35%

25%

63%

8%

51%

22%

78%

3%

21%

Backward takenForward taken

Benchmark

com

pres

s

eqnt

ott

espr

esso

gcc

li doduc

ear

hydr

o2d

mdl

jdp

su2co

r

FIGURE 3.25 Together the forward and backward taken branches account for an average of 67% of all conditional branches.

Page 21: Lecture 2: Review of Pipelines

Page 21

DAP Spr.‘98 ©UCB 41

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken– Execute successor instructions in sequence

– “Squash” instructions in pipeline if branch actually taken

– Advantage of late pipeline state update

– 47% DLX branches not taken on average

– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken– 53% DLX branches taken on average

– But haven’t calculated branch target address in DLX

» DLX still incurs 1 cycle branch penalty

» Other machines: branch target known before outcome

DAP Spr.‘98 ©UCB 42

Four Branch Hazard Alternatives

#4: Delayed Branch– Define branch to take place AFTER a following instruction

branch instructionsequential successor1sequential successor2........sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

– DLX uses this

Branch delay of length n

Page 22: Lecture 2: Review of Pipelines

Page 22

DAP Spr.‘98 ©UCB 43

Delayed Branch

• Where to get instructions to fill branch delay slot?– Before branch instruction– From the target address: only valuable when branch taken

– From fall through: only valuable when branch not taken

– Cancelling branches allow more slots to be filled

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots

– About 80% of instructions executed in branch delay slots useful in computation

– About 50% (60% x 80%) of slots usefully filled

• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

DAP Spr.‘98 ©UCB 44

Evaluating Branch Alternatives

Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall

Stall pipeline 3 1.42 3.5 1.0

Predict taken 1 1.14 4.4 1.26

Predict not taken 1 1.09 4.5 1.29

Delayed branch 0.5 1.07 4.6 1.31

Conditional & Unconditional = 14%, 65% change PC

Pipeline speedup = Pipeline depth1 +Branch frequency ×Branch penalty

Page 23: Lecture 2: Review of Pipelines

Page 23

DAP Spr.‘98 ©UCB 45

Pipelining Introduction Summary

• Just overlap tasks, and easy if tasks are independent

• Speed Up �� Pipeline Depth; if ideal CPI is 1, then:

• Hazards limit performance on computers:– Structural: need more HW resources

– Data (RAW,WAR,WAW): need forwarding, compiler scheduling

– Control: delayed branch, prediction

Speedup =Pipeline Depth

1 + Pipeline stall CPIX

Clock Cycle Unpipelined

Clock Cycle Pipelined