Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Page 1
DAP Spr.‘98 ©UCB 1
Lecture 2: Review of Pipelines
Prof. David A. Patterson
Modifié par M. Aboulhamid
DAP Spr.‘98 ©UCB 2
Pipelining: Its Natural!
• Laundry Example
• Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A B C D
Page 2
DAP Spr.‘98 ©UCB 3
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
DAP Spr.‘98 ©UCB 4
Pipelined LaundryStart work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Page 3
DAP Spr.‘98 ©UCB 5
Pipelining Lessons• Pipelining doesn’t help
latency of single task, it helps throughput of entire workload
• Pipeline rate limited by slowest pipeline stage
• Multiple tasks operating simultaneously
• Potential speedup = Number pipe stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to “fill” pipeline and time to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
DAP Spr.‘98 ©UCB 6
Computer Pipelines
• Execute billions of instructions, so throughput is what matters
• DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores
+ N'est pas visible au programmeur
Page 4
DAP Spr.‘98 ©UCB 7
5 Steps of DLX Datapath
Instruction fetchInstruction decode/
register fetch
Execute/ address
calculation
Memory access
Write back
B
P C
4
AL U
1 6 3 2
Ad d
D a t a m e m o r y
Re g i ste rs
S ig n ex ten d
I n s t r u c t i o n m e m o r y
M u x
M u x
M u x
M u x
Z e r o ?B r a n c h
t a k e nC o n d
N P C
l m m
A L U o u t p u t
IRA
L M D
FIGURE 3.1 The implementation of the DLX datapath allows every instruction to be executed in four or five clock
cycles.
DAP Spr.‘98 ©UCB 8
Steps 1 & 2
• IF - instruction fetch step
IR <-- Mem[ PC]: fetch the next instruction from memoryNPC <-- PC + 4 : compute the new PC
• • done in parallel with opcode decode
• ID - instruction decode and register fetch step– A <-- Regs[ IR 6.. 10 ]
– B <-- Regs[ IR 11.. 16 ]
• • Possible since register specifiers are encoded in fixed fields• • We may fetch register contents that we don’t use but OK since
• the operands will be ready if the opcode is of the type that does use
• them
• • Also calculate the sign extended immediate in case that’s the
• value that the opcode needs
Page 5
DAP Spr.‘98 ©UCB 9
Pipelined DLX DatapathFigure 3.4, page 137
Data memory
ALU
Si gn ex tend
P C
Instruction memory
A DD
IF/ID
4
I D / E X EX/MEM M E M / W B
I R6. .10
M E M / W B . I R
M u x
M u x
M u x
I R11 ..1 5
Regi ster s
B ran ch taken
IR
16 32
M u x
Zero?
FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages.
DAP Spr.‘98 ©UCB 10
Visualizing PipeliningFigure 3.3, Page 133
ALU
ALU
RegRegIM D M
RegIM D M
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
Pro
gram
exe
cutio
n ord
er (in in
stru
ctions)
Reg
CC 8 CC 9
RegIM D M RegALU
RegIM D M RegALU
RegIM D M RegALU
FIGURE 3.3 The pipeline can be thought of as a series of datapaths shifted in time.
Page 6
DAP Spr.‘98 ©UCB 11
Its Not That Easy for Computers
• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
– Control hazards: Pipelining of branches & other instructions that change the PC
– Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline
DAP Spr.‘98 ©UCB 12
One Memory Port/Structural HazardsFigure 3.6, Page 142
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Page 7
DAP Spr.‘98 ©UCB 13
ALU
AL
U
RegRegMem Mem
RegMem Mem
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
Reg
CC 8
RegMem Mem RegALU
RegMem Mem RegAL
U
RegMem MemALU
Load
Instruction 1
Instruction 2
Instruction 3
Instruction 4
FIGURE 3.6 A machine with only one memory port will generate a conflict whenever a memory reference occurs.
DAP Spr.‘98 ©UCB 14
One Memory Port/Structural HazardsFigure 3.7, Page 143
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
stall
Instr 3
Page 8
DAP Spr.‘98 ©UCB 15
ALU
ALU
RegRegMem Mem
RegMem Mem
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
Reg
CC 8
RegMem Mem RegALU
RegMem MemAL
U
Load
Instruction 1
Instruction 2
Stall
Instruction 3
Bubble Bubble Bubble Bubble Bubble
FIGURE 3.7 The structural hazard causes pipeline bubbles to be inserted.
DAP Spr.‘98 ©UCB 16
Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr
Speedup = Ideal CPI x Pipeline depth Clock CycleunpipelinedIdeal CPI + Pipeline stall CPI Clock Cyclepipelined
Speedup = Pipeline depth Clock Cycleunpipelined1 + Pipeline stall CPI Clock Cyclepipelined
x
x
Page 9
DAP Spr.‘98 ©UCB 17
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory
• Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executedSpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline DepthSpeedUpB = Pipeline Depth/(1 + 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)= (Pipeline Depth/1.4) x 1.05= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
DAP Spr.‘98 ©UCB 18
Data Hazard on R1Figure 3.9, page 147
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/RF EX MEM WB
Page 10
DAP Spr.‘98 ©UCB 19
C C 1 C C 2 C C 3 C C 4 C C 5 C C 6
T i m e ( i n c l o c k c y c l e s )
R 1 , R 2 , R 3
R e g
D M
D M
D M
A D D
S U B R 4 , R 1 , R 5
A N D R 6 , R 1 , R 7
O R R 8 , R 1 , R 9
X O R R 1 0 , R 1 , R 1 1
R e g
R e g R e g
R e gI M
I M
I M
I M
I M
R e gA
L
U
AL
U
AL
U
AL
U
R e g
Pro
gra
m e
xe
cu
tion
ord
er
(in
ins
tru
cti
ons
)
F I G U R E 3 . 9 T h e u s e o f t h e r e s u l t o f t h e i n s t r u c t i o n i n t h e n e x t t h r e e i n s t r u c t i o n s c a u s e s a h a z a r d , s i n c e t h e A D D
register is not written until after those instructions read it.
DAP Spr.‘98 ©UCB 20
D M
D M
D M
C C 1 C C 2 C C 3 C C 4 C C 5 C C 6
T i m e ( i n c l o c k c y c l e s )
A D D R 1 , R 2 , R 3
S U B R 4 , R 1 , R 5
A N D R 6 , R 1 , R 7
O R R 8 , R 1 , R 9
X O R R 1 0 , R 1 , R 1 1
R e g
R e g
ALU
AL
U
AL
U
AL
U
R e g
R e g
R e gI M
I M
I M
I M
I M
R e g
R e g
Pro
gra
m e
xe
cu
tio
n o
rde
r (i
n i
ns
tru
cti
on
s)
FIGURE 3.10 A set of instructions that depend on the result use forwarding paths to avoid the data hazard.ADD
Page 11
DAP Spr.‘98 ©UCB 21
Three Generic Data HazardsInstrI followed by InstrJ
• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
DAP Spr.‘98 ©UCB 22
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
R1, R2, R3
D M
D M
D M
ADD
LW R4, 0(R1)
SW 12(R1), R4
Reg
Reg Reg
RegIM
IM
IM ALU
AL
U
AL
U
Reg
Pro
gra
m e
xec
utio
n or
der
(in
inst
ruct
ions
)
FIGURE 3.11 Stores require an operand during MEM, and forwarding of that operand is shown here.
Page 12
DAP Spr.‘98 ©UCB 23
Three Generic Data HazardsInstrI followed by InstrJ
• Write After Read (WAR)InstrJ tries to write operand before InstrI reads i
– Gets wrong operand
• Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
DAP Spr.‘98 ©UCB 24
Three Generic Data Hazards
InstrI followed by InstrJ
• Write After Write (WAW)InstrJ tries to write operand before InstrI writes it
– Leaves wrong result ( InstrI not InstrJ )
• Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated pipes
Page 13
DAP Spr.‘98 ©UCB 25
Instr.
Order
Time (clock cycles)
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with ForwardingFigure 3.12, Page 153
DAP Spr.‘98 ©UCB 26
D MAL
U
ALU
ALU
D M
CC 1 CC 2 CC 3 CC 4 CC 5
Time (in clock cycles)
LW R1, 0(R2)
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
Reg
Reg
RegIM
IM
IM
IM Reg
Reg
Pro
gram
exe
cutio
n o
rder
(in
inst
ruct
ions
)
FIGURE 3.12 The load instruction can bypass its results to the and instructions, but not to the , since AND OR SUBthat would mean forwarding the result in "negative time."
Page 14
DAP Spr.‘98 ©UCB 27
Data Hazard Even with ForwardingFigure 3.13, Page 154
Instr.
Order
Time (clock cycles)
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
DAP Spr.‘98 ©UCB 28
D M
D M
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
LW R1, 0(R2)
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
Reg ALU
AL
U
ALU
Reg
Reg
RegIM
IM
IM
IM Reg
Pro
gra
m e
xecu
tion
ord
er (
in in
stru
ctio
ns)
Bubble
Bubble
Bubble
FIGURE 3.13 The load interlock causes a stall to be inserted at clock cycle 4, delaying the instruction and those SUBthat follow by one cycle.
Page 15
DAP Spr.‘98 ©UCB 29
A = B + C
WBMEMEXIDCaleIFsw a, ra
WBMEMEXCaleIDIFadd ra,rb,rc
WBMEMEXIDIFlw rc, c
WBMEMEXIDIFlw rb,b
DAP Spr.‘98 ©UCB 30
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
Page 16
DAP Spr.‘98 ©UCB 31
A = B + C; D = E + F
MEMEXIDIFSw a, ra
WBMEMEXIDIFadd ra,rb,rc
WBMEMEXIDIFLw rf, f
WBMEMEXIDIFLw re, e
WBMEMEXIDIFlw rc, c
WBMEMEXIDIFlw rb,b
DAP Spr.‘98 ©UCB 32
HW Change for ForwardingFigure 3.20, Page 161
Page 17
DAP Spr.‘98 ©UCB 33
Data memory
ALU
Zero?
ID/EX EX/MEM MEM/WB
M u x
M u x
FIGURE 3.20 Forwarding of results to the ALU requires the addition of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs.
DAP Spr.‘98 ©UCB 34
Control Hazard on BranchesThree Stage Stall
Page 18
DAP Spr.‘98 ©UCB 35
Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
• Two part solution:– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• DLX branch tests if register = 0 or not 0
• DLX Solution:– Move Zero test to ID/RF stage
– Adder to calculate new PC in ID/RF stage
– 1 clock cycle penalty for branch versus 3
DAP Spr.‘98 ©UCB 36
Pipelined DLX DatapathFigure 3.22, page 163
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc.
This is the correct 1 cyclelatency implementation!
Page 19
DAP Spr.‘98 ©UCB 37
Data memory
ALU
Sign extend
PC
Instruction memory
ADD
IF/ID
4
ID/EX EX/MEM MEM/WB
IR6..10
MEM/WB.IR
M u x
M u x
M u x
IR11..15
Reg
iste
rs
Branch taken
IR
16 32
M u x
Zero?
FIGURE 3.4 The datapath is pipelined by adding a set of registers, one between each pair of pipe stages.
DAP Spr.‘98 ©UCB 38
DataALU
Signextend
16 32
memory
PC
Instruction memory
ADD
ADD
IF/ID
4
ID/EX
EX/MEM MEM/WB
IR6..10
MEM/WB.IR
IR11..15
Reg
iste
rs
Zero?
M u x
M u x
M u x
IR
FIGURE 3.22 The stall from branch hazards can be reduced by moving the zero test and branch target calculation
into the ID phase of the pipeline.
Page 20
DAP Spr.‘98 ©UCB 39
P e r c e n t a g e o f i n s t r u c t i o n s e x e c u t e d
0% 25%5% 10% 15% 20%
10%
0%
0%
2%
1%
2%
6%
4%4%
6%
2%2%
11%
8%4%
12%
4%3%
11%
1%4%
22%
2%2%
11%
3%3%
9%0%
1%
Forward conditional branches
Unconditional branchesBackward conditional branches
Benchmark
compress
eqntott
espresso
gcc
li
doduc
ear
hydro2d
mdljdp
su2cor
FIGURE 3.24 The frequency of instructions (branches, jumps, calls, and returns) that may change the PC.
DAP Spr.‘98 ©UCB 40
Fraction of all conditional branches
0%
80%
10%
20%
30%
40%
50%
70%
60%61%
21%
14%
53%
37%38%
26%
34%
13%
44%
16%
35%
25%
63%
8%
51%
22%
78%
3%
21%
Backward takenForward taken
Benchmark
com
pres
s
eqnt
ott
espr
esso
gcc
li doduc
ear
hydr
o2d
mdl
jdp
su2co
r
FIGURE 3.25 Together the forward and backward taken branches account for an average of 67% of all conditional branches.
Page 21
DAP Spr.‘98 ©UCB 41
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% DLX branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
DAP Spr.‘98 ©UCB 42
Four Branch Hazard Alternatives
#4: Delayed Branch– Define branch to take place AFTER a following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage pipeline
– DLX uses this
Branch delay of length n
Page 22
DAP Spr.‘98 ©UCB 43
Delayed Branch
• Where to get instructions to fill branch delay slot?– Before branch instruction– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
DAP Spr.‘98 ©UCB 44
Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequency ×Branch penalty
Page 23
DAP Spr.‘98 ©UCB 45
Pipelining Introduction Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up �� Pipeline Depth; if ideal CPI is 1, then:
• Hazards limit performance on computers:– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction
Speedup =Pipeline Depth
1 + Pipeline stall CPIX
Clock Cycle Unpipelined
Clock Cycle Pipelined