Upload
hoanghanh
View
220
Download
0
Embed Size (px)
Citation preview
1
CHAPTER 6
2
Pipelining
• Improve performance by increasing instruction throughput
Instruction class
Instructionmemory
Registerread
ALUData
memoryRegister
writeTotal
(in ps)
Load word 200 100 200 200 100 800
Store word 200 100 200 200 700
R-format 200 100 200 100 600
Branch 200 100 200 500
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Ins tru ction�fe tch R eg A LU D ata �
acc ess R eg
8 n s Ins tru ction�fe tch R eg A LU D ata �
ac cess R eg
8 n sIns tru ction�
fe tch
8 ns
T im e
lw $ 1, 10 0 ($0 )
lw $ 2, 20 0 ($0 )
lw $ 3, 30 0 ($0 )
2 4 6 8 1 0 1 2 14 16 1 8
2 4 6 8 1 0 1 2 14
...
P rog ram �e xecution �o rd er�(in in struc tio ns )
Ins truc tion �fe tch R eg ALU D a ta�
access R eg
T im e
lw $1 , 1 00 ($ 0)
lw $2 , 2 00 ($ 0)
lw $3 , 3 00 ($ 0)
2 ns Ins truc tion �fe tch R eg ALU D a ta�
access R eg
2 nsIns truc tion �
fe tc h R eg A LU D a ta�access R eg
2 ns 2 n s 2 n s 2 ns 2 n s
�
P rog ram �e xecut io n�o rd er�( in in struc tio n s)
3
Pipelining
• What makes it easy– all instructions are the same length– just a few instruction formats– memory operands appear only in loads and stores
• What makes it hard?– structural hazards: suppose we had only one memory– control hazards: need to worry about branch instructions– data hazards: an instruction depends on a previous instruction
• We’ll build a simple pipeline and look at these issues
• We’ll talk about modern processors and what really makes it hard:– exception handling– trying to improve performance with out-of-order execution, etc.
4
Hazards
A=B+EC=B+F
lw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0)lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0)
lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2sw $t3, 12($t0)add $t5, $t1, $t4sw $t5, 16($t0)
5
Basic Idea
What do we need to add to actually split the datapath into stages?
6
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
Pipelined datapath
7
Five Stages (lw)
Memory and registersLeft half: writeRight half: read
8
Five Stages (lw)
9
Five Stages (lw)
10
What is wrong with this datapath?
11
• Can help with answering questions like:– How many cycles does it take to execute this code?– What is the ALU doing during cycle 4?– Use this representation to help understand datapaths
Graphically representing pipelines
12
Pipeline operation
• In pipeline one operation begins in every cycle• Also, one operation completes in each cycle• Each instruction takes 5 clock cycles
– k cycles in general, where k is pipeline depth• When a stage is not used, no control needs to be applied• In one clock cycle, several instructions are active • Different stages are executing different instructions• How to generate control signals for them is an issue
13
Pipeline control
• We have 5 stages. What needs to be controlled in each stage?– Instruction Fetch and PC Increment– Instruction Decode / Register Fetch– Execution– Memory Stage– Write Back
• How would control be handled in an automobile plant?– A fancy control center telling everyone what to do?– Should we use a finite state machine?
14
PC
Instruction�memory
Address
Inst
ruct
ion
Instruction�[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction�[15– 0]
0
0Registers
Write�register
Write�data
Read�data 1
Read�data 2
Read�register 1
Read�register 2
Sign�extend
M�u�x
1Write�data
Read�data M�
u�x
1
ALU�control
RegWrite
MemRead
Instruction�[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Data�memory
PCSrc
Zero
Add Add�result
Shift�left 2
ALU�result
ALUZero
Add
0
1
M�u�x
0
1
M�u�x
Pipeline control
15
Execution/Address Calculation stage control
linesMemory access stage
control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src
Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Pipeline control
16
PC
Instruction�memory
Inst
ruct
ion
Add
Instruction�[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction�[15– 0]
0
0
M�u�x
0
1
Add Add�result
RegistersW rite�register
W rite�data
Read�data 1
Read�data 2
Read�register 1
Read�register 2
Sign�extend
M�u�x
1
ALU�result
Zero
Write�data
Read�data
M�u�x
1
ALU�control
Shift�left 2
Reg
Writ
e
MemRead
Control
ALU
Instruction�[15– 11]
6
EX
M
W B
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
M�u�x
0
1
Mem
Writ
e
AddressData�
memory
Address
Datapath with control
17
• Problem with starting next instruction before first is finished– Dependencies that “go backward in time” are data hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Program�execution�order�(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of �register $2:
DM Reg
Reg
Reg
Reg
DM
Dependencies
18
• Use temporary results, don’t wait for them to be written– register file forwarding to handle read/write to same register– ALU forwarding
Programexecutionorder(in instructions)
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14,$2 , $2
sw $15, 100($2)
Forwarding
Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
10 10 10 10 10/–20 –20 –20 –20 –20Value of register $2:Value of EX/MEM: X X X –20 X X X X XValue of MEM/WB: X X X X –20 X X X X
19
PC Instruction�memory
Registers
M�u�x
M�u�x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data�memory
M�u�x
Forwarding�unit
IF/ID
Inst
ruct
ion
M�u�x
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Forwardingsub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
20
• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction
that writes to the same register.
• Thus, we need a hazard detection unit to “stall” the load instruction
Can't always forward
Programexecutionorder(in instructions)
lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Time (in clock cycles)CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
IM DMReg Reg
21
ForwardingForward from EX/MEM registers
If (EX/MEM.RegWrite)and If (EX/MEM.Rd != 0)
and (ID/EX.Rs == EX/MEM.Rd)
Forward from MEM/WB registers
If (MEM/WB.RegWrite)and If (MEM/WB.Rd != 0)
and If (ID/EX.Rt==EX/MEM.Rd)
22
lw $2, 20($1)
Program�execution�order�(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
Stalling
• Hardware detection and no-op insertion is called stalling• Stall pipeline by keeping instruction in the same stage
23
Example
24
25
Stall logic
• Stall logic– If (ID/EX.MemRead) // Load
word instruction AND– If ((ID/EX.Rt == IF/ID.Rs) or
(ID/EX.Rt == IF/ID.Rt))
• Insert no-op (no-operation)– Deasserting all control
signals
• Stall following instruction– Not writing program counter– Not writing IF/ID registers
PCWrite
IF/ID.RsIF/ID.Rt
ID/EX.Rt
26
Pipeline with hazard detection
27
Summary
28
Forwarding Case Summary
29
Multi-cycle
30
Multi-cycle
31
Multi-cycle Pipeline
32
PC
Instruction�memory
Inst
ruct
ion
Add
Instruction�[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction�[15– 0]
0
0
M�u�x
0
1
Add Add�result
RegistersW rite�register
W rite�data
Read�data 1
Read�data 2
Read�register 1
Read�register 2
Sign�extend
M�u�x
1
ALU�result
Zero
Write�data
Read�data
M�u�x
1
ALU�control
Shift�left 2
Reg
Writ
e
MemRead
Control
ALU
Instruction�[15– 11]
6
EX
M
W B
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
M�u�x
0
1
Mem
Writ
e
AddressData�
memory
Address
Branch Hazards
33
• When we decide to branch, other instructions are in the pipeline!• We are predicting “branch not taken”
– need to add hardware for flushing instructions if we are wrong
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Program�execution�order�(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Branch hazards
34
Solution to control hazards
• Branch prediction– We are predicting “branch not taken”– Need to add hardware for flushing instructions if we are wrong
• Reduce branch penalty– By advancing the branch decision to ID stage– Compare the data read from two registers read in ID stage– Comparison for equality is a simpler design! (Why?)– Still need to flush instruction in IF stage
• Make the hazard into a feature!– Delayed branch slot - Always execute instruction following
branch
35
Branch detection in ID stage
36
Dynamic branch prediction
• Use lower part of instruction address
– Use one bit to say denote branch taken or not taken
– Disadvantage: poor performance in loops
• Dynamic branch prediction– Use two bits instead of one– Condition must be satisfied
twice to predict
• More sophisticated– Count the number of times
branch is taken 2-bit branch predictionState diagram
37
Correlating Branches• Hypothesis: recent branches are correlated; that is, behavior of recently
executed branches affects prediction of current branch• Idea: record m most recently executed branches as taken or not taken, and
use that pattern to select the proper branch history table• In general, (m,n) predictor means record last m branches to select between
2m history tables each with n-bit counters– Old 2-bit BHT is then a (0,2) predictor
If (aa == 2)aa=0;
If (bb == 2)bb = 0;
If (aa != bb)do something;
38
XX XX
Branch address
Prediction
2-bit global branch history
2-bit per branch predictors
4
Correlating Branches
(2,2) predictor– Then behavior of
recent branches selects between, say, four predictions of next branch, updating just that prediction
Branch address
2-bits per branch predictors
PredictionPrediction
2-bit global branch history
39
Freq
uenc
y of
Mis
pred
ictio
ns
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
nasa
7
mat
rix30
0
tom
catv
dodu
cd
spic
e
fppp
p gcc
espr
esso
eqnt
ott li
0%1%
5%6% 6%
11%
4%
6%5%
1%
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT
Accuracy of Different Schemes
0%
18%
Freq
uenc
y of
Mis
pred
ictio
ns
40
Branch Prediction
• Sophisticated Techniques:– A “branch target buffer” to help us look up the destination– Correlating predictors that base prediction on global behavior
and recently executed branches (e.g., prediction for a specificbranch instruction based on what happened in previous branches)
– Tournament predictors that use different types of prediction strategies and keep track of which one is performing best.
– A “branch delay slot” which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA)
• Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective!
• Modern processors predict correctly 95% of the time!
41
Branch Target Buffer
• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong
branch address
• Return instruction addresses predicted with stack
Predicted PCBranch Prediction:Taken or not Taken
42
Scheduling in delayed branching
43
Other issues in pipelines
• Exceptions– Errors in ALU for arithmetic instructions– Memory non-availability
• Exceptions lead to a jump in a program• However, the current PC value must be saved so that the program
can return to it back for recoverable errors• Multiple exception can occur in a pipeline• Preciseness of exception location is important in some cases• I/O exceptions are handled in the same manner
44
Exceptions
45
Improving Performance
• Try and avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
• Dynamic Pipeline Scheduling– Hardware chooses which instructions to execute next– Will execute instructions out of order (e.g., doesn’t wait for a
dependency to be resolved, but rather keeps going!)– Speculates on branches and keeps the pipeline full
(may need to rollback if prediction incorrect)
• Trying to exploit instruction-level parallelism
46
Advanced Pipelining
• Increase the depth of the pipeline• Start more than one instruction each cycle (multiple issue)• Loop unrolling to expose more ILP (better scheduling)• “Superscalar” processors
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue• All modern processors are superscalar and issue multiple
instructions usually with some limitations (e.g., different “pipes”)• VLIW: very long instruction word, static multiple issue
(relies more on compiler technology)
• This class has given you the background you need to learn more!
47
Superscalar architecture --Two instructions executed in parallel
48
Dynamically scheduled pipeline
49
Motorola G4e
50
Intel Pentium 4
51
IBM PowerPC 970
52
Important facts to remember
• Pipelined processors divide execution in multiple steps
• However pipeline hazards reduce performance
– Structural, data, and control hazard
• Data forwarding helps resolve data hazards
– But all hazards cannot be resolved
– Some data hazards require bubble or noop insertion
• Effects of control hazard reduced by branch prediction
– Predict always taken, delayed slots, branch prediction table
– Structural hazards are resolved by duplicating resources
• Time to execute n instructions depends on
– # of stages (k)– # of control hazard and penalty of
each step– # of data hazards and penalty for
each– Time = n + k - 1 + (load hazard
penalty) + (branch penalty)
• Load hazard penalty is 1 or 0 cycle – Depending on data use with
forwarding
• Branch penalty is 3, 2, 1, or zero cycles depending on scheme
53
Design and performance issues with pipelining
• Pipelined processors are not EASY to design
• Technology affect implementation
• Instruction set design affect the performance
– i.e., beq, bne• More stages do not lead to higher
performance!
54
Chapter 6 Summary
• Pipelining does not improve latency, but does improve throughput
Slower Faster
Instructions per clock (IPC = 1/CPI)
Multicycle(Section 5.5)
Single-cycle(Section 5.4)
Deeplypipelined
Pipelined
Multiple issuewith deep pipeline
(Section 6.10)
Multiple-issuepipelined
(Section 6.9)
1 Several
Use latency in instructions
Multicycle(Section 5.5)
Single-cycle(Section 5.4)
DeeplypipelinedPipelined
Multiple issuewith deep pipeline
(Section 6.10)
Multiple-issuepipelined
(Section 6.9)