Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Lecture 3Pipelining: Basic Concepts
Prof. Tao LiAdvanced Computer Architecture
EEL 5764
Computer Pipelines• Computers execute billions of instructions,
so instruction throughout is what matters• Divide instruction execution up into several
pipeline stages. For exampleIF ID EX MEM WB
• Simultaneously have different instructions in different pipeline stages
• The length of the longest pipeline stage determines the cycle time
• MIPS pipeline features: – all instructions same length– registers located in same place in instruction format– memory operands only in loads or stores
MIPS Instruction Formats
Op31 26 01516202125
rs1 rd immediate
Op31 26 025
Op31 26 01516202125
rs1 rs2
offset added to PC
rd
Register-Register (R-type) ADD R1, R2, R3561011
Register-Immediate (I-type) SUB R1, R2, #3
Jump / Call (J-type) JUMP end
func
(jump, jump and link, trap and return from exception)
Multiple-Cycle MIPS: Cycles 1 and 2
• Most MIPS instruction can be implemented in 5 clock cycles
• The first two clock cycles are the same for every instruction.
1. Instruction fetch cycle (IF)load instructionupdate program counter
2. Instruction decode / register fetch cycle (ID)fetch source registerssign-extend immediate field
5 Steps of MIPS Datapath
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Adder Zero?
Next SEQ PC
PC
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
Multiple-Cycle MIPS: Cycle 3
• The third cycle is known as the Execution/ effective address cycle (EX)
• The actions performed in this cycle depend on the type of operations.
– Loads and Stores» calculate effective address
– ALU operations» perform ALU operation
– Branch » compute branch target» determine if the branch is taken
5 Steps of MIPS Datapath
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Adder Zero?
Next SEQ PC
PC
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
Multiple-Cycle MIPS: Cycle 4
• The fourth cycle is known as the Memory access / branch completion cycle (MEM)
• The only MIPS instructions active in this cycle are loads, stores, and branches
– Loads» load memory onto processor
– Stores» store data into memory
– Branch» go to branch target or next instruction
– ALU Operations» do nothing
5 Steps of MIPS Datapath
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Adder Zero?
Next SEQ PC
PC
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
Multiple-Cycle MIPS: Cycle 5
• The fifth cycle is known as the Write-back cycle (WB)
• During this cycles, results are written to the register file
– Loads» write value from memory into register file
– ALU Operations» write ALU result into register file
– Stores and Branches» do nothing
5 Steps of MIPS Datapath
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Adder Zero?
Next SEQ PC
PC
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
CPI for the Multiple-Cycle MIPS
• The multiple-cycle MIPS requires 4 cycles for branches and stores and 5 cycles for the other operations.
• Assuming 20% of the instructions are branches or stores, this gives a CPI of
0.8*5 + 0.2*4 = 4.80
• We could improve the CPI by allowing ALU operations to complete in 4 cycles.
• Assuming 40% of the instructions are ALU operations, this would reduce the CPI to
0.4*5 + 0.6*4 = 4.40
Pipelining MIPS• To reduce the CPI, MIPS can be implemented
using a five stage pipeline.
• In this example, it takes 9 cycles execute 5 instructions for a CPI of 1.8.
Inst. CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
1 IF ID EX M EM W B
2 IF ID EX M EM W B
3 IF ID EX M EM W B
4 IF ID EX M EM W B
5 IF ID EX M EM W B
5 Steps of MIPS DatapathMemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
PC
RS1
RS2
Imm
MU
X
Visualizing Pipelining
Instr.
Order
Time (clock cycles)
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Pipeline Speedup Example• Assume the multiple cycle MIPS has a 10 ns clock
cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles.
• If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining.
MC Ave Instr. Time = Clock cycle x Average CPI= 10 ns x (0.6 x 4 + 0.4 x 5)= 44 ns
PL Ave Instr. Time = 10 + 1 = 11 nsSpeedup = 44 / 11 = 4
• This ignores time needed to fill & empty the pipeline and delays due to hazards.
Pipelining Summary• Pipelining overlaps the execution of multiple
instructions. • With an ideal pipeline, the CPI is one, and the
speedup is equal to the number of stages in the pipeline.
• However, several factors prevent us from achieving the ideal speedup, including
– Not being able to divide the pipeline evenly– The time needed to empty and flush the pipeline– Overhead needed for pipelining – Structural, data, and control hazards
Its Not That Easy for Computers• Limits to pipelining: Hazards prevent next
instruction from executing during its designated clock cycle
– Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource.
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Pipelining of branches & other instructions that change the PC
• Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”in the pipeline
• To do this, hardware or software must detect that a hazard has occurred.
Speed Up Equations for Pipelining
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup ×+×
=
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup ×+
=
Instper cycles Stall Average CPI Ideal CPIpipelined +=
For simple RISC pipeline, CPI = 1:
Structural Hazards
• Structural hazards occur when two or more instructions need the same resource.
• Common methods for eliminating structural hazards are:
– Duplicate resources – Pipeline the resource– Reorder the instructions
• It may be too expensive too eliminate a structural hazard, in which case the pipeline should stall.
• When the pipeline stalls, no instructions are issued until the hazard has been resolved.
• What are some examples of structural hazards?
One Memory Port Structural HazardsTime (clock cycles)
Instr.
Order
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
One Memory Port Structural Hazards
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
Example: One or Two Memory Ports?
• Machine A: Dual ported memory (“Harvard Architecture”)• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate• Ideal CPI = 1 for both• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockpipe / 1.05)= (Pipeline Depth/1.4) x 1.05= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
RAW Hazards on R1Time (clock cycles)
IF ID/RF EX MEM WB
• Write After Read (WAR)InstrJ writes operand before InstrI reads it
• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and – Writes are always in stage 5
• WAR hazards can happen if instructions execute out of order
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
Instr.
Order
add r4,r1,r3
sub r1,r2,r3 Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
No WAR Hazards on R1Time (clock cycles)
IF ID/RF EX MEM WBread
write
Three Generic Data Hazards• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5
• Will see WAR and WAW in later more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Instr.
Order
add r1,r4,r3
sub r1,r2,r3 Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
No WAR Hazards on R1Time (clock cycles)
IF ID/RF EX MEM WBread
write
Data Forwarding• With data forwarding (also called bypassing or
short-circuiting), data is transferred back to earlier pipeline stages before it is written into the register file.
– Instr i: add r1,r2,r3 (result ready after EX stage)----------------------
– Instr j: sub r4,r1,r5 (result needed in EX stage)
• This either eliminates or reduces the penalty of RAW hazards.
• To support data forwarding, additional hardware is required.
– Multiplexors to allow data to be transferred back– Control logic for the multiplexors
Time (clock cycles)
Forwarding to Avoid RAW Hazard
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
HW Change for Forwarding
MEM
/WR
ID/EX
EX/M
EM
DataMemory
ALU
mux
mux
Registers
NextPC
Immediate
mux
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with Forwarding
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data Hazard Even with Forwarding
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU DMemIfetch Reg
RegIfetch ALU DMem RegBubble
Ifetch ALU DMem RegBubble Reg
Ifetch ALU DMemBubble Reg
Try producing fast code fora = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,RaLW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,RaSUB Rd,Re,RfSW d,Rd
Compiler Avoiding Load Stalls
% loads stalling pipeline
0% 20% 40% 60% 80%
tex
spice
gcc
25%
14%
31%
65%
42%
54%
scheduled unscheduled
Compilers reduce the number of load stalls, but do not completely eliminate them.
MIPS Control Hazards• Control hazards, which occur due to instructions
changing the PC, can result in a large performance loss.
• A branch is either– Taken: PC <= PC + 4 + Imm– Not Taken: PC <= PC + 4
• The simplest solution is to stall the pipeline as soon as a branch instruction is detected.
– Detect the branch in the ID stage– Don’t know if the branch is taken until the EX stage– If the branch is taken, we need to repeat the IF and ID stages– New PC is not changed until the end of the MEM stage, after
determining if the branch is taken and the new PC value
5 Steps of MIPS DatapathMemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
RegFile
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
PC
RS1
RS2
Imm
MU
X
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Control Hazard on Branches• With our original MIPS model, branches have
a delay of 3 cycles• The delay for not-taken branches can be
reduced to two cycles, since it is not necessary to fetch the instruction again.
Inst. CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Branch instr.
IF ID EX MEM WB
Branch successor
IF Stall Stall ID EX MEM WB
Branch success+1
IF ID EX MEM WB
Branch Stall Impact• If CPI = 1, 30% branch,
Stall 3 cycles => new CPI = 1.9!• Two part solution:
– Determine branch taken or not sooner, AND– Compute taken branch address earlier
• Branch tests if register = 0 or ≠ 0• Solution:
– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3
Branch Behavior in Programs
• Based on SPEC benchmarks on MIPS– Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.– About 75% of the branches are forward branches– 60% of forward branches are taken– 80% of backward branches are taken– 67% of all branches are taken
• Why are branches (especially backward branches) more likely to be taken than not taken?
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 33% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 67% MIPS branches taken on average– But haven’t calculated branch target address in MIPS
» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome
Four Branch Hazard Alternatives
#4: Define branch to take place AFTER n following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
– In 5 stage pipeline, 1 slot delay allows proper decision and branch target address to be calculated (n=1)
– MIPS uses this approach, with a single branch delay slot– Superscalar machines with deep pipelines may require
additional delay slots to avoid branch penalties
Branch delay of length n
Delayed Branch• Where to get instructions to fill branch delay slot?
– Before branch instruction: always valuable if found» Branch cannot depend on rescheduled instruction (RI)
– From the target address: only valuable when branch taken» Must be O.K. to execute RI if branch is not taken.
– From fall through: only valuable when branch not taken» Must be O.K. to execute RI if branch is taken
Before Target Fall ThroughADD R1, R2, R3 …BEQZ R2, target BEQZ R2, target BEQZ R2,target… … …… … ADD R1, R2, R3
target: … ADD R1, R2, R3 …
Filling delay slots• Compiler effectiveness for single branch delay slot:
– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful in
computation– About 50% (60% x 80%) of slots usefully filled
• Canceling branches or nullifying branches– Include a prediction of if the branch is taken or not take– If the prediction is correct, the instruction in the delay slot is executed– If the prediction is incorrect, the instruction in the delay slot is squashed– Allow more slots to be filled from the target address or fall through
Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall
Slow stall pipeline 3 1.42 3.5 1.0Fast stall pipeline 1 1.14 4.4 1.26Predict taken 1 1.14 4.4 1.26Predict not taken 0.7 1.10 4.5 1.29Delayed branch 0.5 1.07 4.7 1.34
Assume branch frequency is 14%
Pipeline speedup = Pipeline depth1 +Branch frequency ×Branch penalty
Compiler “Static” Prediction ofTaken/Untaken Branches
• Two strategies examined– Backward branch predict taken, forward branch not taken– Profile-based prediction: record branch behavior, predict branch
based on prior run
Inst
ruct
ions
per
mis
pred
icte
d br
anch
1
10
100
1000
10000
100000al
vinn
com
pres
s
dodu
c
espr
esso gc
c
hydr
o2d
mdl
jsp2 or
a
swm
256
tom
catv
Profile-based Direction-based
Pipelining Complications
• Exceptions: Events other than branches or jumps that change the normal flow of instruction execution.
• 5 instructions executing in 5 stage pipeline– How to stop the pipeline?– How to restart the pipeline?– Who caused the interrupt?
Stage Problem interrupts occurringIF Page fault on instruction fetch; misaligned memory
access; memory-protection violationID Undefined or illegal opcodeEX Arithmetic interruptMEM Page fault on data fetch; misaligned memory
access; memory-protection violation
Pipelining Complications
• Simultaneous exceptions in more than one pipeline stage, e.g.,
– Load with data page fault in MEM stage– Add with instruction page fault in IF stage– Add fault will happen BEFORE load fault, even if LOAD
instruction is first
• Solution #1– Interrupt status vector per instruction– Defer check until last stage, kill state update if exception
• Solution #2– Interrupt ASAP– Restart everything that is incomplete
Another advantage for state update late in pipeline!
Pipelining Complications• MIPS pipeline only writes results at the end of the
instruction’s execution. Not all processors do this.
• Address modes: Auto-increment causes register change during instruction execution
– Interrupts? Need to restore register state– Adds WAR and WAW hazards since writes no longer last stage
• Memory-Memory Move Instructions– Must be able to handle multiple page faults– VAX and x86 store values temporarily in registers
• Condition Codes– Need to detect the last instruction to change condition codes
Pipelining Complications• Floating Point instructions
– Longer execution times than for integers– May pipeline FP execution unit so they can initiate new
instructions without waiting full latency– Adds WAR and WAW hazards since pipelines are no longer
same length (e.g., square root 30 times slower than add)FP Instruction Latency Initiation Rate (MIPS R4000)Add, Subtract 4 3Multiply 8 4Divide 36 35Square root 112 111Negate 2 1Absolute value 2 1FP compare 3 2
Cycles before use result
Cycles before issue instr of same type
Summary of Pipelining Hazards
• Hazards limit performance– Structural: need more HW resources– Data: need forwarding, compiler scheduling– Control: early evaluation & PC, delayed branch, prediction
• Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency
• Interrupts, Instruction Set, FP make pipelining harder• Compilers reduce cost of data and control hazards
– Code rescheduling– Load delay slots– Branch delay slots– Static branch prediction