8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 1/11
Page 1
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .1 Adapted from UCB and other sources
Israel Koren
Fall 2010
UNIVERSITY OF MASSACHUSETTSDept. of Electrical & Computer Engineering
Advanced Computer ArchitectureECE 568
Part 4
Control Hazards
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .2 Adapted from UCB and other sources
Control Hazard on Branches
12: beq r1,r3,24
16: and r2,r3,r5
20: or r6,r1,r7
24: add r8,r1,r9
36: xor r10,r1,r11
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Reg A L U DMemIfetch Reg
Branch Target . . .
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 2/11
Page 2
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .3 Adapted from UCB and other sources
MIPS pipeline (Fig.A.19)
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .4 Adapted from UCB and other sources
Control Hazard on BranchesTwo Stage Stall
12: beq r1,r3,24
16: and r2,r3,r5
20: or r6,r1,r7
36: xor r10,r1,r11
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch RegTarget
Freeze or Flush (higher performance butmust undo side effects)
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 3/11
Page 3
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .5 Adapted from UCB and other sources
Branch Stall Impact - MIPS R4000
♦Deeper pipeline -> worse penalty
A L U
Instr. Memory Data Memory RegReg
cc1 cc10cc9cc8cc7cc6cc5cc4cc3cc2
A L U
Instr. Memory Data Memory RegReg
A L U
Instr. Memory Data Memory RegReg
A L U
Instr. Memory Data Memory RegReg
A L U
Instr. Memory Data Memory RegReg
Time
BEQZ
Instruction 1
Instruction 2
Instruction 3
Target
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .6 Adapted from UCB and other sources
Branch Stall Impact - MIPS R4000
♦Assume CPI = 1.0 ignoring branches & data hazards
♦Assume solution was stalling for 3 cycles for everybranch
♦ If 20% branch, Stall 3 cyclesOp Freq Cycles CPI(i)
Other 80% 1 0.8
Branch 20% 4 0.8
new CPI = 1.6 or 60% slower
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 4/11
Page 4
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .7 Adapted from UCB and other sources
Conditional and unconditional Branch♦Unconditional branches incur lower penalty
• Assume 2 cycles vs 3 for unconditional branch
♦Avg.Stall cycles =ΣΣΣΣ Branch_frequency x Branch_penaltyBranch_type Freq. Penalty Freq x Penalty
Unconditional 0.04 2 0.08
Conditional 0.16 3 0.48
0.56
Pipeline_speedup = 8 / 1.56 = 5.13
(ignoring all other stalls)
Pipeline_speedup =Pipeline_depth
1 + ΣΣΣΣ Branch_frequency×××× Branch_penalty
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .8 Adapted from UCB and other sources
Dealing with Branch in MIPS
♦Branch penalty is significant
♦Two part solution:• Determine branch taken or not sooner, AND
• Compute taken branch address earlier
♦MIPS branch tests if register = 0 or ≠≠≠≠ 0
♦MIPS Solution:• Move Zero test to ID/RF stage
• Adder to calculate new PC in ID/RF stage
• 1 clock cycle penalty for branch versus 2
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 5/11
Page 5
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .9 Adapted from UCB and other sources
Original MIPS 5-stage pipeline
MemoryAccess WriteBackInstructionFetch Instr. DecodeReg. Fetch Execute/Addr. Calc
A L U
M e m o r y
R e g F i l e
M U X
M U X
D a t a
M e m o r y
M U X
SignExtend
Zero?
I F / I D
I D / E X
M E M / W B
E X / M E M
4
A d d e r
Next SEQ PC Next SEQ PC
RD RD RD W
B D a t a
Next PC
A d d r e s s
RS1
RS2
Imm
M U X
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .10 Adapted from UCB and other sources
A d d e r
I F / I D
Modified MIPS Pipeline
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
Execute/Addr. Calc
A L U
M e m o r y
R e g F i l e
M U X
D a t a
M e m o r y
M U X
SignExtend
Zero?
M E M / W B
E X / M E M
4
A d d e r
NextSEQPC
RD RD RD W B D a t a
Next PC
A d d r e s s
RS1
RS2
Imm
M U X
I D / E X
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 6/11
Page 6
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .11 Adapted from UCB and other sources
Four Branch Hazard Alternatives – 1 & 2
#1: Stall until branch direction is clear
#2: Static Prediction:
(a) Predict Branch Not Taken• Execute successor instructions in sequence
• “Squash” instructions in pipeline if branch actually taken
• 47% MIPS branches not taken on average
(b): Predict Branch Taken• 53% MIPS branches taken on average
• But haven’t calculated yet branch target address
» MIPS still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .12 Adapted from UCB and other sources
♦ Predict: guess Not Taken then back up if wrong♦ Impact: 0 lost cycles per branch instruction if
right, 1 if wrong (right about 50% of time)• Need to “Squash” following instruction (LOAD) if wrong
Static Prediction - Not Taken
I n s t r.
O r d e
r
Time (clock cycles)
Reg A L U
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
ADD
BEQ
LOAD
Reg A L U
DMemIfetch Reg
Reg A L U
DMem RegIfetch
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 7/11
Page 7
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .13 Adapted from UCB and other sources
♦CPI for branch: (1 *.47 + 2 * .53) = 1.53
♦Total CPI might be: 1.53 * .2 + 1 * .8 = 1.106(20% branch)♦Compare to Always Stall (Freeze)
Predict Not Taken
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .14 Adapted from UCB and other sources
MIPS R4000Predict Taken - 2 (vs. 3) stall Cycles
cc1 cc10cc9cc8cc7cc6cc5cc4cc3cc2
A L U
Instr. Memory Data Memory RegReg
A L U
Instr. Memory Data Memory RegReg
A L U
Instr. Memory Data Memory RegReg
A L U
Instr. Memory Data Memory RegReg
Time
BEQZ
Instruction 1
Instruction 2
Instruction 3
Target
A L U
Instr. Memory Data Memory RegReg
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 8/11
Page 8
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .15 Adapted from UCB and other sources
CPI Penalty - MIPS R4000
CPI_Stall = 1.56, CPI_P/T = 1.46, CPI_P/UT = 1.38
Speedup_(P/UT vs. Stall) = 1.56/1.38 = 1.13
Stall
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .16 Adapted from UCB and other sources
Four Branch Hazard Alternatives – 3 & 4
#3: Delayed Branch• Define branch to take place AFTER a following instruction(s)
branch instruction
sequential successor1sequential successor2........
sequential successorn
branch target if taken
• 1 slot delay allows proper decision and branch target addressin 5 stage pipeline
• MIPS uses this
#4: Dynamic Branch Prediction
Branch delay of length n
. . .
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 9/11
Page 9
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .17 Adapted from UCB and other sources
♦ Impact: 0 clock cycles per branch instruction if canfind useful instruction to put in “slot” (≈≈≈≈ 50% of time)
MIPS - One Delayed Branch Slot
I n s t r.
O r d e r
Time (clock cycles)
Reg A L U
DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
ADD
BEQ
Misc
LOAD
Reg A L U
DMemIfetch Reg
Reg A L U
DMemIfetch Reg
Reg A L U
DMem RegIfetch
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .18 Adapted from UCB and other sources
Scheduling the branch delay slot
DADD R1,R2,R3
If R2=0 then
Delay Slot
If R2=0 then
DADD R1,R2,R3
Delay Slot
DSUB R4,R5,R6
DSUB R4,R5,R6
DSUB R4,R5,R6
If R1=0 then
If R1=0 then
DADD R1,R2,R3
DADD R1,R2,R3
⇓⇓⇓⇓ ⇓⇓⇓⇓ ⇓⇓⇓⇓
DADD R1,R2,R3
DADD R1,R2,R3
DSUB R4,R5,R6
DSUB R4,R5,R6
If R1=0 then
If R1=0 then
OR R7,R8,R9
Delay Slot
OR R7,R8,R9
From before From target From fall-through
::::
::::
8/8/2019 Part4 Branch
http://slidepdf.com/reader/full/part4-branch 10/11
Page 10
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .19 Adapted from UCB and other sources
Delayed Branch -
♦Where to get instructions to fill branch delay slot?• Before branch instruction• From the target address: only valuable when branch taken• From fall through: only valuable when branch not taken• Canceling branches allow more slots to be filled
♦Compiler effectiveness for single branch delay slot:• Fills about 60% of branch delay slots• About 80% of instructions executed in branch delay slots useful
in computation• About 48% (60% x 80%) of slots usefully filled
♦CPI=1+Prob{Branch}*Prob{Un-usefull_fill}=1+.2*.52=1.104
♦Delayed Branch downside: 8-10 stage pipelines,multiple instructions issued per clock (superscalar)
Copyright UCB & Morgan KaufmannECE568/Koren Part.4 .20 Adapted from UCB and other sources
Evaluating Branch Alternativesfor MIPS
Scheduling Branch CPI speedup v.scheme penalty stall
Stall pipeline 1 1.2 1.0
Predict taken 1 1.2 1.0Predict not taken 0/1 1.14 1.05
Delayed branch 0.52 1.104 1.087
UnCond: 4%, NotTaken_Cond: 6%, Taken_Cond: 10%
Slot filled usefully: 48%
CPI {Delayed_Branch}=1+.2*.52=1.104