Upload
iris-banks
View
215
Download
0
Embed Size (px)
Citation preview
Pipelining: Basic and
Intermediate Concepts
COE 403Computer Architecture
Prof. Muhamed Mudawar
Computer Engineering Department
King Fahd University of Petroleum and Minerals
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 2
Presentation Outline
Pipelining Basics
MIPS 5-Stage Pipeline Microarchitecture
Structural Hazards
Data Hazards and Forwarding
Load Delay and Pipeline Stall
Control Hazards and Branch Prediction
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 3
What is Pipelining? Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit
The total execution time of the task is k time units
Pipelining is to overlap the execution The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2 k…
1 2 k…
1 2 k…
1 2 k…
1 2 k…
1 2 k…
Serial ExecutionOne completion every k time units
Pipelined ExecutionOne completion every 1 time unit
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 4
Synchronous Pipeline Uses clocked registers between stages
Upon arrival of a clock edge … All registers hold the results of previous stages simultaneously
The pipeline stages are combinational logic circuits
It is desirable to have balanced stages Approximately equal delay in all stages
Clock period is determined by the maximum stage delay
S1 S2 Sk
Re
gis
ter
Re
gis
ter
Re
gis
ter
Re
gis
ter
Input
Clock
Output
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 5
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n – 1 cycles
k cycles are needed to complete the first task
n – 1 cycles are needed to complete the remaining n – 1 tasks
Ideal speedup of a k-stage pipeline over serial execution
Pipeline Performance
k + n – 1Pipelined execution in cycles
Serial execution in cycles== Sk → k for large n
nkSk
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 6
Simple 5-Stage Processor Pipeline
Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory Select address: next instruction, jump target, branch target
2. ID: Instruction Decode Determine control signals & read registers from the register file
3. EX: Execute operation Load and Store: Calculate effective memory address
Branch: Calculate address and outcome (Taken or Not Taken)
4. MEM: Memory access for load and store only
5. WB: Write Back result to register
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 7
Visualizing the Pipeline Multiple instruction execution over multiple clock cycles
Instructions are listed in program order from top to bottom Figure shows the use of resources at each stage and each cycle No interference between different instructions in adjacent stages
Time (in cycles)
Pro
gra
m O
rder
add r9, r8, r7
CC2
Reg
IM
DM
Reg
sub r5, r2, r3
CC4
ALU
IM
sw r2, 10(r3)
DM
Reg
CC5
Reg
ALU
IM
DM
Reg
CC6
Reg
ALU DM
CC7
Reg
ALU
CC8
Reg
DM
lw r6, 8(r5) IM
CC1
Reg
ori r4, r3, 7
ALU
CC3
IM
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 8
Time Diagram shows: Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Timing the Instruction Flow
IF
WB
–
EX
ID
WB
–
EX
WB
MEM –
ID
IF
EX
ID
IF
TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3
MEM
EX
ID
IF
WB
MEM
EX
ID
IF
lw r7, 8(r3)
lw r6, 8(r5)
ori r4, r3, 7
sub r5, r2, r3
sw r2, 10(r3)Inst
ruct
ion O
rder
Up to five instructions can be in the pipeline during the same cycle
Instruction Level Parallelism (ILP)
ALU instructions skip the MEM stage.
Store instructions skip the WB stage
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 9
Example of Pipeline Performance Consider a 5-stage instruction execution pipeline …
Instruction fetch = ALU = Data memory access = 350 ps
Register read = Register write = 250 ps
Compare single-cycle, multi-cycle, versus pipelined Assume: 20% load, 10% store, 40% ALU, and 30% branch
Solution:
Instruction Fetch Reg Read ALU Memory Reg Wr Time
Load 350 ps 250 ps 350 ps 350 ps 250 ps 1550 ps
Store 350 ps 250 ps 350 ps 350 ps 1300 ps
ALU 350 ps 250 ps 350 ps 250 ps 1200 ps
Branch 350 ps 250 ps 350 ps 950 ps
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 10
Single-Cycle, Multi-Cycle, PipelinedTclock = 350+250+350+350+250 = 1550 ps
CPI = 1, but long clock cycle
Tclock = 350 ps
Average CPI = 5×0.2 + 4×0.1 + 4×0.4 + 3×0.3 = 3.9
Multi-Cycle Execution:
350ps 350ps 350psLoad = 5 cycles
Reg
350ps
ALU Reg
IF Reg
350ps 350ps 350ps
ALU
350ps 350ps 350ps
IF Reg
350ps
ALU Reg
IF350ps
MEM
ALU = 4 cycles
Branch = 3 cycles
Each instruction = 1550 ps
IF Reg ALU RegMEM
Each instruction = 1550 ps
IF Reg ALU RegMEM
Each instruction = 1550 ps
IF Reg ALU RegMEM
Single-Cycle Execution:
350ps
Pipelined Execution:
350ps 350ps 350ps
IF Reg
350ps
ALU Reg
350ps
MEM
IF Reg ALU RegMEM
IF Reg ALU RegMEM
350ps
Tclock = 350 ps = max(350, 250)
One instruction completes each cycle
Average CPI = 1
Ignore time to fill pipeline
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 11
Single-Cycle, Multi-Cycle, Pipelined Single-Cycle CPI = 1, but long clock cycle = 1550 ps
Time of each instruction = 1550 ps
Multi-Cycle Clock = 350 ps (faster clock than single-cycle) But average CPI = 3.9 (worse than single-cycle)
Average time per instruction = 350 ps × 3.9 = 1365 ps
Multi-cycle is faster than single-cycle by: 1550/1365 = 1.14x
Pipeline Clock = 350 ps (same as multi-cycle) But average CPI = 1 (one instruction completes per cycle)
Average time per instruction = 350 ps × 1 = 350 ps
Pipeline is faster than single-cycle by: 1550/350 = 4.43x
Pipeline is also faster than multi-cycle by: 1365/350 = 3.9x
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 12
Pipeline Performance Summary Pipelining doesn’t improve latency of a single instruction
However, it improves throughput of entire workload Instructions are initiated and completed at a higher rate
In a k-stage pipeline, k instructions operate in parallel Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stages k
Unbalanced lengths of pipeline stages reduces speedup
Pipeline rate is limited by slowest pipeline stage
Unbalanced lengths of pipeline stages reduces speedup
Also, time to fill and drain pipeline reduces speedup
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 13
Next . . .
Pipelining Basics
MIPS 5-Stage Pipeline Microarchitecture
Structural Hazards
Data Hazards and Forwarding
Load Delay and Pipeline Stall
Control Hazards and Branch Prediction
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 14
Review of MIPS Instruction Formats All instructions are 32-bit wide Three instruction formats: R-type, I-type, and J-type
Op6: 6-bit opcode of the instruction Rs5, Rt5, Rd5: 5-bit source and destination register numbers sa5: 5-bit shift amount used by shift instructions funct6: 6-bit function field for R-type instructions immediate16: 16-bit immediate value or address offset immediate26: 26-bit target address of the jump instruction
Op6 Rs5 Rt5 Rd5 funct6sa5
Op6 Rs5 Rt5 immediate16
Op6 immediate26
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 15
Implementing Subset of Instructions
Only a subset of the MIPS instructions is considered
ALU R-type instructions: dadd, dsub, and, or, xor, slt, sltu, . . .
ALU I-type instructions: daddi, andi, ori, xori, slti, sltiu, . . .
Load and Store (I-type): ld, sd, . . .
Conditional Branch (I-type): beq, bne, . . .
Jump (J-type): j and jal
Jump Register (R-type): jr and jalr
This subset does not include all the instructions
Multiply, Divide, and Floating-Point instructions are not included
But sufficient to illustrate design of datapath and control
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 16
Details of the MIPS SubsetInstruction Meaning FormatR-type ALU DADD, DSUB, … op6=0 rs5 rt5 rd5 0 func6
dadd rd, rs, rt Rd (Rs) + (Rt) op6=0 rs5 rt5 rd5 0 0x2Cdsub rd, rs, rt Rd (Rs) – (Rt) op6=0 rs5 rt5 rd5 0 0x2EI-type ALU DADDI, ANDI, … op6 rs5 rt5 Immediate16daddi rt, rs, imm
Rt (Rs) + imm 0x18 rs5 rt5 Immediate16
andi rt, rs, imm
Rt (Rs) & imm 0x0C rs5 rt5 Immediate16
Load / Store LD, SD, LW, SW, … op6 rs5 rt5 Displacement16ld rt, imm(rs)
RtMem[(Rs)+imm] 0x37 rs5 rt5 Displacement16
sd rt, imm(rs)
RtMem[(Rs)+imm] 0x3F rs5 rt5 Displacement16
Cond Branch BEQ, BNE, … op6 rs5 rt5 PC-Relative Offsetbeq rs, rt, target
Branch if (Rs)==(Rt) 0x04 rs5 rt5 PC-Relative Offset
bne rs, rt, target
Branch if (Rs)!=(Rt) 0x05 rs5 rt5 PC-Relative Offset
j target Jump 0x02 26-bit Instruction indexjal target Jump and Link 0x03 26-bit Instruction indexjr rs Jump Register op6=0 rs5 0 0 0 0x08jalr rd, rs Jump and Link Reg op6=0 rs5 0 rd5 0 0x09
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 17
MIPS Single-Cycle Datapath
R31
Op
Branch Target Address
ID = Instruction Decode
& Register Read
flagsZ, N, C, V
PCSrc
ALUOp
RegWrite
ALUAddress
Instruction
InstructionCache
Rs
Rd
Ext
Rt
Jump Target = Imm26
ALU result
clk
PC
00
DataCache
Address
Data_in
Data_out
RegisterFile
RA
RB
BusA
BusBRW BusW
+1
MemRead
MemWrite
SelectResult
EX = Execute
Calculate AddressIF = Instruction Fetch MEM = Memory Access
(Load and Store)
WB
= W
rite
Ba
ck
3
2
1
0
+
1
0
Imm16
Jump Register Address
Next PC Address
0
2
1
2
0
1
ALUSrc
RegDst
Return Address
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 18
Op RegDst
RegWrite
ALUSrc
ALUOp
MemRead
MemWrite
SelectResult PCSrc
ALU-R Rd = 1 1 BusB = 0 func 0 0 ALU = 1 Next PC = 0
ALU-I Rt = 0 1 Imm = 1 op 0 0 ALU = 1 Next PC = 0
Load Rt = 0 1 Imm = 1 ADD 1 0 MEM = 0 Next PC = 0
Store X 0 Imm = 1 ADD 0 1 X Next PC = 0
J X 0 X X 0 0 X Jump = 1
JR X 0 X X 0 0 X Jump Reg = 2
JAL R31 = 2 1 X X 0 0 RA = 2 Jump = 1
JALR Rd = 1 1 X X 0 0 RA = 2 Jump Reg = 2
BEQ X 0 BusB = 0 SUB 0 0 X BR = 3 if (Z) else 0
BNE X 0 BusB = 0 SUB 0 0 X BR = 3 if (Z) else 0
Control Signal Values
X is a don’t care to minimize control logic
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 19
Controlling the Single-Cycle Datapath
Main Control Input: 6-bit opcode field
Output: Main control signals
ALU Control Input: 6-bit opcode field
6-bit function (R-type)
Output: ALUOp signal
PC Control Input: 6-bit opcode field
6-bit function (JR, JALR)
ALU Flags (for branch)
Output: PCSrc signal
PC
Control
PCSrc
Op
Func
Flags
Flags
ALU
ControlOp
ALUOpFunc
32Address
Instruction
InstructionCache
PC
3
2
1
0NextPC
Jump
JReg
Branch
ALU
Reg
Dst
Reg
Wri
te
AL
US
rc
Mem
Rea
d
Mem
Wri
te
Sel
ectR
esu
lt
MainControl
Op
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 20
How to Pipeline the Datapath?Answer: Introduce pipeline register at end of each stage
R31
Op
ID = Instruction Decode
& Register Read
flagsZ, N, C, V
PCSrc
ALUOp
RegWrite
ALUAddress
Instruction
InstructionCache
Rs
Rd
Ext
Rt
ALU result
clk
PC
00
DataCache
Address
Data_in
Data_out
RegisterFile
RA
RB
BusA
BusBRW BusW
+1
MemRead
MemWrite
SelectResult
EX = Execute
Calculate Address
IF = Instruction Fetch MEM = Memory Access
(Load and Store)
WB
= W
rite
Ba
ck
3
2
1
0
+
1
0
Imm16
AB
Imm
NP
C2
IRN
PC
0
2
1
Re
sult2
0
1
NP
C3
YD
ALUSrc
RegDst
Return Address
Branch Target Address
Jump Target = Imm26
Jump Register Address
Next PC Address
Wrong destination register
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 21
MIPS 5-Stage Pipeline
Destination register
should be pipelined
from ID to WB stageR31
Op
ID = Instruction Decode
& Register Read
flagsZ, N, C, V
PCSrc
ALUOp
RegWrite
ALUAddress
Instruction
InstructionCache
Rs
Rd
Ext
Rt
ALU result
PC
00
DataCache
Address
Data_in
Data_out
RegisterFile
RA
RB
BusA
BusB
RW BusW
+1
MemWrite
SelectResult
EX = Execute
Calculate Address
IF = Instruction Fetch MEM = Memory Access
(Load and Store)
WB
= W
rite
Ba
ck
3
2
1
0
+
1
0A
Imm16
IR BIm
mN
PC
2
NP
C3
NP
C
Y
0
2
1
Re
sult2
0
1
DALUSrc
RegDst
Return Address
Rd
2
Rd
3
Rd
4
MemRead
Branch Target Address
Jump Target = Imm26
Jump Register Address
Next PC Address
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 22
Pipelined Control
Pipeline the control signals as the instruction moves Extend the pipeline registers to include the control signals
Each stage uses some of the control signals Instruction Decode (ID) Stage
Generate all control signals
Select destination register: RegDst control signal
PC control uses: J (Jump) control signal for PCSrc
Execution Stage ALUSrc, and ALUOp
PC control uses: JR, Beq, Bne, and ALU flags for PCSrc
Memory Stage MemRead, MemWrite, and SelectResult
Write Back Stage RegWrite is used in this stage
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 23
Jump Register Address
R31
Branch Target Address
flags
ALUAddress
Instruction
InstructionCache
Rs
Rd
Ext
Rt
Jump Address
ALU result
PC
00
DataCache
Address
Data_in
Data_out
RegisterFile
RA
RB
BusA
BusB
RW BusW
+1 WB
= W
rite
Ba
ck
3
2
1
0
+
1
0
A
Imm16IR B
Imm
NP
C2
NP
C3
NP
C
Next PC Address
Y
0
2
1
Re
sult2
0
1
D
Return Address
Rd
2
Rd
3
Rd
4
3
2
1
ID = Instruction Decode EX = ExecuteMEM = Memory Access
Pipelined Datapath + Control
Pass controlSignals alongpipeline just
like dataControl
Signals
RegDst
Main & ALUControl
Op,func J
EX
RegWrite
RegWrite
WB
SelectResult
MemRead
MemWrite
ALUOp
ALUSrc
JRBEQBNE
ME
M
PC Control
J JR
flags
BEQ BNE
PCSrc
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 24
Events on Every Pipe StageStage Register Transfer
IFIR I-Cache[PC]; NPC PC+1 (Lower 2 bits are implicitly zeros)Branch = BEQ&Z | BNE&~Z (True if branch is taken, more branches can be added)PC NPC2+Imm if (Branch); else A if (JR); else PCHI ǁ IR.Imm26 if (J); else PC+1
IDA Reg[IR.Rs]; B Reg[IR.Rt]; Imm extend(IR.Imm16); NPC2 NPCGenerate Control signals; EX control signals sent to EX, MEM, and WB stagesRd2 Rt if (I-ALU or Load); Rd if (R-ALU); R31 if (JAL); else X (don’t care)
EX
ALUOp = func (R-ALU); Op if (I-ALU); ADD if (Load/Store); SUB if (Branch); else NOPALUSrc = B if (R-ALU or Branch); Imm if (I-ALU or Load/Store); else X (don’t care)Y ALUOp(A, B or Imm); D B; Rd3 Rd2; NPC3 NPC2MEM control signals sent to MEM and WB stages
MEMResult D-Cache[Y] if (MemRead); Y if (R-ALU or I-ALU); NPC3 if (JAL or JALR); else XD-Cache[Y] D if (MemWrite)Rd4 Rd3; WB control RegWrite
WB Reg[Rd4] Result if (RegWrite)
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 25
Next . . .
Pipelining Basics
MIPS 5-Stage Pipeline Microarchitecture
Structural Hazards
Data Hazards and Forwarding
Load Delay and Pipeline Stall
Control Hazards and Branch Prediction
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 26
Hazard: Situation that would cause incorrect execution If next instruction were launched during its designated clock cycle
1. Structural hazard Caused by hardware resource contention
Using same resource by two instructions during same clock cycle
2. Data hazard An instruction may depend on the result of a prior instruction still
in the pipeline, that did not write back its result into the register file
3. Control hazards Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
Pipeline Hazards
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 27
Structural Hazard Definition
Attempt to use the same hardware resource by two different
instructions during the same clock cycle
Example Writing back ALU result in stage 4 Conflict with writing load data in stage 5
WB
WB
EX
ID
WB
EX MEM
IF ID
IF
TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3
EX
ID
IF
MEM
EX
ID
IF
lw r6, 8(r5)
ori r4, r13, 7
sub r5, r2, r13
sw r2, 10(r3)Inst
ruct
ions
Structural HazardTwo instructions are attempting to write
the register file during same cycle
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 28
Resolving Structural Hazard Is it serious? Yes! cannot be ignored
Solution 1: Delay access to resource Delay Write Back to Stage 5
Solution 2: Add more hardware Two write ports for register file (costly)
Does not improve performance
One fetch one completion per cycle
WB
EX
ID EX MEM
IF ID
IF
TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3
EX
ID
IF
MEM
EX
ID
IF
lw r6, 8(r5)
ori r4, r13, 7
sub r5, r2, r13
sw r2, 10(r3)Inst
ruct
ions
WB
WB
-
-
ResolvingStructural Hazard
Delay accessto register file
Write Back occursonly in Stage 5
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 29
Example 2 of Structural Hazard
One Cache Memory for both Instructions & Data
Instruction fetch requires cache access each clock cycle
Load/store also requires cache access to read/write data
Cannot fetch instruction and load data if one address port
Time (in cycles)
Pro
gra
m O
rder
add r9, r8, r7
CC2
Reg
Mem
Mem
Reg
sub r5, r2, r3
CC4
ALU
Mem
Mem
Reg
CC5
Reg
ALU Mem
CC6
Reg
ALU Mem
CC7
Reg
CC8
Reg
lw r6, 8(r5) Mem
CC1
Reg
ori r4, r3, 7
ALU
CC3
Mem
Structural
Hazard
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 30
Stalling the Pipeline Delay Instruction Fetch Stall pipeline (inject bubble)
Reduces performance: Stall pipeline for each load and store!
Better Solution: Use Separate Instruction & Data Caches Addressed independently: No structural hazard and No stalls
Time (in cycles)
Pro
gra
m O
rder
add r9, r8, r7
CC2
Reg
Mem
Mem
Reg
CC4
ALU Mem
CC5
Reg
ALU Mem
CC6
Reg
CC7
Reg
CC8
lw r6, 8(r5) Mem
CC1
Reg
ori r4, r3, 7
ALU
CC3
Mem
Stall Pipeline
Inject Bubble
sub r5, r2, r3 Mem Reg ALU Mem Reg
bubble bubble bubble bubble bubble
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 31
Resolving Structural Hazards
Serious Hazard: structural hazard cannot be ignored Can be eliminated with careful design of the pipeline
Solution 1: Delay Access to Hardware Resource Such as having all write backs to register file in the last stage, or
Stall the pipeline until resource is available
Solution 2: Add more Hardware Resources Add more hardware to eliminate the structural hazard
Such as having two cache memories for instructions & data
I-Cache and D-Cache can be addressed in parallel (same cycle)
Known as Harvard Architecture
Better than having two address ports for same cache memory
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 32
Performance Example Processor A: I-Cache + D-Cache (Harvard Architecture)
Processor B: single-ported cache for instructions & data
B has a clock rate 1.05X faster than clock rate of A
Loads + Stores = 40% of instructions executed
Ideal pipelined CPI = 1 (if no stall cycles)
Which processor is faster and by what factor?
Solution:
CPIA = 1, CPIB = 1+0.4 (due to structural hazards)
1.331.05
1
1
0.41
rate Clock
rate Clock
CPI
CPI Speedup
B
A
A
BA/B
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 33
Next . . .
Pipelining Basics
MIPS 5-Stage Pipeline Microarchitecture
Structural Hazards
Data Hazards and Forwarding
Load Delay and Pipeline Stall
Control Hazards and Branch Prediction
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 34
Occurs when one instruction depends on the result of a
previous instruction still in the pipeline Previous instruction did not write back its result to register file
Next instruction reads data before it is written
Data Dependence between instructions Given two instructions I and J, where I comes before J
Instruction J reads an operand written by I
I: add r8, r6, r10; I writes r8
J: sub r7, r8, r14 ; J reads r8
Read After Write: RAW Hazard Hazard occurs when J reads register r8 before I writes it
Data Hazard
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 35
DMReg
IM
Reg
ALU
IM
DM
Reg
Reg
ALU
IM
DM
Reg
Reg
ALU DM
Reg
ALU
Reg
DM
IM
Reg
ALU
IM
Time (cycles)
Pro
gra
m E
xecu
tion
Ord
er
value of r8
sub r8, r9, r13
CC110
CC2
addr4, r8, r15
10
CC3
or r16, r13, r8
10
CC4
andr17, r14, r8
10
CC620
CC720
CC820
CC5
sw r18, 16(r8)
10
Example of a RAW Data Hazard
Result of sub is needed by add, or, and, & sw instructions Instructions add & or will read old value of r8 from reg file During CC5, r8 is written at end of cycle, old value is read
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 36
RegReg
Solution 1: Stalling the Pipeline
Three stall cycles during CC3 thru CC5 (wasting 3 cycles)
Stall cycles delay execution of add & fetching of or instruction
The add instruction cannot read r8 until beginning of CC6
The add instruction remains in the Instruction register until CC6
The PC register is not modified until beginning of CC6
DM
Reg
RegReg
Time (in cycles)
Inst
ruct
ion O
rder value of r8
CC110
CC210
CC310
CC410
CC620
CC720
CC820
CC510
addr4, r8, r15 IM
or r16, r13, r8 IM ALU
ALU Reg
sub r8, r9, r13 IM Reg ALU DM Reg
CC920
stall stall DMstall
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 37
DM
Reg
Reg
Reg
Reg
Reg
Time (cycles)
Pro
gra
m E
xecu
tion
Ord
er
value of r8
sub r8, r9, r13 IM
CC110
CC2
addr4, r8, r15 IM
10
CC3
or r16, r13, r8
ALU
IM
10
CC4
andr17, r16, r8
ALU
IM
10
CC6
Reg
DM
ALU
20
CC7
Reg
DM
ALU
20
CC8
Reg
DM
20
CC5
sw r18, 16(r8)
Reg
DM
ALU
IM
10
Solution 2: Forwarding ALU Result The ALU result is forwarded (fed back) to the ALU input
No bubbles are inserted into the pipeline and no cycles are wasted
ALU result is forwarded from ALU, MEM, and WB stages
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 38
Implementing Forwarding
0123
0123
Yclk
Rs
Inst
ruct
ion
ALU result
DataCache
Address
Data_in
Data_out
Rd4
ALU1
0
Rd3
Rd2
AB
Re
sult
D
Imm
Re
gis
ter
Fil
e
RB
BusA
BusB
RW BusW
RARt
Two multiplexers added at the inputs of A & B registers Data from ALU stage, MEM stage, and WB stage is fed back
Two signals: ForwardA and ForwardB control forwarding
ForwardA
ForwardB
R31
Rd0
2
1
2
0
1
Imm16Ext
Return Address
NP
C3
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 39
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 40
Forwarding Example
0123
0123
YRs
Inst
ruct
ion
ALU result
DataCache
Address
Data_in
Data_out
Rd4
ALU1
0
Rd3
Rd2
AB
Re
sult
D
Imm
Re
gis
ter
Fil
e
RB
BusA
BusB
RW BusW
RARt
R31
Rd0
2
1
2
0
1
Imm16Ext
Return Address
NP
C3
Instruction sequence:lw r4, 4(r8)ori r7, r9, 2sub r3, r4, r7
When sub instruction is fetched
ori will be in the ALU stage
lw will be in the MEM stage
lw r4,4(r8)ori r7,r9,2sub r3,r4,r7
ForwardA = 2 from MEM stage
ForwardA=2
ForwardB=1
ForwardB = 1 from ALU stage
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 41
RAW Hazard Detection Current instruction being decoded is in Decode stage
Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction in the Write Back stage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite)) ForwardA 1Else if ((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2Else if ((Rs != 0) and (Rs == Rd4) and (WB.RegWrite)) ForwardA 3Else ForwardA 0
If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite)) ForwardB 1Else if ((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2Else if ((Rt != 0) and (Rt == Rd4) and (WB.RegWrite)) ForwardB 3Else ForwardB 0
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 42
Hazard Detect and Forward Logic
0123
0123
YRs
Inst
ruct
ion
ALU result
DataCache
Address
Data_in
Data_out
Rd4
ALU1
0
Rd3
Rd2
AB
Re
sult
D
Imm
Re
gis
ter
Fil
e
RB
BusA
BusB
RW BusW
RARt
R31
Rd0
2
1
2
0
1
Imm16Ext
Return Address
NP
C3
ALUOp
RegDst
Main& ALUControl M
EME
X
WB
RegWrite
Op,func
ForwardB ForwardA
RegWriteRegWrite
Hazard Detect
and Forward
Rs
Rt
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 43
Next . . .
Pipelining Basics
MIPS 5-Stage Pipeline Microarchitecture
Structural Hazards
Data Hazards and Forwarding
Load Delay and Pipeline Stall
Control Hazards and Branch Prediction
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 44
Reg
Reg
Reg
Time (cycles)
Pro
gra
m O
rder
CC2
daddr14, r12, r15
Reg
IM
CC3
or r16, r13, r12
ALU
IM
CC6
Reg
DM
ALU
CC7
Reg
Reg
DM
CC8
Reg
ld r12, 24(r10) IM
CC1 CC4
and r17, r12, r13
DM
ALU
IM
CC5
DM
ALU
Unfortunately, not all data hazards can be forwarded Load has a delay that cannot be eliminated by forwarding
In the example shown below … The LD instruction does not read data until end of CC4
Cannot forward data to DADD at end of CC3 - NOT possible
However, load can forward data to
2nd next and later instructions
Load Delay
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 45
Detecting RAW Hazard after Load Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Instruction that depends on the load data is in the decode stage
Condition for stalling the pipeline
if ((EX.MemRead == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard
Insert a bubble into the EX stage after a load instruction
Bubble is a no-op that wastes one clock cycle
Delays the dependent instruction after load by once cycle
Because of RAW hazard
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 46
Regor r16, r13, r12
IM DM RegALU
RegALU DMReg
daddr14, r12, r15 IM
Regld r12, 24(r10) IM
stall
ALU
bubble bubble bubble
DM Reg
Stall the Pipeline for one Cycle DADD instruction depends on LD stall at CC3
Allow Load instruction in ALU stage to proceed
Freeze PC and Instruction registers (NO instruction is fetched)
Introduce a bubble into the ALU stage (bubble is a NO-OP)
Load can forward data to next instruction after delaying it Time (cycles)
Pro
gra
m O
rder
CC2 CC3 CC6 CC7 CC8CC1 CC4 CC5
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 47
ld r12, 8(r11) MEM WBEXIDStallIF
ld r11, (r15) MEM WBEXIDIF
Stall Cycles Stall cycles are shown on a timing diagram
Hazard is detected in the Decode stage
Stall indicates that instruction is delayed (bubble inserted)
Instruction fetching is also delayed after a stall
Example:
dadd r3, r12, r13 MEM WBEXIDStallIF
dsub r4, r11, r3 MEM WBEXIDIF
TimeCC1 CC4 CC5 CC6 CC7 CC8 CC9CC2 CC3 CC10
Data forwarding is shown using green arrows
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 48
Hazard Detect, Forward, and Stall
ForwardB ForwardA
RegWriteRegWrite
MemRead
Hazard Detect
Forward, & Stall
Rs
Rt
0123
0123
YRs
Inst
ruct
ion
ALU result
DataCache
Address
Data_in
Data_out
Rd4
ALU1
0
Rd3
Rd2
AB
Re
sult
D
Imm
Re
gis
ter
Fil
eRB
BusA
BusB
RW BusW
RARt
R31
Rd0
2
1
2
0
1
Imm16Ext
Return Address
NP
C3
PC
RegDst
Main & ALU Control
ME
MEX
WB
RegWrite
Op,func
Control Signals
Bubble = 0 1
0
Stall
Dis
able
PC
Dis
able
IR
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 49
Compiler Scheduling to Avoid Stalls
Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E – F; // A thru F are in Memory
Original code: two stall cycles
ld r10, 8(r16) ; &B = 8+(r16)ld r11, 16(r16) ; &C = 16+(r16)dadd r12, r10, r11 ; stall cyclesd r12, 0(r16) ; &A = (r16)ld r13, 32(r16) ; &E = 32+(r16) ld r14, 40(r16) ; &F = 40+(r16)dsub r15, r13, r14 ; stall cyclesd r15, 24(r16) ; &D = 24+(r16)
Faster code: No Stalls
ld r10, 8(r16)ld r11, 16(r16)ld r13, 32(r16)ld r14, 40(r16)dadd r12, r10, r11sd r12, 0(r16)dsub r15, r13, r14sd r15, 24(r16)
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 50
Instruction J should write its result after it is read by I
I: dsub r14, r11, r13 ; r11 is read
J: dadd r11, r12, r13 ; r11 is written
Called Anti-Dependence: Re-use of register r11
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order
Anti-dependence can be eliminated by renaming
Use a different destination register for dadd (eg, r15)
Name Dependence: Write After Read
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 51
Name Dependence: Write After Write
Same destination register is written by two instructions
I: dsub r11, r14, r13 ; r11 is written
J: dadd r11, r12, r13 ; r11 is written again
Called Output Dependence: Re-use of register r11
Not a data hazard in the 5-stage pipeline because:
All writes are ordered and always take place in stage 5
However, can be a hazard in more complex pipelines
If Instruction J completes and writes r11 before instruction I
Output dependence can be eliminated by renaming r11
Read After Read is NOT a name dependence
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 52
Next . . .
Pipelining Basics
MIPS 5-Stage Pipeline Microarchitecture
Structural Hazards
Data Hazards and Forwarding
Load Delay and Pipeline Stall
Control Hazards and Branch Prediction
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 53
What is needed to calculate next PC?
For Unconditional Jumps Opcode (J or JAL), PC and 26-bit address (immediate)
For Jump Register Opcode + function (JR or JALR) and Register[Rs] value
For Conditional Branches Opcode, branch outcome (taken or not), PC and 16-bit offset
For Other Instructions Opcode and PC value
Opcode is decoded in ID stage Jump delay = 1 cycle
Branch outcome is computed in EX stage Branch delay = 2 clock cycles
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 54
2-Cycle Branch Delay Control logic detects a Branch instruction in the 2nd Stage
ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway
Convert Next1 and Next2 into bubbles if branch is taken
beq r8,r9,L1 IF
cc1
Next1
cc2
Reg
IF
Next2
cc4 cc5 cc6 cc7
IF Reg DMALU
BubbleBubble Bubble
BubbleBubble BubbleBubble
L1: target instruction
cc3
BranchTargetAddr
ALU
Reg
IF
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 55
Predict Branch NOT Taken Branches can be predicted to be NOT taken
If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted clock cycles
beq r8,r9,L1 IF
cc1
Next1
cc2
Reg
IF
Next2
cc3
NOT TakenALU
Reg
IF Reg
cc4 cc5 cc6 cc7
ALU DM
ALU DM
Reg
Reg
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 56
Pipelined Jump and Branch
Jump kills next instructionTaken branch kills next two Main & ALU
Control
Op, func
Control Signals
ME
M
Control Signals
EX1
0
Bubble = 0
PCSrc
Kill
Kill2J JR, BEQ, BNE
flags
PC Control
R31
Address
Instruction
InstructionCache
Rs
Rd
Ext
Rt
Jump Address = Imm26
PC
00
RegisterFile
RA
RB
BusA
BusB
RW BusW
+10
1
2
3
+
1
0
A
Imm16
IR BIm
mN
PC
2
NP
C3
NP
C
Next PC Address
Y
0
21
D
Rd
2
Rd
3
0123
0123
ALU1
0
Branch Target Address
Jump Register Address
Forward & Stall
Stall
Dis
able
PC
Dis
able
IR
Bubble = 0
Rs
RtRd2, Rd3, Rd4
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 57
Jump and Branch Impact on CPI Base CPI = 1 without counting jump and branch stalls
Unconditional Jump = 5%, Conditional branch = 20%
and 90% of conditional branches are taken
1 stall cycle per jump, 2 stall cycles per taken branch
What is the effect of jump and branch stalls on the CPI?
Solution:
Jump adds 1 stall cycle for 5% of instructions = 1 × 0.05
Branch adds 2 stall cycles for 20% × 90% of instructions
= 2 × 0.2 × 0.9 = 0.36
New CPI = 1 + 0.05 + 0.36 = 1.41
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 58
Branch Hazard Alternatives Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT Flush instruction after branch if branch is NOT taken
Flush only instructions appearing after Jump or taken branch
Delayed Branch Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (only one slot)
Dynamic Branch Prediction Loop branches are taken most of time
How to predict the branch behavior at runtime?
Must reduce the branch delay to zero, but how?
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 59
Define branch to take place after the next instruction
MIPS defines one delay slot Reduces branch penalty
Compiler fills the branch delay slot By selecting an independent instruction
from before the branch
Must be okay to execute instruction in the
delay slot whether branch is taken or not
If no instruction is found Compiler fills delay slot with a NO-OP
Delayed Branch
label:
. . .
dadd r8,r9,r10
beq r5,r6,label
Branch Delay Slot
label:
. . .
beq r5,r6,label
dadd r8,r9,r10
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 60
New meaning for branch instruction Branching takes place after next instruction (Not immediately!)
Impacts software and compiler Compiler is responsible to fill the branch delay slot
However, modern processors and deeply pipelined Branch penalty is multiple cycles in deeper pipelines
Multiple delay slots are difficult to fill with useful instructions
MIPS used delayed branching in earlier pipelines However, delayed branching lost popularity in recent processors
Dynamic branch prediction has replaced delayed branching
Drawback of Delayed Branching
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 61
Branch Target Buffer (IF Stage) The branch target buffer is implemented as a small cache
Stores the target addresses of recent branches and jumps
We must also have prediction bits To predict whether branches are taken or not taken
The prediction bits are determined by the hardware at runtime
mux
PC
Branch Target & Prediction Buffer
Addresses of
Recent Branches
Target
Addresses
low-order bits used as index
Predict
BitsInc
=predict_taken
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 62
Branch Target Buffer – cont’d Each Branch Target Buffer (BTB) entry stores:
Address of a recent jump or branch instruction
Target address of jump or branch
Prediction bits for a conditional branch (Taken or Not Taken)
To predict jump/branch target address and branch outcome
before instruction is decoded and branch outcome is computed
Use the lower bits of the PC to index the BTB
Check if the PC matches an entry in the BTB (jump or branch)
If there is a match and the branch is predicted to be Taken then
Update the PC using the target address stored in the BTB
The BTB entries are updated by the hardware at runtime
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 63
Dynamic Branch Prediction
Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction
Typically few prediction bits (1 or 2) are used per entry
We don’t know if the prediction is correct or not
If correct prediction then
Continue normal execution – no wasted cycles
If incorrect prediction (or misprediction) then
Kill the instructions that were incorrectly fetched – wasted cycles
Update prediction bits and target address for future use
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 64
CorrectPrediction
No stallcycles
YesNo
Dynamic Branch Prediction – Cont’d
Use PC to address Instruction Cache and Branch Target Buffer
BTBPredict taken?
Increment PC PC = target address
Enter branch & target address,and set prediction in BTB entry.
Kill fetched instructions.Restart PC at target address
Mispredicted branchKill fetched instructionsUpdate prediction bits
Restart PC after branch
NormalExecution
YesNo
Takenbranch?
No Yes
IFID
EX
Takenbranch?
Jump?
No
Enter jump & target address in BTBKill fetched instruction.
Restart PC at jump target address
Yes
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 65
Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are killed
1-bit prediction scheme is simplest to implement 1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch
1-bit Prediction Scheme
Predict Not Taken
Taken
Predict Taken
NotTaken
Not Taken
Taken
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 66
1-Bit Predictor: ShortcomingInner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around
outer: … …inner: … … bne …, …, inner … bne …, …, outer
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 67
1-bit prediction scheme has a performance shortcoming
2-bit prediction scheme works better and is often used 4 states: strong and weak predict taken / predict not taken
Implemented as a saturating counter Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken
2-bit Prediction Scheme
Not Taken
Taken
Not Taken
TakenStrongPredict
Not Taken
Taken
WeakPredict Taken
Not Taken
WeakPredict
Not TakenNot Taken
TakenStrongPredict Taken
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 68
Evaluating Branch Alternatives
Assume: Jump = 3%, Branch-Not-Taken = 5%, Branch-Taken = 15%
Assume a branch target buffer with hit rate = 90% for jump & branch
Prediction accuracy for jump = 100%, for conditional branch = 80%
What is the impact on the CPI? (Ideal CPI = 1 if no control hazards)
Branch Scheme Jump Branch Not Taken Branch TakenPredict not taken Penalty = 2 cycles Penalty = 0 cycles Penalty = 3 cycles
Delayed branch Penalty = 1 cycle Penalty = 0 cycles Penalty = 2 cycles
BTB Prediction Penalty = 2 cycles Penalty = 3 cycles Penalty = 3 cycles
Branch Scheme Jump = 3% Branch NT = 5% Branch Taken = 15% CPI
Predict not taken 0.03 × 2 0 0.15 × 3 = 0.45 1+0.51
Delayed branch 0.03 × 1 0 0.15 × 2 = 0.30 1+0.33
BTB Prediction 0.03×0.1×2 0.05×0.9×0.2×3 0.15×(0.1+0.9×0.2)×3 1+0.16
Pipelining: Basic and Intermediate Concepts COE 403 – Computer Architecture - KFUPMMuhamed Mudawar – slide 69
In Summary Three types of pipeline hazards
Structural hazards: conflict using a resource during same cycle
Data hazards: due to data dependencies between instructions
Control hazards: due to branch and jump instructions
Hazards limit the performance and complicate the design Structural hazards: eliminated by careful design or more
hardware
Data hazards can be eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching reduces branch delay compiler support
BTB with branch prediction can reduce branch delay to zero
Branch misprediction should kill the wrongly fetched instructions