Upload
allegra-moses
View
93
Download
2
Embed Size (px)
DESCRIPTION
CS4100: 計算機結構 Pipelining. 國立清華大學資訊工程學系 一零零 學年度第二學期. Outline. An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining. A. B. C. D. Pipelining Is Natural!. - PowerPoint PPT Presentation
Citation preview
CS4100: CS4100: 計算機結構計算機結構
PipeliningPipelining
國立清華大學資訊工程學系一零零學年度第二學期
Computer ArchitecturePipelining-2
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-3
Laundry example:
Ann, Brian, Cathy, Dave each have one load ofclothes to wash, dry,and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
A B C D
Pipelining Is Natural!
Computer ArchitecturePipelining-4
Sequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would it take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
Sequential Laundry
Computer ArchitecturePipelining-5
Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
Pipelined Laundry: Start ASAP
Computer ArchitecturePipelining-6
Pipelining LessonsDoesn’t help latency of single task, but throughput of entire
Pipeline rate limited by slowest stage
Multiple tasks working at same time using different resources
Potential speedup = Number pipe stages
Unbalanced stage length; time to “fill” & “drain” the pipeline reduce speedup
Stall for dependences
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
Computer ArchitecturePipelining-7
Single cycle vs. Pipeline
Clk
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Load
Pipeline Implementation:
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem WrStore
Ifetch Reg Exec Mem WrR-type
Cycle 1 Cycle 2
Computer ArchitecturePipelining-8
Pipeline PerformanceSingle-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
Computer ArchitecturePipelining-9
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
Why Pipeline? Because the Resources Are There!
Single-cycle
Datapath
Computer ArchitecturePipelining-10
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-11
Designing a Pipelined Processor
Examine the datapath and control diagramStarting with single cycle datapathSingle cycle control?
Partition datapath into stages:IF (instruction fetch), ID (instruction decode
and register file read), EX (execution or address calculation), MEM (data memory access), WB (write back)
Associate resources with stages Ensure that flows do not conflict, or figure
out how to resolve Assert control in appropriate stage
Computer ArchitecturePipelining-12
Multi-Execution Steps
Step nameAction for R-type
instructionsAction for memory-reference
instructionsAction for branches
Action for jumps
Instruction fetch IR = Memory[PC]PC = PC + 4
Instruction A = Reg [IR[25-21]]decode/register fetch B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)
Execution, address ALUOut = A op B ALUOut = A + sign-extend if (A ==B) then PC = PC [31-28] IIcomputation, branch/ (IR[15-0]) PC = ALUOut (IR[25-0]<<2)jump completion
Memory access or R-type Reg [IR[15-11]] = Load: MDR = Memory[ALUOut]completion ALUOut or
Store: Memory [ALUOut] = B
Memory read completion Load: Reg[IR[20-16]] = MDR
But, use single-cycle datapath ...
Computer ArchitecturePipelining-13
Split Single-cycle Datapath
What to add to split the datapath into stages?
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
FeedbackPath
Computer ArchitecturePipelining-14
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
Pipeline registers (latches)
Add Pipeline Registers
Use registers between stages to carry data and control
Computer ArchitecturePipelining-15
IF: Instruction Fetch Fetch the instruction from the Instruction
Memory ID: Instruction Decode
Registers fetch and instruction decode EX: Calculate the memory address MEM: Read the data from the Data Memory WB: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
Consider load
Computer ArchitecturePipelining-16
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch Reg/Dec Exec Mem Wr1st lw
Ifetch Reg/Dec Exec Mem Wr2nd lw
Ifetch Reg/Dec Exec Mem Wr3rd lw
Pipelining load
5 functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (busA and busB) for
the Reg/Dec stage ALU for the Exec stage Data Memory for the MEM stage Register File’s Write port (busW) for the WB
stage
Computer ArchitecturePipelining-17
IR = mem[PC]; PC = PC + 4
IF Stage of load
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetch
lw
Address
Datamemory
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Instruction decode
lw
Address
Datamemory
IR, PC+4
Computer ArchitecturePipelining-18
ID Stage of load A = Reg[IR[25-21]]; B = Reg[IR[20-16]];
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetch
lw
Address
Datamemory
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Instruction decode
lw
Address
Datamemory
Computer ArchitecturePipelining-19
EX Stage of load ALUout = A + sign-ext(IR[15-0])
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Execution
lw
Address
Datamemory
Computer ArchitecturePipelining-20
MEM State of load MDR = mem[ALUout]
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Memory
lw
Address
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writedata
ReaddataData
memory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Write back
lw
Writeregister
Address
97108/Patterson Figure 06.15
Computer ArchitecturePipelining-21
WB Stage of load Reg[IR[20-16]] = MDR
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Memory
lw
Address
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writedata
ReaddataData
memory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Write back
lw
Writeregister
Address
97108/Patterson Figure 06.15
Who will supply
this address?
Computer ArchitecturePipelining-22
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec WrR-type
The Four Stages of R-type
IF: fetch the instruction from the Instruction Memory
ID: registers fetch and instruction decode EX: ALU operates on the two register
operands WB: write ALU output back to the register
file
Computer ArchitecturePipelining-23
We have a structural hazard: Two instructions try to write to the register file
at the same time! Only one write port
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ops! We have a problem!
Pipelining R-type and load
Computer ArchitecturePipelining-24
Important Observation
Ifetch Reg/Dec Exec Mem WrLoad
1 2 3 4 5
Ifetch Reg/Dec Exec WrR-type
1 2 3 4
Each functional unit can only be used once per instruction
Each functional unit must be used at the same stage for all instructions: Load uses Register File’s write port during its
5th stage
R-type uses Register File’s write port during its 4th stage
Several ways to solve: forwarding, adding pipeline bubble, making instructions same length
Computer ArchitecturePipelining-25
Clock
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec WrR-type Mem
Exec
Exec
Exec
Exec
1 2 3 4 5
Solution: Delay R-type’s Write Delay R-type’s register write by one cycle:
R-type also use Reg File’s write port at Stage 5 MEM is a NOP stage: nothing is being done.
R-type also has 5 stages
Computer ArchitecturePipelining-26
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec MemStore Wr
The Four Stages of store
IF: fetch the instruction from the Instruction Memory
ID: registers fetch and instruction decode EX: calculate the memory address MEM: write the data into the Data Memory
Add an extra stage: WB: NOP
Computer ArchitecturePipelining-27
IF: fetch the instruction from the Instruction Memory
ID: registers fetch and instruction decode EX:
compares the two register operandselect correct branch target addresslatch into PC
Add two extra stages: MEM: NOP WB: NOP
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Ifetch Reg/Dec Exec MemBeq Wr
The Three Stages of beq
Computer ArchitecturePipelining-28
Pipelined Datapath
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX
Computer ArchitecturePipelining-29
Graphically Representing Pipelines
Can help with answering questions like: How many cycles to execute this code? What is the ALU doing during cycle 4? Help understand datapaths
IM Reg DM Reg
IM Reg DM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $10, 20($1)
Programexecutionorder(in instructions)
sub $11, $2, $3
ALU
ALU
Computer ArchitecturePipelining-30
Example 1: Cycle 1
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction decode
lw $10, 20($1)
Instruction fetch
sub $11, $2, $3
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetch
lw $10, 20($1)
Address
Datamemory
Address
Datamemory
Clock 1
Clock 2
Computer ArchitecturePipelining-31
Example 1: Cycle 2
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction decode
lw $10, 20($1)
Instruction fetch
sub $11, $2, $3
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetch
lw $10, 20($1)
Address
Datamemory
Address
Datamemory
Clock 1
Clock 2
Computer ArchitecturePipelining-32
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
3216Sign
extend
Writeregister
Writedata
Memory
lw $10, 20($1)
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Execution
sub $11, $2, $3
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Execution
lw $10, 20($1)
Instruction decode
sub $11, $2, $3
3216Sign
extend
Address
Datamemory
Datamemory
Address
Clock 3
Clock 4
Example 1: Cycle 3
Computer ArchitecturePipelining-33
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
3216Sign
extend
Writeregister
Writedata
Memory
lw $10, 20($1)
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Execution
sub $11, $2, $3
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Execution
lw $10, 20($1)
Instruction decode
sub $11, $2, $3
3216Sign
extend
Address
Datamemory
Datamemory
Address
Clock 3
Clock 4
Example 1: Cycle 4
Computer ArchitecturePipelining-34
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
lw $10, 20($1)
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
sub $11, $2, $3
Memory
sub $11, $2, $3
Address
Datamemory
Address
Datamemory
Clock 6
Clock 5
Example 1: Cycle 5
Computer ArchitecturePipelining-35
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
lw $10, 20($1)
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
sub $11, $2, $3
Memory
sub $11, $2, $3
Address
Datamemory
Address
Datamemory
Clock 6
Clock 5Example 1: Cycle 6
Computer ArchitecturePipelining-36
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-37
Pipeline Control: Control Signals
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shiftleft 2
ALUresult
ALU
Zero
Add
0
1
Mux
0
1
Mux
Computer ArchitecturePipelining-38
Execution/Address Calculationstage control lines
Memory access stagecontrol lines
Write-back stagecontrol lines
RegDst
ALUOp1
ALUOp0
ALUSrc Branch
MemRead
MemWrite
Regwrite
Mem toReg
1 1 0 0 0 0 0 1 00 0 0 1 0 1 0 1 1X 0 0 1 0 0 1 0 XX 0 1 0 1 0 0 0 X
Fig. 4.22
Group Signals According to Stages
Can use control signals of single-cycle CPU
Computer ArchitecturePipelining-39
Pass control signals along just like the data Main control generates control signals during
ID
Data Stationary Control
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Fig. 4.50
Computer ArchitecturePipelining-40
IF/ID
Register
ID/E
x Register
Ex/M
EM
Register
ME
M/W
B R
egister
ID EX MEM
ExtOp
ALUOp
RegDst
ALUSrc
Branch
MemWr
MemtoReg
RegWr
MainControl
ExtOp
ALUOp
RegDst
ALUSrc
MemtoReg
RegWr
MemtoReg
RegWr
MemtoReg
RegWr
Branch
MemWr
Branch
MemW
WB
Data Stationary Control (cont.) Signals for EX (ExtOp, ALUSrc, ...) are used 1 cycle
later Signals for MEM (MemWr, Branch) are used 2 cycles
later Signals for WB (MemtoReg, MemWr) are used 3 cycles
later
Computer ArchitecturePipelining-41
WB Stage of load Reg[IR[20-16]] = MDR
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Memory
lw
Address
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writedata
ReaddataData
memory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Write back
lw
Writeregister
Address
97108/Patterson Figure 06.15
Who will supply
this address?
Computer ArchitecturePipelining-42
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Write
AddressData
memory
Address
Computer ArchitecturePipelining-43
lw $10, 20($1)
sub $11, $2, $3
and $12, $4, $5
or $13, $6, $7
add $14, $8, $9
Let’s Try it Out
Computer ArchitecturePipelining-44
Example 2: Cycle 1
Instructionmemory
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
Instruction[15– 0]
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
M
WB
WBIn
stru
ctio
n
IF/ID EX/MEMID/EX
ID: before<1> EX: before<2> MEM: before<3> WB: before<4>
MEM/WB
IF: lw $10, 20($1)
000
00
0000
000
00
000
0
00
00
0
0
0
Mux
0
1
Add
PC
0
Datamemory
Address
Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
Mux
0
1
Add Addresult
Writeregister
Writedata
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
RegW
rite
ALU
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
MEM/WB
IF: sub $11, $2, $3
010
11
0001
000
00
000
0
00
00
0
0
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
lwControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
X
10
20
X
1
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
20
$X
$1
10
X
Mem
Write
MemRead
Mem
Writ
e
Datamemory
Address
Address
Address
Clock 2
Clock 1
Computer ArchitecturePipelining-45
Example 2: Cycle 2
Instructionmemory
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
Instruction[15– 0]
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: before<1> EX: before<2> MEM: before<3> WB: before<4>
MEM/WB
IF: lw $10, 20($1)
000
00
0000
000
00
000
0
00
00
0
0
0
Mux
0
1
Add
PC
0
Datamemory
Address
Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
Mux
0
1
Add Addresult
Writeregister
Writedata
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
RegW
rite
ALU
M
WB
WBIn
stru
ctio
n
IF/ID EX/MEMID/EX
ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
MEM/WB
IF: sub $11, $2, $3
010
11
0001
000
00
000
0
00
00
0
0
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
lwControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
X
10
20
X
1
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
20
$X
$1
10
X
Mem
Write
MemRead
Mem
Writ
e
Datamemory
Address
Address
Address
Clock 2
Clock 1
Computer ArchitecturePipelining-46
Instructionmemory
Address
Instruction[20– 16]
Mem
toR
eg
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
WBIn
stru
ctio
n
IF/ID EX/MEMID/EX
ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
MEM/WB
IF: and $12, $4, $5
000
10
1100
010
11
000
1
00
00
0
0
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
Writeregister
Writedata 1
ALUresult
ALUcontrol
Shiftleft 2
RegW
rite
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
MEM/WB
IF: or $13, $6, $7
000
10
1100
000
10
101
0
11
10
0
0
0
Mux
0
1
Add
PC
0Writedata
Mux
1
andControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
12
X
X
5
4
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$5
$4
X
12
Mem
Write
MemRead
Mem
Writ
e
sub
11
X
X
3
2
X
$3
$2
X
11
$1
20
10
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$3
$2
11
Mux
Mux
ALUAddress Read
dataData
memory
10
WB
Zero
Zero
Signextend
Signextend
Datamemory
Address
Clock 3
Clock 4
Example 2: Cycle 3
Computer ArchitecturePipelining-47
Instructionmemory
Address
Instruction[20– 16]
Mem
toR
eg
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
MEM/WB
IF: and $12, $4, $5
000
10
1100
010
11
000
1
00
00
0
0
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
Writeregister
Writedata 1
ALUresult
ALUcontrol
Shiftleft 2
RegW
rite
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
MEM/WB
IF: or $13, $6, $7
000
10
1100
000
10
101
0
11
10
0
0
0
Mux
0
1
Add
PC
0Writedata
Mux
1
andControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
12
X
X
5
4
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$5
$4
X
12
Mem
Write
MemRead
Mem
Writ
e
sub
11
X
X
3
2
X
$3
$2
X
11
$1
20
10
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$3
$2
11
Mux
Mux
ALUAddress Read
dataData
memory
10
WB
Zero
Zero
Signextend
Signextend
Datamemory
Address
Clock 3
Clock 4
Example 2: Cycle 4
Computer ArchitecturePipelining-48
Example 2: Cycle 5
Instructionmemory
Address
Instruction[20– 16]
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
MEM/WB
IF: add $14, $8, $9
000
10
1100
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
ALUcontrol
Shiftleft 2
RegW
rite
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
MEM/WB
IF: after<1>
000
10
1100
000
10
101
0
10
00
0
1
0
Mux
0
1
Add
PC
0Writedata
Mux
1
addControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
14
X
X
9
8
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$9
$8
X
14
Mem
Write
MemRead
Mem
Writ
e
or
13
X
X
7
6
X
$7
$6
X
13
$4
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$7
$6
13
Mux
Mux
ALUReaddata
12
WB
11 10
10$5
12
WB
Mem
toR
eg
1
1
11
11
Writeregister
Writedata
Zero
Zero
Datamemory
Address
Datamemory
Address
Signextend
Signextend
Clock 5
Clock 6
Computer ArchitecturePipelining-49
Example 2: Cycle 6
Instructionmemory
Address
Instruction[20– 16]
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
MEM/WB
IF: add $14, $8, $9
000
10
1100
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
ALUcontrol
Shiftleft 2
RegW
rite
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
MEM/WB
IF: after<1>
000
10
1100
000
10
101
0
10
00
0
1
0
Mux
0
1
Add
PC
0Writedata
Mux
1
addControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
14
X
X
9
8
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$9
$8
X
14
Mem
Write
MemRead
Mem
Writ
e
or
13
X
X
7
6
X
$7
$6
X
13
$4
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$7
$6
13
Mux
Mux
ALUReaddata
12
WB
11 10
10$5
12
WB
Mem
toR
eg
1
1
11
11
Writeregister
Writedata
Zero
Zero
Datamemory
Address
Datamemory
Address
Signextend
Signextend
Clock 5
Clock 6
Computer ArchitecturePipelining-50
Example 2: Cycle 7
Fig. 6.34
Instructionmemory
Address
Instruction[20– 16]
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
ALUresult
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
Signextend
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .
MEM/WB
IF: after<2>
000
00
0000
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
RegW
rite
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
MEM/WB
IF: after<3>
000
00
0000
000
00
000
0
10
00
0
1
0
Mux
0
1
Add
PC
0Writedata
Mux
1
Control
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
Mem
Write
MemRead
Mem
Writ
e
$8
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
Mux
Mux
ALUReaddata
14
WB
13 12
12$9
14
WB
Mem
toR
eg
1
0
13
13
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2 Zero
Datamemory
Address
Datamemory
Address
Clock 7
Clock 8
Computer ArchitecturePipelining-51
Example 2: Cycle 8
Fig. 6.34
Instructionmemory
Address
Instruction[20– 16]
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
ALUresult
Shiftleft 2
RegW
rite
MemRead
Control
ALU
Instruction[15– 11]
Signextend
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .
MEM/WB
IF: after<2>
000
00
0000
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
RegW
rite
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
MEM/WB
IF: after<3>
000
00
0000
000
00
000
0
10
00
0
1
0
Mux
0
1
Add
PC
0Writedata
Mux
1
Control
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
Mem
Write
MemRead
Mem
Writ
e
$8
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
Mux
Mux
ALUReaddata
14
WB
13 12
12$9
14
WB
Mem
toR
eg
1
0
13
13
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2 Zero
Datamemory
Address
Datamemory
Address
Clock 7
Clock 8
Computer ArchitecturePipelining-52
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
MEM/WB
IF: after<4>
000
00
0000
000
00
000
0
00
00
0
1
0
Mux
0
1
Add
PC
0Writedata
Mux
1
Control
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
MemRead
Mem
Write
Mux
Mux
ALUReaddata
WB
14
14
Writeregister
Writedata
Datamemory
Address
Clock 9
Example 2: Cycle 9
Computer ArchitecturePipelining-53
Summary of Pipeline Basics Pipelining is a fundamental concept
Multiple steps using distinct resources Utilize capabilities of datapath by pipelined
instruction processing Start next instruction while working on the
current one Limited by length of longest stage (plus
fill/flush) Need to detect and resolve hazards
What makes it easy in MIPS? All instructions are of the same length Just a few instruction formats Memory operands only in loads and stores
What makes pipelining hard? hazards
Computer ArchitecturePipelining-54
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding (R-Type and R-
Type) Data hazards and stalls (Load and R-type) Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-55
Pipeline Hazards Pipeline Hazards:
Structural hazards: attempt to use the same resource in two different ways at the same time
Ex.: combined washer/dryer or folder busy doing something else (watching TV)
Data hazards: attempt to use item before ready Instruction depends on result of prior instruction
still in the pipeline Control hazards: attempt to make decision before
condition is evaluated Ex.: wash football uniforms and need to see result
of previous load to get proper detergent level Branch instructions
Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards
Computer ArchitecturePipelining-56
Mem
Instr.
Order
Time
Load
Instr 1
Instr 2
Instr 3
Instr 4A
LUMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUReg Mem Reg
AL
UMem Reg Mem Reg
Structural Hazard: Single Memory
Computer ArchitecturePipelining-57
Pipeline Hazards Illustrated
IF ID EX MEM WB
StructuralHazard
IF ID ….
time
Computer ArchitecturePipelining-58
Mem
Instr.
Order
Time
Load
Instr 1
Instr 2
Instr 3
Instr 4A
LUMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUReg Mem Reg
AL
UMem Reg Mem Reg
Use 2 memory: data memory and instruction memory
Structural Hazard: Single Memory
Computer ArchitecturePipelining-59
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruc
tion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
Feedback Path
Computer ArchitecturePipelining-60
Data Hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/ -2 0 -20 -2 0 -20 -2 0
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
Computer ArchitecturePipelining-61
Types of Data Hazards
Three types: (inst. i1 followed by inst. i2) RAW (read after write):
i2 tries to read operand before i1 writes it WAR (write after read):
i2 tries to write operand before i1 reads it Gets wrong operand, e.g., autoincrement addr. Can’t happen in MIPS 5-stage pipeline because:
All instructions take 5 stages, and reads are always in stage 2, and writes are always in stage 5
WAW (write after write): i2 tries to write operand before i1 writes it Leaves wrong result ( i1’s not i2’s); occur only
in pipelines that write in more than one stage Can’t happen in MIPS 5-stage pipeline because:
All instructions take 5 stages, and writes are always in stage 5
Computer ArchitecturePipelining-62
Pipeline Hazards Illustrated
IF ID EX MEM WB
IF ID EX Mem
RAW (read after write) Data Hazard
WAW Data Hazard (write after write)
IF ID EX MEM WB WAR Data Hazard (write after read)
IF ID EX MEM WB
IF ID EX MEM WB
time
Computer ArchitecturePipelining-63
Handling Data Hazards Use simple, fixed designs
Eliminate WAR by always fetching operands early (ID) in pipeline
Eliminate WAW by doing all write backs in order (last stage, static)
These features have a lot to do with ISA design Internal forwarding in register file:
Write in first half of clock and read in second half
Read delivers what is written, resolve hazard between sub and add
Detect and resolve remaining ones Compiler inserts NOP (software solution) Forward (hardware solution) Stall (hardware solution)
Computer ArchitecturePipelining-64
Software Solution Have compiler guarantee no hazards Where do we insert the NOPs?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Problem: this really slows us down!
Computer ArchitecturePipelining-65
Data Hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/ -2 0 -20 -2 0 -20 -2 0
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
Insert two nops
Computer ArchitecturePipelining-66
Data Hazards : Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/ -2
0 -20 -20 -20 -20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
Computer ArchitecturePipelining-67
Datapath with Forwarding
Computer ArchitecturePipelining-68
Control: Detecting Data Hazards
Hazard conditions: 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Two optimizations: Don’t forward if instruction does not write
register=> check if RegWrite is asserted
Don’t forward if destination register is $0=> check if RegisterRd = 0
Computer ArchitecturePipelining-69
Detecting Data Hazards (cont.) Hazard conditions using control signals:
At EX stage:EX/MEM.RegWrite and (EX/MEM.RegRd0) and (EX/MEM.RegRd=ID/EX.RegRs)
At MEM stage:MEM/WB.RegWrite and (MEM/WB.RegRd0) and (MEM/WB.RegRd=ID/EX.RegRs)
(replace ID/EX.RegRt for ID/EX.RegRs for the other two conditions)
Computer ArchitecturePipelining-70
Use temporary results, e.g., those in pipeline registers, don’t wait for them to be written
Resolving Hazards: Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
Computer ArchitecturePipelining-71
Datapath with Forwarding
Computer ArchitecturePipelining-72
Forwarding Logic Forwarding: input to ALU from any pipe reg.
Add multiplexors to ALU input Control forwarding in EX => carry Rs in ID/EX
Control signals for forwarding: If both WB and MEM forward, e.g., add $1,$1,$2; add $1,$1,$3; add $1,$1,$4; => let MEM forward
EX hazard: if (EX/MEM.RegWrite and (EX/MEM.RegRd0)
and (EX/MEM.RegRd=ID/EX.RegRs)) ForwardA=10
MEM hazard: if (MEM/WB.RegWrite and (MEM/WB.RegRd0)
and (EX/MEM.RegRd ID/EX.Reg.Rs)and (MEM/WB.RegRd=ID/EX.RegRs))
ForwardA=01(ID/EX.RegRt<->ID/EX.RegRs, ForwardB<-> ForwardA)
Computer ArchitecturePipelining-73
Example 3: Cycle 3
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5 sub $2, $1, $3
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
10 10
$2
$5
5
2
4
$1
$3
3
1
2
Control
ALU
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
or $4, $4, $2 and $4, $2, $5
ID/EX
sub $2, . . .
EX/MEM
before<1>
MEM/WB
add $9, $4, $2
Clock 4
4
6
10 10
$4
$2
6
2
4
$2
$5
5
2
4
Control
ALU
10
2
WB
M
WB
Computer ArchitecturePipelining-74
Example 3: Cycle 4
Fig. 6.41
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5 sub $2, $1, $3
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
10 10
$2
$5
5
2
4
$1
$3
3
1
2
Control
ALU
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
or $4, $4, $2 and $4, $2, $5
ID/EX
sub $2, . . .
EX/MEM
before<1>
MEM/WB
add $9, $4, $2
Clock 4
4
6
10 10
$4
$2
6
2
4
$2
$5
5
2
4
Control
ALU
10
2
WB
M
WB
Computer ArchitecturePipelining-75
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
add $9, $4, $2 or $4, $4, $2
ID/EX
and $4, . . .
EX/MEM
sub $2, . . .
MEM/WB
after<1>
Clock 5
4
2
10 10
$4
$2
2
4
9
$4
$2
4
2
24
Control
ALU
10
WB
2
1
PCInstruction
memory
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
after<1>after<2> add $9, $4, $2 or $4, . . .
EX/MEM
and $4, . . .
MEM/WB
Clock 6
10
$4
$2
2
4
9
ALU
10
4
4
WB
4
1
Registers
Inst
ruct
ion
IF/ID
ID/EX
4
Control
Example 3: Cycle 5
Computer ArchitecturePipelining-76
Example 3: Cycle 6
Fig. 6.42
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
add $9, $4, $2 or $4, $4, $2
ID/EX
and $4, . . .
EX/MEM
sub $2, . . .
MEM/WB
after<1>
Clock 5
4
2
10 10
$4
$2
2
4
9
$4
$2
4
2
24
Control
ALU
10
WB
2
1
PCInstruction
memory
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
after<1>after<2> add $9, $4, $2 or $4, . . .
EX/MEM
and $4, . . .
MEM/WB
Clock 6
10
$4
$2
2
4
9
ALU
10
4
4
WB
4
1
Registers
Inst
ruct
ion
IF/ID
ID/EX
4
Control
Computer ArchitecturePipelining-77
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding (R-Type and R-
Type) Data hazards and stalls (Load and R-type) Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-78
lw can still cause a hazard: if is followed by an instruction to read the
loaded reg.
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
Can't Always Forward
Use stalling or compiler to resolve
Computer ArchitecturePipelining-79
Stalling Stall pipeline by keeping instructions in same
stage and inserting an NOP instead
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
Computer ArchitecturePipelining-80
PCInstruction
memory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Inst
ruct
ion
ID/EX.MemRead
IF/ID
Write
PC
Write
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
Datapath with Stalling Unit Forwarding controls ALU inputs, hazard
detection controls PC, IF/ID, control signals
Computer ArchitecturePipelining-81
Control: Handling Stalls Hazard detection unit in ID to insert stall
between a load instruction and its use: if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.registerRt)) stall the pipeline for one cycle(ID/EX.MemRead=1 indicates a load instruction)
How to stall? Stall instruction in IF and ID: not change PC
and IF/ID=> the stages re-execute the instructions
What to move into EX: insert an NOP by changing EX, MEM, WB control fields of ID/EX pipeline register to 0
as control signals propagate, all control signals to EX, MEM, WB are deasserted and no registers or memories are written
Computer ArchitecturePipelining-82
Example 4: Cycle 2
Hazarddetection
unit
0
MuxIF
/ID
Write
PC
Write
ID/EX.RegisterRt
lw $2, 20($1)
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
2
500 11
$2
$5
5
2
4
$1
$X
X
1
2
Control
ALU
M
WB
Hazarddetection
unit
0
MuxIF
/ID
Write
PC
Write
ID/EX.RegisterRt
ID/EX.MemRead
ID/EX.MemRead
M
WB
$1
$X
X
1
2
before<3>
PCInstruction
memory
Registers
Mux
Mux
Mux
EX WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
ID/EX
EX/MEM
MEM/WB
and $4, $2, $5 lw $2, 20($1) before<1> before<2>
Clock 2
1
1
X
X11
Control
ALU
M
WB
Computer ArchitecturePipelining-83
Example 4: Cycle 3
Hazarddetection
unit
0
MuxIF/ID
Write
PC
Write
ID/EX.RegisterRt
lw $2, 20($1)
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
2
500 11
$2
$5
5
2
4
$1
$X
X
1
2
Control
ALU
M
WB
Hazarddetection
unit
0
MuxIF
/ID
Write
PC
Write
ID/EX.RegisterRt
ID/EX.MemRead
ID/EX.MemRead
M
WB
$1
$X
X
1
2
before<3>
PCInstruction
memory
Registers
Mux
Mux
Mux
EX WB
Datamemory
Mux
Forwardingunit
Inst
ruct
ion
IF/ID
ID/EX
EX/MEM
MEM/WB
and $4, $2, $5 lw $2, 20($1) before<1> before<2>
Clock 2
1
1
X
X11
Control
ALU
M
WB
Computer ArchitecturePipelining-84Hazard
detectionunit
0
MuxIF
/IDW
rite
PC
Writ
e
ID/EX.RegisterRt
$2
$5
5
2
2
2
4
WB
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
e
ID/EX.RegisterRt
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
and $4, $2, $5 bubble
ID/EX
lw $2, . . .
EX/MEM
before<1>
MEM/WB
Clock 4
2
2
5
510
11
00
$2
$5
5
2
4
Control
ALU
M
WB
bubble lw $2, . . .
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
EX/MEM
MEM/WB
add $9, $4, $2
Clock 5
2
210 10
11
$4
$2
2
4
4
4
2
4
$2
$5
5
2
4
Control
ALU
0
WB
ID/EX.MemRead
ID/EX.MemRead
or $4, $4, $2
or $4, $4, $2
Example 4: Cycle 4
Computer ArchitecturePipelining-85
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
e
ID/EX.RegisterRt
$2
$5
5
2
2
2
4
WB
Hazarddetection
unit
0
MuxIF
/IDW
rite
PC
Writ
eID/EX.RegisterRt
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
and $4, $2, $5 bubble
ID/EX
lw $2, . . .
EX/MEM
before<1>
MEM/WB
Clock 4
2
2
5
510
11
00
$2
$5
5
2
4
Control
ALU
M
WB
bubble lw $2, . . .
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Forwardingunit
Inst
ruct
ion
IF/ID
and $4, $2, $5
ID/EX
EX/MEM
MEM/WB
add $9, $4, $2
Clock 5
2
210 10
11
$4
$2
2
4
4
4
2
4
$2
$5
5
2
4
Control
ALU
0
WB
ID/EX.MemRead
ID/EX.MemRead
or $4, $4, $2
or $4, $4, $2
Example 4: Cycle 5
Computer ArchitecturePipelining-86
Example 4: Cycle 6
Fig. 6.49
Registers
Inst
ruct
ion
ID/EX
4
Control
PCInstruction
memory
PCInstruction
memory
Hazarddetection
unit
0
Mux
IF/ID
Writ
e
PC
Write
IF/ID
Write
PC
Write
ID/EX.RegisterRt
bubble
Registers
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
add $9, $4, $2
ID/EX
and $4, . . .
EX/MEM
MEM/WB
Clock 6
4
4
2
210 10
$4
$2
2
4
49
$2
2
Control
ALU
10
WB0
add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>
after<1>
Clock 7
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Forwardingunit
EX/MEM
MEM/WB
10 10
$4
$4
$2
2
4
4
9
ALU
10
WB
44
4
1
Hazarddetection
unit
0
Mux
ID/EX.RegisterRt
or $4, $4, $2
ID/EX.MemRead
ID/EX.MemRead
Mux
IF/ID
Computer ArchitecturePipelining-87
Registers
Inst
ruct
ion
ID/EX
4
Control
PCInstruction
memory
PCInstruction
memory
Hazarddetection
unit
0
Mux
IF/ID
Writ
e
PC
Write
IF/ID
Write
PC
Write
ID/EX.RegisterRt
bubble
Registers
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Inst
ruct
ion
IF/ID
add $9, $4, $2
ID/EX
and $4, . . .
EX/MEM
MEM/WB
Clock 6
4
4
2
210 10
$4
$2
2
4
49
$2
2
Control
ALU
10
WB0
add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>
after<1>
Clock 7
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Forwardingunit
EX/MEM
MEM/WB
10 10
$4
$4
$2
2
4
4
9
ALU
10
WB
44
4
1
Hazarddetection
unit
0
Mux
ID/EX.RegisterRt
or $4, $4, $2
ID/EX.MemRead
ID/EX.MemRead
Mux
IF/ID
Example 4: Cycle 7
Computer ArchitecturePipelining-88
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-89
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruc
tion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
Feedback Path
Computer ArchitecturePipelining-90
Pipeline Datapath with Control Signals
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shiftleft 2
ALUresult
ALU
Zero
Add
0
1
Mux
0
1
Mux
Computer ArchitecturePipelining-91
When decide to branch, other inst. are in pipeline!
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Branch Hazards
Computer ArchitecturePipelining-92
Handling Branch Hazard Predict branch always not taken
Need to add hardware for flushing inst. if wrong Branch decision made at MEM => need to flush
instruction in IF/ID, ID/EX by changing control values to 0
Reduce delay of taken branch by moving branch execution earlier in the pipeline Move up branch address calculation to ID Check branch equality at ID (using XOR) by
comparing the two registers read during ID Branch decision made at ID => one instruction
to flush Add a control signal, IF.Flush, to zero
instruction field of IF/ID => making the instruction an NOP
Dynamic branch prediction Compiler rescheduling, delay branch
Computer ArchitecturePipelining-93
Pipeline with Flushing
PC Instructionmemory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
Computer ArchitecturePipelining-94
Example 5: Cycle 3
PCInstruction
memory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
Mux
IF.Flush
IF/ID
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8
MEM/WB
EX/MEM
ID/EX
Clock 3
72 44
48 44
28
7
$1
$3
10
48
72
72
0
Mux
0
$4
$8
ALUData
memory
bubble (nop)lw $4, 50($7)
Clock 4
Mux
Shiftleft 2
before<1>
beq $1, $3, 7 sub $10, . . . before<1>
before<2>
=
PC Instructionmemory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
MEM/WB
EX/MEM
ID/EX
76 72
76 72
$1
$3
10
76
ALUData
memory
Mux
Shiftleft 2
=
Computer ArchitecturePipelining-95
Example 5: Cycle 4
PCInstruction
memory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
Mux
IF.Flush
IF/ID
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8
MEM/WB
EX/MEM
ID/EX
Clock 3
72 44
48 44
28
7
$1
$3
10
48
72
72
0
Mux
0
$4
$8
ALUData
memory
bubble (nop)lw $4, 50($7)
Clock 4
Mux
Shiftleft 2
before<1>
beq $1, $3, 7 sub $10, . . . before<1>
before<2>
=
PC Instructionmemory
4
Registers
Signextend
Mux
Mux
Control
EX
M
WB
M
WB
WB
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
MEM/WB
EX/MEM
ID/EX
76 72
76 72
$1
$3
10
76
ALUData
memory
Mux
Shiftleft 2
=
Computer ArchitecturePipelining-96
Dynamic Branch Prediction In deeper and superscalar pipelines, branch
penalty is more significant Use dynamic prediction (e.g. loop)
Branch prediction buffer (aka branch history table)
Indexed by recent branch instruction addresses
Stores outcome (taken/not taken) To execute a branch
Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction
Computer ArchitecturePipelining-97
1-Bit Predictor: Shortcoming Inner loop branches mispredicted twice!
outer: … …inner: … … beq …, …, inner … beq …, …, outer
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around
Computer ArchitecturePipelining-98
2-Bit Predictor Only change prediction on two successive
mis-predictions
Computer ArchitecturePipelining-99
Calculating the Branch Target Even with predictor, still need to calculate
the target address 1-cycle penalty for a taken branch
Branch target buffer Cache of target addresses Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken, can fetch target immediately
Computer ArchitecturePipelining-100
Predict-not-taken + branch decision at ID=> the following instruction is always executed=> branches take effect 1 cycle later
0 clock cycle penalty per branch instruction if can find instruction to put in slot (50% of time)
Instr.
Order
Time (clock cycles)
add
beq
misc
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
Mem
AL
UReg Mem Reg
lw Mem
AL
UReg Mem Reg
Delayed Branch
Computer ArchitecturePipelining-101
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-102
Exceptions and Interrupts “ Unexpected” events requiring change
in flow of control Different ISAs use the terms differently
Exception Arises within the CPU e.g., undefined opcode, overflow, syscall,
… Interrupt
From an external I/O controller Dealing with them without sacrificing
performance is hard
§4.9 Exceptions
Computer ArchitecturePipelining-103
Handling Exceptions In MIPS, exceptions managed by a System
Control Coprocessor (CP0) Save PC of offending (or interrupted)
instruction In MIPS: Exception Program Counter (EPC)
Save indication of the problem In MIPS: Cause register We’ll assume 1-bit
0 for undefined opcode, 1 for overflow Jump to handler at 8000 00180
Computer ArchitecturePipelining-104
An Alternate Mechanism Vectored Interrupts
Handler address determined by the cause Example:
Undefined opcode: C000 0000 Overflow: C000 0020 …: C000 0040
Instructions either Deal with the interrupt, or Jump to real handler
Computer ArchitecturePipelining-105
Handler Actions Read cause, and transfer to relevant handler Determine action required If restartable
Take corrective action use EPC to return to program
Otherwise Terminate program Report error using EPC, cause, …
Computer ArchitecturePipelining-106
Exceptions in a Pipeline Another form of control hazard Consider overflow on add in EX stage
add $1, $2, $1 Prevent $1 from being clobbered Complete previous instructions Flush add and subsequent instructions Set Cause and EPC register values Transfer control to handler
Similar to mispredicted branch Use much of the same hardware
Computer ArchitecturePipelining-107
Pipeline with Exceptions
Computer ArchitecturePipelining-108
Exception Properties Restartable exceptions
Pipeline can flush the instruction Handler executes, then returns to the
instruction Refetched and executed from scratch
PC saved in EPC register Identifies causing instruction Actually PC + 4 is saved
Handler must adjust
Computer ArchitecturePipelining-109
Exception Example Exception on add in
40 sub $11, $2, $444 and $12, $2, $548 or $13, $2, $64C add $1, $2, $150 slt $15, $6, $754 lw $16, 50($7)…
Handler80000180 sw $25, 1000($0)80000184 sw $26, 1004($0)…
Computer ArchitecturePipelining-110
Exception Example
Computer ArchitecturePipelining-111
Exception Example
Computer ArchitecturePipelining-112
Multiple Exceptions Pipelining overlaps multiple instructions
Could have multiple exceptions at once Simple approach: deal with exception from
earliest instruction Flush subsequent instructions “Precise” exceptions
In complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult!
Computer ArchitecturePipelining-113
Imprecise Exceptions Just stop pipeline and save state
Including exception cause(s) Let the handler work out
Which instruction(s) had exceptions Which to complete or flush
May require “manual” completion Simplifies hardware, but more complex
handler software Feasible for complex multiple-issue
out-of-order pipelines
Computer ArchitecturePipelining-114
Outline An overview of pipelining A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch hazards Exceptions Superscalar and dynamic pipelining
Computer ArchitecturePipelining-115
Instruction-Level Parallelism (ILP)
Pipelining: executing multiple instructions in parallel
To increase ILP Deeper pipeline
Less work per stage shorter clock cycle Multiple issue
Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice
§4.10 Parallelism
and Advanced Instruction Level P
arallelism
Computer ArchitecturePipelining-116
Multiple Issue Static multiple issue
Compiler groups instructions to be issued together
Packages them into “issue slots” Compiler detects and avoids hazards
Dynamic multiple issue CPU examines instruction stream and chooses
instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced
techniques at runtime
Computer ArchitecturePipelining-117
Speculation “ Guess” what to do with an instruction
Start operation as soon as possible Check whether guess was right
If so, complete the operation If not, roll-back and do the right thing
Common to static and dynamic multiple issue Examples
Speculate on branch outcome Roll back if path taken is different
Speculate on load Roll back if location is updated
Computer ArchitecturePipelining-118
Compiler/Hardware Speculation
Compiler can reorder instructions e.g., move load before branch Can include “fix-up” instructions to recover
from incorrect guess Hardware can look ahead for instructions to
execute Buffer results until it determines they are
actually needed Flush buffers on incorrect speculation
Computer ArchitecturePipelining-119
Speculation and Exceptions What if exception occurs on a speculatively
executed instruction? e.g., speculative load before null-pointer check
Static speculation Can add ISA support for deferring exceptions
Dynamic speculation Can buffer exceptions until instruction
completion (which may not occur)
Computer ArchitecturePipelining-120
Static Multiple Issue Compiler groups instructions into “issue
packets” Group of instructions that can be issued on a
single cycle Determined by pipeline resources required
Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW)
Computer ArchitecturePipelining-121
Scheduling Static Multiple Issue
Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets
Varies between ISAs; compiler must know! Pad with nop if necessary
Computer ArchitecturePipelining-122
MIPS with Static Dual Issue Two-issue packets
One ALU/branch instruction One load/store instruction 64-bit aligned
ALU/branch, then load/store Pad an unused instruction with nop
Address
Instruction type
Pipeline Stages
n ALU/branch IF ID EX MEM WB
n + 4 Load/store IF ID EX MEM WB
n + 8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
Computer ArchitecturePipelining-123
MIPS with Static Dual Issue
Computer ArchitecturePipelining-124
Hazards in the Dual-Issue MIPS
More instructions executing in parallel EX data hazard
Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same
packet add $t0, $s0, $s1load $s2, 0($t0)
Split into two packets, effectively a stall Load-use hazard
Still one cycle use latency, but now two instructions
More aggressive scheduling required
Computer ArchitecturePipelining-125
Scheduling Example Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array element addu $t0, $t0, $s2 # add scalar in $s2 sw $t0, 4($s1) # store result addi $s1, $s1,–4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0
ALU/branch Load/store cycle
Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, Loop
sw $t0, 4($s1) 4
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)
Computer ArchitecturePipelining-126
Loop Unrolling Replicate loop body to expose more
parallelism Reduces loop-control overhead
Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies”
Store followed by a load of the same register Aka “name dependence”
Reuse of a register name
Computer ArchitecturePipelining-127
Loop Unrolling Example
IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size
ALU/branch Load/store cycle
Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 4($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 12($s1) 4
addu $t2, $t2, $s2 sw $t0, 4($s1) 5
addu $t3, $t4, $s2 sw $t1, 8($s1) 6
nop sw $t2, 12($s1) 7
bne $s1, $zero, Loop
sw $t3, 16($s1) 8
Computer ArchitecturePipelining-128
Dynamic Multiple Issue “ Superscalar” processors CPU decides whether to issue 0, 1, 2, … each
cycle Avoiding structural and data hazards
Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU
Computer ArchitecturePipelining-129
Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of
order to avoid stalls But commit result to registers in order
Examplelw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20
Can start sub while addu is waiting for lw
Computer ArchitecturePipelining-130
Dynamically Scheduled CPU
Results also sent to any waiting
reservation stations
Reorders buffer for register writes
Can supply operands for issued
instructions
Preserves dependencies
Hold pending operands
Computer ArchitecturePipelining-131
Register Renaming Reservation stations and reorder buffer
effectively provide register renaming On instruction issue to reservation station
If operand is available in register file or reorder buffer
Copied to reservation station No longer required in the register; can be
overwritten If operand is not yet available
It will be provided to the reservation station by a function unit
Register update may not be required
Computer ArchitecturePipelining-132
Speculation Predict branch and continue issuing
Don’t commit until branch outcome determined
Load speculation Avoid load and cache miss delay
Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit
Don’t commit load until speculation cleared
Computer ArchitecturePipelining-133
Why Do Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable
e.g., cache misses Can’t always schedule around branches
Branch outcome is dynamically determined Different implementations of an ISA have
different latencies and hazards
Computer ArchitecturePipelining-134
Does Multiple Issue Work?
Yes, but not as much as we’d like Programs have real dependencies that limit
ILP Some dependencies are hard to eliminate
e.g., pointer aliasing Some parallelism is hard to expose
Limited window size during instruction issue
Memory delays and limited bandwidth Hard to keep pipelines full
Speculation can help if done well
The BIG Picture
Computer ArchitecturePipelining-135
Power Efficiency Complexity of dynamic scheduling and
speculations requires power Multiple simpler cores may be better
Microprocessor
Year Clock Rate
Pipeline
Stages
Issue width
Out-of-order/
Speculation
Cores Power
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette
2001 2000MHz
22 3 Yes 1 75W
P4 Prescott 2004 3600MHz
31 3 Yes 1 103W
Core 2006 2930MHz
14 4 Yes 2 75W
UltraSparc III 2003 1950MHz
14 4 No 1 90W
UltraSparc T1
2005 1200MHz
6 1 No 8 70W
Computer ArchitecturePipelining-136
Fallacies
Pipelining is easy (!) The basic idea is easy The devil is in the details
e.g., detecting data hazards Pipelining is independent of technology
So why haven’t we always done pipelining? More transistors make more advanced
techniques feasible Pipeline-related ISA design needs to take
account of technology trends e.g., predicted instructions
§4.13 Fallacies and P
itfalls
Computer ArchitecturePipelining-137
Pitfalls Poor ISA design can make pipelining harder
e.g., complex instruction sets (VAX, IA-32) Significant overhead to make pipelining work IA-32 micro-op approach
e.g., complex addressing modes Register update side effects, memory indirection
e.g., delayed branches Advanced pipelines have long delay slots
Computer ArchitecturePipelining-138
Concluding Remarks ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput
using parallelism More instructions completed per second Latency for each instruction not reduced
Hazards: structural, data, control Multiple issue and dynamic scheduling (ILP)
Dependencies limit achievable parallelism Complexity leads to the power wall
§4.14 Concluding R
emarks