Upload
ashlee-elfreda-carr
View
221
Download
0
Embed Size (px)
Citation preview
Chap 6.1
Computer Architecture
Chapter 6
Enhancing Performance with Pipelining
Chap 6.2
Pipelining Overview
Pipelined Datapath
Pipelined Control
Hazards• Structural Hazards
• Data Hazards
• Branch Hazards
Dynamic Scheduling
Examples of Pipelining
Summary
Contents
Chap 6.3
The Five Classic Components of a Computer
Back to the datapath again to try to speed it up some more• Single cycle datapath – great CPI but terrible cycle time (long critical path)
• Multiple cycle datapath – good cycle time but poor CPI
• Pipelining – get the best of both!
Control
Datapath
Memory
Processor
Input
Output
The Big Picture: Where are We Now?
Chap 6.4
Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would laundry take?
30Task
Order
B
C
D
ATime
30 30 3030 30 3030 30 30 3030 30 30 3030
6 PM 7 8 9 10 11 12 1 2 AM
Sequential Laundry
Chap 6.5
Pipelined laundry takes 3.5 hours for 4 loads!
Task
Order
12 2 AM6 PM 7 8 9 10 11 1
Time
B
C
D
A
303030 3030 30 30
Pipelined Laundry: Start work ASAP
Chap 6.6
Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory
Reg/Dec: Register Fetch and Instruction Decode
Exec: Calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
One way to break up the datapath operations
Chap 6.7
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3A
LUIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
Apply pipelining
Chap 6.8
IFetch Reg Exec Mem WB
IFetch Reg Exec Mem WB
IFetch Reg Exec Mem WB
IFetch Reg Exec Mem WB
IFetch Reg Exec Mem WB
IFetch Reg Exec Mem WBProgram Flow
Time
Conventional Pipelined Execution Representation
Chap 6.9
Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Instructionfetch
Reg ALUData
accessReg
8 nsInstruction
fetchReg ALU
Dataaccess
Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch
Reg ALUData
accessReg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Pipelining
Chap 6.10
Clk
Cycle 1
Multiple Cycle Implementation:
Ifetch Reg Exec Mem Wr
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Load Ifetch Reg Exec Mem Wr
Ifetch Reg Exec Mem
Load Store
Pipeline Implementation:
Ifetch Reg Exec Mem WrStore
Clk
Single Cycle Implementation:
Load Store Waste
Ifetch
R-type
Ifetch Reg Exec Mem WrR-type
Cycle 1 Cycle 2
Single Cycle, Multiple Cycle, vs. Pipeline
Chap 6.11
Suppose we execute 100 instructions
Single Cycle Machine• 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine• 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
Ideal pipelined machine• 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
How much improvement can pipelining give us ?
Chap 6.12
What makes it easy• all instructions are the same length• just a few instruction formats• memory operands appear only in loads and stores
What makes it hard?• structural hazards: suppose we had only one memory• control hazards: need to worry about branch instructions• data hazards: an instruction depends on a previous instruction
We’ll build a simple pipeline and look at these issues
We’ll talk about modern processors and what really makes it hard:• exception handling• trying to improve performance with out-of-order execution, etc.
Pipelining
Chap 6.13
What do we need to add to actually split the datapath into stages?
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
Basic Idea
Chap 6.14
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
Pipelined Datapath
Chap 6.15
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX
Corrected Datapath
Chap 6.16
Can help with answering questions like:• how many cycles does it take to execute this code?
• what is the ALU doing during cycle 4?
• use this representation to help understand datapaths
IM Reg DM Reg
IM Reg DM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $10, 20($1)
Programexecutionorder(in instructions)
sub $11, $2, $3
ALU
ALU
Graphically Representing Pipelines
Chap 6.17
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shiftleft 2
ALUresult
ALU
Zero
Add
0
1
Mux
0
1
Mux
Pipeline Control
Chap 6.18
We have 5 stages. What needs to be controlled in each stage?• Instruction Fetch and PC Increment
• Instruction Decode / Register Fetch
• Execution
• Memory Stage
• Write Back
How would control be handled in an automobile plant?• a fancy control center telling everyone what to do?
• should we use a finite state machine?
Pipeline control
Chap 6.19
Pass control signals along just like the dataExecution/Address Calculation
stage control linesMemory access stage
control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Pipeline Control
Chap 6.20
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Me
mto
Re
g
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Re
gWrit
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Me
mW
rite
AddressData
memory
Address
Datapath with Control
Chap 6.21
Yes: Pipeline Hazards• structural hazards: attempt to use the same resource two different ways
at the same time
- E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)
• data hazards: attempt to use item before it is ready
- E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer
- instruction depends on result of prior instruction still in the pipeline
• control hazards: attempt to make a decision before condition is evaluated
- E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in
- branch instructions
Can always resolve hazards by waiting• pipeline control must detect the hazard
• take action (or delay action) to resolve hazards
Can pipelining get us into trouble?
Chap 6.22
Mem
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4A
LUMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUReg Mem Reg
AL
UMem Reg Mem Reg
Detection is easy in this case! (right half highlight means read, left half write)
Single Memory is a Structural Hazard
Chap 6.23
Problem with starting next instruction before first is finished• dependencies that go backward in time are data hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
Dependencies
Chap 6.24
Have compiler guarantee no hazards
Where do we insert the nops??
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Problem: this really slows us down!
Software Solution
Chap 6.25
Use temporary results, don’t wait for them to be written• register file forwarding to handle read/write to same register
• ALU forwarding
what if this $2 was $13?
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
Forwarding
Chap 6.26
PCInstruction
memory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Inst
ruct
ion
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Forwarding
Chap 6.27
Load word can still cause a hazard:• an instruction tries to read a register following a load instruction that writes to the same
register.
•
Thus, we need a hazard detection unit to stall the load instruction
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
Can't always forward
Chap 6.28
We can stall the pipeline by keeping an instruction in the same stage
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
Stalling
Chap 6.29
Stall by letting an instruction that won’t write anything go forward
PCInstruction
memory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Inst
ruct
ion
ID/EX.MemRead
IF/ID
Writ
e
PC
Writ
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
Hazard Detection Unit
Chap 6.30
When we decide to branch, other instructions are in the pipeline!
We are predicting branch not taken• need to add hardware for flushing instructions if we are wrong
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Control (Branch) Hazards
Chap 6.31
PCInstruction
memory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
Flushing Instructions
Chap 6.32
Stall: wait until decision is clear• Its possible to move up decision to 2nd stage by adding hardware to
check registers as being read
Impact: 2 clock cycles per branch instruction => slow
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UReg Mem RegMem
Example: Control Hazard Solutions
Chap 6.33
Predict: guess one direction then back up if wrong• Predict not taken
Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right 50% of time)
More dynamic scheme: history of 1 branch ( 90%)
Instr.
Order
Time (clock cycles)
Add
Beq
Load
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
Mem
AL
UReg Mem Reg
Example: Control Hazard Solutions
Chap 6.34
Redefine branch behavior (takes place after next instruction) “delayed branch”
Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)
Instr.
Order
Time (clock cycles)
Add
Beq
Misc
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
MemA
LUReg Mem Reg
Load Mem
AL
UReg Mem Reg
Example: Control Hazard Solutions
Chap 6.35
The hardware performs the scheduling?• hardware tries to find instructions to execute
• out of order execution is possible
• speculative execution and dynamic branch prediction
All modern processors are very complicated• DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
• PowerPC and Pentium: branch history table
• Compiler technology important
Dynamic Scheduling
Chap 6.36
Pipelining is a fundamental concept• multiple steps using distinct resources
Utilize capabilities of the Datapath by pipelined instruction processing• start next instruction while working on the current one
• limited by length of longest stage (plus fill/flush)
• detect and resolve hazards
Summary