L18 – Pipeline Issues 1Comp 411
CPU Pipelining Issues
Finishing up 4.5-4.8
This pipe stuff makesmy head hurt!
What have you been beating
your head against?
L18 – Pipeline Issues 2Comp 411
Pipelining
Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Instructionfetch
Reg ALUData
accessReg
8 nsInstruction
fetchReg ALU
Dataaccess
Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch
Reg ALUData
accessReg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
L18 – Pipeline Issues 3Comp 411
Pipelining
What makes it easyall instructions are the same lengthjust a few instruction formatsmemory operands appear only in loads and stores
What makes it hard?structural hazards: suppose we had only one memorycontrol hazards: need to worry about branch instructionsdata hazards: an instruction depends on a previous
instruction
Individual Instructions still take the same number of cycles
But we’ve improved the through-put by increasing the number of simultaneously executing instructions
L18 – Pipeline Issues 4Comp 411
Structural Hazards
InstFetch
RegRead
ALU DataAccess
Reg Write
InstFetch
RegRead
ALU DataAccess
Reg Write
InstFetch
RegRead
ALU DataAccess
Reg Write
InstFetch
RegRead
ALU DataAccess
Reg Write
L18 – Pipeline Issues 5Comp 411
Problem with starting next instruction before first is finisheddependencies that “go backward in time” are
data hazards
Data Hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
L18 – Pipeline Issues 6Comp 411
Have compiler guarantee no hazardsWhere do we insert the “nops” ?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Problem: this really slows us down!
Software Solution
L18 – Pipeline Issues 7Comp 411
Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
L18 – Pipeline Issues 8Comp 411
Load word can still cause a hazard:an instruction tries to read a register following a load
instruction that writes to the same register.
Thus, we need a hazard detection unit to “stall” the instruction
Can't always forward
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
L18 – Pipeline Issues 9Comp 411
Stalling
We can stall the pipeline by keeping an instruction in the same stage
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
L18 – Pipeline Issues 10Comp 411
When we decide to branch, other instructions are in the pipeline!
We are predicting “branch not taken”need to add hardware for flushing instructions if
we are wrong
Branch Hazards
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
L18 – Pipeline Issues 11Comp 411
Improving Performance
Try to avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
Add a “branch delay slot”the next instruction after a branch is always
executedrely on compiler to “fill” the slot with something
useful
Superscalar: start more than one instruction in the same cycle
L18 – Pipeline Issues 12Comp 411
Dynamic Scheduling
The hardware performs the “scheduling” hardware tries to find instructions to execute
out of order execution is possible
speculative execution and dynamic branch prediction
All modern processors are very complicatedPentium 4: 20 stage pipeline, 6 simultaneous
instructions
PowerPC and Pentium: branch history table
Compiler technology important
L18 – Pipeline Issues 13Comp 411
Pipeline Summary (I)
• Started with unpipelined implementation– direct execute, 3-5 cycles/instruction– it had a long cycle time: mem + regs + alu + mem
+ wb
• We ended up with a 5-stage pipelined implementation– increase throughput (3x???)– delayed branch decision (1 cycle)
Choose to execute instruction after branch– delayed register writeback (3 cycles)
Add bypass paths (6 x 2 = 12) to forward correct value
– memory data available only in WB stageIntroduce NOPs at IRALU, to stall IF and RF
stages until LD result was ready
L18 – Pipeline Issues 14Comp 411
Pipeline Summary (II)
Fallacy #1: Pipelining is easySmart people get it wrong all of the time!
Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).
Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.
L18 – Pipeline Issues 15Comp 411
RISC = Simplicity???
Generalization of registers and
operand coding
Complex instructions, addressing modes
Addressingfeatures, eg
index registers
RISCs
Primitive Machines with
direct implementati
ons
VLIWs,Super-Scalars ?
Pipelines, Bypasses,
Annulment, …, ...
“The P.T. Barnum World’s Tallest Dwarf Competition”
World’s Most Complex RISC?
L18 – Pipeline Issues 16Comp 411
Chapter 5 Preview
Memory Hierarchy
L18 – Pipeline Issues 17Comp 411
Memory Hierarchy
Memory devices come in several different flavorsSRAM – Static Ram
fast (0.5 to 10ns)expensive (>10 times DRAM)small capacity (few megabytes typically)
DRAM – Dynamic RAMMuch slower than SRAM (50ns – 100ns)Access time varies with addressAffordable ($22 / gigabyte) (2GB DDR3,
Mar 2010)4 Gig typical
DISKSlow! (10ms access time)Cheap! (< $0.10 / GB) (down from
$193,000/GB in 1980)Big! (1Tbyte is no problem)
L18 – Pipeline Issues 18Comp 411
Users want large and fast memories!
Memory Hierarchy
L18 – Pipeline Issues 19Comp 411
Locality
A principle that makes having a memory hierarchy a good idea
If an item is referenced,
temporal locality: it will tend to be referenced again soonspatial locality: nearby items will tend to be referenced soon.
Why does code have locality?