View
231
Download
1
Tags:
Embed Size (px)
Citation preview
21998 Morgan Kaufmann Publishers
Pipelining
• Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Instructionfetch
Reg ALUData
accessReg
8 nsInstruction
fetchReg ALU
Dataaccess
Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch
Reg ALUData
accessReg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
31998 Morgan Kaufmann Publishers
Pipelining
• What makes it easy– all instructions are the same length– just a few instruction formats– memory operands appear only in loads and stores
• What makes it hard?– structural hazards: suppose we had only one memory– control hazards: need to worry about branch instructions– data hazards: an instruction depends on a previous instruction
• We will build a simple pipeline and look at these issues
• We will talk about modern processors and what really makes it hard:– exception handling– trying to improve performance with out-of-order execution, etc.
41998 Morgan Kaufmann Publishers
Basic Idea
• What do we need to add to actually split the datapath into stages?
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
51998 Morgan Kaufmann Publishers
Pipelined Datapath
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
61998 Morgan Kaufmann Publishers
Corrected Datapath
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX
211998 Morgan Kaufmann Publishers
Graphically Representing Pipelines
• Can help with answering questions like:– how many cycles does it take to execute this code?– what is the ALU doing during cycle 4?– use this representation to help understand datapaths
IM Reg DM Reg
IM Reg DM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $10, 20($1)
Programexecutionorder(in instructions)
sub $11, $2, $3
ALU
ALU
221998 Morgan Kaufmann Publishers
Pipeline Control
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
AddAdd
result
Shiftleft 2
ALUresult
ALU
Zero
Add
0
1
Mux
0
1
Mux
Can these operations be completed one
stage earlier?
Can these operations be completed one
stage earlier?
Think about critical path!!!
Why does IM have no control signal at all? Why does RF have
only write control?
Can ALU result be written back in MEM? Cost of RF write port!!!
231998 Morgan Kaufmann Publishers
Pipeline design considerations
• Simplification of control mechanism
– Active in every clock cycle (always enable)
• Ex: – Instruction memory has no control signal.
– Pipeline registers• Minimization of power consumption
– Explicit control for infrequent operations
• Ex: both read and write controls for data memory• Cost consideration
– Ex: alternatives of ALU write: in MEM or WB
241998 Morgan Kaufmann Publishers
• We have 5 stages. What needs to be controlled in each stage?– Instruction Fetch and PC Increment– Instruction Decode / Register Fetch– Execution– Memory Stage– Write Back
• How would control be handled in an automobile plant?
• a fancy control center telling everyone what to do?• should we use a finite state machine?
– Centralized– Distributed
Pipeline control
251998 Morgan Kaufmann Publishers
• Pass control signals along just like the data
– Generate all control signals at the decode stage…similar to the single cycle implementation
– Pass the generated control signals along the pipeline and consume the related control signals at the corresponding stage…similar to the multicycle implementation
Pipeline Control
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control
lines
InstructionReg Dst
ALU Op1
ALU Op0
ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
261998 Morgan Kaufmann Publishers
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Me
mto
Re
g
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Re
gWrit
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Me
mW
rite
AddressData
memory
Address
361998 Morgan Kaufmann Publishers
• Problem with starting next instruction before first is finished
– dependencies that point backward in time are data hazards
Dependencies
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
371998 Morgan Kaufmann Publishers
• Have compiler guarantee no hazards• Where do we insert the nops??
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Insert NOP…• sub $2, $1, $3
NOPNOP
NOPand $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
• Problem: this really slows us down!
Software Solution
381998 Morgan Kaufmann Publishers
• Use temporary results, don’t wait for them to be written
• ALU forwarding
• Register file forwarding (latch-based register file) to handle read/write to same register (read what you just write!!!)
Forwarding
what if this $2 was $13?
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
Transparent latch
411998 Morgan Kaufmann Publishers
The control values for the forwarding multiplexors
Mux control Source Explanation
ForwardingA = 00 ID/EX The first ALU operand comes from the register file
ForwardingA = 10 EX/MEM The first ALU operand is forwarded from prior ALU result
ForwardingA = 01 MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result
ForwardingB = 00 ID/EX The second ALU operand comes from the register file
ForwardingB = 10 EX/MEM The second ALU operand is forwarded from prior ALU result
ForwardingB = 01 MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result
If (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (EX/MEM.RegisterRd ID/EX.RegisterRs) and (MEM/WB.registerRd = ID/EX.RegisterRs)) ForwardA = 01
If (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (EX/MEM.RegisterRd ID/EX.RegisterRt) and (MEM/WB.registerRd = ID/EX.RegisterRt)) ForwardB = 01
461998 Morgan Kaufmann Publishers
•Simultaneously match the $4 operands in stage MEM and stage WB.•Forward from the nearest stage: MEM (based on the sequential programming model)•Write from the WB stage to the register file (RF).
Forward
Forward
Reg. W
rite
Reg. W
rite
471998 Morgan Kaufmann Publishers
• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction
that writes to the same register.
–
• Thus, we need a hazard detection unit to stall the load instruction
Can't always forward
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
Latch-based RF: read what you just write!
(write then read)
481998 Morgan Kaufmann Publishers
Stalling
• We can stall the pipeline by keeping an instruction in the same stage
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
491998 Morgan Kaufmann Publishers
Hazard Detection Unit
• Stall by letting an instruction that won’t write anything go forward
PCInstruction
memory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Inst
ruct
ion
ID/EX.MemReadIF
/ID
Wri
te
PC
Wri
te
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
stall Insert NOPLOAD continues to next stage
561998 Morgan Kaufmann Publishers
• When we decide to branch, other instructions are in the pipeline!
• We are predicting branch not taken– need to add hardware for flushing instructions if we are wrong
Branch Hazards
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
571998 Morgan Kaufmann Publishers
Flushing Instructions
Optimized data path for branch performance: Branch delay: 3 => 1
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Me
mto
Re
g
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Re
gWrit
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Me
mW
rite
AddressData
memory
Address
PCInstruction
memory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
Original data path
601998 Morgan Kaufmann Publishers
Improving Performance
• Try and avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1) lw $t0, 0($t1) lw $t2, 4($t1) lw $t2, 4($t1)sw $t2, 0($t1) sw $t0, 4($t1)sw $t0, 4($t1) sw $t2, 0($t1)
• Add a branch delay slot (delayed branch)
– the next instruction after a branch is always executed
– rely on compiler to fill the slot with something useful
add $2, $3, $4 beq $9, $10, 400
beq $9, $10, 400 add $2, $3, $4 ; always executed
sub $11, $12, $13 sub $11, $12, $13
: :• Superscalar: start more than one instruction in the same cycle
641998 Morgan Kaufmann Publishers
Final data/control path for exception handling
1. flush instr.;
2. save PC (PC+4); Cause
3. set new PC;
4. overflowed instr. (EX) => NOP
1
2
1
3
4
681998 Morgan Kaufmann Publishers
Instruction type Pipe stages
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
ALU or branch instruction IF ID EX MEM WB
Load or store instruction IF ID EX MEM WB
Suerscalar Execution
701998 Morgan Kaufmann Publishers
Simple Superscalar Code Scheduling
Loop : lw $t0, 0($s1) # $t0=array element ($s1 is i)
addu $t0, $t0, $s2 # add dcalar in $s2 (B)
sw $t0, 0($s1) # store result
addi $s1, $s1, -4 # decrement pointer
bne $s1, $zero, Loop # branch $s1 != 0
ALU or branch instruction Data transfer instruction Clock cycle
Loop: lw $t0, 0($s1) 1
addi $s1, $s1, -4 2
addu $t0, $t0, $s2 3
bne $s1, $zero, Loop sw $t0, 0($s1) 4
Do {
*I = *I + B;
I = I -4 ; }
While (I != 0) ;
Do {
*I = *I + B;
I = I -4 ; }
While (I != 0) ;
711998 Morgan Kaufmann Publishers
Loop Unrolling for Superscalar Pipelines
ALU or branch instruction Data transfer instruction Clock cycle
Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1
lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t3, $s2 sw $t1, 12($s1) 6
sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t2, 8($s1) 8
Do {
I = I - 16 ;
*I = *I + B;*(I+12) = *(I+12) + B;*(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B;
; }
While (I != 0) ;
Do {
I = I - 16 ;
*I = *I + B;*(I+12) = *(I+12) + B;*(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B;
; }
While (I != 0) ;
721998 Morgan Kaufmann Publishers
Loop Unrolling
• Superscalar has the architecture to perform parallel calculation• For C source code:
– for(i=100; i!=0; i--) { A[i]=A[i]+1; }– for(i=100; i!=0; i=i-4) { A[i]=A[i]+1; A[i-1]=A[i-1]+1; A[i-2]=A[i-2]+1; A[i-3]=A[i-3]+1; }
In uni-processor,
the functionalities are the same.
But in superscalar, large amount of operations provide a richer opportunity for parallel execution.
751998 Morgan Kaufmann Publishers
Dynamic Scheduling: dispatch add
R3
::
#1
::
R1R2
R3
ROB RF
reg valuereg#
value
::
#1
::
Tag #
0
1
::
Modified?
0
0R4
#1 R1 R2
FunctionalUnit
FunctionalUnit
ReservationStation
Result Bus
:add R3,R1,R2subi R4,R3,2
:
(add R3, R1, R2)
761998 Morgan Kaufmann Publishers
Dynamic Scheduling: dispatch subi
R3
R4
::
#1
::
R1R2
R3
ROB RF
reg valuereg#
value
::
#1
#2
::
Tag #
0
1
::
Modified?
0
#2 1R4
#2 #1 2#1 R1 R2
FunctionalUnit
FunctionalUnit
ReservationStation
Result Bus
:add R3,R1,R2subi R4,R3,2
:
(add R3, R1, R2)
(subi R4,R3,2)
771998 Morgan Kaufmann Publishers
Dynamic Scheduling: execute add
R3
R4
::
#1
::
R1R2
R3
ROB RF
reg valuereg#
value
107
::
#1
#2
::
Tag #
0
1
::
Modified?
0
#2 1R4
#2 #1 2
FunctionalUnit
FunctionalUnit
ReservationStation
Result Bus
:add R3,R1,R2subi R4,R3,2
:
(add R3, R1, R2)
(subi R4,R3,2)
#1 (107) = R1 + R2
781998 Morgan Kaufmann Publishers
Dynamic Scheduling: execute subi
R3
R4
::
#1
::
R1R2
R3
ROB RF
reg valuereg#
value
107
105
::
#1
#2
::
Tag #
0
1
::
Modified?
0
#2 1R4
FunctionalUnit
FunctionalUnit
ReservationStation
Result Bus
:add R3,R1,R2subi R4,R3,2
:
(add R3, R1, R2)
(subi R4,R3,2)
#2 (105) = #1 - 2
791998 Morgan Kaufmann Publishers
Dynamic Scheduling: write back add
R3
R4
::
107
::
R1R2
R3
ROB RF
reg valuereg#
value
107
105
::
#1
#2
::
Tag #
0
0
::
Modified?
0
#2 1R4
FunctionalUnit
FunctionalUnit
ReservationStation
Result Bus
:add R3,R1,R2subi R4,R3,2
:
(add R3, R1, R2)
(subi R4,R3,2)
801998 Morgan Kaufmann Publishers
Dynamic Scheduling: write back subi
R4
::
107
::
R1R2
R3
ROB RF
reg valuereg#
value
105
::
#2
::
Tag #
0
0
::
Modified?
0
105 0R4
FunctionalUnit
FunctionalUnit
ReservationStation
Result Bus
:add R3,R1,R2subi R4,R3,2
:(subi R4,R3,2)
831998 Morgan Kaufmann Publishers
Dynamic Scheduling
• The hardware performs the scheduling?
– hardware tries to find instructions to execute
– out of order execution is possible
– speculative execution and dynamic branch prediction
• All modern processors are very complicated
– DEC Alpha 21264: 9 stage pipeline, 6 instruction issue
– PowerPC and Pentium: branch history table
– Compiler technology important
• This class has given you the background you need to learn more
• Video: An Overview of Intel Pentium Processor
(available from University Video Communications)
841998 Morgan Kaufmann Publishers
Figure 6.52: The performance consequences of single-cycle, multiple-cycle and pipelined
Slower Faster
Instructions per clock (IPC = 1/CPI)
Multicycle(Section 5.5)
Single-cycle(Section 5.4)
Deeplypipelined
Pipelined
Multiple issuewith deep pipeline
(Section 6.10)
Multiple-issuepipelined
(Section 6.9)
851998 Morgan Kaufmann Publishers
Figure 6.53: Basic relationship between the datapaths in Figure 6.52
1 Several
Use latency in instructions
Multicycle(Section 5.5)
Single-cycle(Section 5.4)
Deeplypipelined
Pipelined
Multiple issuewith deep pipeline
(Section 6.10)
Multiple-issuepipelined
(Section 6.9)