View
223
Download
0
Category
Preview:
DESCRIPTION
3 Combinatorial Logic N Gate Delays Combinatorial Logic N Gate Delays BW = ~(1/n) Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates BW = ~(2/n) BW = ~(3/n)
Citation preview
Advanced MicroarchitectureLecture 2: Pipelining and Superscalar Review
2
Pipelined Design• Motivation: Increase throughput with little
increase in cost (hardware, power, complexity, etc.)
• Bandwidth or Throughput = Performance• BW = num. tasks/unit time• For a system that operates on one task at a
time:BW = 1 / latency
• Pipelining can increase BW if many repetitions of same operation/task
• Latency per task remains same or increasesLecture 2: Pipelining and Superscalar Review
3
Pipelining Illustrated
Lecture 2: Pipelining and Superscalar Review
Combinatorial LogicN Gate Delays
BW = ~(1/n)
Combinatorial LogicN/2 Gate Delays
Combinatorial LogicN/2 Gate Delays
Comb. LogicN/3 Gates
Comb. LogicN/3 Gates
Comb. LogicN/3 Gates
BW = ~(2/n)
BW = ~(3/n)
4
T/k
T/k
Performance Model• Starting from an
unpipelined version with propagation delay T and BW=1/T
Perfpipe = BWpipe =1 / (T/k + S)
where S = latch delaywhere k = num stages
Lecture 2: Pipelining and Superscalar Review
TS
S
k-stage pipelinedunpipelined
5
G/k
G/k
Hardware Cost Model• Starting from an
unpipelined version with hardware cost G
Costpipe = G + kL
where L = latch cost incl. controlwhere k = num stages
Lecture 2: Pipelining and Superscalar Review
GL
L
k-stage pipelinedunpipelined
6
Cost/Performance Tradeoff
Lecture 2: Pipelining and Superscalar Review
Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S)
= LT + GS + LSk + GT/k
Optimal Cost/Performance: find min. C/P w.r.t. choice of k
k
C/P
è øç ÷
ç ÷ç ÷æ ö
koptGTLS--------=
çç ç
ç
Lk + G1
Tk+ S
ddk
= 0 + 0 + LS -GTk2
7
“Optimal” Pipeline Depth: kopt
Lecture 2: Pipelining and Superscalar Review
0
1
2
3
4
5
6
7
0 10 20 30 40 50Pipeline Depth k
x104Co
st/P
erfo
rman
ce R
atio
(C/P
)
G=175, L=41, T=400, S=22
G=175, L=21, T=400, S=11
8
Cost?• “Hardware Cost”
– Transistor/Gate Count• Should include additional logic to control the pipeline
– Area (related to gate count)– Power!
• More gates more switching• More gates more leakage
• Many metrics to optimize• Very difficult to determine what really is
“optimal”
Lecture 2: Pipelining and Superscalar Review
9
Pipelining Idealism• Uniform Suboperations
– The operation to be pipelined can be evenly partitioned into uniform-latency suboperations
• Repetition of Identical Operations– The same operations are to be performed
repeatedly on a large number of different inputs• Repetition of Independent Operations
– All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts
Lecture 2: Pipelining and Superscalar Review
Good Examples:Automobile assembly lineFloating-Point multiplierInstruction pipeline (?)
10
Instruction Pipeline Design• Uniform Suboperations … NOT!
– Balance pipeline stages• Stage quantization to yield balanced stages• Minimize internal fragmentation (some waiting stages)
• Identical operations … NOT!– Unifying instruction types
• Coalescing instruction types into one “multi-function” pipe• Minimize external fragmentation (some idling stages)
• Independent operations … NOT!– Resolve data and resource hazards
• Inter-instruction dependency detection and resolution• Minimize performance loss
Lecture 2: Pipelining and Superscalar Review
11
The Generic Instruction Cycle• The “computation” to be pipelined:
1. Instruction Fetch (IF)2. Instruction Decode (ID)3.Operand(s) Fetch (OF)4. Instruction Execution (EX)5.Operand Store (OS)
• a.k.a. writeback (WB)6.Update Program Counter (PC)
Lecture 2: Pipelining and Superscalar Review
12
The Generic Instruction Pipeline
Lecture 2: Pipelining and Superscalar Review
Based on Obvious Subcomputations:Instruction Fetch
Instruction Decode
Operand Fetch
Instruction Execute
Operand Store
IF
ID
OF/RF
EX
OS/WB
13
Balancing Pipeline Stages
Lecture 2: Pipelining and Superscalar Review
TIF= 6 units
TID= 2 units
TID= 9 units
TEX= 5 units
TOS= 9 units
• Without pipeliningTcyc TIF+TID+TOF+TEX+TOS
= 31
• PipelinedTcyc max{TIF, TID, TOF, TEX, TOS}
= 9
Speedup= 31 / 9
Can we do better in terms of either performance or efficiency?
IF
ID
OF/RF
EX
OS/WB
14
Balancing Pipeline Stages• Two methods for stage quantization
– Merging multiple subcomputations into one– Subdividing a subcomputation into multiple
smaller ones
• Recent/Current trends– Deeper pipelines (more and more stages)
• To a certain point: then cost function takes over– Multiple different pipelines/subpipelines– Pipelining of memory accesses (tricky)
Lecture 2: Pipelining and Superscalar Review
15
Granularity of Pipeline Stages
Lecture 2: Pipelining and Superscalar Review
Coarser-Grained Machine Cycle: 4 machine cyc /
instructionTIF&ID= 8 units
TOF= 9 units
TEX= 5 units
TOS= 9 units
Finer-Grained Machine Cycle: 11 machine cyc
/instruction
Tcyc= 3 units
TIF,TID,TOF,TEX,TOS = (6/2/9/5/9)
IFID
OF
OS
EX
IFIFID
OFOFOFEXEX
OSOSOS
16
Hardware Requirements• Logic needed for
each pipeline stage
• Register file ports needed to support all (relevant) stages
• Memory accessing ports needed to support all (relevant) stages
Lecture 2: Pipelining and Superscalar Review
IFID
OF
OS
EX
IFIFID
OFOFOFEXEX
OSOSOS
17
Pipeline Examples
Lecture 2: Pipelining and Superscalar Review
IF
RD
ALU
MEM
WB
IF
IDOF
EX
OS
PC GENCache ReadCache Read
DecodeRead REGAdd GEN
Cache ReadCache Read
EX 1EX 2
Check ResultWrite Result
OS
EX
OFID
IFMIPS R2000/R3000
AMDAHL 470V/7
18
Instruction Dependencies• Data Dependence
– True Dependence (RAW)• Instruction must wait for all required input operands
– Anti-Dependence (WAR)• Later write must not clobber a still-pending earlier read
– Output Dependence (WAW)• Earlier write must not clobber an already-finished later write
• Control Dependence (a.k.a. Procedural Dependence)– Conditional branches cause uncertainty to instruction
sequencing– Instructions following a conditional branch depends on
the execution of the branch instruction– Instructions following a computed branch depends on the
execution of the branch instruction
Lecture 2: Pipelining and Superscalar Review
Example: Quick Sort on MIPS
Lecture 2: Pipelining and Superscalar Review 19
bge $10, $9, $36mul $15, $10, 4addu $24, $6, $15lw $25, 0($24)mul $13, $8, 4addu $14, $6, $13lw $15, 0($14)bge $25, $15, $36
$35:addu $10, $10, 1. . .
$36:addu $11, $11, -1. . .
# for (;(j<high)&&(array[j]<array[low]);++j);# $10 = j; $9 = high; $6 = array; $8 = low
20
Hardware Dependency Analysis• Processor must handle
– Register Data Dependencies• RAW, WAW, WAR
– Memory Data Dependencies• RAW, WAW, WAR
– Control Dependencies
Lecture 2: Pipelining and Superscalar Review
21
Terminology• Pipeline Hazards:
– Potential violations of program dependencies– Must ensure program dependencies are not
violated• Hazard Resolution:
– Static method: performed at compile time in software
– Dynamic method: performed at runtime using hardware
Stall, Flush or Forward
• Pipeline Interlock:– Hardware mechanism for dynamic hazard
resolution– Must detect and enforce dependencies at
runtime
Lecture 2: Pipelining and Superscalar Review
22
Pipeline: Steady State
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
t0 t1 t2 t3 t4 t5
Instj
Instj+1
Instj+2
Instj+3
Instj+4
23
Pipeline: Data Hazard
Lecture 2: Pipelining and Superscalar Review
t0 t1 t2 t3 t4 t5
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
Instj
Instj+1
Instj+2
Instj+3
Instj+4
24
Pipeline: Stall on Data Hazard
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID Stalled in RD ALU MEM WBIF Stalled in ID RD ALU MEM WB
Stalled in IF ID RD ALU MEMIF ID RD ALU
t0 t1 t2 t3 t4 t5
Instj
Instj+1
Instj+2
Instj+3
Instj+4
RDIDIF
IF ID RDIF ID
IF
25
Different Viewt0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF Ij Ij+1 Ij+2 Ij+3 Ij+4 Stall Ij+4
ID Ij Ij+1 Ij+2 Ij+3 Stall Ij+3 Ij+4
RD Ij Ij+1 Ij+2 Stall Ij+2 Ij+3 Ij+4
ALU Ij Ij+1 nop nop nop Ij+2 Ij+3 Ij+4
MEM Ij Ij+1 nop nop nop Ij+2 Ij+3
WB Ij Ij+1 nop nop nop Ij+2
Lecture 2: Pipelining and Superscalar Review
26
Pipeline: Forwarding Paths
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
t0 t1 t2 t3 t4 t5
Many possible pathsInstj
Instj+1
Instj+2
Instj+3
Instj+4
MEM ALURequires stalling even with fwding paths
27
ALU Forwarding Paths
Lecture 2: Pipelining and Superscalar Review
Deeper pipeline mayrequire additionalforwarding paths
IF ID Register Filesrc1src2
==
ALU
MEM
==
dest
28
Pipeline: Control Hazard
Lecture 2: Pipelining and Superscalar Review
t0 t1 t2 t3 t4 t5
InstiInsti+1
Insti+2
Insti+3
Insti+4
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEM WBIF ID RD ALU MEM
IF ID RD ALUIF ID RD
IF IDIF
29
Pipeline: Stall on Control Hazard
Lecture 2: Pipelining and Superscalar Review
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU MEMIF ID RD ALU
IF ID RDIF ID
IF
t0 t1 t2 t3 t4 t5
InstiInsti+1
Insti+2
Insti+3
Insti+4
Stalled in IF
Pipeline: Prediction for Control Hazards
Lecture 2: Pipelining and Superscalar Review 30
t0 t1 t2 t3 t4 t5
InstiInsti+1
Insti+2
Insti+3
Insti+4
IF ID RD ALU MEM WBIF ID RD ALU MEM WB
IF ID RD ALU nop nopIF ID RD nop nop
IF ID nop nopIF ID RD
IF IDIF
nopnop nopALU nopRD ALUID RD
nopnopnop
New Insti+2New Insti+3New Insti+4
Speculative State Cleared
Fetch Resteered
31
Going Beyond Scalar• Simple pipeline limited to execution of CPI
≥ 1.0• “Superscalar” can achieve CPI ≤ 1.0 (i.e.,
IPC ≥ 1.0)– Superscalar means executing more than one
scalar instruction in parallel (e.g., add + xor + mul)
– Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions)
Lecture 2: Pipelining and Superscalar Review
32
Architectures for Instruction Parallelism• Scalar pipeline (baseline)
– Instruction/overlap parallelism = D– Operation Latency = 1– Peak IPC = 1
Lecture 2: Pipelining and Superscalar Review
D
Succ
essiv
eIn
stru
ctio
ns
Time in cycles
1 2 3 4 5 6 7 8 9 10 11 12
D different instructions overlapped
33
Superscalar Machine• Superscalar (pipelined) Execution
– Instruction parallelism = D x N– Operation Latency = 1– Peak IPC = N per cycle
Lecture 2: Pipelining and Superscalar Review
N
Succ
essiv
eIn
stru
ctio
ns
Time in cycles
1 2 3 4 5 6 7 8 9 10 11 12
D x N different instructions overlapped
34
Ex. Original Pentium
Lecture 2: Pipelining and Superscalar Review
Prefetch
Decode1
Decode2 Decode2
Execute Execute
WritebackWriteback
4× 32-byte buffers
Decode up to 2 insts
Read operands, Addr comp
Asymmetric pipesu-pipe v-pipe
shiftrotate
some FP
jmp, jcc,call,fxch
Bothmov, lea,
simple ALU,push/poptest/cmp
35
Pentium Hazards, Stalls• “Pairing Rules” (when can/can’t two insts exec at the
same time?)– read/flow dependence
mov eax, 8mov [ebp], eax
– output dependencemov eax, 8
mov eax, [ebp]– partial register stalls
mov al, 1mov ah, 0
– function unit rules• some instructions can never be paired:
MUL, DIV, PUSHA, MOVS, some FP
Lecture 2: Pipelining and Superscalar Review
36
Limitations of In-Order Pipelines• CPI of inorder pipelines degrades very
sharply if the machine parallelism is increased beyond a certain point– i.e., when N approaches the average distance
between dependent instructions– Forwarding is no longer effectiveMust stall more oftenPipeline may never be full due to frequency of
dependency stalls
Lecture 2: Pipelining and Superscalar Review
37
N Instruction Limit
Lecture 2: Pipelining and Superscalar Review
Ex. Superscalar degree N = 4
Any dependencybetween theseinstructions willcause a stall
Dependent instmust be N =
4 instructionsaway
On average, the parent-child separation is onlyabout 5± instructions!
(Franklin and Sohi ’92)
Pentium: Superscalar degree N=2is reasonable… going much further
encounters rapidly diminishing returns
Average of 5 means there are many cases when the
separation is < 4… each of these limits parallelism
38
In Search of Parallelism• “Trivial” Parallelism is limited
– What is trivial parallelism?• In-order: sequential instructions do not have
dependencies• in all previous examples, all instructions executed
either at the same time or after earlier instructions– previous slides show that superscalar execution
quickly hits a ceiling
• So what is “non-trivial” parallelism? …
Lecture 2: Pipelining and Superscalar Review
39
What is Parallelism?• Work
T1: time to complete a computation on a sequential system
• Critical PathT: time to complete the same
computation on an infinitely-parallel system
• Average ParallelismPavg = T1/ T
• For a p-wide systemTp max{T1/p , T}Pavg >> p Tp T1/p
Lecture 2: Pipelining and Superscalar Review
+
+-
*
*2
a b
xy
x = a + b; y = b * 2z =(x-y) * (x+y)
40
ILP: Instruction-Level Parallelism• ILP is a measure of the amount of inter-
dependencies between instructions• Average ILP = num instructions / longest
pathcode1: ILP = 1 (must execute serially)
T1 = 3, T = 3code2: ILP = 3 (can execute at the same time)
T1 = 3, T = 1
Lecture 2: Pipelining and Superscalar Review
code1: r1 r2 + 1r3 r1 / 17r4 r0 - r3
code2: r1 r2 + 1r3 r9 / 17r4 r0 - r10
41
ILP != IPC• Instruction level parallelism usually
assumes infinite resources, perfect fetch, and unit-latency for all instructions
• ILP is more a property of the program dataflow
• IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine
• The ILP of a program is an upper-bound on the attainable IPC
Lecture 2: Pipelining and Superscalar Review
42
Scope of ILP Analysis
Lecture 2: Pipelining and Superscalar Review
r1 r2 + 1r3 r1 / 17r4 r0 - r3
r11 r12 + 1r13 r19 / 17r14 r0 - r20
ILP=2ILP=1
ILP=3
43
DFG AnalysisA: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]
Lecture 2: Pipelining and Superscalar Review
44
In-Order Issue, Out-of-Order Completion
Lecture 2: Pipelining and Superscalar Review
Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard
Issue = send an instructionto execution
INT Fadd1
Fadd2
Fmul1
Fmul2
Fmul3
Ld/St
In-orderInst.
Stream
ExecutionBegins
In-order
Out-of-orderCompletion
45
Example
Lecture 2: Pipelining and Superscalar Review
A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R7 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R1]J: R1 = R1 – 1K: R3 ST 0[R1]
A BCycle 1:
C2:
D3:
4:
5:
E F6: GH JK
7:
8:
IPC = 10/8 = 1.25
A B
C
D
E F
G
H
J
K
46
Example (2)
Lecture 2: Pipelining and Superscalar Review
A: R1 = R2 + R3B: R4 = R5 + R6C: R1 = R1 * R4D: R9 = LD 0[R1]E: BEQZ R7, +32F: R4 = R7 - 3G: R1 = R1 + 1H: R4 ST 0[R9]J: R1 = R9 – 1K: R3 ST 0[R1]
A BCycle 1:
C2:
D3:
4:
5:
E F G
IPC = 10/7 = 1.43
H J6:
K7:
A B
C
D
E
F G
H J
K
47
Track with Simple Scoreboarding• Scoreboard: a bit-array, 1-bit for each GPR
– If the bit is not set: the register has valid data– If the bit is set: the register has stale data
i.e., some outstanding instruction is going to change it• Issue in Order: RD Fn (RS, RT)
– If SB[RS] or SB[RT] is set RAW, stall– If SB[RD] is set WAW, stall– Else, dispatch to FU (Fn) and set SB[RD]
• Complete out-of-order– Update GPR[RD], clear SB[RD]
Lecture 2: Pipelining and Superscalar Review
48
Out-of-Order Issue
Lecture 2: Pipelining and Superscalar Review
INT Fadd1
Fadd2
Fmul1
Fmul2
Fmul3
Ld/St
In-orderInst.
Stream
DR DR DR DR
Out-of-orderCompletion
Out ofProgram
OrderExecution
Need an extraStage/buffers forDependencyResolution
49
OOO Scoreboarding• Similar to In-Order scoreboarding
– Need new tables to track status of individual instructions and functional units
– Still enforce dependencies• Stall dispatch on WAW• Stall issue on RAW• Stall completion on WAR
• Limitations of Scoreboarding?• Hints
– No structural hazards– Can always write a RAW-free code sequence
Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; …– Think about x86 ISA with only 8 registers
Lecture 2: Pipelining and Superscalar Review
Finite number of registers inany ISA will force you to reuseregister names at some point
WAR, WAW stalls
50
Lessons thus Far• More out-of-orderness More ILP exposed
But more hazards• Stalling is a generic technique to ensure
sequencing• RAW stall is a fundamental requirement (?)
• Compiler analysis and scheduling can help(not covered in this course)
Lecture 2: Pipelining and Superscalar Review
51
Ex. Tomasulo’s Algorithm [IBM 360/91, 1967]
Lecture 2: Pipelining and Superscalar Review
Adder
Floating Point
Registers FLR
0
2
4
8
Store
Data
1
2
3
Buffers SDB
Control
Decoder
Floating
Operand
Stack
FLOSControl
Floating Point
Buffers FLB
1
2
3
4
5
6
Decoder
Floating PointRegisters (FLR)
Control
0248
Floating
Operand Stack
Floating Point
Buffers (FLB)
123456
StoreData
123
Buffers (SDB)
Control
Storage Bus Instruction Unit
Result
Multiply/Divide
•
Common Data Bus (CDB)
Point
BusyBits
Adder
FLB BusFLR Bus
CDB ••
•
•
Tags
Tags
Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.
Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.Sink TagTag Source Ctrl.
•
Result
(FLOS)
52
FYI: Historical Note• Tomasulo’s algorithm (1967) was not the
first• Also at IBM, Lynn Conway proposed multi-
issue dynamic instruction scheduling (OOO) in Feb 1966– Ideas got buried due to internal politics,
changing project goals, etc.– But it’s still the first (as far as I know)
Lecture 2: Pipelining and Superscalar Review
53
Modern Enhancements to Tomasulo’s Algorithm
Lecture 2: Pipelining and Superscalar Review
TomasuloPeak IPC = 12 FP FU’sSingle CDBOperand copyingRS TagTag-based forwardingImprecise
ModernPeak IPC = 6+6-10+ FU’sMany forwarding busesRenamed registersRenamed registersTag-based forwardingPrecise (requires ROB)
Machine WidthStructural Deps
Anti-DepsOutput-DepsTrue DepsExceptions
Recommended