Upload
gusty
View
47
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Out-of-Order Execution Structures. Based on: Complexity-Effective Superscalar Processors S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97. MIPS R10000-Like Design . Fetch: Read instructions from I-Cache Predict Branches Pass on to Decode phase. Fetch Phase. Decode: Parse instruction - PowerPoint PPT Presentation
Citation preview
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Structures
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
MIPS R10000-Like Design
• Based on:– Complexity-Effective Superscalar Processors– S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Fetch Phase
• Fetch:– Read instructions from I-Cache– Predict Branches– Pass on to Decode phase
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decode Phase
• Decode:– Parse instruction– Shuffle opcode parts to appropriate ports for
rename
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Phase
• Rename:– Map Architectural registers to Physical– Eliminate False Dependences– Passes renamed instructions to scheduler
• Called Dispatch
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scheduling Phase
• Wakeup:– Instructions check whether they become ready– From Writeback: physical register names
• Select:– Amongst the ready select those to execute– Structural hazards
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register File Read Phase
• Read source operands
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Bypass and Execute Phase
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Data Cache Access Phase
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Writeback Phase
• Write result to register file• Broadcast tag in order to wakeup waiting
instructions– Notice that the tag broadcast should happen TWO
cycles in advance of the result production
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Reservation Station Model
• Used by Pentium Pro, PowerPC 604• Re-order buffer holds values• Renaming points to re-order buffer entries
– Tomasulo-like
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Physical Register File vs. Reservation Station• Physical Register File
– Values reside in the register file– At writeback instructions broadcast the
register name• Reservation Stations:
– Values reside:– In the register file upon commit
• Non-speculative– In reservation stations prior to commit
• Speculative
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Quantifying Complexity• Critical Path Delay as a function of
architectural parameters– Instruction Window size (WinSize)– Issue Width (IW)
• Full-custom Implementations– Study the critical path– Delay model– Extrapolate how it will scale with “future”
technologies
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming• Inputs:
– IW instructions– Up to 2 x Input register names– Up to 1 x Output register name
• Outputs:– 2 x input physical registers– 1 x new output physical register– 1 x previous physical register name for
checkpointing – Updated rename table
• Superscalar Issue complicates things a bit
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming One Instruction
s1s2
d
RAT
p0
p31
s1s2
old
d
new reg from free listWrite port
Read port
Read port
Read port
1
1
2
1
For mispeculation recovery
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Two Instructions
RAT
s1 s2 d new d s1 s2 d new d
?
??
ps1 ps2 Old d new d ps1 ps2 Old dnew d
Cross BundleDependency Check Logic
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming More Instructions• Dependency Checking logic for
instruction i must match against all preceding destinations
• If there are multiple matches it must enforce priority:– Pick the one closest to this instruction
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
RAT: SRAM Implementation
decoder SRAM cellbitlines
Sense amp
Arch reg
Phys reg
#ARCH REGS
lg(#PHYS REGS)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
SRAM RAT cell
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
RAT: CAM Implementation
encoder
CAM cellArch reg
Phys reg#PHYS REGS
lg(#ARCH REGS)
Active bit
• One CAM per physical register• Active bit indicates the current map• New version by setting active bit
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
CAM Cell
Match
Wordline
Bitline
Bitline_B
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
SRAM vs. CAM• SRAM:
– Arch reg rows– Lg(phy reg) cols– SRAM read/write
• CAM:– Phy reg rows– Lg(arch reg) cols– CAM match– Update:
• Reset previous valid bit• Set current valid bit
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scheduler: Part #1 - Wakeup
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tree of Arbiters
REQ Signals
GRANT Signals
Anyreq raised if any req is active, Grant
Issued if arbiter enabled
Root enabled if
FU available
Scheduler: Part #2 - Select
For a Single FU
Location based select policy
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Select for more than one FUs• Handling Multiple FUs of Same Type:
– Stack Select logic blocks in series - hierarchy
– Mask the Request granted to previous unit
• NOT Feasible for More than 2 FUs• Alternative:
– statically partition issue window among FUs – MIPS R10000, HP PA 8000
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Datapath and BypassCommonly Used
Layout:
1 Bit-Slice
Turn on Tri-State A
to pass result of
FU1 to left operand of
FU0
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Complexity Analysis• Critical path delay as a function of:
– Issue Width – Window Size
• Register Renaming Table
• Wakeup and Select
• Bypass paths
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Methodology• A representative CMOS design is selected
from published alternatives
• Implemented the circuits for 3 technologies:– 0.8micron, 0.35micron and 0.18 micron
• Optimize for speed
• Wire parasitics in delay model– Rmetal, Cmetal
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Methodology• Feature size scaling: 1 / S• Voltage scaling: 1 / U
• Logic Delay = (CLx V) / I• Capac. Load: CL= 1 1 / S• Supply Voltage: V = 1 1 / U• Average charge/discharge current: I = 1
1 / U
• So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wire Delay• L: wire length• Intrinsic RC delay
• Rmetal: resistance per unit length
• Cmetal: capacitance per unit length
• 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wire Delay Scaling• Metal Thickness doesn’t scale much
– Width ~ 1/S– Rmetal ~ S
• Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate
• Parallel plate – scales with 1 / S– Cmetal ~ S
• Length scales with 1/S• Overall Scale factor: S x S x (1/S)2 = 1
• Wire delay remains constant
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Table
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Dependency Checking Logic• Accessed in Parallel with Map Table• Every Logical Reg compared against
logical dest regs of current rename group• For IW=2,4,8, delay less than map table
r1
r4
r4
r4
r4
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Delay • SRAM scheme• Delay Components:
– Time to decode the arch reg index– Time to drive wordline– Time to pull down bit line– Time for SenseAmp to detect pull-down– MUX time ignored as control from dep.
Check logic comes in advance
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Circuit
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay• Predecoding for speed• Length of predecode lines:
– Cellheight: Height of single cell excluding wordlines
– Wordline spacing• NVREG: # of virtual reg-s• x3: 3-operand instr-s
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Tnand fall delay of NAND• Tnor rise delay of NOR
• Rnandpd NAND pull-down channel resistance + Predecode line metal resistance
• Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay• Substitute• Predecode line length, Req and Ceq we
get:
• c2: intrinsic RC delay of predecode line• c2 very small • Decoder delay ~linearly dependent on
IW
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Delay• Wordline
• c2: intrinsic RC delay of wordline• c2 very small • Wordline delay ~linearly dependent on
IW
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Delay• Bitline:
• c2 very small • Bitline delay ~linearly dependent on IW
• SenseAmp delay ~linearly dependent on IW
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Logic Delay Scaling
• Feature size - [increase in bitline&wordline delay with increasing IW]
• 0.8um: IW 2 8 Bitline delay + 37%• 0.18um: IW 28 Bitline delay + 53%
• Total delay increases linearly with IW
• Each Component shows linear increase with IW
• Bitline Delay > Wordline Delay
• Bitline length ~ # of Logical reg-s
• Wordline length ~ width of physical reg designator
IW impact on delay worsenswith decreasing featuresize
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay• Critical Path: Mismatch Pull ready signal low• Delay Components:
– Tag drivers drive tag lines - vertical– Mismatched bit: pull down stack pull matchline low –
horizontal– Final OR gate or all the matchlines of an operand
tag
• Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C
• Quadratic component significant for IW>2 & 0.18um
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay• Quadratic component Small for both
cases• Both delays ~linearly dependent on IW
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: IW and Window Size• 0.18um Process• Quadratic
dependence• Issue width has
greater effect increase all 3 delay components
• As IW & WinSize + together delay actually changes like: THIS
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Window Size
• 8 way & 0.18 Process• Tag drive delay increases rapidly with WinSize +• Match OR delay constant
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Feature size
• 8 way & 64 entry window• Tag drive and Tag match delays do not scale as well as MatchOR
delay • Match OR logic delay• Others also have wire delays
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Selection Logic and Bypass Delay• Selection
– Logarithmically dependent on WinSize
• Bypass: Delay dependent on (IW)2