Upload
prudence-ball
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Complexity-Effective Superscalar Processors
S. Palacharla, N. P. Jouppi, and J. E. Smith
Presented by: Jason Zebchuk
Brainiac vs. Speed Demon
• Brainiac:
• Maximize # of instructions per cycle
• More complex, slower clock
• Speed Demon:
• Maximize clock frequency
• Faster clock, fewer instructions per cycle
Complexity-Effective
• Issue Width vs. Clock Cycle
• Goal is to balance both
• Best design is Complexity-Effective
• Allow complex issue scheme AND fast clock cycle
What is Complexity?
• Delay of Critical Path through a piece of logic.
Creating a Complexity-Effective Architecture
• Analyze complexity of each pipeline stage
• Select component with most complexity
• Propose less-complex alternative that achieves similar performance
Agenda
• OoO Superscalar Pipeline Overview
• Register Renaming
• Wakeup & Select
• Bypass
• Complexity-Effective Optimizations
• Dependence-based Scheduler• Clustering
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Wake
up
Sele
ct
Issu
eW
indow
Bypass
• Fetch:
• Read Instructions from I-Cache
• Predict Branches
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Wake
up
Sele
ct
Issu
eW
indow
Bypass
• Decode:• Parse instruction
• Shuffle opcode parts to appropriate ports for rename
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Wake
up
Sele
ct
Issu
eW
indow
Bypass
• Rename:• Map architectural registers to physical• Eliminate false dependences• Dispatch renamed instructions to
scheduler
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Bypass
• Wakeup:• Instructions check whether they become ready• Compare register names from Writeback stage
• Select:• Choose from amongst ready instructions
Wake
up
Sele
ct
Issu
eW
indow
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Bypass
• Register File Read:
• Read source operands
Wake
up
Sele
ct
Issu
eW
indow
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Bypass
• Bypass and Execute:
• Execute instructions in functional units
• Bypass results from outputs to inputs
Wake
up
Sele
ct
Issu
eW
indow
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Bypass
• Data Cache Access:
• Load & Store to data cache
Wake
up
Sele
ct
Issu
eW
indow
Pipeline OverviewFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Bypass
• Write result to register file
• Broadcast tag to wakeup waiting instrs.
• happens two cycles before results produced
Wake
up
Sele
ct
Issu
eW
indow
Alternate PipelineFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Bypass
• Used by Pentium Pro, PowerPC
• Re-order buffer (ROB) holds values
• Tomasulo-like renaming (point to ROB entries)
Reord
er
Buff
erWakeupSelect
Reservation
Stations
Sources of ComplexityFe
tch
Deco
de
Renam
e
Regis
ter
File
Data
C
ach
e
Bypass
Complexity scales with issue width and window size.
Wake
up
Sele
ct
Issu
eW
indow
What about.....
• Fetch, Decode, Register File, Cache?
• Previously studied
• Register File, Caches
• Easy to scale ?
• Fetch, Decode, Functional Units
Register Renaming• Eliminate WAR and WAW hazards
Ld r1, [r2]Add r3, r1, r2Sub r1, r2, r4Ld r2, [r1]
Ld r1, [r2]Add r3, r1, r2Sub r1, r2, r4Ld r2, [r1]
Ld p5, [p7]Add p3, p5, p7Sub p6, p7, p4Ld p9, [p6]
Architectural (Logical)Registers
PhysicalRegisters
Register Rename LogicLogical
Sou
rce &
Dest
inati
on
Reg
iste
rs
RegisterAliasTable
Dependence
CheckLogicL
ogical
Sou
rce &
Dest
inati
on
Reg
iste
rs
MU
X
Physical
Sou
rce &
Dest
inati
on
Reg
iste
rs
SRAM array Scales with Issue Width
Register Alias Table
LogicalRegisterName
Physical Register Name
Decode Bitlines
Wordlines
Senseamps
Trename = Tdecode + Twordline + Tbitline + Tsenseamp
Rename Delay
Scales Linearly with Issue Width
Bette
r
Instruction Wakeup
• Broadcast newly available operands
• Update which operands are ready for each instruction
• Mark instruction ready when both left and right operands are ready
Wakeup Logic
• Delay = Ttagdrive + Ttagmatch + TmatchOR
Tag
Driv
e
Tag Match
Tag OR
Wakeup Delay
Delay increases quadratically with window size
Bette
r
Wakeup Delay Breakdown
Wire delay dominates at smaller feature sizes!B
ette
r
Select• Consider all instructions that are
ready
• Select one ready instruction for each functional unit
• Uses some selection policy
• e.x., Oldest First
• little affect on performance
Selection Logic
• Delay = c0 + c1×log4(WINSIZE)
RequestSignal Grant
Signal
Selection Delay
Only Logic delay, wire delay ignored
Bette
r
Data Bypass Network
• Forward results from completing instructions to dependent instructions
• # of paths depends on pipeline depth and issue width
• # of paths = 2×IW2×S
• S pipe stages after producing results
Data Bypass Logic• FUs broadcasts
results (potentially multiple subsequent results)
• Regfile reads current operand values
• MUX selects correct source
Bypass Delay• Tbypass = 0.5 × Rmetal × Cmetal × L2
• L = length of result wires
• Delay independent of feature size
• Dependent on specific Layout
Issue Width
Wire Length (λ)
Delay (ps)
4 20500 184.9
8 49000 1056.4
Putting it All TogetherIssue Width
Window Size
Rename Delay (ps)
Wakeup+Select delay (ps)
Bypass delay (ps)
0.8μm
4 32 1577.9 2903.7 184.9
8 64 1710.5 3369.4 1056.4
0.35μm
4 32 627.2 1248.4 184.9
8 64 726.6 1484.8 1056.4
0.18μm
4 32 351.0 578.0 184.9
8 64 427.9 724.0 1056.4
Key Sources of Complexity
• Wakeup+Select Logic:
• Limiting stage for 5 out of 6 designs
• Considers all instructions simultaneously
• Bypass Paths
• Wire dominated delay
• Lots of long wires
Dependence-based Microarchitecture
Fetc
h
Deco
de
Renam
eSte
er
Regis
ter
File
Data
C
ach
e
Wake
up
Sele
ct
Bypass
FIFOs
• Replace Issue Window with a few small FIFO queues
• Only schedule from head of FIFO
• Adds new Steering stage
Instruction Steering Heuristic
• 3 possible cases for instruction I:
1. All operands ready & in register file
• I steered to empty FIFO
2. I requires operand produced by Isource in FIFO Fa
1. If no instruction behind Isource, I steered to Fa
2. Otherwise, steered to empty FIFO
• Two operands produced by Ileft and Iright.
• Apply rule 2 to Ileft
• If steered to empty FIFO, apply rule 2 to Iright
Steering Example
0: addu r18, r0, r2 1: addiu r2, r0, -1 2: beq r18, r2, L2 3: lw r4, -32768(r28) 4: sllv r2, r18, r20 5: xor r16, r2, r19 6: lw r3, -32676(r28) 7: sll r2, r16, 0x2 8: addu r2, r2, r23 9: lw r2, 0(r2)10: sllv r4, r18, r411: addu r17, r4, r1912: addiu r3, r3, 113: sw r3, -32676(r28)14: bew r2, r17, L3
0 1 32
0 1
2
3
4
5 6
7
8
9
10
11
12
13
14
45
6
789
1011
1213
14
Performance Comparison
Cycle count within 8% for all benchmarks
Bett
er
But . . .
• TExecution =
• Dependence-based microarchitecture is less complex
• Allows for faster clock
• Overall, dependence-based has better performance
# Instructions × Clock PeriodInstructions per Cycle
Improving Bypass Paths
• One-cycle bypass within Cluster
• Two-cycle bypass between Clusters
Performance of Clustering
• IPC up to 12% lower
• Clock is 25% faster
• Overall, 10% - 22% faster
• 16% faster on average
Bett
er
1. 1 Window + 1 Cluster
2. 1 Window + 2 Clusters
3. 2 Windows + 2 Clustersi. Issue Window +
Steering heuristic
ii. Issue Window + Random steering
iii.FIFOs + Steering heuristic
Other Alternatives
Performance of Clustering
Steering has largest impact on performance
Bett
er
Conclusions• Wakeup+Select and Bypass paths
likely to be limiting pipeline stages
• Atomicity makes these stages critical
• Consider more complexity-effective designs for these stages
• Sacrifice some IPC decrease for higher clock frequency