View
3
Download
0
Category
Preview:
Citation preview
EECS 470Lecture 5 Slide 1
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
EECS 470Lecture 5Scoreboard SchedulingFall 2019Prof. Ronald Dreslinskihttp://www.eecs.umich.edu/courses/eecs470
Many thanks to Prof. Martin and Roth of University of Pennsylvania for most of these slides. Portions developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin.
EECS 470Lecture 5 Slide 2
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Readings
For Today:
• H & P Chapter C.5-C.7, 3.1-3.3, 3.10
For Monday:
• H & P Chapter 3.4-3.6• D. Sima “Design Space of Register Renaming Techniques”
• Paper is linked from the Readings page
Lecture 4 Slide 3EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Load Delay Slot (MIPS R2000)
F D E M WF D E M W
F D E M
t0 t1 t2 t3 t4 t5
Wj:k:
h: Rk ¬ --……
i: Rk ¬ MEM[ - ]j: -- ¬ Rkk: -- ¬ Rk
Which (Rk) do we really mean?
- The effect of a “delayed” Load is not visible to the instructions in its delay slots.
i:
Lecture 4 Slide 4EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Control Hazards
beq 1 1 10sub 3 4 5
F D E M WF D E M W
t0 t1 t2 t3 t4 t5beqsub squash
Lecture 4 Slide 5EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Handling Control Hazards
Avoidance (static)r No branches?r Convert branches to predication
m Control dependence becomes data dependence
Detect and Stall (dynamic)r Stop fetch until branch resolves
Speculate and squash (dynamic)r Keep going past branch, throw away instructions if wrong
Lecture 4 Slide 6EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Avoidance: if-conversion
if (a == b) {x++;y = n / d;
}
sub t1 ¬ a, bjnz t1, PC+2add x ¬ x, #1div y ¬ n, d
sub t1 ¬ a, badd(t1) x ¬ x, #1div(t1) y ¬ n, d
sub t1 ¬ a, badd t2 ¬ x, #1div t3 ¬ n, dcmov(t1) x ¬ t2cmov(t1) y ¬ t3
Lecture 4 Slide 7EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Handling Control Hazards:Detect & Stall
Detectionr In decode, check if opcode is branch or jump
Stallr Hold next instruction in Fetchr Pass noop to Decode
Lecture 4 Slide 8EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Problems with Detect & Stall
CPI increases on every branch
Are these stalls necessary? Not always!r Branch is only taken half the timer Assume branch is NOT taken
m Keep fetching, treat branch as noopm If wrong, make sure bad instructions don’t complete
Lecture 4 Slide 9EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Handling Control Hazards:Speculate & Squash
Speculater Assume branch is not taken
Squashr Overwrite opcodes in Fetch, Decode, Execute with noopr Pass target to Fetch
10
© Wenisch et al. 2016
PC REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
equal
MUX
beqsubaddnand
add
sub
beq
beq
Instmem
noop
noop
noop
Lecture 4 Slide 11EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Problems with Speculate & Squash
Always assumes branch is not takenCan we do better? Yes.
r Predict branch direction and target!r Why possible? Program behavior repeats.
More on branch prediction to come...
Lecture 4 Slide 12EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Branch Delay Slot (MIPS, SPARC)
F D E M WF
F D E M
t0 t1 t2 t3 t4 t5
Wnext:
target:
i: beq 1, 2, tgtj: add 3, 4, 5 What can we put here?
branch:
F D E M WF D E M W
F D E M Wdelay:
target:
branch:
Squash
- Instruction in delay slot executes even on taken branch
Lecture 4 Slide 13EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Pipeline Hazard ChecklistMemory Data Dependences
r Output Dependence (WAW)r Anti Dependence (WAR)r True Data Dependence (RAW)
Register Data Dependencesr Output Dependence (WAW)r Anti Dependence (WAR)r True Data Dependence (RAW)
Control Dependences
EECS 470Lecture 5 Slide 14
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Instruction Level Parallelism
EECS 470Lecture 5 Slide 15
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Limitations of Scalar PipelinesUpper Bound on Scalar Pipeline Throughput
Limited by IPC=1
Inefficient Unification Into Single PipelineLong latency for each instruction
Performance Lost Due to Rigid In-order PipelineUnnecessary stalls
EECS 470Lecture 5 Slide 16
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, VijaykumarArchitectures for
Instruction-Level Parallelism
EECS 470Lecture 5 Slide 17
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Superscalar Machine
EECS 470Lecture 5 Slide 18
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
What is the real problem?CPI of in-order pipelines degrades very sharply if the machine
parallelism is increased beyond a certain point, i.e., when NxMapproaches average distance between dependent instructions
Forwarding is no longer effectivePipeline may never be full due to frequent dependency stalls!
EECS 470Lecture 5 Slide 19
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
ILP: Instruction-Level Parallelism
ILP is a measure of the amount of inter-dependencies between instructions
Average ILP = no. instruction / no. cyc requiredcode1: ILP = 1
i.e. must execute serially
code2: ILP = 3i.e. can execute at the same time
code1: r1 ¬ r2 + 1r3 ¬ r1 / 17r4 ¬ r0 - r3
code2: r1 ¬ r2 + 1r3 ¬ r9 / 17r4 ¬ r0 - r10
EECS 470Lecture 5 Slide 20
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Purported Limits on ILP
Weiss and Smith [1984] 1.58Sohi and Vajapeyam [1987] 1.81Tjaden and Flynn [1970] 1.86Tjaden and Flynn [1973] 1.96Uht [1986] 2.00Smith et al. [1989] 2.00Jouppi and Wall [1988] 2.40Johnson [1991] 2.50Acosta et al. [1986] 2.79Wedig [1982] 3.00Butler et al. [1991] 5.8Melvin and Patt [1991] 6Wall [1991] 7Kuck et al. [1972] 8Riseman and Foster [1972] 51Nicolau and Fisher [1984] 90
EECS 470Lecture 5 Slide 21
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
The Problem With In-Order Pipelines
What’s happening in cycle 4?• mulf stalls due to RAW hazard
• OK, this is a fundamental problem• subf stalls due to pipeline hazard
• Why? subf can’t proceed into D because mulf is there• That is the only reason, and it isn’t a fundamental one
Why can’t subf go into D in cycle 4 and E+ in cycle 5?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16addf f0,f1,f2 F D E+ E+ E+ Wmulf f2,f3,f2 F D d* d* E* E* E* E* E* Wsubf f0,f1,f4 F p* p* D E+ E+ E+ W
Lecture 4 Slide 22EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, VijaykumarNecessary Conditions for
Data Hazards
i:rk¬_
j:rk¬_ Reg Write
Reg Write i:_¬rk
j:rk¬_ Reg Write
Reg Read i:rk¬_
j:_¬rk Reg Read
Reg Write
stage X
stage Y
dist(i,j) £ dist(X,Y) Þ ??dist(i,j) > dist(X,Y) Þ ??
WAW Hazard WAR Hazard RAW Hazard
dist(i,j) £ dist(X,Y) Þ Hazard!!dist(i,j) > dist(X,Y) Þ Safe
Haz
ard
Dis
tanc
e
EECS 470Lecture 5 Slide 23
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
regfile
D$I$BP
insn buffer
SD
add p2,p3,p4sub p2,p4,p5mul p2,p5,p6div p4,4,p7
Ready TableP2 P3 P4 P5 P6 P7Yes YesYes Yes Yes YesYes Yes Yes Yes YesYes Yes Yes Yes Yes Yes
div p4,4,p7mul p2,p5,p6sub p2,p4,p5add p2,p3,p4
and
Dynamic Scheduling: The Big Picture
• Instructions fetch/decoded/renamed into Instruction Buffer• Also called “instruction window” or “instruction scheduler”
• Instructions (conceptually) check ready bits every cycle• Execute when ready
EECS 470Lecture 5 Slide 24
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Register Renaming• Anti (WAR ) and output (WAW ) dependencies are false
• The dependence is on name/location rather than data• Given infinite registers, WAR/WAW can always be eliminated• Renaming removes these dependencies, but leaves RAW intact
• Example• Names: r1,r2,r3• Locations: p1,p2,p3,p4,p5,p6,p7• Original mapping: r1®p1, r2®p2, r3®p3, p4–p7 are “free”
MapTable FreeList Orig. insns Renamed insnsr1 r2 r3p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6p4 p2 p6 p7 div r1,4,r1 div p4,4,p7
EECS 470Lecture 5 Slide 25
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic Scheduling – OoO Execution• Dynamic scheduling
• Totally in the hardware• Also called “out-of-order execution” (OoO)
• Fetch many instructions into instruction window• Use branch prediction to speculate past (multiple) branches• Flush pipeline on branch misprediction
• Rename to avoid false dependencies (WAW and WAR)• Execute instructions as soon as possible
• Register dependencies are known• Handling memory dependencies more tricky (much more later)
• Commit instructions in order• Anything strange happens before commit, just flush the pipeline
• Current machines: 100+ instruction scheduling window
EECS 470Lecture 5 Slide 26
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Motivation for Dynamic Scheduling• Dynamic scheduling (out-of-order execution)
• Execute insns in non-sequential (non-VonNeumann) order…+ Reduce RAW stalls+ Increase pipeline and functional unit (FU) utilization
• Original motivation was to increase FP unit utilization+ Expose more opportunities for parallel issue (ILP)
• Not in-order ® can be in parallel• …but make it appear like sequential execution
• Important– But difficult• Next few lectures
EECS 470Lecture 5 Slide 27
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Going Forward: What’s Next• We’ll build this up in steps over the next few weeks
• “Scoreboarding” - first OoO, no register renaming• “Tomasulo’s algorithm” - adds register renaming• Handling precise state and speculation
• P6-style execution (Intel Pentium Pro)• R10k-style execution (MIPS R10k)
• Handling memory dependencies
• Let’s get started!
EECS 470Lecture 5 Slide 28
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
New Pipeline Terminology
• In-order pipeline• Often written as F,D,X,W (multi-cycle X includes M)• Example pipeline: 1-cycle int (including mem), 3-cycle pipelined FP
regfile
D$I$BP
EECS 470Lecture 5 Slide 29
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
New Pipeline DiagramInsn D X Wldf X(r1),f1 c1 c2 c3mulf f0,f1,f2 c3 c4+ c7stf f2,Z(r1) c7 c8 c9addi r1,4,r1 c8 c9 c10ldf X(r1),f1 c10 c11 c12mulf f0,f1,f2 c12 c13+ c16stf f2,Z(r1) c16 c17 c18
• Alternative pipeline diagram• Down: insns• Across: pipeline stages• In boxes: cycles• Basically: stages « cycles• Convenient for out-of-order
EECS 470Lecture 5 Slide 30
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
The Problem With In-Order Pipelines
• In-order pipeline• Structural hazard: 1 insn register (latch) per stage
• 1 insn per stage per cycle (unless pipeline is replicated)• Younger insn can’t “pass” older insn without “clobbering” it
• Out-of-order pipeline• Implement “passing” functionality by removing structural hazard
regfile
D$I$BP
EECS 470Lecture 5 Slide 31
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Instruction Buffer
• Trick: insn buffer (many names for this buffer)• Basically: a bunch of flops for holding insns
• Split D into two pieces• Accumulate decoded insns in buffer in-order• Buffer sends insns down rest of pipeline out-of-order
regfile
D$I$BP
insn buffer
D2D1
EECS 470Lecture 5 Slide 32
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Dispatch and Issue
• Dispatch (D): first part of decode• Allocate slot in insn buffer
– New kind of structural hazard (insn buffer is full)• In order: stall back-propagates to younger insns
• Issue (S): second part of decode• Send insns from insn buffer to execution units+ Out-of-order: wait doesn’t back-propagate to younger insns
regfile
D$I$
BP
insn buffer
SD
EECS 470Lecture 5 Slide 33
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Dispatch and Issue with Floating-Point
regfile
D$I$BP
insn buffer
SD
F-regfile
E/
E+
E+
E* E* E*
EECS 470Lecture 5 Slide 34
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic Scheduling Algorithms
• Look at two register scheduling algorithms• Register scheduler: scheduler based on register dependences • Scoreboard
• No register renaming ® limited scheduling flexibility• Tomasulo
• Register renaming ® more flexibility, better performance
• Big simplification in this lecture: memory scheduling• Pretend register algorithm magically knows memory dependences• A little more realism in a few lectures
EECS 470Lecture 5 Slide 35
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scheduling Algorithm I: Scoreboard• Scoreboard
• Centralized control scheme: insn status explicitly tracked• Insn buffer: Functional Unit Status Table (FUST)
• First implementation: CDC 6600 [1964]• 16 separate non-pipelined functional units (7 int, 4 FP, 5 mem) • No bypassing
• Our example: “Simple Scoreboard”• 5 FU: 1 ALU, 1 load, 1 store, 2 FP (3-cycle, pipelined)
EECS 470Lecture 5 Slide 36
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Data Structures• FU Status Table
• FU, busy, op, R, R1, R2: destination/source register names• T: destination register tag (FU producing the value)• T1,T2: source register tags (FU producing the values)
• Register Status Table• T: tag (FU that will write this register)
• Tags interpreted as ready-bits• Tag == 0 ® Value is ready in register file• Tag != 0 ® Value is not ready, will be supplied by T
• Insn status table• S,X bits for all active insns
EECS 470Lecture 5 Slide 37
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Simple Scoreboard Data Structures
• Insn fields and status bits• Tags• Values
FU Status
R1 R2
XS Insn value
FU
T
T2T1Top========
Reg Status
Fetchedinsns
Regfile
R
T
========CAMs
EECS 470Lecture 5 Slide 38
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Pipeline• New pipeline structure: F, D, S, X, W
• F (fetch)• Same as it ever was
• D (dispatch)• Structural or WAW hazard ? stall : allocate scoreboard entry
• S (issue)• RAW hazard ? wait : read registers, go to execute
• X (execute)• Execute operation, notify scoreboard when done
• W (writeback)• WAR hazard ? wait : write register, free scoreboard entry• W and RAW-dependent S in same cycle• W and structural-dependent D in same cycle
EECS 470Lecture 5 Slide 39
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Dispatch (D)
• Stall for WAW or structural (Scoreboard, FU) hazards• Allocate scoreboard entry• Copy Reg Status for input registers• Set Reg Status for output register
FU Status
R1 R2
XS Insn value
FU
T
T2T1Top========
Reg Status
Fetchedinsns
Regfile
R
T
========
EECS 470Lecture 5 Slide 40
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Issue (S)
• Wait for RAW register hazards• Read registers
FU Status
R1 R2
XS Insn value
FU
T
T2T1Top========
Reg Status
Fetchedinsns
Regfile
R
T
========
EECS 470Lecture 5 Slide 41
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Issue Policy and Issue Logic• Issue
• If multiple insns ready, which one to choose? Issue policy• Oldest first? Safe• Longest latency first? May yield better performance
• Select logic: implements issue policy• W®1 priority encoder• W: window size (number of scoreboard entries)
EECS 470Lecture 5 Slide 42
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Execute (X)
• Execute insn
FU Status
R1 R2
XS Insn value
FU
T
T2T1Top========
Reg Status
Fetchedinsns
Regfile
R
T
========
EECS 470Lecture 5 Slide 43
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Writeback (W)
• Wait for WAR hazard• Write value into regfile, clear Reg Status entry• Compare tag to waiting insns input tags, match ? clear input tag• Free scoreboard entry
FU Status
R1 R2
XS Insn value
FU
T
T2T1Top========
Reg Status
Fetchedinsns
Regfile
R
T
========
EECS 470Lecture 5 Slide 44
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Data StructuresInsn StatusInsn D S X Wldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1) addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1f2r1
FU StatusFU busy op R R1 R2 T1 T2ALU noLD noST noFP1 noFP2 no
EECS 470Lecture 5 Slide 45
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 1Insn StatusInsn D S X Wldf X(r1),f1 c1mulf f0,f1,f2stf f2,Z(r1) addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2r1
FU StatusFU busy op R R1 R2 T1 T2ALU noLD yes ldf f1 - r1 - -ST noFP1 noFP2 no
allocate
EECS 470Lecture 5 Slide 46
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 2Insn StatusInsn D S X Wldf X(r1),f1 c1 c2mulf f0,f1,f2 c2stf f2,Z(r1) addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP1r1
FU StatusFU busy op R R1 R2 T1 T2ALU noLD yes ldf f1 - r1 - -ST noFP1 yes mulf f2 f0 f1 - LDFP2 no
allocate
EECS 470Lecture 5 Slide 47
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 3Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3mulf f0,f1,f2 c2stf f2,Z(r1) c3addi r1,4,r1ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP1r1
Functional unit statusFU busy op R R1 R2 T1 T2ALU noLD yes ldf f1 - r1 - -ST yes stf - f2 r1 FP1 -FP1 yes mulf f2 f0 f1 - LDFP2 no
allocate
EECS 470Lecture 5 Slide 48
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 4Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4stf f2,Z(r1) c3addi r1,4,r1 c4ldf X(r1),f1mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP1r1 ALU
FU StatusFU busy op R R1 R2 T1 T2ALU yes addi r1 r1 - - -LD noST yes stf - f2 r1 FP1 -FP1 yes mulf f2 f0 f1 - LDFP2 no
allocatefree
f0 (LD) is ready ® issue mulf
f1 written ® clear
EECS 470Lecture 5 Slide 49
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 5Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5stf f2,Z(r1) c3addi r1,4,r1 c4 c5ldf X(r1),f1 c5mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP1r1 ALU
FU StatusFU busy op R R1 R2 T1 T2ALU yes addi r1 r1 - - -LD yes ldf f1 - r1 - ALUST yes stf - f2 r1 FP1 -FP1 yes mulf f2 f0 f1 - -FP2 no
allocate
EECS 470Lecture 5 Slide 50
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 6Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+stf f2,Z(r1) c3addi r1,4,r1 c4 c5 c6ldf X(r1),f1 c5mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP1r1 ALU
FU StatusFU busy op R R1 R2 T1 T2ALU yes addi r1 r1 - - -LD yes ldf f1 - r1 - ALUST yes stf - f2 r1 FP1 -FP1 yes mulf f2 f0 f1 - -FP2 no
D stall: WAW hazard w/ mulf (f2) How to tell? RegStatus[f2] non-empty
EECS 470Lecture 5 Slide 51
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 7Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+stf f2,Z(r1) c3addi r1,4,r1 c4 c5 c6ldf X(r1),f1 c5mulf f0,f1,f2stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP1r1 ALU
FU StatusFU busy op R R1 R2 T1 T2ALU yes addi r1 r1 - - -LD yes ldf f1 - r1 - ALUST yes stf - f2 r1 FP1 -FP1 yes mulf f2 f0 f1 - -FP2 no
W wait: WAR hazard w/ stf (r1)How to tell? Untagged r1 in FuStatusRequires CAM
EECS 470Lecture 5 Slide 52
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 8Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8addi r1,4,r1 c4 c5 c6ldf X(r1),f1 c5mulf f0,f1,f2 c8stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP1 FP2r1 ALU
FU StatusFU busy op R R1 R2 T1 T2ALU yes addi r1 r1 - - -LD yes ldf f1 - r1 - ALUST yes stf - f2 r1 FP1 -FP1 noFP2 yes mulf f2 f0 f1 - LD allocate
freef1 (FP1) is ready ® issue stf
first mulf done (FP1)
W wait
EECS 470Lecture 5 Slide 53
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 9Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9addi r1,4,r1 c4 c5 c6 c9ldf X(r1),f1 c5 c9mulf f0,f1,f2 c8stf f2,Z(r1)
Reg StatusReg Tf0f1 LDf2 FP2r1 ALU
FU StatusFU busy op R R1 R2 T1 T2ALU noLD yes ldf f1 - r1 - ALUST yes stf - f2 r1 - -FP1 noFP2 yes mulf f2 f0 f1 - LD
D stall: structural hazard FuStatus[ST]
r1 written ® clear
freer1 (ALU) is ready ® issue ldf
EECS 470Lecture 5 Slide 54
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard: Cycle 10Insn StatusInsn D S X Wldf X(r1),f1 c1 c2 c3 c4mulf f0,f1,f2 c2 c4 c5+ c8stf f2,Z(r1) c3 c8 c9 c10addi r1,4,r1 c4 c5 c6 c9ldf X(r1),f1 c5 c9 c10mulf f0,f1,f2 c8stf f2,Z(r1) c10
Reg StatusReg Tf0f1 LDf2 FP2r1
FU StatusFU busy op R R1 R2 T1 T2ALU noLD yes ldf f1 - r1 - -ST yes stf - f2 r1 FP2 -FP1 noFP2 yes mulf f2 f0 f1 - LD
W & structural-dependent D in same cycle
free, then allocate
EECS 470Lecture 5 Slide 55
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
In-Order vs. Scoreboard
• Big speedup? – Only 1 cycle advantage for scoreboard
• Why? addi WAR hazard• Scoreboard issued addi earlier (c8® c5)• But WAR hazard delayed W until c9• Delayed issue of second iteration
In-Order ScoreboardInsn D X W D S X Wldf X(r1),f1 c1 c2 c3 c1 c2 c3 c4mulf f0,f1,f2 c3 c4+ c7 c2 c4 c5+ c8stf f2,Z(r1) c7 c8 c9 c3 c8 c9 c10addi r1,4,r1 c8 c9 c10 c4 c5 c6 c9ldf X(r1),f1 c10 c11 c12 c5 c9 c10 c11mulf f0,f1,f2 c12 c13+ c16 c8 c11 c12+ c15stf f2,Z(r1) c16 c17 c18 c10 c15 c16 c17
EECS 470Lecture 5 Slide 56
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
In-Order vs. Scoreboard II:Cache Miss
• Assume• 5 cycle cache miss on first ldf• Ignore FUST structural hazards– Little relative advantage
• addi WAR hazard (c7® c13) stalls second iteration
In-Order ScoreboardInsn D X W D S X Wldf X(r1),f1 c1 c2+ c7 c1 c2 c3+ c8mulf f0,f1,f2 c7 c8+ c11 c2 c8 c9+ c12stf f2,Z(r1) c11 c12 c13 c3 c12 c13 c14addi r1,4,r1 c12 c13 c14 c4 c5 c6 c13ldf X(r1),f1 c14 c15 c16 c5 c13 c14 c15mulf f0,f1,f2 c16 c17+ c20 c6 c15 c16+ c19stf f2,Z(r1) c20 c21 c22 c7 c19 c20 c21
EECS 470Lecture 5 Slide 57
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar
Scoreboard Redux• The good
+ Cheap hardware
• InsnStatus + FuStatus + RegStatus ~ 1 FP unit in area
+ Pretty good performance
• 1.7X for FORTRAN (scientific array) programs
• The less good
– No bypassing
• Is this a fundamental problem?
– Limited scheduling scope
• Structural/WAW hazards delay dispatch
– Slow issue of truly-dependent (RAW) insns
• WAR hazards delay writeback
• Fix with hardware register renaming
Recommended