77
Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo’s Approach CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan

Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via

Tomasulo’s Approach

CSE 564 Computer Architecture Summer 2017

Department of Computer Science and Engineering

Yonghong Yan [email protected]

www.secs.oakland.edu/~yan

Page 2: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

2

Topics for Instruction Level Parallelism §  ILP Introduction, Compiler Techniques and Branch

Prediction –  3.1, 3.2, 3.3

§  Dynamic Scheduling (OOO) –  3.4, 3.5 and C.5, C.6 and C.7 (FP pipeline and scoreboard)

§  Hardware Speculation and Static Superscalar/VLIW –  3.6, 3.7

§  Dynamic Scheduling, Multiple Issue and Speculation –  3.8, 3.9

§  ILP Limitations and SMT –  3.10, 3.11, 3.12

Page 3: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

3

Acknowledge and Copyright §  Slides adapted from

– UC Berkeley course “Computer Science 252: Graduate Computer Architecture” of David E. Culler Copyright(C) 2005 UCB

– UC Berkeley course Computer Science 252, Graduate Computer Architecture Spring 2012 of John Kubiatowicz Copyright(C) 2012 UCB

– Computer Science 152: Computer Architecture and Engineering, Spring 2016 by Dr. George Michelogiannakis from UC Berkeley

§  https://passlab.github.io/CSE564/copyrightack.html

Page 4: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

4

Complex Pipelining: Motivation §  Why would we want more than our in-order pipeline?

PCInst.Cache D Decode E M

DataCache W+

MainMemory(DRAM)

MemoryController

PhysicalAddress

PhysicalAddress

PhysicalAddress

PhysicalAddress

PhysicalAddress

Page 5: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

5

Complex Pipelining: Motivation Pipelining becomes complex when we want high

performance in the presence of: §  Long latency or partially pipelined floating-point units

– Not all instructions are floating point or integer

§  Memory systems with variable access time –  For example cache misses

§  Multiple arithmetic and memory units

Page 6: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

6

Floating Point Representation §  IEEE standard 754

Value = (-1)s * 1.mantissa * 2(exp-127)

Exponent = 0 has special meaning

Page 7: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

7

Floating-Point Unit (FPU) §  Much more hardware than an integer unit

– A simple FPU takes 150,000 gates. Verification complex. Some exceptions specific to floating point.

–  Integer FU to the order of thousands §  Common to have several FPU’s

–  Some integer, some floating point §  Common to have different types of FPU’s: Fadd,

Fmul, Fdiv, … §  An FPU may be pipelined, partially pipelined or not

pipelined §  To operate several FPU’s concurrently the FP register

file needs to have more read and write ports

Page 8: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

8

Unpipelined FP EXE Stage §  FP takes loops to compute §  Much longer clock period

Single-cycle FPU is a bad idea

Page 9: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

9

Latency and Interval §  Latency

–  The number of intervening cycles between an instruction that produces a result and an instruction that uses the result.

– Usually the number of stages after EX that an instruction produces a result »  ALU Integer 0, Load latency 1

§  Initiation or repeat interval –  the number of cycles that must elapse between issuing two

operations of a given type à structural hazards

Page 10: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

10

Pipelined FP EXE §  Increased stall for RAW hazards

Page 11: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

11

Breaking Our Assumption of Integer Pipeline

§  The divide unit is not fully pipelined –  structural hazards can occur

»  need to be detected and stall incurred.

§  The instructions have varying running times –  the number of register writes required in a cycle can be > 1

§  Instructions no longer reach WB in order – Write after write (WAW) hazards are possible

»  Note that write after read (WAR) hazards are not possible, since the register reads always occur in ID.

§  Instructions can complete in a different order than they were issued (out-of-order complete) –  causing problems with exceptions

§  Longer latency of operations –  stalls for RAW hazards will be more frequent.

Page 12: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

12

Hazards and Forwarding for Longer-Latency Pipeline

§  H

Page 13: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

13

Stalls of FP Operations §  SPEC89 FP §  Latency average §  FP add, subtract, or

convert –  1.7 cycles, or 56% of the

latency (3 cycles). §  Multiplies and divides

–  2.8 and 14.2, respectively, or 46% and 59% of the corresponding latency.

§  Structural hazards for divides are rare –  since the divide frequency is

low.

Page 14: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

14

Stalls per FP Operation §  The total number of

stalls per instruction –  ranges from 0.65 for

su2cor to 1.21 for doduc, with an average of 0.87.

–  FP result stalls dominate in all cases, with an average of 0.71 stalls per instruction, or 82% of the stalled cycles.

Page 15: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

15

Problems Arising From Writes §  If we issue one instruction per cycle, how can we

avoid structural hazards at the writeback stage and out-of-order writeback issues?

§  WAW Hazards

WAW Hazards

Page 16: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

16

Complex In-Order Pipeline

§  Delay writeback so all operations have same latency to W stage –  Write ports never

oversubscribed (one inst. in & one inst. out every cycle)

–  Stall pipeline on long latency operations, e.g., divides, cache misses

–  Handle exceptions in-order at commit point

CommitPoint

PCInst.Mem D Decode X1 X2

DataMem W+GPRs

X2 WFAdd X3

X3

FPRs X1

X2 FMul X3

X2FDiv X3

UnpipelineddividerHowtopreventincreasedwritebacklatency

fromslowingdownsinglecycleintegeropera:ons?

Bypassing

Page 17: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

17

Floating-Point ISA §  Interaction between floating-point datapath

and integer datapath is determined by ISA

§  RISC-V ISA –  separate register files for FP and Integer instructions

»  the only interaction is via a set of move/convert instructions (some ISA’s don’t even permit this)

–  separate load/store for FPR’s and GPR’s (general purpose registers) but both use GPR’s for address calculation

–  FP compares write integer registers, then use integer branch

Page 18: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

18

Realistic Memory Systems Common approaches to improving memory performance: §  Caches - single cycle except in case of a miss

=>stall §  Banked memory - multiple memory accesses

=> bank conflicts §  split-phase memory operations (separate memory

request from response), many in flight => out-of-order responses

LatencyofaccesstothemainmemoryisusuallymuchgreaterthanonecycleandoHenunpredictable

Solvingthisproblemisacentralissueincomputerarchitecture

Page 19: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

19

Multiple-Cycles MEM Stage §  MIPS R4000 §  IF: First half of instruction fetch; PC selection actually happens here,

together with initiation of instruction cache access. §  IS: Second half of instruction fetch, complete instruction cache access. §  RF: Instruction decode and register fetch, hazard checking, and

instruction cache hit detection. §  EX: Execution, which includes effective address calculation, ALU

operation, and branch-target computation and condition evaluation. §  DF: Data fetch, first half of data cache access. §  DS: Second half of data fetch, completion of data cache access. §  TC: Tag check, to determine whether the data cache access hit. §  WB: Write-back for loads and register-register operations.

Page 20: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

20

2-Cycles Load Delay §  2

Page 21: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

21

3-Cycle Branch Delay when Taken

Page 22: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

22

Dynamic Scheduling §  Data Hazards §  Control Hazards

Page 23: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

23

Types of Data Hazards ConsiderexecuLngasequenceof

rk<=rioprjtypeofinstrucLons

Data-dependencer3<=r1opr2 Read-aHer-Writer5<=r3opr4 (RAW)hazard

AnL-dependencer3<=r1opr2 Write-aHer-Readr1<=r4opr5 (WAR)hazard

Output-dependencer3<=r1opr2 Write-aHer-Writer3<=r6opr7 (WAW)hazard

Page 24: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

24

Register vs. Memory Dependence Data hazards due to register operands can be determined at the decode stage, but data hazards due to memory operands can be determined only after computing the effective address

Store: M[r1 + disp1] <= r2 !Load: r3 <= M[r4 + disp2]!!Does (r1 + disp1) = (r4 + disp2) ?

Page 25: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

25

Data Hazards: An Example

I1 FDIV.D f6, f6, f4I2 FLD f2, 45(x3)I3 FMUL.D f0, f2, f4I4 FDIV.D f8, f6, f2I5 FSUB.D f10, f0, f6I6 FADD.D f6, f8, f2

RAWHazardsWARHazardsWAWHazards

Page 26: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

26

Instruction Scheduling

I6

I2

I4

I1

I5

I3

Validorderings:in-order I1 I2 I3 I4 I5 I6out-of-order out-of-order

I1 FDIV.D f6, f6, f4I2 FLD f2, 45(x3)I3 FMULT.D f0, f2, f4I4 FDIV.D f8, f6, f2I5 FSUB.D f10, f0, f6I6 FADD.D f6, f8, f2

I2 I1 I3 I4 I5 I6

I1 I2 I3 I5 I4 I6

Page 27: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

27

Out-of-order Completion In-order Issue

LatencyI1 FDIV.D f6, f6, f4 4I2 FLD f2, 45(x3) 1I3 FMULT.D f0, f2, f4 3I4 FDIV.D f8, f6, f2 4I5 FSUB.D f10, f0, f6 1I6 FADD.D f6, f8, f2 1

in-ordercomp 12out-of-ordercomp12

1234354656

2314355466

Underlines are completes

Page 28: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

28

Dynamic Scheduling §  Rearrange order of instructions to reduce stalls while

maintaining data flow – Minimize RAW Hazards – Minimize WAW and WAR hazards via Register Renaming – Between registers and memory hazards

§  Advantages: – Compiler doesn’t need to have knowledge of

microarchitecture – Handles cases where dependencies are unknown at compile

time

§  Disadvantage: –  Substantial increase in hardware complexity – Complicates exceptions

Page 29: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

29

Dynamic Scheduling §  Dynamic scheduling implies:

– Out-of-order execution – Out-of-order completion

§  Creates more possibility for WAR and WAW hazards §  Scoreboard: C.6

– CDC6600 in 1963

§  Tomasulo’s Approach –  Tracks when operands are available –  Introduces register renaming in hardware

»  Minimizes WAW and WAR hazards

Page 30: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

30

Register Renaming §  Example:

DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8

Anti-dependence on F8

Output dependence on F6

Page 31: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

31

Register Renaming §  Example:

DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D T,F10,T

§  Now only RAW hazards remain, which can be strictly

ordered

DIV.D F0,F2,F4

ADD.D F6,F0,F8

S.D F6,0(R1)

SUB.D F8,F10,F14

MUL.D F6,F10,F8

Page 32: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

32

Tomasulo Algorithm §  For IBM 360/91 about 3 years after CDC 6600 (1966) §  Goal: High Performance without special compilers §  Differences between IBM 360 & CDC 6600 ISA

–  IBM has only 2 register specifiers/instr vs. 3 in CDC 6600 –  IBM has 4 FP registers vs. 8 in CDC 6600 –  IBM has memory-register ops

§  Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Page 33: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

33

Organizations of Tomasulo’s Algorithm §  Load/Store buffer §  Reservation station §  Common data bus

v

Page 34: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

34

Tomasulo Algorithm vs. Scoreboard §  Control & buffers distributed with Function Units (FU)

vs. centralized in scoreboard; –  FU buffers called “reservation stations”; have pending

operands §  Registers in instructions replaced by values or

pointers to reservation stations(RS); called register renaming ; –  avoids WAR, WAW hazards – More reservation stations than registers, so can do

optimizations compilers can’t §  Results to FU from RS, not through registers, over

Common Data Bus that broadcasts results to all FUs §  Load and Stores treated as FUs with RSs as well §  Integer instructions can go past branches, allowing

FP ops beyond basic block in FP queue

Page 35: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

35

Register Renaming §  Register renaming by reservation stations (RS)

–  Each entry contains: »  The instruction »  Buffered operand values (when available) »  Reservation station number of instruction providing the operand

values – RS fetches and buffers an operand as soon as it becomes

available (not necessarily involving register file) –  Pending instructions designate the RS to which they will send

their output »  Result values broadcast on the common data bus (CDB)

– Only the last output updates the register file – As instructions are issued, the register specifiers are

renamed with the reservation station – May be more reservation stations than registers

Page 36: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

36

Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands –  Store buffers has V field, result to be stored

Qj, Qk: Reservation stations producing source registers (value to be written) – Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready in Vj

or Vk –  Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy Qi: Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Page 37: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

37

Three Stages of Tomasulo Algorithm 1.!Issue—get instruction from FP Op Queue

If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2.!Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result 3.!Write result—finish execution (WB)

Write on Common Data Bus to all awaiting units; mark reservation station available

§  Normal data bus: data + destination (“go to” bus) §  Common data bus: data + source (“come from” bus)

–  64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) – Does the broadcast

Page 38: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

38

Tomasulo Organization for the Example

FP adders

Add1 Add2 Add3

FP multipliers

Mult1 Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP Op Queue

Load Buffers

Store Buffers

Load1 Load2 Load3 Load4 Load5 Load6

Page 39: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

39

Tomasulo Example

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Page 40: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

40

Tomasulo Example Cycle 1

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

Page 41: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

41

Tomasulo Example Cycle 2

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Allow multiple outstanding loads

Page 42: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

42

Tomasulo Example Cycle 3

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

•  Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard

•  Load1 completing; what is waiting for Load1?

Page 43: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

43

Tomasulo Example Cycle 4

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

•  Load2 completing; what is waiting for Load2?

Waiting for data from memory by the instruction originally in Load1

Page 44: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

44

Tomasulo Example Cycle 5

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

Waiting for data from memory by the instruction originally in Load2

Page 45: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

45

Tomasulo Example Cycle 6

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

•  Issue ADDD here vs. scoreboard?

Page 46: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

46

Tomasulo Example Cycle 7

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

•  Add1 completing; what is waiting for it?

Page 47: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

47

Tomasulo Example Cycle 8

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 48: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

48

Tomasulo Example Cycle 9

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No1 Add2 Yes ADDD (M-M) M(A2)

Add3 No6 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 49: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

49

Tomasulo Example Cycle 10

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 No0 Add2 Yes ADDD (M-M) M(A2)

Add3 No5 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2

•  Add2 completing; what is waiting for it?

Page 50: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

50

Tomasulo Example Cycle 11

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

4 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

•  Write result of ADDD here vs. scoreboard? •  All quick instructions complete in this cycle!

Page 51: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

51

Tomasulo Example Cycle 12

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

3 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 52: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

52

Tomasulo Example Cycle 13

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

2 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 53: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

53

Tomasulo Example Cycle 14

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

1 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 54: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

54

Tomasulo Example Cycle 15

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 No

0 Mult1 Yes MULTDM(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 55: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

55

Tomasulo Example Cycle 16

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

Page 56: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

56

Faster than light computation (skip a couple of cycles)

Page 57: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

57

Tomasulo Example Cycle 55

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

Page 58: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

58

Tomasulo Example Cycle 56

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

•  Mult2 is completing; what is waiting for it?

Page 59: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

59

Tomasulo Example Cycle 57 Instruction status: Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result

•  Once again: In-order issue, out-of-order execution and completion.

Page 60: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

60

Compare to Scoreboard Cycle 62

Instruction status: Read Exec Write Exec WriteInstruction j k Issue Oper Comp Result Issue ComplResultLD F6 34+ R2 1 2 3 4 1 3 4LD F2 45+ R3 5 6 7 8 2 4 5MULTD F0 F2 F4 6 9 19 20 3 15 16SUBD F8 F6 F2 7 9 11 12 4 7 8DIVD F10 F0 F6 8 21 61 62 5 56 57ADDD F6 F8 F2 13 14 16 22 6 10 11

•  Why take longer on scoreboard/6600? •  Structural Hazards •  Lack of forwarding

Page 61: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

61

Tomasulo v. Scoreboard (IBM 360/91 v. CDC 6600)

Pipelined Functional Units Multiple Functional Units (6 load, 3 store, 3 +, 2 x/÷) (1 load/store, 1 + , 2 x, 1

÷) window size: ≤ 14 instructions ≤ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion WAW: renaming avoids stall issue

Broadcast results from FU Write/read registers Control: reservation stations central scoreboard

Page 62: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

62

SUMMARY

Page 63: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

63

Not Every Stage Takes only one Cycle §  FP EXE Stage

– Multi-cycle Add/Mul – Nonpiplined for DIV

§  MEM Stage

Page 64: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

64

Issues of Multi-Cycle in Some Stages §  The divide unit is not fully pipelined

–  structural hazards can occur »  need to be detected and stall incurred.

§  The instructions have varying running times –  the number of register writes required in a cycle can be > 1

§  Instructions no longer reach WB in order – Write after write (WAW) hazards are possible

»  Note that write after read (WAR) hazards are not possible, since the register reads always occur in ID.

§  Instructions can complete in a different order than they were issued (out-of-order complete) –  causing problems with exceptions

§  Longer latency of operations –  stalls for RAW hazards will be more frequent.

Page 65: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

65

Hazards and Forwarding for Longer-Latency Pipeline

§  H

Page 66: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

66

Problems Arising From Writes §  If we issue one instruction per cycle, how can we

avoid structural hazards at the writeback stage and out-of-order writeback issues?

§  WAW Hazards

WAW Hazards

Page 67: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

67

2-Cycles Load Delay §  2

Page 68: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

68

3-Cycle Branch Delay when Taken

Page 69: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

69

Instruction Scheduling

I6

I2

I4

I1

I5

I3

Validorderings:in-order I1 I2 I3 I4 I5 I6out-of-order out-of-order

I1 FDIV.D f6, f6, f4I2 FLD f2, 45(x3)I3 FMULT.D f0, f2, f4I4 FDIV.D f8, f6, f2I5 FSUB.D f10, f0, f6I6 FADD.D f6, f8, f2

I2 I1 I3 I4 I5 I6

I1 I2 I3 I5 I4 I6

Page 70: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

70

Register Renaming §  Example:

DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T

§  Now only RAW hazards remain, which can be strictly

ordered

Page 71: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

71

How important is renaming? Consider execution without it

latency 1 LD F2, 34(R2) 1 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4, F2, F8 4 6 ADDD F10, F6, F4 1

In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6

1 2

3 4

5

6

Out-of-order: 1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6

Out-of-order execution did not allow any significant improvement!

Page 72: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

72

Instruction-level Parallelism via Renaming

latency 1 LD F2, 34(R2) 1 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4’, F2, F8 4 6 ADDD F10, F6, F4’ 1

In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6 Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6

1 2

3 4

5

6

X

Any antidependence can be eliminated by renaming. (renaming ⇒ additional storage) Can be done either in Software or Hardware

Page 73: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

73

Hardware Solution §  Dynamic Scheduling

– Out-of-order execution and completion §  Data Hazard via Register Renaming

– Dynamic RAW hazard detection and scheduling in data-flow fashion

– Register renaming for WRA and WRA hazard (name conflict)

§  Implementations –  Scoreboard (CDC 6600 1963)

»  Centralized register renaming –  Tomasulo’s Approach (IBM 360/91, 1966)

»  Distributed control and renaming via reservation station, load/store buffer and common data bus (data+source)

Page 74: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

74

Organizations of Tomasulo’s Algorithm §  Load/Store buffer §  Reservation station §  Common data bus

v

Page 75: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

75

Three Stages of Tomasulo Algorithm 1.!Issue—get instruction from FP Op Queue

If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).

2.!Execution—operate on operands (EX) When both operands ready then execute;

if not ready, watch Common Data Bus for result 3.!Write result—finish execution (WB)

Write on Common Data Bus to all awaiting units; mark reservation station available

§  Normal data bus: data + destination (“go to” bus) §  Common data bus: data + source (“come from” bus)

–  64 bits of data + 4 bits of Functional Unit source address – Write if matches expected Functional Unit (produces result) – Does the broadcast

Page 76: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

76

Tomasulo Example Cycle 3

Instruction status: Exec WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No

Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

•  Note: registers names are removed (“renamed”) in Reservation Stations

Page 77: Lecture 16: Instruction Level Parallelism -- Dynamic ... · – 3.6, 3.7 § Dynamic Scheduling, Multiple Issue and Speculation ... Floating-Point Unit (FPU) ... Latency of access

77

Register Renaming Summary §  Purpose of Renaming: removing “Anti-dependencies”

–  Get rid of WAR and WAW hazards, since these are not “real” dependencies §  Implicit Renaming: i.e. Tomasulo

–  Registers changed into values or response tags –  We call this “implicit” because space in register file may or may not be used

by results! §  Explicit Renaming: more physical registers than needed by ISA.

–  Rename table: tracks current association between architectural registers and physical registers

–  Uses a translation table to perform compiler-like transformation on the fly

§  With Explicit Renaming: –  All registers concentrated in single register file –  Can utilize bypass network that looks more like 5-stage pipeline –  Introduces a register-allocation problem

»  Need to handle branch misprediction and precise exceptions differently, but ultimately makes things simpler