33
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

Embed Size (px)

Citation preview

Page 1: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Lecture 10-11 Instruction Execution:

Dynamic Scheduling

Page 2: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Outline

• General concepts– dataflow– dynamic scheduling with Tomasulo’s

Algorithm

• The P6 Execution Microarchitecture

• Dynamic Scheduling Issues

Page 3: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

The Execution Problem

InstructionSupply

ExecutionMechanism

DataSupply

We are able to deliver instructions at high bandwidth, and we have techniquesfor high bandwidth, low-latency data supply. But nothing matters if we cannotconsume everything at high bandwidth in the execution mechanism. We need toexecute instructions in parallel.

Fundamental problem: taking things in the order prescribed by the programmerwill cause instruction dependencies to limit parallel execution of instructions.

Page 4: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Dynamic Scheduling

• Reservation Station

• Renaming

• Retirement/Recovery

• Memory Disambiguation

Tomasulo’s Algorithm

Page 5: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Dataflow Concepts

1. MUL Ra, Rb -> Rm2. ADD Rc, Rd -> Rn3. SUB Rm, Rn -> Rx4. ADD Rr, Rs -> Rm5. ADD Rt, Rv -> Rn6. DIV Rm, Rn -> Ry

x = (a * b) - (c + d);y = (r + s) / (t + v);

Source Code Machine Code

1 2

3

4 5

6Dataflow Graph

Page 6: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Data Dependences

• Data flow dependence– consumer-producer relationship– register bypass and interlocks

• Data output and antidependences– reuse of registers at compile time– register renaming

Page 7: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Interlocking

• Allow instruction to execute only when data and resources ready– simple interlocking based on bypass

logic for short pipelines– scoreboarding for deep pipelines– Tomasulo’s Algorithm for out-of-order

instruction dispatch

Page 8: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Tomasulo’s Algorithm

• Invented for IBM 360-91 FPU• First published in 1967(IBM

Journal)• Not for general CPU design until

1990’s.– branch prediction and exception

recovery problems solved

Page 9: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Tomasulo’s Algorithm

• Register renaming– tags for values

• Out-of-order execution– reservation stations

• Data forwarding– common data bus

Page 10: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Tomasulo’s Algorithm

• Instruction decode– fetch register file for value and tag– tag is handle for data currently being

generated– determine RS to hold the decoded

operations

Page 11: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Reservation Station

• Hardware mechanism that enables instructions to execute out-of-order and as early as their source operands are ready.

• An instruction waits in the RS until the tags for its source operands have been broadcast by their producers.

Page 12: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Tomasulo’s Algorithm

• Instruction Issue– insert operation and operands into

reservation station entry asisgned– mark destination register as not ready

Page 13: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Tomasulo’s Algorithm

• Operation dispatch– identify operations ready for

execution– determine highest priority operation

for each port/function unit

Page 14: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Tomasulo’s Algorithm

• Data forwarding– result value and tag distributed to RS

entries for associative search– result value and tag delivered to

destination register for potential update

Page 15: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Renaming

• Objective: want to eliminate WAR and WAW (false dependencies)

• Renaming happens in program order

• Renaming requires a table to map between architectural registers and physical registers

Page 16: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Retirement

• What happens if we inadvertently execute an instruction that should not have been executed (i.e., branch misprediction) or execute an instruction incorrectly (i.e., exception)?

• Need to flush all bad instructions and make it look as if they never executed.

• And then start executing from the correct point.

Page 17: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Retirement using Reorder Buffer

Reorder Buffer

tail pointer

head pointer

Insts, in program order

An instruction that reachesthe head and executes without exception can be safely retired

Values from Data Bus

•Flushing inflight instructions is easy – clear out RS and ROB

•Recovering RAT state is hard. That’s where the ROB comes in.

Page 18: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Putting it all together

Register Alias Table

Reservation Stations

FU FU

ReorderBuffer

Page 19: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Memory Disambiguation1. MUL Ra, Rb -> Rm2. ADD Rc, Rd -> Rn3. ST Rm -> 0(Rn)4. LD 0(Rs) -> Rm5. ADD Rt, Rv -> Rn6. DIV Rm, Rn -> Ry

1 2

3

4 5

6???

Depends if Rn == Rs

Page 20: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Conceptual Memory Order Buffer

L/S Addr ValueV V

Loads/Storesin program order

• Stores write into buffer and pass to memory only after they reach the head and are retired.

•What about loads?

• Could go in order (highly conservative)

•Could wait until all previous unknown store addresses are known (not so conservative)

•Could go as soon as address is known (optimistic)

Page 21: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

The P6 Execution Microarchitecture

[making dynamic scheduling work at wide issue]

RenamingScheduling/Execution

Memory

Retirement

Fetch/Decode

in-order in-orderout-of-order

Page 22: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

The P6 Register Alias Table

ROB Entry NumberRRF Valid

Srcs for μop0

Srcs for μop1

Srcs for μop2

Dests for μops

ROBAllocator

• If the producer has already retired, the value is in the Retirement Register File (RRF Valid is 1)

•If the producer has not retired, then the value will have to be provided by the Reorder Buffer at the ROB Entry Number indicated in the RAT (RRF Valid is 0)

From retire (Dest, ROB entry #s)

Physical sources

Page 23: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

ReOrder Buffer (ROB) Psrc Read and Pdest Write

V Value Dest StatusPSrcs for μop0

PSrcs for μop1

PSrcs for μop2

PDests for μopsfrom allocator

Values for Psrcs

Execution results and from function units

Page 24: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Retirement Register FilePsrc Read

PSrcs for μop0

PSrcs for μop1

PSrcs for μop2

Values for Psrcs

Value

From ReOrder Buffer retirement

Page 25: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Issue

RAT

RRF

ROB

ReservationStation

Rename (RAT access)

Register Read (Also ROB allocate)

Issue(RS allocate)

Page 26: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

P6 Reservation Station

Entry Valid

Psrc0 tag

Psrc0 data

Psrc0V

OpcodePsrc1

tagPsrc1 data

Psrc1V

ROBEntry # Up to three μops

per cycle are addedto the ResStation

Page 27: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Execution

Reservation Station

IntegerUnit1

IntegerUnit0

Loadaddrgen

Storeaddrgen

Floatingpointunit

Memory Order Buffer

Port0Port1Port2Port3Port4

To Reorder Buffer

Data Cache

Page 28: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Memory Order Buffer

Address

• Allocation happens in order, at issue.

• Store data is buffered in MOB until retirement of that store.

•STIDs correspond to the entry of the previous store.

•P6 Rule: STs must go in-order wrt other STs. LDs can go out-of-order wrt to other LDs and STs.

•LDs go as soon as address is ready. Clean up at retirement.

L/SStore

ID

ST Addr LD Addr

ST Data

Page 29: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Retirement

V Value Dest Status

Head Pointer

•If Status indicates all is OK, then the value is written, or committed, to the RRF. Also, the (Dest and ROB entry number) is sent to RAT to potentially set RRF Valid bit.

•If Status indicates something went wrong, then a recovery action is started.

•Up to 3 uops can be retired per cycle.

Reorder Buffer

Page 30: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Recovery

• ROB – flush all insts.

• RS – flush all insts.

• RRF – do nothing.

• RAT – Make all entries indicate RRF valid.

• Sent new PC to Fetch Mechanism

Page 31: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Reservation Station Alternative Designs

• Value capture reservation stations v.s. tag-only reservation stations– Pentium IV adjusts tags rather than

moves values when retiring an instruction

– Need to keep entries in ROB longer until they no longer safe keep retired value visible to the subsequent instructions

Page 32: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Other thoughts

• How many cycles for branch misprediction?

• Read Sohi and Smith for more general concepts

• Read about the MIPS 10K for details on an alternative implementation

Page 33: © Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture 10-11 Instruction Execution: Dynamic Scheduling

© Wen-mei Hwu and S. J. Patel, 2005ECE 412, University of Illinois

Data Dependencies

• Read After Write– Flow

• Write After Write– Anti

• Write After Read– Output

1. MUL Ra, Rb -> Rm

3. SUB Rm, Rn -> Rx

1. MUL Ra, Rb -> Rm

4. ADD Rr, Rs -> Rm

3. SUB Rm, Rn -> Rx

4. ADD Rr, Rs -> Rm