Download ppt - CSCE 614 Fall 20091 Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing

CSCE 614 Fall 2009 1

Hardware-Based Speculation

• As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing burden.

=> Speculating on the outcome of branches and executing the program as if the guesses were correct.

• Hardware Speculation


3 Key Ideas of Hardware Speculation

• Dynamic Branch Prediction– Choose which instruction to execute.

• Speculation– Allow the execution of instructions before the

control dependences are resolved (with the ability to undo the effect of an incorrectly speculated sequence).

• Dynamic Scheduling– Deal with the scheduling of different

combinations of basic blocks


Examples

• PowerPC 603/604/G3/G4

• MIPS R10000/12000

• Intel Pentium II/III/4

• Alpha 21264

• AMD K5/K6/Athlon


Hardware Speculation

• Extended Tomasulo’s algorithm

• Additional step (instruction commit) required

• Allow instructions to execute out-of-order but to force them to commit in order.

• Any irrevocable action (updating state or taking an exception) is prevented until an instruction commits.


Reorder Buffer (ROB)

• Holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits.

• Source of operands for instructions

• With speculation, the register file (or memory) is not updated until the instruction commits.


ROB Fields• Instruction type: indicates whether the instruction

is a branch, a store, or a register operation (ALU or Load).

• Destination: supplies the register number (for loads and ALU operations) or the memory address (for stores).

• Value: holds the value of the instruction result until the instruction commits.

• Ready: indicates that the instruction has completed execution, and the value is ready.


IssueExecuteWrite result (to ROB)

Commit(write to RF, MEM)

ReservationStations

ReorderBuffer(ROB)



Basic Structure of MIPS FP Unit

The ROB completelyreplaces the storebuffer.

The renaming functionof the reservation stationsis replaced by the ROB


4 Steps of Execution

1. Issue (also called “dispatch”)

- Get an instruction from the instruction queue.

- Issue the instruction if there is an empty reservation station and an empty slot in ROB.

- If either all reservation stations are full or the ROB is full, then instruction issue is stalled.

CSCE 614 Fall 2009 10

2. Execute- If one or more of the operands is not yet available, monitor the CDB while waiting for the register to be computed.- Also RAW hazards are checked.- When both operands are available at a reservation station, execute the operation.

- Loads require two steps (effective address calculation and source operand).

- Stores need one step (effective address calculation).

CSCE 614 Fall 2009 11

3. Write Result

- When the result is available, write it on the CDB and from the CDB into the ROB, as well as to any reservation stations waiting for this result.

- For stores, if the value to be stored is available, it is written into the Value field of the ROB entry for the store.

CSCE 614 Fall 2009 12

4. Commit (also called “completion” or “graduation”)- Normal commit: When an instruction reaches the head of the ROB and its result is present in the buffer, the processor updates the register with the result and removes the instruction from the ROB.- Store commit: Similar except that memory is updated.- Branch with an incorrect prediction: The speculation is wrong. The ROB is flushed and execution is restarted at the correct successor of the branch.

CSCE 614 Fall 2009 13

Example

L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2

When the MUL.D is ready to commit.

CSCE 614 Fall 2009 14

Example (p.109)Loop: L.D F0, 0(R1)

MUL.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop

Assume that we have issued all the instructions in the loop twice.Assume that L.D and MUL.D from the first iteration have committedand all other instructions have completed execution.Show the contents of the ROB and the FP registers.

CSCE 614 Fall 2009 15


• Because neither the register values nor any memory values are actually written until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted.

• Exceptions are handled by not recognizing the exception until it is ready to commit.

CSCE 614 Fall 2009 16


• Figure 2.17 (p.113)

CSCE 614 Fall 2009 17

Multiple-Issue Processors

• Allow multiple instructions to issue in a clock cycle.

• Ideal CPI < 1

• 3 flavors– Statically Scheduled Superscalar– Dynamically Scheduled Superscalar– VLIW (Very Long Instruction Word)

CSCE 614 Fall 2009 18

Superscalar Processors

• Issue varying numbers of instructions per clock– statically scheduled

• using compiler techniques• in-order execution

– dynamically scheduled• Tomasulo’s algorithm• out-of-order execution

CSCE 614 Fall 2009 19

VLIW Processors

• issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (EPIC: Explicitly Parallel Instruction Computers).

• Statically scheduled by the compiler.

CSCE 614 Fall 2009 20

name Issue structure

Hazard detection

Scheduling Distinguishing characteristic

Examples

Superscalar

(static)

dynamic h/w static in-order execution

MIPS and ARM

(embedded)

Superscalar

(dynamic)

dynamic h/w dynamic some out-of-order

execution

None

Superscalar

(speculative)

dynamic h/w dynamic w/ speculation

out-of-order execution w/ speculation

Pentium 4, MIPS R12K, Alpha 21264,

IBM Power5

VLIW/LIW static primarily s/w

static all hazards determined by

compiler

TI C6x

(embedded)

EPIC mostly static

mostly s/w

mostly static

all hazards determined by

compiler

Itanium

CSCE 614 Fall 2009 21

Multiple Instruction Issue with Dynamic Scheduling

• Two-issue dynamically scheduled processor– It can issue any pair of instructions if there are

reservation stations of the right type available.– Extended Tomasulo’s algorithm

Note that Tomasulo’s algorithm (and Hardware Speculation) is usedfor both integer operations and FP operations.

CSCE 614 Fall 2009 22

• Two approaches to implement– Issue one instruction in half a clock cycle, so

that two instructions can be processed in one clock cycle.

– Build the logic necessary to handle two instructions at once, including any possible dependences between the instructions.

• Modern superscalar processors that issue 4 or more instructions per clock often include both approaches.

CSCE 614 Fall 2009 23

How to Handle Branches?

• Dynamically scheduled processors– Only allow instructions to be fetched and

issued (but not actually executed) until the branch has completed.

– IBM 360/91

• Processors with hardware speculation– Can actually execute instructions based on

branch prediction.

CSCE 614 Fall 2009 24

• Note that we consider loads and stores, including those to FP registers, as integer operations.

• Assume that FP adds take 3 execution cycles.

• Latency:

Execute Write CDB

CSCE 614 Fall 2009 25

• The throughput improvement versus a single-issue pipeline is small. – There is only one FP operation per iteration.– There is only one Integer ALU for both integer

ALU operations and effective address calculations.

• Larger improvements would be possible if the processor could execute more integer operations per cycle.

CSCE 614 Fall 2009 26

Multiple Issue with Speculation

• We process multiple instructions per clock assigning reservation stations and reorder buffers to the instructions.

• To maintain throughput of greater than one instruction per cycle, a speculative processor must be able to handle multiple instruction commits per clock cycle.

CSCE 614 Fall 2009 27

Example (p.119)Loop: LD R2, 0(R1)

DADDIU R2, R2, #1SD R2, 0(R1)DADDIU R1, R1, #8BNE R2, R3, Loop

Consider the execution of the loop on a two-issue processor, once withoutspeculation (dynamic scheduling/Tomasulo’s algorithm) and once with speculation.Assume that there are separate integer functional units for effective addresscalculation, for ALU operations, and for branch condition evaluation.Assume that there are 2 CDBs.Assume that up to two instructions of any type can commit per clock for a processorwith speculation.Show the execution timing of the first three iterations of the loop.

CSCE 614 Fall 2009 28

High-Performance Instruction Delivery

• For multiple-issue (delivering 4~8 instructions per clock cycle) processors– Branch-target buffers– Integrated instruction fetch unit– Return address prediction

CSCE 614 Fall 2009 29

Branch-Target Buffers

• To reduce the branch penalty for the classic 5-stage pipeline, we want to know what address to fetch by the end of IF.

• Branch-target buffer: a branch-prediction cache that stores the predicted address for the next instruction after a branch.

• We access the buffer during the IF stage using the instruction address. (We don’t know what the instruction is.)

CSCE 614 Fall 2009 30


Branch-Target Cache

Optional.May be used for extra predictionstate bits.

CSCE 614 Fall 2009 31


• We only need to store the predicted-taken branches in the branch-target buffer.– Why?

• No branch delay if a branch-prediction entry is found and is correct.

CSCE 614 Fall 2009 32

CSCE 614 Fall 2009 33

Return Address Predictors• Predicting indirect jumps (destination

address varies at run time)– Procedure returns, procedure calls, case,

select, etc.– SPEC89: 85% of indirect jumps are procedure

returns.

• A small buffer of return addresses operating as a stack– Caches the most recent return addresses– Push a return address on the stack at a call– Pop one off at a return

CSCE 614 Fall 2009 34

Integrated Instruction Fetch Units

• A separate autonomous unit that feeds instructions to the rest of the pipeline for multiple-issue processors.

• Have several functions– Integrated branch prediction– Instruction prefetch– Instruction memory access and buffering