CSCE 614 Fall 2009 1
Hardware-Based Speculation
• As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing burden.
=> Speculating on the outcome of branches and executing the program as if the guesses were correct.
• Hardware Speculation
CSCE 614 Fall 2009 2
3 Key Ideas of Hardware Speculation
• Dynamic Branch Prediction– Choose which instruction to execute.
• Speculation– Allow the execution of instructions before the
control dependences are resolved (with the ability to undo the effect of an incorrectly speculated sequence).
• Dynamic Scheduling– Deal with the scheduling of different
combinations of basic blocks
CSCE 614 Fall 2009 3
Examples
• PowerPC 603/604/G3/G4
• MIPS R10000/12000
• Intel Pentium II/III/4
• Alpha 21264
• AMD K5/K6/Athlon
CSCE 614 Fall 2009 4
Hardware Speculation
• Extended Tomasulo’s algorithm
• Additional step (instruction commit) required
• Allow instructions to execute out-of-order but to force them to commit in order.
• Any irrevocable action (updating state or taking an exception) is prevented until an instruction commits.
CSCE 614 Fall 2009 5
Reorder Buffer (ROB)
• Holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits.
• Source of operands for instructions
• With speculation, the register file (or memory) is not updated until the instruction commits.
CSCE 614 Fall 2009 6
ROB Fields• Instruction type: indicates whether the instruction
is a branch, a store, or a register operation (ALU or Load).
• Destination: supplies the register number (for loads and ALU operations) or the memory address (for stores).
• Value: holds the value of the instruction result until the instruction commits.
• Ready: indicates that the instruction has completed execution, and the value is ready.
CSCE 614 Fall 2009 7
IssueExecuteWrite result (to ROB)
Commit(write to RF, MEM)
ReservationStations
ReorderBuffer(ROB)
Hardware Speculation
CSCE 614 Fall 2009 8
Basic Structure of MIPS FP Unit
The ROB completelyreplaces the storebuffer.
The renaming functionof the reservation stationsis replaced by the ROB
CSCE 614 Fall 2009 9
4 Steps of Execution
1. Issue (also called “dispatch”)
- Get an instruction from the instruction queue.
- Issue the instruction if there is an empty reservation station and an empty slot in ROB.
- If either all reservation stations are full or the ROB is full, then instruction issue is stalled.
CSCE 614 Fall 2009 10
2. Execute- If one or more of the operands is not yet available, monitor the CDB while waiting for the register to be computed.- Also RAW hazards are checked.- When both operands are available at a reservation station, execute the operation.
- Loads require two steps (effective address calculation and source operand).
- Stores need one step (effective address calculation).
CSCE 614 Fall 2009 11
3. Write Result
- When the result is available, write it on the CDB and from the CDB into the ROB, as well as to any reservation stations waiting for this result.
- For stores, if the value to be stored is available, it is written into the Value field of the ROB entry for the store.
CSCE 614 Fall 2009 12
4. Commit (also called “completion” or “graduation”)- Normal commit: When an instruction reaches the head of the ROB and its result is present in the buffer, the processor updates the register with the result and removes the instruction from the ROB.- Store commit: Similar except that memory is updated.- Branch with an incorrect prediction: The speculation is wrong. The ROB is flushed and execution is restarted at the correct successor of the branch.
CSCE 614 Fall 2009 13
Example
L.D F6, 34(R2)L.D F2, 45(R3)MUL.D F0, F2, F4SUB.D F8, F6, F2DIV.D F10, F0, F6ADD.D F6, F8, F2
When the MUL.D is ready to commit.
CSCE 614 Fall 2009 14
Example (p.109)Loop: L.D F0, 0(R1)
MUL.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
Assume that we have issued all the instructions in the loop twice.Assume that L.D and MUL.D from the first iteration have committedand all other instructions have completed execution.Show the contents of the ROB and the FP registers.
CSCE 614 Fall 2009 15
Hardware Speculation
• Because neither the register values nor any memory values are actually written until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted.
• Exceptions are handled by not recognizing the exception until it is ready to commit.
CSCE 614 Fall 2009 16
Hardware Speculation
• Figure 2.17 (p.113)
CSCE 614 Fall 2009 17
Multiple-Issue Processors
• Allow multiple instructions to issue in a clock cycle.
• Ideal CPI < 1
• 3 flavors– Statically Scheduled Superscalar– Dynamically Scheduled Superscalar– VLIW (Very Long Instruction Word)
CSCE 614 Fall 2009 18
Superscalar Processors
• Issue varying numbers of instructions per clock– statically scheduled
• using compiler techniques• in-order execution
– dynamically scheduled• Tomasulo’s algorithm• out-of-order execution
CSCE 614 Fall 2009 19
VLIW Processors
• issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (EPIC: Explicitly Parallel Instruction Computers).
• Statically scheduled by the compiler.
CSCE 614 Fall 2009 20
name Issue structure
Hazard detection
Scheduling Distinguishing characteristic
Examples
Superscalar
(static)
dynamic h/w static in-order execution
MIPS and ARM
(embedded)
Superscalar
(dynamic)
dynamic h/w dynamic some out-of-order
execution
None
Superscalar
(speculative)
dynamic h/w dynamic w/ speculation
out-of-order execution w/ speculation
Pentium 4, MIPS R12K, Alpha 21264,
IBM Power5
VLIW/LIW static primarily s/w
static all hazards determined by
compiler
TI C6x
(embedded)
EPIC mostly static
mostly s/w
mostly static
all hazards determined by
compiler
Itanium
CSCE 614 Fall 2009 21
Multiple Instruction Issue with Dynamic Scheduling
• Two-issue dynamically scheduled processor– It can issue any pair of instructions if there are
reservation stations of the right type available.– Extended Tomasulo’s algorithm
Note that Tomasulo’s algorithm (and Hardware Speculation) is usedfor both integer operations and FP operations.
CSCE 614 Fall 2009 22
• Two approaches to implement– Issue one instruction in half a clock cycle, so
that two instructions can be processed in one clock cycle.
– Build the logic necessary to handle two instructions at once, including any possible dependences between the instructions.
• Modern superscalar processors that issue 4 or more instructions per clock often include both approaches.
CSCE 614 Fall 2009 23
How to Handle Branches?
• Dynamically scheduled processors– Only allow instructions to be fetched and
issued (but not actually executed) until the branch has completed.
– IBM 360/91
• Processors with hardware speculation– Can actually execute instructions based on
branch prediction.
CSCE 614 Fall 2009 24
• Note that we consider loads and stores, including those to FP registers, as integer operations.
• Assume that FP adds take 3 execution cycles.
• Latency:
Execute Write CDB
CSCE 614 Fall 2009 25
• The throughput improvement versus a single-issue pipeline is small. – There is only one FP operation per iteration.– There is only one Integer ALU for both integer
ALU operations and effective address calculations.
• Larger improvements would be possible if the processor could execute more integer operations per cycle.
CSCE 614 Fall 2009 26
Multiple Issue with Speculation
• We process multiple instructions per clock assigning reservation stations and reorder buffers to the instructions.
• To maintain throughput of greater than one instruction per cycle, a speculative processor must be able to handle multiple instruction commits per clock cycle.
CSCE 614 Fall 2009 27
Example (p.119)Loop: LD R2, 0(R1)
DADDIU R2, R2, #1SD R2, 0(R1)DADDIU R1, R1, #8BNE R2, R3, Loop
Consider the execution of the loop on a two-issue processor, once withoutspeculation (dynamic scheduling/Tomasulo’s algorithm) and once with speculation.Assume that there are separate integer functional units for effective addresscalculation, for ALU operations, and for branch condition evaluation.Assume that there are 2 CDBs.Assume that up to two instructions of any type can commit per clock for a processorwith speculation.Show the execution timing of the first three iterations of the loop.
CSCE 614 Fall 2009 28
High-Performance Instruction Delivery
• For multiple-issue (delivering 4~8 instructions per clock cycle) processors– Branch-target buffers– Integrated instruction fetch unit– Return address prediction
CSCE 614 Fall 2009 29
Branch-Target Buffers
• To reduce the branch penalty for the classic 5-stage pipeline, we want to know what address to fetch by the end of IF.
• Branch-target buffer: a branch-prediction cache that stores the predicted address for the next instruction after a branch.
• We access the buffer during the IF stage using the instruction address. (We don’t know what the instruction is.)
CSCE 614 Fall 2009 30
Branch-Target Buffers
Branch-Target Cache
Optional.May be used for extra predictionstate bits.
CSCE 614 Fall 2009 31
Branch-Target Buffers
• We only need to store the predicted-taken branches in the branch-target buffer.– Why?
• No branch delay if a branch-prediction entry is found and is correct.
CSCE 614 Fall 2009 32
CSCE 614 Fall 2009 33
Return Address Predictors• Predicting indirect jumps (destination
address varies at run time)– Procedure returns, procedure calls, case,
select, etc.– SPEC89: 85% of indirect jumps are procedure
returns.
• A small buffer of return addresses operating as a stack– Caches the most recent return addresses– Push a return address on the stack at a call– Pop one off at a return
CSCE 614 Fall 2009 34
Integrated Instruction Fetch Units
• A separate autonomous unit that feeds instructions to the rest of the pipeline for multiple-issue processors.
• Have several functions– Integrated branch prediction– Instruction prefetch– Instruction memory access and buffering