41
Dynamic Scheduling • Pipelines rely on instruction flow as scheduled by the compiler (whether optimized or not) – We could continue to fetch instructions as scheduled by the compiler, but we could use hardware to anticipate problems and dynamically schedule the instructions to the execution units (here, referred to as functional units) – This was the approach taken in the CDC 6600 in the late 60s, using a new piece of hardware called the scoreboard – The ID unit has an additional duty • issuing the instruction to the proper functional unit when the circumstances permit it – The scoreboard keeps track of the waiting and executing instructions, monitoring data dependencies and structural hazards • this leads to out-of-order completion of instructions, but as long as data dependencies are maintained, this is only a problem when we have exceptions (we’ll worry about that later)

Dynamic Scheduling Pipelines rely on instruction flow as scheduled by the compiler (whether optimized or not) – We could continue to fetch instructions

Embed Size (px)

Citation preview

Dynamic Scheduling• Pipelines rely on instruction flow as scheduled by the

compiler (whether optimized or not)– We could continue to fetch instructions as scheduled by the

compiler, but we could use hardware to anticipate problems and dynamically schedule the instructions to the execution units (here, referred to as functional units)

– This was the approach taken in the CDC 6600 in the late 60s, using a new piece of hardware called the scoreboard

– The ID unit has an additional duty • issuing the instruction to the proper functional unit when the

circumstances permit it

– The scoreboard keeps track of the waiting and executing instructions, monitoring data dependencies and structural hazards• this leads to out-of-order completion of instructions, but as long as data

dependencies are maintained, this is only a problem when we have exceptions (we’ll worry about that later)

Example to Motivate the Scoreboard• Consider the following code:

– L.D F0, 0(R3)– MUL.D F2, F0, F10– ADD.D F4, F2, F12– DIV.D F6, F8, F14

• In the MIPS FP pipeline, we have a data hazard after the L.D resulting in a stall and a data hazard after the MUL.D resulting in 6 stalls– The ADD.D cannot move into A1 until F2 has been forwarded out

of MUL.D’s M7 stage, thus the ADD.D stalls for 6 cycles in ID and DIV.D stalls for 6 cycles in IF

– There is no need for the DIV.D to stall since it is not data dependent on previous instructions, but there is a structural hazard in that ID cannot permit DIV.D to move into it while ADD.D is waiting

– The scoreboard gets around such problems

The New MIPS Pipeline

• The IF stage is the same• The ID stage is now broken into

a two stages: decode/issue and read operands– Decode/issue decodes the

instructions and tracks data dependencies, and manipulates the scoreboard

– It then passes the instruction onto the proper functional unit for execution

– Read operands is moved to the functional units to obtain register values right from the register file, or from functional units that forward their results

– Functional units will write directly to the register file thus performing the WB step

We might have additional FUs suchas another FP + or FP * as needed

Issuing Logic• Given an instruction, examine its

– Source registers – do either values need to be forwarded from other FUs?

– Destination register – does the result get forwarded to another FU? • here, we have to be aware of WAW and WAR hazards

– Instruction type – is a functional unit available?

• If a functional unit is available and no other active instruction has the same destination register, then issue the instruction– If another instruction has this destination register, then it creates a

potential WAW hazard– If either no functional unit is available or a WAW hazard exists, we

postpone issuing this instruction which results in stalling the pipeline• but notice that we no longer stall the pipeline for situations that should not

require it like the prior example

WAW and WAR Hazards• In the 5-stage pipeline, the only type of data hazards was the

RAW hazard (read after write)– LD R1, 0(R2)– DADDI R3, R1, #1

• we are supposed to read R1 after we write R1 but because of where LD accesses the data cache, if we didn’t prevent it, we would read R1 in DADDI’s ID stage before the LD could write to R1

• The WAW hazard is a write after write – the writes occur in the wrong order– This is only possible if we permit out of order completion, but that is

necessary when we introduce floating point, multiply and divide operations to the 5 stage pipeline

• The WAR hazard is a write after read but performed in the wrong order because the read has been postponed– This can happen with the scoreboard– We’ll visit a WAR example in a little bit, after we explore how the

scoreboard works

Scoreboard Structure• The scoreboard is a data structure which maintains

three types of information– Instruction status (which stage issued instruction is in)

– Functional unit status (for each functional unit, its status, where its destination and source registers are from, and boolean flags to indicate whether the source values are available yet)

– Register result status (for each register, which functional unit might be writing to it)

– a

How the Scoreboard Works• IF stage – same as in the pipeline, just fetch next

instruction• Issue stage – take instruction from IF stage– If functional unit is not busy and destination register

does not have a pending result (WAW hazard) issue instruction• Busy ‘yes’• OP operation,• Fi destination register, Fj/FK source registers• Qj/Qk functional unit producing the source registers (if

any)• Rj/Rk ‘yes’/‘no’ depending on whether source value is

available yet • Result register for Fi functional unit name

Continued• Read operands – Occurs at the functional unit when Rj & Rk are set to yes

• Rj/Rk ‘no’, Qj/Qk 0, Fj/Fk values

• Execution takes place in the functional unit– Execution complete is set to true when functional unit

completes execution

• Write result– This step takes place after execution is complete and after

any potential WAR hazard goes away (because of a read taking place elsewhere)

– This stage takes a cycle by itself, so it happens the cycle after execution completes• Result register result, busy ‘no’, functional unit becomes

available

ExampleGiven the code under the instruction list with the following assumptions:

Execution time in cycles: Add = 2 Multiply = 10 Divide = 40Register write and register read of same register occur in separate cycles

Only 1 FU can read operands at a time FUs: 1 int unit, 1 FP div, 2 FP Mult, 1 FP adder

Continued

• During cycles 2-4, second L.D cannot issue because it also requires the same Integer functional unit, so the scoreboard just tracks the one L.D as it goes through read operands, execution and write result stages

Continued• Example continues:

• Second L.D is issued

• MUL.D is issued while L.D reads operands

• SUB.D is issued while L.D executes the load but MUL.D cannot yet read operand F2 (F0 is read)

• At this point, L.D’s execution has completed, MUL.D is waiting to read F2, SUB.D is waiting to read F2, and DIV.D is issued

Continued• In cycle 9: – L.D writes its results, MUL.D can waits 1 cycle to read the register

because the same register cannot be read while it is being written to

• In cycle 10:– L.D leaves the scoreboard, MUL.D reads its register, SUB.D and

DIV.D continue to wait to read operands since they cannot yet read operands while MUL.D does this, ADD.D is not issued because of structural hazard with SUB.D (only one FP adder available)

• In cycle 11: – SUB.D reads its operands while MUL.D begins execution, DIV.D

cannot yet read operands and ADD.D is still stalled• In cycle 12:

– SUB.D begins execution while MUL.D continues execution, DIV.D cannot read operands since F0 has not been produced by MUL.D

• In cycle 13: – SUB.D finishes execution while MUL.D continues

• In cycle 14: – SUB.D writes its result, MUL.D continues, DIV.D still cannot read

Continued• In cycle 15: – SUB.D leaves the scoreboard and now ADD.D is issued, MUL.D continues to

execute while DIV.D waits• In cycle 16:

– ADD.D reads operands, MUL.D continues to execute while DIV.D waits• In cycles 17:

– ADD.D executes, MUL.D finishes executing, DIV.D waits

• In cycle 18:– ADD.D executes, MUL.D writes its result, DIV.D waits (it cannot read it this

cycle)

• In cycle 19:– MUL.D leaves the scoreboard, ADD.D cannot yet write its result because of a

WAR conflict with DIV.D (if the ADD.D writes its result now, the DIV.D could read the wrong value for F6), DIV.D reads its operands

• In cycle 20:– DIV.D begins execution (this will take 40 total cycles), ADD.D writes its result

to F6

• In cycle 21-61:– DIV.D continues execution, ADD.D leaves the scoreboard in 21, DIV.D

continues execution until 59, writes result in 60, leaves at 61

Benefits and Costs of the Scoreboard• Benefits – dynamic nature minimizes stalls from structural

hazards due to data dependencies• Costs –a little extra logic and data storage for the scoreboard

itself are both needed, the primarily cost is in the added buses that permit forwarding from FUs to FUs and registers – About 4 times as many buses as in an ordinary pipeline

• Restrictions– There is only so much parallelism available in the instruction

stream, other techniques might be employed to improve this– Limitations based on the size of the scoreboard and number of FUs– WAW and WAR hazards still stall instructions in their FUs

• We can try to remove the WAR and WAW hazards using a variation of the scoreboard

Re-examining Data Dependencies• We will refer to three types of dependencies– Data dependencies – as before, a datum is needed by a

later instruction but is not available from an earlier instruction in time• these can arise either because

– instruction i produces a result that is used by j but i has not produced it by the time j needs it

– instruction j is data dependent on instruction k and k is data dependent on instruction I

• these are RAW hazards as we saw with the 5-stage pipeline

– Name dependencies – two instructions reference the same named entity but where there is no data dependency• these are WAW and WAR hazards

• There are also control dependencies which we will consider later

Continued• If two instructions are data dependent, they must

be executed in their proper order– The data dependency is a property of the program itself

(the logic) and the situation arises because of the overlapped nature of the pipeline execution of the instructions• note that data dependencies are not limited to registers but

can also flow through memory, consider the following:– L.D F0, 0(R1)– ADD.D F6, F0, F4– S.D F2, 0(R1)

• although it is unlikely that any architecture would have the store executing before the load, we must preserve the dependency because the store will wipe out the value that will be used in the ADD.D

Name Dependencies• There are two types of name dependencies between

instructions i and j (i precedes j)– Antidependence – instruction j writes to a name that i reads but

j finishes writing before i gets to read– Output dependence – instruction i and j write to the same named

item without an intervening read, but the writes are in the wrong order

• Name dependencies may not be obvious if you focus on one iteration of a loop– We will consider a means of enhancing ILP (instruction level

parallelism) by unrolling a loop across iterations– In such a case, we can find additional name dependencies

because the same registers are referenced in each loop iteration, but if we have 2 iterations worth of loop body, we will find the registers used in the first iteration are also referenced in the second

Register Renaming• Recall that the compiler generates registers to be used by the

processor to compute values– Registers are often selected using a graph coloring algorithm that

ensures unique data are placed in separate registers as much as possible to retain as much data in the CPU as possible

– However, because of the pipeline and out of order execution hardware, the compiler’s designation of data to registers may cause conflicts• data dependencies cannot be helped, we have to live with them• name dependencies arise because of limitations on registers

– An optimizing compiler may be able to rename registers to reduce the number of name dependencies that could potentially arise

• Dynamic register renaming is a superior approach– This can be handled using Tomasulo’s algorithm where potential

name dependencies are defeated through register renaming as the instructions are issued to their functional units

Control Dependencies• A control dependence arises due to conditional branches– Code along one branch should only be completed if the

condition dictates that the branch should be taken– This means that you should not execute the if clause if the

condition is false, or bypass the else clause if the condition is true, or that the loop body should be skipped if the condition is false, etc

• However, with speculation, we may take or ignore branches improperly and so we have to dynamically determine if we speculated incorrectly and resolve it– Control dependencies can lead to data and name

dependencies that would not normally exist– Miss-speculation can also lead to exceptions (see the next

slide)

Control Dependence Problems

• The code above to the left contains a data dependence between the DADD and BEQZ but also DADD and LW– Imagine that we speculate that the branch is not taken, this would cause us

to begin execution of LW– Further imagine that we do not compute the condition for numerous stages

so that the LW completes first• If we have miss-speculated, then R2 == 0 and we will have attempted to load from

memory location 0(R2) = 0, which is part of the OS and will most likely cause a memory violation interrupt

• Thus, our miss-speculation directly caused an exception

• In the code above to the right, the value of R1 changes depending on whether we took the branch or not, so a miss-speculation leads to a data hazard

DADD R2, R3, R4 DADD R1, R2, R3BEQZ R2, L1 BEQZ R4, L2LW R1, 0(R2) DSUB R1, R5, R6

L1: … …L2: OR R7, R1, R8

Loop Unrolling• Compiler scheduling allows us to

remove a number of stalling situations– However in concise code, there may

be too few instructions to move around– Consider the code to the right

• there are stalls after L.D (1), ADD.D (2), DSUBI (1) and the branch delay slot

– The best we can do to schedule the code is shown to the right below, but we cannot remove all stalls

• One strategy is to have the compiler “unroll” the for loop some number of iterations to provide additional instructions to schedule

Loop: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DSUBI R1, R1, #8BNE R1, R2, Loop

Loop: L.D F0, 0(R1)DSUBI R1, R1, #8ADD.D F4, F0, F2stallBNE R1, R2, LoopS.D F4, 8(R1)

The 2 stalls assumes no collision in the MEM stage

Continued• We can unroll the loop 4 times

– We will keep 4 iterations worth of L.D, ADD.D and S.D but only have 1 DSUBI and 1 BNE

– This requires adjusting displacements for our L.D and S.D operations and the immediate datum for the DSUBI

– Below to the left is the unrolled code, to the right it is scheduled, no stalls!Loop: L.D F0, 0(R1) Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2 L.D F6, -8(R1)S.D F4, 0(R1) L.D F10, -16(R1)L.D F0, -8(R1) L.D F14, -24(R1)ADD.D F4, F0, F2 ADD.D F4, F0, F2S.D F4, -8(R1) ADD.D F8, F6, F2L.D F0, -16(R1) ADD.D F12, F10, F2ADD.D F4, F0, F2 ADD.D F16, F14, F2S.D F4, -16(R1) S.D F4, 0(R1)L.D F0, -24(R1) S.D F8, -8(R1)ADD.D F4, F0, F2 DSUBI R1, R1, #32S.D F4, -24(R1) S.D F12, 16(R1)DSUBI R1, R1, #32 BNE R1, R2, LoopBNE R1, R2, Loop S.D F16, 8(R1)

Complications of Loop Unrolling• Obviously we need far more registers

– This is sometimes referred to as register pressure

• If the loop iterates n times instead of 100, how do we know we can unroll it 4 times?– We could still unroll it 4 times and have a separate loop to iterate the

final 0-3 iterations as needed

• The compiler must adjust all memory displacements appropriately and the immediate datum, although this may be mathematically challenging to us, it’s a simple piece of code for the compiler– Our latencies for the example were made up to illustrate this example,

imagine if we still had a 7-cycle multiply, unrolling a similar loop that performed MUL.D instead of ADD.D would be more complex – you will do this in the homework

– Additionally, we must have a pipelined FP unit or else this would not be worth doing – two consecutive ADD.Ds would cause the second to stall• what would happen if we were to do DIV.D instead of ADD.D?

Dynamic Scheduling• We enhance the scoreboard concept with register renaming

– This is known as the Tomasulo architecture

• Our new pipeline distributes the 5 stages of MIPS differently– IF – as before, instructions are fetched and placed into an

instruction queue– ID – separated into issue and read operand stages as with the

scoreboard– EX – distributed pipelined functional units (e.g., FP adder, FP

mult, integer), each with its own register sets (plural) to store multiple waiting instructions

– MEM – a separate memory unit with a load buffer and a store buffer – instructions line up here to perform loads and stores

– WB – values are forwarded directly from functional unit to functional unit, store buffer and registers via a common data bus

MIPS FP Using Tomasulo’s Algorithm

New Stages• Issue– Next instruction from instruction queue– Decode, check for structural hazards, if reservation station

available, issue to reservation station

• Read operands– Operands forwarded from Load unit, Register file or other

Functional Unit to reservation station so that “read operands” when available• this means that we no longer wait on reading both operands when they

are available like we did in the scoreboard• a reservation station stores operation, location operand sources and

destination location

– Instructions wait at reservation stations and can execute out of order based on order that operands are received

– Register renaming is applied dynamically in the reservation stations to avoid WAR and WAW hazards• see the example on the next slide

Register Renaming Example• In the code to the right

– F6 causes a RAW hazard between ADD.D and S.D• nothing can be done about this

– F6 causes a WAW hazard between ADD.D to MUL.D

– F8 causes a WAR hazard between ADD.D to SUB.D

• We rename F6 to S between the ADD.D and S.D– This does nothing to prevent the RAW

hazard, but prevents the WAW hazard

• We rename F8 to T between the SUB.D and MUL.D– This removes the WAR hazard with the

ADD.D

DIV.D F0, F2, F4ADD.D F6, F0, F8S.D F6, 0(R1)SUB.D F8, F10, F14MUL.D F6, F10, F8

DIV.D F0, F2, F4ADD.D S, F0, F8S.D S, 0(R1)SUB.D T, F10, F14MUL.D F6, F10, T

As long as S continues to beused for F6 and T continues to be used for F8, there is no problem with this renaming

Stages Continued• Execution– operands are “forwarded” to the reservation station

• once operands are available (must wait for any RAW hazards), execution can begin in the functional unit

• if multiple instructions’ operands become available at the same time, they can all begin execution simultaneously as long as they are at different functional units

• if two instructions of the same functional unit become available at the same time, the instruction to begin is selected randomly (except for loads/stores which are handled in order)

– Any instruction that is control dependent is not allowed to execute until the branch condition has been confirmed• this may slow down processing but it ensures that miss-speculation cannot

cause an exception

• Write result– Once a value has been computed, it is forwarded via CDB to waiting

reservation stations, store buffer and register file• however if multiple functional units complete at the same time, forwarding

has to be done in sequential order

How Register Renaming Works• We need to implement register renaming in hardware• A register renamer (hardware) is added to the issue stage

– It maintains a list of temporary registers, called T registers– It maintains a rename table

• When the renamer sees a destination register that causes a WAW or WAR conflict– It substitutes the destination register with the next available T register

and adds the old and new registers to the register table– Now, for every new instruction’s source register that matches an

entry in the table, replace it with the T register– The architecture might have as many as 64 T registers which are

rotated (when a register is no longer in use, it becomes available to be reused)

– This requires examining the table each cycle • the table is in reality a simple hardware-stored vector that can accommodate

an access per clock cycle

Example• Consider the code to the

right– Assume that the loop will be

unrolled in a Tomasulo-type architecture

– How will register renaming operate assuming that the next available T register is T9?

Loop: L.D F4, 0(Rx)MUL.DF2, F0, F4DIV.D F8, F4, F2L.D F4, 0(Ry)ADD.D F6, F0, F4SUB.D F8, F8, F6S.D F8, 0(Ry)

In the first iteration becomes:

Loop: L.D T9, 0(Rx)MUL.DT10, F0, T9DIV.D T11, T9, T10L.D T12, 0(Ry)ADD.D T13, F0, T12SUB.D T14, T11, T13S.D T14, 0(Ry)

Each successive iteration of the loop will continueTo use additional T registers, the next iterationWill use T15-T20, followed by T21-T26, etc

If we reach T63, we cycle around to T0

Example

L.D F6, 32(R2)L.D F2, 44(R3)MUL.DF0, F2, F4SUB.D F8, F2, F6DIV.D F10, F0, F6ADD.D F6, F8, F2

Instruction     Issue ExecuteWrite Result

L.D F6, 34(R2) *    L.D F2, 45(R3)      MUL.D F0, F2, F4      SUB.D F8, F2, F6      DIV.D F10, F0, F6      ADD.D F6, F8, F2      

Name Busy Op Vj Vk Qj Qk ALoad1 yes Load   R2     34+Regs[R2]Load2              Add1              Add2              Add3              Mult1              Mult2              

Field   F0 F2 F4 F6 F8 F10Qi       Load1    

They have changed the execution times of all FPoperations to be 4cycles  so that FP – FP delays are 3cycles and FP – storedelays are 2 cycles

ContinuedName Busy Op Vj Vk Qj Qk ALoad1 yes L.D   R2     34+Regs[R2]Load2 yes L.D   R3     45+Regs[R3]

Add1 yes SUB.D    Mem[45+Regs[R3]]

Mem[34+Regs[R2]]  

Add2              Add3              

Mult1 yes MUL.D   Regs[F4]Mem[45+Regs[R3]]    

Mult2              

Field   F0 F2 F4 F6 F8 F10Busy   Mult1 Load2   Load1 Add1  

Register renaming is handled through forwarding from the functional unit(s)responsible for delivering the operands, this is denoted through the Qj/Qk fields above and the register status below

Dynamic Loop Unrolling• The Tomasulo architecture improves over the Scoreboard because

instructions can complete without having to wait on WAW and WAR hazards

• However, the real power behind the approach is the dynamic issuing of instructions– This allows instructions to be sent to reservation stations as soon as they

reach the front of the instruction queue unless a functional unit’s reservation stations are all filled

• Rather than using a compiler to unroll and schedule a loop, the hardware can execute many iterations of a loop without waiting for one iteration to complete– Recall in the pipeline, one iteration would stall the pipeline waiting for

completion because of the lengthy computation times for FP operations– The only disadvantage of the dynamic approach is that we are not

removing branch operations as we did with the compiler, however, with proper branch speculation, we can minimize the branches’ impacts on performance

Continued• With dynamic loop unrolling, we have to worry about an

excessive number of WAR and WAW hazards arising by the use of the same registers across loop iterations– Previously, the compiler took care of register renaming for us

by altering the registers used in each iteration– Now, we have to rely on the register renamer and the T registers– How many iterations can we unroll? In part, it will be restricted

by the number of T registers available• in our previous example, we used 6 T registers per iteration and had 64

T registers, so at most, we could only accommodate 10 unrolled loops at any one time

• this seems more than reasonable because there shouldn’t be that iterations that need to be unrolled to account for any FP latencies, but recall that our loop performed DIV.D

• if we have a divider that takes more than 10 cycles to execute, we could easily run out of registers

ExampleLoop: L.D F0, 0(R1)

MUL.D F4, F0, F2 RAW hazard with L.D

S.D F4, 0(R1)RAW hazard with MUL.D, which will take several cycles

DSUBI R1, R1, #-8BNE R1, R2, Loop RAW hazard with DSUBI

Branch hazard

Here is the similar loop to the one we unrolled by compiler

The MUL.D’s latency causes stalls that cannot be removed throughinstruction scheduling because there are not enough instructionsavailable

Loop unrolling and scheduling resolved all stalls, however loop unrolling is not without its own problems (compiler complexity, need for many more registers, problems if the number of loop iterations is not divisible by the number of times we unrolled the loop (4 in the example)

Tomasulo SolutionInstruction     Issue Execute Write ResultL.D F0, 0(R1) *    MUL.D F4, F0, F2      S.D F4, 0(R1)      L.D F0, 0(R1)       MUL.D F4, F0, F2      S.D F4, 0(R1)       

Name Busy Op Vj Vk Qj Qk ALoad1 yes Load   R1     Regs[R1] + 0Load2 no            Add1 no            Add2 no            Add3 no            Mult1 no            Mult2 no            Store1 no            Store2 no            

Field   F0 F2 F4 F6 F8 F10Busy   Load1          

ContinuedInstruction     Issue Execute Write ResultL.D F0, 0(R1) * *  * MUL.D F4, F0, F2 *    S.D F4, 0(R1) *     L.D F0, 0(R1)  *    MUL.D F4, F0, F2      S.D F4, 0(R1)       

Name Busy Op Vj Vk Qj Qk ALoad1 yes Load       Regs[R1] + 0Load2 yes Load    R1      Regs[R1] - 8 Add1 no            Add2 No            Add3 no            Mult1 yes MUL    Regs[F2]  Load1     Mult2 no     Regs[F2]  Load2     Store1 yes Store  Regs[R1]      Mult1   Store2 no            

Field   F0 F2 F4 F6 F8 F10Busy   Load2   Mult1       

ContinuedInstruction     Issue Execute Write ResultL.D F0, 0(R1) * *  * MUL.D F4, F0, F2 * *  S.D F4, 0(R1) *     L.D F0, 0(R1)  * *   MUL.D F4, F0, F2 *     S.D F4, 0(R1)  *     

Name Busy Op Vj Vk Qj Qk ALoad1 no      Load2 Yes Load    R1      Regs[R1] - 8 Add1 no            Add2 no            Add3 no            Mult1 yes MUL    Regs[F2]       Mult2 yes MUL    Regs[F2]  Load2     Store1 yes Store  Regs[R1]      Mult1   Store2 yes Store  Regs[R1]-8      Mult2   

Field   F0 F2 F4 F6 F8 F10Busy   Load2   Mult1       Mult2

ContinuedInstruction     Issue Execute Write ResultL.D F0, 0(R1) * *  * MUL.D F4, F0, F2 * * *S.D F4, 0(R1) *  *   L.D F0, 0(R1)  * *   *MUL.D F4, F0, F2 *  ----   S.D F4, 0(R1)  *     

Name Busy Op Vj Vk Qj Qk ALoad1 no      Load2 noAdd1 no            Add2 no            Add3 no            Mult1 no      Mult2 yes MUL    Regs[F2]  Load2     Store1 no  Store2 yes Store  Regs[R1]-8      Mult2   

Field   F0 F2 F4 F6 F8 F10Busy   Load2         Mult2 

Summary on Tomasulo• Notice in the previous example that we unrolled the

loop two times and the hardware was able to execute the two iterations with only RAW hazards as delays– However, given the limitation on FUs, we were not able to

unroll the loop further• with fully pipelined functional units, or multiple functional units,

and numerous reservation stations, we could continue to execute iterations of the loop without delay

– Additionally, if memory accesses cause a stall because of a cache miss, the hardware is able to continue issuing instructions out of order, thus greatly improving performance over shutting down the pipeline because of the cache miss

– Also, compiler optimizations are no longer essential to obtain performance increases

Continued• The bottleneck for Tomasulo is primarily the single CDB

– We can improve on this by having two, one for FP and one for INT values

• We also see impacts because of– The limited number of FUs– The limited number of reservation stations– Restrictions on single data accesses (loads/stores) per cycle– The added cost because of additional logic to look for potential

WAR/WAW hazards and perform register renaming

• However, the dynamic issuing allowed by the Tomasulo approach will encourage the use of speculation to keep the hardware busy

• This will also lead to an easier way to perform superscalar pipelining (we will cover this idea next week)