26

Click here to load reader

Chapter 3: Instruction Level Parallelism (ILP)

  • Upload
    duongtu

  • View
    245

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Chapter 3: Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

ILP: The simultaneous execution of multiple instructions from aprogram. While pipelining is a form of ILP, the generalapplication of ILP goes much further into more aggressivetechniques to achieve parallel execution of the instructions inthe instruction stream.

1 / 26

Page 2: Chapter 3: Instruction Level Parallelism (ILP)

Exploiting ILP

Two basic approaches:

I rely on hardware to discover and exploit parallelismdynamically, and

I rely on software to restructure programs to staticallyfacilitate parallelism.

These techniques are complimentary. They can be and areused in concert to improve performance.

2 / 26

Page 3: Chapter 3: Instruction Level Parallelism (ILP)

Pgm Challenges and Opportunities to ILP

Basic Block: a straight-line code sequence with a single entrypoint and a single exit point.

Remember, the average branch frequency is 15%–25%; thusthere are only 3–6 instructions on average between a pair ofbranches.

Loop level parallelism: the opportunity to use multiple iterationsof a loop to expose additional parallelism. Example techniquesto exploit loop level parallelism: vectorization, data parallel, loopunrolling.

3 / 26

Page 4: Chapter 3: Instruction Level Parallelism (ILP)

Dependencies and Hazards

3 types of dependencies:

I data dependencies (or true data dependencies),

I name dependencies, and

I control dependencies.

Dependencies are artifacts of programs; hazards are artifactsof pipeline organization. Not all dependencies become hazardsin the pipeline. That is, dependencies may turn into hazardswithin the pipeline depending on the architecture of the pipeline.

4 / 26

Page 5: Chapter 3: Instruction Level Parallelism (ILP)

Data Dependencies

An instruction j is data dependent on instruction i if:

I instruction i produces a result that may be used byinstruction j; or

I instruction j is data dependent on instruction k, andinstruction k is data dependent on instruction i.

5 / 26

Page 6: Chapter 3: Instruction Level Parallelism (ILP)

Name Dependencies (not true dependencies)

Occurring when two instructions use the same register or memorylocation, but there is no flow in data between the instructionsassociated with that name.

There are 2 types of name dependencies between an instruction ithat precedes instruction j in the execution order:

I antidependence: when instruction j writes a register or memorylocation that instruction i reads.

I output dependence: when instruction i and instruction j write thesame register or memory location.

Because these are not true dependencies, the instructions i and jcould potentially be executed in parallel if these dependencies aresome how removed (by using distinct register or memory locations).

6 / 26

Page 7: Chapter 3: Instruction Level Parallelism (ILP)

Data Hazards

A hazard exists whenever there is a name or data dependencebetween two instructions and they are close enough that theiroverlapped execution would violate the program’s order ofdependency.

Possible data hazards:

I RAW (read after write)

I WAW (write after write)

I WAR (write after read)

RAR (read after read) is not a hazard.

7 / 26

Page 8: Chapter 3: Instruction Level Parallelism (ILP)

Control Dependencies

Dependency of instructions to the sequential flow of executionand preserves branch (or any flow altering operation) behaviorof the program.

In general, two constraints are imposed by controldependencies:

I An instruction that is control dependent on a branch cannotbe moved before the branch so that its execution is nolonger controlled by the branch.

I An instruction that is not control dependent on a branchcannot be moved after the branch so that the execution iscontrolled by the branch.

8 / 26

Page 9: Chapter 3: Instruction Level Parallelism (ILP)

Relaxed Adherence to Dependencies

Strictly enforcing dependency relations is not entirelynecessary if we can preserve the correctness of the program.Two properties critical to program correctness (and normallypreserved by maintaining both data and control dependency)are:

I exception behavior: reordering instructions must not alterthe exception behavior of the program — specifically thevisible exceptions (transparent exceptions such as pagefaults are ok to affect).

I data flow: data sources must be preserved to feed theinstructions consuming that data.

Liveness: data that is needed is called live; data that is nolonger used is called dead.

9 / 26

Page 10: Chapter 3: Instruction Level Parallelism (ILP)

Compiler Techniques for Exposing ILP

I loop unrolling

I instruction scheduling

10 / 26

Page 11: Chapter 3: Instruction Level Parallelism (ILP)

Advanced Branch Prediction

I Correlating branch predictors (or two-level predictors):make use of outcome of most recent branches to makeprediction.

I Tournament predictors: run multiple predictors and run atournament between them; use the most successful.

11 / 26

Page 12: Chapter 3: Instruction Level Parallelism (ILP)

Measured misprediction rates: SPEC89

6%

7%

8%

5%

4%

3%

2%

1%

0%

Conditio

nal bra

nch m

ispre

dic

tion r

ate

Total predictor size

Local 2-bit predictors

Correlating predictors

Tournament predictors

5124804484163843523202882562241921601289664320

12 / 26

Page 13: Chapter 3: Instruction Level Parallelism (ILP)

End section 3.1-3.3

Lecture break; continue section 3.4-3.9 next class

13 / 26

Page 14: Chapter 3: Instruction Level Parallelism (ILP)

Dynamic Scheduling

Designing the hardware so that it can dynamically rearrangeinstruction execution to reduce stalls while maintaining dataflow and exception behavior.

Two techniques:

I Scoreboarding (discussed in Appendix C), centralizedscoreboard

I Tomasulo’s Algorithm (discussed in Chapter 3), distributedreservation stations

14 / 26

Page 15: Chapter 3: Instruction Level Parallelism (ILP)

Dynamic Scheduling

Instructions are issued to the pipeline in-order but executed andcompleted out-of-order.

out-of-order execution leading to the possibility of out-of-ordercompletion.

out-of-order execution introduces the possibility of WAR andWAW hazards which do not occur in statically scheduledpipelines.

out-of-order completion creates major complications inexception handling

15 / 26

Page 16: Chapter 3: Instruction Level Parallelism (ILP)

Key Components of Tomasulo’s Algorithm

Register renaming: to minimize WAW and WAR hazards (caused byname dependencies).

Reservation station: buffering operands for instructions waiting toexecute. Operands are fetched as soon as they are available (fromthe reservation station/instruction serving the operand). The registerspecifiers of issued instructions are renamed to the reservationstation (to achieve register renaming).

Thus, hazard detection and execution control are distributed (thetargeted functional unit and reservation station determine when).

Common data bus/result bus/etc: carries results past the reservationstations (where they are captured) and back to the register file.Sometimes multiple buses are used.

16 / 26

Page 17: Chapter 3: Instruction Level Parallelism (ILP)

Hardware Speculation

Execute (completely) the instructions that are predicted tooccur after a branch w/o knowing the branch outcome.Speculate that the instructions are to be executed.

instruction commit: when the results (operandwrites/exceptions) of the speculated instruction are made.

Key to speculation is to allow out-of-order instruction executebut force them to commit in order. Generally achieved by areorder buffer (ROB) which holds completed instructions andretires them in order. Thus ROB holds/buffers register/memoryoperands that are also used by functional units for sourceoperands.

17 / 26

Page 18: Chapter 3: Instruction Level Parallelism (ILP)

Multiple-Issue Processors

Attempting to reduce the CPI below one.

1. Statically scheduled superscalar processors

2. VLIW (very long instruction word) processors

3. Dynamically scheduled superscalar processors

superscalar: scheduling multiple instructions for execution atthe same time.

18 / 26

Page 19: Chapter 3: Instruction Level Parallelism (ILP)

Advanced Instruction Delivery & Speculation

I branch-target buffer or branch-target cache: predict thedestination PC address for a branch instruction; can alsostore the target instruction (instead of, or in addition to thedestination address). branch folding: overwrite thebranch instr with the destination instruction.

I Return address prediction.

I Integrated instruction fetch units. Autonomously executingunits to fetch instructions.

I Comments on speculation: how much? through multiplebranches? energy costs?

I Value prediction: not yet feasible; not sure why it’s covered.

19 / 26

Page 20: Chapter 3: Instruction Level Parallelism (ILP)

Branch-Target Buffer

Look up Predicted PC

Number of

entries

in branch-

target

buffer

No: instruction is

not predicted to be

branch; proceed normally=

Yes: then instruction is branch and predicted

PC should be used as the next PC

Branch

predicted

taken or

untaken

PC of instruction to fetch

20 / 26

Page 21: Chapter 3: Instruction Level Parallelism (ILP)

End section 3.4-3.9

Lecture break; continue section 3.10-3.16 next class

21 / 26

Page 22: Chapter 3: Instruction Level Parallelism (ILP)

The Limits of ILP

How much ILP is available? Is there a limit?

Consider the ideal processor:

1. Infinite number of registers for register renaming.

2. Perfect branch prediction

3. Perfect jump prediction

4. Perfect memory address analysis

5. All memory accesses occur in one cycle

Effectively removing all control dependencies and all but true datadependencies. That is, all instructions can be scheduled as early astheir data dependency allows.

22 / 26

Page 23: Chapter 3: Instruction Level Parallelism (ILP)

Unlimited Issue/Single Cycle Execution

0 20 40 60 80 100 120

Instruction issues per cycle

gcc

espresso

li

SP

EC

benchm

ark

s

fpppp

doduc

tomcatv

55

63

18

75

119

150

140 160

The Power7 currently issues up to six instructions per clock.

23 / 26

Page 24: Chapter 3: Instruction Level Parallelism (ILP)

Limited Issue Window Size/Single Cycle Execution

I Up to 64 instructionsissued/cycle

I Tournament predictor;best available in 2011(not a primarybottleneck)

I Perfect memoryaddress analysis

I 64 int/64 FP extraregisters available forrenaming

I Single cycle execution

101010

89

1515

13

810

111212

119

14

2235

5247

912

151617

5645

3422

14

gcc

espresso

li

fpppp

Benchm

ark

sdoduc

tomcatv

0 10 20

Instruction issues per cycle

30 40 50 60

Infinite

256

128

64

32

Window size

24 / 26

Page 25: Chapter 3: Instruction Level Parallelism (ILP)

Hardware/Software Speculation

Professor rambles on about this topic for a few moments.

25 / 26

Page 26: Chapter 3: Instruction Level Parallelism (ILP)

Multithreading

I Fine-grained multithreading: switch threads at each clockcycle. Examples: Sun Niagra processor (8 threads), NvidiaGPUs.

I Course-grained multithreading: switch threads at majorstalls such as L2/L3 misses. Examples: no commercialventures.

I Simultaneous multithreading: process/dispatch multiplethreads simultaneously to the common functional units.Examples: Intel (hyperthreading, two threads), IBMPower7 (four threads).

26 / 26