CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.1

Exploiting Instruction-Level Parallelism with Software Approach

#1

E. J. Kim

CPSC614Lec 6.2

• To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.

• Goal: to keep a pipeline full.

CPSC614Lec 6.3

Latencies

Inst. producing result

Inst. using result

Latency in cycles

FP ALU op Another FP op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Branch: 1, Integer ALU op – branch: 1Integer load: 1 Integer ALU - integer ALU: 1

CPSC614Lec 6.4

Example

for ( i = 1000; i > 0; i = i – 1)x[i] = x[i] + s;

Loop: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, # -8BNE R1, R2, LOOP

CPSC614Lec 6.5

Without any Scheduling

Clock cycle issued

Loop: L.D F0, 0(R1) 1stall 2

ADD.D F4, F0, F2 3stall 4stall 5

S.D F4, 0(R1) 6DADDIU R1, R1, # -8 7

stall 8BNE R1, R2, LOOP 9

stall 10

CPSC614Lec 6.6

With Scheduling

Clock cycle issued

Loop: L.D F0, 0(R1) 1DADDIU R1, R1, # -8 2ADD.D F4, F0, F2 3

stall 4BNE R1, R2, LOOP 5S.D F4, 8(R1) 6

not trivial

delayed branch

CPSC614Lec 6.7

• The actual work of operating on the array element takes 3 (load, add, store).

• The remaining 3 cycles– Loop overhead (DADDIU, BNE)– Stall

• To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions.

CPSC614Lec 6.8

Reducing Loop Overhead• Loop Unrolling

– Simple scheme for increasing the number of instructions relative to the branch and overhead instructions

– Simply replicates the loop body multiple times, adjusting the loop termination code.

– Improves scheduling» It allows instructions from different iterations to be

scheduled together.– Uses different registers for each iteration.

CPSC614Lec 6.9

Unrolled Loop (No Scheduling)

Clock cycle issued

Loop: L.D F0, 0(R1) 1 2ADD.D F4, F0, F2 3 4 5S.D F4, 0(R1) 6 L.D F6, -8(R1) 7 8ADD.D F8, F6, F2 9 10 11S.D F8, -8(R1) 12 L.D F10, -16(R1) 13 14ADD.D F12, F10, F2 15 16 17S.D F12, -16(R1) 18 L.D F14, -24(R1) 19 20ADD.D F16, F14, F2 21 22 23S.D F16, -24(R1) 24DADDIU R1, R1, # -32 25 26BNE R1, R2, LOOP 27 28

CPSC614Lec 6.10

Loop Unrolling

• Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer.

• Unrolling improves the performance of the loop by eliminating overhead instructions.

CPSC614Lec 6.11

Loop Unrolling (Scheduling)

Clock cycle issued

Loop: L.D F0, 0(R1) 1L.D F6, -8(R1) 2L.D F10, -16(R1) 3L.D F14, -24(R1) 4ADD.D F4, F0, F2 5ADD.D F8, F6, F2 6ADD.D F12, F10, F2 7ADD.D F16, F14, F2 8S.D F4, 0(R1) 9S.D F8, -8(R1) 10DADDIU R1, R1, # -32 11S.D F12, 16(R1) 12BNE R1, R2, LOOP 13S.D F16, 8(R1) 14

CPSC614Lec 6.12

Summary

• Goal: To know when and how the ordering among instructions may be changed.

• This process must be performed in a methodical fashion either by a compiler or by hardware.

CPSC614Lec 6.13

• To obtain the final unrolled code,– Determine that it is legal to move the S.D after the

DADDIU and BNE, and find the amount to adjust the S.D offset.

– Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code.

– Use different registers to avoid unnecessary constraints.

– Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.

CPSC614Lec 6.14

– Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address.

– Schedule the code, preserving any dependences needed to yield the same result as the original code.

Loop Unrolling I(No Delayed Branch)

Loop: L.D F0, 0(R1) ADD.D F4, F0, F2S.D F4, 0(R1)L.D F0, -8(R1) ADD.D F4, F0, F2S.D F4, -8(R1)L.D F0, -16(R1) ADD.D F4, F0, F2S.D F4, -16(R1)L.D F0, -24(R1) ADD.D F4, F0, F2S.D F4, -24(R1)DADDIU R1, R1, # -32BNE R1, R2, LOOP

name dependence

true dependence

Loop Unrolling II(Register Renaming)

Loop: L.D F0, 0(R1) ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6, -8(R1) ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10, -16(R1) ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14, -24(R1) ADD.D F16, F14, F2S.D F16, -24(R1)DADDIU R1, R1, # -32BNE R1, R2, LOOP

true dependence

CPSC614Lec 6.17

• With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel.– Potential shortfall in registers

• Register pressure– It arises because scheduling code to increase ILP

causes the number of live values to increase. It may not be possible to allocate all the live values to registers.

– The combination of unrolling and aggressive scheduling can cause this problem.

CPSC614Lec 6.18

• Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively.

CPSC614Lec 6.19

Unrolling with Two-Issue

Loop: L.D F0, 0(R1) 1L.D F6, -8(R1) 2L.D F10, -16(R1) ADD.D F4, F0, F2 3L.D F14, -24(R1) ADD.D F8, F6, F2 4L.D F18, -32(R1) ADD.D F12, F10, F2 5S.D F4, 0(R1) ADD.D F16, F14, F2 6S.D F8, -8(R1) ADD.D F20, F18, F2 7S.D F12, -16(R1) 8DADDIU R1, R1, #-40 9S.D F16, 16(R1) 10BNE R1, R2, LOOP 11S.D F20, 8(R1) 12

CPSC614Lec 6.20

Static Branch Prediction

• Static branch predictors are sometimes used in processors where the expectation is that branch behavior is highly predictable at compile time.

CPSC614Lec 6.21


• Predict a branch taken– Simplest– Average misprediction rate for SPEC: 34% (9% ~ 59%)

• Predict on the basis of branch direction– backward-going branches: taken– forward-going branches: not taken– Unlikely to generate an overall misprediction

rate of less than 30% ~ 40%.

CPSC614Lec 6.22


• Predict branches on the basis of profile information collected from earlier runs.– An individual branch is often highly biased

toward taken or untaken. (bimodally distributed)

– Changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction.

CPSC614Lec 6.23

VLIW

• Very Long Instruction Word:– Rely on compiler technology to minimize the

potential data hazard stalls.– Actually format the instructions in a potential

issue packet so that the hardware need not check explicitly for dependences.

– Wide instructions with multiple operations per instruction. (64, 128 bits or more)

– Intel IA-64 architecture

CPSC614Lec 6.24

Basic VLIW Approach

• VLIWs use multiple, independent functional units.

• A VLIW packages the multiple operations into one very long instruction.

• The hardware in a superscalar for multiple issue is unnecessary.

• Uses loop unrolling, scheduling…

CPSC614Lec 6.25

• Local Scheduling: Scheduling the code within a single basic block.

• Global Scheduling: scheduling code across branches– much more complex

• Trace Scheduling: Section 4.5• Figure 4.5 VLIW instructions

CPSC614Lec 6.26

Problems

• Increase in code size• Wasted functional units

– In the previous example, only about 60% of the functional units were used.

CPSC614Lec 6.27

Detecting and Enhancing Loop-level Parallelism

• Loop level parallelism : source level• ILP : machine level code after

compliation

for (i= 1000; i< 0; i--)

x[i] = x[i] + s

CPSC614Lec 6.28

Advanced Compiler Support for Exposing and Exploiting ILP

for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */}

CPSC614Lec 6.29

Loop-Carried Dependence

• Data accesses in later iterations are dependent on data values produced in earlier iterations.

for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */}

Loop-Carried Dependences

This dependence forces successive iterations of thisloop to execute in series.

CPSC614Lec 6.30

Does a loop-carried dependence mean there is no parallelism???

• Consider:for (i=0; i< 8; i=i+1) {

A = A + C[i]; /* S1 */}

Could compute:

“Cycle 1”: temp0 = C[0] + C[1];temp1 = C[2] + C[3];temp2 = C[4] + C[5];temp3 = C[6] + C[7];

“Cycle 2”: temp4 = temp0 + temp1;temp5 = temp2 + temp3;

“Cycle 3”: A = temp4 + temp5;

• Relies on associative nature of “+”.

CPSC614Lec 6.31

for ( i = 1; i <= 100; i ++) { A[i] = A[i] + B[i]; /* S1 */ B[i + 1] = C[i] + D[i]; /* S2 */}

Loop-Carried Dependence

Despite this loop-carried dependence, this loop can be made parallel.

CPSC614Lec 6.32

A[1] = A[1] + B[1];

for ( i = 1; i <= 99; i ++) { B[i + 1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; }

B[101] = C[100] + D[100];

CPSC614Lec 6.33

Recurrence

• A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding.

• Detecting a recurrence can be important– Some architectures (especially vector computer) have

special support for executing recurrences.– Some recurrences can be the source of a reasonable

amount of parallelism.

CPSC614Lec 6.34

for ( i = 2; i <= 100; i = i + 1) Y[i] = Y[i – 1] + Y[i];

Dependence distance: 1

for ( i = 6; i <= 100; i = i + 1) Y[i] = Y[i – 5] + Y[i];

Dependence distance: 5

The larger the distance, the more potential parallelism can beobtained by unrolling the loop.

CPSC614Lec 6.35

Finding Dependences

• Determining whether a dependence actually exists => NP-Complete

• Dependence Analysis– Basic tool for detecting loop-level parallelism– Applies only under a limited set of

circumstances.– Greatest common divisor (GCD) test, points-to

analysis, interprocedural analysis,…

CPSC614Lec 6.36

Eliminating Dependent Computation

• Algebraic Simplifications of Expressions

• Copy propagation– Eliminates operations that copy values.

DADDIU R1, R2, #4DADDIU R1, R1, #4

DADDIU R1, R2, #8

CPSC614Lec 6.37


• Tree Height Reduction– Reduces the height of the tree structure

representing a computation.

ADD R1, R2, R3ADD R4, R1, R6ADD R8, R4, R7

ADD R1, R2, R3ADD R4, R6, R7ADD R8, R1, R4

CPSC614Lec 6.38


• Recurrences

sum = sum + x1 + x2 + x3 + x4 + x5

sum = (sum + x1) + (x2 + x3) + (x4 + x5)

CPSC614Lec 6.39

Software Pipelining

• Technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop.

• By choosing instructions from different iterations, dependent computations are separated from one another by an entire loop body.

CPSC614Lec 6.40

Software Pipelining

• Counterpart to what Tomasulo’s algorithm does in hardware

• Software pipelining symbolically unrolls the loop and then selects instructions from each iteration.

• Start-up code before the loop and finish-up code after the loop required.

CPSC614Lec 6.41

Software Pipelining

CPSC614Lec 6.42

Software Pipelining - Example

• Show a software-pipelined version of the following loop. Omit the start-up and finish-up code.

Loop: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop

CPSC614Lec 6.43

Software Pipelining

• Software pipelining consumes less code space.

• Loop unrolling reduces the overhead of the loop (branch, counter update code).

• Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.

CPSC614Lec 6.44

CPSC614Lec 6.45

Hw support for more parallelism at compile time

Conditional Instructions

• Predicated instructions• Extension of instruction set• Conditional instruction: an instruction

that refers a condition, which is evaluated as part of the instruction execution– Condition is true: executed normally– False: no-op– ex) conditional move

CPSC614Lec 6.46

Example

if (A == 0) { S = T; }

BNEZ R1, LADDU R2, R3, R0

L:

CMOVZ R2, R3, R1

R1=A, R2=S, R3=Tconditional move only if thethird operand is equal to zero

CPSC614Lec 6.47

• Conditional moves are used to change a control dependence into a data dependence.

• Handling multiple branches per cycle is complex. => Conditional moves provide a way of reducing branch pressure.

• A conditional move can often eliminate a branch that is hard to predict, increasing the potential gain.

Documents

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim