47
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.1

Exploiting Instruction-Level Parallelism with Software Approach

#1

E. J. Kim

Page 2: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.2

• To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.

• Goal: to keep a pipeline full.

Page 3: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.3

Latencies

Inst. producing result

Inst. using result

Latency in cycles

FP ALU op Another FP op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Branch: 1, Integer ALU op – branch: 1Integer load: 1 Integer ALU - integer ALU: 1

Page 4: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.4

Example

for ( i = 1000; i > 0; i = i – 1)x[i] = x[i] + s;

Loop: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, # -8BNE R1, R2, LOOP

Page 5: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.5

Without any Scheduling

Clock cycle issued

Loop: L.D F0, 0(R1) 1stall 2

ADD.D F4, F0, F2 3stall 4stall 5

S.D F4, 0(R1) 6DADDIU R1, R1, # -8 7

stall 8BNE R1, R2, LOOP 9

stall 10

Page 6: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.6

With Scheduling

Clock cycle issued

Loop: L.D F0, 0(R1) 1DADDIU R1, R1, # -8 2ADD.D F4, F0, F2 3

stall 4BNE R1, R2, LOOP 5S.D F4, 8(R1) 6

not trivial

delayed branch

Page 7: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.7

• The actual work of operating on the array element takes 3 (load, add, store).

• The remaining 3 cycles– Loop overhead (DADDIU, BNE)– Stall

• To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions.

Page 8: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.8

Reducing Loop Overhead• Loop Unrolling

– Simple scheme for increasing the number of instructions relative to the branch and overhead instructions

– Simply replicates the loop body multiple times, adjusting the loop termination code.

– Improves scheduling» It allows instructions from different iterations to be

scheduled together.– Uses different registers for each iteration.

Page 9: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.9

Unrolled Loop (No Scheduling)

Clock cycle issued

Loop: L.D F0, 0(R1) 1 2ADD.D F4, F0, F2 3 4 5S.D F4, 0(R1) 6 L.D F6, -8(R1) 7 8ADD.D F8, F6, F2 9 10 11S.D F8, -8(R1) 12 L.D F10, -16(R1) 13 14ADD.D F12, F10, F2 15 16 17S.D F12, -16(R1) 18 L.D F14, -24(R1) 19 20ADD.D F16, F14, F2 21 22 23S.D F16, -24(R1) 24DADDIU R1, R1, # -32 25 26BNE R1, R2, LOOP 27 28

Page 10: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.10

Loop Unrolling

• Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer.

• Unrolling improves the performance of the loop by eliminating overhead instructions.

Page 11: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.11

Loop Unrolling (Scheduling)

Clock cycle issued

Loop: L.D F0, 0(R1) 1L.D F6, -8(R1) 2L.D F10, -16(R1) 3L.D F14, -24(R1) 4ADD.D F4, F0, F2 5ADD.D F8, F6, F2 6ADD.D F12, F10, F2 7ADD.D F16, F14, F2 8S.D F4, 0(R1) 9S.D F8, -8(R1) 10DADDIU R1, R1, # -32 11S.D F12, 16(R1) 12BNE R1, R2, LOOP 13S.D F16, 8(R1) 14

Page 12: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.12

Summary

• Goal: To know when and how the ordering among instructions may be changed.

• This process must be performed in a methodical fashion either by a compiler or by hardware.

Page 13: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.13

• To obtain the final unrolled code,– Determine that it is legal to move the S.D after the

DADDIU and BNE, and find the amount to adjust the S.D offset.

– Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code.

– Use different registers to avoid unnecessary constraints.

– Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.

Page 14: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.14

– Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address.

– Schedule the code, preserving any dependences needed to yield the same result as the original code.

Page 15: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

Loop Unrolling I(No Delayed Branch)

Loop: L.D F0, 0(R1) ADD.D F4, F0, F2S.D F4, 0(R1)L.D F0, -8(R1) ADD.D F4, F0, F2S.D F4, -8(R1)L.D F0, -16(R1) ADD.D F4, F0, F2S.D F4, -16(R1)L.D F0, -24(R1) ADD.D F4, F0, F2S.D F4, -24(R1)DADDIU R1, R1, # -32BNE R1, R2, LOOP

name dependence

true dependence

Page 16: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

Loop Unrolling II(Register Renaming)

Loop: L.D F0, 0(R1) ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6, -8(R1) ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10, -16(R1) ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14, -24(R1) ADD.D F16, F14, F2S.D F16, -24(R1)DADDIU R1, R1, # -32BNE R1, R2, LOOP

true dependence

Page 17: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.17

• With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel.– Potential shortfall in registers

• Register pressure– It arises because scheduling code to increase ILP

causes the number of live values to increase. It may not be possible to allocate all the live values to registers.

– The combination of unrolling and aggressive scheduling can cause this problem.

Page 18: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.18

• Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively.

Page 19: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.19

Unrolling with Two-Issue

Loop: L.D F0, 0(R1) 1L.D F6, -8(R1) 2L.D F10, -16(R1) ADD.D F4, F0, F2 3L.D F14, -24(R1) ADD.D F8, F6, F2 4L.D F18, -32(R1) ADD.D F12, F10, F2 5S.D F4, 0(R1) ADD.D F16, F14, F2 6S.D F8, -8(R1) ADD.D F20, F18, F2 7S.D F12, -16(R1) 8DADDIU R1, R1, #-40 9S.D F16, 16(R1) 10BNE R1, R2, LOOP 11S.D F20, 8(R1) 12

Page 20: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.20

Static Branch Prediction

• Static branch predictors are sometimes used in processors where the expectation is that branch behavior is highly predictable at compile time.

Page 21: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.21

Static Branch Prediction

• Predict a branch taken– Simplest– Average misprediction rate for SPEC: 34% (9% ~ 59%)

• Predict on the basis of branch direction– backward-going branches: taken– forward-going branches: not taken– Unlikely to generate an overall misprediction

rate of less than 30% ~ 40%.

Page 22: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.22

Static Branch Prediction

• Predict branches on the basis of profile information collected from earlier runs.– An individual branch is often highly biased

toward taken or untaken. (bimodally distributed)

– Changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction.

Page 23: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.23

VLIW

• Very Long Instruction Word:– Rely on compiler technology to minimize the

potential data hazard stalls.– Actually format the instructions in a potential

issue packet so that the hardware need not check explicitly for dependences.

– Wide instructions with multiple operations per instruction. (64, 128 bits or more)

– Intel IA-64 architecture

Page 24: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.24

Basic VLIW Approach

• VLIWs use multiple, independent functional units.

• A VLIW packages the multiple operations into one very long instruction.

• The hardware in a superscalar for multiple issue is unnecessary.

• Uses loop unrolling, scheduling…

Page 25: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.25

• Local Scheduling: Scheduling the code within a single basic block.

• Global Scheduling: scheduling code across branches– much more complex

• Trace Scheduling: Section 4.5• Figure 4.5 VLIW instructions

Page 26: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.26

Problems

• Increase in code size• Wasted functional units

– In the previous example, only about 60% of the functional units were used.

Page 27: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.27

Detecting and Enhancing Loop-level Parallelism

• Loop level parallelism : source level• ILP : machine level code after

compliation

for (i= 1000; i< 0; i--)

x[i] = x[i] + s

Page 28: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.28

Advanced Compiler Support for Exposing and Exploiting ILP

for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */}

Page 29: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.29

Loop-Carried Dependence

• Data accesses in later iterations are dependent on data values produced in earlier iterations.

for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */}

Loop-Carried Dependences

This dependence forces successive iterations of thisloop to execute in series.

Page 30: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.30

Does a loop-carried dependence mean there is no parallelism???

• Consider:for (i=0; i< 8; i=i+1) {

A = A + C[i]; /* S1 */}

Could compute:

“Cycle 1”: temp0 = C[0] + C[1];temp1 = C[2] + C[3];temp2 = C[4] + C[5];temp3 = C[6] + C[7];

“Cycle 2”: temp4 = temp0 + temp1;temp5 = temp2 + temp3;

“Cycle 3”: A = temp4 + temp5;

• Relies on associative nature of “+”.

Page 31: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.31

for ( i = 1; i <= 100; i ++) { A[i] = A[i] + B[i]; /* S1 */ B[i + 1] = C[i] + D[i]; /* S2 */}

Loop-Carried Dependence

Despite this loop-carried dependence, this loop can be made parallel.

Page 32: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.32

A[1] = A[1] + B[1];

for ( i = 1; i <= 99; i ++) { B[i + 1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; }

B[101] = C[100] + D[100];

Page 33: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.33

Recurrence

• A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding.

• Detecting a recurrence can be important– Some architectures (especially vector computer) have

special support for executing recurrences.– Some recurrences can be the source of a reasonable

amount of parallelism.

Page 34: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.34

for ( i = 2; i <= 100; i = i + 1) Y[i] = Y[i – 1] + Y[i];

Dependence distance: 1

for ( i = 6; i <= 100; i = i + 1) Y[i] = Y[i – 5] + Y[i];

Dependence distance: 5

The larger the distance, the more potential parallelism can beobtained by unrolling the loop.

Page 35: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.35

Finding Dependences

• Determining whether a dependence actually exists => NP-Complete

• Dependence Analysis– Basic tool for detecting loop-level parallelism– Applies only under a limited set of

circumstances.– Greatest common divisor (GCD) test, points-to

analysis, interprocedural analysis,…

Page 36: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.36

Eliminating Dependent Computation

• Algebraic Simplifications of Expressions

• Copy propagation– Eliminates operations that copy values.

DADDIU R1, R2, #4DADDIU R1, R1, #4

DADDIU R1, R2, #8

Page 37: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.37

Eliminating Dependent Computation

• Tree Height Reduction– Reduces the height of the tree structure

representing a computation.

ADD R1, R2, R3ADD R4, R1, R6ADD R8, R4, R7

ADD R1, R2, R3ADD R4, R6, R7ADD R8, R1, R4

Page 38: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.38

Eliminating Dependent Computation

• Recurrences

sum = sum + x1 + x2 + x3 + x4 + x5

sum = (sum + x1) + (x2 + x3) + (x4 + x5)

Page 39: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.39

Software Pipelining

• Technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop.

• By choosing instructions from different iterations, dependent computations are separated from one another by an entire loop body.

Page 40: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.40

Software Pipelining

• Counterpart to what Tomasulo’s algorithm does in hardware

• Software pipelining symbolically unrolls the loop and then selects instructions from each iteration.

• Start-up code before the loop and finish-up code after the loop required.

Page 41: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.41

Software Pipelining

Page 42: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.42

Software Pipelining - Example

• Show a software-pipelined version of the following loop. Omit the start-up and finish-up code.

Loop: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop

Page 43: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.43

Software Pipelining

• Software pipelining consumes less code space.

• Loop unrolling reduces the overhead of the loop (branch, counter update code).

• Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.

Page 44: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.44

Page 45: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.45

Hw support for more parallelism at compile time

Conditional Instructions

• Predicated instructions• Extension of instruction set• Conditional instruction: an instruction

that refers a condition, which is evaluated as part of the instruction execution– Condition is true: executed normally– False: no-op– ex) conditional move

Page 46: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.46

Example

if (A == 0) { S = T; }

BNEZ R1, LADDU R2, R3, R0

L:

CMOVZ R2, R3, R1

R1=A, R2=S, R3=Tconditional move only if thethird operand is equal to zero

Page 47: CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim

CPSC614Lec 6.47

• Conditional moves are used to change a control dependence into a data dependence.

• Handling multiple branches per cycle is complex. => Conditional moves provide a way of reducing branch pressure.

• A conditional move can often eliminate a branch that is hard to predict, increasing the potential gain.