\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

$Page 1: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)$
\course\ELEG652-03Fall\Topic3-652 1

Exploitation ofInstruction-Level Parallelism

(ILP)


Reading List

• Slides: Topic4x

• Henn&Patt: Chapter 4

• Other assigned readings from homework and classes


Design Space for Processors

20

10

5.0

2.0

1.0

0.5

0.2

0.1

0.05

Cyc

le p

er I

nstr

ucti

on

{

EnoughParallelism

?[TheobaldGaoHen1992,1993,1994]

Scalar CISC

Scalar RISC

SuperpipelinedMost likely futureprocessor space

MultithreadedSuperscalar

RISCVector

SupercomputerVLIW

5 10 20 50 100 200 500 1000 MHz

Clock Rate


Pipelining - A Review

Hazards

• Structural: resource conflicts when hardware cannot support all possible combinations of insets.. in overlapped exec.

• Data: insts depend on the results of a previous inst.

• Control: due to branches and other insts that change PC

• Hazard will cause “stall”

• but in pipeline “stall” is serious - it will hold up multiple insts.


RISC Concepts: Revisit

• What makes it a success ?- Pipeline- cache

• What prevents CPI = 1?- hazards and its resolution- Def- dependence graph


Structural Hazards- Non-pipelined Fus- One port of a R-file- One port of M.

Data hazards for some data hazards

( e.g. ALU/ALU ops solutions): forwards (bypass)

for others:

pipeline interlock + pipeline stall(bypass cannot do on time)

LD R1 A+ R4 R1 R7

this may need a “stall” or bubble


Example of Structural Hazard

Instruction Clock cycle number

1 2 3 4 5 6 7 8 9

Load instruction IF ID EX MEM WB

Instruction i+1 IF ID EX MEM WB

Instruction i+2 IF ID EX MEM WB

Instruction i+3 stall IF ID EX MEM WB

Instruction i+4 IF ID EX MEM


Clock cycleInstruction 1 2 3 4 5 6

ADD instruction IF ID EX MEM WB- data written here

SUB instruction IF ID- data read here EX MEM WB

The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles afterSUB begins reading it!

(1) data hazard may cause SUB read wrong value(2) this is dangerous: as the result may be non-deterministic (3) forwarding (by-passing)

Data Hazard


IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

ADD R1,R2,R3

SUB R4,R1,R5

AND R6,R1,R7

OR R8,R1,R9

XOR R10,R1,R11

A set of instructions in the pipeline that need to forward results.


A B + C...E A + D

Flow-dependency

(R/W conflicts)


A B + C...A B - C

Output dependency

(W/W conflicts)

Leave A in wrong state if order is changed


A A + B...A C + D

anti-dependency

(W/R conflicts)


Not all data hazards can be eliminated by bypassing

LW R1, 32 (R6)

ADD R4, R1, R7SUB R5, R1, R8AND R6, R1, R7


• Load latency cannot be eliminated by forward alone

• It is handled often by “pipeline interlock” - which detects a hazard and “stall” the pipeline

the delay cycle - called stall or “bubble”

Any instruction IF ID EX MEM WB

LW R1, 32 (R6) IF ID EX MEM WB

ADD R4, R1, R7 IF ID stall EX MEM WB

SUB R5, R1, R8 IF stall ID EX MEM WB

AND R6, R1, R7 stall IF ID EX MEM45


“Issue” - pass ID stage

“Issued instructions” -

DLX always only issue inst where there is no hazard.

Detect interlock early in the pipeline has the advantage that it never needs to suspend an inst and undo the state changes.


Exploitation Instruction Level Parallelism

staticscheduling

dynamicscheduling

• simple scheduling

• loop unrolling

• loop unrolling + scheduling

• software pipelining

• out-of-order execution

• dataflow computers


• directed-edges: data-dependence

• undirected-edges: Resources constraint

• An edge (u,v) (directed or undirected) of length e represent an

interlock between node u and v, and they must be separated by

e time.

Constraint GraphS1

S6

S5S4

S3S2

12

62

1 1

operation latencies4

3


Code Scheduling for Single Pipeline (CSSP problem)

Input: A constraint Graph G = (V.E.)

Output: A sequence of operations in G, v1, v2,...vn with number of no-ops no greater than k such that:

1. if the no-ops are deleted, the result is a topological sort of G.

2. any two nodes u,v in the sequence is separated by a distance >= d (u,v)


Advanced Pipelining

• Instruction reordering/scheduling within loop body

• loop unrolling : the code is not compact

• superscalar: compact code + multiple issuing of

different class of instructions

• VLIW


Loop : LD F0, 0 (R1) ; load the vector element

ADDD F4, F0, F2 ; add the scalar in F2

SD 0 (R1), F4 ; store the vector element

SUB R1, R1, #8 ; decrement the pointer

by

; 8 bytes (per DW)

BNEZ R1, LOOP ; branch when it’s zero

An Example: X + a


Instruction producing Destination instruction Latency in ?

result

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Latencies of FP operations used in this section. The first column shows the originating instruction type. The second column is the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average latencies we would see on an FP unit, like the one we described for DLX in the last chapter. The major change versus the DLX FP pipeline was to reduce the latency of FP multiply; this helps keep our examples from becoming unwieldy. The latency of a floating-point load to a store is zero, since the result of the load can be bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer ALU operation latency of 0.


Without any scheduling the loop will execute as follows:

Clock cycle issued

Loop : LD F0, 0 (R1) 1

stall 2

ADDD F4, F0, F2 3

stall 4

stall 5

SD 0(R1), F4 6

SUB R1, R1, #8 7

BNEZ R1, LOOP 8

stall 9

This requires 9 clock cycles per iteration.


We can schedule the loop to obtain

Loop : LD F0, 0 (R1)

stall

ADDD F4, F0, F2

SUB R1, R1, #8

BNEZ R1, LOOP ; delayed branch

SD 8 (R1), F4 ; changed

because

interchanged

with SUB

Average: 6 cycles per element


Loop unrolling:

Here is the result after dropping the unnecessary SUB and BNEZ operations duplicated during unrolling.

Loop : LD F0, 0 (R1)ADDD F4, F0, F2SD 0 (R1), F4 ; drop SUB & BNEZLD F6, -8 (R1)ADDD F8, F6, F2SD -8 (R1), F8 ; drop SUB & BNEZLD F10, -16 (R1)ADDD F12, F10, F2SD -16 (R1), F12 ; drop SUB & BNEZLD F14, -24 (R1)ADDD F16, F14, F2SD -24 (R1), F16SUB R1, R1, #32BNEZ R1, LOOP

Average: 6.8 cycles per element


Unrolling + Scheduling

Show the unrolled loop in the previous example after it has been scheduled on DLX.

Loop : LD F0, 0 (R1)LD F6, - 8 (R1)LD F10, -16 (R1)LD F14, -24 (R1)ADDD F4, F0, F2ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0 (R1), F4SD -8 (R1), F8SD -16 (R1), F12SUB R1, R1, #32 ; branch dependenceBNEZ R1, LOOPSD 8 (R1), F16 ; 8-32 = -24

The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared to 6.8 per element before scheduling


R1

LD

0

F0 a

+

SD

R1 F4

F2

1

2

3

0 R1

LD

-24

F14 a

+

SD

R1 F16

F2

10

11

12

-24

R1

LD

-8

F6 a

+

SD

R1 F8

F2

4

5

6

-8

R1

LD

-6

F10 a

+

SD

R1 F12

F2

7

8

9

-16

Simple unrolling :

We have eliminated three branches and three decrements of R1.The addresses on the loads and stores have been compensatedfor. Without scheduling, every operation is followed by a dependent operation, and thus will cause a stall. This loop willrun in 27 clock cycles - each LD takes 2 clock cycles,eachADDD 3, the branch 2, and all other instructions 1 - or 6.8 clockcycles for each of the four elements

y[i] = X [i] + a

27 cycle4 elem. = 6.8 cycle/elem.


LD

F0 a

+

SD

F4

F2

1

5

4

LD

F6 a

+

SD

F8

F2

2

6

10

LD

F10 a

+

SD

F12

F2

3

7

11

LD

F14 a

+

SD

F16

F2

4

8

12

Unrolling + Scheduling

14 cycle

4 elem= 3.5 cycle/elem

Documents

\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)