27
\course\ELEG652-03Fall\Topic3-652 1 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 1

Exploitation ofInstruction-Level Parallelism

(ILP)

Page 2: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 2

Reading List

• Slides: Topic4x

• Henn&Patt: Chapter 4

• Other assigned readings from homework and classes

Page 3: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 3

Design Space for Processors

20

10

5.0

2.0

1.0

0.5

0.2

0.1

0.05

Cyc

le p

er I

nstr

ucti

on

{

EnoughParallelism

?[TheobaldGaoHen1992,1993,1994]

Scalar CISC

Scalar RISC

SuperpipelinedMost likely futureprocessor space

MultithreadedSuperscalar

RISCVector

SupercomputerVLIW

5 10 20 50 100 200 500 1000 MHz

Clock Rate

Page 4: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 4

Pipelining - A Review

Hazards

• Structural: resource conflicts when hardware cannot support all possible combinations of insets.. in overlapped exec.

• Data: insts depend on the results of a previous inst.

• Control: due to branches and other insts that change PC

• Hazard will cause “stall”

• but in pipeline “stall” is serious - it will hold up multiple insts.

Page 5: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 5

RISC Concepts: Revisit

• What makes it a success ?- Pipeline- cache

• What prevents CPI = 1?- hazards and its resolution- Def- dependence graph

Page 6: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 6

Structural Hazards- Non-pipelined Fus- One port of a R-file- One port of M.

Data hazards for some data hazards

( e.g. ALU/ALU ops solutions): forwards (bypass)

for others:

pipeline interlock + pipeline stall(bypass cannot do on time)

LD R1 A+ R4 R1 R7

this may need a “stall” or bubble

Page 7: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 7

Example of Structural Hazard

Instruction Clock cycle number

1 2 3 4 5 6 7 8 9

Load instruction IF ID EX MEM WB

Instruction i+1 IF ID EX MEM WB

Instruction i+2 IF ID EX MEM WB

Instruction i+3 stall IF ID EX MEM WB

Instruction i+4 IF ID EX MEM

Page 8: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 8

Clock cycleInstruction 1 2 3 4 5 6

ADD instruction IF ID EX MEM WB- data written here

SUB instruction IF ID- data read here EX MEM WB

The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles afterSUB begins reading it!

(1) data hazard may cause SUB read wrong value(2) this is dangerous: as the result may be non-deterministic (3) forwarding (by-passing)

Data Hazard

Page 9: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 9

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

ADD R1,R2,R3

SUB R4,R1,R5

AND R6,R1,R7

OR R8,R1,R9

XOR R10,R1,R11

A set of instructions in the pipeline that need to forward results.

Page 10: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 11

A B + C...E A + D

Flow-dependency

(R/W conflicts)

Page 11: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 12

A B + C...A B - C

Output dependency

(W/W conflicts)

Leave A in wrong state if order is changed

Page 12: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 13

A A + B...A C + D

anti-dependency

(W/R conflicts)

Page 13: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 16

Not all data hazards can be eliminated by bypassing

LW R1, 32 (R6)

ADD R4, R1, R7SUB R5, R1, R8AND R6, R1, R7

Page 14: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 17

• Load latency cannot be eliminated by forward alone

• It is handled often by “pipeline interlock” - which detects a hazard and “stall” the pipeline

the delay cycle - called stall or “bubble”

Any instruction IF ID EX MEM WB

LW R1, 32 (R6) IF ID EX MEM WB

ADD R4, R1, R7 IF ID stall EX MEM WB

SUB R5, R1, R8 IF stall ID EX MEM WB

AND R6, R1, R7 stall IF ID EX MEM45

Page 15: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 18

“Issue” - pass ID stage

“Issued instructions” -

DLX always only issue inst where there is no hazard.

Detect interlock early in the pipeline has the advantage that it never needs to suspend an inst and undo the state changes.

Page 16: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 19

Exploitation Instruction Level Parallelism

staticscheduling

dynamicscheduling

• simple scheduling

• loop unrolling

• loop unrolling + scheduling

• software pipelining

• out-of-order execution

• dataflow computers

Page 17: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 20

• directed-edges: data-dependence

• undirected-edges: Resources constraint

• An edge (u,v) (directed or undirected) of length e represent an

interlock between node u and v, and they must be separated by

e time.

Constraint GraphS1

S6

S5S4

S3S2

12

62

1 1

operation latencies4

3

Page 18: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 21

Code Scheduling for Single Pipeline (CSSP problem)

Input: A constraint Graph G = (V.E.)

Output: A sequence of operations in G, v1, v2,...vn with number of no-ops no greater than k such that:

1. if the no-ops are deleted, the result is a topological sort of G.

2. any two nodes u,v in the sequence is separated by a distance >= d (u,v)

Page 19: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 22

Advanced Pipelining

• Instruction reordering/scheduling within loop body

• loop unrolling : the code is not compact

• superscalar: compact code + multiple issuing of

different class of instructions

• VLIW

Page 20: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 23

Loop : LD F0, 0 (R1) ; load the vector element

ADDD F4, F0, F2 ; add the scalar in F2

SD 0 (R1), F4 ; store the vector element

SUB R1, R1, #8 ; decrement the pointer

by

; 8 bytes (per DW)

BNEZ R1, LOOP ; branch when it’s zero

An Example: X + a

Page 21: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 24

Instruction producing Destination instruction Latency in ?

result

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Latencies of FP operations used in this section. The first column shows the originating instruction type. The second column is the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average latencies we would see on an FP unit, like the one we described for DLX in the last chapter. The major change versus the DLX FP pipeline was to reduce the latency of FP multiply; this helps keep our examples from becoming unwieldy. The latency of a floating-point load to a store is zero, since the result of the load can be bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer ALU operation latency of 0.

Page 22: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 25

Without any scheduling the loop will execute as follows:

Clock cycle issued

Loop : LD F0, 0 (R1) 1

stall 2

ADDD F4, F0, F2 3

stall 4

stall 5

SD 0(R1), F4 6

SUB R1, R1, #8 7

BNEZ R1, LOOP 8

stall 9

This requires 9 clock cycles per iteration.

Page 23: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 26

We can schedule the loop to obtain

Loop : LD F0, 0 (R1)

stall

ADDD F4, F0, F2

SUB R1, R1, #8

BNEZ R1, LOOP ; delayed branch

SD 8 (R1), F4 ; changed

because

interchanged

with SUB

Average: 6 cycles per element

Page 24: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 27

Loop unrolling:

Here is the result after dropping the unnecessary SUB and BNEZ operations duplicated during unrolling.

Loop : LD F0, 0 (R1)ADDD F4, F0, F2SD 0 (R1), F4 ; drop SUB & BNEZLD F6, -8 (R1)ADDD F8, F6, F2SD -8 (R1), F8 ; drop SUB & BNEZLD F10, -16 (R1)ADDD F12, F10, F2SD -16 (R1), F12 ; drop SUB & BNEZLD F14, -24 (R1)ADDD F16, F14, F2SD -24 (R1), F16SUB R1, R1, #32BNEZ R1, LOOP

Average: 6.8 cycles per element

Page 25: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 28

Unrolling + Scheduling

Show the unrolled loop in the previous example after it has been scheduled on DLX.

Loop : LD F0, 0 (R1)LD F6, - 8 (R1)LD F10, -16 (R1)LD F14, -24 (R1)ADDD F4, F0, F2ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0 (R1), F4SD -8 (R1), F8SD -16 (R1), F12SUB R1, R1, #32 ; branch dependenceBNEZ R1, LOOPSD 8 (R1), F16 ; 8-32 = -24

The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared to 6.8 per element before scheduling

Page 26: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 29

R1

LD

0

F0 a

+

SD

R1 F4

F2

1

2

3

0 R1

LD

-24

F14 a

+

SD

R1 F16

F2

10

11

12

-24

R1

LD

-8

F6 a

+

SD

R1 F8

F2

4

5

6

-8

R1

LD

-6

F10 a

+

SD

R1 F12

F2

7

8

9

-16

Simple unrolling :

We have eliminated three branches and three decrements of R1.The addresses on the loads and stores have been compensatedfor. Without scheduling, every operation is followed by a dependent operation, and thus will cause a stall. This loop willrun in 27 clock cycles - each LD takes 2 clock cycles,eachADDD 3, the branch 2, and all other instructions 1 - or 6.8 clockcycles for each of the four elements

y[i] = X [i] + a

27 cycle4 elem. = 6.8 cycle/elem.

Page 27: \course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic3-652 30

LD

F0 a

+

SD

F4

F2

1

5

4

LD

F6 a

+

SD

F8

F2

2

6

10

LD

F10 a

+

SD

F12

F2

3

7

11

LD

F14 a

+

SD

F16

F2

4

8

12

Unrolling + Scheduling

14 cycle

4 elem= 3.5 cycle/elem