21
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

Embed Size (px)

DESCRIPTION

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures. Instruction Level Parallelism (ILP). Simultaneous execution of multiple instructions. do { Swap = 0; for (I = 0; I Tab[I+1]) { Temp = Tab[I]; - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

Lecture 1: IntroductionInstruction Level

Parallelism& Processor Architectures

Page 2: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

2

Instruction Level Parallelism (ILP)

Simultaneous execution of multiple instructions.

do { Swap = 0; for (I = 0; I<Last; I++) { if (Tab[I] > Tab[I+1]) { Temp = Tab[I]; Tab[I] = Tab[I+1]; Tab[I+1] = Temp; Swap = 1; } } } while (Swap);

Page 3: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

3Barriers to detecting ILP

Control dependences

• Arise due to conditional branches

Data dependences

• Register dependences

• Memory dependences

Page 4: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

4Branches

j = 0;*q = false; while ((*q == false) && (j != 8)) { j = j + 1; *q = false; if ((b[j] == true) && (a[i+j] == true) && (c[i-j + 7] == true)) { x[i] = j; b[j] = false; a[i+j] = false; c[i-j + 7] = false;

if ( ….

if (b[j])

if (a[i+j])

while ((*q

if (c[i-j+7])

x[i] = j; ...

Page 5: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

5Frequent Branches

Sequence of branch instructions in the dynamic stream separated by at most one non-branch instruction.

0

10

20

30

40

50

60

70

go

m88

ksim gc

c

com

pres

s li

ijpeg

perl

vort

ex

INT

Dyn

am

ic B

ran

ch

es [

%]

Page 6: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

6Branch Prediction Accuracy of gshare

0

20

40

60

80

100

go

m88ks

im gcc

xlis

p

perl

vort

ex

Pre

dic

tion

Accu

racy [

%]

Page 7: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

7Memory Dependences

Reordering of memory instructions, loads and stores, is not always possible.

Store R1, addrLoad R2, addr’Add R1, R2

Store R5, addrStore R2, addr’Load R1, addr’Add R1,R3

Load R2, addr’Store R1, addrAdd R1, R2

If addr!=addr’

Store R2, addr’Load R1, addr’Store R5, addrAdd R1,R3

If addr!=addr’

Page 8: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

8Memory Disambiguation

0

5

10

15

20

8 16 32

I ssue Width

Inst

ruct

ions

per

cyc

lePerfect Simple

Page 9: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

9Value based Store-set disambiguator

0

2

4

6

8

10

12

14

16

18

20

8 16 32

I ssue Width

IPCs

Perfect Value- based

Page 10: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

10Register Dependences

• True data dependences

• False data dependences

Add R2, R3

Load R2, .

Add R1, R2

Load R1, ..

Sub R1, R2

Load R1, .

Load R3, .

Add R2, R3

Load R2, .

Add R1, R2

Load R4, ..

Sub R4, R2

Load R1, .

Load R3, .

Page 11: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

11

Window Size vs ILP (issue width = 16)

3

4

5

6

7

8

9

8 16 32 64 128 256 512 1024

Window Size

Inst

ruct

ions

per

cycl

e

Page 12: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

12

Parallelism Study - ILP in Spec95

0

5

10

15

20

25

30

8 32 128 512 2048 8192

Window Size

Inst

ruct

ions

per

cyc

le8- issue

16- issue

32- issue

64- issue

Page 13: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

13Conclusions

• There is ample amount of parallelism to scale the issue width.

• Very large instruction windows must be implemented.

• A highly accurate memory disambiguation mechanism is required.

• Highly accurate branch prediction must be performed.

• Register dependences should be avoided.

Page 14: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

14Processors

• Pipelined

• Advanced Pipelining

• Superscalars

• Very Long Instruction Word (VLIW)

• Multiprocessors/Multicores

Page 15: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

15Pipelined Processors

In-order, overlapped execution of instructions. Eg. 5-stage pipeline instruction fetch, decode and register operand fetch, execute, memory operand fetch, and write-back results.

F D M WBE

F D E WBM

F ED WBM

MIPS R4000 has an 8 stage pipeline.

Page 16: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

16Causes of Pipeline Delays

Data dependences - RAW hazards register bypass and code reordering by the compiler.

Register hazards WAW hazards -instructions may reach the WB stage

out-of-order. No WAR hazards.

Branch delays Compiler fills branch delay slots vs hardware performs

branch prediction.

Structural hazards due to nonpipelined units. Register writes when multiple instructions reach WB

stage at the same time (issue vs retire rate).

Page 17: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

17Advanced Pipelining

In-order issue but Out-of-order execution

DIVD F0, F2, F4

ADDD F10, F0, F8

SUBD F8, F8, F14

Execute SUBD before ADDD

Dynamic scheduling – Scoreboard, Tomasulo’s

Page 18: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

18Superscalar Processors

• Multiple instructions can be issued in each cycle.

• Speculative Execution is incorporated (commit or discard results).

AMD-K7 is a 9-issue superscalar.

F D E WBM

F D E WBM

F D E WBM

F D E WBM

F D E WBM

F D E WBM

PowerPC is a 4-issue superscalar.

Page 19: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

19VLIW

• Each long instruction contains multiple operations that are executed in parallel.

• Compiler performs speculation and recovery.

F D E WBEEE

F D E WBEEE

Multiflow 500 can issue up to 28 operations in each instruction (instructions can be up to 1024-bits).Itanium – 128 bit instruction, 3 operations (40-bit), template (8-bits)

Page 20: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

20Control Dependences -Instruction Window

Superscalar

Hardware branch prediction guides fetching of instructions to fill up the processor’s instruction window.

VLIW

Programs are first profiled.

The compiler uses the profiles to trace out likely paths. A trace is a software instruction window.Instructions are issued

from the window as they become ready, that is, out-of-order execution is possible.

Instruction reordering is performed by the compiler within the trace.

Page 21: Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures

21

Data Dependences - Exploiting ILP

Superscalar

Memory dependences: HW load-store disambiguation techniques used for enabling out-of-order execution.

VLIW

Memory dependences: Detected by the compiler using dependency analysis or using address profiling.

False register dependences: Avoided using register renaming. True data dependences: Must be honored. Value prediction for out-of-order execution of dependent instructions.

False data dependences: Avoided by the compiler through renaming (memory) and register allocation.True data dependences: Are strictly followed. Reordering is possible with HW support.