20
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: ILP: VLIW Architectures VLIW Architectures Marco D. Santambrogio: [email protected] Simone Campanoni: [email protected]

ILP: VLIW Architectures

  • Upload
    coty

  • View
    90

  • Download
    4

Embed Size (px)

DESCRIPTION

ILP: VLIW Architectures. Marco D. Santambrogio: [email protected] Simone Campanoni: [email protected]. Outline. Introduction to ILP (a short reminder) VLIW architecture. Beyond CPI = 1. Initial goal to achieve CPI = 1 Can we improve beyond this? Two approaches - PowerPoint PPT Presentation

Citation preview

Page 1: ILP:  VLIW Architectures

POLITECNICO DI MILANOParallelism in wonderland:

are you ready to see how deep the rabbit hole goes?

ILP: ILP: VLIW ArchitecturesVLIW Architectures

Marco D. Santambrogio: [email protected]

Simone Campanoni: [email protected]

Page 2: ILP:  VLIW Architectures

2

OutlineOutline

Introduction to ILP (a short reminder)

VLIW architecture

Page 3: ILP:  VLIW Architectures

Beyond CPI = 1Beyond CPI = 1

Initial goal to achieve CPI = 1Can we improve beyond this?

Two approachesSuperscalar and VLIW

Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)e.g. IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000The successful approach (to date) for general purpose computing

- 3 -

Page 4: ILP:  VLIW Architectures

Beyond CPI = 1Beyond CPI = 1

(Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templatesCurrently found more success in DSP, Multimedia applicationsJoint HP/Intel agreement in 1999/2000Intel Architecture-64 (Merced/A-64) 64-bit address Style: “Explicitly Parallel Instruction Computer (EPIC)”

But first a little context ….

- 4 -

Page 5: ILP:  VLIW Architectures

5 5

Definition of ILPDefinition of ILP

ILP = Potential overlap of execution among unrelated instructions

Overlapping possible if:No Structural HazardsNo RAW, WAR of WAW StallsNo Control Stalls

Page 6: ILP:  VLIW Architectures

6

Instruction Level Instruction Level ParallelismParallelism

Two strategies to support ILP:

Dynamic Scheduling: Depend on the hardware to locate parallelism

Static Scheduling: Rely on software for identifying potential parallelism

Hardware intensive approaches dominate desktop and server markets

Page 7: ILP:  VLIW Architectures

7

Dynamic SchedulingDynamic Scheduling

The hardware reorder the instruction execution to reduce pipeline stalls while maintaining data flow and exception behavior.Main advantages:

It enables handling some cases where dependences are unknown at compile timeIt simplifies the compiler complexityIt allows compiled code to run efficiently on a different pipeline.

Those advantages are gained at a cost of a significant increase in hardware complexity and power consumption.

Page 8: ILP:  VLIW Architectures

8

Static SchedulingStatic Scheduling Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level Parallelism).

The amount of parallelism available within a basic block – a straight-line code sequence with no branches in except to the entry and no branches out except at the exit – is quite small.

Data dependence can further limit the amount of ILP we can exploit within a basic block to much less than the average basic block size.

To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks (i.e. across branches).

Page 9: ILP:  VLIW Architectures

9

Static detection and resolution of dependences ( static scheduling): accomplished by the compiler dependences are avoided by code reordering. Output of the compiler: reordered into dependency-free code.

Typical example: VLIW (Very Long Instruction Word) processors expect dependency-free code.

Detection and resolution of dependences: Detection and resolution of dependences: Static SchedulingStatic Scheduling

Page 10: ILP:  VLIW Architectures

Very Long Instruction Word Very Long Instruction Word ArchitecturesArchitectures

Processor can initiate multiple operations per cycle

Specified completely by the compiler (!like superscalar)

Low hardware complexity (no scheduling hardware, reduced support of variable latency instructions)

No instruction reordering performed by the hardware

Explicit parallelism

Single control flow1010

Page 11: ILP:  VLIW Architectures

11

VLIW: Very Long Instruction WordVLIW: Very Long Instruction Word

Multiple operations packed into one instructionEach operation slot is for a fixed functionConstant operation latencies are specifiedArchitecture requires guarantee of:

Parallelism within an instruction => no x-operation RAW checkNo data use before data ready => no data interlocks

Two Integer Units,Single Cycle Latency

Two Load/Store Units,Three Cycle LatencyTwo Floating-Point Units,

Four Cycle Latency

Int Op 2

Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1

Page 12: ILP:  VLIW Architectures

12

VLIW Compiler ResponsibilitiesVLIW Compiler Responsibilities

The compiler:Schedules to maximize parallel execution

Exploit ILP and LLPIt is necessary to map the instructions over the machine functional unitsThis mapping must account for time constraints and dependencies among the tasks

Guarantees intra-instruction parallelismSchedules to avoid data hazards (no interlocks)

Typically separates operations with explicit NOPs

The goal is to minimize the total execution time for the program

Page 13: ILP:  VLIW Architectures

A VLIW Machine A VLIW Machine ConfigurationConfiguration

13

Page 14: ILP:  VLIW Architectures

VLIW: Pros and ConsVLIW: Pros and Cons

ProsSimple HW

Easy to extend the #FUs

Good compilers can effectively detect parallelism

ConsHuge number of registers to keep active the FUs

Needed to store operands and results

Large data transport capacity betweenFUs and register filesRegister files and Memory

High bandwidth between i-cache and fetch unitLarge code size

14

Page 15: ILP:  VLIW Architectures

15

Loop ExecutionLoop Execution

for (i=0; i<N; i++)

B[i] = A[i] + C;Int1

Int 2

M1 M2 FP+ FPx

loop:

How many FP ops/cycle?

ld add r1

fadd

sd add r2 bne

1 fadd / 8 cycles = 0.125

loop: ld f1, 0(r1)

add r1, 8

fadd f2, f0, f1

sd f2, 0(r2)

add r2, 8

bne r1, r3, loop

Compile

Schedule

Page 16: ILP:  VLIW Architectures

16

Loop UnrollingLoop Unrolling

for (i=0; i<N; i++)

B[i] = A[i] + C;

for (i=0; i<N; i+=4)

{

B[i] = A[i] + C;

B[i+1] = A[i+1] + C;

B[i+2] = A[i+2] + C;

B[i+3] = A[i+3] + C;

}

Unroll inner loop to perform 4 iterations at once

Need to handle values of N that are not multiples of unrolling factor with final cleanup loop

Page 17: ILP:  VLIW Architectures

17

Scheduling Loop Unrolled CodeScheduling Loop Unrolled Codeloop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1)

add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2)add r2, 32 bne r1, r3, loop

Schedule

Int1Int 2

M1 M2 FP+ FPx

loop:

Unroll 4 ways

ld f1

ld f2

ld f3

ld f4add r1 fadd f5

fadd f6

fadd f7

fadd f8

sd f5

sd f6

sd f7

sd f8add r2 bne

How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36

Page 18: ILP:  VLIW Architectures

18

Software PipeliningSoftware Pipelining

loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1)

add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop

Int1Int 2

M1 M2 FP+ FPxUnroll 4 ways firstld f1

ld f2

ld f3

ld f4

fadd f5

fadd f6

fadd f7

fadd f8

sd f5

sd f6

sd f7

sd f8

add r1

add r2

bne

ld f1

ld f2

ld f3

ld f4

fadd f5

fadd f6

fadd f7

fadd f8

sd f5

sd f6

sd f7

sd f8

add r1

add r2

bne

ld f1

ld f2

ld f3

ld f4

fadd f5

fadd f6

fadd f7

fadd f8

sd f5

add r1

loop:iterate

prolog

epilog

How many FLOPS/cycle?4 fadds / 4 cycles = 1

Page 19: ILP:  VLIW Architectures

19

Software Pipelining vs. Loop Software Pipelining vs. Loop UnrollingUnrolling

time

performance

time

performance

Loop Unrolled

Software Pipelined

Startupoverhead

Wind-down overhead

Loop Iteration

Loop IterationSoftware pipelining pays startup/wind-down costs only once per loop, not once per iteration

Page 20: ILP:  VLIW Architectures

What followsWhat follows

20