Download ppt - VLIW Machines Sima, Fountain and Kacsuk Chapter 6 CSE3304

VLIW Machines

Sima, Fountain and KacsukChapter 6

CSE3304

David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

VLIW Machines

Single stream of instructions – (one program counter and one control unit),

Very long instruction format, – enough control bits to directly and

independently control the action of every functional unit in every cycle


VLIW Machines ...

Large numbers of data paths and functional units, – control is planned at compile time – Some VLIW machines have no arbiters, queues

or other hardware synchronisation mechanisms in the hardware.


Common traits between VLIW and Superscalar?

Register File

Instructions

EU EU EU EU

Performance


FXEU

Common traits between VLIW and Superscalar? ...

FX Register File

FX Instructions

FP Register File

FXEU

FP Instructions

FXEU

FPEU

FPEU

FPEU


Differenecs between VLIW and Superscalar? ...

Cache Memory

FetchUnit

Decode/IssueUnit

Register File

EU EU EU EU

Multiple Instructions


Issuing 1 instruction per cycle

Cache Memory

FetchUnit

Decode/IssueUnit

Register File

EU EU EU EU



Issuing 2 instructions per cycle

Cache Memory

FetchUnit

Decode/IssueUnit

Register File

EU EU EU EU



Issuing 4 instructions per cycle

Cache Memory

FetchUnit

Decode/IssueUnit

Register File

EU EU EU EU


Issue unit decides at run time how many instructions

to issue!


In VLIW machine choice is static

Cache Memory

FetchUnit

Register File

EU EU EU EU

Single Long Instruction


Super Scalar Data Dependence

ConsiderR1 + R2 R3

R4 - R5 R6

load R7, Fred

R7 * R1 R2

R1 + R2 R3

R4 - R5 R6

load R7, Fred

R7 * R1 R2

Super Scalar single issue

WAIT

Super Scalar double issue

R1 + R2 R3 R4 - R5 R6

R7 * R1 R2WAIT

load R7, FredDynamic Decision

not to co-issue


VLIW Data Dependence

ConsiderR1 + R2 R3

R4 - R5 R6

load R7, Fred

R7 * R1 R2VLIW double issue

R1 + R2 R3 R4 - R5 R6

R7 * R1 R2WAIT

load R7, Fred NOPStatic Decision not

to co-issue


Multiflow TRACE VLIW

Multiflow TRACE uses very sophisticated compilation techniques to detect low level parallelism.

The idea is that low level operations which can be executed at the same time are located and "packed" together into one instruction word.

When this instruction is executed, all of the operations will fire.


Memory

ALU ALU ALU ALU

Register File

ALUs are controlled by one instruction stream.The TRACE machine splits this register file intointeger and floating point files to meet the required bandwidth.

Trace Machine Structure


TRACE Performance

Multiflow TRACE VAXCray

7/200 14/200 8700XMP

Implementation Technology CMOS ECL ECLGate Speed 3.5 nsecs 3.5 nsecs 1.5 nsecs

1.3 nsecsIssue Rate 130 nsecs 130nsecs 45 nsecs

8 nsecs

Linpack MFLOPS 6.0 10.0 0.9724.0

Whetstones 12605 14400 395325700

Livermore Loop MFLOPS 2.3 3.4 0.912.3

ANSYS Benchmark M3 Secs 3757 2200 n/a556

Note, this data is supplied by the manufacturer.There are many factors which affect performance, such as compiler options, vector length, etc.


TRACE Compiler The TRACE compiler must not only

generate code for the VLIW machine– schedule the hardware resources statically at

compile time.– This is not always possible!

The compiler must schedule instructions so that as many ALUs are used as possible.


Memory scheduling

The compiler must also schedule memory references so that there is no memory bank contention.

This guarantees arrival time of memory operands

InterleavedMemory

10001001

10021003

100510061007

1004


Compiler optimizations

Some optimizations are standard, and apply to many machines, but others are required for VLIW machines.

TRACE can use previous program execution traces to determine the most common branch directions.

Compiler ExecutableSource Execute

Trace


Compiler optimizations

A branch which is taken the wrong way in a VLIW machine can be even more serious than in a scalar pipelined machine because there are many instructions packed together.


Bad Branches on scalar machine

ConditionKnown

Evaluate BTA

INCORRECT2 instructions lost


Bad Branches on VLIW

Evaluate BTA

INCORRECT4 instructions lost

ConditionKnown


Compiler Optimisations ...

Each subroutine is considered one at a time. Classic optimisations such as:

– loop invariant motion,– common sub-expression elimination are

performed The compiler then build a flow graph for the

program so that data dependencies can be observed


Conditional Branch Estimation

Compiler performs static branch estimation for loops

The sense of the IF statement is unknown at compiler time.

However, with trace data it is possible to guess very accurately.

It may even be possible to guess by using clever dataflow analysis and deduction


Loop Unrolling

One optimisation that is required for high performance on a VLIW machine is static loop unravelling.

The bounds of a loop are reduced by some factor, and a number of loop iterations are statically unravelled.


Loop Unrolling - An example

DO 10 I = 1,10

A(I) = 0

10 CONTINUE

DO 10 I = 1,5

A(I*2) = 0

A(I*2-1) = 0

10 CONTINUE

A(1) = 0

A(2) = 0

A(3) = 0

A(4) = 0

A(5) = 0

A(6) = 0

A(7) = 0

A(8) = 0

A(9) = 0

A(10) = 0


Loop Unrolling - A harder example

DO 10 I = 6,25

A(I) = 0

B(I) = A(I-4) + A(I-5)

10 CONTINUE


DO 10 I = 0,3A ( I*5 + 6) = 0B ( I*5 + 6) = A ( I*5 + 2) + A( I*5 + 1)A ( I*5 + 7) = 0B ( I*5 + 7) = A ( I*5 + 3) + A( I*5 + 2)A ( I*5 + 8) = 0B ( I*5 + 8) = A ( I*5 + 4) + A( I*5 + 3)A ( I*5 + 9) = 0B ( I*5 + 9) = A ( I*5 + 5) + A( I*5 + 4)A ( I*5 + 10) = 0

10 B ( I*5 + 10) = A ( I*5 + 6) + A( I*5 + 5)

Now, if all of these statements can be executed concurrently, then the loop only needs to be performed 4 times.

Loop Unrolling Loop Dependence


Loop Unravelling

Once the loop has been unrolled, it is possible to build the dependence graph for the loop body.

This shows how to pack to instructions into the Very Long Instruction Word.

In our example all of the statements could be executed together if there were sufficient resources.


Conditional Instructions

The Trace machine uses compare-predict operations rather than test operators.

The results can be written to general registers, which can avoid some of the branches in complex IF chains.

CEQ R1,R2, BB(R2) Write BB with 1 if R1 == R2 else write BB with 0

BRANCH (R3),LABEL The branch_test field selects R3


Conditional Branches

Branches cause problems when instructions are packed into wide instructions.

Consider the following sequence:IF A < B GOTO 10

IF C < D GOTO 20

These two statements are independent, and can be packed together.


Conditional Branches ...

But, what happens if both indicate a branch? Which one should be taken?

The TRACE machine uses a statically encoded priority scheme, so that the first one has priority of the second.

IF A < B GOTO 10 IF C < D GOTO 20

Takes Priority


Compensation Code

One problem with statically unrolled loops, and packing instructions into one with a conditional branch, is that certain instructions may sometimes be executed even though a branch has been taken.– We have looked at some conventional solutions in

pipelined machines. The TRACE machine inserts code to undo

mistakes when they occur.


Compensation Code ...

Consider:IF A < B GOTO 10

D = D + 1

If these are packed together, then D will always be incremented.

IF A is usually >= B then this will usually be correct


Compensation Code ...

But is A < B it will be done and will be wrong.

The compiler could insert the following code at 10

10 D = D - 1


TRACE Structure

On the TRACE machine each functional unit is split into an integer ALU and a floating point ALU.

Each FU required 256 bits of instruction. A TRACE machine can have up to 4

Functional Units. A fully configured TRACE machine will

require 1024 bits of instruction per cycle


TRACE Structure ...

Each Integer ALU contains 2 ALU/multipliers, and address translation TLB and a PC, as well as the integer registers.

Each floating point unit contains a floating point adder, a floating point multiplier, store and load registers.


TRACE Structure

I Registers (64 x 32)

ALU0 IMUL

ALU1 IMUL

TLB

Physical Address

PC Adder

PC

ILoad Buses FLoad Buses

F Registers (32 x 64)

FMUL ALUM

FADD ALUA

Store Registers (32x32)

Store Buses


Instruction Format opcode dest dest_bank branch_test src_1 src_2 Imm

016711121315161819242531

opcode dest dest_bank branch_test src_1 src_2 Imm

016711121315161819242531

Immediate constant (early)

031

opcode dest dest_bank src_1 src_2 Imm

016711121315161819242531

Immediate Constant (late)

031

opcode dest dest_bank src_1 src_2 Imm

016711121315161819242531

opcode 64 dest src_1 src_2 Dest_bank

01345615162223242531 101117

opcode 64 dest src_1 src_2 Dest_bank

01345615162223242531 101117

Word 0 ALU 0 Early Beat

Word 1 Immediate Constant

Word 2 ALU 1 Early Beat

Word 3 FA/ALUA control fields

Word4 ALU 0 Late Beat

Word 5 Immediate Constant

Word 6 ALU 1 Late Beat

Word 7FM/ALUM control fields


Instruction Encoding

In a highly parallel program each instruction will be packed with useful instructions.

In a program which does not have sufficient concurrency there will be many no-ops in the fields of the instructions.

Also, even highly parallel program may have regions which are low in concurrency.


Instruction Encoding ...

To combat the wasted space, a special memory format for the instructions in used.

Instructions with no-op fields are expanded on the fly when they are loaded into the instruction cache.

Encoded Instruction

Expanded Instruction


Memory Subsystem

The TRACE machine uses an interleaved memory subsystem to achieve high throughput.

It does not rely in large caches and cache hit rates, but instead pipelines memory references (there is an instruction cache).


Memory Subsystem ...

There are multiple buses between the ALU's and the memory units, these are the F and I load buses and the F store buses.

The load buses are bi-directional, and the store buses are uni-directional.

I Registers (64 x 32)

ALU0 IMUL

ALU1 IMUL

TLB

Physical Address

PC Adder

PC

ILoad Buses FLoad Buses

F Registers (32 x 64)

FMUL ALUM

FADD ALUA

Store Registers (32x32)

Store Buses

Load Buses

Store Buses


Memory Pipeline

Memory is accessed using a 8 stage pipeline. This is visible to the compiler.

0 The program says LD R1, R2, R3. R1 and R2 are added to form a virtual address. R2 may be replaced by a 6, 17 or 32 bit immediate constant.

1 The virtual address is looked up in the TLB

2 The physical address is sent over the buses to the memory controller.

3 The desired RAM bank starts cycling

4 RAM access continues

5 Data is returned from the memory controller

6 Data is sent over the buses

7 Data is written into the register file, and CPU can use data in R3.


Memory Pipeline ...

VA= R1 + R2 TLB Lookup Adrs MemMemory

CycleMemory

CycleBus BusyData Bus Data R3

VA= R1 + R2 TLB Lookup Adrs MemMemory

CycleMemory

CycleBus BusyData Bus

Must ensure that modules are different at compile time Must ensure that Buses

are different at compile time


Memory system

In a fully configured TRACE machine 4 memory references may be started in each beat, to 4 independently generated addresses.

The following rules must be followed:– At most one reference may be initiated on any one controller– No two references should be initiated which require the same bus to

return the data

No two references should be initiated to the same RAM bank within 4 beats of each other

The available number of register file write ports should not be exceeded


The Disambiguator

A special module of the compiler, the disambiguator, determines whether memory references can be started in the same beat.

It must determine whether – address1 mod #modules = address2 mod #modules

The answers may be yes, no and maybe.– If the answer is no, then they are packed into the one

instruction.– If the answer is yes or maybe, they are separated.


The Disambiguator

Consider the following cases:– accessing a single variable (compiler controlled address)– accessing parts of an array

• A(I) and A(I+1)

• A(I) and A(I+J)

InterleavedMemory

A(I)A(I+1)A(I+2)

A(I+3)

A(I+4)