VLIW Machines
Sima, Fountain and KacsukChapter 6
CSE3304
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
VLIW Machines
Single stream of instructions – (one program counter and one control unit),
Very long instruction format, – enough control bits to directly and
independently control the action of every functional unit in every cycle
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
VLIW Machines ...
Large numbers of data paths and functional units, – control is planned at compile time – Some VLIW machines have no arbiters, queues
or other hardware synchronisation mechanisms in the hardware.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Common traits between VLIW and Superscalar?
Register File
Instructions
EU EU EU EU
Performance
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
FXEU
Common traits between VLIW and Superscalar? ...
FX Register File
FX Instructions
FP Register File
FXEU
FP Instructions
FXEU
FPEU
FPEU
FPEU
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Differenecs between VLIW and Superscalar? ...
Cache Memory
FetchUnit
Decode/IssueUnit
Register File
EU EU EU EU
Multiple Instructions
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Issuing 1 instruction per cycle
Cache Memory
FetchUnit
Decode/IssueUnit
Register File
EU EU EU EU
Multiple Instructions
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Issuing 2 instructions per cycle
Cache Memory
FetchUnit
Decode/IssueUnit
Register File
EU EU EU EU
Multiple Instructions
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Issuing 4 instructions per cycle
Cache Memory
FetchUnit
Decode/IssueUnit
Register File
EU EU EU EU
Multiple Instructions
Issue unit decides at run time how many instructions
to issue!
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
In VLIW machine choice is static
Cache Memory
FetchUnit
Register File
EU EU EU EU
Single Long Instruction
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Super Scalar Data Dependence
ConsiderR1 + R2 R3
R4 - R5 R6
load R7, Fred
R7 * R1 R2
R1 + R2 R3
R4 - R5 R6
load R7, Fred
R7 * R1 R2
Super Scalar single issue
WAIT
Super Scalar double issue
R1 + R2 R3 R4 - R5 R6
R7 * R1 R2WAIT
load R7, FredDynamic Decision
not to co-issue
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
VLIW Data Dependence
ConsiderR1 + R2 R3
R4 - R5 R6
load R7, Fred
R7 * R1 R2VLIW double issue
R1 + R2 R3 R4 - R5 R6
R7 * R1 R2WAIT
load R7, Fred NOPStatic Decision not
to co-issue
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Multiflow TRACE VLIW
Multiflow TRACE uses very sophisticated compilation techniques to detect low level parallelism.
The idea is that low level operations which can be executed at the same time are located and "packed" together into one instruction word.
When this instruction is executed, all of the operations will fire.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory
ALU ALU ALU ALU
Register File
ALUs are controlled by one instruction stream.The TRACE machine splits this register file intointeger and floating point files to meet the required bandwidth.
Trace Machine Structure
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
TRACE Performance
Multiflow TRACE VAXCray
7/200 14/200 8700XMP
Implementation Technology CMOS ECL ECLGate Speed 3.5 nsecs 3.5 nsecs 1.5 nsecs
1.3 nsecsIssue Rate 130 nsecs 130nsecs 45 nsecs
8 nsecs
Linpack MFLOPS 6.0 10.0 0.9724.0
Whetstones 12605 14400 395325700
Livermore Loop MFLOPS 2.3 3.4 0.912.3
ANSYS Benchmark M3 Secs 3757 2200 n/a556
Note, this data is supplied by the manufacturer.There are many factors which affect performance, such as compiler options, vector length, etc.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
TRACE Compiler The TRACE compiler must not only
generate code for the VLIW machine– schedule the hardware resources statically at
compile time.– This is not always possible!
The compiler must schedule instructions so that as many ALUs are used as possible.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory scheduling
The compiler must also schedule memory references so that there is no memory bank contention.
This guarantees arrival time of memory operands
InterleavedMemory
10001001
10021003
100510061007
1004
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Compiler optimizations
Some optimizations are standard, and apply to many machines, but others are required for VLIW machines.
TRACE can use previous program execution traces to determine the most common branch directions.
Compiler ExecutableSource Execute
Trace
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Compiler optimizations
A branch which is taken the wrong way in a VLIW machine can be even more serious than in a scalar pipelined machine because there are many instructions packed together.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Bad Branches on scalar machine
ConditionKnown
Evaluate BTA
INCORRECT2 instructions lost
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Bad Branches on VLIW
Evaluate BTA
INCORRECT4 instructions lost
ConditionKnown
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Compiler Optimisations ...
Each subroutine is considered one at a time. Classic optimisations such as:
– loop invariant motion,– common sub-expression elimination are
performed The compiler then build a flow graph for the
program so that data dependencies can be observed
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Conditional Branch Estimation
Compiler performs static branch estimation for loops
The sense of the IF statement is unknown at compiler time.
However, with trace data it is possible to guess very accurately.
It may even be possible to guess by using clever dataflow analysis and deduction
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Loop Unrolling
One optimisation that is required for high performance on a VLIW machine is static loop unravelling.
The bounds of a loop are reduced by some factor, and a number of loop iterations are statically unravelled.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Loop Unrolling - An example
DO 10 I = 1,10
A(I) = 0
10 CONTINUE
DO 10 I = 1,5
A(I*2) = 0
A(I*2-1) = 0
10 CONTINUE
A(1) = 0
A(2) = 0
A(3) = 0
A(4) = 0
A(5) = 0
A(6) = 0
A(7) = 0
A(8) = 0
A(9) = 0
A(10) = 0
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Loop Unrolling - A harder example
DO 10 I = 6,25
A(I) = 0
B(I) = A(I-4) + A(I-5)
10 CONTINUE
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
DO 10 I = 0,3A ( I*5 + 6) = 0B ( I*5 + 6) = A ( I*5 + 2) + A( I*5 + 1)A ( I*5 + 7) = 0B ( I*5 + 7) = A ( I*5 + 3) + A( I*5 + 2)A ( I*5 + 8) = 0B ( I*5 + 8) = A ( I*5 + 4) + A( I*5 + 3)A ( I*5 + 9) = 0B ( I*5 + 9) = A ( I*5 + 5) + A( I*5 + 4)A ( I*5 + 10) = 0
10 B ( I*5 + 10) = A ( I*5 + 6) + A( I*5 + 5)
Now, if all of these statements can be executed concurrently, then the loop only needs to be performed 4 times.
Loop Unrolling Loop Dependence
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Loop Unravelling
Once the loop has been unrolled, it is possible to build the dependence graph for the loop body.
This shows how to pack to instructions into the Very Long Instruction Word.
In our example all of the statements could be executed together if there were sufficient resources.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Conditional Instructions
The Trace machine uses compare-predict operations rather than test operators.
The results can be written to general registers, which can avoid some of the branches in complex IF chains.
CEQ R1,R2, BB(R2) Write BB with 1 if R1 == R2 else write BB with 0
BRANCH (R3),LABEL The branch_test field selects R3
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Conditional Branches
Branches cause problems when instructions are packed into wide instructions.
Consider the following sequence:IF A < B GOTO 10
IF C < D GOTO 20
These two statements are independent, and can be packed together.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Conditional Branches ...
But, what happens if both indicate a branch? Which one should be taken?
The TRACE machine uses a statically encoded priority scheme, so that the first one has priority of the second.
IF A < B GOTO 10 IF C < D GOTO 20
Takes Priority
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Compensation Code
One problem with statically unrolled loops, and packing instructions into one with a conditional branch, is that certain instructions may sometimes be executed even though a branch has been taken.– We have looked at some conventional solutions in
pipelined machines. The TRACE machine inserts code to undo
mistakes when they occur.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Compensation Code ...
Consider:IF A < B GOTO 10
D = D + 1
If these are packed together, then D will always be incremented.
IF A is usually >= B then this will usually be correct
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Compensation Code ...
But is A < B it will be done and will be wrong.
The compiler could insert the following code at 10
10 D = D - 1
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
TRACE Structure
On the TRACE machine each functional unit is split into an integer ALU and a floating point ALU.
Each FU required 256 bits of instruction. A TRACE machine can have up to 4
Functional Units. A fully configured TRACE machine will
require 1024 bits of instruction per cycle
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
TRACE Structure ...
Each Integer ALU contains 2 ALU/multipliers, and address translation TLB and a PC, as well as the integer registers.
Each floating point unit contains a floating point adder, a floating point multiplier, store and load registers.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
TRACE Structure
I Registers (64 x 32)
ALU0 IMUL
ALU1 IMUL
TLB
Physical Address
PC Adder
PC
ILoad Buses FLoad Buses
F Registers (32 x 64)
FMUL ALUM
FADD ALUA
Store Registers (32x32)
Store Buses
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Instruction Format opcode dest dest_bank branch_test src_1 src_2 Imm
016711121315161819242531
opcode dest dest_bank branch_test src_1 src_2 Imm
016711121315161819242531
Immediate constant (early)
031
opcode dest dest_bank src_1 src_2 Imm
016711121315161819242531
Immediate Constant (late)
031
opcode dest dest_bank src_1 src_2 Imm
016711121315161819242531
opcode 64 dest src_1 src_2 Dest_bank
01345615162223242531 101117
opcode 64 dest src_1 src_2 Dest_bank
01345615162223242531 101117
Word 0 ALU 0 Early Beat
Word 1 Immediate Constant
Word 2 ALU 1 Early Beat
Word 3 FA/ALUA control fields
Word4 ALU 0 Late Beat
Word 5 Immediate Constant
Word 6 ALU 1 Late Beat
Word 7FM/ALUM control fields
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Instruction Encoding
In a highly parallel program each instruction will be packed with useful instructions.
In a program which does not have sufficient concurrency there will be many no-ops in the fields of the instructions.
Also, even highly parallel program may have regions which are low in concurrency.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Instruction Encoding ...
To combat the wasted space, a special memory format for the instructions in used.
Instructions with no-op fields are expanded on the fly when they are loaded into the instruction cache.
Encoded Instruction
Expanded Instruction
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory Subsystem
The TRACE machine uses an interleaved memory subsystem to achieve high throughput.
It does not rely in large caches and cache hit rates, but instead pipelines memory references (there is an instruction cache).
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory Subsystem ...
There are multiple buses between the ALU's and the memory units, these are the F and I load buses and the F store buses.
The load buses are bi-directional, and the store buses are uni-directional.
I Registers (64 x 32)
ALU0 IMUL
ALU1 IMUL
TLB
Physical Address
PC Adder
PC
ILoad Buses FLoad Buses
F Registers (32 x 64)
FMUL ALUM
FADD ALUA
Store Registers (32x32)
Store Buses
Load Buses
Store Buses
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory Pipeline
Memory is accessed using a 8 stage pipeline. This is visible to the compiler.
0 The program says LD R1, R2, R3. R1 and R2 are added to form a virtual address. R2 may be replaced by a 6, 17 or 32 bit immediate constant.
1 The virtual address is looked up in the TLB
2 The physical address is sent over the buses to the memory controller.
3 The desired RAM bank starts cycling
4 RAM access continues
5 Data is returned from the memory controller
6 Data is sent over the buses
7 Data is written into the register file, and CPU can use data in R3.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory Pipeline ...
VA= R1 + R2 TLB Lookup Adrs MemMemory
CycleMemory
CycleBus BusyData Bus Data R3
VA= R1 + R2 TLB Lookup Adrs MemMemory
CycleMemory
CycleBus BusyData Bus
Must ensure that modules are different at compile time Must ensure that Buses
are different at compile time
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory system
In a fully configured TRACE machine 4 memory references may be started in each beat, to 4 independently generated addresses.
The following rules must be followed:– At most one reference may be initiated on any one controller– No two references should be initiated which require the same bus to
return the data
No two references should be initiated to the same RAM bank within 4 beats of each other
The available number of register file write ports should not be exceeded
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The Disambiguator
A special module of the compiler, the disambiguator, determines whether memory references can be started in the same beat.
It must determine whether – address1 mod #modules = address2 mod #modules
The answers may be yes, no and maybe.– If the answer is no, then they are packed into the one
instruction.– If the answer is yes or maybe, they are separated.
David Abramson, 2000 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The Disambiguator
Consider the following cases:– accessing a single variable (compiler controlled address)– accessing parts of an array
• A(I) and A(I+1)
• A(I) and A(I+J)
InterleavedMemory
A(I)A(I+1)A(I+2)
A(I+3)
A(I+4)