View
229
Download
0
Category
Preview:
Citation preview
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 1/50
Floating Point Vector Processing
on an
FPGA
Prof. Miriam Leeser
Department
of
Electrical
and
Computer
Engineering
Boston, MAmel@coe.neu.edu
Based on MS thesis by Jainik Kathiara, Jan 2011
and FCCM 2011 paper
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 2/50
• Introduction to Vector Processing
• Vector‐scalar
ISA
• ‐
• Vectorized
Linear
Algebra
Kernels• Results
• Future Directions
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 3/50
• Rich set of reconfigurable
e ements
• Embedded Processor: PowerPC
instruction extensions to
PowerPC: – Emulated in software
– Hardware coprocessor
•
pipeline
– NU VFLOAT library
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 4/50
• FPGA Floating Point Unit serializes operations:
– PowerPC fetches and executes instructions, data
– ,
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 5/50
• Vector Processor: potential to operate on lots of data
at the same time – Multiple data elements stored in a vector
–
• Eliminates loops
• FPVC does
its
own
instruction
fetch
and
execute
• ec or ns ruc ons are ense – Reduced program code size
– Reduced dynamic instruction bandwidth
– Reduced data
hazards
• Parallel execution, parallel data –
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 6/50
for (i=0; i < n; i++)
Y[i] = A[i] * x + Y[i];
• BLAS library routine SAXPY / DAXPY
•
vector Y
• In Vector
ISA
such
operations
are
written
very
com actl : o erate on entire vector Y i
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 7/50
L.D F0,a ;load scalar a
DADDIU
R4,Rx,#512
;last
address
to
loadLoop: L.D F2,0(Rx) ;load X(i)
. , ,
L.D F4,0(Ry) ;load Y(i)
ADD.D F4,F4,F2
;a
× X(i)
+ Y(i)
S.D 0(Ry),F4 ;store into Y(i)
DADDIU Rx,Rx,#8 ;increment index to X
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 8/50
L.S F0 a load scalar a
LV V1,Rx
;load
vector
X
MULVS.S V2,V1,F0 ;vector‐scalar multiply
LV V3,Ry ;load vector Y
ADDV.S V4,V2,V3
;add
SV Ry,V4 ;store the result
• Assumes vector
length
matches
length
of
registers, etc.
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 9/50
• Vector re isters hold man o erands at once
– 64, 128,
256
typical
• Vector instructions operate on many operands at once:
– LV, SV
– VADD, VMULT
– This reduces
code
size
and
dynamic
instruction
count
• What about processing?
– Use one functional unit (e.g. MULT) and pipeline it
– Have multiple functional units operating at once: vector lanes
– Do both: parallelism and pipelining
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 10/50
• Use deep pipeline (=> fast clock) to
V V V
execute element operations
• Simplifies control of deep pipeline
because elements in vector are
1 2 3
independent (=> no hazards!)
Six stage multiply pipeline
- *’ o ov .
Ucal Berkeley, 1998.
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 11/50
Vector Instruction Execution , ,
Execution
using
one
Execution
using
four
p pe ne
unct ona
unit
p pe ne
unct ona
units
A[4] B[4]
A[5] B[5]
A[6] B[6]
A[16] B[16]
A[20] B[20]
A[24] B[24]
A[17] B[17]
A[21] B[21]
A[25] B[25]
A[18] B[18]
A[22] B[22]
A[26] B[26]
A[19] B[19]
A[23] B[23]
A[27] B[27]
C 2
A[3] B[3]
C 8
A[12] B[12]
C 9
A[13] B[13]
C 10
A[14] B[14]
C 11
A[15] B[15]
C[1] C[4] C[5] C[6] C[7]
C[0] C[0] C[1] C[2] C[3]
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 12/50
Functional
Unit
RegistersElements 0,
4, 8,
…
Elements 1,
5, 9,
…
Elements 2,
6, 10,
…
Elements 3,
7, 11,
…
Lane
Memory
Subsystem
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 13/50
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 14/50
m = n; i =0;
while m > MVL
for (j = 0; j < MVL; j= j++)
Y[i*MVL+j] = A[i*MVL+j] * x + Y[i*MVL+j];
m = m – MVL;i++;}
for (j = 0;j < m; j++)
* = * * *
• Maximum vector length (MVL)
• Vector Length Register (VLR)
• Strip mining
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 15/50
Vector Strip Mining
Solution: Break loops into pieces that fit into vector registers, “Strip
mining” ANDI R1, N, 63 # N mod 64
MTC1 VLR, R1 # Do remainder
loop:
for (i=0; i<N; i++)C[i] = A[i]+B[i];
LV V1, RA
DSLL R2, R1, 3 # Multiply by 8
DADDU RA, RA, R2 # Bump pointer+
A B C
Remainder
,
DADDU RB, RB, R2
ADDV.D V3, V1, V2
SV V3 RC
+ 64 elements
DADDU RC, RC, R2
DSUBU N, N, R1 # Subtract elements
LI R1, 64
MTC1 VLR, R1 # Reset full length
BGTZ N, loop # Any more to do?
+
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 16/50
•
VIRAM and
T0
designed
by
Kozyrakis[1]
and
Asanovic[2] respectively
• VIRAM and T0 are implemented with ASICs
• Yianncouras[3] and
Yu[4]
have
designed
FPGA
based
soft vector processors inspired by VIRAM and T0
– This work implements integer arithmetic, not floating
point
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 17/50
• ‐
from earlier
work
on
floating
point
co
‐
– Fetches its own instructions
–
• Loop control is local to the FPVC
– Includes divide and square root in floating point
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 18/50
Vector Chaining and Hybrid
vector/SIMD Architecture
• Vector chaining is pipeline forwarding in a vector
• Requires one read and write port each functional
unit
• Hybrid vector/SIMD computation performs in SIMD
fashion and over time as in the traditional vector (b)
• AMD GPU architecture implements vector/SIMD
architecture
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 19/50
Vector Scalar Instruction Set
rc tecture• 32 bit instruction set
• Supports 32
vector
registers
• All the instructions can be classified into categories:
– Memory access instructions
– Inte er arithmetic
instructions
– Program flow control instructions
– Floatin oint arithmetic instructions
– Special instructions
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 20/50
• Two types of
organization
– Register
Partitioned
– Element art t one
• Vector Register
– Num er o
Vector
lanes
–
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 21/50
Vector Lane, Short Vector, Vector
Register, Sca ar
Register
Scalar
Registers
Vector Lanes (L)
Short Vector (SV)
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 22/50
Memory Instruction format
•
op[5:0] rd[4:0] r1[4:0] r2[4:0] imd[10:0]
access patterns are
– Unit stride
– Non‐unit stride
– Permutation access
– Look up
table
access
– Rake access
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 23/50
Arithmetic Instruction with both register operand
op : r : r : r : exop :
o 5:0 rd 4:0 r1 4:0 imd 15:0
Arithmetic Instruction with 16‐bit immediate value
• Includes both integer
instructions
• Masked instruction
execution is
also
included
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 24/50
• Same instruction
format
is
used
• Only first element of first short vector of each vector
register is used
• Result will
be
replicated
to
all
lanes
and
stored
on
the first short vector
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 25/50
• Expand Compress
Mask Vector1 A[0]
Mask Vector1 A[0]
Mask Vector1 A[0]
0 A[1]
1 A[2]
1 A[3]
0 ‐
1 A[2]
1 A[3]
1 A[2]
1 A[3]
1 A[4]
1 A[4]
0 A[5]
1 A[4]
0 ‐
1 A[7]
0 ‐
1 A[7]
‐
1 A[7]
‐
0 ‐
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 26/50
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 27/50
• Autonomous from the
ma n processor
• Supports vector
scalar
ISA
• in‐order issue, out of
order completion
– Arbiter handles completion
• Unified vector scalar
file
• Uses NU VFLOAT library
for floating point units
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 28/50
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 29/50
• Supports modified Harvard style memory
architecture
• Separate instruction and data memory in local on‐
chip RAM
• Unified
main
memory
(in
other
on‐
chip
RAM)• Local on chip RAM reduces traffic on the system bus
• Program and data size are limited by local on‐chip
RAM size
– Vector code is more compact than scalar code!
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 30/50
• FPVC is connected through PLB interface to system
us ut not m te to any us protoco
• Two ports are provided for connection in embedded
sys em
– Slave port – for communication with main
– Master port – for main memory accesses
•
n er ace
can
e
con gure
or
,
or ‐
data width
memory
accesses
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 31/50
• Design implemented on Xilinx
ML510 board
• 32‐bit PLB based system bus
• Embedded system runs at 100
MHz
• PowerPC program code is
compiled with gcc using –o2
optimization• FPU only used for comparison
• FPVC program code is written in
machine code and unoptimized
• rogram an a a are s ore n
BRAM (main
memory)
• Main metric for performance
cycles
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 32/50
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 33/50
PowerPC_main (){ FPVC_main()
.
2. Write kernel parameter to FPVC’s
local data RAM;
wait for instruction load;
load data;
compute kernel();.
FPVC instruction load;
4. Wait until FPVC completes execution;
5.Sto PowerPC timer();
store result;
HALT FPVC;
}
}
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 34/50
•
• Matrix‐Vector Product
• Matrix‐Matrix Multiplication
• QR Decomposition
• Cholesk Decom osition
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 35/50
• Performs O(N)
DOT_product_kernel(){
load vector u from local data RAM;
operations
•
load vector v from local data RAM;
mul_vector = multiply
u and
v;
formulated as:accumulate = reduction(mul_vector);
}
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 36/50
DOT Product Performance for Short Vector
Sca ing
1.6
1.8
DOT PRODUCT with Lane (L) = 2
1.2
1.4
I m
p r o v e m e n t
0.8
1
P e r f o r m a n c
0.4
0.6
8 16 32 64 128 256 512
Number of Vector Elements
PowerPC SV = 8, L = 2 SV = 16, L = 2 SV = 32, L = 2
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 37/50
2.2
2.4
DOT PRODUCT with Short Vector Size (SV)= 32
1.4
1.6
1.8
I m
p r o v e m e n t
0.8
1
1.2
P e r f o r m a n c
0.4
0.6
8 16 32 64 128 256 512
Power PC L = 1 L = 2 L = 4 L = 8
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 38/50
‐
• BLAS level 2 routine
• Performs O(N2 )
floating point
_ _
loop (i = 0 to i = N‐1)
• Product can be
formulated as:
y i = DOT_product_kernel(A i ,x);
store result y i to local memory;
end loop;
}
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 39/50
Matrix‐Vector Product Performance for
Lane Sca ing
1.4
1.6
1
1.2
p r
o v e m e n t
0.8
e r f o r m a n c e I
0.4
.
4 8 12 16
S uare Matrix Size
PowerPC L = 1 L = 2 L = 4 L = 8
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 40/50
‐
• BLAS level 3 routine
• Performs O(N3 )
floating point
_ _
loop (i = 0 to i = N‐1)
• Product can be
formulated as:
C i = MV_product_kernel(A,Bi );
store result C i
to local memory;
end loop;
}
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 41/50
Matrix‐Matrix Multiplication Performance
or Lane
Sca ing
1.6
1.7
MM Product with Short Vector Size (SV) = 32
1.2
1.3
1.4
.
I m
p r o v e m e n t
0.9
1
1.1
P e r f o r m a n c
0.7
0.8
4 8 12 16
Square Matrix Size
PowerPC Lane = 1 Lane = 2 Lane = 4 Lane = 8
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 42/50
• This kernel uses Givens
rotation to decompose
matrix into
an
orthogonal
(Q) and upper triangular QR_Decomp_kernel(){
loop (i = 0 up to i = M‐1)
matrix (R) such that A = QR.
• An N x N matrix A is zeroed
out one
element
at
a time
loop (j = N‐1 down to j > i)
x = A[j ‐1] [i];
y = A[j][i];
using 2 x 2 rotation matrix: i,j
A[j ‐1:j][0:N‐1] = MM_product_kernel
(Qi,j , A[j ‐1:j][0:N‐1]);
end loop;
end loop;
}
• Performs O(N3 ) floating point
operations.
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 43/50
QR Decomposition Performance for Lane
Sca ing
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 44/50
symmetric positive‐
definite matrix into the
_ _
loop (i = 0 up to i = N‐1)
pivot value
= sqrt
(Ai,i );
divide i th column vector from i to N by pivot value;
such that A = LLT .loop (j = i+1 upto N)
accumulate row vector from 0 to i;
subtract accumulated value from A j,i+1 ;
• Each element of L can
be defined as below:
end loop;
}
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 45/50
Cholesky Decomposition Performance for
Lane Sca ing
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 46/50
• Designed and implemented unified vector
scalar
floating
point
architecture•
operations:
– ,
,
,
,
• Initiated designing linear algebra library for
computation
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 47/50
•
processor
– Easier to implement
• FPVC is autonomous from embedded
processor
– Good choice
for
implementing
scientific
apps
that
use rest of FPGA for at the same time
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 48/50
• Double Precision Floatin Point Su ort
• Architectural Improvements
– Memory Caching
• Im roved Tools
– Vector Compiler Tool Flow
• More applications
– Demonstrate concurrent use of FPVC
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 49/50
[1] C. Kozyrakis and D. Patterson, “Overcoming the Limitations of
Conventional Vector Processors”, In Proceedings of the 30th International
Symposium on
Computer
Architecture,
San
Diego,
California,
June
2003, pp. 399–409.
[2] K. Asanovic, J. Beck, B. Irissou, B. Kingsbury, and N. Morgan, “The T0
Vector Microprocessor,” Hot Chips, vol. 7, pp. 187–196, 1995.
[3] P.
Yiannacouras,
J.
Gregory
Steffan,
and
Jonathan
Rose,
VESPA:
Portable, Scalable, and Flexible FPGA‐Based Vector
Processors, International Conference on Compilers, Architecture and
Synthesis for Embedded Systems (CASES), October 2008, Atlanta, GA.
[4] . Yu, G. Lemieux, and C. Eagleston, "Vector Processing as a Soft‐core CPU
Accelerator," ACM International Symposium on FPGA, 2008.
8/11/2019 Vector Fpu
http://slidepdf.com/reader/full/vector-fpu 50/50
Miriam Leeser
mel@coe.neu.eduhtt : www.coe.neu.edu Research rcl index. h
More details
can
be
found
in:
Jainik Kathiara’s MS thesis under publications link
An Autonomous
Vector/Scalar
Floating
Point
Coprocessor
for FPGAs b Jainik Kathiara and Miriam Leeser
Recommended