Vector Fpu

8/11/2019 Vector Fpu

http://slidepdf.com/reader/full/vector-fpu 1/50

Floating Point Vector Processing

Prof. Miriam Leeser

Department

Electrical

Computer

Engineering

Boston, MAmel@coe.neu.edu

Based on MS thesis by Jainik Kathiara, Jan 2011

and FCCM 2011 paper

• Introduction to Vector Processing

• Vector‐scalar

• ‐

• Vectorized

Linear

Algebra

Kernels• Results

• Future Directions

• Rich set of reconfigurable

e ements

• Embedded Processor: PowerPC

instruction extensions to

PowerPC: – Emulated in software

– Hardware coprocessor

pipeline

– NU VFLOAT library

• FPGA Floating Point Unit serializes operations:

– PowerPC fetches and executes instructions, data

• Vector Processor: potential to operate on lots of data

at the same time – Multiple data elements stored in a vector

• Eliminates loops

• FPVC does

instruction

execute

• ec or ns ruc ons are ense – Reduced program code size

– Reduced dynamic instruction bandwidth

– Reduced data

hazards

• Parallel execution, parallel data –

for (i=0; i < n; i++)

Y[i] = A[i] * x + Y[i];

• BLAS library routine SAXPY / DAXPY

vector Y

• In Vector

operations

written

com actl : o erate on entire vector Y i

L.D F0,a ;load scalar a

DADDIU

R4,Rx,#512

address

loadLoop: L.D F2,0(Rx) ;load X(i)

L.D F4,0(Ry) ;load Y(i)

ADD.D F4,F4,F2

× X(i)

+ Y(i)

S.D 0(Ry),F4 ;store into Y(i)

DADDIU Rx,Rx,#8 ;increment index to X

DSUBU R20,R4,Rx ;compute bound

BNEZ R20,Loop ;check if done

L.S F0 a load scalar a

LV V1,Rx

vector

MULVS.S V2,V1,F0 ;vector‐scalar multiply

LV V3,Ry ;load vector Y

ADDV.S V4,V2,V3

SV Ry,V4 ;store the result

• Assumes vector

length

matches

length

registers, etc.

• Vector re isters hold man o erands at once

– 64, 128,

typical

• Vector instructions operate on many operands at once:

– LV, SV

– VADD, VMULT

– This reduces

dynamic

instruction

• What about processing?

– Use one functional unit (e.g. MULT) and pipeline it

– Have multiple functional units operating at once: vector lanes

– Do both: parallelism and pipelining

• Use deep pipeline (=> fast clock) to

execute element operations

• Simplifies control of deep pipeline

because elements in vector are

independent (=> no hazards!)

Six stage multiply pipeline

- *’ o ov .

Ucal Berkeley, 1998.

Vector Instruction Execution , ,

Execution

p pe ne

unct ona

p pe ne

unct ona

A[4] B[4]

A[5] B[5]

A[6] B[6]

A[16] B[16]

A[20] B[20]

A[24] B[24]

A[17] B[17]

A[21] B[21]

A[25] B[25]

A[18] B[18]

A[22] B[22]

A[26] B[26]

A[19] B[19]

A[23] B[23]

A[27] B[27]

A[3] B[3]

A[12] B[12]

A[13] B[13]

A[14] B[14]

A[15] B[15]

C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]

Functional

RegistersElements 0,

Elements 1,

Elements 2,

6, 10,

Elements 3,

7, 11,

Memory

Subsystem

m = n; i =0;

while m > MVL

for (j = 0; j < MVL; j= j++)

Y[i*MVL+j] = A[i*MVL+j] * x + Y[i*MVL+j];

m = m – MVL;i++;}

for (j = 0;j < m; j++)

* = * * *

• Maximum vector length (MVL)

• Vector Length Register (VLR)

• Strip mining

Vector Strip Mining

Solution: Break loops into pieces that fit into vector registers, “Strip

mining” ANDI R1, N, 63 # N mod 64

MTC1 VLR, R1 # Do remainder

for (i=0; i<N; i++)C[i] = A[i]+B[i];

LV V1, RA

DSLL R2, R1, 3 # Multiply by 8

DADDU RA, RA, R2 # Bump pointer+

Remainder

DADDU RB, RB, R2

ADDV.D V3, V1, V2

SV V3 RC

+ 64 elements

DADDU RC, RC, R2

DSUBU N, N, R1 # Subtract elements

LI R1, 64

MTC1 VLR, R1 # Reset full length

BGTZ N, loop # Any more to do?

VIRAM and

designed

Kozyrakis[1]

Asanovic[2] respectively

• VIRAM and T0 are implemented with ASICs

• Yianncouras[3] and

designed

soft vector processors inspired by VIRAM and T0

– This work implements integer arithmetic, not floating

• ‐

from earlier

floating

– Fetches its own instructions

• Loop control is local to the FPVC

– Includes divide and square root in floating point

Vector Chaining and Hybrid

vector/SIMD Architecture

• Vector chaining is pipeline forwarding in a vector

• Requires one read and write port each functional

• Hybrid vector/SIMD computation performs in SIMD

fashion and over time as in the traditional vector (b)

• AMD GPU architecture implements vector/SIMD

architecture

Vector Scalar Instruction Set

rc tecture• 32 bit instruction set

• Supports 32

vector

registers

• All the instructions can be classified into categories:

– Memory access instructions

– Inte er arithmetic

instructions

– Program flow control instructions

– Floatin oint arithmetic instructions

– Special instructions

• Two types of

organization

– Register

Partitioned

– Element art t one

• Vector Register

– Num er o

Vector

Vector Lane, Short Vector, Vector

Register, Sca ar

Register

Scalar

Registers

Vector Lanes (L)

Short Vector (SV)

Memory Instruction format

op[5:0] rd[4:0] r1[4:0] r2[4:0] imd[10:0]

access patterns are

– Unit stride

– Non‐unit stride

– Permutation access

– Look up

access

– Rake access

Arithmetic Instruction with both register operand

op : r : r : r : exop :

o 5:0 rd 4:0 r1 4:0 imd 15:0

Arithmetic Instruction with 16‐bit immediate value

• Includes both integer

instructions

• Masked instruction

execution is

included

• Same instruction

format

• Only first element of first short vector of each vector

register is used

• Result will

replicated

stored

the first short vector

• Expand Compress

Mask Vector1 A[0]

0 A[1]

1 A[2]

1 A[3]

1 A[2]

1 A[3]

1 A[2]

1 A[3]

1 A[4]

0 A[5]

1 A[4]

1 A[7]

• Autonomous from the

ma n processor

• Supports vector

scalar

• in‐order issue, out of

order completion

– Arbiter handles completion

• Unified vector scalar

• Uses NU VFLOAT library

for floating point units

• Supports modified Harvard style memory

architecture

• Separate instruction and data memory in local on‐

chip RAM

• Unified

memory

RAM)• Local on chip RAM reduces traffic on the system bus

• Program and data size are limited by local on‐chip

RAM size

– Vector code is more compact than scalar code!

• FPVC is connected through PLB interface to system

us ut not m te to any us protoco

• Two ports are provided for connection in embedded

sys em

– Slave port – for communication with main

– Master port – for main memory accesses

n er ace

con gure

or ‐

data width

memory

accesses

• Design implemented on Xilinx

ML510 board

• 32‐bit PLB based system bus

• Embedded system runs at 100

• PowerPC program code is

compiled with gcc using –o2

optimization• FPU only used for comparison

• FPVC program code is written in

machine code and unoptimized

• rogram an a a are s ore n

BRAM (main

memory)

• Main metric for performance

cycles

PowerPC_main (){ FPVC_main()

2. Write kernel parameter to FPVC’s

local data RAM;

wait for instruction load;

load data;

compute kernel();.

FPVC instruction load;

4. Wait until FPVC completes execution;

5.Sto PowerPC timer();

store result;

HALT FPVC;

• Matrix‐Vector Product

• Matrix‐Matrix Multiplication

• QR Decomposition

• Cholesk Decom osition

• Performs O(N)

DOT_product_kernel(){

load vector u from local data RAM;

operations

load vector v from local data RAM;

mul_vector = multiply

formulated as:accumulate = reduction(mul_vector);

DOT Product Performance for Short Vector

Sca ing

DOT PRODUCT with Lane (L) = 2

p r o v e m e n t

P e r f o r m a n c

8 16 32 64 128 256 512

Number of Vector Elements

PowerPC SV = 8, L = 2 SV = 16, L = 2 SV = 32, L = 2

DOT PRODUCT with Short Vector Size (SV)= 32

p r o v e m e n t

P e r f o r m a n c

8 16 32 64 128 256 512

Power PC L = 1 L = 2 L = 4 L = 8

• BLAS level 2 routine

• Performs O(N2 )

floating point

loop (i = 0 to i = N‐1)

• Product can be

formulated as:

y i = DOT_product_kernel(A i ,x);

store result y i to local memory;

end loop;

Matrix‐Vector Product Performance for

Lane Sca ing

o v e m e n t

e r f o r m a n c e I

4 8 12 16

S uare Matrix Size

PowerPC L = 1 L = 2 L = 4 L = 8

• BLAS level 3 routine

• Performs O(N3 )

floating point

loop (i = 0 to i = N‐1)

• Product can be

formulated as:

C i = MV_product_kernel(A,Bi );

store result C i

to local memory;

end loop;

Matrix‐Matrix Multiplication Performance

or Lane

Sca ing

MM Product with Short Vector Size (SV) = 32

p r o v e m e n t

P e r f o r m a n c

4 8 12 16

Square Matrix Size

PowerPC Lane = 1 Lane = 2 Lane = 4 Lane = 8

• This kernel uses Givens

rotation to decompose

matrix into

orthogonal

(Q) and upper triangular QR_Decomp_kernel(){

loop (i = 0 up to i = M‐1)

matrix (R) such that A = QR.

• An N x N matrix A is zeroed

out one

element

a time

loop (j = N‐1 down to j > i)

x = A[j ‐1] [i];

y = A[j][i];

using 2 x 2 rotation matrix: i,j

A[j ‐1:j][0:N‐1] = MM_product_kernel

(Qi,j , A[j ‐1:j][0:N‐1]);

end loop;

• Performs O(N3 ) floating point

operations.

QR Decomposition Performance for Lane

Sca ing

symmetric positive‐

definite matrix into the

loop (i = 0 up to i = N‐1)

pivot value

= sqrt

(Ai,i );

divide i th column vector from i to N by pivot value;

such that A = LLT .loop (j = i+1 upto N)

accumulate row vector from 0 to i;

subtract accumulated value from A j,i+1 ;

• Each element of L can

be defined as below:

end loop;

Cholesky Decomposition Performance for

Lane Sca ing

• Designed and implemented unified vector

scalar

floating

architecture•

operations:

• Initiated designing linear algebra library for

computation

processor

– Easier to implement

• FPVC is autonomous from embedded

processor

– Good choice

implementing

scientific

use rest of FPGA for at the same time

• Double Precision Floatin Point Su ort

• Architectural Improvements

– Memory Caching

• Im roved Tools

– Vector Compiler Tool Flow

• More applications

– Demonstrate concurrent use of FPVC

[1] C. Kozyrakis and D. Patterson, “Overcoming the Limitations of

Conventional Vector Processors”, In Proceedings of the 30th International

Symposium on

Computer

Architecture,

Diego,

California,

2003, pp. 399–409.

[2] K. Asanovic, J. Beck, B. Irissou, B. Kingsbury, and N. Morgan, “The T0

Vector Microprocessor,” Hot Chips, vol. 7, pp. 187–196, 1995.

[3] P.

Yiannacouras,

Gregory

Steffan,

Jonathan

VESPA:

Portable, Scalable, and Flexible FPGA‐Based Vector

Processors, International Conference on Compilers, Architecture and

Synthesis for Embedded Systems (CASES), October 2008, Atlanta, GA.

[4] . Yu, G. Lemieux, and C. Eagleston, "Vector Processing as a Soft‐core CPU

Accelerator," ACM International Symposium on FPGA, 2008.

Miriam Leeser

mel@coe.neu.eduhtt : www.coe.neu.edu Research rcl index. h

More details

Jainik Kathiara’s MS thesis under publications link

An Autonomous

Vector/Scalar

Floating

Coprocessor

for FPGAs b Jainik Kathiara and Miriam Leeser

Vector Fpu

Documents

FPU Thesis Final

FPU Operations Manual

Using uM-FPU V2 with the Comfile PICBASIC … Comfile.pdfConnecting the uM-FPU Micromega Corporation 2 Using uM-FPU V2 with Comfile PICBASIC Pin Diagram and Pin Description CS SOUT

PACS IHDR 12/13 Nov 2003 PACS FPU FPU Structure, Baffle and Straylight R. Graue, D. Kampf Kayser-Threde

presentation 2019 Q4 CH V1 - andestech.com · 5-stage (1.1 GHz) Vector Ext. Linux with FPU/DSP Cache-Coherent 1-4 Cores Fast/Compact with FPU/DSP A25 AX25 N25F D25F NX25F A25MP AX25MP

Using uM-FPU V3.1 with the PICBASIC PRO Compiler PIC… · Micromega Corporation 1 Revised 2007-10-12 Using uM-FPU V3.1 with the PICBASIC PRO Compiler Introduction The uM-FPU V3.1

On Metastability in FPU

Offshore Valves for FPU

Fpu y diseño de canales

FPU-32 Feeder Protection Unit Rev 2-A-103014 Manual · FPU-32 Feeder Protection Unit Rev. 2-A-103014 Introduction 1. INTRODUCTION 1.1 GENERAL The FPU-32 is a feeder protection relay

TMS320C28x FPU Primer (Rev. A

Programování v asembleru – FPU a spol

FPU-5.0002AP Faculty Credentialing Policy 5.29.14

FPU Instruction Reference

The Fermi-Pasta Ulam (FPU) Problem - Physics …physics.ucsc.edu/~peter/242/FPU-birth-of-nonlinear-science... · The Fermi-Pasta Ulam (FPU) Problem: ... (Zabusky, Kruskal…)

CHAPTER 2 OPERATOR INSTRUCTIONS - BOHFPU€¦ · MODELS FPU-8-4/BOH-CARGO-12-3 & FPU-20-3 BOH FPU Field Pack-up Units CHAPTER 2 OPERATOR INSTRUCTIONS. BOH-PM-07-2 ... Lift slightly

OpenPiton+Ariane Tutorial, HiPEAC 2019 · 0.2 - 1.7 GHz (0.5 V –1.15 V) Kosmodrom: RV64GCXsmallFloat Transprecision / Vector FPU Ariane HP 8T library, 0.8V, 1.3 GHz 55 mW @ 1 GHz

BOE Beca FPU

Bli medlem i FpU!

FPU SYSTEMS OPERATION MANUAL (INCLUDING REPAIR PARTS ... · FPU® SYSTEMS OPERATION MANUAL (INCLUDING REPAIR PARTS & SPECIAL ... Make sure all SCM Module slam latches and 3g ... FPU®