FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

1

FPGA based Acceleration of Linear Algebra Computations.

B.Y. Vinay KumarSiddharth JoshiSumedh Attarde

Prof. Sachin PatkarProf. H. Narayanan


2

Outline

Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions

Double Precision Sparse Matrix-Vector Multiplication. Introduction Prasanna DeLorimier David Gregg et. al. What can we do ?


3

FPGA based Double Precision Dense Matrix-Matrix Multiplication.


4

Motivation

FPGAs have been making inroads for HiPC. Accelerating BLAS-3 achieved by accelerating matrix

multiplications. Modern FPGAs provide an abundance of resources – We

must capitalise upon these.


5

Related Work{1/2}

The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak.

Dou : Optimised for a large VirtexII pro device (Xillinx).Created his own MAC (Not fully compliant).Sub-block dimensions must be powers of 2.Optimised for Low IO bandwidth.


6

Related Work{2/2}

Prasanna:

Scaling results in speed degradation of about 35% (2 PEs to 20 PEs).

2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50).

For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs.

» They state they have not made any platform specific optimisations, for the implemented design.


7

Algorithm

1. Broadcast ‘A’, keep a unique ‘B’ per PE2. Multiply, and put in pipeline of multiplier.3. Output is fed to directly to Adder+Ram

(accumulator)4. When the updated C is ready, take them out.


8

Design-1


9

Design-II


10

FPGA Synthesis/PAR data{1/2}

PE DSP48Es FIFO B RAM Slice Reg Slice LUT

1 16 1 2 2511 1374

4 64 4 8 10377 5451

8 128 8 16 20865 10886

16 256 16 32 41841 21750

20(SX240) 320 20 40 52329 27176

40 (SX240)

640 40 80 103335 53914

Table: Clock Speed in MHz for the overall design for different number of PE.

Device/PE 1 4 8 16 19 20 40

SX95T-3 377 374 373 373 372 201 -

SX240T-2 374 373 344 - - 372 371.7

Table: Resource Utilisation for SX95T and SX240T (post PAR)


11

FPGA Synthesis/PAR data{2/2}

Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR)

15 PE 20 PE

MULT18x18 240(54%) 304(68%)

RAMB16s 90 (20%) 114(26%)

Slices 30218 (68%) 37023(83%)

Speed 133.94 MHz 133.79 MHz


12

Conclusions

We propose a variation of the rank one update algorithm for matrix multiplication.

We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA

The two designs clearly show the difference of local storage on IO bandwidth.

The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.


13

FPGA based Double Precision Sparse Matrix-Vector Multiplication.


14

Introduction

There are three main papers we will be looking at Viktor Prasanna: Hybrid method use HLL+S/W+HDL Michael DeLorimier: Maximum performance but unrealisticDavid Gregg et. al.: Most realistic assumptions wrt DRAM


15

Prasanna

Use of prexisting IP cores – specifically for iterative solver (CG)

4 input reduction ckt does dot product results in partial sums as op.

Adder loop with Array does summation of dotproduct – created using

HLL

Reduction ckt at the end uses B-Tree to create the final value

IP s are available

DRAM looked at – but not realistically

Order of Matrices is small

DRAM is bottleneck

With their IP's they have a good architecture -however change the IP

and modify datapath – eg. Dou MAC


16

DeLorimier

Use BRAMs for everything.

Use for iterative Solver – specifically CG

MAC requires interleaving

They do load balancing in their partitioner which requires – a

communication stage, very matrix/partitioner dependent.

Communication is the bottleneck

Performance:750 MFLOPS / processor

16 Virtex II 6000s

Each has 5 PE + 1 CE


17

David Gregg et. al. (SPAR)

They only report the use of the SPAR architecture for FPGAs

They use very pessimistic DRAM access times. Emphasis on

cache-miss removal

Not using their Block RAMs well – maybe something

interesting can be done here

128 MFLOPS for 3 parallel SPAR units but remove cache miss

and we get a peak of 570 MFLOPS


18

What can we do ?

Both use CSR – Not required why not modify representation

Two approaches : We can try both simultaneously

Prasanna – split across dot products (same row many PE)

Delorimier – split accross rows (many rows – one PE)

Use data from SPAR – viable approach – both do zero

multiplies – we get away with one zero multiply/coloumn

Minimise communication or overlap it. - we can do interleaving

for this – while one stage computes the previous one

communicates.


19

Questions ?


20

THANK YOU

Thank You

Documents

FPGA based Acceleration of Linear Algebra Computations