20
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan

FPGA based Acceleration of Linear Algebra Computations

  • Upload
    takoda

  • View
    99

  • Download
    0

Embed Size (px)

DESCRIPTION

B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan. FPGA based Acceleration of Linear Algebra Computations. Outline. Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions - PowerPoint PPT Presentation

Citation preview

Page 1: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

1

FPGA based Acceleration of Linear Algebra Computations.

B.Y. Vinay KumarSiddharth JoshiSumedh Attarde

Prof. Sachin PatkarProf. H. Narayanan

Page 2: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

2

Outline

Double Precision Dense Matrix-Matrix Multiplication. Motivation Related Work Algorithm Design Results Conclusions

Double Precision Sparse Matrix-Vector Multiplication. Introduction Prasanna DeLorimier David Gregg et. al. What can we do ?

Page 3: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

3

FPGA based Double Precision Dense Matrix-Matrix Multiplication.

Page 4: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

4

Motivation

FPGAs have been making inroads for HiPC. Accelerating BLAS-3 achieved by accelerating matrix

multiplications. Modern FPGAs provide an abundance of resources – We

must capitalise upon these.

Page 5: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

5

Related Work{1/2}

The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak.

Dou : Optimised for a large VirtexII pro device (Xillinx).Created his own MAC (Not fully compliant).Sub-block dimensions must be powers of 2.Optimised for Low IO bandwidth.

Page 6: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

6

Related Work{2/2}

Prasanna:

Scaling results in speed degradation of about 35% (2 PEs to 20 PEs).

2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50).

For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs.

» They state they have not made any platform specific optimisations, for the implemented design.

Page 7: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

7

Algorithm

1. Broadcast ‘A’, keep a unique ‘B’ per PE2. Multiply, and put in pipeline of multiplier.3. Output is fed to directly to Adder+Ram

(accumulator)4. When the updated C is ready, take them out.

Page 8: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

8

Design-1

Page 9: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

9

Design-II

Page 10: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

10

FPGA Synthesis/PAR data{1/2}

PE DSP48Es FIFO B RAM Slice Reg Slice LUT

1 16 1 2 2511 1374

4 64 4 8 10377 5451

8 128 8 16 20865 10886

16 256 16 32 41841 21750

20(SX240) 320 20 40 52329 27176

40 (SX240)

640 40 80 103335 53914

Table: Clock Speed in MHz for the overall design for different number of PE.

Device/PE 1 4 8 16 19 20 40

SX95T-3 377 374 373 373 372 201 -

SX240T-2 374 373 344 - - 372 371.7

Table: Resource Utilisation for SX95T and SX240T (post PAR)

Page 11: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

11

FPGA Synthesis/PAR data{2/2}

Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR)

15 PE 20 PE

MULT18x18 240(54%) 304(68%)

RAMB16s 90 (20%) 114(26%)

Slices 30218 (68%) 37023(83%)

Speed 133.94 MHz 133.79 MHz

Page 12: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

12

Conclusions

We propose a variation of the rank one update algorithm for matrix multiplication.

We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA

The two designs clearly show the difference of local storage on IO bandwidth.

The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.

Page 13: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

13

FPGA based Double Precision Sparse Matrix-Vector Multiplication.

Page 14: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

14

Introduction

There are three main papers we will be looking at Viktor Prasanna: Hybrid method use HLL+S/W+HDL Michael DeLorimier: Maximum performance but unrealisticDavid Gregg et. al.: Most realistic assumptions wrt DRAM

Page 15: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

15

Prasanna

Use of prexisting IP cores – specifically for iterative solver (CG)

4 input reduction ckt does dot product results in partial sums as op.

Adder loop with Array does summation of dotproduct – created using

HLL

Reduction ckt at the end uses B-Tree to create the final value

IP s are available

DRAM looked at – but not realistically

Order of Matrices is small

DRAM is bottleneck

With their IP's they have a good architecture -however change the IP

and modify datapath – eg. Dou MAC

Page 16: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

16

DeLorimier

Use BRAMs for everything.

Use for iterative Solver – specifically CG

MAC requires interleaving

They do load balancing in their partitioner which requires – a

communication stage, very matrix/partitioner dependent.

Communication is the bottleneck

Performance:750 MFLOPS / processor

16 Virtex II 6000s

Each has 5 PE + 1 CE

Page 17: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

17

David Gregg et. al. (SPAR)

They only report the use of the SPAR architecture for FPGAs

They use very pessimistic DRAM access times. Emphasis on

cache-miss removal

Not using their Block RAMs well – maybe something

interesting can be done here

128 MFLOPS for 3 parallel SPAR units but remove cache miss

and we get a peak of 570 MFLOPS

Page 18: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

18

What can we do ?

Both use CSR – Not required why not modify representation

Two approaches : We can try both simultaneously

Prasanna – split across dot products (same row many PE)

Delorimier – split accross rows (many rows – one PE)

Use data from SPAR – viable approach – both do zero

multiplies – we get away with one zero multiply/coloumn

Minimise communication or overlap it. - we can do interleaving

for this – while one stage computes the previous one

communicates.

Page 19: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

19

Questions ?

Page 20: FPGA based Acceleration of Linear Algebra Computations

IEEE Globecom-2006, NXG-02: Broadband Access©Copyright 2005-2006All Rights Reserved

20

THANK YOU

Thank You