24
Optimal Algorithm Selection of Parallel Sparse Matrix-Vector Multiplication Is Important Makoto Kudoh *1 , Hisayasu Kuroda *1 , Takahiro Katagiri *2 , Yasumasa Kanada *1 *1 The University of Tokyo *2 PRESTO, Japan Science and Technology

Optimal Algorithm Selection of Parallel Sparse Matrix-Vector Multiplication Is Important Makoto Kudoh *1, Hisayasu Kuroda *1, Takahiro Katagiri *2, Yasumasa

Embed Size (px)

Citation preview

Optimal Algorithm Selection of Parallel Sparse Matrix-Vector

Multiplication Is Important

Makoto Kudoh*1, Hisayasu Kuroda*1,

Takahiro Katagiri*2, Yasumasa Kanada*1

*1 The University of Tokyo

*2 PRESTO, Japan Science and Technology Corporation

Introduction

Sparse Matrix-Vector multiplication(SpMxV)

( A is a sparse matrix, x is a dense vector)

Basic computational kernel used in scientific computations

-ex. Iterative solver for linear systems, eigenvalue problems

nnn xAAx ,

Large scale SpMxV problems

Parallel Sparse Matrix-Vector Multiplication

Calculation of Parallel Sparse Matrix-Vector Multiplication

Two phase computations:data communication and local computation

A

PE0PE1PE2PE3y x

Row block distribution Compressed sparse row format

4020

0020

0301

4001 0 2 4 5

0 3 0 2 1 1 3

1 4 -1 3 2 -2 -4

rowptr

colind

value

PE0

PE1

PE2

PE3

Vector data communication

x A x yPE0

PE1

PE2

PE3

Local computation

Optimization of Parallel SpMxV

Many optimization algorithms of SpMxV proposed

BUTThe effect depends highly on the non-zero structure of the matrix and the machine’s

architecture

Optimal algorithm selection is important

Poor performance compared with dense matrix Increased memory reference to matrix data

caused by indirect access Irregular memory access pattern to vector x

Related Works

Library approach PSPARSLIB, PETSc, ILIB, etc Fixed optimize algorithm Work on parallel systems

Compiler approach SPARSITY, sparse compiler, etc Generate optimized code for matrix and machine Not work on parallel systems

The purpose of our work

Include several algorithms for local computation and data communication

Measure performance of each algorithm exhaustively Select the best algorithm for the matrix and machine

Algorithm selecting time is not in concern

Performance of best algorithm for matrix and machine

Performance of fixed algorithm for all matrices and machines

compare

Our program

Optimization algorithms of our program

Algorithms implemented in our routine Local computation

Register Blocking Diagonal Blocking Unrolling

Data communication Allgather Communication Range Limited Communication Minimum Data Size Communication

Register Blocking (Local Computation 1/3)

Extract small dense blocks and make a blocked matrix

+

• Reduce the number of load instruction• Increase temporal locality to the source

vectorAbbreviate size mxn Register Blocking to RmxnR1x2,R1x3,R1x4,R2x1,R2x2,R2x3,R2x4,

R3x1,R3x2,R3x3,R3x4,R4x1,R4x2,R4x3,R4x4

Original matrix Blocked matrix Remaining matrix

Diagonal Blocking (Local Computation 2/3)

For matrices with dense non-zero structure around diagonal part

Block diagonal part and treat it as a dense band matrix

+

• Reduce the number of load instruction• Optimize the access of register and

cacheAbbreviate size n Diagonal Blocking to DnD3,D5,D7,D9,D11,D13,D15,D17,D19

Original matrix Blocked matrix Remaining matrix

Unrolling (Local Computation 3/3)

Just unroll the inner loop

Abbreviate unrolling level n to Un

• Reduce the loop overhead• Exploit instruction level parallelism

U1,U2,U3,U4,U5,U6,U7,U8

Allgather Communication (data communication 1/3)

Each processor sends all vector data to all other processors

Easy to implement (with MPI_Allgather)

PE0

PE1

PE2

PE3

The communication data size is very large

Range-limited Communication (data communication 2/3)

Send only minimum contiguous required block Not communicate between unnecessary processors

Small overhead CPU time, since data rearrangement is unnecessary

Communication data size is not minimum on most matrices

PE0

vector vector

PE1

Send

Minimum Data Size Communication (data communication 1/3)

Communicate only the required elements Need ‘pack’ and ‘unpack’ operations before and

after communication

The communication data size is minimum ‘pack’, ‘unpack’ operations require a little

overhead CPU time

PE0 PE1

vector

unpack

vector

pack

buffer

send

buffer

Implementation of Communication

Use MPI library 3 implementations for 1 to 1 communication

Send-Recv Isend-Irecv Irecv-Isend

3 implementations for range-limited and minimum data size communication

Allgather

SendRecv-range, IsendIrecv-range, IrecvIsend-range

SendRecv-min, IsendIrecv-min, Irecv-Isend-min

Methodology of Selecting Optimal Algorithm

Measure the time of local computation and data communication independently

When combined, total time is not necessarily fastest

1. Measure time of each data communication, select best algorithm

2. Combine local computation and best data communication, measure time and select best

Select at runtimeCan not detect the characteristic of the matrix until

runtime

Numerical Experiment

Default fixed algorithms Experimental environment, test

matrices Results

Default Fixed Algorithms

No. Local computation

Data communication

1 U1 Allgather

2 R2x2 Allgather

3 U1 IrecvIsend-min

4 R2x2 IrecvIsend-min

Local computation : U1 and R2x2

Data communication :Allgather and IrecvIsend-min

Experimental Environment

NameProcessor # of PEs Network

Compiler Compiler Version

Compiler Option

PC-ClusterPentiumIII 800 MHz 8 100 base-T Ethernet

GCC 2.95.2 -O3

SUN Enterprise 3500

Ultra Sparc II 336 MHz 8 SMP

WorkShop Compilers 5.0 -xO5

COMPAQAlphaServer GS80

Alpha 21264 731MHz

8 SMP

Compaq C 6.3-027 -fast

SGI2100MIPS R12000 350MHz

8 DSM

MIPSpro C 7.30 -64 -O3

HITACHI HA8000-ex880

Intel Itanium 800MHz 8 SMP

Intel Itanium Compiler 5.0.1 -O3

Language C

Communication library MPI (MPICH 1.2.1)

Test Matrices

From Tim Davis’ matrix collectionNo.

Name Explanation Dimension

Non-zeros

1 3dtube 3-D pressure tube 45,330 3,213,618

2 cfd1 Symmetric pressure matrix 70,656 1,828,364

3 crystk03 FEM crystal vibration 24,696 1,751,178

4 venkat01 Unstructured 2D euler solver 62,424 1,717,792

5 bcsstk35 Automobile seat frame and body attachment

30,237 1,450,163

6 cfd2 Symmetric pressure matrix 123,440 3,087,898

7 ct20stif Stiffness matrix 52,329 2,698,463

8 nasasrb Shuttle rocket booster 54,870 2,677,324

9 raefsky3 Fluid structure interaction turbulence

21,200 1,488,768

10 pwtk Pressurized wind tunnel 217,918 11,634,424

11 gearbox Aircraft flap actuator 153,746 9,080,404

cfd1ct20stifgearbox

Result of Matrix No.2

0

50

100

150

200

250

def1 def2 def3 def4 opt

R2x4 U2 U2 U2R2x4 U6 U2 U5

IrecvIsend-min

PentiumIII-Ethernet

0

5

10

15

20

25

30

def1 def2 def3 def4 opt

U1 R2x2 R1x3 R2x2

R3x1 U1 R1x3 U2

IrecvIsend-range

Alpha-SMP

MIPS-DSM

0

5

10

15

20

25

30

35

def1 def2 def3 def4 opt

R2x1 U2 D7 U1

U4 U3 U4 U3

IrecvIsend-range

0

5

10

15

20

25

30

def1 def2 def3 def4 opt

R3x1 R3x1 R1x3 R2x2

R1x3 D9 R3x1 R1x3

IsendIrecv-range

Itanium-SMP

Comm-time(msec)Local-time(msec)

Comm-algorithmLocal-algorithm

Result of Matrix No.7

0

20

40

60

80

100

120

140

160

180

def1 def2 def3 def4 opt

R2x3 R3x3 R3x3 R3x3R2x3 R3x3 R3x3 R3x3

IsendIrecv-min

PentiumIII-Ethernet

0

5

10

15

20

25

def1 def2 def3 def4 opt

U1 U1 R3x1 R3x1U1 R3x1 U1 U1

SendRecv-min

Alpha-SMP

MIPS-DSM

0

5

10

15

20

25

30

35

def1 def2 def3 def4 opt

R4x2 R3x3 R3x3 R3x3R4x1 R3x3 R3x3 R3x2

IrecvIsend-min

0

5

10

15

20

25

30

35

40

45

def1 def2 def3 def4 opt

D9 D15 R3x3 D11D7 D9 R3x3 R2x3

SendRecv-min

Itanium-SMP

Comm-time(msec)

Local-time(msec)

Comm-algorithmLocal-algorithm

Result of Matrix No.11

050

100150200250300350400450500

def1 def2 def3 def4 opt

R2x3 R3x3 D15 R3x3R2x3 R3x3 R3x3 R3x3

IsendIrecv-min

PentiumIII-Ethernet

0

20

40

60

80

100

120

140

def1 def2 def3 def4 opt

D5 R3x3 R3x3 R3x3R3x3 R3x3 R3x3 R3x3

SendRecv-min

Alpha-SMP

MIPS-DSM

0

20

40

60

80

100

120

140

def1 def2 def3 def4 opt

D5 R3x3 R3x3 R3x3R3x3 D7 D9 R3x3

SendRecv-min

0

10

20

30

40

50

60

70

80

90

def1 def2 def3 def4 opt

R3x3 R3x3 R3x3 R4x3R3x3 R3x3 R3x3 R3x3

SendRecv-min

Itanium-SMP

Comm-time(msec)Local-time(msec)

Comm-algorithmLocal-algorithm

Summary of Experiment

def 1 def 2 def 3 def 4

PC-cluster 8.16 7.90 1.32 1.05

Sun Enterprise 3500

2.82 3.07 1.35 1.58

COMPAQ 3.56 3.10 1.59 1.44

SGI 3.73 3.33 1.61 1.36

Hitachi 2.51 1.81 2.03 1.39

Summary of speed-up

•Best algorithm depends highly on characteristics of matrix and machine

•Obtained at least 1.05 speed-up compared with fixed default algorithms

Conclusion and Future Work

Compared performance of best algorithm with that of typical fixed algorithms

Obtained meaningful speed-up by selecting best algorithm

Selecting optimal algorithm according to characteristics of matrix and machine is important

Create light overhead method of selecting algorithm Now, selecting time takes hundreds of SpMxV

time