6
Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format Liang Yuan 123 , Yunquan Zhang 12 , Xiangzheng Sun 123 , Ting Wang 1 [email protected], {yuanliang08, sunxiangzheng08, wangting}@iscas.ac.cn 1). Laboratory of parallel software and computational science, ISCAS, P.R.China 2). State Key Laboratory of Computer Science, CAS, P.R.China 3). Graduate University of Chinese Academy of Sciences, P.R.China Abstract—Sparse matrix vector multiplication (SpMV) is used in many scientific computations. The main bottleneck of this algorithm is memory bandwidth and many methods reduce memory bandwidth usage by compressing the index array. The matrices from finite difference modeling applications often have several dense diagonals and sparse diagonals. For these matrices, the index array can be deleted by using diagonal storage format (DIA) to store dense diagonals and DIA & CSR mixed algorithm. In this paper we propose two improved sparse matrix storage format based on DIA format and the corresponding SpMV algorithms. We present the performance results on two platforms, which show that our method can reduce the memory usage for a wide range of sparse matrices and achieve speedup up to 1.87. Keywords- Diagonal Storage; Compress Sparse Diagonal; Sparse Matrix, Sparse Computations; SMVM; SpMV; Memory Bandwidth; Matrix Splitting; CSR I. INTRODUCTION Sparse Matrix-Vector Multiplication (SpMV) is one of the most important computational kernels in many Partial Differential Equations (PDE) solvers and other scientific computations. For reducing the memory bandwidth usage, most algorithms only store nonzero elements of a sparse matrix along with index arrays which indicate coordinates of each element. The mostly commonly used data structure for sparse matrix is the CSR (compressed sparse row) format which uses three arrays: val contains all the nonzero elements stored in order by row first and by column within each row; col contains the column index of each nonzero element; row_ptr contains the start index in val array of each row. In this paper we only consider the square matrix, namely number of rows (m) equals to number of column (n), but the data structures and algorithms designed in this paper are easily extended to general cases. As previous work indentified, the main bottleneck of the CSR algorithm is memory bandwidth due to the low computation/memory access ratio induced by the indirect access to vector x and the direct access to the index array. Many studies have been conducted to optimize the performance of the CSR algorithm. The two PhD thesis [2][8] are good surveys to SpMV algorithms. Im and Yelick propose the application of register blocking (BCSR storage format), cache blocking and reordering [2][3][4]. To mitigate the memory bandwidth pressure, Willcock and Lunsdaine propose DCSR and RPCSR to reduce the size of index structure [9]. Kourtis et al. [5] store contiguous nonzeros with delta storage size of the same category to reduce the index data. Agarwal et al [1] also design a similar method to use the diagonal pattern and also have the same disadvantage. For the matrices with few diagonals, the DIA storage format [2][8] stores all the elements of each diagonal with nonzero element(s) in contiguous space of size n. Thus the index array col in CSR format can be deleted. However, this format uses spaces of fixed size for each diagonals and the algorithm didn’t reuse the vectors x and y when m and n are larger than the cache size. In this paper we improve the DIA format for matrices which have few dense diagonals. In our storage format, the index array in CSR format can be elided as the DIA storage format [2][8] by storing all the elements of the diagonals including the zero elements. We first use spaces of various sizes to store each diagonal for reducing the total size of val array. The slight improved format and algorithm is called DDD-NAÏVE(naive direct dense diagonal). Then we use each diagonal to split the matrix and vector y into some row blocks, which could make the algorithm reuse elements of vector y in register as the CSR algorithm. Finally, the contiguous same values in each diagonal are compress and each row block may be further split into some same value sub row blocks. The final format and algorithm is called DDD-SPLIT. The remainder of this paper is organized as follows: Section 2 describes the two direct dense diagonal (DDD) storage formats and algorithms. Section 3 describes the implementation of our algorithm. Section 4 gives the experiment results on two platforms. Conclusion will be given in Section 5. II. DIRECT DENSE DIAGONAL FORMAT A. The DDD-NAÏVE Format and algorithm As the DIA format, the basic Direct Dense Diagonal (DDD) format numbers the 2n-1 diagonals of an n- dimension matrix as shown in Figure 1. The element (i, j) in a matrix belongs to diagonal D i-j and the diagonal D 0 is called the main diagonal. Our DDD format stores diagonal D i with n |i| elements including nonzero and zero elements, while the DIA storage format needs n space for each diagonal. The right side of Figure 1 shows the range of 2010 12th IEEE International Conference on High Performance Computing and Communications 978-0-7695-4214-0/10 $26.00 © 2010 IEEE DOI 10.1109/HPCC.2010.67 503 2010 12th IEEE International Conference on High Performance Computing and Communications 978-0-7695-4214-0/10 $26.00 © 2010 IEEE DOI 10.1109/HPCC.2010.67 585

Optimizing sparse matrix-vector multiplication

Embed Size (px)

DESCRIPTION

Interesting paper about the optimization of the sparse matrix-vector kernel for diagonal sparse matrices.

Citation preview

Page 1: Optimizing sparse matrix-vector multiplication

Optimizing Sparse Matrix Vector MultiplicationUsing Diagonal Storage Matrix Format

Liang Yuan 123, Yunquan Zhang12, Xiangzheng Sun123, Ting Wang1

[email protected], {yuanliang08, sunxiangzheng08, wangting}@iscas.ac.cn 1). Laboratory of parallel software and computational science, ISCAS, P.R.China 2). State Key Laboratory of Computer Science, CAS, P.R.China 3). Graduate University of Chinese Academy of Sciences, P.R.China

Abstract—Sparse matrix vector multiplication (SpMV) is used in many scientific computations. The main bottleneck of this algorithm is memory bandwidth and many methods reduce memory bandwidth usage by compressing the index array. The matrices from finite difference modeling applications often have several dense diagonals and sparse diagonals. For these matrices, the index array can be deleted by using diagonal storage format (DIA) to store dense diagonals and DIA & CSR mixed algorithm. In this paper we propose two improved sparse matrix storage format based on DIA format and the corresponding SpMV algorithms. We present the performance results on two platforms, which show that our method can reduce the memory usage for a wide range of sparse matrices and achieve speedup up to 1.87.

Keywords- Diagonal Storage; Compress Sparse Diagonal; Sparse Matrix, Sparse Computations; SMVM; SpMV; Memory Bandwidth; Matrix Splitting; CSR

I. INTRODUCTION

Sparse Matrix-Vector Multiplication (SpMV) is one of the most important computational kernels in many Partial Differential Equations (PDE) solvers and other scientific computations. For reducing the memory bandwidth usage, most algorithms only store nonzero elements of a sparse matrix along with index arrays which indicate coordinates of each element.

The mostly commonly used data structure for sparse matrix is the CSR (compressed sparse row) format which uses three arrays: val contains all the nonzero elements stored in order by row first and by column within each row; col contains the column index of each nonzero element; row_ptr contains the start index in val array of each row. In this paper we only consider the square matrix, namely number of rows (m) equals to number of column (n), but the data structures and algorithms designed in this paper are easily extended to general cases.

As previous work indentified, the main bottleneck of the CSR algorithm is memory bandwidth due to the low computation/memory access ratio induced by the indirect access to vector x and the direct access to the index array. Many studies have been conducted to optimize the performance of the CSR algorithm. The two PhD thesis [2][8] are good surveys to SpMV algorithms. Im and Yelick propose the application of register blocking (BCSR storage format), cache blocking and reordering [2][3][4]. To mitigate

the memory bandwidth pressure, Willcock and Lunsdaine propose DCSR and RPCSR to reduce the size of index structure [9]. Kourtis et al. [5] store contiguous nonzeros with delta storage size of the same category to reduce the index data. Agarwal et al [1] also design a similar method to use the diagonal pattern and also have the same disadvantage. For the matrices with few diagonals, the DIA storage format [2][8] stores all the elements of each diagonal with nonzero element(s) in contiguous space of size n. Thus the index array col in CSR format can be deleted. However, this format uses spaces of fixed size for each diagonals and the algorithm didn’t reuse the vectors x and y when m and n are larger than the cache size.

In this paper we improve the DIA format for matrices which have few dense diagonals. In our storage format, the index array in CSR format can be elided as the DIA storage format [2][8] by storing all the elements of the diagonals including the zero elements. We first use spaces of various sizes to store each diagonal for reducing the total size of valarray. The slight improved format and algorithm is called DDD-NAÏVE(naive direct dense diagonal). Then we use each diagonal to split the matrix and vector y into some row blocks, which could make the algorithm reuse elements of vector y in register as the CSR algorithm. Finally, the contiguous same values in each diagonal are compress and each row block may be further split into some same value sub row blocks. The final format and algorithm is called DDD-SPLIT.

The remainder of this paper is organized as follows: Section 2 describes the two direct dense diagonal (DDD) storage formats and algorithms. Section 3 describes the implementation of our algorithm. Section 4 gives the experiment results on two platforms. Conclusion will be given in Section 5.

II. DIRECT DENSE DIAGONAL FORMAT

A. The DDD-NAÏVE Format and algorithm As the DIA format, the basic Direct Dense Diagonal

(DDD) format numbers the 2n-1 diagonals of an n-dimension matrix as shown in Figure 1. The element (i, j) in a matrix belongs to diagonal Di-j and the diagonal D0 is called the main diagonal. Our DDD format stores diagonal Di with n |i| elements including nonzero and zero elements, while the DIA storage format needs n space for each diagonal. The right side of Figure 1 shows the range of

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.67

503

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.67

585

Page 2: Optimizing sparse matrix-vector multiplication

vector x and y needed by computing for each diagonal. The storage methods are different between our naive (DDD-NAIVE) and improved (DDD-SPLIT) algorithms.

In the DDD-NAIVE algorithm all the diagonals that have nonzero element(s) will be stored one by one from the D1-n to Dn-1 and all the elements within each diagonal are stored in order by rows. The data structure of DDD-NAIVE has three arrays: val contains all the diagonals that have nonzero elements; diag_ptr contains the pointer to the first element of val array for every diagonal; diag_num contains numbers of every diagonal which is used to compute the corresponding start position of vector x and y. The size of val is the sum of size of all diagonals (ne, nonempty) and the size of diag_ptrand diag_num is the total number of diagonals (ndiag).

Figure 2 (a) is an example matrix which has three diagonals and ten nonzero elements. We fill two zero elements (called ne elements, and set 0.0) in diagonal D2 and firstly store diagonal D-4 in val array followed by diagonal D0 and D2 as shown in Figure 2 (e). Since DDD format store and compute all the elements of every diagonal, it just computes the start pointers of vector x and y for every diagonal rather than store col array for every nonzero in CSR format which makes DDD format decrease memory space and memory bandwidth. The DDD-NAIVE algorithm is presented in Figure 3.

B. Split matrix by diagonals Obviously, the DDD-NAIVE algorithm can’t reuse x and

y in cache when the dimension n is large. For reusing vector y register as CSR algorithm, it must be known that which diagonals needed when computing each element of vector y. We split a matrix into some row blocks through lines which above or below cross points that diagonals intersect the leftmost column or rightmost column of a matrix. These lines also split the vector y to several blocks. The computing of elements within each block of y needs the same diagonals. Diagonal D0 don’t split the matrix. There may be two diagonals split a matrix in the same row. We denote nSameSplit the number of diagonals which split a matrix in same row. Thus we have the following equations about the number of row blocks (nrb): nrb = ndiag – nSameSplit, if the matrix has diagonal D0; nrb = ndiag – nSameSplit + 1, otherwise.

C. Split matrix by compressing same values Kourtis et al. [5] designs a simple scheme for compressing

the size of the values of a sparse matrix. Instead of the val

array in CSR algorithm, their scheme contains two arrays: val_unique, a small number of unique values array and val_ind, a index array. However, the main drawback is the additional memory bandwidth usage induced by the val_indarray of size nnz.

Since the sparse matrices often have the same contiguous values in each diagonal, we could split each row block into some sub row blocks in which each diagonal has the same value. Thus the format could only store the values of the first row for each sub row block and adds a loop for each sub row blocks in every row blocks. We denote nsrb the total number of sub row blocks

D. The DDD-SPLIT Format and Algorithm Besides the val array of the DDD-NAVIE data

structure, DDD-SPLIT algorithm requires five more arrays: x_ptr contains the pointers to the start indexs of vector x for every diagonal; m_ptr contains the first row numbers in every row block, m_ptr[nrb] = m; start and end contains the first and the last diagonal indexs of diag_ptr array in every row block; compress_split contains all the numbers of splitting lines that split the matrix into row blocks. The size of x_ptr, m_ptr, start and end array is nrb. Because in the same row block the same diagonals are required, we could store first row of every sub row block as the storage method of val array in CSR format. Then we just need a single pointer a_ptr pointing to val array rather than the diag_ptrarray in DDD-NAIVE algorithm. Figure 4 is the DDD-SPLIT algorithm.

The algorithm designed in [8] also split the matrix A = C + B where matrix B and C is stored in RSDIAG and register block format respectively. However, we first select dense diagonals and suitable sparse diagonals. Then use those diagonals to split the matrix. Although the RSDIAG only store the nonzero elements, our splitting method is more suitable for matrices that have many sparse diagonals that can’t be used to do splitting. For example the matrix in Figure 2 (a) will be split to five row blocks rather than three after filling some zeros in our splitting method.

We use the same matrix to be an example of the splitting process. Firstly all the diagonals with nonzero elements are filled zero elements marked 0.0 in Figure 2 (b). Secondly, diagonal D-4 and D2 intersect the leftmost column at the fifth row and rightmost column at the fourth row of the matrix. We split the matrix through the line above the fifth row and below the fourth row. Since the two splitting lines are the same, the matrix and vector y are split into two row blocks (the solid line) as shown in Figure 2 (c). Thirdly, the first row block is split into two sub row blocks (the dashed line) as shown in Figure 2 (d). In each sub row block the values in each diagonal are the same and only the first row of each sub row block is stored. Finally, we store all the ne elements in val array in order by row and by column within a row as CSR format. The val array is shown in Figure 2 (f).

Di x0~ xn-i-1 yi~ yn-1

D-i xi~ xn-1 y0~ yn-i-1

D0 x0~ xn-1 y0~ yn-1

Figure 1. Direct Dense Diagonal Numbering Scheme

504586

Page 3: Optimizing sparse matrix-vector multiplication

III. IMPLEMENTATION

A. Improvements to the DDD-SPLIT algorithm For our DDD storage format, the length of each row

within one row block is same, thus we could use unroll-and-jam[6] without more overhead of changing the DDD storage format. In our implement the unroll-and-jam factor is 2.

There is also optimization chance through compressing start and end arrays of DDD-SPLIT algorithm. It is obvious that there are only three patterns of differences between adjacent elements j and j+1 of the two arrays: start[j+1]=start[j], end[j+1]=end[j]–1; start[j+1]=start[j]+1, end[j+1]=end[j]; start[j+1]=start[j]+1, end[j+1]=end[j]–1. Thus we could use two bit to compress the three patterns.

In most cases ndiag is large than the number of registers, so the x_ptr pointer array could not be reused in register. We could use one pointer xt which stay in register and point to the start index of vector x when computing each row block and the distance dis array which store all the distances of adjacent diagonals, namely, dis[i] = diag_num[i+1] – diag_num[i].

B. Selection between DDD and CSR Formats We use a ratio nz_ratio = (the number of nonzero

elements of diagonal Di) (the size of the diagonal, i.e. n|i|) for each diagonal to decide which format (CSR or DDD format) is used. We choose a threshold csr_threshold [0,1] to compare with nz_ratio of each diagonal. If nz_ratio csr_threshold, the nonzero elements in the diagonal will stored in CSR format, otherwise it will be stored in DDD format. It obvious that if csr_threshold=0, then all nonzero

elements will be stored in DDD format. If csr_threshold=1, all nonzero elements will be stored in CSR.

We now choose the value of csr_threshold for our different platforms. We don’t consider the reuse of vector x between adjacent diagonals and just use one diagonal to test the threshold value and generate two formats examples with different ne and nz size on each platform. For DDD formats, how the nz nonzero elements distribute in the diagonal doesn’t impact the performance since we need to compute each ne element including the zeros. However, for the CSR format, the distribution of nonzero elements will impact the performance since it impact the spatial locality of vector x. However, due to the variety of matrices, we just generate a CSR example which the nonzero elements are distributed uniformly in a diagonal for each ne and nz size. Table 2 illustrates the best csr_threshold value for different ne size on the two platforms. Figure 5 is our final DDD-CSR mixed algorithm.

1.1 0 2.2 0 0 0 0 1.1 0 2.2 0 0 0 0 3.3 0 0 0 0 0 0 3.3 0 04.4 0 0 0 5.5 0

0 4.4 0 0 0 5.5

1.1 0 2.2 0 0 0 0 1.1 0 2.2 0 0 0 0 3.3 0 0.0 0 0 0 0 3.3 0 0.04.4 0 0 0 5.5 0 0 4.4 0 0 0 5.5

1.1 0 2.2 0 0 0 0 1.1 0 2.2 0 0 0 0 3.3 0 0.0 0 0 0 0 3.3 0 0.04.4 0 0 0 5.5 0 0 4.4 0 0 0 5.5

(a) An example matrix (b) Filled with ne (0.0) (c) Split by diagonals 1.1 0 2.2 0 0 0 0 0 0 0 0 0 0 0 3.3 0 0.0 0 0 0 0 0 0 04.4 0 0 0 5.5 0 0 0 0 0 0 0 val= { }

val= {D-4: diag_ptr[0] D0: diag_ptr[1] D2: diag_ptr[2]

4.4, 4.4, 1.1, 1.1, 3.3, 3.3, 5.5, 5.5, 2.2, 2.2, 0.0, 0.0}

1.1, 2.2, 3.3, 0.0, 4.4, 5.5

(e) val array in DDD-NAIVE algorithm

(d) Split by compressing same values (f) val array in DDD-SPLIT algorithm Figure 2. An example of changing the CSR to DDD-NAIVE and DDD-SPLIT.

For each diagonal Set points at, xt and yt to the corresponding positions of val, vector x and yFor i=0 to (m–abs(diag_num[k]))

yt [i] += xt [i] × at [i];

Figure 3. The DDD-NAIVE Algorithm

For each row block For each same value sub row block For each row Using the values of the first row to compute the vector y

Figure 4. The DDD-SPLIT Algorithm

Step1. Travel all nonzero elements to get all diagonals’ information

Step2. Choose DDD or CSR format for each diagonal. Step3. Split matrix using DDD-stored diagonals. Step4. Transform original CSR to DDD and new CSR. Step5. Use DDD-SPLIT and CSR algorithm

to compute elements stored in DDD and CSR respectively.

Figure 5. DDD-CSR Algorithm

505587

Page 4: Optimizing sparse matrix-vector multiplication

TABLE II. CHOOSE CSR_THRESHOLD

ne Intel platform

AMDplatform

2M 0.656 0.6794M 0.624 0.7298M 0.625 0.729

16M 0.609 0.68332M 0.625 0.706

TABLE I. MATRICES USED FOR EVALUATION OF DDD-CSR ALGORITHM

Mat name n #diag #nz #ne nrb1 A_1 800000 11 6354000 7199834 92 A_2 320000 11 2532800 2879894 93 A_3 1440000 11 11452800 12959794 94 atmosmodd 1270432 7 8814880 8848918 75 atmosmodm 1489752 7 10319760 10349458 76 denormal 89400 7 622812 624307 77 ecology2 999999 4 2997995 3997996 48 Lin 256000 4 1011200 1017519 49 kim2 456976 25 11330020 11404114 25

10 Baumann 112211 7 760631 783231 711 raefsky3 21200 839 1488768 16977032 ---12 matrix_9 103430 8802 2121550 455849839 ---13 crystk03 24696 50 887937 1217460 ---14 nemeth22 9506 99 684169 936243 ---15 Chebyshev4 68121 68910 5377761 2373637081 ---16 af_5_k101 503625 449 9027150 225846620 ---17 majorbasis 160000 22 1750416 2518798 ---18 qa8fm 66127 332 863353 21745563 ---19 windscreen 22692 50 752541 945363 ---20 RS_b39_c30 60098 30155 1079986 460301694 ---21 nemeth26 9506 117 760633 1105416 ---22 s3dkt3m2 90449 328 1921955 29522952 ---23 viscorocks 37762 175 1162244 6590092 ---

IV. EXPERIMENTAL RESULTS

A. Experimental Setup Our experiments were conducted on two platforms:

2.4GHz AMD Opteron 8378 with 64GB of RAM and 3.00G Intel Xeon X5472 with 16GB of RAM. In the AMD platform, The OS was version 2.6.16.60-0.21-smp of Linux and the compiler was gcc4.1.2. In the Intel Xeon platform, the OS was version 2.6.27-7-generic of Linux and the compiler was gcc4.3.3 with the optimization flag –O3.

Table 1 lists our square matrices and their characteristics. The column nrb depicts the number of row blocks. The first

three matrices come from our project in which fluid dynamics codes are developed by using the finite difference method. Other matrices are selected from [7] by glancing the pictures of matrices. The columns total nz and total ne depict the number of nonzero elements and ne elements of the original matrix. total diagonal is the number of diagonals that have at least one nonzero element and nrb is the number of row blocks. We use the DDD-CSR algorithm for the matrices 11-23 in Table 1 since they have some sparse diagonals. Thus we list nrb of matrices 11 to 23 for the two platforms respectively in Table 3 since their best csr_ratio is 0.

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

Matrix�id

1�fr(original)

0

0.2

0.4

0.6

0.8

1

11 12 13 14 15 16 17 18 19 20 21 22 23

Matrix�id

fr(original) DDD�1�fr�(Intel) DDD�1�fr�(AMD)

Figure 6. Filling Ratios on the Two Platforms

506588

Page 5: Optimizing sparse matrix-vector multiplication

00 .20 .40 .60 .81

1 .21 .41 .61 .8

1 2 3 4 5 6 7 8 9 1 0

Speedup

M at r ix �id

DDD �NA IVE DDD �SP L IT

0

0.2

0.4

0.6

0.8

11.2

1.4

1.6

1.8

11 12 13 14 15 16 17 18 19 20 21 22 23

Speedup

Matrix�id

DDD�SPLIT

Figure 7 Speedups for the DDD-NAÏVE and DDD-SPLIT Methods on the Intel Platform

0

0.2

0.4

0.6

0.8

11.2

1.4

1.6

1.8

1 2 3 4 5 6 7 8 9 10

Speedup

Matrix�id

DDD�NAIVE DDD�SPLIT

0

0.2

0.4

0.6

0.8

11.2

1.4

1.6

1.8

11 12 13 14 15 16 17 18 19 20 21 22 23

Speedup

Matrix�id

DDD�SPL IT

Figure 8 Speedups for the DDD-NAÏVE and DDD-SPLIT Methods on the AMD Platform

0

0 .20 .40 .6

0 .81

1 .21 .4

1 .61 .8

2

4 5 6 7 9

Speedup

M a t r ix � id

D D D �S P L IT

Figure 9 Speedups for the Split Same Value Method on the Intel Platform

TABLE III. THE BEST CSR_RATIO AND THEIR CORRESPONDING DDD FORMAT ON THE TWO PLATFORMS. BLUE COLOR IS INTEL PLATFORM AND GREEN COLOR IS AMD PLATFORM.

Matcsr

ratioDDD diag

DDD nz

DDD ne nrb

csrratio

DDDdiag

DDD nz

DDD ne nrb

11 0.9 51 1032032 1056920 51 0.8 57 1137704 1181254 5712 0.7 13 1113694 1129458 11 0.7 13 1113694 1129458 1113 0.7 32 191374 438219 32 0.9 11 261160 268122 1115 0.92 51 479841 483519 45 0.7 64 580230 606366 6414 0.4 79 1703094 1746123 67 0.8 49 1677154 1700413 3716 0.65 14 6321935 7044779 14 0.7 12 5620401 6039229 1217 0.5 13 1075208 1120000 12 0.7 12 995208 1000000 1118 0.6 5 306920 327598 5 0.6 5 306920 327598 519 0.7 32 588741 605724 32 0.7 32 588741 605724 3220 0.9 15 507608 511415 15 0.9 15 507608 511415 1521 0.92 51 479957 483524 51 0.6 75 659056 710175 7522 0.9 14 1250437 1262023 14 0.9 14 1250437 1262023 1423 0.7 5 186500 188804 5 0.7 5 186500 188804 5

TABLE IV . THE NUMBER OF THE SAME VALUES AND SPLIT LINES

Matirx id

same value

compression ratio

# split line

4 179368 0.02027 257455 156808 0.01515 225656 8377 0.01341 11937 37530 0.00938 93809 84344 0.00739 3355

507589

Page 6: Optimizing sparse matrix-vector multiplication

B. Performance Results Our algorithm could beat the CSR algorithm for almost

all the matrices we choose. For some well-suitable matrices, such as atmosmodd and Lin, the best performances are achieved by storing all the nonzero elements in DDD-SPLIT format. Thus there are no calling to CSR algorithm for them. The first ten matrices shown in orange in Table 1 belong to this kind. Their ddd nz equals to total nz and their nrb are same on the two platforms. These matrices have very low filling ratio and we could expect a better performance from the algorithm analysis in Section 2.3. We show the (1-fr) (fr is the filling ratio fr = (ne – nz) / nz) in the left side of Figure 6. All the results are expressed in terms of speedup with regard to the CSR algorithms. The best speedup is up to 1.5 and only one matrix’s speedup is below 1.2.

Other matrices have very high original filling ratio and need storing the sparse diagonals in CSR format and calling both the DDD-CSR algorithm. The right side of Figure 6 shows the original fr (very high) and DDD 1-fr (very high). The “1-fr” (Intel and AMD) indicate the average filling ratio of all the diagonals that stored in DDD-SPLIT format.

Figure 7 and 8 show the results on the Intel and AMD platform, respectively. All the results are expressed in terms of speedup relative to the CSR algorithm. The best performance is achieved when the DDD (1-fr) is larger than 0.9 in our test. The average speedup is 1.2 and lower than matrices 1-10. However, nemeth22 attain the speedup 1.54 since its vector x could stayed in cache due to the deleting of the index array. We could also observe that the matrix with higher fr will achieve higher speedup by comparing the left side of Figure 6 and 7.

Table 3 shows the best csr_ratio, DDD storage format information and nrb of matrices 11-23. The columns ddd neand ddd nz depict the number of ne and nonzero elements of the ddd-stored elements. The number with blue color is on the Intel platform and green is on the AMD platform. The best csr_ratio values are larger than our expect value in Section 3.2 due to the reuse distances of vector x. In the future we will build a more precise model to get the more accurate value for choosing the format for each diagonal.

Figure 9 shows the speedups for the split same value method on the Intel platform, and table 4 shows the corresponding information of the matrices which have small compression ratio values. The split same value method provides average speedups of 1.78 and a maximum of 1.87. The column compression ratios of the matrices are computed as (the number of same values)/(ne). It is clear that these matrices have very higher compression ratios and the sizes of their val array are very small that make the memory bandwidth usage very low.

V. CONCLUSION AND FUTURE WORK For matrices with many dense diagonals, we propose the

DDD format and two algorithms. We use diagonals to split a matrix in order to reuse vector y in register and apply three optimization methods. The experiment results show that our

method can reduce the memory usage for a wide range of sparse matrices and achieve speedup up to 1.87.

The future work is following. (1)We will compare our format with improved CSR algorithms (unroll-and-jam, CSR-VI [5] and so on). (2) All the algorithms designed in this paper are sequential and the corresponding parallel algorithms will be designed in the future.

ACKNOWLEDGMENT

This work was supported by the National 863 Plan of China (No.2006AA01A125, No. 2009AA01A129 and No.2009AA01A134), the HGJ project ((No. 2009ZX01036-001-002)), the Knowledge Innovation Program of the CAS (No. KGCX1-YW-13) and the Ministery of Finance(No. ZDYZ2008-2). The authors thank the reviewers.

REFERENCES

[1] R. C. Agarwal, F. G. Gustavson, and M. Zubair. A high performance algorithm using pre-processing for the sparse matrix-vector multiplication. In SC'92: Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pages 32-41, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.

[2] E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000.

[3] E.-J. Im and K. Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. Lecture Notes in Computer Science, 2073:127–136, 2001.

[4] E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vectormultiply. International Journal of High Performance Computing Applications, 18(1):135-158, February 2004.

[5] K. Kourtis, G. Goumas, and N. Koziris. Optimizing sparse matrix-vector multiplication using index and value compression. In CF ’08: Proceedings of the 2008 conference on computing frontiers, pages 87–96, 2008

[6] J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix-vector product computations using unroll and jam. International Journal of High Performance Computing Applications, 18(2):225, 2004.

[7] UFget and UFgui interfaces to the UF Sparse Matrix Collection. www.cise.ufl.edu/research/sparse/mat/UFget.html

[8] R. W. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California Berkeley, 2003.

[9] J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In ICS’06: Proceedings of the 20th annual international conference on Supercomputing, pages 307–316, New York, NY, USA, 2006.

[10] Zhang Yunquan. DRAM(h): a parallel computation model for high per-formance numerical computing. Chinese Journal of Computers, 2003, 26(12): 1660-1670

[11] Zhang Yunquan, Sun Jiachang, Tang Zhimin, et al . Memory complexity on numerical programs. Chinese Journal of Computer s , 2000 , 23 (4) : 363-373

[12] M. M. Baskaran and R. Bordawekar. Optimizing sparse matrix-vector multiplication on GPUs. IBM Research Report, IBM Corporation, April 2009.

[13] N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec. 2008.

[14] N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Source Code and Matrices. http://www.nvidia.com/object/nvidia_research_pub_001.html.

508590