Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007 [email protected]

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

Ankit Jain, Vasily Volkov

CS252 Final Presentation

5/9/2007

[email protected]

[email protected]

SpM×V and its Applications

• Sparse Matrix Vector Multiply (SpM×V): y y+A∙x

– x, y are dense vectors

• x: source vector

• y: destination vector

– A is a sparse matrix (<1% of entries are nonzero)

• Applications employing SpM×V in the inner loop

– Least Squares Problems

– Eigenvalue Problems

matrix: A

vector: x

vector: y

Storing a Matrix in Memory

Compressed Sparse Row Data Structure and Algorithm

type val : real[k]type ind : int[k]type ptr : int[m+1]

1 foreach row i do

2 for l = ptr[i] to ptr[i + 1] – 1 do

3 y[i] y[i] + val[l] ∙ x[ind[l]]

What’s so hard about it?

• Reason for Poor Performance of Naïve Implementation

– Poor locality (indirect and irregular memory accesses)• Limited by speed of main memory

– Poor instruction mix (low flops to memory operations ratio)

– Algorithm dependent on non-zero structure of matrix• Dense matrices vs Sparse matrices

Register-Level Blocking (SPARSITY): 3x3 Example


BCSR with uniform, aligned grid


Fill-in zeros: trade-off extra ops for better efficiency

Blocked Compressed Sparse Row

• Inner loop performs floating point multiply-add on each non-zero in block instead of just one non-zero

• Reduces the number of times the source vector x has to be brought back into memory

• Reduces the number of indices that have to be stored and loaded

The Payoff: Speedups on Itanium 2

Reference

Best: 4x2

Mflop/s

Mflop/s

Explicit Software PipeliningORIGINAL CODE:


1 foreach row i do



SOFTWARE PIPELINED CODE


1 foreach row i do


3 y[i] y[i] + val_1 ∙ x_1

4 val_1 = val[l + 1]

5 x_1 = x[ind_2]

6 ind_2 = ind[l + 2]

Explicit Software PrefetchingORIGINAL CODE:


1 foreach row i do



SOFTWARE PREFETCHED CODE


1 foreach row i do



4 pref(NTA, pref_v_amt + &val[l])

5 pref(NTA, pref_i_amt + &ind[l])

6 pref(NONE, &x[ind[l+pref_x_amt]])

*NTA refers to no temporal locality on all levels

*NONE refers to temporal locality on highest Level

Characteristics of Modern Architectures

• High Set Associativity in Caches– 4-way L1, 8-way L2, 12-way L3 Itanium 2

• Multiple Load Store Units• Multiple Execution Units

– Six Integer Execution Units on Itanium 2– Two Floating Point Multiply-Add Execution Units in

Itanium 2

Question: What if we broke the matrix into multiple streams of execution?

Parallel SpMV

• Run different rows in different threads

• Can do that on data parallel architectures (SIMD/VLIW, Itanium/GPU)?– What if rows have different length?– One row finishes, other are still running

• Waiting threads keep processors idle

– Can we avoid idleness?• Standard solution: Segmented scan

Segmented Scan

• Multiple Segments (streams) of Simultaneous Execution

• Single Loop with branches inside to check if we’ve reached the end of a row for each segment.– Reduces Loop Overhead– Good if average NZ/Row is small

• Changes the Memory Access Patterns and can more efficiently use caches for some matrices– Future Work: Pass SpM×V through a cache simulator

to observe cache behavior

Itanium 2 Results (1.3 GHz, Millennium Cluster)

Reference 1BCSR 2.96BCSR Prefetched 4.25BCSR Pipelined 2.88BCSR Pipelined & Prefetched 4.25BSS 3.22BSS Prefetched 6.09BSS Pipelined 2.78BSS Pipelined & Prefetched 4.19

Conclusions & Future Work

• Optimizations studied are a good idea and should include this into OSKI

• Develop Parallel / Multicore versions– Dual Core, Dual Socket Opterons, etc

Questions?

Extra Slides

Algorithm # 2: Segmented Scan1x1x2 SegmentedScan Code

type val : real[k]type ind : int[k]type ptr : int[m+1]type RowStart: int[VectorLength]

r0 RowStart[0]r1 RowStart[1]

nnz0 ptr[r0]nnz1 ptr[r1]

EoR0 ptr[r0+1]EoR1 ptr[r1+1]

1 while nnz0 < SegmentLength do

2 y[r0] y[r0] + val[nnz0] ∙ x[ind[nnz0]]

3 y[r1] y[r1] + val[nnz1] ∙ x[ind[nnz1]]

4 if(nnz0 = EoR0)

5 r0++

6 EoR0 ptr[r0+1]

7 if(nnz1 = EoR1)

8 r1++

9 EoR1 ptr[r1+1]

10 nnz0 nnz0 + 1

11 nnz1 nnz1 + 1

Measuring Performance

• Measure Dense Performance (r,c)– Performance (Mflop/s) of dense

matrix in sparse r x c blocked format– Estimate Fill Ratio (r,c), r,c

• Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)

– Choose r,c that maximizes• Estimated Performance (r,c) =

References

1. G. Belloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, Carnegie Mellon University, 1993.

2. E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000.

3. E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vector multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February 2004.

4. R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply. Technical Report UCB/CSD-04-1335, University of California, Berkeley, Berkeley, CA, USA, June 2004.

5. Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, NASA Ames Research Center, Moffett Field, CA, 1990.

6. A. Schwaighofer. A matlab interface to svm light to version 4.0.http://www.cis.tugraz.at/igi/aschwaig/software.html, 2004.

7. R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, December 2003.

8. R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).

Documents

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007 [email protected]