Download ppt - Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007 [email protected]

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures

Ankit Jain, Vasily Volkov

CS252 Final Presentation

5/9/2007

[email protected]

[email protected]

SpM×V and its Applications

• Sparse Matrix Vector Multiply (SpM×V): y y+A∙x

– x, y are dense vectors

• x: source vector

• y: destination vector

– A is a sparse matrix (<1% of entries are nonzero)

• Applications employing SpM×V in the inner loop

– Least Squares Problems

– Eigenvalue Problems

matrix: A

vector: x

vector: y

Storing a Matrix in Memory

Compressed Sparse Row Data Structure and Algorithm

type val : real[k]type ind : int[k]type ptr : int[m+1]

1 foreach row i do

2 for l = ptr[i] to ptr[i + 1] – 1 do

3 y[i] y[i] + val[l] ∙ x[ind[l]]

What’s so hard about it?

• Reason for Poor Performance of Naïve Implementation

– Poor locality (indirect and irregular memory accesses)• Limited by speed of main memory

– Poor instruction mix (low flops to memory operations ratio)

– Algorithm dependent on non-zero structure of matrix• Dense matrices vs Sparse matrices

Register-Level Blocking (SPARSITY): 3x3 Example


BCSR with uniform, aligned grid


Fill-in zeros: trade-off extra ops for better efficiency

Blocked Compressed Sparse Row

• Inner loop performs floating point multiply-add on each non-zero in block instead of just one non-zero

• Reduces the number of times the source vector x has to be brought back into memory

• Reduces the number of indices that have to be stored and loaded

The Payoff: Speedups on Itanium 2

Reference

Best: 4x2

Mflop/s

Mflop/s

Explicit Software PipeliningORIGINAL CODE:


1 foreach row i do



SOFTWARE PIPELINED CODE


1 foreach row i do


3 y[i] y[i] + val_1 ∙ x_1

4 val_1 = val[l + 1]

5 x_1 = x[ind_2]

6 ind_2 = ind[l + 2]

Explicit Software PrefetchingORIGINAL CODE:


1 foreach row i do



SOFTWARE PREFETCHED CODE


1 foreach row i do



4 pref(NTA, pref_v_amt + &val[l])

5 pref(NTA, pref_i_amt + &ind[l])

6 pref(NONE, &x[ind[l+pref_x_amt]])

*NTA refers to no temporal locality on all levels

*NONE refers to temporal locality on highest Level

Characteristics of Modern Architectures

• High Set Associativity in Caches– 4-way L1, 8-way L2, 12-way L3 Itanium 2

• Multiple Load Store Units• Multiple Execution Units

– Six Integer Execution Units on Itanium 2– Two Floating Point Multiply-Add Execution Units in

Itanium 2

Question: What if we broke the matrix into multiple streams of execution?

Parallel SpMV

• Run different rows in different threads

• Can do that on data parallel architectures (SIMD/VLIW, Itanium/GPU)?– What if rows have different length?– One row finishes, other are still running

• Waiting threads keep processors idle

– Can we avoid idleness?• Standard solution: Segmented scan

Segmented Scan

• Multiple Segments (streams) of Simultaneous Execution

• Single Loop with branches inside to check if we’ve reached the end of a row for each segment.– Reduces Loop Overhead– Good if average NZ/Row is small

• Changes the Memory Access Patterns and can more efficiently use caches for some matrices– Future Work: Pass SpM×V through a cache simulator

to observe cache behavior

Itanium 2 Results (1.3 GHz, Millennium Cluster)

Reference 1BCSR 2.96BCSR Prefetched 4.25BCSR Pipelined 2.88BCSR Pipelined & Prefetched 4.25BSS 3.22BSS Prefetched 6.09BSS Pipelined 2.78BSS Pipelined & Prefetched 4.19

Conclusions & Future Work

• Optimizations studied are a good idea and should include this into OSKI

• Develop Parallel / Multicore versions– Dual Core, Dual Socket Opterons, etc

Questions?

Extra Slides

Algorithm # 2: Segmented Scan1x1x2 SegmentedScan Code

type val : real[k]type ind : int[k]type ptr : int[m+1]type RowStart: int[VectorLength]

r0 RowStart[0]r1 RowStart[1]

nnz0 ptr[r0]nnz1 ptr[r1]

EoR0 ptr[r0+1]EoR1 ptr[r1+1]

1 while nnz0 < SegmentLength do

2 y[r0] y[r0] + val[nnz0] ∙ x[ind[nnz0]]

3 y[r1] y[r1] + val[nnz1] ∙ x[ind[nnz1]]

4 if(nnz0 = EoR0)

5 r0++

6 EoR0 ptr[r0+1]

7 if(nnz1 = EoR1)

8 r1++

9 EoR1 ptr[r1+1]

10 nnz0 nnz0 + 1

11 nnz1 nnz1 + 1

Measuring Performance

• Measure Dense Performance (r,c)– Performance (Mflop/s) of dense

matrix in sparse r x c blocked format– Estimate Fill Ratio (r,c), r,c

• Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)

– Choose r,c that maximizes• Estimated Performance (r,c) =

References

1. G. Belloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, Carnegie Mellon University, 1993.

2. E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000.

3. E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vector multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February 2004.

4. R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply. Technical Report UCB/CSD-04-1335, University of California, Berkeley, Berkeley, CA, USA, June 2004.

5. Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, NASA Ames Research Center, Moffett Field, CA, 1990.

6. A. Schwaighofer. A matlab interface to svm light to version 4.0.http://www.cis.tugraz.at/igi/aschwaig/software.html, 2004.

7. R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, December 2003.

8. R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).