Author
giles-baker
View
219
Download
0
Tags:
Embed Size (px)
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures
Ankit Jain, Vasily Volkov
CS252 Final Presentation
5/9/2007
SpM×V and its Applications
• Sparse Matrix Vector Multiply (SpM×V): y y+A∙x
– x, y are dense vectors
• x: source vector
• y: destination vector
– A is a sparse matrix (<1% of entries are nonzero)
• Applications employing SpM×V in the inner loop
– Least Squares Problems
– Eigenvalue Problems
matrix: A
vector: x
vector: y
Storing a Matrix in Memory
Compressed Sparse Row Data Structure and Algorithm
type val : real[k]type ind : int[k]type ptr : int[m+1]
1 foreach row i do
2 for l = ptr[i] to ptr[i + 1] – 1 do
3 y[i] y[i] + val[l] ∙ x[ind[l]]
What’s so hard about it?
• Reason for Poor Performance of Naïve Implementation
– Poor locality (indirect and irregular memory accesses)• Limited by speed of main memory
– Poor instruction mix (low flops to memory operations ratio)
– Algorithm dependent on non-zero structure of matrix• Dense matrices vs Sparse matrices
Register-Level Blocking (SPARSITY): 3x3 Example
Register-Level Blocking (SPARSITY): 3x3 Example
BCSR with uniform, aligned grid
Register-Level Blocking (SPARSITY): 3x3 Example
Fill-in zeros: trade-off extra ops for better efficiency
Blocked Compressed Sparse Row
• Inner loop performs floating point multiply-add on each non-zero in block instead of just one non-zero
• Reduces the number of times the source vector x has to be brought back into memory
• Reduces the number of indices that have to be stored and loaded
The Payoff: Speedups on Itanium 2
Reference
Best: 4x2
Mflop/s
Mflop/s
Explicit Software PipeliningORIGINAL CODE:
type val : real[k]type ind : int[k]type ptr : int[m+1]
1 foreach row i do
2 for l = ptr[i] to ptr[i + 1] – 1 do
3 y[i] y[i] + val[l] ∙ x[ind[l]]
SOFTWARE PIPELINED CODE
type val : real[k]type ind : int[k]type ptr : int[m+1]
1 foreach row i do
2 for l = ptr[i] to ptr[i + 1] – 1 do
3 y[i] y[i] + val_1 ∙ x_1
4 val_1 = val[l + 1]
5 x_1 = x[ind_2]
6 ind_2 = ind[l + 2]
Explicit Software PrefetchingORIGINAL CODE:
type val : real[k]type ind : int[k]type ptr : int[m+1]
1 foreach row i do
2 for l = ptr[i] to ptr[i + 1] – 1 do
3 y[i] y[i] + val[l] ∙ x[ind[l]]
SOFTWARE PREFETCHED CODE
type val : real[k]type ind : int[k]type ptr : int[m+1]
1 foreach row i do
2 for l = ptr[i] to ptr[i + 1] – 1 do
3 y[i] y[i] + val[l] ∙ x[ind[l]]
4 pref(NTA, pref_v_amt + &val[l])
5 pref(NTA, pref_i_amt + &ind[l])
6 pref(NONE, &x[ind[l+pref_x_amt]])
*NTA refers to no temporal locality on all levels
*NONE refers to temporal locality on highest Level
Characteristics of Modern Architectures
• High Set Associativity in Caches– 4-way L1, 8-way L2, 12-way L3 Itanium 2
• Multiple Load Store Units• Multiple Execution Units
– Six Integer Execution Units on Itanium 2– Two Floating Point Multiply-Add Execution Units in
Itanium 2
Question: What if we broke the matrix into multiple streams of execution?
Parallel SpMV
• Run different rows in different threads
• Can do that on data parallel architectures (SIMD/VLIW, Itanium/GPU)?– What if rows have different length?– One row finishes, other are still running
• Waiting threads keep processors idle
– Can we avoid idleness?• Standard solution: Segmented scan
Segmented Scan
• Multiple Segments (streams) of Simultaneous Execution
• Single Loop with branches inside to check if we’ve reached the end of a row for each segment.– Reduces Loop Overhead– Good if average NZ/Row is small
• Changes the Memory Access Patterns and can more efficiently use caches for some matrices– Future Work: Pass SpM×V through a cache simulator
to observe cache behavior
Itanium 2 Results (1.3 GHz, Millennium Cluster)
Reference 1BCSR 2.96BCSR Prefetched 4.25BCSR Pipelined 2.88BCSR Pipelined & Prefetched 4.25BSS 3.22BSS Prefetched 6.09BSS Pipelined 2.78BSS Pipelined & Prefetched 4.19
Conclusions & Future Work
• Optimizations studied are a good idea and should include this into OSKI
• Develop Parallel / Multicore versions– Dual Core, Dual Socket Opterons, etc
Questions?
Extra Slides
Algorithm # 2: Segmented Scan1x1x2 SegmentedScan Code
type val : real[k]type ind : int[k]type ptr : int[m+1]type RowStart: int[VectorLength]
r0 RowStart[0]r1 RowStart[1]
nnz0 ptr[r0]nnz1 ptr[r1]
EoR0 ptr[r0+1]EoR1 ptr[r1+1]
1 while nnz0 < SegmentLength do
2 y[r0] y[r0] + val[nnz0] ∙ x[ind[nnz0]]
3 y[r1] y[r1] + val[nnz1] ∙ x[ind[nnz1]]
4 if(nnz0 = EoR0)
5 r0++
6 EoR0 ptr[r0+1]
7 if(nnz1 = EoR1)
8 r1++
9 EoR1 ptr[r1+1]
10 nnz0 nnz0 + 1
11 nnz1 nnz1 + 1
Measuring Performance
• Measure Dense Performance (r,c)– Performance (Mflop/s) of dense
matrix in sparse r x c blocked format– Estimate Fill Ratio (r,c), r,c
• Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)
– Choose r,c that maximizes• Estimated Performance (r,c) =
References
1. G. Belloch, M. Heroux, and M. Zagha. Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, Carnegie Mellon University, 1993.
2. E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000.
3. E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vector multiply. International Journal of High Performance Computing Applications, 18(1):135–158, February 2004.
4. R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick. Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply. Technical Report UCB/CSD-04-1335, University of California, Berkeley, Berkeley, CA, USA, June 2004.
5. Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical Report 90-20, NASA Ames Research Center, Moffett Field, CA, 1990.
6. A. Schwaighofer. A matlab interface to svm light to version 4.0.http://www.cis.tugraz.at/igi/aschwaig/software.html, 2004.
7. R. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, University of California, Berkeley, December 2003.
8. R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).