Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 1

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Eun-Jin ImU.C.Berkeley


Overview Motivation Optimization techniques

Register Blocking Cache Blocking Multiple Vectors

Sparsity system Related Work Contribution Conclusion


Motivation : Usage Sparse Matrix-Vector Multiplication

Usage of this operation: Iterative Solvers Explicit Methods Eigenvalue and Singular Value Problems

Applications in structure modeling, fluid dynamics, document retrieval(Latent Semantic Indexing) and many other simulation areas

xAy


Motivation : Performance (1) Matrix-vector multiplication (BLAS2) is slower

than matrix-matrix multiplication (BLAS3) For example, on 167 MHz UltraSPARC I,

Vendor optimized matrix-vector multiplication: 57Mflops

Vendor optimized matrix-matrix multiplication: 185Mflops

The reason: lower ratio of the number of floating point operations to the number of memory operation


Motivation : Performance (2) Sparse matrix operation is slower than dense

matrix operation. For example, on 167 MHz UltraSPARC I,

Dense matrix-vector multiplication : naïve implementation : 38Mflops vendor optimized implementation : 57Mflops Sparse matrix-vector multiplication (Naïve

implementation) 5.7 - 25Mflops

The reason : indirect data structure, thus inefficient memory accesses


Motivation : Optimized libraries Old approach : Hand-Optimized Libraries

Vendor-supplied BLAS, LAPACK New approach : Automatic generation of

libraries PHiPAC (dense linear algebra) ATLAS (dense linear algebra) FFTW (fast fourier transform)

Our approach : Automatic generation of libraries for sparse matricesAdditional dimension : nonzero structure of sparse

matrices


Sparse Matrix Formats There are large number of sparse matrix

formats. Point-entry

Coordinate format (COO), Compressed Sparse Row (CSR),Compressed Sparse Column (CSC), Sparse Diagonal

(DIA), … Block-entry

Block Coordinate (BCO), Block Sparse Row (BSR),Block Sparse Column (BSC), Block Diagonal (BDI),Variable Block Compressed Sparse Row (VBR), …


Compressed Sparse Row Format We internally use CSR format, because it is

relatively efficient format


Optimization Techniques Register Blocking Cache Blocking Multiple vector


Register Blocking Blocked Compressed Sparse Row Format

Advantages of the format Better temporal locality in registers The multiplication loop can be unrolled for better

performance

0 2 4

A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35

0 4 2 4

35343332

2522

151110

040100

00

0000

000

000

AAAA

AA

AAA

AAA


Register Blocking : Fill Overhead We use uniform block size, adding fill overhead.

fill overhead = 12/7 = 1.71 This increases both space and the number of

floating point operations.


Register Blocking Dense Matrix profile on an UltraSPARC I (input

to the performance model)


Register Blocking : Selecting the block size The hard part of the problem is picking the

block size so that : It minimizes the fill overhead It maximizes the raw performance

Two approaches : Exhaustive search Using a model


Register Blocking: Performance model Two components to the performance model

Multiplication performance of dense matrix represented in sparse format

Estimated fill overhead

Predicted performance for block size r x c dense r x c blocked performance = fill overhead


Benchmark matrices Matrix 1: Dense matrix (1000 x 1000) Matrices 2-17 : Finite Element Method

matrices Matrices 18-39 : matrices from Structural

Engineering, Device Simulation Matrices 40-44 : Linear Programming matrices Matrix 45 : document retrieval matrix used for Latent Semantic Indexing Matrix 46 : random matrix (10000 x 10000,

0.15%)


Register Blocking : Performance

The optimization is effective most on FEM matrices and dense matrix (lower-numbered).


Register Blocking : Performance

Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38)


Register Blocking : Validation of Performance Model

Comparison to the performance of exhaustive search (yellow bars, block sizes in lower row) on a subset of the benchmark matrices

The exhaustive search does not produce much better result.


Register Blocking : Overhead Pre-computation overhead :

Estimating fill overhead (red bars) Reorganizing the matrix (yellow bars)

The ratio means the number of repetitions for which the optimization is beneficial.


Cache Blocking Temporal locality of access to source vector

Source vector x

DestinationVector

y

In memory


Cache Blocking : Performance

MIPS speedup is generally better. larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.)

Except document retrieval and random matrix.


Cache Blocking : Performance on document retrieval matrix

Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD is applied for LSI(Latent Semantic Indexing)

The nonzero elements are spread across the matrix, with no dense cluster.

Peak at 16K x 16K cache block with speedup 3.1


Cache Blocking : When and how to use cache blocking From the experiment, the matrices for which

cache blocking is most effective are large and “random”.

We developed a measurement of “randomness” of matrix.

We perform search in coarse grain, to decide cache block size.


Combination of Register and Cache blocking : UltraSPARC The combination is rarely beneficial, often slower than

either of the two optimization.


Combination of Register and Cache blocking : MIPS


Multiple Vector Multiplication Better chance of optimization : BLAS2 vs.

BLAS3

Repetition of single-vector case Multiple-vector case


Multiple Vector Multiplication : Performances Register blocking performance Cache blocking performance


Multiple Vector Multiplication :Register Blocking Performance

The speedup is larger than single vector register blocking. Even the performance of the matrices that did not

speedup improved. (middle group in UltraSPARC)


Multiple Vector Multiplication : Cache Blocking Performance

Noticeable speedup for the matrices that did not speedup (UltraSPARC) Block sizes are much smaller than that of single vector cache blocking.

UltraSPARC MIPS


Sparsity System : Purpose Guide a choice of optimization Automatic selection of optimization

parameters such as block size, number of vectors

http://comix.cs.berkeley.edu/~ejim/sparsity


Sparsity System : Organization

SparsityMachineProfiler

MachinePerformance

Profile

SparsityOptimizer

Examplematrix

MaximumNumber of

vectors

Optimized code,drivers


Summary : Speedup of Sparsity on UltraSPARC On UltraSPARC, up to 3x for single vector, 4.7x for multiple

vector

Single Vector Multiple Vector


Summary : Speedup of Sparsity on MIPSOn MIPS, up to 3x single vector, 6x for multiple vector

Single Vector Multiple Vector


Summary : Overhead of Sparsity Optimization

The number of iteration =

Overhead time Time saved

The BLAS Technical Forum include a parameter in the matrix creation routine to indicate how many times the operation is performed.


Related Work (1) Dense Matrix Optimization

Loop transformation by compilers : M. Wolf, etc. Hand-optimized libraries : BLAS, LAPACK

Automatic Generation of Libraries PHiPAC, ATLAS and FFTW

Sparse Matrix Standardization and Libraries BLAS Technical Forum NIST Sparse BLAS, MV++, SparseLib++, TNT

Hand Optimization of Sparse Matrix-Vector Multi.

S. Toledo, Oliker et. al.


Related Work (2) Sparse Matrix Packages

SPARSKIT, PSPARSELIB, Aztec, BlockSolve95, Spark98

Compiling Sparse Matrix Code Sparse compiler (Bik), Bernoulli compiler (Kotlyar)

On-demand Code Generation NIST SparseBLAS, Sparse compiler


Contribution Thorough investigation of memory hierarchy

optimization for sparse matrix-vector multiplication

Performance study on benchmark matrices Development of performance model to choose

optimization parameter Sparsity system for automatic tuning and code

generation of sparse matrix-vector multiplication


Conclusion Memory hierarchy optimization for sparse

matrix-vector multiplication Register Blocking : matrices with dense local structure

benefit Cache Blocking : large matrices with random structure

benefit Multiple vector multiplication improves the

performance further because of reuse of matrix elements

The choice of optimization depends on both matrix structure and machine architecture.

The automated system helps this complicated and time-consuming process.

Documents

Optimizing the Performance of Sparse Matrix-Vector Multiplication