38
6/13/00 U.C.Berkeley 1 Optimizing the Performance of Sparse Matrix-Vector Multiplication Eun-Jin Im U.C.Berkeley

Optimizing the Performance of Sparse Matrix-Vector Multiplication

  • Upload
    tavia

  • View
    79

  • Download
    0

Embed Size (px)

DESCRIPTION

Optimizing the Performance of Sparse Matrix-Vector Multiplication. Eun-Jin Im U.C.Berkeley. Overview. Motivation Optimization techniques Register Blocking Cache Blocking Multiple Vectors Sparsity system Related Work Contribution Conclusion. Motivation : Usage. - PowerPoint PPT Presentation

Citation preview

Page 1: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 1

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Eun-Jin ImU.C.Berkeley

Page 2: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 2

Overview Motivation Optimization techniques

Register Blocking Cache Blocking Multiple Vectors

Sparsity system Related Work Contribution Conclusion

Page 3: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 3

Motivation : Usage Sparse Matrix-Vector Multiplication

Usage of this operation: Iterative Solvers Explicit Methods Eigenvalue and Singular Value Problems

Applications in structure modeling, fluid dynamics, document retrieval(Latent Semantic Indexing) and many other simulation areas

xAy

Page 4: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 4

Motivation : Performance (1) Matrix-vector multiplication (BLAS2) is slower

than matrix-matrix multiplication (BLAS3) For example, on 167 MHz UltraSPARC I,

Vendor optimized matrix-vector multiplication: 57Mflops

Vendor optimized matrix-matrix multiplication: 185Mflops

The reason: lower ratio of the number of floating point operations to the number of memory operation

Page 5: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 5

Motivation : Performance (2) Sparse matrix operation is slower than dense

matrix operation. For example, on 167 MHz UltraSPARC I,

Dense matrix-vector multiplication : naïve implementation : 38Mflops vendor optimized implementation : 57Mflops Sparse matrix-vector multiplication (Naïve

implementation) 5.7 - 25Mflops

The reason : indirect data structure, thus inefficient memory accesses

Page 6: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 6

Motivation : Optimized libraries Old approach : Hand-Optimized Libraries

Vendor-supplied BLAS, LAPACK New approach : Automatic generation of

libraries PHiPAC (dense linear algebra) ATLAS (dense linear algebra) FFTW (fast fourier transform)

Our approach : Automatic generation of libraries for sparse matricesAdditional dimension : nonzero structure of sparse

matrices

Page 7: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 7

Sparse Matrix Formats There are large number of sparse matrix

formats. Point-entry

Coordinate format (COO), Compressed Sparse Row (CSR),Compressed Sparse Column (CSC), Sparse Diagonal

(DIA), … Block-entry

Block Coordinate (BCO), Block Sparse Row (BSR),Block Sparse Column (BSC), Block Diagonal (BDI),Variable Block Compressed Sparse Row (VBR), …

Page 8: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 8

Compressed Sparse Row Format We internally use CSR format, because it is

relatively efficient format

Page 9: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 9

Optimization Techniques Register Blocking Cache Blocking Multiple vector

Page 10: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 10

Register Blocking Blocked Compressed Sparse Row Format

Advantages of the format Better temporal locality in registers The multiplication loop can be unrolled for better

performance

0 2 4

A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35

0 4 2 4

35343332

2522

151110

040100

00

0000

000

000

AAAA

AA

AAA

AAA

Page 11: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 11

Register Blocking : Fill Overhead We use uniform block size, adding fill overhead.

fill overhead = 12/7 = 1.71 This increases both space and the number of

floating point operations.

Page 12: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 12

Register Blocking Dense Matrix profile on an UltraSPARC I (input

to the performance model)

Page 13: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 13

Register Blocking : Selecting the block size The hard part of the problem is picking the

block size so that : It minimizes the fill overhead It maximizes the raw performance

Two approaches : Exhaustive search Using a model

Page 14: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 14

Register Blocking: Performance model Two components to the performance model

Multiplication performance of dense matrix represented in sparse format

Estimated fill overhead

Predicted performance for block size r x c dense r x c blocked performance = fill overhead

Page 15: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 15

Benchmark matrices Matrix 1: Dense matrix (1000 x 1000) Matrices 2-17 : Finite Element Method

matrices Matrices 18-39 : matrices from Structural

Engineering, Device Simulation Matrices 40-44 : Linear Programming matrices Matrix 45 : document retrieval matrix used for Latent Semantic Indexing Matrix 46 : random matrix (10000 x 10000,

0.15%)

Page 16: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 16

Register Blocking : Performance

The optimization is effective most on FEM matrices and dense matrix (lower-numbered).

Page 17: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 17

Register Blocking : Performance

Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38)

Page 18: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 18

Register Blocking : Validation of Performance Model

Comparison to the performance of exhaustive search (yellow bars, block sizes in lower row) on a subset of the benchmark matrices

The exhaustive search does not produce much better result.

Page 19: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 19

Register Blocking : Overhead Pre-computation overhead :

Estimating fill overhead (red bars) Reorganizing the matrix (yellow bars)

The ratio means the number of repetitions for which the optimization is beneficial.

Page 20: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 20

Cache Blocking Temporal locality of access to source vector

Source vector x

DestinationVector

y

In memory

Page 21: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 21

Cache Blocking : Performance

MIPS speedup is generally better. larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.)

Except document retrieval and random matrix.

Page 22: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 22

Cache Blocking : Performance on document retrieval matrix

Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD is applied for LSI(Latent Semantic Indexing)

The nonzero elements are spread across the matrix, with no dense cluster.

Peak at 16K x 16K cache block with speedup 3.1

Page 23: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 23

Cache Blocking : When and how to use cache blocking From the experiment, the matrices for which

cache blocking is most effective are large and “random”.

We developed a measurement of “randomness” of matrix.

We perform search in coarse grain, to decide cache block size.

Page 24: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 24

Combination of Register and Cache blocking : UltraSPARC The combination is rarely beneficial, often slower than

either of the two optimization.

Page 25: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 25

Combination of Register and Cache blocking : MIPS

Page 26: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 26

Multiple Vector Multiplication Better chance of optimization : BLAS2 vs.

BLAS3

Repetition of single-vector case Multiple-vector case

Page 27: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 27

Multiple Vector Multiplication : Performances Register blocking performance Cache blocking performance

Page 28: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 28

Multiple Vector Multiplication :Register Blocking Performance

The speedup is larger than single vector register blocking. Even the performance of the matrices that did not

speedup improved. (middle group in UltraSPARC)

Page 29: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 29

Multiple Vector Multiplication : Cache Blocking Performance

Noticeable speedup for the matrices that did not speedup (UltraSPARC) Block sizes are much smaller than that of single vector cache blocking.

UltraSPARC MIPS

Page 30: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 30

Sparsity System : Purpose Guide a choice of optimization Automatic selection of optimization

parameters such as block size, number of vectors

http://comix.cs.berkeley.edu/~ejim/sparsity

Page 31: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 31

Sparsity System : Organization

SparsityMachineProfiler

MachinePerformance

Profile

SparsityOptimizer

Examplematrix

MaximumNumber of

vectors

Optimized code,drivers

Page 32: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 32

Summary : Speedup of Sparsity on UltraSPARC On UltraSPARC, up to 3x for single vector, 4.7x for multiple

vector

Single Vector Multiple Vector

Page 33: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 33

Summary : Speedup of Sparsity on MIPSOn MIPS, up to 3x single vector, 6x for multiple vector

Single Vector Multiple Vector

Page 34: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 34

Summary : Overhead of Sparsity Optimization

The number of iteration =

Overhead time Time saved

The BLAS Technical Forum include a parameter in the matrix creation routine to indicate how many times the operation is performed.

Page 35: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 35

Related Work (1) Dense Matrix Optimization

Loop transformation by compilers : M. Wolf, etc. Hand-optimized libraries : BLAS, LAPACK

Automatic Generation of Libraries PHiPAC, ATLAS and FFTW

Sparse Matrix Standardization and Libraries BLAS Technical Forum NIST Sparse BLAS, MV++, SparseLib++, TNT

Hand Optimization of Sparse Matrix-Vector Multi.

S. Toledo, Oliker et. al.

Page 36: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 36

Related Work (2) Sparse Matrix Packages

SPARSKIT, PSPARSELIB, Aztec, BlockSolve95, Spark98

Compiling Sparse Matrix Code Sparse compiler (Bik), Bernoulli compiler (Kotlyar)

On-demand Code Generation NIST SparseBLAS, Sparse compiler

Page 37: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 37

Contribution Thorough investigation of memory hierarchy

optimization for sparse matrix-vector multiplication

Performance study on benchmark matrices Development of performance model to choose

optimization parameter Sparsity system for automatic tuning and code

generation of sparse matrix-vector multiplication

Page 38: Optimizing the Performance of Sparse Matrix-Vector Multiplication

6/13/00 U.C.Berkeley 38

Conclusion Memory hierarchy optimization for sparse

matrix-vector multiplication Register Blocking : matrices with dense local structure

benefit Cache Blocking : large matrices with random structure

benefit Multiple vector multiplication improves the

performance further because of reuse of matrix elements

The choice of optimization depends on both matrix structure and machine architecture.

The automated system helps this complicated and time-consuming process.