77
03/09/2007 CS267 Lecture 16 1 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick www.cs.berkeley.edu/~yelick/ cs267_sp07

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 1

CS 267Sparse Matrices:

Sparse Matrix-Vector Multiplyfor Iterative Solvers

Kathy Yelick

www.cs.berkeley.edu/~yelick/cs267_sp07

Page 2: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 2

High-end simulation in the physical sciences = 7 numerical methods:

1. Structured Grids (including locally structured grids, e.g. AMR)

2. Unstructured Grids3. Fast Fourier Transform4. Dense Linear Algebra5. Sparse Linear Algebra 6. Particles7. Monte Carlo

Well-defined targets from algorithmic, software, and architecture standpoint

Phillip Colella’s “Seven dwarfs”

• Add 4 for embedded 8. Search/Sort 9. Finite State Machine10. Filter11. Combinational logic

• Then covers all 41 EEMBC benchmarks

• Revise 1 for SPEC• 7. Monte Carlo => Easily parallel (to add ray tracing)

• Then covers 26 SPEC benchmarks

Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004

Page 3: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 3

ODEs and Sparse Matrices

• All these problems reduce to sparse matrix problems

• Explicit: sparse matrix-vector multiplication (SpMV).• Implicit: solve a sparse linear system

• direct solvers (Gaussian elimination).• iterative solvers (use sparse matrix-vector multiplication).

• Eigenvalue/vector algorithms may also be explicit or implicit.

• Conclusion: SpMV is key to many ODE problems

• Relatively simple algorithm to study in detail• Two key problems: locality and load balance

Page 4: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 4

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] do

y[i] = y[i] + val[k]*x[ind[k]]

SpMV in Compressed Sparse Row (CSR) Format

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] do

y[i] = y[i] + val[k]*x[ind[k]]

Ay

x Representation of A

CSR format is one of many possibilities

Page 5: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 5

Motivation for Automatic Performance Tuning of SpMV

• Historical trends• Sparse matrix-vector multiply (SpMV): 10% of peak or less

• Performance depends on machine, kernel, matrix• Matrix known at run-time• Best data structure + implementation can be surprising

• Our approach: empirical performance modeling and algorithm search

Page 6: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 6

SpMV Historical Trends: Fraction of Peak

Page 7: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 7

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

Page 8: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 8

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

• 8x8 dense substructure

Page 9: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 9

Taking advantage of block structure in SpMV

• Bottleneck is time to get matrix from memory• Only 2 flops for each nonzero in matrix

• Don’t store each nonzero with index, instead store each nonzero r-by-c block with index

• Storage drops by up to 2x, if rc >> 1, all 32-bit quantities• Time to fetch matrix from memory decreases

• Change both data structure and algorithm• Need to pick r and c• Need to change algorithm accordingly

• In example, is r=c=8 best choice?• Minimizes storage, so looks like a good idea…

Page 10: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 10

Speedups on Itanium 2: The Need for Search

Reference

Best: 4x2

Mflop/s

Mflop/s

Page 11: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 11

Register Profile: Itanium 2

190 Mflop/s

1190 Mflop/s

Page 12: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 12

SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9% Ultra 3 - 5%

Pentium III-M - 15%Pentium III - 19%

63 Mflop/s

35 Mflop/s

109 Mflop/s

53 Mflop/s

96 Mflop/s

42 Mflop/s

120 Mflop/s

58 Mflop/s

Page 13: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 13

Register Profiles: Sun and Intel x86Ultra 2i - 11% Ultra 3 - 5%

Pentium III-M - 15%Pentium III - 21%

72 Mflop/s

35 Mflop/s

90 Mflop/s

50 Mflop/s

108 Mflop/s

42 Mflop/s

122 Mflop/s

58 Mflop/s

Page 14: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 14

SpMV Performance (Matrix #2): Generation 1Power3 - 13% Power4 - 14%

Itanium 2 - 31%Itanium 1 - 7%

195 Mflop/s

100 Mflop/s

703 Mflop/s

469 Mflop/s

225 Mflop/s

103 Mflop/s

1.1 Gflop/s

276 Mflop/s

Page 15: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 15

Register Profiles: IBM and Intel IA-64Power3 - 17% Power4 - 16%

Itanium 2 - 33%Itanium 1 - 8%

252 Mflop/s

122 Mflop/s

820 Mflop/s

459 Mflop/s

247 Mflop/s

107 Mflop/s

1.2 Gflop/s

190 Mflop/s

Page 16: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 16

Another example of tuning challenges

• More complicated non-zero structure in general

• N = 16614• NNZ = 1.1M

Page 17: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 17

Zoom in to top corner

• More complicated non-zero structure in general

• N = 16614• NNZ = 1.1M

Page 18: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 18

3x3 blocks look natural, but…

• More complicated non-zero structure in general

• Example: 3x3 blocking• Logical grid of 3x3 cells

• But would lead to lots of “fill-in”

Page 19: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 19

Extra Work Can Improve Efficiency!

• More complicated non-zero structure in general

• Example: 3x3 blocking• Logical grid of 3x3 cells• Fill-in explicit zeros• Unroll 3x3 block multiplies• “Fill ratio” = 1.5

• On Pentium III: 1.5x speedup!• Actual mflop rate 1.52 = 2.25

higher

Page 20: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 20

Automatic Register Block Size Selection

• Selecting the r x c block size• Off-line benchmark

• Precompute Mflops(r,c) using dense A for each r x c• Once per machine/architecture

• Run-time “search”• Sample A to estimate Fill(r,c) for each r x c

• Run-time heuristic model• Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

Page 21: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 21

Accurate and Efficient Adaptive Fill Estimation

• Idea: Sample matrix• Fraction of matrix to sample: s [0,1]• Cost ~ O(s * nnz)• Control cost by controlling s

• Search at run-time: the constant matters!• Control s automatically by computing statistical confidence

intervals• Idea: Monitor variance

• Cost of tuning• Lower bound: convert matrix in 5 to 40 unblocked SpMVs• Heuristic: 1 to 11 SpMVs

Page 22: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 22

Accuracy of the Tuning Heuristics (1/4)

NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)See p. 375 of Vuduc’s thesis for matrices

Page 23: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 23

Accuracy of the Tuning Heuristics (2/4)

Page 24: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 24

Accuracy of the Tuning Heuristics (3/4)

Page 25: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 25

Accuracy of the Tuning Heuristics (3/4)DGEMV

Page 26: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 26

Upper Bounds on Performance for blocked SpMV

• P = (flops) / (time)• Flops = 2 * nnz(A)

• Lower bound on time: Two main assumptions• 1. Count memory ops only (streaming)• 2. Count only compulsory, capacity misses: ignore conflicts

• Account for line sizes• Account for matrix size and nnz

• Charge minimum access “latency” i at Li cache & mem

• e.g., Saavedra-Barrera and PMaC MAPS benchmarks

1mem11

1memmem

Misses)(Misses)(Loads

HitsHitsTime

iiii

iii

Page 27: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 27

Example: L2 Misses on Itanium 2

Misses measured using PAPI [Browne ’00]

Page 28: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 28

Example: Bounds on Itanium 2

Page 29: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 29

Example: Bounds on Itanium 2

Page 30: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 30

Example: Bounds on Itanium 2

Page 31: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 31

Summary of Other Performance Optimizations

• Optimizations for SpMV• Register blocking (RB): up to 4x over CSR• Variable block splitting: 2.1x over CSR, 1.8x over RB• Diagonals: 2x over CSR• Reordering to create dense structure + splitting: 2x over CSR• Symmetry: 2.8x over CSR, 2.6x over RB• Cache blocking: 2.8x over CSR• Multiple vectors (SpMM): 7x over CSR• And combinations…

• Sparse triangular solve• Hybrid sparse/dense data structure: 1.8x over CSR

• Higher-level kernels• AAT*x, ATA*x: 4x over CSR, 1.8x over RB• A*x: 2x over CSR, 1.5x over RB

Page 32: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 32

• Data Structure Transformations• Thread blocking• Cache blocking• Register Blocking• Format selection• Index size reduction

• Kernel Optimizations• Prefetching• Loop structure

SPMV for Shared Memory and Multicore

Page 33: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 33

• Load Balancing• Evenly divide number of nonzeros

• Exploit NUMA memory systems on multi-socket SMPs• Must pin threads to cores AND• pin data to sockets

Thread Blocking

Page 34: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 34

• R x C processor grid• Each covers the same

number of rows andcolumns.

• Potentially unbalanced

Naïve Approach

Page 35: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 35

• R x C processor grid• First, block into rows

• same number of nonzeros in each of theR blocked rows

• Second, block within each blocked row• Not only should each block within a row

have ~same number of nonzeros,• But all blocks should have ~same number

of nonzeros• Third, prune unneeded rows &

columns• Fourth, re-encode the column indices to be

relative to each thread block.

Load Balanced Approach

Page 36: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 36

• Cache blocking• Performed for each thread block.• Chop into blocks so entire source vector fits in cache

• Prefetching• Insert explicit prefetch operations to mask latency to memory• Tune prefetch distance/time using search

• Register blocking• As in OSKI, but done separately per cache block• Simpler heuristic: choose block size that minimize total storage

• Index compression• Use 16b ints for indices in blocks less than 64K wide

Memory Optimizations

Page 37: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 37

1 thread Performance (preliminary)1 thread Performance (preliminary)

493 513

269

NaiveRegisterBlocking

Nai

veS

oftw

are

Pre

fetc

h

258

467 695

439

NaiveRegisterBlocking

Nai

veS

oftw

are

Pre

fetc

h

297

476 460

351

NaiveRegisterBlocking

Nai

veS

oftw

are

Pre

fetc

h

324

612 1372

623

NaiveRegisterBlocking

Nai

veS

oftw

are

Pre

fetc

h

430

memplus.rua

raefsky3.rua

Dual Socket,Dual Core Opteron

@ 2.2GHz

Quad Socket,Single Core Opteron

@ 2.4GHz

3.2x

1.4x

1.4x2.3x

1.5x

1.6x

2x1.9x 1.4x1.5x

Page 38: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 38

2 thread Performance (preliminary)2 thread Performance (preliminary)

- 4950.96x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 12841.85x

-

Naive T,R BlockedN

aive

Sof

twa

reP

refe

tch

-

- 7551.6x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 16391.2x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

memplus.rua

raefsky3.rua

Dual Socket,Dual Core Opteron

@ 2.2GHz

Quad Socket,Single Core Opteron

@ 2.4GHz

Page 39: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 39

4 thread Performance (preliminary)4 thread Performance (preliminary)

- 9852.0x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 19112.75x

-

Naive T,R BlockedN

aive

Sof

twa

reP

refe

tch

-

- 13693.0x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 31482.3x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

memplus.rua

raefsky3.rua

Dual Socket,Dual Core Opteron

@ 2.2GHz

Quad Socket,Single Core Opteron

@ 2.4GHz

Page 40: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 40

Speedup for the best combination of NThreads, blocking, prefetching, …Speedup for the best combination of NThreads, blocking, prefetching, …

- 985

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

258

- 1911

-

Naive T,R BlockedN

aive

Sof

twa

reP

refe

tch

297

- 1369

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

324

- 3148

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

430

memplus.rua

raefsky3.rua

Dual Socket,Dual Core Opteron

@ 2.2GHz

Quad Socket,Single Core Opteron

@ 2.4GHz

7.3x6.4x

3.8x 4.2x

Page 41: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 41

Distributed Memory SPMV

• y = A*x, where A is a sparse n x n matrix

• Questions• which processors store

• y[i], x[i], and A[i,j]

• which processors compute• y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product

• Partitioning• Partition index set {1,…,n} = N1 N2 … Np.• For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) * x

• “owner computes” rule: Processor k compute the y[i]s it owns.

x

y

P1

P2

P3

P4

May require communication

Page 42: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 42

Two Layouts

• The partitions should be by nonzeros counts, not rows/columns

• 1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P

• 2D Partition: reductions scale with log sqrt(P), but needs to keep ~= nonzeros for load balance

x

y

P1

P2

P3

P4

x

y

P1

P2

P3

P4

Page 43: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 43

Summary

• Sparse matrix vector multiply critical to many applications

• Performance limited by memory systems (and perhaps network)

• Cache blocking, register blocking, prefetching are all important

• Autotuning can be used, but need matrix structure

Page 44: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 44

Extra Slides

Including: How to use OSKI

Page 45: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 45

Example: Sparse Triangular Factor

• Raefsky4 (structural problem) + SuperLU + colmmd

• N=19779, nnz=12.6 M

Dense trailing triangle: dim=2268, 20% of total nz

Can be as high as 90+%!1.8x over CSR

Page 46: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 46

Cache Optimizations for AAT*x

• Cache-level: Interleave multiplication by A, AT

• Only fetch A from memory once

• Register-level: aiT to be rc block row, or diag row

n

i

Tii

Tn

T

nT xaax

a

a

aaxAA1

1

1 )(

dot product“axpy”

… …

Page 47: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 47

Example: Combining Optimizations

• Register blocking, symmetry, multiple (k) vectors• Three low-level tuning parameters: r, c, v

v

kX

Y A

cr

+=

*

Page 48: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 48

Example: Combining Optimizations

• Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB]

• Symmetric, blocked, 1 vector• Up to 2.6x over nonsymmetric, blocked, 1 vector

• Symmetric, blocked, k vectors• Up to 2.1x over nonsymmetric, blocked, k vecs.• Up to 7.3x over nonsymmetric, nonblocked, 1, vector

• Symmetric Storage: up to 64.7% savings

Page 49: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 49

Potential Impact on Applications: T3P

• Application: accelerator design [Ko] • 80% of time spent in SpMV• Relevant optimization techniques

• Symmetric storage• Register blocking

• On Single Processor Itanium 2• 1.68x speedup

• 532 Mflops, or 15% of 3.6 GFlop peak• 4.4x speedup with multiple (8) vectors

• 1380 Mflops, or 38% of peak

Page 50: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 50

Potential Impact on Applications: Omega3P

• Application: accelerator cavity design [Ko]• Relevant optimization techniques

• Symmetric storage• Register blocking• Reordering

• Reverse Cuthill-McKee ordering to reduce bandwidth• Traveling Salesman Problem-based ordering to create blocks

– Nodes = columns of A

– Weights(u, v) = no. of nz u, v have in common

– Tour = ordering of columns

– Choose maximum weight tour

– See [Pinar & Heath ’97]

• 2.1x speedup on Power 4, but SPMV not dominant

Page 51: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 51

Source: Accelerator Cavity Design Problem (Ko via Husbands)

Page 52: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 52

100x100 Submatrix Along Diagonal

Page 53: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 53

Post-RCM Reordering

Page 54: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 54

Before: Green + RedAfter: Green + Blue

“Microscopic” Effect of RCM Reordering

Page 55: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 55

“Microscopic” Effect of Combined RCM+TSP Reordering

Before: Green + RedAfter: Green + Blue

Page 56: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 56

(Omega3P)

Page 57: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16

Optimized Sparse Kernel Interface - OSKI

• Provides sparse kernels automatically tuned for user’s matrix & machine

• BLAS-style functionality: SpMV, Ax & ATy, TrSV• Hides complexity of run-time tuning• Includes new, faster locality-aware kernels: ATAx, Akx

• Faster than standard implementations• Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x

• For “advanced” users & solver library writers• Available as stand-alone library (OSKI 1.0.1b, 3/06)• Available as PETSc extension (OSKI-PETSc .1d, 3/06)• Bebop.cs.berkeley.edu/oski

Page 58: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 58

How the OSKI Tunes (Overview)

Benchmarkdata

1. Build forTargetArch.

2. Benchmark

Heuristicmodels

1. EvaluateModels

Generatedcode

variants

2. SelectData Struct.

& Code

Library Install-Time (offline) Application Run-Time

To user:Matrix handlefor kernelcalls

Workloadfrom program

monitoring

Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

HistoryMatrix

Page 59: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 59

How the OSKI Tunes (Overview)

• At library build/install-time• Pre-generate and compile code variants into dynamic libraries• Collect benchmark data

• Measures and records speed of possible sparse data structure and code variants on target architecture

• Installation process uses standard, portable GNU AutoTools• At run-time

• Library “tunes” using heuristic models• Models analyze user’s matrix & benchmark data to choose

optimized data structure and code• Non-trivial tuning cost: up to ~40 mat-vecs

• Library limits the time it spends tuning based on estimated workload– provided by user or inferred by library

• User may reduce cost by saving tuning results for application on future runs with same or similar matrix

Page 60: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 60

Optimizations in the Initial OSKI Release

• Fully automatic heuristics for• Sparse matrix-vector multiply

• Register-level blocking• Register-level blocking + symmetry + multiple vectors• Cache-level blocking

• Sparse triangular solve with register-level blocking and “switch-to-dense” optimization

• Sparse ATA*x with register-level blocking• User may select other optimizations manually

• Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x)

• All available in dynamic libraries• Accessible via high-level embedded script language

• “Plug-in” extensibility• Very advanced users may write their own heuristics, create

new data structures/code variants and dynamically add them to the system

Page 61: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 61

How to Call OSKI: Basic Usage

• May gradually migrate existing apps• Step 1: “Wrap” existing data structures• Step 2: Make BLAS-like kernel calls

int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */

double* x = …, *y = …; /* Let x and y be two dense vectors */

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

my_matmult( ptr, ind, val, , x, , y );

Page 62: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 62

How to Call OSKI: Basic Usage

• May gradually migrate existing apps• Step 1: “Wrap” existing data structures• Step 2: Make BLAS-like kernel calls

int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */

double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);

oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);

oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

my_matmult( ptr, ind, val, , x, , y );

Page 63: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 63

How to Call OSKI: Basic Usage

• May gradually migrate existing apps• Step 1: “Wrap” existing data structures• Step 2: Make BLAS-like kernel calls

int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */

double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);

oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);

oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);/* Step 2 */

Page 64: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 64

How to Call OSKI: Tune with Explicit Hints

• User calls “tune” routine• May provide explicit tuning hints (OPTIONAL)

oski_matrix_t A_tunable = oski_CreateMatCSR( … );

/* … */

/* Tell OSKI we will call SpMV 500 times (workload hint) */oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500);/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

oski_TuneMat(A_tunable); /* Ask OSKI to tune */

for( i = 0; i < 500; i++ )

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

Page 65: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 65

How the User Calls OSKI: Implicit Tuning

• Ask library to infer workload• Library profiles all kernel calls• May periodically re-tune

oski_matrix_t A_tunable = oski_CreateMatCSR( … );

/* … */

for( i = 0; i < 500; i++ ) {

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);oski_TuneMat(A_tunable); /* Ask OSKI to tune */

}

Page 66: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 66

Quick-and-dirty Parallelism: OSKI-PETSc

• Extend PETSc’s distributed memory SpMV (MATMPIAIJ)

p0

p1

p2

p3

• PETSc• Each process stores diag

(all-local) and off-diag submatrices

• OSKI-PETSc:• Add OSKI wrappers• Each submatrix tuned

independently

Page 67: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 67

OSKI-PETSc Proof-of-Concept Results

• Matrix 1: Accelerator cavity design (R. Lee @ SLAC)• N ~ 1 M, ~40 M non-zeros• 2x2 dense block substructure• Symmetric

• Matrix 2: Linear programming (Italian Railways)• Short-and-fat: 4k x 1M, ~11M non-zeros• Highly unstructured• Big speedup from cache-blocking: no native PETSc format

• Evaluation machine: Xeon cluster• Peak: 4.8 Gflop/s per node

Page 68: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 68

Accelerator Cavity Matrix

Page 69: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 69

OSKI-PETSc Performance: Accel. Cavity

Page 70: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 70

Linear Programming Matrix

Page 71: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 71

OSKI-PETSc Performance: LP Matrix

Page 72: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 72

Tuning Higher Level Algorithms

• So far we have tuned a single sparse matrix kernel• y = AT*A*x motivated by higher level algorithm (SVD)

• What can we do by extending tuning to a higher level?• Consider Krylov subspace methods for Ax=b, Ax = x

• Conjugate Gradients (CG), GMRES, Lanczos, …• Inner loop does y=A*x, dot products, saxpys, scalar ops• Inner loop costs at least O(1) messages• k iterations cost at least O(k) messages

• Our goal: show how to do k iterations with O(1) messages• Possible payoff – make Krylov subspace methods much faster on machines with slow networks• Memory bandwidth improvements too (not discussed)• Obstacles: numerical stability, preconditioning, …

Page 73: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 73

Parallel Sparse Matrix-vector multiplication

• y = A*x, where A is a sparse n x n matrix

• Questions• which processors store

• y[i], x[i], and A[i,j]

• which processors compute• y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product

• Partitioning• Partition index set {1,…,n} = N1 N2 … Np.• For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) * x

• “owner computes” rule: Processor k compute the y[i]s it owns.

x

y

P1

P2

P3

P4

May require communication

Page 74: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 74

Matrix Reordering via Graph Partitioning

• “Ideal” matrix structure for parallelism: block diagonal• p (number of processors) blocks, can all be computed locally.• If no non-zeros outside these blocks, no communication needed

• Can we reorder the rows/columns to get close to this?• Most nonzeros in diagonal blocks, few outside

P0

P1

P2

P3

P4

= *

P0 P1 P2 P3 P4

Page 75: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 75

Goals of Reordering

• Performance goals• balance load (how is load measured?).

• Approx equal number of nonzeros (not necessarily rows)

• balance storage (how much does each processor store?).• Approx equal number of nonzeros

• minimize communication (how much is communicated?).• Minimize nonzeros outside diagonal blocks• Related optimization criterion is to move nonzeros near diagonal

• improve register and cache re-use• Group nonzeros in small vertical blocks so source (x) elements

loaded into cache or registers may be reused (temporal locality)• Group nonzeros in small horizontal blocks so nearby source (x)

elements in the cache may be used (spatial locality)

• Other algorithms reorder for other reasons• Reduce # nonzeros in matrix after Gaussian elimination• Improve numerical stability

Page 76: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 76

Graph Partitioning and Sparse Matrices

1 1 1 1

2 1 1 1 1

3 1 1 1

4 1 1 1 1

5 1 1 1 1

6 1 1 1 1

1 2 3 4 5 6

3

6

1

5

2

• Relationship between matrix and graph

• Edges in the graph are nonzero in the matrix: here the matrix is symmetric (edges are unordered) and weights are equal (1)

• If divided over 3 procs, there are 14 nonzeros outside the diagonal blocks, which represent the 7 (bidirectional) edges

4

Page 77: 03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 77

Graph Partitioning and Sparse Matrices

1 1 1 1

2 1 1 1 1

3 1 1 1

4 1 1 1 1

5 1 1 1 1

6 1 1 1 1

1 2 3 4 5 6

• Relationship between matrix and graph

• A “good” partition of the graph has• equal (weighted) number of nodes in each part (load and storage balance).• minimum number of edges crossing between (minimize communication).

• Reorder the rows/columns by putting all nodes in one partition together.

3

6

1

5

42