03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 1

CS 267Sparse Matrices:

Sparse Matrix-Vector Multiplyfor Iterative Solvers

Kathy Yelick

www.cs.berkeley.edu/~yelick/cs267_sp07

03/09/2007 CS267 Lecture 16 2

High-end simulation in the physical sciences = 7 numerical methods:

1. Structured Grids (including locally structured grids, e.g. AMR)

2. Unstructured Grids3. Fast Fourier Transform4. Dense Linear Algebra5. Sparse Linear Algebra 6. Particles7. Monte Carlo

Well-defined targets from algorithmic, software, and architecture standpoint

Phillip Colella’s “Seven dwarfs”

• Add 4 for embedded 8. Search/Sort 9. Finite State Machine10. Filter11. Combinational logic

• Then covers all 41 EEMBC benchmarks

• Revise 1 for SPEC• 7. Monte Carlo => Easily parallel (to add ray tracing)

• Then covers 26 SPEC benchmarks

Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004

03/09/2007 CS267 Lecture 16 3

ODEs and Sparse Matrices

• All these problems reduce to sparse matrix problems

• Explicit: sparse matrix-vector multiplication (SpMV).• Implicit: solve a sparse linear system

• direct solvers (Gaussian elimination).• iterative solvers (use sparse matrix-vector multiplication).

• Eigenvalue/vector algorithms may also be explicit or implicit.

• Conclusion: SpMV is key to many ODE problems

• Relatively simple algorithm to study in detail• Two key problems: locality and load balance

03/09/2007 CS267 Lecture 16 4

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] do

y[i] = y[i] + val[k]*x[ind[k]]

SpMV in Compressed Sparse Row (CSR) Format

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] do

y[i] = y[i] + val[k]*x[ind[k]]

Ay

x Representation of A

CSR format is one of many possibilities

03/09/2007 CS267 Lecture 16 5

Motivation for Automatic Performance Tuning of SpMV

• Historical trends• Sparse matrix-vector multiply (SpMV): 10% of peak or less

• Performance depends on machine, kernel, matrix• Matrix known at run-time• Best data structure + implementation can be surprising

• Our approach: empirical performance modeling and algorithm search

03/09/2007 CS267 Lecture 16 6

SpMV Historical Trends: Fraction of Peak

03/09/2007 CS267 Lecture 16 7

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

03/09/2007 CS267 Lecture 16 8

Example: The Difficulty of Tuning

• n = 21200• nnz = 1.5 M• kernel: SpMV

• Source: NASA structural analysis problem

• 8x8 dense substructure

03/09/2007 CS267 Lecture 16 9

Taking advantage of block structure in SpMV

• Bottleneck is time to get matrix from memory• Only 2 flops for each nonzero in matrix

• Don’t store each nonzero with index, instead store each nonzero r-by-c block with index

• Storage drops by up to 2x, if rc >> 1, all 32-bit quantities• Time to fetch matrix from memory decreases

• Change both data structure and algorithm• Need to pick r and c• Need to change algorithm accordingly

• In example, is r=c=8 best choice?• Minimizes storage, so looks like a good idea…

03/09/2007 CS267 Lecture 16 10

Speedups on Itanium 2: The Need for Search

Reference

Best: 4x2

Mflop/s

Mflop/s

03/09/2007 CS267 Lecture 16 11

Register Profile: Itanium 2

190 Mflop/s

1190 Mflop/s

03/09/2007 CS267 Lecture 16 12

SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9% Ultra 3 - 5%

Pentium III-M - 15%Pentium III - 19%

63 Mflop/s

35 Mflop/s

109 Mflop/s

53 Mflop/s

96 Mflop/s

42 Mflop/s

120 Mflop/s

58 Mflop/s

03/09/2007 CS267 Lecture 16 13

Register Profiles: Sun and Intel x86Ultra 2i - 11% Ultra 3 - 5%

Pentium III-M - 15%Pentium III - 21%

72 Mflop/s

35 Mflop/s

90 Mflop/s

50 Mflop/s

108 Mflop/s

42 Mflop/s

122 Mflop/s

58 Mflop/s

03/09/2007 CS267 Lecture 16 14

SpMV Performance (Matrix #2): Generation 1Power3 - 13% Power4 - 14%

Itanium 2 - 31%Itanium 1 - 7%

195 Mflop/s

100 Mflop/s

703 Mflop/s

469 Mflop/s

225 Mflop/s

103 Mflop/s

1.1 Gflop/s

276 Mflop/s

03/09/2007 CS267 Lecture 16 15

Register Profiles: IBM and Intel IA-64Power3 - 17% Power4 - 16%

Itanium 2 - 33%Itanium 1 - 8%

252 Mflop/s

122 Mflop/s

820 Mflop/s

459 Mflop/s

247 Mflop/s

107 Mflop/s

1.2 Gflop/s

190 Mflop/s

03/09/2007 CS267 Lecture 16 16

Another example of tuning challenges

• More complicated non-zero structure in general

• N = 16614• NNZ = 1.1M

03/09/2007 CS267 Lecture 16 17

Zoom in to top corner


• N = 16614• NNZ = 1.1M

03/09/2007 CS267 Lecture 16 18

3x3 blocks look natural, but…


• Example: 3x3 blocking• Logical grid of 3x3 cells

• But would lead to lots of “fill-in”

03/09/2007 CS267 Lecture 16 19

Extra Work Can Improve Efficiency!


• Example: 3x3 blocking• Logical grid of 3x3 cells• Fill-in explicit zeros• Unroll 3x3 block multiplies• “Fill ratio” = 1.5

• On Pentium III: 1.5x speedup!• Actual mflop rate 1.52 = 2.25

higher

03/09/2007 CS267 Lecture 16 20

Automatic Register Block Size Selection

• Selecting the r x c block size• Off-line benchmark

• Precompute Mflops(r,c) using dense A for each r x c• Once per machine/architecture

• Run-time “search”• Sample A to estimate Fill(r,c) for each r x c

• Run-time heuristic model• Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

03/09/2007 CS267 Lecture 16 21

Accurate and Efficient Adaptive Fill Estimation

• Idea: Sample matrix• Fraction of matrix to sample: s [0,1]• Cost ~ O(s * nnz)• Control cost by controlling s

• Search at run-time: the constant matters!• Control s automatically by computing statistical confidence

intervals• Idea: Monitor variance

• Cost of tuning• Lower bound: convert matrix in 5 to 40 unblocked SpMVs• Heuristic: 1 to 11 SpMVs

03/09/2007 CS267 Lecture 16 22

Accuracy of the Tuning Heuristics (1/4)

NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)See p. 375 of Vuduc’s thesis for matrices

03/09/2007 CS267 Lecture 16 23


03/09/2007 CS267 Lecture 16 24


03/09/2007 CS267 Lecture 16 25

Accuracy of the Tuning Heuristics (3/4)DGEMV

03/09/2007 CS267 Lecture 16 26

Upper Bounds on Performance for blocked SpMV

• P = (flops) / (time)• Flops = 2 * nnz(A)

• Lower bound on time: Two main assumptions• 1. Count memory ops only (streaming)• 2. Count only compulsory, capacity misses: ignore conflicts

• Account for line sizes• Account for matrix size and nnz

• Charge minimum access “latency” i at Li cache & mem

• e.g., Saavedra-Barrera and PMaC MAPS benchmarks

1mem11

1memmem

Misses)(Misses)(Loads

HitsHitsTime

iiii

iii

03/09/2007 CS267 Lecture 16 27

Example: L2 Misses on Itanium 2

Misses measured using PAPI [Browne ’00]

03/09/2007 CS267 Lecture 16 28

Example: Bounds on Itanium 2

03/09/2007 CS267 Lecture 16 29


03/09/2007 CS267 Lecture 16 30


03/09/2007 CS267 Lecture 16 31

Summary of Other Performance Optimizations

• Optimizations for SpMV• Register blocking (RB): up to 4x over CSR• Variable block splitting: 2.1x over CSR, 1.8x over RB• Diagonals: 2x over CSR• Reordering to create dense structure + splitting: 2x over CSR• Symmetry: 2.8x over CSR, 2.6x over RB• Cache blocking: 2.8x over CSR• Multiple vectors (SpMM): 7x over CSR• And combinations…

• Sparse triangular solve• Hybrid sparse/dense data structure: 1.8x over CSR

• Higher-level kernels• AAT*x, ATA*x: 4x over CSR, 1.8x over RB• A*x: 2x over CSR, 1.5x over RB

03/09/2007 CS267 Lecture 16 32

• Data Structure Transformations• Thread blocking• Cache blocking• Register Blocking• Format selection• Index size reduction

• Kernel Optimizations• Prefetching• Loop structure

SPMV for Shared Memory and Multicore

03/09/2007 CS267 Lecture 16 33

• Load Balancing• Evenly divide number of nonzeros

• Exploit NUMA memory systems on multi-socket SMPs• Must pin threads to cores AND• pin data to sockets

Thread Blocking

03/09/2007 CS267 Lecture 16 34

• R x C processor grid• Each covers the same

number of rows andcolumns.

• Potentially unbalanced

Naïve Approach

03/09/2007 CS267 Lecture 16 35

• R x C processor grid• First, block into rows

• same number of nonzeros in each of theR blocked rows

• Second, block within each blocked row• Not only should each block within a row

have ~same number of nonzeros,• But all blocks should have ~same number

of nonzeros• Third, prune unneeded rows &

columns• Fourth, re-encode the column indices to be

relative to each thread block.

Load Balanced Approach

03/09/2007 CS267 Lecture 16 36

• Cache blocking• Performed for each thread block.• Chop into blocks so entire source vector fits in cache

• Prefetching• Insert explicit prefetch operations to mask latency to memory• Tune prefetch distance/time using search

• Register blocking• As in OSKI, but done separately per cache block• Simpler heuristic: choose block size that minimize total storage

• Index compression• Use 16b ints for indices in blocks less than 64K wide

Memory Optimizations

03/09/2007 CS267 Lecture 16 37

1 thread Performance (preliminary)1 thread Performance (preliminary)

493 513

269

NaiveRegisterBlocking

Nai

veS

oftw

are

Pre

fetc

h

258

467 695

439


Nai

veS

oftw

are

Pre

fetc

h

297

476 460

351


Nai

veS

oftw

are

Pre

fetc

h

324

612 1372

623


Nai

veS

oftw

are

Pre

fetc

h

430

memplus.rua

raefsky3.rua

Dual Socket,Dual Core Opteron

@ 2.2GHz

Quad Socket,Single Core Opteron

@ 2.4GHz

3.2x

1.4x

1.4x2.3x

1.5x

1.6x

2x1.9x 1.4x1.5x

03/09/2007 CS267 Lecture 16 38


- 4950.96x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 12841.85x

-

Naive T,R BlockedN

aive

Sof

twa

reP

refe

tch

-

- 7551.6x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 16391.2x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

memplus.rua

raefsky3.rua


@ 2.2GHz


@ 2.4GHz

03/09/2007 CS267 Lecture 16 39


- 9852.0x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 19112.75x

-

Naive T,R BlockedN

aive

Sof

twa

reP

refe

tch

-

- 13693.0x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

- 31482.3x

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

-

memplus.rua

raefsky3.rua


@ 2.2GHz


@ 2.4GHz

03/09/2007 CS267 Lecture 16 40

Speedup for the best combination of NThreads, blocking, prefetching, …Speedup for the best combination of NThreads, blocking, prefetching, …

- 985

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

258

- 1911

-

Naive T,R BlockedN

aive

Sof

twa

reP

refe

tch

297

- 1369

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

324

- 3148

-

Naive T,R Blocked

Nai

veS

oftw

are

Pre

fetc

h

430

memplus.rua

raefsky3.rua


@ 2.2GHz


@ 2.4GHz

7.3x6.4x

3.8x 4.2x

03/09/2007 CS267 Lecture 16 41

Distributed Memory SPMV

• y = A*x, where A is a sparse n x n matrix

• Questions• which processors store

• y[i], x[i], and A[i,j]

• which processors compute• y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product

• Partitioning• Partition index set {1,…,n} = N1 N2 … Np.• For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) * x

• “owner computes” rule: Processor k compute the y[i]s it owns.

x

y

P1

P2

P3

P4

May require communication

03/09/2007 CS267 Lecture 16 42

Two Layouts

• The partitions should be by nonzeros counts, not rows/columns

• 1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P

• 2D Partition: reductions scale with log sqrt(P), but needs to keep ~= nonzeros for load balance

x

y

P1

P2

P3

P4

x

y

P1

P2

P3

P4

03/09/2007 CS267 Lecture 16 43

Summary

• Sparse matrix vector multiply critical to many applications

• Performance limited by memory systems (and perhaps network)

• Cache blocking, register blocking, prefetching are all important

• Autotuning can be used, but need matrix structure

03/09/2007 CS267 Lecture 16 44

Extra Slides

Including: How to use OSKI

03/09/2007 CS267 Lecture 16 45

Example: Sparse Triangular Factor

• Raefsky4 (structural problem) + SuperLU + colmmd

• N=19779, nnz=12.6 M

Dense trailing triangle: dim=2268, 20% of total nz

Can be as high as 90+%!1.8x over CSR

03/09/2007 CS267 Lecture 16 46

Cache Optimizations for AAT*x

• Cache-level: Interleave multiplication by A, AT

• Only fetch A from memory once

• Register-level: aiT to be rc block row, or diag row

n

i

Tii

Tn

T

nT xaax

a

a

aaxAA1

1

1 )(

dot product“axpy”

… …

03/09/2007 CS267 Lecture 16 47

Example: Combining Optimizations

• Register blocking, symmetry, multiple (k) vectors• Three low-level tuning parameters: r, c, v

v

kX

Y A

cr

+=

*

03/09/2007 CS267 Lecture 16 48

Example: Combining Optimizations

• Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB]

• Symmetric, blocked, 1 vector• Up to 2.6x over nonsymmetric, blocked, 1 vector

• Symmetric, blocked, k vectors• Up to 2.1x over nonsymmetric, blocked, k vecs.• Up to 7.3x over nonsymmetric, nonblocked, 1, vector

• Symmetric Storage: up to 64.7% savings

03/09/2007 CS267 Lecture 16 49

Potential Impact on Applications: T3P

• Application: accelerator design [Ko] • 80% of time spent in SpMV• Relevant optimization techniques

• Symmetric storage• Register blocking

• On Single Processor Itanium 2• 1.68x speedup

• 532 Mflops, or 15% of 3.6 GFlop peak• 4.4x speedup with multiple (8) vectors

• 1380 Mflops, or 38% of peak

03/09/2007 CS267 Lecture 16 50

Potential Impact on Applications: Omega3P

• Application: accelerator cavity design [Ko]• Relevant optimization techniques

• Symmetric storage• Register blocking• Reordering

• Reverse Cuthill-McKee ordering to reduce bandwidth• Traveling Salesman Problem-based ordering to create blocks

– Nodes = columns of A

– Weights(u, v) = no. of nz u, v have in common

– Tour = ordering of columns

– Choose maximum weight tour

– See [Pinar & Heath ’97]

• 2.1x speedup on Power 4, but SPMV not dominant

03/09/2007 CS267 Lecture 16 51

Source: Accelerator Cavity Design Problem (Ko via Husbands)

03/09/2007 CS267 Lecture 16 52

100x100 Submatrix Along Diagonal

03/09/2007 CS267 Lecture 16 53

Post-RCM Reordering

03/09/2007 CS267 Lecture 16 54

Before: Green + RedAfter: Green + Blue

“Microscopic” Effect of RCM Reordering

03/09/2007 CS267 Lecture 16 55

“Microscopic” Effect of Combined RCM+TSP Reordering

Before: Green + RedAfter: Green + Blue

03/09/2007 CS267 Lecture 16 56

(Omega3P)

03/09/2007 CS267 Lecture 16

Optimized Sparse Kernel Interface - OSKI

• Provides sparse kernels automatically tuned for user’s matrix & machine

• BLAS-style functionality: SpMV, Ax & ATy, TrSV• Hides complexity of run-time tuning• Includes new, faster locality-aware kernels: ATAx, Akx

• Faster than standard implementations• Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x

• For “advanced” users & solver library writers• Available as stand-alone library (OSKI 1.0.1b, 3/06)• Available as PETSc extension (OSKI-PETSc .1d, 3/06)• Bebop.cs.berkeley.edu/oski

03/09/2007 CS267 Lecture 16 58

How the OSKI Tunes (Overview)

Benchmarkdata

1. Build forTargetArch.

2. Benchmark

Heuristicmodels

1. EvaluateModels

Generatedcode

variants

2. SelectData Struct.

& Code

Library Install-Time (offline) Application Run-Time

To user:Matrix handlefor kernelcalls

Workloadfrom program

monitoring

Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

HistoryMatrix

03/09/2007 CS267 Lecture 16 59

How the OSKI Tunes (Overview)

• At library build/install-time• Pre-generate and compile code variants into dynamic libraries• Collect benchmark data

• Measures and records speed of possible sparse data structure and code variants on target architecture

• Installation process uses standard, portable GNU AutoTools• At run-time

• Library “tunes” using heuristic models• Models analyze user’s matrix & benchmark data to choose

optimized data structure and code• Non-trivial tuning cost: up to ~40 mat-vecs

• Library limits the time it spends tuning based on estimated workload– provided by user or inferred by library

• User may reduce cost by saving tuning results for application on future runs with same or similar matrix

03/09/2007 CS267 Lecture 16 60

Optimizations in the Initial OSKI Release

• Fully automatic heuristics for• Sparse matrix-vector multiply

• Register-level blocking• Register-level blocking + symmetry + multiple vectors• Cache-level blocking

• Sparse triangular solve with register-level blocking and “switch-to-dense” optimization

• Sparse ATA*x with register-level blocking• User may select other optimizations manually

• Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x)

• All available in dynamic libraries• Accessible via high-level embedded script language

• “Plug-in” extensibility• Very advanced users may write their own heuristics, create

new data structures/code variants and dynamically add them to the system

03/09/2007 CS267 Lecture 16 61

How to Call OSKI: Basic Usage

• May gradually migrate existing apps• Step 1: “Wrap” existing data structures• Step 2: Make BLAS-like kernel calls

int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */

double* x = …, *y = …; /* Let x and y be two dense vectors */

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

my_matmult( ptr, ind, val, , x, , y );

03/09/2007 CS267 Lecture 16 62




double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);

oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);

oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);


my_matmult( ptr, ind, val, , x, , y );

03/09/2007 CS267 Lecture 16 63




double* x = …, *y = …; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);

oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);

oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);


oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);/* Step 2 */

03/09/2007 CS267 Lecture 16 64

How to Call OSKI: Tune with Explicit Hints

• User calls “tune” routine• May provide explicit tuning hints (OPTIONAL)

oski_matrix_t A_tunable = oski_CreateMatCSR( … );

/* … */

/* Tell OSKI we will call SpMV 500 times (workload hint) */oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500);/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

oski_TuneMat(A_tunable); /* Ask OSKI to tune */

for( i = 0; i < 500; i++ )

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

03/09/2007 CS267 Lecture 16 65

How the User Calls OSKI: Implicit Tuning

• Ask library to infer workload• Library profiles all kernel calls• May periodically re-tune

oski_matrix_t A_tunable = oski_CreateMatCSR( … );

/* … */

for( i = 0; i < 500; i++ ) {

oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);oski_TuneMat(A_tunable); /* Ask OSKI to tune */

}

03/09/2007 CS267 Lecture 16 66

Quick-and-dirty Parallelism: OSKI-PETSc

• Extend PETSc’s distributed memory SpMV (MATMPIAIJ)

p0

p1

p2

p3

• PETSc• Each process stores diag

(all-local) and off-diag submatrices

• OSKI-PETSc:• Add OSKI wrappers• Each submatrix tuned

independently

03/09/2007 CS267 Lecture 16 67

OSKI-PETSc Proof-of-Concept Results

• Matrix 1: Accelerator cavity design (R. Lee @ SLAC)• N ~ 1 M, ~40 M non-zeros• 2x2 dense block substructure• Symmetric

• Matrix 2: Linear programming (Italian Railways)• Short-and-fat: 4k x 1M, ~11M non-zeros• Highly unstructured• Big speedup from cache-blocking: no native PETSc format

• Evaluation machine: Xeon cluster• Peak: 4.8 Gflop/s per node

03/09/2007 CS267 Lecture 16 68

Accelerator Cavity Matrix

03/09/2007 CS267 Lecture 16 69

OSKI-PETSc Performance: Accel. Cavity

03/09/2007 CS267 Lecture 16 70

Linear Programming Matrix

…

03/09/2007 CS267 Lecture 16 71

OSKI-PETSc Performance: LP Matrix

03/09/2007 CS267 Lecture 16 72

Tuning Higher Level Algorithms

• So far we have tuned a single sparse matrix kernel• y = AT*A*x motivated by higher level algorithm (SVD)

• What can we do by extending tuning to a higher level?• Consider Krylov subspace methods for Ax=b, Ax = x

• Conjugate Gradients (CG), GMRES, Lanczos, …• Inner loop does y=A*x, dot products, saxpys, scalar ops• Inner loop costs at least O(1) messages• k iterations cost at least O(k) messages

• Our goal: show how to do k iterations with O(1) messages• Possible payoff – make Krylov subspace methods much faster on machines with slow networks• Memory bandwidth improvements too (not discussed)• Obstacles: numerical stability, preconditioning, …

03/09/2007 CS267 Lecture 16 73

Parallel Sparse Matrix-vector multiplication

• y = A*x, where A is a sparse n x n matrix

• Questions• which processors store

• y[i], x[i], and A[i,j]

• which processors compute• y[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x … a sparse dot product

• Partitioning• Partition index set {1,…,n} = N1 N2 … Np.• For all i in Nk, Processor k stores y[i], x[i], and row i of A • For all i in Nk, Processor k computes y[i] = (row i of A) * x

• “owner computes” rule: Processor k compute the y[i]s it owns.

x

y

P1

P2

P3

P4

May require communication

03/09/2007 CS267 Lecture 16 74

Matrix Reordering via Graph Partitioning

• “Ideal” matrix structure for parallelism: block diagonal• p (number of processors) blocks, can all be computed locally.• If no non-zeros outside these blocks, no communication needed

• Can we reorder the rows/columns to get close to this?• Most nonzeros in diagonal blocks, few outside

P0

P1

P2

P3

P4

= *

P0 P1 P2 P3 P4

03/09/2007 CS267 Lecture 16 75

Goals of Reordering

• Performance goals• balance load (how is load measured?).

• Approx equal number of nonzeros (not necessarily rows)

• balance storage (how much does each processor store?).• Approx equal number of nonzeros

• minimize communication (how much is communicated?).• Minimize nonzeros outside diagonal blocks• Related optimization criterion is to move nonzeros near diagonal

• improve register and cache re-use• Group nonzeros in small vertical blocks so source (x) elements

loaded into cache or registers may be reused (temporal locality)• Group nonzeros in small horizontal blocks so nearby source (x)

elements in the cache may be used (spatial locality)

• Other algorithms reorder for other reasons• Reduce # nonzeros in matrix after Gaussian elimination• Improve numerical stability

03/09/2007 CS267 Lecture 16 76

Graph Partitioning and Sparse Matrices

1 1 1 1

2 1 1 1 1

3 1 1 1

4 1 1 1 1

5 1 1 1 1

6 1 1 1 1

1 2 3 4 5 6

3

6

1

5

2

• Relationship between matrix and graph

• Edges in the graph are nonzero in the matrix: here the matrix is symmetric (edges are unordered) and weights are equal (1)

• If divided over 3 procs, there are 14 nonzeros outside the diagonal blocks, which represent the 7 (bidirectional) edges

4

03/09/2007 CS267 Lecture 16 77

Graph Partitioning and Sparse Matrices

1 1 1 1

2 1 1 1 1

3 1 1 1

4 1 1 1 1

5 1 1 1 1

6 1 1 1 1

1 2 3 4 5 6

• Relationship between matrix and graph

• A “good” partition of the graph has• equal (weighted) number of nodes in each part (load and storage balance).• minimum number of edges crossing between (minimize communication).

• Reorder the rows/columns by putting all nodes in one partition together.

3

6

1

5

42

Documents

03/09/2007CS267 Lecture 161 CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers Kathy Yelick yelick/cs267_sp07