prev

next

of 77

View

45Download

6

Tags:

Embed Size (px)

DESCRIPTION

CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers. Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07. Phillip Colella’s “Seven dwarfs”. High-end simulation in the physical sciences = 7 numerical methods :. - PowerPoint PPT Presentation

CS 267Sparse Matrices:Sparse Matrix-Vector Multiplyfor Iterative SolversKathy Yelick

www.cs.berkeley.edu/~yelick/cs267_sp07

CS267 Lecure 10

High-end simulation in the physical sciences = 7 numerical methods:Structured Grids (including locally structured grids, e.g. AMR)Unstructured GridsFast Fourier TransformDense Linear AlgebraSparse Linear Algebra ParticlesMonte CarloWell-defined targets from algorithmic, software, and architecture standpoint Phillip Colellas Seven dwarfsAdd 4 for embedded 8. Search/Sort 9. Finite State Machine10. Filter11. Combinational logicThen covers all 41 EEMBC benchmarksRevise 1 for SPEC7. Monte Carlo => Easily parallel (to add ray tracing)Then covers 26 SPEC benchmarksSlide from Defining Software Requirements for Scientific Computing, Phillip Colella, 2004

CS267 Lecure 10

ODEs and Sparse MatricesAll these problems reduce to sparse matrix problemsExplicit: sparse matrix-vector multiplication (SpMV).Implicit: solve a sparse linear systemdirect solvers (Gaussian elimination).iterative solvers (use sparse matrix-vector multiplication).Eigenvalue/vector algorithms may also be explicit or implicit.Conclusion: SpMV is key to many ODE problemsRelatively simple algorithm to study in detailTwo key problems: locality and load balance

CS267 Lecure 10

SpMV in Compressed Sparse Row (CSR) FormatMatrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]AyxRepresentation of ACSR format is one of many possibilities

CS267 Lecure 10

Motivation for Automatic Performance Tuning of SpMVHistorical trendsSparse matrix-vector multiply (SpMV): 10% of peak or lessPerformance depends on machine, kernel, matrixMatrix known at run-timeBest data structure + implementation can be surprisingOur approach: empirical performance modeling and algorithm search

CS267 Lecure 10

SpMV Historical Trends: Fraction of Peak

CS267 Lecure 10

Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

Source: NASA structural analysis problem

CS267 Lecure 10

Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

Source: NASA structural analysis problem8x8 dense substructure

CS267 Lecure 10

Taking advantage of block structure in SpMVBottleneck is time to get matrix from memoryOnly 2 flops for each nonzero in matrixDont store each nonzero with index, instead store each nonzero r-by-c block with indexStorage drops by up to 2x, if rc >> 1, all 32-bit quantitiesTime to fetch matrix from memory decreasesChange both data structure and algorithmNeed to pick r and cNeed to change algorithm accordinglyIn example, is r=c=8 best choice?Minimizes storage, so looks like a good idea

CS267 Lecure 10

Speedups on Itanium 2: The Need for SearchMflop/sMflop/s

CS267 Lecure 10

Register Profile: Itanium 2190 Mflop/s1190 Mflop/s

CS267 Lecure 10

SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 19%63 Mflop/s35 Mflop/s109 Mflop/s53 Mflop/s96 Mflop/s42 Mflop/s120 Mflop/s58 Mflop/s

CS267 Lecure 10

Register Profiles: Sun and Intel x86Ultra 2i - 11%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 21%72 Mflop/s35 Mflop/s90 Mflop/s50 Mflop/s108 Mflop/s42 Mflop/s122 Mflop/s58 Mflop/s

CS267 Lecure 10

SpMV Performance (Matrix #2): Generation 1Power3 - 13%Power4 - 14%Itanium 2 - 31%Itanium 1 - 7%195 Mflop/s100 Mflop/s703 Mflop/s469 Mflop/s225 Mflop/s103 Mflop/s1.1 Gflop/s276 Mflop/s

CS267 Lecure 10

Register Profiles: IBM and Intel IA-64Power3 - 17%Power4 - 16%Itanium 2 - 33%Itanium 1 - 8%252 Mflop/s122 Mflop/s820 Mflop/s459 Mflop/s247 Mflop/s107 Mflop/s1.2 Gflop/s190 Mflop/s

CS267 Lecure 10

Another example of tuning challengesMore complicated non-zero structure in general

N = 16614NNZ = 1.1M

CS267 Lecure 10

Zoom in to top cornerMore complicated non-zero structure in general

N = 16614NNZ = 1.1M

CS267 Lecure 10

3x3 blocks look natural, butMore complicated non-zero structure in general

Example: 3x3 blockingLogical grid of 3x3 cells

But would lead to lots of fill-in

CS267 Lecure 10

Extra Work Can Improve Efficiency!More complicated non-zero structure in general

Example: 3x3 blockingLogical grid of 3x3 cellsFill-in explicit zerosUnroll 3x3 block multipliesFill ratio = 1.5

On Pentium III: 1.5x speedup!Actual mflop rate 1.52 = 2.25 higher

CS267 Lecure 10

Automatic Register Block Size SelectionSelecting the r x c block sizeOff-line benchmarkPrecompute Mflops(r,c) using dense A for each r x cOnce per machine/architectureRun-time search Sample A to estimate Fill(r,c) for each r x cRun-time heuristic modelChoose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

CS267 Lecure 10

Accurate and Efficient Adaptive Fill EstimationIdea: Sample matrixFraction of matrix to sample: s [0,1]Cost ~ O(s * nnz)Control cost by controlling sSearch at run-time: the constant matters!Control s automatically by computing statistical confidence intervalsIdea: Monitor varianceCost of tuningLower bound: convert matrix in 5 to 40 unblocked SpMVsHeuristic: 1 to 11 SpMVs

CS267 Lecure 10

Accuracy of the Tuning Heuristics (1/4)NOTE: Fair flops used (ops on explicit zeros not counted as work)See p. 375 of Vuducs thesis for matrices

CS267 Lecure 10

Accuracy of the Tuning Heuristics (2/4)

CS267 Lecure 10

Accuracy of the Tuning Heuristics (3/4)

CS267 Lecure 10

Accuracy of the Tuning Heuristics (3/4)DGEMV

CS267 Lecure 10

Upper Bounds on Performance for blocked SpMVP = (flops) / (time)Flops = 2 * nnz(A)

Lower bound on time: Two main assumptions1. Count memory ops only (streaming)2. Count only compulsory, capacity misses: ignore conflictsAccount for line sizesAccount for matrix size and nnzCharge minimum access latency ai at Li cache & ameme.g., Saavedra-Barrera and PMaC MAPS benchmarks

CS267 Lecure 10

Example: L2 Misses on Itanium 2Misses measured using PAPI [Browne 00]

CS267 Lecure 10

Example: Bounds on Itanium 2

CS267 Lecure 10

Example: Bounds on Itanium 2

CS267 Lecure 10

Example: Bounds on Itanium 2

CS267 Lecure 10

Summary of Other Performance OptimizationsOptimizations for SpMVRegister blocking (RB): up to 4x over CSRVariable block splitting: 2.1x over CSR, 1.8x over RBDiagonals: 2x over CSRReordering to create dense structure + splitting: 2x over CSRSymmetry: 2.8x over CSR, 2.6x over RBCache blocking: 2.8x over CSRMultiple vectors (SpMM): 7x over CSRAnd combinationsSparse triangular solveHybrid sparse/dense data structure: 1.8x over CSRHigher-level kernelsAAT*x, ATA*x: 4x over CSR, 1.8x over RBA2*x: 2x over CSR, 1.5x over RB

CS267 Lecure 10

SPMV for Shared Memory and MulticoreData Structure TransformationsThread blockingCache blockingRegister BlockingFormat selectionIndex size reductionKernel OptimizationsPrefetchingLoop structure

CS267 Lecure 10

Thread BlockingLoad BalancingEvenly divide number of nonzerosExploit NUMA memory systems on multi-socket SMPsMust pin threads to cores ANDpin data to sockets

CS267 Lecure 10

Nave ApproachR x C processor gridEach covers the same number of rows andcolumns.Potentially unbalanced

CS267 Lecure 10

Load Balanced ApproachR x C processor gridFirst, block into rowssame number of nonzeros in each of theR blocked rowsSecond, block within each blocked rowNot only should each block within a rowhave ~same number of nonzeros,But all blocks should have ~same numberof nonzerosThird, prune unneeded rows &columnsFourth, re-encode the column indices to berelative to each thread block.

CS267 Lecure 10

Memory OptimizationsCache blockingPerformed for each thread block.Chop into blocks so entire source vector fits in cachePrefetchingInsert explicit prefetch operations to mask latency to memoryTune prefetch distance/time using searchRegister blockingAs in OSKI, but done separately per cache blockSimpler heuristic: choose block size that minimize total storageIndex compressionUse 16b ints for indices in blocks less than 64K wide

CS267 Lecure 10

1 thread Performance (preliminary)

CS267 Lecure 10

2 thread Performance (preliminary)

CS267 Lecure 10

4 thread Performance (preliminary)

CS267 Lecure 10

Speedup for the best combination of NThreads, blocking, prefetching, 7.3x6.4x3.8x4.2x

CS267 Lecure 10

Distributed Memory SPMVy = A*x, where A is a sparse n x n matrix

Questionswhich processors storey[i], x[i], and A[i,j]which processors computey[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x a sparse dot productPartitioningPartition index set {1,,n} = N1 N2 Np.For all i in Nk, Processor k stores y[i], x[i], and row i of A For all i in Nk, Processor k computes y[i] = (row i of A) * xowner computes rule: Processor k compute the y[i]s it owns.xyP1P2P3P4May require communication

CS267 Lecure 10

Two LayoutsThe partitions should be by nonzeros counts, not rows/columns1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P2D Partition: reductions scale with log sqrt(P), b