# CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers

• View
45

6

Tags:

• #### mflops1190 mflopscs267

Embed Size (px)

DESCRIPTION

CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers. Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07. Phillip Colella’s “Seven dwarfs”. High-end simulation in the physical sciences = 7 numerical methods :. - PowerPoint PPT Presentation

### Text of CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers

• CS 267Sparse Matrices:Sparse Matrix-Vector Multiplyfor Iterative SolversKathy Yelick

www.cs.berkeley.edu/~yelick/cs267_sp07

CS267 Lecure 10

• High-end simulation in the physical sciences = 7 numerical methods:Structured Grids (including locally structured grids, e.g. AMR)Unstructured GridsFast Fourier TransformDense Linear AlgebraSparse Linear Algebra ParticlesMonte CarloWell-defined targets from algorithmic, software, and architecture standpoint Phillip Colellas Seven dwarfsAdd 4 for embedded 8. Search/Sort 9. Finite State Machine10. Filter11. Combinational logicThen covers all 41 EEMBC benchmarksRevise 1 for SPEC7. Monte Carlo => Easily parallel (to add ray tracing)Then covers 26 SPEC benchmarksSlide from Defining Software Requirements for Scientific Computing, Phillip Colella, 2004

CS267 Lecure 10

• ODEs and Sparse MatricesAll these problems reduce to sparse matrix problemsExplicit: sparse matrix-vector multiplication (SpMV).Implicit: solve a sparse linear systemdirect solvers (Gaussian elimination).iterative solvers (use sparse matrix-vector multiplication).Eigenvalue/vector algorithms may also be explicit or implicit.Conclusion: SpMV is key to many ODE problemsRelatively simple algorithm to study in detailTwo key problems: locality and load balance

CS267 Lecure 10

• SpMV in Compressed Sparse Row (CSR) FormatMatrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]AyxRepresentation of ACSR format is one of many possibilities

CS267 Lecure 10

• Motivation for Automatic Performance Tuning of SpMVHistorical trendsSparse matrix-vector multiply (SpMV): 10% of peak or lessPerformance depends on machine, kernel, matrixMatrix known at run-timeBest data structure + implementation can be surprisingOur approach: empirical performance modeling and algorithm search

CS267 Lecure 10

• SpMV Historical Trends: Fraction of Peak

CS267 Lecure 10

• Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

Source: NASA structural analysis problem

CS267 Lecure 10

• Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

Source: NASA structural analysis problem8x8 dense substructure

CS267 Lecure 10

• Taking advantage of block structure in SpMVBottleneck is time to get matrix from memoryOnly 2 flops for each nonzero in matrixDont store each nonzero with index, instead store each nonzero r-by-c block with indexStorage drops by up to 2x, if rc >> 1, all 32-bit quantitiesTime to fetch matrix from memory decreasesChange both data structure and algorithmNeed to pick r and cNeed to change algorithm accordinglyIn example, is r=c=8 best choice?Minimizes storage, so looks like a good idea

CS267 Lecure 10

• Speedups on Itanium 2: The Need for SearchMflop/sMflop/s

CS267 Lecure 10

• Register Profile: Itanium 2190 Mflop/s1190 Mflop/s

CS267 Lecure 10

• SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 19%63 Mflop/s35 Mflop/s109 Mflop/s53 Mflop/s96 Mflop/s42 Mflop/s120 Mflop/s58 Mflop/s

CS267 Lecure 10

• Register Profiles: Sun and Intel x86Ultra 2i - 11%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 21%72 Mflop/s35 Mflop/s90 Mflop/s50 Mflop/s108 Mflop/s42 Mflop/s122 Mflop/s58 Mflop/s

CS267 Lecure 10

• SpMV Performance (Matrix #2): Generation 1Power3 - 13%Power4 - 14%Itanium 2 - 31%Itanium 1 - 7%195 Mflop/s100 Mflop/s703 Mflop/s469 Mflop/s225 Mflop/s103 Mflop/s1.1 Gflop/s276 Mflop/s

CS267 Lecure 10

• Register Profiles: IBM and Intel IA-64Power3 - 17%Power4 - 16%Itanium 2 - 33%Itanium 1 - 8%252 Mflop/s122 Mflop/s820 Mflop/s459 Mflop/s247 Mflop/s107 Mflop/s1.2 Gflop/s190 Mflop/s

CS267 Lecure 10

• Another example of tuning challengesMore complicated non-zero structure in general

N = 16614NNZ = 1.1M

CS267 Lecure 10

• Zoom in to top cornerMore complicated non-zero structure in general

N = 16614NNZ = 1.1M

CS267 Lecure 10

• 3x3 blocks look natural, butMore complicated non-zero structure in general

Example: 3x3 blockingLogical grid of 3x3 cells

But would lead to lots of fill-in

CS267 Lecure 10

• Extra Work Can Improve Efficiency!More complicated non-zero structure in general

Example: 3x3 blockingLogical grid of 3x3 cellsFill-in explicit zerosUnroll 3x3 block multipliesFill ratio = 1.5

On Pentium III: 1.5x speedup!Actual mflop rate 1.52 = 2.25 higher

CS267 Lecure 10

• Automatic Register Block Size SelectionSelecting the r x c block sizeOff-line benchmarkPrecompute Mflops(r,c) using dense A for each r x cOnce per machine/architectureRun-time search Sample A to estimate Fill(r,c) for each r x cRun-time heuristic modelChoose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

CS267 Lecure 10

• Accurate and Efficient Adaptive Fill EstimationIdea: Sample matrixFraction of matrix to sample: s [0,1]Cost ~ O(s * nnz)Control cost by controlling sSearch at run-time: the constant matters!Control s automatically by computing statistical confidence intervalsIdea: Monitor varianceCost of tuningLower bound: convert matrix in 5 to 40 unblocked SpMVsHeuristic: 1 to 11 SpMVs

CS267 Lecure 10

• Accuracy of the Tuning Heuristics (1/4)NOTE: Fair flops used (ops on explicit zeros not counted as work)See p. 375 of Vuducs thesis for matrices

CS267 Lecure 10

• Accuracy of the Tuning Heuristics (2/4)

CS267 Lecure 10

• Accuracy of the Tuning Heuristics (3/4)

CS267 Lecure 10

• Accuracy of the Tuning Heuristics (3/4)DGEMV

CS267 Lecure 10

• Upper Bounds on Performance for blocked SpMVP = (flops) / (time)Flops = 2 * nnz(A)

Lower bound on time: Two main assumptions1. Count memory ops only (streaming)2. Count only compulsory, capacity misses: ignore conflictsAccount for line sizesAccount for matrix size and nnzCharge minimum access latency ai at Li cache & ameme.g., Saavedra-Barrera and PMaC MAPS benchmarks

CS267 Lecure 10

• Example: L2 Misses on Itanium 2Misses measured using PAPI [Browne 00]

CS267 Lecure 10

• Example: Bounds on Itanium 2

CS267 Lecure 10

• Example: Bounds on Itanium 2

CS267 Lecure 10

• Example: Bounds on Itanium 2

CS267 Lecure 10

• Summary of Other Performance OptimizationsOptimizations for SpMVRegister blocking (RB): up to 4x over CSRVariable block splitting: 2.1x over CSR, 1.8x over RBDiagonals: 2x over CSRReordering to create dense structure + splitting: 2x over CSRSymmetry: 2.8x over CSR, 2.6x over RBCache blocking: 2.8x over CSRMultiple vectors (SpMM): 7x over CSRAnd combinationsSparse triangular solveHybrid sparse/dense data structure: 1.8x over CSRHigher-level kernelsAAT*x, ATA*x: 4x over CSR, 1.8x over RBA2*x: 2x over CSR, 1.5x over RB

CS267 Lecure 10

• SPMV for Shared Memory and MulticoreData Structure TransformationsThread blockingCache blockingRegister BlockingFormat selectionIndex size reductionKernel OptimizationsPrefetchingLoop structure

CS267 Lecure 10

• Thread BlockingLoad BalancingEvenly divide number of nonzerosExploit NUMA memory systems on multi-socket SMPsMust pin threads to cores ANDpin data to sockets

CS267 Lecure 10

• Nave ApproachR x C processor gridEach covers the same number of rows andcolumns.Potentially unbalanced

CS267 Lecure 10

• Load Balanced ApproachR x C processor gridFirst, block into rowssame number of nonzeros in each of theR blocked rowsSecond, block within each blocked rowNot only should each block within a rowhave ~same number of nonzeros,But all blocks should have ~same numberof nonzerosThird, prune unneeded rows &columnsFourth, re-encode the column indices to berelative to each thread block.

CS267 Lecure 10

• Memory OptimizationsCache blockingPerformed for each thread block.Chop into blocks so entire source vector fits in cachePrefetchingInsert explicit prefetch operations to mask latency to memoryTune prefetch distance/time using searchRegister blockingAs in OSKI, but done separately per cache blockSimpler heuristic: choose block size that minimize total storageIndex compressionUse 16b ints for indices in blocks less than 64K wide

CS267 Lecure 10

CS267 Lecure 10

CS267 Lecure 10

CS267 Lecure 10

• Speedup for the best combination of NThreads, blocking, prefetching, 7.3x6.4x3.8x4.2x

CS267 Lecure 10

• Distributed Memory SPMVy = A*x, where A is a sparse n x n matrix

Questionswhich processors storey[i], x[i], and A[i,j]which processors computey[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x a sparse dot productPartitioningPartition index set {1,,n} = N1 N2 Np.For all i in Nk, Processor k stores y[i], x[i], and row i of A For all i in Nk, Processor k computes y[i] = (row i of A) * xowner computes rule: Processor k compute the y[i]s it owns.xyP1P2P3P4May require communication

CS267 Lecure 10

• Two LayoutsThe partitions should be by nonzeros counts, not rows/columns1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P2D Partition: reductions scale with log sqrt(P), b

Recommended

Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents