CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers

  • View
    45

  • Download
    6

Embed Size (px)

DESCRIPTION

CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers. Kathy Yelick www.cs.berkeley.edu/~yelick/cs267_sp07. Phillip Colella’s “Seven dwarfs”. High-end simulation in the physical sciences = 7 numerical methods :. - PowerPoint PPT Presentation

Text of CS 267 Sparse Matrices: Sparse Matrix-Vector Multiply for Iterative Solvers

  • CS 267Sparse Matrices:Sparse Matrix-Vector Multiplyfor Iterative SolversKathy Yelick

    www.cs.berkeley.edu/~yelick/cs267_sp07

    CS267 Lecure 10

  • High-end simulation in the physical sciences = 7 numerical methods:Structured Grids (including locally structured grids, e.g. AMR)Unstructured GridsFast Fourier TransformDense Linear AlgebraSparse Linear Algebra ParticlesMonte CarloWell-defined targets from algorithmic, software, and architecture standpoint Phillip Colellas Seven dwarfsAdd 4 for embedded 8. Search/Sort 9. Finite State Machine10. Filter11. Combinational logicThen covers all 41 EEMBC benchmarksRevise 1 for SPEC7. Monte Carlo => Easily parallel (to add ray tracing)Then covers 26 SPEC benchmarksSlide from Defining Software Requirements for Scientific Computing, Phillip Colella, 2004

    CS267 Lecure 10

  • ODEs and Sparse MatricesAll these problems reduce to sparse matrix problemsExplicit: sparse matrix-vector multiplication (SpMV).Implicit: solve a sparse linear systemdirect solvers (Gaussian elimination).iterative solvers (use sparse matrix-vector multiplication).Eigenvalue/vector algorithms may also be explicit or implicit.Conclusion: SpMV is key to many ODE problemsRelatively simple algorithm to study in detailTwo key problems: locality and load balance

    CS267 Lecure 10

  • SpMV in Compressed Sparse Row (CSR) FormatMatrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

    for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]Matrix-vector multiply kernel: y(i) y(i) + A(i,j)x(j)

    for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]AyxRepresentation of ACSR format is one of many possibilities

    CS267 Lecure 10

  • Motivation for Automatic Performance Tuning of SpMVHistorical trendsSparse matrix-vector multiply (SpMV): 10% of peak or lessPerformance depends on machine, kernel, matrixMatrix known at run-timeBest data structure + implementation can be surprisingOur approach: empirical performance modeling and algorithm search

    CS267 Lecure 10

  • SpMV Historical Trends: Fraction of Peak

    CS267 Lecure 10

  • Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

    Source: NASA structural analysis problem

    CS267 Lecure 10

  • Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

    Source: NASA structural analysis problem8x8 dense substructure

    CS267 Lecure 10

  • Taking advantage of block structure in SpMVBottleneck is time to get matrix from memoryOnly 2 flops for each nonzero in matrixDont store each nonzero with index, instead store each nonzero r-by-c block with indexStorage drops by up to 2x, if rc >> 1, all 32-bit quantitiesTime to fetch matrix from memory decreasesChange both data structure and algorithmNeed to pick r and cNeed to change algorithm accordinglyIn example, is r=c=8 best choice?Minimizes storage, so looks like a good idea

    CS267 Lecure 10

  • Speedups on Itanium 2: The Need for SearchMflop/sMflop/s

    CS267 Lecure 10

  • Register Profile: Itanium 2190 Mflop/s1190 Mflop/s

    CS267 Lecure 10

  • SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 19%63 Mflop/s35 Mflop/s109 Mflop/s53 Mflop/s96 Mflop/s42 Mflop/s120 Mflop/s58 Mflop/s

    CS267 Lecure 10

  • Register Profiles: Sun and Intel x86Ultra 2i - 11%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 21%72 Mflop/s35 Mflop/s90 Mflop/s50 Mflop/s108 Mflop/s42 Mflop/s122 Mflop/s58 Mflop/s

    CS267 Lecure 10

  • SpMV Performance (Matrix #2): Generation 1Power3 - 13%Power4 - 14%Itanium 2 - 31%Itanium 1 - 7%195 Mflop/s100 Mflop/s703 Mflop/s469 Mflop/s225 Mflop/s103 Mflop/s1.1 Gflop/s276 Mflop/s

    CS267 Lecure 10

  • Register Profiles: IBM and Intel IA-64Power3 - 17%Power4 - 16%Itanium 2 - 33%Itanium 1 - 8%252 Mflop/s122 Mflop/s820 Mflop/s459 Mflop/s247 Mflop/s107 Mflop/s1.2 Gflop/s190 Mflop/s

    CS267 Lecure 10

  • Another example of tuning challengesMore complicated non-zero structure in general

    N = 16614NNZ = 1.1M

    CS267 Lecure 10

  • Zoom in to top cornerMore complicated non-zero structure in general

    N = 16614NNZ = 1.1M

    CS267 Lecure 10

  • 3x3 blocks look natural, butMore complicated non-zero structure in general

    Example: 3x3 blockingLogical grid of 3x3 cells

    But would lead to lots of fill-in

    CS267 Lecure 10

  • Extra Work Can Improve Efficiency!More complicated non-zero structure in general

    Example: 3x3 blockingLogical grid of 3x3 cellsFill-in explicit zerosUnroll 3x3 block multipliesFill ratio = 1.5

    On Pentium III: 1.5x speedup!Actual mflop rate 1.52 = 2.25 higher

    CS267 Lecure 10

  • Automatic Register Block Size SelectionSelecting the r x c block sizeOff-line benchmarkPrecompute Mflops(r,c) using dense A for each r x cOnce per machine/architectureRun-time search Sample A to estimate Fill(r,c) for each r x cRun-time heuristic modelChoose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

    CS267 Lecure 10

  • Accurate and Efficient Adaptive Fill EstimationIdea: Sample matrixFraction of matrix to sample: s [0,1]Cost ~ O(s * nnz)Control cost by controlling sSearch at run-time: the constant matters!Control s automatically by computing statistical confidence intervalsIdea: Monitor varianceCost of tuningLower bound: convert matrix in 5 to 40 unblocked SpMVsHeuristic: 1 to 11 SpMVs

    CS267 Lecure 10

  • Accuracy of the Tuning Heuristics (1/4)NOTE: Fair flops used (ops on explicit zeros not counted as work)See p. 375 of Vuducs thesis for matrices

    CS267 Lecure 10

  • Accuracy of the Tuning Heuristics (2/4)

    CS267 Lecure 10

  • Accuracy of the Tuning Heuristics (3/4)

    CS267 Lecure 10

  • Accuracy of the Tuning Heuristics (3/4)DGEMV

    CS267 Lecure 10

  • Upper Bounds on Performance for blocked SpMVP = (flops) / (time)Flops = 2 * nnz(A)

    Lower bound on time: Two main assumptions1. Count memory ops only (streaming)2. Count only compulsory, capacity misses: ignore conflictsAccount for line sizesAccount for matrix size and nnzCharge minimum access latency ai at Li cache & ameme.g., Saavedra-Barrera and PMaC MAPS benchmarks

    CS267 Lecure 10

  • Example: L2 Misses on Itanium 2Misses measured using PAPI [Browne 00]

    CS267 Lecure 10

  • Example: Bounds on Itanium 2

    CS267 Lecure 10

  • Example: Bounds on Itanium 2

    CS267 Lecure 10

  • Example: Bounds on Itanium 2

    CS267 Lecure 10

  • Summary of Other Performance OptimizationsOptimizations for SpMVRegister blocking (RB): up to 4x over CSRVariable block splitting: 2.1x over CSR, 1.8x over RBDiagonals: 2x over CSRReordering to create dense structure + splitting: 2x over CSRSymmetry: 2.8x over CSR, 2.6x over RBCache blocking: 2.8x over CSRMultiple vectors (SpMM): 7x over CSRAnd combinationsSparse triangular solveHybrid sparse/dense data structure: 1.8x over CSRHigher-level kernelsAAT*x, ATA*x: 4x over CSR, 1.8x over RBA2*x: 2x over CSR, 1.5x over RB

    CS267 Lecure 10

  • SPMV for Shared Memory and MulticoreData Structure TransformationsThread blockingCache blockingRegister BlockingFormat selectionIndex size reductionKernel OptimizationsPrefetchingLoop structure

    CS267 Lecure 10

  • Thread BlockingLoad BalancingEvenly divide number of nonzerosExploit NUMA memory systems on multi-socket SMPsMust pin threads to cores ANDpin data to sockets

    CS267 Lecure 10

  • Nave ApproachR x C processor gridEach covers the same number of rows andcolumns.Potentially unbalanced

    CS267 Lecure 10

  • Load Balanced ApproachR x C processor gridFirst, block into rowssame number of nonzeros in each of theR blocked rowsSecond, block within each blocked rowNot only should each block within a rowhave ~same number of nonzeros,But all blocks should have ~same numberof nonzerosThird, prune unneeded rows &columnsFourth, re-encode the column indices to berelative to each thread block.

    CS267 Lecure 10

  • Memory OptimizationsCache blockingPerformed for each thread block.Chop into blocks so entire source vector fits in cachePrefetchingInsert explicit prefetch operations to mask latency to memoryTune prefetch distance/time using searchRegister blockingAs in OSKI, but done separately per cache blockSimpler heuristic: choose block size that minimize total storageIndex compressionUse 16b ints for indices in blocks less than 64K wide

    CS267 Lecure 10

  • 1 thread Performance (preliminary)

    CS267 Lecure 10

  • 2 thread Performance (preliminary)

    CS267 Lecure 10

  • 4 thread Performance (preliminary)

    CS267 Lecure 10

  • Speedup for the best combination of NThreads, blocking, prefetching, 7.3x6.4x3.8x4.2x

    CS267 Lecure 10

  • Distributed Memory SPMVy = A*x, where A is a sparse n x n matrix

    Questionswhich processors storey[i], x[i], and A[i,j]which processors computey[i] = sum (from 1 to n) A[i,j] * x[j] = (row i of A) * x a sparse dot productPartitioningPartition index set {1,,n} = N1 N2 Np.For all i in Nk, Processor k stores y[i], x[i], and row i of A For all i in Nk, Processor k computes y[i] = (row i of A) * xowner computes rule: Processor k compute the y[i]s it owns.xyP1P2P3P4May require communication

    CS267 Lecure 10

  • Two LayoutsThe partitions should be by nonzeros counts, not rows/columns1D Partition: most popular, but for algorithms (NAS CG) that do reductions on y, these scale with log P2D Partition: reductions scale with log sqrt(P), b