119
Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory February 28, 2007

Autotuning sparse matrix kernels

  • Upload
    loan

  • View
    63

  • Download
    4

Embed Size (px)

DESCRIPTION

Autotuning sparse matrix kernels. Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory February 28, 2007. Predictions (2003). Need for “autotuning” will increase over time - PowerPoint PPT Presentation

Citation preview

Page 1: Autotuning sparse matrix kernels

Autotuning sparse matrix kernels

Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory

February 28, 2007

Page 2: Autotuning sparse matrix kernels

Predictions (2003)

Need for “autotuning” will increase over time Improve performance for given app & machine using

automated experiments

Example: Sparse matrix-vector multiply (SpMV), 1987 to present Untuned: 10% of peak or less, decreasing Tuned: 2x speedup, increasing over time Tuning is getting harder (qualitative)

More complex machines & workloads Parallelism

Page 3: Autotuning sparse matrix kernels

Trends in uniprocessor SpMV performance (Mflop/s), pre-2004

Page 4: Autotuning sparse matrix kernels

Trends in uniprocessor SpMV performance (Mflop/s), pre-2004

Page 5: Autotuning sparse matrix kernels

Trends in uniprocessor SpMV performance (fraction of peak)

Page 6: Autotuning sparse matrix kernels

Is tuning getting easier?

// y <-- y + A*x

for all A(i,j):

y(i) += A(i,j) * x(j)

// Compressed sparse row (CSR)

for each row i:

t = 0

for k=ptr[i] to ptr[i+1]-1:

t += A[k] * x[J[k]]

y[i] = t

• Exploit 8x8 dense blocks

• As r x c Mflop/s

Page 7: Autotuning sparse matrix kernels

Speedups on Itanium 2: The need for search

ReferenceMflop/s (7.6%)

Mflop/s (31.1%)

Page 8: Autotuning sparse matrix kernels

Speedups on Itanium 2: The need for search

ReferenceMflop/s (7.6%)

Mflop/s (31.1%)

Best: 4x2

Page 9: Autotuning sparse matrix kernels

SpMV Performance—raefsky3

Page 10: Autotuning sparse matrix kernels

SpMV Performance—raefsky3

Page 11: Autotuning sparse matrix kernels

Better, worse, or about the same?Itanium 2, 900 MHz 1.3 GHz

Page 12: Autotuning sparse matrix kernels

Better, worse, or about the same?Itanium 2, 900 MHz 1.3 GHz

* Reference improves *

Page 13: Autotuning sparse matrix kernels

Better, worse, or about the same?Itanium 2, 900 MHz 1.3 GHz

* Best possible worsens slightly *

Page 14: Autotuning sparse matrix kernels

Better, worse, or about the same?Pentium M Core 2 Duo (1-core)

Page 15: Autotuning sparse matrix kernels

Better, worse, or about the same?Pentium M Core 2 Duo (1-core)

* Reference & best improve; relative speedup improves (~1.4 to 1.6) *

Page 16: Autotuning sparse matrix kernels

Better, worse, or about the same?Pentium M Core 2 Duo (1-core)

* Note: Best fraction of peak decreased from 11% 9.6% *

Page 17: Autotuning sparse matrix kernels

Better, worse, or about the same?Power4 Power5

Page 18: Autotuning sparse matrix kernels

Better, worse, or about the same?Power4 Power5

* Reference worsens! *

Page 19: Autotuning sparse matrix kernels

Better, worse, or about the same?Power4 Power5

* Relative importance of tuning increases *

Page 20: Autotuning sparse matrix kernels

A framework for performance tuningSource: SciDAC Performance Engineering Research Institute (PERI)

Page 21: Autotuning sparse matrix kernels

Outline

Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work

Page 22: Autotuning sparse matrix kernels

Outline

Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work

Page 23: Autotuning sparse matrix kernels

OSKI: Optimized Sparse Kernel Interface

Autotuned kernels for user’s matrix & machine BLAS-style interface: mat-vec (SpMV), tri. solve (TrSV), … Hides complexity of run-time tuning Includes fast locality-aware kernels: ATA*x, …

Faster than standard implementations Standard SpMV < 10% peak, vs. up to 31% with OSKI Up to 4x faster SpMV, 1.8x TrSV, 4x ATA*x, …

For “advanced” users & solver library writers PETSc extension available (OSKI-PETSc) Kokkos (for Trilinos) by Heroux Adopted by ClearShape, Inc. for shipping product (2x

speedup)

Page 24: Autotuning sparse matrix kernels

Tunable matrix-specific optimization techniques Optimizations for SpMV

Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations…

Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR

Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB

Page 25: Autotuning sparse matrix kernels

Tuning for workloads

Bi-conjugate gradients - equal mix of A*x and AT*y 3x1: Ax, ATy = 1053, 343 Mflop/s 517 Mflop/s 3x3: Ax, ATy = 806, 826 Mflop/s 816 Mflop/s

Higher-level operation - (Ax, ATy) kernel 3x1: 757 Mflop/s 3x3: 1400 Mflop/s

Matrix powers (Ak*x) with data structure transformations A2*x: up to 2x faster New latency-tolerant solvers? (Hoemmen’s thesis, on-going at

UCB)

Page 26: Autotuning sparse matrix kernels

How OSKI tunes (Overview)

Library Install-Time (offline) Application Run-Time

Page 27: Autotuning sparse matrix kernels

How OSKI tunes (Overview)

Benchmarkdata

1. Build forTargetArch.

2. Benchmark

Generatedcode

variants

Library Install-Time (offline) Application Run-Time

Page 28: Autotuning sparse matrix kernels

How OSKI tunes (Overview)

Benchmarkdata

1. Build forTargetArch.

2. Benchmark

Heuristicmodels

1. EvaluateModels

Generatedcode

variants

Library Install-Time (offline) Application Run-Time

Workloadfrom program

monitoring HistoryMatrix

Page 29: Autotuning sparse matrix kernels

How OSKI tunes (Overview)

Benchmarkdata

1. Build forTargetArch.

2. Benchmark

Heuristicmodels

1. EvaluateModels

Generatedcode

variants

2. SelectData Struct.

& Code

Library Install-Time (offline) Application Run-Time

To user:Matrix handlefor kernelcalls

Workloadfrom program

monitoring

Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

HistoryMatrix

Page 30: Autotuning sparse matrix kernels

OSKI’s place in the tuning framework

Page 31: Autotuning sparse matrix kernels

Examples of OSKI’s early impact

Early adopter: ClearShape, Inc. Core product: lithography simulator 2x speedup on full simulation after using OSKI

Proof-of-concept: SLAC T3P accelerator cavity design simulator SpMV dominates execution time Symmetry, 2x2 block structure 2x speedups

Page 32: Autotuning sparse matrix kernels

OSKI-PETSc Performance: Accel. Cavity

Page 33: Autotuning sparse matrix kernels

Strengths and limitations of the library approach

Strengths Isolates optimization in the library for portable

performance Exploits domain-specific information aggressively Handles run-time tuning naturally

Limitations “Generation Me”: What about my application and its

abstractions? Run-time tuning: run-time overheads Limited context for optimization (without delayed

evaluation) Limited extensibility (fixed interfaces)

Page 34: Autotuning sparse matrix kernels

Outline

Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the

wild” Toward end-to-end application autotuning Summary and future work

Page 35: Autotuning sparse matrix kernels

Tour of application-specific optimizations

Five case studies Common characteristics

Complex code Heavy use of abstraction Use generated code (e.g., SWIG C++/Python bindings)

Benefit from extensive code and data restructuring Multiple bottlenecks

Page 36: Autotuning sparse matrix kernels

[1] Loop transformations for SMG2000

SMG2000, implements semi-coarsening multigrid on structured grids (ASC Purple benchmark) Residual computation has an SpMV bottleneck Loop below looks simple but non-trivial to extract

for (si = 0; si < NS; ++si) for (k = 0; k < NZ; ++k) for (j = 0; j < NY; ++j) for (i = 0; i < NX; ++i) r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]]

Page 37: Autotuning sparse matrix kernels

[1] SMG2000 demo

Page 38: Autotuning sparse matrix kernels

[1] Before transformation

for (si = 0; si < NS; si++) /* Loop1 */ for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */

for (ii = 0; ii < NX; ii++) { /* Loop4 */

r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]] * x[ii + jj*JA + kk*KA + SA[si]];

} /* Loop4 */

} /* Loop3 */ } /* Loop2 */ } /* Loop1 */

Page 39: Autotuning sparse matrix kernels

[1] After transformation, including interchange, unrolling, and prefetching

for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (si = 0; si < NS; si++) { /* Loop1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core Loop4 */ for ( ; ii < NX; ii++) { /* fringe Loop4 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe Loop4 */ } /* Loop1 */ } /* Loop3 */ } /* Loop2 */

Page 40: Autotuning sparse matrix kernels

[1] Loop transformations for SMG2000

2x speedup on kernel from specialization, loop interchange, unrolling, prefetching But only 1.25x overall---multiple bottlenecks

Lesson: Need complex sequences of transformations Use profiling to guide Inspect run-time data for specialization Transformations are automatable

Research topic: Automated specialization of hypre?

Page 41: Autotuning sparse matrix kernels

[2] Slicing and dicing 3P

Accelerator design code from SLAC calcBasis() very expensive Scaling problems as |

Eigensystem| grows In principle, loop interchange or

precomputation via slicing possible

/* Post-processing phase */foreach mode in Eigensystem foreach elem in Mesh b = calcBasis (elem) f = calcField (b, mode)

Page 42: Autotuning sparse matrix kernels

[2] Slicing and dicing 3P

Accelerator design code calcBasis() very expensive Scaling problems as |

Eigensystem| grows In principle, loop interchange or

precomputation via slicing possible

Challenges in practice “Loop nest” ~ 500+ LOC 150+ LOC to calcBasis() calcBasis() in 6-deep call chain,

4-deep loop nest, 2 conditionals File I/O Changes must be unobtrusive

/* Post-processing phase */foreach mode in Eigensystem foreach elem in Mesh // { … b = calcBasis (elem) // } f = calcField (b, mode) writeDataToFiles (…);

Page 43: Autotuning sparse matrix kernels

[2] 3P: Impact and lessons

4-5x speedup for post-processing step; 1.5x overall

Changes “checked-in” Lesson: Need clean source-level transformations

To automate, need robust program analysis and developer guidance

Research: Annotation framework for developers [w/ Quinlan, Schordan, Yi: POHLL’06]

Page 44: Autotuning sparse matrix kernels

[3] Structure splitting

Convert (array of structs) into (struct of arrays) Improve spatial locality through increased stride-1 accesses Make code hardware-prefetch and vector/SIMD unit “friendly”c

struct Type { double p; double x, y, z; double E; int k;} X[N], Y[N];

for (i = 0; i < N; i++) Y[i].E += Y[X[i].k].p;

double Xp[N];double Xx[N], Xy[N], Xz[N];double XE[N];int Xk[N];// … same for Y …

for (i = 0; i < N; i++) YE[i] += sqrt (Yp[Xk[i]]);

Page 45: Autotuning sparse matrix kernels

[3] Structure splitting: Impact and challenges

2x speedup on a KULL benchmark (suggested by Brian Miller)

Implementation challenges Potentially affects entire code Can apply only locally, at a cost

Extra storage Overhead of copying

Tedious to do by hand

Lesson: Extensive data restructuring may be necessary

Research: When and how best to split?

Page 46: Autotuning sparse matrix kernels

[4] Finding a loop-fusion needle in a haystack

Interprocedural loop fusion finder [w/ B. White : Cornell U.] Known example had 2x speedup on benchmark (Miller) Built “abstraction-aware” analyzer using ROSE

First pass: Associate “loop signatures” with each function Second pass: Propagate signatures through call chains

for (Zone::iterator z = zones.begin (); z != zones.end (); ++z) for (Corner::iterator c = (*z).corners().begin (); …) for (int s = 0; s < c->sides().size(); s++) …

Page 47: Autotuning sparse matrix kernels

[4] Finding a loop-fusion needle in a haystack

Found 6 examples of 3- and 4-deep nested loops “Analysis-only” tool Finds, though does not verify/transform

Lesson: “Classical” optimizations relevant to abstraction use

Research Recognizing and optimizing abstractions [White’s thesis,

on-going] Extending traditional optimizations to abstraction use

Page 48: Autotuning sparse matrix kernels

[5] Aggregating messages (on-going)

Idea: Merge sends (suggested by Miller)

Implementing a fully automated translator to find and transform

Research: When and how best to aggregate?

DataType A;// … operations on A …A.allToAll();

// …

DataType B;// … operations on B …B.allToAll();

DataType A;// … operations on A …// …DataType B;// … operations on B …

bulkAllToAll(A, B);

Page 49: Autotuning sparse matrix kernels

Summary of application-specific optimizations

Like library-based approach, exploit knowledge for big gains Guidance from developer Use run-time information

Would benefit from automated transformation tools Real code is hard to process Changes may become part of software re-engineering Need robust analysis and transformation infrastructure Range of tools possible: analysis and/or transformation

No silver bullets or magic compilers

Page 50: Autotuning sparse matrix kernels

Outline

Motivation OSKI: An autotuned sparse kernel library “Real world” optimization Toward end-to-end application autotuning Summary and future work

Page 51: Autotuning sparse matrix kernels

A framework for performance tuningSource: SciDAC Performance Engineering Research Institute (PERI)

Page 52: Autotuning sparse matrix kernels

OSKI’s place in the tuning framework

Page 53: Autotuning sparse matrix kernels

An empirical tuning framework using ROSE

gprof,HPCtoolkitOpen SpeedShop

POET

Search engine

Empirical TuningFramework using ROSE

Page 54: Autotuning sparse matrix kernels

An end-to-end autotuning framework using ROSE

Guiding philosophy Leverage external stand-alone components Provide open components and tools for community

User or “system” profiles to collect data and/or analyses In ROSE

Mark-up AST with data/analysis, to identify optimizable target(s) Outline target into stand-alone dynamically loadable library

routine Make “benchmark” by inserting checkpoint library calls into app Generate parameterized representation of target

Independent search engine performs search

Page 55: Autotuning sparse matrix kernels

Interfaces to performance tools

Mark-up AST with data, analysis, to identify optimizable target(s) gprof HPCToolkit [Mellor-Crummey : Rice] VizzAnalyzer / Vizz3D [Panas : LLNL] In progress: Open SpeedShop [Schulz : LLNL]

Needed: Analysis to identify targets

Page 56: Autotuning sparse matrix kernels

Outlining

Outline target into dynamically loadable library routine Extends initial implementations by Liao [U. Houston], Jula

[TAMU] Handles many details of C & C++

Wraps up variables, inserts declarations, generates call Produces suitable interfaces for dynamic loading Handles non-local control flow

void OUT_38725__ (double* r, int JR, int KR, const double* A, …) { int si, j, k, i; for (si = 0; si < NS; si++) … r[i + j*JR + k*KR] -= A[i + …

Page 57: Autotuning sparse matrix kernels

Making a benchmark

Make “benchmark” by inserting checkpoint library calls Measure application behavior “in context” Use ckpt (user-level) [Zander : U. Wisc.] Insert timing code (cycle counter) May insert arbitrary code to distinguish calling contexts

Reasonably fast in practice Checkpoint read/write bandwidth: 500 MB/s on my Pentium-M For SMG2000: Problem consuming ~500 MB footprint takes ~30s to

run

Needed Best procedure to get accurate and fair comparisons?

Do restarts resume in comparable states? More portable checkpoint library

Page 58: Autotuning sparse matrix kernels

Example of “benchmark” (pseudo)code

static int num_calls = 0; // no. of invocations of outlined codeif (!num_calls) { ckpt (); // Checkpoint/resume OUT_38725__ = dlsym (…); // Load an implementation startTimer (); }

OUT_38725__ (…); // outlined call-site

if (++num_calls == CALL_LIMIT) { // Measured CALL_LIMIT calls stopTimer (); outputTime (); exit (0); }

Page 59: Autotuning sparse matrix kernels

Generating parameterized representations

Generate parameterized representation of target POET: Embedded scripting language for expressing

parameterized code variations [see POHLL’07] Loop optimizer will generate POET for each target

Hand-coded POET for SMG2000 Interchange Machine-specific: Unrolling, prefetching Source-specific: register & restrict keywords, C pointer

idiom New parameterization for loop fusion [Zhao,

Kennedy : Rice, Yi : UTSA]

Page 60: Autotuning sparse matrix kernels

SMG2000 kernel POET instantiation

for (kk = 0; kk < NZ; kk++) { /* L4 */ for (jj = 0; jj < NY; jj++) { /* L3 */ for (si = 0; si < NS; si++) { /* L1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core L2 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core L2 */ for ( ; ii < NX; ii++) { /* fringe L2 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe L2 */ } /* L1 */ } /* L3 */ } /* L4 */

Page 61: Autotuning sparse matrix kernels

Search

We are search-engine agnostics Many possible hybrid modeling/search techniques

Page 62: Autotuning sparse matrix kernels

Summary of autotuning compiler approach

End-to-end framework leverages existing work ROSE provides a heavy-duty (robust) source-level

infrastructure Assemble stand-alone components

Current and future work Assembling a more complete end-to-end example Interfaces between components? Extending basic ROSE infrastructure, particularly

program analysis

Page 63: Autotuning sparse matrix kernels

Current and future research directions

Autotuning End-to-end autotuning compiler framework Tuning for novel architectures (e.g., multicore) Tools for generating domain-specific libraries

Performance modeling Kernel- and machine-specific analytical and statistical

models Hybrid symbolic/empirical modeling Implications for applications and architectures?

Tools for debugging massively parallel applications JitterBug [w/ Schulz, Quinlan, de Supinski, Saebjoernsen] Static/dynamic analyses for debugging MPI

Page 64: Autotuning sparse matrix kernels

End

Page 65: Autotuning sparse matrix kernels

What is ROSE?

Research: Develop techniques to optimize applications that rely heavily on high-level abstractions Target scientific computing apps relevant to DOE/LLNL Domain-specific analysis and optimization Optimize use of object-oriented abstractions Performance portability via empirical tuning

Infrastructure: Tool for building source-to-source optimizers Full compiler: basic program analysis, loop optimizer, OpenMP

[UH] Support for C, C++; Fortran90 in progress Target “non-compiler audience” Open-source

Page 66: Autotuning sparse matrix kernels

What is ROSE?

Research: Develop techniques to optimize applications that rely heavily on high-level abstractions Target scientific computing apps relevant to DOE/LLNL Domain-specific analysis and optimization Optimize use of object-oriented abstractions Performance portability via empirical tuning

Infrastructure: Tool for building source-to-source optimizers Full compiler: basic program analysis, loop optimizer, OpenMP

[UH] Support for C, C++; Fortran90 in progress Target “non-compiler audience” Open-source

Page 67: Autotuning sparse matrix kernels

Bug hunting in MPI programs

Motivation: MPI is a large, complex API Bug pattern detectors

Check basic API usage Adapt existing tools: MPI-CHECK; FindBugs; Farchi, et al.

VC’05 Tasks requiring deeper program analysis

Properly matched sends/receives, barriers, collectives Buffer errors, e.g., overruns, read before non-blocking op

completes Temporal usage properties See error survey by DeSouza, Kuhn, & de Supinski ‘05 Extend existing analyses by Shires, et al., PDPTA’99;

Strout, et al. ICPP‘06

Page 68: Autotuning sparse matrix kernels

Compiler-based testing tools

Instrumentation and dynamic analysis to measure coverage [IBM]

Measurement-unit validation via Osprey [Jiang and Su, UC Davis]

Numerical interval/bounds analysis [Sun] Interface to MOPS model-checker [Collingbourne,

Imperial College] Interactive program visualization via VizzAnalyzer

[Panas, LLNL]

Page 69: Autotuning sparse matrix kernels

Trends in uniprocessor SpMV performance (absolute Mflop/s)

Page 70: Autotuning sparse matrix kernels

Trends in uniprocessor SpMV performance (fraction of peak)

Page 71: Autotuning sparse matrix kernels

Motivation: The Difficulty of Tuning SpMV

// y <-- y + A*x

for all A(i,j):

y(i) += A(i,j) * x(j)

Page 72: Autotuning sparse matrix kernels

Motivation: The Difficulty of Tuning SpMV

// y <-- y + A*x

for all A(i,j):

y(i) += A(i,j) * x(j)

// Compressed sparse row (CSR)

for each row i:

t = 0

for k=ptr[i] to ptr[i+1]-1:

t += A[k] * x[J[k]]

y[i] = t

Page 73: Autotuning sparse matrix kernels

Motivation: The Difficulty of Tuning SpMV

// y <-- y + A*x

for all A(i,j):

y(i) += A(i,j) * x(j)

// Compressed sparse row (CSR)

for each row i:

t = 0

for k=ptr[i] to ptr[i+1]-1:

t += A[k] * x[J[k]]

y[i] = t

• Exploit 8x8 dense blocks

Page 74: Autotuning sparse matrix kernels

Speedups on Itanium 2: The Need for Search

ReferenceMflop/s (7.6%)

Mflop/s (31.1%)

Page 75: Autotuning sparse matrix kernels

Speedups on Itanium 2: The Need for Search

ReferenceMflop/s (7.6%)

Mflop/s (31.1%)

Best: 4x2

Page 76: Autotuning sparse matrix kernels

SpMV Performance—raefsky3

Page 77: Autotuning sparse matrix kernels

SpMV Performance—raefsky3

Page 78: Autotuning sparse matrix kernels

Better, worse, or about the same?Pentium 4, 1.5 GHz Xeon, 3.2 GHz

Page 79: Autotuning sparse matrix kernels

Better, worse, or about the same?Pentium 4, 1.5 GHz Xeon, 3.2 GHz

* Faster, but relative improvement increases (20% ~50%) *

Page 80: Autotuning sparse matrix kernels

Problem-Specific Performance Tuning

Page 81: Autotuning sparse matrix kernels

Problem-Specific Optimization Techniques Optimizations for SpMV

Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations…

Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR

Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB

Page 82: Autotuning sparse matrix kernels

Problem-Specific Optimization Techniques Optimizations for SpMV

Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over

CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations…

Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR

Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB

Page 83: Autotuning sparse matrix kernels

BCSR Captures Regularly Aligned Blocks

n = 21216 nnz = 1.5 M Source: NASA

structural analysis problem

8x8 dense substructure

Reduces storage

Page 84: Autotuning sparse matrix kernels

Problem: Forced Alignment

BCSR(2x2) Stored / true nz = 1.24

Page 85: Autotuning sparse matrix kernels

Problem: Forced Alignment

BCSR(2x2) Stored / true nz = 1.24

BCSR(3x3) Stored / true nz = 1.46

Page 86: Autotuning sparse matrix kernels

Problem: Forced Alignment Implies UBCSR

BCSR(2x2) Stored / true nz = 1.24

BCSR(3x3) Stored / true nz = 1.46

Forces i mod 3 = j mod 3 = 0

Unaligned BCSR format: Store row indices

Page 87: Autotuning sparse matrix kernels

The Speedup GapThe Speedup Gap: BCSR vs. CSR

Speedup:BCSR/CSR

Machine

1.1—1.5x gap

Page 88: Autotuning sparse matrix kernels

Approach: Splitting + Relaxed Block Alignment

Goal: Close the gap between FEM classes

Our approach: Capture actual structure more precisely Split: A = A1 + A2 + … + As

Store each Ai in unaligned BCSR (UBCSR) format Relax both row and column alignment Buttari, et al. (2005) show improvements from relaxed

column alignment 2.1x over no blocking, 1.8x over blocking When not faster than BCSR, may still reduce storage

Page 89: Autotuning sparse matrix kernels

Variable Block Row (VBR) Analysis

Partition by grouping consecutive rows/columns having same pattern

Page 90: Autotuning sparse matrix kernels

From VBR, Identify Multiple Natural Block Sizes

Page 91: Autotuning sparse matrix kernels

VBR with Fill

Can also pad by matching rows/columns with nearly similar patterns

Define VBR() = VBR where consecutive

rows grouped when “similarity”

01

Page 92: Autotuning sparse matrix kernels

VBR with Fill

Fill of 1%

Page 93: Autotuning sparse matrix kernels

A Complex Tuning Problem

Many parameters need “tuning” Fill threshold, .5 1 Number of splittings, 2 s 4 Ordering of block sizes, rici; rscs = 11

See paper in HPCC 2005 for proof-of-concept experiments based on a semi-exhaustive search Heuristic in progress (uses Buttari, et al. (2005) work)

Page 94: Autotuning sparse matrix kernels

FEM 2 MatricesMatrix Dimensio

n# non-zeros

Dominant blocks

10-ct20stifEngine block

52k 2.7M 6x6 (39%), 3x3 (15%)

12-raefsky4Buckling

20k 1.3M 3x3 (96%)

13-ex11Fluid flow

16k 1.1M 1x1 (38%), 3x3 (23%)

15-Vavasis32D PDE

41k 1.7M 2x1 (81%), 2x2 (19%)

17-rimFluid flow

23k 1.0M 1x1 (75%), 3x1 (12%)

A-bmw7st_1Car chassis

141k 7.3M 6x6 (82%)

B-cop20k_mAccel. Cavity

121k 4.8M 2x1 (26%), 1x2 (26%),1x1 (26%), 2x2 (22%)

C-pwtkWind tunnel

218k 11.6M 6x6 (94%)

D-rma10Charleston Harbor

47k 2.4M 2x2 (17%), 3x2 (15%),2x3 (15%), 4x2 (9%), 2x4 (9%)

E-s3dkqm4Cylindrical shell

90k 4.8M 6x6 (99%)

Page 95: Autotuning sparse matrix kernels

Power 4 Performance

Page 96: Autotuning sparse matrix kernels

Storage Savings

Page 97: Autotuning sparse matrix kernels

Traveling Salesman Problem-based Reordering

Application: Stanford accelerator design problem (Omega3P)

Reorder by approximately solving TSP [Pinar & Heath ‘97] Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour See [Pinar & Heath ’97] Also: symmetric storage, register blocking

Manually selected optimizations Problem: High-cost of computing approximate

solution to TSP

Page 98: Autotuning sparse matrix kernels

100x100 Submatrix Along Diagonal

Page 99: Autotuning sparse matrix kernels

“Microscopic” Effect of Combined RCM+TSP Reordering

Before: Green + RedAfter: Green + Blue

Page 100: Autotuning sparse matrix kernels
Page 101: Autotuning sparse matrix kernels

Inter-Iteration Sparse Tiling (1/3)

y1

y2

y3

y4

y5

t1

t2

t3

t4

t5

x1

x2

x3

x4

x5

Idea: Strout, et al., ICCS 2001

Let A be 5x5 tridiagonal

Consider y=A2x t=Ax, y=At

Nodes: vector elements

Edges: matrix elements aij

Page 102: Autotuning sparse matrix kernels

Inter-Iteration Sparse Tiling (2/3)

y1

y2

y3

y4

y5

t1

t2

t3

t4

t5

x1

x2

x3

x4

x5

Idea: Strout, et al., ICCS 2001

Let A be 5x5 tridiagonal

Consider y=A2x t=Ax, y=At

Nodes: vector elements Edges: matrix elements

aij

Orange = everything needed to compute y1

Reuse a11, a12

Page 103: Autotuning sparse matrix kernels

Inter-Iteration Sparse Tiling (3/3)

Idea: Strout, et al., ICCS 2001

Let A be 5x5 tridiagonal Consider y=A2x

t=Ax, y=At Nodes: vector elements Edges: matrix elements aij

Orange = everything needed to compute y1

Reuse a11, a12

Grey = y2, y3 Reuse a23, a33, a43

y1

y2

y3

y4

y5

t1

t2

t3

t4

t5

x1

x2

x3

x4

x5

Page 104: Autotuning sparse matrix kernels

Serial Sparse Tiling Performance (Itanium 2)

Page 105: Autotuning sparse matrix kernels

OSKI Software Architecture and API

Page 106: Autotuning sparse matrix kernels

Empirical Model Evaluation

Tuning loop Compute a “tuning time budget” based on workload While (time remains and no tuning chosen)

Try a heuristic

Heuristic for blocked SpMV: Choose r x c to minimize

predicted time(A,r,c)estimated flops(A,r,c)

benchmark Mflop /s(r,c)

Tuning for workloads Weighted sums of empirical models Dynamic programming for alternatives

Example: Combined y=ATAx vs. separate (w=Ax, y=ATw)

Page 107: Autotuning sparse matrix kernels

Cost of Tuning

Non-trivial run-time cost: up to ~40 mat-vecs Dominated by conversion time (~ 80%)

Design point: user calls “tune” routine explicitly Exposes cost Tuning time limited using estimated workload

Provided by user or inferred by library

User may save tuning results To apply on future runs with similar matrix Stored in “human-readable” format

Page 108: Autotuning sparse matrix kernels

Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */

double* x = …, *y = …; /* Vectors */

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

my_matmult( ptr, ind, val, , x, , y );r = ddot (x, y); /* Some dense BLAS op on vectors */

Page 109: Autotuning sparse matrix kernels

Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */

double* x = …, *y = …; /* Vectors */

/* Step 1: Create OSKI wrappers */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ )

my_matmult( ptr, ind, val, , x, , y );r = ddot (x, y);

Page 110: Autotuning sparse matrix kernels

Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */

double* x = …, *y = …; /* Vectors */

/* Step 1: Create OSKI wrappers */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Step 2: Call tune (with optional hints) */oski_SetHintMatMult (A_tunable, …, 500);oski_TuneMat (A_tunable);

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y );r = ddot (x, y);

Page 111: Autotuning sparse matrix kernels

Interface supports legacy app migrationint* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */

double* x = …, *y = …; /* Vectors */

/* Step 1: Create OSKI wrappers */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,

num_cols, SHARE_INPUTMAT, …);oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Step 2: Call tune (with optional hints) */oski_setHintMatMult (A_tunable, …, 500);oski_TuneMat (A_tunable);

/* Compute y = ·y + ·A·x, 500 times */for( i = 0; i < 500; i++ ) oski_MatMult (A_tunable, OP_NORMAL, , x_view, , y_view);// Step 3r = ddot (x, y);

Page 112: Autotuning sparse matrix kernels

Quick-and-dirty Parallelism: OSKI-PETSc

Extend PETSc’s distributed memory SpMV (MATMPIAIJ)

p0

p1

p2

p3

PETSc Each process stores

diag (all-local) and off-diag submatrices

OSKI-PETSc: Add OSKI wrappers Each submatrix tuned

independently

Page 113: Autotuning sparse matrix kernels

OSKI-PETSc Proof-of-Concept Results

Matrix 1: Accelerator cavity design (R. Lee @ SLAC) N ~ 1 M, ~40 M non-zeros 2x2 dense block substructure Symmetric

Matrix 2: Linear programming (Italian Railways) Short-and-fat: 4k x 1M, ~11M non-zeros Highly unstructured Big speedup from cache-blocking: no native PETSc

format Evaluation machine: Xeon cluster

Peak: 4.8 Gflop/s per node

Page 114: Autotuning sparse matrix kernels

Accelerator cavity matrix from SLAC’s T3P code

Page 115: Autotuning sparse matrix kernels

Embedded scripting language for selecting customized, complex transformations

Mechanism to save/restore transformations

# In file, “my_xform.txt”

# Compute Afast = P*A*PT using Pinar’s reordering algorithm

A_fast, P = reorder_TSP(InputMat);

# Split Afast = A1 + A2, where A1 in 2x2 block format, A2 in CSR

A1, A2 = A_fast.extract_blocks(2, 2);

return transpose(P)*(A1+A2)*P;

/* In “my_app.c” */fp = fopen(“my_xform.txt”, “rt”);fgets(buffer, BUFSIZE, fp);

oski_ApplyMatTransform(A_tunable, buffer);

oski_MatMult(A_tunable, …);

Additional Features: OSKI-Lua

Page 116: Autotuning sparse matrix kernels

Current Work and Future Directions

Page 117: Autotuning sparse matrix kernels

Current and Future Work on OSKI

OSKI 1.0.1 at bebop.cs.berkeley.edu/oski “Pre-alpha” version of OSKI-PETSc available; “Beta” for Kokkos

(Trilinos) Future work

Evaluation on full solves/apps Bay area lithography shop - 2x speedup in full solve Code generators Studying use of higher-level OSKI kernels

Port to additional architectures (e.g., vectors, SMPs) Additional heuristics [Buttari, et al. (2005)] Many BeBOP projects on-going

SpMV benchmark for HPC-Challenge [Gavari & Hoemmen] Evaluation of Cell [Williams] Higher-level kernels, solvers [Hoemmen, Nishtala] Tuning collective communications [Nishtala] Cache-oblivious stencils [Kamil]

Page 118: Autotuning sparse matrix kernels

ROSE: A Compiler-Based Approach to Tuning General Applications ROSE: Tool for building customized source-to-source tools (Quinlan,

et al.) Full support for C and C++; Fortran90 in development Targets users with little or no compiler background

Focus on performance optimization for scientific computing Domain-specific analysis and optimizations Object-oriented abstraction recognition Rich loop-transformation support Annotation language support Additional infrastructure support for s/w assurance, testing, and

debugging Toward an end-to-end empirical tuning compiler

Combines profiling, checkpointing, analysis, parameterized code generation, search

Joint work with Qing Yi (University of Texas at San Antonio) Sponsored by U.S. Department of Energy

Page 119: Autotuning sparse matrix kernels

ROSE Architecture

Front-end (EDG-based)

Back-end

Transformed application source

Application Library Interface

Mid-end

Source

fragmentAST fragment

AST fragmentSource

fragment

AST fragment

AST

AST

Annotations

Tools

Abtraction RecognitionAbstraction Aware Analysis

Abstraction EliminationExtended Traditional Optimizations

Source+AST Transformations