51
[ <--> ] Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply Richard Vuduc, James Demmel, Katherine Yelick Shoaib Kamil, Rajesh Nishtala, Benjamin Lee Wednesday, November 20, 2002 Berkeley Benchmarking and OPtimization (BeBOP) Project www.cs.berkeley.edu/ richie/bebop Computer Science Division, U.C. Berkeley Berkeley, California, USA . SC 2002: Session on Sparse Linear Algebra Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36

Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

  • Upload
    vonhan

  • View
    235

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply

Richard Vuduc, James Demmel, Katherine Yelick

Shoaib Kamil, Rajesh Nishtala, Benjamin Lee

Wednesday, November 20, 2002

Berkeley Benchmarking and OPtimization (BeBOP) Project

www.cs.berkeley.edu/ �richie/bebopComputer Science Division, U.C. Berkeley

Berkeley, California, USA

.

SC 2002: Session on Sparse Linear Algebra

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36

Page 2: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Context: Performance Tuning in the Sparse Case

Application performance dominated by a fewcomputational kernels

Performance tuning todayVendor-tuned libraries (e.g., BLAS) or user hand-tunes

Automatic tuning (e.g., PHiPAC/ATLAS, FFTW/SPIRAL/UHFFT)

Tuning sparse linear algebra kernels is hardSparse code has . . .

high bandwidth requirements (extra storage)poor locality (indirect, irregular memory access)poor instruction mix (data structure manipulation)

Sparse matrix-vector multiply (SpM � V) performance:less than 10% of machine peak

Performance depends on kernel, architecture, and matrix

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.2/36

Page 3: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Example: Matrix olafu

0 2400 4800 7200 9600 12000 14400

0

2400

4800

7200

9600

12000

14400

1015156 non−zeros

Spy Plot: 03−olafu.rua

N = 16146

nnz = 1.0M

Kernel = SpM �V

A natural choice:

blocked compressed

sparse row (BCSR).

Experiment:

Measure performance of

all block sizes for

.

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36

Page 4: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Example: Matrix olafu

0 10 20 30 40 50 60

0

10

20

30

40

50

60

1188 non−zeros

Spy Plot (zoom−in): 03−olafu.rua

N = 16146

nnz = 1.0M

Kernel = SpM �V

A natural choice:� � �

blocked compressed

sparse row (BCSR).

Experiment:

Measure performance of

all � �� block sizes for

��� � � � � � � � � �

.

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36

Page 5: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Speedups on Itanium: The Need for Search

93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6

column block size (c)

row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s

(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

Page 6: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Speedups on Itanium: The Need for Search

93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6

column block size (c)

row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s

(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

Page 7: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Speedups on Itanium: The Need for Search

93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6

column block size (c)

row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s

(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

Page 8: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Speedups on Itanium: The Need for Search

93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6

column block size (c)

row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s

(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

Page 9: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Speedups on Itanium: The Need for Search

93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6

column block size (c)

row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s

(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

Page 10: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Key Questions and Conclusions

How do we choose the best data structure automatically?

New heuristic for choosing optimal (or near-optimal) block sizes

What are the limits on performance of blocked SpM V?Derive performance upper bounds for blocking

Often within 20% of upper bound, placing limits on improvement from more“low-level” tuning

Performance is memory-bound: reducing data structure size is critical

Where are the new opportunities (kernels, techniques) forachieving higher performance?

Identify cases in which blocking does and does not work

Identify new kernels and opportunities for reducing memory traffic

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.5/36

Page 11: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Related Work

Automatic tuning systems and code generation

PHiPAC [BACD97], ATLAS [WPD01], SPARSITY [Im00]

FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00]

MPI collective ops (Vadhiyar, et al. [VFD01])

Sparse compilers (Bik [BW99], Bernoulli [Sto97])

Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . )

FLAME [GGHvdG01]

Sparse performance modeling and tuning

Temam and Jalby [TJ92]

Toledo [Tol97], White and Sadayappan [WS97], Pinar [PH99]

Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99]

Gropp, et al. [GKKS99], Geus [GR99]

Sparse kernel interfaces

Sparse BLAS Standard [BCD

01]

NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00]

PETSc, hypre, . . .

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.6/36

Page 12: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Approach to Automatic Tuning

For each kernel,

Identify and generate a space of implementations

Search to find the fastest (using models, experiments)

The SPARSITY system for SpM V [Im & Yelick ’99]

InterfaceInput: Your sparse matrix (CSR)

Output: Data structure + routine tuned to your matrix & machine

Implementation spacecache blocking, multiple vectors, . . .

Search

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

Page 13: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Approach to Automatic Tuning

For each kernel,

Identify and generate a space of implementations

Search to find the fastest (using models, experiments)

The SPARSITY system for SpM V [Im & Yelick ’99]

InterfaceInput: Your sparse matrix (CSR)

Output: Data structure + routine tuned to your matrix & machine

Implementation spaceregister level blocking ( � �� )

cache blocking, multiple vectors, . . .

SearchOff-line: benchmarking (once per architecture)

Run-time: estimate matrix properties (“search”) and predict best data

structure parameters

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

Page 14: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Approach to Automatic Tuning

For each kernel,

Identify and generate a space of implementations

Search to find the fastest (using models, experiments)

The SPARSITY system for SpM V [Im & Yelick ’99]

InterfaceInput: Your sparse matrix (CSR)

Output: Data structure + routine tuned to your matrix & machine

Implementation spaceregister level blocking ( � �� )

cache blocking, multiple vectors, . . .

SearchOff-line: benchmarking (once per architecture)

Run-time: estimate matrix properties (“search”) and predict best data

structure parameters

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36

Page 15: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Register-Level Blocking (SPARSITY): 3x3 Example

0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

45

50

688 true non−zeros

3 x 3 Register Blocking Example

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

Page 16: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Register-Level Blocking (SPARSITY): 3x3 Example

0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

45

50

688 true non−zeros

3 x 3 Register Blocking Example

BCSR with uniform, aligned grid

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

Page 17: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Register-Level Blocking (SPARSITY): 3x3 Example

0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

45

50

(688 true non−zeros) + (383 explicit zeros) = 1071 nz

3 x 3 Register Blocking Example

Fill-in zeros: trade-off extra flops for better efficiency

This example: 50% fill led to 1.5x speedup on Pentium III

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36

Page 18: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Search: Choosing the Block Size

Off-line benchmarking (once per architecture)

Measure Dense Performance (r,c)Performance (Mflop/s) of dense matrix in sparse � �� blocked format

At run-time, when matrix is known:

Estimate Fill Ratio (r,c),

����� �

Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)

Choose �� � that maximizes

��� ��� ��� � ��� � !" # � $ �� # % &

' � " � � ��� � ��� � !" # � $ �� # %

() * * + ! �) � $ �� # %

(Replaces previous SPARSITY heurstic)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.9/36

Page 19: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Off-line Benchmarking [Intel Pentium III]

43

50

60

70

80

90

100

107

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

column block size (c)

row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]

2.54

2.49

2.45

2.44

2.41

2.39

2.38

2.39 2.37

2.36

Top 10 codes labeled by speedup over unblocked code. Max speedup = 2.54 (

� ,

).

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.10/36

Page 20: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

333 MHz Sun Ultra 2i (2.03)

36

40

45

50

55

60

65

70

72

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

column block size (c)

row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris]

2.03 1.98

1.98

1.98

1.971.96

1.95 1.94

1.94

1.94

72

36

500 MHz Intel Pentium III (2.54)

43

50

60

70

80

90

100

107

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

column block size (c)

row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]

2.54

2.49

2.45

2.44

2.41

2.39

2.38

2.39 2.37

2.36

107

42

375 MHz IBM Power3 (1.22)

90

100

110

120

130

140

150

160

170

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

column block size (c)

row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [Dense (n=2000); power3−aix]

1.22

1.20

1.19

1.19

1.18

1.18

1.18

1.17

1.17

1.17

174

86

800 MHz Intel Itanium (1.55)

109

120

140

160

180

200

220

240

255

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

column block size (c)

row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc]

1.55

1.55

1.53

1.49 1.49

1.48

1.45

1.46

1.42

1.40

256

109

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.11/36

Page 21: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Bounds for Register Blocking

How close are we to the speed limit of blocking?

Upper-bounds on performance: (flops) / (time)

Flops - 2 * (number of true non-zeros)

Lower-bound on time: Two key assumptionsConsider only memory operationsCount only compulsory misses (i.e., ignore conflicts)

Other features of modelAccount for cache line sizeBound is function of matrix and ��� �

Model of execution time (see paper)Charge full latency ( ./ ) for hits at each cache level

0

, e.g.,

1 2 34 567 8 9 .: ; 3 4 5 67 8 9 .< ; 3�= >= 567 8 9 .@? A?

2 34 BC D 8 9 .: ; 34 = 6 8 8 > 8 9 3 .< E .: 9 ; 34 = 6 8 8 > 8 9 3 . ? A? E .< 9

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.12/36

Page 22: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Overview of Performance Results

Experimental setup

Four machines: Pentium III, Ultra 2i, Power3, Itanium

44 matrices: dense, finite element, mixed, linear programming

Measured misses and cycles using PAPI [BDG

F

00]

Reference: CSR (i.e.,

G G

or “unblocked”)

Main observations

SPARSITY vs. reference: up to 2.5x faster, especially on FEM

Block size selection: chooses within 10% of best

SPARSITY performance typically within 20% of upper-bound

SPARSITY least effective on Power3

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.13/36

Page 23: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [pentium3−linux−icc]

D FEM FEM (var. blk.) Mixed LP

Analytic upper boundReference

DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

Page 24: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [pentium3−linux−icc]

D FEM FEM (var. blk.) Mixed LP

Analytic upper boundUpper bound (PAPI)Reference

DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

Page 25: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [pentium3−linux−icc]

D FEM FEM (var. blk.) Mixed LP

Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Reference

DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

Page 26: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [pentium3−linux−icc]

D FEM FEM (var. blk.) Mixed LP

Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference

DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

Page 27: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Results: Sun Ultra 2i

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [ultra−solaris]

D FEM FEM (var. blk.) Mixed LP

Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference

DGEMV (n=1000): 58 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.15/36

Page 28: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Results: Intel Itanium

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [itanium−linux−ecc]

D FEM FEM (var. blk.) Mixed LP

Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference

DGEMV (n=1000): 310 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.16/36

Page 29: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Results: IBM Power3

1 4 5 7 8 9 10 12 13 15 400

20

40

60

80

100

120

140

160

180

200

220

240

260

280

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [power3−aix]

D FEM FEM (var. blk.) LP

Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference

DGEMV (n=2000): 260 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.17/36

Page 30: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Conclusions

Tuning can be difficult, even when matrix structure is known

Performance is a complicated function of architecture andmatrix

New heuristic for choosing block size selects optimalimplementation, or near-optimal (performance within 5–10%)

Limits of low-level tuning for blocking are near

Performance is often within 20% of upper-bound, particularlywith FEM matrices

Unresolved: closing the gap on the Power3

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.18/36

Page 31: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

BeBOP: Current and Future Work (1/2)

Further performance improvementssymmetry (1.5–2x speedups)

diagonals, block diagonals, and bands (1.2–2x),

splitting for variable block structure (1.3–1.7x),

reordering to create dense structure (1.7x),

cache blocking (1.5–4x)

multiple vectors (2–7x)

and combinations . . .

How to choose optimizations & tuning parameters?

Sparse triangular solve (ICS’02: POHLL Workshop paper)

hybrid sparse/dense structure (1.2–1.8x)

Higher-level kernels that permit reuse

H H IKJ ,

H I HJ (1.5–3x)

HJ and

H I�L simultaneously,

H M J ,

N H N I

, . . . (future work)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.19/36

Page 32: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

BeBOP: Current and Future Work (2/2)

An automatically tuned sparse matrix libraryCode generation via sparse compilers (Bernoulli; Bik)

Plan to extend new Sparse BLAS standard by one routine to support tuning

Architectural issuesImprovements for Power3?

Latency vs. bandwidth (see paper)

Using models to explore architectural design space

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.20/36

Page 33: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

EXTRA SLIDES

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.21/36

Page 34: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Example: No Big Surprises on Sun Ultra 2i

36

38

40

42

44

46

48

50

52

54

1 2 3 6

1

2

3

6

column block size (c)

row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]

1.00 1.09 1.19 1.21

1.17 1.25 1.30 1.30

1.31 1.34 1.41 1.39

1.40 1.39 1.42 1.53

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.22/36

Page 35: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Where in Memory is the Time Spent?

Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Platform

Frac

tion

of C

ycle

s (e

xhau

stiv

e be

st; a

vera

ge o

ver m

atric

es)

Execution Time Model (Model Hits/Misses) −− Where is the Time Spent?

L1L2L3Mem

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.23/36

Page 36: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: Sun Ultra 2i (L1)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440.2

0.3

0.4

0.5

0.6

0.70.80.9

1

2

3

4

5

matrix

no. o

f mis

ses

(mill

ions

)

L1 Misses −− [ultra−solaris]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.24/36

Page 37: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: Sun Ultra 2i (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [ultra−solaris]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.25/36

Page 38: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: Intel Pentium III (L1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L1 Misses −− [pentium3−linux−icc]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.26/36

Page 39: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: Intel Pentium III (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [pentium3−linux−icc]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.27/36

Page 40: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: IBM Power3 (L1)

1 4 5 7 8 9 10 12 13 15 40

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L1 Misses −− [power3−aix]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.28/36

Page 41: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: IBM Power3 (L2)

1 4 5 7 8 9 10 12 13 15 40

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [power3−aix]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.29/36

Page 42: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: Intel Itanium (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [itanium−linux−ecc]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.30/36

Page 43: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Cache Miss Bound Verification: Intel Itanium (L3)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L3 Misses −− [itanium−linux−ecc]

Upper boundPAPILower Bound

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.31/36

Page 44: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Latency vs. Bandwidth

Ultra 2i Pentium III Power3 Itanium 0

0.050.1

0.150.2

0.250.3

0.350.4

0.450.5

0.550.6

0.650.7

0.750.8

0.850.9

0.951

Platform

Frac

tion

of P

eak

Mai

n M

emor

y B

andw

idth

Sustainable Memory Bandwidth (STREAM)

*−⋅ Full−latency model

a[i] = b[i]a[i] = α*b[i]a[i] = b[i] + c[i]a[i] = b[i] + α*c[i]s += a[i]s += a[i]*b[i]s += a[i]; k += ind[i]s += a[i]*x[ind[i]]; x∈L_chips += a[i]*x[ind[i]]; x∈L_extDGEMVSpM×V (dense, 1×1)SpM×V (dense, best)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.32/36

Page 45: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Related Work (2/2)

Compilers (analysis and models); run-time selection

CROPS (UCSD/Carter, Ferrante, et al.)

TUNE (Chatterjee, et al.)

Iterative compilation (O’Boyle, et al., 1998)

Broadway (Guyer and Lin, ’99)

Brewer (’95); ADAPT (Voss, 2000)

Interfaces: Sparse BLAS; PSBLAS; PETSc

Sparse triangular solve

SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . .

Approximation: Alvarado (’93); Raghavan (’98)

Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88)

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.33/36

Page 46: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

What is the Cost of Search?

2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440

5

10

15

20

25

30

35

40

45Block Size Selection Overhead [itanium−linux−ecc]

Matrix #

Cos

t (no

. of r

efer

ence

Sp×

MV

s)

.74 .74

.85

.71

.86 .85.80

.85.84

.72

.76 .76

.78

.76.79

heuristicrebuild

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.34/36

Page 47: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Where in Memory is the Time Spent? (PAPI Data)

Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Platform

Frac

tion

of C

ycle

s (e

xhau

stiv

e be

st; a

vera

ge o

ver m

atric

es)

Execution Time Model (PAPI Hits/Misses) −− Where is the Time Spent?

L1L2L3Mem

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.35/36

Page 48: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Fill: Some Surprises!

Sometimes faster to fill in many zeros

Dense

Fill (Size BCSR) (Perf BCSR)

Matrix Speedup ratio (Size CSR) (Perf CSR) Platform

11 1.09 1.70 1.26 1.55 Itanium

13 1.50 1.52 1.07 2.30 Pentium 3

17 1.04 1.59 1.23 1.54 Itanium

27 1.16 1.94 1.47 1.54 Itanium

27 1.10 1.53 1.25 1.23 Ultra 2i

29 1.00 1.98 1.44 1.89 Pentium 3

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.36/36

Page 49: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

References[BACD97] J. Bilmes, K. Asanovic, C.W. Chin, and J. Demmel. Optimizing ma-

trix multiply using PHiPAC: a portable, high-performance, ANSI C cod-ing methodology. In Proceedings of the International Conference onSupercomputing, Vienna, Austria, July 1997. ACM SIGARC. seehttp://www.icsi.berkeley.edu/˜bilmes/phipac.

[BCD O 01] S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammar-ling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott,F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster,C. Whaley, and J. Wolff von Gudenberg. Document for the Basic Lin-ear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001.www.netlib.org/blast.

[BDG O 00] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalablecross-platform infrastructure for application performance tuning using hard-ware counters. In Proceedings of Supercomputing, November 2000.

[BW99] Aart J. C. Bik and Harry A. G. Wijshoff. Automatic nonzero structure anal-ysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.

[FC00] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for paral-lel linear algebra computation on sparse matrices. ACM Transactions onMathematical Software, 26(4):527–550, December 2000.

[FDZ99] Basilio B. Fraguela, Ramon Doallo, and Emilio L. Zapata. Memory hier-archy performance prediction for sparse blocked algorithms. Parallel Pro-cessing Letters, 9(3), March 1999.

[FJ98] Matteo Frigo and Stephen Johnson. FFTW: An adaptive software architec-ture for the FFT. In Proceedings of the International Conference on Acous-tics, Speech, and Signal Processing, Seattle, Washington, May 1998.

[GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van deGeijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans-actions on Mathematical Software, 27(4), December 2001.

36-1

Page 50: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[GKKS99] William D. Gropp, D. K. Kasushik, David E. Keyes, and Barry F. Smith.Towards realistic bounds for implicit CFD codes. In Proceedings of ParallelComputational Fluid Dynamics, pages 241–248, 1999.

[GR99] Roman Geus and S. Rollin. Towards a fast parallel sparse matrix-vectormultiplication. In E. H. D’Hollander, J. R. Joubert, F. J. Peters, and H. Sips,editors, Proceedings of the International Conference on Parallel Computing(ParCo), pages 308–315. Imperial College Press, 1999.

[HPDR99] Dora Blanco Heras, Vicente Blanco Perez, Jose Carlos CabaleiroDominguez, and Francisco F. Rivera. Modeling and improving locality forirregular problems: sparse matrix-vector product on cache memories as acase study. In HPCN Europe, pages 201–210, 1999.

[Im00] Eun-Jin Im. Optimizing the performance of sparse matrix-vector multiplica-tion. PhD thesis, University of California, Berkeley, May 2000.

[MMJ00] Dragan Mirkovic, Rishad Mahasoom, and Lennart Johnsson. An adaptivesoftware library for fast fourier transforms. In Proceedings of the Interna-tional Conference on Supercomputing, pages 215–224, Sante Fe, NM, May2000.

[Neu98] T. Neubert. Anwendung von generativen Programmiertechniken amBeispiel der Matrixalgebra. Master’s thesis, Technische Universitat Chem-nitz, 1998.

[NGLPJ96] J. J. Navarro, E. Garcia, J. L. Larriba-Pey, and T. Juan. Algorithms forsparse matrix computations on high-performance workstations. In Pro-ceedings of the 10th ACM International Conference on Supercomputing,pages 301–308, Philadelpha, PA, USA, May 1996.

[PH99] Ali Pinar and Michael Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of Supercomputing, 1999.

[PSVM01] Markus Puschel, Bryan Singer, Manuela Veloso, and Jose M. F. Moura.Fast automatic generation of DSP algorithms. In Proceedings of the In-ternational Conference on Computational Science, volume 2073 of LNCS,pages 97–106, San Francisco, CA, May 2001. Springer.

36-1

Page 51: Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[RP96] K. Remington and R. Pozo. NIST Sparse BLAS: User’s Guide. Technicalreport, NIST, 1996. gams.nist.gov/spblas.

[Saa94] Yousef Saad. SPARSKIT: A basic toolkit for sparse matrix computations,1994. www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.

[SL98] Jeremy G. Siek and Andrew Lumsdaine. A rational approach to portablehigh performance: the Basic Linear Algebra Instruction Set (BLAIS) andthe Fixed Algorithm Size Template (fast) library. In Proceedings of ECOOP,1998.

[Sto97] Paul Stodghill. A Relational Approach to the Automatic Generation of Se-quential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.

[TJ92] O. Temam and W. Jalby. Characterizing the behavior of sparse algorithmson caches. In Proceedings of Supercomputing ’92, 1992.

[Tol97] Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Paral-lel Processing for Scientific Computing, March 1997.

[Vel98] Todd Veldhuizen. Arrays in blitz++. In Proceedings of ISCOPE, volume1505 of LNCS. Springer-Verlag, 1998.

[VFD01] Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. Towards anaccurate model for collective communications. In Proceedings of the In-ternational Conference on Computational Science, volume 2073 of LNCS,pages 41–50, San Francisco, CA, May 2001. Springer.

[WPD01] R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empiri-cal optimizations of software and the ATLAS project. Parallel Computing,27(1):3–25, 2001.

[WS97] James B. White and P. Sadayappan. On improving the performance ofsparse matrix-vector multiplication. In Proceedings of the InternationalConference on High-Performance Computing, 1997.

36-1