Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

[ <--> ]

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply

Richard Vuduc, James Demmel, Katherine Yelick

Shoaib Kamil, Rajesh Nishtala, Benjamin Lee

Wednesday, November 20, 2002

Berkeley Benchmarking and OPtimization (BeBOP) Project

www.cs.berkeley.edu/ �richie/bebopComputer Science Division, U.C. Berkeley

Berkeley, California, USA

.

SC 2002: Session on Sparse Linear Algebra

Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36

http://www.cs.berkeley.edu/~richie/bebop

[ <--> ]

Context: Performance Tuning in the Sparse Case

Application performance dominated by a fewcomputational kernels

Performance tuning todayVendor-tuned libraries (e.g., BLAS) or user hand-tunes

Automatic tuning (e.g., PHiPAC/ATLAS, FFTW/SPIRAL/UHFFT)

Tuning sparse linear algebra kernels is hardSparse code has . . .

high bandwidth requirements (extra storage)poor locality (indirect, irregular memory access)poor instruction mix (data structure manipulation)

Sparse matrix-vector multiply (SpM � V) performance:less than 10% of machine peak

Performance depends on kernel, architecture, and matrix


[ <--> ]

Example: Matrix olafu

0 2400 4800 7200 9600 12000 14400

0

2400

4800

7200

9600

12000

14400

1015156 non−zeros

Spy Plot: 03−olafu.rua

N = 16146

nnz = 1.0M

Kernel = SpM �V

A natural choice:

blocked compressed

sparse row (BCSR).

Experiment:

Measure performance of

all block sizes for

.


[ <--> ]

Example: Matrix olafu

0 10 20 30 40 50 60

0

10

20

30

40

50

60

1188 non−zeros

Spy Plot (zoom−in): 03−olafu.rua

N = 16146

nnz = 1.0M

Kernel = SpM �V

A natural choice:� � �

blocked compressed

sparse row (BCSR).

Experiment:

Measure performance of

all � �� block sizes for

��

.


[ <--> ]

Speedups on Itanium: The Need for Search

93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6

column block size (c)

row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]

1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s

(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36

[ <--> ]


93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6


row

blo

ck s

ize

(r)


1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s


[ <--> ]


93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6


row

blo

ck s

ize

(r)


1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s


[ <--> ]


93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6


row

blo

ck s

ize

(r)


1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s


[ <--> ]


93

100

110

120

130

140

150

160

170

180

190

200

208

c = 1 2 3 c = 6

r = 1

2

3

r = 6


row

blo

ck s

ize

(r)


1.00 0.86 0.99 0.93

1.21 1.27 0.97 0.78

1.44 0.99 1.07 0.64

1.20 0.81 0.75 0.95

Mflop/s

Mflop/s


[ <--> ]

Key Questions and Conclusions

How do we choose the best data structure automatically?

New heuristic for choosing optimal (or near-optimal) block sizes

What are the limits on performance of blocked SpM V?Derive performance upper bounds for blocking

Often within 20% of upper bound, placing limits on improvement from more“low-level” tuning

Performance is memory-bound: reducing data structure size is critical

Where are the new opportunities (kernels, techniques) forachieving higher performance?

Identify cases in which blocking does and does not work

Identify new kernels and opportunities for reducing memory traffic


[ <--> ]

Related Work

Automatic tuning systems and code generation

PHiPAC [BACD97], ATLAS [WPD01], SPARSITY [Im00]

FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00]

MPI collective ops (Vadhiyar, et al. [VFD01])

Sparse compilers (Bik [BW99], Bernoulli [Sto97])

Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . )

FLAME [GGHvdG01]

Sparse performance modeling and tuning

Temam and Jalby [TJ92]

Toledo [Tol97], White and Sadayappan [WS97], Pinar [PH99]

Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99]

Gropp, et al. [GKKS99], Geus [GR99]

Sparse kernel interfaces

Sparse BLAS Standard [BCD

�

01]

NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00]

PETSc, hypre, . . .


[ <--> ]

Approach to Automatic Tuning

For each kernel,

Identify and generate a space of implementations

Search to find the fastest (using models, experiments)

The SPARSITY system for SpM V [Im & Yelick ’99]

InterfaceInput: Your sparse matrix (CSR)

Output: Data structure + routine tuned to your matrix & machine

Implementation spacecache blocking, multiple vectors, . . .

Search


[ <--> ]


For each kernel,






Implementation spaceregister level blocking ( � �� )

cache blocking, multiple vectors, . . .

SearchOff-line: benchmarking (once per architecture)

Run-time: estimate matrix properties (“search”) and predict best data

structure parameters


[ <--> ]


For each kernel,






Implementation spaceregister level blocking ( � �� )

cache blocking, multiple vectors, . . .

SearchOff-line: benchmarking (once per architecture)

Run-time: estimate matrix properties (“search”) and predict best data

structure parameters


[ <--> ]

Register-Level Blocking (SPARSITY): 3x3 Example

0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

45

50

688 true non−zeros

3 x 3 Register Blocking Example


[ <--> ]


0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

45

50

688 true non−zeros


BCSR with uniform, aligned grid


[ <--> ]


0 10 20 30 40 50

0

5

10

15

20

25

30

35

40

45

50

(688 true non−zeros) + (383 explicit zeros) = 1071 nz


Fill-in zeros: trade-off extra flops for better efficiency

This example: 50% fill led to 1.5x speedup on Pentium III


[ <--> ]

Search: Choosing the Block Size

Off-line benchmarking (once per architecture)

Measure Dense Performance (r,c)Performance (Mflop/s) of dense matrix in sparse � �� blocked format

At run-time, when matrix is known:

Estimate Fill Ratio (r,c),

��

Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)

Choose �� that maximizes

�� !" # � $ �� # % &

' � " � � �� !" # � $ �� # %

() * * + ! �) � $ �� # %

(Replaces previous SPARSITY heurstic)


[ <--> ]

Off-line Benchmarking [Intel Pentium III]

43

50

60

70

80

90

100

107

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12


row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]

2.54

2.49

2.45

2.44

2.41

2.39

2.38

2.39 2.37

2.36

Top 10 codes labeled by speedup over unblocked code. Max speedup = 2.54 (

� ,

).


[ <--> ]

333 MHz Sun Ultra 2i (2.03)

36

40

45

50

55

60

65

70

72

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12


row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris]

2.03 1.98

1.98

1.98

1.971.96

1.95 1.94

1.94

1.94

72

36

500 MHz Intel Pentium III (2.54)

43

50

60

70

80

90

100

107

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12


row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]

2.54

2.49

2.45

2.44

2.41

2.39

2.38

2.39 2.37

2.36

107

42

375 MHz IBM Power3 (1.22)

90

100

110

120

130

140

150

160

170

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12


row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [Dense (n=2000); power3−aix]

1.22

1.20

1.19

1.19

1.18

1.18

1.18

1.17

1.17

1.17

174

86

800 MHz Intel Itanium (1.55)

109

120

140

160

180

200

220

240

255

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12


row

blo

ck s

ize

(r)

Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc]

1.55

1.55

1.53

1.49 1.49

1.48

1.45

1.46

1.42

1.40

256

109


[ <--> ]

Performance Bounds for Register Blocking

How close are we to the speed limit of blocking?

Upper-bounds on performance: (flops) / (time)

Flops - 2 * (number of true non-zeros)

Lower-bound on time: Two key assumptionsConsider only memory operationsCount only compulsory misses (i.e., ignore conflicts)

Other features of modelAccount for cache line sizeBound is function of matrix and ��

Model of execution time (see paper)Charge full latency ( ./ ) for hits at each cache level

0

, e.g.,

1 2 34 567 8 9 .: ; 3 4 5 67 8 9 .< ; 3�= >= 567 8 9 .@? A?

2 34 BC D 8 9 .: ; 34 = 6 8 8 > 8 9 3 .< E .: 9 ; 34 = 6 8 8 > 8 9 3 . ? A? E .< 9


[ <--> ]

Overview of Performance Results

Experimental setup

Four machines: Pentium III, Ultra 2i, Power3, Itanium

44 matrices: dense, finite element, mixed, linear programming

Measured misses and cycles using PAPI [BDG

F

00]

Reference: CSR (i.e.,

G G

or “unblocked”)

Main observations

SPARSITY vs. reference: up to 2.5x faster, especially on FEM

Block size selection: chooses within 10% of best

SPARSITY performance typically within 20% of upper-bound

SPARSITY least effective on Power3


[ <--> ]

Performance Results: Intel Pentium III

1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [pentium3−linux−icc]

D FEM FEM (var. blk.) Mixed LP

Analytic upper boundReference

DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36

[ <--> ]


1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)



Analytic upper boundUpper bound (PAPI)Reference


[ <--> ]


1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)



Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Reference


[ <--> ]


1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440

10

20

30

40

50

60

70

80

90

100

110

120

130

140

matrix

Per

form

ance

(Mflo

p/s)



Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference


[ <--> ]

Performance Results: Sun Ultra 2i

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [ultra−solaris]




[ <--> ]

Performance Results: Intel Itanium

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440

20

40

60

80

100

120

140

160

180

200

220

240

260

280

300

320

340

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [itanium−linux−ecc]




[ <--> ]

Performance Results: IBM Power3

1 4 5 7 8 9 10 12 13 15 400

20

40

60

80

100

120

140

160

180

200

220

240

260

280

matrix

Per

form

ance

(Mflo

p/s)

Performance Summary [power3−aix]

D FEM FEM (var. blk.) LP



[ <--> ]

Conclusions

Tuning can be difficult, even when matrix structure is known

Performance is a complicated function of architecture andmatrix

New heuristic for choosing block size selects optimalimplementation, or near-optimal (performance within 5–10%)

Limits of low-level tuning for blocking are near

Performance is often within 20% of upper-bound, particularlywith FEM matrices

Unresolved: closing the gap on the Power3


[ <--> ]

BeBOP: Current and Future Work (1/2)

Further performance improvementssymmetry (1.5–2x speedups)

diagonals, block diagonals, and bands (1.2–2x),

splitting for variable block structure (1.3–1.7x),

reordering to create dense structure (1.7x),

cache blocking (1.5–4x)

multiple vectors (2–7x)

and combinations . . .

How to choose optimizations & tuning parameters?

Sparse triangular solve (ICS’02: POHLL Workshop paper)

hybrid sparse/dense structure (1.2–1.8x)

Higher-level kernels that permit reuse

H H IKJ ,

H I HJ (1.5–3x)

HJ and

H I�L simultaneously,

H M J ,

N H N I

, . . . (future work)


[ <--> ]

BeBOP: Current and Future Work (2/2)

An automatically tuned sparse matrix libraryCode generation via sparse compilers (Bernoulli; Bik)

Plan to extend new Sparse BLAS standard by one routine to support tuning

Architectural issuesImprovements for Power3?

Latency vs. bandwidth (see paper)

Using models to explore architectural design space


[ <--> ]

EXTRA SLIDES


[ <--> ]

Example: No Big Surprises on Sun Ultra 2i

36

38

40

42

44

46

48

50

52

54

1 2 3 6

1

2

3

6


row

blo

ck s

ize

(r)

Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]

1.00 1.09 1.19 1.21

1.17 1.25 1.30 1.30

1.31 1.34 1.41 1.39

1.40 1.39 1.42 1.53


[ <--> ]

Where in Memory is the Time Spent?

Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Platform

Frac

tion

of C

ycle

s (e

xhau

stiv

e be

st; a

vera

ge o

ver m

atric

es)

Execution Time Model (Model Hits/Misses) −− Where is the Time Spent?

L1L2L3Mem


[ <--> ]

Cache Miss Bound Verification: Sun Ultra 2i (L1)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440.2

0.3

0.4

0.5

0.6

0.70.80.9

1

2

3

4

5

matrix

no. o

f mis

ses

(mill

ions

)

L1 Misses −− [ultra−solaris]

Upper boundPAPILower Bound


[ <--> ]

Cache Miss Bound Verification: Sun Ultra 2i (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [ultra−solaris]



[ <--> ]

Cache Miss Bound Verification: Intel Pentium III (L1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L1 Misses −− [pentium3−linux−icc]



[ <--> ]

Cache Miss Bound Verification: Intel Pentium III (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [pentium3−linux−icc]



[ <--> ]

Cache Miss Bound Verification: IBM Power3 (L1)

1 4 5 7 8 9 10 12 13 15 40

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L1 Misses −− [power3−aix]



[ <--> ]

Cache Miss Bound Verification: IBM Power3 (L2)

1 4 5 7 8 9 10 12 13 15 40

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [power3−aix]



[ <--> ]

Cache Miss Bound Verification: Intel Itanium (L2)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L2 Misses −− [itanium−linux−ecc]



[ <--> ]

Cache Miss Bound Verification: Intel Itanium (L3)

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44

10−1

100

101

matrix

no. o

f mis

ses

(mill

ions

)

L3 Misses −− [itanium−linux−ecc]



[ <--> ]

Latency vs. Bandwidth

Ultra 2i Pentium III Power3 Itanium 0

0.050.1

0.150.2

0.250.3

0.350.4

0.450.5

0.550.6

0.650.7

0.750.8

0.850.9

0.951

Platform

Frac

tion

of P

eak

Mai

n M

emor

y B

andw

idth

Sustainable Memory Bandwidth (STREAM)

*−⋅ Full−latency model

a[i] = b[i]a[i] = α*b[i]a[i] = b[i] + c[i]a[i] = b[i] + α*c[i]s += a[i]s += a[i]*b[i]s += a[i]; k += ind[i]s += a[i]*x[ind[i]]; x∈L_chips += a[i]*x[ind[i]]; x∈L_extDGEMVSpM×V (dense, 1×1)SpM×V (dense, best)


[ <--> ]

Related Work (2/2)

Compilers (analysis and models); run-time selection

CROPS (UCSD/Carter, Ferrante, et al.)

TUNE (Chatterjee, et al.)

Iterative compilation (O’Boyle, et al., 1998)

Broadway (Guyer and Lin, ’99)

Brewer (’95); ADAPT (Voss, 2000)

Interfaces: Sparse BLAS; PSBLAS; PETSc

Sparse triangular solve

SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . .

Approximation: Alvarado (’93); Raghavan (’98)

Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88)


[ <--> ]

What is the Cost of Search?

2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440

5

10

15

20

25

30

35

40

45Block Size Selection Overhead [itanium−linux−ecc]

Matrix #

Cos

t (no

. of r

efer

ence

Sp×

MV

s)

.74 .74

.85

.71

.86 .85.80

.85.84

.72

.76 .76

.78

.76.79

heuristicrebuild


[ <--> ]

Where in Memory is the Time Spent? (PAPI Data)

Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Platform

Frac

tion

of C

ycle

s (e

xhau

stiv

e be

st; a

vera

ge o

ver m

atric

es)

Execution Time Model (PAPI Hits/Misses) −− Where is the Time Spent?

L1L2L3Mem


[ <--> ]

Fill: Some Surprises!

Sometimes faster to fill in many zeros

Dense

Fill (Size BCSR) (Perf BCSR)

Matrix Speedup ratio (Size CSR) (Perf CSR) Platform

11 1.09 1.70 1.26 1.55 Itanium

13 1.50 1.52 1.07 2.30 Pentium 3

17 1.04 1.59 1.23 1.54 Itanium

27 1.16 1.94 1.47 1.54 Itanium

27 1.10 1.53 1.25 1.23 Ultra 2i

29 1.00 1.98 1.44 1.89 Pentium 3


References[BACD97] J. Bilmes, K. Asanovic, C.W. Chin, and J. Demmel. Optimizing ma-

trix multiply using PHiPAC: a portable, high-performance, ANSI C cod-ing methodology. In Proceedings of the International Conference onSupercomputing, Vienna, Austria, July 1997. ACM SIGARC. seehttp://www.icsi.berkeley.edu/˜bilmes/phipac.

[BCD O 01] S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammar-ling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott,F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster,C. Whaley, and J. Wolff von Gudenberg. Document for the Basic Lin-ear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001.www.netlib.org/blast.

[BDG O 00] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalablecross-platform infrastructure for application performance tuning using hard-ware counters. In Proceedings of Supercomputing, November 2000.

[BW99] Aart J. C. Bik and Harry A. G. Wijshoff. Automatic nonzero structure anal-ysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.

[FC00] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for paral-lel linear algebra computation on sparse matrices. ACM Transactions onMathematical Software, 26(4):527–550, December 2000.

[FDZ99] Basilio B. Fraguela, Ramon Doallo, and Emilio L. Zapata. Memory hier-archy performance prediction for sparse blocked algorithms. Parallel Pro-cessing Letters, 9(3), March 1999.

[FJ98] Matteo Frigo and Stephen Johnson. FFTW: An adaptive software architec-ture for the FFT. In Proceedings of the International Conference on Acous-tics, Speech, and Signal Processing, Seattle, Washington, May 1998.

[GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van deGeijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans-actions on Mathematical Software, 27(4), December 2001.

36-1

[GKKS99] William D. Gropp, D. K. Kasushik, David E. Keyes, and Barry F. Smith.Towards realistic bounds for implicit CFD codes. In Proceedings of ParallelComputational Fluid Dynamics, pages 241–248, 1999.

[GR99] Roman Geus and S. Rollin. Towards a fast parallel sparse matrix-vectormultiplication. In E. H. D’Hollander, J. R. Joubert, F. J. Peters, and H. Sips,editors, Proceedings of the International Conference on Parallel Computing(ParCo), pages 308–315. Imperial College Press, 1999.

[HPDR99] Dora Blanco Heras, Vicente Blanco Perez, Jose Carlos CabaleiroDominguez, and Francisco F. Rivera. Modeling and improving locality forirregular problems: sparse matrix-vector product on cache memories as acase study. In HPCN Europe, pages 201–210, 1999.

[Im00] Eun-Jin Im. Optimizing the performance of sparse matrix-vector multiplica-tion. PhD thesis, University of California, Berkeley, May 2000.

[MMJ00] Dragan Mirkovic, Rishad Mahasoom, and Lennart Johnsson. An adaptivesoftware library for fast fourier transforms. In Proceedings of the Interna-tional Conference on Supercomputing, pages 215–224, Sante Fe, NM, May2000.

[Neu98] T. Neubert. Anwendung von generativen Programmiertechniken amBeispiel der Matrixalgebra. Master’s thesis, Technische Universitat Chem-nitz, 1998.

[NGLPJ96] J. J. Navarro, E. Garcia, J. L. Larriba-Pey, and T. Juan. Algorithms forsparse matrix computations on high-performance workstations. In Pro-ceedings of the 10th ACM International Conference on Supercomputing,pages 301–308, Philadelpha, PA, USA, May 1996.

[PH99] Ali Pinar and Michael Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of Supercomputing, 1999.

[PSVM01] Markus Puschel, Bryan Singer, Manuela Veloso, and Jose M. F. Moura.Fast automatic generation of DSP algorithms. In Proceedings of the In-ternational Conference on Computational Science, volume 2073 of LNCS,pages 97–106, San Francisco, CA, May 2001. Springer.

36-1

[RP96] K. Remington and R. Pozo. NIST Sparse BLAS: User’s Guide. Technicalreport, NIST, 1996. gams.nist.gov/spblas.

[Saa94] Yousef Saad. SPARSKIT: A basic toolkit for sparse matrix computations,1994. www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.

[SL98] Jeremy G. Siek and Andrew Lumsdaine. A rational approach to portablehigh performance: the Basic Linear Algebra Instruction Set (BLAIS) andthe Fixed Algorithm Size Template (fast) library. In Proceedings of ECOOP,1998.

[Sto97] Paul Stodghill. A Relational Approach to the Automatic Generation of Se-quential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.

[TJ92] O. Temam and W. Jalby. Characterizing the behavior of sparse algorithmson caches. In Proceedings of Supercomputing ’92, 1992.

[Tol97] Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Paral-lel Processing for Scientific Computing, March 1997.

[Vel98] Todd Veldhuizen. Arrays in blitz++. In Proceedings of ISCOPE, volume1505 of LNCS. Springer-Verlag, 1998.

[VFD01] Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. Towards anaccurate model for collective communications. In Proceedings of the In-ternational Conference on Computational Science, volume 2073 of LNCS,pages 41–50, San Francisco, CA, May 2001. Springer.

[WPD01] R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empiri-cal optimizations of software and the ATLAS project. Parallel Computing,27(1):3–25, 2001.

[WS97] James B. White and P. Sadayappan. On improving the performance ofsparse matrix-vector multiplication. In Proceedings of the InternationalConference on High-Performance Computing, 1997.

36-1

Documents

Performance Optimizations and Bounds for Sparse …bebop.cs.berkeley.edu/pubs/vuduc2002-smvm-bounds-slides.pdf · Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply