Upload
vonhan
View
235
Download
0
Embed Size (px)
Citation preview
[ <--> ]
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply
Richard Vuduc, James Demmel, Katherine Yelick
Shoaib Kamil, Rajesh Nishtala, Benjamin Lee
Wednesday, November 20, 2002
Berkeley Benchmarking and OPtimization (BeBOP) Project
www.cs.berkeley.edu/ �richie/bebopComputer Science Division, U.C. Berkeley
Berkeley, California, USA
.
SC 2002: Session on Sparse Linear Algebra
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.1/36
[ <--> ]
Context: Performance Tuning in the Sparse Case
Application performance dominated by a fewcomputational kernels
Performance tuning todayVendor-tuned libraries (e.g., BLAS) or user hand-tunes
Automatic tuning (e.g., PHiPAC/ATLAS, FFTW/SPIRAL/UHFFT)
Tuning sparse linear algebra kernels is hardSparse code has . . .
high bandwidth requirements (extra storage)poor locality (indirect, irregular memory access)poor instruction mix (data structure manipulation)
Sparse matrix-vector multiply (SpM � V) performance:less than 10% of machine peak
Performance depends on kernel, architecture, and matrix
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.2/36
[ <--> ]
Example: Matrix olafu
0 2400 4800 7200 9600 12000 14400
0
2400
4800
7200
9600
12000
14400
1015156 non−zeros
Spy Plot: 03−olafu.rua
N = 16146
nnz = 1.0M
Kernel = SpM �V
A natural choice:
blocked compressed
sparse row (BCSR).
Experiment:
Measure performance of
all block sizes for
.
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36
[ <--> ]
Example: Matrix olafu
0 10 20 30 40 50 60
0
10
20
30
40
50
60
1188 non−zeros
Spy Plot (zoom−in): 03−olafu.rua
N = 16146
nnz = 1.0M
Kernel = SpM �V
A natural choice:� � �
blocked compressed
sparse row (BCSR).
Experiment:
Measure performance of
all � �� block sizes for
��� � � � � � � � � �
.
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.3/36
[ <--> ]
Speedups on Itanium: The Need for Search
93
100
110
120
130
140
150
160
170
180
190
200
208
c = 1 2 3 c = 6
r = 1
2
3
r = 6
column block size (c)
row
blo
ck s
ize
(r)
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
1.00 0.86 0.99 0.93
1.21 1.27 0.97 0.78
1.44 0.99 1.07 0.64
1.20 0.81 0.75 0.95
Mflop/s
Mflop/s
(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
[ <--> ]
Speedups on Itanium: The Need for Search
93
100
110
120
130
140
150
160
170
180
190
200
208
c = 1 2 3 c = 6
r = 1
2
3
r = 6
column block size (c)
row
blo
ck s
ize
(r)
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
1.00 0.86 0.99 0.93
1.21 1.27 0.97 0.78
1.44 0.99 1.07 0.64
1.20 0.81 0.75 0.95
Mflop/s
Mflop/s
(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
[ <--> ]
Speedups on Itanium: The Need for Search
93
100
110
120
130
140
150
160
170
180
190
200
208
c = 1 2 3 c = 6
r = 1
2
3
r = 6
column block size (c)
row
blo
ck s
ize
(r)
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
1.00 0.86 0.99 0.93
1.21 1.27 0.97 0.78
1.44 0.99 1.07 0.64
1.20 0.81 0.75 0.95
Mflop/s
Mflop/s
(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
[ <--> ]
Speedups on Itanium: The Need for Search
93
100
110
120
130
140
150
160
170
180
190
200
208
c = 1 2 3 c = 6
r = 1
2
3
r = 6
column block size (c)
row
blo
ck s
ize
(r)
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
1.00 0.86 0.99 0.93
1.21 1.27 0.97 0.78
1.44 0.99 1.07 0.64
1.20 0.81 0.75 0.95
Mflop/s
Mflop/s
(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
[ <--> ]
Speedups on Itanium: The Need for Search
93
100
110
120
130
140
150
160
170
180
190
200
208
c = 1 2 3 c = 6
r = 1
2
3
r = 6
column block size (c)
row
blo
ck s
ize
(r)
Blocking Performance (Mflop/s) [03−olafu.rua; itanium−linux−ecc]
1.00 0.86 0.99 0.93
1.21 1.27 0.97 0.78
1.44 0.99 1.07 0.64
1.20 0.81 0.75 0.95
Mflop/s
Mflop/s
(Peak machine speed: 3.2 Gflop/s)Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.4/36
[ <--> ]
Key Questions and Conclusions
How do we choose the best data structure automatically?
New heuristic for choosing optimal (or near-optimal) block sizes
What are the limits on performance of blocked SpM V?Derive performance upper bounds for blocking
Often within 20% of upper bound, placing limits on improvement from more“low-level” tuning
Performance is memory-bound: reducing data structure size is critical
Where are the new opportunities (kernels, techniques) forachieving higher performance?
Identify cases in which blocking does and does not work
Identify new kernels and opportunities for reducing memory traffic
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.5/36
[ <--> ]
Related Work
Automatic tuning systems and code generation
PHiPAC [BACD97], ATLAS [WPD01], SPARSITY [Im00]
FFTW [FJ98], SPIRAL [PSVM01], UHFFT [MMJ00]
MPI collective ops (Vadhiyar, et al. [VFD01])
Sparse compilers (Bik [BW99], Bernoulli [Sto97])
Generic programming (Blitz++ [Vel98], MTL [SL98], GMCL [Neu98], . . . )
FLAME [GGHvdG01]
Sparse performance modeling and tuning
Temam and Jalby [TJ92]
Toledo [Tol97], White and Sadayappan [WS97], Pinar [PH99]
Navarro [NGLPJ96], Heras [HPDR99], Fraguela [FDZ99]
Gropp, et al. [GKKS99], Geus [GR99]
Sparse kernel interfaces
Sparse BLAS Standard [BCD
�
01]
NIST SparseBLAS [RP96], SPARSKIT [Saa94], PSBLAS [FC00]
PETSc, hypre, . . .
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.6/36
[ <--> ]
Approach to Automatic Tuning
For each kernel,
Identify and generate a space of implementations
Search to find the fastest (using models, experiments)
The SPARSITY system for SpM V [Im & Yelick ’99]
InterfaceInput: Your sparse matrix (CSR)
Output: Data structure + routine tuned to your matrix & machine
Implementation spacecache blocking, multiple vectors, . . .
Search
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36
[ <--> ]
Approach to Automatic Tuning
For each kernel,
Identify and generate a space of implementations
Search to find the fastest (using models, experiments)
The SPARSITY system for SpM V [Im & Yelick ’99]
InterfaceInput: Your sparse matrix (CSR)
Output: Data structure + routine tuned to your matrix & machine
Implementation spaceregister level blocking ( � �� )
cache blocking, multiple vectors, . . .
SearchOff-line: benchmarking (once per architecture)
Run-time: estimate matrix properties (“search”) and predict best data
structure parameters
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36
[ <--> ]
Approach to Automatic Tuning
For each kernel,
Identify and generate a space of implementations
Search to find the fastest (using models, experiments)
The SPARSITY system for SpM V [Im & Yelick ’99]
InterfaceInput: Your sparse matrix (CSR)
Output: Data structure + routine tuned to your matrix & machine
Implementation spaceregister level blocking ( � �� )
cache blocking, multiple vectors, . . .
SearchOff-line: benchmarking (once per architecture)
Run-time: estimate matrix properties (“search”) and predict best data
structure parameters
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.7/36
[ <--> ]
Register-Level Blocking (SPARSITY): 3x3 Example
0 10 20 30 40 50
0
5
10
15
20
25
30
35
40
45
50
688 true non−zeros
3 x 3 Register Blocking Example
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36
[ <--> ]
Register-Level Blocking (SPARSITY): 3x3 Example
0 10 20 30 40 50
0
5
10
15
20
25
30
35
40
45
50
688 true non−zeros
3 x 3 Register Blocking Example
BCSR with uniform, aligned grid
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36
[ <--> ]
Register-Level Blocking (SPARSITY): 3x3 Example
0 10 20 30 40 50
0
5
10
15
20
25
30
35
40
45
50
(688 true non−zeros) + (383 explicit zeros) = 1071 nz
3 x 3 Register Blocking Example
Fill-in zeros: trade-off extra flops for better efficiency
This example: 50% fill led to 1.5x speedup on Pentium III
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.8/36
[ <--> ]
Search: Choosing the Block Size
Off-line benchmarking (once per architecture)
Measure Dense Performance (r,c)Performance (Mflop/s) of dense matrix in sparse � �� blocked format
At run-time, when matrix is known:
Estimate Fill Ratio (r,c),
����� �
Fill Ratio (r,c) = (number of stored values) / (number of true non-zeros)
Choose �� � that maximizes
��� ��� ��� � ��� � !" # � $ �� # % &
' � " � � ��� � ��� � !" # � $ �� # %
() * * + ! �) � $ �� # %
(Replaces previous SPARSITY heurstic)
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.9/36
[ <--> ]
Off-line Benchmarking [Intel Pentium III]
43
50
60
70
80
90
100
107
1 2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
8
9
10
11
12
column block size (c)
row
blo
ck s
ize
(r)
Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]
2.54
2.49
2.45
2.44
2.41
2.39
2.38
2.39 2.37
2.36
Top 10 codes labeled by speedup over unblocked code. Max speedup = 2.54 (
� ,
).
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.10/36
[ <--> ]
333 MHz Sun Ultra 2i (2.03)
36
40
45
50
55
60
65
70
72
1 2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
8
9
10
11
12
column block size (c)
row
blo
ck s
ize
(r)
Register Blocking Performance (Mflop/s) [ Dense (n=1000); ultra−solaris]
2.03 1.98
1.98
1.98
1.971.96
1.95 1.94
1.94
1.94
72
36
500 MHz Intel Pentium III (2.54)
43
50
60
70
80
90
100
107
1 2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
8
9
10
11
12
column block size (c)
row
blo
ck s
ize
(r)
Register Blocking Performance (Mflop/s) [ Dense (n=1000); pentium3−linux−icc]
2.54
2.49
2.45
2.44
2.41
2.39
2.38
2.39 2.37
2.36
107
42
375 MHz IBM Power3 (1.22)
90
100
110
120
130
140
150
160
170
1 2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
8
9
10
11
12
column block size (c)
row
blo
ck s
ize
(r)
Register Blocking Performance (Mflop/s) [Dense (n=2000); power3−aix]
1.22
1.20
1.19
1.19
1.18
1.18
1.18
1.17
1.17
1.17
174
86
800 MHz Intel Itanium (1.55)
109
120
140
160
180
200
220
240
255
1 2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
8
9
10
11
12
column block size (c)
row
blo
ck s
ize
(r)
Register Blocking Performance (Mflop/s) [ Dense (n=1000); itanium−linux−ecc]
1.55
1.55
1.53
1.49 1.49
1.48
1.45
1.46
1.42
1.40
256
109
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.11/36
[ <--> ]
Performance Bounds for Register Blocking
How close are we to the speed limit of blocking?
Upper-bounds on performance: (flops) / (time)
Flops - 2 * (number of true non-zeros)
Lower-bound on time: Two key assumptionsConsider only memory operationsCount only compulsory misses (i.e., ignore conflicts)
Other features of modelAccount for cache line sizeBound is function of matrix and ��� �
Model of execution time (see paper)Charge full latency ( ./ ) for hits at each cache level
0
, e.g.,
1 2 34 567 8 9 .: ; 3 4 5 67 8 9 .< ; 3�= >= 567 8 9 .@? A?
2 34 BC D 8 9 .: ; 34 = 6 8 8 > 8 9 3 .< E .: 9 ; 34 = 6 8 8 > 8 9 3 . ? A? E .< 9
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.12/36
[ <--> ]
Overview of Performance Results
Experimental setup
Four machines: Pentium III, Ultra 2i, Power3, Itanium
44 matrices: dense, finite element, mixed, linear programming
Measured misses and cycles using PAPI [BDG
F
00]
Reference: CSR (i.e.,
G G
or “unblocked”)
Main observations
SPARSITY vs. reference: up to 2.5x faster, especially on FEM
Block size selection: chooses within 10% of best
SPARSITY performance typically within 20% of upper-bound
SPARSITY least effective on Power3
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.13/36
[ <--> ]
Performance Results: Intel Pentium III
1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440
10
20
30
40
50
60
70
80
90
100
110
120
130
140
matrix
Per
form
ance
(Mflo
p/s)
Performance Summary [pentium3−linux−icc]
D FEM FEM (var. blk.) Mixed LP
Analytic upper boundReference
DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
[ <--> ]
Performance Results: Intel Pentium III
1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440
10
20
30
40
50
60
70
80
90
100
110
120
130
140
matrix
Per
form
ance
(Mflo
p/s)
Performance Summary [pentium3−linux−icc]
D FEM FEM (var. blk.) Mixed LP
Analytic upper boundUpper bound (PAPI)Reference
DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
[ <--> ]
Performance Results: Intel Pentium III
1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440
10
20
30
40
50
60
70
80
90
100
110
120
130
140
matrix
Per
form
ance
(Mflo
p/s)
Performance Summary [pentium3−linux−icc]
D FEM FEM (var. blk.) Mixed LP
Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Reference
DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
[ <--> ]
Performance Results: Intel Pentium III
1 2 3 4 5 6 7 8 9 1011121314151617182021232425262728293637404142440
10
20
30
40
50
60
70
80
90
100
110
120
130
140
matrix
Per
form
ance
(Mflo
p/s)
Performance Summary [pentium3−linux−icc]
D FEM FEM (var. blk.) Mixed LP
Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference
DGEMV (n=1000): 96 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.14/36
[ <--> ]
Performance Results: Sun Ultra 2i
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
matrix
Per
form
ance
(Mflo
p/s)
Performance Summary [ultra−solaris]
D FEM FEM (var. blk.) Mixed LP
Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference
DGEMV (n=1000): 58 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.15/36
[ <--> ]
Performance Results: Intel Itanium
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
matrix
Per
form
ance
(Mflo
p/s)
Performance Summary [itanium−linux−ecc]
D FEM FEM (var. blk.) Mixed LP
Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference
DGEMV (n=1000): 310 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.16/36
[ <--> ]
Performance Results: IBM Power3
1 4 5 7 8 9 10 12 13 15 400
20
40
60
80
100
120
140
160
180
200
220
240
260
280
matrix
Per
form
ance
(Mflo
p/s)
Performance Summary [power3−aix]
D FEM FEM (var. blk.) LP
Analytic upper boundUpper bound (PAPI)Sparsity (exhaustive)Sparsity (heuristic)Reference
DGEMV (n=2000): 260 Mflop/sPerformance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.17/36
[ <--> ]
Conclusions
Tuning can be difficult, even when matrix structure is known
Performance is a complicated function of architecture andmatrix
New heuristic for choosing block size selects optimalimplementation, or near-optimal (performance within 5–10%)
Limits of low-level tuning for blocking are near
Performance is often within 20% of upper-bound, particularlywith FEM matrices
Unresolved: closing the gap on the Power3
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.18/36
[ <--> ]
BeBOP: Current and Future Work (1/2)
Further performance improvementssymmetry (1.5–2x speedups)
diagonals, block diagonals, and bands (1.2–2x),
splitting for variable block structure (1.3–1.7x),
reordering to create dense structure (1.7x),
cache blocking (1.5–4x)
multiple vectors (2–7x)
and combinations . . .
How to choose optimizations & tuning parameters?
Sparse triangular solve (ICS’02: POHLL Workshop paper)
hybrid sparse/dense structure (1.2–1.8x)
Higher-level kernels that permit reuse
H H IKJ ,
H I HJ (1.5–3x)
HJ and
H I�L simultaneously,
H M J ,
N H N I
, . . . (future work)
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.19/36
[ <--> ]
BeBOP: Current and Future Work (2/2)
An automatically tuned sparse matrix libraryCode generation via sparse compilers (Bernoulli; Bik)
Plan to extend new Sparse BLAS standard by one routine to support tuning
Architectural issuesImprovements for Power3?
Latency vs. bandwidth (see paper)
Using models to explore architectural design space
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.20/36
[ <--> ]
EXTRA SLIDES
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.21/36
[ <--> ]
Example: No Big Surprises on Sun Ultra 2i
36
38
40
42
44
46
48
50
52
54
1 2 3 6
1
2
3
6
column block size (c)
row
blo
ck s
ize
(r)
Blocking Performance (Mflop/s) [03−olafu.rua; ultra−solaris]
1.00 1.09 1.19 1.21
1.17 1.25 1.30 1.30
1.31 1.34 1.41 1.39
1.40 1.39 1.42 1.53
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.22/36
[ <--> ]
Where in Memory is the Time Spent?
Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Platform
Frac
tion
of C
ycle
s (e
xhau
stiv
e be
st; a
vera
ge o
ver m
atric
es)
Execution Time Model (Model Hits/Misses) −− Where is the Time Spent?
L1L2L3Mem
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.23/36
[ <--> ]
Cache Miss Bound Verification: Sun Ultra 2i (L1)
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440.2
0.3
0.4
0.5
0.6
0.70.80.9
1
2
3
4
5
matrix
no. o
f mis
ses
(mill
ions
)
L1 Misses −− [ultra−solaris]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.24/36
[ <--> ]
Cache Miss Bound Verification: Sun Ultra 2i (L2)
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44
10−1
100
101
matrix
no. o
f mis
ses
(mill
ions
)
L2 Misses −− [ultra−solaris]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.25/36
[ <--> ]
Cache Miss Bound Verification: Intel Pentium III (L1)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44
10−1
100
101
matrix
no. o
f mis
ses
(mill
ions
)
L1 Misses −− [pentium3−linux−icc]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.26/36
[ <--> ]
Cache Miss Bound Verification: Intel Pentium III (L2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 23 24 25 26 27 28 29 36 37 40 41 42 44
10−1
100
101
matrix
no. o
f mis
ses
(mill
ions
)
L2 Misses −− [pentium3−linux−icc]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.27/36
[ <--> ]
Cache Miss Bound Verification: IBM Power3 (L1)
1 4 5 7 8 9 10 12 13 15 40
10−1
100
101
matrix
no. o
f mis
ses
(mill
ions
)
L1 Misses −− [power3−aix]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.28/36
[ <--> ]
Cache Miss Bound Verification: IBM Power3 (L2)
1 4 5 7 8 9 10 12 13 15 40
10−1
100
101
matrix
no. o
f mis
ses
(mill
ions
)
L2 Misses −− [power3−aix]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.29/36
[ <--> ]
Cache Miss Bound Verification: Intel Itanium (L2)
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44
10−1
100
101
matrix
no. o
f mis
ses
(mill
ions
)
L2 Misses −− [itanium−linux−ecc]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.30/36
[ <--> ]
Cache Miss Bound Verification: Intel Itanium (L3)
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 44
10−1
100
101
matrix
no. o
f mis
ses
(mill
ions
)
L3 Misses −− [itanium−linux−ecc]
Upper boundPAPILower Bound
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.31/36
[ <--> ]
Latency vs. Bandwidth
Ultra 2i Pentium III Power3 Itanium 0
0.050.1
0.150.2
0.250.3
0.350.4
0.450.5
0.550.6
0.650.7
0.750.8
0.850.9
0.951
Platform
Frac
tion
of P
eak
Mai
n M
emor
y B
andw
idth
Sustainable Memory Bandwidth (STREAM)
*−⋅ Full−latency model
a[i] = b[i]a[i] = α*b[i]a[i] = b[i] + c[i]a[i] = b[i] + α*c[i]s += a[i]s += a[i]*b[i]s += a[i]; k += ind[i]s += a[i]*x[ind[i]]; x∈L_chips += a[i]*x[ind[i]]; x∈L_extDGEMVSpM×V (dense, 1×1)SpM×V (dense, best)
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.32/36
[ <--> ]
Related Work (2/2)
Compilers (analysis and models); run-time selection
CROPS (UCSD/Carter, Ferrante, et al.)
TUNE (Chatterjee, et al.)
Iterative compilation (O’Boyle, et al., 1998)
Broadway (Guyer and Lin, ’99)
Brewer (’95); ADAPT (Voss, 2000)
Interfaces: Sparse BLAS; PSBLAS; PETSc
Sparse triangular solve
SuperLU/MUMPS/SPOOLES/UMFPACK/PSPASES. . .
Approximation: Alvarado (’93); Raghavan (’98)
Scalability: Rothberg (’92;’95); Gupta (’95); Li, Coleman (’88)
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.33/36
[ <--> ]
What is the Cost of Search?
2 3 4 5 6 7 8 9 10 11 12 13 15 17 21 25 27 28 36 40 440
5
10
15
20
25
30
35
40
45Block Size Selection Overhead [itanium−linux−ecc]
Matrix #
Cos
t (no
. of r
efer
ence
Sp×
MV
s)
.74 .74
.85
.71
.86 .85.80
.85.84
.72
.76 .76
.78
.76.79
heuristicrebuild
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.34/36
[ <--> ]
Where in Memory is the Time Spent? (PAPI Data)
Ultra 2i (L1/L2) Pentium III (L1/L2) Power3 (L1/L2) Itanium (L1/L2/L3) 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Platform
Frac
tion
of C
ycle
s (e
xhau
stiv
e be
st; a
vera
ge o
ver m
atric
es)
Execution Time Model (PAPI Hits/Misses) −− Where is the Time Spent?
L1L2L3Mem
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.35/36
[ <--> ]
Fill: Some Surprises!
Sometimes faster to fill in many zeros
Dense
Fill (Size BCSR) (Perf BCSR)
Matrix Speedup ratio (Size CSR) (Perf CSR) Platform
11 1.09 1.70 1.26 1.55 Itanium
13 1.50 1.52 1.07 2.30 Pentium 3
17 1.04 1.59 1.23 1.54 Itanium
27 1.16 1.94 1.47 1.54 Itanium
27 1.10 1.53 1.25 1.23 Ultra 2i
29 1.00 1.98 1.44 1.89 Pentium 3
Performance Optimizations and Boundsfor Sparse Matrix-Vector Multiply – p.36/36
References[BACD97] J. Bilmes, K. Asanovic, C.W. Chin, and J. Demmel. Optimizing ma-
trix multiply using PHiPAC: a portable, high-performance, ANSI C cod-ing methodology. In Proceedings of the International Conference onSupercomputing, Vienna, Austria, July 1997. ACM SIGARC. seehttp://www.icsi.berkeley.edu/˜bilmes/phipac.
[BCD O 01] S. Blackford, G. Corliss, J. Demmel, J. Dongarra, I. Duff, S. Hammar-ling, G. Henry, M. Heroux, C. Hu, W. Kahan, L. Kaufman, B. Kearfott,F. Krogh, X. Li, Z. Maany, A. Petitet, R. Pozo, K. Remington, W. Walster,C. Whaley, and J. Wolff von Gudenberg. Document for the Basic Lin-ear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, 2001.www.netlib.org/blast.
[BDG O 00] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalablecross-platform infrastructure for application performance tuning using hard-ware counters. In Proceedings of Supercomputing, November 2000.
[BW99] Aart J. C. Bik and Harry A. G. Wijshoff. Automatic nonzero structure anal-ysis. SIAM Journal on Computing, 28(5):1576–1587, 1999.
[FC00] Salvatore Filippone and Michele Colajanni. PSBLAS: A library for paral-lel linear algebra computation on sparse matrices. ACM Transactions onMathematical Software, 26(4):527–550, December 2000.
[FDZ99] Basilio B. Fraguela, Ramon Doallo, and Emilio L. Zapata. Memory hier-archy performance prediction for sparse blocked algorithms. Parallel Pro-cessing Letters, 9(3), March 1999.
[FJ98] Matteo Frigo and Stephen Johnson. FFTW: An adaptive software architec-ture for the FFT. In Proceedings of the International Conference on Acous-tics, Speech, and Signal Processing, Seattle, Washington, May 1998.
[GGHvdG01] John A. Gunnels, Fred G. Gustavson, Greg M. Henry, and Robert A. van deGeijn. FLAME: Formal Linear Algebra Methods Environment. ACM Trans-actions on Mathematical Software, 27(4), December 2001.
36-1
[GKKS99] William D. Gropp, D. K. Kasushik, David E. Keyes, and Barry F. Smith.Towards realistic bounds for implicit CFD codes. In Proceedings of ParallelComputational Fluid Dynamics, pages 241–248, 1999.
[GR99] Roman Geus and S. Rollin. Towards a fast parallel sparse matrix-vectormultiplication. In E. H. D’Hollander, J. R. Joubert, F. J. Peters, and H. Sips,editors, Proceedings of the International Conference on Parallel Computing(ParCo), pages 308–315. Imperial College Press, 1999.
[HPDR99] Dora Blanco Heras, Vicente Blanco Perez, Jose Carlos CabaleiroDominguez, and Francisco F. Rivera. Modeling and improving locality forirregular problems: sparse matrix-vector product on cache memories as acase study. In HPCN Europe, pages 201–210, 1999.
[Im00] Eun-Jin Im. Optimizing the performance of sparse matrix-vector multiplica-tion. PhD thesis, University of California, Berkeley, May 2000.
[MMJ00] Dragan Mirkovic, Rishad Mahasoom, and Lennart Johnsson. An adaptivesoftware library for fast fourier transforms. In Proceedings of the Interna-tional Conference on Supercomputing, pages 215–224, Sante Fe, NM, May2000.
[Neu98] T. Neubert. Anwendung von generativen Programmiertechniken amBeispiel der Matrixalgebra. Master’s thesis, Technische Universitat Chem-nitz, 1998.
[NGLPJ96] J. J. Navarro, E. Garcia, J. L. Larriba-Pey, and T. Juan. Algorithms forsparse matrix computations on high-performance workstations. In Pro-ceedings of the 10th ACM International Conference on Supercomputing,pages 301–308, Philadelpha, PA, USA, May 1996.
[PH99] Ali Pinar and Michael Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of Supercomputing, 1999.
[PSVM01] Markus Puschel, Bryan Singer, Manuela Veloso, and Jose M. F. Moura.Fast automatic generation of DSP algorithms. In Proceedings of the In-ternational Conference on Computational Science, volume 2073 of LNCS,pages 97–106, San Francisco, CA, May 2001. Springer.
36-1
[RP96] K. Remington and R. Pozo. NIST Sparse BLAS: User’s Guide. Technicalreport, NIST, 1996. gams.nist.gov/spblas.
[Saa94] Yousef Saad. SPARSKIT: A basic toolkit for sparse matrix computations,1994. www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.
[SL98] Jeremy G. Siek and Andrew Lumsdaine. A rational approach to portablehigh performance: the Basic Linear Algebra Instruction Set (BLAIS) andthe Fixed Algorithm Size Template (fast) library. In Proceedings of ECOOP,1998.
[Sto97] Paul Stodghill. A Relational Approach to the Automatic Generation of Se-quential Sparse Matrix Codes. PhD thesis, Cornell University, August 1997.
[TJ92] O. Temam and W. Jalby. Characterizing the behavior of sparse algorithmson caches. In Proceedings of Supercomputing ’92, 1992.
[Tol97] Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication. In Proceedings of the 8th SIAM Conference on Paral-lel Processing for Scientific Computing, March 1997.
[Vel98] Todd Veldhuizen. Arrays in blitz++. In Proceedings of ISCOPE, volume1505 of LNCS. Springer-Verlag, 1998.
[VFD01] Sathish S. Vadhiyar, Graham E. Fagg, and Jack J. Dongarra. Towards anaccurate model for collective communications. In Proceedings of the In-ternational Conference on Computational Science, volume 2073 of LNCS,pages 41–50, San Francisco, CA, May 2001. Springer.
[WPD01] R. Clint Whaley, Antoine Petitet, and Jack Dongarra. Automated empiri-cal optimizations of software and the ATLAS project. Parallel Computing,27(1):3–25, 2001.
[WS97] James B. White and P. Sadayappan. On improving the performance ofsparse matrix-vector multiplication. In Proceedings of the InternationalConference on High-Performance Computing, 1997.
36-1