Upload
hoanganh
View
217
Download
0
Embed Size (px)
Citation preview
Dense Linear Algebra
Mark [email protected]://www.icl.utk.edu/~mgates3/
1
CS 594 — Scientific Computing for Engineers — Spring 2017
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
2
Memory hierarchy
3
CPU
Registers
CPU cacheL1 Level 1 cacheL2 Level 2 cacheL3 Level 3 cache
Random Access Memory (RAM)
Solid State Drive (SSD, Flash)
Mechanical Hard Drive (HDD)
Virtual Memory, Files
SDRAM, DDR, GDDR, HBM, etc.
Burst buffers, Non-volatile Memory, Files
FastestMost expensiveSmallest capacity
FastExpensiveSmall capacity
Modest speedModest costModest capacity
SlowCheapLarge capacity
SlowestCheapestLargest capacity
Adapted from illustration by Ryan Leng
168 integer168 float
32 KiB L1 per core256 KiB L2 per core1.5 MiB L3 per core
64 GiB
500 GiB
4 TiB
Examples (Haswell)
Compute vs. memory speed• Machine balance (# flops per memory access)
• Flops “free,”memory expensive
• Good for dense,BLAS-3 operations(matrix multiply)
• Flops & memoryaccess balanced
• Good for sparse &vector operations
Data from Stream benchmark (McCalpin) and vendor information pages.4
Outline• Legacy Software
• BLAS • LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
5
BLAS: Basic Linear Algebra Subroutines• Level 1 BLAS — vector operations
• O(n) data and flops (floating point operations)• Memory bound:
O(1) flops per memory access
• Level 2 BLAS — matrix-vector operations• O(n2) data and flops• Memory bound:
O(1) flops per memory access
• Level 3 BLAS — matrix-matrix operations• O(n2) data, O(n3) flops• Surface-to-volume effect• Compute bound:
O(n) flops per memory access
6
y x y+ β= α
y A y+ β= α x
A C+ β= α BC
BLAS: Basic Linear Algebra Subroutines• Intel Sandy Bridge (2 socket E5-2670)
• Peak 333 Gflop/s = 2.6 GHz * 16 cores * 8 double-precision flops/cycle †• Max memory bandwidth 51 GB/s ‡
7
† http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2‡ http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI
Memory bandwidth• Intel Ark
• https://ark.intel.com/• Search for model number (e.g., “E5-2697 v4” for Broadwell, see /proc/cpuinfo)
• Stream benchmark• https://www.cs.virginia.edu/stream/• Add -fopenmp or equivalent to Makefile CFLAGS and FFLAGS• Adjust STREAM_ARRAY_SIZE depending on cache size;
see instructions in code
8
prompt>less/proc/cpuinfo...modelname:Intel(R)Xeon(R)[email protected]:46080KBphysicalid:0siblings:18...flags:fpuvmedepsetscmsrpaemcecx8apicsepmtrrpgemcacmovpatpse36clflushdtsacpimmxfxsrssesse2sshttmpbesyscallnxpdpe1gbrdtscplmconstant_tscarch_perfmonpebsbtsrep_goodnoplxtopologynonstop_tscaperfmperfeagerfpupnipclmulqdqdtes64monitords_cplvmxsmxesttm2ssse3fmacx16xtprpdcmpciddcasse4_1sse4_2x2apicmovbepopcnttsc_deadline_timeraesxsaveavxf16crdrandlahf_lmabm3dnowprefetchidaaratepbplnptsdthermintel_pttpr_shadowvnmiflexpriorityeptvpidfsgsbasetsc_adjustbmi1hleavx2smepbmi2ermsinvpcidrtmcqmrdseedadxsmapxsaveoptcqm_llccqm_occup_llccqm_mbm_totalcqm_mbm_local...
BLAS naming scheme
9
Example dgemm C = AB + C
• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
BLAS naming scheme
9
Example dgemm C = AB + C
• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats
BLAS naming scheme
9
Example dgemm C = AB + C
• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats
• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.
BLAS naming scheme
9
Example dgemm C = AB + C
• One or two letter data type• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats
• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.
index = argmax
i
|xi
|result = Â
i
|xi
|
BLAS naming scheme
10
x
y
�=
c s
�s c
� x
y
�
result = kxk2
BLAS 1 examples sdot result = xT y (single) ddot result = xT y (double) cdotc result = xH y (single-complex) cdotu result = xT y (single-complex) zdotc result = xH y (double-complex) zdotu result = xT y (double-complex)
_axpy y = α x + y _scal y = α y _copy y = x _swap x ↔ y _nrm2 _asum i_amax _rot Apply Given’s rotation:
BLAS naming scheme• One or two letter data type
• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex
• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats
• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.
BLAS 2 examples _gemv y = Ax + y, A general _symv y = Ax + y, A symmetric _hemv y = Ax + y, A Hermitian _ger C = xyT + C, C general _syr C = xxT + C, C symmetric _her C = xxH + C, C Hermitian _trmv x = Ax, A triangular _trsv solve Ax = b, A triangular
BLAS 3 examples _gemm C = AB + C, all general _symm C = AB + C, A symmetric _hemm C = AB + C, A Hermitian _syrk C = AAT + C, C symmetric _herk C = AAH + C, C Hermitian _trmm X = AX, A triangular _trsm solve AX = B, A triangular
11
Why is BLAS so important?
• BLAS are efficient, portable, parallel, and widely available• Vendors (Intel, AMD, IBM, Cray, NVIDIA, ...) provide highly optimized versions• Open source libraries available (ATLAS, OpenBLAS)
• Commonly used for high quality linear algebra and HPC software
• Performance of many applications depends on underlying BLAS
12
Outline• Legacy Software
• BLAS• LINPACK • LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
13
LINPACK LINear algebra PACKage
• Math software library• solving dense & banded linear systems• singular value decomposition (SVD)
• Written in 1970s using Fortran 66• Aiming for software portability and efficiency• Level 1 BLAS• Four primary contributors:
• Jack Dongarra, Argonne• Jim Bunch, U. California, San Diego• Cleve Moler, New Mexico• Pete Stewart, U. of Maryland
• Superseded by LAPACK
14
Computing in 1974• High performance computers
• IBM 370/195• CDC 7600• Univac 1110• DEC PDP-10• Honeywell 6030
• Fortran 66• EISPACK released in 1974
• Eigenvalue and singular value problems• Translation of Algol into Fortran• Did not use BLAS
• Level 1 BLAS — vector operations (1979)• Aimed at vector supercomputer architectures
• LINPACK released in 1979• About time of Cray 1
15
LU Factorization in LINPACK
• Factor one column at a time• i_amax and _scal
• Update each column of trailing matrix, one column at a time• _axpy
• Level 1 BLAS• Bulk synchronous
• Single main thread• Parallel work in BLAS• “Fork-and-join” model
16
singlemain
thread
vectorized ormulti-threaded
BLAS
sync
Outline• Legacy Software
• BLAS• LINPACK• LAPACK • ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
17
LAPACK Linear Algebra PACKage
• Reformulate linear algebra algorithms as blocked algorithmsusing BLAS-3
• Linear systems• Nearly 100% BLAS-3
• Singular value and symmetric eigenvalue problems• About 50% BLAS-3
• Nonsymmetric eigenvalue problems• About 80% BLAS-3
• 4 data types: single, double, single-complex, double-complex
18
LU factorization in LAPACK
• Factor panel of nb columns• getf2, unblocked BLAS-2 code
• Level 3 BLAS update block-row of U• trsm
• Level 3 BLAS update trailing matrix• gemm• Aimed at machines with cache hierarchy
• Bulk synchronous
19
Parallelism in LAPACK• Most flops in gemm update
20
• 2/3 n3 term• Easily parallelized using
multi-threaded BLAS• Done in any reasonable
software
• Other operations lower order• Potentially expensive if not
parallelized
= lu( )
trsm solve
getf2 panel
= +
=
gemm multiply
L
U
laswpswap rows
LAPACK routine, solve AX = B• dgesv( n, nrhs, A, lda, ipiv, B, ldb, info )
• input:
• output:
21
info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)
L, U overwrite AX overwrites BMatrices stored column-wise
A Bn
n nrhs
lda ldb n ipivn
Xn
n nrhs
lda ldb n nL
U
implicit unit diagonal
ipiv
LAPACK naming
• Similar to BLAS naming• Data type + matrix type + function
• Additional matrix types:• po = real symmetric / complex Hermitian positive definite (SPD / HPD, all λ > 0)• or = orthogonal (AAT = I)• un = unitary (AAH = I)• Also banded, packed, etc. formats
22
LAPACK routines• ~480 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem
• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage
• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model
• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric and non-symmetric generalized
• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized
• gesdd D&C (faster) 23
http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html
LAPACK routines• Computation routines: one step of problem
• trf triangular factorization (LU, Cholesky, LDLT) • trs triangular solve • qrf orthogonal QR factorization • mqr multiply by Q • gqr generate Q • etc.
• Auxiliary routines, begin with “la”• lan__ matrix norm (one, inf, Frobenius, max); see SVD for 2-norm • lascl scale matrix • lacpy copy matrix • laset set matrix & diagonal to constants • etc.
24
http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
25
ScaLAPACK Scalable Linear Algebra PACKage
• Distributed memory
• Message Passing• Clusters of SMPs• Supercomputers
• Dense linear algebra
• Modules• PBLAS: Parallel BLAS• BLACS: Basic Linear Algebra Communication Subprograms
26
PBLAS• Similar to BLAS in functionality and naming• Built on BLAS and BLACS• Provide global view of matrix
• LAPACK: dge___( m, n, A(ia, ja), lda, ... ) • Submatrix offsets implicit in pointer
• ScaLAPACK: pdge___( m, n, A, ia, ja, descA, ... ) • Pass submatrix offsets and matrix descriptor
27
ScaLAPACK structure
28
ScaLAPACK
PBLAS
LAPACK
BLAS BLACS
MPI
Global addressingLocal addressing
Platform independentPlatform specific
ScaLAPACK routine, solve AX = B• LAPACK: dgesv(n, nrhs, A, lda, ipiv, B, ldb, info) • ScaLAPACK:pdgesv(n, nrhs, A, ia, ja, descA, ipiv, B, ib, jb, descB, info)
• input:
• output:
29
info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)
L, U overwrite AX overwrites B
A11 B12
n
n nrhs
n
ip1
n
A12 A13
A21 A22 A23
A31 A32 A33
B11
B22B21
B32B31
ip2
ip3
ip1
n ip2
ip3
B12
nrhs
n
B11
B22B21
B32B31
n
n
U12 U13
L21 U23
L31 L32
U11L11U22L22
U33L33
Global matrix point of view
implicit unit diagonal
mb
nb
2D block-cyclic layout
30
m × n matrixp × q process grid
Global matrix view Local process point of view
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
31
m × n matrixp × q process grid
Global matrix view Local process point of view
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
32
m × n matrixp × q process grid
Global matrix view Local process point of view
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
33
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
m × n matrixp × q process grid
Global matrix view Local process point of view
11 14 17
31 34 37
51 54 57
12 15 18
32 35 38
52 55 58
13 16
33 36
53 56
71 74 77
91 94 97
21 24 27
72 75 78
92 95 98
22 25 28
73 76
93 96
23 26
41 44 47
61 64 67
81 84 87
42 45 48
62 65 68
82 85 88
43 46
63 66
83 86
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
34
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
35
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
36
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
37
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
2D block-cyclic layout
38
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
• Why not simple 2D distribution?
p processes
q processes
11 12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Why 2D block cyclic?
39
m × n matrixp × q process grid
Global matrix view Local process point of view
• Why not simple 2D distribution?
p processes
q processes
11 12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3
11
m
n
12 13
21 22 23
31 32 33
14 15 16
24 25 26
34 35 36
17 18
27 28
37 38
44 42 43
51 52 53
61 62 63
44 45 46
54 55 56
64 65 66
47 48
57 58
67 68
71 72 73
81 82 83
91 92 93
74 75 76
84 85 86
94 95 96
77 78
87 88
97 98
Why 2D block cyclic?
39
m × n matrixp × q process grid
Global matrix view Local process point of view
Why 2D block cyclic?• Better load balancing!
40
m × n matrixp × q process grid
Global matrix view Local process point of view
m
n
Process 1, 1 Process 1, 2 Process 1, 3
Process 2, 1 Process 2, 2 Process 2, 3p processes
q processes
ScaLAPACK routines• ~140–160 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem
• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage
• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model
• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric (only gehrd Hessenberg reduction, no geev)
• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized (also not optimized for tall matrices) • gesdd D&C (faster)
41
Parallelism in ScaLAPACK• Similar to LAPACK• Bulk-synchronous• Most flops in gemm
update• 2/3 n3 term• Can use sequential BLAS,
p x q = # cores = # MPI processes,num_threads = 1
• Or multi-threaded BLAS,p x q = # nodes = # MPI processes,num_threads = # cores/node
42
= lu( )
trsm solve
getf2 panel
= +
=
gemm multiply
L
U
laswpswap rows
Legacy software libraries
43
LINPACK (70s)vector operations Level 1 BLAS
LAPACK (80s)block operations Level 3 BLAS
ScaLAPACK (90s)2D block cyclic
distribution
PBLAS BLACS MPI
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP • DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
44
Emerging software solutions
45
• PLASMA • Tile layout & algorithms• Dynamic scheduling — OpenMP 4
• DPLASMA — PaRSEC• Distributed with accelerators• Tile layout & algorithms• Dynamic scheduling — parameterized task graph
• MAGMA • Hybrid multicore + accelerator (GPU, Xeon Phi)• Block algorithms (LAPACK style)• Standard layout• Static scheduling
2007 2009 2011
Tile matrix layout
• Tiled layout• Each tile is contiguous (column major)• Enables dataflow scheduling• Cache and TLB efficient (reduces conflict misses and false sharing)• MPI messaging efficiency (zero-copy communication)• In-place, parallel layout translation
46
LAPACK column major (D)PLASMA tile layout
mlda
n nb
nb
Tile algorithms: Cholesky
47
LAPACK Algorithm (right looking) Tile Algorithm
= chol( )
1
2
3
= ⧸
= 1
2
3
1T 2T 3T–
trsm
syrk
= chol( )
1 = ⧸
2 = ⧸
3 = ⧸
trsm
trsm
trsm
= 1 1T– syrk
= 2 2T– syrk
= 3 3T– syrk
= 2 1T– gemm
= 3 1T– gemm
= 2 3T– gemm
Track dependencies — Directed acyclic graph (DAG)
48
Classical fork-join schedulewith artificial synchronizations
Track dependencies — Directed acyclic graph (DAG)
48
Classical fork-join schedulewith artificial synchronizations
Track dependencies — Directed acyclic graph (DAG)
48
Classical fork-join schedulewith artificial synchronizations
Reordered for 3 cores,without synchronizations
Execution trace• LAPACK-style fork-join leave cores idle
49
min: 0 max: 822.827
0 100 200 300 400 500 600 700 800
panels
potrf trsm syrk gemm idle24 cores Matrix is 8000 x 8000, tile size is 400 x 400.
time
Execution trace• PLASMA squeezes out idle time
50
min: 0 max: 701.543
0 100 200 300 400 500 600
24 cores Matrix is 8000 x 8000, tile size is 400 x 400. potrf trsm syrk gemm idle
panels time
superscalar scheduling
serial code
side-eUect-free tasks
dependency resolution
resolving data hazards
read after write
write after read
write after write
similar approach
StarPU from Inria
SMPSs from Barcelona
SuperGlue and DuctTEiP from Uppsala
Jade from Stanford (historical)
OpenMP
QUARKmultithreading
Dataflow scheduling• Exploit parallelism
• Load balance
• Remove artificial synchronization between steps
• Maximize data locality
• Parallel correctness inherited from serial tile algorithm• Runtime automatically resolves data hazards
(read after write, write after read, write after write)
51
Dynamic scheduling runtimes• Jade Stanford University • SMPSs / OMPSs Barcelona Supercomputer Center • StarPU INRIA Bordeaux • QUARK University of Tennessee • SuperGlue & DuctTEiP Uppsala University • OpenMP 4
52
May 2008 OpenMP 3.0
April 2009 GCC 4.4
July 2013 OpenMP 4.0
April 2014 GCC 4.9
#pragma omp task
#pragma omp task depend
PLASMA architecture
53
Cholesky: pseudo-code
54
// Sequential Tile Choleskyfor k = 1 ... ntiles {
potrf( Akk ) for i = k+1 ... ntiles {
trsm( Akk, Aik ) } for i = k+1 ... ntiles {
syrk( Aik, Aii ) for j = i+1 ... ntiles {
gemm( Ajk, Aik, Aij ) }
} }
// PLASMA OpenMP Tile Choleskyfor k = 1 .. ntiles {
omp task potrf( ... ) for i = k+1 .. ntiles {
omp task trsm( ... ) } for i = k+1 .. ntiles {
omp task syrk( ... ) { for j = i+1 .. ntiles
omp task gemm( ... ) }
} }
Cholesky: OpenMP
55
#pragma omp parallel #pragma omp master {
for (k = 0; k < nt; k++) { #pragma omp task depend(inout:A(k,k)[0:nb*nb]) info = LAPACKE_dpotrf_work(
LAPACK_COL_MAJOR, lapack_const(PlasmaLower), nb, A(k,k), nb);
for (m = k+1; m < nt; m++) { #pragma omp task depend(in:A(k,k)[0:nb*nb]) \
depend(inout:A(m,k)[0:nb*nb]) cblas_dtrsm(
CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, nb, nb, 1.0, A(k,k), nb,
A(m,k), nb); } for (m = k+1; m < nt; m++) {
#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(inout:A(m,m)[0:nb*nb])
cblas_dsyrk( CblasColMajor, CblasLower, CblasNoTrans, nb, nb, -1.0, A(m,k), nb,
1.0, A(m,m), nb);
for (n = k+1; n < m; n++) { #pragma omp task depend(in:A(m,k)[0:nb*nb]) \
depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,
A(n,k), nb, 1.0, A(m,n), nb);
} }
} }
#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,
A(n,k), nb, 1.0, A(m,n), nb);
Cholesky performance
56
0
100
200
300
400
500
600
0 5000 10000 15000 20000
double precision Cholesky factorization Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores
OpenMP MKL QUARK
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
57
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
57
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
57
Choleskymatrix inverse
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
time →
57
Choleskymatrix inverse
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
Merging DAGs
48 cores, matrix is 4000 x 4000, tile size is 200 x 200.
time →
time →
57
Choleskymatrix inverse
NOTE: Please do not explicitly invert matrices to compute X = A-1 B.Solve using gesv, posv, or sysv; faster and more accurate than inverting and multiplying
Cholesky A = LLT
Invert L-1
Multiply A-1 = L-T L-1
Cholesky inversion performance
58
0
100
200
300
400
500
600
0 5000 10000 15000 20000
double precision Cholesky inversion Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores
OpenMP MKL QUARK
LU performance• Multi-threaded panel• Factorization reaches peak performance quickly
59
0
20
40
60
80
100
120
140
0 5000 10000 15000 20000 25000 30000 35000
2 threads
4 threads
8 threads
16 threads
panel height (panel width = 256)
Gflo
p/s
LU panel factorization
Intel Sandy Bridge (32 cores)
0
50
100
150
200
250
300
350
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Matrix size
LAPACK
MKL
PLASMA
LU factorization
Intel Sandy Bridge (32 cores)
S.Donfack,J.Dongarra,M.Faverge,M.Gates,J.Kurzak,P.Luszczek,I.Yamazaki,AsurveyofrecentdevelopmentsinparallelimplementationsofGaussianelimination,ConcurrencyandComputation:PracticeandExperience,27(5):1292–1309,2015.DOI:10.1002/cpe.3110
PLASMA QR• Tile QR
• Great for square matrices• Great for multicore• Pairwise reductions
“domino”
• TSQR / CAQR• Tall-skinny QR• Communication avoiding QR• Tree of pairwise reductions• Great for tall-skinny matrices (least squares)• Great for distributed memory• But triangle-triangle (TT) pairs less efficient
than square-square (SS) or triangle-square (TS)
60
TSQR performance• Fixed cols, increase rows from square (left) to tall-skinny (right)
61
0
500
1000
1500
2000
2500
3000
0 50000 100000 150000 200000 250000 300000
Per
form
ance
(GFl
op/s
)
M (N=4,480)
60 nodes, 8 cores/node, Intel Nehalem Xeon E5520, 2.27GHz Theoretical Peak: 4358.4 Gflop/s
TSQR
ScaLAPACK (MKL)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC • MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
62
Distributed dataflow execution
63
DataDow Executiondistributed memory
DataDow Executiondistributed memory
DataDow Executiondistributed memory
DataDow Executiondistributed memory
64
DataDow Executiondistributed memory
localdepenency
communication
DataDow Executiondistributed memory
localdepenency
communication
DataDow Executiondistributed memory
DPLASMA architecture
65
DPLASMAsoftware stack
Parameterized task graph (PTG)• Symbolic DAG representation• Problem size independent• Completely distributed• DAG is O(n3),
PTG avoids explicitly creating it
• Runtime• Data-driven execution• Locality-aware scheduling• Communication overlap
66
PaRSECparametrized task graphs
parametrized task graph
symbolic DAG representation
problem size independent
completely distributed
runtime implementation
data-driven execution
locality-aware scheduling
communication overlapping
From serial code to PTG
67
DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)
1ARG � DTSQRTk+1,k,k(⬔)
1ARG � Ak,k(⬕)
DORMQRknk1ARG ← DGEQRTk,k,k(⬕)
2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n
DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)
1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)
2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k
DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n
FOR k=0 TO N-1
DGEQRT(inoutAkk)
FOR n=k+1 to N
DORMQR(inA⬕kk, inoutAkn)
FOR m=k+1 to N
DTSQRT(inoutA⬔kk, inoutAmk)
FOR n=k+1 to N
DTSMQR(inAmk, inoutAkn, inoutAmn)
PaRSECfrom serial code to PTG
serial code
PTG representation
DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)
1ARG � DTSQRTk+1,k,k(⬔)
1ARG � Ak,k(⬕)
DORMQRknk1ARG ← DGEQRTk,k,k(⬕)
2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n
DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)
1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)
2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k
DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n
FOR k=0 TO N-1
DGEQRT(inoutAkk)
FOR n=k+1 to N
DORMQR(inA⬕kk, inoutAkn)
FOR m=k+1 to N
DTSQRT(inoutA⬔kk, inoutAmk)
FOR n=k+1 to N
DTSMQR(inAmk, inoutAkn, inoutAmn)
PaRSECfrom serial code to PTG
serial code
PTG representation
Serial code PTG representation
QR performance• Fixed rows, increase cols, from tall-skinny (left) to square (right)
68
DPLASMA's Performancedistributed (multicore)
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi • Case study: 2-stage SVD
69
Challenges of GPUs• High levels of parallelism
• Expensive branches (if-then-else)
• Computation vs. communicationgap growing• NVIDIA Pascal P100 has 4670 Gflop/s,
732 GB/s memory, 16 GB/s PCIe (80 GB/s NVLINK)
• Use hybrid approach• Small, non-parallelizable tasks on CPU• Large, parallel tasks on GPU• Overlap communication & computation
Chapter 1. Introduction
CUDA C Programming Guide Version 4.0 3
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.
Figure 1-2. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to
Cache
ALU Control
ALU
ALU
ALU
DRAM
CPU
DRAM
GPU
Chapter 1. Introduction
CUDA C Programming Guide Version 4.0 3
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.
Figure 1-2. The GPU Devotes More Transistors to Data Processing
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.
1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to
Cache
ALU Control
ALU
ALU
ALU
DRAM
CPU
DRAM
GPU
PCIe
Figure from NVIDIA CUDA C Programming Guide70
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
DAG
71
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
DAG
71
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
DAG
71
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
TrailingmatrixPanel
DAG
71
Panel Look ahead
Trailing matrixA = QA
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
TrailingmatrixPanelPanel
DAG
71
Panel Look ahead
Trailing matrixA = QA
criticalpath
One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
LookaheadPanel
Trailing matrix
TrailingmatrixPanelPanel
DAG
71
72
Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
73
Two-sided factorization• Hessenberg, tridiagonal factorizations for eigenvalue problems
Level 2BLAS on
GPU
Level 2BLAS on
CPU
Level 3 BLAS on
GPU
Panel Trailing matrixA = QTAQ
yi = Avi
column ai
74
MAGMA Hessenberg in double precision
Gflo
p/s
Matrix size
Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
75
Outline• Legacy Software
• BLAS• LINPACK• LAPACK• ScaLAPACK
• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD
76
Computing SVD in 3 Phases
77
Computing SVD in 3 Phases
77
1.
1. Reduction to bidiagonal form
A = U1 B V1T
A B
Computing SVD in 3 Phases
77
1. 2.
1. Reduction to bidiagonal form2. Bidiagonal SVD
A = U1 B V1T B = U2 Σ V2T
A B Σ
V V1= V2
U U1 U2=
Computing SVD in 3 Phases
77
1. 2. 3.
1. Reduction to bidiagonal form2. Bidiagonal SVD3. Back transform singular vectors
A = U1 B V1T B = U2 Σ V2T
A B Σ
Computing SVD in 3 Phases
1. Reduction to bidiagonal form (2 stage) 1a. Full to band 1b. Band to bidiagonal
2. Bidiagonal SVD3. Back transform singular vectors
78
2.
1a. 1b.
3.
A = Ua Aband VaTAband = Ub B VbT
U Ua Ub U2=
V Va= Vb V2
A
Aband
B Σ
B = U2 Σ V2T
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
One stage reduction
• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder
(BLAS 3 matrix multiplies)79
43 n3
43 n3
• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to
twice gemv speed
A B
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix
• 8n3 flops in BLAS 3 gemm• More efficient with
large bandwidth nb
Two stage reduction
80
83 n3
A Aband
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
• O(nb n2) flops in BLAS 2• More efficient with
small bandwidth nb
• Tune optimal nb
Two stage reduction
• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel
81
O(nb n2)
Aband B
Accelerated phases
82
8x 16x
NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz
lower is better
Overall performance
83
(a) Singular values only (no vectors), using 83n
3 foroperation count in all cases.
(b) With singular vectors, using 9n3 for operation countin all cases.
Figure 10: SVD performance.
Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.
15
(a) Singular values only (no vectors), using 83n
3 foroperation count in all cases.
(b) With singular vectors, using 9n3 for operation countin all cases.
Figure 10: SVD performance.
Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.
15
NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz
singular values only (no vectors) with singular vectors
higher is better
Questions?
84