128
Dense Linear Algebra Mark Gates [email protected] http://www.icl.utk.edu/~mgates3/ 1 CS 594 — Scientific Computing for Engineers — Spring 2017

Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Embed Size (px)

Citation preview

Page 1: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Dense Linear Algebra

Mark [email protected]://www.icl.utk.edu/~mgates3/

1

CS 594 — Scientific Computing for Engineers — Spring 2017

Page 2: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

2

Page 3: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Memory hierarchy

3

CPU

Registers

CPU cacheL1 Level 1 cacheL2 Level 2 cacheL3 Level 3 cache

Random Access Memory (RAM)

Solid State Drive (SSD, Flash)

Mechanical Hard Drive (HDD)

Virtual Memory, Files

SDRAM, DDR, GDDR, HBM, etc.

Burst buffers, Non-volatile Memory, Files

FastestMost expensiveSmallest capacity

FastExpensiveSmall capacity

Modest speedModest costModest capacity

SlowCheapLarge capacity

SlowestCheapestLargest capacity

Adapted from illustration by Ryan Leng

168 integer168 float

32 KiB L1 per core256 KiB L2 per core1.5 MiB L3 per core

64 GiB

500 GiB

4 TiB

Examples (Haswell)

Page 4: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Compute vs. memory speed• Machine balance (# flops per memory access)

• Flops “free,”memory expensive

• Good for dense,BLAS-3 operations(matrix multiply)

• Flops & memoryaccess balanced

• Good for sparse &vector operations

Data from Stream benchmark (McCalpin) and vendor information pages.4

Page 5: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS • LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

5

Page 6: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

BLAS: Basic Linear Algebra Subroutines• Level 1 BLAS — vector operations

• O(n) data and flops (floating point operations)• Memory bound:

O(1) flops per memory access

• Level 2 BLAS — matrix-vector operations• O(n2) data and flops• Memory bound:

O(1) flops per memory access

• Level 3 BLAS — matrix-matrix operations• O(n2) data, O(n3) flops• Surface-to-volume effect• Compute bound:

O(n) flops per memory access

6

y x y+ β= α

y A y+ β= α x

A C+ β= α BC

Page 7: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

BLAS: Basic Linear Algebra Subroutines• Intel Sandy Bridge (2 socket E5-2670)

• Peak 333 Gflop/s = 2.6 GHz * 16 cores * 8 double-precision flops/cycle †• Max memory bandwidth 51 GB/s ‡

7

† http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2‡ http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI

Page 8: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Memory bandwidth• Intel Ark

• https://ark.intel.com/• Search for model number (e.g., “E5-2697 v4” for Broadwell, see /proc/cpuinfo)

• Stream benchmark• https://www.cs.virginia.edu/stream/• Add -fopenmp or equivalent to Makefile CFLAGS and FFLAGS• Adjust STREAM_ARRAY_SIZE depending on cache size;

see instructions in code

8

prompt>less/proc/cpuinfo...modelname:Intel(R)Xeon(R)[email protected]:46080KBphysicalid:0siblings:18...flags:fpuvmedepsetscmsrpaemcecx8apicsepmtrrpgemcacmovpatpse36clflushdtsacpimmxfxsrssesse2sshttmpbesyscallnxpdpe1gbrdtscplmconstant_tscarch_perfmonpebsbtsrep_goodnoplxtopologynonstop_tscaperfmperfeagerfpupnipclmulqdqdtes64monitords_cplvmxsmxesttm2ssse3fmacx16xtprpdcmpciddcasse4_1sse4_2x2apicmovbepopcnttsc_deadline_timeraesxsaveavxf16crdrandlahf_lmabm3dnowprefetchidaaratepbplnptsdthermintel_pttpr_shadowvnmiflexpriorityeptvpidfsgsbasetsc_adjustbmi1hleavx2smepbmi2ermsinvpcidrtmcqmrdseedadxsmapxsaveoptcqm_llccqm_occup_llccqm_mbm_totalcqm_mbm_local...

Page 9: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

BLAS naming scheme

9

Example dgemm C = AB + C

Page 10: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

BLAS naming scheme

9

Example dgemm C = AB + C

Page 11: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats

BLAS naming scheme

9

Example dgemm C = AB + C

Page 12: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats

• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.

BLAS naming scheme

9

Example dgemm C = AB + C

Page 13: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• One or two letter data type• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats

• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.

index = argmax

i

|xi

|result = Â

i

|xi

|

BLAS naming scheme

10

x

y

�=

c s

�s c

� x

y

result = kxk2

BLAS 1 examples sdot result = xT y (single) ddot result = xT y (double) cdotc result = xH y (single-complex) cdotu result = xT y (single-complex) zdotc result = xH y (double-complex) zdotu result = xT y (double-complex)

_axpy y = α x + y _scal y = α y _copy y = x _swap x ↔ y _nrm2 _asum i_amax _rot Apply Given’s rotation:

Page 14: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

BLAS naming scheme• One or two letter data type

• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats

• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.

BLAS 2 examples _gemv y = Ax + y, A general _symv y = Ax + y, A symmetric _hemv y = Ax + y, A Hermitian _ger C = xyT + C, C general _syr C = xxT + C, C symmetric _her C = xxH + C, C Hermitian _trmv x = Ax, A triangular _trsv solve Ax = b, A triangular

BLAS 3 examples _gemm C = AB + C, all general _symm C = AB + C, A symmetric _hemm C = AB + C, A Hermitian _syrk C = AAT + C, C symmetric _herk C = AAH + C, C Hermitian _trmm X = AX, A triangular _trsm solve AX = B, A triangular

11

Page 15: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Why is BLAS so important?

• BLAS are efficient, portable, parallel, and widely available• Vendors (Intel, AMD, IBM, Cray, NVIDIA, ...) provide highly optimized versions• Open source libraries available (ATLAS, OpenBLAS)

• Commonly used for high quality linear algebra and HPC software

• Performance of many applications depends on underlying BLAS

12

Page 16: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK • LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

13

Page 17: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LINPACK LINear algebra PACKage

• Math software library• solving dense & banded linear systems• singular value decomposition (SVD)

• Written in 1970s using Fortran 66• Aiming for software portability and efficiency• Level 1 BLAS• Four primary contributors:

• Jack Dongarra, Argonne• Jim Bunch, U. California, San Diego• Cleve Moler, New Mexico• Pete Stewart, U. of Maryland

• Superseded by LAPACK

14

Page 18: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Computing in 1974• High performance computers

• IBM 370/195• CDC 7600• Univac 1110• DEC PDP-10• Honeywell 6030

• Fortran 66• EISPACK released in 1974

• Eigenvalue and singular value problems• Translation of Algol into Fortran• Did not use BLAS

• Level 1 BLAS — vector operations (1979)• Aimed at vector supercomputer architectures

• LINPACK released in 1979• About time of Cray 1

15

Page 19: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LU Factorization in LINPACK

• Factor one column at a time• i_amax and _scal

• Update each column of trailing matrix, one column at a time• _axpy

• Level 1 BLAS• Bulk synchronous

• Single main thread• Parallel work in BLAS• “Fork-and-join” model

16

singlemain

thread

vectorized ormulti-threaded

BLAS

sync

Page 20: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK• LAPACK • ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

17

Page 21: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LAPACK Linear Algebra PACKage

• Reformulate linear algebra algorithms as blocked algorithmsusing BLAS-3

• Linear systems• Nearly 100% BLAS-3

• Singular value and symmetric eigenvalue problems• About 50% BLAS-3

• Nonsymmetric eigenvalue problems• About 80% BLAS-3

• 4 data types: single, double, single-complex, double-complex

18

Page 22: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LU factorization in LAPACK

• Factor panel of nb columns• getf2, unblocked BLAS-2 code

• Level 3 BLAS update block-row of U• trsm

• Level 3 BLAS update trailing matrix• gemm• Aimed at machines with cache hierarchy

• Bulk synchronous

19

Page 23: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Parallelism in LAPACK• Most flops in gemm update

20

• 2/3 n3 term• Easily parallelized using

multi-threaded BLAS• Done in any reasonable

software

• Other operations lower order• Potentially expensive if not

parallelized

= lu( )

trsm solve

getf2 panel

= +

=

gemm multiply

L

U

laswpswap rows

Page 24: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LAPACK routine, solve AX = B• dgesv( n, nrhs, A, lda, ipiv, B, ldb, info )

• input:

• output:

21

info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)

L, U overwrite AX overwrites BMatrices stored column-wise

A Bn

n nrhs

lda ldb n ipivn

Xn

n nrhs

lda ldb n nL

U

implicit unit diagonal

ipiv

Page 25: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LAPACK naming

• Similar to BLAS naming• Data type + matrix type + function

• Additional matrix types:• po = real symmetric / complex Hermitian positive definite (SPD / HPD, all λ > 0)• or = orthogonal (AAT = I)• un = unitary (AAH = I)• Also banded, packed, etc. formats

22

Page 26: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LAPACK routines• ~480 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem

• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage

• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model

• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric and non-symmetric generalized

• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized

• gesdd D&C (faster) 23

http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html

Page 27: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LAPACK routines• Computation routines: one step of problem

• trf triangular factorization (LU, Cholesky, LDLT) • trs triangular solve • qrf orthogonal QR factorization • mqr multiply by Q • gqr generate Q • etc.

• Auxiliary routines, begin with “la”• lan__ matrix norm (one, inf, Frobenius, max); see SVD for 2-norm • lascl scale matrix • lacpy copy matrix • laset set matrix & diagonal to constants • etc.

24

http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html

Page 28: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

25

Page 29: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

ScaLAPACK Scalable Linear Algebra PACKage

• Distributed memory

• Message Passing• Clusters of SMPs• Supercomputers

• Dense linear algebra

• Modules• PBLAS: Parallel BLAS• BLACS: Basic Linear Algebra Communication Subprograms

26

Page 30: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

PBLAS• Similar to BLAS in functionality and naming• Built on BLAS and BLACS• Provide global view of matrix

• LAPACK: dge___( m, n, A(ia, ja), lda, ... ) • Submatrix offsets implicit in pointer

• ScaLAPACK: pdge___( m, n, A, ia, ja, descA, ... ) • Pass submatrix offsets and matrix descriptor

27

Page 31: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

ScaLAPACK structure

28

ScaLAPACK

PBLAS

LAPACK

BLAS BLACS

MPI

Global addressingLocal addressing

Platform independentPlatform specific

Page 32: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

ScaLAPACK routine, solve AX = B• LAPACK: dgesv(n, nrhs, A, lda, ipiv, B, ldb, info) • ScaLAPACK:pdgesv(n, nrhs, A, ia, ja, descA, ipiv, B, ib, jb, descB, info)

• input:

• output:

29

info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)

L, U overwrite AX overwrites B

A11 B12

n

n nrhs

n

ip1

n

A12 A13

A21 A22 A23

A31 A32 A33

B11

B22B21

B32B31

ip2

ip3

ip1

n ip2

ip3

B12

nrhs

n

B11

B22B21

B32B31

n

n

U12 U13

L21 U23

L31 L32

U11L11U22L22

U33L33

Global matrix point of view

implicit unit diagonal

mb

nb

Page 33: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

30

m × n matrixp × q process grid

Global matrix view Local process point of view

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 34: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

31

m × n matrixp × q process grid

Global matrix view Local process point of view

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 35: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

32

m × n matrixp × q process grid

Global matrix view Local process point of view

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 36: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

33

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

m × n matrixp × q process grid

Global matrix view Local process point of view

11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 37: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

34

m × n matrixp × q process grid

Global matrix view Local process point of view

m

n

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 38: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

35

m × n matrixp × q process grid

Global matrix view Local process point of view

m

n

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 39: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

36

m × n matrixp × q process grid

Global matrix view Local process point of view

m

n

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 40: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

37

m × n matrixp × q process grid

Global matrix view Local process point of view

m

n

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 41: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

2D block-cyclic layout

38

m × n matrixp × q process grid

Global matrix view Local process point of view

m

n

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 42: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• Why not simple 2D distribution?

p processes

q processes

11 12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

Why 2D block cyclic?

39

m × n matrixp × q process grid

Global matrix view Local process point of view

Page 43: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• Why not simple 2D distribution?

p processes

q processes

11 12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

Why 2D block cyclic?

39

m × n matrixp × q process grid

Global matrix view Local process point of view

Page 44: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Why 2D block cyclic?• Better load balancing!

40

m × n matrixp × q process grid

Global matrix view Local process point of view

m

n

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes

Page 45: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

ScaLAPACK routines• ~140–160 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem

• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage

• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model

• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric (only gehrd Hessenberg reduction, no geev)

• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized (also not optimized for tall matrices) • gesdd D&C (faster)

41

Page 46: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Parallelism in ScaLAPACK• Similar to LAPACK• Bulk-synchronous• Most flops in gemm

update• 2/3 n3 term• Can use sequential BLAS,

p x q = # cores = # MPI processes,num_threads = 1

• Or multi-threaded BLAS,p x q = # nodes = # MPI processes,num_threads = # cores/node

42

= lu( )

trsm solve

getf2 panel

= +

=

gemm multiply

L

U

laswpswap rows

Page 47: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Legacy software libraries

43

LINPACK (70s)vector operations Level 1 BLAS

LAPACK (80s)block operations Level 3 BLAS

ScaLAPACK (90s)2D block cyclic

distribution

PBLAS BLACS MPI

Page 48: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP • DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

44

Page 49: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Emerging software solutions

45

• PLASMA • Tile layout & algorithms• Dynamic scheduling — OpenMP 4

• DPLASMA — PaRSEC• Distributed with accelerators• Tile layout & algorithms• Dynamic scheduling — parameterized task graph

• MAGMA • Hybrid multicore + accelerator (GPU, Xeon Phi)• Block algorithms (LAPACK style)• Standard layout• Static scheduling

2007 2009 2011

Page 50: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Tile matrix layout

• Tiled layout• Each tile is contiguous (column major)• Enables dataflow scheduling• Cache and TLB efficient (reduces conflict misses and false sharing)• MPI messaging efficiency (zero-copy communication)• In-place, parallel layout translation

46

LAPACK column major (D)PLASMA tile layout

mlda

n nb

nb

Page 51: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Tile algorithms: Cholesky

47

LAPACK Algorithm (right looking) Tile Algorithm

= chol( )

1

2

3

= ⧸

= 1

2

3

1T 2T 3T–

trsm

syrk

= chol( )

1 = ⧸

2 = ⧸

3 = ⧸

trsm

trsm

trsm

= 1 1T– syrk

= 2 2T– syrk

= 3 3T– syrk

= 2 1T– gemm

= 3 1T– gemm

= 2 3T– gemm

Page 52: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Track dependencies — Directed acyclic graph (DAG)

48

Classical fork-join schedulewith artificial synchronizations

Page 53: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Track dependencies — Directed acyclic graph (DAG)

48

Classical fork-join schedulewith artificial synchronizations

Page 54: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Track dependencies — Directed acyclic graph (DAG)

48

Classical fork-join schedulewith artificial synchronizations

Reordered for 3 cores,without synchronizations

Page 55: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Execution trace• LAPACK-style fork-join leave cores idle

49

min: 0 max: 822.827

0 100 200 300 400 500 600 700 800

panels

potrf trsm syrk gemm idle24 cores Matrix is 8000 x 8000, tile size is 400 x 400.

time

Page 56: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Execution trace• PLASMA squeezes out idle time

50

min: 0 max: 701.543

0 100 200 300 400 500 600

24 cores Matrix is 8000 x 8000, tile size is 400 x 400. potrf trsm syrk gemm idle

panels time

Page 57: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

superscalar scheduling

serial code

side-eUect-free tasks

dependency resolution

resolving data hazards

read after write

write after read

write after write

similar approach

StarPU from Inria

SMPSs from Barcelona

SuperGlue and DuctTEiP from Uppsala

Jade from Stanford (historical)

OpenMP

QUARKmultithreading

Dataflow scheduling• Exploit parallelism

• Load balance

• Remove artificial synchronization between steps

• Maximize data locality

• Parallel correctness inherited from serial tile algorithm• Runtime automatically resolves data hazards

(read after write, write after read, write after write)

51

Page 58: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Dynamic scheduling runtimes• Jade Stanford University • SMPSs / OMPSs Barcelona Supercomputer Center • StarPU INRIA Bordeaux • QUARK University of Tennessee • SuperGlue & DuctTEiP Uppsala University • OpenMP 4

52

May 2008 OpenMP 3.0

April 2009 GCC 4.4

July 2013 OpenMP 4.0

April 2014 GCC 4.9

#pragma omp task

#pragma omp task depend

Page 59: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

PLASMA architecture

53

Page 60: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Cholesky: pseudo-code

54

// Sequential Tile Choleskyfor k = 1 ... ntiles {

potrf( Akk ) for i = k+1 ... ntiles {

trsm( Akk, Aik ) } for i = k+1 ... ntiles {

syrk( Aik, Aii ) for j = i+1 ... ntiles {

gemm( Ajk, Aik, Aij ) }

} }

// PLASMA OpenMP Tile Choleskyfor k = 1 .. ntiles {

omp task potrf( ... ) for i = k+1 .. ntiles {

omp task trsm( ... ) } for i = k+1 .. ntiles {

omp task syrk( ... ) { for j = i+1 .. ntiles

omp task gemm( ... ) }

} }

Page 61: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Cholesky: OpenMP

55

#pragma omp parallel #pragma omp master {

for (k = 0; k < nt; k++) { #pragma omp task depend(inout:A(k,k)[0:nb*nb]) info = LAPACKE_dpotrf_work(

LAPACK_COL_MAJOR, lapack_const(PlasmaLower), nb, A(k,k), nb);

for (m = k+1; m < nt; m++) { #pragma omp task depend(in:A(k,k)[0:nb*nb]) \

depend(inout:A(m,k)[0:nb*nb]) cblas_dtrsm(

CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, nb, nb, 1.0, A(k,k), nb,

A(m,k), nb); } for (m = k+1; m < nt; m++) {

#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(inout:A(m,m)[0:nb*nb])

cblas_dsyrk( CblasColMajor, CblasLower, CblasNoTrans, nb, nb, -1.0, A(m,k), nb,

1.0, A(m,m), nb);

for (n = k+1; n < m; n++) { #pragma omp task depend(in:A(m,k)[0:nb*nb]) \

depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])

cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,

A(n,k), nb, 1.0, A(m,n), nb);

} }

} }

#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])

cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,

A(n,k), nb, 1.0, A(m,n), nb);

Page 62: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Cholesky performance

56

0

100

200

300

400

500

600

0 5000 10000 15000 20000

double precision Cholesky factorization Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores

OpenMP MKL QUARK

Page 63: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Merging DAGs

48 cores, matrix is 4000 x 4000, tile size is 200 x 200.

57

Cholesky A = LLT

Invert L-1

Multiply A-1 = L-T L-1

Page 64: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Merging DAGs

48 cores, matrix is 4000 x 4000, tile size is 200 x 200.

time →

57

Cholesky A = LLT

Invert L-1

Multiply A-1 = L-T L-1

Page 65: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Merging DAGs

48 cores, matrix is 4000 x 4000, tile size is 200 x 200.

time →

57

Choleskymatrix inverse

Cholesky A = LLT

Invert L-1

Multiply A-1 = L-T L-1

Page 66: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Merging DAGs

48 cores, matrix is 4000 x 4000, tile size is 200 x 200.

time →

time →

57

Choleskymatrix inverse

Cholesky A = LLT

Invert L-1

Multiply A-1 = L-T L-1

Page 67: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Merging DAGs

48 cores, matrix is 4000 x 4000, tile size is 200 x 200.

time →

time →

57

Choleskymatrix inverse

NOTE: Please do not explicitly invert matrices to compute X = A-1 B.Solve using gesv, posv, or sysv; faster and more accurate than inverting and multiplying

Cholesky A = LLT

Invert L-1

Multiply A-1 = L-T L-1

Page 68: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Cholesky inversion performance

58

0

100

200

300

400

500

600

0 5000 10000 15000 20000

double precision Cholesky inversion Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores

OpenMP MKL QUARK

Page 69: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

LU performance• Multi-threaded panel• Factorization reaches peak performance quickly

59

0

20

40

60

80

100

120

140

0 5000 10000 15000 20000 25000 30000 35000

2 threads

4 threads

8 threads

16 threads

panel height (panel width = 256)

Gflo

p/s

LU panel factorization

Intel Sandy Bridge (32 cores)

0

50

100

150

200

250

300

350

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Matrix size

LAPACK

MKL

PLASMA

LU factorization

Intel Sandy Bridge (32 cores)

S.Donfack,J.Dongarra,M.Faverge,M.Gates,J.Kurzak,P.Luszczek,I.Yamazaki,AsurveyofrecentdevelopmentsinparallelimplementationsofGaussianelimination,ConcurrencyandComputation:PracticeandExperience,27(5):1292–1309,2015.DOI:10.1002/cpe.3110

Page 70: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

PLASMA QR• Tile QR

• Great for square matrices• Great for multicore• Pairwise reductions

“domino”

• TSQR / CAQR• Tall-skinny QR• Communication avoiding QR• Tree of pairwise reductions• Great for tall-skinny matrices (least squares)• Great for distributed memory• But triangle-triangle (TT) pairs less efficient

than square-square (SS) or triangle-square (TS)

60

Page 71: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

TSQR performance• Fixed cols, increase rows from square (left) to tall-skinny (right)

61

0

500

1000

1500

2000

2500

3000

0 50000 100000 150000 200000 250000 300000

Per

form

ance

(GFl

op/s

)

M (N=4,480)

60 nodes, 8 cores/node, Intel Nehalem Xeon E5520, 2.27GHz Theoretical Peak: 4358.4 Gflop/s

TSQR

ScaLAPACK (MKL)

Page 72: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC • MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

62

Page 73: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Distributed dataflow execution

63

DataDow Executiondistributed memory

DataDow Executiondistributed memory

DataDow Executiondistributed memory

DataDow Executiondistributed memory

Page 74: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

64

DataDow Executiondistributed memory

localdepenency

communication

DataDow Executiondistributed memory

localdepenency

communication

DataDow Executiondistributed memory

Page 75: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

DPLASMA architecture

65

DPLASMAsoftware stack

Page 76: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Parameterized task graph (PTG)• Symbolic DAG representation• Problem size independent• Completely distributed• DAG is O(n3),

PTG avoids explicitly creating it

• Runtime• Data-driven execution• Locality-aware scheduling• Communication overlap

66

PaRSECparametrized task graphs

parametrized task graph

symbolic DAG representation

problem size independent

completely distributed

runtime implementation

data-driven execution

locality-aware scheduling

communication overlapping

Page 77: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

From serial code to PTG

67

DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)

1ARG � DTSQRTk+1,k,k(⬔)

1ARG � Ak,k(⬕)

DORMQRknk1ARG ← DGEQRTk,k,k(⬕)

2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n

DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)

1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)

2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k

DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n

FOR k=0 TO N-1

DGEQRT(inoutAkk)

FOR n=k+1 to N

DORMQR(inA⬕kk, inoutAkn)

FOR m=k+1 to N

DTSQRT(inoutA⬔kk, inoutAmk)

FOR n=k+1 to N

DTSMQR(inAmk, inoutAkn, inoutAmn)

PaRSECfrom serial code to PTG

serial code

PTG representation

DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)

1ARG � DTSQRTk+1,k,k(⬔)

1ARG � Ak,k(⬕)

DORMQRknk1ARG ← DGEQRTk,k,k(⬕)

2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n

DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)

1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)

2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k

DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n

FOR k=0 TO N-1

DGEQRT(inoutAkk)

FOR n=k+1 to N

DORMQR(inA⬕kk, inoutAkn)

FOR m=k+1 to N

DTSQRT(inoutA⬔kk, inoutAmk)

FOR n=k+1 to N

DTSMQR(inAmk, inoutAkn, inoutAmn)

PaRSECfrom serial code to PTG

serial code

PTG representation

Serial code PTG representation

Page 78: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

QR performance• Fixed rows, increase cols, from tall-skinny (left) to square (right)

68

DPLASMA's Performancedistributed (multicore)

Page 79: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi • Case study: 2-stage SVD

69

Page 80: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Challenges of GPUs• High levels of parallelism

• Expensive branches (if-then-else)

• Computation vs. communicationgap growing• NVIDIA Pascal P100 has 4670 Gflop/s,

732 GB/s memory, 16 GB/s PCIe (80 GB/s NVLINK)

• Use hybrid approach• Small, non-parallelizable tasks on CPU• Large, parallel tasks on GPU• Overlap communication & computation

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

DRAM

GPU

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

DRAM

GPU

PCIe

Figure from NVIDIA CUDA C Programming Guide70

Page 81: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

DAG

71

Page 82: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

DAG

71

Page 83: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

DAG

71

Page 84: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanel

DAG

71

Page 85: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

DAG

71

Page 86: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Panel Look ahead

Trailing matrixA = QA

criticalpath

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

DAG

71

Page 87: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

72

Page 88: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

73

Page 89: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Two-sided factorization• Hessenberg, tridiagonal factorizations for eigenvalue problems

Level 2BLAS on

GPU

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

Panel Trailing matrixA = QTAQ

yi = Avi

column ai

74

Page 90: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

MAGMA Hessenberg in double precision

Gflo

p/s

Matrix size

Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

75

Page 91: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Outline• Legacy Software

• BLAS• LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

76

Page 92: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Computing SVD in 3 Phases

77

Page 93: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Computing SVD in 3 Phases

77

1.

1. Reduction to bidiagonal form

A = U1 B V1T

A B

Page 94: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Computing SVD in 3 Phases

77

1. 2.

1. Reduction to bidiagonal form2. Bidiagonal SVD

A = U1 B V1T B = U2 Σ V2T

A B Σ

Page 95: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

V V1= V2

U U1 U2=

Computing SVD in 3 Phases

77

1. 2. 3.

1. Reduction to bidiagonal form2. Bidiagonal SVD3. Back transform singular vectors

A = U1 B V1T B = U2 Σ V2T

A B Σ

Page 96: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Computing SVD in 3 Phases

1. Reduction to bidiagonal form (2 stage) 1a. Full to band 1b. Band to bidiagonal

2. Bidiagonal SVD3. Back transform singular vectors

78

2.

1a. 1b.

3.

A = Ua Aband VaTAband = Ub B VbT

U Ua Ub U2=

V Va= Vb V2

A

Aband

B Σ

B = U2 Σ V2T

Page 97: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 98: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 99: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 100: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 101: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 102: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 103: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 104: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 105: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

Page 106: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 107: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 108: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 109: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 110: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 111: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 112: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 113: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 114: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 115: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 116: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 117: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 118: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 119: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 120: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband

Page 121: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• O(nb n2) flops in BLAS 2• More efficient with

small bandwidth nb

• Tune optimal nb

Two stage reduction

• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel

81

O(nb n2)

Aband B

Page 122: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• O(nb n2) flops in BLAS 2• More efficient with

small bandwidth nb

• Tune optimal nb

Two stage reduction

• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel

81

O(nb n2)

Aband B

Page 123: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• O(nb n2) flops in BLAS 2• More efficient with

small bandwidth nb

• Tune optimal nb

Two stage reduction

• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel

81

O(nb n2)

Aband B

Page 124: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• O(nb n2) flops in BLAS 2• More efficient with

small bandwidth nb

• Tune optimal nb

Two stage reduction

• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel

81

O(nb n2)

Aband B

Page 125: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

• O(nb n2) flops in BLAS 2• More efficient with

small bandwidth nb

• Tune optimal nb

Two stage reduction

• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel

81

O(nb n2)

Aband B

Page 126: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Accelerated phases

82

8x 16x

NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz

lower is better

Page 127: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Overall performance

83

(a) Singular values only (no vectors), using 83n

3 foroperation count in all cases.

(b) With singular vectors, using 9n3 for operation countin all cases.

Figure 10: SVD performance.

Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.

15

(a) Singular values only (no vectors), using 83n

3 foroperation count in all cases.

(b) With singular vectors, using 9n3 for operation countin all cases.

Figure 10: SVD performance.

Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.

15

NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz

singular values only (no vectors) with singular vectors

higher is better

Page 128: Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Questions?

84