Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations

Dense Linear Algebra

Mark [email protected]://www.icl.utk.edu/~mgates3/

1

CS 594 — Scientific Computing for Engineers — Spring 2017

mailto:[email protected]

http://www.icl.utk.edu/~mgates3/

Outline• Legacy Software

• BLAS• LINPACK• LAPACK• ScaLAPACK

• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

2

Memory hierarchy

3

CPU

Registers

CPU cacheL1 Level 1 cacheL2 Level 2 cacheL3 Level 3 cache

Random Access Memory (RAM)

Solid State Drive (SSD, Flash)

Mechanical Hard Drive (HDD)

Virtual Memory, Files

SDRAM, DDR, GDDR, HBM, etc.

Burst buffers, Non-volatile Memory, Files

FastestMost expensiveSmallest capacity

FastExpensiveSmall capacity

Modest speedModest costModest capacity

SlowCheapLarge capacity

SlowestCheapestLargest capacity

Adapted from illustration by Ryan Leng

168 integer168 float

32 KiB L1 per core256 KiB L2 per core1.5 MiB L3 per core

64 GiB

500 GiB

4 TiB

Examples (Haswell)

Compute vs. memory speed• Machine balance (# flops per memory access)

• Flops “free,”memory expensive

• Good for dense,BLAS-3 operations(matrix multiply)

• Flops & memoryaccess balanced

• Good for sparse &vector operations

Data from Stream benchmark (McCalpin) and vendor information pages.4


• BLAS • LINPACK• LAPACK• ScaLAPACK


5

BLAS: Basic Linear Algebra Subroutines• Level 1 BLAS — vector operations

• O(n) data and flops (floating point operations)• Memory bound:

O(1) flops per memory access

• Level 2 BLAS — matrix-vector operations• O(n2) data and flops• Memory bound:

O(1) flops per memory access

• Level 3 BLAS — matrix-matrix operations• O(n2) data, O(n3) flops• Surface-to-volume effect• Compute bound:

O(n) flops per memory access

6

y x y+ β= α

y A y+ β= α x

A C+ β= α BC

BLAS: Basic Linear Algebra Subroutines• Intel Sandy Bridge (2 socket E5-2670)

• Peak 333 Gflop/s = 2.6 GHz * 16 cores * 8 double-precision flops/cycle †• Max memory bandwidth 51 GB/s ‡

7

† http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2‡ http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI

http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI

Memory bandwidth• Intel Ark

• https://ark.intel.com/• Search for model number (e.g., “E5-2697 v4” for Broadwell, see /proc/cpuinfo)

• Stream benchmark• https://www.cs.virginia.edu/stream/• Add -fopenmp or equivalent to Makefile CFLAGS and FFLAGS• Adjust STREAM_ARRAY_SIZE depending on cache size;

see instructions in code

8

prompt>less/proc/cpuinfo...modelname:Intel(R)Xeon(R)[email protected]:46080KBphysicalid:0siblings:18...flags:fpuvmedepsetscmsrpaemcecx8apicsepmtrrpgemcacmovpatpse36clflushdtsacpimmxfxsrssesse2sshttmpbesyscallnxpdpe1gbrdtscplmconstant_tscarch_perfmonpebsbtsrep_goodnoplxtopologynonstop_tscaperfmperfeagerfpupnipclmulqdqdtes64monitords_cplvmxsmxesttm2ssse3fmacx16xtprpdcmpciddcasse4_1sse4_2x2apicmovbepopcnttsc_deadline_timeraesxsaveavxf16crdrandlahf_lmabm3dnowprefetchidaaratepbplnptsdthermintel_pttpr_shadowvnmiflexpriorityeptvpidfsgsbasetsc_adjustbmi1hleavx2smepbmi2ermsinvpcidrtmcqmrdseedadxsmapxsaveoptcqm_llccqm_occup_llccqm_mbm_totalcqm_mbm_local...

https://ark.intel.com/

https://www.cs.virginia.edu/stream/

BLAS naming scheme

9

Example dgemm C = AB + C

• One or two letter data type (precision)• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

BLAS naming scheme

9



• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats

BLAS naming scheme

9



• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= Hermitian (complex, A = AH) • tr= triangular (L or U) • Also banded and packed formats

• Two or more letter function, e.g.• mv= matrix-vector product • mm= matrix-matrix product • etc.

BLAS naming scheme

9


• One or two letter data type• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats


index = argmax

i

|xi

|result = Â

i

|xi

|

BLAS naming scheme

10

x

y

�=

c s

�s c

� x

y

�

result = kxk2

BLAS 1 examples sdot result = xT y (single) ddot result = xT y (double) cdotc result = xH y (single-complex) cdotu result = xT y (single-complex) zdotc result = xH y (double-complex) zdotu result = xT y (double-complex)

_axpy y = α x + y _scal y = α y _copy y = x _swap x ↔ y _nrm2 _asum i_amax _rot Apply Given’s rotation:

BLAS naming scheme• One or two letter data type

• i = integer (e.g., index) • s = single (float) • d = double • c = single-complex • z = double-complex

• Two letter matrix type (BLAS 2, 3)• ge= general nonsymmetric • sy= symmetric (A = AT) • he= complex Hermitian (A = AH) • tr= triangular (L or U) • Also banded and packed formats


BLAS 2 examples _gemv y = Ax + y, A general _symv y = Ax + y, A symmetric _hemv y = Ax + y, A Hermitian _ger C = xyT + C, C general _syr C = xxT + C, C symmetric _her C = xxH + C, C Hermitian _trmv x = Ax, A triangular _trsv solve Ax = b, A triangular

BLAS 3 examples _gemm C = AB + C, all general _symm C = AB + C, A symmetric _hemm C = AB + C, A Hermitian _syrk C = AAT + C, C symmetric _herk C = AAH + C, C Hermitian _trmm X = AX, A triangular _trsm solve AX = B, A triangular

11

Why is BLAS so important?

• BLAS are efficient, portable, parallel, and widely available• Vendors (Intel, AMD, IBM, Cray, NVIDIA, ...) provide highly optimized versions• Open source libraries available (ATLAS, OpenBLAS)

• Commonly used for high quality linear algebra and HPC software

• Performance of many applications depends on underlying BLAS

12


• BLAS• LINPACK • LAPACK• ScaLAPACK


13

LINPACK LINear algebra PACKage

• Math software library• solving dense & banded linear systems• singular value decomposition (SVD)

• Written in 1970s using Fortran 66• Aiming for software portability and efficiency• Level 1 BLAS• Four primary contributors:

• Jack Dongarra, Argonne• Jim Bunch, U. California, San Diego• Cleve Moler, New Mexico• Pete Stewart, U. of Maryland

• Superseded by LAPACK

14

Computing in 1974• High performance computers

• IBM 370/195• CDC 7600• Univac 1110• DEC PDP-10• Honeywell 6030

• Fortran 66• EISPACK released in 1974

• Eigenvalue and singular value problems• Translation of Algol into Fortran• Did not use BLAS

• Level 1 BLAS — vector operations (1979)• Aimed at vector supercomputer architectures

• LINPACK released in 1979• About time of Cray 1

15

LU Factorization in LINPACK

• Factor one column at a time• i_amax and _scal

• Update each column of trailing matrix, one column at a time• _axpy

• Level 1 BLAS• Bulk synchronous

• Single main thread• Parallel work in BLAS• “Fork-and-join” model

16

singlemain

thread

vectorized ormulti-threaded

BLAS

sync


• BLAS• LINPACK• LAPACK • ScaLAPACK


17

LAPACK Linear Algebra PACKage

• Reformulate linear algebra algorithms as blocked algorithmsusing BLAS-3

• Linear systems• Nearly 100% BLAS-3

• Singular value and symmetric eigenvalue problems• About 50% BLAS-3

• Nonsymmetric eigenvalue problems• About 80% BLAS-3

• 4 data types: single, double, single-complex, double-complex

18

LU factorization in LAPACK

• Factor panel of nb columns• getf2, unblocked BLAS-2 code

• Level 3 BLAS update block-row of U• trsm

• Level 3 BLAS update trailing matrix• gemm• Aimed at machines with cache hierarchy

• Bulk synchronous

19

Parallelism in LAPACK• Most flops in gemm update

20

• 2/3 n3 term• Easily parallelized using

multi-threaded BLAS• Done in any reasonable

software

• Other operations lower order• Potentially expensive if not

parallelized

= lu( )

trsm solve

getf2 panel

= +

=

gemm multiply

L

U

laswpswap rows

LAPACK routine, solve AX = B• dgesv( n, nrhs, A, lda, ipiv, B, ldb, info )

• input:

• output:

21

info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)

L, U overwrite AX overwrites BMatrices stored column-wise

A Bn

n nrhs

lda ldb n ipivn

Xn

n nrhs

lda ldb n nL

U

implicit unit diagonal

ipiv

LAPACK naming

• Similar to BLAS naming• Data type + matrix type + function

• Additional matrix types:• po = real symmetric / complex Hermitian positive definite (SPD / HPD, all λ > 0)• or = orthogonal (AAT = I)• un = unitary (AAH = I)• Also banded, packed, etc. formats

22

LAPACK routines• ~480 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem

• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage

• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model

• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric and non-symmetric generalized

• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized

• gesdd D&C (faster) 23

http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html

http://www.icl.utk.edu/~mgates3/docs/

http://www.icl.utk.edu/~mgates3/docs/lapack.html

LAPACK routines• Computation routines: one step of problem

• trf triangular factorization (LU, Cholesky, LDLT) • trs triangular solve • qrf orthogonal QR factorization • mqr multiply by Q • gqr generate Q • etc.

• Auxiliary routines, begin with “la”• lan__ matrix norm (one, inf, Frobenius, max); see SVD for 2-norm • lascl scale matrix • lacpy copy matrix • laset set matrix & diagonal to constants • etc.

24

http://www.icl.utk.edu/~mgates3/docs/http://www.icl.utk.edu/~mgates3/docs/lapack.html

http://www.icl.utk.edu/~mgates3/docs/

http://www.icl.utk.edu/~mgates3/docs/lapack.html




25

ScaLAPACK Scalable Linear Algebra PACKage

• Distributed memory

• Message Passing• Clusters of SMPs• Supercomputers

• Dense linear algebra

• Modules• PBLAS: Parallel BLAS• BLACS: Basic Linear Algebra Communication Subprograms

26

PBLAS• Similar to BLAS in functionality and naming• Built on BLAS and BLACS• Provide global view of matrix

• LAPACK: dge___( m, n, A(ia, ja), lda, ... ) • Submatrix offsets implicit in pointer

• ScaLAPACK: pdge___( m, n, A, ia, ja, descA, ... ) • Pass submatrix offsets and matrix descriptor

27

ScaLAPACK structure

28

ScaLAPACK

PBLAS

LAPACK

BLAS BLACS

MPI

Global addressingLocal addressing

Platform independentPlatform specific

ScaLAPACK routine, solve AX = B• LAPACK: dgesv(n, nrhs, A, lda, ipiv, B, ldb, info) • ScaLAPACK:pdgesv(n, nrhs, A, ia, ja, descA, ipiv, B, ib, jb, descB, info)

• input:

• output:

29

info (error code) = 0: no error < 0: invalid argument > 0: numerical error (e.g., singular)

L, U overwrite AX overwrites B

A11 B12

n

n nrhs

n

ip1

n

A12 A13

A21 A22 A23

A31 A32 A33

B11

B22B21

B32B31

ip2

ip3

ip1

n ip2

ip3

B12

nrhs

n

B11

B22B21

B32B31

n

n

U12 U13

L21 U23

L31 L32

U11L11U22L22

U33L33

Global matrix point of view

implicit unit diagonal

mb

nb

2D block-cyclic layout

30

m × n matrixp × q process grid

Global matrix view Local process point of view

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86

Process 1, 1 Process 1, 2 Process 1, 3

Process 2, 1 Process 2, 2 Process 2, 3p processes

q processes


31



11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86



q processes


32



11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86



q processes


33

11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98



11 14 17

31 34 37

51 54 57

12 15 18

32 35 38

52 55 58

13 16

33 36

53 56

71 74 77

91 94 97

21 24 27

72 75 78

92 95 98

22 25 28

73 76

93 96

23 26

41 44 47

61 64 67

81 84 87

42 45 48

62 65 68

82 85 88

43 46

63 66

83 86



q processes


34



m

n



q processes


35



m

n



q processes


36



m

n



q processes


37



m

n



q processes


38



m

n



q processes

• Why not simple 2D distribution?

p processes

q processes

11 12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98



11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

Why 2D block cyclic?

39



• Why not simple 2D distribution?

p processes

q processes

11 12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98



11

m

n

12 13

21 22 23

31 32 33

14 15 16

24 25 26

34 35 36

17 18

27 28

37 38

44 42 43

51 52 53

61 62 63

44 45 46

54 55 56

64 65 66

47 48

57 58

67 68

71 72 73

81 82 83

91 92 93

74 75 76

84 85 86

94 95 96

77 78

87 88

97 98

Why 2D block cyclic?

39



Why 2D block cyclic?• Better load balancing!

40



m

n



q processes

ScaLAPACK routines• ~140–160 routines x 4 data types (s, d, c, z)• Driver routines: solve an entire problem

• sv Solve linear system: Ax = b • gesv general non-symmetric (LU) • posv symmetric positive definite (Cholesky) • sysv symmetric indefinite (LDLT) • Also packed, banded, tridiagonal storage

• ls Linear least squares: Ax ≅ b • gglse linear equality-constrained least squares • ggglm general Gauss-Markov linear model

• ev Eigenvalue decomposition: Ax = λx and Ax = λMx • syevd symmetric and symmetric generalized • geev non-symmetric (only gehrd Hessenberg reduction, no geev)

• svd Singular value decomposition: A = UΣVH • gesvd standard and generalized (also not optimized for tall matrices) • gesdd D&C (faster)

41

Parallelism in ScaLAPACK• Similar to LAPACK• Bulk-synchronous• Most flops in gemm

update• 2/3 n3 term• Can use sequential BLAS,

p x q = # cores = # MPI processes,num_threads = 1

• Or multi-threaded BLAS,p x q = # nodes = # MPI processes,num_threads = # cores/node

42

= lu( )

trsm solve

getf2 panel

= +

=

gemm multiply

L

U

laswpswap rows

Legacy software libraries

43

LINPACK (70s)vector operations Level 1 BLAS

LAPACK (80s)block operations Level 3 BLAS

ScaLAPACK (90s)2D block cyclic

distribution

PBLAS BLACS MPI



• New Software• PLASMA using OpenMP • DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

44

Emerging software solutions

45

• PLASMA • Tile layout & algorithms• Dynamic scheduling — OpenMP 4

• DPLASMA — PaRSEC• Distributed with accelerators• Tile layout & algorithms• Dynamic scheduling — parameterized task graph

• MAGMA • Hybrid multicore + accelerator (GPU, Xeon Phi)• Block algorithms (LAPACK style)• Standard layout• Static scheduling

2007 2009 2011

Tile matrix layout

• Tiled layout• Each tile is contiguous (column major)• Enables dataflow scheduling• Cache and TLB efficient (reduces conflict misses and false sharing)• MPI messaging efficiency (zero-copy communication)• In-place, parallel layout translation

46

LAPACK column major (D)PLASMA tile layout

mlda

n nb

nb

Tile algorithms: Cholesky

47

LAPACK Algorithm (right looking) Tile Algorithm

= chol( )

1

2

3

= ⧸

= 1

2

3

1T 2T 3T–

trsm

syrk

= chol( )

1 = ⧸

2 = ⧸

3 = ⧸

trsm

trsm

trsm

= 1 1T– syrk

= 2 2T– syrk

= 3 3T– syrk

= 2 1T– gemm

= 3 1T– gemm

= 2 3T– gemm

Track dependencies — Directed acyclic graph (DAG)

48

Classical fork-join schedulewith artificial synchronizations


48



48


Reordered for 3 cores,without synchronizations

Execution trace• LAPACK-style fork-join leave cores idle

49

min: 0 max: 822.827

0 100 200 300 400 500 600 700 800

panels

potrf trsm syrk gemm idle24 cores Matrix is 8000 x 8000, tile size is 400 x 400.

time

Execution trace• PLASMA squeezes out idle time

50

min: 0 max: 701.543

0 100 200 300 400 500 600

24 cores Matrix is 8000 x 8000, tile size is 400 x 400. potrf trsm syrk gemm idle

panels time

superscalar scheduling

serial code

side-eUect-free tasks

dependency resolution

resolving data hazards

read after write

write after read

write after write

similar approach

StarPU from Inria

SMPSs from Barcelona

SuperGlue and DuctTEiP from Uppsala

Jade from Stanford (historical)

OpenMP

QUARKmultithreading

Dataflow scheduling• Exploit parallelism

• Load balance

• Remove artificial synchronization between steps

• Maximize data locality

• Parallel correctness inherited from serial tile algorithm• Runtime automatically resolves data hazards

(read after write, write after read, write after write)

51

Dynamic scheduling runtimes• Jade Stanford University • SMPSs / OMPSs Barcelona Supercomputer Center • StarPU INRIA Bordeaux • QUARK University of Tennessee • SuperGlue & DuctTEiP Uppsala University • OpenMP 4

52

May 2008 OpenMP 3.0

April 2009 GCC 4.4

July 2013 OpenMP 4.0

April 2014 GCC 4.9

#pragma omp task

#pragma omp task depend

PLASMA architecture

53

Cholesky: pseudo-code

54

// Sequential Tile Choleskyfor k = 1 ... ntiles {

potrf( Akk ) for i = k+1 ... ntiles {

trsm( Akk, Aik ) } for i = k+1 ... ntiles {

syrk( Aik, Aii ) for j = i+1 ... ntiles {

gemm( Ajk, Aik, Aij ) }

} }

// PLASMA OpenMP Tile Choleskyfor k = 1 .. ntiles {

omp task potrf( ... ) for i = k+1 .. ntiles {

omp task trsm( ... ) } for i = k+1 .. ntiles {

omp task syrk( ... ) { for j = i+1 .. ntiles

omp task gemm( ... ) }

} }

Cholesky: OpenMP

55

#pragma omp parallel #pragma omp master {

for (k = 0; k < nt; k++) { #pragma omp task depend(inout:A(k,k)[0:nb*nb]) info = LAPACKE_dpotrf_work(

LAPACK_COL_MAJOR, lapack_const(PlasmaLower), nb, A(k,k), nb);

for (m = k+1; m < nt; m++) { #pragma omp task depend(in:A(k,k)[0:nb*nb]) \

depend(inout:A(m,k)[0:nb*nb]) cblas_dtrsm(

CblasColMajor, CblasRight, CblasLower, CblasTrans, CblasNonUnit, nb, nb, 1.0, A(k,k), nb,

A(m,k), nb); } for (m = k+1; m < nt; m++) {

#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(inout:A(m,m)[0:nb*nb])

cblas_dsyrk( CblasColMajor, CblasLower, CblasNoTrans, nb, nb, -1.0, A(m,k), nb,

1.0, A(m,m), nb);

for (n = k+1; n < m; n++) { #pragma omp task depend(in:A(m,k)[0:nb*nb]) \

depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])

cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,

A(n,k), nb, 1.0, A(m,n), nb);

} }

} }

#pragma omp task depend(in:A(m,k)[0:nb*nb]) \ depend(in:A(n,k)[0:nb*nb]) \ depend(inout:A(m,n)[0:nb*nb])

cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans, nb, nb, nb, -1.0, A(m,k), nb,

A(n,k), nb, 1.0, A(m,n), nb);

Cholesky performance

56

0

100

200

300

400

500

600

0 5000 10000 15000 20000

double precision Cholesky factorization Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores

OpenMP MKL QUARK

Merging DAGs

48 cores, matrix is 4000 x 4000, tile size is 200 x 200.

57

Cholesky A = LLT

Invert L-1

Multiply A-1 = L-T L-1

Merging DAGs


time →

57

Cholesky A = LLT

Invert L-1


Merging DAGs


time →

57

Choleskymatrix inverse

Cholesky A = LLT

Invert L-1


Merging DAGs


time →

time →

57


Cholesky A = LLT

Invert L-1


Merging DAGs


time →

time →

57


NOTE: Please do not explicitly invert matrices to compute X = A-1 B.Solve using gesv, posv, or sysv; faster and more accurate than inverting and multiplying

Cholesky A = LLT

Invert L-1


Cholesky inversion performance

58

0

100

200

300

400

500

600

0 5000 10000 15000 20000

double precision Cholesky inversion Intel Xeon E5-2650 v3 (Haswell), 2.3GHz, 20 cores

OpenMP MKL QUARK

LU performance• Multi-threaded panel• Factorization reaches peak performance quickly

59

0

20

40

60

80

100

120

140

0 5000 10000 15000 20000 25000 30000 35000

2 threads

4 threads

8 threads

16 threads

panel height (panel width = 256)

Gflo

p/s

LU panel factorization

Intel Sandy Bridge (32 cores)

0

50

100

150

200

250

300

350

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Matrix size

LAPACK

MKL

PLASMA

LU factorization

Intel Sandy Bridge (32 cores)

S.Donfack,J.Dongarra,M.Faverge,M.Gates,J.Kurzak,P.Luszczek,I.Yamazaki,AsurveyofrecentdevelopmentsinparallelimplementationsofGaussianelimination,ConcurrencyandComputation:PracticeandExperience,27(5):1292–1309,2015.DOI:10.1002/cpe.3110

PLASMA QR• Tile QR

• Great for square matrices• Great for multicore• Pairwise reductions

“domino”

• TSQR / CAQR• Tall-skinny QR• Communication avoiding QR• Tree of pairwise reductions• Great for tall-skinny matrices (least squares)• Great for distributed memory• But triangle-triangle (TT) pairs less efficient

than square-square (SS) or triangle-square (TS)

60

TSQR performance• Fixed cols, increase rows from square (left) to tall-skinny (right)

61

0

500

1000

1500

2000

2500

3000

0 50000 100000 150000 200000 250000 300000

Per

form

ance

(GFl

op/s

)

M (N=4,480)

60 nodes, 8 cores/node, Intel Nehalem Xeon E5520, 2.27GHz Theoretical Peak: 4358.4 Gflop/s

TSQR

ScaLAPACK (MKL)



• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC • MAGMA for CUDA, OpenCL, or Xeon Phi• Case study: 2-stage SVD

62

Distributed dataflow execution

63

DataDow Executiondistributed memory




64


localdepenency

communication


localdepenency

communication


DPLASMA architecture

65

DPLASMAsoftware stack

Parameterized task graph (PTG)• Symbolic DAG representation• Problem size independent• Completely distributed• DAG is O(n3),

PTG avoids explicitly creating it

• Runtime• Data-driven execution• Locality-aware scheduling• Communication overlap

66

PaRSECparametrized task graphs

parametrized task graph

symbolic DAG representation

problem size independent

completely distributed

runtime implementation

data-driven execution

locality-aware scheduling

communication overlapping

From serial code to PTG

67

DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)

1ARG � DTSQRTk+1,k,k(⬔)

1ARG � Ak,k(⬕)

DORMQRknk1ARG ← DGEQRTk,k,k(⬕)

2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n

DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)

1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)

2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k

DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n

FOR k=0 TO N-1

DGEQRT(inoutAkk)

FOR n=k+1 to N

DORMQR(inA⬕kk, inoutAkn)

FOR m=k+1 to N

DTSQRT(inoutA⬔kk, inoutAmk)

FOR n=k+1 to N

DTSMQR(inAmk, inoutAkn, inoutAmn)

PaRSECfrom serial code to PTG

serial code

PTG representation

DGEQRTkkk1ARG ← Ak,k | DTSMQRk,k,k-11ARG � DORMQRk,k+1..N,k(⬕)

1ARG � DTSQRTk+1,k,k(⬔)

1ARG � Ak,k(⬕)

DORMQRknk1ARG ← DGEQRTk,k,k(⬕)

2ARG ← Ak,n | DTSMQRk,n,k-12ARG � DTSMQRk+1,n,k2ARG � Ak,n

DTSQRTmkk1ARG ← DGEQRTm-1,k,k(⬔) | DTSQRTm-1,k,k(⬔)

1ARG � DTSQRTm+1,k,k(⬔) | Ak,k(⬔)

2ARG ← Am,k | DTSMQRm,k,k-12ARG � DTSMQRm,k+1..N,k2ARG � Am,k

DTSMQRmnk1ARG ← DTSQRTm,k,k2ARG ← DORMQRm-1,n,k | DTSMQRm-1,n,k2ARG � DTSMQRm+1,n,k | An,k3ARG ← Am,n | DTSMQRm,n,k-13ARG � DGEQRTm,n,k+1 | DORMQRm,n,k+1 | � DTSQRTm,n,k+1 | DTSMQRm,n,k+1 | � Am,n

FOR k=0 TO N-1

DGEQRT(inoutAkk)

FOR n=k+1 to N

DORMQR(inA⬕kk, inoutAkn)

FOR m=k+1 to N

DTSQRT(inoutA⬔kk, inoutAmk)

FOR n=k+1 to N

DTSMQR(inAmk, inoutAkn, inoutAmn)

PaRSECfrom serial code to PTG

serial code

PTG representation

Serial code PTG representation

QR performance• Fixed rows, increase cols, from tall-skinny (left) to square (right)

68

DPLASMA's Performancedistributed (multicore)



• New Software• PLASMA using OpenMP• DPLASMA and PaRSEC• MAGMA for CUDA, OpenCL, or Xeon Phi • Case study: 2-stage SVD

69

Challenges of GPUs• High levels of parallelism

• Expensive branches (if-then-else)

• Computation vs. communicationgap growing• NVIDIA Pascal P100 has 4670 Gflop/s,

732 GB/s memory, 16 GB/s PCIe (80 GB/s NVLINK)

• Use hybrid approach• Small, non-parallelizable tasks on CPU• Large, parallel tasks on GPU• Overlap communication & computation

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

DRAM

GPU

Chapter 1. Introduction

CUDA C Programming Guide Version 4.0 3

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

Figure 1-2. The GPU Devotes More Transistors to Data Processing

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations – the same program is executed on many data elements in parallel – with high arithmetic intensity – the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

1.2 CUDA™: a General-Purpose Parallel Computing Architecture In November 2006, NVIDIA introduced CUDA™, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to

Cache

ALU Control

ALU

ALU

ALU

DRAM

CPU

DRAM

GPU

PCIe

Figure from NVIDIA CUDA C Programming Guide70

Panel Look ahead

Trailing matrixA = QA

One-sided factorization• LU, Cholesky, QR factorizations for solving linear systems

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

DAG

71

Panel Look ahead



Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

DAG

71

Panel Look ahead



Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

DAG

71

Panel Look ahead



Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanel

DAG

71

Panel Look ahead



Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

DAG

71

Panel Look ahead


criticalpath


Level 2BLAS on

CPU

Level 3 BLAS on

GPU

LookaheadPanel

Trailing matrix

TrailingmatrixPanelPanel

DAG

71

72

Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

73

Two-sided factorization• Hessenberg, tridiagonal factorizations for eigenvalue problems

Level 2BLAS on

GPU

Level 2BLAS on

CPU

Level 3 BLAS on

GPU

Panel Trailing matrixA = QTAQ

yi = Avi

column ai

74

MAGMA Hessenberg in double precision

Gflo

p/s

Matrix size

Keeneland system, using one node 3 Nvidia GPUs (M2070 @ 1.1 GHz, 5.4 GB) 2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

75




76

Computing SVD in 3 Phases

77


77

1.

1. Reduction to bidiagonal form

A = U1 B V1T

A B


77

1. 2.

1. Reduction to bidiagonal form2. Bidiagonal SVD

A = U1 B V1T B = U2 Σ V2T

A B Σ

V V1= V2

U U1 U2=


77

1. 2. 3.

1. Reduction to bidiagonal form2. Bidiagonal SVD3. Back transform singular vectors

A = U1 B V1T B = U2 Σ V2T

A B Σ


1. Reduction to bidiagonal form (2 stage) 1a. Full to band 1b. Band to bidiagonal

2. Bidiagonal SVD3. Back transform singular vectors

78

2.

1a. 1b.

3.

A = Ua Aband VaTAband = Ub B VbT

U Ua Ub U2=

V Va= Vb V2

A

Aband

B Σ

B = U2 Σ V2T

One stage reduction

• For i = 1 to n by nb• Eliminate cols i : i + nb and rows i : i + nb• Update trailing matrix using block Householder

(BLAS 3 matrix multiplies)79

43 n3

43 n3

• flops in BLAS 2 gemv• flops in BLAS 3 gemm• Performance limited to

twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

One stage reduction



43 n3

43 n3


twice gemv speed

A B

• 1st stage: full to band • QR factorize panel, update trailing matrix• LQ factorize panel, update trailing matrix

• 8n3 flops in BLAS 3 gemm• More efficient with

large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband



large bandwidth nb

Two stage reduction

80

83 n3

A Aband

• O(nb n2) flops in BLAS 2• More efficient with

small bandwidth nb

• Tune optimal nb

Two stage reduction

• 2nd stage: band to bidiagonal • Cache friendly kernels• Pipeline multiple sweeps in parallel

81

O(nb n2)

Aband B


small bandwidth nb

• Tune optimal nb

Two stage reduction


81

O(nb n2)

Aband B


small bandwidth nb

• Tune optimal nb

Two stage reduction


81

O(nb n2)

Aband B


small bandwidth nb

• Tune optimal nb

Two stage reduction


81

O(nb n2)

Aband B


small bandwidth nb

• Tune optimal nb

Two stage reduction


81

O(nb n2)

Aband B

Accelerated phases

82

8x 16x

NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz

lower is better

Overall performance

83

(a) Singular values only (no vectors), using 83n

3 foroperation count in all cases.

(b) With singular vectors, using 9n3 for operation countin all cases.

Figure 10: SVD performance.

Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.

15

(a) Singular values only (no vectors), using 83n

3 foroperation count in all cases.

(b) With singular vectors, using 9n3 for operation countin all cases.

Figure 10: SVD performance.

Figure 11: Profile of SVD implementations showing each phase, for n = 20000. With QR iteration, it first explicitly generatesU1 and V1, then multiplies them by U2 and V2 during QR iteration, whereas D&C explicitly generates U2 and V2, andsubsequently multiplies them by implicit U1 and V1 in back transformation.

15

NVIDIA Pascal P100 Intel Haswell 2x10 core 2.3 GHz

singular values only (no vectors) with singular vectors

higher is better

Questions?

84

Documents

Dense Linear Algebra - netlib.org · Compute vs. memory speed • Machine balance (# flops per memory access) • Flops “free,” memory expensive • Good for dense, BLAS-3 operations