51
Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded Schwartz CS294, Lecture #5 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Based on: Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers

Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Embed Size (px)

Citation preview

Page 1: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Communication AvoidingAlgorithms

for

Dense Linear Algebra:LU, QR, and Cholesky decompositions

and

Sparse Cholesky decompositions

Jim Demmel and Oded Schwartz

CS294, Lecture #5 Fall, 2011Communication-Avoiding Algorithms

www.cs.berkeley.edu/~odedsc/CS294

Based on:

Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers

Page 2: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

2

Outline for today

• Recall communication lower bounds • What a “matching upper bound”, i.e. CA-algorithm,

might have to do• Summary of what is known so far• Case Studies of CA dense algorithms

• Last time: case study I: Matrix multiplication• Case study II: LU, QR• Case study III: Cholesky decomposition

• Case Studies of CA sparse algorithms• Sparse Cholesky decomposition

Page 3: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

3

Communication lower bounds

• Applies to algorithms satisfying technical conditions discussed last time

• “smells like 3-nested loops”• For each processor:

• M = size of fast memory of that processor• G = #operations (multiplies) performed by that processor

#words_moved

to/from fast memory= Ω ( max ( G / M1/2 , #inputs + #outputs ) )

#messages

to/from fast memory≥

#words_moved

max_message_size

= Ω ( G / M3/2 )

Page 4: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Summary of dense sequential algorithms attaining communication lower bounds• Algorithms shown minimizing # Messages use (recursive) block layout

• Not possible with columnwise or rowwise layouts• Many references (see reports), only some shown, plus ours• Cache-oblivious are underlined, Green are ours, ? is unknown/future work

Algorithm 2 Levels of Memory Multiple Levels of Memory

#Words Moved and # Messages #Words Moved and #Messages

BLAS-3 Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97]

Cholesky LAPACK (with b = M1/2) [Gustavson 97]

[BDHS09]

[Gustavson,97] [Ahmed,Pingali,00]

[BDHS09]

(←same ) (←same)

LU withpivoting

LAPACK (sometimes)[Toledo,97] , [GDX 08]

[GDX 08]not partial pivoting

[Toledo, 97][GDX 08]?

[GDX 08]?

QRRank-

revealing

LAPACK (sometimes) [Elmroth,Gustavson,98]

[DGHL08]

[Frens,Wise,03]but 3x flops?

[DGHL08]

[Elmroth,Gustavson,98]

[DGHL08] ?

[Frens,Wise,03] [DGHL08] ?

Eig, SVD Not LAPACK [BDD11 ]randomized, but more flops;

[BDK11]

[BDD11,BDK11? ]

[BDD11,BDK11?]

Page 5: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply

Cholesky

LU

QR

Sym Eig, SVD

Nonsym Eig

Page 6: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1

Cholesky ScaLAPACK log P

LU ScaLAPACK log P

QR ScaLAPACK log P

Sym Eig, SVD ScaLAPACK log P

Nonsym Eig ScaLAPACK P1/2 log P

Page 7: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (conventional approach)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU ScaLAPACK log P n log P / P1/2

QR ScaLAPACK log P n log P / P1/2

Sym Eig, SVD ScaLAPACK log P n / P1/2

Nonsym Eig ScaLAPACK P1/2 log P n log P

Page 8: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Summary of dense parallel algorithms attaining communication lower bounds

• Assume nxn matrices on P processors (better)• Minimum Memory per processor = M = O(n2 / P)• Recall lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU [GDX10] log P log P

QR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 P

Nonsym Eig [BDD11] log P log3 P

Page 9: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Can we do even better?

• Assume nxn matrices on P processors • Why just one copy of data: M = O(n2 / P) per processor?• Increase M to reduce lower bounds:

#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 ) #messages = ( (n3/ P) / M3/2 ) = ( P1/2 )

Algorithm Reference Factor exceeding lower bound for #words_moved

Factor exceeding lower bound for

#messages

Matrix Multiply [Cannon, 69] 1 1

Cholesky ScaLAPACK log P log P

LU [GDX10] log P log P

QR [DGHL08] log P log3 P

Sym Eig, SVD [BDD11] log P log3 P

Nonsym Eig [BDD11] log P log3 P

Page 10: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Recall Sequential One-sided Factorizations (LU, QR)

• Classical Approach for i=1 to n update column i update trailing matrix• #words_moved = O(n3)

• Blocked Approach (LAPACK)

for i=1 to n/b

update block i of b columns

update trailing matrix• #words moved = O(n3/M1/3)

• Recursive Approach

func factor(A)

if A has 1 column, update it

else

factor(left half of A)

update right half of A

factor(right half of A)• #words moved = O(n3/M1/2)

• None of these approaches

minimizes #messages or

addresses parallelism

• Need another idea

10

Page 11: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Minimizing latency in sequential case requires new data structures

• To minimize latency, need to load/store whole rectangular subblock of matrix with one “message”

• Incompatible with conventional columnwise (rowwise) storage• Ex: Rows (columns) not in contiguous memory locations

• Blocked storage: store as matrix of bxb blocks, each block stored contiguously

• Ok for one level of memory hierarchy, what if more?• Recursive blocked storage: store each block using subblocks

• Also known as “space filling curves”, “Morton ordering”

11

Page 12: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Summer School Lecture 4 12

Different Data Layouts for Parallel GE

Bad load balance:P0 idle after firstn/4 steps

Load balanced, but can’t easily use BLAS2 or BLAS3

Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck

Complicated addressing,May not want full parallelismIn each column, row

0123012301230123

0 1 2 3 0 1 2 3

0 1 2 3

3 0 1 2

2 3 0 1

1 2 3 0

1) 1D Column Blocked Layout 2) 1D Column Cyclic Layout

3) 1D Column Block Cyclic Layout 4) Block Skewed Layout

The winner!

0 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 30 1 0 1 0 1 0 12 3 2 3 2 3 2 3 6) 2D Row and Column

Block Cyclic Layout

0 1 2 3

Bad load balance:P0 idle after firstn/2 steps

0 1

2 3

5) 2D Row and Column Blocked Layout

b

Page 13: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

02/14/2006

CS267 Lecture 9 13

Distributed GE with a 2D Block Cyclic Layout

13

Page 14: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

02/14/2006

CS267 Lecture 9 14

Mat

rix

mu

ltip

ly o

f

gre

en =

gre

en -

blu

e *

pin

k

14

Page 15: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Where is communication bottleneck?• In both sequential and parallel case: panel factorization• Can we retain partial pivoting? No• “Proof” for parallel case:

• Each choice of pivot requires a reduction (to find the maximum entry) per columns

• #messages = Ω (n) or Ω (n log p), versus goal of O(p1/2)• Solution for QR, then show LU

15

Page 16: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

TSQR: QR of a Tall, Skinny matrix

W =

Q00 R00

Q10 R10

Q20 R20

Q30 R30

W0

W1

W2

W3

Q00

Q10

Q20

Q30

= = .

R00

R10

R20

R30

R00

R10

R20

R30

=Q01 R01

Q11 R11

Q01

Q11= .

R01

R11

R01

R11= Q02 R02

Output = Q00, Q10, Q20, Q30, Q01, Q11, Q02, R02 16

Page 17: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

TSQR: An Architecture-Dependent Algorithm

W =

W0

W1

W2

W3

R00

R10

R20

R30

R01

R11

R02Parallel:

W =

W0

W1

W2

W3

R01 R02

R00

R03

Sequential:

W =

W0

W1

W2

W3

R00

R01 R01

R11

R02

R11

R03

Dual Core:

Can choose reduction tree dynamically

Multicore / Multisocket / Multirack / Multisite / Out-of-core: ?

17

Page 18: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Back to LU: Using similar idea for TSLU as TSQR: Use reduction tree, to do “Tournament Pivoting”

18

Wnxb =

W1

W2

W3

W4

P1·L1·U1

P2·L2·U2

P3·L3·U3

P4·L4·U4

=

Choose b pivot rows of W1, call them W1’

Choose b pivot rows of W2, call them W2’

Choose b pivot rows of W3, call them W3’

Choose b pivot rows of W4, call them W4’

W1’

W2’

W3’

W4’

P12·L12·U12

P34·L34·U34

=Choose b pivot rows, call them W12’

Choose b pivot rows, call them W34’

W12’

W34’= P1234·L1234·U1234

Choose b pivot rows

Go back to W and use these b pivot rows

(move them to top, do LU without pivoting)

Page 19: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Minimizing Communication in TSLU

W =

W1

W2

W3

W4

LULULULU

LU

LULU

Parallel:(binary tree)

W =

W1

W2

W3

W4

LULU

LU

LU

Sequential:(flat tree)

W =

W1

W2

W3

W4

LU

LULU

LULU

LULU

Dual Core:

Can Choose reduction tree dynamically, as before

Multicore / Multisocket / Multirack / Multisite / Out-of-core:

19

Page 20: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Making TSLU Numerically Stable• Stability Goal: Make ||A – PLU|| very small: O(machine_epsilon · ||A||)• Details matter

• Going up the tree, we could do LU either on original rows of W (tournament pivoting), or computed rows of U

• Only tournament pivoting stable• Thm: New scheme as stable as Partial Pivoting (PP) in following sense: get

same results as PP applied to different input matrix whose entries are blocks taken from input A

• CALU – Communication Avoiding LU for general A– Use TSLU for panel factorizations– Apply to rest of matrix– Cost: redundant panel factorizations:

– Extra flops for one reduction step: n/b panels (n b2) flops per panel– For the sequential algorithms this is n2b. b = M1/2,

so extra n2M1/2 flops, (<n3 as M < n2).– For the parallel 2D algorithms, we have log P reduction steps

n/b panels, n/b b2 / P1/2 flops per panel and b = n/(log(P)2P1/2), so extra n/b (n b2) log P / P1/2= n2blog P / P1/2 = n3/(Plog(P)) < n3/P

• Benefit: – Stable in practice, but not same pivot choice as GEPP– One reduction operation per panel: reduces latency to minimum

20

Page 21: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Sketch of TSLU Stabilty Proof

• Consider A = with TSLU of first columns done by :

(assume wlog that A21 already

contains desired pivot rows)• Resulting LU decomposition (first n/2 steps) is

• Claim: As32 can be obtained by GEPP on larger matrix formed from blocks of A

• GEPP applied to G pivots no rows, and

L31 · L21-1 ·A21 · U’11

-1 · U’12 + As32

= L31 · U21 · U’11-1 · U’12 + As

32

= A31 · U’11-1 · U’12 + As

32 = L’31 · U’12 + As32 = A32

21

A11 A12

A21 A22

A31 A32

A11

A21

A31

A21

A’11

A11 A12

A21 A22

A31 A32

11 12 0

21 22 0 0 0 I

· =A’11 A’12

A’21 A’22

A31 A32

=U’11 U’12

As22

As32

L’11

L’21 I

L’31 0 I

·

A’11 A’12

A21 A21

-A31 A32

G = =L’11

A21·U’11-1 L21

-L31 I

·U’11 U’12

U21 -L21-1 ·A21 · U’11

-1 · U’12

As32

Need to show As32

same as above

Page 22: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Stability of LU using TSLU: CALU (1/2)

• Worst case analysis• Pivot growth factor from one panel factorization with b columns

• TSLU:

– Proven: 2bH where H = 1 + height of reduction tree

– Attained: 2b , same as GEPP • Pivot growth factor on m x n matrix

• TSLU:

– Proven: 2nH-1

– Attained: 2n-1 , same as GEPP

• |L| can be as large as 2(b-2)H-b+1 , but CALU still stable• There are examples where GEPP is exponentially less

stable than CALU, and vice-versa, so neither is always better than the other

• Discovered sparse matrices with O(2n) pivot growth22

Page 23: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Stability of LU using TSLU: CALU (2/2)

Summer School Lecture 4 23

• Empirical testing• Both random matrices and “special ones”• Both binary tree (BCALU) and flat-tree (FCALU)• 3 metrics: ||PA-LU||/||A||, normwise and componentwise backward errors• See [D., Grigori, Xiang, 2010] for details

23

Page 24: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Performance vs ScaLAPACK

• TSLU– IBM Power 5

• Up to 4.37x faster (16 procs, 1M x 150)– Cray XT4

• Up to 5.52x faster (8 procs, 1M x 150)• CALU

– IBM Power 5

• Up to 2.29x faster (64 procs, 1000 x 1000)– Cray XT4

• Up to 1.81x faster (64 procs, 1000 x 1000)• See INRIA Tech Report 6523 (2008), paper at SC08

25

Page 25: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Exascale predicted speedupsfor Gaussian Elimination: CA-LU vs ScaLAPACK-LU

log2 (p)

log

2 (n

2 /p

) =

log

2 (m

emo

ry_p

er_p

roc)

27

Page 26: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Case Study III: Cholesky Decomposition

29

Page 27: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

30

LU and Cholesky decompositions

LU decomposition of AA = P·L·U

U is upper triangularL is lower triangularP is a permutation matrix

Cholesky decomposition of AIf A is a real symmetric and positive definite matrix,then L (a real lower triangular) s.t.,

A = L·LT (L is unique if we restrict diagonal elements 0)

Efficient parallel and sequential algorithms?

Page 28: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

31

Application of the generalized form to Cholesky decomposition

(1) Generalized form: (i,j) S, C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),

gi,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… Sij

other arguments)

jikjAjiAjiAjjL

jiLik

,,,,,

1,

1

1

2,,,ik

kiAiiAiiL

Page 29: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

32

By the Geometric Embedding theorem[Ballard, Demmel, Holtz, S. 2011a]Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

The communication cost (bandwidth) of any classic algorithm for Cholesky decomposition (for dense matrices) is:

MM

n3

P

M

M

n3

Page 30: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

33

Some Sequential Classical Cholesky Algorithms

Bandwidth LatencyCache

Oblivious

Lower bound (n3/M1/2) (n3/M3/2)

Naive (n3) (n3/M)

LAPACK (with right block size!)

Column-major Contiguous blocks

O(n3/M1/2)O(n3/M1/2)

O(n3/M)O(n3/M3/2)

Rectangular Recursive [Toledo 97]

Column-major Contiguous blocks

O(n3/M1/2+n2 log n)O(n3/M1/2+n2 log n)

(n3/M)(n2)

Square Recursive [Ahmad, Pingali 00]

Column-major Contiguous blocks

O(n3/M1/2)O(n3/M1/2)

O(n3/M)O(n3/M3/2)

Page 31: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

34

Some Parallel Classical Cholesky Algorithms

Bandwidth LatencyLower bound General 2D-layout: M=O(n2/P)

(n3/PM1/2)(n2/P1/2)

(n3/PM3/2)(P1/2)

Sca-LAPACK General Choosing b=n/P1/2

O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)

O((n/b) log p)O(P1/2 log P)

Page 32: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

35

Top (column-major): Full, Old Packed, Rectangular Full Packed.Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed [Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson, Kagstrom 04].

Upper bounds – Supporting Data Structures

Page 33: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

36

Upper bounds – Supporting Data Structures

Rules of thumb:• Don’t mix column/row major with blocked:

decide which one fits your algorithm.• Converting between the two once in your algorithm

is OK for dense instances.E.g., changing the DS at the beginning of your algorithms.

Page 34: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

37

The Naïve Algorithms

jk j k

i

j

j

i

i

Naïve left looking and right looking

Bandwidth: (n3) Latency: (n3/M)

Assuming column-major DS.

1

2,,,ik

kiAiiAiiL

jikjAjiAjiAjjL

jiLik

,,,,,

1,

1

Page 35: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

38

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Recursive on left rectangular

Recursive on right rectangular;L12 :=0

[A22;A23] := [A22;A23] - [L21;L31]L21

T

Base: ColumnL := A/(A11)1/2

Page 36: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

39

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Toledo’s Algorithm: guarantees stability for LU.

BW(m,n) = 2m if n = 1, otherwise BW(m,n/2) + BWmm(m-n/2,n/2,n/2) + BW(m-n/2,n/2)

Bandwidth: O(n3/M1/2+n2 log n)

Optimal bandwidth (for almost all M)

Page 37: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

40

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Is it Latency optimal?

If we use a column major DS, thenL = (n3/M), so M1/2 away from optimum (n3/M3/2)This is due to the matrix multiply.

Open problem: Can we prove a latency lower bound for all matrix multiply classic algorithm using column major DS?

Note: we don’t need such a lower bound here, as the matrix-multiply algorithm is known.

Page 38: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

41

Rectangular Recursive [Toledo 97]

m

A21 A22

A11 A12

A32A31

n/2

n

n/2

n/2 L21 A22

L11 A12

L31 A32

L21 L22

L11 0

L31 L32

Is it Latency optimal?

If we use a contiguous blocks DS, thenL = (n2). This is due to the forward columns updatesOptimum is (n3/M3/2)

n3/M3/2 ≤ n2 for n ≤ M2/3. Outside this range the algorithm is not latency optimal.

Page 39: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

42

Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]

A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Recursive on A11 Recursive on A22

A12=0

L21 = RTRSM(A21,L11’)

A22 =

A22 – L21 L21’

RTRSM: Recursive Triangular Solver

Page 40: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

43

Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]

A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Recursive on A11 Recursive on A22

A12=0

L21 = RTRSM(A21,L11’)

A22 =

A22 – L21 L21’

BW = 3n2 if 3n2 ≤ M, otherwise

2BW(n/2) + O(n3/M1/2 + n2)

Page 41: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

44

Square Recursive algorithm [Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]

A21 A22

A11 A12

n

n/2

n/2 A21 A22

L11 0

L21 A22

L11 0

L21 A22

L11 0

Uses

• Recursive matrix multiplication

[Frigo, Leiserson, Prokop, Ramachandran 99]

• and recursive RTRSM

Bandwidth: O(n3/M1/2) Latency: O(n3/M3/2) Assuming blocks recursive DS.

• Optimal bandwidth

• Optimal latency

• Cache oblivious

• Not stable for LU

Page 42: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

45

Parallel Algorithm

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

Pc

Pr

1. Find Cholesky locally on a diagonal block.Broadcast downwards.

2. Parallel triangular solve.Broadcast to the right.

3. Parallel symmetric rank-b update

Page 43: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

46

Parallel Algorithm

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

1 2 34 5 67 8 9

Pc

Pr

Bandwidth Latency

Lower bound General 2D-layout: M=O(n2/P)

(n3/PM1/2)(n2/P1/2)

(n3/PM3/2)(P1/2)

Sca-LAPACK General Choosing b=n/P1/2

O((n2/P1/2 + nb)log P)O((n2/P1/2)log P)

O((n/b) log P)O(P1/2 log P)

Low order term, as b ≤ n/P1/2

Optimal (up to a log P factor)only for 2D.New 2.5D algorithm extends to any size M.

Page 44: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

47

Summary of algorithms for dense Cholesky decomposition

Tight lower bound for Cholesky decomposition.Sequential:

Bandwidth: (n3/M1/2) Latency: (n3/M3/2)

Parallel:Bandwidth and latency lower bound tight up to (log P) factor.

Page 45: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Sparse Algorithms: Cholesky Decomposition

Based on [L. Grigori P.Y David, J. Demmel, S. Peyronnet, ‘00]

48

Page 46: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Lower bounds for sparse direct methods

• Geometric embedding for sparse 3-nested-loops-like algorithms may be loose:

BW = (#flops / M1/2 ) , L = Ω(#flops / M3/2 )

It may be smaller than the NNZ (number of non-zeros) in the input/output.• For example computing the product of two (block) diagonal

matrices may involve no communication in parallel

49

Page 47: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Lower bounds for sparse matrix multiplication

• Compute C = AB, with A, B, C of size n n, having dn nonzeros uniformly distributed, #flops is at most d2n

• Lower bound on communication in sequential is BW = (dn),

L = (dn/M) • Two types of algorithms - depending on how their

communication pattern is designed.- Take into account the nonzero structure of the matrices Example - Buluc, Gilbert BW = L = Ω(d2n) - Oblivious to the nonzero structure of the matrices Example - sparse Cannon - Buluc, Gilbert for parallel case

BW = Ω(dn/ p1/2), #messages = Ω(p1/2) • L. Grigori P.Y David, J. Demmel, S. Peyronnet

50

Page 48: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Lower bounds for sparse Cholesky factorization

• Consider A of size ksks results from a finite difference operator on a regular grid of dimension s2 with ks nodes.

• Its Cholesky L factor contains a dense lower triangular matrix of size ks-1ks-1. BW = Ω(W1/2) L = Ω(W3/2)where W = k 3(s-1)/2 for a sequential algorithm W = k 3(s-1)/(2p) for a parallel algorithm

• This result applies more generally to matrix A whose graph G =(V,E) has the following property for some s: • every set of vertices W ⊂ V with n/3 ≤ |W| ≤ 2n/3 is adjacent to at least s

vertices in V − W• the Cholesky factor of A contains a dense s x s submatrix

51

Page 49: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

• PSPASES with an optimal layout could attains the lower bound in parallel for 2D/3D regular grids:• Uses nested dissection to reorder the matrix• Distributes the matrix using the subtree-to-subcube algorithm• The factorization of every dense multifrontal matrix is performed using an

optimal parallel dense Cholesky factorization.

• Sequential multifrontal algorithm attains the lower bound• The factorization of every dense multifrontal matrix is performed using an

optimal dense Cholesky factorization

Optimal algorithms for sparse Cholesky factorization

52

Page 50: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Open Problems / Work in Progress (1/2)

• Rank Revealing CAQR (RRQR) • AP = QR, P a perm. chosen to put “most independent” columns first • Does LAPACK geqp3 move O(n3) words, or fewer?• Better: Tournament Pivoting (TP) on blocks of columns

• CALU with more reliable pivoting (Gu, Grigori, Khabou et al)• Replace tournament pivoting of panel (TSLU) with Rank-Revealing QR of Panel

• LU with Complete Pivoting (GECP)• A = PrLUPc

T , Pr and Pc are perms. putting “most independent” rows & columns first• Idea: TP on column blocks to chose columns, then TSLU on columns to choose rows

• LU of A=AT, with symmetric pivoting• A = PLLTPT, P perm., L lower triangular

• Bunch-Kaufman, Rook pivoting: O(n3) words moved?• CA approach: Choose block with one step of CA-GECP, permute rows & columns

to top left, pick symmetric subset with local Bunch-Kaufman• A = PLBLTPT, with symmetric pivoting (Toledo et al)

• L triangular, P perm., B banded• B tridiagonal: due to Aasen• CA approach: B block tridiagonal, stability still in question

53

Page 51: Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded

Open Problems / Work in Progress (2/2)

• Heterogeneity (Ballard, Gearhart)• How to assign work to attain lower bounds when flop rates / bandwidths / latencies vary?

• Using extra memory in distributed memory case (Solomonik)• Using all available memory to attain improved lower bounds• Just matmul and CALU so far

• Using extra memory in shared memory case• Sparse matrices …• Combining sparse and dense algorithms, for better communication performance,

e.g.,• The above [Grigori, David, Peleg, Peyronnet ‘09]• Our retuned version of [Yuster Zwick ‘04]

• Eliminate the log P factor?• Prove latency lower bounds for specific DS use , e.g., can we show that any classic

matrix multiply algorithm that uses column-major DS has to be M1/2 away from optimality?

54