Download pdf - Lecture 6: Linear Algebra Algorithms

1

Lecture 6:Linear Algebra Algorithms

Jack Dongarra, U of Tennessee

8QNIJX�FWJ�FIFUYJI�KWTR�/NR�)JRRJQ��:('gX 1JHYZWJ�TS�1NSJFW�&QLJGWF�&QLTWNYMRX 2

Homework #3 – Grading Rulespart1: (2 points)° (1) code produce correct result: 2/2.° (2) code run but could not give accurate result: 1.5/2.° (3) code could not run or run but nothing: 1/2

° Note: If you only submit one program which can print "processor° contribution" and "total integral" correctly, I will give 1/2 ° for this part and 2/2 for part2. so you total score for part1 ° and part2 will be 3/4.

part2: (2 points)° (1) code produce correct result: 2/2.° (2) code run but could not give accurate result: 1.5/2.° (3) code could not run or run but nothing: 1/2

part3: (4 points)° (1) give t_p(1), t_p(p), S(p) and E(p) correctly: 4/4.° (2) give t_p(1), t_p(p) and S(p) correctly: 3/4.° (3) give t_p(1) and t_p(p) correctly: 2/4.° (4) give other: 1/4.

part4: (2 points)° (1) contain sentence similar as "if work being evenly divided° and the summation can be performed in a tree fashion, the algorithm° is scalable" or give very detail discussion correctly about: (a)° t_(p), (b) S(p) and (c) E(P). 2/2° (2) give correct discussion about any of: (a) t_(p), (b) S(p) ° and (c) E(P). 1/2° (3) do not make sense. 0/2

3

• Ideal Speedup = SI (n ) = n - Theoretical limit; obtainable rarely- Ignores all of real life

• These definitions apply to a fixed-problem experiment.

Parallel Performance Metrics

• Speedup = S(n ) = T(1) / T(n ) where T1 is the time for the best serial implementation.

• Absolute: Elapsed (wall-clock) Time = T(n )

=> Performance improvement due to parallelsm

• Parallel Efficiency = E(n ) = T(1) / n T(n )

4

Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where:fs = serial fraction of codefp = parallel fraction of code = 1 - fs

N = number of processors

Amdahl’s Law

5

�

��

��

��

��

��

� ��

Number of processors

spee

dup

IS� ��

IS� ��

IS� ��

IS� ��

It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors

Illustration of Amdahl’s Law

6

Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.

�

��

��

��

��

��

��

��

��

0 50 100 150 200 250Number of processors

spee

dup $PGDKOV�/DZ

5HDOLW\

fp = 0.99

Amdahl’s Law Vs. Reality

2

7

More on Amdahl’s Law

° Amdahl’s Law can be generalized to any two processes of with different speeds

° Ex.: Apply to fprocessor and fmemory:• The growing processor-memory performance gap will undermine

our efforts at achieving maximum possible speedup!

8

Gustafson’s Law

° Thus, Amdahl’s Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large.

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where:

fs = serial fraction of code

fp = parallel fraction of code = 1 - fsN = number of processors

° There is a way around this: increase the problem size• bigger problems mean bigger grids or more particles: bigger

arrays• number of serial operations generally remains constant; number

of parallel operations increases: parallel fraction increases

9

Fixed-Problem Size Scaling

• a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS),variable subgrid

1• Amdahl Limit: SA(n) = T(1) / T(n) = ----------------

f / n + ( 1 - f )

• This bounds the speedup based only on the fraction of the codethat cannot use parallelism ( 1- f ); it ignores all other factors

• SA --> 1 / ( 1- f ) as n --> ∞

10

Fixed-Problem Size Scaling (Cont’d)

• Efficiency (n) = T(1) / [ T(n) * n]

• Memory requirements decrease with n

• Surface-to-volume ratio increases with n

• Superlinear speedup possible from cache effects

• Motivation: what is the largest # of procs I can use effectively andwhat is the fastest time that I can solve a given problem?

• Problems:- Sequential runs often not possible (large problems)- Speedup (and efficiency) is misleading if processors are slow

11

Fixed-Problem Size Scaling: Examples

S. Goedecker and Adolfy Hoisie, Achieving High Performance in Numerical Computations on RISC Workstations and Parallel Systems,International Conference on Computational Physics: PC’97 Santa Cruz, August 25-28 1997.

12

Fixed-Problem Size Scaling Examples

3

13

Scaled Speedup Experiments

• a.k.a. Fixed Subgrid-Size, Weak Scaling, Gustafson scaling.

• Motivation: Want to use a larger machine to solve a larger global problem in the same amount of time.

• Memory and surface-to-volume effects remain constant.

14

Top500 Data

128070.491.831.29Deutscher WetterdienstSP Power3

375MhzIBM10

133672.081.971.42Naval Oceanographic Office (NAVOCEANO)

SP Power3 375Mhz

IBM9

614463.892.521.61Los Alamos National Laboratory

ASCI Blue Mountain

SGI8

115263.572.691.71University of TokyoSR8000/MPPHitachi7

153669.083.042.10Los Alamos National Laboratory

AlphaServer SC ES45 1 GHz

Compaq6

580855.333.8682.14Lawrence Livermore National Laboratory

ASCI Blue Pacific

SST, IBM SP 604E

IBM5

963274.213.2072.38Sandia National LaboratoryASCI RedIntel4

332861.005.003.05NERSC/LBNLSP Power3 375 MHz

IBM3

302467.11

6.054.06Pittsburgh

Supercomputing CenterAlphaServer SC

ES45 1 GHzCompaq2

819260.25127.23Lawrence Livermore National Laboratory

ASCI WhiteSP Power3

IBM1

# ProcRatio

Max/PeakR peak[TF/s]

Rmax[TF/s]

Installation SiteComputerManufacturerRank

15

Example of a Scaled Speedup Experiment

Processors NChains T ime Natoms T ime per T ime E fficiencyAtom per P E per Atom

1 32 38.4 2368 1.62E -02 1.62E -02 1.0002 64 38.4 4736 8.11E -03 1.62E -02 1.0004 128 38.5 9472 4.06E -03 1.63E -02 0.9978 256 38.6 18944 2.04E -03 1.63E -02 0.995

16 512 38.7 37888 1.02E -03 1.63E -02 0.99232 940 35.7 69560 5.13E -04 1.64E -02 0.98764 1700 32.7 125800 2.60E -04 1.66E -02 0.975

128 2800 27.4 207200 1.32E -04 1.69E -02 0.958256 4100 20.75 303400 6.84E -05 1.75E -02 0.926512 5300 14.49 392200 3.69E -05 1.89E -02 0.857

T BON on AS CI Red

0.4400.5400.6400.7400.8400.9401.040

0 200 400 600

E fficiency

16

Parallel Performance Metrics: Speedup

Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster.

4840322416800

250

500

750

1000

1250

1500

1750

2000

T3E

O2KT3E Ideal

O2K Ideal

Processors

MF

LO

PS

Absolute performance:

Processors6050403020100

0

10

20

30

40

50

60

T3E

O2K

Ideal

Sp

eed

up

Relative performance:

17 18

Improving Ratio of Floating Point Operationsto Memory Accesses

subroutine mult(n1,nd1,n2,nd2,y,a,x)

implicit real*8 (a-h,o-z)

dimension a(nd1,nd2),y(nd2),x(nd1)

do 20, i=1,n1

t=0.d0

do 10, j=1,n2

t=t+a(j,i)*x(j)

10 continue

y(i)=t

20 continue

return

end

8QUROO�WKH�ORRSV�

**** 2 FLOPS**** 2 LOADS

4

19

Improving Ratio of Floating Point Operations to Memory Accesses

c works correctly when n1,n2 are multiples of 4dimension a(nd1,nd2), y(nd2), x(nd1)do i=1,n1-3,4

t1=0.d0t2=0.d0t3=0.d0t4=0.d0do j=1,n2-3,4t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+

1 a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3)t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+



1 a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3)enddoy(i+0)=t1y(i+1)=t2y(i+2)=t3y(i+3)=t4

enddo

32 FLOPS20 LOADS

20

Summary of Single-Processor Optimization Techniques (I)

° Spatial and temporal data locality

° Loop unrolling

° Blocking

° Software pipelining

° Optimization of data structures

° Special functions, library subroutines

21

Summary of Optimization Techniques (II)

° Achieving high-performance requires code restructuring. Minimization of memory traffic is the single most important goal.

° Compilers are getting better: good at software pipelining. But they are not there yet: can do loop transformations only in simple cases, usually fail to produce optimal blocking, heuristics for unrolling may not match your code well, etc.

° The optimization process is machine-specific and requires detailed architectural knowledge.

22

Optimizing Matrix Addition for Caches

° Dimension A(n,n), B(n,n), C(n,n)

° A, B, C stored by column (as in Fortran)

° Algorithm 1:• for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)

° Algorithm 2:• for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)

° What is “memory access pattern” for Algs 1 and 2?

° Which is faster?

° What if A, B, C stored by row (as in C)?

23

Using a Simpler Model of Memory to Optimize

° Assume just 2 levels in the hierarchy, fast and slow

° All data initially in slow memory• m = number of memory elements (words) moved between fast

and slow memory • tm = time per slow memory operation• f = number of arithmetic operations• tf = time per arithmetic operation < tm• q = f/m average number of flops per slow element access

° Minimum possible Time = f*tf, when all data in fast memory

° Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q))

° Larger q means Time closer to minimum f*tf• Want large q 24

Simple example using memory model

s = 0

for i = 1, n

s = s + h(X[i])

° Assume tf=1 Mflop/s on fast memory

° Assume moving data is tm = 10

° Assume h takes q flops

° Assume array X is in slow memory

° To see results of changing q, consider simple computation

° So m = n and f = q*n

° Time = read X + compute = 10*n + q*n

° Mflop/s = f/t = q/(10 + q)

° As q increases, this approaches the “peak” speed of 1 Mflop/s

5

25

Warm up: Matrix-vector multiplication y = y + A*x

for i = 1:n

for j = 1:n

y(i) = y(i) + A(i,j)*x(j)

= + *

y(i) y(i)

A(i,:)

x(:)

26

Warm up: Matrix-vector multiplication y = y + A*x

{read x(1:n) into fast memory}

{read y(1:n) into fast memory}

for i = 1:n

{read row i of A into fast memory}

for j = 1:n

y(i) = y(i) + A(i,j)*x(j)

{write y(1:n) back to slow memory}

° m = number of slow memory refs = 3*n + n2

° f = number of arithmetic operations = 2*n2

° q = f/m ~= 2° Matrix-vector multiplication limited by slow memory speed

27

Matrix Multiply C=C+A*B

for i = 1 to n

for j = 1 to n

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

28

Matrix Multiply C=C+A*B(unblocked, or untiled)

for i = 1 to n

{read row i of A into fast memory}

for j = 1 to n

{read C(i,j) into fast memory}

{read column j of B into fast memory}

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

{write C(i,j) back to slow memory}

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

29

Matrix Multiply (unblocked, or untiled)

Number of slow memory references on unblocked matrix multiply

m = n3 read each column of B n times

+ n2 read each column of A once for each i

+ 2*n2 read and write each element of C once

= n3 + 3*n2

So q = f/m = (2*n3)/(n3 + 3*n2)

~= 2 for large n, no improvement over matrix-vector mult

= + *C(i,j) C(i,j) A(i,:)

B(:,j)

q=ops/slow mem ref

30

Matrix Multiply (blocked, or tiled)

Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize

for i = 1 to N

for j = 1 to N

{read block C(i,j) into fast memory}

for k = 1 to N

{read block A(i,k) into fast memory}

{read block B(k,j) into fast memory}

C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}

{write block C(i,j) back to slow memory}

= + *C(i,j) C(i,j) A(i,k)

B(k,j)

6

31

Matrix Multiply (blocked or tiled)

Why is this algorithm correct?

Number of slow memory references on blocked matrix multiply

m = N*n2 read each block of B N3 times (N3 * n/N * n/N)

+ N*n2 read each block of A N3 times

+ 2*n2 read and write each block of C once

= (2*N + 2)*n2

So q = f/m = 2*n3 / ((2*N + 2)*n2)

~= n/N = b for large n

So we can improve performance by increasing the blocksize b

Can be much faster than matrix-vector multiplty (q=2)

Limit: All three blocks from A,B,C must fit in fast memory (cache), so we

cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)

Theorem (Hong, Kung, 1981): Any reorganization of this algorithm

(that uses only associativity) is limited to q =O(sqrt(M))

q=ops/slow mem ref

32

Model

° As much as possible will be overlapped

° Dot Product:

ACC = 0

do i = x,n

ACC = ACC + x(i) y(i)

end do

° Experiments done on an IBM RS6000/530• 25 MHz• 2 cycle to complete FMA can be pipelined

- => 50 Mflop/s peak• one cycle from cache

+2&+2&

+2&

cycles

33

DOT Operation - Data in Cache

Do 10 I = 1, n

T = T + X(I)*Y(I)10 CONTINUE

° Theoretically, 2 loads for X(I) and Y(I), one FMA operation, no re-use of data

° Pseudo-assemblerLOAD fp0,T

label:LOAD fp1,X(I)LOAD fp2,Y(I)FMA fp0,fp0,fp1,fp2BRANCH label: 1TFI�] 1TFI�^ +2&

1TFI�] 1TFI�^

��WJXZQY�UJW�H^HQJ�"��2KQTU�X

+2&

34

Matrix-Vector Product

° DOT versionDO 20 I = 1, M

DO 10 J = 1, N

Y(I) = Y(I) + A(I,J)*X(J)

10 CONTINUE

20 CONTINUE

° From Cache = 22.7 Mflops

° From Memory = 12.4 Mflops

35

Loop Unrolling

DO 20 I = 1, M, 2

T1 = Y(I )

T2 = Y(I+1)

DO 10 J = 1, N

T1 = T1 + A(I,J )*X(J)

T2 = T2 + A(I+1,J)*X(J)

10 CONTINUE

Y(I ) = T1

Y(I+1) = T2

20 CONTINUE

° 3 loads, 4 flops

° Speed of y=y+ATx, N=48

Depth 1 2 3 4 v

Speed 25 33.3 37.5 40 50

Measured 22.7 30.5 34.3 36.5

Memory 12.4 12.7 12.7 12.6

° unroll 1: 2 loads : 2 ops per 2 cycles



° …

° unroll n: n+1 loads : 2n ops per n+1 cycles

° problem: only so many registers

36

Matrix Multiply

° DOT version - 25 Mflops in cacheDO 30 J = 1, M

DO 20 I = 1, M

DO 10 K = 1, L

C(I,J) = C(I,J) + A(I,K)*B(K,J)

10 CONTINUE

20 CONTINUE

30 CONTINUE

7

37

How to Get Near Peak

DO 30 J = 1, M, 2

DO 20 I + 1, M, 2

T11 = C(I, J )

T12 = C(I, J+1)

T21 = C(I+1,J )

T22 = C(I+1,J+1)

DO 10 K = 1, L

T11 = T11 + A(I, K) *B(K,J )

T12 = T12 + A(I, K) *B(K,J+1)

T21 = T21 + A(I+1,K)*B(K,J )

T22 = T22 + A(I+1,K)*B(K,J+1)

10 CONTINUE

C(I, J ) = T11

C(I, J+1) = T12

C(I+1,J ) = T21

C(I+1,J+1) = T22

20 CONTINUE

30 CONTINUE

° Inner loop: • 4 loads, 8 operations, optimal.

° In practice we have measured 48.1 out of a peak of 50 Mflop/s when in cache

38

BLAS -- Introduction

° Clarity: code is shorter and easier to read,

° Modularity: gives programmer larger building blocks,

° Performance: manufacturers will provide tuned machine-specific BLAS,

° Program portability: machine dependencies are confined to the BLAS

39

Memory Hierarchy

7JLNXYJWX

1��(FHMJ

1��(FHMJ

1THFQ�2JRTW^

7JRTYJ�2JRTW^

8JHTSIFW^�2JRTW^

° Key to high performance in effective use of memory hierarchy

° True on all architectures

40

Level 1, 2 and 3 BLAS

° Level 1 BLAS Vector-Vector operations

° Level 2 BLAS Matrix-Vector operations

° Level 3 BLAS Matrix-Matrix operations

� �

�

��

41

More on BLAS (Basic Linear Algebra Subroutines)

° Industry standard interface(evolving)

° Vendors, others supply optimized implementations

° History• BLAS1 (1970s):

- vector operations: dot product, saxpy (y=D*x+y), etc

- m=2*n, f=2*n, q ~1 or less

• BLAS2 (mid 1980s)- matrix-vector operations: matrix vector multiply, etc

- m=n2, f=2*n2, q~2, less overhead

- somewhat faster than BLAS1

• BLAS3 (late 1980s)

- matrix-matrix operations: matrix matrix multiply, etc

- m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2

° Good algorithms used BLAS3 when possible (LAPACK)

° www.netlib.org/blas, www.netlib.org/lapack

42

Why Higher Level BLAS?

° Can only do arithmetic on data at the top of the hierarchy

° Higher level BLAS lets us do this

' 1 & 8 2 J R T W ^7 J K X

+ QT U X + QT U X �2 J R T W ^7 J K X

1 J [ J Q��^ " ^ � D ]

� S � S � ��

1 J [ J Q��^ " ^ � & ]

S � � S � �

1 J [ J Q��( " ( � & '

� S � � S � S � �

7JLNXYJWX

1��(FHMJ

1��(FHMJ

1THFQ�2JRTW^

7JRTYJ�2JRTW^

8JHTSIFW^�2JRTW^

8

43

BLAS for Performance

IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)

0

50

100

150

200

250

10 100 200 300 400 500Order of vector/Matrices

Mflo

p/s

1J[JQ��'1&8

1J[JQ��'1&8

1J[JQ��'1&8

44

BLAS for Performance

Alpha EV 5/6 500MHz (1Gflop/s peak)

0100200300400500600700


Mflo

p/s

1J[JQ��'1&8

1J[JQ��'1&8

1J[JQ��'1&8

BLAS 3 (n-by-n matrix matrix multiply) vsBLAS 2 (n-by-n matrix vector multiply) vsBLAS 1 (saxpy of n vectors)

45

Fast linear algebra kernels: BLAS

� 6LPSOH�OLQHDU�DOJHEUD�NHUQHOV�VXFK�DV�PDWUL[�PDWUL[�PXOWLSO\�

� 0RUH�FRPSOLFDWHG�DOJRULWKPV�FDQ�EH�EXLOW�IURP�WKHVH�EDVLF�NHUQHOV�

� 7KH�LQWHUIDFHV�RI�WKHVH�NHUQHOV�KDYH�EHHQ�VWDQGDUGL]HG�DV�WKH�%DVLF�/LQHDU�$OJHEUD�6XEURXWLQHV��%/$6��

� (DUO\�DJUHHPHQW�RQ�VWDQGDUG�LQWHUIDFH��a��

� /HG�WR�SRUWDEOH�OLEUDULHV�IRU�YHFWRU�DQG�VKDUHG�PHPRU\�SDUDOOHO�PDFKLQHV��

� 2Q�GLVWULEXWHG�PHPRU\��WKHUH�LV�D�OHVV�VWDQGDUG�LQWHUIDFH�FDOOHG�WKH�3%/$6

46

Level 1 BLAS

� 2SHUDWH�RQ�YHFWRUV�RU�SDLUV�RI�YHFWRUV� SHUIRUP�2�Q��RSHUDWLRQV��

� UHWXUQ�HLWKHU�D�YHFWRU�RU�D�VFDODU��

� VD[S\� \�L�� D� �[�L��\�L��IRU�L ��WR�Q��

� V�VWDQGV�IRU�VLQJOH�SUHFLVLRQ��GD[S\ LV�IRU�GRXEOH�SUHFLVLRQ��FD[S\ IRU�FRPSOH[��DQG�]D[S\ IRU�GRXEOH�FRPSOH[��

� VVFDO \� �D� �[��IRU�VFDODU�D�DQG�YHFWRUV�[�\

� VGRW FRPSXWHV�V� �6 Q

L � [�L� \�L��

47

Level 2 BLAS

� 2SHUDWH�RQ�D�PDWUL[�DQG�D�YHFWRU�� UHWXUQ�D�PDWUL[�RU�D�YHFWRU�

� 2�Q��RSHUDWLRQV

� VJHPY��PDWUL[�YHFWRU�PXOWLSO\� \� �\��$ [

� ZKHUH�$�LV�P�E\�Q��[�LV�Q�E\��DQG�\�LV�P�E\��

� VJHU��UDQN�RQH�XSGDWH�� $� �$��\ [7��L�H��$�L�M�� $�L�M��\�L� [�M��

� ZKHUH�$�LV�P�E\�Q��\�LV�P�E\��[�LV�Q�E\��

� VWUVY��WULDQJXODU�VROYH�

� VROYHV�\ 7 [�IRU�[��ZKHUH�7�LV��WULDQJXODU

48

Level 3 BLAS

� 2SHUDWH�RQ�SDLUV�RU�WULSOHV�RI�PDWULFHV� UHWXUQLQJ�D�PDWUL[�

� FRPSOH[LW\�LV�2�Q��

� VJHPP��0DWUL[�PDWUL[�PXOWLSOLFDWLRQ� &� �&��$ %��

� ZKHUH�&�LV�P�E\�Q��$�LV�P�E\�N��DQG�%�LV�N�E\�Q

� VWUVP��PXOWLSOH�WULDQJXODU�VROYH� VROYHV�<� �7 ;�IRU�;��

� ZKHUH�7�LV�D�WULDQJXODU�PDWUL[��DQG�;�LV�D�UHFWDQJXODU�PDWUL[�

9

49

Optimizing in practice

° Tiling for registers• loop unrolling, use of n amed “register” variables

° Tiling for multiple levels of cache

° Exploiting fine-grained parallelism within the processor

• super scalar

• pipelining

° Complicated compiler interactions

° Hard to do by hand (but you’ll try)

° Automatic optimization an active research area• PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac

• www.cs.berkeley.edu/~iyer/asci_slides.ps

• ATLAS: www.netlib.org/atlas/index.html

50

BLAS -- References

° BLAS software and documentation can be obtained via:

• WWW: http://www.netlib.org/blas,

• (anonymous) ftp ftp.netlib.org: cd blas; get index

• email [email protected] with the message: send index from blas

° Comments and questions can be addressed to: [email protected]

51

BLAS Papers

° C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979.

° J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988.

° J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990.

52

Performance of BLAS

� %/$6�DUH�VSHFLDOO\�RSWLPL]HG�E\�WKH�YHQGRU�� 6XQ�%/$6�XVHV�IHDWXUHV�LQ�WKH�8OWUDVSDUF

� %LJ�SD\RII�IRU�DOJRULWKPV�WKDW�FDQ�EH�H[SUHVVHG�LQ�WHUPV�RI�WKH�%/$6��LQVWHDG�RI�%/$6��RU�%/$6��

� 7KH�WRS�VSHHG�RI�WKH�%/$6�

� $OJRULWKPV�OLNH�*DXVVLDQ�HOLPLQDWLRQ�RUJDQL]HG�VR�WKDW�WKH\�XVH�%/$6�

53

How To Get Performance From Commodity Processors?

° Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.

° Routines have a large design space w/many parameters• blocking sizes, loop nesting permutations, loop unrolling depths,

software pipelining strategies, register allocations, and instruction schedules.

• Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.

° A few months ago no tuned BLAS for Pentium for Linux.

° Need for quick/dynamic deployment of optimized routines.

° ATLAS - Automatic Tuned Linear Algebra Software• PhiPac from Berkeley

54M C A B

N

K

N

M

K

*NB

Adaptive Approach for Level 3

° Do a parameter study of the operation on the target machine, done once.

° Only generated code is on-chip multiply

° BLAS operation written in terms of generated on-chip multiply

° All tranpose cases coerced through data copy to 1 case of on-chip multiply

• Only 1 case generated per platform

10

55

Code GenerationStrategy

° Code is iteratively generated & timed until optimal case is found. We try:

• Differing NBs

• Breaking false dependencies

• M, N and K loop unrolling

° On-chip multiply optimizes for:

• TLB access

• L1 cache reuse

• FP unit usage

• Memory fetch• Register reuse

• Loop overhead minimization

° Takes a couple of hours to run.

56

500x500 Double Precision Matrix-Matrix Multiply Across Multiple Architectures

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

DE

C A

lpha

2116

4a-4

33

HP

PA

8000

180M

hz

HP

9000

/735

/125

IBM

Pow

er2-

135

IBM

Pow

erP

C60

4e-3

32

Pen

tium

MM

X-15

0

Pen

tium

Pro

-200

Pen

tium

II-2

66

SG

I R46

00

SG

I R50

00

SG

I R80

00ip

21

SG

I R10

000i

p27

Sun

Mic

rosp

arc

IIM

odel

70

Sun

Dar

win

-270

Sun

Ultr

a2 M

odel

2200

System

Mfl

op

s

Vendor Matrix Multiply ATLAS Matrix Multiply

57

500 x 500 Double Precision LU Factorization Performance Across Multiple Architectures

0.0

100.0

200.0

300.0

400.0

500.0

600.0

DC

G L

X 21

164a

-53

3

DE

C A

lpha

211

64a-

433

HP

PA

8000

IBM

Pow

er2-

135

IBM

Pow

erP

C60

4e-3

32

Pen

tium

Pro

-200

Pen

tium

II-2

66

SG

I R50

00

SG

I R10

000i

p27

Sun

Dar

win

-270

Sun

Ultr

a2 M

odel

2200

MF

LO

PS

LU w/Vendor BLAS LU w/ATLAS & GEMM-based BLAS

58

500x500 gemm-based BLAS on SGI R10000ip28

0

50

100

150

200

250

300

DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM

MF

LO

PS

Vendor BLAS ATLAS/SSBLAS Reference BLAS

59

500x500 gemm-based BLAS on UltraSparc 2200

0

50

100

150

200

250

300

DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM

Level 3 BLAS Routine

MFL

OP

S

Vendor BLAS ATLAS/GEMM-based BLAS Reference BLAS

60

Recursive Approach for Other Level 3 BLAS

° Recur down to L1 cache block size

° Need kernel at bottom of recursion

• Use gemm-based kernel for portability

Recursive TRMM

00

0

00

0

0

0

0

0

0

0

00

0

0

0

0

0

0

11

61

500x500 Level 2 BLAS DGEMV

0

50

100

150

200

250

300

AMD A

thlo

n-600

DEC ev56

-533

HP9000

/735/135

IBM

PPC604

-112

IBM

Power2

-160

IBM

Power3

-200

Pentium P

ro-2

00

Pentium II-

266

Pentium III

-550

SGI R10

000ip

28-20

0

SGI R120

00ip

30-27

0

Sun Ultr

aSpar

c2-20

0

Architectures

MF

LOP

S

Vendor NoTrans ATLAS NoTrans

F77 NoTrans

62

0100200300400500600700800

100

200

300

400 500 600 700 800 900

1000

Size

Mfl

op

/s

Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc

Multi-Threaded DGEMMIntel PIII 550 MHz

63

ATLAS

° Keep a repository of kernels for specific machines.

° Develop a means of dynamically downloading code

° Extend work to allow sparse matrix operations

° Extend work to include arbitrary code segments

° See: KWWS��ZZZ�QHWOLE�RUJ�DWODV�

Algorithms and Architecture

°The key to performance is to understand the algorithm and architecture interaction.

°A significant improvement in performance can be obtained by matching algorithm to the architecture or vice versa.

Algorithm Issues

°Use of memory hierarchy

°Algorithm pre-fetching

°Loop unrolling

°Simulating higher precision arithmetic

Blocking° TLB Blocking - minimize TLB misses

° Cache Blocking - minimize cache misses

° Register Blocking - minimize load/stores

° The general idea of blocking is to get the information to a high-speed storage and use it multiple times so as to amortize the cost of moving the data

° Cache blocking reduced traffic between memory and cache

° Register blocking reduces traffic between cache and CPU

7JLNXYJWX

1��(FHMJ

1��(FHMJ

1THFQ�2JRTW^

7JRTYJ�2JRTW^

8JHTSIFW^�2JRTW^

12

Loop Unrolling

° Reduces data dependency delay

° Exploits multiple functional units and quad load/stores effectively.

° Minimizes load/stores

° Reduces loop overheads

° Gives more flexibility to compiler in scheduling

° Facilitates algorithm pre-fetching.

° What about vector computing?

What’s Wrong With Speedup T 1/Tp ?

° Can lead to very false conclusions.

° Speedup in isolation without taking into account the speed of the processor is unrealistic and pointless.

° Speedup over what?

° T1/Tp • There is usually no doubt about Tp

• Often considerable dispute over the meaning of T1

- Serial code? Same algorithm?

Speedup

° Can be used to:• Study, in isolation, the scaling of one algorithm on one computer.

• As a dimensionless variable in the theory of scaling.

° Should not be used to compare:• Different algorithms on the same computer

• The same algorithm on different computers.

• Different interconnection structures.

Strassen’s Algorithm for Matrix Multiply

:XZFQ�2FYWN]�2ZQYNUQ^

C A B A B

C A B A B

C A B A B

C A B A B

11 11 11 12 21

12 11 12 12 22

21 21 11 22 21

22 21 12 22 22

= += += += +

=

2221

1211

2221

1211

2221

1211

BB

BB

AA

AA

CC

CC

Strassen’s Algorithm

P A A B B

P A A B

P A B B

P A B B

P A A B

P A A B B

P A A B B

1 11 22 11 22

2 21 22 11

3 11 12 22

4 22 21 11

5 11 12 22

6 21 11 11 12

7 12 22 21 22

= + += += −= −= += − += − +

( )( )

( )

( )

( )

( )

( )( )

( )( )

C P P P P

C P P

C P P

C P P P P

11 1 4 5 7

12 3 5

21 2 5

22 1 3 2 6

= + − += += += + − +

13


° The count of arithmetic operations is:

One matrix multiply is replaced by 14 matrix additions

2ZQY &II (TRUQJ]NY^

7JLZQFW � � �S��4 S��

8YWFXXJS � �� S��4 S��


° In reality the use of Strassen’s Algorithm is limited by

• Additional memory required for storing the P matrices.

• More memory accesses are needed.

75

Outline

° Motivation for Dense Linear Algebra• Ax=b: Computational Electromagnetics

• Ax = Ox: Quantum Chemistry

° Review Gaussian Elimination (GE) for solving Ax=b

° Optimizing GE for caches on sequential machines• using matrix-matrix multiplication (BLAS)

° LAPACK library overview and performance

° Data layouts on parallel machines

° Parallel matrix-matrix multiplication

° Parallel Gaussian Elimination

° ScaLAPACK library overview

° Eigenvalue problem76

Parallelism in Sparse Matrix-vector multiplication

° y = A*x, where A is sparse and n x n

° Questions• which processors store

- y[i], x[i], and A[i,j]

• which processors compute

- y[i] = sum (from 1 to n) A[i,j] * x[j]

= (row i of A) . x … a sparse dot product

° Partitioning• Partition index set {1,…,n} = N1 u N2 u … u Np

• For all i in Nk, Processor k stores y[i], x[i], and row i of A

• For all i in Nk, Processor k computes y[i] = (row i of A) . x

- “owner computes” rule: Processor k compute the y[i]s it owns

° Goals of partitioning• balance load (how is load measured?)

• balance storage (how much does each processor store?)

• minimize communication (how much is communicated?)

77

Graph Partitioning and Sparse Matrices

1 1 1 1

2 1 1 1 1

3 1 1 1

4 1 1 1 1

5 1 1 1 1

6 1 1 1 1

1 2 3 4 5 6

3

6

2

1

5

4

° Relationship between matrix and graph

° A “good” partition of the graph has• equal (weighted) number of nodes in each part (load and storage balance)• minimum number of edges crossing between (minimize communication)

° Can reorder the rows/columns of the matrix by putting all the nodes in one partition together

78

More on Matrix Reordering via Graph Partitioning

° “Ideal” matrix structure for parallelism: (nearly) block diagonal• p (number of processors) blocks

• few non-zeros outside these blocks, since these require communication

= *

P0

P1

P2

P3

P4

14

79

What about implicit methods and eigenproblems?

° Direct methods (Gaussian elimination)• Called LU Decomposition, because we factor A = L*U

• Future lectures will consider both dense and sparse cases

• More complicated than sparse-matrix vector multiplication

° Iterative solvers• Will discuss several of these in future

- Jacobi, Successive overrelaxiation (SOR) , Conjugate Gradients (CG), Multigrid,...

• Most have sparse-matrix-vector multiplication in kernel

° Eigenproblems• Future lectures will discuss dense and sparse cases

• Also depend on sparse-matrix-vector multiplication, direct methods

° Graph partitioning

8080

Partial Differential Equations

PDEs

81

Continuous Variables, Continuous Parameters

Examples of such systems include

° Heat flow: Temperature(position, time)

° Diffusion: Concentration(position, time)

° Electrostatic or Gravitational Potential:Potential(position)

° Fluid flow: Velocity,Pressure,Density(position,time)

° Quantum mechanics: Wave-function(position,time)

° Elasticity: Stress,Strain(position,time)

82

Example: Deriving the Heat Equation

0 1x x+hConsider a simple problem

° A bar of uniform material, insulated except at ends

° Let u(x,t) be the temperature at position x at time t

° Heat travels from x-h to x+h at rate proportional to:

° As h 0, we get the heat equation:

d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)-u(x+h,t))/h

dt h

= C *

d u(x,t) d2 u(x,t)dt dx2

= C *

x-h

83

Explicit Solution of the Heat Equation

° For simplicity, assume C=1

° Discretize both time and position

° Use finite differences with u[j,i] as the heat at• time t= i*dt (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h)• initial conditions on u[j,0]

• boundary conditions on u[0,i] and u[N,i]

° At each timestep i = 0,1,2,...

° This corresponds to• matrix vector multiply (what is matrix?)

• nearest neighbors on grid

t=5

t=4

t=3

t=2

t=1

t=0u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0]

For j=0 to N

u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+z*u[j+1,i]

where z = dt/h2

84

Parallelism in Explicit Method for PDEs

° Partitioning the space (x) into p largest chunks• good load balance (assuming large number of points relative to p)

• minimized communication (only p chunks)

° Generalizes to • multiple dimensions

• arbitrary graphs (= sparse matrices)

° Problem with explicit approach• numerical instability

• solution blows up eventually if z = dt/h > .5

• need to make the timesteps very small when h is small: dt < .5*h

2

2

15

85

Instability in solving the heat equation explicitly

86

Implicit Solution

° As with many (stiff) ODEs, need an implicit method

° This turns into solving the following equation

° Here I is the identity matrix and T is:

° I.e., essentially solving Poisson’s equation in 1D

(I + (z/2)*T) * u[:,i+1]= (I - (z/2)*T) *u[:,i]

2 -1

-1 2 -1

-1 2 -1

-1 2 -1

-1 2

T = 2-1 -1

Graph and “stencil”

87

2D Implicit Method

° Similar to the 1D case, but the matrix T is now

° Multiplying by this matrix (as in the explicit case) is simply nearest neighbor computation on 2D grid

° To solve this system, there are several techniques

4 -1 -1

-1 4 -1 -1

-1 4 -1

-1 4 -1 -1

-1 -1 4 -1 -1

-1 -1 4 -1

-1 4 -1

-1 -1 4 -1

-1 -1 4

T =

4

-1

-1

-1

-1

Graph and “stencil”

88

Algorithms for 2D Poisson Equation with N unknowns

Algorithm Serial PRAM Memory #Procs

° Dense LU N3 N N2 N2

° Band LU N2 N N3/2 N

° Jacobi N2 N N N

° Explicit Inv. N log N N N

° Conj.Grad. N 3/2 N 1/2 *log N N N

° RB SOR N 3/2 N 1/2 N N

° Sparse LU N 3/2 N 1/2 N*log N N

° FFT N*log N log N N N

° Multigrid N log2 N N N

° Lower bound N log N N

PRAM is an idealized parallel model with zero cost communication

(see next slide for explanation)

2 22

89

Short explanations of algorithms on previous slide° Sorted in two orders (roughly):

• from slowest to fastest on sequential machines

• from most general (works on any matrix) to most specialized (works on matrices “like” T)

° Dense LU: Gaussian elimination; works on any N-by-N matrix

° Band LU: exploit fact that T is nonzero only on sqrt(N) diagonals nearest main diagonal, so faster

° Jacobi: essentially does matrix-vector multiply by T in inner loop of iterative algorithm

° Explicit Inverse: assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it

• It’s still expensive!

° Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but exploits mathematical properies of T that Jacobi does not

° Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits yet different mathematical properties of T

• Used in Multigrid

° Sparse LU: Gaussian elimination exploiting particular zero structure of T

° FFT (Fast Fourier Transform): works only on matrices very like T

° Multigrid: also works on matrices like T, that come from elliptic PDEs

° Lower Bound: serial (time to print answer); parallel (time to combine N inputs)

90

Composite mesh from a mechanical structure

16

91

Converting the mesh to a matrix

92

Effects of Ordering Rows and Columns on Gaussian Elimination

93

Irregular mesh: NASA Airfoil in 2D (direct solution)

94

Irregular mesh: Tapered Tube (multigrid)

95

Adaptive Mesh Refinement (AMR)

°Adaptive mesh around an explosion°John Bell and Phil Colella at LBL (see class web page for URL)°Goal of Titanium is to make these algorithms easier to implement

in parallel

96

Computational Electromagnetics

•Developed during 1980s, driven by defense applications

•Determine the RCS (radar cross section) of airplane

•Reduce signature of plane (stealth technology)

•Other applications are antenna design, medical equipment

•Two fundamental numerical approaches:

•MOM methods of moments ( frequency domain), and

•Finite differences (time domain)

17

97

Computational Electromagnetics

image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/

- Discretize surface into triangular facets using standard modeling tools

- Amplitude of currents on surface are unknowns

- Integral equation is discretized into a set of linear equations

98

Computational Electromagnetics (MOM)

After discretization the integral equation has the form

A x = b

where

A is the (dense) impedance matrix,

x is the unknown vector of amplitudes, and

b is the excitation vector.

(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542)

99

The main steps in the solution process are

Fill: computing the matrix elements of A

Factor: factoring the dense matrix A

Solve: solving for one or more excitations b

Field Calc: computing the fields scattered from the

object

Computational Electromagnetics (MOM)

100

Analysis of MOM for Parallel Implementation

Task Work Parallelism Parallel Speed

Fill O(n**2) embarrassing low

Factor O(n**3) moderately diff. very high

Solve O(n**2) moderately diff. high

Field Calc. O(n) embarrassing high

101

Results for Parallel Implementation on Delta

Task Time (hours)

Fill 9.20

Factor 8.25

Solve 2 .17

Field Calc. 0.12

The problem solved was for a matrix of size 48,672. (The world record in 1991.)

102

Current Records for Solving Dense Systems

Year System Size Machine # Procs Gflops (Peak)

1950’s O(100) 1995 128,600 Intel Paragon 6768 281 ( 338)1996 215,000 Intel ASCI Red 7264 1068 (1453)1998 148,000 Cray T3E 1488 1127 (1786)1998 235,000 Intel ASCI Red 9152 1338 (1830)1999 374,000 SGI ASCI Blue 5040 1608 (2520)1999 362,880 Intel ASCI Red 9632 2379 (3207)2000 430,000 IBM ASCI White 8192 4928 (12000)

source: Alan Edelman http://www-math.mit.edu/~edelman/records.htmlLINPACK Benchmark: http://www.netlib.org/performance/html/PDSreports.html

18

103

Computational Chemistry

° Seek energy levels of a molecule, crystal, etc.• Solve Schroedinger’s Equation for energy levels = eigenvalues

• Discretize to get Ax = OBx, solve for eigenvalues O and eigenvectors x• A and B large, symmetric or Hermitian matrices (B positive definite)

• May want some or all eigenvalues/eigenvectors

° MP-Quest (Sandia NL)• Si and sapphire crystals of up to 3072 atoms

• Local Density Approximation to Schroedinger Equation

• A and B up to n=40000, Hermitian• Need all eigenvalues and eigenvectors

• Need to iterate up to 20 times (for self-consistency)

° Implemented on Intel ASCI Red• 9200 Pentium Pro 200 processors (4600 Duals, a CLUMP)• Overall application ran at 605 Gflops (out of 1800 Glops peak),

• Eigensolver ran at 684 Gflops

• www.cs.berkeley.edu/~stanley/gbell/index.html

• Runner-up for Gordon Bell Prize at Supercomputing 98104

Review of Gaussian Elimination (GE) for solving Ax=b

° Add multiples of each row to later rows to make A upper triangular

° Solve resulting triangular system Ux = c by substitution

… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1

… for each row j below row ifor j = i+1 to n

… add a multiple of row i to row jfor k = i to n

A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

105

Refine GE Algorithm (1)

° Initial Version

° Remove computation of constant A(j,i)/A(i,i) from inner loop

… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1

… for each row j below row ifor j = i+1 to n

… add a multiple of row i to row jfor k = i to n

A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)

for i = 1 to n-1for j = i+1 to n

m = A(j,i)/A(i,i)for k = i to n

A(j,k) = A(j,k) - m * A(i,k)

106


° Last version

° Don’t compute what we already know: zeros below diagonal in column i


m = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - m * A(i,k)


m = A(j,i)/A(i,i)for k = i to n

A(j,k) = A(j,k) - m * A(i,k)

107


° Last version

° Store multipliers m below diagonal in zeroed entries for later use


m = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - m * A(i,k)


A(j,i) = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k)

108


° Last version

° Express using matrix operations (BLAS)

for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n )

- A(i+1:n , i) * A(i , i+1:n)


A(j,i) = A(j,i)/A(i,i)for k = i+1 to n

A(j,k) = A(j,k) - A(j,i) * A(i,k)

19

109

What GE really computes

° Call the strictly lower triangular matrix of multipliers M, and let L = I+M

° Call the upper triangle of the final matrix U

° Lemma (LU Factorization): If the above algorithm terminates (does not divide by zero) then A = L*U

° Solving A*x=b using GE• Factorize A = L*U using GE (cost = 2/3 n3 flops)

• Solve L*y = b for y, using substitution (cost = n2 flops)

• Solve U*x = y for x, using substitution (cost = n2 flops)

° Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired

for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)

110

Problems with basic GE algorithm

° What if some A(i,i) is zero? Or very small?• Result may not exist, or be “unstable”, so need to pivot

° Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply) is fastest (Lecture 2)

for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)

- A(i+1:n , i) * A(i , i+1:n)

PeakBLAS 3

BLAS 2

BLAS 1

IBM RS/6000 Power 3 (200 MHz, 800 Mflop/s Peak)

0

200

400

600

800


Mfl

op

/s

111

Pivoting in Gaussian Elimination

° A = [ 0 1 ] fails completely, even though A is “easy”[ 1 0 ]

° Illustrate problems in 3-decimal digit arithmetic:

A = [ 1e-4 1 ] and b = [ 1 ], correct answer to 3 places is x = [ 1 ][ 1 1 ] [ 2 ] [ 1 ]

° Result of LU decomposition is

L = [ 1 0 ] = [ 1 0 ] … No roundoff error yet[ fl(1/1e-4) 1 ] [ 1e4 1 ]

U = [ 1e-4 1 ] = [ 1e-4 1 ] … Error in 4th decimal place[ 0 fl(1-1e4*1) ] [ 0 -1e4 ]

Check if A = L*U = [ 1e-4 1 ] … (2,2) entry entirely wrong[ 1 0 ]

° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5° Numerical instability° Computed solution x totally inaccurate

° Cure: Pivot (swap rows of A) so entries of L and U bounded 112

Gaussian Elimination with Partial Pivoting (GEPP)° Partial Pivoting: swap rows so that each multiplier

|L(i,j)| = |A(j,i)/A(i,i)| <= 1

for i = 1 to n-1find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)|

… i.e. largest entry in rest of column iif |A(k,i)| = 0

exit with a warning that A is singular, or nearly soelseif k != i

swap rows i and k of Aend ifA(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1]A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)

° Lemma: This algorithm computes A = P*L*U, where P is apermutation matrix

° Since each entry of |L(i,j)| <= 1, this algorithm is considerednumerically stable

° For details see LAPACK code at www.netlib.org/lapack/single/sgetf2.f

113

Converting BLAS2 to BLAS3 in GEPP

° Blocking• Used to optimize matrix-multiplication

• Harder here because of data dependencies in GEPP

° Delayed Updates• Save updates to “tra iling matrix” from several consecutive BLAS2

updates

• Apply many saved updates simultaneously in one BLAS3 operation

° Same idea works for much of dense linear algebra• Open questions remain

° Need to choose a block size b• Algorithm will save and apply b updates

• b must be small enough so that active submatrix consisting of b columns of A fits in cache

• b must be large enough to make BLAS3 fast

114

Blocked GEPP (www.netlib.org/lapack/single/sgetrf.f)

for ib = 1 to n-1 step b … Process matrix b columns at a timeend = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’… let LL denote the strict lower triangular part of A(ib:end , ib:end) + IA(ib:end , end+1:n) = LL -1 * A(ib:end , end+1:n) … update next b rows of UA(end+1:n , end+1:n ) = A(end+1:n , end+1:n )

- A(end+1:n , ib:end) * A(ib:end , end+1:n)… apply delayed updates with single matrix-multiply… with inner dimension b

(For a correctness proof,see on-lines notes.)

20

115

Efficiency of Blocked GEPP

116

Overview of LAPACK

° Standard library for dense/banded linear algebra• Linear systems: A*x=b

• Least squares problems: minx || A*x-b ||2• Eigenvalue problems: Ax = Ox, Ax = OBx

• Singular value decomposition (SVD): A = U6VT

° Algorithms reorganized to use BLAS3 as much as possible

° Basis of math libraries on many computers

° Many algorithmic innovations remain• Projects available

117

Performance of LAPACK (n=1000)

118

Performance of LAPACK (n=100)

119

Parallelizing Gaussian Elimination

° parallelization steps • Decomposition: identify enough parallel work, but not too much

• Assignment: load balance work among threads

• Orchestrate: communication and synchronization

• Mapping: which processors execute which threads

° Decomposition• In BLAS 2 algorithm nearly each flop in inner loop can be done in

parallel, so with n2 processors, need 3n parallel steps

• This is too fine-grained, prefer calls to local matmuls instead

for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)

- A(i+1:n , i) * A(i , i+1:n)

120

Assignment of parallel work in GE

° Think of assigning submatrices to threads, where each thread responsible for updating submatrix it owns

• “owner computes” rule natural because of locality

° What should submatrices look like to achieve load balance?

21

121

Different Data Layouts for Parallel GE (on 4 procs)

The winner!

Bad load balance:P0 idle after firstn/4 steps

Load balanced, but can’t easilyuse BLAS2 or BLAS3

Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck

Complicated addressing

Blocked Partitioned Algorithms

° LU Factorization

° Cholesky factorization

° Symmetric indefinite factorization

° Matrix inversion

° QR, QL, RQ, LQ factorizations

° Form Q or QTC

° Orthogonal reduction to:• (upper) Hessenberg form

• symmetric tridiagonal form

• bidiagonal form

° Block QR iteration for nonsymmetric eigenvalue problems

Memory Hierarchy and LAPACK

° ijk - implementations

Effects order in which data referenced; some betterat allowing data to keep in higher levels of memory hierarchy.

° Applies for matrix multiply, reductions to condensed form• May do slightly more flops

• Up to 3 times faster

for _ = 1:n;

for _ = 1:n;

for _ = 1:n;

end

end

end

a a b ci j i j i k k j, , , ,← +

124

Derivation of Blocked AlgorithmsCholesky Factorization A = UTU

*VZFYNSL�HTJKKNHNJSY�TK�YMJ�OYM HTQZRS��\J�TGYFNS

-JSHJ��NK�:��MFX�FQWJFI^�GJJS�HTRUZYJI��\J�HFS�HTRUZYJ�ZOFSIZOO KWTR�YMJ�JVZFYNTSX�

A a A

a a

A A

U

u u

U U

U u U

u

U

j

jT

jj jT

Tj

T

jT

jjT

jT

j

jj jT

11 13

13 33

11

13 33

11 13

33

0 0

0 0

0 0

αα µ

µ

=

a U ujT

j= 11

a u u ujj jT

j jj= + 2

U u aTj j11 =

u a u ujj jj jT

j2 = −

125

LINPACK Implementation

° Here is the body of the LINPACK routine SPOFA which implements the method:

DO 30 J = 1, N

INFO = J

S = 0.0E0

JM1 = J - 1

IF( JM1.LT.1 ) GO TO 20

DO 10 K = 1, JM1

T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1, J ), 1 )

T = T / A( K, K )

A( K, J ) = T

S = S + T*T

10 CONTINUE

20 CONTINUE

S = A( J, J ) - S

C ...EXIT

IF( S.LE.0.0E0 ) GO TO 40

A( J, J ) = SQRT( S )

30 CONTINUE

126

LAPACK Implementation

DO 10 J = 1, N

CALL STRSV( ’Upper’, ’Transpose’, ’Non-Unit’, J-1, A, LDA, A( 1, J ), 1 )

S = A( J, J ) - SDOT( J-1, A( 1, J ), 1, A( 1, J ), 1 )IF( S.LE.ZERO ) GO TO 20A( J, J ) = SQRT( S )

10 CONTINUE

° This change by itself is sufficient to significantly improve theperformance on a number of machines.

° From 238 to 312 Mflop/s for a matrix of order 500 on a Pentium 4-1.7 GHz.

° However on peak is 1,700 Mflop/s.

° Suggest further work needed.

22

127

Derivation of Blocked Algorithms

*VZFYNSL�HTJKKNHNJSY�TK�XJHTSI�GQTHP�TK�HTQZRSX��\J�TGYFNS

-JSHJ��NK�:��MFX�FQWJFI^�GJJS�HTRUZYJI��\J�HFS�

HTRUZYJ�:��FX�YMJ�XTQZYNTS�TK�YMJ�KTQQT\NSL�JVZFYNTSX�

G^�F�HFQQ�YT�YMJ�1J[JQ��'1&8�WTZYNSJ�89782�

A A A

A A A

A A A

U

U U

U U U

U U U

U U

U

T

T T

T

T T

T T T

T

11 12 13

12 22 12

13 12 33

11

12 22

13 23 33

11 12 13

22 23

33

0 0

0 0

0 0

=

A U UT12 11 12=

A U U U UT T22 12 12 22 22= +

U U AT11 12 12=

U U A U UT T22 22 22 12 12= −

128

LAPACK Blocked Algorithms

DO 10 J = 1, N, NBCALL STRSM( ’Left’, ’Upper’, ’Transpose’,’Non-Unit’, J-1, JB, ONE, A, LDA,

$ A( 1, J ), LDA )CALL SSYRK( ’Upper’, ’Transpose’, JB, J-1,-ONE, A( 1, J ), LDA, ONE,

$ A( J, J ), LDA )CALL SPOTF2( ’Upper’, JB, A( J, J ), LDA, INFO )IF( INFO.NE.0 ) GO TO 20

10 CONTINUE

�2Q�3HQWLXP��/��%/$6�VTXHH]HV�D�ORW�PRUH�RXW�RI��SURFRate of ExecutionIntel Pentium 4 1.7 GHz

N = 500

1262 Mflop/sLevel 3 BLAS Variant

312 Mflop/sLevel 2 BLAS Variant

238 Mflop/sLinpack variant (L1B)

LAPACK Contents

° Combines algorithms from LINPACK and EISPACK into a single package. User interface similar to LINPACK.

° Built on the L 1, 2 and 3 BLAS, for high performance (manufacturers optimize BLAS)

° LAPACK does not provide routines for structured problems or general sparse matrices (i.e sparse storage formats such as compressed-row, -column, -diagonal, skyline ...).

LAPACK Ongoing Work

° Add functionality • updating/downdating, divide and conquer least squares,bidiagonal bisection,

bidiagonal inverse iteration, band SVD, Jacobi methods, ...

° Move to new generation of high performance machines • IBM SPs, CRAY T3E, SGI Origin, clusters of workstations

° New challenges• New languages: FORTRAN 90, HP FORTRAN, ...

• (CMMD, MPL, NX ...)

- many flavors of message passing, need standard (PVM, MPI): BLACS

° Highly varying ratio

° Many ways to layout data,

° Fastest parallel algorithm sometimes less stable numerically.

Computational speed

Communication speed

History of Block Partitioned Algorithms

° Early algorithms involved use of small main memory using tapes as secondary storage.

° Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory.

Blocked Partitioned Algorithms

° LU Factorization

° Cholesky factorization

° Symmetric indefinite factorization

° Matrix inversion

° QR, QL, RQ, LQ factorizations

° Form Q or QTC

° Orthogonal reduction to:• (upper) Hessenberg form

• symmetric tridiagonal form

• bidiagonal form

° Block QR iteration for nonsymmetric eigenvalue problems