1
Lecture 6:Linear Algebra Algorithms
Jack Dongarra, U of Tennessee
8QNIJX�FWJ�FIFUYJI�KWTR�/NR�)JRRJQ��:('gX 1JHYZWJ�TS�1NSJFW�&QLJGWF�&QLTWNYMRX 2
Homework #3 – Grading Rulespart1: (2 points)° (1) code produce correct result: 2/2.° (2) code run but could not give accurate result: 1.5/2.° (3) code could not run or run but nothing: 1/2
° Note: If you only submit one program which can print "processor° contribution" and "total integral" correctly, I will give 1/2 ° for this part and 2/2 for part2. so you total score for part1 ° and part2 will be 3/4.
part2: (2 points)° (1) code produce correct result: 2/2.° (2) code run but could not give accurate result: 1.5/2.° (3) code could not run or run but nothing: 1/2
part3: (4 points)° (1) give t_p(1), t_p(p), S(p) and E(p) correctly: 4/4.° (2) give t_p(1), t_p(p) and S(p) correctly: 3/4.° (3) give t_p(1) and t_p(p) correctly: 2/4.° (4) give other: 1/4.
part4: (2 points)° (1) contain sentence similar as "if work being evenly divided° and the summation can be performed in a tree fashion, the algorithm° is scalable" or give very detail discussion correctly about: (a)° t_(p), (b) S(p) and (c) E(P). 2/2° (2) give correct discussion about any of: (a) t_(p), (b) S(p) ° and (c) E(P). 1/2° (3) do not make sense. 0/2
3
• Ideal Speedup = SI (n ) = n - Theoretical limit; obtainable rarely- Ignores all of real life
• These definitions apply to a fixed-problem experiment.
Parallel Performance Metrics
• Speedup = S(n ) = T(1) / T(n ) where T1 is the time for the best serial implementation.
• Absolute: Elapsed (wall-clock) Time = T(n )
=> Performance improvement due to parallelsm
• Parallel Efficiency = E(n ) = T(1) / n T(n )
4
Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where:fs = serial fraction of codefp = parallel fraction of code = 1 - fs
N = number of processors
Amdahl’s Law
5
�
��
���
���
���
���
� �� ��� ��� ��� ���
Number of processors
spee
dup
IS� ������
IS� ������
IS� ������
IS� ������
It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
Illustration of Amdahl’s Law
6
Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.
�
��
��
��
��
��
��
��
��
0 50 100 150 200 250Number of processors
spee
dup $PGDKOV�/DZ
5HDOLW\
fp = 0.99
Amdahl’s Law Vs. Reality
2
7
More on Amdahl’s Law
° Amdahl’s Law can be generalized to any two processes of with different speeds
° Ex.: Apply to fprocessor and fmemory:• The growing processor-memory performance gap will undermine
our efforts at achieving maximum possible speedup!
8
Gustafson’s Law
° Thus, Amdahl’s Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large.
tN = (fp/N + fs)t1 Effect of multiple processors on run time
S = 1/(fs + fp/N) Effect of multiple processors on speedup
Where:
fs = serial fraction of code
fp = parallel fraction of code = 1 - fsN = number of processors
° There is a way around this: increase the problem size• bigger problems mean bigger grids or more particles: bigger
arrays• number of serial operations generally remains constant; number
of parallel operations increases: parallel fraction increases
9
Fixed-Problem Size Scaling
• a.k.a. Fixed-load, Fixed-Problem Size, Strong Scaling, Problem-Constrained, constant-problem size (CPS),variable subgrid
1• Amdahl Limit: SA(n) = T(1) / T(n) = ----------------
f / n + ( 1 - f )
• This bounds the speedup based only on the fraction of the codethat cannot use parallelism ( 1- f ); it ignores all other factors
• SA --> 1 / ( 1- f ) as n --> ∞
10
Fixed-Problem Size Scaling (Cont’d)
• Efficiency (n) = T(1) / [ T(n) * n]
• Memory requirements decrease with n
• Surface-to-volume ratio increases with n
• Superlinear speedup possible from cache effects
• Motivation: what is the largest # of procs I can use effectively andwhat is the fastest time that I can solve a given problem?
• Problems:- Sequential runs often not possible (large problems)- Speedup (and efficiency) is misleading if processors are slow
11
Fixed-Problem Size Scaling: Examples
S. Goedecker and Adolfy Hoisie, Achieving High Performance in Numerical Computations on RISC Workstations and Parallel Systems,International Conference on Computational Physics: PC’97 Santa Cruz, August 25-28 1997.
12
Fixed-Problem Size Scaling Examples
3
13
Scaled Speedup Experiments
• a.k.a. Fixed Subgrid-Size, Weak Scaling, Gustafson scaling.
• Motivation: Want to use a larger machine to solve a larger global problem in the same amount of time.
• Memory and surface-to-volume effects remain constant.
14
Top500 Data
128070.491.831.29Deutscher WetterdienstSP Power3
375MhzIBM10
133672.081.971.42Naval Oceanographic Office (NAVOCEANO)
SP Power3 375Mhz
IBM9
614463.892.521.61Los Alamos National Laboratory
ASCI Blue Mountain
SGI8
115263.572.691.71University of TokyoSR8000/MPPHitachi7
153669.083.042.10Los Alamos National Laboratory
AlphaServer SC ES45 1 GHz
Compaq6
580855.333.8682.14Lawrence Livermore National Laboratory
ASCI Blue Pacific
SST, IBM SP 604E
IBM5
963274.213.2072.38Sandia National LaboratoryASCI RedIntel4
332861.005.003.05NERSC/LBNLSP Power3 375 MHz
IBM3
302467.11
6.054.06Pittsburgh
Supercomputing CenterAlphaServer SC
ES45 1 GHzCompaq2
819260.25127.23Lawrence Livermore National Laboratory
ASCI WhiteSP Power3
IBM1
# ProcRatio
Max/PeakR peak[TF/s]
Rmax[TF/s]
Installation SiteComputerManufacturerRank
15
Example of a Scaled Speedup Experiment
Processors NChains T ime Natoms T ime per T ime E fficiencyAtom per P E per Atom
1 32 38.4 2368 1.62E -02 1.62E -02 1.0002 64 38.4 4736 8.11E -03 1.62E -02 1.0004 128 38.5 9472 4.06E -03 1.63E -02 0.9978 256 38.6 18944 2.04E -03 1.63E -02 0.995
16 512 38.7 37888 1.02E -03 1.63E -02 0.99232 940 35.7 69560 5.13E -04 1.64E -02 0.98764 1700 32.7 125800 2.60E -04 1.66E -02 0.975
128 2800 27.4 207200 1.32E -04 1.69E -02 0.958256 4100 20.75 303400 6.84E -05 1.75E -02 0.926512 5300 14.49 392200 3.69E -05 1.89E -02 0.857
T BON on AS CI Red
0.4400.5400.6400.7400.8400.9401.040
0 200 400 600
E fficiency
16
Parallel Performance Metrics: Speedup
Speedup is only one characteristic of a program - it is not synonymous with performance. In this comparison of two machines the code achieves comparable speedups but one of the machines is faster.
4840322416800
250
500
750
1000
1250
1500
1750
2000
T3E
O2KT3E Ideal
O2K Ideal
Processors
MF
LO
PS
Absolute performance:
Processors6050403020100
0
10
20
30
40
50
60
T3E
O2K
Ideal
Sp
eed
up
Relative performance:
17 18
Improving Ratio of Floating Point Operationsto Memory Accesses
subroutine mult(n1,nd1,n2,nd2,y,a,x)
implicit real*8 (a-h,o-z)
dimension a(nd1,nd2),y(nd2),x(nd1)
do 20, i=1,n1
t=0.d0
do 10, j=1,n2
t=t+a(j,i)*x(j)
10 continue
y(i)=t
20 continue
return
end
8QUROO�WKH�ORRSV�
**** 2 FLOPS**** 2 LOADS
4
19
Improving Ratio of Floating Point Operations to Memory Accesses
c works correctly when n1,n2 are multiples of 4dimension a(nd1,nd2), y(nd2), x(nd1)do i=1,n1-3,4
t1=0.d0t2=0.d0t3=0.d0t4=0.d0do j=1,n2-3,4t1=t1+a(j+0,i+0)*x(j+0)+a(j+1,i+0)*x(j+1)+
1 a(j+2,i+0)*x(j+2)+a(j+3,i+1)*x(j+3)t2=t2+a(j+0,i+1)*x(j+0)+a(j+1,i+1)*x(j+1)+
1 a(j+2,i+1)*x(j+2)+a(j+3,i+0)*x(j+3)t3=t3+a(j+0,i+2)*x(j+0)+a(j+1,i+2)*x(j+1)+
1 a(j+2,i+2)*x(j+2)+a(j+3,i+2)*x(j+3)t4=t4+a(j+0,i+3)*x(j+0)+a(j+1,i+3)*x(j+1)+
1 a(j+2,i+3)*x(j+2)+a(j+3,i+3)*x(j+3)enddoy(i+0)=t1y(i+1)=t2y(i+2)=t3y(i+3)=t4
enddo
32 FLOPS20 LOADS
20
Summary of Single-Processor Optimization Techniques (I)
° Spatial and temporal data locality
° Loop unrolling
° Blocking
° Software pipelining
° Optimization of data structures
° Special functions, library subroutines
21
Summary of Optimization Techniques (II)
° Achieving high-performance requires code restructuring. Minimization of memory traffic is the single most important goal.
° Compilers are getting better: good at software pipelining. But they are not there yet: can do loop transformations only in simple cases, usually fail to produce optimal blocking, heuristics for unrolling may not match your code well, etc.
° The optimization process is machine-specific and requires detailed architectural knowledge.
22
Optimizing Matrix Addition for Caches
° Dimension A(n,n), B(n,n), C(n,n)
° A, B, C stored by column (as in Fortran)
° Algorithm 1:• for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)
° Algorithm 2:• for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)
° What is “memory access pattern” for Algs 1 and 2?
° Which is faster?
° What if A, B, C stored by row (as in C)?
23
Using a Simpler Model of Memory to Optimize
° Assume just 2 levels in the hierarchy, fast and slow
° All data initially in slow memory• m = number of memory elements (words) moved between fast
and slow memory • tm = time per slow memory operation• f = number of arithmetic operations• tf = time per arithmetic operation < tm• q = f/m average number of flops per slow element access
° Minimum possible Time = f*tf, when all data in fast memory
° Actual Time = f*tf + m*tm = f*tf*(1 + (tm/tf)*(1/q))
° Larger q means Time closer to minimum f*tf• Want large q 24
Simple example using memory model
s = 0
for i = 1, n
s = s + h(X[i])
° Assume tf=1 Mflop/s on fast memory
° Assume moving data is tm = 10
° Assume h takes q flops
° Assume array X is in slow memory
° To see results of changing q, consider simple computation
° So m = n and f = q*n
° Time = read X + compute = 10*n + q*n
° Mflop/s = f/t = q/(10 + q)
° As q increases, this approaches the “peak” speed of 1 Mflop/s
5
25
Warm up: Matrix-vector multiplication y = y + A*x
for i = 1:n
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
= + *
y(i) y(i)
A(i,:)
x(:)
26
Warm up: Matrix-vector multiplication y = y + A*x
{read x(1:n) into fast memory}
{read y(1:n) into fast memory}
for i = 1:n
{read row i of A into fast memory}
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
{write y(1:n) back to slow memory}
° m = number of slow memory refs = 3*n + n2
° f = number of arithmetic operations = 2*n2
° q = f/m ~= 2° Matrix-vector multiplication limited by slow memory speed
27
Matrix Multiply C=C+A*B
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
28
Matrix Multiply C=C+A*B(unblocked, or untiled)
for i = 1 to n
{read row i of A into fast memory}
for j = 1 to n
{read C(i,j) into fast memory}
{read column j of B into fast memory}
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
{write C(i,j) back to slow memory}
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
29
Matrix Multiply (unblocked, or untiled)
Number of slow memory references on unblocked matrix multiply
m = n3 read each column of B n times
+ n2 read each column of A once for each i
+ 2*n2 read and write each element of C once
= n3 + 3*n2
So q = f/m = (2*n3)/(n3 + 3*n2)
~= 2 for large n, no improvement over matrix-vector mult
= + *C(i,j) C(i,j) A(i,:)
B(:,j)
q=ops/slow mem ref
30
Matrix Multiply (blocked, or tiled)
Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize
for i = 1 to N
for j = 1 to N
{read block C(i,j) into fast memory}
for k = 1 to N
{read block A(i,k) into fast memory}
{read block B(k,j) into fast memory}
C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}
{write block C(i,j) back to slow memory}
= + *C(i,j) C(i,j) A(i,k)
B(k,j)
6
31
Matrix Multiply (blocked or tiled)
Why is this algorithm correct?
Number of slow memory references on blocked matrix multiply
m = N*n2 read each block of B N3 times (N3 * n/N * n/N)
+ N*n2 read each block of A N3 times
+ 2*n2 read and write each block of C once
= (2*N + 2)*n2
So q = f/m = 2*n3 / ((2*N + 2)*n2)
~= n/N = b for large n
So we can improve performance by increasing the blocksize b
Can be much faster than matrix-vector multiplty (q=2)
Limit: All three blocks from A,B,C must fit in fast memory (cache), so we
cannot make these blocks arbitrarily large: 3*b2 <= M, so q ~= b <= sqrt(M/3)
Theorem (Hong, Kung, 1981): Any reorganization of this algorithm
(that uses only associativity) is limited to q =O(sqrt(M))
q=ops/slow mem ref
32
Model
° As much as possible will be overlapped
° Dot Product:
ACC = 0
do i = x,n
ACC = ACC + x(i) y(i)
end do
° Experiments done on an IBM RS6000/530• 25 MHz• 2 cycle to complete FMA can be pipelined
- => 50 Mflop/s peak• one cycle from cache
+2&+2&
+2&
cycles
33
DOT Operation - Data in Cache
Do 10 I = 1, n
T = T + X(I)*Y(I)10 CONTINUE
° Theoretically, 2 loads for X(I) and Y(I), one FMA operation, no re-use of data
° Pseudo-assemblerLOAD fp0,T
label:LOAD fp1,X(I)LOAD fp2,Y(I)FMA fp0,fp0,fp1,fp2BRANCH label: 1TFI�] 1TFI�^ +2&
1TFI�] 1TFI�^
��WJXZQY�UJW�H^HQJ�"����2KQTU�X
+2&
34
Matrix-Vector Product
° DOT versionDO 20 I = 1, M
DO 10 J = 1, N
Y(I) = Y(I) + A(I,J)*X(J)
10 CONTINUE
20 CONTINUE
° From Cache = 22.7 Mflops
° From Memory = 12.4 Mflops
35
Loop Unrolling
DO 20 I = 1, M, 2
T1 = Y(I )
T2 = Y(I+1)
DO 10 J = 1, N
T1 = T1 + A(I,J )*X(J)
T2 = T2 + A(I+1,J)*X(J)
10 CONTINUE
Y(I ) = T1
Y(I+1) = T2
20 CONTINUE
° 3 loads, 4 flops
° Speed of y=y+ATx, N=48
Depth 1 2 3 4 v
Speed 25 33.3 37.5 40 50
Measured 22.7 30.5 34.3 36.5
Memory 12.4 12.7 12.7 12.6
° unroll 1: 2 loads : 2 ops per 2 cycles
° unroll 2: 3 loads : 4 ops per 3 cycles
° unroll 3: 4 loads : 6 ops per 4 cycles
° …
° unroll n: n+1 loads : 2n ops per n+1 cycles
° problem: only so many registers
36
Matrix Multiply
° DOT version - 25 Mflops in cacheDO 30 J = 1, M
DO 20 I = 1, M
DO 10 K = 1, L
C(I,J) = C(I,J) + A(I,K)*B(K,J)
10 CONTINUE
20 CONTINUE
30 CONTINUE
7
37
How to Get Near Peak
DO 30 J = 1, M, 2
DO 20 I + 1, M, 2
T11 = C(I, J )
T12 = C(I, J+1)
T21 = C(I+1,J )
T22 = C(I+1,J+1)
DO 10 K = 1, L
T11 = T11 + A(I, K) *B(K,J )
T12 = T12 + A(I, K) *B(K,J+1)
T21 = T21 + A(I+1,K)*B(K,J )
T22 = T22 + A(I+1,K)*B(K,J+1)
10 CONTINUE
C(I, J ) = T11
C(I, J+1) = T12
C(I+1,J ) = T21
C(I+1,J+1) = T22
20 CONTINUE
30 CONTINUE
° Inner loop: • 4 loads, 8 operations, optimal.
° In practice we have measured 48.1 out of a peak of 50 Mflop/s when in cache
38
BLAS -- Introduction
° Clarity: code is shorter and easier to read,
° Modularity: gives programmer larger building blocks,
° Performance: manufacturers will provide tuned machine-specific BLAS,
° Program portability: machine dependencies are confined to the BLAS
39
Memory Hierarchy
7JLNXYJWX
1���(FHMJ
1���(FHMJ
1THFQ�2JRTW^
7JRTYJ�2JRTW^
8JHTSIFW^�2JRTW^
° Key to high performance in effective use of memory hierarchy
° True on all architectures
40
Level 1, 2 and 3 BLAS
° Level 1 BLAS Vector-Vector operations
° Level 2 BLAS Matrix-Vector operations
° Level 3 BLAS Matrix-Matrix operations
� �
�
��
41
More on BLAS (Basic Linear Algebra Subroutines)
° Industry standard interface(evolving)
° Vendors, others supply optimized implementations
° History• BLAS1 (1970s):
- vector operations: dot product, saxpy (y=D*x+y), etc
- m=2*n, f=2*n, q ~1 or less
• BLAS2 (mid 1980s)- matrix-vector operations: matrix vector multiply, etc
- m=n2, f=2*n2, q~2, less overhead
- somewhat faster than BLAS1
• BLAS3 (late 1980s)
- matrix-matrix operations: matrix matrix multiply, etc
- m >= 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2
° Good algorithms used BLAS3 when possible (LAPACK)
° www.netlib.org/blas, www.netlib.org/lapack
42
Why Higher Level BLAS?
° Can only do arithmetic on data at the top of the hierarchy
° Higher level BLAS lets us do this
' 1 & 8 2 J R T W ^7 J K X
+ QT U X + QT U X �2 J R T W ^7 J K X
1 J [ J Q��^ " ^ � D ]
� S � S � ��
1 J [ J Q��^ " ^ � & ]
S � � S � �
1 J [ J Q��( " ( � & '
� S � � S � S � �
7JLNXYJWX
1���(FHMJ
1���(FHMJ
1THFQ�2JRTW^
7JRTYJ�2JRTW^
8JHTSIFW^�2JRTW^
8
43
BLAS for Performance
IBM RS/6000-590 (66 MHz, 264 Mflop/s Peak)
0
50
100
150
200
250
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
1J[JQ���'1&8
1J[JQ���'1&8
1J[JQ���'1&8
44
BLAS for Performance
Alpha EV 5/6 500MHz (1Gflop/s peak)
0100200300400500600700
10 100 200 300 400 500Order of vector/Matrices
Mflo
p/s
1J[JQ���'1&8
1J[JQ���'1&8
1J[JQ���'1&8
BLAS 3 (n-by-n matrix matrix multiply) vsBLAS 2 (n-by-n matrix vector multiply) vsBLAS 1 (saxpy of n vectors)
45
Fast linear algebra kernels: BLAS
� 6LPSOH�OLQHDU�DOJHEUD�NHUQHOV�VXFK�DV�PDWUL[�PDWUL[�PXOWLSO\�
� 0RUH�FRPSOLFDWHG�DOJRULWKPV�FDQ�EH�EXLOW�IURP�WKHVH�EDVLF�NHUQHOV�
� 7KH�LQWHUIDFHV�RI�WKHVH�NHUQHOV�KDYH�EHHQ�VWDQGDUGL]HG�DV�WKH�%DVLF�/LQHDU�$OJHEUD�6XEURXWLQHV��%/$6���
� (DUO\�DJUHHPHQW�RQ�VWDQGDUG�LQWHUIDFH��a������
� /HG�WR�SRUWDEOH�OLEUDULHV�IRU�YHFWRU�DQG�VKDUHG�PHPRU\�SDUDOOHO�PDFKLQHV��
� 2Q�GLVWULEXWHG�PHPRU\��WKHUH�LV�D�OHVV�VWDQGDUG�LQWHUIDFH�FDOOHG�WKH�3%/$6
46
Level 1 BLAS
� 2SHUDWH�RQ�YHFWRUV�RU�SDLUV�RI�YHFWRUV� SHUIRUP�2�Q��RSHUDWLRQV��
� UHWXUQ�HLWKHU�D�YHFWRU�RU�D�VFDODU��
� VD[S\� \�L�� �D� �[�L����\�L���IRU�L ��WR�Q��
� V�VWDQGV�IRU�VLQJOH�SUHFLVLRQ��GD[S\ LV�IRU�GRXEOH�SUHFLVLRQ��FD[S\ IRU�FRPSOH[��DQG�]D[S\ IRU�GRXEOH�FRPSOH[��
� VVFDO \� �D� �[��IRU�VFDODU�D�DQG�YHFWRUV�[�\
� VGRW FRPSXWHV�V� �6 Q
L � [�L� \�L��
47
Level 2 BLAS
� 2SHUDWH�RQ�D�PDWUL[�DQG�D�YHFWRU��� UHWXUQ�D�PDWUL[�RU�D�YHFWRU�
� 2�Q���RSHUDWLRQV
� VJHPY��PDWUL[�YHFWRU�PXOWLSO\� \� �\���$ [
� ZKHUH�$�LV�P�E\�Q��[�LV�Q�E\���DQG�\�LV�P�E\����
� VJHU��UDQN�RQH�XSGDWH�� $� �$���\ [7��L�H���$�L�M�� �$�L�M��\�L� [�M��
� ZKHUH�$�LV�P�E\�Q��\�LV�P�E\����[�LV�Q�E\����
� VWUVY��WULDQJXODU�VROYH�
� VROYHV�\ 7 [�IRU�[��ZKHUH�7�LV��WULDQJXODU
48
Level 3 BLAS
� 2SHUDWH�RQ�SDLUV�RU�WULSOHV�RI�PDWULFHV� UHWXUQLQJ�D�PDWUL[�
� FRPSOH[LW\�LV�2�Q���
� VJHPP��0DWUL[�PDWUL[�PXOWLSOLFDWLRQ� &� �&��$ %��
� ZKHUH�&�LV�P�E\�Q��$�LV�P�E\�N��DQG�%�LV�N�E\�Q
� VWUVP��PXOWLSOH�WULDQJXODU�VROYH� VROYHV�<� �7 ;�IRU�;��
� ZKHUH�7�LV�D�WULDQJXODU�PDWUL[��DQG�;�LV�D�UHFWDQJXODU�PDWUL[�
9
49
Optimizing in practice
° Tiling for registers• loop unrolling, use of n amed “register” variables
° Tiling for multiple levels of cache
° Exploiting fine-grained parallelism within the processor
• super scalar
• pipelining
° Complicated compiler interactions
° Hard to do by hand (but you’ll try)
° Automatic optimization an active research area• PHIPAC: www.icsi.berkeley.edu/~bilmes/phipac
• www.cs.berkeley.edu/~iyer/asci_slides.ps
• ATLAS: www.netlib.org/atlas/index.html
50
BLAS -- References
° BLAS software and documentation can be obtained via:
• WWW: http://www.netlib.org/blas,
• (anonymous) ftp ftp.netlib.org: cd blas; get index
• email [email protected] with the message: send index from blas
° Comments and questions can be addressed to: [email protected]
51
BLAS Papers
° C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Transactions on Mathematical Software, 5:308--325, 1979.
° J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 14(1):1--32, 1988.
° J. Dongarra, J. Du Croz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, ACM Transactions on Mathematical Software, 16(1):1--17, 1990.
52
Performance of BLAS
� %/$6�DUH�VSHFLDOO\�RSWLPL]HG�E\�WKH�YHQGRU�� 6XQ�%/$6�XVHV�IHDWXUHV�LQ�WKH�8OWUDVSDUF
� %LJ�SD\RII�IRU�DOJRULWKPV�WKDW�FDQ�EH�H[SUHVVHG�LQ�WHUPV�RI�WKH�%/$6��LQVWHDG�RI�%/$6��RU�%/$6���
� 7KH�WRS�VSHHG�RI�WKH�%/$6�
� $OJRULWKPV�OLNH�*DXVVLDQ�HOLPLQDWLRQ�RUJDQL]HG�VR�WKDW�WKH\�XVH�%/$6�
53
How To Get Performance From Commodity Processors?
° Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.
° Routines have a large design space w/many parameters• blocking sizes, loop nesting permutations, loop unrolling depths,
software pipelining strategies, register allocations, and instruction schedules.
• Complicated interactions with the increasingly sophisticated microarchitectures of new microprocessors.
° A few months ago no tuned BLAS for Pentium for Linux.
° Need for quick/dynamic deployment of optimized routines.
° ATLAS - Automatic Tuned Linear Algebra Software• PhiPac from Berkeley
54M C A B
N
K
N
M
K
*NB
Adaptive Approach for Level 3
° Do a parameter study of the operation on the target machine, done once.
° Only generated code is on-chip multiply
° BLAS operation written in terms of generated on-chip multiply
° All tranpose cases coerced through data copy to 1 case of on-chip multiply
• Only 1 case generated per platform
10
55
Code GenerationStrategy
° Code is iteratively generated & timed until optimal case is found. We try:
• Differing NBs
• Breaking false dependencies
• M, N and K loop unrolling
° On-chip multiply optimizes for:
• TLB access
• L1 cache reuse
• FP unit usage
• Memory fetch• Register reuse
• Loop overhead minimization
° Takes a couple of hours to run.
56
500x500 Double Precision Matrix-Matrix Multiply Across Multiple Architectures
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
DE
C A
lpha
2116
4a-4
33
HP
PA
8000
180M
hz
HP
9000
/735
/125
IBM
Pow
er2-
135
IBM
Pow
erP
C60
4e-3
32
Pen
tium
MM
X-15
0
Pen
tium
Pro
-200
Pen
tium
II-2
66
SG
I R46
00
SG
I R50
00
SG
I R80
00ip
21
SG
I R10
000i
p27
Sun
Mic
rosp
arc
IIM
odel
70
Sun
Dar
win
-270
Sun
Ultr
a2 M
odel
2200
System
Mfl
op
s
Vendor Matrix Multiply ATLAS Matrix Multiply
57
500 x 500 Double Precision LU Factorization Performance Across Multiple Architectures
0.0
100.0
200.0
300.0
400.0
500.0
600.0
DC
G L
X 21
164a
-53
3
DE
C A
lpha
211
64a-
433
HP
PA
8000
IBM
Pow
er2-
135
IBM
Pow
erP
C60
4e-3
32
Pen
tium
Pro
-200
Pen
tium
II-2
66
SG
I R50
00
SG
I R10
000i
p27
Sun
Dar
win
-270
Sun
Ultr
a2 M
odel
2200
MF
LO
PS
LU w/Vendor BLAS LU w/ATLAS & GEMM-based BLAS
58
500x500 gemm-based BLAS on SGI R10000ip28
0
50
100
150
200
250
300
DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM
MF
LO
PS
Vendor BLAS ATLAS/SSBLAS Reference BLAS
59
500x500 gemm-based BLAS on UltraSparc 2200
0
50
100
150
200
250
300
DGEMM DSYMM DSYR2K DSYRK DTRMM DTRSM
Level 3 BLAS Routine
MFL
OP
S
Vendor BLAS ATLAS/GEMM-based BLAS Reference BLAS
60
Recursive Approach for Other Level 3 BLAS
° Recur down to L1 cache block size
° Need kernel at bottom of recursion
• Use gemm-based kernel for portability
Recursive TRMM
00
0
00
0
0
0
0
0
0
0
00
0
0
0
0
0
0
11
61
500x500 Level 2 BLAS DGEMV
0
50
100
150
200
250
300
AMD A
thlo
n-600
DEC ev56
-533
HP9000
/735/135
IBM
PPC604
-112
IBM
Power2
-160
IBM
Power3
-200
Pentium P
ro-2
00
Pentium II-
266
Pentium III
-550
SGI R10
000ip
28-20
0
SGI R120
00ip
30-27
0
Sun Ultr
aSpar
c2-20
0
Architectures
MF
LOP
S
Vendor NoTrans ATLAS NoTrans
F77 NoTrans
62
0100200300400500600700800
100
200
300
400 500 600 700 800 900
1000
Size
Mfl
op
/s
Intel BLAS 1 proc ATLAS 1proc Intel BLAS 2 proc ATLAS 2 proc
Multi-Threaded DGEMMIntel PIII 550 MHz
63
ATLAS
° Keep a repository of kernels for specific machines.
° Develop a means of dynamically downloading code
° Extend work to allow sparse matrix operations
° Extend work to include arbitrary code segments
° See: KWWS���ZZZ�QHWOLE�RUJ�DWODV�
Algorithms and Architecture
°The key to performance is to understand the algorithm and architecture interaction.
°A significant improvement in performance can be obtained by matching algorithm to the architecture or vice versa.
Algorithm Issues
°Use of memory hierarchy
°Algorithm pre-fetching
°Loop unrolling
°Simulating higher precision arithmetic
Blocking° TLB Blocking - minimize TLB misses
° Cache Blocking - minimize cache misses
° Register Blocking - minimize load/stores
° The general idea of blocking is to get the information to a high-speed storage and use it multiple times so as to amortize the cost of moving the data
° Cache blocking reduced traffic between memory and cache
° Register blocking reduces traffic between cache and CPU
7JLNXYJWX
1���(FHMJ
1���(FHMJ
1THFQ�2JRTW^
7JRTYJ�2JRTW^
8JHTSIFW^�2JRTW^
12
Loop Unrolling
° Reduces data dependency delay
° Exploits multiple functional units and quad load/stores effectively.
° Minimizes load/stores
° Reduces loop overheads
° Gives more flexibility to compiler in scheduling
° Facilitates algorithm pre-fetching.
° What about vector computing?
What’s Wrong With Speedup T 1/Tp ?
° Can lead to very false conclusions.
° Speedup in isolation without taking into account the speed of the processor is unrealistic and pointless.
° Speedup over what?
° T1/Tp • There is usually no doubt about Tp
• Often considerable dispute over the meaning of T1
- Serial code? Same algorithm?
Speedup
° Can be used to:• Study, in isolation, the scaling of one algorithm on one computer.
• As a dimensionless variable in the theory of scaling.
° Should not be used to compare:• Different algorithms on the same computer
• The same algorithm on different computers.
• Different interconnection structures.
Strassen’s Algorithm for Matrix Multiply
:XZFQ�2FYWN]�2ZQYNUQ^
C A B A B
C A B A B
C A B A B
C A B A B
11 11 11 12 21
12 11 12 12 22
21 21 11 22 21
22 21 12 22 22
= += += += +
=
2221
1211
2221
1211
2221
1211
BB
BB
AA
AA
CC
CC
Strassen’s Algorithm
P A A B B
P A A B
P A B B
P A B B
P A A B
P A A B B
P A A B B
1 11 22 11 22
2 21 22 11
3 11 12 22
4 22 21 11
5 11 12 22
6 21 11 11 12
7 12 22 21 22
= + += += −= −= += − += − +
( )( )
( )
( )
( )
( )
( )( )
( )( )
C P P P P
C P P
C P P
C P P P P
11 1 4 5 7
12 3 5
21 2 5
22 1 3 2 6
= + − += += += + − +
13
Strassen’s Algorithm
° The count of arithmetic operations is:
One matrix multiply is replaced by 14 matrix additions
2ZQY &II (TRUQJ]NY^
7JLZQFW � � �S��4 S��
8YWFXXJS � �� ���S�������4 S��
Strassen’s Algorithm
° In reality the use of Strassen’s Algorithm is limited by
• Additional memory required for storing the P matrices.
• More memory accesses are needed.
75
Outline
° Motivation for Dense Linear Algebra• Ax=b: Computational Electromagnetics
• Ax = Ox: Quantum Chemistry
° Review Gaussian Elimination (GE) for solving Ax=b
° Optimizing GE for caches on sequential machines• using matrix-matrix multiplication (BLAS)
° LAPACK library overview and performance
° Data layouts on parallel machines
° Parallel matrix-matrix multiplication
° Parallel Gaussian Elimination
° ScaLAPACK library overview
° Eigenvalue problem76
Parallelism in Sparse Matrix-vector multiplication
° y = A*x, where A is sparse and n x n
° Questions• which processors store
- y[i], x[i], and A[i,j]
• which processors compute
- y[i] = sum (from 1 to n) A[i,j] * x[j]
= (row i of A) . x … a sparse dot product
° Partitioning• Partition index set {1,…,n} = N1 u N2 u … u Np
• For all i in Nk, Processor k stores y[i], x[i], and row i of A
• For all i in Nk, Processor k computes y[i] = (row i of A) . x
- “owner computes” rule: Processor k compute the y[i]s it owns
° Goals of partitioning• balance load (how is load measured?)
• balance storage (how much does each processor store?)
• minimize communication (how much is communicated?)
77
Graph Partitioning and Sparse Matrices
1 1 1 1
2 1 1 1 1
3 1 1 1
4 1 1 1 1
5 1 1 1 1
6 1 1 1 1
1 2 3 4 5 6
3
6
2
1
5
4
° Relationship between matrix and graph
° A “good” partition of the graph has• equal (weighted) number of nodes in each part (load and storage balance)• minimum number of edges crossing between (minimize communication)
° Can reorder the rows/columns of the matrix by putting all the nodes in one partition together
78
More on Matrix Reordering via Graph Partitioning
° “Ideal” matrix structure for parallelism: (nearly) block diagonal• p (number of processors) blocks
• few non-zeros outside these blocks, since these require communication
= *
P0
P1
P2
P3
P4
14
79
What about implicit methods and eigenproblems?
° Direct methods (Gaussian elimination)• Called LU Decomposition, because we factor A = L*U
• Future lectures will consider both dense and sparse cases
• More complicated than sparse-matrix vector multiplication
° Iterative solvers• Will discuss several of these in future
- Jacobi, Successive overrelaxiation (SOR) , Conjugate Gradients (CG), Multigrid,...
• Most have sparse-matrix-vector multiplication in kernel
° Eigenproblems• Future lectures will discuss dense and sparse cases
• Also depend on sparse-matrix-vector multiplication, direct methods
° Graph partitioning
8080
Partial Differential Equations
PDEs
81
Continuous Variables, Continuous Parameters
Examples of such systems include
° Heat flow: Temperature(position, time)
° Diffusion: Concentration(position, time)
° Electrostatic or Gravitational Potential:Potential(position)
° Fluid flow: Velocity,Pressure,Density(position,time)
° Quantum mechanics: Wave-function(position,time)
° Elasticity: Stress,Strain(position,time)
82
Example: Deriving the Heat Equation
0 1x x+hConsider a simple problem
° A bar of uniform material, insulated except at ends
° Let u(x,t) be the temperature at position x at time t
° Heat travels from x-h to x+h at rate proportional to:
° As h 0, we get the heat equation:
d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)-u(x+h,t))/h
dt h
= C *
d u(x,t) d2 u(x,t)dt dx2
= C *
x-h
83
Explicit Solution of the Heat Equation
° For simplicity, assume C=1
° Discretize both time and position
° Use finite differences with u[j,i] as the heat at• time t= i*dt (i = 0,1,2,…) and position x = j*h (j=0,1,…,N=1/h)• initial conditions on u[j,0]
• boundary conditions on u[0,i] and u[N,i]
° At each timestep i = 0,1,2,...
° This corresponds to• matrix vector multiply (what is matrix?)
• nearest neighbors on grid
t=5
t=4
t=3
t=2
t=1
t=0u[0,0] u[1,0] u[2,0] u[3,0] u[4,0] u[5,0]
For j=0 to N
u[j,i+1]= z*u[j-1,i]+ (1-2*z)*u[j,i]+z*u[j+1,i]
where z = dt/h2
84
Parallelism in Explicit Method for PDEs
° Partitioning the space (x) into p largest chunks• good load balance (assuming large number of points relative to p)
• minimized communication (only p chunks)
° Generalizes to • multiple dimensions
• arbitrary graphs (= sparse matrices)
° Problem with explicit approach• numerical instability
• solution blows up eventually if z = dt/h > .5
• need to make the timesteps very small when h is small: dt < .5*h
2
2
15
85
Instability in solving the heat equation explicitly
86
Implicit Solution
° As with many (stiff) ODEs, need an implicit method
° This turns into solving the following equation
° Here I is the identity matrix and T is:
° I.e., essentially solving Poisson’s equation in 1D
(I + (z/2)*T) * u[:,i+1]= (I - (z/2)*T) *u[:,i]
2 -1
-1 2 -1
-1 2 -1
-1 2 -1
-1 2
T = 2-1 -1
Graph and “stencil”
87
2D Implicit Method
° Similar to the 1D case, but the matrix T is now
° Multiplying by this matrix (as in the explicit case) is simply nearest neighbor computation on 2D grid
° To solve this system, there are several techniques
4 -1 -1
-1 4 -1 -1
-1 4 -1
-1 4 -1 -1
-1 -1 4 -1 -1
-1 -1 4 -1
-1 4 -1
-1 -1 4 -1
-1 -1 4
T =
4
-1
-1
-1
-1
Graph and “stencil”
88
Algorithms for 2D Poisson Equation with N unknowns
Algorithm Serial PRAM Memory #Procs
° Dense LU N3 N N2 N2
° Band LU N2 N N3/2 N
° Jacobi N2 N N N
° Explicit Inv. N log N N N
° Conj.Grad. N 3/2 N 1/2 *log N N N
° RB SOR N 3/2 N 1/2 N N
° Sparse LU N 3/2 N 1/2 N*log N N
° FFT N*log N log N N N
° Multigrid N log2 N N N
° Lower bound N log N N
PRAM is an idealized parallel model with zero cost communication
(see next slide for explanation)
2 22
89
Short explanations of algorithms on previous slide° Sorted in two orders (roughly):
• from slowest to fastest on sequential machines
• from most general (works on any matrix) to most specialized (works on matrices “like” T)
° Dense LU: Gaussian elimination; works on any N-by-N matrix
° Band LU: exploit fact that T is nonzero only on sqrt(N) diagonals nearest main diagonal, so faster
° Jacobi: essentially does matrix-vector multiply by T in inner loop of iterative algorithm
° Explicit Inverse: assume we want to solve many systems with T, so we can precompute and store inv(T) “for free”, and just multiply by it
• It’s still expensive!
° Conjugate Gradients: uses matrix-vector multiplication, like Jacobi, but exploits mathematical properies of T that Jacobi does not
° Red-Black SOR (Successive Overrelaxation): Variation of Jacobi that exploits yet different mathematical properties of T
• Used in Multigrid
° Sparse LU: Gaussian elimination exploiting particular zero structure of T
° FFT (Fast Fourier Transform): works only on matrices very like T
° Multigrid: also works on matrices like T, that come from elliptic PDEs
° Lower Bound: serial (time to print answer); parallel (time to combine N inputs)
90
Composite mesh from a mechanical structure
16
91
Converting the mesh to a matrix
92
Effects of Ordering Rows and Columns on Gaussian Elimination
93
Irregular mesh: NASA Airfoil in 2D (direct solution)
94
Irregular mesh: Tapered Tube (multigrid)
95
Adaptive Mesh Refinement (AMR)
°Adaptive mesh around an explosion°John Bell and Phil Colella at LBL (see class web page for URL)°Goal of Titanium is to make these algorithms easier to implement
in parallel
96
Computational Electromagnetics
•Developed during 1980s, driven by defense applications
•Determine the RCS (radar cross section) of airplane
•Reduce signature of plane (stealth technology)
•Other applications are antenna design, medical equipment
•Two fundamental numerical approaches:
•MOM methods of moments ( frequency domain), and
•Finite differences (time domain)
17
97
Computational Electromagnetics
image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/
- Discretize surface into triangular facets using standard modeling tools
- Amplitude of currents on surface are unknowns
- Integral equation is discretized into a set of linear equations
98
Computational Electromagnetics (MOM)
After discretization the integral equation has the form
A x = b
where
A is the (dense) impedance matrix,
x is the unknown vector of amplitudes, and
b is the excitation vector.
(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542)
99
The main steps in the solution process are
Fill: computing the matrix elements of A
Factor: factoring the dense matrix A
Solve: solving for one or more excitations b
Field Calc: computing the fields scattered from the
object
Computational Electromagnetics (MOM)
100
Analysis of MOM for Parallel Implementation
Task Work Parallelism Parallel Speed
Fill O(n**2) embarrassing low
Factor O(n**3) moderately diff. very high
Solve O(n**2) moderately diff. high
Field Calc. O(n) embarrassing high
101
Results for Parallel Implementation on Delta
Task Time (hours)
Fill 9.20
Factor 8.25
Solve 2 .17
Field Calc. 0.12
The problem solved was for a matrix of size 48,672. (The world record in 1991.)
102
Current Records for Solving Dense Systems
Year System Size Machine # Procs Gflops (Peak)
1950’s O(100) 1995 128,600 Intel Paragon 6768 281 ( 338)1996 215,000 Intel ASCI Red 7264 1068 (1453)1998 148,000 Cray T3E 1488 1127 (1786)1998 235,000 Intel ASCI Red 9152 1338 (1830)1999 374,000 SGI ASCI Blue 5040 1608 (2520)1999 362,880 Intel ASCI Red 9632 2379 (3207)2000 430,000 IBM ASCI White 8192 4928 (12000)
source: Alan Edelman http://www-math.mit.edu/~edelman/records.htmlLINPACK Benchmark: http://www.netlib.org/performance/html/PDSreports.html
18
103
Computational Chemistry
° Seek energy levels of a molecule, crystal, etc.• Solve Schroedinger’s Equation for energy levels = eigenvalues
• Discretize to get Ax = OBx, solve for eigenvalues O and eigenvectors x• A and B large, symmetric or Hermitian matrices (B positive definite)
• May want some or all eigenvalues/eigenvectors
° MP-Quest (Sandia NL)• Si and sapphire crystals of up to 3072 atoms
• Local Density Approximation to Schroedinger Equation
• A and B up to n=40000, Hermitian• Need all eigenvalues and eigenvectors
• Need to iterate up to 20 times (for self-consistency)
° Implemented on Intel ASCI Red• 9200 Pentium Pro 200 processors (4600 Duals, a CLUMP)• Overall application ran at 605 Gflops (out of 1800 Glops peak),
• Eigensolver ran at 684 Gflops
• www.cs.berkeley.edu/~stanley/gbell/index.html
• Runner-up for Gordon Bell Prize at Supercomputing 98104
Review of Gaussian Elimination (GE) for solving Ax=b
° Add multiples of each row to later rows to make A upper triangular
° Solve resulting triangular system Ux = c by substitution
… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1
… for each row j below row ifor j = i+1 to n
… add a multiple of row i to row jfor k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
105
Refine GE Algorithm (1)
° Initial Version
° Remove computation of constant A(j,i)/A(i,i) from inner loop
… for each column i… zero it out below the diagonal by adding multiples of row i to later rowsfor i = 1 to n-1
… for each row j below row ifor j = i+1 to n
… add a multiple of row i to row jfor k = i to n
A(j,k) = A(j,k) - (A(j,i)/A(i,i)) * A(i,k)
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i to n
A(j,k) = A(j,k) - m * A(i,k)
106
Refine GE Algorithm (2)
° Last version
° Don’t compute what we already know: zeros below diagonal in column i
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i to n
A(j,k) = A(j,k) - m * A(i,k)
107
Refine GE Algorithm (3)
° Last version
° Store multipliers m below diagonal in zeroed entries for later use
for i = 1 to n-1for j = i+1 to n
m = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - m * A(i,k)
for i = 1 to n-1for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)
108
Refine GE Algorithm (4)
° Last version
° Express using matrix operations (BLAS)
for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n )
- A(i+1:n , i) * A(i , i+1:n)
for i = 1 to n-1for j = i+1 to n
A(j,i) = A(j,i)/A(i,i)for k = i+1 to n
A(j,k) = A(j,k) - A(j,i) * A(i,k)
19
109
What GE really computes
° Call the strictly lower triangular matrix of multipliers M, and let L = I+M
° Call the upper triangle of the final matrix U
° Lemma (LU Factorization): If the above algorithm terminates (does not divide by zero) then A = L*U
° Solving A*x=b using GE• Factorize A = L*U using GE (cost = 2/3 n3 flops)
• Solve L*y = b for y, using substitution (cost = n2 flops)
• Solve U*x = y for x, using substitution (cost = n2 flops)
° Thus A*x = (L*U)*x = L*(U*x) = L*y = b as desired
for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
110
Problems with basic GE algorithm
° What if some A(i,i) is zero? Or very small?• Result may not exist, or be “unstable”, so need to pivot
° Current computation all BLAS 1 or BLAS 2, but we know that BLAS 3 (matrix multiply) is fastest (Lecture 2)
for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)
- A(i+1:n , i) * A(i , i+1:n)
PeakBLAS 3
BLAS 2
BLAS 1
IBM RS/6000 Power 3 (200 MHz, 800 Mflop/s Peak)
0
200
400
600
800
10 100 200 300 400 500Order of vector/Matrices
Mfl
op
/s
111
Pivoting in Gaussian Elimination
° A = [ 0 1 ] fails completely, even though A is “easy”[ 1 0 ]
° Illustrate problems in 3-decimal digit arithmetic:
A = [ 1e-4 1 ] and b = [ 1 ], correct answer to 3 places is x = [ 1 ][ 1 1 ] [ 2 ] [ 1 ]
° Result of LU decomposition is
L = [ 1 0 ] = [ 1 0 ] … No roundoff error yet[ fl(1/1e-4) 1 ] [ 1e4 1 ]
U = [ 1e-4 1 ] = [ 1e-4 1 ] … Error in 4th decimal place[ 0 fl(1-1e4*1) ] [ 0 -1e4 ]
Check if A = L*U = [ 1e-4 1 ] … (2,2) entry entirely wrong[ 1 0 ]
° Algorithm “forgets” (2,2) entry, gets same L and U for all |A(2,2)|<5° Numerical instability° Computed solution x totally inaccurate
° Cure: Pivot (swap rows of A) so entries of L and U bounded 112
Gaussian Elimination with Partial Pivoting (GEPP)° Partial Pivoting: swap rows so that each multiplier
|L(i,j)| = |A(j,i)/A(i,i)| <= 1
for i = 1 to n-1find and record k where |A(k,i)| = max{i <= j <= n} |A(j,i)|
… i.e. largest entry in rest of column iif |A(k,i)| = 0
exit with a warning that A is singular, or nearly soelseif k != i
swap rows i and k of Aend ifA(i+1:n,i) = A(i+1:n,i) / A(i,i) … each quotient lies in [-1,1]A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) - A(i+1:n , i) * A(i , i+1:n)
° Lemma: This algorithm computes A = P*L*U, where P is apermutation matrix
° Since each entry of |L(i,j)| <= 1, this algorithm is considerednumerically stable
° For details see LAPACK code at www.netlib.org/lapack/single/sgetf2.f
113
Converting BLAS2 to BLAS3 in GEPP
° Blocking• Used to optimize matrix-multiplication
• Harder here because of data dependencies in GEPP
° Delayed Updates• Save updates to “tra iling matrix” from several consecutive BLAS2
updates
• Apply many saved updates simultaneously in one BLAS3 operation
° Same idea works for much of dense linear algebra• Open questions remain
° Need to choose a block size b• Algorithm will save and apply b updates
• b must be small enough so that active submatrix consisting of b columns of A fits in cache
• b must be large enough to make BLAS3 fast
114
Blocked GEPP (www.netlib.org/lapack/single/sgetrf.f)
for ib = 1 to n-1 step b … Process matrix b columns at a timeend = ib + b-1 … Point to end of block of b columns apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’… let LL denote the strict lower triangular part of A(ib:end , ib:end) + IA(ib:end , end+1:n) = LL -1 * A(ib:end , end+1:n) … update next b rows of UA(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
- A(end+1:n , ib:end) * A(ib:end , end+1:n)… apply delayed updates with single matrix-multiply… with inner dimension b
(For a correctness proof,see on-lines notes.)
20
115
Efficiency of Blocked GEPP
116
Overview of LAPACK
° Standard library for dense/banded linear algebra• Linear systems: A*x=b
• Least squares problems: minx || A*x-b ||2• Eigenvalue problems: Ax = Ox, Ax = OBx
• Singular value decomposition (SVD): A = U6VT
° Algorithms reorganized to use BLAS3 as much as possible
° Basis of math libraries on many computers
° Many algorithmic innovations remain• Projects available
117
Performance of LAPACK (n=1000)
118
Performance of LAPACK (n=100)
119
Parallelizing Gaussian Elimination
° parallelization steps • Decomposition: identify enough parallel work, but not too much
• Assignment: load balance work among threads
• Orchestrate: communication and synchronization
• Mapping: which processors execute which threads
° Decomposition• In BLAS 2 algorithm nearly each flop in inner loop can be done in
parallel, so with n2 processors, need 3n parallel steps
• This is too fine-grained, prefer calls to local matmuls instead
for i = 1 to n-1A(i+1:n,i) = A(i+1:n,i) / A(i,i) … BLAS 1 (scale a vector)A(i+1:n,i+1:n) = A(i+1:n , i+1:n ) … BLAS 2 (rank-1 update)
- A(i+1:n , i) * A(i , i+1:n)
120
Assignment of parallel work in GE
° Think of assigning submatrices to threads, where each thread responsible for updating submatrix it owns
• “owner computes” rule natural because of locality
° What should submatrices look like to achieve load balance?
21
121
Different Data Layouts for Parallel GE (on 4 procs)
The winner!
Bad load balance:P0 idle after firstn/4 steps
Load balanced, but can’t easilyuse BLAS2 or BLAS3
Can trade load balanceand BLAS2/3 performance by choosing b, butfactorization of blockcolumn is a bottleneck
Complicated addressing
Blocked Partitioned Algorithms
° LU Factorization
° Cholesky factorization
° Symmetric indefinite factorization
° Matrix inversion
° QR, QL, RQ, LQ factorizations
° Form Q or QTC
° Orthogonal reduction to:• (upper) Hessenberg form
• symmetric tridiagonal form
• bidiagonal form
° Block QR iteration for nonsymmetric eigenvalue problems
Memory Hierarchy and LAPACK
° ijk - implementations
Effects order in which data referenced; some betterat allowing data to keep in higher levels of memory hierarchy.
° Applies for matrix multiply, reductions to condensed form• May do slightly more flops
• Up to 3 times faster
for _ = 1:n;
for _ = 1:n;
for _ = 1:n;
end
end
end
a a b ci j i j i k k j, , , ,← +
124
Derivation of Blocked AlgorithmsCholesky Factorization A = UTU
*VZFYNSL�HTJKKNHNJSY�TK�YMJ�OYM HTQZRS��\J�TGYFNS
-JSHJ��NK�:���MFX�FQWJFI^�GJJS�HTRUZYJI��\J�HFS�HTRUZYJ�ZOFSIZOO KWTR�YMJ�JVZFYNTSX�
A a A
a a
A A
U
u u
U U
U u U
u
U
j
jT
jj jT
Tj
T
jT
jjT
jT
j
jj jT
11 13
13 33
11
13 33
11 13
33
0 0
0 0
0 0
αα µ
µ
=
a U ujT
j= 11
a u u ujj jT
j jj= + 2
U u aTj j11 =
u a u ujj jj jT
j2 = −
125
LINPACK Implementation
° Here is the body of the LINPACK routine SPOFA which implements the method:
DO 30 J = 1, N
INFO = J
S = 0.0E0
JM1 = J - 1
IF( JM1.LT.1 ) GO TO 20
DO 10 K = 1, JM1
T = A( K, J ) - SDOT( K-1, A( 1, K ), 1,A( 1, J ), 1 )
T = T / A( K, K )
A( K, J ) = T
S = S + T*T
10 CONTINUE
20 CONTINUE
S = A( J, J ) - S
C ...EXIT
IF( S.LE.0.0E0 ) GO TO 40
A( J, J ) = SQRT( S )
30 CONTINUE
126
LAPACK Implementation
DO 10 J = 1, N
CALL STRSV( ’Upper’, ’Transpose’, ’Non-Unit’, J-1, A, LDA, A( 1, J ), 1 )
S = A( J, J ) - SDOT( J-1, A( 1, J ), 1, A( 1, J ), 1 )IF( S.LE.ZERO ) GO TO 20A( J, J ) = SQRT( S )
10 CONTINUE
° This change by itself is sufficient to significantly improve theperformance on a number of machines.
° From 238 to 312 Mflop/s for a matrix of order 500 on a Pentium 4-1.7 GHz.
° However on peak is 1,700 Mflop/s.
° Suggest further work needed.
22
127
Derivation of Blocked Algorithms
*VZFYNSL�HTJKKNHNJSY�TK�XJHTSI�GQTHP�TK�HTQZRSX��\J�TGYFNS
-JSHJ��NK�:���MFX�FQWJFI^�GJJS�HTRUZYJI��\J�HFS�
HTRUZYJ�:���FX�YMJ�XTQZYNTS�TK�YMJ�KTQQT\NSL�JVZFYNTSX�
G^�F�HFQQ�YT�YMJ�1J[JQ���'1&8�WTZYNSJ�89782�
A A A
A A A
A A A
U
U U
U U U
U U U
U U
U
T
T T
T
T T
T T T
T
11 12 13
12 22 12
13 12 33
11
12 22
13 23 33
11 12 13
22 23
33
0 0
0 0
0 0
=
A U UT12 11 12=
A U U U UT T22 12 12 22 22= +
U U AT11 12 12=
U U A U UT T22 22 22 12 12= −
128
LAPACK Blocked Algorithms
DO 10 J = 1, N, NBCALL STRSM( ’Left’, ’Upper’, ’Transpose’,’Non-Unit’, J-1, JB, ONE, A, LDA,
$ A( 1, J ), LDA )CALL SSYRK( ’Upper’, ’Transpose’, JB, J-1,-ONE, A( 1, J ), LDA, ONE,
$ A( J, J ), LDA )CALL SPOTF2( ’Upper’, JB, A( J, J ), LDA, INFO )IF( INFO.NE.0 ) GO TO 20
10 CONTINUE
�2Q�3HQWLXP����/��%/$6�VTXHH]HV�D�ORW�PRUH�RXW�RI���SURFRate of ExecutionIntel Pentium 4 1.7 GHz
N = 500
1262 Mflop/sLevel 3 BLAS Variant
312 Mflop/sLevel 2 BLAS Variant
238 Mflop/sLinpack variant (L1B)
LAPACK Contents
° Combines algorithms from LINPACK and EISPACK into a single package. User interface similar to LINPACK.
° Built on the L 1, 2 and 3 BLAS, for high performance (manufacturers optimize BLAS)
° LAPACK does not provide routines for structured problems or general sparse matrices (i.e sparse storage formats such as compressed-row, -column, -diagonal, skyline ...).
LAPACK Ongoing Work
° Add functionality • updating/downdating, divide and conquer least squares,bidiagonal bisection,
bidiagonal inverse iteration, band SVD, Jacobi methods, ...
° Move to new generation of high performance machines • IBM SPs, CRAY T3E, SGI Origin, clusters of workstations
° New challenges• New languages: FORTRAN 90, HP FORTRAN, ...
• (CMMD, MPL, NX ...)
- many flavors of message passing, need standard (PVM, MPI): BLACS
° Highly varying ratio
° Many ways to layout data,
° Fastest parallel algorithm sometimes less stable numerically.
Computational speed
Communication speed
History of Block Partitioned Algorithms
° Early algorithms involved use of small main memory using tapes as secondary storage.
° Recent work centers on use of vector registers, level 1 and 2 cache, main memory, and “out of core” memory.
Blocked Partitioned Algorithms
° LU Factorization
° Cholesky factorization
° Symmetric indefinite factorization
° Matrix inversion
° QR, QL, RQ, LQ factorizations
° Form Q or QTC
° Orthogonal reduction to:• (upper) Hessenberg form
• symmetric tridiagonal form
• bidiagonal form
° Block QR iteration for nonsymmetric eigenvalue problems