Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative...

Preview:

Citation preview

Iterative Methods and ParallelAlgorithms

S. D. Margenov I. D. Lirkov

margenov@parallel.bas.bg ivan@parallel.bas.bg

Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia, Bulgaria

http://parallel.bas.bg/˜margenov/

http://parallel.bas.bg/˜ivan/

Parallel Algorithms – p. 1/50

CONTENTS

1. Introduction

2. Parallel inner product

3. Sparse matrix vector multiplication

4. Jacobi method

5. Conjugate gradient method

6. Preconditioned conjugate gradient method

7. Circulant Bloick Factorization

8. MIC(0) preconditioning

9. Parallel PCG testsParallel Algorithms – p. 2/50

Parallel Performance

To establish the theoretical performance, asimple model for non-overlapped arithmetic andcommunication times is assumed:

The execution of M a.o. on one processortakes time

Ta = Mta

where ta is the average unit time to performone a.o. on one processor.

Parallel Algorithms – p. 3/50

Parallel Performance

The communication time to transfer M dataelements from one processor to another isapproximated by

Tcom = `(ts + Mtc)

where ts is the start-up time and tc is the timenecessary for each of M elements to be sent,and ` is the graph distance between theprocessors.

Parallel Algorithms – p. 4/50

Parallel Speedup and Efficiency

The standard expressions for parallel speedupS(N, p), and parallel efficiency E(N, p) are used:

S(N, p) =T (N, 1)

T (N, p)

E(N, p) =S(N, p)

p

Here, T (N, p) stands for the parallel time to solvethe problem on p processors, and N is thediscrete size of the problem.

Parallel Algorithms – p. 5/50

Parallel Speedup and Efficiency

The following theoretical estimates hold:

0 < S(N, p) ≤ p 0 < E(N, P ) ≤ 1

Parallel Algorithms – p. 6/50

Iterative Methods

Iterative methods are techniques to solvesystems of linear equations

Ax = b

that generate a sequence of approximations tothe solution vector x in the form

x0, x1, · · · , xk, · · ·

Parallel Algorithms – p. 7/50

Iterative Methods

The process is said to be convergent if themagnitude of the vector

gk = b − Axk

becomes reasonably small. The vector gk

represents the error in the approximation of xand is referred to as the residual after k iterations.

Parallel Algorithms – p. 8/50

The relative stopping criteria is determined bythe inner product of the kth residual

‖gk‖2

‖g0‖2< ε

where ε > 0 is assumed small, and

‖gi‖2 =

giTgi.

Parallel Algorithms – p. 9/50

In this general setting, each iteration step includes:

inner product

matrix A vector multiplication

Parallel Algorithms – p. 10/50

Parallel Inner Product I

The parallel implementation of the inner productis the only step of the considered iterativealgorithms which requires globalcommunications.

Communications in the one-to-all like parallel inner productParallel Algorithms – p. 11/50

Parallel Inner Product I

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)d

√p/2e 2D − mesh

Parallel Algorithms – p. 12/50

Parallel Inner Product II

Communications in the all-to-all like parallel inner product

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)(

√p − 1) 2D − mesh

Parallel Algorithms – p. 13/50

Sparse Matrices

Example A1

A =

4 −1 −1

4 −1 −1

4 −1 −1

4 −1 −1

4 −1 −1 −1 −1

−1 −1 −1 4

−1 −1 −1 4

−1 −1 −1 4

−1 −1 −1 4

Parallel Algorithms – p. 14/50

Example A2

A =

4 −1 −1

−1 4 −1 −1

−1 4 −1

−1 4 −1 −1

−1 −1 4 −1 −1

−1 −1 4 −1

−1 4 −1

−1 −1 4 −1

−1 −1 4

Parallel Algorithms – p. 15/50

FDM/FEM Sparse Matrices I

Consider the model problem

−uxx − uyy = f in Ω = [0, 1]2

with Dirichlet boundary conditions on Γ = ∂Ω.Let us assume that FDM or FEM is used to solvenumerically the problem where ωh is a uniformmesh with mesh-size h = 1/(n + 1).

2

3

6

7

8

9

5

1

4

1

2

3

4

5

6

7

8

9

(A1) (A2)

Parallel Algorithms – p. 16/50

FDM/FEM Sparse Matrices

If a column-wise numbering of the nodes(unknowns) is used, then

A = blocktridiag(Ai,i−1, Ai,i, Ai,i+1),

A =

A1,1 A1,2

A2,1 A2,2 A2,3

A3,2 A3,3 A3,4

· · · · · · · · · · · · · · ·· · · · · · · · · · · · · · ·

An−1,n−2 An−1,n−1 An−1,n

An,n−1 An,n

,

Ai,i = tridiag(−1, 4,−1), Ai,i−1 = Ai,i+1 = −I,

A ∈N×N , Ai,j ∈n×n, N = n2.

Parallel Algorithms – p. 17/50

Matrix-Vector Multiplication I

P

P

P

0

1

2

Matrix-vector multiplication with a block-stripped partitioning of the block-tridiagonal sparse matrix

T (N, 1) ≈ 9NtaTcom = 2(ts + ntc)

T (N, p) ≈ 9N

pta + 2(ts +

√Ntc)

Parallel Algorithms – p. 18/50

Jacobi Iterative Method

The ith equation of the system Ax = b can bewritten in the form

xi =1

Ai,i

bi −∑

i6=j

Ai,jxj

.

The iteration step in the Jacobi method is

xk+1i =

1

Ai,i

bi −∑

i6=j

Ai,jxkj

Parallel Algorithms – p. 19/50

Jacobi Iterative Method

or equivalently

xk+1i =

gki

Ai,i+ xk

i .

The method always converges in the class ofdiagonally-dominant matrices.

Parallel Algorithms – p. 20/50

Jacoby Algorithm

1. procedure JACOBI(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := b − Ax0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. k := k + 1;9. for i := 1 to N do xk

i :=

gk−1

i

Ai,i+ xk−1

i ;

11. gk := b − Axk;12. endwhile;13. x := xk;14. endJACOBI

The complexity of one iteration is as follows

NJacit (A−1b) ≈ N (Ad) + N (IP ) + 3N

which for thte model problem reads as

NCGit (A−1b) ≈ 14N.

The related times are simply derived usingthe related matrix-vactor communication es-timate.

T it(N, 1) ≈ 14Nta

T itcom = 2(ts + ntc) + T IP

com

T it(N, p) ≈ 14N

pta + 2(ts +

√Ntc)

Parallel Algorithms – p. 21/50

Conjugate Gradient Algorithm

1. procedure CG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := Ax0 − b, d0 = −g0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. τk = gkTgk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. βk = gk+1Tgk+1

gkT gk;

12. dk+1 = −gk+1 + βkdk;

13. endwhile;14. x := xk;15. endCG

The computational complexity of one CG iter-ation is as follows:

NCGit (A−1b) ≈ N (Ad)+2N (IP )+3N (LT ),

NCGit (A−1b) ≈ 19N.

The related times are derived in a similar wayas for Jacobi method.

T it(N, 1) ≈ 19Nta

T itcom = 2(ts + ntc) + 2T IP

com

T it(N, p) ≈ 19N

pta + 2(ts +

√Ntc)

Parallel Algorithms – p. 22/50

Convergence Rate of CG Method

Theorem.

p(ε) ≤ 1

2

κ(A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈ RN .

In a very general setting of FEM/FDM sparsematrices, the spectral condition number behavesas κ(A) = O(N).

Parallel Algorithms – p. 23/50

Convergence Rate of CG Method

Therefore NCG(A−1b) = O(N 3/2). It is importantto note, that the same complexity holds for thebest known direct method, namelyNND(A−1b) = O(N 3/2) where ND stands for theNested Dissection Method.

Parallel Algorithms – p. 24/50

Numerical Tests

The number of iterations for the model problemof Gauss-Seidel (G-S), Steepest Descent(SD),Conjugate Gradient (CG) andPreconditioned CG (PCG) methods arepresented in the table. The implemented PCGalgorithm is subject to the next section.

nG−Sit = O(N)

nSDit = O(N)

nCGit = O(N 1/2)

nPCGit = O(N 1/4)

Parallel Algorithms – p. 25/50

Numerical Tests

n G − S SD CG PCG

4 82 185 26 118 309 698 45 1516 1151 2592 91 1932 4242 9541 177 2764 15529 34818 351 38

Conclusion. The efforts should be addressed todevelopment of scalable parallel algorithms forfast enough iterative solution methods.

Parallel Algorithms – p. 26/50

Preconditioned CG Method

The idea of the PCG method is to substitute theoriginal linear system to a new one which isbetter conditioned:

Ax = b => C−1/2AC−1/2y = b

The PCG strategy is to construct apreconditioner C such that:

κ(C−1A) << κ(A)

N (C−1v) << N (A−1v)

The preconditioner is called optimal ifκ(C−1A) = O(1) and N (C−1v) = O(N).

Parallel Algorithms – p. 27/50

PCG Algorithm

1. procedure PCG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 = Ax0−b, h0 = C−1g0, d0 = h0;

6. while (‖gk‖C−1 > ε‖g0‖C−1 ) do

7. begin

8. τk = gkThk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. hk+1 = C−1gk+1;

12. βk = gk+1Thk+1

gkT hk;

13. dk+1 = −hk+1 + βkdk;

14. endwhile;15. x := xk;16. endPCG

Following the structure of our analysiss,we estimate the computational complexityof one PCG iteration in the form:

NPCGit (A−1b) ≈ N (C−1g) + N (Ad)

+ 2N (IP ) + 3N (LT )

NPCGit (A−1b) ≈ N (C−1g) + 19N.

Then, the related per iteration PCG timesare as follows:

T it(N, 1) ≈ T (C−1g)(N, 1) + 19Nta,

T itcom = 2(ts + ntc) + T

(C−1g)com + 2T IP

com

Parallel Algorithms – p. 28/50

Convergence Rate of PCG Method

Theorem.

p(ε) ≤ 1

2

κ(C−1A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈N

Parallel Algorithms – p. 29/50

Convergence Rate of PCG Method

Some parallel preconditioning techniques:• Incomplete Factorization• Circulant Bloick Factorization• Domain Decomposition• Patched Local Refinement• Multigrid/Multilevel• Approximate Inverse

Parallel Algorithms – p. 30/50

Circulant Bloick Factorization

A circulant matrix C has the form

Ck,j = c(j−k) mod m

C =

c0 c1 c2 . . . cm−1

cm−1 c0 c1 . . . cm−2... ... ... ...c1 c2 . . . cm−1 c0

C = (c0, c1, . . . cm−1) = FΛF ∗

ℵ(C−1v) = O(m log m)

Parallel Algorithms – p. 31/50

2D model problem

−(a(x, y)ux)x − (b(x, y)uy)y = f(x, y),

∀(x, y) ∈ Ω,

u(x, y) = 0, ∀(x, y) ∈ Γ = ∂Ω,

0 < cmin ≤ a(x, y), b(x, y) ≤ cmax,

A = tridiag(−Ai,i−1, Ai,i,−Ai,i+1) i = 1, 2, . . . n,

C = tridiag(−Ci,i−1, Ci,i,−Ci,i+1) i = 1, 2, . . . n,

where Ci,j = Circulant(Ai,j) is some givencirculant approximation of the correspondingblock Ai,j.

Parallel Algorithms – p. 32/50

Factorization

C = D − L − U

C = (X − L)(I − X−1U)

X = D − LX−1U

X1 = C1,1

Xi = Ci,i − Ci,i−1X−1i−1Ci−1,i, i = 2, . . . n

Ci,j = FΛi,jF∗

Xi = FDiF∗

Parallel Algorithms – p. 33/50

Factorization

D−11 = Λ1,1

D−1i = Λi,i − Λi,i−1Di−1Λi−1,i.

Let us denote with Λ = tridiag(Λi,i−1, Λi,i, Λi,i+1).Then the following relation holds

Cw = u ⇐⇒ (I ⊗ F )Λ(I ⊗ F ∗)w = u.

u = (I ⊗ F ∗)u

Λw = u

w = (I ⊗ F )wParallel Algorithms – p. 34/50

Factorization

v1 = D1u1

vi = Di(ui − Λi,i−1vi−1) i = 2, 3, . . . n

wn = vn

wi = vi − DiΛi,i+1wi+1 i = n − 1, n − 2, . . . 1

Parallel Algorithms – p. 35/50

Parallel algorithm

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

r

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

b

P0 P1 P2 P3

Distribution of vectors on processors.

The CBFpreconditioning can be split in three stages. If weuse the column-wise mapping for first and thirdstage there is no need of communicationbecause we perform block-FFT on blocks whichare stored on one processor. For second stagewe have to reorder the vector entries usingrow-wise mapping.

Parallel Algorithms – p. 36/50

Parallel algorithm

The CBF preconditioning can be split in threestages. If we use the column-wise mapping forfirst and third stage there is no need ofcommunication because we perform block-FFTon blocks which are stored on one processor.For second stage we have to reorder the vectorentries using row-wise mapping.

Parallel Algorithms – p. 36/50

Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

128 1 0.086 0.0812 0.047 1.84 0.92 0.047 1.71 0.864 0.028 3.04 0.76 0.029 2.77 0.698 0.021 4.13 0.52 0.096 0.84 0.10

256 1 0.389 0.3922 0.207 1.88 0.94 0.208 1.88 0.944 0.109 3.56 0.89 0.127 3.09 0.778 0.065 6.02 0.75 0.138 2.83 0.35

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel Algorithms – p. 37/50

Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

384 1 1.460 1.4982 0.759 1.92 0.96 0.783 1.91 0.963 0.523 2.79 0.93 0.533 2.81 0.944 0.394 3.71 0.93 0.473 3.17 0.796 0.269 5.43 0.90 0.780 1.92 0.328 0.338 4.32 0.54 1.122 1.33 0.17

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel Algorithms – p. 37/50

Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel Algorithms – p. 37/50

MIC(0) Algorithm

Let us rewrite the real matrix A in the formA = D − L − LT . Then, the modified incompleteCholesky factorization is defined as follows:

CMIC(0)(A) = (X − L)X−1(X − L)T ,

where X = diag(x1, · · · , xN) provides the equalrowsums condition.

Parallel Algorithms – p. 38/50

MIC(0) Algorithm

Theorem. Let us assume that(a) L ≥ 0, (b) Ae ≥ 0, (c) Ae + Lte > 0, e =

(1, · · · , 1)t ∈N . Then the relation

xi = aii −i−1∑

k=1

aik

xk

N∑

j=k+1

akj > 0

gives a stable MIC(0) factorization of A.

Parallel Algorithms – p. 39/50

Remark. All presented numerical tests areperformed using the perturbed MIC(0)algorithm, where the incomplete factorization isapplied to the matrix A = A + D. The diagonalperturbation D = D(ξ) = diag(d1, . . . dN) isdefined as follows:

di =

ξaii if aii ≥ 2wi

ξ1/2aii if aii < 2wi

Parallel Algorithms – p. 40/50

wherewi = −

j>i

aij.

Here 0 < ξ < 1 is a constant of the same orderas the minimal eigenvalue of A. Thecomputations for the considerd model problemsare done with ξ = h2.

Parallel Algorithms – p. 41/50

MIC(0) Complexity

The MIC(0) computational complexity of onePCG iteration is as follows:

N PCGit (A−1b) ≈ N (C−1g)+19N, N (C−1g) ≈ 11N,

N PCGit (A−1b) ≈ 30N.

MIC(0) is a cheap preconditioning algorithm.The cost of N (C−1g) is almost the same asN (Ad).

Parallel Algorithms – p. 42/50

MIC(0) Complexity

MIC(0) is a robust preconditioner with respectto local singularities of the problem, whereκ(C−1A) = O(N 1/4), andN PCG(A−1b) = O(N 5/4).

MIC(0) is an inherently sequential algorithm.

Parallel Algorithms – p. 43/50

FDM/FEM Sparse Matrices II

The model problem is considered again:−uxx − uyy = f in Ω = [0, 1]2 with Dirichletboundary conditions on Γ = ∂Ω.

ReM SkM

Since a five point stencil is used in both cases,the accuracy of the regular mesh (ReM) and thealternative skewed mesh (SkM) FDM/FEMapproximations are one and the same.

Parallel Algorithms – p. 44/50

Block-Structure of the Matrices

SkM ReM

The bottleneck problem of the parallelimplementation of MIC(0) algorithm is thesolution of problems with triangle matrices(X − L) and (X − L)T .

Parallel Algorithms – p. 45/50

Block-Structure of the Matrices

The key point of our consideration is, that in thecase of skewed mesh, the stiffness matrix has ablock structure with diagonal blocks which arediagonal.

Parallel Algorithms – p. 46/50

Parallel MIC(0) algorithm

P

P

P

0

2

1

MIC(0) PCG algorithm with a block-stripped partitioning: N = n2 + (n − 1)2.

TMIC(0)it (N, 1) ≈ 38Nta, Tcom ≈ (4ts +6tc)n+2T IP

com,

TMIC(0)it (N, p) ≈ 38N

pta + (2ts + 3tc)

√2N

Parallel Algorithms – p. 47/50

Parallel MIC(0) Tests

The presented tests are performed on aBeowulf like cluster of four dual processorPower Macintosh computers with 512 MBRAM each and G4 processors on 450 MHz.

The parallel MIC(0) algorithm is implementedin C++ using Message Passing Interface(MPI).

Yellow Dog Linux with LAN MPI are used.

The size of the problem and the number ofthe processors are varied to examine theparallel scalability of the code.

Parallel Algorithms – p. 48/50

Parallel Speedup

n 32 64 128 256 512 1024 1500S(n,2) 1.21 1.68 1.96 1.85 1.92 2.03 2.02S(n,3) 0.24 0.46 0.97 1.72 2.45 2.97 2.86S(n,4) 0.22 0.46 1.11 1.97 2.88 3.76 3.95S(n,5) 0.20 0.40 0.96 1.99 3.25 4.48 4.86S(n,6) 0.18 0.39 1.03 1.99 3.55 5.23 5.73S(n,7) 0.19 0.38 0.95 1.78 3.63 6.02 6.31S(n,8) 0.19 0.39 1.00 2.28 3.97 6.37 6.76

Parallel Algorithms – p. 49/50

Parallel Efficiency

n 32 64 128 256 512 1024 1500E(n,2) 0.60 0.84 0.98 0.93 0.96 1.02 1.01E(n,3) 0.08 0.15 0.32 0.57 0.81 0.99 0.95E(n,4) 0.06 0.12 0.27 0.49 0.72 0.94 0.98E(n,5) 0.04 0.08 0.19 0.40 0.65 0.90 0.97E(n,6) 0.03 0.07 0.17 0.33 0.59 0.87 0.96E(n,7) 0.03 0.05 0.14 0.25 0.51 0.86 0.90E(n,8) 0.02 0.05 0.13 0.29 0.50 0.80 0.84

Parallel Algorithms – p. 50/50

Recommended