Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative...

Iterative Methods and ParallelAlgorithms

S. D. Margenov I. D. Lirkov

margenov@parallel.bas.bg ivan@parallel.bas.bg

Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia, Bulgaria

http://parallel.bas.bg/˜margenov/

http://parallel.bas.bg/˜ivan/

Parallel Algorithms – p. 1/50

CONTENTS

1. Introduction

2. Parallel inner product

3. Sparse matrix vector multiplication

4. Jacobi method

5. Conjugate gradient method

6. Preconditioned conjugate gradient method

7. Circulant Bloick Factorization

8. MIC(0) preconditioning

9. Parallel PCG testsParallel Algorithms – p. 2/50

Parallel Performance

To establish the theoretical performance, asimple model for non-overlapped arithmetic andcommunication times is assumed:

The execution of M a.o. on one processortakes time

Ta = Mta

where ta is the average unit time to performone a.o. on one processor.

Parallel Performance

The communication time to transfer M dataelements from one processor to another isapproximated by

Tcom = `(ts + Mtc)

where ts is the start-up time and tc is the timenecessary for each of M elements to be sent,and ` is the graph distance between theprocessors.

Parallel Speedup and Efficiency

The standard expressions for parallel speedupS(N, p), and parallel efficiency E(N, p) are used:

S(N, p) =T (N, 1)

T (N, p)

E(N, p) =S(N, p)

Here, T (N, p) stands for the parallel time to solvethe problem on p processors, and N is thediscrete size of the problem.

Parallel Speedup and Efficiency

The following theoretical estimates hold:

0 < S(N, p) ≤ p 0 < E(N, P ) ≤ 1

Iterative Methods

Iterative methods are techniques to solvesystems of linear equations

Ax = b

that generate a sequence of approximations tothe solution vector x in the form

x0, x1, · · · , xk, · · ·

Iterative Methods

The process is said to be convergent if themagnitude of the vector

gk = b − Axk

becomes reasonably small. The vector gk

represents the error in the approximation of xand is referred to as the residual after k iterations.

The relative stopping criteria is determined bythe inner product of the kth residual

‖gk‖2

‖g0‖2< ε

where ε > 0 is assumed small, and

‖gi‖2 =

giTgi.

In this general setting, each iteration step includes:

inner product

matrix A vector multiplication

Parallel Inner Product I

The parallel implementation of the inner productis the only step of the considered iterativealgorithms which requires globalcommunications.

Communications in the one-to-all like parallel inner productParallel Algorithms – p. 11/50

Parallel Inner Product I

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)d

√p/2e 2D − mesh

Parallel Inner Product II

Communications in the all-to-all like parallel inner product

T IPcom = (ts + tc) log p d − cube

T IPcom = 2(ts + tc)(

√p − 1) 2D − mesh

Sparse Matrices

Example A1

4 −1 −1

4 −1 −1 −1 −1

−1 −1 −1 4

Example A2

4 −1 −1

−1 4 −1 −1

−1 4 −1

−1 4 −1 −1

−1 −1 4 −1 −1

−1 −1 4 −1

−1 4 −1

−1 −1 4 −1

−1 −1 4

FDM/FEM Sparse Matrices I

Consider the model problem

−uxx − uyy = f in Ω = [0, 1]2

with Dirichlet boundary conditions on Γ = ∂Ω.Let us assume that FDM or FEM is used to solvenumerically the problem where ωh is a uniformmesh with mesh-size h = 1/(n + 1).

(A1) (A2)

FDM/FEM Sparse Matrices

If a column-wise numbering of the nodes(unknowns) is used, then

A = blocktridiag(Ai,i−1, Ai,i, Ai,i+1),

A1,1 A1,2

A2,1 A2,2 A2,3

A3,2 A3,3 A3,4

· · · · · · · · · · · · · · ·· · · · · · · · · · · · · · ·

An−1,n−2 An−1,n−1 An−1,n

An,n−1 An,n

Ai,i = tridiag(−1, 4,−1), Ai,i−1 = Ai,i+1 = −I,

A ∈N×N , Ai,j ∈n×n, N = n2.

Matrix-Vector Multiplication I

Matrix-vector multiplication with a block-stripped partitioning of the block-tridiagonal sparse matrix

T (N, 1) ≈ 9NtaTcom = 2(ts + ntc)

T (N, p) ≈ 9N

pta + 2(ts +

√Ntc)

Jacobi Iterative Method

The ith equation of the system Ax = b can bewritten in the form

bi −∑

Ai,jxj

The iteration step in the Jacobi method is

xk+1i =

bi −∑

Ai,jxkj

Jacobi Iterative Method

or equivalently

xk+1i =

Ai,i+ xk

The method always converges in the class ofdiagonally-dominant matrices.

Jacoby Algorithm

1. procedure JACOBI(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := b − Ax0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. k := k + 1;9. for i := 1 to N do xk

gk−1

Ai,i+ xk−1

11. gk := b − Axk;12. endwhile;13. x := xk;14. endJACOBI

The complexity of one iteration is as follows

NJacit (A−1b) ≈ N (Ad) + N (IP ) + 3N

which for thte model problem reads as

NCGit (A−1b) ≈ 14N.

The related times are simply derived usingthe related matrix-vactor communication es-timate.

T it(N, 1) ≈ 14Nta

T itcom = 2(ts + ntc) + T IP

T it(N, p) ≈ 14N

pta + 2(ts +

√Ntc)

Conjugate Gradient Algorithm

1. procedure CG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 := Ax0 − b, d0 = −g0;6. while (‖gk‖2 > ε‖g0‖2) do

7. begin

8. τk = gkTgk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. βk = gk+1Tgk+1

gkT gk;

12. dk+1 = −gk+1 + βkdk;

13. endwhile;14. x := xk;15. endCG

The computational complexity of one CG iter-ation is as follows:

NCGit (A−1b) ≈ N (Ad)+2N (IP )+3N (LT ),

NCGit (A−1b) ≈ 19N.

The related times are derived in a similar wayas for Jacobi method.

T it(N, 1) ≈ 19Nta

T itcom = 2(ts + ntc) + 2T IP

T it(N, p) ≈ 19N

pta + 2(ts +

√Ntc)

Convergence Rate of CG Method

Theorem.

p(ε) ≤ 1

κ(A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈ RN .

In a very general setting of FEM/FDM sparsematrices, the spectral condition number behavesas κ(A) = O(N).

Convergence Rate of CG Method

Therefore NCG(A−1b) = O(N 3/2). It is importantto note, that the same complexity holds for thebest known direct method, namelyNND(A−1b) = O(N 3/2) where ND stands for theNested Dissection Method.

Numerical Tests

The number of iterations for the model problemof Gauss-Seidel (G-S), Steepest Descent(SD),Conjugate Gradient (CG) andPreconditioned CG (PCG) methods arepresented in the table. The implemented PCGalgorithm is subject to the next section.

nG−Sit = O(N)

nSDit = O(N)

nCGit = O(N 1/2)

nPCGit = O(N 1/4)

Numerical Tests

n G − S SD CG PCG

4 82 185 26 118 309 698 45 1516 1151 2592 91 1932 4242 9541 177 2764 15529 34818 351 38

Conclusion. The efforts should be addressed todevelopment of scalable parallel algorithms forfast enough iterative solution methods.

Preconditioned CG Method

The idea of the PCG method is to substitute theoriginal linear system to a new one which isbetter conditioned:

Ax = b => C−1/2AC−1/2y = b

The PCG strategy is to construct apreconditioner C such that:

κ(C−1A) << κ(A)

N (C−1v) << N (A−1v)

The preconditioner is called optimal ifκ(C−1A) = O(1) and N (C−1v) = O(N).

PCG Algorithm

1. procedure PCG(A, b, x, ε)2. begin

3. k := 0;4. Select initial solution vector x0;5. g0 = Ax0−b, h0 = C−1g0, d0 = h0;

6. while (‖gk‖C−1 > ε‖g0‖C−1 ) do

7. begin

8. τk = gkThk

dkT Adk;

9. xk+1 = xk + τkdk;

10. gk+1 = gk + τkAdk;

11. hk+1 = C−1gk+1;

12. βk = gk+1Thk+1

gkT hk;

13. dk+1 = −hk+1 + βkdk;

14. endwhile;15. x := xk;16. endPCG

Following the structure of our analysiss,we estimate the computational complexityof one PCG iteration in the form:

NPCGit (A−1b) ≈ N (C−1g) + N (Ad)

+ 2N (IP ) + 3N (LT )

NPCGit (A−1b) ≈ N (C−1g) + 19N.

Then, the related per iteration PCG timesare as follows:

T it(N, 1) ≈ T (C−1g)(N, 1) + 19Nta,

T itcom = 2(ts + ntc) + T

(C−1g)com + 2T IP

Convergence Rate of PCG Method

Theorem.

p(ε) ≤ 1

κ(C−1A) ln (2/ε) + 1,

where p(ε) stands for the smallest number k suchthat

‖xk − x‖A ≤ ε‖x0 − x‖A ∀x0 ∈N

Convergence Rate of PCG Method

Some parallel preconditioning techniques:• Incomplete Factorization• Circulant Bloick Factorization• Domain Decomposition• Patched Local Refinement• Multigrid/Multilevel• Approximate Inverse

Circulant Bloick Factorization

A circulant matrix C has the form

Ck,j = c(j−k) mod m

c0 c1 c2 . . . cm−1

cm−1 c0 c1 . . . cm−2... ... ... ...c1 c2 . . . cm−1 c0

C = (c0, c1, . . . cm−1) = FΛF ∗

ℵ(C−1v) = O(m log m)

2D model problem

−(a(x, y)ux)x − (b(x, y)uy)y = f(x, y),

∀(x, y) ∈ Ω,

u(x, y) = 0, ∀(x, y) ∈ Γ = ∂Ω,

0 < cmin ≤ a(x, y), b(x, y) ≤ cmax,

A = tridiag(−Ai,i−1, Ai,i,−Ai,i+1) i = 1, 2, . . . n,

C = tridiag(−Ci,i−1, Ci,i,−Ci,i+1) i = 1, 2, . . . n,

where Ci,j = Circulant(Ai,j) is some givencirculant approximation of the correspondingblock Ai,j.

Factorization

C = D − L − U

C = (X − L)(I − X−1U)

X = D − LX−1U

X1 = C1,1

Xi = Ci,i − Ci,i−1X−1i−1Ci−1,i, i = 2, . . . n

Ci,j = FΛi,jF∗

Xi = FDiF∗

Factorization

D−11 = Λ1,1

D−1i = Λi,i − Λi,i−1Di−1Λi−1,i.

Let us denote with Λ = tridiag(Λi,i−1, Λi,i, Λi,i+1).Then the following relation holds

Cw = u ⇐⇒ (I ⊗ F )Λ(I ⊗ F ∗)w = u.

u = (I ⊗ F ∗)u

Λw = u

w = (I ⊗ F )wParallel Algorithms – p. 34/50

Factorization

v1 = D1u1

vi = Di(ui − Λi,i−1vi−1) i = 2, 3, . . . n

wn = vn

wi = vi − DiΛi,i+1wi+1 i = n − 1, n − 2, . . . 1

Parallel algorithm

P0 P1 P2 P3

Distribution of vectors on processors.

The CBFpreconditioning can be split in three stages. If weuse the column-wise mapping for first and thirdstage there is no need of communicationbecause we perform block-FFT on blocks whichare stored on one processor. For second stagewe have to reorder the vector entries usingrow-wise mapping.

Parallel algorithm

The CBF preconditioning can be split in threestages. If we use the column-wise mapping forfirst and third stage there is no need ofcommunication because we perform block-FFTon blocks which are stored on one processor.For second stage we have to reorder the vectorentries using row-wise mapping.

Parallel CBF tests

SUN Ultra-Enterprise Symmetric Multiprocessor168 MHz 250 MHz

n p T (p) Sp Ep T (p) Sp Ep

128 1 0.086 0.0812 0.047 1.84 0.92 0.047 1.71 0.864 0.028 3.04 0.76 0.029 2.77 0.698 0.021 4.13 0.52 0.096 0.84 0.10

256 1 0.389 0.3922 0.207 1.88 0.94 0.208 1.88 0.944 0.109 3.56 0.89 0.127 3.09 0.778 0.065 6.02 0.75 0.138 2.83 0.35

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel CBF tests

384 1 1.460 1.4982 0.759 1.92 0.96 0.783 1.91 0.963 0.523 2.79 0.93 0.533 2.81 0.944 0.394 3.71 0.93 0.473 3.17 0.796 0.269 5.43 0.90 0.780 1.92 0.328 0.338 4.32 0.54 1.122 1.33 0.17

168 MHz 250 MHzn p T (p) Sp Ep T (p) Sp Ep

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

Parallel CBF tests

420 1 3.718 2.6512 1.922 1.93 0.97 1.378 1.92 0.963 1.313 2.83 0.94 0.937 2.83 0.944 0.990 3.75 0.94 0.714 3.71 0.935 0.817 4.55 0.91 1.005 2.64 0.536 0.679 5.48 0.91 1.233 2.15 0.367 0.595 6.25 0.89 1.314 2.02 0.29

MIC(0) Algorithm

Let us rewrite the real matrix A in the formA = D − L − LT . Then, the modified incompleteCholesky factorization is defined as follows:

CMIC(0)(A) = (X − L)X−1(X − L)T ,

where X = diag(x1, · · · , xN) provides the equalrowsums condition.

MIC(0) Algorithm

Theorem. Let us assume that(a) L ≥ 0, (b) Ae ≥ 0, (c) Ae + Lte > 0, e =

(1, · · · , 1)t ∈N . Then the relation

xi = aii −i−1∑

akj > 0

gives a stable MIC(0) factorization of A.

Remark. All presented numerical tests areperformed using the perturbed MIC(0)algorithm, where the incomplete factorization isapplied to the matrix A = A + D. The diagonalperturbation D = D(ξ) = diag(d1, . . . dN) isdefined as follows:

ξaii if aii ≥ 2wi

ξ1/2aii if aii < 2wi

wherewi = −

Here 0 < ξ < 1 is a constant of the same orderas the minimal eigenvalue of A. Thecomputations for the considerd model problemsare done with ξ = h2.

MIC(0) Complexity

The MIC(0) computational complexity of onePCG iteration is as follows:

N PCGit (A−1b) ≈ N (C−1g)+19N, N (C−1g) ≈ 11N,

N PCGit (A−1b) ≈ 30N.

MIC(0) is a cheap preconditioning algorithm.The cost of N (C−1g) is almost the same asN (Ad).

MIC(0) Complexity

MIC(0) is a robust preconditioner with respectto local singularities of the problem, whereκ(C−1A) = O(N 1/4), andN PCG(A−1b) = O(N 5/4).

MIC(0) is an inherently sequential algorithm.

FDM/FEM Sparse Matrices II

The model problem is considered again:−uxx − uyy = f in Ω = [0, 1]2 with Dirichletboundary conditions on Γ = ∂Ω.

ReM SkM

Since a five point stencil is used in both cases,the accuracy of the regular mesh (ReM) and thealternative skewed mesh (SkM) FDM/FEMapproximations are one and the same.

Block-Structure of the Matrices

SkM ReM

The bottleneck problem of the parallelimplementation of MIC(0) algorithm is thesolution of problems with triangle matrices(X − L) and (X − L)T .

Block-Structure of the Matrices

The key point of our consideration is, that in thecase of skewed mesh, the stiffness matrix has ablock structure with diagonal blocks which arediagonal.

Parallel MIC(0) algorithm

MIC(0) PCG algorithm with a block-stripped partitioning: N = n2 + (n − 1)2.

TMIC(0)it (N, 1) ≈ 38Nta, Tcom ≈ (4ts +6tc)n+2T IP

TMIC(0)it (N, p) ≈ 38N

pta + (2ts + 3tc)

Parallel MIC(0) Tests

The presented tests are performed on aBeowulf like cluster of four dual processorPower Macintosh computers with 512 MBRAM each and G4 processors on 450 MHz.

The parallel MIC(0) algorithm is implementedin C++ using Message Passing Interface(MPI).

Yellow Dog Linux with LAN MPI are used.

The size of the problem and the number ofthe processors are varied to examine theparallel scalability of the code.

Parallel Speedup

n 32 64 128 256 512 1024 1500S(n,2) 1.21 1.68 1.96 1.85 1.92 2.03 2.02S(n,3) 0.24 0.46 0.97 1.72 2.45 2.97 2.86S(n,4) 0.22 0.46 1.11 1.97 2.88 3.76 3.95S(n,5) 0.20 0.40 0.96 1.99 3.25 4.48 4.86S(n,6) 0.18 0.39 1.03 1.99 3.55 5.23 5.73S(n,7) 0.19 0.38 0.95 1.78 3.63 6.02 6.31S(n,8) 0.19 0.39 1.00 2.28 3.97 6.37 6.76

Parallel Efficiency

n 32 64 128 256 512 1024 1500E(n,2) 0.60 0.84 0.98 0.93 0.96 1.02 1.01E(n,3) 0.08 0.15 0.32 0.57 0.81 0.99 0.95E(n,4) 0.06 0.12 0.27 0.49 0.72 0.94 0.98E(n,5) 0.04 0.08 0.19 0.40 0.65 0.90 0.97E(n,6) 0.03 0.07 0.17 0.33 0.59 0.87 0.96E(n,7) 0.03 0.05 0.14 0.25 0.51 0.86 0.90E(n,8) 0.02 0.05 0.13 0.29 0.50 0.80 0.84

Iterative Methods and Parallel Algorithmsparallel.bas.bg/SciComp/vd/parallel.pdf · Iterative...

Documents

DIMITRI P. BERTSEKAS:j:¤ and JOHN N. TSITSIKLIS:jjnt/Papers/J030-91-parallel-survey-automatica.pdf · Iterative methods suitable for use in parallel and distributed computing systems

Fine-grained parallel iterative solvers in OpenFOAM ... · Fine-grained parallel iterative solvers in ... OpenFOAM plug-in AMG and GAMG AMG I Purely algebraic multigrid solver I The

Some Do’s and Don’ts for Designing Parallel Languageshpc.pnl.gov/conf/hips-lspp15/talks/kale.pdf · – Iterative computations – Persistence in behavior ... • Strong scaling

Parallel Numerical Algorithms: Iterative Linear Systems ...parallelcomp.github.io › finite-difference1.pdf · Parallel Numerical Algorithms: Iterative Linear Systems, Di erential

ПРИДОБИВАНЕ НА НАУЧНИ СТЕПЕНИ И ЗАЕМАНЕ НА …parallel.bas.bg/~rayna/PhD_students_documents/proceduri_po_ZRAS.pdf · завършени периоди

SPARSE APPROXIMATE-INVERSE PRECONDITIONERS USING€¦ · when considered in a parallel environment. Key words. iterative methods, unsymmetric systems, preconditioners, sparse approximate

Parallel Iterative Solvers of GeoFEM with Selective …sc-conference.org/sc2003/paperpdfs/pap155.pdf1 Parallel Iterative Solvers of GeoFEM with Selective Blocking Preconditioning for

Iterative and parallel methods for linear systems, with applications in circuit simulation

Μέθοδος Πολυπλέγματoς - NTUAvelos0.ltt.mech.ntua.gr/kgianna/iterative/texts/multigrid2016.pdf · NATIONAL TECHNICAL UNIVERSITY OF ATHENS Parallel CFD & Optimization

A new iterative Monte Carlo approach for inverse matrix ...parallel.bas.bg/dpa/BG/dimov/MyPapers/98DDG.pdf · APPLIED MATHEMATICS ELSEVIER Journal of Computational and Applied Mathematics

SPIRiT: Iterative Self-consistent Parallel Imaging ...mlustig/SPIRiT.pdf · SPIRiT: Iterative Self-consistent Parallel Imaging Reconstruction from Arbitrary k-Space 1,2Michael Lustig

New Parallel Iterative Solvers with Preconditioning in the Post- … · 2016. 9. 8. · Parallel Iterative Solvers with Preconditioning in the Post-Moore Era Kengo Nakajima Information

On the scalability of BDDC-based fast parallel iterative

Student Perceptions of an Iterative or Parallel

Scalable Parallel Computing on Clouds : Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations

Thinking Parallel: Sparse Iterative Solvers with CUDA · Thinking Parallel: Sparse Iterative Solvers with CUDA Jonathan Cohen NVIDIA Research

HIPS: a parallel hybrid direct/iterative solver based on a

Parallel Numerical Algorithms - Edgar Solomoniksolomonik.cs.illinois.edu/teaching/cs554/slides/slides... · 2019. 10. 30. · Parallel Iterative Methods Preconditioning Parallel Numerical

Hypergraph Partitioning for Parallel Iterative Solution of ...saad/PDF/umsi-2006-231.pdf · Hypergraph Partitioning for Parallel Iterative Solution of General Sparse Linear Systems∗

Iterative Methods in Combinatorial Optimizationhajiagha/NetDsgn11/Iterative-Methods-UMD.pdf · Iterative Methods in Combinatorial Optimization R. Ravi ... Iterative Relaxation Solution