Conjugate Gradient Methods - TUM...Quadratic Forms A quadratic form is a scalar, quadratic function of a vector of the form: f(x) = 1 2 xT Ax bT x +c; where A = AT 4-4 0 2-2 40 0 0-2

Scientific Computing II

Conjugate Gradient Methods

Michael BaderTechnical University of Munich

Summer 2016

Families of Iterative Solvers

• relaxation methods:• Jacobi-, Gauss-Seidel-Relaxation, . . .• Over-Relaxation-Methods

• Krylov methods:• Steepest Descent, Conjugate Gradient, . . .• GMRES, . . .

• Multilevel/Multigrid methods,Domain Decomposition, . . .

Michael Bader | Scientific Computing II | Conjugate Gradient Methods | Summer 2016 2

Remember: The Residual Equation

• for Ax = b, we defined the residual as:

r (i) = b − Ax (i)

• and the error: e(i) := x − x (i)

• leads to the residual equation:

Ae(i) = r (i)

• relaxation methods: solve a modified (easier) SLE:

B ê(i) = r (i) where B ∼ A

• multigrid methods: coarse-grid correction on residual equation

AHe(i)H = r

(i)H and x

(i+1) := x (i) + IhHe(i)H


Part I

Quadratic Forms and SteepestDescent

Quadratic FormsDirection of Steepest DescentSteepest Descent


Quadratic FormsA quadratic form is a scalar, quadratic function of a vector of the form:

f (x) =12

xT Ax − bT x + c, where A = AT

4-4

02

-2

40

00-2

80

y2x-4

120

4-66

160-2

-2

xy

02

-6

4

4

-4

60

2


Quadratic Forms (2)

The gradient of a quadratic form is defined as

f ′(x) =

∂∂x1

f (x)...

∂∂xn

f (x)

• apply to f (x) = 12 x

T Ax − bT x + c, then• f ′(x) = Ax − b• f ′(x) = 0 ⇔ Ax − b = 0 ⇔ Ax = b

⇒ Ax = b equivalent to a minimisation problem⇒ proper minimum only if A positive definite


Direction of Steepest Descent

• gradient f ′(x): direction of “steepest ascent”• f ′(x) = Ax − b = −r (with residual r = b − Ax)• residual r : direction of “steepest descent”

-2

y

6x

4

2

-4

2

-6

-2

00 4


Solving SLE via Minimum Search

• basic idea to find minimum:move into direction of steepest descent

• most simple scheme:x (i+1) = x (i) + αr (i)

• α constant⇒ Richardson iteration(usually considered as a relaxation method)

• better choice of α:move to lowest point in that direction⇒ Steepest Descent


Steepest Descent – find an optimal α

• task: line search along the line x (1) = x (0) + αr (0)

• choose α such that f (x (1)) is minimal:

∂

∂αf (x (1)) = 0

• use chain rule:

∂

∂αf (x (1)) = f ′(x (1))T

∂

∂αx (1) = f ′(x (1))T r (0)

• remember f ′(x (1)) = −r (1), thus:

−(

r (1))T

r (0) != 0

hence, f ′(x (1)) = −r (1) should be orthogonal to r (0)


Steepest Descent – find α (2)

(r (1))T

r (0) =(

b − Ax (1))T

r (0) = 0(b − A(x (0) + αr (0))

)Tr (0) = 0(

b − Ax (0))T

r (0) − α(

Ar (0))T

r (0) = 0(r (0))T

r (0) − α(

r (0))T

Ar (0) = 0

Solve for α:

α =

(r (0))T

r (0)(r (0))T Ar (0)


Steepest Descent – Algorithm

1. r (i) = b − Ax (i)

2. αi =(r (i))T

r (i)(r (i))T Ar (i)

3. x (i+1) = x (i) + αi r (i)

Observations:

-2 4x

y2

-2

20

4

-4

6

-6

0

• slow convergence (sim. to Jacobi relaxation)

•∥∥e(i)∥∥A ≤ (κ−1κ+1)i ∥∥e(0)∥∥A

• for positive definite A: κ = λmax/λmin(largest/smallest eigenvalues of A)

• many steps in the same direction


Steepest Descent – Example 1

Consider example(

3 22 6

)x =

(2−8

), starting solution

(−3−3

)


Steepest Descent – Example 2

Consider example(

3 22 100

)x =

(2−8


(−10−2

)

Multiple steps in the same search direction!


Part II

Conjugate Gradients

Conjugate DirectionsA-OrthogonalityConjugate GradientsA Miracle Occurs . . .CG Algorithm


Conjugate Directions

• observation:Steepest Descent takes repeated steps in the same direction

• obvious idea:try to do only one step in each direction

• possible approach:choose orthogonal search directions d (0)⊥d (1)⊥d (2)⊥ . . .

• notice:errors orthogonal to previous directions:

e(1)⊥d (0),e(2)⊥d (1)⊥d (0), . . .


Conjugate Directions (2)

• compute α from(d (0)

)Te(1) =

(d (0)

)T (e(0) − αd (0)

)= 0

requires propagation of the error e(1) = x − x (1)

x (1) = x (0) + αi d (0)

x − x (1) = x − x (0) − αi d (0)

e(1) = e(0) − αi d (0)

• formula for α:

α =

(d (0)

)Te(0)(

d (0))T d (0)

• but: we don’t know e(0)


A-Orthogonality• make the search directions A-orthogonal:(

d (i))T

Ad (j) = 0

• again: errors A-orthogonal to previous directions:(e(i+1)

)TAd (i) != 0

• equiv. to minimisation in search direction d (i):

∂

∂αf(

x (i+1))=

(f ′(

x (i+1)))T ∂

∂αx (i+1) = 0

⇔ −(

r (i+1))T

d (i) = 0

⇔ −(

d (i))T

Ae(i+1) = 0


A-Conjugate Directions

• remember the formula for conjugate directions:

α =

(d (0)

)T e(0)(d (0)

)T d (0)• same computation, but with A-orthogonality:

αi =

(d (i))T Ae(i)(

d (i))T Ad (i) =

(d (i))T r (i)(

d (i))T Ad (i)

(for the i-th iteration)• these αi can be computed!• still to do: find A-orthogonal search directions


A-Conjugate Directions (2)

classical approach to find orthogonal directions→conjugate Gram-Schmidt process:

• from linearly independent vectors u(0),u(1), . . . ,u(i−1)

• construct orthogonal directions d (0),d (1), . . . ,d (i−1)

d (i) = u(i) +i−1∑k=0

βik d (k)

βik = −(u(i))T Ad (k)

(d (k))T Ad (k)

• needs to keep all old search vectors in memory• O(n3) computational complexity⇒ infeasible


Conjugate Gradients

• use residuals (i.e., u(i) := r (i)) to construct conjugate directions:

d (i) = r (i) +i−1∑k=0

βik d (k)

• new direction d (i) should be A-orthogonal to all d (j):

0 !=(d (i))T Ad (j) = (r (i))T Ad (j) + i−1∑

k=0

βik(d (k)

)T Ad (j)• all directions d (k) (for k = 0, . . . , i − 1) are already A-orthogonal

(and j < i), hence:

0 =(r (i))T Ad (j) + βij(d (j))T Ad (j) ⇒ βij = − (r (i))T Ad (j)(

d (j))T Ad (j)


Conjugate Gradients – Status

1. conjugate directions and computation of αi :

αi =

(d (i))T r (i)(

d (i))T Ad (i)

x (i+1) = x (i) + αid (i)

2. use residuals to compute search directions:

d (i) = r (i) +i−1∑k=0

βik d (k)

βik = −(r (i))T Ad (k)(

d (k))T Ad (k)

→ still too expensive, as we need to store all vectors d (k)


A Miracle Occurs – Part 1

Two small contributions:1. propagation of the error e(i) = x − x (i)

x (i+1) = x (i) + αid (i)

x − x (i+1) = x − x (i) − αid (i)

e(i+1) = e(i) − αid (i)

(we have used this once, already)

2. propagation of residuals

r (i+1) = Ae(i+1) = A(

e(i) − αid (i))

⇒ r (i+1) = r (i) − αiAd (i)



Orthogonality of the residuals:• search directions are A-orthogonal• only one step in each direction• hence: error is A-orthogonal to all previous search directions:(

d (i))T Ae(j) = 0, for i < j

• residuals are orthogonal to all previous search directions:(d (i))T r (j) = 0, for i < j

• search directions are built from residuals:span

{d (0), . . . ,d (i−1)

}= span

{r (0), . . . , r (i−1)

}• hence: residuals are orthogonal(

r (i))T r (j) = 0, i < j



• combine orthogonality and recurrence for residuals:(r (i))T r (j+1) = (r (i))T r (j) − αj(r (i))T Ad (j)

⇒ αj(r (i))T Ad (j) = (r (i))T r (j) − (r (i))T r (j+1)

•(r (i))T r (j) = 0, if i 6= j :

(r (i))T Ad (j) =

1αi

(r (i))T r (i), i = j

− 1αi−1(r (i))T r (i), i = j + 1

0 otherwise.



• computation of βik (for k = 0, . . . , i − 1):

βik = −(r (i))T Ad (k)(

d (k))T Ad (k) =

(r (i))T r (i)

αi−1(d (i−1)

)T Ad (i−1) , if i = k + 10, if i > k + 1

• thus: search directions

d (i) = r (i) +i−1∑k=0

βik d (k) = r (i) + βi,i−1d (i−1)

βi := βi,i−1 =

(r (i))T r (i)

αi−1(d (i−1)

)T Ad (i−1)⇒ reduces to a simple iterative scheme for βi



• build search directions

d (i+1) = r (i+1) + βid (i)

βi+1 =

(r (i+1)

)T r (i+1)αi(d (i))T Ad (i)

• remember: αi =(d (i))T r (i)

(d (i))T Ad (i)

• thus: αi (d (i))T Ad (i) = (d (i))T r (i)

⇒ βi+1 =(r (i+1)

)T r (i+1)(d (i))T r (i) =

(r (i+1)

)T r (i+1)(r (i))T r (i)

• last step:(d (i)

)T r (i) = (r (i) + βi−1d (i−1))T r (i) = (r (i))T r (i) + βi−1(d (i−1))T r (i) = (r (i))T r (i)(residual r (i) orthogonal to previous search direction d (i−1))


Conjugate Gradients – Algorithm

Start with d (0) = r (0) = b − Ax (0)

While r (i) > � iterate over:

1. αi =(r (i))T r (i)

(d (i))T Ad (i)

2. x (i+1) = x (i) + αid (i)

3. r (i+1) = r (i) − αiAd (i)

4. βi+1 =(r (i+1)

)T r (i+1)(r (i))T r (i)

5. d (i+1) = r (i+1) + βi+1d (i)


Conjugate Gradients – Example

Consider example(

3 22 100

)x =

(2−8


(−10−2

)

Convergence to solution after n = 2 steps!


Part III

Preconditioning

CG ConvergencePreconditioningCG with “Change-of-Basis” PreconditioningHierarchical TransformsCG with Matrix PreconditionerPreconditioners – ExamplesILU and Incomplete Cholesky


Conjugate Gradients – Convergence

Convergence Analysis:• uses Krylov subspace:

span{

r (0),Ar (0),A2r (0), . . . ,Ai−1r (0)}

• “Krylov subspace method”

Convergence Results:• in principle: direct method (n steps)

(however: orthogonality lost due to round-off errors→ exact solution not found)

• in practice: iterative scheme

∥∥∥e(i)∥∥∥A≤ 2

(√κ− 1√κ+ 1

)i ∥∥∥e(0)∥∥∥A, κ = λmax/λmin


Preconditioning

• convergence depends on matrix A• idea: modify linear system

Ax = b M−1Ax = M−1b,

then: convergence depends on matrix M−1A• optimal preconditioner: M−1 = A−1:

A−1Ax = A−1b ⇔ x = A−1b.

• in practice:– avoid explicit computation of M−1A– find an M similar to A, compute effect of M−1

(i.e., approximate solution of SLE)– or: find an M−1 similar to A−1

• also possible: solve AM y = b for y , and then x = My


CG and Preconditioning

• just replace A by M−1A in the algorithm??• problem: M−1A not necessarily symmetric

(even if M and A both are)• we will try an alternative first:

symmetric preconditioning

Ax = b LT ALx̂ = LT b, x = Lx̂

• Remember: for Finite Element discretization, this corresponds to achange of basis functions!

• can we implement CG without having to set up LT AL?• requires some re-computations in the CG algorithm

(see following slides)


“Change-of-Basis” Preconditioning• preconditioned system of equations:

Ax = b (LT AL︸︷︷︸=: Â

)x̂ = LT b︸︷︷︸

=: b̂

, x = Lx̂

• computation of residual:

r̂ = b̂ − Â x̂ = LT b − LT ALx̂ = LT (b − Ax) = LT r• computation of α (for preconditioned system):

αi :=

(r̂ (i))T r̂ (i)(

d̂ (i))T Â d̂ (i) =

(r̂ (i))T r̂ (i)(

d̂ (i))T LT AL d̂ (i) =

(r̂ (i))T r̂ (i)(

d̃ (i))T A d̃ (i)

where we defined Ld̂ (i) =: d̃ (i)

• update of solution:x̂ (i+1) = x̂ (i) + αi d̂ (i)

⇒ x (i+1) = Lx̂ (i+1) = Lx̂ (i) + Lαi d̂ (i) = x (i) + αi d̃ (i)


“Change-of-Basis” Preconditioning (2)

• update residuals r̂ :

r̂ (i+1) = r̂ (i) − αi Âd̂ (i) = r̂ (i) − αiLT AL d̂ (i)

= r̂ (i) − αiLT A d̃ (i)

• computation of βi :

βi+1 =

(r̂ (i+1)

)T r̂ (i+1)(r̂ (i))T r̂ (i)

• update of search directions:

d̂ (i+1) = r̂ (i+1) + βi d̂ (i)

⇒ d̃ (i+1) = Ld̂ (i+1) = Lr̂ (i+1) + Lβi d̂ (i)

= Lr̂ (i+1) + βi d̃ (i)


CG with “Change-of-Basis” Preconditioning

Start with r̂ (0) = LT (b − Ax (0)) and d̃ (0) = L r̂ (0);

While r̂ (i) > � iterate over:

1. αi =(r̂ (i))T r̂ (i)(

d̃ (i))T A d̃ (i)

2. x (i+1) = x (i) + αi d̃ (i)

3. r̂ (i+1) = r̂ (i) − αiLT A d̃ (i)

4. βi+1 =(r̂ (i+1)

)T r̂ (i+1)(r̂ (i))T r̂ (i)

5. d̃ (i+1) = L r̂ (i+1) + βi d̃ (i)


Hierarchical Basis Preconditioning

Some specifics for the CG implementation:• L transforms coefficient vector from hierarchical basis to nodal basis,

for example x̂ = L x or d̃ = L d̂• LT transforms the vector of basis functions from nodal basis to

hierarchical basis (cmp. FEM), thus r̂ = LT r• effect of L and LT can be computed in O(N) operations

HB-CG for the Poisson problem:• in 1D: convergence after a log N iterations!

(in this case: LT AL diagonal matrix with log N different eigenvalues)• in 2D and 3D very fast convergence!• further improved by additional diagonal preconditioning• so-called hierarchical generating systems (change to a multigrid basis)

achieve multigrid-like performance


CG with Hierarchical Generating SystemsRecall: system of linear equations AGSvGS = bGS given as Ah AhP

h2h AhP

h4h

R2hh Ah A2h A2hP2h4h

R4hh Ah R4h2hA2h A4h

vhv2h

v4h

= bhR2hh bh

R4hh bh

Preconditioning for CG?

• system AGSvGS = bGS is singular→ subspace of solutions vGS (minima of the quadratic form!)

• however: any of the solutions will do!• convergence result:∥∥∥e(i)∥∥∥

A≤ 2

(√κ̂GS − 1√κ̂GS + 1

)i ∥∥∥e(0)∥∥∥A, κ̂GS = λ̂max/λ̂min

where κ̂GS is the ratio of largest vs. smallest non-zero eigenvalue• for Poisson eq.: κ̂GS independent of h multigrid convergence


Hierarchical Basis TransformationTowards level-wise approach

Consider “semi-hierarchical” transform:

.x2x1 x3 x5 x7

Φ4Φ2 Φ6Φ1 Φ3 Φ5 Φ7

x6x4

−→

.x2x1 x3 x5 x7

Ψ2 Ψ6Ψ1 Ψ3 Ψ5 Ψ7

x6x4

Matrices for change of basis are then: (H(2)3 to transform to hierarchical basis)

H(1)3 =

1 0 0 0 0 0 012 1

12 0 0 0 0

0 0 1 0 0 0 00 0 12 1

12 0 0

0 0 0 0 1 0 00 0 0 0 12 1

12

0 0 0 0 0 0 1

H(2)3 =

1 0 0 0 0 0 00 1 0 0 0 0 00 0 1 0 0 0 00 12 0 1 0

12 0

0 0 0 0 1 0 00 0 0 0 0 1 00 0 0 0 0 0 1


Hierarchical Basis Transformation

Level-wise hierarchical transform:• hierarchical basis transformation: ψn,i(x) =

∑j

Hi,jφn,j(x)

• written as matrix-vector product: ~ψn = Hn~φn• Hn~φn can be performed as a sequence of level-wise transforms:

Hn~φn = H(n−1)n H

(n−2)n . . . H

(2)n H

(1)n~φn

• consider Hierarchical-Basis-Preconditioning for FEM-discretization:LT := Hn and we need to compute r̂ (0) = LT r (0) and LT A d̃ (i)

• each level-wise transform H(k)n has a simple loop implementation:For k from 1 to n-1 do

For j from 2k to 2n step 2k

rj := 12 rj−2k−1 + rk +12 rj+2k−1


Hierarchical Coordinate Transformation

• transform b = HTn a turns “hierachical” coefficients a into “nodal”coefficients b: ∑

j

bjφn,j(x) =∑

i

aiψn,i(x) ≈ f (x)

• Hn = H(n−1)n H

(n−2)n . . . H

(2)n H

(1)n has a level-wise representation, thus:

HTn =(

H(1)n)T (

H(2)n)T

. . .(

H(n−2)n)T (

H(n−1)n)T

• consider Hierarchical-Basis-Preconditioning for FEM-discretization:L := HTn for computation of d̃ (0) = L r̂ (0) and d̃ (i+1) = L r̂ (i+1) + βi d̃ (i)

• again: use loop-based implementations for(

H(k)n)T

a

For k from n-1 downto 1For i from 2k−1 to 2n step 2k

di := 12 di−2k−1 + di +12 di+2k−1 (with d0 = adn = 0)


CG and Preconditioning (revisited)

• preconditioning: replace A by M−1A• problem: M−1A not necessarily symmetric• compare symmetric preconditioning

Ax = b LT ALx̂ = Lb, x = Lx̂

• workaround: find ET E = M (Cholesky fact.), then

Ax = b E−T AE−1x̂ = E−T b, x̂ = Ex

• what if E cannot be computed (efficiently)?(neither M nor M−1 might be known explicitly!)

• E , E−T , E−1 can be eliminated from algorithm(again requires some re-computations):

set d̂ = Ed and use r̂ = E−T r , x̂ = Ex , E−1E−T = M−1


CG with Preconditioner

Start: r (0) = b − Ax (0); d (0) = M−1r (0)

1. αi =(r (i))T M−1r (i)

(d (i))T Ad (i)

2. x (i+1) = x (i) + αid (i)

3. r (i+1) = r (i) − αiAd (i)

4. βi+1 =(r (i+1)

)T M−1r (i+1)(r (i))T M−1r (i)

5. d (i+1) = M−1r (i+1) + βi+1d (i)

(for detailed derivation, see Shewchuck)


Implementation

Preconditioning steps: M−1r (i),M−1r (i+1)

• M−1 known then multiply M−1r (i)

• M known? Then solve My = r (i) to obtain y = M−1r (i)

• neither M, nor M−1 are known explicitly:• algorithm to solve My = r (i) is sufficient!• algorithm to compute action of M−1 on a vector is sufficient!⇒ any approximate solver for Ae = r (i) is sufficient

(if it is equivalent to applying a symmetric and pos. definite matrix)


Preconditioners for CG – Examples• find M ≈ A and compute effect of M−1:

• Jacobi preconditioning: M := DA• (Symmetric) Gauss-Seidel preconditioning: M := LA or

M = (DA + L′A)D−1A (DA + (L

′A)

T ), etc.• just compute effect of M−1:

• any approximate solver might do→ incl. multigrid methods

• incomplete LU-decomposition (ILU)should be symmetric→ incomplete Cholesky factorization

• use a multigrid method as preconditioner(?)→ worthwhile (only) in situations where multigrid does not work(well) as stand-alone solver

• find an M−1 similar to A−1

• “sparse approximate inverse” (SPAI)• tries to minimise ‖I −MA‖F , where M is a matrix with (given) sparse

non-zero pattern


Preconditioners – ILU and Incomplete Cholesky

Recall LU decomposition and Cholesky factorization:• LU decomposition: given A, find lower/upper triangular matrices L and U

such that A = LU• Cholesky factorization: given A = AT , find lower triangular matrix L such

that A = LLT → symmetric preconditioning!• variants with explicit diagonal matrix D:

A = LD−1U or A = LD−1LT ,

where L = D + L′ and R = D + R′ with strict lower/upper triangular L′, R′

• but: for sparse A, L and U may be non-sparse

Idea: disregard all fill-in during factorization


Cholesky Factorization

L11L21 L22L31 L32 L33

D

−111

D−122D−133

L

T11 L

T21 L

T31

LT22 LT32

LT33

=A11 A

T21 A

T31

A21 A22 AT32A31 A32 A33

Derive the factorization algorithm:

• assume that A11!= L11D−111 L

T11 is already factorized

• let L21 be a 1× k submatrix, i.e., to compute next row of L:

L21D−111 LT11

!= A21

D−111 LT11 upper triangular matrix→ solve triangular system for L21

• by convention L22 = D22, which is computed from:

L21D−111 LT21 + L22D

−122 L

−T22

!= A22 ⇒ L22 = D22 := A22 − L21D−111 L

T21


Incomplete Cholesky Factorization

Algorithm: ( A→ LD−1LT )• initialize D := 0, L := 0• for i = 1, . . . ,n:

1. for k = 1, . . . , i − 1:if (i , k) ∈ S then set Lik := Aik −

∑j

Literature/References

Conjugate Gradients:• Shewchuk: An Introduction to the Conjugate Gradient Method Without

the Agonizing Pain.• Hackbusch: Iterative Solution of Large Sparse Systems of Equations,

Springer 1993.• M. Griebel: Multilevelmethoden als Iterationsverfahren über

Erzeugendensystemen, Teubner Skripten zur Numerik, 1994M. Griebel: Multilevel algorithms considered as iterative methods onsemidefinite systems, SIAM Int. J. Sci. Stat. Comput. 15(3), 1994.


Quadratic Forms and Steepest DescentQuadratic FormsDirection of Steepest DescentSteepest Descent

Conjugate GradientsConjugate DirectionsA-OrthogonalityConjugate GradientsA Miracle Occurs …CG Algorithm

PreconditioningCG ConvergencePreconditioningCG with ``Change-of-Basis'' PreconditioningHierarchical TransformsCG with Matrix PreconditionerPreconditioners – ExamplesILU and Incomplete Cholesky

Documents

Conjugate Gradient Methods - TUM...Quadratic Forms A quadratic form is a scalar, quadratic function of a vector of the form: f(x) = 1 2 xT Ax bT x +c; where A = AT 4-4 0 2-2 40 0 0-2