Iterative Methods in Linear Algebra (part 1)

Iterative Methods in Linear Algebra(part 1)

Stan Tomov

Innovative Computing LaboratoryComputer Science Department

The University of Tennessee

Wednesday March 31, 2010

CS 594, 03-31-2010

CS 594, 03-31-2010

Outline

Part IDiscussionMotivation for iterative methods

Part IIStationary Iterative Methods

Part IIINonstationary Iterative Methods

CS 594, 03-31-2010

Part I

Discussion

CS 594, 03-31-2010

Discussion

Part 1:

CS 594, 03-31-2010

Discussion

Part 2:

CS 594, 03-31-2010

Discussion

Original matrix:>> load matrix.output>> S = spconvert(matrix);>> spy(S)

Reordered matrix:>> load A.data>> S = spconvert(A);>> spy(S)

CS 594, 03-31-2010

Discussion

Original sub-matrix:>> load matrix.output

>> S = spconvert(matrix);

>> spy(S(1:8660),1:8660))

Reordered sub-matrix:>> load A.data

>> S = spconvert(A);

>> spy(S(8602:8700,8602:8700))

CS 594, 03-31-2010

Discussion

Results on torc0:Pentium III 933 MHz, 16 KB L1and 256 KB L2 cache

Original matrix:Mflop/s = 42.4568

Reordered matrix:Mflop/s = 45.3954

BCRSMflop/s = 72.17374

Results on woodstock:Intel Xeon 5160 Woodcrest3.0GHz, L1 32 KB, L2 4 MB

Original matrix:Mflop/s = 386.8

Reordered matrix:Mflop/s = 387.6

BCRSMflop/s = 894.2

CS 594, 03-31-2010

Homework 9

CS 594, 03-31-2010

Questions?

Homework #9

PART I

PETScMany iterative solversWe will see how is projection used in iterative methods

PART II

hybrid CPU-GPUs computations

CS 594, 03-31-2010

Hybrid CPU-GPU computations

CS 594, 03-31-2010

Iterative Methods

CS 594, 03-31-2010

Motivation

So far we have discussed and have seen:

Many engineering and physics simulations lead to sparsematricese.g. PDE based simulations and their discretizations based on

FEM/FVMfinite differencesPetrov-Galerkin type of conditions, etc.

How to optimize performance on sparse matrix computations

Some software how to solve sparse matrix problems (e.g.PETSc)

The question is:Can we solve sparse matrix problems faster than using directsparse methods?

In certain cases Yes:using iterative methods

CS 594, 03-31-2010

Sparse direct methods


These are based on Gaussian elimination (LU)

Performed in sparse format

Are there limitations of the approach?

Yes, they have fill-ins which lead to

more memory requirementsmore flops being performed

Fill-ins can become prohibitively high

CS 594, 03-31-2010


Consider LU for the matrix below

a nonzero is represented bya *

1st step of LU factorization willintroduce fill-ins

marked by an F

CS 594, 03-31-2010


Fill-ins can be improved by reordering

Remember: we talked about it in slightly different context(for speed and parallelization)

Consider the following reordering

These were extreme cases

but still, the problem exists

CS 594, 03-31-2010

Sparse methods

Sparse direct vs dense methods

Dense takes O(n2) storage, O(n3) flops, runs within peakperformance

Sparse direct can take O(#nonz) storage, and O(#nonz) flops, butthese can also grow due to fill-ins and performance is bad

with #nonz << n2 and ’proper’ ordering it pays off to do sparsedirect

Software (table from Julien Langou)http://www.netlib.org/utk/people/JackDongarra/la-sw.html

Package Support Type Language ModeReal Complex f77 c c++ Seq Dist SPD Gen

DSCPACK yes X X X M XHSL yes X X X X X X

MFACT yes X X X XMUMPS yes X X X X X M X X

PSPASES yes X X X M XSPARSE ? X X X X X X

SPOOLES ? X X X X M XSuperLU yes X X X X X M XTAUCS yes X X X X X X

UMFPACK yes X X X X XY12M ? X X X X X

CS 594, 03-31-2010

http://www.netlib.org/utk/people/JackDongarra/la-sw.html

http://www.cse.psu.edu/~raghavan/Dscpack/

http://hsl.rl.ac.uk/archive/hslarchive.html

http://www.cs.utk.edu/~padma/mfact.html

http://www.enseeiht.fr/apo/MUMPS/

http://www.cs.umn.edu/~mjoshi/pspases/index.html

http://www.netlib.org/sparse/index.html

http://www.netlib.org/linalg/spooles/

http://www.nersc.gov/~xiaoye/SuperLU/

http://www.tau.ac.il/~stoledo/taucs/

http://www.cise.ufl.edu/research/sparse/umfpack

http://www.netlib.org/y12m/index.html

Sparse methods

What about Iterative Methods? Think for example

xi+1 = xi + P(b − Axi )

Pluses:

Storage is only O(#nonz) (for the matrix and a few workingvectors)

Can take only a few iterations to converge (e.g. << n)

for P = A−1 it takes 1 iteration (check)!

In general easy to parallelize

CS 594, 03-31-2010

Sparse methods

What about Iterative Methods? Think for example

xi+1 = xi + P(b − Axi )

Minuses:

Performance can be bad(as we saw in Lecture 11 and today’s discussion)

Convergence can be slow or even stagnateBut can be improved with preconditioning

Think of P as a preconditioner, an operator/matrix P ≈ A−1

Optimal preconditioners (e.g. multigrid can be) lead toconvergence in O(1) iterations

CS 594, 03-31-2010

Part II

Stationary Iterative Methods

CS 594, 03-31-2010

Stationary Iterative Methods

Can be expressed in the form

xi+1 = Bxi + c

where B and c do not depend on i

older, simpler, easy to implement, but usually not as effective(as nonstationary)

examples: Richardson, Jacobi, Gauss-Seidel, SOR, etc. (next)

CS 594, 03-31-2010

Richardson Method

Richardson iteration

xi+1 = xi + (b − Axi ) = (I − A)xi + b (1)

i.e. B from the definition above is B = I − A

Denote ei = x − xi and rewrite (1)

x − xi+1 = x − xi − (Ax − Axi )

ei+1 = ei − Aei

= (I − A)ei

||ei+1|| ≤ ||(I − A)ei || ≤ ||I − A|| ||ei || ≤ ||I − A||2 ||ei−1||

≤ · · · ≤ ||I − A||i ||e0||

i.e. for convergence (ei → 0) we need

||I − A|| < 1

for some norm || · ||, e.g. when ρ(A) < 1.

CS 594, 03-31-2010

Jacobi Method

Jacobi Method

xi+1 = xi + D−1(b − Axi ) = (I − D−1A)xi + D−1b (2)

where D is the diagonal of A (assumed nonzero; B = I − D−1Afrom the definition)

Denote ei = x − xi and rewrite (2)

x − xi+1 = x − xi − D−1(Ax − Axi )

ei+1 = ei − D−1Aei

= (I − D−1A)ei

||ei+1|| ≤ ||(I − D−1A)ei || ≤ ||I − D−1A|| ||ei || ≤ ||I − D−1A||2 ||ei−1||

≤ · · · ≤ ||I − D−1A||i ||e0||

i.e. for convergence (ei → 0) we need

||I − D−1A|| < 1

for some norm || · ||, e.g. when ρ(D−1A) < 1

CS 594, 03-31-2010

Gauss-Seidel Method

Gauss-Seidel Method

xi+1 = (D − L)−1(Uxi + b) (3)

where −L is the lower and −U the upper triangular part of A, andD is the diagonal.

Equivalently:8>>>>>><>>>>>>:

x(i+1)1 = (b1 − a12x

(i)2 − a13x

(i)3 − a14x

(i)4 − . . .− a1nx

(i)n )/a11

x(i+1)2 = (b2 − a21x

(i+1)1 − a23x

(i)3 − a24x

(i)4 − . . .− a2nx

(i)n )/a22

x(i+1)3 = (b2 − a31x

(i+1)1 − a32x

(i+1)2 − a34x

(i)4 − . . .− a3nx

(i)n )/a22

...

x(i+1)n = (bn − an1x

(i+1)1 − an2x

(i+1)2 − an3x

(i+1)3 − . . .− an,n−1x

(i+1)n−1 )/ann

CS 594, 03-31-2010

A comparison (from Julien Langou slides)

Gauss-Seidel method

A =

0BBB@10 1 2 0 10 12 1 3 11 2 9 1 00 3 1 10 01 2 0 0 15

1CCCA b =

0BBB@2344364980

1CCCA

‖b − Ax(i)‖2/‖b‖2, function of i , the number of iterations

x(0) x(1) x(2) x(3) x(4) . . .0.0000 2.3000 0.8783 0.9790 0.9978 1.00000.0000 3.6667 2.1548 2.0107 2.0005 2.00000.0000 2.9296 3.0339 3.0055 3.0006 3.00000.0000 3.5070 3.9502 3.9962 3.9998 4.00000.0000 4.6911 4.9875 5.0000 5.0001 5.0000

CS 594, 03-31-2010

A comparison (from Julien Langou slides)Jacobi Method: 8>>>>>>>><>>>>>>>>:

x(i+1)1 = (b1 − a12x

(i)2 − a13x

(i)3 − a14x

(i)4 − . . .− a1nx

(i)n )/a11

x(i+1)2 = (b2 − a21x

(i)1 − a23x

(i)3 − a24x

(i)4 − . . .− a2nx

(i)n )/a22

x(i+1)3 = (b2 − a31x

(i)1 − a32x

(i)2 − a34x

(i)4 − . . .− a3nx

(i)n )/a22

.

.

.


(i)1 − an2x

(i)2 − an3x

(i)3 − . . .− an,n−1x

(i)n−1)/ann

Gauss-Seidel method8>>>>>>>><>>>>>>>>:

x(i+1)1 = (b1 − a12x

(i)2 − a13x

(i)3 − a14x

(i)4 − . . .− a1nx

(i)n )/a11

x(i+1)2 = (b2 − a21x

(i+1)1 − a23x

(i)3 − a24x

(i)4 − . . .− a2nx

(i)n )/a22

x(i+1)3 = (b2 − a31x

(i+1)1 − a32x

(i+1)2 − a34x

(i)4 − . . .− a3nx

(i)n )/a22

.

.

.


(i+1)1 − an2x

(i+1)2 − an3x

(i+1)3 − . . .− an,n−1x

(i+1)n−1 )/ann

Gauss-Seidel is the method with better numerical properties(less iterations to convergence). Which is the method withbetter efficiency in term of implementation in sequential orparallel computer? Why?In Gauss-Seidel, the computation of x

(i+1)k+1 implies the knowledge of x

(i+1)k .

Parallelization is impossible.

CS 594, 03-31-2010

Convergence

Convergence can be very slowConsider a modified Richardson method:

xi+1 = xi + τ(b − Axi ) (4)

Convergence is linear, similarly to Richardson we get

||ei+1|| ≤ ||I − τA||i ||e0||

but can be very slow (if ||I − τA|| is close to 1), e.g. let

A be symmetric and positive definite (SPD)

the matrix norm in ||I − τA|| is induced by || · ||2Then the best choice for τ (τ = 2

λ1+λN) would give

||I − τA|| =k(A)− 1

k(A) + 1

where k(A) = λNλ1

is the condition number of A.

CS 594, 03-31-2010

Convergence

Note:

The rate of convergence depend on thecondition number k(A)

Even for the the best τ the rate of convergence

k(A)− 1

k(A) + 1

is slow (i.e. close to 1) for large k(A)

We will see convergence of nonstationary methods alsodepend on k(A) but is better, e.g. compare with CG√

k(A)− 1√k(A) + 1

CS 594, 03-31-2010

Part III

Nonstationary Iterative Methods

CS 594, 03-31-2010

Nonstationary Iterative Methods

The methods involve information that changes at every iteration

e.g. inner-products with residuals or other vectors, etc.

The methods are

newer, harder to understand, but more effective

in general based on the idea of orthogonal vectors andsubspace projections

examples: Krylov iterative methods

CG/PCG, GMRES, CGNE, QMR, BiCG, etc.

CS 594, 03-31-2010

Krylov Iterative Methods

CS 594, 03-31-2010


Krylov subspaces: these are the spaces

Ki (A, r) = span{r , Ar , A2r , . . . , Ai−1r}

Krylov iterative methods find approximation xi to x where

Ax = b,

as a minimization on Ki (A, r).We have seen how this can be done for example by projection, i.e.by the the Petrov-Galerkin conditions:

(Axi , φ) = (b, φ) for ∀φ ∈ Ki (A, r)

CS 594, 03-31-2010


In general we

expend the Krylov subspace by a matrix-vector product, and

do a minimization/projection in it.

Various methods result by specific choices of expansion andminimization/projection.

CS 594, 03-31-2010


An example: The Conjugate Gradient Method (CG)

The method is for SPD matrices

There is a way at iteration i to construct new ’search direction’ pi s.t.

span{p0, p1, . . . , pi} ≡ Ki+1(A, r0) and (Api , pj ) = 0 for i 6= j.

Note: A is SPD⇒ (Api , pj ) ≡ (pi , pj )A can be used as an inner product, i.e. p0, . . . , pi is an (·, ·)Aorthogonal basis for Ki+1(A, r0)

⇒ we can easily find xi+1 ≈ x as

xi+1 = x0 + α0p0 + · · · + αi pi s.t.

(Axi+1, pj ) = (b, pj ) for j = 0, . . . , i

Namely, because of the (·, ·)A orthogonality of p0, . . . , pi at iteration i + 1 we have to find only αi

(Axi+1, pj ) = (A(xi + αi pi ), pi ) = (b, pi ), ⇒ αi =(ri , pi )

(Api , pi )

Note: xi above actually can be replaced by any x0 + v , v ∈ Ki (A, r0) (Why?)

CS 594, 03-31-2010

Learning Goals

A brief introduction to iterative methods

Motivation for iterative methods and links to previouslectures, namely

PDE discretization and sparse matricesOptimized implementations

Stationary iterative methods

Nonstationary iterative methods and links to building blocksthat we have already covered

Projection/MinimizationOrthogonalization

Krylov methods; an example with CG;to see more examples ... (next lecture)

CS 594, 03-31-2010

Documents

Iterative Methods in Linear Algebra (part 1)