Applied Mathematics 205 Unit II: Numerical Linear Algebraiacs-courses.seas.harvard.edu/courses/am205/fall13/... · Applied Mathematics 205 Unit II: Numerical Linear Algebra Lecturer:

Applied Mathematics 205

Unit II: Numerical Linear Algebra

Lecturer: Dr. David Knezevic

Unit II: Numerical Linear Algebra

Chapter II.3: QR Factorization, SVD

2 / 66

QR Factorization

3 / 66

QR Factorization

A square matrix Q ∈ Rn×n is called orthogonal if its columns androws are orthonormal vectors

Equivalently, QTQ = QQT = I

Orthogonal matrices preserve the Euclidean norm of a vector, i.e.

‖Qv‖22 = vTQTQv = vT v = ‖v‖22

Hence, geometrically, we picture orthogonal matrices as reflectionor rotation operators

Orthogonal matrices are very important in scientific computing,since norm-preservation implies cond(Q) = 1

4 / 66

QR Factorization

A matrix A ∈ Rm×n, m ≥ n, can be factorized into

A = QR

where

I Q ∈ Rm×m is orthogonal

I R ≡[R0

]∈ Rm×n

I R ∈ Rn×n is upper-triangular

As we indicated earlier, QR is very good for solving overdeterminedlinear least-squares problems, Ax ' b 1

1QR can also be used to solve a square system Ax = b, but requires ∼ 2×as many operations as Gaussian elimination hence not the standard choice

5 / 66

QR Factorization

To see why, consider the 2-norm of the least-squares residual:

‖r(x)‖22 = ‖b − Ax‖22 = ‖b − Q

[R0

]x‖22

= ‖QT

(b − Q

[R0

]x

)‖22

= ‖QTb −[R0

]x‖22

(We used the fact that ‖QT z‖2 = ‖z‖2 in the second line)

6 / 66

QR Factorization

Then, let QTb = [c1, c2]T where c1 ∈ Rn, c2 ∈ Rm−n, so that

‖r(x)‖22 = ‖c1 − Rx‖22 + ‖c2‖22

Question: Based on this expression, how do we minimize ‖r(x)‖2?

7 / 66

QR Factorization

Answer: We don’t have any control over the second term, ‖c2‖22,since it doesn’t contain x

Hence we minimize ‖r(x)‖22 by making the first term zero

That is, we solve the n × n triangular system Rx = c1 — this iswhat Matlab does when we use “backslash” for least squares

Also, this tells us that minx∈Rn‖r(x)‖2 = ‖c2‖2

8 / 66

QR Factorization

Recall that solving linear least-squares via the normal equationsrequires solving a system with the matrix ATA

But using the normal equations directly is problematic since2

cond(ATA) = cond(A)2

The QR approach avoids this condition-number-squaring effect andis much more numerically stable!

2This can be shown via the SVD9 / 66

QR Factorization

How do we compute the QR Factorization?

There are three main methods

I Gram-Schmidt Orthogonalization

I Householder Triangularization

I Givens Rotations

We will cover Gram-Schmidt and Householder in class

10 / 66

QR Factorization

Suppose A ∈ Rm×n, m ≥ n

One way to picture the QR factorization is to construct a sequenceof orthonormal vectors q1, q2, . . . such that

span{q1, q2, . . . , qj} = span{a(:,1), a(:,2), . . . , a(:,j)}, j = 1, . . . , n

We seek coefficients rij such that

a(:,1) = r11q1,

a(:,2) = r12q1 + r22q2,

...

a(:,n) = r1nq1 + r2nq2 + · · ·+ rnnqn.

This can be done via the Gram-Schmidt process, as we’ll discussshortly

11 / 66

QR Factorization

In matrix form we have: a(:,1) a(:,2) · · · a(:,n)

=

q1 q2 · · · qn

r11 r12 · · · r1nr22 r2n

. . ....rnn

This gives A = QR for Q ∈ Rm×n, R ∈ Rn×n

This is called the reduced QR factorization of A, which is slightlydifferent from the definition we gave earlier

Note that for m > n, QT Q = I, but QQT 6= I (the latter is whythe full QR is sometimes nice)

12 / 66

Full vs Reduced QR Factorization

The full QR factorization (defined earlier)

A = QR

is obtained by appending m − n arbitrary orthonormal columns toQ to make it an m ×m orthogonal matrix

We also need to append rows of zeros to R to “silence” the last

m − n columns of Q, to obtain R =

[R0

]

13 / 66


Full QR

Reduced QR14 / 66


Exercise: Show that the linear least-squares solution is given byRx = QTb by plugging A = QR into the Normal Equations

This is equivalent to the least-squares result we showed earlierusing the full QR factorization, since c1 = QTb

15 / 66

Full vs Reduced QR FactorizationIn Matlab, qr gives the full QR factorization by default

>> A = rand(5,3);

>> [Q,R] = qr(A)

Q =

0.4927 -0.4806 -0.1780 -0.6413 0.2886

0.5478 -0.3583 0.5777 0.4690 -0.1336

0.0768 0.4754 0.6343 -0.5674 -0.2091

0.5523 0.3391 -0.4808 0.0385 -0.5893

0.3824 0.5473 -0.0311 0.2129 0.7127

R =

1.6536 1.1405 1.2569

0 0.9661 0.6341

0 0 0.8816

0 0 0

0 0 0

16 / 66


In Matlab, qr(A,0) gives the reduced QR factorization

>> [Q,R] = qr(A,0)

Q =

0.4927 -0.4806 -0.1780

0.5478 -0.3583 0.5777

0.0768 0.4754 0.6343

0.5523 0.3391 -0.4808

0.3824 0.5473 -0.0311

R =

1.6536 1.1405 1.2569

0 0.9661 0.6341

0 0 0.8816

17 / 66

Gram-Schmidt Orthogonalization

Question: How do we use the Gram-Schmidt process to computethe qi , i = 1, . . . , n?

Answer: In step j , find unit vector qj ∈ span{a(:,1), a(:,2), . . . , a(:,j)}orthogonal to span{q1, qn, . . . , qj−1}

To do this, we set vj ≡ a(:,j) − (qT1 a(:,j))q1 − · · · − (qTj−1a(:,j))qj−1,and then qj ≡ vj/‖vj‖2 satisfies our requirements

This gives the matrix Q, and next we can determine the values ofrij such that A = QR

18 / 66

Gram-Schmidt OrthogonalizationWe write the set of equations for the qi as

q1 =a(:,1)r11

,

q2 =a(:,2) − r12q1

r22,

...

qn =a(:,n) −

∑n−1i=1 rinqi

rnn.

Then from the definition of qj , we see that

rij = qTi a(:,j), i 6= j

|rjj | = ‖a(:,j) −j−1∑i=1

rijqi‖2

The sign of rjj is not determined uniquely, e.g. we could chooserjj > 0 for each j

19 / 66

Classical Gram-Schmidt Process

The Gram-Schmidt algorithm we have described is provided in thepseudocode below

1: for j = 1 : n do2: vj = a(:,j)3: for i = 1 : j − 1 do4: rij = qTi a(:,j)5: vj = vj − rijqi6: end for7: rjj = ‖vj‖28: qj = vj/rjj9: end for

This is referred to the classical Gram-Schmidt (CGS) method

20 / 66


The only way the Gram-Schmidt process can “fail” is if|rjj | = ‖vj‖2 = 0 for some j

This can only happen if a(:,j) =∑j−1

i=1 rijqi for some j , i.e. ifa(:,j) ∈ span{q1, qn, . . . , qj−1} = span{a(:,1), a(:,2), . . . , a(:,j−1)}

This means that columns of A are linearly dependent

Therefore, Gram-Schmidt fails =⇒ cols. of A linearly dependent

21 / 66


Equivalently, by contrapositive: cols. of A linearly independent=⇒ Gram-Schmidt succeeds

Theorem: Every A ∈ Rm×n(m ≥ n) of full rank has a uniquereduced QR factorization A = QR with rii > 0

The only non-uniqueness in the Gram-Schmidt process was in thesign of rii , hence QR is unique if rii > 0

22 / 66


Theorem: Every A ∈ Rm×n(m ≥ n) has a full QR factorization.

Case 1: A has full rank

I We compute the reduced QR factorization from above

I To make Q square we pad Q with m − n arbitraryorthonormal columns

I We also pad R with m − n rows of zeros to get R

Case 2: A doesn’t have full rank

I At some point in computing the reduced QR factorization, weencounter ‖vj‖2 = 0

I At this point we pick an arbitrary qj orthogonal tospan{q1, q2, . . . , qj−1} and then proceed as in Case 1

23 / 66

Modified Gram-Schmidt Process

The classical Gram-Schmidt process is numerically unstable!(sensitive to rounding error, orthogonality of the qj degrades)

The algorithm can be reformulated to give the modifiedGram-Schmidt process, which is numerically more robust

Key idea: when each new qj is computed, orthogonalize eachremaining columns of A against it

24 / 66


Modified Gram-Schmidt (MGS):

1: for i = 1 : n do2: vi = a(:,i)3: end for4: for i = 1 : n do5: rii = ‖vi‖26: qi = vi/rii7: for j = i + 1 : n do8: rij = qTi vj9: vj = vj − rijqi

10: end for11: end for

25 / 66


MGS is just a rearrangement of CGS: In MGS outer loop is overrows, whereas in CGS outer loop is over columns

In exact arithmetic, MGS and CGS are identical

MGS is better numerically since when we compute rij = qTi vj ,components of q1, . . . , qi−1 have already been removed from vj

Whereas in CGS we compute rij = qTi a(:,j), which involves someextra “contamination” from components of q1, . . . , qi−1 in a(:,j)

3

3The qi ’s are not exactly orthogonal in finite precision arithmetic26 / 66

Operation Count

Work in MGS is dominated by lines 8 and 9, the innermost loop:

rij = qTi vj

vj = vj − rijqi

First line requires m multiplications, m − 1 additions; second linerequires m multiplications, m subtractions

Hence ∼ 4m operations per single inner iteration

Hence total number of operations is asymptotic to

n∑i=1

n∑j=i+1

4m ∼ 4mn∑

i=1

i ∼ 2mn2

27 / 66

Householder Triangularization

The other main method for computing QR factorizations isHouseholder4 triangularization

Householder algorithm is more numerically stable and moreefficient than Gram-Schmidt

But Gram-Schmidt allows us to build up orthogonal basis forsuccessive spaces spanned by columns of A

span{a(:,1)}, span{a(:,1), a(:,2)}, . . .

which can be important in some cases

4Alston Householder, 1904–1993, American numerical analyst28 / 66


Householder idea: Apply a succession of orthogonal matricesQk ∈ Rm×m to A to compute upper triangular matrix R

Qn · · ·Q2Q1A = R

Hence we obtain the full QR factorization A = QR, whereQ ≡ QT

1 QT2 . . .QT

n

29 / 66


In 1958, Householder proposed a way to choose Qk to introducezeros below diagonal in col. k while preserving previous zeros

× × ×× × ×× × ×× × ×× × ×

Q1−→

× × ×0 × ×0 × ×0 × ×0 × ×

Q2−→

× × ×0 × ×0 0 ×0 0 ×0 0 ×

Q3−→

× × ×0 × ×0 0 ×0 0 00 0 0

A Q1A Q2Q1A Q3Q2Q1A

This is achieved by Householder reflectors

30 / 66

Householder Reflectors

We choose

Qk =

[I 00 F

]where

I I ∈ R(k−1)×(k−1)

I F ∈ R(m−k+1)×(m−k+1) is a Householder reflector

The I block ensures the first k − 1 rows are unchanged

F is an orthogonal matrix that operates on the bottom m − k + 1rows

(Note that F orthogonal =⇒ Qk orthogonal)

31 / 66


Let x ∈ Rm−k+1 denote entries k , . . . ,m of of the kth column

We have two requirements for F :

1. F is orthogonal, so must have ‖Fx‖2 = ‖x‖22. Only the first entry of Fx should be non-zero

Hence we must have

x =

××...×

−→ Fx =

‖x‖2

0...0

= ‖x‖2e1

Question: How can we achieve this?

32 / 66

Householder ReflectorsWe can see geometrically that this can be achieved via reflectionacross a subspace H

Here H is the subspace orthogonal to v ≡ ‖x‖e1 − x , and the keypoint is that H “bisects” v

33 / 66


We see that this bisection property is because x and Fx both liveon the hypersphere centered at the origin with radius ‖x‖2

34 / 66


Next, we need to determine the matrix F which maps x to ‖x‖2e1

F is closely related to the orthogonal projection of x onto H, sincethat projection takes us “half way” from x to ‖x‖2e1

Hence we first consider orthogonal projection onto H, andsubsequently derive F

Recall that the orthogonal projection of a onto b is given by (a·b)‖b‖2 b,

or equivalently bbT

bTba

Hence we can see that bbT

bTbis a rank 1 projection matrix

35 / 66


Let PH ≡ I− vvT

vT v

Then PHx gives “x minus the orthogonal projection of x onto v”

We can show that PHx is in fact the orthogonal projection of xonto H, since it satisfies:

I PHx ∈ H(vTPHx = vT x − vT vvT

vT vx = vT x − vT v

vT vvT x = 0)

I The projection error x − PHx is orthogonal to H(x − PHx = x − x + vvT

vT vx = vT x

vT vv , parallel to v)

36 / 66


But recall that F should reflect across H rather than project ontoH

Hence we obtain F by going “twice as far” in the direction of vcompared to PH , i.e. F = I− 2 vvT

vT v

Exercise: Show that F is an orthogonal matrix, i.e. that FTF = I

37 / 66

Householder ReflectorsBut in fact, we can see that there are two Householder reflectorsthat we can choose from5

5This picture is “not to scale”; H− should bisect −‖x‖e1 − x38 / 66


In practice, it’s important to choose the “better of the tworeflectors”

If x and ‖x‖2e1 (or x and −‖x‖2e1) are close, we could obtain lossof precision due to cancellation (cf. Unit 0) when computing v

To ensure x and its reflection are well separated we should choosethe reflection to be −sign(x1)‖x‖2e1 (more details on next slide)

Therefore, we want v to be v = −sign(x1)‖x‖2e1 − x

But note that the sign of v does not affect F , hence we scale v by−1 to get an equivalent formula without “minus signs”:

v = sign(x1)‖x‖2e1 + x

39 / 66

Householder ReflectorsLet’s compare the two options for v in the potentially problematiccase when x ≈ ±‖x‖2e1, i.e. when |x1| ≈ ‖x‖2:

vbad ≡ −sign(x1)‖x‖2e1 + x

vgood ≡ sign(x1)‖x‖2e1 + x

‖vbad‖22 = ‖−sign(x1)‖x‖2e1 + x‖22= (−sign(x1)‖x‖2 + x1)2 + ‖x2:(m−k+1)‖22= (−sign(x1)‖x‖2 + sign(x1)|x1|)2 + ‖x2:(m−k+1)‖22≈ 0

‖vgood‖22 = ‖sign(x1)‖x‖2e1 + x‖22= (sign(x1)‖x‖2 + x1)2 + ‖x2:(m−k+1)‖22= (sign(x1)‖x‖2 + sign(x1)|x1|)2 + ‖x2:(m−k+1)‖22≈ (2sign(x1)‖x‖2)2

40 / 66


Recall that v is computed from two vectors of magnitude ‖x‖2

The argument above shows that with vbad we can get ‖v‖2 � ‖x‖2=⇒ “loss of precision due to cancellation” is possible

In contrast, with vgood we always have ‖vgood‖2 ≥ ‖x‖2, whichrules out loss of precision due to cancellation

41 / 66


We can now write out the Householder algorithm:

1: for k = 1 : n do2: x = a(k:m,k)3: vk = sign(x1)‖x‖2e1 + x4: vk = vk/‖vk‖25: a(k:m,k:n) = a(k:m,k:n) − 2vk(vTk a(k:m,k:n))6: end for

This replaces A with R and stores v1, . . . , vn

Note that we don’t divide by vTk vk in line 5 (as in the formula forF ) since we normalize vk in line 4

Householder algorithm requires ∼ 2mn2 − 23n

3 operations6

6Compared to 2mn2 for Gram-Schmidt42 / 66


Note that we don’t explicitly form Q

We can use the vectors v1, . . . , vn to compute Q in apost-processing step

Recall that

Qk =

[I 00 F

]and Q ≡ (Qn · · ·Q2Q1)T = QT

1 QT2 · · ·QT

n

Also, the Householder reflectors are symmetric (refer to thedefinition of F ), so Q = QT

1 QT2 · · ·QT

n = Q1Q2 · · ·Qn

43 / 66


Hence, we can evaluate Qx = Q1Q2 · · ·Qnx using the vk :

1: for k = n : −1 : 1 do2: x(k:m) = x(k:m) − 2vk(vTk x(k:m))3: end for

Question: How can we use this to form the matrix Q?

44 / 66


Answer: Compute Q via Qei , i = 1, . . . ,m

Similarly, compute Q (reduced QR factor) via Qei , i = 1, . . . , n

However, often not necessary to form Q or Q explicitly, e.g. tosolve Ax ' b we only need the product QTb

Note the product QTb = Qn · · ·Q2Q1b can be evaluated as:

1: for k = 1 : n do2: b(k:m) = b(k:m) − 2vk(vTk b(k:m))3: end for

45 / 66

Singular Value Decomposition (SVD)

46 / 66

Singular Value Decomposition

The Singular Value Decomposition (SVD) is a very useful matrixfactorization

Motivation for SVD: image of the unit sphere, S , from any m × nmatrix is a hyperellipse

A hyperellipse is obtained by stretching the unit sphere in Rm byfactors σ1, . . . , σm in orthogonal directions u1, . . . , um

47 / 66


For A ∈ R2×2, we have

48 / 66


Based on this picture, we make some definitions:

I Singular values: σ1, σ2, . . . , σn ≥ 0 (we typically assumeσ1 ≥ σ2 ≥ . . .)

I Left singular vectors: {u1, u2, . . . , un}, unit vectors indirections of principal semiaxes of AS

I Right singular vectors: {v1, v2, . . . , vn}, preimages of the ui sothat Avi = σiui , i = 1, . . . , n

(The names “left” and “right” come from the formula for the SVDbelow)

49 / 66


The key equation above is that

Avi = σiui , i = 1, . . . , n

Writing this out in matrix form we getA

v1 v2 · · · vn

=

u1 u2 · · · un

σ1

σ2

. . .

σn

Or more compactly:

AV = UΣ

50 / 66


Here

I Σ ∈ Rn×n is diagonal with non-negative, real entries

I U ∈ Rm×n with orthonormal columns

I V ∈ Rn×n with orthonormal columns

Therefore V is an orthogonal matrix (V TV = VV T = I), so thatwe have the reduced SVD for A ∈ Rm×n:

A = UΣV T

51 / 66


Just as with QR, we can pad the columns of U with m − narbitrary orthogonal vectors to obtain U ∈ Rm×m

We then need to “silence” these arbitrary columns by adding rowsof zeros to Σ to obtain Σ

This gives the full SVD for A ∈ Rm×n:

A = UΣV T

52 / 66

Full vs Reduced SVD

Full SVD

Reduced SVD

53 / 66


Theorem: Every matrix A ∈ Rm×n has a full singular valuedecomposition. Furthermore:

I The σj are uniquely determined

I If A is square and the σj are distinct, the {uj} and {vj} areuniquely determined up to sign

54 / 66


This theorem justifies the statement that the image of the unitsphere under any m × n matrix is a hyperellipse

Consider A = UΣV T (full SVD) applied to the unit sphere, S , inRn:

1. The orthogonal map V T preserves S

2. Σ stretches S into a hyperellipse aligned with the canonicalaxes ej

3. U rotates or reflects the hyperellipse without changing itsshape

55 / 66

SVD in Matlab

Matlab’s svd function computes the full SVD of a matrix

>> A = rand(4,2);

>> [U,S,V] = svd(A)

U =

0.4775 0.5908 0.6085 -0.2293

0.4463 0.1122 -0.1281 0.8785

0.3541 -0.7954 0.4919 -0.0066

0.6689 -0.0756 -0.6094 -0.4190

S =

1.7335 0

0 0.8555

0 0

0 0

V =

0.6404 -0.7681

0.7681 0.6404

56 / 66

SVD in Matlab

As with QR, svd(A,0) returns the reduced SVD

>> [U,S,V] = svd(A,0)

U =

0.4775 0.5908

0.4463 0.1122

0.3541 -0.7954

0.6689 -0.0756

S =

1.7335 0

0 0.8555

V =

0.6404 -0.7681

0.7681 0.6404

>> norm(A - U*S*V’)

ans =

4.2123e-16

57 / 66

Matrix Properties via the SVD

• The rank of A is r , the number of nonzero singular values7

Proof: In the full SVD A = UΣV T , U and V T have full rank,hence it follows from linear algebra that rank(A) = rank(Σ)

• image(A) = span{u1, . . . , ur} and null(A) = span{vr+1, . . . , vn}

Proof: This follows from A = UΣV T and

image(Σ) = span{e1, . . . , er} ∈ Rm

null(Σ) = span{er+1, . . . , en} ∈ Rn

7This also gives us a good way to define rank in finite precision: the numberof singular values larger than some (small) tolerance

58 / 66


• ‖A‖2 = σ1

Proof: Recall that ‖A‖2 ≡ max‖v‖2=1 ‖Av‖2. Geometrically, we seethat ‖Av‖2 is maximized if v = v1 and Av = σ1u1.

• The singular values of A are the square roots of the eigenvaluesof ATA or AAT

Proof: (Analogous for AAT )

ATA = (UΣV T )T (UΣV T ) = VΣUTUΣV T = V (ΣTΣ)V T ,

hence (ATA)V = V (ΣTΣ), or (ATA)v(:,j) = σ2j v(:,j)

59 / 66


The pseudoinverse, A+, can be defined more generally in terms ofthe SVD

Define pseudoinverse of a scalar σ to be 1/σ if σ 6= 0 and zerootherwise

Define pseudoinverse of a (possibly rectangular) diagonal matrix astranspose of the matrix and taking pseudoinverse of each entry

Pseudoinverse of A ∈ Rm×n is defined as

A+ = VΣ+UT

A+ exists for any matrix A, and it captures our definitions ofpseudoinverse from Unit I

60 / 66


We generalize the condition number to rectangular matrices viathe definition κ(A) = ‖A‖‖A+‖

We can use the SVD to compute the 2-norm condition number:

I ‖A‖2 = σmax

I Largest singular value of A+ is 1/σmin so that‖A+‖2 = 1/σmin

Hence κ(A) = σmax/σmin

61 / 66


These results indicate the importance of the SVD, boththeoretically and as a computational tool

Algorithms for calculating the SVD are an important topic inNumerical Linear Algebra, but outside scope of this course

Requires ∼ 4mn2 − 43n

3 operations

For more details on algorithms, see Trefethen & Bau, or Golub &van Loan

62 / 66

Low-Rank Approximation via the SVD

One of the most useful properties of the SVD is that it allows us toobtain an optimal low-rank approximation to A

See Lecture: We can recast SVD as

A =r∑

j=1

σjujvTj

Follows from writing Σ as sum of r matrices, Σj , whereΣj ≡ diag(0, . . . , 0, σj , 0, . . . , 0)

Each ujvTj is a rank one matrix: each column is a scaled version of

uj

63 / 66


Theorem: For any 0 ≤ ν ≤ r , let Aν ≡∑ν

j=1 σjujvTj , then

‖A− Aν‖2 = minB∈Rm×n, rank(B)≤ν

‖A− B‖2 = σν+1

That is:

I Aν gives us the closest rank ν matrix to A, measured in the2-norm

I The error in Aν is given by the first omitted singular value

64 / 66


A similar result holds in the Frobenius norm:

‖A− Aν‖F = minB∈Rm×n, rank(B)≤ν

‖A− B‖F =√σ2ν+1 + · · ·+ σ2r

65 / 66


These theorems indicate that the SVD is an effective way tocompress data encapsulated by a matrix!

If singular values of A decay rapidly, can approximate A with fewrank one matrices (only need to store σj , uj , vj for j = 1, . . . , ν)

Matlab example: Image compression via the SVD

66 / 66

Documents

Applied Mathematics 205 Unit II: Numerical Linear Algebraiacs-courses.seas.harvard.edu/courses/am205/fall13/... · Applied Mathematics 205 Unit II: Numerical Linear Algebra Lecturer: