NUMERICAL METHODS - Purdue UniversityReference: Chapter 2 of Moler. Matlabis an interactive scripting language sold by The MathWorks that facilitates matrix operations, graphics, and

NUMERICAL METHODS

Robert D. Skeel

COPYRIGHT c©2013 Robert D. Skeel. All Rights Reserved.

Chapter 1

Introduction to MATLAB


Reference: Chapter 2 of Moler.Matlab is an interactive scripting language sold by The MathWorks that facilitates matrix operations,

graphics, and building of user interfaces. The default data type is a 2-dimensional array of doubles. Here isdocumentation.

Octave is a free software mostly compatible with Matlab that provides many of its functions for numericalcomputations and that supports plotting using gnuplot.

Another free Matlab clone is SciLab.

1.1 The Golden Ratio

aspect ratio The aspect ratio of a rectangle is the ratio of its length to its width with the length takento be the larger of its two dimension.

functions In mathematics, a function is a set of ordered pairs such that each first member is unique. Forexample, {(x, 2x2 − 1) : x ∈ R} is a function. If the domain of definition is understood, one might writeinstead x 7→ 2x2 − 1. Note that in both cases x is a dummy variable: if y had been used instead of x, thefunction would be unchanged. A further shortcut is to declare x to be a variable and to omit the x 7→.Without such a declaration, x might just as well be a constant, denoting a real number. Note that variablesand constants are not mathematical objects but part of the language of mathematics. Operations can beperformed on functions. The most basic operation is evaluation, e.g.,

(x 7→ 2x2 − 1)(0.5) = −0.5.

definition of error There are two ways to define the error. The instructor prefers

(absolute )error = approximate value− exact value.

This is consistent with ordinary usage: an approximation results from adding error to the exact value, andthe exact value is obtained by removing error from the approximation. Moler does the opposite. Either iscorrect—but you must be consistent.

1

http://www.mathworks.com/products/matlab/

http://www.mathworks.com/access/helpdesk/help/techdoc/matlab.html

http://www.gnu.org/software/octave/

http://www.scilab.org/

decimal places vs. digits The number of decimal places is counted from the decimal; the number ofdigits is counted from the first nonzero digit.

MATLAB notes.

• To run MATLAB from the shell command line, enter$ matlab -nodisplay

But first make sure the binary is in the shell’s search path.

• Note the use of ./ rather than / in the definition of f. The latter will attempt matrix right division ifpresented with operands that are arrays.

• Note that goldrect.m is a script and goldfraction.m is a function definition.

• To save a figure, you can, for example,> print -dpng goldrect.png

• To obtain the dimensions of an array use size.

• If a single index is used for a matrix, it is indexed as though it were stored row by row in a 1-dimensionalarray.

• In MATLAB, there is are two distinct concepts: a function and a function handle. A MATLAB functionis defined in a function M-file. If f is a function, it can be used only in the form f(〈list of arguments〉)or f. The latter is equivalent to f(). A function handle is created from an anonymous function orfrom an at-sign tacked onto a function name, e.g., @f. It is a true object and can be assigned to avariable, and passed as an argument.

• There is an anomaly in MATLAB (but not Octave) parsing. The entry>> f = @(x) x^2 - 2; g = (f); g(1)

works, but>> f = @(x) x^2 - 2; (f)(1)

is flagged as a syntax error!

Octave notes.

• Octave has no builtin Symbolic Toolbox.

• The Octave equivalent of ezplot(f, 0, 4) is ezplot(f, [0 4]). The latter also works in MATLAB.

• Additional anomalies for Octave 3.6.3 running on Mac OS 10.7: the ’o’ marker appears as a triangle,the separating line in the golden rectangle is solid rather than dashed, and close() does not close thefigure window.

Emacs note. If you use emacs, append

(setq auto-mode-alist

(cons

’("\\.m$" . octave-mode)

auto-mode-alist))

to your ~/.emacs file.

1.2 Fibonacci Numbers

Note the concept of a closed-form solution on page 12.In mathematics, the log function is assumed to be a natural logarithm unless stated otherwise.Note. It is conventional to define Fibonacci numbers using F1 = F2 = 1.

2

http://en.wikipedia.org/wiki/Fibonacci_number

1.3 Fractal Fern

Recall the definition of a matrix product in which a table of inner products is computed using each row ofthe first matrix “dotted with” each column of the second matrix.

MATLAB note. The MATLAB backslash operator performs matrix predivisionOctave note. Anomaly for Octave 3.6.3 running on Mac OS 10.7: uicontrol does not work.

1.4 Magic Squares

Recall the identity matrix I. Recall that a matrix A is singular if detA = 0 and nonsingular (aka regular)otherwise. A nonsingular matrix A has an inverse A−1 such that

A−1A = AA−1 = I.

A sequence of k vectors x1, x2, . . . , xk is said to be linearly independent if c1x1 + c2x2 + · · ·+ ckxk = 0only when c1 = c2 = · · · = ck = 0. The row rank of a matrix is the maximum number of linearly independentrows. Similarly for the column rank. Row rank equals column rank.

Other concepts: span, null space, reduced row echelon form.Norms and singular values will be introduced later.Octave notes.

• cameratoolbar is not implemented.

• Anomaly for Octave 3.6.3 running on Mac OS 10.7: image does not work.

1.5 Cryptography

Omit details of the encryption technique.Note the char and reshape functions.

1.6 The 3n+ 1 Sequence

When plotting, use a log scale, e.g., the function semilogy, when appropriate. If an axis represents aquantity that is intrinsically positive, there is a good chance that a log scale is appropriate.

3

http://en.wikipedia.org/wiki/Kernel_(matrix)

http://en.wikipedia.org/wiki/Row_echelon_form

Chapter 2

Floating-Point Computation


Reference: Section 1.7 of Moler.This is a topic for which there are numerous widespread misconceptions. The actual situation is surpris-

ingly simple and satisfying, thanks to the efforts of Turing Prize winner William Kahan. There are threemain ideas here:

• The computer can represent a finite set of values known as machine numbers.

• Rounding maps a value to the nearest machine number.

• Floating-point arithmetic rounds the exact result of an operation.

2.1 Floating-Point Numbers

Reference: Section 2.2 of Skeel & Keiper.For binary floating-point arithmetic, the machine numbers are a finite subset of those values whose

fraction part has a denominator which is a power of two. Hence, for example, 0.1 = 1/(2 · 5) cannot be amachine number, whereas 0.125 = 1/23 normally is. Two sets of machine numbers are specified by the IEEEStandard for Binary Floating-Point Arithmetic:

• Single precision with precision 24 and exponent range [−126 : 127] inclusive.

• Double precision with precision 53 and exponent range [−1022 : 1023] inclusive.

Specifically, a double-precision machine number x can be expressed

x = ±(b0 +

b12

+ · · ·+ b52252

)× 2e, bi ∈ {0, 1}, −1022 ≤ e ≤ 1023.

This can be expressed as

x = ±(b0 · 252 + · · ·+ b51 · 2 + b52

)× 2e−52

= 〈modest integer〉 × 2〈small integer〉 (2.1)

= m× 2p (2.2)

where |m| < 253 = 9 007 199 254 740 992 and −1074 ≤ p ≤ 971.

4

You can count on floating-point numbers.

Additionally, there are representations for ±∞, NaN (not-a-number), and ±0. The sign of zero is normallyundetected.

denormalized numbers The representation of double-precision floating-point numbers is normalized sothat either b0 = 1 or e = −1022. A number for which e = −1022 and b0 = 0 is said to be denormalized.Since its leading bit or bits (binary digits) are zero, it has fewer than 53 signficant bits.

The exact value of a double-precision machine number can be obtained using following function:

actual = @(x) sprintf(’%.767g’, x);

Up to 767 digits are needed to represent the value of a machine number. To distinguish between two differentmachine numbers, it is, however, enough to see the first 17 digits, which can be effected by using

repr = @(x) sprintf(’\n%.17g’, x)

In Octave, one can set output_precision(17).

ternary arithmetic This is a more common name for base 3 arithmetic.The C math library has a function frexp that extracts fraction and exponent from a floating-point

number. MATLAB has a similar function. The definition of fraction and exponent vary. Here is a functionthat delivers values conforming to our definition:

function [f, n] = myfrexp(x)

% f * 2^n = x where either 1 <= f < 2 or n == -1022

[f, n] = log2(x);

f *= 2; n--;

if n < -1022

f *= pow2(1022 + n); n = -1022;

end

if f == 0, n = -1022; end

The C math library has another function nextafter. Here is a MATLAB implementation:

function x1 = nextafter(x, y)

% the machine number just after x if y > x

% the number just before x if y < x

[f, n] = myfrexp(x);

eps = pow2(n-52);

if y < x, eps *= -1; end

if f == 1 & y < x, eps *= 0.5; end

if f == -1 & y > x, eps *= 0.5; end

x1 = x + eps;

Review questions

1. In Equation (2.1) what is the range for 〈modest integer〉 and for 〈small integer〉?

2. For a binary floating-point number system, describe in words the distribution of machine numbersalong the real line. How are they spaced?

5

2.2 Rounding

Reference: Section 3.3 of Skeel & Keiper.For any real value x, we define its rounded value fl(x) to be the machine number closest to x. In case of

a tie, choose the machine number whose last bit (53rd in double precision) is zero. For example,

1, 1 + 2−52, 1 + 2−51

are consecutive machine numbers. The value 1+2−53 is midway between the first two and would be roundeddown to 1 because its 53rd significant bit is zero. The value 1 + 3 · 2−53 is midway between the last two andwould be rounded up to 1 + 2−51 because its 53rd significant bit is zero. If x is closer to ±21023 than it is toany machine number, define its rounded value to be ±∞, respectively (in the case of double precision).

exponent overflow Exponent overflow is said to occur if x is finite but fl(x) is infinite.

exponent underflow There is no standard definition in that computer hardware designers have someleeway in deciding when to raise the underflow flag. It would be good to define underflow as a loss ofaccuracy due to the lower limit on the exponent range; that is, underflow occurs if the result would havebeen different with no lower limit on the exponent range. For example, 2−1023 + 2−1075 is representable with24 bits with an exponent of −1023 but must be rounded to 2−1023 to be represented with an exponent of−1022.

Directed rounding The IEEE standard also specifies directed rounding modes: round toward +∞, e.g,fl↑(1 + 2−24) = 1 + 2−23, and round toward −∞, e.g, fl↓(1 + 2−24) = 1.

relative errorrelative error = (absolute )error/exact value.

It is frequently stated in percent and sometimes in ppm or ppb.

Rounding error The relative rounding error δ = (fl(x) − x)/x can be shown to be less than 2−24 inmagnitude assuming no underflow (nor overflow). This fact is more usefully expressed

fl(x) = x(1 + δ), |δ| < 2−24

with 2−24 known as the unit roundoff error. If underflow occurs, the relative error could be much larger.

Review questions

1. One reason why rounded arithmetic operations are generally preferable to chopped ones is that in theaverage case and in the worst case a rounding error is only half as great in magnitude as a choppingerror. There is a second, more important reason. What is it?

2. In a p-digit binary floating-point system with rounding to nearest, state a good upper bound on therelative roundoff error (i.e., the unit roundoff error)?

6

2.3 Basic Operations

Reference: Section 2.4 of Skeel & Keiper.Let ◦ be one of the arithmetic operations +, −, ×, or /. The corresponding floating-point operation ◦̂ is

defined bya◦̂b = fl(a ◦ b), for machine numbers a, b.

It is that simple. Similarly for the square root and decimal-to-binary conversion. Note that floating-pointoperations are defined only for machine numbers, contrary to what is implied by many textbooks.

When cancellation occurs, the result is exact:

Proposition 2.1 If x and y are machine numbers, 12 ≤ x/y ≤ 2, floating-point subtraction x−̂y is exact.

Underflow, as we have defined it, cannot occur for addition/subtraction and can occur for multiplica-tion/division only if the result is less than 2−1022 in magnitude (and even then it might not).

For elementary functions it is impractical to maintain such high standards. It is enough that they roundto the nearest or next-to-nearest machine number.

Also, the definition implies that for multiple operations, there is a rounding after each operation, e.g.,

a×̂b+̂c = fl(ab)+̂c = fl(fl(ab) + c).

interval arithmetic Rigorous computation with floating-point arithmetic is possible using interval arith-metic, in which values are represented by floating-point lower and upper bounds. This approach makes useof directed rounding.

Review questions

1. What, if anything, can we say about the accuracy of floating-point subtraction of two nearly equalmachine numbers? Hint: What can we say about the exact result of subtracting two nearly equalmachine numbers?

2. Why do we round outward for an interval operation?

2.4 Numerical Instability

Reference: Section 2.5 of Skeel & Keiper.Numerical stability is a property of an algorithm. An algorithm implements a mapping from a set of

input values to a set of output values by decomposing the mapping into builtin operations. As a very simpleexample, the mapping x 7→ 1/x− 1 has two simple floating-point implementations:

Algorithm 1: 1/̂x−̂1

Algorithm 2: (1−̂x)/̂x

Because of roundoff error, the two algorithms may not be equally accurate. Although roundoff error is theconcern here, numerical stability is a concept that applies to any kind of computational error—roundofferror, truncation error, discretization error, and sampling error.

For purposes of illustration, suppose that we are doing arithmetic where the machine numbers are 4-digitdecimal numbers (times some power of ten) and that x = 0.99. Algorithm 1 produces the result

1/̂0.99−̂1 = fl(1.0101 · · · )−̂1 = 1.01−̂1 = fl(1.01− 1) = 0 .0 1000;

7

whereas, Algorithm 2 produces

(1−̂0.99)/̂0.99 = 0.01/̂0.99 = fl(0.010101 · · · ) = 0 .0 1010,

which is the exact answer rounded to 4 digits. The problem with Algorithm 1 is that the small division erroris greatly amplified by the subtraction operation. Note that the subtraction itself is exact.

An algorithm is numerically unstable if it produces an unaccepable amplification of the computationalerror. Numerical instability results from applying a sensitive operation or function to operands or argumentsthat are contaminated by computational error. Cancellation is a prime example. When this occurs, theoperation is performed exactly, without error; it is the effect of this exact operation on existing error that isthe concern.

If possible cancellation should be avoided, e.g.,

ex − 1 → expm1(x).

Otherwise, do the cancellation with uncontaminated operands, e.g.,

x2 − y2 → (x− y)(x+ y).

The example ex − 1 just given illustrates the need for an extensive collection of elementary functions.The underlying principle generalizes:

A small set of primitives is inappropriate for numerical tasks because of numerical stabilityconsiderations.

Review questions

1. As a rule, is it better to design a numerical algorithm so that cancellation occurs at the beginning orat the end of the computation? Or does it not matter?

2. Explain precisely why cancellation is a problem.

8

Chapter 3

Systems of Linear Equations


Reference: Chapter 3 of Moler.

3.1 Matrices

Reference: Section 4.1 of Skeel & Keiper.Many calculations, numerical and otherwise, can be expressed as matrix operations.Recall the definition of a matrix product in which a table of inner products is computed using each row of

the first matrix “dotted with” each column of the second matrix. Recall the identity matrix I. Recall thata matrix A is singular if detA = 0 and nonsingular (aka regular) otherwise. A nonsingular matrix A has aninverse A−1 such that

A−1A = AA−1 = I.

3.1.1 Relevance to linear systems

Review of the concepts of linear independence and rank. A sequence of k vectors x1, x2, . . . , xk is said tobe linearly independent if c1x1 + c2x2 + · · ·+ ckxk = 0 only when c1 = c2 = · · · = ck = 0. A linear subspaceS of Rn is a subset closed under vector addition and multiplication of a vector by a scalar. The dimensionof a linear subspace is the maximum number of linearly independent vectors in it. A sequence of k vectorsx1, x2, . . . , xk is a basis for a linear subspace if it is linearly independent and spans the subspace. The rowrank of a matrix is the dimension of the subspace spanned by its rows and its column rank is the dimensionof the subspace spanned by its columns. Row rank equals column rank.

A system of linear equations to be solved arises in many applications. It can be expressed in matrix formas Ax = b. If detA 6= 0, A is said to be nonsingular and there is a unique solution x = A−1b; otherwisethere is either no solution or infinitely many solutions. Whether a solution exists depends on whether b canbe expressed as a linear combination of the columns of A.

3.1.2 An application of the matrix product

Suppose we have constructed a 3 D model of physical object, e.g., a wireframe model, and we wish to rotatethe object about some axis. Assume the state of the model is completely specified in terms of points r1, r2,. . . , rN and assume the axis of rotation goes through the origin. Let the axis be specified by a unit vector aand let the counterclockwise angle of rotation be θ.

9

For a very small angle of rotation ε, it can be shown that

rotated rk ≈ rk + εa× rk,

which can be written in matrix form as

rotated rk ≈ (I + ε[a]×)rk,

where

[a]×def=

0 −az ayaz 0 −ax−ay ax 0

.(The notation skew(a) is also used for this matrix.)

For a finite angle θ, we can then use

rotated rk = limn→∞

(I +

θ

n[a]×

)n

rk

= exp(θ[a]×)rk,

where the exponential is defined by its power series. From this, we can get Rodrigues’ rotation formula

exp(θ[a]×) = (cos θ)I + sin θ[a]× + (1− cos θ)aaT.

Review questions

1. The (i, j)th element of AB is the dot product of which two vectors?

2. For a sequence of k vectors x1, x2, . . . , xk, define what it means for them to be linearly independent.

3. Define a linear subspace S of Rn.

4. Define the dimension of a linear subspace.

5. Define what it means for a sequence of k vectors x1, x2, . . . , xk to be a basis for a linear subspace.

6. Define the row rank of a matrix. the column rank.

3.2 Gaussian Elimination

References: Section 2.1, 2.2, most of 2.3 and end of 2.4 of Moler; section 4.3 of Skeel & Keiper.With very few exceptions, there is no need to compute a matrix inverse. Just as 5/3 is computed directly

by division instead of as 3−1 · 5, so one should “left divide” by a matrix, as in A\b. This is performed mostefficiently by means of Gaussian elimination, for which the total cost of computing A−1b is only 2

3n3 +O(n2)

flops as opposed to 2n3+O(n2) flops for the use of matrix inversion or n3+O(n2) flops for the elegant Gauss-Jordan algorithm. A flop is one floating-point operation, and O(n2) represents some unspecified functionthat is bounded in magnitude by some unspecified number times n2.Back substitution. Consider solving Ux = y. When written in detail this is

uiixi + ui,i+1xi+1 + · · ·+ uinxn = yi, for i = 1, 2, . . . , n.

These can be easily solved by back substitution

xn = yn/unn,

xi = (yi − ui,i+1xi+1 − · · · − uinxn)/uii, for i = n− 1, n− 2, . . . , 1.

Matlab code follows:

10

http://en.wikipedia.org/wiki/Rodrigues'_rotation_formula

x = zeros(n,1)

for i = n:-1:1

yi = y(i);

for j = i+1:n

yi -= u(i,j)*x(j);

end

x(i) = yi/u(i,i);

end

Improved efficiency is possible through vectorization. In particular, replace the body of the for i loop by

j = i+1:n;

x(i) = (y(i) - u(i,j)*x(j))/u(i,i);

Review question

1. Give in complete detail the algorithm for back substitution applied to a system Ux = y where U isupper triangular.

2. Fill in the blanks in the following algorithm for back substitution. The function has parameters a andb and should return the solution vector x as its value. The strictly lower triangular elements of a areto be ignored by the algorithm.

function x = backSubstitute(a, b)

n = length(a);

x = b;

for i = n:-1:1

for j = ____________:n

x(i) = x(i) - ___________________________;

end

x(i) = x(i)/________________________;

end

end

3. Fill in the blanks in the following algorithm for Gaussian elimination without pivoting.

function x = ge2(a, b)

n = length(a);

for k = 1:n-1

for i = _______________:n

m = a(i, k)/a(k, k);

jvals = k+1:n

a(i, jvals) = ___________________________________________;

b(i) = b(i) - m * b(k);

end

end

x = backSubstitute(a, b);

end

3.3 Partial Pivoting

References: Sections 2.6 and 2.3 of Moler; section 4.4.1 of Skeel & Keiper.

11

Gaussian elimination is, in general, numerically unstable. To fix this, one needs to incorporate rowinterchanges—partial pivoting. At the beginning of each k loop, elements in rows k through n-1 of columnk of the partial LU decomposition are compared to find the element a(im,k) having the largest element (inmagnitude). Then rows k and im are interchanged, including multipliers.

3.4 LU Factorization

References: Sections 2.4, 2.5, and 2.7 of Moler; sections 4.5 and 4.7 of Skeel & Keiper.An LU factorization has the form A = LU where U is upper triangular meaning that uij = 0 for j < i

and L is unit lower triangular meaning that lii = 1 and lij = 0 for j > i. For example, a 3 by 3 matrix wouldhave a factorization 1 0 0

l21 1 0l31 l32 1

u11 u12 u130 u22 u230 0 u33

.Clearly, the number of unknowns equals the number of scalar equations in A = LU , and if these are orderedappropriately, they can be solved explicitly. The matrix A need not be square. An LU factorization doesnot always exist, e.g., if

A =

[0 11 0

].

A special case in which existence is assured is when A is square and strictly diagonally dominant, meaningthat

|aii| > |ai1|+ · · ·+ |ai,i−1|+ |ai,i+1|+ · · ·+ |ain| for i = 1, 2, . . . , n.

Later we generalize the LU factorization so that it exists for any nonsingular matrix.To apply a factorization to the calculation of A−1b, express this as

A−1b = (LU)−1b = U−1L−1B = U\(L\b)

where inversion of the triangular factors is to be avoided by instead performing matrix left division. Theoperation b 7→ U\(L\b) is known as a backsolve.

3.4.1 Division by a triangular matrix

There are two operations to be performed: y = L\b and x = U\y.Consider x = U\y. This is equivalent to solving Ux = y, which can be done using back substitution. An

alternative algorithm known as back elimination references the elements of the matrix U column by columninstead of row by row.

Likewise, the calculation of y = L\b can be performed using either forward substitution or forwardelimination.

Each division by a triangular matrix costs n2 + O(n) flops, so the total cost of a backsolve is only2n2 +O(n) flops.

If needed, the inverse A−1 can be calculated by observing that

A−1 = A−1I = A−1[e1 e2 · · · en

]=[A−1e1 A−1e2 · · · A−1en

]where ei is a column vector with 1 in the ith position and zeros elsewhere. By exploiting the sparsity of thevectors ei, the cost of the n backsolves can be kept to 4

3n3 +O(n2) instead of 2n3 +O(n2).

Textbook note. There is no apparent reason why (1:k-1)’ is used instead of 1:k-1 on line 8 of page 57.

12

3.4.2 Calculating an LU factorization

An LU factorization can be calculated by the forward elimination stage of Gaussian elimination:

m21

m31

a11 a12 a13a21 a22 a23a31 a32 a33

−→m32

a11 a12 a130 a′22 a′230 a′32 a′33

−→ a11 a12 a13

0 a′22 a′230 0 a′′33

.It can be shown that the resulting reduced matrix is the second factor U in an LU factorization of the originalmatrix A and the first factor is given by

L =

1 0 0m21 1 0m31 m32 1

.In practice, the multipliers overwrite the eliminated elements of A and the nonzero elements of U overwritethe other elements of A.

Theorem 3.1 The multipliers and the reduced matrix computed by Gaussian elimination give an LU fac-torization of the original matrix.

Forward elimination stage of Gaussian elimination.

for k = 1:n-1

for i = k+1:n

m = a(i,k)/a(k,k);

for j = k+1:n

a(i,j) = a(i,j) - m*a(k,j);

end

a(i, k) = m;

end

end

Note the beautiful symmetry in the innermost loop if we substitute in the value of m:a(i,j) = a(i,j) - a(i,k)/a(k,k)*a(k,j);

As stated earlier, the cost of this algorithm is 23n

3 +O(n2) flops. The two innermost loops can be vectorizedusing the outer function. Also, the nesting of the three loops can be changed. The above loop nesting isknown as kij; the other possibilities are kji, ikj, jki, ijk, and jik.

For partial pivoting the algorithm is

p = (1:n)’

for k = 1:n-1

determine pivot index im;interchange rows k and im in A and p;determine multipliers;store multipliers where eliminated elements would go;subtract multiples of row k from remaining rows;

end

The interchange of rows in A involves all columns, both multipliers and elements of the reduced matrix.Following is detailed code modeled after lutx on page 60 of the textbook:

function [L, U, p] = mylu(a) % mylu.m

[n, n] = size(a);

p = (1:n)’;

13

for k = 1:n-1

% Find the max value and its position for this column

[m, im] = max(abs(a(k:n,k)));

im += k-1;

% swap rows

a([k im], :) = a([im k], :);

p([k im]) = p([im k]);

i = k+1:n;

% Create multipliers and use them to update remaining rows

a(i,k) = a(i,k)/a(k,k);

j = k+1:n;

a(i,j) = a(i,j) - a(i,k)*a(k,j);

end

L = tril(a, -1) + eye(n, n);

U = triu(a);

Theorem 3.2 The partial pivoting computes a factorization PA = LU where P is a permutation matrix.

The textbook only sketches the proof. Probably, the best way to prove the theorem is to identify a loopinvariant. However, we will omit the proof.

Following is a backslash algorithm, simplified from that on page 62 of the textbook:

function x = mybslash(A, b) % mybslash.m

% solves A*x = b

[L, U, p] = mylu(A);

y = forward(L, b(p));

x = backsubs(U, y);

function y = forward(L, b)

% solves L*y = b where L is unit lower triangular

[n, n] = size(L);

for k = 1:n-1

i = k+1:n;

b(i) = b(i) - L(i,k)*b(k);

end

y = b;

function x = backsubs(U, y)

% solves U*x = y where U is upper triangular

[n, n] = size(U);

x = zeros(n, 1);

for i = n:-1:1

j = i+1:n;

x(i) = (y(i) - U(i,j)*x(j))/U(i,i);

end

Textbook note. Their implementation of forward uses forward substitution not forward elimination.

3.4.3 Symmetric matrices

A matrix A is said to be symmetric if AT = A where T denotes the transpose. The symmetry is with respectto the main diagonal.

A symmetric matrix A is positive definite if xTAx > 0 whenever x 6= 0. For a symmetric matrix,Gaussian elimination without pivoting generates positive pivots (diagonal elements of U matrix) if and onlyif the matrix is positive definite.

14

Gaussian elimination without pivoting is stable for a symmetric positive definite matrix. A closely relatedalgorithm is Cholesky factorization

A = GGT

whereG is lower triangular with positive diagonal elments. The cost of a Cholesky factorization is 13n

3+O(n2)flops, which is half the cost of an LU factorization.

Review questions

1. Give a high level explanation of how an LU factorization can be employed to solve a system Ax = b ofn equations.

2. Give in complete detail the algorithm for forward elimination applied to a system Ly = b where L isunit lower triangular.

3. Give in complete detail the algorithm for Gaussian elimination for computing an LU factorization LUof a matrix A.

4. Contrast the number of operations required for an LU factorization with those needed to do a backsolve.

5. What are the two parts of a backsolve algorithm?

6. In the kij version of Gaussian elimination with partial pivoting what happens at the beginning of eachk loop? Be specific about which array elements are involved.

7. In what sense does Gaussian elimination with partial pivoting compute a factorization? State youranswer algebraically.

8. Define what it means for a symmetric matrix A to be positive definite.

9. If a matrix A is symmetric, how might we test in practice whether or not it is positive definite?

10. What is a Cholesky factorization of a symmetric positive definite matrix?

3.5 Norms and Sensitivity

References: First part of Section 2.9 of Moler; Section 4.2 of Skeel & Keiper.

3.5.1 How to measure errors

It can be important to choose units for (or to scale) the unknowns so that equal numerical changes have equalimportance. As an example, for the circuit example of Figure 1.4 of the textbook some of the unknownsare potentials in volts and others are currents in amperes, and this choice of units is unsatisfactory for thepurpose of calculating norms because one ampere has much greater significance than one volt.

definition of error For a collection of values x =[x1 x2 · · · xn

], the error is a collection of values

x̃ − x. To measure the magnitude of the error, we might use ‖x̃ − x‖∞ where the the infinity-norm of avector v is defined by ‖v‖∞ = max1≤i≤n |vi|. Hence, relative error might be defined by

‖x̃− x‖∞/‖x‖∞.

For reasons not given here, the ∞-norm of an m by n matrix A has the formula

‖A‖∞ = max1≤i≤m

n∑j=1

|aij |.

15

3.5.2 Residual vs. error

We define the relative residual to be ‖r‖/‖b‖ where r = b − Ax̃. Generally, this is more meaningful and ofgreater interest than the quantity in the textbook (which has been contrived to make Gaussian eliminationwith partial pivoting look good). As an example, consider the fitting of a linear combination of basis functionsto pairs of (x, y) values. Then r represents errors in the fit and b the y values. Neither ‖A‖ nor ‖x̃‖ ismeaningful.

3.5.3 Sensitivity for Ax = b

Consider a system of linear equations expressed in matrix form as Ax = b and with solution

x = A−1b assuming detA 6= 0.

Errors ∆b added to b produce errors ∆x added to x:

x+ ∆x = A−1(b+ ∆b).

It can be shown that relative errors are related by

‖∆x‖‖x‖

≤ cond(A)‖∆b‖‖b‖

where cond(A) = ‖A−1‖‖A‖ is a quantity known as the condition number of the matrix A. It gives a smallestpossible upper bound on the amplification of relative errors.

Review questions

1. Define the lp-norm of a vector.

2. Given a norm for any dimension of vector, define the induced norm for any dimension of matrix.

3. Define the condition number of a nonsingular square matrix. Give a formula for the condition number.

4. Suppose x = A−1b and that ∆x is the error in x caused by introducing an error ∆b in b. Using thecondition number cond(A), write down a relation that relates ∆x to ∆b.

3.6 Roundoff Error Estimates

References: Section 2.8 and second part of Section 2.9 of Moler; Section 4.4.2 of Skeel & Keiper.Theoretical and empirical evidence suggests that

2−53cond(A)

is a ballpark estimate for the relative error for double precision computation. The relative residual ‖r‖/‖b‖can be this large, but it is far more common for it to be closer to the unit roundoff error 2−53.

The textbook is rambling, imprecise and, in places, a bit misleading: The guarantee that residuals aresmall is based on a contrived definition of relative residual, and, even then, the worst case is proportional to2n. Following is an example illustrating the possibility of a relative residual proportional to cond(A):

% large_residual.m

Q = [1 1; 1 -1];

A = Q * diag([1 pow2(-26)]) * Q;

b = [1; -1];

16

x = A\b;

r = vpa(b) - vpa(A)*vpa(x); % high accuracy calculation

relative_residual = norm(double(r))/norm(b)

macheps_condA = cond(A)*pow2(-53)

which prints relative_residual = 5.2684e-09 and macheps_condA = 7.4506e-09.

3.7 Sparse Matrices and Banded Matrices

References: Section 2.10 of Moler; Section 4.6 of Skeel & Keiper.A matrix is said to be banded if all of the nonzero elements lie in a band close to the main diagonal—

close enough that efficiency is improved by storing only that part of the matrix ranging from the lowestsubdiagonal to uppermost superdiagonal that define the band of nonzeros. A matrix is said to be sparseif the number of nonzero elements is few—so few that efficiency is improved by storing only the nonzerostogether with their indices. If they are stored row by row, only column indices need to be recorded.

The bandwidth of a matrix A = (aij) is the sum p + 1 + q where the lower bandwidth p and theupper bandwidth q are the smallest nonnegative integers such that the indices of all nonzero aij satisfy−p ≤ j − i ≤ q.

Textbook note. Their definition of bandwidth is nonstandard.

3.8 PageRank and Markov Chains

References: Section 2.11 of Moler.A thought experiment. Let i1, i2,. . . , iN be the pages visited by the random (Web) surfer, and let N1,

N2,. . . , Nn be the corresponding histogram values. From these, we can define probabilities

xi = limN→∞

Ni

N.

It can be shown that x defined in this way satisfies

Ax = x and eTx = 1.

A vector of probabilites x satisfying Ax = x is called a stationary state of the Markov chain.A matrix A having nonnegative elements and satisfying eTA = eT is called a left stochastic matrix.Because each element of A is, in fact, positive, the Perron-Frobenius theorem implies the existence of a

unique x satisfying Ax = x and eTx = 1.Comment on textbook, p. 76. The reason we can take γ = 1 is because eTx = 1 is simply a normalization

of the length of x. Here we are temporarily choosing an alternative normalization zTx = 1 to make the RHSvector a known quantity.

Omit the paragraph (on inverse iteration) beginning on the bottom of p. 76 of the textbook.

17

Chapter 4

Interpolation

Just as real numbers must be approximated on the computer by a finite set of machine numbers, so mustfunctions also be approximated. The most common approximations use piecewise polynomials, discussedin Sections 4.2–4.4. As preparation, polynomials are introduced in the first section. In addition to theirsimplicity, polynomials are special in the following sense: if p(x) is a polynomial of degree n, so is p(ax+ b),assuming a 6= 0, i.e., the set of polynomials of degree n has no intrinsic scale or origin.

Applications can be grouped as follows:

1. interpolating tabulated data,

2. interpolating a formulated function, and

3. representing an undetermined function.

Two distinct operations are involved: (i) construction of the interpolating function for given data, and (ii)evaluation of the interpolating function for a given value or set of values. The interpolating function is calledthe interpolant. In a higher-level programming language like Matlab, the interpolant can be constructed asa function handle, e.g.,

function q = quadratic(a, b, c) % quadratic.m

q = @(x)qeval(x, a, b, c);

function y = qeval(x, a, b, c)

y = a.*x.^2 + b.*x + c;

and evaluation can be performed by calling the function handle. In this example, qeval is a subfunction andvisible only inside the file quadratic.m.

4.1 Polynomial Interpolation

Reference: Section 3.1 of Moler; Sections 5.2 and 5.3 of Skeel & Keiper.If a polynomial approximation p(x) of degree ≤ n is constructed to match exactly the values of a given

function f(x) at the nodesp(x) = f(x), x = x0, x1, . . . , xn,

this is called interpolation. There will be error at values of x other than the nodes.

18

4.1.1 Existence and uniqueness

This is equivalent to the Vandermonde matrix being nonsingular.Proposition. Given pairs (x0, y0), (x1, y1), . . . , (xn, yn), there exists a unique polynomial p(x) of degree

≤ n such p(xi) = yi, i = 0, 1, . . . , n if and only if the nodes xi are distinct. (“Distinct” is mathematics jargonmeaning “all different.”)

If the nodes are not distinct, the data is either redundant or inconsistent.Proof of uniqueness. Let p(x) and q(x) be polynomials of degree n or less that interpolate the same data

at the same n+ 1 distinct points. Then q(x)− p(x) is a polynomial of degree n or less that either has n+ 1distinct roots or is identically zero. By the fundamental theorem of algebra the first of these is not possible,so q(x) ≡ p(x).1

4.1.2 Lagrange form

The basic concept is one of superposition of simpler interpolants.To illustrate the Lagrange form, consider the example

xi 1 2 4yi = p(xi) 1 1/2 1/4

The idea is first to construct nodal basis functions l0(x), l0(x), l0(x):

li(xj) =

{1, j = i,0, j 6= i.

The conditions on each polynomial specify all its roots together with its leading coefficient:

l0(x) =(x− 2)(x− 4)

(1− 2)(1− 4), l1(x) =

(x− 1)(x− 4)

(2− 1)(2− 4), l2(x) =

(x− 1)(x− 2)

(4− 1)(4− 2).

The desired interpolant is a linear combination of the nodal basis functions:

p(x) = 1 · l0(x) +1

2· l1(x) +

1

4· l2(x).

In general, for n+ 1 pairs of values,

p(x) =

n∑i=0

yi

n∏j=0,j 6=i

x− xjxi − xj︸︷︷︸

li(x)

.

4.1.3 Polynomial representation

cf. Exercise 3.8 of MolerA polynomial

cnxn + · · ·+ c1x+ c0

of degree n (degree ≤ n to be really precise) might be represented on the computer in terms of floating-point values fl(c0), fl(c1), . . . , fl(cn). However, the errors in the coefficients can result in a highly inaccurateapproximation of the values of the polynomial even though these errors are (relatively) very small.

A generally more accurate representation is the Lagrange form where a polynomial of degree at most nis represented as the set of pairs (x0, y0), (x1, y1), . . . , (xn, yn), each yi the value of the polynomial at xi.

1cf. Exercise 3.7 of Moler

19

The Lagrange form uniquely specifies a polynomial as long as the nodes (or abscissas) xi are required to bedistinct. But they need not be ordered. Note that this representation requires 2n + 2 values rather thann+ 1 to be stored. However, redundancy in programming is often a virtue, either for reasons of efficiency oraccuracy—in this case it is accuracy. Also, the Lagrange form generalizes easily to multiple dimensions. Themost basic operation to be performed on a polynomial p(x) is to evaluate it at a given value x. If evaluationis expected at several values x, most of the computation can be done in a single preprocessing step requiringO(n2) operations, with each evaluation requiring only O(n) operations.

4.1.4 Runge phenomenon

cf. Exercise 3.9 of MolerFor high degree interpolation at equally spaced nodes, the error is quite large near the nodes at the two

ends—Runge’s phenomenon. This is illustrated by Runge’s function, 1/(1 + 25x2), −1 ≤ x ≤ 1. However,(i) we do get convergence in the center of the interval and (ii) we would get global convergence if we choosethe points to be Chebyshev nodes, which for the interval [−1, 1] are given by

xk = − cosk − 1/2

nπ k = 1, 2, . . . , n.

Review questions

1. Consider the problem of determining a polynomial p(x)of degree ≤ n such that p(xi) = yi, i =1, 2, . . . ,m where the data xi, yi are given. Under what conditions on n and the data, can we becertain that p(x) exists and is unique?

2. Give the Lagrange form for the polynomial p(x) of lowest degree such that p(xi) = yi, i = 0, 1, . . . ,mwhere the data xi, yi are given.

3. Why is interpolation by a high-degree polynomial often unsatisfactory?

4.2 Piecewise Linear Interpolation

Reference: Section 3.2 of Moler; Section 5.5 of Skeel & Keiper.It important to recognize that

a piecewise linear polynomial is a function but typically not a polynomial.

More specifically, a piecewise linear polynomial is a function defined on some interval for which there existsa partition of the interval into subintervals such that on each subinterval the function equals some linearpolynomial. Generally, a piecewise polynomial coincides with a different linear polynomial on each subin-terval. The boundaries between these subintervals are known as knots or breakpoints. If neighboring piecesshare the same value at each knot, the piecewise polynomial will be continuous, in which case, the functionis a C0 piecewise linear polynomial.

Review question

1. If a piecewise polynomial s(x) equals a linear polynomial Li(x) for xi−1 < x < xi and Li+1(x) forxi < x < xi+1, what equation(s) must be satisfied for s(x) to be C0 for xi−1 < x < xi+1?

20

http://en.wikipedia.org/wiki/Chebyshev_nodes

4.3 Piecewise Cubic Hermite Interpolation

Reference: Sections 3.3 and 3.4 of Moler; Section 5.5 of Skeel & Keiper.As before, it is important to recognize that

a piecewise cubic polynomial is a function but typically not a polynomial.

Generally, a piecewise cubic polynomial coincides with a different cubic polynomial on each subinterval. Ifneighboring pieces share the same value at each knot, the piecewise cubic polynomial will be continuous.

The goal is to create a piecewise cubic polynomial s(x), which interpolates the data

(x0, y0), (x1, y1), ..., (xn, yn) (4.1)

where x0 < x1 < · · · < xn. As an intermediate step it is necessary to calculate numerical values dj that willdefine the first derivative of s(x) at partition points.

Textbook note. Last statement of paragraph 1 of Section 3.3 is incorrect: dk is generally undefined becausethe derivative of a piecewise linear polynomial generally does not exist at break points.

Review question

1. If a piecewise polynomial s(x) equals a polynomial qi(x) for xi−1 < x < xi and qi+1(x) for xi < x <xi+1, what equations must be satisfied for s(x) to be Ck for xi−1 < x < xi+1?

4.4 Cubic Spline Interpolation

Reference: Section 3.5 of Moler.

4.5 Curve Fitting

cf. Exercises 3.4 and 3.5 of MolerUp to now, fitting has been to data describing a function.A curve like the ellipse described by (x/a)2 + (y/b)2 = 1 is a more general type of object. In particular,

one cannot solve for y as a function of x.It is computationally more convenient to represent a curve parametrically, e.g., x = X(s) = a cos s,

y = Y (s) = b sin s, 0 ≤ s ≤ 2π. However, the parameterization is not unique, so the curve that results frominterpolating x = X(s), y = Y (s) depends on the parameterization.

21

Chapter 5

Nonlinear Equations

5.1 Bisection Method

Reference: Section 4.1 of Moler; Sections 3.1 and 3.2 of Skeel & Keiper.The algorithm for the bisection method. How to calculate the number of iterations needed.Textbook note.

• The statement at the end of Section 4.1 that “it will always take 52 steps” is untrue. In an extremecase, if the root is very close to zero, the result could be an infinite loop.

5.2 Newton’s Method

Reference: Sections 4.2 and 4.3 of Moler; Section 3.3 of Skeel & Keiper.The best derivation of the Newton-Raphson method is based on 1st degree Taylor interpolation. Memorize

its formula! It is locally convergent under mild assumptions. Its convergent rate is quadratic for a simpleroot and linear for a multiple root.

Definition of convergence rate, e.g, linear, superlinear, quadratic, cubic.

Review question

1. Consider a converging iteration and let dk be the digits of accuracy in the kth iterate.

(a) What is the interpretation in terms of digits of accuracy of convergence of order p > 1?

(b) What is the interpretation in terms of digits of accuracy of linear convergence?

5.3 Secant Method

Reference: Section 4.4 of Moler; Section 3.5 of Skeel & Keiper.The easiest way to remember the formula for the secant method is to replace the derivative in the

Newton-Raphson method by a divided difference. It is locally convergent under mild assumptions. However,its convergence rate is merely superlinear for a simple root, (

√5 + 1)/2 to be precise. Yet compared to

Newton’s method, each iteration requires less computation.

22

5.4 Inverse Interpolation

Reference: Sections 4.5 and 4.9 of Moler; Section 5.7 of Skeel & Keiper.

5.5 Univariate Optimization

Reference: Section 4.10 of Moler.The problem

max0≤x≤2

|83x(x− 1)(x− 2)− sinπx|

is equivalent to that of minimizing

−|83x(x− 1)(x− 2)− sinπx|, 0 ≤ x ≤ 2.

Generic problem in standard form:

minimize f(x) subject to a ≤ x ≤ b.

We say that q is a local minimum of a function f(x) defined on an interval [a, b] if there exists ε > 0 suchthat

f(q) ≤ f(x) for all x such that |x− q| < ε and a ≤ x ≤ b.This definition permits a local minimum to occur on the boundary of the domain on which the function isdefined. In this chapter we consider algorithms for finding a local minimum, which may or may not be theglobal minimum.

Let f be a real valued function on the interval [a, b]. We need a way of “bracketing” a minimum inan interval, analogous to the way we used a sign change for bracketing solutions to nonlinear equations inone dimension. We can know that we have a local minimum in the interval [α, β] if we have three valuesa ≤ α < γ < β ≤ b such that

(i) f(α) ≥ f(γ) or α = a, and

(iii) f(β) ≥ f(γ) or β = b.

A method based on bisection, e.g.,

f(x) 8 4 0 9

x |---------+---------|-------------------|

f(x) 4 0 5 9

x |---------|---------+---------|

f(x) 4 2 0 5

x |----+----|---------|

f(x) 2 0 3 5

x |----|----+----|

f(x) 2 0 3

x |----|----|

23

could end up doing two evaluations of f(x) for every halving of the search interval. Put differently, in theworst case the interval shrinks by an average factor

√1/2 = 0.707106 · · · for every evaluation of f(x). (The

geometric average is being used here.)There is a faster method known as golden section search, e.g.,

x |--------------|---------+--------------|

x |--------+-----|---------|

x |-----|---+-----|

x |---|-----|

which shrinks the search interval by a factor r = 0.618033 · · · for every evaluation of f(x) where r is thereciprocal of the golden ratio φ = 1

2 (1 +√

5). The golden ratio satisfies 1 + φ = φ2, so

r + r2 = 1.

Suppose we begin with a search interval of length 1. This is divided into three intervals as follows:

|<-------r^2------->|<---r^3--->|<-------r^2------->|

Then either the right end piece is discarded and the left end piece is further divided as

|<---r^3--->|<-r^4->|<---r^3--->|

or the left end piece is discarded and the right end piece is further divided as

|<---r^3--->|<-r^4->|<---r^3--->|

Successive parabolic interpolation chooses the three most recent points to define its parabola.Textbook notes.

• The last sentence of paragraph 6 of Sec. 4.10 should say “. . . eps, the spacing between successive IEEEdouble precision numbers, . . . ”. The machine epsilon eps is twice the size of the roundoff error.

• The next sentence should simply say “After the first step, there is enough history . . . ”.

24

Chapter 6

Least-Squares Approximation

6.1 The Linear Least-Squares Problem

Reference: Sections 5.1 (first 2 paragraphs), 5.2, 5.3 of Moler; Sections 6.1, 6.2 of Skeel & Keiper.We are considering an “overdetermined” system

Ax∗ ≈ b,

which means that the number of equations m exceeds the number of unknowns n. Applications includefitting a function of a known form to given values. In reality, such equations are actually an underdeterminedsystem

Ax∗ + r∗ = b

where r∗ are m unknown measurement errors in the values b. Typically the uncertainty is resolved byseeking the most likely value of r∗ (for which the equations have a solution). If one assumes that the errorsr∗ are independent normally distributed random variables of equal variance and zero mean, then the mostlikely1 value r of r∗ is the one that minimizes rTr subject to the condition that Ax + r = b holds for somex. Eliminate r and this reduces to finding the vector x that minimizes

(b−Ax)T(b−Ax) = ‖b−Ax‖2.

If the random variables r∗ have unequal variances, scale the equations so that the variances are equal.Many, but not all, least-squares problems arise from fitting to a sum of basis functions, e.g., Exercise

5.12 of the textbook does not.

Review question

1. What is the difference between interpolation and (least-square) approximation? Give one case wherethe approximation is more useful than interpolation.

6.2 Householder Reflections

Reference: Section 5.4 of Moler.Textbook notes.

1as defined by the maximim likelihood method of statistical inference

25

• Second last paragraph: it is not obvious that multiplication of a vector x by an orthogonal matrix Hpreserves length, but it is easy to prove.

• The notation ek denotes a column vector whose kth entry is 1 and all others are 0.

• Last paragraph: this construction is applicable to the QR factorization only for k = 1. The Householderreflection for k > 1 leaves the first k − 1 elements of x unchanged, which is obtained with

u = x(k) + signxk‖x(k)}ekwhere x(k) is x with the first k − 1 elements set to zero.

6.3 The QR Factorization

Reference: Section 5.5 of Moler; Section 6.2 of Skeel & Keiper.The least squares problem is to find x which minimizes

‖b−Ax‖2where A is an m by n matrix with n ≤ m and the 2-norm of a vector v is defined as ‖v‖2 = (vTv)1/2. Asolution always exists; it is unique if A is of full rank.

Sometimes only Ax is wanted; this is always unique.A geometric interpretation of the above.If A is of full rank, the least squares solution can be shown to satisfy the normal equations

ATAx = ATb,

whencex = A†b where A† = (ATA)−1AT.

The matrix A† is known as the Moore-Penrose pseudoinverse.The most straightforward algorithm is to first form the normal equations

ATAx = ATb

and then solve them. This algorithm is numerically unstable because the second subproblem of solving thenormal equations is much more ill-conditioned than is the original least squares problem. More specifically,errors made in forming the normal equations will be amplified by a huge factor due to the process of solvingthe normal equations (apart from any errors introduced by the solution process). This factor is much largerthan what can be attributed to the sensitivity of the original least squares problem.

It should be noted that the use of the normal equations is numerically unstable and that other algorithms,such as the QR factorization, should be used for more difficult problems. This is what the Matlab backslashoperator does for an overdetermined system.

Textbook notes.

• Paragraph 2: The statement “We have projected y onto the space spanned by the columns of X.” isat best misplaced: this projection of y is given by X(XTX)−1XTy = Xβ.

• Paragraph 2: append to “if the basis functions are independent” the qualification “on the set of specifiedvalues of the independent variables”.

• Paragraph 3: Omit the statement about the condition number. (The condition number of a nonsquarematrix has not been defined.)

• Paragraph 8: Omit the 2nd sentence. (It should say “The jth column of X is a linear combinationof the first j columns of Q. However, this not helpful.) The Householder reflection Hj is designedto eliminate elements below the j diagonal element of the matrix Hj−1 · · ·H1X without introducingnonzeros below the diagonal in the previous columns.

26

Review questions

1. Using matrix and norm notation, state the linear least squares problem.

2. Give necessary and sufficient conditions for the existence of a solution to the linear least squaresproblem Ax ≈2 b.

3. Give necessary and sufficient conditions for the uniqueness of a solution to the linear least squaresproblem.

4. Give a formula for the solution of the linear least squares problem Ax ≈2 b assuming A is of full rank.What is the name of the equations from which this formula comes?

5. Why is solving the normal equations numerically unstable? More specifically, which errors are mostlikely to dominate?

27

Chapter 7

Quadrature

7.1 Basic Quadrature Rules

Reference: Section 6.2 of Moler; Sections 7.2–7.4 of Skeel & Keiper.We consider here methods that approximate integrals by a finite sampling of values of the integrand.

7.1.1 Three simple rules

Below is a schematic representation of these rules:

weights 1-2

1-2 1 1-

62-3

1-6

abscissas

T M S

7.1.2 Composite quadrature rules

7.1.3 Accuracy and exactness

A quadrature rule has degree of exactness q if it is exact for polynomials of degree ≤ q. The trapezoid andmidpoint rule have degree of exactness 1, and the Simpson rule has degree of exactness 3.

The order of accuracy is the first and foremost measure of method’s power to approximate. For quadra-

ture, some care is needed in defining the order. If we look at a simple rule, then as h→ 0 so does∫ a+h

af(x) dx.

Hence, an approximation of zero would in the limit give the correct result! The correct approach is to con-sider a situation in which the true value that we are approximating is independent of the parameter h. Oneshould instead consider the error in∫ b

a

f(x) dx ≈ composite rule with uniform spacing h =b− aN

.

To define order of accuracy in a precise manner, we use the concept of asymptotic equality. If we havetwo sequences {an}, {bn}, we say bn ∼ an if bn/an → 1 as n→∞.

Let us consider first the error in the composite trapezoid rule. If TN (f) denotes the N -fold (composite)

trapezoid approximation to∫ b

af , it can be shown that

TN (f)−∫ b

a

f ∼ 1

12(b− a)2

∫ b

a

f ′′ ·N−2.

28

The quadratic dependence on 1/N (or equivalently h) means that the trapezoid rule is second-order accurate.For the composite midpoint rule,

MN (f)−∫ b

a

f ∼ − 1

24(b− a)2

∫ b

a

f ′′ ·N−2.

The composite Simpson rule requires 2 function evaluations per subinterval, so consider instead the N/2-foldformula SN/2(f) (assuming N is even). The error satisfies

SN (f)−∫ b

a

f ∼ − 1

180(b− a)4

∫ b

a

f iv ·N−4,

so the Simpson rule is fourth-order accurate.A (conventional) method of order p has degree of exactness p− 1.The order of accuracy is most clearly exhibited by doing a log-log plot of accuracy vs. N , where accuracy

is defined to be the reciprocal of the error magnitude. For the Simpson rule this results in a plot with thebehavior

log1

|error|∼ 4 logN.

Textbook notes.

• Second paragraph of Sec. 6.2 should say “hp+1”. The relative error is proportional to hp.

• The last paragraph of Sec. 6.2 should say “Boole’s rule” not “Weddle’s rule”.

7.1.4 Numerical error estimates

An approximation is of no value without some estimate of its error.

7.1.5 Systematic derivation by undetermined coefficients

For convenience we often consider an integral∫ 1

−1 f(x) dx on a standardized interval such as [−1, 1]. Aconventional formula has the form ∫ 1

−1f(x) dx ≈ 2

n∑i=1

wif(xi), (7.1)

where the abscissas xi and weights wi are independent of f .Extension of the rule (7.1) to an arbitrary interval is by means of a linear change of variables. If the

target interval has center c and halfwidth h, we write∫ c+h

c−hf(x) dx =

∫ h

−hf(c+ x) dx = h

∫ 1

−1f(c+ hx) dx ≈ 2h

n∑i=1

wif(c+ xih). (7.2)

Note that after a linear change of variables a polynomial of degree ≤ q remains a polynomial of degree ≤ q.Hence, the exactness property holds for the extension (7.2) of the quadrature rule to the general interval[c− h, c+ h].

Proposition 7.1 A quadrature rule (7.2) is exact for polynomials of degree ≤ q if and only if the formulais exact for each of the integrands 1, x, . . . , xq.

29

Applying this to∫ 1

−1 f ≈ 2(w1f(−3) + w2f(−1) + w3f(1)), the problem reduces to making∫ 1

−1f(x) dx = 2(w1f(−3) + w2f(−1) + w3f(1))

hold exactly for f(x) = 1, x, x2. We would like it to work for higher powers of x also, but we have onlythree coefficients to choose. For f(x) = 1 we have∫ 1

−1f(x)dx = 2, f(−3) = 1, f(−1) = 1, and f(1) = 1,

and the formula above is exact only if

1 = w1 · 1 + w2 · 1 + w3 · 1.

For f(x) = x we have ∫ 1

−1f(x)dx = 0, f(−3) = −3, f(−1) = −1, and f(1) = 1

and the formula is exact only if0 = w1 · (−3) + w2 · (−1) + w3 · 1.

Similarly requiring the formula to be exact for f(x) = x2 yields the equation

1

3= w1 · 9 + w2 · 1 + w3 · 1.

7.2 Adaptive Quadrature

Reference: Section 6.1 and 6.3 of Moler; Section 7.5 of Skeel & Keiper.Adaptive quadrature adapts breakpoints to the integrand. Adaptive quadrature routines are relatively

efficient in handling discontinuities and singularities of the integrand. A singularity is a point where theintegrand does not possess a power series expansion (having a positive radius of convergence).

7.3 Integrating Discrete Data

Reference: Section 6.6 of Moler.The idea is that we approximate the integrand by the polynomial that interpolates it at the chosen nodes.

Hence the weights are merely the exact integrals of the Lagrange fundamental polynomials.

30

Chapter 8

Eigenvalues and Singular Values

8.1 Eigenvalue Decompositions

Reference: Sections 10.1 (paragraphs 1–7), 10.2 (paragraph 1), 10.5 (paragraph 1) of Moler.

8.1.1 Eigenvalues

The characteristic polynomial p(λ) of a square matrix A is defined by

p(λ) = det(λI −A),

so that p(λ) is monic. The complex conjugate transpose of a vector x is given by

xH = xT

where the bar denotes the complex conjugate. Note that ‖x‖2 = (xHx)1/2. A left eigenvector y satisfies

yHA = λyH.

Review question

1. What is the characteristic polynomial of a triangular matrix?

8.1.2 Partitioned Matrices

This notion makes it possible to express many ideas without introducing the clutter of subscripts. Anexample of partitioning a matrix into blocks is

M =

0 0 −1 −10 0 1 1−1 1 5 0−1 1 0 7

=

[0 AT

A R

]22

2 2

where

A =

[−1 1−1 1

], R =

[5 00 7

].

31

In matrix operations, blocks can be treated as scalars except that multiplication is noncommutative. Theproduct

A11 A12 . . . A1q

A21 A22 . . . A2q

......

...Ap1 Ap2 . . . Apq

l1l2

lp

m1 m2 mq

B11 B12 . . . B1r

B21 B22 . . . B2r

......

...Bq1 Bq2 . . . Bqr

m1

m2

mq

n1 n2 nr

can be conveniently expressed because the partitioning is conformable. The (1-1)-block of the product is

A11B11 +A12B21 + · · ·+A1qBq1

Here are some examples:

Ax =:[c1 c2 · · · cn

]x1x2...xn

= x1c1 + x2c2 + · · ·+ xncn,

Ax =:

rT1rT2...rTm

x =

rT1 xrT2 x

...rTmx

(the two preceding examples are computational alternatives),

AB =: A[b1 b2 · · · bp

]=[Ab1 Ab2 · · · Abp

],

AB =:

a11 a12 a13a21 a22 a23a31 a32 a33

rT1rT2rT3

=

a11rT1 + a12r

T2 + a13r

T3

a21rT1 + a22r

T2 + a23r

T3

a31rT1 + a32r

T2 + a33r

T3

.The last example states that the rows of AB are linear combinations of rows of B. Therefore, matrixpremultiplication ⇔ row operations.

8.1.3 Spectral decomposition

8.1.4 Symmetric matrices

8.1.5 The determinant and trace

Recall that det(AB) = det(A) det(B). From this, it can be shown the determinant of a square matrix is theproduct of its eigenvalues.

The trace tr(A) of a square matrix A is defined to be the sum of its diagonal elements. It can be shownthat tr(AB) = tr(BA). And from this, it can be shown the trace of a square matrix is the sum of itseigenvalues.

8.2 Computation of Eigenvalues

Reference: Sections 10.4, 10.6 of Moler.

32

• An algorithm based on finding the roots of the characteristic polynomial is numerically unstable—theerrors in representing its coefficients can be quite significant.

• [X, lambda] = eig(A)

• If there is a unique eigenvalue of largest magnitude, the power method can be used: xk+1 = Axkwhere x0 is arbitrary, e.g., x0 = e. To avoid overflow/underflow, normalize after each iteration. Theeigenvalue can be estimated as xTkAxk/x

Tkxk.

8.3 Applications of Eigenvalues


• rotation matrix

• PageRank

• circle generator

8.4 Singular Value Decompositions

Reference: Sections 10.1 (paragraphs 1–4, 8–10), 10.2 (paragraph 1) of Moler.

• The requirement that singular values be ordered so that

σ1 ≥ σ2 ≥ · · · ≥ σn.

seems to be missing from the textbook.

• The 2-norm of a matrix ‖A‖ = σ1.

8.5 Computation of Singular Values


• [U, S, V] = svd(A)

8.6 Applications of Singular Values


• least-squares

• best approximation in the 2-norm by a matrix of rank r

• principal component analysis

33

Documents

NUMERICAL METHODS - Purdue UniversityReference: Chapter 2 of Moler. Matlabis an interactive scripting language sold by The MathWorks that facilitates matrix operations, graphics, and