Undergraduate Text

Azmy S. Ackleh, University of Louisiana at Lafayette

R. Baker Kearfott, University of Louisiana at Lafayette

Applied NumericalMethodsTechniques, Software, and Applications from

Biology, Engineering, Physics, Statistics, and

Operations Research

CRC PRESS

Boca Raton London New York Washington, D.C.

Dedication

To my wife Howayda,my children Aseal and Wedad,

and my parents Sima’an and Wedad–A.S.A.

To my wife Ruth,my daughter Frances,

and my mother Edith–R.B.K.

Preface

The purpose of this book is to introduce people to underlying techniques,methods, and software in common use today throughout scientific computing,to enable additional study of these tools, or to use these tools in scientific re-search, mathematical modeling, or engineering analysis and design. The bookhas elements of our graduate-level textbook, but we have omitted some of theless commonly used topics and theory and we have given many more illustra-tive examples of the basic concepts. We have omitted many proofs, most ofwhich can be found in our graduate-level text or as exercises in that text, butwe have supplied additional perspective relating the theory to applicationsand implementations on the computer. We have also integrated the use ofmatlab more closely into the text itself, and we have supplied examples ofuse of the methods in mathematical biology, engineering, physics, statistics,operations research, as well as many simple, illustrative examples.

We have used notes for this book to teach a one-semester course attendedby a mixture of undergraduate students in computer science, physics, mathe-matics and statistics, and in various subfields of engineering, as well as somebeginning mathematics and engineering graduate students.

ix

Contents

List of Figures xv

List of Tables xvii

1 Mathematical Review and Computer Arithmetic 11.1 Mathematical Review . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Intermediate Value Theorem, Mean Value Theorems,and Taylor’s Theorem . . . . . . . . . . . . . . . . . . 1

1.1.2 Big “O” Notation . . . . . . . . . . . . . . . . . . . . 61.1.3 Convergence Rates . . . . . . . . . . . . . . . . . . . . 7

1.2 Computer Arithmetic . . . . . . . . . . . . . . . . . . . . . . 91.2.1 Floating Point Arithmetic and Rounding Error . . . . 111.2.2 Practicalities and the IEEE Floating Point Standard . 18

1.3 Interval Computations . . . . . . . . . . . . . . . . . . . . . 241.3.1 Interval Arithmetic . . . . . . . . . . . . . . . . . . . . 251.3.2 Application of Interval Arithmetic: Examples . . . . . 29

1.4 Programming Environments . . . . . . . . . . . . . . . . . . 301.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 331.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Numerical Solution of Nonlinear Equations of One Variable 392.1 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . 392.2 The Fixed Point Method . . . . . . . . . . . . . . . . . . . . 472.3 Newton’s Method (Newton-Raphson Method) . . . . . . . . 542.4 The Univariate Interval Newton Method . . . . . . . . . . . 562.5 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . 602.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 632.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3 Linear Systems of Equations 693.1 Matrices, Vectors, and Basic Properties . . . . . . . . . . . . 703.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . 79

3.2.1 The Gaussian Elimination Algorithm . . . . . . . . . . 813.2.2 The LU decomposition . . . . . . . . . . . . . . . . . 843.2.3 Determinants and Inverses . . . . . . . . . . . . . . . . 863.2.4 Pivoting in Gaussian Elimination . . . . . . . . . . . . 88

xi

xii

3.2.5 Systems with a Special Structure . . . . . . . . . . . . 92

3.3 Roundoff Error and Conditioning . . . . . . . . . . . . . . . 100

3.3.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.3.2 Condition Numbers . . . . . . . . . . . . . . . . . . . 105

3.3.3 Roundoff Error in Gaussian Elimination . . . . . . . . 110

3.3.4 Interval Bounds . . . . . . . . . . . . . . . . . . . . . . 111

3.4 Orthogonal Decomposition (QR Decomposition) . . . . . . . 116

3.4.1 Properties of Orthogonal Matrices . . . . . . . . . . . 116

3.4.2 Least Squares and the QR Decomposition . . . . . . . 117

3.5 Iterative Methods for Solving Linear Systems . . . . . . . . . 121

3.5.1 The Jacobi Method . . . . . . . . . . . . . . . . . . . 123

3.5.2 The Gauss–Seidel Method . . . . . . . . . . . . . . . . 124

3.5.3 Successive Overrelaxation . . . . . . . . . . . . . . . . 125

3.5.4 Convergence of Iterative Methods . . . . . . . . . . . . 126

3.5.5 The Interval Gauss–Seidel Method . . . . . . . . . . . 130

3.6 The Singular Value Decomposition . . . . . . . . . . . . . . . 133

3.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

4 Approximating Functions and Data 145

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

4.2 Taylor Polynomial Approximations . . . . . . . . . . . . . . 145

4.3 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . 146

4.3.1 The Vandermonde System . . . . . . . . . . . . . . . . 146

4.3.2 The Lagrange Form . . . . . . . . . . . . . . . . . . . 148

4.3.3 The Newton Form . . . . . . . . . . . . . . . . . . . . 150

4.3.4 An Error Formula for the Interpolating Polynomial . . 153

4.3.5 Optimal Points of Interpolation: Chebyshev Points . . 156

4.4 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . 159

4.4.1 Piecewise Linear Interpolation . . . . . . . . . . . . . 159

4.4.2 Cubic Spline Interpolation . . . . . . . . . . . . . . . . 163

4.5 Approximation Other Than by Interpolation . . . . . . . . . 171

4.5.1 Least Squares Approximation . . . . . . . . . . . . . . 172

4.5.2 Minimax Approximation . . . . . . . . . . . . . . . . . 172

4.5.3 Sum of Absolute Values Approximation . . . . . . . . 175

4.5.4 Weighted Fits . . . . . . . . . . . . . . . . . . . . . . . 177

4.6 Approximation Other Than by Polynomials . . . . . . . . . . 179

4.7 Interval (Rigorous) Bounds on the Errors . . . . . . . . . . . 182

4.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

xiii

5 Eigenvalue-Eigenvector Computation 1915.1 Facts About Eigenvalues and Eigenvectors . . . . . . . . . . 1915.2 The Power Method . . . . . . . . . . . . . . . . . . . . . . . 1965.3 Other Methods for Eigenvalues and Eigenvectors . . . . . . . 202

5.3.1 The Inverse Power Method . . . . . . . . . . . . . . . 2025.3.2 The QR Method . . . . . . . . . . . . . . . . . . . . . 2045.3.3 Jacobi Diagonalization (Jacobi Method) . . . . . . . . 205

5.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

6 Numerical Differentiation and Integration 2116.1 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . 211

6.1.1 Derivation of Formulas . . . . . . . . . . . . . . . . . . 2116.1.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . 212

6.2 Automatic (Computational) Differentiation . . . . . . . . . . 2156.2.1 The Forward Mode . . . . . . . . . . . . . . . . . . . . 2166.2.2 The Reverse Mode . . . . . . . . . . . . . . . . . . . . 2186.2.3 Implementation of Automatic Differentiation . . . . . 220

6.3 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . 2216.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 2216.3.2 Newton-Cotes Formulas . . . . . . . . . . . . . . . . . 2226.3.3 Gaussian Quadrature . . . . . . . . . . . . . . . . . . 2236.3.4 More General Integrals . . . . . . . . . . . . . . . . . . 2276.3.5 Error Terms . . . . . . . . . . . . . . . . . . . . . . . . 2296.3.6 Changes of Variables . . . . . . . . . . . . . . . . . . . 2336.3.7 Composite Quadrature . . . . . . . . . . . . . . . . . . 2346.3.8 Adaptive Quadrature . . . . . . . . . . . . . . . . . . 2386.3.9 Multiple Integrals, Singular Integrals, and Infinite In-

tervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 2426.3.10 Interval Bounds . . . . . . . . . . . . . . . . . . . . . . 246

6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 2476.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

7 Initial Value Problems for Ordinary Differential Equations 2537.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2537.2 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . 2547.3 Higher-Order and Systems of Differential Equations . . . . . 2567.4 Higher-Order Taylor Series Methods . . . . . . . . . . . . . . 2597.5 Runge–Kutta Methods . . . . . . . . . . . . . . . . . . . . . 2617.6 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2637.7 Adaptive Step Controls . . . . . . . . . . . . . . . . . . . . . 2667.8 Multistep, Implicit, and Predictor-Corrector Methods . . . . 2697.9 Stiff Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

7.9.1 Stiff Systems and Linear Systems . . . . . . . . . . . . 2737.9.2 Stability of Stiff Systems . . . . . . . . . . . . . . . . . 277

xiv

7.9.3 Methods for Stiff Systems . . . . . . . . . . . . . . . . 2817.10 Application to Parameter Estimation in Differential Equations 2827.11 Application for Solving an SIRS Epidemic Model . . . . . . . 2837.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

8 Numerical Solution of Systems of Nonlinear Equations 2918.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2918.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . 2948.3 Multidimensional Fixed Point Iteration . . . . . . . . . . . . 2988.4 Multivariate Interval Newton Methods . . . . . . . . . . . . 3038.5 Quasi-Newton (Multivariate Secant) Methods . . . . . . . . 3078.6 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . 3108.7 Methods for Finding All Solutions . . . . . . . . . . . . . . . 317

8.7.1 Homotopy Methods . . . . . . . . . . . . . . . . . . . 3178.7.2 Branch and Bound Methods . . . . . . . . . . . . . . . 317

8.8 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3188.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 3208.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324

References 329

Index 333

List of Figures

1.1 Illustration of the Intermediate Value Theorem. . . . . . . . . 21.2 An example floating point system: β = 10, t = 1, and m = 1. 12

2.1 Example for the Intermediate Value Theorem applied to rootsof a function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2 Graph of ex + x for Example 2.1. . . . . . . . . . . . . . . . . 412.3 Example of when the method of bisection cannot be applied. 422.4 Example of monotonic convergence of fixed point iteration. . 522.5 Illustration of two iterations of Newton’s method. . . . . . . . 552.6 Examples of divergence of Newton’s method. On the left, the

sequence diverges; on the right, the sequence oscillates. . . . . 552.7 Geometric interpretation of the secant method. . . . . . . . . 61

4.1 An example of a piecewise linear function. . . . . . . . . . . . 1594.2 Graphs of the “hat” functions ϕi(x). . . . . . . . . . . . . . . 1614.3 B-spline basis functions. . . . . . . . . . . . . . . . . . . . . . 164

5.1 Illustration of Gerschgorin discs for Example 5.2. . . . . . . . 195

6.1 Illustration of the total error (roundoff plus truncation) boundin forward difference quotient approximation to f ′. . . . . . . 213

7.1 Actual solutions to the stiff ODE system of Example 7.12. . . 276

xv

List of Tables

1.1 Parameters for IEEE arithmetic . . . . . . . . . . . . . . . . . 201.2 Machine constants for IEEE arithmetic . . . . . . . . . . . . . 21

2.1 Convergence of the interval Newton method with f(x) = x2−2. 59

3.1 Condition numbers of some Hilbert matrices . . . . . . . . . . 1093.2 Iterates of the Jacobi and Gauss–Seidel methods, for Exam-

ple 3.32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.1 Error factors K and M in polynomial approximations f(x) =p(x) + K(x)M(f ; x). . . . . . . . . . . . . . . . . . . . . . . . 183

6.1 Weights and sample points: Gauss–Legendre quadrature . . . 2276.2 Error terms seen so far . . . . . . . . . . . . . . . . . . . . . . 2306.3 Some quadrature formula error terms . . . . . . . . . . . . . . 232

xvii

Chapter 1

Mathematical Review and ComputerArithmetic

1.1 Mathematical Review

The tools of scientific, engineering, and operations research computing arefirmly based in the calculus. In particular, formulating and solving mathe-matical models in these areas involves approximation of quantities, such asintegrals, derivatives, solutions to differential equations, and solutions to sys-tems of equations, first seen an a calculus course. Indeed, techniques fromsuch a course are the basis of much of scientific computation. We reviewthese techniques here, with particular emphasis on how we will use them.

In addition to basic calculus techniques, scientific computing involves ap-proximation of the real number system by decimal numbers with a fixednumber of digits in their representation. Except for certain research-orientedsystems, computer number systems today for this purpose are floating pointsystems, and almost all such floating point systems in use today adhere to theIEEE 754-2008 floating point standard. We describe floating point numbersand the floating point standard in this chapter, paying particular attention toconsequences and pitfalls of its use.

Third, programming and software tools are used in scientific computing.Considering how commonly it is used, ease of programming and debugging,documentation, and packages accessible from it, we have elected to use mat-

lab throughout this book. We introduce the basics of matlab in this chapter.

1.1.1 Intermediate Value Theorem, Mean Value Theorems,and Taylor’s Theorem

Throughout, Cn[a, b] will denote the set of real-valued functions f definedon the interval [a, b] such that f and its derivatives, up to and including itsn-th derivative f (n), are continuous on [a, b].

THEOREM 1.1

(Intermediate value theorem) If f ∈ C[a, b] and k is any number betweenm = min

a≤x≤bf(x) and M = max

a≤x≤bf(x), then there exists a number c in [a, b]

1

2 Applied Numerical Methods

for which f(c) = k (Figure 1.1).

x

y

y = f(x)

m +

k +

M +

a+

c+

b+

FIGURE 1.1: Illustration of the Intermediate Value Theorem.

Example 1.1

Consider f(x) = ex − x− 2. Using a computational device (such as a calcu-lator) on which we trust the approximation of ex to be accurate, we computef(0) = −1 and f(2) ≈ 3.3891. We know f is continuous, since it is a sumof continuous functions. Since 0 is between f(0) and f(2), the IntermediateValue Theorem tells us there is a point c ∈ [0, 2] such that f(c) = 0. At sucha c, ec = c + 2.

THEOREM 1.2

(Mean value theorem for integrals) Let f be continuous and w be Riemannintegrable1 on [a, b] and suppose that w(x) ≥ 0 for x ∈ [a, b]. Then there existsa point c in [a, b] such that

∫ b

a

w(x)f(x)dx = f(c)

∫ b

a

w(x)dx.

Example 1.2

Suppose we want bounds on

∫ 1

0

x2e−x2

dx.

1This means that the limit of the Riemann sums exists. For example w may be continuous,or w may have a finite number of breaks.

Mathematical Review and Computer Arithmetic 3

With w(x) = x2 and f(x) = e−x2

, the Mean Value Theorem for integrals tellsus that

∫ 1

0

x2e−x2

dx = e−c2

∫ 1

0

x2dx

for some c ∈ [0, 1], so

1

3e≤∫ 1

0

x2e−x2

dx = e−c2

∫ 1

0

x2dx ≤ 1

3.

The following is extremely important in scientific and engineering comput-ing.

THEOREM 1.3

(Taylor’s theorem) Suppose that f ∈ Cn+1[a, b]. Let x0 ∈ [a, b]. Then forany x ∈ [a, b],

f(x) = Pn(x) + Rn(x), where

Pn(x) = f(x0) + f ′(x0)(x− x0) + · · ·+ f (n)(x0)(x − x0)n

n!

=

n∑

k=0

1

k!f (k)(x0)(x − x0)

k, and

Rn(x) =1

n!

∫ x

x0

f (n+1)(t)(x − t)ndt (integral form of remainder).

Furthermore, there is a ξ = ξ(x) between x0 and x with

Rn(x) =f (n+1)(ξ(x))(x − x0)

n+1

(n + 1)!(Lagrange form of remainder).

PROOF Recall the integration by parts formula∫

udv = uv −∫

vdu.


Thus,

f(x)− f(x0) =

∫ x

x0

f ′(t)dt (let u = f ′(t), v = t− x, dv = dt)

= f ′(x0)(x − x0) +

∫ x

x0

(x− t)f ′′(t)dt

(let u = f ′′(t), dv = (x − t)dt)

= f ′(x0)(x − x0)−(x− t)2

2f ′′(t)

∣

∣

∣

∣

∣

x

x0

+

∫ x

x0

(x− t)2

2f ′′′(t)dt

= f ′(x0)(x − x0) +(x− x0)

2

2f ′′(x0) +

∫ x

x0

(x− t)2

2f ′′′(t)dt

Continuing this procedure,

f(x) = f(x0) + f ′(x0)(x− x0) +(x − x0)

2

2f ′′(x0)

+ · · ·+ (x − x0)n

n!f (n)(x0) +

∫ x

x0

(x− t)n

n!f (n+1)(t)dt

= Pn(x) + Rn(x).

Now consider Rn(x) =

∫ x

x0

(x − t)n

n!f (n+1)(t)dt and assume that x0 < x (same

argument if x0 > x). Then, by Theorem 1.2,

Rn(x) = f (n+1)(ξ(x))

∫ x

x0

(x− t)n

n!dt = f (n+1)(ξ(x))

(x − x0)n+1

(n + 1)!,

where ξ is between x0 and x and thus, ξ = ξ(x).

Example 1.3

Approximate sin(x) by a polynomial p(x) such that | sin(x) − p(x)| ≤ 10−16

for −0.1 ≤ x ≤ 0.1.

For Example 1.3, Taylor polynomials about x0 = 0 are appropriate, sincethat is the center of the interval about which we wish to approximate. Weobserve that the terms of even degree in such a polynomial are absent, so, for


n even, Taylor’s theorem gives

n Pn Rn

2 x − x3

3!cos(c2)

4 x− x3

3!

x5

5!cos(c4)

6 x− x3

3!+

x5

5!− x7

7!cos(c6)

......

...

n — (−1)n/2 xn+1

(n + 1)!cos(cn)

Observing that | cos(cn)| ≤ 1, we see that

|Rn(x)| ≤ |x|n+1

(n + 1)!.

We may thus form the following table.

n bound on error Rn

2 1.67× 10−4

4 8.33× 10−8

6 1.98× 10−11

8 2.76× 10−15

10 2.51× 10−19

Thus, a polynomial with the required accuracy for x ∈ [−0.1, 0.1] is

p(x) = x− x3

3!+

x5

5!− x7

7!+

x9

9!.

An important special case of Taylor’s theorem is obtained with n = 0 (thatis, directly from the Fundamental Theorem of Calculus).

THEOREM 1.4

(Mean value theorem) Suppose f ∈ C1[a, b], x ∈ [a, b], and y ∈ [a, b] (and,without loss of generality, x ≤ y). Then there is a c ∈ [x, y] ⊆ [a, b] such that

f(y)− f(x) = f ′(c)(y − x).

Example 1.4

Suppose f(1) = 1 and |f ′(x)| ≤ 2 for x ∈ [1, 2]. What are an upper boundand a lower bound on f(2)?


The mean value theorem tells us that

f(2) = f(1) + f ′(c)(2 − 1) = f(1) + f ′(c)

for some c ∈ (1, 2). Furthermore, the fact |f ′(x)| ≤ 2 is equivalent to −2 ≤f ′(x) ≤ 2. Combining these facts gives

1− 2 = −1 ≤ f(2) ≤ 1 + 2 = 3.

1.1.2 Big “O” Notation

We study “rates of growth” and “rates of decrease” of errors. For example,if we approximate eh by a first degree Taylor polynomial about x = 0, we get

eh − (1 + h) =1

2h2eξ,

where ξ is some unknown quantity between 0 and h. Although we don’tknow exactly what eξ is, we know that it is nearly constant (in this case,approximately 1) for h near 0, so the error eh−(1+h) is roughly proportionalto h2 for h small. This approximate proportionality is often more important toknow than the slowly-varying constant eξ. The big “O” and little “o” notationare used to describe and keep track of this approximate proportionality.

DEFINITION 1.1 Let E(h) be an expression that depends on a smallquantity h. We say that E(h) = O(hk) if there are an ǫ and C such that

E(h) ≤ Chk

for all |h| ≤ ǫ.

The “O” denotes “order.” For example, if f(h) = O(h2), we say that “fexhibits order 2 convergence to 0 as h tends to 0.”

Example 1.5

E(h) = eh − h− 1. Then E(h) = O(h2).

PROOF By Taylor’s Theorem,

eh = e0 + e0(h− 0) +h2

2eξ

for some c between 0 and h. Thus,

E(h) = eh − 1− h ≤ h2

(

e1

2

)

, and E(h) ≥ 0


for h ≤ 1, that is, ǫ = 1 and C = e/2 work.

Example 1.6

Show that∣

∣

f(x+h)−f(x)h − f ′(x)

∣

∣= O(h) for x, x + h ∈ [a, b], assuming that f

has two continuous derivatives at each point in [a, b].

PROOF

∣

∣

∣

∣

f(x + h)− f(x)

h− f ′(x)

∣

∣

∣

∣

=

∣

∣

∣

∣

∣

∣

∣

∣

∣

f(x) + f ′(x)h +

∫ x+h

x

(x + h− t)f ′′(t)dt− f(x)

h− f ′(x)

∣

∣

∣

∣

∣

∣

∣

∣

∣

=1

h

∣

∣

∣

∣

∣

∫ x+h

x

(x + h− t)f ′′(t)dt

∣

∣

∣

∣

∣

≤ maxa≤t≤b

|f ′′(t)| h2

= ch.

1.1.3 Convergence Rates

DEFINITION 1.2 Let {xk} be a sequence with limit x∗. If there areconstants C and α and an integer N such that |xk+1 −x∗| ≤ C|xk −x∗|α fork ≥ N we say that the rate of convergence is of order at least α. If α = 1(with C < 1), the rate is said to be linear. If α = 2, the rate is said to bequadratic.

Example 1.7

A sequence sometimes learned in elementary classes for computing the squareroot of a number a is

xk+1 =xk

2+

a

2xk.


We have

xk+1 −√

a =xk

2+

a

2xk−√a

= xk −x2

k − a

2xk−√a

= (xk −√

a)− (xk −√

a)xk +

√a

2xk

= (xk −√

a)

(

1− xk +√

a

2xk

)

= (xk −√

a)xk −

√a

2xk

=1

2xk(xk −

√a)2

≈ 1

2√

a(xk −

√a)2

for xk near√

a, thus showing that the convergence rate is quadratic.

Quadratic convergence is very fast. We can think of quadratic convergence,with C ≈ 1, as doubling the number of significant figures on each iteration. (Incontrast, linear convergence with C = 0.1 adds one decimal digit of accuracyto the approximation on each iteration.) For example, if we use the squareroot computation from Example 1.7 with a = 2, and starting with x0 = 2, weobtain the following table

k xk xk −√

2xk −

√2

(xk−1 −√

2)2

0 2 0.5858× 100 —

1 1.5 0.8579× 10−1 0.2500

2 1.416666666666667 0.2453× 10−2 0.3333

3 1.414215686274510 0.2123× 10−6 0.35294 1.414213562374690 0.1594× 10−13 0.3535

5 1.414213562373095 0.2204× 10−17 —

In this table, the correct digits are underlined. This table illustrates thatthe total number of digits more than doubles on each iteration. In fact, themultiplying factor C for the quadratic convergence appears to be approaching0.3535. (The last error ratio is not meaningful in this sense, because onlyroughly 16 digits were carried in the computation.) Based on our analysis,the limiting value of C should be about 1/(2

√2) ≈ 0.353553390593274. (We

explain how we computed the table at the end of this chapter.)


Example 1.8

As an example of linear convergence, consider the iteration

xk+1 = xk −x2

k

3.5+

2

3.5,

which converges to√

2. We obtain the following table.

k xk xk −√

2xk −

√2

(xk−1 −√

2)0 2 0.5858× 100 —

1 1.428571428571429 0.1436× 10−1 0.2451× 10−1

2 1.416909620991254 0.2696× 10−2 0.1878

3 1.414728799831946 0.5152× 10−3 0.1911

4 1.414312349239392 0.9879× 10−4 0.1917

5 1.414232514607664 0.1895× 10−4 0.1918

6 1.414217198786659 0.3636× 10−5 0.1919

7 1.414214260116949 0.6955× 10−6 0.1919

8 1.414213696254626 0.1339× 10−6 0.1919...

......

...

19 1.414213562373097 0.1554× 10−14 —

Here, the constant C in the linear convergence, to four significant digits,appears to be 0.1919 ≈ 1/5. That is, the error is reduced by approximately afactor of 5 each iteration. We can think of this as obtaining one more correctbase-5 digit on each iteration.

1.2 Computer Arithmetic

In numerical solution of mathematical problems, two common types of errorare:

1. Method (algorithm or truncation) error. This is the error due to ap-proximations made in the numerical method.

2. Rounding error. This is the error made due to the finite number of digitsavailable on a computer.


Example 1.9

By the mean value theorem for integrals (Theorem 1.2, as in Example 1.6 onpage 7), if f ∈ C2[a, b], then

f ′(x) =f(x + h)− f(x)

h+

1

h

∫ x+h

x

f ′′(t)(x + h− t)dt

and

∣

∣

∣

∣

∣

1

h

∫ x+h

x

f ′′(t)(x + h− t)dt

∣

∣

∣

∣

∣

≤ ch.

Thus, f ′(x) ≈ (f(x + h)− f(x))/h, and the error is O(h). We will call thisthe method error or truncation error , as opposed to roundoff errors due tousing machine approximations.

Now consider f(x) = lnx and approximate f ′(3) ≈ ln(3+h)−ln 3h for h small

using a calculator having 11 digits. The following results were obtained.

hln(3 + h)− ln(3)

hError =

1

3− ln(3 + h)− ln(3)

h= O(h)

10−1 0.3278982 5.44 ×10−3

10−2 0.332779 5.54 ×10−4

10−3 0.3332778 5.55 ×10−5

10−4 0.333328 5.33 ×10−6

10−5 0.333330 3.33 ×10−6

10−6 0.333300 3.33 ×10−5

10−7 0.333 3.33 ×10−4

10−8 0.33 3.33 ×10−3

10−9 0.3 3.33 ×10−2

10−10 0.0 3.33 ×10−1

One sees that, in the first four steps, the error decreases by a factor of 10as h is decreased by a factor of 10 (That is, the method error dominates).However, starting with h = 0.00001, the error increases. (The error due to afinite number of digits, i.e., roundoff error dominates).

There are two possible ways to reduce rounding error:

1. The method error can be reduced by using a more accurate method.This allows larger h to be used, thus avoiding roundoff error. Consider

f ′(x) =f(x + h)− f(x− h)

2h+ {error}, where {error} is O(h2).

hln(3 + h)− ln(3− h)

2herror

0.1 0.3334568 1.24 ×10−4

0.01 0.3333345 1.23 ×10−6

0.001 0.3333333 1.91 ×10−8


The error decreases by a factor of 100 as h is decreased by a factor of 10.

2. Rounding error can be reduced by using more digits of accuracy, suchas using double precision (or multiple precision) arithmetic.

To fully understand and avoid roundoff error, we should study some de-tails of how computers and calculators represent and work with approximatenumbers.

1.2.1 Floating Point Arithmetic and Rounding Error

Let β = {a positive integer}, the base of the computer system. (Usually,β = 2 (binary) or β = 16 (hexadecimal)). Suppose a number x has the exactbase representation

x = (±0.α1α2α3 · · ·αtαt+1 · · · )βm = ± qβm,

where q is the mantissa, β is the base, m is the exponent, 1 ≤ α1 ≤ β−1 and0 ≤ αi ≤ β − 1 for i > 1.

On a computer, we are restricted to a finite set of floating-point numbersF = F (β, t, L, U) of the form x∗ = (±0.a1a2 · · · at)β

m, where 1 ≤ a1 ≤ β− 1,0 ≤ ai ≤ β − 1 for 2 ≤ i ≤ t, L ≤ m ≤ U , and t is the number of digits. (Inmost floating point systems, L is about −64 to −1000 and U is about 64 to1000.)

Example 1.10

(binary) β = 2

x∗ = (0.1011)23 =

(

1× 1

2+ 0× 1

4+ 1× 1

8+ 1× 1

16

)

× 8

=11

2= 5.5 (decimal).

REMARK 1.1 Most numbers cannot be exactly represented on a com-puter. Consider x = 10.1 = 1010.0001 1001 1001 (β = 2). If L = −127, U =127, t = 24, and β = 2, then x ≈ x∗ = (0.10100001 1001 1001 1001 1001)24.

Question: Given a real number x, how do we define a floating point numberfl(x) in F , such that fl(x) is close to x?On modern machines, one of the following four ways is used to approximatea real number x by a machine-representable number fl(x).

round down: fl(x) = x ↓, the nearest machine representable number to thereal number x that is less than or equal to x


round up: fl(x) = x ↑, the nearest machine number to the real number xthat is greater than or equal to x.

round to nearest: fl(x) is the nearest machine number to the real numberx.

round to zero, or “chopping”: fl(x) is the nearest machine number to thereal number x that is closer to 0 than x. The term “chopping” is becausewe simply “chop” the expansion of the real number, that is, we simplyignore the digits in the expansion of x beyond the t-th one.

The default on modern systems is usually round to nearest, although choppingis faster or requires less circuitry. Round down and round up may be used,but with care, to produce results from a string of computations that areguaranteed to be less than or greater than the exact result.

Example 1.11

β = 10, t = 5, x = 0.12345666 · · · × 107. Then

fl(x) = 0.12345× 107 (chopping).

fl(x) = 0.12346× 107 (rounded to nearest).

(In this case, round down corresponds to chopping and round up correspondsto round to nearest.)

See Figure 1.2 for an example with β = 10 and t = 1. In that figure, theexhibited floating point numbers are (0.1)× 101, (0.2)× 101, . . . , (0.9)× 101,0.1× 102.

+

βm−1 = 1

+ + + + + + + + + +

βm−t = 100 = 1

successive floating point numbersβm = 101

FIGURE 1.2: An example floating point system: β = 10, t = 1, andm = 1.

Example 1.12

Let a = 0.410, b = 0.000135, and c = 0.000431. Assuming 3-digit decimalcomputer arithmetic with rounding to nearest, does a + (b + c) = (a + b) + cwhen using this arithmetic?


Following the “rounding to nearest” definition of fl , we emulate the opera-tions a machine would do, as follows:

a← 0.410× 100, b← 0.135× 10−3, c← 0.431× 10−3,

and

fl(b + c) = fl(0.135× 10−3 + 0.431× 10−3)

= fl(0.566× 10−3)

= 0.566× 10−3,

so

fl(a + 0.566× 10−3) = fl(0.410× 100 + 0.566× 10−3)

= fl(0.410× 100 + 0.000566× 100)

= fl(0.410566× 100)

= 0.411× 100.

On the other hand,

fl(a + b) = fl(0.410× 100 + 0.135× 10−3)

= fl(0.410000× 100 + 0.000135× 100)

= fl(0.410135× 100)

= 0.410× 100,

so

fl(0.410× 100 + c) = fl(0.410× 100 + 0.431× 10−3)

= fl(0.410× 100 + 0.000431× 100)

= fl(0.410431× 100)

= 0.410× 100 6= 0.411× 100.

Thus, the distributive law does not hold for floating point arithmetic with“round to nearest.” Furthermore, this illustrates that accuracy is improved ifnumbers of like magnitude are added first in a sum.

The following error bound is useful in some analyses.

THEOREM 1.5

|x− fl(x)| ≤ 1

2|x|β1−tp,

where p = 1 for rounding and p = 2 for chopping.


DEFINITION 1.3 δ =p

2β1−t is called the unit roundoff error.

Let ǫ =fl(x)− x

x. Then fl(x) = (1+ ǫ)x, where |ǫ| ≤ δ. With this, we have

the following.

THEOREM 1.6

Let ⊙ denote the operation +,−,×, or ÷, and let x and y be machine num-bers. Then

fl(x⊙ y) = (x⊙ y)(1 + ǫ), where |ǫ| ≤ δ =p

2β1−t.

Roundoff error that accumulates as a result of a sequence of arithmeticoperations can be analyzed using this theorem. Such an analysis is calledforward error analysis.

Because of the properties of floating point arithmetic, it is unreasonable todemand strict tolerances when the exact result is too large.

Example 1.13

Suppose β = 10 and t = 3 (3-digit decimal arithmetic), and suppose we wishto compute 104π with a computed value x such that |104π − x| < 10−2. Theclosest floating point number in our system to 104π is x = 0.314×105 = 31400.However |104π − x| = 15.926 . . . . Hence, it is impossible to find a number xin the system with |104π − x| < 10−2.

The error |104π − x| in this example is called the absolute error in approx-imating 104π. We see that absolute error is not an appropriate measure oferror when using floating point arithmetic. For this reason, we use relativeerror :

DEFINITION 1.4 Let x∗ be an approximation to x. Then |x − x∗| iscalled the absolute error, and

∣

∣

∣

∣

x− x∗

x

∣

∣

∣

∣

is called the relative error.

For example,

∣

∣

∣

∣

x− fl(x)

x

∣

∣

∣

∣

≤ δ =p

2β1−t (unit roundoff error).

1.2.1.1 Expression Evaluation and Condition Numbers

We now examine some common situations in which roundoff error can be-come large, and explain how to avoid many of these situations.


Example 1.14

β = 10, t = 4, p = 1. (Thus, δ = 1210−3 = 0.0005.) Let x = 0.5795 × 105,

y = 0.6399× 105. Then

fl(x + y) = 0.1219× 106 = (x + y)(1 + ǫ1), ǫ1 ≈ −3.28× 10−4, |ǫ1| < δ, andfl(xy) = 0.3708× 1010 = (xy)(1 + ǫ2), ǫ2 ≈ −5.95× 10−5, |ǫ2| < δ.

(Note: x + y = 0.12194× 106, xy = 0.37082205× 1010.)

Example 1.15

Suppose β = 10 and t = 4 (4 digit arithmetic), suppose x1 = 10000 andx2 = x3 = · · · = x1001 = 1. Then

fl(x1 + x2) = 10000,

fl(x1 + x2 + x3) = 10000,

...

fl

(

1001∑

i=1

xi

)

= 10000,

when we sum forward from x1. But going backwards,

fl(x1001 + x1000) = 2,

fl(x1001 + x1000 + x999) = 3,

...

fl

(

1∑

i=1001

xi

)

= 11000,

which is the correct sum.

This example illustrates the point that large relative errors occur whena large number of almost small numbers is added to a large number,or when a very large number of small almost-equal numbers is added. To avoidsuch large relative errors, one can sum from the smallest number to the largestnumber. However, this will not work if the numbers are all approximatelyequal. In such cases, one possibility is to group the numbers into sets of twoadjacent numbers, summing two almost equal numbers together. One thengroups those results into sets of two and sums these together, continuing untilthe total sum is reached. In this scheme, two almost equal numbers are alwaysbeing summed, and the large relative error from repeatedly summing a smallnumber to a large number is avoided.


Example 1.16

x1 = 15.314768, x2 = 15.314899, β = 10, t = 6 (6-digit decimal accuracy).Then x2 − x1 ≈ fl(x2)− fl(x1) = 15.3149− 15.3148 = 0.0001. Thus,

∣

∣

∣

∣

x2 − x1 − (fl(x2)− fl(x1))

x2 − x1

∣

∣

∣

∣

=0.000131− 0.0001

0.000131

= 0.237

= 23.7% relative accuracy.

This example illustrates that large relative errors can occur whentwo nearly equal numbers are subtracted on a computer. Sometimes,an algorithm can be modified to reduce rounding error occurring from thissource, as the following example illustrates.

Example 1.17

Consider finding the roots of ax2 + bx + c = 0, where b2 is large comparedwith |4ac|. The most common formula for the roots is

x1,2 =−b±

√b2 − 4ac

2a.

Consider x2 + 100x + 1 = 0, β = 10, t = 4, p = 2, and 4-digit choppedarithmetic. Then

x1 =−100 +

√9996

2, x2 =

−100−√

9996

2,

but√

9996 ≈ 99.97 (4 digit arithmetic chopped). Thus,

x1 ≈−100 + 99.97

2, x2 ≈

−100− 99.97

2.

Hence, x1 ≈ −0.015, x2 ≈ −99.98, but x1 = −0.010001 and x2 = −99.989999,so the relative errors in x1 and x2 are 50% and 0.01%, respectively.

Let’s change the algorithm. Assume b ≥ 0 (can always make b ≥ 0). Then

x1 =−b +

√b2 − 4ac

2a

(

−b−√

b2 − 4ac

−b−√

b2 − 4ac

)

=4ac

2a(−b−√

b2 − 4ac)=

−2c

b +√

b2 − 4ac,

and

x2 =−b−

√b2 − 4ac

2a(the same as before).


Then, for the above values,

x1 =−2(1)

100 +√

9996≈ −2

100 + 99.97= −0.0100.

Now, the relative error in x1 is also 0.01%.

Let us now consider error in function evaluation. Consider a single valuedfunction f(x) and let x∗ = fl(x) be the floating point approximation of x.Therefore the machine evaluates f(x∗) = f(fl(x)), which is an approximatevalue of f(x) at x = x∗. Then the perturbation in f(x) for small perturbationsin x can be computed via Taylor’s formula. This is illustrated in the nexttheorem.

THEOREM 1.7

The relative error in functional evaluation is,

∣

∣

∣

∣

f(x)− f(x∗)

f(x)

∣

∣

∣

∣

≈∣

∣

∣

∣

x f ′(x)

f(x)

∣

∣

∣

∣

∣

∣

∣

∣

x− x∗

x

∣

∣

∣

∣

PROOF The linear Taylor approximation of f(x∗) about f(x) for smallvalues of |x − x∗| is given by f(x∗) ≈ f(x) + f ′(x)(x∗ − x). Rearranging theterms immediately yields the result.

This leads us to the following definition.

DEFINITION 1.5 The condition number of a function f(x) is

κf (x) :=

∣

∣

∣

∣

x f ′(x)

f(x)

∣

∣

∣

∣

The condition number describes how large the relative error in functionevaluation is with respect to the relative error in the machine representationof x. In other words, κf(x) is a measure of the degree of sensitivity of thefunction at x.

Example 1.18

Let f(x) =√

x. The condition number of f(x) about x is

κf (x) =

∣

∣

∣

∣

∣

x 12√

x√x

∣

∣

∣

∣

∣

=1

2.

This suggests that f(x) is well-conditioned.


Example 1.19

Let f(x) =√

x− 2. The condition number of f(x) about x is

κf (x) =

∣

∣

∣

∣

x

2(x− 2)

∣

∣

∣

∣

.

This is not defined at x∗ = 2. Hence the function f(x) is numerically unstableand ill-conditioned for values of x close to 2.

REMARK 1.2 If x = f(x) = 0, then the condition number is simply|f ′(x)|. If x = 0, f(x) 6= 0 (or f(x) = 0, x 6= 0) then it is more usefulto consider the relation between absolute errors than relative errors. Thecondition number then becomes |f ′(x)/f(x)|.

REMARK 1.3 Generally, if a numerical approximation z to a quantityz is computed, the relative error is related to the number of digits after thedecimal point that are correct. For example if z = 0.0000123453 and z =0.00001234543, we say that z is correct to 5 significant digits . Expressingz as 0.123453 × 10−4 and z as 0.123454 × 10−4, we see that if we round zto the nearest number with five digits in its mantissa, all of those digits arecorrect, whereas, if we do the same with six digits, the sixth digit is notcorrect. Significant digits is the more logical way to talk about accuracy ina floating point computation where we are interested in relative error, ratherthan “number of digits after the decimal point,” which can have a differentmeaning. (Here, one might say that z is correct to 9 digits after the decimalpoint.)

1.2.2 Practicalities and the IEEE Floating Point Standard

Prior to 1985, different machines used different word lengths and differentbases, and different machines rounded, chopped, or did something else to formthe internal representation fl(x) for real numbers x. For example, IBM main-frames generally used hexadecimal arithmetic (β = 16), with 8 hexadecimaldigits total (for the base, sign, and exponent) in “single precision” numbersand 16 hexadecimal digits total in “double precision” numbers. Machines suchas the Univac 1108 and Honeywell Multics systems used base β = 2 and 36binary digits (or “bits”) total in single precision numbers and 72 binary digitstotal in double precision numbers. An unusual machine designed at MoscowState University from 1955-1965, the “Setun” even used base-3 (β = 3, or“ternary”) numbers. Some computers had 32 bits total in single precisionnumbers and 64 bits total in double precision numbers, while some “super-computers” (such as the Cray-1) had 64 bits total in single precision numbersand 128 bits total in double precision numbers.

Some hand-held calculators in existence today (such as some Texas Instru-ments calculators) can be viewed as implementing decimal (base 10, β = 10)


arithmetic, say, with L = −999 and U = 999, and t = 14 digits in the man-tissa.

Except for the Setun (the value of whose ternary digits corresponded to“positive,” “negative,” and “neutral” in circuit elements or switches), digitalcomputers are mostly based on binary switches or circuit elements (that is,“on” or “off”), so the base β is usually 2 or a power of 2. For example, theIBM hexadecimal digit could be viewed as a group of 4 binary digits2.

Older floating point implementations did not even always fit exactly intothe model we have previously described. For example, if x is a number in thesystem, then −x may not have been a number in the system, or, if x were anumber in the system, then 1/x may have been too large to be representablein the system.

To promote predictability, portability, reliability, and rigorous error bound-ing in floating point computations, the Institute of Electrical and ElectronicsEngineers (IEEE) and American National Standards Institute (ANSI) pub-lished a standard for binary floating point arithmetic in 1985: IEEE/ANSI754-1985: Standard for Binary Floating Point Arithmetic, often referenced as“IEEE-754,” or simply “the IEEE standard3.” Almost all computers in exis-tence today, including personal computers and workstations based on Intel,AMD, Motorola, etc. chips, implement most of the IEEE standard.

In this standard, β = 2, 32 bits total are used in a single precision number(an “IEEE single”), and 64 bits total are used for a double precision number(“IEEE double”). In a single precision number, 1 bit is used for the sign, 8 bitsare used for the exponent, and t = 23 bits are used for the mantissa. In doubleprecision numbers, 1 bit is used for the sign, 11 bits are used for the exponent,and 52 bits are used for the mantissa. Thus, for single precision numbers,the exponent is between 0 and (11111111)2 = 255, and 128 is subtractedfrom this, to get an exponent between −127 and 128. In IEEE numbers,the minimum and maximum exponent are used to denote special symbols(such as infinity and “unnormalized” numbers), so the exponent in singleprecision represents magnitudes between 2−126 ≈ 10−38 and 2127 ≈ 1038. Themantissa for single precision numbers represents numbers between (20 = 1

and∑23

i=0 2−i = 2(1− 2−24) ≈ 2. Similarly, the exponent for double precisionnumbers is, effectively, between 2−1022 ≈ 10−308 and 21023 ≈ 10308, while themantissa for double precision numbers represents numbers between 20 = 1and

∑52i=0 2−i ≈ 2.

Summarizing, the parameters for IEEE arithmetic appear in Table 1.1.In many numerical computations, such as solving the large linear systems

arising from partial differential equation models, more digits or a larger ex-ponent range is required than is available with IEEE single precision. For

2An exception is in some systems for business calculations, where base 10 is implemented.3An update to the 1985 standard was made in 2008. This update gives clarifications ofcertain ambiguous points, provides certain extensions, and specifies a standard for decimalarithmetic.


TABLE 1.1: Parameters forIEEE arithmetic

precision β L U tsingle 2 -126 127 24

double 2 -1022 1023 53

this reason, many numerical analysts at present have adopted IEEE doubleprecision as the default precision. For example, underlying computations inthe popular computational environment matlab are done in IEEE doubleprecision.

IEEE arithmetic provides four ways of defining fl(x), that is, four “roundingmodes,” namely, “round down,” “round up,” “round to nearest,” and “roundto zero,” are specified as follows. The four elementary operations +, −, ×,and / must be such that fl(x⊙y) is implemented for all four rounding modes,for ⊙ ∈

{

−, +,×, /,√·}

.The default mode (if the rounding mode is not explicitly set) is normally

“round to nearest,” to give an approximation after a long string of compu-tations that is hopefully near the exact value. If the mode is set to “rounddown” and a string of computations is done, then the result is less than orequal to the exact result. Similarly, if the mode is set to “round up,” thenthe result of a string of computations is greater than or equal to the exactresult. In this way, mathematically rigorous bounds on an exact result can beobtained. (This technique must be used astutely, since naive use could resultin bounds that are too large to be meaningful.)

Several parameters more directly related to numerical computations thanL, U , and t are associated with any floating point number system. These are

HUGE: the largest representable number in the floating point system;

TINY: the smallest positive representable number in the floating point system.

ǫm: the machine epsilon, the smallest positive number which, when added to1, gives something other than 1 when using the rounding mode–roundto the nearest.

These so-called “machine constants” appear in Table 1.2 for the IEEE singleand IEEE double precision number systems.

For IEEE arithmetic, 1/TINY < HUGE, but 1/HUGE < TINY. This brings upthe question of what happens when the result of a computation has absolutevalue less than the smallest number representable in the system, or has abso-lute value greater than the largest number representable in the system. In thefirst case, an underflow occurs, while, in the second case, an overflow occurs.In floating point computations, it is usually (but not always) reasonable toreplace the result of an underflow by 0, but it is usually more problematical


TABLE 1.2: Machine constants for IEEE arithmetic

Precision HUGE TINY ǫm

single 2127 ≈ 3.40 · 1038 2−126 ≈ 1.18 · 10−38 2−24 + 2−45 ≈ 5.96 · 10−8

double 21023 ≈ 1.79 · 10308 2−1022 ≈ 2.23 · 10−308 2−53 + 2−105 ≈ 1.11 · 10−16

when an overflow occurs. Many systems prior to the IEEE standard replacedan underflow by 0 but stopped when an overflow occurred.

The IEEE standard specifies representations for special numbers ∞, −∞,+0, −0, and NaN, where the latter represents “not a number.” The standardspecifies that computations do not stop when an overflow or underflow occurs,or when quantities such as

√−1, 1/0, −1/0, etc. are encountered (although

many programming languages by default or optionally do stop). For example,the result of an overflow is set to ∞, whereas the result of

√−1 is set to NaN,

and computation continues. The standard also specifies “gradual underflow,”that is, setting the result to a “denormalized” number, or a number in thefloating point format whose first digit in the mantissa is equal to 0. Com-putation rules for these special numbers, such as NaN × any number = NaN,∞× any positive normalized number =∞, allow such “nonstop” arithmetic.

Although the IEEE nonstop arithmetic is useful in many contexts, the nu-merical analyst should be aware of it and be cautious in interpreting results.In particular, algorithms may not behave as expected if many intermediateresults contain∞ or NaN, and the accuracy is less than expected when denor-malized numbers are used. In fact, many programming languages, by defaultor with a controllable option, stop if ∞ or NaN occurs, but implement IEEEnonstop arithmetic with an option.

Example 1.20

IEEE double precision floating point arithmetic underlies most computationsin matlab. (This is true even if only four decimal digits are displayed.) Oneobtains the machine epsilon with the function eps, one obtains TINY with thefunction realmax, and one obtains HUGE with the function realmin. Observethe following matlab dialog:

>> epsm = eps(1d0)

epsm = 2.2204e-016

>> TINY = realmin

TINY = 2.2251e-308

>> HUGE = realmax

HUGE = 1.7977e+308

>> 1/TINY

ans = 4.4942e+307

>> 1/HUGE


ans = 5.5627e-309

>> HUGE^2

ans = Inf

>> TINY^2

ans = 0

>> new_val = 1+epsm

new_val = 1.0000

>> new_val - 1

ans = 2.2204e-016

>> too_small = epsm/2

too_small = 1.1102e-016

>> not_new = 1+too_small

not_new = 1

>> not_new - 1

ans = 0

>>

Example 1.21

(Illustration of underflow and overflow) Suppose, for the purposes of illustra-tion, we have a system with β = 10, t = 2 and one digit in the exponent, sothat the positive numbers in the system range from 0.10× 10−9 to 0.99× 109,and suppose we wish to compute N =

√

x21 + x2

2, where x1 = x2 = 106. Thenboth x1 and x2 are exactly represented in the system, and the nearest floatingpoint number in the system to N is 0.14 × 107, well within range. However,x2

1 = 1012, larger than the maximum floating point number in the system.In older systems, an overflow usually would result in stopping the compu-tation, while in IEEE arithmetic, the result would be assigned the symbol“Infinity.” The result of adding “Infinity” to “Infinity” then taking the squareroot would be “Infinity,” so that N would be assigned “Infinity.” Similarly,if x1 = x2 = 10−6, then x2

1 = 10−12, smaller than the smallest representablemachine number, causing an “underflow.” On older systems, the result is usu-ally set to 0. On IEEE systems, if “gradual underflow” is switched on, theresult either becomes a denormalized number, with less than full accuracy,or is set to 0; without gradual underflow on IEEE systems, the result is setto 0. When the result is set to 0, a value of 0 is stored in N , whereas theclosest floating point number in the system is 0.14× 10−5, well within range.To avoid this type of catastrophic underflow and overflow in the computationof N , we may use the following scheme.

1. s← max{|x1|, |x2|}.2. η1 ← x1/s; η2 ← x2/s.

3. N ← s√

η21 + η2

2 .


1.2.2.1 Input and Output

For examining the output to large numerical computations arising frommathematical models, plots, graphs, and movies comprised of such plots andgraphs are often preferred over tables of values. However, to develop suchmodels and study numerical algorithms, it is necessary to examine individualnumbers. Because humans are trained to comprehend decimal numbers moreeasily than binary numbers, the binary format used in the machine is usuallyconverted to a decimal format for display or printing. In many programminglanguages and environments (such as all versions of Fortran, C, C++, andin matlab), the format is of a form similar to ±d1.d2d3...dme±δ1δ2δ3, or±d1.d2d3...dmE±δ1δ2δ3, where the “e” or “E” denotes the “exponent” of10. For example, -1.00e+003 denotes −1 × 103 = −1000. Numbers areusually also input either in a standard decimal form (such as 0.001) or inthis exponential format (such as 1.0e-3). (This notation originates fromthe earliest computers, where the only output was a printer, and the printercould only print numerical digits and the 26 upper case letters in the Romanalphabet.)

Thus, for input, a decimal fraction needs to be converted to a binary float-ing point number, while, for output, a binary floating point number needsto be converted to a decimal fraction. This conversion necessarily is inexact.For example, the exact decimal fraction 0.1 converts to the infinitely repeat-ing binary expansion (0.00011)2, which needs to be rounded into the binaryfloating point system. The IEEE 754 standard specifies that the result of adecimal to binary conversion, within a specified range of input formats, bethe nearest floating point number to the exact result, over a specified range,and that, within a specified range of formats, a binary to decimal conversionbe the nearest number in the specified format (which depends on the numberm of decimal digits requested to be printed).

Thus, the number that one sees as output is usually not exactly the num-ber that is represented in the computer. Furthermore, while the floatingpoint operations on binary numbers are usually implemented in hardware or“firmware” independently of the software system, the decimal to binary andbinary to decimal conversions are usually implemented separately as part ofthe programming language (such as Fortran, C, C++, Java, etc.) or softwaresystem (such as matlab). The individual standards for these languages, ifthere are any, may not specify accuracy for such conversions, and the lan-guages sometimes do not conform to the IEEE standard. That is, the numberthat one sees printed may not even be the closest number in that format tothe actual number.

This inexactness in conversion usually does not cause a problem, but maycause much confusion in certain instances. In those instances (such as in“debugging,” or finding programming blunders), one may need to examinethe binary numbers directly. One way of doing this is in an “octal,” or base-8format, in which each digit (between 0 and 7) is interpreted as a group of


three binary digits, or in hexadecimal format (where the digits are 0-9, A, B,C, D, E, F), in which each digit corresponds to a group of four binary digits.

1.2.2.2 Standard Functions

To enable accurate computation of elementary functions such as sin, cos,and exp, IEEE 754 specifies that a “long” 80-bit register (with “guard digits”)be available for intermediate computations. Furthermore, IEEE 754-2008, anofficial update to IEEE 754-1985, provides a list of functions it recommends beimplemented, and specifies accuracy requirements (in terms of correct round-ing), for those functions a programming language elects to implement.

REMARK 1.4 Alternative number systems, such as variable precisionarithmetic, multiple precision arithmetic, rational arithmetic, and combina-tions of approximate and symbolic arithmetic have been investigated andimplemented. These have various advantages over the traditional floatingpoint arithmetic we have been discussing, but also have disadvantages, andusually require more time, more circuitry, or both. Eventually, with the ad-vance of computer hardware and better understanding of these alternativesystems, their use may become more ubiquitous. However, for the foreseeablefuture, traditional floating point number systems will be the primary tool innumerical computations.

1.3 Interval Computations

Interval computations are useful for two main purposes:

• to use floating point computations to compute mathematically rigor-ous bounds on an exact result (and hence to rigorously bound roundofferror);

• to use floating point computations to compute mathematically rigorousbounds on the ranges of functions over boxes.

In complicated traditional floating point algorithms, naive arrangement of in-terval computations usually gives bounds that are too wide to be of practicaluse. For this reason, interval computations have been ignored by many. How-ever, used cleverly and where appropriate, interval computations are powerful,and provide rigor and validation when other techniques cannot.

Interval computations are based on interval arithmetic.


1.3.1 Interval Arithmetic

In interval arithmetic, we define operations on intervals, which can be con-sidered as ordered pairs of real numbers. We can think of each interval asrepresenting the range of possible values of a quantity. The result of an op-eration is then an interval that represents the range of possible results of theoperation as the range of all possible values, as the first argument ranges overall points in the first interval and the second argument ranges over all valuesin the second interval. To state this symbolically, let x = [x, x] and y = [y, y],and define the four elementary operations by

x⊙ y = {x⊙ y | x ∈ x and y ∈ y} for ⊙ ∈ {+,−,×,÷}. (1.1)

Interval arithmetic’s usefulness derives from the fact that the mathematicalcharacterization in Equation (1.1) is equivalent to the following operationaldefinitions.

x + y = [x + y, x + y],

x− y = [x− y, x− y],

x× y = [min{xy, xy, xy, xy}, max{xy, xy, xy, xy}]1

x= [

1

x,1

x] if x > 0 or x < 0

x÷ y = x× 1

y

(1.2)

The ranges of the four elementary interval arithmetic operations are ex-actly the ranges of the corresponding real operations, but, if such operationsare composed, bounds on the ranges of real functions can be obtained. Forexample, if

f(x) = (x + 1)(x− 1), (1.3)

then

f([−2, 2]) =(

[−2, 2] + 1)(

[−2, 2]− 1)

= [−1, 3][−3, 1] = [−9, 3],

which contains the exact range [−1, 3].

REMARK 1.5 In some definitions of interval arithmetic, division byintervals containing 0 is defined, consistent with (1.1). For example,

[1, 2]

[−3, 4]=

[

−∞,−1

3

]

⋃

[

1

4,∞]

= R∗∖

(

−1

3,1

4

)

,

where R∗ is the extended real number system,4 consisting of the real numberswith the two additional numbers −∞ and ∞. This extended interval arith-

4also known as the two-point compactification of the real numbers


metic5 was originally invented by William Kahan6 for computations with con-tinued fractions, but has wider use than that. Although a closed system canbe defined for the sets arising from this extended arithmetic, typically, thecomplements of intervals (i.e., the unions of two semi-infinite intervals) areimmediately intersected with intervals, to obtain zero, one, or two intervals.Interval arithmetic can then proceed using (1.2).

The power of interval arithmetic lies in its implementation on computers. Inparticular, outwardly rounded interval arithmetic allows rigorous enclosuresfor the ranges of operations and functions. This makes a qualitative differencein scientific computations, since the results are now intervals in which theexact result must lie. It also enables use of floating point computations forautomated theorem proving.

Outward rounding can be implemented on any machine that has downwardrounding and upward rounding, such as any machine that complies with theIEEE 754 standard. For example, take x + y = [x + y, x + y]. If x + yis computed with downward rounding, and x + y is computed with upwardrounding, then the resulting interval z = [z, z] that is represented in themachine must contain the exact range of x + y for x ∈ x and y ∈ y. We callthe expansion of the interval from rounding the lower end point down and theupper end point up roundout error .

Interval arithmetic is only subdistributive. That is, if x, y, and z areintervals, then

x(y + z) ⊆ xy + xz, but x(y + z) 6= xy + xz in general. (1.4)

As a result, algebraic expressions that would be equivalent if real values aresubstituted for the variables are not equivalent if interval values are used. Forexample, if, instead of writing (x − 1)(x + 1) for f(x) in (1.3), suppose wewrite

f(x) = x2 − 1, (1.5)

and suppose we provide a routine that computes an enclosure for the rangeof x2 that is the exact range to within roundoff error. Such a routine couldbe as follows:

ALGORITHM 1.1

(Computing an interval whose end points are machine numbers and whichencloses the range of x2.)

5There are small differences in current definitions of extended interval arithmetic. Forexample, in some systems, −∞ and ∞ are not considered numbers, but just descriptivesymbols. In those systems, [1, 2]/[−3, 4] = (−∞,−1/3] ∪ [1/4,∞) = R\(−1/3, 1/4). See[31] for a theoretical analysis of extended arithmetic.6who also was a major contributor to the IEEE 754 standard


INPUT: x = [x, x].OUTPUT: a machine-representable interval that contains the range of x2 overx.

IF x ≥ 0 THEN

RETURN [x2, x2], where x2 is computed with downward rounding andx2 is computed with upward rounding.

ELSE IF x ≤ 0 THEN

RETURN [x2, x2], where x2 is computed with downward rounding andx2 is computed with upward rounding.

ELSE

1. Compute x2 and x2 with both downward and upward rounding; that is,compute x2

l and x2u such that x2

l and x2u are machine representable num-

bers and x2 ∈ [x2l , x

2u], and compute x2

l and x2u such that x2

l and x2u are

machine representable numbers and x2 ∈ [x2l , x

2u].

2. RETURN [0, max{

x2u, x2

u

}

].

END IF

END ALGORITHM 1.1.

With Algorithm 1.1 and rewriting f(x) from (1.3) as in (1.5), we obtain

f([−2, 2]) = [−2, 2]2 − 1 = [0, 4]− 1 = [−1, 3]

which, in this case, is equal to the exact range of f over [−2, 2].In fact, this illustrates a general principle: If each variable in the expression

occurs only once, then interval arithmetic gives the exact range, to withinroundout error. We state this formally as

THEOREM 1.8

(Fundamental theorem of interval arithmetic.) Suppose f(x1, x2, . . . , xn) isan algebraic expression in the variables x1 through xn (or a computer programwith inputs x1 through xn), and suppose that this expression is evaluated withinterval arithmetic. The algebraic expression or computer program can con-tain the four elementary operations and operations such as xn, sin(x), exp(x),and log(x), etc., as long as the interval values of these functions contain theirrange over the input intervals. Then

1. The interval value f(x1, . . . , xn) contains the range of f over the inter-val vector (or box) (x1, . . . , xn).


2. If the single functions (the elementary operations and functions xn, etc.)have interval values that represent their exact ranges, and if each vari-able xi, 1 ≤ i ≤ n occurs only once in the expression for f , then thevalues of f obtained by interval arithmetic represent the exact ranges off over the input intervals.

If the expression for f contains one or more variables more than once,then overestimation of the range can occur due to interval dependency. Forexample, when we evaluate our example function f([−2, 2]) according to (1.3),the first factor, [−1, 3] is the exact range of x + 1 for x ∈ [−2, 2], while thesecond factor, [−3, 1] is the exact range of x− 1 for x ∈ [−2, 2]. Thus, [−9, 3]is the exact range of f(x1, x2) = (x1 + 1)(x2 − 1) for x1 and x2 independent,x1 ∈ [−2, 2], x2 ∈ [−2, 2].

We now present some definitions and theorems to clarify the practical con-sequences of interval dependency.

DEFINITION 1.6 An expression for f(x1, . . . , xn) which is written sothat each variable occurs only once is called a single use expression, or SUE.

Fortunately, we do not need to transform every expression into a single useexpression for interval computations to be of value. In particular, the intervaldependency becomes less as the widths of the input intervals becomes smaller.The following formal definition will help us to describe this precisely.

DEFINITION 1.7 Suppose an interval evaluation f(x1, . . . , xn) gives[a, b] as a result interval, but the exact range {f(x1, . . . , xn), xi ∈ xi, 1 ≤ i ≤ n}is [c, d] ⊆ [a, b]. We define the excess width E(f ; x1, . . . , xn) in the intervalevaluation f(x1, . . . , xn) by E(f ; x1, . . . , xn) = (c− a) + (b− d).

For example, the excess width in evaluating f(x) represented as (x+1)(x−1)over x = [−2, 2] is (−1− (−9)) + (3 − 3) = 8. In general, we have

THEOREM 1.9

Suppose f(x1, x2, . . . , xn) is an algebraic expression in the variables x1 throughxn (or a computer program with inputs x1 through xn), and suppose that thisexpression is evaluated with interval arithmetic, as in Theorem 1.8, to obtainan interval enclosure f (x1, . . . , xn) to the range of f for xi ∈ xi, 1 ≤ i ≤ n.Then, if E(f ; x1, . . . , xn) is as in Definition 1.7, we have

E(f ; x1, . . . , xn) = O(

max1≤i≤n

w(xi))

,

where w(x) denotes the width of the interval x.


That is, the overestimation becomes less as the uncertainty in the argumentsto the function becomes smaller.

Interval evaluations as in Theorem 1.9 are termed first-order interval ex-tensions. It is not difficult to obtain second-order extensions, where required.(See Exercise ?? below.)

1.3.2 Application of Interval Arithmetic: Examples

We give one such example here.

Example 1.22

Using 4-digit decimal floating point arithmetic, compute an interval enclosurefor the first two digits of e, and prove that these two digits are correct.

Solution: The fifth degree Taylor polynomial representation for e is

e = 1 + 1 +1

2!+

1

3!+

1

4!+

1

5!+

1

6!eξ,

for some ξ ∈ [0, 1]. If we assume we know e < 3 and we assume we know ex

is an increasing function of x, then the error term is bounded by∣

∣

∣

∣

1

6!eξ

∣

∣

∣

∣

≤ 3

6!< 0.005,

so this fifth-degree polynomial representation should be adequate. We willevaluate each term with interval arithmetic, and we will replace eξ with [1, 3].We obtain the following computation:

[1.000, 1.000] + [1.000, 1.000]→ [2.000, 2.000]

[1.000, 1.000]/[2.000, 2.000]→ [0.5000, 0.5000]

[2.000, 2.000] + [0.5000, 0.5000]→ [2.500, 2.500]

[1.000, 1.000]/[6.000, 6.000]→ [0.1666, 0.1667]

[2.500, 2.500] + [0.1666, 0.1667]→ [2.666, 2.667]

[1.000, 1.000]/[24.00, 24.00]→ [0.04166, 0.04167]

[2.666, 2.667] + [0.04166, 0.04167]→ [2.707, 2.709]

[1.000, 1.000]/[120.0, 120.0]→ [0.008333, 0.008334]

[2.707, 2.709] + [0.008333, 0.008334]→ [2.715, 2.718]

[1.000, 1.000]/[720.0, 720.0]→ [0.001388, 0.001389]

[.001388, .001389]× [1, 3]→ [0.001388, 0.004167]

[2.715, 2.718] + [0.001388, 0.004167]→ [2.716, 2.723]

Since we used outward rounding in these computations, this constitutes amathematical proof that e ∈ [2.716, 2.723].Note:


1. These computations can be done automatically on a computer, as simplyas evaluating the function in floating point arithmetic. We will explainsome programming techniques for this in Chapter 6, Section 6.2.

2. The solution is illustrative. More sophisticated methods, such as argu-ment reduction, would be used in practice to bound values of ex moreaccurately and with less operations.

Proofs of the theorems, as well as greater detail, appear in various textson interval arithmetic. A good book on interval arithmetic is R. E. Moore’sclassic text [27] although numerous more recent monographs and reviews areavailable. A World Wide Web search on the term “interval computations”will lead to some of these.

A general introduction to interval computations is [26]. That work givesnot only a complete introduction, with numerous examples and explanationof pitfalls, but also provides examples with intlab, a free matlab toolbox forinterval computations, and reference material for intlab. If you have mat-

lab available, we recommend intlab for the exercises in this book involvinginterval computations.

1.4 Programming Environments

Modern scientific computing (with floating point numbers) is usually donewith high-level “imperative” (as opposed to “functional”) programming lan-guages. Common programming environments in use for general scientific com-puting today are Fortran (or FORmula TRANslation), C/C++, and matlab.Fortran is the original such language, with its origins in the late 1950’s. Thereis a large body of high-quality publicly available software in Fortran for com-mon computations in numerical analysis and scientific computing. Such soft-ware can be found, for example, on NETLIB, at

http://www.netlib.org/

Fortran has evolved over the years, becoming a modern, multi-faceted lan-guage with the Fortran 2003 standard. Throughout, the emphasis by boththe standardization committee and suppliers of compilers for Fortran has beenfeatures that simplify programming of solutions to large problems in numericalanalysis and scientific computing, and features that enable high performance,especially on computers that can process vectors and matrices efficiently.

The “C” language, originally developed in conjunction with the Unix oper-ating system, was originally meant to be a higher-level language for designingand accessing the operating system, but has become more ubiquitous sincethen. C++, appearing in the late 1980’s, was the first widely available lan-


guage7 to allow the object-oriented programming paradigm. In recent years,computer science departments have favored teaching C++ over teaching For-tran, and Fortran has fallen out of favor in relative terms. However, Fortran isstill favored in certain large-scale applications such as fluid dynamics (e.g. inweather prediction and similar simulations), and some courses are still offeredin it in engineering schools. However, some people still think of Fortran asthe now somewhat rudimentary language known as FORTRAN 77.

Reasonably high-quality compilers for both Fortran and C/C++ are avail-able free of charge with Linux operating systems. Fortran 2003, largely im-plemented in these compilers, has a standardized interface to C, so functionswritten in C can be called from Fortran programs, and visa versa. These com-pilers include interactive graphical-user-interface-oriented debuggers, such as“insight,” available with the Linux operating system. Commercially availablecompilation and debugging systems are also available under Windows.

The matlabr system has become increasingly popular over the last two

decades or so. The matlab (or MATrix LABoratory) began in the early1980’s as a National Science Foundation project, written by Cleve Moler inFORTRAN 66, to provide an interactive environment for computing with ma-trices and vectors, but has since evolved to be both an interactive environmentand full-featured programming language. matlab is highly favored in coursessuch as this, because the ease of programming, debugging, and general use(such as graphing), and because of the numerous toolboxes, supplied by bothMathworks (Cleve Moler’s company) and others, for many basic computingtasks and applications. The main drawback to use of matlab in all scientificcomputations is that the language is interpretive, that is, matlab translateseach line of a program to machine language each time that it executes the line.This makes complicated programs that involve nested iterations much slower(a factor of 60 or more) than comparable programs written in a compiledlanguage such as C or Fortran. However, functions compiled from Fortran orC/C++ can be called from matlab. A common strategy has been to initiallydevelop algorithms in matlab, then translate all or part of the program to acompilable language, as necessary for efficiency.

One perceived disadvantage of matlab is that it is proprietary. Undesirablepossible consequences are that it is not free, and there is no official guaranteethat it will be available forever, unchanged. However, its use has becomeso widespread in recent years that these concerns do not seem to be major.Several projects including “Octave” and “Scilab” have produced free productsthat partially support the matlab programming language. The most widelydistributed of these “Octave,” is integrated into Linux systems. However,the object-oriented features of Octave are rudimentary compared to those ofmatlab, and some toolboxes, such as intlab (which we will mention later)will not function with Octave.

7with others, including Fortran, to follow


Alternative systems sometimes used for scientific computing are computeralgebra systems. Perhaps the most common of these are Mathematicar andMapler, while a free such system under development is “SAGE.” These sys-tems admit a different way of thinking about programming, termed functionalprogramming, in which rules are defined and available at all times to automat-ically simplify expressions that are presented to the system. (In contrast, inimperative programming, a sequence of commands is executed one after theother.) Although these systems have become comprehensive, they are basedin computations of a different character, rather than in the floating pointcomputations and linear algebra involved in numerical analysis and scientificcomputing.

We will use matlab in this book to illustrate the concepts, techniques, andapplications. With newer versions of matlab, a student can study how touse the system and make programs largely by using the matlab help system.The first place to turn will be the “Getting started” demos, which in newerversions are presented as videos. There are also many books devoted to useof matlab. Furthermore, we will be giving examples throughout this book.

matlab programs can be written as matlab scripts and matlab func-tions .

Example 1.23The matlab script we used to produce the table following Example 1.7 (onpage 8) is:

a = 2;

x=2;

xold=x;

err_old = 1;

for k=0:10

k

x

err = x - sqrt(2);

err

ratio = err/err_old^2

err_old = err;

x = x/2 + 1/x;

end

Example 1.24The matlab script we used to produce the table in Example 1.8 (on page 9)is

format long

a = 2;

x=2;


xold=x;

err_old = 1;

for k=0:25

k

x

err = x - sqrt(2);

err

ratio = err/err_old

err_old = err;

x = x - x^2/3.5 + 2/3.5;

end

An excellent alternative text book that focuses on matlab functions isCleve Moler’s Numerical Computing with Matlab [25]. An on-line version,along with “m” files, etc., is currently available at http://www.mathworks.

com/moler/chapters.html.

1.5 Applications

The purpose of the methods and techniques in this book ultimately is toprovide both accurate predictions and insight into practical problems. Thisincludes understanding and predicting, and managing or controlling the evolu-tion of ecological systems and epidemics, designing and constructing durablebut inexpensive bridges, buildings, roads, water control structures, under-standing chemical and physical processes, designing chemical plants and elec-tronic components and systems, minimizing costs or maximizing delivery ofproducts or services within companies and governments, etc. To achieve thesegoals, the numerical methods are a small part of the overall modeling process,that can be viewed as consisting of the following steps.

Identify the problem: This is the first step in translating an often vaguesituation into a mathematical problem to be solved. What questionsmust be answered and how can they be quantified?

Assumptions: Which factors are to be ignored and which are important?The real world is usually significantly more complicated than mathe-matical models of it, and simplifications must be made, because somefactors are poorly understood, because there isn’t enough data to de-termine some minor factors, or because it is not practical to accuratelysolve the resulting equations unless the model is simplified. For exam-ple, the theory of relativity and variations in the acceleration of gravity


due to the fact that the earth is not exactly round and due to the factthat the density differs from point to point on the surface of the earthin principle will affect the trajectory of a baseball as it leaves the bat.However, such effects can be ignored when we write down a model ofthe trajectory of the baseball. On the other hand, we need to includesuch effects if we are measuring the change in distance between twosatellites in a tandem orbit to detect the location of mineral deposits onthe surface of the earth.

Construction: In this step, we actually translate the problem into mathe-matical language.

Analysis: We solve the mathematical problem. Here is where the numericaltechniques in this book come into play. With more complicated models,there is an interplay between the previous three steps and this solutionprocess: We may need to simplify the process to enable practical so-lution. Also, presentation of the result is important here, to maximizethe usefulness of the results. In the early days of scientific computing,printouts of numbers are used, but increasingly, results are presented astwo and three-dimensional graphs and movies.

Interpretation: The numerical solution is compared to the original problem.If it does not make sense, go back and reformulate the assumptions.

Validation: Compare the model to real data. For example, in climate mod-els, the model might be used to predict climate changes in past years,before it is used to predict future climate changes.

Note that there is an intrinsic error introduced in the modeling process(such as when certain phenomena are ignored), that is outside the scope ofour study of numerical methods. Such error can only be measured indirectlythrough the interpretation and validation steps. In the model solution process(the “analysis” step), errors are also introduced due to roundoff error and theapproximation process. We have seen that such error consists of approxima-tion error and roundoff error. In a study of numerical methods and numericalanalysis, we quantify and find bounds on such errors. Although this may notbe the major source of error, it is important to know. Consequences of thistype of error might be that a good model is rejected, that incorrect conclu-sions are deduced about the process being modeled, etc. The authors of thisbook have personal experience with these events.

Errors in the modeling process can sometimes be quantified in the solutionprocess. If the model depends on parameters that are not known precisely, butbounds on those parameters are known, knowledge of these bounds can some-times be incorporated into the mathematical equations, and the set of possiblesolutions can sometimes be computed or bounded. One tool that sometimesworks is interval arithmetic. Other tools, less mathematically definite but


applicable in different situations, are statistical methods and computing solu-tions to the model for many different values of the parameter.

Throughout this book, we introduce applications from many areas.

Example 1.25

The formula for the net capacitance when two capacitors of values x and yare connected in series is

z =xy

x + y.

Suppose the measured values of x and y are x = 1 and y = 2, respectively.Estimate the range of possible values of z, given that the true values of x andy are known to be within ±10% of the measured value.

In this example, the identification, assumptions, and construction have al-ready been done. (It is well known how capacitances in a linear electricalcircuit behave.) We are asked to analyze the error in the output of the com-putation, due to errors in the data. We may proceed using interval arithmetic,relying on the accuracy assumptions for the measured values. In particular,these assumptions imply that x ∈ [0.9, 1.1] and y ∈ [1.8, 2.2]. We will plugthese intervals into the expression for z, but we first use Theorem 1.8, part (2)as a guide to rewrite the expression for z so x and y only occur once. (Wedo this so we obtain sharp bounds on the range, without overestimation.)Dividing the numerator and denominator for z by xy, we obtain

z =1

1x + 1

y

.

We use the intlab toolbox8 for matlab to evaluate z. We have the followingdialog in matlab’s command window.

>> intvalinit(’DisplayInfsup’)

===> Default display of intervals by infimum/supremum

>> x = intval(’[0.9,1.1]’)

intval x =

[ 0.8999, 1.1001]

>> y = intval(’[1.8,2.2]’)

intval y =

[ 1.7999, 2.2001]

>> z = 1/(1/x + 1/y)

intval z =

[ 0.5999, 0.7334]

>> format long

>> z

intval z =

8If one has matlab, intlab is available free of charge for non-commercial use from http:

//www.ti3.tu-harburg.de/~rump/intlab/


[ 0.59999999999999, 0.73333333333334]

>>

Thus, the capacitance must lie between 0.5999 and 0.7334.Note that x and y are input as strings. This is to assure that roundoff

errors in converting the decimal expressions 0.9, 1.1, 1.8, and 2.2 into internalbinary format are taken into account. See [26] for more examples of the useof intlab.

1.6 Exercises

1. Write down a polynomial p(x) such that |S(x)−p(x)| ≤ 10−10 for −0.2 ≤x ≤ 0.2, where

S(x) =

sin(x)

xif x 6= 0,

1 if x = 0.

Note: sinc(x) = S(πx) = sin(πx)/(πx) is the “sinc” function (well-known in signal processing, etc.).

(a) Show that your polynomial p satisfies the condition |sinc(x) −p(x)| ≤ 10−10 for x ∈ [−0.2, 0.2].Hint: You can obtain polynomial approximations with error termsfor sinc(x) by writing down Taylor polynomials and correspondingerror terms for sin(x), then dividing these by x. This can be easierthan trying to differentiate sinc(x). For the proof part, you canuse, for example, the Taylor polynomial remainder formula or thealternating series test and you can use interval arithmetic to obtainbounds.

(b) Plot your polynomial approximation and sinc(x) on the same graph,

(i) over the interval [−0.2, 0.2],

(ii) over the interval [−3, 3],

(iii) over the interval [−10, 10].

2. Suppose f has a continuous third derivative. Show that∣

∣

∣

∣

f(x + h)− f(x− h)

2h− f ′(x)

∣

∣

∣

∣

= O(h2).

3. Suppose f has a continuous fourth derivative. Show that∣

∣

∣

∣

f(x + h)− 2f(x) + f(x− h)

h2− f ′′(x)

∣

∣

∣

∣

= O(h2).


4. Let a = 0.41, b = 0.36, and c = 0.7. Assuming a 2-digit decimal com-

puter arithmetic with rounding, show thata− b

c6= a

c− b

cwhen using

this arithmetic.

5. Write down a formula relating the unit roundoff δ of Definition 1.3(page 14) and the machine epsilon ǫm defined on page 20.

6. Store and run the following matlab script. What are your results?What does the script compute? Can you say anything about the com-puter arithmetic underlying matlab?

eps = 1;

x = 1+eps;

while(x~=1)

eps = eps/2;

x = 1+eps;

end

eps = eps+(2*eps)^2

y = 1+eps;

y-1

7. Suppose, for illustration, we have a system with base β = 10, t = 3decimal digits in the mantissa, and L = −9, U = 9 for the exponent.For example, 0.123 × 104, that is, 1230 is a machine number in thissystem. Suppose also that “round to nearest” is used in this system.

(a) What is HUGE for this system?

(b) What is TINY for this system?

(c) What is the machine epsilon ǫm for this system?

(d) Let f(x) = sin(x) + 1.

i. Write down fl(f(0)) and fl(f(0.0008)) in normalized formatfor this toy system.

ii. Compute fl(fl(f(0.0008))−fl(f(0))) On the other hand, whatis the nearest machine number to the exact value of f(0.0008)−f(0)?

iii. Compute fl(fl(f(0.0008))−fl(f(0)))/fl(0.0008). Compare thisto the nearest machine number to the exact value of (f(0.0008)−f(0))/0.0008 and to f ′(0).

8. Let f(x) =ln(x + 1)− ln(x)

2.

(a) Use four-digit decimal arithmetic with rounding to evaluatef(100, 000).


(b) Use the Mean Value Theorem to approximate f(x) in a form thatavoids the loss of significant digits. Use this form to evaluate f(x)for x = 100, 000 once again.

(c) Compare the relative errors for the answers obtained in (a) and(b).

9. Compute the condition number of f(x) = e√

x2−1, x > 1 and discussany possible ill-conditioning.

10. Let f(x) = (sin(x))2 + x/2. Use interval arithmetic to prove that thereare no solutions to f(x) = 0 for x ∈ [−1,−0.8].

Chapter 2

Numerical Solution of NonlinearEquations of One Variable

In this chapter, we study methods for finding approximate solutions to theequation f(x) = 0, where f is a real-valued function of a real variable. Someclassical examples include the equation x − tan x = 0 that occurs in thediffraction of light, or Kepler’s equation x − b sinx = 0 used for calculatingplanetary orbits. Other examples include transcendental equations such asf(x) = ex +x = 0 and algebraic equations such as x7 +4x5−7x2 +6x+3 = 0.

2.1 Bisection Method

The bisection method is simple, reliable, and almost always can be applied,but is generally not as fast as other methods. Note that, if y = f(x), thenf(x) = 0 corresponds to the point where the curve y = f(x) crosses the x-axis. The bisection method is based on the following direct consequence ofthe Intermediate Value Theorem.

THEOREM 2.1

Suppose that f ∈ C[a, b] and f(a)f(b) < 0. Then there is a z ∈ [a, b] suchthat f(z) = 0. (See Figure 2.1.)

The method of bisection is simple to implement as illustrated in the followingalgorithm:

ALGORITHM 2.1

(The bisection algorithm)INPUT: An error tolerance ǫOUTPUT: Either a point x that is within ǫ of a solution z or “failure to finda sign change”

1. Find a and b such that f(a)f(b) < 0. (By Theorem 2.1, there is a

39


x

y y = f(x)

a+

z b+

FIGURE 2.1: Example for the Intermediate Value Theorem applied toroots of a function.

z ∈ [a, b] such that f(z) = 0.) (Return with “failure to find a signchange” if such an interval cannot be found.)

2. Let a0 = a, b0 = b, k = 0.

3. Let xk = (ak + bk)/2.

4. IF f(xk)f(ak) > 0 THEN

(a) ak+1 ← xk,

(b) bk+1 ← bk.

ELSE

(a) bk+1 ← xk,

(b) ak+1 ← ak.

END IF

5. IF (bk − ak)/2 < ǫTHEN

Stop, since xk is within ǫ of z. (See the explanation below.)

ELSE

(a) k ← k + 1.

(b) Return to step 3.

END IF

END ALGORITHM 2.1.

Basically, in the method of bisection, the interval [ak, bk] contains z andbk − ak = (bk−1 − ak−1)/2. The interval containing z is reduced by a factor

Numerical Solution of Nonlinear Equations of One Variable 41

of 2 at each iteration.

Note: In practice, when programming bisection, we usually do not store thenumbers ak and bk for all k as the iteration progresses. Instead, we usuallystore just two numbers a and b, replacing these by new values, as indicatedin Step 4 of our bisection algorithm (Algorithm 2.1).

x

yf(x) = ex + x

−1+

z

FIGURE 2.2: Graph of ex + x for Example 2.1.

Example 2.1

f(x) = ex + x, f(0) = 1, f(−1) = −0.632. Thus, −1 < z < 0. (There isa unique zero, because f ′(x) = ex + 1 > 0 for all x.) Setting a0 = −1 andb0 = 0, we obtain the following table of values.

k ak bk xk

0 −1 0 −1/21 −1 −1/2 −3/42 −3/4 −1/2 −0.6253 −0.625 −0.500 −0.56254 −0.625 −0.5625 −0.59375

Thus z ∈ (−0.625,−0.5625); see Figure 2.2.

The method always works for f continuous, as long as a and b can be foundsuch that f(a)f(b) < 0 (and as long as we assume roundoff error does notcause us to incorrectly evaluate the sign of f(x)). However, consider y = f(x)with f(x) ≥ 0 for every x, but f(z) = 0. There are no a and b such thatf(a)f(b) < 0. Thus, the method is not applicable to all problems in itspresent form. (See Figure 2.3 for an example of a root that cannot be foundby bisection.)


x

y y = f(x)

FIGURE 2.3: Example of when the method of bisection cannot be applied.

Is there a way that we can know how many iterations to do for the methodof bisection without actually performing the test in Step 5 of Algorithm 2.1?Simply examining how the widths of the intervals decrease leads us to thefollowing fact.

THEOREM 2.2

Suppose that f ∈ C[a, b] and f(a)f(b) < 0, then

|xk − z| ≤ b− a

2k+1.

Thus, in the algorithm, if 12 (bk − ak) = (b− a)/2k+1 < ǫ, then |z − xk| < ǫ.

Example 2.2

How many iterations are required to reduce the error to less than 10−6 ifa = 0 and b = 1?Solution: We need 1

2k+1 (1− 0) < 10−6. Thus, 2k+1 > 106, or k = 19.

This example illustrates the preferred way of stopping the method of bisec-tion. Namely, if the method of bisection is programmed, it is preferable tocompute an integer N such that

N >log((b − a)/ǫ)

log(2)− 1,

and test k > N , rather than testing the length of the interval directly as inStep 5 of Algorithm 2.1. One reason is because integer comparisons (compar-ing k to N , or doing it implicitly in a programming language loop, such as thematlab loop for k=1:N) are more efficient than floating point comparisons.Another reason is because, if ǫ were chosen too small (such as smaller thanthe distance between machine numbers near the solution z), the comparisonin Step 5 of Algorithm 2.1 would never hold in practice, and the algorithmwould never stop.


The following is an example of programming Algorithm 2.1 in matlab.

function [root,success] = bisect_method (a, b, eps, f)

%

% [root, success] = bisect_method (a, b, eps, func) returns the

% result of the method of bisection, with starting interval [a, b],

% tolerance eps, and with function defined by y = f(x). For example,

% suppose an m-file xsqm2.m is available in Matlab’s working

% directory, with the following contents:

% function [y] = xsqm2(x)

% y = x^2-2;

% return

% Then, issuing

% [root,success] = bisect (1, 2, 1e-10, ’xsqm2’)

% from Matlab’s command window will cause an approximation to

% the square root of 2 that, in the absence of excessive roundoff

% error, has absolute error of at most 10^{-16}

% to be returned in the variable root, and success to be set to

% ’true’.

%

% success is set to ’false’ if f(a) and f(b) do not have the same

% sign. success is also set to ’false’ if the tolerance cannot be met.

% In either case, a message is printed, and the midpoint of the present

% interval is returned in the variable root.

error=b-a;

fa=feval(f,a);

fb=feval(f,b);

success = true;

% First, handle incorrect arguments --

if (fa*fb > 0)

disp(’Error: f(a)*f(b)>0’);

success = false;

root = a + (b-a)/2;

return

end

if (eps <=0)

disp(’Error: eps is less than or equal to 0’)

success = false;

root = a + (b-a)/2;

return

end

if (b < a)

disp(’Error: b < a’)

success = false;

root = (a+b)/2;

return

end


% Set N to be the smallest integer such that N iterations of bisection

% suffices to meet the tolerance --

N = ceil( log((b-a)/eps)/log(2) - 1 )

% This is where we actually do Algorithm 2.1 --

disp(’ -----------------------------’);

disp(’ Error Estimate ’);

disp(’ -----------------------------’);

for i=1:N

x= a + (b-a)/2;

fx=feval(f,x);

if(fx*fa > 0)

a=x;

else

b=x;

end

error=b-a;

disp(sprintf(’ %12.4e %12.4e’, error, x));

end

% Finally, check to see if the tolerance was actually met. (With

% additional analysis of the minimum possible relative error

% (according to the distance between floating point

% numbers), unreasonable values of epsilon can be determined

% before the loop on i, avoiding unnecessary work.)

error = (b-a)/2;

root = a + (b-a)/2;

if (error > eps)

disp(’Error: epsilon is too small for tolerance to be met’);

success = false;

return

end

This program includes practical considerations beyond the raw mathematicaloperations in the algorithm Observe the following.

1. The comments at the beginning of the program state precisely howthe function is used. In fact, within the matlab system, if the filebisect method.m contains this program within the working directoryor within matlab’s search path, and one issues the command help

bisect method from the matlab command window, all of these com-ments (those lines starting with “%”) prior to the first non-commentline are printed to the command window.

2. There are statements to catch errors in the input arguments.

When developing such computer programs, it is wise to use a uniform stylein the comments, indentation of “if” blocks and “for” loops, etc. To a large


extend, matlab’s editor does indentation automatically, and automaticallyhighlights comments and syntax elements such as “if” and “for” in differentcolors. It is also a good idea to identify the author and date programmed, aswell as the package (if any) to which the program belongs. This is done forbisect method.m in the version posted on the web page for the book, but isnot reproduced here, for brevity.

The above implementation, stored in a matlab “m-file,” is an example ofa matlab function, that is, an m-file that begins with a function statement.In such a file, quantities that are to be returned must appear in the list inbrackets on the left of the “=,” while quantities that are input must appear inthe list in parentheses on the right of the statement. In a function m-file, theonly quantities from the command line that are available while the operationswithin the m-file are being done are those passed on the left, and the onlyquantities available to the command environment (or other function) fromwhich the function is called are the ones in the bracketed list on the left. Forexample, consider the following dialog in the matlab command window.

>> eps = 1e-16

eps =

1.0000e-016

>> [root,success] = bisect_method(1,2,1e-2,’xsqm2’)

N =

6

-----------------------------

Error Estimate

-----------------------------

5.0000e-001 1.5000e+000

2.5000e-001 1.2500e+000

1.2500e-001 1.3750e+000

6.2500e-002 1.4375e+000

3.1250e-002 1.4063e+000

1.5625e-002 1.4219e+000

root =

1.4141

success =

1

>> N

??? Undefined function or variable ’N’.

>>eps

eps =

1.0000e-016

>>

Observe that N is not available within the environment calling bisect method,and eps is not available within bisect method. This contrasts with matlab

m-files that do not begin with a function statement. These files, termed mat-

lab scriptsMatlab!script. For example, the script run bisect method.mmight contain the following lines.


clear

a = 1

b = 2

eps = 1e-1

[root,success] = bisect_method(a, b, eps, ’xsqm2’)

The clear command removes all quantities from the environment. Observenow the following dialog in the matlab command window.

>> clear

>> a

??? Undefined function or variable ’a’.

>> b

??? Undefined function or variable ’b’.

>> run_bisect_method

a =

1

b =

2

eps =

0.1000

N =

3

-----------------------------

Error Estimate

-----------------------------

5.0000e-001 1.5000e+000

2.5000e-001 1.2500e+000

1.2500e-001 1.3750e+000

root =

1.4375

success =

1

>> a

a =

1

>> b

b =

2

>> eps

eps =

0.1000

>>

The reader is invited to use matlab’s help system to explore the other aspectsof the function bisect method.

We end our present discussion of matlab programs with a note on the useof the symbol “=.” In statements entered into the command line and in m-files,= means “store the computed contents to the left of the = into the variable


represented by the symbol on the right.” In quantities printed by the matlab

system, =” means “the value stored in the memory locations represented bythe printed symbol is approximately equal to the printed quantity. Note thatthis is significantly different from the meaning that a mathematician attachesto the symbol. For example, the approximations might not be close enoughfor our purposes to the intended value, due to roundoff error or other errors,or even due to error in conversion from the internal binary form to the printeddecimal form.

2.2 The Fixed Point Method

The so-called “fixed point method” is a really general way of viewing com-putational processes involving equations, systems of equations, and equilibria.We introduce it here, and will see it again when we study systems of linearand nonlinear equations. It is also seen in more advanced studies of systemsof differential equations.

DEFINITION 2.1 z ∈ G is a fixed point of g if g(z) = z.

REMARK 2.1 If f(x) = g(x)− x, then a fixed point of g is a zero of f .

The fixed-point iteration method is defined by the following: For x0 ∈ G,

xk+1 = g(xk) for k = 0, 1, 2, . . . .

Example 2.3

Suppose

g(x) =1

2(x + 1).

Then, starting with x0 = 0, fixed point iteration becomes

xk+1 =1

2(xk + 1),

and the first few iterates are x0 = 0, x1 = 1/2, x2 = 3/4, x3 = 7/8, x4 =15/16, · · · . We see that this iteration converges to z = 1.

Example 2.4If f is as in Example 2.1 on page 41, a corresponding g is g(x) = −ex. Wecan study fixed point iteration with this g with the following matlab dialog.


>> x = -0.5

x =

-0.5000

>> x = -exp(x)

x =

-0.6065

>> x = -exp(x)

x =

-0.5452

>> x = -exp(x)

x =

-0.5797

>> x = -exp(x)

x =

-0.5601

>> x = -exp(x)

x =

-0.5712

>> x = -exp(x)

x =

-0.5649

>>

(Here, we can recall the expression x = - exp(x) by simply pressing theup-arrow button on the keyboard.) We observe a convergence in which theapproximation appears to alternate about the limit, but the convergence doesnot appear to be quadratic.

An important question is: when does {xk}∞k=0 converge to z, a fixed pointof g? Fixed-point iteration does not always converge. Consider g(x) = x2,whose fixed points are x = 0 and x = 1. If x0 = 2, then xk+1 = x2

k, so x1 = 4,x2 = 16, x3 = 256, · · · .

Although it is tempting to pose problems as fixed point iteration, the fixedpoint iterates do not always converge. We talk about convergence of fixedpoint iteration in terms of Lipschitz constants.

DEFINITION 2.2 g satisfies a Lipschitz condition on G if there is aLipschitz constant L ≥ 0 such that

|g(x)− g(y)| ≤ L|x− y| for all x, y ∈ G. (2.1)

If g satisfies (2.1) with 0 ≤ L < 1, g is said to be a contraction on the setG. For differentiable functions, a common way of thinking about Lipschitzconstants is in terms of the derivative of g. For instance, it is not hard toshow (using the mean value theorem) that, if g′ is continuous and |g′(x)| ≤ Lfor every x, then g satisfies a Lipschitz condition with Lipschitz constant L.


Basically, if L < 1 (or if |g′| < 1), then fixed point iteration converges. Infact, in such instances,

|xk+1 − z| = |g(xk)− g(z)| ≤ L|xk − z|,

so fixed point iteration is linearly convergent with convergence factor C = L.(Later, we state conditions under which the convergence is faster than linear.)

This is embodied in the following theorem.

THEOREM 2.3

(Contraction Mapping Theorem in one variable) Suppose that g maps G intoitself (i.e., if x ∈ G then g(x) ∈ G) and g satisfies a Lipschitz condition with0 ≤ L < 1 (i.e., g is a contraction on G). Then, there is a unique z ∈ G suchthat z = g(z), and the sequence determined by x0 ∈ G, xk+1 = g(xk), k = 0,1, 2, · · · converges to z, with error estimates

|xk − z| ≤ Lk

1− L|x1 − x0|, k = 1, 2, · · · (2.2)

|xk − z| ≤ L

1− L|xk − xk−1|, k = 1, 2, · · · (2.3)

Example 2.5

Suppose

g(x) = −x3

6+

x5

120,

and suppose we wish to find a Lipschitz constant for g over the interval[−1/2, 1/2].

We will proceed by an interval evaluation of g′ over [−1/2, 1/2]. Sinceg′(x) = −x2/2 + x4/24, we have

g′([−1/2, 1/2]) ∈ −1

2[−1/2, 1/2]2 +

1

24[−1/2, 1/2]4

= −1

2[0, 1/4] +

1

24[0, 1/16] = [−1/8, 0] + [0, 1/384]

⊆ [−0.125, 0] + [0, 0.002605] ⊆ [−0.125, 0.00261].

Thus, since |g′(x)| ≤ maxy∈[−0.125,0.00261] |y| = 0.125, g satisfies a Lipschitz

condition with Lipschitz constant 0.125.

If g is a contraction for all real numbers x, then the hypotheses of thecontraction mapping theorem are automatically satisfied, and fixed point it-eration converges for any x. (That is, the domain G can be taken to be theset of all real numbers.) On the other hand, if G must be restricted (such asif g is not a contraction everywhere or if g is not defined everywhere), then, to


be assured that fixed point iteration converges, we need to know that g mapsG into itself. Two possibilities are with the following two theorems.

THEOREM 2.4

Let ρ > 0 and G = [c − ρ, c + ρ]. Suppose that g is a contraction on G withLipschitz constant L, 0 ≤ L < 1, and

|g(c)− c| ≤ (1− L)ρ.

Then g maps G into itself.

THEOREM 2.5

Assume that z is a solution of x = g(x), g′(x) is continuous in an intervalabout z, and |g′(z)| < 1. Then g is a contraction in a sufficiently smallinterval about z, and g maps this interval into itself. Thus, provided x0 ispicked sufficiently close to z, the iterates will converge.

Example 2.6

Let

g(x) =x

2+

1

x.

Can we show that the fixed point iteration xk+1 = g(xk) converges for anystarting point x0 ∈ [1, 2]? We will use Theorem 2.4, and Theorem 2.3 to showconvergence. In particular, g′(x) = 1/2 − 1/x2. Evaluating g′(x) over [1, 2]with interval arithmetic, we obtain

g′([1, 2]) ∈ 1

2− 1

[1, 2]2=

[

1

2,1

2

]

− 1

[1, 4]

=

[

1

2,1

2

]

−[

1

4, 1

]

=

[

1

2,1

2

]

+

[

−1,−1

4

]

=

[

−1

2,1

4

]

.

Thus, since g′(x) ∈ g′([1, 2]) ∈ [− 12 , 1

4 ] for every x ∈ [1, 2],

|g′(x)| ≤ maxx∈[− 1

2 , 14 ]|x| = 1

2

for every x ∈ [1, 2]. Thus, g is a contraction on [1, 2]. Furthermore, lettingρ = 1/2 and c = 3/2, |g(3/2)− 3/2| = 1/12 ≤ 1/4. Thus, by Proposition 2.4,g maps [1, 2] into [1, 2]. Therefore, we can conclude from Theorem 2.3 thatthe fixed point iteration converges for any starting point x0 ∈ [1, 2] to theunique fixed point z = g(z).


Of course, it may be relatively easy to verify that |g′| < 1, after which wemay actually try fixed point iteration to see if it stays in the domain andconverges. In fact, in Theorem 2.4, we essentially do one iteration of fixedpoint iteration and compare the change to the size of the region.

Example 2.7

Let g(x) = 4 + 13 sin 2x and xk+1 = 4 + 1

3 sin 2xk. Observing that

|g′(x)| =∣

∣

∣

∣

2

3cos 2x

∣

∣

∣

∣

≤ 2

3

for all x shows that g is a contraction on all of R, so we can take G = R. Theng : G→ G and g is a contraction on R. Thus, for any x0 ∈ R, the iterationsxk+1 = g(xk) will converge to z, where z = 4 + 1

3 sin 2z. For x0 = 4, thefollowing values are obtained.

k xk

0 41 4.32982 4.2309...

...14 4.261515 4.2615

It is not hard to show that, if −1 < −L ≤ g′(x) ≤ 0 and fixed pointiterates stay within G, then fixed point iteration converges, with the iteratesxk alternately less than and greater than the fixed point z = g(z). On theother hand, if 0 ≤ g′(x) ≤ L < 1 and the fixed point iterates stay within thedomain G, then the fixed point iterates xk converge monotonically to z. Thislatter situation is illustrated in Figure 2.4.

There are conditions under which the convergence of fixed point iterationis faster than linear. Recall if lim

k→∞xk = z and |xk+1− z| ≤ c|xk − z|α, we say

{xk} converges to z with rate of convergence α. (We specify that c < 1 forα = 1.)

THEOREM 2.6

Assume that the iterations xk+1 = g(xk) converge to a fixed point z. Fur-thermore, assume that q is the first positive integer for which g(q)(z) 6= 0 andif q = 1 then |g′(z)| < 1. Then the sequence {xk} converges to z with orderq. (It is assumed that g ∈ Cq(G), where G contains z.)


x

y

y = g(x)

y = x

g(a)+

g(x1)+g(x2)+

a+

x1

+x2

+z+

b+

FIGURE 2.4: Example of monotonic convergence of fixed point iteration.

Example 2.8

Let

g(x) =x2 + 6

5and G = [1, 2.3]. Since g′(x) = 2x/5, the range of g′ is 2/5[1, 2.3] > 0, so gis monotonically increasing. Furthermore, g(1) = 7/5 and g(2.3) = 2.258, sothe exact range of g over [1, 2.3] is the interval [1.4, 2.258] ⊂ [1, 2.3], that is, gmaps G into G. Also,

|g′(x)| =∣

∣

∣

∣

2x

5

∣

∣

∣

∣

≤ 0.92 < 1

for x ∈ G. (Indeed, in this case, an interval evaluation gives 2[1, 2.3]/5 =[0.4, 0.92], the exact range of g′ in this case, since x occurs only once in theexpression for g′.) Theorem 2.3 then implies that there is a unique fixed pointz ∈ G. It is easy to see that the fixed point is z = 2. In addition, sinceg′(z) = 4

5 6= 0, there is a linear rate of convergence. Inspecting the values inthe following table, notice that the convergence is not fast.

k xk

0 2.21 2.1682 2.1403 2.1164 2.095

Example 2.9

Let

g(x) =x

2+

2

x=

x2 + 4

2x


be as in Example 2.6. It can be shown that if 0 < x0 < 2, then x1 > 2.Also, xk > xk+1 > 2 when xk > 2. Thus, {xk} is a monotonically decreasingsequence bounded by 2 and hence is convergent. Thus, for any x0 ∈ (0,∞),the sequence xk+1 = g(xk) converges to z = 2.

Now consider the convergence rate. We have that

g′(x) =1

2− 2

x2,

so g′(2) = 0, and

g′′(x) =4

x3,

so g′′(2) 6= 0. By Theorem 2.6, the convergence is quadratic, and as indicatedin the following table, the convergence is rapid.

k xk

0 2.21 2.009092 2.000023 2.00000000

Example 2.10

Let

g(x) =3

8x4 − 4.

There is a unique fixed point z = 2. However, g′(x) = 32x3, so g′(2) = 12,

and we cannot conclude linear convergence. Indeed, the fixed point iterationsconverge only if x0 = 2. If x0 > 2, then x1 > x0 > 2, x2 > x1 > x0 > 2, · · · .Similarly, if x0 < 2, it can be verified that, for some k, xk < 0, after whichxk+1 > 2, and we are in the same situation as if x0 > 2. That is, fixed pointiterations diverge unless x0 = 2.

Example 2.11

Consider again g from Example 2.8. Starting with x0 = 2.2, how manyiterations would be required to obtain the fixed point z with |xk − z| <10−16? Can this number of iterations be computed before actually doingthe iterations?

We can use the bound

|xk − z| ≤ Lk

1− L|x1 − x0|

from the Contraction Mapping Theorem (on page 49). The mean value theo-rem gives

xk+1 − z = g′(ck)(xk − z),


but the smallest bound we know on |g′(ck)| (and hence the smallest L in theformula) is L = 0.92). We also compute x1 =

(

(2.2)2 + 6)

/5, and

|x1 − x0| = 0.032.

Therefore,

|xk − z| ≤ 0.92k

1− 0.920.032 = 0.4 · (0.92)k.

Solving

0.4 · (0.92)k < 10−16

for k gives

k > −16log(25)

log(0.92)≈ 617.7.

Thus, 618 iterations would be required to achieve, roughly, IEEE double pre-cision accuracy.

2.3 Newton’s Method (Newton-Raphson Method)

We now return to the problem: given f(x), find z such that f(z) = 0.Newton’s iteration for finding approximate solutions to this problem has theform

xk+1 = xk −f(xk)

f ′(xk)for k = 0, 1, 2, · · · . (2.4)

REMARK 2.2 Newton’s method is a special fixed-point method withg(x) = x− f(x)/f ′(x).

Figure 2.5 illustrates the geometric interpretation of Newton’s method. Tofind xk+1, the tangent line to the curve at point (xk, f(xk)) is followed tothe x-axis. The tangent line is y − f(xk) = f ′(xk)(x − xk). Thus, at y = 0,x = xk − f(xk)/f ′(xk) = xk+1.

Newton’s method is quadratically convergent, and is therefore fast whencompared to a typical linearly convergent fixed point method. However, New-ton’s method may diverge if x0 is not sufficiently close to a root z at whichf(z) = 0. To see this, study Figure 2.6.

Another conceptually useful way of deriving Newton’s method is using Tay-lor’s formula. We have

0 = f(z) = f(xk) + f ′(xk)(z − xk) +(z − xk)2

2f ′′(ξk),


x

y

(xk, f(xk))

(xk+1, f(xk+1))

xk

+xk+1+

xk+2+

FIGURE 2.5: Illustration of two iterations of Newton’s method.

x

y

xk+ xk+1

+ xk+2+

z+ x

y

xk+ xk+1

+z

FIGURE 2.6: Examples of divergence of Newton’s method. On the left,the sequence diverges; on the right, the sequence oscillates.

where ξk is between z and xk. Thus, assuming that (z − xk)2 is small,

z ≈ xk −f(xk)

f ′(xk).

Hence, when xk+1 = xk− f(xk)/f ′(xk), we would expect xk+1 to be closer toz than xk.

The quadratic convergence rate of Newton’s method can be inferred fromTheorem 2.6 by analyzing Newton’s method as a fixed point iteration. Con-sider

xk+1 = xk −f(xk)

f ′(xk)= g(xk).

Observe that g(z) = z,

g′(z) = 0 = 1− f ′(z)

f ′(z)+

f(z)f ′′(z)

(f ′(z))2,

and, usually, g′′(z) 6= 0. Thus, the quadratic convergence follows from Theo-rem 2.6.


Example 2.12

Let f(x) = x + ex. Compare bisection, simple fixed-point iteration, andNewton’s method.

• Newton’s method: xk+1 = xk−f(xk)

f ′(xk)= xk−

(xk + exk)

(1 + exk)=

(xk − 1)exk

1 + exk.

• Fixed-Point (one form): xk+1 = −exk = g(xk).

k xk (Bisection) xk (Fixed-Point) xk (Newton’s)a = −1, b = 0

0 -0.5 -1.0 -1.01 -0.75 -0.367879 -0.5378832 -0.625 -0.692201 -0.5669873 -0.5625 -0.500474 -0.5671434 -0.59375 -0.606244 -0.5671435 -0.578125 -0.545396 -0.567143

10 -0.566895 -0.568429 -0.56714320 -0.567143 -0.567148 -0.567143

2.4 The Univariate Interval Newton Method

A simple application of the ideas behind Newton’s method and the MeanValue Theorem leads to a mathematically rigorous computation of the zerosof a function f . In particular, suppose x = [x, x] is an interval, and supposethat there is a z ∈ x with f(z) = 0. Let x be any point (such as the midpointof x). Then the Mean Value Theorem (page 5) gives

0 = f(x) + f ′(ξ)(z − x). (2.5)

Solving (2.5) for z, then applying the fundamental theorem of interval arith-metic (page 27) gives

z = x− f(x)

f ′(ξ)

∈ x− f(x)

f ′(x)= N(f ; x, x). (2.6)

We thus have the following.

THEOREM 2.7

Any solution z ∈ x of f(x) = 0 must also be in N(f ; x, x).


We call N(f ; x, x) the univariate interval Newton operator .The interval Newton operator forms the basis of a fixed-point type of iter-

ation of the form

xk+1 ←N(f ; xk, xk) for k = 1, 2, . . . .

The interval Newton method is similar in many ways to the traditionalNewton–Raphson method of Section 2.3 (page 54), but provides a way to usefloating point arithmetic (with upward and downward roundings) to providerigorous upper and lower bounds on exact solutions. We now discuss existenceand uniqueness properties of the interval Newton method. In addition toproviding bounds on any solutions within a given region, the interval Newtonmethod has the following property.

THEOREM 2.8

Suppose f ∈ C(x) = C([x, x]), x ∈ x, and N(f ; x, x) ⊆ x. Then there is anx∗ ∈ x such that f(x∗) = 0. Furthermore, this x∗ is unique.

A formal algorithm for the interval Newton method is as follows.

ALGORITHM 2.2

(The univariate interval Newton method)INPUT: x = [x, x], f : x ⊂ R→ R, a maximum number of iterations N , anda stopping tolerance ǫ.OUTPUT: Either

1. “solution does not exist within the original x”, or

2. a new interval x∗ such that any x∗ ∈ x with f(x∗) = 0 has x∗ ∈ x∗,and one of:

(a) “existence and uniqueness verified” and “tolerance met.”

(b) “existence and uniqueness not verified,”

(c) “solution does not exist,” or

(d) “existence and uniqueness verified” but “tolerance not met.”

1. k ← 1.

2. “existence and uniqueness verified”← “false.”

3. “solution does not exist”← “false.”

4. DO WHILE k <= N .

(a) x← (x + x)/2.

(b) IF x 6∈ x THEN RETURN.


(c) x←N (f ; x, x).

(d) IF x ⊆ x (that is, if x ≥ x and x ≤ x) THEN“existence and uniqueness verified← “true.”

(e) IF x ∩ x = ∅ (that is, if x ≤ x or x ≥ x) THEN

i. “solution does not exist”← “true.”

ii. RETURN.

(f) IF w(x) < ǫ THEN

i. x∗ ← x.

ii. tolerance met← “true.”

iii. RETURN.

END IF

(g) x← x ∩ x.

(h) k ← k + 1.

END DO

5. “tolerance met”← “false.”

6. RETURN.

END ALGORITHM 2.2.

Notes:

1. The interval Newton method generally becomes stationary. (That is, theend points of x can be proven to not change, under certain assumptionson the machine arithmetic.) However, it is good general programmingpractice to enforce an upper limit on the total number of iterations ofany iterative process, to avoid problems arising from slow convergence,etc.

2. In Step 4a of Algorithm 2.2, the midpoint is computed approximately,and it occasionally occurs (when the interval is very narrow), that themachine approximation lies outside the interval. Thus, we need to checkfor this possibility.

3. Although f is evaluated at a point in the expression

x− f(x)

f ′(x)

for N (f ; x, x), the machine must evaluate f with interval arithmetic totake account of rounding error. (That is, we start with the computa-tions with the degenerate interval [x, x].) Otherwise, the results are notmathematically rigorous.


Similar to the traditional Newton–Raphson method, the interval Newtonmethod exhibits quadratic convergence. (This is common knowledge.) Anexample of a specific theorem along these lines is

THEOREM 2.9

(Quadratic convergence of the interval Newton method) Suppose f : x→ R,suppose f ∈ C(x) and f ′ ∈ C(x), and suppose there is an x∗ ∈ x such thatf(x∗) = 0. Suppose further that f ′ is a first order or higher order intervalextension of f in the sense of Theorem 1.9 (on page 28). Then, for the initialwidth w(x) sufficiently small,

w(N (f ; x, x)) = O(w(x))2.

We will not give a proof of Theorem 2.9 here, although Theorem 2.9 is aspecial case of Theorem 6.3, page 222 in [20]. We will illustrate this quadraticconvergence with

Example 2.13

(Taken from [22].) Apply the interval Newton method x ← N (f ; x, x),x← fl((x + x)/2), to

f(x) = x2 − 2,

starting with x = [1, 2] and x = 1.5. The results for Example 2.13 appear inTable 2.1. Here,

δk =w(xk)

max {maxy∈xk{|y|}, 1}

is a scaled version of the width w(xk), and ρk = maxy∈f(xk){|y|}. The dis-played decimal intervals have been rounded out from the corresponding binaryintervals.

TABLE 2.1: Convergence of the interval Newton method withf(x) = x2 − 2.

k xk δk ρk

0 [1.00000000000000, 2.00000000000000] 5.00 × 10−1 2.00 × 100

1 [1.37499999999999, 1.43750000000001] 4.35 × 10−2 1.09 × 10−1

2 [1.41406249999999, 1.41441761363637] 2.51 × 10−4 5.77 × 10−4

3 [1.41421355929452, 1.41421356594718] 4.70 × 10−9 1.01 × 10−8

4 [1.41421356237309, 1.41421356237310] 4.71 × 10−16 1.33 × 10−15

5 [1.41421356237309, 1.41421356237310] 4.71 × 10−16 1.33 × 10−15

6 [1.41421356237309, 1.41421356237310] 4.71 × 10−16 1.33 × 10−15


2.5 The Secant Method

Under certain circumstances, f may have a continuous derivative, but itmay not be possible to explicitly compute it. This is less true now than inthe past, because techniques of automatic differentiation (or “computationaldifferentiation”), such as we explain in Section 6.2, page 215, have been devel-oped, have become more widely available, and are used in practice. However,there are still various situations involving black box functions f . In “blackbox” functions, f is evaluated by some external procedure (such as a softwaresystem provided by someone other than its user), in which one supplies theinput x, and the output f(x) is returned, but the user (or the designer of themethod for finding points x∗, f(x∗) = 0) does not have access to the internalworkings, so that f ′ cannot be easily computed. In such cases, methods thatconverge more rapidly than the method of bisection, but that do not requireevaluation of f ′, are useful.

Example 2.14

Suppose we wish to find a zero of

f(x) = e−ax − g(

cosx +a

xsin x + lnx

)

,

where

g(x) =1

1 + x2+ 3x2 + 5x + h(5 + ex + cosx),

h(x) =e2x2

(1 + x + x2),

and a is a constant.

Problems as complicated as this are not uncommon. Prior to widespreaduse of automatic differentiation, applying Newton’s method to this problemwas quite difficult because it would have been difficult and time-consumingto calculate f ′(xk) at each time step. Automatic differentiation is now anoption for many problems of this type. However, in certain situations, suchas applying the shooting method to solution of boundary-value problems (seethe discussion in Chapter 10), f ′ cannot be directly computed and the secantmethod is useful. In this section, we will assume that f ′ cannot be computed,and we will treat f as a “black-box” function.

In the secant method, f ′(xk) is approximated by

f ′(xk) ≈ f(xk)− f(xk−1)

xk − xk−1.


The secant method thus has the form

xk+1 = xk − f(xk)

[

xk − xk−1

f(xk)− f(xk−1)

]

. (2.7)

If f(xk) and f(xk+1) have opposite signs, then, as with the bisection method,there must be an x∗ between xk and xk+1 for which f(x∗) = 0.

For the secant method, we need starting values x0 and x1. However, onlyone evaluation of the function f is required at each iteration, since f(xk−1) isknown from the previous iteration.

Geometrically, (see figure 2.7), to obtain xk+1, the secant to the curvethrough (xk−1, f(xk−1)) and (xk, f(xk)) is followed to the x-axis.

x

y

(xk−1, f(xk−1))

(xk, f(xk))

xk+1+

FIGURE 2.7: Geometric interpretation of the secant method.

Interestingly, the convergence rate of the secant method is faster than linearbut slower than quadratic.

THEOREM 2.10

(Convergence of the secant method) Let G be a subset of R containing a zeroz of f(x). Assume f ∈ C2(G) and there exists an M ≥ 0 such that

M =maxx∈G|f ′′(x)|

2 minx∈G|f ′(x)| .

Let x0 and x1 be two initial guesses to z and let

Kǫ(z) = (z − ǫ, z + ǫ) ∈ G,


where ǫ = δM and δ < 1. Let x0, x1 ∈ Kǫ(z). Then, the iterates x2, x3, x4,

· · · remain in Kǫ(z) and converge to z with error

|xk − z| ≤ 1

Mδ( 1+

√5

2 )k

.

Note that (1 +√

5)/2 ≈ 1.618, a fractional order of convergence between 1

and 2. For Newton’s method |xk − z| ≤ q2k

with q < 1.

2.6 Software

The matlab function bisect method we presented, as well as a matlab

function for Newton’s method, are available from the web page for the grad-uate version of this book, namely at

http://interval.louisiana.edu/Classical-and-Modern-NA/

A step of the interval Newton method is implemented with the matlab func-tion i newton step no fp.m, explained in [26], and available from the webpage for the book, at http://www.siam.org/books/ot110. This functionuses intlab (see page 35) for the interval arithmetic and for automaticallycomputing derivatives of f .

Additional techniques for root-finding, such as finding complex roots andfinding all roots of polynomials, appear in the graduate version of this book[1]. One common computation is finding all of the roots of a polynomialequation p(x) = 0. The matlab function roots accepts an array containingthe coefficients of the polynomial, and returns the roots of that polynomial.For example, we might have the following matlab dialog.

>> c = [1,1,1]

c =

1 1 1

>> r = roots(c)

r =

-0.5000 + 0.8660i

-0.5000 - 0.8660i

>> c = [1 5 4]

c =

1 5 4

>> r = roots(c)

r =

-4

-1

>>

The first computation computes approximations to the roots of the polynomialp(x) = x2 +x+1, namely, approximations to −1/2±

√3/2i, while the second


computation computes the roots of the polynomial p(x) = x2 +5x+4, namelyx = −4 and x = −1.

matlab also contains a function fzero for finding zeros of more generalequations f(x) = 0. A matlab dialog with its use is

>> x = fzero(’exp(x)+x’,-0.5)

x =

-0.5671

>>

(Compare this with Example 2.1.) Various examples, as well as explanationsof the underlying algorithms, are available within the matlab help systemfor roots and fzero.

NETLIB (at http://www.netlib.org/ contains various software packagesin Fortran, C, etc. for computing roots of polynomial equations and otherequations.

Software for finding verified bounds on all solutions is also available. See[26] for an introduction to some of the techniques. intlab has the functionverifypoly for finding certified bounds on the roots of polynomials. Thereis a general function verifynlss in intlab for finding

Generally, finding a root of an equation is a computation done as partof an overall modeling or simulation process. It is usually advantageous touse polished programs within the chosen system (matlab, a programminglanguage such as Fortran or C++, etc.). However, in developing specializedpackages, if the function f has special properties, one can take advantage ofthese. It may also be efficient in certain cases to program directly the simplemethods described in this chapter, if one is certain of their convergence in thecontext of their use.

2.7 Applications

The problem of finding x such that f(x) = 0 arises frequently when tryingto solve for equilibrium solutions (constant solutions) of differential equationmodels in many fields including biology, engineering and physics. Here, wefocus on a model from population biology. To this end, consider the generalpopulation model given by

dx

dt= f(x)x = (b(x)− d(x))x. (2.8)

Here, x(t) is the population density at time t. The function b(x) is the density-dependent birth rate and d(x) is the density-dependent death rate. Thus, f(x)is the density-dependent growth rate of the population. An important problemin population biology is analyzing the solution behavior of such dynamical


models. A first step in such analysis is often finding the equilibrium solutions.Clearly, these solutions satisfy dx/dt = 0 which is equivalent to x = 0, i.e.,the trivial solution is an equilibrium solution of this population model usuallyreferred to as the extinction equilibrium and f(x) = 0, i.e., values x whichmake the growth rate equal zero.

To focus on a concrete example, assume the birth rate is of Ricker typeb(x) = e−x and the mortality rate is linear function given by d(x) = 2x.This implies that the growth rate is given by f(x) = e−x − 2x. To find theunique positive equilibrium we need to solve the equation e−x−2x = 0. UsingNewton Method given by the following programming algorithm in matlab.

function [x_star,success] = newton (x0, f, f_prime, eps, maxitr)

%

% [x_star,success] = newton(x0,f,f_prime,eps,maxitr)

% does iterations of Newton’s method for a single variable,

% using x0 as initial guess, f (a character string giving

% an m-file name) as function, and f_prime (also a character

% string giving an m-file name) as the derivative of f.

% For example, suppose an m-file xsqm2.m is available in Matlab’s working

% directory, with the following contents:

% function [y] = xsqm2(x)

% y = x^2-2;

% return

% and an m-fine xsqm2_prime is also available, with the following

% function [y] = xsqm2_prime(x)

% y = 2*x;

% return

% contents:

% Then, issuing

% [x_star,success] = newton(1.5, ’xsqm2’, ’xsqm2_prime’, 1e-10, 20)

% from Matlab’s command window will cause an approximation to the square

% root of 2 to be stored in x_star.

% iteration stops successfully if |f(x)| < eps, and iteration

% stops unsuccessfully if maxitr iterations have been done

% without stopping successfully or if a zero derivative

% is encountered.

% On return:

% success = 1 if iteration stopped successfully, and

% success = 0 if iteration stopped unsuccessfully.

% x_star is set to the approximate solution to f(x) = 0

% if iteration stopped successfully, and x_star

% is set to x0 otherwise.

success = 0;

x = x0;

for i=1:maxitr;

fval = feval(f,x);

if abs(fval) < eps;

success = 1;


disp(sprintf(’ %10.0f %15.9f %15.9f ’, i, x, fval));

x_star = x;

return;

end;

fpval = feval(f_prime,x);

if fpval == 0;

x_star = x0;

end;

disp(sprintf(’ %10.0f %15.9f %15.9f ’, i, x, fval));

x = x - fval / fpval;

end;

x_star =x0;

and the following matlab dialog

>> y=inline(’exp(-x)-2*x’)

>> yp=inline(’-exp(-x)-2’)

>> [x_star,success]=newton(0,y,yp,1e-10,40)

we obtain the following table of iterations for the solution

1 0.0000000002 0.3333333333 0.3516893324 0.3517337115 0.351733711

Thus, x = 0.351733711 is the unique positive equilibrium of this model.

2.8 Exercises

1. Consider the method of bisection applied to f(x) = arctan(x), withinitial interval x = [−4.9, 5.1].

(a) Are the hypotheses under which the method of bisection convergesvalid? If so, then how many iterations would it take to obtain thesolution to within an absolute error of 10−2?

(b) Apply Algorithm 2.1 with pencil and paper, until k = 5, arrangingyour computations carefully so you gain some intuition into theprocess.

2. Let f and x be as in Problem 1.


(a) Modify bisect method so it prints ak, bk, f(ak), f(bk), and f(xk)for each step, so you can see what is happening. Hint: Deletingthe semicolon from the end of a matlab statement causes the valueassigned to the left of the statement to be printed, while a statementconsisting only of a variable name causes that variable name tobe printed. If you want more neatly printed quantities, study thematlab functions disp and sprintf.

(b) Try to solve f(x) = 0 with ǫ = 10−2, ǫ = 10−4, ǫ = 10−8, ǫ = 10−16,ǫ = 10−32, ǫ = 10−64, and ǫ = 10−128.

i. For each ǫ, compute the k at which the algorithm should stop.

ii. What behavior do you actually observe in the algorithm? Canyou explain this behavior?

3. Repeat Problem 2, but with f(x) = x2−2 and initial interval x = [1, 2].

4. Use the program for the bisection method in Problem 2 to find an ap-proximation to 1000

14 which is correct to within 10−5.

5. Consider g(x) = x− arctan(x).

(a) Perform 10 iterations of the fixed point method xk+1 = g(xk),starting with x = 5, x = −5, x = 1, x = −1, and x = 0.1.

(b) What do you observe for the different starting points? What is |g′|at each starting point, and how might this relate to the behavioryou observe?

6. It is desired to find the positive real root of the equation x3 +x2−1 = 0.

(a) Find an interval x = [x, x] and a suitable fixed point iteration func-tion g(x) to accomplish this. Verify all conditions of the contractionmapping theorem.

(b) Find the minimum number of iterations n needed so that the abso-lute error in the n-th approximation to the root is correct to 10−4.Also, use the fixed-point iteration method (with the g you deter-mined in part (a)) to determine this positive real root accurate towithin 10−4.

7. Find an approximation to 100014 correct to within 10−5 using the fixed

point iteration method.

8. Consider f(x) = arctan(x). This function has a unique zero z = 0.

(a) Use a digital computer with double precision arithmetic to do it-erations of Newton’s method, starting with x0 = 0.5, 1.0, 1.3, 1.4,1.35, 1.375, 1.3875, 1.39375, 1.390625, 1.3921875. Iterate until oneof the following occurs:


• |f(x)| ≤ 10−10,

• an operation exception occurs, or

• 20 iterations are completed.

(i) Describe the behavior you observe.

(ii) Explain the behavior you observe in terms of the graph of f .

(iii) Evidently, there is a point p such that, if x0 > p, then New-ton’s method diverges, and if x0 < p, then Newton’s methodconverges.

(α) What would happen if x0 = p exactly? Illustrate whatwould happen on a graph of f .

(β) Do you think we could choose x0 = p exactly in practice?

9. Let f(x) = x2 − a.

(a) Write down and simplify the Newton’s method iteration equationfor f(x) = 0.

(b) For a = 2, form a table of 15 iterations of Newton’s method, start-ing with x0 = 2, x0 = 4, x0 = 8, x0 = 16, x0 = 32, and x0 = 64.

(c) Explain your results in terms of the shape of the graph of f and interms of the convergence theory in this section.

(d) Compare your analysis here to the analysis in Example 2.9 onpage 52.

10. Hint: The free intlab toolbox, mentioned on page 35, is recommendedfor this problem.

(a) Let f be as in Problem 8 of this set. Experiment with the intervalNewton method for this problem, and with various intervals thatcontain zero. Try some intervals of the form [−a, a] (with x = 0)and other intervals of the form [−a, b], a > 0, b > 0 and a 6= b.Explain what you have found.

(b) Use the interval Newton method to prove that there exists a uniquesolution to f(x) = 0 for x ∈ [−1, 0], where f(x) = x + ex.

(c) Iterate the interval Newton method to find as narrow bounds aspossible on the solution proven to exist in part 10b.

11. Repeat Exercise 8a, page 66, but with the secant method instead ofNewton’s method. (Use pairs of starting points {0.5, 1.0}, {1.0, 1.3},etc.)

12. Do three steps of Newton’s method, using complex arithmetic, for thefunction f(z) = z2 + 1, with starting guess z0 = 0.2 + 0.7i. Althoughyou may use a computer program, you should show intermediate re-sults, including zk, f(zk), and f ′(zk). (Note: Newton’s method with


complex arithmetic can be viewed as a multivariate Newton method intwo variables; see Exercise 5 on page 324, in Section 8.2.)

Chapter 3

Linear Systems of Equations

The solution of linear systems of equations is an extremely important pro-cess in scientific computing. Linear systems of equations directly serve asmathematical models in many situations, while solution of linear systems ofequations is an important intermediary computation in the analysis of othermodels, such as nonlinear systems of differential equations.

Example 3.1

Find x1, x2, and x3 such that

x1 + 2x2 + 3x3 = −1,4x1 + 5x2 + 6x3 = 0,7x1 + 8x2 + 10x3 = 1.

This chapter deals with the analysis and approximate solution of such sys-tems of equations with floating point arithmetic. We will study two directmethods , Gaussian elimination (the LU decomposition) and the QR decom-position), as well as iterative methods, such as the Gauss–Seidel method, forsolving such systems. (Computations in direct methods finish with a finitenumber of operations, while iterative methods involve a limiting process, asfixed point iteration does.) We will also study the singular value decomposi-tion, a powerful technique for obtaining information about linear systems ofequations, the mathematical models that give rise to such linear systems, andthe effects of errors in the data on the solution of such systems.

The process of dealing with linear systems of equations comprises the sub-ject numerical linear algebra. Before studying the actual solution of linearsystems, we introduce (or review) underlying commonly used notation andfacts.

69


3.1 Matrices, Vectors, and Basic Properties

The coefficients of x1, x2, and x3 in Example 3.1 can be written as an arrayof numbers

A =

1 2 34 5 67 8 10

,

which we call a matrix . The horizontal lines of numbers are the rows, whilethe vertical lines are the columns. In the example, we say “A is a 3 by 3matrix,” meaning that it has 3 rows and 3 columns. (If a matrix B had tworows and 5 columns, for example, we would say “B is a 2 by 5 matrix.”)

In numerical linear algebra, the variables x1, x2, and x3 as in Example 3.1are typically represented in a matrix

x =

x1

x2

x3

with 3 rows and 1 column, as is the set of right members of the equations:

b =

−101

;

x and b are called column vectors .Often, the system of linear equations will have real coefficients, but the

coefficients will sometimes be complex. If the system has n variables in thecolumn vector x, and the variables are assumed to be real, we say that x ∈ Rn.If B is an m by n matrix whose entries are real numbers, we say B ∈ Rm×n.If vector x has n complex coefficients, we say x ∈ Cn, and if an m by n matrixB has complex coefficients, we say B ∈ C

m×n.Systems such as in Example 3.1 can be written using matrices and vectors,

with the concept of matrix multiplication. We use upper case letters to denotematrices, lower case letters without subscripts to denote vectors (which weconsider to be column vectors, we denote the element in the i-th row, j-thcolumn of a matrix A by aij , and we sometimes denote the entire matrix Aby (aij). The numbers that comprise the elements of matrices and vectors arecalled scalars .

DEFINITION 3.1 If A = (aij), then AT = (aji) and AH = (aji) denotethe transpose and conjugate transpose of A, respectively.

Example 3.2On most computers in use today, the basic quantity in matlab is a matrix

whose entries are double precision floating point numbers according to the

Linear Systems of Equations 71

IEEE 754 standard. Matrices are marked with square brackets “[” and “]”,with commas or spaces separating the entries in a row and semicolons or theend of a line separating the rows. The transpose of a matrix is obtained bytyping a single quotation mark (or apostrophe) after the matrix. Considerthe following matlab dialog.

>> A = [1 2 3;4 5 6;7 8 10]

A =

1 2 3

4 5 6

7 8 10

>> A’

ans =

1 4 7

2 5 8

3 6 10

>>

DEFINITION 3.2 If A is an m × n matrix and B is an n × p matrix,then C = AB where

cij =

n∑

k=1

aikbkj

for i = 1, · · · , m, j = 1, · · · , p. Thus, C is an m× p matrix.

Example 3.3Continuing the matlab dialog from Example 3.2, we have

>> B = [-1 0 1

2 3 4

3 2 1]

B =

-1 0 1

2 3 4

3 2 1

>> C = A*B

C =

12 12 12

24 27 30


39 44 49

>>

(If the reader or student is not already comfortable with matrix multiplica-tion, we suggest confirming the above calculation by doing it with paper andpencil.)

With matrix multiplication, we can write the linear system in Example 3.1at the beginning of this chapter as

Ax = b, A =

1 2 34 5 67 8 10

, x =

x1

x2

x3

, b =

−101

.

Matrix multiplication can be easily described in terms of the dot product :

DEFINITION 3.3 Suppose we have two real vectors

v =

v1

v2

...vn

and w =

w1

w2

...wn

Then the dot product v ◦ w, also written (v, w), of v and w is the matrixproduct

vT w =

n∑

i=1

viwi.

If the vectors have complex components, the dot product is defined to be

vHw =

n∑

i=1

viwi,

where vi is the complex conjugate of vi.

Dot products can also be defined more generally and abstractly, and areuseful throughout pure and applied mathematics. However, our interest hereis the fact that many computations in scientific computing can be written interms of dot products, and most modern computers have special circuitry andsoftware to do dot products efficiently.

Example 3.4matlab represents the transpose of a vector V as V’. Also, if A is an n by

n matrix in matlab, the i-th row of A is accessed as A(i,:), while the j-th


column of A is accessed as A(:,j). Continuing Example 3.3, we have thefollowing matlab dialog, illustrating writing the product matrix C in termsof dot products.

>> C = [ A(1,:)*B(:,1), A(1,:)*B(:,2), A(1,:)*B(:,3)

A(2,:)*B(:,1), A(2,:)*B(:,2), A(2,:)*B(:,3)

A(3,:)*B(:,1), A(3,:)*B(:,2), A(3,:)*B(:,3)]

C =

12 12 12

24 27 30

39 44 49

>>

Matrix inverses are also useful in describing linear systems of equations:

DEFINITION 3.4 Suppose A is an n by n matrix. (That is, supposea is square.) Then, A−1 is the inverse of A if A−1A = AA−1 = I, where Iis the n by n identity matrix, consisting of 1’s on the diagonal and 0’s in alloff-diagonal elements. If A has an inverse, then A is said to be nonsingularor invertible.

Example 3.5Continuing the matlab dialog from the previous examples, we have

>> Ainv = inv(A)

Ainv =

-0.66667 -1.33333 1.00000

-0.66667 3.66667 -2.00000

1.00000 -2.00000 1.00000

>> Ainv*A

ans =

1.00000 0.00000 -0.00000

0.00000 1.00000 -0.00000

-0.00000 0.00000 1.00000

>> A*Ainv

ans =

1.0000e+00 4.4409e-16 -4.4409e-16

1.1102e-16 1.0000e+00 -1.1102e-15

3.3307e-16 2.2204e-15 1.0000e+00

>> eye(3)


ans =

1 0 0

0 1 0

0 0 1

>> eye(3)-A*inv(A)

ans =

1.1102e-16 -4.4409e-16 4.4409e-16

-1.1102e-16 -8.8818e-16 1.1102e-15

-3.3307e-16 -2.2204e-15 2.3315e-15

>> eye(3)*B-B

ans =

0 0 0

0 0 0

0 0 0

>>

Above, observe that I · B = B. Also observe that the computed value ofI − AA−1 is not exactly the matrix consisting entirely of zeros, but is amatrix whose entries are small multiples of the machine epsilon for IEEEdouble precision floating point arithmetic.

Some matrices do not have inverses.

DEFINITION 3.5 A matrix that does not have an inverse is called asingular matrix. A matrix that does have an inverse is said to be non-singular.

Singular matrices are analogous to the number zero when we are dealingwith a single equation in a single unknown. In particular, if we have thesystem of equations Ax = b, it follows that x = A−1b (since A−1(Ax) =(A−1A)x = Ix = x = A−1b), just as if ax = b, then x = (1/a)b.

Example 3.6

The matrix

1 2 34 5 67 8 9

is singular. However, if we use matlab to try to find an inverse, we obtain:

>> A = [1 2 3;4 5 6;7 8 9]


A =

1 2 3

4 5 6

7 8 9

>> inv(A)

ans =

1.0e+016 *

-0.4504 0.9007 -0.4504

0.9007 -1.8014 0.9007

-0.4504 0.9007 -0.4504

>>

Observe that the matrix matlab gives for the inverse has large elements (onthe order of the reciprocal of the machine epsilon ǫm ≈ 1.11 × 10−16 timesthe elements of A). This is due to roundoff error. This can be viewed asanalogous to trying to form 1/a when a = 0, but, due to roundoff error (suchas some cancelation error) a is a small number, on the order of the machineepsilon ǫm. We then have 1/a is on the order of 1/ǫm.

The following two definitions and theorem clarify which matrices are sin-gular and clarify the relationship between singular matrices and solution oflinear systems of equations involving those matrices.

DEFINITION 3.6 Let{

v(i)}m

i=1be m vectors. Then

{

v(i)}m

i=1is said

to be linearly independent provided∑m

i=1 βiv(i) = 0, then βi = 0 for i =

1, 2, · · · , m.

Example 3.7

Let

a1 =

123

, a2 =

456

, and a3 =

789

be the rows of the matrix A from Example 3.6 (expressed as column vectors).Then

a1 − 2a2 + a3 = 0,

so a1, a2, and a3 are linearly dependent. In particular, the third row of A istwo times the second row minus the first row.

DEFINITION 3.7 The rank of a matrix A, rank(A), is the maximumnumber of linearly independent rows it possesses. It can be shown that this isthe same as the maximum number of linearly independent columns. If A is anm by n matrix and rank(A) = min{m, n}, then A is said to be of full rank.For example, if m < n and the rows of A are linearly independent, then A isof full rank.


The following theorem deals with rank, nonsingularity, and solutions tosystems of equations.

THEOREM 3.1

Let A be an n× n matrix (A ∈ L(Cn)). Then the following are equivalent:

1. A is nonsingular.

2. det(A) 6= 0, where det(A) is the determinant1 of the matrix A.

3. The linear system Ax = 0 has only the solution x = 0.

4. For any b ∈ Cn, the linear system Ax = b has a unique solution.

5. The columns (and rows) of A are linearly independent. (That is, A isof full rank, i.e. rank(A) = n.)

When the matrices for a system of equations have special properties, wecan often use these properties to take short cuts in the computation to solvecorresponding systems of equations, or to know that roundoff error will notaccumulate when solving such systems. Symmetry and positive definitenessare important properties for these purposes.

DEFINITION 3.8 If AT = A, then A is said to be symmetric. IfAH = A, then A is said to be Hermitian.

Example 3.8

If A =

(

1 2− i

2 + i 3

)

, then AH =

(

1 2− i

2 + i 3

)

, so A is Hermitian.

DEFINITION 3.9 If A is an n by n matrix with real entries, if AT = Aand xT Ax > 0 for any x ∈ Rn except x = 0, then A is said to be symmetricpositive definite. If A is an n by n matrix with complex entries, if AH = Aand xHAx > 0 for x ∈ Cn, x 6= 0, then A is said to be Hermitian positivedefinite. Similarly, if xT Ax ≥ 0 (for a real matrix A) or xHAx ≥ 0 (fora complex matrix A) for every x 6= 0, we say that A is symmetric positivesemi-definite or Hermitian positive semi-definite, respectively.

1We will not give a formal definition of determinant here, but we will use their properties.Determinants are generally defined well in a good linear algebra course. We explain a goodway of computing determinants in Section 3.2.3 on page 86. When computing a determinantof small matrices symbolically, expansion by minors is often used.


Example 3.9

If A =

(

4 11 3

)

, then AT = A, so A is symmetric.

Also, xT Ax = 4x21 + 2x1x2 + 3x2

2 = 3x21 + (x1 + x2)

2 + 2x22 > 0 for x 6= 0.

Thus, A is symmetric positive definite.

Prior to studying actual methods for analyzing systems of linear equations,we introduce the following concepts.

DEFINITION 3.10 If v = (v1, . . . , vn)T is a vector and λ is a number,we define scalar multiplication w = λv by wi = λvi, that is, we multiply eachcomponent of v by λ. We say that we have scaled v by λ. We can similarlyscale a matrix.

Example 3.10Observe the following matlab dialog.

>> v = [1;-1;2]

v =

1

-1

2

>> lambda = 3

lambda =

3

>> lambda*v

ans =

3

-3

6

>>

DEFINITION 3.11 If A is an n×n matrix, a scalar λ and a nonzero xare an eigenvalue and eigenvector of A if Ax = λx.

DEFINITION 3.12 ρ(A) = max1≤i≤n

|λi|, where {λi}ni=1 is the set of eigen-

values of A, is called the spectral radius of A.

Example 3.11The matlab function eig computes eigenvalues and eigenvectors. Consider

the following matlab dialog.


>> A = [1,2,3

4 5 6

7 8 10]

A =

1 2 3

4 5 6

7 8 10

>> [V,Lambda] = eig(A)

V =

-0.2235 -0.8658 0.2783

-0.5039 0.0857 -0.8318

-0.8343 0.4929 0.4802

Lambda =

16.7075 0 0

0 -0.9057 0

0 0 0.1982

>> A*V(:,1) - Lambda(1,1)*V(:,1)

ans =

1.0e-014 *

0.0444

-0.1776

0.1776

>> A*V(:,2) - Lambda(2,2)*V(:,2)

ans =

1.0e-014 *

0.0777

0.1985

0.0944

>> A*V(:,3) - Lambda(3,3)*V(:,3)

ans =

1.0e-014 *

0.0666

-0.0444

0.2109

>>

Note that the eigenvectors of the matrix A are stored in the columns of V, whilecorresponding eigenvalues are stored in the diagonal entries of the diagonalmatrix Lambda. In this case, the spectral radius is ρ(A) ≈ 16.7075.

Although we won’t study computation of eigenvalues and eigenvectors untilChapter 5, we refer to the concept in this chapter.

With these facts and concepts, we can now study the actual solution ofsystems of equations on computers.


3.2 Gaussian Elimination

We can think of Gaussian elimination as a process of repeatedly addinga multiple of one equation to another equation to transform the system ofequations into one that is easy to solve. We first focus on these elementaryrow operations.

DEFINITION 3.13 Consider a linear system of equations Ax = b, whereA is n × n, and b, x ∈ Rn.Elementary row operations on a system of linearequations are of the following three types:

1. interchanging two equations,

2. multiplying an equation by a nonzero number,

3. adding to one equation a scalar multiple of another equation.

THEOREM 3.2

If system Bx = d is obtained from system Ax = b by a finite sequence ofelementary operations, then the two systems have the same solutions.

(A proof of Theorem 3.2 can be found in elementary texts on linear alge-bra and can be done, for example, with Theorem 3.1 and using elementaryproperties of determinants.)

The idea underlying Gaussian elimination is simple:

1. Subtract multiples of the first equation from the second through then-th equations to eliminate x1 from the second through n-th equations.

2. Then, subtract multiples of the new second equation from the thirdthrough n-th equations to eliminate x2 from these. After this step, thethird through n-th equations contain neither x1 nor x2.

3. Continue this process until the resulting n-th equation contains only xn,the resulting n− 1-st equation contains only xn and xn−1, etc.

4. Solve the resulting n-th equation for xn.

5. Plug the value for xn into the resulting (n − 1)-st equation, and solvethat equation for xn−1.

6. Continue this back-substitution process until we have solved for x1 inthe first equation.


Example 3.12

We will apply this process to the system in Example 3.1. In illustratingthe process, we can write the original system and transformed systems as anaugmented matrix , with a number’s position in the matrix telling us to whichvariable (or right-hand-side) and which equation it belongs. The originalsystem is thus written as

1 2 3 −14 5 6 07 8 10 1

.

We will use ∼ to denote that two systems of equations are equivalent, and wewill indicate below this symbol which multiples are subtracted: For exampleR3 ← R3 − 2R2 would mean that we replace the third row (i.e. the thirdequation) by the third equation minus two times the second equation. TheGaussian elimination process then proceeds as follows.

1 2 3 −14 5 6 07 8 10 1

∼R2←R2−4R1

R3←R3−7R1

1 2 3 −10 −3 −6 40 −6 −11 8

∼R3←R3−2R2

1 2 3 −10 −3 −6 40 0 1 0

.

The transformed third equation now reads “x3 = 0,” while the transformedsecond equation reads “−3x2 − 6x3 = 4.” Plugging x3 = 0 into the trans-formed second equation thus gives

x2 = (4 + 6 · x3)/(−3) = (4)/(−3) = −4

3.

Similarly plugging x3 = 0 and x2 = −4/3 into the transformed first equationgives

x1 = (−1− 2x2 − 3x3) = (−1− 2(−4/3)) = 5/3.

The solution vector is thus

x1

x2

x3

=

5/3−4/3

0

.

We check by computing the residual :

Ax− b =

1 2 34 5 67 8 10

5/3−4/3

0

−

−101

=

−101

−

−101

=

000

.


Example 3.13

If we had used floating point arithmetic in Example 3.12, 5/3 and −4/3would not have been exactly representable, and the residual would not havebeen exactly zero. In fact, a variant2 of Gaussian elimination with back-substitution is programmed in matlab and accessible with the backslash (\)operator:

>> A = [1 2 3

4 5 6

7 8 10]

A =

1 2 3

4 5 6

7 8 10

>> b = [-1;0;1]

b =

-1

0

1

>> x = A\b

x =

1.6667

-1.3333

-0.0000

>> A*x-b

ans =

1.0e-015 *

0.2220

0.8882

0.8882

>>

3.2.1 The Gaussian Elimination Algorithm

Following the pattern in the examples we have presented, we can write downthe process in general. The system will be written

a11x1 + a12x2+ · · · +a1nxn = b1,a21x1 + a22x2+ · · · +a2nxn = b2,

...

an1x1 + an2x2+ · · · +annxn = bn.

2using partial pivoting, which we will see later


Now, the transformed matrix

1 2 30 −3 −60 0 1

from Example 3.12 (page 80) is termed an upper triangular matrix , since it haszeros in all entries below the diagonal. The goal of Gaussian elimination is toreduce A to an upper triangular matrix through a sequence of elementary rowoperations as in Definition 3.13. We will call the transformed matrix beforeworking on the r-th column A(r), with associated right-hand-side vector b(r),

and we begin with A(1) = A = (a(1)ij ) and b = b(1) = (b

(1)1 , b

(1)2 , · · · , b(1)

n )T ,

with A(1)x = b(1). The process can then be described as follows.

Step 1: Assume that a(1)11 6= 0. (Otherwise, the nonsingularity of A guaran-

tees that the rows of A can be interchanged in such a way that the new

a(1)11 is nonzero.) Let

mi1 =a(1)i1

a(1)11

, 2 ≤ i ≤ n.

Now multiply the first equation of A(1)x = b(1) by mi1 and subtract theresult from the i-th equation. Repeat this for each i, 2 ≤ i ≤ n. As aresult, we obtain A(2)x = b(2), where

A(2) =

a(1)11 a

(1)12 . . . a

(1)1n

0 a(2)22 . . . a

(2)2n

......

...

0 a(2)n2 . . . a

(2)nn

and b(2) =

b(1)1

b(2)2

...

b(2)n

.

Step 2: We consider the (n − 1) × (n − 1) submatrix A(2) of A(2) defined

by A(2) = a(2)ij , 2 ≤ i, j ≤ n. We eliminate the first column of A(2)

in a manner identical to the procedure for A(1). The result is systemA(3)x = b(3) where A(3) has the form

A(3) =

a(1)11 a

(1)12 · · · a

(1)1n

0 a(2)22 · · · a

(2)2n

... 0 a(3)33 · · · a

(3)3n

......

......

0 0 a(3)n3 · · · a

(3)nn

.


Steps 3 to n− 1: The process continues as above, where at the k-th stagewe have A(k)x = b(k), 1 ≤ k ≤ n− 1, where

A(r) =

a(1)11 · · · a

(1)1n

0 a(2)22 · · · a2

2n... 0 a33 · · ·

...

.... . .

. . .

0 a(k)kk · · · a

(k)kn

......

0 · · · 0 a(k)nk · · · a

(k)nn

and b(r) =

b(1)1

b(2)2...

b(k−1)k−1

b(k)k

...

b(k)n

.

(3.1)For every i, k + 1 ≤ i ≤ n, the k-th equation is multiplied by

mik = a(k)ik /a

(k)kk

and subtracted from the i-th equation. (We assume, if necessary, a row

is interchanged so that a(k)kk 6= 0.) After step k = n − 1, the resulting

system is A(n)x = b(n) where A(n) is upper triangular.

On a computer, this algorithm can be programmed as:

ALGORITHM 3.1

(Gaussian elimination, forward phase)

INPUT: the n by n matrix A and the n-vector b ∈ Rn.OUTPUT: A(n) and b(n).FOR k = 1, 2, · · · , n− 1

FOR i = k + 1, · · · , n

(a) mik ← aik/akk.

(b) FOR j = k, k + 1, · · · , naij ← aij −mikakj .

END FOR

(c) bi ← bi −mikbk.

END FOR

END FOREND ALGORITHM 3.1.


Note: In Step (b) of Algorithm 3.1, we need only do the loop for j = k + 1to n, since we know that the resulting ak+1,k will equal 0.

Back solving can be programmed as:

ALGORITHM 3.2

(Gaussian elimination, back solving phase)

INPUT: A(n) and b(n) from Algorithm 3.1.OUTPUT: x ∈ Rn as a solution to Ax = b.

1. xn ← bn/ann.

2. FOR k = n− 1, n− 2, · · · , 1

xk ← (bk −n∑

j=k+1

akjxj)/akk.

END FOR

END ALGORITHM 3.2.

Note: To solve Ax = b using Gaussian elimination requires 13n3 + O(n2)

multiplications and divisions. (See Exercise 4 on page 142.)

3.2.2 The LU decomposition

We now explain how the Gaussian elimination process we have just pre-sented can be viewed as finding a lower triangular matrix L (i.e. a matrixwith zeros above the diagonal) and an upper triangular matrix U such thatA = LU . Assume first that no row interchanges are performed in Gaussianelimination. Let

M (1) =

1 0 . . . 0

−m21

−m31

... In−1

−mn1

,

where In−1 is the (n−1)×(n−1) identity matrix, and where mi1, 2 ≤ i ≤ n aredefined in Gaussian elimination. Then, A(2) = M (1)A(1) and b(2) = M (1)b(1).


At the r-th stage of the Gaussian elimination process,

M (r) = r-th row

Ir−1 0 0 . . . 0

0 1 0 . . . 0

0 −mr+1,r

...... In−r

0 −mn,r

. (3.2)

Also,

(M (r))−1 =

Ir−1 0 0 . . . 0

0 1 0 . . . 0

0 mr+1,r

...... In−r

0 mn,r

, (3.3)

where mir , r + 1 ≤ i ≤ n are given in the Gaussian elimination process, andA(r+1) = M (r)A(r) and b(r+1) = M (r)b(r). (Note: We are assuming here that

a(r)rr 6= 0 and no row interchanges are required.) Collecting the above results,

we obtain A(n)x = b(n), where

A(n) = M (n−1)M (n−2) · · ·M (1)A(1) and b(n) = M (n−1)M (n−2) · · ·M (1)b(1).

Recalling that A(n) is upper triangular and setting A(n) = U , we have

A = (M (n−1)M (n−2) · · ·M (1))−1U. (3.4)

Example 3.14

Following the Gaussian elimination process from Example 3.12, we have

M (1) =

1 0 0−4 1 0−7 0 1

, M (2) =

1 0 00 1 00 −2 1

,

(M (1))−1 =

1 0 04 1 07 0 1

, (M (2))−1 =

1 0 00 1 00 2 1

,

and A = LU , with

L = (M (1))−1(M (2))−1 =

1 0 04 1 07 2 1

, U =

1 2 30 −3 −60 0 1

.

Applying the Gaussian elimination process to b can be viewed as solvingLy = b for y then solving Ux = y for x. Solving Ly = b involves formingM1)b, then forming M (2)(M (1)b), while solving Ux = y is simply the back-substitution process.


Note: The product of two lower triangular matrices is lower triangular andthe inverse of a nonsingular lower triangular matrix is lower triangular. Thus,L = (M (1))−1(M (2))−1 · · · (M (n−1))−1 is lower triangular. Hence, A = LU ,i.e., A is expressed as a product of lower and upper triangular matrices. Theresult is called the LU decomposition (also known as the LU factorization,triangular factorization, or triangular decomposition of A. The final matricesL and U are given by:

L =

1 0 · · · 0

m21 1 0 · · · 0

m31 m32 1 · · · 0...

.... . .

. . .

mn1 mn2 · · · mn,n−1 1

and U =

a(1)11 a

(1)12 · · · a

(1)1n

0 a(2)22 · · · a

(2)2n

.... . .

...

0 · · · 0 a(n)nn

.

(3.5)This decomposition can be so formed when no row interchanges are required.Thus, the original problem Ax = b is transformed into LUx = b.Note: Since computing the LU decomposition of A is done by Gaussianelimination, it requires O(n3) operations. However, if L and U are alreadyavailable, computing y with Ly = b then computing x with Ux = y requiresonly O(n2) operations.Note: In some software, the multiplying factors, that is, the nonzero off-diagonal elements of L, are stored in the locations of corresponding entriesof A that are made equal to zero, thus obviating the need for extra storage.Effectively, such software returns the elements of L and U in the same arraythat was used to store A.

3.2.3 Determinants and Inverses

Usually, the solution x of a system of equations Ax = b is desired, and thedeterminant det(A) is not of interest, even though one method of computingx, Cramer’s rule, involves first computing determinants. (In fact, computingx with Gaussian elimination with back substitution is more efficient thanusing Cramer’s rule, and is definitely more practical for large n.) However,occasionally the determinant of A is desired for other reasons. An efficientway of computing the determinant of a matrix is with Gaussian elimination.If A = LU , then

det(A) = det(L) det(U) =

n∏

j=1

a(j)jj .

(Using expansion by minors to compute the determinant requires O(n!) mul-tiplications.)

Similarly, even though we could in principle compute A−1, then computex = A−1b, computing A−1 is less efficient than applying Gaussian eliminationwith back-substitution. However, if we need A−1 for some other reason, we


can compute it relatively efficiently by solving n systems Ax(j) = e(j) where

e(j)i = δij , where δij is the Kronecker delta function defined by

δij =

{

1 if i = j,

0 if i 6= j.

If A = LU , we perform n pairs of forward and backward solves, to obtainA−1 = (x1, x2, · · · , xn).

Example 3.15

In Example 3.14, for

A =

1 2 34 5 67 8 10

,

we used Gaussian elimination to obtain A = LU , with

L =

1 0 04 1 07 2 1

and U =

1 2 30 −3 −60 0 1

.

Thus,det(A) = u11u22u33 = (1)(−3)(1) = −3.

We now compute A−1: Using L and U to solve

1 2 34 5 67 8 10

x(1) =

100

gives x(1) =

−2/3−2/3

1

,

solving

1 2 34 5 67 8 10

x(2) =

010

gives x(2) =

−4/311/3−2

,

and solving

1 2 34 5 67 8 10

x(3) =

001

gives x(3) =

1−2

1

.

Thus,

A−1 =

−2/3 −4/3 1−2/3 11/3 −2

1 −2 1

, AA−1 = I =

1 0 00 1 00 0 1

.


3.2.4 Pivoting in Gaussian Elimination

In our explanation and examples of Gaussian elimination so far, we haveassumed that “no row interchanges are required.” In particular, we must haveakk 6= 0 in each step of Algorithm 3.1. Otherwise, we may need to do a “rowinterchange,” that is, we may need to rearrange the order of the transformedequations. We have two questions:

1. When can Gaussian elimination be performed without row interchanges?

2. If row interchanges are employed, can Gaussian elimination always beemployed?

THEOREM 3.3

(Existence of an LU factorization) Assume that n×n matrix A is nonsingular.Then A = LU if and only if all the leading principal submatrices of A arenonsingular.3 Moreover, the LU decomposition is unique, if we require thatthe diagonal elements of L are all equal to 1.

REMARK 3.1 Two important types of matrices that have nonsingu-lar leading principal submatrices are symmetric positive definite and strictlydiagonally dominant, i.e.,

|aii| >n∑

j=1j 6=i

|aij |, for i = 1, 2, · · · , n.

We now consider our second question, “If row interchanges are employed, canGaussian elimination be performed for any nonsingular A?” Switching therows of a matrix A can be done by multiplying A on the left by a permutationmatrix :

DEFINITION 3.14 A permutation matrix P is a matrix whose columnsconsist of the n different vectors ej , 1 ≤ j ≤ n, in any order.

3The leading principal submatrices of A have the form

a11 . . . a1n

..

.. . .

..

.ak1 . . . akk

for k = 1, 2, · · · , n.


Example 3.16

P = (e1, e3, e4, e2) =

1 0 0 00 0 0 10 1 0 00 0 1 0

is a permutation matrix such that the first row of PA is the first row of A, thesecond row of PA is the fourth row of A, the third row of PA is the secondrow of A, and the fourth row of PA is the third row of A. Note that thepermutation of the columns of the identity matrix in P corresponds to thepermutation of the rows of A. For example,

0 1 00 0 11 0 0

1 2 34 5 67 8 10

=

4 5 67 8 101 2 3

.

Thus, by proper choice of P , any two or more rows can be interchanged.

Note: detP = ±1, since P is obtained from I by row interchanges.

Now, Gaussian elimination with row interchanges can be performed by thefollowing matrix operations:4

A(n) = M (n−1)P (n−1)M (n−2)P (n−2) · · ·M (2)P (2)M (1)P (1)A.

b(n) = M (n−1)P (n−1) · · ·M (2)P (2)M (1)P (1)b.

It follows that U = LA(1), where L is no longer lower triangular. However, ifwe perform all the row interchanges first, at once, then

M (n−1) · · ·M (1)PAx = M (n−1)M (n−2) · · ·M (1)Pb,

or

LPAx = LP b,

so

LPA = U.

Thus,

PA = L−1U = LU.

We can state these facts as follows.

4When implementing Gaussian elimination, we usually don’t actually multiply full n byn matrices together, since this is not efficient. However, viewing the process as matrixmultiplications has advantages when we analyze it.


THEOREM 3.4

If A is a nonsingular n × n matrix, then there is a permutation matrix Psuch that PA = LU , where L is lower triangular and U is upper triangular.(Note: det(PA) = ± det(A) = det(L) det(U).)

We now examine the actual operations we do to complete the Gaussianelimination process with row interchanges (known as pivoting).

Example 3.17

Consider the system

0.0001x1 + x2 = 1

x1 + x2 = 2.

The exact solution of this system is x1 ≈ 1.00010 and x2 ≈ 0.99990. Letus solve the system using Gaussian elimination without row interchanges.We will assume calculations are performed using three-digit rounding decimalarithmetic. We obtain

m21 ←a(1)21

a(1)11

≈ 0.1× 105,

a(2)22 ← a

(1)22 −m21a

(1)12 ≈ 0.1× 101 − 0.1× 105 ≈ −0.100× 105.

Also, b(2) ≈ (0.1 × 101,−0.1× 105)T , so the computed (approximate) uppertriangular system is

0.1× 10−3x1 + 0.1× 101x2 = 0.1× 101,−0.1× 105x2 = −0.1× 105,

whose solutions are x2 = 1 and x1 = 0. If instead, we first interchange the

equations so that a(1)11 = 1, we find that x1 = x2 = 1, correct to the accuracy

used.

Example 3.17 illustrates that small values of a(r)rr in the r-th stage lead to

large values of the mir’s and may result in a loss of accuracy. Therefore, we

want the pivots a(r)rr to be large.

Two common pivoting strategies are:

Partial pivoting: In partial pivoting, the a(r)ir for r ≤ i ≤ n, in the r-

th column of A(r) is searched to find the element of largest absolutevalue, and row interchanges are made to place that element in the pivotposition.

Full pivoting: In full pivoting, the pivot element is selected as the element

a(r)ij , r ≤ i, j ≤ n of maximum absolute value among all elements of


the (n− r)× (n− r) submatrix of A(r). This strategy requires row andcolumn interchanges.

In theory, full pivoting is required in general to assure that the process doesnot result in excessive roundoff error. However, partial pivoting is adequatein most cases. For some classes of matrices, no pivoting strategy is requiredfor a stable elimination procedure. For example, no pivoting is required for areal symmetric positive definite matrix or for a strictly diagonally dominantmatrix [41].

We now present a formal algorithm for Gaussian elimination with partialpivoting. In reading this algorithm, recall that

a11x1 + a12x2 + · · ·+ a1nxn = b1

a21x1 + a22x2 + · · ·+ a2nxn = b2

...

an1x1 + an2x2 + · · ·+ annxn = bn.

ALGORITHM 3.3

(Solution of a linear system of equations with Gaussian elimination with par-tial pivoting and back-substitution)

INPUT: The n by n matrix A and right-hand-side vector b.OUTPUT: An approximate solution5 x to Ax = b.FOR k = 1, 2, · · · , n− 1

1. Find ℓ such that |aℓk| = maxk≤j≤n

|ajk| (k ≤ ℓ ≤ n).

2. Interchange row k with row ℓ

cj ← akj

akj ← aℓj

aℓj ← cj

for j = 1, 2, . . . , n, and

d ← bk

bk ← bℓ

bℓ ← d

.

3. FOR i = k + 1, · · · , n

(a) mik ← aik/akk.

(b) FOR j = k, k + 1, · · · , naij ← aij −mikakj .

END FOR

5approximate because of roundoff error


(c) bi ← bi −mikbk.

END FOR

4. Back-substitution:

(a) xn ← bn/ann and

(b) xk ←(

bk −n∑

j=k+1

akjxj

)/

akk, for k = n− 1, n− 2, · · · , 1.

END FOREND ALGORITHM 3.3.

REMARK 3.2 In Algorithm 3.3, the computations are arranged “seri-ally,” that is, they are arranged so each individual addition and multiplica-tion is done separately. However, it is efficient on modern machines, thathave “pipelined” operations and usually also have more than one processor,to think of the operations as being done on vectors. Furthermore, we don’tnecessarily need to change entire rows, but just keep track of a set of indicesindicating which rows are interchanged; for large systems, this saves a signif-icant number of storage and retrieval operations. For views of the Gaussianelimination process in terms of vector operations, see [16]. For an example ofsoftware that takes account of the way machines are built, see [5].

REMARK 3.3 If U is the upper triangular matrix resulting from Gaus-sian elimination with partial pivoting, we have

det(A) = (−1)K det(U) = (−1)Ka(1)11 a

(2)22 · · ·a(n)

nn ,

where K is the number of row interchanges made.

3.2.5 Systems with a Special Structure

We now consider some special but commonly encountered kinds of matrices.

3.2.5.1 Symmetric, Positive Definite Matrices

We first characterize positive definite matrices.

THEOREM 3.5

Let A be a real symmetric n × n matrix. Then A is positive definite if andonly if there exists an invertible lower triangular matrix L such that A = LLT .Furthermore, we can choose the diagonal elements of L, ℓii, 1 ≤ i ≤ n, to bepositive numbers.


The decomposition with positive ℓii is called the Cholesky factorization ofA. It can be shown that this decomposition is unique. L can be computedusing a variant of Gaussian elimination. Set ℓ11 =

√a11 and ℓj1 = aj1/

√a11

for 2 ≤ j ≤ n. (Note that xT Ax > 0, and the choice x = ej implies thatajj > 0.) Then, for i = 1, 2, 3, · · ·n, set

ℓii =

[

aii −i−1∑

k=1

(ℓik)2

]12

ℓji =1

ℓii

[

aji −i−1∑

k=1

ℓikℓjk

]

for i + 1 ≤ j ≤ n.

If A is real symmetric and L can be computed in this way, then A is positivedefinite. (This is an efficient way to show positive definiteness.) To solveAx = b where A is real symmetric positive definite, L can be formed in thisway, and the pair Ly = b and LT x = y can be solved for x, analogously tothe way we use the LU decomposition to solve a system.Note: The multiplication and division count for Cholesky decomposition is

n3/6 +O(n2).

Thus, for large n, about 1/2 the multiplications and divisions are requiredcompared to standard Gaussian elimination.

Example 3.18

Consider solving approximately

x′′(t) = − sin(πt), x(0) = x(1) = 0.

One technique of approximately solving this equation is to replace x′′ in thedifferential equation by

x′′(t) ≈ x(t + h)− 2x(t) + x(t− h)

h2. (3.6)

If we subdivide the interval [0, 1] into four subintervals, then the end points ofthese subintervals are t0 = 0, t1 = 1/4, t2 = 1/2, t3 = 3/4, and t4 = 1. If werequire the approximate differential equation with x′′ replaced using (3.6) tobe exact at t1, t2, and t3 and take h = 1/4 to be the length of a subinterval,we obtain:

at t1 = 14 :

x2 − 2x1 + x0

116

= − sin(π/4),

at t2 = 12 :

x3 − 2x2 + x1116

= − sin(π/2),

at t3 = 34 :

x4 − 2x3 + x2116

= − sin(3π/4),


with tk = k/4, k = 0, 1, 2, 3, 4. If we plug in x0 = 0, x4 = 0, we multiplyboth sides of each of these three equations by −h2 = −1/16, and we write theequations in matrix form, we obtain

2 −1 0−1 2 −1

0 −1 2

x1

x2

x3

=1

16

sin(π/4)sin(π/2)sin(3π/4)

.

The matrix for this system is symmetric. There is a matlab function cholthat performs a Cholesky factorization. We use it as follows:

>> A = [2 -1 0

-1 2 -1

0 -1 2]

A =

2 -1 0

-1 2 -1

0 -1 2

>> b = (1/16)*[sin(pi/4); sin(pi/2); sin(3*pi/4)]

b =

0.0442

0.0625

0.0442

>> L = chol(A)’

L =

1.4142 0 0

-0.7071 1.2247 0

0 -0.8165 1.1547

>> L*L’-A

ans =

1.0e-015 *

0.4441 0 0

0 -0.4441 0

0 0 0

>> y = L\b

y =

0.0312

0.0691

0.0871

>> x = L’\y

x =

0.0754

0.1067

0.0754

>> A\b

ans =

0.0754


0.1067

0.0754

>>

3.2.5.2 Tridiagonal Matrices

A tridiagonal matrix is a matrix of the form

A =

a1 c1 0 · · · 0b2 a2 c2 0 · · · 00 b3 a3 c3 · · · 0...

. . .. . .

. . .. . .

...0 · · · 0 bn−1 an−1 cn−1

0 · · · 0 bn an

.

For example, the matrix from Example 3.18 is tridiagonal. In many cases im-portant in applications, A can be decomposed into a product of two bidiagonalmatrices, that is,

A = LU =

α1 0 · · · 0

b2 α2 · · · 0

.... . .

. . ....

0 · · · bn αn

1 γ1 . . . 0

.... . .

. . ....

γn−1

0 · · · 0 1

. (3.7)

In such cases, multiplying the matrices on the right of (3.7) together andequating the resulting matrix entries with corresponding entries of A givesthe following variant of Gaussian elimination:

α1 = a1,γ1 = c1/α1,

{

αi = ai − biγi−1

γi = ci/αi

}

for i = 2, · · · , n− 1,

αn = an − bnγn−1.

(3.8)

Thus, if αi 6= 0, 1 ≤ i ≤ n, we can compute the decomposition (3.7). Fur-thermore, we can compute the solution to Ax = f = (f1, f2, · · · , fn)T bysuccessively solving Ly = f and Ux = y, i.e.,

y1 = f1/α1,yi = (fi − biyi−1)/αi for i = 2, 3, · · · , n,xn = yn,xj = (yj − γjxj+1) for j = n− 1, n− 2, · · · , 1.

(3.9)

Sufficient conditions to guarantee the decomposition (3.7) are as follows.

THEOREM 3.6

Suppose the elements ai, bi, and ci of A satisfy |a1| > |c1| > 0, |ai| ≥ |bi|+|ci|,and bici 6= 0 for 2 ≤ i ≤ n − 1, and suppose |an| > |bn| > 0. Then A is


invertible and the αi’s are nonzero. (Consequently, the factorization (3.7) ispossible.)

Note: It can be verified that solution of a linear system having tridiago-nal coefficient matrix using (3.8) and (3.9) requires (5n − 4) multiplicationsand divisions and 3(n − 1) additions and subtractions. (Recall that we needn3/3+O(n2) multiplications and divisions for Gaussian elimination.) Storagerequirements are also drastically reduced to 3n locations, versus n2 for a fullmatrix.

Example 3.19

The matrix from Example 3.18 is tridiagonal, and satisfies the conditionsin Theorem 3.6. This holds true if we form the linear system of equationsin the same was as in Example 3.18, regardless of how small we make h,and how large the resulting system is. Thus, we may solve such systemswith the forward substitution and back substitution algorithms representedby (3.8) and (3.9). If we want less truncation error in the approximation tothe differential equation, we need to solve a larger system (with h smaller). Itis more practical to do so with (3.8) and (3.9) than with the general Gaussianelimination algorithm, since the amount of work the computer has to do isproportional to n, rather than n3.

3.2.5.3 Block Tridiagonal Matrices

We now consider briefly block tridiagonal matrices, that is, matrices of theform

A =

A1 C1 0 · · · 0 0

B2 A2 C2 0 · · · 0

0 B3 A3 C3 0 0

.... . .

. . .. . .

. . ....

0 0 0 Bn−1 An−1 Cn−1

0 0 0 0 Bn An

,

where Ai, Bi, and Ci are m×m matrices. Analogous to the tridiagonal case,we construct a factorization of the form

A =

A1 0 . . . 0

B2 A2 . . . 0

0. . .

. . ....

0 . . . Bn An

I E1 . . . 0

0 I. . . 0

0. . .

. . . En−1

0 . . . 0 I

.


Provided the Ai, 1 ≤ i ≤ n, are nonsingular, we can compute:

A1 = A1,

E1 = A−11 C1,

Ai = Ai −BiEi−1 for 2 ≤ i ≤ n,

Ei = A−1i Ci, for 2 ≤ i ≤ n− 1.

For efficiency, the A−1i are generally not computed, but instead, the columns

of Ei are computed by factoring Ai and solving a pair of triangular systems.That is, AiEi = Ci with Ai = LiUi becomes LiUiEi = Ci.Note: The number of operations for computing a block factorization of ablock tridiagonal system is proportional to nm3. This is significantly lessthan the number of operations, proportional to n3, for completing the generalGaussian elimination algorithm, for m small relative to n. In such cases,tremendous savings are achieved by taking advantage of the zero elements.

Now consider

Ax = b, x =

x1

x2

...xn

, b =

b1

b2

...bn

,

where xi, bi ∈ Rm. Then, with the factorization A = LU , Ax = b can besolved as follows: Ly = b, Ux = y, with

A1y1 = b1,

Aiyi = (bi −Biyi−1) for i = 2, · · · , n,xn = yn,xj = yj − Ejxj+1 for j = n− 1, · · · , 1.

Block tridiagonal systems arise in various applications, such as in equi-librium models for diffusion processes in two and three variables, a simpleprototype of which is the equation

∂2u

∂x2+

∂2

∂y2= −f(x, y),

when we approximate the partial derivatives in a manner similar to how weapproximated u′′ in Example 3.18. In that case, not only is the overall systemblock tridiagonal, but, depending on how we order the equations and variables,the individual matrices Ai, Bi, and Ci are tridiagonal, or contain mostly zeros.Taking advantage of these facts is absolutely necessary, to be able to achievethe desired accuracy in the approximation to the solutions of certain models.

3.2.5.4 Banded Matrices

A generalization of a tridiagonal matrix arising in many applications is abanded matrix. Such matrices have non-zero elements only on the diagonal


and p entries above and below the diagonal. For example, p = 1 for a tridi-agonal matrix. The number p is called the semi-bandwidth of the matrix.

Example 3.20

3 −1 1 0 0

−1 3 −1 1.1 0

0.9 −1 3 −1 1.1

0 1.1 −1 3 −1

0 0 0.9 −1 3

is a banded matrix with semi-bandwidth equal to 2.

Provided Gaussian elimination without pivoting is applicable, banded ma-trices may be stored and solved analogously to tridiagonal matrices. In partic-ular, we may store the matrix in 2p+1 vectors, and we may use an algorithmsimilar to (3.8) and (3.9), based on the general Gaussian elimination algorithm(Algorithm 3.1 on page 83), but with the loop on i having an upper boundequal to min k + p, n, rather than n, and with the ai,j replaced by appropriatereferences to the n by 2p + 1 matrix in which the non-zero entries are stored.

It is advantageous to handle a matrix as a banded matrix when its dimensionn is large relative to p.

3.2.5.5 General Sparse Matrices

Numerous applications, such as models of communications and transporta-tion networks, give rise to matrices most of whose elements are zero, but donot have an easily usable structure such as a block or banded structure. Ma-trices most of whose elements are zero are called sparse matrices. Matricesthat are not sparse are called dense or full . Special, more sophisticated vari-ants of Gaussian elimination, as well as iterative methods, which we treatlater in Section 3.5, may be used for sparse matrices.

Several different schemes are used to store sparse matrices. One such schemeis to store two integer vectors r and c and one floating point vector v, suchthat the number of entries in r, c, and v is the total number of non-zeroelements in the matrix; ri gives the row index of the i-th non-zero element, ci

gives the corresponding column index, and vi gives the value.


Example 3.21

0 0 1 0 0

−3 0 0 0 1

−2 −1 0 1.1 0

0 0 0 5 −1

7 −8 0 0 0

may be stored with the vectors

r =

2353513424

, c =

1112234455

, and v =

−3−2

7−1−8

11.1

51−1

.

Note that there are 25 entries in this matrix, but only 10 nonzero entries.

There is a question concerning whether or not a particular matrix shouldbe considered to be sparse, rather than treated as dense. In particular, if thematrix has some elements that are zero, but many are not, it may be moreefficient to treat the matrix as dense. This is because there is extra overheadin the algorithms used to solve the systems with matrices that are stored assparse, and the elimination process can cause fill-in, the introduction of non-zeros in the transformed matrix into elements that were zero in the originalmatrix. Whether a matrix should be considered to be sparse or not dependson the application, the type of computer used to solve the system, etc. Sparsesystems that have a banded or block structure are more efficiently treatedwith special algorithms for banded or block systems than with algorithms forgeneral sparse matrices.

There is extensive support for sparse matrices in matlab. This is detailed inmatlab’s help system. One method of describing a sparse matrix in matlab

is as we have done in Example 3.21.


3.3 Roundoff Error and Conditioning

On page 17, we defined the condition number of a function in terms of theratio of the relative error in the function value to the relative error in its argu-ment. Also, in Example 1.17 on page 16, we saw that one way of computinga quantity can lead to a large relative error in the result, while another wayleads to an accurate result; that is, one algorithm can be numerically unstablewhile another is stable.

Similar concepts hold for solutions to systems of linear equations. For ex-ample, Example 3.17 on page 90 illustrated that Gaussian elimination withoutpartial pivoting can be numerically unstable for a system of equations whereGaussian elimination with partial pivoting is stable. We also have a concept ofcondition number of a matrix, which relates the relative change of componentsof x to changes in the elements of the matrix A and right-hand-side vectorb for the system. To understand the most commonly used type of conditionnumber of a matrix, we introduce norms.

3.3.1 Norms

We use norms to describe errors in vectors and convergence of sequences ofvectors.

DEFINITION 3.15 A function that assigns a non-negative real number‖v‖ to a vector v is called a norm, provided it has the following properties.

1. ‖u‖ ≥ 0.

2. ‖u‖ = 0 if and only if u = 0.

3. ‖λu‖ = |λ|‖u‖ for λ ∈ R (or λ ∈ C if v is a complex vector).

4. ‖u + v‖ ≤ ‖u‖+ ‖v‖ (triangle inequality).

Consider V = Cn, the vector space of n-tuples of complex numbers. Note

that x ∈ Cn has the form x = (x1, x2, · · · , xn)T . Also,

x + y = (x1 + y1, x2 + y2, · · · , xn + yn)T

andλx = (λx1, λx2, · · · , λxn)T .

Important norms on Cn are:

(a) ‖x‖∞ = max1≤i≤n

|xi|: the ℓ∞ or max norm (for z = a + ib, |z| =√

a2 + b2 =√

zz.)


(b) ‖x‖1 =

n∑

i=1

|xi|: the ℓ1 norm

(c) ‖x‖2 =

(

n∑

i=1

|xi|2)1/2

: the ℓ2 norm (Euclidean norm)

(d) Scaled versions of the above norms, where we define ‖v‖a = ‖aT v‖,where a = (a1, a2, · · · , an)T with ai > 0 for 1 ≤ i ≤ n.

A useful property relating the Euclidean norm and the dot product is:

THEOREM 3.7

(the Cauchy–Schwarz inequality)

|v ◦ w| = |vT w| ≤ ‖x‖2‖y‖2.

We now introduce a concept and notation for describing errors in compu-tations involving vectors.

DEFINITION 3.16 The distance from u to v is defined as ‖u− v‖.

The following concept and associated theorem are worth keeping in mind,since they hint that, in many cases, it is not so important from the point ofview of size of the error, which norm we choose to describe the error in avector.

DEFINITION 3.17 Two norms ‖ · ‖α and ‖ · ‖β are called equivalent ifthere exist positive constants c1 and c2 and such that

c1‖x‖α ≤ ‖x‖β ≤ c2‖x‖α.

Hence, also,

1

c2‖x‖β ≤ ‖x‖α ≤

1

c1‖x‖β.

THEOREM 3.8

Any two norms on Cn are equivalent.


The following are the constants associated with the 1-, 2-, and ∞-norms:

(a) ‖x‖∞ ≤ ‖x‖2 ≤√

n|x‖∞,

(b)1√n‖x‖1 ≤ ‖x‖2 ≤ ‖x‖1,

(c)1

n‖x‖1 ≤ ‖x‖∞ ≤ ‖x‖1.

(3.10)

The above relations are sharp in the sense that vectors can be found for whichthe inequalities are actually inequalities. Thus, in a sense, the 1, 2, and ∞norms of vectors become “less equivalent,” the larger the vector space.

Example 3.22

The matlab function norm computes norms of vectors. Consider the follow-ing dialog.

>> x = [1;1;1;1;1]

x =

1

1

1

1

1

>> norm(x,1)

ans =

5

>> norm(x,2)

ans =

2.2361

>> norm(x,inf)

ans =

1

>> n=1000;

>> for i=1:n;x(i)=1;end;

>> norm(x,1)

ans =

1000

>> norm(x,2)

ans =

31.6228

>> norm(x,inf)

ans =

1

>> >>

This illustrates that, for a vector all of whose entries are equal to 1, the second


inequality in (3.10)(a) is an equation, the first inequality in (b) is an equation,and the first inequality in (c) is an equation.

To discuss the condition number of the matrix, we use the concept of thenorm of a matrix. In the following, A and B are arbitrary square matricesand λ is a complex number.

DEFINITION 3.18 A matrix norm is a real-valued function of A, de-noted by ‖ · ‖ satisfying:

1. ‖A‖ ≥ 0.

2. ‖A‖ = 0 if and only if A = 0.

3. ‖λA‖ = |λ| ‖A‖.

4. ‖A + B‖ ≤ ‖A‖+ ‖B‖.

5. ‖AB‖ ≤ ‖A‖ ‖B‖.

REMARK 3.4 In contrast to vector norms, we have an additional fifthproperty, referred to as a submultiplicative property, dealing with the normof the product of two matrices.

Example 3.23

The quantity

‖A‖E =

n∑

i,j=1

|aij |2

12

is called the Frobenius norm. Since the Frobenius norm is the Euclideannorm of the matrix when the matrix is viewed to be a single vector formedby concatenating its columns (or rows), the Frobenius norm is a norm. It isalso possible to prove that the Frobenius norm is a matrix norm.

To relate norms of matrices to errors in the solution of linear systems, werelate vector norms to matrix norms:

DEFINITION 3.19 A matrix norm ‖A‖ and a vector norm ‖x‖ are calledcompatible if for all vectors x and matrices A we have ‖Ax‖ ≤ ‖A‖ ‖x‖.

REMARK 3.5 A consequence of the Cauchy–Schwarz inequality is that‖Ax‖2 ≤ ‖A‖E‖x‖2, i.e., the Euclidean norm ‖ · ‖E for matrices is compatiblewith the ℓ2-norm ‖ · ‖2 for vectors.


In fact, every vector norm has associated with it a sharply defined compat-ible matrix norm:

DEFINITION 3.20 Given a vector norm ‖ · ‖, we define a natural orinduced matrix norm associated with it as

‖A‖ = supx 6=0

‖Ax‖‖x‖ . (3.11)

It is straightforward to show that an induced matrix norm satisfies the fiveproperties required of a matrix norm. Also, from the definition of inducednorm, an induced matrix norm is compatible with the given vector norm,that is,

‖A‖ ‖x‖ ≥ ‖Ax‖ for all x ∈ Cn. (3.12)

REMARK 3.6 Definition 3.20 is equivalent to

‖A‖ = sup‖y‖=1

‖Ay‖,

since

‖A‖ = supx 6=0

‖Ax‖‖x‖ = sup

x 6=0

∥

∥

∥

∥

Ax

‖x‖

∥

∥

∥

∥

= sup‖y‖=1

‖Ay‖

(letting y = x/‖x‖).

We now present explicit expressions for ‖A‖∞, ‖A‖1, and ‖A‖2.

THEOREM 3.9

(Formulas for common induced matrix norms)

(a) ‖A‖∞ = max1≤i≤n

n∑

j=1

|aij | = {maximum absolute row sum}.

(b) ‖A‖1 = max1≤j≤n

n∑

i=1

|aij | = {maximum absolute column sum}.

(c) ‖A‖2 =√

ρ(AHA), where ρ(M) is the spectral radius of the matrix M ,that is, the maximum absolute value of an eigenvalue of M .

(We will study eigenvalues and eigenvectors in Chapter 5. This spectralradius plays a fundamental role in a more advanced study of matrixnorms. In particular ρ(A) ≤ ‖A‖ for any square matrix A and anymatrix norm, and, for any square matrix A and any ǫ > 0, there is amatrix norm ‖ · ‖ such that ‖A‖ ≤ ρ(A) + ǫ. )


Note that ‖A‖2 is not equal to the Frobenius norm.

Example 3.24

The norm function in matlab gives the induced matrix norm when its ar-gument is a matrix. With the matrix A as in Example 3.13 (on page 81),consider the following matlab dialog (edited for brevity):

>> A

A =

1 2 3

4 5 6

7 8 10

>> x’

ans = 1 1 1

> norm(A,1)

ans = 19

>> norm(x,1)

ans = 3

>> norm(A*x,1)

ans = 46

>> norm(A,1)*norm(x,1)

ans = 57

>> norm(A,2)

ans = 17.4125

>> norm(x,2)

ans = 1.7321

>> norm(A*x,2)

ans = 29.7658

>> norm(A,2)*norm(x,2)

ans = 30.1593

>> norm(A,inf)

ans = 25

>> norm(x,inf)

ans = 1

>> norm(A*x,inf)

ans = 25

>> norm(A,inf)*norm(x,inf)

ans = 25

>>

We are now prepared to discuss condition numbers of matrices.

3.3.2 Condition Numbers

We begin with the following:


DEFINITION 3.21 If the solution x of Ax = b changes drastically whenA or b is perturbed slightly, then the system Ax = b is called ill-conditioned.

Because rounding errors are unavoidable with floating point arithmetic,much accuracy can be lost during Gaussian elimination for ill-conditionedsystems. In fact, the final solution may be considerably different than theexact solution.

Example 3.25

An ill-conditioned system is

Ax =

(

1 0.990.99 0.98

)(

x1

x2

)

=

(

1.991.97

)

, whose exact solution is x =

(

11

)

.

However,

Ax =

(

1.9899031.970106

)

has solution x =

(

3−1.0203

)

.

Thus, a change of

δb =

(

−0.0000970.000106

)

produces a change δx =

(

2.0000−2.0203

)

.

We first study the phenomenon of ill-conditioning, then study roundoff errorin Gaussian elimination. We begin with

THEOREM 3.10

Let ‖ · ‖β be an induced matrix norm. Let x be the solution of Ax = b withA an n× n invertible complex matrix. Let x + δx be the solution of

(A + δA)(x + δx) = b + δb. (3.13)

Assume that‖δA‖β‖A−1‖β < 1. (3.14)

Then

‖δx‖β‖x‖β

≤ κβ(A)(1 − ‖δA‖β‖A−1‖β)−1

(‖δb‖β‖b‖β

+‖δA‖β‖A‖β

)

, (3.15)

whereκβ(A) = ‖A‖β‖A−1‖β

is defined to be the condition number of the matrix A with respect to norm‖·‖β. There exist perturbations δx and δb for which (3.15) holds with equality.That is, inequality (3.15) is sharp.


(We supply a proof of Theorem 3.10 in [1].)

The condition number κβ(A) ≥ 1 for any induced matrix norm and anymatrix A, since

1 = ‖I‖β = ‖A−1A‖β ≤ ‖A−1‖β‖A‖β = κβ(A).

Example 3.26

Consider the system of equations from Example 3.25, and the following mat-

lab dialog.

>> A = [1 0.99

0.99 0.98]

A =

1.0000 0.9900

0.9900 0.9800

>> norm(A,1)*norm(inv(A),1)

ans =

3.9601e+004

>>>> b = [1.99;1.97]

b =

1.9900

1.9700

>> x = A\b

x =

1.0000

1.0000

>> btilde = [1.989903;1.980106]

btilde =

1.9899

1.9801

>> xtilde = A\btilde

xtilde =

102.0000

-101.0203

>> norm(x-xtilde,1)/norm(x,1)

ans =

101.5102

>> sol_error = norm(x-xtilde,1)/norm(x,1)

sol_error =

101.5102

>> data_error = norm(b-btilde,1)/norm(b,1)

data_error =

0.0026

>> data_error * cond(A,1)

ans =

102.0326

>> cond(A,1)

ans =


3.9601e+004

>> cond(A,2)

ans =

3.9206e+004

>> cond(A,inf)

ans =

3.9601e+004

>>

This illustrates the definition of the condition number, as well as the fact thatthe relative error in the norm of the solution can be estimated by the relativeerror in the norms of the matrix and the right-hand-side vector multipliedby the condition number of the matrix. Also, in this two-dimensional case,the condition numbers in the 1-, 2-, and ∞-norms do not differ by much.The actual error in the solutions A\x and A\tilde x are small relative to thedisplayed digits, in this case.

If δA = 0, we have‖δx‖β‖x‖β

≤ κβ(A)‖δb‖β‖b‖β

,

and if δb = 0, then

‖δx‖β‖x‖β

≤ κβ(A)

1− ‖A−1‖β‖δA‖β‖δA‖β‖A‖β

.

Note: In solving systems using Gaussian elimination with partial pivoting,we can use the condition number as a rule of thumb in estimating the numberof digits correct in the solution. For example, if double precision arithmetic isused, errors in storing the matrix into internal binary format and in each stepof the Gaussian elimination process are on the order of 10−16. If the conditionnumber is 104, then we might expect 16 − 4 = 12 digits to be correct in thesolution. In many cases, this is close. (For more foolproof bounds on theerror, interval arithmetic techniques can sometimes be used.)Note: For a unitary matrix U , i.e., UHU = I, we have κ2(U) = 1. Such amatrix is called perfectly conditioned , since κβ(A) ≥ 1 for any β and A.

A classic example of an ill-conditioned matrix is the Hilbert matrix of ordern:

Hn =

1 12

13 · · · 1

n

12

13

14 · · · 1

n+1

...

1n

1n+1

1n+2 · · · 1

2n−1

.

Hilbert matrices and matrices that are approximately Hilbert matrices occurin approximation of data and functions. Condition numbers for some Hilbert


TABLE 3.1: Condition numbers of some Hilbert matricesn 3 5 6 8 16 32 64

κ2(Hn) 5× 102 5× 105 15× 106 15 × 109 2.0× 1022 4.8× 1046 3.5× 1095

matrices appear in Table 3.1. The reader may verify entries in this table,using the following matlab dialog as an example.

>> hilb(3)

ans =

1.0000 0.5000 0.3333

0.5000 0.3333 0.2500

0.3333 0.2500 0.2000

>> cond(hilb(3))

ans =

524.0568

>> cond(hilb(3),2)

ans =

524.0568

REMARK 3.7 Consider Ax = b. Ill-conditioning combined with round-ing errors can have a disastrous effect in Gaussian elimination. Sometimes,the conditioning can be improved (κ decreased) by scaling the equations. Acommon scaling strategy is to row equilibrate the matrix A by choosing adiagonal matrix D, such that premultiplying A by D causes max

1≤j≤n|aij | = 1

for i = 1, 2, · · · , n. Thus, DAx = Db becomes the scaled system with maxi-mum elements in each row of DA equal to unity. (This procedure is generallyrecommended before Gaussian elimination with partial pivoting is employed[19]. However, there is no guarantee that equilibration with partial pivotingwill not suffer greatly from effects of roundoff error.)

Example 3.27The condition number does not give the entire story in Gaussian elimination.In particular, if we multiply an entire equation by a non-zero number, thischanges the condition number of the matrix, but does not have an effect onGaussian elimination. Consider the following matlab dialog.

>> A = [1 1

-1 1]

A =

1 1

-1 1

>> cond(A)

ans =


1.0000

>> A(1,:) = 1e16*A(1,:)

A =

1.0e+016 *

1.0000 1.0000

-0.0000 0.0000

>> cond(A)

ans =

1.0000e+016

>>

However, the strange scaling in the first row of the matrix will not causeserious roundoff error when Gaussian elimination proceeds with floating pointarithmetic, if the right-hand-sides are scaled accordingly.

3.3.3 Roundoff Error in Gaussian Elimination

Consider the solution of Ax = b. On a computer, elements of A and bare represented by floating point numbers. Solving this linear system on acomputer only produces an approximate solution x.

There are two kinds of rounding error analysis. In backward error analysis ,one shows that the computed solution x is the exact solution of a perturbedsystem of the form (A + F )x = b. (See, for example, [30] or [42].) Then wehave

Ax−Ax = −F x,

that is,

x− x = −A−1F x,

from which we obtain

‖x− x‖∞‖x‖∞

≤ ‖A−1‖∞‖F‖∞ = κ∞(A)‖F‖∞‖A‖∞

. (3.16)

Thus, assuming that we have estimates for κ∞(A) and ‖F‖∞, we can use(3.16) to estimate the error ‖x− x‖∞.

In forward error analysis, one keeps track of roundoff error at each step ofthe elimination procedure. Then, x − x is estimated in some norm in termsof, for example, A, κ(A), and θ = p

2β1−t [37, 38].The analyses are lengthy and are not given here. The results, however, are

useful to understand. Basically, it is shown that

‖F‖∞‖A‖∞

≤ cngθ, (3.17)

where

cn is a constant that depends on size of the n× n matrix A,


g is a growth factor, g =maxi,j,k |a(k)

ij |maxi,j |aij |

, and

θ is the unit roundoff error, θ =p

2β1−t.

Note: Using backward error analysis, cn = 1.01n3 + 5(n + 1)2, and usingforward error analysis, cn = 1

6 (n3 + 15n2 + 2n− 12).

Note: The growth factor g depends on the pivoting strategy: g ≤ 2n−1 forpartial pivoting,6 while g ≤ n1/2(2 · 31/2 · 41/3 · · ·n1/n−1)1/2 for full pivoting.(Wilkinson conjectured that this can be improved to g ≤ n.) For example,for n = 100, g ≤ 299 ≈ 1030 for partial pivoting and g ≤ 3300 for full pivoting.

Note: Thus, by (3.16) and (3.17), the relative error ‖x− x‖∞/‖x‖∞ dependsdirectly on κ∞(A), θ, n3, and the pivoting strategy.

REMARK 3.8 The factor of 2n−1 discouraged numerical analysts inthe 1950’s from using Gaussian elimination, and spurred study of iterativemethods for solving linear systems. However, it was found that, for mostmatrices, the growth factor is much less, and Gaussian elimination with partialpivoting is usually practical.

3.3.4 Interval Bounds

In many instances, it is practical to obtain rigorous bounds on the solutionx to a linear system Ax = b. The algorithm is a modification of the gen-eral Gaussian elimination algorithm (Algorithm 3.1) and back substitution(Algorithm 3.2), as follows.

ALGORITHM 3.4

(Interval bounds for the solution to a linear system)

INPUT: The n by n matrix A and n-vecttor b ∈ Rn.OUTPUT: an interval vector x such that the exact solution to Ax = b mustbe within the bounds x.

1. Use Algorithm 3.1 and Algorithm 3.2 (that is, Gaussian elimination withback substitution, or any other technique) and floating point arithmeticto compute an approximation Y to A−1.

2. Use interval arithmetic, with directed rounding, to compute interval en-closures to Y A and Y b. That is,

6It cannot be improved, since g = 2n−1 for certain matrices.


(a) A← Y A (computed with interval arithmetic),

(b) b← Y b (computed with interval arithmetic).

3. FOR k = 1, 2, · · · , n− 1 (forward phase using interval arithmetic)

FOR i = k + 1, · · · , n(a) mik ← aik/akk.

(b) aik ← [0, 0].

(c) FOR j = k + 1, · · · , naij ← aij −mikakj .

END FOR

(d) bi ← bi −mikbk.

END FOR

END FOR

4. xn ← bn/ann.

5. FOR k = n− 1, n− 2, · · · , 1 (back substitution)

xk ← (bk −∑n

j=k+1 akjxj)/akk.

END FOR

END ALGORITHM 3.4.

Note: We can explicitly set aik to zero without loss of mathematical rigor,even though, using interval arithmetic, aik−mikakk may not be exactly [0, 0].In fact, this operation does not even need to be done, since we need not ref-erence aik in the back substitution process.

Note: Obtaining the rigorous bounds x in Algorithm 3.2 is more costlythan computing an approximate solution with floating point arithmetic usingGaussian elimination with back substitution, because an approximate inverseY must explicitly be computed to precondition the system. However, bothcomputations take O(n3) operations for general systems.

The following theorem clarifies why we may use Algorithm 3.4 to obtainmathematically rigorous bounds.

THEOREM 3.11

Define the solution set to Ax = b to be

Σ(A, b) ={

x | Ax = b for some A ∈ A and b ∈ b}

.


If Ax∗ = b, then x∗ ∈ Σ(A, b). Furthermore, if x is the output to Algo-rithm 3.4, then Σ(A, b) ⊆ x.

For facts enabling a proof of Theorem 3.11, see [29] or other references oninterval analysis.

Example 3.28

3.3330 15920. −10.333

2.2220 16.710 9.612

1.5611 5.1791 1.6852

x1

x2

x3

=

15913.

28.544

8.4254

For this problem, κ∞(A) ≈ 16000 and the exact solution is x = [1, 1, 1]T . Wewill use the matlab(providing IEEE double precision floating point arith-metic) to compute Y , and we will use the intlab interval arithmetic toolboxfor matlab (based on IEEE double precision). Rounded to 14 decimal digits,7

we obtain

Y ≈

−0.00012055643706−0.14988499865822 0.854170957416750.00006278655296 0.00012125786211−0.00030664438576−0.00008128244868 0.13847464088044−0.19692507695527

.

Using outward rounding in both the computation and the decimal display, weobtain

A ⊆(

[1.00000000000000, 1.00000000000000] [−0.00000000000012,−0.00000000000011][0.00000000000000, 0.00000000000001] [1.00000000000000, 1.00000000000001][0.00000000000000, 0.00000000000001] [0.00000000000013, 0.00000000000014]

[−0.00000000000001,−0.00000000000000][−0.00000000000001,−0.00000000000000]

[0.99999999999999, 1.00000000000001]

)

,

and

b ⊆(

[0.99999999999988, 0.99999999999989][1.00000000000000, 1.00000000000001][1.00000000000013, 1.00000000000014]

)

.

Completing the remainder of Algorithm 3.4 then gives

x∗ ∈ x ⊆

[0.99999999999999, 1.00000000000001][0.99999999999999, 1.00000000000001][0.99999999999999, 1.00000000000001]

.

The actual matlab dialog is as follows:

>> format long>> intvalinit(’DisplayInfsup’)

===> Default display of intervals by infimum/supremum (e.g. [ 3.14 , 3.15 ])>> x = interval_Gaussian_elimination(A,b)

7as matlab displays it


x =

1.0000000000000001.000000000000000

1.000000000000001>> IA = [intval(3.3330) intval(15920.) intval(-10.333)intval(2.2220) intval(16.710) intval(9.612)

intval(1.5611) intval(5.1791) intval(1.6852)]intval IA =

1.0e+004 *Columns 1 through 2

[ 0.00033330000000, 0.00033330000001] [ 1.59200000000000, 1.59200000000000][ 0.00022219999999, 0.00022220000000] [ 0.00167100000000, 0.00167100000001][ 0.00015610999999, 0.00015611000000] [ 0.00051791000000, 0.00051791000001]

Column 3[ -0.00103330000001, -0.00103330000000]

[ 0.00096120000000, 0.00096120000001][ 0.00016851999999, 0.00016852000001]>> Ib = [intval(15913.);intval(28.544);intval(8.4254)]

intval Ib =1.0e+004 *

[ 1.59130000000000, 1.59130000000000][ 0.00285440000000, 0.00285440000001]

[ 0.00084253999999, 0.00084254000000]>> YA = Y*IAintval YA =

Columns 1 through 2[ 0.99999999999999, 1.00000000000001] [ -0.00000000000100, -0.00000000000099]

[ -0.00000000000001, 0.00000000000001] [ 1.00000000000000, 1.00000000000001][ -0.00000000000001, 0.00000000000001] [ 0.00000000000013, 0.00000000000014]

Column 3

[ 0.00000000000000, 0.00000000000001][ -0.00000000000001, -0.00000000000000]

[ 0.99999999999999, 1.00000000000001]>> Yb = Y*Ib

intval Yb =[ 0.99999999999900, 0.99999999999901][ 1.00000000000000, 1.00000000000001]

[ 1.00000000000013, 1.00000000000014]

>> x = interval_Gaussian_elimination(A,b)x =

1.000000000000000

1.0000000000000001.000000000000001

Here, we need to use the intlab function intval to convert the decimalstrings representing the matrix and right-hand side vector elements to smallintervals containing the actual decimal values. This is because, even thoughthe original system did not have interval entries, the elements cannot all berepresented exactly as binary floating point numbers, so we must enclosethe exact values in floating point intervals to be certain that the bounds wecompute contain the actual solution. This is not necessary in computing thefloating point preconditioning matrix Y , since Y need not be an exact inverse.The function interval Gaussian elimination, not a part of intlab, is asfollows:

function [x] = interval_Gaussian_elimination(A, b)

% [x] = interval_Gaussian_elimination(A, b)

% returns the result of Algorithm 3.5 in the book.

% The matrix A and vector b should be intervals,

% although they may be point intervals (i.e. of width zero).


n = length(b);

Y = inv(mid(A));

Atilde = Y*A;

btilde = Y*b;

error_occurred = 0;

for k=1:n

for i=k+1:n

m_ik = Atilde(i,k)/Atilde(k,k);

for j=k+1:n

Atilde(i,j) = Atilde(i,j) - m_ik*Atilde(k,j);

end

btilde(i) = btilde(i) -m_ik*btilde(k);

end

end

x(n) = btilde(n)/Atilde(n,n);

for k=n-1:-1:1

x(k) = btilde(k);

for j=k+1:n

x(k) = x(k) - Atilde(k,j)*x(j);

end

x(k) = x(k)/Atilde(k,k);

end

x = x’;

Note: There are various ways of using interval arithmetic to obtain rigorousbounds on the solution set to linear systems of equations. Some of these arerelated mathematically to the interval Newton method introduced in §2.4 onpage 56, while others are related to the iterative techniques we discuss later inthis section. The effectiveness and practicality of a particular such techniquedepend on the condition of the system, and whether the entries in the matrixA and right hand side vector b are points to start, or whether there are largeruncertainties in them (that is, whether or not these coefficients are wide ornarrow intervals). A good theoretical reference is [29] and some additionalpractical detail is given in our monograph [20].

We now consider another method for computing the solution of a linearsystem Ax = b. This method is particularly appropriate for various statisticalcomputations, such as least squares fits, when there are more equations thanunknowns.


3.4 Orthogonal Decomposition (QR Decomposition)

This method for computing the solution of Ax = b is based on orthogonaldecomposition, also known as the QR decomposition or QR factorization. Inaddition to solving linear systems, the QR factorization is also useful in leastsquares problems and eigenvalue computations.

We will use the following concept heavily in this section, as well as whenwe study the singular value decomposition.

DEFINITION 3.22 Two vectors u and v are called orthogonal providedthe dot product u ◦ v = 0. A set of vectors v(i) is said to be orthonormal,provided v(i) ◦ v(j) = δij, where δij is the Kronecker delta function

δij =

{

1 if i = j,

0 if i 6= j.

A matrix Q whose columns are orthonormal vectors is called an orthogonalmatrix.

In QR-decompositions, we compute an orthogonal matrix Q and an uppertriangular matrix8 R such that A = QR. Advantages of the QR decompositioninclude the fact that systems involving an upper triangular matrix R can besolved by back substitution, the fact that Q is perfectly conditioned (withcondition number in the 2-norm equal to 1), and the fact that the solution toQy = b is QT y.

There are several ways of computing QR-decompositions. These are de-tailed, for example, in our graduate-level text [1]. Here, we focus on theproperties of the decomposition and its use.

Note: The QR decomposition is not unique. Hence, different software maycome up with different QR decompositions for the same matrix.

3.4.1 Properties of Orthogonal Matrices

The following two properties, easily provable, make the QR decompositiona numerically stable way of dealing with systems of equations.

THEOREM 3.12

Suppose Q is an orthogonal matrix. Then Q has the following properties.

8also known as a “right triangular” matrix. This is the reason for the notation “R”.


1. QT Q = I, that is, QT = Q−1. Thus, solving the system Qy = b can bedone with a matrix multiplication.9

2. ‖Q‖2 = ‖QT ‖2 = 1.

3. Hence, κ2(Q) = 1, where κ2(Q) is the condition number of Q in the 2-norm. That is, Q is perfectly conditioned with respect to the 2-norm (andworking with systems of equations involving Q will not lead to excessiveroundoff error accumulation).

4. ‖Qx‖2 = ‖x‖2 for every x ∈ Rn. Hence ‖QA‖2 = ‖A‖2 for every n byn matrix A.

3.4.2 Least Squares and the QR Decomposition

Overdetermined linear systems (with more equations than unknowns) oc-cur frequently in data fitting, in mathematical modeling and statistics. Forexample, we may have data of the form {(ti, yi)}mi=1, and we wish to modelthe dependence of y on t by a linear combination of n basis functions {ϕj}nj=1,that is,

y ≈ f(t) =n∑

i=1

xiϕi(t), (3.18)

where m > n. Setting f(ti) = yi, 1 ≤ i ≤ m, gives the overdetermined linearsystem

ϕ1(t1) ϕ2(t1) · · · ϕn(t1)

ϕ1(t2) ϕ2(t2) · · · ϕn(t2)

...

ϕ1(tm) ϕ2(tm) · · · ϕn(tm)

x1

x2

...

xn

=

y1

y2

...

ym

, (3.19)

that is,

Ax = b, where A ∈ L(Rn, Rm), aij = ϕj(ti), and bi = yi. (3.20)

Perhaps the most common way of fitting data is with least squares , in whichwe find x∗ such that

1

2‖Ax∗ − b‖22 = min

x∈Rnϕ(x), where ϕ(x) =

1

2‖Ax− b‖22. (3.21)

(Note that x∗ minimizes the 2-norm of the residual vector r(x) = Ax − b,since the function g(u) = u2 is increasing.)

9With the usual way of multiplying matrices, this is n2 multiplications, more than withback-substitution, but still O(n2). Furthermore, it can be done with n dot products,something that is efficient on many machines.


The naive way of finding x∗ is to set the gradient ∇ϕ(x) = 0 and simplify.Doing so gives the normal equations :

AT Ax = AT b. (3.22)

(See Exercise 11 on page 143.) However, the normal equations tend to bevery ill-conditioned. For example, if m = n, κ2(A

T A) = κ2(A)2. Fortunately,the least squares solution x∗ may be computed with a QR decomposition. Inparticular,

‖Ax− b‖2 = ‖QRx− b‖2 = ‖QT (QRx− b)‖2 = ‖Rx−QT b‖2.

(Above, we used ‖Ux‖2 = ‖x‖2 when U is orthogonal.) However,

‖Rx−QT b‖22 =

n∑

i=1

i∑

j=1

rijxj

− (QT b)i

2

+

n∑

i=m+1

(QT b)2i . (3.23)

Observe now:

1. All m terms in the sum in (3.23) are nonnegative.

2. The first n terms can be made exactly zero.

3. The last m− n terms are constant.

Therefore,

minx∈Rn

‖Ax− b‖2 =

n∑

i=m+1

(QT b)2i ,

and the minimizer x∗ can be computed by backsolving the square triangularsystem consisting of the first n rows of Rx = QT b.

We summarize these computations in the following algorithm.

ALGORITHM 3.5

(Least squares fits with a QR decomposition)

INPUT: the m by n A, m ≥ n, and b ∈ Rm.OUTPUT: the least squares fit x ∈ Rn such that ‖Ax− b‖2 is minimized, aswell as the square of the residual norm ‖Ax− b‖22.

1. Compute Q and R such that Q is an m by m orthogonal matrix, R is anm by n upper triangular (or “right triangular”) matrix, and A = QR.

2. form y = QT b.


3. Solve the upper triangular system n by n system R1:n,1:nx = y1:n usingAlgorithm 3.2 (the back-substitution algorithm). Here, R1:n,1:n corre-sponds to A(n) and y1:n corresponds to b(n).

4. Set the residual norm ‖Ax− b‖2 to√

∑mi=n+1 y2

i = ‖yn+1:m‖2.

END ALGORITHM 3.5.

Example 3.29

Consider fitting the datat y0 11 42 53 8

in the least squares sense with a polynomial of the form

p2(x) = x0ϕ0(x) + x1ϕ1(x) + x2ϕ2(x),

where ϕ0(x) ≡ 1, ϕ1(x) ≡ x, and ϕ2(x) ≡ x2. The overdetermined system(3.19) becomes

1 0 01 1 11 2 41 3 9

x0

x1

x2

=

1458

.

We use matlab to perform a QR decomposition and find the least squaressolution:

>> format short>> clear x

>> A = [1 0 01 1 1

1 2 41 3 9]A =

1 0 01 1 1

1 2 41 3 9

>> b = [1;4;5;8]

b =1

45

8>> [Q,R] = qr(A)Q =

-0.5000 0.6708 0.5000 0.2236-0.5000 0.2236 -0.5000 -0.6708

-0.5000 -0.2236 -0.5000 0.6708-0.5000 -0.6708 0.5000 -0.2236

R =-2.0000 -3.0000 -7.0000


0 -2.2361 -6.7082

0 0 2.00000 0 0

>> Qtb = Q’*b;>> x(3) = Qtb(3)/R(3,3);>> x(2)=(Qtb(2) - R(2,3)*x(3))/R(2,3);

>> x(1) = (Qtb(1) - R(1,2)*x(2) - R(1,3)*x(3))/R(1,1)x =

3.4000 0.7333 0.0000>> x=x’;

>> resid = A*x - bresid =

2.4000

0.1333-0.1333

-2.4000>> tt = linspace(0,3);>> yy = x(1) + x(2)*tt + x(3)*tt.^2;

>> axis([-0.1,3.1,0.9,8.1])>> hold

Current plot held>> plot(A(:,2),b,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)

>> plot(tt,yy)>> y = Q’*by =

-9.0000-4.9193

0.0000-0.8944

>> x = R(1:n,1:n)\y(1:n)

x =1.2000

2.20000.0000

>> resid_norm = norm(y(n+1:m),2)resid_norm =

0.8944

>> norm(A*x-b,2)ans =

0.8944>>>>

This dialog results in the following plot, illustrating the data points as starsand the quadratic fit (which in this case happens to be linear) as a blue curve.

0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

Note that the fit does not approximate the first and fourth data points well.(The portion of the dialog following the plot commands illustrates alternative


views of the computation of x and the residual norm.)

Although working with the QR decomposition is a stable process, careshould be taken when computing Q and R. We discuss actually computing Qand R in [1].

We now turn to iterative techniques for linear systems of equations.

3.5 Iterative Methods for Solving Linear Systems

Here, we study iterative solution of linear systems

Ax = b, i.e.

n∑

k=1

ajkxk = bj , j = 1, 2, . . . , n. (3.24)

Example 3.30

Consider Example 3.18 (on page 93), where we replaced a second derivativein a differential equation by a difference approximation, to obtain the system

2 −1 0−1 2 −1

0 −1 2

x1

x2

x3

=1

16

sin(π/4)sin(π/2)sin(3π/4)

.

In other words, the equations are

2x1 − x2 = 116 sin(π

4 ),

−x1 + 2x2 − x3 = 116 sin(π

2 ),

− x2 + 2x3 = 116 sin(3π

4 ).

Solving the first equation for x1, the second equation for x2, and the thirdequation for x3, we obtain

x1 = 12

[

116 sin(π

4 ) + x2

]

,

x2 = 12

[

116 sin(π

2 ) + x1 + x3

]

,

x3 = 12

[

116 sin(3π

4 ) + x2

]

,

which can be written in matrix form as

x1

x2

x3

=

0 12 0

12 0 1

2

0 12 0

x1

x2

x3

+

1

32

sin(π4 )

sin(π2 )

sin(3π4 )

,


that is,x = Gx + c, (3.25)

with

x =

x1

x2

x3

, G =

0 12 0

12 0 1

2

0 12 0

, and c =

1

32

sin(π4 )

sin(π2 )

sin(3π4 )

.

Equation (3.25) can form the basis of an iterative method:

x(k+1) = Gx(k) + c. (3.26)

Starting with x(0) = (0, 0, 0)T , we obtain the following in matlab:

>> x = [0,0,0]’

x =00

0>> G = [0 1/2 0

1/2 0 1/20 1/2 0]G =

0 0.5000 00.5000 0 0.5000

0 0.5000 0>> c = (1/32)*[sin(pi/4); sin(pi/2); sin(3*pi/4)]

c =0.02210.0313

0.0221>> x = G*x + c

x =0.02210.0313

0.0221>> x = G*x + c

x =0.0377

0.05330.0377

>> x = G*x + c

x =0.0488

0.06900.0488

>> x = G*x + c

x =0.0566

0.08000.0566

>> x = G*x + cx =

0.0621

0.08780.0621

>> x = G*x + cx =

0.06600.0934


0.0660

>> x = G*x + cx =

0.06880.09730.0688

>> x = G*x + cx =

0.07070.1000

0.0707>> x = G*x + cx =

0.07210.1020

0.0721>>

Comparing with the solution in Example 3.18, we see that the componentsof x tend to the components of the solution to Ax = b as we iterate (3.26).This is an example of an iterative method (namely, the Jacobi method) forsolving the system of equations Ax = b.

Good references for iterative solution of linear systems are [23, 30, 39, 44].

Why may we wish to solve (3.24) iteratively? Suppose that n = 10, 000 ormore, which is not unreasonable for many problems. Then A has 108 elements,making it difficult to store or solve (3.24) directly using, for example, Gaussianelimination.

To discuss iterative techniques involving vectors and matrices, we use:

DEFINITION 3.23 A sequence of vectors {xk}∞k=1 is said to convergeto a vector x ∈ Cn if and only if ‖xk−x‖ → 0 as k →∞ for some norm ‖ · ‖.

Definition 3.23 implies that a sequence of vectors {xk} ⊂ Rn (or ⊂ Cn)converges to x if and only if xk

i → xi as k →∞ for all i.

Note: Iterates defined by (3.26) can be viewed as fixed point iterates thatunder certain conditions converge to the fixed point.

DEFINITION 3.24 The iterative method defined by (3.26) is called con-vergent if, for all initial values x(0), we have x(k) → A−1b as k →∞.

We now take a closer look at the Jacobi method, as well as the relatedGauss–Seidel method and SOR method.

3.5.1 The Jacobi Method

We can think of the Jacobi method illustrated in the above example inmatrix form as follows. Let L be the lower triangular part of the matrix A,U the upper triangular part, and D the diagonal part.


Example 3.31

In Example 3.30,

L =

0 0 0−1 0 0

0 −1 0

, U =

0 −1 00 0 −10 0 0

, and D =

2 0 00 2 00 0 2

.

Then the Jacobi method may be written in matrix form as

G = −D−1(L + U) ≡ J. (3.27)

J is called the iteration matrix for the Jacobi method. The iterative methodbecomes:

x(k+1) = −D−1(L + U)x(k) + D−1b, k = 0, 1, 2, . . . (3.28)

Generally, one uses the following equations to solve for x(k+1):

x(0)i is given,

x(k+1)i =

1

aii

bi −i−1∑

j=1

aijx(k)j −

n∑

j=i+1

aijx(k)j

,(3.29)

for k ≥ 0 and 1 ≤ i ≤ n (where a sum is absent if its lower limit on j is largerthan its upper limit). Equations (3.29) are easily programmed.

3.5.2 The Gauss–Seidel Method

We now discuss the Gauss–Seidel method, or successive relaxation method.If in the Jacobi method, we use the new values of xj as they become available,then

x(0)i is given,

x(k+1)i =

1

aii

bi −i−1∑

j=1

aijx(k+1)j −

n∑

j=i+1

aijx(k)j

,(3.30)

for k ≥ 0 and 1 ≤ i ≤ n. (We continue to assume that aii 6= 0 for i =1, 2, . . . , n.) The iterative method (3.30) is called the Gauss–Seidel method,and can be written in matrix form with

G = −(L + D)−1U ≡ G,

sox(k+1) = −(L + D)−1Ux(k) + (L + D)−1b for k ≥ 0. (3.31)


Note: The Gauss–Seidel method only requires storage of

(x(k+1)1 , x

(k+1)2 , . . . , x

(k+1)i−1 , x

(k)i , x

(k)i+1, . . . , x

(k)n )T

to compute x(k+1)i . The Jacobi method requires storage of x(k) as well as

x(k+1). Also, the Gauss–Seidel method generally converges faster. This givesan advantage to the Gauss–Seidel method. However, on some machines, sep-arate rows of the iteration equation may be processed simultaneously in par-allel, while the Gauss–Seidel method requires the coordinates be processedsequentially (with the equations in some specified order).

Example 3.32

(

2 1−1 3

)(

x1

x2

)

=

(

32

)

, that is,2x1 + x2 = 3−x1 + 3x2 = 2.

(The exact solution is x1 = x2 = 1.) The Jacobi and Gauss–Seidel methodshave the forms

Jacobi:

x(k+1)1 =

3

2− 1

2x

(k)2

x(k+1)2 =

2

3+

1

3x

(k)1

,

Gauss–Seidel:

x(k+1)1 =

3

2− 1

2x

(k)2

x(k+1)2 =

2

3+

1

3x

(k+1)1

.

The results in Table 3.2 are obtained with x(0) = (0, 0)T . Observe that theGauss–Seidel method converges roughly twice as fast as the Jacobi method.This behavior is provable.

3.5.3 Successive Overrelaxation

We now describe Successive OverRelaxation (SOR). In the SOR method,

one computes x(k+1)i to be a weighted mean of x

(k)i and the Gauss–Seidel

iterate for that element. Specifically, for σ 6= 0 a real parameter, the SORmethod is given by

x(0)i is given

x(k+1)i = (1− σ)x

(k)i +

σ

aii

bi −i−1∑

j=1

aijx(k+1)j −

n∑

j=i+1

aijx(k)j

,

(3.32)


TABLE 3.2: Iterates of the Jacobi andGauss–Seidel methods, for Example 3.32

k x(k)1 Jacobi x

(k)2 Jacobi x

(k)1 G–S x

(k)2 G–S

0 0 0 0 01 1.5 0.667 1.5 1.1672 1.167 1.167 0.917 0.9723 0.917 1.056 1.014 1.0054 0.972 0.972 0.998 0.9995 1.014 0.991 1.000 1.0006 1.005 1.0057 0.998 1.0028 0.999 0.9999 1.000 1.000

for 1 ≤ i ≤ n and for k ≥ 0. The parameter σ is called a relaxation factor. Ifσ < 1, we call σ an underrelaxation factor and if σ > 1, we call σ an overre-laxation factor. Note that if σ = 1, the Gauss–Seidel method is obtained.

Note: For certain classes of matrices and certain σ between 1 and 2, the SORmethod converges faster than the Gauss–Seidel method.

We can write (3.32) in the matrix form:

(

L +1

σD

)

x(k+1) = −{

U + (1− 1

σ)D

}

x(k) + b (3.33)

for k = 0, 1, 2, . . . , with x(0) given. Thus,

G = (σL + D)−1 [(1− σ)D − σU ] ≡ Sσ,

and

x(k+1) = Sσx(k) +

(

L +1

σD

)−1

b. (3.34)

The matrix Sσ is called the SOR matrix . Note that σ = 1 gives G, theGauss–Seidel matrix.

A classic reference on iterative methods, and the SOR method in particular,is [44].

3.5.4 Convergence of Iterative Methods

The general iteration equation (3.26) (on page 122) gives

x(k+1) = Gx(k) + c and x(k) = Gx(k−1) + c.


Subtracting these equations and using properties of vector addition and matrix-vector multiplication gives

x(k+1) − x(k) = G(x(k) − x(k−1)). (3.35)

Furthermore, similar rearrangements give

(I −G)(x(k) − x) = x(k) − x(k+1) (3.36)

becausex = Gx + c and x(k+1) = Gx(k) + c.

Combining (3.35) and (3.36) gives

x(k)−x = −(I −G)−1G(x(k)−x(k−1)) = −(I−B)−1B2(x(k−1)−x(k−2)) · · · ,

and taking norms gives

‖x(k) − x‖ ≤ ‖(I −G)−1‖ ‖G‖ ‖x(k) − x(k−1)‖= ‖x(k) − x‖ ≤ ‖(I −G)−1‖ ‖G‖ ‖G(x(k−1) − x(k−2))‖≤ ‖(I −G)−1‖ ‖G‖ ‖G‖ ‖x(k−1) − x(k−2)‖...

...

≤ ‖(I −G)−1‖ ‖G‖k ‖x(1) − x(0)‖.

It is not hard to show that, for any induced matrix norm,

‖(I −G)−1‖ ≤ 1

1− ‖G‖ .

Therefore,

‖x(k) − x‖ ≤ ‖G‖k1− ‖G‖‖x

(1) − x(0)‖. (3.37)

The practical importance of this error estimate is that we can expect linearconvergence of our iterative method when ‖G‖ < 1.

Example 3.33We revisit Example 3.30, with the following matlab dialog:

>> x = [0,0,0]’

x =0

00

>> G = [0 1/2 0

1/2 0 1/20 1/2 0]

G =0 0.5000 0

0.5000 0 0.50000 0.5000 0


>> c = (1/32)*[sin(pi/4); sin(pi/2); sin(3*pi/4)]

c =0.0221

0.03130.0221

>> exact_solution = (eye(3)-G)\c

exact_solution =0.0754

0.10670.0754

>> normG = norm(G)normG =

0.7071

>> for i=1:5;old_norm = norm(x-exact_solution);

x = G*x + c;new_norm = norm(x-exact_solution);ratio = new_norm/old_norm

endratio =

0.7071ratio =

0.7071ratio =

0.7071

ratio =0.7071

ratio =0.7071

>> x

x =0.0660

0.09340.0660

>>

We thus see linear convergence with the Jacobi method, with convergencefactor ‖G‖ ≈ 0.7071, just as we discussed in Section 1.1.3 (page 7) and ourstudy of the fixed point method for solving a single nonlinear equation (Sec-tion 2.2, starting on page 47).

Example 3.34We examine the norm of the iteration matrix for the Gauss–Seidel method

for Example 3.30:

>> L = [0 0 0-1 0 0

0 -1 0]L =

0 0 0

-1 0 00 -1 0

>> U = [0 -1 00 0 -1

0 0 0]U =

0 -1 0

0 0 -10 0 0

>> D = [2 0 00 2 0

0 0 2]D =


2 0 0

0 2 00 0 2

>> GS = -inv(L+D)*UGS =

0 0.5000 0

0 0.2500 0.50000 0.1250 0.2500

>> norm(GS)ans =

0.6905

We see that this norm is less than the norm of the iteration matrix forthe Jacobi method, so we may expect the Gauss–Seidel method to convergesomewhat faster.

The error estimates hold if ‖ · ‖ is any norm. Furthermore, it is possible toprove the following.

THEOREM 3.13

Supposeρ(G) < 1,

where ρ(G) is the spectral radius of G, that is,

ρ(G) = max{|λ| : λ is an eigenvalue of G}.Then the iterative method

x(k+1) = Gx(k) + c

converges.

In particular, the Jacobi method and Gauss–Seidel method for matricesof the form in Example 3.30 all converge, although ‖G‖ becomes nearer to 1(and hence, the convergence is slower), the finer we subdivide the interval [0, 1](and hence the larger n becomes). There is some theory relating the spectralradius of various iteration matrices, and matrices arising from discretizationssuch as in Example 3.30 have been analyzed extensively.

One criterion that is easy to check is diagonal dominance, as defined inRemark 3.1 on page 88:

THEOREM 3.14

Suppose

|aii| ≥n∑

j=1j 6=i

|aij |, for i = 1, 2, · · · , n,

and suppose that the inequality is strict for at least one i. Then the Jacobimethod and Gauss–Seidel method for Ax = b converge.

We present a more detailed analysis in [1].


3.5.5 The Interval Gauss–Seidel Method

The interval Gauss–Seidel method is an alternative method10 for usingfloating point arithmetic to obtain mathematically rigorous lower and upperbounds to the solution to a system of linear equations. The interval Gauss–Seidel method has several advantages, especially when there are uncertaintiesin the right-hand-side vector b that are represented in the form of relativelywide intervals [bi, bi], and when there are also uncertainties [aij , aij ] in the co-efficients of the matrix A. That is, we assume that the matrix is A ∈ IRn×n,b ∈ IR

n, and we wish to find an interval vector (or “box”) x that bounds

Σ(A, b) = {x | Ax = b for some A ∈ A and some b ∈ b} , (3.38)

where IRn×n denotes the set of all n by n matrices whose entries are in-tervals, IRn denotes the set of all n-vectors whose entries are intervals, andA ∈ A means that each element of the point matrix A is contained in thecorresponding element of the interval matrix A (and similarly for b ∈ b).

The interval Gauss–Seidel method is similar to the point Gauss–Seidelmethod as defined in (3.30) on page 124, except that, for general systems,we almost always precondition. In particular, let A = Y A and b = Y b,where Y is a preconditioning matrix . We then have the preconditioned system

Y Ax = Y b, i.e. Ax = b. (3.39)

We have

THEOREM 3.15

(The solution set for the preconditioned system contains the solution set forthe original system.) Σ(A, b) ⊆ Σ(Y A, Y b) = Σ(A, b).

This theorem is a fairly straightforward consequence of the subdistributivity(Equation (1.4) on page 26) of interval arithmetic. For a proof of this andother facts concerning interval linear systems, see, for example, [29].

Analogously to the noninterval version of Gauss–Seidel iteration (3.30), theinterval Gauss–Seidel method is given as

x(k+1)i ← 1

aii

bi −i−1∑

j=1

aijx(k+1)j −

n∑

j=i+1

aijx(k)j

(3.40)

for i = 1, 2, . . . , n, where a sum is interpreted to be absent if its lower index

is greater than its upper index, and with x(0)i given for 1 = 1, 2, . . . , n.

10to the interval version of Gaussian elimination of Section 3.3.4 on page 111


REMARK 3.9 As with the interval version of Gaussian elimination (Al-gorithm 3.4 on page 111), a common preconditioner Y for the interval Gauss–Seidel method is the inverse midpoint matrix Y = (m(A))−1, where m(A)is the matrix whose elements are midpoints of corresponding elements of theinterval matrix A. However, when the elements of A have particularly largewidths, specially designed preconditioners11 may be more appropriate.

REMARK 3.10 Point iterative methods, are often preconditioned. How-ever, computing an inverse of a point matrix A leads to Y A ≈ I, where I isthe identity matrix, so the system will already have been solved (except for,possibly, iterative refinement). Moreover, such point iterative methods areusually employed for very large systems of equations, with matrices with “0”for many elements. Although the elements that are 0 need not be stored,the inverse generally does not have 0’s in any of its elements [13], so it maybe impractical to even store the inverse, let alone compute it.12 Thus, spe-cial approximations are used for these preconditioners.13 Preconditioners forthe point Gauss–Seidel method, conjugate gradient method (explained in ourgraduate text [1]), etc. are often viewed as operators that increase the sepa-ration between the largest eigenvalue of A and the remaining eigenvalues ofA, rather than computing an approximate inverse.

The following theorem tells us that the interval Gauss–Seidel method canbe used to prove existence and uniqueness of a solution of a system of linearequations.

THEOREM 3.16

Suppose (3.40) is used, starting with initial interval vector x(0), and obtaininginterval vector x(k) after a number of iterations. Then, if x(k) ⊆ x(0), for eachA ∈ A and each b ∈ b, there is an x ∈ x(k) such that Ax = b.

The proof of Theorem 3.16 can be found in many places, such as in [20] or[29].

Example 3.35

Consider Ax = b, where

A =

(

[0.99, 1.01] [1.99, 2.01][2.99, 3.01] [3.99, 4.01]

)

, b =

(

[−1.01,−0.99][0.99, 1.01]

)

, x(0) =

(

[−10, 10][−10, 10]

)

.

11See [20, Chapter 3].12Of course, the inverse could be computed one row at a time, but this may still be im-practical for large systems.13Much work has appeared in the research literature on such preconditioners


Then,14

m(A) =

(

1 23 4

)

, Y ≈ m(A)−1

=

(

−2.0 1.01.5 −0.5

)

,

A = Y A ⊆(

[0.97, 1.03] [−0.03, 0.03][−0.02, 0.02] [0.98, 1.02]

)

, b = Y b ⊆(

[2.97, 3.03][−2.02,−1.98]

)

.

We then have

x(1)1 ← 1

[0.97, 1.03]

(

[2.97, 3.03]− [−0.03, 0.03][−10, 10])

⊆ [2.5922, 3.4330],

x(1)2 ← 1

[0.98, 1.02]

(

[−2.02,−1.98]− [−0.02, 0.02][2.5922, 3.4330])

⊆ [−2.1313,−1.8738].

If we continue this process, we eventually obtain

x(4) = ([2.8215, 3.1895], [−2.1264,−1.8786])T ,

which, to four significant figures, is the same as x(3). Thus, we have foundmathematically rigorous bounds on the set of all solutions to Ax = b suchthat A ∈ A and b ∈ b.

In Example 3.35, uncertainties of ±0.01 are present in each element of thematrix and right-hand-side vector. Although the bounds produced with thepreconditioned interval Gauss–Seidel method are not guaranteed to be thetightest possible with these uncertainties, they will be closer to the tightestpossible when the uncertainties are smaller.

Convergence of the interval Gauss–Seidel method is related closely to con-vergence of the point Gauss–Seidel method, through the concept of diagonaldominance. We give a hint of this convergence theory here.

DEFINITION 3.25 If a = [a, a] is an interval, then the magnitude of a

is defined to be

mag(a) = max{|a|, |a|}.

Similarly, the mignitude of a is defined to be

mig(a) = mina∈a|a|.

14These computations were done with the aid of intlab, a matlab toolbox available freeof charge for non-commercial use.


Given the matrix A, form the matrix H = (hij) such that

hij =

{

mag(aij) if i 6= j,

mig(aij) if i = j.

Then, basically, the interval Gauss–Seidel method will be convergent if H isdiagonally dominant.

For a careful review of convergence theory for the interval Gauss–Seidelmethod and other interval methods for linear systems, see [29]. Also, see [32].

3.6 The Singular Value Decomposition

The singular value decomposition, which we will abbreviate “SVD,” is notalways the most efficient way of analyzing a linear system, but is extremelyflexible, and is sometimes used in signal processing (smoothing), sensitivityanalysis, statistical analysis, etc., especially if a large amount of informationabout the numerical properties of the system is desired. The major librariesfor programmers (e.g. Lapack) and software systems (e.g. matlab, Mathe-matica) have facilities for computing the SVD. The SVD is often used in thesame context as a QR factorization, but the component matrices in an SVDare computed with an iterative technique related to techniques for computingeigenvalues and eigenvectors (in Chapter 5 of this book).

The following theorem defines the SVD.

THEOREM 3.17

Let A be an m by n real matrix, but otherwise arbitrary. Then there areorthogonal matrices U and V and a an m by n matrix Σ = [Σij ] such thatΣij = 0 for i 6= j, Σi,i = σi ≥ 0 for 1 ≤ i ≤ p = min{m, n}, and σ1 ≥ σ2 ≥· · · ≥ σp, such that

A = UΣV T .

For a proof and further explanation, see G. W. Stewart, Introduction toMatrix Computations [35] or G. H. Golub15 and C. F. van Loan, MatrixComputations [16].

Note: The SVD for a particular matrix is not necessarily unique.

Note: The SVD is defined similarly for complex matrices A ∈ L(Cn, Cm).

15Gene Golub, a famous numerical analyst, a professor of Computer Science and, for manyyears, department chairman, at Stanford University, invented the efficient algorithm usedtoday for computing the singular value decomposition.


REMARK 3.11 A simple algorithm to find the singular-value decom-position is: (1) find the nonzero eigenvalues of AT A, i.e., λi, i = 1, 2, . . . , r,(2) find the orthogonal eigenvectors of AT A and arrange them in n × n ma-trix V , (3) form the m × n matrix Σ with diagonal entries σi =

√λi, (4)

let ui = σ−1i Avi, i = 1, 2, . . . r and compute ui, i = r + 1, r + 2, . . . , m using

Gram-Schmidt orthogonalization. However, a well-known efficient methodfor computing the SVD is the Golub-Reinsch algorithm [36] which employsHouseholder bidiagonalization and a variant of the QR method.

Example 3.36

Let A =

1 23 45 6

. Then

U ≈

−0.2298 0.8835 0.4082−0.5247 0.2408 −0.8165−0.8196 −0.4019 0.4082

, Σ ≈

9.5255 00 0.51430 0

, and

V ≈(

−0.6196 −0.7849−0.7849 0.6196

)

is a singular value decomposition of A. This approximate singular value de-composition was obtained with the following matlab dialog.

>> A = [1 2;3 4;5 6]

A =

1 2

3 4

5 6

>> [U,Sigma,V] = svd(A)

U =

-0.2298 0.8835 0.4082

-0.5247 0.2408 -0.8165

-0.8196 -0.4019 0.4082

Sigma =

9.5255 0

0 0.5143

0 0

V =

-0.6196 -0.7849

-0.7849 0.6196

>> U*Sigma*V’

ans =

1.0000 2.0000

3.0000 4.0000

5.0000 6.0000

>>


Note: If A = UΣV T represents a singular value decomposition of A, then,for A = AT , A = V ΣT UT represents a singular value decomposition for A.

DEFINITION 3.26 The vectors V (:, i), 1 ≤ i ≤ p are called the rightsingular vectors of A, while the corresponding U(:, i) are called the left singularvectors of A corresponding to the singular values σi.

The singular values are like eigenvalues, and the singular vectors are likeeigenvectors. In fact, we have

THEOREM 3.18

Let the n by n matrix A be symmetric and positive definite. Let {λi}ni=1 bethe eigenvalues of A, ordered so that λ1 ≥ λ2 ≥ · · · ≥ λn, and let vi be theeigenvector corresponding to λi. Furthermore, choose the vi so {vi}ni=1 is anorthonormal set, and form V = [v1, · · · , vn] and Λ = diag(λ1, · · · , λn). ThenA = V ΛV T represents a singular value decomposition of A.

This theorem follows directly from the definition of the SVD. We also have

THEOREM 3.19

Let the n by n matrix A be invertible, and let A = UΣV T represent a singularvalue decomposition of A. Then the 2-norm condition number of A is κ2(A) =σ1/σn.

Thus, the condition number of a matrix is obtainable directly from theSVD, but the SVD gives us more useful information about the sensitivity ofsolutions than just that single number, as we’ll see shortly.

The singular value decomposition is related directly to the Moore–Penrosepseudo-inverse. In fact, the pseudo-inverse can be defined directly in termsof the singular value decomposition.

DEFINITION 3.27 Let A ∈ L(Rn, Rm), let A = UΣV T represent asingular value decomposition of A, and assume r ≤ p is such that σ1 ≥ σ2 ≥σr > 0, and σr+1 = σr+2 = · · · = σp = 0. Then the Moore–Penrose pseudo-inverse of A is defined to be

A+ = V Σ+UT ,

where Σ+ =(

Σ+ij

)

∈ L(Rm, Rn) is such that{

Σ+ij = 0 if i 6= j or i > r, and

Σ+ii = 1/σi if 1 ≤ i ≤ r.


Part of the power of the singular value decomposition comes from the fol-lowing.

ace-2pt

THEOREM 3.20

Suppose A ∈ L(Rn, Rm) and we wish to find approximate solutions to Ax = b,where b ∈ Rm. Then,

• If Ax = b is inconsistent, then x = A+b represents the least squaressolution of minimum 2-norm.

• If A is consistent (but possibly underdetermined) then x = A+b repre-sents the solution of minimum 2-norm.

• In general, x = A+b represents the least squares solution to Ax = b ofminimum norm.

The proof of Theorem 3.20 is left as an exercise (on page 144).

ace-2pt

REMARK 3.12 If m < n, one would expect the system to be underde-termined but full rank. In that case, A+b gives the solution x such that ‖x‖2is minimum; however, if A were also inconsistent, then there would be manyleast squares solutions, and A+b would be the least squares solution of min-imum norm. Similarly, if m > n, one would expect there to be a single leastsquares solution; however, if the rank of A is r < p = n, then there wouldbe many such least squares solutions, and A+b would be the least squaressolution of minimum norm.

Example 3.37

Consider Ax = b, where A =

1 2 34 5 67 8 9

and b =

−101

. Then

U ≈

−0.2148 0.8872 0.4082−0.5206 0.2496 −0.8165−0.8263 −0.3879 0.4082

, Σ ≈

16.8481 0 00 1.0684 00 0 0.0000

,

V ≈

−0.4797 −0.7767 −0.4082−0.5724 −0.0757 0.8165−0.6651 0.6253 −0.4082

, and Σ+ ≈

0.0594 0 00 0.9360 00 0 0

.

Since σ3 = 0, we note that the system is not of full rank, so it could be eitherinconsistent or underdetermined. We compute x ≈ [0.9444, 0.1111,−0.7222]T ,


and we obtain16 ‖Ax− b‖2 ≈ 2.5×10−15. Thus, Ax = b, although apparentlyunderdetermined, is apparently consistent, and x represents that solution ofAx = b which has minimum 2-norm.

As with other methods for computing solutions, we usually do not form thepseudo-inverse A+ to compute A+x, but we use the following.

ALGORITHM 3.6

(Computing A+b)INPUT:

(a) the m by n matrix A ∈ L(Rn, Rm),

(b) the right-hand-side vector b ∈ Rm,

(c) a tolerance ǫ such that a singular value σi is considered to be equal to0 if σi/σ1 < ǫ.

OUTPUT: an approximation x to A+b.

1. Compute the SVD of A, that is, compute approximations to U ∈ L(Rm),Σ ∈ L(Rn, Rm), and V ∈ L(Rn) such that A = UΣV T .

2. p← min{m, n}.

3. r ← p.

4. FOR i = 1 to p.

IF σi/σ1 > ǫ THEN

σ+i ← 1/σi.

ELSE

i. r ← i− 1.

ii. EXIT FOR

END IF

END FOR

5. Compute w = (w1, · · · , wr)T ∈ Rr, w ← U(:, 1 : r)T b, where U(:, 1 :

r) ∈ Rn×r is the matrix whose columns are the first r columns of U .

6. FOR i = 1 to r: wi ← σ+i wi.

16The computations in this example were done using matlab, and were thus done in IEEEdouble precision. The digits displayed here are the results from that computation, roundedto four significant decimal digits with matlab’s intrinsic display routines.


7. x←r∑

i=1

wiV (:, i).

END ALGORITHM 3.6.

REMARK 3.13 Ill-conditioning (i.e., sensitivity to roundoff error) in thecomputations in Algorithm 3.6 occurs when small singular values σi are used.For example, suppose σi/σ1 ≈ 10−6, and there is an error δU(:, i) in the vectorb, that is, b = b−δU(:, i) (that is, we perturb b by δ in the direction of U(:, i)).Then, instead of A+b,

A+(b + δU(:, i)) = A+b + A+δU(:, i) = A+b + δ1

σiV (:, i). (3.41)

Thus, the norm of the error δU(:, i) is magnified by 1/σi. Now, if, in addition,b happened to be in the direction of U(:, 1), that is, b = δ1U(:, 1), then‖A+b‖2 = ‖δ1(1/σ1)V (:, 1)‖2 = (1/σ1)‖b‖2. Thus, the relative error, in thiscase, would be magnified by σ1/σi.

In view of Remark 3.13, we are led to consider modifying the problemslightly to reduce the sensitivity to roundoff error. For example, suppose thatwe are data fitting, with m data points (ti, yi) (as in Section 3.4 on page 117),and A is the matrix as in Equation (3.19), where m ≫ n. Then we assumethere is some error in the right-hand-side vector b. However, since {U(:, i)}forms an orthonormal basis for Rm,

b =

m∑

i=1

βiU(:, i) for some coefficients {βi}mi=1.

Therefore, UT b = (β1, . . . , βm)T , and we see that x will be more sensitive tochanges in components of b in the direction of the βi with larger indices. If weknow that typical errors in the data are on the order of ǫ, then, intuitively, itmakes sense not to use components of b in which the magnification of errorswill be larger than that. That is, it makes sense in such cases to choose ǫ = ǫin Algorithm 3.6.

Use of ǫ 6= 0 in Algorithm 3.6 can be viewed as replacing the smallestsingular values of the matrix A by 0. In the case that A ∈ L(Rn) is square andonly σn is replaced by zero, this amounts to replacing an ill-conditioned matrixA by a matrix that is exactly singular. One (of many possible) theoremsdealing with this replacement process is

THEOREM 3.21

Suppose A is an n by n matrix, and suppose we replace σn 6= 0 in the singularvalue decomposition of A by 0, then form A = U ΣV T , where A = UΣV T rep-resents the singular value decomposition of A, and Σ = diag(σ1, · · · , σn−1, 0).


Then

‖A− A‖2 = minB∈L(Rn)

rank(B)<n

‖A−B‖2,

Suppose now that A has been obtained from A by replacing the smallestsingular values of A by 0, so the nonzero singular values of A are σ1 ≥ σ2 ≥· · · ≥ σr > 0, and define x = A+b. Then, perturbations of size ‖∆b‖ in bresult in perturbations of size at most (σ1/σr)‖∆b‖ in x. This prompts us todefine a generalization of condition number as follows.

DEFINITION 3.28 Let A, be an m by n matrix with m and n arbitrary,and assume the nonzero singular values of A are σ1 ≥ σ2 ≥ · · · ≥ σr > 0.Then the generalized condition number of A is σ1/σr.

Example 3.38

Consider A =

1 2 34 5 67 8 10

, whose singular value decomposition is approxi-

mately

U ≈

0.2093 0.9644 0.16170.5038 0.0353 −0.86310.8380 −0.2621 0.4785

,

Σ ≈

17.4125 0 00 0.8752 00 0 0.1969

, and

V ≈

0.4647 −0.8333 0.29950.5538 0.0095 −0.83260.6910 0.5528 0.4659

.

Suppose we want to solve the system Ax = b, where b = [1,−1, 1]T , but that,due to noise in the data, we do not wish to deal with any system of equationswith condition number equal to 25 or greater. How can we describe the set ofsolutions, based on the best information we can obtain from the noisy data?

We first observe that κ2(A) = σ1/σn ≈ 88.4483. However, σ1/σ2 ≈19.8963 < 25. We may thus form a new matrix A = U ΣV T , where Σis obtained from Σ by replacing σ3 by 0. This is equivalent to projectingA onto the set of singular matrices according to Theorem 3.21. We thenuse Algorithm 3.6 (applied to A) to determine x as x = A+b. We obtainx ≈ (−0.6205, 0.0245, 0.4428)T . Thus, to within the accuracy of 1/25 = 4%,


we can only determine that the solution lies along the line

−0.62050.02450.4428

+ y3V:,3, y3 ∈ R.

This technique is a common type of analysis in data fitting. The parametery3 (or multiple parameters, in case of higher-order rank deficiency) needs tobe chosen through other information available with the application.

3.7 Applications

Consider the following difference equation model [3], which describes thedynamics of a population divided into three stages.

J(t + 1) = (1 − γ1)s1J(t) + bB(t)

N(t + 1) = γ1s1J(t) + (1− γ2)s2N(t)

B(t + 1) = γ2s2N(t) + s3B(t)

(3.42)

The variables J(t), N(t) and B(t) represents the number of juveniles, non-breeders, and breeders, respectively, at time t. The parameter b > 0 is thebirth rate, while γ1, γ2 ∈ (0, 1) represent the fraction (in one time unit) ofjuveniles that become non-breeders and non-breeders that become breeders,respectively. Parameters s1, s2, s3 ∈ (0, 1) are the survivor rates of juveniles,non-breeders and breeders, respectively.

To analyze the model numerically, we let b = 0.6, γ1 = 0.8, γ2 = 0.7, s1 =0.7, s2 = 0.8, s3 = 0.9. Also notice the model can be written as

J(t + 1)N(t + 1)B(t + 1)

=

0.14 0 0.60.56 0.24 00 0.56 0.9

J(t)N(t)B(t)

or a matrix formX(t + 1) = AX(t),

where X(t) = (J(t), N(t), B(t))T , and A =

0.14 0 0.60.56 0.24 00 0.56 0.9

.

Suppose we know all the eigenvectors vi, i = 1, 2, 3 and their associatedeigenvalues λi, i = 1, 2, 3 of the matrix A. By the knowledge of linear alge-bra, any initial vector X(0) can be expressed as a linear combination of theeigenvectors

X(0) = c1v1 + c2v2 + c3v3,


thenX(1) = AX(0) = A(c1v1 + c2v2 + c3v3)

= c1Av1 + c2Av2 + c3Av3

= c1λ1v1 + c2λ2v2 + c3λ3v3.

Applying the same techniques, we get

X(2) = AX(1) = A(c1λ1v1 + c2λ2v2 + c3λ3v3)= c1λ

21v1 + c2λ

22v2 + c3λ

23v3.

Continuing the above will lead to the general solution of the population dy-namical model (3.42)

X(t) =3∑

i=1

ciλtivi.

Now, to compute the eigenvalues and eigenvectors of A, we could simply typethe following in matlab command window

>> A=[0.14 0 0.6; 0.56 0.24 0; 0 0.56 0.9]

A =

0.1400 0 0.6000

0.5600 0.2400 0

0 0.5600 0.9000

>> [v,lambda]=eig(A)

v =

-0.1989 + 0.5421i -0.1989 - 0.5421i 0.4959

0.6989 0.6989 0.3160

-0.3728 - 0.1977i -0.3728 + 0.1977i 0.8089

lambda =

0.0806 + 0.4344i 0 0

0 0.0806 - 0.4344i 0

0 0 1.1188

From the result, we see that the spectral radius of A is λ3 = 1.1188 and itscorresponding eigenvector is v3 = (0.4959, 0.3160, 0.8089)T . Hence, X(t) =AtX(0) ≈ c3(1.1188)tv3. This shows the population size will increase geo-metrically as time increases.

3.8 Exercises

1. Let

A =

(

5 −2−4 7

)

.

Find ‖A‖1, ‖A‖∞, ‖A‖2, and ρ(A). Verify that ρ(A) ≤ ‖A‖1, ρ(A) ≤‖A‖∞ and ρ(A) ≤ ‖A‖2.


2. Show that back solving for Gaussian elimination (that is, show thatcompletion of Algorithm 3.2) requires (n2 + n)/2 multiplications anddivisions and (n2 − n)/2 additions and subtractions.

3. Consider Example 3.14 (on page 85).

(a) Fill in the details of the computations. In particular, by multiplyingthe matrices together, show that M−1

1 and M−12 are as stated, that

A = LU , and that L = M−11 M−1

2 .

(b) Solve Ax = b as mentioned in the example, by first solving Ly = b,then solving Ux = y. (You may use matlab, but print the entiredialog.)

4. Show that performing the forward phase of Gaussian elimination forAx = b (that is, completing Algorithm 3.1) requires 1

3n3 +O(n2) mul-tiplications and divisions.

5. Show that the inverse of a nonsingular lower triangular matrix is lowertriangular.

6. Explain why A = LU , where L and U are as in Equation (3.5) onpage 86.

7. Verify the details in Example 3.15 by actually computing the solutionsto the three linear systems, and by multiplying A and A−1. (If you usematlab, print the details.)

8. Program the tridiagonal version of Gaussian elimination represented byequations (3.8) and (3.9) on page 95. Use your program to approxi-mately solve

u′′ = −1, u(0) = u(1) = 0

using the technique from Example 3.18 (on page 93), with h = 1/4, 1/8,1/64, and 1/4096. Compare with the exact solution u(x) = 1

2x(1 − x).

9. Store the matrices from Problem 8 in matlab’s sparse matrix format,and solve the systems from Problem 8 in matlab, using the sparsematrix format. Compare with the results you obtained from your tridi-agonal system solver.

10. Let A =

1 2 34 5 67 8 10

and b =

−101

.

(a) Compute κ∞(A) approximately.

(b) Use floating point arithmetic with β = 10 and t = 3 (3-digit deci-mal arithmetic), rounding-to-nearest, and Algorithms 3.1 and 3.2to find an approximation to the solution x to Ax = b.


(c) Execute Algorithm 3.4 by hand, using t = 3, β = 10, and out-wardly rounded interval arithmetic (and rounding-to-nearest forcomputing Y ).

(d) Find the exact solution to Ax = b by hand.

(e) Compare the results you have obtained.

11. Derive the normal equations (3.22) from (3.21).

12. Let

A =

2 1 14 4 16 −5 8

.

(a) Find the LU factorization of A, such that L is lower triangular andU is unit upper triangular.

(b) Perform back solving then forward solving to find a solution x forthe system of equations Ax = b = [4 7 15]T .

13. Find the Cholesky factorization of

A =

1 −1 2−1 5 4

2 4 29

.

Also explain why A is positive definite.

14. Let A =

(

0.1α 0.1α1.0 1.5

)

. Determine α such that κ∞(A), the condition

number in the induced ∞-norm, is minimized.

15. Let A be n× n lower triangular matrix with elements

aij =

1 if i = j,−1 if i = j + 1,

0 otherwise.

Determine the condition number of A using the matrix norm ‖ · ‖∞.

16. Consider the matrix system Au = b given by

12 0 0 0

14

12 0 0

18

14

12 0

116

18

14

12

u1

u2

u3

u4

=

1

0

0

1

.


(a) Determine A−1 by hand.

(b) Determine the infinity-norm condition number of the matrix A.

(c) Let u be the solution when the right-hand side vector b is perturbedto b = (1.01 0 0 0.99)T . Estimate ‖u− u‖∞, without computingu.

17. Complete the computations, to check that x(4) is as given in Exam-ple 3.35 on page 131. (You may use intlab. Also see the codegauss seidel step.m available fromhttp://www.siam.org/books/ot110.)

18. Repeat Example 3.28, but with the interval Gauss–Seidel method, in-

stead of interval Gaussian elimination, starting with x(0i = [−10, 10],

1 ≤ i ≤ 3. Compare the results.

19. Let A be the n× n tridiagonal matrix with,

aij =

4 if i = j,−1 if i = j + 1 or i = j − 1,

0 otherwise.

Prove that the Gauss-Seidel and Jacobi methods converge for this ma-trix.

20. Consider the linear system,

(

3 22 4

)(

x1

x2

)

=

(

710

)

.

Using the starting vector x(0) = (0, 0)T , carry out two iterations of theGauss–Seidel method to solve the system.

21. Prove Theorem 3.20 on page 136. (Hint: You may need to considervarious cases. In any case, you’ll probably want to use the properties oforthogonal matrices, as in the proof of Theorem 3.19.)

22. Given U , Σ, and V as given in Example 3.37 (on page 136), computeA+b by using Algorithm 3.6. How does the x that you obtain comparewith the x reported in Example 3.37?

23. Find the singular value decomposition of the matrix A =

1 21 11 3

.

Chapter 4

Approximating Functions and Data

4.1 Introduction

A fundamental task in scientific computing is to approximate a function ordata set by a simpler function. For example, to evaluate functions such assin, cos, exp, etc., developers of a programming language or even designers ofcomputer chip circuitry reduce computing the function value to a combinationof additions, subtractions, multiplications, divisions, comparisons, and tablelook-up. We have seen approximation of a general function by a polynomialin Example 1.3 on page 4. There, where we approximated sin(x) to a specifiedaccuracy over a small interval by a Taylor polynomial.

Approximation of data sets by functions that are easy to evaluate occursthroughout computer science (such as computer graphics), statistics, engineer-ing, and the sciences. We have seen an example of this (approximating a dataset in the least squares sense by a polynomial of degree 2) in Example 3.29on page 119.

In this chapter, we study several techniques for approximating functions anddata sets by polynomials, piecewise polynomials (functions defined by differentpolynomials over different subintervals), and trigonometric functions.

4.2 Taylor Polynomial Approximations

Recall from Chapter 1 (Taylor’s Theorem, on page 3) that if f ∈ Cn[a, b]and f (k+1)(x) exists on [a, b], then for x0 ∈ [a, b] there exists a ξ(x) betweenx0 and x such that f(x) = Pn(x) + Rn(x), where

Pn(x) =

n∑

k=0

f (k)(x0)

k!(x− x0)

k,

and

Rn(x) =

∫ x

x0

(x− t)n

n!f (n+1)(t)dt =

f (n+1)(ξ(x))(x − x0)n+1

(n + 1)!.

145


Pn(x) is the Taylor polynomial of f(x) about x = x0 and Rn(x) is the re-mainder term. Taylor polynomials provide good approximations near x = x0.However, away from x = x0, Taylor polynomials can be poor approximations.In addition, Taylor series require smooth functions. Nonetheless, automaticdifferentiation techniques, as explained in Section 6.2 on page 215, can beused to obtain high-order derivatives for complicated but smooth functions.

4.3 Polynomial Interpolation

Given n + 1 data points{(xi, yi)}ni=0,

polynomial interpolation is the process of finding a polynomial pn(x) of degreen or less such that pn(xi) = yi for i = 0, 1, . . . n. We describe here severalways of finding and representing the interpolating polynomial.

4.3.1 The Vandermonde System

Example 4.1

Consider the data set from Example 3.29, namely,

i xi yi

0 0 11 1 42 2 53 3 8

However, instead of computing a least squares fit as in Example 3.29, we willpass a polynomial through each of the data points exactly. The polynomialmust obey p(xi) = yi, i = 0, 1, 2, 3. Since there are four equations, we expectthat we should have four unknowns for the system of equations and unknownsto be well-determined. If we write the polynomial in power form, that is,

pn(x) = a0 + a1x + · · ·+ anxn, (4.1)

we see that n = 3 for there to be four unknowns aj. Explicitly, the fourequations are thus

p3(x0) = y0 : a0 = 1,

p3(x1) = y1 : a0 + a1 + a2 + a3 = 4,

p3(x2) = y2 : a0 + 2a1 + 4a2 + 8a3 = 5,

p3(x3) = y3 : a0 + 3a1 + 9a2 + 27a3 = 8,

Approximating Functions and Data 147

or, in matrix form:

1 0 0 01 1 1 11 2 4 81 3 9 27

a0

a1

a2

a3

=

1458

.

With a matlab dialog as in Example 3.29, we have:

>> A = [1 0 0 0

1 1 1 11 2 4 81 3 9 27]

A =1 0 0 0

1 1 1 11 2 4 81 3 9 27

>> b = [1;4;5;8]b =

14

58

>> a = A\b

a =1.0000

5.3333-3.00000.6667

>> tt = linspace(0,3);>> yy = a(1) + a(2)*tt + a(3)*tt.^2 + a(4)*tt.^3;

>> axis([-0.1,3.1,0.9,8.1])>> hold

>> plot(A(:,2),b,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)>> plot(tt,yy)

This dialog results in the following plot of the data points and interpolatingpolynomial.

0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

The system of equations in Example 4.1 is called the Vandermonde system,and the matrix is called a Vandermonde matrix . The general form for a


Vandermonde system is

Aa =

1 x0 x20 x3

0 · · · xn0

1 x1 x21 xn

1

.... . .

1 xn x2n · · · xn

n

a0

a1

...

an

=

y0

y1

...

yn

. (4.2)

It can be shown that, if the points {xi}ni=0 are distinct, the correspondingVandermonde matrix is non-singular, and it follows that the coefficients ofthe interpolating polynomial are unique:

THEOREM 4.1

For any n + 1 distinct real numbers x0, x1, . . . , xn and for arbitrary realnumbers y0, y1, . . . , yn, there exists a unique interpolating polynomial ofdegree at most n such that p(xj) = yj, j = 0, 1, . . . , n.

Although the interpolating polynomial is in general unique, the powerform (4.1) may not be the easiest form with which to work nor the mostnumerically stable to evaluate in a particular application. We now studysome alternative forms.

4.3.2 The Lagrange Form

We first define a useful set of polynomials of degree n denoted by ℓ0, ℓ1,. . . , ℓn for points x0, x1, . . . , xn ∈ R as

ℓk(x) =

n∏

i=0

i6=k

x− xi

xk − xi, k = 0, 1, . . . , n. (4.3)

Notice that

(i) ℓk(x) is of degree n for each k = 0, 1, . . . , n.

(ii) ℓk(xj) =

{

0 if j 6= k,

1 if j = k,

that is, ℓk(xj) = δkj .

Now let

p(x) =

n∑

k=0

ykℓk(x).

Then

p(xj) =

n∑

k=0

ykℓk(xj) =

n∑

k=0

ykδjk = yj for j = 0, 1, . . . , n.


Thus,

p(x) =

n∑

k=0

ykℓk(x)

is a polynomial of degree at most n that passes through points (xj , yj), j =0, 1, 2, . . . , n. This is called the Lagrange form of the interpolating polynomial,and the set of functions {ℓk}nk=0 is called the Lagrange basis for the space ofpolynomials of degree n associated with the set of points {xi}ni=0.

Summarizing, we obtain the Lagrange form of the (unique) interpolatingpolynomial:

p(x) =

n∑

k=0

ykℓk(x) with ℓk(x) =

n∏

j=0j 6=k

x− xj

xk − xj. (4.4)

An important feature of the Lagrange basis is that it is collocating. Thesalient property of a collocating basis is that the matrix of the system ofequations to be solved for the coefficients is the identity matrix. That is, thematrix in the system of equations

{p(xi) = yi}ni=0

to solve for the ci in the representation

p(x) =

n∑

k=0

ckℓk(x)

is

ℓ0(x0) ℓ1(x0) · · · ℓn(x0)

ℓ0(x1) ℓ1(x1) · · · ℓn(x1)

.... . .

...

ℓ0(xn) ℓ1(xn) · · · ℓn(xn)

c0

c1

...

cn

=

y0

y1

...

yn

(4.5)

is the identity matrix. (Contrast this to the Vandermonde matrix, where weuse xk instead of ℓk(x). The Vandermonde matrix becomes ill-conditioned forn moderately sized, while the identity matrix is perfectly conditioned; indeed,we need do no work to solve (4.5).)

Example 4.2

For the data set as in Example 4.1, that is, for

i xi yi

0 0 11 1 42 2 53 3 8


we have

ℓ0(x) =(x − 1)(x− 2)(x− 3)

(0 − 1)(0− 2)(0− 3)= −1

6(x− 1)(x− 2)(x− 3),

ℓ1(x) =x(x − 2)(x− 3)

(1)(1− 2)(1− 3)=

1

2x(x− 2)(x− 3),

ℓ2(x) =x(x − 1)(x− 3)

(2)(2− 1)(2− 3)= −1

2x(x− 1)(x− 3),

ℓ3(x) =x(x − 1)(x− 2)

(3)(3− 1)(3− 2)=

1

6x(x− 1)(x− 3),

and the interpolating polynomial in Lagrange form is

p3(x) = 1 ·[

−1

6(x− 1)(x− 2)(x− 3)

]

+ 4 ·[

1

2x(x− 2)(x− 3)

]

+ 5 ·[

−1

2x(x − 1)(x− 3)

]

+ 8 ·[

1

6x(x− 1)(x− 3)

]

.

4.3.3 The Newton Form

Although the Lagrange polynomials form a collocating basis, useful in the-ory for symbolically deriving formulas such as for numerical integration, theLagrange representation (4.4) is generally not used to numerically evaluateinterpolating polynomials. This is because

(1) it requires many operations to evaluate p(x) for many different valuesof x, and

(2) all ℓk’s change if another point (xn+1, yn+1) is added.

These problems are alleviated in the Newton form of the interpolating poly-nomial. To describe this form we use the following.

DEFINITION 4.1 y[xj , xj+1] =yj+1 − yj

xj+1 − xjis the first divided difference.

DEFINITION 4.2

y[xj , xj+1, . . . , xj+k] =y[xj+1, . . . , xj+k]− y[xj , . . . , xj+k−1]

xj+k − xj

is the k-th order divided difference (and is defined iteratively).


Consider first the linear interpolant through (x0, y0), and (x1, y1):

p1(x) = y0 + (x− x0)y[x0, x1],

since p1(x) is of degree 1, p1(x0) = y0, and

p1(x1) = y0 + (x1 − x0)y1 − y0

x1 − x0= y1.

Consider now the quadratic interpolant through (x0, y0), (x1, y1), and (x2, y2).We have

p2(x) = p1(x) + (x− x0)(x − x1)y[x0, x1, x2],

since p2(x) is of degree 2, p2(x0) = y0, p2(x1) = y1, and

p2(x2) = p1(x2) + (x2 − x0)(x2 − x1)y[x1, x2]− y[x0, x1]

x2 − x0

= y0 + (x2 − x0)y1 − y0

x1 − x0+ (x2 − x1)

y2 − y1

x2 − x1− (x2 − x1)

y1 − y0

x1 − x0

= y0 + y2 − y1

+1

x1 − x0[(x2 − x0 − x2 + x1)y1 + (−x2 + x0 − x1 + x2)y0]

= y2.

Continuing this process, one obtains:

pn(x) = pn−1(x) + (x − x0)(x − x1) . . . (x − xn−1)y[x0, x1, . . . , xn]

= y0 + (x − x0)y[x0, x1] + (x− x0)(x − x1)y[x0, x1, x2] + . . .

+

{

n−1∏

i=0

(x − xi)

}

y[x0, . . . , xn].

This is called Newton’s divided-difference formula for the interpolating poly-nomial through the points {(xj , yj)}nj=0. This is a computationally efficientform, because the divided differences can be rapidly calculated using the fol-lowing tabular arrangement, which is easily implemented on a computer.

j xj yj y[xj, xj+1] y[xj, xj+1, xj+2] y[xj, xj+1, xj+2, xj+3]

0 x0 y0

1 x1 y1y1−y0x1−x0

= y[x0, x1]

2 x2 y2y2−y1x2−x1

= y[x1, x2]y[x1,x2]−y[x0,x1]

x2−x0= y[x0, x1, x2]

3 x3 y3y3−y2x3−x2

= y[x2, x3]y[x2,x3]−y[x1,x2]

x3−x1= y[x1, x2, x3]

y[x1,x2,x3]−y[x0,x1,x2]x3−x0

4 x4 y4y4−y3x4−x3

= y[x3, x4]y[x3,x4]−y[x2,x3]

x4−x2= y[x2, x3, x4]

y[x2,x3,x4]−y[x1,x2,x3]

x4−x1

Example 4.3

Consider y(x) =

∫ x

−∞

1√2π

e−12x2

dx (standard normal distribution)


j xj yj y[xj, xj+1] y[xj, xj+1, xj+2] y[xj , xj+1, xj+2, xj+3]

0 1.4 0.9192

1 1.6 0.9452 0.9452−0.91920.2 = 0.130

2 1.8 0.9641 0.0945 0.0945−0.1300.4 = −0.08875

3 2.0 0.9772 0.0655 -0.0725 −0.0725+0.088750.6 = 0.02708

Thus,p1(x) = 0.9192 + (x− 1.4)0.130

is the line through (1.4, 0.9192) and (1.6, 0.9452). Hence, y(1.65) ≈ p1(1.65) ≈0.9517. Also,

p2(x) = 0.9192 + (x− 1.4)0.130 + (x− 1.4)(x− 1.6)(−0.08875)

is a quadratic polynomial through (x0, y0), (x1, y1), and (x2, y2). Hence,y(1.65) ≈ p2(1.65) ≈ 0.9506. Finally,

p3(x) = p2(x) + (x− 1.4)(x− 1.6)(x− 1.8)(0.027083)

is the cubic polynomial through all four points and y(1.65) ≈ p3(1.65) ≈0.9505, which is accurate to four digits.

If the points are equally spaced, i.e., xj+1 − xj = ∆x for all j, Newton’sdivided difference formula can be simplified. (See, e.g., [7].) The resultingformula, called Newton’s forward difference formula, is

y[xj , xj+1, . . . , xj+k] =1

k!hk

(

y[xj+1, · · · , xj+k]− y[xj , · · · , xj+k−1])

. (4.6)

If the points x0, . . . , xn are reordered to xn, xn−1, . . . , x0, the there is an anal-ogous formula, called Newton’s backward difference formula.

Example 4.4

In Example 4.1, the points are equally spaced. Since h = 1 here, the compu-tations become simply

k!y[xj , xj+1, . . . , xj+k] = y[xj+1, · · · , xj+k]− y[xj , · · · , xj+k−1].

and the Newton forward difference table becomes

j xj yj y[xj , xj+1] 2y[xj, xj+1, xj+2] 6y[xj , xj+1, xj+2, xj+3]

0 0 1 3 −2 4

1 1 4 1 2 —

2 2 5 3 — —

3 3 8 — — —


and the Newton form for the interpolating polynomial is

p3(x) = 1 ·N0(x) + 3N1(x) +1

2!· (−2)N2(x) +

1

3!· 4N3(x)

= 1 + 3x− x(x − 1) +2

3x(x − 1)(x− 2),

Where N0(x) ≡ 1, N1(x) ≡ x, N2(x) ≡ x(x−1), and N3(x) ≡ x(x−1)(x−2).An alternative viewpoint is that taken in Example 4.1, where we explicitlyform a system of equations:

p3(0) = 1 : d0N0(0) + d1N1(0) + d2N2(0) + d3N3(0) = 1,

p3(1) = 4 : d0N0(1) + d1N1(1) + d2N2(1) + d3N3(1) = 4,

p3(2) = 5 : d0N0(2) + d1N1(2) + d2N2(2) + d3N3(2) = 5,

p3(3) = 8 : d0N0(3) + d1N1(3) + d2N2(3) + d3N3(3) = 8,

where di = y[x0, . . . , xi]. In matrix form, this system is

1 0 0 01 1 0 01 2 2 01 3 6 6

d0

d1

d2

d3

=

1458

,

with solution equal to

d0

d1

d2

d3

=

13−12/3

=

y[0]y[0, 1]y[0, 1, 2]y[0, 1, 2, 3]

.

This example illustrates that the coefficient matrix for the Newton interpo-lating polynomial is lower triangular. In fact, the back-substitution processfor solving this lower triangular system results in the same computations ascomputing the divided differences.

4.3.4 An Error Formula for the Interpolating Polynomial

We now consider the error in approximating a given function f(x) by aninterpolating polynomial p(x) that passes through the n+1 points (xj , f(xj)),j = 0, 1, 2, . . . , n.

THEOREM 4.2

If x0, x1, . . . , xn are n+1 distinct points in [a, b] and f has n+1 continuousderivatives on the interval [a, b], then for each x ∈ [a, b], there exists a number


ξ = ξ(x) ∈ (a, b) such that the interpolating polynomial pn to f through thesepoints obeys

f(x) = pn(x) +

f (n+1)(ξ(x))n∏

j=0

(x− xj)

(n + 1)!(4.7)

A proof of this theorem can be found in our graduate-level text [1] andother references.

Formula 4.7 may be used analogously to the representation of a function asa Taylor polynomial with remainder term (Taylor’s theorem, on page 3).

Example 4.5

We will approximate sin(x) on [−0.1, 0.1], as in Example 1.3 (on page 4), ex-cept we will approximate by an interpolating polynomial with equally spacedpoints. As in Example 1.3, we will find degree of polynomial (i.e. a numberof equally spaced points) that will suffice to ensure that the error of approx-imation is at most 10−16. If we use n + 1 such points, [−0.1, 0.1] will bedivided into n subintervals, and each sub-interval will have length h = 0.2/n.Furthermore, we have

|f(x)− pn(x)| =

∣

∣f (n+1)(ξ(x))∣

∣

n∏

j=0

|x− xj |

(n + 1)!

≤ 1

(n + 1)!

n∏

j=0

|x− xj |,

since f (n+1) is either a sine or cosine. To bound the factorn∏

j=0

|x−xj |, observe

that, if x ∈ [−0.1, 0.1], x is in some interval [xj , xj+1] of length h, so the largest|x − xj | can be is h. In adjacent intervals, the largest |x − xj | can be is 2h,etc. Observing this in the context of the product, one sees that, if we boundeach factor in the product in this way, the largest the product of the boundscan be is

n∏

j=0

|x− xj | ≤ h(h)(2h)(3h) · · · (nh) = n!hn+1 = n!

(

0.2

n

)n+1

.

Thus,

|f(x) − pn(x)| ≤ 1

(n + 1)!n!

(

0.2

n

)n+1

=1

n + 1

(

0.2

n

)n+1

.

We compute this bound for various n with the following matlab dialog (com-pressed for brevity).


>> for n=1:15

n,(1/(n+1))*(0.2/n)^(n+1)end

n = 1, ans = 0.0200n = 2, ans = 3.3333e-004n = 3, ans = 4.9383e-006

n = 4, ans = 6.2500e-008n = 5, ans = 6.8267e-010

n = 6, ans = 6.5321e-012n = 7, ans = 5.5509e-014

n = 8, ans = 4.2386e-016n = 9, ans = 2.9368e-018

.

.

.

We see that it is sufficient1 to take 9 subintervals, corresponding to 10equally spaced points, for the error to be at most 10−16. (It is possible that asmaller n would work, since the bounds we substituted for the actual valuesmay be overestimates.)

One would expect that, if we are approximating a function f by an inter-polating polynomial, the graph of the interpolating polynomial will get closerto the graph of the function as we take more and more points, and as h getssmaller. However, this is not always the case.

Example 4.6

Consider Runge’s function:

f(x) =1

1 + x2.

We will compute and graph the interpolating polynomials (with a graph ofRunge’s function itself) using 5, 9, and 17 equally spaced points in the interval[−5, 5]. We use our matlab functions Lagrange interp poly coeffs.m andLagrange interp poly val.m, that we have posted on the web page http://interval.louisiana.edu/Classical-and-Modern-NA/:

xpts = linspace(-5,5,200);

z1 = 1./(1+xpts.^2);

[a] = Lagrange_interp_poly_coeffs(4,’runge’,-5,5);

z2 = Lagrange_interp_poly_val(a,xpts);





plot(xpts,z1,xpts,z2,xpts,z3,xpts,z4);

1Since floating point arithmetic was used, this is not a mathematically rigorous proof. Theexpression could be evaluated using interval arithmetic to make the result mathematicallyrigorous.


The result is as follows.

−5 0 5−16

−14

−12

−10

−8

−6

−4

−2

0

2

We see that, the higher the degree, the worse the approximation near theends of the interval. A clue to what is happening are the observations thatthe (n+1)-st derivative f (n+1)(0) increases like (n+1)! and that the maximum

of∣

∣

n∏

j=0

(x− xj)∣

∣ also increases as n increases.

A similar phenomenon as in Example 4.6 will occur if we try to pass ahigh-degree interpolating polynomial through a large number of data points,and there are small errors (such as measurement errors) in the values. Themathematical effect of the errors is the same as if the supposed underlyingfunction has very large higher-order derivatives.

In the next section, we examine a way of choosing the points xi to reducethe error error term in (4.7).

4.3.5 Optimal Points of Interpolation: Chebyshev Points

Reviewing the error estimate (4.7), we have

maxa≤x≤b

|f(x)− p(x)| ≤ 1

(n + 1)!max

x∈[a,b]|f (n+1)(x)| max

x∈[a,b]

∣

∣

∣

n∏

j=0

(x− xj)∣

∣

∣. (4.8)

In Example 4.6, we saw that the last factor does not tend to 0 as we increasethe number of points, if the points are equally spaced. In that example, wesaw that |f(x)− pn(x)| was largest if


THEOREM 4.3

(Chebyshev points) maxx∈[−1,1]

∣

∣

n∏

i=0

(x− xi)∣

∣ is minimized on [−1, 1] when

xi = cos

(

2i + 1

n + 1· π

2

)

0 ≤ i ≤ n,

and the minimum value is 2−n. Furthermore, if we approximate f(y), y ∈[a, b] 6= [−1, 1], and we take

yi =b− a

2xi +

b + a

2,

then

maxy∈[a,b]

∣

∣

∣

n∏

i=0

(y − yi)∣

∣

∣ = 2−2n−1(b− a)n+1.

Theorem 4.3 is based on the Chebyshev equi-oscillation property. The pointsxi are the roots of the Chebyshev polynomials:

Tn(x) = cos(n arccos(x)). (4.9)

We present a detailed explanation and proof of this theorem in our graduatetext [1].

Example 4.7

We will recompute the interpolating polynomials of degree 4, 8, and 16, asin Example 4.6, except we use Chebyshev points instead of equally spacedpoints. We modify Lagrange interp poly coeffs.m to use the Chebyshevpoints. To do so, one may simply replace the lines

for i =1:np1;

x(i) = a + (i-1)*h;

end

by

for i =1:np1;

t = cos((2*i-1)/np1 *pi/2),0;

x(i) = 0.5 *((b-a)*t + (b+a));

end

We obtain the following graph.


−5 0 5−0.2

0

0.2

0.4

0.6

0.8

1

1.2

With these better points of interpolation, there now appears to be convergencetowards the actual function as we increase n. (In fact, this can be formallyshown to be so.)

REMARK 4.1 We have seen through Example 4.6 (approximating Runge’sfunction) that the approximation to a function f does not necessarily get bet-ter as we take more and more equally spaced points. However, with Exam-ple 4.7, we saw that we could make the interpolating polynomial approximateRunge’s function as closely as we want by taking sufficiently many pointsChebyshev points. However, it can be shown that some other functions can-not be approximated well by a high degree interpolating polynomial withChebyshev points. In fact, it can be shown no matter how the interpolat-ing points are distributed, there exists a continuous function f for whichmaxx∈[a,b] |pn(x)− f(x)| → ∞ as n→∞.

This observation motivates us to consider other methods for approximatingfunctions using polynomials, in particular piecewise polynomial interpolation.The piecewise polynomial concept, analogous to composite integration in-volves dividing the interval of approximation into subintervals, and using adifferent polynomial over each subinterval. We enforce the resulting polyno-mial to have a particular degree of smoothness by requiring derivatives tomatch between separate subintervals.


4.4 Piecewise Polynomial Interpolation

Piecewise polynomials are commonly used approximations. They are easyto work with, they can provide good approximations, and they are widely usedin computer graphics. In addition, piecewise polynomials are employed, forexample, in finite element methods. Good references for piecewise polynomialapproximation are [24] and [34]. Here, we study two commonly used piecewiseinterpolants: linear splines (piecewise linear interpolants) and cubic splines.A unified treatment, based on elementary functional analysis, appears in ourgraduate text [1].

4.4.1 Piecewise Linear Interpolation

As with polynomial interpolation, we start with a set of points {xi}ni=0 thatsubdivides the interval [a, b]:

a = x0 < x1 < · · · < xn−1 < xn = b,

and we draw lines between the points (xi, f(xi). This is the graph of thepiecewise linear interpolant to f (or, if we have a finite data set {(xi, yi)}, thepiecewise linear interpolant to the data). More formally, we have:

DEFINITION 4.3 The piesewise linear interpolant to the data

{(xi, yi)}

is the function ϕ(x) such that

1. ϕ is linear on each [xi, xi+1], and

2. ϕ(xi) = yi.

Graphically, ϕ may look as in Figure 4.1.

x

y

y = ϕ(x)

+ax0

+x1

++x2

+x3... ... +

xN

+b

FIGURE 4.1: An example of a piecewise linear function.


Example 4.8

The piecewise linear interpolant to the data set as in Example 4.1, that is,to the data set

i xi yi

0 0 11 1 42 2 53 3 8

is

ϕ(x) =

1 + 3(x− 0) for 0 ≤ x ≤ 1,

4 + (x − 1) for 1 ≤ x ≤ 2,

5 + 3(x− 2) for 2 ≤ x ≤ 3.

matlab has the function interp1 to do piecewise linear and other inter-polants. We may thus use the following dialog:

>> x = [0,1,2,3]x =

0 1 2 3

>> y = [1,4,5,8]y =

1 4 5 8>> xi = linspace(0,3);

>> yi = interp1(x,y,xi,’linear’);>> axis([-0.1,3.1,0.9,8.1])>> hold

Current plot held>> plot(x,y,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)

>> plot(xi,yi)>>

This dialog produces the following plot.

0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

Analogously to the Lagrange functions for polynomial interpolation, thereis a commonly used collocating basis to represent piecewise linear interpolants.


This basis consists of the hat functions ϕi(x), 0 ≤ i ≤ n, defined as follows.

ϕ0(x) =

x1 − x

x1 − x0, x0 ≤ x ≤ x1,

0, otherwise,

ϕn(x) =

x− xn−1

xn − xn−1, xn−1 ≤ x ≤ xn,

0, otherwise,

ϕi(x) =

x− xi−1

xi − xi−1, xi−1 ≤ x ≤ xi,

xi+1 − x

xi+1 − xi, xi ≤ x ≤ xi+1,

for 1 ≤ i ≤ n.

These hat functions are depicted graphically in Figure 4.2.

x

y

ϕ0(x) ϕi(x) ϕn(x)+1

+ax0

+x1

+xi−1

+

xi

xi+1+ +

xn−1+b

xn

FIGURE 4.2: Graphs of the “hat” functions ϕi(x).

Example 4.9

The hat functions for the abscissas x0 = 0, x1 = 1, x2 = 2, x3 = 3 are

ϕ0(x) =

{

1− x for 0 ≤ x ≤ 1,0 for 1 ≤ x ≤ 3,

ϕ1(x) =

x for 0 ≤ x ≤ 1,2− x for 1 ≤ x ≤ 2,0 for 2 ≤ x ≤ 3,

ϕ2(x) =

0 for 0 ≤ x ≤ 1,x− 1 for 1 ≤ x ≤ 2,3− x for 2 ≤ x ≤ 3,

ϕ3(x) =

{

0 for 0 ≤ x ≤ 2,x− 2 for 2 ≤ x ≤ 3.


In terms of these hat functions, the piecewise linear interpolant to the data

i xi yi

0 0 11 1 42 2 53 3 8

is

ϕ(x) = 1ϕ0(x) + 4ϕ1(x) + 5ϕ2(x) + 8ϕ3(x).

Here is an estimate for the error when using the piecewise linear interpolantto approximate a function:

THEOREM 4.4

Let

a = x0 < x1 < · · · < xn−1 < xn = b,

let h = max(xi+1 − xi) denote the maximum length of a sub-interval, supposef has two continuous derivatives on the interval [a, b], and let ϕ(x) = In(f)(x)denote the piecewise linear interpolant to f over this point set. Then,

maxx∈[a,b]

|f(x)− Inf(x)| ≤ 1

8h2 max

x∈[a,b]|f ′′(x)|.

PROOF We use the error term (4.7) (on page 154) for polynomial inter-polation. In particular, on each subinterval [xi, xi+1], f is interpolated by adegree-1 polynomial, so (4.7) gives

f(x) = In(f)(x) +f ′′(ξ(x))(x − xi)(x − xi+1)

2. (4.10)

However, the quadratic

g(x) = (x− xi)(x − xi+1)

has a vertex at x = (xi + xi+1)/2, g(xi) = g(xi+1) = 0, and

g(x) = − (xi+1 − xi)2

4≥ −h2

4. (4.11)

Combining (4.10) and (4.11) gives

|f(x)− In(f)(x)| ≤ |f′′(ξ(x))|

2· h

2

4,


from which the error bound follows.

Example 4.10

Consider f(x) = lnx on the interval [2, 4]. We want to find h that willguarantee that the piecewise linear interpolant of f(x) on [2, 4] has an errorof at most 10−4. We will assume that |xi+1− xi| = h for 0 ≤ i ≤ n− 1. Then

‖f − Inf‖∞ ≤h2

8‖f ′′‖∞ =

h2

8max

2≤x≤4| 1x2| = h2

32≤ 10−4.

Thus h2 ≤ 32× 10−4, h ≤ 0.056, giving n = (4 − 2)/h ≥ 36.

Although hat functions and piecewise linear functions are frequently usedin practice, it is desirable in some applications, such as computer graphics, forthe interpolant to be smoother (say, to have one, two, or even more continuousderivatives) at the mesh points xi. Special piecewise cubic polynomials, whichwe consider next, are commonly used for this purpose.

4.4.2 Cubic Spline Interpolation

DEFINITION 4.4 Suppose we have a point set ∆ = {xi}ni=0 that subdi-vides the interval [a, b]:

a = x0 < x1 < · · · < xn−1 < xn = b.

Then ϕ is said to be a cubic spline with respect to ∆ provided

1. ϕ(x) has two continuous derivatives at every x ∈ [a, b], and

2. ϕ(x) is a cubic polynomial on each subinterval [xi, xi+1], 0 ≤ i ≤ n− 1.

Such cubic splines are mathematical analogs of the old-fashioned drafts-man’s spline. The draftsman’s spline was a flexible piece of long, thin woodused to draw curves. The draftsman’s spline had weights that could be set atpoints (xi, yi) on the paper. The resulting curve that the draftsman’s splinemade satisfied, to a high degree of approximation, a differential equationwhose solution is a cubic spline.

Just as we can represent interpolating polynomials in terms of Lagrangefunctions and piecewise linear polynomials in terms of hat functions, we canrepresent a cubic spline s(x) as linear combinations of special “B-splines,”which we now define. For convenience, we assume here a uniform mesh, i.e.,


xj+1 − xj = h for all j, and let

sj(x) =

0, x > xj+2

1

6h3(xj+2 − x)3, xj+1 ≤ x ≤ xj+2,

1

6+

1

2h(xj+1 − x) +

1

2h2(xj+1 − x)2

− 1

2h3(xj+1 − x)3, xj ≤ x ≤ xj+1,

2

3− 1

h2(x− xj)

2 − 1

2h3(x− xj)

3, xj−1 ≤ x ≤ xj ,

1

6h3(x− xj−2)

3 xj−2 ≤ x ≤ xj−1,

0, x < xj−2.

(4.12)

DEFINITION 4.5 The function sj(x) defined by (4.12) is called a B-spline centered at x = xj with respect to the partition ∆ with a uniform mesh.

We now introduce two extra points x−1 = x0 − h and xn+1 = xn + h andalso consider the B-splines s−1(x) and sn+1(x) centered at x−1 and xn+1. Thesj ’s are depicted graphically in Figure 4.3.

-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20012345

x−1 x0 x1 x2 x3 x4 xj−2xj−1

xjxj+1

xj+2 xn−4xn−3

xn−2xn−1

xnxn+1

s−1

s0s1 s2 sj sn−2 sn−1 sn

sn+1

FIGURE 4.3: B-spline basis functions.

It is straightforward to show that s′j(x) and s′′j (x) are continuous, so eachsj(x) is indeed a cubic spline It follows, since linear combinations of contin-uous functions are continuous, that any linear combination of the sj , as inFormula (4.13) in the following theorem, is a cubic spline.


THEOREM 4.5

Let s be any cubic spline with respect to a point set ∆ = {xi}ni=0. Then thereis a unique set of coefficients {cj}n+1

j=−1 such that

s(x) =

n+1∑

i=−1

cjsj(x). (4.13)

For a complete treatment and proof of this theorem, see our graduate text[1].)

Cubic splines can be used in a variety of ways to approximate functions(e.g. interpolation, least squares fits). Here, we will limit the discussion tointerpolation at the points in ∆.

A consequence of Theorem 4.5 is that there are n + 3 unknown coefficientsdetermining a spline with respect to ∆, whereas, if we require s(xj) = yj ,0 ≤ j ≤ n, we only have n + 1 conditions, so we have two “free” conditions.We now consider this in the context of interpolation, where there are two com-monly used ways of specifying the two extra conditions: clamped boundaryconditions and “natural” conditions.

DEFINITION 4.6 The clamped boundary spline interpolant Φc ∈ S∆

of a function f ∈ C1[a, b] satisfies

(c)

Φc(xi) = f(xi), i = 0, 1, . . . , n,Φ′c(x0) = f ′(x0),Φ′c(xn) = f ′(xn).

DEFINITION 4.7 The natural spline interpolant Φn ∈ S∆ of a functionf ∈ C[a, b] satisfies

(n)

Φn(xi) = f(xi), i = 0, 1, . . . , n,Φ′′n(x0) = 0,Φ′′n(xn) = 0.

We how set up the system of equations for computing the coefficients ofthe clamped and natural spline interpolants when we express these in termsof B-splines. Let

Φc(x) =

n+1∑

j=−1

cjsj(x).


The requirements (c) then lead to the system

n+1∑

j=−1

sj(xi)cj = f(xi), i = 0, 1, . . . , n,

c−1s′−1(x0) + c1s

′1(x0) = f ′(x0),

cn−1s′n−1(xn) + cn+1s

′n+1(xn) = f ′(xn),

(4.14)

since s′0(x0) = s′n(xn) = 0. The above system can be written in matrix formas

4 21 4 1 0

. . .

0 1 4 12 4

c0

c1

...cn−1

cn

=

6f(x0) + 2hf ′(x0)6f(x1)

...6f(xn−1)6f(xn)− 2hf ′(xn)

, (4.15)

where

c−1 = c1 − 2hf ′(x0), and

cn+1 = cn−1 + 2hf ′(xn).

The system (4.15) has a unique solution {cj}n+1j=−1 because the matrix A is

strictly diagonally dominant (and hence nonsingular).Now consider

Φn(x) =

n+1∑

j=−1

djsj(x).

Conditions (n) lead to the system

6 0 · · · 0

1 4 1...

.... . .

1 4 10 · · · 0 6

d0

d1

...

dn−1

dn

= 6

f(x0)

f(x1)

...

f(xn−1)f(xn)

, (4.16)

where

d−1 = −d1 + 2d0

dn+1 = −dn−1 + 2dn

(You will derive this system in Exercise 8 at the end of this chapter.)Observe that the matrix for the system is not the identity matrix, as with

Lagrange functions for polynomial interpolation or hat functions for piecewiselinear interpolation. However, it is tridiagonal, so, with a tridiagonal system


solver, the coefficients may be found in O(n) time, rather than O(n3) timefor a general system of equations.

Example 4.11

We will use matlab to compute a cubic spline interpolant to the data set asin Example 4.1, that is, to the data set

i xi yi

0 0 11 1 42 2 53 3 8

We have n = 3, h = 1. The system of equations for the clamped cubic splinerequires f ′(x0) and f ′(xn), which we don’t have for this point data, so we willjust find the coefficients for the natural cubic spline. The system of equations(4.16) for this example is

6 0 0 01 4 1 00 1 4 10 0 0 6

d0

d1

d2

d3

= 6

1

4

58

.

Solving this in matlab:

>> A = [6 0 0 0;1 4 1 00 1 4 1

0 0 0 6]A =

6 0 0 01 4 1 0

0 1 4 10 0 0 6

>> b = 6*[1;4;5;8]

b =6

243048

>> d = A\bd =

1.00004.6667

4.33338.0000

>> d_minus_1 = -d(2) + 2*d(1)

d_minus_1 =-2.6667

>> d_4 = -d(3) + 2*d(4)d_4 =

11.6667>>


Thus, the cubic spline interpolant is given approximately as

s(x) ≈ −2.6667s−1(x) + 1.0000s0(x) + 4.6667s1(x)

+4.3333s2(x) + 8.0000s3(x) + 11.6667s4(x)

≈ 8

3s−1(x) + s0(x) +

14

3s1(x) +

13

3s2(x) + 8s3(x) +

35

3s4(x),

where the sj(x) are given by (4.5). (You will write down the sj explicitly forthis example in Exercise 10 at the end of this chapter.)

In fact, the matlab function interp1 will compute values of a cubic splineinterpolation. We proceed analogously to Example 4.8:

>> x = [0,1,2,3]x =

0 1 2 3>> y = [1,4,5,8]y =

1 4 5 8>> xi = linspace(0,3);

>> yi = interp1(x,y,xi,’spline’);>> axis([-0.1,3.1,0.9,8.1])

>> holdCurrent plot held>> plot(x,y,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)

>> plot(xi,yi)>>


0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

In this case, the plot appears to be virtually identical to the cubic interpolatingpolynomial from Example 4.1 (on page 146).

We have the following error estimate.

THEOREM 4.6

Let f have four continuous derivatives on the interval [a, b] and let Φc(x) bethe clamped boundary cubic interpolant (uniform mesh). Then

maxx∈[a,b]

|f(x)− Φc(x)| ≤ 5

384h4 max

x∈[a,b]

∣

∣

∣

∣

d4f

dx4(x)

∣

∣

∣

∣

.


Furthermore, |Φc| also approximates the first, second, and third derivativesof f well. See [34] for a proof. A similar result holds for natural boundarycubic interpolants; see [9]. Also, similar results hold for a nonuniform mesh.

Example 4.12

Consider f(x) = lnx. We wish to determine how small h should be to ensurethat the cubic spline interpolant Φc(x) of f(x) on the interval [2, 4] has errorless than 10−4. We have

‖f−Φc‖∞ ≤5

384h4‖D4f‖∞ =

5

384h4 max

2≤x≤4

∣

∣

∣

∣

6

x4

∣

∣

∣

∣

=

(

5

384

)(

6

16

)

h4 ≤ 10−4.

Thus, h4 ≤ (1/30)(384)(16)10−4, h ≤ 0.38, and n ≥ 2/0.38 = 6. (Recallthat we required n ≥ 36 to achieve the same error with piecewise linearinterpolants.)

Example 4.13

We return to Runge’s function: We saw in Example 4.6 that the approx-imations by interpolating polynomials with equally spaced points got worseas we took more and more points. We saw in Example 4.7 that, if we tookChebyshev points, the approximations got better as we took more points, butthe graphs of the interpolating polynomials still “wiggled,” without approx-imating the derivatives well. We’ll now try cubic spline interpolation withequally spaced points, using matlab’s interp1 routine:

xpts = linspace(-5,5,200);z1 = 1./(1+xpts.^2);

x = linspace(-5,5,5);y = 1./(1+x.^2);z2 = interp1(x,y,xpts,’spline’);

x = linspace(-5,5,9);y = 1./(1+x.^2);

z3 = interp1(x,y,xpts,’spline’);x = linspace(-5,5,17);y = 1./(1+x.^2);

z4 = interp1(x,y,xpts,’spline’);plot(xpts,z1,xpts,z2,xpts,z3,xpts,z4);



−5 0 5−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

We see that the spline with 5 points (corresponding to a degree 4 interpolatingpolynomial) is somewhat similar to the degree 4 interpolating polynomial withChebyshev points, except it does not undershoot as much. In contrast, thespline with 9 points tracks the actual function very closely, while the corre-sponding interpolating polynomial with Chebyshev points still has significantovershoot and undershoot. In this graph, we cannot see the spline with 17equally spaced points, since its graph is indistinguishable from the graph ofthe function, while the interpolating polynomial of degree 16 with Chebyshevpoints still has discernable wiggles in it.

REMARK 4.2 Satisfaction of Φ′c(x0) = f ′(x0), Φ′c(xn) = f ′(xn) may bedifficult to achieve if f(x) is not explicitly known. Approximations of orderh4 can then be used. Examples of such approximations are:

f ′(x0) =1

12h

[

− 25f(x0) + 48f(x0 + h)

− 36f(x0 + 2h) + 16f(x0 + 3h)− 3f(x0 + 4h)]

+ 〈error〉,where 〈error〉 = h4

5f (5)(ξ), x0 ≤ ξ ≤ x0 + 4h,

f ′(xn) =1

12h

[

25f(xn)− 48f(xn − h) + 36f(xn − 2h)

− 16f(xn − 3h) + 3f(xn − 4h)]

+ 〈error〉,where 〈error〉 = h4

5f (5)(ξ), xn ≤ ξ ≤ xn − 4h.


REMARK 4.3 It can be shown that if u is any function on [a, b] withtwo continuous derivatives such that u interpolates f in the manner

u(xi) = f(xi), 0 ≤ i ≤ n,u′(x0) = f ′(x0),u′(xn) = f ′(xn),

then

∫ b

a

(Φ′′c (x))2dx ≤

∫ b

a

(u′′(x))2dx.

That is, among all clamped C2-interpolants of f , the clamped spline in-

terpolant is the smoothest in the sense of minimizing∫ b

a (u′′(x))2dx. Such

smoothness properties are useful, e.g. in computer graphics, where we wantthe rendered image to look smooth. They also are important in automatedmachining, where the manufactured part should have a smooth surface, andwhere smooth motions of the manufacturing robot lead to less wear and tear.

In the next section, we consider approximation by polynomials in such a waythat the graph does not necessarily go through the data exactly. This type ofapproximation is appropriate, for example, when there is much data, and thedata contains small errors, or when we need to approximate an underlyingfunction with a low-degree polynomial.

4.5 Approximation Other Than by Interpolation

So far in this chapter, we have looked at approximation of functions byTaylor polynomials and by polynomials that pass through specified data pointsexactly, that is, by interpolating polynomials. In fact, a Taylor polynomialof degree n centered at x0 for a function can be thought of as a limit of aninterpolating polynomial with n + 1 equally spaced points for that functionover an interval [x0 − ǫ, x0 + ǫ] as we let ǫ tend to 0. Here, we mentionalternatives.


4.5.1 Least Squares Approximation

We have already seen an alternative to polynomial interpolation in Chap-ter 3: In Example 3.29 (on page 119), we fit the data set

i ti yi

0 0 11 1 42 2 53 3 8

with a polynomial p2 of degree 2 in such a way that

(

p2(0)− 1)2

+(

p2(1)− 4)2

+(

p2(2)− 5)2

+(

p2(3)− 8)2

was minimized. The result was a polynomial of degree 2 that approximatedthe data set, but did not fit exactly. Such approximations are appropriatewhen we already suspect the form of the underlying function (for example, ifwe have reason to believe that the function is indeed a polynomial of degree2), and if there are errors in the data. This approximation is least squaresapproximation, defined by Equations (3.18) (on page 117) and (3.21), whichwe repeat here in a somewhat different form: To fit data {(xi, yi)}mi=1, weassume a function of the form

y ≈ f(t) =

n∑

j=0

ajϕj(t), (4.17)

where we find the coefficients aj , 0 ≤ j ≤ n by solving the minimizationproblem

min{aj}n

j=0

m∑

i=1

(yi − f(ti))2. (4.18)

In Example 3.29, n = 2, m = 4, and ϕi = xi, i = 0, 1, 2. We saw in Sec-tion 3.4.2 that, if f is of the form (4.17), the minimization problem (4.18) canbe solved with a QR-decomposition. In some models, the aj occur nonlinearlyin the expression for f , in which case techniques we introduce in Chapter 8may be used.

4.5.2 Minimax Approximation

In minimax approximation, also known as ℓ∞-approximation, instead ofminimizing the function in (4.18), we do the following minimization:

min{aj}n

j=0

max1≤i≤m

|yi − f(ti)|. (4.19)

(In other words, we minimize the maximum deviation from the data.) We dis-cuss this problem for various cases in our graduate text [1]. Using Lemarechal’s


technique, the problem can be posed as the following constrained optimizationproblem:

min{aj}n

j=0

v

subject to v ≥ yi − f(ti), 1 ≤ i ≤ m,v ≥ −

(

yi − f(ti)), 1 ≤ i ≤ m.

(4.20)

Example 4.14

If we fit a quadratic to the data from Example 3.29, that is, if we fit the data

i ti yi

0 0 11 1 42 2 53 3 8

we have f(ti) = a0+a1ti+a2t2i , and the optimization problem (4.20) becomes

mina0,a1,a2 v

subject to v ≥ 1 − ( a0 ),v ≥ 4 − ( a0 + a1 + a2 ),v ≥ 5 − ( a0 + 2a1 + 4a2 ),v ≥ 8 − ( a0 + 3a1 + 9a2 ),v ≥ −1 + a0,v ≥ −4 + a0 + a1 + a2,v ≥ −5 + a0 + 2a1 + 4a2,v ≥ −8 + a0 + 3a1 + 9a2,

Identifying v with a3, we recognize this optimization problem as the linearprogramming problem min v subject to

−1 0 0 −1−1 −1 −1 −1−1 −2 −4 −1−1 −3 −9 −1

1 0 0 −11 1 1 −11 2 4 −11 3 9 −1

a0

a1

a2

v

≤

−1−4−5−8

1458

.

If we have the matlab optimization toolbox, we may use linprog to solvethis problem, as follows:

>> M = [-1 0 0 -1

-1 -1 -1 -1-1 -2 -4 -1

-1 -3 -9 -11 0 0 -1


1 1 1 -1

1 2 4 -11 3 9 -1]

M =-1 0 0 -1-1 -1 -1 -1

-1 -2 -4 -1-1 -3 -9 -1

1 0 0 -11 1 1 -1

1 2 4 -11 3 9 -1

>> b = [-1;-4;-5;-8;1;4;5;8]

b =-1

-4-5-8

14

58

>> f = [0;0;0;1]f =

0

00

1>> a = linprog(f,M,b)Optimization terminated.

a =1.5000

2.00000.0000

0.5000>> tt = linspace(0,3);>> yy = a(1) + a(2)*tt + a(3)*tt.^2;

>> ti = [0 1 2 3]ti =

0 1 2 3>> yi = [1 4 5 8]yi =

1 4 5 8>> axis([-0.1,3.1,0.9,8.1])

>> holdCurrent plot held

>> plot(ti,yi,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)>> plot(tt,yy)

This dialog results in the following plot:


0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

Just as in the least squares fit, the minimax “quadratic” is also a line for thisparticular example. However, the minimax line seems to fit the data betterthan the least squares fit, for this particular case.

We observe in Example 4.14 that the deviations of the fit from the actualdata alternate in sign but all have the same absolute value. This is a generalproperty of minimax approximations, that is desirable if the data do not haveerrors and if we want the maximum error in the approximation to be small.However, minimax fits are sensitive to large errors in single data points (whatstatisticians call outliers). The next type of fit is not so sensitive to this kindof data error.

4.5.3 Sum of Absolute Values Approximation

In this type approximation, also known as ℓ1-approximation, instead ofminimizing the function in (4.18), we do the following minimization:

min{aj}n

j=0

m∑

i=1

|yi − f(ti)|. (4.21)

As in minimax optimization, we may use Lemarechal’s technique to pose theproblem as the following constrained optimization problem:

min{aj}n

j=0

m∑

i=1

vi

subject to vi ≥ yi − f(ti), 1 ≤ i ≤ m,vi ≥ −

(

yi − f(ti)), 1 ≤ i ≤ m.

(4.22)

Example 4.15

We will fit a quadratic to the data from Example 3.29, just as we did in theminimax example (Example 4.14, starting on page 173). The optimization


problem (4.20) becomes

mina0,a1,a2 v1 + v2 + v3 + v4

subject to v1 ≥ 1 − ( a0 ),v2 ≥ 4 − ( a0 + a1 + a2 ),v3 ≥ 5 − ( a0 + 2a1 + 4a2 ),v4 ≥ 8 − ( a0 + 3a1 + 9a2 ),v1 ≥ −1 + a0,v2 ≥ −4 + a0 + a1 + a2,v3 ≥ −5 + a0 + 2a1 + 4a2,v4 ≥ −8 + a0 + 3a1 + 9a2,

Identifying v1, v2, v3, and v4 with a3, a4, a5, and a6, we recognize this opti-mization problem as the linear programming problem min v1 + v2 + v3 + v4

subject to

−1 0 0 −1 0 0 0−1 −1 −1 0 −1 0 0−1 −2 −4 0 0 −1 0−1 −3 −9 0 0 0 −1

1 0 0 −1 0 0 01 1 1 0 −1 0 01 2 4 0 0 −1 01 3 9 0 0 0 −1

a0

a1

a2

v1

v2

v3

v4

≤

−1−4−5−8

1458

.

If we have the matlab optimization toolbox, we may use linprog to solvethis problem using linprog, analogously to what we did in Example 4.14. Weobtain the following fit:

a =1.0000

2.3523-0.00630.0000

0.65400.6793

0.0000

which has the following graph:

0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8


Least absolute value (that is ℓ1) fits are a type of fit statisticians call robust .This means that, if we add a large amount of error to just one point of many,it will not affect the approximating function much (or at least as much as itwould if we were doing, say, an ℓ∞ fit).

4.5.4 Weighted Fits

There are endless variations on least-squares, ℓ∞, and ℓ1 fits. In particularapplications or models, we may have a large number of data points, but wemay judge some data points to be more important than others. We expressthis importance with weights

{wi}mi=1 , wi > 0, 1 ≤ i ≤ m

associated with each data point. The corresponding weighted least squaresproblem would then become

min

m∑

i=1

wi(yi − f(ti))2,

while the weighted minimax problem would become

min{

max1≤i≤m

wi|yi − f(ti)|}

,

and the weighted ℓ1 problem would become

min

m∑

i=1

wi|yi − f(ti)|.

Example 4.16

Let us continue with the data

i ti yi

0 0 11 1 42 2 53 3 8

Suppose we have decided to do a least squares fit, but we have either deter-mined that the data for i = 1 and i = 2 was scaled five times larger thanthat for i = 0 and i = 3, or we have determined that the data correspondingto i = 0 and i = 3 is five times as important. If we desire to fit a quadratic


polynomial f(t) = a0 + a1t + a2t2, we would formulate the corresponding

weighted least squares problem as

mina0,a1,a2

5(1−a0)2+(4−a0−a1−a2)

2+(5−a0−2a1−4a2)2+5(8−a0−3a1−9a2)

2.

Unfortunately, an off-the-shelf QR-decomposition will not work directly2. Aneasy option, provided ill-conditioning is not judged to be a problem, is toform the normal equations directly. For this particular example, the normalequations are

12 18 5018 50 14450 144 422

a0

a1

a2

=

54134384

.

Solving this system with matlab gives3:

AtA =

12 18 5018 50 14450 144 422

>> Atb = [54;134;384]Atb =

54134384

>> a = AtA\Atba =

1.04352.3043

0>> yy = a(1) + a(2)*tt + a(3)*tt.^2;>> axis([-0.1,3.1,0.9,8.1])


>> plot(ti,yi,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)>> plot(tt,yy)

>>

The corresponding plot is:

0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

2However, we may design special QR-decomposition routines for weighted least squaresproblems. These routines would be based on a weighted definition of orthogonality.3within the environment of our previous examples


We see that the fit approximates the highly weighted points much more closelythan the unweighted fit (which you see on page 120).

4.6 Approximation Other Than by Polynomials

So far, we have examined approximation of data and functions by functionsof the form

ϕ(x) =

n∑

j=0

ajϕj(x), (4.23)

where we have chosen ϕj(x) = xj for polynomial interpolation by formingthe Vandermonde system, as well as for ℓ1 and minimax fits. We also chose,as alternatives, ϕj to be the Lagrange function corresponding to the samplepoint xj , for the Lagrange form of the interpolating polynomial, and where

we chose ϕj(x) =∏j−1

i=0 (x − xj) for the Newton form of the interpolatingpolynomial. For splines, we chose ϕj to be the j-th B-spline associated withthe point set {xj}nj=0.

Often, non-polynomial ϕj are used, and sometimes, even more general formsthan (4.23) are used. For example, rational approximation, that is, approxi-mation by functions of the form

ϕ(x) =

n1∑

j=0

ajxj

/

n2∑

j=0

bjxj

can be effective in various contexts. Some facts about rational approximationappear in our graduate text [1] and elsewhere.

Another very common type of approximation is of the form (4.23), where wechoose ϕj(x) to be cos(jx) or sin(jx) (or eijx, where i denotes the imaginaryunit here). Approximation by such trigonometric polynomials is ubiquitousthroughout signal processing and elsewhere, and also leads to the branch ofmathematics termed Fourier analysis. In fact, a special associated algorithm,the Fast Fourier Transform, or FFT , is the basis of digital transmission ofaudio and video signals.

We may also approximate by sums of exponentials. The approximationmay be linear, that is, of the form (4.23), or nonlinear.


Example 4.17

Let us approximate the datai ti yi

0 0 11 1 42 2 53 3 8

in the least squares sense

1. by ϕ(t) = a0 + a1et, and

2. by ϕ(t) = a0ea1t.

In the first case, the approximation is of the form (4.23), with ϕ0(t) = 1 andϕ1(t) = et, and we may use the same techniques as with polynomial leastsquares. The overdetermined system as in (3.19) (on page 117) is

1 11 e1 e2

1 e3

(

a0

a1

)

=

1458

,

and, using a computation similar to that in Example 3.29 (on page 119), weobtain the following result and plot:

a = 2.0842 0.3098

0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

9

We see that this particular form does not seem to fit the data well.For the nonlinear exponential form ϕ(t) = a0e

a1t, we need to minimize thefunction

f(a0, a1) = (1 − a0)2 + (4− a0e

a1)2 + (5− a0e2a1)2 + (8− a0e

3a1)2.

Since this function is nonlinear, we cannot use just a single linear compu-tation such as a QR-decomposition. However, we may use techniques from


nonlinear optimization, such as setting the gradient equal to zero and solv-ing the resulting nonlinear system using techniques from Chapter 8 In fact,however, special techniques have been developed for solving nonlinear leastsquares problems, such as those embodied in the routine lsqcurvefit frommatlab’s optimization toolbox. To use this routine, we need to program ϕ(t),which we do in the following matlab “m” file:

function [y] = exponential_fit(a,t)

y = a(1) * exp(a(2)*t);

Assuming we are continuing the dialog from the previous examples, we useexponential fit.m in the following matlab dialog to compute and plot thefit:

>> x0=rand(2,1)

x0 =0.35290.8132

>> a = lsqcurvefit(’exponential_fit’,x0,ti,yi)Optimization terminated: relative function value

changing by less than OPTIONS.TolFun.a =

1.93110.4781

>> yy = exponential_fit(a,tt);


>> plot(ti,yi,’LineStyle’,’none’,’Marker’,’*’,’MarkerEdgeColor’,’red’,’Markersize’,15)>> plot(tt,yy)>>

0 0.5 1 1.5 2 2.5 31

2

3

4

5

6

7

8

9

Caution: Routines such as lsqcurvefit use heuristics, and may not re-turn with mathematically correct fits; sometimes the fits that they return arenot near the actual best fits. One way of gathering evidence that the fit iscorrect is to try different starting points. Also, the routine lsqcurvefit hasvarious options, including an option to supply partial derivatives of the fittingfunction; trying different options may either give one more confidence in thefit or provide evidence that the fit is not good. Yet another possibility is touse global optimization software, and, in particular, software with automaticverification, such as interval-arithmetic-based software, such as we describe in[1, Section 9.6,3] or in [26]. In fact, we used our GlobSol package [21] to verify


that the fit we have displayed is correct. To see if lsqcurvefit was trust-worthy in this case, we also tried lsqcurvefit with various starting points,and found that each time it gave the same fit4 that we have displayed.

In general, if we are conjecturing an underlying model to the data, wechoose a form that we think corresponds to the underlying processes thatproduced the data. For example, sometimes the coefficients aj are coefficientsin a differential equation.

4.7 Interval (Rigorous) Bounds on the Errors

Whether we are considering Taylor polynomial approximation, interpola-tion, least squares, or minimax approximation, the error term in the approx-imation can be put in the general form

f(x) = p(x) + K(x)M(f ; x) for x ∈ [a, b], (4.24)

where K becomes small as we increase the number of points, and M dependson a derivative of f . We list K and M for various approximations5 in Ta-ble 4.1. In such cases, p and K can be evaluated explicitly, while M(f ; x)can be estimated using interval arithmetic. We illustrated how to do this forf(x) = ex, using a degree-5 Taylor polynomial, in Example 1.22 on page 29.We elaborate here: In addition to bounding particular values of the function,a maximum error of approximation and rigorous bounds valid for all of [a, b]can be inferred. In particular, the polynomial part p(x) is evaluated at a point(but using outwardly rounded interval arithmetic to maintain mathematicalrigor), and the error part is evaluated with interval arithmetic.

Example 4.18

Consider approximating sin(x), x ∈ [−0.1, 0.1] by a degree-5

1. Taylor polynomial about zero,

2. interpolating polynomial at the points xk = −.1 + .04k, 0 ≤ k ≤ 5.

For the Taylor polynomial, we observe that the fifth degree Taylor polynomialis the same as the sixth degree Taylor polynomial, and we have

sin(x) ∈ x− 1

6x3 +

1

120x5 − 1

5040x7 sin(ξ) for some ξ ∈ [−0.1, 0.1]. (4.25)

4approximately5The error of approximation of smooth functions by Chebyshev polynomials can be muchless than for nonsmooth (merely C0) functions, as is indicated in Remark ?? combined withTheorem ??; however, bounds on the error may be more complicated to find in this case.


TABLE 4.1: Error factors K and M in polynomial approximationsf(x) = p(x) + K(x)M(f ; x).

Type of approxima-tion K M(f)

degree n Taylor poly-nomial

1

(n + 1)!(x−x0)

n+1f (n+1)(ξ(x)), ξ ∈ [a, b] unknown

polynomial interpola-tion at n + 1 points

1

(n + 1)!

n∏

i=0

(x − xi) f (n+1)(ξ(x)), ξ ∈ [a, b] unknown

|f(x) − p(x)| ≤ K(x)M(f, x) (Bounds on the error only6):

piecewise linear inter-polation

h2

8max

x∈[a,b]|f ′′(x)|

interpolation withclamped cubic splines

5

384h

4 maxx∈[a,b]

|f ′′′′(x)|

6The actual equation 4.24 can be given, but it is more complicated, involving conditional

branches.

We can replace sin(ξ) by an appropriate interval to get a pointwise estimate;for example,

sin(0.05) ∈ .05− .053

6+

.055

120− .057

5040[0, 0.05]

⊆ [0.049979169270821, 0.04997916927084],

where the above bounds are mathematically rigorous. Here, K was evaluatedat the point x, but, sin(ξ) was replaced by sin([0.0.05]). Similarly,

sin(−0.01) ∈ (−.01)− (−.01)3

6+

(−.01)5

120− (−.01)7

5040[−0.01, 0]

⊆ [−0.00999983333417,−0.00999983333416].

Thus, since we know sin(x) is monotonic for x ∈ [−0.01, 0.05],[−0.00999983333417, 0.04997916927084] represents a fairly sharp bound onthe range {sin(x) | x ∈ [−0.01, 0.05]}. Alternately, it may be more convenientin some contexts to evaluate K and M over the entire interval, although thisleads to a less sharp result. Using that technique, we would have

sin(0.05) ∈ .05− .053

6+

.055

120+

[−0.1, 0.1]7

5040[−0.1, 0.1]

⊆ .05− .053

6+

.055

120−

[−0.19841269841270× 10−11, 0.19841269841270× 10−11]

⊆ [0.04997916926884, 0.04997916927282],


and

sin(−0.01) ∈ (−.01)− (−.01)3

6+

(−.01)5

120− [−0.1, 0.1]7

5040[−0.1, 0.1]

⊆ (−.01)− (−.01)3

6+

(−.01)5

120−

[−0.19841269841270× 10−11, 0.19841269841270× 10−11]

⊆ [−0.00999983333616,−0.00999983333218],

thus obtaining (somewhat less sharp) bounds

[−0.00999983333616, 0.04997916927282]

on the range {sin(x) | x ∈ [−0.01, 0.05]}.In general, substituting intervals into the polynomial approximation itself

does not give sharp bounds on the range. For example,

sin([−0.01, 0.05]) ∈ ([−.01, .05])− ([−.01, .05])3

6+

([−.01, .05])5

120−

[−0.19841269841270× 10−11, 0.19841269841270× 10−11]

⊆ [−0.01002083333616, 0.05000016927282].

Nonetheless, in some contexts in which there is no alternative, this techniquegives usable bounds.

Computing bounds based on the interpolating polynomial is similar to com-puting bounds based on the Taylor polynomial, and is left as Exercise 4.

4.8 Applications

To better understand the population dynamics of American green tree frogs(Hyla cinerea), scientists used a capture-mark-recapture method to follow apopulation at an urban study site in Lafayette, LA, during their breedingseasons. The following data are the weekly frog population estimates fromweek 2 (June 24, 2004) of the 2004 dataset [2].

Week 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18Estimate 143 415 140 177 150 125 133 151 123 1429 487 523 228 416 341 523

Now, suppose we are looking for a least squares fit to the data, which meanswe want to find a function of the form

y ≈ f(t) =

n∑

j=0

cjϕj(t),


where the coefficients cj , 0 ≤ j ≤ n can be found by solving the minimizationproblem

min{cj}n

j=0

m∑

i=1

(yi − f(ti))2.

Here, (ti, yi) are the given data, so m = 16. We have decided to use thehat functions (on page 161) ϕj(t), j = 1, 2, ..., 13 for t ∈ [1, 19] as the basisfunctions; that is, we will use hat functions centered at the points 1, 2.5, 3,. . . , 17.5, 19. We saw in Section 3.4.2 that this minimization problem can besolved with a QR-decomposition. The following matlab “m” files show howto find all the coefficients cj, j = 1, 2, ..., 13, and plot both the actual data andthe least squares fit in the same xy plane.

function p = hat_function_value (j, xi, n, h, z);

p=0;

if(j==1 & xi(1) <= z & z <=xi(2))p=(xi(2)-z)/h;

elseif (j==n+1 & xi(n) <= z & z <= xi(n+1))p=(z-xi(n))/h;

elseif( 2 <= j & j <= n & xi(j-1) <= z & z <= xi(j))p=(z-xi(j-1))/h;

elseif (2 <= j & j <= n & xi(j) <= z & z <= xi(j+1))

p=(xi(j+1)-z)/h;end

return

The above function is used in the following matlab script.

clc,clear,close allt = [2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18];

y = [143 415 140 177 150 125 133 151 123 1429 487 523 228 416 341 523];h=1.5;

xi=1:h:19;m = max(size(t)); % m data items

n = length(xi)-1; % a basis of n+1 functionsA=zeros(m,n+1);for i=1:m % i means ith observation

for j=1:n+1 % j means jth basis functionA(i,j)= hat_function_value (j,xi,n,h,t(i)); % call the hat function

endend[Q,R]=qr(A); % perform the QR decomposition of matrix A.

e=Q’*y’; % solve the square triangular systemf=e(1:(n+1)); % use the first n rows of Rx=Q’b.

r=inv(R(1:(n+1),1:n+1));c=r*f % c is the coefficient vector of the hat function basis.

for i=1:length(t)xx(i)=sum(c’.*(A(i,1:(n+1))));

end

axis([0,20,100,1450]) % plot both the actual data and the least squareshold % approximation

plot(t,y,’*’,t,xx,’-’)legend(’Actual Data’,’Least Squares Fit’)

Running the above matlab gives the following cj in the command window:

>> cc =

1.0e+003 *-0.6726

0.55080.1434


0.1783

0.12420.1255

0.22581.5561

-0.7023

0.66120.3106

0.35620.8566

The script also gives the following plot:

0 5 10 15 20

200

400

600

800

1000

1200

1400

Actual DataLeast Squares Fit

Notice that the population estimate of week 11 is much larger than the otherweek estimates. This suggests that the week 11 data is anoutlier , that is,it is somehow exceptional or in error. In the next fitting experiment, weremove (t10, y10) = (11, 1429) from our dataset, and adjust the matlab codesaccordingly. In particular, we replace the second and third line of the scriptby

t = [2 3 4 5 6 7 8 9 10 12 14 15 16 17 18];

y = [143 415 140 177 150 125 133 151 123 487 523 228 416 341 523];

and we replace the “axis” command (after examination of the data and ex-perimentation) by

axis([0,20,100,700])

The resulting plot (seen below) shows that closer approximation can be ob-tained by omitting this outlier.


0 5 10 15 20100

200

300

400

500

600

700

Actual DataLeast Squares Fit

The corresponding set of coefficients is

c =

1.0e+003 *-0.6728

0.55090.1432

0.17970.11920.1518

0.12560.0800

1.30100.1340

0.41600.30350.9620

Note: Here, we have explicitly used the QR decomposition for illustration.Actually, matlab has a function lscov that will compute the least squaressolution to a system Ax = b, probably more efficiently than the explicit mat-

lab statements we have just exhibited. A corresponding matlab script forthe data with the outlier removed is as follows.

clc,clear,close all

t = [2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18];y = [143 415 140 177 150 125 133 151 123 1429 487 523 228 416 341 523];

h=1.5;xi=1:h:19;m = max(size(t)); % m data items

n = length(xi)-1; % a basis of n+1 functionsA=zeros(m,n+1);

for i=1:m % i means ith observationfor j=1:n+1 % j means jth basis function

A(i,j)= hat_function_value (j,xi,n,h,t(i)); % call the hat functionend

end

c = lscov(A,y’) % Compute the least squares solutionfor i=1:length(t)

xx(i)=sum(c’.*(A(i,1:(n+1))));end

axis([0,20,100,1450]) % plot both the actual data and the least squareshold % approximation


plot(t,y,’*’,t,xx,’-’)

legend(’Actual Data’,’Least Squares Fit’)

4.9 Exercises

1. For f(t) = sin(t),

(a) compute the coefficients of the degree 3 Taylor polynomial approx-imation at t = 0;

(b) compute the coefficients of the degree 3 polynomial that interpo-lates f at t = −1, t = −1/3, t = 1/3, and t = 1.

(c) Rewrite each of the degree-3 polynomials in 1a and 1b in terms ofthe basis ϕ0 ≡ 1, ϕ1 ≡ t, ϕ2 ≡ t2, and ϕ3 ≡ t3, then comparecoefficients.

(d) Estimate the maximum error

maxt∈[−1,1]

|f(t)− p(t)|

for each of the approximations in 1a and 1b.

2. Repeat the computations for Example 4.18 on page 182, except use theinterpolating polynomial at the six Chebyshev points, rather than at sixequally spaced points.

3. Fill in the details of the computations in Example 4.4 (starting onpage 152). In particular, solve the lower triangular system, and alsoarrange the computations so you see that, by solving it, you do thesame computations as computing the divided differences. Also, rear-range the terms of the polynomial to rewrite it in power form, and showthat you get the same coefficients in power form as you do by solvingthe Vandermonde system.

4. Complete the computations in Example 4.18 on page 182. That is,approximate sin(x), x ∈ [−0.1, 0.1] by an interpolating polynomial ofdegree 5, and use interval arithmetic to obtain an interval bounding theexact solution of sin(0.05). Do it

(a) with equally spaced points, and

(b) with Chebyshev points.

5. Redo the matlab computations in Example 4.5 (on page 154, but useinterval arithmetic (say, with intlab). Doing so, you will obtain math-ematically rigorous bounds on the truncation error, from the upperbounds of the intervals produced.


6. Actually do Example 4.7 (on page 157).

7. (A significant programming project) Consider the piecewise linear in-terpolant in Example 4.9 (on page 161).

(a) Write a matlab function [y] = hat(t,ti,i) that accepts a setof abscissas ti (an array with m points, the specified point i, andthe specified point of evaluation t, and returns the value of the hatfunction centered at ti(i).

(b) Use matlab and your hat function routine from part (a) of thisproblem to evaluate the piecewise linear interpolant from Exam-ple 4.9 and produce a graph as in Example 4.9.

(c) Observe that your graph is the same graph as the one that appearsin the text for the example.

8. Show that conditions (n) in Definition 4.7 of the natural spline inter-polant (on page 165) lead to the system (4.16) (on page 166).

9. Write down the details stating that (4.14) (on page 166) can be writtenin matrix form as (4.15) (on page 166).

10. (Involves significant algebraic manipulation) In Example 4.11,

(a) Write down s−1, s0, s1, s2, and s3 explicitly in terms of branching.

(b) Simplify the expansion of s(x) in Example 4.11 to write s(x) as aset of cubic polynomials in power form. (You will have a differentcubic polynomial for each of the three subintervals [0, 1], [1, 2], and[2, 3].)

(c) Write a matlab routine that evaluates your polynomial. (You willneed to use branching statements such as if — elseif — else —end.)

(d) Use your matlab routine to plot s(x).

(e) Compare your graph with that given in the text to make sure it isthe same.

11. Let s1(x) = 1 + c(x + 1)3, −1 ≤ x ≤ 0, where c is a real number.Determine s2(x) on 0 ≤ x ≤ 1 so that

s(x) =

{

s1(x), −1 ≤ x ≤ 0,

s2(x), 0 ≤ x ≤ 1,

is a natural cubic spline, i.e. s′′(−1) = s′′(1) = 0 on [−1, 1] with nodalpoints at −1, 0, 1. How must c be chosen if one wants s(1) = −1?

12. Complete the computations in Example 4.15 (on page 175, analogouslyto what was done in Example 4.14 (on page 173), to verify that we obtainthe graph that is displayed in the text. Show all your computations.


13. Fill in the details of the computations in Example 4.17 (on page 180).Exhibit all of your work.

Chapter 5

Eigenvalue-Eigenvector Computation

In Chapter 3, we introduced the concept of an eigenvalue-eigenvector pair ofan n by n matrix A, that is, a scalar λ and a vector v such that

Av = λv,

and we referred to eigenvalues and eigenvectors when talking about conver-gence of iterative methods. In fact eigenvalues and eigenvectors are fundamen-tal in various applications. In particular, physical systems exhibit character-istic modes of vibration (that is, resonance frequencies) that are described byeigenvalues and eigenvectors. For example, structures such as buildings andbridges can be modeled by differential equations, and the resonant, or char-acteristic1 frequencies of models of such structures are routinely computed inearthquake-prone zones such as California, prior to construction.

We discuss some basic methods for computing eigenvalues and eigenvectorsin this chapter. First, we introduce necessary facts.

5.1 Facts About Eigenvalues and Eigenvectors

THEOREM 5.1

λ is an eigenvalue of A if and only if the determinant det(A− λI) = 0.

The determinant defines the characteristic polynomial

det(A− λI) = λn + αn−1λn−1 + αn−2λ

n−2 + · · ·+ α1λ + α0.

Thus, the fundamental theorem of algebra tells us that A has exactly n eigen-values, the roots of the above polynomial, in the complex plane, countingmultiplicities. The set of eigenvalues of A is called the spectrum of A. Recallthe following.

1The prefix “eigen” is a German-language prefix, meaning, roughly, “characteristic.”

191


(a) The spectral radius of A is defined by

ρ(A) = maxλ an eigenvalue of A

|λ|.

(b) ‖A‖2 =√

ρ(AHA). If AH = A (that is, if A is Hermitian), then ‖A‖2 =ρ(A).

Also, it can be shown that |λ| ≤ ‖A‖ for any induced matrix norm ‖ · ‖ andany eigenvalue λ.

Example 5.1

Let

A =

(

1 2−1 4

)

.

Then the characteristic polynomial of A is

|A− λI| =∣

∣

∣

∣

1− λ 2−1 4− λ

∣

∣

∣

∣

= (1− λ)(4 − λ)− (2)(−1)

= λ2 − 5λ + 6,

so the eigenvalues of A are λ2 = 2 and λ1 = 3. In this case, we can computethe eigenvectors v = (v1, v2)

T corresponding to λ2 = 2 as follows:

(A− 2I)v =

(

1− 2 2−1 4− 2

)(

v1

v2

)

=

(

−1 2−1 2

)(

v1

v2

)

= 0

for any vector v with v1 = 2v2. Thus there is a space of eigenvectors of theform

v(1) = t

(

21

)

, t a number

corresponding to λ2 = 2. Similarly, the space of eigenvectors correspondingto λ1 = 3 is

v(2) = t

(

11

)

, t a number.

For this example, the matrix of eigenvectors (normalized somehow, say, so thesecond component of each is equal to 1) is

P = (v(1), v(2)) =

(

2 11 1

)

is non-singular.We now use matlab to compute induced norms of A, to illustrate that they

are greater than than or equal to max{λ1, λ2} = 3. In this computation, wealso illustrate the relationship between the matrix P of eigenvectors of A andA.

Eigenvalue-Eigenvector Computation 193

>> A = [1 2

-1 4]A =

1 2-1 4

>> norm(A,1)

ans = 6>> norm(A,2)

ans = 4.4966>> max(sqrt(eig(A’*A)))

ans = 4.4966>> norm(A,inf)ans = 5

>> P = [2 11 1]

P =2 11 1

>> inv(P)*A*Pans =

2 00 3

>>

Example 5.1 is special in several ways:

1. In general, the eigenvalues and eigenvectors are complex, even if thematrix A is real.

2. In general, an n by n matrix A does not have n linearly independenteigenvectors, although certain classes of matrices, such as symmetricones, do.

3. In general, we do not explicitly form the characteristic equation to com-pute eigenvalues and eigenvectors, but we use iterative methods like thebasic ones we explain later in this chapter.

DEFINITION 5.1 A square matrix A is called defective if it has aneigenvalue of multiplicity k having fewer than k linearly independent eigen-vectors.

For example, if

A =

(

1 10 1

)

, then λ1 = λ2 = 1, but x = t

(

10

)

is the only eigenvector, so A is defective.

THEOREM 5.2

Let A and P be n×n matrices, with P nonsingular. Then λ is an eigenvalueof A with eigenvector x if and only if λ is an eigenvalue of P−1AP with


eigenvector P−1x. (P−1AP is called a similarity transformation of A, and Aand P−1AP are called similar.)

THEOREM 5.3

Let {xi}ni=1 be eigenvectors of A corresponding to distinct eigenvalues {λi}ni=1.Then the vectors {xi}ni=1 are linearly independent.

If A has n different eigenvalues, then the n eigenvectors are linearly inde-pendent and thus form a basis for Cn. (This means that the matrix P formedfrom these eigenvectors is non-singular.) Note that n different eigenvalues issufficient but not necessary for {xi}ni=1 to form a basis. Consider A = I witheigenvectors {ei}ni=1.

We now consider some results for the special case when matrix A is Hermi-tian. Recall that, if AH = A, then A is called Hermitian. (A real symmetricmatrix is a special kind of Hermitian matrix.)

THEOREM 5.4

Let A be Hermitian (or real symmetric). The eigenvalues of A are real,and there is an orthonormal system of eigenvectors w1, w2, . . . , wn of A withAwj = λjwj and (wj , wk) = wH

k wj = δjk.

The orthonormal system is linearly independent and spans Cn, and thusforms a basis for Cn. Thus, any vector x ∈ Cn can be expressed as

x =

n∑

j=1

ajwj , where aj = (x, wj) and ‖x‖22 =

n∑

i=1

|aj |2.

The following fact can be used to obtain initial guesses for eigenvalues, tobe used in iterative methods for finding eigenvalues and eigenvectors.

THEOREM 5.5

(Gerschgorin’s Circle Theorem) Let A be any n × n complex matrix. Thenevery eigenvalue of A lies in the union of the discs

n⋃

j=1

Kρj(ajj), where Kρj

= {z ∈ C : |z − ajj | ≤ ρj} for j = 1, 2, . . . , n,

and where the centers ajj are diagonal elements of A and the radii ρj can betaken as:

ρj =n∑

k=1k 6=j

|ajk|, j = 1, 2, . . . , n (5.1)


(absolute sum of the elements of each row excluding the diagonal elements),

ρj =

n∑

k=1k 6=j

|akj |, j = 1, 2, . . . , n (5.2)

(absolute column sums excluding diagonal elements), or

ρj = ρ = (

n∑

j,k=1j 6=k

|ajk|2)1/2 (5.3)

for j = 1,2,. . . ,n.

Example 5.2

A =

2 1 12

−1 −3i 13 −2 −6

Using absolute row sums,

a11 = 2, ρ1 = 3/2,a22 = −3i, ρ2 = 2,a33 = −6, ρ3 = 5.

The eigenvalues are in the union of these discs. For example, ρ(A) ≤ 11. Also,A is nonsingular, since no eigenvalue λ can equal zero. (See Figure 5.1.)

Re

Im

2

32

-3i2

-6

5

FIGURE 5.1: Illustration of Gerschgorin discs for Example 5.2.


Symmetric matrices (or more generally, Hermitian matrices) are specialfrom the point of view of eigenvalues and eigenvectors. We have a specialversion of Gerschgorin’s theorem for such matrices:

THEOREM 5.6

(A Gerschgorin Circle Theorem for Hermitian matrices) If A is Hermitianor real symmetric, then a one-to-one correspondence can be set up2 betweeneach disc Kρ(ajj) and each λj from the spectrum λ1, λ2, . . . , λn of A, where

ρ = maxj

n∑

k=1k 6=j

|ajk| or ρ =

n∑

j,k=1j 6=k

|ajk|2

1/2

.

(Recall that Kρ(ajj) = {z ∈ C : |z − ajj | ≤ ρ}.)

We now turn to some basic methods for computing eigenvalues and eigen-vectors.

5.2 The Power Method

In this section, we describe a simple iterative method for computing theeigenvector corresponding to the largest (in modulus) eigenvalue of a matrixA. We assume first that A is nondefective, i.e., A has a complete set ofeigenvectors, and A has a unique simple3 dominant eigenvalue. We will discussmore general cases later.

Specifically, suppose that the n × n matrix A has a complete set of eigen-vectors corresponding to eigenvalues {λj}nj=1, and the eigenvalues satisfy

|λ1| > |λ2| ≥ |λ3| ≥ · · · ≥ |λn|. (5.4)

Since the eigenvectors {xj} are linearly independent, they form a basis forCn. That is, any vector q(0) ∈ Cn can be written

q(0) =

n∑

j=1

cjxj , (5.5)

2This does not necessarily mean that there is only one eigenvalue in each disk; consider thematrix A =

(

0 11 0

)

.3A simple eigenvalue is an eigenvalue corresponding to a root of multiplicity 1 of the char-acteristic equation


for some coefficients cj . Starting with initial guess q(0), we define the sequence{q(ν)}ν≥1 by

q(ν+1) =1

σν+1Aq(ν), ν = 0, 1, 2, . . . (5.6)

where the sequence {σν}ν≥1 consists of scale factors chosen to avoid overflowand underflow errors. From (5.5) and (5.6), we have

q(ν) =

[

ν∏

i=1

σ−1i

]

n∑

j=1

λνj cjxj

= λν1

[

ν∏

i=1

σ−1i

]

{

c1x1 +n∑

j=2

(

λj

λ1

)ν

cjxj

}

.

(5.7)

Since by (5.4), |λj/λ1| < 1 for j ≥ 2, we have limν→∞

(λj/λ1)ν

= 0 for j ≥ 2,

and if c1 6= 0,

limν→∞

q(ν) = limν→∞

[

λν1

ν∏

i=1

1

σi

]

c1x1. (5.8)

The scale factors σi are usually chosen so that ‖q(ν)‖∞ = 1 or ‖q(ν)‖2 =1 for ν = 1, 2, 3, . . . , i.e., the vector q(ν) is normalized to have unit norm;thus σν+1 = ‖Aq(ν)‖∞ or ‖Aq(ν)‖2, since q(ν+1) = Aq(ν)/σν+1. With eithernormalization, the limit in (5.8) exists; in fact,

limν→∞

q(ν) =x1

‖x1‖, (5.9)

i.e., the sequence q(ν) converges if c1 6= 0 to an eigenvector of unit lengthcorresponding to the dominant eigenvalue of A.

If q(0) is chosen randomly, the probability that c1 6= 0 is close to one, butnot one. However, even if the exact q(0) happens to have been chosen withc1 = 0, rounding errors on the computer may still result in a component indirection x1.

Example 5.3We illustrate the method with the matrix A from Example 5.1 and the

following matlab dialog. We use σν+1 = max1≤i≤n |q(ν)i |:

>> A = [1 2-1 4]

A =1 2

-1 4>> q = rand(2,1)q =

0.40570.9355

>> q = q/max(abs(q))q =

0.43371.0000


>> q = A*q

q =2.4337

3.5663>> q = q/max(abs(q))q =

0.68241.0000

.

.

.q =

0.9784

1.0000>> q = A*q

q =2.97843.0216


0.98571.0000

>> q = A*qq =

2.9857

3.0143>> q = q/max(abs(q))

q =0.99051.0000

>> q = A*qq =

2.99053.0095

>> max(abs(q))ans =

3.0095

>>

One sees linear convergence to the eigenvalue λ1 = 3 and correspondingeigenvector v = (1, 1)T .

Consider again Eq. (5.7). Since by assumption λ2 is the eigenvalue of secondlargest absolute magnitude, we see that, for ν sufficiently large,

q(ν) −(

λν1

n∏

i=1

σ−1i

)

c1x1

(λ2/λ1)ν → k as ν →∞,

where k is a constant vector. Hence,

q(ν) − x1

‖x1‖= O

(∣

∣

∣

∣

λ2

λ1

∣

∣

∣

∣

ν)

. (5.10)

That is, the rate of convergence of the sequence q(ν) to the exact eigenvectoris governed by the ratio |λ2/λ1|. In practice, this ratio may be too close to1, yielding a slow convergence rate. For instance, if |λ2/λ1| = 0.95, then|λ2/λ1|ν ≤ 0.1 only for ν ≥ 44, that is, it takes over 44 iterations to reducethe error in (5.10) by a factor of 10.


Example 5.4In Example 5.3, the σν are approximations to the eigenvalues. According to

our analysis, the convergence rate should be linear, with convergence factorλ1/λ2 = 2/3. We test this with the following matlab dialog, where webegin the power method iteration with the last iterate q(ν) we computed inExample 5.4.


0.99371.0000

>> current = norm(q-[1;1])current =

0.0063

>> old = current;>> q = A*q;


0.99581.0000

>> current = norm(q-[1;1])

current = 0.0042>> current/old

ans = 0.6653>> old = current;>> q = A*q;


0.99721.0000

>> current = norm(q-[1;1])current = 0.0028>> current/old

ans = 0.6657>> old = current;

>> q = A*q;>> q = q/max(abs(q))

q =0.99811.0000

>> current = norm(q-[1;1])current = 0.0019

>> current/oldans = 0.6660>>

On the other hand, the method is very simple, requiring n2 multiplicationsto compute Aq(ν) at each iteration. Also, if A is sparse, the work is reduced,and only the nonzero elements of A need to be stored.

REMARK 5.1 The matlab computations in the previous examples maybe put into the following matlab function:

function [lambda,v,n_iter,success] ...= power_method (A, start_vector, max_iter, tol)

%% [lambda,v,success] = power_method (A, start_vector, max_iter, tol)

% returns an approximation to the dominant eigenvalue lambda% and corresponding eigenvector v, starting with the column vector


% start_vector. The iteration stops when either maxitr is reached or the

% infinity norm between successive estimates for lambda becomes less than% tol.

%% On return, success is set to ’1’ ( "true") if successive estimates for% lambda have become less than tol, and is set to ’0’ ("false") otherwise.

q = start_vector;

if (norm(q,inf)==0)disp(’Error in power_method.m: start_vector is the zero vector’);

n_iter=0success=0;lambda = 0;

v = start_vector;return

endq = q/max(abs(q));success=1;

for i=1:max_iter

old_q = q;q = A*q;

nu = max(abs(q));q = q/nu;diff = norm(q-old_q,inf);

if (diff < tol)lambda = nu;

v = q;n_iter = i;return

endend

disp(sprintf(’Tolerance tol = %12.4e was not met in power_method’,tol));

disp(sprintf(’within max_iter = %7.0f iterations.’, max_iter));disp(sprintf(’Current difference diff: %12.4e’,diff))success=0

lambda = nu;v = q;

n_iter = max_iter;

For example, this function (stored, say, in the user’s current matlab direc-tory as power method.m) may be used as follows:

>> A = [1 2

-1 4]A =

1 2-1 4

format long

>> [lambda, v, n_iter, success] = power_method(A, rand(2,1), 100, 1e-15)lambda =

3.000000000000003v =

-0.999999999999998

-1.000000000000000n_iter =

88success =

1>>

If the real matrix A has complex eigenvalues, the dominant eigenvalue is


necessarily not unique. Furthermore, we would need to begin with a non-realstarting vector to have a chance of converging to any eigenvalue.

If the dominant eigenvalue is unique but not simple, the power methodwill still converge. Suppose that λ1 has multiplicity r and has r linearlyindependent eigenvectors. Then,

q(0) =

r∑

j=1

cjxj +

n∑

j=r+1

cjxj .

The sequence q(ν) will converge to the direction

r∑

j=1

cjxj .

However, if the dominant eigenvalue is not unique, e.g.

A =

0 0 11 0 00 1 0

, λ1 = 1, λ2 =1

2+

√3

2i, λ3 =

1

2−√

3

2i,

then the power method will fail to converge. This severely limits the applica-bility of the power method. Once a dominant eigenvalue and eigenvector havebeen found, a deflation technique may be applied to define a smaller matrixwhose eigenvalues are the remaining eigenvalues of A. The power method isthen applied to the smaller matrix. If all eigenvalues of A are simple withdifferent magnitudes, this procedure can be used to find all the eigenvalues.

Example 5.5

Let

A =

(

0 1−1 0

)

.

The eigenvalues of A are i and −i, where i is the imaginary unit, and cor-responding eigenvectors are (−i, 1)T and (i, 1)T . In the following matlab

dialog, we illustrate that the simple power method does not converge, andthat the matlab function eig can compute the eigenvalues and eigenvectorsof A.

>> A = [0 1

-1 0]A =

0 1-1 0

>> start_vector = rand(2,1) + i*rand(2,1)

start_vector =0.5252 + 0.6721i

0.2026 + 0.8381i>> [lambda, v, n_iter, success] = power_method(A, start_vector, 1000, 1e-5)

Tolerance tol = 1.0000e-005 was not met in power_methodwithin max_iter = 1000 iterations.


Current difference diff: 1.9443e+000

success =0

lambda =1

v =

0.6090 + 0.7795i0.2350 + 0.9720i

n_iter =1000

success =0

>> [P,Lambda] = eig(A)

P =0.7071 0.7071

0 + 0.7071i 0 - 0.7071iLambda =

0 + 1.0000i 0

0 0 - 1.0000i>> inv(P)*A*P

ans =0 + 1.0000i 0

0 0 - 1.0000i>>

The power method may sometimes be used with deflation to compute alleigenvalues and eigenvectors of a matrix. See our text [1] for an explanationof deflation, and for more details and additional convergence analysis of thepower method.

5.3 Other Methods for Eigenvalues and Eigenvectors

The simple power method is usually not used alone in modern software foreigenvalues and eigenvectors. Instead, sophisticated implementations of vari-ous methods are combined. Such methods include the inverse power methodwith deflation, the QR method with origin shifts, the Jacobi method, etc.The resulting software, such as that underneath the matlab functions eig

and eigs or in the publicly available (free, open-source) LAPACK package[5] do not fail often. For more details of some of these methods, see ourgraduate-level text [1]. We outline several of them here.

5.3.1 The Inverse Power Method

The inverse power method has a faster rate of convergence than the powermethod, and can be used to compute any eigenvalue, not just the dominantone. Let A have eigenvalues λ1, λ2, . . . , λn corresponding to linearly inde-pendent eigenvectors x1, x2, . . . , xn. (Here, the eigenvalues are not necessarilyordered.) Then, the matrix (A−λI)−1 has eigenvalues (λ−λ1)

−1, (λ−λ2)−1,


. . . , (λ−λn)−1, corresponding to eigenvectors x1, x2, . . . , xn. It can be shown(see [1] and the references therein) that, if

|(λ− λ2)−1| ≥ |(λ− λj)

−1|, 3 ≤ j ≤ n,

then

q(ν) =1

(λ− λ1)ν

[

c1x1 +O(∣

∣

∣

∣

λ1 − λ

λ2 − λ

∣

∣

∣

∣

ν)]

. (5.11)

Thus, the iterates q(ν) converge in direction to x1. The error estimate for theinverse power method analogous to (5.10) is

q(ν) − x1

‖x1‖= O

(∣

∣

∣

∣

λ− λ1

λ− λ2

∣

∣

∣

∣

ν)

, (5.12)

where q(ν) is normalized so that ‖q(ν)‖ = 1.It can also be shown that, if q is approximately an eigenvector of A, the

Rayleigh quotientqT Aq

qT q

is an approximation to the eigenvalue corresponding to q. Thus, in the inversepower method, we can adjust λ on each iteration by setting it to the Rayleighquotient.

Example 5.6

Let

A =

(

1 2−1 4

)

.

be as in Example 5.1. The eigenvalues of A are λ1 = 3 and λ2 = 2, so theeigenvalues of A−µI)−1 are 1/(2−µ) and 1/(3−µ). Suppose we have alreadyfound λ1 = 3 using the power method. Then, if we choose an initial λ lessthan λ2, we will have |λ− λ2| < |λ − λ1|, and the inverse power method willconverge to v(2) and λ2. We use the following matlab function:

function [lambda_1, v_1, success, n_iter] = inverse_power_method...(lambda, q0, A, tol, maxitr)

%% [lambda_1, v_1, success]=inverse_power_method(lambda, q0, A, tol, maxitr)% computes an eigenvalue and eigenvector of the matrix A, according to the

% inverse power method as described in Section 5.3 (starting on page 303)% of the text.

% On entry:% lambda is the shift for inv(A - lambda I)

% q0 is the initial guess for the eigenvector.

n = size(A,1);

a=inv(A-lambda*eye(n));q_nu = q0;

alam=2*norm(q_nu,inf); %(Initialize the approximate eigenvalue)check=1;

success = false;for k=1:maxitr


alam2=alam;

q_nu=a*q_nu;q_nu=q_nu/norm(q_nu,inf);

alam=(q_nu’*a*q_nu)/(q_nu’*q_nu); %(Update the approx. eigenvalue)check=abs(alam-alam2); % (stop if successive eigenvalue approximationsif (check < tol) % are close)

success = true;break

endend

n_iter = k;lambda_1=lambda+(1/alam); % (the eigenvalue of the original matrix)v_1 = q_nu;

disp(sprintf(’ %9.0f %15.4f %15.5f ’,k,lambda,lambda_1));

With inverse power method.m, we have the following dialog:

>> A = [1 2-1 4]

A =1 2

-1 4>> lambda = -1;

>> q0 = rand(2,1)q0 =

0.4057

0.9355>> format long

>> [lambda_1, v_1, success, n_iter] =...inverse_power_method(lambda, q0, A, 1e-16, 1000)

118 -1.0000 2.00000

lambda_1 =1.999999999999997

v_1 =-1.000000000000000

-0.499999999999999success =

1

n_iter =118

>>

5.3.2 The QR Method

The QR method is an iterative method for reducing a matrix to triangularform using orthogonal similarity transformations. The eigenvalues of the tri-angular matrix then appear on the diagonal, and, by Theorem 5.2, must alsobe the eigenvalues of the original matrix. The eigenvectors may be found bytransforming back the eigenvectors of the final triangular matrix.

Computing eigenvalues and eigenvectors by the QR method involves twosteps:

1. reducing the matrix to an almost triangular form (called the Hessenbergform, and

2. iteration by QR decompositions.


We apply origin shifts in the QR method similarly to how we applied suchshifts (i.e. replacing A by A− λI) in the inverse power method.

We give a detailed explanation of the QR method in [1].

5.3.3 Jacobi Diagonalization (Jacobi Method)

The Jacobi method for computing eigenvalues of a symmetric matrix is oneof the oldest numerical methods for the eigenvalue problem. It was replacedby the QR algorithm as the method of choice in the 1960’s. However, it ismaking a comeback due to its adaptability to parallel computers [19, 40]. Wegive a brief description of the Jacobi method in this section. Let A(1) = A bean n× n symmetric matrix. The procedure consists of

A(k+1) = NHk A(k)Nk (5.13)

where the Nk are unitary matrices that eliminate the off-diagonal element of

largest modulus. It can be shown that if a(k)pq is the off-diagonal element of

largest modulus, one transformation increases the sum of the squares of thediagonal elements by 2a2

pq and at the same time decreases the sum of the

squares of the off-diagonal elements by the same amount. Thus, A(k) tendsto a diagonal matrix as k → ∞. Since A(1) is symmetric, A(2) = NH

1 A(1)N1

is symmetric, so A(3), A(4) . . . are symmetric. Also, since A(k+1) is similarto A(k), A(k+1) has the same eigenvalues as A(k), and hence has the sameeigenvalues as A.

We now consider how to find Nk such that the largest off-diagonal elementof A(k) is eliminated. Let

A(k) =

a(k)11 . . . a

(k)1n

. . .

a(k)n1 . . . a

(k)nn

,


and suppose that |a(k)pq | ≥ |a(k)

ij | for 1 ≤ i, j ≤ n. Let

Nk =

1p-thcol.

q-thcol.

. . . ↓ ↓1

p-throw → cos(αk) − sin(αk)

1. . .

1q-throw → sin(αk) cos(αk)

1. . .

1

.

Nk is a Givens transformation (also called a plane rotator or Jacobi rotator).When A(k+1) is constructed, only rows p and q and columns p and q of A(k+1)

are different from those of A(k). The choice for αk is such that a(k+1)pq = 0.

That is, since

a(k+1)pq = (−a(k)

pp + a(k)qq ) cosαk sin αk + a(k)

pq (cos2 αk − sin2 αk) = 0,

cosαk and sinαk are chosen so that

cos2 αk =1

2+

a(k)pp − a

(k)qq

2r, sin2 αk =

1

2− a

(k)pp − a

(k)qq

2r, and

sin αk cosαk =a(k)pq

r,

where r2 = (a(k)pp − a

(k)qq )2 + 4(a

(k)pq )2.

In summary, the Jacobi computational algorithm consists of the followingsteps, where the third step provides stability with respect to rounding errors:

(1) At step k, find a(k)pq such that p 6= q and |a(k)

pq | ≥ |a(k)ij | for 1 ≤ i, j ≤ n,

i 6= j.

(2) Set r =

√

(a(k)pp − a

(k)qq )2 + 4(a

(k)pq )2 and t = 0.5 + (a

(k)pp − a

(k)qq )/2r.

(3) Set chk = (a(k)pp − a

(k)qq ).

IF chk ≥ 0 THEN

set c =√

t and s = a(k)pq /(rc),

ELSE


set s =√

1− t and c = a(k)pq /(rs).

END IF

(4) Set

Ni,j =

1 if i = j,c if i = p, j = p,−s if i = p, j = q,

s if i = q, j = p,c if i = q, j = q,0 otherwise.

(5) Set A(k+1) = NT A(k)N .

(6) Go to Step (1) until |a(k)pq | < ε.

Example 5.7

A = A(1) =

5 1 01 5 20 2 5

, N1 =

1 0 0

0 1√2

1√2

0 − 1√2

1√2

.

Then,

A(2) = NH1 A(1)N1 =

5 1√2

1√2

1√2

3 0

1√2

0 7

.

Notice that the sum of the squares of the diagonal elements of A(2) is 8 morethan the sum of the squares of the diagonal elements of A(1), and the sum ofthe squares of the off-diagonal elements of A(2) is 8 less than the sum of thesquares of the off-diagonal elements of A(1).

5.4 Applications

In our example in Section 3.7 (page 140), we used the dominant eigenvalueof the matrix

A =

0.14 0 0.6

0.56 0.24 0

0 0.56 0.9

to estimate the long run behavior of the system defined in (3.42). To computethe eigenvalues of A, we used the eig command in matlab as follows:


>> A=[0.14 0 0.6; 0.56 0.24 0; 0 0.56 0.9]

A =0.1400 0 0.6000

0.5600 0.2400 00 0.5600 0.9000

>> [v,lambda]=eig(A)

v =-0.1989 + 0.5421i -0.1989 - 0.5421i 0.4959

0.6989 0.6989 0.3160-0.3728 - 0.1977i -0.3728 + 0.1977i 0.8089

lambda =0.0806 + 0.4344i 0 0

0 0.0806 - 0.4344i 0

0 0 1.1188

We thus find the dominant eigenvalue to be 1.1188. Now, let us compute thedominant eigenvalue with an iterative method we have learned in this chapter.Since what we need in this example is the dominant eigenvalue, we can firsttry the simple iterative method, the power method. Since A is nonnegative,primitive and irreducible, A has a positive, simple, and strictly dominanteigenvalue (see [4]). Thus, we know from Section 5.2 that the power methodwill converge. Using the function power method.m given in Remark 5.1, wehave the following dialog:

>> AA =

0.1400 0 0.60000.5600 0.2400 0

0 0.5600 0.9000

>> format long>> [lambda,v,n_iter,success]=power_method(A, rand(3,1),100,1e-10)

lambda =1.11876441203147

v =0.613017793262440.39065073581277

1.00000000000000n_iter =

25success =

1

The result shows that the dominant eigenvalue of A is approximately4

1.11876441203147, which agrees with our previous result from Section 3.7.

5.5 Exercises

1. If

A =

2 −1 0−1 2 −1

0 −1 2

,

then

4based on the tolerance, we can only expect the first 10 digits of this to be correct.


(a) Use the Gerschgorin theorem to bound the eigenvalues of A.

(b) Compute the eigenvalues and eigenvectors of A directly from thedefinition, and compare with the results you obtained by usingGerschgorin’s circle theorem.

2. Consider

A =

(

0 11 0

)

.

(a) Compute the eigenvalues of A directly from the definition.

(b) Apply the Gerschgorin circle theorem to the matrix A.

(c) Why is A not a counterexample to the Gerschgorin theorem forHermitian matrices (Theorem 5.6)?

3. Let

A =

0 14

15

− 14 0 1

3

− 15 − 1

3 0

.

Show that the spectral radius ρ(A) < 1.

4. Let A be a strictly diagonally dominant matrix. Can zero be in thespectrum of A?

5. Apply several iterations of the power method to the matrix from Prob-lem 1 on page 208, using several different starting vectors. Compare theresults to the results you obtained from Problem 1 on page 208.

6. Let A be a real symmetric n × n matrix with dominant eigenvaluesλ1 = 1 and λ2 = −1. What would happen if we applied the powermethod to A?

7. Apply several iterations of the inverse power method to the matrix fromProblem 1 on page 208, using several different starting vectors, andusing the centers of the Gerschgorin circles as estimates for λ. Comparethe results to the results you obtained from Problem 5 on page 209.

8. Let A be a (2n+1)×(2n+1) symmetric matrix with elements aij = (1.5)i

if i = j and aij = (0.5)i+j−1 if i 6= j. Let the eigenvalues of A be λi,i = 1, 2, . . . , 2n + 1, ordered such that λ1 ≤ λ2 ≤ . . . ≤ λ2n ≤ λ2n+1.We wish to compute the eigenvector xn+1 associated with the middleeigenvalue λn+1 using the inverse power method qr = (λI−A)−1qr−1 forr = 1, 2, . . . Considering Gerschgorin’s Theorem for symmetric matrices,choose a value for λ that would ensure rapid convergence. Explain howyou chose this value.

9. Apply one or more iterations of the Jacobi method to compute theeigenvalues and eigenvectors of the matrix in Problem 2.


10. Apply one or more iterations of the inverse power method to computean eigenvalue of the matrix

A =

(

0 11 0

)

.

Hint: Since the eigenvalues are not real, you must choose a complexinitial guess for the eigenvalue. One possibility might be to choose thisrandomly.

Chapter 6

Numerical Differentiation andIntegration

In this chapter, we study the fundamental problem of approximating integralsand derivatives.

6.1 Numerical Differentiation

There are two common ways to develop approximations to derivatives, us-ing Taylor’s formula or Lagrange interpolation. We derive the formulas withTaylor’s formula here, while we also consider derivations using Lagrange in-terpolation in [1].

6.1.1 Derivation of Formulas

Consider applying Taylor’s formula for approximating derivatives. Supposethat f has two continuous derivatives, and we wish to approximate f ′ at somepoint x0. By Taylor’s formula,

f(x) = f(x0) + f ′(x0)(x − x0) +(x− x0)

2

2f ′′(ξ(x))

for some ξ between x and x0. Thus, letting x = x0 + h,

f ′(x0) =f(x0 + h)− f(x0)

h− h

2f ′′(ξ).

Hence,

f ′(x0) =f(x0 + h)− f(x0)

h+O(h) (forward-difference formula). (6.1)

To obtain a better approximation, suppose that f has three continuous deriva-tives, and consider

f(x0 + h) = f(x0) + f ′(x0)h + f ′′(x0)h2

2+ f ′′′(ξ1)

h3

6

f(x0 − h) = f(x0)− f ′(x0)h + f ′′(x0)h2

2− f ′′′(ξ2)

h3

6.

(6.2)

211


Subtracting the above two expressions and dividing by 2h gives

f ′(x0) =f(x0 + h)− f(x0 − h)

2h+O(h2) (central-difference formula). (6.3)

Similarly, we can go out one more term in (6.2) (assuming f has four contin-uous derivatives). Adding the two resulting expressions and dividing by h2

then gives

f ′′(x0) =1

h2

[

f(x0 − h)− 2f(x0) + f(x0 + h)]

− h2

24

[

f (4)(ξ1) + f (4)(ξ2)]

.

Hence, using the Intermediate Value Theorem,

f ′′(x0) =f(x0 + h)− 2f(x0) + f(x0 − h)

h2− h2

12f (4)(ξ). (6.4)

Example 6.1

f(x) = x lnx. Estimate f ′′(2) using (6.4) with h = 0.1. Doing so, we obtain

f ′′(2) ≈ f(2.1)− 2f(2) + f(1.9)

(0.1)2= 0.50021.

(Notice that f ′′(2) = 1/2, so the approximation is accurate.)

6.1.2 Error Analysis

One difficulty with numerical differentiation is that rounding error can belarge if h is too small. In the computer, f(x0 + h) = f(x0 + h) + e(x0 + h)and f(x0) = f(x0)+e(x0), where e(x0 +h) and e(x0) are roundoff errors thatdepend on the number of digits used by the computer.

Consider the forward-difference formula

f ′(x0) =f(x0 + h)− f(x0)

h+

h

2f ′′(ξ(x)).

We will assume that |e(x)| ≤ ǫ|f(x)| for some relative error ǫ, that |f(x)| ≤M0

for some constant M0, and that |f ′′(x)| ≤ M2 for some constant M2, for allvalues of x near x0 that are being considered. Then, these assumed boundsand repeated application of the triangle inequality give

∣

∣

∣

∣

∣

f ′(x0)−f(x0 + h)− f(x0)

h

∣

∣

∣

∣

∣

≤∣

∣

∣

∣

e(x0 + h)− e(x0)

h

∣

∣

∣

∣

+h

2M2

≤ 2ǫM0

h+

hM2

2= E(h), (6.5)

where ǫ is any number such that |e(x)| ≤ ǫ|f(x)| for all x under consideration.That is, the error is bounded by a curve such as that in Figure 6.1. Thus, if

Numerical Differentiation and Integration 213

h

error

hM2

2

2ǫM0

h

FIGURE 6.1: Illustration of the total error (roundoff plus truncation)bound in forward difference quotient approximation to f ′.

the value of h is too small, the error can be large.If we use calculus to minimize the expression on the right of (6.5) with

respect to h, we obtain

hopt =2√

ǫM0√M2

,

with a minimal bound on the error of

E(hopt) = 2√

M0M2

√ǫ.

Although the right member of (6.5) is merely a bound, we see that hopt givesa good estimate for the optimal step in the divided difference, and E(hopt)gives a good estimate for the minimum achievable error. In particular, theminimum achievable error is O(

√ǫ) and the optimal h is also O(

√ǫ), both in

the estimates and in the numerical experiments in Example 6.2.

Example 6.2We will use matlab to observe the behavior of the total error when we

approximate ln′(3) by evaluating (ln(3 + h) − ln(3))/h using IEEE doubleprecision arithmetic. We use the following matlab functions:

function difference_table(f,fprime,x)%% Issuing the command

% difference_table(’f’,’fprime’,x)% causes a table of difference quotients to be formed using the

% difference_quotient function.for i=1:30

h=4^(-i);value = difference_quotient(f,x,h);error = value - feval(fprime,x);

fprintf(’%3d %12.2e %20.16f %12.2e\n’, i, h, value, error);end clear value error

function [diff] = difference_quotient(f,x,h)%


% difference_quotient (f,x,h) returns the forward difference quotient of

% f at x with stepsize h, as in formula (6.1), page 323 of the text.

fxph = feval(f,x+h); fx = feval(f,x); diff = (fxph-fx)/h;

function [y] = logprime(x)y = 1/x;

(These functions are available with more detailed in-line documentation athttp://interval.louisiana.edu/Classical-and-Modern-NA/#Chapter_6.)With these functions, we have the following matlab dialog.

>> difference_table(’log’,’logprime’,3)1 2.50e-001 0.3201708306941464 -1.32e-002

2 6.25e-002 0.3299085952437721 -3.42e-0033 1.56e-002 0.3324682801346626 -8.65e-0044 3.91e-003 0.3331165076408524 -2.17e-004

5 9.77e-004 0.3332790916319937 -5.42e-0056 2.44e-004 0.3333197707015643 -1.36e-005

7 6.10e-005 0.3333299425394216 -3.39e-0068 1.53e-005 0.3333324856357649 -8.48e-007

9 3.81e-006 0.3333331214380451 -2.12e-00710 9.54e-007 0.3333332804031670 -5.29e-00811 2.38e-007 0.3333333209156990 -1.24e-008

12 5.96e-008 0.3333333320915699 -1.24e-00913 1.49e-008 0.3333333432674408 9.93e-009

14 3.73e-009 0.3333333730697632 3.97e-00815 9.31e-010 0.3333334922790527 1.59e-00716 2.33e-010 0.3333339691162109 6.36e-007

17 5.82e-011 0.3333358764648438 2.54e-00618 1.46e-011 0.3333435058593750 1.02e-005

19 3.64e-012 0.3333740234375000 4.07e-00520 9.09e-013 0.3334960937500000 1.63e-004

21 2.27e-013 0.3339843750000000 6.51e-00422 5.68e-014 0.3359375000000000 2.60e-00323 1.42e-014 0.3437500000000000 1.04e-002

24 3.55e-015 0.3750000000000000 4.17e-00225 8.88e-016 0.5000000000000000 1.67e-001

26 2.22e-016 0.0000000000000000 -3.33e-00127 5.55e-017 0.0000000000000000 -3.33e-001

28 1.39e-017 0.0000000000000000 -3.33e-00129 3.47e-018 0.0000000000000000 -3.33e-00130 8.67e-019 0.0000000000000000 -3.33e-001

>>

We see the minimum error occurs with h ≈ 5.96× 10−8, and the minimumabsolute error is about 1.24× 10−9.

To analyze this example, notice that f ′′(x) = − 1x2 and M2 = max |f ′′(ξ)| ≈

19 , and M0 ≈ ln(3) ≈ 1. Suppose that the error is 2ǫ

h M0 + h2 M2, so

e(h) =2ǫ

hM0 +

h

2M2.

The minimum error occurs at e′(h) = 0, which gives hopt ≈√

36ǫ. In matlab,if we assume ln is evaluated with maximal accuracy, ǫ is the IEEE doubleprecision machine epsilon, namely ǫ ≈ 2.23× 10−16. Thus, hopt ≈ 8.9× 10−8,close to what we observed. The minimum error is predicted to be about

2

√

1 · 19

√

2.23× 10−16 ≈ 10−8,


somewhat larger but within a factor of 10 of what was observed.

With higher-order formulas, we can obtain a smaller total error bound, atthe expense of additional complication. In particular, if the roundoff error isO(1/h) and the truncation error is O(hn), then the optimal h is O(ǫ1/(n+1))and the minimum achievable error bound is O(ǫn/(n+1)).

6.2 Automatic (Computational) Differentiation

Numerical differentiation has been used extensively in the past, e.g. for com-puting the derivative f ′ for use in Newton’s method.1 Another example of theuse of such derivative formulas is in the construction of methods for the solu-tion of boundary value problems in differential equations, such was illustratedin Example 3.18, on page 93, while we were studying Cholesky factorizations.However, as we have just seen (in §6.1.2 above) roundoff error limits the ac-curacy of finite-difference approximations to derivatives. Moreover, it may bedifficult in practice to determine a step size h for which near-optimal accu-racy can be attained. This can cause significant problems, for example, inmultivariate floating point Newton methods.

For complicated functions, algebraic computation of the derivatives by handis also impractical. One possible alternative2 is to compute the derivativeswith symbolic manipulation systems such as Mathematica, Maple, or Reduce.These systems have facilities for output of the derivatives as statements incommon compiled programming languages. However, such systems are oftennot able to adequately simplify the expressions for the derivatives, resulting inexpressions for derivatives that can be many times as long as the expressionsfor the function itself. This “expression swell” not only can result in inefficientevaluation, but also can cause roundoff error to be a problem, even thoughthere is no truncation error.

A third alternative is automatic differentiation, also called “computationaldifferentiation.” In this scheme, there is no truncation (method) error and theexpression for the function is not symbolically manipulated, yet the user onlyneed supply the expression for the function itself. The technique, increasinglyused during the two decades prior to composition of this book, is based upondefining an arithmetic on composite objects, the components of which repre-sent function and derivative values. The rules of this arithmetic are based onthe elementary rules of differentiation learned in calculus, in particular, onthe chain rule.

1or the multidimensional analog, as described in §8.1 on page 2912but not for Example 3.18


6.2.1 The Forward Mode

In the “forward mode” of automatic differentiation, the derivative or deriva-tives are computed at the same time as the function. For example, if the func-tion and the first k derivatives are desired, then the arithmetic will operateon objects of the form

u∇ = 〈u, u′, u′′, · · · , u(k)〉. (6.6)

Addition of such objects comes from the calculus rule “the derivative of a sumis the sum of the derivatives,” that is,

u∇ + v∇ = 〈u + v, u′ + v′, u′′ + v′′, · · · , u(k) + v(k)〉. (6.7)

In other words, the j-th component of u∇ + v∇ is the j-th component of u∇plus the j-th component of v∇, for 1 ≤ j ≤ k. Subtraction is defined similarly,while products u∇v∇ are defined such that the first component of u∇v∇ isthe first component of u∇ times the first component of v∇, etc., as follows:

u∇v∇ =

⟨

uv, u′v + uv′, u′′v + 2u′v′ + uv′′, · · · ,k∑

j=0

(

kj

)

u(k−j)v(j)

⟩

. (6.8)

Rules for applying functions such as “exp,” “sin,” and “cos” to such objectsare similarly defined. For example,

sin(u∇) = 〈sin(u), u′ cos(u),− sin(u)(u′)2 + cos(u)u′′, · · · 〉. (6.9)

The differentiation object corresponding to a particular value a of the inde-pendent variable x is of the form

x∇ = 〈a, 1, 0, · · · 0〉.

Example 6.3

Suppose the context requires us to have values of the function, of the firstderivative, and of the second derivative for the function

f(x) = x sin(x)− 1,

where we want function and derivative values at x = π/4. What steps wouldthe computer do to complete the automatic differentiation?

The computer would first resolve f into a sequence of operations (some-times called a code list , tape, or decomposition into elementary operations).If we associate the independent variable x with the variable v1 and the i-th


intermediate result with vi+1, a sequence of operations for f can be3

v∇2 ← sin(v∇1)v∇3 ← v∇1v∇2

v∇4 ← v∇3 − 1(6.10)

We now illustrate with 4-digit decimal arithmetic, with rounding to nearest.We first set

v∇1 ← 〈π/4, 1, 0〉 ≈ 〈0.7854, 1, 0〉.

Second, we use (6.9) to obtain

v∇2 ← sin(〈0.7854, 1, 0〉)i.e. 〈sin(0.7854), 1× cos(0.7854),− sin(0.7854)× (12) + cos(0.7854)× 0〉≈ 〈0.7071, 0.7071,−0.7071〉.

Third, we use (6.8) to obtain

v∇3 ← 〈0.7854, 1, 0〉〈0.7071, 0.7071,−0.7071〉i.e. 〈0.7854× 0.7071, 1× 0.7071 + 0.7854× 0.7071,

0× 0.7071 + 2× 1× 0.7071 + 0.7854× (−0.7071)〉≈ 〈0.5554, 1.263, 0.8589〉

Finally, the second derivative object corresponding to the constant 1 is 〈1, 0, 0〉,so we apply formula (6.7) to obtain

v∇4 ← 〈0.5554, 1.263, 0.8589〉− 〈1, 0, 0〉≈ 〈−0.4446, 1.263, .08589〉.

Comparing, we have

f(π/4) = (π/4 sin(π/4)− 1 ≈ −0.4446,

f ′(x) = x cos(x) + sin(x) so f ′(π/4) ≈ 1.262,

f ′′(x) = −x sin(x) + 2 cos(x) so f ′′(π/4) ≈ 0.8589,

where the above values were computed to 16 digits, then rounded to fourdigits. This illustrates the validity of automatic differentiation.4

3We say “a sequence of operations for f can be,” rather than “the sequence of operationsfor f is,” because, in general, decompositions for a particular expression are not unique.4The discrepancy between the values 1.263 and 1.262 for f ′(π/4) is due to the fact thatrounding to four digits was done after each operation in the automatic differentiation. If theexpression for f ′ were first symbolically derived, then evaluated with four digit rounding(rather than exactly, then rounding), then a similar error would occur.


6.2.2 The Reverse Mode

The reverse mode of automatic differentiation, when used to compute thegradient of a function f of n variables, can be more efficient than the forwardmode. In particular, when the forward mode (or for that matter, when finitedifferences or when symbolic derivatives) is used, the number of operationsrequired to compute the gradient is proportional to n times the number ofoperations to compute the function. In contrast, when the reverse mode isused, it can be proven that the number of operations required to compute thethe gradient∇F (which has n components) is bounded by 5 times the numberof operations required to evaluate the f itself, regardless of n. (However, aquantity of numbers proportional to the number of operations required toevaluate f needs to be stored when the reverse mode is used.)

So, how does the reverse mode work? We can think of the reverse modeas forming a system of equations relating the derivatives of the intermediatevariables in the computation through the chain rule, then solving the systemof equations for the derivative of the independent variable. Suppose we havea code list such as (6.10), giving the sequence of instructions for evaluating afunction f . For example, one such operation could be

vp = vq + vr,

where vp is the value to be computed, while vq and vr have previously beencomputed. Then, computing f ′ is equivalent to computing v′M , where vM

corresponds to the value of f . (That is, vM is the dependent variable, generallythe result of the last operation in the computation of the expression for f .)We form a sparse linear system with an equation for each operation in thecode list, whose variables are v′k, 1 ≤ k ≤ M . For example, the equationcorresponding to an addition vp = vq + vr would be

v′q + v′r − v′p = 0,

while the equation corresponding to a product vp = vqvr would be

vrv′q + vqv

′r − v′p = 0,

where the values of the intermediate quantities vq and vr have been previouslycomputed and stored from an evaluation of f . Likewise, if the operation werevp = sin(vq), then the equation would be

cos(vq)v′q − vp = 0,

while if the operation were addition of a constant, vp = vq + c, then theequation would be

v′q − v′p = 0.

If there is a single independent variable and the derivative is with respectto this variable, then the first equation would be

v′1 = 1.


We illustrate with the f for Example 6.3. If the code list is as in (6.10),then the system of equations will be

1 0 0 0cos(v1) −1 0 0

v2 v1 −1 00 0 1 −1

v′1v′2v′3v′4

=

1000

. (6.11)

If v1 = x = π/4 as in Example 6.3, then this system, filled using four-digitarithmetic, is

1 0 0 00.7071 −1 0 00.7071 0.7854 −1 0

0 0 1 −1

v′1v′2v′3v′4

=

1000

. (6.12)

The reverse mode consists simply of solving this system with forward substi-tution. This system has solution

v′1v′2v′3v′4

≈

1.00000.70711.26251.2625

.

Thus f ′(π/4) = v′4 ≈ 1.2625, which corresponds to what we obtained withthe forward mode.

Example 6.4

Suppose f(x1, x2) = x21 − x2

2. Compute

∇f(x1, x2) =

(

∂f

∂x1,

∂f

∂x2

)T

at (x1, x2) = (1, 2) using the reverse mode.Solution: A code list for this function can be

v1 = x1

v2 = x2

v3 = v21

v4 = v22

v5 = v3 − v4

The reverse mode system of equations for computing ∂f/∂xi is thus

1 0 0 0 00 1 0 0 0

2v1 0 −1 0 00 2v2 0 −1 00 0 1 −1 −1

v′1v′2v′3v′4v′5

= ei, (6.13)


where ei is the vector whose i-th component is 1 and all of whose othercomponents are 0. When x1 = 1 and x2 = 2, we have

1 0 0 0 00 1 0 0 02 0 −1 0 00 4 0 −1 00 0 1 −1 −1

v′1v′2v′3v′4v′5

= ei .

Now, ∂f/∂x1 can be computed by ignoring the row and column correspondingto v′2, while ∂f/∂x2 can be computed by ignoring the row and column corre-sponding to v′1. We thus obtain ∂f/∂x1 = 2 and ∂f/∂x2 = −4 (Exercise 10).

In fact, a directional derivative can be computed in the reverse mode withthe same amount of work it takes to compute a single partial derivative. Forexample, the directional derivative of f(x1, x2) at (x1, x2) = (1, 2) in thedirection of u = (1/

√2, 1/√

2)T can be obtained by solving the linear system

1 0 0 0 00 1 0 0 02 0 −1 0 00 4 0 −1 00 0 1 −1 −1

v′1v′2v′3v′4v′5

=

1/√

2

1/√

2000

(6.14)

for v′5.

6.2.3 Implementation of Automatic Differentiation

Automatic differentiation can be incorporated directly into the program-ming language compiler, or the technology of operator overloading (availablein object-oriented languages) can be used. A number of packages are availableto do automatic differentiation. The best packages (such as ADOLC, for differ-entiating “C” programs and ADIFOR, for differentiating Fortran programs) canaccept the definition of the function f in the form of a fairly generally writtencomputer program. Some of them (such as ADIFOR) produce a new programthat will evaluate both the function and derivatives, while others (such asADOLC) produce a code list or “tape” from the original program, then operateon the code list to produce the derivatives. The monograph [17] contains acomprehensive overview of theory and implementation of both the forwardand backward modes.

Within matlab, intlab has a special gradient data type that provides theforward mode of automatic differentiation through operator overloading. Hereis how we might do Example 6.3 (but with the first derivative only), usingintlab.

>> x = gradientinit(pi/4)


gradient value x.x =

0.7854

gradient derivative(s) x.dx =

1

>> x*sin(x) - 1

gradient value ans.x =

-0.4446

gradient derivative(s) ans.dx =

1.2625

>>

Above, notice that we initialize a variable to be of type “gradient” (providedby intlab) with the constructor gradientinit. The “gradient” type hastwo or more components, corresponding to a value and the partial derivatives.(There will be more than two components if the argument to gradientinit

is a vector with more than one component.) The value ans.x contains thefunction value, while ans.dx contains the derivative value.

6.3 Numerical Integration

The problem throughout the remainder of this chapter is determining ac-

curate methods for approximating the integral∫ b

af(x)dx. Approximating

integrals is called numerical integration or quadrature.

6.3.1 Introduction

Our goal is to approximate an integral

J(f) =

∫ b

a

f(x)dx. (6.15)

with quadrature formulas of the form

Q(f) = (b− a)

m∑

j=0

αjf(xj), (6.16)

where the α0, α1, . . . , αm are called weights and the x0, x1, . . . , xm are thesample or nodal points. We have

J(f) = Q(f) + E(f), (6.17)

where E(f) is the error in the quadrature formula.To simplify derivation and use of the formulas, we derive the formulas over

the simple interval [a, b] = [−1, 1], then use a change of variables to apply


these formulas over arbitrary intervals [a, b]. Furthermore, the basic formulaswe so derive may not work well over intervals [a, b] that are wide in relationto how fast the integrand f is varying. In such instances, we divide [a, b] intosubintervals, and apply the basic formula over each subinterval, effectivelycomputing the integral as a sum of integrals.

As with numerical differentiation, we present the essentials and numerousexamples here, while we present alternative methods and derivations, as wellas a more complete analysis in [1].

6.3.2 Newton-Cotes Formulas

In the approximation

J(f) =

∫ b

a

f(x)dx ≈ Q(f) = (b− a)

m∑

j=0

αjf(xj), (6.18)

the Newton–Cotes Formulas , are derived by setting the sample points xj

beforehand to be equally spaced, then determining the weights to make theformula exact for as high a degree polynomial as possible.

DEFINITION 6.1 The (m+1 point) open Newton–Cotes formulas havepoints xj = x0 + (j + 1)h, j = 0, 1, 2, . . . , m, where h = (b − a)/(m + 2) andx0 = a + h. The (m + 1 point) closed Newton–Cotes formulas have pointsxj = x0 + jh, j = 0, 1, . . . , m, where h = (b − a)/m and x0 = a. That is, thesample points in the open formulas do not include the end points, whereas thesample points in the closed formulas do.

Example 6.5

Suppose that a = −1, b = 1, and m = 2. Then, the open points are

x0 = −0.5, x1 = 0, and x2 = 0.5,

while the closed points are

x0 = −1, x1 = 0, and x2 = 1.

We now derive both the closed and open Newton–Cotes formulas with threepoints. We obtain three equations for the three unknowns wi by matchingthe first three powers of x. That is, we plug f(x) ≡ 1, f(x) ≡ x, andf(x) ≡ x2 into (6.18) and solve the resulting system for the αi. For notationalconvenience, we set wi = (b − a)αi, and solve for the wi rather than the αi.


For the open formula, we obtain

∫ 1

−1

1dx = 2 = w0 + w1 + w2,

∫ 1

−1

xdx = 0 = −0.5w0 + 0.5w2,

∫ 1

−1

x2dx =2

3= 0.25w0 + 0.25w2.

This is the system of equations

1 1 1−0.5 0 0.5

0.25 0 0.25

w0

w1

w2

=

20

2/3

,

whose solution is w0 = w2 = 4/3 and w1 = −2/3. Hence, the open quadratureformula is:

∫ 1

−1

f(x)dx ≈ 4

3f(

− 1

2

)

− 2

3f(0) +

4

3f(1

2

)

.

Using the same technique for the closed formula, we get

∫ 1

−1

f(x)dx ≈ 1

3f(−1) +

4

3f(0) +

1

3f(1).

This closed formula is called Simpson’s Rule. (You will do the computationsto derive Simpson’s rule in Exercise 1 on page 249.)

6.3.3 Gaussian Quadrature

We have just seen that Newton-Cotes formulas can be derived by

(a) choosing the sample (or nodal) points xi, 0 ≤ i ≤ m, equidistant on[a, b], and

(b) choosing the weights wi, 0 ≤ i ≤ m, so that numerical quadrature isexact for the highest degree polynomial possible.

In Gaussian quadrature, the points and weights xi and wi, 0 ≤ i ≤ m, areboth chosen so that the quadrature formula is exact for the highest degreepolynomial possible. This results in the degree of precision for (m + 1)-pointGaussian quadrature being 2m + 1. Consider the following example.

Example 6.6

Take J(f) =∫ b

af(x)dx and m = 1. We want to find w0, w1, x0, and x1

such that Q(g) = J(g) for the highest degree polynomial possible. Letting


g(x) = 1, g(x) = x, g(x) = x2, and g(x) = x3, we obtain the followingnonlinear system:

∫ 1

−1

1 dx = 2 = w0 + w1,

∫ 1

−1

x dx = 0 = w0x0 + w1x1,

∫ 1

−1

x2dx =2

3= w0x

20 + w1x

21,

∫ 1

−1

x3dx = 0 = w0x30 + w1x

31 .

Solving, we obtain w0 = w1 = 1, x0 = −1/√

3, x1 = 1/√

3, which are the2-point Gaussian weights and points. The formula therefore is

∫ 1

−1

f(x)dx ≈ f

(

− 1√3

)

+ f

(

1√3

)

.

This formula, known as the 2-point Gauss-Legendre quadrature rule, is exact5

by design when f is a polynomial of degree 3 or less.

In Example 6.6, we had four unknowns w0, w1, x0, and x1, and we hadfour conditions (fitting 1, x, x2, and x3 to the formula exactly, leading to fourequations). Even though the equations are nonlinear in x0 and x1, the samenumber of equations as unknowns allowed us to specify the unknowns. Ingeneral, if we have m + 1 points {xi}mi=0, we will have 2(m + 1) unknowns,and will be able to fit 2m + 2 powers of x exactly. That is, we can design theformula to be exact for polynomials of degree 2m + 1 or less.

How might we determine the weights and sample points in Gaussian quadra-ture? We now answer this question. Suppose we want to design a formulawith m + 1 points {xi}mi=0 such that

∫ 1

−1

f(x)dx ≈m∑

i=0

wif(xi) = 2

m∑

i=0

αif(xi) (6.19)

5that is, has no approximation error


is exact if f = f2m+1 is a polynomial of degree 2m + 1 or less. Let pm+1 be apolynomial of degree m + 1 with the following properties:

1. The roots of pm+1 are x0 through xm, that is, pm+1(xi) = 0,0 ≤ i ≤ m.

2.∫ 1

−1 pm+1(x)qm(x) = 0 whenever qm is a polynomial of degreem or less.

(6.20)

Then, by long division of polynomials, we may write

f2m+1(x) = pm+1(x)qm(x) + rm(x),

where qm is the quotient polynomial, of degree m or less, and rm is theremainder polynomial, also of degree m or less. Plugging this into (6.19)gives

∫ 1

−1

f2m+1(x)dx =

∫ 1

−1

pm+1(x)qm(x)dx +

∫ 1

−1

rm(x)dx

=

∫ 1

−1

rm(x)dx

=

m∑

i=0

wipm+1(xi)qm(xi) +

m∑

i=0

wirm(xi)

=

m∑

i=0

wirm(xi)

Thus, if the xi are chosen according to (6.20), all we need to do is choose theweights wi (or αi) to make the formula exact for polynomials of degree m orless, since then,

∫ 1

−1

rm(x)dx =

m∑

i=0

wirm(xi).

We can compute such xi with a special technique we illustrate in the followingexample.

Example 6.7

Suppose we want to derive x1 and x2 in 2-point Gaussian quadrature, asin Example 6.6. Matching the integral with powers of x, we get the fourequations in Example 6.6. Let’s assume the polynomial with roots x0 andx1 is of the form p2(x) = x2 + c1x + c0. Then, if we take c0 times the firstequation plus c1 times the second equation plus 1 times the third equation,


we obtain

2c0 +2

3= 2α0(c0 + c1x0 + c1x

20) + 2α1(c0 + c1x1 + c1x

21)

= 0.

Similarly, taking c0 times the second equation plus c1 times the third equationplus 1 times the fourth equation gives.

2

3c1 = 2α0x0(c0 + c1x0 + c1x

20) + 2α1x1(c0 + c1x1 + c1x

21)

= 0.

We thus get the following system of two linear equations in the two unknownsc0 and c1:

(

2 00 2/3

)(

c0

c1

)

=

(

−2/30

)

,

giving c0 = −1/3, c1 = 0, and

p2(x) = x2 − 1

3,

with roots x0 = −1/√

3 and x1 = 1/√

3. We then plug x0 and x1 into thefirst two equations to obtain

2

1 1

−1/√

3 1/√

3

α0

α1

=

2

0

,

giving α0 = α1 = 1/2.We can diagram the linear combinations of the fitting equations we take to

obtain the coefficients of pm+1 as follows:

∫ 1

−1

1 dx = 2 = α0 + α1 c0

∫ 1

−1

x dx = 0 = α0x0 + α1x1 c1 c0

∫ 1

−1

x2dx =2

3= α0x

20 + α1x

21 1 c1

∫ 1

−1

x3dx = 0 = α0x30 + α1x

31 1

In fact, the technique illustrated in Example 6.7 works for general m, toreduce finding the xi in (m + 1)-point Gaussian quadrature to finding thezeros of a degree m polynomial.


A more sophisticated way of computing the xi is through the theory oforthogonal polynomials. We explain orthogonal polynomials in the book [1]for our second course in numerical analysis. In particular, the polynomialspm+1 constructed as in Example 6.7 are, to within a constant, the Legendrepolynomials of degree m + 1. Similarly, (m + 1)-point Gaussian quadrature

rules to compute∫ 1

−1 f(x)dx exactly when f is a polynomial of degree 2m+1or less are termed Gauss–Legendre quadrature rules.

Sample points and weights for Gauss–Legendre quadrature for various mappear in Table 6.1.

TABLE 6.1: Weights and sample points: Gauss–Legendre quadrature

1 point (m = 0) α1 = 1, z1 = 0 (midpoint rule)

2 point (m = 1) α1 = α2 = 1/2, z1 = − 1√3, z2 = 1√

3

3 point (m = 2)α1 = 5

18 , α2 = 818 , α3 = 5

18 ,

z1 = −√

35 , z2 = 0, z3 =

√

35

4 point (m = 3)

α1 = 14 − 1

6√

4.8, α2 = 1

4 + 16√

4.8, α3 = α2, α4 = α1,

z1 = −√

3+√

4.87 , z2 = −

√

3−√

4.87 , z3 = −z2, z4 = −z1.

2 point (m = 1) α0 = α1 = 1/2, z0 = − 1√3, z1 = 1√

3

3 point (m = 2)α0 = 5

18 , α1 = 818 , α2 = 5

18 ,

z0 = −√

35 , z1 = 0, z2 =

√

35

4 point (m = 3)

α0 = 14 − 1

6√

4.8, α1 = 1

4 + 16√

4.8, α2 = α1, α3 = α0,

z0 = −√

3+√

4.87 , z1 = −

√

3−√

4.87 , z2 = −z1, z3 = −z0.

6.3.4 More General Integrals

The techniques for deriving formulas in the previous sections apply for moregeneral integrals of the form

J(f) =

∫ b

a

ρ(x)f(x)dx, (6.21)

where ρ is not necessarily equal to 1 and where a, b, or both might be infinite.


Example 6.8

Suppose we want a quadrature rule of the form

∫ ∞

0

e−xf(x)dx ≈ w0f(x0) + w1f(x1)

that is exact when f is a polynomial of degree 3 or less. We may use thetechnique illustrated in Example 6.7 gives

∫ ∞

0

1 · e−x dx = 1 = w0 + w1,

∫ ∞

0

x e−xdx = 1 = w0x0 + w1x1,

∫ ∞

0

x2e−xdx = 2 = w0x20 + w1x

21,

∫ ∞

0

x3e−xdx = 6 = w0x30 + w1x

31.

This gives

c0 + c1 + 2 = 0

c0 + 2c1 + 6 = 0,

whence c1 = −4, c0 = 2, and

p2(x) = x2 − 4x + 2,

so x0 = 2−√

2 and x1 = 2+√

2. Plugging these into the equations matchingf(x) ≡ 1 and f(x) ≡ x gives

1 = w0 + w1,

1 = (2 −√

2)w0 + (2 +√

2)w1,

whence w0 = (√

2 + 1)/(2√

2), w1 = (√

2− 1)/(2√

2), and

∫ ∞

0

e−xf(x)dx ≈√

2 + 1

2√

2f(2−

√2) +

√2− 1

2√

2f(2 +

√2).

This is known as the 2-point Gauss–Laguerre quadrature formula. In general,Gaussian formulas that approximate integrals with a = 0, b =∞, and ρ(x) =e−x with m + 1 points are known as Gauss-Laguerre. In general, Gaussianformulas that approximate integrals with a = 0, b = ∞, and ρ(x) = e−x

with m+1 points are known as Gauss–Laguerre formulas. The correspondingpolynomials pm+1 are known as Laguerre polynomials.


Example 6.9

Suppose a = −∞, b = ∞, ρ(x) = e−x2

, and we wish to derive a 2-pointGauss formula. That is, we seek an approximation of the form

∫ ∞

−∞e−x2

f(x)dx ≈ w0f(x0) + w1f(x1).

Proceeding as in Example 6.8, we have

∫ ∞

−∞1 · e−x2

dx =√

π = w0 + w1,

∫ ∞

−∞x e−x2

dx = 0 = w0x0 + w1x1,

∫ ∞

−∞x2e−x2

dx =

√π

2= w0x

20 + w1x

21,

∫ ∞

−∞x3e−x2

dx = 0 = w0x30 + w1x

31.

Continuing as in Example 6.8, we obtain c0 = −1/2, c1 = 0, and from themobtain x0 = −1/

√2, x1 = 1/

√2, then w0 = w1 =

√π/2. (You will fill in the

details in Exercise 4.) The formula is thus

∫ ∞

−∞e−x2

f(x)dx ≈√

π

2f

(

− 1√2

)

+

√π

2f

(

1√2

)

.

This is known as the 2-point Gauss–Hermite quadrature formula, and Gaus-sian formulas that approximate integrals with a = −∞, b = ∞, and ρ(x) =e−x with m + 1 points are known as Gauss–Hermite formulas. The corre-sponding orthogonal polynomials pm+1 are known as Hermite polynomials.

Gauss–Laguerre and Gauss–Hermite formulas are useful for integratingsmooth functions over semi-infinite and infinite intervals, respectively. Occa-sionally, an application requires a derivation of a special formula with differenta, b, and ρ. An example of this is in Exercise 5.

6.3.5 Error Terms

So far, we have seen error terms for approximation of a function and deriva-tive, as indicated in Table 6.2.

All of these approximations are of the form

{Exact value} = {Approximate value}+ {Error term}

or∣

∣{Exact value} − {Approximate value}∣

∣ ≤ {Error bound}.


TABLE 6.2: Error terms seen so far

Itemapprox-imated

Approximationmethod

Formula for error Reference

function

fTaylor polyno-mial

f (n+1)(ξ(x))(x − x0)n+1

(n + 1)!

Taylor’s Theo-rem, on page 3and page 145

fpolynomial in-terpolation

f (n+1)(ξ(x))n∏

j=0

(x − xj)

(n + 1)!

Formula (4.7) onpage 154

f

piecewise lin-ear polynomialinterpolation

1

8h2 max

x∈[a,b]|f ′′(x)| Theorem 4.4 on

page 162

f

clampedboundarycubic splineinterpolation

5

384h4 max

x∈[a,b]

∣

∣

∣

∣

d4f

dx4(x)

∣

∣

∣

∣

Theorem 4.6 onpage 168

f ′forward differ-ence quotient

−h

2f ′′(ξ) Above (6.1) on

page 211

f ′′central differ-ence quotient

−h2

12f (4)(ξ) Formula (6.4) on

page 212

We now explain how to compute error bounds when

{Exact value} = J(f) =

∫ b

a

ρ(x)f(x)dx

and

{Approximate value} = Q(f) =

1. A Newton–Cotes quadrature rule,

2. a Gaussian quadrature rule, or

3. a special quadrature rule.

.

In fact, we can use a general approach given in [18, Chapter 16].


THEOREM 6.1

For the usual Newton–Cotes, Gauss, Gauss–Laguerre, and Gauss–Hermiteformulas with positive weights, we have

J(f) = Q(f) +Eµ

µ!f (µ)(ξ)

for some ξ ∈ [a, b], where J(f) = Q(f) when f is a polynomial of degree µ−1or less, but J(xµ) 6= Q(xµ), and where

Eµ = J(xµ)−Q(xµ).

(For the Newton–Cotes formulas with m odd, µ = m+1, for the Newton–Cotesformulas with m even, µ = m + 2, and for Gaussian formulas, µ = 2m + 2.)

PROOF The proof depends on the influence function G(s) for the formulabeing of constant sign. See [18, Chapter 16].

Example 6.10

Suppose we want to find the error term in Simpson’s rule (derived in Ex-ample 6.5, starting on page 222). We designed the formula to be exact forf(x) = 1, x, and x2, so µ ≥ 3. In fact

∫ 1

−1

x3dx = 0 =1

3(−1)3 +

4

3(0)3 +

1

3(1)3,

so µ ≥ 4. We compute

E4 =

∫ 1

−1

x4dx−[

1

3(−1)4 +

4

3(0)4 +

1

3(1)4

]

=2

5− 2

3= − 4

156= 0.

Thus, the multiplying factor for the error term is −4/(15 · 4!) = −1/90, and∫ 1

−1

f(x)dx =1

3f(−1) +

4

3f(0) +

1

3f(1)− 1

90f (4)(ξ)

for some ξ ∈ [−1, 1].

Example 6.11

Let’s compute the error in the 2-point Gauss–Legendre formula, given inExample 6.6 (starting on page 223). The formula was designed to be exactwhen f(x) = xk for k ≤ 3, so we try µ = 4:

J(f)−Q(f) =

∫ 1

−1

x4dx−[

f

(

− 1√3

)

+ f

(

1√3

)]

=2

5− 2

9=

8

456= 0.


Thus, µ = 4, E4 = 8/45, the multiplying factor is E4/4! = 1/135, and

∫ 1

−1

f(x)dx = f

(

− 1√3

)

+ f

(

1√3

)

+1

135f (4)(ξ)

for some ξ ∈ [−1, 1].

We summarize these error terms for quadrature formulas in Table 6.3.

TABLE 6.3: Some quadrature formula error terms

Formula Formula for error Reference

Trapezoidal rule (2-pointclosed Newton–Cotes) −2

3f ′′(ξ)

Exercise 2 onpage 249

Simpson’s Rule (3-pointclosed Newton–Cotes) − 1

90f (4)(ξ)

Example 6.10on page 231

1-point Gauss–Legendre(midpoint rule)

1

3f (2)(ξ)

Table 6.1 onpage 227

2-point Gauss–Legendre1

135f (4)(ξ)

Example 6.11on page 231

2-point Gauss–Laguerre1

6f (4)(ξ)


2-point Gauss–Hermite√

π

48f (4)(ξ)


Higher-order Newton–Cotes and Gaussian formulas, along with their errorterms, are available in published tables, on the web, and in software. Whencomputing error terms for specially derived formulas, care must be taken toassure the conditions under which Theorem 6.1 is true are satisfied, or elseuse other methods, such as those in our second text [1]. In particular, oneshould study [18, Chapter 16] and determine if the influence function G(s) forthe formula is of constant sign.


6.3.6 Changes of Variables

To use a quadrature rule with error term over an interval [a, b] other than[−1, 1], that is, to evaluate an integral of the form

∫ t=b

t=a

f(t)dt,

we may use the change of variables

t(x) = a +b− a

2(x + 1), dt = t′(x)dx =

b− a

2dx. (6.22)

We thus have

∫ b

a

f(t)dt =b − a

2

∫ 1

−1

f

(

a +b− a

2(x + 1)

)

dx (6.23)

We use this change of variables both in the formula and the error term.

Example 6.12

Suppose we want to apply Simpson’s rule over a small interval [xi, xi+1] =[xi, xi + h]. Simpson’s rule over [−1, 1] is

∫ 1

−1

f(x)dx = 0 =1

3f(−1) +

4

3f(0) +

1

3f(1)− 1

90

d4f(x)

dx4

∣

∣

∣

∣

x=ξ

.

Using the change of variables (6.22) to change from x to t, we have a = xi,b = xi + h, and (b − a)/2 = h/2. Furthermore, since x0 = −1, x1 = 0, andx2 = 1 in Simpson’s rule, we have

t0 = xi +h

2(−1 + 1) = xi,

t1 = xi +h

2(0 + 1) = xi +

h

2,

t2 = xi +h

2(1 + 1) = xi + h.

Also, since t′′(x) = 0, repeated application of the chain rule gives

d4f(t)

dx4=

d4f(t(x))

dt4d4t

dx4=

(

h

2

)4

f (4)(t).


Simpson’s rule with error term over the interval [xi, xi + h] thus becomes∫ xi+h

xi

f(t)dt =h

2

∫ 1

x=−1

f(t(x))dx

=h

2

[

1

3f(xi) +

4

3f(xi +

h

2) +

1

3f(xi + h)

− 1

90

(

h

2

)4d4f(t)

dt4

∣

∣

∣

∣

t=ζ

]

=h

6

[

f(xi) + 4f(xi +h

2) + f(xi + h)

]

− h5

2880

d4f(ζ)

dt4

(6.24)

for some ζ ∈ [xi, xi + h].

With a change of variables, we may use the formulas we have seen, or higher-order formulas derived with the techniques we have seen, to approximate∫ b

af(t)dt for arbitrary a and b. However, the error, given in Theorem 6.1 for

many formulas, is proportional to both the µ-th derivative of f and (b−a)µ+1.As in polynomial interpolation, we may not be able to decrease the errorby increasing the order µ of the formula. In such instances, the compositeformulas of the next section may be appropriate.

6.3.7 Composite Quadrature

Using Theorem 6.1 and the change of variables we have presented, for manyquadrature rules Q(f) that are exact when f(x) = xµ−1 but not exact whenf(x) = xµ, we have

∫ b

a

f(t)dt = Q(f) + K

(

b− a

2

)µ+1

f (µ)(ζ), (6.25)

for some ζ ∈ [a, b] and constant K = Emu/µ! that depends on the quadratureformula but not on f , provided f has µ continuous derivatives. However, theerror can easily increase if we apply, say, a sequence of Newton–Cotes formulaswith increasing numbers of points (and hence increasing µ), even though theconstant K is smaller if we use a larger number of points. One reason for thisis that errors in evaluation of f (due either to roundoff error or measurementerror, if the function values are obtained from measuring a physical process)can be viewed as making the higher-and-higher order derivatives ever larger.

Observing (6.25), we see that we may subdivide the interval [a, b] into Nsubintervals, [xi, xi+1] = [xi, xi + h], each of length h = (b − a)/N , x0 = a,xN = b, and, with our change of variables,

∫ xi+h

xi

f(t)dt = Qi(f) + K

(

h

2

)µ+1

f (µ)(ζi),


where Qi(f) is the quadrature rule applied to [xi, xi + h] and ζi is someunknown number in [xi, xi + h]. Thus,

∫ b

a

f(t)dt =

N−1∑

i=0

∫ xi+h

xi

f(t)dt

=

N−1∑

i=0

Qi(f) +

N−1∑

i=0

K

(

h

2

)µ+1

f (µ)(ζi)

=N−1∑

i=0

Qi(f) + K

(

h

2

)µ+1 N−1∑

i=0

f (µ)(ζi)

=

N−1∑

i=0

Qi(f) + K

(

h

2

)µ+1

Nf (µ)(ζ(N))

=

N−1∑

i=0

Qi(f) + Khµ b− a

2µ+1f (µ)(ζ(N))

=

N−1∑

i=0

Qi(f) + K

(

1

N

)µ (b− a

2

)µ+1

f (µ)(ζ(N)),

for some ζ ∈ [a, b], where

N−1∑

i=0

f (µ)(ζi) = Nf (µ)(ζ(N)) =b− a

hf (µ)(ζ(N))

by the intermediate value theorem. The error is thus proportional to (1/N)µ,and we can decrease the error by decreasing the number of subintervals. Thecomputation

QC,N =

N−1∑

i=0

Qi(f) (6.26)

is called the composite quadrature rule with N sub-panels corresponding to Q.In principle, we can compute and bound f (µ)(x) using automatic differentia-tion, to determine the N that is needed to achieve a particular error bound.Until recently 6, however, heuristic7 estimates for the error can be obtainedby assuming ζ(N) does not depend on N , so the error in the composite ruleis

EC,N (f) = K

(

1

N

)µ(b− a

2

)µ+1

f (µ)(ζ(N)) ≈ K

(

1

N

)µ

. (6.27)

6and, indeed, in many instances when f comes from measured data or from a computationto which automatic differentiation cannot be easily used7that is, rule-of-thumb


We then have

EC,2N (f) ≈(

1

2

)µ

EC,N (f),

so

∫ b

a

f(t)dt = QC,N(f) + EC,N (F )

= QC,2N(f) + EC,2N (f)

≈ QC,2N(f) +

(

1

2

)µ

EC,N (f).

We may solve this approximation for EC,N (f) to obtain

EC,N (f) ≈ 1

1−(

12

)µ (QC,2N (f)−QC,N(f)),

and

∫ b

af(t)dt ≈ QC,N(f) +

1

1−(

12

)µ (QC,2N (f)−QC,N(f))

=QC,2N −

(

12

)µQC,N

1−(

12

)µ .

(6.28)

This technique both gives an approximation for the error and a higher-orderapproximation to the exact value. The technique is used in various contexts innumerical analysis and software for computing integrals and other quantities.Used iteratively, it is called Richardson extrapolation. Richardson extrapola-tion used with the Trapezoidal rule (the 2-point closed Newton–Cotes rule,Exercise 2) is called Romberg integration. For details, see our text [1] for asecond course in numerical analysis.

Example 6.13

Suppose Q(f) is Simpson’s rule (as in Example 6.12 on page 233) to compute

∫ π

−π

sin(t)

tdt,

where limt→0 sin(t)/t = 1. We have f(x) = sin(x)/x, a = −π, b = π, µ = 4,and K = −1/90. For N = 1, h = 2π, and

QC,1 =π

3[f(−π) + 4f(0) + f(π)] =

π

3[0 + 4 · 1 + 0] =

4π

3.


For N = 2, h = π, and

QC,2 =π

6

[

f(−π) + 4f(−π

2) + f(0)

]

+π

6

[

f(0) + 4f(π

2

)

+ f(π)]

=π

6

[

f(−π) + 4f(

−π

2

)

+ 2f(0) + 4f(π

2

)

+ f(π)]

=π

6

[

0 + 4

(

2

π

)

+ 2(1) + 4

(

2

π

)

+ 0

]

=8

3+

π

3.

Thus,

EC,1(f) ≈ 1

1−(

12

)4

(

8

3+

π

3− 4π

3

)

=16

15

(

8

3− π

)

≈ −0.5066,

and∫ π

−π

sin(t)

tdt = QC,1 + EC,1 ≈

4π

3+

16

15

(

8

3− π

)

≈ 3.682

The error approximation we can use is

EC,2 ≈1

16EC,1 ≈ −.03167.

Thus, we would expect the first two digits of our answer 3.682 to be correct.In fact, the function

Si(x) =

∫ x

0

sin(x)

xdx =

1

2

∫ x

−x

sin(x)

xdx

is called the sine integral , and is available as the matlab function sinint.Thus, the integral in this example is 2Si(π), and we have the following matlab

dialog.

>> exact = 2*sinint(pi)

exact = 3.703874103964933

>> approx = 8/3 + pi/3

approx = 3.713864217863264

>> true_error = exact-approx

true_error = -0.009990113898332

>>

Assuming matlab’s routine sinint gives a result that has all or most ofits digits correct, we see that the actual error in our computed value for theintegral is well within our heuristic estimate for the error.

Example 6.14

We supply a routine composite Newton Cotes.m on the web site http://

interval.louisiana.edu/Classical-and-Modern-NA/#Chapter_6. This


routine implements composite Newton–Cotes formulas of various orders, anddoubles N until the heuristic estimate for the error is within a specified toler-ance. On the other hand, with the change of variables πu = t, πdu = dt, theintegral in Example 6.13 can be written as

∫ π

−π

sin(t)

t= π

∫ 1

−1

sin(πu)

πudu = π

∫ 1

−1

sinc(u)du,

where

sinc(u) =

1, u = 0,

sin(πu)

πu, u 6= 0

is known as the sinc function. matlab has a routine sinc to evaluate the sincfunction. We may use sinc with our routine composite Newton Cotes.m touse a composite Simpson’s rule to compute the integral from Example 6.13:

>> [value, success] = composite_Newton_Cotes(’sinc’,-1,1,3,1e-14)

2 1.1821604 1.1791548 1.178990

16 1.17898032 1.178980

64 1.178980128 1.178980

256 1.178980512 1.178980

1024 1.178980

2048 1.1789804096 1.178980

value = 1.178979744472170success = 1>> integral = pi*value

integral = 3.703874103964942>>

We see that, with 4096 subintervals, we obtain an approximation to theerror of less than 10−14, and the first 14 digits of the value returned agreewith the first 14 digits of the value matlabreturns for 2 times the sine integral.

6.3.8 Adaptive Quadrature

If the function varies more rapidly in one part of the interval of integrationthan in other parts, and it is not known beforehand where the rapid variationis, then a single rule or a composite rule in which the subintervals all havethe same length is not the most efficient. Also, in general, routines withinlarger numerical software libraries or packages, a user typically supplies afunction f , an interval of integration [a, b], and an error tolerance ǫ, withoutsupplying any additional information about the function’s smoothness.8 In

8A function is “smooth” if it has many continuous derivatives. Generally the “degree ofsmoothness” refers to the number of continuous derivatives available. Even if a function has,


such cases, the quadrature routine itself should detect which portions of theinterval of integration (or domain of integration in the multidimensional case)need to have a small interval length, and which portions need to have a largerinterval length, to achieve the specified tolerance ǫ. In such instances, adaptivequadrature is appropriate.

Adaptive quadrature can be considered to be a type of branch and boundmethod9. In particular, the following general procedure can be used to com-

pute∫ b

a f(x)dx.

ALGORITHM 6.1

(Adaptive quadrature)

INPUT:

1. the interval of integration [a, b] and the function f ;

2. an absolute error tolerance ǫ, and a minimum interval length δ.

OUTPUT: Either “tolerance has not been met” or “tolerance has been met”and an approximation sum to the integral

1. (Initialization)

(a) Input an absolute error tolerance ǫ, and a minimum interval lengthδ.

(b) Input the interval of integration [a, b] and the function f .

(c) sum← 0.

(d) sum← 0.

(e) L ← {[a, b]}, where L is a list of subintervals that needs to beconsidered.

2. DO WHILE L 6= ∅.

(a) Remove the first interval from L and place it in the current interval[c, c].

(b) Apply a quadrature formula over the current interval [c, c] to obtainan approximation Ic.

in theory, many continuous derivatives, we might consider it not to be smooth numericallyif it changes curvature rapidly at certain points. An example of this is the function f(x) =√

x2 + ǫ: as ǫ gets small, the graph of this function becomes indistinguishable from that off(x) = |x|.9We explain another type of branch and bound method, of common use in optimization, in[1, Section 9.6.3].


(c) (bound): Use an error formula for the rule to obtain a bound Ec

for the error, or else obtain Ec as a heuristic estimate for the error;This can be done by either using an error formula or by comparingwith a different quadrature rule of the same or different order.

(d) IF Ec < ǫ(c− c), THEN

sum← sum+ Ic.

ELSE

IF (c− c) < δ THEN

RETURN with a message that the tolerance ǫ could not bemet with the given minimum step size δ.

ELSE

(branch): form two new intervals [c, (c + c)/2] and [(c +c)/2, c], and store each into the list L.

END IF

END IF

END DO

3. RETURN with a message that the tolerance has been met, and returnsum as the approximation to the integral.

END ALGORITHM 6.1.

An early good example implementation of an adaptive quadrature routineis given in the classic text [15] of Forsythe, Malcolm, and Moler.10 Thisroutine, quanc8, is based on an 8-panel Newton-Cotes quadrature formulaand a heuristic estimate for the error. The heuristic estimate is obtained bycomparing the approximation with 8-panel rule over the entire subintervalIc and the approximation with the composite rule obtained by applying the8-point rule over the two halves of Ic; see [15, pp. 94–105] for details. Theroutine itself11 can be found in NETLIB, presently at http://www.netlib.

org/fmm/quanc8.f.An extremely elegant implementation, using recursion, is the pair of rou-

tines matlab functions quadtx and quadgui described in [25, Section 6.3].In recursion, the adaptive process is arranged so the loop of Step 2 of Algo-rithm 6.1 is absent, and, instead, the quadrature routine is called again. Arecursive version of Algorithm 6.1 is as follows.

10This text doubles as an elementary numerical analysis text and as a “user guide” for theroutines it explains. It distinguished itself from other texts of the time by featuring routinesthat were simple enough to be used to explain the elementary concepts, yet sophisticatedenough to be used to solve practical problems.11In Fortran 66, but written carefully and clearly.


ALGORITHM 6.2

(Recursive version of adaptive quadrature)

INPUT:

1. the interval of integration [a, b] and the function f ;

2. an absolute error tolerance ǫ, and a minimum interval length δ.

OUTPUT: Either “failure” (tolerance has not been met) or “success” (toler-ance has been met) and an approximation I to the integral

1. Apply a quadrature formula over the current interval [a, b] to obtain anapproximation I.

2. (bound): Use an error formula for the rule to obtain an approximationI for the integral and a bound E for the error, or else obtain E as aheuristic estimate for the error; This can be done by either using anerror formula or by comparing with a different quadrature rule of thesame or different order.

3. IF E < ǫ, THEN

RETURN “success” and I.

ELSE

IF (b − a) < δ THEN

RETURN “failure”.

ELSE (branch)

(a) Invoke this algorithm with function f , interval of integration[a, (a + b)/2], error tolerance ǫ/2, minimum interval length δ,and output success1 and I1.

(b) Invoke this algorithm with function f , interval of integration[(a + b)/2, b], error tolerance ǫ/2, minimum interval length δ,and output success2 and I2.

(c) IF both success1 and success2, THEN

RETURN “success” and I = I1 + I2,

ELSE

RETURN “failure”.

END IF

END IF

END IF


END ALGORITHM 6.2.

Usually, adaptive algorithms that are implemented recursively are simpler,easier to program, and easier for humans to understand than adaptive algo-rithms that use lists and loops. However, functions that are invoked recur-sively usually involve more overhead and are less efficient.

An illustration of the behavior of an adaptive quadrature algorithm is [25,Figure 6.3].

The matlab routine quad does an adaptive quadrature based on Simpson’srule.

Example 6.15We can use the matlab routine quad, the matlab function sinc, and the

change of variables from Example 6.14 to compute the sine integral fromExample 6.13 to a specified (heuristically determined) accuracy:

>> format long

>> [I,n_function_values] = quad(’sinc’,-1,1,1e-14)

I = 1.178979744472168

n_function_values = 1177

>> I = pi*I

I = 3.703874103964933

>>

We see that the first 14 digits agree with the result we obtained in Exam-ple 6.14, but with only about 1/4 the number of function evaluations.

6.3.9 Multiple Integrals, Singular Integrals, and Infinite In-tervals

We describe some special considerations in numerical integration in thissection.

6.3.9.1 Multiple Integrals

Consider

∫ b

a

∫ d

c

f(x, y)dydx or

∫ b

a

∫ d

c

∫ s

r

f(x, y, z)dxdydz.

How can we approximate these integrals? One way is with a product formula,in which we apply a one-dimensional quadrature rule in each variable.

Using recursion, it is not hard to write a system of matlab “m” files thatcomputes

∫ bn

an

∫ bn−1

an−1

· · ·∫ b1

a1

f(x1, x2, · · · , xn)dx1dx2 . . . dxn.


for general n using an already-programmed quadrature routine for one dimen-sional integrals. For instance, we may modify the routinecomposite Newton Cotes.m (which we used in Example 6.14 on page 237).We create a function multiquad, which calls our modificationmultiquad composite Newton Cotes, which in turn calls the integration func-tion multiquad func, which in turn calls our top routine multiquad. Thatis, we have:

multiquad→ multiquad composite Newton Cotes→ multiquad func

→ multiquad.

The routines can be as follows.

function [value, n_eval, success] = multiquad...(n, a, b, current_arg, n_eval, f, tol, m)

[value, success, n_eval] = multiquad_composite_Newton_Cotes...(f, a(n), b(n), m, tol, a(1:n), b(1:n), n, current_arg, n_eval);

% (Calls multiquad_func)if (~success)

return

end

function [value, n_eval,success] = multiquad_func...(x, a, b, n, current_arg, n_eval, f, tol, m)current_arg(n)=x;

if (n == 1)value = feval(f,current_arg);

n_eval = n_eval + 1;success = 1;

else[value, n_eval, success] = multiquad...

(n-1, a(1:n-1), b(1:n-1), current_arg, n_eval, f, tol, m);

if (~success)return

endend

Example 6.16

Consider the illustrative exampleIn this case, we know that

I3 =

(∫ 1

0

e−xdx

)3

=(

1− e−1)3 ≈ 0.252580457827647,

so we may check any results we obtain. We program the integrand as

function [y] = multiquad_example(x)y = exp(-(x(1)+x(2)+x(3)));

We now use our recursive routine multiquad:

>> format long

>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-8,3)value = 0.252580458119098

n_eval = 6180168success = 1


>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-8,5)

value = 0.252580457923039n_eval = 27000

success = 1>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-12,5)value = 0.252580457827670

n_eval = 3459640success = 1

>>[value, n_eval, success] = multiquad(3,a,b,current_arg, 0, ’multiquad_example’, 1e-12,7)value = 0.252580457827654

n_eval = 83888success = 1>>

We see that decreasing the tolerance has a much larger effect on the numberof functions with this multidimensional quadrature than if we are computinga single integral. We also see the effect of using higher-order formulas.

Programming multidimensional quadrature with recursion is perhaps theeasiest way to understand quadrature as iterated integrals with the productrule concept. However, this is usually not the most efficient way of program-ming multidimensional quadrature: Programming the recursion explicitly ina single routine leads to a more complicated program, but to a program thatcompletes more quickly. Other increases in efficiency may be obtained withproduct rules based on adaptive routines. A third possibility would be tothink of the product rule as a single rule, to be applied in an adaptive routinethat subdivides the n-dimensional region directly, analogously to the processwe explain for optimization in Chapter ??. For triple integrals, matlab hasthe function triplequad.

Example 6.17

We use triplequad to approximate our integral from Example 6.16:

>> value = triplequad(@(x,y,z) exp(-(x+y+z)),0,1,0,1,0,1,1e-12)

value = 0.252580457827648

>>

The response is still slow, but possibly somewhat faster than our recursiveimplementation with composite Newton–Cotes. See matlab’s help facilityfor additional control over the accuracy of the computation, etc. Also, seematlab’s help facility for anonymous functions for an explanation of thesyntax “@(x,y,z) exp(-(x+y+z)).”

6.3.9.2 Singularities and Infinite Intervals

Consider∫ v

af(x)dx. Suppose that f is Riemann integrable but has a sin-

gularity somewhere on [a, b]. (Alternately, for example, f may be continuousbut f ′ may have a singularity on [a, b], which results in low accuracy of the nu-merical quadrature methods used unless a large number of intervals is taken.)


Example 6.18

An illustrative example is

∫ 1

0

1√x

dx = 2.

We get an error if we apply a standard quadrature routine that tries to eval-uate the integrand at the end points, and we get low accuracy or inefficientcomputation, even if we don’t attempt to evaluate at the end point, if wedon’t take account of the singularity. For example:

>> [value, success] = composite_Newton_Cotes(@(x) 1/sqrt(x),0,1,3,1e-4)2 Inf

4 Infvalue = Inf

success = 1>>

matlab’s adaptive routine quad does somewhat better:

>> [value,n_function_values] = quad(@(x) 1./sqrt(x),0,1,1e-4)

value = 2.001366443256776n_function_values = 134>> [value,n_function_values] = quad(@(x) 1./sqrt(x),0,1,1e-12)

value = 1.999999999792113n_function_values = 2426

>>

An adaptive routine should be able to detect that it cannot cannot evaluateexactly at an end point to have a chance of evaluating a singular integralaccurately and efficiently.

Example 6.19

Suppose we want to evaluate the illustrative integral

∫ ∞

0

e−xdx = 1.

We cannot use quad directly, since the limits must be finite:

>> [value,n_function_values] = quad(@(x) exp(-x),0,Inf,1e-12)value = NaNn_function_values = 13

>>

In fact, the matlab function quadgk supports infinite limits:

>> [value, error_bound] = quadgk(@(x) exp(-x),0,Inf)value = 1

error_bound = 1.644999922012122e-011>>

We outline a few basic techniques for handling singular integrals in ourtext for the second course [1, Section 6.3.5]. See matlab’s “help” facility forvarious quadrature routines available within matlaband its toolboxes.


6.3.10 Interval Bounds

Mathematically rigorous bounds on integrals can be computed, for sim-ple integration, for composite rules, and for adaptive quadrature, if intervalarithmetic is used in the error formulas. As an example, take the two pointGauss–Legendre quadrature rule:

∫ 1

−1

f(x)dx =

{

f

(−1√3

)

+ f

(

1√3

)}

+1

135f (4)(ξ), (6.29)

for some ξ ∈ [−1, 1], where the quadrature formula is obtained from Table 6.1(on page 227) and where the error term is obtained from Theorem 6.1. Now,suppose we want to find guaranteed error bounds on the integral

∫ 1

−1

e0.1xdx.

Then, the fourth derivative of e0.1x is (0.1)4e0.1x, and an interval evaluationof this over x = [−1, 1] gives

(0.1)4e0.1x ∈ [0.9048, 1.1052]× 10−4 for x ∈ [−1, 1],

where the interval enclosure for the range e0.1[−1,1] was obtained using thematlab toolbox intlab [33]. The application of the quadrature rule thusgives

∫ 1

−1

e0.1xdx ∈ e−0.1/√

3 + e0.1/√

3 + [0.9048, 1.1052]× 10−4

⊆ e−0.1/[1.7320,1.7321] + e0.1/[1.7320,1.7321] + [0.9048, 1.1052]× 10−4

⊆ [2.0034, 2.0035],

where the computations were done within intlab. This set of computationsprovides a mathematical proof that the exact integral lies within [2.0034, 2.0035].

The higher order derivatives required in the quadrature formulas can bebounded over intervals using a combination of automatic differentiation (ex-plained in §6.2, starting on page 215, of this book) and interval arithmetic.

The mathematically rigorous error bounds obtained by this technique canbe used in an adaptive quadrature technique, and the resulting routine cangive mathematically rigorous bounds, provided Ic and sum are computed withinterval arithmetic and the error bounds are added to each Ic when it is addedto sum. Such a routine is described in [10], although an updated package wasnot widely available at the time this book was written.


6.4 Applications

Consider the following ordinary differential equation for population growth

dx

dt= (a− bx)x = ax− bx2, (6.30)

where a and b are positive constants. Here, x(t) is the population density attime t, a is the birth rate and bx(t) is the density-dependent death rate. Thus,(a−bx(t)) is the density-dependent growth rate of the population. (This typeof equation is also well-known as the logistic equation.) The solution of alogistic equation can be derived by separating variables, and is

x(t) =a

Ce−at + b,

where C is an arbitrary constant. It follows that limt→∞ x(t) = a/b.

For illustration, let us approximate the differential equation by a differenceequation. The derivative dx/dt can be approximated by a difference quotient,

dx

dt≈ x(t + h)− x(t)

h.

This leads tox(t + h)− x(t)

h= ax(t)− bx2(t).

After simplification, we obtain the following difference equation

x(t + h) = (1 + ah)x(t)− bhx2(t). (6.31)

We will see that equation (6.31) has different dynamical behavior dependingon the magnitude of ah. To visualize the dynamics of equation (6.31), we runthe following matlab code:

clear all

h=0.01;

b=0.2;a=25; %Change a=200,250,295

y(1)=25;T=0.5;t=0:h:T;

K=a/b

for j=1:length(t)-1x(j+1)=(1+a*h)*x(j)-b*h*x(j)^2;

endplot(t,x,’k o-’,’LineWidth’,2)


0 0.1 0.2 0.3 0.4 0.520

40

60

80

100

120

140

t

x(t+

h)

a=25

0 0.1 0.2 0.3 0.4 0.50

200

400

600

800

1000

1200

t

x(t+

h)

a = 200

0 0.1 0.2 0.3 0.4 0.50

200

400

600

800

1000

1200

1400

1600

t

x(t+

h)

a = 250

0 0.1 0.2 0.3 0.4 0.50

200

400

600

800

1000

1200

1400

1600

1800

2000

t

x(t+

h)

a = 295

Notice that, in contrast to (6.30), the dynamics of (6.31) changes as theparameter a varies. Consequently, we seek another difference equation toapproximate the differential equation so that it has same dynamics as (6.30).Since x(t) is continuous on its domain x(t + h) is a close approximation tox(t) for h > 0 small. Thus,

(1 + ah)x(t)− bhx2(t) ≈ (1 + ah)x(t) − bhx(t + h)x(t).

Rearranging this gives

x(t + h)− x(t)

h= (1 + ah)x(t)− bhx(t + h)x(t).

Solving the above equation for x(t + h) gives the difference equation

x(t + h) =(1 + ah)x(t)

1 + bhx(t). (6.32)

The following figure shows that with different values of a, the populationmodeled by 6.32 has the same outcome (all converge to a/b).


0 0.1 0.2 0.3 0.4 0.520

40

60

80

100

120

140

t

x(t+

h)

a = 25

0 0.1 0.2 0.3 0.4 0.50

100

200

300

400

500

600

700

800

900

1000

t

x(t+

h)

a = 200

0 0.1 0.2 0.3 0.4 0.50

200

400

600

800

1000

1200

1400

t

x(t+

h)

a = 250

0 0.1 0.2 0.3 0.4 0.50

500

1000

1500

t

x(t+

h)

a = 295

Finally, we point out that the exact difference equation version of logis-tic growth can be obtained by separating variables and integrating equation(6.30), giving

x(t + h) =aeahx(t)

a + b(eah − 1)x(t). (6.33)

6.5 Exercises

1. Fill in the details in the derivation of Simpson’s rule. (See page 223.)

2. Derive the trapezoidal rule (the 2-point closed Newton–Cotes rule)

∫ 1

−1

f(x)dx = w0f(−1) + w1f(1) + E(f),

where E(f) is the error term. (That is, find w0, w1, and E(f).)

3. Use the transformations as in Section 6.3.6 to transform the trapezoidalrule and corresponding error term to the interval [xi, xi + h].

4. Fill in the details of the computations in Example 6.9.


5. Derive a 2-point Gauss formula that integrates

∫ π

−π

f(x) sin(x)dx

exactly when f is a polynomial of degree 3 or less.

6. Use Theorem 6.1 (on page 231)to derive the error in the 2-point Gauss–Laguerre quadrature formula. (See Example 6.8 on page 228.)

7. Use Theorem 6.1 (on page 231)to derive the error in the 2-point Gauss-Hermite quadrature formula. (See Example 6.9 on page 229.)

8. Carry out the details of the computation to derive (6.29).

9. Assume that we have a finite-difference approximation method wherethe roundoff error is O(ǫ/h) and the truncation error is O(hn). Usingthe error bounding technique exemplified in (6.5) on page 212, show thatthe optimal h is O(ǫ1/(n+1)) and the minimum achievable error boundis O(ǫn/(n+1)).

10. Fill in the details of the computations for Example 6.4.

11. Solve the system (6.14) (on page 220) and compare your result to thecorresponding directional derivative of f computed by taking the gradi-ent of f and taking the dot product with the direction.

12. Consider quadrature formulas of the form

∫ 1

0

f(x) [x ln(1/x)] dx = a0f(0) + a1f(1).

(a) Find a0 and a1 such that the formula is exact for linear polynomials.

(b) Describe how the above formula, for h > 0, can be used to approx-

imate

∫ h

0

g(t) t ln(h/t) dt.

13. Suppose that I(h) is an approximation to

∫ b

a

f(x) dx, where h is the

width of a uniform subdivision of [a, b]. Suppose that the error satisfies

I(h)−∫ b

a

f(x) dx = c1h + c2h2 +O(h3),

where c1 and c2 are constants independent of h. Let I(h), I(h/2), andI(h/3) be calculated for a given value of h. Use the values I(h), I(h/2)

and I(h/3) to find an O(h3) approximation to

∫ b

a

f(x) dx.


14. Compute an accurate approximation to the following integral:

I =

∫ 1

−∞

1√2π

e−x2/2 dx .

15. Find the nodes xi and the corresponding weights Ai, i = 0, 1, 2, so theformula

∫ 1

−1

1√1− x2

f(x) dx ≈2∑

i=0

Aif(xi)

is exact when f(x) is any polynomial of degree 5. Compare your solutionwith the roots of the Chebyshev polynomial of the first kind T3, givenby T3(x) = cos(3 cos−1(x)).

16. Suppose that a particular composite quadrature rule is used to approx-

imate

∫ 2

0

ex2

dx. The following values are obtained for N = 8, 16, and

32 intervals, respectively: 16.50606, 16.45436, and 16.45347. Using onlythese values, estimate the power of h to which the error is proportional,where h = 2

N .

17. A two dimensional Gaussian quadrature formula has the form

∫ 1

−1

∫ 1

−1

f(x, y) dx dy = f(α, α) + f(−α, α) + f(α,−α) + f(−α,−α)

+ E(f).

Find the value of α such that the formula is exact (i.e. E(f) = 0) forevery polynomial f(x, y) of degree less than or equal to 2 in 2 variables

i.e., f(x, y) =

2∑

i,j=0

aijxiyj .

18. (A significant programming project) Add a function

[a k, b k] = abfunc(k, x)

to the system consisting of multiquad,multiquad composite Newton Cotes and multiquad func from Sec-tion 6.3.9.1, so the resulting system computes

∫ bn

an

∫ bn−1(xn)

an−1(xn)

· · ·∫ b1(x2,...,xn)

a1(x2,...,xn)

f(x1, x2, · · · , xn)dx1dx2 . . . dxn.

The function name abfunc should be passed as an argument tomultiquad (and hence also to multiquad composite Newton Cotes and


multiquad func), and evaluated in multiquad composite Newton Cotes.Its evaluation

[a k, b k] = feval(abfunc, k, current arg)

returns the appropriate value of the lower bound ak and upper boundbk. (Note that xi, k + 1 ≤ i ≤ n has already been stored incurrent arg(n-i+1), that the value of n inmultiquad composite Newton Cotes is in the array n array, and thevalue of k appropriate for the actual call is n.)

Test your modified routine by computing an approximation to

∫ 1

0

∫ 1−x3

0

∫ 1−x3−x2

0

x1 + x2 + x3dx1dx2dx3 =1

8= 0.125.

Chapter 7

Initial Value Problems for OrdinaryDifferential Equations

7.1 Introduction

In this chapter, we study solution of initial-value problems (IVP) for systemsof differential equations. We can write such initial value problems as findingsolutions to

{

y′(t) = f(t, y(t)), a ≤ t ≤ b,y(a) = y0,

(7.1)

where f is a given function and y0 is given. To introduce the solution con-cepts, we first consider y, y0, and f to represent real-values and real-valuedfunctions, then later show that our techniques hold when y, y0, and f arevectors with n components. In fact, we will see that arbitrary systems ofnonlinear differential equations can be transformed into the form (7.1), andthe form (7.1) is the form in which software for finding approximate solutionsto initial value problems solves such systems.

Generally, the approximate numerical solution that software delivers is inthe form of a table of values tk of the independent variable and correspondingvalues yk of the dependent variable (that is, of the function that solves thedifferential equation). We think of tk+1 = tk + hk, where hk is the k-th steplength, and yk is our numerical approximation to y(tk).

An important consideration when finding approximate numerical solutionsto an initial value problem is whether such a solution exists and whether itis unique. Roughly, the solution exists and is unique if f is continuous andsatisfies a Lipschitz condition in y (see Definition 2.2 on page 48). We giveexamples and some theory in our second-level text [1].

We now consider a prototypical method; although more efficient methodsare usually used in practice, this method illustrates the basic ideas of numer-ical solution of initial value problems in practice.

253


7.2 Euler’s method

The simplest method we consider is Euler’s method. One can view Eu-ler’s method as a kind of “successive tangent line approximation” method, orrepeated approximation of y by degree-1 Taylor polynomials. In particular,

y(tk+1) = y(tk) + hky′(tk) +O(h2k)

= y(tk) + hkf(tk, y(tk)) +O(h2k)

= y(tk) + hkΦ(tk, y(tk)) +O(h2k), (7.2)

where Φ(t, y) is the iteration function for the step, which for Euler’s methodis Φ(t, y) = f(t, y), If we replace y(tk) by our approximation yk, we obtainEuler’s method:

yk+1 = yk + hkf(tk, yk). (7.3)

Example 7.1

For illustration, suppose a = 0, b = 1, f(t, y) = t, h = 0.25, and y(0) = 0.We see immediately that the solution is

y(t) =

∫ t

0

sds =t2

2.

y1 = y0 + 0.25t0 = 0 + 0.25(0) = 0,

y2 = y1 + 0.25t1 = 0 + 0.25(0.25) = 0.125,

y3 = 0.125 + 0.25(0.5) = 0.25,

y4 = 0.25 + 0.25(0.75) = 0.4375.

We see that, when f does not depend on the unknown function y, the initialvalue problem reduces to finding the values of an indefinite integral, andEuler’s method reduces approximating an integral as

∫ tk+h

tk

f(s)ds ≈ hkf(tk),

that is, we approximate the integral over [tk, tk + hk] by the area of the rect-angle of width hk and height f(tk), that is, as in a left Riemann sum. Indeed,methods for finding approximate solutions to initial value problems resemblemethods for finding integrals in various ways, and we speak of “integrating”the differential equation.

Example 7.2

Consider{

y′(t) = t + y,y(1) = 2.

Initial Value Problems for Ordinary Differential Equations 255

(The exact solution is y(t) = −t− 1 + 4e−1et.) Assuming a constant step sizehk = h, Euler’s method has the form

{

yk+1 = yk + hf(tk, yk) = yk + h(tk + yk),y0 = 2, t0 = 1.

Applying Euler’s method, we obtain

h = 0.1 h = 0.05

k tk yk y(tk)

0 1 2 21 1.1 2.3 2.320682 1.2 2.64 2.685613 1.3 3.024 3.099444 1.4 3.4564 3.567305 1.5 3.94304 4.09489

k tk yk y(tk)

0 1 2 21 1.05 2.1500 2.155082 1.1 2.3100 2.32068

The error for h = 0.1 at t = 1.1 is about 0.02 and the error for h = 0.05 att = 1.1 is about 0.01. If h is cut in half, the error is cut in half, suggesting thatthe error is proportional to h. This seems to be consistent with the truncationerror in one step being O(h2), since the total number of steps is proportionalto 1/h. (However, this is not entirely obvious.)

See our text [1] (or other sources) for a convergence analysis of Euler’smethod. In fact, the “global” error (that is, when Euler’s method is appliedfor N = (b− a)/h steps to find an approximation to y(b)) can be shown to beO(h), as in Example 7.2. Furthermore, it can be shown that, for small h withfloating point arithmetic, the rounding error is proportional to 1/h. Thus, justas in our analysis of total error in the forward difference approximation to thederivative (see Section 6.1.2, starting on page 212), the minimum achievableerror in Euler’s method, regardless of how small we make the step sizes hk, isproportional to the square root of the accuracy to which we can evaluate f .For this reason, as well as for reasons of efficiency1, higher-order methods areoften used.

DEFINITION 7.1 A method of the form

yk+1 = yk + hΦ(f, tk, yk),

that is, as in (7.2), in which yk+1 depends only on tk and yk and not onprevious values such as yk−1, is termed a single-step method. If

|yk+1 − hΦ(tk, y(tk))| = O(hω+1),

1that is, having the computations complete in a practical amount of time


where y(tk) is the exact solution to the initial value problem at tk, we say thatthe method has order ω.

For example Euler’s method has order 1.

Do not confuse the order of the method for finding approximate solutionsto the differential equation with the order of the differential equation itself.Before we examine higher-order methods, we pause to look at how we handlehigher-order differential equations.

7.3 Higher-Order and Systems of Differential Equations

Traditionally2, IVP’s for higher order differential equations are not consid-ered separately from first-order equations. By a change of variables, higherorder problems can be reduced to a system of the form of (7.1). For example,consider the scalar IVP for the m-th-order scalar differential equation:

{

y(m)(t) = g(t, y(m−1)(t), y(m−2)(t), · · · , y′′(t), y′(t), y(t)), a ≤ t ≤ b,

y(a) = u0, y′(a) = u1, · · · , y(m−1)(a) = um−1.(7.4)

We can reduce this high order IVP to a first-order system of the form (7.1)by defining x : [a, b]→ Rm componentwise by

x(t) = [x1(t), x2(t), · · · , xm(t)]T = [y(t), y′(t), y(2)(t), · · · , y(m−1)(t)]T .

Then,

x′1(t) = x2(t),x′2(t) = x3(t),x′3(t) = x4(t),

...x′m−1(t) = xm(t),x′m(t) = g(t, xm, xm−1, · · · , x2, x1),

and x(a) =

u0

u1

...um−1

. (7.5)

That is, in this case f(t, x) is defined by:

{

fi(t, x) = xi+1, 1 ≤ i ≤ m− 1fm(t, x) = g(t, xm, xm−1, · · · , x2, x1).

2Recently, there has been some discussion concerning efficient methods that do considerhigher-order problems separately.


Example 7.3

Consider

y′′(t) = y′(t) cos(y(t)) + e−t,y(0) = 1,y′(0) = 2.

Let x1 = y and x2 = y′. Then,

x′1(t) = x2(t),x′2(t) = x2(t) cos(x1(t)) + e−t,x1(0) = 1, x2(0) = 2,

which can be represented in vector form as

dx

dt= f(t, x) =

(

x2(t)x2(t) cos(x1(t)) + e−t

)

,

x(0) =

(

12

)

.

Now, we may interpret y and f in (7.3) as vectors, and apply a couple of stepsof Euler’s method, with h = 0.1 and the help of matlab. We use the function

function [f] = ode_sys_example(t,x)

f = zeros(2,1);

f(1) = x(2);

f(2) = x(2) * cos(x(1)) + exp(-t);

(Here, we initialize the array f, since, otherwise, matlab forms a row vec-tor by default.) We compute an approximation to y(1.3) with the followingmatlab dialog.

>> x = [1;2]

x =1

2>> t = 1

t = 1>> h=0.1h = 0.100000000000000

>> x = x + h*ode_sys_example(t,x)x =

1.2000000000000002.144848405290772

>> t = t + h

t = 1.100000000000000>> x = x + h*ode_sys_example(t,x)

x =1.414484840529077

2.255855758843984>> t = t + ht = 1.200000000000000

>> x = x + h*ode_sys_example(t,x)x =

1.6400704164134762.321093379172201

>> t = t + ht = 1.300000000000000


>> x = x + h*ode_sys_example(t,x)

x =1.872179754330696

2.332280252695239>>

Of course, many mathematical models begin as systems of differential equa-tions, rather than as a single differential equation involving higher-order deriva-tives. The same techniques apply for such systems.

Example 7.4

Partial differential equations can be written as systems of ordinary differ-ential equations by approximating all but one of the derivatives. When theresulting system of ordinary differential equations is then solved, this is calledthe method of lines . For example, processes of diffusion (say of chemicals orfluids through media) may be modeled by the differential equation

∂u

∂t= D∆u, (7.6)

where D is related to the medium in which diffusion is taking place, and where

∆u =∂2u

∂x2+

∂2u

∂y2+

∂2u

∂z2

is the Laplacian operator . If we are looking at diffusion in a single spatialdimension, such as the distribution of temperature along a rod (or, say, verticaldiffusion of a fluid that is assumed to be uniform in the horizontal dimensions),then Equation (7.6), also known as the heat equation, becomes

∂u

∂t= D

∂2u

∂x2, (7.7)

where u = u(x, t). (D may in general depend on x and t, but we will assumefor simplicity here that it is constant.) Proceeding as in Example 3.18 onpage 93, we use

∂2u

∂x2≈ u(x + h, t)− 2u(x, t) + u(x− h, t)

h2. (7.8)

For example, suppose we have a rod that is initially at temperature 0 at bothends, and, starting at time t = 0, the end at x = 0 is immersed in ice, sou(0, t) = 0 for t ≥ 0, and the end at x = 1 is heated at the rate u(1, t) = t.Suppose, as in Example 3.18, we subdivide 0 ≤ x ≤ 1 into 4 subintervals,having u1(t) correspond to u(1/4, t), u2(t) correspond to u(1/2, t), and u3(t)correspond to u(3/4, t). In this way, we replace the boundary value problem


consisting of (7.7) and u(0, t) = 0, u(1, t) = t, u(x, 0) = 0 by the system ofordinary differential equations

u′1 = 16u2 − 32u1,

u′2 = 16u3 − 32u2 + 16u1,

u′3 = 16t− 32u3 + 16u2,

with

u1(0)u2(0)u3(0)

=

000

.

Such discretizations may be solved with software for initial value problemsfor ordinary differential equations. However, with such finite difference dis-cretizations in space, and, indeed, with various other discretizations in space,the resulting system of ordinary differential equations is usually stiff , in thesense we describe in Section 7.9 on page 273 to follow. This is especially so ifh is small. Thus, generally, software for stiff systems should be used with themethod of lines.

We now study various higher-order methods, that is, methods, that is,methods for which the order ω as in Definition 7.1 is greater than 1.

7.4 Higher-Order Taylor Series Methods

If y(t), the solution of (7.1), is sufficiently smooth3, we see that

y(tk+1) = y(tk) + hy′(tk) +h2

2y′′(tk) + · · ·+ hp

p!y(p)(tk) +O(hp+1) (7.9)

where, using (7.1), these derivatives can be computed explicitly with the mul-tivariate chain rule of the usual calculus. Thus, (7.9) leads to the followingnumerical scheme:

y0 = y(a)

yk+1 = yk + hf(tk, yk) +h2

2

d

dtf(tk, yk) + · · ·+ hp

p!

dp−1

dtp−1f(tk, yk)

(7.10)

for k = 0, 1, 2, · · · , N − 1. This is called a Taylor series method. (Note thatEuler’s method is a Taylor series method of order p = 1.)

3that is, if the solution y(t) contains enough continuous derivatives for Taylor’s theorem(on page 3) to hold


In the past, these methods were seldom used in practice since they requiredevaluations of high-order derivatives. However, with efficient implementationsof automatic differentiation,4 these methods are increasingly solving impor-tant real-world problems. For example, very high-order Taylor methods (oforder 30 or higher) are used, with the aid of automatic differentiation, in the“COSY Infinity” package, which is used world-wide to model atomic particleaccelerator beams. (See, for example, [8].)

By construction, the order of the method (7.10) is ω = p. In weighing thepracticality of this method, one should consider the structure of the problemitself, along with the ease (or lack thereof) of computing the derivatives. Forexample, with n = 1, we must compute

d

dtf(t, y) =

∂f

∂t+ f

∂f

∂y,

d2

dt2f(t, y) =

∂2f

∂2t+ 2f

∂2f

∂t∂y+ f2 ∂2f

∂2y+

∂f

∂t

∂f

∂y+ f

(

∂f

∂y

)2

,

etc.

If f is mildly complicated, then it is impractical to compute these formulasby hand5; also, observe that, for n > 1, the number of terms can becomelarge, although many may be zero; thus, an implementation of automaticdifferentiation should take advantage of the structure in f .

Example 7.5

Consider{

y′(t) = f(t, y) = t + y,y(1) = 2,

which has exact solution y(t) = −t− 1+ 4e−1et. The Taylor series method oforder 2 for this example has

f(t, y) = t + y

andd

dtf(t, y) =

∂f

∂t+ f

∂f

∂y= 1 + t + y.

Therefore,

yk+1 = yk + hf(tk, yk) +h2

2

d

dtf(tk, yk)

= yk + h(tk + yk) +h2

2(1 + tk + yk).

Letting h = 0.1, we obtain the following results:

4These implementations can be very sophisticated.5but this does not rule out automatic differentiation


k tk yk (Euler) yk (T.S. order 2) y(tk) (Exact)0 1 2 2 21 1.1 2.3 2.32 2.32072 1.2 2.64 2.6841 2.68563 1.3 3.024 3.0969 3.09944 1.4 3.4564 3.5636 3.5673

7.5 Runge–Kutta Methods

A classic form higher-order methods that do not explicitly require deriva-tives take is Runge–Kutta methods . We now show how Runge–Kutta methodsare derived by deriving a simple one. For simplicity, we derive it for a scalardifferential equation, although Runge–Kutta methods can easily be appliedto systems.

If y(t) is the exact solution of (7.1), then

y(tk+1)− y(tk) =

∫ tk+1

tk

f(t, y(t))dt, 0 ≤ k ≤ N − 1. (7.11)

Approximating the integral on the right side by the midpoint rule, we obtain

∫ tk+1

tk

f(t, y(t))dt ≈ hf

(

tk +h

2, y(tk +

h

2)

)

. (7.12)

Now, by Taylor’s Theorem,

y(tk +h

2) ≈ y(tk) +

h

2y′(tk) = y(tk) +

h

2f(tk, y(tk)). (7.13)

By (7.11), (7.12), and (7.13), it is seen that y(t) approximately satisfies

y(tk+1) ≈ y(tk) + hf

(

tk +h

2, K1

)

, 0 ≤ k ≤ N − 1,

with K1 = y(tk) +h

2f(tk, y(tk)),

(7.14)

which suggests the following numerical method, known as the midpoint methodfor solution of (7.1). We seek yk, 0 ≤ k ≤ N , such that

y0 = y(t0),

yj+1 = yj + hf

(

tj +h

2, K1,j

)

, j = 0, 1, 2, · · · , N − 1,

K1,j = yj +h

2f(tj , yj).

(7.15)


We can write (7.15) in the form:

{

y0 = y(t0)yj+1 = yj + hΦ(tj , yj , h),

(7.16)

where

Φ(tj , yj , h) = f

(

tj +h

2, yj +

h

2f(tj , yj)

)

.

It can be shown that, when f(t, y) does not depend on y, a step of themidpoint method reduces to the midpoint rule, that is, the degree-0 Gauss–Legendre quadrature formula:

y(tk+1 = y(tk)+

∫ tk+h

tk

f(s)ds = y(tk)+hf(tk+h/2)+h3

12f ′′(ξ).y(tk+1) = y(tk)+

∫ tk+h

tk

f(s)ds = y(tk)+hf(tk+

(See Table 6.1, Table 6.3, and Section 6.3.6.) Indeed, the midpoint methodhas order ω = 2. We present a proof in our second course [1].

In general, Runge–Kutta methods have the form

{

y0 = y(a)yk+1 = yk + hΦ(tk, yk, h)

(7.17)

where

Φ(t, y, h) =

R∑

r=1

crKr,

K1 = f(t, y),

Kr = f(t + arh, y + h

r−1∑

s=1

brsKs)

and

ar =

r−1∑

s=1

brs, r = 2, 3, · · · , R.

Such a method is called an R-stage Runge–Kutta method . Notice that Euler’smethod is a one-stage Runge–Kutta method and the midpoint method is atwo-stage Runge–Kutta method with c1 = 0, c2 = 1, a2 = 1

2 , b21 = 12 , i.e.,

yk+1 = yk + hf

(

tk +h

2, yk +

h

2f(tk, yk)

)

.

The coefficients aR, brs, and cr can be derived by matching terms in the Taylorexpansion. In general, for a particular number of stages and a particular order,the coefficients aR, brs, and cr are not unique, that is, there are in generalvarious R stage methods of a given order. We discuss these issues in [1].


The most well-known Runge–Kutta scheme is 4-th order; it has the form:

y0 = y(t0)

yk+1 = yk +h

6[K1 + 2K2 + 2K3 + K4]

K1 = f(tk, yk)

K2 = f

(

tk +h

2, yk +

h

2K1

)

K3 = f

(

tk +h

2, yk +

h

2K2

)

K4 = f(tk + h, yk + hK3),

(7.18)

i.e.,

Φ(tk, yk, h) =h

6[K1 + 2K2 + 2K3 + K4].

Notice that in single-step methods, yk+1 = yk + hΦ(tk, yk, h), hΦ(tk, yk, h) isan approximation to the “rise” in y in going from tk to tk + h. In the fourth-order Runge–Kutta method, Φ(tk, yk, h) is a weighted average of approximate“slopes” K1, K2, K3, K4 evaluated at tk, tk + h/2, tk + h/2 and tk + h,respectively.

Example 7.6

Consider y′(t) = t + y, y(1) = 2, with h = 0.1. We obtain

k tk EulerRunge–Kutta order 2

(Modified Euler)Runge–Kuttaorder 4

y(tk) (exact)

0 1 2 2 2 21 1.1 2.30 2.32 2.32068 2.320682 1.2 2.64 2.6841 2.68561 2.685613 1.3 3.024 3.09693 3.09943 3.09944

Higher-order Runge–Kutta methods are sometimes used, such as in theadaptive step control schemes we describe in Section 7.7.

7.6 Stability

In a method for integrating an IVP, it is important to know how smallerrors that have accumulated in the value yk ≈ y(tk) propagate to subsequentapproximations yℓ ≈ y(tℓ), ℓ > k.


DEFINITION 7.2 Assume we take a constant step size h = (b − a)/Nto compute yk, 1 ≤ k ≤ N , with yn ≈ y(b), in a single-step method is forintegrating an initial value problem. We say the method is numerically stableif there is a constant c independent of h such that

‖yN − zN‖ ≤ c‖yk − zk‖ for all k ≤ N. (7.19)

Under certain continuity and Lipschitz conditions (see [1]), Runge–Kuttamethods are stable in the sense that (7.19) is satisfied. This implies that anerror ‖zk − yk‖ will not be magnified by more than a constant c at final timetN , i.e., “small errors” have “small effect.”

The above definition of stability is not satisfactory if the constant c is verylarge. Consider, for example, Euler’s method applied to the scalar equationy′ = λy, λ = constant. Then Euler’s scheme gives yj+1 = yj(1 +λh), 0 ≤ j ≤N − 1. An error, say at t = tk, will cause us to compute zj+1 = zj(1 + λh)instead and hence |zj+1 − yj+1| = |1 + λh‖zj − yj |, k ≤ j ≤ N − 1. Thus, theerror will be magnified if |1 + λh| > 1, will remain the same if |1 + λh| = 1,and will be suppressed if |1 + λh| < 1. Consider the problem

y′ = −1000y + 1000t2 + 2t, 0 ≤ t ≤ 1, y(0) = 0,

whose exact solution is y(t) = t2, 0 ≤ t ≤ 1. We find for Euler’s methodthat |zj+1 − yj+1| = |1 − 1000h‖zj − yj |, 0 ≤ j ≤ N − 1. The error will besuppressed if |1 − 1000h| < 1, i.e., 0 ≤ h ≤ 0.002. Consider the followingtable:

h N yN

1 1 00.1 10 9× 1016

0.01 102 overflow0.001 103 0.999999000.0001 104 0.999999900.00001 105 0.99999999

For h > .002, small errors are violently magnified. For example, for h = .01,the errors are magnified by |1− 1000

100 | = 9 at each time step, even though thereexists a c as in (7.19).

This motivates a second concept of stability that will be important whenwe discuss stiff systems.

DEFINITION 7.3 A numerical method for solution of (7.1) is calledabsolutely stable if when applied to the scalar equation y′ = λy, t ≥ 0, it yieldsvalues {yj}j≥0 with the property that yj → 0 as j →∞. The set of values λhfor which a method is absolutely stable is called the set of absolute stability.


Example 7.7

(Absolute stability of Euler’s method and the midpoint method)

1. Euler’s Method applied to y′ = λy yields yj+1 = yj(1 + λh), whence

yj = y0(1 + λh)j+1.

Clearly, assuming that λ is real, yj → 0 as j →∞ if and only if |1+λh| <1 or −2 < λh < 0. Hence, the interval of absolute stability of Euler’smethod is (−2, 0).

Generally, however, when we analyze stability for systems of differentialequations, we need to consider the possibility of complex λh. (We willsee why in Section 7.9.2.) In such a context, we seek a region in thecomplex plane for which |1 + λh| < 1. If λh = x + yi where i is theimaginary unit, we have

|1 + λh|2 = |1 + x + yi|2 = (1 + x)2 + y2 < 1.

This describes a circle of radius 1 centered at −1 + 0i, as in this figure:

−2 −1.5 −1 −0.5 0

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Re(z)

Im(z

) The interior of this circle isthe region of stability ofEuler’s Method.

2. The midpoint method applied to y′ = λy yields

yj+1 = yj + hλ(yj +h

2λyj) = yj

(

1 + λh +λ2h2

2

)

.

Hence, yj → 0 as j →∞ if |1+λh+λ2h2/2| < 1, which for λ real leadsto an interval of absolute stability (−2, 0).

When we consider λ to be complex, we obtain

∣

∣

∣

∣

1 + λh +λ2h2

2

∣

∣

∣

∣

2

=

(

1 + x +x2 − y2

2

)2

+ (y + xy)2 < 1.


Using a computer algebra system to find the boundary curves of thisregion of stability, then plotting them with matlab, we find the regionof stability to be the interior of the following oval.

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1

−1.5

−1

−0.5

0

0.5

1

1.5

In general, explicit methods (such as the Taylor series methods or RungeKutta methods we have considered so far) require significantly small h for ac-curate approximations of problems with large |λ|. (Notice that the linear casewith λ models the nonlinear case y′ = f(t, y) with Lipschitz constant L ≈ λ.)These methods should not be used for such problems or their system analogs(stiff systems). We will consider later methods suitable for such problems.

7.7 Adaptive Step Controls

It often occurs that the curvature of solutions y(t) to initial value problemsvaries considerably as t changes. To achieve a given accuracy, such softwarecan take large steps hk in regions of small curvature, but must take smallersteps in regions of large curvature to achieve the same accuracy. Not onlydoes the user of such software usually not know beforehand which step size hwill give the required accuracy, but choosing a fixed step size h that is smallenough for the intervals where the curvature (that is, y′′) is large will resultin many more steps than necessary over intervals in which the curvature issmall, and hence lead to inefficiency.

Analogously to composite formulas for numerical integration, we have theglobal error yN−y(tN ), where t0 = a and tN = b, and the local error y1−y(t1)incurred by taking a single step of the method. (The analogous concepts fornumerical integration are the error over the entire interval and the error fora single application of the quadrature formula.) As in adaptive quadrature


routines, we focus on computing estimates for the error. The most commontechnique for deriving such estimates is to assume we cannot evaluate an exactrange for the error term, but that we know the order ω of the method as inDefinition 7.1 (on page 255). To take a step from yk to yk+1, we assume thatyk is exact6, and we have two methods, one of order ω and one of order ω +1,giving yk+1,1,h and yk+1,2,h with

yk+1,1,h − y(tk+1) ≈ C1hω+1,

yk+1,2,h − y(tk+1) ≈ C2hω+2,

whence

yk+1,1,h − yk+1,2,h ≈ C1hω+1 − C2h

ω+2 (7.20)

≈ C1hω+1

≈ yk+1,1,h − y(tk+1).

An error bound ǫ is specified, h is decreased if the error estimated through(7.20) is too large, and h is increased if the error estimated through (7.20) issufficiently small.

Several schemes are developed in which combining the same values of f(t, y)in two different ways results in Runge–Kutta methods of orders ω and ω +1. A classic example of this is the routine RKF45 in [15, page 129 ff]. Themethod used there, combining six evaluations of f per step to obtain both afourth-order and fifth-order Runge–Kutta method, is called the Runge–Kutta–Fehlberg method. We derive it in detail, as well as present an algorithm, in [1].The Runge–Kutta–Fehlberg method is available in matlab with the functionode45.

Example 7.8

A classic illustrative problem in population dynamics is the predator-preyequations. Suppose we have an animal (say, rabbits) and a predator (sayfoxes). The population x1(t) of rabbits depends on the number of rabbitspresent, and also on the number x2(t) of foxes. Similarly, the number offoxes present depends on how much food they have (that is, on the numberof rabbits present) and on the number of foxes present. The classic model is

x′1(t) = αx1 − βx1x2,

x′2(t) = −γx2 + δx1x2.

To use ode45, we program the following.

function [f] = predator_prey(t,y)global alpha

6That is, for the purposes of analysis, we assume yk = y(tk).


global beta

global gammaglobal delta

f = zeros(2,1);f(1) = alpha * y(1) - beta*y(1)*y(2);f(2) = -gamma * y(2) + delta*y(1)*y(2);

Suppose we want to see what happens out to time t = 0.5, with 1000 rabbitsand 100 foxes initially, and with α = 2, β = 0.1, γ = 1, and δ = 0.1. We thenrun ode45 with the following matlab dialog.

>> global alpha>> global beta

>> global gamma>> global delta

>> alpha = 2;>> beta = 0.1;>> gamma = 2;

>> delta = 0.1;>> [T,Y] = ode45(’predator_prey’,[0,0.5], [1000,100]);

>> plot(T,Y(:,1),T,Y(:,2))>> holdCurrent plot held

>> plot(T,zeros(size(T,1),1),’LineStyle’,’none’,’Marker’,’+’,...’MarkerEdgeColor’,’red’,’Markersize’,4)

>> size(T)ans = 129 1

This results in the following figure.

0 0.1 0.2 0.3 0.4 0.5−200

0

200

400

600

800

1000

1200

This figure indicates that, with these values of the parameters α, β, γ, and δand the chosen initial number of rabbits and foxes, the foxes rapidly increaseand the rabbits die out before time t = 0.1, then the foxes slowly die out. Itshows that a total of 129 steps were taken, and we see on the horizontal lineat level 0 that the steps become larger where the solutions are not varying asmuch.

If a Taylor series method is used, an alternative error control is an intervalevaluation of the error term. That is, if we are taking a step from tk totk+1 = tk + hk, the actual error in the Taylor series method of order ω (that


is, expanding y′ = f in a Taylor polynomial of degree ω − 1) is of the form

hω+1

(ω + 1)!

dωf(ξ, y(ξ))

dtω(7.21)

for some ξ ∈ [tk, tk+hk]. The actual derivative in (7.21) is a linear combinationof products of partial derivatives of f of various orders with respect to t andthe components of the vector y, but values of it can be obtained effectivelywith automatic differentiation. If the automatic differentiation uses intervalarithmetic with the interval t = [tk, tk+hk] and an interval bound y (obtainedand verified in various ways), we obtain mathematically rigorous bounds onthe error. This has proven effective in simulations of particle beams in atomicaccelerators and other applications [8], but general software based on it is notyet publicly available.

7.8 Multistep, Implicit, and Predictor-Corrector Meth-ods

In multistep methods, values yℓ with ℓ < k, in addition to yk, are used toobtain yk+1. A common class of multistep methods is the class of Adams–Bashforth methods , in which f is approximated by an interpolating poly-nomial on yk−n, . . . , yk, then the interpolating polynomial is integrated toobtain yk+1. For instance, to obtain a so-called “3-step method,” in which 3previous values of the solution are used, we pass an interpolating polynomialthrough fk = f(tk, yk), fk−1 = f(tk−1, yk−1), and fk−2 = f(tk−2, yk−2). Thecorresponding Lagrange form representation (see (4.4) on page 149) is

p2(t) = ℓk(t)fk + ℓk−1(t)fk−1 + ℓk−2(t)fk−2,

where

ℓk(t) =(t− (tk − hk))(t− (tk − hk − hk−1))

hk(hk + hk−1),

ℓk−1(t) = − (t− tk)(t− (tk − hk − hk−1))

hkhk−1, and

ℓk−2(t) =(t− tk)(t− (tk − hk))

hk−1(hk + hk−1).


The next approximation yk+1 ≈ y(tk+1) is then defined by

yk+1 = yk +

∫ tk+hk+1

tk

p2(t)dt

= fk

∫ tk+hk+1

tk

ℓk(t)dt

+ fk−1

∫ tk+hk+1

tk

ℓk−1(t)dt + fk−2

∫ tk+hk+1

tk

ℓk−2(t)dt.

Under the simplifying assumption7 that hk+1 = hk = hk−1 = h, we have

∫ tk+h

tk

ℓk(t)dt =23

12h,

∫ tk+h

tk

ℓk−1(t)dt = −4

3h, and

∫ tk+h

tk

ℓk−1(t)dt =5

12h,

so

yk+1 = h

(

23

12fk −

4

3fk−1 +

5

12fk−2

)

. (7.22)

This is known as the Adams–Bashforth 3-step method . This method hasorder ω = 3 (and in general, the s-step adams Bashforth method, involving spreviously computed values of f , has order ω = s).

Adams–Bashforth methods cannot compute y1 through ys−1 on their own,since they do not have the required previously computed values of f for theseinitial points. Generally, a separate order s or higher method, such as anorder s Runge–Kutta method, is used to start the process.

Under certain conditions, namely, when the system is stiff , we may want touse as-yet-unknown information to perform the step from tk to tk + hk. Forexample, we may pass an interpolating polynomial of degree 2 through fk+1

(as yet unknown), fk, and fk−1, to obtain

q2(t) = ℓk+1(t)fk+1 + ℓk(t)fk + ℓk−1(t)fk−1,

where

ℓk+1(t) =(t− tk)(t− (tk − hk)

hk+1(hk + hk+1),

ℓk(t) = − (t− (tk + hk+1))(t− (tk − hk))

hkhk+1, and

ℓk−1(t) =(t− (tk + hk+1))(t− tk)

hk(hk + hk+1),

7good here for illustration, but not made in practical software


and where, as before,

yk+1 = yk +

∫ tk+hk+1

tk

q2(t)dt.

We integrate the ℓk as before, to obtain the coefficients of the formula. Underthe simplifying assumption that hk+1 = hk = h, we have

yk+1 = h

(

5

12fk+1 +

2

3fk −

1

12fk−1

)

. (7.23)

This is called the Adams–Moulton implicit method of order 3. (The Adams–Moulton implicit method of order ω uses fk+1, fk, . . . , fk−ω+2.) For vectory and f , computing yk+1 in an implicit method involves solving a system ofin general nonlinear equations in the components of yk+1.

Example 7.9

Let y′(t) = t + y, y(1) = 2, with h = 0.1, as in Example 7.6 (on page 263).The order 3 Adams–Moulton method for this example reduces to

yk+1 == yk + (0.1)

(

5

12(tk+1 + yk+1) +

2

3(tk + yk)− 1

12(tk−1 + yk−1)

)

.

For the purposes of illustration, we may solve this equation symbolically foryk+1 (although, in general, numerical methods, such as the multivariate New-ton method we describe in Chapter ?? are used to compute solve the nonlinearsystem for the components of yk+1). We obtain

yk+1 =1

115(8tk − tk−1 + 5tk+1 + 128yk − yk−1).

We have t0 = 1, y0 = 2. If we use the fourth-order Runge–Kutta methodas in Example 7.6 to get starting values, we obtain t1 = 1.1, t2 = 1.2, andy1 ≈ 2.32068. Applying the order-3 Adams–Moulton method then gives

y2 ≈ 2.685626,

which compares favorably with the Runge–Kutta method in Example 7.6

Implicit methods are appropriate for stiff systems , which we discuss inthe next section. Adams–Bashforth methods are used when a high-ordermethod is needed, but evaluations of f are expensive. (In a high-order Adams–Bashforth method, only one additional evaluation of f is required per step,since previous values are recycled. In contrast, in a Taylor series method,values and many derivatives are required, and, in a Runge–Kutta method,many function values are required per step.)


Another way that implicit and explicit methods are used is in predictor-corrector methods . In such a method, an explicit formula is used to computean approximation yk+1 to y(tk+1). The approximation yk+1 is then used inthe right side of an implicit formula (generally of higher order than the explicitformula) to obtain a better approximation yk+1 to y(tk+1).

The matlab function ode113 implements predictor-corrector Adams–Bashforthand Adams–Moulton methods of various orders. In particular, not only is thestep size adjusted, but the software also uses heuristics to change the order.

Example 7.10

We will use ode113 to solve the predator-prey system of Example 7.8 (onpage 267). We have

>> [T,Y] = ode113(’predator_prey’,[0,.5], [1000,100]);>> plot(T,Y(:,1),T,Y(:,2))


>> plot(T,zeros(size(T,1),1),’LineStyle’,’none’,’Marker’,’+’,...’MarkerEdgeColor’,’red’,’Markersize’,4)

>> size(T)ans = 70 1

with figure:

0 0.1 0.2 0.3 0.4 0.5−200

0

200

400

600

800

1000

1200

We see that only 70 steps are taken, instead of 129, and the steps are fartherapart on the smooth part of the graph. For this illustrative example, thedifference in performance is not significant on a modern laptop computer, butthe difference can be significant for certain larger problems.

We give a theoretical analysis of explicit and implicit multistep methods ingeneral, as well as of predictor-corrector methods, in [1].


7.9 Stiff Systems

Stiff systems are common both in primary applications and in approximat-ing partial differential equations by systems of ordinary differential equations.We begin our study of stiff systems with an explanation of a simplified context.

7.9.1 Stiff Systems and Linear Systems

To understand the basic ideas about stiff systems, we think of approximat-ing general initial value problem

y′(t) = f(t, y(t)), t ≥ 0, y(0) = y0.

by the linear problem

y′(t) = Ay(t), t ≥ 0, y(0) = y0. (7.24)

In particular, the system (7.24) is a model for nonlinear systems y′ = f(t, y).The matrix A is a model for the Jacobi matrix ∂f/∂y, i.e., expanding in aTaylor series about fixed y,

f(t, y) ≈ f(t, y) +∂f

∂y(t, y)(y − y).

To further simplify our study, we will assume that the matrix A has simpleeigenvalues, that is, that A has n distinct eigenvaluesλi, 1 ≤ i ≤ n, andthus has n corresponding linearly independent eigenvectors vi, 1 ≤ i ≤ n.(Further analysis can show that the systems behave similarly without thesesimplifications, but the basic ideas are clear in the simpler context.) In oursimplified context, if we form the n by n matrix V whose i-th column is vi,we have

AV = V Λ, or V −1AV = Λ,

where Λ is the diagonal matrix such that its i-th diagonal entry is the i-theigenvalue λi. (See Example 5.1 on page 192.) If we make the change ofdependent variables z = V −1y, or y = V z, and we observe (V z)′ = V z′, wehave

V z′ = A(V z), or z′ = (V −1AV )z = Λz.

Interpreted component-by-component, this last system is simply

z′i = λizi, 1 ≤ i ≤ n,

which has solution

zi = cieλit.


The vector equation y = V z can thus be written componentwise as

y(t) =

n∑

i=1

cieλitvi. (7.25)

The ci can then be found by solving the linear system

n∑

i=1

civi = V c = y0

for the vector c = (c1, . . . , cn)T .

Example 7.11

As an illustrative example, take the equation.

u′′ + u′ + u = 0, u(0) = 1, u′(0) = 2.

(This equation is a simplified model of a damped mechanical system, suchas automobile springs with shock absorbers.) Converting to a system withy1(t) = u(t), y2(t) = y′1(t), we obtain the system of equations

y′ =

(

y′1y′2

)

=

(

y2

−y1 − y2

)

=

(

0 1−1 −1

)(

y1

y2

)

.

Using matlab, we obtain

>> A = [0 1;-1 -1]

A =0 1

-1 -1>> [V,Lambda] = eig(A)

V =0.7071 0.7071

-0.3536 + 0.6124i -0.3536 - 0.6124i

Lambda =-0.5000 + 0.8660i 0

0 -0.5000 - 0.8660i>> y0 = [1;2]y0 =

12

>> c = V\y0c =

0.7071 - 2.0412i0.7071 + 2.0412i

>>

In fact, it can be verified that the exact eigenvalues of A are the roots of thecharacteristic equation

λ2 + λ + 1 = 0

for the original second-order linear differential equation, namely,

λ1 = −1

2−√

3

2i and λ2 = −1

2+

√3

2i.


Thus,

z1(t) = e(−1/2−√

3/2i)t and z2(t) = e(−1/2+√

3/2i)t,

and the solution to the initial value problem is

y(t) ≈ (0.7071− 2.0412i)e(−1/2−√

3/2i)tv1

+(0.7071 + 2.0412i)e(−1/2+√

3/2i)tv2

≈ (0.7071− 2.0412i)e(−1/2−√

3/2i)t

(

0.7071−0.3536 + 0.6124i

)

+(0.7071 + 2.0412i)e(−1/2+√

3/2i)t

(

0.7071−0.3536− 0.6124i

)

.

Simplifying using ea+bi = eaebi and Euler’s formula

ebi = cos(b) + i sin(b),

we obtain

y(t) ≈ e−t/2

[

cos

(√3

2t

)

+5√

3

3sin

(√3

2t

)]

.

(In fact, by solving the original equation symbolically as a linear second orderequation, we obtain exactly this solution.)

The term stiff system originated in the study of mechanical systems withsprings. A spring is “stiff” if its damping constant is large; in such a me-chanical system, motions of the spring will damp out fast relative to the timescale on which we are studying the system. In the numerical solution of initialvalue problems, “stiffness” has come to mean that the solution to the ODEhas some components that vary or die out rapidly in relation to the othercomponents, or in relation to the time interval over which the integrationproceeds. For example, the scalar equation y′ = −1000y might be consideredto be moderately stiff when it is integrated for 0 ≤ t ≤ 1, but not stiff if theinterval of integration is 0 ≤ t ≤ 0.001.

Example 7.12

Let’s consider the system

y′ = Ay, t ≥ 0, y(0) = (1, 0,−1)T (7.26)

where

A =

−21 19 −20

19 −21 20

40 −40 −40

.


The eigenvalues of A are λ1 = −2, λ2 = −40 + 40i and λ3 = −40 − 40i andthe exact solution of (7.26) is

y1(t) =1

2e−2t +

1

2e−40t(cos 40t + sin 40t),

y2(t) =1

2e−2t − 1

2e−40t(cos 40t + sin 40t),

y3(t) = − e−40t(cos 40t− sin 40t),

(7.27)

This system is stiff over the time interval in which we expect e−2t to die out,since the component e−40t dies out much faster. The graphs of the solution tothis initial value problem are in Figure 7.1 (obtained using matlab’s stiff ODEroutine ode15s). Notice that for 0 ≤ t ≤ .1, yi(t), 1 ≤ i ≤ 3, vary rapidly but

0 0.05 0.1 0.15 0.2 0.25 0.3−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

t

y3(t)

y2(t)

y1(t)

FIGURE 7.1: Actual solutions to the stiff ODE system of Example 7.12.

for t ≥ 0.1, then yi vary slowly. Hence, a small time step must be used in theinterval [0, 0.1] for adequate resolution, whereas for t ≥ 0.1 large time stepsshould suffice. Suppose, however we use Euler’s method starting at t = 0.2with initial conditions taken as the exact values yi(0.2), 1 ≤ i ≤ 3. We obtain:


For h = 0.04

j tj y1j y2j y3j

0 0.2 0.335 0.335 -0.000285 0.4 0.218 0.223 0.0031

10 0.6 0.186 .106 -0.028315 0.8 -0.519 0.711 0.143620 1.0 9.032 -8.91 1.923621 1.1 -6.862 6.98 27.55

For h = 0.02

j tj y1j y2j y3j

0 0.2 0.3353 0.3350 -0.000285 0.3 0.2734 0.2732 -0.00006510 0.4 0.2229 0.2228 -0.0000054

Violent instability occurs for h = 0.04 but the method is stable for h = 0.02.What happened? Why do we need h so small?

The answer lies in understanding the concept of stability, as in Definition 7.3(on page 264).

7.9.2 Stability of Stiff Systems

Earlier (Definition 7.3 on page 264), we defined absolute stability of methodsfor solving the IVP in terms of the scalar equation y′ = λy. We now extendthe definition to systems.

DEFINITION 7.4 Let A satisfy the stated assumptions and suppose thatReλi < 0, 1 ≤ i ≤ n. A numerical method for solving the linear IVP (7.24)is called absolutely stable for a particular value of the product λh if it yieldsnumerical solutions, yj, j ≥ 0, in Rn such that yj → 0 as j → ∞ for ally0. As in Definition 7.3, we speak of the region of absolute stability as beingthe set of λh in the complex plane for which the method, applied to a scalarequation y′ = λy, y ∈ R, is absolutely stable.

We now show in our simplified context why Definition 7.4 makes sense. Inparticular, a method for a system is absolutely stable if and only if the methodis absolutely stable for the scalar equations z′ = λiz, for 1 ≤ i ≤ n. To seethis, consider, for example, the k-step method

k∑

l=0

αlyl+j = h

k∑

l=0

βlfl+j = h

k∑

l=0

βlAyl+j .


Thus,k∑

l=0

(αlI − hβlA)yl+j = 0, j ≤ 0.

Let V −1AV = Λ, where this decomposition is guaranteed if A has n simpleeigenvalues and Λ = diag(λ1, λ2, · · · , λn). We conclude that

k∑

l=0

(αlI − hβlΛ)V −1yl+j = 0.

Setting zj = V −1yj, we see that

k∑

l=0

(αl − hβlλi)(zl+j)i = 0, 1 ≤ i ≤ n,

where (zl+j)i is the i-th component of zl+j. Since (zj)i → 0, 1 ≤ i ≤ n, asj → ∞ if and only if yj → 0 as j → ∞, we see that the method will beabsolutely stable for system (7.24) if and only if it is absolutely stable forthe scalar equation z′ = λiz, 1 ≤ i ≤ n. In this case, it will be absolutelystable provided that the roots of p(z, h; i) = ρ(z)−hλiσ(z), 1 ≤ i ≤ n, satisfy|zl,i| < 1, 1 ≤ l ≤ k, 1 ≤ i ≤ n.

Example 7.13

Recall that, in Example 7.7 (on page 265), we found that the region ofabsolute stability for Euler’s method (the Adams–Bashforth 1-step method)is the open disk

{λh : |1 + λh| < 1}, (7.28)

as depicted here:

Re

Im

+-1-2

λh-plane

(Recall yj+1 = yj + λhyj for Euler’s method applied to y′ = λy gives yj → 0if |1 + λh| < 1.)

Applying Euler’s method symbolically to

y′ = −1000y, y(0) = ǫ

givesyk = (1− 1000h)kǫ.

The graph of the solution for 0 ≤ y ≤ 1 is indistinguishable from the graph ofthe constant function y ≡ 0, if the y-scale is 0 ≤ y ≤ 1. However, examine the


following simple matlab computation, representing steps of Euler’s methodfor this problem with h = 0.1.

>> y = 1e-3;

>> h=0.1;

>> y = y - 1000*h*y

y = -0.0990 % t = 0.1

>> y = y - 1000*h*y

y = 9.8010 % t = 0.2

>> y = y - 1000*h*y

y = -970.2990 % t = 0.3

>> y = y - 1000*h*y

y = 9.6060e+004 % t = 0.4

>> y = y - 1000*h*y

y = -9.5099e+006 % t = 0.5

>> y = y - 1000*h*y

y = 9.4148e+008 % t = 0.6

>>

In fact, to avoid this kind of behavior, we would need h < 0.001, and thusneed 600 steps to go from t = 0 to t = 1.

In contrast, the implicit Euler method (that is, the Adams–Moulton methodorder 1) has iteration equation defined by

yk+1 = yk + hf(tk+1, yk+1), (7.29)

which, for the test equation y′ = λy used to determine stability, becomes

yk+1 = yk + λhyk+1,

which, solving for yk, becomes

yk+1 =1

1− λhyk,

with a region of absolute stability defined by

1

|1− (x + iy)| < 1.

Namely, the implicit Euler method is stable for the entire region outside of thecircle of radius 1 centered at (x, y) = (1, 0), and, in particular, in the entireleft half of the complex plane. We perform a simple matlab computation forthe implicit Euler method on y′ = 1000y, y(0) = 1e− 3 with h = 0.1:

>> y = 1e-3;

>> h=0.1;

>> y = (1/(1-1000*h))*y

y = -1.0101e-005

>> y = (1/(1-1000*h))*y


y = 1.0203e-007

>> y = (1/(1-1000*h))*y

y = -1.0306e-009

>> y = (1/(1-1000*h))*y

y = 1.0410e-011

>> y = (1/(1-1000*h))*y

y = -1.0515e-013

>>

We see that, although the relative accuracy of the solution is not high, theapproximate solution tends to 0, as it should.

Example 7.14

Analyzing the Euler’s method computations for Example 7.12 in the same way,we see that, for the numerical solutions to go to zero (that is, for absolutestability), we must have |1 + λih| < 1, 1 ≤ i ≤ 3. For i = 1 (λ1 = −2), thisyields h < 1. However, i = 2, 3 (λ2 = −40 + 40i, λ3 = −40 − 40i) yieldsh < 1/40 = .025 which is violated if h = .04. We conclude that, althoughthe terms with eigenvalues λ2, λ3 contribute almost nothing to the solutionof (7.26) after t = .1, they force the selection of small time step h which mustsatisfy |1 + λ2h| < 1, |1 + λ3h| < 1.

The implicit Euler method applied to Example 7.12 takes the form

yk+1 = yk + Ayk+1, that is, yk+1 = (I −A)−1yk. (7.30)

Without worrying about implementing the computations efficiently in thissimple example, we use matlab to iterate (7.30) directly, with h = 0.04:

>> A = [-21 19 -20; 19 -21 20; 40 -40 -40]A =

-21 19 -20

19 -21 2040 -40 -40

>> h = 0.04;>> y = [1;0;-1]

y =10

-1>> I = eye(3);

>> y = (I-A)^(-1)*y % t=0.04y =

0.1790

0.1543-0.0003

>> y = (I-A)^(-1)*y % t=0.08y =

0.05570.05540.0003

>> y = (I-A)^(-1)*y % t=0.12y =

0.01850.0185

0.0000>> y = (I-A)^(-1)*y % t=0.16


y =

0.00620.0062

0.0000>> y = (I-A)^(-1)*y % t=0.20y =

0.00210.0021

0.0000>> y = (I-A)^(-1)*y % t=0.24

y =1.0e-003 *

0.6859

0.6859-0.0000

>>

We see that the solutions are tending to 0, as they should.

7.9.3 Methods for Stiff Systems

Generally, implicit methods, with regions of stability that contain the entirenegative real axis (or at least a large portion of it) are appropriate for stiffsystems. We give further theory and other methods (including Pade methods)in [1].

In matlab, routines for stiff systems include ode15s, ode23s, ode23t, andode23tb. All of these matlab functions share the same arguments, and areused in the same way as the other matlab functions, such as ode45, tointegrate initial value problems. We have already mentioned that we usedode15s to produce Figure 7.1. Here are the matlab function and command-window dialog:

function [f] = stiff_example(t,y)A = [-21 19 -20

19 -21 20

40 -40 -40];f = A*y;

>> [T,Y] = ode15s(’stiff_example’,[0,.3], [1,0,-1]);>> plot(T,Y(:,1),T,Y(:,2),T,Y(:,3))


>> plot(T,zeros(size(T,1),1),’LineStyle’,’none’,’Marker’,’+’,...’MarkerEdgeColor’,’black’,’Markersize’,4)>> size(T)

ans =74 1

Thus, 74 steps were taken, and we can see the steps on Figure 7.1. In contrast,if we replace ode15s in this dialog by ode45 (but leave the other statementsthe same), we obtain the following plot:


0 0.05 0.1 0.15 0.2 0.25 0.3−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

We also obtained size(T) = 89 1, that is, 89 steps were taken, more that the74 for ode15s. For this example, the difference is not particularly significant,since the system is only moderately stiff over the interval t ∈ [0, 0.3]. InExercise 12 at the end of this chapter, you will try matlab’s various IVPsolvers on a stiffer problem.

7.10 Application to Parameter Estimation in Differen-tial Equations

Many phenomena in the biological and physical sciences have been describedby parameter-dependent systems of differential equations such as those dis-cussed previously in this chapter. Furthermore, some of the parameters inthese models cannot be directly measured from observed data. Thus, param-eter estimation techniques discussed in this section are crucial to use suchdifferential equation models as prediction tools.

In this section we focus on the following question: Given the set of data{dj}nj=1 at the respective points tj ∈ [0, T ], j = 1, . . . , n, find the parametera ∈ Q where Q is a compact set contained in C[0, T ] (the space of continuousfunctions on [0, T ]) which minimizes the least-squares index

n∑

i=1

|y(tj ; a)− dj |2


subject tody

dt= f(t, y; a), y(0; a) = y0,

where y(t; a) represents the parameter-dependent solution of the above initial-value problem. We combine two methods discussed in this book to providea numerical algorithm for solving this problem. In particular, we will useapproximation theory together with numerical methods for solving differentialequations to present an algorithm for solving the least-squares problem. Tothis end, divide the interval [0, T ] into m equal size intervals and denote thebin points by t0, t1, . . . , tm. Let ϕi be a spline function (e.g., linear or cubicspline) centered at ti, i = 0, . . . , m and define am(t) =

∑mi=0 ciϕi(t). Denote

by yk(a) the numerical approximation (using any of the numerical methodsdiscussed in this chapter) of the solution of the differential equation y(tk; a),k = 1, . . . , N with tk − tk−1 = h = T

N . Let yN (t; a) be a piecewise interpolant(e.g., piecewise linear) of yk(a) at the points tk. Then one can define anapproximating problem of the above constrained least-squares problem asfollows: Find the parameter am ∈ Qm where Qm is the space spanned bythe m+1 spline elements ϕ0, . . . , ϕm which minimizes the least-squares index

n∑

j=1

|yN (tj ; am)− dj |2.

Clearly, the above problem is a finite dimensional minimization problem andis equivalent to the problem: Find {ci}mi=1 ⊂ Rm+1 which minimizes theleast-squares index

∑ni=1 |yN (tj ; c0, . . . , cm) − dj |2. One can apply many op-

timization routines to solve this problem (e.g., the nonlinear least-squaresroutine “lsqnonlin,” available in matlab, works well for such a problem).

7.11 Application for Solving an SIRS Epidemic Model

The following ordinary differential equations (7.31) is an epidemic modelwhich describes the spread of an infectious disease after it starts within apopulation.

dS

dt= − β

NSI + νR,

dI

dt=

β

NSI − γI,

dR

dt= γI − νR,

(7.31)

with initial conditions S(0) > 0, I(0) > 0, R(0) ≥ 0.Here, we assume that the total population is a constant with no births or

deaths. Thus, the total size of the population, N , equals S(0) + I(0) + R(0).


And any individual in the population is in one of the three distinct classes:the susceptible, S, who are not infected, but can contract the disease; theinfected, I, who has the disease and are capable of transmitting the disease;the removed, R, who had recovered from the disease or permanently immune,or isolated until recovered.

The SIRS model (7.31) implies the following: First, the susceptible classmoves to the infected class at a rate of βSI/N , where β is the average adequatecontacts made by an infected individual per time [4], hence βSI/N gives thetotal number of infections caused by the infected class per time. Second, theinfected class are recovered from the infected class at a rate of γI, where γis defined as the recovery rate and thus, 1/γ is the average time spent in theinfectious class [4]. Third, the removed individuals lose immunity at a rate ofνR and become susceptible again.

The ratio β/γ, called as the basic reproduction number [4] and denoted byR0, is the number of secondary infections caused by an infective individualduring his/her infectious period([4], [6]). If there is more than one secondaryinfections produced by one infective individual during his/her infectious pe-riod, that is, R0 > 1, the disease becomes endemic; and if R0 < 1, then thedisease fades out, the whole population becomes healthy but susceptible[14].

Now, we use a second-order Runge–Kutta method to find the numericalsolutions to the SIRS model (7.31) with two sets of parameters. We choosethe initial size of susceptible, infected and removed individuals to be 5, 3 and7, respectively. Hence the total size of the population is 15. And we pickβ = 2, γ = 1.5 and ν = 0.2. So in this case R0 = β/γ > 1. Using the followingmatlab program and functions, we obtain a graph of the population sizes ofeach class in 500 iterations. Graph (a) shows the infected population stayspositive and hence the epidemic continues. Then we set β to be 1 withoutchanging other parameters, so R0 < 1. Run the program other time, we getgraph (b), which shows all individuals become susceptible and the infectedpopulation becomes zero.

function [f] = SIRS (t, y)N=y(1)+y(2)+y(3);b=2;

r=1.5;m=0.2;

f(1)=-b/N*y(1)*y(2)+m*y(3);f(2)= b/N*y(1)*y(2)-r*y(2);f(3)= r*y(2)-m*y(3);

return

function [t,y] = Runge_Kutta_2_for_systems(t0, tf, y0, f, n_steps)%% [t,y] = Runge_Kutta_2_for_systems(t0, tf, y0, f, n_steps) performs

% n_steps of the modified Euler method (explained in Section 7.3.4.2 of the% text), on the system represented by the right-hand-side function f, with

% constant step size h = (tf - t0)/nsteps, and starting with initial% independent variable value t0 and initial dependent variable values y0.

% The corresponding independent and dependent variable values are returned% in t(1:n_steps+1) and y(1:n_steps+1,:), respectively.


h = (tf - t0) / n_steps;t=linspace(t0,tf,n_steps+1);

y(1,:) = y0 % y(1,:) are the initial values at t=t0for i=1:n_steps

k1=h*feval(f,t(i),y(i,:));

k2=h*feval(f,t(i)+h,y(i,:)+k1);y(i+1,:)=y(i,:)+(k1+k2)/2;

end % k1, k2, f, and y(i,:) are vectors

% Matlab script run_Runge_Kutta_2_for_systems.m

%clear

clft0 = 0;tf = 50;

n_steps = 500;y0(1) = 5;

y0(2) = 3;y0(3) = 7;

[t,y] = Runge_Kutta_2_for_systems(t0, tf, y0, ’SIRS’, n_steps);

set(gca,’fontsize’,15,’linewidth’,1.5);plot(t,y(:,1),’g-’,t,y(:,2),’b--’,t,y(:,3),’r-.’,’linewidth’,1.5)

axis([0,tf,0,20]);xlabel(’Time’)ylabel(’Populations’)

0 10 20 30 40 500

5

10

15

20

Time

Pop

ulat

ions

SusceptibleInfectedRemoved

(a)

0 10 20 30 40 500

5

10

15

20

Time

Pop

ulat

ions

SusceptibleInfectedRemoved

(b)

7.12 Exercises

1. Suppose we consider an example of the initial value problem (7.1) (onpage 253), such that a = 0, b = 1, such that y and f are scalar valued,and such that f(t, y(t)) = f(t), that is, f is a function of the independentvariable t only, and not of the dependent variable. In that case,

y(1) = y(0) +

∫ 1

t=0

f(t)dt.


(a) To what method of approximating the integral does Euler’s methodcorrespond?

(b) In view of your answer to item 1a, do you think Euler’s method isappropriate to use in practice for accuracy and efficiency?

2. Show that Euler’s method fails to approximate the solution y(x) =(

23x)

32 of the initial value problem y′(x) = y

13 , y(0) = 0. Explain why.

3. Consider Euler’s method for approximating the IVP y′(x) = f(x, y), 0 <x < a, y(0) = α. Let yh(xi+1) = yh(xi) + hf(xi, yh(xi)) for i =0, 1, . . . , N where yh(0) = α. It is known that yh(xi) − y(xi) = c1h +c2h

2 + c3h3 + . . . where cm, m = 1, 2, 3, . . . depend on xi but not on

h. Suppose that yh(a), yh2(a), yh

3(a) have been calculated using interval

width: h, h2 , h

3 , respectively. Find an approximation y(a) to y(a) that isaccurate to order h3.

4. Duplicate the table on page 260, but for h = 0.05 and h = 0.01. (You willprobably want to write a short computer program to do this. You alsomay need to display more digits than in the original table on page 260.)By taking ratios of errors, illustrate that the global error in the ordertwo Taylor series method is O(h2).

5. Suppose that

y′′′(t) = t+2ty′′+2t2y(t), 1 ≤ t ≤ 2, y(1) = 1, y′(1) = 2, y′′(1) = 3.

Convert this third order equation problem into a first-order system andcompute yk for k = 1, 2 for Euler’s method with step length h = 0.1.

6. Calculate the real part of the region for the absolute stability of thefourth order Runge–Kutta method (7.18).

7. Consider the Runge–Kutta method

yi+1 = yi + hf(ti +h

8, yi +

h

8f(ti, yi)).

Apply this method to y′ = λy to find the interval of absolute stabilityof the method. (Assume that λh < 0.)

8. Find the region of absolute stability for

(a) Trapezoidal method:

yj+1 = yj +h

2(f(tj , yj) + f(tj+1, yj+1)) , j = 0, 1, · · · , N − 1.

(b) Backward Euler method:

yj+1 = yj + hf(tj+1, yj+1), j = 0, 1, · · · , N − 1.


9. Consider solving the initial value problem y′ = λy, y(0) = α, whereλ < 0, by the implicit trapezoid method , given by

y0 = α, yi+1 = yi +h

2[f(ti+1, yi+1) + f(ti, yi)] , 0 ≤ i ≤ N − 1,

ti = ih, h = T/N . Prove that any two numerical solutions yi and yi

satisfy|yi − yi| ≤ eK |y0 − y0|

for 0 ≤ ti ≤ T , assuming that λh ≤ 1, where K = 3λT/2 and y0, y0

are respective initial values with y0 6= y0. (That is, yi and yi satisfy thesame difference equations except for different initial values.)

10. Consider the initial-value system

dy

dt= (I −Bt)−1y, y(0) = y0, y(t) ∈ R

n, 0 ≤ t ≤ 1,

where B is an n × n matrix with ‖B‖∞ ≤ 1/2. Euler’s method forapproximating y(t) has the form

yi+1 = yi +h(I−Bti)−1yi = (I +h(I−Bti)

−1)yi, i = 0, 1, · · · , N−1,

where ti = ih and h = 1/N . Noting that ‖Bti‖∞ ≤ 1/2 for all i, provethat

‖yi+1‖∞ ≤ (1 + 2h)‖yi‖∞for i = 0, 1, · · · , N − 1 and

‖yN‖∞ ≤ e2‖y0‖∞

for any value of N ≥ 1.

11. Consider the following time-dependent logistic model for t ∈ [0, 2]:

dy

dt= a(t)y(1 − y

5), y(0) = 4.

(a) Find parameters ci to approximate the time-varying coefficient

a(t) ≈∑2

i=0 ciϕi(t). Here, ϕi denotes the hat function centered atti, with respect to the nodes [t0, t1, t2] = [0, 1, 2]. (See page 161.)Compute those ci which provide the best least-squares fit for the(t, a) data set:

{(0.3, 5), (0.6, 5.2), (0.9, 4.8), (1.2, 4.7), (1.5, 5.5), (1.8, 5.2), (2, 4.9)} .

(b) Solve the resulting initial value problem numerically. Somehowestimate the error in your numerical solution.


12. Try the matlab routines ode45, ode15s, ode23s, ode23t, and ode23tb

on the following initial value problems

(a) y′ = −104y, y(0) = 1, and

(b) y′′ = −104y, y(0) = 1, y′(0) = 0.

In each case, integrate from t = 0 to t = 1. Graph the solutions given,and form a table of the number of steps each of the routines took.

13. Experiment with the predator-prey model in examples 7.8 and 7.10.(You may use the function predator prey on page 267 and script onpage 268.) In particular, it is known that, for some values of α, β, γ,and δ, the populations of rabbits and foxes oscillate, instead of dyingout. Use your intuition to find such values, and display the results. (Forinstance, to decrease the possibility that the rabbits will die out, youcan increase the birth rate α of the rabbits, or decrease the predationrate β. Similarly, to decrease the chances that the foxes will die out,you can decrease the resource competition factor γ or increase the foxgrowth rate δ.) You may find some solutions where the two populationsoscillate, and some solutions where the population of foxes dies out andthe population of rabbits increases exponentially. Print your graphs asPDF files, and supply written explanation.

14. The following matlab script and function implement the discretizationdescribed in Example 7.4 for an arbitrary number of subintervals N .(N is set on the second line of the script.) The script generates a two-dimensional plot, by selecting some of the time steps (N t divisions ofthem), and also prints the total number of time steps taken (representedas size(T)). The script is:

global N

N=8

N_t_divisions = 8;

[T,Y] = ode45(’example_7p4_func’,[0,5], [0;0;0;0;0;0;0]);

stride = floor(size(T,1)/N_t_divisions);

Yplot = zeros(N_t_divisions,N-1);

Yplot(1,:) = Y(1,:);

Xsurf(1) = T(1);

for i=2:N_t_divisions

Yplot(i,:) = Y(1+i*stride,:);

Xsurf(i) = T(1+i*stride);

end

for i=1:N-1;Ysurf(i) = i/(N);end;

surf(Ysurf,Xsurf,Yplot)

size(T)

while the function is:


function [ f ] = example_7p4_func( t, u )

% Function for example 7.4 of the manuscript.

global N;

f = zeros(N-1,1);

h = 1/N;

f(1) = N^2*(u(2) - 2*u(1));

for i=2:N-2

f(i) = N^2*(u(i+1)-2*u(i)+u(i-1));

end

f(N-1) = N^2*(t - 2*u(N-1) + u(N-2));

end

Try using N = 8, 50, 100, and 500, using ode45 (if practical) andode15s. Make a table of the number of steps taken in each case, andcompare the surface plots obtained.

Chapter 8

Numerical Solution of Systems ofNonlinear Equations

In this chapter, we study numerical methods for solution of nonlinear systems.That is, we study numerical methods for finding x = (x1, x2 · · · , xn)T ∈ D ⊂RN that solves

F (x) = 0, (8.1)

where F (x) = (f1(x), f2(x), f3(x), · · · fn(x))T , F : D ⊆ Rn → Rn.

Example 8.1

The following system of two equations in two unknowns arises from a problemin phase stability in chemical engineering1. Find x1 and x2 such that

f1(x1, x2) = x21 + x1x

32 − 9 = 0,

f2(x1, x2) = 3x21x2 − x3

2 − 4 = 0.

This system is interesting because it has four solutions.

8.1 Introduction

A basic tool in computing solutions to nonlinear systems of equations is amultivariate version of Newton’s method. In turn, central to the concept of amultivariate Newton method is that of a Jacobian matrix .

1This problem was communicated by Alberto Copati in 1999.

291


DEFINITION 8.1 The matrix of partial derivatives

F ′(x) =

∂f1

∂x1(x)

∂f1

∂x2(x) . . .

∂f1

∂xn(x)

∂f2

∂x1(x)

∂f2

∂x2(x) . . .

∂f2

∂xn(x)

......

∂fn

∂x1(x)

∂fn

∂x2(x) . . .

∂fn

∂xn(x)

. (8.2)

is called the Jacobian matrix for the function F . The Jacobian matrix ofF is sometimes denoted by J(F )(x). It is also an instance of the Frechetderivative, which we define in this context in [1, Chapter 8].

Example 8.2

The Jacobian matrix for the function F (x) = (f1(x1, x2), f2(x1, x2))T from

Example 8.1 is

F ′(x) =

(

2x1 + x32 3x1x

22

6x1x2 3x21 − 3x2

2

)

.

Just as for functions of one variable, we can form linear models of functionsF with n components, each component of which is a function of n-variables.In particular, if x ∈ Rn and x(0) ∈ Rn, we have

F (X) = F (x(0)) + F ′(x(0))(x − x(0)) +O(‖x− x(0)‖2), (8.3)

that is, there is a constant c such that, for all x sufficiently close to x(0),

‖F (X)−[

F (x(0)) + F ′(x(0))(x − x(0))]

‖ ≤ c(‖x− x(0)‖2.

In fact, the i-th component of F (x(0))+F ′(x(0))(x−x(0)) is the tangent planeapproximation to fi(x) at x(0).

Example 8.3

The linear approximation to the function F from Examples 8.1 and 8.2 at thepoint x(0) = (1, 2) is

(

f1(x1, x2)f2(x1, x2)

)

≈ L(x) = F (1, 2) + F ′(1, 2)

[

x−(

12

)]

=

(

0−6

)

+

(

10 1212 −9

)[(

x1

x2

)

−(

12

)]

=

(

10(x1 − 1) + 12(x2 − 2)

−6 + 12(x1 − 1) − 9(x2 − 2)

)

.

Numerical Solution of Systems of Nonlinear Equations 293

Related to such linear approximations, the following multivariate version ofthe mean value theorem can lead to insight.

THEOREM 8.1

(A multivariate mean value theorem) Suppose F : D ⊂ Rn → Rn has contin-uous first-order partial derivatives, and suppose that x ∈ D, x ∈ D, and theline segment {x + t(x− x) | t ∈ [0, 1]} is in D. Then

F (x) = F (x) + A(x− x), (8.4)

where A is some matrix whose i-th row is of the form

(

∂fi

∂x1(ci),

∂fi

∂x2(ci), . . . ,

∂fi

∂xn(ci)

)

,

where the ci ∈ Rn, 1 ≤ i ≤ n are (possibly distinct) points on the line betweenx and x.

We can think of the linear approximation 8.3 as a degree-1 multivariateTaylor polynomial. Higher-order Taylor expansions are of the form

F (x) = F (x) + F ′(x)(x − x) +1

2F ′′(x)(x− x)(x− x) + . . . , (8.5)

where F ′ is the Jacobian matrix as in (8.2) and F ′′, F ′′′, etc. are higher-orderderivative tensors . For example, F ′′(x) can be viewed as a matrix of matrices,whose (i, j, k)-th element is

∂2fi

∂xj∂xk(x),

and where F ′′(x)(x − x) can be viewed as a matrix whose (i, j)-th entry iscomputed as

(

F ′′(x)(x − x))

i,j=

n∑

k=1

∂2fi

∂xj∂xk(x)(xk − xk).

Just as in univariate Taylor expansions, if we truncate the expansion in Equa-tion 8.5 by taking terms only up to and including the k-th Frechet derivative,then the resulting multivariate Taylor polynomial Tk(x) satisfies

F (x) = Tk(x) +O(‖x− x‖k+1).


8.2 Newton’s Method

The multivariate Newton method, for finding solutions of systems of equa-tions such as in Example 8.1, can be viewed in the same way as the univariateNewton method, namely, we replace the function by its tangent-line approx-imation, then repeatedly find where the approximation is equal to zero. Inthe multivariate case, setting the linear approximation to the function to zerogives

F (x(0)) + F ′(x(0))(x− x(0)) = 0,

that is,F ′(x(0))(x − x(0)) = −F (x(0)). (8.6)

Equation 8.6 is a linear system of equations in the unknown vector v = x−x(0).If F ′(x(0)) is non-singular, this system of equations can be solved for v, with avalue x = x(1) = x(0) + v. In the case, such as Example 8.1, of two equationsin two unknowns, x(1) would represent the intersection of the tangent planesto f1 and f2 at x(0) with the (x1, x2)-plane.

Equation 8.6, combined with practical considerations, leads to the followingalgorithm for Newton’s method.

ALGORITHM 8.1

(Newton’s method)INPUT:

(a) an initial guess x(0);

(b) a maximum number of iterations M .

(c) a domain stopping tolerance ǫd and a range stopping tolerance ǫr

OUTPUT: either “success” or “failure.” If “success,” then also output thenumber of iterations k and the approximation x(k+1) to the solution x∗.

1. “success”← “false”.

2. FOR k = 0 to M .

(a) Evaluate F ′(x(k)). (That is, evaluate the corresponding n2 partialderivatives at x(k).)

(b) F ′(x(k))v(k) = −F (x(k)) for v(k).

◦ IF F ′(x(k))v(k) = −F (x(k)) cannot be solved (such as whenF ′(x(k)) is numerically singular) THEN EXIT.

(c) x(k+1) ← x(k) + v(k).

(d) IF(

‖v(k)‖ < ǫd or ‖F (x(k+1))‖ < ǫr

)

THEN


i. “success”← “true”.

ii. EXIT.

END FOR

END ALGORITHM 8.1.

Example 8.4We will use matlab to do a few iterations of the multivariate Newton method,with x(0) = (1, 2)T , for Example 8.1. We obtain the following (condensed tosave space, and with comments added):

x =

12

>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]F = 0

-6>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]Fprime = 10 12

12 -9>> v = -Fprime \ F

v = 0.3077-0.2564

>> x=x+v

x = 1.3077 % (k = 1)1.7436

>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]F =-0.3583

-0.3558>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]Fprime = 7.9161 11.9266

13.6805 -3.9901>> v = -Fprime \ F

v = 0.02910.0107

>> x=x+v

x = 1.3368 % (k = 2)1.7543

>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]F = 0.0045

0.0063>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]Fprime = 8.0726 12.3424

14.0711 -3.8714>> v = -Fprime \ F

v = 1.0e-003 *-0.4651-0.0601

>> x=x+vx = 1.3364 % (k = 3)

1.7542>> F = [x(1)^2 + x(1)*x(2)^3 - 9; 3*x(1)^2*x(2) - x(2)^3 - 4]

F =1.0e-005 *

0.0500

0.1343>> Fprime = [2*x(1) + x(2)^3, 3*x(1)*x(2)^2; 6*x(1)*x(2), 3*x(1)^2 - 3*x(2)^2]

Fprime = 8.0711 12.337314.0657 -3.8745

>> v = -Fprime \ Fv =


1.0e-007 *

-0.90370.1863

>> x=x+vx = 1.3364 % (k = 4)

1.7542

We see that Newton’s method converges rapidly, with five displayed digitsunchanging after four iterations. In fact, if we look at the ratios

‖F (x(k+1))‖/‖F (x(k))‖2,

we see that these ratios are approximately constant, indicating quadratic con-vergence.

An interesting aspect of this example is that different starting points leadto different approximate solutions. For example, with x(0) = (3, 0.5)T , themethod converges to x(5) ≈ (2.9984, 0.1484)T .

The multivariate Newton method is subject to the same pitfalls as theunivariate Newton method, as illustrated in Figure 2.6 (on page 55), wherethe analog to a horizontal tangent line is a singular (or ill-conditioned) F ′.

The following matlab function implements a simplified version of Algo-rithm 8.1. (An electronic copy is available at http://interval.louisiana.edu/Classical-and-Modern-NA/newton_sys.m.)

function [x_star,success] = newton_sys (x0 ,f , f_prime, eps, maxitr)

%

% [x_star,success] = newton_sys(x0,f,f_prime,eps,maxitr)

% does iterations of Newton’s method for systems,

% using x0 as initial guess, f (a character string giving

% an m-file name) as function, and f_prime (also a character

% string giving an m-file name) as the derivative of f.

% iteration stops successfully if ||f(x)|| < eps, and iteration

% stops unsuccessfully if maxitr iterations have been done

% without stopping successfully or if a zero derivative

% is encountered.

% On return:

% success = 1 if iteration stopped successfully, and

% success = 0 if iteration stopped unsuccessfully.

% x_star is set to the approximate solution to f(x) = 0

% if iteration stopped successfully, and x_star

% is set to x0 otherwise.

success = 0;

x = x0;

for i=1:maxitr;

fval = feval(f,x);

i

x

norm_fval = norm(fval,2)


if norm_fval < eps;

success = 1;

x_star = x;

return;

end;

fpval = feval(f_prime,x);

if fpval == 0;

x_star = x0;

end;

v = fpval \(-fval);

x = x +v;

end;

x_star =x0;

if (~success)

disp(’Warning: Maximum number of iterations reached’);

end

The following theorem tells us that we can often expect Newton’s methodto be locally quadratically convergent.

THEOREM 8.2

Assume that F is defined on a subset of Rn, that the partial derivativesof each of the n components of F are continuous, that F (x(∗)) = 0 (where0 is interpreted to be the 0-vector here), that F ′(x(∗) is non-singular, and(x(∗) is in the interior of the domain of definition of F . Then, Newton’smethod will converge to x(∗) for all initial guesses x(0) sufficiently close tox(∗). Additionally, if for some constant c,

‖F ′(x) − F ′(x∗)‖ ≤ c‖x− x∗‖ (8.7)

for all x in some neighborhood of x∗, then there exists a positive constant csuch that

‖x(k+1) − x(k)‖ ≤ c‖x(k) − x∗‖2. (8.8)

A proof of a generalization of this theorem can be found in [1, Chapter 8].A classic theorem on convergence of Newton’s method is the Newton–

Kantorovich Theorem, which we do not give here, but which can also befound in our second-level text [1, Chapter 8].

The multivariate Newton method is applied in a broad range of contexts.For example, nonlinear systems of equations, with n very large, arise in the dis-cretization of nonlinear partial differential equations, and Newton’s method,combined with use of banded or sparse matrix structures to solve the resultinglinear systems, is used to solve these nonlinear systems. Newton’s method isalso used in implicit methods for solving stiff initial value problems involvingnonlinear systems of ordinary differential equations.

Although there may be significant computational cost to evaluating theJacobian matrix at each iteration of Newton’s method, actually deriving it and


coding it in a programming language is often not necessary today when usingmodern packages (such as those that incorporate automatic differentiation).Use of divided differences to evaluate the partial derivatives in the Jacobianmatrix is still sometimes used, although that technique not only can be morecostly but it is unclear in particular problems how to choose the step h, andthe technique can lead to significantly inaccurate values.

For general nonlinear systems, it may be difficult to choose a starting guessx(0). Furthermore, as is the case for Example 8.1, the system may have morethan one solution, and all solutions may be desired. In such cases, Newton’smethod or related iterative methods are embedded in more sophisticated algo-rithms. We will discuss these algorithms after we explain general multivariatefixed point iteration.

8.3 Multidimensional Fixed Point Iteration

Multidimensional fixed-point iteration is a close multidimensional analogueto the univariate fixed point iteration method discussed in Section 2.2 onpage 47. It is used in various contexts. For example, the iterative methodsfor linear systems we discussed in Section 3.5 are examples of fixed pointiteration. In fixed point iteration methods, just as in solutions of nonlinearsystems of equations, we have a function

G(x) = (g1(x1, . . . , xn), . . . , gn(x1, . . . , xn))T

with n components, each of whose components is a function of n variables.However, instead of finding a vector (x1, . . . , xn)T at which fi(x1, . . . , xn) = 0,1 ≤ i ≤ n, we seek a vector x such that

x = G(x), (8.9)

that is, xi = gi(x), 1 ≤ i ≤ n. Such x with x = G(x) are called fixed points ofG. Note that, if F (x) = G(x) − x, solutions to F (x) = 0 are precisely fixedpoints of G. The fixed point equation (8.9) leads to the iteration scheme

x(k+1) = G(x(k)), k ≥ 0, x(0)given in Rn. (8.10)

Example 8.5

Just as we saw on page 55 for the univariate Newton method, the multivariateNewton method can be viewed as a multivariate fixed point iteration, with

G(x) = x− (F ′(x))−1

F (x).


Example 8.6

In Example 3.18 (on page 93), we saw that we could discretize the boundaryvalue problem

x′′(t) = − sin(πt), x(0) = x(1) = 0.

by replacing the second derivative with a central difference approximation, toobtain the linear system

2 −1 0−1 2 −1

0 −1 2

x1

x2

x3

=1

16

sin(π4 )

sin(π2 )

sin(3π4 )

.

We further saw in Example 3.30 (on page 121) that we could write this systemas

x1

x2

x3

=

0 12 0

12 0 1

2

0 12 0

x1

x2

x3

+

1

32

sin(π4 )

sin(π2 )

sin(3π4 )

,

This represents a fixed point iteration, with

G(x) =

0 12 0

12 0 1

2

0 12 0

x +

1

32

sin(π4 )

sin(π2 )

sin(3π4 )

.

In this case, G(x) happens to be a linear function2. (Note that the functionG(x) here is not to be confused with the matrix G we used in explainingiterative methods in Chapter 3, where we used G to denote the iterationmatrix for historical purposes3.)

Example 8.7

If, instead of x′′ = − sin(πt) as in Example 8.6, the differential equation werex′′ = −ex, replacing x′′ by its central difference approximation as before givesthe system

x1

x2

x3

=

0 12 0

12 0 1

2

0 12 0

x1

x2

x3

+

1

32

ex1

ex2

ex3

.

The fixed-point iteration function

G(x) =

0 12 0

12 0 1

2

0 12 0

x +

1

32

ex1

ex2

ex3

2although it is not a linear operator, since it has the constant term3See the notation used in [44].


is now non-linear. We do a small experiment in matlab with fixed pointiteration for this G(x):

>> x = [0;0;0]x = 0

0

0>> x = [(1/2)*x(2) + (1/32)*exp(x(1));

(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]x = 0.0313 % k = 1

0.0313

0.0313>> x = [(1/2)*x(2) + (1/32)*exp(x(1));

(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]x = 0.0479 % k = 2

0.06350.0479

>> x = [(1/2)*x(2) + (1/32)*exp(x(1));

(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]x = 0.0645 % k = 3

0.08120.0645

>> x = [(1/2)*x(2) + (1/32)*exp(x(1));(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]

x = 0.0739 % k = 4

0.09840.0739

.

.

.

x = 0.1053 % k = 220.1412

0.1053>> x = [(1/2)*x(2) + (1/32)*exp(x(1));

(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]x = 0.1053 % k = 23

0.1413

0.1053>> x = [(1/2)*x(2) + (1/32)*exp(x(1));

(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]x = 0.1054 % k = 24

0.1413

0.1054>> x = [(1/2)*x(2) + (1/32)*exp(x(1));

(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]x = 0.1054 % k = 25

0.14140.1054

>> x = [(1/2)*x(2) + (1/32)*exp(x(1));

(1/2)*x(1) + (1/2)*x(3) + (1/32)*exp(x(2));(1/2)*x(2) + (1/32)*exp(x(3))]x = 0.1054 % k = 26

0.14140.1054

>>

Linear convergence can be inferred from these computations.

When will such a nonlinear multivariate fixed point iteration converge? Thisquestion is answered by a multivariate version of the contraction mappingtheorem4.

4We saw the univariate version on page 49.


DEFINITION 8.2 A mapping G : D ⊂ Rn → Rn is a contraction on aset D0 ⊂ D if there is an α < 1 such that ‖G(x) −G(y)‖ ≤ α‖x− y‖ for allx, y ∈ D0. (Here, the norm can be any norm.)

THEOREM 8.3

(Contraction Mapping Theorem) Suppose that G : D ⊂ Rn → Rn is a con-traction on a closed set5 D0 ⊂ D and that G : D0 → D0, i.e., if x ∈ D0, thenG(x) ∈ D0. Then G has a unique fixed point x∗ ∈ D0. Moreover, for anyx(0) ∈ D0, the iterates {x(k)} defined by x(k+1) = G(x(k)) converge to x∗. Wealso have the error estimates

‖x(k) − x∗‖ ≤ α

1− α‖x(k) − x(k−1)‖, k = 1, 2, · · · (8.11)

‖x(k) − x∗‖ ≤ αk

1− α‖G(x(0))− x(0)‖, (8.12)

where α is as in Definition 8.2.

The proof is similar to the proof of Theorem 2.3 on page 49, the univariatecontraction mapping theorem, which we also gave without proof. Peopleinterested can in the proof can consult [1] or other references.

Showing that G is a contraction can be facilitated by the following Theorem.The theorem requires the set over which G be defined be convex :

DEFINITION 8.3 A set D0 is said to be convex, provided

λx + (1 − λ)y ∈ D0 whenever x ∈ D0, y ∈ D0, and λ ∈ [0, 1].

For example, hyper-rectangles (that is, interval vectors, defined by lowerand upper bounds on each variable) are convex.

THEOREM 8.4

Let D0 be a convex subset of Rn and G be a mapping of D0 into Rn whosecomponents g1, g2, · · · , gn have continuous and bounded derivatives of firstorder on D0. Then the mapping G satisfies the Lipschitz condition

‖G(x)−G(y)‖ ≤ L‖x− y‖ for all x, y ∈ D0, (8.13)

where L = supw∈D0

‖G′(w)‖, where G′ is the Jacobian matrix of G. If L ≤ α < 1,

then G is a contraction on D0.

5Recall that closed sets in n-space are simply sets that contain their boundaries, that is,sets that contain all of their limit points.


Here, ‖ · ‖ signifies any vector norm and the corresponding induced matrix

norm, i.e., ‖A‖ = sup‖x‖6=0

‖Ax‖‖x‖ . Thus, ‖Ax‖ ≤ ‖A‖‖x‖.

The reader can consult our second-level text [1] for a proof of Theorem 8.4..

Example 8.8

We will apply Theorem 8.4 and Theorem 8.3 to Example 8.7. Take

D0 = x = ([−1, 1], [−1, 1], [−1, 1])T .

We have

G′(x) =

132ex1 1

2 0

12

132ex2 1

2

0 12

132ex3

.

Applying interval arithmetic6 over x, we obtain the naive interval evaluationof G′:

G′(x) ⊆

[0.0114, 0.0850] 0.5 0

0.5 [0.0114, 0.0850] 0.5

0 0.5 [0.0114, 0.0850]

.

Storing this matrix Gprime in matlab, the norm function for interval matricesprovided with intlab gives

>> norm(Gprime,2)

intval ans = [ 0.6791, 0.7526]

This means that the largest induced 2-norm of any matrix in G′(x) is atmost 0.7526 < 1, so Theorem 8.4 shows that G is a contraction over x = D0.Furthermore, we perform an interval evaluation of G, with a mean valueextension over x and centered at x(0) = (0, 0, 0)T as follows:

>> x = [infsup(-1,1);infsup(-1,1);infsup(-1,1)]

intval x = [ -1.0000, 1.0000]

[ -1.0000, 1.0000]

[ -1.0000, 1.0000]

>> G0 = [1/32;1/32;1/32]

G0 = 0.0313

0.0313

0.0313

>> G = G0 + Gprime*x

intval G = [ -0.5537, 0.6162]

[ -0.5537, 0.6162]

[ -0.5537, 0.6162]

6We used intlab to obtain the interval enclosure to the range (1/32)e[−1,1] .


Since [−0.5537, 0.6162] ⊂ [−1, 1], this shows that G maps x = D0 into D0.Therefore, Theorem 8.3 implies that G has a unique fixed point in D0, andthe iterations defined by x(k+1) = G(x(k)) converge to this unique fixed pointfrom any starting point whose coordinates are between −1 and 1.

8.4 Multivariate Interval Newton Methods

Multivariate interval Newton methods are similar to univariate intervalNewton methods (as presented in Section 2.4, starting on page 56), in thesense that they provide rigorous bounds on solutions, in addition to existenceand uniqueness proofs [20, 29]. Because of this, multivariate interval Newtonmethods have a good potential for computing mathematically rigorous boundson a solution to a nonlinear system of equations, given an approximate solu-tion (computed, say, by a point Newton method). Interval Newton methodsare also used as parts of more involved algorithms to find all solutions to anonlinear system, or for global optimization. (See [1, Section 9.6.3].)

Most multivariate interval Newton methods follow a form similar to that ofthe multivariate point method seen in steps 2b and 2c of Algorithm 8.1. Wesummarize the algorithm and theoretical properties here. For generalizationsand details, see [1, Chapter 8].

Interval Newton methods can now be viewed as follows

DEFINITION 8.4 Suppose F : D ⊆ Rn → Rn, suppose x ∈ D is aninterval n-vector, suppose and suppose that F ′(x) is an interval extension ofthe Jacobian matrix7 of F (for example, by evaluating each component withinterval arithmetic). Then a multivariate interval Newton operator F is anymapping N (F, x, x) from the set of ordered pairs (x, x) of interval n-vectorsx and point n-vectors x to the set of interval n-vectors, such that

x←N (F, x, x) = x + v, (8.14)

where v ∈ IRn is any box that bounds the solution set to the linear intervalsystem

F ′(x)v = −F (x). (8.15)

In implementations of interval Newton methods on computers, the vectorF (x) is evaluated using interval arithmetic, even though the value sought isat a point. This is to take account of roundoff error, so the results will be

7The matrix can be somewhat more general than an interval extension of the Jacobianmatrix; see, for example, [1], or, for even more details, [29].


mathematically rigorous. Bounding the solution set to (8.15) may be doneby interval Gaussian elimination (see Section 3.3.4 on page 111) the intervalGauss–Seidel method (see Section 3.5.5 on page 130), variants of these, orother methods, such as the Krawczyk method. See [1], [29], etc.

The following facts (stated as a theorem) form the basis of computations toprove existence and uniqueness of solutions, as well as to obtain mathemati-cally rigorous bounds on solutions.

THEOREM 8.5

Suppose F has continuous first-order partial derivatives, and N(F, x, x) isthe image under an interval Newton method of the box x. Then

1. any solutions x∗ ∈ x of F (x) = 0 must also lie in N (F, x, x).A unique-ness theorem can be stated in general for any interval Newton operator.

2. If, in addition, N (F, x, x) ⊂ x, then there exists an x ∈ x such thatF (x) = 0, and that x is unique.

This theorem is related to the contraction mapping theorem through a mul-tivariate version of the mean value theorem and through the range inclusionproperties of interval arithmetic. For details, see [1, Chapter 8], or, for a morecomprehensive consideration, [29].

Example 8.9

Take the discretization from Example 8.7, but, instead of splitting it as inthe Jacobi method, write the discretized system as

F (x) =

−1 12 0

12 −1 1

2

0 12 −1

x1

x2

x3

+

1

32

ex1

ex2

ex3

= 0.

The Jacobian matrix is then

F ′(x) =

−1 + 132ex1 1

2 0

12 −1 + 1

32ex2 12

0 12 −1 + 1

32ex3

x1

x2

x3

We first use newton sys (see page 296) to find an approximate solution. Thefunction and Jacobian matrix are programmed as follows.

function [ F ] = F_nonlinear_BVP( x )

A = [-1, 1/2, 0

1/2, -1, 1/2


0, 1/2, -1];

F = A*x + (1/32)*[exp(x(1));exp(x(2));exp(x(3))];

end

function [ Fp ] = F_prime_nonlinear_BVP( x )

Fp = [ -1 + (1/32)*exp(x(1)), 1/2, 0

1/2, -1 + (1/32)*exp(x(2)), 1/2

0, 1/2, -1 + (1/32)*exp(x(3))]

end

The matlab dialog (abridged) for Newton’s method, using starting pointx(0) = (0, 0, 0)T , is then as follows:

>> [x_star, success] = newton_sys([0;0;0], ...’F_nonlinear_BVP’,’F_prime_nonlinear_BVP’,1e-15,30)

i = 1x = 0

0

0norm_fval = 0.0541

Fp =-0.9688 0.5000 00.5000 -0.9688 0.5000

0 0.5000 -0.9688.

.

.

i = 4x = 0.1054

0.1414

0.1054norm_fval = 1.3055e-016

x_star =0.10540.1414

0.1054success = 1

We now construct a box around x star, then use intlab’s overloading of the“\” operator8 to perform a step of an interval Newton method:

>> xx = midrad(x_star,0.1)intval xx =

[ 0.0054, 0.2055][ 0.0414, 0.2415][ 0.0054, 0.2055]

>> xp = midrad(x_star,0)intval xp =

0.10540.14140.1054

>> Fstar = F_nonlinear_BVP(xp)intval Fstar =

1.0e-015 *[ 0.0416, 0.0556]

[ 0.1040, 0.1180]

8See the intlab help file for an explanation of what method is used to bound the solutionset, when the backslash operator is used with interval data.


[ 0.0416, 0.0556]

>> Fp = F_prime_nonlinear_BVP(xx)intval Fp =

-0.97__ 0.5000 0.00000.5000 -0.96__ 0.50000.0000 0.5000 -0.97__

>> v = -Fp\Fstarintval v =

1.0e-015 *[ 0.2104, 0.2653]

[ 0.3248, 0.3991][ 0.2104, 0.2653]>> xx_new = x_star + v

intval xx_new =0.1054

0.14140.1054

>> intvalinit(’DisplayInfsup’)

>> format long>> xx_new

intval xx_new =[ 0.10544866558254, 0.10544866558255]

[ 0.14144676492828, 0.14144676492829][ 0.10544866558254, 0.10544866558255]

This computation shows that there is a unique solution to the nonlinear systemof equations within xx, a box with diameter approximately 0.2 and centeredon x star, and that the coordinates of this solution lie within the boundsgiven by xx new. Note that this means that we have at least 13 digits correct.If we wanted to demonstrate uniqueness within a larger box, we could do so:

>> xx = midrad(x_star,2)

intval xx =[ -1.8946, 2.1055]

[ -1.8586, 2.1415][ -1.8946, 2.1055]>> Fp = F_prime_nonlinear_BVP(xx)

intval Fp =[ -0.9954, -0.7434] [ 0.5000, 0.5000] [ 0.0000, 0.0000]

[ 0.5000, 0.5000] [ -0.9952, -0.7340] [ 0.5000, 0.5000][ 0.0000, 0.0000] [ 0.5000, 0.5000] [ -0.9954, -0.7434]

>> v = -Fp\Fstarintval v =

1.0e-014 *

[ -0.1409, 0.2184][ -0.1983, 0.3136]

[ -0.1409, 0.2184]>> xx_new = x_star + vintval xx_new =

[ 0.1054, 0.1055][ 0.1414, 0.1415]

[ 0.1054, 0.1055]>> format long

>> xx_newintval xx_new =[ 0.10544866558254, 0.10544866558256]

[ 0.14144676492828, 0.14144676492830][ 0.10544866558254, 0.10544866558256]

This shows that the solution is unique within a box of radius 2 centered onx star.

Note that we have proved existence and uniqueness, as well as have com-puted bounds, for the solution to the nonlinear system of equations arisingfrom the discretization of the boundary value problem. To show existence and


uniqueness to the solution of the original boundary value problem, more workwould need to be done. Indeed, discretizations sometimes have solutions notpresent in the original problem. Existence and uniqueness of solutions to theoriginal problem can sometimes be proven, and bounds on the solutions to theoriginal problem can sometimes be obtained, using interval-Newton methods.However, the error in the discretization needs to be taken into account, andthe process is more sophisticated than that given here9.

8.5 Quasi-Newton (Multivariate Secant) Methods

In the 1960’s and 1970’s, many efforts were put into developing methodsthat had the advantage of Newton’s method but avoided computing the ma-trix of partial derivatives. This was to avoid the inaccuracy of finite differenceapproximations (with uncertainty in choosing the stepsize h) and the man-ual labor and possibility for blunder in deriving partial derivatives by hand.Automatic differentiation (see Section 6.2 on page 215) and other technologyhave removed many of the concerns with using Newton’s method directly sincethen, but replacement of the Jacobian matrix F ′ by approximations is stilladvisable in many situations, such as for very large systems, as an alternativeto methods such as the conjugate gradient method10, and such quasi-Newtonmethods are found in many widely-available software packages, where they arecombined with step controls to try to prevent the erratic divergence behaviorof Newton’s method. Such methods can be considered to be generalizationsto multiple equations and variables of the secant method for one variable.

Quasi-Newton methods have the general form of Newton’s method, namely,

{

v(k) = −(

B(k))−1

F (x(k))x(k+1) = x(k) + v(k)t(k) for k = 0, 1, 2, · · · , (8.16)

where B(k) is an n× n matrix and t(k) is a scalar. Using B(k) = (F ′(x(k)))−1

and t(k) = 1, (8.16) gives Newton’s method.

A commonly used quasi-Newton method is Broyden’s method. Broyden’smethod is designed with the following conditions:

secant condition: B(k+1)(

x(k+1) = x(k))

= F (x(k+1))−F (x(k)). (H repro-duces the function change in the direction of the step.)

9Michael Plum, Mitsuhiro Nakao, and others have been active in developing such algo-rithms.10See [1, Section 3.4.10] or numerous other references


orthogonality or least change condition: B(k+1)z = B(k)z ifzT (x(k+1) − x(k)) = 0. (The effect of H isn’t altered in directions or-thogonal to the step.)

These two conditions imply that B(k+1) is given by

B(k+1) = B(k) +

(

y(k) −B(k)s(k))

(s(k))T

(s(k))T s(k). (8.17)

where

y(k) = F (x(k+1))− F (x(k))

and

s(k) = x(k+1) − x(k).

Equation (8.17) provides what is known as the Broyden update to the approx-imate Jacobian matrix. Note that it is a rank-one update in the sense thatB(k+1) is obtained from B(k) by adding a rank-one matrix. See our second-level text [1] for details of our derivation, as well as other quasi-Newton up-dates, such as symmetric and rank-two updates appropriate for optimization.Information can also be found in the research and review literature, such as[12], or texts such as [11]. It is also possible to update the inverses of these

matrices (say computing(

B(k))−1

F (x(k))) in less than O(n3) operations; seethe aforementioned references.

Interestingly, Broyden’s method has a convergence rate that is faster thanlinear but does not exhibit the order-2 convergence of Newton’s method. Thisis analogous to the convergence of the secant method for one variable, whichexhibits a convergence order between 1 and 2.

Example 8.10We will illustrate Broyden’s method in matlab using F from Example 8.9,

but without using F ′. As before, we start the computation with x(0) =(0, 0, 0)T , but we also must have an initial H = B(0). Traditionally, finitedifferences can be used to approximate the Jacobian matrix at x(0), or B(0)

can be set to the identity matrix. We’ll try the latter, and we’ll try t(k) = 1.(Convergence may occur once a sufficient number of steps have been takento build up a reasonable approximation to the action of F ′. Also, t(k) isoften chosen to minimize ‖F‖ in the direction of v(k).) We use the followingmatlab function

function [ xstar, success ] = simple_Broyden( x0, B0, F, eps, maxitr )

% [ xstar, success ] = simple_Broyden( x0, B0, F, eps, maxitr )

% is analogous to newton_sys: It computes up to maxitr iterations of

% Broyden’s method for the function F, with starting point x0 and

% starting matrix B0. It stops when either the norm of F is less than

% eps or maxitr iterations have been exceeded.

%


xk = x0;

Fk = feval(F,xk);

norm_Fk = norm(Fk,2);

B = B0;

success=0;

for k=1:maxitr

k

if (norm_Fk < eps)

success = 1;

xstar = xk;

return

end

s = -B\Fk;

xkp1 = xk+s;

Fkp1 = feval(F,xkp1);

y =Fkp1 - Fk;

B = B + (y - B*s)*s’/(s’*s)

xk = xkp1

Fk = Fkp1;

norm_Fk = norm(Fk,2)

end

success = 0;

xstar = x;

end

with the following results (abridged):

>> [xstar,success] = simple_Broyden([0;0;0],eye(3),...

’F_nonlinear_BVP’,1e-10,50)

k = 1, norm_Fk = 0.0716

k = 2, norm_Fk = 0.1114

k = 3, norm_Fk = 0.2657

k = 4, norm_Fk = 7.9479e-004

k = 5, norm_Fk = 1.5246e-004

k = 6, norm_Fk = 1.5021e-005

k = 7, norm_Fk = 1.2445e-007

k = 8,

B = 0.0806 0.5263 -0.9194

0.3942 -1.0081 0.3942

-0.9194 0.5263 0.0806

xk =0.1054

0.1414

0.1054

norm_Fk = 2.3587e-013

k = 9,

xstar =

0.1054

0.1414


0.1054

success = 1

>> F_prime_nonlinear_BVP(xstar)

ans =

-0.9653 0.5000 0

0.5000 -0.9640 0.5000

0 0.5000 -0.9653

This illustrates the fast convergence. It also illustrates the fact that the matrixB(k) may not converge to F ′, even though x(k) converges quickly to a solutionof F (x) = 0.

8.6 Nonlinear Least Squares

Nonlinear least squares is a special type of optimization problem, involvingnonlinear systems of equations, that is important in applications. The non-linear case is similar to the linear case, which we treated in (3.18) (page 117)and (4.18) (page 172). We have m data points (ti, yi), as well as a modelhaving n parameters {xj}nj=1, and, in general m is much larger than n. Innonlinear least squares, however, instead of having a model of the form

y ≈ f(t) =

n∑

j=1

xjϕj(t),

we assume that the parameters xi occur nonlinearly, so the model is of themore general form

y ≈ f(t) = f(x1, . . . , xn; t),

while the optimization problem remains of the form11

ϕ(x) =1

2min{xj}n

j=1

m∑

i=1

(

yi − f(x1, . . . , xn; ti))2

. (8.18)

The corresponding system of equations

fi(x) = f(x1, . . . , xn; ti)− yi, 1 ≤ i ≤ m (8.19)

still has more equations than unknowns, but now is, additionally, nonlinearinstead of linear. The “matrix” for this system of equations in this case can

11Here, we include the factor of 1/2, even though it does not affect the optimal x, forsimplicity, since a factor of 2 is introduced when we differentiate.


be viewed not as the matrix

ϕ1(t1) ϕ2(t1) · · · ϕn(t1)

ϕ1(t2) ϕ2(t2) · · · ϕn(t2)

...

ϕ1(tm) ϕ2(tm) · · · ϕn(tm)

as in the linear case, but is the Jacobian matrix of F = (f1(x), . . . , fm(x))T .The components fi of F are known as the residuals of the fit.

Example 8.11

Find a least squares fit of the form

y(t) = x1ex2t

to the datat y

0 1.71 2.62 7.3

Here n = 2 and m = 3, and the least squares problem is nonlinear, since theparameter x2 occurs nonlinearly in the expression for y(t). We have

F (x) =

x1 − 1.7x1e

x2 − 2.6x1e

2x2 − 7.3

=

000

,

and the Jacobian matrix is

F ′(x) =

1 0

ex2 x1ex2

e2x2 2x1e2x2

.

In the Gauss–Newton method , we do an iteration of the form (8.16), except

we use the pseudo-inverse (see page 135) of F ′ in place of(

B(k))−1

. This

is equivalent to replacing B(k) by the matrix (F ′)T F ′ corresponding to thenormal equations12.

12The normal equations are introduced in formula (3.22) on page 118.


Example 8.12

We will apply several iterations of the Gauss–Newton method to the nonlinearleast squares problem of (8.11). We have programmed F in

function [ F ] = nonlinear LS example( x )

and have programmed F ′ in

function [ Fp ] = F prime nonlinear LS example( x ).

Using the pattern in newton sys and simple Broyden, we have programmedthe simple Gauss–Newton method as follows.

function [ xstar, success ] = simple_Gauss_Newton..( x0, F, Fprime, eps, maxitr )

% [ xstar, success ] = simple_Gauss_Newton( x0, F, Fprime, eps, maxitr )% is analogous to newton_sys: It computes up to maxitr iterations of

% the Gauss--Newton method for the function F, with starting point x0 and% Jacobian matrix Fprime. It stops when either the norm the step is less% than eps or maxitr iterations have been exceeded.

%% This is an illustrative function for Chapter 8 of Ackleh and Kearfott,

% "Applied Numerical Methods."

x = x0;F = feval(F,x);norm_step = norm(F,2);

B = feval(Fprime,x);success=0;

for k=1:maxitrks = -pinv(B)*F;

x = x+s;norm_step = norm(s,2)

if (norm_step < eps)success = 1;

xstar = x;return

end

F = feval(F,x);B = feval(Fprime,x);

endsuccess = 0;xstar = x;

end

We obtain the following output (abridged):

>> [xstar,success] = simple_Gauss_Newton([0;0], ...

’F_nonlinear_LS_example’,’F_prime_nonlinear_LS_example’, 1e-10,30)

k = 1, norm_step = 3.8667

k = 2, norm_step = 2.8921

k = 3, norm_step = 0.2766

k = 4, norm_step = 0.0483

k = 5, norm_step = 0.0113

k = 6, norm_step = 0.0015

k = 7, norm_step = 1.3465e-004

k = 8, norm_step = 1.2115e-005


k = 9, norm_step = 1.0948e-006

k = 10, norm_step = 9.8967e-008

k = 11, norm_step = 8.9469e-009

k = 12, norm_step = 8.0882e-010

k = 13, norm_step = 7.3119e-011

xstar =

1.2342

0.8832

success = 1

>>

Analyzing the convergence rate in the following table, we observe the conver-gence to be linear:

k ‖s(k+1)‖/‖s(k)‖1 0.74802 0.09563 0.17464 0.23405 0.03116 0.08987 0.09008 0.09049 0.0904

10 0.090411 0.090412 0.0904

A further computation illustrates that using the pseudo-inverse (through thematlab function pinv) is equivalent to computing the step s(k) by solvingthe system

(F ′(x))T F ′(x)s = −(F ′(x))T F (x)

that corresponds to the normal equations:

>> x = rand(2,1)

x =

0.8936

0.0579

>> Fp = F_prime_nonlinear_LS_example(x)

Fp =

1.0000 0

1.0596 0.9469

1.1228 2.0067

>> F= F_nonlinear_LS_example(x)

F =

-0.8064

-1.6531

-6.2967


>> s = - pinv(Fp)*F

s =

0.1913

2.7578

>> s_tilde = -(Fp’*Fp)\(Fp’*F)

s_tilde =

0.1913

2.7578

These computations of the step are equivalent when (F ′(x))T F ′(x) is non-singular.

In fact, the Gauss–Newton method is in general only linearly convergent,with a convergence rate that depends on the norm of the residuals at thesolution x(∗) to which the iteration is converging, with convergence slowerwhen the minimum residual norm is larger. For faster convergence, Newton’smethod can be applied to the gradient of the function ϕ in (8.18). However,the Jacobian matrix of the gradient of ϕ contains second-order partial deriva-tives. In particular, if ϕ is as in (8.18), and F = (f1, . . . , fm)T is as in (8.19),that is, if

ϕ(x) =1

2

m∑

i=1

f2i (x),

we have the nonlinear system of equations

∇ϕ(x) =

(

∂ϕ

∂x1(x), . . . ,

∂ϕ

∂xn(x)

)T

= (F ′(x))T F (x) = G(x) = 0, (8.20)

and the Jacobian matrix of this system is

H(x) = G′(x) = (F ′(x))T F ′(x) +

m∑

i=1

fi(x)Hi(x), (8.21)

where

Hi(x) =

∂2fi

∂x21

(x)∂2fi

∂x1∂x2(x) . . .

∂2fi

∂x1∂xn(x)

∂2fi

∂x2∂x1(x)

∂2fi

∂x22

(x) . . .∂2fi

∂x2∂xn(x)

......

∂2fi

∂xn∂x1(x)

∂2fn

∂xn∂x2(x) . . .

∂2fi

∂x2n

(x)

(8.22)

is known as the Hessian matrix of fi, while H(x) is the Hessian matrix of ϕ.


Example 8.13

We will apply Newton’s method to the system G(x) = 0 corresponding toExample 8.11. We have

H1(x) =

(

0 0

0 0

)

, H2(x) =

(

0 ex2

ex2 x1ex2

)

, H3(x) =

(

0 2e2x2

2e2x2 4x1e2x2

)

.

To apply Newton’s method to the nonlinear system of equations G(x) = 0, wehave created gradient NLS example and Hessian NLS example as follows13:

function [ G ] = gradient_NLS_example( x )

F = F_nonlinear_LS_example(x);

Fp = F_prime_nonlinear_LS_example(x);

G = Fp’*F;

end

function [ H ] = Hessian_NLS_example( x )

Hi = zeros(2,2,3);

Hi(:,:,1) = [0 0

0 0];

Hi(:,:,2) = [ 0 exp(x(2))

exp(x(2)) x(1)*exp(x(2))];

Hi(:,:,3) = [ 0 2*exp(2*x(2))

2*exp(2*x(2)) 4*x(1)*exp(2*x(2))];

Fp = F_prime_nonlinear_LS_example(x);

F = F_nonlinear_LS_example(x);

H = Fp’*Fp + F(1)*Hi(:,:,1) + F(2)*Hi(:,:,2) + F(3)*Hi(:,:,3);

end

In fact, when newton sys is started at x(0) = (0, 0)T , a point for which theGauss–Newton method converged, Newton’s method does not converge:

>> [xstar,success] = newton_sys([0;0],...

’gradient_NLS_example’,’Hessian_NLS_example’,1e-10,20)

i = 1

x = 0

0

13This is not the most efficient way of programming this, since there are redundant cal-culations. However, this presentation makes the underlying mathematics clearer than themost efficient way.


i = 2

x = 0

-0.6744

i = 3

x = 0

-1.6364

i = 4

x = 0

-3.9796

1.7512

i = 5

x = 0

-36.5880

i = 6

x = 1.0e+015 *

0

-5.0751

i =

7

x =

1.7000

NaN

However, we observe quadratic convergence when we start Newton’s methodsufficiently close to the solution:

>> [xstar,success] = newton_sys([1.2;0.8],...

’gradient_NLS_example’,’Hessian_NLS_example’,1e-10,20)

i = 1, norm_fval = 17.4291

i = 2, norm_fval = 10.3042

i = 3, norm_fval = 0.6856

i = 4, norm_fval = 0.4781

i = 5, norm_fval = 0.0018

i = 6, norm_fval = 7.7870e-006

i = 7, norm_fval = 4.0175e-013

xstar =

1.2342

0.8832

success =

1


8.7 Methods for Finding All Solutions

To this point, we have discussed iterative methods for finding approxima-tions to a single solution to a nonlinear system of equations. In many appli-cations, finding all solutions to a nonlinear system is required. Salient amongthese are homotopy methods and branch and bound methods.

8.7.1 Homotopy Methods

In a homotopy method, one starts with a simple function g(x), g : D ⊆Rn → Rn such that every point with g(x) = 0 is known, then transformsthe function into the f(x), f : D ⊆ Rn → Rn for which all points satisfyingf(x) = 0 are desired. During the process, one solves various intermediatesystems, using the solution to the previous system in an initial guess for aniterative method for the next system. An example of such a transformationis

H(x, t) = (1− t)g(x) + tf(x), (8.23)

so H(x, 0) = g(x) and H(x, 1) = f(x). One way of following the curvesH(x, t) = 0 from t = 0 to t = 1 is to consider y = (x, t) ∈ Rn+1, and todifferentiate (8.23), obtaining

H ′(y)y′ = 0, (8.24)

where H ′(z) is the n by n + 1 Jacobian matrix of H . Equation (8.23) alongwith some14 normalization condition N(y) = 0 (representing a parametriza-tion of the curve y(t)) defines a derivative y′, so, in principle, methods andsoftware for finding solutions to initial value problems for ordinary differen-tial equations can be used to follow the curves of the homotopy. Indeed, thisapproach has been used.

Determining an appropriate homotopy H is crucial when finding all solu-tions to a system of equations. Particularly interesting is finding such H forpolynomial systems of equations, where there is an interplay between numer-ical analysis and algebraic geometry. Significant results were obtained duringthe 1980’s; for example, see [28]. In such techniques, the homotopy is generallydefined in a space derived from complex n-space, rather than real n-space.

We say more about homotopy methods in our section on software.

8.7.2 Branch and Bound Methods

Branch and bound methods, which we explain in some detail in [1, Sec-tion 9.6.3] in the context of global optimization, can also be used to solve

14possibly implicit


systems of nonlinear equations. In this context, the equations F (x) = 0 canbe considered as constraints, and the objective function can be

∑ni=1 f2

i (x),for example.

8.8 Software

Much software, both proprietary and public, that contains the basic meth-ods we have introduced here as part of more sophisticated schemes, is avail-able. In the matlab optimization toolbox, the function fsolve solves asystem of nonlinear equations, using a trust region algorithm, in which heuris-tics15 are used to modify the length and direction of the Newton step−(F ′(x))−1F (x)to make it more likely the iteration will converge for starting points x(0) faraway from a solution.

Example 8.14

Let us solve the problem from Example 8.7 with fsolve. fsolve has anoption to use the Jacobian matrix, or to approximate the Jacobian matrix withfinite differences, if it is not available. We will use the Jacobian matrix, butwe need to provide fsolve with a function whose output arguments containboth F and F ′:

function [ F, Fp ] = F_and_Fp_nonlinear_BVP( x )

F = F_nonlinear_BVP(x);

FP = F_prime_nonlinear_BVP(x);

end

We then have

>> options = optimset(’Jacobian’,’on’);

>> xstar = fsolve(’F_and_FP_nonlinear_BVP’,[0;0;0])

Equation solved.

fsolve completed because the vector of function values is near zero

as measured by the default value of the function tolerance, and

the problem appears regular as measured by the gradient.

xstar =

0.1054

0.1414

0.1054

15that is, rules of thumb


A freely available homotopy-method-based package for solving polynomialsystems of equations is POLSYS PLP [43].

Our GlobSol software [21] implements a branch and bound algorithm forfinding all solutions to small global optimization problems and finding allsolutions to nonlinear systems of equations. GlobSol is freely available, al-though it continues to be under development as of the writing of this work.On problems when GlobSol is able to finish its computations, it uses a domainsubdivision process and interval arithmetic techniques to find tight mathemat-ically rigorous bounds on all solutions within a particular hyper-rectangle. Asan example of this, let us reconsider Example 8.1, for which we previouslyfound an approximation to one of the solutions using Newton’s method.

Example 8.15

The function in Example 8.1 is programmed in GlobSol as a Fortran-90program. (See [21] for details.) Starting with initial bounds x1 ∈ [−10, 10],x2 ∈ [−10, 10], and with configuration of GlobSol appropriate for solvingnonlinear systems of equations, we obtain the following output from GlobSol(abridged):

Output from FIND_GLOBAL_MIN on 06/29/2010 at 06:51:15.

Box data file name is: copatiopt.DT1

Initial box: [ -10, 10 ], [ -10, 10 ]

LIST OF BOXES CONTAINING VERIFIED FEASIBLE POINTS:

Box no.: 1

Box coordinates: [ -3.01, -3 ], [ .147, .149 ]

Box no.: 2

Box coordinates: [ -.902, -.901 ], [ -2.09, -2.08 ]

Box no.: 3

Box coordinates: [ 1.33, 1.34 ], [ 1.75, 1.76 ]

Box no.: 4

Box coordinates: [ 2.99, 3 ], [ .148, .149 ]

Number of bisections: 10

Total number of boxes processed in loop: 23

Overall CPU time: .06

The function lsqnonlin in matlab’s optimization toolbox computes so-lutions to nonlinear least squares problems, using a choice of Gauss-Newton


algorithm or other techniques, combined with trust regions. An example ofits use is in the following applications section.

8.9 Applications

In this section we give an example which involves parameter estimation us-ing the nonlinear least-squares routine “lsqnonlin” in matlab. The follow-ing nonlinear system [3] is a stage-structured discrete-time model describingthe dynamics of a population whose life cycle can be divided into three stages:juvenile, non-breeder adult and breeder adult. For example, in a green treefrog population, each individual can be classified as a tadpole, tadpole frogor a sexually mature frog. In the equations, J(t), N(t) and B(t) denote thesize of the juvenile, non-breeder, and breeder populations at time t, respec-tively. The survivorship functions of each stage si, i = 1, 2, 3, and the birthrate function b(t) are assumed to be time-dependent functions. Parametersγ1, γ2 ∈ (0, 1] represent the fraction of juveniles that become non-breeders andnon-breeders that become breeders, respectively. We have

J(t + 1) = b(t)B(t) + (1 − γ1)s1(J(t))J(t),

N(t + 1) = γ1s1(J(t))J(t) + (1− γ2)s2(N(t) + B(t))N(t),

B(t + 1) = γ2s2(N(t) + B(t))N(t) + s3(N(t) + B(t))B(t).

(8.25)

Now, suppose the survivorship functions are of the form

s1(J(t)) =a1

1 + k1J(t),

s2(N(t) + B(t)) =a2

1 + k2(N(t) + B(t)),

s3(N(t) + B(t)) =a3

1 + k3(N(t) + B(t)),

where ai and ki are unknown positive parameters, for i = 1, 2, 3. We alsoassume that the breeders give births periodically. For our specific problem,we let b(t) = bmax > 0, for t = 1, 2, . . . , 26 and b(t) = 0, for t = 27, 28, . . . , 52,and so forth. Here, we choose one week as the time unit, then we start countingthe population from the beginning of its breeding season, which lasts for about26 weeks, and during the next 26 weeks, the population does not have newbirths.

Next, we want to show how to estimate the seven parameters bmax, ai andki for i = 1, 2, 3 by using a set of data points. Instead of using real data inour example, we generate the data points as follows: First, we prearrange the


values of the parameters in the nonlinear system (8.25) and get the solutions.We use the total number of adults (N + B) in the population and allow thedata values to be close but have small random deviations from the solutions.We then invoke the lsqnonlin function in matlab to find best estimates forthe parameters, then compare to the actual ones.

The following matlab code with comments added demonstrates the stepsabove. In our experiment, we set γ1 = 1/5 (it takes about 5 weeks on averagefor a tadpole to become an immature frog), γ2 = 1/52. (It takes about a yearon average for a sexually immature frog to becomes mature.) We prearrangethe other parameters to be bmax = 30, a1 = 0.8, k1 = 0.002, a2 = 0.9,k2 = 0.001, a3 = 0.7, k3 = 0.004.

clear all

J(1)=5;

N(1)=15;

B(1)=10;

S(1)=N(1)+B(1); % set up the initial values

gamma_1=1/5;

gamma_2=1/52;

b_max=30; % prearrange the values of the parameters

a1=0.8;

a2=0.9;

a3=0.7;

k1=0.002;

k2=0.001;

k3=0.004;

T=150; % number of iterations

for t=1:T

b(t)=b_max*(1-mod(floor((t-1)/26),2)); % periodic birth rate

s1(t)=a1/(1+k1*J(t));

s2(t)=a2/(1+k2*(N(t)+B(t)));

s3(t)=a3/(1+k3*(N(t)+B(t)));

J(t+1)=b(t)*B(t)+(1-gamma_1)*s1(t)*J(t);

N(t+1)=gamma_1*s1(t)*J(t)+(1-gamma_2)*s2(t)*N(t);

B(t+1)=gamma_2*s2(t)*N(t)+s3(t)*B(t);

S(t+1)=N(t+1)+B(t+1);

end

Q = S.*(1+0.1*randn(1,T+1)); % add small deviation

x=1:(T+1);

plot(x,S,’r-’,x,Q,’linewidth’,1.5) % plot the actual values and

%% the generated data

global Q

lb=zeros(1,7); % lower bound for the estimates


ub=[50,1,1,1,1,1,1]; % upper bound for the estimates

v=rand(1,7);

P_0=lb.*(1-v)+v.*ub; % initial guess of the parameter vector

[parameters, LS, o]=lsqnonlin(@lsquare_errors, P_0, lb, ub);

parameters % display the estimate of the parameters

LS % shows the sum of the least squares errors

JJ(1)=5; % use the estimate of the parameters to

NN(1)=15; %% compute the corresponding solutions

BB(1)=10;

for t=1:T

bb(t)=parameters(1)*(1-mod(floor((t-1)/26),2));

ss1(t)=parameters(2)/(1+parameters(3)*JJ(t));

ss2(t)=parameters(4)/(1+parameters(5)*(NN(t)+BB(t)));

ss3(t)=parameters(6)/(1+parameters(7)*(NN(t)+BB(t)));

JJ(t+1)=bb(t)*BB(t)+(1-gamma_1)*ss1(t)*JJ(t);

NN(t+1)=gamma_1*ss1(t)*JJ(t)+(1-gamma_2)*ss2(t)*NN(t);

BB(t+1)=gamma_2*ss2(t)*NN(t)+ss3(t)*BB(t);

SS(t+1)=NN(t+1)+BB(t+1);

end

figure

plot(x,S,’r-’,x,SS,’b-.’,’linewidth’,2) % plot both the solutions with

%% the original and the estimated

%% parameters.

function L=lsquare_errors(p) % the function computes the sum of

%% the least square errors

global Q

J(1)=5;

N(1)=15;

B(1)=10;

gamma_1=1/5;

gamma_2=1/52;

T=150;

for t=1:T

b(t)=p(1)*(1-mod(floor((t-1)/26),2));

s1(t)=p(2)/(1+p(3)*J(t));

s2(t)=p(4)/(1+p(5)*(N(t)+B(t)));

s3(t)=p(6)/(1+p(7)*(N(t)+B(t)));

J(t+1)=b(t)*B(t)+(1-gamma_1)*s1(t)*J(t);

N(t+1)=gamma_1*s1(t)*J(t)+(1-gamma_2)*s2(t)*N(t);

B(t+1)=gamma_2*s2(t)*N(t)+s3(t)*B(t);

S(t+1)=N(t+1)+B(t+1);

end

L=0;


LS=0;

for i=1:T

L(i+1)=(S(i+1)-Q(i+1));

LS=LS+L(i)^2;

end

Note that we will get different results because the generated data are ran-dom and not unique. Below are several results displayed on the the commandwindow after we run the code:

Maximum number of function evaluations exceeded;

increase options.MaxFunEvals

parameters =

39.9582 0.8215 0.0020 0.8885 0.0008 0.6380 0.0127

LS =

3.8691e+003



parameters =

41.2113 0.8232 0.0022 0.8989 0.0009 0.5560 0.0092

LS =

3.6080e+003



parameters =

44.2568 0.7712 0.0020 0.8935 0.0008 0.6417 0.0139

LS =

5.0750e+003

The following figure is based on our first result, i.e.,

parameters =

39.9582 0.8215 0.0020 0.8885 0.0008 0.6380 0.0127

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

140

t

Po

pu

lati

on

of

the

adu

lts

ActualGenerated

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

140

t

Po

pu

lati

on

of

the

adu

lts

ActualEstimated

The plot on the left side compares the generated data and the adult pop-ulation sizes obtained by the original nonlinear system, and the one on theright shows that how well the equations using the estimated parameters fitsthe given data set.


8.10 Exercises

1. Use the univariate mean value theorem (which you can find stated asTheorem 1.4 on page 5) to prove the multivariate mean value theorem(stated as Theorem 8.1 on page 293).

2. Write down the degree 2 Taylor polynomials for f1(x1, x2) and f2(x1, x2),centered at x = (x1, x2) = (0, 0), for F as in Example 8.2. Lumpingterms together in an appropriate way, interpret your values in terms ofthe Jacobian matrix and a second-derivative tensor.

3. Let F be as in Example 8.2 (on page 292), and define

G(x) = x− Y F (x), where Y ≈(

2.0030 1.23990.0262 0.0767

)

.

Do several iterations of fixed point iteration, starting with initial guessx(0) = (8.0,−0.9)T . What do you observe?

4. The nonlinear system

x21 − 10x1 + x2

2 + 8 = 0,

x1x22 + x1 − 10x2 + 8 = 0

can be transformed into the fixed-point problem

x1 = g1(x1, x2) =x2

1 + x22 + 8

10,

x2 = g2(x1, x2) =x1x

22 + x1 + 8

10.

Perform 4 iterations of the fixed-point method on this problem, withinitial vector x(0) = (0.5, 0.5)T .

5. Univariate Newton iteration applied to find complex roots

f(x + iy) = u(x, y) + iv(x, y) = 0

is equivalent to multivariate Newton iteration with functions

f1(x, y) = u(x, y) = 0 and

f2(x, y) = v(x, y) = 0.

(a) Repeat Exercise 12 on page 67, except doing the iterations on thecorresponding system u(x, y) = 0, v(x, y) = 0 of two equations intwo unknowns.


(b) Compare the results, number by number, to the results you ob-tained in Exercise 12 on page 67.

6. Consider solving the nonlinear system

x21 − 10x1 + x2

2 + 8 = 0,

x1x22 + x1 − 10x2 + 8 = 0.

Experiment with Newton’s method, with various initial vectors, anddiscuss what you observe.

7. Consider finding the minimum of

f(x1, x2) = ex1 + ex2 − x1x2 + x21 + x2

2 − x1 − x2 + 4

on R2. Experiment with Newton’s method:

x(k+1) = x(k) −(

∇2f(x(k)))−1

∇f(x(k)).

(That is, try Newton’s method to compute zeros of the gradient.) Whatcan you surmise from your experiments?

8. Let F be as in Exercise 5 on page 324, let x = ([−0.1, 0.2], [0.8, 1.1])T ,and x = (0.05, 0.95)T .

(a) Apply several iterations of the interval Gauss–Seidel method; in-terpret your results.

(b) Apply several iterations of the Krawczyk method; interpret yourresults.

(c) Apply several iterations of the interval Newton method you obtainby using the linear system solution bounder verifylss in intlab.

9. Let F be as in Exercise 5 on page 324. Do several iterations of Broyden’smethod, using the same starting points as you did for Exercise 5; observenot only x(k), but also Bk. What do you observe? Do you observesuperlinear convergence?

10. Consider the nonlinear system of equations from Example 8.1 at thebeginning of this chapter. In Example 8.15 on page 319, we presentedmathematically proven enclosures for the four solutions of this nonlinearsystem obtained with the GlobSol software system.

(a) Use the raw Newton’s method (as in newton sys on page 296) withdifferent starting guesses, to see if you can find approximations toeach of the four solutions. Describe what you have found.

(b) Proceed as in part (a), but using fsolve from the matlab opti-mization toolbox, without using a Jacobian matrix.


(c) Proceed as in part (b), but using fsolve, with the Jacobian matrix.

11. Satisfactory solutions to some (but not all) algebraic systems of equa-tions can be obtained symbolically, without roundoff error, using com-puter algebra systems such as Mathematicar or Mapler. For example,to solve the system in Example 8.15 in Mathematica, one possibilitywould be to use Solve as follows.

solutions = Solve[{x1^2 + x1*x2^3 - 9 == 0,

3x1^2*x2 - x2^3 - 4 == 0},

{x1, x2}]

If you have access to a computer algebra system, try obtaining solutionsto the system in Example 8.15. Explain what you observe.

12. Proceeding as in Example 8.7 (on page 299), compute approximate so-lutions to the boundary value problem x′′ = −ex, x(0) = x(1) = 0 withN = 8, 50, 100, and 500. (Hint: Write a routine to do this, ratherthan repeating commands in the command window. Also, use matlab’ssparse matrix structure, for it will otherwise be too lengthy. Plot yourresults using matlab’s plot routine, and comment on the results.

13. Consider the data from Example 8.11 (on page 311).

(a) Use the routine fminimax from matlab’s optimization toolbox tofind the minimax fit to the data. Do this by defining the functionsfi(t) = y(ti)− yi, i = 1, 2, 3, and no constraints.

(b) Use fmincon to find the minimax solution. Do this by reformulat-ing the problem as a constrained optimization problem as follows:

minimize v

subject to:

v ≥ y(t1)− y1,

v ≥ −(y(t1)− y1)

v ≥ y(t2)− y2,

v ≥ −(y(t2)− y2)

v ≥ y(t3)− y3,

v ≥ −(y(t3)− y3).

(c) Redo Example 8.11 in the following ways.

(i) Use fminunc with objective function

f(x) =

3∑

i=1

(yi − y(ti))2.


(ii) Use lsqcurvefit from matlab’s optimization toolbox.

(iii) Use lsqnonlin from matlab’s optimization toolbox.

(d) Compare the solutions from (a) and (b). (They should be the sameto within the stopping tolerances used.) Also compare the threesolutions from (c); the solutions from (c)(i), (c)(ii), and (c)(iii)should be the same to within the stopping tolerances.

(e) Plot the solutions from (b) and (c) and the points {(ti, yi)}3i=1 onthe same plot, using matlab’s plot routine. (Examples of the useof plot are Example 4.11, page 167, etc.)

14. Consider the nonlinear systems in problems 6, 7, and from Example 8.1.

(a) If the symbolic math toolbox from matlab is available to you,reprogram newton sys (on page 296) to create a functionnewton sys symbolic that does not need the argument f prime.Make the program as general as you can (that is, so that it possiblyhandles n variables, with n not specified beforehand). Hint: youmay wish to use the function jacobian, and you may want to lookat the section “Generating Code from Symbolic Expressions” in thesymbolic math toolbox. Try your function on the three examples.Does it work the way you expected? Are the generated functionsfor the Jacobian matrix similar to the ones you would write byhand?

(b) Use INTLAB’s “gradient” data type to modify newton sys as inpart (a) of this problem, using automatic differentiation instead ofsymbolic differentiation, to create a functionnewton sys automatic that does not need the argument f prime.You can consult the “Gradients: automatic differentiation” sec-tion of the INTLAB demo package for the syntax you can use. Trynewton sys symbolic on the same three nonlinear systems as inpart (a). Does newton sys automatic give the same results asnewton sys symbolic and as newton sys?

(c) Use central differences

∂fi

∂xj(x) ≈ fi(x1, . . . , xj + h, . . . xn)− fi(x1, . . . , xj − h, . . . xn)

2h

in a routine newton sys central difference that does not needthe argument f prime. Does newton sys central difference

give the same results as newton sys symbolic or newton sys automatic?(Try it on the three problems, with different h.)

References

[1] Azmy S. Ackleh, Edward J. Allen, Ralph Baker Kearfott, and Padman-abhan Seshaiyer. Classical and Modern Numerical Analysis: Theory,Methods, and Practice. Taylor and Francis, Boca Raton, Florida, 2009.

[2] Azmy S. Ackleh, Jacoby Carter, Lauren Cole, Tom Nguyen, Jay Monte,and Claire Pettit. Measuring and modeling the seasonal changes of anurban green treefrog (Hyla cinerea) population. Ecological Modelling,221(2):281 – 289, 2010.

[3] Azmy S. Ackleh and Patrick De Leenheer. Discrete three-stage popula-tion model: persistence and global stability results. Journal of BiologicalDynamics, 2(4):415–427, October 2008.

[4] Linda J. S. Allen. An Introduction to Mathematical Biology. PearsonPrentice Hall, New Jersey, 2006.

[5] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, andD. Sorensen. LAPACK’s User’s Guide. Society for Industrial and Ap-plied Mathematics, Philadelphia, PA, USA, 1992.

[6] R. M. Anderson and R. M. May. Infectious Diseases of Humans, Dy-namics and Control. Oxford University Press, Oxford, 1991.

[7] N. S. (Asai) Asaithambi. Numerical Analysis Theory and Practice. Har-court Brace College Publishers, Orlando, Florida, February 1995.

[8] Martin Berz, Kyoko Makino, Khodr Shamseddine, and Weishi Wan.Modern Map Methods in Particle Beam Physics. Academic Press, SanDiego, 1999.

[9] Garrett Birkhoff and Carl de Boor. Error bounds for spline interpolation.Journal of Mathematics and Mechanics, 13:827–836, 1964.

[10] George F. Corliss and Louis B. Rall. Adaptive, self-validating numeri-cal quadrature. SIAM Journal on Scientific and Statistical Computing,8(5):831–847, September 1987.

[11] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Un-constrained Optimization and Nonlinear Equations, volume 16 of Clas-sics in Applied Mathematics. SIAM, Philadelphia, PA, 1996.

329

330 References

[12] John E. Dennis, Jr. and Robert B. Schnabel. Least change secant up-dates for quasi-Newton methods. SIAM Review, 21(4):443–459, October1979.

[13] Iain S. Duff, Albert M. Erisman, C. William Gear, and John K. Reid.Sparsity structure and Gaussian elimination. SIGNUM Newsl., 23(2):2–8, 1988.

[14] Leah Edulstein-Kechet. Mathematical Models in Biology. SIAM,Philadelphia, 2005.

[15] George E. Forsythe, Michael A. Malcolm, and Cleve B. Moler. Com-puter Methods for Mathematical Computations. Prentice–Hall Profes-sional Technical Reference, 1977.

[16] Gene H. Golub and Charles F. van Loan. Matrix Computations. JohnsHopkins University Press, third edition, 1996.

[17] Andreas Griewank. Evaluating Derivatives: Principles and Techniquesof Algorithmic Differentiation. Number 19 in Frontiers in Appl. Math.SIAM, Philadelphia, PA, 2000.

[18] Richard Wesley Hamming. Numerical Methods for Scientists and Engi-neers (second edition). Dover Publications, Inc., New York, NY, USA,1986 (originally 1973).

[19] Alan Jennings. Matrix Computation for Engineers and Scientists. Wiley,New York, NY, USA, 1977.

[20] Ralph Baker Kearfott. Rigorous Global Search: Continuous Problems.Number 13 in Nonconvex optimization and its applications. Kluwer Aca-demic Publishers, Dordrecht, Netherlands, 1996.

[21] Ralph Baker Kearfott. GlobSol user guide. Optimization Methods andSoftware, 24(4-5):687–708, August 2009.

[22] Ralph Baker Kearfott and G. William Walster. On stopping criteria inverified nonlinear systems or optimization algorithms. ACM Transac-tions on Mathematical Software, 26(3):373–389, September 2000.

[23] David Kincaid and Ward Cheney. Numerical Analysis: Mathematics ofScientific Computing. Brooks / Cole, Pacific Grove, California, thirdedition, 2002.

[24] Peter Lancaster and Kestutis Salkauskas. Curve and Surface Fitting:An Introduction. Academic Press, London, 1986.

[25] Cleve B. Moler. Numerical Computing with matlab. Society for Indus-trial and Applied Mathematics, January 2004.

[26] R. E. Moore, R. B. Kearfott, and M. J. Cloud. Introduction to IntervalAnalysis. SIAM, Philadelphia, PA, 2009.

References 331

[27] Ramon E. Moore. Methods and Applications of Interval Analysis. SIAM,Philadelphia, PA, USA, 1979.

[28] Alexander Morgan and Andrew Sommese. A homotopy for solving gen-eral polynomial systems that respects m-homogenous structures. Appl.Math. Comput., 24(2):101–113, 1987.

[29] Arnold Neumaier. Interval Methods for Systems of Equations, volume 37of Encyclopedia of Mathematics and its Applications. Cambridge Uni-versity Press, Cambridge, UK, 1990.

[30] James. M. Ortega. Numerical Analysis: A Second Course. AcademicPress, New York, NY, USA, 1972.

[31] John Derwent Pryce and George F. Corliss. Interval arithmetic withcontainment sets. Computing, 78(3):251–276, 2006.

[32] Siegfried M. Rump. Verification methods for dense and sparse systemsof equations. In Jurgen Herzberger, editor, Topics in Validated Com-putations: proceedings of IMACS-GAMM International Workshop onValidated Computation, Oldenburg, Germany, 30 August–3 September1993, volume 5 of Studies in Computational Mathematics, pages 63–136,Amsterdam, The Netherlands, 1994. Elsevier.

[33] Siegfried M. Rump. INTLAB–INTerval LABoratory. In Tibor Csendes,editor, Developments in Reliable Computing: Papers presented at the In-ternational Symposium on Scientific Computing, Computer Arithmetic,and Validated Numerics, SCAN-98, in Szeged, Hungary, volume 5(3)of Reliable Computing, pages 77–104, Dordrecht, Netherlands, 1999.Kluwer Academic Publishers. URL: http://www.ti3.tu-harburg.de/rump/intlab/.

[34] Martin H. Schultz. Spline Analysis. Prentice–Hall, Englewood Cliffs, NJUSA, 1973.

[35] Gilbert W. Stewart. Introduction to Matrix Computations. AcademicPress, New York, NY, USA, 1973.

[36] Josef Stoer and Roland Bulirsch. Introduction to Numerical Analysis.Springer, New York, 1980. A third edition, 2002, is available.

[37] Friedrich Stummel. Forward error analysis of Gaussian elimination. I.error and residual estimates. Numerische Mathematik, 46(3):365–395,June 1985.

[38] Friedrich Stummel. Forward error analysis of Gaussian elimination. II.stability theorems. Numerische Mathematik, 46(3):397–415, June 1985.

[39] Richard S. Varga. Matrix Iterative Analysis. Springer, New York, NY,USA, second edition, 2000.

[40] David S. Watkins. Fundamentals of Matrix Computations. Wiley, NewYork, NY, USA, 1991. A second edition is available, 2002.

[41] Burton Wendroff. Theoretical Numerical Analysis. Academic Press,Englewood Cliffs, NJ, USA, 1966.

[42] James Hardy Wilkinson. Rounding Errors in Algebraic Processes.Prentice–Hall, Englewood Cliffs, NJ, USA, 1963.

[43] Steven M. Wise, Andrew J. Sommese, and Layne T. Watson. Algorithm801: POLSYS PLP: a partitioned linear product homotopy code forsolving polynomial systems of equations. ACM Transactions on Mathe-matical Software, 26(1):176–200, March 2000.

[44] David M. Young. Iterative Solution of Large Linear Systems. AcademicPress, New York, NY, USA, 1971.

Index

absolute error, 14absolute stability, 264

methods for systems, 277Adams–Bashforth methods, 269

3-step, 270Adams–Moulton implicit method, 271adaptive quadrature, 239anonymous function, 244augmented matrix, 80automatic differentiation, 215

forward mode, 216reverse mode, 218

B-spline, 164back substitution, 92back-substitution, 79backward error analysis, 110banded matrices, 97basis

collocating, 149basis functions, 117big-O notation, 6bisection

method of, 39black box function, 60boundary value problem, 258branch and bound algorithm, 239

for nonlinear systems of equa-tions, 317

Broyden update, 308

C, 30C++, 30Cauchy–Schwarz inequality, 101central difference formula, 212characteristic polynomial, 191Chebyshev

equi-oscillation property, 157

polynomial, 157Cholesky factorization, 93clamped spline interpolant, 165code list, 216collocating basis, 149, 160column vector, 70compatible matrix and vector norms,

103composite quadrature rule, 235condition

ill, 106number

generalized, 139of a function, 17of a matrix, 106

perfect, 108contraction, 48Contraction Mapping Theorem, 301

in one variable, 49convergence

iterative method for linear sys-tems, 123

linear, 7of a sequence of vectors, 123order of, 7quadratic, 7

convexset, 301

correct rounding, 24Cramer’s rule, 86cubic spline, 163

defective matrix, 193dense matrix, 98dependency, interval, 28derivative tensor, 293determinant, of a matrix, 86diagonally dominant

333

strictly, 88differentiation

automatic, 215direct method, linear systems of equa-

tions, 69distance

in a normed space, 101divided difference

k-th order, 150first order, 150Newton’s backward formula, 152Newton’s formula, 151

dot product, 72

eigenvalue, 77simple, 196

eigenvector, 77elementary row operations

for linear systems, 79equi-oscillation property, 157equilibration, row, 109equivalent norms, 101error

absolute, 14backward analysis, 110forward analysis, 110method, 10relative, 14roundoff, 10roundout, 26truncation, 10

Euclidean norm, 101Euler’s method, 254excess width, 28expansion by minors, 76extended real numbers, 25

Fast Fourier Transform, 179FFT, 179fill-in, sparse matrix, 99fixed point, 47, 298

iteration method, 47floating point numbers, 11fortran, 30forward difference formula, 152, 211

forward error analysis, 14, 110forward mode, automatic differenti-

ation, 216Fourier analysis, 179Frechet derivative, 292Frobenius norm, 103full matrix, 98full pivoting

Gaussian elimination, 90full rank matrix, 75function

matlab, 32functional programming, 32fundamental theorem of interval arith-

metic, 27

Gauss–Hermite quadrature, 229Gauss–Laguerre formula, 228Gauss–Laguerre quadrature, 228Gauss–Legendre quadrature, 227Gauss–Newton method, 311Gauss–Seidel method, 124Gaussian elimination, 79

full pivoting, 90partial pivoting, 91pivoting, 90

Gaussian quadrature, 2232-point, 224

generalized condition number, 139Gerschgorin’s Circle Theorem, 194

for Hermitian matrices, 196Givens rotation, 206global error

of a method for integrating aninitial value problem, 266

GlobSol, 181

hat functions, 161heat equation, 258Hermite polynomials, 229Hermitian matrix, 76, 194Hessenberg form, 204Hessian matrix, 314Hilbert matrix, 108homotopy method, 317

HUGE, 20

identity matrix, 73IEEE arithmetic, 19ill-conditioned, 106implicit

Euler method, 279implicit trapezoid method, 287improper integrals, 244infinite integrals, 244initial value problems, 253integration, 221

infinite, 244multiple, 242singular, 244

interpolantpiecewise linear, 159

interpolating polynomialLagrange form, 149Newton form, 151

interval arithmeticfundamental theorem of, 27operational definitions, 25

interval dependency, 28interval extension

first order, 29second order, 29

interval Newtonoperator, 303

univariate, 57interval Newton method

multivariate, 303quadratic convergence of, 59univariate, 57

INTLAB, 31, 35, 63, 113, 132, 144,188, 220, 246, 325

inverseof a matrix, 73, 86

inverse midpoint matrix, 131inverse power method, 202invertible matrix, 73iterative method

linear system of equations, 123IVP, 253

Jacobi diagonalization, 205Jacobi method, 123

for computing eigenvalues, 205Jacobi rotation, 206Jacobian matrix, 291, 292

Kantorovich Theorem, 297Kronecker delta function, 87, 116

Lagrangebasis, 149polynomial interpolation, 149

Laguerre polynomials, 228Laplacian operator, 258least squares

approximation, 117least squares approximation, 172left singular vector, 135Legendre polynomials, 227Lemarechal’s technique, 173, 175linear algebra, numerical, 69linear convergence, 7linear model, 292linearly independent

vectors, 75Lipschitz condition, 48logistic equation, 247LU

decomposition, 86factorization, 86

m-file, 45machine constants, 20machine epsilon, 20mag, 132magnitude (of an interval), 132mantissa, 11Maple, 32Mathematica, 32Matlab, 30

function, 32, 45m-file, 45script, 32

matrixbanded, 97

dense, 98determinant of, 86full, 98inverse of, 86orthogonal, 116permutation, 88singular, 74sparse, 98

fill-in, 99upper triangular, 82Vandermonde, 147

matrix (definition), 70matrix multiplication, 70matrix norm, 103

compatible, 103Frobenius, 103induced, 104natural, 104

mean value theoremfor integrals, 2multivariate, 293univariate, 5

method error, 10method of bisection, 39method of lines, 258midpoint method

for solution of initial value prob-lems, 261

midpoint rulefor quadrature, 227

mig, 132mignitude (of an interval), 132Moore–Penrose pseudo-inverse, 135multiple integrals, 242multiplication

matrix, 70multivariate interval Newton opera-

tor, 303multivariate mean value theorem, 293

NaN, 21natural or induced matrix norm, 104natural spline, 165Newton’s backward difference formula,

152

Newton’s divided difference formula,151

Newton’s forward difference formula,152

Newton’s methodmultivariate

local convergence of, 297univariate, 54

Newton–Cotes formulas, 222closed, 222open, 222

Newton–Kantorovich Theorem, 297nonlinear least squares, 310

Gauss–Newton method, 311nonsingular matrix, 73norm, 100

equivalent, 101Euclidean, 101important ones on Cn, 100matrix, 103

compatible, 103Frobenius, 103induced, 104natural, 104

scaled, 101normal distribution

standard, 151normal equations, 118not a number, 21numerical linear algebra, 69numerical stability, 100

object-oriented programming, 31Octave, 31operator overloading, 220order

of a single-step method for solv-ing an IVP, 256

of convergence, 7origin shifts, 205orthogonal

matrix, 116orthogonal decomposition, 116orthogonal vectors, 116orthonormal vectors, 116

outlier, 186outliers, 175outward rounding, 26overflow, 20overloading, operator, 220overrelaxation factor, 126

partial pivotingGaussian elimination, 91

perfectly conditioned, 108permutation matrix, 88piecewise linear interpolant, 159pivoting, in Gaussian elimination, 90plane rotation, 206polynomial interpolation, 146positive

definite, 76semi-definite, 76

preconditioning, 130predictor-corrector methods

for solving initial value problems,272

product formula, 242pseudo-inverse, 135

QRdecomposition, 116factorization, 116method, 204

quadratic convergence, 7quadrature, 221

Gauss–Hermite, 229Gauss–Laguerre, 228Gaussian, 223

2-point, 224midpoint rule, 227Newton–Cotes, 222product formula, 242

quadrature rulecomposite, 235

quasi-Newton methods, 307

R-stage Runge–Kutta method, 262rank

of a matrix, 75

rank-one update, 308Rayleigh quotient, 203recursion, 240regression

robust, 177relative error, 14residual, 80residuals, 311Richardson extrapolation, 236right singular vector, 135robust fit, 177Romberg integration, 236round

down, 11to nearest, 12to zero, 12up, 12

rounding modes, 20roundoff error, 10

in Gaussian elimination, 110roundout error, 26row equilibration, 109Runge’s function, 155Runge–Kutta method

fourth order classic, 263R-stage, 262

Runge–Kutta methods, 261Runge–Kutta–Fehlberg method, 267

scalar, 70scalar multiplication, 77scaled norm, 101Schwarz inequality, 101Scilab, 31script, matlab, 32script, Matlab, 45secant method, 61

convergence of, 61semi-definite, 76significant digits, 18similarity transformation, 194simple eigenvalue, 196Simpson’s rule, 223sinc function, 238sine integral, 237

single use expression, 28single-step method, 255

order of a, 256singular integrals, 244singular matrix, 74singular vector

left, 135right, 135

smoothness, 238solution set, 112SOR

matrix, 126method, 125

sparse matrices, 98spectral radius, 77, 104, 192spectrum, 191spline

B-, 164clamped, 165cubic, 163natural, 165

stabilitynumerical, 100of a method for initial value prob-

lems, 264, 277standard normal distribution, 151stiff

system of ODE’s, 259, 270, 271,275

subdistributivity, 26successive overrelaxation, 125successive relaxation method, 124SUE, 28symmetric matrix, 76

tape, 216Taylor polynomial

approximation by, 145multivariate, 293

Taylor series methodsfor solving IVP’s, 259

Taylor’s theorem, 3tensor

derivative, 293TINY, 20

trapezoid methodimplicit, 287

trapezoidal rule, 249triangle inequality, 100triangular

decomposition, 86factorization, 86

trigonometric polynomials, 179truncation error, 10trust region algorithm, 318two-point compactification, 25

underflow, 20underrelaxation factor, 126unitary matrix, 108upper triangular matrix, 82

Vandermonde matrix, 147Vandermonde system, 147

Documents

Undergraduate Text