Aetna Book 2015 Hyper

Petr Krysl

An Engineer’sToolkit of NumericalAlgorithms

With the MATLABr toolboxhttps://github.com/PetrKryslUCSD/AETNA

July 2015

Pressure Cooker PressSan Diegoc⃝ 2009-2015 Petr Krysl

Contents

1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Modeling with differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Simple model of motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 A simple implementation of Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Solving the Stokes IVP with built-in MATLAB integrator . . . . . . . . . . . . . . . . . 112.2.3 Refining the Stokes IVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.4 Some properties of Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.5 A variation on Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.6 Implementations of forward and backward Euler method . . . . . . . . . . . . . . . . . . 16

2.3 Beam bending model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Model of satellite motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5 On existence and uniqueness of solutions to IVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6 First look at accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6.1 Modified Euler method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.6.2 Deeper look at errors: going to the limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.7 Runge-Kutta integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.8 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Preservation of solution features: stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 Scalar real linear ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Forward Euler method for a decaying solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Backward Euler method for a decaying solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5 Backward Euler method for a growing solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.6 Forward Euler method for a growing solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.7 Complex IVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.8 Single scalar equation versus two coupled equations: eigenvalues . . . . . . . . . . . . . . . . . . 433.9 Case of Rek = 0 and Imk = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.10 Case of Rek = 0 and Imk = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.11 Application of the Euler integrators to the IVP (3.10) . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.12 Euler methods for oscillating solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.13 General complex k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

VI Contents

3.14 Summary of integrator stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.14.1 Visualizing the stability regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.15 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Linear Single Degree of Freedom Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.1 Linear single degree of freedom oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.1 ω = 0: No oscillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.2 α = −(c/2m): Oscillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.3 Critically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2 Supercritically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 Change of coordinates: similarity transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Subcritically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5 Undamped oscillator: alternative treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5.1 Subcritically damped oscillator: alternative treatment . . . . . . . . . . . . . . . . . . . . . 694.6 Matrix-exponential solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.7 Critically damped oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Illustration 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Illustration 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75


5 Linear Multiple Degree of Freedom Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.1 Model of a vibrating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Undamped vibrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.1 Second order form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.2 First order form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3 Direct time integration and eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3.1 Practical use of eigenvalues for integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4 Analyzing the frequency content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85


5.5 Proportionally damped system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.6 Non-proportionally damped system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.7 Singular stiffness, damped . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.8 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Analyzing errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.1 Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.2 Order-of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.2.1 Using the big-O notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99


6.2.2 Error of the Riemann-sum approximation of integrals . . . . . . . . . . . . . . . . . . . . . 1006.2.3 Error of the Midpoint approximation of integrals . . . . . . . . . . . . . . . . . . . . . . . . . 101

Contents VII

6.3 Estimating error in ODE integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.3.1 Local error of forward Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.3.2 Global error of forward Euler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.4 Approximation of derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106


6.5 Computer arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.5.1 Integer data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.5.2 Floating-point data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.5.3 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.6 Interplay of errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116Illustration 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116Illustration 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116


7 Solution of systems of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197.1 Single-variable nonlinear algebraic equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.1.1 Convergence rate of Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.1.2 Robustness of Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.1.3 Bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.2 System of nonlinear algebraic equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.2.1 Numerical Jacobian evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2.2 Nonlinear structural analysis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1307.3 LU factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.1 Forward and backward substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.3.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.3.3 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1397.3.4 Computational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141Illustration 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142Illustration 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.3.5 Large systems of coupled equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1437.3.6 Uses of the LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.4 Errors and condition numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.4.1 Perturbation of b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Illustration 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.4.2 Condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Illustration 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.4.3 Perturbation of A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497.4.4 Induced matrix norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Illustration 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1517.4.5 Condition number in pictures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

Illustration 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1527.4.6 Condition number for symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Illustration 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557.5 QR factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.5.1 Householder reflections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157Illustration 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159Illustration 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

VIII Contents


8 Solutions methods for eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1618.1 Repeated multiplication by matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1618.2 Power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1648.3 Inverse power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167


8.3.1 Shifting used with inverse power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171Illustration 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.4 Simultaneous power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171Illustration 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

8.5 QR iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.5.1 Schur factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1758.5.2 QR iteration: Shifting and deflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Illustration 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1788.6 Spectrum slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Illustration 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1798.7 Generalized eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Illustration 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1818.7.1 Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Illustration 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1828.8 Annotated bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

9 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1859.1 Basic ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1859.2 Two degrees of freedom static equilibrium: unstable structure . . . . . . . . . . . . . . . . . . . . 186

Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1889.3 Two degrees of freedom static equilibrium: stable structure . . . . . . . . . . . . . . . . . . . . . . 1889.4 Potential function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1909.5 Determining definiteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1919.6 Two degrees of freedom static equilibrium: computing displacement . . . . . . . . . . . . . . . 1919.7 One degree of freedom total energy minimization example . . . . . . . . . . . . . . . . . . . . . . . 1929.8 Two degrees of freedom total energy minimization example . . . . . . . . . . . . . . . . . . . . . . 1929.9 Application of the total energy minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

9.9.1 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1949.9.2 Line search for the quadratic-form objective function. . . . . . . . . . . . . . . . . . . . . . 194

Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969.10 Conjugate Gradients method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969.11 Generalization to multiple equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

Illustration 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2009.12 Direct versus iterative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2029.13 Least-squares minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

9.13.1 Geometry of least squares fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2069.13.2 Solving least squares problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207


Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

1

Motivation

The narrative in this chapter is provided in the hope that it will motivate the esteemed reader totake the present subject seriously. Please do not be discouraged if the text in this Chapter is foundlacking in entertainment value. The rest of the book will make it up to you to excess.

Let us consider the experience made by the renowned structural engineer WTP (in Figure 1.1accompanied by his attorney CR). It concerns a planar truss structure designed by WTP andanalyzed for static loads. The structural software used was developed by Owl & Co.

Fig. 1.1. The renowned structural engineer WTP depicted on the stairs of his house with his attorney CR.CR was instrumental in keeping WTP’s engineering career on track.

The structure was first analyzed with default analysis settings and the shape after the deformationis shown in Figure 1.2 ( also included is a visual representation of the applied loading and the threepin supports). The shape before deformation is shown in broken line. The deformation is highlymagnified.

Later that day WTP was exploring the menus of the analysis software (there is always a first timefor everything), and got intrigued by the fact that the analysis option “Use automatic stabilization”was checked. The documentation was not very helpful in explaining the effects of this option (WTPwas in fact not sure in which language the documentation was written), and therefore an experimentwas in order. The analysis option “Use automatic stabilization” was unchecked, and the analysiswas repeated. To WTP’s surprise the results were practically identical, except that a slight displace-ment unsymmetry developed (for instance, displacement of -0.7039 versus -0.6518 units). This wasdisquieting since the structure and the boundary conditions (loads and supports) were symmetric.WTP was however nonplussed, especially given that the analysis was to be delivered to the clientthe next day.

Several weeks later the software developers alerted WTP that a bug was found in the analysissoftware, the bug was fixed, and an update was to be installed. WTP remembered the slight un-symmetry, and therefore checked whether the update removed it. Since the unsymmetry remained,a brief discussion ensued in whose course WTP ascertained that the bug had to do with the color

2 1 Motivation

Fig. 1.2. The planar truss structure. Deformation under indicated static loads is shown in solid line (mag-nified). The undeformed structure is shown in dashed line.

in which the logo of the company was drawn in the splash screen, and therefore it was somewhatunlikely to be the cause of the unsymmetry.

During a discussion with a colleague WTP was able to convince himself that no unsymmetrywas to be expected in the analysis, and that if it appeared it should be considered an error. Atthat point WTP began to draw on his immense powers of reasoning. After only a few hours hewas able to recall the name of the text in which properties of coupled systems of linear algebraicequations were discussed in his junior year in college. An intense session with the textbook followed,and WTP was quickly able to find the page that pertained to errors that can appear in the solutionof systems of equations. The error was found to be proportional to the error in the right-hand side(the loads) and to the “condition number” of the stiffness matrix. The loads were, as WTP checked,specified correctly, and consequently the mysterious “condition number” was probably the source ofthe confounding error.

WTP was now able to find in the textbook that the condition number of the stiffness matrixwas rather expensive to compute as one had to solve an “eigenvalue problem”. WTP was not to bedeterred however, and subcontracted this work out to a group of students from the local university,cost it what it may (it wasn’t much). The magnitudes of the eigenvalues of the stiffness matrix foundby the students are shown in Figure 1.3.

0 10 20 30 40 5010

−15

10−10

10−5

100

105

Eigenvalue

Eig

enva

lue

mag

nitu

de

Fig. 1.3. The magnitudes of the eigenvalues of the stiffness matrix of structure from Figure 1.2

1 Motivation 3

The rather small first eigenvalue did not escape WTP and a few more rewarding hours werespent looking for information that could lead to an understanding of the relationship between thecondition number, the eigenvalue problem, and the stiffness matrix. Eventually the critical piece ofinformation that the so-called singular matrix had at least one zero eigenvalue was located, and theconclusion that the stiffness matrix was somehow close to singular was reached.

The displacement shape corresponding to the first eigenvector (Figure 1.4) facilitated the ul-timate breakthrough. The structure contained a mechanism: a floppy piece of structure that wasinsufficiently connected to the rest of the structure (which was in fact sufficiently supported).

Fig. 1.4. The eigenvector 1 of the stiffness matrix of structure from Figure 1.2

The structure was consequently subjected to a redesign to remove the mechanism, and theredesign was eagerly adopted by the client who remarked on the propitious circumstance that asuperior design became available before the structure was realized. WTP has yet again demonstratedthat superior skill and knowledge cannot fail to win the day. Even though his friend CR’s assistancewas not required in this matter, his comforting presence during these trials and tribulations wasgratefully noted by WTP.

2

Modeling with differential equations

Summary

1. In this chapter we develop an understanding of initial value problems (IVPs). We look at thesimple but illustrative model of motion in a viscous fluid, and the model of satellite motion. Themain idea: these models can be treated similarly since they are are both members of the classof IVPs. The constituents of an IVP are the governing equation and the initial conditions.

2. The IVPs that will be considered in this book will be in the form of coupled first-order (onlyone derivative with respect to the independent variable) equations.

3. We develop simple methods for integrating IVPs numerically in time. The main idea: approxi-mate the curve by its tangent in order to make one discrete step in time. The basic visual pictureis provided by the direction field.

4. We discuss the essential differences between IVPs and BVPs (boundary value problems). Themain idea: BVPs are harder to solve than IVPs because the problem data is located on the entireboundary of the domain of the independent variables.

5. We investigate the accuracy of some simple numerical solvers for IVPs. The main concepts:Monomial relationship between the error and the time step length gives us formulas to estimatethe error, the log-log plot illuminates the convergence produced by the dependence on the timestep, and the convergence rate is revealed by the log-log plot.

6. We wrap up the exposition of the various time integrators by describing the Runge-Kutta inte-grators. Main idea: try to aim the time step for optimal accuracy by sampling the right-handside function (that is the slope) within the time step.

2.1 Simple model of motion

George Gabriel Stokes was a 19th century mathematician who has had an enormous impact onmany areas of engineering through his work on properties of fluids. Perhaps his most significantaccomplishment was the work describing the motion of a sphere in a viscous fluid. This work led tothe development of Stokes’ Law. This is a mathematical description of the force required to move asphere through a viscous fluid at specific velocity.

Stokes’ Law for a sphere descending under the influence of gravity in a viscous fluid is writtenas

η6πrv =4

3πr3(ρs − ρf )g , (2.1)

where 43πr

3 is the volume of the sphere, η is the dynamic fluid viscosity (for instance in SI unitsPa · s), 6πr is the shape factor of the sphere of radius r, v is the velocity of the falling sphere relativeto the fluid, m is the mass of the sphere, and g is the gravitational acceleration. On the left ofequation (2.1) is the so-called drag force Fd, on the right is the gravitational force Fg (i.e. 4

3πr3ρsg,

6 2 Modeling with differential equations

Fd

Fg

x

Fig. 2.1. Sphere falling in viscous fluid.

where ρs is the mass density of the material of the sphere) minus the buoyancy force (i.e. 43πr

3ρfg,where ρf is the mass density of the fluid)– compare with Figure 2.1.

An application of this law to structural engineering may be found for instance in compositesmanufacturing: a commonly used manufacturing technique used for large parts infuses dry fibers laidup on a bagged mold with resin by creating a degree of vacuum (Vacuum Assisted Resin TransferMoulding (VARTM)) to “suck the resin into the fibers”. A critical property of the polymer resins istheir dynamic viscosity : if the resin is too viscous, the fibers may be incompletely impregnated andthe part must be discarded. Some of the techniques to determine the viscosity of the liquid resemblea high school science experiment: drop a ball into a tube filled with this liquid. Measure the time ittakes the ball to travel some distance. From that calculate the ball’s velocity (distance/time), andknowing the ball’s diameter and mass obtain from (2.1) the liquid’s viscosity.

Of course, if we calculate the ball’s velocity as (distance/time) it better be uniform in that inter-val. So how does the velocity of the falling ball vary with time? Let us say we observe the proceedingswith a high-speed camera. We drop the ball from rest, and then we see the ball rapidly accelerate.Eventually it seems to settle down to a steady speed on the way downwards. The modeling keywordis “acceleration”, and consequently we shall use Newton’s equation: Acceleration is proportional toforce. The acceleration may be written as x (measuring the distance traveled downwards: Figure 2.1),and the total applied force is Fg − Fd. Therefore, we write

4

3πr3ρsx =

4

3πr3(ρs − ρf )g − η6πrv . (2.2)

Simplifying a little bit, we obtain

x =ρs − ρf

ρsg − 9η

2r2ρsv . (2.3)

We see that we have one equation, but two variables: x and v. These are not independent, sincethe velocity is defined as the rate of change of the distance fallen, v = x. We have two choices. Eitherwe express equation (2.3) in terms of the distance

x =ρs − ρf

ρsg − 9η

2r2ρsx (2.4)

and we obtain a second order differential equation, or we express equation (2.3) in terms of thevelocity

v =ρs − ρf

ρsg − 9η

2r2ρsv (2.5)

2.1 Simple model of motion 7

and we obtain a first order differential equation. Since we are at the moment primarily interested inthe velocity, we will stick to the latter.

All these equations are the so-called equations of motion. They are differential equations, express-ing rate of change of some variable (x or v) in terms of the same variable (and/or other variables,in general). The independent variable is the time t and the dependent variable is the velocity v.

We realize that to obtain a solution we must somehow integrate both sides of the equation ofmotion. From calculus we know that integration brings in constant(s) of integration. So, for instancefor equation (2.5), we may write∫ t

t0

v(τ)dτ =

∫ t

t0

[ρs − ρf

ρsg − 9η

2r2ρsv(τ)

]dτ

and evaluating the left-hand side we arrive at

v(t)− v(t0) =

∫ t

t0

[ρs − ρf

ρsg − 9η

2r2ρsv(τ)

]dτ .

Here the task is to find a suitable form of function v(τ) to satisfy this equation for all times. Thevalue v(t0) is arbitrary. Its physical meaning is that of the velocity at the beginning of the intervalt0 ≤ τ ≤ t. Therefore, setting the value v(t0) to some particular number

v(t0) = v0 (2.6)

is called specifying the initial condition . The initial condition makes the solution to the equationof motion meaningful to a particular problem. Therefore, we always think of the models of this typein terms of the pair “governing equation” (the equation of motion) plus the “initial condition”. Thistype of model is called the initial value model (and the problem which is modeled this way is calledan initial value problem : IVP). The problem of the falling sphere is an initial value problem,and the model that needs to be solved is

v =ρs − ρf

ρsg − 9η

2r2ρsv , v(0) = v0 (2.7)

where we have quite sensibly taken t0 = 0.For future reference we will sketch the construction of an analytical solution. One possible ap-

proach uses the decomposition of the solution into a general solution of the homogeneous equation

vh = − 9η

2r2ρsvh

and one particular solution to the inhomogeneous equation

vp =ρs − ρf

ρsg − 9η

2r2ρsvp .

The homogenous equation may be solved by assuming the solution in the form of an exponential

vh(τ) = exp(aτ) .

Differentiating vh we find a = − 9η2r2ρs

.The particular solution can be guessed as vp = constant, and differentiating we find

vp =2r2(ρs − ρf )

9ηg .

The solution to the initial value problem is the sum of the particular solution and some multiple ofthe general solution


v(τ) = vp(τ) + Cvh(τ)

and it must satisfy the initial condition v(0) = v0. Substitution of τ = 0 leads to

C = v0 −2r2(ρs − ρf )

9ηg

and the analytical solution to the initial value problem (2.7)

v(τ) =2r2(ρs − ρf )

9ηg +

(v0 −

2r2(ρs − ρf )

9ηg

)exp

(− 9η

2r2ρsτ

).

Figure 2.2 displays the time variation of the speed of the falling sphere for some common data(epoxy resin, and steel sphere of 5 mm radius). The analytical formula is easily recognizable in theMATLAB1 code2 to produce the figure:

t=linspace(tspan(1),tspan(2), 100);

vt =(2*r^2*(rhos-rhof))/(9*eta)*g +...

(v0-(2*r^2*(rhos-rhof))/(9*eta)*g)*exp (-(9*eta)/(2*r^2*rhos)*t);

plot(t,vt, ’linewidth’, 2, ’color’, ’black’, ’marker’, ’.’); hold on

xlabel(’t [s]’),ylabel(’v(t) [m/s]’)

As we can see, the sphere attains an essentially unchanging velocity within a fraction of a second.Our model can give us additional information: we can see that in theory the time dependence ofthe velocity vanishes only for τ → ∞. In other words, it takes an infinite time for the sphere tostop accelerating. The corresponding velocity is called terminal velocity. So much for theory. Inexperiments we expect that practical limits to measurement accuracy would allow us to say thatterminal velocity will be reached within a finite time (for instance, if we can measure velocities withaccuracy of about 1 mm per second, the time to reach terminal velocity in our example is less than0.3 seconds).

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

t [s]

v(t)

[m

/s]

Fig. 2.2. Sphere falling in viscous fluid. Time variation of the descent speed.

2.2 Euler’s method

For the simple problem (2.7) it was relatively straightforward to derive an analytical solution. Aswe encounter more complicated problems, it will dawn on us that analytical solutions are in general

1MATLABr is registered trademark of The MathWorks, Inc. Contact: www.mathworks.com.2See: aetna/Stokes/stokesref.m

2.2 Euler’s method 9

not available. The tools available to engineers for these problems will most likely be numerical innature. (Hence the reason for this book.)

The simplest method with which to introduce numerical solutions to initial value problems (IVP)is the Euler’s method. It is based on a very simple observation: the solution graph is a curve. Thesolution process itself could be understood as constructing a curve. The curve passes through a pointthat is known to us from the specification of the IVP: the point (t0, v0). A curve consists from aninfinite number of points, and we do not want to have to compute the coordinates of an infinitenumber of points. The next best thing would be to compute the solution at only a few points alongthe curve, and somehow approximate the curve in-between. It is logical to try to approach this taskby starting from the point we know from the beginning, (t0, v0), and to compute next another pointon the curve, let us say (t1, v1). Then restart the process by moving one point forward in time,compute (t2, v2), and so on. This is an important aspect of numerical methods: the algorithms makediscrete steps and they produce discrete solutions (as opposed to a continuous analytical solution).

In general we will not be able to compute this sequence of points so that they all lie on the“exact” solution curve. The points will only be “close” to the curve we wish to find (they will beonly approximately “on the curve”). In fact, there is in general an infinite number of solution curves,those passing through all possible initial conditions. Refer to Figure 2.3): Shown are five solutioncurves for five different initial velocities. So if our numerical solution process drifts off the desiredsolution curve, it will most probably lie on an adjacent solution curve.

Since the process is repetitive (start from a known solution point and then compute the nextsolution point), we may just as well think in terms the pair (tj , vj) (known) and (tj+1, vj+1) (un-known, to be computed). How do we approximately locate the point (tj+1, vj+1) from what we knowof the solution curve passing through (tj , vj)? We know the location (tj , vj), but is there anythingelse? The answer is yes: having (tj , vj) allows us to substitute these values on the right-hand side ofthe governing equation (2.7) and compute

v(tj , vj) =ρs − ρf

ρsg − 9η

2r2ρsvj (2.8)

(there is no mention of tj , so it does not appear). The meaning of v(tj , vj) (∼ dvdt (tj , vj)) is the slope

of the solution curve that passes through (tj , vj)! And here is Euler’s critical insight: if we can’tmove along the actual curve (since we don’t know it), we will move instead along the straight linetangent to the solution curve. How far? Just a little bit, since if the curve really curves, the straightline motion will quickly become a very poor approximation of the curve. Therefore we compute thenext solution point as (compare with Figure 2.3)3

(tj+1, vj+1)← (tj +∆t, vj +∆tv(tj , vj)) , ∆t “small” . (2.9)

Here v(tj , vj)) may become confusing, since by the superimposed dot we don’t mean that a timederivative of some quantity was taken. We simply mean the value of the given function on the rightof (2.8). Therefore, we give the right-hand side function a name and we use the notation

v = f(t, v) , v(t0) = v0 (2.10)

for the IVP. Here by f(t, v) we mean that the right-hand side of the governing equation is known asa function of t and v. Then the Euler algorithm may be written as

(tj+1, vj+1)← (tj +∆t, vj +∆tf(tj , vj)) , ∆t “small” . (2.11)

One more remark is in order in reference to Figure 2.3. The short red lines indicate the slopeof the solution curves passing through the points from which the straight red lines emanate (theleft-hand side ends). The straight lines represent the tangents to the solution curves (the slopes).They are also known as the direction field . Plotting the direction field is a good way in whichthe behavior of solutions to ordinary differential equations can be understood. It works best for asingle scalar equation since it is hard to visualize the direction fields when there are more than onedependent variables.

3See: aetna/Stokes/stokesdirf.m


0 0.05 0.1 0.15 0.2−0.1

0

0.1

0.2

0.3

0.4

t [s]

v(t)

[m/s

]

Fig. 2.3. Stokes problem solutions corresponding to different initial conditions, with the direction fieldshown at a few selected points.

2.2.1 A simple implementation of Euler’s method

First we define the variables that appear in the definition of the IVP (2.7)4:

g=9.81;%m.s^-2

r=0.005;%m

eta=1100*1e-3;%1 Centipoise =1 mPa s

rhos=7.85e3;%kg/m^3

rhof=1.10e3;%kg/m^3

v0 = 0;% meters per second

This is the solution time span.

tspan =[0 0.5];% seconds

Define an anonymous function (assigned to the variable f) to return the value of the right-hand sideof (2.7) for a given time t and velocity v.

f=@(t,v)((rhos-rhof)/rhos*g - (9*eta)/(2*r^2*rhos)*v);

Decide how many steps the algorithm should take, and compute the time step to cover the timespan in the selected number of time steps.

nsteps =20;

dt= diff(tspan)/nsteps;

Initialize two arrays to hold the solution pairs. Note that the two lines in the loop correspond exactlyto the algorithm formula (2.11). We call the function f defined above to evaluate the right-handside.

t(1)=tspan(1);

v(1)=v0;

for j=1:nsteps

t(j+1) =t(j)+dt;

v(j+1) =v(j)+dt*f(t(j),v(j));

end

Finally, we graphically represent the solution as a series of markers that correspond to the computedsolution pairs (tj , vj).

plot(t,v,’o’)


4See: aetna/Stokes/stokes1.m


0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

t [s]v(

t) [

m/s

]

Fig. 2.4. Stokes problem solution computed with a simple implementation of the Euler’s method.

2.2.2 Solving the Stokes IVP with built-in MATLAB integrator

Numerically solving IVP’s is a common and important task. Not surprisingly, MATLAB has amenagerie of functions that can do this job very well indeed. Here we illustrate how to use theMATLAB integrator (that’s what these types of functions are usually called) ode235. For brevitywe omit the definitions of the constants (same as above). Then as above we define an anonymousfunction for the right-hand side of the governing equation, and we pass it as the first argument tothe integrator. The integrator returns two arrays, whose meaning is the same as in our simple codeabove.

f=@(t,v)((rhos-rhof)/rhos*g - (9*eta)/(2*r^2*rhos)*v);

[t,v] = ode23 (f, tspan, [v0]);

The solution pairs are now plotted. However, this time we let the plotter connect the computedpoints (as indicated by markers) with straight lines. Note well: we are not computing the pointsin between, those are interpolated from the computed data points only “for show”. We may takenote of the spacing of the computed data points: the spacing is not uniform. The integrator is cleverenough to figure out how long a step may be taken from one time instant to the next without losingtoo much accuracy. We will do our own so-called adaptive time stepping later on in the book.

plot(t,v,’o-’)


2.2.3 Refining the Stokes IVP

Now consider the equation of motion written in terms of the second derivative of the distancetraveled (2.3). Since two time derivatives are present, we should expect to have to integrate twiceto obtain a solution. This will result in two constants of integration. The first integration yields∫ t

t0

x dτ =

∫ t

t0

(ρs − ρf

ρsg − 9η

2r2ρsx

)dτ , (2.12)

which results in

x(t)− x(t0) = (t− t0)ρs − ρf

ρsg − 9η

2r2ρs(x(t)− x(t0)) .

Similarly the second integration gives



0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

t [s]

v(t)

[m

/s]

Fig. 2.5. Stokes problem solution computed with a MATLAB integrator.

∫ t

t0

(x(τ)− x(t0)) dτ =

∫ t

t0

((τ − t0)

ρs − ρfρs

g

)dτ −

∫ t

t0

(9η

2r2ρs(x(τ)− x(t0))

)dτ .

This expression could be further simplified, but my point can be made here: the two constants ofintegration are already present, x(t0) and x(t0). Therefore the IVP (the governing equation plus theinitial conditions) may be written

x =ρs − ρf

ρsg − 9η

2r2ρsx , x(t0) = x0, x(t0) = v0. (2.13)

The meaning of the IVP is: Find a function (distance traveled) x(t) such that it satisfies the equationof motion, and such that the initial distance and the initial velocity at the time t0 are x0 and v0respectively.

The integration of IVPs in MATLAB is made general by requiring that all IVPs be first order(only first derivatives of the variables may be present). Our IVP (2.13) is second order, but we cansee that it may be converted to a first order form. Just introduce the velocity to write

v =ρs − ρf

ρsg − 9η

2r2ρsv , x = v , x(t0) = x0, v(t0) = v0. (2.14)

The price to pay for having to deal with only the first order derivatives is an increased number ofvariables: now we have two. Since we have two variables, we better also have two equations. Notethat the initial conditions are now written in terms of the two variables, but we still have two ofthem. That is not entirely surprising since we still need two integration constants: we have two first-order equations, each of them needs to be integrated once, which will again result in two constantsof integration.

The IVP now deals with a system of coupled ordinary differential equations. Such systems areusually written in the so-called vector form. We introduce a vector to collect our variables

z =

[xv

]and then the IVP (2.14) is put as

z =

vρs − ρf

ρsg − 9η

2r2ρsv

= f(t, z) , z(t0) =

[x0

v0

]. (2.15)

Formally, this is the same as the IVP (2.7), except that our variable is a vector, and the functionon the right-hand side returns a vector and takes the time and a vector as arguments. This parallel


makes it possible to treat a variety of IVP’s with the same code in MATLAB. Here we show animplementation6 that computes the solution to (2.15).

The definitions of the constants are the same as above, except for the initial conditions. Theinitial condition now is a column vector.

z0 = [0;0];% Initial distance and velocity, meters per second

The right-hand side function looks very similar to the one introduced above, except that it needs toreturn a vector, and whenever it refers to velocity it needs to take it out of the input vector as z(2)

f=@(t,z)([z(2); (rhos-rhof)/rhos*g-(9*eta)/(2*r^2*rhos)*z(2)]);

The MATLAB integrator is called exactly as before.

[t,z] = ode23 (f, tspan, z0);

The arrays returned by the integrator collect results in the form of a table:

t(:) z(:,1) z(:,2)

t1 x1 v1t2 x2 v2... ... ...

Plotting the two arrays then the yields two curves: the distance traveled and the velocity (Figure 2.6).

plot(t,z,’o-’)

xlabel(’t [s]’),ylabel(’x(t) [m], v(t) [m/s]’)

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

t [s]

x(t)

[m

], v

(t)

[m/s

]

Fig. 2.6. Stokes problem. Solution of (2.15) computed with a MATLAB integrator.

2.2.4 Some properties of Euler’s method

The solutions are computed in the form of a table. An important parameter in that table is thespacing along the time direction, the time step. The time step is a form of a control knob: itappears that turning the knob so that the time step decreases would compute more points alongthe solution curve. This should be helpful, if for nothing else than to render better representationsof the solutions. Since the major approximation in Euler’s method is the replacement of the actualcurve with a straight line, we can also see that making the time step smaller will somehow decreasethe error that we will make in each step.



Making the step smaller is however expensive. The more steps we make the algorithm take, thelonger we have to wait for the computer to give us the solution. Hence we may wish to use a step thatis sufficiently large for the results to arrive quickly, but large steps also have consequences. What ifI wanted to take only 10 steps instead of 20 in the first MATLAB script (Section 2.2.1, set nsteps=10;). The result7 is shown in Figure 2.7 and it is clearly unphysical: in the actual experiment (andin the analytical solution and in our prior calculations) the dropped sphere certainly seems to bemonotonically speeding up, whereas here the result tells us that the velocity oscillates, and moreoverat times seems to be higher than the terminal velocity.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

t [s]

v(t)

[m

/s]

Fig. 2.7. Stokes problem. Solution with a larger time step than in Figure 2.4.

An explanation of this phenomenon8 may be found in Figure 2.8. Note the direction field whichwill help us understand the numerical solution. Starting from (t1 = 0, v1 = 0) we proceed along thesteep straight-line so far that the next solution point (t2, v2) overshoots the terminal velocity. Thenext step is along a straight line with a negative slope, and again we go so far that we undershootthe terminal velocity. The third step takes us along a straight line with a positive slope, and weovershoot again. This kind of computed response is not useful to us since the qualitative feature ofthe solution, namely the monotonic increase of the speed, is lost in the numerical solution.

0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

t [s]

v(t)

[m

/s]

Fig. 2.8. Stokes problem. Solution with a larger time step than in Figure 2.4. The direction field and theanalytical solution are shown.

7See: aetna/Stokes/stokes4.m8See: aetna/Stokes/stokes4ill.m


In summary, we see that the selection of the time step length has two kinds of implications.Firstly, the time step affects the accuracy (how far are the computed solutions from the curve thatwe would like to track?). Secondly, the time step effects the quality of the solution (is the shape of thecomputed solution series a reasonable approximation of the shape of the “exact” solution curve?).The first aspect is generally referred to as accuracy . The second aspect is generally considered amanifestation of stability (or instability, depending on how we look at it).

2.2.5 A variation on Euler’s method

The Euler’s method proposes to follow a straight line set up at (tj , vj) to arrive at the point(tj+1, vj+1). As an alternative, let us consider the possibility of following a straight line set upat the (initially unknown) point (tj+1, vj+1). For simplicity let us work with the IVP (2.7). Theright-hand side of the equation of motion is the function

f(t, v) =ρs − ρf

ρsg − 9η

2r2ρsv .

The slope of the solution curve passing through (tj+1, vj+1) is

f(tj+1, vj+1) =ρs − ρf

ρsg − 9η

2r2ρsvj+1 ,

which we substitute into the formula that expresses the movement from the point (tj , vj) to(tj+1, vj+1) along a straight line

(tj+1, vj+1) = (tj +∆t, vj +∆tf(tj+1, vj+1)) . (2.16)

The resulting expression for the velocity at time tj+1 reads

vj+1 = vj +∆tf(tj+1, vj+1) = vj +∆t

(ρs − ρf

ρsg − 9η

2r2ρsvj+1

),

which has the unknown velocity vj+1 on both sides of the equation. Equations of this type are calledimplicit , as opposed to the original Euler’s algorithmic equation (2.9). The latter was explicit : theunknown was explicitly defined by the right-hand side. In implicit equations, the unknown may behidden in some nasty expressions on both sides of the equation, and typically a numerical methodmust be used to extract the value of the unknown.

In the present case, solving the implicit equation is not that hard

vj+1 =

vj +∆tρs − ρf

ρsg

1 +∆t9η

2r2ρs

.

The MATLAB script of Section 2.2.1 may be easily modified to incorporate our new algorithm. Theonly change occurs inside the time-stepping loop

for j=1:nsteps

t(j+1) =t(j)+dt;

v(j+1) =(v(j)+dt*(rhos-rhof)/rhos*g)/(1+dt*(9*eta)/(2*r^2*rhos));

end

With this modification9 we can now compute the numerical solution without overshoot not onlywith just 10 steps, but with just five or even two– see Figure 2.9. The computed points are notparticularly accurate, but the qualitative character of the solution curves is preserved. In this sense,the present modification of the Euler’s algorithm has rather different properties than the original.



0 0.1 0.2 0.3 0.4 0.50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

t [s]

v(t)

[m

/s]

Fig. 2.9. Stokes problem. Solution with the algorithm (2.16).

In order to be able to distinguish between these algorithms we will call the original algorithmof Section 2.2.1 the forward Euler , and the algorithm introduced in this section will be calledbackward Euler . The justification for this nomenclature may be sought in the visual analogy ofapproximating a curve with a tangent: in the forward Euler method this tangent points forward fromthe point (tj , vj), in the backward Euler method the tangent points backward from (tj+1, vj+1).

2.2.6 Implementations of forward and backward Euler method

In this book we shall spend some time experimenting with the forward and backward Euler method.However, MATLAB does not come with integrators implementing these methods. They are toosimplistic to serve the general-purpose aspirations of MATLAB. Since it will make our life easier ifwe don’t have to code the forward and backward Euler method every time we want to apply it to adifferent problem, the toolbox aetna that comes with the book provides integrators for this pair ofmethods.

The aetna forward and backward Euler integrators are called in the same way as the built-inMATLAB integrators. We have seen in Section 2.2.2 an example of the built-in MATLAB integrator,ode23. There is one difference, however, which is unavoidable. The built-in MATLAB integrators areable to determine the time step automatically, and in general the time step is changed from step tostep. The aetna forward and backward Euler integrators are fixed-time-step implementations: theuser controls the time step, and it will not change. Therefore, we have to supply the initial time stepas an option to the integrator. (In fact, even the MATLAB built-in integrators take that optionsargument. It is used to control various aspects of the solution process.) The MATLAB odeset

function is used to create the options argument. To compute the solution with the forward Eulerintegrator odefeul10, replace the ode23 line in the script in Section 2.2.2 with these two lines11:

options =odeset(’initialstep’, 0.01);

[t,v] = odefeul(f, tspan, [v0], options);

To compute the solution with a backward Euler integrator, use odebeul12 instead. The inquisitivereader now probably wonders: how does odebeul solve for vj+1 from the implicit equation

vj+1 = vj +∆tf(tj+1, vj+1)

when it cannot even know how the function f was defined (all it is given is the function handle f)? Theanswer is: the equation is solve numerically. Solving (systems of) non-linear algebraic equations is

10See: aetna/utilities/ODE/integrators/odefeul.m11See: aetna/Stokes/stokes6.m12See: aetna/utilities/ODE/integrators/odebeul.m

2.3 Beam bending model 17

so important that MATLAB cannot fail to deliver some methods for dealing with them. MATLAB’sfzero implements a few methods by which the root of a single nonlinear equation may be located.It takes two arguments, function handle: this would be the function whose zero we wish to find; andthe initial guess of the root location. First we define the function

F (vj+1) = vj+1 − vj −∆tf(tj+1, vj+1)

by moving all the terms to the left-hand side, and our goal will be to find vj+1 such that F (vj+1) = 0.For that purpose we will create a handle to an anonymous function @(v1)(v1-v(j)-dt*f(t(j+1),v1))in which we readily recognize the function F (vj+1) (the argument vj+1 is called v1). Finally, thetime stepping loop for the backward Euler method is written as13

for j=1:nsteps

t(j+1) =t(j)+dt;

v(j+1) =fzero(@(v1)(v1-v(j)-dt*f(t(j+1),v1)),v(j));

end

where the second line inside the loop solves the implicit equation using fzero.

2.3 Beam bending model

The reader will find more than initial value problems (IVP) in this book. Here is a boundary valueproblem (BVP). It is of particular interest to structural engineers as it describes the planar bendingof a prismatic thin isotropic elastic beam. Using the orientations of the transverse displacement,distributed transverse load, and the resultant shear forces and moments in Figure 2.10, we canintroduce the definitions

V ′ = q , M ′ = V , EIv′′ = M , (2.17)

where V is the shear force resultant, q is the applied transverse load, M is the moment resultant, Eis the Young’s modulus, I is the moment of inertia of the cross-section, and (.)

′= d(.)/dx. Therefore,

the governing equation (static equilibrium of the differential element of the beam) is fourth order

EIv′′′′ = q . (2.18)

Our knowledge of the particular configuration of the beam would be expressed in terms of theconditions at either end: Is the cross-section at the end of the beam free of loading? Is it supported?Is the support a roller or is the rotation at the supported cross-section disallowed? At the cross-section x = 0 we could write the following four conditions

v(0) = v0 , v′(0) = s0 , v′′(0) =1

EIM0 , v′′′(0) =

1

EIV0 ,

depending of course on what was known: deflection v0, slope s0, moment M0, or shear force V0.Similar four conditions could be written for the cross-section at x = L (L = the length of the beam).Normally we would know two quantities out of the four at each end of the beam. For instance, for abeam supported on rollers on each end (the so-called simply supported beam) the known quantitieswould be v0 = M0 = vL = ML = 0. Since the quantities are specified at the boundary x = 0 andx = L of the domain 0 ≤ x ≤ L on which the governing equation is written, we call these theboundary conditions. The entire setup leads consequently to a boundary value problem (BVP),which for the instance of the simply supported beam would be defined as

EIv′′′′ = q , v(0) = 0 , v′′(0) = 0 , v(L) = 0 , v′′(L) = 0 . (2.19)

The difference between an IVP and a BVP is innocuous but rather consequential. All the conditionsfrom which the integration constants needs to be obtained are given at the same point for the IVP.On the other hand, they are not given at the same point for the BVP, and therefore the boundaryvalue problem is considerably more difficult to solve. One of the difficulties is that solutions to BVP’sdo not necessarily exist for some combinations of boundary conditions.



Illustration 1

Consider the beam-bending BVP with

EIv′′′′ = 0 , v(0) = 0 , v′′(0) = 0 , v′′(L) = 0 , v′′′(L) = 1 .

Note that the beam is not loaded (q = 0). The boundary conditions correspond to a beam simplysupported at one end and on the other side with a free end with a nonzero shear force. In the absenceof other forces and moments, the shear force at x = L cannot be balanced. Such a beam is not stablysupported, and therefore no solution exists for these boundary conditions.

We will handle in this book some simple boundary value problems, but most of their intricaciesare outside of the scope of this book.

q(x)

v(x)

V (x) V (x + dx)M (x)

M (x + dx)

x

y q(x)

v(x)

V (x) V (x + dx)M (x)

M (x + dx)

x

y

Fig. 2.10. Beam bending schematic

It is relatively straightforward to add the aspect of dynamics to the equation of motion (2.18).All terms are moved to one side of the equation, and they represent the total applied force on thedifferential element of the beam. Then invoking Newton’s law of motion, we obtain

µv = q − EIv′′′′ . (2.20)

Here µ is the mass per unit length, and v is the acceleration. The equation of motion now becamea partial differential equation, since there are now derivatives with respect to space and time. Withthe time derivatives there comes the need for more “constants” of integration. It is consistent withour physical reality that the integration constants will come from the beginning of the time intervalon which the equation of motion (2.20) holds. Therefore, they will be obtained from the so-calledinitial conditions. The solution will still be subject to the boundary conditions as before, and thus weobtain an initial boundary value problem (IBVP) for the function v(t, x) of the midline deflection.For instance

µv = q − EIv′′′′ , v(t, 0) = 0 , v′′(t, 0) = 0 , v(t, L) = 0 , v′′(t, L) = 0 ,v(0, x) = vt0(x) , v(0, x) = vt0(x) .

(2.21)

This IBVP model describes the vibration of a simply-supported beam, whose deflection at time t = 0(initial deflection) is described by the shape vt0(x) and whose (initial) velocity at time t = 0 is givenas vt0(x).

2.4 Model of satellite motion 19

2.4 Model of satellite motion

For the moment we shall continue investigating initial value problems. Now we look at anothermechanical IVP. Consider the unpowered motion of an Earth-orbiting satellite. The only force inthe problem is the force of gravity. The force acting on the satellite is shown in Figure 2.11, and acorresponding force of equal magnitude but opposite direction is also acting on the Earth.

F = −GmM

∥r∥3r .

Here G is the gravitational constant, m and M are the masses of the satellite and the planetrespectively, and r is the vector from the center of the earth to the location of the satellite. The IVP

FFFF

Fig. 2.11. Satellite motion. Satellite path and the gravitational force.

formulation is straightforward. The equation of motion is a classical Newton’s law: the accelerationof the mass of the satellite is proportional to the acting force

F = mr .

Substituting for the force, we obtain

mr = −GmM

∥r∥3r ,

which is entirely expressed in terms of the components of the location of the satellite with respect tothe Earth. The initial conditions are the location and velocity of the satellite at some time instant,let us say at t = 0

r(0) = r0 , r(0) = v0 .

Thus, the IVP reads

mr = −GmM

∥r∥3r , r(0) = r0 , r(0) = v0 . (2.22)

As for the problem discussed in Section 2.2.3, the dynamics of this IVP is driven by a second orderequation. In order to convert to the first order form, we shall use the obvious definition of a newvariable, the velocity v = r. With this definition, the IVP may be written in first order form as

v = −GM

∥r∥3r , v = r r(0) = r0 , v(0) = v0 . (2.23)


Note that the mass of the satellite canceled in the equation of motion.As before we can introduce the same formal way of writing the IVP using a single dependent

variable. Introduce the vector

z =

[rv

]and the definition of the right-hand side function

f(t,z) =

[v

− GM∥r∥3 r

]. (2.24)

Then the IVP is simply

z = f(t, z) , z(0) = z0 .

The complete MATLAB code14 to compute the solution starts with a few definitions. Especiallynote the initial conditions, velocity v0, and position r0.

G=6.67428 *10^-11;% cubic meters per kilogram second squared;

M=5.9736e24;% kilogram

R=6378e3;% meters

v0=[-2900;-3200;0]*0.9;% meters per second

r0=[R+20000e3;0;0];% meters

dt=0.125*60;% in seconds

te=50*3600; % seconds

Now the right-hand side function is defined (as an anonymous function, assigned to the variable f).Clearly the MATLAB code corresponds very closely to equation (2.24).

f=@(t,z)([z(4:6);-G*M*z(1:3)/(norm(z(1:3))^3)]);

We set the initial time step (the MATLAB integrator may or may not consider it: it is always drivenby accuracy), and then we call the integrator ode45.

opts=odeset(’InitialStep’,dt);

[t,z]=ode45(f,[0,te],[r0;v0]);

Finally, we do some visualization in order to understand the output better than a printout of thenumbers can afford. In Figure 2.12 we compare results for this problem obtained with two MATLABintegrators, ode45 and ode23, and with the forward and backward Euler integrators, odefeul andodebeul. Some of the interesting features are: ode45 is nominally of higher accuracy than ode23.However, we can see the individual curves spread out quite distinctly for ode45 while only a singlecurve, at this resolution of the image, is presented for ode23. From what we know about analyticalsolutions to this problem (remember Kepler?), the curve is an ellipse and the computed paths forrepeated revolutions of the satellite around the planet would ideally overlap and represent a singlecurve. Therefore we have to conclude that ode23 is actually doing a better job, but not perfect (thetrajectory is not actually closed). The two Euler integrators produce altogether useless solutions.The problem is not accuracy, it is the qualitative character of the orbits. From years and years ofobservations of the motion of satellites (and from the analytical solution to this model) we knowthat the energy of a satellite moving without contact with the atmosphere should be conserved toa high degree. For the forward Euler the satellite is spiraling out (which would correspond to itsgaining energy), while for the backward Euler it is spiraling in (losing energy). A lot of energy! Wesay that all these integrators fail to reproduce the qualitative character of the solution, but somefail more spectacularly than others.

Looking at energy is a good way of judging the performance of the above integrators. The kineticenergy of the satellite is

14See: aetna/SatelliteMotion/satellite1.m

2.5 On existence and uniqueness of solutions to IVPs 21

Fig. 2.12. Satellite motion. Solution computed with (left to right): ode45, ode23, odefeul, odebeul.

K =m∥v∥2

2

and the potential energy of the satellite is written as

V = −GmM

∥r∥.

The total energy T = K + V should be conserved for all times. Let us compute this quantityfor the solutions produced by these various integrators, and graph it. Or, rather we will graphT/m = K/m + V/m so that the expressions do not depend on the mass of the satellite, which didnot appear in the IVP in the first place. Here is the code15 to produce Figure 2.13 which shows whatthe time variation of the energies should look like (the total energy is conserved – hence a horizontalline).

Km=0*t;

Vm=0*t;

for i=1:length(t)

Km(i)=norm(z(i,4:6))^2/2;

Vm(i)=-G*M/norm(z(i,1:3));

end

plot(t,Km,’k--’); hold on

plot(t,Vm,’k:’); hold on

plot(t,Km+Vm,’k-’); hold on

xlabel(’t [s]’),ylabel(’T/m,K/m,V/m [m^2/s^2]’)

In Figure 2.14 we compare the four integrators. Only the total energy is shown, so ideally weshould see horizontal lines corresponding to perfectly conserved energy. On the contrary, we can seethat neither of the four integrators conserves the total energy. Note that the vertical axes have ratherdifferent scales. The Euler integrators perform very poorly: the change in total energy is huge. Theode45 is significantly outperformed by ode23, but both integrators lose kinetic energy nevertheless.Since ode45 is significantly more expensive than ode23, this example illustrates that choosing anappropriate integrator can make the difference between success and failure.

2.5 On existence and uniqueness of solutions to IVPs

With many common engineering models we are not worried about the existence and uniquenessof the solutions to the IVPs. Existence and uniqueness are guaranteed under certain conditionsconcerning the smoothness of the right-hand side function f (see equation (2.15)), and for manymodels these conditions are satisfied.

There are nevertheless engineering models where the right-hand side has a built-in non-smoothness.A good example of such a model deals with dry (Coulomb) friction. Consider an eccentric mass shaker

15See: aetna/SatelliteMotion/satellite energy.m


0 0.5 1 1.5 2

x 105

−6

−4

−2

0

2

4x 10

7

t [s]

T/m

,K/m

,V/m

[m2 /s

2 ]

Fig. 2.13. Satellite motion. Total energy (solid line), potential energy (dotted line), kinetic energy (dashedline).

0 0.5 1 1.5 2

x 105

−7.9

−7.8

−7.7

−7.6

−7.5

−7.4x 10

6

t [s]

T/m

,K/m

,V/m

[m2 /s

2 ]

0 0.5 1 1.5 2

x 105

−7.63

−7.62

−7.61

−7.6

−7.59

−7.58

−7.57

−7.56

−7.55x 10

6

t [s]

T/m

,K/m

,V/m

[m2 /s

2 ]

0 0.5 1 1.5 2

x 105

−8

−7.5

−7

−6.5

−6

−5.5x 10

6

t [s]

T/m

,K/m

,V/m

[m2 /s

2 ]

0 0.5 1 1.5 2

x 105

−1.3

−1.2

−1.1

−1

−0.9

−0.8

−0.7x 10

7

t [s]

T/m

,K/m

,V/m

[m2 /s

2 ]

Fig. 2.14. Satellite motion. Total energy computed with (left to right, top to bottom): ode45, ode23,odefeul, odebeul.

2.6 First look at accuracy 23

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7−0.04

−0.02

0

0.02

0.04

0.06

0.08

t [s]

v(t)

,[m/s

], x(

t)[m

]

Fig. 2.15. Dry friction sliding of eccentric mass shaker. Sliding motion: Displacement in dotted line, velocityin solid line.

with a polished steel base lying on a steel platform. The mass of the shaker is m. The harmonicforce due to the eccentric mass motion is added to the weight of the shaker to give the normalcontact force between those two as mg+A sin(ωt), and the horizontal force parallel to the platformA cos(ωt). The IVP of the shaker sliding motion may be written in terms of its velocity as

mv + µ(v)(mg +A sin(ωt))signv = A cos(ωt) , v(0) = v0 .

Here µ(v) is the friction coefficient. For steel on steel contact we could take

µ(v) =

{µs, for |v| ≤ vstickµk ≪ µs, otherwise.

Here µs is the so-called static friction coefficient, µk is the kinetic friction coefficient, and vstick is thesticking velocity. In words, for low sliding velocity the coefficient of friction is high, for high slidingvelocity the coefficient of friction is low.

Run the simulation script stickslip harm 2 animate and watch the animation a few times toget a feel for the motion.16 Figure 2.15 shows the displacement and velocity of the sliding motion.The brief stick phases should be noted. Also note the drift of the shaker due to the lift off forcewhich reduces the contact force and hence also the friction force when the mass is moving upwards.

Now consider Figure 2.16 which shows the velocity of the sliding motion for two slightly differentinitial conditions.17 Note well the direction field and consider how quickly (in fact discontinuously)it changes for some values of the velocity of sliding.

We take the initial velocity of the shaker to be 0.99vstick and 1.01vstick. We may expect that forsuch close initial conditions the velocity curves would also stay close, but they don’t. The reason isthe discontinuous (and divergent) direction field, as especially evident in the close-up on the right.The direction field is also discontinuous at zero velocity, but there it is convergent, and solutioncurves that arrive there are forced to remain at zero (and sticking occurs).

The divergent discontinuous direction field makes the solution non-unique. As users of numericalalgorithms for IVPs we must be aware of such potential complications, and address them by carefulconsideration of the formulation of the problem and interpretation of the results.

2.6 First look at accuracy

In this section we will have a first look at how to get good accuracy for initial value problems. Or,more generally, how to control the error.

16See: aetna/Stickslip/stickslip harm 2 animate.m17See: aetna/Stickslip/stickslip harm 1 dirf.m


0 0.01 0.02 0.03 0.04 0.05 0.06−0.01

0

0.01

0.02

0.03

0.04

0.05

t [s]

v(t)

[m/s

]

0 0.005 0.01 0.015 0.02

−0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

t [s]

v(t)

[m/s

]

Fig. 2.16. Dry friction sliding of eccentric mass shaker. Direction field and velocity curves for two initialconditions.

First, what do we mean by error? Consider for example that we want to obtain a numericalsolution to the IVP

y = f(t, y) , y(0) = y0 (2.25)

in the sense that our goal is the approximation of the value of the solution at some given point t = t.The difference between our computed solution yt and the true answer y(t) will be the true error

Et = y(t)− yt .

We have already seen that the time step length apparently controls the error (we will see later exactlyhow it achieves this feat). So let us compute the solution for a few decreasing time step lengths. Theresult will be a table of time step length versus true error.

Time step length Solution at t for ∆tj True error for ∆tj∆t1 yt,1 Et,1 = y(t)− yt,1∆t2 yt,2 Et,2 = y(t)− yt,2... ... ...

The true error is a fine concept, but not very useful as knowing it implies knowing the exactsolution. In practical applications of numerical methods we will never know the true error (otherwisewhy would we be computing a numerical solution?). In practice we will have to be content with theconcept of an approximate error . A useful form of approximate error is the difference of successivesolutions. So now we can construct the table of approximate errors

Time step length Solution at t for ∆tj Approximate error for ∆tj∆t1 yt,1 –∆t2 yt,2 Ea,1 = yt,2 − yt,1∆t3 yt,3 Ea,2 = yt,3 − yt,2∆t4 yt,4 Ea,3 = yt,4 − yt,3... ... ...

For definiteness we will be working in this section with the IVP

y = −1

2y , y(0) = 1.0 (2.26)

and our goal will be to compute y(t = 4). Figure 2.17 shows on the left the succession of computedsolutions with various time steps. As we can see, the two methods used, the forward and backward


Euler, are approaching the same value as the time step gets smaller. We call this behavior conver-gence . From the computed sequence of solutions we can obtain the approximate errors as discussedabove18. The approximate errors are shown in Figure 2.17 on the right.

0 0.5 1 1.5 20

0.05

0.1

0.15

0.2

0.25

∆t

y(t)

0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

∆ t

Ea

Fig. 2.17. Successive approximations to y(t = 4) for various time steps (on the left), and approximate errors(on the right). Red curve backward Euler, blue curve forward Euler

With this data at hand we can try to ask some questions. How does the error depend on thetime step length? The curves in Figure 2.17 suggest a linear relationship. Before we look at thisquestion in more detail, we will consider the problem of numerical integration of the IVP again, thistime with a view towards devising a better (read: more accurate) integrator than the first two Euleralgorithms.

2.6.1 Modified Euler method

As discussed below equation (2.5), the governing equation of the IVP (2.25) may be subject tointegration from t0 to t to obtain

y(t)− y(t0) =

∫ t

t0

f(τ, y(τ))dτ .

To use this expression to obtain an actual solution may not be easy because of the integral on theright-hand side. This gives us an incentive to try to approximate the right-hand side integral. Onepossibility is to write∫ t

t0

f(τ, y(τ))dτ ≈ (t− t0)f(t0, y(t0))

and we see that we’ve obtained the forward Euler algorithm

y(t) = y(t0) + (t− t0)f(t0, y(t0)) .

Or, we may write∫ t

t0

f(τ, y(τ))dτ ≈ (t− t0)f(t, y(t))

and we get the backward Euler method. This should be familiar: we are approximating the “areasunder curves” (integrals of functions) by rectangles (recall the concept of the Riemann sum). Abetter approximation would be achieved with trapezoids. Thus we may try

18See: aetna/ScalarODE/scalardecayconv.m

26 2 Modeling with differential equations∫ t

t0

f(τ, y(τ)) dτ ≈ (t− t0)

2[f(t0, y(t0)) + f(t, y(t))] .

The resulting trapezoidal method

(tj+1, vj+1)← (tj +∆t, Solution vj+1 to vj+1 = vj +∆t

2[f(tj , yj) + f(tj+1, vj+1)])

has very attractive properties, and we will devote more attention to it later. (It is implemented inaetna as odetrap19.) One factor that may discourage its use is cost: it is an implicit method, andto obtain y(t) one has to solve (in general, nonlinear) equation for y(t)

y(t) = y(t0) +(t− t0)

2[f(t0, y(t0)) + f(t, y(t))] . (2.27)

To obtain a method that is explicit in y(t) one may try the following trick: in the above equationapproximate y(t) in f(t, y(t)) using the forward Euler step to arrive at

ya = y(t0) + (t− t0)f(t0, y(t0)) ,

y(t) = y(t0) +(t− t0)

2[f(t0, y(t0)) + f(t, ya)] .

(2.28)

This formula defines one of the so-called modified Euler algorithms. It turns out to be only a littlebit more expensive than the basic forward Euler method, but its accuracy is superior as we willimmediately see on some results. (An implementation is available in 20odemeul.)

2.6.2 Deeper look at errors: going to the limit

We will now compute21 the solution to (2.26) also with the modified Euler method (2.28). Figure 2.18shows that the modified Euler approaches the solution somehow quicker than both backward andforward Euler methods. This is especially clear when we look at the approximate errors (on theright).

0 0.5 1 1.5 20

0.05

0.1

0.15

0.2

0.25

∆t

y(t)

0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

∆t

y(t)

Fig. 2.18. Successive approximations to y(t = 4) for various time steps (on the left), and approximate errors(on the right). Red curve backward Euler, blue curve forward Euler, black curve modified Euler

The errors seem to decrease roughly linearly for the backward and forward Euler methods. Themodified Euler method errors decrease along some curve. Can we find out what kind of curve? Couldit be a polynomial? That would be the first thing to try, because polynomials tend to be very usefulin this way (viz the Taylor series later in the book).

19See: aetna/utilities/ODE/integrators/odetrap.m20See: aetna/utilities/ODE/integrators/odemeul.m21See: aetna/ScalarODE/scalardecayconv1.m

pkrysl

Sticky Note

approx error E_a


We will assume that the approximate errors depend on the time step length as

Ea(∆t) ≈ C∆tβ . (2.29)

The exponent β is unknown. One clever way in which we can use data to find out the value of βrelies on taking logarithms on both sides of the equation

log(Ea(∆t)) ≈ log(C∆tβ) ,

which yields

log(Ea(∆t)) ≈ log(C) + β log(∆t) .

This is an expression for a straight line on a plot with logarithmic axes. The slope of the line wouldbe β. Figure 2.19 shows the approximate errors re-plotted on the log-log scale. Also shown are twored triangles. The hypotenuse in those triangles has slope 1 or 2 respectively. This may be comparedwith the plotted data. The forward and backward Euler approximate errors (at least for the smallertime step lengths) appear to lie along a straight line with slope equal to one. The modified Eulerapproximate errors on the other hand are close to a straight line with slope equal to two. Therefore,we may hypothesize that the approximate errors behave as Ea(∆t) ≈ C∆t for the forward andbackward Euler, and as Ea(∆t) ≈ C∆t2 for the modified Euler. The exponent β is called the rateof convergence (convergence rate). The higher the rate of convergence, the faster the errors drop.Later in the book we will use mathematical analysis tools (the Taylor series) to understand wherethe convergence rate is coming from.

10−2

10−1

100

10−5

10−4

10−3

10−2

10−1

∆ t

y,E a

Fig. 2.19. The approximate errors from Figure 2.18 re-plotted on the log-log scale. Red curve backwardEuler, blue curve forward Euler, black curve modified Euler

What about the first few points in the computed series: they fail to lie along a straight line on thelog-log plot? We have assumed that the errors depended only a single power of the time step length.This is a good assumption for very small time step lengths (the so-called asymptotic range , where∆t → 0), but for larger time step lengths (the so-called pre-asymptotic range) the error morelikely depends on a mixture of powers of the time step length. Then the data points will not lie ona straight line on the log-log plot.

Plotting the data as in Figure 2.19 is very useful in that it gives us the convergence rate. Couldwe use this information to get a handle on the true error? As explained above, we assumed that the


approximate error depended on the time step length through the mononomial relation (2.29). Usinga simple trick, we can relate the approximate errors to the true errors

Ea,j = yt,j+1 − yt,j = yt,j+1 − y(t) + y(t)− yt,j ,

where −Et,j+1 = yt,j+1 − y(t) and Et,j = y(t)− yt,j and so we have

Ea,j = Et,j − Et,j+1 . (2.30)

Then if the approximate error on the left behaves as the mononomial (2.29), then so will the trueerrors on the right. There are two parameters in (2.29), the rate β and the constant C. We haveestimated the rate by plotting the approximate errors on a log-log scale. Now we can estimate theconstant C by taking

Et,j ≈ C∆tβj , Et,j+1 ≈ C∆tβj+1

to obtain

Ea,j = C∆tβj − C∆tβj+1 (2.31)

and

C =Ea,j

∆tβj −∆tβj+1

.

For instance, for the forward Euler we have obtained the following approximate errors

>> ea_f =

6.2500e-2 3.7613e-2 1.7954e-2 8.7217e-3 4.2952e-3 2.1311e-3

>> dts =

2, 1, 1/2, 1/4, 1/8, 1/16, 1/32

and we have estimated from Figure 2.19 that the convergence rate was β = 1. Therefore, we canestimate the constant using (for example) Ea,3 as

>> C=ea_f(3)/(dts(3)-dts(4))

C =

0.071816687928745

This is useful: we can now predict for instance how small a time step will be required to obtain thesolution within the absolute tolerance 10−4:

Et,j ≈ C∆tβj ≤ 10−4 ⇒ ∆tj ≤(10−4

C

)1/β

>> 1e-4/(ea_f(3)/(dts(3)-dts(4)))

ans =

0.001392434027300

Indeed, we find that for time step length 1/1024 < 0.00139 the true error is computed as 0.000066 <10−4.

If we do not have an estimate of the convergence rate, we can try solving for it. Provided wehave at least two approximate errors, let us say Ea,1 and Ea,2, we can write (2.31) twice as

Ea,1 = C∆tβ1 − C∆tβ2 , Ea,2 = C∆tβ2 − C∆tβ3 .

This system of two nonlinear equations will allow us to solve for both unknowns C and β. In generala numerical solution of this nonlinear system of equations will be required. Only if the time stepsare always related by a constant factor so that

2.7 Runge-Kutta integrators 29

∆tj+1 = α∆tj , (2.32)

where α is a fixed constant, will we be able to solve the system analytically: First we write

Ea,1

∆tβ1 −∆tβ2=

Ea,2

∆tβ2 −∆tβ3. (2.33)

Now we realize that

∆tβ1 −∆tβ2 = ∆tβ1 − (α∆t1)β = ∆tβ1

(1− αβ

)and also

∆tβ2 −∆tβ3 = ∆tβ2 − (α∆t2)β = ∆tβ2

(1− αβ

)so that we can rewrite (2.33) as

Ea,1

∆tβ1 (1− αβ)=

Ea,2

∆tβ2 (1− αβ),

and canceling the factor(1− αβ

)Ea,1

∆tβ1=

Ea,2

∆tβ2.

This is then easily solved for the convergence rate by taking a logarithm of both sides to give

β =logEa,1 − logEa,2

log∆t1 − log∆t2. (2.34)

The second unknown C then follows

C =Ea,1

∆tβ1 −∆tβ2. (2.35)

The described procedure for the estimation of the parameters of the relation (2.29) is a specialcase of the so-called Richardson extrapolation . When the data for the extrapolation is “nice”,this procedure is very useful. The data may not be nice: for instance for some reason we haven’treached the asymptotic range. Or, perhaps there is a lot of noise in the data. Then the extrapolationprocedure cannot work. It is a good idea to always visualize the approximate error on the log-loggraph. If the approximate error data does not not appear to lie on a straight line, the extrapolationshould not be attempted.

An important note: the above Richardson extrapolation can work only for results obtained withfixed step-length integrators. The step length is the parameter in the extrapolation formula. It variesfrom step to step when the MATLAB adaptive step length integrators (i.e. ode23, ...) are used,which the formula cannot accommodate, and the extrapolation is then not applicable.

2.7 Runge-Kutta integrators

The modified Euler method (2.28) is an example of the so-called Runge-Kutta (RK) algorithms. Alarge subclass of such algorithms (the explicit RK methods) advances the solution by the prescription

y(t) = y(t0) +∆t (b1k1 + b2k2 + · · ·+ bsks) , (2.36)

which means that from y(t0) we follow a slope which is determined as a linear combination of slopeskj evaluated at various points within the current time step


k1 = f(t0 + c1∆t,y(t0))k2 = f(t0 + c2∆t,y(t0) + a21∆tk1)k3 = f(t0 + c3∆t,y(t0) + a31∆tk1 + a32∆tk2)· · ·ks = f(t0 + cs∆t,y(t0) + as1∆tk1 + as2∆tk2 + · · ·+ as,s−1∆tks−1)

(2.37)

where ∆t = (t− t0), and the coefficients asj , bj , cj are determined in various ingenious ways so thatthe method has the best accuracy and stability properties.

Figure 2.20 shows graphically an example of such an explicit Runge-Kutta method, the modifiedEuler method. It can be written in the above notation as

y(t) = y(t0) +∆t

(1

2k1 +

1

2k2

)k1 = f(t0 + 0×∆t,y(t0))k2 = f(t0 + 1×∆t,y(t0) + 1×∆tk1)

(2.38)

We see that the coefficients of this method are c1 = 0, c2 = 1, a21 = 1 and b1 = b2 = 12 .

Fig. 2.20. The modified Euler algorithm as a graphical schematic: The solution at the time t = t0 +∆t isarrived at from y(t0) using the average slope 1

2k1 +

12k2

The coefficients of Runge-Kutta methods asj , bj , cj are usually presented in the form of theso-called Butcher tableau

c ab

(2.39)

where the coefficients are elements of the three matrices. For the explicit RK methods c1 = 0 always,and the matrix a is strictly lower diagonal. The modified Euler method is an RK method with s = 2and the tableau

01

0 01 012

12

The forward Euler method is an RK method with s = 1 with the tableau

0 01

We must mention the fourth-order explicit Runge-Kutta. It represents perhaps the most commonRK method. An improvement of this method in the form of the explicit Runge-Kutta (4,5) pairof Dormand and Prince (a combination of fourth-and fifth-order method) makes appearance in

2.7 Runge-Kutta integrators 31

MATLAB in the ode45 integrator. The tableau of the fourth-order explicit Runge-Kutta with afixed time step is

01212

1

0 0 0 012 0 0 0

0 12 0 0

0 0 1 016

13

13

16

The aetna toolbox implements the fixed-time-step fourth-order RK integrator in oderk4.22

Illustration 2

Figure 2.21 shows the same results as Figure 2.19, but supplemented23 with results for fourth-orderRunge-Kutta oderk4. Clearly the fourth-order method is much more accurate, and by drawing atriangle in the log-log scale plot we easily ascertain that RK4 converges with convergence rate of 4.

10−3

10−2

10−1

100

10−15

10−10

10−5

100

∆ t

y,E a

Fig. 2.21. The approximate errors plotted on the log-log scale. Red curve backward Euler, blue curveforward Euler, black curve with “o” markers modified Euler, black curve with “x” markers fourth-orderRunge-Kutta oderk4

To round off this discussion we will consider the adaptive-step Runge-Kutta method implementedin the Matlab ode23 integrator. The tableau reads

22See: aetna/utilities/ODE/integrators/oderk4.m23See: aetna/ScalarODE/scalardecayconv2.m


01234

1

0 0 0 012 0 0 0

0 34 0 0

29

13

49 0

29

13

49 0

724

14

13

18

The array b with two rows instead of one makes the method so useful: the solution at the timet = t0 +∆t may be computed in two different ways from the slopes k1, ...,k4. One of these (the firstrow) is third-order accurate and the other (the second row) is fourth-order accurate. The differencebetween them can be used to guide the change of the time step to maintain accuracy.

Suggested experiments

1. The integrator ode87fixed24 uses a high order Runge-Kutta formula and fixed time step length.Repeat the above exercise with this integrator, and estimate its convergence rate.

2.8 Annotated bibliography

First of all, the MATLAB documentation on the MathWorks website is eminently useful. I copy andpaste code from there all the time. Checkout the “Getting Started”, the “User Guides”, and thenumerous “Examples” atwww.mathworks.com/access/helpdesk/help/techdoc/MATLAB product page.html.

1. V. I. Arnold, Ordinary Differential Equations, Universitext, Springer, 2006.This book provides great insights into differential equation models by using a crisp language andlots of pictures. Highly recommended especially for Chapters 1 – 3.

2. C. Moler, Numerical Computing with MATLAB, SIAM, 2004.Written by one of the co-authors of MATLAB, this is a gem of a textbook. Covered are selectednumerical methods and basics of MATLAB. An electronic version of it is available for free athttp://www.mathworks.com/moler/index ncm.html, including dozens of MATLAB codes.

3. L. F. Shampine, I. Gladwell, S. Thompson, Solving ODEs with MATLAB, Cambridge UniversityPress, 2003.Especially Chapters 1 and 2 are a valuable source of basic theory and examples for IVPs.

4. S. S. Rao, Mechanical Vibrations, Addison-Wesley, second edition, 1990.Comprehensive treatment of many mechanical systems. Suitable as a reference for almost allvibrations examples in this book.

24See: aetna/utilities/ODE/integrators/ode87fixed.m

3

Preservation of solution features: stability

Summary

1. In this chapter we investigate the central role that the eigenvalue problem plays in the designof ODE integrators. The goal is to preserve important solution features. This is referred to asthe stability of the integration algorithm. Main idea: stability can be investigated on the modelequation of the scalar linear ODE.

2. For the model IVP, the formula of a particular integrator can be written down so that the newvalue of the solution is expressed as a multiple of the solution value in the previous step. Mainidea: the amplification factor depends on the product of the eigenvalue and the time step, andtherefore the “shape” of the numerical solution is determined by these quantities. The eigenvalueis given as data, the time step can be (needs to be) chosen by the user.

3. The scalar linear ODE with a complex coefficient is equivalent to two coupled real equations intwo real variables. Main idea: the ODE with a complex coefficient describes harmonic oscillations.

4. For the model IVP with a complex coefficient, the same procedure that leads to an amplificationfactor is used. Main idea: the amplification factor and the solution now “live” in the complexplane. The magnitude of the amplification factor again is seen to play a role in the stabilityinvestigation.

5. Understanding the amplification factors is aided by appropriate diagrams. Main idea: The preser-vation of solution features is illustrated by a complete stability diagram for the various methods.The magnitude of the amplification factor may also be visualized as a surface above the complexplane.

3.1 Scalar real linear ODE

At the end of the previous chapter we had a brief look at accuracy. This is the first aspect of theapplication of numerical integration to the solution of initial value problems. The second aspecthas to do with the preservation (or lack thereof) of the important features of the solutions. As anexample of such important features, in the modeling of mechanical systems we worry a lot aboutthe conservation of momentum or energy. Often just the general shape of the solution curve may bea characteristic of an analytical solution that we would really like to see preserved in the numericalsolution. We refer to the ability of the numerical algorithms to preserve these important aspects ofthe analytical solution as stability .

We will begin to study the issue of stability on the simplest and nicest possible differentialequation: a scalar linear ordinary differential equation with a constant coefficient k

y = ky , y(0) = y0 . (3.1)

For the moment we shall consider k real. As an example take k = −1/2, with an arbitrary initialcondition

34 3 Preservation of solution features: stability

y = −1

2y , y(0) = 1.3361 . (3.2)

3.2 Eigenvalue problem

For this type of equation (derivative of the function proportional to the function itself) we can guessthat the function is an exponential. Both real and complex exponentials have this property. Ourguess

y = B exp(λt)

is differentiated and substituted into the differential equation

y = Bλ exp(λt) = ky = kB exp(λt)

which can be reshuffled to give

B(λ− k) exp(λt) = 0 . (3.3)

The constant B = 0 (otherwise we don’t have a solution!), and for the above to hold for all times twe must require

B(λ− k) = 0 .

The above equation is called the eigenvalue problem , and this is definitely not the last time wehave encountered this type of equation in the present book. Here λ is the eigenvalue, and B is theeigenvector. The solution is easy: we see that λ = k. Any B = 0 will satisfy the eigenvalue equation.We could determine B so that the initial value problem (3.1) was solved by substituting into theinitial condition to obtain B = y0.

The solution to the IVP (3.2) is drawn with a solid line in Figure 3.1. It is a “decaying” solution.In the same figure there’s also a “growing” solution (for k = 1/2), and a constant solution (fork = 0).

0 1 2 3 40

2

4

6

8

10

t

y(t)

Fig. 3.1. Solution to (3.1) for k positive, negative, and zero

3.3 Forward Euler method for a decaying solution

Let us now look at what the forward Euler (2.11) will produce for the model equation (3.2). Wesubstitute f(tj , yj) = kyj into the Euler method

yj+1 = yj +∆tf(tj , yj) ,

3.3 Forward Euler method for a decaying solution 35

to obtain

yj+1 = yj +∆tkyj = (1 +∆tk)yj . (3.4)

We would like to see a monotonically decaying numerical solution, | yj+1 |<| yj |, so the so-calledamplification factor (1 +∆tk) must be positive and its magnitude must be less than one

|1 +∆tk| < 1 .

If this condition is satisfied but (1 + ∆tk) < 0 the solution decreases in magnitude, but changessign from step to step. Finally, (1 + ∆tk) = 0 implies that the solution drops to zero in one stepand stays zero. Recall that for our example k = −1/2. Correspondingly, in Figure 3.2 we see1 amonotonically decaying solution for ∆t = 1.0 (|1 +∆tk| = |1 + 1.0× (−1/2)| = 1/2 < 1), a solutiondropping to zero in one step for ∆t = 2.0, a solution decaying, but non-monotonically for ∆t = 3.0(as 1 +∆tk = 1 + 3.0× (−1/2) = −1/2), and finally for ∆t = 4.0 we get a solution which oscillatesbetween ±y0. Note that for an even bigger time step we would get an oscillating solution whichwould increase in amplitude rather than decrease.

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

1.2

1.4

t

y

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

1.2

1.4

t

y

0 10 20 30 40 50−1

−0.5

0

0.5

1

1.5

t

y

0 10 20 30 40 50−1.5

−1

−0.5

0

0.5

1

1.5

t

y

Fig. 3.2. Forward Euler solutions to (3.2) for time steps (left to right) ∆t = 1.0, 2.0, 3.0, 4.0

In summary, for a negative coefficient k < 0 in the model IVP (3.1) we can reproduce the correctshape of the solution curve with the forward Euler method provided

0 < ∆t ≤ −1/k . (3.5)

This is visualized in Figure 3.3. On the top we show the real line, the thick part indicates wherethe eigenvalues λ = k are located when they are negative. On the bottom we show the real linefor the quantity λ∆t. The thick segment corresponds to equation (3.5). The filled circle indicates“included”, the empty circle indicates “excluded”. The meaning of (3.5) is expressed in words as:

1See: aetna/ScalarODE/scalarsimple.m


for a negative λ = k the forward Euler will reproduce the correct decaying behavior provided thequantity λ∆t lands in the segment −1 ≤ λ∆t < 0 as indicated by the arrow.

The time step lengths that satisfy equation (3.5) are called stable. If we need to be precise, wewould say that such time step lengths are stable for the forward Euler applied to IVPs with decayingsolutions.

λ−1 0 1

λ∆t−1 0 1

Fig. 3.3. Forward Euler stability when applied to the model problem (3.1) for negative eigenvalues. Thegiven coefficient λ is located in the negative part of the real axis on top. The time step ∆t needs to be chosento place the product λ∆t in the unit interval −1 ≤ λ∆t < 0 on the axis at the bottom.

Sometimes it is useful to set the time step from the condition

0 < ∆t ≤ −2/k (3.6)

so that ∆tk is allowed to be in the interval between -2 and zero. This will guarantee that the solutiondecays, albeit non-monotonically. Such behavior is considered admissible when all we care about isthat the solution decays. Detailed discussion follows in Section ??.

3.4 Backward Euler method for a decaying solution

How would the backward Euler (2.16) handle the model equation (3.2)? Upon substitution of theexpression f(tj+1, yj+1) = kyj+1 into

yj+1 = yj +∆tf(tj+1, yj+1) ,

we obtain

yj+1 = yj +∆tkyj+1 . (3.7)

This may be solved for yj+1

yj+1 =yj

1−∆tk,

where

1

1−∆tk

is the amplification factor for this Euler scheme. Now if we realize that by assumption k < 0, we seethat the solution is going to decay monotonically for all nonzero time step lengths, since 1−∆tk > 1for ∆t > 0. Hence we can state that any time step length is stable for the backward Euler methodapplied to an IVP with a decaying solution.

3.7 Complex IVP 37

3.5 Backward Euler method for a growing solution

Consider now the model IVP (3.1) with k > 0. The solution should be monotonically growing inmagnitude. When the backward Euler method is applied to such an equation, the amplificationfactor

1

1−∆tk

is now going to be greater than one in magnitude without changing sign provided

0 < ∆t ≤ 1/k . (3.8)

The time step lengths that satisfy equation (3.8) are called stable. If we need to be precise, wewould say that such time step lengths are stable for the backward Euler applied to IVPs with growingsolutions. We see that the situation somehow mirrors the one discussed for the forward Euler appliedto decaying solutions. The Figure 3.4 which corresponds to (3.8) illustrates this quite clearly, as itis quite literally a mirror image of the Figure 3.3 for the forward Euler and k < 0.

λ−1 0 1

λ∆t−1 0 1

Fig. 3.4. Backward Euler stability one applied to the model problem (3.1) for positive eigenvalues. Thegiven coefficient λ is located in the positive part of the real axis on top. The time step ∆t needs to be chosento place the product λ∆t in the unit interval 0 < λ∆t ≤ 1 on the axis at the bottom.

3.6 Forward Euler method for a growing solution

Now again we consider k > 0, this time with a forward Euler method. When it is applied to such anequation, the amplification factor

(1 +∆tk)

is positive for all time step lengths, and also (1 +∆tk) > 1. Hence we see that any time step lengthis stable for the forward Euler method applied to an IVP with a growing solution. This is again amirror image, this time of the backward Euler applied to IVP with a decaying solution.

3.7 Complex IVP

The model IVP (3.1) admits the possibility of the coefficient k being a complex number. Section 3.2still applies, and the solution may be therefore again sought as


y = y0 exp(kt) .

Note that the complex exponential may be expressed in terms of sine and cosine

exp(kt) = exp [(Rek + i Imk)t] = exp(Rek t) [cos(Imk t) + i sin(Imk t)] .

The solution is now to be sought with a time dependence in the form of a complex exponential. Letus write the solution in terms of the real and imaginary parts

y = Rey + i Imy ,

which can be substituted into the differential equation, together with k = Rek + i Imk, to give

Rey + i Imy = (Rek + i Imk)(Rey + i Imy) .

Expanding we obtain

Rey + i Imy = RekRey − ImkImy + iRekImy + i ImkRey .

Now we group the real and imaginary terms

[Rey − (RekRey − ImkImy)] + i [Imy − (RekImy + ImkRey)] = 0 ,

and in order to get a real zero on the right-hand side, we require that both brackets vanish identically,and we obtain a system of coupled real differential equations

Rey = RekRey − ImkImy

Imy = RekImy + ImkRey . (3.9)

Also, the initial condition y(0) = y0 is equivalent to

Rey(0) = Rey0 , Imy(0) = Imy0 .

So we see that to solve (3.1) with k complex is equivalent to solving the real IVP (profitably writtenin matrix form)[

ReyImy

]=

[Rek, −ImkImk, Rek

] [ReyImy

],

[Rey(0)Imy(0)

]=

[Rey0

Imy0

]. (3.10)

The method of Section 3.2 can be used again, but with a little modification since we now have amatrix differential equation instead of a scalar ODE. We will seek the solution to (3.10) as[

ReyImy

]= exp(λt)

[z1z2

].

For brevity we will introduce the notation

w =

[ReyImy

],

and

K =

[Rek, −ImkImk, Rek

](3.11)

and the IVP could then be written as

w = Kw , w(0) = w0 . (3.12)

Correspondingly, the solution will be sought as

3.7 Complex IVP 39

w = exp(λt)z , (3.13)

where z is a time independent vector. Performing the time differentiation, we obtain

w = λ exp(λt)z = K exp(λt)z .

Collecting the terms, we get, entirely analogously to (3.3),

exp(λt) (Kz − λz) = 0 .

To satisfy this equation for all times, the following conditions must be true

Kz = λz . (3.14)

This is the so-called matrix eigenvalue problem . The vector z is the eigenvector , the scalarλ is the eigenvalue , and they both may be complex. The eigenvalue problem (EP) is highly non-linear, and therefore for larger matrices impossible to solve analytically and quite difficult to solvenumerically.

Looking at (3.14) we realize that there are too many unknowns here: λ , z1 , and z2 (three), andnot enough equations (two). We need one more equation, and to get it we rewrite (3.14) as

(K − λ1)z = 0 ,

where 1 is an identity matrix. This is a system of linear equations for the vector z with a zeroright-hand side. In order for the above equation to have a nonzero solution, the square matrix

K − λ1

must be singular . (The linear combination of the columns of K − λ1 yields a zero vector, which isjust another way of saying that the columns are linearly dependent. Hence, the matrix is singular.)We may put the fact that K − λ1 is singular differently by referring to its determinant

det (K − λ1) = 0 . (3.15)

This is the additional equation that makes the solution of the eigenvalue problem possible (thecharacteristic equation).

Illustration 1

Expand the determinant of the 2× 2 matrix[2, −1−1, 1

]− λ

[1, 00, 1

]The determinant may be defined recursively in terms of cofactors (Laplace formula). For a 2× 2

matrix we obtain the familiar “diagonal products” rule

det

([2, −1−1, 1

]− λ

[1, 00, 1

])= (2− λ)(1− λ)− (−1)(−1) = λ2 − 3λ+ 1

We see that the expanded determinant is a polynomial in λ, the so-called characteristic polyno-mial .

For a 2 × 2 matrix the polynomial is quadratic, and with each additional row and column theorder of the polynomial goes up by one. As a consequence, to solve the eigenvalue problem means tofind the roots of the characteristic polynomial. This is a highly nonlinear and unstable computation,which for larger matrices must be done numerically since no analytical formulas exist.


Illustration 2

Display the characteristic polynomial of the matrix [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1].The MATLAB symbolic solution

>> syms lambda ’real’

>> det( [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1]-lambda*eye(4))

ans =

lambda+6*lambda^2-5*lambda^3-2+lambda^4

>> ezplot(ans)

>> grid on

yields a curve similar to the one shown in Figure 3.5. One has to zoom in to be able to estimatewhere the roots lie. There are going to be four of them, corresponding to the highest power λ4.

−3 −2 −1 0 1 2 3 4 5

0

50

100

150

λ

p(λ) =λ+6 λ2−5 λ3−2+λ4

p(λ)

Fig. 3.5. Characteristic polynomial of [2,-1,0,0;-1,1,-1,0;0,-1,1,-1;0,0,-1,1]

Illustration 3

We may familiarize ourselves with the concepts of the EP solutions by looking at some simple 2× 2matrices.

• Zero matrix. The characteristic polynomial is

det

([0, 00, 0

]− λ

[1, 00, 1

])= λ2 = 0

which has the double root λ1,2 = 0. Apparently any vector v is an eigenvector since

0v = 0× v .

The MATLAB solution agrees with our analytical consideration (columns of V are the eigenvec-tors, the diagonal elements of D are the eigenvalues). Eigenvectors we obtained are particularlynice because they are orthogonal.

>> [V,D]=eig([0,0;0,0])

V =

1 0

0 1

D =

0 0

0 0

3.7 Complex IVP 41

• Identity matrix. The characteristic polynomial is

det

([1, 00, 1

]− λ

[1, 00, 1

])= (1− λ)2 = 0

which has the double root λ1,2 = 1. Again any vector is an eigenvector. The MATLAB solutionagrees with our analytical consideration (note that the eigenvectors are again orthonormal).

>> [V,D]=eig([1,0;0,1])

V =

1 0

0 1

D =

1 0

0 1

• Diagonal matrix. The characteristic polynomial is

det

([a, 00, b

]− λ

[1, 00, 1

])= (a− λ)(b− λ) = 0

which has the roots λ1 = a and λ2 = b. The eigenvectors may be calculated by substituting theeigenvalue (let us start with λ1)[

a, 00, b

]v1 = λ1v1 = av1

and by guessing that this can be satisfied with the vector

v1 =

[10

].

Similarly for the second eigenvalue.The symbolic MATLAB solution agrees with our analytical consideration. (a, b are real symbolicconstants.)

>> syms a b ’real’

>> [V,D]=eig([a,0;0,b])

V =

[ 1, 0]

[ 0, 1]

D =

[ a, 0]

[ 0, b]

• General real matrix. The characteristic polynomial is

det

([a, dc, b

]− λ

[1, 00, 1

])= (a− λ)(b− λ)− cd = 0

The roots λ1 and λ2 need to be solved for from this quadratic equation. The below symbolicMATLAB expression evaluates the determinant

>> syms a b c d lambda ’real’

>> det([a,d;c,b]-lambda*[1,0;0,1])

ans =

a*b-a*lambda-lambda*b+lambda^2-d*c


A helpful observation usually made in a linear algebra course is that the trace of the 2×2 matrix(i.e. the sum of the diagonal elements) is equal to the sum of the eigenvalues a+ b = λ1+λ2, andthe determinant of the matrix is equal to the product of the eigenvalues ab− cd = λ1λ2. We caneasily verify this symbolically in MATLAB by first computing the eigenvalues and eigenvectors(symbolically)

syms a b c d lambda ’real’

[V,D]=eig([a,d;c,b])

and then using the symbolic expressions

D(1,1) +D(2,2)-a-b

simple(D(1,1)*D(2,2)-a*b+c*d)

we check that we get identically zero. As an example consider the matrix[2, −1−1, 2

]We find the eigenvalues from 2 + 2 = 4 = λ1 + λ2, and 2 × 2 − (−1) × (−1) = 3 = λ1λ2. Weeasily guess λ1 = 3 and λ2 = 1. The eigenvectors are found by substituting the eigenvalue intothe eigenvalue problem, and then solving the singular system of equations. For instance,([

2, −1−1, 2

]− λ1

[1, 00, 1

])[z11z21

]=

[00

]So that[

−1, −1−1, −1

] [z11z21

]=

[00

]These two equations are linearly dependent, and we cannot determine both elements z11, z21from a single equation. Choosing for instance z11 = 1 gives (one possible) solution for the firsteigenvector[

z11z21

]=

[1−1

]• Real matrix of the form (3.11)

det

([a, −bb, a

]− λ

[1, 00, 1

])= (a− λ)2 + b2 = 0

Taking the helpful formula for the eigenvalues of the 2 x 2 matrix

λ1 + λ2 = 2a , λ1λ2 = a2 + b2

and the identity (a+ i b) (a− i b) = a2 + b2 we can see that the eigenvalues are in fact

λ1 = a+ i b , λ2 = a− i b .

So the diagonal elements of the matrix are the real parts of the eigenvalues, and the off diagonalelements are the (real values) of the imaginary parts of the eigenvalues.

3.10 Case of Rek = 0 and Imk = 0 43


1. When we compute the eigenvector by solving the system with the singular matrix we have tochoose one element of the vector, apparently arbitrarily. Discuss whether the choice is trulyarbitrary. For instance, could we choose z11 = 0?

3.8 Single scalar equation versus two coupled equations: eigenvalues

We know that the eigenvalue of the scalar IVP (3.1) may be obtained from the complex eigenvalueproblem discussed in Section 3.2. We have also seen that the scalar complex IVP is equivalent tothe real coupled IVP (3.12). What is the correspondence of the eigenvalue obtained from the scalarequation with the eigenvalues obtained from the coupled matrix equations?

The eigenvalues of the matrix (3.11) may be obtained from the characteristic equation

det (K − λ1) = 0 ,

which yields

det

{[Rek, −ImkImk, Rek

]− λ

[1, 00, 1

]}= (Rek − λ)2 + (Imk)2 = 0 . (3.16)

We know that for the scalar case the eigenvalue is λ = k = Rek+iImk. Would this eigenvalue satisfyalso the characteristic equation above? Substituting and simplifying we obtain:

(Rek − λ)2 + (Imk)2 = (Rek − Rek − iImk)2 + (Imk)2 = i2(Imk)2 + (Imk)2 = 0 .

It does! That is not all, however. Numbers whose imaginary parts have equal magnitude but oppositesigns are called complex conjugate (see Figure 3.6). The characteristic equation (3.16) also hasthe root λ = k = Rek − iImk, where the overbar means “complex conjugate”. This holds because(iImk)2 = (−iImk)2. The eigenvalue problem in Section 3.2 is saying the same thing, since forminga complex conjugate of the equation

B(λ− k) = B(λ− k) = 0

is equally valid as the original equation.

3.9 Case of Rek = 0 and Imk = 0

For Rek = 0 and Imk = 0 the matrix

K =

[Rek, 0

0, Rek

]becomes a multiple of the identity. The eigenvalues are λ1,2 = Rek. Depending on the numericalvalue of Rek both equations describe the same growth, decay, or stagnation.

3.10 Case of Rek = 0 and Imk = 0

For Rek = 0 and Imk = 0 the matrix becomes skew-symmetric

K =

[0, −Imk

Imk, 0

]. (3.17)


Re

Im

a

a

a + a

Fig. 3.6. Graphical interpretation of complex conjugate quantities

These are interesting matrices, which occur commonly in many important applications. We willhear more about them. The eigenvalues are λ1,2 = ±i Imk, which means purely imaginary. We writeλ1 = λ2 (and λ1 = λ2).

We solve for the components of the first eigenvector. The procedure is the same as in the exampleabove: substitute the computed eigenvalue into the eigenproblem equation, and since the resultingequations are linearly dependent, choose one of the components of the eigenvector and solve for therest. Thus we get for λ1 = i Imk([

0, −ImkImk, 0

]− λ1

[1, 00, 1

])[z11z21

]=

[00

].

This may be rewritten

Imk

[−i , −11, −i

] [z11z21

]=

[00

]and choosing z21 = 1 we obtain the first eigenvector

z1 =

[z11z21

]=

[i1

].

Similarly we obtain the second eigenvector as

z2 =

[z12z22

]=

[−i1

].

Note that z1 and z2 are complex conjugate, as their corresponding eigenvalues. We can easily con-vince ourselves that an eigenvalue problem with complex conjugate eigenvalues must have complexconjugate eigenvectors. For an arbitrary real matrix A write the complex conjugate on either sideof the equation

A · z = λz −→ A · z = A · z = λz (3.18)

Both eigenvalue/eigenvector pairs

w1 = exp(λ1t)z1

and

w2 = exp(λ2t)z2 = w1

3.10 Case of Rek = 0 and Imk = 0 45

could be solutions of the IVP (3.10). A general solution therefore is likely to be a mix of these two

w = C1w1 + C2w2 .

We expect w to be a real vector, whereas w1 and w2 are both complex quantities. However, they arecomplex conjugate which suggests that if the constants are also complex conjugates the expressionon the right may be real (refer to Figure 3.6).

w = C1w1 + C1w1

In general, the complex constant may be written as

C1 = ReC1 + i ImC1 (3.19)

and the complex exponential has the equivalent expression (Euler’s formula from complex analysis)

exp(i Imkt) = cos(Imkt) + i sin(Imkt) . (3.20)

Therefore, the product of the three complex quantities may be expanded as

C1w1 = C1 exp(λ1t)z1 =[−ReC1 sin(Imkt)− ImC1 cos(Imkt)ReC1 cos(Imkt)− ImC1 sin(Imkt)

]+ i

[ReC1 cos(Imkt)− ImC1 sin(Imkt)ReC1 sin(Imkt) + ImC1 cos(Imkt)

].

Next we take into account that C2w2 = C1w1 and we attain readily the simplification of theexpression w = C1w1 + C2w2 by canceling the imaginary part

w = 2

[−ReC1 sin(Imkt)− ImC1 cos(Imkt)ReC1 cos(Imkt)− ImC1 sin(Imkt)

]. (3.21)

Substituting of the above expression into the initial condition we arrive at

w(0) = 2

[−ImC1

ReC1

]= w0

and that allows us to solve immediately for ImC1 ,ReC1 .So finally we can write the solution to the matrix IVP (3.10)

w =

[−Imy0 sin(Imkt) + Rey0 cos(Imkt)Imy0 cos(Imkt) + Rey0 sin(Imkt)

]or even more profitably in the matrix form

w =

[cos(Imkt), − sin(Imkt)sin(Imkt), cos(Imkt)

] [Rey0Imy0

]. (3.22)

The matrix in the above equation is the so-called rotation matrix . The quantity Imk has themeaning of angular velocity, and correspondingly Imkt is the rotation angle. One way of visualizingrotations is through phasors: see Figure 3.7. A phasor is a rotating vector whose components varyharmonically.

Figure 3.8 provides a link between different ways of visualizing rotations2. The black circlecorresponds to the trace of the tip of the rotating vector of Figure 3.7. The rainbow colored helicaltube (time advances from blue to red) is the black circle stretched in the time dimension. (ThinkSlinky.)

The red curve is the projection of the helix onto the plane Imy = 0, and it is the graph of tversus Rey. The blue curve is the projection of the helix onto the plane Rey = 0, and it is the graphof t versus Imy. When we plot the solutions computed by the MATLAB integrators they are thesuperposition of these (red and blue) curves in one plane, as shown on the right in Figure 3.8.3 Therotating-vector picture tells us was kind of curves we should expect: the vector rotates with constantangular velocity, which when projected onto either of the two coordinates will yield a sinusoidalphase-shifted curve in time– compare with Figure 3.8.

2See: aetna/ScalarODE/scalaroscillstream.m3See: aetna/ScalarODE/scalaroscillplot.m


Rey

Imy

w0Imk tw(t)

Fig. 3.7. Representation of the solution to (3.10) as a rotating vector (phasor)

Fig. 3.8. Graphical representation of the solution to (3.12) Imk = 0.3, Rey0 = 0, Imy0 = 8

3.11 Application of the Euler integrators to the IVP (3.10)

For our numerical experiments we shall consider (3.10) with Imk = 3, and the initial conditionsRey0 = 0, Imy0 = 8. The code4 to integrate the system of ODE’s starts by defining the matrix K,the initial condition, and the time span. The right-hand side function literally copies the definitionof the IVP (3.12).

K=[0,-3;3,0];

w0=[0;8];

tspan =[0,10];

options=odeset (’InitialStep’, 0.099);

[t,sol] = ode45(@(t,w) (K*w), tspan, w0, options);

From our analysis we would expect the numerical solution to reproduce shifted sine waves that wefound as the analytical solution. The built-in MATLAB integrator ode45 does a good job, at leastat first sight (Figure 3.9).

Now replace ode45 with odefeul in the above code fragment. With the step of 0.099 the forwardEuler integrator takes more than 20 steps per period of oscillation. This seems like a sufficiently fine

4See: aetna/ScalarODE/scalaroscill1st.m

3.12 Euler methods for oscillating solutions 47

0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

t

Rey,Im

y

Fig. 3.9. Example of Section 3.11, ode45 integrator

time step, but the forward Euler integrator odefeul fails spectacularly: the solution blows up veryquickly (Figure 3.10 on the left). The backward Euler integrator is not much better, except that theamplitude goes to zero (Figure 3.10 on the right). With smaller time steps we can reduce the rate ofthe blowup (decay) of the amplitude, but we can never remove it (try it: decrease the time step bycouple of orders of magnitude – and arm yourselves with patience, it is going to take a long time tointegrate). We consider the constant amplitude as the main feature of the solutions to this problem.Therefore, we must conclude that for this problem the two integrators appear to be unconditionallyunstable as they are unable to maintain an unchanging amplitude of the oscillations no matter howsmall the time step. For comparison we show the results for the built-in ode45 integrator, applied

0 2 4 6 8 10−600

−400

−200

0

200

400

600

t

Rey,Im

y

0 2 4 6 8 10−8

−6

−4

−2

0

2

4

6

8

t

Rey,Im

y

Fig. 3.10. Example of Section 3.11, odefeul integrator (on the left), and odebeul integrator (on the right).Time step ∆t = 0.099.

to in a long integration time5 in Figure 3.11 (there are so many oscillations that the curves visuallymelt into a solid block). We see that even for this integrator there is a systematic change (decay)in the amplitude of the oscillation. By reducing the time step length we can reduce the drift, butwe cannot remove it entirely (as observed in numerical experiments). Again, this behavior has to dowith stability, not accuracy.

3.12 Euler methods for oscillating solutions

We shall now analyze the Euler methods in order to gain an understanding of the results reported inthe previous section. In the first step we will apply the forward Euler method to the model IVP (3.1)

yj+1 = yj +∆tkyj = (1 +∆tk)yj ,

5See: aetna/ScalarODE/scalaroscill1stlong.m


0 200 400 600 800 1000 1200−8

−6

−4

−2

0

2

4

6

8

t

Rey,Im

yFig. 3.11. Example of Section 3.11, ode45 integrator, long integration time

and we work in the knowledge that k, yj , yj+1 are complex. We now understand that for a purelyimaginary k = i Imk the solution may be represented as a circle in the plane Rey, Imy. Another wayof saying this is “the modulus of the complex quantity y is constant”. We take the modulus on bothsides

|yj+1| = |(1 +∆tk)yj | = |1 +∆tk| |yj | , (3.23)

and in order to get |yj+1| = |yj | (so that the solution points lie on a circle) we need for the complexamplification factor to satisfy

|1 +∆tk| = 1 . (3.24)

Figure 3.12 illustrates the meaning of the above equation graphically. The circle of radius equal to1.0 centered at (0, 0) is translated to be centered at (−1, 0) in order for the complex number ∆tk tosatisfy (3.24). Now consider the purely imaginary value of the coefficient k = i Imk. Such numbers

∆tRek

∆tImk

|x| = 1|1 + x| = 1

−1−2

∆tk

∆tRek

∆tImk

|x| = 1|1 + x| = 1

−1−2

∆tk

Fig. 3.12. Representation of equation (3.24)

lie along the imaginary axis, Rek = 0, and when multiplied by ∆t > 0 the resulting product justmoves closer to or further away from the origin. One such number ∆tk is shown in Figure 3.12.In order for ∆tk to satisfy (3.24) the dot representing the number must move to the thick circlein Figure 3.12. We can see that no such non-zero time step length exists: only ∆t = 0 will make∆tk = 0 lie on the circle at (0, 0). Therefore, we must conclude that the forward Euler methodis unconditionally unstable for imaginary k as there is no time step length that would satisfy thestability requirement (3.24).

3.13 General complex k 49

Next we shall consider the backward Euler method (3.7) for the same problem. Taking themodulus on both sides we obtain

|yj+1| =∣∣∣∣ yj(1−∆tk)

∣∣∣∣ = |yj ||1−∆tk|

,

and in order to get |yj+1| = |yj | (so that the solution points lie on a circle) we need

|1−∆tk| = 1 . (3.25)

Figure 3.13 now illustrates that the circle of radius equal to 1.0 centered at (0, 0) needs to betranslated to be centered at (+1, 0) in order for ∆tk to satisfy (3.25). Again, we must conclude thatthe backward Euler method is unconditionally unstable for imaginary k as there is no non-zero timestep length that would satisfy the stability requirement (3.25).

∆tRek

∆tImk

|x| = 1 |1 − x| = 1

+1 +2

∆tk


3.13 General complex k

For a general complex coefficient k (meaning neither the real part, nor the imaginary part are zero)in the IVP (3.1) the general solution will still be the sum of two complex conjugate terms as in (3.19).The eigenvalues are general complex numbers, and hence formula (3.20) will need to become thegeneral Euler formula

exp(λ1t) = exp(Rekt) [cos(Imkt) + i sin(Imkt)] . (3.26)

The solution will be in the form of (3.21), except that everything will be multiplied by the realexponential exp(Rekt)

w = 2 exp(Rekt)

[−ReC1 sin(Imkt)− ImC1 cos(Imkt)ReC1 cos(Imkt)− ImC1 sin(Imkt)

].

Following the same steps as in Section 3.10, we arrive at the solution to the IVP in the form

w = exp(Rekt)

[cos(Imkt), − sin(Imkt)sin(Imkt), cos(Imkt)

] [Rey0Imy0

], (3.27)

which may be interpreted readily as the rotation of a phasor with exponentially decreasing (Rek < 0)or increasing (Rek > 0) amplitude.


Let us take first Rek < 0 and the forward Euler algorithm. Equation (3.23) is still our startingpoint, but now we are asking if there is a time step length that would make the modulus of thesolution decrease in time, or in mathematical terms

|1 +∆tk| < 1 . (3.28)

For the accompanying picture refer to Figure 3.14: One possible complex coefficient k is shown, asis its scaling (down) by the time step ∆tk. Clearly, it is now possible by choosing a sufficiently smalltime step length to bring ∆tk inside the circle so that its distance from (−1, 0) is less than one, andso that the stability criterion (3.28) is satisfied. Since now there is a time step length so that theforward Euler can reproduce the correct solution shape, we call forward Euler for general complex kand Rek < 0 conditionally stable. The condition implied by “conditionally” is equation (3.28), andfor a given k we can use it to solve for an appropriate ∆t.

∆tRek

∆tImk

|1 + x| = 1

−1−2

k

∆tk


On the other hand, we can now see that for the forward Euler algorithm we achieve stability(satisfy equation (3.28)) for Rek > 0 for any ∆t: the coefficient k is in the right-hand side half plane,and the stability circle is in the left-hand side half plane. Therefore multiplying a complex k withan arbitrary ∆t > 0 will satisfy |1 + ∆tk| > 1. Hence, for Rek > 0 the forward Euler method isunconditionally stable.

This state of affairs is again mirrored by the behavior of the backward Euler algorithm. Firsttake Rek > 0. Equation (3.25) is now used to figure out if there is a time step length that wouldmake the modulus of the solution increase in time, or in mathematical terms

1

|1−∆tk|> 1 . (3.29)

For the accompanying picture refer to Figure 3.15: One possible complex coefficient k is shown,as is its scaling ∆tk. Clearly, it is now possible by choosing a sufficiently small time step lengthto bring ∆tk inside the circle so that its distance from (+1, 0) is less than one which will ensuresatisfaction of (3.29). Thus, the backward Euler method is conditionally stable for general complexk and Rek > 0. Also, we now conclude the backward Euler algorithm achieves stability (satisfyequation (3.29)) for Rek < 0 for any ∆t: the coefficient k is in the left-hand side half plane, and thestability circle is in the right-hand side half plane. Similar reasoning as for the forward Euler leadsus to conclude that backward Euler is unconditionally stable for complex k and Rek < 0.

Illustration 4

Apply the modified Euler (2.28) to the model equation (3.1), and derive the amplification factor.

3.14 Summary of integrator stability 51

∆tRek

∆tImk

|1 − x| = 1

+1 +2

k

∆tk


Substituting the right-hand side of the model equation into the formula (2.28) we get

ya = y(t0) + (t− t0)f(t0, y(t0)) = y(t0) + (t− t0)ky(t0)

and

y(t) = y(t0) +(t− t0)

2[ky(t0) + kya]

= y(t0) +(t− t0)

2[ky(t0) + k(y(t0) + (t− t0)ky(t0))]

= y(t0)

[1 + k(t− t0) +

[k(t− t0)]2

2

].

The term in square brackets that multiplies y(t0) is the amplification factor for the modified Euler.


1. Derive the amplification factor for the trapezoidal rule (2.27).

3.14 Summary of integrator stability

Figure 3.16 shows the classification of the various behaviors for the linear differential equation withconstant coefficients

y = ky , y(0) = y0 , k, y complex .

The eigenvalue λ = k (a complex number) is plotted in the complex plane. Depending on whereit lands, the analytical solution will display the following behaviors: In the left half-plane we getdecaying oscillations, in the right half-plane we get growing oscillations. If the eigenvalue is purelyimaginary, we get pure oscillation. If the eigenvalue is purely real, we get either exponentially decay-ing or growing solutions. Finally, zero eigenvalue yields a stagnant (unchanging) solution. Figure 3.17shows the behaviors produced by the forward Euler integrator. The same color coding as in Fig-ure 3.16 is used. The key to understanding whether the forward Euler integrator can give us a discrete


Oscillation

Decaying Growing

GrowingOscillations

DecayingOscillations

Re

Im

Fig. 3.16. Behavior classification for the first order linear differential equation

solution that mimics the analytical one is to compare the two figures. The complex number ∆tλ isplotted in the complex plane in Figure 3.17. Forward Euler can reproduce the desired behavior ifthere is such a ∆t as to place the number ∆tλ in Figure 3.17 in the region with the same color asthe one in which λ was located in Figure 3.16.

Illustration 5

Example 1: consider λ = −0.1+ i3. The analytical solution is decaying oscillation. In Figure 3.17 wecan see that a sufficiently small time step ∆t will indeed place ∆tλ inside the circle of unit radiuscentered at −1 which has the same color as the left-hand side half-plane in Figure 3.16. ForwardEuler is conditionally stable in this case. (The condition is that ∆t must be sufficiently small.)

Example 2: consider λ = −i3. The analytical solution is pure oscillation. In Figure 3.17 we cansee that it is not possible to find any other time step but ∆t = 0 to place ∆tλ on the circle of unitradius centered at −1 (which has the same color as the imaginary axis in Figure 3.16). ForwardEuler is unconditionally unstable for pure oscillations.

Example 3: consider λ = 13.3. The analytical solution is exponential growth. In Figure 3.17 wecan see that the positive part of the real axis has the same color in both figures. Therefore, forall ∆t > 0 we get the correct behavior. Forward Euler is unconditionally stable for exponentiallygrowing solutions.

Example 4: consider λ = −0.61. The analytical solution is exponentially decaying. In Figure 3.17we can see that a sufficiently small time step ∆t will indeed place ∆tλ within the interval −1 ≤∆tλ < 0 which has the same color as the negative part of the real axis in Figure 3.16. Forward Euleris conditionally stable in this case. (The condition is that ∆t must be sufficiently small.)

In words, using the pair of images 3.16 and 3.17 the forward Euler integrator is found to beunconditionally unstable for pure oscillations, unconditionally stable for growing oscillations andexponentially growing non-oscillating solutions, and conditionally stable for exponentially decayin-goscillating and non-oscillating solutions. Analogous observations can be made about the backwardEuler integrator which is found to be unconditionally unstable for pure oscillations, conditionally sta-ble for growing oscillations and exponentially growing non-oscillating solutions, and unconditionallystable for exponentially decaying oscillating and non-oscillating solutions.


Oscillation

Decaying

Growing

GrowingOscillationsDecaying

Oscillations

−1 +1 Re

Im

Fig. 3.17. Behavior classification for the first order linear differential equation, Forward Euler algorithm

Oscillation

Decaying Growing

GrowingOscillations

DecayingOscillations

−1 +1 Re

Im

Fig. 3.18. Behavior classification for the first order linear differential equation, Backward Euler algorithm.

3.14.1 Visualizing the stability regions

The stability diagrams that we have developed for the Euler algorithms are complete and unam-biguous. Nevertheless, it will be instructive to visualize the amplification factors of the algorithmsdiscussed so far in yet another way6.

For instance, the amplification factor for the modified Euler may be written in terms of ∆tλ as(see the Illustration section on page 50)

1 +∆tλ+1

2(∆tλ)2 . (3.30)

All possible complex λ are allowed, which means that ∆tλ may represent an arbitrary point of thecomplex plane. The magnitude of the amplification factor may be therefore considered a functionof the complex number ∆tλ, and it is often useful to visualize such functions as surfaces raisedabove the complex plane. The MATLAB function surf is designed to do just that. It takes threematrices which represent the coordinates of points of a logically rectangular grid. The elements

6See: aetna/StabilitySurfaces/StabilitySurfaces.m


x(k,m), y(k,m), z(k,m), represent Cartesian coordinates of the k,m vertex of the grid. The gridthen may be rendered with surf(x,y,z). Here we set up a grid with 99 rectangular faces in eachdirection (which is why we have 100× 100 matrices for the corners of those faces). First the extentof the grid and the number of corners.

xlow =-3.2; xhigh= 0.9;

ylow =-3.2; yhigh= 3.2;

n=100;

Then we set up the matrices for the coordinates. Note that k corresponds to moving in the xdirection, the index m corresponds to moving in the y direction. dtlambda is a complex number (1iis the complex unit), so taking its absolute value means getting the magnitude of the amplificationfactor.

x=zeros(n,n); y=zeros(n,n); z=zeros(n,n);

for k =1:n

for m =1:n

x(k,m) =xlow +(k-1)/(n-1)*(xhigh-xlow);

y(k,m) =ylow +(m-1)/(n-1)*(yhigh-ylow);

dtlambda = x(k,m) + 1i*y(k,m);

z(k,m) = abs(1 + dtlambda + 0.5*dtlambda.^2);

end

end

Of course there is more than one way of accomplishing this. Here is the whole setup accomplishedwith just three lines using the handy meshgrid and linspace functions.

[x,y] = meshgrid(linspace(xlow,xhigh,n),linspace(ylow,yhigh,n));

dtlambda = x + 1i*y;

% Modified Euler

z = abs(1 + dtlambda + 0.5*dtlambda.^2);

Next we draw the color-coded surface that represents the height z above the complex plane: blueis the lowest, red is the highest.

surf(x,y,z,’edgecolor’,’none’)

Then we draw into the same figure the level curve at height 1.0 of the same function z of x, y. Weset the linewidth of the curve using a handle returned from the function contour3.

hold on

[C,H] = contour3(x,y,z,[1, 1],’k’)

set(H,’linewidth’, 3)

Finally set up the view, and label the axes.

axis([-4 0.6 -4 4 0 8])

axis equal,

xlabel (’Re (\Delta{t}\lambda)’)ylabel (’Im (\Delta{t}\lambda)’)

Voila Figure 3.19. It shows how the amplification factor falls below 1.0 in amplitude inside an ovalshape in the left-hand side half plane.

As shown in the MATLAB script StabilitySurfaces, corresponding surface representations ofthe amplification factors for the methods discussed so far, forward and backward Euler (Figures 3.20and 3.21), trapezoidal rule (Figure 3.22), and the fourth-order Runge-Kutta (Figure 3.23), are easilyobtained just by commenting out or uncommenting the appropriate definitions of the variable z.

Figure 3.24 compares the level curves at 1.0 for the amplitude of the amplification factorfor the first order linear differential equation for the integrators FEUL=forward Euler algorithm,


Fig. 3.19. Surface of the amplitude of the amplification factor for the first order linear differential equation,MEUL=modified Euler algorithm. The contour of unit amplitude is shown in black.

Fig. 3.20. Surface of the amplitude of the amplification factor for the first order linear differential equation,FEUL=forward Euler algorithm. The contour of unit amplitude is shown in black.

−3−2

−10

−2

0

2

1

2

3

4

5

6

7

Re (∆tλ)Im (∆tλ)

Fig. 3.21. Surface of the amplitude of the amplification factor for the first order linear differential equation,BEUL=backward Euler algorithm. The contour of unit amplitude is shown in black.


Fig. 3.22. Surface of the amplitude of the amplification factor for the first order linear differential equation,TRAP=trapezoidal rule algorithm. The contour of unit amplitude is shown in black.

Fig. 3.23. Surface of the amplitude of the amplification factor for the first order linear differential equation,RK4=fourth-order Runge-Kutta algorithm. The contour of unit amplitude is shown in black.

BEUL=backward Euler algorithm, MEUL=modified Euler algorithm, TRAP=trapezoidal rule al-gorithm, RK4=fourth-order Runge-Kutta algorithm. Note that the level curve for the trapezoidalrule coincides with the vertical axis in the figure. For a decaying solution, the integrator will pro-duce a stable solution if it is inside the contours in the left-hand side plane, or outside the circle inthe right-hand side plane for the backward Euler. Clearly, comparing Figure 3.24 with the surfacerepresentations of integrator stability in Figures 3.20 – 3.23 we can see that visualizing the stabilitywith contours is only part of the story: the surface figures supply the missing information about themagnitude of the amplification factor.


1. Use the information in the Figure 3.22 to estimate the stability diagram for the integratorodetrap similar to that shown in Figures 3.17 , 3.18.

3.15 Annotated bibliography 57

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Re(∆t λ)

Im(∆

tλ)

RK4

FEUL

MEUL

BEUL

TRAP

Fig. 3.24. Level curves (contours) at value 1.0 of the amplitude of the amplification factor for thefirst order linear differential equation, FEUL=forward Euler algorithm, BEUL=backward Euler algorithm,MEUL=modified Euler algorithm, TRAP=trapezoidal rule algorithm, RK4=fourth-order Runge-Kutta al-gorithm.


1. V. I. Arnold, Ordinary Differential Equations, Universitext, Springer, 2006.Nice introduction to the topic of classification of solutions of linear differential equations.

2. B. Leimkuhler, S. Reich, Simulating Hamiltonian Dynamics, Cambridge Monographs on Appliedand Computational Mathematics, Cambridge University Press, 2005.Valuable discussion of the stability of numerical integrators, especially for mechanical systems.

4

Linear Single Degree of Freedom Oscillator

Summary

1. The model of the linear oscillator with a single degree of freedom is investigated from the pointof view of the uncoupling procedure (so-called modal expansion), and the solution in the formof a matrix exponential. Main idea: solve the eigenvalue problem for the governing ODE system,and expand the original variables in terms of the eigenvectors. The modal expansion is a criticalpiece in engineering vibration analysis.

2. For the single degree of freedom linear vibrating system we have study how to transform betweenthe second order and the first order matrix form, and we discuss the relationship of the scalarequation with the complex coefficient from Chapter 3 with the linear oscillator model. Mainidea: the two IVPs are shown to be equivalent descriptions.

3. It is shown that modal analysis is possible as long as the system matrix is not defective, i.e. aslong as it has a full set of eigenvectors. The case of critical damping is discussed as a special casewhich leads to a defective system matrix.

4. The modal analysis allows multiple degree of freedom systems to be understood in terms of theproperties of multiple single degree of freedom linear oscillators.

4.1 Linear single degree of freedom oscillator

The second-order equation of the free (unforced) motion of the singled degree of freedom dampedlinear oscillator (see Figure 4.1) is

mx = −kx− cx .

When supplemented with the initial conditions

x(0) = x0 , x(0) = v0

together this will constitute the complete definition of the IVP of the linear oscillator. Using thedefinition of the velocity

v = x

will yield the general first-order form of the 1-dof damped oscillator IVP as

y = A · y , y(0) = y0 , (4.1)

where

A =

[0, 1

−k/m, −c/m

](4.2)

60 4 Linear Single Degree of Freedom Oscillator

and

y =

[xv

].

mk c

x

Fig. 4.1. Linear one-degree of freedom oscillator

The discussion of Section 3.7 (refer to equation (3.13)), applies here too. We assume the solutionin the form of an exponential

y = eλtz .

The characteristic equation for the damped oscillator is

det (A− λ1) = det

[−λ, 1

−k/m, −c/m− λ

]= λ2 + (c/m)λ+ k/m = 0 .

The quantity ωn

ω2n = k/m (4.3)

appears everywhere in vibration analysis as the natural frequency of undamped vibration .

Illustration 1

Show that the IVP of the undamped (c = 0) one degree of freedom oscillator is equivalent to a singlescalar equation with a complex coefficient as in equation (3.12).

Solution is obtained by substitution of (4.3) into the matrix of (4.2)

A =

[0, 1

−ω2n, 0

].

The IVP is therefore expanded as[xv

]=

[0, 1

−ω2n, 0

] [xv

],

[x(0)v(0)

]=

[x0

v0

].

The trick (yes, there is one) is to introduce a new set of variables. The first is the same as deflectionof the mass x and the second is the velocity scaled by the negative natural frequency:[

w1

w2

]=

[x

−v/ωn

].

4.1 Linear single degree of freedom oscillator 61

Therefore, the differential equation of motion may be written in terms of the new variables as[w1

−ωnw2

]=

[0, 1

−ω2n, 0

] [w1

−ωnw2

],

and by canceling −ωn in the second equation we obtain[w1

w2

]=

[0, −ωn

ωn, 0

] [w1

w2

],

[w1(0)w2(0)

]=

[x0

−v0/ωn

],

which is in perfect agreement with Section 3.10: we get two variables, the displacement x and thevelocity scaled by the angular velocity −v/ωn, coupled together by a skew symmetric matrix whichis the same as in equation (3.17) (where ωn = Imk). The solution in the new variables w1, w2 istherefore expressed by the rotation matrix as in (3.22). Now we can understand that Figure 3.7describes the motion of an oscillating mass.

Assuming λ to be in general complex, we can write λ = α + iω and substituting into thecharacteristic equation we obtain

λ2 + (c/m)λ+ k/m =

(α2 − ω2 +

c

mα+

k

m

)+ i( c

mω + 2ωα

)= 0 .

Since both the real and the imaginary part of this equation must vanish at the same time, we musthave

α2 − ω2 +c

mα+

k

m= 0 ,

( c

m+ 2α

)ω = 0 .

The second equation allows us to branch out into two subcases, since there are two ways in whichthe second equation could be satisfied.

4.1.1 ω = 0: No oscillation

For ω = 0 the imaginary part of the eigenvalue vanishes, and then there is no oscillation. The realcomponent α is obtained from

α2 + (c/m)α+ k/m = 0

giving

α1,2 = − c

2m±√( c

2m

)2− k

m.

Notice that we must require( c

2m

)2≥ k

m(4.4)

for α1,2 to come out real.

4.1.2 α = −(c/2m): Oscillation

This is the second subcase: Substituting α = −(c/2m) into the first equation, we obtain( c

2m

)2− ω2 + (c/m)(− c

2m) +

k

m= 0 ,


which immediately gives for the imaginary component ω

ω = ±√

k

m−( c

2m

)2.

For ω to come out real and positive (we include the latter condition since ω = 0 was already coveredin the preceding section) we require( c

2m

)2<

k

m.

4.1.3 Critically damped oscillator

The case of ω = 0 and at the same time( c

2m

)2=

k

m

yields a special case: the critically damped oscillator. The damping coefficient

ccr = 2mωn (4.5)

is the so-called critical damping . The critically damped oscillator needs a special handling, whichwe will postpone to its own section that will follow the discussion of the generic cases of the super-critically and the subcritically damped oscillator.

4.2 Supercritically damped oscillator

The oscillator is supercritically damped when the damping is sufficiently strong to eliminate oscil-lation, ω = 0, and when equation (4.4) gives two real roots. For this to occur we require( c

2m

)2>

k

m

i.e. a sharp inequality. In other words, the damping coefficient is greater than the critical dampingcoefficient

c = ζccr > ccr , (i.e. ζ > 1)

from equation (4.5). Here ζ is the so-called damping ratio.The characteristic equation gives two real roots

λ1,2 = − c

2m±√( c

2m

)2− k

m.

Let us compute the first eigenvector corresponding to λ1 = − c2m +

√(c

2m

)2 − km . We are looking

for the vector z1 that solves

(A− λ11)z1 = 0 .

Substituting we have[−λ1 , 1− k

m , − cm − λ1

] [z11z21

]=

[00

].

The two equations are really only one equation (the rows and columns of the matrix on the leftare linearly dependent, since that is the condition from which we solved for λ1). Therefore, using

4.2 Supercritically damped oscillator 63

for instance the first equation and choosing z11 = 1 we compute z21 = λ1. We repeat the sameprocedure for the second root to arrive at the two eigenvectors

z1 =

[z11z21

]=

[1λ1

], z2 =

[z12z22

]=

[1λ2

].

The general solution of the differential equation of motion of the oscillator is therefore

y = c1eλ1tz1 + c2e

λ2tz2 . (4.6)

The two constants cj can be determined from the initial condition

y(0) = c1eλ10z1 + c2e

λ20z2 = c1z1 + c2z2 = y0 .

This can be conveniently cast in matrix form using the matrix of eigenvectors

V = [z1, z2]

as the matrix-vector multiplication

V

[c1c2

]= y0 .

Provided λ1 = λ2, the two eigenvectors are linearly independent, which means that the matrix V isnon-singular. The constants are then[

c1c2

]= V −1y0 .

Illustration 2

It may be illustrative to work out in detail the inverse of the matrix of eigenvectors.We write the matrix of the eigenvectors as above:

V =

[1, 1λ1, λ2

].

The cofactor equation yields immediately

V −1 =1

λ2 − λ1

[λ2, −1−λ1, 1

].

Note that (4.6) may be written as

y =[eλ1tz1 , e

λ2tz2

] [ c1c2

]or, even slicker,

y = [z1 , z2]

[eλ1t 00 eλ2t

] [c1c2

]= V

[eλ1t 00 eλ2t

] [c1c2

].

Substituting for the integration constants we obtain

y = V

[eλ1t 00 eλ2t

]V −1y0 . (4.7)


If we pre-multiply this equation by V −1, we obtain this eminently useful representation

V −1y =

[eλ1t 00 eλ2t

]V −1y0 .

Namely, with the definition of a new variable w (also commonly referred to as change of coordinates)

w(t) = V −1y(t) , w0 = V −1y0 (4.8)

we can write

w =

[eλ1t 00 eλ2t

]w0

as a completely equivalent solution to the oscillator IVP, using the new variable w. Each componentof the solution is independent of the other, as we can see from the scalar equivalent to the abovematrix equation

w1(t) = eλ1tw10 , w2(t) = eλ2tw20 . (4.9)

4.3 Change of coordinates: similarity transformation

As an alternative way of deriving the solution, the change of coordinates (4.8) may be also introducedalready into the equation of motion (4.1)

V −1y = V −1Ay = V −1AV V −1y

and with the new variable w we may rewrite the original IVP in this form

w = V −1AV w , w(0) = w0 = V −1y0 .

The matrix V −1AV is a very nice one: it is diagonal . To see this, we realize that for each columnof the matrix V the eigenvalue problem

Azj = λjzj , j = 1, 2 (4.10)

holds, and writing all such eigenvalue problems in one shot is possible as

A [z1, z2] = [z1, z2]

[λ1 00 λ2

](4.11)

using the diagonal matrix

Λ =

[λ1 00 λ2

]. (4.12)

Therefore we have

A [z1, z2] = AV = V Λ

and pre-multiplying with V −1

V −1AV = Λ . (4.13)

We say that the matrixA is similar to a diagonal matrixΛ. (We also say thatA is diagonalizable .)So the IVP for the oscillator can be written in the new variable w as

4.3 Change of coordinates: similarity transformation 65

w = V −1AV w = Λw , w(0) = w0 = V −1y0 .

This means that we can write totally independent scalar IVPs for each component

w1(t) = λ1w1 , w1(0) = w10 , w2(t) = λ2w2 , w2(0) = w20 ,

which as we know have the solutions (4.9):

wj(t) = eλjtwj0 . (4.14)

This is the well-known decoupling procedure: the original variables y are in general coupled togethersince the matrix A is in general non-diagonal. Therefore to make things easier for us we switch toa different set of variables w with the transformation (4.8) in which all the variables are uncoupled.The uncoupled variables have each its own IVP which is easily solved. Finally, if we wish to, weswitch back to the original variables y. This procedure may be summarized as

y = Ay , y(0) = y0 (original IVP), (4.15)

w = V −1AV w = Λw , w(0) = V −1y0 (uncoupled IVP),

w =

[eλ1t 00 eλ2t

]w(0) (solution to uncoupled IVP),

y = V w = V

[eλ1t 00 eλ2t

]V −1y(0) (solution to original IVP).

It is well worth understanding this sequence of operations. It is the essence of linear vibrationanalysis. The variables w are the modal coordinates (often called normal coordinates), and themeaning of y = V w is that of expansion of the solution y as a linear combination of modes (thecolumns of V , which are the eigenvectors), where the coefficients of the linear combination are themodal coordinates w

y = V w =2∑

j=1

zjwj(t) .

Illustration 3

Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, ζ = 3/2, x0 = 0, and v0 = 1.We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebra

toolbox1. First the definitions of the variables. The variable names are self-explanatory.

function froscill_super_symb

syms m k c omega_n t x0 v0 ’real’

y0= [x0;v0];

c_cr=2*m*omega_n;

c=3/2*c_cr;

A = [0, 1; -omega_n^2, -(c/m)];

We compute symbolically the eigenvalues and eigenvectors, and we construct the diagonal matrix Λ(called L in the code)

[V,D] =eig(A);

L =simple(inv(V)*A*V);

(Control question: How do L and D compare?) Next we can compute the matrix with eλjt on thediagonal (called eLt). Note that calling the MATLAB function exp on a matrix would exponentiateeach element of the matrix. This is not what we intend: only the elements on the diagonal shouldbe affected. Therefore we have to extract the diagonal of L with diag, exponentiate, and thenreconstruct a square matrix with another call to diag

1See: aetna/LinearOscillator/froscill super symb.m


eLt =diag(exp(diag(L)*t));

Now we are ready to write down the last equation (4.15) to construct the solution components(displacement and velocity).

y=simple(V*eLt*inv(V))*y0;

It only remains to substitute numbers and plot. These are the given numbers and we also define anauxiliary variable.

x0= 0; v0=1;% [initial displacement; initial velocity]

m= 13; k= 6100; omega_n= sqrt(k/m);

For the plotting we need data to plot on the horizontal and vertical axis. Here we set it up so thatthe time variable array t consists of 200 points spanning two periods of vibration of the undampedsystem.

T_n=(2*pi)/omega_n;

t=linspace(0, 2*T_n, 200);

Finally the plotting of the components of the solution.

plot(t,eval(vectorize(y(1))),’m-’); hold on

plot(t,eval(vectorize(y(2))),’r--’); hold on

Remember the components of y are symbolic expressions. Now that we have provided all the variableswith numerical values, we need to evaluate the numerical value of the solution components usingthe MATLAB function eval. It also doesn’t hurt to use the function vectorize: the variable t is anarray. In case the expression for the solution components contained arithmetic operators of two ormore terms that referred to t (such as exp(t)*sin(t)) we would want the expressions to evaluateelement-by-element. vectorize replaces all references to operators such as “*” or “^” with “.*” or“.^” so that these operators work on each scalar element of the arrays in turn.

4.4 Subcritically damped oscillator

The eigenvalues are

λ1,2 = − c

2m± i

√k

m−( c

2m

)2.

Let us remind ourselves that an undamped oscillator is a special case of the subcritically dampedoscillator for c = 0.

The same procedure as in Section 4.2 leads to the eigenvectors

z1 =

[z11z21

]=

[1λ1

], z2 =

[z12z22

]=

[1λ2

],

which are complex, since λj are complex numbers. The solution is again written as in (4.6) but withthe important difference that all quantities on the right-hand side are complex while the left-handside is expected to be real.

The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate ofthe first one, λ2 = λ1. We see this easily writing the complex conjugate of the equation A · z = λz(see equation (3.18)). The two constants cj can be determined from the initial condition

y(0) = c1eλ10z1 + c2e

λ20z2 = c1z1 + c2z2 = y0

4.5 Undamped oscillator: alternative treatment 67

and since y0 is real, the two constants must be complex conjugates of each other, c2 = c1. Theconstants are still determined by[

c1c2

]= V −1y0 .

Now we can follow all the derivations from the previous section, and the solution will be still arrivedat in the form of (4.7). Since both y and y0 are real, the product of the three complex matrices

V

[eλ1t 00 eλ2t

]V −1

must also be real, and however surprising it may seem, it is real. (We can do the algebra by handor with MATLAB to check this.)

Illustration 4

Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, ζ = 0.2 (< 1 so that thedamping is subcritical), x0 = 0, and v0 = 1.

We shall follow the procedure (4.15). The MATLAB solution is based on the symbolic algebratoolbox2. First the definitions of the variables. The variable names are self-explanatory. The code ispretty much the same as for the supercritically damped oscillator example above, except

function froscill_sub_symb

...

c=0.2*c_cr;

...

Wemay verify that the eigenvalues (and eigenvectors) are now general complex numbers. For instance

K>> D(1,1)

ans =

(-1/5+2/5*i*6^(1/2))*omega_n

It is rather satisfying to find that no modifications to the code of froscill_super_symb that waswritten for the real (supercritical) case are required to account for the complex eigenvalues andeigenvectors: it just works as is.

4.5 Undamped oscillator: alternative treatment

The characteristic equation for the undamped oscillator gives

λ1,2 = ±i√k/m = ±iωn .

Let us compute the first eigenvector corresponding to λ1 = iωn

z1 =

[z11z21

]=

[1iωn

].

The second eigenvector corresponds to the second eigenvalue, which is the complex conjugate of thefirst one, λ2 = λ1 = −iωn,

2See: aetna/LinearOscillator/froscill sub symb.m


z2 =

[z12z22

]= z1 =

[1−iωn

].

The general solution of the free undamped oscillator motion is a linear combination of the eigenvec-tors

y = c1eλ1tz1 + c2e

λ2tz2 .

Because of the complex conjugate status of the pairs of the eigenvalues and eigenvectors, we have

y = c1eλ1tz1 + c2e

λ1tz1 .

Introducing the initial condition, which is real, we obtain

y(0) = c1z1 + c2z1

and we must conclude c1 = c2, otherwise the right-hand side couldn’t be real. Using

Rea = (a+ a)/2

we see that the sum c1z1 + c2z1 therefore evaluates to 2Re(c1z1), and the constants can be deter-mined from

y(0) = 2Re(c1z1) = 2 (Rec1Rez1 − Imc1Imz1) = 2[Rez1 , −Imz1

] [Rec1Imc1

].

We will introduce the matrix composed of the real and imaginary part of the eigenvector z1

Z =[Rez1 , −Imz1

]=

[1, 00, −ωn

]. (4.16)

Then we can write[Rec1Imc1

]=

1

2Z−1y(0) =

1

2

[1, 00, −ω−1

n

]y(0) .

Using the same principle that we obtain a real number from the sum of the complex conjugates, wewrite

y = 2Re(c1e

λ1tz1

), (4.17)

which may be expanded into

y = 2 [Rec1 (cosωntRez1 − sinωnt Imz1)− Imc1 (sinωnt Imz1 + cosωntRez1)] .

Then collecting the terms leads to the matrix expression

y = 2[Rez1 , −Imz1

] [ cosωnt , − sinωntsinωnt , cosωnt

] [Rec1Imc1

],

which after substitution of Rec1, Imc1 finally results in the matrix expression

y = 2Z

[cosωnt , − sinωntsinωnt , cosωnt

]1

2Z−1y(0) = ZR(t)Z−1y(0) . (4.18)

We have in this way introduced the time-dependent rotation matrix

R(t) =

[cosωnt , − sinωntsinωnt , cosωnt

]. (4.19)

The solution for the displacement and velocity of the linear single degree of freedom oscillator cantherefore be understood as the result of the rotation of the initial-value quantity Z−1y(0) (phasor)

Z−1y(t) = R(t)Z−1y(0) . (4.20)

4.5 Undamped oscillator: alternative treatment 69

Illustration 5

Check that the procedure (4.15) and the alternative formula (4.18) lead to the same solution.We don’t want to do this by hand. It is faster to use the MATLAB symbolic algebra. The function

froscill un symb3 computes the solution twice, and then subtracts one from the other. If we getas a result zeroes, the solutions were the same.

The code begins by the same variable definitions and solution of the eigenvalue problem as forfroscill sub symb. We compute the first solution using (4.15).

L =simple(inv(V)*A*V);

eLt =diag(exp(diag(L)*t));

y1=simple(V*eLt*inv(V))*y0;

Next, we compute the solution using the alternative with the rotation matrix (4.19).

Z =[real(V(:,1)),-imag(V(:,1))];

R = [cos(omega_n*t),-sin(omega_n*t);

sin(omega_n*t),cos(omega_n*t)];

y2 =simple(Z*R*inv(Z))*y0;

Finally we evaluate y1-y2.

Finally we can realize that the solution (4.20) is of the same form as that derived in Section 3.10(as in (3.22)) and then again in the Illustration in Section 4.1. The new variables are w1 = y1, w2 =−y2/ωn as in Section 4.1.

4.5.1 Subcritically damped oscillator: alternative treatment

The eigenvalues are

λ1,2 = − c

2m± i

√k

m−( c

2m

)2.

Equation (4.17) is still applicable. The only difference is that λ1,2 now have a real component. Using

e(α+iω)t = eαteiωt

we see that (4.18) requires only a change of the matrix R, which should for the damped oscillatorread

R(t) = eαt[cosωt , − sinωtsinωt , cosωt

].

Here

α = − c

2m, ω =

√k

m−( c

2m

)2.

Let us note that ω is the frequency of damped oscillation.

3See: aetna/LinearOscillator/froscill un symb.m


4.6 Matrix-exponential solution

Consider a linear differential equation with constant coefficients in a single variable

y = ay , y(0) = y0 .

We have derived the solution before in the form

y = eaty0 ,

that is as an exponential. Therefore it may not be a terrible stretch of imagination to anticipate thesolution to (4.1) to be formally identical so that the IVP

y = A · y , y(0) = y0

would have the solution

y = eAty0 .

Of course, we must explain the meaning of the matrix exponential eAt. Even here the analogywith the scalar case is of help: consider defining a scalar exponential using the Taylor series startingfrom t = 0

eat = ea0 + aea0t+ a2ea0t2/2 + ... =

∞∑k=0

aktk

k!.

The matrix exponential could be defined (and in fact this is one of its definitions) as

eAt =∞∑k=0

Aktk

k!. (4.21)

For a general matrix A evaluating the infinite series would be difficult. Fortunately, for somespecial matrices it turns out to be easy. Especially the nice diagonal matrix makes this a breeze:

eDt =∞∑k=0

Dktk

k!=

∑∞k=0

Dk11t

k

k! 0 . . . 0 0

0∑∞

k=0Dk

22tk

k! . . . 0 0...

.... . .

......

0 0 . . .∑∞

k=0

Dkn−1,n−1t

k

k! 0

0 0 . . . 0∑∞

k=0Dk

nntk

k!

.

This result is easily verified by just multiplying through the diagonal matrix with itself. Finally werealize that on the right-hand side we have a matrix with exponentials eDjjt on the diagonal

eDt =

∞∑k=0

Dktk

k!=

eD11t 0 . . . 0 00 eD22t . . . 0 0...

.... . .

......

0 0 . . . eDn−1,n−1t 00 0 . . . 0 eDnnt

. (4.22)

This is very helpful indeed, since we already saw that having a full set of eigenvectors as in equa-tion (4.13) allows us to write the matrix A as similar to a diagonal matrix. Let us substitute intothe definition of a matrix exponential the similarity

V −1AV = Λ , A = V ΛV −1

4.7 Critically damped oscillator 71

as

eAt =∞∑k=0

Aktk

k!=

∞∑k=0

(V ΛV −1

)ktk

k!.

Now we work out the matrix powers: The zeroth and first,(V ΛV −1

)0= 1 = V 1V −1 ,

(V ΛV −1

)1= V ΛV −1 ,

and the second,(V ΛV −1

)2=(V ΛV −1

) (V ΛV −1

)= V Λ

(V −1V

)ΛV −1 = V ΛΛV −1 = V Λ2V −1 .

The pattern is clear: we get(V ΛV −1

)k= V ΛkV −1 .

The matrix exponential will become

eAt =∞∑k=0

V ΛkV −1tk

k!= V

( ∞∑k=0

Λktk

k!

)V −1 = V eΛtV −1 .

To compute the matrix exponential of the diagonal Λt is easy, so the only thing we need in order tocompute the exponential of At is a full set of eigenvectors of A. (Warning: There are matrices thatdo not have a full set of linearly independent eigenvectors. Such matrices are called defective. Moredetails are discussed in the next section.)

As a matter of fact we have been using the matrix exponential all along. The solution (4.7)

is of the form V eΛtV −1. In equation (4.18) the matrix R(t) (rotation matrix) is also a matrixexponential of a special matrix: the skew-symmetric matrix.

S =

[0, −ωn

ωn, 0

]= ωn

[0, −11, 0

].

Note that the powers of S have this special structure

S2 = −ω2n1 , S

3 = −ω2nS , S4 = ω4

n1 , S5 = ω4

nS , ... .

Therefore, for the rotation matrix we have

R(t) = eSt =∞∑k=0

Sktk

k!=

1t0

0!+

St1

1!+−ω2

n1t2

2!+−ω2

nSt3

3!+ ... .

Constructing the infinite matrix series, this gives the correct Taylor expansions for cosines andsines of the rotation matrix

R(t) = eSt =

[1− ω2

nt2

2!+

ω4nt

4

4!+ ...

]︸︷︷︸

cosωnt

1+

[ωnt−

ω3nt

3

3!+

ω5nt

5

5!+ ...

]︸︷︷︸

sinωnt

1

ωnS .

4.7 Critically damped oscillator

The oscillator is critically damped when at the same time α = − c2m and ω = 0. The characteristic

equation has a double real root λ1,2 = α = − c2m .

Let us compute the first eigenvector. Substituting we have

72 4 Linear Single Degree of Freedom Oscillator c

2m, 1

− k

m,

c

2m− c

m

[ z11z21

]=

[00

].

Further simplifying with km =

(c

2m

)2leads to c

2m, 1

−( c

2m

)2, − c

2m

[ z11z21

]=

[00

].

Arbitrarily choosing one component of the eigenvector, for instance z21 = c2m , yields

z1 =

[z11z21

]=

[1− c

2m

].

Inconveniently, this is the only eigenvector that we are going to get for the case of the criticallydamped oscillator. Since we obtained a double real root, the second eigenvector is exactly the sameas the first. We say inconveniently, because our approach was developed for an invertible eigenvectormatrix

V = [z1, z2]

and it will now fail since both columns of V are the same, and such a matrix is not invertible.We call matrices that have missing eigenvectors defective . For the critically damped oscillator thematrix A is defective.

−25 −20 −15 −10 −5 0 5 10 15 20 25

−20

−15

−10

−5

0

5

10

15

20

Reλ

Imλ

ζ=1.005

Fig. 4.2. Location of the roots for ζ = 1.005

Let us approach the degenerate case of the critically damped oscillator as the limit of the supercritically damped oscillator whose two eigenvalues will approach each other to become one. Figure 4.2shows a circle of the radius equal to ωn for the data of the IVP (4.1) set to m = 13, k = 6100,ζ = 1.005 (in other words close to critical damping). The two (real) eigenvalues are indicated bysmall circular markers (the function animated eigenvalue diagram4 illustrates with an animationhow the eigenvalues change in dependence on the amount of damping). For critical damping (ζ = 1.0)the two eigenvalues would merge on the black circle and become one real eigenvalue (also referred to

4See: aetna/LinearOscillator/animated eigenvalue diagram.m

4.7 Critically damped oscillator 73

as a repeated eigenvalue). As the eigenvalues approach each other λ2 −→ λ1 the solution may stillbe written as

y = c1eλ1tz1 + c2e

λ2tz2 .

In order to understand the behavior of the eigenvalues as they approach each other, we can writethe exponential eλ2t using the Taylor series with λ1 as the starting point

eλ2t = eλ1t +d

dλ2

(eλ2t

)|λ1 (λ2 − λ1) + . . . = eλ1t + teλ1t(λ2 − λ1) + . . . .

So we see that the difference between eλ1t and eλ2t is

eλ2t − eλ1t = eλ1t + teλ1t(λ2 − λ1) + . . .− eλ1t = teλ1t(λ2 − λ1) + . . . .

From this result we conclude that as λ2 −→ λ1, a linearly independent basis will be the two functions

eλ1t , and teλ1t .

With essentially the same reasoning we can now look for the missing eigenvector. Write (againassuming λ2 −→ λ1)

z2 ≈ z1 +dz2

dλ2

∣∣∣∣λ1

(λ2 − λ1) .

This allows us to subtract the two eigenvector equations from each other to obtain

(+) Az2 = λ2z2

(−) Az1 = λ1z1

A (z2 − z1) = (λ2z2 − λ1z1),

where we can substitute the difference of the eigenvectors to arrive at

Adz2

dλ2

∣∣∣∣λ1

(λ2 − λ1) = (λ2 − λ1)z1 + λ2dz2

dλ2

∣∣∣∣λ1

(λ2 − λ1) ,

and, factoring out (λ2 − λ1), finally

Adz2

dλ2

∣∣∣∣λ1

= z1 + λ2dz2

dλ2

∣∣∣∣λ1

.

Fig. 4.3. Relationship of eigenvectors for λ2 → λ1

Note that dz2

dλ2

∣∣∣λ1

has the direction of the difference between the two vectors z2 and z1. Since z2

and z1 are linearly independent vectors for λ2 = λ1, so are the vectors z1 and dz2

dλ2

∣∣∣λ1

. Therefore,

when λ2 = λ1, we can obtain a full set of linearly independent vectors that go with the double rootas the two vectors z1 and p2 that solve


Az1 = λ1z1 , Ap2 = z1 + λ2p2 . (4.23)

Here p2 is not an eigenvector. Rather, it is called a principal vector . To continue with our criticallydamped oscillator: we can compute the principal vector as[

0 , 1

− k

m, − c

m

][p12p22

]=

[z11z21

]+ λ2

[p12p22

],

or, upon substitution,[0 , 1

−( c

2m

)2, − c

m

][p12p22

]=

[1− c

2m

]− c

2m

[p12p22

],

or, rearranging the terms, c

2m, 1

−( c

2m

)2, − c

2m

[p12p22

]=

[1− c

2m

].

Since the matrix on the left-hand side is singular, the principal vector is not determined uniquely.One possible solution is

p2 =

[p12p22

]=

[0c

2m

].

Similarly as for the general oscillator eigenproblem (4.10) which could be written in the matrixform (4.11), we can write here for the critically damped oscillator

A [z1,p2] = [z1,p2]

[λ1 10 λ2

], (4.24)

where we introduce the so-called Jordan matrix

J =

[λ1 10 λ2

]=

[λ1 10 λ1

](since λ1 = λ2) (4.25)

and of the matrix of the principal vectors

M = [z1,p2] .

We see that for critical damping the matrix A cannot be diagonalized (i.e. be made similar to adiagonal matrix). It becomes defective (i.e. it doesn’t have a full set of eigenvectors). The best wecan do is to make it similar to the Jordan matrix

M−1AM = J . (4.26)

Illustration 6

Plot the analytical solution to the IVP (4.1) with m = 13, k = 6100, ζ = 1.0 (critical damping),x0 = 0, and v0 = 1.

We shall follow the procedure that leads to the Jordan matrix. The MATLAB solution is basedon the symbolic algebra toolbox5.

The solution to the eigenvalue problem yields a rectangular one-column V. Therefore we solve forthe principal vector p2, and we form the matrix M

5See: aetna/LinearOscillator/froscill crit symb.m


[V,D] =eig(A);% this gives V with only one column

% so here we solve for the principal vector

p2 = (A-D(2,2)*eye(2))\V(:,1);

M = [V(:,1),p2];

We compute the Jordan matrix, the exponential of the Jordan matrix, and the solution follows asbefore (see for instance Illustration on page 65).

J =simple(inv(M)*A*M);

eJt =expm(J*t);

y=simple(M*eJt*inv(M))*y0;

Illustration 7

Compute the matrix exponential of the Jordan matrix

J = t

[λ 10 λ

].

Solution: The matrix can be decomposed as

J = tλ1+ t

[0 10 0

]= tλ1+ tΘ .

Because we have

(tλ1) (tΘ) = (tΘ) (tλ1)

(i.e. the matrices commute), it holds for the matrix exponential

etλ1+ tΘ = etλ1etΘ = etΘetλ1 .

The exponential of the diagonal matrix is easy: see equation (4.22). For the matrix Θ using thedefinition (4.21) we readily get

etΘ =∞∑k=0

Θktk

k!= 1+ tΘ

because all its powers higher than two are zero matrices, Θ2 = 0, and so on. Therefore, we have

etλ1+ tΘ = etλ1etΘ = eλt1 (1+ tΘ) = eλt[1 t0 1

].


1. V. I. Arnold, Ordinary Differential Equations, Universitext, Springer, 2006.This book has a great discussion of the issues of complex differential equations, including expla-nations of the relationship of complex differential equations and the linear oscillator IVP.

2. D. E. Newland, Mechanical Vibration Analysis and Computation, Dover Publications Inc., 2006.An excellent reference for all vibrations subjects. It covers thoroughly the single-degree of freedomoscillator, matrix analysis of natural frequencies and mode shapes, and numerical methods formodal analysis. Did I mention that it was inexpensive?

5

Linear Multiple Degree of Freedom Oscillator

Summary

1. For the multiple degree of freedom linear vibrating system we study how to transform betweenthe second order and the first order matrix form. Modal analysis is discussed in detail for bothforms.

2. Modal analysis decouples the equations of the multiple degree of freedom system. The orig-inal coupled system may be understood in terms of the individual modal components. Mainidea: whether coupled or uncoupled, the response of the system is determined by the modalcharacteristics. Each uncoupled equation evolves as governed by its own eigenvalue.

3. We can analyze a scalar real or complex linear differential equation to gain insight into thestability behavior. When the equations are coupled, stability is usually decided by the fastestchanging component of the solution (as dictated by the largest eigenvalue). This information isused to select the time step for direct integration of the equations of motion.

4. The frequency content (spectrum) is a critical piece of information. We use the Fourier transformand we discuss the Nyquist frequency.

5. The first-order form of the vibrating system equations is used to analyze damped systems.

5.1 Model of a vibrating system

The second-order equation of the free (unforced) motion of a system of interconnected damped linearoscillators (see Figure 5.1) is

Mx = −Kx−Cx , (5.1)

where M is the mass matrix, K is the stiffness matrix, C is the damping matrix, and x is the vectorof displacements. In conjunction with the initial conditions

x(0) = x0 , x(0) = v0

this will define the multi-degree of freedom (dof) damped oscillator IVP. Using the definition

v = x

will yield the general first-order form of the multi-dof damped oscillator IVP as

y = A · y , y(0) = y0 , (5.2)

where

A =

[0, 1

−M−1K, −M−1C

]

78 5 Linear Multiple Degree of Freedom Oscillator

and

y =

[xv

].

The vector variable y collects both the vector of displacements x and the vector of velocities v.Figure 5.1 shows an example of a multi-degree of freedom oscillator that is physically realized as

three carriages connected by springs and dampers. This will be our sample mechanical system thatwill be studied in the following sections.

m1k1, c1

x1

m2k2, c2

x2

m3k3, c3

x3

Fig. 5.1. Linear 3-degree of freedom oscillator

5.2 Undamped vibrations

Let us take a system where all the springs are of equal stiffness kj = k = 61, all the masses are equalmj = m = 1.3, and the system is undamped, cj = 0.

5.2.1 Second order form

The second order equations of motion (5.1) have a solution (this is an educated guess)

x = eλtz ,

which upon substitution into (5.1) gives

λ2Meλtz = −Keλtz .

This yields the eigenvalue problem

−λ2Mz = Kz ,

which is a form of the so-called generalized eigenvalue problem

ω2Mz = Kz (5.3)

for the eigenvalues ω2.For the mechanical system of Figure 5.1 the mass and stiffness matrices are

M =

m 0 00 m 00 0 m

, K =

2k, −k, 0−k, 2k, −k0, −k, k

.

Similarly to the characteristic equation for the standard eigenvalue problem (3.15) we can write

det(K − ω2M

)= 0 . (5.4)

5.2 Undamped vibrations 79

Illustration 1

For the stiffness and mass matrices given above, the characteristic polynomial is

det

2k, −k, 0−k, 2k, −k0, −k, k

− ω2

m 0 00 m 00 0 m

= k3 − 6k2mω2 + 5km2(ω2)2 −m3(ω2)3

The eigenvalues ω2 are the roots of this polynomial.

Illustration 2

For the stiffness and mass matrices given above, the characteristic equation is

k3 − 6k2mω2 + 5km2(ω2)2 −m3(ω2)3 = 0 .

Find the roots.Symbolic solution can be delivered by Matlab, but it is far from tidy. Numerical solution eigen-

values for the data m = 1.3, k = 61, c = 0, results from the roots of

−(2197/1000)(ω2)3 + (10309/20)(ω2)2 − (145119/5)ω2 + 226981 = 0 ,

which may be (crudely) solved graphically from1

0 50 100 150−3

−2

−1

0

1

2

3

4x 10

5

o2

− (

2197

*o23 )/

1000

+ (

1030

9*o22 )/

20 −

(14

5119

*o2)

/5 +

226

981

or numerically using solve.

The eigenvalues (and eigenvectors) of the generalized eigenvalue problem are known to bereal for M ,K symmetric. Also, when the stiffness matrix is nonsingular, the eigenvalues will bepositive. Hence we write

−λ2 = ω2 ≥ 0 .

The generalized eigenvalue problem is solved in MATLAB2 using

[V,D]=eig(K,M);

1See: aetna/ThreeCarriages/n3 undamped modes MK symbolic.m2See: aetna/ThreeCarriages/n3 undamped modes MK.m


For the above matrices, the eigenvalues are ω21 = 9.2937 (i.e. angular frequency ω1 = ±3.0486),

ω22 = 72.9634 (i.e. angular frequency ω2 = ±8.5419), and ω2

3 = 152.3583 (i.e. angular frequencyω3 = ±12.3433). Therefore, we see that the λ’s are all imaginary, λj = ±iωj . Note that thereare three eigenvalues, but each eigenvalue generates two solutions because of the ± for the squareroots. That is necessary, because there are six constants needed to satisfy the initial conditions (twoconditions, each with three equations).

The solutions are therefore found to be both

x = e+iωjtzj and x = e−iωjtzj ,

which are complex vectors. The solution however needs to be real. This is easily accomplished bytaking as the solutions a linear combination of the above, for instance

x = Re(e+iωjt + e−iωjt

)z and x = Im

(e+iωjt − e−iωjt

)z .

From Euler’s formula we know that

Re(e+iωjt + e−iωjt

)= 2 cosωjt

and

Im(e+iωjt − e−iωjt

)= 2 sinωjt .

Therefore, we can take as the three linearly independent solutions (j = 1, 2, 3)

x = cosωjtzj and x = sinωjtzj .

In this way we will obtain enough integration constants to satisfy the initial conditions, since thegeneral solution may be written as

x =

3∑j=1

(Aj cosωjt+Bj sinωjt)zj .

The undamped mode shapes for our example are shown in Figure 5.2, both graphically as arrowsand numerically as the values of the components.3

5.2.2 First order form

Next we will explore the free vibration of the same system in its first-order form. The system matrixis (note: no damping)

A =

[0, 1

−M−1K, 0

].

The standard eigenvalue problem is solved in MATLAB as4

[V,D]=eig(A);

Note that the results for the eigenvalues on the diagonal ofD indicate the eigenvalues are not orderedfrom smallest in absolute value to the largest as we would like to see them.

D =

12.34i 0 0 0 0 0

0 -12.34i 0 0 0 00 0 +8.54i 0 0 00 0 0 -8.54i 0 00 0 0 0 3.05i 00 0 0 0 0 -3.05i

3See: aetna/ThreeCarriages/n3 undamped modes MK.m4See: aetna/ThreeCarriages/n3 undamped modes A.m

5.2 Undamped vibrations 81

z11 = −0.288 z21 = −0.518 z31 = −0.646

z12 = 0.646 z22 = 0.288 z32 = −0.518

z13 = −0.518 z23 = 0.646 z33 = −0.288

Fig. 5.2. Linear 3-degree of freedom oscillator: second-order model, undamped modes

We can reorder them using the sort function: the first line sorts the diagonal elements by ascendingmodulus, the second line re-orders the rows and columns of D, and constructs the new D, the thirdline then reorders the columns of V .

[Ignore,ix] = sort(abs(diag(D)));

D =D(ix,ix);

V =V(:,ix);

Here is the reordered D (be sure to compare with the eigenvalues computed in the previous sectionfor the generalized EP)

D =

0+3.05i 0 0 0 0 0

0 0-3.05i 0 0 0 00 0 0+8.54i 0 0 00 0 0 0-8.54i 0 00 0 0 0 0+12.3i 00 0 0 0 0 0-12.3i

and the corresponding eigenvectors as columns of V

V = 10−2

0-10.2i 0+10.2i 0-8.57i 0+8.57i 0+4.77i 0-4.77i0-18.4i 0+18.4i 0-3.81i 0+3.81i 0-5.95i 0+5.95i0-23i 0+23i 0+6.87i 0-6.87i 0+2.65i 0-2.65i31.2 31.2 73.2 73.2 -58.9 -58.956.2 56.2 32.6 32.6 73.5 73.570 70 -58.7 -58.7 -32.7 -32.7

.

Note that the eigenvalues come in complex conjugate pairs. The corresponding eigenvectors arealso complex conjugate. Each pair of complex conjugate eigenvalues corresponds to a one-degree offreedom oscillator with complex-conjugate solutions.

Figure 5.3 illustrates graphically the modes of the A matrix. There are six components to eacheigenvector: the first three elements represent the components of the displacement, and the lastthree elements represent the components of the velocity. Therefore, the eigenvectors are visualizedusing two arrows at each mass. We use the classical complex-vector (phasor) representation: thereal part is on the horizontal axis, and the imaginary part is on the vertical axis. Note that alldisplacement components (green) are purely imaginary, while all the velocity components (red) arereal. An animation of the motion described by a single eigenvector


x = eλjtzj (no sum over j)

is implemented in the script n3 undamped A animation5.

z11, z41 z21, z51 z31, z61z12, z42 z22, z52 z32, z62

z13, z43 z23, z53 z33, z63z14, z44 z24, z54 z34, z64

z15, z45 z25, z55 z35, z65z16, z46 z26, z56 z36, z66

Fig. 5.3. Linear 3-degree of freedom oscillator: first-order model, undamped modes

Figure 5.4 shows the free-vibration response to excitation in the form of the initial condition setto (the real part of) mode 2.6 Note that the displacements go through zero at the same time, andthat the amplitude does not change.

0 1 2 3 4 5 6−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

y(1:

3)

t

Fig. 5.4. Linear 3-degree of freedom oscillator: first-order model, undamped. Free-vibration response toinitial condition in the form of mode 2.

We have made the observation that the eigenvalues and eigenvectors come in complex conjugatepairs. Each pair of complex conjugate eigenvalues corresponds to a one-degree of freedom oscillatorwith complex-conjugate solutions. We have shown in Section 4.3 that all the individual eigenvalueproblems for the 2× 2 matrix A may be written as one matrix expression

AV = V Λ ,

where each column of V corresponds to one eigenvector, and the eigenvalues are the diagonal ele-ments of the diagonal matrix Λ. So that provided V was invertible, the matrix A was similar to a

5See: aetna/ThreeCarriages/n3 undamped A animation.m6See: aetna/ThreeCarriages/n3 undamped IC.m

5.3 Direct time integration and eigenvalues 83

diagonal matrix (4.13). Exactly the same transformation may be used no matter what the size ofthe matrix A. The 6× 6 A is also similar to a diagonal matrix

V −1AV = D

using the matrix of eigenvectors V . Therefore, the original IVP (5.2) may be written in the com-pletely equivalent form

w = D ·w , w(0) = V −1y0 (5.5)

for the new variables, the modal coordinates, w. Each modal coordinate wj is independent of theothers since the matrix D is diagonal.

5.3 Direct time integration and eigenvalues

Let us consider the task of finding a numerical solution to the IVP (5.2) by the so-called direct timeintegration using a Matlab integrator. Let us assume such an integrator is conditionally stable forour current vibration system (pure oscillation, no decay, no growth). As an example, let us takethe fourth-order Runge-Kutta integrator (oderk47). The amplification factor for this method whenapplied to the scalar IVP for the modal coordinate

w = λw, w(0) = w0

reads

α = 1 +∆tλ+(∆tλ)2

2+

(∆tλ)3

6+

(∆tλ)4

24.

The stability diagram is shown in Figure 3.24. The intersection of the imaginary axis with the levelcurve α = 1 of the amplification factor gives one and only one stable time step for purely oscillatorysolutions. Numerically we can solve for the corresponding stable time step with fzero as

F=@(dt)(abs(1+(dt*lambda)+(dt*lambda)^2/2+(dt*lambda)^3/6+(dt*lambda)^4/24)-1);

dt =fzero(F, 1.0)

Integrating with the stable time step leads to an oscillating solution with unchanging amplitude,using a longer time step yields oscillating solutions with increasing amplitude, and decreasing thetime step leads to oscillations with decaying amplitude. Figure 5.5 was produced by the scriptn3 undamped direct modal8. The modal coordinate w2 (λ2 = 3.0486i) was integrated by oderk4

with a stable time step ∆t (horizontal curve), slightly longer time step 1.00001∆t (rising curve), andshorter time step ∆t/10 (dropping curve), and it is a good illustration of the above derivation.

If we were to numerically integrate the IVP (5.5), i.e. the uncoupled form of the original (5.2),we could integrate each equation separately from all the others since in the uncoupled form theyare totally independent. Hence we could also use different time steps for different equations. Let ussay we were to use a conditionally stable integrator such as the oderk4. Then for each equation jwe could find a stable time step and integrate wj with that time step. Of course to construct theoriginal solution as y = V w would take additional work: All the wj would be computed at differenttime instants, whereas all the components of y should be known at the same time instants.

Alternately, if we were to integrate the original IVP (5.2) in the coupled form, the uncoupledmodal coordinates wj would still be present in the solution y, only now they would be mixedtogether (coupled) in the variables yk. Again, let us assume that we need to use a conditionallystable integrator such as the oderk4. However, now we have to use only one time step for all thecomponents of the solution. It would be in general impossible for purely oscillatory solutions to

7See: aetna/utilities/ODE/integrators/oderk4.m8See: aetna/ThreeCarriages/n3 undamped direct modal.m


−0.5 0 0.5

−0.5

0

0.5

Im(w

2)

Re(w2)

0 20 40 60 80 1000.498

0.5

0.502

0.504

0.506

0.508

|w|

t

Fig. 5.5. Integration of modal coordinate w2 (λ2 = 3.0486i). The real and imaginary part of the solution(phase-space diagram) on the left, absolute value of the complex solution on the right. Integrated with stabletime step ∆t (exactly one circle on the left, on the horizontal curve on the right), slightly longer time step1.00001∆t (increasing radius on the left, rising curve on the right), and shorter time step ∆t/10 (decreasingradius on the left, dropping curve on the right)

integrate at a time step that was stable for all wj at the same time. If we cannot integrate allsolution components so that their amplitude of oscillation is conserved, then we would probablyelect to have the amplitudes decay rather than grow. Therefore, we would integrate the coupled IVPwith the time step equal to or shorter than the shortest stable time step. For our example the stabletime step lengths are9

dts =

0.9278 0.9278 0.3311 0.3311 0.2291 0.2291

The shortest stable time step (for solution components five and six) is ∆tmin ≈ 0.2291. Figure 5.6shows that running the integrator at the shortest stable time step yields a solution of the original,uncoupled, vibrating system which is non-growing (decaying oscillations), because two componentsare integrated at the stable time step (and therefore their amplitude is maintained), and the firstfour components are integrated below their stable time step and hence their amplitude decays.Running the integration at just a slightly longer time step than ∆tmin means that the first fourcomponents are still integrated below their stable time steps. Their amplitude will still decay. Thelast two components are integrated very slightly above their stable time step, which means that theamplification factor for them is just a tad greater than one. We can clearly see how that can easilydestroy the solution as we get a sharply growing oscillation amplitude of the coupled solution (onthe right).

5.3.1 Practical use of eigenvalues for integration

The eigenvalues of the matrix A of the IVP (5.2) (sometimes referred to as the spectrum of A) needto be taken into account when the IVP is to be integrated numerically. We have shown the reasonsfor this above, and now we are going to try to summarize a few practical rules.

• If the decoupling of the original system is feasible and cost-effective, each of the resulting in-dependent modal equations can be integrated separately with its own time step. In particular,exponentially decaying (or growing) solutions may require the time step to be smaller than someappropriate length for stability. Purely oscillating solutions may also pose a limit on the timestep, depending on the integrator. To achieve stability we need to solve for an appropriate timestep from the amplification factor as shown for instance above for the fourth-order Runge-Kuttaintegrator, or for the Euler integrators in Chapter 3.

9See: aetna/ThreeCarriages/n3 undamped stable rk4.m

5.4 Analyzing the frequency content 85

0 10 20 30 40 50−0.5

0

0.5

y(1:

3)

t0 10 20 30 40 50

−0.5

0

0.5

y(1:

3)

t

Fig. 5.6. Integration of the undamped IVP with the shortest stable time step ∆tmin (non-growing solutionon the left), and slightly longer time step than the shortest stable time step 1.002∆tmin (growing solutionon the right)

• All types of solutions may also require time step that provides sufficient accuracy. In this respectwe should remember that equations should not be integrated at a time step that is longer than thestable time step. Therefore we first consider stability, and then, if necessary, we further shortenthe time step length for accuracy. For oscillating solutions, good accuracy is typically achieved ifthe time step is less than 1/10 of the period of oscillation. In particular, let us say we got a purelyimaginary eigenvalue for the jth mode, λj = iωj . Then the time step for acceptable accuracyshould be

∆t ≤ Tj

10,

where Tj is the period of vibration for the jth mode

Tj =2π

ωj.

• If the equations cannot be decoupled (such as when the cost of solving the complete eigenvalueproblem is too high), the system has to be integrated in its coupled form. Firstly, we shallthink about stability. A time step must be chosen that works well for all the eigenvalues andeigenvectors in the system. That shouldn’t be a problem for unconditionally stable integrators–they would give reasonable answers for any time step length. Unfortunately, there is really onlyone such integrator on our list, the trapezoidal rule. For conditionally stable integrators we haveto choose a suitable time step length. In particular, we would most likely try to avoid integratingat a time step length that would make some of the solution components grow when they shouldnot grow (oscillating or decaying components). Then we should choose a time step that is thesmallest of all the time step limits computed for the individual eigenvector/eigenvalue pairs.Secondly, the time step is typically assessed with respect to accuracy requirements– this wasdiscussed above.

More on the topic of the time step selection in the next two sections that deal with solutions toinitial boundary value problems.

5.4 Analyzing the frequency content

Next we look at a couple of experiments that will provide insight into the frequency content of theresponse. First we simulate the free vibration of the undamped system, with the initial conditionbeing a mixture of the modes 1,2,5,6.10 The “measurement” of the response will be displacement

10See: aetna/ThreeCarriages/n3 undamped fft.m


of the mass 3, which the simulation will give us as a “discrete signal”. The signal is a sequence ofnumbers xj measured at equally spaced time intervals, tj such that tj − tj−1 = ∆t.

The sampling interval is a critical quantity. With a given sampling interval length it is only pos-sible to sample signals faithfully up to a certain frequency. Figure 5.7 shows two signals of differentfrequencies sampled with the same sampling interval. Even though the signals have different fre-quencies, their sampling produces exactly the same numbers and therefore we would be interpretingthem as one and the same. This is called aliasing. The so-called Nyquist rate 1/∆t is the minimumsampling rate required to avoid aliasing, i.e. viewing two very different frequencies as being the samedue to inadequate sampling.

t

s(t)

Fig. 5.7. Illustration of the Nyquist rate. Sampling at a rate that is lower than the Nyquist rate for thesignal represented with the dashed line. Clearly as far as the information obtained from the sampling thetwo signals shown in the figure are completely equivalent, even though they have different frequencies.

We can see from Figure 5.8 that the Nyquist rate is twice the rate (frequency) of the frequencywe wish to reproduce faithfully. The highest frequency that is reproduced faithfully by the Nyquistrate is the Nyquist frequency

fNy =1

2

1

∆t, (5.6)

where ∆t is the sampling interval. If we sample with an even higher rate (with smaller samplinginterval), the signal is going to be reproduced much better; on the other hand sampling slower, belowthe Nyquist rate, i.e. with a longer sampling interval, the signal is going to be aliased: we will getthe wrong idea of its frequency.

In order to extract the frequencies that contribute to the response from the measured signal weperform an FFT analysis. A quick refresher: The discrete Fourier transform (DFT ) is expressedby the formula

Am =1

N

N∑n=1

e−i2π(n−1)(m−1)/Nan , m = 1, ..., N (5.7)

that links two sets of numbers, the input signal an and its Fourier transform coefficients Am. TheFast Fourier transform (FFT) is simply a fast way of multiplying with the complex transform matrix,

i.e. evaluating the sum∑N

n=1 e−i2π(n−1)(m−1)/Nan.

The Fourier transform (Fourier series) of a periodic function x(t) with period T is defined as

x(t) =∞∑

m=−∞Xmeim(2π/T )t , (5.8)

where


f = 1× fNy

t

s(t) f = 1.1× fNy

t

s(t)

f = 2× fNy

t

s(t) f = 10× fNy

t

s(t)

Fig. 5.8. Illustration of the Nyquist frequency. Frequencies which are lower than the Nyquist frequency aresampled at a higher rate.

Xm =1

T

∫ T

0

x(t)e−im(2π/T )tdt . (5.9)

Here 2πT = ω0 is the fundamental frequency. The following illustration shows how equation (5.7)

that defines the transformation between the Fourier coefficients and the input discrete signal can beobtained from the above expressions for the continuous transform by a numerical approximation ofintegral.

Illustration 3

Consider the possibility that the function x(t) is known only by its values xj = x(tj) at equally spacedtime intervals, tj such that tj − tj−1 = ∆t. Assume the period of the function is an integer numberof the time intervals, T = N∆t, and the function is periodic between 0 and T . The integral (5.9)may then be approximated by a Riemann-sum

1

T

∫ T

0

x(t)e−i2πmt/T dt ≈ 1

T

N∑n=1

x(tn)e−i2πmtn/T∆t ,

where m = 0, 1, .... After we substitute T = N∆t, tn = (n− 1)∆t, and x(tn) = xn we obtain

1

T

∫ T

0


N∆t

N∑n=1

xne−i2πm(n−1)∆t/(N∆t)∆t

and finally

1

T

∫ T

0


N

N∑n=1

xne−i2πm(n−1)/N .

This is already close to formula (5.7). The remaining difference may be removed by a shift of theindex m. Therefore if we set m = 1, 2, ..., then the above will change to

1

T

∫ T

0


N

N∑n=1

xne−i2π(m−1)(n−1)/N .

As an example of the use of the DFT we will analyze the spectrum of an earthquake accelerationrecord to find out which frequencies were represented strongly in the ground motion.


Fig. 5.9. Workspace variables stored in elcentro.mat. The variable desc is the description of the datastored in the file.

Illustration 4

The earthquake record is from the notorious 1940 El Centro earthquake. The acceleration data isstored in elcentro.mat (Figure 5.9), and processed by the script dft example 111. Note that whenthe file is loaded as Data=load(’elcentro.mat’);, the variables stored in the file become fields ofa structure (in this case called Data).

Data=load(’elcentro.mat’);

dt=Data.delt;% The sampling interval

x=Data.han;% This is the signal: Let us process the North-South acceleration

t=(0:1:length(x)-1)*dt;% The times at which samples were taken

Next the signal is going to be padded to length which is an integral power of 2 for efficiency. Theproduct of the complex transform matrix with the signal is carried out by fft.

N = 2^nextpow2(length(x)); % Next power of 2 from length of x

X = (1/N)*fft(x,N);% Now we compute the coefficients X_k

The Nyquist frequency is calculated and used to determine the N/2 frequencies of interest, whichare all frequencies lower than one half of the Nyquist rate.

f_Ny=(1/dt)/2; % This is the Nyquist frequency

f = f_Ny*linspace(0,1,N/2);% These are the frequencies

Because of the aliasing there is a symmetry of the computed coefficients, and hence we also takeonly one half of the coefficients, X(1:N/2). In order to preserve the energy of the signal we multiplyby two.

absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients

Finally, the coefficients are plotted.

plot(f,absX,’Color’, ’r’,’LineWidth’, 3,’LineStyle’, ’-’,’Marker’, ’.’); hold on

xlabel (’ Frequency f [Hz]’); ylabel (’ |X(f)|’);

11See: aetna/FourierTransform/dft example 1.m


0 5 10 15 20 250

1

2

3

4

5

6x 10

−3

Frequency f [Hz] |X

(f)|

We can see that the highest-magnitude accelerations in the north-south direction occur withfrequencies below 5 Hz.

Finally, we are ready to come back to our vibration example. The displacement at the third massis the signal to transform.

x=y(:,3);% this is the signal to transform

The computation of the Fourier transform coefficients proceeds as

N = 2^nextpow2(length(x)); % Next power of 2 from length of x

X = (1/N)*fft(x,N);% Now we compute the coefficients X_k

f_Ny=(1/dt)/2; % This is the Nyquist frequency

f = f_Ny*linspace(0,1,N/2);% These are the frequencies

absX=2*abs(X(1:N/2)); % Take 2 times one half of the coefficients

Note that the absolute value of one half of the coefficients (shown in Figure 5.10) is often called theone-sided amplitude spectrum.

The three frequencies that we may expect to show up correspond to the angular frequenciesabove and are 0.485 Hz, 1.359 Hz and 1.965 Hz. As evident from Figure 5.10 we can see that theintermediate frequency, 1.359 Hz, is missing in the FFT. By including only the modes 1,2 and 5,6with frequencies 0.485 Hz and 1.965Hz in the initial condition, we have excluded the intermediatetwo modes from the response. Not to have been excited by the initial condition, the two modes willnot appear in the FFT: they will not contribute to the response of the system at any time.

Next we simulate the forced vibration of the system, with zero initial condition and sinusoidalforce at the frequency of 3 Hz applied at the mass 3.12 With the inclusion of forcing the second orderequations of motion are rewritten as

Mx = −Kx+L ,

where L is the vector of forces applied to the individual masses. Converting this to first order formresults in[

xv

]=

[0, 1

−M−1K, 0

] [xv

]+

[0L

].

Therefore, we add in the forcing to the right-hand side function supplied to the integrator: now itincludes a harmonic force applied to mass 3. 13

12See: aetna/ThreeCarriages/n3 undamped fft f.m13See: aetna/utilities/ODE/integrators/odetrap.m


0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

Frequency f [Hz]

One

−si

ded

ampl

itude

spe

ctru

m |X

(f)|

Fig. 5.10. Linear 3-degree of freedom oscillator: first-order model, undamped. Free-vibration response toinitial condition in the form of mode 1,2,5,6 mixture.

[t,y]=odetrap(@(t,y)A*y+sin(2*pi*3*t)*[0;0;0;0;0;1],...

tspan,y0,odeset(’InitialStep’,dt));

Again, the “measurement” of the response (the signal) will be the displacement of the mass 3. Thesimulation will give us the displacement x3 as a discrete signal. The FFT analysis on this signal isshown in Figure 5.11. We can see that now all free-vibration frequencies are present, and of coursethe forcing frequency shows up strongly.

0 1 2 3 4 50

1

2

3

4

5x 10

−3

Frequency f [Hz]

One

−si

ded

ampl

itude

spe

ctru

m |X

(f)|

Fig. 5.11. Linear 3-degree of freedom oscillator: first-order model, undamped. Forced-vibration response.

5.5 Proportionally damped system

In this section we are again considering the system of Section 5.2, but this time with nonzero dampingc.14

Here we consider the damping matrix to be a multiple of the stiffness matrix (so-called stiffness-proportional damping). This manifests itself by the damping matrix having the same structure ofthe nonzero elements as the stiffness matrix

14See: aetna/ThreeCarriages/n3 damped modes A.m

5.5 Proportionally damped system 91

C =

2c −c 0−c 2c −c0 −c c

,

where for our particular data c = 3.13. This is an example of the so-called Rayleigh damping .(In addition to stiffness-proportional there is also a mass-proportional Rayleigh damping.) Theeigenvalues are now complex with negative real parts

D =

-0.238+3.04i 0 0 0 0 0

0 -0.238-3.04i 0 0 0 00 0 -1.87+8.33i 0 0 00 0 0 -1.87-8.33i 0 00 0 0 0 -3.91+11.7i 00 0 0 0 0 -3.91-11.7i

.

Clearly the system is strongly damped (the real parts of the eigenvalues are quite large in magnitude).The eigenvectors shows that the velocities (the last three components) are no longer phase-shifted

by 90o with respect to the displacements.

V = 10−2

-0.8-10.2i -0.8+10.2i -1.88-8.36i -1.88+8.36i 1.51+4.53i 1.51-4.53i-1.44-18.4i -1.44+18.4i -0.836-3.72i -0.836+3.72i -1.88-5.64i -1.88+5.64i-1.8-22.9i -1.8+22.9i 1.51+6.7i 1.51-6.7i 0.839+2.51i 0.839-2.51i

31.2 31.2 73.2 73.2 -58.9 -58.956.2 56.2 32.6 32.6 73.5 73.570 70 -58.7 -58.7 -32.7 -32.7

.

z11, z41 z21, z51 z31, z61z12, z42 z22, z52 z32, z62

z13, z43 z23, z53 z33, z63z14, z44 z24, z54 z34, z64

z15, z45 z25, z55 z35, z65 z16, z46 z26, z56 z36, z66

Fig. 5.12. Linear 3-degree of freedom oscillator: first-order model, modes for stiffness-proportional damping

Figure 5.13 shows the free-vibration response to excitation in the form of the initial condition setto (the real part of) mode 2.15 Note that the displacements go through zero at the same time. Thismay be also deduced in Figure 5.12 from the fact that all the displacement arrows for any particularmode are parallel, which means they all have the same phase shift. Next we repeat the frequencyanalysis we’ve performed for the undamped system previously: we simulate the forced vibration ofthe damped system, with zero initial condition and sinusoidal force at the frequency of 3 Hz appliedat the mass 3.16 Again, the “measurement” of the response will be displacement of the mass 3. Theone-sided amplitude FFT analysis on this signal is shown in Figure 5.14. We can see that not allfree-vibration frequencies are clearly distinguishable, while the forcing frequency shows up strongly.

15See: aetna/ThreeCarriages/n3 damped IC.m16See: aetna/ThreeCarriages/n3 damped fft f.m


0 1 2 3 4 5 6−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

y(1:

3)

t

Fig. 5.13. Linear 3-degree of freedom oscillator: first-order model, stiffness-proportional damping. Free-vibration response to initial condition in the form of mode 2.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6x 10

−3

Frequency f [Hz]

One

−si

ded

ampl

itude

spe

ctru

m |X

(f)|

Fig. 5.14. Linear 3-degree of freedom oscillator: first-order model, stiffness-proportional damping. Forced-vibration response.

5.6 Non-proportionally damped system

In this section we consider general damping, by which we mean that it is represented by a dampingmatrix that does not have the structure of either the stiffness or the mass matrix. We assume onlydamper 1 is active (c1 = 33.3), and dampers 2 and 3 are absent:

C =

c1 0 00 0 00 0 0

.

Otherwise the mass and stiffness properties are unchanged.17

The eigenvalues are quite interesting. There are two negative and real eigenvalues (each corre-sponding to an exponentially decaying mode), and two pairs of complex conjugate eigenvalues forone-degree of freedom oscillators.

17See: aetna/ThreeCarriages/n3 damped non modes A.m

5.6 Non-proportionally damped system 93

D =

-2.4 0 0 0 0 00 -0.641+3.98i 0 0 0 00 0 -0.641-3.98i 0 0 00 0 0 -0.254+11.1i 0 00 0 0 0 -0.254-11.1i 00 0 0 0 0 -21.4

.

Correspondingly, the first and last eigenvector is real, and the rest are complex conjugate pairs.

V = 10−2

26 5.22+1.33i 5.22-1.33i -1.25+0.191i -1.25-0.191i 4.6521.1 4.16+12.5i 4.16-12.5i -0.173-7.57i -0.173+7.57i 0.39818.8 3.09+19.2i 3.09-19.2i 0.446+4.61i 0.446-4.61i 0.0369-62.5 -8.64+19.9i -8.64-19.9i -1.8-13.9i -1.8+13.9i -99.5-50.7 -52.5+8.5i -52.5-8.5i 84.1 84.1 -8.52-45.2 -78.2 -78.2 -51.3+3.79i -51.3-3.79i -0.79

.

We can illustrate that the motion for instance for mode 6 is non-oscillatory in Figure 5.15 where weshow the response for the initial conditions in the form of mode 6.18

0 1 2 3 4 5 6−0.01

0

0.01

0.02

0.03

0.04

0.05

y(1:

3)

t

Fig. 5.15. Linear 3-degree of freedom oscillator: first-order model, modes for non-proportional damping.Response for initial conditions in the form of mode 6.

Figure 5.16 illustrates graphically the modes of the A matrix. It is noteworthy that the displace-ments and velocities for the purely decaying modes are phase shifted by 180o (they are out of phase).

Figure 5.17 shows the free-vibration response to excitation in the form of the initial conditionset to (the real part of) mode 2. Note that the displacements no longer go through zero at the sametime: they are phase shifted. This may be also deduced in Figure 5.16 because the displacementarrows for any particular mode are not parallel any more.

Illustration 5

The dynamics of the system discussed above is to be integrated with the time step ∆t = 0.06 s withthe modified Euler integrator. Determine if this integrator will be stable.

The natural angular frequencies are diag(D)

lambda=[ -2.4030

-0.6411+3.9785i

18See: aetna/ThreeCarriages/n3 damped non IC.m


z11, z41 z21, z51 z31, z61 z12, z42 z22, z52 z32, z62

z13, z43 z23, z53 z33, z63 z14, z44 z24, z54 z34, z64

z15, z45 z25, z55 z35, z65 z16, z46 z26, z56 z36, z66

Fig. 5.16. Linear 3-degree of freedom oscillator: first-order model, modes for non-proportional damping

0 1 2 3 4 5 6−0.15

−0.1

−0.05

0

0.05

0.1

y(1:

3)

t

Fig. 5.17. Linear 3-degree of freedom oscillator: first-order model, nonproportional damping. Free-vibrationresponse to initial condition in the form of mode 2.

-0.6411-3.9785i

-0.2541+11.1142i

-0.2541-11.1142i

-21.4220]

Each angular frequency needs to be substituted into the amplification factor for the modified Eu-ler (3.30), and its modulus (absolute value) needs to be evaluated. The result is

>> abs(1+dt*lambda+1/2*(dt*lambda).^2)

ans =

0.8662

0.9616

0.9616

1.0063

1.0063

0.5407

Since two of the amplification factors (for the complex-conjugate natural frequencies 4 and 5) aregreater than one in modulus, the integrator is not going to be stable with the given time step as thecontribution of the modes 4 and 5 would grow in time.


5.7 Singular stiffness, damped

Now we consider a system with a singular stiffness matrix – the first spring is absent. We also includedamping in the form considered in the previous section.19

z11, z41 z21, z51 z31, z61

z12, z42 z22, z52 z32, z62

z13, z43 z23, z53 z33, z63

z14, z44 z24, z54 z34, z64

z15, z45 z25, z55 z35, z65 z16, z46 z26, z56 z36, z66

Fig. 5.18. Linear 3-degree of freedom oscillator: first-order model, modes for singular-stiffness non-proportional damping

Note the zero eigenvalue: for a singular stiffness the entire matrix A must be singular (considerwhether the first three columns of A can be linearly independent when K has linearly dependentcolumns).

D =

0 0 0 0 0 00 -0.679+4.31i 0 0 0 00 0 -0.679-4.31i 0 0 00 0 0 -0.237+11.2i 0 00 0 0 0 -0.237-11.2i 00 0 0 0 0 -23.8

.

Correspondingly, the first and last eigenvector is real, and the rest are complex conjugate pairs. Thefirst eigenvector is as expected: all displacements the same, no velocities:

V = 10−2

-57.7 -4.98+1.26i -4.98-1.26i -1.16+0.371i -1.16-0.371i -4.19-57.7 -4.03-10.8i -4.03+10.8i -0.161-7.57i -0.161+7.57i -0.3-57.7 -2.86-18.2i -2.86+18.2i 0.409+4.56i 0.409-4.56i -0.0230 -2.07-22.4i -2.07+22.4i -3.86-13i -3.86+13i 99.70 49.3-10i 49.3+10i 84.4 84.4 7.130 80.4 80.4 -51+3.48i -51-3.48i 0.546

.

Under these conditions no forces are generated in any of the springs or the damper.


1. D. E. Newland, Mechanical Vibration Analysis and Computation, Dover Publications Inc., 2006.An excellent reference for all vibrations subjects. It covers thoroughly the single- and multiple-degree of freedom oscillators, matrix analysis of natural frequencies and mode shapes, and nu-merical methods for modal analysis. Also a nice exposition of the discrete Fourier transform(DFT).

19See: aetna/ThreeCarriages/n3 damped sing modes A.m

6

Analyzing errors

Summary

1. The basic tool is here the Taylor series. Especially important is the Lagrange remainder term.2. We use it to reason about order-of estimates (i.e. big-O notation). Main idea: as we control error

in numerical algorithms by decreasing the time step length, the element size, and other controlparameters, towards zero, the first term of the Taylor series that is missing in our model willdominate the error. We use these ideas to evaluate errors of integrals and estimate local andglobal errors of ODE integrators.

3. Combining order-of error estimates with repeated solutions with different time step lengths allowsus to construct time-adaptive integrators. Main idea: by controlling the local error (estimatedfrom the Taylor series) we attempt to deliver the solution within a user-given error tolerance.

4. We discuss the approximation of derivatives by the so-called finite difference stencils. Main idea:the total error has components of a distinct nature, the truncation error and the machine-representation error.

5. The computer represents numbers as collections of bits. Main idea: The machine-representationerror (round-off) is due to the inability of the computer to store only some values, to whichresults of arithmetic operations must be converted (with the attendant loss of precision).

6.1 Taylor series

For a reasonably smooth function (for instance it helps if all the function’s derivatives exist), we canwrite the infinite series

y(x) = y(x0) +dy(x0)

dx(x− x0) +

d2y(x0)

dx2

(x− x0)2

2+ . . . .

Its purpose is to approximate the function value at x from the function derivatives at x0 (the functionvalue may be considered the zeroth derivative). When the above series converges, the Taylor serieswill become better and better approximation with any additional term. (When the Taylor series fora given function converges, we call such function analytic.)

Illustration 1

Warning: The Taylor series need not be convergent. For instance, the function log(1 + x) has aconvergent Taylor series in the interval −1 < x < 1. Outside this interval the Taylor series does notconverge (the more terms are added, the worse the approximation becomes). Try the following codethat uses the taylor MATLAB function.

98 6 Analyzing errors

syms x ’real’

t=taylor(log(1+x),6);

x=linspace(-1,+2,100);

plot (x,log(1+x))

hold on

plot (x,eval(vectorize(t)),’--’)

Note the use of vectorize: MATLAB will choke on all those powers of x from the Taylor seriesfunction when x is an array of numbers.

Often it is useful to truncate the Taylor series exactly (that is to write down a finite number ofthe terms, but still preserve the exact meaning). The Lagrange remainder can be used for thispurpose. For instance we can write

y(x) = y(x0) +dy(x)

dx(x− x0)

to truncate after the first term, or

y(x) = y(x0) +dy(x0)

dx(x− x0) +

d2y(x)

dx2

(x− x0)2

2

to truncate after the second term. Both truncations are exact (when the Taylor series converges, ofcourse). The trick is to write the last term (which is the Lagrange remainder) with a derivative takenat x somewhere between x0 and x. The location x is not the same in the two truncations above.

In general, we would write

y(x) = y(x0) +dy(x0)

dx(x− x0) +

d2y(x0)

dx2

(x− x0)2

2+ . . .++

dny(x0)

dxn

(x− x0)n

n!+Rn ,

where the Lagrange remainder Rn is

Rn =dn+1y(x)

dxn+1

(x− x0)n+1

(n+ 1)!. (6.1)

Having reminded ourselves of the basics of Taylor series approximation, we can look at a veryuseful tool (terminology really) to help us with engineering analyses of all kinds.

6.2 Order-of analysis

The order-of analysis helps us to make sweeping statements about things such as errors by high-lighting the most important contributions and obscuring the rest. To begin, consider how to say assimply as possible that for x → 0 the value of a given function f(x) decreases toward zero. Thefunction f could vary in some complicated way. Perhaps we could compare it to something reallysimple such as the function g(x) = x which also decreases to zero for x→ 0?

That is the idea behind this definition: The function f(x) is of the order of g(x) if

limx→0

|f(x)||g(x)|

< M <∞ ,

where we require g(x) = 0 for x = 0. In words, the absolute values of the two functions are in someproportion that is of finite magnitude. We write f(x) ∈ O(g(x)) and say “f of x is big o g of x as xgoes to zero”. The meaning of this definition is that “|f(x)| decreases towards zero at least as fastas |g(x)|”.

6.2 Order-of analysis 99

Illustration 2

Example 1: Consider f(x) = 0.1x+ 30x2, for x > 0. Show that it is of order g(x) = x as x→ 0.We form the fraction and simplify

limx→0

|f(x)||g(x)|

= limx→0

|0.1x+ 30x2||x|

= limx→0

0.1x+ 30x2

x= lim

x→00.1 + 30x = 0.1 <∞

Conclusion: f(x) = 0.1x+ 30x2 is of order g(x) = x as x→ 0. We say “f of x is big o x”, and writef(x) = 0.1x+ 30x2 ∈ O(x).

Example 2: Consider f(x) = 0.1x+ 30, for x > 0. Show that it is of order g(x) = 1 as x→ 0.We form the fraction and simplify

limx→0

|f(x)||g(x)|

= limx→0

|0.1x+ 30||1|

= limx→0

0.1x+ 30

1= lim

x→030 = 30 <∞

Conclusion: f(x) = 0.1x+ 30 is of order g(x) = 1 as x→ 0. We say “f of x is big o one”, and writef(x) = 0.1x+ 30 ∈ O(1).

Example 3: Consider f(x) = 0.1x+ 30x2, for x > 0. Show that f(x) is not of order g(x) = x2 asx→ 0.

We form the fraction and simplify

limx→0

|f(x)||g(x)|

= limx→0

|0.1x+ 30x2||x2|

= limx→0

0.1x+ 30x2

x2= lim

x→00.1/x+ 30→∞

Conclusion: f(x) = 0.1x+ 30x2 is not of order g(x) = x2 as x→ 0.

6.2.1 Using the big-O notation

When analyzing algorithms, our interest is typically to find out how quickly their errors decrease asa function of the accuracy control knob (which may be the time step, or the grid spacing, accordingto the algorithm). The assumption is that accuracy is improving as the control knob makes the timestep (or the grid spacing) smaller (approaching zero).

Given an expression such as f(∆t) = 0.1∆t+ 30∆t2 our interest would be to find the dominantterm, that is the term that decreases to zero the slowest, as ∆t→ 0. In the examples above we havediscovered that f(∆t) = 0.1∆t+ 30∆t2 ∈ O(∆t). This to us indicates that f(∆t) decreases towardzero at most as quickly as ∆t. It does not decrease as quickly as ∆t2. Also, it does decrease towardzero, which a constant, 1, does not. The notation f(∆t) ∈ O(∆t), and f(∆t) /∈ O(∆t2), f(∆t) /∈ O(1)helps us filter out things that are not important, the numerical values of the coefficients (0.1 and 30),what other unimportant terms there might be (∆t2), and keep just the information that matters tous: f(∆t) = 0.1∆t+ 30∆t2 ∈ O(∆t).

Illustration 3

Use the order-of notation to compare the following polynomials as t→ 0.

p(t) = 100, 003t3 + 0.16131t2 − 555, q(t) = −703t6 − (1 + 2πt), r(t) = 3t6 − log e .

Solution: As all polynomial expressions include the constant term, all of these polynomials areO(1).


Illustration 4

Estimate the resulting magnitude of the Taylor series sum for tj+1 → tj . Assume that all thederivatives exist and are finite numbers.

d2y(tj)

dt2(tj+1 − tj)

2

2+

d3y(tj)

dt3(tj+1 − tj)

3

3!+

d4y(tj)

dt4(tj+1 − tj)

4

4!+ . . . .

First of all, the Taylor series is a polynomial in the quantity tj+1 − tj , and this quantity goes tozero as tj+1 → tj . Therefore, we can introduce the new variable τ = tj+1 − tj and write

d2y(tj)

dt2τ2

2+

d3y(tj)

dt3τ3

3!+

d4y(tj)

dt4τ4

4!+ . . . .

The quantitiesd2y(tj)2!dt2 ,

d3y(tj)3!dt3 ,

d4y(tj)4!dt4 , . . . are just inconsequential coefficients, and we can easily

convince ourselves that

d2y(tj)

dt2τ2

2+

d3y(tj)

dt3τ3

3!+

d4y(tj)

dt4τ4

4!+ . . . ∈ O(τ2)

by evaluating

limτ→0

d2y(tj)

dt2τ2

2+

d3y(tj)

dt3τ3

3!+

d4y(tj)

dt4τ4

4!+ . . .

τ2=

limτ→0

d2y(tj)

dt21

2+

d3y(tj)

dt3τ

3!+

d4y(tj)

dt4τ2

4!+ . . . =

d2y(tj)

dt21

2<∞

.

In conclusion, the Taylor series sum is O((tj+1 − tj)

2).

6.2.2 Error of the Riemann-sum approximation of integrals

The goal here is to estimate the error of the Riemann-sum approximation of integrals of one variableusing the order-of analysis. For instance, as shown in Figure 6.1, approximate the integral∫ b

a

y(x) dx

using the Riemann-sum approximation indicated by the filled rectangles in the figure. The error ofapproximating the actual area between x0 and x0 + h by the rectangle y(x0)h may be estimated byexpressing the Taylor series of y(x) at x0

y(x) = y(x0) +dy(x0)

dx(x− x0) +

d2y(x0)

dx2

(x− x0)2

2+ . . .

and integrating the Taylor series, where we can conveniently introduce the change of variabless = x− x0∫ x0+h

x0

y(x) dx =

∫ h

0

(y(x0) +

dy(x0)

dxs+

d2y(x0)

dx2

s2

2+ . . .

)ds .

We obtain∫ x0+h

x0

y(x) dx = y(x0)h+dy(x0)

dx

h2

2+

d2y(x0)

dx2

h3

6+ . . . .

6.2 Order-of analysis 101

Comparing with the approximate area y(x0)h, we express the error as

e =dy(x0)

dx

h2

2+

d2y(x0)

dx2

h3

6+ . . . .

We recall that the lowest polynomial power dominates, and therefore

e ∈ O(h2) .

The integral of the function y(x) between a and b is approximated as a sum of the areas of therectangles, let us say all of the same width h. There is

n =b− a

h

such rectangles. A pessimistic estimate of the total error magnitude would ignore the possibility oferror canceling, so that the absolute value of the total error could be bounded by the sum of theabsolute values of the errors committed for each subinterval

|E| ≤n∑

i=1

|ei| =n∑

i=1

O(h2) = nO(h2) =b− a

hO(h2) = O(h) .

Note that when we write in the equals sign in the above equation, we don’t really mean equality, weuse it rather informally to mean “is”. In the terms of the order-of analysis, we would write for theerror E of the integral from a to b

E ∈ O(h) .

From the point of view of the user of the Riemann-sum approximation this is good news: The errorcan be controlled. By decreasing h (that is by using more subintervals) we can make the total errorsmaller. It would be even nicer if the error of was O(h2), since then it would decrease faster when hwas decreased. We demonstrate this as follows: assume that we use twice as many subintervals. ForE ∈ O(h) the error would decrease as

h→ h/2⇒ E ∈ O(h)→ Enew ∈ O(h/2) = O(h)/2

so the error decreases with a factor of two. For E ∈ O(h2) the error would decrease as

h→ h/2⇒ E ∈ O(h2)→ Enew ∈ O((h/2)2) = O(h2)/4

so the error decreases with a factor of four. The pay off of using twice as many intervals is betterthis time.

6.2.3 Error of the Midpoint approximation of integrals

Now we demonstrate the estimate the error of the midpoint approximation of integrals of one variableusing the order-of analysis. For instance, as shown in Figure 6.2, approximate the integral∫ b

a

y(x) dx

using the midpoint approximation indicated by the filled rectangles in the figure. The error ofapproximating the actual area between x0 − h/2 and x0 + h/2 by the rectangle y(x0)h may beestimated by expressing the Taylor series of y(x) at x0

y(x) = y(x0) +dy(x0)

dx(x− x0) +

d2y(x0)

dx2

(x− x0)2

2+ . . .


h

x0a b x

y

Fig. 6.1. Riemann-sum approximation of the integral of a scalar variable.

and integrating the Taylor series, where we introduce the change of variables s = x− x0∫ x0+h/2

x0−h/2

y(x) dx =

∫ h/2

−h/2

(y(x0) +

dy(x0)

dxs+

d2y(x0)

dx2

s2

2+ . . .

)ds .

We obtain∫ x0+h/2

x0−h/2

y(x) dx = y(x0)h+d2y(x0)

dx2

h3

24+ . . . .

Importantly, the term with dy(x0)dx produced one negative contribution (triangle) which canceled with

the corresponding positive contribution (triangle), and so this term with its associated h2 droppedout. Comparing with the approximate area y(x0)h, we express the error as

e =d2y(x0)

dx2

h3

24+ . . . ∈ O(h3) ,

which is one order higher than the error estimated for the Riemann sum. The integral of the functiony(x) between a and b is approximated as a sum of the areas of the n rectangles and the absolutevalue of the total error could be bounded by the sum of the absolute values of the errors committedfor each subinterval

|E| ≤n∑

i=1

|ei| =n∑

i=1

O(h3) = nO(h3) =b− a

hO(h3) = O(h2) .

The order-of analysis tells us that the error E of the integral from a to b for the midpoint rule is

E ∈ O(h2)

and therefore the midpoint rule is more accurate than either of the Riemann sum rules.

6.3 Estimating error in ODE integrators

Intuitively we can see the forward Euler algorithm as related to the Riemann sum approximation ofintegrals. That is especially clear when the right-hand side function does not depend on y

y = f(t) , y(0) = y0 .

To solve this equation we integrate

6.3 Estimating error in ODE integrators 103

h

x0a b x

y

Fig. 6.2. Midpoint approximation of the integral of a scalar variable.

y(∆t) = y0 +

∫ ∆t

0

f(τ) dτ .

Forward Euler approximates the integral on the right as

y(∆t) ≈ y0 + f(0)∆t

which leads exactly to the same kind of error estimate, O(∆t2), moving the solution forward by onetime step.

The situation is complicated somewhat by considering right-hand sides which depend both onthe time t and the solution y. For instance, Figure 6.3 shows what happens for of the equation

y = − cos(2t)y , y(0) = 1 .

Each step of the forward Euler algorithm drifts off from the original curve. So we see one solutioncurve departing from the starting point (t0, y0), but after one step the forward Euler no longer triesto follow that curve, but rather the one starting at (t1, y1), and so on. Clearly, here is the potentialfor amplifying small errors if the solution curves part company rapidly as the time goes on. However,provided we use time steps which are sufficiently small so that the forward Euler does not excessivelyamplify these little drifts, we can estimate the error on the entire solution interval (the so-calledglobal error) from the so-called local errors in each time step.

t0

y0

t1

y1

t2

y2

t3

y3

t4

y4

t

y

Fig. 6.3. Forward Euler integration drifting off the original solution path.


6.3.1 Local error of forward Euler

Let us consider the vector ODE

y = f(t,y) , y(0) = y0 ,

which is advanced from tj to tj+1 using the forward Euler algorithm as

yj+1 = yj + (tj+1 − tj)f(tj ,yj) .

At the same time we can expand the solution in a Taylor series at (tj ,yj)

y(tj+1) = y(tj) +dy(tj)

dt(tj+1 − tj) +

d2y(tj)

dt2(tj+1 − tj)

2

2+ . . . .

Here y(tj+1) is the true solution that lies on the solution curve passing through the point (tj ,yj),and yj+1 is what we get from forward Euler. Now we can substitute from the definition

dy(tj)

dt= f(tj ,yj)

to get

y(tj+1) = y(tj) + f(tj ,yj)(tj+1 − tj) +d2y(tj)

dt2(tj+1 − tj)

2

2+ . . .

and then move the first two terms on the right-hand side onto the left-hand side

y(tj+1)− y(tj)− f(tj ,yj)(tj+1 − tj) =d2y(tj)

dt2(tj+1 − tj)

2

2+ . . . .

Finally, the second and third term on the left-hand side are −yj+1, and so we obtain the local error(also called truncation error) in this time step as

y(tj+1)− yj+1 =d2y(tj)

dt2(tj+1 − tj)

2

2+

d3y(tj)

dt3(tj+1 − tj)

3

3!+ . . . .

Two observations: firstly, the local error is second order in the time step

y(tj+1)− yj+1 ∈ O((tj+1 − tj)2)

and secondly, the coefficient of this term is the second derivative at (tj ,yj) which measures thecurvature of the solution curve at that point. The more the curve curves, the larger the error. Ifthe solution happens to have a zero curvature at (tj ,yj) then we would predict that the Euler stepshould not incur any error. It still might: our prediction neglected all those “dots” (the higher orderterms) in the Taylor series, but at least for zero curvature the second order term in the error would beabsent. The local error resulted from the truncation of the Taylor series, which is a good explanationof why it is called the truncation error.

6.3.2 Global error of forward Euler

We have demonstrated above (see Figure 6.3) that the global error , that is the difference betweenthe analytical exact solution y(tn) and the computational solution yn, is a mixture of two compo-nents. Now we will look at the global error in detail. We will try to estimate the global error attime tn+1, GEn+1, from the global error GEn at time tn: see Figure 6.4. Note that we are thinkingin terms of a scalar differential equation, but the conclusions may be readily generalized to coupledequations.

6.3 Estimating error in ODE integrators 105

The first component of the global error is the local (truncation) error which is caused by thetruncation of the Taylor series as explained in the previous section.

The second component is caused by the drift off in the previous steps of the algorithm: everystep of the integrator will cause the solution to drift off the original curve passing through theinitial condition. Let us consider performing one single step of the numerical integration, from tn totn+1. Two different curves pass through the two points (tn, yn) and (tn, y(tn)): let us say y(t) passesthrough (tn, yn), and y(t) passes through (tn, y(tn)). The difference between the points (tn, yn) and(tn, y(tn)) is the global error at time tn, GEn.

The difference between the two curves y(tn+1) − y(tn+1) at time tn+1 measures the propagatederror . We can estimate the propagated error from as PEn+1 which is the global error GEn plus theincrease of the distance between the two curves. The increase can be approximated to first orderfrom the slopes ˙y(tn) = f(tn, yn) and y(tn) = f(tn, y(tn))

PEn+1 ≈ GEn + (f(tn, y(tn))− f(tn, yn)) (tn+1 − tn) .

We can also use the Taylor series to expand the right-hand side function f as

f(tn, y(tn) ≈ f(tn, yn) +∂f(tn, yn)

∂y(y(tn)− yn)

to obtain

PEn+1 ≈ GEn +∂f(tn, yn)

∂y(y(tn)− yn) (tn+1 − tn)

and substituting GEn = y(tn)− yn we arrive at

PEn+1 ≈ GEn

(1 +

∂f(tn, yn)

∂y(tn+1 − tn)

).

This is really saying that the propagated error in step tn+1 is the global error in step tn plus alittle bit more due to the difference between the slopes at yn and y(tn). As an illustration considera model equation

y = λy , y(0) = y0 .

For this model equation the propagated error will read

PEn+1 ≈ GEn (1 + λ(tn+1 − tn)) .

Thus we see that the propagated error will be controlled by the stability (growth versus de-cay) of the analytical solution: for positive λ the propagated error will exponentially increase as(1 + λ(tn+1 − tn)) > 1, for negative lambda (and sufficiently small time step) the propagated errorwill likely decrease as (1 + λ(tn+1 − tn)) < 1.

Under reasonable assumptions concerning the smoothness of the right-hand side function f (notewell that this will not include models such as the friction stick-slip), the global error may be estimatedfrom the local errors using a (pessimistic) assumption that the local errors will never cancel eachother, they will always add up. Then we can estimate the global error E = y(tn+1)− yn+1 as

|GEn+1| ≤n∑

i=1

|ei| =n∑

i=1

O(∆t2) = nO(∆t2) =t

∆tO(∆t2) = O(∆t) .

Thus we see that we lost one order in the error estimate going from local to global errors. Theforward Euler algorithm was second order locally, but it is only first order globally.

Illustration 5

Now we can go back to graphs of Chapter 1, especially Figure 2.19. The slopes of the error curves onthe log-log scale will now be making sense. For the forward Euler we now know that its local error issecond order, but the global error is first order. The graph 2.19 displays the global error, and hencethe slope (i.e. the convergence rate) is one. For the modified Euler the global error is second order,consequently its local error is cubic in the time step.


t0

y0

t1

y1

t2

y2

t3 t4

y(tn)

y(tn+1)

yn

yn+1

f(tn, yn)

f(tn, y(tn))

GEn

LEn+1 y(t)

t

y

Fig. 6.4. Global error of the forward Euler integration. LEn+1= local (truncation) error, PEn+1= propa-gated error, GEn= global error at time tn, GEn+1= global error at time tn+1.


1. Estimate from Figure 2.21 the order of the local error of the oderk4 Runge-Kutta integrator.

6.4 Approximation of derivatives

We can write the Taylor series of the function whose derivative we wish to approximate at x0 as

f(x) = f(x0) +df(x0)

dx(x− x0) +

d2f(x0)

dx2

(x− x0)2

2!+R2 ,

where R2 is the Lagrange remainder. We can write for instance

R2 =d3f(ξx+ (1− ξ)x0)

dx3

(x− x0)3

3!, 0 ≤ ξ ≤ 1 .

The Taylor series expression can be solved for the derivative

df(x0)

dx=

f(x)− f(x0)

(x− x0)− d2f(x0)

dx2

(x− x0)

2!− R2

(x− x0).

When second-order derivative term and the remainder term are ignored, presumably because theyare much smaller in magnitude then what we keep from the right-hand side expression, we get anapproximation of the derivative as

df(x0)

dx≈ f(x)− f(x0)

(x− x0)(6.2)

Because of the form of this expression, we call this formula the divided differences. All the formulasfor the approximation of derivatives derived in this section are of this nature.

What we’ve neglected above is

6.4 Approximation of derivatives 107

−d2f(x0)

dx2

(x− x0)

2!− R2

(x− x0)

and we realize that this is the error of the approximation. Unless d3f(ξx+(1−ξ)x0)dx3 behaves like 1/(x−

x0), in words unless its magnitude blows up to infinity as x → x0, we can estimate the magnitudeof the error using the order-of notation we developed earlier∣∣∣∣− R2

(x− x0)

∣∣∣∣ ∈ O(|x− x0|2) .

Since

d2f(x0)

dx2

(x− x0)

2!∈ O(|x− x0|) (6.3)

we see that it will dominate the error as the control parameter, the step along the x axis, x − x0,becomes shorter and shorter. The accuracy of the algorithm (6.2) is quite poor, the error being onlyO(|x− x0|). We call this kind of error the truncation error, since it is the result of the truncation ofthe Taylor series.

Illustration 6

Consider a common counterexample where (6.3) is not valid. In Figure 6.5 a piecewise linear functionis shown (in solid line) with its derivative (dashed). If we take (6.2) with x0 to the left of b, for x < bthe formula works perfectly. The second derivative is in fact

d2f(x0)

dx2= 0 ,

which makes our derivative computation perfect – no error. Now we will make x0 approach b fromthe left arbitrarily closely. The error estimate (6.3) is then no longer valid since at x0 = b

d2f(x0)

dx2→∞ .

This unfortunate behavior is due to the first derivative being discontinuous at b.

b

f(x)

f ′(x)

f ′(x)

x

Fig. 6.5. Piecewise linear function with its derivative


Now we will consider the approximation formula (6.2) for two cases: x > x0 and x < x0. Whenx > x0 we are looking “forward” with the formula to determine the slope at x0, hence we get theforward Euler approximation of the derivative. Let us write h = |x − x0|. Then the formula (6.2)may be rewritten in the familiar form

df(x0)

dx≈ f(x0 + h)− f(x0)

h.

On the other hand, when x < x0 we are looking “backward” with the formula to determine theslope at x0, hence we get the backward Euler approximation of the derivative as this version of theformula (6.2)

df(x0)

dx≈ f(x0)− f(x0 − h)

h.

x0x0 − h

x0 + h

f(x0 + h)

f (x0 − h)

f(x0)

x

f

Fig. 6.6. Forward and backward Euler approximation of the derivative.

Figure 6.6 illustrates these concepts. The actual derivative is the slope of the green line (tangentat (x0, f(x0))), which is approximated by the forward Euler algorithm as the slope of the red dashedline, and by the backward Euler algorithm as the slope of the blue dashed line.

Evidently the figure suggests an improvement on these two algorithms. The green line seems tohave a slope rather close to the average of the slopes of the red and blue lines. (The angles betweenthe blue and green line and between the red and green line are about the same.) So what happensif we average those Euler predictions?

1

2

(f(x0 + h)− f(x0)

h+

f(x0)− f(x0 − h)

h

)=

f(x0 + h)− f(x0 − h)

2h. (6.4)

The above formula defines another algorithm, the centered difference approximation of the deriva-tive. Figure 6.7 shows the dashed green line which represents the centered difference approximationof the tangent, and we can see that indeed the slopes of the dashed and solid green lines are indeedquite close. It appears that the centered difference approximation should be more accurate, in gen-eral, and we can investigate this analytically by averaging out only the approximation formulas, butthe entire expressions including the errors.

The forward difference approximation of the derivative, including the truncation error R2,f

df(x0)

dx=

f(x0 + h)− f(x0)

h− d2f(x0)

dx2

h

2!− R2,f

h

6.4 Approximation of derivatives 109

x0x0 − h

x0 + h

f(x0 + h)

f(x0 − h)

f(x0)

x

f

Fig. 6.7. Forward and backward Euler and centered difference approximation of the derivative.

is added to the backward difference approximation of the derivative, including the truncation errorR2,b

df(x0)

dx=

f(x0 − h)− f(x0)

−h− d2f(x0)

dx2

(−h)2!− R2,b

(−h)

to result in the expression of the centered difference approximation

2df(x0)

dx=

f(x0 + h)− f(x0 − h)

h− R2,f

h− R2,b

(−h).

The truncation error of the centered difference approximation is seen to be

−R2,f

2h− R2,b

(−2h)∈ O(h2) . (6.5)

It is one order higher than the truncation errors of the Euler algorithms (O(h2) versus O(h)), andhigher is better – the error decreases faster with decreasing h.

The formulas for the numerical approximation of derivatives of functions, forward and backwardEuler, and the centered differences, are called finite difference stencils, and many more, sometimeswith a considerably higher accuracy, can be found in the technical literature. The price to pay isthat with higher accuracy one needs more function values around the point x0.

Illustration 7

We shall now investigate the numerical evidence for these estimates of truncation error. 1 In the scriptcompare conv driver, x is the point where the derivative is evaluated, n is the number of reductionsof the step, dx0 is the initial step, which is then subsequently reduced by the factor divFactor.funhand and funderhand are the handles of the function and its derivative (as anonymous MATLABfunctions).

funhand=@(x)2*x^2-1/3*x^3;

funderhand=@(x)4*x-3/3*x^2;

x=1e1;

n= 9;

dx0= 0.3;

divFactor=4;


10−6

10−4

10−2

100

10−15

10−10

10−5

100

dx

err

FEBECD

Fig. 6.8. Forward and backward Euler and centered difference approximation of the derivative. Error versusthe step size.

Figure 6.8 both confirms the expected outcome and presents an unexpected one: the forwardand backward Euler are of the same accuracy, and on the log-log scale the error decreases with rateof convergence equal to one, and the centered difference is both more accurate in absolute termsand the error decreases with a convergence rate of two. What may be unexpected however is thebehavior of the centered difference error for very small steps. The error does not decrease anymore,rather the opposite occurs.

Shifting x as the point where the derivative is evaluated (change the third line to read

x=1e4;

gives the results in Figure 6.9. The performance of the numerical differentiation algorithms has nowvery much deteriorated, and a decrease in the step size does not necessarily lead to an improvementin the result in neither the two Euler derivatives approximations, nor in the centered differenceapproximation.

The explanation for the behavior described in the Illustration above rests in what is displayedin the graphs: the graphs present the total error incurred by the numerical algorithm, and this erroris the result of the interplay between the truncation error and the effect of the so-called machine-representation error. The term “round-off error” is commonly used for this type of error. However,round-off is only a special case of the broader class of machine-representation errors. Another termwhich would be equivalent is “computer-arithmetic”, or just “arithmetic” error. We will sometimesuse interchangeably machine-representation and arithmetic error.

6.5 Computer arithmetic

The machine-representation (arithmetic) error is due to the limited capability of computers to storenumbers: only some numbers may be represented in the computer.

6.5.1 Integer data types

The computer architectures in current use are based on binary storage: the smallest piece of datais a bit , which assumes values 0 or 1. A collection of bits can store a binary number. In particular,computers nowadays use a chunk of eight bits called byte . The position of the bit in the byte indicates

1See: aetna/RoundoffTruncation/compare conv driver.m

6.5 Computer arithmetic 111

10−6

10−4

10−2

100

10−12

10−10

10−8

10−6

10−4

dxer

r

FEBECD

Fig. 6.9. Forward and backward Euler and centered difference approximation of the derivative. As Figure 6.8,but the point of evaluation is shifted towards much bigger number, x=1e4.

the power of two, similarly to what we’re used to with decimal numbers. For instance, the decimalnumber 13 = 1×101+3×100 can be written in the binary system as 13 = 1×23+1×22+0×21+1×20.Hence its binary representation is 1101. We can use the MATLAB function dec2bin:

>> dec2bin(13)

ans =

1101

The largest number we can store in a byte (more precisely in an unsigned byte) is 255, viz

>> dec2bin(255)

ans =

11111111

since in that case all the bits are toggled to 1. If we wish to represent signed numbers, we mustreserve one bit for the storage of the sign (positive or negative). Then we have only seven bits forthe storage of the actual pattern of 0s and 1s. The largest number that seems to be available then is

>> bin2dec(’1111111’)

ans =

127

However, by some clever manipulation it is possible to squeeze out one more number out of the eightbits, and so we get as the algebraically smallest and largest integers using the MATLAB functionsintmin and intmax

>> intmin(’int8’)

ans =

-128

>> intmax(’int8’)

ans =

127

The clever trick is called the “2’s complement” representation, and the bits represent numbers asshown here

00000000=0

00000001=1

00000010=2

00000011=3

...


01111111=127

11111111=-1

11111110=-2

11111101=-3

...

10000000 =-128

The argument ’int8’ denotes the so-called integer type, and there are four signed and four unsignedvarieties in MATLAB (with 8, 16, 32, and 64 bits). As an example, here are the smallest and largestunsigned 64-bit integer

>> intmin(’uint64’)

ans =

0

>> intmax(’uint64’)

ans =

18446744073709551615

Integers are nice to work with, and they are very useful for instance as counters in loops. If we’renot careful, bad things can happen though. Take the following code fragment: First we create thevariable a as an 8-bit integer zero with int8

>> a= int8(0);

and then we increment it 1000 times by one. The result is a bit unexpected, perhaps:

for i=1: 1000

a=a+1;

end

a

a =

127

What happened? Overflow! When the variable reached the largest value that can be stored in avariable of this type, it stopped increasing: the variable overflowed.

6.5.2 Floating-point data types

The floating-point numbers are represented with values for the so-called mantissa M and exponentE, stored in bits essentially as described above, as

M*2^E

The basic datatype in MATLAB is a floating-point number stored in 64 bits, the so-called double. Themachine representation for this number is standardized, as described in the ANSI/IEEE Standard754-1985, Standard for Binary Floating Point Arithmetic. The exponent and the mantissa are storedas patterns of bits, which may be represented as numbered from 0 to 63, left to right. The first bit isthe sign bit, ’S’, the next eleven bits are the exponent bits, ’E’, and the final 52 bits are the mantissabits ’M’:

S EEEEEEEEEEE MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

0 1 11 12 63

The value V represented by the 64-bit word may be determined by an algorithm (how else?):

1. If E=2047a) If M is nonzero, then V=NaN (“Not a Number”)b) Else (M==0)

i. If S is 1, then V=-Inf


ii. Else V=Inf2. Else if 0<E<2047 then we get the so-called normalized values

V=(-1)^S * 2^(E-1023) * (1.M)

where 1.M is intended to represent the binary number created by prefixing M with an implicitleading 1 and a binary point.

3. Else (E==0)a) If M is nonzero, then we get the so-called unnormalized values

V=(-1)^S * 2^(-1022) * (0.M)

b) Else (M==0)i. If S is 1, then V=-0

ii. Else V=0

The cleverness of this representation should be appreciated. It allows us not only to store a zero(twice!), and the regular (normalized) numbers, but also numbers which are extremely small (theunnormalized values). In addition, we can also store negative and positive infinity (Inf, and -Inf),which may result if we divide by zero (most likely by accident)

>> 1/0

ans =

Inf

and finally we can also store something that isn’t a number at all (for instance because the result ofan operation isn’t defined at all):

>> 0/0

ans =

NaN

The following two functions can be used to obtain the smallest (normalized) and largest floating-point double value

>> realmin(’double’)

ans =

2.225073858507201e-308

>> realmax(’double’)

ans =

1.797693134862316e+308

Here we have one unnormalized value

>> realmin(’double’)/1e6

ans =

2.225073858324152e-314

The machine representation of the double floating-point values described above tells us somethingimportant about which values can be represented in the computer, and they are not all the numberswe can think of! To get from one particular number to the one right next to it we change one of thebits in the mantissa. The least significant one is the 52nd. So if we take a normalized number

V=(-1)^S * 2^(E-1023) * (1.M)

changing the bit in the 52nd position behind the binary point amounts to adding to or subtractingfrom the total value of the number V of the tiny value (-1)^S * 2^(E-1023)*2^(-52). For instancefor E=1023 and S=0 the tiny difference is

>> 2^-52

ans =

2.220446049250313e-016


For S=0, E-1023, and all the bits of the mantissa M set to zero the value V=1.0. Then the next closestnumber to 1.0 that can be represented in the computer is 1.0 + 2−52. Between these two values is agap where no numbers live. It is a tiny gap, so it may not bother us too much, but consider whathappens as the exponent 2^(E-1023) gets bigger. The MATLAB function eps computes the size ofthe gap next to any particular double value. For instance, for a value representative of the Young’smodulus in some units the gap will get bigger

>> eps(3e11)

ans =

6.103515625000000e-005

For the distance across the Milky Way, the gap already amounts to (in meters)

>> eps(100000 *9.46e15)

ans =

131072

and the distance to the outermost object in the universe can be recorded in the computer with adouble value only with precision amounting to tens of millions of kilometers

>> eps(18e9 *9.46e15)

ans =

3.435973836800000e+010

The gap between adjacent numbers represented in the computer is called machine epsilon , andit is an important quantity. Especially when we consider round-off. As an example, consider theaddition of two numbers: watch what happens to the underlined red digits.

>> pi

ans =

3.141592653589793

>> 1+pi

ans =

4.141592653589793

>> 1e6+pi

ans =

1.000003141592654e+006

In the second example, the disparate magnitude led to truncation of the previously significant digits.Addition results in loss of significance when the numbers are disparate. For subtraction, the

dangerous situation occurs when the two numbers are close. Consider this example:

>> 3.14159265-pi

ans =

-3.589792907376932e-009

The computer essentially made up the underlined red digits. (To check this we go to the Web: a lotof digits of π are available on the web: 3.14159265358979323846264338....) This problem is referredto as loss of significance, and it is one of the most deleterious aspects of computer arithmetic.

Another example illustrates the so-called catastrophic cancellation of terms, for both additionand subtraction. We may consider the below expressions equivalent, but it evidently matters whatthe MATLAB engine thinks of the expression we typed in:

>> (1 +2e-38)-(1 -2e-38)

ans =

0

>> (1 +2e-38-1 +2e-38)

ans =

2.000000000000000e-038


>> (1-1 +2e-38+2e-38)

ans =

4.000000000000000e-038

In the first expression, the parentheses wiped out the small terms altogether. In the second expression,the order in which the terms were processed wiped out only one of the tiny ones. Finally, in the lastexpression the order of processing worked in our favor, and we got the correct answer.

One of the implications of using a binary computer representation of numbers is that numbersthat look really nice to us may be bad for the computer operations. This example illustrates it nicely:

>> a= 0;

for i=1: 10000

a=a+0.0001;

end

a

a =

0.999999999999906

to be compared with

>> a=0.0001*10000

a =

1

Similarly to the integer data types, floating-point values can overflow .

>> realmax(’double’)+1==realmax(’double’)

ans =

1

This test should not have evaluated to “true” (numerical value 1), but since the left-hand sideoverflowed it did. Floating-point values can also underflow : the value becomes so small that it getsconverted to the exact zero in the computer.

>> realmin(’double’)/1e16

ans =

0

There is a second floating-point type available in MATLAB, the single. Since there are only 32bits available for its storage, the budgets for the exponent (eight bits) and the mantissa (23 bits)are correspondingly reduced with respect to the double. This simply compounds all the problemswe described above for the double. For instance, the machine epsilon for the single at 1.0 is

>> eps(single(1.0))

ans =

1.1920929e-007

so almost 9 orders of magnitude larger than the one for the double. This will make the precisionconsiderably less for all sorts of operations. The only reason one may consider using a single isthat it saves half the storage space compared to a double. Therefore (pieces of) some commercialsoftwares use single-precision storage, which could make them considerably less robust for certaininputs then we would hope for. We as users need to be aware of such pitfalls.

6.5.3 Summary

A few points to sum up the basics of the computer representation of numbers:

1. Only some values can be represented in the computer. For both integers and floating-point valuesthere’s a range in which numbers can be represented, and no values can exist outside of the range.


2. The range in which floating-point values are represented is sparsely populated: there are gapsbetween numbers (the so-called machine epsilon), which increase with the magnitude of thenumber. The machine epsilon essentially limits the largest number of significant digits that wecan expect (15 for double, 6 for single).

3. Operations on the computer-represented numbers rarely lead to exact answers, and especiallyaddition and subtraction can prove devastating to our budget of significant digits (overflow,underflow, round-off, cancellation of terms,...). Consequently the number of significant digits isusually much less than the number of digits in the computer printouts.

6.6 Interplay of errors

We will inspect the centered-difference formula for the approximation of derivatives (6.4). As thestep size in the denominator decreases, the numerator contains the difference of two numbers whichare closer and closer in magnitude. We can estimate that the arithmetic error of the subtractionf(x0 + h) − f(x0 − h) is going to be on the order of machine epsilon. Therefore, the error of thederivative can be estimated as

ER =ε(f(x0))

2h.

Here ε(f(x0)) is the machine epsilon at the real number f(x0). We can see that the arithmetic errorER increases in inverse proportion to the step size h. Also, we see that the error increases with themagnitude of the numbers to be subtracted ≈ f(x0), since the machine epsilon depends on it.

Now let us go back to Figure 6.8. The total error displayed in Figure 6.8 is the sum of thetruncation error and the arithmetic error. The descending branches of the errors are dominated bythe truncation error, either O(h) (slope +1) or O(h2) (slope +2). In the climbing branch of thetotal error the arithmetic error dominates. The dependence of the arithmetic error on 1/h = h−1

can be clearly detected in the slope −1 of the climbing branch of the total error of the derivative inFigure 6.9.

Note well that while talking about the total error we disregard the avoidable errors of the natureof bugs and mistakes. Sadly, these errors are sometimes present, but their unpredictable naturemakes them very difficult to discuss in general.

Illustration 8

The report “DCS Upgrades for Nuclear Power Plants: Saving Money and Reducing Risk throughVirtual-Stimulation Control System Checkout” by G. McKim, M. Yeager and C. Weirich from 2011,states on page 5 when discussing a software simulator of a nuclear power plant subsystem: “Herewas the first surprise. The emulated Bailey response in Figure 5 didnt show this rate limiting. Thecontroller output traveled as fast as 12% per second. This led to a line-by- line examination of theFORTRAN source code for the Bailey emulation, whereupon it was discovered that, contrary tobelief, the rate limiting was not included in the simulation. ”

This is an example of a software bug: the feature that was supposed to be programmed waseither never implemented or was implemented and later deleted.

Illustration 9

The Deepwater Horizon Accident Investigation Report from September 8, 2010 states on page 64“The 13.97 ppg interval at 18,200 ft. was included in the OptiCem model report as the reservoirzone. The investigation team was unable to clarify why this pressure (13.97 ppg) was used in the


model since available log data measured the main reservoir pressure to be 12.6 ppg at the equivalentdepth. Use of the higher pressure would tend to increase the predicted gas flow potential. The sameOptiCem report refers to a 14.01 ppg zone at 17,700 ft. (which, in fact, should be 14.1 ppg : theactual pressure measured using the GeoTap logging-while-drilling tool). ” (Emphasis is mine.)

These two instances are illustrations of an input error (mistake of the operator). Undoubtedlyimportant, but they are outside of the scope of error control that numerical methods can exerciseand therefore will not be discussed in this book.


1. JE Marsden , A. Weinstein, Calculus I and II, Springer, 1985.Excellent presentation of the background to the Taylor series.

2. L. F. Shampine, I. Gladwell, S. Thompson, Solving ODEs with MATLAB, Cambridge UniversityPress, 2003.A very complete discussion of the subject of errors in numerical integration of differential equa-tions.

3. GW Stewart, Matrix algorithms. Volume I: Basic decompositions. SIAM, 1998.Good discussion of the rounding error (machine representation error). Complete and readablepresentation of the QR factorization and least squares.

7

Solution of systems of equations

Summary

1. We discuss a couple of representative methods for the solution of a scalar nonlinear equation.Main idea: Newton’s and bisection method are complementary with the respect to the rate ofconvergence and robustness.

2. Newton’s method (in one of its several variants) is a crucial building block in nonlinear analysisof structures, where systems of coupled nonlinear equations need to be solved repeatedly. Mainidea: efficient solvers for systems of coupled linear equations are critical to the success of theNewton’s method.

3. Solutions of systems of coupled linear equations that fall under the class of factorizations on theexamples of the LU and QR decompositions. Main idea: factorizations provide critical infrastruc-ture to a variety of numerical algorithms, especially Newton-like solvers of nonlinear equationsand eigenvalue problem solvers.

4. Errors produced by factorization algorithms depend on the so-called condition number. Mainidea: condition numbers are related to eigenvalues.

7.1 Single-variable nonlinear algebraic equation

We will start with a scalar nonlinear algebraic equation, but immediately thereafter we will move intothe subject of a system of coupled nonlinear algebraic equations. Our motivation will be providedhere by the so-called implicit time integration algorithms, of which we have seen as examples thebackward Euler and the trapezoidal integration algorithms.

As an example of a nonlinear algebraic equation consider the application of the backward Eulerintegrator to a single-variable IVP. We recall that the time step advance was expressed as an implicitfunction (implicit in the sense of “unresolved”) in the solution yj+1

yj+1 = yj + (tj+1 − tj)f(tj+1, yj+1) . (7.1)

In general, for an arbitrary right-hand side function f this will require the solution of a nonlinearalgebraic equation to obtain yj+1. For convenience we will define the function of the unknown yj+1

F (yj+1) = yj+1 − yj − (tj+1 − tj)f(tj+1, yj+1) .

The solution y(∗) to the equation

F (y(∗)) = 0

is the sought yj+1.We attempt to find the solution to F (y(∗)) = 0 by first guessing where the root may be y(0), and

then using the Taylor series expansion of F at y(0)

120 7 Solution of systems of equations

F (y(∗)) = F (y(0)) +dF (y(0))

dyj+1(y(∗) − y(0)) +R1 = 0 .

The term

dF (y(0))

dyj+1

is referred to as the Jacobian . Provided the remainder R1 is negligible compared to the other terms,we can write approximately

F (y(0)) +dF (y(0))

dyj+1(y(∗) − y(0)) ≈ 0 , (7.2)

which may be solved for y(∗) as a better approximation of the root

y(∗) ≈ y(0) − F (y(0))

dF (y(0))

dyj+1

= y(0) −(dF (y(0))

dyj+1

)−1

F (y(0)) .

Thus we arrive at the Newton’s algorithm for finding the solution of a nonlinear algebraic equation:Guess the starting point of the iteration, y(0), as close to the expected root of the equation y(∗) aspossible. Then repeat until the error (in some measure to be determined) drops below acceptabletolerance.

y(k) = y(k−1) −(dF (y(k−1))

dyj+1

)−1

F (y(k−1)) (7.3)

if error e(k) < tolerance,break; otherwise go on

k = k + 1 and repeat from the top

The error could be measured as the difference between the successive iterations

e(k) = y(k) − y(k−1)

or by comparing the value of the function F with zero

e(k) = F (y(k)) .

Or, convergence can be decided by looking at some composite of the above errors, for instance theiteration could be considered converged when either of these errors drops below a certain tolerance.

Illustration 1

How do we apply the Newton’s algorithm to solve the nonlinear equation that defines a single stepfor the backward Euler algorithm?

To advance the solution we have to solve F (yj+1) = yj+1− yj − (tj+1− tj)f(tj+1, yj+1) = 0. Theonly difficulty may present the derivative of the function f which we need to compute

dF (y(k−1))

dyj+1= 1− (tj+1 − tj)

∂f(tj+1, y(k−1))

∂yj+1.

This turns out to be really easy for the simple function f of a linear ODE with a constant coefficient

f(t, y) = λy

and the Jacobian is

7.1 Single-variable nonlinear algebraic equation 121

dF (y(k−1))

dyj+1= 1− (tj+1 − tj)λ .

The Newton algorithm gives the solution in one iteration step as

y(∗) = y(1) = y(0) −(dF (y(0))

dyj+1

)−1

F (y(0)) =yj

1− (tj+1 − tj)λ

For this special right-hand side function it works out precisely as we would expect from the definitionof the backward Euler method. For general right-hand side functions f the solution will requireseveral iterations of the algorithm, until some tolerance is reached as discussed above.

Concerning the implementation of the backward Euler time integration in MATLAB: Either wehave to provide not only a function for the right-hand side f but also its derivative ∂f(., y)/∂y, or thesoftware must make do without the derivative. Fortunately, we realize that numerical differentiationcould be used, and we have developed some approaches in the previous Chapter.

7.1.1 Convergence rate of Newton’s method

Write the Taylor series for the scalar function F (x), but this time keep the remainder

F (x(∗)) = F (x(k)) +dF (x(k))

dx(x(∗) − x(k)) +R1 .

The remainder is written as

R1 =d2F (ξ)

dx2

(x(∗) − x(k))2

2!, ξ ∈

⟨x(k), x(∗)

⟩.

Since x(∗) is the root, we know that

F (x(∗)) = 0

and so it follows that

F (x(k)) +dF (x(k))

dx(x(∗) − x(k)) +R1 = 0 . (7.4)

The Newton’s algorithm would use the above equation, neglect the remainder R1, and thus obtainan estimate of the root x(k+1) from

F (x(k)) +dF (x(k))

dx(x(k+1) − x(k)) = 0 . (7.5)

Now notice that the errors in iterations k and k + 1 are

Ek = x(∗) − x(k) , Ek+1 = x(∗) − x(k+1)

and these may be substituted both in Equation (7.4) and in the expression for the remainder.Thus (7.4) may be written in terms of the errors as

F (x(k)) +dF (x(k))

dxEk +

d2F (ξ)

dx2

E2k

2!= 0 . (7.6)

Equation (7.5) may also be rewritten in terms of the errors by expressing


x(k+1) − x(k) = x(k+1) − x(k) + x(∗) − x(∗) = Ek − Ek+1

so that (7.5) becomes

F (x(k)) +dF (x(k))

dx(Ek − Ek+1) = 0 . (7.7)

Now (7.7) may be subtracted from (7.6) to yield

dF (x(k))

dxEk+1 +

d2F (ξ)

dx2

E2k

2!= 0 .

Therefore we can write the error in iteration k + 1 in terms of the error in iteration k as

Ek+1 = −(dF (x(k))

dx

)−1d2F (ξ)

dx2

E2k

2!. (7.8)

We say that the Newton’s method attains a quadratic convergence rate, because the error in thecurrent iteration is proportional to the square of the error in the previous iteration (and this is good,assuming the error is going to be small, and the square of a small number is even smaller).

Illustration 2

We shall solve the equation f(x) = −0.5 + (x − 1)3 = 0 with Newton’s method.1 The solver usedis the aetna implementation of the Newton’s method newt2. The approximate errors in the seveniterations required for convergence to machine precision are

Iteration Approximate Error

1 0.806299474015900

2 0.338070307349234

3 0.090929677656663

4 0.009026273904915

5 0.000101115651663

6 0.000000012879718

7 0.000000000000000

A good rule of thumb is that the number of zeros behind the decimal point of the error doubles witheach iteration. That is excellent convergence indeed.

Figure 7.1 illustrates the formula (7.8). We plot the approximate errors Ek+1 versus Ek as

plot(e(1:end-2),e(2:end-1),’ro-’)% e = approximate errors

Clearly the data resembles a parabolic arc, exactly as predicted by the formula. Re-plotted on a loglog scale (Figure 7.2) as

loglog(e(1:end-2),e(2:end-1),’ro-’)

confirms the relationship between Ek+1 and Ek. It is quadratic, since the slope on the log-log plotis very close to 2.

1See: aetna/NonlinearEquations/testnewt conv rate.m2See: aetna/NonlinearEquations/newt.m


0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35@(x)(−0.5+(x−1)3), x0=2.6

Ek

Ek+

1

10−8

10−6

10−4

10−2

100

10−20

10−15

10−10

10−5

100

@(x)(−0.5+(x−1)3), x0=2.6

Ek

Ek+

1

Fig. 7.1. Typical convergence of the Newton’s method.

x0

x1

x2

x3

x

F (x)

x0x1

x2x3 x

F (x)

Fig. 7.2. Failure of Newton’s method due to divergence (left), and successful convergence upon the selectionof the initial guess closer to the root (right).

7.1.2 Robustness of Newton’s method

Newton’s method can converge very satisfactorily, but the bad news is it can also spectacularly failto deliver the goods. Consider for instance Figure 7.3. On the left: Choosing as the initial guess x0

leads to a succession of xj which drift away from the root rather than converging to it. On the right:The function graph is scaled in the horizontal direction for clarity. Therefore the initial guess x0

that is shown there is in fact chosen much closer to the desired root than in the figure on the left.Consequently, the Newton’s method generates a succession of root locations which converge. This isquite typical: as good an initial guess of the location of the root as possible is critical to the successof the method.

Figure 7.3 shows a situation in which one may be looking for a root where there is none. Useyour imagination to reduce the gap between the horizontal axis and the hump of the function sothat the two almost merge visually. (In the figure we keep the gap large for clarity.) Then startingthe iteration in the vicinity of the presumed root will not lead to convergence. In fact, since thefunction graph has a zero slope at some point at the top of the hump, there is a potential for theNewton’s method to blow up (remember, we need to divide with the value of the derivative).

Figure 7.4 illustrates another difficulty. For rapidly oscillating functions with many roots it isquite possible for the Newton’s method to jump from root to root, and to eventually locate a root,but not the one we were looking for originally. If the Newton’s solver is used in an automatic fashion,we might not be even aware of the switch.


x0x1 x2 x3

x4 x5

x

F (x)

Fig. 7.3. Failure of Newton’s method: first its gets stuck next to a false root (maximum), then the iterationsblast off to infinity.

x0x1x2x3

x4

x

F (x)

Fig. 7.4. Failure of Newton’s method: if the initial guess of the root is not sufficient to close, it does notfind the root that was intended.

7.1.3 Bisection method

The bisection method is a complement to the Newton’s method. (a) While the Newton’s methodconverges quickly, bisection is slow to converge. (b) While the Newton’s method may fail to find aroot, bisection is guaranteed to converge to a root. (c) Newton’s method needs to know both thefunction and its derivative, while bisection can work with just the function. (d) While for bisectionwe need the so-called bracket (pair of locations at which a given function gives values of oppositesigns), this is not needed for Newton’s method.

Perhaps the best way to describe the bisection method is by an algorithm:

function [xl,xu] = bisect(funhandle,xl,xu,tolx,tolf)3

if (xl >xu)

temp =xl; xl = xu; xu =temp;

end

fl=feval(funhandle,xl);

fu=feval(funhandle,xu);

... a bit of error checking omitted for brevity

while 1

xr=(xu+xl)/2; % bisect interval

fr=feval(funhandle,xr); % value at the midpoint

if (fr*fl < 0), xu=xr; fu=fr;% upper --> midpoint

elseif (fr == 0), xl=xr; xu=xr;% exactly at the root

else, xl=xr; fl=fr;% lower --> midpoint

end

if (abs(xu-xl) < tolx) || (abs(fr) < tolf)

3See: aetna/NonlinearEquations/bisect.m


break; % We are done

end

end

Figure 7.5 shows typical convergence of the bisection method: the relationship between Ek+1 andEk is roughly linear. (Data produced by testbisect conv rate.4 The solver is an aetna implemen-tation of the bisection method bisect5.)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5@(x)(−0.5+(x−1)3), xl=1.7934, xu=1.7953

Ek

Ek+

1

10−2

10−1

100

10−3

10−2

10−1

100

@(x)(−0.5+(x−1)3), xl=1.7934, xu=1.7953

Ek

Ek+

1Fig. 7.5. Typical convergence of the bisection method.

Figure 7.6 is a good comparison of the typical convergence properties of the Newton’s andbisection methods.6 Evidently the bisection method requires many more iterations than the Newton’smethod. When each evaluation of the function is expensive, the quicker converging method wins.When the robustness of bisection is required (such as when Newton’s would not converge), theslower method is preferable. Wouldn’t it make sense to combine such disparate methods and switchbetween them as needed? That is how the MATLAB fzero function works. (Find out from thedocumentation which methods are combined in fzero.)

0 2 4 6 8 1010

−20

10−15

10−10

10−5

100

@(x)(−0.5+(x−1)3), Bisection versus Newton

Iteration k

Ek

Fig. 7.6. Comparison of the convergence of the bisection method (dashed line), and the Newton’s method(solid line).

4See: aetna/NonlinearEquations/testbisect conv rate.m5See: aetna/NonlinearEquations/bisect.m6See: aetna/NonlinearEquations/bisection versus Newton.m


7.2 System of nonlinear algebraic equations

As an example of a system of nonlinear algebraic equations we now consider the application of thebackward Euler integrator to a multiple-variable (vector) IVP. The time step advance is expressedas a vector function implicit in the solution yj+1

yj+1 = yj + (tj+1 − tj)f(tj+1,yj+1) .

For an arbitrary right-hand side function f this will require the solution of a nonlinear vectoralgebraic equation to obtain yj+1. We will define the vector function of the vector unknown y

F (y) = y − yj − (tj+1 − tj)f(tj+1,y) ,

which is clearly the backward Euler algorithm for y = yj+1. The solution y(∗) to the equation

F (y(∗)) = 0

is the sought yj+1.

Now we take as the initial guess y(0), and we expand into a Taylor series

F (y(∗)) = F (y(0)) +dF (y(0))

dy(y(∗) − y(0)) +R1 = 0 .

Provided the remainder R1 is negligible compared to the other terms, we can write approximately

F (y(0)) +dF (y(0))

dy(y(∗) − y(0)) ≈ 0 , (7.9)

which at first sight looks exactly like (7.2). There must be a difference here, however, as we aredealing with a system of equations. What do we mean by

dF (y(0))

dy(y(∗) − y(0)) ? (7.10)

The expression (7.9) holds for each component (row) of the vector (column matrix) separately. Thecomponents of the vector function F and of the argument y may be written as

[F (.)]r , [y]c .

Each of the components [F (.)]r is a function of all the components [y]c. Therefore, equation (7.9)in components must have the meaning

[F (y(0))]r +∑c=1:n

∂[F (y(0))]r∂[y]c

[y(∗) − y(0)]c ≈ 0 ,

i.e. in words: the change in the component [F (y(0))]r is due to the change of this component in thedirection of each of the c components of the argument [y]c, which is expressed by the first term ofthe Taylor series. Thus we see that left-hand side of (7.9) is the sum of two vectors, F (y(0)) and thevector

dF (y(0))

dy(y(∗) − y(0)) ,

which is the product of a square matrixdF (y(0))

dyand the vector (y(∗) − y(0)). The matrix

dF (y(0))

dy

7.2 System of nonlinear algebraic equations 127

is the so-called Jacobian matrix , whose components are[dF (y(0))

dy

]rc

=∂[F (y(0))]r

∂[y]c.

Thus we arrive at the Newton’s algorithm for finding the solution of a nonlinear algebraicequation: Initially guess y(0); then compute

y(k) = y(k−1) −(dF (y(k−1))

dy

)−1

F (y(k−1)) (7.11)

k = k + 1 and repeat previous line

until the error (in some measure to be determined) drops below acceptable tolerance. In general itis a good idea not to invert a matrix if we can help it. Rewriting the Newton algorithm as

J(y(k−1)) =dF (y(k−1))

dy% Compute the Jacobian matrix

J(y(k−1))∆y = −F (y(k−1)) % Compute the increment ∆y

y(k) = y(k−1) +∆y % Update the solution (7.12)

if error e(k) < tolerance,break; otherwise go on

k = k + 1 and go to the top

we see that the Newton algorithm will require repeated solutions of a system of linear algebraicequations, since the first line of the above algorithm means solve for ∆y.

Clearly this could mean major computational effort, depending on how many equations thereare (how big the matrix J(y(k−1)) is), whether the Jacobian is symmetric, how many zeros and inwhat pattern there might be in the Jacobian (in other words, is it dense, and if it isn’t what is thepattern of the sparse matrix), and so on. We will take up the subject of the solution of system ofequations in the next chapter.

The error e(k) of the solution in iteration k could be measured as the difference between successiveiterations, as for the scalar equation (7.3), and it should be expressed in terms of vector norms

e(k) = ∥y(k) − y(k−1)∥

or by comparing the norm of the function F with zero

e(k) = ∥F (y(k))∥ .

Illustration 3

Find the solution of the simultaneous equations

f(x, y) = (x2 + 3y2)/2− 2 = 0 , g(x, y) = xy + 3/4 = 0 .

The two expressions f(x, y) and g(x, y) may be interpreted as surfaces raised above the x, y plane.Setting these to zero is equivalent to forcing the points that satisfy these equations, individually, tolie on the level curves of the surfaces. The solution of the two equations being satisfied simultaneouslycorresponds to the intersection of the level curves. The figures of the surfaces were produced by thescript two surfaces7.

The solution will be attempted with the Newton method. The vector argument is

y =

[xy

]and the vector function is

7See: aetna/NonlinearEquations/two surfaces.m


F (y) =

[f(x, y)g(x, y)

].

Therefore the necessary Jacobian matrix is

J11 =∂f(x, y)

∂x= x , J12 =

∂f(x, y)

∂y= 3y , J21 =

∂g(x, y)

∂x= y , J22 =

∂g(x, y)

∂y= x .

The Matlab code defines both the vector function and the Jacobian matrix as anonymous functions.

F=@(x,y) [((x.^2 + 3*y.^2)/2 -2); (x.*y +3/4)];

J=@(x,y) [x, 3*y; y, x];

With these functions at hand it is easy to carry out the iteration interactively, step-by-step. Forinstance, guessing

w0= [-0.5;0.5];

we update the solution as

>> w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))

w =

-0.5000

1.5000

For the next iteration, we reset the variable w0

>> w0=w;

and repeat the solution

w=w0-J(w0(1),w0(2))\F(w0(1),w0(2))

w =

-0.6154

1.1538

We can watch the differences between the successive iterations getting smaller. With four iterationswe get five decimal digits converged.

w =

-0.6923

1.0833

This point will be one of the four possible solutions (level-curve intersections). To get a differentsolution we need to start with a different guess w0, for instance w0= [-2;0.5];.


7.2.1 Numerical Jacobian evaluation

To compute the Jacobian analytically is often not possible or feasible. The elements of the Jacobianmatrix can be computed by numerical differentiation. MATLAB includes a sophisticated routine forforming Jacobians numerically, numjac. Here we discuss just the basic idea.

Consider the vector function F (z), whose derivative should be evaluated at z. Each element ofthe matrix

∂[F (z)]r∂[z]c

is a partial derivative of the component r of the vector F with the respect to the component c of theargument. The index c is the column index. Therefore, just one evaluation of the vector function Fper column is necessary for forward or backward difference evaluation of the numerical derivative.

First, evaluate F = F (z). Then, for each column c of the Jacobian matrix evaluate

cF = F (z + hcec) ,

where

[ec]m = 1 for c = m, [ec]m = 0 otherwise,

and hc is a suitably small number (not too small: let us not forget the effect of computer arithmetic).The Jacobian matrix is approximated by the computed column vectors as[

∂F (z)

∂z

]≈[(1F − F )/h1, (

2F − F )/h2, . . . , (nF − F )/hn,

]. (7.13)

In these columns we recognize numerical approximations of derivatives of the vector function F(divided differences).

One can recognize in the Newton’s method with the numerical approximation of the Jacobian avariation of the so-called secant method .

Illustration 4

In the script Jacobian example8 we compare the analytically determined Jacobian matrix with itsnumerical approximation. The vector function is taken as

F (z) =

[z21 + 2z1z2 + z22

z1z2

].

Therefore the Jacobian matrix is evaluated at z1 = −0.23, z2 = 0.6 as

8See: aetna/NonlinearEquations/Jacobian example.m


>> F=@(z)[z(1)^2+2*z(1)*z(2)+z(2)^2;...

z(1)*z(2)];

dFdz=@(z)[2*z(1)+2*z(2),2*z(1)+2*z(2);...

z(2), z(1)];

zbar = [-0.23;0.6];

>> Jac =dFdz(zbar)

Jac =

0.7400 0.7400

0.6000 -0.2300

Evaluating the function at the base point and using the step size of 0.1

>> Fbar =F(zbar);

h=1e-1;

we obtain the approximate (numerically differentiated) Jacobian matrix

>> Jac_approx =[(F(zbar+[1;0]*h)-Fbar)/h, (F(zbar+[0;1]*h)-Fbar)/h]

Jac_approx =

0.8400 0.8400

0.6000 -0.2300

We may note that the second row is in fact exact. (Why?) On the other hand the Jacobian matrixwill not be evaluated exactly in any component for the second example in Jacobian example. Checkit out.

7.2.2 Nonlinear structural analysis example

Consider a high-strength steel cable structure shown in Figure 7.7. The dashed line shows how thecables are connected, but the geometry has no physical meaning. In reality, the cables are strungbetween the joints so that the unstressed lengths of three sections of the main cable, connectingjoints p3, p1, p2, and p4, are given as 1.025× the distance between the joints. The two tiedownsbetween joints p5 and p1 and between joints p5 and p2 have unstressed lengths which are less thanthe distances between the points p5 and p1 and p5 and p2: tiedown 4 has unstressed length 0.88×distance between p5 and p1, and tiedown 5 has unstressed length 0.81× distance between p5 and p2.Therefore after the structure is assembled, the structure must deform and it must experience tensilestress (it becomes prestressed). The goal is to find the forces in the cables after the structure wasassembled. Since the problem is statically indeterminate, we will use the deformation method. Therequisite equations are going to be the equilibrium equations for the joints p1, p2, and the unknownsare going to be the locations of these two joints. (Note that the locations of joints p3, p4, and p5 arefixed, those joints are supported.)

For instance, for the joint 1 we write the equilibrium equations as

−N1∆x,1

L1+N2

∆x,2

L2−N4

∆x,4

L4= 0

−N1∆y,1

L1+N2

∆y,2

L2−N4

∆y,4

L4= 0

.

The geometrical relationships for cable 1 are based on these expressions

∆x,1 = Y1 − px,3, ∆y,1 = Y2 − py,3, L1 =√

∆2x,1 +∆2

y,1 ,

where Y1, Y2 are the coordinates of joint 1 after deformation. Similarly for cable 2


1

2

3

4

5p1

p2

p3

p4

p5

Fig. 7.7. Cable structure configuration. Dashed line: schematic of the connections. Filled dots indicatesupported joints.

∆x,2 = Y3 − Y1, ∆y,2 = Y4 − Y2, L2 =√

∆2x,2 +∆2

y,2 ,

where Y3, Y4 are the coordinates after deformation of joint 2. Together Y1, Y2, Y3, Y4 constitute theunknowns in the problem. Finally for the third cable running into joint 1 we have

∆x,4 = Y1 − px,5, ∆y,4 = Y2 − py,5, L4 =√∆2

x,4 +∆2y,4 .

The forces can be based upon the following constitutive equation

N1 = EA1L1 − L10

L10

(and analogously for the other cables) which relates the relative stretch of the cable to the axialforce. This is based on the assumption that the stretches are small compared to 1.0 and thereforethe stresses are small compared to the elastic modulus, and this assumption is verified in the presentproblem. In general it is a good idea to verify that the assumptions that go into a model arereasonable by backtracking from the results. For instance, in the current problem we would find thelocations of the joints, and from those we would compute the forces, and stresses. If the stresses inthe cables were well below the yield stress (or negligibly small with respect to the elastic modulus),our assumption would have been verified.

Working out in detail just the first term in the first equation gives us an idea of the complexityof the resulting equations. We do get an appreciation for the tedium associated with computingderivatives of such terms with respect to the unknowns Y1, Y2, Y3, Y4 to construct the Jacobianmatrix analytically:

−N1∆x,1

L1= −EA1

√(Y1 − px,3)2 + (Y2 − py,3)2 − L10

L10.

Y1 − px,3√(Y1 − px,3)2 + (Y2 − py,3)2

Therefore, we are going to construct the Jacobian matrix numerically using the numerical differen-tiation technique from the preceding section.

We are going to present the computation as implemented in a MATLAB function.9 First wedefine the data of the problem.

function [y,sigma]=cable_config_myjac

% undeformed configuration, lengths in millimeters

p =[10,10; 25,25; 0,0; 40,40; 40,0]*1000;

9See: aetna/NonlinearEquations/cable config myjac.m


y =p;% Initialize deformed configuration

% Which joints are connected by which cables?

conn = [3,1; 1,2; 2,4; 5,1; 5,2];

% Initial cable lengths

Initial_L =[...

1.025*[Length(1),Length(2),Length(3)],...

0.88*Length(4),0.81*Length(5)];

A= [150,150,150,100,100];% Cable cross-section mm^2

E=200000;% Young’s modulus, MPa

AbsTol = 0.0000005;%Tolerance in millimeters

maximum_iteration = 12;% How many iterations are allowed?

N =zeros(size(conn,1),1);% Prestressing force

We introduce some ancillary functions (implemented as nested MATLAB functions). They are usedto compute the geometry characteristics from the locations of the joints on the deformed structure.

function Delt=Delta(j)% Compute the run/rise

Delt =diff(y(conn(j,:),:));

end

function L=Length(j)% Deformed cable length

L=sqrt(sum(Delta(j).^2));

end

This function computes the total force on each joint. When equilibrium is reached, the total force(meaning the sum of the forces from all the cables connected at a joint) vanishes. When the structureis not in equilibrium, the force residual is in general nonzero.

function R=Force_residual(Y)% Compute the force residual

y(1,:) =Y(1:2)’;

y(2,:) =Y(3:4)’;

F =zeros(size(p,1),2);

for j=1:size(conn,1)

L=Length(j);

N(j)=E*A(j)*(L-Initial_L(j))/L;

F(conn(j,1),:) =F(conn(j,1),:) +N(j)*Delta(j)/L;

F(conn(j,2),:) =F(conn(j,2),:) -N(j)*Delta(j)/L;

end

R =[F(1,:)’;F(2,:)’];

end

In this function we compute the numerical approximation of the Jacobian matrix. Compare the ex-pressions inside the loop with equation (7.13). Note that the step used in the numerical differentiationis hardwired at 1/1000th of the current value of the unknown. A more sophisticated implementationcould adjust these better.

function dRdy=myjac(Y)% Compute the current Jacobian

R=Force_residual(Y);

for j=1:size(Y,1)

Ys =Y; Ys(j)=Ys(j) +Ys(j)/1000;

dRdy(:,j) =(Force_residual(Ys)-R)/(Ys(j)/1000);

end

end

Here is the Newton’s solver loop.

Y=[y(1,:)’;y(2,:)’];% Initialize deformed configuration

for iteration = 1: maximum_iteration % Newton loop

R=Force_residual(Y);% Compute residual


dRdy = myjac(Y);% Compute Jacobian

dY=-dRdy\R;% Solve for correction

if norm(dY,inf)<AbsTol % Check convergence

y(1,:) =Y(1:2)’;% Converged

y(2,:) =Y(3:4)’;

R=Force_residual(Y);% update the forces

sigma =N./A’;% Stress

return;

end

Y=Y+dY;% Update configuration

end

error(’Not converged’)% bummer :(

The output is

>> [y,sigma]=cable_config_myjac

y =

1.0e+004 *

1.293851615236427 0.660813145442825

2.924407085679964 2.105124641641672

0 0

4.000000000000000 4.000000000000000

4.000000000000000 0

sigma =

1.0e+002 *

4.494507981897321

4.851283944479819

2.463132114467833

2.051731752058516

3.659994343900810

Figure 7.8 displays the results of the computation. Note that the stresses are distributed somewhatnon-uniformly. A cool improvement on our computation would be to optimize the unstressed lengthsof the cables so that the prestress was uniform across the structure.

1

2

3

4

5

p1

p2

p3

p4

p5σ = 449

σ = 357

σ = 414

σ = 205

σ = 208

Fig. 7.8. Cable structure configuration. Dashed line: schematic of the connections. Filled dots indicatesupported joints. Thick solid line: actual configuration of the prestressed structure. Tensile stresses areindicated.


As a final note, we shall point out that MATLAB comes with its own sophisticated function forthe numerical evaluation of the Jacobian matrix, numjac. The pieces of code that would need tobe changed with respect to our implementation10 are the computation of the residual (the functionneeds to accept additional arguments)

function R=Force_residual(Ignore1,Y,varargin)

y(1,:) =Y(1:2)’;

y(2,:) =Y(3:4)’;

F =zeros(size(p,1),2);

for j=1:size(conn,1)

L=Length(j);

N(j)=E*A(j)*(L-Initial_L(j))/L;

F(conn(j,1),:) =F(conn(j,1),:) +N(j)*Delta(j)/L;

F(conn(j,2),:) =F(conn(j,2),:) -N(j)*Delta(j)/L;

end

R =[F(1,:)’;F(2,:)’];

end

and the evaluation of the numerical Jacobian in the Newton’s loop (there are a few additionalarguments to pass)

Y=[y(1,:)’;y(2,:)’];% Initialize deformed configuration

for iteration = 1: maximum_iteration % Newton loop

R=Force_residual(0,Y);% Compute residual

[dRdy] = numjac(@Force_residual,0,...

Y,R,Y/1e3,[],0);% Compute Jacobian

dY=-dRdy\R;% Solve for correction

if norm(dY,inf)<AbsTol % Check convergence

y(1,:) =Y(1:2)’;% Converged

y(2,:) =Y(3:4)’;

R=Force_residual(0,Y);% update the forces

sigma =N./A’;% Stress

return;

end

Y=Y+dY;% Update configuration

end

error(’Not converged’)% bummer :(

We can easily check that the two implementations of the computation give identical results.In summary, Newton’s method, in its several variants and refinements, has a special place among

the mainstream methods for solving a system of nonlinear algebraic equations in engineering appli-cations. One of the building blocks of this class of algorithms is a solver for repeatedly solving asystem of linear algebraic equations. This is the topic we will take up in the following sections.

7.3 LU factorization

Consider a system of linear algebraic equations

Ax = b

with a square matrix A. It is possible to factorize the matrix into the product of a lower triangularmatrix and an upper triangular matrix

A = LU

10See: aetna/NonlinearEquations/cable config numjac.m

7.3 LU factorization 135

The triangular matrices are not determined uniquely. Here we will consider the variant where thelower triangular matrix L has ones on the diagonal.

7.3.1 Forward and backward substitution

What is the value of the LU factorization? It derives from the efficiency with which a system witha triangular matrix can be solved. For instance, consider the system

Ly =

•• •• • •• • • •• • • • •• • • • • •

y = b ,

where L is lower triangular (non-zeros are indicated by the black dots, the zeros are not shown). Inthe first row of L there is only one nonzero, L11. Therefore we can solve immediately for y1. Next,y1 may be substituted into the second equation, from which we can solve for y2, and so on. Sincewe are solving for the unknowns in the order of their indexes, 1, 2, 3, ..., n, we call this the forwardsubstitution .11

function c=fwsubs(L,b)

[n m] = size(L);

if n ~= m, error(’Matrix must be square!’); end

c=zeros(n,1);

c(1)=b(1)/L(1,1);

for i=2:n

c(i)=(b(i)-L(i,1:i-1)*c(1:i-1))/L(i,i);

end

end

Now consider the system

Ux =

• • • • • •• • • • •• • • •• • •• ••

x = c ,

where U is upper triangular. In the last row of U there is only one nonzero, Unn. Therefore we cansolve immediately for xn. Next, xn may be substituted into the last but one equation, from whichwe can solve for xn−1, and so on. Since we are solving for the unknowns in the reverse order of theirindexes, n, n− 1, n− 2, ..., 2, 1, we call this the backward substitution . 12

function x=bwsubs(U,c)

[n m] = size(U);

if n ~= m, error(’Matrix must be square!’); end

x=zeros(n,1);

x(n)=c(n)/U(n,n);

for i=n-1:-1:1

x(i)=(c(i)-U(i,i+1:n)*x(i+1:n))/U(i,i);

end

end

11See: aetna/LUFactorization/fwsubs.m12See: aetna/LUFactorization/bwsubs.m


And so we come to the punchline: provided we can factorize a general matrixA into the triangularfactors, we can solve the system Ax = b in two steps. Write

Ax = LUx = L(Ux)︸︷︷︸y

= Ly = b .

Step one, solve for y from

Ly = b .

And step two, solve for x from

Ux = y .

Both solution steps can be done very efficiently since the matrices involved are triangular. This ishandy in many situations where the right-hand side b will change several times while the matrix Astays the same. For instance, here is how we compute the inverse of a general square matrix A:write the definition of the inverse

AA−1 = 1

column-by-column as

Ack(A−1) = ck(1) .

Here by ck(A−1) we mean the kth column of A−1, and by ck(1) we mean the kth column of the

identity matrix . So if we successively set the right-hand side vector to b = ck(1), k = 1, 2, ... andsolve Ax = b, we obtain the columns of the inverse matrix as ck(A

−1) = x.

7.3.2 Factorization

The crucial question is: how do we compute the factors? LU factorization can be easily explainedby reference to the well-known Gaussian elimination. We shall start with an example:

A =

0.796, 0.7448, 0.1201, 0.0905

−0.3649, 1.216, −0.3435, −0.54490.0186, −0.093, 1.204, −0.0012−0.1734, −0.6695, −0.0653, 0.4113

First we will change the numbers below the diagonal in the first column to zeros. Gaussian eliminationdoes this by replacing a row in which a zero should be introduced, let us say row j, by a combination ofthe row j and the so-called pivot row. Thus zero will be introduced in the element 2, 1 by subtracting(−0.3649)/(0.796)× row 1 from row 2 two obtain

0.796, 0.7448, 0.1201, 0.09050, 1.558, −0.2884, −0.5034

0.0186, −0.093, 1.204, −0.0012−0.1734, −0.6695, −0.0653, 0.4113

The element 1, 1 (the number .796) is called a pivot . Evidently, the success of the proceedings is goingto rely on the pivot being different from zero (not only strictly different from zero, but “sufficientlydifferent”: it shouldn’t be too small compared to the other numbers in the same column). Themanipulation described above can be executed by the following code fragment

i=1;

A(2,i:end) =A(2,i:end)-A(2,i)/A(i,i)*A(i,i:end)


Importantly, the same can also be written as a result of a matrix-matrix multiplication by theso-called elimination matrix

E(2,1) =

1, 0, 0, 0

0.4584, 1, 0, 00, 0, 1, 00, 0, 0, 1

The elimination matrices are easily computed in MATLAB as13

function E =elim_matrix(A,i,j)

E =eye(size(A));

E(i,j) =-A(i,j)/A(j,j);

end

We can readily verify that the element 2, 1 of A can be eliminated (zeroed out) by multiplying

E(2,1)A =

0.796, 0.7448, 0.1201, 0.0905

0, 1.558, −0.2884, −0.50340.0186, −0.093, 1.204, −0.0012−0.1734, −0.6695, −0.0653, 0.4113

Next we will change 0.0186 to a zero. Again, we will do this with an elimination matrix, and notewell that we will be working with the above right-hand side matrix, not the original A. So we willconstruct

E(3,1) =

1, 0, 0, 00, 1, 0, 0

−0.02337, 0, 1, 00, 0, 0, 1

and compute

E(3,1)E(2,1)A =

0.796, 0.7448, 0.1201, 0.0905

0, 1.558, −0.2884, −0.50340, −0.1104, 1.201, −0.003315

−0.1734, −0.6695, −0.0653, 0.4113

.

And so on: eliminating the non-zeros in the first column is constructed as the sequence

E(4,1)E(3,1)E(2,1)A =

0.796, 0.7448, 0.1201, 0.0905

0, 1.558, −0.2884, −0.50340, −0.1104, 1.201, −0.0033150, −0.5073, −0.03914, 0.431

.

Now we start working on the second column. Note again that we are working with the matrixE(4,1)E(3,1)E(2,1)A, not the elements of the original matrix. Thus 0.07087 = −(−0.1104/1.558),and the elimination matrix to put a zero in the element 3, 2 reads

E(3,2) =

1, 0, 0, 00, 1, 0, 00, 0.07087, 1, 00, 0, 0, 1

.

Finally, we apply the elimination matrix to the element 4, 3 and the entire Gaussian eliminationsequence will read

13See: aetna/LUFactorization/elim matrix.m


E(4,3)E(4,2)E(3,2)E(4,1)E(3,1)E(2,1)A =

0.796, 0.7448, 0.1201, 0.0905

0, 1.558, −0.2884, −0.50340, 0, 1.18, −0.038990, 0, 0, 0.2627

.

We recall that we wish to construct the factorization A = LU , which means that the above matrixon the right is U and consequently

L−1 = E(4,3)E(4,2)E(3,2)E(4,1)E(3,1)E(2,1) .

So now we have the matrix U and the inverse of L. Fortunately, L is obtained very easily. Not byinverting the above product, but rather by inverting each of the terms separately

L = E−1(2,1)E

−1(3,1)E

−1(4,1)E

−1(3,2)E

−1(4,2)E

−1(4,3) .

For instance, to invert E(2,1) we realize that the effect of the matrix multiplication in the productE(2,1)A is to make the second row of the result the sum of a multiple of the first row and 1× thesecond row. Therefore, to multiply with the inverse of E(2,1) is to undo this operation, to subtracta multiple of the first row from the second row. The inverse of E(2,1) also has ones on the diagonal,the only change is that the off-diagonal element changes its sign (we want subtraction instead ofaddition)

E−1(2,1) = 21−E(2,1) .

The same reasoning applies to the other elimination matrices. Now we only have to figure out theproduct of the inverses of the elimination matrices. Take for instance the product E−1

(2,1)E−1(3,1):

E−1(2,1)E

−1(3,1) =

1, 0, 0, 0

−0.4584, 1, 0, 00, 0, 1, 00, 0, 0, 1

1, 0, 0, 00, 1, 0, 0

0.02337, 0, 1, 00, 0, 0, 1

=

1, 0, 0, 0

−0.4584, 1, 0, 00.02337, 0, 1, 0

0, 0, 0, 1

.

The pattern is clear: each matrix in the product will simply copy its only nonzero off diagonalelement into the same location in the resulting matrix. Thus we have

L =

1, 0, 0, 0

−0.4584, 1, 0, 00.02337, −0.07087, 1, 0−0.2178, −0.3256, −0.1127, 1

.

The entire elimination process for our given matrix can be expressed as a series of matrix multipli-cations

E21 =elim matrix(A,2,1)

E31 =elim matrix(E21*A,3,1)

E41 =elim matrix(E31*E21*A,4,1)

E32 =elim matrix(E41*E31*E21*A,3,2)

E42 =elim matrix(E32*E41*E31*E21*A,4,2)

E43 =elim matrix(E42*E32*E41*E31*E21*A,4,3)

U = E43*E42*E32*E41*E31*E21*A

Inefficient, but correct. In reality the elimination is done usually in-place . The upper triangle andthe diagonal of A store the matrix U , the lower triangle (below the diagonal) of A store the matrixL (we do not store the diagonal, since we know that the diagonal of L consists of ones). naivelu4is one of the naive implementations of the LU factorization in aetna.14

14See: aetna/LUFactorization/naivelu4.m


function [l,u] = naivelu4(a)

[n m] = size(a);

if n ~= m

error(’Matrix must be square!’)

end

for col=1:n-1

ks=col+1:n;

ls=a(ks,col)/a(col,col); % col of L

a(ks,ks)=a(ks,ks)-ls*a(col,ks);

a(ks,col)=ls;

end

l=tril(a,-1)+eye(n,n);

u=triu(a);

end

(Note the use of tril and triu to extract the lower and upper triangle from a matrix respectively.)

7.3.3 Pivoting

The implementation of the LU factorization presented above is naive: it blithely divides by thenumerical value in the diagonal element, the so-called pivot. Unless the user is reasonably sure thatall the numbers encountered in the pivot locations are sufficiently large during the factorization,it is preferable to use an implementation that does either partial or full pivoting. The MATLABimplementation of the LU factorization can perform pivoting. Normally only the so-called partialpivoting is performed. Partial pivoting consists of selecting which row should be used as the pivotrow when working in column j, and all the rows j and below are considered. The row with thelargest number in absolute value in column j is chosen. Complete pivoting would also considerthe possibility of switching columns in order to get the best element in the pivot position, but thatinvolves extensive searching throughout the matrix and is therefore expensive (and hence rarelydone).

The MATLAB implementation of the LU factorization will return the information in three ma-trices: Consider this example

A =

0.4653, 0.1766, 0.8463, 0.79170.1805, 0.9188, 0.3244, 0.69520.7891, 0.236, 0.007259, 0.48910.09073, 0.6998, 0.9637, 0.9205

.

Compute the factorization using this command

[L,U,P]=lu(A)

with the result

L =

1, 0, 0, 0

0.2287, 1, 0, 00.5897, 0.04327, 1, 00.115, 0.7778, 0.8597, 1

, U =

0.7891, 0.236, 0.007259, 0.4891

0, 0.8648, 0.3228, 0.58330, 0, 0.828, 0.4780, 0, 0, −0.0003876

and the so-called permutation matrix

P =

0, 0, 1, 00, 1, 0, 01, 0, 0, 00, 0, 0, 1

.

The meaning of the output is that


LU = PA .

The matrix P permutes (switches) the rows of the matrix A. That is the actual pivoting. Note thatthe permutation matrix has an interesting inverse: it is its own transpose (the permutation matrixis orthogonal). Therefore we can write the above as

P TLU = A .

The matrix

P TL =

0.5897, 0.04327, 1, 00.2287, 1, 0, 0

1, 0, 0, 00.115, 0.7778, 0.8597, 1

is the so-called psychologically lower triangular matrix. Such a matrix would be returned if we calledlu with only two output arguments

[L,U]=lu(A)

How do we use the three output matrices? Symbolically, we can write now in the way in which we usethe LU factorization (A = LU) as (we do not actually use inverses, we use forward and backwardsubstitution!)

y = L−1b , x = U−1y

or

x = U−1(L−1b

).

When pivoting is used, we have rather A = P TLU so that we are solving

y =(P TL

)−1

b = L−1(Pb) , x = U−1y

or

x = U−1(L−1(Pb)

).

In MATLAB syntax, we write

x=U\(L\P*b);

In other words, the LU factorization is used as before, except that the rows of the right-hand sidevector are reordered (permuted) by P .

A more efficient approach to working with the LU factorization when pivoting is applied is tocompute the so-called permutation vector .

[L,U,p]=lu(A,’vector’)

The permutation vector is p = [3, 2, 1, 4]. We can see that it correlates with the position of the 1’sin the rows of the permutation matrix. The permutation vector is used for multiple right-hand sidesas

x=U\(L\b(p));

which is a shorthand for

y=L\b(p); x=U\y;


7.3.4 Computational cost

How much does it cost to perform an LU factorization? We see that the procedure is essentiallythat of Gaussian elimination, which processes the matrix in blocks. First the block A(2:n,1:n) ismodified, then the block A(3:n,2:n), A(4:n,3:n), all the way down to A(n:n,n-1:n). If we takeas a measure of required time the number of modified elements of the matrix, we have

C(n− 1)n+ C(n− 2)(n− 1) + C(n− 3)(n− 2) + . . .+ C(2)(3) + C(1)(2) =C ((n− 1)n+ (n− 2)(n− 1) + (n− 3)(n− 2) + . . .+ (2)(3) + (1)(2)) ,

where C is a time constant that measures how much time it takes to manipulate a single element ofthe matrix. Multiplying through we see that the required time is the sum

C(n2 − n+ (n− 1)2 − (n− 1) + (n− 2)2 − (n− 2) + . . .+ 32 − 3 + 22 − 2

)=

C(n2 + (n− 1)2 + (n− 2)2 + . . .+ 32 + 22

)− C (n+ (n− 1) + (n− 2) + . . .+ 3 + 2) .

So finally, recalling the analogy between the integrals∫ x

0

(s2 − s) ds =x3

3− x2

2

and our sums, we conclude that the required factorization time is

tLU = C

(n3

3− n2

2

).

In Chapter 6 we have seen the big-O notation used as a means of describing how function valuedecreases as the argument decreases towards zero. Here we introduce the opposite viewpoint: thenotation can be also used to express how quickly a function value grows. As we discussed, the big-O notation typically expresses how complicated functions behave in terms of a simple monomial(say x2). For the measurement how quickly function value decreases the low powers dominate;contrariwise, when we measure how quickly function value grows the high powers dominate.

Illustration 5

Consider the simple function f(x) = x2 + 30000x. Use the big-O notation to describe its behavioras x→ 0 and as x→∞ .

As x → 0 the function value decrease is dominated by the linear term (30000x) as it drops inmagnitude much slower then the square. On the contrary, the square term grows much faster thanthe linear term as x→∞. Therefore we conclude that f(x) ∈ O(x) as x→ 0 and that f(x) ∈ O(x2)as x→∞.

The big-O notation is often used in computer science to express how quickly the cost of analgorithm grows as the number of quantities to be processed grows. For instance, nice algorithms arethose that grow linearly or logarithmically - for instance computing the mean of a vector of lengthn is an operation of O(n) or FFT is an operation of O(n log n). Not so nice algorithms may be veryexpensive for large n – for instance a naıve discrete Fourier transform (the slow version of FFT) isO(n2). Much more expensive than FFT!

The LU factorization is one of the more computationally-intensive algorithms. Based on theexpression that includes both a cubic term and a quadratic term we conclude that for sufficientlylarge n we should write tLU = O(n3). Rather costly!


Illustration 6

Figure 7.9 shows the results of a numerical experiment. The MATLAB LU factorization is run for asequence of variously sized matrices, and the factorization time is recorded.

t = [];

for n = 10:10:600

A=rand(n);

tic;

for Trial = 1: 1000

[L,U,p]=lu(A,’vector’);

end

t(end+1) =toc

end

The curve of required CPU time per factorization illustrates our estimate: first the time grows moreslowly than predicted, but asymptotically it appears to approach a straight line with slope 3 whichcorresponds to a cubic dependence on the number of equations.

101

102

103

10−5

10−4

10−3

10−2

10−1

Matrix size n

Tim

e[s] 1:3

Fig. 7.9. Timing of the LU factorization

In a similar way, we can show that the time for forward or backward substitution is going togrow as O(n2). This is good news, since for many right-hand sides the time is only going to growas quickly as for the factorization itself. For instance, to compute a matrix inverse we need to solven times an n× n system of linear algebraic equations. If we use LU factorization with forward andbackward substitution, it will take

O(n3)︸︷︷︸factorization

+ n×O(n2)︸︷︷︸forward/backward substitution

= O(n3)

time. If we use just plain Gaussian elimination for each solve, it will take

nO(n3) = O(n4) .

A much more quickly growing cost!


Illustration 7

The cost estimate tLU = C O(n3) can be put to good use guessing the time that it may take tofactorize larger matrices. From Figure 7.9 we can read off that on this particular computer a 400×400matrix takes about one hundredth of a second:

tLU,400 = C O(4003) = 0.01 s .

Therefore we can express the time constant as

C =0.01 s

O(4003).

To factorize a matrix 10 times larger, we estimate that it would take

tLU,3000 = C O(30003) =0.01 s

O(4003)O(30003) = 4.2 s .

Running the calculation we find 2.35 s. This is a substantial difference with respect to the prediction.First, the measurement of 0.01 s is likely to be substantially in error as it is difficult to measure theexecution times for computations that conclude very quickly – there are just too many confoundingfactors in the software (think of all the operating system overhead) and hardware. Second, ourestimate was based on the cubic term, but we know there is also a quadratic term and that was nottaken into account. The matrix may not be large enough for the asymptotic big-O estimate to workbased on the largest term only.

Furthermore, let us say we want to use the second measurement, tLU,3000 = 2.35 s to predictthe factorization time for a 30, 000× 30, 000 matrix. If we had a computer with enough memory toaccommodate a matrix of this size, our prediction would be that the factorization time would goup with a factor of 1000 = 103 with the respect to the time measured for the 3000 × 3000 matrix,so about 40 minutes. We would find the prediction rather more accurate this time. (Try it with aslightly more modest increase: for instance a factor of 2 increase in the size of the matrix wouldincrease the factorization time by a factor of 8.)

7.3.5 Large systems of coupled equations

Structural engineers nowadays meet almost daily with results produced by models which are muchlarger than the ones encountered so far in this book. Structural analysis programs, or more generallyfinite element analysis programs, work on a daily basis with models where one million unknownsis not uncommon. In recent years there have been reports of successful analyses with billions ofunknowns (simulation of seismic events). How do our algorithms handle the linear algebra in bigmodels?

First we may note that in many analyses we work with symmetric matrices. Considerable savingsare possible then. Take the LU factorization of a symmetric matrix

A = LU .

Now it is possible to factor U by dividing the rows with its diagonal elements, so that we can writeU as the product of the diagonal D = diag(diag(U)) (expressed in MATLAB notation) with the

matrix U

U = DU .

Since we must have A = AT , substituting


A = LU = LDU

implies that U = LT . Therefore for symmetric A we can make one more step from the LU factor-ization to the LDLT factorization

A = LDLT .

This saves both time (we don’t have to compute U) and space (we don’t have to store U).Figure 7.10 displays a finite element model with over 2000 unknowns. A small model, it can be

handled comfortably on a reasonably equipped laptop, yet it will serve us well to illustrate some ofthe aspects of the so-called large-scale computing algorithms of which we need to be aware.

The figure shows a tuning fork. This one sounds approximately the note of A (440 Hz, interna-tional “concert pitch”). To find this vibration frequency, we need to solve an eigenvalue problem (inour terminology, the free vibration problem).

Fig. 7.10. Tuning fork finite element mesh.

The impedance matrix A = K −ω2M which couples together the stiffness and the mass matrixis of dimension of roughly 2000× 2000. However, not all 4 million numbers are nonzero. Figure 7.11illustrates this by displaying the nonzeros as black dots (the zeros are not shown). The code to getan image like this for the matrix A is as simple as

spy(A)

Where do the unknowns come from? The vibration model describes the motion of each node (thatwould be the corners and the midsides of the edges of the tetrahedral shapes which constitute themesh of the tuning fork). At each node we have three displacements. Through the stiffness and massof each of the tetrahedra the nodes which are connected by the tetrahedra are dynamically coupled (inthe sense that the motion of one node creates forces on another node). All these coupling interactionsare recorded in the impedance matrix A. If an unknown displacement j at node K is coupled to anunknown displacement k at node M, there will be a nonzero element Ajk in the impedance matrix.If we do not care how we number the individual unknowns, the impedance matrix may look forinstance as shown in Figure 7.11: there are some interesting patterns in the matrix, but otherwisethe connections seem to be pretty random.

An important aspect of working with large matrices is that as a rule only the non-zeros inmatrices will be stored. The matrices will be stored as sparse . So far we have been working withdense matrices: all the numbers were stored in a two-dimensional table. A sparse matrix has a morecomplicated storage, since only the non-zeros are kept, and all the zeros are implied (not stored, butwhen we ask for an element of the matrix that is not in storage, we will get back a zero). This maymean considerable savings for matrixes that hold only a very small number of non-zeros.


Fig. 7.11. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Originalnumbering of the unknowns. The black dots represent non-zeros, zeros are not shown.

The reason we might want to worry about how the unknowns are numbered lies in the way theLU factorization works. Remember, we are removing non-zeros below the diagonal by combiningrows. That means that if we are eliminating element km, we are adding a multiple of the row k andthe row m. If the row m happens to have non-zeros to the right of the column m, all those non-zeroswill now appear in row k. In this way, some of the zeros in a certain envelope around the diagonalwill become non-zeros during the elimination. This is clearly evident in Figure 7.11, where we cansee almost entirely black (non-zero) matrices L and U . Why is this a problem? Because there are alot more of non-zeros in the LU factors than in the original matrix A. The more numbers we have tooperate on, the more it costs to factorize the matrix, and the longer it takes. Also, all the non-zerosneed to be stored, and to update a sparse matrix with additional non-zeros is very expensive.

The appearance of additional non-zeros in the matrix during the elimination is called fill-in .Fortunately, there are ways in which the fill-in may be minimized by carefully numbering coupledunknowns. Figure 7.12 and Figure 7.13 visualize the impedance matrix and its factors for twodifferent renumbering schemes: the reverse Cuthill-McKee and symmetric approximate minimumdegree permutation. The matrix A holds the same number of non-zeros in all three figures (originalnumbering, and the two renumbered cases). However the factors in the renumbered cases hold about10 times less non-zeros than in the original factors. This may be significant. Recall that for a densematrix the cost scales as O(N3). For a sparse matrix with a nice numbering which will limit thefill-in to say 100 elements per row, the cost will scale as O(100×N2). For N = 106 this will be thedifference between having to wait for the factors for one minute or for a full week.

Fig. 7.12. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumberingof the unknowns with symrcm. The black dots represent non-zeros, zeros are not shown.

As a last note on the subject we may take into account other techniques of solving systems oflinear algebraic equations than factorization. There is a large class of iterative algorithms, a lineup starting with Jacobi and Gauss-Seidel solvers and currently ending with the so-called multi-grid solvers. These algorithms are much less sensitive to the numbering of the unknowns. In thisbook we do not discuss these techniques, only a couple of minimization-based solvers, including thepowerful conjugate gradients, but refer for instance to Trefethen, Bau for an interesting story oncurrent iterative solvers. They are becoming ubiquitous in commercial softwares, hence we betterknow something about them.


Fig. 7.13. The structure of the tuning fork impedance matrix. Left to right: A = LU , L, U . Renumberingof the unknowns with symamd. The black dots represent non-zeros, zeros are not shown.

7.3.6 Uses of the LU Factorization

Some of the uses of the LU factorization had been mentioned above: computing the matrix inverse,in particular. Some other uses to which the factorization can be put are computing the matrixdeterminant , finding out whether the matrix has a full rank, and assessing the so-called definiteness(especially positive definiteness is of interest to us).

The determinant of the matrix A can be computed from the LU factorization as

detA = det (LU) = detL× detU .

Provided L is indeed lower triangular, the determinant of the two triangular matrices is the productof the diagonal elements which yields

detL =n∏

i=1

Lii = 1 , detU =n∏

i=1

Uii

so that we have detA =∏n

i=1 Uii.If on the other hand L has been modified by pivoting permutations, its determinant can be ±1,

according to how many permutations occurred. (It is probably best to use the MATLAB built-indet function. It uses the LU factorization, and correctly accounts for pivoting.)

That’s how determinants are computed, not by Cramer’s rule (not if we wish to live to see theresult).

We might consider using the LU factorization for determining the number of independent rows(columns) of a matrix, the so-called rank . If the LU factorization succeeds, the matrix A hada full rank. Otherwise, it is possible that the factorization failed just because full pivoting wasnot applied: it is possible that the factorization might succeed if all possibilities for pivoting areexploited. MATLAB does not use factorization for this reason (and other reasons that have to dowith the stability of the computation), it rather takes advantage of the so-called singular valuedecomposition. If the matrix A does not have a full rank (the number of linearly independentcolumns, or linearly independent rows, is less than the dimension of the matrix) it is singular, andcannot be LU factorized.

On the diagonal of the matrix U we have the pivots. The signs of the pivots determine theso-called positive or negative definiteness (or indefiniteness) of a matrix. More about this in thechapter on optimization.

7.4 Errors and condition numbers

When solving a system of linear algebraic equations Ax = b we should not expect to get an exactsolution. In other words if the obtained solution vector x is substituted into the equation the left-hand side does not equal the right-hand side. One way of looking at the reasons for this error is toconsider that each operation results in some arithmetic error, so in a way both the right-hand side

7.4 Errors and condition numbers 147

vector b and the coefficient matrix A itself are not represented during the solution process faithfully.Therefore, in this section we will consider how the properties of A and b affect the error of thesolution x.

First, we shall inspect the sensitivity of the solution of the system of coupled linear algebraicequations Ax = b to the magnitude of the error of the right-hand side, and the properties of thematrix A. Equivalently, we could also state this in terms of errors: how large can they get?

7.4.1 Perturbation of b

Imagine the right-hand side vector changes just a little bit to b + ∆b. The solution will then alsochange

A (x+∆x) = (b+∆b) ,

which then gives

A∆x = ∆b .

Now we would like to measure the relative change in the solution ∥∆x∥/∥x∥ due to the relativechange in the right-hand side ∥∆b∥/∥b∥. In terms of norms we can write (symbolically, we neveractually invert the matrix)

∥∆x∥ = ∥A−1∆b∥ (7.14)

so that using the so-called CBS inequality (CBS: Cauchy, Bunyakovsky, Schwartz) we estimate

∥∆x∥ ≤ ∥A−1∥∥∆b∥ . (7.15)

It does not matter very much which norm is meant here, they are all equivalent. Also we can writefor the norms of the solution vector on the left-hand side and the vector on the right-hand side

∥Ax∥ = ∥b∥ −→ ∥A∥∥x∥ ≥ ∥b∥ (7.16)

Now we substitute (7.15) into (7.14) and divide both sides by ∥b∥

∥∆x∥∥b∥

≤ ∥A−1∥∥∆b∥∥b∥

.

On the right-hand side we now have the relative error ∥∆b∥/∥b∥. Now we can introduce (7.16) toreplace ∥b∥ on the left-hand side

∥∆x∥∥A∥∥x∥

≤ ∥A−1∥∥∆b∥∥b∥

,

which will give us the relative error of the solution ∥∆x∥/∥x∥. Finally we rearrange this result into

∥∆x∥∥x∥

≤ ∥A∥∥A−1∥∥∆b∥∥b∥

. (7.17)

The quantity ∥A∥∥A−1∥ is the so-called condition number of the matrixA. This inequality relatesthe relative error of the solution to the relative error of the right-hand side vector. The coefficientof proportionality is found to be determined by the properties of the coefficient matrix.


Illustration 8

When the condition number is large, we see that there is a possibility of the change in the right-hand side being very much magnified in the change of the solution. An example of the effect is givenhere.15 Consider the least-squares computation of a quadratic function passing through three points:the point locations are x= [0,1.11,1.13]’, and the values of the function at those three points arey= [1,0.5,0.513]’. The least squares computation is set up as

A = [x.^2,x.^1,x.^0];

p=(A’*A)\(A’*y)

to solve for the parameters p of the quadratic fit from the so-called normal equations (see details inSection 9.13). The solution is

p =

0.973849956151390

-1.531423901778500

1.000000000000001

Now change the values of the quadratic function by dy= [0,0.00746,-0.006658]’;, which is arelative change of norm(dy)/norm(y)=0.00864. The solution changes by

dp=(A’*A)\(A’*dy)dp =

-0.630637805947415

0.706728685322350

-0.000000000000000

which can be appreciated as a pretty substantial change. We see that

norm(dy)/norm(y)

ans =

0.008128568566353

norm(dp)/norm(p)

ans =

0.457113748779369

This means that while the data changed by less than 1%, the solution for the parameters changed byalmost 50%. We call matrices that produce this kind of large sensitivity ill conditioned . Figure 7.14produced by

x =linspace(0,1.13,100)’;

plot(x,[x.^2,x.^1,x.^0]*p,’r-’,’linewidth’,2); hold on

plot(x,[x.^2,x.^1,x.^0]*(p+dp),’k--’,’linewidth’,2)

shows the effect of the ill conditioning : It shows two quadratic curves fitted to the original data y

(red solid curve), and to the perturbed data y+dy (black dashed curve). The curves are very differentdespite the fact that the points through which they pass have been moved only very little.

7.4.2 Condition number

The amplification of the right-hand side error can be measured as shown in equation (7.17) byassessing the magnitude of the condition number . In MATLAB this can be evaluated with thefunction cond. For instance, we find for the matrix A from the Illustration above

15See: aetna/DifficultMatrices/ill conditioned.m


0 0.2 0.4 0.6 0.8 10.4

0.5

0.6

0.7

0.8

0.9

1

xy

Fig. 7.14. Quadratic curves fit it to the original data y (red solid curve), and the perturbed data y+dy

(black dashed curve).

cond(A’*A)

ans =

7.145344297615475e+004

The magnitude of the condition number can be understood in relative terms by considering thecondition numbers of identity matrices (these are probably the best matrices to work with!), whichare equal to one. More generally, orthogonal matrices also have condition numbers that are equal toone. That is as low as the condition number goes, all other matrices have larger condition numbers.The bigger the condition number, the bigger the ill conditioning problem. In particular, we can seethat the condition number depends on the existence of the inverse of A. The closer the matrix A isto being not invertible, the larger the condition number is going to get. For a singular matrix thecondition number is defined to be infinite. In the present case, the condition number is seen to befairly large. Hence we get the substantial amplification of the change of the right-hand side in thesolution vector.

Illustration 9

To continue the previous Illustration, we change the horizontal position of one of the points x=

[0,0.61,1.13]’.16 The perturbed quadratic curve is found to differ only slightly from the original.The condition number confirms that the matrix is considerably less ill-conditioned

cond(A’*A)

ans =

193.7789

7.4.3 Perturbation of A

We can also consider the effect of changes in the matrix itself. For instance, when the elements ofthe matrix are calculated with some error. So when the matrix changes (not the right-hand side,that remains the same), we write for the changed solution

16See: aetna/DifficultMatrices/better conditioned.m


(A+∆A) (x+∆x) = b

canceling Ax = b gives

A∆x+∆A (x+∆x) = 0

or

A∆x = −∆A (x+∆x) .

Considering the problem in terms of norms as before

∥∆x∥ = ∥ −A−1∆A (x+∆x) ∥

and

∥∆x∥ ≤ ∥A−1∥∥∆A∥∥x+∆x∥ .

To bring in relative changes again, we divide by ∥x +∆x∥ on both sides and divide and multiplywith ∥A∥ on the right-hand side

∥∆x∥∥x+∆x∥

≤ ∥A∥∥A−1∥∥∆A∥∥A∥

.

We see that the relative change in the solution is expressed as before. It is bounded by the relativechange in the left-hand side matrix, and the multiplier is again the condition number.

The condition number appears to be an important quantity. In order to understand the conditionnumber we have to understand a little bit where the norms of the matrix and its inverse come from.

7.4.4 Induced matrix norm

An easy way in which we can talk of matrix norms while introducing nothing more than norms ofvectors stems from the so-called induced matrix norm . We think of the matrix A (here we willdiscuss only square matrices, but this would also apply to rectangular matrices) as producing a mapfrom the vector space Rn to the same vector space by taking input x and producing output y

y = Ax .

We can measure “how big” a matrix is (that is its norm) by measuring how much all possible inputvectors x get stretched by A. We take the largest possible stretch as the induced norm of A

∥A∥ = max∥x∥ = 0

∥Ax∥∥x∥

.

Note that on the left we have a matrix norm, and on the right we have a vector norm. That is whywe say that the matrix norm on the left is induced by the vector norm on the right. An alternativeform of the above equation, and a very useful one, can be expressed as

∥A∥ = max∥x∥ = 1

∥Ax∥ . (7.18)

In other words, test the stretching on vectors of unit length.We could take any norm of the vector x. We can pick one of all those generated by the definition

∥x∥p =

n∑j=1

|xj |p1/p


(we may recall the similarity with the root-mean-square formula for p = 2). Taking p = 1 we get∥x∥1 =

∑nj=1 |xj | (the so-called 1-norm), taking p = 2 we obtain the usual Euclidean norm (also

called the 2-norm)

∥x∥2 =

n∑j=1

|xj |21/2

.

Also used is the so-called infinity norm , which has to be worked out by a limiting process ∥x∥∞ =maxj=1:n |xj |.

Illustration 10

The three norms introduced above are illustrated in Figure 7.15. The squares and the circle representvectors of unit norm, as measured by the various norm definitions. The arrows are vectors of unitnorms, using the three norm definitions given above.17

−1 −0.5 0 0.5 1 1.5

−1

−0.5

0

0.5

1

x1

x2

x

‖x‖1 = 1

‖x‖2 = 1

‖x‖∞ = 1

Fig. 7.15. Illustration of vector norms (1, 2,∞).

7.4.5 Condition number in pictures

We are going to work out a useful visual association for the condition number ∥A∥∥A−1∥. We havethe definition (7.18) and the induced matrix norm of the matrix inverse can be obtained by thefollowing substitution

Ay = x

into the definition of the induced matrix norm

∥A−1∥ = max∥x∥ = 0

∥A−1x∥∥x∥

= max∥Ay∥ = 0

∥y∥∥Ay∥

.

17See: aetna/MatrixNorms/vecnordemo.m


Note that we assume A to be invertible, and then ∥Ay∥ = 0 for ∥y∥ = 0. Also, we can change themaximum into a one-over-minimum fraction, so that we can write for the norm of A−1

∥A−1∥ =

(min∥y∥ = 0

∥Ay∥∥y∥

)−1

=

(min∥y∥ = 1

∥Ay∥

)−1

.

With these formulas for the norms, we can write for the condition number

∥A∥∥A−1∥ =max∥x∥ = 1

∥Ax∥

min∥y∥ = 1

∥Ay∥. (7.19)

Now this is relatively easy to visualize. Figures 7.16 and 7.17 present a gallery of matrices. Theimages visualize the results of the multiplication of unit-length vectors pointing in various directionsfrom the origin. The induced 2-norm is used, and consequently the heads of the unit-length vectorsform a circle of unit radius. We can see how the formula for the condition number (7.19) correlateswith the largest and smallest length of the vector that results from the multiplication of the matrixand the unit vector. For instance, for the matrix A we may estimate the length of the longest andshortest Ax vector as ≈ 3 and ≈ 2, and therefore we guess the condition number to be ≈ 3/2. Thismay be compared with the computed condition number ∥A∥∥A−1∥ ≈ 1.414. Alternatively, we couldtake the length of the longest vector Ax as ≈ 3 and the length of the longest vector A−1x as ≈ 1/2,and therefore we guess the condition number to be ≈ 3× 1/2.

Illustration 11

Use the function matnordemo18 to create for each of the three norms a diagram similar to those ofFigure 7.16 for the matrix [2 -0.2; -1.5 3], and then try to read off the norm of this matrix fromthe figure. Compare with the matrix norm computed as

norm([2 -0.2; -1.5 3],1)

norm([2 -0.2; -1.5 3],2)

norm([2 -0.2; -1.5 3],inf)

7.4.6 Condition number for symmetric matrices

Note that for the symmetric matrices B,D,F in Figures 7.16 and 7.17 the largest and the smalleststretch occurs in the direction of some vector x. In other words, we have

Bx = λx

and we see that the extreme stretches have to do with the eigenvalues of the symmetric matrix.This may be contrasted with for instance the unsymmetric matrix A, where the stretch Ax neveroccurs in the direction of x. Other examples similar Ax to are matrices C,E in Figure 7.16 andFigure 7.17.

Symmetric matrices have real eigenvalues and can be always made similar to a diagonal matrix,which means that symmetric matrices always have a full set of eigenvectors. Now we have seenthat for symmetric matrices the 2-norms are directly related to their eigenvalues. We all will fondlyremember the stress and strain representations as symmetric matrices: the principal stresses and

18See: aetna/MatrixNorms/matnordemo.m


−4 −2 0 2 4

−2

0

2

x1, [Ax]1

x2,[

Ax] 2

A=[2 1.5; −1.5 3], A*x

−4 −2 0 2 4

−2

0

2

x1, [A−1x]1

x2,[

A−

1x] 2

A=[2 1.5; −1.5 3], inv(A)*x

−4 −2 0 2 4

−2

0

2

x1, [Bx]1

x2,[B

x] 2

B=[3 −1.2; −1.2 2], B*x

−4 −2 0 2 4

−2

0

2

x1, [B−1x]1

x2,[B

−1x] 2

B=[3 −1.2; −1.2 2], inv(B)*x

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

x1, [Cx]1

x2,[

Cx] 2

C=[0 −1; 1 0], C*x

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

x1, [C−1x]1

x2,[

C−

1x] 2

C=[0 −1; 1 0], inv(C)*x

Fig. 7.16. Matrix and matrix inverse norm illustration. Matrix condition numbers: ∥A∥∥A−1∥= 1.414;∥B∥∥B−1∥= 3.167; ∥C∥∥C−1∥= 1.0;

strains, and the directions of the principles stresses and strains, are the eigenvalues and eigenvectorsof these matrices.

In fact, for all matrices, symmetric and unsymmetric, the matrix norm has something to do witheigenvalues and eigenvectors. Consider the definition of the induced matrix norm

∥A∥ = max∥x∥ = 0

∥Ax∥∥x∥

and square both sides

∥A∥2 = max∥x∥ = 0

∥Ax∥2

∥x∥2.


−5 0 5−5

0

5

x1, [Dx]1

x2,[D

x] 2

D=[2 0; 0 0.2], D*x

−5 0 5−5

0

5

x1, [D−1x]1

x2,[D

−1x] 2

D=[2 0; 0 0.2], inv(D)*x

−1 0 1

−1

−0.5

0

0.5

1

x1, [Ex]1

x2,[E

x] 2

E=[1,1; 0,1], E*x

−1 0 1

−1

−0.5

0

0.5

1

x1, [E−1x]1

x2,[E

−1x] 2

E=[1,1; 0,1], inv(E)*x

−10 −5 0 5 10

−5

0

5

x1, [Fx]1

x2,[F

x] 2

F=[1,1; 1,1.2], F*x

−10 −5 0 5 10

−5

0

5

x1, [F−1x]1

x2,[F

−1x] 2

F=[1,1; 1,1.2], inv(F)*x

Fig. 7.17. Matrix and matrix inverse norm illustration. Matrix condition numbers: ∥D∥∥D−1∥= 10.0;∥E∥∥E−1∥= 2.618; ∥F ∥∥F−1∥= 22.15;

For the moment we shall consider that the vector norms are Euclidean norms (2-norms). From thedefinition of the vector norms, we have

∥Ax∥2 = (Ax)T (Ax)

so that we can write

∥A∥2 = max∥x∥ = 0

xTATAx

xTx.

The expression on the right is the so-calledRayleigh quotient of the matrix ATA (not ofA itself!).It is the result of the pre-multiplication of the eigenvalue problem


ATAx = λx (7.20)

with xT , which can be rearranged as

λ =xTATAx

xTx.

Note that

xTATAx ≥ 0 , xTx > 0

where xTx = 0 is not allowed by the definition of the norm. Clearly, the Rayleigh quotient attainsits maximum for the largest eigenvalue in absolute value max |λ|, and its minimum for the smallesteigenvalue in absolute value min |λ|. From this we can deduce

∥A∥ =√max |λ| .

Similarly, we obtain

∥A−1∥ = 1/√min |λ| .

Hence, the condition number of A is found to be

∥A∥∥A−1∥ =√max |λ|√min |λ|

.

If the matrix A is symmetric, we write an eigenvalue problem for it as

Av = λ′v . (7.21)

Now pre-multiplication of both sides of this equation with AT = A gives

ATAv = λ′ATv = (λ′)2v . (7.22)

In comparison with (7.20) we see that λ = (λ′)2. Therefore, the norm of a symmetric matrix will be

∥A∥ = max |λ′| .

where λ′ solves the eigenvalue problem (7.21). Analogously, the norm of the inverse of a symmetricmatrix will be

∥A−1∥ = 1

min |λ′|,

and the condition number of the symmetric matrix is therefore

∥A∥∥A−1∥ = max |λ′|min |λ′|

. (7.23)

Illustration 12

Apply formula (7.23) to a singular matrix.Any singular matrix has at least one zero eigenvalue. No matter how large an eigenvalue of a

singular matrix can get, we know that its smallest eigenvalue (in absolute value) is equal to zero.Consequently, the condition number of the singular matrix →∞.


7.5 QR factorization

Consider a system of linear algebraic equations

Ax = b

with a square matrix A. It is possible to factorize the matrix into the product of an orthogonalmatrix Q and an upper triangular matrix R

A = QR .

How does this work? If we write this relationship between the matrices in terms of their columnsthings become clearer.

ck (A) = Q ck (R) .

Now remember, R is an upper triangular matrix. For instance like this

R =

♠ · · ⋄ · · ·· · ⋄ · · ·· ⋄ · · ·⋄ · · ·· · ·· ··

.

Then the first column of A is c1 (A) = c1 (Q) R11 (R11 = ♠, all other coefficients in the first columnof R are zero). The fourth column of A is a linear combination of the first four columns of Q (thecoefficients are the ⋄’s)

c4 (A) = c1 (Q) R14 + c2 (Q) R24 + c3 (Q) R34 + c4 (Q) R44

and so on. The principle is now clear: each of the columns of A is constructed of columns of Q whichare orthogonal, and the columns of Q can be obtained by straightening out the columns of A aslong as the columns of A are linearly independent (refer to Figure 7.18): q1 is a unit vector in thedirection of a1, and q2 is obtained from the part of a2 that is orthogonal to q1.

Fig. 7.18. Two arbitrary linearly independent vectors a1 and a2, and two orthonormal vectors vectors q1

and q2 that span the same plane

The great advantage that can be derived from this factorization stems from the fact that theinverse of an orthogonal matrix is simply its transpose

Q−1 = QT .

If we substitute this factorization into Ax = b we obtain

Ax = QRx = b

and this allows us to rewrite the system as

7.5 QR factorization 157

Rx = QT b .

Now since the matrix R is upper triangular, to solve for the unknown x is very efficient, startingat the bottom we proceed by backsubstitution. The solution is not for free, of course. We had toconstruct the factorization in the first place.

An additional benefit of this particular factorization is in the ability to factorize rectangularmatrices, not just square. Furthermore, due to the orthogonality of Q operations with it are asnice numerically as possible (remember the perfect condition number of one?). Therefore the QRfactorization is used when numerical stability is at a premium. Examples may be found in the least-squares fitting subject. Also, the QR factorization leads to a valuable algorithm for the computationof eigenvalues and eigenvectors for general matrices.

7.5.1 Householder reflections

The question now is how to compute the QR factorization. A particularly popular and effectivealgorithm is based on the so-called Householder reflections.

The Householder transformation (reflection) is designed to modify a column matrix so that theresult of the transformation has only one nonzero element, the first one, but the length of the result(that is its norm) is preserved. Matrix transformations that preserve lengths are either rotations orreflections (the Householder transformation is the latter):

Ha = a , where ∥a∥ = ∥a∥ .

The transformation produces the vector a

a =

±∥a∥0...0

by reflection in a plane that is defined by the normal generated as the difference n = a − a andpasses through the origin O (see Figure 7.19). This follows from the two vectors a and a being ofthe same length.

Oa

a

n

O

a

a

n

Fig. 7.19. Householder transformation: the geometrical relationships. The reflection plane is shown by thedashed line. Consider that in the two-dimensional figure there are two possible reflection planes.

The relationship between the three vectors may be written as a = a+n, which may be tweakedusing a little trick (note carefully the position of the parentheses)

a = a+ n

(nTa

nTa

)= a+

(nnT

)a

nTa=

(1+

nnT

nTa

)a .

Note that both matrices (the identity and the rest) in the parentheses are square. Together theyconstitute an orthogonal matrix


H = 1+nnT

nTa, HTH = 1 . (7.24)

Interestingly, this matrix is also symmetric. This is really how it should be: H produces a mirrorimage of a, a = Ha. The mirror image of a, the inverse operation of a = H−1a, must give us backa, but the inverse operation is again a reflection, the same reflection that gave us a from a.

To compute the Householder matrix we could use the function Householder matrix.19 The signof the non-zero element of a is computed with particular attention to numerical stability: when wecompute n = a − a, the vector a has only one nonzero element. To avoid numerical error whensubtracting two similar numbers a1 − a1 we choose sign a1 = −sign a1.

function H = Householder_matrix(a)

if (a(1)>0) at1 =-norm(a);% choose the sign wisely

else at1 =+norm(a); end

n=-a; n(1)=n(1)+at1;% this is the subtraction of a~-a

H = eye(length(a))+(n*n’)/(n’*a);% this is the formula

end

How do we use the Householder transformation? We consider the columns of the matrix to betransformed as the vectors that we can reflect as shown above. The first step zeroes out the elementsof A below the diagonal of the first column.

H1A =

H16×6

• • • • • •• • • • • •• • • • • •• • • • • •• • • • • •• • • • • •

=

• • • • • •• • • • •• • • • •• • • • •• • • • •• • • • •

.

We write H1 for the 6× 6 matrix obtained from the first column of A. We write H2 for the 5× 5matrix obtained from the second column of A, from the diagonal to the bottom of the column.Analogously for the other Householder matrices.

1

H25×5

• • • • • •• • • • •• • • • •• • • • •• • • • •• • • • •

=

• • • • • •• • • • •• • • •• • • •• • • •• • • •

.

The last step that leads to an upper triangular matrix is

1111H52×2

• • • • • •• • • • •• • • •• • •• •• •

=

• • • • • •• • • • •• • • •• • •• ••

= R .

To obtain A from R we would successively invert the above relationships one by one. That is notdifficult since we realize that those matrices are orthogonal and symmetric, so the inverse is equalto the original matrix. We just have to switch the order of the matrices. We get

19See: aetna/QRFactorization/Householder matrix.m

7.5 QR factorization 159

A =

H1

1

H2

...

1111H5

R ,

which means that orthogonal matrix Q is obtained as

Q =

H1

1

H2

...

1111H5

.

Illustration 13

Here we present a factorization which is based directly on the schemas above. The functionHouseholder matrix20 computes the Householder matrix of equation (7.24). Note that the matri-ces Hj are blocks embedded in an identity matrix. The following code fragment should be steppedthrough, and I will bet that it will nicely reinforce our ideas of how Householder reflections work.

format short

A=rand(5); R=A % this is where R starts

Q=eye(size(A));% this is where Q starts

for k=1:size(A,1)-1

H=eye(size(A));% Start with an identity...

% ...and then put in the Householder matrix as a block

H(k:end,k:end) = Householder_matrix(R(k:end,k:end))

R= H*R % this matrix is becoming R

Q= Q*H % this matrix is becoming Q

end

Q*Q’% check that this is an orthogonal matrix: should get identity

A-Q*R % check that the factorization is correct

R-Q’*A % another way to check

The algorithm to produce the QR factorization21 is designed to be a little bit more efficient thanthe code above, but it is still surprisingly short and readable

function [Q,R] = HouseQR(A)

m=size(A,1);

Q=eye(m); R =A;

for k=1:size(A,1)-1

n = Householder_normal(R(k:end,k:k));

R(k:end,k:end) =R(k:end,k:end)-2*n*(n’*R(k:end,k:end));

Q(:,k:end)=Q(:,k:end)-2*(Q(:,k:end)*n)*n’;

end

end

20See: aetna/QRFactorization/Householder matrix.m21See: aetna/QRFactorization/HouseQR.m


Instead of the Householder matrix (7.24) we use in HouseQR the equivalent expression

H = 1− 2NNT ,

where N has the same direction as n but is of unit length

N =n

∥n∥.

Substituting this expression and

∥n∥ = 2|NTa| and NTa < 0

into the relationship a = a+n we obtain the above alternative expression of the Householder matrix

a =

(1+

nnT

nTa

)a =

(1+∥n∥2NNT

∥n∥NTa

)a =

(1− 2NNT

)a .

The Householder normal is also computed with attention to numerical stability by choosing the signof the nonzero element of the normal to eliminate cancellation. Note well that the computed normalis of unit length.22

function n = Householder_normal (a)

if (a(1)>0) at1 =-norm(a);% choose the sign wisely

else at1 =+norm(a); end

n=-a; n(1)=n(1)+at1;% this is the subtraction of a~-a

n=n/sqrt(n’*n);% normalize

end

Illustration 14

The HouseQR function acts as a black box: A goes in, Q,R come out in their finished form. It ishowever possible to set a breakpoint inside the function to watch the matrices form layer-by-layerby the Householder reflections. Try it.


1. C. Meyer, Matrix Analysis and Applied Linear Algebra Book and Solutions Manual, SIAM:Society for Industrial and Applied Mathematics, 2001.Good treatment of the Gaussian elimination. Some very instructive examples of ill-conditionedmatrices. More than you ever wanted to know about matrix norms. Best of all, freely availableat http://matrixanalysis.com/.

2. L. N. Trefethen, D. Bau III, Numerical Linear Algebra, SIAM: Society for Industrial and AppliedMathematics, 1997.The treatment of QR factorization is excellent.

3. GW Stewart, Matrix algorithms. Volume I: Basic decompositions. SIAM, 1998.Complete and readable presentation of the QR factorization.

4. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005. (Alternatively,the 3rd edition, 1988.)Overall one of the best references for introductory linear algebra. Clearly written, and full ofexamples.

22See: aetna/QRFactorization/Householder normal.m

8

Solutions methods for eigenvalue problems

Summary

1. We discover a few basic algorithms for the solution of the eigenvalue problem, both the standardand the generalized form.

2. Repeated multiplications with matrices tends to amplify directions associated with eigenvectorsof dominance eigenvalues. Main idea: write the modal expansion, and consider the powers ofeigenvalues.

3. Various forms of the power iteration, including the QR iteration, form the foundations of someof the workhorse routines used in vibration analysis and in general purpose software (withappropriate, and sometimes considerable, refinements).

4. The Rayleigh quotient is an invaluable tool both for algorithm design and for quick ad hocchecks.

5. This area of numerical analysis has seen considerable progress in recent years and some power-ful new algorithms have emerged. Solving large-scale eigenvalue problems nevertheless remainsnontrivial, even with sophisticated software packages.

8.1 Repeated multiplication by matrix

Consider the effect of repeated multiplication by the matrix A on its eigenvector

Avj = λjvj .

To multiply both the left-hand side in the right-hand side of the above equation again with A yields

A (Avj) = A2vj = λjAvj = λ2jvj .

In general, we will have after k − 1 multiplications

Akvj = λkjvj .

Now imagine that an arbitrary vector x is going to be multiplied repeatedly by A. Our goal isto analyze the result of Akx. We will use an expansion of the arbitrary vector x in terms of theeigenvectors of the matrix A (the so-called modal expansion)

x =∑j=1:n

cjvj .

The product Akx may be written using the expansion as

Akx =∑j=1:n

cjAkvj =

∑j=1:n

cjλkjvj .

162 8 Solutions methods for eigenvalue problems

The eigenvalues will be ordered by absolute value so that

|λ1| ≥ |λ2| ≥ |λ3| ≥ . . . | ≥ |λn−1| ≥ |λn| .

For the moment we shall assume that the first eigenvalue is dominant : its absolute value is strictlylarger than the absolute value of any other eigenvalue |λ1| > |λj | , j = 2, 3, .... With these assumptionswe can write

Akx =∑j=1:n

cjAkvj = |λ1|k

∑j=1:n

cjλkj

|λ1|kvj .

Due to our assumption that the first eigenvalue dominates, the coefficients λkj /|λ1|k will approach

zero in absolute value as k → ∞, except for λk1/|λ1|k which will maintain absolute value equal to

one. Therefore, as k →∞ the only term left from the modal expansion of x will be

limk→∞

Akx = c1λk1v1 .

Figure 8.1 illustrates the effect of repeated multiplication of an arbitrary vector x by the 2 × 2matrix A

Ax ,AAx = A2x , ...

The eigenvalues are λ1 = 1.6 (with eigenvector v1), λ2 = 0.37 (with eigenvector v2), so the first eigen-value is dominant, and evidently the result of the multiplication leans more and more towards thefirst eigenvector. The “leaning” is very rapid. The reason is that the fraction λk

2/|λ1|k = (0.23125)k

will decrease very rapidly with higher powers (for instance, (0.23125)4 = 0.00285). Therefore, thecontribution of the eigenvector v2 to the vector Akx will become vanishingly small rather quickly.

v1

v2

A0x

A1x

A2x

A3x

A4x

A=[1.68 0.548; −0.202 0.286]

Fig. 8.1. The effect of several matrix-vector multiplications. Eigenvalues λ1 = 1.6, λ2 = 0.37

The repeated multiplication to amplify the components of the dominant eigenvector is theprinciple behind the so-called power iteration method for the calculation of the dominant eigen-value/eigenvector.

8.2 Power iteration

The power method (power iteration) relies on the above observation that provided there is onedominant eigenvalue, the repeated multiplication of an arbitrary starting vector x by the coefficient

8.2 Power iteration 163

matrix will diminish the contributions of all other eigenvectors except the first one so that eventuallythe product Akx will be mostly in the direction of the first eigenvector v1.

The method is not failproof. Firstly, it appears that if the starting vector x does not contain anycontribution of the first eigenvector, c1 = 0, the power method is not going to converge. Fortunately,any amount of the inevitable arithmetic error will likely introduce some contribution of the firsteigenvector to which the power method will ultimately converge. Unfortunately, it may take a longtime.

Secondly, the method is definitely going to have trouble with converging for |λ2| ≈ |λ1| (inwords, when the second eigenvalue is close to the first eigenvalue in magnitude). The ratio λk

2/|λ1|kwill decrease slowly, resulting in slow convergence. Such a situation is illustrated in Figure 8.2: theeigenvalues are λ1 = −0.8, λ2 = 0.75. The iterated vector Akx appears to converge to the directionof v1, but slowly.

A few observations can be made from Figure 8.2. The iterated vectorAkx decreases in magnitude(|λ1| < 1), and if we iterate sufficiently long the vector will get so short that we may risk underflow,or at least numerical issues due to arithmetic error. (Note that for |λ1| > 1 the approximationsto the eigenvector will grow, which may eventually result in overflow.) Further, since λ1 < 0 theiterated vector aligns itself alternately with v1 and −v1. This is fine, since both are perfectly goodeigenvectors, but it complicates somewhat the issue of how to measure convergence. We want tomeasure convergence of directions, not of the individual components of the vector!

v1

v2

A0x

A1x

A2x

A3x

A4x

A=[−1.4 −0.894; 1.43 1.35]

Fig. 8.2. The effect of several matrix-vector multiplications. Eigenvalues λ1 = −0.8, λ2 = 0.75

To address the concerns about underflow and overflow we may introduce normalization (rescaling)of the iterated vector as

x0 givenfor k = 1, 2, ...x(k) = Ax(k−1)

x(k) = x(k)

∥x(k)∥

How to measure the convergence of the algorithm may be made easier by considering the associatedproblem of finding the eigenvalue λ1. An excellent tool is offered by the Rayleigh quotient. Pre-multiply the eigenvalue problem on both sides with vT

j

Avj = λjvj ⇒ vTj Avj = λjv

Tj vj ,

which gives (the Rayleigh quotient)


λj =vTj Avj

vTj vj

.

Now consider the vector x(k) as an approximation of the eigenvector v1. A good approximation ofthe eigenvalue will be

λ1 ≈x(k)TAx(k)

x(k)Tx(k).

It will be much easier to measure relative approximate errors in the eigenvalue then to measure theconvergence of the direction of the eigenvector. An actual implementation of the power iterationalgorithm then follows easily:1

function [lambda,v,converged]=pwr2(A,v,tol,maxiter)

... some error checking omitted

plambda=Inf;% eigenvalue in previous iteration

converged = false;

for iter=1:maxiter

u=A*v; % update eigenvector approx

lambda=(u’*v)/(v’*v);% Rayleigh quotient

v=u/norm(u);% normalize

if (abs(lambda-plambda)/abs(lambda)<tol)

converged = true; break;% converged!

end

plambda=lambda;% eigenvalue in previous iteration

end

end

Note that we have to return a Boolean flag to indicate whether the iteration process inside thefunction converged or not. This is a common design feature of software implementing iterativeprocesses, since the iterations may or may not succeed.

We conclude this section with pointing out that power iteration relies on the existence of adominant eigenvalue. This is not applicable in many important problems, for example for the firstorder form of the equations of motion of a vibrating system. For such systems eigenvalues come incomplex conjugate pairs. There is no single dominant eigenvalue, and consequently power iterationwill not converge. This is illustrated in Figure 8.3, where we show the progress of the power iterationfor two different starting vectors for a matrix with eigenvalues λ1,2 = ±0.7. There is no progresstowards any of the eigenvectors, since the iterated vectors just switch between two different directionsneither of which is the eigenvector direction.

In what follows we shall work with real symmetric matrices, unless we explicitly sayotherwise. The main reasons: these matrices are very important in practice, we don’thave to treat special cases such as missing eigenvectors, and the eigenvalues and eigen-vectors are real.

Illustration 1

Figure 8.4 shows the model of two linked buildings. Each building is represented by a concentratedmass m standing in for the total mass of the floor, and springs linking the floors kc which would berepresentative of the total horizontal stiffness of the columns in between the floors (or the ground).The buildings are linked at each floor with another spring kℓ, representative of walkways (bridges)that connect the buildings. The masses in the system are numbered as shown.

1See: aetna/EigenvalueProblems/pwr2.m

8.2 Power iteration 165

v1

v2

A0x

A1x

A2x

A3x

A4x

A=[−1.24 −0.808; 1.29 1.24]

v1

v2

A0x

A1x

A2x

A3x

A4x

A=[−1.24 −0.808; 1.29 1.24]

Fig. 8.3. The effect of several matrix-vector multiplications. Eigenvalues λ1,2 = ±0.7

1

2

3

4

5

6

7

8

9

10

Fig. 8.4. Vibration model of linked buildings.

The mass matrix is simply m× (10×10 identity matrix). The stiffness matrix K has the structureshown below. Note that if the buildings are not linked by the walkways (kℓ = 0), the stiffness matrixwill split into two uncoupled 5 × 5 diagonal blocks that correspond to each building separately.Nonzero walkway stiffness will couple the vibrations of the two buildings together.

K =

kc+kℓ −kc 0 0 0 −kℓ 0 0 0 0−kc 2kc+kℓ −kc 0 0 0 −kℓ 0 0 00 −kc 2kc+kℓ −kc 0 0 0 −kℓ 0 00 0 −kc 2kc+kℓ −kc 0 0 0 −kℓ 00 0 0 −kc 2kc+kℓ 0 0 0 0 −kℓ−kℓ 0 0 0 0 kc+kℓ −kc 0 0 00 −kℓ 0 0 0 −kc 2kc+kℓ −kc 0 00 0 −kℓ 0 0 0 −kc 2kc+kℓ −kc 00 0 0 −kℓ 0 0 0 −kc 2kc+kℓ −kc0 0 0 0 −kℓ 0 0 0 −kc 2kc+kℓ

The vibration problem can be described by the equation (5.3)

ω2Mz = Kz .

Since the mass matrix is just a multiple of the identity, this may be written as

Az = λz ,


where we define

A =1

mK , and λ = ω2 .

The first practice will apply the power method to the computation of the largest frequency ofvibration. We assume m = 133, kc = 61000, kℓ = 3136 (in consistent units). The solution withMATLAB’s eig is written for the eigenvalue problem as

[M,K,A] = lb_prop;

[V,D]=eig(A) % This may be replaced with [V,D]=eig(M,K)

disp(’Frequencies [Hz]’)

sqrt(diag(D)’)/(2*pi)

which yields the resulting frequencies as

Frequencies [Hz]

ans =

0.9702 1.4614 2.8319 3.0354 4.4641 4.5960 5.7348 5.8380 6.5408 6.6315

Applying the power method as shown in the script lb A power2 with a random starting vector yieldsan approximation of the highest eigenvalue, but it is not anywhere close to being converged. Thisshould not surprise us. We would expect the convergence to be slow: The two largest eigenvalues arevery closely spaced (the largest eigenvalue is weakly dominant): see Figure 8.5. This makes, togetherwith the inherent symmetry in the structure, for an interesting experiment: see below.


1. Use a starting vector in the form of ones(10,1). Do we get convergence to the largest eigenvalue?If not, try to explain. [Difficult]

1

2

3

4

5

6

7

8

9

10

f9=6.5408[Hz]

1

2

3

4

5

6

7

8

9

10

f10

=6.6315[Hz]

Fig. 8.5. The highest modes of the linked buildings.

2See: aetna/EigenvalueProblems/LinkedBuildings/lb A power.m

8.3 Inverse power iteration 167

8.3 Inverse power iteration

The power iteration can be used to compute the eigenvalue/eigenvector pair for the eigenvalue withthe largest absolute value. The inverse power iteration can look at the other end of the spectrum,at the smallest eigenvalues.

The eigenvalues of a matrix A and A−1 are related as follows: Provided the matrix is invertible(and therefore does not have λ = 0 among its eigenvalues), we can multiply the eigenvalue problemfor A

Ax = λx

with A−1 and divide by λ to give

1

λA−1Ax =

1

λA−1λx ⇒ 1

λx = A−1x .

In words, the matrix A and A−1 have the same eigenvectors, and the eigenvalues of A−1 are theinverses of the eigenvalues of A. Clearly, the largest eigenvalue of A−1 will be one over the smallesteigenvalue of A

max |eigenvalue of A−1| = 1

min |eigenvalue of A|.

Therefore, to find the eigenvalue/eigenvector pair of A for the smallest eigenvalue in absolute valuewe can perform the power iteration on A−1. We would not wish to invert the matrix, of course, andso we formulate the algorithm as

x0 givenfor k = 1, 2, ...Ax(k) = x(k−1)

x(k) = x(k)

∥x(k)∥

which simply means solve for x(k) from Ax(k) = x(k−1). (Compare with the power iteration algo-rithm on page 163; there is only one change, but an important one.) Since the solution is neededduring each iteration, we may conveniently and efficiently take advantage of the LU factorization.The inverse power iteration algorithm is summarized in the code below. Note the changes withrespect to the power iteration in the first two lines in the for loop. 3

function [lambda,v,converged]=invpwr2(A,v,tol,maxiter)


plambda=Inf;% initialize eigenvalue in previous iteration

[L,U,p]=lu(A,’vector’);%Factorization

converged = false;% not yet

for iter=1:maxiter

u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v

lambda=(v’*v)/(u’*v);% Rayleigh quotient: note the inverse




end

plambda=lambda;

end

end

3See: aetna/EigenvalueProblems/invpwr2.m


Note the shortcut to the value of the Rayleigh quotient: the vector product (u’*v) incorporates themultiplication with A−1. Then, because we are iterating to find 1/λ, we invert the fraction.

The inverse power iteration also relies on the existence of a dominant eigenvalue. Dominanthere means that the smallest eigenvalue should be strictly smaller in absolute value than any othereigenvalue of A. We assume again they are ordered in decreasing magnitude, and for the success ofthe inverse iteration we require

|λ1| ≥ |λ2| ≥ |λ3| ≥ . . . | ≥ |λn−1| > |λn| .

Analogously to the power iteration, the convergence of the inverse power iteration will be faster forvery dominant eigenvalues, |λn−1| ≫ |λn|, and painfully slow for |λn−1| ≈ |λn|.

Illustration 2

Here we illustrate the convergence of the inverse power iteration on the example of two symmetricmatrices.4 We construct two random matrices with spectra that are identical except for the small-est eigenvalue. The smallest eigenvalue is dominant in one matrix, and rather close to the secondeigenvalue in magnitude in the second matrix. Consequently Figure 8.6 displays quite disparateconvergence behaviors of the inverse power iteration: very good in the first case, poor in the second.

0 5 10 15 20 2510

−15

10−10

10−5

100

105

Iteration

Rel

ative

eigen

valu

eer

ror

λn = 13

λn = 6.1

Fig. 8.6. The relative error of the smallest eigenvalue for two symmetric 13× 13 matrices with eigenvalues[13, 14 : 25] and [6.1, 14 : 25].

Illustration 3

Apply the inverse power iteration method to the structure described in Illustration on page 164.The inverse power method as shown in the script lb A invpower5 with a random starting vector

yields an approximation of the lowest eigenvalue with satisfactory convergence. The first two modeshapes are shown in Figure 8.7 (only the mode on the left was computed with inverse power iteration,the mode on the right was added using eig()).

4See: aetna/EigenvalueProblems/test invpwr conv1.m5See: aetna/EigenvalueProblems/LinkedBuildings/lb A invpower.m

8.3 Inverse power iteration 169

1

2

3

4

5

6

7

8

9

10

f1=0.97015[Hz]

1

2

3

4

5

6

7

8

9

10

f2=1.4614[Hz]

Fig. 8.7. The lowest modes of the linked buildings.


1. Change the stiffness of the link spring to kℓ = 0. Does the inverse power iteration converge? Ifnot, why?

8.3.1 Shifting used with inverse power iteration

Consider the effect of adding an identity −σx = −σx to the eigenvalue problem.

Ax− σx = λx− σx .

At first blush, this does not seem to have any effect, but rewritten as

(A− σ1)x = (λ− σ)x

or

(A− σ1)x = ϱx

it is revealed that it leads to a slightly different eigenvalue problem, with the same eigenvector, buta shifted eigenvalue ϱ = λ−σ. This leads to the idea of searching for an eigenvalue/eigenvector pairfor the shifted matrix, not the original one, because the smallest min |λ| can be made to correspondto min |ϱ| ≈ 0. Then, the eigenvalue min |ϱ| could be very strongly dominant, since 1/min |ϱ| is goingto be large compared to the other eigenvalues.

Figure 8.8 illustrates this concept with an example with four eigenvalues

λ = [2.80, 1.167, 0.609, 0.452]

The ratio λ3/λ4 ≈ 1.34. Applying a shift σ = 0.3 leads to a shifted problem with eigenvalues

ϱ = [2.50, 0.867, 0.309, 0.152]

and the ratio ϱ3/ϱ4 ≈ 2.04 > 1.34. The larger this ratio, the better. The solution of the inversepower iteration on the shifted problem will converge faster.


0

λ1

λ2

λ3

λ4

0

1/λ1

1/λ2

1/λ3

1/λ4

= λ − σ, σ > 0

0

1

2

3

4

0

1/1

1/2

1/3

1/4

Fig. 8.8. Visual representation of the effect of shifting.

0 2 4 6 8 10 1210

−15

10−10

10−5

100

105

Iteration

Rel

ative

eigen

valu

eer

ror

no shift

σ = 0.3

σ = 0.4

Fig. 8.9. The relative error of the smallest eigenvalue λ4 for the symmetric 4× 4 matrices with eigenvalues[2.80, 1.167, 0.609, 0.452]. Comparison of un-shifted and shifted inverse power iteration.

Figure 8.9 shows the effect of shifting. Two shifts are applied, one corresponding to Figure 8.8,and one even closer to the eigenvalue λ4 in magnitude, σ = 0.4. The effect of shifting is quitedramatic. The closer we can guess the magnitude of the smallest eigenvalue (so that we can set theshift to be equal to the guess the eigenvalue) the higher the convergence rate.

The inverse power iteration algorithm with shifting is given in MATLAB code below.6

function [lambda,v,converged]=sinvpwr2(A,v,sigma,tol,maxiter)


plambda=Inf;% initialize eigenvalue in previous iteration

v=v/norm(v);% normalize

[L,U,p]=lu((A-sigma*eye(n)),’vector’);%Factorization


for iter=1:maxiter

u=U\(L\v(p)); % update eigenvector approx, equiv. to u=A\v

lambda=(u’*A*u)/(u’*u);% Rayleigh q. using the definition



6See: aetna/EigenvalueProblems/sinvpwr2.m

8.4 Simultaneous power iteration 171


end

plambda=lambda;

end

end

Note that we factorize the shifted matrix, and also note that we compute the Rayleigh quotient usingthe definition formula instead of the shortcut possible in the plain-vanilla inverse power iteration.

How to choose the shift in the first place is a bit of a ticklish question. We don’t know the smallesteigenvalue to begin with, but the negative smallest eigenvalue would be the best shift to apply! Ifwe guess the shift incorrectly, the iteration may converge to an eigenvalue that we did not want.

Illustration 4

Consider the following eigenvalue problem with a 3× 3 matrix whose eigenvalues are 1,2,4.7

A =[ 2.486697669648270 -0.326429831194336 -1.065046141649933

-0.326429831194336 2.167809045836811 1.032918306492685

-1.065046141649933 1.032918306492685 2.345493284514918];

n=3;

[V,D]=eig(A)

tol =1e-6; maxiter= 24;

v=rand(n,1);% starting vector

sigma =1.6;% the shift

[lambda,phi,converged]=sinvpwr2(A,v,sigma,tol,maxiter)

We guessed that the smallest eigenvalue was close to 1.6 and applied the shift 1.6. The shifted inversepower iteration produced the eigenvalue approximation of 2, instead of the smallest eigenvalue wehoped to find.

Illustration 5

Apply the inverse power iteration method to the structure described in Illustration on page 164, butchange the stiffness of the link spring to kℓ = 0. Would shifting help with convergence to the firstfrequency?

8.4 Simultaneous power iteration

So far we have been pointing out how the components of the dominant eigenvector are magnifiedin each iteration. In fact, components of all eigenvectors are magnified, except not as strongly. Thisleads to the idea to try to apply the power iteration (or the inverse power iteration) to severalstarting vectors at once with the goal of extracting the components of the eigenvectors for severaldominant eigenvalues concurrently.

First decision we have to make concerns the starting vectors: We should make every effort toavoid the starting vectors being orthogonal to the eigenvectors that we’re looking for. Often timesthis is achieved by choosing starting vectors with random components.

7See: aetna/EigenvalueProblems/test shift.m


v1

v2

w(0)1w

(0)2

w(1)1

w(1)2

w(2)1

w(2)2

w(3)1

w(3)2

w(4)1

w(4)2

A=[1.05 −0.171; −0.171 0.614]

Fig. 8.10. The effect of several matrix-vector multiplications. Eigenvalues λ1 = 1.11, λ2 = 0.556 . No effortis made to maintain the iteration vectors linearly independent.

The fact that the most dominant eigenvector will be swamping out all the other eigenvectors isgoing to keep us from obtaining reasonable approximations of the other eigenvectors. In other words,since the dominant eigenvector components will be getting magnified more than the componentsof the other vectors, eventually all the vectors on which we iterate will become aligned with thedominant eigenvector. Figure 8.10 illustrates the effect of simultaneous iteration on two vectors:

starting vectors are w(0)1 ,w

(0)2 . After just four iterations the vectors w

(4)1 ,w

(4)2 are pretty much

aligned with the dominant eigenvector v1. They are still linearly independent, but only barely.So iteration on multiple vectors will be tricky. The desired eigenvectors will still be present,

but they will be hard to extract from such an ill conditioned basis (all vectors essentially parallel).Therefore, similarly to power (inverse power) iteration where we normalized the approximation ineach step so as to avoid underflow or overflow, we will normalize the set of vectors on which we iterate.Not only so they are unit magnitude, but also so that they are mutually orthogonal. (Technicalterm: the vectors are orthonormal .) An excellent tool for this purpose is the QR factorization: thecolumns of the matrix Q are orthonormal, and they come from the columns of the input matrix.

In this way we get the so-called simultaneous power iteration (also called block power itera-tion). The starting vectors will be arranged as columns of a rectangular matrix

W (0) =[w

(0)1 ,w

(0)2 , ...w(0)

p

].

The algorithm will repeatedly multiply the iterated n× p matrix W (k) by the n× n matrix A andalso orthogonalize the columns of the iterated matrix by the QR factorization.

W (0) givenfor k = 1, 2, ...

W (k) = AW (k−1)

QR = W (k) % compute QR factorization

W (k) = Q

(8.1)

The eigenvalue approximations may be computed as before from the Rayleigh quotient

λ(k)j = w

(k)j

TAw

(k)j .

Note that we have omitted dividing by w(k)j

Tw

(k)j because these vectors are orthonormal:

8.4 Simultaneous power iteration 173

w(k)j

Tw

(k)j =

{1, when j = m0, otherwise.

Figure 8.11 shows the effect of orthogonalization for the same matrix and the same starting vectorsas in Figure 8.10, but this time with QR factorization. The iterated vectors now converge to the twoeigenvectors.

v1

v2

w(0)1

w(0)2

w(1)1

w(1)2

w(2)1

w(2)2

w(3)1

w(3)2

w(4)1

w(4)2

A=[1.05 −0.171; −0.171 0.614]

Fig. 8.11. The effect of several matrix-vector multiplications. Eigenvalues λ1 = 1.11, λ2 = 0.556. Iterationvectors are orthogonalized after each iteration.

In order to switch from the block power iteration to the block inverse power iteration we justswitch the one line that refers to the repeated multiplication with the coefficient matrix so that themultiplication is with its inverse

W (0) givenfor k = 1, 2, ...

AW (k) = W (k−1) % solve


W (k) = Q

(8.2)

The MATLAB code for the block inverse power iteration is given below. Note that the so-calledeconomy QR factorization is used: the matrix Q is rectangular rather than square. 8

function [lambda,v,converged]=binvpwr2(A,v,tol,maxiter)


nvecs =size(v,2);

plambda=Inf+zeros(nvecs,1);

lambda =plambda;

nvecs=size(v,2);% How many eigenvalues?

[v,r]=qr(v,0);% normalize

[L,U,p] =lu(A,’vector’);% Factorized for efficiency


for iter=1:maxiter

u=U\(L\v(p,:)); % update vectors

for j=1:nvecs % Rayleigh quotient

lambda(j)=(v(:,j)’*v(:,j))./(u(:,j)’*v(:,j));

8See: aetna/EigenvalueProblems/binvpwr2.m


end

[v,r]=qr(u,0);% economy QR factorization

if (norm(lambda-plambda)/norm(lambda)<tol)

converged = true; break;

end

plambda=lambda;

end

end

Note that when we’re computing the Rayleigh quotient we have to account for u being the result ofthe inverse power iteration. Also, we could have replacedlambda(j)=(v(:,j)’*v(:,j))./(u(:,j)’*v(:,j)) withlambda(j)= 1.0./(u(:,j)’*v(:,j)) (why?).

Shifting could also be applied to block inverse power iteration. Even though only one shift valuecan be used, the beneficial effect applies to all iterated eigenvectors: The iteration will converge tothe eigenvectors with eigenvalues closest to the shift.

Illustration 6

Apply the block inverse power iteration method to the structure described in Illustration on page 164,but change the stiffness of the link spring to kℓ = 0. Use it to find the first two modes.

A possible solution is given in the script lb A blinvpower.9


1. Interpret the mode shapes obtained above with the solution provided by MATLAB’s eig. Themode shapes are different. Does it matter?

8.5 QR iteration

An obvious step to take with simultaneous power iteration is to compute all the eigenvalues andeigenvectors of the n× n matrix A by iterating on n vectors at the same time. This is shown in thefollowing algorithm (note the choice of the initial orthonormal vectors as the columns of an identitymatrix):

W (0) = 1for k = 1, 2, ...

W (k) = AW (k−1)


W (k) = Q

(8.3)

The matrix W (k) converges to a matrix of eigenvectors. Recall that the matrix of eigenvectors canmake the matrix A similar to a diagonal matrix, the matrix of the eigenvalues (call for (4.13)). The

matrix W (k) is only close to the matrix of eigenvectors (and getting closer with the iteration), andtherefore the matrix

A(k) = W (k)TAW (k)

9See: aetna/EigenvalueProblems/LinkedBuildings/lb A blinvpower.m

8.5 QR iteration 175

will be only close to a diagonal matrix, not perfectly diagonal, and the numbers on the diagonal willapproximate the eigenvalues.

It can be shown that the above simultaneous iteration is equivalent to the so-calledQR iteration(note well that this is different from QR factorization). The QR iteration is given by the followingalgorithm:

A(0) = Afor k = 1, 2, ...

QR = A(k−1) % compute QR factorization

A(k) = RQ % note the switched factors

(8.4)

The matrix A(k) that appears in the last step of (8.4) is the same as A(k) = W (k)TAW (k) inthe algorithm (8.3) (explained in detail in Trefethen, Bau (1997)). In this sense the two algorithmsare equivalent. The script qr power correspondence10 demonstrates the equivalence of the twoalgorithms for a randomly generated matrix.

The QR iteration (8.4) is amenable to several significant enhancements as pointed out below.The QR iteration is one of the most important algorithms used in eigenvalue/eigenvector problems.First we will inspect the properties of the transformations effected by the above algorithm.

8.5.1 Schur factorization

The matrix A(k) in (8.4) converges to an upper triangular matrix. In fact, for our assumption of A

being symmetric, A(k) converges to a diagonal matrix. In the limit of k →∞ the transformation

A(k) = W (k)TAW (k)

will lead to the so-called Schur factorization .The Schur factorization in fact exists for all square matrices, complex or real, symmetric (Hermi-

tian11) or un-symmetric (non-Hermitian), non-defective or defective12. The so-called Schur lemmaclaims: for any square matrix A there is a unitary matrix13 U such that the matrix T

T = UTAU (8.5)

is upper triangular. This can be shown as follows: the square matrix A has at least one eigenvalueand one eigenvector. Therefore, we can write (for simplicity the procedure is demonstrated here fora 6× 6 matrix; the symbols •,♢, ... mean here general complex numbers; zeros are not shown)

AU1 = U1

λ1 • • • • •♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢

,

where the first column of U1 is an eigenvector of A: Au1 = λ1u1, and the other columns of U1 arearbitrarily selected to form an orthonormal basis (this is always possible). Now we write

10See: aetna/EigenvalueProblems/qr power correspondence.m11Hermitian matrix: A = A

T, where A

Tis the so-called conjugate transpose (its elements are complex

conjugates of the transposed matrix).12Defective matrix does not have a full set of eigenvectors. Example: [0, 1; 0, 0]. Double eigenvalue 0, a

single eigenvector [1; 0].13Unitary matrix: complex matrixU such thatUU

T= U

TU = 1. For real matrices unitary = orthogonal.


U1TAU1 =

λ1 • • • • •♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢

and we apply exactly the same argument to the smaller 5× 5 matrix (the ♢ elements). This againleads to the first column having zeros below the diagonal, which we write as

U2T

λ1 • • • • •♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢♢ ♢ ♢ ♢ ♢

U2 = U2TU1

TAU1U2 =

λ1 • • • • •

λ2 ♢ ♢ ♢ ♢♠ ♠ ♠ ♠♠ ♠ ♠ ♠♠ ♠ ♠ ♠♠ ♠ ♠ ♠

And so we continue until we construct

U5T...U2

TU1

TAU1U2...U5 =

λ1 • • • • •

λ2 ♢ ♢ ♢ ♢λ3 ♠ ♠ ♠

λ4 � �λ5 ♣

λ6

.

Since we can define a unitary matrix as U = U1U2...U5 we have completed the Schur factorization.This construction highlights the main attraction of the Schur factorization: the upper triangularmatrix on the right-hand side has the eigenvalues of A on the diagonal. It also points to a majordifficulty: in order to compute the Schur factorization we have to solve a sequence of eigenvalueproblems. This is not possible in a finite number of steps in general, as follows from the impossibilityof finding the roots of an arbitrarily high order polynomial by explicit formulas. As a consequence,computing the Schur factorization must be an iterative procedure, and in fact the QR iteration isprecisely such a procedure.

8.5.2 QR iteration: Shifting and deflation

The QR iteration is a numerically stable procedure because it proceeds by applying successiveorthogonal transformations, similarly to the construction we just outlined. To show this we write forthe QR factors in one step

Q(k)R(k) = A(k−1)

A(k) = R(k)Q(k)

and substitute in the second equation R(k) = Q(k)TA(k−1) from the first equation:

A(k) = Q(k)TA(k−1)Q(k) .

This is a similarity transformation with orthogonal matrices Q(k).The basic QR iteration14 convergence process is illustrated in Figure 8.12 (see the script

Visualize qr iteration15). The QR iteration is demonstrated by the gradual emergence of a

dominant diagonal of the matrix A(k) (which contains the approximations to the eigenvalues). Themagnitude of the elements of the matrix is coded in shades of gray: white for elements close tozero, and the larger the element in absolute value the darker the shade. We see how the off-diagonal

8.5 QR iteration 177

Columns

Row

s

Approx λ = [4.980, 5.016, −1.332, 3.872, 4.702, 3.263,]

1 2 3 4 5 6

1

2

3

4

5

6

Columns

Row

s

Approx λ = [6.854, 4.528, 4.623, 1.073, 0.516, 2.907,]

1 2 3 4 5 6

1

2

3

4

5

6

Columns

Row

s

Approx λ = [6.994, 4.576, 4.897, 3.491, −2.351, 2.894,]

1 2 3 4 5 6

1

2

3

4

5

6

ColumnsR

ow

s

Approx λ = [7.000, 4.678, 4.812, 3.954, −2.837, 2.893,]

1 2 3 4 5 6

1

2

3

4

5

6

Fig. 8.12. QR factorization example. Matrix eigenvalues [−3, 3, 4, 4.5, 5, 7]. QR iterations 1, 5, 9, 13 areshown top to bottom, left to right.

elements decrease in magnitude with successive iterations, and the diagonal elements come to dom-inate. Figure 8.13 shows similar computation as in Figure 8.12, but with a different matrix. Thistime the QR iteration gets stuck on the three eigenvalues in the top left corner, and the iterationdoes not result in a diagonal matrix. The lack of convergence is due to the repeated eigenvalues (inabsolute value), and additional sophistication is needed to extract the the repeated eigenvalues.

Shifting may be introduced into the QR iteration similarly as in the simultaneous inverse itera-tion. The QR iteration may be in fact shown to be equivalent not only to simultaneous iteration, butalso to simultaneous inverse iteration. Therefore, the shifting will have a very similar effect: fasterconvergence in the lower eigenvalues. The shift can be selected in various judicious ways. Here wewill discuss a simple choice: the Rayleigh quotient shift. We have seen that the QR iteration wassuccessively transforming the original matrix to a diagonal matrix. The elements on the diagonal of

the iterated matrix are in fact the Rayleigh quotients. A good shift therefore is the element A(k−1)nn

of the iterated matrix. The shift is applied as

A(0) = Afor k = 1, 2, ...

ρ = A(k−1)nn

QR = (A(k−1) − ρ1) % compute QR factorization

A(k) = RQ+ ρ1

(8.6)

This translates directly into MATLAB code:16

function A = qrstepS(A)

14See: aetna/EigenvalueProblems/qrstep.m15See: aetna/EigenvalueProblems/Visualize qr iteration.m16See: aetna/EigenvalueProblems/qrstepS.m


Columns

Row

s

Approx λ = [1.257, 4.873, −0.387, 2.183, 1.108, 1.967,]

1 2 3 4 5 6

1

2

3

4

5

6

Columns

Row

s

Approx λ = [0.208, 4.915, −0.163, 3.041, 1.997, 1.001,]

1 2 3 4 5 6

1

2

3

4

5

6

Columns

Row

s

Approx λ = [0.180, 4.913, −0.094, 3.001, 2.000, 1.000,]

1 2 3 4 5 6

1

2

3

4

5

6

Columns

Row

s

Approx λ = [0.180, 4.913, −0.093, 3.000, 2.000, 1.000,]

1 2 3 4 5 6

1

2

3

4

5

6

Fig. 8.13. QR factorization example. Matrix eigenvalues [5,−5, 5, 3, 2, 1]. QR iterations 1, 5, 9, 13 are showntop to bottom, left to right.

[m,n]=size(A);

rho = A(n,n); % shift

[Q,R]=qr(A-rho*eye(n,n));

A = R*Q + rho*eye(n,n);

end

In practice, once an eigenvalue converges, the corresponding row and column are removed fromthe matrix, and the QR iteration continues on the smaller remaining matrix. This is called deflation .

Illustration 7

Apply the shifted QR iteration method to the structure described in Illustration on page 164.The shifted QR algorithm using qrstepS in the script lb A qr17 does not in fact converge very

well. The basic algorithm without shifting18 works actually better. Even better is the strategy ofshifting known under the name of Wilkinson (James Hardy Wilkinson, 1919 - 1986, was a giant inthe 20th century history of numerical algorithms)19.


1. Change the stiffness of the link spring to kℓ = 0. Does the QR iteration converge? Try thevariants with shifting.

17See: aetna/EigenvalueProblems/LinkedBuildings/lb A qr.m18See: aetna/EigenvalueProblems/qrstep.m19See: aetna/EigenvalueProblems/qrstepW.m

8.6 Spectrum slicing 179

8.6 Spectrum slicing

For a real symmetric matrix A the transformation A = MAMT is called a congruence . Is not asimilarity transformation, so it does not preserve the eigenvalues. However according to the so-calledSylvester’s Law of inertia the congruence transformation preserves the number of positive, zero, andnegative eigenvalues. This fact leads to a simple and convenient method called spectrum slicing :To find the number of eigenvalues of the matrix A less than σ, form the LDLT factorization of thematrix A− σ1

A− σ1 = LDLT

and count the number of negative elements (these are the pivots) in the diagonal matrix D.This spectrum slicing approach is also easily extended to the generalized eigenvalue problem.

To find the number of eigenvalues of Kx = λMx less than σ, form the LDLT factorization of thematrix

K − σM = LDLT

and count the number of negative elements in the diagonal matrix D.

Illustration 8

For the mechanical system of Figure 5.1 the mass and stiffness matrices are

M =

m 0 00 m 00 0 m

, K =

2k k 0k 2k k0 k k

Here k = 61, all the masses are equal m = 1.3. For instance, we can check how many naturalfrequencies lie below 0.5 Hz. We form the matrix

A = K − (0.5× 2π)2M

Using MATLAB LU factorization as [L,U,P] =lu(A) yields

L =

1 0 0−0.559 1 0

0 −0.812 1

, U =

109 −61 00 75.1 −610 0 −1.39

, P =

1 0 00 1 00 0 1

Since there is only one negative number on the diagonal of U (that is on the diagonal of the matrixD from the LDLT matrix factorization) we conclude that only one natural frequency lies below0.5 Hz.

Next we check how many natural frequencies lie below 2.0 Hz. The factorization gives

L =

1 0 0−0 1 00.732 0.633 1

, U =

−83.3 −61 00 −61 −1440 0 30.3

, P =

1 0 00 0 10 1 0

,

which we compare with the frequencies given in Section 5.4 and conclude that something is wrong:there are two negative numbers on the diagonal, but all three frequencies are in fact below 2.0 Hz.The reason is that once the partial pivoting introduces a non-identity permutation matrix, so that

LU = PA

the congruence that the Sylvester theorem relies upon is no longer applicable. In fact, the productLU is no longer symmetric and it is not possible to factor into LDLT . The pivoting has to be donecarefully to preserve the symmetry of the resulting product of factors. For instance, the MATLAB


function ldl produces directly the LDLT factorization and returns the “psychologically” lower-triangular factor L. We can write [L,D] =ldl(A), with the result

L =

1 0 00.732 0.423 1−0 1 0

, D =

−83.3 0 00 −144 00 0 −12.8

.

Now we see three negative numbers on the diagonal of D which indeed corresponds to our priorknowledge that all three frequencies are below 2.0 Hz.

8.7 Generalized eigenvalue problem

Small generalized eigenvalue problems may be approached by converting them to a standard eigen-value problem form. For instance, if the stiffness matrix is nonsingular, we may form the so-calledCholesky factorization. It can be produced from the LDLT factorization as

LDLT = K

by defining R = L√D so that RRT = K. We see that we need to work with a positive definite

stiffness matrix so that the diagonal matrix D will give real roots. With the Cholesky factors athand we transform the generalized eigenvalue problem Kz = ω2Mz as

Kz = RRTz = ω2Mz

and by introducing y = RTz we obtain the standard eigenvalue problem

1

ω2y = R−1MR−Ty .

If the stiffness happens to be singular, but the mass matrix is not, the roles of these two matricesmay be reversed.

For larger generalized eigenvalue problems, and in vibration analysis it is not uncommon nowa-days to work with millions of equations, and the conversion to the standard eigenvalue problemwould be too expensive. Moreover, we are typically not interested in all the eigenvalues anyway, anda better suited technique will help us extract a few eigenvalues of interest, typically the lowest ones.

The inverse iteration method (8.2) is easily adapted to the generalized eigenvalue problem. Thesimultaneous inverse iteration for the generalized eigenvalue problem is written as

W (0) givenfor k = 1, 2, ...

KW (k) = MW (k−1) % solve


W (k) = Q

The eigenvalues may be estimated during the iteration using the Rayleigh quotient. For the gener-alized eigenvalue problem the Rayleigh quotient is computed from

ω2Mz = Kz ⇒ ω2 =zTKz

zTMz.

The MATLAB code for the generalized eigenvalue problem solved with block inverse poweriteration is given below: 20

20See: aetna/EigenvalueProblems/gepbinvpwr2.m

8.7 Generalized eigenvalue problem 181

function [lambda,v,converged]=gepbinvpwr2(K,M,v,tol,maxiter)


nvecs=size(v,2);% How many eigenvalues?

plambda=Inf+zeros(nvecs,1);% previous eigenvalue

lambda =plambda;

[L,U,p] =lu(K,’vector’);


for iter=1:maxiter

u=U\(L\(M*v(p,:))); % update vector

for j=1:nvecs

lambda(j)=(v(:,j)’*K*v(:,j))/(v(:,j)’*M*v(:,j));% Rayleigh quotient

end

[v,r]=qr(u,0);% economy factorization

if (norm(lambda-plambda)/lambda<tol)

converged = true; break;

end

plambda=lambda;

end

end

Illustration 9

Apply the block inverse power iteration method for the generalized eigenvalue problem to the struc-ture described in Illustration on page 164.

The algorithm gepbinvpwr2 converges as well as the regular block inverse power iteration for thestandard eigenvalue problem.21 No surprise, given how easy it was to transition from the generalizedto the standard eigenvalue problem for this particular mass matrix.

8.7.1 Shifting

Shifting could also be introduced into the block inverse power iteration for the generalized eigenvalueproblem. Not only to speed up convergence to the smallest eigenvalue by making it more dominant,but also for precisely the opposite: to make the smallest eigenvalue less dominant. What we mean bythis is that if a structure contains rigid body modes (the structure can move without experiencingany resisting forces), it has at least one zero frequency of vibration. Such a frequency is very stronglydominant in the inverse power iteration (1/0!!!). The effect of this dominance cannot be exploited,however, since the matrix K is not invertible. This would make the block inverse power iterationalgorithm (page 180) impossible.

Shifting can help. To the eigenvalue problem (with λ = ω2)

λMz = Kz

we add the term σMz on both sides

σMz + λMz = σMz +Kz

and obtain

(σ + λ)Mz = (σM +K)z .

21See: aetna/EigenvalueProblems/LinkedBuildings/lb A gepblinvpower.m


This can be written to resemble the original equation as

ϱMz = Kz ,

where ϱ = (σ + λ), and K = (σM +K). The matrix K is the shifted stiffness.

Illustration 10

We consider a variation on the three-carriage vibrating system of Section 5.1, where the middlespring is removed. The stiffness matrix of such vibrating system is singular.

K =

k −0 0−0 k −k0 −k k

Equivalently, we say that the structure has a rigid body mode. The frequency corresponding to therigid body mode is zero. Figure 8.14 shows this rigid body mode as a translation of the masses2,3. Mass 1 does not displace.22 Clearly, all springs maintain their unstressed length: the rigid bodymotion does not induce any forces in the structure.

z11 = 0 z21 = −0.62 z31 = −0.62

Fig. 8.14. Structure with a singular stiffness matrix. The rigid body mode (ω = 0).

Now we shall try to apply the block inverse power iteration with gepbinvpwr2. 23 The scriptn3 sing undamped modes MK224 invokes gepbinvpwr2 to obtain the first mode without shifting,and the resulting eigenvector and eigenvalue are worthless. The eigenvector in fact contains not-a-numbers (NaN). Why? Because the stiffness matrix is singular, its LU factorization should not exist.The MATLAB function lu (put a breakpoint inside gepbinvpwr2) returns the factors as

K>> L,U

L =

1 0 0

0 1 0

0 -1 1

U =

61 0 0

0 61 -61

0 0 0

The 0 in the element 3,3 of the U factor is a problem: at some point we will have to divide with it.Hence the not-a-numbers.

The script n3 sing undamped modes MK325 invokes gepbinvpwr2 to obtain the first mode withshifting. The shift is guessed as 0.2. This number is arbitrary, but it should be sufficiently small

22See: aetna/ThreeCarriages/n3 sing undamped modes MK1.m23See: aetna/EigenvalueProblems/gepbinvpwr2.m24See: aetna/ThreeCarriages/n3 sing undamped modes MK2.m25See: aetna/ThreeCarriages/n3 sing undamped modes MK3.m


to avoid getting close to the first nonzero frequency. The script shows how we invoke gepbinvpwr2

for a stiffness matrix that is modified by the addition of a multiple of the mass matrix to make itnon-singular.

[M,C,K,A,k1,k2,k3,c1,c2,c3] = properties_sing_undamped;

v=rand(size(M,1),1);% initial guess of the eigenvector

tol=1e-9; maxiter =4;% tolerance, how many iterations allowed?

sigma = 0.2;% this is the shift

[lambda,v,converged]=gepbinvpwr2(K+sigma*M,M,v,tol,maxiter)

lambda =lambda-sigma % subtract the shift to get the original eigenvalue

The output evidently shows that the iteration was successful.

lambda =

0.2000 % shifted

v =

-0.0000

-0.7071

-0.7071

converged =

1

lambda =

6.3838e-016 % shift removed: ~0


For the structure from Illustration on page 164:

1. Change the stiffness of the link spring to kℓ = 0. Does the block inverse power iteration converge?2. Use the spectrum slicing approach to check the number of eigenvalues located by the power

iteration above.


1. C. Meyer, Matrix Analysis and Applied Linear Algebra Book and Solutions Manual, SIAM:Society for Industrial and Applied Mathematics, 2001.Good coverage of eigenvalue and eigenvector problems. Interesting examples. Best of all, freelyavailable at http://matrixanalysis.com/.

2. D. E. Newland, Mechanical Vibration Analysis and Computation, Dover Publications Inc., 2006.It covers well matrix analysis of natural frequencies and mode shapes, and some numericalmethods for modal analysis.

3. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005. (Alternatively,the 3rd edition, 1988.)Good coverage of the basics of the eigenvalue problem.

4. L. N. Trefethen, D. Bau III, Numerical Linear Algebra, SIAM: Society for Industrial and AppliedMathematics, 1997.The treatment of QR factorization is excellent.

9

Unconstrained Optimization

Summary

1. A number of basic techniques in structural analysis rely on results from the area of optimization.Main idea: Equilibrium of structures and minimization of potential functions are intimately tied.Equilibrium equations are the conditions of the minimum.

2. Stability of structures is connected to the classification of the stiffness matrix. Main idea: positivedefinite matrices correspond to stable structures.

3. The line search is a basic tool in minimization. Main idea: Monitor the gradient of the objectivefunction. Minimum (extremum) is indicated when the gradient becomes orthogonal to the linesearch direction.

4. Solving a system of linear equations and minimizing an objective function are two roads to thesame destination. Main idea: We show that minimizing the so-called quadratic form solves asystem of linear algebraic equations.

5. The method of steepest descent may be improved by the method of conjugate gradients. Mainidea: keep track of directions of past line searches.

6. Direct versus iterative methods. Main idea: direct and iterative methods are rather different intheir properties (cost vs. accuracy). Iterative algorithms seem to be becoming more and moreimportant in modern software.

7. Least-squares fitting is an important example of optimization.

9.1 Basic ideas

We shall start to explore the subject of unconstrained optimization on problems of static equilib-rium of bar structures. The optimization problems are unconstrained in the sense that the minimumor maximum is sought for variables (displacements) that can take on any value; we say they are notconstrained. An example of constrained optimization – equilibrium problem with a contact condition– is treated in a subsequent chapter.

The optimization can search either for a minimum or maximum of the so-called objective function(or as it is sometimes called, the cost function). Without any loss of generality we can assumethat the objective function is always minimized. It is possible to switch between minimization andmaximization by this trick: Let us assume the goal is to minimize the function f(x) by seeking thelocation of the minimum as

Find x∗ such that f(x∗) ≤ f(x) for all x . (9.1)

This can be easily changed into a maximization task by flipping the objective function about thehorizontal axis (i.e. changing its sign) and seeking the maximum as

Find x∗ such that − f(x∗) ≥ −f(x) for all x . (9.2)

186 9 Unconstrained Optimization

9.2 Two degrees of freedom static equilibrium: unstable structure

Consider the simple static system in Figure 9.1. The stretch of the spring can be expressed as

s = x1 cos 30o + x2 sin 30

o .

The energy stored in the spring (the energy of deformation) is

DE =1

2ks2 .

Using a matrix expression (for reasons that will become clear later), the stretch of the spring canbe expressed as

s =[cos 30o sin 30o

] [x1

x2

].

The energy stored in the spring can also be written as

DE =1

2ksT s ,

where by sT we mean the transpose (never mind that the transpose of a scalar doesn’t do anything).Substituting for the stretch we obtain

DE =1

2k

([cos 30o sin 30o

] [x1

x2

])T [cos 30o sin 30o

] [x1

x2

],

which gives in short order

DE =1

2k[x1 x2

] [ cos 30osin 30o

] [cos 30o sin 30o

] [x1

x2

].

If we define the matrix

K = k

[cos 30o

sin 30o

] [cos 30o sin 30o

]= k

[cos 30o cos 30o , cos 30o sin 30o

sin 30o cos 30o , sin 30o sin 30o

], (9.3)

we can write the energy stored in the spring as

DE =1

2xTKx , (9.4)

where we write for convenience

x =

[x1

x2

].

The matrix K is the stiffness matrix.The energy DE is a quadratic function of the displacements x1, x2. Expressed in the form of

the matrix expression (9.4), it is called a quadratic form . In the optimization arena the energyfunction DE would be referred to as the objective function , and the displacements that minimizethis function would be sought as the solution to the optimization problem.

As a scalar function of two variables, x1, x2, the energy DE may be visualized as a surface raisedabove the plane x1, x2. Figure 9.2 shows the surface of the deformation energy in two views: fromthe top, and isometric. We can see that the surface is a trough, with the bottom indicated by thethick white level curve at DE = 0. This level curve appears to run in the direction[

− sin 30o

cos 30o

].

9.2 Two degrees of freedom static equilibrium: unstable structure 187

k

30o

x1

x2

Fig. 9.1. Static equilibrium of particle suspended on a spring.

Taking the displacement as[x1

x2

]= α

[− sin 30o

cos 30o

](9.5)

we can compute the stretch in the spring as

s =[cos 30o sin 30o

] [x1

x2

]=[cos 30o sin 30o

]α

[− sin 30o

cos 30o

]= 0 .

This confirms that for the displacements (9.5) the energy stored in the spring is equal to zero. Thisproperty is encountered in structures which are mechanisms: they can move in some ways withoutdeformation, that is without the need to store energy. Such structures are unstable.

Furthermore, we can see that for the displacements (9.5) we get

Kx = 0 .

Thus we see that the matrix K of (9.3) is singular. Clearly, the fact that the matrix is singular andthe fact that the deformation energy may be zero for some nonzero displacement are related.

Fig. 9.2. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.


Illustration 1

Modify the code below to display the surface in Figure 9.2. The second and the last line need tobe modified to reflect a particular objective function. The last line is supposed to draw arrowsrepresenting the gradient.

[x,y]=meshgrid(-10:10,-10:10);

z=x.*y; % function

surf(x,y,z,’Edgecolor’,’none’); hold on

contour3(x,y,z,20,’w’); hold on

quiver(x,y,y,x); % gradient

9.3 Two degrees of freedom static equilibrium: stable structure

Figure 9.3 shows a static equilibrium system with two degrees of freedom as before, but this timewith two springs. The stiffness matrix consists of two contributions, one from each spring

K = k

[cos 30o

sin 30o

] [cos 30o sin 30o

]+ (k/2)

[cos−30osin−30o

] [cos−30o sin−30o

]= k

[cos 30o cos 30o , cos 30o sin 30o

sin 30o cos 30o , sin 30o sin 30o

]+(k/2)

[cos−30o cos−30o , cos−30o sin−30osin−30o cos−30o , sin−30o sin−30o

]. (9.6)

k

k/2

30o

30o

L

x1

x2

Fig. 9.3. Static equilibrium of particle suspended on two springs.

Figure 9.4 shows the variation of the deformation energy as a function of x1, x2: the only pointwhere the DE assumes the value of zero is at x1 = 0, x2 = 0. Everywhere else the deformationenergy is positive. This means that whenever the displacements are different from zero, the springswill store nonzero energy. This is the hallmark of stable structures.

Matrices A that have the property

xTAx > 0

for all x = 0, and for which

9.4 Potential function 189

Fig. 9.4. Static equilibrium of particle suspended on a spring. The surface of the deformation energy.

xTAx = 0

only for x = 0, are called positive definite . Stable structures have positive definite stiffness matri-ces. Positive definite matrices are nonsingular (they are regular). This is a fact well worth retaining.

Note that the stiffness matrix is symmetric. An important property of the quadratic forms isthat only symmetric matrices contribute to the value of the quadratic form. We can show that asfollows: For the moment assume that A is in general unsymmetric. The quadratic form is a scalar(real number), and as such it is equal to its transpose

xTAx =(xTAx

)T.

Therefore, we can write

xTAx = xTATx

or

xTAx− xTATx = xT(A−AT

)x = 0 . (9.7)

The general matrix A may be written as a sum of a symmetric matrix and a skew-symmetric(anti-symmetric) matrix

A =1

2

(A+AT

)+

1

2

(A−AT

).

In the expression (9.7) we recognize the anti-symmetric part of A. Therefore, we conclude thatthe anti-symmetric part does not contribute to the quadratic form, only the symmetric part does.Therefore, normally we work only with symmetric matrices in quadratic forms.

9.4 Potential function

There is another reason for using symmetric matrices to generate quadratic forms: the quadratic formoften results as an expression of a potential function (potential, for short). The potential functionmust have a symmetric matrix of second derivatives. The first derivative, the so-called gradient ,comes from

d

dx

(xTAx

).


Consider how to compute the derivative with respect to x of the product aT b: both vectors needsto be differentiated in turn using the chain rule. So that we don’t have to differentiate a transposeof the vector a we take advantage of the fact that the result of the product aT b is a scalar whichmay be transposed at will without changing anything(

aT b)=(bTa

).

To differentiate the vector b in the product aT b with respect to x is straightforward:

aT ∂ b

∂x.

To differentiate the vector a in the product aT b with respect to x, we will first transpose the productto get bTa, and then we differentiate

bT∂ a

∂x.

So the product(aT b

)is differentiated as

∂

∂x

(aT b

)= aT ∂ b

∂x+ bT

∂ a

∂x.

Now back to the quadratic function. The quadratic term may be identified with the above productof vectors if we write

a = x , b = Ax ,

which means that we have for the gradient

∂

∂x

(xTAx

)= xTA+ xTAT = xT

(A+AT

). (9.8)

The second-order derivative, the Hessian , is the derivative of the gradient

d

dx

(xT(A+AT

))=(A+AT

).

We see that for the potential function xTAx both the gradient and the Hessian are expressed in

terms of a symmetric matrix(A+AT

).

Illustration 2

Compute the components of the Hessian of the potential Φ(x) = 12x

TAx.As shown above, the gradient of Φ(x) is

∇Φ(x) = 1

2xT(A+AT

).

The result is a row matrix, with components

[∇Φ(x)]c =∑i

xi1

2(Aic +Aci) .

The components of the Hessian matrix Hrc can be obtained by differentiating the gradient with therespect to each xr. Therefore, we obtain

Hrc =1

2(Arc +Acr) .

Clearly, the Hessian is symmetric, Hrc = Hcr. For A symmetric we have

Hrc = Arc .

9.6 Two degrees of freedom static equilibrium: computing displacement 191

9.5 Determining definiteness

The LU factorization is a useful tool for determining positive definiteness. If the upper triangularfactor matrix has only positive pivots on the diagonal the matrix is positive definite. Otherwise,the matrix is indefinite. Else the factorization may have failed, in which case a more substantialinvestigation is needed to determine the rank of the matrix (does it have zero eigenvalues?). To showthe above we write the deformation energy as

xTAx .

Assuming symmetric A, its LU factorization is A = LU = LDLT , where D is the diagonal of U(i.e. the pivots used in the factorization). The quadratic form may be written as

xTAx = xTLUx = xTLDLTx .

Now note that we can write this same quadratic form using the new variable z = LTx as

xTAx = xTLDLTx = zTDz

or since D is diagonal we can write in components

xTAx = zTDz =∑i=1:n

Diiz2i .

The last expression is going to be positive for any combination of zi only if Dii > 0 for all i. SoDii > 0 for all i guarantees that the quadratic form is positive definite.

If any of the Dii was equal to zero (to get this factorization if any of the elements in the pivotposition was zero would be tricky!) and all the others were positive, the matrix would be positivesemi-definite (and singular). (Just for completeness, if the pivots were a mixture of positive andnegative numbers, the matrix would be indefinite.)

9.6 Two degrees of freedom static equilibrium: computing displacement

Now consider the system from Figure 9.1 with the stiffness matrix (9.3) loaded by the forces

L =

[L1

L2

].

The solution of the static equilibrium problem is (presumably) available from the equilibrium equa-tions

Kx = L .

We recall that the matrix (9.3) was singular. This means it is not invertible, and it isn’t possible toobtain a solution for just any load L since we cannot write

x = K−1L .

Some particular sets of forces may lead to solutions: namely all such loads that are proportional tothe eigenvectors of K. Recall that the eigenproblem is written as

Kx = λx

so that setting L = βλjxj where xj is an eigenvector, will lead to a solution of the equilibriumproblem in the form

x = βxj .

The stiffness matrix (9.3) has the eigenvalues λ1 = 0, λ2 = k. The corresponding eigenvectors are

x1 =

[sin 30o

− cos 30o

], x2 =

[− cos 30o

− sin 30o

].

Evidently, for the zero eigenvalue, no nonzero load is admissible (L = βλ1x1 = β× 0×x1 = 0). Forthe nonzero eigenvalue, the load is seen to be in the direction of the spring.


9.7 One degree of freedom total energy minimization example

Consider a one degree of freedom system (particle on a grounded spring). The deformation energy(elastic energy stored in the spring)

DE =1

2xKx ,

where K is the stiffness constant of the spring. The potential energy of the applied forces is definedas

W = −Lx .

The total energy is defined as

TE = DE +W . (9.9)

The solution for the equilibrium displacement is determined by the principle of minimum totalenergy: for the equilibrium displacement x∗ the total energy assumes the smallest possible value

x∗ = argminTE . (9.10)

(It should be read: find x∗ as such argument that minimizes TE.) This is an unconstrained min-imization problem. The minimum of the total energy is distinguished by the condition that theslope at the minimum is zero:

dTE

dx=

d

dx

(1

2xKx− Lx

)= Kx− L = 0 .

This condition is seen to be simply the equation of equilibrium, whose solution indeed is the equi-librium displacement.

The meaning of equation (9.9) and of the minimization problem (9.10) is illustrated in Figure 9.5.The deformation energy is represented by a parabolic arc (dashed line), which attains zero value(that is its minimum) at zero displacement. The potential energy of the external force is representedby the straight dashed line. The sum of the deformation energy and the energy of the externalforce tilts the dashed parabola into the solid line parabola, the total energy. That shifts the originalminimum on the dashed parabola into the new minimum on the solid parabola (negative value) atx∗. The minimum is easily seen to be

minTE =1

2x∗Kx∗ − Lx∗ =

1

2x∗Kx∗ −Kx∗x∗ = −1

2x∗Kx∗ = −1

2x∗L .

9.8 Two degrees of freedom total energy minimization example

Now we shall consider again the system of Figure 9.3. The potential energy of the applied forces Lis expressed as

W = −LTx .

The effect of this term on the parabolic surface in Figure 9.4 is very similar to that of Figure 9.5,except now it is in more than one variable: the parabolic surface of the deformation energy is tiltedinto the parabolic surface of the total energy (TE). This surface is shown in Figure 9.6. The redcross at the bottom represents the solution of the static equilibrium equations.

9.9 Application of the total energy minimization 193

TE

W

DE

x∗

x

Energy

Fig. 9.5. Total energy minimization diagram.

Fig. 9.6. Static equilibrium of particle suspended on two springs. The surface of total energy.

9.9 Application of the total energy minimization

We have seen above that to find the minimum of the total potential energy TE is equivalent tosolving a system of linear equations (the equilibrium equations). By finding the solution to one ofthese problems, one has automatically solved the other. The application we have in mind here is tosolve the system of linear equations

Kx = L

by minimizing the energy

TE =1

2xTKx−LTx . (9.11)

This is a classical optimization problem. We have an objective function, the total potential energyTE, and our goal is to find the displacement x∗ such that the objective function attains a minimumfor that displacement

x∗ = argminx

TE . (9.12)

Since all candidate displacements x may be considered in the minimization without any restrictions,the minimization problem is called “unconstrained”.


9.9.1 Line search

A commonly used technique for these kinds of problems is the so-called line search method. Itworks as follows: start at a point. Then repeat as many times as necessary: pick a direction, andfind along this direction a location where the objective function has a lower value than at the startpoint. Make this the new start point, and go back to picking a direction.

The algorithm is seen to be a sort of walkabout on the surface of the objective function. Thegoal is to reach the lowest point. Two issues need to be addressed: how to choose the direction,and how to choose where to stop when moving along this direction from the starting point. Oneparticular strategy for addressing the first issue is to choose the direction of the negative gradientat the starting point. Since this direction leads to the steepest decrease of the objective function outof all the directions as the starting point, this strategy is called the steepest descent . For generalobjective functions, the second issue is difficult to address. To know when to stop when movingfrom the starting point along the chosen direction could be expensive to compute. Compare withFigure 9.7: the objective function appears to be rather complex, the minimum in the middle is onlya local one, not a global minimum: following the drop-off of the objective function in either of thedescending corners would lead to further decrease. Fortunately, our present objective function (9.11)is much simpler, and hence much nicer to work with.

p0

p1

p2p3

Fig. 9.7. Walk towards the minimum of the objective function. Starting point is p0, the walk proceedsagainst the direction of the gradient.

9.9.2 Line search for the quadratic-form objective function

First things first: let us figure out the gradient of the objective function. Since the objective functionis based on a quadratic form, we have in fact already done something very much like this before

∇TE =∂ TE

∂x=

∂

∂x

(1

2xTKx−LTx

).

From (9.8) we have

∂ TE

∂x=

1

2xT(K +KT

)−LT .

Since the matrix K is symmetric, we can simplify

∂

∂x

(1

2xTKx

)= xTK

and finally

9.9 Application of the total energy minimization 195

∇TE =∂ TE

∂x=

∂

∂x

(1

2xTKx−LTx

)= xTK −LT . (9.13)

Note that the gradient is a row matrix.So now we know how to determine the direction in which to move from a given point in order

to decrease the objective function. For direction vectors we usually use column matrices, and so wedefine the direction of steepest descent as

r = − (∇TE)T= L−Kx .

The vector r is called the residual. We make it into a column matrix in order for the addition of thevector x and r to make sense.

Next we have to find out how far to go. One possible strategy is to go as far as possible, meaningthat we would follow along a given direction until we’ve reached the lowest possible value of theobjective function starting from a given point in a given direction. Denoting the starting point x0,we write the motion in the direction r

x = x0 + αr .

The lowest point will be reached when we stop descending and if we went any further we would startascending on the surface of the objective function. We are moving along a direction which subtendsvarious angles with the gradient at any given point. When we are descending we are moving againstthe direction of the gradient. This would be expressed as (see Figure 9.8, and observe the gradientof function f at point p2)

∇f(p2)r(p0) < 0 .

Note that the result of the multiplication ∇f(p2)r (row matrix with one row times column matrixwith one column) is a number, cosine of the angle that these two arrows subtend.

p0

p1

p2

p3

p4

∇f(p0) ∇f(p2)

∇f(p3)

∇f(p4)r(p0)

Fig. 9.8. Walk to find the minimum of the objective function along a given direction. Starting point is p0,the walk proceeds in the direction of r(p0) towards the point p1.

On the other hand, when we are ascending we are moving broadly in the same direction in whichthe gradient points, and we have (see Figure 9.8, and observe the gradient of function f at point p3)

∇f(p3)r(p0) > 0 .


Finally, we must conclude that when we are standing at a point from which to move in any directionwould mean ascending, the path at that point must be perpendicular to the direction of the gradientat that point (see Figure 9.8, observe the gradient of function f at point p4)

∇f(p4)r(p0) = 0 .

(Remark: This may be an oversimplification for more general objective functions. There is also thepossibility that a part of the path from p0 to p1 runs level – no descending or ascending.)

The condition that the gradient (9.13) at the lowest point x∗ must be orthogonal to the directionof descent r can be written down as

∇f(x∗)r =(x∗T

K −LT)r = 0

and writing x∗ = x0 + α∗r we obtain

∇f(x∗)r =(xT0 K + α∗rTK −LT

)r = 0 .

Further, we recognize in xT0 K −LT = −rT so that we arrive at

α∗ =rTr

rTKr.

This is really the entire algorithm of steepest descent applied to the quadratic form objective func-tion (9.11): improve the location of the lowest value of the objective function by moving from thestarting point x0 to the new point x

x = x0 +

(rTr

rTKr

)r , r = L−Kx0

and then reset the starting point x0 = x. Such an algorithm is concisely written in MATLAB as

for iter=1:maxiter

r = b-A*x0;

x = x0 + (dot(r,r)/dot(A*r,r))* r;

x0 = x;

end

The steepest descent solver for quadratic objective functions is provided in the toolbox asSteepestAxb. 1

Illustration 3

In Figure 9.9 we apply the solver SteepestAxb to the two-spring equilibrium problem from Sec-tion 9.8. Given that this is a two-unknowns system of linear algebraic equations, it takes a lot ofiterations to arrive at a solution: inefficient! So why would we bother with this method? It doeshave some redeeming characteristics. To mention one, it requires very little memory. More aboutthis later in Section 9.12.

9.10 Conjugate Gradients method

The system of two equations for the structure of Figure 9.3 will be considered again in the light ofwhat we’ve learned about the steepest descent method. By inspection of Figure 9.10 we realize that

9.10 Conjugate Gradients method 197

0 5 10 15 20 25 30 3510

−20

10−15

10−10

10−5

100

Iteration

Norm

ofthe

error

Fig. 9.9. Convergence in the norm of the solution error for the steepest descent algorithm applied to thetwo-spring equilibrium problem.

p0

p1

p2

p3

p4

Fig. 9.10. Walk towards the minimum of the quadratic-form (total energy) objective function. Startingpoint is p0, the walk proceeds against the direction of the gradient.

effort is wasted by zigzagging in towards the minimum, with each step going too much sideways withtoo little progress in the direction of the minimum.

We realize that there are only two independent directions in the plane x1, x2. The first directionis d(0) = −∇f(x(0))T , the direction for the first descent step. Therefore, it must be possible to find

a direction for the second step d(1) that would lead directly to the minimum. The reason is that atthe point x(2) (that is at the minimum) the gradient must vanish, which will make it perpendicularto any vector, including the first and second descent direction

∇f(x(2))d(0) = 0 , and ∇f(x(2))d(1) = 0 ⇒ ∇f(x(2)) = 0 ⇒ x(2) is minimum .

The second orthogonality condition, that is ∇f(x(2))d(1) = 0, occurs naturally as a stopping condi-

tion for the step along d(1) (we go as far downhill as possible). We write

x(2) = x(1) + αd(1)

and the second condition will allow us to express

1See: aetna/SteepestDescent/SteepestAxb.m


α =−∇f(x(1))d(1)

d(1)TKd(1).

The first condition may be put as

∇f(x(2))d(0) =(x(2)TK −LT

)d(0) =

(x(1)TK + αd(1)TK −LT

)d(0)

and since x(1)TK −LT = ∇f(x(1)) is orthogonal to d(0), we get

d(1)TKd(0) = 0 . (9.14)

From this condition we can determine the second descent direction. We can see that it must be acombination of the first direction d(0) and of −∇f(x(1))T : these two vectors are orthogonal andtherefore they span the plane. In other words any vector can be expressed as a linear combinationof these two. Thus we write

d(1) = −∇f(x(1))T + βd(0) .

From (9.14) we obtain

β =∇f(x(1))Kd(0)

d(0)TKd(0).

To show that the solution can indeed be obtained in just two steps in this case is possible withMATLAB symbolic math:2

K =[sym(’K11’),sym(’K12’);sym(’K12’),sym(’K22’)];% stiffness

L =[sym(’L1’);sym(’L2’)];% load

X0 =[sym(’X01’);sym(’X02’)];% starting point

g=@(x)(x’*K-L’);% compute gradient

a=@(x,d)(-g(x)*d)/(d’*K*d);% compute alpha

b=@(x,d)(g(x)*K*d)/(d’*K*d);% compute beta

d0 =-g(X0)’;% first descent direction

X1 =X0 +a(X0,d0)*d0;% second point

d1 =b(X1,d0)*d0-g(X1)’;% second descent direction

X2 =X1 +a(X1,d1)*d1;% final point

simplify(g(X2))% gradient at final point ~ 0

The gradient at X2 indeed comes out as zero matrix. (Word of caution: the symbolic computationmay take a while – computer-assisted algebra is not very efficient.)

9.11 Generalization to multiple equations

Now let us consider the solution of the linear system of coupled equations, this time with an n× nmatrix K. We will revise the method proposed in the previous section so that it works for morethan two equations.

Consider the situation in which the iteration attained the point x(k) and now we want to deter-mine a new search direction d(k) to find the next point x(k+1)

x(k+1) = x(k) + αd(k) .

2See: aetna/SteepestDescent/analytical CG in 2D book.m

9.11 Generalization to multiple equations 199

p0

p1

p2

Fig. 9.11. Walk towards the minimum of the quadratic-form (total energy) objective function. Startingpoint is p0, the walk proceeds in the directions determined to reach the minimum in just two steps.

We will again make the gradient at the point x(k+1) orthogonal to the two directions d(k−1) andd(k),

∇f(x(k+1))d(k) = 0 , ∇f(x(k+1))d(k−1) = 0 ,

only this time the gradient does not have to vanish identically at x(k+1) since there are many vectorsto which it could be orthogonal without having to become identically zero. First we will work outthe gradient at the point x(k+1)

∇f(x(k+1)) = x(k+1)TK −LT =(x(k) + αd(k)

)TK −LT = x(k)TK + αd(k)TK −LT ,

which results in

∇f(x(k+1)) = ∇f(x(k)) + αd(k)TK .

We substitute into the first orthogonality condition

∇f(x(k+1))d(k) = ∇f(x(k))d(k) + αd(k)TKd(k) = 0

so that we arrive at the step length coefficient

α =−∇f(x(k))d(k)

d(k)TKd(k).

The second orthogonality condition gives

∇f(x(k+1))d(k−1) = ∇f(x(k))d(k−1) + αd(k)TKd(k−1) = 0 .

We realize that the point x(k) was reached along the direction d(k−1) and at that point the gradientwas orthogonal to the marching direction

∇f(x(k))d(k−1) = 0 .

Therefore, we must also have


d(k)TKd(k−1) = 0 . (9.15)

We say that the directions d(k−1) and d(k) are K-orthogonal or K-conjugate (or just conjugatedirections for short).

So that we can determine the new direction d(k) to be K-conjugate to the old one d(k−1) weassume the new descent direction is a combination of the direction of steepest descent −∇f(x(k))T

and the old direction d(k−1)

d(k) = −∇f(x(k))T + βd(k−1) .

Substituting into the K-conjugate condition (9.15) we obtain

β =∇f(x(k))Kd(k−1)

d(k−1)TKd(k−1).

The conjugate gradients algorithm may be succinctly sketched as3

x=x0;

g = x’*A-b’;

d=-g’;

for iter=1:maxiter

alpha =(-g*d)/(d’*A*d);

x = x + alpha* d;

g = x’*A-b’;

beta =(g*A*d)/(d’*A*d);

d =beta*d-g’;

end

Note well that this is not at all an efficient implementation. For instance, the product A*d shouldbe computed just once. For a real industrial-strength conjugate gradient implementation checkoutthe MATLAB pcg solver.

Illustration 4

Here we apply the steepest descent and the conjugate gradient solvers to a system of linear algebraicequations with a “standard” 324× 324 matrix.4

Figure 9.12 illustrates that the method of conjugate gradients is a definite improvement over themethod of steepest descent. The convergence is much quicker. Note that after just 75 iterations orso we could have stopped the conjugate gradient iteration since it reached a limit imposed by themachine precision. The difference between the two methods can be also dramatically displayed byshowing how the solution is approached during the iterations. Figure 9.13 shows how the iteratedsolution (red dashed curve) approaches the converged solution (black solid line) for the steepestdescent method in relation to the number of iterations. Figure 9.14 shows the same kind of infor-mation. Clearly, even though the two methods started with essentially the same magnitude of error,conjugate gradients managed to reduce it much more quickly.

3See: aetna/SteepestDescent/ConjGradAxb.m4See: aetna/SteepestDescent/test cg 1.m

9.11 Generalization to multiple equations 201

0 50 100 150 200 250 300 35010

−15

10−10

10−5

100

105

Iteration

Norm

ofthe

error

Fig. 9.12. Comparison of the convergence of the steepest-descent algorithm (dashed line) and the ConjugateGradients algorithm (solid line). Matrix: poisson(18), 324 unknowns.

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =6

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =16

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =32

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =65

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =108

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =162

Fig. 9.13. Solution obtained with the Steepest Descent algorithm for the matrix gallery(’poisson’,18),324 unknowns, using various numbers of iterations.

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =3

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =6

0 50 100 150 200 250 300 3500

5

10

15

Iteration

Solu

tio

n

iter =16

Fig. 9.14. Solution obtained with the Conjugate Gradients algorithm for the matrix poisson(18), 324unknowns, using various numbers of iterations.


9.12 Direct versus iterative methods

We have seen two representatives of two classes of numerical methods: the LU factorization as arepresentative of the so-called direct methods, and the method of steepest descent as a representativeof the iterative methods.

The direct methods will complete their work in a number of steps that can be determined beforethey start. If we took the time and effort, we could count every single addition and multiplicationthat will be required for a given size matrix.

On the other hand, for iterative methods this is not possible. There may be constituents in theiteration procedure whose cost maybe evaluated a priori, but the number of iterations is typicallyimpossible to determine beforehand.

Where does the method of conjugate gradients fit? It can be shown that even though we haveenforced the orthogonality of gradients to two successive directions at a time, orthogonality of allprevious directions to the gradient at the current point is carried forward. Therefore, theoretically,given infinite arithmetic precision, after n steps we will again reach a point, x(n), where the gradientmust be orthogonal to all n descent directions. Thus, the gradient at x(n) must vanish identically,otherwise in an n dimensional space it couldn’t be simultaneously orthogonal to n directions. In thissense, a method of conjugate gradients is able to complete its work in time that can be determinedbefore the computation starts. On the other hand, it can also be used as an iteration procedure sinceit is possible to stop it at any time, and the current point would be an improvement of the initialguess.

The characteristics we have just introduced can be illustrated in Figure 9.15. The direct methodwill start computing, and after a certain time and effort (which we can predict in advance: advantage!)it will stop and deliver the solution with an error within the limits of computer precision (machineepsilon). Until it does, we have nothing. (Disadvantage.)

The iterative method will start reducing the error of the initial guess right away. After a certaintime and effort the method will reduce the error to machine precision. (For simplicity we haveassumed that the two methods we are comparing will reach machine precision in the same time, thismay or may not be so.) Importantly, the iterative method can be stopped before it reaches machineprecision. If we are satisfied with a cruder tolerance, we could accept the solution much sooner, andpotentially save time (advantage!). For the iterative method we will not know in advance how longit’s going to take to compute an acceptable solution. (Disadvantage.)

Nowadays there seems to be an agreement in the scientific and engineering computing communitythat iterative methods are for many applications the preferred algorithms. This makes it a little bitharder for the users of software built on iterative algorithms since iterative algorithms typicallyinclude some tuning parameters, and various tolerances are involved. A judicious choice of these isnot always easy, and it can have a very significant impact on the cost of such computations.

9.13 Least-squares minimization

Consider the following problem. The load-deflection diagram of a stainless steel 303 round couponwas determined experimentally as shown in Figure 9.16. The data comes from the initial, more orless straight portion of the curve. What is the stiffness coefficient of the coupon? If the data were alllocated on a straight line, it will be the slope of that straight line. However, we can see that not onlythere is some experimental scatter, but the data points appear to lie on a curve, not a straight line.The so-called linear regression approach to the above problem could start from the assumption thatthe stiffness could be determined as the slope of a straight line which somehow “best” approximatesthe measured data points. If the data was all on a straight line, we could write

F (w) = p1w + p2

for the relationship between the displacement w and of the force F , where p1 is the stiffness coefficientof the coupon K = dF/dw = p1. The data points are not located on a straight line however, which

9.13 Least-squares minimization 203

Iterative

Direct

eps

tol

Effort

Error

Fig. 9.15. Comparison of effort versus error for direct and iterative methods.

0.01 0.02 0.03 0.04 0.05 0.06 0.070

500

1000

1500

2000

2500

Deflection [in]

Forc

e[lb]

Fig. 9.16. Stainless steel 303 round coupon, and the load-deflection diagram.

means that substituting the displacement wk and the force measured for that displacement Fk intothe above relationship will not render it an equality, something will be left over: we will call it theresidual.

Fk − F (wk) = Fk − p1wk + p2 = rk .

This may be written in matrix form for all the data points asF1

F2

...Fn

−w1, 1w2, 1...

...wn, 1

[p1p2

]=

r1r2...rn

.

For convenience, using the measured data w1, w2, ..., wn and F1, F2, ..., Fn we will define the matrix

A =

w1, 1w2, 1...

...wn, 1

and the vector


b =

F1

F2

...Fn

.

The vector of the parameters of the linear fit is

u =

[p1p2

].

The vector of the residuals (also called the error of the linear fit) is

e =

r1r2...rn

.

So we write

b−Au = e ,

where the matrix A has more rows than columns. This is the reason why it will not be possible tomake the error exactly zero in general: there are more equations than unknowns.

We realize that in default of being able to zero out the error, we have to go for the next bestthing which is to somehow minimize the magnitude of the error. In terms of the norm of the vectore it means to find the minimum of the following objective function

min ∥e∥2 = min eTe = min (b−Au)T(b−Au)

with the respect to the parameter vector of the linear fit u. This is a classical unconstrained mini-mization problem. The argument u∗ for which the minimum of the objective function is attained isfound from the above expression which in expanded form reads

u∗ = argminu

(b−Au)T(b−Au) = argmin

u

(uTATAu− 2bTAu+ bT b

).

The first-order condition for the existence of an extremum is the vanishing of the gradient of theobjective function

∂

∂u

(uTATAu− 2bTAu+ bT b

)= 2uTATA− 2bTA = 0 .

Canceling out the factor 2 and transposing leads to the so-called normal equations

ATAu = AT b .

They are linear algebraic equations with a symmetric matrix, and since the columns of A arelinearly independent (if the wk’s are not all the same), the matrix ATA is invertible. (It is howevernot necessarily well conditioned. We have seen an evidence of this in the Illustration on page 148.)

The coupon data are

w = 10−2 × [ 1.3, 1.8, 2.3, 2.8, 3.1, 3.6, 4.1, 4.5, 4.8, 5.3, 5.8, 6.3, 6.7 ]

and

F = 102 × [ 1.2, 2.6, 4.3, 6.1, 7.3, 9.2, 11, 13, 14, 16, 18, 20, 21 ] .

Substituting our data into the normal equations leads to the solution


A =[w, ones(length(w),1)];

pl =(A’*A)\A’*Fpl =

1.0e+004 *

3.799600652696673

-0.042276210924081

So the stiffness of the coupon based on the linear fit is approximately 37996 lb/in. Continuing ourinvestigation, we realize that the data points appear to lie on S-shaped curve, which suggests a linearregression with a cubic polynomial. This is easily accommodated in our model by taking

F (w) = p1w3 + p2w

2 + p3w + p4 .

The matrix A becomes

A =

w3

1, w21, w1, 1

w32, w

22, w2, 1

...,...,

...,...

w3n, w

2n, wn, 1

and the solution is

A =[w.^3, w.^2, w, ones(length(w),1)];

pc =(A’*A)\A’*Fpc =

1.0e+006 *

-7.000925471829832

0.862362550370550

0.006168259997214

-0.000087120026727

Figure 9.17 shows the linear and cubic polynomial fit of the experimental data. The difference issomewhat inconspicuous, but plotting the residuals is quite enlightening. Figure 9.18 shows the

0.01 0.02 0.03 0.04 0.05 0.06 0.070

500

1000

1500

2000

2500

Deflection [in]

Forc

e[lb]

0.01 0.02 0.03 0.04 0.05 0.06 0.070

500

1000

1500

2000

2500

Deflection [in]

Forc

e[lb]

Fig. 9.17. Stainless steel 303 round coupon, and the load-deflection diagram. Linear polynomial fit on theleft, cubic polynomial fit on the right.

residual for the linear and cubic polynomial fit. The linear polynomial fit residual shows a clear biasin the form of a cubic curve. This indicates that a cubic polynomial would be a better fit. That isindeed true, as both the magnitude decreased and the bias was removed from the cubic-fit residual.


0.01 0.02 0.03 0.04 0.05 0.06 0.07−40

−20

0

20

40

60

Deflection [in]

Res

idual[lb]

Fig. 9.18. Stainless steel 303 round coupon load-deflection diagram. Linear polynomial fit residual in dashedline, cubic polynomial fit in solid line.

Figure 9.19 shows the variation of the stiffness coefficient as a function of the deflection forboth the linear and the cubic polynomial fit. It may be appreciated that the stiffness varies by asubstantial amount when determined from the cubic fit, while it is constant based on the linear fit.

0.01 0.02 0.03 0.04 0.05 0.06 0.072

2.5

3

3.5

4

4.5x 10

4

Deflection [in]

Stiffnes

s[lb/in

]

Fig. 9.19. Stainless steel 303 round coupon, and the load-deflection diagram. Stiffness coefficient as afunction of deflection. Dashed line: from linear polynomial fit, solid line: from cubic polynomial fit.

9.13.1 Geometry of least squares fitting

Let us now come back to the geometrical meaning of the least squares equations. The equation

b−Au = e

expresses that we cannot satisfy all the individual equations since there more equations than thereare unknown parameters: the vector b belongs to Rn, while u belongs to Rm, and we have m < n.In other words, the matrix A is rectangular (tall and skinny).

The geometry viewpoint would imagine b as a vector (arrow) in Rn. Each of the columns of thematrix A also represents a vector (arrow) in Rn. The product Au is a linear combination of thecolumns of the matrix A

Au = c1(A)u1 + c2(A)u2 + . . .+ cm(A)um ,

where c1(A) is used to mean column 1 of the matrix A and so on. To reach every single point of Rn,we would need n linearly independent basis vectors. Since there are only m columns of the matrix A


they cannot serve as such basis vectors, and the linear combination of the columns of the matrix Ais only going to cover a subset of Rn. Inspect Figure 9.20: the columns of the matrix A generate thegray plane as a graphical representation of the subset of Rn. The vector b is of course not confinedto the plane and somehow sticks out of it. The difference e between b and Au also sticks out. Tomake the error e as small as possible (as short as possible) then amounts to making it orthogonalto the gray plane Au. The shortest possible error e∗ = b −Au∗ will be orthogonal to all possiblevectors in the gray plane, Au as expressed here

(Au)Te∗ = 0 .

Substituting we obtain

(Au)T(b−Au∗) = 0

or

uTAT (b−Au∗) = uT(AT b−ATAu∗

)= 0 .

When we say for all possible vectors in the gray plane, Au, we mean for all parameters u, and sincethe above equation must be true for all u, we have again the normal equations

AT b−ATAu∗ = 0 .

The solution to the normal equations are such parameters u∗ that they make the error of the leastsquares fitting as small as possible.

Fig. 9.20. Least squares fitting: the geometrical relationships.

9.13.2 Solving least squares problems

A practical approach to the solution of the normal equations needs to consider numerical stability:the normal equations are relatively poorly conditioned as a rule.

A good approach is based on the QR factorization of the matrix A

A = QR .

Note that Q and R are computed by the so-called economy QR factorization: Q has the samedimensions as A, and R is square upper triangular m×m matrix. We substitute this factorizationinto ATAu = AT b and we obtain

RTQTQRu = RTQT b .

Canceling the product of the orthogonal matrix with itself and multiplying on the left with(RT)−1

yields

Ru = QT b .

The meaning is: project the vector b onto the orthonormal basis of the m columns of the Q matrix.Then solve (easily!) the upper triangular system for the fitting parameters. Thanks to the QRfactorization all these operations are numerically stable.



1. R. Fletcher, Practical methods of optimization, second edition, John Wiley and sons, 2000.Lucid presentation of the basics of unconstrained optimization.

2. P.Y. Papalambros, D. J. Wilde, Principles of optimal design, second edition, Cambridge Univer-sity press, 2000.Practical engineering treatment of both unconstrained and constrained optimization.

3. G. Strang, Linear Algebra and Its Applications, Brooks Cole; 4th edition, 2005.Great reference for the least-squares methodology.

.

Index

AETNA function or scriptConjGradAxb, 200HouseQR, 159Householder matrix, 158, 159Householder normal, 160Jacobian example, 129StabilitySurfaces, 53SteepestAxb, 197Visualize qr iteration, 177analytical CG in 2D book, 198animated eigenvalue diagram, 72better conditioned, 149binvpwr2, 173bisection versus Newton, 125bisect, 124, 125bwsubs, 135cable config myjac, 131cable config numjac, 134compare conv driver, 110dft example 1, 88elim matrix, 137froscill crit symb, 74froscill sub symb, 67froscill super symb, 65froscill un symb, 69fwsubs, 135gepbinvpwr2, 180, 182ill conditioned, 148invpwr2, 167lb A blinvpower, 174lb A gepblinvpower, 181lb A invpower, 168lb A power, 166lb A qr, 178matnordemo, 152n3 damped IC, 91n3 damped fft f, 91n3 damped modes A, 90n3 damped non IC, 93

n3 damped non modes A, 92n3 damped sing modes A, 95n3 sing undamped modes MK1, 182n3 sing undamped modes MK2, 182n3 sing undamped modes MK3, 182n3 undamped A animation, 82n3 undamped IC, 82n3 undamped direct modal, 83n3 undamped fft f, 89n3 undamped fft, 85n3 undamped modes A, 80n3 undamped modes MK symbolic, 79n3 undamped modes MK, 79, 80n3 undamped stable rk4, 84naivelu4, 138newt, 122ode87fixed, 32odebeul, 16odefeul, 16odemeul, 26oderk4, 31, 83odetrap, 26, 89pwr2, 164qr power correspondence, 175qrstepS, 177qrstepW, 178qrstep, 177, 178satellite1, 20satellite energy, 21scalardecayconv1, 26scalardecayconv2, 31scalardecayconv, 25scalaroscill1stlong, 47scalaroscill1st, 46scalaroscillplot, 45scalaroscillstream, 45scalarsimple, 35sinvpwr2, 170stickslip harm 1 dirf, 23

210 Index

stickslip harm 2 animate, 23stokes1, 10stokes2, 11stokes3, 13stokes4ill, 14stokes4, 14stokes5, 15stokes6, 16stokes7, 17stokesdirf, 9stokesref, 8test cg 1, 200test invpwr conv1, 168test shift, 171testbisect conv rate, 125testnewt conv rate, 122two surfaces, 127vecnordemo, 151

accuracy, 15accuracy control knob, 99adaptive time stepping, 11algorithmconjugate gradients, 200

aliasing, 86amplification factor, 35, 36, 48, 53, 83analytic function, 97angular frequency, 80anti-symmetric matrix, 189approximate error, 24asymptotic range, 27

backsubstitution, 157backward difference, 129backward Euler, 16, 52, 121approximation of derivatives, 108

backward substitution, 135beam bending, 17bias, 205big O notationfunction value decreasing, 98function value increasing, 141

binary number, 110bisection method, 124bit, 110block inverse power iteration, 173block power iteration, 172boundary conditions, 17boundary value problem, 17bugs, 116Butcher tableau, 30BVP, 17byte, 110

Cauchy, Bunyakovsky, Schwartz, 147CBS inequality, 147centered difference, 108change of coordinates, 64characteristic equation, 39, 60, 78characteristic polynomial, 39, 78Cholesky, 180Cholesky factorization, 180cofactor, 39, 63commuting matrices, 75complex conjugate, 66, 81, 82complex conjugate numbers, 43complex exponential, 38complex plane, 51complex-variable representation, 81condition number, 147–150, 152condition of extremum, 192conditionally stable, 52congruence, 179conjugate directions, 200conjugate gradients, 145, 196conjugate transpose, 175constant of integration, 7convergence, 25convergence rate, 27Newton, 122

cost function, 185Coulomb friction, 21critical damping, 62critically damped oscillator, 62, 71curvature, 104

damped oscillator, 59damping matrix, 90damping ratio, 62decoupling, 65defective matrix, 71, 72, 175deflation, 178dense matrix, 127, 144dependent variable, 7determinant, 39, 42, 146DFT, 86diagonal matrix, 64diagonalizable, 64diagonalize, 74differential equation, 7, 51direct methods, 202direct time integration, 83direction field, 9, 14, 23discrete Fourier transform, 86, 95displacement, 81divided differences, 106, 129dof, 77dominant eigenvalue, 162, 168

Index 211

drag force, 5dry friction, 21dynamic viscosity, 6

eigenvalue, 39, 51inverse, 167positive, 79shifting, 169vibrating system, 164

eigenvalue problem, 34generalized, 180matrix, 39standard, 180

eigenvalues of inverse matrix, 167eigenvector, 39elimination matrix, 137energy, 186energy minimization, 192energy of deformation, 186energy surface, 186equationexplicit, 15homogeneous, 7implicit, 15inhomogeneous, 7partial differential, 18

equilibrium, 132error, 100approximate, 24arithmetic, 110machine-representation, 110round-off, 110true, 24truncation, 110

error estimate, 103Euclidean norm, 151Euler formula, 49Euler’s formula, 45, 80Euler’s method, 9expansionmodal, 161

experimental scatter, 202explicit equation, 15exponent, 112exponentialmatrix, 70

exponential as solution, 60extremum, 204

factorizationCholesky, 180LDLT, 144, 179, 191LU, 134, 179QR, 156

QR economy, 207Schur, 175

FFT, 86, 91fill-in, 145finite difference stencil, 109first-order form, 80force residual, 132forced vibration, 89forward Euler, 51forward difference, 129forward Euler, 16, 30, 52, 102approximation of derivatives, 108

forward substitution, 135Fourier series, 86Fourier transform, 86discrete, 86

free-vibration response, 82frequency content, 85friction coefficient, 23fundamental frequency, 87

Gauss-Seidel, 145generalized eigenvalue problem, 78, 79, 180global error, 104, 105global minimum, 194governing equation, 7gradient, 189, 190, 194, 195

Hermitian matrix, 175Hessian matrix, 190homogeneous equation, 7Householder transformation, 157

IBVP, 18identity matrix, 136ill conditioned, 148ill conditioned basis, 172ill conditioning, 148impedance matrix, 144implicit algorithm, 119implicit equation, 15in-place factorization, 138indefinite matrix, 191independent variable, 7induced matrix norm, 150Inf, 112inhomogeneous equation, 7initial boundary value problem, 18initial condition, 7, 18initial conditions, 59initial value problem, 7instability, 15integer, 111integral

212 Index

midpoint approximation, 101integration constants, 63integrator, 514th-order explicit RK4, 30modified Euler, 26ode45, 30trapezoidal, 26

inverse, 136eigenvalue, 167

inverse iteration, 180inverse matrix, 156inverse power iteration, 167invertible, 191invertible matrix, 72iteration, 120, 122iterative methods, 202IVP, 7, 59

Jacobi, 145Jacobian, 120symmetric, 127

Jacobian matrix, 127numerical evaluation, 129

Jordan matrix, 74, 75

Lagrange remainder, 98, 106Laplace formula, 39LDLT factorization, 144, 179least-squares, 148least-squares fitting, 157line search, 194linear algebraic equations, 127linear combination, 68linear fit, 204linear oscillator, 59linear regression, 202linearly independent columns, 146local error, 104local minimum, 194loss of significance, 114lower triangular matrix, 134LU factorization, 134, 135inverse, 136

machine epsilon, 114, 116, 202mantissa, 112map, 150MATLABbin2dec, 111cond, 148contour3, 54dec2bin, 111det, 146diag, 65

diff, 10double, 112eig, 40, 166eps, 114eval, 66expm, 75exp, 65ezplot, 40fft, 88fzero, 17, 83, 125int8, 112intmax, 111intmin, 111ldl, 180linspace, 8, 54lu, 139meshgrid, 54norm, 152numjac, 129, 134ode23, 11, 31ode45, 20, 31, 46odeset, 16realmax, 113realmin, 113single, 115solve, 79sort, 81spy, 144surf, 53syms, 40taylor, 97tril, 139triu, 139vectorize, 66, 98anonymous function, 10function handle, 17

matrixcommuting, 75condition number, 147congruence, 179conjugate transpose, 175damping, 77defective, 71, 72, 175dense, 127determinant, 146diagonalizable, 64eigenvalue problem, 39elimination, 137exponential, 70Fourier transform, 86Hessian, 190Householder, 157identity, 39

Index 213

impedance, 144indefinite, 191inverse, 127, 136inverse, eigenvalues, 167Jacobian, 127Jordan, 74lower diagonal, 30lower triangular, 134mass, 77norm, 150normal equations, 204of eigenvectors, 63of principal vectors, 74orthogonal, 140, 156, 157permutation, 139positive definite, 189positive semi-definite, 191power, 71psychologically lower triangular, 140quadratic form, 186rank, 146Rayleigh quotient, 154rotation, 45, 68, 71similar, 64singular, 39skew-symmetric, 43, 71sparse, 127, 144sparse, fill-in, 145spectrum, 179stiffness, 2, 186symmetric, 79, 152unitary, 175unsymmetric, 152upper triangular, 134, 156

matrix exponential, 70, 75matrix inverse, 127matrix power, 161matrix powers, 71methodconjugate gradients, 196direct, 202Euler’s, 9Hermitian, 175iterative, 202of conjugate gradients, 202power, 161rectangular, 206

minimizationunconstrained, 193

modal coordinates, 65, 83modal expansion, 161mode, 65mode shape

undamped , 80modified Euler, 30modified Euler method, 26multi-grid, 145

NaN, 112natural frequency, 179natural frequency, 60nested function, 132Newton’s algorithm, 120, 127Newton’s equation, 6nonlinear algebraic equation, 119vector, 126

nonsingular, 189nonsingular matrix, 79norm1-norm, 1512-norm, 151Euclidean, 151infinity, 151matrix, 150vector, 150

normal coordinates, 65normal equations, 148, 204, 207normalized values, 113numerical stability, 207QR factorization, 157

Nyquist frequency, 86Nyquist rate, 86

objective function, 185, 186, 193, 204one-sided spectrum, 89optimizationunconstrained, 185

order-of analysis, 98orthogonal, 198, 207orthogonal matrix, 140, 149, 156, 157orthogonal vectors, 172orthogonality condition, 199orthonormal, 172oscillation, 51oscillatormulti-degree of freedom, 77

overflow, 112, 115, 163

partial differential equation, 18partial pivoting, 139, 179period of vibration, 85periodic function, 86permutation matrix, 139, 140, 179permutation vector, 140phase, 91, 93phase shift, 91phasor, 45, 68, 81

214 Index

piecewise linear function, 107pivot, 136, 139, 179, 191pivoting, 146complete, 139partial, 139

polynomialroots, 176

positive definite, 146positive definite matrix, 180, 189positive definiteness, 146, 191positive semi-definite matrix, 191potential, 189potential function, 189power iteration, 162, 164pre-asymptotic range, 27principal stress, 152principal vector, 74propagated error, 105psychologically lower triangular matrix, 140

QR factorization, 156, 157, 159, 172, 207economy, 173, 207

QR iteration, 175quadratic convergence rate, 122quadratic form, 186quadratic function, 186

rank, 146, 191rate of convergence, 27, 110Rayleigh damping, 91Rayleigh quotient, 154, 163, 177, 180reflection, 157regular matrix, 189remainder, 126repeated eigenvalue, 73residual, 195, 204, 205Richardson extrapolation, 29Riemann sum, 25, 87, 100rigid body mode, 182rotation matrix, 45, 68, 71round-off error, 110row matrix, 195Runge-Kuttaode45, 31oderk4, 31, 834th-order explicit RK4, 30method, 29

Schur factorization, 175, 176Schur lemma, 175secant method, 129shaker, 23shift, 177Rayleigh quotient, 177

Wilkinson, 178shifted eigenvalue, 169shifting, 174similar matrix, 64, 152similarity transformation, 64simultaneous iteration, 174simultaneous power iteration, 172singular, 146, 191matrix, 39

singular matrix, 149, 187singular stiffness, 95singular value decomposition, 146skew-symmetric matrix, 43, 71, 189solutiongeneral, 7particular, 7

sparse matrix, 127, 144spectrum, 84spectrum slicing, 179spring, 186stability, 15, 33stable structure, 188stable time step, 36, 37standard eigenvalue problem, 80, 180static equilibrium, 188steepest descent, 194–196stiffness matrix, 186Stokes, 5Stokes’ Law, 5stretch, 186Sylvester, 179Sylvester’s Law of inertia, 179symmetricstiffness, 189

symmetric matrix, 79, 190

Taylor series, 70, 100terminal velocity, 8time step, 84time steppingadaptive, 11

tolerance, 120total energy, 192trace, 42transpose, 186, 190trapezoidal integrator, 26trapezoidal rule, 26triangular matrix, 135true error, 24truncation error, 104, 105, 107, 108, 110

unconditionally stable, 52unconditionally unstable, 52unconstrained minimization, 192

Index 215

problem, 193unconstrained minimization problem, 193unconstrained optimization, 185, 193uncoupled variables, 65undamped oscillator, 67undamped vibrationnatural frequency, 60

underflow, 115, 163unitary matrix, 175unnormalized values, 113unsigned byte, 111unstable structure, 187upper triangular matrix, 134, 156

vector function, 126vector norm, 150vector space, 150vector unknown, 126velocity, 81viscous fluid, 5

Wilkinson, 178shift, 178

zero eigenvalue, 95zero frequency, 182

Documents

Aetna Book 2015 Hyper