Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Numerical Analysis Scientific ComputingCS450
Andreas Kloeckner
Fall 2021
1
OutlineIntroduction to Scientific Computing
NotesNotes (unfilled with empty boxes)About the ClassErrors Conditioning Accuracy StabilityFloating Point
Systems of Linear Equations
Linear Least Squares
Eigenvalue Problems
Nonlinear Equations
Optimization
Interpolation
Numerical Integration and Differentiation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
2
Whatrsquos the point of this classrsquoScientific Computingrsquo describes a family of approaches to obtainapproximate solutions to problems once theyrsquove been statedmathematicallyName some applications
Engineering simulation Eg Drag from flow over airplane wings behavior of photonic
devices radar scattering rarr Differential equations (ordinary and partial)
Machine learning Statistical models with unknown parameters rarr Optimization
Image and Audio processing EnlargementFiltering rarr Interpolation
Lots more3
What do we study and how
Problems with real numbers (ie continuous problems)
As opposed to discrete problems Including How can we put a real number into a computer
(and with what restrictions)
Whatrsquos the general approach
Pick a representation (eg a polynomial) Existenceuniqueness
4
What makes for good numerics
How good of an answer can we expect to our problem
Canrsquot even represent numbers exactly Answers will always be approximate So itrsquos natural to ask how far off the mark we really are
How fast can we expect the computation to complete
Aka what algorithms do we use What is the cost of those algorithms Are they efficient
(Ie do they make good use of available machine time)
5
Implementation concerns
How do numerical methods get implemented
Like anything in computing A layer cake of abstractions(ldquocareful liesrdquo)
What toolslanguages are available Are the methods easy to implement If not how do we make use of existing tools How robust is our implementation (eg for error cases)
6
Class web page
httpsbitlycs450-f21
Assignments HW1 Pre-lecture quizzes In-lecture interactive content (bring computer or phone if possible)
Textbook Exams Class outline (with links to notesdemosactivitiesquizzes) Discussion forum Policies Video
7
Programming Language Pythonnumpy
Reasonably readable Reasonably beginner-friendly Mainstream (top 5 in lsquoTIOBE Indexrsquo) Free open-source Great tools and libraries (not just) for scientific computing Python 23 3 numpy Provides an array datatype
Will use this and matplotlib all the time See class web page for learning materials
Demo Sum the squares of the integers from 0 to 100 First withoutnumpy then with numpy
8
Supplementary Material
Numpy (from the SciPy Lectures) 100 Numpy Exercises Dive into Python3
9
Sources for these Notes
MT Heath Scientific Computing An Introductory Survey RevisedSecond Edition Society for Industrial and Applied MathematicsPhiladelphia PA 2018
CS 450 Notes by Edgar Solomonik Various bits of prior material by Luke Olson
10
Open Source lt3These notes (and the accompanying demos) are open-source
Bug reports and pull requests welcomehttpsgithubcominducernumerics-notes
Copyright (C) 2020 Andreas Kloeckner
Permission is hereby granted free of charge to any person obtaining a copy of this software andassociated documentation files (the ldquoSoftwarerdquo) to deal in the Software without restrictionincluding without limitation the rights to use copy modify merge publish distributesublicense andor sell copies of the Software and to permit persons to whom the Software isfurnished to do so subject to the following conditions
The above copyright notice and this permission notice shall be included in all copies orsubstantial portions of the Software
THE SOFTWARE IS PROVIDED ldquoAS ISrdquo WITHOUT WARRANTY OF ANY KIND EXPRESSOR IMPLIED INCLUDING BUT NOT LIMITED TO THE WARRANTIES OFMERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE ANDNONINFRINGEMENT IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERSBE LIABLE FOR ANY CLAIM DAMAGES OR OTHER LIABILITY WHETHER IN ANACTION OF CONTRACT TORT OR OTHERWISE ARISING FROM OUT OF OR INCONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THESOFTWARE 11
What problems can we study in the first place
To be able to compute a solution (through a process that introduceserrors) the problem
Needs to have a solution That solution should be unique And depend continuously on the inputs
If it satisfies these criteria the problem is called well-posed Otherwiseill-posed
12
Dependency on Inputs
We excluded discontinuous problemsndashbecause we donrsquot stand much chancefor those what if the problemrsquos input dependency is just close to discontinuous
We call those problems sensitive to their input dataSuch problems are obviously trickier to deal with thannon-sensitive ones
Ideally the computational method will not amplify thesensitivity
13
Approximation
When does approximation happen
Before computation modeling measurements of input data computation of input data
During computation truncation discretization rounding
Demo Truncation vs Rounding [cleared]
14
Example Surface Area of the Earth
Compute the surface area of the earthWhat parts of your computation are approximate
All of themA = 4πr2
Earth isnrsquot really a sphere What does radius mean if the earth isnrsquot a sphere How do you compute with π (By roundingtruncating)
15
Measuring Error
How do we measure errorIdea Consider all error as being added onto the result
Absolute error = approx value minus true value
Relative error =Absolute error
True valueProblem True value not known Estimate lsquoHow big at worstrsquo rarr Establish Upper Bounds
16
Recap Norms
Whatrsquos a norm
f (x) Rn rarr R+0 returns a lsquomagnitudersquo of the input vector
In symbols Often written ∥x∥
Define norm
A function ∥x∥ Rn rarr R+0 is called a norm if and only if
1 ∥x∥ gt 0hArr x = 02 ∥γx∥ = |γ| ∥x∥ for all scalars γ3 Obeys triangle inequality ∥x + y∥ le ∥x∥+ ∥y∥
17
Norms Examples
Examples of norms
The so-called p-norms∥∥∥∥∥∥∥x1
xn
∥∥∥∥∥∥∥p
= p
radic|x1|p + middot middot middot+ |xn|p (p ⩾ 1)
p = 1 2infin particularly important
Demo Vector Norms [cleared]
18
Norms Which one
Does the choice of norm really matter much
In finitely many dimensions all norms are equivalentIe for fixed n and two norms ∥middot∥ ∥middot∥lowast there exist α β gt 0 so thatfor all vectors x isin Rn
α ∥x∥ le ∥x∥lowast le β ∥x∥
So No doesnrsquot matter that much Will start mattering more forso-called matrix normsndashsee later
19
Norms and Errors
If wersquore computing a vector result the error is a vectorThatrsquos not a very useful answer to lsquohow big is the errorrsquoWhat can we do
Apply a norm
How Attempt 1
Magnitude of error = ∥true value∥ minus ∥approximate value∥
WRONG (How does it fail)Attempt 2
Magnitude of error = ∥true valueminus approximate value∥
20
ForwardBackward ErrorSuppose want to compute y = f (x) but approximate y = f (x)
What are the forward error and the backward error
Forward error ∆y = y minus y
Backward error Imagine all error came from feeding the wrong inputinto a fully accurate calculation Backward error is the differencebetween true and lsquowrongrsquo input Ie Find the x closest to x so that f (x) = y ∆x = x minus x
x
x y = f (x)
f (x)
bw err fw err
f
f
f
21
ForwardBackward Error Example
Suppose you wanted y =radic
2 and got y = 14Whatrsquos the (magnitude of) the forward error
|∆y |= |14minus 141421 | asymp 00142
Relative forward error
|∆y ||y |
=00142 141421
asymp 001
About 1 percent or two accurate digits
22
ForwardBackward Error ExampleSuppose you wanted y =
radic2 and got y = 14
Whatrsquos the (magnitude of) the backward error
Need x so that f (x) = 14radic
196 = 14 rArr x = 196
Backward error|∆x | = |196minus 2| = 004
Relative backward error
|∆x ||x |asymp 002
About 2 percent
23
ForwardBackward Error Observations
What do you observe about the relative manitude of the relative errors
In this case Got smaller ie variation damped out Typically Not that lucky Input error amplified If backward error is smaller than the input error
result ldquoas good as possiblerdquoThis amplification factor seems worth studying in more detail
24
Sensitivity and ConditioningWhat can we say about amplification of error
Define condition number as smallest number κ so that
|rel fwd err| le κ middot |rel bwd err|
Or somewhat sloppily with xy notation as in previous example
cond = maxx
|∆y | |y ||∆x | |x |
(Technically should use lsquosupremumrsquo)
If the condition number is small the problem well-conditioned or insensitive large the problem ill-conditioned or sensitive
Can also talk about condition number for a single input x 25
Example Condition Number of Evaluating a Function
y = f (x) Assume f differentiable
κ = maxx
|∆y | |y ||∆x | |x |
Forward error
∆y = f (x +∆x)minus f (x) = f prime(x)∆x
Condition number
κ ⩾|∆y | |y ||∆x | |x |
=|f prime(x)| |∆x | |f (x)|
|∆x | |x |=|xf prime(x)||f (x)|
Demo Conditioning of Evaluating tan [cleared]
26
Stability and AccuracyPreviously Considered problems or questionsNext Considered methods ie computational approaches to find solutionsWhen is a method accurate
Closeness of method output to true answer for unperturbed input
When is a method stable
ldquoA method is stable if the result it produces is the exactsolution to a nearby problemrdquo
The above is commonly called backward stability and is astricter requirement than just the temptingly simple
If the methodrsquos sensitivity to variation in the input is no (or notmuch) greater than that of the problem itself
Note Necessarily includes insensitivity to variation in intermediateresults 27
Getting into Trouble with Accuracy and Stability
How can I produce inaccurate results
Apply an inaccurate method Apply an unstable method to a well-conditioned problem Apply any type of method to an ill-conditioned problem
28
In-Class Activity ForwardBackward Error
In-class activity ForwardBackward Error
29
Wanted Real Numbers in a computerComputers can represent integers using bits
23 = 1 middot 24 + 0 middot 23 + 1 middot 22 + 1 middot 21 + 1 middot 20 = (10111)2
How would we represent fractions
Idea Keep going down past zero exponent
23625 = 1 middot 24 + 0 middot 23 + 1 middot 22 + 1 middot 21 + 1 middot 20
+1 middot 2minus1 + 0 middot 2minus2 + 1 middot 2minus3
So Could store a fixed number of bits with exponents ⩾ 0 a fixed number of bits with exponents lt 0
This is called fixed-point arithmetic
30
Fixed-Point NumbersSuppose we use units of 64 bits with 32 bits for exponents ⩾ 0 and 32 bitsfor exponents lt 0 What numbers can we represent
231 middot middot middot 20 2minus1 middot middot middot 2minus32
Smallest 2minus32 asymp 10minus10
Largest 231 + middot middot middot+ 2minus32 asymp 109
How many lsquodigitsrsquo of relative accuracy (think relative rounding error) areavailable for the smallest vs the largest number
For large numbers about 19For small numbers few or none
Idea Instead of fixing the location of the 0 exponent let it float
31
Floating Point Numbers
Convert 13 = (1101)2 into floating point representation
13 = 23 + 22 + 20 = (1101)2 middot 23
What pieces do you need to store an FP number
Significand (1101)2Exponent 3
32
Floating Point Implementation NormalizationPreviously Consider mathematical view of FP (via example (1101)2)Next Consider implementation of FP in hardwareDo you notice a source of inefficiency in our number representation
Idea Notice that the leading digit (in binary) of the significand isalways oneOnly store lsquo101rsquo Final storage formatSignificand 101 ndash a fixed number of bitsExponent 3 ndash a (signed) integer allowing a certain rangeExponent is most often stored as a positive lsquooffsetrsquo from a certainnegative number Eg
3 = minus1023︸ ︷︷ ︸implicit offset
+ 1026︸︷︷︸stored
Actually stored 1026 a positive integer33
Unrepresentable numbersCan you think of a somewhat central number that we cannot represent as
x = (1_________)2 middot 2minusp
Zero Which is somewhat embarrassing
Core problem The implicit 1 Itrsquos a great idea were it not for thisissue
Have to break the pattern Idea Declare one exponent lsquospecialrsquo and turn off the leading one for
that one(say minus1023 aka stored exponent 0)
For all larger exponents the leading one remains in effectBonus Q With this convention what is the binary representation ofa zero
Demo Picking apart a floating point number [cleared] 34
Subnormal Numbers
What is the smallest representable number in an FP system with 4 storedbits in the significand and a stored exponent range of [minus7 8]
First attempt Significand as small as possible rarr all zeros after the implicit
leading one Exponent as small as possible minus7
So(10000)2 middot 2minus7
Unfortunately wrong
35
Subnormal Numbers Attempt 2What is the smallest representable number in an FP system with 4 storedbits in the significand and a (stored) exponent range of [minus7 8]
Can go way smaller using the special exponent (turns off theleading one)
Assume that the special exponent is minus7 So (0001)2 middot 2minus7 (with all four digits stored)
Numbers with the special epxonent are called subnormal (or denor-mal) FP numbers Technically zero is also a subnormal
Why learn about subnormals
Subnormal FP is often slow not implemented in hardware Many compilers support options to lsquoflush subnormals to zerorsquo
36
Underflow
FP systems without subnormals will underflow (return 0) as soon asthe exponent range is exhausted
This smallest representable normal number is called the underflowlevel or UFL
Beyond the underflow level subnormals provide for gradual underflowby lsquokeeping goingrsquo as long as there are bits in the significand but it isimportant to note that subnormals donrsquot have as many accurate digitsas normal numbersRead a story on the epic battle about gradual underflow
Analogously (but much more simplyndashno lsquosupernormalsrsquo) the overflowlevel OFL
37
Rounding ModesHow is rounding performed (Imagine trying to represent π)(
11101010︸ ︷︷ ︸representable
11)2
ldquoChoprdquo aka round-to-zero (11101010)2 Round-to-nearest (11101011)2 (most accurate)
What is done in case of a tie 05 = (01)2 (ldquoNearestrdquo)
Up or down It turns out that picking the same direction every timeintroduces bias Trick round-to-even
05rarr 0 15rarr 2
Demo Density of Floating Point Numbers [cleared]Demo Floating Point vs Program Logic [cleared] 38
Smallest Numbers Above
What is smallest FP number gt 1 Assume 4 bits in the significand
(10001)2 middot 20 = x middot (1 + 00001)2
Whatrsquos the smallest FP number gt 1024 in that same system
(10001)2 middot 210 = x middot (1 + 00001)2
Can we give that number a name
39
Unit Roundoff
Unit roundoff or machine precision or machine epsilon or εmach is thesmallest number such that
float(1 + ε) gt 1
Technically that makes εmach depend on the rounding rule
Assuming round-towards-infinity in the above systemεmach = (000001)2
Note the extra zero Another related quantity is ULP or unit in the last place
(εmach = 05ULP)
40
FP Relative Rounding ErrorWhat does this say about the relative error incurred in floating pointcalculations
The factor to get from one FP number to the next larger one is(mostly) independent of magnitude 1 + εmach
Since we canrsquot represent any results betweenx and x middot (1 + εmach) thatrsquos really the minimum errorincurred
In terms of relative error∣∣∣∣ x minus x
x
∣∣∣∣ = ∣∣∣∣x(1 + εmach)minus x
x
∣∣∣∣ = εmach
At least theoretically εmach is the maximum relative error inany FP operations (Practical implementations do fall short ofthis)
41
FP Machine Epsilon
Whatrsquos that same number for double-precision floating point (52 bits inthe significand)
2minus53 asymp 10minus16
Bonus Q What does 1 + 2minus53 do on your computer Why
We can expect FP math to consistently introduce little relative errorsof about 10minus16
Working in double precision gives you about 16 (decimal) accuratedigits
Demo Floating Point and the Harmonic Series [cleared]
42
In-Class Activity Floating Point
In-class activity Floating Point
43
Implementing ArithmeticHow is floating point addition implementedConsider adding a = (1101)2 middot 21 and b = (1001)2 middot 2minus1 in a system withthree bits in the significand
Rough algorithm1 Bring both numbers onto a common exponent2 Do grade-school addition from the front until you run out of
digits in your system3 Round result
a = 1 101 middot 21
b = 0 01001 middot 21
a+ b asymp 1 111 middot 21
44
Problems with FP AdditionWhat happens if you subtract two numbers of very similar magnitudeAs an example consider a = (11011)2 middot 20 and b = (11010)2 middot 20
a = 1 1011 middot 21
b = 1 1010 middot 21
aminus b asymp 0 0001 middot 21
or once we normalize1 middot 2minus3
There is no data to indicate what the missing digits should berarr Machine fills them with its lsquobest guessrsquo which is not often good
This phenomenon is called Catastrophic Cancellation
Demo Catastrophic Cancellation [cleared] 45
Supplementary Material
Josh Haberman Floating Point Demystified Part 1 David Goldberg What every computer programmer should know
about floating point
46
OutlineIntroduction to Scientific Computing
Systems of Linear EquationsTheory ConditioningMethods to Solve SystemsLU Application and Implementation
Linear Least Squares
Eigenvalue Problems
Nonlinear Equations
Optimization
Interpolation
Numerical Integration and Differentiation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
47
Solving a Linear SystemGiven m times n matrix A
m-vector bWhat are we looking for here and when are we allowed to ask thequestion
Want n-vector x so that Ax = b Linear combination of columns of A to yield b Restrict to square case (m = n) for now Even with that solution may not exist or may not be unique
Unique solution exists iff A is nonsingular
Next Want to talk about conditioning of this operation Need to measuredistances of matrices
48
Matrix NormsWhat norms would we apply to matrices
Could use ldquoFlattenrdquo matrix as vector use vector normBut we wonrsquot Not very meaningful
Instead Choose norms for matrices to interact with an lsquoassociatedrsquovector norm ∥middot∥ so that ∥A∥ obeys
∥Ax∥ le ∥A∥ ∥x∥
For a given vector norm define induced matrix norm ∥middot∥
∥A∥ = max∥x∥=1
∥Ax∥
For each vector norm we get a different matrix norm eg for thevector 2-norm ∥x∥2 we get a matrix 2-norm ∥A∥2
49
Intuition for Matrix Norms
Provide some intuition for the matrix norm
maxx =0
∥Ax∥∥x∥
= maxx =0
∥∥∥∥Ax middot 1∥x∥
∥∥∥∥ = max∥y∥=1
∥Ay∥ = ∥A∥
Ie the matrix norm gives the maximum (relative) growth of thevector norm after multiplying a vector by A
50
Identifying Matrix NormsWhat is ∥A∥1 ∥A∥infin
∥A∥1 = maxcol j
sumrow i
|Ai j | ∥A∥infin = maxrow i
sumcol j
|Ai j |
2-norm Actually fairly difficult to evaluate See in a bit
How do matrix and vector norms relate for n times 1 matrices
They agree Why For n times 1 the vector x in Ax is just a scalar
max∥x∥=1
∥Ax∥ = maxxisinminus11
∥Ax∥ = ∥A[ 1]∥
This can help to remember 1- and infin-norm
Demo Matrix norms [cleared]51
Properties of Matrix Norms
Matrix norms inherit the vector norm properties ∥A∥ gt 0hArr A = 0 ∥γA∥ = |γ| ∥A∥ for all scalars γ Obeys triangle inequality ∥A+ B∥ le ∥A∥+ ∥B∥
But also some more properties that stem from our definition
∥Ax∥ le ∥A∥ ∥x∥ ∥AB∥ le ∥A∥ ∥B∥ (easy consequence)
Both of these are called submultiplicativity of the matrix norm
52
ConditioningWhat is the condition number of solving a linear system Ax = b
Input b with error ∆bOutput x with error ∆x
Observe A(x +∆x) = (b +∆b) so A∆x = ∆b
rel err in outputrel err in input
=∥∆x∥ ∥x∥∥∆b∥ ∥b∥
=∥∆x∥ ∥b∥∥∆b∥ ∥x∥
=
∥∥Aminus1∆b∥∥ ∥Ax∥
∥∆b∥ ∥x∥
le∥∥Aminus1∥∥ ∥A∥ ∥∆b∥ ∥x∥
∥∆b∥ ∥x∥=
∥∥Aminus1∥∥ ∥A∥ 53
Conditioning of Linear Systems Observations
Showed κ(Solve Ax = b) le∥∥Aminus1
∥∥ ∥A∥Ie found an upper bound on the condition number With a little bit offiddling itrsquos not too hard to find examples that achieve this bound iethat it is sharp
So wersquove found the condition number of linear system solving also calledthe condition number of the matrix A
cond(A) = κ(A) = ∥A∥∥∥Aminus1∥∥
54
Conditioning of Linear Systems More properties cond is relative to a given norm So to be precise use
cond2 or condinfin
If Aminus1 does not exist cond(A) =infin by conventionWhat is κ(Aminus1)
κ(A)
What is the condition number of matrix-vector multiplication
κ(A) because it is equivalent to solving with Aminus1
Demo Condition number visualized [cleared]Demo Conditioning of 2x2 Matrices [cleared]
55
Residual Vector
What is the residual vector of solving the linear system
b = Ax
Itrsquos the thing thatrsquos lsquoleft overrsquo Suppose our approximate solution isx Then the residual vector is
r = b minus Ax
56
Residual and Error Relationship
How do the (norms of the) residual vector r and the error ∆x = x minus xrelate to one another
∥∆x∥ = ∥x minus x∥ =∥∥Aminus1(b minus Ax)
∥∥ =∥∥Aminus1r
∥∥Divide both sides by ∥x∥
∥∆x∥∥x∥
=
∥∥Aminus1r∥∥
∥x∥le∥∥Aminus1
∥∥ ∥r∥∥x∥
= cond(A)∥r∥∥A∥ ∥x∥
le cond(A)∥r∥∥Ax∥
rel err le cond middot rel resid Given small (rel) residual (rel) error is only (guaranteed to
be) small if the condition number is also small
57
Changing the MatrixSo far only discussed changing the RHS ie Ax = b rarr Ax = bThe matrix consists of FP numbers toomdashit too is approximate Ie
Ax = b rarr Ax = b
What can we say about the error due to an approximate matrix
Consider
∆x = x minus x = Aminus1(Ax minus b) = Aminus1(Ax minus Ax) = minusAminus1∆Ax
Thus∥∆x∥ le
∥∥Aminus1∥∥ ∥∆A∥ ∥x∥
And we get∥∆x∥∥x∥
le cond(A)∥∆A∥∥A∥
58
Changing Condition NumbersOnce we have a matrix A in a linear system Ax = b are we stuck with itscondition number Or could we improve it
Diagonal scaling is a simple strategy that sometimes helps Row-wise DAx = Db Column-wise AD x = b
Different x Recover x = D x
What is this called as a general concept
Preconditioning Leftrsquo preconditioning MAx = Mb Right preconditioning AM x = b
Different x Recover x = M x
59
In-Class Activity Matrix Norms and Conditioning
In-class activity Matrix Norms and Conditioning
60
Singular Value Decomposition (SVD)What is the Singular Value Decomposition of an m times n matrix
A = UΣV T
with U is m timesm and orthogonal
Columns called the left singular vectors Σ = diag(σi ) is m times n and non-negative
Typically σ1 ge σ2 ge middot middot middot ge σmin(mn) ge 0Called the singular values
V is n times n and orthogonalColumns called the right singular vectors
Existence Computation Not yet later
61
Computing the 2-Norm
Using the SVD of A identify the 2-norm
A = UΣV T with U V orthogonal 2-norm satisfies ∥QB∥2 = ∥B∥2 = ∥BQ∥2 for any matrix B
and orthogonal Q So ∥A∥2 = ∥Σ∥2 = σmax
Express the matrix condition number cond2(A) in terms of the SVD
Aminus1 has singular values 1σi cond2(A) = ∥A∥2
∥∥Aminus1∥∥
2 = σmaxσmin
62
Not a matrix norm FrobeniusThe 2-norm is very costly to compute Can we make something simpler
∥A∥F =
radicradicradicradic msumi=1
nsumj=1
|aij |2
is called the Frobenius norm
What about its properties
Satisfies the mat norm properties (proofs via Cauchy-Schwarz) definiteness scaling triangle inequality submultiplicativity
63
Frobenius Norm Properties
Is the Frobenius norm induced by any vector norm
Canrsquot be Whatrsquos ∥I∥F Whatrsquos ∥I∥ for an induced norm
How does it relate to the SVD
∥A∥F =
radicradicradicradic nsumi=1
σ2i
(Proof)
64
Solving Systems Simple casesSolve Dx = b if D is diagonal (Computational cost)
xi = biDii with cost O(n)
Solve Qx = b if Q is orthogonal (Computational cost)
x = QTb with cost O(n2)
Given SVD A = UΣV T solve Ax = b (Computational cost)
Compute z = UTb Solve Σy = z Compute x = V x
Cost O(n2) to solve and O(n3) to compute SVD65
Solving Systems Triangular matricesSolve
a11 a12 a13 a14a22 a23 a24
a33 a34a44
xyzw
=
b1b2b3b4
Rewrite as individual equations This process is called back-substitution The analogous process for lower triangular matrices is called
forward substitution
Demo Coding back-substitution [cleared]What about non-triangular matrices
Can do Gaussian Elimination just like in linear algebra class
66
Gaussian EliminationDemo Vanilla Gaussian Elimination [cleared]What do we get by doing Gaussian Elimination
Row Echelon Form
How is that different from being upper triangular
REF reveals the rank of the matrix REF can take ldquomultiple column-stepsrdquo to the right per row
What if we do not just eliminate downward but also upward
Thatrsquos called Gauss-Jordan elimination Turns out to be computa-tionally inefficient We wonrsquot look at it
67
LU Factorization
What is the LU factorization
A factorization A = LU with L lower triangular unit diagonal U upper triangular
68
Solving Ax = b
Does LU help solve Ax = b
Ax = bL Ux︸︷︷︸
y
= b
Ly = b larr solvable by fwd substUx = y larr solvable by bwd subst
Now know x that solves Ax = b
69
Determining an LU factorization
[a11 aT
12a21 A22
]=
[L11L21 L22
] [U11 U12
U22
]
Or written differently [u11 uT
12U22
][
1l 21 L22
] [a11 a12a21 A22
]
Clear u11 = a11 uT12 = aT
12 a21 = u11l 21 or l 21 = a21u11 A22 = l 21uT
12 + L22U22 or L22U22 = A22 minus l 21uT12
Demo LU Factorization [cleared] 70
Computational CostWhat is the computational cost of multiplying two n times n matrices
O(n3)
u11 = a11 uT12 = aT
12 l 21 = a21u11 L22U22 = A22 minus l 21uT
12
What is the computational cost of carrying out LU factorization on ann times n matrix
O(n2) for each step n minus 1 of those steps O(n3)
Demo Complexity of Mat-Mat multiplication and LU [cleared]
71
LU Failure CasesIs LUGaussian Elimination bulletproof
Not bulletproof
A =
[0 12 1
]
Q Is this a problem with the process or with the entire idea of LU[u11 u12
u22
][
1ℓ21 1
] [0 12 1
]rarr u11 = 0
u11 middot ℓ21︸ ︷︷ ︸0
+1 middot 0 = 2
It turns out to be that A doesnrsquot have an LU factorizationLU has exactly one failure mode the division when u11 = 0
72
Saving the LU FactorizationWhat can be done to get something like an LU factorization
Idea from linear algebra class In Gaussian elimination simply swaprows equivalent linear system
Good idea Swap rows if therersquos a zero in the way Even better idea Find the largest entry (by absolute value)
swap it to the top rowThe entry we divide by is called the pivot Swapping rows to get a bigger pivot is called partial pivoting Swapping rows and columns to get an even bigger pivot is
called complete pivoting (downside additional O(n2) cost tofind the pivot)
Demo LU Factorization with Partial Pivoting [cleared]73
Cholesky LU for Symmetric Positive DefiniteLU can be used for SPD matrices But can we do better
Cholesky factorization A = LLT ie like LU but using LT for U[ℓ11l 21 L22
] [ℓ11 lT21
LT22
]=
[a11 aT
21a21 A22
]
ℓ211 = a11 then ℓ11l 21 = a21 ie l 21 = a21ℓ11 Finally
l 21lT21 + L22LT22 = A22 or
L22LT22 = A22 minus l 21lT21
Fails if a11 is negative at any point (rArr A not SPSemiD) If a11 zero A is positive semidefinite Cheaper than LU no pivoting only one factor to compute
74
More cost concernsWhatrsquos the cost of solving Ax = b
LU O(n3)FWBW Subst 2times O(n2) = O(n2)
Whatrsquos the cost of solving Ax = b1b2 bn
LU O(n3)FWBW Subst 2n times O(n2) = O(n3)
Whatrsquos the cost of finding Aminus1
Same as solvingAX = I
so still O(n3)75
Cost Worrying about the Constant BLASO(n3) really means
α middot n3 + β middot n2 + γ middot n + δ
All the non-leading and constants terms swept under the rug But at leastthe leading constant ultimately matters
Shrinking the constant surprisingly hard (even for rsquojustrsquo matmul)
Idea Rely on library implementation BLAS (Fortran)Level 1 z = αx + y vector-vector operations
O(n)axpy
Level 2 z = Ax + y matrix-vector operationsO(n2)gemv
Level 3 C = AB + βC matrix-matrix operationsO(n3)gemm trsm
Show (using perf) numpy matmul calls BLAS dgemm 76
LAPACK
LAPACK Implements lsquohigher-endrsquo things (such as LU) using BLASSpecial matrix formats can also help save const significantly eg banded sparse symmetric triangular
Sample routine names dgesvd zgesdd dgetrf dgetrs
77
LU on Blocks The Schur Complement
Given a matrix [A BC D
]
can we do lsquoblock LUrsquo to get a block triangular matrix
Multiply the top row by minusCAminus1 add to second row gives[A B0 D minus CAminus1B
]
D minus CAminus1B is called the Schur complement Block pivoting is alsopossible if needed
78
LU Special casesWhat happens if we feed a non-invertible matrix to LU
PA = LU
(invertible not invertible) (Why)
What happens if we feed LU an m times n non-square matrices
Think carefully about sizes of factors and columnsrows that dodonrsquotmatter Two cases m gt n (tallampskinny) L m times n U n times n
m lt n (shortampfat) L m timesm U m times n
This is called reduced LU factorization
79
Round-off Error in LU without Pivoting
Consider factorization of[ϵ 11 1
]where ϵ lt ϵmach
Without pivoting L =
[1 0
1ϵ 1
] U =
[ϵ 10 1minus 1ϵ
] Rounding fl(U)) =
[ϵ 10 minus1ϵ
] This leads to L fl(U)) =
[ϵ 11 0
] a backward error of
[0 00 1
]
80
Round-off Error in LU with Pivoting
Permuting the rows of A in partial pivoting gives PA =
[1 1ϵ 1
]
We now compute L =
[1 0ϵ 1
] U =
[1 10 1minus ϵ
] so
fl(U) =
[1 10 1
] This leads to L fl(U) =
[1 1ϵ 1 + ϵ
] a backward error of[
0 00 ϵ
]
81
Changing matricesSeen LU cheap to re-solve if RHS changes (Able to keep the expensivebit the LU factorization) What if the matrix changes
Special cases allow something to be done (a so-called rank-one up-date)
A = A+ uvT
The Sherman-Morrison formula gives us
(A+ uvT )minus1 = Aminus1 minus Aminus1uvTAminus1
1 + vTAminus1u
Proof Multiply the above by A get the identityFYI There is a rank-k analog called the Sherman-Morrison-Woodburyformula
Demo Sherman-Morrison [cleared] 82
In-Class Activity LU
In-class activity LU and Cost
83
OutlineIntroduction to Scientific Computing
Systems of Linear Equations
Linear Least SquaresIntroductionSensitivity and ConditioningSolving Least Squares
Eigenvalue Problems
Nonlinear Equations
Optimization
Interpolation
Numerical Integration and Differentiation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
84
What about non-square systems
Specifically what about linear systems with lsquotall and skinnyrsquo matrices (Am times n with m gt n) (aka overdetermined linear systems)
Specifically any hope that we will solve those exactly
Not really more equations than unknowns
85
Example Data Fitting Too much data
Lots of equations but not many unknowns
f (x) = ax2 + bx + c
Only three parameters to setWhat are the lsquorightrsquo a b c
Want lsquobestrsquo solution
Intro ExistenceUniqueness Sensitivity and Conditioning Transformations Orthogonalization SVD
Have data (xi yi ) and model
y(x) = α+ βx + γx2
Find data that (best) fit model
86
Data Fitting Continued
α+ βx1 + γx21 = y1
α+ βxn + γx2n = yn
Not going to happen for n gt 3 Instead∣∣α+ βx1 + γx21 minus y1
∣∣2+ middot middot middot+∣∣α+ βxn + γx2
n minus yn∣∣2 rarr min
rarr Least SquaresThis is called linear least squares specifically because the coefficientsx enter linearly into the residual
87
Rewriting Data Fitting
Rewrite in matrix form
∥Ax minus b∥22 rarr min
with
A =
1 x1 x21
1 xn x2
n
x =
αβγ
b =
y1yn
Matrices like A are called Vandermonde matrices Easy to generalize to higher polynomial degrees
88
Least Squares The Problem In Matrix Form
∥Ax minus b∥22 rarr min
is cumbersome to writeInvent new notation defined to be equivalent
Ax sim= b
NOTE Data Fitting is one example where LSQ problems arise Many other application lead to Ax sim= b with different matrices
89
Data Fitting NonlinearityGive an example of a nonlinear data fitting problem
∣∣exp(α) + βx1 + γx21 minus y1
∣∣2+ middot middot middot+∣∣exp(α) + βxn + γx2
n minus yn∣∣2 rarr min
But that would be easy to remedy Do linear least squares with exp(α) asthe unknown More difficult
∣∣α+ exp(βx1 + γx21 )minus y1
∣∣2+ middot middot middot+∣∣α+ exp(βxn + γx2
n )minus yn∣∣2 rarr min
Demo Interactive Polynomial Fit [cleared]90
Properties of Least-Squares
Consider LSQ problem Ax sim= b and its associated objective functionφ(x) = ∥b minus Ax∥22 Assume A has full rank Does this always have asolution
Yes φ ⩾ 0 φrarrinfin as ∥x∥ rarr infin φ continuousrArr has a minimum
Is it always unique
No for example if A has a nullspace
91
Least-Squares Finding a Solution by Minimization
Examine the objective function find its minimum
φ(x) = (b minus Ax)T (b minus Ax)bTb minus 2xTATb + xTATAx
nablaφ(x) = minus2ATb + 2ATAx
nablaφ(x) = 0 yields ATAx = ATb Called the normal equations
92
Least squares Demos
Demo Polynomial fitting with the normal equations [cleared]
Whatrsquos the shape of ATA
Always square
Demo Issues with the normal equations [cleared]
93
Least Squares Viewed Geometrically
Why is r perp span(A) a good thing to require
Because then the distance between y = Ax and b is minimalQ WhyBecause of Pythagorasrsquos theoremndashanother y would mean additionaldistance traveled in span(A)
94
Least Squares Viewed Geometrically (II)
Phrase the Pythagoras observation as an equation
span(A) perp b minus AxATb minus ATAx = 0
Congratulations Just rediscovered the normal equations
Write that with an orthogonal projection matrix P
Ax = Pb95
About Orthogonal ProjectorsWhat is a projector
A matrix satisfying P2 = P
What is an orthogonal projector
A symmetric projector
How do I make one projecting onto spanq1q2 qℓ for orthogonal q i
First define Q =[q1 q2 middot middot middot qℓ
] Then
QQT
will project and is obviously symmetric
96
Least Squares and Orthogonal Projection
Check that P = A(ATA)minus1AT is an orthogonal projector onto colspan(A)
P2 = A(ATA)minus1ATA(ATA)minus1AT = A(ATA)minus1AT = P
Symmetry also yes
Onto colspan(A) Last matrix is A rarr result of Px must be incolspan(A)
Conclusion P is the projector from the previous slide
What assumptions do we need to define the P from the last question
ATA has full rank (ie is invertible)
97
PseudoinverseWhat is the pseudoinverse of A
Nonsquare m times n matrix A has no inverse in usual senseIf rank(A) = n pseudoinverse is A+ = (ATA)minus1AT (colspan-projector with final A missing)
What can we say about the condition number in the case of atall-and-skinny full-rank matrix
cond2(A) = ∥A∥2∥∥A+
∥∥2
If not full rank cond(A) =infin by convention
What does all this have to do with solving least squares problems
x = A+b solves Ax sim= b98
In-Class Activity Least Squares
In-class activity Least Squares
99
Sensitivity and Conditioning of Least Squares
Relate ∥Ax∥ and b with $θ via trig functions
cos(θ) =∥Ax∥2∥b∥2
100
Sensitivity and Conditioning of Least Squares (II)Derive a conditioning bound for the least squares problem
Recall x = A+b Also ∆x = A+∆b Take norms divide by ∥x∥2
∥∆x∥2∥x∥2
le∥∥A+
∥∥2∥∆b∥2∥x∥2
=κ(A)
∥A∥2 ∥A+∥2
∥∥A+∥∥
2∥b∥2∥b∥2
∥∆b∥2∥x∥2
= κ(A)∥b∥2
∥A∥2 ∥x∥2∥∆b∥2∥b∥2
le κ(A)∥b∥2∥Ax∥2︸ ︷︷ ︸1 cos θ
∥∆b∥2∥b∥2
What values of θ are bad
b perp colspan(A) ie θ asymp π2101
Sensitivity and Conditioning of Least Squares (III)
Any comments regarding dependencies
Unlike for Ax = b the sensitivity of least squares solution dependson both A and b
What about changes in the matrix
∥∆x∥2∥x∥2
le [cond(A)2 tan(θ) + cond(A)] middot∥∆A∥2∥A∥2
Two behaviors If tan(θ) asymp 0 condition number is cond(A) Otherwise cond(A)2
102
Recap Orthogonal Matrices
Whatrsquos an orthogonal (=orthonormal) matrix
One that satisfies QTQ = I and QQT = I
How do orthogonal matrices interact with the 2-norm
∥Qv∥22 = (Qv)T (Qv) = vTQTQv = vTv = ∥v∥22
103
Transforming Least Squares to Upper Triangular
Suppose we have A = QR with Q square and orthogonal and R uppertriangular This is called a QR factorizationHow do we transform the least squares problem Ax sim= b to one with anupper triangular matrix
∥Ax minus b∥2=∥∥∥QT (QRx minus b)
∥∥∥2
=∥∥∥Rx minus QTb
∥∥∥2
104
Simpler Problems TriangularWhat do we win from transforming a least-squares system to uppertriangular form
[Rtop
]x sim=
[(QTb)top
(QTb)bottom
]
How would we minimize the residual norm
For the residual vector r we find
∥r∥22 =∥∥∥(QTb)top minus Rtopx
∥∥∥2
2+∥∥∥(QTb)bottom
∥∥∥2
2
R is invertible so we can find x to zero out the first term leaving
∥r∥22 =∥∥∥(QTb)bottom
∥∥∥2
2
105
Computing QR
Gram-Schmidt Householder Reflectors Givens Rotations
Demo Gram-SchmidtndashThe Movie [cleared]Demo Gram-Schmidt and Modified Gram-Schmidt [cleared]Demo Keeping track of coefficients in Gram-Schmidt [cleared]Seen Even modified Gram-Schmidt still unsatisfactory in finite precisionarithmetic because of roundoff
NOTE Textbook makes further modification to lsquomodifiedrsquo Gram-Schmidt Orthogonalize subsequent rather than preceding vectors Numerically no difference but sometimes algorithmically helpful
106
EconomicalReduced QR
Is QR with square Q for A isin Rmtimesn with m gt n efficient
No Can obtain economical or reduced QR with Q isin Rmtimesn andR isin Rntimesn Least squares solution process works unmodified withthe economical form though the equivalence proof relies on the rsquofullrsquoform
107
In-Class Activity QR
In-class activity QR
108
Householder TransformationsFind an orthogonal matrix Q to zero out the lower part of a vector a
Intuition for Householder
Worksheet 10 Problem 1a
Intro ExistenceUniqueness Sensitivity and Conditioning Transformations Orthogonalization SVD
Orthogonality in figure (a minus ∥a∥2 e1)middot(a + ∥a∥2 e1) = ∥a∥22minus∥a∥22
Letrsquos call v = a minus ∥a∥2 e1 How do we reflect about the planeorthogonal to v Project-and-keep-going
H = I minus 2vvT
vTv
This is called a Householder reflector 109
Householder Reflectors PropertiesSeen from picture (and easy to see with algebra)
Ha = plusmn∥a∥2 e1
Remarks Q What if we want to zero out only the i + 1th through nth entry
A Use e i above A product Hn middot middot middotH1A = R of Householders makes it easy (and quite
efficient) to build a QR factorization It turns out v prime = a + ∥a∥2 e1 works out toondashjust pick whichever one
causes less cancellation H is symmetric H is orthogonal
Demo 3x3 Householder demo [cleared]
110
Givens Rotations
If reflections work can we make rotations work too
[c sminuss c
] [a1a2
]=
[radica21 + a2
2
0
]
Not hard to solve for c and s
Downside Produces only one zero at a time
Demo 3x3 Givens demo [cleared]
111
Rank-Deficient Matrices and QRWhat happens with QR for rank-deficient matrices
A = Q
lowast lowast lowast(small) lowast
lowast
(where lowast represents a generic non-zero)
Practically it makes sense to ask for all these lsquosmallrsquo columns to begathered near the lsquorightrsquo of R rarr Column pivoting
Q What does the resulting factorization look like
AP = QR
Also used as the basis for rank-revealing QR
112
Rank-Deficient Matrices and Least-SquaresWhat happens with Least Squares for rank-deficient matrices
Ax sim= b
QR still finds a solution with minimal residual By QR itrsquos easy to see that least squares with a short-and-fat
matrix is equivalent to a rank-deficient one But No longer unique x + n for n isin N(A) has the same
residual In other words Have more freedom
Or Can demand another condition for example Minimize ∥b minus Ax∥22 and minimize ∥x∥22 simultaneously
Unfortunately QR does not help much with that rarr Needbetter tool the SVD A = UΣV T Letrsquos learn more about it
113
SVD Whatrsquos this thing good for (I)
Recall ∥A∥2 = σ1
Recall cond2(A) = σ1σn
Nullspace N(A) = span(v i σi = 0) rank(A) = i σi = 0
Computing rank in the presence of round-off error is notlaughably non-robust More robust
Numerical rank
rankε = i σi gt ε
114
SVD Whatrsquos this thing good for (II) Low-rank Approximation
Theorem (Eckart-Young-Mirsky)If k lt r = rank(A) and
Ak =ksum
i=1
σiuivTi then
minrank(B)=k
∥Aminus B∥2 = ∥Aminus Ak∥2 = σk+1
minrank(B)=k
∥Aminus B∥F = ∥Aminus Ak∥F =
radicradicradicradic nsumj=k
σ2j
Demo Image compression [cleared] 115
SVD Whatrsquos this thing good for (III) The minimum norm solution to Ax sim= b
UΣV Tx sim= bhArr ΣV Tx︸ ︷︷ ︸
y
sim= UTb
hArr Σy sim= UTb
Then defineΣ+ = diag(σ+1 σ
+n )
where Σ+ is n timesm if A is m times n and
σ+i =
1σi σi = 00 σi = 0
116
SVD Minimum-Norm PseudoinverseWhat is the minimum 2-norm solution to Ax sim= b and why
Observe ∥x∥2 = ∥y∥2 and recall that ∥y∥2 was already minimal(why)
x = VΣ+UTb
solves the minimum-norm least-squares problem
Generalize the pseudoinverse to the case of a rank-deficient matrix
Define A+ = VΣ+UT and call it the pseudoinverse of A
Coincides with prior definition in case of full rank
117
Comparing the Methods
Methods to solve least squares with A an m times n matrix
Form ATA n2m2 (symmetricmdashonly need to fill half)Solve with ATA n36 (Cholesky asymp symmetric LU)
Solve with Householder mn2 minus n33 If m asymp n about the same If m≫ n Householder QR requires about twice as much work
as normal equations SVD mn2 + n3 (with a large constant)
Demo Relative cost of matrix factorizations [cleared]
118
In-Class Activity Householder Givens SVD
In-class activity Householder Givens SVD
119
OutlineIntroduction to Scientific Computing
Systems of Linear Equations
Linear Least Squares
Eigenvalue ProblemsProperties and TransformationsSensitivityComputing EigenvaluesKrylov Space Methods
Nonlinear Equations
Optimization
Interpolation
Numerical Integration and Differentiation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
120
Eigenvalue Problems SetupMath Recap
A is an n times n matrix x = 0 is called an eigenvector of A if there exists a λ so that
Ax = λx
In that case λ is called an eigenvalue The set of all eigenvalues λ(A) is called the spectrum The spectral radius is the magnitude of the biggest eigenvalue
ρ(A) = max |λ| λ(A)
121
Finding Eigenvalues
How do you find eigenvalues
Ax = λx hArr (Aminus λI )x = 0hArrAminus λI singularhArr det(Aminus λI ) = 0
det(Aminus λI ) is called the characteristic polynomial which has degreen and therefore n (potentially complex) roots
Does that help algorithmically Abel-Ruffini theorem for n ⩾ 5 isno general formula for roots of polynomial IOW no
For LU and QR we obtain exact answers (except rounding) For eigenvalue problems not possiblemdashmust approximate
Demo Rounding in characteristic polynomial using SymPy [cleared]
122
Multiplicity
What is the multiplicity of an eigenvalue
Actually there are two notions called multiplicity Algebraic Multiplicity multiplicity of the root of the
characteristic polynomial Geometric Multiplicity of lin indep eigenvectors
In general AM ⩾ GMIf AM gt GM the matrix is called defective
123
An Example
Give characteristic polynomial eigenvalues eigenvectors of[1 1
1
]
CP (λminus 1)2
Eigenvalues 1 (with multiplicity 2)Eigenvectors [
1 11
] [xy
]=
[xy
]rArr x + y = x rArr y = 0 So only a 1D space of eigenvectors
124
Diagonalizability
When is a matrix called diagonalizable
If it is not defective ie if it has a n linear independent eigenvectors(ie a full basis of them) Call those (x i )
ni=1
In that case let
X =
| |x1 middot middot middot xn
| |
and observe AX = XD or
A = XDXminus1
where D is a diagonal matrix with the eigenvalues
125
Similar Matrices
Related definition Two matrices A and B are called similar if there existsan invertible matrix X so that A = XBXminus1
In that sense ldquoDiagonalizablerdquo = ldquoSimilar to a diagonal matrixrdquo
Observe Similar A and B have same eigenvalues (Why)
Suppose Ax = λx We have B = Xminus1AX Let w = Xminus1v Then
Bw = Xminus1Av = λw
126
Eigenvalue Transformations (I)What do the following transformations of the eigenvalue problem Ax = λxdoShift Ararr Aminus σI
(Aminus σI )x = (λminus σ)x
Inversion Ararr Aminus1
Aminus1x = λminus1x
Power Ararr Ak
Akx = λkx
127
Eigenvalue Transformations (II)
Polynomial Ararr aA2 + bA+ cI
(aA2 + bA+ cI )x = (aλ2 + bλ+ c)x
Similarity Tminus1AT with T invertible
Let y = Tminus1x Then
Tminus1ATy = λy
128
Sensitivity (I)Assume A not defective Suppose Xminus1AX = D Perturb Ararr A+ E What happens to the eigenvalues
Xminus1(A+ E )X = D + F
A+ E and D + F have same eigenvalues D + F is not necessarily diagonal
Suppose v is a perturbed eigenvector
(D + F )v = microvhArr Fv = (microI minus D)vhArr (microI minus D)minus1Fv = v (when is that invertible)rArr ∥v∥ le
∥∥(microI minus D)minus1∥∥ ∥F∥ ∥v∥rArr
∥∥(microI minus D)minus1∥∥minus1 le ∥F∥
129
Sensitivity (II)Xminus1(A+ E )X = D + F Have
∥∥(microI minus D)minus1∥∥minus1 le ∥F∥
∥∥(microI minus D)minus1∥∥minus1= |microminus λk |
where λk is the closest eigenvalue of D to micro∣∣∣microminus λk ∣∣∣= ∥∥(microI minus D)minus1∥∥minus1 le ∥F∥ =∥∥Xminus1EX
∥∥ le cond(X ) ∥E∥
This is lsquobadrsquo if X is ill-conditioned ie if the eigenvectors arenearly linearly dependent
If X is orthogonal (eg for symmetric A) then eigenvalues arealways well-conditioned
This bound is in terms of all eigenvalues so may overestimatefor each individual eigenvalue
Demo Bauer-Fike Eigenvalue Sensitivity Bound [cleared] 130
Power IterationWhat are the eigenvalues of A1000
Assume |λ1| gt |λ2| gt middot middot middot gt |λn| with eigenvectors x1 xnFurther assume ∥x i∥ = 1
Use x = αx1 + βx2
y = A1000(αx1 + βx2) = αλ10001 x1 + βλ1000
2 x2
Or
yλ1000
1= αx1 + β
λ2
λ1︸︷︷︸lt1
1000
︸ ︷︷ ︸≪1
x2
Idea Use this as a computational procedure to find x1Called Power Iteration 131
Power Iteration Issues
What could go wrong with Power Iteration
Starting vector has no component along x1Not a problem in practice Rounding will introduce one
Overflow in computing λ10001
rarr Normalized Power Iteration |λ1| = |λ2|
Real problem
132
What about Eigenvalues
Power Iteration generates eigenvectors What if we would like to knoweigenvalues
Estimate themxTAxxTx
= λ if x is an eigenvector w eigenvalue λ Otherwise an estimate of a lsquonearbyrsquo eigenvalue
This is called the Rayleigh quotient
133
Convergence of Power IterationWhat can you say about the convergence of the power methodSay v (k)
1 is the kth estimate of the eigenvector x1 and
ek =∥∥∥x1 minus v (k)
1
∥∥∥ Easy to see
ek+1 asymp|λ2||λ1|
ek
We will later learn that this is linear convergence Itrsquos quite slowWhat does a shift do to this situation
ek+1 asymp|λ2 minus σ||λ1 minus σ|
ek
Picking σ asymp λ1 does not help Idea Invert and shift to bring |λ1 minus σ| into numerator
134
Inverse Iteration
Describe inverse iteration
xk+1 = (Aminus σI )minus1xk
Implemented by storingsolving with LU factorization Converges to eigenvector for eigenvalue closest to σ with
ek+1 asymp|λclosest minus σ|
|λsecond-closest minus σ|ek
135
Rayleigh Quotient Iteration
Describe Rayleigh Quotient Iteration
Compute the shift σk = xTk AxkxT
k xk to be the Rayleighquotient for xk
Then apply inverse iteration with that shift
xk+1 = (Aminus σk I )minus1xk
Demo Power Iteration and its Variants [cleared]
136
In-Class Activity Eigenvalues
In-class activity Eigenvalues
137
Schur formShow Every matrix is orthonormally similar to an upper triangular matrixie A = QUQT This is called the Schur form or Schur factorization
Assume A non-defective for now Suppose Av = λv (v = 0) LetV = spanv Then
A V rarr V
Vperp rarr V oplus Vperp
A =
|v Basis of Vperp
|
︸ ︷︷ ︸
Q1
λ lowast lowast lowast lowast0 lowast lowast lowast lowast lowast lowast lowast lowast0 lowast lowast lowast lowast
QT1
Repeat n times to triangular Qn middot middot middotQ1UQT1 middot middot middotQT
n
138
Schur Form Comments Eigenvalues EigenvectorsA = QUQT For complex λ Either complex matrices or 2times 2 blocks on diag
If we had a Schur form of A how can we find the eigenvalues
The eigenvalues (of U and A) are on the diagonal of U
And the eigenvectors
Find eigenvector of U Suppose λ is an eigenvalue
U minus λI =
U11 u U130 0 vT
0 0 U31
x = [Uminus1
11 uminus1 0]T eigenvector of U and Qx eigenvector of A139
Computing Multiple Eigenvalues
All Power Iteration Methods compute one eigenvalue at a timeWhat if I want all eigenvalues
Two ideas1 Deflation similarity transform to[
λ1 lowastB
]
ie one step in Schur form Now find eigenvalues of B 2 Iterate with multiple vectors simultaneously
140
Simultaneous Iteration
What happens if we carry out power iteration on multiple vectorssimultaneously
Simultaneous Iteration1 Start with X0 isin Rntimesp (p le n) with (arbitrary) iteration vectors
in columns2 Xk+1 = AXk
Problems Needs rescaling X increasingly ill-conditioned all columns go towards x1
Fix orthogonalize
141
Orthogonal Iteration
Orthogonal Iteration1 Start with X0 isin Rntimesp (p le n) with (arbitrary) iteration vectors
in columns2 QkRk = Xk (reduced)3 Xk+1 = AQk
Good Xk obey convergence theory from power iterationBad Slowlinear convergence Expensive iteration
142
Toward the QR Algorithm
Q0R0 = X0
X1 = AQ0
Q1R1 = X1 = AQ0 rArr Q1R1QT0 = A
X2 = AQ1
Q2R2 = X2 = AQ1 rArr Q2R2QT1 = A
Once the Qk converge (Qn+1 asymp Qn) we have a Schur factorization
Problem Qn+1 asymp Qn works poorly as a convergence testObservation 1 Once Qn+1 asymp Qn we also have QnRnQ
Tn asymp A
Observation 2 Xn = QTn AQn asymp Rn
Idea Use below-diag part of X2 for convergence checkQ Can we restate the iteration to compute Xk directly
Demo Orthogonal Iteration [cleared] 143
QR IterationQR Algorithm
Orthogonal iteration QR iterationX0 = A X0 = AQkRk = Xk Qk Rk = Xk
Xk+1 = AQk Xk+1 = RkQk
Xk = QTk AQk = Xk+1
Q0 = Q0Q1 = Q0Q1Qk = Q0Q1 middot middot middot Qk
Orthogonal iteration showed Xk = Xk+1 converge Also
Xk+1 = RkQk = QTk XkQk
so the Xk are all similar rarr all have the same eigenvaluesrarr QR iteration produces Schur form
144
QR Iteration Incorporating a ShiftHow can we accelerate convergence of QR iteration using shifts
X0 = A
Qk Rk = Xkminusσk IXk+1 = RkQk+σk I
Still a similarity transform
Xk+1 = RkQk + σk I = [QT
k Xk minus QTk σk ]Qk + σk I
Q How should the shifts be chosen Ideally Close to existing eigenvalue Heuristics
Lower right entry of Xk
Eigenvalues of lower right 2times 2 of Xk
145
QR Iteration Computational ExpenseA full QR factorization at each iteration costs O(n3)ndashcan we make thatcheaper
Idea Upper Hessenberg form
A = Q
lowast lowast lowast lowastlowast lowast lowast lowastlowast lowast lowastlowast lowast
QT
Attainable by similarity transforms () HAHT
with Householders that start 1 entry lower than lsquousualrsquo QR factorization of Hessenberg matrices can be achieved in
O(n2) time using Givens rotations
Demo Householder Similarity Transforms [cleared]146
QRHessenberg Overall procedureOverall procedure
1 Reduce matrix to Hessenberg form2 Apply QR iteration using Givens QR to obtain Schur form
Why does QR iteration stay in Hessenberg form
Asumme Xk is upper Hessenberg (ldquoUHrdquo) Qk Rk = Xk Qk = Rminus1
k Xk is UH (UH middot upper = UH) Xk+1 = RkQk is UH (upper middot UH = UH)
What does this process look like for symmetric matrices
Use Householders to attain tridiagonal form Use QR iteration with Givens to attain diagonal form
147
Krylov space methods Intro
What subspaces can we use to look for eigenvectors
QRspanAℓy1A
ℓy2 Aℓyk
Krylov spacespan x︸︷︷︸
x0
Ax Akminus1x︸ ︷︷ ︸xkminus1
Define
Kk =
| |x0 middot middot middot xkminus1| |
(n times k)
148
Krylov for Matrix FactorizationWhat matrix factorization is obtained through Krylov space methods
AKn =
| |x1 middot middot middot xn
| |
= Kn
| | |e2 middot middot middot en Kminus1
n xn
| | |
︸ ︷︷ ︸
Cn
Kminus1n AKn = Cn
Cn is upper Hessenberg So Krylov is lsquojustrsquo another way to get a matrix into upper
Hessenberg form But works well when only matvec is available (searches in
Krylov space not the space spanned by first columns)
149
Conditioning in Krylov Space MethodsArnoldi Iteration (I)What is a problem with Krylov space methods How can we fix it
(x i ) converge rapidly to eigenvector for largest eigenvaluerarr Kk become ill-conditioned
Idea Orthogonalize (at end for now)
QnRn = Kn rArr Qn = KnRminus1n
ThenQT
n AQn = Rn Kminus1n AKn︸ ︷︷ ︸Cn
Rminus1n
Cn is upper Hessenberg QT
n AQn is also UH(because upper middot UH = UH and UH middot upper = UH)
150
Conditioning in Krylov Space MethodsArnoldi Iteration (II)
We find that QTn AQn is also upper Hessenberg QT
n AnQn = HAlso readable as AQn = QnH which read column-by-column is
Aqk = h1kq1 + hk+1kqk+1
We find hjk = qTj Aqk Important consequence Can compute
qk+1 from q1 qk
(k + 1)st column of Hanalogously to Gram-Schmidt QR
This is called Arnoldi iteration For symmetric Lanczos iteration
Demo Arnoldi Iteration [cleared] (Part 1)
151
Krylov What about eigenvaluesHow can we use ArnoldiLanczos to compute eigenvalues
Q =[Qk Uk
]Green known (ie already computed) red not yet computed
H = QTAQ =
[Qk
Uk
]A[Qk Uk
]=
lowast lowast lowast lowast lowastlowast lowast lowast lowast lowastlowast lowast lowast lowastlowast lowast lowastlowast lowast
Use eigenvalues of top-left matrix as approximate eigenvalues(still need to be computed using QR it)
Those are called Ritz values
Demo Arnoldi Iteration [cleared] (Part 2) 152
Computing the SVD (Kiddy Version)
1 Compute (orth) eigenvectors V and eigenvalues D of ATA
ATAV = VD rArr V TATAV = D = Σ2
2 Find U from A = UΣV T hArr UΣ = AV Observe U is orthogonal if Σminus1 exists (If not can choose so)
UTU = Σminus1V TATAVΣminus1 = Σminus1Σ2Σminus1 = I
Demo Computing the SVD [cleared]
ldquoActualrdquoldquonon-kiddyrdquo computation of the SVD
Bidiagonalize A = U
[B0
]V T then diagonalize via variant of QR
References Chan rsquo82 or Golubvan Loan Sec 86
153
OutlineIntroduction to Scientific Computing
Systems of Linear Equations
Linear Least Squares
Eigenvalue Problems
Nonlinear EquationsIntroductionIterative ProceduresMethods in One DimensionMethods in n Dimensions (ldquoSystems of Equationsrdquo)
Optimization
Interpolation
Numerical Integration and Differentiation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
154
Solving Nonlinear Equations
What is the goal here
Solve f (x) = 0 for f Rn rarr Rn
If looking for solution to f (x) = y simply consider f (x) = f (x)minusy
Intuition Each of the n equations describes a surface Looking forintersectionsDemo Three quadratic functions [cleared]
155
Showing ExistenceHow can we show existence of a root
Intermediate value theorem (uses continuity 1D only) Inverse function theorem (relies on invertible Jacobian Jf )
Get local invertibility ie f (x) = y solvable Contraction mapping theorem
A function g Rn rarr Rn is called contractive if there exists a0 lt γ lt 1 so that ∥g(x)minus g(y)∥ le γ ∥x minus y∥ A fixed pointof g is a point where g(x) = x
Then On a closed set S sube Rn with g(S) sube S there exists aunique fixed pointExample (real-world) map
In general no uniquness results available
156
Sensitivity and Multiplicity
What is the sensitivityconditioning of root finding
cond (root finding) = cond (evaluation of the inverse function)
What are multiple roots
0 = f (x) = f prime(x) = middot middot middot = f (mminus1)(x)
This is called a root of multiplicity m
How do multiple roots interact with conditioning
The inverse function is steep near one so conditioning is poor
157
In-Class Activity Krylov and Nonlinear Equations
In-class activity Krylov and Nonlinear Equations
158
Rates of ConvergenceWhat is linear convergence quadratic convergence
ek = ukminusu error in the kth iterate uk Assume ek rarr 0 as k rarrinfin
An iterative method converges with rate r if
limkrarrinfin
∥ek+1∥∥ek∥r
= C
gt 0ltinfin
r = 1 is called linear convergence r gt 1 is called superlinear convergence r = 2 is called quadratic convergence
Examples Power iteration is linearly convergent Rayleigh quotient iteration is quadratically convergent
159
About Convergence RatesDemo Rates of Convergence [cleared]Characterize linear quadratic convergence in terms of the lsquonumber ofaccurate digitsrsquo
Linear convergence gains a constant number of digits each step
∥ek+1∥ le C ∥ek∥
(and C lt 1 matters) Quadratic convergence
∥ek+1∥ le C ∥ek∥2
(Only starts making sense once ∥ek∥ is small C doesnrsquotmatter much)
160
Stopping CriteriaComment on the lsquofoolproof-nessrsquo of these stopping criteria
1 |f (x)| lt ε (lsquoresidual is smallrsquo)2 ∥xk+1 minus xk∥ lt ε3 ∥xk+1 minus xk∥ ∥xk∥ lt ε
1 Can trigger far away from a root in the case of multiple roots(or a lsquoflatrsquo f )
2 Allows different lsquorelative accuracyrsquo in the root depending on itsmagnitude
3 Enforces a relative accuracy in the root but does not actuallycheck that the function value is smallSo if convergence lsquostallsrsquo away from a root this may triggerwithout being anywhere near the desired solution
Lesson No stopping criterion is bulletproof The lsquorightrsquo one almostalways depends on the application
161
Bisection Method
Demo Bisection Method [cleared]
Whatrsquos the rate of convergence Whatrsquos the constant
Linear with constant 12
162
Fixed Point Iteration
x0 = ⟨starting guess⟩xk+1 = g(xk)
Demo Fixed point iteration [cleared]
When does fixed point iteration converge Assume g is smooth
Let xlowast be the fixed point with xlowast = g(xlowast)If |g prime(xlowast)| lt 1 at the fixed point FPI converges
Errorek+1 = xk+1 minus xlowast = g(xk)minus g(xlowast)
[contrsquod]
163
Fixed Point Iteration Convergence contrsquodError in FPI ek+1 = xk+1 minus xlowast = g(xk)minus g(xlowast)
Mean value theorem says There is a θk between xk and xlowast so that
g(xk)minus g(xlowast) = g prime(θk)(xk minus xlowast) = g prime(θk)ek
So ek+1 = g prime(θk)ek and if ∥g prime∥ le C lt 1 near xlowast then we havelinear convergence with constant C
Q What if g prime(xlowast) = 0
By Taylorg(xk)minus g(xlowast) = g primeprime(ξk)(xk minus xlowast)22
So we have quadratic convergence in this case
We would obviously like a systematic way of finding g that producesquadratic convergence
164
Newtonrsquos Method
Derive Newtonrsquos method
Idea Approximate f at xk using Taylor f (xk +h) asymp f (xk)+ f prime(xk)hNow find root of this linear approximation in terms of h
f (xk) + f prime(xk)h = 0 hArr h = minus f (xk)
f prime(xk)
x0 = ⟨starting guess⟩
xk+1 = xk minusf (xk)
f prime(xk)= g(xk)
Demo Newtonrsquos method [cleared]
165
Convergence and Properties of NewtonWhatrsquos the rate of convergence of Newtonrsquos method
g prime(x) =f (x)f primeprime(x)
f prime(x)2
So if f (xlowast) = 0 and f prime(xlowast) = 0 we have g prime(xlowast) = 0 ie quadraticconvergence toward single roots
Drawbacks of Newton
Convergence argument only good locallyWill see convergence only local (near root)
Have to have derivative
Demo Convergence of Newtonrsquos Method [cleared]166
Secant Method
What would Newton without the use of the derivative look like
Approximate
f prime(xk) asympf (xk)minus f (xkminus1)
xk minus xkminus1
So
x0 = ⟨starting guess⟩
xk+1 = xk minusf (xk)
f (xk )minusf (xkminus1)xkminusxkminus1
167
Convergence of Properties of Secant
Rate of convergence is(1 +radic
5)2 asymp 1618 (proof)
Drawbacks of Secant
Convergence argument only good locallyWill see convergence only local (near root)
Slower convergence Need two starting guesses
Demo Secant Method [cleared]Demo Convergence of the Secant Method [cleared]
Secant (and similar methods) are called Quasi-Newton Methods
168
Improving on Newton
How would we do ldquoNewton + 1rdquo (ie even faster even better)
Easy Use second order Taylor approximation solve resulting
quadratic Get cubic convergence Get a method thatrsquos extremely fast and extremely brittle Need second derivatives What if the quadratic has no solution
169
Root Finding with InterpolantsSecant method uses a linear interpolant based on points f (xk) f (xkminus1)could use more points and higher-order interpolant
Can fit polynomial to (subset of) (x0 f (x0)) (xk f (xk))
Look for a root of that Fit a quadratic to the last three Mullerrsquos method
Also finds complex roots Convergence rate r asymp 184
What about existence of roots in that case
Inverse quadratic interpolation Interpolate quadratic polynomial q so that q(f (xi )) = xi for
i isin k k minus 1 k minus 2 Approximate root by evaluating xk+1 = q(0)
170
Achieving Global Convergence
The linear approximations in Newton and Secant are only good locallyHow could we use that
Hybrid methods bisection + Newton Stop if Newton leaves bracket
Fix a region where theyrsquore lsquotrustworthyrsquo (trust region methods) Limit step size
171
In-Class Activity Nonlinear Equations
In-class activity Nonlinear Equations
172
Fixed Point Iteration
x0 = ⟨starting guess⟩xk+1 = g(xk)
When does this converge
Converges (locally) if ∥Jg (xlowast)∥ lt 1 in some norm where the Jaco-bian matrix
Jg (xlowast) =
partx1g1 middot middot middot partxng1
partx1gn middot middot middot partxngn
Similarly If Jg (xlowast) = 0 we have at least quadratic convergence
Better There exists a norm ∥middot∥A such that ρ(A) le ∥A∥A lt ρ(A)+ϵso ρ(A) lt 1 suffices (proof)
173
Newtonrsquos MethodWhat does Newtonrsquos method look like in n dimensions
Approximate by linear f (x + s) = f (x) + Jf (x)s
Set to 0 Jf (x)s = minusf (x) rArr s = minus(Jf (x))minus1f (x)
x0 = ⟨starting guess⟩xk+1 = xk minus (Jf (xk))
minus1f (xk)
Downsides of n-dim Newton
Still only locally convergent Computing and inverting Jf is expensive
Demo Newtonrsquos method in n dimensions [cleared]
174
Secant in n dimensionsWhat would the secant method look like in n dimensions
Need an lsquoapproximate Jacobianrsquo satisfying
J(xk+1 minus xk) = f (xk+1)minus f (xk)
Suppose we have already taken a step to xk+1 Could we lsquoreverseengineerrsquo J from that equation No n2 unknowns in J but only n equations Can only hope to lsquoupdatersquo J with information from current
guessSome choices all called Broydenrsquos method update Jn minimize ∥Jn minus Jnminus1∥F update Jminus1
n (via Sherman-Morrison) minimize∥∥Jminus1
n minus Jminus1nminus1
∥∥F
multiple variants (ldquogoodrdquo Broyden and ldquobadrdquo Broyden)175
Numerically Testing DerivativesGetting derivatives right is important How can I testdebug them
Verify convergence of the Taylor remainder by checking that for aunit vector s and an input vector x ∥∥∥∥ f (x + hs)minus f (x)
hminus Jf (x)s
∥∥∥∥rarr 0 as hrarr 0
Same trick can be used to check the Hessian (needed inoptimization) It is the Jacobian of the gradient
Norm is not necessarily small Convergence (ie decrease withh) matters
Important to divide by h so that the norm is O(1) Can ldquobootstraprdquo the derivatives do the above one term at a
time
176
OutlineIntroduction to Scientific Computing
Systems of Linear Equations
Linear Least Squares
Eigenvalue Problems
Nonlinear Equations
OptimizationIntroductionMethods for unconstrained opt in one dimensionMethods for unconstrained opt in n dimensionsNonlinear Least SquaresConstrained Optimization
Interpolation
Numerical Integration and Differentiation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
177
Optimization Problem Statement
Have Objective function f Rn rarr RWant Minimizer xlowast isin Rn so that
f (xlowast) = minx
f (x) subject to g(x) = 0 and h(x) le 0
g(x) = 0 and h(x) le 0 are called constraintsThey define the set of feasible points x isin S sube Rn
If g or h are present this is constrained optimizationOtherwise unconstrained optimization
If f g h are linear this is called linear programmingOtherwise nonlinear programming
178
Optimization ObservationsQ What if we are looking for a maximizer not a minimizerGive some examples
What is the fastestcheapestshortest way to do Solve a system of equations lsquoas well as you canrsquo (if no exact
solution exists)ndashsimilar to what least squares does for linearsystems
min ∥F (x)∥
What about multiple objectives
In general Look up Pareto optimality For 450 Make up your mindndashdecide on one (or build a
combined objective) Then wersquoll talk
179
ExistenceUniquenessTerminology global minimum local minimum
Under what conditions on f can we say something aboutexistenceuniquenessIf f S rarr R is continuous on a closed and bounded set S sube Rn then
a minimum exists
f S rarr R is called coercive on S sube Rn (which must be unbounded) if
lim∥x∥rarrinfin
f (x) = +infin
If f is coercive
a global minimum exists (but is possibly non-unique)180
Convexity
S sube Rn is called convex if for all x y isin S and all 0 le α le 1
αx + (1minus α)y isin S
f S rarr R is called convex on S sube Rn if for x y isin S and all 0 le α le 1
f (αx + (1minus α)y isin S) le αf (x) + (1minus α)f (y)
With lsquoltrsquo strictly convex
Q Give an example of a convex but not strictly convex function
181
Convexity Consequences
If f is convex
then f is continuous at interior points(Why What would happen if it had a jump)
a local minimum is a global minimum
If f is strictly convex
a local minimum is a unique global minimum
182
Optimality ConditionsIf we have found a candidate xlowast for a minimum how do we know itactually is one Assume f is smooth ie has all needed derivatives
In one dimension Necessary f prime(xlowast) = 0 (ie xlowast is an extremal point) Sufficient f prime(xlowast) = 0 and f primeprime(xlowast) gt 0
(implies xlowast is a local minimum)
In n dimensions Necessary nablaf (xlowast) = 0 (ie xlowast is an extremal point) Sufficient nablaf (xlowast) = 0 and Hf (xlowast) positive semidefinite
(implies xlowast is a local minimum)
where the Hessian
Hf (xlowast) =
part2
partx21
middot middot middot part2
partx1partxn
part2
partxnpartx1middot middot middot part2
partx2n
f (xlowast)
183
Optimization Observations
Q Come up with a hypothetical approach for finding minima
A Solve nablaf = 0
Q Is the Hessian symmetric
A Yes (Schwarzrsquos theorem)
Q How can we practically test for positive definiteness
A Attempt a Cholesky factorization If it succeeds the matrix is PD
184
Sensitivity and Conditioning (1D)How does optimization react to a slight perturbation of the minimum
Suppose we still have |f (x)minus f (xlowast)| lt tol (where xlowast is true min)Apply Taylorrsquos theorem
f (xlowast + h) = f (xlowast) + f prime(xlowast)︸ ︷︷ ︸0
h + f primeprime(xlowast)h2
2+ O(h3)
Solve for h |x minus xlowast| leradic
2 tol f primeprime(xlowast)In other words Can expect half as many digits in x as in f (x)This is important to keep in mind when setting tolerances
Itrsquos only possible to do better when derivatives are explicitly knownand convergence is not based on function values alone (then cansolve nablaf = 0)
185
Sensitivity and Conditioning (nD)
How does optimization react to a slight perturbation of the minimum
Suppose we still have |f (x)minus f (xlowast)| lt tol where xlowast is true minand x = xlowast + hs Assume ∥s∥ = 1
f (xlowast + hs) = f (xlowast) + hnablaf (xlowast)T︸ ︷︷ ︸0
s +h2
2sTHf (xlowast)s + O(h3)
Yields|h|2 le 2 tol
λmin(Hf (xlowast))
In other words Conditioning of Hf determines sensitivity of xlowast
186
Unimodality
Would like a method like bisection but for optimizationIn general No invariant that can be preservedNeed extra assumption
f is called unimodal on an open interval if there exists an xlowast in theinterval such that for all x1 lt x2 in the interval x2 lt xlowast rArr f (x1) gt f (x2)
xlowast lt x1 rArr f (x1) lt f (x2)
187
In-Class Activity Optimization Theory
In-class activity Optimization Theory
188
Golden Section SearchSuppose we have an interval with f unimodal
Would like to maintain unimodality
1 Pick x1 x2
2 If f (x1) gt f (x2) reduce to (x1 b)
3 If f (x1) le f (x2) reduce to (a x2)189
Golden Section Search EfficiencyWhere to put x1 x2
Want symmetryx1 = a+ (1minus τ)(b minus a)x2 = a+ τ(b minus a)
Want to reuse function evaluations
Demo Golden Section Proportions [cleared]
τ2 = 1minusτ Find τ =(radic
5minus 1)2 Also known as the golden section
Hence golden section search
Convergence rate
Linearly convergent Can we do better190
Newtonrsquos MethodReuse the Taylor approximation idea but for optimization
f (x + h) asymp f (x) + f prime(x)h + f primeprime(x)h2
2= f (h)
Solve 0 = f prime(h) = f prime(x) + f primeprime(x)h h = minusf prime(x)f primeprime(x)1 x0 = ⟨some starting guess⟩2 xk+1 = xk minus f prime(xk )
f primeprime(xk )
Q Notice something Identical to Newton for solving f prime(x) = 0Because of that locally quadratically convergent
Good idea Combine slow-and-safe (bracketing) strategy with fast-and-risky (Newton)
Demo Newtonrsquos Method in 1D [cleared]
191
Steepest DescentGradient DescentGiven a scalar function f Rn rarr R at a point x which way is down
Direction of steepest descent minusnablaf
Q How far along the gradient should we go
Unclearndashdo a line search For example using Golden Section Search1 x0 = ⟨some starting guess⟩2 sk = minusnablaf (xk)
3 Use line search to choose αk to minimize f (xk + αksk)4 xk+1 = xk + αksk5 Go to 2
Observation (from demo) Linear convergence
Demo Steepest Descent [cleared] (Part 1) 192
Steepest Descent ConvergenceConsider quadratic model problem
f (x) =12xTAx + cTx
where A is SPD (A good model of f near a minimum)
Define error ek = xk minus xlowast Then can show
||ek+1||A =radic
eTk+1Aek+1 =
σmax(A)minus σmin(A)
σmax(A) + σmin(A)||ek ||A
rarr confirms linear convergence
Convergence constant related to conditioning
σmax(A)minus σmin(A)
σmax(A) + σmin(A)=κ(A)minus 1κ(A) + 1
193
Hacking Steepest Descent for Better Convergence
Extrapolation methods
Look back a step maintain rsquomomentumrsquo
xk+1 = xk minus αknablaf (xk) + βk(xk minus xkminus1)
Heavy ball method
constant αk = α and βk = β Gives
||ek+1||A =
radicκ(A)minus 1radicκ(A) + 1
||ek ||A
Demo Steepest Descent [cleared] (Part 2)
194
Optimization in Machine LearningWhat is stochastic gradient descent (SGD)
Common in ML Objective functions of the form
f (x) =1n
nsumi=1
fi (x)
where each fi comes from an observation (ldquodata pointrdquo) in a (training)data set Then ldquobatchrdquo (ie normal) gradient descent is
xk+1 = xk minus α1n
nsumi=1
nablafi (xk)
Stochastic GD uses one (or few ldquominibatchrdquo) observation at a time
xk+1 = xk minus αnablafϕ(k)(xk)
ADAM optimizer GD with exp moving avgs of nabla and its square195
Conjugate Gradient Methods
Can we optimize in the space spanned by the last two step directions
(αk βk) = argminαk βk
[f(xk minus αknablaf (xk) + βk(xk minus xkminus1)
)] Will see in more detail later (for solving linear systems) Provably optimal first-order method for the quadratic model
problem Turns out to be closely related to Lanczos (A-orthogonal search
directions)
Demo Conjugate Gradient Method [cleared]
196
Nelder-Mead Method
Idea
Form a n-point polytope in n-dimensional space and adjust worstpoint (highest function value) by moving it along a line passingthrough the centroid of the remaining points
Demo Nelder-Mead Method [cleared]
197
Newtonrsquos method (n D)What does Newtonrsquos method look like in n dimensions
Build a Taylor approximation
f (x + s) asymp f (x) +nablaf (x)T s +12sTHf (x)s = f (s)
Then solve nablaf (s) = 0 for s to find
Hf (x)s = minusnablaf (x)
1 x0 = ⟨some starting guess⟩2 Solve Hf (xk)sk = minusnablaf (xk) for sk3 xk+1 = xk + sk
198
Newtonrsquos method (n D) Observations
Drawbacks
Need second () derivatives(addressed by Conjugate Gradients later in the class)
local convergence Works poorly when Hf is nearly indefinite
Demo Newtonrsquos method in n dimensions [cleared]
199
Quasi-Newton MethodsSecantBroyden-type ideas carry over to optimization How
Come up with a way to update to update the approximate Hessian
xk+1 = xk minus αkBminus1k nablaf (xk)
αk a line searchdamping parameter
BFGS Secant-type method similar to Broyden
Bk+1 = Bk +ykyT
k
yTk sk
minusBksksTk Bk
sTk Bkskwhere sk = xk+1 minus xk
yk = nablaf (xk+1)minusnablaf (xk)
200
In-Class Activity Optimization Methods
In-class activity Optimization Methods
201
Nonlinear Least Squares SetupWhat if the f to be minimized is actually a 2-norm
f (x) = ∥r(x)∥2 r(x) = y minus a(x)
Define lsquohelper functionrsquo
φ(x) =12r(x)T r(x) =
12f 2(x)
and minimize that instead
part
partxiφ =
12
nsumj=1
part
partxi[rj(x)2] =
sumj
(part
partxirj
)rj
or in matrix formnablaφ = Jr (x)T r(x)
202
Gauss-NewtonFor brevity J = Jr (x)
Can show similarly
Hφ(x) = JT J +sumi
riHri (x)
Newton step s can be found by solving Hφ(x)s = minusnablaφ
Observationsum
i riHri (x) is inconvenient to compute and unlikely tobe large (since itrsquos multiplied by components of the residual which issupposed to be small) rarr forget about it
Gauss-Newton method Find step s by JT Js = minusnablaφ = minusJT r(x)Does that remind you of the normal equations Js sim= minusr(x) Solvethat using our existing methods for least-squares problems
203
Gauss-Newton Observations
Demo Gauss-Newton [cleared]
Observations
Newton on its own is still only locally convergent Gauss-Newton is clearly similar Itrsquos worse because the step is only approximaterarr Much depends on the starting guess
204
Levenberg-MarquardtIf Gauss-Newton on its own is poorly conditioned can tryLevenberg-Marquardt
(Jr (xk)T Jr (xk)+microk I )sk = minusJr (xk)
T r(xk)
for a lsquocarefully chosenrsquo microk This makes the system matrix lsquomoreinvertiblersquo but also less accuratefaithful to the problem Can also betranslated into a least squares problem (see book)
What Levenberg-Marquardt does is generically called regularizationMake H more positive definiteEasy to rewrite to least-squares problem[
Jr (xk)radicmicrok
]sk sim=
[minusr(xk)
0
]
205
Constrained Optimization Problem SetupWant xlowast so that
f (xlowast) = minx
f (x) subject to g(x) = 0
No inequality constraints just yet This is equality-constrainedoptimization Develop a (local) necessary condition for a minimum
Necessary cond ldquono feasible descent possiblerdquo Assume g(x) = 0
Recall unconstrained necessary condition ldquono descent possiblerdquo
nablaf (x) = 0
Look for feasible descent directions from x (Necessary cond exist)
s is a feasible direction at x if
x + αs feasible for α isin [0 r ] (for some r)
206
Constrained Optimization Necessary Condition
Need nablaf (x) middot s ⩾ 0 (ldquouphill that wayrdquo) for any feasible direction s Not at boundary s and minuss are feasible directionsrArr nablaf (x) = 0rArr Only the boundary of the feasible set is different from theunconstrained case (ie interesting)
At boundary (the common case) g(x) = 0 Need
minusnablaf (x) isin rowspan(Jg )
aka ldquoall descent directions would cause a change(rarrviolation) of the constraintsrdquoQ Why lsquorowspanrsquo Think about shape of Jg
hArr minusnablaf (x) = JTg λ for some λ
207
Lagrange Multipliers
Seen Need minusnablaf (x) = JTg λ at the (constrained) optimum
Idea Turn constrained optimization problem for x into an unconstrainedoptimization problem for (x λ) How
Need a new function L(x λ) to minimize
L(x λ) = f (x) + λTg(x)
208
Lagrange Multipliers Development
L(x λ) = f (x) + λTg(x)
Then nablaL = 0 at unconstrained minimum ie
0 = nablaL =
[nablaxLnablaλL
]=
[nablaf + Jg (x)Tλ
g(x)
]
Convenient This matches our necessary condition
So we could use any unconstrained method to minimized LFor example Using Newton to minimize L is called SequentialQuadratic Programming (lsquoSQPrsquo)
Demo Sequential Quadratic Programming [cleared]
209
Inequality-Constrained OptimizationWant xlowast so that
f (xlowast) = minx
f (x) subject to g(x) = 0 and h(x) le 0
Develop a necessary condition for a minimum
Again Assume wersquore at a feasible point on the boundary of thefeasible region Must ensure descent directions are infeasible
Motivation g = 0 hArr two inequality constraints g le 0 and g ge 0
Consider the condition minusnablaf (x) = JTh λ2 Descent direction must start violating constraint
But only one direction is dangerous here minusnablaf descent direction of f nablahi ascent direction of hi If we assume λ2 gt 0 going towards minusnablaf would increase h
(and start violating h le 0)210
Lagrangian ActiveInactivePut together the overall Lagrangian
L(x λ1λ2) = f (x) + λT1 g(x) + λT
2 h(x)
What are active and inactive constraints
Active active hArr hi (xlowast) = 0hArr at lsquoboundaryrsquo of ineqconstraint(Equality constrains are always lsquoactiversquo)
Inactive If hi inactive (hi (xlowast) lt 0) must force λ2i = 0Otherwise Behavior of h could change location of minimum ofL Use complementarity condition hi (xlowast)λ2i = 0hArr at least one of hi (xlowast) and λ2i is zero
211
Karush-Kuhn-Tucker (KKT) ConditionsDevelop a set of necessary conditions for a minimum
Assuming Jg and Jhactive have full rank this set of conditions isnecessary
(lowast) nablaxL(xlowastλlowast1λ
lowast2) = 0
(lowast) g(xlowast) = 0h(xlowast) le 0
λ2 ⩾ 0(lowast) h(xlowast) middot λ2 = 0
These are called the Karush-Kuhn-Tucker (lsquoKKTrsquo) conditions
Computational approach Solve (lowast) equations by Newton
212
OutlineIntroduction to Scientific Computing
Systems of Linear Equations
Linear Least Squares
Eigenvalue Problems
Nonlinear Equations
Optimization
InterpolationIntroductionMethodsError EstimationPiecewise interpolation Splines
Numerical Integration and Differentiation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
213
Interpolation Setup
Given (xi )Ni=1 (yi )
Ni=1
Wanted Function f so that f (xi ) = yi
How is this not the same as function fitting (from least squares)
Itrsquos very similarndashbut the key difference is that we are asking for exactequality not just minimization of a residual normrarr Better error control error not dominated by residual
Idea There is an underlying function that we are approximating fromthe known point values
Error here Distance from that underlying function
214
Interpolation Setup (II)
Given (xi )Ni=1 (yi )
Ni=1
Wanted Function f so that f (xi ) = yi
Does this problem have a unique answer
Nondashthere are infinitely many functions that satisfy the problem asstated
215
Interpolation Importance
Why is interpolation important
It brings all of calculus within range of numerical operations Why
Because calculus works on functions How
1 Interpolate (go from discrete to continuous)2 Apply calculus3 Re-discretize (evaluate at points)
216
Making the Interpolation Problem Unique
Limit the set of functions to span of an interpolation basis φiNFunci=1
f (x) =
Nfuncsumj=1
αjφj(x)
Interpolation becomes solving the linear system
yi = f (xi ) =
Nfuncsumj=1
αj φj(xi )︸ ︷︷ ︸Vij
harr Vα = y
Want unique answer Pick Nfunc = N rarr V square V is called the (generalized) Vandermonde matrix V (coefficients) = (values at nodes) Can prescribe derivatives Use φprime
j fprime in a row (Hermite interp)
217
ExistenceSensitivitySolution to the interpolation problem Existence Uniqueness
Equivalent to existenceuniqueness of the linear system
Sensitivity
Shallow answer Simply consider the condition number of thelinear system
∥coefficients∥ does not suffice as measure of stabilityf (x) can be evaluated in many places (f is interpolant)
Want maxxisin[ab] |f (x)| le Λ∥y∥infin Λ Lebesgue constant Λ depends on n and xii
Technically also depends on ϕii But same for all polynomial bases
218
Modes and Nodes (aka Functions and Points)Both function basis and point set are under our control What do we pick
Ideas for basis functions Monomials 1 x x2 x3 x4
Functions that make V = I rarrlsquoLagrange basisrsquo
Functions that make Vtriangular rarr lsquoNewton basisrsquo
Splines (piecewise polynomials) Orthogonal polynomials Sines and cosines lsquoBumpsrsquo (lsquoRadial Basis
Functionsrsquo)
Ideas for points Equispaced lsquoEdge-Clusteredrsquo (so-called
ChebyshevGauss nodes)
Specific issues Why not monomials on
equispaced pointsDemo Monomial interpolation[cleared]
Why not equispacedDemo Choice of Nodes forPolynomial Interpolation[cleared]
219
Lagrange InterpolationFind a basis so that V = I ie
φj(xi ) =
1 i = j
0 otherwise
Start with simple example Three nodes x1 x2 x3
φ1(x) =(x minus x2)(x minus x3)
(x1 minus x2)(x1 minus x3)
φ2(x) =(x minus x1) (x minus x3)
(x2 minus x1) (x2 minus x3)
φ3(x) =(x minus x1)(x minus x2)
(x3 minus x1)(x3 minus x2)
Numerator Ensures φi zero at other nodesDenominator Ensures φi (xi ) = 1
220
Lagrange Polynomials General Form
φj(x) =
prodmk=1k =j(x minus xk)prodmk=1k =j(xj minus xk)
Write down the Lagrange interpolant for nodes (xi )mi=1 and values (yi )
mi=1
pmminus1(x) =msumj=1
yjφj(x)
221
Newton InterpolationFind a basis so that V is triangular
Easier to build than Lagrange but coefficient finding costs O(n2)
φj(x) =
jminus1prodk=1
(x minus xk)
(At least) two possibilities for coefficent finding Set up V run forward substitution Divided Differences (Wikipedia link)
Why not LagrangeNewton
Cheap to form expensive to evaluate expensive to do calculus on
222
Better conditioning Orthogonal polynomials
What caused monomials to have a terribly conditioned Vandermonde
Being close to linearly dependent
Whatrsquos a way to make sure two vectors are not like that
Orthogonality
But polynomials are functions
223
Orthogonality of Functions
How can functions be orthogonal
Need an inner product Orthogonal then just means ⟨f g⟩ = 0
f middot g =nsum
i=1
figi = ⟨f g⟩
⟨f g⟩ =
int 1
minus1f (x)g(x)dx
224
Constructing Orthogonal PolynomialsHow can we find an orthogonal basis
Apply Gram-Schmidt to the monomials
Demo Orthogonal Polynomials [cleared] mdash Got Legendre polynomialsBut how can I practically compute the Legendre polynomials
rarr DLMF Chapter on orthogonal polynomials There exist three-term recurrences eg Tn+1 = 2xTn minus Tnminus1
There is a whole zoo of polynomials there depending on theweight function w in the (generalized) inner product
⟨f g⟩ =int
w(x)f (x)g(x)dx
Some sets of orth polys live on intervals = (minus1 1)225
Chebyshev Polynomials Definitions
Three equivalent definitions Result of Gram-Schmidt with weight 1
radic1minus x2 What is that weight
1 (Half circle) ie x2 + y2 = 1 with y =radic
1minus x2
(Like for Legendre you wonrsquot exactly get the standard normalization ifyou do this)
Tk(x) = cos(k cosminus1(x))
Tk(x) = 2xTkminus1(x)minus Tkminus2(x) plus T0 = 1 T0 = x
226
Chebyshev InterpolationWhat is the Vandermonde matrix for Chebyshev polynomials
Need to know the nodes to answer that The answer would be very simple if the nodes were cos(lowast) So why not cos(equispaced) Maybe
xi = cos
(i
kπ
)(i = 0 1 k)
These are just the extrema (minimamaxima) of Tk
Vij = cos
(j cosminus1
(cos
(i
kπ
)))= cos
(ji
kπ
)
Called the Discrete Cosine Transform (DCT) Matvec (and inverse) available with O(N logN) cost (rarr FFT)
227
Chebyshev Nodes
Might also consider roots (instead of extrema) of Tk
xi = cos
(2i minus 1
2kπ
)(i = 1 k)
Vandermonde for these (with Tk) can be applied in O(N logN) time too
Edge-clustering seemed like a good thing in interpolation nodes Do thesedo that
Yes
Demo Chebyshev Interpolation [cleared] (Part I-IV)
228
Chebyshev Interpolation Summary
Chebyshev interpolation is fast and works extremely well httpwwwchebfunorg and ATAP In 1D theyrsquore a very good answer to the interpolation question But sometimes a piecewise approximation (with a specifiable level of
smoothness) is more suited to the application
229
In-Class Activity Interpolation
In-class activity Interpolation
230
Truncation Error in InterpolationIf f is n times continuously differentiable on a closed interval I andpnminus1(x) is a polynomial of degree at most n that interpolates f at ndistinct points xi (i = 1 n) in that interval then for each x in theinterval there exists ξ in that interval such that
f (x)minus pnminus1(x) =f (n)(ξ)
n(x minus x1)(x minus x2) middot middot middot (x minus xn)
Set the error term to be R(x) = f (x) minus pnminus1(x) and set up anauxiliary function
Yx(t) = R(t)minus R(x)
W (x)W (t) where W (t) =
nprodi=1
(t minus xi )
Note also the introduction of t as an additional variable independentof the point x where we hope to prove the identity
231
Truncation Error in Interpolation contrsquod
Yx(t) = R(t)minus R(x)
W (x)W (t) where W (t) =
nprodi=1
(t minus xi )
Since xi are roots of R(t) and W (t) we haveYx(x) = Yx(xi ) = 0 which means Yx has at least n + 1 roots
From Rollersquos theorem Y primex(t) has at least n roots then Y
(n)x
has at least one root ξ where ξ isin I Since pnminus1(x) is a polynomial of degree at most n minus 1
R(n)(t) = f (n)(t) Thus
Y(n)x (t) = f (n)(t)minus R(x)
W (x)n
Plugging Y(n)x (ξ) = 0 into the above yields the result
232
Error Result Connection to ChebyshevWhat is the connection between the error result and Chebyshevinterpolation
The error bound suggests choosing the interpolation nodessuch that the product |
prodni=1(x minus xi )| is as small as possible
The Chebyshev nodes achieve this If nodes are edge-clustered
prodni=1(x minus xi ) clamps down the
(otherwise quickly-growing) error there Confusing Chebyshev approximating polynomial (or
ldquopolynomial best-approximationrdquo) Not the Chebyshevinterpolant
Chebyshev nodes also do not minimize the Lebesgue constant
Demo Chebyshev Interpolation [cleared] (Part V)
233
Error Result Simplified FormBoil the error result down to a simpler form
Assume x1 lt middot middot middot lt xn∣∣f (n)(x)∣∣ le M for x isin [x1xn]
Set the interval length h = xn minus x1Then |x minus xi | le h
Altogetherndashthere is a constant C independent of h so that
maxx|f (x)minus pnminus1(x)| le CMhn
For the grid spacing h rarr 0 we have E (h) = O(hn) This is calledconvergence of order n
Demo Interpolation Error [cleared] Demo Jump with Chebyshev Nodes [cleared]
234
Going piecewise Simplest Case
Construct a pieceweise linear interpolant at four points
x0 y0 x1 y1 x2 y2 x3 y3
| f1 = a1x + b1 | f2 = a2x + b2 | f3 = a3x + b3 || 2 unk | 2 unk | 2 unk || f1(x0) = y0 | f2(x1) = y1 | f3(x2) = y2 || f1(x1) = y1 | f2(x2) = y2 | f3(x3) = y3 || 2 eqn | 2 eqn | 2 eqn |
Why three intervals
General situation rarr two end intervals and one middle interval Canjust add more middle intervals if needed
235
Piecewise Cubic (lsquoSplinesrsquo)x0 y0 x1 y1 x2 y2 x3 y3
| f1 | f2 | f3 || a1x
3 + b1x2 + c1x + d1 | a2x
3 + b2x2 + c2x + d2 | a3x
3 + b3x2 + c3x + d3 |
4 unknowns 4 unknowns 4 unknownsf1(x0) = y0 f2(x1) = y1 f3(x2) = y2f1(x1) = y1 f2(x2) = y2 f3(x3) = y3
Not enough need more conditions Ask for more smoothnessf prime1(x1) = f prime2(x1) f prime2(x2) = f prime3(x2)f primeprime1 (x1) = f primeprime2 (x1) f primeprime2 (x2) = f primeprime3 (x2)
Not enough need yet more conditionsf primeprime1 (x0) = 0 f primeprime3 (x3) = 0
Now have a square system
236
Piecewise Cubic (lsquoSplinesrsquo) Accountingx0 y0 x1 y1 x2 y2 x3 y3
| f1 | f2 | f3 || a1x
3 + b1x2 + c1x + d1 | a2x
3 + b2x2 + c2x + d2 | a3x
3 + b3x2 + c3x + d3 |
Number of conditions 2Nintervals + 2Nmiddle nodes + 2 where
Nintervals minus 1 = Nmiddle nodes
so2Nintervals + 2(Nintervals minus 1) + 2 = 4Nintervals
which is exactly the number of unknown coefficients
These conditions are fairly arbitrary Can choose different ones basi-cally at will The above choice lsquonatural splinersquo
Can also come up with a basis of spline functions (with the chosensmoothness conditions) These are called B-Splines
237
OutlineIntroduction to Scientific Computing
Systems of Linear Equations
Linear Least Squares
Eigenvalue Problems
Nonlinear Equations
Optimization
Interpolation
Numerical Integration and DifferentiationNumerical IntegrationQuadrature MethodsAccuracy and StabilityGaussian QuadratureComposite QuadratureNumerical DifferentiationRichardson Extrapolation
Initial Value Problems for ODEs
Boundary Value Problems for ODEs
Partial Differential Equations and Sparse Linear Algebra
Fast Fourier Transform
Additional Topics
238
Numerical Integration About the Problem
What is numerical integration (Or quadrature)
Given a b f compute int b
af (x)dx
What about existence and uniqueness
Answer exists eg if f is integrable in the Riemann or Lebesguesenses
Answer is unique if f is eg piecewise continuous and bounded(this also implies existence)
239
Conditioning
Derive the (absolute) condition number for numerical integration
Let f (x) = f (x) + e(x) where e(x) is a perturbation
∣∣∣∣int b
af (x)dx minus
int b
af (x)dx
∣∣∣∣=
∣∣∣∣int b
ae(x)dx
∣∣∣∣ le int b
a|e(x)| dx le (b minus a) max
xisin[ab]|e(x)|
240
Interpolatory QuadratureDesign a quadrature method based on interpolation
With interpolation f (x) asympsumn
i=1 αiϕi (x) Thenint b
af (x)dx asymp
int b
a
nsumi=1
αiϕi (x)dx =nsum
i=1
αi
int b
aϕi (x)dx
αi are linear combination of function values (f (xi ))ni=1
Want int b
af (x)dx asymp
nsumi=1
ωi f (xi )
Then nodes (xi ) and weights (ωi ) together make a quadrature rule
Idea Any interpolation method (nodes+basis) gives rise to an inter-polatory quadrature method
241
Interpolatory Quadrature Examples
Example Fix (xi ) Then
f (x) asympsumi
f (xi )ℓi (x)
where ℓi (x) is the Lagrange polynomial for the node xi Thenint b
af (x)dx asymp
sumi
f (xi )
int b
aℓi (x)dx︸ ︷︷ ︸ωi
With polynomials and (often) equispaced nodes this is calledNewton-Cotes quadrature
With Chebyshev nodes and (Chebyshev or other) polynomials this is called Clenshaw-Curtis quadrature
242
Interpolatory Quadrature Computing WeightsHow do the weights in interpolatory quadrature get computed
Done by solving linear systemKnow This quadrature should at least integrate monomials exactly
b minus a =
int b
a1dx = ω1 middot 1 + middot middot middot+ ωn middot 1
1
k + 1(bk+1 minus ak+1) =
int b
axkdx = ω1 middot xk1 + middot middot middot+ ωn middot xkn
Write down n equations for n unknowns solve linear system done
This is called the method of undetermined coefficients
Demo Newton-Cotes weight finder [cleared] 243
Examples and ExactnessTo what polynomial degree are the following rules exact
Midpoint rule (b minus a)f(a+b2
)Trapezoidal rule bminusa
2 (f (a) + f (b))
Simpsonrsquos rule bminusa6
(f (a) + 4f
(a+b2
)+ f (b)
) parabola
Midpoint technically 0 (constants) actually 1 (linears)Trapezoidal 1 (linears)Simpsonrsquos technically 2 (parabolas) actually 3 (cubics) Cancellation of odd-order error requires symmetric nodes (trapzminusmidpt) usable as (ldquoa-posteriorirdquo) error estimate
244
Interpolatory Quadrature AccuracyLet pnminus1 be an interpolant of f at nodes x1 xn (of degree n minus 1)Recall sum
i
ωi f (xi ) =
int b
apnminus1(x)dx
What can you say about the accuracy of the method∣∣∣∣int b
af (x)dx minus
int b
apnminus1(x)dx
∣∣∣∣le
int b
a|f (x)minus pnminus1(x)| dx
le (b minus a) ∥f minus pnminus1∥infin(using interpolation error) le C (b minus a)hn
∥∥∥f (n)∥∥∥infin
le Chn+1∥∥∥f (n)∥∥∥
infin
245
Quadrature Overview of Rulesn Deg ExIntDeg
(wodd)IntpOrd QuadOrd
(regular)QuadOrd(wodd)
n minus 1 (nminus1)+1odd n n + 1 (n+1)+1oddMidp 1 0 1 1 2 3Trapz 2 1 1 2 3 3Simps 3 2 3 3 4 5S 38 4 3 3 4 5 5 n number of points ldquoDegrdquo Degree of polynomial used in interpolation (= n minus 1) ldquoExIntDegrdquo Polynomials of up to (and including) this degree actually get
integrated exactly (including the odd-order bump) ldquoIntpOrdrdquo Order of Accuracy of Interpolation O(hn)
ldquoQuadOrd (regular)rdquo Order of accuracy for quadrature predicted by the errorresult above O(hn+1)
ldquoQuadOrd (wodd)rdquo Actual order of accuracy for quadrature given lsquobonusrsquodegrees for rules with odd point count
Observation Quadrature gets (at least) lsquoone order higherrsquo than interpolationndasheven morefor odd-order rules (ie more accurate)Demo Accuracy of Newton-Cotes [cleared]
246
Interpolatory Quadrature StabilityLet pn be an interpolant of f at nodes x1 xn (of degree n minus 1)Recall sum
i
ωi f (xi ) =
int b
apn(x)dx
What can you say about the stability of this method
Again consider f (x) = f (x) + e(x)∣∣∣∣∣sumi
ωi f (xi )minussumi
ωi f (xi )
∣∣∣∣∣ lesumi
|ωie(xi )| le
(sumi
|ωi |
)∥e∥infin
So what quadrature weights make for bad stability bounds
Quadratures with large negative weights (Recallsum
i ωi is fixed)
247
About Newton-Cotes
Whatrsquos not to like about Newton-Cotes quadratureDemo Newton-Cotes weight finder [cleared] (again with many nodes)
In fact Newton-Cotes must have at least one negative weight as soonas n ⩾ 11
More drawbacks All the fun of high-order interpolation with monomials and
equispaced nodes (ie convergence not guaranteed) Weights possibly non-negative (rarrstability issues) Coefficients determined by (possibly ill-conditioned)
Vandermonde matrix Thus hard to extend to arbitrary number of points
248