MATH 2XX3 - Advanced Calculus II - ms.mcmaster.ca · MATH 2XX3 - Advanced Calculus II Prof. S. Alama 1 Class notes recorded, adapted, and illustrated by Sang Woo Park} Revised March

MATH 2XX3 - Advanced Calculus II

Prof. S. Alama 1

♦Class notes recorded, adapted, and illustrated

by Sang Woo Park♦

Revised March 13, 2018

1 c©2017, 2018 by S. Alama and S.W. Park. All Rights Reserved. Not to be distributed without authors’permission.

Contents

Preface iii

Bibliography iv

1 Space, vectors, and sets 11.1 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Subsets of Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Functions of several variables 92.1 Limits and continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Paths, directional derivative, and gradient 213.1 Paths and curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Directional derivatives and gradient . . . . . . . . . . . . . . . . . . . . . . . 273.3 Gradients and level sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.5 Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 The geometry of space curves 334.1 The Frenet frame and curvature . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Implicit functions 415.1 The Implicit Function Theorem I . . . . . . . . . . . . . . . . . . . . . . . . 415.2 The Implicit Function Theorem II . . . . . . . . . . . . . . . . . . . . . . . . 445.3 Inverse Function Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6 Taylor’s Theorem and critical points 516.1 Taylor’s Theorem in one dimension . . . . . . . . . . . . . . . . . . . . . . . 516.2 Taylor’s Theorem in higher dimensions . . . . . . . . . . . . . . . . . . . . . 526.3 Local minima/maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.4 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

i

7 Calculus of Variations 637.1 One-dimensional problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.2 Problems in higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 697.3 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8 Fourier Series 758.1 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768.2 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818.4 Orthogonal functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8.4.1 Gram-Schmidt process . . . . . . . . . . . . . . . . . . . . . . . . . . 888.4.2 Eigenvalue problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8.5 Application: The vibrating string . . . . . . . . . . . . . . . . . . . . . . . . 928.6 Practice problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Preface

These notes are intended as a kind of organic textbook for a class in advanced calculus, forstudents who have already studied three terms of calculus (from a standard textbook suchas Stewart [Stew],) and one or two terms of linear algebra.

Typically, these two subjects (calculus and linear algebra) have been introduced as sep-arate areas of study, and aside from some small intersection the true intertwined natureof multivariable and vector calculus with linear algebra has been hidden. For example, inStewart the derivative of multivariable functions is only discussed in special cases, intro-ducing only the partial derivatives and certain physically important combinations (curl anddivergence,) but avoiding the discussion of the derivative as a linear transformation (ie, amatrix.) As a consequence, the notion of differentiability is not properly presented. Inaddition, the chain rule is presented as an ad hoc collection of separate formulas dealingwith specific cases of composite functions, together with some visual tricks to make themeasier to calculate. There are many other instances where mysterious looking calculationtechniques are introduced in order to avoid a more powerful and general formulation viamatrices and linear maps. This is because standard calculus books cannot assume thatstudents have linear algebra as a prerequisite or corequisite course.

A second issue which we address here is the level and mathematical content of themultivariable calculus introduced in the first three terms. In a mixed audience with notonly mathematics students but also physicists, engineers, biologists, economists, etc, theemphasis in the calculus class is on formulas, calculation, and problem solving directedto the applications of calculus, and less stress is put on the actual mathematics. This canlead to a shock when students encounter higher level mathematics, such as real and complexanalysis or topology, where calculational proficiency becomes secondary to an understandingof when and why mathematical facts are true. In this course we will pay careful attentionto precise definitions, and the need to verify that they are satisfied in specific examples.Facts will be presented as Theorems, and students will need to understand the hypthesesand the conclusions, how they apply (or not) to specific cases, and what their consequencesare. Some computation will be needed to do examples, but we will make a step up inmathematical rigor and the emphasis will be on understanding the definitions and theorems.

The topics presented here begin with the foundations of multivariable calculus: a discus-sion of the structure of Euclidean space Rn, followed by limits and continuity, the derivative,and important properties of the derivative. This covers Chapters 1–3. The succeeding chap-ters are more or less independent of each other and may be covered in any order (and somemay be deemed optional.) Chapter 4 concerns the differential geometry of curves in R3,with the Frenet frame and curvature. The Implicit and Inverse Function Theorems are thesubject of Chapter 5. Chapter 6 introduces the second-order Taylor’s Theorem in order tounderstand the role of the second derivative matrix in classifying critical points. Chapter 7

iii

CHAPTER 0. PREFACE iv

is an introduction to the Calculus of Variations, which puts together many aspects of theprevious material in the study of extremal problems in spaces of functions. And finally, inChapter 8 we introduce Fourier Series and orthogonal decompositions, a field which is tooimportant to mathematics and its applications to be relegated to more specialized classesin partial differential equations. (An inexpensive paperback Dover book by Tolstov [Tol] isalso used for this segment of the course.)

Why write these notes when there are so many calculus textbooks on the market? It’strue that there are some excellent books dealing with advanced calculus in a way which in-troduces correct mathematical definitions and statements of theorems, without introducingthe difficulties of rigorously proving all of the (fairly complex) theorems of vector calculus;the book by Marsden & Tromba [MT] is one. However, these books are intended as areplacement for Stewart, and spend many pages on developing the calculation techniqueswhich students have already learned in the 3-term calculus sequence. And the linear alge-bra background presumend by [MT] is actually much less than what students learn in the2-semester linear algebra sequence 1B03–2R03. More advanced books, such as Marsden &Hoffman [MH], are really analysis books and the level is too high for a second-year course.And so we have a need for an intermediate level text.

Fortunately, a student, Sang Woo Park, made a LaTeX transcription of his class notesfor Math 2XX3 in the Winter 2017 term, which he generously offered to his fellow studentson his personal web page as the document evolved. Sang Woo then added some graphics toillustrate various concepts, and this text was born!

Finally, I will be keeping a corrected copy and a list of corrections on the course webpage. If you find an error, or some place where the text is unclear, (and it doesn’t alreadyappear on the list of corrections,) please send me email at [email protected] so I canmake the correction.

Prof. S. Alama, McMaster University, October 2017. Revised March 13, 2018

Bibliography

[AR] H. Anton and C. Rorres. Elementary Linear Algebra: Applications Version, 11thEdition:. Wiley Global Education, 2013.

[MH] J. Marsden and M. Hoffman. Elementary Classical Analysis. W. H. Freeman, 1993.

[MT] J. E. Marsden and A. Tromba. Vector calculus. Macmillan, 2003.

[Stew] J. Stewart. Calculus: Early Transcendentals. Cengage Learning, 2015.

[Tol] G. Tolstov. Fourier Series. Dover Books on Mathematics. Dover Publications, 2012.

v

BIBLIOGRAPHY vi

Chapter 1

Space, vectors, and sets

Over the last three terms you’ve studied (at least) two flavors of mathematics: calculusand linear algebra. They have been taught separately, and probably feel very different toyou. Linear Algebra is about vectors and matrices, and how you use them to solve linearalgebraic equations. Calculus is about functions, continuity, and rates of change. The maingoal of this course is to bring these two subjects together. Indeed, to understand calculusin more than one dimension it is essential to use concepts from linear algebra. And theconstructions from linear algebra (which may have seemed strange and arbitrary to youwhen you learned them) are then motivated by their use in understanding calculus in theplane and in space.

We begin by a brief discussion of space,

Rn = {(x1, x2, . . . , xn) : x1, x2, . . . , xn ∈ R},

where it all happens.

1.1 Vector norms

To connect linear algebra to calculus we need to talk about lengths and distance. This isdone via a norm.

Definition 1.1. Euclidean norm of ~x = (x1, x2, . . . , xn) is given as

‖~x‖ =√~x · ~x =

√√√√ n∑j=1

x2j

Using the norm we can then define the distance between any two points ~p and ~q in Rn,

dist(~p, ~q) = ‖~p− ~q‖.

The Euclidean norm is intimately related to the dot product of vectors,

~x · ~y =n∑i=1

xi yi,

1

CHAPTER 1. SPACE, VECTORS, AND SETS 2

which gives us the useful concept of orthogonality: ~x ⊥ ~y when ~x · ~y = 0. More generally,the dot product enables us to calculate angles between vectors:

~x · ~y = ‖~x‖ ‖~y‖ cos θ,

where θ is the angle between ~x and ~y, drawn in the plane determined by the two vectors.Since | cos θ| ≤ 1 for any angle, we also have the famous

Theorem 1.2 (Cauchy-Schwartz Inequality).

|~x · ~y| ≤ ‖~x‖ ‖~y‖

The Euclidean norm has certain properties, which are intuitively clear if we think aboutit as measuring the length of a vector:

Theorem 1.3 (Properties of a norm).

1. ‖~x‖ ≥ 0 and ‖~x‖ = 0 iff ~x = ~0 = (0, 0, . . . , 0).

2. For all scalars a ∈ R, ‖a~x‖ = |a| · ‖~x‖.

3. (Triangle inequality) ‖~x+ ~y‖ ≤ ‖~x‖+ ‖~y‖.

Proof. (i) If ~x = ~0, then by the formula ‖~0‖ =√

0 = 0. On the other hand, if ‖~x‖ = 0, then

0 = ‖~x‖2 =n∑i=1

x2i .

This is a sum of squares, so each term in the sum is non-negative, and thus each must equalzero. This proves (i).

For (ii), this is just factoring (and being careful that√a2 = |a|!!)

‖a~x‖ =

√√√√a2n∑i=1

x2i = |a|

√√√√ n∑i=1

x2i = |a|‖~x‖.

For (iii), we look at the square of the norm, which is easier than that norm itself:

‖~x+ ~y‖2 = (~x+ ~y) · (~x+ ~y) = ~x · ~x+ ~y · ~y + 2~x · ~y = ‖~x‖2 + ‖~y‖2 + 2~x · ~y≤ ‖~x‖2 + ‖~y‖2 + 2‖~x‖ ‖~y‖= (‖~x‖+ ~y})2.

Since each term is non-negative, we can take the square root of each side to obtain thetriangle inequality (iii).

We separate out the properties in this way to point out that there may be many otherways of measuring the length of a vector in Rn (which give different numerical values) whichwould be equally valid as norms.


Example 1.4 (A non-Euclidean norm - The Taxi Cab Norm). Consider the vector ~p =(p1, p2) ∈ R2. The euclidean norm gives the length of the diagonal line joining ~0 to ~p. Onthe other hand,

‖~p‖1 = |p1|+ |p2|gives us the length traveled in a rectangular grid system, like the number of blocks traveledin a car in a city of streets meeting at right angles. We can define a taxi cab norm in anydimension: for ~p = (p1, p2, . . . , pn) ∈ Rn, ‖~p‖1 =

∑nj=1 |pj|.

As with the Euclidean norm, we may define the distance between two points ~p, ~q ∈ Rn

as the taxi cab norm length of the vector joining them, dist1(~p, ~q) = ‖~p− ~q‖1. In the plane,this is the number of city blocks (east-west plus north-south) you need to drive to get fromone point to the other. Unless the two points lie on the same street (and can be joined by asegment parallel to one of the axes), the distance will be larger than the Euclidean distance,which is the distance “as the bird flies” from one point to the other along the (diagonal)straight line path.

The Taxi Cab norm is a valid norm because it satisfies all properties of a norm above.(You will show this as an exercise!) So it also gives us a valid alternative way to measuredistance in Rn, dist(~p, ~q) = ‖~p − ~q‖. This way of measuring distance gives Rn a differentgeometry, as we will see below. (However, this norm is not compatible with the dot product,as you will also see from the exercise below.)

Definition 1.5. The Neighborhood of radius r > 0 about a point ~p is the set

Dr(~p) ={~x ∈ Rn

∣∣‖~x− ~p‖ < r}

We may also call this the disk of radius r > 0 in R2, or the ball of radius r > 0 in R3.

Remark. Different books use different notations for a neighborhood, and so depending onwhich of the calculus books you read, the neighborhood around ~a of radius r may be writtenusing any of the following notations:

Dr(~a) = Br(~a) = B(~a, r)

Definition 1.6. The sphere of radius r > 0 about ~p is defined as

Sr(~p) ={~x ∈ Rn

∣∣‖~x− ~p‖ = r}

The shape of the neighborhood and sphere depends strongly on which norm you choose.First, let’s start with the familiar Euclidean norm. Then, the sphere is given by

‖~x− ~p‖ = r

⇐⇒

√√√√ n∑j=1

(xj − pj)2 = r,

which is equivalent ton∑j=1

(xj − pj)2 = r2.


When n = 3, we have

(x1 − p1)2 + (x2 − p2)2 + (x3 − p3)2 = r2,

the usual sphere in R3 with center ~p = (p1, p2, p3)When n = 2, we have

(x1 − p1)2 + (x2 − p2)2 = r2,

the usual circle in Rn with center ~p = (p1, p2).

1

2

−1

−2

1 2−1−2

x

y

Figure 1.1. Neighborhood around (0, 0) of radius 1 using the Taxi Cab norm.

If we replace Euclidean norm by the Taxi Cab norm (for simplicity, take ~p = ~0), we have

Staxir (~0) =

{~x ∈ Rn

∣∣‖~x−~0‖1 = r}

In other words, we have

~x ∈ Staxir (~0) ⇐⇒

n∑j=1

|xj| = r

In R2, we have ~x = (x1, x2). Then, r = |x1| + |x2|, which is a diamond. (See figure 1.1)1.Notice that in the first quadrant, x1, x2 ≥ 0, we have x1 + x2 = r, which is the line segmentwith slope −1 connecting (r, 0) to (0, r). In the second quadrant, |x1| = −x1 and so theequation of the taxi cab circle is −x1 +x2 = r, the segment with slope +1 joining (−r, 0) to(0, r). The two other segments (in the third and fourth quadrants) forming the taxi-spheremay be derived in a similar manner.

1.2 Subsets of Rn

Let’s introduce some properties of subsets in Rn. A ⊂ Rn means A is a collection of points~x, drawn from Rn.

1Note that |x1|+ |x2| = r is a circle in R2 under the Taxi Cab norm. Then, we have

π =circumference

diameter=

8r

2r= 4


Definition 1.7. Let A ⊂ Rn, and ~p ∈ A. We say ~p is an interior point of A (Figure 1.2)if there exists a neighbourhood of ~p, i.e. an open disk disk, which is entirely contained in A:

Dr(~p) ⊂ A.

So an interior point is one which is “well inside” the set A, in the sense that all of itsneighbors up to distance r > 0 are elements of the set. It means that one can walk a certaindistance r from ~p and stay within the set A.

x

y

r

r

Figure 1.2. The point on the left is an interior point; the point on the right is aboundary point.

Example 1.8.

A ={~x ∈ Rn|~x 6= ~0

}Take any ~p ∈ A, so ~p 6= ~0. Then, let r = ‖~p− ~0‖ > 0, and Dr(~p) ⊂ A, since ~0 /∈ Dr(~p).

(Notice: any smaller disk, Ds(~p) ⊂ Dr(~p) ⊂ A, where 0 < s < r works to show that ~p is aninterior point).

So every ~p ∈ A is an interior point to A.

Definition 1.9. If every ~p ∈ A is an interior point, we call A an open set.

Example 1.10. A ={~x ∈ Rn|~x 6= ~0

}is an open set.

Proposition 1.11. A = DR(~0) is an open set.

Proof. If ~p = ~0, Dr(~0) ⊆ A = DR(~0) provided r ≤ R. So ~p = ~0 is interior to A. Considerany other ~p ∈ A. It’s evident that Dr(~p) ⊂ A = DR(~0) provided that 0 ≤ r ≤ R − ‖~p‖.Therefore, A = DR(~0) is an open set.

Example 1.12 (Figure 1.3). Suppose we use a Taxi Cab disks instead of a Euclidean disk.It does not change which points are interior to A since the diamond is inscribed in a circle.In other words,

Dtaxir (~p) ⊂ DEuclid

r (~p)


x

y

r/√

2

r

Figure 1.3. A disk with radius r constructed with the Euclidean norm inscribes the samedisk constructed with the Taxi Cab norm. Furthermore, the disk constructed with theTaxi Cab norm inscribes a disk with radius r/

√2 constructed with the Euclidean norm

Definition 1.13. The complement of set A is

Ac = {~x|~x /∈ A}

That is, Ac consists of all points which are not elements of A. Any point in Rn mustbelong either to A or to Ac, but never to both.

Using complements, we define a complementary notion to openness:

Definition 1.14. A set A is closed if Ac is open.

Notice that this does NOT mean that a set A is closed if it is not open. Sets are notdoors: they don’t have to be either open or closed. The open and closed sets are special,and not every set falls into one of the two categories. Here is a better way of understandingwhen a set is closed. First, we define the boundary of a set A:

Definition 1.15. ~b is a boundary point of A (Figure 1.2) if for every r > 0, Dr(~b) containsboth points in A and points not in A:

Dr(~b) ∩ A 6= ∅ and Dr(~b) ∩ Ac 6= ∅

The boundary ∂A is the collection of all the boundary points of A.

In the example 1.11, the set of all boundary points of A = DR(~0){~b∣∣‖~b‖ = R

}is a sphere of radius R.

Notice that a boundary point of A can never be an interior point, and an interior pointof A can never be in ∂A.

Theorem 1.16. A is closed if and only if A contains all its boundary points.


Example 1.17. Consider the following set:

A = {(x1, x2) ∈ R2∣∣x1 ≥ 0, x2 > 0}

Not every point in A is an interior point. In fact, the interior points are the ~p = (p1, p2)where both p1 > 0, p2 > 0. To see this, apply the definition with r = min{p1, p2}. Then,Dr(~p) ⊂ A.

On the other hand, any ~p that lies on either of the axes (including ~0) is a boundarypoint. So none of the points on the segment x1 = 0 (which are included in A are interiorpoints, and hence A is not open.

On the other hand, points lying on the axis x2 = 0 are boundary points of A but arenot elements of A, and so A cannot be closed either.

1.3 Practice problems

1. Verify that the taxicab norm on R2, ‖~x‖ = |x1| + |x2| satisfies the conditions whichmake it a valid norm, that is:

1. ‖~x‖ = 0 if and only if ~x = ~0;

2. ‖a~x‖ = |a| ‖~x‖ for all a ∈ R and ~x ∈ R2;

3. ‖~x+ ~y‖ ≤ ‖~x‖+ ‖~y‖ for all ~x, ~y ∈ R2.

2. (a) Verify that the following identities hold for the Euclidean norm on Rn, defined by:

‖~x‖ =√~x · ~x =

√√√√ n∑j=1

x2j

(i) [Paralellogram Law] ‖~x+ ~y‖2 + ‖~x− ~y‖2 = 2‖~x‖2 + 2‖~y‖2;

(ii) [Polarization Identity] ~x · ~y =1

4

(‖~x+ ~y‖2 − ‖~x− ~y‖2

).

(b) Show that the Parallelogram Law becomes false if we replace the Euclidean normon R2 by the Taxicab norm (as defined in problem 1.)

3. Let U = {(x1, x2) ∈ R2 : |x2| ≤ x1, and x1 > 0}. Find all interior points of U and allboundary points of U . Is U an open set? Is U a closed set?

4. Show each set is open, by showing every point ~a ∈ U is an interior point. [So you needto explicitly find a radius r > 0 so that Dr(~a) ⊂ U .]

(a) U = {(x1, x2) ∈ R2 : x21 + x22 > 0};(b) U = {(x1, x2) ∈ R2 : 1 < x21 + x22 < 4}.


Chapter 2

Functions of several variables

2.1 Limits and continuity

In this section, we will be considering vector valued functions such that

F : A ⊆ Rn → Rk.

Using the matrix notation we can write:

F (x1, x2, . . . , xn) =

F1(x1, x2, . . . , xn)F2(x1, x2, . . . , xn)

...Fk(x1, x2, . . . , xn)

.Example 2.1. Given a (k × n) matrix M , we can define the following function:

F (~x) = M~x.

This looks like a pretty special example, but it turns out to be an important one. And itgives a direct connection between calculus and matrices which we will exploit.

The first order of business is to define what we mean by the limit of a function, and thenwe can define the notion of continuity. What does lim~x→~a F (~x) = ~L mean? First, we startby noting that it is not sufficient to treat each variable, x1, x2, . . . xn, separately.

Example 2.2. Consider the following function:

F (x, y) =xy

x2 + 4y2,

where (x, y) 6= (0, 0).We can try to find its limit at (0, 0) by treating each variable separately.

limx→0

(limy→0

F (x, y)

)= lim

x→0

(0

x2

)= lim

x→0= 0

Similarly, we have

limy→0

(limx→0

F (x, y))

= 0

9

CHAPTER 2. FUNCTIONS OF SEVERAL VARIABLES 10

However, if we take (x, y) → (0, 0) along a straight line path with y = mx, where m isconstant, we have

F (x,mx) =mx2

x2 + 4m2x2=

m

1 + 4m2

In this case, we have

lim(x,y)→(0,0)along y=mx

F (x, y) =m

1 + 4m2

Therefore, F (x, y) does not approach any particular value as (x, y)→ (0, 0).

Example 2.3 (Worse). Consider the following function:

F (x, y) =y2

x4 + y2.

If we approach (0, 0) along y = mx, the values of F (x, y) = F (x,mx) → 1. However, ifwe approach along a parabola, y = mx2, we obtain a limiting value of m2/(1 + m2). Weget different limits along different parabolas, and so the limit as (x, y) → (0, 0) does notexist.

From these examples we conclude that if there is to be a limit

lim~x→~a

F (~x) = ~b,

we must have a criterion which doesn’t depend on the path or the direction in which ~xapproaches ~a, but only on proximity. In other words, we want ‖F (~x)−~b‖ to go to zero asthe distance ‖~x− ~a‖ goes to zero, regardless of the path taken in getting there.

Definition 2.4. We say lim~x→~a

F (~x) = ~b if for any given ε > 0, there exists δ > 0 such that

whenever 0 < ‖~x− ~a‖ < δ, we have ‖F (x)−~b‖ < ε. Therefore,

lim~x→~a

F (x) = ~b ⇐⇒ lim~x→~a‖F (~x)−~b‖ = 0

Remark. We can state the definition equivalently in a geometrical form: for any given ε > 0,there exists a radius δ > 0 such that

F (~x) ∈ Dε(~b),

whenever ~x ∈ Dδ(~a) and ~x 6= ~a.

Before we look at an example, here’s a useful observations. Take ~v = (v1, v2, . . . , vn) ∈Rn. Then, we have

‖~v‖ =

√√√√ n∑j=1

v2j ≥√v2i = |vi|

for each individual coordinate i = 1, 2, . . . , n. So when applying the definition of the limit,we may use the following inequalities: for any ~x,~a ∈ Rn,

|xi − ai| ≤ ‖~x− ~a‖, (2.1)

for each i = 1, . . . , n.


Example 2.5. Show

lim(x,y)→(0,0)

2x2y

x2 + y2= 0

Solution: We must set up the definition of the limit. Note that F : R \ {~0} → R, whichis real-valued, so we drop the vector symbols in the range (but not in the domain, which isR2!) Matching the notation of the definition, we have b = 0, ~a = (0, 0). Call

R = ‖~x− ~a‖ = ‖~x‖ =√x2 + y2,

the distance from ~x to ~a. By the above observation (2.1),

|x| ≤ R, |y| ≤ R.

Since F (~x) ∈ R, the norm ‖F (~x) − ~b‖ = |F (~x) − 0| is just the absolute value. We nowwant to estimate the difference |F (~x)− 0| and show that it is bounded by a quantity whichdepends only on R, and tends to zero as R→ 0. To do this, we find an upper bound for thenumerator, and a lower bound for the denominator, as a fraction becomes larger when itsnumerator is increases, and its denominator is decreased. In this case, the denominator isexactly R2, and so:

|F (~x)− b| =∣∣∣∣ 2x2y

x2 + y2− 0

∣∣∣∣=

2|x|2|y|x2 + y2

≤ 2 ·R2 ·RR2

= 2R

= 2‖~x− ~a‖

By letting ‖~x−~a‖ = ‖~x‖ < ε/2, we get |F (~x)− 0| < ε. Therefore, the definition is satisfiedby choosing δ > 0 to be any value with 0 < δ ≤ ε/2

Example 2.6. Consider the following function, F : R3 \ {~0} → R:

F (x, y, z) =3z2 + 2x2 + 4y2 + 6z2

x2 + 2y2 + 3z2, (x, y, z) 6= (0, 0, 0)

Prove that

lim(x,y,z)→(0,0,0)

F (x, y, z) = 2.

Solution: In analogy with the previous example, letR = ‖(x, y, z)−(0, 0, 0)‖ =√x2 + y2 + z2.

We use the “common denominator” to write the desired quantity as a fraction, and estimate


as before:‖F (x, y, z)−~b‖ = |F (x, y, z)− 2|

=

∣∣∣∣3z3 + 2x2 + 4y2 + 6z2

x2 + 2y2 + 3z2− 2

∣∣∣∣=

3|z|3x2 + 2y2 + 3z2

≤ 3R3

x2 + y2 + z2

=3R3

R2

= 3R

Then,‖F (x, y, z)−~b‖ < 3R < ε

provided that

R = ‖~x−~0‖ < δ =ε

3

We are now ready to define continuity. A continuous function is one whose limits coincidewith the value the function at the limit vector:

Definition 2.7. Consider a function F : Rn → Rk with domain A ⊆ Rn. For ~a ∈ A, wesay F is continuous at ~a in the domain of F iff

F (a) = lim~x→~a

F (~x)

Example 2.8. Going back to example 2.6, if we redefine F as follows,

F (x, y, z) =

3z2 + 2x2 + 4y2 + 6z2

x2 + 2y2 + 3z2(x, y, z) 6= (0, 0, 0)

2 (x, y, z) = (0, 0, 0)

then F is continuous at (0, 0, 0) (and at all ~x ∈ R).

If F is continuous at every ~a ∈ A, (∀~x ∈ A), we say F is continuous on the set A.Continuity is always preserved by the usual algebraic operations: sum. product, quotient,and composition of continuous functions is continuous.

2.2 Differentiability

You’re used to thinking of the derivative as another function, “derived” from the originalone, which gives numerical values of the rate of change (slope of the tangent line) of thefunction. But having a derivative is a property of the function, a measure of how smoothlythe values of the function vary, and so we talk about “differentiability”, the ability of thefunction to be differentiated at all.

Let’s recall how the derivative was defined for functions of one variable, because it’s thebasic idea which we’ll return to:


Definition 2.9. For a function f : R→ R, its derivative is defined as

f ′(x) = limh→0

f(x+ h)− f(x)

h.

If it exists, we say f is differentiable at x.

Differentiability is a stronger property than continuity:

Theorem 2.10. If f : R→ R is differentiable at x, then f(x) is also continuous at x.

Differentiable functions, f(x), are well approximated by their tangent lines (also knownas linearization). We wish to extend this idea to F : Rn → Rm.

First, we can try dealing with independent variables, x1, x2, . . . , xn, one at a time byusing partial derivatives. We start by introducing the standard basis in Rn:

~e1 = (1, 0, 0, . . . , 0)

~e2 = (0, 1, 0, . . . , 0)

...

~en = (0, 0, 0, . . . , 1)

In particular, we have the usual ~e1 =~i, ~e2 = ~j,~e3 = ~k in R3.For any ~x ∈ Rn, and h ∈ R, (~x + h~ej) moves from ~x parallel to the xj axis by distance

h. In other words,~x+ h~ej = (x1, x2, . . . , xj + h, xj+1, . . . , xn).

Definition 2.11. A partial derivative of f(x) with respect to xj is defined as

∂f

∂xj(~x) = lim

h→0

f(~x+ h~ej)− f(~x)

h,

for all j = 1, 2, . . . , n (provided the limit exists.)

A partial derivative of function is calculated by treating of ~xj as the only variable, andall others treated as constants. For a vector valued function F : Rn → Rm,

F (~x) =

F1(~x)F2(~x)

...Fm(~x)

,we treat each component Fi(~x) : Rn → R separately as a real valued function. Each hasn partial derivatives, and so F : Rn → Rm has (m × n) partial derivatives, which form an(m× n) matrix, the Jacobian matrix or total derivative matrix,

DF (~x) =

(∂Fi∂xj

)i=1,2,...,mj=1,2,...,n

.

Be careful: each row (with fixed i and j = 1, . . . , n) corresponds to the partial derivativesof one component Fi(~x). Each column (with fixed j and i = 1, . . . ,m) corresponds todifferentiating the vector F (~x) with respect to one independent variable, xj. That is, wecount the components of F top to bottom, and the independent variables’ derivatives leftto right.


Example 2.12. Consider a function F : R2 → R3:

F (~x) =

x21x1x2x42

.Jacobian of the function is given by

DF (~x) =

∂F1

∂x1

∂F1

∂x2∂F2

∂x1

∂F2

∂x2∂F3

∂x1

∂F3

∂x2

=

2x1 0x2 x10 4x32

The question now is whether the derivative matrix DF (~x) gives us the same informationand properties as the ordinary derivative did in single-value calculus. The following exampleshows that we must be more careful with functions of several variables:


f(x, y) =

{xy

(x2+y2)2, (x, y) 6= (0, 0)

0, (x, y) = (0, 0)

Do partial derivatives exist at (0, 0)?By definition,

∂f

∂x(0, 0) = lim

h→0

f(0 + h, 0)− f(0, 0)

h

= limh→0

h·0(h2+02)2

− 0

h

= limh→0

0

h= 0

Similarly, ∂f∂y

(0, 0) = 0 by symmetry of x, y. Therefore,

Df(0, 0) =[0 0

]Although partial derivatives exist, f is not cotinuous at (0, 0). (For example, f(x,mx) →±∞ as x→ 0± for m 6= 0).

By the previous example, we see that the mere existence of the partial derivatives is nota very strong property for a function of several variables; despite the existence of the partialderivatives, the function isn’t even continuous at ~x = ~0. So it’s doubtful that the partialderivatives are giving us very significant information as to the smoothness of the functionin this example. A “differentiable” function should at least be continuous.

To get a reasonable information from Df(~x), we need to ask for more than just itsexistence. To understand what is needed, let’s go back to f : R → R. We rewrite thederivative at x = a, making the substitution h = x− a in the definition, so that

f ′(x) = limx→a

f(x)− f(a)

x− a .


Equivalently, we have:

0 = limx→a

(f(x)− f(z)

x− a − f ′(x)

)

= limx→a

(f(x)− [

La(x)︷︸︸︷f(a) + f ′(x)(x− a)]

x− a

)= 0,

where La(x) = f(a) + f ′(x)(x− a) is the linearization of f at a, the equation of the tangentplane to y = f(x) at x = a. Thus, the difference between the value of f(x) and its tangentplane La(x) is very small compared to the distance (x − a) of x to a. In other words, fis differentiable at x if its linear approximation gives an estimate of the value f(x + h) towithin an error which is small compared to ∆x = x− a.

This is the attribute of the one-dimensional derivative which we want to extend to higherdimensions. Let’s make the analogy. For F : Rn → Rm, F (~x) has (m× n) partial derivates(see definition 2.11). Then, the linearization of F at ~a is

L~a(~x) = F (~a)︸︷︷︸m×1

+DF (~a)︸︷︷︸m×n

(~x− ~a︸︷︷︸n×1

).

So, L : Rn → Rm, just like F . The derivative matrix DF (~a) is a linear transformation ofRn → Rm.

Notice that when n = 2 and m = 1, For F : R2 → R, we have

DF (~a) =[∂F∂x1

(~a) ∂F∂x2

(~a)],

a (1× 2) row vector and

~x− ~a =

[x1 − a1x2 − a2

],

so we have

L~a(~x) = F (~a) +∂F

∂x1(x1 − a1) +

∂F

∂x2(x2 − a2),

a familiar equation of the tangent plane to z = F (x1, x2).

We’re now ready to introduce the idea of differentiability:

Definition 2.14 (Differentiability). We say F : Rn → Rm is differentiable if

lim~x→~a

‖F (~x)− F (~a)−DF (~a)(~x− ~a)‖‖~x− ~a‖ = 0.

Equivalently,

lim~h→~0

‖F (~x+ ~h)− F (~x)−DF (~x)~h‖‖~h‖

= 0

In summary, F is differentiable if ‖F (~x) − L~a(~x)‖ is small compared to ‖~x − ~a‖, or ifF (~x) is approximated by L~a(~x) with an error which is much smaller than ‖~x− ~a‖.


There is a very useful notation to express this idea that one quantity is very smallcompare to another, the “little-o” notation. We write o (h), “little-oh of h”, for a quantitywhich is small compred to h., in the sense

g(h) = o(h) ⇐⇒ limh→0

g(h)

h= 0.

Using this notation, differentiability can be written as

‖F (~x)− [F (~a) +DF (~a)(~x− ~a)]‖ = o(‖~x− ~a‖).

Example 2.15. Is the following function differentiable at ~a = ~0?

F (x1, x2) =

x22 sinx1√x21+x

22

, ~x 6= ~0

0, ~x = ~0

Solution: First, we have

∂F

∂x1(~0) = lim

h→0

F (~0 + h~e1)− F (~0)

h

= limh→0

0− 0

h= 0

Similarly, we have∂F

∂x2(~0) = 0

So we haveDF (~0) =

[∂F∂x1

∂F∂x2

]=[0 0

]For differentiability, we have to look at:∣∣∣∣∣ x22 sinx1√

x21 + x22− 0−

[0 0

]·[x1x2

]∣∣∣∣∣=x22| sinx1|√x21 + x22

Then,|F (~x)− L~0(~x)|‖~x−~0‖

=x22| sinx1|(√x21 + x22

)2 =x22| sinx1|x21 + x22

≤ R2 ·RR2

= R = ‖~x−~0‖

By the squeeze theorem, we have

lim~x→~0

|F (~x)− L~0(~x)|‖~x−~0‖

= 0

Therefore, F is differentiable at ~x = ~0


Example 2.16. Verify that F is differentiable at ~a = ~0.

F (~x) =

[1 + x1 + x22

2x2 − x21

]First, note that

F (~a) = F (~0) =

[10

]We also need to compute the Jacobian at ~0:

DF (~0) =

[1 00 2

]Then, we get the following linearization of the function:

L~0(~x) = F (~0) +DF (~x)(~x−~0)

=

[10

]+

[1 00 2

] [x1x2

]=

[1 + x1

2x2

]Then, to verify the definition of differentiabiility we estimate the quotient,

‖F (~x)− L~0(~x)‖‖~x−~0‖

=

∥∥∥∥[ x22−x21]∥∥∥∥

‖~x‖ =

√x42 + x41√x21 + x22

≤√R4 +R4

R=√

2R =√

2‖~x−~0‖,

where we have used x41 ≤ R4 and x2 ≤ R4, with R =√x21 + x22.

As ~x → ~0, ‖~x − ~0‖ = R → 0, by the squeeze theorem, the desired limit goes to 0.Therefore, F is differentiable at ~0.

Verifying differentiability can involve quite a bit of work, but fortunately there is apowerful theorem which makes differentiability much easier to show.

Theorem 2.17. Suppose F : Rn → Rm, and ~a ∈ Rn. If there exists a disk Dr(~a) in whichall the partial derivatives ∂(Fi(~x))/∂xj exist and are continuous, then F is differentiable at~x = ~a.

So it’s enough to calculate the partial derivatives and verify that each one is a continuousfunction, and then we can conclude that the function is differentiable, without going throughthe definition! It’s convenient to give this property a name:

Definition 2.18. Suppose F : Rn → Rm, and ~a ∈ Rn. If there exists a disk Dr(~a) inwhich all the partial derivatives ∂(Fi(~x))/∂xj exist and are continuous, then we say thatF is continuously differentiable, or C1, at ~x = ~a. If F is continuously differentiable at allpoints ~x ∈ A, we say that F is continuously differentiable on A, or F ∈ C1(A).


f is C1

f is differentiable partial derivatives exist

f is continuous

Figure 2.1. Relationship between differentiability and continuity

So far as our example, we calculate the partial for ~x 6= ~0:

∂F

∂x1= x22

(cosx1

(x21 + x22

)− 12 +

(−1

2

(x21 + x22

)− 32 2x1

)sinx1

)=

x22(x21 + x22)

3/2

[cosx1

(x21 + x22

)− x1 sinx1

].

This is a quotient of sums, products, and compositions of continuous functions as long asthe denominator is not zero (that is, when (x1, x2) 6= (0, 0),) and therefore it is continuous. Iinvite you to calculate ∂F

∂x2, which has a similar form and is also continuous for all (x1, x2) 6=

(0, 0). By definition, F is C1 at all ~x 6= ~0, so by Theorem 2.17 we conclude that F isdifferentiable at all points in its domain.

We summarize the ideas in this chapter in Figure 2.1. A function which is continouslydifferentiable is automatically differentiable (by Theorem 2.17.) A function which is dif-ferentiable must be continuous. However, none of these definitions are equivalent. Thereare continuous functions which are not differentiable. There are also functions with partialderivatives existing but the function is discontinuous. And there are differentiable functionswhich are not C1.

2.3 Chain rule

As in single-variable calculus, the composition of differentiable functions is differentiable,and we have a convenient formula for calculating the derivative.

Let’s be a little careful about domains. Suppose A ⊆ Rn and B ⊆ Rm are open sets,F : A ⊆ Rn → Rm with F (A) ⊆ B, that is

F (~x) ∈ B for all ~x ∈ A,and G : B ⊂ Rm → Rp. Then, we define the composition H : A ⊂ Rn → Rp via

H(~x) = G ◦ F (~x) = G (F (~x)) .

Example 2.19. This may seem like a overly simple example, but it is an important one.Consider the case where F and G are linear maps,{

F (~x) = M~x M an (m× n) matrix

G(~y) = N~y N an (p×m) matrix


Then,H(~x) = G (F (~x)) = NM~x

is also a linear function, which is represented by the product NM . (Recall that the orderof multiplication matters a lot!)

In fact this example leads right into the general form of the chain rule for compositionsin higher dimensions. Recall that however nonlinear the functions F and G are, theirderivatives are matrices!

Theorem 2.20. Assume F : Rn → Rm is differentiable at ~x = ~a and G : Rm → Rp isdifferentiable at ~b = F (~a). Then, H = G ◦ F is differentiable at ~x = ~a and

DH(~a) = DG(~b)︸︷︷︸DG(F (~a))

DF (~a)

The product above is matrix multiplication, so be careful of the order of multiplication.Unless the matrices are both square (which will only happen when the dimensions n = m =p,) you may not be able to multiply them in the wrong order, but be very careful anyway.

Note that all of the various forms of chain rule done in first year calculus (and explainedby complicated tree diagrams) can be derived directly from this general formula. So the treediagrams used in Stewart are just imitating the natural structure of matrix multiplication!

Finally, we point out that the Chain Rule is more than just a formula: it also containsthe information that the composition H = G ◦ F is differentiable at ~a, provided F isdifferentiable at ~a and G is differentiable at ~b = F (~a).

Example 2.21. Consider the following functions, F : R3 → R2 and G : R2 → R2:

F (~x) =

[x21 + x2x3x21 + x23

], G(~y) =

[−y32y1 + y2

]Let H = G ◦ F (~x). Find DH(~a) where a = (1,−1, 0).

First, we have

DF (~x) =

[2x1 x3 x22x1 0 2x3

], DF (1,−1, 0) =

[2 0 −12 0 0

]Similarly, we have

DG(~y) =

[0 −3y21 1

], DG(1, 1) =

[0 −31 1

]As each entry of DF (~x), DG(~y) (the partial derivatives of F,G) is a continuous functionfor every ~x, ~y, both F,G are C1 and hence are differentiable, by Theorem 2.17 they aredifferentiable. By Chain Rule, we get

DH(1,−1, 0) = DG(1, 1)DF (1,−1, 0)

=

[0 −31 1

] [2 0 −12 0 0

]=

[−6 0 04 0 −1

]



1. Prove that:

(a) lim~x→~0

x61x61 + 3x22

does not exist.

(b) lim~x→~0

x31x22

[x21 + x22]2

= 0.

(c) lim~x→(1,2)

(x1 − 1)2(x2 − 2)2

[(x1 − 1)2 + (x2 − 2)2]32

= 0

2. Let F (x, y) =xy2

(x2 + y2)2for (x, y) 6= (0, 0) and f(0, 0) = 0.

(a) Show that∂f

∂x(0, 0),

∂f

∂y(0, 0) both exist.

(b) Show that f is not continuous at (0, 0).

3. Let M be a k × n matrix, and define F (~x) = M~x (ie, via matrix multiplication, with ~xas a column vector.) Use the definition of differentiability to show that F is differentiableat all ~x ∈ Rn, and DF (~x) = M .

4. Let F : R2 → R3 be defined by

F (~x) =

x1 cosx2x1x

22

e−x2x1

Find DF (~x). Calculate the linearization L~a(~x) around ~a = (1, 0).

5. Let F : R2 → R2, G : R2 → R2, defined by

F (r, θ) =

[r cos θr sin θ

], G(x, y) =

[2xy

y2 − x2]

and let H = G ◦ F . Use the chain rule to calculate DH(√

2, π4).

6. Justify the approximation

ln(1 + 2x1 + 4x2) = 2x1 + 4x2 + o(‖~x‖)

for ~x close to ~0.

Chapter 3

Paths, directional derivative, andgradient

In this section we use the chain rule to explore two very common (and complementary)situations,

• parametrized paths, ~c : R→ Rn, for which the domain is one-dimensional;

• real-valued f : Rn → R, which are functions of n ≥ 2 variables but are scalar-valued.

In fact, we will use paths as a tool to derive some qualitative facts about functions of severalvariables.

3.1 Paths and curves

We start with paths, which parametrize curves in Rn. First, recall that an interval is aconnected set in the real line, which includes the open intervals (a, b), (−∞, b), (a,∞),or (−∞,∞) = R; the closed intervals [a, b], [a,∞), (−∞, b] (which include endpoints); orintervals like (a, b] which are neither open nor closed.

Definition 3.1. Let I ⊆ R be an interval. A path ~c : I ⊆ R → Rn is a vector-valuedfunction of a scalar independent variable, usually, t:

~c(t) =

c1(t)c2(t)

...cn(t)

The path ~c(t) can be thought of as a vector (or a point, representing the tip of the

vector) moving in time t. The moving point traces out a curve in Rn as t increases, andthus we use paths to attach functions (the moving coordinates) to a geometrical object, acurve in space. Note that this is not the only way to describe a curve:

Example 3.2. A unit circle in R2 described as a path is

~c(t) = (cos t, sin t),

21

CHAPTER 3. PATHS, DIRECTIONAL DERIVATIVE, AND GRADIENT 22

where t ∈ [0, 2π). But we could also describe the unit circle non-parametrically as

x2 + y2 = 1

We will talk about such non-parametric curves and surfaces in a later chapter.Note that the same curve can be described by different paths. Going back to unit circle,

we can also write~b(t) =

(sin(t2), cos(t2)

), t ∈ [0,

√2π).

Using different paths can change (1) time dynamics and (2) direction of the curve. Thiscurve has a non-constant speed and reversed orientation, that is, the path ~c(t) traces the

circle counter-clockwise while ~b(t) draws it clockwise.

If ~c is differentiable, D~c(t) is an (n×1) matrix. Since each component ~cj(t) is a real-valuedfunction of only one variable, the partial-derivative is the usual (or ordinary) derivative:

∂cj∂t

=dcjdt

= c′j(t) = limh→0

cj(t+ h)− cj(t)h

So D~c(t) = ~c ′(t) is written as a column vector:

D~c(t) =

c′1(t)c′2(t)

...c′3(t)

= lim

h→0

~c(t+ h)− ~c(t)h

which is a vector which is tangent to the curve traced out at ~x = ~x(t). Physically, ~c ′(t) isthe velocity vector for motion along the path (Figure 3.1).

Example 3.3 (Lines). Given two points, ~p1, ~p2 ∈ Rn, there is a unique line connectingthem. One path which represents this line is

~c(t) = ~p1 + t~v,

where ~v = ~p2 − ~p1. Velocity is then given by ~c ′(t) = ~v, a constant.

A path, ~c(t), is continuous (or differentiable, or C1,) provided that each of the compo-nents cj(t), j = 1, 2, . . . , n is continuous (or differentiable, or C1.) Note that {~c(t) : t ∈ [a, b]}traces out a curve in Rn, with initial endpoint, ~a, and final endpoint, ~b. The path ~c(t) pa-rameterizes the curve drawn out.

Recall that for any function F : Rk → Rn, differentiability means that tangent (i.e.linearization) gives a good approximation to the function itself. For a differentiable path,~c ′(t) is a tangent vector to the curve drawn out when ~c ′(t) 6= 0. We call ~v(t) = ~c ′(t) thevelocity vector. The norm v = ‖~v‖ = ‖~c ′(t)‖ is the speed.

Finally, we can define the unit tangent vector, when the speed is non-vanishing:

Definition 3.4. Let ~c(t) be a differentiable path, with ‖~c ′(t)‖ 6= 0. The unit tangent vectoris defined as

~T (t) =~v

‖~v(t)‖ =~c ′(t)

‖~c ′(t)‖


~c ′(t)

~c(t)

Figure 3.1. ~c ′(t) is the velocity vector for motion along the path ~c(t). It is alwaystangent to ~c(t).

Example 3.5. Consider the “astroid”, parametrized by ~c : R→ R2:

~c(t) =(cos3 t, sin3 t

), t ∈ [−π, π].

This is a C1 path (in fact, it is C∞, differentiable to all orders!) Its velocity vector isgiven by

~c ′(t) =(−3 cos2 t sin t, 3 sin2 t cos t

).

To find the unit tangent, we have to find its speed first:

v =√

9 cos4 t sin2 t+ 9 sin4 t cos2 t =√

9 cos2 t sin2 t(sin2 t+ cos2 t)

= 3| sin t cos t|,

remembering that√x2 = |x|, because we always choose the positive square root. Then, the

unit tangent is given by

~T (t) =~v(t)

‖~v(t)‖ =

(−| cos t| sin t

| sin t| , | sin t|cos t

| cos t|

)Note that its tangent is undefined when sin t = 0 or cos t = 0, i.e. at multiples of π

2.

Worse, sin t| sin t| ,

cos t| cos t| flip discontinuously as t crosses a multiple of π/2 from −1 to +1, or vice

versa. Although the path is C1, the curve is not smooth! When ~v(t) = ~c ′(t) = 0, it allowsthe curve to have cusps (Figure 3.2).

Note that it is possible to have a nice tangent direction even when ~c ′(t) = 0:

Example 3.6. Consider a parameterized straight line:

~c = ~a+ ~wt3

Its velocity vector, ~c ′(t) = 3~wt2, is equal to ~0 when t = 0. However, it still has a tangentdirection which is parallel to ~w.

Although it’s possible to have a continuously varying tangent vector even when thespeed of the path can vanish, we wish to avoid such tricky cases. So we will only call acurve “smooth” if the speed is never zero, and so the unit tangent ~T (t) may be defined, andwill be continuous:


c(t) = (cos3 t, sin3 t)

T (t)

Figure 3.2. For ~c(t) = (cos3 t, sin3 t), unit tangent suddenly changes its direction at thecusp.

Definition 3.7. We say a path ~c(t), t ∈ I, is smooth (or regular) if ~c is a C1 path with‖~c ′(t)‖ 6= 0 for any t ∈ I. We call a geometrical curve smooth if can be parametrized by aC1 path ~c(t) which is smooth (ie, C1 with nonvanishing speed.)

For a smooth curve the unit tanget vector ~T (t) is continuous, and so we eliminate thebad behavior of the astroid example above.

Specializing to curves in space ~c(t) ∈ R3, we will want to use vector operations to studythe velocity and tangent vectors associated to the paths. The following product rules willbe useful:

Theorem 3.8.

1. If f : R→ R, ~c : R→ Rn, both differentiable,

d

dt

(f(t)~c(t)

)= f(t)~c ′(t) + f ′(t)~c(t)

2. If ~c, ~d : R→ Rn are differentiable,

d

dt

(~c(t) · ~d(t)

)= ~c ′(t) · ~d(t) + ~c(t) · ~d ′(t)

3. If ~c, ~d : R→ R3 are differentiable,

d

dt

(~c(t)× ~d(t)

)= ~c(t)× ~d ′(t) + ~c ′(t)~d(t),

where ~c× ~d =∑3

i,j,k=1 cidj~ekεijk1.

1 εijk =

0 if i = j or j = k or k = i

1 if (i, j, k) is positively ordered

−1 if (i, j, k) is negatively ordered


To prove these, we write them in components,

~c(t) = (c1(t), c2(t), . . . , cn(t)) =n∑i=j

cj ~cj,

using the standard basis {~e1, . . . , ~en} of Rn, and do the ordinary derivative. For example,

d

dt

(f(t)~c(t)

)=

d

dt

( n∑j=1

f(t)~cj(t)~ej

)=

n∑j=1

d

dt

(f(t)~cj(t)

)~ej

=n∑j=1

d

dt

(f ′(t)~cj(t) + f(t)~cj

′(t))~ej

= f(t)~c ′(t) + f ′(t)~c(t).

The verification of the other two formulae are left as an exercise.

Example 3.9. Suppose ~c is a twice differentiable path and its acceleration vector ~a(t) =~c′′(t) satisfies the equation

~a(t) = k~c(t)

for some constant k 6= 0. Show that ~c(t) describes a motion in a fixed plane.

Define a vector,~n(t) = ~c(t)× ~v(t) = ~c(t)× ~c ′(t).

Notice ~n(t) ⊥ ~c(t) and ~n(t) ⊥ ~v(t), i.e. ~n(t) is normal to the plane containing the vectors~c,~v (drawn with common endpoint.) Our goal is to show that ~n is a constant, independentof t, in which case the path ~c(t) always lies in the same plane which is normal to ~n. To dothis, we differentiate the equation defining ~n,

d~n

dt=

d

dt(~c(t)× ~c ′(t)) = ~c(t)× ~c ′′(t)︸︷︷︸

~a(t)

+~c ′(t)× ~c ′(t)︸︷︷︸~0

= ~c(t)× k~c(t)= ~0

Therefore, ~n is constant in time.As ~n is a constant vector, it defines a fixed plane,

P = {~w | ~w · ~n = 0}

passing through the origin, which contains the moving vector ~c(t) for all t. We have justshown that the path moves in the plane P .

Definition 3.10 (Arc length). The arc length (or distance traveled along the parameterizedcurve) for a ≤ t ≤ b is given by ∫ b

a

‖~c ′(t)‖︸︷︷︸speed

dt


We are also interested in keeping track of how much distance we travel along a path,starting at a fixed time t = a. Think of the trip odometer in a car, which you set to zero atthe start of a road trip and keeps track of how far you’ve driven since you started out.

Definition 3.11 (Arc length function). The arc length function associated to the path ~c(t),with starting time t = a is

s(t) =

∫ t

a

‖~c ′(u)‖du

We use the dummy variable u in the integral since the independent variable t is one ofthe limits of integration. So the arclength function s(t) is an antiderivative of the speed‖~c ′(t)‖,

ds

dt= ‖~c ′(t)‖,

but with the constant of integration chosen to make s(0) = 0. If this seems mysterious, youshould review the First Fundamental Theorem of Calculus from Stewart.

Example 3.12. Consider the following path:

~c(t) = (3 cos t, 3 sin t, 4t), t ∈ [0, 4π].

Its velocity vector is given by

~v(t) = (−3 sin t, 3 cos t, 4).

It follows that its speed ‖~c ′(t)‖ =√

32 + 42 = 5. Then, we can compute the arc lengthfunction:

s(t) =

∫ t

0

v(t)dt =

∫ t

0

5du = 5t

In particular, the total arc length over t ∈ [0, 4π] is s(4π) = 20π.

Definition 3.13. When the path ~c(t) traces out the curve with speed ‖~v(t)‖ = 1 for all t,we say that the curve is arc length parameterized.

We note that a C1 arclength parametrized path is smooth. If a curve is arc lengthparameterized, its arc length function becomes

s(t) = t

In this case it is conventional to use the letter s instead of t as the parameter in the path.

Theorem 3.14. Let ~c(t), t ∈ I, be a smooth path. Then, there is an arclength reparametriza-tion for the curve, ~γ(s), with ‖~γ′(s)‖ = 1 for all s.

Example 3.15. In example 3.12, the helix is not arc length parameterized but we can re-parameterize it so that it is. To do so, we need to solve for t = ϕ(s) to invert the function,s(t).

Going back the example, we had s(t) = 5t. It follows that t = 15s. Then,

~c(s) = ~c(ϕ(s)) = ~c(s

5

)=

(3 cos

(s5

), 3 sin

(s5

),4s

5

)is an arc length parameterization of the original helix, i.e. ‖~c ′(s)‖ = 1, ∀s.


3.2 Directional derivatives and gradient

Now let’s use curves to explore functions defined on Rn. We restrict our attention here toscalar-valued functions, that is f : Rn → R. Think of f(~x) as describing a quantity suchas temperature or air pressure in a room, and an insect or tiny airplane moving around theroom with path ~c(t), measuring the value of f as it moves. In other words, it measures thecomposition, h(t) = f(~c(t)).

If f : Rn → R is differentiable, Df(~x) is a (1× n) matrix:

Df(~x) =[∂f∂x1

∂f∂x2

· · · ∂f∂xn

]We use paths ~c(t) to explore f(x) by looking at

h(t) = f ◦ ~c(t) = f (~c(t)) .

where h : R→ R, By chain rule,

Dh(t) = h′(t) = Df(~c(t))︸︷︷︸1×n

D~c(t)︸︷︷︸n×1

= Df (~c(t))~c ′(t)

=[∂f∂x1

∂f∂x2

· · · ∂f∂xn

]c′1c′2...c′n

We can think of this as a dot product of ~c ′(t) with a vector DfT = ∇f , the gradient vector:

h′(t) = ∇f(~c(t)) · ~c ′(t)

Suppose f : Rn → R is differentiable at ~a ∈ Rn, and we have a path ~c : Rn → R with~c(0) = ~a. Let ~v = ~c ′(0). Then, h′(0) measures rate of change of f along the path, ~c, as wecross through ~a:

h′(0) = ∇f(~c(0)) · ~c ′(0)

= ∇f(~c(0)) · ~vNote that we get the same value for h′(0) for any path ~c(t) going through ~a with velocity~c ′(0) = ~v: other than insisting that the path pass through the point ~a with velocity vector~v, this quantity is path-independent. In other words, h′(0) says something about f at ~a, andnot the past or future trajectory of the path ~c(t).

Definition 3.16 (Directional derivative). The directional derivative of f at ~a in direction~v is given by

D~vf(~a) = Df(~a)~v = ∇f(~a) · ~v.Now, we can make some observations. Since the shape of the path ~c(t) doesn’t matter,

only that it passes through ~a with velocity ~v, we might as well choose a straight line pathwhen calculating directional derivatives, ~c(t) = ~a + t~v, with ~c(0) = ~a and ~c′(0) = ~v. Usingthe Chain Rule, the directional derivative can be rewritten as

D~vf(~a) = limt→0

f(~a+ t~v)− f(~a)

t.


From this formula we recognize that the partial derivatives are just special cases of direc-tional derivatives, obtained by choosing ~v = ~ej. That is,

∂f

∂xj(~a) = D~ejf(~a).

There is one little problem with our definition of directional derivatives: D~vf(~a) dependsnot only on the “direction” of the vector ~v, but also its magnitude. To see this, notice thatalthough ~v and 2~v are parallel (and thus have the same direction,) D2~vf(~a) = ∇f(~a) ·(2~v) =2D~vf(~a). To get the information on how fast f is changing at ~a, we need to restrict to unitvectors ‖~v‖ = 1. So we often use the term “directional derivative” only when ~v is a unitvector.

Directional derivatives also give a geometrical interpretation of the gradient vector,∇f(~a), when ∇f(~a) 6= ~0. Applying the Cauchy-Schwartz inequality (Theorem 1.2), wehave:

D~vf(~a) = ∇f(~a) · ~v ≤ ‖∇f(~a)‖‖~v‖ = ‖∇f(~a)‖,and equality holds if and only if ~v is parallel to ∇f(~a). Therefore, we can conclude thatthe length of the gradient vector, ‖∇f(~a)‖, gives the largest of D~vf(~a) among all choices ofunit directions ~v,

‖∇f(~a)‖ = max {D~uf(~a) : ~u ∈ Rn with ‖~u‖ = 1} .

In other words, the direction ~v in which f(~x) increases most rapidly is the direction of∇f(~a), i.e.

~v =∇f(~a)

‖∇f(~a)‖ ,

provided that ∇f(~a) 6= ~0.Similarly, −∇f(~a) points in the direction of the least (most negative) D~vf(~a), i.e.

~v = − ∇f(~a)

‖∇f(~a)‖ ,

gives the direction in which f decreases fastest at ~a.

3.3 Gradients and level sets

We recall that if G : Rn → R, the level set of G at level k ∈ R (a constant) is

S = {~x ∈ Rn : G(~x) = k}.

For example, in R2, the function f(x, y) = x2 + y2 has level sets x2 + y2 = k which definecircles of radius

√k when k > 0. When k = 0 the level set consists only of the origin,

and for k < 0 the level set is empty, S = ∅. This suggests that generically (that is, for allbut a few values of k,) the level set of a C1 function of 2 variables is a curve, but that itcan be lower dimensional (a point, or the empty set.) Later on, with the Implicit FunctionTheorem we will understand why this is so.

What does the gradient ∇G tell us in this situation? The graph of G(~x) lies in thehigher dimensional space Rn+1, and so when n ≥ 3 we can’t really visualize the graph.


However, we know from the last section that G increases most rapidly if we move ~x alongthe direction of ∇G(~x). Moving ~x along the level set S doesn’t change the value of G at all,so this suggests that the directional derivatives “along” the level set are zero. The precisestatement uses the concept of the tangent plane to the surface:

Theorem 3.17. Assume G : Rn → R is a C1 function, ~a ∈ Rn, k = G(~a), and S = {~x ∈Rn : G(~x) = k} is the level set.

If ∇G(~a) 6= ~0, then ∇G(~a) is a normal vector to the tangent plane of S at ~a.

To see why this should be true, look at the case n = 3, so S is a surface in R3. Asbefore, we consider paths ~c(t), but now we insist that the path remain on the level set Sfor all times t, G(~c(t)) = k. Assuming that they cross through the point ~a at time t = 0,~c(0) = ~a, with velocity vector ~v = ~c′(0), then ~v is a tangent vector to the surface S at ~a.Using implicit differentiation and the chain rule, since G(~c(t)) = k is constant,

0 =d

dt(k) =

d

dtG(~c(t)) = ∇G(~c(t)) · ~c′(t). (3.1)

Evaluating at t = 0, we get ∇G(~a) · ~v = 0, so the vector ∇G(~a) is orthogonal to the vector~v, which is tangent to S. Since this is true for any path ~c(t) with any velocity vector ~v, itfollows that ∇G(~a) is either the zero vector or it is a normal vector to the plane containingall tangent vectors.

Example 3.18. Suppose g : R → R is C1 with g′(r) 6= 0 for all r > 0, and G : R3 → Ris defined by G(x, y, z) = g(x2 + y2 + z2). If the orgin does not lie on the level set S ={(x, y, z) : G(x, y, z) = k}, then its normal vector is parallel to the vector ~x = (x, y, z).

To see this, use the chain rule to calculate the gradient of G: define r = f(x, y, z) =x2 + y2 + z2, and then

DG(x, y, z) = Dg(r)Df(x, y, z) = [g′(r)][2x 2y 2z

]= 2g′(r)

[x y z

]Writing this as a gradient vector, ∇G = 2g′(r)~x, which is parallel to ~x as long as g′(r) 6= 0,which is true since the origin (r = 0) does not lie on the level set.

3.4 Lagrange multipliers

As an application of these ideas we present the method of Joseph-Louis Lagrange (1736-1813) for finding constrained maxima and minima. Here’s the setup: we want to minimizeor maximize a function F : Rn → R, but only over those vectors which satisfy a constraintcondition, given as the level set of G : Rn → R. That is, we want to

maximize (or minimize) F (~x) for all ~x with G(~x) = k, (3.2)

where k is a fixed constant. Such optimization problems arise naturally in fields such aseconomics, where fixed resources or capacities restrict our choice when optimizing.

The way that Lagrange solves this problem is to introduce a new unknown scalar λ, the“Lagrange multiplier”, into the problem:


Theorem 3.19 (Lagrange multipliers). Assume F,G : Rn → R are C1 functions. If theconstrained optimization problem (3.2) is attained at ~x = ~a, and ∇G(~a) 6= ~0, then there isa constant λ ∈ R with

∇F (~a) = λ∇G(~a).

Proof of Theorem 3.19. Assume F has a maximum at ~a ∈ S, that is, with G(~a) = k.Consider any path ~c(t) which lies on the surface S and passes through ~a, so G(~c(t)) = k isconstant for all t, with ~c(0) = ~a and ~c′(0) = ~v is a tangent vector to S at ~a. The functionh(t) = F (~c(t)) has its maximum at t = 0 (when ~c(0) = ~a,) and so by first year calculush′(0) = 0. As in the derivation of (3.1), by implicit differentiation and the chain rule wehave

0 = h′(0) = ∇F (~c(0)) · ~c′(0) = ∇F (~a) · ~v,

for any tangent vector ~v to S at ~a. In other words, ∇F (~a) is orthogonal to all of the vectorsin the tangent plane to S, so it must point parallel to the normal vector of S. But byTheorem 3.17 that is given by ∇G(~a), and so there is a constant λ of proportionality forwhich ∇F (~a) = λ∇G(~a).

So the gradients of F andG are parallel at a solution of the constrained extremal problem.In practice, this means that to solve (3.2) we must simultaneously solve the (n+1) equations,

∇F (~x) = λ∇G(~x), G(~x) = k, (3.3)

for the (n+ 1) unknowns ~x = (x1, . . . , xn) and λ.

Example 3.20. Find the closest and furthest point from the origin to the surface 4x2 +2xy + 4y2 + z2 = 1.

First we need to set this up as a problem with level set constraint. The function to beoptimized is the distance to the origin, which is ‖~x‖. However, it is equivalent (and mucheasier) to instead minimize/maximize the square of the distance, F (~x) = ‖~x‖2 = x2+y2+z2,since the square is a monotone increasing function of positive numbers. The constraint isG(x, y, z) = 1, with G(x, y, z) = 4x2 + 2xy + 4y2 + z2. These are C1 functions, so byTheorem 3.2 at an extrema we must satisfy the equations (3.3), which in this examplegives:

2x = λ(8x+ 2y), 2y = λ(2x+ 8y), 2z = 2λz, 4x2 + 2xy + 4y2 + z2 = 1,

to be solved for (x, y, z) and λ. Note that these equations are nonlinear, but with a bit ofcare we will use linear algebra anyway. But be very careful not to let solutions go unnoticed!

The easiest equation is the third one, 2z = 2λz or (1 − λ)z = 0. So there are twopossibilities (both of which must be dealt with!) z = 0 or λ = 1. First, assume z = 0 . Werewrite the first two equations as a system,

4x+ y =

1

λx

x+ 4y =1

λy

or, equivalently,

(

4− 1

λ

)x+ y = 0

x+

(4− 1

λ

)y = 0.


If this looks familiar it’s because it’s an eigenvalue problem (except the eigenvalue is called1/λ.) To find a nonzero solution we need the determinant of the matrix of coefficients tovanish (as in the eigenvalue problem!), that is

0 = det

[(4− 1

λ

)1

1(4− 1

λ

)] =

(4− 1

λ

)2

− 1 =

(5− 1

λ

)(3− 1

λ

).

So we can solve for λ = 15, 13; that is, there are two different possibilites for λ when z = 0.

When λ = 15, substituting into the first equation of the system gives y = x. Then, plugging

into the constraint equation,

1 = 4x2 + 2xy + 4y2 + z2 = 4x2 + 2x2 + 4x2 + 0 = 10x2,

and so x = ±√

110

. This gives our pair of solutions to the equations,

(x, y, z) =

(√1

10,

√1

10, 0

), (x, y, z) =

(−√

1

10,−√

1

10, 0

).

(Be careful: since y = x we don’t get all the permutations of the signs!) For the otherLagrange multiplier λ = 1

3, we again use the first equation to get y = −x, and then the

constraint gives us

1 = 4x2 + 2xy + 4y2 + z2 = 4x2 − 2x2 + 4x2 + 0 = 6x2,

and so x = ±√

16. This gives our pair of solutions to the equations,

(x, y, z) =

(√1

6,−√

1

6, 0

), (x, y, z) =

(−√

1

6,

√1

6, 0

).

This exhausts all the possibilites for the case z = 0.The other case was λ = 1, in which case the equations for (x, y) become

3x+ y = 0 x+ 3y = 0,

which has only the trivial solution (x, y) = (0, 0) (check the determinant of the coefficientmatrix!) Plugging into the constraint,

1 = 4x2 + 2xy + 4y2 + z2 = 0 + 0 + 0 + z2 = 1,

so z = ±1, and we get

(x, y, z) = (0, 0, 1), (x, y, z) = (0, 0,−1).

Thus, we have 6 points which satisfy the Lagrange multiplier equations. The maximum andminimum must be attained among these, so test them in F (x, y, z):

F

(±(√

1

10,

√1

10, 0

))=

1

5, F

(±(√

1

6,−√

1

6, 0

))=

1

3, F (0, 0,±1) = 1.

The minimum value is 15, attained at the points (x, y, z) = ±

(√110,√

110, 0)

, and the

maximum value is 1, attained at the points (x, y, z) = (0, 0,±1).


3.5 Practice Problems

1. Consider the path ~c : R→ R2 defined by ~c(t) = (cos t , sin2 t).

(a) What curve is traced out by ~c(t) for t ∈ R?

[Careful! | sin t|, | cos t| ≤ 1 for all t.]

(b) Verify that the path is smooth (or regular) for 0 < t < π, but not for any largerinterval. What goes wrong at t = 0, π?

(c) Find a smooth reparametrization ~r(u), u ∈ [−1, 1], which traces out the entirecurve.

2. Assume ~c(t) is a smooth parametrized curve.(a) If ~c(t) lies on the surface of the sphere of radius R > 0 for all t ∈ R, show that the

velocity ~c′(t) is always orthogonal to the position ~c(t).

[Hint: the sphere of radius R > 0 can be described as all vectors with ‖~x‖2 = R2. Plugin ~c(t) and do implicit differentiation in t.]

(b) Conversely, show that if the position and velocity are always orthogonal, ~c(t)·~c′(t) =0 for all t ∈ R, then the path ~c(t) always remains on a sphere.

3. Let F : R3 → R be a C1 function, and S the level set defined by F (~x) = 4. Assume thatthe origin lies inside the level surface S, and there is a point ~a ∈ S at which the distance tothe origin is maximized. Show that the normal vector to the surface S at ~a is parallel to ~a.

[Hint: the distance is maximized means that ‖~x‖ ≤ ‖~a‖ holds for all ~x ∈ S, which alsomeans that G(~x) := ‖~x‖2 has a maximum at ~a.]

Chapter 4

The geometry of space curves

In this chapter we look more in detail at curves in R3, and the relationship between thedynamics (how they are traced out in time) and the geometry (their shape.) This is anintroduction to the field of differential geometry, and is based on finding a moving basis ofvectors which is carried by the space curve along the trajectory.

4.1 The Frenet frame and curvature

A path, or moving vector,

~c(t) =

x(t)y(t)z(t)

= (x, y, z)(t),

traces out a curve, for t ∈ [a, b], in space. We recall that its velocity vector and speed aregiven by ~v(t) = ~c ′(t) and ds

dt= ‖~c ′(t)‖, respectively. Recall also that we have a smooth

parameterization if ~c ∈ C1 and ‖~c ′(t)‖ 6= 0 for any t ∈ [a, b].In the previous chapter we introduced the arc length function,

s(t) =

∫ t

a

‖~c ′(u)‖du,

the total distance along the curve up to time t. We also introduced the idea of arc lengthparameterization, in which s(t) = t. Then, since ds

dt= ‖~c ′(t)‖, arc length parameterization

is a path that travels along the curve with unit speed, ds/dt = 1, throughout. In fact,any path with ‖~c ′(t)‖ 6= 0 can be parameterized by arc length by inverting s = s(t) suchthat t = ϕ(s). Note that we can always do this for a smooth path (ds/dt > 0 so s(t) ismonotonically increasing). In practice, however, you may not be able to find an explicitformula for the arc length parameterization!

Example 4.1. Consider the following path:

~c(t) = (x(t), y(t)) =

(t,

1

2t2)

:

Since y = x2/2, it’s a parabola. Then, we observe that

~c ′(t) = (1, t), ‖~c ′(t)‖ =√

1 + t2 ≥ 1 > 0.

33

CHAPTER 4. THE GEOMETRY OF SPACE CURVES 34

So the path is smooth. Then, we have

s(t) =

∫ t

0

‖~c ′(u)‖du =

∫ t

0

√1 + u2du =

1

2

(ln∣∣∣√1 + t2 + t

∣∣∣+ t√

1 + t2).

Clearly, there’s no way we can explicitly solve for t as a function of s. The way out of thistrouble is to treat all ~c as if they were parameterized by arc length and use Chain rule withds/dt = ‖~c ′(t)‖ to compensate.

Recall that unit tangent vector to ~c(t) is

~T (t) =~c ′(t)

‖~c ′(t)‖ .

We wish to understand how the shape of the curve changes over its length: how curved isit?

Definition 4.2. The curvature of a curve is defined as rate of change of the unit tangentvector with respect to arc length:

κ =

∥∥∥∥∥d~Tds∥∥∥∥∥ .

By chain rule,

d~T

dt=d~T

ds· dsdt

So, in the original time parameter, t,

κ(t) =

∥∥∥∥∥ 1dsdt

d~T

dt

∥∥∥∥∥ =‖~T ′(t)‖‖~c ′(t)‖

While the curvature is naturally defined in terms of the arclength function s (because it is ageometrical quantity, and shouldn’t depend on the speed at which the curve is traced out,)this formula is more practical for finding κ in specific examples.

We’ve understood the magnitude of derivative of the unit tangent ‖~T ′(s)‖ as a measure

of curvature; what does its direction tell us? Since ‖~T (s)‖ = 1 is a unit vector for all s, byimplicit differentiation,

0 =d

ds(1) =

d

ds

(‖~T (s)‖2

)=

d

ds

(~T (s) · ~T (s)

)= 2~T (s) · ~T ′(s),

using the product rule for dot products, Theorem 3.8. Therefore, we have ~T ′(s) ⊥ ~T (s) atall points on the curve.

Definition 4.3. The principal normal vector to a smoothly parametrized curve ~c(s) is

~N(s) =~T ′(s)

‖~T ′(s)‖.


Looking at the equation satisfied by the derivative of the unit tangent, ~T ′(s) = ‖~T ′(s)‖ ~N =

κ ~N, so the tangent turns in the direction of ~N . Note that ~N is only defined when ‖~T ′(s)‖ 6=0, that is, when the curvature is nonzero! When the curvature is zero there is no preferreddirection apart from the unit tangent; this is easy to understand for a straight line path. If~c(s)‖ is a straight line, the curvature κ(s) = 0 for all s, and there is no second direction inwhich the tangent turns.

As we don’t always want to express our paths with arclength parametrization, we shouldhave a formula for ~N(t) in terms of an arbitrary parameter t. As usual, this is easy using

the chain rule. Since ~T ′(t) = ~T ′(s)dsdt

, the term dsdt

cancels out in the fraction, and we havethe easy formula,

~N(t) =~T ′(t)

‖~T ′(t)‖Example 4.4. Consider a circle of radius R > 0 in xy-plane:

~c(t) = (R sin t, R cos t).

Now, we can easily find its velocity vector and speed:

~c ′(t) = (R cos t,−R sin t)

‖~c ′(t)‖ = R

Notice that this travels with constant speed (clockwise) but is not arc length parameterized(unless the radius R = 1.)

Its unit tangent is

~T (t) =~c ′(t)

‖~c ′(t)‖ =~c ′(t)

R= (cos t,− sin t)

Then,~N(t) = ~T ′(t) = (− sin t,− cos t)

Again, notice that ~N(t) is perpendicular to ~T (t).Finally, we have

κ(t) =‖~T ′(t)‖‖~c ′(t)‖ =

1

R.

Therefore, circle with large radius has less curvature.

Example 4.5. Consider the following helix:

~c(t) = (3 cos t, 3 sin t, 4t).

Following the same approach as shown in the previous example, we get

~c ′(t) = (−3 sin t, 3 cos t, 4)

‖~c ′(t)‖ = 5

~T (t) =

(−3

5sin t,

3

5cos t,

4

5

)~T ′(t) =

(−3

5cos t,−3

5sin t, 0

)


Then,

κ(t) =‖~T ′(t)‖‖~c ′(t)‖ =

3/5

5=

3

25

This curve also has a constant curvature.

As long as the curvature κ(s) 6= 0, ~T , ~N determines a plane in R3, the osculating plane.The normal vector to the osculating plane is given by

~B = ~T × ~N,

the Binormal vector to the curve. We observe that ~B ⊥ ~T , ~B ⊥ ~N , and

‖ ~B‖ = ‖~T‖‖ ~N‖| sin θ| = 1 · 1 · sin (π/2) = 1

Therefore, {~T (s), ~N(s), ~B(s)} is a moving orthonormal basis for R3 at each point along thecurve. This plane is also referred to as moving frame or Frenet frame.

Remark. • If curvature κ(s) = 0 for all s, then the curve is a straight line. This is easy

to see, since then ~T ′(s) = 0 for all s, which may be integrated to get ~T (s) = ~u, a

constant, for all s. But ~c′(s) = ~T (s) = ~u is integrated to ~c(s) = s~u+~a, where ~u,~a areconstants, which is the parametric equation of a line.

• When κ = 0, ~N and ~B cannot be defined.

• If ~B(s) is a constant vector, then ~c(t) lies in a fixed plane, with normal vector ~B.

Now, suppose ~B(s) isn’t constant. First, ‖ ~B(s)‖ = 1 for all s. Then,

1 = ‖ ~B(s)‖2 = ~B(s) · ~B(s)

holds for all s. So we can apply implicit differentiation:

0 =d

dS(1) =

d

dS

(~B · ~B

)= 2 ~B ′ · ~B.

Then, it follows that ~B ′ ⊥ ~B, for every s.Next, since ~B(s) ⊥ ~T (s) for all s, we have ~B · ~T = 0 for all s. Then,

d

ds

(~B · ~T

)= ~B ′(s) · ~T (s) + ~B(s) · ~T ′(s) = 0.

Since ~T ′ = κ ~N and ~B · ~N = 0, it follows that

~B ′(s) · ~T (s) = 0 ⇐⇒ ~B ′(s) ⊥ ~T (s)

Since{~T , ~N, ~B

}form a orthonormal basis for R3, we must have ~B ′(s) parallel to ~N .

Therefore,~B ′(s) = −τ(s) ~N(s)

for a function τ(s) called the torsion. Since τ = ‖d ~B/ds‖, it measures how fast the normal~B to the osculating plane is twisting.


Definition 4.6 (Torsion).

τ =

∥∥∥∥∥d ~Bds∥∥∥∥∥ =‖ ~B ′(t)‖‖~c ′(t)‖

Putting all the information together we get Frenet formulas :

Theorem 4.7 (Frenet formulas). Let ~c(t) be a smoothly parametrized curve. Then the

Frenet frame {~T , ~N, ~B} satisfy the system of equations,

d~T

ds= κ ~N

d~B

ds= −τ ~N

d ~N

ds= −κ~T + τ ~B

Example 4.8. Consider the following helix:

~c(t) = (3 cos t, 3 sin t, 4t)

Then, we have

‖~c ′(t)‖ = 5,

~T (t) =

(−3

5sin t,

3

5cos t,

4

5

),

~T ′(t) =

(−3

5cos t,−3

5sin t, 0

),

κ =3

25,

~N(t) = (− cos t,− sin t, 0),

~B(t) = ~T × ~N =

(4

5sin t,−4

5cos t,

3

5

),

~B ′ =

(4

5cos t,

4

5sin t, 0

),

τ =4

25.

4.2 Dynamics

How do the geometrical quantities {~T , ~N, ~B}, κ, τ relate to dynamical quantities, velocity~c ′(t) = ~v(t), speed ‖~c ′(t)‖ = ds/dt, and acceleration ~a(t) = ~v ′(t) = ~c ′′(t)?

First, observe that

~v(t) = ~c ′(t) =ds

dt· ~T (t)


Then,

~a(t) =d

dt

(ds

dt· ~T (t)

)=d2s

dt2· ~T (t) +

ds

dt· ~T ′(t)

=d2s

dt2· ~T +

ds

dt·(d~T

ds· dsdt

)So we have

~a(t) =d2s

dt2· ~T︸︷︷︸

Linear acceleration

+κ

(ds

dt

)2

~N︸︷︷︸Steering-term

By looking at the steering term, we see that acceleration to turn on a curve is proportionalto the curvature and (speed)2.

Example 4.9. Consider the following path

~c(t) = (et cos t, et sin t, et)

that drwas a spiral in xy direction and monotonically increases along z coordinate.First, observe that

~v(t) = ~c ′(t) = (−et sin +et cos t, et cos t+ et sin t, et)

ds

dt= ‖~c ′(t)‖ =

√3et

Then, we have

~T (t) =~c ′(t)

‖~c ′(t)‖ =1√3

(− sin t+ cos t, cos t+ sin t, 1),

~T ′(t) =~c ′(t)

‖~c ′(t)‖ =1√3

(− cos t− sin t,− sin t+ cos t, 0).

Since ‖~T ′(t)‖ =√

2/3, we can easily find the principal normal vector:

~N(t) =~T ′(t)

‖~T ′(t)‖=

1√2

(− cos t− sin t,− sin t+ cos t, 0)

Then,

κ =‖~T ′(t)‖‖~c ′(t)‖ =

√2

3e−t.

Furthermore,

~B(t) = ~T (t)× ~N(t) = · · · = 1√6

(cos t− sin t,− sin t− cos t, 2)

~B ′(t) =1√6

(− sin t− cos t,− cos t+ sin t, 0)

Therefore, torsion of the curve is given by

τ(t) =‖ ~B ′(t)‖‖~c ′(t)‖ =

1

3e−t

We can then veriy formula for ~a in terms of ~T , ~N, κ, (and verify that it agrees with ~a = ~v ′(t)calculated directly).


Now, we present an alternative equation for curvature using dynamical quantities:

Theorem 4.10.

κ(t) =‖~c ′(t)× ~c ′′(t)‖‖~c ′(t)‖3 =

‖~v(t)× ~a(t)‖‖~v(t)‖3

Proof. To verify it, we use the decomposition of ~a:

~v × ~a = ~v ×(d2s

dt2· ~T + κ

(ds

dt

)2

~N

)

=d2s

dt2

(~v × ~T

)+ κ

(ds

dt

)2 (~v × ~N

)= κ

(ds

dt

)3 (~T × ~N

)= κ

(ds

dt

)3

~B

Then, κ(ds/dt)3‖ ~B‖ = ‖~v × ~a‖. Since ~B is a unit vector, we get the desired identity bytaking the norm on each side of the equation.


Practice problems will be given from Stewart [Stew].


Chapter 5

Implicit functions

5.1 The Implicit Function Theorem I

Often, we have an implicit relationship between variables,

F (x1, x2, . . . , xn) = 0,

rather than an explicit function relation, such as

xn = f(x1, x2, . . . , xn−1).

Example 5.1. Look at a familiar example in R2 (See Figure 5.1), the unit circle,

x2 + y2 = 1.

This fails vertical line test (y 6= f(x)) as well as horizontal line test (x 6= g(y)); globally,this relation does not define a function. Locally, we can write this as a function, i.e. byrestricting attention to small pieces of the curve.

First, defineF (x, y) = x2 + y2 − 1

If y0 > 0, x20 + y20 = 1, i.e. F (x0, y0) = 0, and we look at a window (or neighborhood) around(x0, y0), which lies entirely in the upper half plane, we can solve for y = f(x),

y =√

1− x2︸︷︷︸f(x)

We could calculate y′ = f ′(x) from the explicit formula but we can also get it via implicitdifferentiation:

d

dx(F (x, f(x))) =

∂F

∂x· dxdx

+∂F

∂y· f ′(x)

= 2x+ 2yf ′(x),

so f ′(x) = −x/y.

For a general F (x, y) = 0, we can solve for f ′(x) if

∂F

∂y(x0, y0) 6= 0,

41

CHAPTER 5. IMPLICIT FUNCTIONS 42

x

y

A

B

F (x, y) = x2 + y2 − 1

Figure 5.1. Near the neighborhood around point A, vertical line test fails as single xvalue corresponds to two y values. On the other hand, neighborhood around point B canbe solved as y =

√1− x2.

where y is the variable we want to solve for. This gives the limitation on which we can solvefor y = f(x) locally. For the circle example, we had

∂F

∂y= 2y.

When y = 0, the vertical line test fails in every neighborhood of (x0, y0) = (±1, 0), which isexactly when the partial derivative with respect to y vanishes.

In general, suppose we have a C1 function,

F : Rn+1 → R,

and consider all functions of F (~x, y) = 0. In order to write y = g(~x) – i.e. we can solve fory as a differentiable function of ~x – we do the same implicit differentiation, with the chainrule,

∂

∂xi(F (x1, x2. . . . .xn, f(~x))) =

∂F

∂xi+∂F

∂y

∂f

∂xi

for each i = 1, 2, . . . , n. We can then solve for each

∂f

∂xi=− ∂F

∂xi∂F

∂y

,

provided ∂F/∂y 6= 0. This is a sufficient condition to solve for y = f(~x).

Theorem 5.2 (Implicit Function Theorem I). Assume F : Rn+1 → R is C1 in a neighbor-hood of (~x0, y0) with F (~x0, y0) = 0. If ∂F

∂y(~x0, y0) 6= 0, then there exists neighborhood U of ~x0

and V of y0 and a C1 function

f : U ⊂ Rn → V ⊂ R,


for which F (~x, f(~x)) = 0 for all ~x ∈ U . In addition,

Df(~x) =− 1

∂F

∂y(~x, y)

D~xF (~x, y),

whereD~xF (~x, y) =

[∂F∂x1

∂F∂x2

· · · ∂F∂xn

].


xy + y2z + z3 = 1.

For which part on this surface can we write z = f(x, y), i.e.

F : R3 → R, F (x, y, z) = xy + y2z + z3 − 1

is C1?

We want to solve for z, so we look at

∂F

∂z= y2 + 3z2

We observe that ∂F/∂z = 0 iff y = 0 and z = 0. However, y = 0 and z = 0 is not definedon this surface. At all points on this surface, ∂F/∂z 6= 0. So at every (x0, y0, z0) withF (x0, y0, z0) = 0, we can solve for z = f(x, y) locally near (x0, y0).

We can then use the implicit differentiation formula in the theorem to calculate Df(x, y):

D(x,y)F =[y (x+ 2yz)

],

so we get

Df(x, y) =−D(x,y)F

∂F/∂z=

[− y

y2 + 3z2−x+ 2yz

y23z2

]or

∇f(x, y) =

(− y

y2 + 3z2,−x+ 2yz

y23z2

).

Example 5.4. Consider the following equation:

x4 + xz2 + z4 = 1.

Show that we can solve for z = g(x) near (x1, z1) = (−1, 1) but not near (x2, z2) = (1, 0).

Proof. First, letF (x, z) = x4 + xz2 + z4 − 1.

Clearly, F : R2 → R is C1 for all (x, z) ∈ R2. Observe that

∂F

∂z= 2xz + 4z3,

and so ∂F (−1, 1)/∂z 6= 0.


By the Implicit Function Theorem, we can solve for z = g(x) locally near (x1, z1) =(−1, 1). In addition, we can get an explicit formula for its derivative:

Dg(x) = g′(x) = −∂F (x, z)/∂x

∂F/∂z= − 4x3 + z2

2xz + 4z3

Finally, since ∂F (1, 0)/∂z = 0, the Implicit Function Theorem does not apply near(1, 0).

Example 5.5. Consider the following equation:

x− z3 = 0

Clearly, F (x, z) = x− z3 is C1 for all (x, z) ∈ R2. Note that

∂F

∂z= −3z2.

Clearly, ∂F/∂z = 0 at (x, z) = (0, 0). However, we can write z = x1/3 globally. So anexplicit representation of z as a function of x exists, z = g(x) = x1/3. On the other hand,g(x) isn’t differentiable at (x0, z0) = (0, 0), so the vanishing of the partial derivative ∂F

∂zis

making itself felt in a more subtle way.

5.2 The Implicit Function Theorem II

We also have a version of the Implicit Function Theorem which applies to under-determinedsystems of (nonlinear) equations.

Example 5.6. Suppose we have a system of equations with more unknowns:{u2 − v2 − x3 = 0

2uv − y5 = 0

Can we solve for (u, v) as functions of (x, y)?First, consider a C1 function, F : R4 → R2, that is defined as follows:{

F1(x, y, u, v) = u2 − v2 − x2 = 0

F2(x, y, u, v) = 2uv − y5 = 0

Following what we did before, we can assume (u, v) = g(x, y) and see when we cancalculate Dg. Note that

0 =∂

∂xF1(x, y, u(x, y), v(x, y)) =

∂F1

∂x+∂F1

∂u

∂u

∂x+∂F1

∂v

∂v

∂x

0 =∂

∂xF2(x, y, u(x, y), v(x, y)) =

∂F2

∂x+∂F2

∂u

∂u

∂x+∂F2

∂v

∂v

∂x

Then, we can solve for ∂u/∂x and ∂v/∂x. Rearranging,∂F1

∂u

∂F1

∂v∂F2

∂u

∂F2

∂v

∂u∂x∂v

∂x

=

−∂F1

∂x

−∂F2

∂x


This can be solved if D(u,v)F is invertible, i.e. det[D(u,v)F

]6= 0.

Similarly, we can also solve for ∂u/∂y and ∂v/∂y. As a result, we get a different linearsystem to solve but with the same matrix

[D(u,v)F

]. The second version of the Implicit

Function Theorem says that this is the correct condition to solve for g(x) in this setting.

Let’s see how this works in general. The notation is a bit complicated, but the basicidea is the same as in the example. Given the following system of functions,

F1(x1, . . . , xn, u1, . . . um) = 0

F2(x1, . . . , xn, u1, . . . um) = 0

...

Fm(x1, . . . , xn, u1, . . . um) = 0

we want to solve for ~u = (u1, . . . , um) ∈ Rm as a function, ~u = g(~x), of ~x = (x1, . . . , xn) ∈ Rn.Via implicit differentiation, for the case of n = m = 2, we arrived at an appropriate conditionwhere this is possible.

Theorem 5.7 (Implicit Function Theorem II - General Form). Let

F : Rn+m → Rm

be a C1 function in a neighborhood of (~x0, ~u0) ∈ Rn+m, with F (~x0, ~u0) = ~0. If, in addition,D~uF (~x0, ~u0) is invertible, then there exists neighborhoods V of ~x0 and U of ~u0, for whichsolutions of F (~x, ~u) = ~0 lie on a C1 graph, ~u = g(~x),

g : V ⊂ Rn → U ∈ Rm

Example 5.8. Consider the following set of equations:{2xu2 + yv4 = 3

xy(u2 − v2) = 0

Can we solve for (u, v) = g(x, y) near (x0, y0, y0, v0) = (1, 1,−1,−1)?

Let

F =

[F1

F2

], ~x = (x, y), ~u = (u, v),

where F is defined asF1(~x, ~u) = 2xu2 + yv4 − 3 = 0

F2(~x, ~u) = xy(u2 − v2) = 0)

}(5.1)

Then, we get the following Jacobian

D~uF =∂(F1, F2)

∂(u, v)

=

∂F1

∂u

∂F1

∂v∂F2

∂u

∂F2

∂v

=

[4xu 4yv3

2uxy −2vxy

]


Substituting the given values, we have

D~uF (1, 1,−1,−1) =

[−4 −4−2 2

]Since detD~uF = −16 6= 0, the Implicit Function Theorem does apply, and we can solve for~u = (u, v) = g(~x) = g(x, y) near (x0, y0, u0, v0) = (1, 1,−1,−1).

In general, we can’t get an explicit formula for g, but we can get a formula for Dg(x, y),i.e. its partial derivatives, using implicit differentiation. For the above example, we may cal-culate ∂u

∂x, ∂v∂x

by implicitly differentiating the equations (5.1) with respect to x, rememberingthat u, v are functions of (x, y):

0 =∂F1

∂x=

∂

∂x

[2xu2 + yv4 − 3

]= 2u2 + 4xuux + 4yv3 vx

0 =∂F1

∂x=

∂

∂x

[xy(u2 − v2)

]= y(u2 − v2) + xy(2uux − 2v vx).

At (x, y, u, v) = (1, 1,−1,−1) we obtain the system of equations,

−4ux − 4vx = −2

−2ux + 2vx = 0.

Notice that the coefficient matrix is exactly D~uF (1, 1,−1,−1). This is not an accident!Using Cramer’s Rule1 the system may be easily solved using determinants, and we haveux(1, 1) = vx(1, 1) = 1

4. To find uy(1, 1), vy(1, 1), do the same thing but take the implicit

derivative with respect to y.

5.3 Inverse Function Theorem

In general, suppose we have a C1 function, f : Rn → Rn, where ~x = f(~u). Is it possible tosolve for ~u = g(~x)?

In single-variable calculus, a function f : R→ R is one-to-one on an interval [a, b] if andonly if f is strictly monotone on [a, b]. For these functions, f has an inverse g = f−1,

g(f(x)) = x, ∀x ∈ [a, b]

If f is differentiable on [a, b], and f ′(x) > 0 on [a, b] (or f ′(x) < 0 on [a, b]), then theinverse g(x) is also differentiable, and

g′(f(x)) =1

f ′(x), ∀x ∈ [a, b]

1 Cramer’s Rule. Given a system of linear equations that is represented by 2× 2 matrices,{ax+ by = s

cx+ dy = t,

solution of the system is given by

x =

det

(s bt d

)det

(a bc d

) , y =

det

(a sc t

)det

(a bc d

)


If, for example, f ′(x) > 0 for all x ∈ R, then it’s globally invertible, i.e. g(f(x)) = x for allx ∈ R. How do we apply this for f : Rn → Rn with n ≥ 2? There’s no such thing as amonotone function of several variables!

Let’s look at a simple example for guidance.

Example 5.9. Consider the following set of equations:{x = u2 − v2y = 2uv

Note that this example fits the form of the Implicit Function Theorem, but it’s a specialcase. We want to invert this relation, i.e. given, ~x = f(~u), we want to solve for ~u = g(~x).

To get a nice theorem for this special case, we can use the framework of the ImplicitFunction Theorem: {

F1(~x, ~u) = f1(~u)− x = 0

F2(~x, ~u) = f2(~u)− y = 0

SinceD~uF (~x, ~u) = Df(~u),

we can do this locally near a point (~x0, ~u0) provided that

det (Df(~u)) 6= 0

This is a very satisfying result, because it’s a direct extension of the linear algebra case.If we had a linear change of variables, ~x = M~u with a constant matrix M , we can solve~u = M−1~x provided detM 6= 0. So while it makes no sense to calculate the determinant ofa nonlinear function, it turns out that the nonlinear function is invertible (at least locally)when its derivative matrix is invertible!

Theorem 5.10 (Inverse Function Theorem). Suppose f : Rn → Rn which is C1 in aneighborhood of ~u0, with f(~u0) = ~x0. If det (Df(~u0)) 6= 0, then there exist neighborhoods Uof ~u0 and V of ~x0 and a C1 function g : V → U , with

~x = f(~u)with ~u∈U

⇐⇒ ~u = g(~x)with ~x∈V

,

i.e. near ~x0 and ~u0, g is the inverse of f .Moreover,

Dg(~x) = [Df(~u)]−1 . (5.2)

The formula for the derivative matrix of the inverse function follows directly from theChain Rule: since ~x = f(~u) = f(g(~x)), ignore the middle equality and differentiate withrespect to ~x. Since the function h(~x) = ~x = I~x, with I the identity matrix, Dh(~x) = I. So

I = Dh(~x) = D(f ◦ g(~x)) = Df(~u)Dg(~x).

Since we are assuming in the theorem that det(Df(~u)) 6= 0, the matrix is invertible, andthe above formula (5.2) is verified.


Example 5.11. Apply the Inverse Function Theorem to the function that was defined inthe previous example: {

x = u2 − v2y = 2uv

Observe that

det (Df(u, v)) = det

[2u −2v2v 2u

]= 4u2 + 4v2 6= 0

as long as (u0, v0) 6= (0, 0). So we can invert the variables and solve for (u, v) = g(x, y),locally near any (u0, v0) 6= (0, 0).

Notice thatf1(−u,−v) = x = f1(u, v)

f2(−u,−v) = y = f2(u, v)

So in any neighborhood of (0, 0) there are 2 values of (u, v) corresponding to each (x, y).Therefore, f is not invertible near (u, v) = (0, 0).

Example 5.12. Consider the following equations:{x = ey cos v

y = eu sin v

For which (u, v, x, y) can we solve for u, v as functions of x, y?

Let

f(u, v) =

[eu cos veu sin v

].

Then, we have

Df(u, v) =

[eu cos v −eu sin veu sin v eu cos v

]Then, we can compute det(Df(u, v)), (or det

(∂(x,y)∂(u,v)

)):

det(Df(u, v)) = e2u.

Clearly, det(Df(u, v)) > 0 for all u, v. By the Inverse Function Theorem, we can invert andsolve for (u, v) = g(x, y), near any (u0, v0).

We can invert locally near any point; can we find a global inverse, i.e. a g for which(u, v) = g(x, y) for every (u, v) ∈ R2? If so, then f would have to be a one-to-one function.However,

f(u, v + 2πk) = f(u, v)

for all k ∈ Z. Therefore, f can’t be globally inverted.

Example 5.13. Consider the following equations:{x = f1(u, v) = u3 − 3uv2

y = f2(u, v) = −v3 + 3u2v


Since they’re polynomials, f : R2 → R2 is C1. Then, we have

∂(x, y)

∂(u, v)=

[3u2 − 3v2 −6uv

6uv −3v2 + 3u2

]det

(∂(x, y)

∂(u, v)

)= (3u2 − 3v2)2 + (6uv)2

Clearly, det (∂(x, y)/∂(u, v)) = 0 iff u = v = 0. So, Inverse Function Theorem holds for all(u0, v0) 6= (0, 0), and we can solve for (x, y) = g(u, v) around any (u0, v0) 6= (0, 0).


1. [MT, p. 210] Show that the equation x+y−z+cos(xyz) = 1 can be solved for z = g(x, y)near the origin. Find ∂g

∂xand ∂g

∂yat (0,0).

2. [MT, p. 210] Show that xy + z + 3xz5 = 4 is solvable for z as a function of (x, y) near(1, 0, 1). Compute ∂z/∂x and ∂z/∂y at (1, 0).

3. [MT, p. 210]

(a). Check directly where we can solve the equation

F (x, y) = y2 + y + 3x+ 1 = 0

for y in terms of x.

(b). Check that your answer in part (a) agrees with the answer you expect from theimplicit function theorem. Compute dy/dx.

4. [MT, p. 210] Repeat problem 3 with

F (x, y) = xy2 − 2y + x2 + 2 = 0.

5. [MT, p. 210] Let F (x, y) = 0 define a curve in the xy plane through the point (x0, y0),where F is C1. Assume that (∂F/∂y)(x0, y0) 6= 0. Show that this curve can be locallyrepresented by the graph of a function y = g(x). Show that (i) the line orthogonal to∇F (x0, y0) agrees with (ii) the tangent line to the graph of y = g(x).

6. [MT, p. 210] Show that x3z2 − z3yx = 0 is solvable for z as a function of (x, y) near(1, 1, 1), but not near the origin. Compute ∂z/∂x and ∂z/∂y at (1, 1).

7. [MT, p. 210] Investigate whether or not the system

u(x, y, z) = x+ xyz

v(x, y, z) = y + xy

w(x, y, z) = z + 2x+ 3z2

can be solved for x, y, z in terms of u, v, w near (x, y, z) = (0, 0, 0).


8. [MT, p. 210] Let (x0, y0, z0) be a point of the locus defined by x2 + xy − a = 0,z2 + x2 − y2 − b = 0, where a and b are continuous.

(a) Under what conditions may the part of the locus near (x0, y0, z0) be represented inthe form x = f(z), y = g(z)?

(b) Compute f ′(z) and g′(z).

9. [MT, p. 211] Let F (x, y) = x3−y2 and let C denote the level curve given by F (x, y) = 0.

(a) Without using the implicit function theorem, show that we can describe C as thegraph of x as a function of y near any point.

(b) Show that Fx(0, 0) = 0. Does this contradict the implicit function theorem?

10. [MT, p. 211] Consider the system of equations

x5v2 + 2y3u = 3

3yu− xuv3 = 2.

Show that near the point (x, y, u, v) = (1, 1, 1, 1), this system defines u and v implicitlyas functions of x and y. For such local functions u and v, define the local function f byf(x, y) = (u(x, y), v(x, y)). Find Df(1, 1).

11. [MT, p. 211] Consider the equations

x2 − y2 − u3 + v2 + 4 = 0

2xy + y2 − 2u2 + 3v4 + 8 = 0.

(a) Show that these equations determine functions u(x, y) and v(x, y) near the point(x, y, u, v) = (2,−1, 2, 1).

(b) Compute ∂u∂x

at (x, y) = (2,−1).

Chapter 6

Taylor’s Theorem and critical points

6.1 Taylor’s Theorem in one dimension

Consider a one-dimensional function:

g : R→ R,

which is Ck+1, i.e. it is (k + 1) times continuously differentiable; i.e., each derivative,

djg

dxj(t), j = 1, 2, . . . , k + 1, (of order up to and including the (k + 1)st),

exists and is a continuous function (in some interval). Then, we can approximate g(x)locally near t = a by a polynomial of degree k, Taylor’s polynomial, Pk(t):

Pk(t) = g(a) + g′(a)(t− a) +1

2!g′′(a)(t− a)2 + · · ·+ 1

k!

dkg

dtk(a)(t− a)k

The Taylor coefficients are chosen to match g(t) up to the kth derivative at t = a,

djPkdtj

(a) =djg

dtj(a), j = 0, 1, 2, . . . , k.

To review this material, look back to section 11.11 in Stewart [Stew]. There, the emphasiswas on the Taylor Series, and questions of its convergence, but here we are interested in trun-cating the series for the purpose of approximation. And in any question of approximationthe error made (the remainder term) is of the highest importance.

For example, P1(t) = g(a) + g′(a)(t − a) is the tangent line. Since we know that g isdifferentiable,

limt→a

|g(t)− P1(t)||t− a| = 0 or g(t) = P1(t) + o(|t− a|)

Theorem 6.1 (Taylor’s Theorem, 1D version). Assume g : R → R is Ck+1 in a neighbor-hood of t = a. Then, for each t, there exists c between a and t for which

g(t) = Pk(t) +Rk(a, t),

with remainder Rk(a, t) = dk+1gdtk+1 (c)(t− a)k+1.

51

CHAPTER 6. TAYLOR’S THEOREM AND CRITICAL POINTS 52

Since we assumed that dk+1gdtk+1 is continuous, the remainder term is very small when t→ a,

limt→a

Rk(a, t)

|t− a|k = 0,

i.e. Rk(a, t) = o(|t− a|k

). So Rk(a, t) is small compared with each of the terms appearing

in Pk(t).

Remark. Locally, g(t) is well approximated by its Taylor polynomial, but only near x = a.

Example 6.2. For g(t) = cos t at a = 0, we look at the 3rd order Taylor polynomial:

g(0) = cos 0 = 1, g′(0) = − sin 0 = 0, g′′(0) = − cos(0) = −1, g′′′(0) = sin 0 = 0.

Therefore, Taylor’s Theorem implies that for t near 0, and

g(t) = cos t = 1− 0t− 1

2t2 + 0t3︸︷︷︸

P3(t)

+R3(0, t) = 1− 1

2t2 + o(t3).

This tells us that cos t is approximately quadratic near a = 0, with an error which is verysmall compared to t3 (it is of the order of t4.) This is local information, as the graph ofcos t doesn’t look in the least like an inverted parabola when t is large, since the cosineoscillates periodically between ±1 while the parabola diverges to −∞ for t → ±∞. (SeeFigure 6.1.)

y = cosx

y = 1− 12x2

Figure 6.1. Third degree Taylor expansion of cos x tells us that the curve is locallyquadratic near x = 0.

6.2 Taylor’s Theorem in higher dimensions

To obtain a version of Taylor’s Theorem in several variables we apply the one-dimensionalversion in a careful way, by restricting our attention to each line, ~x = ~a + t~u, through ~a indirection ~u.

Assume f ∈ C3 near ~a ∈ Rn. Let’s sample f(~x) along a line running through x0. Takea unit vector ~u, ‖~u‖ = 1, and the line,

~l(t) = ~a+ t~u,


that goes through ~a at t = 0 in the direction of ~u. Then, we let

g(t) = f(~l(t)) = f(~a+ t~u),

so g : R → R. By the chain rule, if f is C3 near ~a, then g is C3 near t = 0, and so wemay apply Taylor’s Theorem to g to approximate its values near t = 0. In doing so, we willgenerate a Taylor expansion for f(~x) with x = ~a+ t~u.

We calculate the terms in the Taylor polynomial for g, which has the form,

g(t) = g(0) + g′(0)(t− 0) +1

2g′′(0)(t− 0)2 +R2, with R2 = o(t2).

We begin with:g(0) = f(~a),

g′(t) = Df(~a+ t~u) ·~l ′(t) = Df(~a+ t~u)~u.

So g′(0) = Df(~a)~u = ∇f(~a) · ~u. Using the definition ~x = ~l(t) = ~a+ t~u, we have t~u = ~x−~a,and so the first derivative term may be rewritten as

g′(0)(t− 0) = Df(~a)[~x− ~a].

Finding the second derivative is a little tricky. We write the first derivative out in itscoordinates,

g′(t) =n∑i=1

∂f

∂xi(~a+ t~u)ui.

Then we use the Chain Rule in the form you learned last year to calculate

g′′(t) =n∑i=1

d

dt

(∂f

∂xi(~a+ t~u)

)ui =

n∑i=1

n∑j=1

∂2f

∂xi∂xj(~a+ t~u)ujui

Therefore,

g′′(0) =n∑i=1

n∑j=1

∂2f

∂xi∂xj(~a)uj ui,

which is a quadratic form with matrix

H(~x) =

[∂2f

∂xi∂xj(~a)

]i,j=1,...,n

,

the Hessian matrix of f at ~a. We will also write H(~x) = D2f(~x), as it is the matrix of allsecond partial derivatives of f . For f a C2 function, fxixj = fxjxi , so H(~a) is a symmetricmatrix. Again using t~u = ~x− ~a, we may then write

g′′(0)(t− 0)2 = t~u ·H(~a)t~u = [~x− ~x0] ·H(~a)[~x− ~a].

In this equation, the matrix H(~a) multiplies [~x−~a], and the result (a vector) is dotted with[~x− ~a].

Finally, the remainder term in the expansion of g is o(t2), but notice that |t| = ‖~x−~a‖,so the remainder is very small compared to the square distance, ‖~x− ~a‖2.

We now have all the ingredients for the one-dimensional Taylor’s Theorem for g, and soapplying it we get the following theorem:


Theorem 6.3 (Second order Taylor’s approximation). Assume f : Rn → R and is C3 in aneighborhood of ~a. Then,

f(~x) = f(~a) +Df(~a)(~x− ~a) +1

2(~x− ~a) ·H(~a)(~x− ~a) +R2(~a, ~x),

where

H(~a) = D2f(~a) =

[∂2f

∂xi∂xj

]i,j=1,2,...,n

is the Hessian and

lim~x→~a

R2(~a, ~x)

‖~x− ~a‖2 = 0,

i.e. R2(~a~x) = o(‖~x− ~a‖2).

Alternatively, we may write ~x = ~a+~h, and then the second order Taylor’s approximationcan be written as

f(~a+ ~h) = f(~a) +Df(~a)~h+1

2~h ·D2f(~a)~h+R2(~a,~h),

with

lim~h→~0

=R2(~a,~h)

‖~h‖2= 0.

We will find this form to be more useful at times in the next section.

Example 6.4. Find the second order Taylor polynomial of the following functions:

f(x, y) = cos(xy2)

near ~a = (π, 1).

First, we compute the derivatives:

f(~a) = f(π, 1) = cos(π) = 1,

∂f

∂x= −y2 sin

(xy2),

∂f

∂y= −2xy sin

(xy2),

∂2f

∂x2= −y2 cos

(xy2)y2,

∂2f

∂x∂y= −2y sin

(xy2)− 2xy3 cos

(xy2),

∂2f

∂y2= −2x sin

(xy2)− 2xy cos

(xy2).

Then, at ~a = (π, 1), we find that

Df(~a) =[0 0

]D2f(~a) =

[1 2π

2π 4π2

]


So, we have

f(~a+ ~h) = −1 +1

2~h ·[

1 2π2π 4π2

]~h+R2

f(π + h1, 1 + h2) = −1 +1

2

[h1h2

]·[h1 + 2πh2

2πh1 + 4π2h2

]= −1 +

1

2

(h21 + 4πh1h2 + 4π2h22

)+ o(‖~h2‖)

In terms of a point ~x (near ~a), we can write ~x = ~a+ ~h, so ~h = ~x− ~a, and then

cos(xy2)

= −1 +1

2

((x− a1)2 + 4π(x− a1)(y − a2) + 4π2(y − a2)2

)+R2,

and R2 = o(‖~x − ~a‖2. The advantage of the f(~a + ~h) form is that it is easier to guess thebehaviour of f(~x) near ~x = ~a. This is what we will do in the next section, to classify criticalpoints as maxima and minima!

6.3 Local minima/maxima

This is exactly the same definition as in Stewart [Stew], only generalized to Rn. (So if youwere paying attention to the definitions last year, this is old news!)

Definition 6.5. We say ~a is a local minimum for f if there exists an open disk Dr(~a) forwhich

f(~a) ≤ f(~x)

for all ~x ∈ Dr(~a). ~a is a strict local minimum if

f(~a) < f(~x)

for all ~x 6= ~a, ~x ∈ Dr(~a).

Definition 6.6. We say ~a is a local maximum for f if ∃r > 0 with f(~a) ≥ f(~x), ∀~x ∈ Dr(~a).~a is a strict local max if f(~a) > f(~x), ∀~x ∈ Dr(~a) \ {~a}.

Example 6.7. As you have seen from your previous multivariable course, f(x, y) = x2 + y2

has a local minimum (in fact, the global minimum) at (x, y) = (0, 0), and g(x, y) = −x2−y2has a local maximum (in fact, the global maximum) at (x, y) = (0, 0). Verifying this directlyis not hard, since a sum of squares is always non-negative, and so

f(0, 0) = 0 ≤ x2 + y2 = f(x, y), for all (x, y) ∈ R2.

We leave the verification for g(x, y) as an exercise.

Note that if f is differentiable, we have a necessary condition for local maxima andminima: they can only occur at a critical point of f :

Theorem 6.8. If f has a local maxima or minima at ~a and is differentiable at ~a, thenDf(~a) = ~0.


Proof. Again, we start by restricting to line through ~a:

g(t) = f(~a+ t~u),

where ~u is a unit vector. If f has a local minima at ~a, then

g(0) = f(~a) ≤ f(~a+ t~u) = g(t),

for all t with |t| < r. So g(t) has a local minima at t = 0. By a calculus theorem, g′(0) = 0.But,

0 = g′(0) = Df(~a)~u,

for all ~u. Then, by taking ~u = ~e1, ~e2 . . . , ~en, we get

∂f

∂xj(~a) = 0,

for each j = 1, 2, 3, . . . , n. Therfore, Df(~a) = 0.

Definition 6.9. An ~a for which Df(~a) = 0 is called a critical point.

Example 6.10. In Example 6.4 , ~a = (π, 1) was a critical point.

Recall that not all critical points must be local maxima or minima, even in one dimension.But in higher dimensions the possible variety of critical point behavior is much larger. Wesingle out a class of critical points which are neither maxima nor minima:

Definition 6.11. We say ~a is a saddle point for f if it is a critical point and there existtwo linearly independent vectors ~u and ~v so that

• Restricted to the line in direction ~u, f(~a+ t~u) has a strict local minimum at t = 0;

• Restricted to the line in direction ~v, f(~a+ t~v) has a strict local maximum at t = 0.

Example 6.12. The function h(x, y) = x2− y2 has a saddle point at (0, 0). It is minimizedalong direction ~u = (1, 0), and maximized along direction ~v = (0, 1).

Now, we want to combine Taylor’s Theorem and linear algebra to classify critical pointsas local minima, maxima, or other. Taylor’s theorem states that for ~x = ~a + ~h, if ‖~h‖ issmall,

f(~x) = f(~a+ ~h) = f(~a) +Df(~a)~h︸︷︷︸0

+1

2~h ·D2f(~a)~h︸︷︷︸quadratic form

+R2(~a,~h)︸︷︷︸o(‖~h‖2)

(6.1)

So we expect the behaviour of f(~x) near ~a to be determined by the quadratic term.Recall that the Hessian matrix, M = D2f(~a), is a symmetric matrix. This allows us to

apply the following theorem:

Theorem 6.13 (The Spectral Theorem). Assume M is a symmetric (n×n) matrix. Then,

• All eigenvalues of M are real, λi ∈ R ∀i = 1, 2, . . . , n.

• There is an orthonormal basis {~u1, ~u2, . . . , ~un} of Rn composed ot eigenvalues of M ,

M~ui = λi~ui, ‖~ui‖ = 1, ~ui · ~uj = 0 for i 6= j


• In the basis of eigenvalues, M is a diagonal matrix. In other words, if we let U be thematrix whose columns are the ~ui; then

MU = UΛ,

where Λ = diag(λ1, λ2, . . . , λn).

Remark. Note that since the eigenvalues are real, they can be ordered, smallest to largest:

λ1 ≤ λ2 ≤ · · · ≤ λn.

However, they may not be necessarily distinct. (For example, the identity matrix I has eachλj = 1.)

Written in the orthonormal basis of eigenvalues, the quadratic form, ~h · M~h, has aneasy expression. First, since the ~ui form a basis, any vector may be expressed as a linearcombination,

~h =n∑i=1

ci~ui,

with coefficients ci ∈ R, i = 1, 2, . . . , n. Notice that

~h · uj =n∑i=1

ci~ui · ~uj = cj, (6.2)

since ~ui · ~uj = 0 if i 6= j and ~ui · ~uj = 1 if i = j. In particular, we can evaluate the norm of~h in terms of the coefficients ci,

‖~h‖2 = ~h · ~h = ~h ·(

n∑j=1

cj~uj

)

=n∑j=1

cj ~h · ~ui

=n∑j=1

c2j , (6.3)

using (6.2).To evaluate the quadratic form we make a similar calculation, using the orthogonality

of the normalized eigenvectors:

~h ·M~h = ~h ·n∑i=1

ciM~ui

= ~h ·n∑i=1

λici~ui

=n∑i=1

λici(~h · ~ui︸︷︷︸ci

)

=n∑i=1

λic2i (6.4)


Note the similarity of the two forms (6.3) and (6.4). The fundamental difference is thatthe sum in (6.4) is weighted by the eigenvalues λi. Using these calculations, we prove thefollowing fact about quadratic forms and eigenvalues:

Theorem 6.14. Suppose M is a symmetric matrix with eigenvalues

λ1 ≤ λ2 ≤ · · · ≤ λn.

Then, for any vector ~h ∈ Rn,

λ1‖~h‖2 ≤ ~h ·M~h ≤ λn‖~h‖2

Proof. First, we apply (6.4) and use the fact that λn ≥ λi for any i, and so:

~h ·M~h =n∑i=1

λic2i

≤n∑i=1

λnc2i

= λn

n∑i=1

c2i = λn‖~h‖2,

by (6.3), which proves the right hand inequality. For the let hand side, we use the fact thatλ1 ≤ λi for all i, so

~h ·M~h =n∑i=1

λic2i ≥

n∑i=1

λ1c2i = λ1‖~h‖2.

This proves both sides of the inequality.

Now we apply this idea to the Hessian via Taylor’s Theorem to get the following theorem:

Theorem 6.15 (Second derivative test). Suppose f is C3 in a neighborhood of a criticalpoint ~a. Let λ1 ≤ λ2 ≤ · · · ≤ λn be the eigenvalues of D2f(~a). Then,

(1) If all eigenvalues are positive, then ~a is a strict local minimum of f .

(2) If all eigenvalues are negative, then ~a is a strict local maximum of f .

(3) If D2f(~a) has at least one positive and at least one negative eigenvalue, then ~a is asaddle point.

We note that if there are zero eigenvalues we might still have a local max or a local min;in that case, the behavior of f(~x) near the critical point will be determined by higher orderterms in the Taylor expansion. A matrix M with zero eigenvalues is called degenerate, asis a critical point ~a with Hessian D2f(~a) having one or more zero eigenvalue.

Proof. We prove (1); parts (2) and (3) will be for you to do as Exercises! So assume~a is a critical point for f , and the Hessian D2f(~a) has all strictly positive eigenvalues,0 < λ1 ≤ λ2 ≤ · · · ≤ λn.


The idea is to show that the quadratic term is much more important than the error termwhen we’re close to ~a. Since R2(~a,~h) = o(‖~h‖2), there exists r > 0 for which

|R2(~a,~h)|‖~h‖2

<1

4λ1, whenever ‖~h‖ < r.

In other words, |R2(~a,~h)| < 14λ1‖~h‖2 when ‖~h‖ < r. (In fact, what we’ll need here is

R2(~a,~h) > −14λ1‖~h‖2, so the remainder term isn’t very negative.)

For any ~x ∈ Dr(~a) \ {~a}, let ~h = ~x− ~a, so ‖~h‖ < r, and by (6.1) and Proposition 6.14,

f(~x) = f(~a+ ~h) = f(~a) + 0 +1

2~h ·D2f(~a)~h+R2(~a,~h)

≥ f(~a) +1

2λ1‖~h‖2 −

1

4λ1‖~h‖2

> f(~a),

which shows that ~a is a strict local minimum.[To prove (b), do the same thing but choose your r > 0 based on the remainder being

smaller (in absolute value) than 14|λn|‖~h‖2. To do (c), look at points near ~a along an

eigenvector ui direction, ~h = t~ui, and apply the same idea as in (a) or (b), depending onwhether λi is positive or negative.]

Are conditions (a) and (b) necessary conditions to have a local minimum or a localmaximum? Not quite (see example 2 below!) but almost:

Theorem 6.16. Suppose f : Rn → R is a C3 function.

(i) If ~a is a local minimum of f , then every eigenvalue of D2f(~a) satisfies λi ≥ 0, i =1, . . . , n;

(ii) If ~a is a local maximum of f , then every eigenvalue of D2f(~a) satisfies λi ≤ 0, i =1, . . . , n.

So at a local minimum we have no negative eigenvalues, and at a local maximum thereare no positive eigenvalues of the Hessian.

Proof. We only show (i); (ii) is similar (but upside down!) Suppose ~a is a local minimizerof f , but (for a contradiction) assume λ1 < 0. Let ~u be the eigenvector, D2f(~a)~u = λ1~u,with ‖~u‖ = 1. Then Df(~a) = and the quadratic form ~u · D2f(~a)~u = λ1 < 0, so applyingTaylor’s Theorem,

f(~a+ t~u) = f(~a) + 0 +1

2t~u ·D2f(~a)(t~u) + o(‖t~u‖2)

= f(~a) +t2λ1

2+ o(t2)

< f(~a),

for all sufficiently small |t| (ie, for |t| < r where r is chosen so that |R2|t2

< 14|λ1|.) So ~a

cannot be a local minimum, which contradicts our initial assumption. In conclusion, λ1 ≥ 0so all the eigenvalues are nonnegative.


Example 6.17. Consider

f(x, y, z) = x3 − 3xy + y3 + cos z

Find all critical points and classify them using the Hessian.First, we calculate the gradient,

∂f∂x

= 3x2 − 3y∂f∂y

= −3x+ 3y2

∂f∂z

= − sin z

Critical points are defined as ∇f(~a) = ~0 so we get the following critical points

(x, y, z) = (0, 0, nπ) and (1, 1, nπ),

where n ∈ Z.Then, we want to compute the Hessian at each point.

D2f(~x) =

6x −3 0−3 6y 00 0 − cos z

Notice that at (0, 0, nπ), the Hessian depends on whether n is even or odd:

D2f(0, 0, 2kπ) =

0 −3 0−3 0 00 0 −1

D2f(0, 0, (2k + 1)π) =

0 −3 0−3 0 00 0 1

When n is even, we find the eigenvalues of the Hessian are

λ = −3,−1, 3

so we get a saddle at (0, 0, 2kπ), k ∈ Z. Similarly, when n is odd, we find that its eigenvaluesare

λ = −3, 1, 3

which is also a saddle. Thus, we get a saddle at (0, 0, nπ) for all n ∈ Z.At (1, 1, nπ), we get

D2f(1, 1, nπ) =

6 −3 0−3 6 00 0 (−1)n+1

By observation, we find that ~e3 =

001

is an eigenvector with λ = (−1)n+1. Then, the two

remaining eigenvalues are eigenvalues of the 2× 2 submatrix,

[6 −3−3 6

]. Since its trace is

12 and determinant is 27, its characteristic equation is given by

λ2 − 12λ+ 27 = 0.


So we find that two other eigenvalues are λ = 3, 9. Therefore, (1, 1, (2k + 1)π) is a localminima, and (1, 1, 2kπ) are saddles.

Remark. When D2f(~a) has zero as an eigenvalue, things can get complicated. For example,if λi ≥ 0 for all i, you could still have a local minima. In this case, the behaviour would bedetermined by higher order terms in Taylor Series. We call this Degenerate critical point.

Example 6.18. Considerf(x, y) = x2 + y4

We find that

∇f(x, y) =

[2x4y3

]so we get only one critical point, (x, y) = (0, 0). Ntice that

D2f(x, y) =

[2 00 12y2

]so D2f(0, 0)

[2 00 0

]So we get λ = 2, 0. Since the quadractic doesn’t dominate the remainder, we call this adegenerate case.

Still, f(0, 0) < f(x, y) for all (x, y) 6= (0, 0) so its a minima even if the Hessian testdoesn’t tell us so.

Example 6.19. Considerg(x, y) = x2 − y4

This has the same second order Taylor expansion as the previous example but has a differentramainder, R2 = −y4. This is a degenerate saddle.

Notice that for the converse, eigenvalues don’t have to be strictly larger or smaller than0. In other words, if ~a is a local minima, then ~a is a critical point and all the eigenvalues ofD2f(~a) must be greater than equal to 0 (not necessarily strictly greater than 0).

Where does the 2nd derivative test in Stewart come from?

For f : R2 → R, we’re looking at the eigenvalues of a 2 × 2 matrix M = D2f(x0, y0).Call

a = fxx(x0, y0), b = fxy(x0, y0) = fyx(x0, y0), c = fyy(x0, y0),

so the eigenvalue equation is

0 = det

[a− λ bb c− λ

]= λ2 − (a+ c)λ+ (ac− b2) = λ2 − Tλ+D,

where T = a+ c is the trace and D = ab− c2 the determinant of M . If the two eigenvaluesare λ1, λ2, then those are the roots of this polynomial, so it factors as

λ2 − Tλ+D = (λ− λ1) (λ− λ2).

Multiplying out the product and matching powers of λ,

D = λ1 λ2, T = λ1 + λ2.


So D > 0 if and only if both eigenvalues have the same sign (and neither is zero.) Notealso that if D > 0, the product ac > b2 ≥ 0 also, so a and c have the same sign. SinceT = a + c = λ1 + λ2, each of T, a, c, λ1, λ2 must have the same sign. In conclusion, ifD > 0 and a = fxx(x0, y0) > 0, then both eigenvalues λ1, λ2 > 0, and we get a localminimum. If D > 0 and a = fxx(x0, y0) < 0, then both eigenvalues λ1, λ2 < 0, and we geta local maximum. If D < 0, then the product of the eigenvalues is negative, and they haveopposite signs, and (x0, y0) is a saddle point. (The case D = 0 is the degenerate case, (d).)


1. Assume f : Rn → R is C3, ~a is a critical point of f , and the Hessian is strictly negativedefinite, ~h ·D2f(~a)~h < 0 for all ~h ∈ Rn. Prove that ~a is a strict local maximum of f , thatis: there exists r > 0 so that f(~x) < f(~a) for all ~x ∈ Dr(~a) \ {~a}.

2. For each function, find all critical points and use the Hessian to determine whether theyare local maxima, minima, or saddle points.

(a) f(x, y, z) = x− 2 sinx− 3yz

(b) g(x, y, z) = cosh x+ 4yz − 2y2 − z4

(c) u(x, y, z) = (x− z)4 − x2 + y2 + 6xz − z2

3. Let M be a positive definite, symmetric (n× n) matrix. Show that

λn = max

{‖M ~x‖‖~x‖

∣∣∣∣ ~x 6= ~0

}is the largest eigenvalue of the matrix M .

[Hint: do it in two steps. First, show ‖M ~x‖‖~x‖ ≤ λn for all ~x 6= ~0, then find a particular ~x

for which equality holds.]

4. [MT, p. 166] Determine the second-order Taylor formula for the given function aboutthe given point (x0, y0).

(a) f(x, y) = ex+y, where x0 = 0, y0 = 0.

(b) f(x, y) = sin(xy) + cos(x), where x0 = 0, y0 = 0.

5. [MT, p. 166] Calculate the second-order Taylor approximation to f(x, y) = cosx sin yat the point (π, π/2).

Chapter 7

Calculus of Variations

7.1 One-dimensional problems

The calculus of variations studies extremal (minimization or maximization) problems wherethe quantity to be optimized is a function, and not just a number or a vector in finitedimensional space. A typical set-up is to consider all possible C1 functions u : [a, b] ⊂ R→R, connecting given fixed endpoints P1 = (a, c) and P2 = (b, c), that is

u(a) = c, u(b) = d. (7.1)

Denote by A the collection of all such u ∈ C1([a, b]) satisfying the endpoint condition (orboundary condition) given above in (7.1). We then try to find u ∈ A which minimizes (ormaximizes, or is a critical point of) an integral expression,

I(u) =

∫ b

a

F (u′(x), u(x), x) dx, (7.2)

where F : R3 → R is a given C2 function. [It will be convenient write F (p, u, x), with theletter p holding the place of the derivative u′.] We call such I : A → R a functional ; it is areal-valued function whose domain is a set of functions!

Whether we can actually find an extremum for I and what properties it has dependsstrongly on the form of the function F . Let’s start with some common examples:

Example 7.1. Set [a, b] ⊂ R and choose values c, d ∈ R. Consider all C2 functions, u(x),joining P1 = (a, c) and P2 = (b, d), i.e. u(a) = c and u(b) = d. Among all C2 curves, u(x),connecting P1 to P2, find the one with shortest arclength.

Let ~c(t) = (t, u(t)). Then, we have

‖~c ′(t)‖ =√

1 + u′(t)2.

This allows us to compute the arc length:

I(u) =

∫ b

a

√1 + u′(x)2dx.

Now, call A = {u : [a, b] → R|u ∈ C2, u(a) = c, u(b) = d}. Then, I : A → R is a functionof functions, or functional. We want to minimize I(u) over all u ∈ A.

63

CHAPTER 7. CALCULUS OF VARIATIONS 64

Example 7.2. For the same class A of functions, take u ∈ A and rotate around the x axis,creating a surface of revolution. We would then like to choose u(x) for which the surface ofrevolution has the least area.

This surface can be parametrized using cylindrical coordinates around the x-axis, r =√y2 + z2 and θ measured in the (y, z) plane, by r = u(x), x ∈ [a, b]. The surface area is

then given by

I(u) =

∫ b

a

2π u(x)√

1 + (u′(x))2 dx,

corresponding to F (p, u, x) = 2πu√

1 + p2. The surface of revolution determined by a uwhich minimizes I(u) is called aminimal surface

Example 7.3. Same setup, we fix two endpoints P1, P2 and look at all graphs y = u(x)connecting them, but now we imagine that the graph describes a track with a marblerolling down it, and we want to choose the graph y = u(x) so that the marble spends theleast time to roll downhill from one point to the other. To make things a little easier, wetake P1 = (0, 0), orient the y-axis downward, and let P2 = (b, d) with b, d > 0 (so it’s belowthe x− axis and to the right of P1.) Doing a little mechanics (see the chapter in Simmons’book), the total time taken in rolling down hill is:

I(u) =

∫ b

0

√1 + (u′(x))2√

2gu(x)dx, u ∈ A, (7.3)

where g is the (constant) acceleration due to gravity. So this fits the standard pattern, with

F (p, u, x) =

√1+p2√2gu

. [You may notice that the denominator vanishes when u(x) = 0, but we

won’t let it worry us, although it should.]This is one of the oldest variational problems, the “Brachistochrone” problem, posed by

Johann Bernoulli in 1696 in a journal as a challenge to other mathematicians to solve (partlyto show off the fact that he knew the solution, and to embarrass his brother, Jacob.) In theend, it was solved by at least 4 others, Jacob Bernoulli, Newton, L’Hopital, and Leibnitz.

The First Variation. We use the analogy to finding critical points of a function f :Rn → R. At a local extremum ~a of f , we know that the one-dimensional restrictionsg(t) = f(~a+ t~v) must have critical points g′(0) = 0 at t = 0. The Chain Rule then asserts,

0 = g′(t) = Df(~a)~v, for all ~v ∈ Rn.

That is, all directional derivatives of f at ~a vanish, for all directons ~v. In particular, thepartial derivatives of f , which are the directional derivatives with ~v = ~ej, the standard basisvectors, are all zero, so the matrix Df(~a) is the zero matrix, and f has a critical point at ~a.

Unfortunately, there is no clear meaning of “partial derivative” for I(u) in (7.2), butwe can use the idea of one-dimensional reduction as above to make sense of the derivativeDI(u) via directional derivatives. To do this, we introduce the idea of a variation of u ∈ A.As above, we want to look at I(u+ tv) and compare it’s value to that of I(u). To do that,we need u(x) + tv(x) to be C1([a, b]) and have the same fixed endpoints (7.1) as u(x). Thisis accomplished by choosing variations v which are C1 and vanish on the endpoints. Wewrite v ∈ A0, with

A0 ={v ∈ C1([a, b]) | v(a) = 0 = v(b)

}. (7.4)


y = u(x)

y = v(x)

y = u(x) + tv(x)

a b

P1 = (a, c)

P2 = (b, d)

Figure 7.1. An admissible curve, y = u(x), u ∈ A connecting P1 to P2. For v ∈ A0

(vanishing at both x = a, b,) we create the variations u+ tv.

So then we will always have u+ tv ∈ A for all t. See Figure 7.1

Now we proceed as in the vector case, and look at g(t) = I(u+ tv) for any v ∈ A0; if uis an extremal of I, then g′(0) = 0 for all variations v ∈ A0, that is

0 =d

dtI(u+ tv)

∣∣∣∣t=0

. (7.5)

Let’s see how this works in Example 7.1, the shortest arclength problem. Since I(u) =∫ b

a

√1 + (u′(x))2, we calculate:

0 =d

dtI(u+ tv)

∣∣∣∣t=0

=d

dt

∫ b

a

√1 + (u′(x) + tv′(x))2 dx

∣∣∣∣t=0

=

∫ b

a

d

dt

[√1 + (u′(x) + tv′(x))2

]dx

∣∣∣∣t=0

=

∫ b

a

(u′(x) + tv′(x)) v′(x)√1 + (u′(x) + tv′(x))2

dx

∣∣∣∣t=0

=

∫ b

a

u′(x) v′(x)√1 + (u′(x))2

dx,

for every v ∈ A0. Compare with the vector case: this is supposed to be the analogue of theequation 0 = Df(~a)~v. But in the above integral, we have v′(x) rather than just v(x). We


fix this by integrating by parts: for every v ∈ A0,

0 =

∫ b

a

u′(x)√1 + (u′(x))2

v′(x) dx

=u′(x)√

1 + (u′(x))2v′(x)

∣∣∣∣ba

−∫ b

a

d

dx

[u′(x)√

1 + (u′(x))2

]v(x) dx

= −∫ b

a

d

dx

[u′(x)√

1 + (u′(x))2

]v(x) dx,

since v ∈ A0 implies that at the endpoints, v(a) = 0 = v(b). Now, an integral can bezero without the integrand being the zero function, but here we are saying that the integralvanishes for every v ∈ A0, and this can only happen when the integrand is flat zero for allx ∈ [a, b]:

Lemma 7.4 (The Fundamental Lemma of the Calculus of Variations). Assume h is con-

tinuous on [a, b], and for every v ∈ A0 we have

∫ b

a

h(x) v(x) dx = 0. Then h(x) = 0 for all

x ∈ [a, b].

Accepting the FLCoV for the moment, let’s continue with the Example. We may thenconclude that the integrand is always zero, and obtain a 2nd order ODE, the Euler–Lagrangeequation for the functional I,

d

dx

[u′(x)√

1 + (u′(x))2

]= 0.

This may be solved by integration: first, u′(x)√1+(u′(x))2

= C for some constant of integration

C. After some algebra, u′(x) = C1 = ±√

C1−C2 , another constant. So y = u(x) = C1x+C2,

a linear function. (The values of C1, C2 may be chosen to match the endpoints (7.1).) Sowe’ve reconfirmed that the shortest distance in the plane between two points is a straightline segment!

Proof of FLCoV. Suppose (for a contradiction) that there is a point x0 with h(x0) 6= 0. Byconsidering −h(x) instead if necessary, we can assume h(x0) > 0. Since h is continuous,there is an open interval around x0, (α, β) on which h(x) > 0. Now choose v to be a C1

function with v(x) > 0 in (α, β) and v(x) = 0 outside that interval. An example of such afunction is:

v(x) =

{(x− α)2(β − x)2, if x ∈ (α, β),

0, otherwise.

Such a function is depicted in Figure 7.2. This v ∈ A0, and h(x)v(x) > 0 in (α, β) andvanishes outside. Therefore, there is positive area under the graph, and so

0 =

∫ b

a

h(x) v(x) dx =

∫ β

α

h(x) v(x) dx > 0,

which is a contradiction. What went wrong? We assumed h was not always zero; this provesthat h(x) = 0 for all x ∈ (a, b).


Figure 7.2. One of the functions v as described in the proof of Lemma 7.4, which ispositive in (−1, 1) and zero outside of that interval. It attaches to zero smoothly; thederivatives v′(±1) = 0, and so v ∈ C1.

Before doing the other two examples, let’s develop a general formula for the first variationand Euler-Lagrange equations for I of the form (7.2). We proceed as above:

0 =d

dtI(u+ tv)

∣∣∣∣t=0

=d

dt

∫ b

a

F (u′(x) + tv′(x), u(x) + tv(x), x) dx

∣∣∣∣t=0

=

∫ b

a

d

dt[F (u′(x) + tv′(x), u(x) + tv(x), x)] dx

∣∣∣∣t=0

=

∫ b

a

[∂F

∂p(u′ + tv′, u+ tv, x) v′(x) +

∂F

∂u(u′ + tv′, u+ tv, x) v(x)

]dx

∣∣∣∣t=0

=

∫ b

a

[∂F

∂p(u′(x), u(x), x) v′(x) +

∂F

∂u(u′(x), u(x), x) v(x)

]dx

=∂F

∂p(u′(x), u(x), x) v(x)

∣∣∣∣ba

−∫ b

a

d

dx

(∂F

∂p(u′, u, x)

)dx+

∫ b

a

+∂F

∂u(u′, u, x) v(x) dx

=

∫ b

a

[− d

dx

(∂F

∂p(u′(x), u(x), x)

)+∂F

∂u(u′(x), u(x), x)

]v(x) dx,

for all v ∈ A0. Note that we integrated the first term by parts, and used v(a) = 0 = v(b)to eliminate the term coming from the endpoints, as in the first example. Applying theFLCoV, we conclude that the integrand must always be zero, and obtain the Euler–Lagrangeequation,

d

dx

(∂F

∂p(u′(x), u(x), x)

)− ∂F

∂u(u′(x), u(x), x) = 0, (7.6)

which is a 2nd order ODE, and usually a nonlinear one!! This equation may be very difficult


to solve, as we’ll see in Example 3. In Example 1, F depended only on p = u′(x), and inthat case it was easy to solve by direct integration. (In fact, if F = F (p) is independent ofu, x, the solutions are always straight lines– check it yourself as an exercise!)

There is an interesting special case, which is when F = F (p, u) and does not dependexplicitly on x. In that case, we can integrate the EL equation using a trick: by a carefulapplication of the Chain Rule,

d

dx

(u′(x)

∂F

∂p(u′, u)− F (u′, u)

)= u′

d

dx

(∂F

∂p

)+ u′′

∂F

∂p−(∂F

∂pu′′ +

∂F

∂uu′)

=

[d

dx

(∂F

∂p

)− ∂F

∂u

]u′(x)

= 0,

for any solution of the EL equation. In conclusion, for F = F (p, u), the Euler–Lagrangeequation is equivalent to

u′(x)∂F

∂p(u′, u)− F (u′, u) = C1, (7.7)

with C1 a constant, which is only a first order ODE!Now let’s apply this same procedure in the minimal surface of revolution, Example 7.2,

for which F (p, u) = 2πu√

1 + p2. Then the EL equation is,

d

dx

(2πu(x)u′(x)√

1 + (u′(x))2

)− 2π

√1 + (u′(x))2 = 0,

a 2nd order nonlinear ODE, which is pretty ugly. Instead we recognize that F (p, u) has noexplicit x-dependence, and use the reduced form (7.7), to get:

C = 2πu(x)[u′(x)]2√1 + (u′(x))2

− 2πu(x)√

1 + (u′(x))2 = − 2πu(x)√1 + (u′(x))2

.

Call C1 = C/(2π). So −u/C1 =√

1 + (u′(x))2, and squaring both sides and solving for u′

gives u′(x) = ±√

u2

C21− 1, which is a separable ODE. By resisting the temptation to do trig

substitution with a secant function and instead doing hyperbolic substitution with cosh, weget:

x+ C2 =

∫dx =

∫du√u2

C21− 1

[substitute u = C1 cosh θ, du = C1 sinh θdθ,]

=

∫C1 sinh θ dθ

sinh θ

=

∫C1 dθ = C1θ = C1 cosh−1

(u

C1

).

So the general solution to the EL equations is

u(x) = C1 cosh

(x+ C2

C1

).


Figure 7.3. A surface of revolution, the catenoid.

To find the constants C1, C2 we should plug in the endpoints (7.1), but in general thisis not very pretty. The main interest here is in the shape and not the specific values ofthe constants. We have shown that the surface of revolution with the least area for givenboundary conditions, a “minimal surface”, is called the catenoid; it is generated by one ofthis family of cosh graphs, and is drawn in Figure 7.3.

Finally, we return to the Brachistochrone, the original problem posed by Bernoulli todetermine the marble track which leads to the shortest time of descent. (This was discussed

in Example 7.3. ) This example is also of the special form, F (p, u) =

√1+p2√2gu

, independent

of x, so the same reduction of order will be useful. Doing the partial derivatives as in (7.7),we get a constant C for which

C = u′∂F

∂p(u′, u)− F (u′, u) =

(u′)2√1 + (u′)2

√2gu−√

1 + (u′)2√2gu

=1√

1 + (u′)2√

2gu.

Simplifying somewhat, we get

u(x)[1 + (u′(x))2

]= k, with k =

1

2gC2.

Although we could write this as a separable first order ODE, it has no closed-form (explicit)solution! However, by a method which will remain mysterious, we can exhibit a solution inthe form of a parametrized curve, x = k

2(t − sin t), y = k

2(1 − cos t), for parameter t ∈ R.

This is a famous curve, the cycloid, obtained by following a point on the rim of a wheel!

7.2 Problems in higher dimensions

We can also consider variational problems in higher dimensions, looking for an unknownfunction u : D ⊂ Rn → R which minimizes a multiple integral over the domain D,

I(u) =

∫D

F (∇u, u, ~x) dx1 . . . dxn,


among C1 functions u with given values on the boundary ∂D, u(~x) = g(~x) for all ~x ∈ ∂D,where g is a given function. A common choice is to take g(~x) = 0, and that’s what we’llassume in this section: u ∈ A0, that is, u is C1 inside D and on its boundary ∂D, with

u(~x) = 0 for all ~x ∈ ∂D. (7.8)

(This boundary condition is often called Dirichlet’s condition.)Rather than develop formulas for a general form of F (~p, u, ~x), let’s look at a specific

example. Imagine a very thin membrane, represented by the planar domain D ⊂ R2,which is attached to the boundary ∂D (like a drum skin attached to its rim, althoughthe rim ∂D need not be circular.) If we weigh the membrane down with a load, exertinga pressure f(x, y) on the surface of the membrane, it will stretch and take a new formz = u(x, y), (x, y) ∈ D. Since the membrane is attached to the rim, we have u(x, y) = 0for all (x, y) ∈ ∂D, so such a u ∈ A0. From linear elasticity theory, the shape of the loadedmembrane minimizes the elastic energy,

I(u) =

∫∫D

[1

2|∇u(x, y)|2 + f(x, y)u(x, y)

]dx dy,

among all u ∈ A0.As we have done before, we look at variations of u by v ∈ A0: expanding the square,

I(u+ tv) =

∫∫D

[1

2|∇u+ t∇v|2 + f(x, y)(u+ tv)

]dx dy

=

∫∫D

[1

2|∇u|2 + f(x, y)u+ t (∇u · ∇v + f(x, y)v) +

t2

2|∇v|2

]dx dy

= I(u) + t

∫∫D

(∇u · ∇v + f(x, y)v) dx dy +t2

2

∫∫D

|∇v|2 dx dy. (7.9)

We notice in passing that this is a Taylor expansion– we’ll come back to that later!First, let’s do the usual thing for the first variation:

0 =d

dtI(u+ tv)

∣∣∣∣t=0

=

∫∫D

(∇u · ∇v + f(x, y)v) dx dy, (7.10)

for all v ∈ A0. As before, we must integrate by parts to express the integral in terms ofv(x, y) and not ∇v. For that, we need to go back to vector calculus, and use the DivergenceTheorem!

Theorem 7.5 (Multiple Integration by Parts). Let D ⊂ Rn (n = 2, 3) be an open set withsmooth boundary ∂D and unit normal vector ~n. Suppose v : D ∈ Rn → R, (n = 2, 3,) is a

C1 function, and ~A : Rn → Rn (n = 2, 3) is a C1 vector field. Then, for D ⊂ R2,∫∫D

~A · ∇v dx dy =

∫∂D

v ~A · ~n ds−∫∫

D

v∇ · ~Adx dy,

and for D ⊂ R3,∫∫∫D

~A · ∇v dx dy dz =

∫∫∂D

v ~A · ~n dS −∫∫∫

D

v∇ · ~Adx dy dz.


Recall that ∇ · ~A = div ( ~A) = trace(D ~A) is the divergence, the trace of the derivativematrix (I bet you never thought of it like that before!) In 2D, the boundary integral is aline integral around the curve ∂D; in 3D it is a surface integral on the surface ∂D.

Proof. We do it in 2D; 3D is basically the same, plus one more integral sign all around.Integration by parts is the product rule plus the Fundamental Theorem of Calculus; inhigher dimension, that means the Divergence Theorem (Green’s in 2D, Gauss’ in 3D.) Bya vector identity (valid in any dimension,)

∇ · (v ~A) = ∇v · ~A+ v∇ · ~A.

We integrate both sides (double in 2D, triple in 3D), and apply the Divergence Theorem (avariant of Green’s theorem in 2D, Gauss’ Theorem in 3D),∫∫

D

[∇v · ~A+ v∇ · ~A

]dx dy =

∫∫D

∇ · (v ~A) dx dy =

∫∂D

v ~A · ~n ds,

which is the same thing as what we want.

Let’s get back to what we were doing, trying to get the Euler–Lagrange equation for ourelastic energy functional. Applying the Integration by Parts Theorem with ~A = ∇u to thefirst term in (7.10), we have:

0 =

∫∫D

∇u · ∇v dx dy +

∫∫D

f(x, y)v dx dy

=

∫∂D

v∇u · ~n ds−∫∫

D

div(∇u) v dx dy +

∫∫D

f(x, y)v dx dy

=

∫∫D

(−∆u+ f(x, y)) v dx dy,

for all v ∈ A0. In the above, we have used the fact that v = 0 on ∂D to eliminate the lineintegral, and written the Laplacian ∆u = div(∇u) (also written as ∇2u by physicists.) Weapply our old friend the Fundamental Lemma of the Calculus of Variations (which is truein any dimension, for pretty much the same reasons as in 1D,) to conclude that the termmultiplying v must always be zero. Thus, we get the Euler–Lagrange equation, which is alinear 2nd order partial differential equation (PDE),

∆u(x, y, z) = f(x, y, z),

which in this case is Poisson’s equation, which appears in many different physical contexts.

As a last remark, note that the equation (7.9) really is the second order Taylor polynomialfor the functional I(u); if u solves the EL equation (Poisson’s equation above,) then (as inthe case of vector functions at a critical point,) we have

I(u+ tv) = I(u) +t2

2

∫∫D

|∇v|2 dx dy > I(u)

for all t 6= 0, as long as v 6≡ 0. Thus, the solution to Poisson’s equation represents theglobal minimizer of I(u) among all u ∈ A0.



1. Let I(u) =∫ 1

0F (u′(x), u(x), x) dx. Find the first variation and Euler-Lagrange equations

for each choice of F below. Then solve the equations to find a general solution for extremalsof I(u).

[Note: your solutions should have two constants of integration.]

(a) F (p, u, x) =

√1+p2

u

(b) F (p, u, x) =√p2 + u2

(c) F (p, u, x) = 12u2 − 1

2p2

2. Let I(u) =∫ 4

0[(u′)2 − xu′] dx, for u ∈ C1 satisfying u(0) = 0 and u(4) = 3. Find

the first variation and the Euler-Lagrange equation, and then solve it to find the (unique)solution (which gives the minimizer of I(u) over the admissible set.)

3. Consider the surface of the right circular cylinder, x2 + y2 = a2, in R3. Usingcylindrical coordinates, this surface may be parametrized by the polar angle θ and z,Φ(θ, z) = (a cos θ, a sin θ, z).

A geodesic on a surface is a curve of shortest length joining two points. Take two pointson the cylinder, identified by local coordinates (θ1, z1) and (θ2, z2) with θ1 < θ2, and considerthe collection of all paths of the form

~r(θ) = (a cos θ, a sin θ, z(θ)), θ ∈ [θ1, θ2],

with z(θ) a C1 function, z(θ1) = z1 and z(θ2) = z2.

(a) Show that the arclength of the curve ~r(θ) equals: I(z) =∫ θ2θ1

√a2 + [z′(θ)]2 dθ.

(b) Find the first variation and Euler-Lagrange equations for I(z).

(c) Solve the differential equation, and show that the extremals (“geodesics”) are helices,ie, the z-coordinate is a linear function of the polar angle θ.

4. Let D ⊂ R2 be a bounded open set with smooth boundary curve ∂D. Define theadmissible class A0 of C1 functions u : D ⊂ R2 → R, which vanish on the boundary,u(x, y) = 0 for all (x, y) ∈ ∂D. Find the first variation and Euler-Lagrange equation for thefunctional

I(u) =

∫∫D

[1

2|∇u|2 − 1

3u3]dx dy.

(Do not try to solve the resulting PDE! Just the form of the equation will do.)

5. Let D ⊂ R3 be a bounded open set with smooth boundary surface ∂D. Define theadmissible class A0 of C1 functions u : D ⊂ R3 → R, which vanish on the boundary,u(x, y, z) = 0 for all (x, y, z) ∈ ∂D. Let ~E : R3 → R3 be a C1 vector field, and consider thefunctional

I(u) =

∫∫∫D

1

2|∇u− ~E(x, y, z)|2 dx dy dz.

(a) Expand out I(u+ tv) for u, v ∈ A0, in powers of t.

(b) Use part (a) to find the Euler-Lagrange equation. (Do not try to solve it!)


(c) Explain why any solution u of the Euler–Lagrange equation must be a strict mini-mizer of I(u) among functions in A0.

6. Let D ⊂ R2 as in problem 1, but now think of points in R2 representing space and time,(x, t) ∈ R2, and u(x, t) : D ⊂ R2 → R a C1 function. For a constant c > 0, consider thefunctional

I(u) =1

2

∫∫D

[(ut(x, t))

2 − c2(ux(x, t))2]dx dt.

Calculate the first variation, but to avoid confusion write it as ddsI(u + sv)|s=0 for v ∈ A0

(as before, v(x, t) = 0 on ∂D), and find the Euler–Lagrange equation (A.K.A. the WaveEquation.)

[You can still apply the Integration by Parts, by writing ∇v = (vx, vt), treating t as if itwere y.]


Chapter 8

Fourier Series

In 1807, Jean-Baptiste Joseph Fourier (1768–1830) published a research paper on solutionsof the heat equation, the partial differential equation describing heat flow in a solid object.An important part of his work involved linear combinations of special solutions of the form

ekt sin(ωx).

At this point, Fourier makes what looks like a ridiculous claim: that any continuous functionmay be represented as a series of terms of this form, that is

∞∑n=1

sin(ωnx),

for an appropriate choice of frequencies ωn. On the face of it, this is absurd. The sine func-tions oscillate periodically, with period 2π/ωn, so how could they represent any continuousfunction?

In the end Fourier was essentially right, almost any function (even discontinuous ones)can be represented in terms of convergence series of sines (or cosines,) which we now callFourier Series,

α0 +∞∑n=1

[αn cos(nω0x) + βn sin(nω0x)] .

However, it is not entirely clear in what sense we should interpret the series; being an in-finite series there is an important (and difficult) question of convergence. The question ofwhich functions can be obtained as convergent Fourier Series, and in what sense the conver-gence takes place, have been keeping mathematicians busy since Fourier’s time. The studyof Fourier Series and its applications has also stimulated a huge amount of mathematicalresearch in analysis, which has contributed to solving many, many other important prob-lems in mathematics and in applications (which include data transmission, audio and videocompressing, medical scanning, and image processing!)

Let’s begin by reviewing the properties of the sine (and cosine) functions. Consider afunction of the form

f1(x) = A sin(ωx+ ϕ). (8.1)

The constant ω is the frequency, and ϕ is called the phase. f1 is a periodic function withperiod T = 2π/ω, in other words,

f1(x+ T ) = f1(x) for all x ∈ R.

75

CHAPTER 8. FOURIER SERIES 76

Similarly, if k ∈ N, then fk(x) = A sin(kωx + ϕ) is also T -periodic. This may be verifiedeasily, since

fk(x+ T ) = A sin(kω(x+ T ) + ϕ) = A sin(kωx+ ϕ+ 2πk) = A sin(kωx+ ϕ) = fk(x).

Since any linear combination of these fk(x) is T -periodic, we can’t hope to represent non-periodic function by Fourier series, at least not on the whole real line R. So the solution isto restrict our attention to functions defined on an interval of length T .

Next, by the angle addition identity from trigonometry, we have the identity,

A sin(ωx+ ϕ) = α cos(ωx) + β sin(ωx),

whereA =√α2 + β2 and α = A sinϕ, β = A cosϕ. Conversely, cos(ωx+ϕ) = sin

(ωx+ ϕ+ π

2

),

and so we see that using both sines and cosines with the same frequence ω produces theexact same family of wave forms as (8.1) with phase angle.

It will be convenient to use both sin and cos and use a symmetric interval of lengthT = 2L where x ∈ [−L,L]. We follow Fourier and look at integer multiple of the frequency,ω = π/L, i.e.

αk cos

(kπx

L

)+ βk sin

(kπx

L

),

where k = 1, 2, 3, . . . as our basic building blocks. Using these building blocks, we want toanswer two questions: (1) given a function f(x) on [−L,L], can we choose the coefficientsαk, βk such that f is represented by the following form,

α0 +∞∑k=1

(αk cos

(kπx

L

)+ βk sin

(kπx

L

)),

and (2) does the series converge to f(x) and in what sense does it converge?

8.1 Orthogonality

Fourier Series are so useful because they are related to a natural sense of orthogonality ina vector space whose “points” are functions. Again, we restrict our attention to T -periodicfunctions, but for convenience we choose a symmetric interval of length T , [−L,L] withT = 2L. Then, the collection of all continuous, 2L-periodic functions f : R → R form avector space which we call CL. We introduce an inner product (or scalar product) on CL:

Definition 8.1 (Inner product). Let f, g be continuous functions on [−L,L]. Their L2

inner product (or scalar product) is 〈f, g〉 =∫ L−L f(x)g(x)dx

We can easily verity the symmetry and linearity of the inner product. Now, we alsodefine an associated norm:

Definition 8.2 (Norm). Let f, g be continuous functions on [−L,L]. Their L2 norm is

‖f‖ =√〈f, f〉 =

√∫ L

−L(f(x))2 dx.


The term “L2” inner product and norm comes from the square (and square root) ap-pearing in the form of the norm ‖f‖. One may define a whole family of norms on functionsf ∈ CL, [∫ L

−L|f(x)|p dx

] 1p

,

for any p ≥ 1, the so-called Lp norms. We won’t deal with these here, but you will run intothem later on in your analysis courses!

Using the L2 definition of norm, we can define distance between f and g as ‖f(x)−g(x)‖,which is equal to square root of area under the graph of (f(x)− g(x))2.

This definition of norm should have the property that ‖f‖ = 0 iff f(x) = 0. This is trueby an argument which is similar to the proof of the FLCoV (and which we omit here.)

Example 8.3. With this definition of inner product, we find that{1, cos

(kπx

L

), sin

(kπx

L

)|k = 1, 2, 3, . . .

}form an orthogonal family of functions. In other words,⟨

cos

(kπx

L

), cos

(mπxL

)⟩= 0, for m 6= k;⟨

sin

(kπx

L

), sin

(mπxL

)⟩= 0, for m 6= k;⟨

sin

(kπx

L

), cos

(mπxL

)⟩= 0, for m = 0, 1, . . . , and k = 1, 2, . . .

(Note that 1 = cos(0 · πx) is included in the formulas for the cosine!)To derive these we use the angle addition formulas (this was done in section 7.2 of

Stewart!) to calculate the integrals explicity. For example,∫ L

−Lcos

(kπx

L

)cos(mπx

L

)dx =

∫ L

−L

1

2

[cos

((k −m)πx

L

)+ cos

((k +m)πx

L

)]dx = 0

if k 6= m. The others are done in a similar way.

8.2 Fourier Series

The idea is to take a function f(x) defined on an interval of length T = 2L, typically(−L,L), and associate to it a series of trigonometric function of period T = 2L:

f(x) ∼ S(x) =a02

+∞∑k=1

(ak cos

(kπx

L

)+ bk sin

(kπx

L

))

(We do not write this as an equality, since we have no idea if the infinite series converges!)Further, we want to derive the coefficients ak and bk from f(x) and test whether the seriesconverges or not.


Figure 8.1. y = sin(kx), where k = 1, 2, 3, 4. As k increases, period decreases.

To find the coefficients, we use orthogonality. We use the L2 inner product on CL, theset of continuous periodic functions defined on [−L,L]:

〈f, g〉∫ L

−Lf(x)g(x).

Then, f ⊥ g when 〈f, g〉 = 0 and we found in Example 8.3 that{1

2, cos

(kπx

L

), sin

(kπx

L

)∣∣∣∣ k = 1, 2, 3, . . .

}is an orthogonal family on CT . By calculating the inner product when k = m, we find that⟨

cos

(kπx

L

), cos

(mπxL

)⟩=

⟨sin

(kπx

L

), sin

(mπxL

)⟩=

{√L m = k

0 m 6= k

Recall for vectors u, v ∈ Rn, the projection of v onto u is

v · u‖u‖2u.

By analogy, we get

a0 =〈f, 1〉‖1‖2 =

1

L

∫ L

−Lf(x)dx

ak =〈f, cos

(kπxL

)〉

‖ cos(kπxL

)‖2 =

1

L

∫ L

−Lf(x) cos

(kπx

L

)dx

bk =〈f, sin

(kπxL

)〉

‖ sin(kπxL

)‖2 =

1

L

∫ L

−Lf(x) sin

(kπx

L

)dx

So the Fourier series of f(x) is

S(x) ∼ a02

+∞∑k=1

(ak cos

(kπx

L

)+ bk sin

(kπx

L

)).


Figure 8.2. Dashed curves represent the first three partial sums of Fourier series forf(x) = x2 for x ∈ (−π, π). As the number of terms increase, the sum converges to x2 inthe interval. Solid curve represents f(x) = x2

If we look at the nth partial sum of this series,

Sn(x) =a02

+n∑k=1

(ak cos

(kπx

L

)+ bk sin

(kπx

L

)),

we find that Sn(x) is a trigonometric polynomial. Then, for large values of n, this shouldbe a good approximation to f(x) on (−L,L).

Example 8.4. Consider f(x) = cos2 x = 12

+ 12

cos(2x) with L = π on [−π, π]. Then,{a0 = 1, a1 = 0, a2 = 1/2, ak≥3 = 0

bk = 0,∀kAny function of the form cosm x, sinm x or cosm x sinm x is a trig polynomial and it can bewritten as a linear combination using trigonometric identities. And so this function is exactlyequal to its Fourier Series! Since the series is actually a finite sum, there is no convergencequestion to deal with.

Example 8.5. Consider f(x) = x2 for x ∈ (−π, π). Now, x2 is not a periodic function,but the Fourier Series in (−π, π) will be periodic outside that interval, so what the seriesgenerates is not equal to x2 outside of that interval, but the periodic extension of the pieceof x2 inside (−π, π): see Figure 8.2.

Calculating the coefficients, we get

bk =1

π

∫ π

−πx2 sin(kx)dx = 0,∀k ∈ N

a0 =1

π

∫ π

−πx2dx =

2π2

3

ak =1

π

∫ π

−πx2 cos(kx)dx = · · · = 4

k2(−1)k ,

where you must integrate by parts twice to calculate ak. So

f(x) ∼ S(x) =π2

3+∞∑k=1

4(−1)k

k2cos(kx)


Figure 8.3. Dashed curves represent the first four partial sums of Fourier series forf(x) = x2 for x ∈ (0, 2π). As the number of terms increase, the sum converges to x2 in theinterval. Solid curve represents f(x) = x2

Example 8.6. Consider f(x) = x2 where x ∈ (0, 2π) (Figure 8.3). This periodic extensionis different from the one above: it agrees with x2 on the asymmetrical interval (0, 2π)and replicates that shape periodically to the rest of R. Unlike the previous example, thisextension is discontinuous at all x = 2πm. So even in the best case, where the series, S(x),converges everywhere, it’s discontinuous function at x = 2πm, for all m ∈ Z. Still, does itconverge there, and if so, to which values?

We calculate the Fourier coefficients, remembering that a 2π-periodic function can beintegrated over any interval over length 2π, and give the same answer. So

ak =1

π

∫ 2π

0

x2 cos(kx)dx

bk =1

π

∫ 2π

0

x2 sin(kx)dx

In this case, we get a0 = 8π2/3, ak = 4/k2, bk = −4π/k. Since bk 6= 0, we know that thisfunction is not symmetric.

Example 8.7. Consider f(x) = x where x ∈ (−π, π) and L = π (Figure 8.4). Then, we get

ak =1

π

∫ π

−πx cos(kx)dx = 0

bk =1

π

∫ π

−πx sin(kx)dx

=2

π

∫ π

0

x sin(kx)dx

=2

k(−1)k+1

So as f(x) = x on (−π, π) is odd, we have a sine series:

f(x) ∼ S(x) =∞∑k=1

2(−1)k+1

ksin(kx)


Figure 8.4. Dashed curves represent the first three partial sums of Fourier series forf(x) = x for x ∈ (−π, π). As the number of terms increase, the sum converges to x2 in theinterval. Solid curve represents f(x) = x2. The series converges at each x (by thePointwise Convergence Theorem), but the accuracy of the approximation by partial sumsis poor near the jump discontinuities, x = (2m− 1)π, m ∈ Z.

S(x) has discontinuities, f(−π) = −π 6= π = f(π). So what happens at the discontinu-ities, x = (2m− 1)π?

S(nπ) =∞∑k=1

2(−1)k+1

ksin(knπ) =

∞∑k=1

0 = 0

It will be a general fact that at a jump discontinuity, the Fourier series, S(x), takes themidpoint value. It does not choose the value of f(x) from either the right or the left at ajump discontinuity.

8.3 Convergence

Let’s talk with convergence; not only is it interesting and important, but it goes to the heartof what we get when we use Fourier Series to approximate a function. First, let’s reviewconvergence of numerical series,

∑∞k=1 ck, where ck are real numbers. A series converges to

a number s if

limn→∞

n∑k=1

ck = s.

We call Sn =∑n

k=1 ck, the n-th partial sum of the series. If the limit, limn→∞ Sn, does notexist, we say the series diverges.

Example 8.8 (Geometric Series). This is the most important example. Consider

∞∑k=0

rk.

This is exceptional, because we can explicitly calculate Sn for these,

Sn =1− rn+1

1− r , r 6= 1.


So the series converges if |r| < 1 and diverges if |r| ≥ 1. For r = 1, sn = n and it divergesas n→∞. When r = −1, Sn = 1− 1 + 1− 1 + 1− · · · ± 1. So

sn =

{1, if even

0, if odd

and sn diverges. In conclusion, geometric series converges iff |r| < 1 and s = 1/(1− r).

Example 8.9 (p-series).∞∑k=1

1/kp converges iff p > 1.

Using the comparison test, we can determine convergence and divergence of many seriesusing these examples.

Theorem 8.10 (Comparison Test). Assume that 0 ≤ ak ≤ bk for all k ∈ N.

• If∑∞

k=1 bk converges, then∑∞

k=1 ak also converges.

• If∑∞

k=1 ak diverges, then∑∞

k=1 bk also divverges.

Example 8.11. The series∑∞

k=11

2k2+k+1converges, since ak = 1

2k2+k+1≤ 1

2k2and

∑∞k=1

12k2

=12

∑∞k=1

1k2

converges (p-series with p = 2.)The series

∑∞k=1

k2k2+k+1

diverges, since k + 1 ≤ 2k2 for k ≥ 1, and so bk = k2k2+k+1

≥k

4k2= 1

4kand

∑∞k=1

14k

diverges (p-series with p = 1.)

While the Comparison Test and p-series are series with positive terms, Fourier Seriescontain sines and cosines, and so will have terms changing in sign. We distinguish two typesof convergence for series with sign-changing terms:

Definition 8.12 (Absolute and conditional convergence). A series∑∞

k=1 ck converges ab-solutely if

∑∞k=1 |ck| converges. The series converges conditionally if

∑∞k=1 ck converges but∑∞

k=1 |ck| diverges.

Example 8.13 (Alternating series). Consider

∞∑k=1

(−1)kbk,

with bk → 0 and bk ≥ bk+1 for all k, the series converges conditionally. For example, ifwe let bk = 1/k, the alternating series

∑∞k=1(−1)k 1

kconverges but the series of its absolute

values∑∞

k=11k

is the harmonic series, which diverges.

For absolutely convergent series we can have a good estimate on the rate of convergence,ie, how many terms we need to keep to have a good approximation of the limit by a partialsum. On the other hand, conditional convergence is delicate and the series converges slowly.

Let’s apply these ideas to Fourier series. An important distinction is that Fourier seriesare series of functions not just numbers:

S(x) =a02

+∞∑k=1

(ak cos(kx) + bk sin(kx)) =∞∑k=0

gk(x).

There are several different (and important) notions of convergence for series of functions.The first is easy to grasp: For each fixed value of x,

∑∞k=1 gk(x) is a numerical series, which

converges or doesn’t. So we can ask for each individual x whether the series converges ordiverges.


Definition 8.14 (Pointwise convergence). We say that the series∑∞

k=1 gk(x) convergespointwise on a set A ∈ R if the series converges for every individual x ∈ A.

The notion of poinwise convergence is simple and easy to verify, but it has some draw-backs. In particular, series of continuous functions might be discontinuous, and we may notbe able to differentiate or integrate the series term-by-term and obtain a correct answer!

Example 8.15. Consider∞∑k=1

(1− x)xk, x ∈ [0, 1].

Then, we get the following partial sums:

Sn =n∑k=1

(1− x)xk

= (1− x)n∑k=1

xk

If x ∈ [0, 1), the series converges:

limn→∞

Sn(x) = (1− x) limn→∞

n∑k=1

xk︸︷︷︸1/(1−x)

= 1

However, when x = 1, we have Sn(1) = 0. So at x = 1, the series converges to 0, i.e.

∞∑k=1

(1− x)xk =

{1 if x ∈ [0, 1)

0 if x = 1

Therefore, the series converges pointwise on the set A = [0, 1]. Notice that, although eachpartial sum Sn(x) is a continuous function on A, the series converges to a discontinuouslimit function!

A series of continuous functions can converge pointwise to a discontinuous function, andthere are examples for which the series of the integrals is not equal to the integral of theseries, ∫ b

a

(∞∑k=1

gn(x)

)dx 6=

∞∑k=1

∫ b

a

gn(x)dx.

So we need to be very careful about doing calculus with a series which is only pointwiseconvergent.

Fortunately, there is a nice theorem about pointwise convergence of trigonometric Fourierseries, which is easy to apply. (However, it is fairly tricky to prove!)

Definition 8.16. We say that f(x) is piecewise C1, or piecewise smooth, on [a, b] if (i) fis differentiable and f ′(x) is continuous except maybe at finitely many points; (ii) At eachexceptional point, f(x) and f ′(x) have jump discontinuities.


We need to be careful with this definition when dealing with Fourier Series, because evenif f is C1([−π, π]), when we construct a Fourier series we get the 2π-periodic extension off to R, which might create discontinuities. We saw this happen in Examples 8.6 and 8.7.

Theorem 8.17 (Pointwise Convergence Theorem for Trigonometric Fourier Series). Assumef is a piecewise C1, (2L)-periodic function. Then, its trigonometric Fourier Series,

S(x) =a02

+∞∑k=1

[ak cos(kx) + bk sin(kx)] ,

converges pointwise at every x ∈ R to

S(x) =1

2

(f(x+)

+ f(x−)),

wheref(x+)

= limt→x+

f(t),

f(x−)

= limt→x−

f(t).

In other words, if f is continuous, S(x) = f(x). If f(x) jumps, S(x) averages the jumpvalues.

Remark. It is not sufficient to assume f is continuous to conclude that

limn→∞

Sn(x) = f(x),

some additional information is needed. There are examples of functions which are continuous(but not piecwise C1) for which the Fourier Series diverges at some values of x.

Example 8.18. Consider f(x) = x, x ∈ (−π, π). Then,

S(x) =∞∑k=1

2(−1)k+1

ksin(kx)

= x,

if x ∈ (−π, π), and S(kπ) = 0 for each k ∈ Z, which averages the value across the jumpdiscontinuities in the 2π-periodic extension of f .

Example 8.19. Consider f(x) = x for x ∈ (−π, π), which is extended 2π-periodically.Then, f(x) is discontinuous. It takes the value of S((2m − 1)π) = 0 and we get thefollowing Fourier series:

S(x) =∞∑k=1

2(−1)k+1

ksin(kx).

Observe that

|bk| =∣∣∣∣2(−1)k+1

k

∣∣∣∣ =2

k.

Since∑2

k=1 2/k diverges, the series converges conditionally, and bk is an alternating series.So we see how slowly it converges by looking at graphs of Sn, for large n.


Example 8.20. Consider f(x) = x2 with x ∈ (−π, π) extended 2π-periodically. In thiscase, the extension is continuous, so S(x) = f(x)∀x ∈ R. Further, we find that

S(x) =π2

3+∞∑k=1

4(−1)k

k2cos(kx).

Then, by looking at the individual terms,

|ak cos(kx)| =∣∣∣∣4(−1)k

k2cos(kx)

∣∣∣∣ ≤ 4

k2,

we find that S(x) converges absolutely for all x ∈ R by the Comparison Test.Since it converges for all x ∈ R, let’s try some values. First, when x = 0, we get

π2

3+∞∑k=1

4(−1)k

k2= 0

So we find that∞∑k=1

(−1)k

k2= −π

2

12

Likewise, when x = π, we find that

∞∑k=1

1

k2=π2

6

In this way Fourier Series provides us with some exact numerical values for some commonseries.

This example illustrates a second notion of convergence, uniform convergence. A uni-formly convergent series is one for which the partial sums Sn(x) approximate the limitingfunction S(x) to within the same error at all points x. If you were to draw a thin strip ofuniform width around the graph y = S(x), the graphs y = Sn(x) would all fit completely in-side the strip for n large. Figure 8.5 shows how closely the Fourier partial sums approximatex2 on [−π, π].

Theorem 8.21 (Uniform Convergence). Consider S(x) =∑∞

k=1 gk(x) with x ∈ [a, b]. Wesay that the series converges uniformly to S(x) if

limn→∞

(maxx∈[a,b]

∣∣∣∣∣S(x)−n∑k=1

gk(x)

∣∣∣∣∣)

= 0.

In other words, for any ε > 0, there is an N for which∣∣∣∣∣S(x)−n∑k=1

gk(x)

∣∣∣∣∣ < ε,

whenever n ≥ N .


-4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4

2.5

5

7.5

10

Figure 8.5. Partial sums Sn for n = 3, 5, 15, 50 for f(x) = x2 on (−π, π). Theconvergence is uniform.

This is a bit complicated, but fortunately there is a test for uniform convergence whichis easy to apply; it looks a lot like the Comparison Test for numerical series!

Theorem 8.22 (Weierstress). Assume gk(x) are continuous on [a, b], where k = 1, 2, 3, . . . .

• M-Test. If |gk(x)| < Mk, for all x ∈ [a, b], and∑

kMk converges, then∑

k gk(x)converges uniformly on [a, b].

• If S(x) =∑

k=1 gk(x) converges uniformly on [a, b], then S(x) is continuous, and∫ b

a

S(x)dx =∞∑k=1

∫ b

a

gk(x)dx

When you study Real Analysis you will prove this theorem (which actually is not verydifficult!)

Going back to the examples, the series for f(x) = x can not converge uniformly, becausef(x) = x extended 2π-periodically is discontinuous, contrary to the Weierstrass theorem. Onthe other hand, the series for f(x) = x2 extended 2π-periodically does converge uniformly,because

|gk(x)| =∣∣∣∣4(−1)k

k2cos(kx)

∣∣∣∣ ≤ 4

k2∀x

Remark. Uniform convergence holds for Fourier Series when the absolute values of theFourier coefficients are summable, that is,

∞∑k=1

(|ak|+ |bk|) converges.

In other words, if both∑

k |ak| and∑

k |bk| converge, then by the Weierstrass M-Test theFourier Series converges uniformly. From the second part of Theorem 8.22 this can onlyhappen when the 2π-periodic extension of f is continuous.


Example 8.23. Consider f(x) = | sinx| with x ∈ (−π, π). The 2π-periodic extension iscontinuous (graph it, either by hand or using the computer!) Then, we get

bk =1

π

∫ π

−π| sinx| sin(ks)dx

= 0

a1 = 0

ak =2

π

((−1)k+1 − 1

) 1

k2 − 1

So

S(x) =2

π− 4

π

∞∑j=1

1

4j2 − 1cos(2jx),

which converges for each individual x to f(x) by the Pointwise Convergence Theorem.Since

|a2j| =∣∣∣∣− 4

π

1

4j2 − 1

∣∣∣∣=

4

π(4j2 − 1)

<4

π

1

4j2 − j2and

∑1/j2 converges, by the M-Test, the series converges uniformly. (See figure 8.6.)

We can now use it to calculate some numerical series. When x = 0, by the convergencetheorem,

0 = f(0) =2

π− 4

π

∞∑j=1

1

4j2 − 1

4

π

∞∑j=1

1

4j2 − 1=

1

2

-4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4

0.5

1

Figure 8.6. Partial sums Sn for n = 3, 5, 15, 50 for f(x) = | sinx| on (−π, π). Theconvergence is uniform.


8.4 Orthogonal functions

Fourier Series is really about orthogonality, representing functions in an orthogonal basis.Cosines and sines give only one example of orthogonal families of functions. So where elsedo orthogonal functions come from? Here are two mechanisms which produce orthogonalfamilies of functions, which are useful and natural in various contexts.

8.4.1 Gram-Schmidt process

Take a linearly independent collection {v1, v2, v3, . . . } in an independent space. Gram-Scmidt gives an iterative procedure to create an orthogonal collection {u1, u2, u3 . . . } byorthogonal projection. (See section 6.3 in Anton & Rorres [AR] to review this concept.)

Take u1 = v1. Then,

u2 = v2 −〈v2, u1〉‖u1‖2

u1

is orthogonal to u1.

Proof.

〈u1, u2〉 = 〈u1, v2〉 −⟨u1〈v2, u1〉‖u1‖2

u1

⟩= 〈u1, v2〉 −

〈v2, u1〉‖u1‖2

〈u1, u1〉︸︷︷︸‖u1‖2

= 0.

We then iterate the preceeding step, and create the orthogonal family by induction.Assume we already have an orthogonal family with k elements,

{u1, u2, . . . , uk}.

Then, the vector obtained by projection as above,

uk+1 = vk+1 −k∑j=1

〈vj+1, uj〉‖uj‖2

uj

is orthogonal to {u1, u2, . . . , uk}, and will have the same span as {v1, v2, . . . , vk+1}.A common application is to polynomials, P (x) = a0 + a1x + · · · + anx

n. A basis ofpolynomials is {1, x, x2, . . . }. For any choice of inner product, we get different orthogonalfamilies of polynomials (Legendre, Chebyshev, . . . ).

Example 8.24. Take the vector subspace of polynomials, that is the linearly independentset

{1, x, x2, , . . . , xk, . . . },on the interval x ∈ [−1, 1]. We choose the familiar L2 inner product,

〈f, g〉 =

∫ 1

−1f(x)g(x)dx.


Start with P0(x) = 1, the constant function. In Problem #12 below, you’ll verify thatP1(x) = x and P1(x) ⊥ P0(x). Continuing with Gram-Schmidt, we get the next orthogonalelement,

P2(x) = x2 − 〈x2, P0〉‖P0‖2

P0 −−〈x2, P1〉‖P1‖2

P1

= x2 −∫ 1

−1 x2 · 1dx∫ 1

−1 12dx−∫ 1

−1 x3 · 1dx∫ 1

−1 x2dx︸︷︷︸

=0

= x2 − 1

3.

This procedure generates the Chebyshev polynomials on [−1, 1], which are very often usedin numerical analysis.

To get other families, we can substitute a different inner product. A common choice isa “weighted” inner product,

〈f, g〉 =

∫ 1

−1f(x)g(x)w(x)dx,

where w(x) is a given continuous function. The principle is the same as for the Chebyshevpolynomials, but changing the inner product will change the actual polynomials obtainedas an orthogonal family.

8.4.2 Eigenvalue problems

This mechanism for generating orthogonal families comes from the basic linear algebra factabout symmetric matrices. Recall that a matrix is symmetric if its transpose M t = M .This is equivalent to symmetry with respect to the inner (dot) product, (M~u) ·~v = ~u · (M~v),for all vectors ~u,~v ∈ Rn. In a general inner product space V this is the appropriate way totalk about symmetry:

Definition 8.25. Let M be a linear transformation on an inner product space V. We sayM is symmetric if

〈Mu, v〉 = 〈u,Mv〉, for all u, v ∈ V .Theorem 8.26. Let M be a symmetric linear transformation on an inner product space V.If Mu = λu and Mv = µv with distinct eigenvalues λ 6= µ, then 〈u, v〉 = 0.

Proof. Notice that due to symmetry, we get

〈λu, v〉 = 〈Mu, v〉 = 〈u,Mv〉 = 〈u, µv〉

So we get λ〈u, v〉 = µ〈u, v〉. So since λ 6= µ, we conclude that 〈u, v〉 = 0.

Now, consider a vector space V of all C2 functions,

u : [0, L]→ R,

satisfying the boundary condition such as u(0) = 0 and u(L) = 0. Then, we take te inner

product 〈u, v〉 =∫ L0u(x)v(x)dx and linear transformation Mu = u′′(x).


Lemma 8.27. M is symmetric with respect to the inner product on V.

Proof.

〈Mu, v〉 =

∫ L

0

u′′(x)v(x)dx

= u′(x)v(x)|L0 −∫ L

0

u′(x)v′(x)dx

= −v′(x)u(x)|L0 +

∫ L

0

u(x)v′′(x)dx = 〈u,Mv〉

In Rn, any symmetric linear transformation (i.e. matrix) provides an orthogonal basisof eigenvectors of m for Rn. Even though V is infinite dimensional, it turns out that this isstill true in the sense of Fourier series.

Now, this leads us to the eigenvalue problem. We want to u ∈ V , u 6= 0 and λ ∈ R suchthat Mu = λu. This is the same thing as solving an ordinary differential equation (ODE)with boundary conditions which define the vector space V , u(0) = 0 = u(L). One way todo this is to find all solutions of an ODE with the given u(0) = 0 as initial condition, andcheck if they also solve u(L) = 0. When finding the general solution to the ODE we have 3cases:

λ > 0, λ = 0, λ < 0

When λ > 0, the general solution to the ODE is

u(x) = A cosh√λx+B sinh

√λx.

Since we need u(0) = 0, we must have A = 0. To satisfy the other boundary condition,0 = u(L) = sinh

√λL, but sinhx > 0 whenever x > 0 so this can never be satisfied except

when B = 0 also. So we only get the trivial solution u(x) ≡ 0, and so there is no eigenvaluewith λ > 0.

Similarly, when λ = 0 the general solution to the ODE is u(x) = A+ Bx, and the onlystraight line with u(0) = 0 = u(L) is the trivial one u(x) ≡ 0. So λ = 0 is not an eigenvalueeither.

We conclude that all eigenvalues λ < 0. For convenience, call λ = −µ2 with µ > 0.Then u′′ + µ2u = 0 with general solution

u(x) = A cosµx+B sinµx.

Applying the condition u(0) = 0, we have A = 0 as before. But at the other endpoint wehave nontrivial solutions:

0 = u(L) = B sinµL

is solved when µL = kπ, k ∈ N, that is µ = µk = kπL

. So we get a sequence of eigenvaluesand associated eigenfunctions,

λk = −µ2k =

k2π2

L2, uk(x) = sin

(kπx

L

).


By Lemma 8.27 these form an orthogonal family of functions on [0, L]. And given f :[0, L]→ R we can make a Fourier expansion in this family,

f(x) ∼ S(x) =∞∑k=1

ckφk(x),

where

ck =〈f, φk〉‖φk‖2

,

and the norm is defined via the inner product as usual, ‖f‖ =√〈f, f〉. As for the full

trigonometric Fourier Series (sines and cosines), in what sense are f(x) and S(x) equal, anddoes the series converge?

As always, we define the partial sums,

Sn(x) =n∑k=1

ckφk(x).

Definition 8.28. We say the series converges in norm if

limn→∞

‖Sn − f‖ = 0.

Definition 8.29. We call the family {φk(x)} complete if for all continuous f , the seriesconverges to f in norm, i.e. {φk(x)} forms an orthogonal basis for all continuous f(x) on[a, b].

Let’s use orthogonality to actually calculate the square of the error made in approxi-mating f by Sn in the norm. (We have to be careful about indices!)

‖Sn − f‖2 = 〈Sn − f , Sn − f〉 =

⟨[n∑k=1

ckφk − f],

[n∑j=1

cjφj − f]⟩

=

⟨n∑k=1

ckφk ,n∑j=1

cjφj

⟩− 2

⟨n∑k=1

ckφk , f

⟩+ ‖f‖2

=n∑k=1

n∑j=1

ckcj〈φk, φj〉+ 2n∑k=1

ck〈φk, f〉+ ‖f‖2

=n∑k=1

c2k‖φk‖2 − 2n∑k=1

c2k‖φk‖2 + ‖f‖2

= ‖f‖2 −n∑k=1

c2k‖φk‖2.

We draw two conclusions from this computation. The first comes from the observation thatthe norm ‖Sn − f‖2 ≥ 0, and so the infinite series

∞∑k=1

c2k‖φk‖2 = limn→∞

n∑k=1

c2k‖φk‖2 ≤ ‖f‖2 <∞,


so that numerical series converges, to something at most ‖f‖2. This fact is called Bessel’sinequality. The second conclusion is that the Fourier Series converges in norm if and only if

0 = limn→∞

‖f − Sn‖2 = limn→∞

[‖f‖2 −

n∑k=1

c2k‖φk‖2]

= ‖f‖2 −∞∑k=1

c2k ‖φk‖2 ,

that is, if equality holds in Bessel’s inequality. In fact, this is always the case as long as fis square integrable, which includes all bounded piecewise continuous functions.

Theorem 8.30 (Parseval’s Theorem). Suppose f(x) is defined on [−π, π] and∫ π

−π(f(x))2 dx <∞.

Then, the trigonometric Fourier Series of f(x),

S(x) =a02

+∞∑k=1

(ak cos(kx) + bk sin(kx))

converges to f(x) m-norm on [−π, π]. In addition,∫ π

−π(f(x))2 dx = π

[a202

+∞∑k=1

(a2k + b2k

)](8.2)

The formula (8.2) is known as Parseval’s Identity. The proof of Parseval’s Theorem isquite intricate, and way too advanced for a course of this level. However, we may appreciatewhat it says about convergence of Fourier Series, and use it calculate exact values for somemore numerical series.

Example 8.31. Consider f(x) = x with x ∈ [−π, π]. Then, we have

ak = 0, bk =2

k(−1)k+1.

Thus, we get

π∞∑k=1

4

k2=

∫ π

−πx2dx =

2

3π3

∞∑k=1

1

k2=π2

6.

8.5 Application: The vibrating string

To conclude, we look at a problem which is very similar to Fourier’s original motivation,the solution of partial differential equations arising in science. Instead of the heat equation,let’s look instead at the wave equation, which describes a vibrating string.


Consider a string in a violin, guitar, piano, etc. with length L. Suppose that the stringis attached at its endpoints, x = 0, L, but is free to vibrate up and down in a vertical planein between. We call u(x, t) the shape of the string at the point x and time t, so we imaginea snapshot of the string at time t giving us a graph y = u(x, t). The fact that the string isattached at the endpoints means that we are imposing boundary conditions,

u(0, t) = 0, u(L, t) = 0, for all time t ∈ R. (8.3)

The equation of motion for the string is the wave equation,

∂2u

∂t2= c2

∂2u

∂x2,

where c is a constant which depends on the density and tension of the string.We can solve this using Fourier Series! Recall from section 8.4.2 that we obtain an

orthogonal family of eigenfunctions by solving a boundary-value problem. For the choice(8.3), these are:

vk(x) = sin

(kπ

Lx

).

So we seek a solution in the form of a Fourier Series with time-dependent coefficients,

u(x, t) =∞∑k=1

ak(t) vk(x).

This is already chosen to satisfy the boundary conditions, so we plug into the wave equationto determine the equation satisfied by the coefficients ak(t) and get:

a′′k(t) = −(kπc

L

)2

ak(t),

which has as a general solution

ak(t) = Ak cos

(kπc

Lt+ φk

),

where φk is a phase factor.1

To understand what is going on, look at each individual term in the series,

ak(t) vk(x) = Ak cos

(kπc

Lt+ φk

)sin

(kπ

Lx

).

Each of these terms is a simple standing wave, associated to φk, a fixed wave form in xmultiplied by an oscillatory term in t which makes the wave move up and down but with afixed profile. The profile comes from vk(x), which has (k−1) nodes (zeros) inside the stringx ∈ (0, L); see figure 8.7. The time dependence determines the sound produced, which is asingle pure frequency:

ωk =kc

2L, k = 1, 2, 3, . . .

1You could write it as ak(t) = αk coskπcL t + βk sin

kπcL t, but by a trig identity the two expressions are

the same.


So each term k = 1, 2, 3, . . . produces a different musical note, and the entire wave makes asound which is a superposition of these different notes!

The k = 1 term gives the fundamental tone, ω1 = c/2L, the lowest note produced bythe string. For k = 2, 3, 4, . . . , we get ωk = kω1, the overtones.

Suppose the length, density, and tension in the string are chosen so that the fundamentalfrequency ω1 = 440 Hz, which is A4, the A just above middle-C on a piano. Then, the firstovertone ω2 = 880 Hz gives A5, which is exactly one octave higher. The next overtonew3 = 1320 Hz is E6, which is a perfect fifth above A5. Continuing, ω4 is A6 (two octavesabove the fundamental tone ω1) and w5 is C#

7 . When you play A4 on a stringed instrumentyou are actually hearing a combination of the intended note with a bit of each overtone.

k=1k=2

k=3k=4

k=5

Figure 8.7. The standing wave profiles vk for the first five harmonics of the vibratingstring. Higher frequencies mean more nodes (zeros) in the waveform.

The same principle applies to woodwind instruments. The vibrating string becomes avibrating column of air, and u(x, t) is the fluctuation of air pressure away from atmosphericpressure in the room. A flute is an air column which is open to the atmosphere on bothends, and so it gives the same boundary condition (8.3) as for a stringed instrument, andtherefore the same harmonic sequence of frequencies. However, clarinets are different: theyhave an end with a mothpiece and a reed, which are not open to the atmosphere, where thepressure is maximal. This changes the boundary condition at the end x = 0 to ux(0, t) = 0.The consequence is that the orthogonal basis is different, with different eigenvalues, and acompletely different harmonic sequence of overtones!


1. Find the Fourier Series on [−π, π] of the following functions: sin2 x, sinx cosx, sin3 x,and sin4 x.

[Hint: you don’t need to calculate any integrals!]

2. Calculate the Fourier Series S(x) of f(x) = sin(x/2), −π < x < π. What is S(π) =?

3. Find the Fourier Series of f(x) = coshx and g(x) = sinhx, −1 < x < 1. What are thevalues of the Fourier Series at x = −1?


4. Find the Fourier Series on (−π, π) of

f(x) =

{− cosx, if −π < x < 0,

cosx, if 0 < x < π.

At what points in R is the Fourier Series discontinuous?

5. (a) Show that {1, cos(kx) | k ∈ N} is an orthogonal family on [0, π] with inner product

〈f, g〉 =

∫ π

0

f(x) g(x) dx.

(b) Find the formulas for the coefficients in a Fourier Series in these functions, f(x) ∼S(x) = a0

2+∑

k=1 ak cos(kx).

(c) Show that S(x) represents the even, 2π-periodic extension of f(x) from 0 < x < πto the real line. Draw the graph of S(x) on −3π < x < 3π when f(x) = x.

6. (a) Calculate the Fourier Series of f(x) = |x|, −π < x < π.

(b) Draw the 2π-periodic function which the series converges to, over the interval−3π < x < 3π.

(c) Does the series converge absolutely or conditionally? Does it converge uniformly?Why or why not?

(d) Use your result to calculate∞∑k=1

1

(2k − 1)2and

∞∑k=1

1

(2k − 1)4.

7. Answer (a)–(c) for f(x) = cos(x/4), −π < x < π. Use the Fourier coefficients to obtain

explicit values for the series∞∑k=1

1

16k2 − 1,∞∑k=1

(−1)k

16k2 − 1, and

∞∑k=1

1

(16k2 − 1)2.

8. Answer (a)–(c) for f(x) = sin(x/4), −π < x < π. Use the Fourier coefficients to obtain

an explicit value for the series∞∑k=1

k2

(16k2 − 1)2.

9. Consider the vector space V of all C2 functions u : [0, L]→ R satisfying the boundaryconditions u′(0) = 0 and u(L) = 0, and the linear transformation Mu = u′′(x).

(a) Show that for any u, v ∈ V , 〈Mu, v〉 = 〈u,Mv〉.[Hint: integrate by parts twice, and use the boundary conditions.]

(b) Suppose λ, µ are eigenvalues of M , so there exist nontrivial functions u, v ∈ V , withMu = u′′(x) = λu(x) and Mv = v′′(x) = µv. Show that if λ 6= µ, then 〈u, v〉 = 0.

(c) Show that

{cos

((2k − 1)πx

2L

): k = 1, 2, 3, . . .

}are eigenfunctions of M with

distinct eigenvalues λk = −( (2k−1)π2L

)2, and conclude that they form an orthogonal family inV .


10. (a) Find the Fourier Series S(x) of f(x) = cos(x/2), −π < x < π.

(b) Graph S(x) for −3π < x < 3π. Does S(x) represent a continuous function on R?

(c) Use (a) to evaluate∞∑n=1

1

4n2 − 1and

∞∑n=1

(−1)k

4n2 − 1.

11. Consider the family of functions

{φk(x) = sin

[(k +

1

2)x

], k = 0, 1, 2, 3, . . .

}on

x ∈ [−π, π].

(a) Show that this family is orthogonal with respect to the inner product

〈f, g〉 =

∫ π

0

f(x)g(x) dx.

(b) Give the formula for Fourier coefficients for a function f(x) with respect to thisfamily,

f(x) ∼ S(x) =∞∑k=0

ckφk(x).

12. For x ∈ [−1, 1], define the polynomials,

P0(x) = 1, P1(x) = x, P2(x) = 3x2 − 1, P3(x) = 5x3 − 3x.

(a) Show that these four functions are orthogonal with respect to the inner product

〈f, g〉 =

∫ 1

−1f(x)g(x) dx.

(b) If f(x) = a0 + a1x + a2x2 + a3x

3 + · · · + anxn is a nontrivial polynomial of degree

n, and f is orthogonal to each of P0, P1, P2, P3, show that the degree n ≥ 4.

[Hints: Assume for a contradiction that n ≤ 3, so f(x) = a0 + a1x + a2x2 + a3x

3, andshow a0 = 0 = a1 = a2 = a3 (and so f(x) is the zero polynomial).]

(c) Find a polynomial P4(x) which is of order 4 and which is orthogonal to each ofP0, P1, P2, P3.

Documents

MATH 2XX3 - Advanced Calculus II - ms.mcmaster.ca · MATH 2XX3 - Advanced Calculus II Prof. S. Alama 1 Class notes recorded, adapted, and illustrated by Sang Woo Park} Revised March