Upload
lamdan
View
223
Download
1
Embed Size (px)
Citation preview
MATH 2210Q Applied Linear Algebra, Fall 2017
Arthur J. Parzygnat
These are my personal notes. This is not a substitute for Lay’s book. I will frequently reference
both recent versions of this book. The 4th edition will henceforth be referred to as [2] while the 5th
edition will be [3]. In case comments apply to both versions, these two books will both be referred
to as [Lay]. You will not be responsible for any Remarks in these notes. However, everything
else, including what is in [Lay] (even if it’s not here), is fair game for homework, quizzes, and
exams. At the end of each lecture, I provide a list of recommended exercise problems that should
be done after that lecture. Some of these exercises will appear on homework, quizzes, or exams!
I also provide additional exercises throughout the notes which I believe are good to know. You
should also browse other books and do other problems as well to get better at writing proofs and
understanding the material.
Notes in light red are for the reader.
Notes in light green are reminders for me.
When a word or phrase is underlined, that typically means the definition of this word or phrase is
being given.
Contents
Introduction: What is linear algebra and why study it? 3
1 Linear systems, row operations, and examples 16
2 Vectors and span 25
3 Solution sets of linear systems 30
4 Linear independence and dimension of solution sets 38
5 Subspaces, bases, and linear manifolds 47
6 Convex spaces and linear programming 53
7 Linear transformations and their matrices 61
8 Visualizing linear transformations 71
1
9 Subspaces associated to linear transformations 76
10 Iterating linear transformations—matrix multiplication 85
11 Hamming’s error correcting code 92
12 Inverses of linear transformations 99
13 The signed volume scale of a linear transformation 109
14 The determinant and the formula for inverse matrices 121
15 Orthogonality 130
16 The Gram-Schmidt procedure 138
17 Least squares approximation 147
Decision making and support vector machines 153
18 Markov chains and complex networks 171
19 Eigenvalues and eigenvectors 181
20 Diagonalizable matrices 190
Spectral decomposition and the Stern-Gerlach experiment 198
21 Solving ordinary differential equations 205
22 Abstract vector spaces 213
Acknowledgments
I’d like to thank Philip Parzygnat, Benjamin Russo, Xing Su, Yun Yang, and George Zoghbi for
many helpful suggestions and comments.
2
Introduction: What is linear algebra and why study it?
Before saying what one studies in linear algebra, let us consider the following scenarios. You should
not feel that you must understand all of these examples. They are merely meant to illustrate the
scope of linear algebra and its applications along with a better idea of the language used in linear
algebra. In time, as you learn more tools, these examples will make more sense. Furthermore,
some of these examples might be things you’ve seen before, but perhaps from a new perspective.
In this case, you’ll have a relateable example in the back of your mind as you learn more abstract
concepts.
Example 0.1. Queens, New York has several one-way streets throughout its many neighborhoods.
Figure 1 shows an intersection in Middle Village, New York. We can represent the flow of traffic
Figure 1: An intersection in Middle Village, New York in the borough of
Queens. This image is obtained from Map data c©2017 Google https:
//www.google.com/maps/place/Middle+Village,+Queens,+NY/@40.7281297,-73.
8802503,18.72z/data=!4m5!3m4!1s0x89c25e6887df03e7:0xef1a62f95c745138!
8m2!3d40.717372!4d-73.87425
around 81st and 82nd streets diagrammatically as
Queens Midtown Expy
81st
St
58th Ave
82nd
St
58th Ave
Imagine that we send out detectors (such as scouts) to record the average number of cars per hour
along each street. What is the smallest number of scouts we will need to determine the traffic flow
3
on every street? To answer this question, we first point out one important fact that appears in
several different contexts:
The net flow into an intersection equals the net flow out of an intersection.
From this, it actually follows that the net flow into the network itself is equal to the net flow out
of the network. Each edge connecting any two intersections represents an unknown and each fact
above provides an equation. Hence, this system has 9 unknowns and 4 equations. The minimum
number of scouts needed is therefore 9−4 = 5. This does not mean that you can place these scouts
anywhere and find out what the traffic flow looks like. For example, if you place the 5 scouts on
the following streets
þ ÿQueens Midtown Expy
81st
St
58th Ave
þ82nd
St
þ
58th Ave
ÿ
then you still don’t know the traffic flow leaving 58th Ave and 81st Street on the bottom left.
Let’s see what happens explicitly by first sending out 3 scouts, which observe the following traffic
flow per hour
100
Queens Midtown Expy
x1
81st
St
x570
58th Ave
82nd
St
x2
x3
x6
58th Avex4
30
The unknown traffic flows have been labelled by the variables x1, x2, x3, x4, x5, x6, which is where
we did not send out any scouts. The equations for the “flow in” equals “flow out” are given by
(they are written going clockwise starting at the top left)
100 = x1 + x5
x1 + x2 = 70
x3 = x2 + x4
x4 + x5 = 30 + x6
(0.2)
4
This system of linear equations can be rearranged in the following way
x1 +x5 = 100
x1 +x2 = 70
x2 −x3 +x4 = 0
x4 +x5 −x6 = 30
(0.3)
which makes it easier to see how to manipulate these expressions algebraically by adding or sub-
tracting multiples of different rolls. When adding these rows, all we ever add are the coefficients
and the variables are just there to remind us of our organization. We can therefore replace these
equations with the augmented matrix1 0 0 0 1 0 100
1 1 0 0 0 0 70
0 1 −1 1 0 0 0
0 0 0 1 1 −1 30
. (0.4)
Adding and subtracting rows here corresponds to the same operations for the equations. For
example, subtract row 1 from row 2 to get1 0 0 0 1 0 100
0 1 0 0 −1 0 −30
0 1 −1 1 0 0 0
0 0 0 1 1 −1 30
(0.5)
Subtract row 2 from row 3 to get1 0 0 0 1 0 100
0 1 0 0 −1 0 −30
0 0 −1 1 1 0 30
0 0 0 1 1 −1 30
(0.6)
Subtract row 4 from row 3 to get1 0 0 0 1 0 100
0 1 0 0 −1 0 −30
0 0 −1 0 0 1 0
0 0 0 1 1 −1 30
(0.7)
Multiply row 3 by −1 to get rid of the negative coefficient for the x3 variable (this step is not
necessary and is mostly just for the A E S T H E T I C S)1 0 0 0 1 0 100
0 1 0 0 −1 0 −30
0 0 1 0 0 −1 0
0 0 0 1 1 −1 30
(0.8)
5
This tells us that the initial system of linear equations is equivalent to
x1 +x5 = 100
x2 −x5 = −30
x3 −x6 = 0
x4 +x5 −x6 = 30
(0.9)
or
x1 = 100− x5
x2 = x5 − 30
x3 = x6
x4 = 30 + x6 − x5
(0.10)
so that all of the traffic flows are expressed in terms of just x5 and x6. This choice is arbitrary and
we could have expressed another four traffic flows in terms of the other two (again, not any four,
but some). In this case, x5 and x6 are called free variables. In order to figure out the entire traffic
flow through all streets, we can therefore send out two more scouts to observe x5 and x6. The
scouts discover that x5 = 40 and x6 = 20. Knowing these values, we can figure out the traffic flow
through every street without sending out additional scouts by plugging these values into (0.10)
100
Queens Midtown Expy
x1 = 100− 40 = 60
81st
St
4070
58th Ave
82nd
St
x2 = 40− 30 = 10
x3 = x6 = 2020
58th Ave
x4 = 30 + 20− 40 = 1030
We could have also figured out x1, x2, x3, and x4 by brute force by just implementing the intersection
network flow conservation at every intersection without using any of the above methods. For
instance, first we can figure out x1.
100
Queens Midtown Expy
x1 = 100− 40 = 60
81st
St
4070
58th Ave
82nd
St
x2
x3
20
58th Avex4
30
6
x2 and x4 can then each be solved immediately
100
Queens Midtown Expy
60
81st
St
4070
58th Ave
82nd
St
x2 = 70− 60 = 10
x3
20
58th Ave
x4 = 20 + 30− 40 = 1030
and finally x3.
100
Queens Midtown Expy
60
81st
St
4070
58th Ave
82nd
St
10
x3 = 10 + 10 = 2020
58th Ave
1030
Although the second method was much faster in this case, when dealing with large scale intersec-
tions over several blocks, this becomes much more difficult. The first method is more systematic
and applies to all situations.
Example 0.11. If you ever cook a dish and need to make twice as much or half as much as the
recipe calls for, it’s very easy to alter the quantity of the ingredients: you either double or halve the
quantity, respectively. When you’re making several different kinds of recipes in large quantities,
the situation becomes a bit more complicated. For example, a small bakery produces pancakes,
tres leches cake, strawberry shortcake, and egg tarts. The recipes for these are provided in Table
1. One of the workers in the bakery is responsible for purchasing all the required ingredients and
is told to buy enough for the quantity the bakery plans to make and nothing more. You see the
worker purchase 7 dozen eggs, 12 pounds of strawberries, 8 quarts of heavy cream, 18 pounds of
flour, 30 sticks of butter, 6 pounds of sugar, and a gallon of milk. Eggs are sold by the dozen,
strawberries by the pound, heavy cream by the quart, flour by the pound, butter by the stick,
sugar by the pound, and milk by the gallon. Assuming that only an integral number of batches
for each recipe are made, how many egg tarts do you expect to see the next morning when the
bakery opens?
Let p, t, s, e denote the number of batches of pancakes, tres leches cakes, strawberry shortcakes,
and batches of egg tarts, respectively. Then, these quantities must satisfy the following inequalities
7
Pancakes Tres leches Strawberry shortcake Egg tarts
(makes 12) (makes 8 slices) (makes 8 slices) (makes 16 tarts)
eggs 2 6 0 6
strawberries 1 + 12
lbs 0 1 + 12
lbs 0
heavy cream 1 cup 1 pint 3 cups 0
flour 3 cups 1 cup 4 cups 3 + 34
cups
butter 4 tbsp 0 1 + 14
cups 1 + 13
cups
sugar 3 tbsp 1 cup 12
cup 25
cup
milk 2 cups 13
cup 0 13
cup
Table 1: Ingredients for some recipes. Not all ingredients are listed.
(appearing in the same order as above):
72 ≤ 2p+ 6t+ 0s+ 6e ≤ 84
11 ≤ 3
2p+ 0t+
3
2s+ 0e ≤ 12
28 ≤ 1p+ 2t+ 3s+ 0e ≤ 32
170
3≤ 3p+ 1t+ 4s+
5
3e ≤ 60
29 ≤ 1
2p+ 0t+
5
2s+
8
3e ≤ 30
(0.12)
We’ve ignored the sugar and milk in this system of inequalities to avoid too much clutter. We can
write this as a matrix inequality in the form Ax ≤ b, where
x :=
p
t
s
e
, b :=
84
24
32
180
180
−72
−22
−28
−170
−172
, & A :=
2 6 0 6
3 0 3 0
1 2 3 0
9 3 12 5
3 0 15 16
2 6 0 6
−3 0 −3 0
−1 −2 −3 0
−9 −3 −12 −5
−3 0 −15 −16
. (0.13)
The positive entries come from the right-hand-side of the inequality (0.12) while the negative entries
come from the left-hand-side of the inequality. It’s highly likely that some of these inequalities
are redundant, but let us ignore this possibility. The method is to eliminate the variables one at
a time until the last remaining variable is expressed in terms of some inequality. This procedure
grows exponentially with the number of variables and equations, and will not be shown here. It
8
goes by the name Fourier-Motzkin elimination. The solution is given by
x =
3
7
4
6
(0.14)
so that you expect there to be 6 ∗ 16 = 96 egg tarts tomorrow morning.
Example 0.15. A drunk walks out of a bar and onto a sidewalk that has a barrier preventing
him from crossing the street. Assume the sidewalk is infinitely long in both directions and the
drunk takes steps of equal distance every time. Furthermore, assume that the likelihood of walking
up and down the sidewalk is 12
each. Label the starting point 0, and choose one direction to be
positive. So for example, After one step, the drunk is 50% likely to be at the point 1 or −1. After
a total of two steps, the drunk has a 50% chance at being at the point 0. This is because if he was
at 1, then he’s 50% likely to go to 0 and similarly if he was at −1. However, he was 50% likely to
be at the point 1 in the first place. Hence, the net probability is the sum of the two possibilities(1
2
)(1
2
)+
(1
2
)(1
2
)=
1
2. (0.16)
Similarly, he has a 25% chance each at being at the points −2 or 2. This is because there’s only
one way to get from position 0 to position 2 in two steps, each with a probability of 12. What is
the probability that the drunk will be at position n, where n is any integer, after k steps?
The question is asking to figure out the resulting probability distribution after k steps. We
could do this by hand for each k one at a time until we find a pattern, but this would take us too
long. Figure 2 shows what happens after the first few steps. Instead, if we represent the probability
t = 0•
• • • • • • • • • • • • • • • • • • • •
t = 1
•
• •
• • • • • • • • • • • • • • • • • •
t = 2
• •
• •
•
• • • • • • • • • • • • • • • •t = 3
•• •• •
• •
• • • • • • • • • • • • • •
t = 4
• •• •
•• •
• •• • • • • • • • • • • •
t = 5
•• •• •
• •• •
• •• • • • • • • • • •
Figure 2: The probability distributions after the first few steps.
distribution for each time step as a function of the position, all we have to do is track what this
function looks like for each k, and for each position. In each time step, there is a transition from
the initial location to a 12
probability on the nearby adjacent points.
9
• • • • • • • • •
• • • • • • • • •
i
i + 1i− 1
We can represent this as a transition matrix
Ai←j :=
12
if i = j ± 1
0 otherwise,(0.17)
where i ← j indicates the transition from position j to i. Instead of writing Ai←j, we write Ai,jfor short. The transition matrix after two time steps is given by summing over all intermediate
possibilities weighted by the appropriate probability,
∑j
Ai,jAj,k =
12
if i = k14
if i = k ± 2
0 otherwise,
(0.18)
which agrees with our earlier observations. There is a helpful notation that will make finding
successive iterates of this much simpler. Let
δi,j :=
1 if i = j
0 otherwise(0.19)
denote the Kronecker delta function. The most important thing about this function is that for
any other function fj labelled by the same set of indices,∑j
δijfj = fi. (0.20)
Using this notation, the transition matrix entries become
Ai,j =1
2δi,j−1 +
1
2δi,j+1. (0.21)
If our initial probability distribution is given by
fj := δj,0, (0.22)
which means that the drunk begins at the position j = 0 with probability 1, then after applying
the transition matrix∑j
Ai,jδj,0 =1
2
∑j
(δi,j−1δj,0 + δi,j+1δj,0) =1
2(δi,−1 + δi,1) , (0.23)
10
which agrees with what we found earlier. The transition matrix after two time steps is∑j
Ai,jAj,k =∑j
(1
2δi,j−1 +
1
2δi,j+1
)(1
2δj,k−1 +
1
2δj,k+1
)=
1
4
∑j
(δi,j−1δj,k−1 + δi,j+1δj,k−1 + δi,j−1δj,k+1 + δi,j+1δj,k+1)
=1
4(δi+1,k−1 + δi−1,k−1 + δi+1,k+1 + δi−1,k+1) .
(0.24)
Again, applying our initial configuration gives the resulting probability distribution∑j,k
Ai,jAj,kδk,0 =1
4
∑k
(δi+1,k−1δk,0 + δi−1,k−1δk,0 + δi+1,k+1δk,0 + δi−1,k+1δk,0)
=1
4(δi+1,−1 + δi−1,−1 + δi+1,1 + δi−1,1) .
(0.25)
This tells us the probability of being at position i after two steps, which agrees with what we found
earlier. After taking three steps, the transition is given by∑j,k
Ai,jAj,kAk,l =1
8
∑k
(δi+1,k−1 + δi−1,k−1 + δi+1,k+1 + δi−1,k+1) (δk,l−1 + δk,l+1)
=1
8(δi+1,l−2 + δi+1,l + δi−1,l−2 + δi−1,l + δi+1,l + δi+1,l+2 + δi−1,l + δi−1,l+2)
(0.26)
Starting at the position 0 gives the probability distribution∑j,k,l
Ai,jAj,kAk,lδl,0 =1
8(δi+1,−2 + δi+1,0 + δi−1,−2 + δi−1,0 + δi+1,0 + δi+1,2 + δi−1,l + δi−1,2) . (0.27)
We can check to make sure this is consistent with Figure 2. It describes the probability that the
drunk will end up at step i after 3 steps. From this, we can write down a formal solution to our
problem. The probability that the drunk will end up at step n after k steps is∑i1,i2,...,in−1
Ai1,i2Ai2,i3 · · ·Ain−2,in−1Ain−1,0. (0.28)
Example 0.29. Newton’s force law states that the force F acting on an object of mass m is
proportional to the acceleration a of the object
F = ma. (0.30)
This is an example of a linear relationship because the acceleration depends linearly on the force
since a = Fm. As a special case, imagine throwing a ball vertically up and letting it fall back down.
If we declare up to be the positive direction, the force on the ball, ignoring air resistance and other
factors, is
F = −mg, (0.31)
where g is the gravitational constant, approximately 9.8 meters per second per second, i.e. meters
per second2. Therefore, the acceleration is just a = −g. What do all the possible trajectories look
11
like? If v denotes the velocity and x the displacement from the ground, then these variables are
all related in terms of derivatives with respect to time
v =dx
dt& a =
dv
dt. (0.32)
So to solve for the position for all time, first integrate the latter to solve for v as a function of time∫dv
v
−gt+ v0
−∫g dt
∫a(t) dt
(0.33)
where v0 is the initial velocity (it is the integration constant). Using this, we can then use the first
equation to solve for the position x as a function of time∫dx
x
−gt2 + v0t+ x0
∫(−gt+ v0) dt
∫v(t) dt
(0.34)
where x0 is the initial position (if someone was on a ladder or dug a hole in the ground, the position
might be different from 0). What do we notice about this solution? It depends on two parameters.
Furthermore, if we write these parameters as a pair of numbers (x0, v0), then we notice that any
linear combination of these numbers (adding them component by component) provides us with
another solution to the problem. More precisely, suppose that
x(t) = −gt2 + v0t+ x0 & x′(t) = −gt2 + v′0t+ x′0 (0.35)
are two solutions (the prime here does not denote the derivative—it is merely used to distinguish
two possible solutions with two different pairs of initial conditions). Adding these solutions is
defined by adding their corresponding initial conditions1
(x x′)(t) := −gt2 + (v0 + v′0)t+ (x0 + x′0). (0.36)
Then, the acceleration associated to this motion is
d2
dt2
((x x′)(t)
)=
d2
dt2
(− gt2 + (v0 + v′0)t+ (x0 + x′0)
)= −g, (0.37)
1The notation A := B is read “A is defined by B” and defines the left-hand-side A in terms of the right-hand-
side B. Also, the is used instead of + because the addition rule is not the usual addition of numbers. Notice how
the quadratic term was not added.
12
so that xx′ is another solution. We can also scale the initial conditions by any other real number
and obtain yet another solution. If c is a real number then the scaled solution is defined by2
(c x)(t) := −gt2 + cv0t+ cx0. (0.38)
The acceleration is the same for cx as it is for x. Note that a negative velocity makes sense if we
dug a hole in the ground and/or if we threw the ball from a ladder or the leaning tower of Pisa.
Also, standing on the ground and not throwing the ball at all means the ball will stay at rest.
Furthermore, no other solutions are possible. Notice that we can obtain any solution by taking
linear combinations of the solutions (0, 1) (throwing the ball up from the ground with a velocity
of 1 meter per second) and (1, 0) (letting go of the ball off of a ladder 1 meter above the ground).
Thus, these two solutions span the space of all solutions. Furthermore, one cannot obtain one
solution via algebraic manipulations without using the other one, so the two solutions are linearly
independent. All of this is summarized by saying that the solutions of the 2nd order differential
equation −mg = md2xdt2
is a 2-dimensional vector space and these two solutions form a basis for the
set of all solutions.
Now let’s add a slight level of complication by assuming there is air resistance and that it
is of the form −γv, where γ is some non-negative constant. This means that the air resistance
is proportional to the velocity but pushes the object in the opposite direction when the velocity
increases. The force is then
F = −mg − γv (0.39)
and Newton’s law then becomes
−mg − γv = ma. (0.40)
In terms of the velocity, this looks like
−mg − γv = mdv
dt. (0.41)
Applying a similar procedure to above (this time the integration is a bit more complicated and is
left to as an exercise to refresh your memory on calculating integrals), the velocity as a function
of time is
v(t) =
(v0 +
mg
γ
)exp
(− γmt)− mg
γ, (0.42)
where v0 is the initial velocity and exp denotes the exponential. The position is therefore
x(t) =m
γ
(v0 +
mg
γ
)(1− exp
(− γmt))− mg
γt+ x0 (0.43)
where x0 is the initial position. We add solutions x x′ and scale them c x just as before (by
adding and scaling the initial conditions, respectively). The space of all solutions is therefore also
2-dimensional, just as before!
2The notation of multiplying a scalar c by a solution x(t) is used instead of · or just concatenation (namely,
cx(t)) because the scalar multiplication rule is not the usual scalar multiplication of a number by a polynomial.
Notice how the quadratic term was not affected.
13
What are some of the common themes that we used to solve both of these problems? First,
notice that we introduced a variable, the velocity v, as an intermediate variable to solve for the
position x as a function of time t. Then, we replaced the acceleration a with the derivative of
velocity a = dvdt
and solved for v first before then using v = dxdt
to solve for the position. This
method is used to turn a 2nd order linear differential equation into a system of two 1st order linear
differential equations.
F
(x,dx
dt, t
)= m
d2x
dt2⇐⇒
F (x, v, t) = mdv
dt
v = dxdt
(0.44)
Since the force depends in general on position, velocity, and maybe even time, we write F as
F(x, dx
dt, t)
or F (x, v, t) (in the above two cases, we only had a situation where F was either
independent of these variables or F only depended on the velocity). Furthermore, we saw that all
solutions depended on exactly two initial conditions and that these initial conditions parametrized
the set of all solutions. Rather than solving each problem separately, wouldn’t it be nice to know
that all solutions can be parametrized in such a way with just two variables? Wouldn’t it be
even better if there was some systematic way to solve systems like (0.44) without performing any
integrals whatsoever? We will be able to solve systems like (0.44) under the additional assumption
that the force F is a linear function in x and v without calculating any integrals in this course!
For example, for a spring with spring constant k and a damping coefficient of γ, the force is
F = −kx− γv and so the system of 1st order differential equations is given by
− kmx− γ
mv =
dv
dt
0 x+ 1 v =dx
dt
(0.45)
If you’ve taken a course in differential equations, you should be able to solve this (the solutions
will depend on what γ is). But if you’ve only taken linear algebra, you can solve it, too! The four
coefficients of x and v above form a matrix of numbers[− km− γm
0 1
](0.46)
and the initial conditions form a vector of numbers[v0
x0
]. (0.47)
Matrices are objects that act on vectors and spit out vectors. Square matrices (ones for which
the number of columns equals the number of rows) can be exponentiated to produce yet another
matrix so that exp
[− km− γm
0 1
]makes sense. The solution to the 1st order differential equation is
then just [v(t)
x(t)
]= exp
(t
[− km− γm
0 1
])[v0
x0
]. (0.48)
The exponential of a matrix is another matrix and so the solution can be written more explicitly
after working out what this exponential is. Even when F is not linear in x and v, we will be able
to study the general behavior of solutions near their equilibrium points.
14
All of these examples have some features in common. In particular, they exhibit linear behavior
of some sort. However, each system is quite different and one might think that to properly analyze
these systems, one needs to work with each system separately. To a large extent, this is false.
Instead, if one can abstract the crucial properties of linearity more precisely without the particular
model one is looking at, then one can study these properties and make conclusions abstractly. Then
by going back to the particular problem, one can apply these conclusions to say something about
the particular system.
Linear algebra is the study of these abstract properties.
Not all problems in nature behave in such a linear fashion. Nevertheless, certain aspects of the
system can be approximated linearly. This is where the techniques of linear algebra apply.
15
1 Linear systems, row operations, and examples
Linear algebra is the study of systems of linear equations. Many physical situations are described
by non-linear equations. However, the first order approximations of such systems are always linear.
Linear systems are much simpler to solve and give a decent approximation to the local behavior
of a physical system.
Definition 1.1. A linear system (or a system of linear equations) in a finite number of variables
is a collection of equations of the form
a11x1 + a12x2 + · · ·+ a1nxn = b1
a21x1 + a22x2 + · · ·+ a2nxn = b2
...
am1x1 + am2x2 + · · ·+ amnxn = bm,
(1.2)
where the aij are real numbers (typically known constants), the bi are real numbers (also typically
known values), and the xj are the variables (which we would often like to solve for). The solution
set of a linear system (1.2) is the collection of all (x1, x2, . . . , xn) that satisfy (1.2). A linear system
where the solution set is non-empty is said to be consistent. A linear system where the solution
set is empty is said to be inconsistent.
It helps to start off immediately with some examples. We will slowly develop a more formal
and rigorous approach to linear algebra as the semester progresses.
Example 1.3. Consider the linear system given by
−x− y + z = −2
−2x+ y + z = 1(1.4)
These two equations are plotted in Figure 3.
−x− y + z = −2
−2x+ y + z = 1
Figure 3: A plot of the equations −x− y + z = −2 and −2x+ y + z = 1.
16
It is clear from this picture that there are solutions, in fact a lines worth of solutions instead
of a unique one (the intersection of the two planes is the set of solutions). How can we describe
this line explicitly? Looking at (1.4), we can add the two equations to get3
− 3x+ 2z = −1 ⇐⇒ z =1
2(3x− 1) (1.5)
We can also subtract the second equation from the first to get
x− 2y = −3 ⇐⇒ y =1
2(3 + x). (1.6)
Hence, the set of points given by (x,
1
2(3 + x),
1
2(3x− 1)
)(1.7)
as x varies over real numbers, are all solutions of (1.4). We can plot this in Figure 4.
Figure 4: A plot of the equations −x− y + z = −2 and −2x+ y + z = 1 together with
the intersection shown in red and given parametrically as x 7→(x, 1
2(3 + x), 1
2(3x− 1)
).
Hence, (1.4) is an example of a consistent system.
In this example, we saw that not only could we find solutions, but there were infinitely many
solutions. Sometimes, a solution to a linear system need not exist at all!
Example 1.8. Let
2x+ 3y = 5
4x+ 6y = −2(1.9)
be two linear equations in the variables x and y. There is no solution to this system. If there were
a solution, then dividing the second line by 2 would give 5 = −1, which is impossible.4 This can
17
-2 -1 1 2
-1
1
2
3
2x+ 3y = 5
4x+ 6y = −2
Figure 5: A plot of the equations 2x+ 3y = 5 and 4x+ 6y = −2.
also be seen by plotting these two equations in the plane as in Figure 5. These two lines do not
intersect. Hence, (1.9) is an example of an inconsistent system.
To test yourself that you understand these definitions, try to answer the following true or false
questions.
Problem 1.10. State whether the following claims are True or False. If the claim is true, be
able to precisely deduce why the claim is true. If the claim is false, be able to provide an explicit
counter-example.
(a) If a linear system has infinitely many solutions, then the linear system is inconsistent.
(b) If a linear system is consistent, then it has infinitely many solutions.
(c) Every linear system of the form
a11x1 + a12x2 = 0
a21x2 + a22x2 = 0(1.11)
is consistent.
Many of these questions have very simple answers. The difficulty is not in knowing just which
statement is true or false but being able to prove your claim. At first, many of the claims we
will prove will seem insultingly simple. The reason for proving them therefore is not so much to
convince you of their validity but to get used to the way in which proofs are done in the simplest
of examples. Therefore, we will present the solutions for now, but will eventually leave many such
exercises throughout the notes.
Answer. you found me!
(a) False: A counterexample is in Example 1.3, which has infinitely many solutions.
3The ⇐⇒ symbol means “if and only if,” which in this context means that the two equations are equivalent.4This is an example of a proof by contradiction.
18
(b) False: A counterexample will be presented shortly in Problem 1.12 below. The linear system
described there is consistent and has only one solution.
(c) True: Setting x1 = 0 and x2 = 0 gives one solution regardless of what a11, a12, a21, and a22 are.
In general, you should think about every definition that you are introduced to and be able
to relate it to examples and general situations. Always compare definitions to understand the
differences if some seem similar. You may be exposed to new true and false questions on quizzes
and/or exams that test your ability to be able to identify the trueness of said claims. We will now
go through a more complicated and challenging linear system where it will be useful to introduce
the concept of a matrix.
Problem 1.12 (Exercise 1.1.33 in [Lay]). The temperature on the boundary of a cross section of
a metal beam is fixed and known but is unknown in the intermediate points on the interior
10
10
20
T1
T4
30
20
T2
T3
30
40
40(1.13)
Assume the temperature at these intermediate points equals the average of the temperature at
the nearest neighboring points.5 Write a system of linear equations to describe the temperatures
T1, T2, T3, and T4.
Answer. The system of equations is given by
T1 =1
4(10 + 20 + T2 + T4)
T2 =1
4(T1 + 20 + 40 + T3)
T3 =1
4(T4 + T2 + 40 + 30)
T4 =1
4(10 + T1 + T3 + 30)
(1.14)
Rewriting them in the form provided above gives
4T1 − 1T2 + 0T3 − 1T4 = 30
−1T1 + 4T2 − 1T3 + 0T4 = 60
0T1 − 1T2 + 4T3 − 1T4 = 70
−1T1 + 0T2 − 1T3 + 4T4 = 40.
(1.15)
5This is true to a good approximation and is in fact how approximation techniques can be used to solve problems
like this though the mesh will usually be much finer, and the boundary might not look so nice. Furthermore, the
solution we are obtaining is the steady state solution, which is what the temperatures will be after you wait long
enough. For instance, if you dumped the beam into an ice bath, it would take time for the temperatures to be
stable on the inside of the beam so that this method would work. We use the phrase “steady state” instead of
“equilibrium” because something is forcing the temperatures to be different on the different edges of the beam.
19
Is there a solution for the temperatures in the previous problem? If there is a solution, is it
unique? Notice that the coefficients and numbers in (1.15) can be put together in an array64 −1 0 −1 30
−1 4 −1 0 60
0 −1 4 −1 70
−1 0 −1 4 40
(1.16)
This augmented matrix will aid in implementing calculations to solve for the temperatures. From
a course in algebra, you might guess that one way to solve for the temperatures is to solve for
one and then plug in this value successively into the other ones. This becomes difficult when we
have more than two variables. Some things we can do, which are more effective, are adding linear
combinations of equations within the system (1.15). For instance, subtracting row 4 of (1.15) by
row 2 gives
−1T1 + 0T2 − 1T3 + 4T4 = 40
−(− 1T1 + 4T2 − 1T3 + 0T4= 60
)0T1 − 4T2 + 0T3 + 4T4 = −20
(1.17)
for row 4. We know we can do this because all we are doing is adding two equations of the form
A = B and C = D and obtaining A+C = B+D. This is based on the assumption that a solution
exists in the first place. We can also multiply this equation by 14
without changing the values of
the variables. This gives
0T1 − 1T2 + 0T3 + 1T4 = −5. (1.18)
From this, we see that we are only manipulating the entries in the augmented matrix (1.16) and
we don’t have to constantly rewrite all the T variables. In other words, the augmented matrix
becomes 4 −1 0 −1 30
−1 4 −1 0 60
0 −1 4 −1 70
0 −1 0 1 −5
(1.19)
after these two row operations. If we could get rid of T2 from this last row, we could solve for T4 (or
vice versa). Similarly, we should try to solve for all the other temperatures by finding combinations
of rows to eliminate as many entries from the left-hand-side of the augmented matrix. This left-
hand-side of the augmented matrix is just called a matrix.
Problem 1.20 (Exercise 1.1.34 in [Lay]). Solve the system of linear equations in (1.15).
Answer. Let’s begin by adding 4 of row 2 to row 10 15 −4 −1 270
−1 4 −1 0 60
0 −1 4 −1 70
0 −1 0 1 −5
. (1.21)
6[Lay] does not draw a vertical line to separate the two sides. I find this confusing. We will always draw this
line to be clear.
20
Add 15 of row 4 to row 1 0 0 −4 14 195
−1 4 −1 0 60
0 −1 4 −1 70
0 −1 0 1 −5
(1.22)
Subtract row 4 from row 3 0 0 −4 14 195
−1 4 −1 0 60
0 0 4 −2 75
0 −1 0 1 −5
(1.23)
Add row 3 to row 1 0 0 0 12 270
−1 4 −1 0 60
0 0 4 −2 75
0 −1 0 1 −5
(1.24)
Divide row 1 by 6 0 0 0 2 45
−1 4 −1 0 60
0 0 4 −2 75
0 −1 0 1 −5
(1.25)
Add row 1 to row 3 and subtract half of row 1 from row 40 0 0 2 45
−1 4 −1 0 60
0 0 4 0 120
0 −1 0 0 −27.5
(1.26)
Add 4 of row 4 to row 2 and divide row 3 by 40 0 0 2 45
−1 0 −1 0 −50
0 0 1 0 30
0 −1 0 0 −27.5
(1.27)
Add row 3 to row 2 0 0 0 2 45
−1 0 0 0 −20
0 0 1 0 30
0 −1 0 0 −27.5
(1.28)
Multiply rows 2 and 4 by −1 and divide row 1 by 20 0 0 1 22.5
1 0 0 0 20
0 0 1 0 30
0 1 0 0 27.5
(1.29)
21
In other words, we have found a solution
T1 = 20
T2 = 27.5
T3 = 30
T4 = 22.5
10
10
20
20
22.5
30
20
27.5
30
30
40
40(1.30)
Because it helps to visualize this the same way, we can permute the rows and still have the same
equations describing our problem 1 0 0 0 20
0 1 0 0 27.5
0 0 1 0 30
0 0 0 1 22.5
(1.31)
This is another example of a row operation.
You should check these solutions by plugging them back into the original linear system (1.15).
In total, we have used three row operations to help us solve linear systems:
(a) scaling rows,
(b) adding rows, and
(c) permuting rows.
In this situation, we were lucky and a solution existed and was unique. In problem 1.12, there
is only one element in the solution set. Occasionally, two arbitrary linear systems may have the
same set of solutions.
Definition 1.32. Two linear systems of equations with the same variables that have the same set
of solutions are said to be equivalent.
Hence, the two linear systems of equations given in (1.15) and (1.31) are equivalent.
As we do more problems, we get familiar with faster methods of solving systems of linear
equations. We start with another problem from circuits with batteries and resistors.
Problem 1.33. Consider a circuit of the following form
6 V2 V
4 Ohm
1 Ohm
3 Ohm
22
Here the jagged lines represent resistors and the two parallel lines, with one shorter than the
other, represent batteries with the positive terminal on the longer side. The units of resistance
are Ohms and the units for voltage are Volts. Find the current (in units of Amperes) across each
resister along with the direction of current flow.
Answer. Before solving this, we recall a crucial result from physics, which is
Kirchhoff’s rule: the voltage difference across any closed loop in a circuit with resistors and
batteries is always zero.
Across a resistor, the voltage drop is the current times the resistance (this is called Ohm’s law).
Across a battery from the negative to positive terminal, there is a voltage increase given by the
voltage of the battery. There is also the rule that says current is always conserved, meaning that
at a junction, “current in” equals “current out”, just as in Example 0.1. Knowing this, we label
the currents in the wires by I1, I2, and I3 as follows.
6 V2 V
4 Ohm
I1−−−−→
I2−−−−→
1 Ohm I3
−−−−→
2 Ohm
The directionality of these currents has been chosen arbitrarily. Conservation of current gives
I1 = I2 + I3. (1.34)
Kirchhoff’s rule for the left loop in the circuit gives
2− 4I1 − 1I3 = 0 (1.35)
and for the right loop gives
− 6− 2I2 + 1I3 = 0. (1.36)
These are three equations in three unknowns.
If you were lost up until this point, that’s fine. You can start by assuming the following form
for the linear system of equations.
Rearranging them gives
1I1 − 1I2 − 1I3 = 0
0I1 − 2I2 + 1I3 = 6
4I1 + 0I2 + 1I3 = 2
(1.37)
and putting it in augmented matrix form gives1 −1 −1 0
0 −2 1 6
4 0 1 2
(1.38)
23
To solve this, we perform row operations. Subtract 4 times row 1 from row 31 −1 −1 0
0 −2 1 6
0 4 5 2
(1.39)
Adding 2 of row 2 to row 3 gives 1 −1 −1 0
0 −2 1 6
0 0 7 14
(1.40)
The matrix is now in echelon form (more on this after we solve the actual problem). Dividing row
3 by 7 gives 1 −1 −1 0
0 −2 1 6
0 0 1 2
(1.41)
Adding row 3 to row 1 and subtracting row 3 from row 2 gives1 −1 0 2
0 −2 0 4
0 0 1 2
(1.42)
Divide row 2 by -2 1 −1 0 2
0 1 0 −2
0 0 1 2
(1.43)
Add row 2 to row 1 1 0 0 0
0 1 0 −2
0 0 1 2
(1.44)
The matrix is now in reduced echelon form (more on this later) and we have found our solution
I1 = 0 A
I2 = −2 A
I3 = 2 A
(1.45)
The negative sign means that the current is actually flowing in the opposite direction to what we
assumed.
You should check these solutions by plugging them back into the original linear system.
Go back to Example 0.1. You should now have enough experience to understand it.
Recommended Exercises. Exercises 12, 16, 18, 20, and 28 in Section 1.1 of [3]. Be able to show
all your work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we finished Section 1.1 and worked through parts of Section 1.6 of [Lay].
24
2 Vectors and span
Given a linear system of equations as in (1.2), which is written as an augmented matrix asa11 a12 · · · a1n b1
a21 a22 · · · a2n b2
......
. . ....
...
am1 am2 · · · amn bm
, (2.1)
an echelon form of such an augmented matrix is an equivalent augmented matrix whose matrix
components (to the left of the vertical line) satisfy the following conditions.
(a) All nonzero rows are above any rows containing only zeros.
(b) The first entry (from the left), also known as a pivot, of any nonzero row is always to the right
of the first nonzero entry of the row above it.
(c) All entries in the column below a pivot are zeros.
The column corresponding to a pivot is called a pivot column.
A matrix is in reduced echelon form if in addition the following hold.
(d) All pivots are 1.
(e) The pivots are the only nonzero entries in the corresponding pivot columns.
Exercise 2.2. State whether the following matrices are in echelon form. If they are not, use row
operations to find an equivalent matrix that is in echelon form.
(a)
5 0 1 −1 3
0 0 2 1 0
0 0 0 0 0
(b)
5 0 1 −1 3
0 0 0 0 0
0 0 2 1 0
(c)
5 0 1 −1 3
5 0 2 1 0
0 0 0 0 0
Do the same thing for the previous examples but replace “echelon” with “reduced echelon.”
It is a fact that the reduced row echelon form of a matrix is always unique provided that the
linear system corresponding to it is consistent. Furthermore, a linear system is consistent if and
only if an echelon form of the augmented matrix does not contain any rows of the form[0 · · · 0 b
]with b nonzero (2.3)
25
In our earlier examples of temperature on a rod and currents in a circuit, the arrays of numbers
given by T1
T2
T3
T4
&
I1
I2
I3
(2.4)
are examples of vectors in R4 and R3, respectively. Here R is the set of real numbers and Rn is
the set of n-tuples of real numbers where n is a positive integer being one of 1, 2, 3, 4, . . . ,
Rn :=
(x1, . . . , xn) : xi ∈ R ∀ i = 1, . . . , n. (2.5)
Given two vectors in Rn, a1
...
an
&
b1
...
bn
(2.6)
we can take their sum defined by a1
...
an
+
b1
...
bn
:=
a1 + b1
...
an + bn
. (2.7)
We can also scale each vector by any number c in R by
c
a1
...
an
:=
ca1
...
can
. (2.8)
The above descriptions of vectors are algebraic and we’ve illustrated their algebraic structures
(addition and scaling). Vectors can also be visualized when n = 1, 2, 3. Vectors are more than just
points in space. For example, a billiard ball on an infinite pool table has a well-defined position.
•
You see it. It’s right there. When it moves in one instant of time, the difference from the final
position to the initial position provides us with a length together with a direction.
26
••
Thus, to define vectors, a reference point must be specified. In the above example, the reference
point is the initial position. A fixed observer (one that does not move in time) can also act as a
reference point. This provides any other point with a length and a direction.
o•
In many applications, the reference point will be called “zero” because often, the numerical values
of the entries can be taken to be 0. One such example is the vector of temperatures, currents,
traffic flows, etc.. We will often write vectors with an arrow over them as in ~a and ~b when n, the
number of entries of said vector, is understood.
Definition 2.9. Let S := ~v1, . . . , ~vm be a set of m vectors in Rn. The span of S is the set of all
vectors of the form7m∑i=1
ai~vi ≡ a1~v1 + · · · am~vm, (2.10)
where the ai can be any real number. For a fixed set of ai, the right-hand-side of (2.10) is called
a linear combination of the vectors ~vi.
In set-theoretic notation, we would write the span in this definition as
span(S) :=
m∑i=1
ai~vi ∈ Rn : a1, . . . , am ∈ R
. (2.11)
The span of vectors in R2 and R3 can be visualized quite nicely.
7Please do not confuse the notation ~vi with the components of the vector ~vi. It can be confusing with these
indices, but to be very clear, we could write the components of the vector ~vi as(vi)1...
(vi)n
.
27
Problem 2.12. In the following figure, vectors ~u,~v, ~w1, ~w2, ~w3, ~w4 are depicted with a grid showing
unit markings.
•
•
••
•
•
OO
//
'' ''
~u
'' '' ''
WWWW
~v
WWWWWW
~w1
??????gg~w2
gggg
~w3
~w4
(2.13)
What linear combinations of ~u and ~v will produce the other bullets drawn in the graph?
Answer. To answer this question, it helps to draw the integral grid associated to the vectors ~u
and ~v. This is the set of linear combinations
a~u+ b~v (2.14)
such that a, b ∈ Z, i.e. a and b are both integers. The intersections of the red lines in the following
image depict these integral linear combinations.
•
•
••
•
•
OO
//
'' ''
~u
'' '' ''
WWWW
~v
WWWWWW
(2.15)
As you can see, the bullets lie exactly on these intersections. Hence, we should be able to find
integers ai, bi ∈ Z such that
~wi = ai~u+ bi~v (2.16)
for all i = 1, 2, 3, 4. For example, ~w1 = ~u+ ~v so that a1 = 1 and b1 = 1.
The intersections of the red grid only depict integral linear combinations, but it’s clear that
scaling these vectors by any real number should fill the entire plane.
Problem 2.17. In the previous example, show that every vector[b1
b2
](2.18)
can be written as a linear combination of ~u and ~v. Thus ~u,~v spans R2.
28
Answer. To see this, note that
~u =
[2
−1
]& ~v =
[−1
2
]. (2.19)
To prove the claim, we must find real numbers a1 and a2 such that
a1~u+ a2~v =
[b1
b2
]. (2.20)
But the left-hand-side is given by
a1~u+ a2~v = a1
[2
−1
]+ a2
[−1
2
](2.8)=
[2a1
−a1
]+
[−a2
2a2
](2.7)=
[2a1 − a2
−a1 + 2a2
]. (2.21)
Therefore, we need to solve the linear system of equations given by
2a1 − a2 = b1
−a1 + 2a2 = b2,(2.22)
which should by now be a familiar procedure. Put it in augmented matrix form[2 −1 b1
−1 2 b2
](2.23)
Permute the first and second rows [−1 2 b2
2 −1 b1
](2.24)
Add two of row 1 to row 2 to get [−1 2 b2
0 3 b1 + 2b2
](2.25)
This is now in echelon form. Multiply row 1 by −1 and divide row 2 by 3[1 −2 −b2
0 1 13(b1 + 2b2)
](2.26)
Add 2 of row 2 to row 1 [1 0 −b2 + 2
3(b1 + 2b2)
0 1 13(b1 + 2b2)
](2.27)
which is equal to [1 0 1
3(2b1 + b2)
0 1 13(b1 + 2b2)
](2.28)
which says that [b1
b2
]=
(2b1 + b2
3
)~u+
(b1 + 2b2
3
)~v. (2.29)
Recommended Exercises. Exercises 12, 16, 23, and 31 in Section 1.2 of [3]. Exercises 8, 25, 26,
and 28 (you may use a calculator for exercise 28) in Section 1.3 of [3]. Be able to show all your
work, step by step! Do not use calculators or computer programs to solve any problems unless
otherwise stated!
In this lecture, we finished Sections 1.2 and 1.3 of [Lay].
29
3 Solution sets of linear systems
HW #01 is due at the beginning of class!
Go over Exercise 1.3.28 in [Lay].
We will skip Section 1.4 for now. We just discussed the span of vectors, but let’s review it and
discuss the relationship between the span and the solution set of a linear system.
Problem 3.1. In Example 1.3, we graphed two planes and their intersection, which was the set
of solutions to the corresponding linear system. This intersection was a line. This line is spanned
by a vector. What is that vector and what is its origin?
Answer. The set of solutions was given by(t,
1
2(3 + t),
1
2(3t− 1)
)∈ R3 : t ∈ R
. (3.2)
We have used the variable t only because we will interpret it as time. Using the notation of vectors
written vertically, this looks like t
12(3 + t)
12(3t− 1)
∈ R3 : t ∈ R
. (3.3)
We can split any vector in this set into a constant vector plus a vector multiplied by (the common
factor) t as t12(3 + t)
12(3t− 1)
=
0
3/2
−1/2
+ t
1
1/2
3/2
. (3.4)
As t varies over the set of real numbers, this traces out a straight line. This describes the solution
to (1.4) in parametric form. This line coincides with the span of the vector 1
1/2
3/2
(3.5)
whose origin is at 0
3/2
−1/2
. (3.6)
Problem 3.7. Find two vectors with the same origin so that they span the solution set of the
linear system
− x− y + z = −2. (3.8)
30
Answer. Since z can be solved in terms of x and y via z = x+ y− 2, the set of solutions is given
by x
y
x+ y − 2
∈ R3 : x, y ∈ R
. (3.9)
Each such vector can be expressed as x
y
x+ y − 2
=
0
0
−2
+ x
1
0
1
+ y
0
1
1
. (3.10)
The origin can therefore be taken as 0
0
−2
. (3.11)
Since x and y can vary over the set of real numbers, two vectors that span this solution set are1
0
1
&
0
1
1
. (3.12)
Proposition 3.13. Using the notation from Definition 1.1, if
~y =
y1
...
yn
and ~z =
z1
...
zn
(3.14)
are both solutions of the linear system (1.2), then every point on the straight line passing through
both the vectors ~x and ~y is a solution to (1.2).
This result is surprising! In particular, it says that if we have two distinct solutions of a linear
system, then we automatically have infinitely many solutions ! To see that this fails for non-linear
systems, consider the quadratic polynomial of the form x2 − 2 (with x taking values in R)
−2 −1 1 2
−2
−1
1
2x2 − 2
The two solutions are y :=√
2 and z := −√
2 but there are no other solutions at all!
31
Proof. The straight line passing through ~y and ~z can be described parametrically as8
R 3 t 7→ (1− t)~y + t~z =
(1− t)y1 + tz1
...
(1− t)yn + tzn
. (3.15)
We have to show that each point on this line is a solution. It suffices to show this for the i-th
equation in (1.2) for any i ∈ 1, . . . ,m. Plugging in a point along the straight line, we get
ai1
((1− t)y1 + tz1
)+ · · ·+ ain
((1− t)yn + tzn
)= (1− t)
(ai1y1 + · · ·+ ainyn
)+ t(ai1z1 + · · ·+ ainzn
)= (1− t)bi + tbi
= bi
(3.16)
using the distributive law (among other properties) for adding and multiplying real numbers.
Problem 3.17 (Exercise 1.6.8 in [Lay]). Consider a chemical reaction that turns limestone CaCO3
and acid H3O into water H2O, calcium Ca, and carbon dioxide CO2. In a chemical reaction, all
elements must be accounted for. Find the appropriate ratios of these compounds and elements
needed for this reaction to occur without other waste products.
Answer. Introduce variables x1, x2, x3, x4 and x5 for the coefficients of limestone, acid, water, cal-
cium, and carbon dioxide, respectively. The elements appearing in these compounds and elements
are H, O, C, and Ca. We can therefore write the compounds as a vector in these variables (in this
order). For example, limestone, CaCO3, is 0
3
1
1
← H
← O
← C
← Ca
(3.18)
since it is composed of zero hydrogen atoms, three oxygen atoms, one carbon atom, and one
calcium atom. Thus, the linear system we need to solve is given by
x1CaCO3 + x2H3O = x3H2O + x4Ca + x5CO2
x1
0
3
1
1
+ x2
3
1
0
0
= x3
2
1
0
0
+ x4
0
0
0
1
+ x5
0
2
1
0
(3.19)
The associated augmented matrix is0 3 −2 0 0 0
3 1 −1 0 −2 0
1 0 0 0 −1 0
1 0 0 −1 0 0
(3.20)
8You might have written down a different formula, but the line you get should be the same as the one we get
here.
32
Subtract row 4 from row 3 and subtract 3 of row 4 from row 20 3 −2 0 0 0
0 1 −1 3 −2 0
0 0 0 1 −1 0
1 0 0 −1 0 0
(3.21)
Subtract 3 of row 2 from row 1 0 0 1 −9 6 0
0 1 −1 3 −2 0
0 0 0 1 −1 0
1 0 0 −1 0 0
(3.22)
Permute the rows so that the augmented matrix is in echelon form1 0 0 −1 0 0
0 1 −1 3 −2 0
0 0 1 −9 6 0
0 0 0 1 −1 0
(3.23)
Add row 4 to row 1 and add row 3 to row 21 0 0 0 −1 0
0 1 0 −6 4 0
0 0 1 −9 6 0
0 0 0 1 −1 0
(3.24)
Add 9 of row 4 to row 3 and add 6 of row 4 to row 21 0 0 0 −1 0
0 1 0 0 −2 0
0 0 1 0 −3 0
0 0 0 1 −1 0
(3.25)
Now the augmented matrix is in reduced echelon form. Notice that although solutions exist, they
are not unique! We saw this happening in example 1.3. Let us write the concentrations in terms
of x5, the concentration of calcium (this choice is somewhat arbitrary).
x1 = x5, x2 = 2x5, x3 = 3x5, & x4 = x5. (3.26)
Thus, the resulting reaction is given by
x5CaCO3 + 2x5H3O→ 3x5H2O + x5Ca + x5CO2 (3.27)
It is common to set the smallest quantity to 1 so that this becomes
CaCO3 + 2H3O→ 3H2O + Ca + CO2. (3.28)
33
Nevertheless, we do not have to do this, and a proper way to express the solution is in terms of
the concentration of calcium (for instance) asx1
x2
x3
x4
x5
= x5
1
2
3
1
1
. (3.29)
We did not have to choose calcium as the free variable. Any of the other elements would have
been as good of a choice as any other, but in some instances, the resulting coefficients might be
fractions.
The previous example leads us to the notion of homogeneous linear systems. For brevity, instead
of writing the linear system (1.2) over and over again, we use the shorthand notation (for now, it
is only notation!)
A~x = ~b. (3.30)
Definition 3.31. A linear system A~x = ~b is said to be homogeneous whenever ~b = ~0.
Note that a homogeneous linear system always has at least one solution (as we saw in Problem
1.10), namely ~x = ~0, which is called the trivial solution. We have also noticed in the example that
there is a free variable in the solution. This is a generic phenomena:
Theorem 3.32. The homogeneous equation A~x = ~0 has a nontrivial solution if and only if 9 the
corresponding system of linear equations has a free variable.
Proof. you found me!
(⇒) Let ~x be a non-zero vector (i.e. a non-trivial solution) that satisfies A~x = ~0. Since ~0 is also
a solution, this system has two distinct solutions. By Proposition 3.13, all the points along the
straight line
R 3 t 7→ (1− t)~0 + t~x = t~x (3.33)
are solutions as well. Since ~x is not zero, at least one of the components is non-zero. Suppose it
is xi for some i ∈ 1, . . . , n. Setting t := 1xi
shows that the vector
x1/xi...
xi−1/xi1
xi+1/xi...
xn/xi
(3.34)
9To prove a statement of the form “A if and only if B,” one must show that A implies B and B implies A. In a
proof, we often depict the former by (⇒) and the latter by (⇐).
34
is also a solution. Hence, any constant multiple of this vector is also a solution. In particular, xican be taken to be a free variable for the linear system since
x1
...
xi...
xn
= xi
x1/xi
...
1...
xn/xi
. (3.35)
(⇐) Suppose that xi is a free variable. Setting xi = 1 and all other free variables (if they exist) to
zero gives a non-trivial solution to A~x = ~0.
In (3.29), the solution of the homogeneous equation was written in the form
~x = ~p+ t~v (3.36)
where in that example ~p was ~0, t was x5, and ~v was the vector1
2
3
1
1
. (3.37)
This form of the solution of a linear equation is also in parametric form because its value depends
on an additional unspecified parameter, which in this case is t. In other words, all solutions are
valid as t varies over the real numbers. For a homogeneous equation, ~p is always ~0. In fact, there
could be more than one such parameter involved.
Theorem 3.38. Suppose that the linear system described by A~x = ~b is consistent and let ~x = ~p be
a solution. Then the solution set of A~x = ~b is the set of all vectors of the form ~p + ~u where ~u is
any solution of the homogeneous equation A~u = ~0.
Proof. Let S be the solution set of A~x = ~b, i.e.
S :=~x ∈ Rn : A~x = ~b
. (3.39)
The claim that we must prove is that for some fixed solution ~p ∈ Rn satisfying A~p = ~b,
S =~p+ ~u ∈ Rn : A~u = ~0
. (3.40)
Let’s call the right-hand-side of (3.40) T. To prove that two sets are equal, S = T, we must show
that each one is contained in the other. First let us show that T is contained in S, which is written
mathematically as T ⊆ S. To prove this, let ~x := ~p+~u ∈ T so that ~p satisfies A~p = ~b and ~u satisfies
A~u = ~0. By a similar calculation as in (3.16), we see that A~x = ~0 (I’m leaving this calculation to
you as an exercise). This shows that ~x ∈ S so that T ⊆ S (because we showed that any arbitrary
element in T is in S).
Now let ~x ∈ S. This means that A~x = ~b. Our goal is to find a ~u that satisfies the two conditions
35
(a) A~u = ~0 and
(b) ~x = ~p+ ~u.
This would prove that ~x ∈ T. Let’s therefore define ~u to be ~u := ~x − ~p. Then, by a similar
calculation as in (3.16), we see that A~u = ~0 (exercise!). Also, from this definition, it immediately
follows that ~p+ ~u = ~p+(~x− ~p
)= ~x. Hence, S ⊆ T.
Together, T ⊆ S and S ⊆ T prove that S = T.
This theorem says that the solution set of a consistent linear system A~x = ~b can be expressed
as
~x = ~p+ t1~u1 + · · ·+ tk~uk, (3.41)
where ~p is one solution of A~x = ~b, k is a positive integer, t1, . . . , tk are the parameters (real
numbers), and the set ~u1, . . . , ~uk spans the solution set of A~x = ~0. A linear combination of
solutions to A~x = ~0 is a solution as well. Here’s an application of the theorem.
Problem 3.42. Consider the linear system
2x1 + 4x2 − 2x5 = 2
−x1 − 2x2 + x3 − x4 = −1
x4 − x5 = 1
x3 − x4 − x5 = 0
(3.43)
Check that
~p =
1
0
0
0
1
(3.44)
is a solution to this linear system. Then, find all the solutions of this system.
Answer. I’ll leave the check that ~p is a solution to you. To find all the solutions, all we need to
do now is solve the homogeneous system
2x1 + 4x2 − 2x5 = 0
−x1 − 2x2 + x3 − x4 = 0
x4 − x5 = 0
x3 − x4 − x5 = 0
(3.45)
This is a little easier than solving the original system because we have fewer numbers to keep track
of (and therefore have a less likely probability of making a mistake!). After row reduction, the
augmented matrix becomes (exercise!)1 2 0 0 −1 0
0 0 1 0 −2 0
0 0 0 1 −1 0
0 0 0 0 0 0
(3.46)
36
which is in reduced echelon form. The set of solutions here are all of the form
~u =
x5 − 2x2
x2
2x5
x5
x5
= x2
−2
1
0
0
0
+ x5
1
0
2
1
1
(3.47)
Hence, the set of solutions of the linear system consists of vectors of the form1
0
0
0
1
+ s
−2
1
0
0
0
+ t
1
0
2
1
1
(3.48)
where s, t ∈ R.
Exercise 3.49. State whether the following claims are True or False. If the claim is true, be
able to precisely deduce why the claim is true. If the claim is false, be able to provide an explicit
counter-example.
(a) If there is a nonzero (aka nontrivial) solution to a linear homogeneous system, then there are
infinitely many solutions.
(b) If there are infinitely many solutions to a linear system, then the system is homogeneous.
(c) ~x = ~0 is always a solution to every linear system A~x = ~b.
Recommended Exercises. Exercises 15, 27, and 40 in Section 1.5 of [Lay]. Exercises 7 and 15
(this one is similar to the circuit problem from last class) in Section 1.6 of [Lay]. You may (and
are encouraged to) use any theorems we have done in class! Be able to show all your work, step
by step! Do not use calculators or computer programs to solve any problems!
Today, we finished Sections 1.5 and 1.6 of [Lay] (we skipped Section 1.4). Whenever you see
the equation A~x = ~b, just read it as the associated linear system as in (1.2). We will not provide
an algebraic interpretation of the expression “A~x ” until a few lectures from now.
37
4 Linear independence and dimension of solution sets
The solution sets of Problems 3.1 and 3.7 are visually different, and we would like to say that the
span of one nonzero vector is a 1-dimensional line and the span of the two vectors in (3.12) is a
2-dimensional plane. But we need to be a bit more precise about what we mean by dimension.
To get there, we first introduce the notion of linear independence. Heuristically, a solution set is
n-dimensional if the minimum number of vectors needed to span it is n. This would answer our
earlier question when we did Example 0.1 for traffic flow. The dimensionality of the solution set
corresponds to the minimal number of people needed to count traffic to obtain the full traffic flow
for a given set of streets and intersections. Where we place those people is related to a choice of
linearly independent vectors that span the set of solutions.
Definition 4.1. A set of vectors ~u1, . . . , ~uk in Rn is linearly independent if the solution set of
the vector equation
x1~u1 + · · ·+ xk~uk = ~0 (4.2)
consists of only the trivial solution. Otherwise, the set is said to be linearly dependent in which
case there exist some coefficients x1, . . . , xk, not all of which are zero, such that (4.2) holds.
Example 4.3. The vectors 1
−2
0
&
−3
6
0
(4.4)
are linearly dependent because −3
6
0
= −3
1
−2
0
(4.5)
so that
3
1
−2
0
+
−3
6
0
= 0. (4.6)
Example 4.7. The vectors [1
1
]&
[−1
1
](4.8)
are linearly independent for the following reason. Let x1 and x2 be two real numbers such that
x1
[1
1
]+ x2
[−1
1
]=
[0
0
]. (4.9)
This equation describes the linear system associated to the augmented matrix[1 −1 0
1 1 0
]. (4.10)
38
Performing row operations,[1 −1 0
1 1 0
]R2→R2−R17−−−−−−−→
[1 −1 0
0 2 0
]R2→ 1
2R2
7−−−−−→[1 −1 0
0 2 0
]R1→R1+R27−−−−−−−→
[1 0 0
0 1 0
]. (4.11)
The only solution to (4.9) is therefore x1 = 0 and x2 = 0. Thus, the two vectors in (4.8) are linearly
independent.
Example 4.12. A set ~u1, ~u2 of two vectors in Rm is linearly dependent if and only if10 one can
be written as a scalar multiple of the other, i.e. there exists a real number c such that ~u1 = c~u2
or11 c~u1 = ~u2.12
Proof. First13 note that the associated vector equation is of the form
x1~u1 + x2~u2 = ~0, (4.13)
where14 x1 and x2 are coefficients, or upon rearranging
x1~u1 = −x2~u2. (4.14)
(⇒) If the set is linearly dependent, then x1 and x2 cannot both be zero.15 Without loss of
generality, suppose that x1 is nonzero.16 Then dividing both sides of (4.14) by x1 gives
~u1 = −x2
x1
~u2. (4.15)
Thus, setting c := −x2x1
proves the first claim17 (a similar argument can be made if x2 is nonzero).
(⇐) Conversely,18 suppose that there exists a real number c such that19 ~u1 = c~u2. Then
~u1 − c~u2 = ~0 (4.16)
10If A and B are statements, the phrase “A if and only if B” means two things. First, it means “A implies B.”
Second, it means “B implies A.”11In mathematics, the word “or” is never exclusive. If “A or B” are true, it always means that “at least one of
A or B is true.” It does not mean that if A is true, then B is false, or vice versa. If A happens to be true, we make
no additional assumptions about B (and vice versa).12 In what follows, we will work through the proof very closely. We will try to guide you using footnotes so that
you know what is part of the proof and what is based on intuition. Instead of first teaching you how to do proofs
from scratch, we will go through several examples so that you see what they are like first. This is like learning a new
language. Before learning the grammar, you want to first listen to people talking to get a feel for what the language
sounds like. Then, when you learn the alphabet, you want to read a few passages before you start constructing
sentences on your own. The point is not to know/memorize proofs. The point is to know how to read, understand,
and construct proofs of your own.13Before proving anything, we just recall what the vector equation is to remind us of what we’ll need to refer to.14If you introduce notation in a proof, please say what it is every time!15What we have done so far is just state the definition of what it means for ~u1, ~u2 to be linearly dependent.
Stating these definitions to remind ourselves of what we know is a large part of the battle in constructing a proof.16We know from the definition that at least one of x1 or x2 is not zero but we do not know which one. It won’t
matter which one we pick in the end (some insight is required to notice this), so we may use the phrase “without
loss of generality” to cover all other possible cases.17Remember, we wanted to show that ~u1 is a scalar multiple of ~u2.18We say “conversely” when we want to prove an assertion in the opposite direction to the previously proven
assertion.19Remember, this is literally the latter assumption in the claim.
39
showing that the set ~u1, ~u2 is linearly dependent since the coefficient in front of ~u1 is nonzero (it
is 1).20
At the end of a proof, you should always check your work!
Example 4.17. Let
x :=
1
0
0
, y :=
0
1
0
, & z :=
0
0
1
(4.18)
be the three unit vectors in R3. A lot of different notation is used for this, sometimes i, j, and k,
and sometimes ~e1, ~e2, and ~e3, respectively. In addition, let ~u be any other vector in R3. Then the
set x, y, z, ~u is linearly dependent because ~u can be written as a linear combination of the three
unit vectors. This is apparent if we write
~u =
u1
u2
u3
(4.19)
since
~u = u1x+ u2y + u3z. (4.20)
Here’s a less trivial example.
Example 4.21. The vectors 1
0
1
,2
1
3
, &
−1
−2
−3
(4.22)
are linearly dependent. This is a little bit more difficult to see so let us try to solve it from scratch.
We must find x1, x2, and x3 such that
x1
1
0
1
+ x2
2
1
3
+ x3
−1
−2
−3
=
0
0
0
. (4.23)
Putting the left-hand-side together into a single vector gives us an equality of vectors x1 + 2x2 − x3
x2 − 2x3
x1 + 3x2 − 3x3
=
0
0
0
. (4.24)
We therefore have to solve the linear system whose augmented matrix is given by1 2 −1 0
0 1 −2 0
1 3 −3 0
(4.25)
20Recall the definition of what it means to be linearly dependent and confirm that you agree with the conclusion.
40
which after some row operations is equivalent to1 0 3 0
0 1 −2 0
0 0 0 0
. (4.26)
This has non-zero solutions. Setting x3 = −1 (we don’t have to do this—we can leave x3 as a free
variable, but I just want to show that we can write the last vector in terms of the first two) shows
that −1
−2
−3
= 3
1
0
1
− 2
2
1
3
. (4.27)
The previous examples hint at a more general situation.
Theorem 4.28. Let S := ~u1, . . . , ~uk be a set of vectors in Rn. S is linearly dependent if and
only if at least one vector from S can be written as a linear combination of the others.
The proof of Theorem 4.28 will be similar to the previous example. Why should we expect
this? Well, if k = 3, then we have ~u1, ~u2, ~u3 and we could imagine doing something very similar.
Think about this! If you’re not comfortable working with arbitrary k just yet, specialize to the
case k = 3 and try to mimic the previous proof. Then try k = 4. Do you see the pattern? Once
you’re ready, try the following.21
Proof. The vector equation associated to S is
k∑j=1
xj~uj = ~0, (4.29)
where the xj are coefficients.
(⇒) If the set S is linearly dependent, then there exists22 a nonzero xi (for some i between 1 and
k). Therefore,
~ui =k∑j=1j 6=i
(−xjxi
)~uj, (4.30)
where the sum is over all numbers j from 1 to k except i. Hence, the vector ~ui can be written as
a linear combination of the others.
(⇐) Conversely, suppose that there exists a vector ~ui from S that can be written as a linear
combination of the others, i.e.
~ui =k∑j=1j 6=i
yj~uj, (4.31)
21If this is your first time proving things outside of geometry in highschool, study how these proofs are written.
Try to prove things on your own. Do not be discouraged if you are wrong. Keep trying. A good book on learning
how to think about proofs is How to Solve It by G. Polya [4]. A course in discrete mathematics also helps. Practice,
practice, practice!22By definition of a linearly dependent set, at least one of the xi’s must be nonzero. This is phrased concisely
by the statement “there exists a nonzero xi...”.
41
where the yj are real numbers.23 Rearranging gives
~ui −k∑j 6=i
xj~uj = 0, (4.32)
and we see that the coefficient in front of ~ui is nonzero (it is 1). Hence S is linearly dependent.
Let’s give a simple application of this theorem.
Example 4.33. On a computer, colors can be obtained from choosing three integers from the set
of numbers 0, 1, 2, . . . , 255. These three integers represent the level of red, green, and blue. If
we denote these three colors as constructing a column “vector”,24 we can writeRGB
(4.34)
R
255
0
0
G
0
255
0
B
0
0
255
The ↔ should be read as “corresponds to.” Because there are 256 numbers allowed for each of
these three colors, the total number of vectors allowed is
(256)3 = 16, 777, 216. (4.35)
8-bit computer displays (?) work using these colors. Therefore, each pixel on your computer has
this many possibilities. Multiply that by the number of pixels on your computer display. That’s
a lot of information. These colors are all obtained from linear combinations of the form
xR
1
0
0
+ xG
0
1
0
+ xB
0
0
1
(4.36)
with xR, xG, xB ∈ 0, 1, 2, 3, . . . , 255. For example,
Y
255
255
0
= 255
1
0
0
+ 255
0
1
0
R + G
23We call our variables y to avoid potentially confusing them with the previous variables x.24Technically, these are not vectors. They are just arrays. The reason these are not vectors is because we cannot
scale these arrays by an arbitrary number because the maximum value of any entry is 255. Similarly, we cannot
add combinations of colors arbitrarily because of the maximum and minimum values allowed. Nevertheless, this
example describes the content of the previous theorem with hopefully something you can relate to.
42
M
255
0
255
= 255
1
0
0
+ 255
0
0
1
R + B
C
0
255
255
= 255
0
1
0
+ 255
0
0
1
G + B
The colors R , G ,and B are linearly independent in the sense of the above definition, namely
the only solution toxR R +xG G +xB B = Black (4.37)
is
xR = xG = xB = 0. (4.38)
Using the previous theorem, another way of saying this is that none of the colors R , G , and B
can be expressed in terms of the other two as linear combinations. What about the colors M ,
R , and B ? Are these linearly independent? Or can we express any of these colors in terms of
the others? Well, we already know we can express M in terms of R and B so the three are
not linearly independent—they are linearly dependent. However, the colors Y , M , and C are
linearly independent—none of these colors can be expressed in terms of the others.
The following two theorems will give quick methods to figure out whether a given set of vectors
is linearly dependent.
Theorem 4.39. Let S := ~u1, . . . , ~uk be a set of vectors in Rn with k > n. Then S is linearly
dependent.
Proof. Recall, S is linearly dependent25 if there exist numbers x1, . . . , xk not all zero such that
k∑i=1
xi~ui = ~0. (4.40)
This equation can be expressed as a linear system
k∑i=1
xi(ui)1 = 0
...
k∑i=1
xi(ui)n = 0,
(4.41)
25Again, it is always helpful to constantly remind yourself and the reader of definitions that are crucial to
solving the problem at hand. It is also helpful to use them to introduce notation that has not been introduced in
the statement of the claim (the theorem).
43
where26 (ui)j is the j-th component of the vector ~ui. In this linear system, there are k unknowns
given by the variables x1, . . . , xk and there are n equations. Because k > n, there are more
unknowns than equations, and hence there is at least one free variable.27 Let xp be one of these
free variables. Then the other xi’s might depend on xp so we may write xi(xp).28 Then by setting
xp = 1, we find
1~up +k∑i=1i 6=p
xi(xp = 1)~ui = ~0 (4.42)
showing that S is linearly dependent (again since the coefficient in front of ~up is nonzero).
Warning! Using an example of S := ~u1, . . . , ~uk and showing that it is linearly dependent is not
a proof! We have to prove the claim for all potential cases. Nevertheless, an example helps to see
why the claim might be true in the first place.
Another warning! The theorem does not say that if k < n, then the set is linearly independent!
This is an important distinction, one that comes from logic. If A is a statement and B is a
statement, then the claim “A implies B” does not imply that “not A implies not B.” Also, “A
implies B” does not imply “B implies A.” However, “A implies B” does imply “not B implies not
A.” To see this in a concrete example, suppose the manager of some company is in charge of giving
his workers a raise, particularly to those who do not make so much money. If a worker’s salary is
less than $50,000 a year (A), the manager will give them a raise (B). Does this statement imply
that if a worker’s salary is greater than $50,000 a year (not A), then that worker will not get a
raise (not B)? No, it doesn’t. We don’t know what happens in this situation. Similarly, if a worker
received a raise (B), does this mean that the worker must have made less than $50,000 a year (A)?
No, it doesn’t mean that either. If a worker does not receive a raise (not B), then what do we
know? We know that this guarantees that the worker in question could not have a salary that is
less than $50,000 a year because otherwise that worker would get a raise! Hence, the worker must
make more than $50,000 a year (not B).
Theorem 4.43. Let S := ~u1, . . . , ~uk be a set of vectors in Rn with at least one of the ~ui being
zero. Then S is linearly dependent.
Proof. Suppose ~ui = ~0. Then choose29 the coefficient of ~uj to be
xj :=
1 if j = i
0 otherwise(4.44)
26We have introduced some notation, so we should define it.27But wait, how do we know that a solution even exists? If a solution doesn’t exist, then our conclusion must
be false! Thankfully, by our earlier comments from the previous lecture, we know that every homogeneous linear
system has at least one solution, namely the trivial solution. Hence, the solution set is not empty.28This is read as “xi is a function of xp.”29To show that the set is linearly dependent, we have to find a set of coefficients, not all of which are zero, so
that their linear combination results in the zero vector. The coefficients that I’ve chosen here are not the only
coefficients that will work. You may choose have chosen others. All we have to do is exhibit the existence of one
such choice. We do not have to exhaust all posibilities.
44
Thenk∑j=1
xj~uj = 1~ui = 1(~0) = ~0 (4.45)
because any scalar multiple of the zero vector is the zero vector. Since not all of the coefficients
are zero (one of them is 1), S is linearly dependent.
Definition 4.46. Let A~x = ~b be a consistent linear system with a particular solution ~p. The
dimension of the solution set of A~x = ~b is the number k ∈ N of linearly independent homogeneous
solutions ~u1, . . . , ~uk needed so that the solution set consists of all vectors of the form ~p+ t1~u1 +
· · ·+ tk~uk with t1, . . . , tk ∈ R.
There is something sneaky about this “definition.” If you find the vectors ~u1, . . . , ~uk but
your neighbor finds another linearly independent set of vectors ~v1, . . . , ~vl with l ∈ N, then does
l = k? In order for the above definition to make sense, the answer to this question better be yes.
Theorem 4.47. Let ~u1, . . . , ~uk and ~v1, . . . , ~vl be two sets of linearly independent vectors that
span the solution set to the homogeneous linear system A~x = ~0. Then k = l.
The proof of this introduces more ideas from logic, namely proof by contradiction. This logic
is as follows. To prove that “A implies B” we can “assume that A is true and B is false.” If we
can deduce some logical contradiction (such as “A is false” or “C is false” where C is something
that must be true provided A is true), then the initial assumption, namely that “A is true and
B is false”, is false. Since the claim assumes “A is true,” we must conclude that “B is true.”
The following proof might not be easy to follow if this is your first time proving something by
contradiction. What also makes the following proof difficult is that it is broken up into many
steps. Before we prove it, we will prove an important preliminary fact.
Lemma 4.48. Let V be the solution set to the homogeneous linear system A~x = ~0 of m equations
in n variables with m,n ∈ N. Let S := ~u1, . . . , ~uk be a linearly independent set of vectors such
that span(S) = V and let T := ~v1, . . . , ~vl be a set that spans V, where k, l ∈ N. Then k ≤ l.
Proof. Since T spans V, ~u1 can be written as a linear combination of the ~v’s, i.e. there exist
coefficients c11, . . . , c1l ∈ R such that
~u1 = c11~v1 + · · ·+ c1l~vl. (4.49)
Since S is linearly independent, Theorem 4.43 says that ~u1 is not zero. Therefore, one of the c’s
is not zero, i.e. there exists an i1 ∈ 1, . . . , l such that c1i1 6= 0. Therefore, we can divide by it to
write ~vi1 in terms of the other vectors, namely (the second line just compactifies the expression)
~vi1 =1
c1i1
~u1 −c11
c1i1
~v1 − · · · −c1i1−1
c1i1
~vi1−1 −c1i1+1
c1i1
~vi1+1 − · · · −c1l
c1i1
~vl
=1
c1i1
~u1 −l∑
j=1j 6=i1
c1j
c1i1
~vj.(4.50)
45
We might sometimes write this as
~vi1 =1
c1i1
~u1 −c11
c1i1
~v1 − · · · − ith1 term− · · · − c1l
c1i1
~vl, (4.51)
where the wide hat over an expression indicates to exclude it. Now, define
T1 :=~u1, ~v1, . . . , ~vi1 , . . . , ~vl
, (4.52)
which appends ~u1 to T and removes ~vi1 . Notice that T1 still spans V. Hence, there exist coefficients
d21, c21, . . . , c2i1 , . . . , c2l ∈ R (the hat still means that we exclude this term—I’m only writing it to
keep track of the numbers) such that
~u2 = d21~u1 + c21~v1 + · · ·+ c2i1~vi1 + · · ·+ . . . , c2l~vl. (4.53)
Claim. There exists an i2 ∈ 1, . . . , i1, . . . , l such that c2i2 6= 0. Proof of claim. Suppose to the
contradiction that all of the c coefficients were zero in (4.53). Since ~u2 is not zero, this would say
~u2 ∝ ~u1,30 which contradicts that S is linearly independent (see Example 4.12). End of proof of
claim. Therefore,
~vi2 =1
c2i2
~u2 −d21
c2i2
~u1 −l∑
j=1j 6=i1,j 6=i2
c2j
c2i2
~vj (4.54)
Hence, we can also remove this vector from T1, define
T2 :=~u1, ~u2, ~v1, . . . , ~vi1 , . . . , ~vi2 , . . . , ~vl
, (4.55)
and T2 still spans V. We can continue to remove one ~vij at a time from Tj−1 and replace it with
a ~uj to construct Tj (instead of using Example 4.12, we need to use Theorem 4.28 to show that
there exist nonzero coefficients cjij—I encourage you to do the next step to see this explicitly).
This process ends at Tk, where we have
Tk :=~u1, . . . , ~uk, ~vr1 , . . . , ~vrl−k
, (4.56)
where the ~vr’s are the leftover ~v’s from this procedure. Note that there are exactly l − k of them.
Tk still spans V and we have that S is now a subset of T, written as S ⊆ Tk. Therefore k ≤ l.
Proof of Theorem 4.47. Lemma 4.48 shows that k ≤ l and l ≤ k. Hence l = k.
Theorem 4.57. Let A~x = ~b be a consistent linear system. The dimension of the solution set of
A~x = ~b is the number of free variables.
Proof. Sorry, that last proof wore me out and so I’m leaving this as an exercise. However, you
should be able to use an idea similar to the proof of Theorem 3.32.
Recommended Exercises. Exercises 6, 8 (use a theorem!), 36, and 38 in Section 1.7 of [Lay].
You may (and are encouraged to) use any theorems we have done in class! Be able to show all
your work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we finished Section 1.7 of [Lay] and explored some ideas from Section 2.9 of
[Lay]. Whenever you see the equation A~x = ~b, just read it as the associated linear system as in
(1.2).
30The notation A ∝ B means “A is proportional to B”.
46
5 Subspaces, bases, and linear manifolds
We need some more experience with vectors in Euclidean space (Rn). We have already seen that
the set of solutions to a system of linear equations is always “linear” in the sense that it is either
a point, a line, a plane, or a higher-dimensional plane.
Definition 5.1. A subspace of Rn is a set H of vectors in Rn satisfying the following conditions.
(a) ~0 ∈ H.
(b) For every pair of vectors ~u and ~v in H, their sum ~u+ ~v is also in H.
(c) For every vector ~v and constant c, the scalar multiple c~v is in H.
Example 5.2. Rn itself is a subspace of Rn. Also, the set ~0 consisting of just the zero vector in
Rn is a subspace.
Are there other subspaces?
Exercise 5.3. Let H be the set of points in R3 described by the solution set of
3x− 2y + z = 0, (5.4)
which is depicted in Figure 6.
3x− 2y + z = 0
3x− 2y + z = 12
Figure 6: A plot of the planes described by 3x − 2y + z = 0 (Exercise 5.3) and 3x −2y + z = 12 (Exercise 5.6)
Is ~0 in H? Let
~u =
u1
u2
u3
& ~v =
v1
v2
v3
(5.5)
be two vectors in H and let c be a real number. Is ~u+ ~v in H? Is c~v in H?
47
Exercise 5.6. Is the set of solutions to
3x− 2y + z = 12 (5.7)
a subspace of R3? See Figure 6 for a comparison of this solution set to the one from Exercise 5.3.
If not, what goes wrong?
The last example was a subspace that has been shifted by a vector. To see this, notice that
any point on the set of solutions to 3x− 2y + z = 12 can be expressed as the set of vectors of the
form x
y
12− 3x+ 2y
=
0
0
12
+ x
1
0
−3
+ y
0
1
2
(5.8)
for all x, y ∈ R. This is almost the same as the set of solutions of 3x − 2y + z = 0 except for the
additional constant vector (0, 0, 12). The set of solutions is of the form ~p+ ~u where
~p =
0
0
12
& ~u = x
1
0
−3
+ y
0
1
2
(5.9)
are particular and homogeneous solutions, respectively. In other words, if we denote the set of
solutions to 3x− 2y + z = 0 by H and the set of solutions to 3x− 2y + z = 12 by S, then
S = ~p+H, (5.10)
where the latter notation means
~p+H :=~p+ ~u : ~u ∈ H
. (5.11)
Definition 5.12. A linear manifold (sometimes called an affine subspace31) in Rn is a subset S of
Rn for which there exists a vector ~p such that the set
S − ~p :=~v − ~p : ~v ∈ S
(5.13)
is a subspace of Rn.
Every subspace is a linear manifold but not conversely, meaning that not every linear manifold
is a subspace.
Exercise 5.14. Is the set of solutions to
3x− 2y + z = 0 (5.15)
with the constraint that
x2 + y2 ≤ 1 (5.16)
a subspace of R3? See Figure 7a. What goes wrong? Which of the three properties of the definition
of subspace remain valid even in this example? What about the same linear system but with the
constraint that1
3≤ x2 + y2 ≤ 1? (5.17)
See Figure 7b. Are either of these linear manifolds?
31I will avoid this terminology because we are already using the word “subspace” to mean something else.
48
(a) A plot of the set of solutions to 3x−2y+ z = 0
with the constraint x2 + y2 ≤ 1.
(b) A plot of the set of solutions to 3x−2y+z = 0
with the constraint 13 ≤ x
2 + y2 ≤ 1.
Figure 7: A plot of the set of solutions to 3x− 2y + z = 0 with different constraints.
The previous example leads to the following definition and hints at the following fact.
Theorem 5.18. Let A~x = ~b be a consistent linear system of m equations in n unknowns. Then
the set of solutions to this system is a linear manifold in Rn. Furthermore, if ~b = 0, then the set
of solutions is a subspace.
Proof. We first prove the second claim. Let H be the solution set of A~x = ~0. Then A~0 = ~0 so that~0 is a solution. If ~u and ~v are solutions, then A(~u + ~v) = A~u + A~v = ~0 + ~0 = ~0 so that ~u + ~v is a
solution. If ~u is a solution, then A(c~u) = cA~u = c~0 = ~0 so that c~u is a solution for all c ∈ R.To prove the first claim, let S be the solution set of A~x = ~b and let ~p ∈ S. By Theorem 3.38,
S = H + ~p. Hence, H = S − ~p is a subspace by the previous paragraph. Therefore, S is a linear
manifold.
Definition 5.19. A basis for a subspace H of Rn is a set of vectors that is both linearly independent
and spans H. A basis for a linear manifold S of Rn is a basis for some subspace H for which there
exists a vector ~p ∈ Rn with S = ~p+H.
Exercise 5.20. Going back to our previous example of the plane in R3 specified by the linear
system
3x− 2y + z = 0, (5.21)
what is a basis for the vectors in this plane? Since the set of all vectorsxyz
(5.22)
satisfying this linear system define this plane, we just need to find a basis for these solutions. We
know that if we specify x and y as our free variables, then a general solution of this system is of
49
the form x
y
−3x+ 2y
(5.23)
with x and y free. How about testing the cases x = 1 with y = 0 and x = 0 with y = 1? This
gives 1
0
−3
&
0
1
2
(5.24)
respectively. Any other vector in the solution set is a linear combination of these two vectors.
Definition 5.25. The number of elements in a basis for a subspace H of Rn is the dimension of
H and is denoted by dimH.
As before, the fact that this number is well-defined is not obvious. How can we be sure that
any two choices of bases have the same number of vectors in them? However, the proof given for
the dimension of a solution set in Theorem 4.47 is exactly the same as the proof of this claim.
A linear manifold is equivalently described by the following geometric property.
Theorem 5.26. A nonempty subset S ⊆ Rn is a linear manifold if and only if for any two vectors
~x and ~y in S, the vector
t~x+ (1− t)~y (5.27)
is in S for all t ∈ R.
Proof. you found me!
(⇒) Suppose that S is a linear manifold. Let ~x, ~y ∈ S and let t ∈ R. The goal is to show that
t~x + (1 − t)~y ∈ S. By definition of a linear manifold, if ~p ∈ S, the set H := S − ~p is a subspace.
Therefore, ~x− ~p, ~y − ~p ∈ S. By definition of a subspace, t(~x− ~p) + (1− t)(~s− ~p) ∈ H. Therefore,
t(~x− ~p) + (1− t)(~s− ~p) + ~p ∈ S since S = H + ~p. But this expression is equal to
t(~x− ~p) + (1− t)(~s− ~p) + ~p = t~x+ (1− t)~y (5.28)
which shows that t~x+ (1− t)~y ∈ S.(⇐) Suppose S is a set for which the straight line going through any two vectors in S is also in
S. We must find a subspace H and a vector ~p such that S = ~p+H. First, since S is nonempty, it
contains at least one vector. Let ~p be any such vector in S and set H := S − ~p. We show that H
is a subspace.
(a) ~0 ∈ H because ~0 = ~p− ~p ∈ S − ~p since ~p ∈ S.
(b) Let us actually check the scalar condition first since that is easier. Let ~u ∈ H and let t ∈ R.
50
•~0
•~u
•~p
•~u+ ~p
Then
t~u = (1− t)~p+ t(~u+ ~p)︸ ︷︷ ︸∈S
−~p (5.29)
showing that t~u ∈ H = S − ~p.
(c) Let ~u,~v ∈ H. Since ~0 is also in H, we can draw the straight lines through any pairs of these
two vectors.
•~0
•~u
•~v
Drawing where ~u+ ~v shows that it lies along a line parallel to the line straight through ~u and
~v and is explicitly given by
~u+ ~v =1
2(2~u) +
1
2(2~v). (5.30)
A visualization of where this expression comes from is given by the following graphic.
51
•~0
•~u
•2~u
•~v
•2~v
• ~u+ ~v
Just as in (b), we can draw this in S by adding ~p. Using our assumption gives ~u+ ~v + ~p ∈ S.This shows that ~u+ ~v ∈ H = S − ~p.
Hence, H is a subspace of Rn showing that S is a linear manifold.
The previous proof indicates how 3 distinct points determine a plane. Notice that the point ~0
could have been anything and the geometric idea would still hold.
Definition 5.31. Let S be a set of vectors in Rn. The affine span of S is the set of all linear
combinations of vectors ~v in S of the form∑~v
a~v~v such that∑~v
a~v = 1 (5.32)
and all but finitely many a~v are zero. The affine span of S is denoted by aff(S).
Example 5.33. In the proof of Theorem 5.26, ~u + ~v is in the affine span of the vectors ~0, ~u,~vwhile ~u+ ~v is not in the affine span of ~u,~v. However, ~u+ ~v is in the affine span of 2~u, 2~v.
The affine span, not the usual span of vectors, is used to add vectors in linear manifolds. The
reason is because if we do not impose the additional condition that the sum of the coefficients is
1, we might “jump off” the linear manifold. We formalize this as follows.
Theorem 5.34. A set S is a linear manifold if and only if for every subset R ⊆ S, the affine span
of R is in S, i.e. aff(R) ⊆ S.
Proof. Exercise.
Recommended Exercises. Exercises 3, 4, and 17 in Section 2.8 of [Lay] and Exercises 7 and
20 in Section 2.9 of [Lay]. Be able to show all your work, step by step! Do not use calculators or
computer programs to solve any problems!
In this lecture, we went through parts of Sections 2.8, 2.9, and 8.1 of [Lay].
52
6 Convex spaces and linear programming
Linear manifolds with certain constraints are described by convex spaces. The quintessential
example of a convex space that will occur in many contexts, especially probability theory, is that
of a simplex.
Example 6.1. The set of all probability distributions on an n-element set can be described by a
mathematical object known as the standard (n− 1)-simplex and denoted by ∆n−1. It is defined by
∆n−1 :=
(p1, . . . , pn) ∈ Rn :
n∑i=1
pi = 1 and pi ≥ 0 ∀ i = 1, . . . , n
. (6.2)
The interpretation of an (n− 1) simplex is as follows. For a set of events labeled by the numbers
1 through n, the probability of the event i taking place is pi. For example, the 2-simplex looks like
the following subset of R3 viewed from two different angles
p1
p2
p3
1 1
1
p1
p2
p3
1
1
1
The 1-simplex describes the probability space associated with flipping a weighted coin. Is the
n-simplex a vector subspace of Rn+1? Is it a linear manifold? Why or why not?
The previous example motivates the following definition.
Definition 6.3. A convex space32 is a subset C of Rn such that if ~u,~v are any two vectors in C
then the interval
λ~u+ (1− λ)~v (6.4)
with λ ∈ [0, 1], is also in C.
Example 6.5. Every linear manifold is a convex space. However, not every convex space is a
linear manifold. For example, the n-simplex is a convex space but it is not a linear manifold.
Exercise 6.6. Which of the examples in the previous section are convex spaces?
Example 6.7. PacMan
32I do not want to go into the technicalities of closed and open sets, but throughout, we will always assume
that our convex spaces are also closed. This just means that we will be dealing with convex spaces that come from
linear systems of inequalities (to be described below). Visually, it means that our convex spaces always include
their boundaries, (faces, edges, vertices, etc.).
53
• ~u
• ~v
is not a convex space since the interval connecting ~u and ~v is not in the space.
Convex spaces are important in linear algebra because they often arise as the solution sets of
systems of linear inequalities (instead of systems of equalities) of the form
a11x1 + a12x2 + · · ·+ a1nxn ≤ b1
a21x1 + a22x2 + · · ·+ a2nxn ≤ b2
...
am1x1 + am2x2 + · · ·+ amnxn ≤ bm.
(6.8)
You might ask why not also allow the reversed inequality on some of these equations? The reason
for this is because we can multiply the whole inequality by −1, reverse the ≥ to a ≤, and then
just rename the constant coefficients reproducing something of the form (6.8). How can we include
equalities? This is done by replacing any equality, such as
a21x1 + a22x2 + · · ·+ a2nxn = b2 (6.9)
by the two inequalities
a21x1 + a22x2 + · · ·+ a2nxn ≤ b2
−a21x1 − a22x2 − · · · − a2nxn ≤ −b2
(6.10)
(the second inequality is equivalent to the first one with a ≥ instead after multiplying both sides
by −1).
Example 6.11. Consider the following linear system of inequalities
y ≤ 1
2x− y ≤ 0
−2x− y ≤ 0
(6.12)
These regions are depicted in the following figure (the central point is the origin)
54
y ≤ 1
y ≥ 2x
y ≥ −2x
The intersection describes an isosceles triangle with vertices given by[0
0
],
[12
1
], &
[−1
2
1
], (6.13)
Solving systems of inequalities is difficult in general. Row operations do not work because
multiplying by negative numbers reverses the sign of the inequality. Sometimes, one deals with
a combination of linear systems of equalities and inequalities such as in Example 6.1. There, the
linear system consists of only a single equation in n variables given by
p1 + p2 + · · ·+ pn = 1 (6.14)
and a system of n inequalities
p1 ≥ 0
...
pn ≥ 0.
(6.15)
Theorem 6.16. The set of solutions to any linear system of inequalities (6.8) is a convex space.
If you parse through the definition of a convex space, this says that if ~y and ~z are two solutions to
(6.8), then the interval connecting these two vectors, i.e. the set of points of the form λ~y+(1−λ)~z
with λ ∈ [0, 1], is also in the solution set.
55
Proof. The proof is almost the same as it was for a linear system of equalities (Proposition 3.13)
with one important difference. To see this let the system be described by A~x ≤ ~b and let ~y and ~z
be two solutions. Fix λ ∈ [0, 1]. Then,
a11
(λy1 + (1− λ)z1
)+ · · ·+ a1n
(λyn + (1− λ)zn
)= λ(a11y1 + · · ·+ a1nyn) + (1− λ)(a11z1 + · · ·+ a1nzn)
≤ λb1 + (1− λ)b1
= b1.
(6.17)
Note that it was crucial in the second last step that λ ∈ [0, 1] since if λ was not in this set, either
λ or 1− λ would be negative, and then one of the inequalities would have to flip.
The following table summarizes the relationship between the three types of linear systems we
have come across and their associated solution spaces.
Linear system Solution set
A~x = ~0 subspace
A~x = ~b linear manifold
A~x ≤ ~b convex space
Problem 6.18 (Problem 9.2.9 in [2]). Find all solutions to the linear system of inequalities
−2x+ y ≤ 4
x− 2y ≤ −4
−x ≤ 0
−y ≤ 0.
(6.19)
Draw the solution set as a convex subset of R2.
Answer. These regions are depicted in the following figure (the origin is represented by a •)
56
y ≥ 0
•
x ≥ 0
•
y ≤ 4 + 2x
•
y ≥ x2− 2
•
•
One often wants to optimize functions over linear constraints. As a result, one studies functions
over convex spaces.
Go through Example 1 in Section 9.2 of [2].
Definition 6.20. A linear functional on Rn is a function f : Rn → R satisfying the condition
f(a~u+ ~v) = af(~u) + f(~v) (6.21)
for all real numbers a and vectors ~u,~v ∈ Rn.
For example, in Example 1 from Section 9.2 of [2], the function was given by
R2 3[x
y
]7→ 2x+ 3y ∈ R. (6.22)
This described the profit obtained. In trying to maximize this function, we looked at the level sets
of this function, which are described by the formula 2x+ 3y = c, with c a fixed constant.
Problem 6.23. Maximize the linear functional f : R2 → R given by
R2 3[x
y
]7→ 2x+ 7y ∈ R (6.24)
over the convex set in Problem 6.18.
57
Answer. The level sets of this function are drawn for a few values below
•
2x+ 7y = 0
2x+ 7y = 7
2x+ 7y = 14
2x+ 7y = 21
2x+ 7y = 28
2x+ 7y = 35
Although the minimum of this function is attained, namely at (x, y) = (0, 0), the maximum is not
attained.
The fact that a maximum was not attained in the previous problem is due largely to the fact
that the solution set of the linear system of inequalities is unbounded.
Definition 6.25. A subset S ⊆ Rn is bounded if there exists a positive number R > 0 such that
S ⊆ [−R,R]× · · · × [−R,R], where [−R,R] denotes the interval of diameter 2R centered at 0 and
[−R,R]× · · · × [−R,R] denotes the n-dimensional cube centered at 0 and whose side lengths are
all 2R. If no such R exists, S is said to be unbounded.
We have already learned that the set of solutions to a linear system of inequalities is a convex
set. If this convex set is, in addition, bounded, then it is a polytope (in two dimensions, a polytope
is just a polygon).
Theorem 6.26. Let A~x ≤ ~b be a consistent linear system of inequalities, let S denote the set of
solutions, and let f : Rn → R be a linear functional. If S is bounded, then f attains a maximum
and a minimum on S.
In fact, much more is true! The example we went over, namely Example 1 in Section 9.2
of [2], shows us that not only did a maximum occur on the polytope, but it occurred on a point of
intersection given by the lines separating regions from the inequality. These points are incredibly
significant because instead of checking infinitely many possible values to optimize a functional, one
can merely check these extreme points.
Definition 6.27. An extreme point of a convex space C in Rn is a vector ~u ∈ C such that if
~u = λ~v + (1− λ)~w for some ~v, ~w ∈ C and λ ∈ [0, 1], then ~v = ~u and/or ~w = ~u. The set of extreme
points of C is denoted by ex(C).
58
Example 6.28. The following figures show two examples of extreme points of convex spaces
illustrating the wide variety of possibilities.
ex
=•
•
•
• •
ex
ex
=
Extreme points of a convex space are important because the entire convex space can be obtained
via convex combinations of these points.
Definition 6.29. Let ~u1, . . . , ~um be a set of vectors in Rn. A convex combination of these vectors
is a linear combination of these vectors of the form
p1~u1 + · · ·+ pm~um, (6.30)
where (p1, . . . , pm) ∈ ∆m−1, i.e.
pi ≥ 0 ∀ i ∈ 1, . . . ,m &m∑i=1
pi = 1. (6.31)
A convex combination of vectors is similar to an affine combination except that one can only
obtain the vectors “between” the vectors given.
Definition 6.32. The convex hull of a set S of vectors in Rn is the set of all vectors in Rn that
are (finite) convex combinations of the vectors in S. The convex hull of S is denoted by conv(S).
Example 6.33.
•
•
•
• •
conv
=
Theorem 6.34. Let C be a bounded convex set. Then
C = conv(ex(C)
). (6.35)
59
Theorem 6.36. Let A~x ≤ ~b be a consistent linear system of inequalities, let S denote the set of
solutions, and let f : Rn → R be a linear functional. If S is bounded, then f attains a maximum
and a minimum on ex(S).
Proof. See Theorem 16 in Section 8.5 of [3].
This theorem is supplemented by the important fact that the number of extreme points of a
bounded convex set of solutions to a linear system of inequalities is finite. Let’s use it to solve
Example 1 from Section 9.2 in [2].
Solve Example 1 in Section 9.2 of [3] by calculating the profit function on the extreme points.
Recommended Exercises. Exercises 5, 11, 12 in Section 8.3 and 1 in Section 8.5 of [3]. Exercises
1 and 15 in Section 9.2 of [2]. Be able to show all your work, step by step!
In this lecture, we went through parts of Sections 8.3, 8.5, and 9.2 of [Lay].
60
7 Linear transformations and their matrices
In the context of linear programming, we discussed the notion of a linear functional. These are
special cases of linear transformations, which can be thought of as families of linear functionals.
Linear transformations arise in many familiar situations.
Example 7.1. In Example 0.11, a bakery was said to produce pancakes, tres leches cake, straw-
berry shortcake, and egg tarts. Each of these items requires several ingredients, some of which are
listed in Table 1, reproduced here for your convenience.
Pancakes Tres leches Strawberry shortcake Egg tarts
(makes 12) (makes 8 slices) (makes 8 slices) (makes 16 tarts)
eggs 2 6 0 6
strawberries 1 + 12
lbs 0 1 + 12
lbs 0
heavy cream 1 cup 1 pint 3 cups 0
flour 3 cups 1 cup 4 cups 3 + 34
cups
butter 4 tbsp 0 1 + 14
cups 1 + 13
cups
sugar 3 tbsp 1 cup 12
cup 25
cup
milk 2 cups 13
cup 0 13
cup
How much flour is needed to make 4 batches of pancakes, 3 batches of tres leches cakes, 2 batches
of strawberry shortcakes, and 4 batches of egg tarts? To answer this, we would simply multiply
the number of batches for each item by the amount of flour needed for that item, and then add
up all these quantities for the different recipes:
4× 3 + 3× 1 + 2× 4 + 4×(
3 +3
4
)= 38 (7.2)
cups of flour. This procedure of decomposing a pastry into its ingredients in this way can be
described by a transformation
ingredients
R7
pastries
R4
On the right, we view the four different pastry items as four linearly independent vectors. This
is because no item, once made, can be obtained from the other items as a linear combination (we
cannot take an egg tart, two strawberry shortcakes, and one tres leches cake, and turn them into
a pancake). On the left, we view the seven different ingredients as seven linearly independent
vectors. Of course, milk, butter, and heavy cream are closely related, but let’s say that we do not
want to put in the extra effort to turn one of them into another form.
Now let’s try to provide a general abstract formula for which we can simply plug in numbers
when needed. How much flour is needed to make p batches of pancakes, t batches of tres leches
61
cakes, s batches of strawberry shortcakes, and e batches of egg tarts? Let F denote cups of flour
so that F is a function of p, t, s, and e. The idea is similar to above and the flour function is
F (p, t, s, e) = 3p+ t+ 4s+15
4e. (7.3)
Notice that the flour function is an example of a linear functional (see Definition 6.20). Similar
equations can be written for the other ingredients. To describe the full transformation, we would
have to write a list of such linear functionals for each ingredient. The functions for eggs, strawber-
ries (by pounds), heavy cream (by cups), butter (by cups), sugar (by cups), and milk (by cups),
respectively, would also all be linear functionals.
Example 7.4. Figure 8 shows the difference between what an individual with protanopia sees
(on the left) and what an individual without any colorblindness sees (on the right).33 Since each
P←−−−−
Figure 8: Protanopia colorblindness.
color can be identified numerically, by testing out several points, we can find out how each color
changes. We can then hope to describe the relationship between normal eyesight and protanopia
as a type of transformation
Pcolor space
R3
color space
R3
which we have called P. It is far from obvious, but it is true (to a reasonable degree),34 that such
a transformation is linear. In particular, this means that it can be determined by just knowing
where three linearly independent colors go (recall Example 4.33). Under this transformation, the
images of the pure RGB colors look like:
33These figures were obtained from https://ssodelta.wordpress.com/tag/rgb-to-lms/ and the original
photo is from https://animals.desktopnexus.com/wallpaper/480778/. If you have protanopia, you should
see no difference between these images.34The filter might not be exactly linear. I need to do some research to figure out if there are corrections to
linearity. Hence, take this with a grain of salt.
62
P (R) RP
P (G) GP
P (B) BP
For the YMC colors, the transformation looks like:
P (Y ) YP
P (M) MP
P (C) CP
Example 7.5. An experiment35 was done in 1926 to determine which color paint on walls helps a
baby sleep more. In this study, it was not only found that different color paints are more conducive
to healthier sleeping habits, but also that different genders were affected by colors differently. Table
2 shows the fractions of babies that had the healthiest sleeping habits to the corresponding color
paints. A couple visited the doctor, who, upon analyzing their DNA, indicated that their odds of
peach lavender sky blue light green light yellow
boy 0.15 0.2 0.2 0.2 0.25
girl 0.3 0.2 0.25 0.1 0.15
Table 2: Percentages for a study examining the healthiest sleeping habits for baby boys
and girls depending on the color used for painting walls in a baby’s room.
giving birth to a baby boy is actually 60%. The couple wants to finish the paint job in the baby’s
room well before the baby is born. What color should they paint their walls?
For this situation, we multiply all the respective probabilities for a boy by 0.6 and for a girl by
0.4 and then sum the results as in Table 3. The highest percentage occurs for sky blue. Hence,
the couple should paint the room sky blue.
How would these results change if the doctor told them they are actually only 40% likely to
give birth to a boy? For this situation, we multiply all the respective probabilities for a boy by
0.4 and for a girl by 0.6 and then sum the results as in Table 4. In this case, peach wins.35This is an example of a poorly designed experiment, but let’s just go with it...
63
peach lavender sky blue light green light yellow
boy 0.09 0.12 0.12 0.12 0.15
girl 0.12 0.08 0.1 0.04 0.06
sum 0.21 0.2 0.22 0.16 0.21
Table 3: Percentages for 60% chance of giving birth to a boy
peach lavender sky blue light green light yellow
boy 0.06 0.08 0.08 0.08 0.1
girl 0.18 0.12 0.15 0.06 0.09
sum 0.24 0.2 0.23 0.14 0.19
Table 4: Percentages for 40% chance of giving birth to a boy
Definition 7.6. A linear transformation (sometimes called an operator) from Rn to Rm is an
assignment, denoted by T, sending any vector ~x in Rn to a unique vector T (~x) in Rm satisfying
T (~x+ ~y) = T (~x) + T (~y) (7.7)
and
T (c~x) = cT (~x) (7.8)
for all ~x, ~y in Rn and all c in R. Such a linear transformation can be written in any of the following
ways36
T : Rn → Rm, Rn T−→ Rm, Rm ← Rn : T, or Rm T←− Rn. (7.9)
Given a vector ~x in Rn and a linear operator Rm T←− Rn, the vector T (~x) in Rm is called the image
of ~x under T. Rn is called the domain of T and Rm is called the codomain. The image of all vectors
in Rn under T is called the range of T.
A linear transformation is completely determined by what it does to a basis.
Example 7.10. In Example 7.1, we only needed to know the values of p, t, s, and e to determine
the amount of flour needed. The four pastry items pancakes, tres leches, strawberry shortcakes,
and egg tarts, form a basis in the sense that no one of these items can be obtained from any
combination of any other (linear independence) and all pastry items obtainable are precisely these
(they span the possible products of the bakery). Let E, T,H, F,B, S,M denote the functions for
eggs, strawberries (by pounds), heavy cream (by cups), flour (by cups), butter (by cups), sugar
(by cups), and milk (by cups), respectively. Given any value of p, t, s, and e, which can be viewed
as a vector in R4 as p
t
s
e
, (7.11)
36My preference are the second and last.
64
the resulting ingredients are given as a vector in R7 as
E
T
H
F
B
S
M
= p
2
3/2
1
3
1/4
3/16
2
+ t
6
0
2
1
0
1
1/3
+ s
0
3/2
3
4
5/4
1/2
0
+ e
6
0
0
15/4
4/3
2/5
1/3
(7.12)
We can then use the notation A~x = ~b to express this as
2 6 0 6
3/2 0 3/2 0
1 2 3 0
3 1 4 15/4
1/4 0 5/4 4/3
3/16 1 1/2 2/5
2 1/3 0 1/3
p
t
s
e
=
E
T
H
F
B
S
M
. (7.13)
Example 7.14. Colorblindness can be modeled in terms of a linear transformation.37 Referring
back to Example 4.33 for notation, protanopia (a type of colorblindness) can be described in terms
of a linear transformation. To obtain this linear transformation, we should write the corresponding
RGB codes for what happens to R, G, and B when we view these colors as vectors in R3. These
are given by
P (R) =
28.6579
28.6579
0
R =
255
0
0
P
P (G) =
226.3423
226.3423
0
G =
0
255
0
P
P (B) =
0
0
255
B =
0
0
255
P
and for the YMC colors, the transformation looks like:
37The following is based on my understanding of the information on the website https://ssodelta.wordpress.
com/tag/rgb-to-lms/.
65
P (Y ) =
255
255
0
Y =
255
255
0
P
P (M) =
28.6579
28.6579
255
M =
255
0
255
P
P (C) =
226.3423
226.3423
255
C =
0
255
255
P
From where R,G, and B go, the matrix
P :=
0.112384 0.887617 0
0.112384 0.887617 0
0 0 1
(7.15)
describes the filter for protanopia becauseRnew
Gnew
Bnew
=
0.112384 0.887617 0
0.112384 0.887617 0
0 0 1
RGB
= R
0.112384
0.112384
0
+G
0.887617
0.887617
0
+B
0
0
1
. (7.16)
For example, let’s calculate P ( ) where
=
251
212
185
.
(7.17)
By linearity,
P
251
212
185
= P
251
0
0
+ P
0
212
0
+ P
0
0
185
=
28
28
0
+
188
188
0
+
0
0
185
=
216
216
185
.
(7.18)
Using this notation, we can now make sense of linear systems and why they are expressed in
the form A~x = ~b.
66
Definition 7.19. Let n ∈ N be a positive integer and let i ∈ 1, . . . , n. The i-th unit vector in
Rn is the vector
~ei :=
0...
0
1
0...
0
← i-th entry, (7.20)
which is 0 in every entry except the i-th entry, where it is 1.
Example 7.21. In R, there is only one unit vector, ~e1, and it can be drawn as
~e1
The distance between each pair of consecutive tick marks is 1.
Example 7.22. In R2, ~e1 and ~e2 can be drawn as
~e1
~e2
Example 7.23. In R3, all of the following notations are used to express certain vectors
~e1 = x = i =
1
0
0
, ~e2 = y = j =
0
1
0
, & ~e3 = z = k =
0
0
1
(7.24)
x
y
z
~e1
~e2
~e3
67
The blue box is merely drawn to help the viewer in visualizing the coordinates in three dimensions.
Definition 7.25. Let Rm T←− Rn be a linear transformation. The m× n matrix associated to T is
the m× n array of numbers whose entries are given by | |T (~e1) · · · T (~en)
| |
. (7.26)
More explicitly, if the vector T (~ei) is written as
T (~ei) =
a1i
...
ami
, (7.27)
then | |T (~e1) · · · T (~en)
| |
=
a11 a12 · · · a1n
a21 a22 · · · a2n
......
...
am1 am2 · · · amn
(7.28)
The relationship between matrices and augmented matrices from before is given as follows. An
augmented matrix of the form (2.1) corresponding to a linear system (1.2) with variables x1, . . . , xn,
can be expressed as
A~x = ~b, (7.29)
where the notation A~x stands for the vector 38a11 a12 · · · a1n
a21 a22 · · · a2n
......
...
am1 am2 · · · amn
x1
x2
...
xn
:=
a11x1 + a12x2 + · · ·+ a1nxna21x1 + a22x2 + · · ·+ a2nxn
...
am1x1 + am2x2 + · · ·+ amnxn
(7.30)
in Rm. Notice that this vector can be decomposed, by factoring out the common factors, asa11 a12 · · · a1n
a21 a22 · · · a2n
......
...
am1 am2 · · · amn
x1
x2
...
xn
= x1
a11
a21
...
am1
+ x2
a12
a22
...
am2
+ · · ·+ xn
a1n
a2n
...
amn
. (7.31)
This is nothing more than the linearity of T expressed in matrix form. Equivalently, this equation
can be written as | |T (~e1) · · · T (~en)
| |
x1
...
xn
= x1T (~e1) + · · ·+ xnT (~en). (7.32)
38The vector on the right-hand-side is a definition of the notation on the left-hand-side. Don’t be confused by
the fact that there are a lot of terms inside each component of the vector on the right-hand-side of (7.30)—it is not
an m× n matrix!
68
In this way, we see the columns of the matrix A more clearly. Furthermore, an m×n matrix can be
viewed as having an existence independent of a linear transformation, at least a-priori. Therefore,
an m× n matrix acts on a vector in Rn to produce a vector in Rm. This is a way of consolidating
the augmented matrix and A is precisely the matrix corresponding to the linear system. One can
express the matrix A as a row of column vectors
A =
| | |~a1 ~a2 · · · ~an| | |
(7.33)
where the i-th component of the j-th vector ~aj is given by
(~aj)i = aij. (7.34)
In this case, ~b is explicitly expressed as a linear combination of the vectors ~a1, . . . ,~an via
~b = x1~a1 + · · ·+ xn~an. (7.35)
Therefore, solving for the variables x1, . . . , xn for the linear system (1.2) is equivalent to finding
coefficients x1, . . . , xn that satisfy (7.35). A~x = ~b is called a matrix equation.
Warning: we do not provide a definition for an m × n matrix acting on a vector in Rk with
k 6= n.
Thus, there are three equivalent ways to express a linear system.
(a) m linear equations in n variables (1.2).
(b) An augmented matrix (2.1).
(c) A matrix equation A~x = ~b as in (7.30).
The above observations also lead to the following.
Theorem 7.36. Let A be a fixed m × n matrix. The following statements are equivalent (which
means that any one implies the other and vice versa).
(a) For every vector ~b in Rm, the solution set of the equation A~x = ~b, meaning the set of all ~x
satisfying this equation, is nonempty.
(b) Every vector ~b in Rm can be written as a linear combination of the columns of A, viewed as
vectors in Rm, i.e. the columns of A span Rm.
(c) A has a pivot position in every row.
Proof. Let’s just check part of the equivalence between (a) and (b) by showing that (b) implies
(a). Suppose that a vector ~b can be written as a linear combination
~b = x1~a1 + · · ·+ xn~an, (7.37)
69
where the x1, . . . , xn are some coefficients. Rewriting this using column vector notation gives b1
...
bm
= x1
(a1)1
...
(a1)m
+ · · ·+ xn
(an)1
...
(an)m
(7.38)
We can set our notation and write
(aj)i ≡ aij. (7.39)
Then, writing out this equation of vectors gives b1
...
bm
=
x1a11 + · · ·+ xna1n
...
x1am1 + · · ·+ xnamn
(7.40)
by the rules about scaling and adding vectors from last lecture. The resulting equation is exactly
the linear system corresponding to A~x = ~b. Hence, the x’s from the linear combination in (7.37)
give a solution of the matrix equation A~x = ~b.
Theorem 7.41. Let A be an m×n matrix, let ~x and ~y be two vectors in Rn, and let c be any real
number. Then
A(~x+ ~y) = A~x+ A~y & A(c~x) = cA~x. (7.42)
In other words, every m× n matrix determines a linear transformation Rm T←− Rn.
Exercise 7.43. Prove this! To do this, write out an arbitrary A matrix with entries as in (7.30)
along with two vectors ~x and ~y and simply work out both sides of the equation using the rule in
(7.30).
Recommended Exercises. Exercises 4, 13 (there is a typo in the 5th edition: you can ignore
the symbol R3), 17, 25, in Section 1.4 of [Lay], Exercises 10, 12, 17, 18, 27 (26)—this is the line
segment problem, 31, 33 in Section 1.8 of [Lay], and Exercise 3 in Section 1.10 of [Lay]. Be able
to show all your work, step by step!
In this lecture, we went through parts of Sections 1.4, 1.8, 1.9, and 1.10 of [Lay].
70
8 Visualizing linear transformations
Every m× n matrix A acts on a vector ~x in Rn and produces a vector ~b in Rm as in
A~x = ~b. (8.1)
Furthermore, a matrix acting on vectors in Rn in this way satisfies the following two properties
A(~x+ ~y) = A~x+ A~y (8.2)
and
A(c~x) = cA~x (8.3)
for any other vector ~y in Rn and any scalar c. Since ~x is arbitrary, we can think of A as an operation
that acts on all of Rn. Any time you input a vector in Rn, you get out a vector in Rm. We can
depict this diagrammatically as
Rm A Rnoooo (8.4)
You will see right now (and several times throughout this course) why we write the arrows from
right to left (your book does not, which I personally find confusing).39 For example,404
1
−4
−7
1 −1 2
0 3 −1
4 −2 1
2 −3 −1
−1
1
2
oooo (8.5)
is a 4× 3 matrix (in the middle) acting on a vector in R3 (on the right) and producing a vector in
R4 (on the left).
Example 8.6. In Exercise 1.3.28 in [Lay], two types of coal, denoted by A and B, respectively,
produce a certain amount of heat (H), sulfur dioxide (S), and pollutants (P ) based on the quantity
of input for the two types of coal. Let HA, SA, and PA denote these quantities for one ton of A
and let HB, SB, and PB denote these quantities for one ton of B. Visually, these can be described
as a linear transformation
HA
SA
PA
A
jj
oo )
tt
HB
SB
PB
B
jj
oo )
tt
(8.7)
and the matrix associated to this transformation isHA HB
SA SBPA PB
(8.8)
39It doesn’t matter how you draw it as long as you are consistent and you know what it means. It’s not a ‘rule’
and only my preference.40We use arrows with a vertical dash as in ← [ at the beginning when we act on specific vectors.
71
and it acts on vectors of the form [x
y
](8.9)
where x is the number of tons of coal of type A and y is the number of tons of coal of type B. The
rows of the matrix describe the type of output while the columns correspond to all outputs due to
a given input (the type of coal used). Indeed, the net output given x tons of A and y tons of B isHA HB
SA SBPA PB
[xy
]=
xHA + yHB
xSA + ySBxPA + yPB
= x
HA
SAPA
+ y
HB
SBPB
(8.10)
as you probably already know from doing that exercise. The rows in the resulting vector cor-
respond to the total heat, sulfur dioxide, and pollutant outputs, respectively. But thinking of
the transformation (8.7) abstractly without matrices, it can be viewed as a linear transformation
without reference to any given set of vectors. Abstractly, the transformation of the power plant
produces 3 outputs (heat, sulfur dioxide, and pollutants) from 2 inputs (the two types of coal
used).
From the above discussion, every m× n matrix is an example of a linear transformation from
Rn to Rm. In the example above, namely (10.34), the image of−1
1
2
(8.11)
under the linear operator given by the matrix1 −1 2
0 3 −1
4 −2 1
2 −3 −1
(8.12)
is 4
1
−4
−7
. (8.13)
Notice that the operator can act on any other vector in R3 as well, not just the particular choice
we made. So for example, the image of 0
3
−1
(8.14)
would be 1 −1 2
0 3 −1
4 −2 1
2 −3 −1
0
3
−1
=
−5
10
−7
−8
. (8.15)
72
Maybe now you see why we wrote our arrows from right to left. It makes acting on the vectors with
the matrix much more straightforward (as written on the page). If we didn’t, we would have to flip
the vector to the other side of the matrix every time to calculate the image. In this calculation,
we showed −5
10
−7
−8
1 −1 2
0 3 −1
4 −2 1
2 −3 −1
0
3
−1
oooo . (8.16)
Notice that the center matrix always stays the same no matter what vectors in R3 we put on
the right. The matrix in the center is a rule that applies to all vectors in R3. When the matrix
changes, the rule changes, and we have a different linear transformation.
Example 8.17. Consider the transformation that multiplies every vector by 2. Under this trans-
formation, the vector 1
2
2
(8.18)
gets sent to 2
4
4
(8.19)
This transformation is linear and the matrix representing it is2 0 0
0 2 0
0 0 2
. (8.20)
Example 8.21. Let θ be some angle in [0, 2π). Let Rθ : R2 → R2 be the transformation that
rotates (counter-clockwise) all the vectors in the plane by θ degrees (for the pictures, let’s say
θ = π2). This transformation is linear and is represented by the matrix
Rθ :=
[cos θ − sin θ
sin θ cos θ
](8.22)
For θ = π2, this looks like
Rπ2(~e1)
Rπ2(~e2)
[0 −1
1 0
]~e1
~e2
73
Visually, it is not difficult to believe that rotation by an angle θ is a linear transformation.
However, to prove it is a bit non-trivial.
Problem 8.23. Prove that R2 Rθ←− R2, defined by rotating all vectors by θ, is a linear transforma-
tion.
Answer. We will not prove this here as it is a homework problem, but we will set it up so that
you know what is involved in proving such a claim. Any vector in R2 can be expressed in the
following two ways [x
y
]= x
[1
0
]+ y
[0
1
](8.24)
where x, y ∈ R are the coordinates of the vector. In this context, Rθ is a linear transformation
whenever
Rθ
([x
y
])= xRθ
([1
0
])+ yRθ
([0
1
])(8.25)
for all x, y ∈ R. Therefore, you must calculate each side of this equality and prove that the results
you obtain are the same.
Example 8.26. A vertical shear in R2 is given by a matrix of the form
S|k :=
[1 0
k 1
](8.27)
while a horizontal shear is given by a matrix of the form
S−k :=
[1 k
0 1
], (8.28)
where k is a real number. When k = 1, the former is depicted by
S|1(~e1)
S|1(~e2)
[1 0
1 1
]
~e1
~e2
while the latter is depicted by
74
S−1 (~e1)
S−1 (~e2)
[1 1
0 1
]
~e1
~e2
Example 8.29. Many more examples are given in Section 1.9 of [Lay]. You should be comfortable
with all of them!
Recommended Exercises. Exercises 6 (8) and 13 in Section 1.9 of [Lay]. Many of the Chapter
1 Supplementary Exercises are good as well! Be able to show all your work, step by step!
In this lecture, we finished Sections 1.4, and 1.9 of [Lay]. We still have a few concepts to cover
from Section 1.8.
75
9 Subspaces associated to linear transformations
Definition 9.1. Rm T←− Rn be a linear transformation with associated m × n matrix denoted by
A. The kernel of T is the set of all vectors ~x ∈ Rn such that T (~x) = ~0
ker(T ) =~x ∈ Rn : T (~x) = ~0
. (9.2)
Equivalently, the null space of A is the set of all solutions to the homogeneous equation
A~x = ~0. (9.3)
Problem 9.4. Jake bought stocks A and B in 2013 at a cost of CA and CB per stock, respectively.
He spent a total of $10, 000. In 2017, he sold the stocks at a selling price of SA and SB per stock,
respectively. Suppose that SB 6= CB and SA 6= CA. In the end, he broke even, because he was a
scrub and didn’t diversify his assets. How many of each stock did Jake buy? What if he initially
spent $15, 000 and still broke even?
Answer. Let x and y denote the number of stocks (possibly not a whole number) of A and B
that Jake had purchased in 2013. Because he spent $10, 000,
xCA + yCB = 10000. (9.5)
Because he broke even, his profit function R≥0 × R≥0 3 (x, y) 7→ p(x, y) satisfies
x(SA − CA) + y(SB − CB) = 0. (9.6)
The different possible combinations of purchasing stocks A and B and breaking even describes the
kernel of the profit function. These two equations describe a linear manifold and a subspace of R2
sketched as follows
total value of A
total value of B
The intersection of these two lines indicates the quantity of stocks that were purchased. If Jake
had spent $15,000, only the first equation would change, and this would merely shift the blue line
76
total value of A
total value of B
Kernels can also be used to describe that information is lost in some sense. This will be
discussed more precisely in Definition 9.43.
Example 9.7. Consider the example of protanopia colorblindness. The kernel of the protanopia
filter P can be calculated by solving0.112384 0.887617 0 0
0.112384 0.887617 0 0
0 0 1 0
→0.112384 0.887617 0 0
0 0 1 0
0 0 0 0
. (9.8)
Therefore, the kernel is the set of vectors of the form
G
−7.89807
1
0
, (9.9)
where G is a free variable. As you can tell, the only solution that physically makes sense is when
G = 0 since colors cannot be chosen to be negative. Hence, although the kernel associated to the
linear transformation P is spanned by the vector−7.89807
1
0
(9.10)
no multiple of this vector intersects the set of allowed color values. So is there any information
actually lost? Is it possible for a “reverse filter” to be applied to somebody with protanopia so
that they can see in full color? I’ll let you think about it.
Theorem 9.11. The kernel (null space) of a linear transformation Rm T←− Rn is a subspace of Rn.
Proof. We must check the axioms of a subspace.
77
(a) The zero vector satisfies T (~0) = ~0 because T (~0) = T (0~0) = 0T (~0) = ~0 since 0 times any vector
is the zero vector. Linearity of T was used in the second equality.
(b) Let ~x ∈ ker(T ) and let c ∈ R. Then, T (c~x) = cT (~x) = c~0 = ~0. The first equality follows from
linearity of T.
(c) Let ~x, ~y ∈ ker(T ). Then T (~x + ~y) = T (~x) + T (~y) = ~0 + ~0 = ~0. The first equality follows from
linearity of T.
There is actually an important consequence in the above proof. We will illustrate why this is
so in a short example.
Corollary 9.12. Let Rm T←− Rn be a linear transformation. Then T (~0) = ~0.
This is important because it provides one quick method of showing that certain functions are
not linear transformations. This is because this corollary says that it is necessary (i.e. it must be
the case) that ~0 is in the kernel for every linear transformation. In other words, if you show that~0 is not in the kernel of some function, then that means it cannot be linear.
Problem 9.13. Let R3 T←− R3 be the function defined by
R3 3 (x, y, z) 7→ T (x, y, z) := (x+ y − z, 2x− 3y + 2, 3x− 5z). (9.14)
Show that T is not a linear transformation.
Answer. T (0) = (0, 2, 0) so T is not a linear transformation.
Warning: showing that T (~0) = ~0 does not mean that the function is linear.
Problem 9.15. Let R3 T←− R2 be the function defined by
R3 3 (x, y) 7→ T (x, y) :=((2− y)x, x+ 3y, 2x− y
). (9.16)
Show that T is not a linear transformation.
Notice that T (0, 0) = (0, 0, 0) so this does not help in showing that T is not linear.
Answer. 2T (1, 1) = 2(1, 4, 1) = (2, 8, 2) while T (2, 2) = (0, 8, 2). Since 2T (1, 1) 6= T (2, 2), T is
not a linear transformation.
Note that all we had to show was one instance where linearity failed. Linearity is supposed to
hold for all inputs so if we find just one case where it fails, the function cannot be linear. We now
go to illustrating some examples of Theorem 9.11.
78
Example 9.17. Consider the linear system
3x− 2y + z = 0 (9.18)
from an earlier section. The matrix corresponding to this linear system is just
A =[3 −2 1
], (9.19)
a 1 × 3 matrix. Hence, it describes a linear transformation from R3 to R1. The nullspace of A
exactly corresponds to the solutions of
[3 −2 1
] xyz
=[0]. (9.20)
Definition 9.21. Let Rm T←− Rn be a linear transformation with associated m×n matrix denoted
by A. The image (also called range) of T is the set of all vectors in Rm of the form T (~x) with ~x in
Rn. Equivalently, the column space of A is the span of the columns of A.
The reason the image of transformation Rm T←− Rn is the same as the column space is because
the image of T is spanned by the vectors in the columns of the associated matrix | |T (~e1) · · · T (~en)
| |
(9.22)
In other words, ~b is in the image of A if and only if there exist coefficients x1, . . . , xn such that
~b = x1T (~e1) + · · ·+ xnT (~en). (9.23)
Example 9.24. For the baker, the linear transformation from the batches of pastries to the
ingredients required, the image of this transformation describes the quantity of ingredients needed
to evenly make the batches of the pastries without any excess. For any point not in the image,
the baker will either have an excess of certain ingredients or a lack of certain ingredients.
Theorem 9.25. The image of a linear transformation Rm T←− Rn is a subspace of Rm.
To avoid confusion between “imaginary,” the image of T will be denoted by ran(T ).
Proof. We must check the axioms of a subspace.
(a) Since every linear transformation takes ~0 to ~0, the zero vector satisfies ~0 = T (~0). Hence,~0 ∈ ran(T ).
(b) Let ~x ∈ ran(T ) and let c ∈ R. By the first assumption, there exists a ~z ∈ Rn such that
T (~z) = ~x. Then c~x = cT (~z) = T (c~z) which shows that c~x ∈ ran(T ).
(c) Let ~x, ~y ∈ ran(T ). Then there exist ~z, ~w such that T (~z) = ~x and T (~w) = ~y. Therefore,
~x+ ~y = T (~z) + T (~w) = T (~z + ~w) so that ~x+ ~y ∈ ran(T ).
79
Definition 9.26. Let Rm T←− Rn be a linear transformation. The dimension of the image of T is
called the rank of T and is denoted by rankT,
rank(T ) = dim(ran(T )). (9.27)
The rank of a linear transformation Rm T←− Rn can be calculated by counting the number of
pivot columns in the associated m×n matrix. In fact, the pivot columns (from the original matrix)
form a basis for the image of T.
Proposition 9.28. Let Rm T←− Rn be a linear transformation then the pivot columns of the asso-
ciated matrix | |T (~e1) · · · T (~en)
| |
(9.29)
form a basis for ran(T ).
Proof. We have already established the the columns span ran(T ). Let i1, . . . , ik denote the indices
corresponding to the pivot columns. Then, removing the non-pivot columns from this matrix, we
get | | 0
T (~ei1) · · · T (~eik)...
| | 0
row operations7−−−−−−−−→
1 0 · · · 0 0
0 1 · · · 0 0...
.... . .
......
0 0 · · · 1 0
(9.30)
after reducing to reduced row echelon form. This shows that the pivot columns of the original
matrix are linearly independent. Hence, they form a basis for the image of T.
Example 9.31. Consider the linear transformation from R2 to R3 described by the matrix 1 0
0 1
−3 2
. (9.32)
The images of the vectors ~e1 and ~e2 get sent to the columns of the matrix. They span the plane
shown in Figure 9.
Problem 9.33. Let
A :=
[1 −2 3
−5 10 −15
](9.34)
and set
~b :=
[2
−10
]. (9.35)
(a) Find a vector ~x such that A~x = ~b.
(b) Is there more than one such ~x as in part (a)?
80
3x− 2y + z =0
Figure 9: A plot of the plane described by 3x − 2y + z = 0 along with two vectors
spanning it.
(c) Is the vector
~v :=
[3
0
](9.36)
in the range of A viewed as a linear transformation?
Answer. you found me!
(a) To answer this, we must solve
[1 −2 3
−5 10 −15
]x1
x2
x3
=
[2
−10
](9.37)
which we can do in the usual way we have learned[1 −2 3 2
−5 10 −15 −10
]add 5 of row 1 to row 27−−−−−−−−−−−−−→
[1 −2 3 2
0 0 0 0
](9.38)
There are two free variables here, say x2 and x3. Then x1 is expressed in terms of them via
x1 = 2 + 2x2 − 3x3. (9.39)
Therefore, any vector of the form 2 + 2x2 − 3x3
x2
x3
(9.40)
for any choice of x2 and x3 will have image ~b.
(b) By the analysis from part (a), yes there is more than one such vector.
81
(c) To see if ~v is in the range of A, we must find a solution to
[1 −2 3
−5 10 −15
]x1
x2
x3
=
[3
0
](9.41)
but applying row operations as above[1 −2 3 3
−5 10 −15 0
]add 5 of row 1 to row 27−−−−−−−−−−−−−→
[1 −2 3 3
0 0 0 15
](9.42)
show that the system is inconsistent. This means that there are no solutions and therefore, ~v
is not in the range of A.
Definition 9.43. A linear transformation Rm T←− Rn is onto if every vector ~b in Rm is in the range
of T and is one-to-one if for any vector ~b in the range of T, there is only a single vector ~x in Rn
whose image is ~b.
Theorem 9.44. The following are equivalent for a linear transformation Rm T←− Rn.
(a) T is one-to-one.
(b) The only solution to the linear system T (~x) = ~0 is ~x = ~0.
(c) The columns of the matrix associated to T are linearly independent.
Proof. We will prove (a) =⇒ (b) =⇒ (c) =⇒ (a).
((a) =⇒ (b)) Suppose that T is one-to-one. Suppose there is an ~x ∈ Rn such that T (~x) = ~0. Since
T is linear, T (~0) = ~0. Since T is one-to-one, ~x = ~0.
((b) =⇒ (c)) Suppose that the only solution to T (~x) = ~0 is ~x = ~0. The goal is to show thatT (~e1), . . . , T (~e1)
is linearly independent, since these are precisely the columns of the matrix
associated to T. The linear system
y1T (~e1) + · · ·+ ynT (~en) = 0 (9.45)
can be expressed as
T (y1~e1 + · · ·+ yn~en) = 0 (9.46)
using linearity of T. By assumption, the only solution to this is
y1~e1 + · · ·+ yn~en = ~0. (9.47)
Since ~e1, . . . , ~en is linearly independent, the only solution to this system is y1 = · · · = yn = 0.
HenceT (~e1), . . . , T (~e1)
is linearly independent.
((c) =⇒ (a)) Suppose that the columns of the matrix associated to T are linearly independent.
Let ~x, ~y ∈ Rn satisfy T (~x) = T (~y). The goal is to prove that ~x = ~y. By linearity of T, T (~x−~y) = ~0.
Since ~x = x1~e1 + · · ·+ xn~en and similarly for ~y, this reads
T((x1 − y1)~e1 + · · ·+ (xn − yn)~en
)= ~0. (9.48)
82
By linearity of T, this can be expressed as
(x1 − y1)T (~e1) + · · ·+ (xn − yn)T (~en) = ~0. (9.49)
SinceT (~e1), . . . , T (~e1)
is linearly independent, the only solution to this system is
x1 − y1 = · · · = xn − yn = 0. (9.50)
In other words,
x1 = y1, . . . , xn = yn (9.51)
so that ~x = ~y.
Theorem 9.52. The following are equivalent for a linear transformation Rm T←− Rn.
(a) T is onto.
(b) For every ~b ∈ Rm, the linear system T (~x) = ~b always has at least one solution.
(c) The columns of the associated m× n matrix A span Rm, i.e.
span(T (~e1), . . . , T (~en)
)= Rm. (9.53)
Proof. This one is left as an exercise. Compare this to Theorem 7.36 (it’s the same thing!).
Theorem 9.54. Let Rm T←− Rn be a linear transformation. Then
rank T + dim(kerT ) = n. (9.55)
In other words,
# pivot columns + # free variables = dimension of domain. (9.56)
Proof. Omitted.
The idea behind this “rank-nullity” theorem is that the information you start with (the domain)
equals the information you obtained (the image) plus the information that was lost (the kernel).
Notice that this has nothing to do with the codomain of the linear transformation.
As a summary, given a linear transformation
Rm T←− Rn (9.57)
with domain (source) domain(T ) = Rn and codomain (target) codomain(T ) = Rm, the image and
kernel of T are given by
image(T ) :=T (~x) ∈ Rm : ~x ∈ Rn
(9.58)
and
ker(T ) :=~x ∈ Rn : T (~x) = ~0
. (9.59)
83
Note that image(T ) ⊆ codomain(T ) and ker(T ) ⊆ domain(T ). Since every linear transformation
T has the matrix form | |T (~e1) · · · T (~en)
| |
, (9.60)
the span of the columns is
spanT (~e1), . . . , T (~en)
= image(T ). (9.61)
In fact, the pivot columns provide a basis for image(T ). Hence,
dim(image(T )
)= # of pivots of T . (9.62)
Solving | | |T (~e1) · · · T (~en) ~0
| | |
(9.63)
in parametric form will give you the basis for ker(T ). Hence, the number of free variables is the
dimension of this kernel
dim(ker(T )
)= # of free variables of (9.63). (9.64)
Because of this, it immediately makes sense why the number of pivots plus the number of free
variables is n.
Recommended Exercises. Exercises 10 and 12 in Section 1.8 of [Lay]. Exercises 22, 23, and 31
in Section 2.8 of [Lay]. Exercises 13, 15, and 20, in Section 2.9 of [Lay]. Be able to show all your
work, step by step!
In this lecture, we finished Sections 1.8, 2.8, and parts of 2.9 of [Lay].
84
10 Iterating linear transformations—matrix multiplication
If you think of a linear transformation as a process (recall our examples: ingredients for a recipe
is a decomposition, colorblindness can be described by a filter, etc.), you can perform processes in
succession. For example, imagine you had two linear transformations
Rm T Rnoooo (10.1)
and
Rl S Rmoooo . (10.2)
Then it should be reasonable to perform these operations in succession as
Rl S Rm T Rnoooooooo (10.3)
so that the result is some operation, denoted by ST, from Rn to Rl
Rl ST Rnoooo . (10.4)
Problem 10.5. Most grocery stores give you discounts on items if you purchase more. In this
case, the cost per quantity of a good is not typically linear. So imagine, for purposes of illustrating
the example of iterating linear transformations, that a particular grocery store has a fixed price per
quantity of goods regardless of how much you buy. Refer to Example 0.11 on baking pastries and
their ingredients. For convenience, we have reproduced the table as well as the cost of ingredients
per quantity. In addition, we have added the selling price at the bottom of the table per item (as
opposed to per batch).
Pancakes Tres leches Strawberry shortcake Egg tarts cost per
(makes 12) (makes 8 slices) (makes 8 slices) (makes 16) ingredient
eggs 2 6 0 6 $0.09 per egg
strawberries 1 + 12
lbs 0 1 + 12
lbs 0 $1.00 per lb
heavy cream 1 cup 1 pint 3 cups 0 $1.50 per cup
flour 3 cups 1 cup 4 cups 3 + 34
cups $0.20 per cup
butter 4 tbsp 0 1 + 14
cups 1 + 13
cups $2.00 per cup
sugar 3 tbsp 1 cup 12
cup 25
cup $0.25 per cup
milk 2 cups 13
cup 0 13
cup $0.25 per cup
selling price $2.00 $3.50 $3.00 $1.00
Note that there are 16 tablespoons in a cup and 2 cups in a pint. Ignoring the costs of maintaining
a business, what is the profit of the bakery if they sell 4 batches of pancakes, 3 batches of tres
leches cakes, 2 batches of strawberry shortcakes, and 4 batches of egg tarts?
85
Answer. Before calculating, conceptually, we have the following list of linear transformations.
profit
R1
cost
R1
sell price
R1
ingredients
R7
pastries
R4
The cost and sell price can be calculated separately, but we have boxed them together because the
profit is calculated as a difference of the two. The explicit matrices corresponding to these linear
transformations are given by
2 6 0 632
0 32
0
1 2 3 0
3 1 4 154
14
0 54
43
316
1 12
25
2 13
0 13
[0.09 1 1.50 0.20 2 0.25 0.25
]
[−1 1
]
[24 28 24 16
]profit
R1
cost
R1
sell price
R1
ingredients
R7
pastries
R4
Given the known quantities that are sold, we can calculate the images of the batches sold under
these different linear transformations.
86
2 6 0 632
0 32
0
1 2 3 0
3 1 4 154
14
0 54
43
316
1 12
25
2 13
0 13
[0.09 1 1.50 0.20 2 0.25 0.25
][−1 1
]
[24 28 24 16
][231.30
] [60.70
292.00
]
50
9
16
38436
12720313
4
3
2
4
This leads to a profit of $231.30.
Definition 10.6. The composition of Rm T←− Rn followed by Rl S←− Rm is the function Rl ST←− Rn
defined by
Rn 3 ~x 7→ S(T (~x)
). (10.7)
In words, ST is the transformation that sends a vector ~x to T (~x) by applying T first and then
applies S to the result, which is T (~x). Diagrammatically this looks like
Rl Rn
Rm
T
\\S
SToo
(10.8)
Proposition 10.9. The composition of a linear transformation Rm T←− Rn followed by a linear
transformation Rl S←− Rm is a linear transformation Rl ST←− Rn.
Proof. Let c ∈ R and ~x, ~y ∈ Rn. Then
S(T (~x+ ~y)
)= S
(T (~x) + T (~y)
)by linearity of T
= S(T (~x)
)+ S
(T (~y)
)by linearity of S
(10.10)
and
S(T (c~x)
)= S
(cT (~x)
)by linearity of T
= cS(T (~x)
)by linearity of S.
(10.11)
Hence, ST is a linear transformation.
Exercise 10.12. Show that (ST )(~0) = ~0.
87
Because ST is a linear transformation, it must have a matrix associated to it. Let A be the
matrix associated to S and let B be the matrix associated to T. Remember, this means
A =
| |S(~e1) · · · S(~em)
| |
& B =
| |T (~e1) · · · T (~en)
| |
. (10.13)
Notice the difference in the unit vector inputs! Let’s try to figure out the matrix associated to ST.
To do this, we need to figure out what the columns of this matrix are, and this should be given by | |(ST )(~e1) · · · (ST )(~en)
| |
. (10.14)
Therefore, all we have to do is figure out what an arbitrary column in this matrix looks like.
Therefore, pick some i ∈ 1, . . . , n. Our goal is to calculate this column |(ST )(~ei)
|
=
|S(T (~ei)
)|
. (10.15)
By definition, T (~ei) is the i-th column of
B =
| |T (~e1) · · · T (~en)
| |
. (10.16)
Let’s therefore give some notation to the elements of this column: |T (~ei)
|
=:
b1i
...
bmi
. (10.17)
Notice that the indices make sense because B is an m × n matrix, so its columns must have m
entries. We left the i index on the right because we want to keep track that it is the i-th column.
Now we apply the linear transformation S to this vector, which we know how to compute
S(T (~ei)
)=
| |S(~e1) · · · S(~em)
| |
b1i
...
bmi
= b1iS(~e1) + · · ·+ bmiS(~em). (10.18)
Therefore, this particular linear combination of the columns of S is the i-th column of ST. Let’s
also put in some notation here. Writing the l ×m matrix associated to S as | |S(~e1) · · · S(~em)
| |
=
a11 · · · a1m
......
al1 · · · alm
(10.19)
88
we can express the above linear combination as
S(T (~ei)
)= b1iS(~e1) + · · ·+ bmiS(~em)
= b1i
a11
...
al1
+ · · ·+ bmi
a1m
...
alm
=
b1ia11 + · · ·+ bmia1m
...
b1ial1 + · · ·+ bmialm
. (10.20)
Yes, this looks complicated. And remember, that this is only the i-th column of ST. If we now
did this for all columns and entries, we would find that the matrix associated to ST is
AB =
m∑k=1
a1kbk1
m∑k=1
a1kbk2 · · ·m∑k=1
amkbkn
m∑k=1
a2kbk1
m∑k=1
a2kbk2 · · ·m∑k=1
a2kbkn
......
...m∑k=1
alkbk1
m∑k=1
alkbk2 · · ·m∑k=1
alkbkn
(10.21)
From this calculation, we see that the ij component (meaning the i-th row and j-th column entry)
(AB)ij of the matrix AB is given by
(AB)ij :=m∑k=1
aikbkj. (10.22)
The resulting formula seems overwhelming, but there is a convenient way to remember it instead
of this long derivation. The ij-th component of AB is given by multiplying the entries of the i-th
row of A with the entries of the j-th column of B one by one in order and then adding them all
together:
i-th row→
j-th column↓ ai1 ai2 · · · · · · aim
b1j
b2j
...
bmj
=
m∑k=1
aikbkj
(10.23)
This operation makes sense because the number of entries in a row of A is m while the number of
entries in a column of B is also m. Yet another way of thinking about the matrix product AB is
if we write B as
B =
| |~b1 · · · ~bn| |
(10.24)
89
Then AB is the matrix
AB =
| |A~b1 · · · A~bm| |
. (10.25)
Example 10.26. Consider the following two linear transformations on R2 given by a shear S and
then a rotation R by angle θ (in the figures, k = 1 and θ = π2).
R2
[cos θ − sin θ
sin θ cos θ
]R2
[1 k
0 1
]R2oooooooo . (10.27)
R(S− 1(~e
1))
R(S −
1 (~e2 ))
S−1 (~e1)
S−
1(~e 2)
~e1
~e2
Let us compute the matrix associated to RS by calculating the first and second columns, i.e.
(RS)(~e1) and (RS)(~e2). The first one is
R(S(~e1)
)=
[cos θ − sin θ
sin θ cos θ
] [1
0
]=
[cos θ
sin θ
](10.28)
while the second is
R(S(~e2)
)=
[cos θ − sin θ
sin θ cos θ
] [k
1
]= k
[cos θ
sin θ
]+
[− sin θ
cos θ
]=
[k cos θ − sin θ
k sin θ + cos θ
]. (10.29)
Therefore, the resulting linear transformation is given by
R2
[cos θ k cos θ − sin θ
sin θ k sin θ − cos θ
]R2oooo , (10.30)
which with k = 1 and θ = π2
becomes
R2
[0 −1
1 1
]R2oooo (10.31)
If, however, we executed these operations in the opposite order
R2
[1 k
0 1
]R2
[cos θ − sin θ
sin θ cos θ
]R2oooooooo (10.32)
90
S−
1(R(~e 1))
S−1 (R(~e2))
R(~e
1)
R(~e2) ~e1
~e2
we would find the resulting linear transformation to be
R2
[k sin θ + cos θ k cos θ − sin θ
sin θ cos θ
]R2oooo , (10.33)
which with k = 1 and θ = π2
becomes
R2
[1 −1
1 0
]R2oooo (10.34)
If A is an m×m matrix, then
Ak :=
k times︷ ︸︸ ︷A · · ·A (10.35)
Note that it does not make sense to raise an m × n matrix to some power other than 1. By
definition,
A0 := 1m (10.36)
is the identity m×m matrix.
Exercise 10.37. State whether the following claims are True or False. If the claim is true, be
able to precisely deduce why the claim is true. If the claim is false, be able to provide an explicit
counter-example.
(a)
[1 k
0 1
]15
=
[1 15k
0 1
]for all real numbers k.
(b) The matrix
[−0.6 0.8
−0.8 −0.6
]represents a rotation.
Exercise 10.38. Compute the matrices from exercises 7-11 in Section 1.9 of [Lay] in the following
two ways. First, calculate each of the individual matrices for the transformations and then matrix
multiply (compose). Second, write the matrix associated to the over-all transformation. How are
these two methods of calculating related?
Recommended Exercises. Exercises 9, 10, 11, and 12 in Section 2.1 of [Lay]. Be able to show
all your work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we finished Section 2.1 of [Lay].
91
11 Hamming’s error correcting code
We will review several concepts in the context of an example. This example and a lot of the
wording comes directly from an exercise in [1].
Example 11.1. In binary, one works with the numbers 0 and 1 only. The way we add these
numbers is the same way we normally add numbers except with the rule that 1 + 1 = 0. For
example, 2017 = 1 while 2018 = 0. In other words, every even number is treated as 0 while every
odd number is treated as 1. Multiplication of these numbers is also treated in the same was as with
ordinary integers. This makes sense because, for example, an even number times an odd number
is an even number while an odd number times an odd number is still an odd number. Just like the
set of real numbers is so important that we give it notation, such as R, the set of binary numbers
is so important that we also give it notation. Unfortunately, people will disagree on what letter
to use. I prefer the notation Z2 to remind myself that I am working with integers where “2” is
actually 0. Just like we can form n-component vectors of real numbers, (recall, this set is denoted
by Rn), we can form n-component vectors of binary numbers, and we denote this set by Zn2 . Most
of the manipulations, definitions, and theorems that worked for vectors in Rn work for vectors in
Zn2 .
Exercise 11.2. Because there are infinitely many real numbers, there are infinitely many vectors
in Rn for n > 0. How many vectors are there in Zn2 ?
Answer. For each component of a vector in Zn2 , there are 2 possibilities: either a 0 or a 1.
Therefore, for n components, this gives 2n possible entries. Therefore, the number of vectors in
Zn2 is finite!
Definition 11.3. An element of Z2 is typically called a bit, a vector in Z82 is typically called a
byte, and a vector in Z42 is typically called a nibble.
In 1950, Richard Hamming introduced a method of recovering transmitted information that
was subject to certain kinds of errors during its transmission. A Hamming matrix with n rows is
a matrix with 2n− 1 columns and whose columns consist of exactly all the non-zero vectors of Zn2 .For example, one such Hamming matrix with n = 3 rows (therefore 23 − 1 = 7 columns) is given
by
H =
1 0 0 1 0 1 1
0 1 0 1 1 0 1
0 0 1 1 1 1 0
. (11.4)
Exercise 11.5. Express the kernel of H as the span of four vectors in Z72 of the form
~v1 =
∗∗∗1
0
0
0
, ~v2 =
∗∗∗0
1
0
0
, ~v3 =
∗∗∗0
0
1
0
, ~v4 =
∗∗∗0
0
0
1
(11.6)
92
Answer. All we have to do is solve the augmented matrix problem (find the homogenous solution
to) 1 0 0 1 0 1 1 0
0 1 0 1 1 0 1 0
0 0 1 1 1 1 0 0
(11.7)
and we see immediately (since this matrix is already in reduced echelon form) that the general
solution is
x1
x2
x3
x4
x5
x6
x7
=
−x4 − x6 − x7
−x4 − x5 − x7
−x4 − x5 − x6
x4
x5
x6
x7
= x4
−1
−1
−1
1
0
0
0
+ x5
0
−1
−1
0
1
0
0
+ x6
−1
0
−1
0
0
1
0
+ x7
−1
−1
0
0
0
0
1
(11.8)
where x4, x5, x6, and x7 are free variables. But, don’t forget that −1 = 1 in Z2, so this actually
becomes
x1
x2
x3
x4
x5
x6
x7
= x4
1
1
1
1
0
0
0
+ x5
0
1
1
0
1
0
0
+ x6
1
0
1
0
0
1
0
+ x7
1
1
0
0
0
0
1
(11.9)
where x4, x5, x6, and x7 are free variables. By the way, this expression of the set of solutions is
now in parametric form (only the free variables appear). From this, we can immediately read off
the requested vectors:
~v1 =
1
1
1
1
0
0
0
, ~v2 =
0
1
1
0
1
0
0
, ~v3 =
1
0
1
0
0
1
0
, ~v4 =
1
1
0
0
0
0
1
. (11.10)
Let’s make sure this answer makes sense. H is a 3 × 7 matrix. The first three columns are pivot
columns and the last four columns provide us with free variables. Therefore, we expect the kernel
of H to be 4-dimensional. This agrees with the fact that we found the four vectors ~v1, ~v2, ~v3, ~v4.Rank-Nullity tells us that these vectors form a basis for the kernel of H (but you can also check
this explicitly by showing that these four vectors are linearly independent). Furthermore, because
H is a 3×7 matrix, it describes a linear transformation Z32
H←− Z72. Hence, the kernel should consist
of vectors with 7 components. Again, this is consistent with the basis we found.
93
Using ~v1, ~v2, ~v3, ~v4, we can construct a new matrix
M :=
| | | |~v1 ~v2 ~v3 ~v4
| | | |
=
1 0 1 1
1 1 0 1
1 1 1 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
. (11.11)
Exercise 11.12. Show that image(M) = ker(H). In particular, what is the resulting matrix HM
obtained by first performing M and then H?
Answer. The image of M is the span of the columns of the matrix associated with M. But these
vectors also span the kernel of H by construction. Therefore,
image(M) = span~v1, ~v2, ~v3, ~v4 = ker(H). (11.13)
This implies, in particular, that image(M) ⊆ ker(H). In other words, a vector ~x in image(M) is
also in ker(H), i.e. H(M(~x)) = ~0. This means that HM is the zero 3× 4 matrix. Notice that we
didn’t even have to calculate HM explicitly.
Now, consider a “message”, which will mathematically be described by a vector ~u in Z42, i.e. a
nibble (so a message consisting of four bits), and suppose you would like to transmit this message
to someone. While this message is traveling, the environment might perturb it slightly and might
change some of the components of the vector ~u. We assume for simplicity that at most one entry
can change during transmission. Because we are working with binary, the way in which ~u might
change is completely determined by which component gets altered. Thus, there are only five
possible outcomes for what happens to ~u. These possible outcomes are
~u ~u+ ~e1 ~u+ ~e2 ~u+ ~e3 ~u+ ~e4 (11.14)
since ~u is a four-component vector Notice that adding a vector ~ei is the same as subtracting one
since we are working with binary. For example, imagine we started with the vector
~u =
0
1
1
0
. (11.15)
When this message is transmitted, if at most one error occurs, the five possible outcomes are
~u =
0
1
1
0
, ~u+ ~e1 =
1
1
1
0
, ~u+ ~e2 =
0
2
1
0
=
0
0
1
0
, ~u+ ~e3 =
0
1
2
0
=
0
1
0
0
, ~u+ ~e4 =
0
1
1
1
. (11.16)
94
The information we are working with in our message is precious! We would like to not lose any
information. How can we do this? Again, this only works assuming that at most one error occurs,
but nevertheless it is an amazing idea. Hamming realized that you can first transform the vector
~u ∈ Z42 into a seven-component vector ~v ∈ Z7
2 by using the matrix M, i.e.
~v = M~u (11.17)
and then transmit that vector. This is beneficial for the following several reasons. First of all, ~u is
part of the vector ~v. In fact, if we split up M into two parts (the top and bottom parts)
M top =
1 0 1 1
1 1 0 1
1 1 1 0
& Mbot =
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
(11.18)
so we see that
M~u =
[M top~u
~u
](11.19)
so that all of the information of ~u is stored directly in the vector ~v since it is contained in the
last four components. Secondly, there are now eight (instead of five) possible outcomes of ~v after
transmission
~v ~v + ~e1 ~v + ~e2 ~v + ~e3 ~v + ~e4 ~v + ~e5 ~v + ~e6 ~v + ~e7 (11.20)
Exercise 11.21. Using the matrix H, how can the receiver detect if there was an error during
transmission? [Hint: consider the case that there was no error first and then analyze the situation
if there was an error.]
Answer. If there is no error, then ~v = H~u is the message that the receiver sees. They write down
this information. The receiver can then apply the transformation H to this message and will find
that the resulting vector is H~v = H(M~u) = ~0 by our calculations above. When they see the string
of all 0’s, they know that the message they received was the same as the original message and they
keep only the last 4 rows of the vector ~v that they received since they know these 4 numbers give
the vector ~u.
On the other hand, if an error occurs during transmission, then the resulting vector that the
receiver obtains will be of the form ~v + ~ei for some i ∈ 1, 2, . . . , 7. Remember, the receiver does
not know that the vector is of this form (all they see is a string of 0’s and 1’s). Nevertheless, when
they apply M they will obtain the vector
H(~v + ~ei) = H(M(~u) + ~ei) = H(M(~u)) +H(~ei) = ~0 +H(~ei) = H~ei, (11.22)
where H~ei is the i-th column of the matrix H. Remember, this is because
H =
| | | | | | | |H~e1 H~e2 H~e2 H~e3 H~e4 H~e5 H~e6 H~e7
| | | | | | | |
. (11.23)
95
By looking at the matrix H, notice that every single column in that matrix is different from
every other column. Therefore, H~ei will inform the receiver what i is, and remember that i
represents the component where the error occurred during transmission. First suppose the error
occurred where i ∈ 1, 2, 3. Because ~u, the original message, was in the last four slots of the
vector M~u, the receiver knows that the message they read before applying H is indeed the original
message that was sent since the error only occurred in one of the first 3 entries and did not
affect the last 4 entries (which is where ~u is). However, if i ∈ 4, 5, 6, 7, then the receiver knows
that an error occurred in the original message. Fortunately, because they can see H~ei, they can
identify which component the error occurred in. They can then fix this error (again because we
are in binary, there is only one other number it could have been) and obtain the original message.
Fascinating!
In case you didn’t quite get that, let’s work with a concrete example. Suppose a sender has
the initial message
~u =
0
1
1
0
. (11.24)
After applying M, this vector becomes
~v = M~u =
1
1
2
0
1
1
0
=
1
1
0
0
1
1
0
(11.25)
since we must remember that 2 = 0 in Z2. Notice how ~u is still preserved in the bottom four
entries. So now let’s say this message is transmitted and an error occurs in the second entry (of
course, the receiver does not know this) so what the receiver sees first is the vector
~v + ~e2 =
1
2
0
0
1
1
0
=
1
0
0
0
1
1
0
. (11.26)
The receiver writes this information down. Then applies the linear transformation M to the
96
message and obtains
H(~v + ~e2) =
1 0 0 1 0 1 1
0 1 0 1 1 0 1
0 0 1 1 1 1 0
1
0
0
0
1
1
0
=
2
1
2
=
0
1
0
(11.27)
Notice that the math tells us that H(~v + ~e2) = H~e2 but the receiver does not know beforehand
where the error occurred, so we should not express the above equation as equal to H~e2 (even
though it’s true) because that would presume the receiver already knows the state—the receiver
must apply the received vector as they obtained it because they are initially ignorant. So we see
that we obtained the second column of H which tells us that an error occurred in the second entry
of the transmitted message. Therefore, the last four entries were not altered and the receiver can
safely conclude that the original message was indeed our starting vector ~u.
Now consider the situation where the fifth entry of ~v gets altered during the transmission.
Therefore, the receiver sees
~v + ~e5 =
1
1
0
0
2
1
0
=
1
1
0
0
0
1
0
(11.28)
Applying H to this gives
1 0 0 1 0 1 1
0 1 0 1 1 0 1
0 0 1 1 1 1 0
1
1
0
0
0
1
0
=
2
1
1
=
0
1
1
. (11.29)
Therefore, the receiver knows that the 5th component of the original message has an error because
this resulting 3-component vector is the 5th column of H. By flipping this fifth entry and looking
at the last four components, they get back the original message ~u.
In summary, the receiver can perform a (linear) operation on the received message (namely,
H) and figure out the entire original message even if there was an error during transmission! It
is very important to notice that neither H nor M were constructed in any way that depends on
the original message! They apply to all 4-bit transmissions with at most one error occurrence.
97
Remark 11.30. The previous experiment unfortunately fails if bits are replaced by their quantum
analogues, known as qubits. The reason is that whenever the receiver looks at the message, they
necessarily alter the state. This is what makes quantum cryptography challenging, but these
aspects can also be used as strengths.
What happens if you allow for more than just one error? Or what if you want to transmit
longer messages?
Recommended Exercises. See homework. Be able to show all your work, step by step! Do not
use calculators or computer programs to solve any problems!
In this lecture, we reviewed many important concepts: kernel, image, matrix multiplication,
bases, etc. all through an example that is an exercise in [1].
98
12 Inverses of linear transformations
Refer back to Example 7.1 of the ingredients needed to make a set of pastries, but imagine one
now considers all possible pastries (or at least a sufficiently large number of pastries, such as 20)
one can make with those seven ingredients. As we discussed in that example, a recipe defines a
linear transformation, which is, in particular, a function
recipeingredients
R7
pastries
R# of pastries
Is there a way to go back?
recipeingredients
R7
pastries
R# of pastries
The way this question is phrased is a bit meaningless, because there are definitely many ways to
go back. For example, you can just send every ingredient to the vector ~0. A more meaningful
question would be to ask if there is a way to go back that recovers the pastry you started with.
Is this possible? Phrased differently, imagine being given a set of ingredients such as flour, milk,
eggs, sugar, etc. What kinds of pastries can you make with your set of ingredients? Is there
only one possibility? Of course not! Depending on the chef, one could make many different kinds
of pastries with a given set of ingredients. Hence, there is no well-defined rule to go back, i.e.
there is no function satisfying these requirements.41 In the context of linear algebra, given a linear
transformation
Rm T Rnoooo (12.1)
taking vectors with n components in and providing vectors with m components out, you might want
to know if there is a way to go back to reverse the process. This would be a linear transformation
going in the opposite direction (I’ve drawn it going backwards to our usual convention)
Rm S Rn//// (12.2)
so that if we perform these two processes in succession, the result would be the transformation
that does nothing, i.e. the identity transformation. In other words, going along any closed loop in
41If we had only used 4 pastries as in Example 7.1 and we used the recipes provided, then there actually is a way
to go back because the columns of the matrix associated to the recipe are linearly independent. However, there are
still many ways to go back and no unique choice.
99
the diagram
Rm
T
S
Rn
77''
ggww
(12.3)
is the identity. Expressed another way, this means that
ST Rn
nn
..
= 1n Rn
nn
..
(12.4)
and
TSRm
nn
..
= 1mRm
nn
..
(12.5)
Here 1m is the identity transformation on Rm and similarly 1n on Rn. Often, the inverse S of T
is written as T−1 and the inverse T of S is written as S−1. This is because inverses, if they exist,
are unique.
Definition 12.6. A linear transformation Rm T←− Rn is invertible (also known as non-singular) if
there exists a linear transformation Rn S←− Rm such that
ST = 1n & TS = 1m. (12.7)
A linear transformation that is not invertible is called a non-invertible (also known as singular)
linear transformation.
Example 12.8. Consider the matrix Rθ describing rotation in R2 counterclockwise about the
origin by angle θ
R2
[cos θ − sin θ
sin θ cos θ
]R2oooo . (12.9)
For θ = π2, this looks like
100
Rπ2(~e1)
Rπ2(~e2)
[0 −1
1 0
]~e1
~e2
The inverse of such a transformation is very intuitive! We just want to rotate back by angle −θ,i.e. clockwise by angle θ. This inverse should therefore be given by the matrix
R−θ =
[cos(−θ) − sin(−θ)sin(−θ) cos(−θ)
]=
[cos θ sin θ
− sin θ cos θ
](12.10)
For θ = π2, this looks like
R−π2 (~e1)
R−π2 (~e2)
[0 1
−1 0
]~e1
~e2
Is this really the inverse, though? We have to check the definition. Remember, this means we
need to show
RθR−θ = 12 & R−θRθ = 12. (12.11)
It turns out that we only need to check any one of these conditions (this is one of the exercises in
[Lay]), so let’s check the second one.
R−θRθ =
[cos θ sin θ
− sin θ cos θ
] [cos θ − sin θ
sin θ cos θ
]=
[cos2 θ + sin2 θ − cos θ sin θ + sin θ cos θ
− sin θ cos θ + cos θ sin θ sin2 θ + cos2 θ
]=
[1 0
0 1
] (12.12)
There is something quite interesting about this last example, but to explain it, we provide the
following definition.
101
Definition 12.13. The transpose of an m× n matrix A
A =
a11 a12 · · · a1n
a21 a22 · · · a2n
......
...
am1 am2 · · · amn
(12.14)
is the n×m matrix
AT :=
a11 a21 · · · an1
a12 a22 · · · an2
......
...
a1m a2m · · · anm
. (12.15)
Another way of writing the transpose that makes it easier to remember is | |~a1 · · · ~an| |
T :=
— ~a1 —...
— ~an —
(12.16)
In other words, the columns become rows and vice versa. In the previous example of a rotation,
we discovered that [cos θ − sin θ
sin θ cos θ
]−1
=
[cos θ − sin θ
sin θ cos θ
]T. (12.17)
These are special types of matrices, known as orthogonal matrices, and they will be discussed in
more detail later in this course.
Example 12.18. Consider the matrix S|k describing a vertical shear in R2 of length k
R2
[1 0
k 1
]R2oooo . (12.19)
When k = 1, this transformation is depicted by
S|1(~e1)
S|1(~e2)
[1 0
1 1
]
~e1
~e2
In this case as well, it seems intuitively clear that the inverse should be also vertical shear but
where the shift is in the opposite vertical direction, namely, k should be replaced with −k. Thus,
we propose that the inverse vertical shear, S|−k, is given by
S|−k =
[1 0
−k 1
]. (12.20)
102
When k = 1, this transformation is depicted by
S|1(~e1)
S|1(~e2) [
1 0
1 1
]
~e1
~e2
We check that this works:
S|−kS
|k =
[1 0
−k 1
] [1 0
k 1
]=
[1 0
0 1
]. (12.21)
Theorem 12.22. A 2× 2 matrix
A :=
[a b
c d
](12.23)
is invertible if and only if ad− bc 6= 0. When this happens,
A−1 =1
ad− bc
[d −b−c a
]. (12.24)
Proof. This is an if and only if statement so it’s proof must be broken into two major steps.
(⇐) Suppose that ad− bc 6= 0. Then the formula for A−1 shows that an inverse to A exists (matrix
multiply to verify this). Hence, A is invertible.
(⇒) Suppose that A is invertible. The goal is to show that ad− bc 6= 0. Since A is invertible, there
exists a 2 × 2 matrix B such that AB = 12 = BA. Notice that this means solutions to the two
systems
A~x = ~e1 & A~y = ~e2 (12.25)
are given by the respective columns of B since applying B to both sides of these two equations
gives
~x = 12~x = (BA)(~x) = B(A~x) = B~e1 & ~y = 12~y = (BA)(~y) = B(A~y = B~e2. (12.26)
In other words, we have to be able to solve the system[a b e
c d f
](12.27)
for all e, f ∈ R. In an earlier homework problem, we showed that if a 6= 0, then this row reduces
to [a b e
c d f
]7→[a b e
0 ad− bc af − ec
]. (12.28)
103
This is consistent for all e, f ∈ R provided that ad− bc 6= 0. But what if a = 0? Then it must be
that c 6= 0 so that we can swap rows and row reduce in a similar way to make the same conclusion.
Why can’t both a and c be equal to 0? If this happened, then A would not have two pivot columns
and it would not be possible to solve our two systems. Therefore, at least one of a or c is nonzero
and the formula ad− bc 6= 0 must hold.
This actually concludes the proof, but you might wonder where the formula for A−1 comes
from. Without loss of generality, suppose that a 6= 0 (we say without loss of generality because
we can swap the rows to put c in the position of a and row reduction would give us an analogous
result). Setting e = 1 and f = 0 gives[a b 1
c d 0
]7→[a b 1
0 ad− bc −c
], (12.29)
which says
ax1 + bx2 = 1
(ad− bc)x2 = 0
⇒ ~x =
1
ad− bc
[d
−c
]. (12.30)
Setting e = 0 and f = 1 gives [a b 0
c d 1
]7→[a b 0
0 ad− bc a
], (12.31)
which says
ay1 + by2 = 0
(ad− bc)y2 = a
⇒ ~y =
1
ad− bc
[−ba
]. (12.32)
Therefore,
B =[~x ~y
]=
1
ad− bc
[d −b−c a
], (12.33)
which agrees with the formula for A−1 in the statement of the theorem.
Exercise 12.34. If a = 0, then c 6= 0. By going through a similar procedure, find B, the inverse
of A, and show that it agrees with the formula we found.
Remark 12.35. You might have also tried to prove the second part of the theorem by writing
the inverse of A as some matrix (the e and f here are not the same as in the above proof)
B =
[e f
g h
](12.36)
and then matrix multiply with A to get the equation AB = 12 which reads[a b
c d
] [e f
g h
]=
[ae+ bg af + bh
ce+ dg cf + dh
]=
[1 0
0 1
]. (12.37)
104
This provides us with four equations in four unknowns (the knowns are a, b, c, d and the unknown
variables are e, f, g, h)
ae+ 0f + bg + 0h = 1
0e+ af + 0g + bh = 0
ce+ 0f + dg + 0h = 0
0e+ cf + 0g + dh = 1
, (12.38)
which is a linear system described by the augmented matrixa 0 b 0 1
0 a 0 b 0
c 0 d 0 0
0 c 0 d 1
(12.39)
By a similar argument to as before, at least one of a or c cannot be 0. Without loss of generality,
assume that a 6= 0. Then we can row reduce this augmented matrix toa 0 b 0 1
0 a 0 b 0
c 0 d 0 0
0 c 0 d 1
7→a 0 b 0 1
0 a 0 b 0
0 0 ad− bc 0 −c0 0 0 ad− bc a
(12.40)
Because a and c can’t both be 0, in order for this system to be consistent, ad − bc cannot be
zero. This again concludes the proof since all that was needed to be shown was ad− bc 6= 0. But
again, we can proceed and try to find the inverse by solving this augmented matrix completely.
Proceeding with row reduction givesa 0 b 0 1
0 a 0 b 0
0 0 ad− bc 0 −c0 0 0 ad− bc a
7→a 0 b 0 1
0 a 0 b 0
0 0 1 0 −cad−bc
0 0 0 1 aad−bc
7→
1 0 0 0 dad−bc
0 1 0 0 −bad−bc
0 0 1 0 −cad−bc
0 0 0 1 aad−bc
(12.41)
This gives us the matrix B = A−1, and it agrees with our earlier result. Notice how much longer
this construction was.
The quantity ad − bc of a matrix as in this theorem is called the determinant of the matrix
A and is denoted by detA. In all of the examples, the matrices were square matrices, i.e. m × nmatrices where m = n. It turns out that an m × n matrix cannot be invertible if m 6= n. Our
examples from above are consistent with this theorem.
Example 12.42. In the 2 × 2 rotation matrix Rθ from our earlier examples, the determinant is
given by
detRθ = cos θ cos θ − sin θ(− sin θ) = cos2 θ + sin2 θ = 1. (12.43)
Example 12.44. In the 2×2 vertical shear matrix S|k from our earlier examples, the determinant
is given by
detS|k = 1 · 1− 0 · k = 1. (12.45)
105
You could imagine just by the form of the inverse of a 2 × 2 matrix that finding formulas for
inverses of 3×3 or 4×4 matrices will be incredibly complicated. This is true. But we will still find
that a certain number, also called the determinant, will completely determine whether an inverse
exists. We will describe this in the next two sections. But before going there, let’s look at some of
the properties of the inverse of a linear transformation. We can still study properties even though
we might not have an explicit formula for the inverse. Invertible matrices are quite useful for the
following reason.
Theorem 12.46. Let A be an invertible m × m matrix and let ~b be a vector in Rm. Then the
linear system
A~x = ~b (12.47)
has a unique solution. Furthermore, this solution is given by
~x = A−1~b. (12.48)
Proof. The fact that ~x = A−1~b is a solution follows from
A(A−1~b
)= (AA−1)~b = 12
~b = ~b. (12.49)
To see that it is the only solution, suppose that ~y is another solution. Then by taking the difference
of A~x = ~b and A~y = ~b, we get
A(~x− ~y) = ~0 ⇒ A−1A︸ ︷︷ ︸12
(~x− ~y) = A−1~0 ⇒ ~x− ~y = ~0 (12.50)
so that ~x = ~y.
Exercise 12.51. Let
~b :=
[√3
1
](12.52)
and let Rπ/6 be the matrix that rotates by 30 (in the counterclockwise direction). Find the vector
~x whose image is ~b under this rotation.
Steps:
(1) Write the matrix Rπ/6 explicitly.
(2) Draw the vector ~b.
(3) Guess a solution ~x by thinking about how Rπ/6 acts.
(4) Use the theorem to calculate ~x to test your guess.
(5) Compare your results and then make sure it works.
Theorem 12.53. If A is an invertible m×m matrix, then(A−1
)−1= A. (12.54)
If A and B are invertible m×m matrices, then AB is invertible and
(BA)−1 = A−1B−1. (12.55)
106
This theorem is completely intuitive! To reverse two processes, you do each one in reverse as if
you’re rewinding a movie! The inverse of an m×m matrix A can be computed, if it exists, in the
following way, reminiscent of how we solved linear systems. In fact, this idea is a generalization
of the method we used to solve for the inverse of a 2 × 2 matrix. The idea is to row reduce the
augmented matrix [A 1m
](12.56)
to the form [1m B
](12.57)
where B is some new m×m matrix. If this can be done, B = A−1.
Example 12.58. The inverse of the matrix
A :=
1 −1 1
−1 1 0
1 0 1
(12.59)
can be calculated by some row reductions 1 −1 1 1 0 0
−1 1 0 0 1 0
1 0 1 0 0 1
7→1 −1 1 1 0 0
0 0 1 1 1 0
0 1 0 −1 0 1
7→1 0 1 0 0 1
0 0 1 1 1 0
0 1 0 −1 0 1
(12.60)
and then1 0 1 0 0 1
0 0 1 1 1 0
0 1 0 −1 0 1
7→1 0 0 −1 −1 1
0 0 1 1 1 0
0 1 0 −1 0 1
7→1 0 0 −1 −1 1
0 1 0 −1 0 1
0 0 1 1 1 0
(12.61)
So the supposed inverse is
A−1 =
−1 −1 1
−1 0 1
1 1 0
. (12.62)
To verify this, we should check that it works:−1 −1 1
−1 0 1
1 1 0
1 −1 1
−1 1 0
1 0 1
=
1 0 0
0 1 0
0 0 1
. (12.63)
Exercise 12.64. A rotation by angle θ (about the origin) in R3 in the plane spanned by ~e1 and
~e2 is given by the matrix cos θ − sin θ 0
sin θ cos θ 0
0 0 1
(12.65)
Show that the inverse of this matrix is cos θ sin θ 0
− sin θ cos θ 0
0 0 1
(12.66)
107
Theorem 12.67 (The Invertible Matrix Theorem). Let Rm T←− Rm be a linear transformation with
corresponding m×m matrix denoted by A. Then the following are equivalent (which means that if
one condition holds, then all the other conditions hold).
(a) T is invertible.
(b) The columns of A span Rm, i.e. T is onto.
(c) The columns of A are linearly independent, i.e. T is one-to-one.
(d) For every ~b ∈ Rm, there exists a unique solution to A~x = ~b.
(e) AT is invertible.
Please see [Lay] for the full version of this theorem, which provides even more characterizations
for a matrix to be invertible. Later, we will add other characterizing properties to this list as well.
Theorem 12.68. Let Rm T←− Rn be a linear transformation. The following are equivalent.
(a) T is one-to-one.
(b) There exists a linear transformation Rm S−→ Rn such that ST = 1n.
(c) The columns of the standard matrix associated to T are linearly independent.
(d) The only vector ~x ∈ Rn satisfying T~x = ~0 is ~x = ~0.
Theorem 12.69. Let Rm T←− Rn be a linear transformation. The following are equivalent.
(a) T is onto.
(b) There exists a linear transformation Rm S−→ Rn such that TS = 1m.
(c) The columns of the standard matrix associated to T span Rm.
(d) For ever vector ~b ∈ Rm, there exists a vector ~x ∈ Rn such that T~x = ~b.
Exercise 12.70. State whether the following claims are True or False. If the claim is true, be
able to precisely deduce why the claim is true. If the claim is false, be able to provide an explicit
counter-example.
(a)
[1 k
0 1
]−15
=
[1 −15k
0 1
]for all real numbers k.
(b) The inverse of the matrix
[−0.6 0.8
−0.8 −0.6
]is the matrix
[0.6 −0.8
0.8 0.6
].
(c) If A,B,C, and D are invertible 2× 2 matrices, then (ABCD)−1 = A−1B−1C−1D−1.
Recommended Exercises. Exercises 7, 9 (except part (e)), 13, 14, 21, 22, 23, 24, 25, 26, 33, 34,
and 35 in Section 2.2 of [Lay]. Exercises 11, 12, 13, 14, 21, 29, and 36 in Section 2.3 of [Lay]. Be
able to show all your work, step by step! Do not use calculators or computer programs to solve
any problems!
In this lecture, we finished Sections 2.2 and 2.3 of [Lay]. This concludes our study of Chapters
1 and 2 in [Lay]. In particular, we have skipped Sections 2.4, 2.5, 2.6, and 2.7.
108
13 The signed volume scale of a linear transformation
We’re going to do things a little differently from your book [Lay], so please pay close attention.
Instead of starting with Section 3.1 on the formula for a determinant, we will explore some of the
geometric properties of the determinant vaguely combining parts parts of Sections 3.2 and 3.3 (and
also some stuff from Section 6.1). In the next lecture, we will talk about cofactor expansions (in
fact, we will derive them). In the previous section, we defined the determinant of a 2× 2 matrix
A =
[a b
c d
](13.1)
to be
detA := ad− bc. (13.2)
We were partially motivated to give this quantity a special name because if detA 6= 0, then the
inverse of the matrix A is given by
A−1 =1
detA
[d −b−c a
]. (13.3)
There is another perspective to determinants that allows a simple generalization to higher di-
mensions, i.e. for m × m matrices where m does not necessarily equal 2. To understand this
generalization, we first explore some of the geometric properties of the determinant for 2 × 2
matrices.
Example 13.4. Consider the following linear transformation.
A~e1
A~e2
A :=
[−2 2
1 0
]~e1
~e2
The square obtained from the vectors ~e1 and ~e2 gets transformed into a parallelogram obtained
from the vectors A~e1 and A~e2. The area (a.k.a. 2-dimensional volume) of the square is initially 1.
Under the transformation, the area becomes twice as big, so that gives a resulting area of 2. Also
notice that the orientation of the face gets flipped once (the tear is initially on the left side of the
face and after the transformation, it is on the right side). This is the same thing that happens to
you when you look in the mirror. It turns out that
detA = (sign of orientation)(volume of parallelogram) = (−1)(2) = −2, (13.5)
109
which we can check:
detA = (−2)(0)− (1)(2) = −2. (13.6)
Notice that if we swap the columns of A, then the transformation becomes
B~e1
B~e2B :=
[2 −2
0 1
]~e1
~e2
and the face is oriented the same way as in the original situation. The volume is scaled by 2 so we
expect the determinant to be 2, and it is:
detB = (2)(1)− (0)(−2) = 2. (13.7)
As another example, imagine writing the vector in the first column in the following way[2
0
]=
[2
1
]+
[0
−1
]. (13.8)
Then how is the determinant of B related to the determinants of the transformations
C~e1C~e2C :=
[2 −2
1 1
]~e1
~e2
and
110
D~e1
D~e2 D :=
[0 −2
−1 1
]~e1
~e2
A quick calculation shows that
detB = detC + detD
2 = 4− 2.(13.9)
The previous example illustrates many of the basic properties of the determinant function. For
a linear transformation Rm T←− Rm, the determinant of the resulting matrix | |T (~e1) · · · T (~em)
| |
(13.10)
is the signed volume of the parallelepiped obtained from the column vectors in the above matrix.
The sign of the determinant is determined by the resulting orientation: +1 if the orientation
is right-handed and −1 if the orientation is left-handed. This definition has several important
properties, many of which have been illustrated in the previous example. But we should gain more
confidence that the determinant of a 2 × 2 matrix is really the area of the parallelogram whose
two edges or obtained from the column vectors.
Theorem 13.11. Let a, b, c, d > 0 and suppose that a > b and d > c. Then the area of the
parallelogram obtained from ~u :=
[a
c
]and ~v :=
[b
d
]is ad− bc.
Proof. Drawing these two vectors and the resulting parallelogram
~u
~v
7→
a
c
d
bc
b a
c
c
b
d
b
111
The area of the parallelogram is therefore
(a+ b)(c+ d)− ac− bd− 2bc = ad− bc, (13.12)
which is the desired result. Notice that this agrees with the determinant of the matrix[~u ~v
]. Also
notice that if the vectors ~u and ~v get swapped, the orientation of the parallelogram also changes.
This accounts for the fact that the determinant could be negative.
The property of decomposing a column into two parts and calculating the determinant should
also be proved, and its proof is actually more intuitive and provides sufficient justification for the
result.
Theorem 13.13. Let a, b, c, d, e, f > 0 and suppose that a > c > e and b > d > f. Then
det
[a+ c e
b+ d f
]= det
[a e
b f
]+ det
[c e
d f
](13.14)
The result is true regardless of the relationship between the numbers a, b, c, d, e, and f, but one
has to keep track of signs.
Proof. Instead of proving this algebraically (which you should be able to do), let’s prove it geo-
metrically, which is far more intuitive. Set
~u :=
[a
b
], ~v :=
[c
d
], ~w :=
[e
f
]. (13.15)
Then one obtains the following picture
~u
~v
~w
~u+ ~v
The area of the orange shaded region, af − be, plus the area of the green shaded region, cf − de,is equal to the purple shaded region, (a+ c)f − (b+ d)e. This proves the theorem.
112
Another simple consequence of the area interpretation of the determinant is what happens
when columns are scaled.
Theorem 13.16. Let a, b, c, d > 0 and suppose that a > b and d > c. Also, let λ > 0. Then
det
[λa b
λc d
]= λ det
[a b
c d
]. (13.17)
The result is true regardless of the relationship between the numbers a, b, c, d, and λ, but one
has to keep track of signs.
Proof. As before, set ~u :=
[a
c
]and ~v :=
[b
d
]. For the purposes of the picture, suppose that λ > 1
(a completely analogous proof holds when λ ≤ 1, and drawing the corresponding picture is left as
an exercise). Using the generic picture for ~u and ~v as in the proof of Theorem 13.11 gives
~u
λ~u~v
which shows that the area increases by a factor of λ. This proves the claim.
What happens in higher dimensions? Consider the following 3-dimensional example.
Example 13.18. Consider the transformation from R3 to R3 that scales the second unit vector
by a factor of 2 and shears everything by one unit along the first unit vector.
~e1
~e2
~e3T~e1
T~e2
T~e3
T~e1
~e2
~e3
113
The matrix associated to T is
[T ] =
| | |T (~e1) T (~e2) T (~e3)
| | |
=
1 1 0
0 2 0
0 0 1
. (13.19)
Although we do not have a formula for the determinant yet, we can imagine that the determinant
of this transformation is 2 since the volume doubles and the orientation stays the same. However,
now consider reflecting through the ~e2~e3-plane.
T~e1
T~e2
T~e3
RT~e1
RT~e2
RT~e3
R
T~e1
T~e2
T~e3
This reflect reverses the orientation and hence has determinant −1. Combined with the scale and
shear from before, the transformation RT has determinant −2. As practice, it is useful to verify
that the matrix associated to the transformation RT, which can be seen from the picture (by where
the blue vectors are) to be
[RT ] =
| | |RT (~e1) RT (~e2) RT (~e3)
| | |
=
−1 −1 0
0 2 0
0 0 1
, (13.20)
is the matrix product of the transformations [R] and [T ]
[R][T ] =
−1 0 0
0 1 0
0 0 1
1 1 0
0 2 0
0 0 1
. (13.21)
The first example also illustrates that the determinant for m×m matrices itself can be viewed as
a function from m vectors in Rm to the real numbers R. These m vectors specify the parallelepiped
in Rm and the function det gives the signed volume of this parallelepiped. Before presenting the
general definition of the determinant of m×m matrices in the most abstract version by highlighting
its essential properties that we have discovered above, we will first describe another formula for
the determinant of 3 × 3 matrices, which can, and will, be derived from the abstract definition.
You might have learned in multivariable calculus that the volume of the parallelepiped P obtained
from three vectors ~v1, ~v2, and ~v3 is given by∣∣(~v1 × ~v2) · ~v3
∣∣, (13.22)
114
where × is the cross product, · is the dot product, and | · | denotes the absolute value of a number.
In fact, the orientation of the parallelepiped P is given by the sign, so it is better to write
(~v1 × ~v2) · ~v3. (13.23)
Recall, the dot product of two vectors ~u and ~v in R3 is a number and is given by
~u · ~v :=[u1 u2 u3
] v1
v2
v3
= u1v1 + u2v2 + u3v3. (13.24)
The cross product of two vectors ~u and ~v in R3 is a vector in R3 and can be expressed in terms of
determinants of 2× 2 matrices. It is given by
~u× ~v := (u2v3 − u3v2)~e1 − (u1v3 − u3v1)~e2 + (u1v2 − u2v1)~e3
= det
[u2 v2
u3 v3
]~e1 − det
[u1 v1
u3 v3
]~e2 + det
[u1 v1
u2 v2
]~e3.
(13.25)
There is a lot of geometric meaning behind these formulas so we should explore them for the
moment. Given two vectors ~u and ~v, the dot product of ~u with ~v is given by
~u · ~v = ‖~u‖‖~v‖ cos θ, (13.26)
where θ is the angle between ~u and ~v and ‖ · ‖ applied to a vector is the length of that vector.
Note that the length ‖~u‖ of a vector ~u in R3 is itself given in terms of the dot product by
‖~u‖ =√~u · ~u (13.27)
This last identity follows from the Pythagorean Theorem (used twice), as the following picture
illustrates.
y
z
x
~u
~uxy
~ux~uy
115
In this picture,
~uxy :=
u1
u2
0
, ~ux :=
u1
0
0
& ~uy :=
0
u2
0
(13.28)
are the projections of the vector ~u onto the xy-plane, the x-axis, and the y-axis, respectively. The
cross product of ~u and ~v satisfies the following condition
‖~u× ~v‖ = ‖~u‖‖~v‖ sin θ, (13.29)
where θ is the angle between ~u and ~v. This provides the length of the vector ~u × ~v. In fact, this
length is the area of the parallelogram obtained from ~u and ~v. Notice that it is zero if ~u and ~v are
parallel (i.e. one is a scalar multiple of the other). The direction of ~u× ~v (up to a plus or minus
sign) follows from the fact that it satisfies the conditions
(~u× ~v) · ~u = 0 & (~u× ~v) · ~v = 0, (13.30)
which can be verified via a quick computation. Setting ϕu be the angle between ~u× ~v and ~u and
setting ϕv to be the angle between ~u× ~v and ~v, these conditions say
0 = ‖(~u× ~v)‖‖~u‖ cosϕu & 0 = ‖(~u× ~v)‖‖~v‖ cosϕv (13.31)
i.e. ~u× ~v is orthogonal to both ~u and ~v (provided that the lengths of all of these vectors are not
zero). So the only thing left to determine the vector ~u × ~v is the sign of the direction. This is
determined by the right-hand-rule, which says that you chop your right hand pinky first onto ~u,
curl your fingers towards ~v, and then your thumb points up towards ~u× ~v.
Going back to our definition for the volume of a parallelepiped in R3, denote the components
of the vectors ~v1, ~v2, and ~v3, as follows
~v1 =
v11
v21
v31
, ~v2 =
v12
v22
v32
, & ~v3 =
v13
v23
v33
. (13.32)
Then
~v1 × ~v2 :=
v21v32 − v22v21
v12v31 − v11v32
v11v22 − v21v12
(13.33)
while the dot product just multiplies corresponding entries and adds everything together
(~v1 × ~v2) · ~v3 = (v21v32 − v22v31)v13 − (v12v31 + v11v32)v23 + (v11v22 − v21v12)v33 (13.34)
In the parallelepiped example from above, we have−1
0
0
×−1
2
0
·0
0
1
=
0
0
−2
·0
0
1
= −2. (13.35)
116
If we look closely at expression (13.34), we see that this is a linear combination of 2×2 determinants!
Namely,
(~v1 × ~v2) · ~v3 = v13 det
([v21 v22
v31 v32
])− v23 det
([v11 v12
v31 v32
])+ v33 det
([v11 v12
v21 v22
]). (13.36)
We can visualize these determinants together with their appropriate factors in the following wayv11 v12 v13
v21 v22 v23
v31 v32 v33
−v11 v12 v13
v21 v22 v23
v31 v32 v33
+
v11 v12 v13
v21 v22 v23
v31 v32 v33
. (13.37)
In fact, one can show that this is also equal tov11 v12 v13
v21 v22 v23
v31 v32 v33
−v11 v12 v13
v21 v22 v23
v31 v32 v33
+
v11 v12 v13
v21 v22 v23
v31 v32 v33
, (13.38)
i.e.
(~v1 × ~v2) · ~v3 = v11 det
[v22 v23
v32 v33
]− v12 det
[v21 v23
v31 v33
]+ v13 det
[v21 v22
v31 v32
]. (13.39)
This equality of expressions can be thought of as a proof of the identity
(~v1 × ~v2) · ~v3 = (~v2 × ~v3) · ~v1. (13.40)
In other words, under the permutation42 (1 2 3
2 3 1
)(13.41)
that sends 1 to 2, sends 2 to 3, and sends 3 to 1, the expression for the determinant does not
change. If however, we swapped only two indices once, then we would get a negative sign (check
this!):
(~v1 × ~v2) · ~v3 = −(~v2 × ~v1) · ~v3. (13.42)
It helps to understand the properties a little more precisely in the 3× 3 case.
Exercise 13.43. In this exercise, you will explore the properties of the dot product and then use
these properties to reconstruct the formula for the dot product.
(a) Using the formula for the dot product of 3-component vectors, prove that
~v · ~w = ~w · ~v (13.44)
for all vectors ~v, ~w ∈ R3.
42The general definition of a permutation is given in the next lecture.
117
(b) Using the formula for the dot product of 3-component vectors prove that
~ei · ~ej = δij, (13.45)
where
δij :=
1 if i = j
0 otherwise(13.46)
For example, ~e1 · ~e1 = 1 while ~e1 · ~e2 = 0.
(c) Using the formula for the dot product of 3-component vectors, prove that
(~u+ ~v) · ~w = ~u · ~w + ~v · ~w (13.47)
for all vectors ~u,~v, ~w ∈ R3.
(d) Now, using only the results from parts (a) and (c) of this exercise (and not the formula for
the dot product in terms of the components of the vectors), prove that
~v · (~u+ ~w) = ~v · ~u+ ~v · ~w (13.48)
for all vectors ~u,~v, ~w ∈ R3.
(e) Using only the results of the previous parts of this exercise (and not the formula for the dot
product in terms of the components of the vectors) and the fact that all vectors ~v and ~w can
be expressed as linear combinations of the unit vectors
~v =3∑i=1
vi~ei & ~w =3∑j=1
wj~ej, (13.49)
prove that
~v · ~w =3∑i=1
viwi (13.50)
agreeing with the formula we had initially for the dot product.
This shows us that conditions (a), (b), and (c) are enough to define the dot product of vectors.
A similar fact holds for the cross product, as the following exercise shows.
Exercise 13.51. In this exercise, you will explore the properties of the cross product and then
use these properties to reconstruct the formula for the cross product.
(a) Using the formula for the cross product of 3-component vectors, prove that
~v × ~w = −~w × ~v (13.52)
for all vectors ~v, ~w ∈ R3.
118
(b) Using the formula for the cross product of 3-component vectors prove that
~ei × ~ej =3∑
k=1
εijk~ek (13.53)
where
εijk :=
1 if ijk is a cyclic permutation
−1 if ijk is an anti-cyclic permutation
0 if i = j, j = k, or i = k.
(13.54)
Note that a cyclic permutation of 1, 2, 3 is one that is provided in clockwise order
1
23
bb
44
(13.55)
beginning at any number. An anti-cyclic permutation of 1, 2, 3 is one that is provided in
counter-clockwise order1
23
jj
<< (13.56)
beginning at any number. So for example, 231 is cyclic while 321 is anti-cyclic. Hence,
~e2 × ~e3 = ~e1 while ~e3 × ~e2 = −~e1.
(c) Using the formula for the cross product of 3-component vectors, prove that
(~u+ ~v)× ~w = ~u× ~w + ~v × ~w (13.57)
for all vectors ~u,~v, ~w ∈ R3.
(d) Now, using only the results from parts (a) and (c) of this exercise (and not the formula for
the cross product in terms of the components of the vectors), prove that
~v × (~u+ ~w) = ~v × ~u+ ~v × ~w (13.58)
for all vectors ~u,~v, ~w ∈ R3.
(e) Using only the results of the previous parts of this exercise (and not the formula for the cross
product in terms of the components of the vectors) and the fact that all vectors ~v and ~w can
be expressed as linear combinations of the unit vectors
~v =3∑i=1
vi~ei & ~w =3∑j=1
wj~ej, (13.59)
119
prove that
~v × ~w =3∑i=1
3∑j=1
3∑k=1
viwjεijk~ek, (13.60)
and then express this result as a 3-component vector by showing that it equals
~v × ~w =
v2w3 − v3w2
v3w1 − v1w3
v1w2 − v2w1
(13.61)
agreeing with the formula we had initially for the cross product.
This shows us that conditions (a), (b), and (c) are enough to define the cross product of vectors.
Therefore, since the determinant is expressed in terms of the cross product and the dot product,
it, too, can be characterized by similar properties, as we will see later.
Exercise 13.62. Construct a geometric proof of the determinant formula for a 3 × 3 matrix
analogous to the geometric construction of the determinant for 2 × 2 matrices. In other words,
prove that the (signed) volume obtained from three linearly independent vectors ~u,~v, ~w is
det
| | |~u ~v ~w
| | |
. (13.63)
As an initial step, first provide a geometric proof that
det
1 0 0
0 a b
0 c d
= det
[a b
c d
]. (13.64)
Recommended Exercises. Please check HuskyCT for the homework. Be able to show all your
work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we discussed parts of Sections 3.2, 3.3, and 6.1 of [Lay].
120
14 The determinant and the formula for inverse matrices
All of the properties that we have discovered for the determinant of 2 × 2 and 3 × 3 matrices in
terms of signed volumes can be turned into a definition for arbitrary m × m matrices. We will
therefore think of the determinant of an m×m matrix as the signed volume of the parallelepiped
obtained from the m columns of the matrix. In other words, the determinant should be a function
sending m m-component vectors to some real number, which is to be interpreted as the signed
volume of the parallelepiped obtained from these m vectors. It should satisfy all of the properties
discussed in the previous section. Namely,
(a) the signed volume changes sign when any two vectors are swapped,
(b) the signed volume is linear in each column,
(c) the signed volume of the unit cube should be 1.
Definition 14.1. The determinant for m×m matrices is a function43
det :
m times︷ ︸︸ ︷Rm × · · · × Rm → R, (14.2)
which we think of as assigning to m m-component vectors aligned as in a matrix | |~v1 · · · ~vn| |
7→ det(~v1, . . . , ~vn), (14.3)
satisfying the following conditions.
(a) For every m-tuple of vectors (~v1, . . . , ~vm) in Rm,
det(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
)
switch
= − det
(~v1, . . . , ~vj, . . . , ~vi, . . . , ~vm
). (14.4)
This is sometimes called the skew-symmetry of det .
(b) det is multilinear, i.e.
det(~v1, . . . , a~vi + b~ui, . . . , ~vm
)= a det
(~v1, . . . , ~vi, . . . , ~vm
)+ b det
(~v1, . . . , ~ui, . . . , ~vm
)(14.5)
for all i = 1, . . . ,m all scalars a, b and all vectors ~v1, . . . , ~vi−1, ~vi, ~ui, ~vi+1, . . . , ~vm.
(c) The determinant of the unit vectors, listed in order, is 1:
det(~e1, . . . , ~em
)= 1. (14.6)
For an m×m matrix A, the input for the determinant function consists of the columns of A
written in order:
detA := det(A~e1, · · · , A~em
). (14.7)
43I found this description of the determinant at http://math.stackexchange.com/questions/668/
whats-an-intuitive-way-to-think-about-the-determinant
121
Corollary 14.8. Let (~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm) be a sequence of vectors in Rm where ~vi = ~vj (and
yet i 6= j). Then
det(~v1, . . . , ~vi, . . . , ~vi, . . . , ~vm
)= 0. (14.9)
Proof. By the skew-symmetry condition (a) in the definition of determinant,
det(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
)
switch
= − det
(~v1, . . . , ~vj, . . . , ~vi, . . . , ~vm
)(14.10)
but since ~vi = ~vj
− det(~v1, . . . , ~vj, . . . , ~vi, . . . , ~vm
)= − det
(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
). (14.11)
Putting these two equalities together gives
det(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
)= − det
(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
). (14.12)
A number can only equal the negative of itself if it is zero. Hence,
det(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
)= 0 (14.13)
whenever ~vi = ~vj and i 6= j.
Corollary 14.14. Let (~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm) be a list of m vectors in Rm. Then
det(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
)= det
(~v1, . . . , ~vi + ~vj, . . . , ~vj, . . . , ~vm
)(14.15)
for any j between 1 and m.
Proof. This follows immediately from multilinearity and the previous Corollary
det(~v1, . . . , ~vi + ~vj, . . . , ~vj, . . . , ~vm
)= det
(~v1, . . . , ~vi, . . . , ~vj, . . . , ~vm
)+ det
(~v1, . . . , ~vj, . . . , ~vj, . . . , ~vm
)︸ ︷︷ ︸
0
(14.16)
because ~vj repeats itself in the argument of det .
Let’s check to make sure this definition reduces to the determinant of a 2× 2 matrix. Let
A :=
[a b
c d
]. (14.17)
Then the columns of A are ([a
c
],
[b
d
])= (a~e1 + c~e2, b~e1 + d~e2) . (14.18)
122
Therefore,
detA = det (a~e1 + c~e2, b~e1 + d~e2)
= a det (~e1, b~e1 + d~e2) + c det (~e2, b~e1 + d~e2)
= ab det (~e1, ~e1)︸ ︷︷ ︸0
+ad det (~e1, ~e2)︸ ︷︷ ︸1
+cb det (~e2, ~e1)︸ ︷︷ ︸−1
+cd det (~e2, ~e2)︸ ︷︷ ︸0
= ad− bc.
(14.19)
Wow! This abstract technique actually worked! And we didn’t have to memorize the specific
formulas for all the different cases of 2×2, 3×3, or more general matrices. All we have to remember
are the three conditions in Definition 14.1. Granted, it takes longer to use this definition right
now, and we will learn a particular pattern that will help us soon.
Example 14.20. Let’s try an example in R3 this time finding out the more explicit formula. Let
A :=
1 1 0
2 0 1
0 −1 1
. (14.21)
Then
detA = det (A~e1, A~e2, A~e3)
= det (~e1 + 2~e2, ~e1 − ~e3, ~e2 + ~e3)
= det (~e1, ~e1 − ~e3, ~e2 + ~e3) + 2 det (~e2, ~e1 − ~e3, ~e2 + ~e3)
= det (~e1, ~e1, ~e2 + ~e3)︸ ︷︷ ︸0
− det (~e1, ~e3, ~e2 + ~e3)
+ 2 det (~e2, ~e1, ~e2 + ~e3)− 2 det (~e2, ~e3, ~e2 + ~e3)
= − det (~e1, ~e3, ~e2)︸ ︷︷ ︸−1
− det (~e1, ~e3, ~e3)︸ ︷︷ ︸0
+2 det (~e2, ~e1, ~e2)︸ ︷︷ ︸0
+ 2 det (~e2, ~e1, ~e3)︸ ︷︷ ︸−1
−2 det (~e2, ~e3, ~e2)︸ ︷︷ ︸0
−2 det (~e2, ~e3, ~e3)︸ ︷︷ ︸0
= −1.
(14.22)
We can also calculate the determinant using our first formula in terms of the signed volume of the
parallelepiped.
With this general idea, you can calculate determinants of very large matrices as well, though
perhaps it may take a lot of time. We will try to do this by first recalling the definition of a
permutation. We have already discussed the special case of permutations of two and three indices,
but we need a more general definition for n indices.
Definition 14.23. A permutation of a list of numbers (1, 2, . . . , n) is a rearrangement of these same
numbers in a different order(σ(1), σ(2), . . . , σ(n)
). An elementary permutation is a permutation
for which only two numbers switch places
1_
2_
· · · i
''
· · · j/
ww
· · · n_
1 2 · · · i · · · j · · · n
(14.24)
123
Theorem 14.25. Fix a positive integer n. Every permutation of (1, 2, . . . , n) is obtained from
successive elementary permutations. Furthermore, the number of such elementary permutations is
either always even or always odd for a given permutation.
Definition 14.26. If a permutation can be expressed as an even number of elementary permu-
tations, then its sign is +1. Otherwise, its sign is −1. The sign of a permutation σ is written as
sign(σ).
An arbitrary permutation is often written as(1 2 3 · · · n
σ(1) σ(2) σ(3) · · · σ(n)
)(14.27)
As an example, for a permutation of 3 numbers(1 2 3
i j k
), (14.28)
the sign reproduces something we have already seen before
sign
(1 2 3
i j k
)= εijk, (14.29)
where εijk was defined in (13.54). Now, let A be an arbitrary m×m matrix as in
A =
a11 a12 · · · a1m
a21 a22 · · · a2m
......
...
am1 am2 · · · amm
(14.30)
The j-th column of A is a1j
a2j
...
amj
=m∑ij=1
aijj~eij . (14.31)
Therefore,
detA = det
(m∑i1=1
ai11~ei1 , . . . ,
m∑im=1
aimm~eim
)
=m∑i1=1
· · ·m∑
im=1
ai11 · · · aimm det (~ei1 , . . . , ~eim)
=m∑
i1=1,...,im=1i1 6=i2 6=···6=im
ai11 · · · aimmsign(σi1...im),
(14.32)
124
where σi1...im is the permutation defined by(1 2 · · · m
i1 i2 · · · im
). (14.33)
In the second equality in (14.32), the multi-linearity of det was used. In the third equality, the
skew-symmetry of det was used together with the fact that the determinant of the identity matrix
is 1.
Example 14.34. Let’s use this formula to compute the determinant of a 3× 3 matrix and check
that it agrees with our previous definition. Let the 3× 3 matrix be of the form
A :=
a11 a12 a13
a21 a22 a23
a31 a32 a33
(14.35)
Then,
detA =3∑
i1=1,i2=1,i3=1i1 6=i2 6=i3
ai11ai22ai33sign(σi1i2i3)
= a11
3∑i2=2,i3=2i2 6=i3
ai22ai33sign(σ1i2i3) + a12
3∑i1=2,i3=2i1 6=i3
ai11ai33sign(σi11i3) + a13
3∑i1=2,i2=2i1 6=i2
ai11ai22sign(σi1i21)
= a11
3∑i2=2,i3=2i2 6=i3
ai22ai33sign(σi2i3)− a12
3∑i1=2,i3=2i1 6=i3
ai11ai33sign(σi1i3) + a13
3∑i1=2,i2=2i1 6=i2
ai11ai22sign(σi1i2)
= a11(a22a33 − a32a23)− a12(a21a33 − a31a23) + a13(a21a32 − a31a22)
= a11 det
[a22 a23
a32 a33
]− a12 det
[a21 a23
a31 a33
]+ a13 det
[a21 a22
a31 a32
](14.36)
by some rearrangements of the terms. We therefore see that this agrees with our initial definition!
We can also re-write this expression in terms of the 2× 2 matrices by introducing the notation
detA = a11 detA11 − a12 detA12 + a13 detA13. (14.37)
A1j is the resulting matrix obtained from A by deleting the 1-st row and j-th column
A11 :=
a11 a12 a13
a21 a22 a23
a31 a32 a33
, A12 :=
a11 a12 a13
a21 a22 a23
a31 a32 a33
, & A13 :=
a11 a12 a13
a21 a22 a23
a31 a32 a33
. (14.38)
Note that we could have chosen to do this rearrangement by separating out the second or third
row instead of the first (we have already analyzed a similar situation earlier). Just to illustrate
125
the slight difference, let’s separate out the second row explicitly.
detA =3∑
i1=1,i2=1,i3=1i1 6=i2 6=i3
ai11ai22ai33sign(σi1i2i3)
= a21
3∑i2=1,i3=1i2 6=i3 6=2
ai22ai33sign(σ2i2i3) + a22
3∑i1=1,i3=1i1 6=i3 6=2
ai11ai33sign(σi12i3) + a23
3∑i1=1,i2=1i1 6=i2 6=2
ai11ai22sign(σi1i22)
= −a21
3∑i2=1,i3=1i2 6=i3 6=2
ai22ai33sign(σi2i3) + a22
3∑i1=1,i3=1i1 6=i3 6=2
ai11ai33sign(σi1i3)− a23
3∑i1=1,i2=1i1 6=i2 6=2
ai11ai22sign(σi1i2)
= −a21(a12a33 − a32a13) + a22(a11a33 − a31a13)− a23(a11a32 − a31a12)
= −a21 detA21 + a22 detA22 − a23A23,
(14.39)
where
A21 :=
a11 a12 a13
a21 a22 a23
a31 a32 a33
, A22 :=
a11 a12 a13
a21 a22 a23
a31 a32 a33
, & A23 :=
a11 a12 a13
a21 a22 a23
a31 a32 a33
. (14.40)
The preceding example is quite general. A similar calculation gives an inductive formula for
the determinant of an m×m matrix A as in (14.30) in terms of determinants of (m− 1)× (m− 1)
matrices. A completely similar calculation, just more involved, gives
detA =m∑j=1
(−1)j+1a1j detA1j (14.41)
where A1j is the (m − 1) × (m − 1) matrix obtained by deleting the first row of A and the j-th
column of A as in
A1j :=
a11 · · · a1j−1 a1j a1j+1 · · · a1m
a21 · · · a2j−1 a2j a2j+1 · · · a2m
......
......
...
am1 · · · amj−1 amj amj+1 · · · amm
, (14.42)
i.e.
A1j =
a21 · · · a2j−1 a2j+1 · · · a2m
......
......
am1 · · · amj−1 amj+1 · · · amm
. (14.43)
More generally, we could have chosen the i-th row and the j-th column in calculating this deter-
minant
detA =m∑j=1
(−1)j+iaij detAij, (14.44)
126
where
Aij :=
a11 · · · a1j−1 a1j a1j+1 · · · a1m
......
......
...
ai−11 · · · ai−1j−1 ai−1j ai−1j+1 · · · ai−1m
ai1 · · · aij−1 aij aij+1 · · · aimai+11 · · · ai+1j−1 ai+1j ai+1j+1 · · · ai+1m
......
......
...
am1 · · · amj−1 amj amj+1 · · · amm
, (14.45)
i.e.
Aij =
a11 · · · a1j−1 a1j+1 · · · a1m
......
......
ai−11 · · · ai−1j−1 ai−1j+1 · · · ai−1m
ai+11 · · · ai+1j−1 ai+1j+1 · · · ai+1m
......
......
am1 · · · amj−1 amj+1 · · · amm
. (14.46)
Definition 14.47. Let A be an m×m matrix as in the previous lecture. In the previous formula
for the determinant,
Cij := (−1)i+j detAij (14.48)
is called the (i, j)-cofactor of A.
Theorem 14.49. Let A be an m×m matrix. A is invertible if and only if detA 6= 0. Furthermore,
when this happens,
A−1 =1
detA
C11 C21 · · · Cm1
C12 C22 · · · Cm2
......
...
C1m C2m · · · Cmm
, (14.50)
where Cij is the (i, j)-cofactor of A.
Recall the definition of the transpose of a matrix.
Definition 14.51. The transpose of an m× n matrix A
A =
a11 a12 · · · a1n
a21 a22 · · · a2n
......
...
am1 am2 · · · amn
(14.52)
is the n×m matrix
AT :=
a11 a21 · · · an1
a12 a22 · · · an2
......
...
a1m a2m · · · anm
. (14.53)
127
We will provide several interpretations of the transpose later in this course.
Theorem 14.54. Let A be an m×m matrix. Then
detAT = detA. (14.55)
Proof. One can prove this directly from the formula for the determinant.
Exercise 14.56. Provide a geometric proof of
det
[a c
b d
]= det
[a b
c d
]. (14.57)
Definition 14.58. The matrixC11 C21 · · · Cm1
C12 C22 · · · Cm2
......
...
C1m C2m · · · Cmm
≡C11 C12 · · · C1m
C21 C22 · · · C2m
......
...
Cm1 Cm2 · · · Cmm
T
(14.59)
in Theorem 14.49 is called the adjugate of A. It is often written as adjA.
From the definition of the transpose of a matrix, the definition of the determinant, and Corollary
14.14, one can calculate the determinant using row operations. Swapping rows introduces a minus
sign from the definition of det . Adding rows does not change anything by this theorem and
Corollary 14.14. And in your homework, you are asked to find a formula for the determinant of
an upper-triangular matrix. In other words, if a matrix A is row reduced to a matrix A′ without
any scaling of rows, then
detA = (−1)number of row swaps detA′. (14.60)
This is a faster method for computing the determinant.
Let’s look back at Example 14.34 by using these results to calculate the inverse of a specific
3× 3 matrix.
Example 14.61. Let’s find the inverse of the matrix
A :=
1 1 3
2 −2 1
0 1 0
. (14.62)
The determinant of A is
detA = 1 · det
[1 3
2 1
]= 1− 6 = −5. (14.63)
128
The adjugate matrix of A is given by
adjA :=
det
[−2 1
1 0
]− det
[2 1
0 0
]det
[2 −2
0 1
]− det
[1 3
1 0
]det
[1 3
0 0
]− det
[1 1
0 1
]det
[1 3
−2 1
]− det
[1 3
2 1
]det
[1 1
2 −2
]
T
=
−1 0 2
3 0 −1
7 5 −4
T
=
−1 3 7
0 0 5
2 −1 −4
. (14.64)
Therefore,
A−1 =1
detAadjA =
1
5
−1 3 7
0 0 5
2 −1 −4
. (14.65)
Let’s just make sure this works:
A−1A =1
5
−1 3 7
0 0 5
2 −1 −4
1 1 3
2 −2 1
0 1 0
=
1
5
−1 + 6 + 0 −1− 6 + 7 −3 + 3 + 0
0 + 0 + 0 0 + 0 + 5 0 + 0 + 0
2− 2 + 0 2 + 2− 4 6− 1 + 0
=
1 0 0
0 1 0
0 0 1
.(14.66)
Recommended Exercises. See homework. Be able to show all your work, step by step! Do not
use calculators or computer programs to solve any problems!
In this lecture, we finished Sections 3.1, 3.2, and 3.3 of [Lay]. Notice that we skipped Cramer’s
rule.
129
15 Orthogonality
The dot product can be used to define the length and orthogonality of vectors in Euclidean space.
Later, we will use this idea to diagonalize certain kinds of matrices. Along the way, we will discuss
projections, which are special examples of linear transformations. To get a better understanding
of the dot product, we have the following theorems, which include some useful inequelities for the
dot product in Euclidean space. In what follows, the notation 〈~v, ~u〉 will be used to denote the
dot product, ~v · ~u, which is also often referred to as the inner product of ~v and ~u. We will use this
notation to avoid confusing this with matrix multiplication.
Definition 15.1. The Euclidean norm/length on Rn is the function Rn → R defined by∥∥(x1, x2, . . . , xn)∥∥ :=
√x2
1 + x22 + · · ·+ x2
n. (15.2)
The Euclidean inner product on Rn is the function Rn × Rn → R defined by⟨(x1, x2, . . . , xn), (y1, y2, . . . , yn)
⟩:= x1y1 + x2y2 + · · ·+ xnyn. (15.3)
Often, the short-hand notation
n∑i=1
zi := z1 + z2 + · · ·+ zn (15.4)
will be used to denote the sum of n real numbers zi ∈ Rn, i ∈ 1, . . . , n.
Theorem 15.5. Rn with these structures satisfies the following for all vectors ~x, ~y, ~z ∈ Rn and for
all numbers c ∈ R,
(a) 〈~x, ~x〉 ≥ 0 and 〈~x, ~x〉 = 0 if and only if ~x = 0,
(b) ‖~x‖ =√〈~x, ~x〉,
(c) 〈~x, ~y〉 = 〈~y, ~x〉,
(d) 〈c~x, ~y〉 = c〈~x, ~y〉 = 〈~x, c~y〉,
(e) 〈~x+ ~z, ~y〉 = 〈~x, ~y〉+ 〈~z, ~y〉 and 〈~x, ~y + ~z〉 = 〈~x, ~y〉+ 〈~y, ~z〉,
(f) |〈~x, ~y〉| ≤ ‖~x‖‖~y‖ and equality holds if and only if ~x and ~y are linearly dependent (Cauchy-
Schwarz inequality),
(g) ‖~x+ ~y‖ ≤ ‖~x‖+ ‖~y‖ (triangle inequality),
(h) 〈~x, ~y〉 = ‖~x+~y‖2−‖~x−~y‖24
(polarization identity).
Proof. space
130
(a) Let (x1, . . . , xn) := ~x denote the components of ~x. Then,
〈~x, ~x〉 =n∑i=1
x2i ≥ 0 (15.6)
since the square of any real number is always at least 0. Furthermore, the sum of squares is
zero if and only if each term is zero, but the square root of zero is zero so xi = 0 for all i if
and only if 〈~x, ~x〉 = 0.
(b) This follows immediately from the definitions of ‖ · ‖ in (15.2) and 〈 · , · 〉 in (15.3).
(c) This follows from commutativity of multiplication of real numbers and the formula (15.3).
(d) This follows from commutativity and associativity of multiplication of real numbers and the
formula (15.3).
(e) This follows from the distributive law for real numbers and the formula (15.3).
(f) Suppose that ~x and ~y are linearly independent. Therefore, ~x 6= λ~y for all all λ ∈ R. Hence,
0 < ‖λ~y − ~x‖2 =n∑i=1
(λyi − xi)2 = λ2
n∑i=1
y2i − 2λ
n∑i=1
xiyi +n∑i=1
x2i . (15.7)
In particular, this is a quadratic equation in the variable λ that has no real solutions. Hence,(−2
n∑i=1
xiyi
)2
− 4
(n∑i=1
y2i
)(n∑j=1
x2i
)< 0. (15.8)
Rewriting this and canceling out the common factor of 4 gives
n∑i=1
n∑j=1
xiyixjyj <n∑i=1
n∑j=1
x2i y
2j (15.9)
Applying the square root to both sides gives the desired result.
Now suppose ~x = λ~y for some λ ∈ R. Then by parts (b) and (d), equality holds.
(g) Notice that by the Cauchy Schwarz inequality,
‖~x+ ~y‖2 =n∑i=1
(xi + yi)2
=n∑i=1
x2i + 2
n∑i=1
xiyi +n∑i=1
y2i
≤n∑i=1
x2i + 2
∣∣∣∣∣n∑i=1
xiyi
∣∣∣∣∣+n∑i=1
y2i
≤ ‖~x‖2 + 2‖~x‖‖~y‖+ ‖~y‖2
=(‖~x‖+ ‖~y‖
)2
(15.10)
Applying the square root and using parts (a) and (b) gives the desired result.
131
(h) This calculation is left to the reader.
Definition 15.11. A vector ~u in Rn of length 1 is called a normalized/unit vector. If ~v is any
nonzero vector in Rn, its normalization is given by
~v
‖~v‖=
~v√v2
1 + · · ·+ v2n
, (15.12)
where
~v =
v1
...
vn
(15.13)
are the components of the vector ~v. Occasionally, the notation v will be used for the normalized
vector associated to ~v.
Normalized vectors always lie on the unit sphere in whatever Euclidean space the vector is
in. Below are two drawings of the unit sphere along the normalized versions of vectors ~v in two
dimensions and three dimensions44
~v~v‖~v‖
~v~v‖~v‖
Definition 15.14. Let ~v and ~u be two vectors in Rn. The vector projection of ~v onto ~u is the
vector Pu~v given by
Pu~v :=〈~v, ~u〉~u‖~u‖2
. (15.15)
In terms of the normalized vector u, this is expressed as
Pu~v := 〈~v, u〉u. (15.16)
The scalar projection of ~v onto ~u is the Euclidean inner product of ~v with u
Su~v := 〈~v, u〉. (15.17)
Notice that the scalar projection is a number while the vector projection is a vector. In fact,
Pu~v = (Su~v)u. (15.18)
The vector projection of ~v onto ~u is interpreted geometrically by the following picture
44The sphere drawing was done using a modified version of code written by Christian Feuersanger and was found
at http://tex.stackexchange.com/questions/124916/draw-sphere-pgfplots-with-axis-at-center.
132
~u~v
Pu~v
In this drawing,
~u =
[4
1
]& ~v =
[2
3
](15.19)
so that
Pu~v =
⟨[2
3
],
[4
1
]⟩[4
1
]∥∥∥∥[41]∥∥∥∥2 =
11
17
[4
1
]. (15.20)
The scalar projection is just the signed/oriented length of the projection. It is positive if Pu~v
points in the same direction as ~u and it is negative if it points in the opposite direction.
Theorem 15.21. Let ~u be a nonzero vector in Rn. Then the functions
Rn Su7−→ R~v 7→ Su~v
(15.22)
and
Rn Pu7−→ Rn
~v 7→ Pu~v(15.23)
are linear transformations. Furthermore, Pu is a linear transformation such that P 2u = Pu.
Proof. This can be proven algebraically, but there is a nice geometric proof. Similarity of triangles
~u
~v
c~v
Pu~v
Pu(c~v)
proves that Pu(c~v) = cPu~v while geometry involving parallel lines
133
~u~v
~w
~v + ~w
Pu(~v + ~w)Pu~vPu ~w
proves that Pu(~v+ ~w) = Pu~v+Pu ~w. The fact that P 2u = Pu follows immediately from the geometric
picture of the definition: once you have projected onto ~u, you can’t project anywhere else. The
algebraic proof of this is less enlightening:
P 2u (~v) = Pu (Pu~v) = Pu (〈~v, u〉u) = 〈~v, u〉Puu = 〈~v, u〉 〈u, u〉︸ ︷︷ ︸
1
u = 〈~v, u〉u = Pu~v (15.24)
for all vectors ~v ∈ Rn. Hence, P 2u = Pu.
In other words, the projection onto a vector is a linear transformation. In fact, the signed
length of the projection is also a linear transformation. There are many other kinds of projections
besides projections onto single vectors. One could imagine projecting onto a plane (such as when
you shine a flashlight onto a figure and it makes a shadow on the wall). In higher dimensions,
there are several ways to project onto different dimensional planes. To discuss these situations, we
need to discuss when vectors are perpendicular.
Definition 15.25. Two vectors ~u and ~v are orthogonal/perpendicular whenever 〈~v, ~u〉 = 0.
In terms of the definition of angle, this says that the angle between ~v and ~u is an odd integer
multiple of π2, as we would expect from our common-day use of the word orthogonal/perpendicular.
Definition 15.26. Let ~v be a nonzero vector in Rn and let L := span~v. The orthogonal
complement of L is
L⊥ :=~u ∈ Rn : 〈~v, ~u〉 = 0
(15.27)
i.e. the set of vectors ~u in Rn that are orthogonal to ~v.
Example 15.28. Consider the vector
~v :=
[1
2
](15.29)
134
in R2 and let L := span~v. Then
L⊥ =aRπ
2(~v) : a ∈ R
, (15.30)
where Rπ2
is the 2 × 2 matrix describing rotation by π2, is the set of all scalar multiples of the
vector ~v after it has been rotated by 90. More explicitly,
L⊥ = span
[−2
1
](15.31)
~v
LL⊥
Example 15.32. Consider the vector
~v :=
3
2
1
(15.33)
in R3 and let L := span~v. Then
L⊥ =
xyz
∈ R3 :
⟨3
2
1
,xyz
⟩ = 0
=
xyz
∈ R3 : 3x+ 2y + z = 0
(15.34)
which describes the plane z = −3x− 2y. In other words,
L⊥ =
x 1
0
−3
+ y
0
1
−2
∈ R3 : x, y ∈ R
= span
1
0
−3
, 0
1
−2
.
(15.35)
More generally, we can define the orthogonal complement of any subspace.
Definition 15.36. Let W be a k-dimensional subspace of Rn. The orthogonal complement of W
is
W⊥ :=~v ∈ Rn : 〈~v, ~w〉 = 0 for all ~w ∈ W
(15.37)
i.e. the set of vectors ~v in Rn that are orthogonal to all vectors ~w in W.
135
Example 15.38. The orthogonal complement of the plane P described by the equation z =
−3x− 2y is precisely the one-dimensional subspace L from Example 15.28 because
P⊥ =
v1
v2
v3
∈ R3 :
⟨v1
v2
v3
, x 1
0
−3
+ y
0
1
−2
⟩ = 0 for all x, y ∈ R
=
v1
v2
v3
∈ R3 : v1x+ v2y − (3x+ 2y)v3 = 0 for all x, y ∈ R
=
v1
v2
v3
∈ R3 : x(v1 − 3v3) + y(v2 − 2v3) = 0 for all x, y ∈ R
=
v1
v2
v3
∈ R3 : v1 = 3v3 and v2 = 2v3
= span
3
2
1
.
(15.39)
If we have a basis for a subspace, we can use that basis to find the orthogonal complement.
Theorem 15.40. Let W ⊆ Rn be a k-dimensional subspace and let ~w1, . . . , ~wk be a basis of W.
Then W⊥ is the solution set to the homogeneous system (of k equations in n unknowns) | |~w1 · · · ~wk| |
T |~v|
= ~0. (15.41)
Proof. We first prove that
W⊥ =~v ∈ Rn : 〈~v, ~w1〉 = 0, . . . , 〈~v, ~wk〉 = 0
. (15.42)
Let ~v ∈ W⊥. Then, by definition, 〈~v, ~w〉 = 0 for all ~w ∈ W. In particular, 〈~v, ~wi〉 = 0 for all
i ∈ 1, . . . , k since ~wi ∈ W for all i ∈ 1, . . . , k. Hence
W⊥ ⊆~v ∈ Rn : 〈~v, ~w1〉 = 0, . . . , 〈~v, ~wk〉 = 0
. (15.43)
Now suppose ~v satisfies 〈~v, ~wi〉 = 0 for all i ∈ 1, . . . , k. Let ~w ∈ W. Since ~w1, . . . , ~wk is a basis
of W, there exist coefficients x1, . . . , xk such that ~w = x1 ~w1 + · · ·+ xk ~wk. Hence,
〈~v, ~w〉 = 〈~v, x1 ~w1 + · · ·+ xk ~wk〉 = x1〈~v, ~w1〉+ · · ·+ xk〈~v, ~wk〉 = 0 (15.44)
proving that ~v ∈ W⊥. Hence,
W⊥ ⊇~v ∈ Rn : 〈~v, ~w1〉 = 0, . . . , 〈~v, ~wk〉 = 0
. (15.45)
136
This proves our first claim. Now, notice that the conditions 〈~v, ~wi〉 = 0 for all i ∈ 1, . . . , k is
equivalent to | |~w1 · · · ~wk| |
T |~v|
= ~0 (15.46)
since | |~w1 · · · ~wk| |
T |~v|
=
〈~w1, ~v〉...
〈~wk, ~v〉
. (15.47)
The phenomenon of taking the orthogonal complement twice to get back what you started (this
happened in Example 15.38) is true in general, but we will need an important result to prove it. A
similar question we might ask, which is very intuitive, is the following. Given a subspace W ⊆ Rn,
is there a linear transformation that acts as a projection onto W? If so, how can one express this
linear transformation as a matrix? Visually, the projection of ~v onto W should be a vector PW~v
that satisfies the condition of being the closest vector to ~v inside W, i.e.
‖~v − PW~v‖ = min~w∈W‖~v − ~w‖. (15.48)
W
W⊥
~v
PW~v
In the process of answering this question, we will prove that every subspace has an orthonormal
basis.
Exercise 15.49. If Rm S←− Rn is a linear transformation, prove that
〈~w, S~v〉 = 〈ST ~w,~v〉 (15.50)
for all vectors ~w ∈ Rm and ~v ∈ Rn. Furthermore, when m = n, show that S = ST if and only if
〈~w, S~v〉 = 〈S ~w,~v〉 for all vectors ~w ∈ Rm and ~v ∈ Rn. This gives some geometric meaning to the
transpose. [Hint: write out what the inner product is in terms of matrices and the transpose.]
Recommended Exercises. Please check HuskyCT for the homework. Be able to show all your
work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we covered Sections 6.1 and 6.2.
137
16 The Gram-Schmidt procedure
From the examples previously, we noticed some interesting properties of sets of orthogonal vectors
when studying the orthogonal complement.
Theorem 16.1. Let S := ~v1, ~v2, . . . , ~vk be an orthogonal set of nonzero vectors in Rn. Then Sis linearly independent. In particular, S is a basis for span(S).
Proof. Let x1, . . . , xk be coefficients such that
x1~v1 + · · ·+ xk~vk = ~0. (16.2)
The goal is to show that x1 = · · · = xk = 0. To see this, take the inner product of both sides of
the above equation with the vector ~vi for some i ∈ 1, . . . , k. This gives
x1〈~vi, ~v1〉+ · · ·+ xk〈~vi, ~vk〉 = 〈~vi,~0〉 = 0. (16.3)
Because S is orthogonal, the only nonzero term on the left is xi〈~vi, ~vi〉 = xi‖~vi‖2. Since ~vi 6= ~0, this
means that ‖~vi‖2 > 0. Therefore,
xi‖~vi‖2 = 0 ⇒ xi = 0. (16.4)
Since i ∈ 1, . . . , k was arbitrary, x1 = · · · = xk = 0, which shows that S is linearly independent.
Given an orthogonal set ~v1, ~v2, . . . , ~vk of nonzero vectors in Rn, the set~v1
‖~v1‖,~v2
‖~v2‖, . . . ,
~vk‖~vk‖
(16.5)
is an orthonormal set. An orthonormal set of vectors in Rn has many similar properties to the
vectors ~e1, ~e2, · · · , ~ek in Rn, where k ≤ n. One such property is described in the following
theorem.
Theorem 16.6. Let S := ~u1, ~u2, . . . , ~uk be an orthonormal set in Rn, and let ~v be a vector in
span(S). Then
~v = 〈~v, ~u1〉~u1 + 〈~v, ~u2〉~u2 + · · ·+ 〈~v, ~un〉~un. (16.7)
Furthermore, since S is a basis for span(S), these linear combinations are unique.
The infinite-dimensional generalization of this theorem is what makes Fourier series work. We
might discuss this later in the course.
Proof. Since S is a basis (because S are linearly independent and they span span(S) by definition),
there exist unique coefficients x1, . . . , xk ∈ R such that
~v = x1~u1 + · · ·+ xk~uk. (16.8)
Fix some i ∈ 1, . . . , k. Applying the inner product with ~ui to both sides gives
〈~ui, ~v〉 = xi 〈~ui, ~ui〉︸ ︷︷ ︸1
, (16.9)
which shows that xi = 〈~ui, ~v〉 = 〈~v, ~ui〉. Since i ∈ 1, . . . , k was arbitrary, the above decomposition
holds.
138
You can see from the proof, specifically (16.9), how crucial it was that the vectors in S were
not just a basis but an orthonormal basis. What happens if ~v is not in the subspace spanned by an
orthonormal set? Given an orthonormal set of vectors, what is the linear transformation describing
the projection onto the subspace spanned by these vectors? The following theorem answers these
questions.
Theorem 16.10. Let S := ~u1, ~u2, . . . , ~uk be an orthonormal set of vectors in Rn, and let ~v be a
vector in Rn. Then there exists a unique decomposition
~v = 〈~v, ~u1〉~u1 + 〈~v, ~u2〉~u2 + · · ·+ 〈~v, ~un〉~un + ~v⊥, (16.11)
where
~v⊥ := ~v −(〈~v, ~u1〉~u1 + 〈~v, ~u2〉~u2 + · · ·+ 〈~v, ~un〉~un
)(16.12)
is in span(S)⊥. In fact, the transformation Rn PW←−− Rn defined by sending ~v in Rn to
PW (~v) := 〈~v, ~u1〉~u1 + 〈~v, ~u2〉~u2 + · · ·+ 〈~v, ~un〉~un (16.13)
is a linear transformation satisfying
P 2W = PW (16.14)
and
P TW = PW . (16.15)
Proof. The first thing we check is that ~v⊥ is in span(S)⊥. For this, notice that
〈~v⊥, ~ui〉 =⟨~v −
(〈~v, ~u1〉~u1 + 〈~v, ~u2〉~u2 + · · ·+ 〈~v, ~un〉~un
), ~ui
⟩= 〈~v, ~ui〉 −
n∑j=1
〈~v, ~uj〉〈~uj, ~ui〉
= 〈~v, ~ui〉 −n∑j=1
〈~v, ~uj〉δji
= 〈~v, ~ui〉 − 〈~v, ~ui〉= 0.
(16.16)
Since this is true for every i = 1, 2, . . . , n, it follows that 〈~v⊥, ~u〉 = 0 for every vector ~u in span(S),
since every such vector is expressed as a linear combination of these elements in S. Hence, ~v⊥ is
in span(S)⊥.
The second thing we check is that P 2W = PW (it is not difficult to see that PW is in fact a linear
139
transformation). For this, let ~v be a vector in V and calculating P 2W (~v) gives
P 2W (~v) = PW
(n∑i=1
〈~v, ~ui〉~ui
)
=n∑i=1
(〈~v, ~ui〉PW (~ui)
)=
n∑i=1
(〈~v, ~ui〉
n∑j=1
〈~ui, ~uj〉~uj
)
=n∑i=1
(〈~v, ~ui〉
n∑j=1
δij~uj
)
=n∑i=1
(〈~v, ~ui〉~ui
)= PW (~v)
(16.17)
as needed.
The final thing to check is P TW = PW . By Exercise 15.49, it suffices to show that 〈PW~v, ~u〉 =
〈~v, PW~u〉 for all ~u,~v ∈ Rn. This follows from
〈PW~v, ~u〉 =
⟨k∑i=1
〈~v, ~ui〉~ui, ~u
⟩
=k∑i=1
〈~v, ~ui〉〈~ui, ~u〉
=k∑i=1
〈~u, ~ui〉〈~v, ~ui〉
=
⟨~v,
k∑i=1
〈~u, ~ui〉~ui
⟩= 〈~v, PW~u〉.
(16.18)
The last conditions in Theorem 16.10 are what defines an orthogonal projection.
Definition 16.19. A linear transformation Rn S←− Rn is called a projection whenever S2 = S. S is
called an orthogonal projection whenever S2 = S and ST = S.
A nice way to think about projections is given at http://math.stackexchange.com/questions/
1303977/what-is-the-idea-behind-a-projection-operator-what-does-it-do. Are there pro-
jections that are not orthogonal projections? We will answer this shortly, though you should be
able to find counter-examples at this point. Given an arbitrary subspace, we should be able to
define the orthogonal projection onto that subspace. However, we have only been able to construct
140
such a projection when we have an orthonormal basis for such a subspace. Given an arbitrary sub-
space W, how do we know if a orthonormal set spanning W even exists? We know a basis always
exists, but a basis is quite different from an orthonormal set (not every basis is an orthonormal
set but every orthonormal set is a basis for the space it spans). The Gram-Schmidt process, a
process we are about to describe, provides a construction of an orthonormal set for any subspace
of a finite-dimensional inner product space. Hence, not only does it prove the existence of an
orthonormal set for a subspace, it provides a step-by-step procedure for producing one!
Construction 16.20 (The Gram-Schmidt Procedure). Let W be a nonzero subspace of Rn.
Because W is finite-dimensional, there exists a basis B := ~w1, ~w2, . . . , ~wk, with k ≥ 1, for W. Set
~u1 :=~w1
‖~w1‖(16.21)
and let W1 := span(~w1) ≡ span(~u1). Set
~v2 := ~w2 − 〈~w2, ~u1〉~u1, (16.22)
i.e. the component of ~w2 perpendicular to W1. Normalize it by setting
~u2 :=~v2
‖~v2‖(16.23)
and set W2 := span(~u1, ~u2
). Proceed inductively. Namely, suppose that ~um and Wm have been
constructed and 1 < m < n. Then, set
~vm+1 := ~wm+1 −m∑i=1
〈~wm+1, ~ui〉~ui, (16.24)
i.e. the component of ~wm+1 perpendicular to Wm. Normalize it by setting
~um+1 :=~vm+1
‖~vm+1‖(16.25)
and set Wm+1 := span(~u1, ~u2, . . . , ~um+1
). After this construction is done at step k (which must
happen since n is finite), the set S := ~u1, ~u2, . . . , ~uk of vectors has been constructed along with
a sequence of subspaces W1,W2, . . . ,Wk satisfying the following properties.
(a) The set S is orthonormal.
(b) Wk = W. In fact, span(S) = W.
The first claim follows from the construction. The second claim follows from the fact that S is a
basis of Wk containing k elements, but Wk is a subspace of W, which also has a basis containing
k elements, which implies W = Wk.
Example 16.26. Going back to Example 15.32, a quick calculation shows that the basis
B :=
~w1 :=
1
0
−3
, ~w2 :=
0
1
−2
(16.27)
for the plane described by the solutions to 3x+ 2y + z = 0 is not orthogonal.
141
x
y
z
•(−1,−1, 5)
• (−1, 1, 1)
•(1, 1,−5)
•(1,−1,−1)
~w1
~w2
3
2
1
It does not look like it, but the vector (3, 2, 1) is orthogonal to the plane in the above figure.
Applying the Gram-Schmidt procedure, the vector ~u1 is
~u1 :=1√10
1
0
−3
(16.28)
and ~v2 is
~v2 := ~w2 − 〈~w2, ~u1〉~u1
=
0
1
−2
−⟨ 0
1
−2
, 1√10
1
0
−3
⟩ 1√10
1
0
−3
=
0
1
−2
− 3
5
1
0
−3
=
1
5
−3
5
−1
.
(16.29)
Thus, the unit vector in this direction is given by
~u2 :=~v2
‖~v2‖=
1√35
−3
5
−1
. (16.30)
142
x
y
z
~u1~u2
Now consider the vector
~v :=
1
2
−2
. (16.31)
The projection of this vector onto the plane W spanned by B is given by
PW (~v) = 〈~v, ~u1〉~u1 + 〈~v, ~u2〉~u2
=
⟨ 1
2
−2
, 1√10
1
0
−3
⟩ 1√10
1
0
−3
+
⟨ 1
2
−2
, 1√35
−3
5
−1
⟩ 1√35
−3
5
−1
=
7
10
1
0
−3
+18
35
−3
5
−1
=
1
35
−1
18
33
(16.32)
x
y
z
~v
PW (~v)
We can calculate the matrix form of PW in at least two ways. One way is to calculate PW~e1, PW~e2,
and PW~e3. The other way is to use the relationship between inner products and the transpose. In
143
our case, since W is spanned by two orthonormal vectors ~u1 and ~u2, this is given by
PW = ~u1~uT1 + ~u2~u
T2
=1
10
1
0
−3
[1 0 −3]
+1
35
−3
5
−1
[−3 5 −1]
=1
10
1 0 −3
0 0 0
−3 0 9
+1
35
9 −15 3
−15 25 −5
3 −5 1
=
1
14
5 −6 −3
−6 10 −2
−3 −2 13
.
(16.33)
From this expression, you can see that P TW . It is a bit more cumbersome to calculate P 2
W , but it
can be done and one sees that P 2W = PW . However, we already know that we do not have to do this
because we have constructed PW in such a way so that it satisfies the conditions of an orthogonal
projection.
Why did we need an orthonormal basis to construct an orthogonal projection onto a subspace?
Let us look at what happens if we do not use an orthonormal basis but instead just use a basis of
normalized vectors.
Example 16.34. Let
~u1 :=
1
0
0
& ~u2 :=1√2
1
1
0
(16.35)
and let W be the plane spanned by these two vectors (this is just the xy plane). We can define
the linear transformation R3 TW←−− R3 by
R3 3 ~v 7→ TW~v := 〈~u1, ~v〉~u1 + 〈~u2, ~v〉~u2 (16.36)
exactly as we did for an orthonormal set of vectors. Again, we can calculate the matrix associated
to TW via finding its columns from TW~ei or we can use the transpose method
TW = ~u1~uT1 + ~u2~u
T2 =
1
2
3 1 0
1 1 0
0 0 0
. (16.37)
The range/image/column space of TW is exactly W. However, even though T TW = TW ,
T 2W =
1
2
5 2 0
2 1 0
0 0 0
6= TW . (16.38)
Therefore, it is not a projection. An intuitive way to see why TW isn’t a projection is to look at
what happens to ~e1 and ~e2. Under a projection, these vectors must be fixed since they already
144
lie in the plane W (this is the meaning of the equation T 2W = TW and is why it is such a crucial
condition for the definition of a projection). Under these transformations
TW~e1 =3
2~e1 +
1
2~e2 & TW~e2 =
1
2~e1 +
1
2~e2 (16.39)
neither of which are fixed.
A projection onto a plane in R3 can be visualized as shining a flashflight onto that plane from
any angle. If the angle makes a right angle with the plane, then the projection is an orthogo-
nal projection. However, there are many other possible projections besides the orthogonal one.
Consider, for example, shining the flashlight at a 45 angle with respect to the normal of a plane.
Exercise 16.40. Let W = span~e1, ~e2 in R3. Express the matrix A associated to the linear
transformation that is obtained from shining a flashlight at a 45 angle with respect to the ~e3 axis
and pointing in the direction of ~e1, i.e. the flashlight points in the direction
1√2
(~e1 − ~e3). (16.41)
W
z
x
y
~v
A~v
45
Theorem 16.42. Let W be a subspace of Rn. Then for every vector ~v in Rn, there exist unique
vectors ~w in W and ~u in W⊥ satisfying ~v = ~w + ~u. In fact, setting PW to be the orthogonal
projection onto W, this decomposition is written as
~v = PW~v + (~v − PW~v) . (16.43)
Furthermore, PW~v minimizes the distance between the vector ~v and all vectors in W.
Proof. Let S be an orthonormal set for W, the existence of which is guaranteed by the Gram-
Schmidt procedure. Then apply Theorem 16.10 for uniqueness. To see that PW~v minimizes the
distance from ~v to W, let ~w be another vector in W that is not equal to PW~v. Then the vectors
~v − PW~v and PW~v − ~w are orthogonal because ~v − PW~v ∈ W⊥ and PW~v − ~w ∈ W.
145
W
W⊥
~v
PW~v
~w
Hence, by Pythagorean’s theorem,
‖~v − ~w‖ = ‖~v − PW~v + PW~v − ~w‖= ‖~v − PW~v‖+ ‖PW~v − ~w‖> ‖~v − PW~v‖,
(16.44)
which shows that PW~v minimizes the distance from ~v to W.
Theorem 16.45. Let W be a k-dimensional subspace of Rn. Then the orthogonal complement
W⊥ is an (n − k)-dimensional subspace of Rn. Furthermore, the orthogonal complement of the
orthogonal complement is the original subspace itself, i.e.
(W⊥)⊥ = W. (16.46)
Proof. To prove this, it must be shown that (W⊥)⊥ ⊆ W and (W⊥)⊥ ⊇ W.
Let us begin with the second claim first. Let ~w ∈ W. The goal is to show that 〈~w, ~u〉 = 0 for
all ~u ∈ W⊥. By definition of W⊥, this means that 〈~u, ~w′〉 = 0 for all ~w′ ∈ W. Since ~w ∈ W, this
implies 〈~u, ~w〉 = 0 so that ~w ∈ (W⊥)⊥.
Let ~v ∈ (W⊥)⊥. The goal is to show that ~v ∈ W. By Theorem 16.42, there exists unique vectors
~w ∈ W and ~u ∈ W⊥ such that ~v = ~w + ~u. Taking the inner product of both sides with ~u gives
〈~u,~v〉︸ ︷︷ ︸=0 since ~v∈(W⊥)⊥
= 〈~u, ~w〉︸ ︷︷ ︸=0 since ~u∈W⊥
+〈~u, ~u〉. (16.47)
This implies ~u = ~0. Hence, ~v ∈ W.
Recommended Exercises. Please check HuskyCT for the homework. Be able to show all your
work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we covered Sections 6.3, 6.4, and parts of 6.7.
146
17 Least squares approximation
If Rm A←− Rn is a linear transformation and ~b ∈ Rm is a vector, a solution to A~x = ~b exists if
and only if the system is consistent. In some applications, a system might be over-constrained,
meaning that there are many more equations than unknowns. This corresponds to A having more
rows than columns. In this situation, it is often the case that A~x = ~b will only be consistent for
very special values of ~b. However, given what we have learned, it should now be possible to find
the best approximation to such a system
W := image(A)
W⊥
~b
PW~b
by projecting ~b onto the image of A. In this case, the best approximation solutions to A~x = ~b are
the solutions to A~x = PW~b. The latter is always consistent since W is precisely the image (column
space) of A. The reason this is the best approximation to A~x = ~b follows from Theorem 16.42.
Definition 17.1. Let Rm A←− Rn be a linear transformation and let ~b ∈ Rm. A least-squares
approximation to A~x = ~b is a solution to A~x = PW~b, where W := image(A).
It takes a bit of work to calculate the projection onto a subspace. Fortunately, because the
subspace is the image of a linear transformation, a drastic simplification can be made to the above
definition.
Theorem 17.2. Let Rm A←− Rn be a linear transformation and let ~b ∈ Rm. ~x ∈ Rn is a least-squares
approximation to A~x = ~b if and only if ~x is a solution to ATA~x = AT~b.
Proof. For this proof, set W := image(A).
(⇒) Let ~x ∈ Rn be a solution to A~x = PW~b. Since ~b− PW~b ∈ W⊥,
〈A~ek,~b− PW~b〉 = 0 (17.3)
for all k ∈ 1, . . . , n. By the relationship between the inner product and the transpose of vectors,
this equation says
(A~ek)T (~b− PW~b) = 0 (17.4)
for all k ∈ 1, . . . , n. Hence,
AT (~b− PW~b) = ~0. (17.5)
147
By assumption, PW~b = A~x so that this says
AT (~b− A~x) = 0 ⇒ ATA~x = AT~b. (17.6)
(⇐) Let ~x ∈ Rn be a solution to ATA~x = AT~b. Then AT (~b− A~x) = ~0 implies that ~b− A~x ∈ W⊥.
W := image(A)
W⊥
~b
A~x
~b− A~x
By Theorem 16.42, ~b can be expressed uniquely as ~b = ~w + ~u with ~w ∈ W and ~u ∈ W⊥, but by
the above calculations~b = A~x+ (~b− A~x) (17.7)
is such a decomposition. By uniqueness, this means that A~x = PW~b.
Problem 17.8. In her introductory physics lab, Joanna drops a ball from several different heights
and counts how long it takes for the ball to reach the ground. She does several measurements
for each height in increments of 50 centimeters from 0 to 2 meters. She then takes the average
of her measurements for each height. Her results are listed in the following table without their
uncertainties.
Height (meters) 0.5 1.0 1.5 2.0 2.5 3.0
Time (seconds) 0.38 0.49 0.55 0.66 0.74 0.76
She expects the relationship between the height, x, and time, t, to be quadratic and of the form
x(t) = 12gt2, where g is a constant to be determined from these data. What is the best approxi-
mation to g?
Answer. Squaring the times gives the following table
Height (meters) 0.5 1.0 1.5 2.0 2.5 3.0
Time2 (seconds2) 0.14 0.24 0.30 0.44 0.55 0.58
It therefore suffices to find the slope m so that the line x = mt2 best fits these data. Let A be the
6× 1 matrix for the input data, the square of time and let ~b be the vector for the output data, the
148
height. In this case, A~x = ~b takes the form
0.14
0.24
0.30
0.44
0.55
0.58
[m]
=
0.5
1.0
1.5
2.0
2.5
3.0
(17.9)
and the goal is to solve for the best approximation to m. ATA and AT~b are given by
0.9997m = 4.755 ⇒ m = 4.76 ⇒ g = 9.51. (17.10)
The actual value of g is known to be 9.81 in Joanna’s location. The data points, her best fit curve,
and the actual curve are depicted in the following figure.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.25
0.5
0.75
1
x (meters)
t (seconds)
•• •
• • •Legend
• data
best fit
actual
Problem 17.11. The Michaelis-Menten equation [7] is used as a model in biochemistry to describe
the reaction rate, v, of an enzymatic reaction to the concentration, [S], of the substrate. The
formula is given by
v =vmax[S]
KM + [S], (17.12)
where vmax is a constant that describes the maximum possible achieved reaction rate and KM is
another constant that describes the substrate concentration when the reaction rate is half of vmax.
My friend did such an experiment and here is his data45
[S] (millimolar) 0.900 0.675 0.450 0.225 0.090 0.045 0.0225
v (micromoles per min) 0.210 0.200 0.167 0.120 0.053 0.031 0.017
Find the coefficients vmax and KM that best fit these data.
Answer. The Michaelis-Menten equation can be flipped so that it is of the form
1
v=KM + [S]
vmax[S]=
1
vmax
+KM
vmax
1
[S]. (17.13)
In terms of the variables 1[S]
instead of [S], this is a linear equation of the form y = mx+ b where
m and b are the unknowns. Hence, we can apply the least-squares method. For this, we translate
our data in terms of these reciprocated variables.
45I’d like to thank David Lei for sharing his data.
149
1/[S] 1.11 1.48 2.22 4.44 11.1 22.2 44.4
1/v 4.76 5.00 5.99 8.33 18.9 32.3 58.8
Therefore, set
A :=
1 1.11
1 1.48
1 2.22
1 4.44
1 11.1
1 22.2
1 44.4
, ~b :=
4.76
5.00
5.99
8.33
18.9
32.3
58.8
, & ~x :=
[1/vmax
KM/vmax
]. (17.14)
Solving ATA~x = AT~b would give us the coefficients, which we can then use to find the best fit
curve. This system then becomes[7.00 87.0
87.0 2620
] [1/vmax
KM/vmax
]=
[134
3600
]. (17.15)
Row reduction then gives [1/vmax
KM/vmax
]=
[3.50
1.26
]. (17.16)
Solving for vmax and KM gives
vmax = 0.286 & KM = 0.360. (17.17)
Plotting the inverse relation gives the following curve
0 5 10 15 20 25 30 35 40 45 500
102030405060
1/[S]
1/v
••• ••
•
•Legend
• databest fitactual
Plotting the best fit curve
v([S])
=0.286[S]
0.360 + [S](17.18)
to the original Michaelis-Menten equation gives
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00
0.05
0.1
0.15
0.20
0.25
[S]
v
•••
•
•••
Legend
• data
best fit
actual
150
The “actual” plot is actually the best fit curve obtained through a different method that is more
standard for this specific kind of equation.46 Nevertheless, the least-squares method provides a
reasonable approximation.
In the previous examples, one could fit data to any curve provided that one is using linear
combinations of functions that are linearly independent. This will make more sense when we
discuss vector spaces and how certain spaces of functions are vector spaces, but for now we can
provide a specific, yet somewhat imprecise, definition.
Definition 17.19. A set of functions f1, . . . , fk on some common domain in R is said to be
linearly indepedent iff
a1f1 + · · ·+ akfk = 0 ⇒ a1 = · · · = ak = 0, (17.20)
i.e.
a1f1(x) + · · ·+ akfk(x) = 0 for all x in the domain ⇒ a1 = · · · = ak = 0. (17.21)
In this notation a1, . . . , ak are just some coefficients and the expression a1f1 + · · ·+akfk is a linear
combination of the functions in the set f1, . . . , fk.
If you have data that must fit to some curve of the form
a1f1 + · · ·+ akfk, (17.22)
where f1, . . . , fk is some set of linearly independent functions, then your goal is to find the
coefficients a1, . . . , ak so that the function a1f1 + · · · + akfk best fits your data. If your data
inputs are x1, . . . , xd and your data outputs are y1, . . . , yd, then your matrix A is given by
A :=
f1(x1) · · · fk(x1)...
...
f1(xd) · · · fk(xd)
(17.23)
and the vector ~b is
~b :=
y1
...
yd
. (17.24)
In the previous two examples, the functions are given as follows. For the ball being dropped
from a height, there is actually only one function and it is given by f(t) = t2. The coefficient isg2. For the Michaelis-Menten equation, after taking the reciprocal, there are two functions. The
first one is f1([S]) = 1, which is just a constant, and the second one is f2([S]) = 1[S]. Taking the
reciprocal was important because this allowed us to express 1v
as a linear combination of these two
functions, namely1
v=
(1
vmax
)f1 +
(KM
vmax
)f2. (17.25)
46The reason the actual fit is used is because most of the data points are clustered near small values of 1/[S]
as opposed to being distributed somewhat evenly. This means that the least-squares method we are using is not
as accurate due to the lack of data for larger values of 1/[S]. The original data is much more evenly distributed in
terms of [S].
151
Recommended Exercises. Please check HuskyCT for the homework. Be able to show all your
work, step by step! Do not use calculators or computer programs to solve any problems!
152
Decision making and support vector machines
We now move on to a different application of orthogonality in the context of machine learning and
artificial intelligence.47 The setup is that one has a large range of data X := ~x1, . . . , ~xd described
by vectors in Rn and these data separate into two types, X+ and X−. If a new data point ~xd+1 is
provided, the machine must then decide to place this new data point in X+ or X−.
- --- -
- -
-
+
+ +
+
+ +
+
•
The new data point is drawn as a bullet •. To make this decision, the machine must draw a
hyperplane, an (n − 1)-dimensional linear manifold, that divides Rn into two parts in the most
optimal way. Different hyperplanes will give different answers.
- --- -
- -
-
+
+ +
+
+ +
+
•
We would like to therefore establish a convention for a unique such hyperplane that is also the
most optimal one to allow for the most accurate identification. How do we define the most optimal
hyperplane? We will define a separating hyperplane and then define optimality, but first there are
a few facts we should establish.
Lemma 17.26. Let H ⊆ Rn be a hyperplane. Then there exists a vector ~w ∈ Rn \ ~0 and a real
number c ∈ R such that H is the solution set of 〈~w, ~x〉 − c = 0.
47I’d like to thank Benjamin Russo for helpful discussions on this topic.
153
Proof. Let ~h ∈ H. Then H − ~h is an (n − 1)-dimensional subspace in Rn. Hence, (H − ~h)⊥ is a
one-dimensional subspace spanned by some normalized vector u. Because spanu is perpendicular
to H, there exists an a ∈ R such that au ∈ H. Then, H is the solution set of 〈u, ~x〉 − a = 0.
Notice that the vectors ~w and numbers c need not be unique. Indeed, we can multiply the
previous system by any non-zero real number λ to get 〈λu, ~x〉 − λa = 0. Furthermore, notice that
if H is the solution set of 〈~w, ~x〉 − c = 0 for some nonzero vector ~w and some number c ∈ R, then
the vector
~x =c
‖~w‖w =
c
‖~w‖2~w (17.27)
is in H. This is because ⟨~w,
c
‖~w‖w
⟩− c = c〈w, w〉 − c = c− c = 0. (17.28)
This tells us that the orthogonal distance from the origin, the zero vector, to the hyperplane H isc‖~w‖ .
Definition 17.29. Let ~w ∈ Rn \ ~0 and c ∈ R with associated plane H given by the solution
set of 〈~w, ~x〉 − c = 0. The marginal planes H+ and H− associated to H are the solution sets to
〈~w, ~x〉 − c = 1 and 〈~w, ~x〉 − c = −1, respectively, i.e.
H± :=~x ∈ Rn : 〈~w, ~x〉 − c = ±1
. (17.30)
For example, in R2, if
~w =1
3
[3
1
]& c = 4 (17.31)
then these planes would look like the following
•HH− H+
~w
To check this, H is described by the linear system
1
3
(3x+ y
)= 4 ⇒ y = 12− 3x, (17.32)
154
H+ is described by1
3
(3x+ y
)= 5 ⇒ y = 15− 3x, (17.33)
and H− is described by1
3
(3x+ y
)= 3 ⇒ y = 9− 3x. (17.34)
Notice that the vector (c
‖~w‖2
)~w =
6
5
[3
1
](17.35)
lies on the plane H.
Lemma 17.36. Let (~w, c) ∈ (Rn \ ~0)× R describe a hyperplane H. The perpendicular distance
between H and H+ is 1‖~w‖ and similarly for the distance between H and H−.
Proof. Let ~x+ ∈ H+ and ~x ∈ H. The orthogonal distance between H and H+ is given by⟨~x+ − ~x,
~w
‖~w‖
⟩=
1
‖~w‖(〈~w, ~x+〉 − 〈~w, ~x〉
)=
1
‖~w‖((1 + c)− c
)=
1
‖~w‖(17.37)
•HH− H+
~w
~x+
~x
~x+−~x
w‖~w‖
by the definition of H and H+ in terms of (~w, c). A similar calculation holds for H−.
In these two cases, notice that the vectors
~x± :=
(c
‖~w‖± 1
‖~w‖
)w (17.38)
are vectors in H±. This is because⟨~w,
(c
‖~w‖± 1
‖~w‖
)w
⟩− c =
c
‖~w‖〈~w, w〉 ± 1
‖~w‖〈~w, w〉 − c = c± 1− c = ±1. (17.39)
In the example we have been using, we have
~x− =
(c− 1
‖~w‖2
)~w =
9
10
[3
1
], ~x =
(c
‖~w‖2
)~w =
6
5
[3
1
], & ~x+ =
(c+ 1
‖~w‖2
)~w =
3
2
[3
1
](17.40)
as the three vectors that in the span of ~w and that pass through H−, H, and H+, respectively.
155
•HH− H+
~w•~x− •~x
•~x+
Definition 17.41. Let (~w, c) ∈ (Rn\~0)×R describe a hyperplane H. The convex region between
H+ and H− is called the margin of (~w, c). The orthogonal distance between H+ and H−, which is
given by 2‖~w‖ , is called the margin width of (~w, c).
Even though (~w, c) can be scaled to (λ~w, λc) to give the same H, notice that the marginal
planes are different. This is because the margin width has scaled by a factor of 1λ. For example, if
we set λ = 2 in our example, the margin shrinks by 12.
•HH− H+
Hλ− Hλ
+
In this drawing, we’ve used the notation Hλ± to signify the resulting marginal planes for (λ~w, λc).
If instead we only scale ~w, but not c, to get (λ~w, c), then we change the position of the hyperplane
because the new equation that it is the solution set of is
〈λ~w, ~x〉 − c = 0 ⇐⇒ 〈~w, ~x〉 − c
λ= 0. (17.42)
Therefore, the hyperplane (λ~w, c) is equivalent to the hyperplane(~w, c
λ
). However, their margins,
and hence their marginal planes, will be different. Therefore, think of the ~w in (~w, c) as determining
156
a direction as well as a margin width and think of c in (~w, c) as determining the position of the
central hyperplane. We make this relationship between (~w, c) and such triples of hyperplanes
formal in the following Lemma.
Lemma 17.43. Two parallel hyperplanes H− and H+ in Rn determine a unique (~w, c) ∈(Rn \
~0)× R whose marginal planes agree with H− and H+.
Proof. Let ~x+ ∈ H+ and pick u ∈ (H+−~x+)⊥ such that if λu ∈ H− and µu ∈ H+, then λ < µ (i.e.
choose a normal vector u perpendicular to H+ that points from H− to H+). Also, let ~x− ∈ H−(any choice of vectors will work). The orthogonal separation between the planes H+ and H− is
given by 〈~x+ − ~x−, u〉.
•H− H+
u
~x+
~x−
~x+ − ~x−
Therefore, set
~w :=
(2
〈~x+ − ~x−, u〉
)u. (17.44)
Now, pick any ~x+ ∈ H+ and set
c := 〈~w, ~x+〉 − 1. (17.45)
Then (~w, c) has H+ and H− as its marginal planes.
Exercise 17.46. Finish the proof by showing that (~w, c) has H+ and H− as its marginal planes,
i.e. show that H+ is the solution set to 〈~w, ~x〉−c = 1 and H− is the solution set to 〈~w, ~x〉−c = −1.
This result says that there is a 1-1 correspondence between the set of (ordered) pairs of parallel
hyperplanes and the set(Rn \ ~0
)× R.
Definition 17.47. Let (X ,X+,X−) denote a non-empty set X of vectors in Rn that are separated
into the two (disjoint) non-empty sets X+ and X−. Such a collection of sets is called a training
data set. A hyperplane H ⊆ Rn, described by (~w, c) ∈ (Rn \ ~0)× R, separates (X ,X+,X−) iff
〈~w, ~x+〉 − c > 0 & 〈~w, ~x−〉 − c < 0 (17.48)
for all ~x+ ∈ X+ and for all ~x− ∈ X−. In this case, H is said to be a separating hyperplane for
(X ,X+,X−). H marginally separates (X ,X+,X−) iff
〈~w, ~x+〉 − c ≥ 1 & 〈~w, ~x−〉 − c ≤ −1 (17.49)
157
for all ~x+ ∈ X+ and for all ~x− ∈ X−. Let SX ⊆ (Rn \ ~0)× R denote the set of hyperplanes that
marginally separate (X ,X+,X−). Let f : SX → R be the function defined by
(Rn \ ~0)× R 3 (~w, c) 7→ f(~w, c) :=2
‖~w‖, (17.50)
i.e. the margin. A support vector machine (SVM) for (X ,X+,X−) is a maximum of f, i.e. an SVM
is a pair (~w, c) ∈ (Rn \ ~0)×R such that 1‖~w′‖ ≤
1‖~w‖ for every other pair (~w′, c′) ∈ (Rn \ ~0)×R.
Some examples of separating hyperplanes and marginally separating hyperplanes are depicted
in the following figures on the left and right, respectively.
- --- -
- -
-
+
+ +
+
+ +
+
- --- -
- -
-
+
+ +
+
+ +
+
An SVM is a hyperplane that maximizes the margin, as in the following figure.
HH− H+
- --- -
- -
-
+
+ +
+
+ +
+
Because of this, it is useful to know when a given hyperplane that marginally separates a training
data set can be enlarged. This will be useful because then instead of looking at the set of all
marginally separating hyperplanes, we can focus our attention on those whose margins have been
maximized. Afterwards, we will maximize the margin over this resulting set.
158
Definition 17.51. Let (X ,X+,X−) be a training data set and let (~w, c) be a marginally separating
hyperplane for this set. The elements of X ∩H± are called support vectors for (~w, c). The set of
support vectors is denoted by HsuppX . The notation Hsupp
X± := HsuppX ∩H± will also be used to denote
the set of positive and negative support vectors.
In the following figures, the support vectors have been circled for two different marginal hy-
perplanes.
HH− H+
- --- -
- -
-
+
+ +
+
+ +
+HH− H+
- --- -
- -
-
+
+ +
+
+ +
+
Lemma 17.52. Let (X ,X+,X−) be a training data set and let (~w, c) be a marginally separating
hyperplane for this set. Then there exists a marginally separated hyperplane (~v, d) such that
v = w &2
‖~v‖= min
~x+∈X+
~x−∈X−
〈~x+ − ~x−, w〉 . (17.53)
In other words, if the marginal planes do not contain any of the training data set, then the
separating hyperplane can be translated and the margin width can be enlarged until the margin
touches both positive and negative training data sets.
Proof. It will be convenient to define the function
X 3 ~x 7→ θ(~x) :=
+1 if ~x ∈ X+
−1 if ~x ∈ X−. (17.54)
By the discussions after Lemma 17.26 and Lemma 17.36, we have vectors in each H−, H, and H+
given by (c− 1
‖~w‖
)w ∈ H−,
(c
‖~w‖
)w ∈ H, &
(c+ 1
‖~w‖
)w ∈ H+. (17.55)
Set m+ to be the remaining minimum orthogonal distance between H+ and X+ and set m− to be
the remaining minimum orthogonal distance between H− and X−, namely
m± := min~x±∈X±
θ(~x±)
(〈~x±, w〉 −
(c± 1
‖~w‖
))(17.56)
159
H
K
H−
K−
H+
K+
•
- --- -
- -
-
+
+ +
+
+ +
+
Therefore, the planes K± containing the vectors(c± 1
‖~w‖±m±
)w (17.57)
that are perpendicular to w intersect X± but do not contain points of X on the interior of their
margin. By Lemma 17.43, there exists a (~v, d) ∈(Rn \ ~0
)× R that describes these marginal
separating hyperplanes, namely
~v :=
(2
2‖~w‖ +m+ +m−
)w (17.58)
(since 2‖~v‖ is now the margin width between the new marginal hyperplanes) and
d :=
⟨~v,
(c+ 1
‖~w‖+m+
)w
⟩− 1
=2(c+1‖~w‖ +m+
)2‖~w‖ +m+ +m−
− 1
=2c+ ‖~w‖(m+ −m−)
2 + ‖~w‖(m+ +m−)
(17.59)
(since this is the required number so that a vector on K+ satisfies the positive marginal plane
equation).
Exercise 17.60. Verify that (~v, d) in the above proof defines marginally separating hyperplanes
that are perpendicular to w. Furthermore, explain why they cannot be enlarged any farther.
Theorem 17.61. Let (X ,X+,X−) be training data set for which there exists a separating hyper-
plane for (X ,X+,X−). Then there exists a unique SVM for (X ,X+,X−).
160
Proof. By Lemma 17.52, it suffices to maximize the margin function f on the subset SsuppX ⊆ SX
consisting of marginally separating hyperplanes that have both positive and negative support
vectors, namely on
SsuppX :=
(~w, c) ∈ SX : H± ∩ X± 6= ∅
. (17.62)
The goal is therefore to maximize the margin function, which is a function of (~w, c), subject to the
constraint
〈~w, ~x〉 − c∓ 1 = 0 (17.63)
for all ~x ∈ SsuppX , or equivalently
θ(~x)(〈~w, ~x〉 − c
)− 1 = 0 (17.64)
for all ~x ∈ SsuppX . Maximizing the margin function is equivalent to minimizing the function
(Rn \ ~0)× R 3 (~w, c) 7→ 1
2‖~w‖2 (17.65)
subject to these same constraints. It is therefore equivalent to maximize the function g given by
(Rn \ ~0)× R 3 (~w, c)g7−→ 1
2‖~w‖2 −
∑~x∈X
α~x
(θ(~x)
(〈~w, ~x〉 − c
)− 1). (17.66)
Here, α~x = 0 for all ~x ∈ X \HsuppX and α~x needs to be determined for all ~x ∈ Hsupp
X . This condition
guarantees that the function g equals f when restricted to SsuppX (but notice that it does not equal
f on the larger domain SX of all marginally separating hyperplanes). The α~x are called Lagrange
multipliers. The extrema of g occur at points (~v, d) for which the derivative of g vanishes with
respect to these coordinates48
∂g
∂ ~w
∣∣∣(~w,c)
= 0 &∂g
∂c
∣∣∣(~w,c)
= 0. (17.67)
The first equation gives
~w =∑~x∈X
α~xθ(~x)~x, (17.68)
which is the desired result, except that it has many unknown coefficients given by all of the
Lagrange multipliers. The second equation gives∑~x∈X
α~xθ(~x) = 0, (17.69)
48Notice that it would not have made sense to take these derivatives if we had worked with the function f
constrained to SsuppX . This is because to define the derivative we need to take a limit of nearby points, but if
(~w, c) ∈ SsuppX , then it might not be true that (~w + ~ε, c + δ) is also in SsuppX for arbitrarily small vectors ~ε and
arbitrarily small numbers δ.
161
which is a condition that the Lagrange multipliers have to satisfy. Plugging in these results back
into the function g gives
g(~w, c) =1
2
∥∥∥∥∥∑~x∈X
α~xθ(~x)~x
∥∥∥∥∥2
−∑~x∈X
α~x
θ(~x)
⟨∑~y∈X
α~yθ(~y)~y, ~x
⟩− c
− 1
=
1
2
∑~x,~y∈X
α~xα~yθ(~x)θ(~y)〈~x, ~y〉 −∑~x,~y∈X
α~xα~yθ(~x)θ(~y)〈~y, ~x〉+ c∑~x∈X
α~xθ(~x) +∑~x∈X
α~x
=∑~x∈X
α~x −1
2
∑~x,~y∈X
α~xα~yθ(~x)θ(~y)〈~x, ~y〉
(17.70)
Notice that although we have not yet solved the full problem, the maximizer only depends on the
inner products between the vectors in the training data set. Setting (for the original function g)
∂g
∂α~x
∣∣∣(~w,c)
= 0 (17.71)
for each ~x ∈ HsuppX will give additional conditions that the Lagrange multipliers have to satisfy.
This equation then reads
θ(~x)(〈~w, ~x〉 − c
)= 1 (17.72)
for each ~x ∈ HsuppX , and after plugging in the result for ~w, this gives∑
~y∈X
α~yθ(~y)〈~y, ~x〉 − θ(~x) = c (17.73)
for each ~x ∈ HsuppX . However, there is one subtle point, and that is that we do not know what
HsuppX is. Nevertheless, there is still an optimization procedure left over, and it is based on the
different possible choices of HsuppX . For each choice of Hsupp
X , one has the linear system∑~y∈X
θ(~y)α~y = 0∑~y∈X
θ(~y)〈~y, ~x〉α~y − c = θ(~x)(17.74)
in the variables α~x~x ∈ HsuppX ∪ c obtained from equations (17.69) and (17.73). Notice that
the second equation in this linear system is actually a set of |HsuppX | equations. Therefore, this
describes a linear system of |HsuppX |+ 1 equations (+1 because of the first equation) in |Hsupp
X |+ 1
variables (+1 because of the extra variable c). There are only a finite number of possible choices
of HsuppX and therefore only a finite number of linear systems one needs to solve. These systems
are all consistent because we have assumed that the training data set can be separated. Hence, a
solution to the SVM problem exists.
Some simple examples should help illustrate what could happen.
Problem 17.75. Find the SVM for the training data set given by
X− :=
~x− :=
[0
−1
]& X+ :=
~x+ :=
[0
1
]. (17.76)
162
Answer. In this case, there is only one positive and one negative vector. We expect the margin
width to be 2 since this is the distance between the two points. Let us see that this works. The
inner products are given by
〈~x−, ~x−〉 = 1, 〈~x+, ~x+〉 = 1, 〈~x−, ~x+〉 = −1. (17.77)
The associated linear system (17.74) is
θ(~x+)α~x+ + θ(~x−)α~x− = 0
θ(~x+)〈~x+, ~x+〉α~x+ + θ(~x−)〈~x−, ~x+〉α~x− − c = θ(~x+)
θ(~x+)〈~x+, ~x−〉α~x+ + θ(~x−)〈~x−, ~x−〉α~x− − c = θ(~x−),
(17.78)
which becomes
α~x+ − α~x− = 0
α~x+ + α~x− − c = 1
−α~x+ − α~x− − c = −1
(17.79)
after substitution. This linear system corresponds to the augmented matrix 1 −1 0 0
1 1 −1 1
−1 −1 −1 −1
7→1 0 0 1/2
0 1 0 1/2
0 0 1 0
(17.80)
so that the solution is
α~x+ =1
2, α~x− =
1
2, c = 0. (17.81)
Plugging this into the equation for ~w (17.68) gives
~w = α~x+θ(~x+)~x+ + α~x−θ(~x−)~x− =
[0
1
]. (17.82)
Therefore, the plane H is described as the set of vectors ~x such that 〈~w, ~x〉 − c = 0. Since c = 0,
the set of solutions are all vectors ~x of the form
x
[1
0
](17.83)
with x ∈ R. The plane H+ is the set of vectors ~x =
[x
y
]such that 〈~w, ~x〉 − c = 1. Since c = 0, this
equation forces y = 1 but the x component is arbitrary, i.e. H+ consists of all vectors of the form[0
1
]+ x
[1
0
](17.84)
with x ∈ R. The plane H− is the set of vectors ~x =
[x
y
]such that 〈~w, ~x〉 − c = −1. Since c = 0,
this equation forces y = −1 but the x component is arbitrary, i.e. H− consists of all vectors of the
form [0
−1
]+ x
[1
0
](17.85)
163
with x ∈ R. Therefore, ~w =
[0
1
]and c = 0 indeed describes the following strip
• H
H−
H+
-
+
Problem 17.86. Find the SVM for the training data set given by
X− :=
~x1− :=
[−1
−1
], ~x2− :=
[1
−1
]& X+ :=
~x+ :=
[0
1
]. (17.87)
Answer. If we solve for the SVM by including only one of the vectors from the negative training
data set, then we expect to get a strip such as follows
•
H
H−
H+
- -
+
and an analogous picture if we include only the other negative vector. Therefore, let us include all
the points as support vectors. Their inner products are
〈~x1−, ~x
1−〉 = 2, 〈~x1
−, ~x2−〉 = 0, 〈~x1
−, ~x+〉 = −1,
〈~x2−, ~x
2−〉 = 2, 〈~x2
−, ~x+〉 = −1, 〈~x+, ~x+〉 = 1.(17.88)
The associated linear system (17.74) is
θ(~x+)α~x+ + θ(~x1−)α~x1− + θ(~x2
−)α~x2− = 0
θ(~x+)〈~x+, ~x+〉α~x+ + θ(~x1−)〈~x1
−, ~x+〉α~x1− + θ(~x2−)〈~x2
−, ~x+〉α~x2− − c = θ(~x+)
θ(~x+)〈~x+, ~x1−〉α~x+ + θ(~x1
−)〈~x1−, ~x
1−〉α~x1− + θ(~x2
−)〈~x2−, ~x
1−〉α~x2− − c = θ(~x1
−)
θ(~x+)〈~x+, ~x2−〉α~x+ + θ(~x1
−)〈~x1−, ~x
2−〉α~x1− + θ(~x2
−)〈~x2−, ~x
2−〉α~x2− − c = θ(~x2
−)
(17.89)
164
which becomes
α~x+ − α~x1− − α~x2− = 0
α~x+ + α~x1− + α~x2− − c = 1
−α~x+ − 2α~x1− − 0α~x2− − c = −1
−α~x+ − 0α~x1− − 2α~x2− − c = −1
(17.90)
after substitution. This linear system corresponds to the augmented matrix1 −1 −1 0 0
1 1 1 −1 1
−1 −2 0 −1 −1
−1 0 −2 −1 −1
7→
1 0 0 0 1/2
0 1 0 0 1/4
0 0 1 0 1/4
0 0 0 1 0
(17.91)
so that the solution is
α~x+ =1
2, α~x1− =
1
4, α~x2− =
1
4, c = 0. (17.92)
Plugging this into the equation for ~w (17.68) gives
~w = α~x+θ(~x+)~x+ + α~x1−θ(~x1−)~x1
− + α~x2−θ(~x2−)~x2
− =
[0
1
], (17.93)
which gives the following margin
• H
H−
H+
- -
+
The previous two examples assumed that all of the vectors given were actually support vectors.
What if there are vectors in the training data set that are not support vectors? When this
happens, we have to exclude them from the calculation. The difficulty with this is that it might
not be clear apriori what the support vectors should be because we have not yet found the SVM.
One then uses a method of exhaustion (trial and error, if you will). Because the training data set
is finite, there are only a finite number of possibilities. However, as the training data set grows,
the number of possibilities increases dramatically. One must then make educated guesses as to
which combinations to try. The possibilities will usually be more transparent after drawing a
visualization of the training data set.
Problem 17.94. Find the SVM for the training data set given by
X− :=
~x1− :=
[0
−1
], ~x2− :=
[1
−2
]& X+ :=
~x+ :=
[0
1
]. (17.95)
165
Answer. We will solve this problem by first showing what the solution is when HsuppX is taken to
be all of X and then what the solution is if HsuppX = ~x1
−, ~x+ (there is still the other possibility of
taking HsuppX = ~x2
−, ~x+, but we will ignore this situation because an optimization for this would
result in a non-separating solution). In either case, it is useful to have the inner products of these
vectors handy:
〈~x1−, ~x
1−〉 = 1, 〈~x1
−, ~x2−〉 = 2, 〈~x1
−, ~x+〉 = 0,
〈~x2−, ~x
2−〉 = 5, 〈~x2
−, ~x+〉 = −2, 〈~x+, ~x+〉 = 1.(17.96)
i. We can immediately throw out the case HsuppX = ~x2
−, ~x+ because the resulting maximal
margin would contain ~x1− as shown in the following figure
• H
H−
H+
--
+
ii. If HsuppX = X , we have the linear system
−α~x1− − α~x2− + α~x+ = 0
−α~x1− − 2α~x2− − α~x+ − c = −1
−2α~x1− − 5α~x2− − 2α~x+ − c = −1
α~x1− + 2α~x2− + α~x+ − c = 1
(17.97)
whose solution is
α~x1− = 2, α~x2− = −1, α~x+ = 1, c = 0. (17.98)
Therefore,
~w =∑~x∈X
θ(~x)α~x~x = −2
[0
−1
]+
[1
−2
]+
[0
1
]=
[1
1
](17.99)
so that the margin width is√
2. The resulting margin is depicted in the following figure.
166
•
HH−
H+
--
+
iii. If HsuppX = ~x1
−, ~x+, we have the linear system obtained from the first by removing all α~x2−terms since α~x2− = 0, which is what the Lagrange multiplier must satisfy because ~x2
− /∈ HsuppX ,
i.e. ~x2− is not a support vector. We must also remove the equation obtained from ∂g
∂~x2−= 0
since α~x2− = 0. The resulting linear system is
−α~x1− + α~x+ = 0
−α~x1− − α~x+ − c = −1
α~x1− + α~x+ − c = 1
(17.100)
and its solution is
α~x1− =1
2, α~x+ =
1
2, c = 0. (17.101)
Therefore,
~w =∑~x∈X
θ(~x)α~x~x = −1
2
[0
−1
]+ 0
[1
−2
]+
1
2
[0
1
]=
[0
1
](17.102)
so that the margin width is 2. The resulting margin is depicted in the following figure.
• H
H−
H+
--
+
From both of these solutions, we can read off the SVM by choosing the solution that has the
largest margin, which is the second one.
Problem 17.103. Find the SVM for the training data set given by
X− :=
~x1− :=
[−1
−2
], ~x2− :=
[1
−1
]& X+ :=
~x+ :=
[0
1
]. (17.104)
167
Answer. The inner products are given by
〈~x1−, ~x
1−〉 = 5, 〈~x1
−, ~x2−〉 = 1, 〈~x1
−, ~x+〉 = −2,
〈~x2−, ~x
2−〉 = 2, 〈~x2
−, ~x+〉 = −1, 〈~x+, ~x+〉 = 1.(17.105)
There are three cases to consider.
i. Assume HsuppX =
~x1−, ~x+
. The resulting linear system is
−α~x1− + α~x+ = 0
−5α~x1− − 2α~x+ − c = −1
2α~x1− + α~x+ − c = 1
(17.106)
and its solution is
α~x1− =1
5, α~x+ =
1
5, c = −2
5. (17.107)
Thus,
~w = −1
5
[−1
−2
]+
1
5
[0
1
]=
1
5
[1
3
](17.108)
so that the margin width is
2
‖~w‖=
2∥∥∥∥15
[1
3
]∥∥∥∥ =10√10
=√
10. (17.109)
The resulting margin along with ~w and
c
‖~w‖2~w =
(− 2
51025
)(1
5
[1
3
])= −1
5
[1
3
]= −~w (17.110)
(which is a vector on the middle hyperplane H) are depicted in the following figure on the left
• H+
HH−
~w
c‖~w‖2 ~w
--
+
• H+
HH−c−1‖~w‖2 ~w
c+1‖~w‖2 ~w
--
+
On the right, the two vectors
c+ 1
‖~w‖2~w =
3
10
[1
3
]&
c− 1
‖~w‖2~w = − 7
10
[1
3
](17.111)
168
that are on the hyperplanes H+ and H− are drawn. The lines for these hyperplanes are
obtained by solving the equations (the slope comes from the ratio of the y-component to the
x-component of a vector orthogonal to ~w)
y+ = −1
3x+ b+
y = −1
3x+ b
y− = −1
3x+ b−
(17.112)
by using the fact that these vectors are on these planes, i.e.⟨c+ 1
‖~w‖2~w,~e2
⟩= −1
3
⟨c+ 1
‖~w‖2~w,~e1
⟩+ b+⟨
c
‖~w‖2~w,~e2
⟩= −1
3
⟨c
‖~w‖2~w,~e1
⟩+ b⟨
c− 1
‖~w‖2~w,~e2
⟩= −1
3
⟨c− 1
‖~w‖2~w,~e1
⟩+ b−,
(17.113)
which reads
9
10= −1
3
(3
10
)+ b+
−3
5= −1
3
(−1
5
)+ b
−21
10= −1
3
(− 7
10
)+ b−,
(17.114)
which gives the following equations for these lines
y+ = −1
3x+ 1
y = −1
3x− 2
3
y− = −1
3x− 7
3
(17.115)
This margin has a negative training data set in its interior so it cannot be an SVM because it
is not described by a marginally separating hyperplane.
ii. Assume HsuppX =
~x2−, ~x+
. The resulting linear system is
−α~x2− + α~x+ = 0
−2α~x2− − α~x+ − c = −1
α~x2− + α~x+ − c = 1
(17.116)
and its solution is
α~x2− =2
5, α~x+ =
2
5, c = −1
5. (17.117)
169
Thus,
~w = −2
5
[1
−1
]+
2
5
[0
1
]=
2
5
[−1
2
](17.118)
so that the margin width is√
5. The other relevant quantities for obtaining the marginally
separating hyperplane are
c− 1
‖~w‖2~w =
3
5
[1
−2
],
c
‖~w‖2~w =
1
10
[1
−2
], &
c+ 1
‖~w‖2~w =
2
5
[−1
2
](17.119)
Therefore, the lines describing the different hyperplanes are
y+ =1
2x+ 1
y =1
2x− 1
4
y− =1
2x− 3
2.
(17.120)
Hence, the resulting margin is given by
•
H+
H
H−
~w
c‖~w‖2 ~w
--
+
•
H+
H
H−
c−1‖~w‖2 ~w
c+1‖~w‖2 ~w
--
+
iii. Assume HsuppX =
~x1−, ~x
2−, ~x+
. Since the previous case already contains ~x1
− as a support vector,
we already know the result will be the same. Hence, this is the SVM.
Exercise 17.121. Let X+ = ~e1, let X− = ~e2, and set X = ~e1, ~e2.
(a) Sketch or describe SX , the set of all marginally separating hyperplanes for (X ,X+,X−). Note
that SX must be a subset of(R2 \ ~0
)× R, which may be a bit challenging to draw.
(b) Sketch or describe SsuppX , the set of all marginally separating hyperplanes for X for which their
margin widths have been enlarged to include support vectors. Again, this should be a subset
of(R2 \ ~0
)× R.
(c) Using the method employed in the preceding problems, find the SVM for (X ,X+,X−).
(d) Draw the SVM in R2 together with (X ,X+,X−).
(e) What is the margin width of this SVM?
Recommended Exercises. Please check HuskyCT for the homework. Be able to show all your
work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we covered Sections 6.5 and 6.6 in addition to several topics outside what is
covered in [Lay].
170
18 Markov chains and complex networks
Today we will cover some applications in the context of stochastic processes and Markov chains.
To gain some motivation for this, we recall what a function is.
Definition 18.1. Let X and Y be two finite sets. A function f from X to Y written as Yf←− X
is an assignment sending every x in X to a unique element, denoted by f(x), in Y.
Example 18.2. The following illustrates two examples of a function
?
♣e\
•
3
d
rrpp
qqoommkk
ll
&
?
♣e\
•
3
d
rrppnn
oo
jj
kk
ff
X (18.3)
Example 18.4. The following two assignments are not functions.
?
♣e\
•
3
d
rrpp
xx
ll
pp
ookk
ll
&
?
♣e\
•
3
d
rr
vv
oo
pp
kk
ff × (18.5)
The assignment on the left is not a function because, for instance, ? gets assigned two entities,
namely and d. The assignment on the right is not a function because, for instance, ♣ is not
assigned anything.
Today, we will think of the sets X, Y, and so on, as sets of events that could occur in a given
situation. We will often denote the elements of X as a list x1, x2, . . . , xn. Thus, a function could
be thought of as a deterministic process. What if instead of sending an element x in X to a unique
element f(x) in Y we instead distributed the element x over Y in some fashion? For this to be a
reasonable definition, we would want the sum of the probabilities of the possible outcomes to be 1
so that something is always guaranteed to happen. But for this, we should talk about probability
distributions.
Definition 18.6. A probability distribution on X = x1, x2, . . . , xn is a function R p←− X such
that
p(xi) ≥ 0 for all i &n∑i=1
p(xi) = 1. (18.7)
171
Equivalently, such a probability distribution can be expressed as an n-component vectorp(x1)
p(x2)...
p(xn)
≡p1
p2
...
pn
(18.8)
again with the condition that each entry is at least 0 and the sum of all entries is equal to 1.
Exercise 18.9. Show that the set of all probability distributions on a finite set X is not a vector
space. Is it a linear manifold? Is it a convex space?
Example 18.10. Let X := H,T, where H stands for “heads” and T stands for “tails.” Let
R p←− X denote a “fair” coin toss, i.e.
p(H) =1
2& p(T ) =
1
2. (18.11)
Then p is a probability distribution on X.
Example 18.12. Again, let X := H,T be the set of events of a coin flip: either heads or tails.
But this time, fix some weight r. r is some arbitrary number strictly between 0 and 1. Let R qr←− X
be the probability distribution
qr(H) = r & qr(T ) = 1− r. (18.13)
Then qr is a probability distribution on X. This is called an “unfair” coin toss if r 6= 12. Thus, the
set of all probability distributions looks like the following subset of R2.
−1 1
−1
1
Definition 18.14. Let X and Y be two finite sets. A stochastic map/matrix from X to Y is an
assignment sending a probability distribution on X to a probability distribution on Y. Such a map
is drawn as T : X //Y.
Let us parse out what this definition is saying. Write X := x1, x2, . . . , xn and Y :=
y1, y2, . . . , ym. As we’ve already discussed, any probability distribution p on X can be expressed
as a vector (18.8) and similarly on Y. Thus T (p) is a probability distribution on Y, i.e. is some
172
vector (this time with m components). Is this starting to look familiar? T is an operation taking
an n-component vector to an m-component vector. It almost sounds as if T is described by some
matrix. Furthermore, we can look at the special probability distribution δxi defined by
δxi(xj) := δij :=
1 if i = j
0 otherwise(18.15)
(you may recognize this as the Kronecker-delta function). In other words, δxi describes the proba-
bility distribution that says the event xi will occur with 100% probability and no other event will
occur. As a vector, this looks like
δxi =
0...
0
1
0...
0
← i-th entry (18.16)
Therefore, we might expect that the probability distribution T (p) on Y is determined by the
probability distributions δxi since p itself can be written as a linear combination of these! Indeed,
we have
p =n∑i=1
p(xi)δxi , (18.17)
or in vector form p(x1)
p(x2)...
p(xn)
= p(x1)
1
0...
0
+ p(x2)
0
1...
0
+ · · ·+ p(xn)
0
0...
1
. (18.18)
Furthermore, whatever T is, it has to send the Kronecker-delta probability distribution to some
distribution on Y which is represented by an m-component vector
T (δxi) =:
T1i
T2i
...
Tmi
. (18.19)
The meaning of this vector is as follows. Imagine that the event xi takes place with 100% proba-
bility. Then the stochastic map says that after xi occurs, there is a T1i probability that the event
y1 will occur, a T2i probability that the event y2 will occur,..., and a Tmi probability that the event
ym will occur. This exactly describes the i-th column of a matrix. In other words, the stochastic
173
process is described by a matrix given by
T =
T11 T12 · · · T1n
T21 T22 · · · T2n
......
. . ....
Tm1 T2m · · · Tmn
(18.20)
where the i-th column represents physically the situation described in the past few sentences. Now
let’s go back to our initial probability distribution p on X. In this case, the event xi takes places
with probability p(xi) instead of 100%. Given this information, what is the probability of the event
yj taking place after the stochastic process? This would be obtained by taking the j-th entry of
the resulting m-component vector from the matrix operationT11 T12 · · · T1n
T21 T22 · · · T2n
......
. . ....
Tm1 T2m · · · Tmn
p(x1)
p(x2)...
p(xn)
(18.21)
In other words,
yj =n∑i=1
Tjip(xi) (18.22)
is the probability of the event yj taking place given that the stochastic process T takes place and
the initial probability distribution on X was given by p.
Example 18.23. Imagine a machine that flips a coin and is programmed to always obtain heads
when given heads and always obtains tails when it is given tails. Unfortunately, machines are never
perfect and there are always subtle changes in the environment that actually make the probability
distribution slightly different. Oddly enough, the distribution for heads and tails was slightly
different after performing the tests over and over again. Given heads, the machine is 88% percent
likely to flip the coin and land heads again (leaving 12% for tails). Given tails, the machine is only
86% likely to flip the coin and land tails again (leaving 14% for heads). The matrix associated to
this stochastic process is
T =
[0.88 0.14
0.12 0.86
]. (18.24)
Imagine I give the machine the coin heads up at first. After how many flips will the probability of
seeing heads be less than 65%? After one flip, the probability of seeing heads is[0.88 0.14
0.12 0.86
] [1
0
]=
[0.88
0.12
](18.25)
as we could have guessed. After another turn, it becomes (after rounding and suppressing the
higher order terms) [0.88 0.14
0.12 0.86
] [0.88
0.12
]=
[0.79
0.21
](18.26)
174
and so on [0.68
0.32
]Too
[0.73
0.27
]Too
[0.79
0.21
](18.27)
until after 5 turns we finally get [0.88 0.14
0.12 0.86
]5 [1
0
]=
[0.64
0.36
]. (18.28)
If we draw these points on the space of probability distributions, they look as follows
1
1
•••
•••
which by the way makes it look like they are converging. We will get back to this soon.
Definition 18.29. Given a set X, a stochastic process X oo X : T from X to itself, and a
probability distribution p on X, a Markov chain is the sequence of probability vectors(p, T (p), T 2(p), T 3(p), . . .
). (18.30)
Example 18.31. In the previous example, what happens if we keep iterating the stochastic map?
Does the resulting distribution eventually converge to some probability distribution on X? And
if it does converge to some probability distribution q, does that probability remain “steady”? In
other words, can we find a vector q such that Tq = q? Could there be more than one such “steady”
probability distribution? Let’s first try to find such a vector before answering all of these questions.
We want to solve the equation [0.88 0.14
0.12 0.86
] [q
1− q
]=
[q
1− q
](18.32)
Working out the left-hand-side gives the two equations
0.88q + 0.14(1− q) = q
0.12q + 0.86(1− q) = 1− q(18.33)
This is a bit scary: two equations and one unknown! But maybe we can still solve it... The first
equation gives the solution
q =0.14
0.26≈ 0.54. (18.34)
175
Fortunately, the second equation gives the same exact solution! What this is saying is that if I
was 54% sure that I gave the machine a coin with heads up, then the probability of the outcome
would be 54% heads every single time!
Definition 18.35. Let X be a set and T a stochastic process on X. A steady state probability
distribution for X and T is a probability distribution p on X such that T (p) = p.
A more clever way to solve for steady state probability distributions is to rewrite the equation
T (p) = p as (T − 1)(p) = 0, where 1 is the stochastic process that does nothing (in other words,
it leaves every single probability distribution alone). Since T − 1 can be represented as a matrix
and p as a vector, this amounts to solving a homogeneous system, which you are quite familiar
with by now.
Go through Example 2 in Section 10.2 in [2].
Problem 18.36. If S : X //Y is a stochastic map, what is the meaning of ST , the transpose of
the stochastic map?
Answer. If we write out the elements of X and Y as X = x1, . . . , xn and Y = y1, . . . , ym,then S has the matrix form
S =
| |S~e1 · · · S~en| |
. (18.37)
It’s helpful to write out the components of S explicitly
S =
s11 s12 · · · s1n
s21 s22 · · · s2n
......
. . ....
sm1 sm2 · · · smn
. (18.38)
Note that S~ek, the k-th column of S, is the probability distribution associated to the stochastic
map with a definitive starting value of xk. In other words, it describes all possible outputs given the
input xk with their corresponding probabilities. The k-th row of S describes all ways of achieving
the output yk from all possible inputs with their corresponding probabilities. Notice that the sum
of the entries in each row of S do not have to add up to 1. For example, if S gave the same output
no matter what input was given, then it would look like a matrix of all 0’s except for one row
consisting of all 1’s. So the transpose of S is in general not a stochastic matrix. Nevertheless, we
still have an interpretation of the rows of S, which are the columns of ST . Therefore, ST assigns
to each yk the possible elements in X that could have lead to yk as being the output of S together
with the corresponding probability that that specific element in X lead to yk. Stochastic matrices
S for which ST is also a stochastic matrix are called doubly stochastic matrices.
We now come to answering the many questions we had raised earlier.
Definition 18.39. Let X be a finite set. A T stochastic map on X is said to be regular if
there exists a positive integer k such that the matrix associated to T k has entries (T k)ij satisfying
0 < (T k)ij < 1 for all i and j.
176
Theorem 18.40. Let X be a finite set and T a regular stochastic map on X. Then, there exists a
unique probability distribution q on X such that T (q) = q. Furthermore, for any other probability
distribution p, the sequence (p, T (p), T 2(p), T 3(p), T 4(p), . . .
)(18.41)
converges to q. In fact, when the probability distribution q is written as a vector ~q and T as a
stochastic matrix
limn→∞
T n =
| |~q · · · ~q
| |
. (18.42)
This limit is meant to be interpreted pointwise, i.e. the limit of each of the entries.
We already saw this in the example above. We found a unique solution to the coin toss scenario
and we also observed how our initial configuration tended towards the steady state solution. If T is
not regular, the sequence might not converge to a steady state solution. Markov chains appear in
several other contexts. For example, Google prioritizes search results based on stochastic matrices.
The internet can be viewed as a directed graph where webpages are represented as vertices and a
directed edge from one vertex to another means that the source webpage has a hyperlink to the
target webpage.
Definition 18.43. A directed graph consists of a set V , a set E , and two functions s, t : E → Vsuch that
(a) s(e) 6= t(e) for all e ∈ E ,
(b) if s(e) = s(e′) and t(e) = t(e′), then e = e′,
(c) for each v ∈ V , there exists an e ∈ E such that either s(e) = v or t(e) = v.
This definition is interpreted in the following way. The elements of V are called vertices (also
called nodes) and the elements of E are called directed edges. The functions s and t are interpreted
as the source and target of each directed edge, respectively. The first condition guarantees that
there are no loops. In terms of the internet example, this means that there is no webpage that
hyperlinks to itself (of course, some webpages do this, but we will not consider such cases). The
second condition guarantees that there is at most one directed edge from one vertex to another.
In terms of the internet example, this means that a webpage has at most one hyperlink to another
webpage. The third condition guarantees that there are no isolated vertices. In terms of the
internet example, this means that each webpage is connected to some other webpage either by
having a hyperlink to another webpage or by being the hyperlink of another webpage. Note that
we do allow directed edges to go in both direction between two vertices. This means that we allow
the situation that a webpage A hyperlinks to B and B hyperlinks back to A.
177
Go through PageRank and the Google Matrix on page 19 in Section 10.2 of [2]. The main
idea is that a surfer clicks a hyperlink with a uniform distribution. With this information, a
stochastic matrix can be obtained. This stochastic matrix is not regular and two adjustments
need to be made. The first adjustment has a wonderful geometric interpretation which will be
explained below. The second adjustment is a convex combination with a uniform distribution
allowing for the possibility of a surfer selecting a website at random regardless of whether or
not a hyperlink exists on that webpage. These two modifications construct a regular stochastic
matrix so that Theorem 18.40 holds.
Note that adjustment 1 in Lay’s book may change the topology of the graph in the sense that
one cannot draw it on the plane without intersections. Naively drawing the adjusted graph results
in
• 6
• 3
• 7
•5
• 4•2
•1
7→
• 6
• 3
• 7
•5
• 4•2
•1
As you can see, there are three edges that cannot be draw without intersecting any other edge. If
we could somehow cut out two holes in the plane and somehow glue the outer circles together, we
could “tunnel” from the outside to the inside of the graph and connected the edges so that they
do not intersect.
• 6
• 3
•7
•5
• 4•2
•1
We’ll go to three dimensions to see why the three different edges do not intersect each other.
178
• 6
• 3
•7
•5
• 4•2
•1
If we view our original planar graph on a two-dimensional sphere (which we can always do since
the graph is compact), cutting out two such holes and gluing the boundary circles together means
that the adjusted graph actually lives on a torus (see Figure 10). Notice how none of the edges
Figure 10: Embedding the graph on a torus
intersect each other now.
Such graphs are part of the more general area of study known as complex networks. Since the
internet is a vastly larger network than the examples we illustrated above, computing the steady
state vectors, and therefore obtaining the ranking of website importance, is a challenging task.
Imagine trying to do this for such a large network (see Figure 11).49
There are methods to reduce such big data problems to more manageable ones, but this neces-
sarily involves some approximations. Such methods are discussed in [3]. Figuring out the topology
is a great help in obtaining certain features of the network. Unfortunately, the kind of topology
we have discussed above is rarely touched on in a first course in topology, unless it is towards the
end of the course. Such material is more often deferred to a course on algebraic topology or graph
theory. If you’d like to get a good taste of topology for beginners, I recommend the book The
Shape of Space by Weeks [6].
49This figure was obtained from Grandjean, Martin, “Introduction a la visualisation de donnees, l’analyse de
reseau en histoire”, Geschichte und Informatik 18/19, pp. 109128, 2015.
179
Figure 11: A complex network, similar to the one described by webpages and hyperlinks.
The larger, warmer color, nodes depict higher importance.
Recommended Exercises. Exercises 4 and 18 in Section 4.9 of [Lay]. Exercises 8, 15, 23, and
24 in Section 10.1 of [2]. Exercises 3, 13, 27, 28, 34, and 35 in Section 10.2 of [2]. Be able to show
all your work, step by step! Do not use calculators or computer programs to solve any problems!
In this lecture, we covered parts of Sections 4.9, 10.1, 10.2, and my own personal notes.
180
19 Eigenvalues and eigenvectors
The steady state vectors from the lecture on Markov chains are special cases of what are called
eigenvectors with eigenvalue 1.
Definition 19.1. Let Rn T←− Rn be a linear transformation. An eigenvector for T is a non-zero
vector ~v ∈ Rn for which T (~v) ∝ ~v (read T (~v) is proportional to ~v), i.e. T (~v) = λ~v for some scalar
λ ∈ R. The proportionality constant λ in the expression T (~v) = λ~v is called the eigenvalue of the
eigenvector ~v.
Equivalently, ~v is an eigenvector for T iff
spanT~v⊆ span
~v. (19.2)
Please note that although eigenvectors are assumed to be nonzero, eigenvalues can certainly be
zero. For example, take the matrix
[1 0
0 0
]which has eigenvalues 1 and 0 with corresponding
eigenvectors
[1
0
]and
[0
1
], respectively. This is why the condition is not span
T~v
= span~v.
Besides the example studied in the previous lecture associated with Markov chains and stochas-
tic processes, we have several other examples.
Example 19.3. Consider the vertical shear transformation in R2 given by
S|1(~e1)
S|1(~e2)
[1 0
1 1
]
~e1
~e2
Visually, it is clear that the vector ~e2 is an eigenvector of eigenvalue 1. Let us check to see if this
is true and if this is the only eigenvector for the vertical shear. The system we wish to solve is[1 0
1 1
] [x
y
]= λ
[x
y
](19.4)
for all possible values of x and y as well as λ. Following a similar procedure to what we did last
class, we subtract
λ
[x
y
]= λ
[1 0
0 1
] [x
y
](19.5)
from both sides ([1 0
1 1
]− λ
[1 0
0 1
])[x
y
]=
[0
0
], (19.6)
181
which becomes the homogeneous system[1− λ 0
1 1− λ
] [x
y
]=
[0
0
]. (19.7)
Notice that we are trying to find a nontrivial solution to this system. Rather than trying to
manipulate these equations algebraically and solving these systems in a case by case basis, let us
analyze this system from the linear algebra perspective. Finding an eigenvector for the vertical
shear matrix amounts to finding a nontrivial solution to the system described by equation (19.7),
which means that the kernel of the matrix
[1− λ 0
1 1− λ
]must be nonzero, which, by the Invertible
Matrix Theorem, means the determinant of this matrix must be zero, i.e.
det
[1− λ 0
1 1− λ
]= 0. (19.8)
Solving this, we arrive at the polynomial equation
(1− λ)2 = 0. (19.9)
The only root of this polynomial is λ = 1 (in fact, λ = 1 appears twice, which means it has
multiplicity 2—more on this soon). Knowing this information, we can then solve the system (19.7)
much more easily since the equation reduces to[0 0
1 0
] [x
y
]= 0 (19.10)
and the set of solutions to this system ist
[0
1
]: t ∈ R
, (19.11)
where t is a free variable. Therefore, all of the vectors of the form (19.11) are eigenvectors with
eigenvalue 1.
Example 19.12. Consider the rotation by angle π2
Rπ2(~e1)
Rπ2(~e2)
[0 −1
1 0
]~e1
~e2
182
Following a similar procedure to the previous example, namely solving[0 −1
1 0
] [x
y
]= λ
[x
y
](19.13)
for all possible values of x and y as well as λ, we obtain[−λ −1
1 −λ
] [x
y
]=
[0
0
]. (19.14)
Again, we want to find eigenvectors and eigenvalues for this system, and this means the matrix[−λ −1
1 −λ
]must be non-invertible so that
det
[−λ −1
1 −λ
]= 0, (19.15)
but the determinant is given by
λ2 + 1 = 0. (19.16)
This polynomial has no real root. The only roots are λ = ±√−1. Therefore, there are no eigen-
vectors with real eigenvalues for the rotation matrix. This is plausible because if you rotate in the
plane, nothing except the zero vector is fixed, agreeing with our intuition.
The previous example showed us that we could make sense of eigenvalues for a real matrix, but
we would have to allow them to be complex.
Example 19.17. As we saw in Example 19.12, the characteristic polynomial associated to the
rotation by angle π2
matrix given by
Rπ2
:=
[0 −1
1 0
](19.18)
is
λ2 + 1 = 0. (19.19)
Using just real numbers, no such solution exists. However, using complex numbers, we know
exactly what λ should be. The possible choices are
λ1 =√−1 & λ2 = −
√−1 (19.20)
since both satisfy λ2 = −1. What are the corresponding eigenvectors? As usual, we solve a
homogeneous problem for each eigenvalue. For λ1, we have[−√−1 −1 0
1 −√−1 0
]→[−1
√−1 0
1 −√−1 0
]→[1 −
√−1 0
0 0 0
](19.21)
which has solutions of the form
t
[−√−1
1
](19.22)
183
with t a free variable. Hence one such eigenvector for λ1 is
~v1 =
[1√−1
](19.23)
(I multiplied throughout by√−1 so that the first entry is a 1). Similarly, for λ2 one finds that
one such eigenvector is
~v2 =
[1
−√−1
]. (19.24)
Hence, the rotation matrix does have eigenvalues and eigenvectors—we just can’t see them! Com-
plex numbers are briefly reviewed at the end of this lesson. When we discuss ordinary differential
equations, we will prove a physical interpretation for complex eigenvalues (briefly, they describe
oscillations).
Example 19.25. Consider the transformation given by
A~e1
A~e2 A :=
[0 −2
−1 1
]~e1
~e2
At a first glance, it does not look like there are any eigenvectors for this transformation, but this
is misleading. The possible eigenvalues are obtained by solving the quadratic polynomial
det
[−λ −2
−1 1− λ
]= −λ(1− λ)− 2 = 0 ⇐⇒ λ2 − λ− 2 = 0. (19.26)
The roots of this quadratic polynomial are given in terms of the quadratic formula
λ =−(−1)±
√(−1)2 − 4(1)(−2)
2(1)=
1
2± 3
2, (19.27)
which has two solutions. The eigenvalues are therefore
λ1 = −1 & λ2 = 2. (19.28)
Associated to the first eigenvalue, we have the linear system[−λ1 −2
−1 1− λ1
] [x1
y1
]=
[1 −2
−1 2
] [x1
y1
]=
[0
0
](19.29)
184
The solutions of this system are all scalar multiples of the vector
~v1 :=
[2
1
]. (19.30)
Therefore, ~v1 is an eigenvector of
[0 −2
−1 1
]with eigenvalue −1 because
[0 −2
−1 1
] [2
1
]= −1
[2
1
]. (19.31)
Similarly, associated to the second eigenvalue, we have the linear sytem[−λ2 −2
−1 1− λ2
] [x2
y2
]=
[−2 −2
−1 −1
] [x2
y2
]=
[0
0
](19.32)
whose solutions are all scalar multiples of the vector
~v2 :=
[1
−1
]. (19.33)
Again, this means that ~v2 is an eigenvector of
[0 −2
−1 1
]with eigenvalue 2 because
[0 −2
−1 1
] [1
−1
]= 2
[1
−1
]. (19.34)
Visually, under the transformation, these eigenvectors stay along the same line where they started.
A~e1
A~e2
A~v1
A~v2
A :=
[0 −2
−1 1
]~e1
~e2 ~v1
~v2
I highly recommend checking out 3Blue1Brown’s video https://www.youtube.com/watch?v=
PFDu9oVAE-g on eigenvectors and eigenvalues for geometric animations describing what eigen-
vectors are. The previous examples indicate the following algebraic fact relating eigenvalues to
determinants.
185
Theorem 19.35. Let A be an n× n matrix. The eigenvalues for the linear transformation asso-
ciated to A are given by the solutions to the polynomial equation
det (A− λ1n) = 0 (19.36)
in the variable λ.
Proof. Let ~v be an eigenvector of A with eigenvalue λ. Then A~v = λ~v. Therefore, (A− λ1)~v = ~0,
i.e. ~v ∈ ker(A−λ1). Since ~v is a non-zero vector, A−λ1 is not invertible. Hence, det(A−λ1) = 0.
Conversely, suppose that λ satisfies det(A − λ1) = 0. Then A − λ1 is not invertible. Since
A− λ1 is an n× n matrix, A− λ1 is not injective, i.e. its nullspace is at least 1-dimensional, i.e.
(A−λ1)~v = ~0 has a nontrivial solution, i.e. there exists a nonzero vector ~v such that A~v = λ~v.
Definition 19.37. For an n × n matrix A, the resulting polynomial det (A− λ1n) is called the
characteristic polynomial of A. If an eigenvalue appears k times as a root, then k is called the
multiplicity of that eigenvalue. The span of the eigenvectors of A for a particular eigenvalue λ is
called the eigenspace associated to A and λ.
One consequence of the above theorem is the following.
Corollary 19.38. λ is an eigenvalue of an n × n matrix A if and only if λ is an eigenvalue of
AT , the transpose of A.
Proof. Exercise.
Another immediate consequence of the above theorem is the following.
Theorem 19.39. Let A be an upper-triangular n× n matrix of the form
A =
a11 a12 a13 · · · a1n
0 a22 a23 · · · a2n
0 0 a33 · · · a3n
.... . . . . .
...
0 0 · · · 0 ann
(19.40)
Then the eigenvalues of A are given by the roots of the polynomial
(a11 − λ)(a22 − λ)(a33 − λ) · · · (ann − λ) = 0 (19.41)
in the variable λ.
Proof. Since
A− λ1 =
a11 − λ a12 a13 · · · a1n
0 a22 − λ a23 · · · a2n
0 0 a33 − λ · · · a3n
.... . . . . .
...
0 0 · · · 0 ann − λ
(19.42)
is an upper triangular matrix, its determinant is the product of the diagonal elements, i.e.
det (A− λ1n) = (a11 − λ)(a22 − λ)(a33 − λ) · · · (ann − λ). (19.43)
186
Unfortunately, this theorem only tells us what the eigenvalues of an upper triangular matrix
are. It is not so easy to write down a set of eigenvectors (if they even exist). It is now important
to understand why we sometimes found complex eigenvalues even though we started out with
matrices that only had real numbers in their entries. This is because an arbitrary polynomial of
the form
a0 + a1x+ a2x2 + · · ·+ anx
n = 0 (19.44)
with real coefficients has roots that are in general complex! So you might wonder if an arbitrary
polynomial
a0 + a1x+ a2x2 + · · ·+ anx
n = 0 (19.45)
with complex coefficients, are their roots that are not necessarily complex? It turns out the answer
is no. Furthermore, how many roots are there? Hopefully there are n roots provided that an 6= 0.
The answer is yes, and the following theorem makes both of these statements precise.
Theorem 19.46 (Fundamental Theorem of Algebra). Every polynomial of the form
a0 + a1x+ a2x2 + · · ·+ anx
n = 0, (19.47)
with all a0, a1, . . . , an ∈ C and an 6= 0, has exactly n complex roots (possibly with multiplicity). In
other words, there exist complex numbers r1, r2, . . . , rn ∈ C such that
a0 + a1x+ a2x2 + · · ·+ anx
n = an(x− r1)(x− r2) · · · (x− rn). (19.48)
Proof. There are many interesting proofs of this theorem, though we will not concern ourselves
here with a proof.
For this reason, we may often work with complex numbers even if our original matrices were
real-valued. This is because some eigenvalues can be complex! When this happens, we may find
eigenvectors whose eigenvalues are complex, but these eigenvectors will also be complex. If we
started with a real linear transformation Rn T←− Rn and we find a complex eigenvalue with an
eigenvector with complex coefficients, then that vector does not live in Rn, because Rn consists of
vectors with only real entries. Instead, just like we enlarge our set of real numbers R to include
complex numbers C, we can enlarge Rn to include complex linear combinations of vectors. We
denote the set of n-component vectors with complex entries by Cn.
Another interesting fact that we did not point out explicitly from the examples but is also
visible is the following.
Theorem 19.49. Let ~v1, ~v2 be two eigenvectors of a linear transformation Rn T←− Rn with
associated eigenvalues λ1, λ2 that are distinct. Then ~v1, ~v2 is a linearly independent set of vectors.
Proof. We must show that the only solution to
x1~v1 + x2~v2 = ~0 (19.50)
is x1 = x2 = 0. Without loss of generality, suppose that λ1 6= 0 (we know that at least one of λ1
and λ2 must be nonzero since they are distinct). Then, by applying T to
x1~v1 + x2~v2 = ~0 (19.51)
187
we obtain
x1λ1~v1 + x2λ2~v2 = ~0. (19.52)
Since λ1 6= 0, we can divide by it to obtain
x1~v1 + x2λ2
λ1
~v2 = ~0 (19.53)
Subtracting this from the first equation, we get(1− λ2
λ1
)x2~v2 = ~0. (19.54)
By assumption, λ1 6= λ2 so that the term in parentheses is nonzero. Furthermore, ~v2 is nonzero
because by definition of being an eigenvector, it must be nonzero. Hence x2 = 0. Therefore, we
are left with
x1~v1 = ~0 (19.55)
and again, since ~v1 is nonzero by definition of being an eigenvector, this implies x1 = 0. Hence,
~v1, ~v2 is a linearly independent set of vectors.
Theorem 19.56. Suppose that ~v1, . . . , ~vk are eigenvectors of a linear transformation Rn T←− Rn
with associated eigenvalues λ1, . . . , λk all of which are distinct. Then the set ~v1, . . . , ~vk is
linearly independent.
Proof. Use the previous theorem and induction.
We now briefly discuss complex numbers, complex eigenvalues, and complex eigenvectors. A
geometric visualization of complex numbers is obtained from the fact that every complex number
has a polar decomposition.
Theorem 19.57. Every complex number (a, b) ≡ a+b√−1 can be written in the form r(cos θ, sin θ) ≡
r cos(θ) + r sin(θ)√−1 for some unique positive number r and for some angle θ in [0, 2π). Further-
more, if a and b are not both zero, then the angle θ is unique.
Proof. It helps to draw (a, b) in the plane.
x
y
−r r
−r
r
•b
a
(a, b)
r =√ a
2 + b2
θ
188
Set
r :=√a2 + b2 & θ :=
arctan(ba
)for a > 0, b ≥ 0
π2
for a = 0, b > 0
π + arctan(ba
)for a < 0
3π2
for a = 0, b < 0
2π + arctan(ba
)for a > 0, b < 0
(19.58)
where θ is defined as long as both a and b are not zero. If a = b = 0, θ can be anything.
Exercise 19.59. Show that the eigenvalues of a matrix of the form[a −bb a
](19.60)
are of the form
a± b√−1. (19.61)
Complex eigenvalues and eigenvectors can be interpreted in this way as types of rotations in a
particular plane. Hence, they always come in pairs.
Definition 19.62. Let (a, b) ≡ a+ b√−1 be a complex number (a and b are real numbers). The
complex conjugate of (a, b) ≡ a+ b√−1 is
(a, b) = (a,−b) (19.63)
or in terms of the notation a+ b√−1 this is written as
a+ b√−1 = a− b
√−1. (19.64)
Definition 19.65. Let Cn be the set of n-component vectors but whose entries are all complex
numbers. Addition, scalar multiplication (this time using complex numbers!), and the zero vector
are analogous to how they are defined for Rn. Similarly, m × n matrices can be taken to have
complex numbers as their entries and are (complex) linear transformations Cm ← Cn.
Theorem 19.66. Let A be an n× n matrix with complex entries. If λ is an eigenvalue of A then
λ is also an eigenvalue.
Theorem 19.67. Let A be an n × n matrix with complex entries. Then A has n (complex)
eigenvalues including mutiplicity.
Recommended Exercises. Please check HuskyCT for the homework. Please show your work!
Do not use calculators or computer programs to solve any problems! In this lecture, we covered
parts of Sections 5.1, 5.2, 5.5, and Appendix B in [Lay].
189
20 Diagonalizable matrices
As we saw in the previous lecture, it is easy to compute the eigenvalues of an upper-triangular
matrix. It is equally simple to compute the eigenvalues of a diagonal matrix.
Theorem 20.1. Let
D =
λ1 0 · · · 0
0 λ2...
.... . . 0
0 · · · 0 λn
(20.2)
be a diagonal matrix. Then ~ei is an eigenvector of D with eigenvalue λi for all i ∈ 1, . . . , n.
Proof. Exercise.
This Theorem tells us more than Theorem 19.39 since we know not only the eigenvalues but
also the eigenvectors in the case of a diagonal matrix. However, not all matrices are diagonal. The
good news is that many matrices are close to being diagonal in a sense we will describe soon.
Theorem 20.3. Let A be an n× n matrix. Let P be an invertible matrix and define
B := PAP−1. (20.4)
Then det(A) = det(B).
Proof. This follows immediately from the product rule for determinants, which says
det(B) = det(PAP−1) = det(P ) det(A) det(P−1) = det(P ) det(A) det(P )−1 = det(A). (20.5)
Definition 20.6. An n× n matrix A is similar to an n× n matrix B iff there exists an invertible
n× n matrix P such that B = PAP−1.
Definition 20.7. Let A be an n × n matrix. A is diagonalizable iff it is similar to a diagonal
matrix D, i.e. iff there exists an invertible n× n matrix P such that A = PDP−1.
Note that solving for D in the above equation gives D = P−1AP.
Theorem 20.8. An n× n matrix A is diagonalizable if and only if A has n linearly independent
eigenvectors. Furthermore, when A is diagonalizable and its eigenvectors are given by ~v1, . . . , ~vn,setting
P :=
| |~v1 · · · ~vn| |
, (20.9)
A can be expressed as
A = P
λ1 0 · · · 0
0 λ2...
.... . . 0
0 · · · 0 λn
P−1, (20.10)
190
where λi is the eigenvalue corresponding to the eigenvector ~vi, namely
A~vi = λi~vi (20.11)
for all i ∈ 1, . . . , n.
Proof. space
(⇒) Suppose A = PDP−1 with D as in (20.10). Set ~vi := P~ei for each i ∈ 1, . . . , n. Then
AP~ei = PP−1︸ ︷︷ ︸1
AP~ei = PD~ei = Pλi~ei = λiP~ei = λi~vi (20.12)
showing that ~vi is an eigenvector of A with eigenvalue λi. Because P is invertible, its columns,
which are exactly the vectors ~vi = P~ei, are linearly independent, i.e. ~v1, . . . , ~vn is a linearly
independent set of vectors.
(⇐) Suppose that ~v1, . . . , ~vn is a linearly independent set of eigenvectors of A with corresponding
eigenvalues λ1, . . . , λn. Set
P :=
| |~v1 · · · ~vn| |
. (20.13)
Then define the matrix D by D := P−1AP. Notice that the columns of D are given by
D~ei = P−1AP~ei = P 1A~vi = P−1λi~vi = λiP−1~vi = λiP
−1P~ei = λi~ei. (20.14)
Therefore, D is a diagonal matrix and is of the form given in (20.10).
Theorem 20.15. If an n×n matrix A is similar to an n×n matrix B, then the set of eigenvalues
of A is the same as the set of eigenvalues of B and their multiplicities will also be the same.
Proof. Let P be an invertible matrix such that B = PAP−1. Let ~v be an eigenvector of A with
eigenvalue λ. Then P~v is an eigenvector of B with eigenvalue λ because
BP~v = PAP−1P~v = PA~v = Pλ~v = λP~v. (20.16)
A similar calculation shows that if ~w is an eigenvector of B with eigenvalue µ, then P−1 ~w is an
eigenvector of A with eigenvalue µ.
Remark 20.17. The converse of this Theorem is not true! For example, the matrices
A =
[1 1
0 1
]& B =
[1 0
0 1
](20.18)
both have the same eigenvalues (1 with multiplicity 2) but are not similar. To see this, one could
suppose, to the contrary, that there exists an invertible matrix
P =
[a b
c d
](20.19)
such that PA = BP−1 and show that this leads to some contradiction (such as showing that this
system of equations does not have a solution). Another argument avoiding any calculations is
to notice that similarity of matrices is an equivalence relation on the set of square matrices and
diagonalizability is preserved under this equivalence relation. Namely,
191
i. A is similar to itself,
ii. if A is similar to B, then B is similar to A,
iii. if A is similar to B and B is similar to C, then A is similar to C.
Then, if A is diagonalizable and A is similar to B, then B is diagonalizable. Now, to apply this
to our problem, since A has a basis of eigenvectors, we know that it is diagonalizable. We also
know that B does not have two linearly independent eigenvectors. All of its eigenvectors are scalar
multiples of the vector ~e1. Therefore, B is not diagonablizable and hence A cannot be similar to
B.
Example 20.20. In Example 19.25, we found that the two eigenvalues with their corresponding
eigenvectors of the matrix
A :=
[0 −2
−1 1
](20.21)
are (λ1 = −1, ~v1 =
[2
1
])&
(λ2 = 2, ~v2 =
[1
−1
])(20.22)
As in Theorem 20.8, set
P :=
| |~v1 ~v2
| |
=
[2 1
1 −1
](20.23)
from which we can calculate the inverse as
P−1 =1
3
[1 1
1 −2
]. (20.24)
We verify the claim of the theorem
PDP−1 =
[2 1
1 −1
] [−1 0
0 2
]1
3
[1 1
1 −2
]=
1
3
[2 1
1 −1
] [−1 −1
2 −4
]=
1
3
[0 −6
−3 3
]=
[0 −2
−1 1
]= A.
(20.25)
The following is a more complicated example but illustrates how to actually solve cubic poly-
nomials when calculating eigenvalues.
192
Example 20.26. Let A be the matrix50
A :=
0 1 1
2 1 2
3 3 2
. (20.27)
The eigenvalues of A are obtained from solving the polynomial equation
det(A− λ13) = det
−λ 1 1
2 1− λ 2
3 3 2− λ
= −λ det
[1− λ 2
3 2− λ
]− 1 det
[2 2
3 2− λ
]+ 1 det
[2 1− λ3 3
]= −λ
((1− λ)(2− λ)− 6
)−(
2(2− λ)− 6)
+(
6− 3(1− λ))
= −λ3 + 3λ2 + 9λ+ 5
= 0.
(20.28)
Solving cubic equations is not impossible, but it is difficult. A good strategy is to guess one solution
and then factor out that term to reduce it to a quadratic polynomial. By inspection, λ = −1 is
one such solution. Knowing this, we can use long division of polynomials
(λ+ 1) −λ3 + 3λ2 + 9λ+ 5
to figure out the other factors. To compute this division, one does what one is used to from
ordinary long division by calculating remainders
−λ2
(λ+ 1) −λ3 + 3λ2 + 9λ+ 5
−(−λ3 − λ2)
4λ2 + 9λ+ 5
Proceeding, we get
−λ2 + 4λ
(λ+ 1) −λ3 + 3λ2 + 9λ+ 5
−(−λ3 − λ2)
4λ2 + 9λ+ 5
−(4λ2 + 4λ)
5λ+ 5
until finally
50This is exercise 5.3.11 in [Lay] but without knowledge of any of the eigenvalues.
193
−λ2 + 4λ+ 5
(λ+ 1) −λ3 + 3λ2 + 9λ+ 5
−(−λ3 − λ2)
4λ2 + 9λ+ 5
−(4λ2 + 4λ)
5λ+ 5
−(5λ+ 5)
0
From this, we can factor our polynomial equation to the form
− λ3 + 3λ2 + 9λ+ 5 = (λ+ 1)(−λ2 + 4λ+ 5). (20.29)
Now we can apply the quadratic formula to find the roots of −λ2 + 4λ+ 5 and we get
λ =−4±
√42 − 4(−1)(5)
2(−1)= 2∓ 3. (20.30)
Thus, the polynomial factors into
− λ3 + 3λ2 + 9λ+ 5 = (λ+ 1)(λ+ 1)(−λ+ 5) = (λ+ 1)2(−λ+ 5) (20.31)
and the roots of the polynomial, i.e. the eigenvalues of the matrix A, are λ1 = −1, λ2 = −1, λ3 = 5.
Problem 20.32. From Example 20.26, calculate the eigenvectors ~v1, ~v2, ~v3 of A associated to the
eigenvalues λ1, λ2, λ3, respectively. Construct the matrix P. Calculate P−1 and verify explicitly
that A = PDP−1.
Answer. We first find the eigenvectors for eigenvalue λ1 = λ2 = −1. We must solve0 1 1
2 1 2
3 3 2
xyz
= −1
xyz
(20.33)
i.e. 0− (−1) 1 1
2 1− (−1) 2
3 3 2− (−1)
xyz
=
1 1 1
2 2 2
3 3 3
xyz
=
0
0
0
. (20.34)
The solutions to this can be written in parametric form asxyz
= y
−1
1
0
+ z
−1
0
1
(20.35)
with y and z free variables. Hence, we can take
~v1 :=
−1
1
0
& ~v2 :=
−1
0
1
(20.36)
194
(in fact, you can set ~v1 and ~v2 to be any two linearly independent vectors in this kernel because
the corresponding eigenvalue is the same). For λ3 = 5, the corresponding linear system to solve is
given by 0− 5 1 1
2 1− 5 2
3 3 2− 5
xyz
=
−5 1 1
2 −4 2
3 3 −3
xyz
=
0
0
0
. (20.37)
The solutions to this are written in parametric form asxyz
= z
1
2
3
(20.38)
(I’ve removed an overall factor of −13
which one obtains when row reducing). Hence, we can take
~v3 :=
1
2
3
(20.39)
as an eigenvector of A with eigenvalue λ3 = 5. Therefore,
P =
−1 −1 1
1 0 2
0 1 3
& P−1 =1
6
−2 4 −2
−3 −3 3
1 1 1
. (20.40)
Furthermore, the equality A = PDP−1, which reads0 1 1
2 1 2
3 3 2
=
−1 −1 1
1 0 2
0 1 3
−1 0 0
0 −1 0
0 0 5
1
6
−2 4 −2
−3 −3 3
1 1 1
, (20.41)
holds.
Warning: not all matrices are diagonalizable.
Example 20.42. The vertical shear matrix
S|1 :=
[1 0
1 1
](20.43)
is not diagonalizable. One way to see this is to use Theorem 20.8. We have already calculated the
eigenvectors of this matrix in Example 19.3. We only found a single eigenvector[0
1
](20.44)
with eigenvalue 1. Since a single vector in R2 cannot span all of R2, this means there aren’t 2
linearly independent eigenvectors associated to this matrix. Therefore, S|1 is not diagonalizable.
195
Remark 20.45. Even though S|1 is not diagonalizable, it can be put into what is called Jordan
normal form, which is the next best thing after diagonal matrices. Namely, every matrix is similar
to a matrix of the form D + N, where D is diagonal and N has all zero entries except possibly
a few 1’s directly above the diagonal (and satisfies other properties). However, there is a caveat,
which is that one must allow for the matrices to be expressed in terms of complex numbers and
to allow for the diagonal matrix D to contain complex numbers as well.
Here’s a theorem that gives a sufficient condition for a matrix to be diagonalizable.
Theorem 20.46. Let A be an n× n matrix with n distinct real eigenvalues. Then A is diagonal-
izable.
Proof. Let λ1, . . . , λn be these eigenvalues with corresponding eigenvectors ~v1, . . . , ~vn. By The-
orem 19.56, the set of vectors ~v1, . . . , ~vn is linearly independent (in fact, they form a basis of Rn
since there are n of them). By Theorem 20.8, A is therefore diagonalizable.
It is not necessary for a matrix A to have distinct eigenvalues to be diagonalizable. A simple
example is a matrix that is already in diagonal form. Once it is known that a matrix A is similar
to a diagonal matrix, it becomes easy to calculate “polynomials” in the matrix A. This is because
if P is an invertible matrix such that A = PDP−1, then
A2 =(PDP−1
)2= PDP−1P︸ ︷︷ ︸
=1n
DP−1 = PDDP−1 = PD2P−1 (20.47)
and similarly for higher orders
Ak =(PDP−1
)k= PDP−1P︸ ︷︷ ︸
=1n
DP−1P︸ ︷︷ ︸=1n
DP−1P︸ ︷︷ ︸=1n
DP−1 · · ·PDP−1 = PDkP−1. (20.48)
And to compute the power of a diagonal matrix is a piece of cake:λ1 0 · · · 0
0 λ2...
.... . . 0
0 · · · 0 λn
k
=
λk1 0 · · · 0
0 λk2...
.... . . 0
0 · · · 0 λkn
. (20.49)
Example 20.50. Let A be as in Example 20.20. Then
A5 = PD5P−1
=
[2 1
1 −1
] [−1 0
0 2
]51
3
[1 1
1 −2
]=
[2 1
1 −1
] [−1 0
0 32
]1
3
[1 1
1 −2
]=
1
3
[−2 32
−1 −32
] [1 1
1 −2
]=
1
3
[30 −66
−33 63
]=
[10 −22
−11 21
].
(20.51)
196
It is also instructive to write this formula for arbitrary powers. Let λ1 = −1 and λ2 = 2 and fix
n ∈ N. Then
An = PDnP−1
=
[2 1
1 −1
] [λn1 0
0 λn2
]1
3
[1 1
1 −2
]=
1
3
[2λn1 λn2λn1 −λn2
] [1 1
1 −2
]=
1
3
[2λn1 + λn2 2λn1 − 2λn2λn1 − λn2 λn1 + 2λn2
].
(20.52)
Definition 20.53. Let A be an n× n matrix. Let p be a positive integer. A degree p polynomial
in A is an expression of the form
t01n + t1A+ t2A2 + · · ·+ tpA
p, (20.54)
where the t’s are real numbers.
Theorem 20.55. Let A be a diagonalizable n× n matrix with a decomposition of the form
A = PDP−1, (20.56)
where D is diagonalizable and P is a matrix of eigenvectors of A. Then
t01n + t1A+ t2A2 + · · ·+ tpA
p = P(t01n + t1D + t2D
2 + · · ·+ tpDp)P−1. (20.57)
Proof. This follows from the previous discussion since
t01n + t1A+ t2A2 + · · ·+ tpA
p = t01n + t1PDP−1 + t2(PDP−1)2 + · · ·+ tp(PDP
−1)p
= t01n + t1PDP−1 + t2PD
2P−1 + · · ·+ tpPDpP−1
= P(t01n + t1D + t2D
2 + · · ·+ tpDp)P−1.
(20.58)
As we saw in earlier lectures, some n × n matrices do not have n eigenvalues or a basis of
n eigenvectors. For example, shears in two dimensions only have a one-dimensional basis of
eigenvectors and rotations in two dimensions have none! Each of these problems has a resolution,
the former in terms of what are called generalized eigenvectors, and the latter in terms of complex
eigenvectors. The former is typically studied in a more advanced course on linear algebra.
Recommended Exercises. Please check HuskyCT for the homework. Please show your work!
Do not use calculators or computer programs to solve any problems! In this lecture, we covered
Section 5.3.
197
Spectral decomposition and the Stern-Gerlach experiment
We will now learn about sufficient conditions for a matrix to be diagonalizable as well as a useful
decomposition for any such matrix.
Definition 20.59. An m× n matrix A is symmetric whenever AT = A.
Notice that m = n for symmetric m× n matrices.
A similar definition can be made for complex matrices, but one takes into account the complex
conjugation. Why this is so will be made clear shortly.
Definition 20.60. Let A be an m× n matrix with possibly complex entries
A =
a11 a12 · · · a1n
a21 a22 · · · a2n
......
. . ....
...
am1 am2 · · · amn
. (20.61)
The conjugate transpose of A is the n×m matrix
A† :=
a11 a21 · · · am1
a12 a22 · · · am2
......
. . ....
...
a1n a2n · · · amn
. (20.62)
The superscript † is pronounced “dagger.”
Definition 20.63. An m× n matrix A is Hermitian whenever A† = A.
Again, an m × n matrix A is Hermitian implies m = n. Projections are defined similarly as
before but with this complex conjugation taken into account.
Definition 20.64. A complex m×m matrix P is an orthogonal projection iff P 2 = P and P † = P.
Example 20.65. The three matrices
σx =
[0 1
1 0
], σy =
[0 −ii 0
], & σz =
[1 0
0 −1
](20.66)
are all Hermitian. Notice that the eigenvalues of these three matrices are all ±1, which are real.
We will say something about this phenomenon soon.
Definition 20.67. The inner product on Cn is a function
〈 · , · 〉 : Cn × Cn → C (20.68)
defined by
Cn × Cn 3 (~u,~v) 7→ 〈~u,~v〉 := ~u†~v. (20.69)
In terms of components, this says⟨u1
...
un
,v1
...
vn
⟩ = u1v1 + · · ·+ unvn. (20.70)
198
Notice that for any complex number c and for any two vectors ~u and ~v in Cn,
〈~u,~v〉 = 〈~v, ~u〉 (20.71)
and
〈c~u,~v〉 = c〈~u,~v〉. (20.72)
Theorem 20.73. Let A be a complex m× n matrix. Then
〈~v, A~w〉 = 〈A†~v, ~w〉 (20.74)
for all vectors ~w in Cn and ~v in Cm.
Proof. A calculation easily proves this.
Theorem 20.75. The eigenvalues of a Hermitian matrix are real.
Proof. Let A be a Hermitian matrix and let λ be an eigenvalue with a corresponding eigenvector
~v. Then
λ〈~v,~v〉 = 〈~v, λ~v〉 = 〈~v, A~v〉 = 〈A†~v,~v〉 = 〈A~v,~v〉 = 〈λ~v,~v〉 = λ〈~v,~v〉 (20.76)
Since ~v is nonzero, 〈~v,~v〉 is nonzero as well, and this shows that λ = λ, i.e. λ is real.
Theorem 20.77. Let A be a Hermitian matrix. Then eigenvectors corresponding to different
eigenvalues are orthogonal. An analogous statement holds for symmetric real matrices provided
one considers real eigenvalues.
Proof. Let λ1 and λ2 be two distinct eigenvalues of A and let ~v1 and ~v2 be corresponding eigen-
vectors. Then
λ2〈~v1, ~v2〉 = 〈~v1, λ2~v2〉= 〈~v1, A~v2〉= 〈A†~v1, ~v2〉= 〈A~v1, ~v2〉= 〈λ1~v1, ~v2〉= λ1〈~v1, ~v2〉= λ1〈~v1, ~v2〉,
(20.78)
where in the last step we used the fact that eigenvalues of Hermitian matrices are real. This
calculation shows that
(λ2 − λ1)〈~v1, ~v2〉 = 0, (20.79)
which is only possible if either λ2−λ1 = 0 or 〈~v1, ~v2〉 = 0. However, by assumption, since λ1 6= λ2,
it follows that λ2 − λ1 6= 0. Hence 〈~v1, ~v2〉 = 0 which means that ~v1 is orthogonal to ~v2.
Theorem 20.80. A matrix A is Hermitian if and only if there exists a diagonal matrix D and a
matrix P all of whose columns are orthogonal such that A = PDP−1.
199
In fact, let A be an n× n Hermitian matrix, let
D :=
λ1 0 · · · 0
0 λ2...
.... . . 0
0 · · · 0 λn
(20.81)
be a diagonal matrix of its eigenvalues, and let
P :=
| |~u1 · · · ~un| |
(20.82)
be the matrix of orthonormal eigenvectors (given a matrix P that initially might have all orthogonal
eigenvectors, one can simply scale them so that they have unit length). Then, a quick calculation
shows that
P † = P−1. (20.83)
Using this, the previous theorem says that
A =
| |~u1 · · · ~un| |
λ1 0 · · · 0
0 λ2...
.... . . 0
0 · · · 0 λn
| |~u1 · · · ~un| |
† =n∑k=1
λk~uk~u†k. (20.84)
Notice that Pk := ~uk~u†k is an n × n matrix satisfying P 2
k = Pkand P †k = Pk. In fact, this operator
is the orthogonal projection operator onto the subspace span~uk. Hence,
A =n∑k=1
λkPk (20.85)
provides a formula for the Hermitian matrix A as a weighted sum of projection operators onto
orthogonal one-dimensional subspaces of A generated by the eigenvectors of A. This decomposition
is referred to as the spectral decomposition of A.
Example 20.86. Consider the matrix
σy =
[0 −ii 0
](20.87)
from Example 20.65. The eigenvalues are λ↑y = +1 and λ↓y = −1 with corresponding eigenvectors
given by
~v↑y =
[1
i
]& ~v↓y =
[i
1
](20.88)
respectively. Associated normalized eigenvectors are
~u↑y =1√2
[1
i
]. & ~u↓y =
1√2
[i
1
](20.89)
200
Then, the orthogonal matrix Py that diagonalizes σy is given by
Py =
[1/√
2 i/√
2
i/√
2 1/√
2
](20.90)
as we can check51
Py
[λ↑y 0
0 λ−y
]P †y =
1
2
[1 i
i 1
] [1 0
0 −1
] [1 −i−i 1
]=
[0 −ii 0
]= σy (20.91)
and the projection matrices P ↑y and P ↓y that project onto the eigenspaces span(~u↑y
)and span
(~u↓y
),
respectively, are given by (each calculated in two different ways to illustrate the possible methods)52
P ↑y = ~u↑y~u↑†y =
(1√2
[1
i
])(1√2
[1 −i
])=
[1/2 −i/2i/2 1/2
](20.92)
and53
P ↓y =
⟨~u↓y, ~e1
⟩~u↓y
⟨~u↓y, ~e2
⟩~u↓y
=1
2
⟨[
i
1
],
[1
0
]⟩[i
1
] ⟨[i
1
],
[0
1
]⟩[i
1
] =1
2
[1 i
−i 1
]. (20.93)
Therefore, the matrix σy decomposes as
σy = λ↑yP↑y + λ↓yP
↓y = 1
[1/2 −i/2i/2 1/2
]− 1
[1/2 i/2
−i/2 1/2
]. (20.94)
Example 20.95. For the record, consider the matrix σz from Example 20.65. The eigenvalues
are λ↑z = +1 and λ↓z = −1 with corresponding normalized eigenvectors given by
~u↑z =
[1
0
]& ~u↓z =
[0
1
](20.96)
respectively. The projection operators onto these eigenspaces are easy to read off because the
matrix σz is already in diagonal form. These projections are
P ↑z =
[1 0
0 0
]& P ↓z =
[0 0
0 1
]. (20.97)
51For a complex matrix P with orthogonal columns, the inverse P−1 is given by P † instead of PT , which is what
happens when P is real.52In Dirac bra-ket notation, this reads P ↑y = |~u↑y〉〈~u↑y|.53Remember, the matrix A associated to a linear transformation C2 T←− C2 is given by
A =
| |T (~e1) T (~e2)
| |
.
201
Example 20.98 (The Stern-Gerlach experiment and quantum mechanics). Consider the following
experiment where (light) classical magnets are sent through a specific type of magnetic field (fixed
throughout the experiment). Depending on the orientation of the magnet, the deflection will be
distributed continuously according to this orientation. However, if a silver atom is sent through
the same apparatus, its deflection is more discrete. It is either only sent up or down. The silver
atoms are show out of an oven so that their internal properties are distributed as uniformly as
possible.
Watch video at: https://upload.wikimedia.org/wikipedia/commons/9/9e/Quantum_spin_
and_the_Stern-Gerlach_experiment.ogv
Let us visualize this Stern-Gerlach experiment by the following cartoon (read from right to
left).
AgovenSG
Z
50% ↑ along z
50% ↓ along z
Now, imagine a second experiment where we isolate the silver atoms that were deflected up along
the z axis and we send those atoms through a Stern-Gerlach experiment oriented along the y axis.
Experimentally, we find that on average 50% of the atoms are deflected in the positive y direction
and 50% in the negative y direction with the same magnitude as the deflection in the z direction
from the first experiment.
AgovenSG
Z
↑ along z
↓ along z
SGY
50% ↑ along y
50% ↓ along y
What do you think happens if we take the atoms that were deflected in the positive z direction first,
then deflected in the positive y direction, and we send these through yet another Stern-Gerlach
experiment oriented again back along the z direction?
AgovenSG
Z
↑ z
↓ zSGY
↑ y
↓ ySGZ
50% ↑ along z
50% ↓ along z
It turns out that not all of them will still deflect in the positive z direction. Instead, 50% of
these atoms will be deflected in the positive z direction and 50% in the negative z direction!
Preposterous! How can we possibly explain this phenomenon?
Because we only see two possibilities in the deflection of the silver atoms, we postulate that the
state corresponding to these two possibilities is described by a normalized complex 2-component
vector.
~ψ :=
[ψ1
ψ2
](20.99)
202
We postulate that an atom that gets deflected in the positive z direction corresponds to an eigen-
vector of the σz operator with eigenvalue +1 and an atom that gets deflected in the negative z
direction corresponds to an eigenvector of the σz operator with eigenvalue −1. This is a weird pos-
tulate (what do vectors in complex space have anything to do with reality!?), but let us see where
it takes us. Furthermore, we postulate that blocking out states that get deflected in a positive
or negative direction corresponds to projecting the state onto the eigenspace that is allowed to
pass through. Finally, we postulate that the probability of a state ~ψ as above to be measured in
another state
~φ :=
[φ1
φ2
](20.100)
is given by ∣∣∣⟨~φ, ~ψ⟩∣∣∣2 . (20.101)
We now rely on our analysis from the previous examples and use the notation set therein. We
interpret ~u↑z and ~u↓z as the states (silver atoms) that are deflected up and down along the z axis,
respectively. Similarly, ~u↑y and ~u↓y are the states that are deflected up and down along the y axis,
respectively. Because initially the oven provides the experimentally observed fact that half of the
particles get deflected (up) along the positive z direction and half of them get deflected (down)
along the negative z direction, any vector of the form
~ψ =eiθ√
2
[1
1
](20.102)
suffices to describe the initial state (in fact, we can ignore the eiθ and set θ = 0). This is because∣∣∣⟨~u↑z, ~ψ⟩∣∣∣2 =1
2&
∣∣∣⟨~u↓z, ~ψ⟩∣∣∣2 =1
2. (20.103)
Agoven
~ψSGZ
50% ↑ along z
50% ↓ along z
By selecting only the silver atoms that are deflect up along the z axis, we project our initial state
onto this state
P ↑z~ψ =
[1 0
0 0
](1√2
[1
1
])=
1√2
[1
0
]=
1√2~u↑z (20.104)
Agoven
~ψSGZ
P ↑z~ψ
↓ along z
and then we normalize this state to ~u↑z so that our probabilistic interpretation will still hold. In
the second experiment, this state gets sent through the Stern-Gerlach apparatus oriented along
203
the y direction. Let us check that indeed half of the particles are deflected up along the y direction
and the other half are deflected down.∣∣⟨~u↑y, ~u↑z⟩∣∣2 =1
2&
∣∣⟨~u↓y, ~u↑z⟩∣∣2 =1
2. (20.105)
Agoven
~ψSGZ
P ↑z~ψ
↓ along z
SGY
50% ↑ along y
50% ↓ along y
So far so good. Now let’s put these postulates to the ultimate test of projecting our resulting state
onto the up state along the y direction to obtain
P ↑y ~u↑z =
1√2~u↑y (20.106)
Agoven
~ψSGZ
P ↑z~ψ
↓ zSGY
P ↑yP↑z~ψ
↓ y
and again normalize to obtain just ~u↑y. Now let’s send these silver atoms back to another Stern-
Gerlach apparatus oriented along the z direction. So what does the math tell us? The probabilities
of this resulting state deflecting either up or down along the z axis after going through all these
experiments is ∣∣⟨~u↑z, ~u↑y⟩∣∣2 =1
2&
∣∣⟨~u↓z, ~u↑y⟩∣∣2 =1
2(20.107)
Agoven
~ψSGZ
P ↑z~ψ
↓ zSGY
P ↑yP↑z~ψ
↓ ySGZ
50% ↑ along z
50% ↓ along z
which beautifully agrees with the experiment! Is all of this a coincidence? We think not. And
it turns out that the study of abstract vector spaces is intimately related to other phenomena
centered around the subject called quantum mechanics. The effect described in this example is
due to the spin of an electron (in the case of silver, this is due to the single valence electron), a
feature inherent of particles and described adequately in the theory of quantum mechanics.
Recommended Exercises. Please check HuskyCT for the homework. Please show your work!
Do not use calculators or computer programs to solve any problems! In this lecture, we covered
Section 7.1 and some additional material based on Chapter 1 of [5].
204
21 Solving ordinary differential equations
Let us now study an application of diagonalization to the problem of a harmonic oscillator with
friction. Consider the ordinary differential equation in R2 of the form
dx
dt= v
dv
dt= − k
mx− γv
(21.1)
with initial condition (x0, v0) and m > 0 and k, γ ≥ 0. This corresponds to the second order linear
differential equation given byd2x
dt2+ γ
dx
dt+k
mx = 0. (21.2)
This system describe a one-dimensional oscillator with friction. The x variable is interpreted as
the position and the v variable as the velocity, k is a constant dictating the strength of the spring,
m is the mass of the particle that is affected by the spring force and the frictional force, and γ is a
constant related to the strength of the frictional force. γ = 0 corresponds to no friction and k = 0
corresponds to having no spring. The name for this model is the damped harmonic oscillator. We
will not plug in numbers for k,m, and γ because we want to study what happens to the motion
depending on how strong the frictional constant is compared to the spring constant and mass. The
matrix associated to this system is
A =
[0 1
− km−γ
](21.3)
because then the differential equation (21.1) can be expressed as a vector equation
d
dt
[x
v
]=
[0 1
− km−γ
] [x
v
]. (21.4)
More generally, if ~x is a vector in Rm and A is an m×m matrix, then
d
dt~x = A~x (21.5)
describes an ordinary differential equation (ODE) and if the initial condition is ~x0 := ~x(0), then
the solution is given by
~x(t) = exp(tA)~x0. (21.6)
Remember, the exponential of a square matrix B is defined to be
exp(B) :=∞∑n=0
Bn
n!. (21.7)
205
Let us check this claim informally by differentiating54
d
dt~x =
d
dt
(exp
(tA)~x0
)=
d
dt
(∞∑n=0
tnAn
n!~x0
)
=∞∑n=0
d
dt
(tnAn
n!~x0
)=∞∑n=1
tn−1An
(n− 1)!~x0
= A∞∑n=1
tn−1An−1
(n− 1)!~x0
= A exp(tA)~x0
= A~x
(21.8)
and
~x(0) = exp(0A)~x0 = 1m~x0 = ~x0 (21.9)
so that this is indeed the solution. However, as we know, if we could diagonalize A via some matrix
P as A = PDP−1, then computing the exponential would be vastly simplified. This is because
exp(tPDP−1
)=∞∑n=0
(tPDP−1)n
n!=∞∑n=0
tnPDnP−1
n!= P
(∞∑n=0
tnDn
n!
)P−1 = P exp
(tD)P−1.
(21.10)
So, if A is diagonalizable, and let’s say the matrix D is of the form
D =
[λ1 0
0 λ2
](21.11)
with λ1 and λ2 the eigenvalues of A, then the solution to our ODE is given simply by
~x(t) = exp(tA)~x0 = P exp
(tD)P−1~x0 = P
[eλ1t 0
0 eλ2t
]P−1~x0. (21.12)
Notice how much simpler the solution has become. Rather than calculating an infinite sum of
increasingly complicated products, we only have to calculate three matrix products and then act
on the vector to get the equation for ~x as a function of time. We will now work out this general
theory for the example ODE given by (21.1). The eigenvalues of this system are therefore given
by solving
0 = det
[−λ 1
− km−γ − λ
]= λ(γ + λ) +
k
m=
k
m+ γλ+ λ2, (21.13)
54One subtle point is regarding convergence, but let us take for granted that this exponential function acting on
matrices converges uniformly on any compact interval in time. This fact allows us to bring in the derivative with
respect to t into the summation in the third line.
206
which has solutions
λ = −γ2± 1
2
√γ2 − 4k
m. (21.14)
Let’s use the notation
λ1 :=1
2
(−γ −
√γ2 − 4k
m
)& λ2 :=
1
2
(−γ +
√γ2 − 4k
m
)(21.15)
for these two possibilities for the eigenvalues. The corresponding eigenvectors would therefore be
solutions to the augmented matrix problem[−λ 1 0
− km−γ − λ 0
](21.16)
for each λ. Rather than plugging in these values right away, we can first simplify. Using the first
row, we find that
− λx+ y = 0 ⇐⇒ y = λx (21.17)
so that we can set x = 1 to obtain a pair of eigenvectors of the form
~v1 =
[1
λ1
]& ~v2 =
[1
λ2
](21.18)
for λ1 and λ2, respectively. Note that if λ1 = λ2, we will not be able to find a second eigenvector.
We will discuss this case later. In the case that λ1 6= λ2, the similarity transformation matrix
corresponding to this choice of eigenvectors is given by
P =
[1 1
λ1 λ2
](21.19)
and the inverse of this matrix is
P−1 =1
λ2 − λ1
[λ2 −1
−λ1 1
]. (21.20)
For what follows, it helps to note that
λ1λ2 =1
4
(γ +
√γ2 − 4k
m
)(γ −
√γ2 − 4k
m
)=
k
m(21.21)
and
λ1 + λ2 = −γ. (21.22)
With these facts, you can check (do it!) that this similarity transformation enables us to express
A as [0 1
− km−γ
]=
[1 1
λ1 λ2
] [λ1 0
0 λ2
]1
λ2 − λ1
[λ2 −1
−λ1 1
]. (21.23)
Therefore, the exponential of this matrix with the variable t appended to it is given by
exp(tA) = exp(tPDP−1) = P exp(tD)P−1 =
[1 1
λ1 λ2
] [eλ1t 0
0 eλ2t
]1
λ2 − λ1
[λ2 −1
−λ1 1
](21.24)
207
Multiplying this expression out, we get
exp(tA) =1
λ2 − λ1
[eλ1t eλ2t
λ1eλ1t λ2e
λ2t
] [λ2 −1
−λ1 1
]=
1
λ2 − λ1
[λ2e
λ1t − λ1eλ2t eλ2t − eλ1t
λ1λ2(eλ1t − eλ2t) λ2eλ2t − λ1e
λ1t
]=
1
λ2 − λ1
[λ2e
λ1t − λ1eλ2t eλ2t − eλ1t
km
(eλ1t − eλ2t
)λ2e
λ2t − λ1eλ1t
].
(21.25)
If γ > 0, there are three cases to consider depending on the value of γ2− 4km, which appears inside
a squareroot. Either this is negative (complex eigenvalues), positive (real eigenvalues), or zero
(degenerate case when λ1 = λ2). We will discuss these three different cases one at a time in order.
i. (γ2 < 4km
underdamped case) When γ = 1 and k = m, the vector field associated to this system
looks like (after rescaling)
−2 −1 0 1 2−2
−1
0
1
2
x
v
In this graph, we are plotting the vectord
dt
[x(t)
v(t)
]at each point
[x(t)
v(t)
]. In this case, the
eigenvalues are complex and distinct. It is convenient to set
α :=γ
2& ω :=
1
2
√4k
m− γ2 (21.26)
so that
λ1 = −α− iω & λ2 = −α + iω. (21.27)
208
With these substitutions, the solution becomes
exp(tA) =1
2iω
[e−αt ((iω − α)e−iωt + (iω + α)eiωt) e−αt (eiωt − e−iωt)
kme−αt (e−iωt − eiωt) e−αt ((iω − α)eiωt + (iω + α)e−iωt)
]=
1
2iω
[e−αt (2iω cos(ωt) + 2iα sin(ωt)) 2ie−αt sin(ωt)
−2ikme−αt sin(ωt) e−αt (2iω cos(ωt)− 2iα sin(ωt))
]=
[e−αt
(cos(ωt) + α
ωsin(ωt)
)1ωe−αt sin(ωt)
− kmωe−αt sin(ωt) e−αt
(cos(ωt)− α
ωsin(ωt)
)](21.28)
In this derivation, we have used the fact that
eiωt = cos(ωt) + i sin(ωt). (21.29)
Therefore, plugging in this result to our initial condition gives the complete trajectory both
in position and velocity[x(t)
v(t)
]=
[e−αt
(cos(ωt) + α
ωsin(ωt)
)1ωe−αt sin(ωt)
− kmωe−αt sin(ωt) e−αt
(cos(ωt)− α
ωsin(ωt)
)] [x0
v0
]=
[x0e−αt (cos(ωt) + α
ωsin(ωt)
)+ v0
ωe−αt sin(ωt)
−kx0mωe−αt sin(ωt) + v0e
−αt (cos(ωt)− αω
sin(ωt)).
] (21.30)
For example, with initial conditions x0 = 1 and v0 = 0, which corresponds to letting the
oscillator go without giving it an initial velocity, the position as a function of time is given by
(γ = 2 and k = 2m so that α = 1 and ω = 1 in this graph)
2 4 6 8
0.5
1
x(t) = e−αt(
cos(ωt) + αω
sin(ωt))
t
x
ii. (γ2 > 4km
overdamped case) This case is left as an exercise.
iii. (γ2 = 4km
critically damped) This is the degenerate case where the matrix for the system is
not diagonalizable. Let us again use the notation α := γ2. The vector field associated to this
system looks like (after rescaling)
209
−2 −1 0 1 2−2
−1
0
1
2
Nevertheless, it can be put into Jordan normal form,55 and a matrix that accomplishes this is
P =
[−α−1 −α−2
1 0
](21.31)
with a Jordan matrix given by
J =
[−α 1
0 −α
]. (21.32)
A matrix J is in Jordan normal form iff it is of the form
J =
λ1 ∗. . . ∗
λ1
. . .
λk ∗. . . ∗
λk
(21.33)
where ∗ are either 0 or 1 and every other off-diagonal term is 0. The inverse of P is given by
P−1 = α2
[0 α−2
−1 −α−1
]=
[0 1
−α2 −α
]. (21.34)
55You will not be responsible for knowing how to find a Jordan normal decomposition. This case is merely
worked out to illustrate that the degenerate case can still be solved using matrix methods.
210
Indeed, you can check that A = PJP−1. The exponential of tJ is given by
exp(tJ) = exp
([−αt t
0 −αt
])=
[1 0
0 1
]+
[−αt t
0 −αt
]+
1
2!
[(−αt)2 −2αt
0 (−αt)2
]+
1
3!
[(−αt)3 3α2t3
0 (−αt)3
]+ · · ·
=
[1 + (−αt) + (−αt)2
2!+ (−αt)3
3!+ · · · t
(1 + (−αt) + (−αt)2
2!+ · · ·
)0 1 + (−αt) + (−αt)2
2!+ (−αt)3
3!+ · · ·
]
=
[e−αt te−αt
0 e−αt
](21.35)
for all t ∈ R. Therefore,
exp(tA) =
[−α−1 −α−2
1 0
] [e−αt te−αt
0 e−αt
] [0 1
−α2 −α
]=
[−α−1 −α−2
1 0
] [−α2te−αt (1− αt)e−αt−α2e−αt −αe−αt
]=
[(1 + αt)e−αt te−αt
−α2te−αt (1− αt)e−αt
].
(21.36)
Notice that there are no oscillatory terms and any initial configuration with zero velocity
asymptotically approaches 0 without passing through 0. When α = 1, and x0 = 1 and v0 = 0,
the trajectory looks like
2 4 6 8
0.5
1
x(t) = (1 + t)e−t
t
x
However, if there is an initial sufficiently strong “kick,” the trajectory will pass through the
origin once (but only once!). Such a trajectory occurs, for instance, when x0 = 1 and v0 = −2
(with k = m and γ = 2) and is depicted here
211
2 4 6 8
0.5
1
x(t) = (1− t)e−t
t
x
Recommended Exercises. Please check HuskyCT for the homework. Please show your work!
Do not use calculators or computer programs to solve any problems! In this lecture, we covered
Section 5.7.
212
22 Abstract vector spaces
The applicability of the subject matter of linear algebra goes far beyond the analysis of vectors
in finite-dimensional real (and complex) Euclidean space. Many of the constructions, definitions,
and arguments we gave had nothing to do with the specific structure associated with Euclidean
space. Take for example the definition of a linear transformation. A function Rm T←− Rn is a linear
transformation iff it satisfies
T (~u+ ~v) = T (~v) + T (~w) & T (c~v) = cT (~v) (22.1)
for all numbers c and vectors ~u,~v. The definition looks the same whether we use a function
R5 T←− R8 or a function R19 T←− R34. Furthermore, the only structure needed to write down the
above condition is that of being able to add vectors and the ability to multiply vectors by scalars.
Before giving the abstract definitions of vector spaces, linear transformations, etc., an example
should help motivate why we might want to do this.
Example 22.2. A real degree m polynomial is a function p : R → R of a single real variable of
the form
p(x) = a0 + a1x+ a2x2 + · · ·+ amx
m, (22.3)
where a0, a1, a2, . . . , am are real numbers. A degree zero polynomial is a constant, a degree one
polynomial is a linear function, and a degree two polynomial is a quadratic function. The sum of
a degree m polynomial p as above and a degree n polynomial q given by
q(x) = b0 + b1x+ b2x2 + · · ·+ bnx
n (22.4)
is the degree M := maxm,n polynomial p+ q, which is given by
(p+ q)(x) := p(x) + q(x) ≡ (a0 + b0) + (a1 + b1)x+ (a2 + b2)x2 + · · ·+ (aM + bM)xM . (22.5)
In this notation, if n > m, then am+1, . . . , aM are defined to be zero (and vice versa if m > n). If
c is a real number, then cp is the degree m polynomial given by
(cp)(x) := cp(x) ≡ ca0 + ca1x+ ca2x2 + · · ·+ camx
m. (22.6)
Therefore, polynomials can be added and scaled just as vectors in Euclidean space. Moreover, the
zero polynomial, which is just the polynomial whose coefficients are all zero, i.e. it’s the constant
function whose value is always zero, satisfies the same conditions as the zero vector does. Namely,
if you add it to any polynomial, you get that polynomial back. If you wanted to make sense of
a basis for all polynomials, you would want to be sure that any polynomial can be written as a
unique linear combination of the polynomials of said basis. One such basis could be the set of
monomials, p0, p1, p2, p3, . . . , where
pk(x) := xk. (22.7)
Notice that this set of monomials has infinitely many polynomials. Such a set of monomials
spans the set of all polynomials because every polynomial is a (finite) linear combination of such
monomials. Furthermore, the set of these monomials is linearly independent because no monomial
213
is a (finite) linear combination of other monomials. An example of a linear transformation on the
set of polynomials is the derivative operation. Recall, for k > 0,
p′k(x) ≡ d
dx
(pk(x)
)≡ d
dx
(xk)
= kxk−1. (22.8)
What aboutd
dx
(xk + xl
)&
d
dx
(cxk)? (22.9)
This is something usually covered in a course on calculus, but from the definition of the derivative
(in terms of limits), you can prove that
d
dx
(xk + xl
)=
d
dx
(xk)
+d
dx
(xl)
&d
dx
(cxk)
= cd
dx
(xk). (22.10)
This is true no matter what polynomial you plug in:
d
dx
(p(x) + q(x)
)=
d
dxp(x) +
d
dxq(x) &
d
dx
(cp(x)
)= c
d
dx
(p(x)
). (22.11)
This just means that taking the derivative is a linear transformation. But to make sense of it being
a linear transformation, we have to specify its source and target. What are we allowed to take
derivatives of in this case? If we write P as the set of all polynomials, then we can think of the
derivative as a function ddx
: P → P on polynomials. We could analyze this linear transformation
in many ways. For instance, what is its kernel? The analogue of the definition of kernel would say
that
ker
(d
dx
)=
p ∈ P :
d
dxp(x) = 0
, (22.12)
so the set of all polynomials whose first derivatives are the zero function. The only polynomials
whose derivatives are the zero function are exactly the constant functions. Therefore,56
ker
(d
dx
)= spanp0 (22.13)
since p0 is the function representing p0(x) = x0 = 1. What is the range/image of ddx
?57 By
definition, the range/image would be
image
(d
dx
)=
d
dxq(x) : q ∈ P
, (22.14)
56It is more precise to write spanp0 instead of span1 because we want to make sure we understand that
an entire function is being represented inside the span and not just a number. For example, spanx could be
interpreted as cx for all numbers c ∈ R but if x is a fixed number (and not a variable), then this represents just all
possible numbers. However, spanp1, where p1 is the function defined by p1(x) = x for all x, is more accurate and
less ambiguous. It might seem strange to write it this way if you are not used to thinking of functions as rules. The
main point is that f(x) is not a function—f(x) is the value of the function f at x. I might be sloppy on occasion,
however, and call f(x) the function even though it’s really f because this tends to confuse most students (even
though it confuses me!).57 Notice, by the way, that we can’t use the terminology “column space’ because we do not have a matrix
anymore.
214
i.e. the derivatives of all possible polynomials. Is ddx
surjective/onto? To answer this, we would
try to solve the following problem. If p(x) = a0 + a1x + a2x2 + · · · + amx
m is some polynomial,
can we find a polynomial q whose derivative is p? As an equation, we want to find a q such that
d
dxq(x) = a0 + a1x+ a2x
2 + · · ·+ amxm. (22.15)
Finding such a polynomial can be obtained by integrating p. Namely, take q to be
q(x) = a0x+a1
2x2 +
a2
3x3 + · · ·+ am
m+ 1xm+1. (22.16)
Then, you can check that ddxq(x) = p(x) for all x ∈ R. Of course, there are many other possibilities,
one for each integration constant. Therefore, because we can always integrate polynomials,
image
(d
dx
)= P, (22.17)
the image of ddx
is actually all of P, i.e. ddx
is onto/surjective. The set of polynomials with their usual
notion of addition and scalar multiplication form what is known as a vector space, a mathematical
object that formalizes the structure of of vector addition and scalar multiplication along with its
properties resembling those of Euclidean vectors and polynomials.
Definition 22.18. A real (complex) vector space consists of a set V, the elements of which are
called vectors, together with
i) a binary operation + : V × V → V, called addition and written as ~u+ ~v
ii) a vector ~0 in V, called the zero vector
iii) a function R× V → V (C× V → V ) called scalar multiplication and written as c~u
satisfying
(a) addition is commutative: ~u+ ~v = ~v + ~u
(b) addition is associative: ~u+ (~v + ~w) = (~u+ ~v) + ~w
(c) the zero vector is an identity for addition: ~u+~0 = ~u
(d) addition is invertible: for each vector ~u, there is a vector ~v such that ~u+ ~v = ~0
(e) scalar multiplication is distributive over vector addition: c(~u+ ~v) = c~u+ c~v
(f) scalar multiplication is distributive over scalar addition: (c+ d)~u = c~u+ d~u
(g) scalar multiplication is distributive over itself: c(d~u) = (cd)~u
(h) the scalar unit axiom: 1~u = ~u
215
If in a particular definition or statement it does not matter whether the real or complex numbers
are used, the terminology “vector space” will be used more generically.58 Depending on the context,
we might not write arrows over our vectors. For example, we usually do not write arrows over
polynomials and other functions, which can be viewed as vectors in some vector space.
To understand why definitions are the way they are written, imagine trying to define a piece
of chalk.59 What is a piece of chalk other than the properties and structure that characterize
it? For example, chalk is made mostly of calcium, but this is not to define it. It serves the
function of a writing utensil on certain surfaces such as blackboards. It can come in a variety of
colors. It is often in cylindrical shape. To really specify what chalk is, we must keep describing
its characterizing features. However, we do not want to specify so much that we identify only one
particular chalk in the universe. Instead, we wish to identify the characterizing features so that
any chalk can be placed into this set, but so that nothing else is in this set. Identifying these
characterizing features is what goes behind setting up a mathematical definition (and also what
goes behind image recognition software). Features such as “color” might not be so relevant when
one is merely interested in a writing utensil for blackboards. Hence, we would not include these in
a definition of chalk—we would instead save that for the definition of chalk of a particular color.
You might also ask: why do we need such an abstract definition? What do we gain? The
pay-off is actually phenomenal. As we will see shortly, there are many examples of vector spaces.
Rather than proving something about each and every one of these examples, if we can prove or
discover something for general vector spaces, then all of these results will hold for every single
example. The reason is because most of the concepts we have learned about vectors and linear
transformations in Euclidean space have completely natural analogues for vector spaces.
For example, here is a list of some concepts/definitions that have natural analogues for arbitrary
vector spaces: linear combinations, linear independence, span, linear transformations, subspace,
basis, dimension, image, kernel, composing linear transformations (in succession), projections,60
inverses,61 eigenvectors, eigenvalues, and eigenspaces.
There are some other concepts/definitions that we have studied that do not have immediate (or
straightforward) analogues for arbitrary vector spaces and linear transformations. These include,
but are not limited to: the determinant (and therefore the characteristic polynomial).
Remark 22.19. Besides these concepts, there are also some definitions that do not make sense
without additional structure on the vector space. For example, it is not clear how to define:
orthogonality, orthogonal projections, the transpose of a linear transformation, least squares ap-
proximations, diagonalization, and spectral decomposition. To define these, we will need the notion
of an inner product as well. As this requires its own section, we will not be able to cover it here.
58In the example of Hamming’s error correcting codes, we manipulated the numbers 0, 1 in much the same
way as real or complex numbers but with a different rule for addition. All our vectors and matrices had entries
that were all either 0 or 1. In fact, one can extend the definition of a real (complex) vector space to the notion of a
vector space over a field. Z2 is an example of a field. The axioms for a field are comparable in complication to the
definition of a vector space and will therefore be omitted since most of the examples of vector spaces that we will
be dealing with from now are real or complex.59I learned this analogy from Prof. Balakrishnan at IIT Madras.60But not orthogonal projections—keep reading.61The notion of an inverse exists, but it is a tiny bit more subtle in infinite dimensions.
216
Rather than spending all of our time and wasting space redefining all of the concepts that do
have analogues for arbitrary vector spaces, it seems more appropriate to give examples to illustrate
the broad scope.
Example 22.20. Any subspace of Rn is a real vector space with the induced addition, zero vector,
and scalar multiplication. For example, if Rm T←− Rn is a linear transformation, then ker(T ) and
image(T ) are both real vector spaces.
Here is a rather strange example of a vector space.
Example 22.21. The set of real m × n matrices is a real vector space with addition given by
component-wise addition, the zero vector is the zero matrix, and the scalar multiplication is given
by distributing the scalar to each component. This vector space is denoted by mMn. Taking the
transpose of an m× n matrix defines a function nMmT←− mMn defined by
mMn 3 A 7→ AT ∈ nMm. (22.22)
Is this function linear? Linearity would say
(A+B)T = AT +BT & (cA)T = cAT (22.23)
for all A,B ∈ mMn and for all c ∈ R. A quick check shows that this is true.
Exercise 22.24. Let
Q :=
1 −2
0 3
−1 1
(22.25)
and let 3M4T←− 2M4 be the function defined by
T (A) = QA (22.26)
for all 2× 4 matrices A.
(a) Prove that T is a linear transformation.
(b) Find the kernel of T.
(c) Find the image of T.
The example of polynomials given at the beginning is another example of a vector space. The
following example generalizes this example quite a bit.
Example 22.27. More generally, functions from R to R form a vector space in the following way.
Let f and g be two functions. Then f + g is the function defined by
(f + g)(x) := f(x) + g(x). (22.28)
If c is a real number, then cf is the function defined by
(cf)(x) := cf(x). (22.29)
217
The zero function is the function 0 defined by
0(x) := 0. (22.30)
It might seem complicated to think of functions as vectors. In fact, we can think of functions
as vectors with infinitely many components. To see this, imagine taking a function, such as the
Gaussian e−x2.
x
y
If you wanted to plug in the data for this function on a computer for instance, you wouldn’t give
the computer an infinite number of values since that just wouldn’t be possible. You could specify
certain values of this function at certain positions such as
x
y
• • ••
•••
•• • • x
y
• • ••
•••
•• • •
For example, the Gaussian in this picture on the left could be represented by a vector with 11
components (since there are 11 values chosen). Then you can piece them together to get a rough
image of the function by linear extrapolation as shown above on the right. The more values you
keep, the better your approximation.
x
y
•••••••••••••••••••••••••••••••••••••••••••••••••• x
y
••••••••••••••••••••••••••••••••••••••••••••••••••
In this case, we have a vector with 51 components. As you can see, you will never quite reach
the exact Gaussian because you will always be using a finite approximation. This is why we
view the set of functions as a vector space instead. We can visualize adding functions as follows.
For example, adding the Gaussian centered at 0 to a negative Gaussian centered at 1 might look
something like the following
218
x
y
+ x
y
= x
y
where the sum is drawn in purple. We could view this as summing components of vectors by using
an approximation as above
x
y
• • ••••••• • • + x
y
• • • • ••••••• = x
y
••• ••• •••••••
•
••
•
• •
•
••
••
•
••
••••••
Example 22.31. The set P2 of all degree 2 polynomials, i.e. functions of the form
a0 + a1x+ a2x2 (22.32)
is a subspace of the vector space P3 of all degree 3 polynomials, i.e. functions of the form
b0 + b1x+ b2x2 + b3x
3. (22.33)
In fact, P2 is a subspace of Pn for every n ≥ 2. Even more generally, Pk is a subspace of P for all
natural numbers k.
Example 22.34. For each natural number n, let fn and gn be the functions on [0, 1] given by
fn(x) := cos(2πnx) and gn(x) := sin(2πnx). (22.35)
Then, the set of all (finite) linear combinations of such sines and cosines is a subspace of the set of
all continuous functions on [0, 1]. The zero vector is g0 because sin 0 = 0. Such functions are useful
in expressing periodic functions. For example, the linear combination
h :=4
π
(g1 +
1
3g3 +
1
5g5 +
1
7g7
)(22.36)
given at x by
h(x) :=4
π
(sin(2πx) +
1
3sin(6πx) +
1
5sin(10πx) +
1
7sin(14πx)
)(22.37)
is a decent approximation to the “square wave” piece-wise function given by
h(x) :=
1 for 0 ≤ x ≤ 1
2
−1 for 12≤ x ≤ 1
. (22.38)
219
0.2 0.4 0.6 0.8 1.0x
-1.0
-0.5
0.5
1.0 h
h
Figure 12: Plots of the step function h (in blue) and an approximation h (in orange).
See Figure 12 for an illustration of this. We denote the set of all linear combinations of the fnand gn by F (for Fourier),
F :=
finite∑n=0
anfn +finite∑m=0
bmgm : an, bm ∈ R
. (22.39)
Example 22.40. Let (a1, a2, a3, . . . ) be a sequence of real numbers satisfying
∞∑n=1
|an| <∞, (22.41)
(recall, this means∑∞
n=1 an is an absolutely convergent series). Given another sequence of real
numbers (b1, b2, b3, . . . ) satisfying∞∑n=1
|bn| <∞, (22.42)
we define the sum of the two to be given by
(a1, a2, a3, . . . ) + (b1, b2, b3, . . . ) := (a1 + b1, a2 + b2, a3 + b3, . . . ). (22.43)
The zero vector is the sequence (0, 0, 0, . . . ). If c is a real number, then the scalar multiplication is
defined to be
c(a1, a2, a3, . . . ) := (ca1, ca2, ca3, . . . ). (22.44)
The sum of two sequences still satisfies the absolutely convergent condition because
∞∑n=1
|an + bn| ≤∞∑n=1
(|an|+ |bn|
)=∞∑n=1
|an|+∞∑n=1
|bn| <∞ (22.45)
by the triangle inequality. Furthermore, the scalar multiplication of an absolutely convergent series
is still absolutely convergent
∞∑n=1
|can| =∞∑n=1
|c||an| = |c|∞∑n=1
|an| <∞. (22.46)
The set of such sequences whose associated series are absolutely convergent is denoted by `1 (read
“little ell one”).
220
Exercise 22.47. A closely related example of a vector space that shows up in quantum mechanics
is the set of sequence of complex numbers (a1, a2, a3, . . . ) such that
∞∑n=1
|an|2 <∞. (22.48)
The zero vector, sum, and scalar multiplication are defined just as in Example 22.40. Check that
this is a complex vector space (this vector space is denoted by `2 and is read “little ell two”).
Exercise 22.49. A generalization of the vector spaces `1 and `2 is the following. Let p ≥ 1. Check
that the set of sequences (a1, a2, a3, . . . ) satisfying
∞∑n=1
|an|p <∞, (22.50)
with similar structure as in the previous examples is a vector space. This space is denoted by `p.
Exercise 22.51. Show that `1 is a subspace of `2. Note that `1 was defined in Example 22.40 and
`2 was defined in Example 22.47 (let both of the sequences be either real or complex). [Warning:
this exercise is not trivial.]
Exercise 22.52. Show that `2 is not a subspace of `1. [Hint: give an example of a sequence
(a1, a2, a3, . . . ) such that∑∞
n=1 |an|2 <∞ but∑∞
n=1 |an| does not converge.]
Let us go through more examples of eigenvectors and eigenvalues for linear transformations
between more general vector spaces. But before doing so, it is helpful to understand more about
matrix representations of linear transformations. The m × n matrix associated to a linear trans-
formation Rm T←− Rn was a convenient tool for calculating certain expressions. In fact, we were
able to use the basis ~e1, . . . , ~en in Rn to write down the matrix associated to T—but we didn’t
have to for many of the calculations we did. For example, T (4~e2 − 7~e5) = 4T (~e2) − 7T (~e5) does
not require writing down this matrix. If we know where a basis goes under a linear transformation
T, then we know what the linear transformation does to any vector. For example, if ~v1, . . . , ~vnwas a basis for Rn, then any vector ~u ∈ Rn can be expressed as a linear combination of these basis
elements, let’s say as
~u = u1~v1 + · · ·+ un~vn. (22.53)
Then by linearity of T,
T (~u) = T(u1~v1 + · · ·+ un~vn
)= u1T (~v1) + · · ·+ unT (~vn). (22.54)
Therefore, we only need to know what the vectors T (~v1), . . . , T (~vn) are.
Furthermore, when we express the actual components of a vector such as T (~e2), we would be
using the basis ~e1, . . . , ~em for Rm (notice that we’re now looking at Rm and not Rn because T (~e2)
is a vector in Rm). In other words, we can use this basis to express the vector T (~e2) as a linear
combination of these basis vectors. But we could have also used any other basis. Therefore, the
notion of a matrix associated to a linear transformation makes sense for any basis on the source
and target of T.
221
Definition 22.55. Let V be a vector space. A set of vectors S := ~v1, . . . , ~vk in V is linearly
independent if the only values of x1, . . . , xk that satisfy the equation
k∑i=1
xi~vi ≡ x1~v1 + · · ·+ xk~vk = ~0 (22.56)
are
x1 = 0, x2 = 0, . . . , xk = 0. (22.57)
A set S of vectors as above is linearly dependent if there exists a solution to the above equation
for which not all of the xi’s are zero. If S is an infinite set of vectors in V, indexed, say, by some
set Λ so that S = ~vαα∈Λ, then S is linearly independent if for every finite subset Ω of Λ, the
only solution to62 ∑α∈Ω
xα~vα = ~0 (22.58)
is
xα = 0 for all α ∈ Ω. (22.59)
S is linearly dependent if it is not linearly independent, i.e. if there exists a finite subset Ω of Λ
with a solution to∑
α∈Ω xα~vα = ~0 in which not all of the xα’s is 0.
Example 22.60. Let S be the set of degree 7 polynomials of the form S := p1, p3, p7. Here pkis the k-th degree monomial
pk(x) = xk. (22.61)
The set S is linearly independent. This is because the only solution to
a1x+ a3x3 + a7x
7 = 0 (22.62)
that holds for all x is
a1 = a3 = a7 = 0. (22.63)
However, the set p1, p3, 3p3−7p7, p7 is linearly dependent because the third entry can be written
as a linear combination of the other entries. The set p1 + 2p3, p4 − p5, 3p6 + 5p0 − p1 is linearly
independent. This is because the only solution to
a(x+ 2x3) + b(x4 − x5) + c(3x6 + 5− x) = 0 (22.64)
is a = b = c = 0. To see this, rewrite the left-hand-side as
a(x+ 2x3) + b(x4 − x5) + c(3x6 + 5− x) = c5 + (a− c)x+ a2x3 + bx4 − bx5 + c3x6. (22.65)
The only way the right-hand-side vanishes for all values of x is when c = 0 which then forces a = 0
and b = 0 as well.
62Here, the notation α ∈ Ω means that α is an element of Ω.
222
Example 22.66. The set of 2× 2 matrices
A :=
[1 2
0 −1
], B :=
[−3 1
0 2
], and C :=
[−1 5
0 0
](22.67)
are linearly dependent in 2M2 because
2A+B = C. (22.68)
Definition 22.69. Let V be a vector space and let S := ~v1, . . . , ~vk be a set of k vectors in V.
The span of S is the set of vectors in V of the form
k∑i=1
xi~vi ≡ x1~v1 + · · ·xk~vk, (22.70)
with x1, . . . , xk arbitrary real or complex numbers. The span of S is often denoted by span(S). If Sis an infinite set of vectors, say indexed by some set Λ, in which case S is written as S := ~vαα∈Λ,
then the span of S is the set of vectors in V of the form63∑α∈Ω⊆Λ
Ω is finite
xα~vα. (22.71)
The sum in the definition of span must be finite (even if S itself is infinite). At this point, it
does not make sense to take an infinite sum of vectors because the latter requires a discussion on
sequences, series, and convergence.
Example 22.72. Let S = p0, p1, p2, p3, . . . be the set of all monomials. Then span(S) = P, the
vector space of all polynomials. Indeed, every polynomial has some finite degree so it is a finite
linear combination of monomials. A power series is not a polynomial.
Example 22.73. Consider the vector space F of Example 22.34. Recall, this is the vector space
of linear combinations of the functions fn(x) := cos(2πnx) and gm(x) := sin(2πmx) for arbitrary
natural numbers n and m. Let S := f0, f1, g1, f2, g2, f3, g3, . . . . Then the function
x 7→ sin(
2πx− π
4
)(22.74)
is in the span of S. Notice that this function is not in the set S. The fact that this function is in
the span follows from the sum angle formula for sine:
sin(θ + φ) = sin(θ) cos(φ) + cos(θ) sin(φ), (22.75)
which gives
sin(
2πx− π
4
)= sin(2πx) cos
(−π
4
)+ cos(2πx) sin
(−π
4
)=
√2
2sin(2πx)−
√2
2cos(2πx)
=
√2
2g1(x) +
(−√
2
2
)f1(x).
(22.76)
63Here, the notation Ω ⊆ Λ means that Ω is a subset of Λ and α ∈ Ω means that α is an element of Ω.
223
As another example, the function
x 7→ cos2(2πx) (22.77)
is also in the span of S. For this, recall the other angle sum formula
cos(θ + φ) = cos(θ) cos(φ)− sin(θ) sin(φ) (22.78)
and of course the identity
cos2(θ) + sin2(θ) = 1, (22.79)
which can be used to rewrite
cos(2θ) = cos2(θ)− sin2(θ) = cos2(θ)−(
1− cos2(θ))
= 2 cos2(θ)− 1. (22.80)
Using this last identity, we can write
cos2(2πx) =1
2+
1
2cos(4πx) =
1
2f0(x) +
1
2f2(x). (22.81)
Definition 22.82. Let V be a vector space. A basis for V is a set B of vectors in V that is linearly
independent and spans V.
Definition 22.83. The number of elements in a basis for a vector space V is the dimension of V
and is denoted by dimV. A vector space V with dimV < ∞ is said to be finite-dimensional. A
vector space V with dimV =∞ is said to be infinite-dimensional.
Example 22.84. A basis of Pn is given by the monomials p0, p1, p2, . . . , pn, where
pk(x) := xk. (22.85)
Therefore, dimPn = n+ 1. Similarly, p0, p1, p2, p3, . . . is a basis for P. Therefore, dimP =∞.
Example 22.86. A basis for m× n matrices is given by matrices of the form
Eij :=
0 · · · 0 0 0 · · · 0...
......
......
0 · · · 0 0 0 · · · 0
0 · · · 0 1 0 · · · 0
0 · · · 0 0 0 · · · 0...
......
......
0 · · · 0 0 0 · · · 0
(22.87)
where the only non-zero entry is in the i-th row and j-th column, where its value is 1. In other
words, Eij is an m × n matrix with a 1 in the i-th row and j-th column and is zero everywhere
else. Therefore, dimmMn := mn. For example, in 2M2, this basis looks likeE11 =
[1 0
0 0
], E12 =
[0 1
0 0
], E21 =
[0 0
1 0
], E22 =
[0 0
0 1
](22.88)
224
Example 22.89. Let F be the vector space from Example 22.34. A basis for F is given by
f0, f1, g1, f2, g2, . . . . Hence, dimF =∞. Notice that we have excluded g0 because g0 is the zero
function and would render the set linearly dependent if added.
Definition 22.90. Let V and W be two finite-dimensional vector spaces. Let V := ~v1, ~v2, . . . , ~vnbe an (ordered) basis for V and W := ~w1, ~w2, . . . , ~wm be an (ordered) basis for W. The m × nmatrix associated to a linear transformation W
T←− V with respect to the bases V and W is the
m× n matrix whose ij-th entry is the unique coefficient [T ]VW ij in front of wi in the expansion64
T (~vj) =m∑i=1
[T ]VW ij ~wi. (22.91)
The same definition can be made for vector spaces of infinite dimensions provided one uses only
finite linear combinations.
Example 22.92. Let Pn be the set of degree n polynomials. The derivative operator ddx
is a linear
transformation Pnddx←− Pn that takes a degree n polynomial and differentiates it
d
dx
(a0 + a1x+ a2x
2 + · · ·+ anxn)
= a1 + 2a2x+ 3a3x2 + · · ·+ nanx
n−1. (22.93)
Does ddx
have any eigenvectors? What are the eigenvalues? We could express ddx
as a matrix using
the basis of monomials P := p0, p1, . . . , pn. Notice that
d
dxpk = kpk−1 (22.94)
for all k ∈ 0, 1, . . . , n. Hence, with respect to this basis, the linear transformation ddx
takes the
form
[d
dx
]PP
=
0 1 0 0 0 · · · 0
0 0 2 0 0 · · · 0
0 0 0 3 0 · · · 0...
......
. . . . . . . . ....
0 0 0 · · · 0 n− 1 0
0 0 0 · · · · · · 0 n
0 0 0 · · · · · · · · · 0
(22.95)
This is an (n+ 1)× (n+ 1) matrix. The eigenvalues are obtained by solving
0 = det
−λ 1 0 0 0 · · · 0
0 −λ 2 0 0 · · · 0
0 0 −λ 3 0 · · · 0...
......
. . . . . . . . ....
0 0 0 · · · −λ n− 1 0
0 0 0 · · · · · · −λ n
0 0 0 · · · · · · · · · −λ
= (−λ)n+1 = (−1)n+1λn+1 (22.96)
64Such and expansion exists because W spans W and such an expansion is unique because W is linearly inde-
pendent.
225
because this is an upper triangular matrix and the determinant is therefore just the product along
the diagonals. But upon inspection, the only solutions to this equation are λ = 0. Therefore, the
only eigenvalue of ddx
is 0. Are there any eigenvectors? To find the eigenvectors associated to the
eigenvalue 0, we would have to find polynomials p such that
d
dxp = 0 (22.97)
since 0p = 0 for all polynomials p. The only polynomial whose derivative is 0 is the constant
polynomial. Hence, the set of all eigenvectors for ddx
with eigenvalue 0 aretp0 : t ∈ R
. (22.98)
The fact that there are no other eigenvectors should surprise you. In case you’re not sure why,
the following example should illustrate.
Example 22.99. Let A(R) denote the set of all analytic functions. Recall, these are all in-
finitely differentiable functions f whose associated Taylor series expansions agree with the original
function, i.e. for all real numbers a,
f(x+ a) = f(x) + adf
dx(x) +
a2
2!
d2f
dx2(x) +
a3
3!
d3f
dx3(x) + · · · . (22.100)
Then A(R), with the addition and scalar multiplication for functions, is a real vector space. Just
as the derivative on polynomials is a linear transformation, ddx
is also a linear transformation on
A(R). Does ddx
have any eigenvectors whose eigenvalue is not zero? Namely, for what real number
λ and for what analytic functions f does ddxf(x) = λf(x) for all x ∈ R? We cannot actually write
down a basis for A(R)—nobody (and I mean nobody) knows even one basis for A(R) so we cannot
write down any matrices here to help us. However, we can still answer this question from a more
conceptual point of view by using the definitions. To solve this, we separate the variables and
solve ∫1
fdf =
∫λdx ⇒ ln(f) = λx+ c ⇒ f(x) = Ceλx (22.101)
for some constant c and C := ec. Let eλ be the function eλ(x) = eλx. Hence, if ddx
were to have any
eigenvector, it should be eλ for each real number λ. But is eλ analytic, i.e. is eλ really a vector in
A(R)? This is true and it follows from the definition of the exponential (as well as a theorem that
says the exponential series uniformly converges on any compact interval). The eigenvalue of eλ is
λ. Hence, ddx
has infinitely many (linearly independent!) eigenvectors, with each eigenspace being
spanned by an exponential function.
Example 22.102. Consider the linear transformation nMnT←− nMn defined by sending an n× n
matrix A to T (A) := AT , the transpose of A. What are the eigenvalues and eigenvectors of
this transformation? Let us be concrete and analyze this problem for n = 2. Then, the linear
transformation acts as
T
([a b
c d
])=
[a c
b d
]. (22.103)
226
We want to find solutions, 2 × 2 matrices A, together with eigenvalues λ, satisfying T (A) = λA,
i.e. AT = λA. Right off the bat, we can guess three eigenvectors (remember, our vectors are now
2× 2 matrices!) by just looking at what T does in (22.103). These are
A1 =
[1 0
0 0
], A2 =
[0 0
0 1
], & A3 =
[0 1
1 0
]. (22.104)
Furthermore, their corresponding eigenvalues are all 1. This is because all of these matrices satisfy
ATi = Ai for i = 1, 2, 3. Is there a fourth eigenvector?65 For this, we could express the linear trans-
formation T in terms of the basis E := E11, E12, E21, E22. In this basis, the matrix representation
of T is given by
[T ]EE =
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
(22.105)
because
T (E11) = E11
T (E12) = E21
T (E21) = E12
T (E22) = E22.
(22.106)
The characteristic polynomial associated to this transformation is
det
1− λ 0 0 0
0 −λ 1 0
0 1 −λ 0
0 0 0 1− λ
= (1− λ) det
−λ 1 0
1 −λ 0
0 0 1− λ
= (1− λ)2 det
[−λ 1
1 −λ
]= (1− λ)2(λ2 − 1)
= (1− λ)3(λ+ 1).
(22.107)
Hence, we see that there is another eigenvalue, namely, λ4 = −1. The corresponding eigenvector
can be solved for by solving the linear system2 0 0 0 0
0 1 1 0 0
0 1 1 0 0
0 0 0 2 0
→
1 0 0 0 0
0 1 1 0 0
0 0 0 0 0
0 0 0 1 0
(22.108)
65We do not have to go through this entire calculation that follows to find this fourth eigenvector. One can think
about what it should be by guessing, but we will go through this to illustrate what one would do even if it is not
apparent.
227
whose solutions are all of the form
s
0
−1
1
0
E
(22.109)
with s a free variable, which in terms of 2× 2 matrices is given by
s
[0 −1
1 0
]. (22.110)
Hence, our fourth eigenvector for T can be taken to be
A4 =
[0 −1
1 0
](22.111)
and its corresponding eigenvalue is λ4 = −1.
Example 22.112. Consider the following two bases of degree 2 polynomials
q0(x) = 0 + 2x+ 3x2 p0(x) = 1
q1(x) = 1 + 1x+ 3x2 p1(x) = x
q2(x) = 1 + 2x+ 2x2 p2(x) = x2
and the linear transformation P2T←− P2 satisfying T (pi) = qi for i = 0, 1, 2. A matrix representation
of this transformation in the P := p0, p1, p2 basis is given by
[T ]PP =
0 1 1
2 1 2
3 3 2
PP
. (22.113)
Because this matrix has nonzero determinant (detT = 5), it is invertible. In fact, we studied this
matrix in Example 20.26. A matrix representation of the inverse of this transformation is given in
the Q := q0, q1, q2 basis by
[T−1]QQ =1
5
−4 1 1
2 −3 2
3 3 −2
. (22.114)
and satisfies T−1(qi) = pi for i = 0, 1, 2. To check this, let us make sure that the first column of
this matrix expresses the polynomial p0 in the Q basis.
−4
5q0(x) +
2
5q1(x) +
3
5q2(x) = −4
5(0 + 2x+ 3x2)
+2
5(1− 1x+ 3x2)
− 3
5(1 + 2x+ 2x2)
= 1(1 + 0x+ 0x2)
= p0(x).
228
Anyway, we’d like to find the eigenvalues and corresponding eigenvectors of T . To do this, we
can use any basis we’d like and use the matrix representation of T in this basis. Therefore,
we can simply find the roots of the characteristic polynomial, which we have already done in
Example 20.26. They were λ1 = −1, λ2 = −1, λ3 = 5. We should now calculate the corresponding
eigenvectors. For λ1 = λ2 = −1, we have to solve1 1 1 0
2 2 2 0
3 3 3 0
→1 1 1 0
0 0 0 0
0 0 0 0
(22.115)
which has solutions
y
−1
1
0
P
+ z
−1
0
1
P
(22.116)
with two free variables y and z. Hence, a basis for such solutions, and therefore two eigenvectors
for λ1 and λ2, is given by the two vectors
~v1 =
−1
1
0
P
& ~v2 =
−1
0
1
P
. (22.117)
(Note: your choice of eigenvectors could be different from mine!) For the eigenvalue λ3 = 5, we
must solve−5 1 1 0
2 −4 2 0
3 3 −3 0
→0 0 0 0
1 −2 1 0
1 1 −1 0
→0 0 0 0
1 −2 1 0
0 3 −2 0
→0 0 0 0
1 0 −1/3 0
0 1 −2/3 0
(22.118)
which has solutions
z
1/3
2/3
1
P
(22.119)
with z a free variable. Thus, an eigenvector for λ3 is
~v3 =
1
2
3
P
. (22.120)
In terms of the polynomials, the eigenvalues together with their corresponding eigenvectors in P2
are given by
λ1 : −p0 + p1 ↔ −1 + x
λ2 : −p0 + p2 ↔ −1 + x2
λ3 : p0 + 2p1 + 3p2 ↔ 1 + 2x+ 3x2
(22.121)
229
Example 22.122. Let Pn be the vector space of degree n polynomials (in the variable x). The
function Pn+1T←− Pn, given by (
T (p))(x) := xp(x) (22.123)
for any degree n polynomial p, is a linear transformation. Here p is a degree n polynomial and
T (p) is a degree n+ 1 polynomial. For example, if n = 1 and p is of the form p(x) = mx+ b with
m and b real numbers, then T (p) is the quadratic polynomial given by mx2 + bx. Let us check that
T is indeed a linear transformation. We must show two things.
(a) Let p and q be two degree n polynomials, which are of the form
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n & q(x) = b0 + b1x+ b2x2 + · · ·+ bnx
n. (22.124)
Then p+ q is the polynomial given by
(p+ q)(x) = (a0 + b0) + (a1 + b1)x+ (a2 + b2)x2 + · · ·+ (an + bn)xn. (22.125)
Denote the polynomial T (p+ q) by r. By the definition of T, r is given by
r(x) = x(
(a0 + b0) + (a1 + b1)x+ (a2 + b2)x2 + · · ·+ (an + bn)xn)
= (a0 + b0)x+ (a1 + b1)x2 + (a2 + b2)x3 + · · ·+ (an + bn)xn+1
=(a0x+ a1x
2 + a2x3 + · · ·+ anx
n+1)
+(b0x+ b1x
2 + b2x3 + · · ·+ bnx
n+1)
= xp(x) + xq(x)
=(T (p)
)(x) +
(T (q)
)(x).
(22.126)
This shows that T (p+ q) = T (p) + T (q).
(b) Now let λ be a real number. Then T (λp) is the polynomial given by(T (λp)
)(x) = x
(λa0 + λa1x+ λa2x
2 + · · ·+ λanxn)
= λa0x+ λa1x2 + λa2x
3 + · · ·+ λanxn+1
= λ(a0 + a1x+ a2x
2 + · · ·+ anxn)
= λ(xp(x)
)= λ
(T (p)
)(x)
(22.127)
which shows that T (λp) = λT (p).
These two calculations prove that T is a linear transformation.
The following two theorems are still true for arbitrary vector spaces and linear transformations.
Theorem 22.128. The kernel of a linear transformation WT←− V is a subspace of V. The image
of a linear transformation WT←− V is a subspace of W.
230
Example 22.129. Let Pn+1T←− Pn be the linear transformation from Example 22.122. The image
(column space) of this linear transformation is the set of degree n+ 1 polynomials of the form
a1x+ a2x2 + · · ·+ anx
n + an+1xn+1. (22.130)
In other words, it is the set of all polynomials with no constant term. Mathematically, as a set,
this would be written asa0 + a1x+ a2x
2 + · · ·+ anxn + an+1x
n+1 : a0 = 0, a1, a2, . . . , an ∈ R. (22.131)
The kernel of T consists of only the constant 0 polynomial because if p is a degree n polynomial
of the form
p(x) = a0 + a1x+ a2x2 + · · ·+ anx
n (22.132)
then T (p) is the polynomial(T (p)
)(x) = a0x+ a1x
2 + a2x3 + · · ·+ anx
n+1 (22.133)
and this is zero for all x ∈ R if and only if a0 = a1 = a2 = · · · = an = 0. This linear transformation
is also defined on all polynomials P T←− P. Does this linear transformation have any eigenvalues
with corresponding eigenvectors? If λ is such an eigenvalue and p is an eigenvector (a polynomial),
then this would required xp(x) = λp(x) for all input values of x. Since p must be nonzero (in order
for it to be an eigenvector), for any value of x for which p(x) 6= 0, this equation demands that
λ = x, which is impossible since p(x) 6= 0 for at least two distinct values of x (in fact, infinitely
many). Therefore, P T←− P has no eigenvalues either.
Example 22.134. Fix N ∈ N to be some large natural number and set ∆ := 1N. Consider
the set of periodic functions on the interval [0, 1] with values given only on the points kN
for
k ∈ 0, 1, . . . , N − 1 (k = N is not included because by periodicity, we assume that the value of
these functions at 0 equals the value at 1). Let T : RN → RN be the function
RN 3 ek 7→ T (ek) :=ek+1 − ek−1
2∆, (22.135)
where −1 ≡ N − 1 and N = 0. Extend this function in a linear fashion so that it is a linear
transformation. T represents an approximation to the derivative function. It takes the value of
the function at a point and then takes the approximate slope by using the values of the function
at its nearest neighbors. For example, the cosine function with N = 20 looks like
231
x
y
• ••
•
•
•
•
•
•• • •
•
•
•
•
•
•
••
y0
y1
y2
...
yN−1
Using the formula for T then gives the set of points
x
y
•
•
••• • •
••
•
•
•
••• • •
••
•
12∆
y1 − yN−1
y2 − y0
y3 − y1
...
y0 − yN−2
which are seen to lie almost exactly along the − sin curve. If fewer points where chosen, such as
10, this approximation might not be as good. The approximation
232
x
y
••
•
•
••
•
•
•
•
gets mapped to
x
y
•
•
• •
•
•
•
• •
•
so you can see that the approximation for the derivative is not as good.
Example 22.136. The set of solutions to an n-th order homogeneous linear ordinary differential
equation is a vector space. To see this, let us first write down such an ODE as
anf(n) + an−1f
(n−1) + · · ·+ a1f(1) + a0f = 0. (22.137)
Here f denotes a sufficiently smooth function of a single variable (one that admits all of these
derivatives) and f (k) denotes the k-th derivative of f. The coefficients ak are all constants inde-
pendent of the variable input for f. An example of such an ODE is
f (2) + f = 0 (22.138)
whose solutions are all of the form
f(x) = a cos(x) + b sin(x), (22.139)
233
with a and b real numbers. Notice that in this example, the set of solutions is given by spancos(x), sin(x).More generally, let A(Ω) denote the vector space of analytic functions of a single variable on a
domain Ω ⊆ R. Let A(Ω)L←− A(Ω) be the transformation defined by
A(Ω) 3 f 7→ L(f) := anf(n) + an−1f
(n−1) + · · ·+ a1f(1) + a0f. (22.140)
Then L is a linear transformation and ker(L) is exactly the set of solutions to our general ODE.
In particular, it is a vector space since every kernel of a linear transformation is. Inhomogeneous
systems can also be formulated in this framework. Let g ∈ A(Ω) be another function and let
anf(n) + an−1f
(n−1) + · · ·+ a1f(1) + a0f = g (22.141)
be an n-th order linear inhomogeneous ordinary differential equation. L is defined just as above
and the solution set is actually the solution set of Lf = g (which should remind you of A~x = ~b,
where A is replaced by L, ~x is replaced by a function f, and ~b is replaced by a function g). In
particular, the solution set of the inhomogeneous ODE is a linear manifold in A(Ω).
Theorem 3.38 from a while back tells us that the general solution to the inhomogeneous system
Lf = g is therefore of the form
f(x) = fp(x) + fh(x), (22.142)
where fp is one particular solution to Lf = g and fh is any homogeneous solution to Lf = 0.
Therefore, many of the concepts from the theory of differential equations are special cases of the
concepts from linear algebra.
Recommended Exercises. Please check HuskyCT for the homework. Please show your work!
Do not use calculators or computer programs to solve any problems! In this lecture, we covered
Chapter 4, most notably Section 4.1. The other sections were essentially covered earlier in these
notes. We have also covered additional topics giving more context and utility for the notion of
vector spaces.
234
References
[1] Otto Bretscher, Linear algebra with applications, 3rd ed., Prentice Hall, 2005.
[2] David C. Lay, Linear algebra and its applications, 4th ed., Pearson, 2011.
[3] David C. Lay, Steven R. Lay, and Judi J. McDonald, Linear algebra and its applications, 5th ed., Pearson, 2015.
[4] G. Polya, How to solve it: A new aspect of mathematical method, Princeton University Press, 2014.
[5] Jun John Sakurai, Modern quantum mechanics; rev. ed., Addison-Wesley, Reading, MA, 1994.
[6] Jeffrey R. Weeks, The shape of space, Second, Monographs and Textbooks in Pure and Applied Mathematics,
vol. 249, Marcel Dekker, Inc., New York, 2002.
[7] Wikipedia, Michaelis?menten kinetics — wikipedia, the free encyclopedia, 2017. [Online; accessed 23-October-
2017].
235