Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Linear regression
CS 446
1. Overview
todo
check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.
1 / 94
Lecture 1: supervised learning
Training data: labeled examples
(x1, y1), (x2, y2), . . . , (xn, yn)
where
I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and
I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.
Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.
learned predictorpast labeled examples learning algorithm
predicted label
new (unlabeled) example
2 / 94
Lecture 2: nearest neighbors and decision trees
1.0 0.5 0.0 0.5 1.0
1.0
0.5
0.0
0.5
1.0
x1
x2
Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.
Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.
3 / 94
Lectures 3-4: linear regression
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Linear regression / least squares.
Our first (of many!) linear predic-tion methods.
Today:
I Example.
I How to solve it; ERM, andSVD.
I Features.
Next lecture: advanced topics, in-cluding overfitting.
4 / 94
2. Example: Old Faithful
Prediction problem: Old Faithful geyser (Yellowstone)
Task: Predict time of next eruption.
5 / 94
Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Time between eruptions
Historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Time between eruptions
Historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Time until next eruption: Yi := ai − bi−1.
Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .
On “Old Faithful” data:
I Using 136 past observations, we form mean estimate µ = 70.7941.
I Can we do better?
6 / 94
Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94
Looking at the data
Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.
1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption
50
60
70
80
90
time
until
nex
t eru
ptio
n
7 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Ydata
. . . t
X
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
Using side-information
At prediction time t, duration of last eruption is available as side-information.
an bnan−1 bn−1 . . .
Y
. . . t
XXn Yn. . .
IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).
X takes values in X (e.g., X = R), Y takes values in R.
1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)
f : X → R,
This is called “learning” or “training”.
2. At prediction time, observe X, and form prediction f(X).
How should we choose f based on data? Recall:
I The model is our choice.
I We must contend with overfitting, bad fitting algorithms, . . .
8 / 94
3. Least squares and linear regression
Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
y := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94
Which line?
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration
50
60
70
80
90
dela
y
Let’s predict with a linear regressor:
y := wT [ x1 ] ,
where w ∈ R2 is learned from data.
Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )
If data lies along a line,we should output that line.But what if not?
9 / 94
ERM setup for least squares.
I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)
I Loss/penalty: the least squares loss
`ls(y, y) = `ls(y, y) = (y − y)2.
(Some conventions scale this by 1/2.)
I Goal: minimize least squares emprical risk
Rls(f) =1
n
n∑i=1
`ls(yi, f(xi)) =1
n
n∑i=1
(yi − f(xi))2.
I Specifically, we choose w ∈ Rd according to
arg minw∈Rd
Rls
(x 7→ wTx
)= arg min
w∈Rd
1
n
n∑i=1
(yi −wTxi)2.
I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.
10 / 94
ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94
ERM in general
I Pick a family of models/predictors F .(For today, we use linear predictors.)
I Pick a loss function `.(For today, we chose squared loss.)
I Minimize the empirical risk over the model parameters.
We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.
Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.
11 / 94
Least squares ERM in pictures
Red dots: data points.
Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).
ERM: minimize sum of squared verticallengths from hyperplane to points.
12 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.
Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
In upcoming lecture we’ll prove every critical point of R is a minimizer of R.
13 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
Summary on ERM and linear regression
Procedure:
I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)
I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)
I In general, solutions are not unique. (Why not?)
I If ATA is invertible, can choose (unique) (ATA)−1ATb.
I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.
I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.
I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .
14 / 94
4. SVD and least squares
SVD
Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where
I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)
I This convention is sometimes called the thin SVD.
I Another notation is to write M =∑ri=1 siuiv
Ti . This avoids the issue
with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)
ri=1 span the column space and (vi)
ri=1 span the rows space of M .
I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.
I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.
15 / 94
Pseudoinverse
Let SVD M =∑ri=1 siuiv
Ti be given.
I Define pseudoinverse M+ =∑ri=1
1siviu
Ti .
(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)
I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST butwith reciprocals of non-zero elements;then M+ = V S+UT.
I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).
I If M−1 exists, then M−1 = M+ (why?).
16 / 94
SVD and least squares
Recall: we’d like to find w such that
ATAw = ATb.
If w = A+b, then
ATAw =
r∑i=1
siviuTi
r∑i=1
siuivTi
r∑i=1
1
siviu
Ti
b=
r∑i=1
siviuTi
r∑i=1
uiuTi
b = ATb.
Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)
Note: in general, AA+ =∑ri=1 uiu
Ti 6= I.
17 / 94
5. Summary of linear regression so far
Main points
I Model/function/predictor class of linear regressors x 7→ wTx.
I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.
I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.
18 / 94
Part 2 of linear regression lecture. . .
Recap on SVD. (A messy slide, I’m sorry.)
Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.
I “Decomposition form” thin SVD: M =∑ri=1 siuiv
Ti , and
s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1
1siviu
Ti . and in general
M+M =∑ri=1 viv
Ti 6= I.
I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.
I Full SVD: M = U fSfVTf , U f ∈ Rn×n and V ∈ Rd×d orthonormal and
full rank so UTf U f and V T
f V f are identity matrices and Sf ∈ Rn×d is zeroeverywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS
+f U
Tf where S+
f is obtainedby transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .
The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).
19 / 94
Recap on SVD, zero matrix case
Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.
I In all types of SVD, M+ is MT (another zero matrix).
I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).
I “Factorization form thin SVD” becomes a little messy.
20 / 94
6. More on the normal equations
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.
Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.
21 / 94
Recall our matrix notation
Let labeled examples ((xi, yi))ni=1 be given.
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) =1
n
n∑i=1
(yi − xT
iw)2
= ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
We’ll now finally show that normal equations imply optimality.21 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
Normal equations imply optimality
Consider w with ATAw = ATy, and any w′; then
‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2
= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.
Since
(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,
then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.
Morever, writing A =∑ri=1 siuiv
Ti ,
‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>
r∑i=1
s2ivivTi
(w′−w),
so w′ optimal iff w′ −w is in the right nullspace of A.
(We’ll revisit all this with convexity later.)
22 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.
Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined; indeed,b = AA+b =
∑ri=1 uiu
Ti b.
I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
23 / 94
7. Features
Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative side-information, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94
Enhancing linear regression models with features
Linear functions alone are restrictive,but become powerful with creative side-information, or features.
Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
24 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
25 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
26 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil n
ext eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
27 / 94
Final remarks on features
I “Feature engineering” can drastically change the power of a model.
I Some people consider it messy, unprincipled, pure “trial-and-error”.
I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).
28 / 94
8. Statistical view of least squares; maximum likelihood
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Maximum likelihood estimation (MLE) refresher
Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.
I Θ: parameter space.
I θ ∈ Θ: a particular parameter (or parameter vector).
I Pθ: a particular probability distribution for observed data.
Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,
L(θ) := pθ(x).
For continuous X ∼ Pθ with probability density function fθ,
L(θ) := fθ(x).
Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.
29 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Distributions over labeled examples
X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).
Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:
1. Marginal distribution PX of X:
PX is a probability distribution on X .
2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :
PY |X=x is a probability distribution on Y.
This lecture: Y = R (regression problems).
30 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Optimal predictor
What function f : X → R has smallest (squared loss) risk
R(f) := E[(f(X)− Y )2]?
Note: earlier we discussed empirical risk.
I Conditional on X = x, the minimizer of conditional risk
y 7→ E[(y − Y )2 | X = x]
is the conditional meanE[Y | X = x].
I Therefore, the function f? : R→ R where
f?(x) = E[Y | X = x], x ∈ R
has the smallest risk.
I f? is called the regression function or conditional mean function.
31 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5
y
f*
32 / 94
Linear regression models
When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.
I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).
I Conditional distribution of Y given X is normal.
I Marginal distribution of X not specified.
In this model, the regression function f? is a linear function fw : Rd → R,
fw(x) = xTw =
d∑i=1
xiw, x ∈ Rd.
(We’ll often refer to fw just by
w.)-1 -0.5 0 0.5 1
x
-5
0
5y
f*
32 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
33 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
(“Plug-in principle” is used throughout statistics in this same way.)
34 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
Further remarks.
I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.
I In ERM, by default there is no assumption of a “true” parameter torecover.
Useful examples: medical testing, gene expression, . . .
35 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).
I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
a1 b1 a2 a3a0 b2 b3b0 . . .
Y1 Y2 Y3
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).
I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
Old faithful data under this least squares statistical model
Recall our data, consisting of historical records of eruptions:
an bnan−1 bn−1 . . .
Ydata
. . . t
Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).
I Data: Yi := ai − bi−1, i = 1, . . . , n.
(Admittedly not a great model, since durations are non-negative.)
Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(This extends the model above if we add the “1” feature.)
36 / 94
9. Regularization and ridge regression
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
37 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
38 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
Explicit solution (ATA+ λI)−1ATb.
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.
39 / 94
10. True risk and overfitting
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
40 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
41 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
42 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
43 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
44 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
45 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
45 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
45 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
46 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
47 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
48 / 94
11. `1 regularization: the LASSO
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
49 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · ,
so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
50 / 94
Example: Coefficient profile (`2 vs. `1)
Y = levels of prostate cancer antigen, X = clincal measurements
Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.
51 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
52 / 94
12. Summary
Summary
ERM for olsERM in generalnormal equationspseudoinverse solnridge regressionstatistical view (say 1-2 things that should be remembered)
53 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Obtain w viaw = A+b
where A+ is the (Moore-Penrose) pseudoinverse of A.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
54 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
55 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
56 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
57 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
58 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Enhancing linear regression models
Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.
Examples:
1. Non-linear transformations of existing variables: for x ∈ R,
φ(x) = ln(1 + x).
2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,
φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).
3. Trigonometric expansion: for x ∈ R,
φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).
4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,
φ(x) = (1, x1, . . . , xd, x21, . . . , x
2d, x1x2, . . . , x1xd, . . . , xd−1xd).
59 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Example: Taking advantage of linearity
Suppose you are trying to predict some health outcome.
I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:
φ(x) = (xtemp − 98.6)2.
I What if you didn’t know about this magic constant 98.6?
I Instead, useφ(x) = (1, xtemp, x
2temp).
Can learn coefficients w such that
wTφ(x) = (xtemp − 98.6)2,
or any other quadratic polynomial in xtemp (which may be better!).
60 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
61 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
61 / 94
Quadratic expansion
Quadratic function f : R→ R
f(x) = ax2 + bx+ c, x ∈ R,
for a, b, c ∈ R.
This can be written as a linear function of φ(x), where
φ(x) := (1, x, x2),
sincef(x) = wTφ(x)
where w = (c, b, a).
For multivariate quadratic function f : Rd → R, use
φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms
, x21, . . . , x2d︸ ︷︷ ︸
squared terms
, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms
).
61 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil n
ext eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
62 / 94
Affine expansion and “Old Faithful”
Woodward needed an affine expansion for “Old Faithful” data:
φ(x) := (1, x).
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
affine function
Affine function fa,b : R→ R for a, b ∈ R,
fa,b(x) = a+ bx,
is a linear function fw of φ(x) for w = (a, b).
(This easily generalizes to multivariate affine functions.)
62 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
Why linear regression models?
1. Linear regression models benefit from good choice of features.
2. Structure of linear functions is very well-understood.
3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.
63 / 94
13. From data to prediction functions
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Maximum likelihood estimation for linear regression
Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with
Y |X = x ∼ N(xTw, σ2), x ∈ Rd.
(Traditional to study linear regression in context of this model.)
Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:
n∑i=1
{− 1
2σ2(xTiw − yi)2 +
1
2ln
1
2πσ2
}+{
terms not involving (w, σ2)}.
The w that maximizes log-likelihood is also w that minimizes
1
n
n∑i=1
(xTiw − yi)2.
This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .
64 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
65 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
65 / 94
Empirical distribution and empirical risk
Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by
pn((x, y)) :=1
n
n∑i=1
1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.
Plug-in principle: Goal is to find function f that minimizes (squared loss) risk
R(f) = E[(f(X)− Y )2].
But we don’t know the distribution P of (X,Y ).
Replace P with Pn → Empirical (squared loss) risk R(f):
R(f) :=1
n
n∑i=1
(f(xi)− yi)2.
65 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
66 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
66 / 94
Empirical risk minimization
Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.
For linear functions and squared loss: ERM returns
w ∈ arg minw∈Rd
R(w),
which coincides with MLE under the basic linear regression model.
In general:
I MLE makes sense in context of statistical model for which it is derived.
I ERM makes sense in context of general iid model for supervised learning.
66 / 94
Empirical risk minimization in pictures
Red dots: data points.
Affine hyperplane: linear function w(via affine expansion (x1, x2) 7→ (1, x1, x2)).
ERM: minimize sum of squared verticallengths from hyperplane to points.
67 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.
Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Empirical risk minimization in matrix notation
Define n× d matrix A and n× 1 column vector b by
A :=1√n
← xT
1 →...
← xTn →
, b :=1√n
y1...yn
.Can write empirical risk as
R(w) = ‖Aw − b‖22.
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates to(ATA)w = ATb,
a system of linear equations called the normal equations.
It can be proved that every critical point of R is a minimizer of R:
68 / 94
Aside: Convexity
Let f : Rd → R be a differentiable function.
Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?
Yes, if f is a convex function:
f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),
for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.
69 / 94
Aside: Convexity
Let f : Rd → R be a differentiable function.
Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?
Yes, if f is a convex function:
f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),
for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.
69 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22
= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)
= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk
Checking convexity of g(x) = ‖Ax− b‖22:
g((1− t)x+ tx′)
= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)
≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22
where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.
70 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Convexity of empirical risk, another way
Preview of convex analysis
Recall R(w) =1
n
n∑i=1
(xTiw − yi)2.
I Scalar function g(z) = cz2 is convex for any c ≥ 0.
I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.
I Therefore, function w 7→ 1n
(xTiw − yi)2 is convex.
I Sum of convex functions is convex.
I Therefore R is convex.
Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)
71 / 94
Algorithm for ERM
Algorithm for ERM with linear functions and squared loss†
input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.
1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).
2: return w.
†Also called “ordinary least squares” in this context.
Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!
72 / 94
Algorithm for ERM
Algorithm for ERM with linear functions and squared loss†
input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.
1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).
2: return w.
†Also called “ordinary least squares” in this context.
Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!
72 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.
Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Geometric interpretation of least squares ERM
Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so
A =
↑ ↑a1 · · · ad↓ ↓
.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.
Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.
b
b
a1
a2
I b is uniquely determined.
I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.
If rank(A) < d, then ERM solution is notunique.
To get w from b:solve system of linear equations Aw = b.
73 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
Statistical interpretation of ERM
Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?
Necessary condition for w to be a minimizer of R:
∇R(w) = 0, i.e., w is a critical point of R.
This translates toE[XXT]w = E[YX],
a system of linear equations called the population normal equations.
It can be proved that every critical point of R is a minimizer of R.
Looks familiar?
If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then
E[ATA] = E[XXT] and E[ATb] = E[YX],
so ERM can be regarded as a plug-in estimator for a minimizer of R.
74 / 94
14. Risk, empirical risk, and estimating risk
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM
IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.
Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations
E[XXT]w? = E[YX].
I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).
I What about when ERM solution is unique?
Theorem. Under mild assumptions on distribution of X,
R(w)−R(w?) = O
(tr(cov(εW ))
n
)“asymptotically”, where W := E[XXT]−
12X and ε := Y −XTw?.
75 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).
76 / 94
Risk of ERM analysis (rough sketch)
Let εi := Yi −XTiw
? for each i = 1, . . . , n, so
E[εiXi] = E[YiXi]− E[XiXTi ]w
? = 0
and√n(w −w?) =
(1
n
n∑i=1
XiXTi
)−11√n
n∑i=1
εiXi.
1. By LLN:1
n
n∑i=1
XiXTi
p−→ E[XXT]
2. By CLT:1√n
n∑i=1
εiXid−→ cov(εX)
12Z, where Z ∼ N(0, I).
Therefore, asymptotic distribution of√n(w −w?) is
√n(w −w?)
d−→ E[XXT]−1 cov(εX)12Z.
A few more steps gives
n(E[(XTw − Y )2]− E[(XTw? − Y )2]
)d−→ ‖E[XXT]−
12 cov(εX)
12Z‖22.
Random variable on RHS is “concentrated” around its mean tr(cov(εW )).76 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk of ERM: postscript
I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.
I Only assumptions are those needed for LLN and CLT to hold.
I However, if normal linear regression model holds, i.e.,
Y |X = x ∼ N(xTw?, σ2),
then the bound from the theorem becomes
R(w)−R(w?) = O
(σ2d
n
),
which is familiar to those who have taken introductory statistics.
I With more work, can also prove non-asymptotic risk bound of similar form.
I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.
77 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Risk vs empirical risk
Let w be ERM solution.
1. Empirical risk of ERM: R(w)
2. True risk of ERM: R(w)
Theorem.E[R(w)
]≤ E
[R(w)
].
(Empirical risk can sometimes be larger than true risk, but not on average.)
Overfitting: empirical risk is “small”, but true risk is “much higher”.
78 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
79 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
79 / 94
Overfitting example
(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.
Suppose we use degree-k polynomial expansion
φ(x) = (1, x1, . . . , xk), x ∈ R,
so dimension is d = k + 1.
Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.
0 0.2 0.4 0.6 0.8 1
x
-3
-2
-1
0
1
2
3
y
Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).
79 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Estimating risk
IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .
I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .
I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk
Rtest(f) :=1
m
m∑i=1
(f(Xi)− Yi)2.
I Training data is independent of test data, so f is independent of test data.
I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so
E[Rtest(f) | f
]=
1
m
m∑i=1
E[Li | f
]= R(f).
I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,
Rtest(f)p−→ R(f) as m→∞.
I By CLT, the rate of convergence is m−1/2.
80 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Rates for risk minimization vs. rates for risk estimation
One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.
I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.
Roughly speaking, under some assumptions, can expect that
|R(w)−R(w)| ≤ O
(√d
n
)for all w ∈ Rd.
However . . .
I By CLT, we know the following holds for a fixed w:
R(w)p−→ R(w) at n−1/2 rate.
(Here, we ignore the dependence on d.)
I Yet, for ERM w,
R(w)p−→ R(w?) at n−1 rate.
(Also ignoring dependence on d.)
Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!
81 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil ne
xt eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
Old Faithful example
I Linear regression model + affine expansion on “duration of last eruption”.
I Learn w = (35.0929, 10.3258) from 136 past observations.
I Mean squared loss of w on next 136 observations is 35.9404.
(Recall: mean squared loss of µ = 70.7941 was 187.1894.)
0 1 2 3 4 5 6
duration of last eruption
0
20
40
60
80
100
tim
e u
ntil next eru
ption
linear model
constant prediction
(Unfortunately,√
35.9 > mean duration ≈ 3.5.)
82 / 94
15. Regularization
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Inductive bias
Suppose ERM solution is not unique. What should we do?
One possible answer: Pick the w of shortest length.
I Fact: The shortest solution w to (ATA)w = ATb is always unique.
I Fact: the OLS solution A+b is the least norm solution.
Why should this be a good idea?
I Data does not give reason to choose a shorter w over a longer w.
I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.
All learning algorithms encode some kind of inductive bias.
83 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and some arbitrary ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Example
ERM with scaled trigonometric feature expansion:
φ(x) = (1, sin(x), cos(x), 12
sin(2x), 12
cos(2x), 13
sin(3x), 13
cos(3x), . . . ).
Training data and least `2 norm ERM:
0 1 2 3 4 5 6
x
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
f(x)
It is not a given that the least norm ERM is better than the other ERM!
84 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Regularized ERM
Combine the two concerns: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖22
over w ∈ Rd.
Fact: If λ > 0, then the solution is always unique (even if n < d)!
I This is called ridge regression.
(λ = 0 is ERM / Ordinary Least Squares.)
I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).
I Choose λ using cross-validation.
85 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Another interpretation of ridge regression
Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by
A :=1√n
← xT1 →
...← xT
n →√nλ
. . . √nλ
, b :=
1√n
y1...yn0...0
.
Then‖Aw − b‖22 = R(w) + λ‖w‖22.
Interpretation:
I d “fake” data points; ensure that augmented data matrix A has rank d.
I Squared length of each “fake” feature vector is nλ.
All corresponding labels are 0.
I Prediction of w on i-th fake feature vector is√nλwi.
86 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Regularization with a different norm
Lasso: For a given λ ≥ 0, find minimizer of
R(w) + λ‖w‖1
over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.
I Prefers shorter w, but using a different notion of length than ridge.
I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.
Fact: Vectors with small `1-norm are well-approximated by sparse vectors.
If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤ ε‖w‖1.
87 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · ,
so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|
‖w − w‖22 =∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Sparse approximations
Claim: If w contains just the T -largest coefficients (by magnitude) of w, then
‖w − w‖2 ≤‖w‖1√T + 1
.
WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).
i
|βi|‖w − w‖22 =
∑i≥T+1
w2i
≤∑i≥T+1
|wi| · |wT+1|
≤ ‖w‖1 · |wT+1|
≤ ‖w‖1 ·‖w‖1T + 1
.
This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.
88 / 94
Example: Coefficient profile (`2 vs. `1)
Y = levels of prostate cancer antigen, X = clincal measurements
Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.
89 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Other approaches to sparse regression
I Subset selection:
Find w that minimizes empirical risk among all vectors with at most knon-zero entries.
Unfortunately, this seems to require time exponential in k.
I Greedy algorithms:
Repeatedly choose new variable to “include” in support of w until kvariables are included.
Forward stepwise regression / Orthogonal matching pursuit
Often works as well as `1-regularized ERM.
Why do we care about sparsity?
90 / 94
Key takeaways
1. IID model for supervised learning.
2. Optimal predictors, linear regression models, and optimal linear predictors.
3. Empirical risk minimization for linear predictors.
4. Risk of ERM; training risk vs. test risk; risk minimization vs. riskestimation.
5. Inductive bias, `1- and `2-regularization, sparsity.
Make sure you do the assigned reading, especially from the handouts!
91 / 94
misc
svdpytorch/numpy; gpu; gpu errors. maybe even sgd. they’ll use it in homework.talk about regression and classification somewhere early on. can mention howto do it for dt and knn too i guess, though it’s a little gross in this lecture?before MLE slide, give a quick one-slide refresher/primer on MLE.ridge and soln existence. for homework maybe prove → 0 gives svd?daniel’s 1/n. talk about loss functionslook at my old lecsvd topics: not unique; pseudoinverse equal inverse always; pseudoinversealways unique(?) or at least when inverse exists? talk about things it satisfieslike XX+X = X etc; “meaning” of the U , V matrices in svd; introduce svdvia eigendecomposition
92 / 94
misc
logistic regression: optimize w 7→ 1n
∑ni=1 ln(1 + exp(−yiw>xi)).
SVD solution for ols:- write ‖Xw − y‖22.- normal equations (differentiate and set to zero:) X>Xw = X>y.- Writing X = USV >, have V S2V >w = V SU>y.- Thus pseudoinverse solution X+y = V S+U>y satisfies normal equations.for homework maybe also suggest experiment with ridge regression (addingλ‖w‖2/2).for pytorch solver, can have them manually do gradient, and also use pytorch’s.backward, see the sample code for lecture 1 (in the repository, not in theslides).features: replace xi with φ(xi) where phi is some function. E.g.,φ(x) = (1, x1, . . . , xd, x1x1, x1x2, . . . x1xd, . . . xdxd) means w>φ(x) is aquadratic (and now we can search over all possible quadratics with ouroptimization).
93 / 94
16. Summary of linear regression so far
Main points
I Model/function/predictor class of linear regressors x 7→ wTx.
I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.
I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.
I We also discussed feature expansion; affine and polynomial expansion aregood to keep in mind!
94 / 94