Linear regression - University Of...

Preview:

Citation preview

Linear regression

CS 446

1. Overview

todo

check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.

1 / 94

Lecture 1: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

2 / 94

Lecture 2: nearest neighbors and decision trees

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

x1

x2

Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.

Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

3 / 94

Lectures 3-4: linear regression

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Linear regression / least squares.

Our first (of many!) linear predic-tion methods.

Today:

I Example.

I How to solve it; ERM, andSVD.

I Features.

Next lecture: advanced topics, in-cluding overfitting.

4 / 94

2. Example: Old Faithful

Prediction problem: Old Faithful geyser (Yellowstone)

Task: Predict time of next eruption.

5 / 94

Time between eruptions

Historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Time between eruptions

Historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Time between eruptions

Historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Time between eruptions

Historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Ydata

. . . t

X

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

3. Least squares and linear regression

Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

y := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

y := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

ERM setup for least squares.

I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

I Loss/penalty: the least squares loss

`ls(y, y) = `ls(y, y) = (y − y)2.

(Some conventions scale this by 1/2.)

I Goal: minimize least squares emprical risk

Rls(f) =1

n

n∑i=1

`ls(yi, f(xi)) =1

n

n∑i=1

(yi − f(xi))2.

I Specifically, we choose w ∈ Rd according to

arg minw∈Rd

Rls

(x 7→ wTx

)= arg min

w∈Rd

1

n

n∑i=1

(yi −wTxi)2.

I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

10 / 94

ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

Least squares ERM in pictures

Red dots: data points.

Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).

ERM: minimize sum of squared verticallengths from hyperplane to points.

12 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

4. SVD and least squares

SVD

Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where

I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)

I This convention is sometimes called the thin SVD.

I Another notation is to write M =∑ri=1 siuiv

Ti . This avoids the issue

with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)

ri=1 span the column space and (vi)

ri=1 span the rows space of M .

I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.

I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.

15 / 94

Pseudoinverse

Let SVD M =∑ri=1 siuiv

Ti be given.

I Define pseudoinverse M+ =∑ri=1

1siviu

Ti .

(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)

I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST butwith reciprocals of non-zero elements;then M+ = V S+UT.

I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).

I If M−1 exists, then M−1 = M+ (why?).

16 / 94

SVD and least squares

Recall: we’d like to find w such that

ATAw = ATb.

If w = A+b, then

ATAw =

r∑i=1

siviuTi

r∑i=1

siuivTi

r∑i=1

1

siviu

Ti

b=

r∑i=1

siviuTi

r∑i=1

uiuTi

b = ATb.

Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)

Note: in general, AA+ =∑ri=1 uiu

Ti 6= I.

17 / 94

5. Summary of linear regression so far

Main points

I Model/function/predictor class of linear regressors x 7→ wTx.

I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.

I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

18 / 94

Part 2 of linear regression lecture. . .

Recap on SVD. (A messy slide, I’m sorry.)

Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.

I “Decomposition form” thin SVD: M =∑ri=1 siuiv

Ti , and

s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1

1siviu

Ti . and in general

M+M =∑ri=1 viv

Ti 6= I.

I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.

I Full SVD: M = U fSfVTf , U f ∈ Rn×n and V ∈ Rd×d orthonormal and

full rank so UTf U f and V T

f V f are identity matrices and Sf ∈ Rn×d is zeroeverywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS

+f U

Tf where S+

f is obtainedby transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .

The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).

19 / 94

Recap on SVD, zero matrix case

Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.

I In all types of SVD, M+ is MT (another zero matrix).

I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).

I “Factorization form thin SVD” becomes a little messy.

20 / 94

6. More on the normal equations

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.21 / 94

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.

Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

7. Features

Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

25 / 94

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil n

ext eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

27 / 94

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

27 / 94

Final remarks on features

I “Feature engineering” can drastically change the power of a model.

I Some people consider it messy, unprincipled, pure “trial-and-error”.

I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).

28 / 94

8. Statistical view of least squares; maximum likelihood

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5y

f*

32 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).

I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).

I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).

I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

9. Regularization and ridge regression

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

10. True risk and overfitting

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

45 / 94

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

45 / 94

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

45 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

11. `1 regularization: the LASSO

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · ,

so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Example: Coefficient profile (`2 vs. `1)

Y = levels of prostate cancer antigen, X = clincal measurements

Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.

51 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

12. Summary

Summary

ERM for olsERM in generalnormal equationspseudoinverse solnridge regressionstatistical view (say 1-2 things that should be remembered)

53 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

61 / 94

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

61 / 94

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

61 / 94

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil n

ext eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

62 / 94

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

62 / 94

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

13. From data to prediction functions

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

66 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

66 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

66 / 94

Empirical risk minimization in pictures

Red dots: data points.

Affine hyperplane: linear function w(via affine expansion (x1, x2) 7→ (1, x1, x2)).

ERM: minimize sum of squared verticallengths from hyperplane to points.

67 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Aside: Convexity

Let f : Rd → R be a differentiable function.

Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?

Yes, if f is a convex function:

f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),

for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.

69 / 94

Aside: Convexity

Let f : Rd → R be a differentiable function.

Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?

Yes, if f is a convex function:

f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),

for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.

69 / 94

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22

= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)

= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Algorithm for ERM

Algorithm for ERM with linear functions and squared loss†

input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.

1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).

2: return w.

†Also called “ordinary least squares” in this context.

Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!

72 / 94

Algorithm for ERM

Algorithm for ERM with linear functions and squared loss†

input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.

1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).

2: return w.

†Also called “ordinary least squares” in this context.

Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!

72 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.

Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

14. Risk, empirical risk, and estimating risk

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).76 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

79 / 94

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

79 / 94

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

79 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

15. Regularization

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · ,

so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Example: Coefficient profile (`2 vs. `1)

Y = levels of prostate cancer antigen, X = clincal measurements

Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.

89 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Key takeaways

1. IID model for supervised learning.

2. Optimal predictors, linear regression models, and optimal linear predictors.

3. Empirical risk minimization for linear predictors.

4. Risk of ERM; training risk vs. test risk; risk minimization vs. riskestimation.

5. Inductive bias, `1- and `2-regularization, sparsity.

Make sure you do the assigned reading, especially from the handouts!

91 / 94

misc

svdpytorch/numpy; gpu; gpu errors. maybe even sgd. they’ll use it in homework.talk about regression and classification somewhere early on. can mention howto do it for dt and knn too i guess, though it’s a little gross in this lecture?before MLE slide, give a quick one-slide refresher/primer on MLE.ridge and soln existence. for homework maybe prove → 0 gives svd?daniel’s 1/n. talk about loss functionslook at my old lecsvd topics: not unique; pseudoinverse equal inverse always; pseudoinversealways unique(?) or at least when inverse exists? talk about things it satisfieslike XX+X = X etc; “meaning” of the U , V matrices in svd; introduce svdvia eigendecomposition

92 / 94

misc

logistic regression: optimize w 7→ 1n

∑ni=1 ln(1 + exp(−yiw>xi)).

SVD solution for ols:- write ‖Xw − y‖22.- normal equations (differentiate and set to zero:) X>Xw = X>y.- Writing X = USV >, have V S2V >w = V SU>y.- Thus pseudoinverse solution X+y = V S+U>y satisfies normal equations.for homework maybe also suggest experiment with ridge regression (addingλ‖w‖2/2).for pytorch solver, can have them manually do gradient, and also use pytorch’s.backward, see the sample code for lecture 1 (in the repository, not in theslides).features: replace xi with φ(xi) where phi is some function. E.g.,φ(x) = (1, x1, . . . , xd, x1x1, x1x2, . . . x1xd, . . . xdxd) means w>φ(x) is aquadratic (and now we can search over all possible quadratics with ouroptimization).

93 / 94

16. Summary of linear regression so far

Main points

I Model/function/predictor class of linear regressors x 7→ wTx.

I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.

I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

I We also discussed feature expansion; affine and polynomial expansion aregood to keep in mind!

94 / 94

Recommended