371
Linear regression CS 446

Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression

CS 446

Page 2: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

1. Overview

Page 3: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

todo

check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.

1 / 94

Page 4: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Lecture 1: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

2 / 94

Page 5: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Lecture 2: nearest neighbors and decision trees

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

x1

x2

Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.

Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

3 / 94

Page 6: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Lectures 3-4: linear regression

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Linear regression / least squares.

Our first (of many!) linear predic-tion methods.

Today:

I Example.

I How to solve it; ERM, andSVD.

I Features.

Next lecture: advanced topics, in-cluding overfitting.

4 / 94

Page 7: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

2. Example: Old Faithful

Page 8: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Prediction problem: Old Faithful geyser (Yellowstone)

Task: Predict time of next eruption.

5 / 94

Page 9: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Time between eruptions

Historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Page 10: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Time between eruptions

Historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Page 11: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Time between eruptions

Historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Page 12: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Time between eruptions

Historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94

Page 13: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

Page 14: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

Page 15: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Ydata

. . . t

X

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

Page 16: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

Page 17: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

Page 18: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Using side-information

At prediction time t, duration of last eruption is available as side-information.

an bnan−1 bn−1 . . .

Y

. . . t

XXn Yn. . .

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94

Page 19: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

3. Least squares and linear regression

Page 20: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

y := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

Page 21: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

y := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

Page 22: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

ERM setup for least squares.

I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

I Loss/penalty: the least squares loss

`ls(y, y) = `ls(y, y) = (y − y)2.

(Some conventions scale this by 1/2.)

I Goal: minimize least squares emprical risk

Rls(f) =1

n

n∑i=1

`ls(yi, f(xi)) =1

n

n∑i=1

(yi − f(xi))2.

I Specifically, we choose w ∈ Rd according to

arg minw∈Rd

Rls

(x 7→ wTx

)= arg min

w∈Rd

1

n

n∑i=1

(yi −wTxi)2.

I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

10 / 94

Page 23: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

Page 24: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

Page 25: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Least squares ERM in pictures

Red dots: data points.

Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).

ERM: minimize sum of squared verticallengths from hyperplane to points.

12 / 94

Page 26: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Page 27: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Page 28: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Page 29: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Page 30: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94

Page 31: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

Page 32: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

Page 33: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

Page 34: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94

Page 35: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

4. SVD and least squares

Page 36: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

SVD

Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where

I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)

I This convention is sometimes called the thin SVD.

I Another notation is to write M =∑ri=1 siuiv

Ti . This avoids the issue

with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)

ri=1 span the column space and (vi)

ri=1 span the rows space of M .

I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.

I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.

15 / 94

Page 37: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Pseudoinverse

Let SVD M =∑ri=1 siuiv

Ti be given.

I Define pseudoinverse M+ =∑ri=1

1siviu

Ti .

(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)

I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST butwith reciprocals of non-zero elements;then M+ = V S+UT.

I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).

I If M−1 exists, then M−1 = M+ (why?).

16 / 94

Page 38: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

SVD and least squares

Recall: we’d like to find w such that

ATAw = ATb.

If w = A+b, then

ATAw =

r∑i=1

siviuTi

r∑i=1

siuivTi

r∑i=1

1

siviu

Ti

b=

r∑i=1

siviuTi

r∑i=1

uiuTi

b = ATb.

Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)

Note: in general, AA+ =∑ri=1 uiu

Ti 6= I.

17 / 94

Page 39: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

5. Summary of linear regression so far

Page 40: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Main points

I Model/function/predictor class of linear regressors x 7→ wTx.

I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.

I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

18 / 94

Page 41: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Part 2 of linear regression lecture. . .

Page 42: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recap on SVD. (A messy slide, I’m sorry.)

Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.

I “Decomposition form” thin SVD: M =∑ri=1 siuiv

Ti , and

s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1

1siviu

Ti . and in general

M+M =∑ri=1 viv

Ti 6= I.

I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.

I Full SVD: M = U fSfVTf , U f ∈ Rn×n and V ∈ Rd×d orthonormal and

full rank so UTf U f and V T

f V f are identity matrices and Sf ∈ Rn×d is zeroeverywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS

+f U

Tf where S+

f is obtainedby transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .

The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).

19 / 94

Page 43: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recap on SVD, zero matrix case

Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.

I In all types of SVD, M+ is MT (another zero matrix).

I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).

I “Factorization form thin SVD” becomes a little messy.

20 / 94

Page 44: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

6. More on the normal equations

Page 45: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Page 46: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Page 47: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Page 48: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.

21 / 94

Page 49: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

We’ll now finally show that normal equations imply optimality.21 / 94

Page 50: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

Page 51: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

Page 52: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94

Page 53: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.

Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Page 54: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Page 55: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Page 56: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Page 57: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Page 58: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Page 59: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94

Page 60: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

7. Features

Page 61: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

Page 62: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

Page 63: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

25 / 94

Page 64: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Page 65: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Page 66: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

26 / 94

Page 67: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil n

ext eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

27 / 94

Page 68: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

27 / 94

Page 69: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Final remarks on features

I “Feature engineering” can drastically change the power of a model.

I Some people consider it messy, unprincipled, pure “trial-and-error”.

I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).

28 / 94

Page 70: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

8. Statistical view of least squares; maximum likelihood

Page 71: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Page 72: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Page 73: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Page 74: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Page 75: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Page 76: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94

Page 77: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Page 78: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Page 79: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Page 80: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Page 81: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94

Page 82: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Page 83: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Page 84: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Page 85: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Page 86: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Page 87: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Page 88: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Page 89: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Page 90: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94

Page 91: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5y

f*

32 / 94

Page 92: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Page 93: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Page 94: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Page 95: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94

Page 96: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Page 97: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Page 98: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Page 99: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94

Page 100: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Page 101: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Page 102: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Page 103: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94

Page 104: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).

I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

Page 105: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).

I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

Page 106: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

an bnan−1 bn−1 . . .

Ydata

. . . t

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).

I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94

Page 107: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

9. Regularization and ridge regression

Page 108: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 109: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 110: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 111: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 112: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 113: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 114: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 115: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Page 116: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Page 117: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Page 118: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Page 119: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Page 120: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Page 121: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Page 122: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Page 123: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Page 124: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Page 125: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Page 126: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

10. True risk and overfitting

Page 127: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Page 128: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Page 129: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Page 130: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Page 131: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94

Page 132: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Page 133: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Page 134: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Page 135: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Page 136: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Page 137: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Page 138: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Page 139: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94

Page 140: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94

Page 141: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Page 142: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Page 143: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Page 144: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Page 145: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94

Page 146: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Page 147: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Page 148: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Page 149: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Page 150: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94

Page 151: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

45 / 94

Page 152: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

45 / 94

Page 153: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

45 / 94

Page 154: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Page 155: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Page 156: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Page 157: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Page 158: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Page 159: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Page 160: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Page 161: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Page 162: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Page 163: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Page 164: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Page 165: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Page 166: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94

Page 167: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Page 168: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Page 169: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Page 170: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Page 171: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94

Page 172: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

11. `1 regularization: the LASSO

Page 173: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Page 174: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Page 175: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Page 176: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Page 177: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 178: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · ,

so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 179: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 180: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 181: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 182: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 183: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 184: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94

Page 185: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example: Coefficient profile (`2 vs. `1)

Y = levels of prostate cancer antigen, X = clincal measurements

Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.

51 / 94

Page 186: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Page 187: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Page 188: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Page 189: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Page 190: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Page 191: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94

Page 192: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

12. Summary

Page 193: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Summary

ERM for olsERM in generalnormal equationspseudoinverse solnridge regressionstatistical view (say 1-2 things that should be remembered)

53 / 94

Page 194: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 195: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 196: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 197: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 198: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 199: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 200: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 201: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

54 / 94

Page 202: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Page 203: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Page 204: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Page 205: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

55 / 94

Page 206: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Page 207: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Page 208: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Page 209: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Page 210: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

56 / 94

Page 211: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Page 212: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Page 213: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Page 214: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Page 215: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94

Page 216: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Page 217: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Page 218: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Page 219: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Page 220: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

58 / 94

Page 221: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Page 222: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Page 223: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Page 224: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Page 225: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94

Page 226: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Page 227: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Page 228: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Page 229: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

60 / 94

Page 230: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

61 / 94

Page 231: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

61 / 94

Page 232: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸ ︷︷ ︸linear terms

, x21, . . . , x2d︸ ︷︷ ︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸ ︷︷ ︸cross terms

).

61 / 94

Page 233: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil n

ext eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

62 / 94

Page 234: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

62 / 94

Page 235: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

Page 236: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

Page 237: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

Page 238: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94

Page 239: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

13. From data to prediction functions

Page 240: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Page 241: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Page 242: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Page 243: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

64 / 94

Page 244: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94

Page 245: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94

Page 246: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94

Page 247: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

66 / 94

Page 248: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

66 / 94

Page 249: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

66 / 94

Page 250: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in pictures

Red dots: data points.

Affine hyperplane: linear function w(via affine expansion (x1, x2) 7→ (1, x1, x2)).

ERM: minimize sum of squared verticallengths from hyperplane to points.

67 / 94

Page 251: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Page 252: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Page 253: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Page 254: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Page 255: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) = ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

It can be proved that every critical point of R is a minimizer of R:

68 / 94

Page 256: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Aside: Convexity

Let f : Rd → R be a differentiable function.

Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?

Yes, if f is a convex function:

f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),

for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.

69 / 94

Page 257: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Aside: Convexity

Let f : Rd → R be a differentiable function.

Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?

Yes, if f is a convex function:

f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),

for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.

69 / 94

Page 258: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Page 259: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22

= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Page 260: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)

= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Page 261: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Page 262: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94

Page 263: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Page 264: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Page 265: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Page 266: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Page 267: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Page 268: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Page 269: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94

Page 270: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Algorithm for ERM

Algorithm for ERM with linear functions and squared loss†

input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.

1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).

2: return w.

†Also called “ordinary least squares” in this context.

Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!

72 / 94

Page 271: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Algorithm for ERM

Algorithm for ERM with linear functions and squared loss†

input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.

1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).

2: return w.

†Also called “ordinary least squares” in this context.

Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!

72 / 94

Page 272: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.

Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Page 273: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Page 274: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Page 275: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Page 276: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Page 277: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Page 278: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

73 / 94

Page 279: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Page 280: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Page 281: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Page 282: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Page 283: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

74 / 94

Page 284: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

14. Risk, empirical risk, and estimating risk

Page 285: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Page 286: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Page 287: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Page 288: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

75 / 94

Page 289: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Page 290: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Page 291: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Page 292: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

76 / 94

Page 293: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).76 / 94

Page 294: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Page 295: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Page 296: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Page 297: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Page 298: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

77 / 94

Page 299: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Page 300: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Page 301: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Page 302: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Page 303: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

78 / 94

Page 304: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

79 / 94

Page 305: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

79 / 94

Page 306: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

79 / 94

Page 307: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Page 308: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Page 309: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Page 310: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Page 311: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Page 312: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Page 313: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

80 / 94

Page 314: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Page 315: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Page 316: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Page 317: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Page 318: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Page 319: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

81 / 94

Page 320: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Page 321: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Page 322: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Page 323: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Page 324: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

82 / 94

Page 325: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

15. Regularization

Page 326: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 327: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 328: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 329: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 330: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 331: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 332: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 333: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Fact: the OLS solution A+b is the least norm solution.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

83 / 94

Page 334: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Page 335: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Page 336: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Page 337: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)

It is not a given that the least norm ERM is better than the other ERM!

84 / 94

Page 338: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Page 339: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Page 340: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Page 341: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Page 342: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

85 / 94

Page 343: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Page 344: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Page 345: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Page 346: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Page 347: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

86 / 94

Page 348: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Page 349: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Page 350: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Page 351: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

87 / 94

Page 352: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 353: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · ,

so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 354: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 355: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 356: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 357: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 358: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 359: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

88 / 94

Page 360: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Example: Coefficient profile (`2 vs. `1)

Y = levels of prostate cancer antigen, X = clincal measurements

Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.

89 / 94

Page 361: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Page 362: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Page 363: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Page 364: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Page 365: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Page 366: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

90 / 94

Page 367: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Key takeaways

1. IID model for supervised learning.

2. Optimal predictors, linear regression models, and optimal linear predictors.

3. Empirical risk minimization for linear predictors.

4. Risk of ERM; training risk vs. test risk; risk minimization vs. riskestimation.

5. Inductive bias, `1- and `2-regularization, sparsity.

Make sure you do the assigned reading, especially from the handouts!

91 / 94

Page 368: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

misc

svdpytorch/numpy; gpu; gpu errors. maybe even sgd. they’ll use it in homework.talk about regression and classification somewhere early on. can mention howto do it for dt and knn too i guess, though it’s a little gross in this lecture?before MLE slide, give a quick one-slide refresher/primer on MLE.ridge and soln existence. for homework maybe prove → 0 gives svd?daniel’s 1/n. talk about loss functionslook at my old lecsvd topics: not unique; pseudoinverse equal inverse always; pseudoinversealways unique(?) or at least when inverse exists? talk about things it satisfieslike XX+X = X etc; “meaning” of the U , V matrices in svd; introduce svdvia eigendecomposition

92 / 94

Page 369: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

misc

logistic regression: optimize w 7→ 1n

∑ni=1 ln(1 + exp(−yiw>xi)).

SVD solution for ols:- write ‖Xw − y‖22.- normal equations (differentiate and set to zero:) X>Xw = X>y.- Writing X = USV >, have V S2V >w = V SU>y.- Thus pseudoinverse solution X+y = V S+U>y satisfies normal equations.for homework maybe also suggest experiment with ridge regression (addingλ‖w‖2/2).for pytorch solver, can have them manually do gradient, and also use pytorch’s.backward, see the sample code for lecture 1 (in the repository, not in theslides).features: replace xi with φ(xi) where phi is some function. E.g.,φ(x) = (1, x1, . . . , xd, x1x1, x1x2, . . . x1xd, . . . xdxd) means w>φ(x) is aquadratic (and now we can search over all possible quadratics with ouroptimization).

93 / 94

Page 370: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

16. Summary of linear regression so far

Page 371: Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Main points

I Model/function/predictor class of linear regressors x 7→ wTx.

I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.

I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

I We also discussed feature expansion; affine and polynomial expansion aregood to keep in mind!

94 / 94