Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Linear regression

CS 446

1. Overview

todo

check some continuity bugsmake sure nothing missing from old lectures (both mine and daniel’s)fix some of those bugs, like b replacing y.delete the excess material from endadd proper summary slide which boils down concepts and reduces studentworry.

1 / 94

Lecture 1: supervised learning

Training data: labeled examples

(x1, y1), (x2, y2), . . . , (xn, yn)

where

I each input xi is a machine-readable description of an instance (e.g.,image, sentence), and

I each corresponding label yi is an annotation relevant to thetask—typically not easy to automatically obtain.

Goal: learn a function f from labeled examples, that accurately “predicts” thelabels of new (previously unseen) inputs.

learned predictorpast labeled examples learning algorithm

predicted label

new (unlabeled) example

2 / 94

Lecture 2: nearest neighbors and decision trees

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

x1

x2

Nearest neighbors.Training/fitting: memorize data.Testing/predicting: find k closestmemorized points, return pluralitylabel.Overfitting? Vary k.

Decision trees.Training/fitting: greedily partitionspace, reducing “uncertainty”.Testing/predicting: traverse tree,output leaf label.Overfitting? Limit or prune tree.

3 / 94

Lectures 3-4: linear regression

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Linear regression / least squares.

Our first (of many!) linear predic-tion methods.

Today:

I Example.

I How to solve it; ERM, andSVD.

I Features.

Next lecture: advanced topics, in-cluding overfitting.

4 / 94

2. Example: Old Faithful

Prediction problem: Old Faithful geyser (Yellowstone)

Task: Predict time of next eruption.

5 / 94

Time between eruptions

Historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Time until next eruption: Yi := ai − bi−1.

Prediction task:At later time t (when an eruption ends), predict time of next eruption t+ Y .

On “Old Faithful” data:

I Using 136 past observations, we form mean estimate µ = 70.7941.

I Can we do better?

6 / 94



a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3





I Can we do better?

6 / 94



an bnan−1 bn−1 . . .

Ydata

. . . t





I Can we do better?

6 / 94




Ydata

. . . t





I Can we do better?

6 / 94

Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

Looking at the data

Naturalist Harry Woodward observed that time until the next eruption seemsto be related to duration of last eruption.

1.5 2 2.5 3 3.5 4 4.5 5 5.5duration of last eruption

50

60

70

80

90

time

until

nex

t eru

ptio

n

7 / 94

Using side-information

At prediction time t, duration of last eruption is available as side-information.


Ydata

. . . t

X

IID model for supervised learning:(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid random pairs (i.e., labeled examples).

X takes values in X (e.g., X = R), Y takes values in R.

1. We observe (X1, Y1), . . . , (Xn, Yn), and the choose a prediction function(a.k.a. predictor)

f : X → R,

This is called “learning” or “training”.

2. At prediction time, observe X, and form prediction f(X).

How should we choose f based on data? Recall:

I The model is our choice.

I We must contend with overfitting, bad fitting algorithms, . . .

8 / 94




Y

. . . t

XXn Yn. . .




f : X → R,






8 / 94




Y

. . . t

XXn Yn. . .




f : X → R,






8 / 94




Y

. . . t

XXn Yn. . .




f : X → R,






8 / 94

3. Least squares and linear regression

Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

y := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

Which line?

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0duration

50

60

70

80

90

dela

y

Let’s predict with a linear regressor:

y := wT [ x1 ] ,

where w ∈ R2 is learned from data.

Remark: appending 1 makes thisan affine function x 7→ w1x + w2.(More on this later. . . )

If data lies along a line,we should output that line.But what if not?

9 / 94

ERM setup for least squares.

I Predictors/model: f(x) = wTx;a linear predictor/regressor.(For linear classification: x 7→ sgn(wTx).)

I Loss/penalty: the least squares loss

`ls(y, y) = `ls(y, y) = (y − y)2.

(Some conventions scale this by 1/2.)

I Goal: minimize least squares emprical risk

Rls(f) =1

n

n∑i=1

`ls(yi, f(xi)) =1

n

n∑i=1

(yi − f(xi))2.

I Specifically, we choose w ∈ Rd according to

arg minw∈Rd

Rls

(x 7→ wTx

)= arg min

w∈Rd

1

n

n∑i=1

(yi −wTxi)2.

I More generally, this is the ERM approach:pick a model and minimize empirical risk over the model parameters.

10 / 94

ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

ERM in general

I Pick a family of models/predictors F .(For today, we use linear predictors.)

I Pick a loss function `.(For today, we chose squared loss.)

I Minimize the empirical risk over the model parameters.

We haven’t discussed: true risk and overfitting; how to minimize; why this is agood idea.

Remark: ERM is convenient in pytorch, just pick a model, a loss, an optimizer,and tell it to minimize.

11 / 94

Least squares ERM in pictures

Red dots: data points.

Affine hyperplane: our predictions(via affine expansion (x1, x2) 7→ (1, x1, x2)).

ERM: minimize sum of squared verticallengths from hyperplane to points.

12 / 94

Empirical risk minimization in matrix notation

Define n× d matrix A and n× 1 column vector b by

A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.

Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.

Necessary condition for w to be a minimizer of R:

∇R(w) = 0, i.e., w is a critical point of R.

This translates to(ATA)w = ATb,

a system of linear equations called the normal equations.

In upcoming lecture we’ll prove every critical point of R is a minimizer of R.

13 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.Can write empirical risk as

R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.






13 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.






13 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.






13 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.






13 / 94

Summary on ERM and linear regression

Procedure:

I Form matrix A and vector b with data (resp. xi, yi) as rows.(Scaling factor 1/√n is not standard, doesn’t change solution.)

I Find w satisfying the normal equations ATAw = ATb.(E.g., via Gaussian elimination, taking time O(nd2).)

I In general, solutions are not unique. (Why not?)

I If ATA is invertible, can choose (unique) (ATA)−1ATb.

I Recall our original conundrum:we want to fit some line.We chose least squares, it gives one (family of) choice(s).Next lecture, with logistic regression, we get another.

I Note: if Aw = b for some w, then data lies along a line, and we might aswell not worry about picking a loss function.

I Note: Aw− b = 0 may not have solutions, but least square setting meanswe instead work with AT(Aw − b) = 0 which does have solutions. . .

14 / 94


Procedure:








14 / 94


Procedure:








14 / 94


Procedure:








14 / 94

4. SVD and least squares

SVD

Recall the Singular Value Decomposition (SVD) M = USV T ∈ Rm×n, where

I U ∈ Rm×r is orthonormal, S ∈ Rr×r is diag(s1, . . . , sr) withs1 ≥ s2 ≥ · · · ≥ sr ≥ 0, and V ∈ Rn×r is orthonormal, withr := rank(M). (If r = 0, use the convention of S = 0 ∈ R1×1.)

I This convention is sometimes called the thin SVD.

I Another notation is to write M =∑ri=1 siuiv

Ti . This avoids the issue

with 0 (empty sum is 0). Moreover, this notation makes it clear that(ui)

ri=1 span the column space and (vi)

ri=1 span the rows space of M .

I The full SVD will not be used in this class; it fills out U and V to be fullrank and orthonormal, and pads S with zeros. It agrees with theeigendecompositions of MTM and MMT.

I Note; numpy and pytorch have SVD (interfaces slightly differ).Determining r runs into numerical issues.

15 / 94

Pseudoinverse

Let SVD M =∑ri=1 siuiv

Ti be given.

I Define pseudoinverse M+ =∑ri=1

1siviu

Ti .

(If 0 = M ∈ Rm×n, then 0 = M+ ∈ Rn×m.)

I Alternatively, define pseudoinverse S+ of a diagonal matrix to be ST butwith reciprocals of non-zero elements;then M+ = V S+UT.

I Also called Moore-Penrose Pseudoinverse; it is unique, even though theSVD is not unique (why not?).

I If M−1 exists, then M−1 = M+ (why?).

16 / 94

SVD and least squares

Recall: we’d like to find w such that

ATAw = ATb.

If w = A+b, then

ATAw =

r∑i=1

siviuTi

r∑i=1

siuivTi

r∑i=1

1

siviu

Ti

b=

r∑i=1

siviuTi

r∑i=1

uiuTi

b = ATb.

Henceforth, define wols = A+b as the OLS solution.(OLS = “ordinary least squares”.)

Note: in general, AA+ =∑ri=1 uiu

Ti 6= I.

17 / 94

5. Summary of linear regression so far

Main points

I Model/function/predictor class of linear regressors x 7→ wTx.

I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.

I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

18 / 94

Part 2 of linear regression lecture. . .

Recap on SVD. (A messy slide, I’m sorry.)

Suppose 0 6= M ∈ Rn×d, thus r := rank(M) > 0.

I “Decomposition form” thin SVD: M =∑ri=1 siuiv

Ti , and

s1 ≥ · · · ≥ sr > 0, and M+ =∑ri=1

1siviu

Ti . and in general

M+M =∑ri=1 viv

Ti 6= I.

I “Factorization form” thin SVD: M = USV T, U ∈ Rn×r and V ∈ Rd×rorthonormal but UTU and V TV are not identity matrices in general, andS = diag(s1, . . . , sr) ∈ Rr×r with s1 ≥ · · · ≥ sr > 0; pseudoinverseM+ = V S−1UT and in general M+M 6= MM+ 6= I.

I Full SVD: M = U fSfVTf , U f ∈ Rn×n and V ∈ Rd×d orthonormal and

full rank so UTf U f and V T

f V f are identity matrices and Sf ∈ Rn×d is zeroeverywhere except the first r diagonal entries which ares1 ≥ · · · ≥ sr > 0; pseudoinverse M+ = V fS

+f U

Tf where S+

f is obtainedby transposing Sf and then flipping nonzero entries, and in generalM+M 6= MM+ 6= I. Additional property: agreement witheigendecompositions of MMT and MTM .

The “full SVD” adds columns to U and V which hit zeros of S and thereforedon’t matter(as a sanity check, verify for yourself that all these SVDs are equal).

19 / 94

Recap on SVD, zero matrix case

Suppose 0 = M ∈ Rn×d, thus r := rank(M) = 0.

I In all types of SVD, M+ is MT (another zero matrix).

I Technically speaking, s is a singular value of M iff exist nonzero vectors(u,v) with Mv = su and MTu = sv, and zero matrix therefore has nosingular values (or left/right singular vectors).

I “Factorization form thin SVD” becomes a little messy.

20 / 94

6. More on the normal equations

Recall our matrix notation

Let labeled examples ((xi, yi))ni=1 be given.


A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.





We’ll now finally show that normal equations imply optimality.

21 / 94




A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.






21 / 94




A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.






21 / 94




A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.






21 / 94




A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) =1

n

n∑i=1

(yi − xT

iw)2

= ‖Aw − b‖22.





We’ll now finally show that normal equations imply optimality.21 / 94

Normal equations imply optimality

Consider w with ATAw = ATy, and any w′; then

‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2

= ‖Aw′ −Aw‖2 + 2(Aw′ −Aw)T(Aw − y) + ‖Aw − y‖2.

Since

(Aw′ −Aw)T(Aw − y) = (w′ −w)T(ATAw −ATy) = 0,

then ‖Aw′ − y‖2 = ‖Aw′ −Aw‖2 + ‖Aw − y‖2. This means w′ is optimal.

Morever, writing A =∑ri=1 siuiv

Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),

so w′ optimal iff w′ −w is in the right nullspace of A.

(We’ll revisit all this with convexity later.)

22 / 94



‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2


Since




Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),



22 / 94



‖Aw′ − y‖2 = ‖Aw′ −Aw +Aw − y‖2


Since




Ti ,

‖Aw′−Aw‖2 = (w′−w)>(A>A)(w′−w) = (w′−w)>

r∑i=1

s2ivivTi

(w′−w),



22 / 94

Geometric interpretation of least squares ERM

Let aj ∈ Rn be the j-th column of matrix A ∈ Rn×d, so

A =

↑ ↑a1 · · · ad↓ ↓

.

Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.

Solution b is orthogonal projection of b onto range(A) = {Aw : w ∈ Rd}.

b

b

a1

a2

I b is uniquely determined; indeed,b = AA+b =

∑ri=1 uiu

Ti b.

I If r = rank(A) < d, then >1 way towrite b as linear combination ofa1, . . . ,ad.

If rank(A) < d, then ERM solution is notunique.

To get w from b:solve system of linear equations Aw = b.

23 / 94



A =

↑ ↑a1 · · · ad↓ ↓

.Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.


b

b

a1

a2


∑ri=1 uiu

Ti b.




23 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2


∑ri=1 uiu

Ti b.




23 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2


∑ri=1 uiu

Ti b.




23 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2


∑ri=1 uiu

Ti b.




23 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2


∑ri=1 uiu

Ti b.




23 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2


∑ri=1 uiu

Ti b.




23 / 94

7. Features

Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:

1. Non-linear transformations of existing variables: for x ∈ R,

φ(x) = ln(1 + x).

2. Logical formula of binary variables: for x = (x1, . . . , xd) ∈ {0, 1}d,

φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).

3. Trigonometric expansion: for x ∈ R,

φ(x) = (1, sin(x), cos(x), sin(2x), cos(2x), . . . ).

4. Polynomial expansion: for x = (x1, . . . , xd) ∈ Rd,

φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

Enhancing linear regression models with features

Linear functions alone are restrictive,but become powerful with creative side-information, or features.

Idea: Predict with x 7→ wTφ(x), where φ is a feature mapping.

Examples:


φ(x) = ln(1 + x).


φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).




φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

24 / 94

Example: Taking advantage of linearity

Suppose you are trying to predict some health outcome.

I Physician suggests that body temperature is relevant, specifically the(square) deviation from normal body temperature:

φ(x) = (xtemp − 98.6)2.

I What if you didn’t know about this magic constant 98.6?

I Instead, useφ(x) = (1, xtemp, x

2temp).

Can learn coefficients w such that

wTφ(x) = (xtemp − 98.6)2,

or any other quadratic polynomial in xtemp (which may be better!).

25 / 94

Quadratic expansion

Quadratic function f : R→ R

f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.

This can be written as a linear function of φ(x), where

φ(x) := (1, x, x2),

sincef(x) = wTφ(x)

where w = (c, b, a).

For multivariate quadratic function f : Rd → R, use

φ(x) := (1, x1, . . . , xd︸︷︷︸linear terms

, x21, . . . , x2d︸︷︷︸

squared terms

, x1x2, . . . , x1xd, . . . , xd−1xd︸︷︷︸cross terms

).

26 / 94

Quadratic expansion


f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.


φ(x) := (1, x, x2),

sincef(x) = wTφ(x)




, x21, . . . , x2d︸︷︷︸

squared terms


).

26 / 94

Quadratic expansion


f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.


φ(x) := (1, x, x2),

sincef(x) = wTφ(x)




, x21, . . . , x2d︸︷︷︸

squared terms


).

26 / 94

Affine expansion and “Old Faithful”

Woodward needed an affine expansion for “Old Faithful” data:

φ(x) := (1, x).

0 1 2 3 4 5 6

duration of last eruption

0

20

40

60

80

100

tim

e u

ntil n

ext eru

ption

affine function

Affine function fa,b : R→ R for a, b ∈ R,

fa,b(x) = a+ bx,

is a linear function fw of φ(x) for w = (a, b).

(This easily generalizes to multivariate affine functions.)

27 / 94



φ(x) := (1, x).

0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil next eru

ption

affine function


fa,b(x) = a+ bx,



27 / 94

Final remarks on features

I “Feature engineering” can drastically change the power of a model.

I Some people consider it messy, unprincipled, pure “trial-and-error”.

I Deep learning is somewhat touted as removing some of this, but it doesn’tdo so completely (e.g., took a lot of work to come up with the“convolutional neural network” (side question, who came up with that?)).

28 / 94

8. Statistical view of least squares; maximum likelihood

Maximum likelihood estimation (MLE) refresher

Parametric statistical model:P = {Pθ : θ ∈ Θ}, a collection of probability distributions for observed data.

I Θ: parameter space.

I θ ∈ Θ: a particular parameter (or parameter vector).

I Pθ: a particular probability distribution for observed data.

Likelihood of θ ∈ Θ given observed data x:For discrete X ∼ Pθ with probability mass function pθ,

L(θ) := pθ(x).

For continuous X ∼ Pθ with probability density function fθ,

L(θ) := fθ(x).

Maximum likelihood estimator (MLE):Let θ be the θ ∈ Θ of highest likelihood given observed data.

29 / 94







L(θ) := pθ(x).


L(θ) := fθ(x).


29 / 94







L(θ) := pθ(x).


L(θ) := fθ(x).


29 / 94







L(θ) := pθ(x).


L(θ) := fθ(x).


29 / 94







L(θ) := pθ(x).


L(θ) := fθ(x).


29 / 94







L(θ) := pθ(x).


L(θ) := fθ(x).


29 / 94

Distributions over labeled examples

X : Space of possible side-information (feature space).Y: Space of possible outcomes (label space or output space).

Distribution P of random pair (X,Y ) taking values in X × Y can be thoughtof in two parts:

1. Marginal distribution PX of X:

PX is a probability distribution on X .

2. Conditional distribution PY |X=x of Y given X = x for each x ∈ X :

PY |X=x is a probability distribution on Y.

This lecture: Y = R (regression problems).

30 / 94









30 / 94









30 / 94









30 / 94









30 / 94

Optimal predictor

What function f : X → R has smallest (squared loss) risk

R(f) := E[(f(X)− Y )2]?

Note: earlier we discussed empirical risk.

I Conditional on X = x, the minimizer of conditional risk

y 7→ E[(y − Y )2 | X = x]

is the conditional meanE[Y | X = x].

I Therefore, the function f? : R→ R where

f?(x) = E[Y | X = x], x ∈ R

has the smallest risk.

I f? is called the regression function or conditional mean function.

31 / 94

Optimal predictor


R(f) := E[(f(X)− Y )2]?



y 7→ E[(y − Y )2 | X = x]



f?(x) = E[Y | X = x], x ∈ R



31 / 94

Optimal predictor


R(f) := E[(f(X)− Y )2]?



y 7→ E[(y − Y )2 | X = x]



f?(x) = E[Y | X = x], x ∈ R



31 / 94

Optimal predictor


R(f) := E[(f(X)− Y )2]?



y 7→ E[(y − Y )2 | X = x]



f?(x) = E[Y | X = x], x ∈ R



31 / 94

Linear regression models

When side-information is encoded as vectors of real numbers x = (x1, . . . , xd)(called features or variables), it is common to use a linear regression model,such as the following:

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

I Parameters: w = (w1, . . . , wd) ∈ Rd, σ2 > 0.

I X = (X1, . . . , Xd), a random vector (i.e., a vector of random variables).

I Conditional distribution of Y given X is normal.

I Marginal distribution of X not specified.

In this model, the regression function f? is a linear function fw : Rd → R,

fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.

(We’ll often refer to fw just by

w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.






fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.


w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.






fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.


w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.






fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.


w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.






fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.


w.)-1 -0.5 0 0.5 1

x

-5

0

5

y

f*

32 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.






fw(x) = xTw =

d∑i=1

xiw, x ∈ Rd.


w.)-1 -0.5 0 0.5 1

x

-5

0

5y

f*

32 / 94

Maximum likelihood estimation for linear regression

Linear regression model with Gaussian noise:(X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, with

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(Traditional to study linear regression in context of this model.)

Log-likelihood of (w, σ2), given data (Xi, Yi) = (xi, yi) for i = 1, . . . , n:

n∑i=1

{− 1

2σ2(xTiw − yi)2 +

1

2ln

1

2πσ2

}+{

terms not involving (w, σ2)}.

The w that maximizes log-likelihood is also w that minimizes

1

n

n∑i=1

(xTiw − yi)2.

This coincides with another approach, called empirical risk minimization, whichis studied beyond the context of the linear regression model . . .

33 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.



n∑i=1

{− 1


1

2ln

1

2πσ2

}+{



1

n

n∑i=1

(xTiw − yi)2.


33 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.



n∑i=1

{− 1


1

2ln

1

2πσ2

}+{



1

n

n∑i=1

(xTiw − yi)2.


33 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.



n∑i=1

{− 1


1

2ln

1

2πσ2

}+{



1

n

n∑i=1

(xTiw − yi)2.


33 / 94

Empirical distribution and empirical risk

Empirical distribution Pn on (x1, y1), . . . , (xn, yn) has probability massfunction pn given by

pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.

Plug-in principle: Goal is to find function f that minimizes (squared loss) risk

R(f) = E[(f(X)− Y )2].

But we don’t know the distribution P of (X,Y ).

Replace P with Pn → Empirical (squared loss) risk R(f):

R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

(“Plug-in principle” is used throughout statistics in this same way.)

34 / 94



pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.


R(f) = E[(f(X)− Y )2].



R(f) :=1

n

n∑i=1

(f(xi)− yi)2.


34 / 94



pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.


R(f) = E[(f(X)− Y )2].



R(f) :=1

n

n∑i=1

(f(xi)− yi)2.


34 / 94



pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.


R(f) = E[(f(X)− Y )2].



R(f) :=1

n

n∑i=1

(f(xi)− yi)2.


34 / 94

Empirical risk minimization

Empirical risk minimization (ERM) is the learning method that returns afunction (from a specified function class) that minimizes the empirical risk.

For linear functions and squared loss: ERM returns

w ∈ arg minw∈Rd

R(w),

which coincides with MLE under the basic linear regression model.

In general:

I MLE makes sense in context of statistical model for which it is derived.

I ERM makes sense in context of general iid model for supervised learning.

Further remarks.

I In MLE, we assume a model, and we not only maximize likelihood, butcan try to argue we “recover” a “true” parameter.

I In ERM, by default there is no assumption of a “true” parameter torecover.

Useful examples: medical testing, gene expression, . . .

35 / 94




w ∈ arg minw∈Rd

R(w),


In general:



Further remarks.




35 / 94




w ∈ arg minw∈Rd

R(w),


In general:



Further remarks.




35 / 94




w ∈ arg minw∈Rd

R(w),


In general:



Further remarks.




35 / 94

Old faithful data under this least squares statistical model

Recall our data, consisting of historical records of eruptions:

a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3

Statistical model (not just IID!): Y1, . . . , Yn, Y ∼iid N(µ, σ2).

I Data: Yi := ai − bi−1, i = 1, . . . , n.

(Admittedly not a great model, since durations are non-negative.)

Task:At later time t (when an eruption ends), predict time of next eruption t+ Y .For the linear regression model, we’ll assume

Y |X = x ∼ N(xTw, σ2), x ∈ Rd.

(This extends the model above if we add the “1” feature.)

36 / 94



a1 b1 a2 a3a0 b2 b3b0 . . .

Y1 Y2 Y3


I Data: Yi := ai − bi−1, i = 1, . . . , n.



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.


36 / 94




Ydata

. . . t


I Data: Yi := ai − bi−1, i = 1, . . . , n.



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.


36 / 94

9. Regularization and ridge regression

Inductive bias

Suppose ERM solution is not unique. What should we do?

One possible answer: Pick the w of shortest length.

I Fact: The shortest solution w to (ATA)w = ATb is always unique.

I Obtain w viaw = A+b

where A+ is the (Moore-Penrose) pseudoinverse of A.

Why should this be a good idea?

I Data does not give reason to choose a shorter w over a longer w.

I The preference for shorter w is an inductive bias: it will work well forsome problems (e.g., when “true” w? is short), not for others.

All learning algorithms encode some kind of inductive bias.

37 / 94

Inductive bias










37 / 94

Inductive bias










37 / 94

Inductive bias










37 / 94

Inductive bias










37 / 94

Inductive bias










37 / 94

Inductive bias










37 / 94

Inductive bias










37 / 94

Example

ERM with scaled trigonometric feature expansion:

φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

It is not a given that the least norm ERM is better than the other ERM!

38 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


38 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and some arbitrary ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


38 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data and least `2 norm ERM:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


38 / 94

Regularized ERM

Combine the two concerns: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖22

over w ∈ Rd.

Fact: If λ > 0, then the solution is always unique (even if n < d)!

I This is called ridge regression.

(λ = 0 is ERM / Ordinary Least Squares.)

Explicit solution (ATA+ λI)−1ATb.

I Parameter λ controls how much attention is paid to the regularizer ‖w‖22relative to the data fitting term R(w).

I Choose λ using cross-validation.

Note: in deep networks, this regularization is called “weight decay”. (Why?)Note: another popular regularizer for linear regression is `1.

39 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.








39 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.








39 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.








39 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.








39 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.








39 / 94

10. True risk and overfitting

Statistical interpretation of ERM

Let (X, Y ) ∼ P , where P is some distribution on Rd × R.Which w have smallest risk R(w) = E[(XTw − Y )2]?



This translates toE[XXT]w = E[YX],

a system of linear equations called the population normal equations.

It can be proved that every critical point of R is a minimizer of R.

Looks familiar?

If (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, then

E[ATA] = E[XXT] and E[ATb] = E[YX],

so ERM can be regarded as a plug-in estimator for a minimizer of R.

40 / 94








Looks familiar?




40 / 94








Looks familiar?




40 / 94








Looks familiar?




40 / 94








Looks familiar?




40 / 94

Risk of ERM

IID model: (X1, Y1), . . . , (Xn, Yn), (X, Y ) are iid, taking values in Rd × R.

Let w? be a minimizer of R over all w ∈ Rd, i.e., w? satisfies populationnormal equations

E[XXT]w? = E[YX].

I If ERM solution w is not unique (e.g., if n < d), then R(w) can bearbitrarily worse than R(w?).

I What about when ERM solution is unique?

Theorem. Under mild assumptions on distribution of X,

R(w)−R(w?) = O

(tr(cov(εW ))

n

)“asymptotically”, where W := E[XXT]−

12X and ε := Y −XTw?.

41 / 94

Risk of ERM



E[XXT]w? = E[YX].




R(w)−R(w?) = O

(tr(cov(εW ))

n



41 / 94

Risk of ERM



E[XXT]w? = E[YX].




R(w)−R(w?) = O

(tr(cov(εW ))

n



41 / 94

Risk of ERM



E[XXT]w? = E[YX].




R(w)−R(w?) = O

(tr(cov(εW ))

n



41 / 94

Risk of ERM analysis (rough sketch)

Let εi := Yi −XTiw

? for each i = 1, . . . , n, so

E[εiXi] = E[YiXi]− E[XiXTi ]w

? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1

εiXid−→ cov(εX)

12Z, where Z ∼ N(0, I).

Therefore, asymptotic distribution of√n(w −w?) is

√n(w −w?)

d−→ E[XXT]−1 cov(εX)12Z.

A few more steps gives

n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).

42 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.


42 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.


42 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.


42 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).42 / 94

Risk of ERM: postscript

I Analysis does not assume that the linear regression model is “correct”;the data distribution need not be from normal linear regression model.

I Only assumptions are those needed for LLN and CLT to hold.

I However, if normal linear regression model holds, i.e.,

Y |X = x ∼ N(xTw?, σ2),

then the bound from the theorem becomes

R(w)−R(w?) = O

(σ2d

n

),

which is familiar to those who have taken introductory statistics.

I With more work, can also prove non-asymptotic risk bound of similar form.

I In homework/reading, we look at a simpler setting for studying ERM forlinear regression, called “fixed design”.

43 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




43 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




43 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




43 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




43 / 94

Risk vs empirical risk

Let w be ERM solution.

1. Empirical risk of ERM: R(w)

2. True risk of ERM: R(w)

Theorem.E[R(w)

]≤ E

[R(w)

].

(Empirical risk can sometimes be larger than true risk, but not on average.)

Overfitting: empirical risk is “small”, but true risk is “much higher”.

44 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



44 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



44 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



44 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



44 / 94

Overfitting example

(X1, Y1), . . . , (Xn, Yn), (X,Y ) are iid; X is continuous random variable in R.

Suppose we use degree-k polynomial expansion

φ(x) = (1, x1, . . . , xk), x ∈ R,

so dimension is d = k + 1.

Fact: Any function on ≤ k + 1 points can be interpolated by a polynomial ofdegree at most k.

0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y

Conclusion: If n ≤ k + 1 = d, ERM solution w with this feature expansion hasR(w) = 0 always, regardless of its true risk (which can be � 0).

45 / 94

Overfitting example



φ(x) = (1, x1, . . . , xk), x ∈ R,



0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y


45 / 94

Overfitting example



φ(x) = (1, x1, . . . , xk), x ∈ R,



0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y


45 / 94

Estimating risk

IID model: (X1, Y1), . . . , (Xn, Yn), (X1, Y1), . . . , (Xm, Ym) ∼iid P .

I training data (X1, Y1), . . . , (Xn, Yn) used to learn f .

I test data (X1, Y1), . . . , (Xm, Ym) used to estimate risk, via test risk

Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.

I Training data is independent of test data, so f is independent of test data.

I Let Li := (f(Xi)− Yi)2 for each i = 1, . . . ,m, so

E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).

I Moreover, L1, . . . , Lm are conditionally iid given f , and hence by Law ofLarge Numbers,

Rtest(f)p−→ R(f) as m→∞.

I By CLT, the rate of convergence is m−1/2.

46 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




46 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




46 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




46 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




46 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




46 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




46 / 94

Rates for risk minimization vs. rates for risk estimation

One may think that ERM “works” because, somehow, training risk is a good“plug-in” estimate of true risk.

I Sometimes this is partially true—we’ll revisit this when we discussgeneralization theory.

Roughly speaking, under some assumptions, can expect that

|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .

I By CLT, we know the following holds for a fixed w:

R(w)p−→ R(w) at n−1/2 rate.

(Here, we ignore the dependence on d.)

I Yet, for ERM w,

R(w)p−→ R(w?) at n−1 rate.

(Also ignoring dependence on d.)

Implication: Selecting a good predictor can be “easier” than estimating howgood predictors are!

47 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




47 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




47 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




47 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




47 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




47 / 94

Old Faithful example

I Linear regression model + affine expansion on “duration of last eruption”.

I Learn w = (35.0929, 10.3258) from 136 past observations.

I Mean squared loss of w on next 136 observations is 35.9404.

(Recall: mean squared loss of µ = 70.7941 was 187.1894.)

0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√

35.9 > mean duration ≈ 3.5.)

48 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√


48 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√


48 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√


48 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√


48 / 94

11. `1 regularization: the LASSO

Regularization with a different norm

Lasso: For a given λ ≥ 0, find minimizer of

R(w) + λ‖w‖1

over w ∈ Rd. Here, ‖v‖1 =∑di=1 |vi| is the `1-norm.

I Prefers shorter w, but using a different notion of length than ridge.

I Tends to produce w that are sparse—i.e., have few non-zerocoordinates—or at least well-approximated by sparse vectors.

Fact: Vectors with small `1-norm are well-approximated by sparse vectors.

If w contains just the 1/ε2-largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤ ε‖w‖1.

49 / 94



R(w) + λ‖w‖1






‖w − w‖2 ≤ ε‖w‖1.

49 / 94



R(w) + λ‖w‖1






‖w − w‖2 ≤ ε‖w‖1.

49 / 94



R(w) + λ‖w‖1






‖w − w‖2 ≤ ε‖w‖1.

49 / 94

Sparse approximations

Claim: If w contains just the T -largest coefficients (by magnitude) of w, then

‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.

This is a consequence of “mismatch” between `1- and `2-norms.Can get similar results for other `p norms.

50 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · ,

so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


50 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


50 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


50 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


50 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


50 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


50 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


50 / 94

Example: Coefficient profile (`2 vs. `1)

Y = levels of prostate cancer antigen, X = clincal measurements

Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.

51 / 94

Other approaches to sparse regression

I Subset selection:

Find w that minimizes empirical risk among all vectors with at most knon-zero entries.

Unfortunately, this seems to require time exponential in k.

I Greedy algorithms:

Repeatedly choose new variable to “include” in support of w until kvariables are included.

Forward stepwise regression / Orthogonal matching pursuit

Often works as well as `1-regularized ERM.

Why do we care about sparsity?

52 / 94


I Subset selection:








52 / 94


I Subset selection:








52 / 94


I Subset selection:








52 / 94


I Subset selection:








52 / 94


I Subset selection:








52 / 94

12. Summary

Summary

ERM for olsERM in generalnormal equationspseudoinverse solnridge regressionstatistical view (say 1-2 things that should be remembered)

53 / 94

Inductive bias










54 / 94

Inductive bias










54 / 94

Inductive bias










54 / 94

Inductive bias










54 / 94

Inductive bias










54 / 94

Inductive bias










54 / 94

Inductive bias










54 / 94

Inductive bias










54 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).


55 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


55 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).


0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


55 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).


0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


55 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






56 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






56 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






56 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






56 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






56 / 94

Another interpretation of ridge regression

Define (n+ d)× d matrix A and (n+ d)× 1 column vector b by

A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:

I d “fake” data points; ensure that augmented data matrix A has rank d.

I Squared length of each “fake” feature vector is nλ.

All corresponding labels are 0.

I Prediction of w on i-th fake feature vector is√nλwi.

57 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





57 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





57 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





57 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





57 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





58 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





58 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





58 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





58 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





58 / 94

Enhancing linear regression models

Linear functions might sound rather restricted, but actually they can be quitepowerful if you are creative about side-information.

Examples:


φ(x) = ln(1 + x).


φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).




φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94



Examples:


φ(x) = ln(1 + x).


φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).




φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94



Examples:


φ(x) = ln(1 + x).


φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).




φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94



Examples:


φ(x) = ln(1 + x).


φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).




φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94



Examples:


φ(x) = ln(1 + x).


φ(x) = (x1 ∧ x5 ∧ ¬x10) ∨ (¬x2 ∧ x7).




φ(x) = (1, x1, . . . , xd, x21, . . . , x

2d, x1x2, . . . , x1xd, . . . , xd−1xd).

59 / 94




φ(x) = (xtemp − 98.6)2.



2temp).


wTφ(x) = (xtemp − 98.6)2,


60 / 94




φ(x) = (xtemp − 98.6)2.



2temp).


wTφ(x) = (xtemp − 98.6)2,


60 / 94




φ(x) = (xtemp − 98.6)2.



2temp).


wTφ(x) = (xtemp − 98.6)2,


60 / 94




φ(x) = (xtemp − 98.6)2.



2temp).


wTφ(x) = (xtemp − 98.6)2,


60 / 94

Quadratic expansion


f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.


φ(x) := (1, x, x2),

sincef(x) = wTφ(x)




, x21, . . . , x2d︸︷︷︸

squared terms


).

61 / 94

Quadratic expansion


f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.


φ(x) := (1, x, x2),

sincef(x) = wTφ(x)




, x21, . . . , x2d︸︷︷︸

squared terms


).

61 / 94

Quadratic expansion


f(x) = ax2 + bx+ c, x ∈ R,

for a, b, c ∈ R.


φ(x) := (1, x, x2),

sincef(x) = wTφ(x)




, x21, . . . , x2d︸︷︷︸

squared terms


).

61 / 94



φ(x) := (1, x).

0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil n

ext eru

ption

affine function


fa,b(x) = a+ bx,



62 / 94



φ(x) := (1, x).

0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil next eru

ption

affine function


fa,b(x) = a+ bx,



62 / 94

Why linear regression models?

1. Linear regression models benefit from good choice of features.

2. Structure of linear functions is very well-understood.

3. Many well-understood and efficient algorithms for learning linear functionsfrom data, even when n and d are large.

63 / 94





63 / 94





63 / 94





63 / 94

13. From data to prediction functions



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.



n∑i=1

{− 1


1

2ln

1

2πσ2

}+{



1

n

n∑i=1

(xTiw − yi)2.


64 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.



n∑i=1

{− 1


1

2ln

1

2πσ2

}+{



1

n

n∑i=1

(xTiw − yi)2.


64 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.



n∑i=1

{− 1


1

2ln

1

2πσ2

}+{



1

n

n∑i=1

(xTiw − yi)2.


64 / 94



Y |X = x ∼ N(xTw, σ2), x ∈ Rd.



n∑i=1

{− 1


1

2ln

1

2πσ2

}+{



1

n

n∑i=1

(xTiw − yi)2.


64 / 94



pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.


R(f) = E[(f(X)− Y )2].



R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94



pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.


R(f) = E[(f(X)− Y )2].



R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94



pn((x, y)) :=1

n

n∑i=1

1{(x, y) = (xi, yi)}, (x, y) ∈ Rd × R.


R(f) = E[(f(X)− Y )2].



R(f) :=1

n

n∑i=1

(f(xi)− yi)2.

65 / 94




w ∈ arg minw∈Rd

R(w),


In general:



66 / 94




w ∈ arg minw∈Rd

R(w),


In general:



66 / 94




w ∈ arg minw∈Rd

R(w),


In general:



66 / 94

Empirical risk minimization in pictures

Red dots: data points.

Affine hyperplane: linear function w(via affine expansion (x1, x2) 7→ (1, x1, x2)).

ERM: minimize sum of squared verticallengths from hyperplane to points.

67 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn

.


R(w) = ‖Aw − b‖22.





It can be proved that every critical point of R is a minimizer of R:

68 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) = ‖Aw − b‖22.






68 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) = ‖Aw − b‖22.






68 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) = ‖Aw − b‖22.






68 / 94



A :=1√n

← xT

1 →...

← xTn →

, b :=1√n

y1...yn


R(w) = ‖Aw − b‖22.






68 / 94

Aside: Convexity

Let f : Rd → R be a differentiable function.

Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?

Yes, if f is a convex function:

f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),

for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.

69 / 94

Aside: Convexity

Let f : Rd → R be a differentiable function.

Suppose we find x ∈ Rd such that ∇f(x) = 0. Is x a minimizer of f?

Yes, if f is a convex function:

f((1− t)x+ tx′) ≤ (1− t)f(x) + tf(x′),

for any 0 ≤ t ≤ 1 and any x,x′ ∈ Rd.

69 / 94

Convexity of empirical risk

Checking convexity of g(x) = ‖Ax− b‖22:

g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22

where last step uses Cauchy-Schwarz inequality and arithmetic mean/geometricmean (AM/GM) inequality.

70 / 94



g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22

= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22


70 / 94



g((1− t)x+ tx′)

= ‖(1− t)(Ax− b) + t(Ax′ − b)‖22= (1− t)2‖Ax− b‖22 + t2‖Ax′ − b‖22 + 2(1− t)t(Ax− b)T(Ax′ − b)

= (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22−(1− t)t[‖Ax− b‖22 + ‖Ax′ − b‖22] + 2(1− t)t(Ax− b)T(Ax′ − b)

≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22


70 / 94



g((1− t)x+ tx′)


≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22


70 / 94



g((1− t)x+ tx′)


≤ (1− t)‖Ax− b‖22 + t‖Ax′ − b‖22


70 / 94

Convexity of empirical risk, another way

Preview of convex analysis

Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.

I Scalar function g(z) = cz2 is convex for any c ≥ 0.

I Composition (g ◦ a) : Rd → R of any convex function g : R→ R and anyaffine function a : Rd → R is convex.

I Therefore, function w 7→ 1n

(xTiw − yi)2 is convex.

I Sum of convex functions is convex.

I Therefore R is convex.

Convexity is a useful mathematical property to understand!(We’ll study more convex analysis in a few weeks.)

71 / 94



Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.








71 / 94



Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.








71 / 94



Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.








71 / 94



Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.








71 / 94



Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.








71 / 94



Recall R(w) =1

n

n∑i=1

(xTiw − yi)2.








71 / 94

Algorithm for ERM

Algorithm for ERM with linear functions and squared loss†

input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.

1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).

2: return w.

†Also called “ordinary least squares” in this context.

Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!

72 / 94

Algorithm for ERM

Algorithm for ERM with linear functions and squared loss†

input Data (x1, y1), . . . , (xn, yn) from Rd × R.output Linear function w ∈ Rd.

1: Find solution w to the normal equations defined by the data(using, e.g., Gaussian elimination).

2: return w.

†Also called “ordinary least squares” in this context.

Running time (dominated by Gaussian elimination): O(nd2).Note: there are many approximate solvers that run in nearly linear time!

72 / 94



A =

↑ ↑a1 · · · ad↓ ↓

.

Minimizing ‖Aw − b‖22 is the same as finding vector b ∈ range(A) closest to b.


b

b

a1

a2

I b is uniquely determined.

I If rank(A) < d, then >1 way to writeb as linear combination of a1, . . . ,ad.



73 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2





73 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2





73 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2





73 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2





73 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2





73 / 94



A =

↑ ↑a1 · · · ad↓ ↓



b

b

a1

a2





73 / 94








Looks familiar?




74 / 94








Looks familiar?




74 / 94








Looks familiar?




74 / 94








Looks familiar?




74 / 94








Looks familiar?




74 / 94

14. Risk, empirical risk, and estimating risk

Risk of ERM



E[XXT]w? = E[YX].




R(w)−R(w?) = O

(tr(cov(εW ))

n



75 / 94

Risk of ERM



E[XXT]w? = E[YX].




R(w)−R(w?) = O

(tr(cov(εW ))

n



75 / 94

Risk of ERM



E[XXT]w? = E[YX].




R(w)−R(w?) = O

(tr(cov(εW ))

n



75 / 94

Risk of ERM



E[XXT]w? = E[YX].




R(w)−R(w?) = O

(tr(cov(εW ))

n



75 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.


76 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.


76 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.


76 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.


76 / 94



? for each i = 1, . . . , n, so


? = 0

and√n(w −w?) =

(1

n

n∑i=1

XiXTi

)−11√n

n∑i=1

εiXi.

1. By LLN:1

n

n∑i=1

XiXTi

p−→ E[XXT]

2. By CLT:1√n

n∑i=1


12Z, where Z ∼ N(0, I).


√n(w −w?)



n(E[(XTw − Y )2]− E[(XTw? − Y )2]

)d−→ ‖E[XXT]−

12 cov(εX)

12Z‖22.

Random variable on RHS is “concentrated” around its mean tr(cov(εW )).76 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




77 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




77 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




77 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




77 / 94





Y |X = x ∼ N(xTw?, σ2),


R(w)−R(w?) = O

(σ2d

n

),




77 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



78 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



78 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



78 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



78 / 94





Theorem.E[R(w)

]≤ E

[R(w)

].



78 / 94

Overfitting example



φ(x) = (1, x1, . . . , xk), x ∈ R,



0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y


79 / 94

Overfitting example



φ(x) = (1, x1, . . . , xk), x ∈ R,



0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y


79 / 94

Overfitting example



φ(x) = (1, x1, . . . , xk), x ∈ R,



0 0.2 0.4 0.6 0.8 1

x

-3

-2

-1

0

1

2

3

y


79 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




80 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




80 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




80 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




80 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




80 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




80 / 94

Estimating risk




Rtest(f) :=1

m

m∑i=1

(f(Xi)− Yi)2.



E[Rtest(f) | f

]=

1

m

m∑i=1

E[Li | f

]= R(f).




80 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




81 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




81 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




81 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




81 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




81 / 94





|R(w)−R(w)| ≤ O

(√d

n

)for all w ∈ Rd.

However . . .




I Yet, for ERM w,




81 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√


82 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√


82 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil ne

xt eru

ption

linear model

constant prediction

(Unfortunately,√


82 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√


82 / 94






0 1 2 3 4 5 6


0

20

40

60

80

100

tim

e u

ntil next eru

ption

linear model

constant prediction

(Unfortunately,√


82 / 94

15. Regularization

Inductive bias




I Fact: the OLS solution A+b is the least norm solution.





83 / 94

Inductive bias









83 / 94

Inductive bias









83 / 94

Inductive bias









83 / 94

Inductive bias









83 / 94

Inductive bias









83 / 94

Inductive bias









83 / 94

Inductive bias









83 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).


84 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).

Training data:

0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


84 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).


0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


84 / 94

Example


φ(x) = (1, sin(x), cos(x), 12

sin(2x), 12

cos(2x), 13

sin(3x), 13

cos(3x), . . . ).


0 1 2 3 4 5 6

x

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

f(x)


84 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






85 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






85 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






85 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






85 / 94

Regularized ERM


R(w) + λ‖w‖22

over w ∈ Rd.






85 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





86 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





86 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





86 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





86 / 94



A :=1√n

← xT1 →

...← xT

n →√nλ

. . . √nλ

, b :=

1√n

y1...yn0...0

.

Then‖Aw − b‖22 = R(w) + λ‖w‖22.

Interpretation:





86 / 94



R(w) + λ‖w‖1






‖w − w‖2 ≤ ε‖w‖1.

87 / 94



R(w) + λ‖w‖1






‖w − w‖2 ≤ ε‖w‖1.

87 / 94



R(w) + λ‖w‖1






‖w − w‖2 ≤ ε‖w‖1.

87 / 94



R(w) + λ‖w‖1






‖w − w‖2 ≤ ε‖w‖1.

87 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · ,

so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|

‖w − w‖22 =∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94



‖w − w‖2 ≤‖w‖1√T + 1

.

WLOG |w1| ≥ |w2| ≥ · · · , so w = (w1, . . . , wT , 0, . . . , 0).

i

|βi|‖w − w‖22 =

∑i≥T+1

w2i

≤∑i≥T+1

|wi| · |wT+1|

≤ ‖w‖1 · |wT+1|

≤ ‖w‖1 ·‖w‖1T + 1

.


88 / 94

Example: Coefficient profile (`2 vs. `1)

Y = levels of prostate cancer antigen, X = clincal measurements

Horizontal axis: varying λ (large λ to left, small λ to right).Vertical axis: coefficient value in `2-regularized ERM and `1-regularized ERM,for eight different variables.

89 / 94


I Subset selection:








90 / 94


I Subset selection:








90 / 94


I Subset selection:








90 / 94


I Subset selection:








90 / 94


I Subset selection:








90 / 94


I Subset selection:








90 / 94

Key takeaways

1. IID model for supervised learning.

2. Optimal predictors, linear regression models, and optimal linear predictors.

3. Empirical risk minimization for linear predictors.

4. Risk of ERM; training risk vs. test risk; risk minimization vs. riskestimation.

5. Inductive bias, `1- and `2-regularization, sparsity.

Make sure you do the assigned reading, especially from the handouts!

91 / 94

misc

svdpytorch/numpy; gpu; gpu errors. maybe even sgd. they’ll use it in homework.talk about regression and classification somewhere early on. can mention howto do it for dt and knn too i guess, though it’s a little gross in this lecture?before MLE slide, give a quick one-slide refresher/primer on MLE.ridge and soln existence. for homework maybe prove → 0 gives svd?daniel’s 1/n. talk about loss functionslook at my old lecsvd topics: not unique; pseudoinverse equal inverse always; pseudoinversealways unique(?) or at least when inverse exists? talk about things it satisfieslike XX+X = X etc; “meaning” of the U , V matrices in svd; introduce svdvia eigendecomposition

92 / 94

misc

logistic regression: optimize w 7→ 1n

∑ni=1 ln(1 + exp(−yiw>xi)).

SVD solution for ols:- write ‖Xw − y‖22.- normal equations (differentiate and set to zero:) X>Xw = X>y.- Writing X = USV >, have V S2V >w = V SU>y.- Thus pseudoinverse solution X+y = V S+U>y satisfies normal equations.for homework maybe also suggest experiment with ridge regression (addingλ‖w‖2/2).for pytorch solver, can have them manually do gradient, and also use pytorch’s.backward, see the sample code for lecture 1 (in the repository, not in theslides).features: replace xi with φ(xi) where phi is some function. E.g.,φ(x) = (1, x1, . . . , xd, x1x1, x1x2, . . . x1xd, . . . xdxd) means w>φ(x) is aquadratic (and now we can search over all possible quadratics with ouroptimization).

93 / 94

16. Summary of linear regression so far

Main points

I Model/function/predictor class of linear regressors x 7→ wTx.

I ERM principle: we chose a loss (least squares) and find a good predictorby minimizing empirical risk.

I ERM solution for least squares: pick w satisfying ATAw = ATb, which isnot unique; one unique choice is the ordinary least squares solution A+b.

I We also discussed feature expansion; affine and polynomial expansion aregood to keep in mind!

94 / 94

Documents

Linear regression - University Of Illinoismjt.cs.illinois.edu/courses/ml-s19/files/slides-linear_regression.pdf · Lectures 3-4: linear regression 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0