42
Nakul Gopalan Georgia Tech Regularized Linear Regression Machine Learning CS 4641 These slides are adopted based on slides from Andrew Zisserman, Jonathan Taylor, Chao Zhang, Mahdi Roozbahani and Yaser Abu-Mostafa.

Lecture 15 Regularized Linear Regression

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 15 Regularized Linear Regression

Nakul Gopalan

Georgia Tech

Regularized Linear Regression

Machine Learning CS 4641

These slides are adopted based on slides from Andrew Zisserman, Jonathan Taylor, Chao Zhang, Mahdi Roozbahani

and Yaser Abu-Mostafa.

Page 2: Lecture 15 Regularized Linear Regression

Recap

• Linear regression:

• Y = θX

• MSE

Page 3: Lecture 15 Regularized Linear Regression

Polynomial regression

Page 4: Lecture 15 Regularized Linear Regression

Polynomial regression when order not known

Page 5: Lecture 15 Regularized Linear Regression

Outline

• Overfitting and regularized learning

• Ridge regression

• Lasso regression

• Determining regularization strength

5

Page 6: Lecture 15 Regularized Linear Regression

Regression: Recap

6

Page 7: Lecture 15 Regularized Linear Regression

Regression: Recap

7

dd

dd𝑧 = 1, 𝑥, 𝑥2, … , 𝑥𝑑 ∈ 𝑅𝑑

𝑦 = 𝑧𝜃

Page 8: Lecture 15 Regularized Linear Regression
Page 9: Lecture 15 Regularized Linear Regression
Page 10: Lecture 15 Regularized Linear Regression

Which One is Better?

10

No, this can lead to overfitting!

D=0 D=1

D=9D=3

Page 11: Lecture 15 Regularized Linear Regression

The Overfitting Problem

11

• The training error is very low, but the error on test set is large.

• The model captures not only patterns but also noisy nuisances

in the training data.

D=9

Page 12: Lecture 15 Regularized Linear Regression

The Overfitting Problem

12

• In regression, overfitting is often associated with large Weights

(severe oscillation)

• How can we address overfitting?

D=9

Page 13: Lecture 15 Regularized Linear Regression

Regularization (smart way to cure overfitting disease )

Fit a linear line on sinusoidal with just two data points

Put a brake on fitting

Page 14: Lecture 15 Regularized Linear Regression

Who is the winner?

ҧ𝑔 𝑥 : average over all lines

bias=0.21; var=1.69 bias=0.23; var=0.33

Page 15: Lecture 15 Regularized Linear Regression

Minimize 𝐸 𝜃 +𝜆

𝑁𝜃𝑇𝜃

Regularized Learning

Why this term leads to

regularization of parameters

𝑁 𝑁𝜃 𝜃 𝜃

Page 16: Lecture 15 Regularized Linear Regression

Polynomial Model

Let’s rewrite it as:

𝑦 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +⋯+ 𝜃𝑑𝑥

𝑑 + 𝜖

𝑦 = 𝜃0 + 𝜃1𝑧1 + 𝜃2𝑧2 +⋯+ 𝜃𝑑𝑧d + 𝜖 = 𝒛𝜽

Page 17: Lecture 15 Regularized Linear Regression

Regularizing is just constraining the weights (𝜽)

For example: let’s do a hard constraining

𝑦 = 𝜃0 + 𝜃1𝑧1 + 𝜃2𝑧2 +⋯+ 𝜃𝑑𝑧d

subject to

𝜃𝑑 = 0 𝑓𝑜𝑟 𝑑 > 2

𝑦 = 𝜃0 + 𝜃1𝑧1 + 𝜃2𝑧2 + 0 +⋯+ 0

Page 18: Lecture 15 Regularized Linear Regression

Let’s not penalize 𝜃 in such a harsh waylet’s cut them some slack

𝜃 = 𝑎𝑟𝑔𝑚𝑖𝑛𝜃𝐸 𝜃 =1

𝑛

𝑖=1

𝑛

𝑦𝑖 − 𝑧𝑖𝜃2

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒1

𝑁z𝜃 − 𝑦 𝑇 z𝜃 − 𝑦

Subject to 𝜃𝑡𝜃 ≤ 𝐶

For simplicity let’s call 𝜃𝑙𝑖𝑛 as weights’ solution for non constrained one

and 𝜃 for the constrained model.

Page 19: Lecture 15 Regularized Linear Regression

𝐸 𝜃 =1

𝑁z𝜃 − 𝑦 𝑇 z𝜃 − 𝑦

Possible graph for 𝐸 𝜃 for different values of 𝜃0 and 𝜃1and given

observation data

3D view Top view

Page 20: Lecture 15 Regularized Linear Regression

Gradient of 𝜃𝑡𝜃

𝜃 =𝜃0𝜃1

⇒ 𝜃𝑡 𝜃 = 𝜃02 + 𝜃1

2

If you imagine standing at a point (𝜃0, 𝜃1), 𝛻(𝜃𝑇𝜃) tells you which direction you should

travel to increase the value of 𝜃𝑇𝜃 most rapidly.

𝛻 𝜃𝑇𝜃 =

𝜕

𝜕(𝜃0)𝜃𝑇𝜃

𝜕

𝜕(𝜃1)𝜃𝑇𝜃

=2𝜃02𝜃1

≈𝜃0𝜃1

𝛻 𝜃𝑇𝜃 is a vector field

any line passing through the center of the

circle

Page 21: Lecture 15 Regularized Linear Regression

Plotting the regularization term 𝜃𝑡𝜃

𝜃 =𝜃0𝜃1

⇒ 𝜃𝑡 𝜃 = 𝜃02 + 𝜃1

2

3D view Top view

Page 22: Lecture 15 Regularized Linear Regression

𝐸 𝜃 =1

𝑁z𝜃 − 𝑦 𝑇 z𝜃 − 𝑦

𝜃𝑙𝑖𝑛

𝜃𝑙𝑖𝑛 is the solution (min absolute)

𝐸 𝜃 : which is constant on the surface of the ellipsoid

Find a solution in 𝜃𝑡𝜃that minimizes 𝐸 𝜃

Subject to 𝜃𝑡𝜃 ≤ 𝐶

𝜃0

𝜃1

Page 23: Lecture 15 Regularized Linear Regression

Constraint and Loss

𝜃𝑙𝑖𝑛

𝜃1

𝜃0

Page 24: Lecture 15 Regularized Linear Regression

Considering the below 𝐸 𝜃 and 𝐶what is a 𝜃 candidate here?

𝜃𝑙𝑖𝑛

𝜃𝑡𝜃 = 𝐶𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 = 𝐶

𝐸 𝜃

𝛻𝐸 𝜃

𝛻𝐸: the gradient (rate) in objective function

which minimizes error (orthogonal to ellipse.

Changes happen in orthogonal direction)

What is the orthogonal direction on the other surface?

It is just 𝜃, a line passing through center of

the circle

𝛻(𝜃𝑡𝜃)Applying a constrain 𝜃𝑡𝜃, where

the best solution happens?

On the boundary of the circle, as it is the

closest one to the minimum absolute

Page 25: Lecture 15 Regularized Linear Regression

Considering the below 𝐸 𝜃 and 𝐶what is the bew 𝜃 solution here?

𝜃𝑙𝑖𝑛

𝐸 𝜃

𝜃

𝛻𝐸(𝜃)

𝛻(𝜃𝑡𝜃)

𝛻𝐸 𝜃 ∝ −𝜃

𝛻𝐸 𝜃 = −2𝜆

𝑁𝜃

𝛻𝐸 𝜃 + 2𝜆

𝑁𝜃 = 0

Let’s do integration

Minimize 𝐸 𝜃 +𝜆

𝑁𝜃𝑇𝜃

𝐶 ↑ 𝜆 ↓

Page 26: Lecture 15 Regularized Linear Regression

Outline

• Overfitting and regularized learning

• Ridge regression

• Lasso regression

• Determining regularization strength

26

Page 27: Lecture 15 Regularized Linear Regression

Ridge Regression

27

𝑁 𝑁

d d𝑓(𝑥, 𝜃) = 𝜃0 + 𝜃1𝑧1 + 𝜃2𝑧2 +⋯+ 𝜃𝑑𝑧d + 𝜖 = 𝒛𝜽

𝑁 𝑁𝜃 𝜃 𝜃

Page 28: Lecture 15 Regularized Linear Regression

Solving for the Weights 𝜃

28

D

d

𝑧(𝑥1)𝜃𝑧(𝑥2)𝜃

𝑧(𝑥𝑛)𝜃

𝑧𝜃

𝜃0𝜃1

𝜃𝑑

𝑧1(𝑥1) 𝑧𝑑(𝑥1)…

𝑧1(𝑥2) 𝑧𝑑(𝑥2)…

𝑧1(𝑥𝑛) 𝑧𝑑(𝑥𝑛)…

𝑧

𝑧𝜃𝜃0

𝜃2

𝜃1

Page 29: Lecture 15 Regularized Linear Regression

29

𝑁 𝑁

𝑁𝑁

𝑁𝑁

𝜃 𝜃 𝜃

𝜃𝑦𝑖 − 𝑧𝑖𝜃2 +

𝜆

𝑁𝜃 2

𝑦𝑖 − 𝑧𝜃 2 +𝜆

𝑁𝜃 2

𝜃

𝑑𝜃= −𝑧𝑇 𝑦 − 𝑧𝜃 + 𝜆𝜃

𝜃

𝑧𝑇𝑧 + 𝜆𝐼 𝜃 = 𝑧𝑇𝑦

𝜃 = 𝑧𝑇𝑧 + 𝜆𝐼 −1𝑧𝑇𝑦

Page 30: Lecture 15 Regularized Linear Regression

30

Dx1 DxD DxN Nx1

N>D

𝜃 = 𝑧𝑇𝑧 + 𝜆𝐼 −1 𝑧𝑇 𝑦

𝜃 = 𝑧𝑇𝑧 −1𝑧𝑇𝑦 = 𝑧+𝑦

D

𝜃 𝑧𝑇𝑦

𝑧+ 𝑧

𝑧𝑇𝑧𝑧

Page 31: Lecture 15 Regularized Linear Regression

Ridge Regression Example

31

𝑁 𝑁

D+1

D+1

𝑓 𝑥, 𝜃 = 𝑧𝜃 𝑧: 𝑥 → 𝑧

෨𝐸 𝜃 =1

𝑁

𝑖=1

𝑁

𝑓 𝑥𝑖 , 𝜃 − 𝑦𝑖2 +

𝜆

𝑁𝜃 2 𝜃

D,

Page 32: Lecture 15 Regularized Linear Regression

32

D = 7

Page 33: Lecture 15 Regularized Linear Regression

33

D = 3 D = 5

Page 34: Lecture 15 Regularized Linear Regression

Outline

• Overfitting and regularized learning

• Ridge regression

• Lasso regression

• Determining regularization strength

34

Page 35: Lecture 15 Regularized Linear Regression

Regularized Regression

35

Now let’s look at another regularization choice.

𝜃 𝜃

𝜃

𝜃

𝜃

Page 36: Lecture 15 Regularized Linear Regression

The Lasso Regularization (norm one)

36

𝜃 𝜃

𝜃

𝜃

𝜃

Page 37: Lecture 15 Regularized Linear Regression

Let’s say we have two parameters (𝜃0 𝑎𝑛𝑑 𝜃1)

𝜃𝑙𝑖𝑛

𝐸 𝜃 : which is constant on the surface of the ellipsoid

𝜃0

𝜃1

𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐸 𝜃 =1

𝑁z𝑤 − 𝑦 𝑇 z𝜃 − 𝑦

Subject to 𝜽 ≤ 𝑪

𝑪

𝑪

−𝑪

−𝑪

Sharp edges

𝜃 =𝜃0𝟎

Interesting way for

feature selection

Page 38: Lecture 15 Regularized Linear Regression

Outline

• Overfitting and regularized learning

• Ridge regression

• Lasso regression

• Determining regularization strength

38

Page 39: Lecture 15 Regularized Linear Regression

Leave-One-Out Cross Validation

39

Page 40: Lecture 15 Regularized Linear Regression

K-Fold Cross Validation

40

Page 41: Lecture 15 Regularized Linear Regression

41

Choosing λ Using Validation Dataset

Pick up the lambda with the lowest

mean value of rmse calculated by

Cross Validation approach

Page 42: Lecture 15 Regularized Linear Regression

Take-Home Messages

• What is overfitting

• What is regularization

• How does Ridge regression work

• Sparsity properties of Lasso regression

• How to choose the regularization coefficient λ

42