Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. â€ĒKeep...

Preview:

Citation preview

Regularization

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Administrative

â€Ē Women in Data Science Blacksburgâ€Ē Location: Holtzman Alumni Center

â€Ē Welcome, 3:30 - 3:40, Assembly hall

â€Ē Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall

â€Ē Career Panel, 4:05 - 5:00, hall

â€Ē Break , 5:00 - 5:20, Grand hallAssembly

â€Ē Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall

â€Ē Dinner with breakout discussion groups, 5:45 - 7:00, Museum

â€Ē Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall

â€Ē Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room

k-NN (Classification/Regression)

â€Ē Modelð‘Ĩ 1 , ð‘Ķ 1 , ð‘Ĩ 2 , ð‘Ķ 2 , â‹Ŋ , ð‘Ĩ 𝑚 , ð‘Ķ 𝑚

â€Ē Cost function

None

â€Ē Learning

Do nothing

â€Ē Inference

ā·œð‘Ķ = ℎ ð‘Ĩtest = ð‘Ķ(𝑘), where 𝑘 = argmin𝑖 𝐷(ð‘Ĩtest, ð‘Ĩ(𝑖))

Linear regression (Regression)

â€Ē Modelℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ1 + 𝜃2ð‘Ĩ2 +â‹Ŋ+ 𝜃𝑛ð‘Ĩ𝑛 = 𝜃âŠĪð‘Ĩ

â€Ē Cost function

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2

â€Ē Learning

1) Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛞1

𝑚σ𝑖=1𝑚 ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗

𝑖}

2) Solving normal equation 𝜃 = (𝑋âŠĪ𝑋)−1𝑋âŠĪð‘Ķ

â€Ē Inferenceā·œð‘Ķ = ℎ𝜃 ð‘Ĩtest = 𝜃âŠĪð‘Ĩtest

NaÃŊve Bayes (Classification)

â€Ē Modelℎ𝜃 ð‘Ĩ = 𝑃(𝑌|𝑋1, 𝑋2, â‹Ŋ , 𝑋𝑛) ∝ 𝑃 𝑌 Π𝑖𝑃 𝑋𝑖 𝑌)

â€Ē Cost functionMaximum likelihood estimation: ð― 𝜃 = − log 𝑃 Data 𝜃Maximum a posteriori estimation :ð― 𝜃 = − log 𝑃 Data 𝜃 𝑃 𝜃

â€Ē Learning𝜋𝑘 = 𝑃(𝑌 = ð‘Ķ𝑘)

(Discrete 𝑋𝑖) 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = ð‘Ĩ𝑖𝑗𝑘|𝑌 = ð‘Ķ𝑘)

(Continuous 𝑋𝑖) mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘2 , 𝑃 𝑋𝑖 𝑌 = ð‘Ķ𝑘) = ð’Đ(𝑋𝑖|𝜇𝑖𝑘 , 𝜎𝑖𝑘

2 )

â€Ē Inference𝑌 ← argmax

ð‘Ķ𝑘

𝑃 𝑌 = ð‘Ķ𝑘 Π𝑖𝑃 𝑋𝑖test 𝑌 = ð‘Ķ𝑘)

Logistic regression (Classification)

â€Ē Modelℎ𝜃 ð‘Ĩ = 𝑃 𝑌 = 1 𝑋1, 𝑋2, â‹Ŋ , 𝑋𝑛 =

1

1+𝑒−𝜃âŠĪð‘Ĩ

â€Ē Cost function

ð― 𝜃 =1

𝑚

𝑖=1

𝑚

Cost(ℎ𝜃(ð‘Ĩ𝑖 ), ð‘Ķ(𝑖))) Cost(ℎ𝜃 ð‘Ĩ , ð‘Ķ) = āĩ

−log ℎ𝜃 ð‘Ĩ if ð‘Ķ = 1

−log 1 − ℎ𝜃 ð‘Ĩ if ð‘Ķ = 0

â€Ē LearningGradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛞

1

𝑚σ𝑖=1𝑚 ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗

𝑖}

â€Ē Inference𝑌 = ℎ𝜃 ð‘Ĩtest =

1

1 + 𝑒−𝜃âŠĪð‘Ĩtest

Logistic Regression

â€ĒHypothesis representation

â€ĒCost function

â€Ē Logistic regression with gradient descent

â€ĒRegularization

â€ĒMulti-class classification

ℎ𝜃 ð‘Ĩ =1

1 + 𝑒−𝜃âŠĪð‘Ĩ

Cost(ℎ𝜃 ð‘Ĩ , ð‘Ķ) = āĩâˆ’log ℎ𝜃 ð‘Ĩ if ð‘Ķ = 1

−log 1 − ℎ𝜃 ð‘Ĩ if ð‘Ķ = 0

𝜃𝑗 ≔ 𝜃𝑗 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ(𝑖) ð‘Ĩ𝑗(𝑖)

How about MAP?

â€Ē Maximum conditional likelihood estimate (MCLE)

â€Ē Maximum conditional a posterior estimate (MCAP)

𝜃MCLE = argmax𝜃

ς𝑖=1𝑚 𝑃𝜃 ð‘Ķ(𝑖)|ð‘Ĩ 𝑖

𝜃MCAP = argmax𝜃

ς𝑖=1𝑚 𝑃𝜃 ð‘Ķ(𝑖)|ð‘Ĩ 𝑖 𝑃(𝜃)

Prior 𝑃(𝜃)

â€Ē Common choice of 𝑃(𝜃): â€Ē Normal distribution, zero mean, identity covariance

â€Ē “Pushes” parameters towards zeros

â€Ē Corresponds to Regularizationâ€Ē Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell

MLE vs. MAP

â€Ē Maximum conditional likelihood estimate (MCLE)

𝜃𝑗 ≔ 𝜃𝑗 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ(𝑖) ð‘Ĩ𝑗(𝑖)

â€Ē Maximum conditional a posterior estimate (MCAP)

𝜃𝑗 ≔ 𝜃𝑗 − 𝛞𝜆𝜃𝑗 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ(𝑖) ð‘Ĩ𝑗(𝑖)

Logistic Regression

â€ĒHypothesis representation

â€ĒCost function

â€Ē Logistic regression with gradient descent

â€ĒRegularization

â€ĒMulti-class classification

Multi-class classification

â€Ē Email foldering/taggning: Work, Friends, Family, Hobby

â€Ē Medical diagrams: Not ill, Cold, Flu

â€Ē Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng

Binary classification

ð‘Ĩ2

ð‘Ĩ1

Multiclass classification

ð‘Ĩ2

ð‘Ĩ1

One-vs-all (one-vs-rest)

ð‘Ĩ2

ð‘Ĩ1Class 1:Class 2: Class 3:

ℎ𝜃𝑖ð‘Ĩ = 𝑃 ð‘Ķ = 𝑖 ð‘Ĩ; 𝜃 (𝑖 = 1, 2, 3)

ð‘Ĩ2

ð‘Ĩ1

ð‘Ĩ2

ð‘Ĩ1

ð‘Ĩ2

ð‘Ĩ1

ℎ𝜃1

ð‘Ĩ

ℎ𝜃2

ð‘Ĩ

ℎ𝜃3

ð‘Ĩ

Slide credit: Andrew Ng

One-vs-all

â€ĒTrain a logistic regression classifier ℎ𝜃𝑖ð‘Ĩ for

each class 𝑖 to predict the probability that ð‘Ķ = 𝑖

â€ĒGiven a new input ð‘Ĩ, pick the class 𝑖 that maximizes

maxi

ℎ𝜃𝑖ð‘Ĩ

Slide credit: Andrew Ng

Generative ApproachEx: NaÃŊve Bayes

Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌)

Predictionā·œð‘Ķ = argmaxð‘Ķ 𝑃 𝑌 = ð‘Ķ 𝑃(𝑋 = ð‘Ĩ|𝑌 = ð‘Ķ)

Discriminative ApproachEx: Logistic regression

Estimate 𝑃(𝑌|𝑋) directly

(Or a discriminant function: e.g., SVM)

Predictionā·œð‘Ķ = 𝑃(𝑌 = ð‘Ķ|𝑋 = ð‘Ĩ)

Further readings

â€Ē Tom M. MitchellGenerative and discriminative classifiers: NaÃŊve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

â€Ē Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

Regularization

â€Ē Overfitting

â€Ē Cost function

â€Ē Regularized linear regression

â€Ē Regularized logistic regression

Regularization

â€Ē Overfitting

â€Ē Cost function

â€Ē Regularized linear regression

â€Ē Regularized logistic regression

Example: Linear regressionPrice ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ2 ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ

2 +𝜃3ð‘Ĩ

3 + 𝜃4ð‘Ĩ4 +â‹Ŋ

Underfitting OverfittingJust right

Slide credit: Andrew Ng

Overfitting

â€Ē If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2≈ 0

but fail to generalize to new examples (predict prices on new examples).

Slide credit: Andrew Ng

Example: Linear regressionPrice ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ2 ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ

2 +𝜃3ð‘Ĩ

3 + 𝜃4ð‘Ĩ4 +â‹Ŋ

Underfitting OverfittingJust right

High bias High varianceSlide credit: Andrew Ng

Bias-Variance Tradeoff

â€ĒBias: difference between what you expect to learn and truthâ€Ē Measures how well you expect to represent true solution

â€Ē Decreases with more complex model

â€ĒVariance: difference between what you expect to learn and what you learn from a particular dataset â€Ē Measures how sensitive learner is to specific dataset

â€Ē Increases with more complex model

Low variance High variance

Low bias

High bias

Bias–variance decomposition

â€Ē Training set { ð‘Ĩ1, ð‘Ķ1 , ð‘Ĩ2, ð‘Ķ2 , â‹Ŋ , ð‘Ĩ𝑛, ð‘Ķ𝑛 }

â€Ē ð‘Ķ = 𝑓 ð‘Ĩ + 𝜀

â€Ē We want መ𝑓 ð‘Ĩ that minimizes ðļ ð‘Ķ − መ𝑓 ð‘Ĩ2

ðļ ð‘Ķ − መ𝑓 ð‘Ĩ2= Bias መ𝑓 ð‘Ĩ

2+ Var መ𝑓 ð‘Ĩ + 𝜎2

Bias መ𝑓 ð‘Ĩ = ðļ መ𝑓 ð‘Ĩ − 𝑓(ð‘Ĩ)

Var መ𝑓 ð‘Ĩ = ðļ መ𝑓 ð‘Ĩ 2 − ðļ መ𝑓 ð‘Ĩ2

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Overfitting

Tumor Size

Age

Tumor Size

Age

Tumor Size

Age

ℎ𝜃 ð‘Ĩ = 𝑔(𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ2) ℎ𝜃 ð‘Ĩ = 𝑔(𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ2 +𝜃3ð‘Ĩ1

2 + 𝜃4ð‘Ĩ22 + 𝜃5ð‘Ĩ1ð‘Ĩ2)

ℎ𝜃 ð‘Ĩ = 𝑔(𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ2 +𝜃3ð‘Ĩ1

2 + 𝜃4ð‘Ĩ22 + 𝜃5ð‘Ĩ1ð‘Ĩ2 +

𝜃6ð‘Ĩ13ð‘Ĩ2 + 𝜃7ð‘Ĩ1ð‘Ĩ2

3 +â‹Ŋ)

Underfitting OverfittingSlide credit: Andrew Ng

Addressing overfitting

â€Ē ð‘Ĩ1 = size of house

â€Ē ð‘Ĩ2 = no. of bedrooms

â€Ē ð‘Ĩ3 = no. of floors

â€Ē ð‘Ĩ4 = age of house

â€Ē ð‘Ĩ5 = average income in neighborhood

â€Ē ð‘Ĩ6 = kitchen size

â€Ē â‹Ū

â€Ē ð‘Ĩ100

Price ($)in 1000’s

Size in feet^2

Slide credit: Andrew Ng

Addressing overfitting

â€Ē 1. Reduce number of features.â€Ē Manually select which features to keep.

â€Ē Model selection algorithm (later in course).

â€Ē 2. Regularization.â€Ē Keep all the features, but reduce magnitude/values of parameters 𝜃𝑗.

â€Ē Works well when we have a lot of features, each of which contributes a bit to predicting ð‘Ķ.

Slide credit: Andrew Ng

Overfitting Thriller

â€Ē https://www.youtube.com/watch?v=DQWI1kvmwRg

Regularization

â€Ē Overfitting

â€Ē Cost function

â€Ē Regularized linear regression

â€Ē Regularized logistic regression

Intuition

â€Ē Suppose we penalize and make 𝜃3, 𝜃4 really small.

min𝜃

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2+ 1000 𝜃3

2 + 1000 𝜃42

ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ2 ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ

2 + 𝜃3ð‘Ĩ3 + 𝜃4ð‘Ĩ

4

Price ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

Slide credit: Andrew Ng

Regularization.

â€ĒSmall values for parameters 𝜃1, 𝜃2, â‹Ŋ , 𝜃𝑛â€Ē “Simpler” hypothesisâ€Ē Less prone to overfitting

â€ĒHousing:â€Ē Features: ð‘Ĩ1, ð‘Ĩ2, â‹Ŋ , ð‘Ĩ100â€ĒParameters: 𝜃0, 𝜃1, 𝜃2, â‹Ŋ , 𝜃100

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

Slide credit: Andrew Ng

Regularization

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

min𝜃

ð―(𝜃)

Price ($)in 1000’s

Size in feet^2

𝜆: Regularization parameter

Slide credit: Andrew Ng

Question

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?

1. Algorithm works fine; setting to be very large can’t hurt it

2. Algorithm fails to eliminate overfitting.

3. Algorithm results in underfitting. (Fails to fit even training data well).

4. Gradient descent will fail to converge.

Slide credit: Andrew Ng

Question

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?Price ($)in 1000’s

Size in feet^2

ℎ𝜃 ð‘Ĩ = 𝜃0 + 𝜃1ð‘Ĩ1 + 𝜃2ð‘Ĩ2 +â‹Ŋ+ 𝜃𝑛ð‘Ĩ𝑛 = 𝜃âŠĪð‘ĨSlide credit: Andrew Ng

Regularization

â€Ē Overfitting

â€Ē Cost function

â€Ē Regularized linear regression

â€Ē Regularized logistic regression

Regularized linear regression

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

min𝜃

ð―(𝜃)

𝑛: Number of features

𝜃0 is not panelizedSlide credit: Andrew Ng

Gradient descent (Previously)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖

𝜃𝑗 ≔ 𝜃𝑗 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗𝑖

}

(𝑗 = 1, 2, 3,â‹Ŋ , 𝑛)

Slide credit: Andrew Ng

(𝑗 = 0)

Gradient descent (Regularized)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖

𝜃𝑗 ≔ 𝜃𝑗 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗𝑖+ 𝜆𝜃𝑗

}𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛞

𝜆

𝑚) − 𝛞

1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗𝑖

Slide credit: Andrew Ng

Comparison

Regularized linear regression

𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛞𝜆

𝑚) − 𝛞

1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗𝑖

Un-regularized linear regression

𝜃𝑗 ≔ 𝜃𝑗 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗𝑖

1 − 𝛞𝜆

𝑚< 1: Weight decay

Normal equation

â€Ē 𝑋 =

ð‘Ĩ 1 âŠĪ

ð‘Ĩ 2 âŠĪ

â‹Ū

ð‘Ĩ 𝑚 âŠĪ

∈ 𝑅𝑚×(𝑛+1) ð‘Ķ =

ð‘Ķ(1)

ð‘Ķ(2)

â‹Ūð‘Ķ(𝑚)

∈ 𝑅𝑚

â€Ē min𝜃

ð―(𝜃)

â€Ē 𝜃 = 𝑋âŠĪ𝑋 + 𝜆

0 0 â‹Ŋ 00 1 0 0â‹Ū â‹Ū ⋱ â‹Ū0 0 0 1

−1

𝑋âŠĪð‘Ķ

(𝑛 + 1 ) × (𝑛 + 1) Slide credit: Andrew Ng

Regularization

â€Ē Overfitting

â€Ē Cost function

â€Ē Regularized linear regression

â€Ē Regularized logistic regression

Regularized logistic regression

â€Ē Cost function:

ð― 𝜃 =1

𝑚

𝑖=1

𝑚

ð‘Ķ 𝑖 log ℎ𝜃 ð‘Ĩ 𝑖 + (1 − ð‘Ķ 𝑖 ) log 1 − ℎ𝜃 ð‘Ĩ 𝑖 +𝜆

2

𝑗=1

𝑛

𝜃𝑗2

Tumor Size

Age

ℎ𝜃 ð‘Ĩ = 𝑔(𝜃0 + 𝜃1ð‘Ĩ + 𝜃2ð‘Ĩ2 +𝜃3ð‘Ĩ1

2 + 𝜃4ð‘Ĩ22 + 𝜃5ð‘Ĩ1ð‘Ĩ2 +

𝜃6ð‘Ĩ13ð‘Ĩ2 + 𝜃7ð‘Ĩ1ð‘Ĩ2

3 +â‹Ŋ)

Slide credit: Andrew Ng

Gradient descent (Regularized)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖

𝜃𝑗 ≔ 𝜃𝑗 − 𝛞1

𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 ð‘Ĩ𝑗𝑖− 𝜆𝜃𝑗

}

ℎ𝜃 ð‘Ĩ =1

1 + 𝑒−𝜃âŠĪð‘Ĩ

𝜕

ðœ•ðœƒð‘—ð―(𝜃)

Slide credit: Andrew Ng

𝜃 1: Lasso regularization

ð― 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 ð‘Ĩ 𝑖 − ð‘Ķ 𝑖 2+ 𝜆

𝑗=1

𝑛

|𝜃𝑗|

LASSO: Least Absolute Shrinkage and Selection Operator

Single predictor: Soft Thresholding

â€Ēminimize𝜃1

2𝑚σ𝑖=1𝑚 ð‘Ĩ(𝑖)𝜃 − ð‘Ķ 𝑖 2

+ 𝜆 𝜃 1

𝜃 =

1

𝑚< 𝒙, 𝒚 > −𝜆 if

1

𝑚< 𝒙, 𝒚 > > 𝜆

0 if1

𝑚| < 𝒙, 𝒚 > | â‰Ī 𝜆

1

𝑚< 𝒙, 𝒚 > +𝜆 if

1

𝑚< 𝒙, 𝒚 > < −𝜆

𝜃 = 𝑆𝜆(1

𝑚< 𝒙, 𝒚 >)

Soft Thresholding operator 𝑆𝜆 ð‘Ĩ = sign ð‘Ĩ ð‘Ĩ − 𝜆 +

Multiple predictors: : Cyclic Coordinate Desce

â€Ē minimize𝜃1

2𝑚σ𝑖=1𝑚 ð‘Ĩ𝑗

𝑖𝜃𝑗 + σ𝑘≠𝑗 ð‘Ĩ𝑖𝑗

𝑖𝜃𝑘 − ð‘Ķ 𝑖

2+

𝜆

𝑘≠𝑗

|𝜃𝑘| + 𝜆 𝜃𝑗 1

For each 𝑗, update 𝜃𝑗 with

minimize𝜃1

2𝑚

𝑖=1

𝑚

ð‘Ĩ𝑗𝑖𝜃𝑗 − 𝑟𝑗

(𝑖) 2+ 𝜆 𝜃𝑗 1

where 𝑟𝑗(𝑖)

= ð‘Ķ 𝑖 − σ𝑘≠𝑗 ð‘Ĩ𝑖𝑗𝑖𝜃𝑘

L1 and L2 balls

Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

TerminologyRegularization function

Name Solver

𝜃 22 =

𝑗=1

𝑛

𝜃𝑗2

Tikhonov regularizationRidge regression

Close form

𝜃1=

𝑗=1

𝑛

|𝜃𝑗|LASSO regression Proximal gradient

descent, least angle regression

𝛞 𝜃1+ (1 − 𝛞) 𝜃 2

2 Elastic net regularization Proximal gradient descent

Things to remember

â€Ē Overfitting

â€Ē Cost function

â€Ē Regularized linear regression

â€Ē Regularized logistic regression

Recommended