50
Regularization Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Page 2: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Administrative

• Women in Data Science Blacksburg• Location: Holtzman Alumni Center

• Welcome, 3:30 - 3:40, Assembly hall

• Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall

• Career Panel, 4:05 - 5:00, hall

• Break , 5:00 - 5:20, Grand hallAssembly

• Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall

• Dinner with breakout discussion groups, 5:45 - 7:00, Museum

• Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall

• Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room

Page 3: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

k-NN (Classification/Regression)

• Model𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚

• Cost function

None

• Learning

Do nothing

• Inference

ො𝑦 = ℎ 𝑥test = 𝑦(𝑘), where 𝑘 = argmin𝑖 𝐷(𝑥test, 𝑥(𝑖))

Page 4: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Linear regression (Regression)

• Modelℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥

• Cost function

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

• Learning

1) Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗

𝑖}

2) Solving normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦

• Inferenceො𝑦 = ℎ𝜃 𝑥test = 𝜃⊤𝑥test

Page 5: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Naïve Bayes (Classification)

• Modelℎ𝜃 𝑥 = 𝑃(𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 Π𝑖𝑃 𝑋𝑖 𝑌)

• Cost functionMaximum likelihood estimation: 𝐽 𝜃 = − log 𝑃 Data 𝜃Maximum a posteriori estimation :𝐽 𝜃 = − log 𝑃 Data 𝜃 𝑃 𝜃

• Learning𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)

(Discrete 𝑋𝑖) 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘)

(Continuous 𝑋𝑖) mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘2 , 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) = 𝒩(𝑋𝑖|𝜇𝑖𝑘 , 𝜎𝑖𝑘

2 )

• Inference𝑌 ← argmax

𝑦𝑘

𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖test 𝑌 = 𝑦𝑘)

Page 6: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Logistic regression (Classification)

• Modelℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1, 𝑋2, ⋯ , 𝑋𝑛 =

1

1+𝑒−𝜃⊤𝑥

• Cost function

𝐽 𝜃 =1

𝑚

𝑖=1

𝑚

Cost(ℎ𝜃(𝑥𝑖 ), 𝑦(𝑖))) Cost(ℎ𝜃 𝑥 , 𝑦) = ൝

−log ℎ𝜃 𝑥 if 𝑦 = 1

−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

• LearningGradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗

𝑖}

• Inference𝑌 = ℎ𝜃 𝑥test =

1

1 + 𝑒−𝜃⊤𝑥test

Page 7: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Logistic Regression

•Hypothesis representation

•Cost function

• Logistic regression with gradient descent

•Regularization

•Multi-class classification

ℎ𝜃 𝑥 =1

1 + 𝑒−𝜃⊤𝑥

Cost(ℎ𝜃 𝑥 , 𝑦) = ൝−log ℎ𝜃 𝑥 if 𝑦 = 1

−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)

Page 8: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

How about MAP?

• Maximum conditional likelihood estimate (MCLE)

• Maximum conditional a posterior estimate (MCAP)

𝜃MCLE = argmax𝜃

ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖

𝜃MCAP = argmax𝜃

ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)

Page 9: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Prior 𝑃(𝜃)

• Common choice of 𝑃(𝜃): • Normal distribution, zero mean, identity covariance

• “Pushes” parameters towards zeros

• Corresponds to Regularization• Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell

Page 10: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

MLE vs. MAP

• Maximum conditional likelihood estimate (MCLE)

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)

• Maximum conditional a posterior estimate (MCAP)

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)

Page 11: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Logistic Regression

•Hypothesis representation

•Cost function

• Logistic regression with gradient descent

•Regularization

•Multi-class classification

Page 12: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Multi-class classification

• Email foldering/taggning: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng

Page 13: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Binary classification

𝑥2

𝑥1

Multiclass classification

𝑥2

𝑥1

Page 14: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

One-vs-all (one-vs-rest)

𝑥2

𝑥1Class 1:Class 2: Class 3:

ℎ𝜃𝑖𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)

𝑥2

𝑥1

𝑥2

𝑥1

𝑥2

𝑥1

ℎ𝜃1

𝑥

ℎ𝜃2

𝑥

ℎ𝜃3

𝑥

Slide credit: Andrew Ng

Page 15: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

One-vs-all

•Train a logistic regression classifier ℎ𝜃𝑖𝑥 for

each class 𝑖 to predict the probability that 𝑦 = 𝑖

•Given a new input 𝑥, pick the class 𝑖 that maximizes

maxi

ℎ𝜃𝑖𝑥

Slide credit: Andrew Ng

Page 16: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Generative ApproachEx: Naïve Bayes

Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌)

Predictionො𝑦 = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦)

Discriminative ApproachEx: Logistic regression

Estimate 𝑃(𝑌|𝑋) directly

(Or a discriminant function: e.g., SVM)

Predictionො𝑦 = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)

Page 17: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Further readings

• Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

• Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

Page 18: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

• Overfitting

• Cost function

• Regularized linear regression

• Regularized logistic regression

Page 19: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

• Overfitting

• Cost function

• Regularized linear regression

• Regularized logistic regression

Page 20: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Example: Linear regressionPrice ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 +𝜃3𝑥

3 + 𝜃4𝑥4 +⋯

Underfitting OverfittingJust right

Slide credit: Andrew Ng

Page 21: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Overfitting

• If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2≈ 0

but fail to generalize to new examples (predict prices on new examples).

Slide credit: Andrew Ng

Page 22: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Example: Linear regressionPrice ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 +𝜃3𝑥

3 + 𝜃4𝑥4 +⋯

Underfitting OverfittingJust right

High bias High varianceSlide credit: Andrew Ng

Page 23: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Bias-Variance Tradeoff

•Bias: difference between what you expect to learn and truth• Measures how well you expect to represent true solution

• Decreases with more complex model

•Variance: difference between what you expect to learn and what you learn from a particular dataset • Measures how sensitive learner is to specific dataset

• Increases with more complex model

Page 24: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Low variance High variance

Low bias

High bias

Page 25: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Bias–variance decomposition

• Training set { 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , ⋯ , 𝑥𝑛, 𝑦𝑛 }

• 𝑦 = 𝑓 𝑥 + 𝜀

• We want መ𝑓 𝑥 that minimizes 𝐸 𝑦 − መ𝑓 𝑥2

𝐸 𝑦 − መ𝑓 𝑥2= Bias መ𝑓 𝑥

2+ Var መ𝑓 𝑥 + 𝜎2

Bias መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 − 𝑓(𝑥)

Var መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 2 − 𝐸 መ𝑓 𝑥2

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Page 26: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Overfitting

Tumor Size

Age

Tumor Size

Age

Tumor Size

Age

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2) ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1

2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2)

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1

2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +

𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2

3 +⋯)

Underfitting OverfittingSlide credit: Andrew Ng

Page 27: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Addressing overfitting

• 𝑥1 = size of house

• 𝑥2 = no. of bedrooms

• 𝑥3 = no. of floors

• 𝑥4 = age of house

• 𝑥5 = average income in neighborhood

• 𝑥6 = kitchen size

• ⋮

• 𝑥100

Price ($)in 1000’s

Size in feet^2

Slide credit: Andrew Ng

Page 28: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Addressing overfitting

• 1. Reduce number of features.• Manually select which features to keep.

• Model selection algorithm (later in course).

• 2. Regularization.• Keep all the features, but reduce magnitude/values of parameters 𝜃𝑗.

• Works well when we have a lot of features, each of which contributes a bit to predicting 𝑦.

Slide credit: Andrew Ng

Page 29: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Overfitting Thriller

• https://www.youtube.com/watch?v=DQWI1kvmwRg

Page 30: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

• Overfitting

• Cost function

• Regularized linear regression

• Regularized logistic regression

Page 31: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Intuition

• Suppose we penalize and make 𝜃3, 𝜃4 really small.

min𝜃

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 1000 𝜃3

2 + 1000 𝜃42

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 + 𝜃3𝑥3 + 𝜃4𝑥

4

Price ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2

Slide credit: Andrew Ng

Page 32: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization.

•Small values for parameters 𝜃1, 𝜃2, ⋯ , 𝜃𝑛• “Simpler” hypothesis• Less prone to overfitting

•Housing:• Features: 𝑥1, 𝑥2, ⋯ , 𝑥100•Parameters: 𝜃0, 𝜃1, 𝜃2, ⋯ , 𝜃100

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

Slide credit: Andrew Ng

Page 33: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

min𝜃

𝐽(𝜃)

Price ($)in 1000’s

Size in feet^2

𝜆: Regularization parameter

Slide credit: Andrew Ng

Page 34: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Question

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?

1. Algorithm works fine; setting to be very large can’t hurt it

2. Algorithm fails to eliminate overfitting.

3. Algorithm results in underfitting. (Fails to fit even training data well).

4. Gradient descent will fail to converge.

Slide credit: Andrew Ng

Page 35: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Question

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?Price ($)in 1000’s

Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥Slide credit: Andrew Ng

Page 36: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

• Overfitting

• Cost function

• Regularized linear regression

• Regularized logistic regression

Page 37: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularized linear regression

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2

min𝜃

𝐽(𝜃)

𝑛: Number of features

𝜃0 is not panelizedSlide credit: Andrew Ng

Page 38: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Gradient descent (Previously)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖

}

(𝑗 = 1, 2, 3,⋯ , 𝑛)

Slide credit: Andrew Ng

(𝑗 = 0)

Page 39: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Gradient descent (Regularized)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖+ 𝜆𝜃𝑗

}𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼

𝜆

𝑚) − 𝛼

1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖

Slide credit: Andrew Ng

Page 40: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Comparison

Regularized linear regression

𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼𝜆

𝑚) − 𝛼

1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖

Un-regularized linear regression

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖

1 − 𝛼𝜆

𝑚< 1: Weight decay

Page 41: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Normal equation

• 𝑋 =

𝑥 1 ⊤

𝑥 2 ⊤

𝑥 𝑚 ⊤

∈ 𝑅𝑚×(𝑛+1) 𝑦 =

𝑦(1)

𝑦(2)

⋮𝑦(𝑚)

∈ 𝑅𝑚

• min𝜃

𝐽(𝜃)

• 𝜃 = 𝑋⊤𝑋 + 𝜆

0 0 ⋯ 00 1 0 0⋮ ⋮ ⋱ ⋮0 0 0 1

−1

𝑋⊤𝑦

(𝑛 + 1 ) × (𝑛 + 1) Slide credit: Andrew Ng

Page 42: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

• Overfitting

• Cost function

• Regularized linear regression

• Regularized logistic regression

Page 43: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularized logistic regression

• Cost function:

𝐽 𝜃 =1

𝑚

𝑖=1

𝑚

𝑦 𝑖 log ℎ𝜃 𝑥 𝑖 + (1 − 𝑦 𝑖 ) log 1 − ℎ𝜃 𝑥 𝑖 +𝜆

2

𝑗=1

𝑛

𝜃𝑗2

Tumor Size

Age

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1

2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +

𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2

3 +⋯)

Slide credit: Andrew Ng

Page 44: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Gradient descent (Regularized)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖− 𝜆𝜃𝑗

}

ℎ𝜃 𝑥 =1

1 + 𝑒−𝜃⊤𝑥

𝜕

𝜕𝜃𝑗𝐽(𝜃)

Slide credit: Andrew Ng

Page 45: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

𝜃 1: Lasso regularization

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆

𝑗=1

𝑛

|𝜃𝑗|

LASSO: Least Absolute Shrinkage and Selection Operator

Page 46: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Single predictor: Soft Thresholding

•minimize𝜃1

2𝑚σ𝑖=1𝑚 𝑥(𝑖)𝜃 − 𝑦 𝑖 2

+ 𝜆 𝜃 1

𝜃 =

1

𝑚< 𝒙, 𝒚 > −𝜆 if

1

𝑚< 𝒙, 𝒚 > > 𝜆

0 if1

𝑚| < 𝒙, 𝒚 > | ≤ 𝜆

1

𝑚< 𝒙, 𝒚 > +𝜆 if

1

𝑚< 𝒙, 𝒚 > < −𝜆

𝜃 = 𝑆𝜆(1

𝑚< 𝒙, 𝒚 >)

Soft Thresholding operator 𝑆𝜆 𝑥 = sign 𝑥 𝑥 − 𝜆 +

Page 47: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Multiple predictors: : Cyclic Coordinate Desce

• minimize𝜃1

2𝑚σ𝑖=1𝑚 𝑥𝑗

𝑖𝜃𝑗 + σ𝑘≠𝑗 𝑥𝑖𝑗

𝑖𝜃𝑘 − 𝑦 𝑖

2+

𝜆

𝑘≠𝑗

|𝜃𝑘| + 𝜆 𝜃𝑗 1

For each 𝑗, update 𝜃𝑗 with

minimize𝜃1

2𝑚

𝑖=1

𝑚

𝑥𝑗𝑖𝜃𝑗 − 𝑟𝑗

(𝑖) 2+ 𝜆 𝜃𝑗 1

where 𝑟𝑗(𝑖)

= 𝑦 𝑖 − σ𝑘≠𝑗 𝑥𝑖𝑗𝑖𝜃𝑘

Page 48: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

L1 and L2 balls

Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

Page 49: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

TerminologyRegularization function

Name Solver

𝜃 22 =

𝑗=1

𝑛

𝜃𝑗2

Tikhonov regularizationRidge regression

Close form

𝜃1=

𝑗=1

𝑛

|𝜃𝑗|LASSO regression Proximal gradient

descent, least angle regression

𝛼 𝜃1+ (1 − 𝛼) 𝜃 2

2 Elastic net regularization Proximal gradient descent

Page 50: Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Things to remember

• Overfitting

• Cost function

• Regularized linear regression

• Regularized logistic regression