Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well

Regularization

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Administrative

• Women in Data Science Blacksburg• Location: Holtzman Alumni Center

• Welcome, 3:30 - 3:40, Assembly hall

• Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall

• Career Panel, 4:05 - 5:00, hall

• Break , 5:00 - 5:20, Grand hallAssembly

• Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall

• Dinner with breakout discussion groups, 5:45 - 7:00, Museum

• Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall

• Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room

k-NN (Classification/Regression)

• Model𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚

• Cost function

None

• Learning

Do nothing

• Inference

ො𝑦 = ℎ 𝑥test = 𝑦(𝑘), where 𝑘 = argmin𝑖 𝐷(𝑥test, 𝑥(𝑖))

Linear regression (Regression)

• Modelℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥

• Cost function

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2

• Learning

1) Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗

𝑖}

2) Solving normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦

• Inferenceො𝑦 = ℎ𝜃 𝑥test = 𝜃⊤𝑥test

Naïve Bayes (Classification)

• Modelℎ𝜃 𝑥 = 𝑃(𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 Π𝑖𝑃 𝑋𝑖 𝑌)

• Cost functionMaximum likelihood estimation: 𝐽 𝜃 = − log 𝑃 Data 𝜃Maximum a posteriori estimation :𝐽 𝜃 = − log 𝑃 Data 𝜃 𝑃 𝜃

• Learning𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)

(Discrete 𝑋𝑖) 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘)

(Continuous 𝑋𝑖) mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘2 , 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) = 𝒩(𝑋𝑖|𝜇𝑖𝑘 , 𝜎𝑖𝑘

2 )

• Inference𝑌 ← argmax

𝑦𝑘

𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖test 𝑌 = 𝑦𝑘)

Logistic regression (Classification)

• Modelℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1, 𝑋2, ⋯ , 𝑋𝑛 =

1

1+𝑒−𝜃⊤𝑥

• Cost function

𝐽 𝜃 =1

𝑚

𝑖=1

𝑚

Cost(ℎ𝜃(𝑥𝑖 ), 𝑦(𝑖))) Cost(ℎ𝜃 𝑥 , 𝑦) = ൝

−log ℎ𝜃 𝑥 if 𝑦 = 1

−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

• LearningGradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼

1

𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗

𝑖}

• Inference𝑌 = ℎ𝜃 𝑥test =

1

1 + 𝑒−𝜃⊤𝑥test

Logistic Regression

•Hypothesis representation

•Cost function

• Logistic regression with gradient descent

•Regularization

•Multi-class classification

ℎ𝜃 𝑥 =1

1 + 𝑒−𝜃⊤𝑥

Cost(ℎ𝜃 𝑥 , 𝑦) = ൝−log ℎ𝜃 𝑥 if 𝑦 = 1

−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)

How about MAP?

• Maximum conditional likelihood estimate (MCLE)

• Maximum conditional a posterior estimate (MCAP)

𝜃MCLE = argmax𝜃

ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖

𝜃MCAP = argmax𝜃

ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)

Prior 𝑃(𝜃)

• Common choice of 𝑃(𝜃): • Normal distribution, zero mean, identity covariance

• “Pushes” parameters towards zeros

• Corresponds to Regularization• Helps avoid very large weights and overfitting

Slide credit: Tom Mitchell

MLE vs. MAP

• Maximum conditional likelihood estimate (MCLE)


𝑚

𝑖=1

𝑚


• Maximum conditional a posterior estimate (MCAP)

𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼1

𝑚

𝑖=1

𝑚


Logistic Regression

•Hypothesis representation

•Cost function

• Logistic regression with gradient descent

•Regularization

•Multi-class classification

Multi-class classification

• Email foldering/taggning: Work, Friends, Family, Hobby

• Medical diagrams: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

Slide credit: Andrew Ng

Binary classification

𝑥2

𝑥1

Multiclass classification

𝑥2

𝑥1

One-vs-all (one-vs-rest)

𝑥2

𝑥1Class 1:Class 2: Class 3:

ℎ𝜃𝑖𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)

𝑥2

𝑥1

𝑥2

𝑥1

𝑥2

𝑥1

ℎ𝜃1

𝑥

ℎ𝜃2

𝑥

ℎ𝜃3

𝑥


One-vs-all

•Train a logistic regression classifier ℎ𝜃𝑖𝑥 for

each class 𝑖 to predict the probability that 𝑦 = 𝑖

•Given a new input 𝑥, pick the class 𝑖 that maximizes

maxi

ℎ𝜃𝑖𝑥


Generative ApproachEx: Naïve Bayes

Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌)

Predictionො𝑦 = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦)

Discriminative ApproachEx: Logistic regression

Estimate 𝑃(𝑌|𝑋) directly

(Or a discriminant function: e.g., SVM)

Predictionො𝑦 = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)

Further readings

• Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

• Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

Regularization

• Overfitting

• Cost function

• Regularized linear regression

• Regularized logistic regression

Regularization

• Overfitting

• Cost function



Example: Linear regressionPrice ($)in 1000’s

Size in feet^2

Price ($)in 1000’s

Size in feet^2


Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 +𝜃3𝑥

3 + 𝜃4𝑥4 +⋯

Underfitting OverfittingJust right


Overfitting

• If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2≈ 0

but fail to generalize to new examples (predict prices on new examples).


Example: Linear regressionPrice ($)in 1000’s

Size in feet^2


Size in feet^2


Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 +𝜃3𝑥

3 + 𝜃4𝑥4 +⋯

Underfitting OverfittingJust right

High bias High varianceSlide credit: Andrew Ng

Bias-Variance Tradeoff

•Bias: difference between what you expect to learn and truth• Measures how well you expect to represent true solution

• Decreases with more complex model

•Variance: difference between what you expect to learn and what you learn from a particular dataset • Measures how sensitive learner is to specific dataset

• Increases with more complex model

Low variance High variance

Low bias

High bias

Bias–variance decomposition

• Training set { 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , ⋯ , 𝑥𝑛, 𝑦𝑛 }

• 𝑦 = 𝑓 𝑥 + 𝜀

• We want መ𝑓 𝑥 that minimizes 𝐸 𝑦 − መ𝑓 𝑥2

𝐸 𝑦 − መ𝑓 𝑥2= Bias መ𝑓 𝑥

2+ Var መ𝑓 𝑥 + 𝜎2

Bias መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 − 𝑓(𝑥)

Var መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 2 − 𝐸 መ𝑓 𝑥2

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Overfitting

Tumor Size

Age

Tumor Size

Age

Tumor Size

Age

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2) ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1

2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2)

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1

2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +

𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2

3 +⋯)

Underfitting OverfittingSlide credit: Andrew Ng

Addressing overfitting

• 𝑥1 = size of house

• 𝑥2 = no. of bedrooms

• 𝑥3 = no. of floors

• 𝑥4 = age of house

• 𝑥5 = average income in neighborhood

• 𝑥6 = kitchen size

• ⋮

• 𝑥100


Size in feet^2


Addressing overfitting

• 1. Reduce number of features.• Manually select which features to keep.

• Model selection algorithm (later in course).

• 2. Regularization.• Keep all the features, but reduce magnitude/values of parameters 𝜃𝑗.

• Works well when we have a lot of features, each of which contributes a bit to predicting 𝑦.


Overfitting Thriller

• https://www.youtube.com/watch?v=DQWI1kvmwRg

https://www.youtube.com/watch?v=DQWI1kvmwRg

Regularization

• Overfitting

• Cost function



Intuition

• Suppose we penalize and make 𝜃3, 𝜃4 really small.

min𝜃

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 1000 𝜃3

2 + 1000 𝜃42

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥

2 + 𝜃3𝑥3 + 𝜃4𝑥

4


Size in feet^2


Size in feet^2


Regularization.

•Small values for parameters 𝜃1, 𝜃2, ⋯ , 𝜃𝑛• “Simpler” hypothesis• Less prone to overfitting

•Housing:• Features: 𝑥1, 𝑥2, ⋯ , 𝑥100•Parameters: 𝜃0, 𝜃1, 𝜃2, ⋯ , 𝜃100

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆

𝑗=1

𝑛

𝜃𝑗2


Regularization

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚


𝑗=1

𝑛

𝜃𝑗2

min𝜃

𝐽(𝜃)


Size in feet^2

𝜆: Regularization parameter


Question

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚


𝑗=1

𝑛

𝜃𝑗2

What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?

1. Algorithm works fine; setting to be very large can’t hurt it

2. Algorithm fails to eliminate overfitting.

3. Algorithm results in underfitting. (Fails to fit even training data well).

4. Gradient descent will fail to converge.


Question

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚


𝑗=1

𝑛

𝜃𝑗2

What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?Price ($)in 1000’s

Size in feet^2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥Slide credit: Andrew Ng

Regularization

• Overfitting

• Cost function



Regularized linear regression

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚


𝑗=1

𝑛

𝜃𝑗2

min𝜃

𝐽(𝜃)

𝑛: Number of features

𝜃0 is not panelizedSlide credit: Andrew Ng

Gradient descent (Previously)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖


𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖

}

(𝑗 = 1, 2, 3,⋯ , 𝑛)


(𝑗 = 0)

Gradient descent (Regularized)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚



𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖+ 𝜆𝜃𝑗

}𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼

𝜆

𝑚) − 𝛼

1

𝑚

𝑖=1

𝑚



Comparison

Regularized linear regression

𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼𝜆

𝑚) − 𝛼

1

𝑚

𝑖=1

𝑚


Un-regularized linear regression


𝑚

𝑖=1

𝑚


1 − 𝛼𝜆

𝑚< 1: Weight decay

Normal equation

• 𝑋 =

𝑥 1 ⊤

𝑥 2 ⊤

⋮

𝑥 𝑚 ⊤

∈ 𝑅𝑚×(𝑛+1) 𝑦 =

𝑦(1)

𝑦(2)

⋮𝑦(𝑚)

∈ 𝑅𝑚

• min𝜃

𝐽(𝜃)

• 𝜃 = 𝑋⊤𝑋 + 𝜆

0 0 ⋯ 00 1 0 0⋮ ⋮ ⋱ ⋮0 0 0 1

−1

𝑋⊤𝑦

(𝑛 + 1 ) × (𝑛 + 1) Slide credit: Andrew Ng

Regularization

• Overfitting

• Cost function



Regularized logistic regression

• Cost function:

𝐽 𝜃 =1

𝑚

𝑖=1

𝑚

𝑦 𝑖 log ℎ𝜃 𝑥 𝑖 + (1 − 𝑦 𝑖 ) log 1 − ℎ𝜃 𝑥 𝑖 +𝜆

2

𝑗=1

𝑛

𝜃𝑗2

Tumor Size

Age

ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1

2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +

𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2

3 +⋯)


Gradient descent (Regularized)

Repeat {

𝜃0 ≔ 𝜃0 − 𝛼1

𝑚

𝑖=1

𝑚



𝑚

𝑖=1

𝑚

ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖− 𝜆𝜃𝑗

}

ℎ𝜃 𝑥 =1

1 + 𝑒−𝜃⊤𝑥

𝜕

𝜕𝜃𝑗𝐽(𝜃)


𝜃 1: Lasso regularization

𝐽 𝜃 =1

2𝑚

𝑖=1

𝑚


𝑗=1

𝑛

|𝜃𝑗|

LASSO: Least Absolute Shrinkage and Selection Operator

Single predictor: Soft Thresholding

•minimize𝜃1

2𝑚σ𝑖=1𝑚 𝑥(𝑖)𝜃 − 𝑦 𝑖 2

+ 𝜆 𝜃 1

𝜃 =

1

𝑚< 𝒙, 𝒚 > −𝜆 if

1

𝑚< 𝒙, 𝒚 > > 𝜆

0 if1

𝑚| < 𝒙, 𝒚 > | ≤ 𝜆

1

𝑚< 𝒙, 𝒚 > +𝜆 if

1

𝑚< 𝒙, 𝒚 > < −𝜆

𝜃 = 𝑆𝜆(1

𝑚< 𝒙, 𝒚 >)

Soft Thresholding operator 𝑆𝜆 𝑥 = sign 𝑥 𝑥 − 𝜆 +

Multiple predictors: : Cyclic Coordinate Desce

• minimize𝜃1

2𝑚σ𝑖=1𝑚 𝑥𝑗

𝑖𝜃𝑗 + σ𝑘≠𝑗 𝑥𝑖𝑗

𝑖𝜃𝑘 − 𝑦 𝑖

2+

𝜆

𝑘≠𝑗

|𝜃𝑘| + 𝜆 𝜃𝑗 1

For each 𝑗, update 𝜃𝑗 with

minimize𝜃1

2𝑚

𝑖=1

𝑚

𝑥𝑗𝑖𝜃𝑗 − 𝑟𝑗

(𝑖) 2+ 𝜆 𝜃𝑗 1

where 𝑟𝑗(𝑖)

= 𝑦 𝑖 − σ𝑘≠𝑗 𝑥𝑖𝑗𝑖𝜃𝑘

L1 and L2 balls

Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf

TerminologyRegularization function

Name Solver

𝜃 22 =

𝑗=1

𝑛

𝜃𝑗2

Tikhonov regularizationRidge regression

Close form

𝜃1=

𝑗=1

𝑛

|𝜃𝑗|LASSO regression Proximal gradient

descent, least angle regression

𝛼 𝜃1+ (1 − 𝛼) 𝜃 2

2 Elastic net regularization Proximal gradient descent

Things to remember

• Overfitting

• Cost function



Documents

Regularization - Virginia Techjbhuang/teaching/ECE5424-CS5824/sp19/... · Regularization. •Keep all the features, but reduce magnitude/values of parameters 𝜃 . •Works well