Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Regularization
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
Administrative
• Women in Data Science Blacksburg• Location: Holtzman Alumni Center
• Welcome, 3:30 - 3:40, Assembly hall
• Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall
• Career Panel, 4:05 - 5:00, hall
• Break , 5:00 - 5:20, Grand hallAssembly
• Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall
• Dinner with breakout discussion groups, 5:45 - 7:00, Museum
• Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall
• Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room
k-NN (Classification/Regression)
• Model𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚 , 𝑦 𝑚
• Cost function
None
• Learning
Do nothing
• Inference
ො𝑦 = ℎ 𝑥test = 𝑦(𝑘), where 𝑘 = argmin𝑖 𝐷(𝑥test, 𝑥(𝑖))
Linear regression (Regression)
• Modelℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥
• Cost function
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2
• Learning
1) Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗
𝑖}
2) Solving normal equation 𝜃 = (𝑋⊤𝑋)−1𝑋⊤𝑦
• Inferenceො𝑦 = ℎ𝜃 𝑥test = 𝜃⊤𝑥test
Naïve Bayes (Classification)
• Modelℎ𝜃 𝑥 = 𝑃(𝑌|𝑋1, 𝑋2, ⋯ , 𝑋𝑛) ∝ 𝑃 𝑌 Π𝑖𝑃 𝑋𝑖 𝑌)
• Cost functionMaximum likelihood estimation: 𝐽 𝜃 = − log 𝑃 Data 𝜃Maximum a posteriori estimation :𝐽 𝜃 = − log 𝑃 Data 𝜃 𝑃 𝜃
• Learning𝜋𝑘 = 𝑃(𝑌 = 𝑦𝑘)
(Discrete 𝑋𝑖) 𝜃𝑖𝑗𝑘 = 𝑃(𝑋𝑖 = 𝑥𝑖𝑗𝑘|𝑌 = 𝑦𝑘)
(Continuous 𝑋𝑖) mean 𝜇𝑖𝑘, variance 𝜎𝑖𝑘2 , 𝑃 𝑋𝑖 𝑌 = 𝑦𝑘) = 𝒩(𝑋𝑖|𝜇𝑖𝑘 , 𝜎𝑖𝑘
2 )
• Inference𝑌 ← argmax
𝑦𝑘
𝑃 𝑌 = 𝑦𝑘 Π𝑖𝑃 𝑋𝑖test 𝑌 = 𝑦𝑘)
Logistic regression (Classification)
• Modelℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1, 𝑋2, ⋯ , 𝑋𝑛 =
1
1+𝑒−𝜃⊤𝑥
• Cost function
𝐽 𝜃 =1
𝑚
𝑖=1
𝑚
Cost(ℎ𝜃(𝑥𝑖 ), 𝑦(𝑖))) Cost(ℎ𝜃 𝑥 , 𝑦) = ൝
−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
• LearningGradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼
1
𝑚σ𝑖=1𝑚 ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗
𝑖}
• Inference𝑌 = ℎ𝜃 𝑥test =
1
1 + 𝑒−𝜃⊤𝑥test
Logistic Regression
•Hypothesis representation
•Cost function
• Logistic regression with gradient descent
•Regularization
•Multi-class classification
ℎ𝜃 𝑥 =1
1 + 𝑒−𝜃⊤𝑥
Cost(ℎ𝜃 𝑥 , 𝑦) = ൝−log ℎ𝜃 𝑥 if 𝑦 = 1
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)
How about MAP?
• Maximum conditional likelihood estimate (MCLE)
• Maximum conditional a posterior estimate (MCAP)
𝜃MCLE = argmax𝜃
ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖
𝜃MCAP = argmax𝜃
ς𝑖=1𝑚 𝑃𝜃 𝑦(𝑖)|𝑥 𝑖 𝑃(𝜃)
Prior 𝑃(𝜃)
• Common choice of 𝑃(𝜃): • Normal distribution, zero mean, identity covariance
• “Pushes” parameters towards zeros
• Corresponds to Regularization• Helps avoid very large weights and overfitting
Slide credit: Tom Mitchell
MLE vs. MAP
• Maximum conditional likelihood estimate (MCLE)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)
• Maximum conditional a posterior estimate (MCAP)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼𝜆𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦(𝑖) 𝑥𝑗(𝑖)
Logistic Regression
•Hypothesis representation
•Cost function
• Logistic regression with gradient descent
•Regularization
•Multi-class classification
Multi-class classification
• Email foldering/taggning: Work, Friends, Family, Hobby
• Medical diagrams: Not ill, Cold, Flu
• Weather: Sunny, Cloudy, Rain, Snow
Slide credit: Andrew Ng
Binary classification
𝑥2
𝑥1
Multiclass classification
𝑥2
𝑥1
One-vs-all (one-vs-rest)
𝑥2
𝑥1Class 1:Class 2: Class 3:
ℎ𝜃𝑖𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥2
𝑥1
𝑥2
𝑥1
𝑥2
𝑥1
ℎ𝜃1
𝑥
ℎ𝜃2
𝑥
ℎ𝜃3
𝑥
Slide credit: Andrew Ng
One-vs-all
•Train a logistic regression classifier ℎ𝜃𝑖𝑥 for
each class 𝑖 to predict the probability that 𝑦 = 𝑖
•Given a new input 𝑥, pick the class 𝑖 that maximizes
maxi
ℎ𝜃𝑖𝑥
Slide credit: Andrew Ng
Generative ApproachEx: Naïve Bayes
Estimate 𝑃(𝑌) and 𝑃(𝑋|𝑌)
Predictionො𝑦 = argmax𝑦 𝑃 𝑌 = 𝑦 𝑃(𝑋 = 𝑥|𝑌 = 𝑦)
Discriminative ApproachEx: Logistic regression
Estimate 𝑃(𝑌|𝑋) directly
(Or a discriminant function: e.g., SVM)
Predictionො𝑦 = 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
Further readings
• Tom M. MitchellGenerative and discriminative classifiers: Naïve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
• Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
Example: Linear regressionPrice ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 +𝜃3𝑥
3 + 𝜃4𝑥4 +⋯
Underfitting OverfittingJust right
Slide credit: Andrew Ng
Overfitting
• If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2≈ 0
but fail to generalize to new examples (predict prices on new examples).
Slide credit: Andrew Ng
Example: Linear regressionPrice ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 +𝜃3𝑥
3 + 𝜃4𝑥4 +⋯
Underfitting OverfittingJust right
High bias High varianceSlide credit: Andrew Ng
Bias-Variance Tradeoff
•Bias: difference between what you expect to learn and truth• Measures how well you expect to represent true solution
• Decreases with more complex model
•Variance: difference between what you expect to learn and what you learn from a particular dataset • Measures how sensitive learner is to specific dataset
• Increases with more complex model
Low variance High variance
Low bias
High bias
Bias–variance decomposition
• Training set { 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , ⋯ , 𝑥𝑛, 𝑦𝑛 }
• 𝑦 = 𝑓 𝑥 + 𝜀
• We want መ𝑓 𝑥 that minimizes 𝐸 𝑦 − መ𝑓 𝑥2
𝐸 𝑦 − መ𝑓 𝑥2= Bias መ𝑓 𝑥
2+ Var መ𝑓 𝑥 + 𝜎2
Bias መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 − 𝑓(𝑥)
Var መ𝑓 𝑥 = 𝐸 መ𝑓 𝑥 2 − 𝐸 መ𝑓 𝑥2
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
Overfitting
Tumor Size
Age
Tumor Size
Age
Tumor Size
Age
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2) ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1
2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2)
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1
2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +
𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2
3 +⋯)
Underfitting OverfittingSlide credit: Andrew Ng
Addressing overfitting
• 𝑥1 = size of house
• 𝑥2 = no. of bedrooms
• 𝑥3 = no. of floors
• 𝑥4 = age of house
• 𝑥5 = average income in neighborhood
• 𝑥6 = kitchen size
• ⋮
• 𝑥100
Price ($)in 1000’s
Size in feet^2
Slide credit: Andrew Ng
Addressing overfitting
• 1. Reduce number of features.• Manually select which features to keep.
• Model selection algorithm (later in course).
• 2. Regularization.• Keep all the features, but reduce magnitude/values of parameters 𝜃𝑗.
• Works well when we have a lot of features, each of which contributes a bit to predicting 𝑦.
Slide credit: Andrew Ng
Overfitting Thriller
• https://www.youtube.com/watch?v=DQWI1kvmwRg
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
Intuition
• Suppose we penalize and make 𝜃3, 𝜃4 really small.
min𝜃
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 1000 𝜃3
2 + 1000 𝜃42
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 + 𝜃3𝑥3 + 𝜃4𝑥
4
Price ($)in 1000’s
Size in feet^2
Price ($)in 1000’s
Size in feet^2
Slide credit: Andrew Ng
Regularization.
•Small values for parameters 𝜃1, 𝜃2, ⋯ , 𝜃𝑛• “Simpler” hypothesis• Less prone to overfitting
•Housing:• Features: 𝑥1, 𝑥2, ⋯ , 𝑥100•Parameters: 𝜃0, 𝜃1, 𝜃2, ⋯ , 𝜃100
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
Slide credit: Andrew Ng
Regularization
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
min𝜃
𝐽(𝜃)
Price ($)in 1000’s
Size in feet^2
𝜆: Regularization parameter
Slide credit: Andrew Ng
Question
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?
1. Algorithm works fine; setting to be very large can’t hurt it
2. Algorithm fails to eliminate overfitting.
3. Algorithm results in underfitting. (Fails to fit even training data well).
4. Gradient descent will fail to converge.
Slide credit: Andrew Ng
Question
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
What if 𝜆 is set to an extremely large value (say 𝜆 = 1010)?Price ($)in 1000’s
Size in feet^2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 +⋯+ 𝜃𝑛𝑥𝑛 = 𝜃⊤𝑥Slide credit: Andrew Ng
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
Regularized linear regression
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
𝜃𝑗2
min𝜃
𝐽(𝜃)
𝑛: Number of features
𝜃0 is not panelizedSlide credit: Andrew Ng
Gradient descent (Previously)
Repeat {
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
}
(𝑗 = 1, 2, 3,⋯ , 𝑛)
Slide credit: Andrew Ng
(𝑗 = 0)
Gradient descent (Regularized)
Repeat {
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖+ 𝜆𝜃𝑗
}𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼
𝜆
𝑚) − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
Slide credit: Andrew Ng
Comparison
Regularized linear regression
𝜃𝑗 ≔ 𝜃𝑗(1 − 𝛼𝜆
𝑚) − 𝛼
1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
Un-regularized linear regression
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖
1 − 𝛼𝜆
𝑚< 1: Weight decay
Normal equation
• 𝑋 =
𝑥 1 ⊤
𝑥 2 ⊤
⋮
𝑥 𝑚 ⊤
∈ 𝑅𝑚×(𝑛+1) 𝑦 =
𝑦(1)
𝑦(2)
⋮𝑦(𝑚)
∈ 𝑅𝑚
• min𝜃
𝐽(𝜃)
• 𝜃 = 𝑋⊤𝑋 + 𝜆
0 0 ⋯ 00 1 0 0⋮ ⋮ ⋱ ⋮0 0 0 1
−1
𝑋⊤𝑦
(𝑛 + 1 ) × (𝑛 + 1) Slide credit: Andrew Ng
Regularization
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression
Regularized logistic regression
• Cost function:
𝐽 𝜃 =1
𝑚
𝑖=1
𝑚
𝑦 𝑖 log ℎ𝜃 𝑥 𝑖 + (1 − 𝑦 𝑖 ) log 1 − ℎ𝜃 𝑥 𝑖 +𝜆
2
𝑗=1
𝑛
𝜃𝑗2
Tumor Size
Age
ℎ𝜃 𝑥 = 𝑔(𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 +𝜃3𝑥1
2 + 𝜃4𝑥22 + 𝜃5𝑥1𝑥2 +
𝜃6𝑥13𝑥2 + 𝜃7𝑥1𝑥2
3 +⋯)
Slide credit: Andrew Ng
Gradient descent (Regularized)
Repeat {
𝜃0 ≔ 𝜃0 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼1
𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 𝑥𝑗𝑖− 𝜆𝜃𝑗
}
ℎ𝜃 𝑥 =1
1 + 𝑒−𝜃⊤𝑥
𝜕
𝜕𝜃𝑗𝐽(𝜃)
Slide credit: Andrew Ng
𝜃 1: Lasso regularization
𝐽 𝜃 =1
2𝑚
𝑖=1
𝑚
ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 2+ 𝜆
𝑗=1
𝑛
|𝜃𝑗|
LASSO: Least Absolute Shrinkage and Selection Operator
Single predictor: Soft Thresholding
•minimize𝜃1
2𝑚σ𝑖=1𝑚 𝑥(𝑖)𝜃 − 𝑦 𝑖 2
+ 𝜆 𝜃 1
𝜃 =
1
𝑚< 𝒙, 𝒚 > −𝜆 if
1
𝑚< 𝒙, 𝒚 > > 𝜆
0 if1
𝑚| < 𝒙, 𝒚 > | ≤ 𝜆
1
𝑚< 𝒙, 𝒚 > +𝜆 if
1
𝑚< 𝒙, 𝒚 > < −𝜆
𝜃 = 𝑆𝜆(1
𝑚< 𝒙, 𝒚 >)
Soft Thresholding operator 𝑆𝜆 𝑥 = sign 𝑥 𝑥 − 𝜆 +
Multiple predictors: : Cyclic Coordinate Desce
• minimize𝜃1
2𝑚σ𝑖=1𝑚 𝑥𝑗
𝑖𝜃𝑗 + σ𝑘≠𝑗 𝑥𝑖𝑗
𝑖𝜃𝑘 − 𝑦 𝑖
2+
𝜆
𝑘≠𝑗
|𝜃𝑘| + 𝜆 𝜃𝑗 1
For each 𝑗, update 𝜃𝑗 with
minimize𝜃1
2𝑚
𝑖=1
𝑚
𝑥𝑗𝑖𝜃𝑗 − 𝑟𝑗
(𝑖) 2+ 𝜆 𝜃𝑗 1
where 𝑟𝑗(𝑖)
= 𝑦 𝑖 − σ𝑘≠𝑗 𝑥𝑖𝑗𝑖𝜃𝑘
L1 and L2 balls
Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
TerminologyRegularization function
Name Solver
𝜃 22 =
𝑗=1
𝑛
𝜃𝑗2
Tikhonov regularizationRidge regression
Close form
𝜃1=
𝑗=1
𝑛
|𝜃𝑗|LASSO regression Proximal gradient
descent, least angle regression
𝛼 𝜃1+ (1 − 𝛼) 𝜃 2
2 Elastic net regularization Proximal gradient descent
Things to remember
• Overfitting
• Cost function
• Regularized linear regression
• Regularized logistic regression