View
4
Download
0
Category
Preview:
Citation preview
Regularization
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
Administrative
âĒ Women in Data Science BlacksburgâĒ Location: Holtzman Alumni Center
âĒ Welcome, 3:30 - 3:40, Assembly hall
âĒ Keynote Speaker: Milinda Lakkam, "Detecting automation on LinkedIn's platform," 3:40 - 4:05, Assembly hall
âĒ Career Panel, 4:05 - 5:00, hall
âĒ Break , 5:00 - 5:20, Grand hallAssembly
âĒ Keynote Speaker: Sally Morton , "Bias," 5:20 - 5:45, Assembly hall
âĒ Dinner with breakout discussion groups, 5:45 - 7:00, Museum
âĒ Introductory track tutorial: Jennifer Van Mullekom, "Data Visualization", 7:00 - 8:15, Assembly hall
âĒ Advanced track tutorial: Cheryl Danner, "Focal-loss-based Deep Learning for Object Detection," 7-8:15, 2nd floor board room
k-NN (Classification/Regression)
âĒ ModelðĨ 1 , ðĶ 1 , ðĨ 2 , ðĶ 2 , âŊ , ðĨ ð , ðĶ ð
âĒ Cost function
None
âĒ Learning
Do nothing
âĒ Inference
ā·ðĶ = â ðĨtest = ðĶ(ð), where ð = argminð ð·(ðĨtest, ðĨ(ð))
Linear regression (Regression)
âĒ Modelâð ðĨ = ð0 + ð1ðĨ1 + ð2ðĨ2 +âŊ+ ðððĨð = ðâĪðĨ
âĒ Cost function
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2
âĒ Learning
1) Gradient descent: Repeat {ðð â ðð â ðž1
ðÏð=1ð âð ðĨ ð â ðĶ ð ðĨð
ð}
2) Solving normal equation ð = (ðâĪð)â1ðâĪðĶ
âĒ Inferenceā·ðĶ = âð ðĨtest = ðâĪðĨtest
NaÃŊve Bayes (Classification)
âĒ Modelâð ðĨ = ð(ð|ð1, ð2, âŊ , ðð) â ð ð Î ðð ðð ð)
âĒ Cost functionMaximum likelihood estimation: ð― ð = â log ð Data ðMaximum a posteriori estimation :ð― ð = â log ð Data ð ð ð
âĒ Learningðð = ð(ð = ðĶð)
(Discrete ðð) ðððð = ð(ðð = ðĨððð|ð = ðĶð)
(Continuous ðð) mean ððð, variance ððð2 , ð ðð ð = ðĶð) = ðĐ(ðð|ððð , ððð
2 )
âĒ Inferenceð â argmax
ðĶð
ð ð = ðĶð Î ðð ððtest ð = ðĶð)
Logistic regression (Classification)
âĒ Modelâð ðĨ = ð ð = 1 ð1, ð2, âŊ , ðð =
1
1+ðâðâĪðĨ
âĒ Cost function
ð― ð =1
ð
ð=1
ð
Cost(âð(ðĨð ), ðĶ(ð))) Cost(âð ðĨ , ðĶ) = āĩ
âlog âð ðĨ if ðĶ = 1
âlog 1 â âð ðĨ if ðĶ = 0
âĒ LearningGradient descent: Repeat {ðð â ðð â ðž
1
ðÏð=1ð âð ðĨ ð â ðĶ ð ðĨð
ð}
âĒ Inferenceð = âð ðĨtest =
1
1 + ðâðâĪðĨtest
Logistic Regression
âĒHypothesis representation
âĒCost function
âĒ Logistic regression with gradient descent
âĒRegularization
âĒMulti-class classification
âð ðĨ =1
1 + ðâðâĪðĨ
Cost(âð ðĨ , ðĶ) = āĩâlog âð ðĨ if ðĶ = 1
âlog 1 â âð ðĨ if ðĶ = 0
ðð â ðð â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ(ð) ðĨð(ð)
How about MAP?
âĒ Maximum conditional likelihood estimate (MCLE)
âĒ Maximum conditional a posterior estimate (MCAP)
ðMCLE = argmaxð
Ïð=1ð ðð ðĶ(ð)|ðĨ ð
ðMCAP = argmaxð
Ïð=1ð ðð ðĶ(ð)|ðĨ ð ð(ð)
Prior ð(ð)
âĒ Common choice of ð(ð): âĒ Normal distribution, zero mean, identity covariance
âĒ âPushesâ parameters towards zeros
âĒ Corresponds to RegularizationâĒ Helps avoid very large weights and overfitting
Slide credit: Tom Mitchell
MLE vs. MAP
âĒ Maximum conditional likelihood estimate (MCLE)
ðð â ðð â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ(ð) ðĨð(ð)
âĒ Maximum conditional a posterior estimate (MCAP)
ðð â ðð â ðžððð â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ(ð) ðĨð(ð)
Logistic Regression
âĒHypothesis representation
âĒCost function
âĒ Logistic regression with gradient descent
âĒRegularization
âĒMulti-class classification
Multi-class classification
âĒ Email foldering/taggning: Work, Friends, Family, Hobby
âĒ Medical diagrams: Not ill, Cold, Flu
âĒ Weather: Sunny, Cloudy, Rain, Snow
Slide credit: Andrew Ng
Binary classification
ðĨ2
ðĨ1
Multiclass classification
ðĨ2
ðĨ1
One-vs-all (one-vs-rest)
ðĨ2
ðĨ1Class 1:Class 2: Class 3:
âðððĨ = ð ðĶ = ð ðĨ; ð (ð = 1, 2, 3)
ðĨ2
ðĨ1
ðĨ2
ðĨ1
ðĨ2
ðĨ1
âð1
ðĨ
âð2
ðĨ
âð3
ðĨ
Slide credit: Andrew Ng
One-vs-all
âĒTrain a logistic regression classifier âðððĨ for
each class ð to predict the probability that ðĶ = ð
âĒGiven a new input ðĨ, pick the class ð that maximizes
maxi
âðððĨ
Slide credit: Andrew Ng
Generative ApproachEx: NaÃŊve Bayes
Estimate ð(ð) and ð(ð|ð)
Predictionā·ðĶ = argmaxðĶ ð ð = ðĶ ð(ð = ðĨ|ð = ðĶ)
Discriminative ApproachEx: Logistic regression
Estimate ð(ð|ð) directly
(Or a discriminant function: e.g., SVM)
Predictionā·ðĶ = ð(ð = ðĶ|ð = ðĨ)
Further readings
âĒ Tom M. MitchellGenerative and discriminative classifiers: NaÃŊve Bayes and Logistic Regressionhttp://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
âĒ Andrew Ng, Michael JordanOn discriminative vs. generative classifiers: A comparison of logistic regression and naive bayeshttp://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
Regularization
âĒ Overfitting
âĒ Cost function
âĒ Regularized linear regression
âĒ Regularized logistic regression
Regularization
âĒ Overfitting
âĒ Cost function
âĒ Regularized linear regression
âĒ Regularized logistic regression
Example: Linear regressionPrice ($)in 1000âs
Size in feet^2
Price ($)in 1000âs
Size in feet^2
Price ($)in 1000âs
Size in feet^2
âð ðĨ = ð0 + ð1ðĨ âð ðĨ = ð0 + ð1ðĨ + ð2ðĨ2 âð ðĨ = ð0 + ð1ðĨ + ð2ðĨ
2 +ð3ðĨ
3 + ð4ðĨ4 +âŊ
Underfitting OverfittingJust right
Slide credit: Andrew Ng
Overfitting
âĒ If we have too many features (i.e. complex model), the learned hypothesis may fit the training set very well
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2â 0
but fail to generalize to new examples (predict prices on new examples).
Slide credit: Andrew Ng
Example: Linear regressionPrice ($)in 1000âs
Size in feet^2
Price ($)in 1000âs
Size in feet^2
Price ($)in 1000âs
Size in feet^2
âð ðĨ = ð0 + ð1ðĨ âð ðĨ = ð0 + ð1ðĨ + ð2ðĨ2 âð ðĨ = ð0 + ð1ðĨ + ð2ðĨ
2 +ð3ðĨ
3 + ð4ðĨ4 +âŊ
Underfitting OverfittingJust right
High bias High varianceSlide credit: Andrew Ng
Bias-Variance Tradeoff
âĒBias: difference between what you expect to learn and truthâĒ Measures how well you expect to represent true solution
âĒ Decreases with more complex model
âĒVariance: difference between what you expect to learn and what you learn from a particular dataset âĒ Measures how sensitive learner is to specific dataset
âĒ Increases with more complex model
Low variance High variance
Low bias
High bias
Biasâvariance decomposition
âĒ Training set { ðĨ1, ðĶ1 , ðĨ2, ðĶ2 , âŊ , ðĨð, ðĶð }
âĒ ðĶ = ð ðĨ + ð
âĒ We want áð ðĨ that minimizes ðļ ðĶ â áð ðĨ2
ðļ ðĶ â áð ðĨ2= Bias áð ðĨ
2+ Var áð ðĨ + ð2
Bias áð ðĨ = ðļ áð ðĨ â ð(ðĨ)
Var áð ðĨ = ðļ áð ðĨ 2 â ðļ áð ðĨ2
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
Overfitting
Tumor Size
Age
Tumor Size
Age
Tumor Size
Age
âð ðĨ = ð(ð0 + ð1ðĨ + ð2ðĨ2) âð ðĨ = ð(ð0 + ð1ðĨ + ð2ðĨ2 +ð3ðĨ1
2 + ð4ðĨ22 + ð5ðĨ1ðĨ2)
âð ðĨ = ð(ð0 + ð1ðĨ + ð2ðĨ2 +ð3ðĨ1
2 + ð4ðĨ22 + ð5ðĨ1ðĨ2 +
ð6ðĨ13ðĨ2 + ð7ðĨ1ðĨ2
3 +âŊ)
Underfitting OverfittingSlide credit: Andrew Ng
Addressing overfitting
âĒ ðĨ1 = size of house
âĒ ðĨ2 = no. of bedrooms
âĒ ðĨ3 = no. of floors
âĒ ðĨ4 = age of house
âĒ ðĨ5 = average income in neighborhood
âĒ ðĨ6 = kitchen size
âĒ âŪ
âĒ ðĨ100
Price ($)in 1000âs
Size in feet^2
Slide credit: Andrew Ng
Addressing overfitting
âĒ 1. Reduce number of features.âĒ Manually select which features to keep.
âĒ Model selection algorithm (later in course).
âĒ 2. Regularization.âĒ Keep all the features, but reduce magnitude/values of parameters ðð.
âĒ Works well when we have a lot of features, each of which contributes a bit to predicting ðĶ.
Slide credit: Andrew Ng
Overfitting Thriller
âĒ https://www.youtube.com/watch?v=DQWI1kvmwRg
Regularization
âĒ Overfitting
âĒ Cost function
âĒ Regularized linear regression
âĒ Regularized logistic regression
Intuition
âĒ Suppose we penalize and make ð3, ð4 really small.
minð
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2+ 1000 ð3
2 + 1000 ð42
âð ðĨ = ð0 + ð1ðĨ + ð2ðĨ2 âð ðĨ = ð0 + ð1ðĨ + ð2ðĨ
2 + ð3ðĨ3 + ð4ðĨ
4
Price ($)in 1000âs
Size in feet^2
Price ($)in 1000âs
Size in feet^2
Slide credit: Andrew Ng
Regularization.
âĒSmall values for parameters ð1, ð2, âŊ , ððâĒ âSimplerâ hypothesisâĒ Less prone to overfitting
âĒHousing:âĒ Features: ðĨ1, ðĨ2, âŊ , ðĨ100âĒParameters: ð0, ð1, ð2, âŊ , ð100
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2+ ð
ð=1
ð
ðð2
Slide credit: Andrew Ng
Regularization
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2+ ð
ð=1
ð
ðð2
minð
ð―(ð)
Price ($)in 1000âs
Size in feet^2
ð: Regularization parameter
Slide credit: Andrew Ng
Question
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2+ ð
ð=1
ð
ðð2
What if ð is set to an extremely large value (say ð = 1010)?
1. Algorithm works fine; setting to be very large canât hurt it
2. Algorithm fails to eliminate overfitting.
3. Algorithm results in underfitting. (Fails to fit even training data well).
4. Gradient descent will fail to converge.
Slide credit: Andrew Ng
Question
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2+ ð
ð=1
ð
ðð2
What if ð is set to an extremely large value (say ð = 1010)?Price ($)in 1000âs
Size in feet^2
âð ðĨ = ð0 + ð1ðĨ1 + ð2ðĨ2 +âŊ+ ðððĨð = ðâĪðĨSlide credit: Andrew Ng
Regularization
âĒ Overfitting
âĒ Cost function
âĒ Regularized linear regression
âĒ Regularized logistic regression
Regularized linear regression
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2+ ð
ð=1
ð
ðð2
minð
ð―(ð)
ð: Number of features
ð0 is not panelizedSlide credit: Andrew Ng
Gradient descent (Previously)
Repeat {
ð0 â ð0 â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð
ðð â ðð â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð ðĨðð
}
(ð = 1, 2, 3,âŊ , ð)
Slide credit: Andrew Ng
(ð = 0)
Gradient descent (Regularized)
Repeat {
ð0 â ð0 â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð
ðð â ðð â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð ðĨðð+ ððð
}ðð â ðð(1 â ðž
ð
ð) â ðž
1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð ðĨðð
Slide credit: Andrew Ng
Comparison
Regularized linear regression
ðð â ðð(1 â ðžð
ð) â ðž
1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð ðĨðð
Un-regularized linear regression
ðð â ðð â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð ðĨðð
1 â ðžð
ð< 1: Weight decay
Normal equation
âĒ ð =
ðĨ 1 âĪ
ðĨ 2 âĪ
âŪ
ðĨ ð âĪ
â ð ðÃ(ð+1) ðĶ =
ðĶ(1)
ðĶ(2)
âŪðĶ(ð)
â ð ð
âĒ minð
ð―(ð)
âĒ ð = ðâĪð + ð
0 0 âŊ 00 1 0 0âŪ âŪ âą âŪ0 0 0 1
â1
ðâĪðĶ
(ð + 1 ) Ã (ð + 1) Slide credit: Andrew Ng
Regularization
âĒ Overfitting
âĒ Cost function
âĒ Regularized linear regression
âĒ Regularized logistic regression
Regularized logistic regression
âĒ Cost function:
ð― ð =1
ð
ð=1
ð
ðĶ ð log âð ðĨ ð + (1 â ðĶ ð ) log 1 â âð ðĨ ð +ð
2
ð=1
ð
ðð2
Tumor Size
Age
âð ðĨ = ð(ð0 + ð1ðĨ + ð2ðĨ2 +ð3ðĨ1
2 + ð4ðĨ22 + ð5ðĨ1ðĨ2 +
ð6ðĨ13ðĨ2 + ð7ðĨ1ðĨ2
3 +âŊ)
Slide credit: Andrew Ng
Gradient descent (Regularized)
Repeat {
ð0 â ð0 â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð
ðð â ðð â ðž1
ð
ð=1
ð
âð ðĨ ð â ðĶ ð ðĨððâ ððð
}
âð ðĨ =1
1 + ðâðâĪðĨ
ð
ðððð―(ð)
Slide credit: Andrew Ng
ð 1: Lasso regularization
ð― ð =1
2ð
ð=1
ð
âð ðĨ ð â ðĶ ð 2+ ð
ð=1
ð
|ðð|
LASSO: Least Absolute Shrinkage and Selection Operator
Single predictor: Soft Thresholding
âĒminimizeð1
2ðÏð=1ð ðĨ(ð)ð â ðĶ ð 2
+ ð ð 1
ð =
1
ð< ð, ð > âð if
1
ð< ð, ð > > ð
0 if1
ð| < ð, ð > | âĪ ð
1
ð< ð, ð > +ð if
1
ð< ð, ð > < âð
ð = ðð(1
ð< ð, ð >)
Soft Thresholding operator ðð ðĨ = sign ðĨ ðĨ â ð +
Multiple predictors: : Cyclic Coordinate Desce
âĒ minimizeð1
2ðÏð=1ð ðĨð
ððð + Ïðâ ð ðĨðð
ððð â ðĶ ð
2+
ð
ðâ ð
|ðð| + ð ðð 1
For each ð, update ðð with
minimizeð1
2ð
ð=1
ð
ðĨðððð â ðð
(ð) 2+ ð ðð 1
where ðð(ð)
= ðĶ ð â Ïðâ ð ðĨððððð
L1 and L2 balls
Image credit: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf
TerminologyRegularization function
Name Solver
ð 22 =
ð=1
ð
ðð2
Tikhonov regularizationRidge regression
Close form
ð1=
ð=1
ð
|ðð|LASSO regression Proximal gradient
descent, least angle regression
ðž ð1+ (1 â ðž) ð 2
2 Elastic net regularization Proximal gradient descent
Things to remember
âĒ Overfitting
âĒ Cost function
âĒ Regularized linear regression
âĒ Regularized logistic regression
Recommended