Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1/12/17
1
CSE 446: Machine Learning
CSE 446: Machine Learning Emily Fox University of Washington January 13, 2017
©2017 Emily Fox
Ridge Regression: Regulating overfitting when using many features
CSE 446: Machine Learning 2
Training, true, & test error vs. model complexity
©2017 Emily Fox
Model complexity
Err
or
Overfitting if:
x
y
x
y
1/12/17
2
CSE 446: Machine Learning 3
Error vs. amount of data
©2017 Emily Fox
# data points in training set
Err
or
CSE 446: Machine Learning
Overfitting of polynomial regression
©2017 Emily Fox
1/12/17
3
CSE 446: Machine Learning 5
Flexibility of high-order polynomials
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
fŵ
square feet (sq.ft.)
pri
ce
($
)
x
y fŵ
y
yi = w0 + w1 xi+ w2
xi2 + … + wp
xip + εi
OVERFIT
CSE 446: Machine Learning 6
Symptom of overfitting
Often, overfitting associated with very large estimated parameters ŵ
©2017 Emily Fox
1/12/17
4
CSE 446: Machine Learning
Overfitting of linear regression models more generically
©2017 Emily Fox
CSE 446: Machine Learning 8
Overfitting with many features
Not unique to polynomial regression, but also if lots of inputs (d large)
Or, generically, lots of features (D large)
yi = wj hj(xi) + εi
©2017 Emily Fox
DX
j=0
- Square feet
- # bathrooms
- # bedrooms
- Lot size
- Year built
- …
1/12/17
5
CSE 446: Machine Learning 9
How does # of observations influence overfitting?
Few observations (N small) à rapidly overfit as model complexity increases
Many observations (N very large) à harder to overfit
©2017 Emily Fox
square feet (sq.ft.)
pri
ce
($
)
x
fŵ
y
square feet (sq.ft.)
pri
ce
($
)
x
fŵ y
CSE 446: Machine Learning 10
How does # of inputs influence overfitting?
1 input (e.g., sq.ft.): Data must include representative examples of all possible (sq.ft., $) pairs to avoid overfitting
©2017 Emily Fox
HARD
square feet (sq.ft.)
pri
ce
($
)
x
fŵ
y
1/12/17
6
CSE 446: Machine Learning 11
How does # of inputs influence overfitting?
d inputs (e.g., sq.ft., #bath, #bed, lot size, year,…):
Data must include examples of all possible (sq.ft., #bath, #bed, lot size, year,…., $) combos to avoid overfitting
©2017 Emily Fox
MUCH!!!
HARDER square feet (sq.ft.)
pri
ce
($
)
x[1]
y x[2]
CSE 446: Machine Learning
Adding term to cost-of-fit to prefer small coefficients
©2017 Emily Fox
1/12/17
7
CSE 446: Machine Learning 13
Desired total cost format
Want to balance:
i. How well function fits data
ii. Magnitude of coefficients
Total cost =
measure of fit + measure of magnitude of coefficients
©2017 Emily Fox
small # = good fit to training data
small # = not overfit
want to balance measure quality of fit
CSE 446: Machine Learning 14
Measure of fit to training data
©2017 Emily Fox
RSS(w) = (yi-h(xi)Tw)2
NX
i=1
square feet (sq.ft.)
pri
ce
($
)
x[1]
y
# b
athr
oom
s
x[2]
1/12/17
8
CSE 446: Machine Learning 15
What summary # is indicative of size of regression coefficients?
- Sum?
- Sum of absolute value?
- Sum of squares (L2 norm)
©2017 Emily Fox
Measure of magnitude of regression coefficient
CSE 446: Machine Learning 16
Consider specific total cost
Total cost =
measure of fit + measure of magnitude of coefficients
©2017 Emily Fox
1/12/17
9
CSE 446: Machine Learning 17
Consider specific total cost
Total cost =
measure of fit + measure of magnitude of coefficients
©2017 Emily Fox
RSS(w) ||w||2
2
CSE 446: Machine Learning 18
Consider resulting objective
What if ŵ selected to minimize
If λ=0:
If λ=∞:
If λ in between:
RSS(w) + ||w||2
tuning parameter = balance of fit and magnitude
©2017 Emily Fox
λ
2
1/12/17
10
CSE 446: Machine Learning 19
Consider resulting objective
What if ŵ selected to minimize
RSS(w) + ||w||2
©2017 Emily Fox
λ
Ridge regression (a.k.a L2 regularization)
tuning parameter = balance of fit and magnitude
2
CSE 446: Machine Learning 20
Bias-variance tradeoff
Large λ:
high bias, low variance
(e.g., ŵ =0 for λ=∞)
Small λ:
low bias, high variance
(e.g., standard least squares (RSS) fit of high-order polynomial for λ=0)
©2017 Emily Fox
In essence, λ controls model
complexity
1/12/17
11
CSE 446: Machine Learning 21
Revisit polynomial fit demo
What happens if we refit our high-order polynomial, but now using ridge regression?
Will consider a few settings of λ …
©2017 Emily Fox
CSE 446: Machine Learning 22
0 50000 100000 150000 200000−100000
0100000
200000
coefficient paths −− Ridge
h
coefficient
bedroomsbathroomssqft_livingsqft_lotfloorsyr_builtyr_renovatwaterfront
Coefficient path
©2017 Emily Fox
λ
co
effi
cie
nts
ŵj
1/12/17
12
CSE 446: Machine Learning
Fitting the ridge regression model (for given λ value)
©2017 Emily Fox
CSE 446: Machine Learning
Step 1: Rewrite total cost in matrix notation
©2017 Emily Fox
1/12/17
13
CSE 446: Machine Learning 25
Recall matrix form of RSS
Model for all N observations together
©2017 Emily Fox
= + y
H
ε
CSE 446: Machine Learning 26
Recall matrix form of RSS
©2017 Emily Fox
RSS(w) = (yi- h(xi)Tw)2
= (y-Hw)T(y-Hw)
NX
i=1
1/12/17
14
CSE 446: Machine Learning 27
Rewrite magnitude of coefficients in vector notation
©2017 Emily Fox
||w||2 = w02 + w1
2 + w2 2 + … + wD
2
=
2
CSE 446: Machine Learning 28
Putting it all together
©2017 Emily Fox
In matrix form, ridge regression cost is:
RSS(w) + λ||w||2
= (y-Hw)T(y-Hw) + λwTw
2
1/12/17
15
CSE 446: Machine Learning
Step 2: Compute the gradient
©2017 Emily Fox
CSE 446: Machine Learning 30
Why? By analogy to 1d case…
wTw analogous to w2 and derivative of w2=2w
Gradient of ridge regression cost
©2017 Emily Fox
[RSS(w) + λ||w||2] = [(y-Hw)T(y-Hw) + λwTw]
[(y-Hw)T(y-Hw)] + λ [wTw]
Δ
Δ Δ
-2HT(y-Hw)
Why?
Δ
=
2
2w
1/12/17
16
CSE 446: Machine Learning
Step 3, Approach 1: Set the gradient = 0
©2017 Emily Fox
CSE 446: Machine Learning 32
Ridge closed-form solution
©2017 Emily Fox
cost(w) = -2HT(y-Hw) +2λIw= 0
Δ
Solve for w:
1/12/17
17
CSE 446: Machine Learning 33
Interpreting ridge closed-form solution
©2017 Emily Fox
ŵ = ( HTH + λI)-1 HTy
If λ=0:
If λ=∞:
CSE 446: Machine Learning 34
Recall discussion on previous closed-form solution
©2017 Emily Fox
ŵ = ( HTH )-1 HTy Invertible if:
In general, (# linearly independent obs) N > D
Complexity of inverse:
O(D3)
1/12/17
18
CSE 446: Machine Learning 35
Discussion of ridge closed-form solution
©2017 Emily Fox
ŵ = ( HTH + λI)-1 HTy
+
Invertible if: Always if λ>0, even if N < D
Complexity of inverse:
O(D3)… big for large D!
CSE 446: Machine Learning
Step 3, Approach 2: Gradient descent
©2017 Emily Fox
1/12/17
19
CSE 446: Machine Learning 37
Elementwise ridge regression gradient descent algorithm
©2017 Emily Fox
wj(t+1) ß wj
(t) – η *
[-2 hj(xi)(yi-ŷi(w(t)))
+2λwj(t) ]
NX
i=1
Update to jth feature weight:
cost(w) = -2HT(y-Hw) +2λw
Δ
CSE 446: Machine Learning 38
Recall previous algorithm
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj(t+1) ß wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
1/12/17
20
CSE 446: Machine Learning 39
Summary of ridge regression algorithm
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while || RSS(w(t))|| > ε for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
wj(t+1) ß (1-2ηλ)wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
CSE 446: Machine Learning
How to choose λ
©2017 Emily Fox
1/12/17
21
CSE 446: Machine Learning 41
The regression/ML workflow
1. Model selection Need to choose tuning parameters λ controlling model complexity
2. Model assessment Having selected a model, assess generalization error
©2017 Emily Fox
CSE 446: Machine Learning 42
Hypothetical implementation
1. Model selection For each considered λ : i. Estimate parameters ŵλ on training data ii. Assess performance of ŵλ on test data iii. Choose λ* to be λ with lowest test error 2. Model assessment Compute test error of ŵλ* (fitted model for selected λ*) to approx. generalization error
©2017 Emily Fox
Training set Test set
Overly optimistic!
1/12/17
22
CSE 446: Machine Learning 43
Hypothetical implementation
©2017 Emily Fox
Issue: Just like fitting ŵ and assessing its performance both on training data • λ* was selected to minimize test error (i.e., λ* was fit on test data)
• If test data is not representative of the whole world, then ŵλ* will typically perform worse than test error indicates
Training set Test set
CSE 446: Machine Learning 44
Training set Test set
Practical implementation
©2017 Emily Fox
Solution: Create two “test” sets! 1. Select λ* such that ŵλ* minimizes error on validation set
2. Approximate generalization error of ŵλ* using test set
Validation set
Training set Test set
1/12/17
23
CSE 446: Machine Learning 45
Practical implementation
©2017 Emily Fox
Validation set
Training set Test set
fit ŵλ test performance of ŵλ to select λ*
assess generalization
error of ŵλ*
CSE 446: Machine Learning 46
Typical splits
©2017 Emily Fox
Validation set
Training set Test set
80% 10% 10%
50% 25% 25%
1/12/17
24
CSE 446: Machine Learning
How to handle the intercept
©2017 Emily Fox
OPTIONAL
CSE 446: Machine Learning 48
Recall multiple regression model
©2017 Emily Fox
Model: yi = w0h0(xi) + w1
h1(xi) + … + wD hD(xi)+ εi
= wj hj(xi) + εi
feature 1 = h0(x)…often 1 (constant) feature 2 = h1(x)… e.g., x[1] feature 3 = h2(x)… e.g., x[2] … feature D+1 = hD(x)… e.g., x[d]
DX
j=0
1/12/17
25
CSE 446: Machine Learning 49
If constant feature…
©2017 Emily Fox
yi = w0 + w1 h1(xi) + … + wD
hD(xi)+ εi
In matrix notation for N observations:
w0111111111111111
CSE 446: Machine Learning 50
Do we penalize intercept?
Standard ridge regression cost: Encourages intercept w0 to also be small Do we want a small intercept? Conceptually, not indicative of overfitting…
©2017 Emily Fox
RSS(w) + ||w||2
λ
2
strength of penalty
1/12/17
26
CSE 446: Machine Learning 51
Option 1: Don’t penalize intercept
Modified ridge regression cost:
How to implement this in practice?
©2017 Emily Fox
RSS(w0,wrest) + ||wrest||2
λ
2
CSE 446: Machine Learning 52
Option 1: Don’t penalize intercept – Closed-form solution –
©2017 Emily Fox
ŵ = ( HTH + λImod)-1 HTy
+
0
1/12/17
27
CSE 446: Machine Learning 53
Option 1: Don’t penalize intercept – Gradient descent algorithm –
©2017 Emily Fox
while || RSS(w(t))|| > ε for j=0,…,D
partial[j] =-2 hj(xi)(yi-ŷi(w(t)))
if j==0
w0(t+1) ß w0
(t) – η partial[j]
else
wj(t+1) ß (1-2ηλ)wj
(t) – η partial[j]
t ß t + 1
Δ
NX
i=1
CSE 446: Machine Learning 54
Option 2: Center data first
If data are first centered about 0, then favoring small intercept not so worrisome
Step 1: Transform y to have 0 mean
Step 2: Run ridge regression as normal (closed-form or gradient algorithms)
©2017 Emily Fox
1/12/17
28
CSE 446: Machine Learning
Summary for ridge regression
©2017 Emily Fox
CSE 446: Machine Learning 56
What you can do now… • Describe what happens to magnitude of estimated
coefficients when model is overfit
• Motivate form of ridge regression cost function
• Describe what happens to estimated coefficients of ridge regression as tuning parameter λ is varied
• Interpret coefficient path plot
• Estimate ridge regression parameters: - In closed form
- Using an iterative gradient descent algorithm
• Use a validation set to select the ridge regression tuning parameter λ
©2017 Emily Fox