Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Dimension Reduction MethodsAnd Bayesian Machine Learning
Marek Petrik
2/28
Previously in Machine Learning
I How to choose the right features if we have (too) many options
I Methods:1. Subset selection2. Regularization (shrinkage)3. Dimensionality reduction (next class)
Best Subset Selection
I Want to find a subset of p featuresI The subset should be small and predict wellI Example: credit ∼ rating + income + student + limit
M0 ← null model (no features);for k = 1, 2, . . . , p do
Fit all(pk
)models that contain k features ;
Mk ← best of(pk
)models according to a metric (CV error, R2,
etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above
Algorithm 1: Best Subset Selection
Achieving Scalability
I Complexity of Best Subset Selection?
I Examine all possible subsets? How many?I O(2p)!
I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation
Achieving Scalability
I Complexity of Best Subset Selection?I Examine all possible subsets? How many?
I O(2p)!
I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation
Achieving Scalability
I Complexity of Best Subset Selection?I Examine all possible subsets? How many?I O(2p)!
I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation
Achieving Scalability
I Complexity of Best Subset Selection?I Examine all possible subsets? How many?I O(2p)!
I Heuristic approaches:1. Stepwise selection: Solve the problem approximately: greedy2. Regularization: Solve a di�erent (easier) problem: relaxation
Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do
Fit all(pk
)models that contain k features ;
Mk ← best of(pk
)models according to a metric (CV error, R2,
etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above
Algorithm 2: Best Subset Selection
1. Direct error estimate: Cross validation, precise butcomputationally intensive
2. Indirect error estimate: Mellow’s Cp:
Cp =1
n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]
Akaike information criterion, BIC, and many others.Theoretical foundations
3. Interpretability Penalty: What is the cost of extra features
Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do
Fit all(pk
)models that contain k features ;
Mk ← best of(pk
)models according to a metric (CV error, R2,
etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above
Algorithm 3: Best Subset Selection
1. Direct error estimate: Cross validation, precise butcomputationally intensive
2. Indirect error estimate: Mellow’s Cp:
Cp =1
n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]
Akaike information criterion, BIC, and many others.Theoretical foundations
3. Interpretability Penalty: What is the cost of extra features
Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do
Fit all(pk
)models that contain k features ;
Mk ← best of(pk
)models according to a metric (CV error, R2,
etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above
Algorithm 4: Best Subset Selection
1. Direct error estimate: Cross validation, precise butcomputationally intensive
2. Indirect error estimate: Mellow’s Cp:
Cp =1
n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]
Akaike information criterion, BIC, and many others.Theoretical foundations
3. Interpretability Penalty: What is the cost of extra features
Which Metric to Use?M0 ← null model (no features);for k = 1, 2, . . . , p do
Fit all(pk
)models that contain k features ;
Mk ← best of(pk
)models according to a metric (CV error, R2,
etc)endreturn Best ofM0,M1, . . . ,Mp according to metric above
Algorithm 5: Best Subset Selection
1. Direct error estimate: Cross validation, precise butcomputationally intensive
2. Indirect error estimate: Mellow’s Cp:
Cp =1
n(RSS +2dσ̂2) where σ̂2 ≈ Var[ε]
Akaike information criterion, BIC, and many others.Theoretical foundations
3. Interpretability Penalty: What is the cost of extra features
Regularization
1. Stepwise selection: Solve the problem approximately
2. Regularization: Solve a di�erent (easier) problem: relaxationI Solve a machine learning problem, but penalize solutions that
use “too much” of the features
Regularization
I Ridge regression (parameter λ), `2 penalty
minβ
RSS(β) + λ∑j
β2j =
minβ
n∑i=1
yi − β0 − p∑j=1
βjxij
2
+ λ∑j
β2j
I Lasso (parameter λ), `1 penalty
minβ
RSS(β) + λ∑j
|βj | =
minβ
n∑i=1
yi − β0 − p∑j=1
βjxij
2
+ λ∑j
|βj |
I Approximations to the `0 solution
Why Lasso WorksI Bias-variance trade-o�I Increasing λ increases biasI Example: all features relevant
0.02 0.10 0.50 2.00 10.00 50.00
010
20
30
40
50
60
Me
an
Sq
ua
red
Err
or
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
purple: test MSE, black: bias, green: variancedo�ed (ridge)
Why Lasso WorksI Bias-variance trade-o�I Increasing λ increases biasI Example: some features relevant
0.02 0.10 0.50 2.00 10.00 50.00
020
40
60
80
100
Me
an
Sq
ua
red
Err
or
0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
40
60
80
100
R2 on Training Data
Me
an
Sq
ua
red
Err
or
λ
purple: test MSE, black: bias, green: variancedo�ed (ridge)
Regularization
I Ridge regression (parameter λ), `2 penalty
minβ
RSS(β) + λ∑j
β2j =
minβ
n∑i=1
yi − β0 − p∑j=1
βjxij
2
+ λ∑j
β2j
I Lasso (parameter λ), `1 penalty
minβ
RSS(β) + λ∑j
|βj | =
minβ
n∑i=1
yi − β0 − p∑j=1
βjxij
2
+ λ∑j
|βj |
I Approximations to the `0 solution
Regularization: Constrained Formulation
I Ridge regression (parameter λ), `2 penalty
minβ
n∑i=1
yi − β0 − p∑j=1
βjxij
2
subject to∑j
β2j ≤ s
I Lasso (parameter λ), `1 penalty
minβ
n∑i=1
yi − β0 − p∑j=1
βjxij
2
subject to∑j
|βj | ≤ s
I Approximations to the `0 solution
Lasso Solutions are Sparse
Constrained Lasso (le�) vs Constrained Ridge Regression (right)
Constraints are blue, red are contours of the objective
Today
I Dimension reduction methodsI Principal component regressionI Partial least squares
I Interpretation in high dimensionsI Bayesian view of ridge regression and lasso
Dimensionality Reduction Methods
I Di�erent approach to model selectionI We have many features: X1, X2, . . . , Xp
I Transform features to a smaller number Z1, . . . , ZM
I Find constants φjmI New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Dimension reduction: M is much smaller than p
Dimensionality Reduction Methods
I Di�erent approach to model selectionI We have many features: X1, X2, . . . , Xp
I Transform features to a smaller number Z1, . . . , ZM
I Find constants φjmI New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Dimension reduction: M is much smaller than p
Dimensionality Reduction Methods
I Di�erent approach to model selectionI We have many features: X1, X2, . . . , Xp
I Transform features to a smaller number Z1, . . . , ZM
I Find constants φjmI New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Dimension reduction: M is much smaller than p
Using Transformed Features
I New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Fit linear regression model:
yi = θ0 +
M∑m=1
θmzim + εi
I Run plain linear regression, logistic regression, LDA, oranything else
Recovering Coe�icients for Original Features
I Prediction using transformed features
yi = θ0 +
M∑m=1
θmzim + εi
I New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Consider prediction for data point i
M∑m=1
θmzim
=
M∑m=1
θm
p∑j=1
φjmxij =
p∑j=1
M∑m=1
θmφjmxij =
p∑j=1
βjxij
Recovering Coe�icients for Original Features
I Prediction using transformed features
yi = θ0 +
M∑m=1
θmzim + εi
I New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Consider prediction for data point i
M∑m=1
θmzim =
M∑m=1
θm
p∑j=1
φjmxij
=
p∑j=1
M∑m=1
θmφjmxij =
p∑j=1
βjxij
Recovering Coe�icients for Original Features
I Prediction using transformed features
yi = θ0 +
M∑m=1
θmzim + εi
I New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Consider prediction for data point i
M∑m=1
θmzim =
M∑m=1
θm
p∑j=1
φjmxij =
p∑j=1
M∑m=1
θmφjmxij
=
p∑j=1
βjxij
Recovering Coe�icients for Original Features
I Prediction using transformed features
yi = θ0 +
M∑m=1
θmzim + εi
I New features Zm are linear combinations of Xj :
Zm =
p∑j=1
φjmXj
I Consider prediction for data point i
M∑m=1
θmzim =
M∑m=1
θm
p∑j=1
φjmxij =
p∑j=1
M∑m=1
θmφjmxij =
p∑j=1
βjxij
Dimension Reduction
1. Reduce dimensions of features Z from X
2. Fit prediction model to compute θ
3. Compute weights for the original features β
Dimension Reduction
1. Reduce dimensions of features Z from X
2. Fit prediction model to compute θ
3. Compute weights for the original features β
Dimension Reduction
1. Reduce dimensions of features Z from X
2. Fit prediction model to compute θ
3. Compute weights for the original features β
How (and Why) Reduce Feaures?
1. Principal Component Analysis (PCA)
2. Partial least squares
3. Also: many other non-linear dimensionality reduction methods
Principal Component Analysis
I Unsupervised dimensionality reduction methodsI Works with n× p data matrixX (no labels)I Correlated features: pop and ad
10 20 30 40 50 60 70
05
10
15
20
25
30
35
Population
Ad
Sp
en
din
g
1st Principal Component
10 20 30 40 50 60 70
05
10
15
20
25
30
35
Population
Ad S
pendin
g
I 1st Principal Component: Direction with the largest variance
Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)
I Is this linear?
Yes, a�er mean centering.
1st Principal Component
10 20 30 40 50 60 70
05
10
15
20
25
30
35
Population
Ad S
pendin
g
I 1st Principal Component: Direction with the largest variance
Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)
I Is this linear?
Yes, a�er mean centering.
1st Principal Component
10 20 30 40 50 60 70
05
10
15
20
25
30
35
Population
Ad S
pendin
g
I 1st Principal Component: Direction with the largest variance
Z1 = 0.839× (pop− pop) + 0.544× (ad− ad)
I Is this linear? Yes, a�er mean centering.
1st Principal Component
20 30 40 50
510
15
20
25
30
Population
Ad S
pendin
g
−20 −10 0 10 20
−10
−5
05
10
1st Principal Component2nd P
rincip
al C
om
ponent
green line: 1st principal component, minimize distances to all points
Is this the same as linear regression? No, like total least squares.
1st Principal Component
20 30 40 50
510
15
20
25
30
Population
Ad S
pendin
g
−20 −10 0 10 20
−10
−5
05
10
1st Principal Component2nd P
rincip
al C
om
ponent
green line: 1st principal component, minimize distances to all points
Is this the same as linear regression?
No, like total least squares.
1st Principal Component
20 30 40 50
510
15
20
25
30
Population
Ad S
pendin
g
−20 −10 0 10 20
−10
−5
05
10
1st Principal Component2nd P
rincip
al C
om
ponent
green line: 1st principal component, minimize distances to all points
Is this the same as linear regression? No, like total least squares.
2nd Principal Component
10 20 30 40 50 60 70
05
10
15
20
25
30
35
Population
Ad S
pendin
g
I 2nd Principal Component: Orthogonal to 1st component,largest variance
Z2 = 0.544× (pop− pop)− 0.839× (ad− ad)
1st Principal Component
−3 −2 −1 0 1 2 3
20
30
40
50
60
1st Principal Component
Po
pu
latio
n
−3 −2 −1 0 1 2 3
51
01
52
02
53
0
1st Principal Component
Ad
Sp
en
din
g
−1.0 −0.5 0.0 0.5 1.0
20
30
40
50
60
2nd Principal Component
Po
pu
latio
n
−1.0 −0.5 0.0 0.5 1.0
51
01
52
02
53
0
2nd Principal Component
Ad
Sp
en
din
g
Properties of PCA
I No more principal components than features
I Principal components are perpendicular
I Principal components are eigenvalues ofX>X
I Assumes normality, can break with heavy tails
I PCA depends on the scale of features
Principal Component Regression
1. Use PCA to reduce features to a small number of principalcomponents
2. Fit regression using principal components
0 10 20 30 40
010
20
30
40
50
60
70
Number of Components
Me
an
Sq
ua
red
Err
or
0 10 20 30 40
050
100
150
Number of Components
Me
an
Sq
ua
red
Err
or
Squared BiasTest MSEVariance
PCR vs Ridge Regression & Lasso
0 10 20 30 40
010
20
30
40
50
60
70
PCR
Number of Components
Me
an
Sq
ua
red
Err
or
Squared BiasTest MSEVariance
0.0 0.2 0.4 0.6 0.8 1.0
010
20
30
40
50
60
70
Ridge Regression and Lasso
Shrinkage FactorM
ea
n S
qu
are
d E
rro
r
I PCR selects combinations of all features (not feature selection)I PCR is closely related to ridge regression
PCR Application
2 4 6 8 10
−300
−100
0100
200
300
400
Number of Components
Sta
ndard
ized C
oeffic
ients
IncomeLimitRatingStudent
2 4 6 8 1020000
40000
60000
80000
Number of Components
Cro
ss−
Valid
ation M
SE
Standardizing Features
I Regularization and PCR depend on scales of featuresI Good practice is to standardize features to have same variance
x̃ij =xij√
1n
∑ni=1(xij − x̄j)2
I Do not standardize features when they have the same unitsI PCA needs mean-centered features
x̃ij = xij − x̄j
Partial Least SquaresI Supervised version of PCR
20 30 40 50 60
51
01
52
02
53
0
Population
Ad
Sp
en
din
g
High-dimensional Data
1. Predict blood pressure from DNA: n = 200, p = 500 000
2. Predicting user behavior online: n = 10 000, p = 200 000
Problem With High Dimensions
I Computational complexityI Overfi�ing is a problem
−1.5 −1.0 −0.5 0.0 0.5 1.0
−10
−5
05
10
−1.5 −1.0 −0.5 0.0 0.5 1.0
−10
−5
05
10
XX
YY
Overfi�ing with Many Variables
5 10 15
0.2
0.4
0.6
0.8
1.0
Number of Variables
R2
5 10 15
0.0
0.2
0.4
0.6
0.8
Number of Variables
Tra
inin
g M
SE
5 10 15
15
50
50
0
Number of Variables
Te
st
MS
E
Interpreting Feature Selection
1. Solutions may not be unique
2. Must be careful about how we report solutions
3. Just because one combination of features predicts well, doesnot mean others will not