Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Steepest Descent Methods for Variable Selection
Julian Wolfson
593 Final Project
May 29, 2007
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 1 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
1 Motivation
2 Boosting for Beginners
3 L1 Penalization and Boosting - Separated at birth?
4 Beyond Boosting: TGDR
5 Applying TGDR
6 Extensions
7 Final Thoughts
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 2 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
The Penalty Box
Focus of this class: Solvemin
βL(β) + λP(β)
for some penalty function P(β)
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 3 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
The Penalty Box, cont’d
Good stuff (with right choice of penalty)
Does variable selection
Nice asymptotic results
Can be solved QUICKLY in simple situations (eg. linear regression)
Software available ⇒ ubiquitous, many variants
Bad stuffConstrained optimization more complicated for loss functions other thansquared-error
Unclear whether these methods (particularly more complex variants) can beapplied to real-life large problems
New loss functions/penalties tackled in a piecemeal fashion - seemingly a new“trick” required for every adaptation of LASSO
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 4 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Alternatives?
Let’s restrict ourselves to the problem of obtaining good predictions only(don’t worry about interpretability)
How do we get good predictions for big problems?
Fundamental question for those working in area of machine learning
Boosting is a popular technique:
FastSimpleGeneral
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 5 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Alternatives?
Let’s restrict ourselves to the problem of obtaining good predictions only(don’t worry about interpretability)
How do we get good predictions for big problems?
Fundamental question for those working in area of machine learning
Boosting is a popular technique:
FastSimpleGeneral
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 5 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Very quick intro to boosting
Boosting is an iterative technique for building an additive model
FT () =J∑
j=1
hj() · β(T )j
We call H = {hj , j = 1, . . . , J} a dictionary of candidate predictors (or weaklearners)
β(T )j is the coefficient derived after T iterations of boosting
Define some loss function L with which to evaluate predictions FT () = Y
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 6 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Very quick intro to boosting, cont’d.
Super Simplified Boosting Algorithm1 Set coefficient vector = 02 For t = 1 : T ,
1 Pick hj ∈ H for which change in hj results in greatest decrease in lossfunction
2 Increment the coefficient associated with hj by some (small) amount
Boosting = Steepest Descent in Predictor Space
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 7 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Towards Interpretability
Coefficients of functions in weak learner set H aren’t interpretable for generalHBut if we take H to be {1, . . . ,p }, the set of covariates, then coefficients areinterpretable in standard way
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 8 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Boosting in Covariate Space
Boosting algorithm simplifies to
1 Set β(0) = 02 For t = 1 : T ,
1 Identify jt = arg maxj |∇L(β(t−1))|2 Set β
(t)jt
= β(t−1)jt
− αtsign(∇L(β(t−1))jt )
Questions1 How to choose increment αt?
2 When do we stop?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 9 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Boosting in Covariate Space
Boosting algorithm simplifies to
1 Set β(0) = 02 For t = 1 : T ,
1 Identify jt = arg maxj |∇L(β(t−1))|2 Set β
(t)jt
= β(t−1)jt
− αtsign(∇L(β(t−1))jt )
Questions1 How to choose increment αt?
2 When do we stop?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 9 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Incrementing and Stopping
Options for αt
1 Exact line search (= Forward Selection... too greedy)
2 Inexact line search (≈ LARS in linear regression case, TGDR in general)
3 Small constant value (= Forward Stagewise Selection)
4 Etc
When to stop1 Leave-one-out CV
2 k-fold CV
3 Etc
Beyond the scope of this talk
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 10 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Boosting and L1 - separated at birth?
Some empirical evidence:
Coincidence? What is the connection between boosting and L1-penalizedmethods?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 11 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
A theorem (Rosset, Zhu, Hastie, 2004)
Theorem 1Consider applying the boosting algorithm with αt = ε to any convex loss function,generating a path of solutions β(ε)(t). Then if the L1-penalized coefficient pathsare monotone for all c < c0, i.e. if ∀j , |(c)j | is non-decreasing in the rangec < c0, then
limε0
β(ε)(c0/ε) = (c0)
where (c0) is the LASSO solution, i.e.
(c0) = arg minβ
L(β) + c0
∑|βj |
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 12 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Some intuition
Consider the problemmin L(β)s.t. ——β||1 − ||β0||1 ≤ ε——β|| ≥ ||β0||, component-wiseExpand L(β) about β0:
L(β) = L(β0) +∇L(β0)(β − β0) + O(ε2)
L(β) is seen to be optimized as ε0 by updating the element of β such that|∇L(β)|j is maximum, provided sign(β0,j) = −sign(∇L(β0)j).
Boosting solves the local L1-constrained problems it encounters along the way
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 13 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Beyond Boosting
β(t)jt
= β(t−1)jt
− αtsign(∇L(β(t−1))jt )
An ObservationBoosting only updates one element of the coefficient vector at each iteration -could we do better by updating multiple elements at once?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 14 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Enter TGDR
TGDR:Threshold Gradient Descent Regularization
Suggested by Friedman and Popescu (2004)
Motivated by boosting and gradient descent methods
Allows multiple directions to be updated in each iteration
Early stopping provides regularization
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 15 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Tweaking the update rule
ε-boosting:
β(t)jt
= β(t−1)jt
− ε · sign(∇L(β(t−1))jt )
TGDR:β(t) = β(t−1) + (β(t−1)) · ε · ∇L(β(t−1))
where
(·) = 1[|∇L(·)| >= τ · maxk=1,...,p
(|∇L(·)|k)]
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 16 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Thresholding
∇L(β)
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 17 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Thresholding
(β) = 1[|∇L(β)| >= τ · maxk=1,...,p
(|∇L(β)|k)]
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 18 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Thresholding
(β) = 1[|∇L(β)| >= τ · maxk=1,...,p
(|∇L(β)|k)]
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 19 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
An example
Gui and Li (2005) extended TGDR for Cox regression
Use partial likelihood loss:L = −`p(β;X )
I adapted TGDR to handle time-varying covariates
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 20 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Application: ACTG 398
Relevant Data≈ 490 HIV-infected patients
Current drug regimen
HIV protein sequences (300 AAs) collected post-infection for approximatelytwo years
Endpoint of Interest
(T ,C ), where
T is the time until a patient “fails” a drug regimen
C is the censoring indicator
Question
Which amino acid positions on HIV (mutations, insertions, deletions) areassociated with time until drug regimen failure?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 21 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Application: ACTG 398
Relevant Data≈ 490 HIV-infected patients
Current drug regimen
HIV protein sequences (300 AAs) collected post-infection for approximatelytwo years
Endpoint of Interest
(T ,C ), where
T is the time until a patient “fails” a drug regimen
C is the censoring indicator
Question
Which amino acid positions on HIV (mutations, insertions, deletions) areassociated with time until drug regimen failure?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 21 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Application: ACTG 398
Relevant Data≈ 490 HIV-infected patients
Current drug regimen
HIV protein sequences (300 AAs) collected post-infection for approximatelytwo years
Endpoint of Interest
(T ,C ), where
T is the time until a patient “fails” a drug regimen
C is the censoring indicator
Question
Which amino acid positions on HIV (mutations, insertions, deletions) areassociated with time until drug regimen failure?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 21 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Results: ACTG 398 Data
Run TGDR on 60% of data (training set) for a range of values of τ ...
Estimated coefficients from training set70R 74V 103N 108I 118I 122E 123E 181C 184V 190A
τ K L K V V K D Y M G0.5 0.134 0.258 0.134 −0.164 0.1310.55 0.115 0.421 0.096 0.092 0.117 −0.255 0.1280.6 0.115 0.421 0.117 −0.164 0.1280.65 0.118 0.434 0.125 −0.143 0.1280.7 0.092 0.535 0.086 0.088 0.207 −0.143 0.2290.75 0.105 0.542 0.078 −0.080 0.085 0.075 0.184 −0.143 0.2210.8 0.434 −0.1430.85 −0.063 0.087 0.554 0.143 −0.082 0.088 0.142 0.119 −0.201 0.3680.9 −0.069 0.083 0.554 0.147 −0.082 0.087 0.079 0.119 −0.202 0.3100.95 −0.062 0.145 0.541 0.206 −0.207 0.147 0.141 0.105 −0.204 0.3800.96 −0.062 0.092 0.541 0.206 −0.148 0.144 0.141 0.094 −0.203 0.3870.97 −0.066 0.098 0.535 0.208 −0.149 0.082 0.143 0.087 −0.204 0.3860.98 −0.066 0.092 0.535 0.146 −0.149 0.084 0.143 0.094 −0.205 0.3810.99 −0.066 0.086 0.535 0.147 −0.150 0.087 0.143 0.094 −0.205 0.380
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 22 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Results: ACTG 398 Data
Run TGDR on 60% of data (training set) for a range of values of τ ...
Estimated coefficients from training set70R 74V 103N 108I 118I 122E 123E 181C 184V 190A
τ K L K V V K D Y M G0.5 0.134 0.258 0.134 −0.164 0.1310.55 0.115 0.421 0.096 0.092 0.117 −0.255 0.1280.6 0.115 0.421 0.117 −0.164 0.1280.65 0.118 0.434 0.125 −0.143 0.1280.7 0.092 0.535 0.086 0.088 0.207 −0.143 0.2290.75 0.105 0.542 0.078 −0.080 0.085 0.075 0.184 −0.143 0.2210.8 0.434 −0.1430.85 −0.063 0.087 0.554 0.143 −0.082 0.088 0.142 0.119 −0.201 0.3680.9 −0.069 0.083 0.554 0.147 −0.082 0.087 0.079 0.119 −0.202 0.3100.95 −0.062 0.145 0.541 0.206 −0.207 0.147 0.141 0.105 −0.204 0.3800.96 −0.062 0.092 0.541 0.206 −0.148 0.144 0.141 0.094 −0.203 0.3870.97 −0.066 0.098 0.535 0.208 −0.149 0.082 0.143 0.087 −0.204 0.3860.98 −0.066 0.092 0.535 0.146 −0.149 0.084 0.143 0.094 −0.205 0.3810.99 −0.066 0.086 0.535 0.147 −0.150 0.087 0.143 0.094 −0.205 0.380
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 22 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Results (cont’d)
Get η = X from test set (40% of data)
HR = Hazard ratio comparing group with η ≥ 0 (“high risk”) to η < 0 (“lowrisk”)
τ HR 95% CI0.5 2.258 1.438 3.5460.55 2.360 1.499 3.7160.6 2.025 1.290 3.1780.65 2.025 1.290 3.1780.7 2.384 1.492 3.8100.75 2.349 1.476 3.7390.8 2.054 1.311 3.2170.85 2.441 1.549 3.8460.9 2.475 1.571 3.9000.95 2.429 1.537 3.8370.96 2.429 1.537 3.8370.97 2.463 1.558 3.8930.98 2.463 1.558 3.8930.99 2.463 1.558 3.893
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 23 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Loss functions and descent directions
For log-likelihood (or log partial likelihood) loss, the descent direction is just`β ≡ ˙, the score function.
Extensive literature on modified/adapted/approximate/quasi score functionswhich allow for:
Missing dataMeasurement errorHeteroskedasticity. . .
IdeaEstimating equations propose a “descent direction” in some sense - applyingTGDR to these descent directions could allow us to do variable selection wheneverestimating equations are available (even if closed-form likelihoods aren’t available).
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 24 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Loss functions and descent directions
For log-likelihood (or log partial likelihood) loss, the descent direction is just`β ≡ ˙, the score function.
Extensive literature on modified/adapted/approximate/quasi score functionswhich allow for:
Missing dataMeasurement errorHeteroskedasticity. . .
IdeaEstimating equations propose a “descent direction” in some sense - applyingTGDR to these descent directions could allow us to do variable selection wheneverestimating equations are available (even if closed-form likelihoods aren’t available).
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 24 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Loss functions and descent directions
For log-likelihood (or log partial likelihood) loss, the descent direction is just`β ≡ ˙, the score function.
Extensive literature on modified/adapted/approximate/quasi score functionswhich allow for:
Missing dataMeasurement errorHeteroskedasticity. . .
IdeaEstimating equations propose a “descent direction” in some sense - applyingTGDR to these descent directions could allow us to do variable selection wheneverestimating equations are available (even if closed-form likelihoods aren’t available).
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 24 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Loss functions and descent directions
For log-likelihood (or log partial likelihood) loss, the descent direction is just`β ≡ ˙, the score function.
Extensive literature on modified/adapted/approximate/quasi score functionswhich allow for:
Missing dataMeasurement errorHeteroskedasticity. . .
IdeaEstimating equations propose a “descent direction” in some sense - applyingTGDR to these descent directions could allow us to do variable selection wheneverestimating equations are available (even if closed-form likelihoods aren’t available).
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 24 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Will this work?
Simplest way of solving estimating equations: iterative substitution
Consider estimating equation gn(β) = 0
If gn(β) is asymptotically unbiased, then solutions will be such that
gn() ≈ 0
In the neighbourhood of a solution, write
≈ +gn()
Suggests the iteration(t) =(t−1) +εgn(
(t−1))
Add thresholding to get TGDR iteration
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 25 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Asymptotics
Some things I’d like to show:
Conjecture 1: Knight/Fu consistency
For suitably well-behaved loss functions L/descent directions ˙, the TGDRestimate converges to min L(β) or the solution of ˙ = 0 as the number ofiterations ∞.
Proof sketch?
Apply results of Bickel, Ritov, Zakai (2006), who show consistency for a verygeneral class of boosting methods
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 26 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Asymptotics, cont’d.
Conjecture 2: Greenshtein/Ritov persistency
For suitably well-behaved loss functions L/descent directions ˙, the TGDRestimates are persistent.
Proof sketch?
Exploit relationship between L1-penalization (shown to be persistent) andboosting (similar to TGDR)
Ideas are welcome!
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 27 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
In Conclusion
TGDR is...
Variable selection based on thresholded gradient descent
Beautifully simple
Computationally tractable
Easy to extend to more complex data structures
But TGDR is not...
Popular (yet)
Particularly amenable to inference (confidence intervals?)
Well studied from a theoretical perspective:
When does it work?How well does it work?How does it compare to competing methods?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 28 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
In Conclusion
TGDR is...
Variable selection based on thresholded gradient descent
Beautifully simple
Computationally tractable
Easy to extend to more complex data structures
But TGDR is not...
Popular (yet)
Particularly amenable to inference (confidence intervals?)
Well studied from a theoretical perspective:
When does it work?How well does it work?How does it compare to competing methods?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 28 / 29
Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts
Acknowledgements
Prof. Peter Gilbert (thesis supervisor)
Prof. Victor DeGruttola (for providing ACTG data)
Thanks!
Questions?
J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 29 / 29