Steepest Descent Methods for Variable Selectionjulianw/downloads/JW.SteepestDescent.pdf · 1 Leave-one-out CV 2 k-fold CV 3 Etc Beyond the scope of this talk J. Wolfson (593 Final

Motivation Boosting for Beginners L1 and Boosting Beyond Boosting: TGDR Applying TGDR Extensions Final Thoughts

Steepest Descent Methods for Variable Selection

Julian Wolfson

593 Final Project

May 29, 2007

J. Wolfson (593 Final Project) Steepest Descent Methods for Variable Selection May 29, 2007 1 / 29


1 Motivation

2 Boosting for Beginners

3 L1 Penalization and Boosting - Separated at birth?

4 Beyond Boosting: TGDR

5 Applying TGDR

6 Extensions

7 Final Thoughts



The Penalty Box

Focus of this class: Solvemin

βL(β) + λP(β)

for some penalty function P(β)



The Penalty Box, cont’d

Good stuff (with right choice of penalty)

Does variable selection

Nice asymptotic results

Can be solved QUICKLY in simple situations (eg. linear regression)

Software available ⇒ ubiquitous, many variants

Bad stuffConstrained optimization more complicated for loss functions other thansquared-error

Unclear whether these methods (particularly more complex variants) can beapplied to real-life large problems

New loss functions/penalties tackled in a piecemeal fashion - seemingly a new“trick” required for every adaptation of LASSO



Alternatives?

Let’s restrict ourselves to the problem of obtaining good predictions only(don’t worry about interpretability)

How do we get good predictions for big problems?

Fundamental question for those working in area of machine learning

Boosting is a popular technique:

FastSimpleGeneral



Alternatives?

Let’s restrict ourselves to the problem of obtaining good predictions only(don’t worry about interpretability)

How do we get good predictions for big problems?

Fundamental question for those working in area of machine learning

Boosting is a popular technique:

FastSimpleGeneral



Very quick intro to boosting

Boosting is an iterative technique for building an additive model

FT () =J∑

j=1

hj() · β(T )j

We call H = {hj , j = 1, . . . , J} a dictionary of candidate predictors (or weaklearners)

β(T )j is the coefficient derived after T iterations of boosting

Define some loss function L with which to evaluate predictions FT () = Y



Very quick intro to boosting, cont’d.

Super Simplified Boosting Algorithm1 Set coefficient vector = 02 For t = 1 : T ,

1 Pick hj ∈ H for which change in hj results in greatest decrease in lossfunction

2 Increment the coefficient associated with hj by some (small) amount

Boosting = Steepest Descent in Predictor Space



Towards Interpretability

Coefficients of functions in weak learner set H aren’t interpretable for generalHBut if we take H to be {1, . . . ,p }, the set of covariates, then coefficients areinterpretable in standard way



Boosting in Covariate Space

Boosting algorithm simplifies to

1 Set β(0) = 02 For t = 1 : T ,

1 Identify jt = arg maxj |∇L(β(t−1))|2 Set β

(t)jt

= β(t−1)jt

− αtsign(∇L(β(t−1))jt )

Questions1 How to choose increment αt?

2 When do we stop?



Boosting in Covariate Space

Boosting algorithm simplifies to

1 Set β(0) = 02 For t = 1 : T ,

1 Identify jt = arg maxj |∇L(β(t−1))|2 Set β

(t)jt

= β(t−1)jt


Questions1 How to choose increment αt?

2 When do we stop?



Incrementing and Stopping

Options for αt

1 Exact line search (= Forward Selection... too greedy)

2 Inexact line search (≈ LARS in linear regression case, TGDR in general)

3 Small constant value (= Forward Stagewise Selection)

4 Etc

When to stop1 Leave-one-out CV

2 k-fold CV

3 Etc

Beyond the scope of this talk



Boosting and L1 - separated at birth?

Some empirical evidence:

Coincidence? What is the connection between boosting and L1-penalizedmethods?



A theorem (Rosset, Zhu, Hastie, 2004)

Theorem 1Consider applying the boosting algorithm with αt = ε to any convex loss function,generating a path of solutions β(ε)(t). Then if the L1-penalized coefficient pathsare monotone for all c < c0, i.e. if ∀j , |(c)j | is non-decreasing in the rangec < c0, then

limε0

β(ε)(c0/ε) = (c0)

where (c0) is the LASSO solution, i.e.

(c0) = arg minβ

L(β) + c0

∑|βj |



Some intuition

Consider the problemmin L(β)s.t. ——β||1 − ||β0||1 ≤ ε——β|| ≥ ||β0||, component-wiseExpand L(β) about β0:

L(β) = L(β0) +∇L(β0)(β − β0) + O(ε2)

L(β) is seen to be optimized as ε0 by updating the element of β such that|∇L(β)|j is maximum, provided sign(β0,j) = −sign(∇L(β0)j).

Boosting solves the local L1-constrained problems it encounters along the way



Beyond Boosting

β(t)jt

= β(t−1)jt


An ObservationBoosting only updates one element of the coefficient vector at each iteration -could we do better by updating multiple elements at once?



Enter TGDR

TGDR:Threshold Gradient Descent Regularization

Suggested by Friedman and Popescu (2004)

Motivated by boosting and gradient descent methods

Allows multiple directions to be updated in each iteration

Early stopping provides regularization



Tweaking the update rule

ε-boosting:

β(t)jt

= β(t−1)jt

− ε · sign(∇L(β(t−1))jt )

TGDR:β(t) = β(t−1) + (β(t−1)) · ε · ∇L(β(t−1))

where

(·) = 1[|∇L(·)| >= τ · maxk=1,...,p

(|∇L(·)|k)]



Thresholding

∇L(β)



Thresholding

(β) = 1[|∇L(β)| >= τ · maxk=1,...,p

(|∇L(β)|k)]



Thresholding

(β) = 1[|∇L(β)| >= τ · maxk=1,...,p

(|∇L(β)|k)]



An example

Gui and Li (2005) extended TGDR for Cox regression

Use partial likelihood loss:L = −`p(β;X )

I adapted TGDR to handle time-varying covariates



Application: ACTG 398

Relevant Data≈ 490 HIV-infected patients

Current drug regimen

HIV protein sequences (300 AAs) collected post-infection for approximatelytwo years

Endpoint of Interest

(T ,C ), where

T is the time until a patient “fails” a drug regimen

C is the censoring indicator

Question

Which amino acid positions on HIV (mutations, insertions, deletions) areassociated with time until drug regimen failure?








(T ,C ), where



Question









(T ,C ), where



Question




Results: ACTG 398 Data

Run TGDR on 60% of data (training set) for a range of values of τ ...

Estimated coefficients from training set70R 74V 103N 108I 118I 122E 123E 181C 184V 190A

τ K L K V V K D Y M G0.5 0.134 0.258 0.134 −0.164 0.1310.55 0.115 0.421 0.096 0.092 0.117 −0.255 0.1280.6 0.115 0.421 0.117 −0.164 0.1280.65 0.118 0.434 0.125 −0.143 0.1280.7 0.092 0.535 0.086 0.088 0.207 −0.143 0.2290.75 0.105 0.542 0.078 −0.080 0.085 0.075 0.184 −0.143 0.2210.8 0.434 −0.1430.85 −0.063 0.087 0.554 0.143 −0.082 0.088 0.142 0.119 −0.201 0.3680.9 −0.069 0.083 0.554 0.147 −0.082 0.087 0.079 0.119 −0.202 0.3100.95 −0.062 0.145 0.541 0.206 −0.207 0.147 0.141 0.105 −0.204 0.3800.96 −0.062 0.092 0.541 0.206 −0.148 0.144 0.141 0.094 −0.203 0.3870.97 −0.066 0.098 0.535 0.208 −0.149 0.082 0.143 0.087 −0.204 0.3860.98 −0.066 0.092 0.535 0.146 −0.149 0.084 0.143 0.094 −0.205 0.3810.99 −0.066 0.086 0.535 0.147 −0.150 0.087 0.143 0.094 −0.205 0.380



Results: ACTG 398 Data

Run TGDR on 60% of data (training set) for a range of values of τ ...

Estimated coefficients from training set70R 74V 103N 108I 118I 122E 123E 181C 184V 190A

τ K L K V V K D Y M G0.5 0.134 0.258 0.134 −0.164 0.1310.55 0.115 0.421 0.096 0.092 0.117 −0.255 0.1280.6 0.115 0.421 0.117 −0.164 0.1280.65 0.118 0.434 0.125 −0.143 0.1280.7 0.092 0.535 0.086 0.088 0.207 −0.143 0.2290.75 0.105 0.542 0.078 −0.080 0.085 0.075 0.184 −0.143 0.2210.8 0.434 −0.1430.85 −0.063 0.087 0.554 0.143 −0.082 0.088 0.142 0.119 −0.201 0.3680.9 −0.069 0.083 0.554 0.147 −0.082 0.087 0.079 0.119 −0.202 0.3100.95 −0.062 0.145 0.541 0.206 −0.207 0.147 0.141 0.105 −0.204 0.3800.96 −0.062 0.092 0.541 0.206 −0.148 0.144 0.141 0.094 −0.203 0.3870.97 −0.066 0.098 0.535 0.208 −0.149 0.082 0.143 0.087 −0.204 0.3860.98 −0.066 0.092 0.535 0.146 −0.149 0.084 0.143 0.094 −0.205 0.3810.99 −0.066 0.086 0.535 0.147 −0.150 0.087 0.143 0.094 −0.205 0.380



Results (cont’d)

Get η = X from test set (40% of data)

HR = Hazard ratio comparing group with η ≥ 0 (“high risk”) to η < 0 (“lowrisk”)

τ HR 95% CI0.5 2.258 1.438 3.5460.55 2.360 1.499 3.7160.6 2.025 1.290 3.1780.65 2.025 1.290 3.1780.7 2.384 1.492 3.8100.75 2.349 1.476 3.7390.8 2.054 1.311 3.2170.85 2.441 1.549 3.8460.9 2.475 1.571 3.9000.95 2.429 1.537 3.8370.96 2.429 1.537 3.8370.97 2.463 1.558 3.8930.98 2.463 1.558 3.8930.99 2.463 1.558 3.893



Loss functions and descent directions

For log-likelihood (or log partial likelihood) loss, the descent direction is just`β ≡ ˙, the score function.

Extensive literature on modified/adapted/approximate/quasi score functionswhich allow for:

Missing dataMeasurement errorHeteroskedasticity. . .

IdeaEstimating equations propose a “descent direction” in some sense - applyingTGDR to these descent directions could allow us to do variable selection wheneverestimating equations are available (even if closed-form likelihoods aren’t available).
























Will this work?

Simplest way of solving estimating equations: iterative substitution

Consider estimating equation gn(β) = 0

If gn(β) is asymptotically unbiased, then solutions will be such that

gn() ≈ 0

In the neighbourhood of a solution, write

≈ +gn()

Suggests the iteration(t) =(t−1) +εgn(

(t−1))

Add thresholding to get TGDR iteration



Asymptotics

Some things I’d like to show:

Conjecture 1: Knight/Fu consistency

For suitably well-behaved loss functions L/descent directions ˙, the TGDRestimate converges to min L(β) or the solution of ˙ = 0 as the number ofiterations ∞.

Proof sketch?

Apply results of Bickel, Ritov, Zakai (2006), who show consistency for a verygeneral class of boosting methods



Asymptotics, cont’d.

Conjecture 2: Greenshtein/Ritov persistency

For suitably well-behaved loss functions L/descent directions ˙, the TGDRestimates are persistent.

Proof sketch?

Exploit relationship between L1-penalization (shown to be persistent) andboosting (similar to TGDR)

Ideas are welcome!



In Conclusion

TGDR is...

Variable selection based on thresholded gradient descent

Beautifully simple

Computationally tractable

Easy to extend to more complex data structures

But TGDR is not...

Popular (yet)

Particularly amenable to inference (confidence intervals?)

Well studied from a theoretical perspective:

When does it work?How well does it work?How does it compare to competing methods?



In Conclusion

TGDR is...

Variable selection based on thresholded gradient descent

Beautifully simple

Computationally tractable

Easy to extend to more complex data structures

But TGDR is not...

Popular (yet)

Particularly amenable to inference (confidence intervals?)

Well studied from a theoretical perspective:

When does it work?How well does it work?How does it compare to competing methods?



Acknowledgements

Prof. Peter Gilbert (thesis supervisor)

Prof. Victor DeGruttola (for providing ACTG data)

Thanks!

Questions?


Documents

Steepest Descent Methods for Variable Selectionjulianw/downloads/JW.SteepestDescent.pdf · 1 Leave-one-out CV 2 k-fold CV 3 Etc Beyond the scope of this talk J. Wolfson (593 Final