Estimation and Accuracy after Model Selection

Bradley EfronMotivation

Bootstrap SmoothingResults

Estimation and Accuracy after Model Selection byBradley Efron (Stanford)

Sahir Rai Bhatnagar

McGill University

[email protected]

April 7, 2014

1 / 43



Who?AchievementsSome Quotes

Born in St. Paul, Minnesotain 1938 to Jewish-Russianimmigrants

B.S., Mathematics Caltech(1960)

Ph.D., Statistics (1964)under the direction ofRupert Miller and HerbSolomon

Professor of Statistics atStanford for the past 50years

2 / 43




Born in St. Paul, Minnesotain 1938 to Jewish-Russianimmigrants

B.S., Mathematics Caltech(1960)

Ph.D., Statistics (1964)under the direction ofRupert Miller and HerbSolomon

Professor of Statistics atStanford for the past 50years

2 / 43




Best known for theBootstrap, Annals ofStatistics (1977)

Founding Editor Annals ofApplied Statistics

Awarded Guy Medal in Goldfrom RSS (2014) (34awarded since 1892 includingRao, Cox, Fisher, Nelder)

3 / 43




National Medal of Science 2005

Established by Congress in 1959 and administered by the NationalScience Foundation, the medal is the nation’s highest scientifichonour

4 / 43




“Statistics is the science of informationgathering, especially when the informationarrives in little pieces instead of big ones”

“Statistics did not come naturally to me.Dads keeping score for the baseball leaguehelped a lot”

“I spent the first year at Stanford in theMath Department...After, I started takingstats courses, which I thought would beeasy. In fact I found them harder”

5 / 43







5 / 43







5 / 43



A Quick Review of the BootstrapTypical Model Selection SettingCholesterol Data ExampleProstate Data Example

6 / 43




7 / 43




Look at the data: one response, many covariates

Identify list of candidate models M2p submodelslinear, quadratic, cubic . . .

Perform Model Selection (see Abbas class notes)

Do inference based on chosen model

PredictionConfidence Intervals

Today’s Question: Should we care about the variability of thevariable selection step in our post-selection inference?

8 / 43










8 / 43










8 / 43










8 / 43










8 / 43




An Example

n = 164 men took Cholestyramine (meant to reducecholesterol in the blood) for 7 years

x : a compliance measure adjusted : x ∼ N (0, 1)

y : cholesterol decrease

Perform a regression of y on x

We want to predict cholesterol decrease for a givencompliance value

µ = E [y |x ]

9 / 43




An Example

Multiple Linear Regression Model

Y = Xβ + ε, εi ∼ N (0, σ2)

6 candidate models: M = {linear , quadratic, . . . , sextic , } e.g.

y = β0 + β1x + β2x2 + . . .+ β6x6 + ε

Cp Criterion for Model Selection

Cp(M) =SSres(M)

n︸︷︷︸goodness of fit

+2σ2pM

n︸︷︷︸complexity

Use OLS estimate for β from chosen model and predict:

µ = Xβ

10 / 43




An Example: Nonparametric Bootstrap Analysis

Bootstrap the data:

data∗ = {(xj , yj)∗, j = 1, . . . , n}

where (xj , yj)∗ are drawn randomly with replacement from the

original data

data∗ →Cp

M∗ →OLS

β∗M∗ → µ∗ = XM∗ β∗M∗

Repeat B = 4000 times

11 / 43




Reproduced from Efron 201312 / 43




Reproduced from Efron 2013

13 / 43




Prostate Data

Examine relation between level of PSA and clinical measures

n = 97 men who were about to receive prostatectomy

x = (x1, . . . , x8): clinical measures (adjusted : x ∼ N (0, 1))

y = logPSA

Perform regression of y on x

8 candidate models were identified using regsubsets andnbest=1

We want to estimate

µj = E [y |xj ] , j = 1, . . . , 97

14 / 43




original estimate = 3.6 based on Cp chosen model

0

100

200

2 3 4

fitted value µ95

coun

tFitted values for subject 95, from B=4000 nonparametric bootsrap replications

of the Cp chosen model; 60% of the replications greater than the original estimate 3.6

15 / 43




original estimate = 3.6

based on Cp chosen model

18%

22%

24%

0

30

60

90

120

3 4

fitted value µ95

coun

t

model

m3

m5

m7

Fitted values for subject 95, from B=4000 nonparametric bootsrap replications separated by three most frequently chosen models by Cp

16 / 43




●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

1% 18% 12% 22% 15% 24% 8%

Model 7

******

3

4

m2 m3 m4 m5 m6 m7 m8model

fitted

value

µ 95

Boxplot of fitted values for Subject 95 for the model chosen by Cp criteria based on B=4000 nonparametric bootsrap samples

17 / 43




Questions

Are you convinced there is a problem in the way we dopost-selection inference?

Is the juice worth the squeeze ?

18 / 43




Questions

Are you convinced there is a problem in the way we dopost-selection inference?

Is the juice worth the squeeze ?

18 / 43



IdeaStandard ErrorsTheoremConfidence Intervals

Bagging (Breiman 1996)

Replace original estimator µ = t(y) with bootstrap average

µ = s(y) =1

B

B∑i=1

t(y∗i )

y∗i : i th bootstrap sample

Known as model averaging

“If perturbing the learning set can cause significant changes inthe predictor constructed, then bagging can improve accuracy”

19 / 43




Bagging (Breiman 1996)

Replace original estimator µ = t(y) with bootstrap average

µ = s(y) =1

B

B∑i=1

t(y∗i )

y∗i : i th bootstrap sample

Known as model averaging

“If perturbing the learning set can cause significant changes inthe predictor constructed, then bagging can improve accuracy”

19 / 43




Main Contribution of this Paper

t∗i = t(y∗i ), i = 1, . . . ,B (value of statistic in boot sample i)

Y ∗ij =# of times j th data point appears in i th boot sample

covj = cov(Y ∗ij , t∗i )

The non-parametric estimate of standard deviation for the ideal

smoothed bootstrap statistic µ = s(y) = B−1B∑i=1

t(y∗i ) is

sd =

n∑j=1

cov2j

1/2

20 / 43










t(y∗i ) is

sd =

n∑j=1

cov2j

1/2

20 / 43










t(y∗i ) is

sd =

n∑j=1

cov2j

1/2

20 / 43










t(y∗i ) is

sd =

n∑j=1

cov2j

1/2

20 / 43










t(y∗i ) is

sd =

n∑j=1

cov2j

1/2

20 / 43





Note that covj = cov(Y ∗ij , t∗i ) is an unknown quantity. Therefore

we must estimate it. The estimate of standard deviation forµ = s(y) in the non-ideal case is

sdB =

n∑j=1

cov2j

1/2

cov j = B−1B∑i=1

(Y ∗ij − Y ∗·j

)(t∗i − t∗· )

Y ∗·j = B−1B∑i=1

Y ∗ij t∗· = B−1B∑i=1

t∗i

21 / 43




Improvement on Traditional Standard Error

sdB =

n∑j=1

cov2j

1/2

is always less than the bootstrap estimate of standard deviation forthe unsmoothed statistic

sdB =

n∑j=1

(t∗i − t∗· )2

1/2

22 / 43




Three Types

1 Standardµ± 1.96sdB

2 Percentile [µ∗(0.025), µ∗(0.975)

]

3 Smoothedµ± 1.96sdB

23 / 43




Three Types


2 Percentile [µ∗(0.025), µ∗(0.975)

]


23 / 43




Three Types


2 Percentile [µ∗(0.025), µ∗(0.975)

]


23 / 43



SettingProstate Data: RevisitedParametric BootstrapDiscussion

L1-Norm Penalty Functions

Recall the optimization problem of interest:

maxβ

`n(β)− n

p∑j=1

p(|βj |;λ)

24 / 43




LASSO, SCAD and MCP penalties

LASSO (Tibshirani, 1996)

p(|β|;λ) = λ|β|

SCAD (Fan and Li, 2001 )

p′(|β|;λ, γ) = λsign(β)

{I(|β|≤λ) +

(γλ− |β|)+(γ − 1)λ

I(|β|>λ)

}, γ > 2

MCP (Zhang, 2010)

p(|β|;λ, γ) =

{λ|β| − |β|

2

2γ |β| ≤ γλγλ2

2 |β| > γλ

25 / 43




Software

Analysis was performed in R

Implement LASSO using the glmnet package (Friedman,Hastie, Tibshirani, 2013)

SCAD and MCP using the coordinate descent algorithm(Breheny and Huang, 2011) in the ncvreg package

BIC and Cp model selection using the leaps package(Lumley, 2009)

26 / 43




MCP SCAD LASSO

0

50

100

150

200

2 3 4 2 3 4 2 3 4

fitted value µ95

coun


27 / 43




BIC Cp

0

100

200

300

−10 0 10 −10 0 10

fitted value µ95

coun


28 / 43




SCAD, MCP, LASSO

LASSO

SCAD

MCP

0.0 0.5 1.0length

pena

lty

type

standard

quantile

smooth

95% Confidence Intervals for fitted value of Subject 95 based on B=4000 nonparametric bootsrap samples for MCP, SCAD and LASSO penalties

29 / 43




BIC and Cp

BIC

Cp

0 5 10 15 20length

pena

lty

type

standard

quantile

smooth

Length of 95% Confidence Intervals for fitted value of Subject 95 based on B=4000 nonparametric bootsrap samples for Cp and BIC

30 / 43




Table : Prostate data, B=4000, Observation 95

model type fitted value sd length coverage

LASSO standard 3.62 0.31 1.21 0.94quantile 1.20 0.95smooth 3.57 0.29 1.14 0.93

SCAD standard 3.60 0.35 1.37 0.95quantile 1.33 0.95smooth 3.62 0.33 1.28 0.93

MCP standard 3.60 0.35 1.38 0.96quantile 1.35 0.95smooth 3.61 0.33 1.29 0.94

BIC standard 5.50 4.75 18.62 0.84quantile 16.05 0.95smooth 3.22 3.46 13.55 0.83

Cp standard 5.13 5.11 20.02 0.86quantile 16.15 0.95smooth 0.64 4.40 17.24 0.97 31 / 43




An Example: Parametric Bootstrap Analysis

Obtain OLS estimates µOLS based on full model

Generatey∗ ∼ N (µOLS , I)

Full Model Bootstrap

y∗ →Cp

M∗, β∗M∗ → µ∗ = XM∗ β∗M∗

Repeat B = 4000 times → t∗ij = µ∗ijSmoothed Estimates

sj = B−1B∑i=1

t∗ij

32 / 43






0

50

100

150

200

1 2 3 4 5

fitted value µ95

coun

tFitted values for subject 95, from B=4000 Parametric bootsrap replications

of the Cp chosen model; 53% of the replications greater than the original estimate 3.6

33 / 43






0

10

20

30

1 2 3 4 5

fitted value µ95

coun

t

model

m6

m7

m8

Fitted values for subject 95, from B=4000 Parametric bootsrap replications separated by three most frequently chosen models by Cp

34 / 43




●

●

●

●

●●

●●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●3% 6% 16% 12% 13% 14% 17% 19%

Model 8

******

1

2

3

4

5

m1 m2 m3 m4 m5 m6 m7 m8model

fitted

value

µ 95

Boxplot of fitted values for Subject 95 for the model chosen by Cp criteria based on B=4000 Parametric bootsrap samples

35 / 43





based on BIC chosen model

0

50

100

150

200

2 3 4 5

fitted value µ95

coun

tFitted values for subject 95, from B=4000 Parametric bootsrap replications

of the BIC chosen model; 40% of the replications greater than the original estimate 3.7

36 / 43





based on BIC chosen model

27%

20%

18%

0

20

40

60

80

3 4 5

fitted value µ95

coun

t

model

m1

m2

m3

Fitted values for subject 95, from B=4000 Parametric bootsrap replications separated by three most frequently chosen models by BIC

37 / 43




●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

20% 18% 27% 13% 9% 5% 5% 3%

Model 3

******

2

3

4

5

m1 m2 m3 m4 m5 m6 m7 m8model

fitted

value

µ 95

Boxplot of fitted values for Subject 95 for the model chosen by BIC criteria based on B=4000 Parametric bootsrap samples

38 / 43




Improvements for regularized procedures where tuningparameters are also chosen in a data-driven fashion

GLM ?

Why parametric bootstrap?

39 / 43

Family

Roots

What I have done so far

1 BSc Actuarial Math - Concordia (2005-2008)

2 Pension actuary (2008-2011)

3 RA at the Chest with Andrea Benedetti (2011-2012)

4 MSc Biostats - Queen’s (2012-2013)

What’s Next?

1 PhD Biostatistics - McGill (2013-???)

2 Supervisor Celia Greenwood (Statistical Genetics)