View
15
Download
1
Category
Preview:
Citation preview
22s:152 Applied Linear Regression
Chapter 16: BootstrappingAn option for dealing with violation of assumptions.
————————————————————
•What do we do when our distributionalassumptions (like normality) are not met?
• The assumptions get us valid p-values...
– valid standard errors
– valid confidence intervals
– valid hypothesis tests
• Consider inference on a mean µ where wehave the point estimate X̄ .
– Define the statistic: T = X̄−µs/√n
1
– With normality, we have: T ∼ tn−1which we use to...
∗ form valid 100(1− α)% CI’s
∗ perform α-level hypothesis tests
– Without normality, we can not assume Thas this distribution
– Without the known distributional‘behavior’ of the random variable T , wecan’t make valid confidence intervals forµ, nor run hypothesis tests on µ with aknown error rate using our usual methods.
• If we do not know the theoretical ‘samplingdistribution’ of a relevant statistic (like X̄),we can instead use the Bootstrapping ap-proach to statistical inference.
2
Bootstrapping, what is it...
• A nonparametric approach to statistical in-ference that gives us...
– valid standard errors
– valid confidence intervals
– valid hypothesis tests
without the normality assumption.
• Some dislike the term nonparametric andprefer the term distribution-free.
• Assumption we do need:
– The sampled data provide a reasonablerepresentation of the population from whichthey came
• Bootstrapping is more computationallyintensive than traditional inference becauseit is a ‘resampling’ method.
3
• Recall, a sampling distribution is...
the probability distribution of a statistic.
Examples (true under certain conditions):
1. X̄ ∼ N(µ, σ2
n )
2. β̂ ∼ N(β, V (β̂))
These sampling distributions allow us toperform hypothesis tests and form confidenceintervals on parameters of interest, like µ andβ.
But what do we do if our needed ‘conditions’are not met? One option is bootstrapping.
4
Bootstrapping:Example for inference on ρ(population correlation)
• Average values for GPA and LSAT scores forstudents admitted to n=15 Law Schools in1973 (a random sampling of law schools).
School LSAT GPA
1 576 3.39
2 635 3.30
3 558 2.81
4 579 3.03
5 666 3.44
6 580 3.07
7 555 3.00
8 661 3.43
9 651 3.36
10 605 3.13
11 653 3.12
12 575 2.74
13 545 2.76
14 572 2.88
15 594 2.96
– Can we make a confidence interval for ρ?
5
– Point estimate for ρ is the sample correla-tion r:
r = 0.7766
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
560 580 600 620 640 660
2.8
3.0
3.2
3.4
LSAT
GP
A
– Classical inference on ρ depends on X andY having a bivariate normal distribution.
– In the above sample, there are a few out-liers suggesting this assumption may beviolated.
– We’ll use the bootstrap approach to dostatistical inference instead.
6
– In the bootstrap method, we generate manypossible sample data sets (based on thesingle data set we actually observed), andfrom each one, we calculate an estimate r.So, we’ll have an r∗b calculated from eachhypothetical (or bootstrap) sample b.
– This provides an empirical ‘samplingdistribution’ for the estimator.(by empirical we mean based on the observed data)
r∗1 , r∗2 , r∗3 , . . . , r
∗B
This distribution of the above values willgive us an idea of the variability (and sam-pling distribution) of our estimator r.
– We will assume these n = 15 observationsis a representative sample from the pop-ulation of all law schools (the assumptionwe make).
7
– Repeating steps (i) & (ii) below B timesgenerates an empirical ‘sampling distribu-tion’ for r:
(i) Resample from the original sample withreplacement to create a bootstrap sampleof the same size n = 15. One observa-tion looks like xi = (x1i, x2i).
(ii) From this bth bootstrap sample, calcu-late the estimate r∗b .
## Create function to get 1 new bootstrapped r.
## Input arguments: n, data
## Output argument: r
> get.1.bootstrapped.r=function(n=15,data=law){
## Get indices of resampled observations:
chosen=sample(1:n,replace=T)
## Get new bootstrap sample:
bootstrap.sample=law[chosen,]
## Calculate r:
r=cor(bootstrap.sample$LSAT,bootstrap.sample$GPA)
return(r)
}
8
## Call the function:
> r.bootstrap=get.1.bootstrapped.r(n=15,data=law)
> r.bootstrap
GPA
LSAT 0.953635
This particular bootstrap sample gave afairly high correlation, r∗1 = 0.9536.
The bootstrap sample looked like:School LSAT GPA
9 651 3.36
4 579 3.03
6 580 3.07
3 558 2.81
6 580 3.07
8 661 3.43
9 651 3.36
4 579 3.03
10 605 3.13
8 661 3.43
7 555 3.00
13 545 2.76
3 558 2.81
15 594 2.96
6 580 3.07
9
## Call the function again(get new bootstrap sample):
> r.bootstrap=get.1.bootstrapped.r(n=15,data=law)
> r.bootstrap
GPA
LSAT 0.7605992
This particular sample gave a lower corre-lation, r∗2 = 0.7606.
The bootstrap sample looked like:School LSAT GPA
13 545 2.76
7 555 3.00
11 653 3.12
4 579 3.03
5 666 3.44
5 666 3.44
12 575 2.74
15 594 2.96
11 653 3.12
12 575 2.74
3 558 2.81
13 545 2.76
4 579 3.03
9 651 3.36
1 576 3.39
10
– Repeat procedure B-times to get thebootstrap estimates: r∗1 , r
∗2 , . . . , r
∗B.
> B=1000
## Allocate space to save B estimates:
> bootstrapped.r.values=rep(0,B)
> for (i in 1:B){bootstrapped.r.values[i]=
get.1.bootstrapped.r(n=15,data=law)}
> hist(bootstrapped.r.values,col="grey80",n=16)
> abline(v=0.7766,col="red",lwd=2)
> box()
> legend(.3,70,"Observed r",lwd=2,col="red")
Histogram of bootstrapped.r.values
bootstrapped.r.values
Fre
quen
cy
0.2 0.4 0.6 0.8 1.0
050
100
150
Observed r
– This distribution tells us the empiricalsampling distribution of our estimator r.
11
– We can use it to make a 90% empiricalconfidence interval for ρ.
– Order the r∗b values, and use the 5th quan-tile and the 95th quantile as the lower andupper end points...
> quantile(bootstrapped.r.values,0.05)
5%
0.5373854
> quantile(bootstrapped.r.values,0.95)
95%
0.9506562
The 90% empirical CI for ρ is: [0.5374, 0.9507]Histogram of bootstrapped.r.values
bootstrapped.r.values
Fre
quen
cy
0.2 0.4 0.6 0.8 1.0
050
100
150
5% lower tailor 50 observations
5%uppertail
12
– We were able to create a CI without anyassumptions on distribution, i.e. nonpara-metrically (very useful in many situations).
– This only works if the original sample isrepresentative of the original population.
– Recall what sampling variability of an es-timator is... BEFORE we collect our data,the estimator is a random variable becauseit’s value depends on the sample chosen.
– The bootstrap method uses resampling toget a handle on this variability (since wecan’t get at it theoretically because wedon’t have normality).
– We should resample from the n observa-tions in the same manner as how the origi-nal data was sampled (here, we had a sim-ple random sample).
13
• There is an R package that will dobootstrapping for us called boot...
R contributed packages:http://cran.r-project.org/web/packages/
Some of these packages may be available fromthe pull-down menu under Install Packages...(‘boot’ is already pre-installed in SH 41, use > library(boot) ).
Installing a package from a local ‘mirror’ (i.e. apackage provider):> install.packages("boot")
Then choose a ‘mirror’, there is an Iowa one... IA.
To see list of available packages from a ‘mirror’...> available.packages()
Then choose a ‘mirror’.
You can also download a package to your local drive and
load it from the pull-down menu under the local drive.
14
• After installing the boot package...
> library(boot)
## The main function is called boot():
> ?boot
## The form is ‘boot(data, compute.statistic, R)’.
## where the first argument is the data, and the
## second argument is a function computing the
## statistic of interest. ‘compute.statistic’ must
## take two inputs as (data, indices) where the data
## will be put in order by ‘indices’. For the original
## data, indices=1:n. R is the number of bootstrap
## samples requested.
## We define the ‘compute.statistic’ function:
> get.r=function(data,indices){
## order rows of data by ‘indices’:
data=data[indices,]
## Calculate r:
r=cor(data$LSAT,data$GPA)
return(r)
}
## Get a bootstrap confidence interval from
## 1000 bootstrap samples using boot():
> boot.out=boot(law, get.r, R=1000)
> boot.ci(boot.out, conf=0.90, type="perc")
15
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = boot.out, conf = 0.9, type = "perc")
Intervals :
Level Percentile
90% ( 0.5399, 0.9538 )
Calculations and Intervals on Original Scale
This is quite close to the one we made earlieras [0.5374, 0.9507].
16
Bootstrapping Regression Models
• You can use this same procedure for infer-ence in βj in a regression model.
• Example: Anscombe data set:U.S. State Public-School Expenditures in 1970
VARIABLES
education -- Per-capita education expenditures, $
income -- Proportion urban, per 100
> attach(Anscombe)
> plot(income,education)
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2000 2500 3000 3500 4000 4500
150
200
250
300
350
income
educ
atio
n
17
> lm.out=lm(education ~ income)
> plot(lm.out$fitted.values,lm.out$residuals,pch=16)
> qqnorm(lm.out$residuals,pch=16)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
140 160 180 200 220 240 260
−50
050
100
lm.out$fitted.values
lm.o
ut$r
esid
uals
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
−2 −1 0 1 2
−50
050
100
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
> lm.out$coefficients
(Intercept) income
17.71003077 0.05537594
——————————————————–Use the bootstrap method to get a confidenceinterval on both regression coefficients.
Define function for ‘boot’ to get coefficients:
> get.coeffic=function(data,indices){
data=data[indices,]
lm.out=lm(education ~ income,data=data)
return(lm.out$coefficients)
}
18
Call the function once for the original data:
> n=nrow(Anscombe)
> get.coeffic(Anscombe,1:n)
(Intercept) income
17.71003077 0.05537594
These estimates are from the original data.
——————————————————–
Using ‘boot’ to get 1000 bootstrap estimates...
> boot.out=boot(Anscombe,get.coeffic,R=1000)
## The 1000 bootstrap regression estimates are in...
> boot.out$t
[,1] [,2]
[1,] 17.52784692 0.05423758
[2,] 16.55435429 0.05580714
[3,] -51.16559342 0.07918406
[4,] 37.51375190 0.04516923
[5,] -6.07968045 0.06205171
. .
. .
. .
[999,] 22.27582056 0.05379981
[1000,] 110.27523283 0.02707580
19
Empirical 95%CI for β1
> boot.ci(boot.out,index=2,type="perc",conf=0.95)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = boot.out, type = "perc", index = 2)
Intervals :
Level Percentile
95% ( 0.0359, 0.0776 )
Calculations and Intervals on Original Scale
Empirical 95%CI for β0
> boot.ci(boot.out,index=1,type="perc",conf=0.95)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = boot.out, type = "perc", index = 1)
Intervals :
Level Percentile
95% (-51.41, 78.34 )
Calculations and Intervals on Original Scale
20
Joint sampling distribution of (β̂0, β̂1)
> dataEllipse(boot.out$t[,2],boot.out$t[,1],
xlab="slope",ylab="intercept",levels=c(.5,.95,.99))
0.00 0.02 0.04 0.06 0.08
-100
-50
050
100
slope
intercept
{dataEllipse() is in the car library. It superimposes the
normal-probability contours over a scatterplot of the data}
Using the 95% CI for β1 to test the hypoth-esis of H0 : β1 = 0 at the α = 0.05 level...
Intervals :
Level Percentile
95% ( 0.0359, 0.0776 )
CI does not contain 0. We reject H0.21
Using the 95% CI for β0 to test the hypoth-esis of H0 : β0 = 0 at the α = 0.05 level...
Intervals :
Level Percentile
95% (-51.41, 78.34 )
CI contains 0. We fail to reject H0.——————————————————We can use the average of the 1000 bootstrapestimates as our estimated regression coeffi-cients:
> apply(boot.out$t,2,mean)
[1] 17.33916141 0.05551061
——————————————————Comparing to parametric results from origi-nal sample:
> summary(lm.out)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.710031 28.873840 0.613 0.542
income 0.055376 0.008823 6.276 8.76e-08 ***
(Recall the original wasn’t terribly non-normal.)
22
Comments
•We can use bootstrapping to do statisticalinference when the assumptions of normalityand/or constant variance are violated.
• ∗The sample must be a representativesample from the population.
• The bootstrap approximation becomes moreaccurate for larger samples (larger n).
• There are some bias-correction elements forbootstrapping which we leave for further study.
• You can also use the distribution of bootstrapestimates to calculate a bootstrap estimate ofthe standard error of θ̂:
SE∗(θ̂∗) =
√∑Bb=1(θ̂∗b−θ̄
∗)2
B−1 with θ̄∗ =∑B
b=1 θ̂∗b/B
23
• The type of bootstrapping of regression mod-els described here (random resampling of theobservations) considers the regressors (theX ’s)to be random, not fixed.
• If you have fixed X-values (as in an exper-imental design), you can use a parametricbootstrap, and/or residual resampling, whichtakes this into account.
24
Bootstrapping, the general process...
1. we’re interested in doing inference on thepopulation parameter θ using the estimatorθ̂
2. we have a representative sample of n obser-vations from the population
3. we resample with replacement from the orig-inal sample of size n to create a bootstrapsample of the same size n
4. From the bth bootstrap sample, calculate theestimate θ̂∗b
5. Repeat steps 3 & 4 B-times to generate anempirical sampling distribution for θ̂{θ̂∗1 , θ̂
∗2 , . . . , θ̂
∗B}
6. Use the distribution of θ̂∗b ’s to estimate prop-
erties of the sampling distribution of θ̂.
25
Recommended