View
226
Download
0
Tags:
Embed Size (px)
Citation preview
Applied Linear Regression
CSTAT WorkshopMarch 16, 2007
Vince Melfi
References
• “Applied Linear Regression,” Third Edition by Sanford Weisberg.
• “Linear Models with R,” by Julian Faraway.
• Countless other books on Linear Regression, statistical software, etc.
Statistical Packages
• Minitab (we’ll use this today)
• SPSS
• SAS
• R
• Splus
• JMP
• ETC!!
Outline
I. Simple linear regression review
II. Multiple Regression: Adding predictors
III. Inference in Regression
IV. Regression Diagnostics
V. Model Selection
I. Simple Linear Regression Review
5
Savings Rate Data
Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate.
• SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.)
• Pop>75: Percent of the population over 75 years old. (One of the predictors.)
I. Simple Linear Regression Review
6
543210
20
15
10
5
0
pop>75
SaveRate
Scatterplot of SaveRate vs pop>75
I. Simple Linear Regression Review
7
Regression Output
The regression equation isSaveRate = 7.152 + 1.099 pop>75
S = 4.29409 R-Sq = 10.0% R-Sq(adj) = 8.1%
Analysis of Variance
Source DF SS MS F PRegression 1 98.545 98.5454 5.34 0.025Error 48 885.083 18.4392Total 49 983.628
Fitted model
R2 (coeff. of determination)
Testing the model
Importance of Plots
• Four data sets
• All have – Regression line Y = 3 + 0.5 x– R2 = 66.7%– S = 1.24– Same t statistics, etc., etc.
• Without looking at plots, the four data sets would seem similar.
I. Simple Linear Regression Review
9
Importance of Plots (1)
15.012.510.07.55.0
11
10
9
8
7
6
5
4
x1
y1
S 1.23660R-Sq 66.7%R-Sq(adj) 62.9%
Fitted Line Ploty1 = 3.000 + 0.5001 x1
I. Simple Linear Regression Review
10
Importance of Plots (2)
15.012.510.07.55.0
10
9
8
7
6
5
4
3
x1
y2
S 1.23721R-Sq 66.6%R-Sq(adj) 62.9%
Fitted Line Ploty2 = 3.001 + 0.5000 x1
I. Simple Linear Regression Review
11
Importance of Plots (3)
15.012.510.07.55.0
13
12
11
10
9
8
7
6
5
4
x1
y3
S 1.23631R-Sq 66.6%R-Sq(adj) 62.9%
Fitted Line Ploty3 = 3.002 + 0.4997 x1
I. Simple Linear Regression Review
12
Importance of Plots (4)
2018161412108
13
12
11
10
9
8
7
6
5
x2
y4
S 1.23570R-Sq 66.7%R-Sq(adj) 63.0%
Fitted Line Ploty4 = 3.002 + 0.4999 x2
I. Simple Linear Regression Review
13
The model
• Yi = β0 + β1xi + ei, for i = 1, 2, …, n
• “Errors” e1, e2, …, en are assumed to be independent.
• Usually e1, e2, …, en are assumed to have the same standard deviation, σ.
• Often e1, e2, …, en are assumed to be normally distributed.
I. Simple Linear Regression Review
14
Least Squares
• The regression line (line of best fit) is based on “least squares.”
• The regression line is the line that minimizes the sum of the squared deviations from the data.
• The least squares line has certain optimality properties.
• The least squares line is denoted
iii eXY ˆˆˆ
I. Simple Linear Regression Review
15
Residuals
• The residuals represent the difference between the data and the least squares line:
iii YYe ˆˆ
1 2 3 4 5 6 7
45
67
89
10
X
Y
I. Simple Linear Regression Review
16
Checking assumptions
• Residuals are the main tool for checking model assumptions, including linearity and constant variance.
• Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance.
• Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption.
I. Simple Linear Regression Review
17
I. Simple Linear Regression Review
18
I. Simple Linear Regression Review
19
I. Simple Linear Regression Review
20
I. Simple Linear Regression Review
21
1050-5-10
99
90
50
10
1
Residual
Perc
ent
12111098
10
5
0
-5
-10
Fitted Value
Resi
dual
1050-5-10
16
12
8
4
0
Residual
Fre
quency
50454035302520151051
10
5
0
-5
-10
Observation Order
Resi
dual
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for SaveRate
“Four in one” plot from Minitab
I. Simple Linear Regression Review
22
Coefficient of determination (R2)
Residual sum of squares, aka sum of squares for error:
Total sum of squares:
Coefficient of determination:
n
i ieSSERSS1
2ˆ
2
1)( yyTSSSST
n
i i
TSS
RSSTSSR
2
I. Simple linear regression review
23
R2
• The coefficient of determination, R2, measures the proportion of the variability in Y that is explained by the linear relationship with X.
• It’s also the square of the Pearson correlation coefficient
II. Multiple regression: Adding predictors
24
Adding a predictor
• Recall: Fitted model was SaveRate = 7.152 + 1.099 pop>75 (p-value for test of whether pop>75 is
significant was 0.025.)• Another predictor: DPI (per-capita income)• Fitted model: SaveRate = 8.57 + 0.000996 DPI (p-value for DPI: 0.124)
II. Multiple regression: Adding predictors
25
Adding a predictor (2)
• Model with both pop>75 and DPI is SaveRate = 7.06 + 1.30 pop>75 - 0.00034 DPI
• p-values are 0.100 and 0.738 for pop>75 and DPI
• The sign of the coefficient of DPI has changed!
• pop>75 was significant alone, but neither it nor DPI are significant together!
II. Multiple regression: Adding predictors
26
Adding a predictor (3)
40003000200010000
5
4
3
2
1
0
DPI
pop>
75
S 0.804599R-Sq 61.9%R-Sq(adj) 61.1%
Fitted Line Plotpop>75 = 1.158 + 0.001025 DPI
•What happened??
•The predictors pop>75 and DPI are highly correlated
II. Multiple regression: Adding predictors
27
Added variable plots and partial correlation
1. Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75.
2. Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75.
3. A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75. This is called an “added variable plot.”
4. The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75.
II. Multiple regression: Adding predictors
28
Added variable plot
25002000150010005000-500-1000
15
10
5
0
-5
-10
RESDPIvspop>75
RES
SRvsp
op>
75
S 4.28891R-Sq 0.2%R-Sq(adj) 0.0%
Fitted Line PlotRESSRvspop>75 = 0.0000 - 0.000341 RESDPIvspop>75
Note that the slope term,
-0.000341, is the same as the slope term for DPI in the two-predictor model
II. Multiple regression: Adding predictors
29
Scatterplot matrices (Matrix Plots)
• With one predictor X, a scatterplot of Y vs. X is very informative.
• With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed.
• A scatterplot matrix (or matrix plot) is just an organized display of the plots
II. Multiple regression: Adding predictors
30
20
10
0
403020 400020000
40
30
204
2
0 4000
2000
0
20100
16
8
0
420 1680
SaveR
ate
pop<
15
pop>
75
DPI
SaveRate
changeD
PI
pop<15 pop>75 DPI changeDPI
Matrix Plot of SaveRate, pop<15, ... vs SaveRate, pop<15, ...
II. Multiple regression: Adding predictors
31
Changes in R2
• Consider adding a predictor X2 to a model that already contains the predictor X1
• Let R2,1 be the R2 value for the fit of Y vs. X1, and let R2,2 be the R2 value for the fit of Y vs. X2
II. Multiple regression: Adding predictors
32
Changes in R2 (2)
• The R2 value for the multiple regression fit is always larger than R2,1 and R2,2
• The R2 value for the multiple regression fit of Y versus X1 and X2 may be– less than R2,1 + R2,2 (if the two predictors are
explaining the same variation)– equal to R2,1 + R2,2 (if the two predictors measure
different things)– more than R2,1 + R2,2 (e.g. Response is area of
rectangle, and the two predictors are length and width)
II. Multiple regression: Adding predictors
33
Multiple regression model• Response variable Y
• Predictors X1, X2, …, Xp
ipipiii eXXXY ...21
•Same assumptions on errors ei
(independent, constant variance, normality)
III. Inference in regression
34
Inference in regression
• Most inference procedures assume independence, constant variance, and normality of the errors.
• Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold.
• In general, techniques like the bootstrap can be used when normality is suspect.
III. Inference in regression
35
New data set
• Response variable: – Fuel = per-capita fuel consumption (times 1000)
• Predictors:– Dlic = proportion of the population who are licensed
drivers (times 1000)– Tax = gasoline tax rate– Income = per person income in thousands of dollars– logMiles = base 2 log of federal-aid highway miles in
the state
III. Inference in regression
36
t tests• Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles
• The regression equation is• Fuel = 154 - 4.23 Tax + 0.472 Dlic - 6.14 Income +
18.5 logMiles
• Predictor Coef SE Coef T P• Constant 154.2 194.9 0.79 0.433• Tax -4.228 2.030 -2.08 0.043• Dlic 0.4719 0.1285 3.67 0.001• Income -6.135 2.194 -2.80 0.008• logMiles 18.545 6.472 2.87 0.006
t statisticsp values
III. Inference in regression
37
t tests (2)
• The t statistics tests the hypothesis that a particular slope parameter is zero.
• The formula is
t = (coefficient estimate)/(standard error)
• degrees of freedom are n-(p+1)
• p-values given are for the two-sided alternative
• This is like simple linear regression
III. Inference in regression
38
F tests• General structure:
– Ha: Large model– H0: Smaller model, obtained by setting some
parameters in the large model to zero, or equal to each other, or equal to a constant
– RSSAH = resid. sum of squares after fitting the large (alt. hypothesis) model
– RSSNH = resid. sum of squares after fitting the smaller (null hypothesis) model
– dfNH and dfAH are the corresponding degrees of freedom
III. Inference in regression
39
F tests (2)
• Test statistic:
AH
AH
AHNH
AHNH
dfRSS
dfdfRSSRSS
F)(
)(
•Null distribution: F distribution with dfNH – dfAH numerator and dfAH denominator degrees of freedom
III. Inference in regression
40
F test example
• Can the “economic” variables tax and income be dropped from the model with all four predictors?
• AH model includes all predictors
• NH model includes only Dlic and logMiles
• Fit both models and get RSS and df values
III. Inference in regression
41
F test example (2)
• RSSAH = 193700; dfAH = 46
• RSSNH = 243006; dfNH = 48
85.546/193700
)4648/()193700243006(
F
•P-value is the area to the right of 5.85 under a F(2,46) distribution, approx. 0.0054
•There’s pretty strong evidence that removing both Tax and Income is unwise
III. Inference in regression
42
Another F test example
• Question: Does it make sense that the two “economic” predictors should have the same coefficient?
• Ha: Y = β0 + β1Tax + β2 Dlic+ β3 Income + β4 logMiles + error
• H0: Y = β0 + β1Tax + β2 Dlic+ β1 Income + β4 logMiles + error
• Note: H0: Y = β0 + β1 (Tax + Income)+ β2 Dlic + β4 logMiles + error
III. Inference in regression
43
Another F test example (2)
• Fit full model (AH)• Create new predictor “TI” by adding Tax and
Income, and fit a model with TI and Dlic and logMiles (NH)
424.046/193700
)4647/()193700195487(
F
•P-value is the area to the right of 5.85 under a F(1,46) distribution, approx. 0.518•This suggests that the simpler model with the same coefficient for Tax and Income fits well.
III. Inference in regression
44
Removing one predictor
• We have two ways to test whether one predictor can be removed from the model:– t test– F test
• The tests are equivalent, in the sense that t2 = F, and that the p-values will be equivalent.
III. Inference in regression
45
Confidence regions
• Confidence intervals for one parameter use the familiar t-interval.
• For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model:
• -6.135 ± (2.013)(2.194) = -6.135 ± 4.417.
From Minitab outputFrom t distribution with 46 df
III. Inference in regression
46
Joint confidence regions
• Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution.
• Minitab (and SPSS, and …) can’t draw these easily
• On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R.
III. Inference in regression
47
-8 -6 -4 -2 0
0.0
0.2
0.4
0.6
0.8
1.0
Tax
Dlic
Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two.
(0,0)
Boundary of confidence region
III. Inference in regression
48
Prediction
• Given a new set of predictor values x1, x2, …, xp, what’s the predicted response?
• It’s easy to answer this: Just plug the new predictors into the fitted regression model:
ppxxxY ˆ...ˆˆˆˆ21
•But how do we assess the uncertainty in the prediction? How do we form a confidence interval?
III. Inference in regression
49
Predicted Values for New Observations
New
Obs Fit SE Fit 95% CI 95% PI
1 613.39 12.44 (588.34, 638.44) (480.39, 746.39)
Values of Predictors for New Observations
New
Obs Dlic Income logMiles Tax
1 900 28.0 15.0 17.0
Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17
Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17
IV. Regression Diagnostics
50
Diagnostics
• Want to look for points that have a large influence on the fitted model
• Want to look for evidence that one or more model assumptions are untrue.
• Tools:– Residuals– Leverage– Influence and Cook’s Distance
IV. Regression Diagnostics
51
Leverage
• A point whose predictor values are far from the “typical” predictor values has high leverage.
• For a high leverage point, the fitted value
will be close to the data value Yi.
• A rule of thumb: Any point with leverage larger than 2(p+1)/n is interesting.
• Most statistical packages can compute leverages.
iY
IV. Regression Diagnostics
52
15.012.510.07.55.0
13
12
11
10
9
8
7
6
5
4
x1
y3
0.236364
0.127273
0.172727
0.318182
0.172727
0.318182
0.127273
0.090909
0.236364
0.100000
0.100000
Scatterplot with leverages
IV. Regression Diagnostics
53
50403020100
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Index
Levera
ge
0.2
Malaysia
Libya
Uruguay
Jamaica
ZambiaVenezuela
UnitedStates
UnitedKingdom
Tunisia
Turkey
Switzerland
Sweden
Spain
SouthRhodesia
SouthAfrica
Portugal
PhilippinesPeruParaguayPanama
NicaraguaNewZealand
Netherlands
Norway
MaltaLuxembourgKorea
Japan
Italy
Ireland
IndiaIcelandHondurasGuatamala
GreeceGermany
France
FinlandEcuadorDenmark
CostaRicaColombiaChina
Chile
Canada
BrazilBoliviaBelgium
Austria
Australia
Scatterplot of Leverage vs Index
IV. Regression Diagnostics
54
Influential Observations
• A data point is influential if it has a large effect on the fitted model.
• Put another way, an observation is influential if the fitted model will change a lot if the observation is deleted.
• Cook’s Distance is a measure of the influence of an observation.
• It may make sense to refit the model after removing a few of the most influential observations.
IV. Regression Diagnostics
55
15.012.510.07.55.0
13
12
11
10
9
8
7
6
5
4
x1
y3
0.00695
0.00035
0.05954
0.03382
0.00052
0.30057
0.02598
0.00547
1.39285
0.00214
0.01176
Scatterplot with Cook's Distance (measure of influence)
High leverage, low influence High Influence
IV. Regression Diagnostics
56
50403020100
0.30
0.25
0.20
0.15
0.10
0.05
0.00
Index
Cook'
s Dis
tance
Malaysia
Libya
UruguayJamaica
Zambia
VenezuelaUnitedStatesUnitedKingdomTunisiaTurkeySwitzerland
Sweden
SpainSouthRhodesiaSouthAfricaPortugal
PhilippinesPeruParaguay
PanamaNicaraguaNewZealandNetherlandsNorway
MaltaLuxembourg
Korea
Japan
Italy
Ireland
India
Iceland
HondurasGuatamalaGreece
GermanyFrance
FinlandEcuador
DenmarkCostaRica
ColombiaChina
Chile
CanadaBrazil
BoliviaBelgiumAustriaAustralia
Scatterplot of Cook's Distance vs Index
V. Model Selection 57
Model Selection
• Question: With a large number of potential predictors, how do we choose the predictors to include in the model?
• Want good prediction, but parsimony: Occam’s Razor.
• Also can be thought of as a bias-variance tradeoff.
V. Model Selection 58
Model Selection Example
• Data on all 50 states, from the 1970s– Life.Exp = Life expectancy (response)– Population (in thousands)– Income = per-capita income– Illiteracy (in percent of population)– Murder = murder rate per 100,000– HS.Grad (in percent of population)– Frost = mean # days with min. temp < 32F– Area = land area in square miles
V. Model Selection 59
Forward Selection
• Choose a cutoff α
• Start with no predictors
• At each step, add the predictor with the lowest p-value less than α
• Continue until there are no unused predictors with p-values less than α
V. Model Selection 60
• Stepwise Regression: Life.Exp versus Population, Income, ...
• Forward selection. Alpha-to-Enter: 0.25
• Response is Life.Exp on 7 predictors, with N = 50
• Step 1 2 3 4• Constant 72.97 70.30 71.04 71.03
• Murder -0.284 -0.237 -0.283 -0.300• T-Value -8.66 -6.72 -7.71 -8.20• P-Value 0.000 0.000 0.000 0.000
• HS.Grad 0.044 0.050 0.047• T-Value 2.72 3.29 3.14• P-Value 0.009 0.002 0.003
• Frost -0.0069 -0.0059• T-Value -2.82 -2.46• P-Value 0.007 0.018
• Population 0.00005• T-Value 2.00• P-Value 0.052
• S 0.847 0.796 0.743 0.720• R-Sq 60.97 66.28 71.27 73.60• R-Sq(adj) 60.16 64.85 69.39 71.26• Mallows Cp 16.1 9.7 3.7 2.0
V. Model Selection 61
Variations on FS
• Backward elimination– Choose cutoff α– Start with all predictors in the model– Eliminate the predictor with the highest p-
value that is greater than α– ETC
• Stepwise: Allow addition or elimination at each step (hybrid of FS and BE)
V. Model Selection 62
All subsets
• Fit all possible models.
• Based on a “goodness” criterion, choose the model that fits best.
• Goodness criteria include AIC, BIC, Adjusted R2, Mallow’s Cp
• Some of the criteria will be described next
V. Model Selection 63
Notation
• RSS* = Resid. Sum of Squares for the current model
• p* = Number of terms (including intercept) in the current model
• n = number of observations
• s2 = RSS/(n-(p+1)) = Estimate of σ2 from model with all predictors and intercept term.
V. Model Selection 64
Goodness criteria
• Smaller is better for AIC, BIC, Cp*. Larger is better for adjR2
• AIC = n log(RSS*/n) + 2p*• BIC = n log(RSS*/n) + p* log(n)
• Cp* = RSS*/s2 + 2p* - n
• adjR2 = )1(
)1(
11 2R
pn
n
V. Model Selection 65
• Best Subsets Regression: Life.Exp versus Population, Income, ...
• Response is Life.Exp
• P I• o l• p l• u i H• l I t M S• a n e u . F• t c r r G r A• i o a d r o r• Mallows o m c e a s e• Vars R-Sq R-Sq(adj) Cp S n e y r d t a• 1 61.0 60.2 16.1 0.84732 X• 2 66.3 64.8 9.7 0.79587 X X• 3 71.3 69.4 3.7 0.74267 X X X• 4 73.6 71.3 2.0 0.71969 X X X X• 5 73.6 70.6 4.0 0.72773 X X X X X• 6 73.6 69.9 6.0 0.73608 X X X X X X• 7 73.6 69.2 8.0 0.74478 X X X X X X X
V. Model Selection 66
Model selection can overstate significance
• Generate Y and X1, X2, …, X50
• All are independent and standard normal.• So none of the predictors are related to
the response.
• Fit the full model and look at the overall F test.
• Use model selection to choose a “good” smaller model, and look at its overall F test
V. Model Selection 67
The full model
• Results from fitting model with all 50 predictors
• Note that the F test is not significant
• S = 0.915237 R-Sq = 57.6% R-Sq(adj) = 14.3%
• Analysis of Variance
• Source DF SS MS F P• Regression 50 55.7093 1.1142 1.33 0.160• Residual Error 49 41.0453 0.8377• Total 99 96.7546
V. Model Selection 68
The “good” small model
• Run FS with α = 0.05• Predictors x38, x41, and x24 are chosen.• Fit that three predictor model. Now the F test is
highly significant
• Analysis of Variance
• Source DF SS MS F P• Regression 3 20.9038 6.9679 8.82 0.000• Residual Error 96 75.8508 0.7901• Total 99 96.7546
What’s left?
• Weighted least squares
• Tests for lack of fit
• Transformations of response and predictors
• Analysis of Covariance
• Etc.