24
Lecture 3: Inference and Diagnostics for Simple Linear Regression Nancy R. Zhang Statistics 203, Stanford University January 12, 2010 Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 1 / 24

Inference and Diagnostics for Simple Linear Regression

Embed Size (px)

DESCRIPTION

Stats

Citation preview

Page 1: Inference and Diagnostics for Simple Linear Regression

Lecture 3: Inference and Diagnostics for SimpleLinear Regression

Nancy R. Zhang

Statistics 203, Stanford University

January 12, 2010

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 1 / 24

Page 2: Inference and Diagnostics for Simple Linear Regression

Review

Assumptions of the linear model

Yi = �0Xi + �1 + �i

1 Y has a linear dependence on X .2 Error variances are equal.3 Errors are independent.4 Errors are Gaussian.

Data points that deviate from the “bulk", which we assume to satisfythese model assumptions, are called outliers.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 2 / 24

Page 3: Inference and Diagnostics for Simple Linear Regression

Anscombe’s quartet

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 3 / 24

Page 4: Inference and Diagnostics for Simple Linear Regression

Anscombe’s quartet

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 4 / 24

Page 5: Inference and Diagnostics for Simple Linear Regression

Standardized ResidualsA simple way to evaluate model fit is through the residuals,

ri = yi − yi .

The residuals have variance

�2(1− hii),

where

hii =1n

+(xi − x)2∑j(xj − x)2

is called the leverage.

We need to standardize the residuals so that they are comparable:

zi =ri

�√

1− hii.

However, we don’t know �, so need to estimated it.Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 5 / 24

Page 6: Inference and Diagnostics for Simple Linear Regression

Standardized ResidualsThere are two ways to estimate �:

1 Using all of the data:

�2 =SSEn − 2

2 Let SSE(i) be the sum of squared residuals obtained when we fitthe model to the sample with the i-th point taken out.

�2(i) =

SSE(i)

n − 3,

Thus, there are two ways to standardize ri :1 Studentized residuals: ri = ri

�√

1−hii

2 Studentized deleted residuals:

r∗i =ri

�(i)√

1− hiihas tn−2 distribution.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 6 / 24

Page 7: Inference and Diagnostics for Simple Linear Regression

New York Rivers DataHaith (1976) Study: How does land use around a river basin contributeto water pollution?

River Forest CommIndl NitrogenOlean 63 0.29 1.1

Cassadaga 57 0.09 1.01Oatka 26 0.58 1.9

Neversink 84 1.98 1Hackensack 27 3.11 1.99Wappinger 61 0.56 1.42

Fishkill 60 1.11 2.04Honeoye 43 0.24 1.65

Susquehanna 62 0.15 1.01Chenango 60 0.23 1.21

Tioughnioga 53 0.18 1.33WestCanada 75 0.16 0.75EastCanada 84 0.12 0.73

Saranac 81 0.35 0.8Ausable 89 0.35 0.76

Black 82 0.15 0.87Schoharie 70 0.22 0.8Raquette 75 0.18 0.87

Oswegatchie 56 0.13 0.66Cohocton 49 0.13 1.25

CommIndl: % land area in either commercial or industrial useNitrogen: Mean nitrogen concentration (mg/liter)

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 7 / 24

Page 8: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 + �1Agr + error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 8 / 24

Page 9: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 + �1Forest + error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 9 / 24

Page 10: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - standardized residual plot

Plot standardized residuals versusXi , large standardized residualsindicate outliers.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 10 / 24

Page 11: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - qq-plot of residuals

Quantile-quantile plots may besometimes helpful for identifyingoutliers.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 11 / 24

Page 12: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 12 / 24

Page 13: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 13 / 24

Page 14: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 14 / 24

Page 15: Inference and Diagnostics for Simple Linear Regression

Diagnostic plots - scatter plot

Fit model:

Nitrogen = �0 +�1CommInd+error

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 15 / 24

Page 16: Inference and Diagnostics for Simple Linear Regression

Outliers in X versus Outliers in Y

1 Outliers in Y can be detected by examining the standardizedresiduals.

ri =ri

�√

1− hiieasier to compute

2 Outliers in X can be detected using leverage values.

hii =1n

+(xi − x)2∑

(xi − x)2

Mean of the leverage values is 2/n. When data set is reasonablylarge, leverage values larger than 2×(mean leverage) areconsidered large. Another convention is to consider leveragebetween 0.2 and 0.5 as high.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 16 / 24

Page 17: Inference and Diagnostics for Simple Linear Regression

Leverage versus Residuals

Standardized Residuals

Leverage

Is there a measure that can detect both types of outliers?Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 17 / 24

Page 18: Inference and Diagnostics for Simple Linear Regression

Anscombe’s quartet

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 18 / 24

Page 19: Inference and Diagnostics for Simple Linear Regression

Influence of OutliersHow much influence does the data point have on the model fit?

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 19 / 24

Page 20: Inference and Diagnostics for Simple Linear Regression

Different measures of Influence

1 How much influence does observation i have on its own fit?

(DFFITS)i =yi − yi(i)√MSE(i)hii

DFFITS exceeding 2√

p/n is considered large.2 How much influence does observation i have on the fitted �’s?

(DFBETAS)i =�1 − �1(i)√MSE(i)c

−111

,

where c11 =∑

i(xi − x)2. DFBETA exceeding 2/√

n is consideredlarge.

These conventional rules work for “reasonably sized” data sets.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 20 / 24

Page 21: Inference and Diagnostics for Simple Linear Regression

Different measures of Influence: Cook’s Distance

1 Cook’s distance is defined as:

Di =

∑nj=1(yj − yj(i))

2

pMSE

2 Considers the influence of Yi on all of the fitted values, not just thei-th case.

3 It can be shown that Di is equivalent to

r2i

pMSEhii

(1− hii)2 .

4 Compare Di to the Fp,n−p distribution.

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 21 / 24

Page 22: Inference and Diagnostics for Simple Linear Regression

Cook’s Distance

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 22 / 24

Page 23: Inference and Diagnostics for Simple Linear Regression

Masking of Outliers

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 23 / 24

Page 24: Inference and Diagnostics for Simple Linear Regression

What to do with outliers?

1 Sometimes they hint that our model assumptions are wrong.2 Down-weight or delete outlying data points.3 Transform the variables (more on this later).

Nancy R. Zhang (Statistics 203) Lecture 3 January 12, 2010 24 / 24