36
Unit 4: Regression Assumptions and Diagnostics Class 11… ttp://xkcd.com/539/ Unit 4 / Page 1 © Andrew Ho, Harvard Graduate School of Education

Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11… Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

Embed Size (px)

Citation preview

Page 1: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

Unit 4: Regression Assumptions and Diagnostics Class 11…

http://xkcd.com/539/ Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

Page 2: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

Where is Unit 4 in our 11-Unit Sequence?

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in depth:Correlation and collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression in practice. Common Extensions.

Unit 1:Introduction to

simple linear regression

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Building a solid

foundation

Unit 4:Regression assumptions:Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of predictors and

effects

Pulling it all

together

Unit 3 / Page 2

Page 3: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

In this unit, we’re going to learn about…

• Reprise of the assumptions required for OLS regression-based inference

• The four major types of model violations:– Outliers– Nonlinearity– Heteroscedasticity– Non-independence of errors

• Determining whether the regression assumptions hold—strategies and rationale– Why residuals provide a powerful lens for evaluating regression

assumptions– Residuals as observations that “control for” or account for variables– Raw vs. standardized/studentized residuals– Leverage, outliers, and Cook’s Distance– Residual plots: How to construct them and what to look for– What should we do if we identify an outlier or other unusual observation?

• How would we summarize our results?Unit 4 / Page 3© Andrew Ho, Harvard Graduate School of Education

Page 4: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

These four assumptions are often summarized in a shorthand: . This reads, the population residuals, , are independent and identically normally distributed with mean 0 (values centered on the prediction) and a common variance (and thus a common SD).

The linear regression model and its four assumptions

© Andrew Ho, Harvard Graduate School of Education Unit 1 / Page 4

Assumption 1: At each value of there is a conditional distribution of that is normal with mean and SD .

Assumption 2: The straight line model is correct: The fall on a line.

Assumption 3: Homoscedasticity. The conditional standard deviations, , are equal across all .

Assumption 4: Conditional independence. For any value of , the s are independent. They share no hidden common association. Cannot be visualized from the plot.

𝜖 i .i . d .𝑁 (0 ,𝜎 𝑌∨𝑋2 )

Page 5: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

Anscombe’s Quartet: Four datasets with identical summary statistics

How can we investigate the distributions of our residuals?

Unit 4 / Page 5

Page 6: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

24

68

10

y2

4 6 8 10 12 14x2

-2-1

01

2R

esid

uals

5 6 7 8 9 10Fitted values

-10

12

3R

esi

dua

ls

5 6 7 8 9 10Fitted values

-2-1

01

2R

esid

uals

6 8 10 12Fitted values

Diagnostic Plots: Residuals vs. Fitted Values2

46

81

0y2

4 6 8 10 12 14x2

-2-1

01

2R

esid

uals

4 6 8 10 12 14x2

-10

12

3R

esi

dua

ls

5 6 7 8 9 10Fitted values

-2-1

01

2R

esid

uals

6 8 10 12Fitted values

24

68

10

y2

4 6 8 10 12 14x2

-2-1

01

2R

esid

uals

4 6 8 10 12 14x2

-10

12

3R

esi

dua

ls

5 6 7 8 9 10Fitted values

-2-1

01

2R

esid

uals

6 8 10 12Fitted values

^

To better visualize the residuals, we can subtract out the regression line. Instead of plotting y2 on x2, we can plot y2 – y2 on x2:

/* Graph the usual linear fit */graph twoway (scatter y2 x2) (lfit y2 x2)/* Run the regression */regress y2 x2/* Save residuals to variable resid */predict resid, residuals/* Plot residuals on x2 */scatter resid x2/* An equivalent visual representation is obtained by plotting residuals on their predicted (“fitted”) values, the yhats. */predict yhatscatter resid yhat // Or, in one smooth move after regress…rvfplot, name(rvf2)

For simple linear regression, since values are linearly related to values:

the plots look the same but just have a different horizontal axis. In practice, we use fitted values because it’s harder to plot on multiple values when we move to multiple regression.

^

Unit 4 / Page 6© Andrew Ho, Harvard Graduate School of Education

Page 7: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

-2-1

01

23

Re

sidu

als

4 6 8 10 12Fitted values

-2-1

01

23

Res

idua

ls

4 6 8 10 12Fitted values

-2-1

01

23

Re

sidu

als

4 6 8 10 12Fitted values

-2-1

01

23

Res

idua

ls

4 6 8 10 12Fitted values

Plots of residuals vs. fitted values for Anscombe’s Quartet

Your residuals will always sum to zero, and the correlation between your residuals and fitted values (or any ) will always be zero. RVFs are a better lens through which to appreciate your residuals.

Unit 4 / Page 7© Andrew Ho, Harvard Graduate School of Education

Page 8: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

Zagat: Can the price of your dinner predict its quality?

n = 76 Boston restaurants classified as “New American” in the ‘05-’06 guideZagat surveyor’s average estimate of the price of dinner, one

drink, and tip.

Average of the restaurant’s ratings for food, service, and decor. A 0-30 scale where 16-19 is good to very good, 20-25 is very good to excellent, and 26-30 is extraordinary to perfection.

Unit 4 / Page 8© Andrew Ho, Harvard Graduate School of Education

Page 9: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

1618

2022

2426

Zag

at r

atin

g (0

-30)

20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)

Scatterplot of Zagat rating on Zagat-reported cost

Unit 4 / Page 9© Andrew Ho, Harvard Graduate School of Education

Page 10: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

Regression Results

61.6% of the variance in ratings is associated with (accounted for by/explained by) the cost.

The conditional standard deviation of ratings is 1.56 (the unconditional standard deviation was 2.5)

A dollar difference in cost predicts (is associated with) a .186-point difference in ratings.

The sample statistic of .186 is 10.9 estimated standard errors away from our null hypothesis of 0 slope.The probability of such a slope under

the null hypothesis is very, very low.A plausible range for the “true” slope is between .15 and .22; 95% of these kinds of intervals are expected to contain the “true” slope.

The relationship is moderately strong and statistically significant, where a $10 difference in cost predicts a 1.86-point difference in the Zagat ratings.

Unit 4 / Page 10

Page 11: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

1618

2022

2426

Zag

at r

atin

g (0

-30)

20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)

An estimated linear regression model of rating on cost

Unit 4 / Page 11© Andrew Ho, Harvard Graduate School of Education

And, subtracting out the regression line…

𝑟𝑎𝑡𝑖𝑛𝑔=13.83+ .186𝑐𝑜𝑠𝑡

Page 12: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

-4-2

02

4R

esid

uals

18 20 22 24 26Fitted values

Plot of residuals vs. fitted values

-4-2

02

4R

esid

uals

18 20 22 24 26Fitted values

Unit 4 / Page 12© Andrew Ho, Harvard Graduate School of Education

We look for: 1) linearity 2) outliers 3) homoscedasticity 4) normality

Page 13: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

Linearity: Eyeballing and lpoly (sparingly)

Unit 4 / Page 13

-4-2

02

4R

esid

uals

18 20 22 24 26Fitted values

-4-2

02

4R

esid

ual

s

18 20 22 24 26Fitted values

kernel = epanechnikov, degree = 0, bandwidth = .85

Local polynomial smooth

Responses to nonlinearity include polynomial or other nonlinear regression approaches (not covered here) and transformations (covered in Unit 5)

Ignoring or failing to address nonlinearity risks poor prediction, underestimation of , and Type II error (missing a finding).

Page 14: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

-4-2

02

4R

esid

uals

18 20 22 24 26Fitted values

29 Newbury

Bristol

Stanhope Grille

-4-2

02

4R

esid

uals

18 20 22 24 26Fitted values

Outliers: Identifying and interpreting relatively large residuals

But residuals can’t identify outliers alone.

Positive residuals.Underpredicted.High ratings after “controlling for,” “partialling out,” “regressing out,” or accounting for cost.

Unit 4 / Page 14© Andrew Ho, Harvard Graduate School of Education

Page 15: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

Leverage: Extremity of an observation along predictor variables

© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 15

This outlier has a large residual This outlier does not.

We should consider both the discrepancy from the regression line and the leverage exerted on the regression line.

With many predictors, the leverage of an observation is a single expression of distance from many predictor means.

With simple linear regression, it looks somewhat familiar:

Called is because leverage is an element of the “hat matrix”: It puts the hat on the Y.

h=1𝑛

+(𝑋− 𝑋 0 )

2

∑ ( 𝑋𝑖−𝑋 )2

http://www.stat.sc.edu/~west/javahtml/Regression.html

Page 16: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

.02

.04

.06

.08

.1.1

2L

eve

rag

e

20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)

Visualizing Leverage: Look familiar?

© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 16

Leverage values should be interpreted relatively.They are never more than 1, and the sum of all leverages is the number of predictors+1 (here, 2)

The greater the leverage, the more a single observation can influence the regression line.

h=1𝑛

+(𝑋− 𝑋 0 )

2

∑ ( 𝑋𝑖−𝑋 )2

𝑋

Page 17: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

29 Newbury

Bristol

Stanhope Grille

-4-2

02

4R

esid

uals

18 20 22 24 26Fitted values

Incorporating leverage… a little.

Unit 4 / Page 17© Andrew Ho, Harvard Graduate School of Education

To better interpret these residuals, we can “standardize” them: express them in terms of a kind of standard deviation unit that incorporates leverage.

Page 18: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

29 Newbury

Bristol

Stanhope Grille

-3-2

-10

12

Sta

ndar

dize

d R

esid

uals

18 20 22 24 26Fitted values

Standardized/Studentized residuals: predict stdres, rstandard

Expresses residuals in terms of RMSE (conditional standard deviations) while inflating residuals with more leverage.

If residuals are homoscedastic and normally distributed about the regression line, we have a sense of how unexpected a standardized residual might be.

Unit 4 / Page 18© Andrew Ho, Harvard Graduate School of Education

Called “studentized” in honor of Gosset of the t-test (pseudonym: student). Stata calls these “standardized residuals.” Standardized residuals usually refer to dividing by RMSE alone. Very similar for large sample sizes.

𝑒𝑠𝑡𝑑 ( 𝑖 )=𝑌 𝑖− �̂� 𝑖

𝑅𝑀𝑆𝐸√1−h𝑖

Page 19: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

𝑑𝑖=( h𝑖1− h𝑖 )(

𝑒 𝑠𝑡𝑑 (𝑖 )2

𝑘+1 )

224 Boston St.

29 Newbury33 RestaurantAriadneAura Avenue One

Azure

B-Side Lounge

Bambara

Birch St. Bistro

Blarney Stone

Brenden Crocker'sBristol

Central Kitchen

Daedalus

Dalia's Bistro

Dalya'sDedo Lounge

Devlin's

Excelsior

Fava

Federalist

Franklin Café

Gardner Museum

Gargoyles

Grafton St. Pub

Grapevine

Green St. Grill

Hamersley's Bistro

Harvest

Icarus

Independent

IndigoIntrigue Café

IsabellaJ's at Nashoba Valley

Jacob WirthJer-Ne

LaurelLiving RoomLucy's

Lyceum B&GMargo Bistro

Meritage

Nightingale

Novel

Olio

Orleans

PerdixRed Rock Bistro

Sage

Salts

Scandia

Seasons

Sidney's Grille

Silks

Square CaféStanhope GrilleStephanie's

TW Food

Temple Bar

Ten Tables

Top of the Hub

Tremont 647

Troquet

Tryst

Union B&G

UpStairs on the Sq.

VinaliaVox PopuliWalden Grille

West Side Lounge

Zebra's Bistro

Zon's

blu

flora

.02

.04

.06

.08

.1.1

2L

eve

rag

e

0 .02 .04 .06 .08Normalized residual squared

Cook’s Distance: A metric incorporating both discrepancy and leverage

Unit 4 / Page 19© Andrew Ho, Harvard Graduate School of Education

• A go-to metric to address the question, did this single observation have a notable impact on my results? A measure of influence.

Recall: Cook’s Distance:

1618

2022

2426

Zag

at r

atin

g (0

-30)

20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)

Leverage(horizontal)

Squared standardizedresidual (vertical)

These are averages, not benchmarks.

Just the squared studentized residual over n.

𝑒𝑠𝑡𝑑 ( 𝑖 )=𝑌 𝑖− �̂� 𝑖

𝑅𝑀𝑆𝐸√1−h𝑖

Page 20: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

224 Boston St.

29 Newbury

33 Restaurant

AriadneAura

Avenue One

Azure

B-Side LoungeBambara

Birch St. BistroBlarney StoneBrenden Crocker's

Bristol

Central Kitchen

DaedalusDalia's Bistro

Dalya'sDedo Lounge

Devlin's

ExcelsiorFava

Federalist

Franklin CaféGardner Museum

GargoylesGrafton St. Pub

Grapevine

Green St. Grill

Hamersley's Bistro

Harvest

Icarus

Independent

Indigo

Intrigue CaféIsabella

J's at Nashoba Valley

Jacob Wirth

Jer-Ne

LaurelLiving Room

Lucy's Lyceum B&GMargo Bistro

MeritageNightingale

Novel

Olio

Orleans

PerdixRed Rock Bistro

Sage

Salts

ScandiaSeasons

Sidney's GrilleSilks

Square Café

Stanhope Grille

Stephanie's

TW Food

Temple BarTen Tables

Top of the Hub

Tremont 647 TroquetTryst

Union B&G

UpStairs on the Sq.

Vinalia

Vox Populi

Walden GrilleWest Side Lounge

Zebra's Bistro

Zon's

blu

flora

0.0

2.0

4.0

6.0

8.1

Coo

k's

D

18 20 22 24 26Fitted values

1618

2022

2426

Zag

at r

atin

g (0

-30)

20 30 40 50 60 70Average estimated dinner cost w/drink and tip ($)

Cook’s Distance: predict cooksd, cooksd

Unit 4 / Page 20

• Benchmarks: distances greater than 1 or 4/n are worth noting, although we acknowledge the impossibility of simple cutoffs.

Stata Demo: . findit regpt. regpt

Central Kitchen has notable influence on the regression line, with moderately high amounts of leverage and discrepancy.

© Andrew Ho, Harvard Graduate School of Education

Page 21: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

Key concepts for outlying observations.• A residual is the vertical distance from an observed point to the estimated

regression line. OLS minimizes the sum of these squared residuals.• Absolute residuals are interpreted on the scale of Y. They are interpretable as Y

above and beyond, “controlling” for, accounting for, adjusting for X. • Visualization of residuals can be improved by plotting residuals on fits (rvfplot)• Leverage is a single measure of distance of an observation from the means of

one or more predictor variables and indicates potential to influence slopes.• Standardized/Studentized residuals are expressed in terms of RMSE, or

conditional standard deviations. The latter is adjusted slightly by leverage. These afford probabilistic interpretations if the normal assumption holds.

• Cook’s distance incorporates both residuals (vertical discrepancy) and leverage in a single metric describing the influence of observations on slope.

• The discrepancy and leverage of observations can be visualized by plotting leverage on squared studentized residuals (lvr2plot).

• An outlier is almost never removed absent an outright data entry error. Instead, the influence is described, and alternative models may be employed.

• Exclusion of an outlier usually requires an argument that it is not part of the population of interest. © Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 21

Page 22: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

The Butterfly Ballot: The Palm Beach County Outlier

Unit 4 / Page 22

Although Democrats are listed second in the left hand column, you vote Democratic by punching the third hole

If you punch the second hole, you are voting for the Reform party (ie, Pat Buchanan)

RQ: In the 2000 Presidential election, did Buchanan get more votes than we “would have expected” in Palm Beach County?

Of the nearly 6 million votes cast in Florida, the official tally has Bush beating Gore by 537 votes

Page 23: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

01

000

200

03

000

400

0B

uch

ana

n vo

tes

in 2

000

0 100 200 300 400Reform party registered voters in 2000

BROWARDHILLSBOROUGH

PALM BEACH

PINELLAS

01

000

200

03

000

400

0B

uch

ana

n vo

tes

in 2

000

0 100 200 300 400Reform party registered voters in 2000

After univariate descriptives, our scatterplot.

© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 23

10 Nov 2000: “The Bush campaign claims that the number of votes for Buchanan in Palm Beach County is perfectly accurate. ‘New information has come to our attention that puts in perspective the results of the vote in Palm Beach County,’ Bush spokesman Ari Fleischer said on Thursday. ‘Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there.’” (Salon.com) View Article

Page 24: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

BROWARD

HILLSBOROUGH

PALM BEACH

PINELLAS-500

05

001

000

150

02

000

Res

idu

als

0 500 1000 1500Fitted values

Regressing Buchanan’s votes on registered reform party voters

Unit 4 / Page 24

_cons 1.532519 46.60847 0.03 0.974 -91.55102 94.61606 reform 3.686713 .4087551 9.02 0.000 2.870372 4.503053 bucvote Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 13334589.9 66 202039.241 Root MSE = 301.85 Adj R-squared = 0.5490 Residual 5922476.32 65 91115.0202 R-squared = 0.5559 Model 7412113.59 1 7412113.59 Prob > F = 0.0000 F( 1, 65) = 81.35 Source SS df MS Number of obs = 67

. regress bucvote reform

Above and beyond, “controlling” for, adjusting for, accounting for…

Some say, “after statistically controlling for…”

Acceptable.Safest.

Palm Beach has 2163 more votes for Buchanan than predicted given the

number of registered Reform Party voters.…than the model predicts given the

number of registered Reform Party voters.

Page 25: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

PALM BEACH

PINELLAS-20

24

68

Sta

nda

rdiz

ed r

esi

dua

ls

0 500 1000 1500Fitted values

Standardized/Studentized Residuals

© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 25

Stata: Standardized ResidualsMe/Others: Studentized Residuals

Palm Beach County is almost 8 standard deviations above the regression line.

Here, standard deviations are RMSEs slightly adjusted to accentuate high-leverage residuals.

Page 26: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

ALACHUABAKERBAYBRADFORDBREVARD

BROWARD

CALHOUNCHARLOTTECITRUSCLAYCOLLIERCOLUMBIA

DADE

DE SOTODIXIEDUVALESCAMBIAFLAGLERFRANKLINGADSDENGILCHRISTGLADESGULFHAMILTONHARDEEHENDRYHERNANDOHIGHLANDS

HILLSBOROUGH

HOLMESINDIAN RIVERJACKSONJEFFERSONLAFAYETTELAKELEELEONLEVYLIBERTYMADISONMANATEEMARIONMARTINMONROENASSAUOKALOOSAOKEECHOBEE

ORANGE

OSCEOLA

PALM BEACH

PASCO

PINELLAS

POLKPUTNAMSANTA ROSA

SARASOTA

SEMINOLEST JOHNSST LUCIESUMTERSUWANNEETAYLORUNIONVOLUSIA

WAKULLAWALTONWASHINGTON

0.0

5.1

.15

.2.2

5L

eve

rag

e

0 .2 .4 .6 .8Normalized residual squared

Leverage vs. Residual Squared (Discrepancy)

© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 26

Page 27: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

BROWARD

PALM BEACH

PINELLAS

01

23

45

Coo

k's

D

0 500 1000 1500Fitted values

Cook’s Distance

© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 27

Palm Beach County has an extreme influence on the regression line.

Page 28: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

01

000

200

03

000

400

0B

uch

ana

n vo

tes

in 2

000

0 100 200 300 400Reform party registered voters in 2000

Sensitivity Plots

Unit 4 / Page 28

The sensitivity analysis is one of the most useful tools that you can take from this class. As we begin to appreciate that there are many plausibly correct models, we investigate whether our inferences are robust across these plausible decisions. Also called a robustness check.

“Here are my results and my conclusions.”“But have you considered , , and ?”“We considered , , and , and also , , and .… and it makes no difference.… or it does, and here is why.”

Page 29: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

What to do with outlying observations (1)• Visualize them.

– Scatterplots, residuals vs. fit plots (rvfplot), standardized/studentized residuals, leverage vs. residuals-squared plots (lvr2plot), Cook’s distance.

• Describe their degree of discrepancy and influence in terms of the above and in context.• Try to explain each outlying observation. What is story of Central Kitchen? What is story of

Palm Beach County? What are the plausible rival stories that might account for the outlier?

• Severely influential or outlying observations must be documented for your audience.

• Conduct a sensitivity analysis: Resultswith and without the outlier.

• Address whether the outlier affectsyour primary inferences.

© Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 29

Comparison of regression models predicting Buchanan’s total vote (Florida Presidential Election data, 2000)

All Florida counties (n=67)

Without Palm Beach (n=66)

Estimated SlopeEstimated S.E.

t statistic

3.69*(0.41)

9.02

2.45*(0.12)20.18

R2 55.6% 86.4%* p<.001

Page 30: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

What to do with outlying observations (2)

• Conduct a full analysis without the outlying observation and retest assumptions.

• Sometimes eliminating an outlier reveals other outliers…

• Stata note: After running regress again without the outlier, use the predict command without it, too: predict cooksd_1 if county != “PALM BEACH”, cooksd

• In general, do not remove outliers unless demonstrating sensitivity. Deciding to proceed with a model without an outlier requires very strong substantive justification:– Because of 1) problems with misleading

ballots in this county and no other, 2) the absence of other explanations for the outlying observation, and 3) our interest in modeling the relationship between bucvote and reform in a population without misleading ballots, we proceed with a regression model excluding Palm Beach County. © Andrew Ho, Harvard Graduate School of Education Unit 4 / Page 30

-200

-100

010

020

030

0R

esid

uals

0 200 400 600 800 1000Fitted values

BROWARD

COLLIER

DUVAL

MARION

PINELLAS

POLK0

.05

.1.1

5.2

.25

Coo

k's

D

0 200 400 600 800 1000Fitted values

Page 31: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

What to do with outlying observations (3)• If outliers are part of your population of interest, they cannot be

removed.• If they are part of your population, then in spite of their influence, your

estimated slope is still unbiased… your best guess over many samples.• However, your standard error may be so large (sometimes you get the

outliers, sometimes you don’t) that you may prefer a more stable model, even if it’s not unbiased.

• This leads to alternative approaches. One to remember: median regression, also called least absolute value (LAV) regression, where we model the conditional median and minimize the absolute deviations (instead of squared deviations). – Recall that the median is more robust to outliers than the mean. In

Stata, this is a subset of quantile regression under the qreg command• As a default, don’t get fancy. Model in a familiar framework to best

communicate your results.• Sometimes, a simple transformation can draw in an outlier and let us

preserve our familiar framework (next unit). Unit 4 / Page 31

Page 32: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

Heteroscedasticity: Blood pressure example

Unit 4 / Page 32

6070

8090

100

110

Dia

stol

ic b

lood

pre

ssur

e

20 30 40 50 60Age

Page 33: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

-20

-10

010

20R

esid

uals

65 70 75 80 85 90Fitted values

Heteroscedasticity: Visual inspection

Unit 4 / Page 33

Clear increase in conditional variance in blood pressure as age increases, but a linear relationship seems appropriate.An unbiased slope and fine prediction, but also high standard errors and inaccurate conditional standard errors.

When the conditional variance can be modeled as a function of the predictor, we can use a weighted least squares (WLS) approach that diminishes the “pull” of noisy observations.

And sometimes, when the relationship is nonlinear, transformations can reduce heteroscedasticity.

Page 34: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

010

2030

40P

erce

nt

-200 -100 0 100 200 300Residuals

Normality of residuals: hist res, kden percent

• Residuals will always have a mean of zero but should also have a symmetrical normal distribution, both conditionally and unconditionally.

• Violations can lead to poor estimation properties, particularly in small samples, although without nonlinearity, outliers or heteroscedasticity, the estimates can be quite robust.

• When accompanied by nonlinearity and heteroscedasticity, the remedial approach is typically transformation.

Unit 4 / Page 34

Blood Pressure Example

Florida Example (Palm Beach Removed)

05

1015

2025

Per

cent

-20 -10 0 10 20Residuals

http://www.people.vcu.edu/~rjohnson/regression/

Page 35: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

What are the takeaways from this unit?

Unit 4 / Page 35

• Inference from regression models requires adherence to important assumptions• Independent and identically normally distributed residuals with mean 0

• Residuals are the tools we use to test our assumptions• Linearity (Do residuals have a conditional mean of 0 in the observed range?)• Outliers (Are residuals identically distributed? Normally distributed?)• Homoscedasticity (Are residuals identically distributed?)• Normality (Are residuals normally distributed?)

• Residuals can be interpreted as values above and beyond, accounting for, “controlling” for, or adjusting for • Residuals are error that remains to be explained.

• Outlying observations must be documented and described. Particularly influential outliers may require sensitivity studies, and all actions taken must be clearly and honestly communicated.

• For each regression assumptions, we can now:• Communicate degrees of violation of assumptions.• Articulate how and where we might expect a violation to threaten our

inferences.• Take remedial action to improve model fit or articulate threats to inferences

Page 36: Unit 4: Regression Assumptions and Diagnostics Class 11…Class 11…  Unit 4 / Page 1© Andrew Ho, Harvard Graduate School of Education

© Andrew Ho, Harvard Graduate School of Education

• Assumptions• Cook’s distance• Discrepancy• Homoscedasticity• Independence of observations• Influence• Leverage• Leverage vs. residuals-squared plot• Linearity• Normality• Outlier• Residuals vs. fitted values plot• Sensitivity studies and plots• Standardized residuals• Studentized residuals

Terms to Note

Unit 4 / Page 36