Multiple Regression Assumptions & Diagnostics

Multiple Regression Assumptions & Diagnostics

Regression: Outliers

• Note: Even if regression assumptions are met, slope estimates can have problems

• Example: Outliers -- cases with extreme values that differ greatly from the rest of your sample

• More formally: “influential cases”

• Outliers can result from:• Errors in coding or data entry

• Highly unusual cases

• Or, sometimes they reflect important “real” variation

• Even a few outliers can dramatically change estimates of the slope, especially if N is small.


• Outlier Example:

-4 -2 0 2 4

4

2

-2

-4

Extreme case that pulls regression

line up

Regression line with extreme case

removed from sample


• Strategy for identifying outliers:

• 1. Look at scatterplots or regression partial plots for extreme values

• Easiest. A minimum for final projects

• 2. Ask SPSS to compute outlier diagnostic statistics– Examples: “Leverage”, Cook’s D, DFBETA,

residuals, standardized residuals.


• SPSS Outlier strategy: Go to Regression – Save– Choose “influence” and “distance” statistics such as

Cook’s Distance, DFFIT, standardized residual– Result: SPSS will create new variables with values of

Cook’s D, DFFIT for each case– High values signal potential outliers– Note: This is less useful if you have a VERY large

dataset, because you have to look at each case value.

Scatterplots• Example: Study time and student achievement.

– X variable: Average # hours spent studying per day– Y variable: Score on reading test

Case X Y

1 2.6 28

2 1.4 13

3 .65 17

4 4.1 31

5 .25 8

6 1.9 16

7 3.5 6

Y axis

X axis

0 1 2 3 4

30

20

10

0

Outliers

• Results with outlier:Model Summaryb

.466a .217 .060 9.1618Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), HRSTUDYa.

Dependent Variable: TESTSCORb. Coefficientsa

10.662 6.402 1.665 .157

3.081 2.617 .466 1.177 .292

(Constant)

HRSTUDY

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: TESTSCORa.

Outlier Diagnostics

• Residuals: The numerical value of the error– Error = distance that points falls from the line– Cases with unusually large error may be outliers– Note: residuals have many other uses!

• Standardized residuals– Z-score of residuals… converts to a neutral unit– Often, standardized residuals larger than 3 are

considered worthy of scrutiny• But, it isn’t the best outlier diagnostic.

Outlier Diagnostics

• Cook’s D: Identifies cases that are strongly influencing the regression line– SPSS calculates a value for each case

• Go to “Save” menu, click on Cook’s D

• How large of a Cook’s D is a problem?– Rule of thumb: Values greater than: 4 / (n – k – 1)– Example: N=7, K = 1: Cut-off = 4/5 = .80– Cases with higher values should be examined.

Outlier Diagnostics

• Example: Outlier/Influential Case Statistics

Hours Score Resid Std Resid Cook’s D

2.60 28 9.32 1.01 .124

1.40 13 -1.97 -.215 .006

.65 17 4.33 .473 .070

4.10 31 7.70 .841 .640

.25 8 -3.43 -.374 .082

1.90 16 -.515 -.056 .0003

3.50 6 -15.4 -1.68 .941

Outliers

• Results with outlier removed:Model Summaryb

.903a .816 .770 4.2587Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), HRSTUDYa.

Dependent Variable: TESTSCORb. Coefficientsa

8.428 3.019 2.791 .049

5.728 1.359 .903 4.215 .014

(Constant)

HRSTUDY

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: TESTSCORa.


• Question: What should you do if you find outliers? Drop outlier cases from the analysis? Or leave them in?– Obviously, you should drop cases that are incorrectly

coded or erroneous– But, generally speaking, you should be cautious about

throwing out cases• If you throw out enough cases, you can produce any result

that you want! So, be judicious when destroying data.


• Circumstances where it can be good to drop outlier cases:

• 1. Coding errors

• 2. Single extreme outliers that radically change results– Your results should reflect the dataset, not one case!

• 3. If there is a theoretical reason to drop cases– Example: In analysis of economic activity,

communist countries may be outliers• If the study is about “capitalism”, they should be dropped.


• Circumstances when it is good to keep outliers

• 1. If they form meaningful cluster– Often suggests an important subgroup in your data

• Example: Asian-Americans in a dataset on education

• In such a case, consider adding a dummy variable for them

– Unless, of course, research design is not interested in that sub-group… then drop them!

• 2. If there are many– Maybe they reflect a “real” pattern in your data.


• When in doubt: Present results both with and without outliers– Or present one set of results, but mention how results

differ depending on how outliers were handled

• For final projects: Check for outliers!• At least with scatterplots

– In the text: Mention if there were outliers, how you handled them, and the effect it had on results.

Multicollinearity

• Another common regression problem: Multicollinearity

• Definition: collinear = highly correlated– Multicollinearity = inclusion of highly correlated

independent variables in a single regression model

• Recall: High correlation of X variables causes problems for estimation of slopes (b’s)– Recall: variable denominators approach zero,

coefficients may wrong/too large.

Multicollinearity

• Multicollinearity symptoms:

• Unusually large standard errors and betas• Compared to if both collinear variables aren’t included

• Betas often exceed 1.0

• Two variables have the same large effect when included separately… but…– When put together the effects of both variables shrink– Or, one remains positive and the other flips sign

• Note: Not all “sign flips” are due to multicollinearity!

Multicollinearity

• What does multicollinearity do to models?– Note: It does not violate regression assumptions

• But, it can mess things up anyway

• 1. Multicollinearity can inflate standard error estimates– Large standard errors = small t-values = no rejected

null hypotheses– Note: Only collinear variables are effected. The rest

of the model results are OK.

Multicollinearity

• What does multicollinearity do?

• 2. It leads to instability of coefficient estimates– Variable coefficients may fluctuate wildly when a

collinear variable is added– These fluctuations may not be “real”, but may just

reflect amplification of “noise” and “error”• One variable may only be slightly better at predicting Y…

but SPSS will give it a MUCH higher coefficient

– Note: These only affect variables that are highly correlated. The rest of the model is OK.

Multicollinearity

• Diagnosing multicollinearity:

• 1. Look at correlations of all independent vars– Correlation of .7 is a concern, .8> is often a problem– But, sometimes problems aren’t always bivariate…

and don’t show up in bivariate correlations• Ex: If you forget to omit a dummy variable

• 2. Watch out for the “symptoms”

• 3. Compute diagnostic statistics• Tolerances, VIF (Variance Inflation Factor).

Multicollinearity

• Multicollinearity diagnostic statistics:

• “Tolerance”: Easily computed in SPSS– Low values indicate possible multicollinearity

• Start to pay attention at .4; Below .2 is very likely to be a problem

– Tolerance is computed for each independent variable by regressing it on other independent variables.

Multicollinearity

• If you have 3 independent variables: X1, X2, X3… – Tolerance is based on doing a regression: X1 is

dependent; X2 and X3 are independent.

• Tolerance for X1 is simply 1 minus regression R-square.

• If a variable (X1) is highly correlated with all the others (X2, X3) then they will do a good job of predicting it in a regression

• Result: Regression r-square will be high… 1 minus r-square will be low… indicating a problem.

Multicollinearity

• Variance Inflation Factor (VIF) is the reciprocal of tolerance: 1/tolerance

• High VIF indicates multicollinearity

– Gives an indication of how much the Standard Error of a variable grows due to presence of other variables.

Multicollinearity

• Solutions to multcollinearity– It can be difficult if a fully specified model requires

several collinear variables

• 1. Drop unnecessary variables

• 2. If two collinear variables are really measuring the same thing, drop one or make an index– Example: Attitudes toward recycling; attitude toward

pollution. Perhaps they reflect “environmental views”

• 3. Advanced techniques: e.g., Ridge regression• Uses a more efficient estimator (but not BLUE – may

introduce bias).

What is Model Specification?• Model Specification is two sets of choices:

– The set of variables that we include in a model – The functional form of the relationships we specify

• These are central theoretical choices• Can’t get the right answer if we ask the wrong

question

What is the “Right” Model?• In truth, all our models are misspecified to some

extent.• Our theories are always a simplification of

reality, and all our measures are imperfect• Our task is to seek models that are reasonably

well specified – keeps our errors relatively modest

Omitting Variables and Model Specification

• These criteria give us our conceptual standards for determining when a variable must be included

• A model must be included in a regression equation IF:– The variable is correlated with other X’sAND– The variable is also a cause of Y

The Meaning of b in Multiple Regression

• Each element of the vector b is a slope coefficient for one of the X’s

• Same as in bivariate context except that b1 is the expected change in Y for a 1 unit increase in X1, while holding X2…Xn constant

• Thus b1 represents the direct effect of X1 on Y, controlling for X2…Xn

Illustrating Omitted Variable Bias

• Imagine a true model where X1 has a small effect on Y and is correlated with X2 that has a large effect on Y.

• Specifiying both variables can distinguish these effects Y

X1

X2

Illustrating Omitted Variable Bias

• But when we run simple model excluding X2, we attribute all causal influence to X1

• Coefficient is too big and variance of coefficient is too small Y

X1

Omitted Variable Bias: Causes• This problem is explicitly theoretical rather than

“statistical.”• No statistical test can reveal a specification error

or omitted variable bias• Scholars can form hypotheses about other X’s

that may be a source of omitted variable bias

Including Irrelevant Variables

• In this case:

• If b2=0, our estimate of b1 is not affected

• If X1’X2=0, our estimate of b1 is not affected

• But including X2 if it is not relevant does unnecessarily inflate the σb

2

2

1)(b

bBbE

Including Irrelevant Variables: Consequences

• σb2 increases for two reasons:

• Addition of parameter for X2 reduces the degrees of freedom – Part of estimator for σu

2

• If b2=0 but X1’X2 is not, then including X2 unnecessarily reduces independent variation of X1

• Thus parsimony remains a virtue

Rules for Model Specification

• Model specification is fundamentally a theoretical exercise. We build models to reflect our theories

• Theorizing process cannot be replaced with statistical tests

• Avoid mechanistic rules for specification such as stepwise regression

The Evils of Stepwise Regression• Stepwise regression is a method of model

specification that chooses variables on:– Significance of their t-scores– Their contribution to R2

• Variables will be selected in or out depending on the order they are introduced into the model

Multiple Regression Analysis: Further Issues

y = 0 + 1x1 + 2x2 + . . . kxk + u

Functional Form

• OLS can be used for relationships that are not strictly linear in x and y by using nonlinear functions of x and y – will still be linear in the parameters

• Can take the natural log of x, y or both

• Can use quadratic forms of x

• Can use interactions of x variable

Interpretation of Log Models• If the model is ln(y) = 0 + 1ln(x) + u

1 is the elasticity of y with respect to x• If the model is ln(y) = 0 + 1x + u

1 is approximately the percentage change in y given a 1 unit change in x

• If the model is y = 0 + 1ln(x) + u

1 is approximately the change in y for a 100 percent change in x

Why use log models?• Log models are invariant to the scale of the variables since

measuring percent changes

• They give a direct estimate of elasticity

• For models with y > 0, the conditional distribution is often heteroskedastic or skewed, while ln(y) is much less so

• The distribution of ln(y) is more narrow, limiting the effect of outliers

Adjusted R-Squared

• Recall that the R2 will always increase as more variables are added to the model

• The adjusted R2 takes into account the number of variables in a model, and may decrease

1

ˆ1

1

11

2

2

nSST

nSST

knSSRR

Adjusted R-Squared (cont)

• It’s easy to see that the adjusted R2 is just (1 – R2)(n – 1) / (n – k – 1), but most packages will give you both R2 and adj-R2

• You can compare the fit of 2 models (with the same y) by comparing the adj-R2

• You cannot use the adj-R2 to compare models with different y’s (e.g. y vs. ln(y))

The Use and Interpretation of the Constant Term

General Rule

• Do Not Suppress the Constant Term

( Even if theory specifically calls for it )

• Do Not Rely on estimates of the Constant Term

43

Multiple Regression Assumptions

• 1. a. Linearity: The relationship between dependent and independent variables is linear

• Just like bivariate regression

• Points don’t all have to fall exactly on the line; but error (disturbance) must be random

– Check scatterplots of X’s and error (residual)• Watch out for non-linear trends: error is systematically

negative (or positive) for certain ranges of X

• There are strategies to cope with non-linearity, such as including X and X-squared to model curved relationship.


• 1. b. And, the model is properly specified: – No extra variables are included in the model, and no

important variables are omitted. This is HARD!

• Correct model specification is critical• If an important variable is left out of the model, results are

biased (“omitted variable bias”)

– Example: If we model job prestige as a function of family wealth, but do not include education

• Coefficient estimate for wealth would be biased

– Use theory and previous research to decide what critical variables must be included in your model.

• For final paper, it is OK if model isn’t perfect.


• 2. All variables are measured without error

• Unfortunately, error is common in measures– Survey questions can be biased– People give erroneous responses (or lie)– Aggregate statistics (e.g., GDP) can be inaccurate

• This assumption is often violated to some extent– We do the best we can:– Design surveys well, use best available data– And, there are advanced methods for dealing with

measurement error.


• 3. The error term (ei) has certain properties• Recall: error is a cases deviation from the regression line

• Not the same as measurement error!

• After you run a regression, SPSS can tell you the error value for any or all cases (called the “residual”)

• 3. a. Error is conditionally normal– For bivariate, we looked to see if Y was conditionally

normal… Here, we look to see if error is normal

– Examine “residuals” (ei) for normality at different values of X variables.

Regression Assumptions

• Normality:

INCOME

100000800006000040000200000

HA

PP

Y

10

8

6

4

2

0

Examine residuals at different values of X. Make histograms and check for normality.

HAPPY

8.00

7.50

7.00

6.50

6.00

5.50

5.00

4.50

4.00

3.50

3.00

2.50

2.00

1.50

1.00

.50

12

10

8

6

4

2

0

Std. Dev = 1.51

Mean = 3.84

N = 60.00

Good

HAPPY

10.00

9.50

9.00

8.50

8.00

7.50

7.00

6.50

6.00

5.50

5.00

4.50

4.00

3.50

3.00

2.50

2.00

1.50

1.00

.50

12

10

8

6

4

2

0

Std. Dev = 3.06

Mean = 4.58

N = 60.00

Not very good


• 3. b. The error term (ei) has a mean of 0

– This affects the estimate of the constant. (Not a huge problem)

• 3. c. The error term (ei) is homoskedastic (has constant variance)– Note: This affects standard error estimates,

hypothesis tests– Look at residuals, to see if they spread out with

changing values of X• Or plot standardized residuals vs. standardized predicted

values.

INCOME

100000800006000040000200000

HA

PP

Y

10

8

6

4

2

0


• Homoskedasticity: Equal Error Variance

Examine error at different values of X.

Is it roughly equal?

Here, things look pretty good.

INCOME

100000

90000

80000

70000

60000

50000

40000

30000

20000

10000

0

HA

PP

Y

10

8

6

4

2

0


• Heteroskedasticity: Unequal Error Variance

At higher values of X, error variance increases a lot.

This looks pretty bad.


• 3. d. Predictors (Xis) are uncorrelated with error

– This most often happens when we leave out an important variable that is correlated with another Xi

– Example: Predicting job prestige with family wealth, but not including education

– Omission of education will affect error term. Those with lots of education will have large positive errors.

• Since wealth is correlated with education, it will be correlated with that error!

– Result: coefficient for family wealth will be biased.


• 4. In systems of equations, error terms of equations are uncorrelated– This is not a concern for us in this class

• Worry about that later!


• 5. Sample is independent, errors are random• Technically, part of 3.c.

– Not only should errors not increase with X (heteroskedasticity), there should be no pattern at all!

• Things that cause patterns in error (autocorrelation):– Measuring data over long periods of time (e.g., every

year). Error from nearby years may be correlated.• Called: “Serial correlation”.


• More things that cause patterns in error (autocorrelation):– Measuring data in families. All members are similar,

will have correlated error– Measuring data in geographic space.

• Example: data on 50 US states. States in a similar region have correlated error

• Called “spatial autocorrelation”

• There are variations of regression models to address each kind of correlated error.

Multiple Regression Estimation

• Calculating b’s involves solving a set of equations to minimize squared error

• Analogous to bivariate, but math is more complex

• The optimal estimator has minimum variance and is referred to as “BLUE”:

• Best Linear, Unbiased Estimate

• The BLUE Multiple Regression has more assumptions than bivariate.

Documents

Multiple Regression Assumptions & Diagnostics