18
Diagnostics for Linear Regression Models Prof. David Sibbritt

Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Diagnostics for

Linear Regression

Models

Prof. David Sibbritt

Page 2: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Session Outline

• Using residuals to check the model

residual plots

• Multicollinearity diagnostics

variance inflation factor (VIF)

Page 3: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Background

• When a regression model is selected, one cannot

usually be certain in advance that the model is

appropriate

• It is important to consider a series of regression

diagnostics available

allow us to look for flaws that may affect our parameter

estimates

consider whether the assumptions underlying the model are

violated and whether our results are heavily impacted by

influential observations

Page 4: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Using Residuals to check the model

• There are several plots that we can do to determine if

there are any departures from regression model

1) The regression function is not linear

plot residuals vs independent variable (or dependent

variable)

residuals should be evenly scattered on a straight line

about zero and there should not be any other obvious

pattern (eg. curved)

Page 5: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

2) The error terms do not have constant variance

plot residuals vs independent variable (or dependent

variable)

residuals should be constantly scattered about zero

(ie. no ‘fanning out’)

3) The model fits all but one or a few outlier observations

plot residuals vs independent variable (or dependent

variable)

there should not be any residuals that are positioned far

from zero

Page 6: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

4) The error terms are not normally distributed

histogram of residuals and/or normal probability plot

5) One or several important independent variables have

been omitted from the model

plot residuals vs any other independent variables

residuals should not be raised above, or lowered below,

zero

Page 7: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

• Below are some example residual plots (assuming

linear regression model), that typically results:

(a) shows that all residuals are evenly scattered about

zero, with no obvious pattern or outliers

suggests that the model is appropriate

(a) (b)

i i

0 0

X X

(c) (d)

i i

0 0

X Time

Page 8: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

(a) (b)

i i

0 0

X X

(c) (d)

i i

0 0

X Time

(a) (b)

i i

0 0

X X

(c) (d)

i i

0 0

X Time

(b) shows that as the predictor variable X increase, so does

the variation in residuals (fanning out)

suggests that the model is not appropriate

(c) shows a definite pattern (curved) in the residuals

suggests that the model is not appropriate (and also

that a curved model would be a better choice)

Page 9: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

• In SPSS, we can produce residuals as follows:

Page 10: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

• Should be randomly scattered about zero

Page 11: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Notes of caution:

a) Range of observations

even if fit appears satisfactory for the observations we

have available to use, the model may not be a good fit

when extended outside the range of past observations

b) Causality

the presence of a regression relation between two

variables does not imply a cause-and-effect relation

between them

Page 12: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Multicollinearity Diagnostics –

Variance Inflation Factor

• In multiple linear regression, problems can arise

when the independent variables being considered

for the regression model are highly correlated

among themselves

ie. the correlated variables will have a similar

relationship with the dependent variable

• There is a highly useful diagnostic; the variance

inflation factor

Page 13: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

• The variance inflation factor (VIF) measures how much

the variances of the estimated regression coefficients

are inflated as compared to when the independent

variables are not linearly related

• The largest VIF value among all X variables is often

used as an indicator of the severity of multicollinearity

a maximum VIF value in excess of 10 is often taken as an

indication that multicollinearity may be unduly influencing

the least squares estimates

Page 14: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

• Mean VIF values considerably larger than 1 are

indicative of serious multicollinearity problems

• In general, the VIF for the j th regression coefficient can

be written as

VIFj =

where is the coefficient of multiple determination

obtained from regressing Xj on the other regressor variables

21

1

kR

2kR

Page 15: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

• If Xj is nearly linearly dependent on some of the other

regressors, then will be near unity and VIFj will be large

• If Xj is orthogonal to the remaining predictors, its VIF will

be 1

• Regression models fit to data by the method of least

squares, when strong multicollinearity is present, are

notoriously poor prediction equations and the values

of the regression coefficients are often very sensitive

to the data in the particular sample collected

2kR

Page 16: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Example: AIS Athletes Study

Page 17: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

• Maximum VIF = 5.01 < 10 (not too bad)

• Mean VIF = 3.73 > 1 (not too bad?)

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.

Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) -63.714 47.543 -1.340 .182

RCC -12.359 15.533 -.118 -.796 .427 .211 4.747

Hg 13.884 5.356 .396 2.592 .010 .200 5.012

Bfat -.233 .626 -.030 -.372 .710 .700 1.428

a. Dependent Variable: Ferr

Mean VIF = ([4.747 + 5.012 + 1.428] ÷ 3) = 3.73

Page 18: Diagnostics for Linear Regression Models · • Regression models fit to data by the method of least squares, when strong multicollinearity is present, are notoriously poor prediction

Reference

• DuPont WD. (2009) “Statistical Modeling for

Biomedical Researchers: A Simple Introduction to the

Analysis of Complex Data” Cambridge University

Press.