20
Linear regression models in R (session 1) Tom Price 22 Feb 2010 http:// rcourse.iop.kcl.ac.uk/S16/ lm1.ppt

Linear regression models in R (session 1) Tom Price 22 Feb 2010

Embed Size (px)

Citation preview

Page 1: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Linear regression models in R (session 1)

Tom Price

22 Feb 2010

http://rcourse.iop.kcl.ac.uk/S16/lm1.ppt

Page 2: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Linear model

1000 2000 3000 4000 5000 6000 7000

-2-1

01

23

4

climb

std

res(

lm1

)

Dependent variable

Inde

pend

ent

varia

ble

We wish to test the dependence of y on x

Page 3: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Linear regression model

y = b0 + b1x1 + … + bkxk + e

Wherey is the dependent variable

x1 … xk are independent variables (predictors)

b0 … bk are the regression coefficients

e denotes the residuals

• The residuals are assumed to be identically and independently Normally distributed with mean 0.

• The coefficients are usually estimated by the “least squares” technique – choosing values of b0 … bk that minimise the sum of the squares of the residuals e.

Page 4: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Scottish hill races

Set of data on record times of Scottish hill races against distance and total height climbed.

library(MASS)?hillsdata(hills)library(lattice)splom(~hills)

Scatter Plot Matrix

dist15

20

2515 20 25

5

10

15

5 10 15

climb4000

6000

8000

4000 6000 8000

0

2000

4000

0 2000 4000

time

150

200 150 200

50

100

50 100

Page 5: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Formula

?formula

Specifies the model e.g.y ~ x y is dependent var,

x is independent vary ~ factor( x ) dummy codedy ~ -1 + factor( x ) no intercepty ~ a + b + c 3 independent variablesy ~ a * b + c includes one interaction termy ~ a * b * c includes all interactions termsy ~ ( a + b + c )^3 same as abovey ~ ( a + b + c )^2 includes all 2-way interaction

terms

Page 6: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Linear model

?lmlm1=lm(time~dist,data=hills)summary(lm1)

Page 7: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Linear modellm1=lm(time~dist,data=hills)summary(lm1)

Call:lm(formula = time ~ dist)

Residuals: Min 1Q Median 3Q Max -35.745 -9.037 -4.201 2.849 76.170

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.8407 5.7562 -0.841 0.406 dist 8.3305 0.6196 13.446 6.08e-15 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.96 on 33 degrees of freedomMultiple R-squared: 0.8456, Adjusted R-squared: 0.841 F-statistic: 180.8 on 1 and 33 DF, p-value: 6.084e-15

Page 8: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Fitted points

attach(hills)(c1=coef(lm1))

(Intercept) dist -4.840720 8.330456

plot(dist,time)abline(c1)f1=fitted(lm1)for(i in 1:35) lines(c(dist[i],dist[i]), c(time[i],f1[i]),col=“red”)

5 10 15 20 25

50

10

01

50

20

0

dist

time

Page 9: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Standardised residualssr1=stdres(lm1)truehist(sr1)

-2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

stdres(lm1)

Page 10: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Standardised residualsqqnorm(sr1)abline(0,1)

-2 -1 0 1 2

-2-1

01

23

4

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Page 11: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Standardised residualsplot(climb,sr1)

1000 2000 3000 4000 5000 6000 7000

-2-1

01

23

4

climb

sr1

Page 12: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Exercise 1• Using the Scottish hill races dataset, model the time as a function of

both distance and total height climbed. – What can you learn from looking at the fitted points and the

standardised residuals?

• Some useful commands:library(MASS)?hillsdata(hills)attach(hills)?lm?formula?fitted?stdres?truehist?plot?qqnorm

Page 13: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Model fit

• The linear model assumes that residuals are independently identically Normally distributed with mean 0.

• To assess whether the model fits the data, look at the residuals.

Page 14: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Model fit

• If the model does not fit, it may be because of:– Outliers– Unmodelled covariates– Heteroscedasticity (residuals have unequal variance)– Clustering (residuals have lower variance within subgroups)– Autocorrelation (correlation between residuals at successive

time points)

• All of these can be detected by looking for patterns in the residuals.

• In the next session we will look at some ways to find better fitting models.

Page 15: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Outlierssr1=stdres(lm1)truehist(sr1)

-2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

stdres(lm1)

Page 16: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Outliersqqnorm(sr1)abline(0,1)

-2 -1 0 1 2

-2-1

01

23

4

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Page 17: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Unmodelled covariateplot(climb,sr1)

1000 2000 3000 4000 5000 6000 7000

-2-1

01

23

4

climb

std

res(

lm1

)

Page 18: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Heteroscedasticityplot(f1,sr1)

50 100 150 200

-2-1

01

23

4

fitted(lm1)

std

res(

lm1

)

Page 19: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Exercise 2

Investigate the dataset “trees” in the MASS package.

• How does Volume depend on Height and Girth? Try some models and examine the residuals to assess model fit.

• Transforming the data can give better fitting models, especially when the residuals are heteroscedastic. Try log and cube root transforms for Volume. Which do you think works better? How do you interpret the results?

Some useful commands:library(MASS)

?trees?lm?formula?stdres?fitted

?boxcox

Page 20: Linear regression models in R (session 1) Tom Price 22 Feb 2010

Reading

Venables & Ripley, “Modern Applied Statistics with S”, chapter 6.