simple linear reg stat

Chapter 10 Simple Linear Regression and Correlation

Linear Regression

Methods for studying the relationship of two or more quantitative variables

Example: • Predict salary from education and years of experience• Predict sales from the amount of advertising expenditures• Predict vocabulary size from the age and amount of education of parents

Variables:• Response/outcome/dependent variable• Predictor/explanatory/independent variable

1

Relationships between the response and predictor variables• Functional or mathematical relation:

– deterministic• Structural or statistical relation:

error – stochastic/probabilistic

Goals: 1) What is a reasonable model?

(a) (b) errors

2) When has unknown parameters, estimate the parameters3) predict at new

2

Simple Linear Regression (SLR)

Basic model:

• : the response/dependent variable• : the predictor/explanatory/independent variable• : the observed value of • : treated as a fixed quantity (or conditioned upon)• : the random error, typically assumed 0 and

, and usually assumed normally distributedKey assumptions (to be checked later):

• Linear relationship• Independent (uncorrelated) errors• Constant variance errors• Normally distributed errors

3

The SLR model can also be written as

| ~ ,

4

• The mean of given (known as the condition mean) is a linear function of given by

• is the conditional mean when 0• If we replace by then is interpreted as conditional

mean when • is the slope, i.e. change in the mean of per unit change in • is the variation of responses about the mean • The relationship is described by the true regression lineE Y|

• The model is called “linear” not because it is linear in , but rather because it is linear in the parameters and

5

Example: Crime Rate A criminologist studying the relationship between level of education

and crime rate in medium-sized U.S. counties collected the following data for a random sample of 84 counties; is the percentage of individuals in the county having at least a high-school diploma and is the crime rate (crimes reported per 100, 000 residents) last year.

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0

Scatter Plot for the Crime Rate Data

Percentage of having at least high school diplomas

Crim

e R

ate

(per

100

K re

side

nts

6

Fitting the SLR model - least squares (LS) estimationChoose , to minimize the sum of squared deviations

(vertical distance) of all data points to the fitted line:, ∑

, ≡

Taking first partial derivatives and setting them equal to zero yields normal equations:

∑ ∑∑ ∑ ∑

which are equivalent to ∑ 0∑ 0

7

• Least squares estimators:∑ ∑ ∑ ∑

∑ ∑∑ ∑ ∑

∑ ∑

∑ ∑ ∑ ∑

∑ ∑ ∑

∑ ∑ ∑

,

8

• , and are the best linear unbiased estimates of and

• The fitted values: • Residuals: • Least squares (LS) line:

, is the “centroid” of the scatter plot

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0



Crim

e R

ate

(per

100

K re

side

nts

,

20517.6 170.58

0 20 40 60 80

-400

0-2

000

020

0040

0060

00

Res

idua

ls

9

Goodness of fit of the LS lineResiduals:

Error sum of squares (SSE): ∑Compare with the SSE for the simplest model:

, and ∑ , referred to as the (corrected) total sum of squares (SST), which measures the variability of around its mean

Then SST can be decomposed as∑ ∑ ∑

SST = SSR + SSESSR: the regression sum of squares, which measures the variation in that is accounted for by regression on x

10

The coefficient of determination:

1 , 0 1

which represents the proportion of variation in that is accounted for by regression on .

Relationship to the sample correlation coefficient :

The sign of is the same as the sign of .

11

Estimation of A common unbiased estimator of is given by

∑2 2

MSE: Mean square error• The d.f. for is 2 since 2 unknown parameters and

are estimated from the data of size .

Crime rate example continued:Obtain the point estimates of the following: (1) The difference in the mean crime rate for the two counties whose high-

school graduation rates differ by one percentage point;(2) The mean crime rate last year in counties with high school graduation

percentage X=80;(3) The random error .

12

# read in the data set> crime=read.table("crimerate.txt",header=FALSE)> names(crime)=c("rate","percentage")

# scatter plot> plot(crime$percentage,crime$rate,main="Scatter Plot for the Crime Rate Data", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents",type="p",pch=16)

# fitting a SLR model using least squares> g1=lm(rate~percentage,data=crime)

# adding the fitted LR line in the scatter plot> abline(g1,col="red",lwd=2)

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0



Crim

e R

ate

(per

100

K re

side

nts

13

# LS estimation results> summary(g1)

Call:lm(formula = rate ~ percentage, data = crime)

Residuals:Min 1Q Median 3Q Max

-5278.3 -1757.5 -210.5 1575.3 6803.3

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05

14

> summary(g1)$coeffEstimate Std. Error t value Pr(>|t|)

(Intercept) 20517.5999 3277.64269 6.259865 1.672906e-08percentage -170.5752 41.57433 -4.102897 9.571396e-05

> predict(g1,data.frame(percentage=80),se=TRUE)$fit

1 6871.585$se.fit[1] 263.6425$df[1] 82$residual.scale[1] 2356.292

> deviance(g1) # SSE[1] 455273165> df.residual(g1) # df for SSE[1] 82> sqrt(deviance(g1)/df.residual(g1)) # estimate for sigma[1] 2356.292

15

> residuals(g1)1 2 3 4 5 6 7 8

591.96401 1648.56552 1660.99033 1518.99033 568.44147 -159.63749 -2357.48712 -828.00967 9 10 11 12 13 14 15 16

97.96401 1401.56552 -1233.46080 285.56552 2426.26477 -1594.28410 -1493.43448 -2615.16004 …

81 82 83 84 -1363.25778 2533.01666 621.14071 28.11439

> summary(g1)$residuals # do the same as residuals(g1)> sum(residuals(g1)^2) # SSE[1] 455273165

> plot(residuals(g1),pch=16,main="Scatter Plot of Residuals“,ylab="Residuals",xlab="")> abline(h=0,lty=2)

0 20 40 60 80

-400

0-2

000

020

0040

0060

00

Scatter Plot of Residuals

Res

idua

ls

16

> fitted.values(g1)1 2 3 4 5 6 7 8 9

7895.036 6530.434 6701.010 6701.010 5677.559 9259.637 8918.487 6701.010 7895.036 10 11 12 13 14 15 16 17 18

6530.434 7724.461 6530.434 7212.735 6189.284 6530.434 7042.160 7212.735 8065.611 …

82 83 84 5506.983 6359.859 7553.886> plot(crime$percentage,fitted.values(g1),pch=16,xlab="Percentage",ylab="Fitted Values")> abline(g1,lty=2)> plot(fitted.values(g1),residuals(g1),main="",ylab="Residuals",xlab=expression(hat(y)),pch=16)> plot(crime$percentage,residuals(g1),main="",ylab="Residuals",xlab="Percentage",pch=16)

60 65 70 75 80 85 90

5000

6000

7000

8000

9000

1000

0

Percentage

Fitte

d V

alue

s

5000 6000 7000 8000 9000 10000

-400

0-2

000

020

0040

0060

00

y

Res

idua

ls

60 65 70 75 80 85 90

-400

0-2

000

020

0040

0060

00

Percentage

Res

idua

ls

17

Statistical Inference for Simple Linear Regression

Inference on and

∑ ∑

∑ ∑

∑

∑ ∑

∑ ∑

18

~ 0,1 and ~ 0,1

~

, , and are independently distributed

∑ and

~ and ~

100 1 % CI’s on and are given by

, /

, /

19

Hypotheses tests: : vs. :

Use the t-test:

~ when is true

Reject at level if | |

, /

or p-value 2

Particularly, for testing if there is a linear relationship,: 0 vs. : 0

Reject at level if | |

, /

20

Crime Rate Example continued:(1) Test linear relationship at 0.05

> summary(g1)Coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05

21

(2) Calculate a 95% CI on the change in the mean crime rate for every one percentage point increase in high-school graduation rate. > confint(g1)

2.5 % 97.5 %(Intercept) 13997.3245 27037.87538percentage -253.2798 -87.87061

> # we can specify a particular parameter> # as well as change confidence level> confint(g1,"percentage",level=0.9)

5 % 95 %percentage -239.7403 -101.4101

22

Analysis of Variance (ANOVA) for SLR

ANOVA is a statistical technique to decompose the total variability in the ’s into separate variance components associated with specific sources

Decomposition of the variability and degrees of freedom (d.f.)∑ ∑ ∑

SST = SSR + SSEd.f. n-1 = 1 + n-2

A mean square is defined by a sum of squares divided by its d.f.Mean square regression: /1Mean square error: / 2

23

Since /

~ ,

we can test : 0 vs. : 0 at level by rejecting if , , (equivalent to , / )

Analysis of variance (ANOVA) table

Source of Variation(Source)

Sum of Squares(SS)

Degrees of Freedom (d.f.)

Mean Square (MS) F statistic

Regression SSR 11

Error SSE 22

Total SST 124

Crime Rate Example continued:- Test the significance of the linear relationship between the crime rate and the high-school graduation rate at 0.05

> anova(g1)Analysis of Variance Table

Response: rateDf Sum Sq Mean Sq F value Pr(>F)

percentage 1 93462942 93462942 16.834 9.571e-05 ***Residuals 82 455273165 5552112 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

25

Prediction of Future Observations To predict the value of a future response ∗ at a specified value ∗

Use confidence interval to estimate the fixed unknown mean of ∗, denoted by ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗

∗, /

∗

Use prediction interval to predict the value of the r.v. ∗

∗ ∗~ 0, 1∗

∗, / 1

1 ∗

26

Crime rate example continued:(a) Calculate 95% CI for the average crime rate in counties with

80% high-school graduation rate;(b) Calculate 95% PI for the crime rate of a future selected county

with 80% high-school graduation rate.

> predict(g1,data.frame(percentage=80), interval="confidence")$fit

fit lwr upr1 6871.585 6347.116 7396.054

> predict(g1,data.frame(percentage=80), interval="prediction")$fit

fit lwr upr1 6871.585 2154.92 11588.25

27

> grid=seq(60,90,1)> conf=predict(g1,data.frame(percentage=grid),interval="confidence")> pred=predict(g1,data.frame(percentage=grid),interval="prediction")> matplot(grid,pred,lty=c(1,2,2),col=c("red","green","green"),type="l",lwd=2,main="CI vs PI", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents)")> matplot(grid,conf[,2:3],lty=c(2,2),col=c("blue","blue"),type="l",add=T,lwd=2)

60 65 70 75 80 85 90

2000

4000

6000

8000

1000

012

000

1400

0

CI vs PI


Crim

e R

ate

(per

100

K re

side

nts)

Both CI and PI have shortest widths when ∗ ;

Predicting beyond the range of observed data (extrapolation) is risky and should generally be avoided

28

Documents

simple linear reg stat