Upload
chrisadin
View
69
Download
2
Tags:
Embed Size (px)
DESCRIPTION
simple linear reg statistics
Citation preview
Chapter 10 Simple Linear Regression and Correlation
Linear Regression
Methods for studying the relationship of two or more quantitative variables
Example: • Predict salary from education and years of experience• Predict sales from the amount of advertising expenditures• Predict vocabulary size from the age and amount of education of parents
Variables:• Response/outcome/dependent variable• Predictor/explanatory/independent variable
1
Relationships between the response and predictor variables• Functional or mathematical relation:
– deterministic• Structural or statistical relation:
error – stochastic/probabilistic
Goals: 1) What is a reasonable model?
(a) (b) errors
2) When has unknown parameters, estimate the parameters3) predict at new
2
Simple Linear Regression (SLR)
Basic model:
• : the response/dependent variable• : the predictor/explanatory/independent variable• : the observed value of • : treated as a fixed quantity (or conditioned upon)• : the random error, typically assumed 0 and
, and usually assumed normally distributedKey assumptions (to be checked later):
• Linear relationship• Independent (uncorrelated) errors• Constant variance errors• Normally distributed errors
3
The SLR model can also be written as
| ~ ,
4
• The mean of given (known as the condition mean) is a linear function of given by
• is the conditional mean when 0• If we replace by then is interpreted as conditional
mean when • is the slope, i.e. change in the mean of per unit change in • is the variation of responses about the mean • The relationship is described by the true regression lineE Y|
• The model is called “linear” not because it is linear in , but rather because it is linear in the parameters and
5
Example: Crime Rate A criminologist studying the relationship between level of education
and crime rate in medium-sized U.S. counties collected the following data for a random sample of 84 counties; is the percentage of individuals in the county having at least a high-school diploma and is the crime rate (crimes reported per 100, 000 residents) last year.
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts
6
Fitting the SLR model - least squares (LS) estimationChoose , to minimize the sum of squared deviations
(vertical distance) of all data points to the fitted line:, ∑
, ≡
Taking first partial derivatives and setting them equal to zero yields normal equations:
∑ ∑∑ ∑ ∑
which are equivalent to ∑ 0∑ 0
7
• Least squares estimators:∑ ∑ ∑ ∑
∑ ∑∑ ∑ ∑
∑ ∑
∑ ∑ ∑ ∑
∑ ∑ ∑
∑ ∑ ∑
,
8
• , and are the best linear unbiased estimates of and
• The fitted values: • Residuals: • Least squares (LS) line:
, is the “centroid” of the scatter plot
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts
,
20517.6 170.58
0 20 40 60 80
-400
0-2
000
020
0040
0060
00
Res
idua
ls
9
Goodness of fit of the LS lineResiduals:
Error sum of squares (SSE): ∑Compare with the SSE for the simplest model:
, and ∑ , referred to as the (corrected) total sum of squares (SST), which measures the variability of around its mean
Then SST can be decomposed as∑ ∑ ∑
SST = SSR + SSESSR: the regression sum of squares, which measures the variation in that is accounted for by regression on x
10
The coefficient of determination:
1 , 0 1
which represents the proportion of variation in that is accounted for by regression on .
Relationship to the sample correlation coefficient :
The sign of is the same as the sign of .
11
Estimation of A common unbiased estimator of is given by
∑2 2
MSE: Mean square error• The d.f. for is 2 since 2 unknown parameters and
are estimated from the data of size .
Crime rate example continued:Obtain the point estimates of the following: (1) The difference in the mean crime rate for the two counties whose high-
school graduation rates differ by one percentage point;(2) The mean crime rate last year in counties with high school graduation
percentage X=80;(3) The random error .
12
# read in the data set> crime=read.table("crimerate.txt",header=FALSE)> names(crime)=c("rate","percentage")
# scatter plot> plot(crime$percentage,crime$rate,main="Scatter Plot for the Crime Rate Data", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents",type="p",pch=16)
# fitting a SLR model using least squares> g1=lm(rate~percentage,data=crime)
# adding the fitted LR line in the scatter plot> abline(g1,col="red",lwd=2)
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
Scatter Plot for the Crime Rate Data
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts
13
# LS estimation results> summary(g1)
Call:lm(formula = rate ~ percentage, data = crime)
Residuals:Min 1Q Median 3Q Max
-5278.3 -1757.5 -210.5 1575.3 6803.3
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05
14
> summary(g1)$coeffEstimate Std. Error t value Pr(>|t|)
(Intercept) 20517.5999 3277.64269 6.259865 1.672906e-08percentage -170.5752 41.57433 -4.102897 9.571396e-05
> predict(g1,data.frame(percentage=80),se=TRUE)$fit
1 6871.585$se.fit[1] 263.6425$df[1] 82$residual.scale[1] 2356.292
> deviance(g1) # SSE[1] 455273165> df.residual(g1) # df for SSE[1] 82> sqrt(deviance(g1)/df.residual(g1)) # estimate for sigma[1] 2356.292
15
> residuals(g1)1 2 3 4 5 6 7 8
591.96401 1648.56552 1660.99033 1518.99033 568.44147 -159.63749 -2357.48712 -828.00967 9 10 11 12 13 14 15 16
97.96401 1401.56552 -1233.46080 285.56552 2426.26477 -1594.28410 -1493.43448 -2615.16004 …
81 82 83 84 -1363.25778 2533.01666 621.14071 28.11439
> summary(g1)$residuals # do the same as residuals(g1)> sum(residuals(g1)^2) # SSE[1] 455273165
> plot(residuals(g1),pch=16,main="Scatter Plot of Residuals“,ylab="Residuals",xlab="")> abline(h=0,lty=2)
0 20 40 60 80
-400
0-2
000
020
0040
0060
00
Scatter Plot of Residuals
Res
idua
ls
16
> fitted.values(g1)1 2 3 4 5 6 7 8 9
7895.036 6530.434 6701.010 6701.010 5677.559 9259.637 8918.487 6701.010 7895.036 10 11 12 13 14 15 16 17 18
6530.434 7724.461 6530.434 7212.735 6189.284 6530.434 7042.160 7212.735 8065.611 …
82 83 84 5506.983 6359.859 7553.886> plot(crime$percentage,fitted.values(g1),pch=16,xlab="Percentage",ylab="Fitted Values")> abline(g1,lty=2)> plot(fitted.values(g1),residuals(g1),main="",ylab="Residuals",xlab=expression(hat(y)),pch=16)> plot(crime$percentage,residuals(g1),main="",ylab="Residuals",xlab="Percentage",pch=16)
60 65 70 75 80 85 90
5000
6000
7000
8000
9000
1000
0
Percentage
Fitte
d V
alue
s
5000 6000 7000 8000 9000 10000
-400
0-2
000
020
0040
0060
00
y
Res
idua
ls
60 65 70 75 80 85 90
-400
0-2
000
020
0040
0060
00
Percentage
Res
idua
ls
17
Statistical Inference for Simple Linear Regression
Inference on and
∑ ∑
∑ ∑
∑
∑ ∑
∑ ∑
18
~ 0,1 and ~ 0,1
~
, , and are independently distributed
∑ and
~ and ~
100 1 % CI’s on and are given by
, /
, /
19
Hypotheses tests: : vs. :
Use the t-test:
~ when is true
Reject at level if | |
, /
or p-value 2
Particularly, for testing if there is a linear relationship,: 0 vs. : 0
Reject at level if | |
, /
20
Crime Rate Example continued:(1) Test linear relationship at 0.05
> summary(g1)Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 20517.60 3277.64 6.260 1.67e-08 ***percentage -170.58 41.57 -4.103 9.57e-05 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2356 on 82 degrees of freedomMultiple R-squared: 0.1703, Adjusted R-squared: 0.1602 F-statistic: 16.83 on 1 and 82 DF, p-value: 9.571e-05
21
(2) Calculate a 95% CI on the change in the mean crime rate for every one percentage point increase in high-school graduation rate. > confint(g1)
2.5 % 97.5 %(Intercept) 13997.3245 27037.87538percentage -253.2798 -87.87061
> # we can specify a particular parameter> # as well as change confidence level> confint(g1,"percentage",level=0.9)
5 % 95 %percentage -239.7403 -101.4101
22
Analysis of Variance (ANOVA) for SLR
ANOVA is a statistical technique to decompose the total variability in the ’s into separate variance components associated with specific sources
Decomposition of the variability and degrees of freedom (d.f.)∑ ∑ ∑
SST = SSR + SSEd.f. n-1 = 1 + n-2
A mean square is defined by a sum of squares divided by its d.f.Mean square regression: /1Mean square error: / 2
23
Since /
~ ,
we can test : 0 vs. : 0 at level by rejecting if , , (equivalent to , / )
Analysis of variance (ANOVA) table
Source of Variation(Source)
Sum of Squares(SS)
Degrees of Freedom (d.f.)
Mean Square (MS) F statistic
Regression SSR 11
Error SSE 22
Total SST 124
Crime Rate Example continued:- Test the significance of the linear relationship between the crime rate and the high-school graduation rate at 0.05
> anova(g1)Analysis of Variance Table
Response: rateDf Sum Sq Mean Sq F value Pr(>F)
percentage 1 93462942 93462942 16.834 9.571e-05 ***Residuals 82 455273165 5552112 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
25
Prediction of Future Observations To predict the value of a future response ∗ at a specified value ∗
Use confidence interval to estimate the fixed unknown mean of ∗, denoted by ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗, /
∗
Use prediction interval to predict the value of the r.v. ∗
∗ ∗~ 0, 1∗
∗, / 1
1 ∗
26
Crime rate example continued:(a) Calculate 95% CI for the average crime rate in counties with
80% high-school graduation rate;(b) Calculate 95% PI for the crime rate of a future selected county
with 80% high-school graduation rate.
> predict(g1,data.frame(percentage=80), interval="confidence")$fit
fit lwr upr1 6871.585 6347.116 7396.054
> predict(g1,data.frame(percentage=80), interval="prediction")$fit
fit lwr upr1 6871.585 2154.92 11588.25
27
> grid=seq(60,90,1)> conf=predict(g1,data.frame(percentage=grid),interval="confidence")> pred=predict(g1,data.frame(percentage=grid),interval="prediction")> matplot(grid,pred,lty=c(1,2,2),col=c("red","green","green"),type="l",lwd=2,main="CI vs PI", xlab="Percentage of having at least high school diplomas", ylab="Crime Rate (per 100K residents)")> matplot(grid,conf[,2:3],lty=c(2,2),col=c("blue","blue"),type="l",add=T,lwd=2)
60 65 70 75 80 85 90
2000
4000
6000
8000
1000
012
000
1400
0
CI vs PI
Percentage of having at least high school diplomas
Crim
e R
ate
(per
100
K re
side
nts)
Both CI and PI have shortest widths when ∗ ;
Predicting beyond the range of observed data (extrapolation) is risky and should generally be avoided
28