Upload
kelly-baird
View
11
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Linear Regression. Hypothesis testing and Estimation. Assume that we have collected data on two variables X and Y. Let ( x 1 , y 1 ) ( x 2 , y 2 ) ( x 3 , y 3 ) … ( x n , y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population). - PowerPoint PPT Presentation
Citation preview
Linear Regression
Hypothesis testing and Estimation
Assume that we have collected data on two variables X and Y. Let
(x1, y1) (x2, y2) (x3, y3) … (xn, yn)
denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)
The Statistical Model
Each yi is assumed to be randomly generated from a normal distribution with
mean i = + xi and standard deviation . (, and are unknown)
yi
+ xi
xi
Y = + X
slope =
The Data The Linear Regression Model
• The data falls roughly about a straight line.
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
Y = + X
unseen
The Least Squares Line
Fitting the best straight line
to “linear” data
LetY = a + b X
denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.
For example, if X = xi (as for the ith case) then the predicted value of Y is:
ii bxay ˆ
The residual
can be computed for each case in the sample,
The residual sum of squares (RSS) is
a measure of the “goodness of fit of the line
Y = a + bX to the data
iiiii bxayyyr ˆ
,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
The optimal choice of a and b will result in the residual sum of squares
attaining a minimum.
If this is the case than the line:
Y = a + bX
is called the Least Squares Line
n
iii
n
iii
n
ii bxayyyrRSS
1
2
1
2
1
2 ˆ
The equation for the least squares line
Let
n
iixx xxS
1
2
n
iiyy yyS
1
2
n
iiixy yyxxS
1
n
x
xxxS
n
iin
ii
n
iixx
2
1
1
2
1
2
n
yx
yx
n
ii
n
iin
iii
11
1
n
y
yyyS
n
iin
ii
n
iiyy
2
1
1
2
1
2
n
iiixy yyxxS
1
Computing Formulae:
Then the slope of the least squares line can be shown to be:
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1
This is an estimator of the slope, , in the regression model
and the intercept of the least squares line can be shown to be:
xS
Syxbya
xx
xy
This is an estimator of the intercept, , in the regression model
The residual sum of Squares
22
1 1
ˆn n
i i i ii i
RSS y y y a bx
2
xy
yyxx
SS
S
Computing formula
Estimating , the standard deviation in the regression model :
22
ˆ1
2
1
2
n
bxay
n
yys
n
iii
n
iii
xx
xyyy S
SS
n
2
2
1
This estimate of is said to be based on n – 2 degrees of freedom
Computing formula
Sampling distributions of the estimators
The sampling distribution slope of the least squares line :
n
ii
n
iii
xx
xy
xx
yyxx
S
Sb
1
2
1
It can be shown that b has a normal distribution with mean and standard deviation
n
ii
xx
bb
xxS
1
2
and
Thus
has a standard normal distribution, and
b
b
xx
b bz
S
b
b
xx
b bt
ssS
has a t distribution with df = n - 2
The sampling distribution intercept of the least squares line :
It can be shown that a has a normal distribution with mean and standard deviation
2
2
1
1 and a a n
ii
x
n x x
xS
Syxbya
xx
xy
Thus
has a standard normal distribution and
2
2
1
1
a
a
n
ii
a az
xn x x
2
2
1
1
a
a
n
ii
a at
s xs
n x x
has a t distribution with df = n - 2
(1 – )100% Confidence Limits for slope :
t/2 critical value for the t-distribution with n – 2 degrees of freedom
xxS
st ˆ
2/
Testing the slope
The test statistic is:
0 0 0: vs : AH H
0
xx
bt
sS
- has a t distribution with df = n – 2 if H0 is true.
The Critical Region
Reject
0 0 0: vs : AH H
0/ 2 / 2if or
xx
bt t t t
sS
df = n – 2
This is a two tailed tests. One tailed tests are also possible
(1 – )100% Confidence Limits for intercept :
t/2 critical value for the t-distribution with n – 2 degrees of freedom
1
ˆ2
2/xxS
x
nst
Testing the intercept
The test statistic is:
0 0 0: vs : AH H
- has a t distribution with df = n – 2 if H0 is true.
0
2
2
1
1
n
ii
at
xs
n x x
The Critical Region
Reject
0 0 0: vs : AH H
0/ 2 / 2if or
a
at t t t
s
df = n – 2
Example
The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.
Country (i) Xi Yi
Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
404,541
2
n
iix
914,161
n
iii yx
018,61
2
n
iiy
Fitting the Least Squares Line
6641
n
iix
2261
n
iiy
55.1432211
66454404
2
xxS
73.1374
11
2266018
2
yyS
82.3271
11
22666416914 xyS
Fitting the Least Squares Line
First compute the following three quantities:
Computing Estimate of Slope (), Intercept () and standard deviation (),
288.055.14322
82.3271
xx
xy
S
Sb
756.611
664288.0
11
226
xbya
35.8
2
12
xx
xyyy S
SS
ns
95% Confidence Limits for slope :
t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom
xxS
st ˆ
2/
0.0706 to 0.3862
8.350.288 2.262
1432255
95% Confidence Limits for intercept :
1
ˆ2
2/xxS
x
nst
-4.34 to 17.85
t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom
2664 111
6.756 2.262 8.35 11 1432255
Iceland
NorwaySweden
DenmarkCanada
Australia
HollandSwitzerland
Great Britain
Finland
USA
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100 120 140
Per capita consumption of cigarettes
deat
h ra
tes
from
lung
can
cer
(195
0)
Y = 6.756 + (0.228)X
95% confidence Limits for slope 0.0706 to 0.3862
95% confidence Limits for intercept -4.34 to 17.85
Testing the positive slope
The test statistic is:
0 : 0 vs : 0 AH H
0
xx
bt
sS
The Critical Region
Reject
0 : 0 in favour of : 0 AH H
0.05
0if =1.833
xx
bt t
sS
df = 11 – 2 = 9
A one tailed test
and conclude
0 : 0 H
0Since
xx
bt
sS
0.28841.3 1.833
8.351432255
we reject
: 0 AH
Confidence Limits for Points on the Regression Line
• The intercept is a specific point on the regression line.
• It is the y – coordinate of the point on the regression line when x = 0.
• It is the predicted value of y when x = 0.• We may also be interested in other points on the
regression line. e.g. when x = x0
• In this case the y – coordinate of the point on the regression line when x = x0 is + x0
x0
+ x0
y = + x
(1- )100% Confidence Limits for + x0 :
1 20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
Prediction Limits for new values of the Dependent variable y
• An important application of the regression line is prediction.
• Knowing the value of x (x0) what is the value of y?
• The predicted value of y when x = x0 is:
• This in turn can be estimated by:.
ˆ 0xy
00 ˆˆˆ bxaxy
The predictor
• Gives only a single value for y. • A more appropriate piece of information would
be a range of values.• A range of values that has a fixed probability of
capturing the value for y.• A (1- )100% prediction interval for y.
00 ˆˆˆ bxaxy
(1- )100% Prediction Limits for y when x = x0:
11
20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
Example
In this example we are studying building fires in a city and interested in the relationship between:
1. X = the distance of the closest fire hall and the building that puts out the alarm
and
2. Y = cost of the damage (1000$)
The data was collected on n = 15 fires.
The DataFire Distance Damage
1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
Scatter Plot
Computations
Fire Distance Damage
1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6
10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1
2.491
n
iix
2.3961
n
iiy
16.1961
2
n
iix
5.113761
2
n
iiy
65.14701
n
iii yx
Computations Continued
28.3152.491
n
xx
n
ii
4133.26152.3961
n
yy
n
ii
Computations Continued
784.34152.4916.196
2
2
1
1
2
n
xxS
n
iin
iixx
517.911152.3965.11376
2
2
1
1
2
n
yyS
n
iin
iiyy
n
yxyxS
n
ii
n
iin
iiixy
11
1
114.171152.3962.4965.1470
Computations Continued
92.4784.34
114.171ˆ xx
xy
S
Sb
28.1028.3919.44133.26ˆ xbya
2
2
n
SS
Ss xx
xyyy
316.213
784.34114.171517.911
2
95% Confidence Limits for slope :
t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom
xxS
st ˆ
2/
4.07 to 5.77
95% Confidence Limits for intercept :
1
ˆ2
2/xxS
x
nst
7.21 to 13.35
t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
Least Squares Line
y=4.92x+10.28
(1- )100% Confidence Limits for + x0 :
1 20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
95% Confidence Limits for + x0 :
x 0 lower upper
1 12.87 17.522 18.43 21.803 23.72 26.354 28.53 31.385 32.93 36.826 37.15 42.44
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
95% Confidence Limits for + x0
Confidence limits
(1- )100% Prediction Limits for y when x = x0:
11
20
2/0xxS
xx
nstbxa
t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom
95% Prediction Limits for y when x = x0
x 0 lower upper
1 9.68 20.712 14.84 25.403 19.86 30.214 24.75 35.165 29.51 40.246 34.13 45.45
0.0
10.0
20.0
30.0
40.0
50.0
60.0
0.0 2.0 4.0 6.0 8.0
Distance (miles)
Dam
age
(100
0$)
95% Prediction Limits for y when x =x0
Prediction limits