60
Linear Regression Hypothesis testing and Estimation

Linear Regression

Embed Size (px)

DESCRIPTION

Linear Regression. Hypothesis testing and Estimation. Assume that we have collected data on two variables X and Y. Let ( x 1 , y 1 ) ( x 2 , y 2 ) ( x 3 , y 3 ) … ( x n , y n ) denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population). - PowerPoint PPT Presentation

Citation preview

Page 1: Linear Regression

Linear Regression

Hypothesis testing and Estimation

Page 2: Linear Regression

Assume that we have collected data on two variables X and Y. Let

(x1, y1) (x2, y2) (x3, y3) … (xn, yn)

denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

Page 3: Linear Regression

The Statistical Model

Page 4: Linear Regression

Each yi is assumed to be randomly generated from a normal distribution with

mean i = + xi and standard deviation . (, and are unknown)

yi

+ xi

xi

Y = + X

slope =

Page 5: Linear Regression

The Data The Linear Regression Model

• The data falls roughly about a straight line.

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

Y = + X

unseen

Page 6: Linear Regression

The Least Squares Line

Fitting the best straight line

to “linear” data

Page 7: Linear Regression

LetY = a + b X

denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.

For example, if X = xi (as for the ith case) then the predicted value of Y is:

ii bxay ˆ

Page 8: Linear Regression

The residual

can be computed for each case in the sample,

The residual sum of squares (RSS) is

a measure of the “goodness of fit of the line

Y = a + bX to the data

iiiii bxayyyr ˆ

,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

Page 9: Linear Regression

The optimal choice of a and b will result in the residual sum of squares

attaining a minimum.

If this is the case than the line:

Y = a + bX

is called the Least Squares Line

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

Page 10: Linear Regression

The equation for the least squares line

Let

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

Page 11: Linear Regression

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

Computing Formulae:

Page 12: Linear Regression

Then the slope of the least squares line can be shown to be:

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

This is an estimator of the slope, , in the regression model

Page 13: Linear Regression

and the intercept of the least squares line can be shown to be:

xS

Syxbya

xx

xy

This is an estimator of the intercept, , in the regression model

Page 14: Linear Regression

The residual sum of Squares

22

1 1

ˆn n

i i i ii i

RSS y y y a bx

2

xy

yyxx

SS

S

Computing formula

Page 15: Linear Regression

Estimating , the standard deviation in the regression model :

22

ˆ1

2

1

2

n

bxay

n

yys

n

iii

n

iii

xx

xyyy S

SS

n

2

2

1

This estimate of is said to be based on n – 2 degrees of freedom

Computing formula

Page 16: Linear Regression

Sampling distributions of the estimators

Page 17: Linear Regression

The sampling distribution slope of the least squares line :

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

It can be shown that b has a normal distribution with mean and standard deviation

n

ii

xx

bb

xxS

1

2

and

Page 18: Linear Regression

Thus

has a standard normal distribution, and

b

b

xx

b bz

S

b

b

xx

b bt

ssS

has a t distribution with df = n - 2

Page 19: Linear Regression

The sampling distribution intercept of the least squares line :

It can be shown that a has a normal distribution with mean and standard deviation

2

2

1

1 and a a n

ii

x

n x x

xS

Syxbya

xx

xy

Page 20: Linear Regression

Thus

has a standard normal distribution and

2

2

1

1

a

a

n

ii

a az

xn x x

2

2

1

1

a

a

n

ii

a at

s xs

n x x

has a t distribution with df = n - 2

Page 21: Linear Regression

(1 – )100% Confidence Limits for slope :

t/2 critical value for the t-distribution with n – 2 degrees of freedom

xxS

st ˆ

2/

Page 22: Linear Regression

Testing the slope

The test statistic is:

0 0 0: vs : AH H

0

xx

bt

sS

- has a t distribution with df = n – 2 if H0 is true.

Page 23: Linear Regression

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

xx

bt t t t

sS

df = n – 2

This is a two tailed tests. One tailed tests are also possible

Page 24: Linear Regression

(1 – )100% Confidence Limits for intercept :

t/2 critical value for the t-distribution with n – 2 degrees of freedom

1

ˆ2

2/xxS

x

nst

Page 25: Linear Regression

Testing the intercept

The test statistic is:

0 0 0: vs : AH H

- has a t distribution with df = n – 2 if H0 is true.

0

2

2

1

1

n

ii

at

xs

n x x

Page 26: Linear Regression

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

a

at t t t

s

df = n – 2

Page 27: Linear Regression

Example

Page 28: Linear Regression

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950. 

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

 

Page 29: Linear Regression

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

Page 30: Linear Regression

404,541

2

n

iix

914,161

n

iii yx

018,61

2

n

iiy

Fitting the Least Squares Line

6641

n

iix

2261

n

iiy

Page 31: Linear Regression

55.1432211

66454404

2

xxS

73.1374

11

2266018

2

yyS

82.3271

11

22666416914 xyS

Fitting the Least Squares Line

First compute the following three quantities:

Page 32: Linear Regression

Computing Estimate of Slope (), Intercept () and standard deviation (),

288.055.14322

82.3271

xx

xy

S

Sb

756.611

664288.0

11

226

xbya

35.8

2

12

xx

xyyy S

SS

ns

Page 33: Linear Regression

95% Confidence Limits for slope :

t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

xxS

st ˆ

2/

0.0706 to 0.3862

8.350.288 2.262

1432255

Page 34: Linear Regression

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

-4.34 to 17.85

t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

2664 111

6.756 2.262 8.35 11 1432255

Page 35: Linear Regression

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

Y = 6.756 + (0.228)X

95% confidence Limits for slope 0.0706 to 0.3862

95% confidence Limits for intercept -4.34 to 17.85

Page 36: Linear Regression

Testing the positive slope

The test statistic is:

0 : 0 vs : 0 AH H

0

xx

bt

sS

Page 37: Linear Regression

The Critical Region

Reject

0 : 0 in favour of : 0 AH H

0.05

0if =1.833

xx

bt t

sS

df = 11 – 2 = 9

A one tailed test

Page 38: Linear Regression

and conclude

0 : 0 H

0Since

xx

bt

sS

0.28841.3 1.833

8.351432255

we reject

: 0 AH

Page 39: Linear Regression

Confidence Limits for Points on the Regression Line

• The intercept is a specific point on the regression line.

• It is the y – coordinate of the point on the regression line when x = 0.

• It is the predicted value of y when x = 0.• We may also be interested in other points on the

regression line. e.g. when x = x0

• In this case the y – coordinate of the point on the regression line when x = x0 is + x0

Page 40: Linear Regression

x0

+ x0

y = + x

Page 41: Linear Regression

(1- )100% Confidence Limits for + x0 :

1 20

2/0xxS

xx

nstbxa

t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom

Page 42: Linear Regression

Prediction Limits for new values of the Dependent variable y

• An important application of the regression line is prediction.

• Knowing the value of x (x0) what is the value of y?

• The predicted value of y when x = x0 is:

• This in turn can be estimated by:.

ˆ 0xy

00 ˆˆˆ bxaxy

Page 43: Linear Regression

The predictor

• Gives only a single value for y. • A more appropriate piece of information would

be a range of values.• A range of values that has a fixed probability of

capturing the value for y.• A (1- )100% prediction interval for y.

00 ˆˆˆ bxaxy

Page 44: Linear Regression

(1- )100% Prediction Limits for y when x = x0:

11

20

2/0xxS

xx

nstbxa

t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom

Page 45: Linear Regression

Example

In this example we are studying building fires in a city and interested in the relationship between:

1. X = the distance of the closest fire hall and the building that puts out the alarm

and

2. Y = cost of the damage (1000$)

The data was collected on n = 15 fires.

Page 46: Linear Regression

The DataFire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

Page 47: Linear Regression

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Scatter Plot

Page 48: Linear Regression

Computations

Fire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6

10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

2.491

n

iix

2.3961

n

iiy

16.1961

2

n

iix

5.113761

2

n

iiy

65.14701

n

iii yx

Page 49: Linear Regression

Computations Continued

28.3152.491

n

xx

n

ii

4133.26152.3961

n

yy

n

ii

Page 50: Linear Regression

Computations Continued

784.34152.4916.196

2

2

1

1

2

n

xxS

n

iin

iixx

517.911152.3965.11376

2

2

1

1

2

n

yyS

n

iin

iiyy

n

yxyxS

n

ii

n

iin

iiixy

11

1

114.171152.3962.4965.1470

Page 51: Linear Regression

Computations Continued

92.4784.34

114.171ˆ xx

xy

S

Sb

28.1028.3919.44133.26ˆ xbya

2

2

n

SS

Ss xx

xyyy

316.213

784.34114.171517.911

2

Page 52: Linear Regression

95% Confidence Limits for slope :

t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom

xxS

st ˆ

2/

4.07 to 5.77

Page 53: Linear Regression

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

7.21 to 13.35

t.025 = 2.160 critical value for the t-distribution with 13 degrees of freedom

Page 54: Linear Regression

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Least Squares Line

y=4.92x+10.28

Page 55: Linear Regression

(1- )100% Confidence Limits for + x0 :

1 20

2/0xxS

xx

nstbxa

t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom

Page 56: Linear Regression

95% Confidence Limits for + x0 :

x 0 lower upper

1 12.87 17.522 18.43 21.803 23.72 26.354 28.53 31.385 32.93 36.826 37.15 42.44

Page 57: Linear Regression

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Confidence Limits for + x0

Confidence limits

Page 58: Linear Regression

(1- )100% Prediction Limits for y when x = x0:

11

20

2/0xxS

xx

nstbxa

t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom

Page 59: Linear Regression

95% Prediction Limits for y when x = x0

x 0 lower upper

1 9.68 20.712 14.84 25.403 19.86 30.214 24.75 35.165 29.51 40.246 34.13 45.45

Page 60: Linear Regression

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Prediction Limits for y when x =x0

Prediction limits