Linear Regression

Linear Regression

Hypothesis testing and Estimation

Assume that we have collected data on two variables X and Y. Let

(x1, y1) (x2, y2) (x3, y3) … (xn, yn)

denote the pairs of measurements on the on two variables X and Y for n cases in a sample (or population)

The Statistical Model

Each yi is assumed to be randomly generated from a normal distribution with

mean i = + xi and standard deviation . (, and are unknown)

yi

+ xi

xi

Y = + X

slope =

The Data The Linear Regression Model

• The data falls roughly about a straight line.

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

Y = + X

unseen

The Least Squares Line

Fitting the best straight line

to “linear” data

LetY = a + b X

denote an arbitrary equation of a straight line.a and b are known values.This equation can be used to predict for each value of X, the value of Y.

For example, if X = xi (as for the ith case) then the predicted value of Y is:

ii bxay ˆ

The residual

can be computed for each case in the sample,

The residual sum of squares (RSS) is

a measure of the “goodness of fit of the line

Y = a + bX to the data

iiiii bxayyyr ˆ

,ˆ,,ˆ,ˆ 222111 nnn yyryyryyr

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

The optimal choice of a and b will result in the residual sum of squares

attaining a minimum.

If this is the case than the line:

Y = a + bX

is called the Least Squares Line

n

iii

n

iii

n

ii bxayyyrRSS

1

2

1

2

1

2 ˆ

The equation for the least squares line

Let

n

iixx xxS

1

2

n

iiyy yyS

1

2

n

iiixy yyxxS

1

n

x

xxxS

n

iin

ii

n

iixx

2

1

1

2

1

2

n

yx

yx

n

ii

n

iin

iii

11

1

n

y

yyyS

n

iin

ii

n

iiyy

2

1

1

2

1

2

n

iiixy yyxxS

1

Computing Formulae:

Then the slope of the least squares line can be shown to be:

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

This is an estimator of the slope, , in the regression model

and the intercept of the least squares line can be shown to be:

xS

Syxbya

xx

xy

This is an estimator of the intercept, , in the regression model

The residual sum of Squares

22

1 1

ˆn n

i i i ii i

RSS y y y a bx

2

xy

yyxx

SS

S

Computing formula

Estimating , the standard deviation in the regression model :

22

ˆ1

2

1

2

n

bxay

n

yys

n

iii

n

iii

xx

xyyy S

SS

n

2

2

1

This estimate of is said to be based on n – 2 degrees of freedom

Computing formula

Sampling distributions of the estimators

The sampling distribution slope of the least squares line :

n

ii

n

iii

xx

xy

xx

yyxx

S

Sb

1

2

1

It can be shown that b has a normal distribution with mean and standard deviation

n

ii

xx

bb

xxS

1

2

and

Thus

has a standard normal distribution, and

b

b

xx

b bz

S

b

b

xx

b bt

ssS

has a t distribution with df = n - 2

The sampling distribution intercept of the least squares line :

It can be shown that a has a normal distribution with mean and standard deviation

2

2

1

1 and a a n

ii

x

n x x

xS

Syxbya

xx

xy

Thus

has a standard normal distribution and

2

2

1

1

a

a

n

ii

a az

xn x x

2

2

1

1

a

a

n

ii

a at

s xs

n x x

has a t distribution with df = n - 2

(1 – )100% Confidence Limits for slope :

t/2 critical value for the t-distribution with n – 2 degrees of freedom

xxS

st ˆ

2/

Testing the slope

The test statistic is:

0 0 0: vs : AH H

0

xx

bt

sS

- has a t distribution with df = n – 2 if H0 is true.

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

xx

bt t t t

sS

df = n – 2

This is a two tailed tests. One tailed tests are also possible

(1 – )100% Confidence Limits for intercept :

t/2 critical value for the t-distribution with n – 2 degrees of freedom

1

ˆ2

2/xxS

x

nst

Testing the intercept


0 0 0: vs : AH H

- has a t distribution with df = n – 2 if H0 is true.

0

2

2

1

1

n

ii

at

xs

n x x

The Critical Region

Reject

0 0 0: vs : AH H

0/ 2 / 2if or

a

at t t t

s

df = n – 2

Example

The following data showed the per capita consumption of cigarettes per month (X) in various countries in 1930, and the death rates from lung cancer for men in 1950. TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11 countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for men in 1950.

Country (i) Xi Yi

Australia 48 18Canada 50 15Denmark 38 17Finland 110 35Great Britain 110 46Holland 49 24Iceland 23 6Norway 25 9Sweden 30 11Switzerland 51 25USA 130 20

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

404,541

2

n

iix

914,161

n

iii yx

018,61

2

n

iiy

Fitting the Least Squares Line

6641

n

iix

2261

n

iiy

55.1432211

66454404

2

xxS

73.1374

11

2266018

2

yyS

82.3271

11

22666416914 xyS

Fitting the Least Squares Line

First compute the following three quantities:

Computing Estimate of Slope (), Intercept () and standard deviation (),

288.055.14322

82.3271

xx

xy

S

Sb

756.611

664288.0

11

226

xbya

35.8

2

12

xx

xyyy S

SS

ns

95% Confidence Limits for slope :

t.025 = 2.262 critical value for the t-distribution with 9 degrees of freedom

xxS

st ˆ

2/

0.0706 to 0.3862

8.350.288 2.262

1432255

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

-4.34 to 17.85


2664 111

6.756 2.262 8.35 11 1432255

Iceland

NorwaySweden

DenmarkCanada

Australia

HollandSwitzerland

Great Britain

Finland

USA

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140

Per capita consumption of cigarettes

deat

h ra

tes

from

lung

can

cer

(195

0)

Y = 6.756 + (0.228)X

95% confidence Limits for slope 0.0706 to 0.3862

95% confidence Limits for intercept -4.34 to 17.85

Testing the positive slope


0 : 0 vs : 0 AH H

0

xx

bt

sS

The Critical Region

Reject

0 : 0 in favour of : 0 AH H

0.05

0if =1.833

xx

bt t

sS

df = 11 – 2 = 9

A one tailed test

and conclude

0 : 0 H

0Since

xx

bt

sS

0.28841.3 1.833

8.351432255

we reject

: 0 AH

Confidence Limits for Points on the Regression Line

• The intercept is a specific point on the regression line.

• It is the y – coordinate of the point on the regression line when x = 0.

• It is the predicted value of y when x = 0.• We may also be interested in other points on the

regression line. e.g. when x = x0

• In this case the y – coordinate of the point on the regression line when x = x0 is + x0

x0

+ x0

y = + x

(1- )100% Confidence Limits for + x0 :

1 20

2/0xxS

xx

nstbxa

t/2 is the /2 critical value for the t-distribution with n - 2 degrees of freedom

Prediction Limits for new values of the Dependent variable y

• An important application of the regression line is prediction.

• Knowing the value of x (x0) what is the value of y?

• The predicted value of y when x = x0 is:

• This in turn can be estimated by:.

ˆ 0xy

00 ˆˆˆ bxaxy

The predictor

• Gives only a single value for y. • A more appropriate piece of information would

be a range of values.• A range of values that has a fixed probability of

capturing the value for y.• A (1- )100% prediction interval for y.

00 ˆˆˆ bxaxy

(1- )100% Prediction Limits for y when x = x0:

11

20

2/0xxS

xx

nstbxa


Example

In this example we are studying building fires in a city and interested in the relationship between:

1. X = the distance of the closest fire hall and the building that puts out the alarm

and

2. Y = cost of the damage (1000$)

The data was collected on n = 15 fires.

The DataFire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.610 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Scatter Plot

Computations

Fire Distance Damage

1 3.4 26.22 1.8 17.83 4.6 31.34 2.3 23.15 3.1 27.56 5.5 36.07 0.7 14.18 3.0 22.39 2.6 19.6

10 4.3 31.311 2.1 24.012 1.1 17.313 6.1 43.214 4.8 36.415 3.8 26.1

2.491

n

iix

2.3961

n

iiy

16.1961

2

n

iix

5.113761

2

n

iiy

65.14701

n

iii yx

Computations Continued

28.3152.491

n

xx

n

ii

4133.26152.3961

n

yy

n

ii


784.34152.4916.196

2

2

1

1

2

n

xxS

n

iin

iixx

517.911152.3965.11376

2

2

1

1

2

n

yyS

n

iin

iiyy

n

yxyxS

n

ii

n

iin

iiixy

11

1

114.171152.3962.4965.1470


92.4784.34

114.171ˆ xx

xy

S

Sb

28.1028.3919.44133.26ˆ xbya

2

2

n

SS

Ss xx

xyyy

316.213

784.34114.171517.911

2

95% Confidence Limits for slope :


xxS

st ˆ

2/

4.07 to 5.77

95% Confidence Limits for intercept :

1

ˆ2

2/xxS

x

nst

7.21 to 13.35


0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

Least Squares Line

y=4.92x+10.28

(1- )100% Confidence Limits for + x0 :

1 20

2/0xxS

xx

nstbxa


95% Confidence Limits for + x0 :

x 0 lower upper

1 12.87 17.522 18.43 21.803 23.72 26.354 28.53 31.385 32.93 36.826 37.15 42.44

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Confidence Limits for + x0

Confidence limits

(1- )100% Prediction Limits for y when x = x0:

11

20

2/0xxS

xx

nstbxa


95% Prediction Limits for y when x = x0

x 0 lower upper

1 9.68 20.712 14.84 25.403 19.86 30.214 24.75 35.165 29.51 40.246 34.13 45.45

0.0

10.0

20.0

30.0

40.0

50.0

60.0

0.0 2.0 4.0 6.0 8.0

Distance (miles)

Dam

age

(100

0$)

95% Prediction Limits for y when x =x0

Prediction limits

Documents

Linear Regression