29
Chapter 8: Simple Linear Regression [email protected] http://www.mysmu.edu/faculty/zlyang/ Yang Yang Zhenlin Zhenlin

Chapter 8: Simple Linear Regression [email protected] Yang Zhenlin

Embed Size (px)

Citation preview

Page 1: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

Chapter 8:Simple Linear Regression

[email protected] http://www.mysmu.edu/faculty/zlyang/

Yang ZhenlinYang Zhenlin

Page 2: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Learning Objectives

Describing the Relationship between Two Variables

-- Scatter plot

-- Numerical measures

Simple Linear Regression Model

Least Squares Method for Model Estimation

A Measure of Goodness of Fit: R-Square

Inference about the Regression Coefficients

Predictions

-- Predicting the value of a future observation

-- Predicting the mean of future observations

2

Page 3: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Introduction

We are interested in the relationship between two numerical variables X and Y.

One of these variables, say X, is known in advance, called the explanatory variable, or independent variable.The other variable, Y, is a random variable and its values or its general random behavior is of interest. For this, Y is called the response variable, or dependent variable.If there is a strong relationship between X and Y, one can predict a future random variable Y , based on the known future value of X, through such a “relationship”.To study the relation, n pairs of observations on (X, Y) are collected, denoted as (X1, Y1) , (X2, Y2) , . . . , (Xn, Yn).

The Least Squares Method helps finding such a relation.

3

Page 4: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Describing the Relationship

Example 8.1. Prices of used cars and the odometer readings. A car dealer wants to find the

relationship between the odometer reading and the selling price of used cars.

A random sample of 100 cars is selected, and the data recorded.

Construct a scatter plot of the data.

Car Odometer (X ) Price (Y )1 37388 146362 44758 141223 45833 140164 30862 155905 31705 155686 34010 14718. . .. . .. . .

The full data

Scatter diagram: plot of the pairs of observed values (x1, y1) , (x2, y2) , . . . , (xn, yn) of variables X and Y. It is a very effective graphical tool for “revealing” the relationship between variables.

Scatter diagram: plot of the pairs of observed values (x1, y1) , (x2, y2) , . . . , (xn, yn) of variables X and Y. It is a very effective graphical tool for “revealing” the relationship between variables.

4

Page 5: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Describing the Relationship

The plot indeed shows a negative linear relation between the price and the odometer reading.

5

Page 6: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Besides the graphical display of the data, some numerical measures, such as the sample covariance and the sample coefficient of correlation can be used to measure the direction and strength of the linear relationship between two variables

Describing the Relationship

n

iii YYXX

nYXCov

1

))((1

1),(

n

ii

n

ii Y

nYX

nX

11

1and

1

n

iiY

n

iiX YY

nsXX

ns

1

22

1

22 )(1

1and)(

1

1

Sample Means:

Sample Variances:

Sample covariance:

Sample correlation coefficient: YX ss

YXCovr

),(

This is called the ‘five statistics summary’ of the dataThis is called the ‘five statistics summary’ of the data6

Page 7: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Example 8.2. Continuing on the Example 8.1, find the five statistics summary and comment on the linear relationship between price and odometer reading.

n

YY

ns

n

XX

ns

n

YXYX

nYX

iiY

iiX

iiii

222

222

1

1;

1

1

1

1),(Cov

:FormulasShortcut

;823.822,14

;45.009,36

Y

X

8063.0 or,511,712,2),(Cov

996,259,690,528,43 22

rYX

ss YX

Solution:

Describing the Relationship

As r = 0.8063, there exists a strong negative linear relation …7

Page 8: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Describing the Relationship

Cov(X, Y) = 0

Strong positive linear relationship.The scatter diagram shows a clear upward trend.

No linear relationship.Scatter diagram shows either no pattern, or a non-linear pattern.

Strong negative linear relationship.The scatter diagram shows a cleardownward trend.

or

Cov(X, Y) > 0

Cov(X, Y) < 0

Sample Coefficient of Correlation

r =

+1

0

1

8

Page 9: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Simple Linear Regression Model The simple linear regression model takes the form:

Y = dependent variableX = independent variable

0 = y-intercept

1 = slope of the line = error variable

XY 10 XY 10

x

y

0Run

Rise = Rise/Run

0 and 1 are unknown populationparameters, therefore need to be estimated from the data.

As the scatter diagram given in Example 8.1 shows that although there is a general trend that as the odometer reading increases, the price of the used car decreases, the relation is not deterministic as cars of the same odometer reading can have different prices. Thus, price can also be altered by some unknown random errors!

As the scatter diagram given in Example 8.1 shows that although there is a general trend that as the odometer reading increases, the price of the used car decreases, the relation is not deterministic as cars of the same odometer reading can have different prices. Thus, price can also be altered by some unknown random errors!

9

Page 10: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Simple Linear Regression Model

These n pairs of observations satisfy:

.,...,2,1,10 niXY iii

As Y is a random variable, so must be . Due to the random sampling mechanism, {Yi} must be independent, and so are the {i}. Further, it is reasonable to assume that

To learn this theoretical relationship, in particular, to estimate the parameters 0 and 1, a random sample of n experimental units are selected, and the values of (Y, X) for each unit are to be observed to give (X1, Y1), (X2, Y2), . . . , (Xn, Yn) .

E(i) = 0, i = 1, 2, . . . , n.

For if they are not zero, the non zero constant can be absorbed into 0. Thus, .,...,2,1,)( 10 niXYE ii

10

Page 11: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Least Squares Estimation

Based on the observed data, we are seeking a line that best fits the data when two variables are related to one another.

We define “best fit line” as a line for which the sum of squared differences between it and the data points is minimized.

Errors

Different lines generate different errors, thus different sum of squares of errors.

Different lines generate different errors, thus different sum of squares of errors.

X

YErrors

There is a line that minimizes the sum of squared errors, and in this sense it is the best line.

11

Page 12: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Let be a fitted line. To find the best line that minimizes the sum of squared errors, it is equivalent to find the intercept b0 and the slope b1 that

2

1

)ˆ(minimize ii

n

i

YY

The actual Y value of point iThe actual Y value of point i

The value of point icalculated from the equation

The value of point icalculated from the equation ii XbbY 10

ˆ

ii XbbY 10ˆ

That is, to minimize

210

110 )(),( ii

n

i

XbbYbbSS

Least Squares Estimation

12

Page 13: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Taking partial derivatives and set to zero:

0)(2),(

0)(2),(

101

1

11

101

0

10

iii

n

i

ii

n

i

XXbbYb

bbSS

XbbYb

bbSS

0

0

1

21

10

1

110

1

n

ii

n

iiii

n

i

n

iii

n

i

XbXbXY

XbnbY

0)(1

21

11

1

n

ii

n

ii

n

iii XbXXbYXY

XbYb 10 Leads to

Substituting

Least Squares Estimation

13

Page 14: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

,)(

0

0)(

1

2

1

21

1

21

21

1

1

211

1

XYnYXXnXb

XbXnbXYnYX

XbXXbYnYX

n

iii

n

ii

n

ii

n

iii

n

ii

n

iii

XY

s

YX

XXn

YYXXn

XnX

XYnYX

Xn

ii

n

iii

n

ii

n

iii

10

2

1

2

1

2

1

2

11

ˆˆ

),(Cov

)(1

1

))((1

1

ˆ

And the solutions:

Least Squares Estimation

gives the least squares equation: XY 10ˆˆˆ

14

Page 15: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Example 8.3. Continuing on the Example 8.2, find the least squares line relating odometer reading to the price of the used car.

067,17)45.009,36)(06232.(82.822,14ˆ

06232.690,528,43

511,712,2),(Covˆ

10

21

XbY

s

YX

X

Solution: The estimated coefficients are

The least squares equation is

XXY 0623.0067,17ˆˆˆ10

Interpretation of = 0.0623: for one additional mile on the odometer, it is estimated that the average cost of the cars decrease by $0.0623.

Interpretation of = 0.0623: for one additional mile on the odometer, it is estimated that the average cost of the cars decrease by $0.0623.

Least Squares Estimation

15

Page 16: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Least Squares Estimation

This is the estimated slope of the line.For each additional mile on the odometer, the price decreases by an average of $0.0623

Odometer Line Fit Plot

13000

14000

15000

16000

Odometer

Pri

ce

XY 0623.067,17ˆ

Interpreting the Linear Regression Equation

The intercept is estimated as $17067.

0 No data

Do not interpret the intercept as the “Price of cars that have not been driven”

17067

16

Page 17: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Least Squares Estimation

Properties of the Least Squares Estimators.

For the simple linear regression model:

Where {i} are independent with E(i) = 0, the least squares estimators and are unbiased estimators of 0 and 1,

niXY iii ,...,2,1,10

1̂0̂

To see this, note that E(Yi) = 0 + 1 Xi, we have More on white board in class.

XYE 10)(

17

Page 18: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Measure of Goodness of Fit

.)ˆ(1

2

n

iii YYSSE .)ˆ(

1

2

n

iii YYSSE

Sum of Squares due to Errors (SSE) This is the sum of differences between the points and the

regression line.

It can serve as a measure of how well the line fits the data.

SSE is defined by

2

22 ),(Cov

)1(X

Y s

YXsnSSE

2

22 ),(Cov

)1(X

Y s

YXsnSSE

– A shortcut formula

18

Page 19: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Measure of Goodness of Fit

22

22

22 ),(Cov

or)(

1YXi ss

YXR

YY

SSER

22

22

22 ),(Cov

or)(

1YXi ss

YXR

YY

SSER

Coefficient of Determination R2 it is a measure of the strength of the linear relationship between the response Y and the explanatory variable(s) X, and is defined as

The first definition is a general one and applies to linear regression models with multi predictors.

It simplifies to the second definition when there is only one predictor X.

In the case of simple linear regression, R2 is also the square of the sample correlation coefficient r.

19

Page 20: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

• To understand the significance of coefficient of determination, note:

SST: total variations (sum of squares) in Y,SSR: sum of squares due to regression,SSE: sum of squares due to error.

• It follows that R2 = 1 SSE/SST = SSR/SST

• R2 measures the proportion of the variation in Y that is explained by the variation in X, or by the model.

• R2 takes on any value between zero and one.

R2 = 1: Perfect match between the line and the data points.

R2 = 0: There are no linear relationship between X and Y.

Measure of Goodness of Fit

)()()(

)ˆ()ˆ()(1

2

1

2

1

2

SSESSRSST

YYYYYYn

iii

n

ii

n

ii

20

Page 21: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Inferences for the Model Error Variable: Required Conditions

The error is a critical part of the regression model.

For formal statistical inferences for the model, four requirements involving the distribution of must be satisfied. The probability distribution of is normal. The mean of is zero: E() = 0. The standard deviation of is for all values of X.

The set of errors associated with different observations on Y are all independent.

It follows that the response Y is normally distributed with mean E(Y) = 0 + 1 X, and standard deviation , and that the random sample of n observations {Y1, Y2, . . . , Yn} made on Y are independent.

It follows that the response Y is normally distributed with mean E(Y) = 0 + 1 X, and standard deviation , and that the random sample of n observations {Y1, Y2, . . . , Yn} made on Y are independent.

21

Page 22: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Inferences for the Model

0 + 1x1

0 + 1x2

0 + 1x3

E(y|x2)

E(y|x3)

x1 x2 x3

E(y|x1)

The standard deviation remains constant,

but the mean value changes with x

Normality of

Changing the X value increases (or decreases if 1 < 0) the mean of Y, but does not change the distributional shape of it.

22

Page 23: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Inferences for the Model

Estimate of Error Standard Deviation The mean error is equal to zero. If is small the errors tend to be close to zero (close to

the mean error). Then, the model fits the data well. Therefore, we can also use as a measure of the

suitability of using a linear model. However, is unknown and has to be estimated. As SSE

is the sum of squared errors, it leads naturally to an

2

Deviation StandardError of Estimate

n

SSEs

2

Deviation StandardError of Estimate

n

SSEs

It can be shown that It can be shown that . ofestimator unbiasedan is )2/( 22 nSSEs

23

Page 24: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Inferences for the Model Example 8.4. Calculate the estimated of error standard deviation

and the coefficient of determination for Example 8.1, and describe what does it tell you about the model fit?

Solution

450,005,9

690,528,43

)511,712,2()996,259(99

),(Cov)1(

996,2591

)(

2

2

22

22

XY

iY

s

YXsnSSE

n

YYs

It is hard to assess the model based on s even when compared with the mean value of Y,

It is hard to assess the model based on s even when compared with the mean value of Y,

823,14,1.303 Ys

13.30398

450,005,9

2

n

SSEs

Calculated earlierCalculated earlier

24

Page 25: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Inferences for the Model

25999699)1( 2

YsnSST

6501.0

)25999699/(90054501

/12

SSTSSER

65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model.

65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model.

Some Theoretical Results. If the errors {1, 2, …, n} are independent and identically distributed as N(0, ) , then we have

(a)

(b)

(c)

Some Theoretical Results. If the errors {1, 2, …, n} are independent and identically distributed as N(0, ) , then we have

(a)

(b)

(c)

2

]))1((,[~ˆ 221 XsnN

222

2

~)2(

n

sn

t.independen are andˆ 21 s

25

Page 26: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Inferences for the Model

We can draw inference about 1 from by testing

H0: 1 = 0 versus

H1: 1 0 (or < 0,or > 0)

Testing the Slope

The implication of this test is clear: if H0 is rejected, one can conclude that there is sufficient evidence to show that Y and X are linearly related; otherwise, they are not. The same question can be answered by constructing a confidence interval for 1.

The implication of this test is clear: if H0 is rejected, one can conclude that there is sufficient evidence to show that Y and X are linearly related; otherwise, they are not. The same question can be answered by constructing a confidence interval for 1.

From the theoretical result given earlier and the results presented in Chapter 5b regarding the t-distribution, it is immediate to see that

22

11 ~)1(

ˆ

n

X

tsns

A statistic for testing the slope parameter or constructing a confidence interval for it.

26

Page 27: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

A 100(1)% confidence interval for 1 is given as

A 100(1)% confidence interval for 1 is given as

Apparently, the quantity

is an estimate of

the standard deviation of , and thus referred to as the estimated standard error of .

Apparently, the quantity

is an estimate of

the standard deviation of , and thus referred to as the estimated standard error of .

Inferences for the Model

2)1( Xsn

1̂1̂

2)1( Xsn

s

2211221

)1()2(ˆ

)1()2(ˆ

X

n

X

nsn

st

sn

st

Inference concerning the intercept parameter 0 can be carried out in a similar manner, but it is not as interesting and important as for the slope parameter 1.

27

Page 28: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Inferences for the Model

Example 8.5. Test to determine whether there is enough evidence to infer that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in Example 8.4. Use a = 5%.

Solution: H0: 1 = 0 vs H1: 1 0

49.1300462

00623)1(

00462.)690,528,43)(99(

1.303

)1(

0623.ˆ

2

1

2

1

..

snst

sn

s

X

X

With n = n 2 = 98, the rejection region is

t > t98(.025) or

t < t98(.025) ,

where t.025 1.984. As t = 13.49 < 1.984, reject H0 at 5% level of significance. Yes, there is enough evidence to … A 95% CI for 1:

}0531.0,0715.0{00462.0984.10623.0

28

Page 29: Chapter 8: Simple Linear Regression zlyang@smu.edu.sg  Yang Zhenlin

STAT306, Term II, 09/10

Chapter 8

STAT151, Term I 2015-16 © Zhenlin Yang, SMU

Predictions

• Before using the regression model, we need to assess how well it fits the data.

• If we are satisfied with how well the model fits the data, we can use it to predict the a future value of Y0 or the mean of Y0

based on the future value of X0. This is in fact an important application of a regression model.

• The simple linear regression model can be easily extended to include more predictor variables, e.g., in the examples presented, the price of a used car is not only affected by its odometer reading, but also affected by its ‘age’, color, etc.

• Those constitute important topics in an advanced course: Applied Regression Methods (STAT312)

The end. Thank you.29