Chapter 9 Statistical Inferences Based on Two Samples Business Statistics in Practice

Chapter 9

Statistical Inferences Based onTwo Samples

Business Statistics in Practice

Learning Objectives

In this chapter, you learn: How to use hypothesis testing for

comparing the difference between The means of two independent

populations The means of two related

populations The proportions of two independent

populations The variances of two independent

populations

Two-Sample Tests (两总体检验 )

Two-Sample Tests

Population Means,

Independent Samples

Means, Related Samples

Population Variances

Population 1 vs. independent Population 2

Same population before vs. after treatment

Variance 1 vs.Variance 2

Examples:

Population Proportions

Proportion 1 vs. Proportion 2

Difference Between Two Means

Population means, independent

samples

σ1 and σ2 known

Goal: Test hypothesis or form a confidence interval for the difference between two population means, μ1 – μ2

The point estimate for the difference is

X1 – X2

*

σ1 and σ2 unknown, assumed equal

σ1 and σ2 unknown, not assumed equal

Independent Samples (独立样本 )


samples

Different data sources Unrelated Independent

Sample selected from one population has no effect on the sample selected from the other population

Use the difference between 2 sample means

Use Z test, a pooled-variance t test, or a separate-variance t test

*

σ1 and σ2 known



Difference Between Two Means


samples

σ1 and σ2 known

*

Use a Z test statistic

Use Sp to estimate unknown σ , use a t test statistic and pooled standard deviation



Use S1 and S2 to estimate unknown σ1 and σ2, use a separate-variance t test


samples

σ1 and σ2 known

σ1 and σ2 Known

Assumptions:

Samples are randomly and independently drawn

Population distributions are normal or both sample sizes are 30

Population standard deviations are known

*σ1 and σ2 unknown,

assumed equal



samples

σ1 and σ2 known …and the standard error of

X1 – X2 is

When σ1 and σ2 are known and both populations are normal or both sample sizes are at least 30, the test statistic is a Z-value…

2

22

1

21

XX n

σ

n

σσ

21

(continued)

σ1 and σ2 Known


assumed equal



samples

σ1 and σ2 known

2

22

1

21

2121

nσ

nσ

μμXXZ

The test statistic for

μ1 – μ2 is:

σ1 and σ2 Known

*

(continued)



Hypothesis Tests forTwo Population Means

Lower-tail test:

H0: μ1 ≥ μ2

Ha: μ1 < μ2

i.e.,

H0: μ1 – μ2 ≥ 0Ha: μ1 – μ2 < 0

Upper-tail test:

H0: μ1 ≤ μ2

Ha: μ1 > μ2

i.e.,

H0: μ1 – μ2 ≤ 0Ha: μ1 – μ2 > 0

Two-tail test:

H0: μ1 = μ2

Ha: μ1 ≠ μ2

i.e.,

H0: μ1 – μ2 = 0Ha: μ1 – μ2 ≠ 0

Two Population Means, Independent Samples

Two Population Means, Independent Samples

Lower-tail test:

H0: μ1 – μ2 ≥ 0Ha: μ1 – μ2 < 0

Upper-tail test:

H0: μ1 – μ2 ≤ 0Ha: μ1 – μ2 > 0

Two-tail test:

H0: μ1 – μ2 = 0Ha: μ1 – μ2 ≠ 0

a a/2 a/2a

-za -za/2za za/2

Reject H0 if Z < -Za Reject H0 if Z > Za Reject H0 if Z < -Za/2

or Z > Za/2

Hypothesis tests for μ1 – μ2


samples

σ1 and σ2 known

2

22

1

21

21n

σ

n

σZXX

The confidence interval for

μ1 – μ2 is:

Confidence Interval, σ1 and σ2 Known


assumed equal


Customer Waiting Time Case

A random sample of size 100 waiting times observed under the current system of serving customers has a sample waiting time mean of 8.79 Call this population 1 Assume population 1 is normal or sample size is large The variance is 4.7

A random sample of size 100 waiting times observed under the new system of serving customers has a sample mean waiting time of 5.14 Call this population 2 Assume population 2 is normal or sample size is large The variance is 1.9

Then if the samples are independent …

Example 9.1Example 9.1


At 95% confidence, z/2 = z0.025 = 1.96, and

According to the calculated interval, the bank manager can be 95% confident that the new system reduces the mean waiting time by between 3.15 and 4.15 minutes

154153

50350653

100

91

100

74961145798

2

22

1

21

221

.,.

..

.....

nnzxx


Test the claim that the new system reduces the mean waiting time

Test at the = 0.05 significance level the nullH0: 1 – 2 ≤ 0 against the alternative Ha: 1 – 2 > 0

Use the rejection rule H0 if z > z

At the 5% significance level, z = z0.05 = 1.645

So reject H0 if z > 1.645

Use the sample and population data in Example 9.1 to calculate the test statistic

2114

25690

653

100

91

100

74

0145798

2

22

1

21

021 ..

.

..

..

nn

Dxxz


Because z = 14.21 > z0.05 = 1.645, reject H0

Conclude that 1 – 2 is greater than 0 and therefore the new system does reduce the waiting time by 3.65 minutes On average, reduces the mean time by 3.65 minutes

The p-value for this test is the area under the standard normal curve to the right of z = 14.21 This z value is off the table, so the p-value has to be

much less than 0.001 So, we have extremely strong evidence that H0 is false

and that Ha is true Therefore, we have extremely strong evidence that the

new system reduces the mean waiting time


samples

σ1 and σ2 known

σ1 and σ2 Unknown, Assumed Equal

Assumptions:


Populations are normally distributed or both sample sizes are at least 30

Population variances are unknown but assumed equal

*σ1 and σ2 unknown, assumed equal



samples

σ1 and σ2 known

(continued)

*

Forming interval estimates:

The population variances are assumed equal, so use the two sample variances and pool them to estimate the common σ2

the test statistic is a t value with (n1 + n2 – 2) degrees of freedom





samples

σ1 and σ2 knownThe pooled variance (合并方差 ) is

(continued)

1)n(n

S1nS1nS

21

222

2112

p

()1

*σ1 and σ2 unknown, assumed equal




samples

σ1 and σ2 known

Where t has (n1 + n2 – 2) d.f.,

and

21

2p

2121

n1

n1

S

μμXXt


μ1 – μ2 is:

*

1)n()1(n

S1nS1nS

21

222

2112

p

(continued)





samples

σ1 and σ2 known

21

2p2-nn21

n

1

n

1StXX

21

The confidence interval for

μ1 – μ2 is:

Where

1)n()1(n

S1nS1nS

21

222

2112

p

*

Confidence Interval, σ1 and σ2 Unknown



Catalyst Comparison Case

The difference in mean hourly yields of a chemical process Given:

Assume that populations of all possible hourly yields for the two catalysts are both normal with the same variance

The pooled estimate of 2 is

Let 1 be the mean hourly yield of catalyst 1 and let 2 be the mean hourly yield of catalyst 2

2.484,2.750,5

0.386,0.811,52222

2111

sxn

sxn

1.435

255

2.4841538615

2

11

21

222

2112

nn

snsnsp


continued Want the 95% confidence interval for 1 – 2

df = (n1 + n2 – 2) = (5 + 5 – 2) = 8

At 95% confidence, t/2 = t0.025. For 8 degrees of freedom, t0.025 = 2.306

The 95% confidence interval is

So we can be confident that the mean hourly yield from catalyst 1 is between 30.38 and 91.22 pounds higher than that of catalyst 2

On average, the mean yields will differ by 60.8 lbs

22.91,38.304217.308.60

5

1

5

11.435306.22.750811

11

21

2025.021

nnstxx p

Pooled-Variance t Test

You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between stocks listed on the NYSE & NASDAQ? You collect the following data:

NYSE NASDAQNumber 21 25Sample mean 3.27 2.53Sample std dev 1.30 1.16

Assuming both populations are approximately normal with equal variances, isthere a difference in average yield ( = 0.05)?


Calculating the Test Statistic

1.5021

1)25(1)-(21

1.161251.30121

1)n()1(n

S1nS1nS

22

21

222

2112

p

2.040

251

211

5021.1

02.533.27

n1

n1

S

μμXXt

21

2p

2121

The test statistic is:

Solution

H0: μ1 - μ2 = 0 i.e. (μ1 = μ2)

Ha: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2)

= 0.05

df = 21 + 25 - 2 = 44Critical Values: t = ± 2.0154

Test Statistic: Decision:

Conclusion:

Reject H0 at a = 0.05

There is evidence of a difference in means.

t0 2.0154-2.0154

.025

Reject H0 Reject H0

.025

2.040

2.040

251

211

5021.1

2.533.27t


samples

σ1 and σ2 known

σ1 and σ2 Unknown, Not Assumed Equal

Assumptions:


Populations are normally distributed or both sample sizes are at least 30

Population variances are unknown but cannot be assumed to be equal*




samples

σ1 and σ2 known

(continued)

*

Forming the test statistic:

The population variances are not assumed equal, so include the two sample variances in the computation of the t-test statistic

the test statistic is a t value with v degrees of freedom (see next slide)





samples

σ1 and σ2 known

The number of degrees of freedom is the integer portion of:

(continued)

11

22

2

2

2

22

1

1

21

2

22

1

21

n

nS

n

nS

nS

nS

*





samples

σ1 and σ2 known

2

22

1

21

2121

nS

nS

μμXXt


μ1 – μ2 is:

*

(continued)




A recent EPA study compared the highway fuel economy of domestic and imported passenger cars. A sample of 15 domestic cars revealed a mean of 33.7 mpg with a sample standard deviation of 2.4 mpg.

A sample of 12 imported cars revealed a mean of 35.7 mpg with a sample standard deviation of 3.9. Assuming both populations are approximately normal with equal variances, At the .05 significance level can the EPA conclude that the mpg for the domestic cars is lower than the imported cars?

Exercise

Step 1 State the null and alternate hypotheses. H0: µD > µI H1: µD < µI

Step 2 The .05 significance level is stated in the problem

Step 3 Find the appropriate test statistic. we use the t distribution.

918.921215

)9.3)(112()4.2)(115(

2

))(1())(1(

22

21

222

2112

nn

snsns p

Step 4 The decision rule is to reject H0 if t<-1.708 or if p-value < .05. There are n-1 or 25 degrees of freedom.

Step 5 We compute the pooled variance.

640.1

12

1

15

1312.8

7.357.33

11

21

2

21

nns

XXt

p

Since a computed t of –1.64 > critical t of –1.71, the p-value of .0567 > of .05, H0 is not rejected. There is insufficient sample evidence to claim a higher mpg on the imported cars.

Dependent samples are samples that are paired or related in some fashion.

If you wished to buy a car you would look at the same car at two (or more) different dealerships and compare the prices.

If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program.

Town and Country Cadillac Downtown Cadillac

Related Populations (相关总体 )

Example (Related Population):

An analyst for Educational Testing Service wants to Compare the mean GMAT scores of students before & after taking a GMAT review course.

Nike wants to see if there is a difference in durability of 2 sole materials. One type is placed on one shoe, the other type on the other shoe of the same pair..

Test of Related Populations

Tests Means of 2 Related Populations Paired or matched samples Repeated measures (before/after) Use difference between paired values:

Eliminates Variation Among Subjects Assumptions:

Both Populations Are Normally Distributed Or, if not Normal, use large samples

Related samples

di = X1i - X2i

Mean Difference

The ith paired difference is di , whereRelated samples

di = X1i - X2i

The point estimate for the population mean paired difference is d :

n

ii 1

n

dd

n is the number of pairs in the paired sample

We the unknown population standard deviation with a sample standard deviation:

Related samples

n2

ii 1

d

(d d)S

n 1

The sample standard deviation is

Mean Difference, Estimate of σd

Use a paired t test, the test statistic for d is a t statistic, with n-1 d.f.:Paired

samples

n2

ii 1

d

(d d)S

n 1

d

d μt

S

n

d

Where t has n - 1 d.f.

and Sd is:

Mean Difference (continued)

The confidence interval for μd isPaired samples

n2

ii 1

d

(d d)S

n 1

d/ 2, 1

S

nnd t

where

Confidence Interval

Lower-tail test:

H0: μd 0Ha: μd < 0

Upper-tail test:

H0: μd ≤ 0Ha: μd > 0

Two-tail test:

H0: μd = 0Ha: μd ≠ 0

Paired Samples

Hypothesis Testing for Mean Difference, σd Unknown

/2 /2

-t -t/2t t/2

Reject H0 if t < -t Reject H0 if t > t Reject H0 if t < -t or t > t Where t has n - 1 d.f.

Repair Cost Comparison

Table: A sample of n=7 paired differences of the repair costestimates at garages 1 and 2 (Cost estimate in hundreds of dollars)

Damaged cars

Repair cost estimates at garage 1

Repair cost estimate at garage 2

Paired differences

Car 1 $7.1 $7.9 d1=-0.8

Car 2 9.0 10.1 d2=-1.1

Car 3 11.0 12.2 d3=-1.2

Car 4 8.9 8.8 d4=0.1

Car 5 9.9 10.4 d5=-0.5

Car 6 9.1 9.8 d6=-0.7

Car 7 10.3 11.7 d7=-1.4



Sample of n = 7 damaged cars Each damaged car is taken to Garage 1 for its

estimated repair cost, and then is taken to Garage 2 for its estimated repair cost

Estimated repair costs at Garage 1: 1 = 9.329 Estimated repair costs at Garage 2: 2 = 10.129 Sample of n = 7 paired differences

and8012910329921 ...xxd

50330

253302

.s

.s

d

d

x

x

continued

At 95% confidence, want t/2 with n – 1 = 6 degrees of freedom t/2 = 2.447

The 95% confidence interval is

Can be 95% confident that the mean of all possible paired differences of repair cost estimates at the two garages is between -$126.54 and -$33.46

Can be 95% confident that the mean of all possible repair cost estimates at Garage 1 is between $126.54 and $33.46 less than the mean of all possible repair cost estimates at Garage 2

33460265414654080

7

503304472802

.,...

...

n

std d

/


Now, test if repair cost at Garage 1 is less expensive than at Garage 2, that is, test if d = 1 – 2 is less than zero

H0: d ≥ 0 Ha: d < 0 Test at the = 0.01 significance level. Reject if t < –t, that is , if t < –t0.01

With n – 1 = 6 degrees of freedom, t0.01 = 3.143

So reject H0 if t < –3.143

continued

Calculate the t statistic

Because t = –4.2053 is less than –t0.01 = – 3.143, reject H0

Conclude at the a = 0.01 significance level that the mean repair cost at Garage 1 is less than the mean repair cost of Garage 2

From a computer, for t = -4.2053, the p-value is 0.003 Because this p-value is very small, there is very strong evidence

that H0 should be rejected and that 1 is actually less than 2

2053.475033.0

8.00

ns

dt

d

Assume you send your salespeople to a “customer service” training workshop. Has the training made a difference in the number of complaints? You collect the following data:

Paired t Test (成对 t检验 )

Number of Complaints: (2) - (1)Salesperson Before (1) After (2) Difference, di

C.B. 6 4 - 2 T.F. 20 6 -14 M.H. 3 2 - 1 R.K. 0 0 0 M.O. 4 0 - 4 -21

d = di

n

2i

d

(d d)S

n 15.67

= -4.2


Has the training made a difference in the number of complaints (at the 0.01 level)?

- 4.2d =

d

d

μ 4.2 0t 1.66

S / n 5.67/ 5

d

H0: μd = 0Ha: μd 0

Test Statistic:

Critical Value = ± 4.604 d.f. = n - 1 = 4

Reject

/2

- 4.604 4.604

Decision: Do not reject H0

(t stat is not in the reject region)

Conclusion: There is not a significant change in the number of complaints.

Paired t Test: Solution

Reject

/2

- 1.66 = .01

Two Population Proportions

Goal: test a hypothesis or form a confidence interval for the difference between two population proportions,

p1 – p2

The point estimate for the difference is

Population proportions

Assumptions: n1 p1 5 , n1(1- p1) 5

n2 p2 5 , n2(1- p2) 5

1 2ˆ ˆp p

Two Population Proportions

Then the population of all possible values of Has approximately a normal distribution if each of

the sample sizes n1 and n2 is large

Here, n1 and n2 are large enough is n1 p1 ≥ 5, n1 (1 - p1) ≥ 5, n2 p2 ≥ 5, and n1 (1 – p2) ≥ 5

Has mean

Has standard deviation

21 pp

2

22

1

11 1121 n

pp

n

pppp

2121pppp

Confidence Interval forTwo Population Proportions


If the random samples are independent of each other, then the following a 100(1 – α) percent confidence interval for

p1 – p2 is:

2

22

1

11221

11

n

pp

n

ppzpp



p1 – p2 is a Z statistic:

(continued)

1 2

1 2 1 2

ˆ ˆ

ˆ ˆ ( )

p p

p p p pz =

Testing the Difference of Two

Population Proportions

Testing the Difference of TwoPopulation Proportions

If p1-p2= 0, estimate by

If p1-p2 ≠ 0, estimate by

1 2ˆ ˆ

1 2

1 1ˆ ˆ1 , p ps p p

n n

2

22

1

11 1121 n

pp

n

pps pp

21 pp

21 pp

(continued)

the total number of units in the two samples that fall into the category of interestˆ

the total number of units in the two samplesp

Hypothesis Tests forTwo Population Proportions


Lower-tail test:

H0: p1 p2

Ha: p1 < p2

i.e.,

H0: p1 – p2 0Ha: p1 – p2 < 0

Upper-tail test:

H0: p1 ≤ p2

Ha: p1 > p2

i.e.,

H0: p1 – p2 ≤ 0Ha: p1 – p2 > 0

Two-tail test:

H0: p1 = p2

Ha: p1 ≠ p2

i.e.,

H0: p1 – p2 = 0Ha: p1 – p2 ≠ 0

Hypothesis Tests forTwo Population Proportions


Lower-tail test:

H0: p1 – p2 0Ha: p1 – p2 < 0

Upper-tail test:

H0: p1 – p2 ≤ 0Ha: p1 – p2 > 0

Two-tail test:

H0: p1 – p2 = 0Ha: p1 – p2 ≠ 0

/2 /2

-z -z/2z z/2

Reject H0 if Z < -Z Reject H0 if Z > Z Reject H0 if Z < -Z

or Z > Z

(continued)

Two population Proportions

Is there a significant difference between the proportion of men and the proportion of women who will vote Yes on Proposition A?

In a random sample, 36 of 72 men and 31 of 50 women indicated they would vote Yes

Test at the .05 level of significance


The hypothesis test is:

1p

H0: p1 – p2 = 0 (the two proportions are equal)

Ha: p1 – p2 ≠ 0 (there is a significant difference between proportions)

The sample proportions are: Men: = 36/72 = .50

Women: = 31/50 = .62

1 2

1 2

X X 36 31 67ˆ .549

n n 72 50 122p

The pooled estimate for the overall proportion is:

Two population Proportions(continued)

2p

The test statistic for p1 – p2 is:

Example: Two population Proportions

(continued)

.025

-1.96 1.96

.025

-1.31

Decision: Do not reject H0

Conclusion: There is not significant evidence of a difference in proportions who will vote yes between men and women.

1 2 1 2

1 2

ˆ ˆz

1 1ˆ ˆ(1 )

n n

.50 .62 01.31

1 1.549(1 .549)

72 50

p p p p

p p

Reject H0 Reject H0

Critical Values = ±1.96For = .05

Chapter Summary

Compared two independent samples Performed Z test for the difference in two means Performed pooled variance t test for the difference in two

means Performed separate-variance t test for difference in two means Formed confidence intervals for the difference between two

means

Compared two related samples (paired samples) Performed paired sample Z and t tests for the mean difference

Formed confidence intervals for the mean difference

Chapter Summary

Compared two population proportions Formed confidence intervals for the difference between two

population proportions Performed Z-test for two population proportions

(continued)

Chapter 11

Simple Linear Regression Analysis (线性回归分析 )

Business Statistics in Practice

Table 11.1 lists the percentage of the labour force that was Table 11.1 lists the percentage of the labour force that was unemployed during the decade unemployed during the decade 1991-20001991-2000. Plot a graph with . Plot a graph with the time (years after 1991) on the the time (years after 1991) on the xx axis and percentage of axis and percentage of unemployment on theunemployment on the y y axis. Do the points follow a clear axis. Do the points follow a clear pattern? Based on these data, what would you expect the pattern? Based on these data, what would you expect the percentage of unemployment to be in the year percentage of unemployment to be in the year 20052005??

Number of Years Percentage of

Year from 1991 Unemployed

1991 0 6.8

1992 1 7.5

1993 2 6.9

1994 3 6.1

1995 4 5.6

1996 5 5.4

1997 6 4.9

1998 7 4.5

1999 8 4.2

2000 9 4.0

Table 11.1 Percentage of Civilian Unemployment

The pattern does suggest that we may be able to get useful The pattern does suggest that we may be able to get useful information by finding a line that “best fits” the data in information by finding a line that “best fits” the data in some meaningful way. It produces the “best-fitting line”. some meaningful way. It produces the “best-fitting line”.

Based on this formula, we can attempt a prediction of the Based on this formula, we can attempt a prediction of the unemployment rate in the year unemployment rate in the year 20052005: :

338.7389.0 xy

892.1338.7)14(389.0)14( yNote: Note: Care must be taken when making predictions by extrapolating Care must be taken when making predictions by extrapolating from known data, especially when the data set is as small as the one from known data, especially when the data set is as small as the one in this example. in this example.

Learning Objectives

In this chapter, you learn: How to use regression analysis to

predict the value of a dependent variable based on an independent variable

The meaning of the regression coefficients b0 and b1

To make inferences about the slope and correlation coefficient

To estimate mean values and predict individual values

Correlation(相关 ) vs. Regression(回归 )

A scatter diagram (散点图 ) can be used to show the relationship between two variables

Correlation (相关 ) analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the

relationship

No causal effect (因果效应 ) is implied with correlation

Scatter Diagrams are used to examine possible relationships between two numerical variables

The Scatter Diagram: one variable is measured on the vertical

axis and the other variable is measured on the horizontal axis

Scatter Diagrams

Scatter Plots(散点图 )

Restaurant Ratings: Mean Preference vs. Mean Taste

Visualize the data to see patterns, especially “trends”

Introduction to Regression Analysis

Regression analysis is used to: Predict the value of a dependent variable based on the

value of at least one independent variable

Explain the impact of changes in an independent variable on the dependent variable

Dependent variable: the variable we wish to predict or explain

Independent variable: the variable used to explain the dependent variable

Simple Linear Regression Model

Only one independent variable, X

Relationship between X and Y is described by a linear function

Changes in Y are assumed to be caused by changes in X

Types of Relationships

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships


Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships

(continued)


Y

X

Y

X

No relationship

(continued)

ii10i εXββY Linear component


Population Y intercept

Population SlopeCoefficient

Random Error term

Dependent Variable

Independent Variable

Random Error component

(continued)

Random Error for this Xi value

Y

X

Observed Value of Y for Xi

Predicted Value of Y for Xi

ii10i εXββY

Xi

Slope = β1

Intercept = β0

εi


i10i XbbY

The simple linear regression equation provides an estimate of the population regression line

Simple Linear Regression Equation (Prediction Line)

Estimate of the regression

intercept

Estimate of the regression slope

Estimated (or predicted) Y value for observation i

Value of X for observation i

The individual random error terms ei have a mean of zero

Least Squares Method (最小二乘方法 )

b0 and b1 are obtained by finding the values

of b0 and b1 that minimize the sum of the

squared differences between Y and :

2i10i

2ii ))Xb(b(Ymin)Y(Ymin

Y

b0 is the estimated average value of Y

when the value of X is zero

b1 is the estimated change in the average

value of Y as a result of a one-unit change in X

Interpretation of the Slope(斜率 ) and the Intercept(截

距 )

Example 11.1Example 11.1 The House Price Case

A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)

A random sample of 10 houses is selected Dependent variable (Y) = house price in $1000s

Independent variable (X) = square feet

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

Square Feet

Ho

use

Pri

ce (

$100

0s)

Graphical Presentation

House price model: scatter plot

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

Square Feet

Ho

use

Pri

ce (

$100

0s)

Graphical Presentation

House price model: scatter plot and regression line

feet) (square 0.10977 98.24833 price house

Slope = 0.10977

Intercept = 98.248

Interpretation of the Intercept, b0

b0 is the estimated average value of Y when the

value of X is zero (if X = 0 is in the range of observed X values)

Here, no houses had 0 square feet, so b0 = 98.24833

just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet


Interpretation of the Slope Coefficient, b1

b1 measures the estimated change in the

average value of Y as a result of a one-unit change in X Here, b1 = .10977 tells us that the average value of a

house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size


317.85

0)0.1098(200 98.25

(sq.ft.) 0.1098 98.25 price house

Predict the price for a house with 2000 square feet:

The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850

Predictions using Regression Analysis

0

50

100

150

200

250

300

350

400

450

0 500 1000 1500 2000 2500 3000

Square Feet

Ho

use

Pri

ce (

$100

0s)

Interpolation vs. Extrapolation

When using a regression model for prediction, only predict within the relevant range of data

Relevant range for interpolation

Do not try to extrapolate

beyond the range of observed X’s

Estimation/prediction equation

Least squares point estimate of the slope 1

xbby 10 ˆ

n

yySS

n

xxxxSS

n

yxyxyyxxSS

SS

SSb

iiyy

iiixx

iiiiiixy

xx

xy

2

2

2

22

1

)(

)()(

The Least Square Point Estimates

Least squares point estimate of the y-intercept 0

n

xx

n

yyxbyb ii 10

Model Assumptions

1. Mean of ZeroAt any given value of x, the population of potential error term values has a mean equal to zero

2. Constant Variance AssumptionAt any given value of x, the population of potential error term values has a variance that does not depend on the value of x

3. Normality AssumptionAt any given value of x, the population of potential error term values has a normal distribution

4. Independence AssumptionAny one value of the error term is statistically independent of any other value of

Measures of Variation

Total variation is made up of two parts:

SSE SSR SST Total Sum of

SquaresRegression Sum

of SquaresError Sum of

Squares

2i )YY(SST 2

ii )YY(SSE 2i )YY(SSR

where:

= Average value of the dependent variable

Yi = Observed values of the dependent variable

i = Predicted value of Y for the given Xi valueY

Y

SST = total sum of squares

Measures the variation of the Yi values around their mean Y

SSR = regression sum of squares

Explained variation attributable to the relationship between X and Y

SSE = error sum of squares

Variation attributable to factors other than the relationship between X and Y

(continued)


(continued)

Xi

Y

X

Yi

SST = (Yi - Y)2

SSE = (Yi - Yi )2

SSR = (Yi - Y)2

_

_

_

Y

Y

Y_Y


The coefficient of determination (决定系数 ) is the portion of the total variation in the dependent variable that is explained by variation in the independent variable

The coefficient of determination is also called r-squared and is denoted as r2

Coefficient of Determination, r2

1r0 2 note:

squares of sum total

squares of sum regression

SST

SSRr2

r2 = 1

Examples of Approximate r2 Values

Y

X

Y

X

r2 = 1

r2 = 1

Perfect linear relationship between X and Y:

100% of the variation in Y is explained by variation in X


Y

X

Y

X

0 < r2 < 1

Weaker linear relationships between X and Y:

Some but not all of the variation in Y is explained by variation in X


r2 = 0

No linear relationship between X and Y:

The value of Y does not depend on X. (None of the variation in Y is explained by variation in X)

Y

Xr2 = 0

The Simple Correlation Coefficient (简单相关系数 )

The simple correlation coefficient measures the strength of the linear relationship between y and x and is denoted by r

negative is if

and positive, is if

12

12

brr=

brr=

Where, b1 is the slope of the least squares line

yyxx

xy

SSSS

SSr

r can also be calculated using the formula

Different Values of the CorrelationCoefficient

t test for a population slope Is there a linear relationship between X and Y?

Null and alternative hypotheses H0: β1 = 0 (no linear relationship)

Ha: β1 0 (linear relationship does exist)

Test statistic

Inference about the Slope: t Test

1b

11

s

βbt

2nd.f.

where:

b1 = regression slope coefficient

β1 = hypothesized slope

sb = standard error of the slope

1

xx

bSS

ss

1

Example 11.2Example 11.2 The House Price Case

House Price in $1000s

(y)

Square Feet (x)

245 1400

312 1600

279 1700

308 1875

199 1100

219 1550

405 2350

324 2450

319 1425

255 1700

(sq.ft.) 0.1098 98.25 price house

Simple Linear Regression Equation:

The slope of this model is 0.1098

Does square footage of the house affect its sales price?

H0: β1 = 0

Ha: β1 0

The House Price Case ＃ 2

32938.303297.0

010977.0

s

βbt

1b

11

Reject H0Reject H0

/2=.025

-tα/2

Do not reject H0

0 tα/2

/2=.025

-2.3060 2.3060 3.329

n=10 d.f. = 10-2 = 8

There is sufficient evidence that square footage affects house price

Confidence Interval Estimate for the Slope

Confidence Interval Estimate of the Slope:

At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)

1b2n1 Stb d.f. = n - 2

Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size

This 95% confidence interval does not include 0.

Conclusion: There is a significant relationship between house price and square feet at the .05 level of significance

Chapter Summary

Introduced types of regression models Reviewed assumptions of regression and correlation Discussed determining the simple linear regression

equation Described measures of variation Described inference about the slope Discussed correlation -- measuring the strength of

the association Addressed estimation of mean values and prediction

of individual values

Documents

Chapter 9 Statistical Inferences Based on Two Samples Business Statistics in Practice