Upload
theresa-scott
View
228
Download
2
Embed Size (px)
Citation preview
Chapter 9
Statistical Inferences Based onTwo Samples
Business Statistics in Practice
Learning Objectives
In this chapter, you learn: How to use hypothesis testing for
comparing the difference between The means of two independent
populations The means of two related
populations The proportions of two independent
populations The variances of two independent
populations
Two-Sample Tests (两总体检验 )
Two-Sample Tests
Population Means,
Independent Samples
Means, Related Samples
Population Variances
Population 1 vs. independent Population 2
Same population before vs. after treatment
Variance 1 vs.Variance 2
Examples:
Population Proportions
Proportion 1 vs. Proportion 2
Difference Between Two Means
Population means, independent
samples
σ1 and σ2 known
Goal: Test hypothesis or form a confidence interval for the difference between two population means, μ1 – μ2
The point estimate for the difference is
X1 – X2
*
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
Independent Samples (独立样本 )
Population means, independent
samples
Different data sources Unrelated Independent
Sample selected from one population has no effect on the sample selected from the other population
Use the difference between 2 sample means
Use Z test, a pooled-variance t test, or a separate-variance t test
*
σ1 and σ2 known
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
Difference Between Two Means
Population means, independent
samples
σ1 and σ2 known
*
Use a Z test statistic
Use Sp to estimate unknown σ , use a t test statistic and pooled standard deviation
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
Use S1 and S2 to estimate unknown σ1 and σ2, use a separate-variance t test
Population means, independent
samples
σ1 and σ2 known
σ1 and σ2 Known
Assumptions:
Samples are randomly and independently drawn
Population distributions are normal or both sample sizes are 30
Population standard deviations are known
*σ1 and σ2 unknown,
assumed equal
σ1 and σ2 unknown, not assumed equal
Population means, independent
samples
σ1 and σ2 known …and the standard error of
X1 – X2 is
When σ1 and σ2 are known and both populations are normal or both sample sizes are at least 30, the test statistic is a Z-value…
2
22
1
21
XX n
σ
n
σσ
21
(continued)
σ1 and σ2 Known
*σ1 and σ2 unknown,
assumed equal
σ1 and σ2 unknown, not assumed equal
Population means, independent
samples
σ1 and σ2 known
2
22
1
21
2121
nσ
nσ
μμXXZ
The test statistic for
μ1 – μ2 is:
σ1 and σ2 Known
*
(continued)
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
Hypothesis Tests forTwo Population Means
Lower-tail test:
H0: μ1 ≥ μ2
Ha: μ1 < μ2
i.e.,
H0: μ1 – μ2 ≥ 0Ha: μ1 – μ2 < 0
Upper-tail test:
H0: μ1 ≤ μ2
Ha: μ1 > μ2
i.e.,
H0: μ1 – μ2 ≤ 0Ha: μ1 – μ2 > 0
Two-tail test:
H0: μ1 = μ2
Ha: μ1 ≠ μ2
i.e.,
H0: μ1 – μ2 = 0Ha: μ1 – μ2 ≠ 0
Two Population Means, Independent Samples
Two Population Means, Independent Samples
Lower-tail test:
H0: μ1 – μ2 ≥ 0Ha: μ1 – μ2 < 0
Upper-tail test:
H0: μ1 – μ2 ≤ 0Ha: μ1 – μ2 > 0
Two-tail test:
H0: μ1 – μ2 = 0Ha: μ1 – μ2 ≠ 0
a a/2 a/2a
-za -za/2za za/2
Reject H0 if Z < -Za Reject H0 if Z > Za Reject H0 if Z < -Za/2
or Z > Za/2
Hypothesis tests for μ1 – μ2
Population means, independent
samples
σ1 and σ2 known
2
22
1
21
21n
σ
n
σZXX
The confidence interval for
μ1 – μ2 is:
Confidence Interval, σ1 and σ2 Known
*σ1 and σ2 unknown,
assumed equal
σ1 and σ2 unknown, not assumed equal
Customer Waiting Time Case
A random sample of size 100 waiting times observed under the current system of serving customers has a sample waiting time mean of 8.79 Call this population 1 Assume population 1 is normal or sample size is large The variance is 4.7
A random sample of size 100 waiting times observed under the new system of serving customers has a sample mean waiting time of 5.14 Call this population 2 Assume population 2 is normal or sample size is large The variance is 1.9
Then if the samples are independent …
Example 9.1Example 9.1
Customer Waiting Time Case
At 95% confidence, z/2 = z0.025 = 1.96, and
According to the calculated interval, the bank manager can be 95% confident that the new system reduces the mean waiting time by between 3.15 and 4.15 minutes
154153
50350653
100
91
100
74961145798
2
22
1
21
221
.,.
..
.....
nnzxx
Customer Waiting Time Case
Test the claim that the new system reduces the mean waiting time
Test at the = 0.05 significance level the nullH0: 1 – 2 ≤ 0 against the alternative Ha: 1 – 2 > 0
Use the rejection rule H0 if z > z
At the 5% significance level, z = z0.05 = 1.645
So reject H0 if z > 1.645
Use the sample and population data in Example 9.1 to calculate the test statistic
2114
25690
653
100
91
100
74
0145798
2
22
1
21
021 ..
.
..
..
nn
Dxxz
Customer Waiting Time Case
Because z = 14.21 > z0.05 = 1.645, reject H0
Conclude that 1 – 2 is greater than 0 and therefore the new system does reduce the waiting time by 3.65 minutes On average, reduces the mean time by 3.65 minutes
The p-value for this test is the area under the standard normal curve to the right of z = 14.21 This z value is off the table, so the p-value has to be
much less than 0.001 So, we have extremely strong evidence that H0 is false
and that Ha is true Therefore, we have extremely strong evidence that the
new system reduces the mean waiting time
Population means, independent
samples
σ1 and σ2 known
σ1 and σ2 Unknown, Assumed Equal
Assumptions:
Samples are randomly and independently drawn
Populations are normally distributed or both sample sizes are at least 30
Population variances are unknown but assumed equal
*σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
Population means, independent
samples
σ1 and σ2 known
(continued)
*
Forming interval estimates:
The population variances are assumed equal, so use the two sample variances and pool them to estimate the common σ2
the test statistic is a t value with (n1 + n2 – 2) degrees of freedom
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
σ1 and σ2 Unknown, Assumed Equal
Population means, independent
samples
σ1 and σ2 knownThe pooled variance (合并方差 ) is
(continued)
1)n(n
S1nS1nS
21
222
2112
p
()1
*σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
σ1 and σ2 Unknown, Assumed Equal
Population means, independent
samples
σ1 and σ2 known
Where t has (n1 + n2 – 2) d.f.,
and
21
2p
2121
n1
n1
S
μμXXt
The test statistic for
μ1 – μ2 is:
*
1)n()1(n
S1nS1nS
21
222
2112
p
(continued)
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
σ1 and σ2 Unknown, Assumed Equal
Population means, independent
samples
σ1 and σ2 known
21
2p2-nn21
n
1
n
1StXX
21
The confidence interval for
μ1 – μ2 is:
Where
1)n()1(n
S1nS1nS
21
222
2112
p
*
Confidence Interval, σ1 and σ2 Unknown
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
Catalyst Comparison Case
The difference in mean hourly yields of a chemical process Given:
Assume that populations of all possible hourly yields for the two catalysts are both normal with the same variance
The pooled estimate of 2 is
Let 1 be the mean hourly yield of catalyst 1 and let 2 be the mean hourly yield of catalyst 2
2.484,2.750,5
0.386,0.811,52222
2111
sxn
sxn
1.435
255
2.4841538615
2
11
21
222
2112
nn
snsnsp
Example 9.2Example 9.2
continued Want the 95% confidence interval for 1 – 2
df = (n1 + n2 – 2) = (5 + 5 – 2) = 8
At 95% confidence, t/2 = t0.025. For 8 degrees of freedom, t0.025 = 2.306
The 95% confidence interval is
So we can be confident that the mean hourly yield from catalyst 1 is between 30.38 and 91.22 pounds higher than that of catalyst 2
On average, the mean yields will differ by 60.8 lbs
22.91,38.304217.308.60
5
1
5
11.435306.22.750811
11
21
2025.021
nnstxx p
Pooled-Variance t Test
You are a financial analyst for a brokerage firm. Is there a difference in dividend yield between stocks listed on the NYSE & NASDAQ? You collect the following data:
NYSE NASDAQNumber 21 25Sample mean 3.27 2.53Sample std dev 1.30 1.16
Assuming both populations are approximately normal with equal variances, isthere a difference in average yield ( = 0.05)?
Example 9.3Example 9.3
Calculating the Test Statistic
1.5021
1)25(1)-(21
1.161251.30121
1)n()1(n
S1nS1nS
22
21
222
2112
p
2.040
251
211
5021.1
02.533.27
n1
n1
S
μμXXt
21
2p
2121
The test statistic is:
Solution
H0: μ1 - μ2 = 0 i.e. (μ1 = μ2)
Ha: μ1 - μ2 ≠ 0 i.e. (μ1 ≠ μ2)
= 0.05
df = 21 + 25 - 2 = 44Critical Values: t = ± 2.0154
Test Statistic: Decision:
Conclusion:
Reject H0 at a = 0.05
There is evidence of a difference in means.
t0 2.0154-2.0154
.025
Reject H0 Reject H0
.025
2.040
2.040
251
211
5021.1
2.533.27t
Population means, independent
samples
σ1 and σ2 known
σ1 and σ2 Unknown, Not Assumed Equal
Assumptions:
Samples are randomly and independently drawn
Populations are normally distributed or both sample sizes are at least 30
Population variances are unknown but cannot be assumed to be equal*
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
Population means, independent
samples
σ1 and σ2 known
(continued)
*
Forming the test statistic:
The population variances are not assumed equal, so include the two sample variances in the computation of the t-test statistic
the test statistic is a t value with v degrees of freedom (see next slide)
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
σ1 and σ2 Unknown, Not Assumed Equal
Population means, independent
samples
σ1 and σ2 known
The number of degrees of freedom is the integer portion of:
(continued)
11
22
2
2
2
22
1
1
21
2
22
1
21
n
nS
n
nS
nS
nS
*
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
σ1 and σ2 Unknown, Not Assumed Equal
Population means, independent
samples
σ1 and σ2 known
2
22
1
21
2121
nS
nS
μμXXt
The test statistic for
μ1 – μ2 is:
*
(continued)
σ1 and σ2 unknown, assumed equal
σ1 and σ2 unknown, not assumed equal
σ1 and σ2 Unknown, Not Assumed Equal
A recent EPA study compared the highway fuel economy of domestic and imported passenger cars. A sample of 15 domestic cars revealed a mean of 33.7 mpg with a sample standard deviation of 2.4 mpg.
A sample of 12 imported cars revealed a mean of 35.7 mpg with a sample standard deviation of 3.9. Assuming both populations are approximately normal with equal variances, At the .05 significance level can the EPA conclude that the mpg for the domestic cars is lower than the imported cars?
Exercise
Step 1 State the null and alternate hypotheses. H0: µD > µI H1: µD < µI
Step 2 The .05 significance level is stated in the problem
Step 3 Find the appropriate test statistic. we use the t distribution.
918.921215
)9.3)(112()4.2)(115(
2
))(1())(1(
22
21
222
2112
nn
snsns p
Step 4 The decision rule is to reject H0 if t<-1.708 or if p-value < .05. There are n-1 or 25 degrees of freedom.
Step 5 We compute the pooled variance.
640.1
12
1
15
1312.8
7.357.33
11
21
2
21
nns
XXt
p
Since a computed t of –1.64 > critical t of –1.71, the p-value of .0567 > of .05, H0 is not rejected. There is insufficient sample evidence to claim a higher mpg on the imported cars.
Dependent samples are samples that are paired or related in some fashion.
If you wished to buy a car you would look at the same car at two (or more) different dealerships and compare the prices.
If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program.
Town and Country Cadillac Downtown Cadillac
Related Populations (相关总体 )
Example (Related Population):
An analyst for Educational Testing Service wants to Compare the mean GMAT scores of students before & after taking a GMAT review course.
Nike wants to see if there is a difference in durability of 2 sole materials. One type is placed on one shoe, the other type on the other shoe of the same pair..
Test of Related Populations
Tests Means of 2 Related Populations Paired or matched samples Repeated measures (before/after) Use difference between paired values:
Eliminates Variation Among Subjects Assumptions:
Both Populations Are Normally Distributed Or, if not Normal, use large samples
Related samples
di = X1i - X2i
Mean Difference
The ith paired difference is di , whereRelated samples
di = X1i - X2i
The point estimate for the population mean paired difference is d :
n
ii 1
n
dd
n is the number of pairs in the paired sample
We the unknown population standard deviation with a sample standard deviation:
Related samples
n2
ii 1
d
(d d)S
n 1
The sample standard deviation is
Mean Difference, Estimate of σd
Use a paired t test, the test statistic for d is a t statistic, with n-1 d.f.:Paired
samples
n2
ii 1
d
(d d)S
n 1
d
d μt
S
n
d
Where t has n - 1 d.f.
and Sd is:
Mean Difference (continued)
The confidence interval for μd isPaired samples
n2
ii 1
d
(d d)S
n 1
d/ 2, 1
S
nnd t
where
Confidence Interval
Lower-tail test:
H0: μd 0Ha: μd < 0
Upper-tail test:
H0: μd ≤ 0Ha: μd > 0
Two-tail test:
H0: μd = 0Ha: μd ≠ 0
Paired Samples
Hypothesis Testing for Mean Difference, σd Unknown
/2 /2
-t -t/2t t/2
Reject H0 if t < -t Reject H0 if t > t Reject H0 if t < -t or t > t Where t has n - 1 d.f.
Repair Cost Comparison
Table: A sample of n=7 paired differences of the repair costestimates at garages 1 and 2 (Cost estimate in hundreds of dollars)
Damaged cars
Repair cost estimates at garage 1
Repair cost estimate at garage 2
Paired differences
Car 1 $7.1 $7.9 d1=-0.8
Car 2 9.0 10.1 d2=-1.1
Car 3 11.0 12.2 d3=-1.2
Car 4 8.9 8.8 d4=0.1
Car 5 9.9 10.4 d5=-0.5
Car 6 9.1 9.8 d6=-0.7
Car 7 10.3 11.7 d7=-1.4
Example 9.4Example 9.4
Repair Cost Comparison
Sample of n = 7 damaged cars Each damaged car is taken to Garage 1 for its
estimated repair cost, and then is taken to Garage 2 for its estimated repair cost
Estimated repair costs at Garage 1: 1 = 9.329 Estimated repair costs at Garage 2: 2 = 10.129 Sample of n = 7 paired differences
and8012910329921 ...xxd
50330
253302
.s
.s
d
d
x
x
continued
At 95% confidence, want t/2 with n – 1 = 6 degrees of freedom t/2 = 2.447
The 95% confidence interval is
Can be 95% confident that the mean of all possible paired differences of repair cost estimates at the two garages is between -$126.54 and -$33.46
Can be 95% confident that the mean of all possible repair cost estimates at Garage 1 is between $126.54 and $33.46 less than the mean of all possible repair cost estimates at Garage 2
33460265414654080
7
503304472802
.,...
...
n
std d
/
Repair Cost Comparison
Now, test if repair cost at Garage 1 is less expensive than at Garage 2, that is, test if d = 1 – 2 is less than zero
H0: d ≥ 0 Ha: d < 0 Test at the = 0.01 significance level. Reject if t < –t, that is , if t < –t0.01
With n – 1 = 6 degrees of freedom, t0.01 = 3.143
So reject H0 if t < –3.143
continued
Calculate the t statistic
Because t = –4.2053 is less than –t0.01 = – 3.143, reject H0
Conclude at the a = 0.01 significance level that the mean repair cost at Garage 1 is less than the mean repair cost of Garage 2
From a computer, for t = -4.2053, the p-value is 0.003 Because this p-value is very small, there is very strong evidence
that H0 should be rejected and that 1 is actually less than 2
2053.475033.0
8.00
ns
dt
d
Assume you send your salespeople to a “customer service” training workshop. Has the training made a difference in the number of complaints? You collect the following data:
Paired t Test (成对 t检验 )
Number of Complaints: (2) - (1)Salesperson Before (1) After (2) Difference, di
C.B. 6 4 - 2 T.F. 20 6 -14 M.H. 3 2 - 1 R.K. 0 0 0 M.O. 4 0 - 4 -21
d = di
n
2i
d
(d d)S
n 15.67
= -4.2
Example 9.5Example 9.5
Has the training made a difference in the number of complaints (at the 0.01 level)?
- 4.2d =
d
d
μ 4.2 0t 1.66
S / n 5.67/ 5
d
H0: μd = 0Ha: μd 0
Test Statistic:
Critical Value = ± 4.604 d.f. = n - 1 = 4
Reject
/2
- 4.604 4.604
Decision: Do not reject H0
(t stat is not in the reject region)
Conclusion: There is not a significant change in the number of complaints.
Paired t Test: Solution
Reject
/2
- 1.66 = .01
Two Population Proportions
Goal: test a hypothesis or form a confidence interval for the difference between two population proportions,
p1 – p2
The point estimate for the difference is
Population proportions
Assumptions: n1 p1 5 , n1(1- p1) 5
n2 p2 5 , n2(1- p2) 5
1 2ˆ ˆp p
Two Population Proportions
Then the population of all possible values of Has approximately a normal distribution if each of
the sample sizes n1 and n2 is large
Here, n1 and n2 are large enough is n1 p1 ≥ 5, n1 (1 - p1) ≥ 5, n2 p2 ≥ 5, and n1 (1 – p2) ≥ 5
Has mean
Has standard deviation
21 pp
2
22
1
11 1121 n
pp
n
pppp
2121pppp
Confidence Interval forTwo Population Proportions
Population proportions
If the random samples are independent of each other, then the following a 100(1 – α) percent confidence interval for
p1 – p2 is:
2
22
1
11221
11
n
pp
n
ppzpp
Population proportions
The test statistic for
p1 – p2 is a Z statistic:
(continued)
1 2
1 2 1 2
ˆ ˆ
ˆ ˆ ( )
p p
p p p pz =
Testing the Difference of Two
Population Proportions
Testing the Difference of TwoPopulation Proportions
If p1-p2= 0, estimate by
If p1-p2 ≠ 0, estimate by
1 2ˆ ˆ
1 2
1 1ˆ ˆ1 , p ps p p
n n
2
22
1
11 1121 n
pp
n
pps pp
21 pp
21 pp
(continued)
the total number of units in the two samples that fall into the category of interestˆ
the total number of units in the two samplesp
Hypothesis Tests forTwo Population Proportions
Population proportions
Lower-tail test:
H0: p1 p2
Ha: p1 < p2
i.e.,
H0: p1 – p2 0Ha: p1 – p2 < 0
Upper-tail test:
H0: p1 ≤ p2
Ha: p1 > p2
i.e.,
H0: p1 – p2 ≤ 0Ha: p1 – p2 > 0
Two-tail test:
H0: p1 = p2
Ha: p1 ≠ p2
i.e.,
H0: p1 – p2 = 0Ha: p1 – p2 ≠ 0
Hypothesis Tests forTwo Population Proportions
Population proportions
Lower-tail test:
H0: p1 – p2 0Ha: p1 – p2 < 0
Upper-tail test:
H0: p1 – p2 ≤ 0Ha: p1 – p2 > 0
Two-tail test:
H0: p1 – p2 = 0Ha: p1 – p2 ≠ 0
/2 /2
-z -z/2z z/2
Reject H0 if Z < -Z Reject H0 if Z > Z Reject H0 if Z < -Z
or Z > Z
(continued)
Two population Proportions
Is there a significant difference between the proportion of men and the proportion of women who will vote Yes on Proposition A?
In a random sample, 36 of 72 men and 31 of 50 women indicated they would vote Yes
Test at the .05 level of significance
Example 9.6Example 9.6
The hypothesis test is:
1p
H0: p1 – p2 = 0 (the two proportions are equal)
Ha: p1 – p2 ≠ 0 (there is a significant difference between proportions)
The sample proportions are: Men: = 36/72 = .50
Women: = 31/50 = .62
1 2
1 2
X X 36 31 67ˆ .549
n n 72 50 122p
The pooled estimate for the overall proportion is:
Two population Proportions(continued)
2p
The test statistic for p1 – p2 is:
Example: Two population Proportions
(continued)
.025
-1.96 1.96
.025
-1.31
Decision: Do not reject H0
Conclusion: There is not significant evidence of a difference in proportions who will vote yes between men and women.
1 2 1 2
1 2
ˆ ˆz
1 1ˆ ˆ(1 )
n n
.50 .62 01.31
1 1.549(1 .549)
72 50
p p p p
p p
Reject H0 Reject H0
Critical Values = ±1.96For = .05
Chapter Summary
Compared two independent samples Performed Z test for the difference in two means Performed pooled variance t test for the difference in two
means Performed separate-variance t test for difference in two means Formed confidence intervals for the difference between two
means
Compared two related samples (paired samples) Performed paired sample Z and t tests for the mean difference
Formed confidence intervals for the mean difference
Chapter Summary
Compared two population proportions Formed confidence intervals for the difference between two
population proportions Performed Z-test for two population proportions
(continued)
Chapter 11
Simple Linear Regression Analysis (线性回归分析 )
Business Statistics in Practice
Table 11.1 lists the percentage of the labour force that was Table 11.1 lists the percentage of the labour force that was unemployed during the decade unemployed during the decade 1991-20001991-2000. Plot a graph with . Plot a graph with the time (years after 1991) on the the time (years after 1991) on the xx axis and percentage of axis and percentage of unemployment on theunemployment on the y y axis. Do the points follow a clear axis. Do the points follow a clear pattern? Based on these data, what would you expect the pattern? Based on these data, what would you expect the percentage of unemployment to be in the year percentage of unemployment to be in the year 20052005??
Number of Years Percentage of
Year from 1991 Unemployed
1991 0 6.8
1992 1 7.5
1993 2 6.9
1994 3 6.1
1995 4 5.6
1996 5 5.4
1997 6 4.9
1998 7 4.5
1999 8 4.2
2000 9 4.0
Table 11.1 Percentage of Civilian Unemployment
The pattern does suggest that we may be able to get useful The pattern does suggest that we may be able to get useful information by finding a line that “best fits” the data in information by finding a line that “best fits” the data in some meaningful way. It produces the “best-fitting line”. some meaningful way. It produces the “best-fitting line”.
Based on this formula, we can attempt a prediction of the Based on this formula, we can attempt a prediction of the unemployment rate in the year unemployment rate in the year 20052005: :
338.7389.0 xy
892.1338.7)14(389.0)14( yNote: Note: Care must be taken when making predictions by extrapolating Care must be taken when making predictions by extrapolating from known data, especially when the data set is as small as the one from known data, especially when the data set is as small as the one in this example. in this example.
Learning Objectives
In this chapter, you learn: How to use regression analysis to
predict the value of a dependent variable based on an independent variable
The meaning of the regression coefficients b0 and b1
To make inferences about the slope and correlation coefficient
To estimate mean values and predict individual values
Correlation(相关 ) vs. Regression(回归 )
A scatter diagram (散点图 ) can be used to show the relationship between two variables
Correlation (相关 ) analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the
relationship
No causal effect (因果效应 ) is implied with correlation
Scatter Diagrams are used to examine possible relationships between two numerical variables
The Scatter Diagram: one variable is measured on the vertical
axis and the other variable is measured on the horizontal axis
Scatter Diagrams
Scatter Plots(散点图 )
Restaurant Ratings: Mean Preference vs. Mean Taste
Visualize the data to see patterns, especially “trends”
Introduction to Regression Analysis
Regression analysis is used to: Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent variable on the dependent variable
Dependent variable: the variable we wish to predict or explain
Independent variable: the variable used to explain the dependent variable
Simple Linear Regression Model
Only one independent variable, X
Relationship between X and Y is described by a linear function
Changes in Y are assumed to be caused by changes in X
Types of Relationships
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Types of Relationships
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
(continued)
Types of Relationships
Y
X
Y
X
No relationship
(continued)
ii10i εXββY Linear component
Simple Linear Regression Model
Population Y intercept
Population SlopeCoefficient
Random Error term
Dependent Variable
Independent Variable
Random Error component
(continued)
Random Error for this Xi value
Y
X
Observed Value of Y for Xi
Predicted Value of Y for Xi
ii10i εXββY
Xi
Slope = β1
Intercept = β0
εi
Simple Linear Regression Model
i10i XbbY
The simple linear regression equation provides an estimate of the population regression line
Simple Linear Regression Equation (Prediction Line)
Estimate of the regression
intercept
Estimate of the regression slope
Estimated (or predicted) Y value for observation i
Value of X for observation i
The individual random error terms ei have a mean of zero
Least Squares Method (最小二乘方法 )
b0 and b1 are obtained by finding the values
of b0 and b1 that minimize the sum of the
squared differences between Y and :
2i10i
2ii ))Xb(b(Ymin)Y(Ymin
Y
b0 is the estimated average value of Y
when the value of X is zero
b1 is the estimated change in the average
value of Y as a result of a one-unit change in X
Interpretation of the Slope(斜率 ) and the Intercept(截
距 )
Example 11.1Example 11.1 The House Price Case
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
Ho
use
Pri
ce (
$100
0s)
Graphical Presentation
House price model: scatter plot
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
Ho
use
Pri
ce (
$100
0s)
Graphical Presentation
House price model: scatter plot and regression line
feet) (square 0.10977 98.24833 price house
Slope = 0.10977
Intercept = 98.248
Interpretation of the Intercept, b0
b0 is the estimated average value of Y when the
value of X is zero (if X = 0 is in the range of observed X values)
Here, no houses had 0 square feet, so b0 = 98.24833
just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
feet) (square 0.10977 98.24833 price house
Interpretation of the Slope Coefficient, b1
b1 measures the estimated change in the
average value of Y as a result of a one-unit change in X Here, b1 = .10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size
feet) (square 0.10977 98.24833 price house
317.85
0)0.1098(200 98.25
(sq.ft.) 0.1098 98.25 price house
Predict the price for a house with 2000 square feet:
The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850
Predictions using Regression Analysis
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
Ho
use
Pri
ce (
$100
0s)
Interpolation vs. Extrapolation
When using a regression model for prediction, only predict within the relevant range of data
Relevant range for interpolation
Do not try to extrapolate
beyond the range of observed X’s
Estimation/prediction equation
Least squares point estimate of the slope 1
xbby 10 ˆ
n
yySS
n
xxxxSS
n
yxyxyyxxSS
SS
SSb
iiyy
iiixx
iiiiiixy
xx
xy
2
2
2
22
1
)(
)()(
The Least Square Point Estimates
Least squares point estimate of the y-intercept 0
n
xx
n
yyxbyb ii 10
Model Assumptions
1. Mean of ZeroAt any given value of x, the population of potential error term values has a mean equal to zero
2. Constant Variance AssumptionAt any given value of x, the population of potential error term values has a variance that does not depend on the value of x
3. Normality AssumptionAt any given value of x, the population of potential error term values has a normal distribution
4. Independence AssumptionAny one value of the error term is statistically independent of any other value of
Measures of Variation
Total variation is made up of two parts:
SSE SSR SST Total Sum of
SquaresRegression Sum
of SquaresError Sum of
Squares
2i )YY(SST 2
ii )YY(SSE 2i )YY(SSR
where:
= Average value of the dependent variable
Yi = Observed values of the dependent variable
i = Predicted value of Y for the given Xi valueY
Y
SST = total sum of squares
Measures the variation of the Yi values around their mean Y
SSR = regression sum of squares
Explained variation attributable to the relationship between X and Y
SSE = error sum of squares
Variation attributable to factors other than the relationship between X and Y
(continued)
Measures of Variation
(continued)
Xi
Y
X
Yi
SST = (Yi - Y)2
SSE = (Yi - Yi )2
SSR = (Yi - Y)2
_
_
_
Y
Y
Y_Y
Measures of Variation
The coefficient of determination (决定系数 ) is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called r-squared and is denoted as r2
Coefficient of Determination, r2
1r0 2 note:
squares of sum total
squares of sum regression
SST
SSRr2
r2 = 1
Examples of Approximate r2 Values
Y
X
Y
X
r2 = 1
r2 = 1
Perfect linear relationship between X and Y:
100% of the variation in Y is explained by variation in X
Examples of Approximate r2 Values
Y
X
Y
X
0 < r2 < 1
Weaker linear relationships between X and Y:
Some but not all of the variation in Y is explained by variation in X
Examples of Approximate r2 Values
r2 = 0
No linear relationship between X and Y:
The value of Y does not depend on X. (None of the variation in Y is explained by variation in X)
Y
Xr2 = 0
The Simple Correlation Coefficient (简单相关系数 )
The simple correlation coefficient measures the strength of the linear relationship between y and x and is denoted by r
negative is if
and positive, is if
12
12
brr=
brr=
Where, b1 is the slope of the least squares line
yyxx
xy
SSSS
SSr
r can also be calculated using the formula
Different Values of the CorrelationCoefficient
t test for a population slope Is there a linear relationship between X and Y?
Null and alternative hypotheses H0: β1 = 0 (no linear relationship)
Ha: β1 0 (linear relationship does exist)
Test statistic
Inference about the Slope: t Test
1b
11
s
βbt
2nd.f.
where:
b1 = regression slope coefficient
β1 = hypothesized slope
sb = standard error of the slope
1
xx
bSS
ss
1
Example 11.2Example 11.2 The House Price Case
House Price in $1000s
(y)
Square Feet (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.) 0.1098 98.25 price house
Simple Linear Regression Equation:
The slope of this model is 0.1098
Does square footage of the house affect its sales price?
H0: β1 = 0
Ha: β1 0
The House Price Case # 2
32938.303297.0
010977.0
s
βbt
1b
11
Reject H0Reject H0
/2=.025
-tα/2
Do not reject H0
0 tα/2
/2=.025
-2.3060 2.3060 3.329
n=10 d.f. = 10-2 = 8
There is sufficient evidence that square footage affects house price
Confidence Interval Estimate for the Slope
Confidence Interval Estimate of the Slope:
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)
1b2n1 Stb d.f. = n - 2
Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship between house price and square feet at the .05 level of significance
Chapter Summary
Introduced types of regression models Reviewed assumptions of regression and correlation Discussed determining the simple linear regression
equation Described measures of variation Described inference about the slope Discussed correlation -- measuring the strength of
the association Addressed estimation of mean values and prediction
of individual values