Upload
maximilian-willis
View
231
Download
1
Tags:
Embed Size (px)
Comparing k Populations
Means – One way Analysis of Variance (ANOVA)
The F test – for comparing k means
Situation
• We have k normal populations
• Let i and denote the mean and standard deviation of population i.
• i = 1, 2, 3, … k.
• Note: we assume that the standard deviation for each population is the same.
1 = 2 = … = k =
We want to test
kH 3210 :
against
jiH jiA ,pair oneleast at for :
The data• Assume we have collected data from each
of th k populations
• Let xi1, xi2 , xi3 , … denote the ni observations from population i.
• i = 1, 2, 3, … k.
Let
i
n
jij
i n
x
x
i
1
1
1
2
i
n
iiij
i n
xxs
One possible solution (incorrect)• Choose the populations two at a time
• then perform a two sample t test of
• Repeat this for every possible pair of populations
1 1pooled
x yt
sn m
0 : vs :i j A i jH H
• The flaw with this procedure is that you are performing a collection of tests rather than a single test
• If each test is performed with = 0.05, then the probability that each test makes a type I error is 5% but the probability the group of tests makes a type I error could be considerably higher than 5%.
• i.e. Suppose there is no different in the means of the populations. The chance that this procedure could declare a significant difference could be considerably higher than 5%
The Bonferoni inequalityIf N tests are preformed with significance level .
then
P[group of N tests makes a type I error] ≤ 1 – (1 – )N
Example:Suppose . = 0.05, N = 10 then
P[group of N tests makes a type I error]
≤ 1 – (0.95)10 = 0.41
For this reason we are going to consider a single test for testing:
kH 3210 :
against
jiH jiA ,pair oneleast at for :
Note: If k = 10, the number of pairs of means (and hence the number of tests that would have to be performed ) is:
10 2
10 10 945
2 2 1C
The F test
To test kH 3210 :
against jiH jiA ,pair oneleast at for :
knsn
kxxn
s
sF
k
ii
k
iii
k
iii
Pooled
Between
11
2
1
2
2
2
1
1use the test statistic
where mean for the sample.thix i
standard deviation for the samplethis i
1 1
1
overall meank k
k
n x n xx
n n
is called the Between Sum of Squares and is denoted by SSBetween
It measures the variability between samples
the statistic 2
1
k
i ii
n x x
k – 1 is known as the Between degrees of freedom and
is called the Between Mean Square and is denoted by MSBetween
2
1
1k
i ii
n x x k
is called the Within Sum of Squares and is denoted by SSWithin
the statistic
is known as the Within degrees of freedom and
is called the Within Mean Square and is denoted by MSWithin
2
1 1
1k k
i i ii i
n s n k
2
1
1k
i ii
n s
1
k
ii
n k N k
then
Between WithinF MS MS
The Computing formula for F:
k
i
n
jij
i
x1 1
2
Compute
ixTin
jiji samplefor Total
1
Total Grand 1 11
k
i
n
jij
k
ii
i
xTG
size sample Total1
k
iinN
k
i i
i
n
T
1
2
1)
2)
3)
4)
5)
Then
1)
2)
k
i i
ik
i
n
jijWithin n
TxSS
i
1
2
1 1
2
BetweenSS
k
i i
i
N
G
n
T
1
22
3) kNSS
kSSF
Within
Between
1
We reject
kH 3210 :
FF if
F is the critical point under the F distribution with 1 = k - 1degrees of freedom in the numerator and 2 = N – k degrees of freedom in the denominator
The critical region for the F test
Example
In the following example we are comparing weight gains resulting from the following six diets
1. Diet 1 - High Protein , Beef
2. Diet 2 - High Protein , Cereal
3. Diet 3 - High Protein , Pork
4. Diet 4 - Low protein , Beef
5. Diet 5 - Low protein , Cereal
6. Diet 6 - Low protein , Pork
Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork)
Diet 1 2 3 4 5 6
73 98 94 90 107 49 102 74 79 76 95 82 118 56 96 90 97 73 104 111 98 64 80 86 81 95 102 86 98 81 107 88 102 51 74 97 100 82 108 72 74 106 87 77 91 90 67 70 117 86 120 95 89 61 111 92 105 78 58 82
Mean 100.0 85.9 99.5 79.2 83.9 78.7 Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55
x 1000 859 995 792 839 787 x2 102062 75819 100075 64462 72613 64401
Hence
4794321 1
2
k
i
n
jij
i
x
60 size sample Total1
k
iinN
4678461
2
k
i i
i
n
T
i 1 2 3 4 5 6 Total (G )T i 1000 859 995 792 839 787 5272
Thus
115864678464794321
2
1 1
2
k
i i
ik
i
n
jijWithin n
TxSS
i
BetweenSS 933.461260
5272467846
2
1
22
k
i i
i
N
G
n
T
3.4
56.214
6.922
54/11586
5/933.46121
kNSS
kSSF
Within
Between
54 and 5 with 386.2 2105.0 F
Thus since F > 2.386 we reject H0
The ANOVA Table
A convenient method for displaying the calculations for the F-test
Source d.f. Sum of Squares
Mean Square
F-ratio
Between k - 1 SSBetween MSBetween MSB /MSW
Within N - k SSWithin MSWithin
Total N - 1 SSTotal
Anova Table
Source d.f. Sum of Squares
Mean Square
F-ratio
Between 5 4612.933 922.587 4.3
Within 54 11586.000 214.556 (p = 0.0023)
Total 59 16198.933
The Diet Example
Equivalence of the F-test and the t-test when k = 2
mns
yxt
Pooled
11
2
11 22
mn
smsns yx
Pooled
the t-test
the F-test
knsn
kxxn
s
sF
k
ii
k
iii
k
iii
Pooled
Between
11
2
1
2
2
2
1
1
2 2
1 1 2 2
2 21 1 2 2 1 21 1 2
n x x n x x
n s n s n n
2 2
1 1 2 2numerator n x x n x x
2r denominato pooleds
2
21
221122
222
nn
xnxnxnxxn
2
21
221111
211
nn
xnxnxnxxn
2212
21
221 xxnn
nn
2212
21
221 xx
nn
nn
221221
212
2212
222
11 xxnn
nnnnxxnxxn
22121
21 xxnn
nn
221
21
11
1xx
nn
22
221
21
11
1t
s
xx
nn
FPooled
Hence
Using SPSS
Note: The use of another statistical package such as Minitab is similar to using SPSS
Assume the data is contained in an Excel file
Each variable is in a column
1. Weight gain (wtgn)
2. diet
3. Source of protein (Source)
4. Level of Protein (Level)
After starting the SSPS program the following dialogue box appears:
If you select Opening an existing file and press OK the following dialogue box appears
The following dialogue box appears:
If the variable names are in the file ask it to read the names. If you do not specify the Range the program will identify the Range:
Once you “click OK”, two windows will appear
One that will contain the output:
The other containing the data:
To perform ANOVA select Analyze->General Linear Model-> Univariate
The following dialog box appears
Select the dependent variable and the fixed factors
Press OK to perform the Analysis
Tests of Between-Subjects Effects Dependent Variable: wtgn
Source Type III Sum of
Squares df Mean Square F Sig. Corrected Model 4612.933(a) 5 922.587 4.300 .002
Intercept 463233.067 1 463233.067 2159.036 .000
diet 4612.933 5 922.587 4.300 .002
Error 11586.000 54 214.556
Total 479432.000 60
Corrected Total 16198.933 59
a R Squared = .285 (Adjusted R Squared = .219)
The Output
Comments
• The F-test H0: 1 = 2 = 3 = … = k against HA: at least one pair of means are different
• If H0 is accepted we know that all means are equal (not significantly different)
• If H0 is rejected we conclude that at least one pair of means is significantly different.
• The F – test gives no information to which pairs of means are different.
• One now can use two sample t tests to determine which pairs means are significantly different
Fishers LSD (least significant difference) procedure:
1. Test H0: 1 = 2 = 3 = … = k against HA: at least one pair of means are different, using the ANOVA F-test
2. If H0 is accepted we know that all means are equal (not significantly different). Then stop in this case
3. If H0 is rejected we conclude that at least one pair of means is significantly different, then follow this by• using two sample t tests to determine which pairs
means are significantly different
Example
In the following example we are comparing weight gains resulting from the following six diets
1. Diet 1 - High Protein , Beef
2. Diet 2 - High Protein , Cereal
3. Diet 3 - High Protein , Pork
4. Diet 4 - Low protein , Beef
5. Diet 5 - Low protein , Cereal
6. Diet 6 - Low protein , Pork
Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork)
Diet 1 2 3 4 5 6
73 98 94 90 107 49 102 74 79 76 95 82 118 56 96 90 97 73 104 111 98 64 80 86 81 95 102 86 98 81 107 88 102 51 74 97 100 82 108 72 74 106 87 77 91 90 67 70 117 86 120 95 89 61 111 92 105 78 58 82
Mean 100.0 85.9 99.5 79.2 83.9 78.7 Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55
x 1000 859 995 792 839 787 x2 102062 75819 100075 64462 72613 64401
Hence
4794321 1
2
k
i
n
jij
i
x
60 size sample Total1
k
iinN
4678461
2
k
i i
i
n
T
i 1 2 3 4 5 6 Total (G )T i 1000 859 995 792 839 787 5272
Thus
115864678464794321
2
1 1
2
k
i i
ik
i
n
jijWithin n
TxSS
i
BetweenSS 933.461260
5272467846
2
1
22
k
i i
i
N
G
n
T
Source d.f. Sum of Squares
Mean Square
F-ratio
Between 5 4612.933 922.587 4.3
Within 54 11586.000 214.556 (p = 0.0023)
Total 59 16198.933
The ANOVA Table
54 and 5 with 386.2 2105.0 F
Thus since F > 2.386 we reject H0
Conclusion: There are significant differences amongst the k = 6 means
1 1i j
pooledi j
x xt
sn n
with t0.025 = 2.005 for 54 d.f.
Now we want to perform t tests to compare the k = 6 means
pooled withins MS
LevelSource Beef Cereal Pork Beef Cereal Pork
Diet 1 2 3 4 5 6
Mean 100.0 85.9 99.5 79.2 83.9 78.7
High Low
Critical value t0.025 = 2.005 for 54 d.f.
t values that are significant are indicated in bold.
Table of means
t test results
i 1 vs i 2 vs i 3 vs i 4 vs i 5 vs i
2 2.1523 0.076 -2.0764 3.175 1.023 3.0995 2.458 0.305 2.381 -0.7176 3.252 1.099 3.175 0.076 0.794
value tabled is where 1 1
i jpooled within
pooledi j
x xt s MS
sn n
Conclusions:
1. There is no significant difference between diet 1 (high protein, pork) and diet 3 (high protein, pork).
2. There are no significant differences amongst diets 2, 4, 5 and 6. (i. e. high protein, cereal (diet 2) and the low protein diets (diets 4, 5 and 6)).
3. There are significant differences between diets 1and 3 (high protein, meat) and the other diets (2, 4, 5, and 6).
Major conclusion: High protein diets result in a higher weight gain but only if the source of protein is a meat source.
i 1 vs i 2 vs i 3 vs i 4 vs i 5 vs i
2 2.1523 0.076 -2.0764 3.175 1.023 3.0995 2.458 0.305 2.381 -0.7176 3.252 1.099 3.175 0.076 0.794
These are similar conclusions to those made using exploratory techniques– Examining box-plots
Non-Outlier MaxNon-Outlier Min
Median; 75%25%
Box Plots: Weight Gains for Six Diets
Diet
We
igh
t G
ain
40
50
60
70
80
90
100
110
120
130
1 2 3 4 5 6
High Protein Low Protein
Beef Beef Cereal Cereal Pork Pork
Conclusions
• Weight gain is higher for the high protein meat diets
• Increasing the level of protein - increases weight gain but only if source of protein is a meat sourceThe carrying out of the F-test and Fisher’s LSD ensures the significance of the conclusions. Differences observed exploratory methods could have occurred by chance.
Comparing k Populations
Proportions
The 2 test for independence
The two sample test for proportions
21
21
11ˆ1ˆ
ˆˆ statistictest
nnpp
ppz
21
21
2
21
1
11 ˆ and ˆ,ˆ
nn
xxp
n
xp
n
xp
population
1 2 Total
Success x1 x2 x1 + x2
Failuren1 - x2 n2 - x2
n1 + n2- (x1 + x2)
Total n1 n2 n1 + n2
The data can be displayed in the following table:
1 2 c Total
1 x11 x12 R1
2x21 x22 R2
Rr
Total C1 C2 Cc N
This problem can be extended in two ways:1.Increasing the populations (columns) from 2 to k (or c)2.Increasing the number of categories (rows) from 2 to r.
The 2 test for independence
Situation
• We have two categorical variables R and C.
• The number of categories of R is r.
• The number of categories of C is c.
• We observe n subjects from the population and count xij = the number of subjects for which R = i and C = j.
• R = rows, C = columns
Example
Both Systolic Blood pressure (C) and Serum Cholesterol (R) were meansured for a sample of n = 1237 subjects.
The categories for Blood Pressure are:
<126 127-146 147-166 167+
The categories for Cholesterol are:
<200 200-219 220-259 260+
Table: two-way frequency
Serum Systolic Blood pressure Cholesterol <127 127-146 147-166 167+ Total
<200 117 121 47 22 307 200-219 85 98 43 20 246 220-259 119 209 68 43 439
260+ 67 99 46 33 245 Total 388 527 204 118 1237
The 2 test for independence
DefineTotal row
1
thc
jiji ixR
Totalcolumn 1
thc
iiji jxC
n
CRE ji
ij
= Expected frequency in the (i,j) th cell in the case of independence.
Justification - for Eij = (RiCj)/n in the case of independence
Let ij = P[R = i, C = j] = P[R = i] P[C = j] = ij in the case of independence
jijiijij nnnE ˆˆ
= Expected frequency in the (i,j) th cell in the case of independence.
n
CR
n
C
n
Rn jiji
Use test statistic
r
i
c
j ij
ijij
E
Ex
1 1
2
2
Eij= Expected frequency in the (i,j) th cell in the case of independence.
H0: R and C are independent
against
HA: R and C are not independent
Then to test
xij= observed frequency in the (i,j) th cell
i jR C
n
Sampling distribution of test statistic when H0 is true
r
i
c
j ij
ijij
E
Ex
1 1
2
2
- 2 distribution with degrees of freedom = (r - 1)(c - 1)
Critical and Acceptance Region
Reject H0 if : 2
Accept H0 if : 2
Table Expected frequencies, Observed frequencies, Standardized Residuals
Serum Systolic Blood pressure
Cholesterol <127 127-146 147-166 167+ Total <200 96.29 130.79 50.63 29.29 307 (117) (121) (47) (22) 2.11 -0.86 -0.51 -1.35 200-219 77.16 104.80 40.47 23.47 246 (85) (98) (43) (20) 0.86 -0.66 0.38 -0.72 220-259 137.70 187.03 72.40 41.88 439 (119) (209) (68) (43) -1.59 1.61 -0.52 0.17 260+ 76.85 104.38 40.04 23.37 245 (67) (99) (46) (33) -1.12 -0.53 0.88 1.99 Total 388 527 204 118 1237
2 = 20.85
Standardized residuals
ij
ijijij
E
Exr
85.20
1 1
2
1 1
2
2
r
i
c
jij
r
i
c
j ij
ijij rE
Ex
degrees of freedom = (r - 1)(c - 1) = 9
919.1605.0
Test statistic
Reject H0 using = 0.05
Another Example
This data comes from a Globe and Mail study examining the attitudes of the baby boomers.Data was collected on various age groups
Age group Total
Echo (Age 20 – 29) 398Gen X (Age 30 – 39) 342
Younger Boomers (Age 40 – 49) 378Older Boomers (Age 50 – 59) 286
Pre Boomers (Age 60+) 445Total 1849
One question with responses
In an average week, how many times would you drink alcohol?
Age group never once twice
three or four times
five more times Total
Echo (Age 20 – 29) 115 135 64 48 36 398 Gen X (Age 30 – 39) 130 123 38 31 20 342 Younger Boomers (Age 40 – 49) 136 87 64 57 34 378 Older Boomers (Age 50 – 59) 109 74 40 43 20 286
Pre Boomers (Age 60+) 218 80 45 40 62 445
Total 708 499 251 219 172 1849
Are there differences in weekly consumption of alcohol related to age?
Table: Expected frequencies
Age group never once twice
three or four times
five more times Total
Echo (Age 20 – 29) 152.40 107.41 54.03 47.14 37.02 398 Gen X (Age 30 – 39) 130.96 92.30 46.43 40.51 31.81 342 Younger Boomers (Age 40 – 49) 144.74 102.01 51.31 44.77 35.16 378 Older Boomers (Age 50 – 59) 109.51 77.18 38.82 33.87 26.60 286
Pre Boomers (Age 60+) 170.39 120.09 60.41 52.71 41.40 445
Total 708 499 251 219 172 1849
Table: Residuals
Conclusion: There is a significant relationship between age group and weekly alcohol use
Age group never once twice
three or four times
five more times
Echo (Age 20 – 29) -3.029 2.662 1.357 0.125 -0.168 Gen X (Age 30 – 39) -0.083 3.196 -1.237 -1.494 -2.095 Younger Boomers (Age 40 – 49) -0.726 -1.486 1.771 1.828 -0.196 Older Boomers (Age 50 – 59) -0.049 -0.362 0.189 1.568 -1.280
Pre Boomers (Age 60+) 3.647 -3.659 -1.982 -1.750 3.203
ij
ijijij
E
Exr
2
2 2
1 1 1 1
93.97r c r c
ij ij
iji j i jij
x Er
E
2.05 26.296 for 4 4 16 .d f
Examining the Residuals allows one to identify the cells that indicate a departure from independence
• Large positive residuals indicate cells where the observed frequencies were larger than expected if independent Large negative residuals indicate cells where the observed frequencies were smaller than expected if independent
Age group never once twice
three or four times
five more times
Echo (Age 20 – 29) -3.029 2.662 1.357 0.125 -0.168 Gen X (Age 30 – 39) -0.083 3.196 -1.237 -1.494 -2.095 Younger Boomers (Age 40 – 49) -0.726 -1.486 1.771 1.828 -0.196 Older Boomers (Age 50 – 59) -0.049 -0.362 0.189 1.568 -1.280
Pre Boomers (Age 60+) 3.647 -3.659 -1.982 -1.750 3.203
Another question with responses
Are there differences in weekly internet use related to age?
Age group never 1 to 4 times
5 to 9 times
10 or more times Total
Echo (Age 20 – 29) 48 72 100 178 398 Gen X (Age 30 – 39) 51 82 92 117 342 Younger Boomers (Age 40 – 49) 79 128 76 95 378 Older Boomers (Age 50 – 59) 92 63 57 74 286
Pre Boomers (Age 60+) 276 71 67 31 445
Total 546 416 392 495 1849
In an average week, how many times would you surf the internet?
Table: Expected frequencies
Age group never 1 to 4 times
5 to 9 times
10 or more times Total
Echo (Age 20 – 29) 117.53 89.54 84.38 106.55 398 Gen X (Age 30 – 39) 100.99 76.95 72.51 91.56 342
Younger Boomers (Age 40 – 49) 111.62 85.04 80.14 101.20 378 Older Boomers (Age 50 – 59) 84.45 64.35 60.63 76.57 286
Pre Boomers (Age 60+) 131.41 100.12 94.34 119.13 445
Total 546 416 392 495 1849
Table: Residuals
Conclusion: There is a significant relationship between age group and weekly internet use
ij
ijijij
E
Exr
2
2 2
1 1 1 1
406.29r c r c
ij ij
iji j i jij
x Er
E
2.05 21.03 for 4 3 12 .d f
Age group never 1 to 4 times
5 to 9 times
10 or more times
Echo (Age 20 – 29) -6.41 -1.85 1.70 6.92 Gen X (Age 30 – 39) -4.97 0.58 2.29 2.66
Younger Boomers (Age 40 – 49) -3.09 4.66 -0.46 -0.62 Older Boomers (Age 50 – 59) 0.82 -0.17 -0.47 -0.29
Pre Boomers (Age 60+) 12.61 -2.91 -2.82 -8.07
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
never 1 to 4 times 5 to 9 times 10 or more times
Echo (Age 20 – 29)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
never 1 to 4 times 5 to 9 times 10 or more times
Gen X (Age 30 – 39)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
never 1 to 4 times 5 to 9 times 10 or more times
Younger Boomers (Age 40 – 49)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
never 1 to 4 times 5 to 9 times 10 or more times
Older Boomers (Age 50 – 59)
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
never 1 to 4 times 5 to 9 times 10 or more times
Pre Boomers (Age 60+)
Regressions and Correlation
Estimation by confidence intervals, Hypothesis Testing