Upload
others
View
43
Download
2
Embed Size (px)
Citation preview
ANOVA
Analysis of Variance
1
One way Analysis of Variance
(ANOVA)
Comparing k Populations
2
The F test – for comparing k means
Situation
• We have k normal populations
• Let mi and s 2 denote the mean and variance
of population i.
• i = 1, 2, 3, … k.
• Note: we assume that the variance for each
population is unknown but the same.
s12 = s2
2 = … = sk2= s 2
3
We want to test
kH mmmm 3210 :
against
jiH jiA ,pair oneleast at for : mm
4
The F statistic
k
i
n
j
iijkN
k
i
iik
j
xx
xxn
F
1 1
21
1
2
11
where xij = the jth observation in the i th sample.
injki ,,2,1 and ,,2,1
kiin
x
x th
i
n
j
ij
i
i
,,2,1 sample for mean 1
size sample Total 1
k
i
inN
mean Overall 1 1
N
x
x
k
i
n
j
ij
i
5
The ANOVA table
k
i
iiB xxnSS1
2
W
B
MS
MSF
k
i
iikB xxnMS1
2
11
k
i
n
j
iijW
j
xxSS1 1
2
k
i
n
j
iijkNW
j
xxMS1 1
21
1k
kN
Source S.S d.f, M.S. F
Between
Within
The ANOVA table is a tool for displaying the
computations for the F test. It is very important when
the Between Sample variability is due to two or more
factors
6
Computing Formulae:
k
i
n
j
ij
i
x1 1
2
Compute
ixTin
j
iji samplefor Total 1
Total Grand 1 11
k
i
n
j
ij
k
i
i
i
xTG
size sample Total1
k
i
inN
k
i i
i
n
T
1
2
1)
2)
3)
4)
5) 7
The data
• Assume we have collected data from each of
k populations
• Let xi1, xi2 , xi3 , … denote the ni observations
from population i.
• i = 1, 2, 3, … k.
8
Then
1)
2)
k
i i
ik
i
n
j
ijWithinn
TxSS
i
1
2
1 1
2
BetweenSS
k
i i
i
N
G
n
T
1
22
3)
kNSS
kSSF
Within
Between
1
9
Source d.f. Sum of
Squares Mean
Square
F-ratio
Between k - 1 SSBetween MSBetween MSB /MSW
Within N - k SSWithin MSWithin
Total N - 1 SSTotal
Anova Table
SSMS
df
10
Example
In the following example we are comparing weight
gains resulting from the following six diets
1. Diet 1 - High Protein , Beef
2. Diet 2 - High Protein , Cereal
3. Diet 3 - High Protein , Pork
4. Diet 4 - Low protein , Beef
5. Diet 5 - Low protein , Cereal
6. Diet 6 - Low protein , Pork
11
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low)
and source of protein (Beef, Cereal, or Pork)
Diet 1 2 3 4 5 6
73 98 94 90 107 49
102 74 79 76 95 82
118 56 96 90 97 73
104 111 98 64 80 86
81 95 102 86 98 81
107 88 102 51 74 97
100 82 108 72 74 106
87 77 91 90 67 70
117 86 120 95 89 61
111 92 105 78 58 82
Mean 100.0 85.9 99.5 79.2 83.9 78.7
Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55
x 1000 859 995 792 839 787
x2 102062 75819 100075 64462 72613 64401
12
Thus
115864678464794321
2
1 1
2
k
i i
ik
i
n
j
ijWithinn
TxSS
i
BetweenSS 933.461260
5272467846
2
1
22
k
i i
i
N
G
n
T
3.456.214
6.922
54/11586
5/933.46121
kNSS
kSSF
Within
Between
54 and 5 with 386.2 2105.0 F
Thus since F > 2.386 we reject H0 13
Source d.f. Sum of
Squares Mean
Square
F-ratio
Between 5 4612.933 922.587 4.3** (p = 0.0023)
Within 54 11586.000 214.556
Total 59 16198.933
Anova Table
* - Significant at 0.05 (not 0.01)
SSSSSS
** - Significant at 0.01
14
Equivalence of the F-test and the t-test
when k = 2
mns
yxt
Pooled
11
2
11 22
mn
smsns
yx
Pooled
the t-test
15
the F-test
knsn
kxxn
s
sF
k
i
i
k
i
ii
k
i
ii
Pooled
Between
11
2
1
2
2
2
1
1
211 21
2
11
2
11
2
12
2
11
nnsnsn
xxnxxn
2
12
2
11numerator xxnxxn
2r denominato pooleds
16
2
21
221122
2
22
nn
xnxnxnxxn
2
21
221111
2
11
nn
xnxnxnxxn
2
212
21
2
21 xxnn
nn
2
212
21
2
2
1 xxnn
nn
17
2
212
21
2
12
2
212
22
2
11 xxnn
nnnnxxnxxn
2
21
21
21 xxnn
nn
2
21
21
11
1xx
nn
2
2
2
21
21
11
1t
s
xx
nn
FPooled
Hence
18
Gains in weight (grams) for rats under six diets
differing in level of protein (High or Low)
and source of protein (Beef, Cereal, or Pork)
Diet 1 2 3 4 5 6
73 98 94 90 107 49
102 74 79 76 95 82
118 56 96 90 97 73
104 111 98 64 80 86
81 95 102 86 98 81 107 88 102 51 74 97
100 82 108 72 74 106
87 77 91 90 67 70
117 86 120 95 89 61
111 92 105 78 58 82
Mean 100.0 85.9 99.5 79.2 83.9 78.7
Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55
x 1000 859 995 792 839 787
x2 102062 75819 100075 64462 72613 64401 19
SAS Code for one-way ANOVA
20
Data oneway;
Input diet $ weight_gain;
Datalines;
1 73
1 102
1 118
1 104
1 81
1 107
1 100
1 87
1 117
1 111
2 98
2 74
2 56
2 111
2 95
2 88
2 82
2 77
2 86
2 92
3 94
3 79
3 96
3 98
3 102
3 102
3 108
3 91
3 120
3 105
4 90
4 76
4 90
4 64
4 86
4 51
4 72
4 90
4 95
4 78
5 107
5 95
5 97
5 80
5 98
5 74
5 74
5 67
5 89
5 58
6 49
6 82
6 73
6 86
6 81
6 97
6 106
6 70
6 61
6 82
;
Run;
Note: there are
easier ways to
enter the data.
We will come
to that later.
SAS Code for one-way ANOVA
To test our hypothesis,
we use the following
code in SAS:
• “class” tells SAS the classification variable. In general, this is
going to be the effect that you are studying. In this case, the
effect is “diet.”
• “model” tells SAS the dependent variable. The general format
is “model Y = X” where Y is the dependent variable, and X is
the independent variable. In this case, weight_gain is
dependent on diet.
• Often a “quit” statement is necessary, because SAS may
continue to run a procedure until either another one has been
run, or SAS has been told to quit.
PROC ANOVA DATA = oneway;
class diet;
model weight_gain = diet;
RUN;
QUIT;
SAS Output
The ANOVA Procedure
Class Level Information
Class Levels Values
diet 6 1 2 3 4 5 6
Number of Observations Read 60
Number of Observations Used 60
The ANOVA Procedure
Dependent Variable: weight_gain
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 5 4612.93333 922.58667 4.30 0.0023
Error 54 11586.00000 214.55556
Corrected Total 59 16198.93333
R-Square Coeff Var Root MSE weight_gain Mean
0.284768 16.67039 14.64772 87.86667
Source DF Anova SS Mean Square F Value Pr > F
diet 5 4612.933333 922.586667 4.30 0.0023
Factorial Experiments
Analysis of Variance
24
• Dependent variable Y
• k Categorical independent variables A, B,
C, … (the Factors)
• Let
– a = the number of categories of A
– b = the number of categories of B
– c = the number of categories of C
– etc.
25
The Completely Randomized Design
• We form the set of all treatment combinations
– the set of all combinations of the k factors
• Total number of treatment combinations
– t = abc….
• In the completely randomized design n
experimental units (test animals , test plots, etc.
are randomly assigned to each treatment
combination.
– Total number of experimental units N = nt=nabc..
26
The treatment combinations can thought to be
arranged in a k-dimensional rectangular block
A
1
2
a
B 1 2 b
27
A
B
C
28
• The Completely Randomized Design is called balanced
• If the number of observations per treatment combination is unequal the design is called unbalanced. (resulting mathematically more complex analysis and computations)
• If for some of the treatment combinations there are no observations the design is called incomplete. (In this case it may happen that some of the parameters - main effects and interactions - cannot be estimated.)
29
Example: Two-way ANOVA
(two-factor experiment)
In this example we are examining the effect of
We have n = 10 test animals randomly
assigned to k = 6 diets
The level of protein A (High or Low) and
the source of protein B (Beef, Cereal, or
Pork) on weight gains (grams) in rats.
30
The k = 6 diets are the 6 = 3×2 Level-
Source combinations
1. High - Beef
2. High - Cereal
3. High - Pork
4. Low - Beef
5. Low - Cereal
6. Low - Pork
31
Source of Protein
Level
of
Protein
Beef Cereal Pork
High
Low
Treatment combinations
Diet 1 Diet 2 Diet 3
Diet 4 Diet 5 Diet 6
32
Level of Protein Beef Cereal Pork Overall
Low 79.20 83.90 78.70 80.60
Source of Protein
High 100.00 85.90 99.50 95.13
Overall 89.60 84.90 89.10 87.87
Summary Table of Means
33
Table Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and s
ource of protein (Beef, Cereal, or Pork)
Level of Protein High Protein Low protein
Source of Protein Beef Cereal Pork Beef Cereal Pork
Diet 1 2 3 4 5 6
73 98 94 90 107 49 102 74 79 76 95 82 118 56 96 90 97 73 104 111 98 64 80 86 81 95 102 86 98 81 107 88 102 51 74 97 100 82 108 72 74 106 87 77 91 90 67 70 117 86 120 95 89 61
111 92 105 78 58 82
Mean 100.0 85.9 99.5 79.2 83.9 78.7
Std. Dev. 15.14 15.02 10.92 13.89 15.71 16.55 34
35
Data twoway;
Input Protein $ Source $ weight_gain;
Datalines;
High Beef 73
High Beef 102
High Beef 118
High Beef 104
High Beef 81
High Beef 107
High Beef 100
High Beef 87
High Beef 117
High Beef 111
High Cereal 98
High Cereal 74
High Cereal 56
High Cereal 111
High Cereal 95
High Cereal 88
High Cereal 82
High Cereal 77
High Cereal 86
High Cereal 92
High Pork 94
High Pork 79
High Pork 96
High Pork 98
High Pork 102
High Pork 102
High Pork 108
High Pork 91
High Pork 120
High Pork 105
Low Beef 90
Low Beef 76
Low Beef 90
Low Beef 64
Low Beef 86
Low Beef 51
Low Beef 72
Low Beef 90
Low Beef 95
Low Beef 78
Low Cereal 107
Low Cereal 95
Low Cereal 97
Low Cereal 80
Low Cereal 98
Low Cereal 74
Low Cereal 74
Low Cereal 67
Low Cereal 89
Low Cereal 58
Low Pork 49
Low Pork 82
Low Pork 73
Low Pork 86
Low Pork 81
Low Pork 97
Low Pork 106
Low Pork 70
Low Pork 61
Low Pork 82
;
Run;
SAS Code for two-way ANOVA
To test our hypotheses,
we use the following
code in SAS:
• “class” tells SAS the two classification variables, which are
generally going to be the effects that you are studying. In this
case, the effects are “Protein” and “Source”
• “model” tells SAS the dependent variable. The general format
is “model Y = X1 X2 X1*X2” where Y is the dependent
variable, X1 and X2 are independent variables. X1*X2 means
the interaction of X1 and X2.
• Often a “quit” statement is necessary, because SAS may
continue to run a procedure until either another one has been
run, or SAS has been told to quit.
PROC ANOVA DATA = twoway;
class Protein Source;
model weight_gain = Protein Source
Protein*Source;
RUN;
QUIT;
SAS Output
The ANOVA Procedure
Class Level Information
Class Levels Values
Protein 2 High Low
Source 3 Beef Cereal Pork
Number of Observations Read 60
Number of Observations Used 60
The ANOVA Procedure
Dependent Variable: weight_gain
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 5 4612.93333 922.58667 4.30 0.0023
Error 54 11586.00000 214.55556
Corrected Total 59 16198.93333
R-Square Coeff Var Root MSE weight_gain Mean
0.284768 16.67039 14.64772 87.86667
Source DF Anova SS Mean Square F Value Pr > F
Protein 1 3168.266667 3168.266667 14.77 0.0003
Source 2 266.533333 133.266667 0.62 0.5411
Protein*Source 2 1178.133333 589.066667 2.75 0.0732
Profiles of the response relative
to a factor
A graphical representation of the
effect of a factor on a reponse
variable (dependent variable)
39
Profile Y for A
Y
Levels of A
a 1 2 3 …
This could be for an
individual case or
averaged over a group
of cases
This could be for
specific level of
another factor or
averaged levels of
another factor
40
70
80
90
100
110
Beef Cereal Pork
Weig
ht
Ga
in
High Protein
Low Protein
Overall
Profiles of Weight Gain for
Source and Level of Protein
41
70
80
90
100
110
High Protein Low Protein
Weig
ht
Ga
in
Beef
Cereal
Pork
Overall
Profiles of Weight Gain for
Source and Level of Protein
42
Example – Four factor experiment
Four factors are studied for their effect on Y (luster of paint film). The four factors are:
Two observations of film luster (Y) are taken
for each treatment combination
1) Film Thickness - (1 or 2 mils)
2) Drying conditions (Regular or Special)
3) Length of wash (10,30,40 or 60 Minutes), and
4) Temperature of wash (92 ˚C or 100 ˚C)
43
The data is tabulated below: Regular Dry Special Dry Minutes 92 C 100 C 92C 100 C 1-mil Thickness 20 3.4 3.4 19.6 14.5 2.1 3.8 17.2 13.4 30 4.1 4.1 17.5 17.0 4.0 4.6 13.5 14.3 40 4.9 4.2 17.6 15.2 5.1 3.3 16.0 17.8 60 5.0 4.9 20.9 17.1 8.3 4.3 17.5 13.9 2-mil Thickness 20 5.5 3.7 26.6 29.5 4.5 4.5 25.6 22.5 30 5.7 6.1 31.6 30.2 5.9 5.9 29.2 29.8 40 5.5 5.6 30.5 30.2 5.5 5.8 32.6 27.4 60 7.2 6.0 31.4 29.6 8.0 9.9 33.5 29.5
44
Definition:
A factor is said to not affect the response if
the profile of the factor is horizontal for all
combinations of levels of the other factors:
No change in the response when you change
the levels of the factor (true for all
combinations of levels of the other factors)
Otherwise the factor is said to affect the
response:
45
Profile Y for A – A affects the response
Y
Levels of A
a 1 2 3 …
Levels of B
46
Profile Y for A – no affect on the response
Y
Levels of A
a 1 2 3 …
Levels of B
47
Definition:
• Two (or more) factors are said to interact if changes in the response when you change the level of one factor depend on the level(s) of the other factor(s).
• Profiles of the factor for different levels of the other factor(s) are not parallel
• Otherwise the factors are said to be additive .
• Profiles of the factor for different levels of the other factor(s) are parallel.
48
Interacting factors A and B Y
Levels of A
a 1 2 3 …
Levels of B
49
Additive factors A and B Y
Levels of A
a 1 2 3 …
Levels of B
50
• If two (or more) factors interact each factor
effects the response.
• If two (or more) factors are additive it still
remains to be determined if the factors
affect the response
• In factorial experiments we are interested in
determining
– which factors effect the response and
– which groups of factors interact .
51
Order of testing in factorial experiments
1. Test first the higher order interactions.
2. If an interaction is present there is no need to test lower order interactions or main effects involving those factors. All factors in the interaction affect the response and they interact
3. The testing continues for lower order interactions and main effects for factors which have not yet been determined to affect the response.
52
More SAS Program: Proc GLM
The ANOVA procedure is one of several procedures available in SAS/STAT software for analysis of variance. The ANOVA procedure is designed to handle balanced data (that is, data with equal numbers of observations for every combination of the classification factors), whereas the GLM procedure can analyze both balanced and unbalanced data. Because PROC ANOVA takes into account the special structure of a balanced design, it is faster and uses less storage than PROC GLM for balanced data.
Proc GLM
PROC GLM DATA = twoway;
class Protein Source;
model weight_gain = Protein Source Protein*Source;
lsmeans Protein Source Protein*Source /out=outmns;
*gives least square means and outputs them into another data set called 'outmns';
means Protein Source /cldiff bon;
*ask SAS for the confidence limits for the difference of means and the type of comparison;
output out=resout p=preds rstudent=exstdres;
*outputs the residuals and predicted value to a data set called 'resout';
RUN;
QUIT;
Proc GLM, continued title 'Profile/Interaction Plots';
symbol i=j;
*tells SAS to draw lines between joint means;
proc gplot data=outmns;
where poison ne . and treatment ne .;
*remove the marginal means from the data set since we only wish to plot joint means;
plot lsmean*Protein=Source;
plot lsmean*Source=Protein;
run; quit;
goptions reset=all; *resets PROC GPLOT options;
title 'Residual Plot';
proc gplot data=resout;
plot exstdres*preds;
run; quit;
Mean versus LS Mean (LSM)
56
Mean versus LS Mean (LSM)
57
Note, for balanced designs,
as true for our examples,
the mean and LSM are the same.
Bonferroni Pairwise Mean Comparisons The GLM Procedure
Bonferroni (Dunn) t Tests for weight_gain
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type
II error rate than Tukey's for all pairwise comparisons.
Alpha 0.05
Error Degrees of Freedom 54
Error Mean Square 214.5556
Critical Value of t 2.00488
Minimum Significant Difference 7.5825
Comparisons significant at the 0.05 level are indicated by ***.
Difference
Protein Between Simultaneous 95%
Comparison Means Confidence Limits
High - Low 14.533 6.951 22.116 ***
Low - High -14.533 -22.116 -6.951 ***
The GLM Procedure
Bonferroni (Dunn) t Tests for weight_gain
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type
II error rate than Tukey's for all pairwise comparisons.
Alpha 0.05
Error Degrees of Freedom 54
Error Mean Square 214.5556
Critical Value of t 2.47085
Minimum Significant Difference 11.445
Comparisons significant at the 0.05 level are indicated by ***.
Difference Simultaneous
Source Between 95% Confidence
Comparison Means Limits
Beef - Pork 0.500 -10.945 11.945
Beef - Cereal 4.700 -6.745 16.145
Pork - Beef -0.500 -11.945 10.945
Pork - Cereal 4.200 -7.245 15.645
Cereal - Beef -4.700 -16.145 6.745
Cereal - Pork -4.200 -15.645 7.245
Tukey pairwise mean comparisons
PROC GLM DATA = twoway;
class Protein Source;
model weight_gain = Protein Source Protein*Source;
means Protein Source /tukey;
RUN;
QUIT;
The GLM Procedure
Tukey's Studentized Range (HSD) Test for weight_gain
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type
II error rate than REGWQ.
Alpha 0.05
Error Degrees of Freedom 54
Error Mean Square 214.5556
Critical Value of Studentized Range 2.83533
Minimum Significant Difference 7.5825
Means with the same letter are not significantly different.
Tukey Grouping Mean N Protein
A 95.133 30 High
B 80.600 30 Low
The GLM Procedure
Tukey's Studentized Range (HSD) Test for weight_gain
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type
II error rate than REGWQ.
Alpha 0.05
Error Degrees of Freedom 54
Error Mean Square 214.5556
Critical Value of Studentized Range 3.40823
Minimum Significant Difference 11.163
Means with the same letter are not significantly different.
Tukey Grouping Mean N Source
A 89.600 20 Beef
A
A 89.100 20 Pork
A
A 84.900 20 Cereal
Models for factorial
Experiments
66
Part I. Factor Effects Model
67
The Single Factor Experiment (One-way ANOVA)
Situation
• We have t = a treatment combinations
• Let mi and s 2 denote the mean and variance
of treatment (population) i.
• i = 1, 2, 3, … a.
• Note: we assume that the variance for each
population is unknown but the same.
s12 = s2
2 = … = sa2= s 2
68
The data
• Assume we have collected data for each of
the a treatments
• Let yi1, yi2 , yi3 , … , yin denote the n
observations for treatment i.
• i = 1, 2, 3, … a.
69
The model
Note:
ij i ij i i ijy ym m m
i ij i ijm m m m
where ij ij iy m
1
1 k
i
ikm m
i i m m
has N(0,s 2) distribution
(overall mean effect)
(Effect of Factor A)
Note: 1
0a
i
i
by their definition. 70
Model 1:
ij i ijy m
yij (i = 1, … , a; j = 1, …, n) are independent
Normal with mean mi and variance s 2.
Model 2:
where ij (i = 1, … , a; j = 1, …, n) are independent
Normal with mean 0 and variance s 2.
ij i ijy m Model 3:
where ij (i = 1, … , a; j = 1, …, n) are independent
Normal with mean 0 and variance s 2 and
1
0a
i
i
71
The Two Factor Experiment
Situation
• We have t = ab treatment combinations
• Let mij and s 2 denote the mean and variance
of observations from the treatment
combination when A = i and B = j.
• i = 1, 2, 3, … a, j = 1, 2, 3, … b.
72
The data
• Assume we have collected data (n observations)
for each of the t = ab treatment combinations.
• Let yij1, yij2 , yij3 , … , yijn denote the n observations
for treatment combination - A = i, B = j.
• i = 1, 2, 3, … a, j = 1, 2, 3, … b.
73
The model Note:
ijk ij ijk ij ij ijky ym m m
i j ij i j ijm m m m m m m m m
where ijk ijk ijy m
1 1 1 1
1 1 1, and
a b b a
ij i ij j ij
i j j iab b am m m m m m
, ,i i j j m m m m
follows N(0,s 2) distribution
and
i j ijkijm
ij i jij m m m m
74
The model Note:
ijk ij ijk ij ij ijky ym m m
i j ij i j ijm m m m m m m m m
where ijk ijk ijy m
1 1 1 1
1 1 1, and
a b b a
ij i ij j ij
i j j iab b am m m m m m
, ,i i j j m m m m
follows N(0,s 2) distribution
Note: 1
0a
i
i
by their definition.
i j ijkijm
75
ijk i j ijkijy m
Model :
where ijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are
independent Normal with mean 0 and variance s 2 and
1
0a
i
i
1
0b
j
j
1 1
and 0a b
ij iji j
Main effects Interaction
Effect Mean Error
76
ijk i j ijkijy m
Maximum Likelihood Estimates
where ijk (i = 1, … , a; j = 1, …, b ; k = 1, …, n) are
independent Normal with mean 0 and variance s 2 and
1 1 1
ˆa b n
ijk
i j k
y y abnm
1 1
ˆb n
i i ijk
j k
y y y bn y
1 1
ˆa n
j j ijk
i k
y y y an y
77
^
ij i jijy y y y
1
n
ijk i j
k
y n y y y
2
2
1 1 1
1ˆ
a b n
ijk ij
i j k
y ynab
s
2
1 1 1
^1 ˆˆˆa b n
ijk i j iji j k
ynab
m
This is not an unbiased estimator of s 2 (usually the
case when estimating variance.)
The unbiased estimator results when we divide by
ab(n -1) instead of abn 78
22
1 1 1
1
1
a b n
ijk ij
i j k
s y yab n
2
1 1 1
^1 ˆˆˆ1
a b n
ijk i j iji j k
yab n
m
The unbiased estimator of s 2 is
1
1Error ErrorSS MS
ab n
2
1 1 1
a b n
Error ijk ij
i j k
SS y y
where
79
22
1 1 1 1
^a b a b
AB ij i jiji j i j
SS y y y y
Testing for Interaction:
1
1 1AB
AB
Error Error
SSa bMS
FMS MS
where
We want to test:
H0: ()ij = 0 for all i and j, against
HA: ()ij ≠ 0 for at least one i and j.
The test statistic
80
( 1)( 1), ( 1)AB
Error
MSF F a b ab n
MS
We reject
H0: ()ij = 0 for all i and j,
If
81
22
1 1
ˆa a
A i i
i i
SS y y
Testing for the Main Effect of A:
1
1A
A
Error Error
SSaMS
FMS MS
where
We want to test:
H0: i = 0 for all i, against
HA: i ≠ 0 for at least one i.
The test statistic
82
( 1), ( 1)A
Error
MSF F a ab n
MS
We reject
H0: i = 0 for all i,
If
83
2
2
1 1
ˆb b
B j j
j j
SS y y
Testing for the Main Effect of B:
1
1B
B
Error Error
SSbMS
FMS MS
where
We want to test:
H0: j = 0 for all j, against
HA: j ≠ 0 for at least one j.
The test statistic
84
( 1), ( 1)B
Error
MSF F b ab n
MS
We reject
H0: j = 0 for all j,
If
85
The ANOVA Table
Source S.S. d.f. MS =SS/df F
A SSA a - 1 MSA MSA / MSError
B SSB b - 1 MSB MSB / MSError
AB SSAB (a - 1)(b - 1) MSAB MSAB/ MSError
Error SSError ab(n - 1) MSError
Total SSTotal abn - 1
86
The ANOVA Procedure
Dependent Variable: weight_gain
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 5 4612.93333 922.58667 4.30 0.0023
Error 54 11586.00000 214.55556
Corrected Total 59 16198.93333
R-Square Coeff Var Root MSE weight_gain Mean
0.284768 16.67039 14.64772 87.86667
Source DF Anova SS Mean Square F Value Pr > F
Protein 1 3168.266667 3168.266667 14.77 0.0003
Source 2 266.533333 133.266667 0.62 0.5411
Protein*Source 2 1178.133333 589.066667 2.75 0.0732
Part II. General Linear Model
88
One-way ANOVA
The ANOVA is indeed a special case of the
general linear model (GLM) when all the
predictors are categorical variables.
For one-way ANOVA, we have only one
categorical predictor. As shown in the
following slides, we can easily translate the
ANOVA into a GLM using dummy variables.
89
Dummy Variables • Dummy coding
• 0s and 1s
– For a categorical predictor with k categories, k-1 dummy variables will go into the regression equation leaving out one
reference category (e.g. control)
• Coefficients are interpreted
as change with respect to the
reference variable (the one
with all zeros)
– In this case group 3
Group D1 D2
1 1 0
2 0 1
3 0 0
GLM representation and
interpretations • GLM model:
• Relation to category/group means:
• Therefore the ANOVA hypothesis:
• Can be expressed as:
mmmm
mmmm
mmmm
003
10
01
2122113
22122112
12122111
DD:Group
DD:2Group
DD:1Group
m 2211 DDY
3210 : mmm H
0: 210 H
Two-way ANOVA
We will revisit the two-way ANOVA example
on the impact of weight_gain from two
factors:
(1)Protein level (denoted as Protein) – it has
two levels: High/Low
(2)Protein source (denoted as Source) – it has
three levels: Beef/Cereal/Pork
92
Dummy Variables
Source D1 D2
Beef 1 0
Cereal 0 1
Pork 0 0
Protein D
High 1
Low 0
GLM representation and
interpretations • GLM model:
• Relation to category/group means:
mmm
mmm
mmm
mmm
mmm
mmm
0*00*0000
1*00*0100
0*01*0010
0*10*1001
1*10*1101
0*11*1011
543216
3543215
2543214
1543213
531543212
421543211
:Low/Pork
:Low/Cereal
:Low/Beef
:High/Pork
:lHigh/Cerea
:High/Beef
m 251423121 ** DDDDDDDY
GLM representation and
interpretations • Test for Interaction:
• Test for Protein (level) main effect:
• Test for (protein) Source main effect:
0: 540 H
0: 10 H
0: 320 H
Acknowledgement:
• We thank colleagues who posted their lecture notes on the internet@!
• Please note that in SAS, we have several procedures that will enable you to perform ANOVA. These include Proc ANOVA and Proc GLM, plus several other procedures such as Proc Mixed, etc. The ANOVA procedures we have learned so far are just the basic fixed effect ANOVAs. In the future we will also learn those with random effect, and mixed effects. See the following websites for a review and preview:
• http://www.ats.ucla.edu/stat/sas/library/SASAnova_mf.htm
• http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#mixed_toc.htm
• http://www.hawaii.edu/hisug/pdf/AnnMariaprocmixed.pdf
96