Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Analysis of variance (ANOVA)
• Suppose we observe bivariate data (X,Y ) in which the X
variable is qualitative and the Y variable is quantitative.
In the following example (Cox & Snell, 1981) four varieties
of winter wheat were grown in various plots of land, and the
yield (tons per hectare) was measured in each plot.
Variety (X) Yield (Y )
Huntsman 5.12 4.50 5.49 5.86Atou 4.65 5.07 5.59 6.53Armada 5.04 4.99 5.59 6.57Mardler 5.13 4.60 5.83 6.14
The X variable is the type of wheat, which is qualitative, and
in this context is called a factor. Specifically, it is a four level
factor, since there are four types of wheat. In general, we
will use m to denote the number of levels of the factor.
All 16 data values are assumed to be independent. The
four values in a given row are independent and identically
distributed (iid), and are referred to as replicates. Note that
this implies a key assumption – it is assumed that the mean
and variance within each row are fixed.
Our primary interest will be whether the means for different
rows (different varieties of wheat) differ. This would imply
that some varieties of wheat are better than others. The
analysis is easiest when we assume that the variances for all
rows are the same.
This type of data is called a balanced one-way layout. Theterm “balance” refers to the fact that there are the samenumber of observations in every row. The term “one-way”refers to the fact that there is only one X variable.
• Our notation for this type of data will be Yij, where i =1,2,3,4 indicates the type of wheat (i.e. 1 = Huntsman,2 = Atou, 3 = Armada, 4 = Mardler), and j indexes thereplicates. Thus, Y11 = 5.12, Y12 = 4.50, Y21 = 4.65, Y44 =6.14, etc.
Additional notation: Yi· =∑j Yij is the sum of all values
in the ith row, n is the number of values in each row (4in the example above), Yi· = Yi·/n is the average value inthe ith row, Y·· =
∑ij Yij is the sum of all observations, and
Y·· = Y··/mn is the overall (“grand”) mean.
All of these values can be displayed in an ANOVA table:
Variety (X) Yield (Y ) Yi· Yi·Huntsman 5.12 4.50 5.49 5.86 20.97 5.24Atou 4.65 5.07 5.59 6.53 21.84 5.46Armada 5.04 4.99 5.59 6.57 22.19 5.55Mardler 5.13 4.60 5.83 6.14 21.70 5.43
86.7 5.42
where the values in the lower right are Y·· = 86.7 and Y·· =
5.42.
• Analysis of variance (ANOVA) specifies the following simple
model for these data:
Yij = µ+ αi + εij.
The constant values µ, α1, α2, α3, and α4 are unknown pa-
rameters of the population, and the random variables εij are
iid errors with mean 0 and a common unknown variance σ2.
For example, the model gives the following for certain specific
data points:
Y11 = µ+ α1 + ε11
Y14 = µ+ α1 + ε14
Y23 = µ+ α2 + ε23
Y42 = µ+ α4 + ε42.
• One difficulty with the above model is that different valuesof µ and the αi will give the same mean values for every datapoint. Specifically, if we replace µ with µ + c and replaceeach αi with αi − c, the means will not change.
When this occurs the parameters are said to be unidentified.To estimate the parameters, we must impose a constraint.In the present situation, the constraint will be
∑iαi = 0.
This allows the αi to be interpreted as “deviations from themean” – if α2 = 3, then Atou wheat yields on average threetons more than the overall mean, and if α4 = −2, Mardlerwheat yields on average two tons less than the overall mean.
• In order to estimate the population parameters, we use thesame “sum of squared residuals” function that was used forsimple linear regression:
∑ij
(Yij − µ− αi)2.
As with simple linear regression, our estimates will be the val-
ues that we get by searching for the values of µ, α1, . . . , α4
that make the sum of squared residuals as small as possi-
ble. Without derivation, the following are the least squares
parameter estimates:
αi = Yi· − Y··µ = Y··
For the example given above we get
α1 = −0.176
α2 = 0.041
α3 = 0.129
α4 = 0.006
µ = 5.42
Also by analogy with simple linear regression, we can define
fitted values
Yij = µ+ αi
residuals
rij = Yij − µ− αi,
and an estimate of the standard deviation
σ =√∑
i
r2ij/(mn−m).
• Since EYi1 = · · · = EYin = µ+αi, it follows that EYi· = µ+αi.
Similarly, EY·· = µ+∑iαi/m = µ. Thus Eαi = µ+αi−µ = αi,
so αi is unbiased.
Similarly it can be shown that µ is unbiased.
• The variance of each αi can be calculated directly:
var(αi) = var(Yi· − Y··)= var(Yi·) + var(Y··)− 2cov(Yi·, Y··)
= σ2/n+ σ2/mn− 2cov(Yi·, Y··)/n2m
= σ2/n+ σ2/mn− 2cov(Yi·, Yi·)/n2m
= σ2/n+ σ2/mn− 2nσ2/n2m
= σ2/n+ σ2/mn− 2σ2/mn
= σ2/n− σ2/mn
= σ2 ·m− 1
mn.
• Based on the variance formula for αi, we can carry out hy-
pothesis tests. For example, to test α2 = 0 versus α2 > 0,
the test statistic would be
T =α2
σ·√
mn
m− 1.
When α2 = 0 (the null hypothesis) T has a tm(n−1) distribu-
tion. In ANOVA problems it is common for m(n − 1) to be
small, so the normal approximation should not generally be
used.
In the above example, we get σ = .71, so the (two sided)
test statistics and p-values are as follows:
Parameter Estimate |T | p-value
α1 -0.176 0.57 .58α2 0.041 0.13 .90α3 0.129 0.42 .68α4 0.006 0.02 .98
So in the example, none of the coefficients are significantlydifferent from zero – we can not confidently conclude thatany variety of wheat is better than the others.
As we have seen previously, the test statistic −T can beused to test the alternative hypothesis α2 < 0, and the teststatistic |T | can be used to test the alternative hypothesisα2 6= 0.
• We can also use the standard deviation formula to get a CIfor any of the αi’s. Since
P
(Q(.025) ≤
α2 − α2
σ·√
mn
m− 1≤ Q(.975)
)= .95,
where Q is the tm(n−1) quantile function, it follows that
P
α2 − σQ(.975)
√m− 1
mn≤ α2 ≤ α2 + σQ(.975)
√m− 1
mn
= .95.
The confidence intervals for the example are:
α1 (−.84, .48)α2 (−.62, .70)α3 (−.53, .79)α4 (−.65, .67).
• As with simple linear regression, we have a “sum of squares
law”:
SSTO = SSE + SSR∑ij
(Yij − Y··)2 =∑ij
(Yij − Yij)2 +∑ij
(Yij − Y··)2,
where SSR and SSE are uncorrelated.
We can define “mean squares”:
Sum of Squares DF Mean square
SSTO mn-1 SSTO / (mn-1)SSR m-1 SSR / (m-1)SSE m(n-1) SSE / (m(n-1))
Large values of SSR and small values of SSE suggest a good
fit to the model. Therefore F = MSR/MSE can be used
to test the fit of the model. F has a F distribution with
(m− 1,m(n− 1)) degrees of freedom.
In the example, the sums of squares are SSTO= 6.2350,
SSR= 0.1975, and SSE= 6.0374. The mean squares are
MSTO= 0.5196, MSR= 0.0658, and MSE= 0.5031. The F
statistic is F = .13079, which gives an insignficant p-value
of around .94. Thus there is no evidence that any type of
wheat has greater or lesser yield than any other.
Unabalanced one-way layout
• The balanced one way layout can easily be generalized to
the unbalanced case, where differing numbers of replicates
are made for different factor levels.
In this case, we use ni to denote the number of replicates for
factor level i, and let N =∑i ni denote the total number of
observations.
The definitions of Yi· and Y·· are the same as in the balanced
case, but now we have Yi· = Yi·/ni and Y·· = Y··/N .
The definitions of αi, µ, Yij, and rij are the same as in the
balanced case.
In the unbalanced one-way ANOVA, the αi are identified byrequiring
∑i
niαi = 0.
The standard deviation estimate becomes:
σ =√∑ij
r2ij/(N −m).
The variance of αi becomes:
Var(αi) = σ2(1/ni − 1/N) = σ2N − niNni
.
The test statistic for a hypothesis test αi = 0 versus αi > 0
is
T = αi
√NniN − ni
/σ,
which has a tN−m distribution.
Since
P
(Q(.025) ≤
αi − αiσ
·√
NniN − ni
≤ Q(.975)
)= .95,
where Q is the tN−m quantile function, it follows that
P
(αi − σQ(.975)
√N − niNni
≤ αi ≤ αi + σQ(.975)
√N − niNni
)= .95.
• The sum of squares law is the same as in the balanced case,however the degrees of freedom must be generalized:
Sum of Squares DF Mean square
SSTO N-1 SSTO / (N-1)SSR m-1 SSR / (m-1)SSE N-m SSE / (N-m)
Note that for everything above, the formulas for the balancedcase are special cases of the formulas for the unbalancedcase, replacing ni with n, and nm with N .
• Here is an example unbalanced one-way layout, along with
its ANOVA table:
Group (X) Response (Y ) Yi· Yi·
1 3.4 3.7 2.9 3.5 3.2 16.7 3.342 4.1 3.8 4.2 12.1 4.033 3.5 3.7 3.2 3.9 3.5 3.5 21.3 3.554 2.5 3.2 3.1 3.6 3.4 3.1 3.0 3.1 25.0 3.135 3.9 3.6 3.8 11.3 3.77
86.4 3.46
Note that n1 = 5, n2 = 3, n3 = 6, n4 = 8, n5 = 3, and
N = 25.
The standard deviation estimate is σ ≈ .27.
The parameter estimates, t-statistics, p-values, and confi-
dence intervals are contained in the following table:
Parameter Estimate |T | p-value CI
α1 -0.12 1.06 3.0× 10−1 (−0.34,0.11)α2 0.58 3.90 8.9× 10−4 (0.27,0.89)α3 0.09 0.97 3.5× 10−1 (−0.11,0.30)α4 -0.33 4.15 4.9× 10−4 (−0.50,−0.16)α5 0.31 2.10 4.8× 10−2 (0.01,0.62)
The sum of squares law for the example is
SSTO = SSR + SSE
3.78 = 2.29 + 1.50,
giving 0.57 as the MSR, 0.07 as the MSE, an F -statistic value
of F = 7.64, and an F -statistic p-value of ≈ 6.5× 10−4.
Since the F -statistic is highly significant, we can reject thenull hypothesis that α1 = α2 = α3 = α4. Thus some ofthe factor levels have different mean values. Based on thehypothesis tests for each αi, we are confident that α2 > 0and α4 < 0.
Balanced two-way layout
• Suppose now that there are two qualitative X variables (twofactors). Thus each observation has the form (X1, X2, Y ).
Let m1 be the number of levels of the first factor, and letm2 be the number of levels of the second factor. Thusthere are m1m2 distinct combinations of factor levels. Alsosuppose that there is exactly one observation made at eachcombination of levels.
For example, in the following data m1 = 4 and m2 = 5. One
way to display the data is in the following table:
X1 X2 Y X1 X2 Y
1 1 4 3 1 51 2 5 3 2 41 3 6 3 3 71 4 5 3 4 51 5 4 3 5 42 1 3 4 1 42 2 2 4 2 42 3 4 4 3 62 4 3 4 4 42 5 2 4 5 4
For example, X1 may indicate 4 different levels of fertilizer
(e.g. no, low, medium, high), X2 may indicate 5 different
varieties of wheat, and Y may indicate the yield.
A more informative way to display this data is as follows:
1 2 3 4 5
1 4 5 6 5 42 3 2 4 3 23 5 4 7 5 44 4 4 6 4 4
X1
X2
We may refer to the Y values using the X1 and X2 values as
subscripts – for instance when X1 = 3 and X2 = 4, we have
Y34 = 5, or when X1 = 2 and X2 = 1, we have Y21 = 3.
We will need the row sums Yi· =∑j Yij and the column sums
Y·j =∑i Yij. Similarly we will need the row means Yi· =
Yi·/m2 and the column means Y·j = Y·j/m1. Finally we will
need the overall sum Y·· =∑ij Yij, and the overall mean
Y·· = Y··/m1m2.
All of these quantities can be displayed in the following ANOVA
table:
4 5 6 5 4 24 4.83 2 4 3 2 14 2.85 4 7 5 4 25 5.04 4 6 4 4 22 4.4
16 15 23 17 14 854.00 3.75 5.75 4.25 3.50 4.25
For example, Y2· = 14, Y4· = 4.4, Y·2 = 15, Y·4 = 4.25,Y·· = 85, and Y·· = 4.25.
• We will consider several different models for this data. Thecentral model is the additive model, that specifies the fol-lowing mean values for each observation:
EYij = µ+ αi + βj.
The unknown parameters are α1, . . . , αm1, β1, . . . , βm2, µ, and
σ.
Looking at another small example, suppose that m1 = 2,
m2 = 3, the population values are µ = −1, α1 = 3, α2 = −3,
β1 = −5, β2 = 0, β3 = 5. The interior of the following
“table of means” shows the mean values for each Yij, and
the margins show the population αi and βj values.
-5 0 5
3 -3 2 7-3 -9 -4 1
The additive model is
Yij = µ+ αi + βj + εij
where the εij are iid random variables with mean 0 and stan-
dard deviation σ. The following table shows each observed
value expressed as its mean value plus an error term:
-5 0 5
3 −3 + ε11 2 + ε12 7 + ε13-3 −9 + ε21 −4 + ε22 1 + ε23
• In practice we won’t know the αi, βj, µ, and σ values, so we
will estimate them by minimizing the least squares function:
∑ij
(Yij − µ− αi − βj)2.
As with the one-way layout, we are not able to identify the
parameters – adding a constant to every αi and subtracting
that constant from every βj does not change the mean levels.
To be able to identify the parameters, we require
∑i
αi = 0
and
∑j
βj = 0.
The least squares parameter estimates are:
µ = Y··
αi = Yi· − Y··βj = Y·j − Y··
For the 4 × 5 example above, the parameter estimates and
fitted values are given in the following “table of fitted val-
ues”:
4.55 4.3 6.3 4.8 4.05 0.552.55 2.3 4.3 2.8 2.05 -1.454.75 4.5 6.5 5.0 4.25 0.754.15 3.9 5.9 4.4 3.65 0.15
-0.25 -0.5 1.5 0 -0.75 4.25
The residuals are given in the following “table of residuals”:
-0.55 0.70 -0.30 0.20 -0.050.45 -0.30 -0.30 0.20 -0.050.25 -0.50 0.50 0.00 -0.25
-0.15 0.10 0.10 -0.40 0.35
As with any least squares fit, you can check that the residuals
sum to zero.
• Applying a similar derivation as was used in the one-way
layout, the variances of the parameter estimates are:
var(αi) = σ2(1/m2 − 1/m1m2) = σ2m1 − 1
m1m2.
var(βj) = σ2(1/m1 − 1/m1m2) = σ2m2 − 1
m1m2.
It can also be easily shown that these estimates are unbiased:
so Eαi = αi and Eβj = βj.
We will also need to study the correlation between αi and βj,
for any pair of indices i, j:
cov(αi, βj) = cov(Yi· − Y··, Y·j − Y··)= σ2/m1m2 − σ2/m1m2 − σ2/m1m2 + σ2/m1m2
= 0.
Thus αi and βj are uncorrelated.
• The fitted values are
Yij = µ+ αi + βj = Yi·+ Y·j − Y··,
the residuals are
rij = Yij − Yi· − Y·j + Y··,
and the error standard deviation estimate is
σ =√∑ij
r2ij/(m1 − 1)(m2 − 1).
• The usual “sum of squares” law applies:
∑ij
(Yij − Y··)2 =∑ij
(Yij − Yij)2 +∑ij
(Yij − Y··)2,
but now we can take it a step further, decomposing SSR as:
∑ij
(Yij − Y··)2 =∑ij
(αi + βj)2
= m2∑i
α2i +m1
∑j
β2j +
∑ij
αiβj.
Since∑i αi =
∑j βj = 0, it follows that
∑ij
αiβj =∑i
αi∑j
βj = 0.
Thus we can decompose SSR = SSA + SSB, where
SSA = m2∑i
α2i
and
SSB = m1∑j
β2j .
Thus for the balanced, two-way layout we get: SSTO = SSE
+ SSA + SSB. This decomposition can be used to carry out
several tests regarding the structure of the model. Before we
see how to do this, the degrees of freedom and mean squares
are as follows (where N = m1m2 is the total sample size):
SS MS DF
SSTO MSTO N − 1SSR MSR m1 +m2 − 2SSA MSA m1 − 1SSB MSB m2 − 1SSE MSE (m1 − 1)(m2 − 1)
For the example given above, the sums of squares and mean
squares are:
SSTO 29.75 MSTO 1.57SSR 27.45 MSR 3.92SSA 14.95 MSA 4.98SSB 12.50 MSB 3.13SSE 2.30 MSE 0.19
• We can use the sum of squares to assess the model that bestfits the data. First note that every two-way layout containstwo one-way layouts as special cases. If α1 = · · · = αm1 = 0,then the model reduces to the one-way layout EYij = µ+ βj(the i index doesn’t affect the mean). Similarly, if β1 = · · · =βm2 = 0, then the model reduces to EYij = µ + αi (the j
index doesn’t affect the mean).
If SSA is small, then it may easily be true that all αi = 0,while if SSA is large, the data strongly suggest that someof the αi are nonzero. Thus we can use SSA as our teststatistic for the null hypothesis α1 = · · · = αm1 = 0. Wenormalize to get an F-statistic:
F =MSA
MSE,
with m1 − 1, (m1 − 1)(m2 − 1) DF.
In the example, we get F ≈ 26.21, giving a p-value < 10−5.
Similarly, to test the null hypothesis β1 = · · · = βm2 = 0, use
F =MSB
MSE,
which has m2 − 1, (m1 − 1)(m2 − 1) DF.
In the example, we get F ≈ 16.47, giving a p-value < 10−4.
We may also test the null hypothesis that all αi = 0 and allβj = 0 using
F =MSR
MSE,
which has m1 +m2 − 2, (m1 − 1)(m2 − 1) DF.
In the example, we get F ≈ 20.63, giving a p-value < 10−5.
Balanced two-way layout with replicates
• Suppose, as above, that there are two factors influencing
the response, so the data takes the form (X1, X2, Y ). Now
suppose that for each combination of X1&X2, we observe
r > 1 replicates of Y .
For example, the following data come from an experimentwhere the response (Y ) is the weight of 16 week old chicks,and the factors are “level of protein” (X1) and “level of fishsolubles” (X2). There are m1 = 3 levels of protein (low,medium, high), and m2 = 2 levels of fish solubles (absent,present). For each combination of factor levels, there arer = 2 replicates. The data can be written:
No FS FS
Low 7094, 7053 8005, 7657Med 6943, 6249 7359, 7292High 6748, 6422 6764, 6560
We use Yijk to denote replicate k within cell i, j (i.e. replicatek where X1 = i and X2 = j). For example, Y111 = 7094, andY112 = 7053.
• Row and column sums, Yi··, Y·j·, and the total Y··· now include
a sum over the replicates as well (hence the third · in the
subscript). The row and column means, and grand mean
must account for the replicates in their denominator, so Yi·· =Yi··/m2r, Y·j· = Y·j·/m1r, and Y··· = Y···/m1m2r.
The following table shows the data given above, with the
row and column sums and averages given in the margins.
No FS FS
Low 7094, 7053 8005, 7657 29809 7452Med 6943, 6249 7359, 7292 27843 6961High 6748, 6422 6764, 6560 26494 6624
40509 43637 841466752 7273 7012
There is also a new average, the “within cell average” Yij·formed by averaging all values in a single cell. The table of
within cell averages for the example follows:
No FS FS
Low 7074 7831Med 6596 7326High 6585 6662
• Parameter estimates are defined as in the case with no repli-
cates, but the standard errors will be smaller since we have
more data when r > 1:
Parameter Estimate Variance
αi Yi·· − Y··· σ2(m1 − 1)/Nβj Y·j· − Y··· σ2(m2 − 1)/Nµ Y··· σ2/N
Fitted values are defined as
Yijk = µ+ αi + βj.
Note that the fitted values for all replicates in a common cellare identical (i.e. Yijk does not depend on k).
• By direct analogy with the two-way layout with no replicates,we have the sum of squares law SSTO = SSE + SSA + SSB(note that SSA and SSB are scaled by r in this case):
∑(Yijk − Y )2 =
∑(Yijk − Yijk)2 +m2r
∑i
α2i +m1r
∑j
β2j .
Since we have replicates, we can further decompose SSE:
∑(Yijk − Yijk)2 =
∑(Yijk − Yij·+ Yij· − Yijk)2
=∑
(Yijk − Yij·)2 +∑
(Yij· − Yijk)2
+2∑
(Yijk − Yij·)(Yij· − Yijk).
Focusing on the third term, and using the fact that Yijk does
not depend on k,
∑ijk
(Yijk − Yij·)(Yij· − Yijk) =∑ij
(Yij· − Yijk)∑k
(Yijk − Yij·)
= 0
Thus we have a decomposition SSE = SSP + SSI, where
SSP =∑
(Yijk − Yij·)2
SSI =∑
(Yij· − Yijk)2,
where SSP is the “pure error” sum of squares, and SSI is the
“interaction” sum of squares. Note that when r = 1, SSP =
0 and SSI = SSE.
• Degrees of freedom and mean squares are as follows:
Sum Mean DF
SSTO MSTO N − 1SSR MSR m1 +m2 − 2SSA MSA m1 − 1SSB MSB m2 − 1SSE MSE N −m1 −m2 + 1SSP MSP m1m2(r − 1)SSI MSI (m1 − 1)(m2 − 1)
Note that when r = 1, SSP has 0 DF, and the DF’s for SSI
and SSE are identical.
For the example, the sums of squares and degrees of freedom
are:
Sum DF
SSTO 2879821 11SSR 2204881 3SSA 1389515 2SSB 815365 1SSE 674941 8SSP 378401 6SSI 296540 2
The usual estimate for the error standard deviation is
σ =
√∑r2ijk/(N −m1 −m2 + 1),
but we also have a different estimate
σpure =
√∑(Yij· − Yijk)2/m1m2(r − 1).
In the example, σ = 290 while σpure = 251.
• As in the case r = 1, we have F-tests for the null hypothesis
“all αi = 0” (F = MSA / MSE), for the null hypothesis “all
βj = 0” (F = MSB / MSE), and for the null hypothesis “all
αi = 0 and all βj = 0” (F = MSR / MSE).
In the example, MSA/MSE ≈ 8.2, giving a p-value ≈ .01,
and MSB/MSE ≈ 9.7, also giving a p-value ≈ .01. So both
the row and column effects are significantly different from
zero.
• When r > 1 we are also able to test the additivity hypothesis
(i.e. that EYijk = µ + αi + βj, as opposed to EYijk = θij,
where the θij are completely unconstrained). For example,
the table of population means on the left is additive, while
that on the right is not:
3 5 12 4 0
0 4 32 3 4
Before we consider formal tests, it is useful to construct some
simple plots that suggest whether the effects are interactive.
The following are plots of the points (j, Yij·), where points
with a common i value are connected by lines. The plot on
the left corresponds to the table of means on the left, above,
and the plot on the right corresponds to the table of means
on the right, above.
0
1
2
3
4
5
0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4
0 0.5 1 1.5 2
If the the lines are approximately parallel, then additivity is
suggested. In the above plots, the left side is perfectly addi-
tive, while the right side is strongly non-additive. Similarly,
we can plot (i, Yij·), connecting points with a common j
value.
2
2.5
3
3.5
4
4.5
5
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
3.5
4
0 0.2 0.4 0.6 0.8 1
The same conclusion is suggested.
Here are the two diagnostic plots for the chick weight exam-
ple data:
6400
6600
6800
7000
7200
7400
7600
7800
8000
0 0.2 0.4 0.6 0.8 16400
6600
6800
7000
7200
7400
7600
7800
8000
0 0.5 1 1.5 2
There is some evidence of non-additivity, but it is not ex-
tremely strong.
• Using sample means rather than population means, additivity
will not hold exactly due to random variation, so we need to
assess whether the means are sufficiently close to additive to
infer that the underlying population means are exactly addi-
tive. This is acccomplished using the F-test F = MSI/MSP.
In the chick weight example, the F-statistic is ≈ 2.4, giving a
p-value of ≈ .17. This provides quantitative evidence that the
interactivity suggested in the plot above is not unexpected
given the sample size and noise level.