Upload
gary-houston
View
230
Download
0
Embed Size (px)
Citation preview
Topic 20: Single Factor Analysis of Variance
Outline
• Analysis of Variance–One set of treatments (i.e., single
factor)• Cell means model• Factor effects model
–Link to linear regression using indicator explanatory variables
One-Way ANOVA
• The response variable Y is continuous
• The explanatory variable is categorical
– We call it a factor
– The possible values are called levels
• This approach is a generalization of the independent two-sample pooled t-test
• In other words, it can be used when there are more than two treatments
Data for One-Way ANOVA
• Y is the response variable
• X is the factor (it is qualitative/discrete)
– r is the number of levels
– often refer to these levels as groups or treatments
• Yi,j is the jth observation in the ith group
Notation• For Yi,j we use
– i to denote the level of the factor– j to denote the jth observation at factor
level i• i = 1, . . . , r levels of factor X
• j = 1, . . . , ni observations for level i of factor
X
– ni does not need to be the same in each group
KNNL Example (p 685)• Y is the number of cases of cereal sold
• X is the design of the cereal package
– there are 4 levels for X because there are 4 different package designs
• i =1 to 4 levels
• j =1 to ni stores with design i (ni=5,5,4,5)
• Will use n if ni the same across groups
Data for one-way ANOVA
data a1; infile 'c:../data/ch16ta01.txt'; input cases design store;
proc print data=a1; run;
The data
Obs cases design store1 11 1 12 17 1 23 16 1 34 14 1 45 15 1 56 12 2 17 10 2 28 15 2 3
Plot the data
symbol1 v=circle i=none;proc gplot data=a1; plot cases*design;run;
The plot
Plot the means
proc means data=a1; var cases; by design; output out=a2 mean=avcases;proc print data=a2;symbol1 v=circle i=join;proc gplot data=a2; plot avcases*design;run;
New Data Set
Obs design _TYPE_ _FREQ_ avcases1 1 0 5 14.6
2 2 0 5 13.4
3 3 0 4 19.5
4 4 0 5 27.2
Plot of the means
The Model
• We assume that the response variable is – Normally distributed with a
1. mean that may depend on the level of the factor
2. constant variance • All observations assumed independent• NOTE: Same assumptions as linear
regression except there is no assumed linear relationship between X and E(Y|X)
Cell Means Model
• A “cell” refers to a level of the factor
• Yij = μi + εij
– where μi is the theoretical mean or expected value of all observations at level (or cell) i
– the εij are iid N(0, σ2) which means
– Yij ~N(μi, σ2) and independent
– This is called the cell means model
Parameters• The parameters of the model are
– μ1, μ2, … , μr
–σ2
• Question (Version 1) – Does our explanatory variable help explain Y?
• Question (Version 2) – Do the μi vary?
H0: μ1= μ2= … = μr = μ (a constant)
Ha: not all μ’s are the same
Estimates• Estimate μi by the mean of the
observations at level i, (sample mean)
• ûi = = ΣYi,j/ni
• For each level i, also get an estimate of the variance
• = Σ(Yij- )2/(ni-1) (sample variance)
• We combine these to get an overall estimate of σ2
• Same approach as pooled t-test
iY
iY
iY2is
2is
Pooled estimate of σ2
• If the ni were all the same we would average the – Do not average the si
• In general we pool the , giving weights proportional to the df, ni -1
• The pooled estimate is
2is
2is
)(1
112
22
rnsn
nsns
Tii
iii
Running proc glm
proc glm data=a1; class design; model cases=design; means design; lsmeans designrun;
Difference 1: Need to specify factor variables
Difference 2: Ask for mean estimates
Output
Class Level Information
Class Levels Valuesdesign 4 1 2 3 4
Number of Observations Read 19Number of Observations Used 19
Important summaries to check these summaries!!!
SAS 9.3 default output for MEANS statement
MEANS statement output
Level ofdesign N
cases
Mean Std Dev1 5 14.6000000 2.302172892 5 13.4000000 3.646916513 4 19.5000000 2.645751314 5 27.2000000 3.96232255
Table of sample means and sample variances
SAS 9.3 default output for LSMEANS statement
LSMEANS statement output
design cases LSMEANStandard
Error Pr > |t|1 14.6000000 1.4523544 <.00012 13.4000000 1.4523544 <.00013 19.5000000 1.6237816 <.00014 27.2000000 1.4523544 <.0001
Provides estimates based on model (i.e., constant variance)
Notation
i iT
T
i j Tij
ij ij
nn
n
nYY
nY
..
i.
nsobservatio ofnumber total the is
mean) sample (grand /
mean) sample(trt /Y
ANOVA Table
Source df SS MS
Model r-1 Σij( - )2 SSR/dfR
Error nT-r Σij(Yij - )2 SSE/dfE
Total nT-1 Σij(Yij - )2 SST/dfT..Y
..Yi.Y
i.Y
ANOVA SAS Output
Source DFSum of
SquaresMean
SquareF
Value Pr > FModel 3 588.2210526 196.0736842 18.59 <.0001
Error 15 158.2000000 10.5466667
Corrected Total
18 746.4210526
R-Square Coeff Var Root MSE cases Mean0.788055 17.43042 3.247563 18.63158
Expected Mean Squares
• E(MSR) > E(MSE) when the group means are different
• See KNNL p 694 – 698 for more details• In more complicated models, these tell
us how to construct the F test
Ti ii
i ii
nn
rnE
E
/ where
1)MSR(
)MSE(
.
2.
2
2
F test
• F = MSR/MSE
• H0: μ1 = μ2 = … = μr
• Ha: not all of the μi are equal
• Under H0, F ~ F(r-1, nT-r)
• Reject H0 when F is large
• Report the P-value
Maximum Likelihood Approach
proc glimmix data=a1;
class design;
model cases=design / dist=normal;
lsmeans design;
run;
GLIMMIX OutputModel Information
Data Set WORK.A1
Response Variable cases
Response Distribution Gaussian
Link Function Identity
Variance Function Default
Variance Matrix Diagonal
Estimation Technique Restricted Maximum Likelihood
Degrees of Freedom Method Residual
GLIMMIX Output
Fit Statistics-2 Res Log Likelihood 84.12AIC (smaller is better) 94.12AICC (smaller is better) 100.79BIC (smaller is better) 97.66CAIC (smaller is better) 102.66HQIC (smaller is better) 94.08Pearson Chi-Square 158.20Pearson Chi-Square / DF 10.55
GLIMMIX OutputType III Tests of Fixed Effects
EffectNum
DFDen DF F Value Pr > F
design 3 15 18.59 <.0001
design Least Squares Means
design EstimateStandard
Error DF t Value Pr > |t|1 14.6000 1.4524 15 10.05 <.00012 13.4000 1.4524 15 9.23 <.00013 19.5000 1.6238 15 12.01 <.00014 27.2000 1.4524 15 18.73 <.0001
Factor Effects Model
• A reparameterization of the cell means model
• Useful way at looking at more complicated models
• Null hypotheses are easier to state
• Yij = μ + i + εij
– the εij are iid N(0, σ2)
Parameters
• The parameters of the model are
– μ, 1, 2, … , r
– σ2
• The cell means model had r + 1 parameters– r μ’s and σ2
• The factor effects model has r + 2 parameters– μ, the r ’s, and σ2
– Cannot uniquely estimate all parameters
An example
• Suppose r=3; μ1 = 10, μ2 = 20, μ3 = 30
• What is an equivalent set of parameters for the factor effects model?
• We need to have μ + i = μi
• μ = 0, 1 = 10, 2 = 20, 3 = 30
• μ = 20, 1 = -10, 2 = 0, 3 = 10
• μ = 5000, 1 = -4990, 2 = -4980, 3 = -4970
Problem with factor effects?• These parameters are not estimable
or not well defined (i.e., unique)• There are many solutions to the least
squares problem• There is an X΄X matrix for this
parameterization that does not have an inverse (perfect multicollinearity)
• The parameter estimators here are biased (SAS proc glm)
Factor effects solution
• Put a constraint on the i
• Common to assume Σi i = 0
• This effectively reduces the number of parameters by 1
• Numerous other constraints possible
Consequences• Regardless of constraint, we always have μi = μ + i
• The constraint Σi i = 0 implies
– μ = (Σi μi)/r (unweighted grand mean)
i = μi – μ (group effect)
• The “unweighted” complicates things when the ni are not all equal; see KNNL p 702-708
Hypotheses
• H0: μ1 = μ2 = … = μr
• H1: not all of the μi are equal
are translated into
• H0: 1 = 2 = … = r = 0
• H1: at least one i is not 0
Estimates of parameters
• With the constraint Σi i = 0
.i.i
..i i..
ˆYˆ
) (if YYˆ
nnr i
Solution used by SAS
• Recall, X΄X does not have an inverse
• We can use a generalized inverse in its place
• (X΄X)- is the standard notation
• There are many generalized inverses, each corresponding to a different constraint
Solution used by SAS
• (X΄X)- used in proc glm corresponds to the constraint r = 0
• Recall that μ and the i are not estimable
• But the linear combinations μ + i are estimable
• These are estimated by the cell means
Cereal package example
• Y is the number of cases of cereal sold
• X is the design of the cereal package
• i =1 to 4 levels
• j =1 to ni stores with design i
SAS coding for X•Class statement generates r explanatory variables •The ith explanatory variable is equal to 1 if the observation is from the ith group•In other words, the rows of X are 1 1 0 0 0 for design=1 1 0 1 0 0 for design=2 1 0 0 1 0 for design=3 1 0 0 0 1 for design=4
Some options
proc glm data=a1; class design; model cases=design /xpx inverse solution;run;
Output The X'X Matrix
Int d1 d2 d3 d4 casesInt 19 5 5 4 5 354d1 5 5 0 0 0 73d2 5 0 5 0 0 67d3 4 0 0 4 0 78d4 5 0 0 0 5 136cases 354 73 67 78 136 7342
Also contains X’Y
Output
X'X Generalized Inverse (g2)
Int d1 d2 d3 d4 casesInt 0.2 -0.2 -0.2 -0.2 0 27.2d1 -0.2 0.4 0.2 0.2 0 -12.6d2 -0.2 0.2 0.4 0.2 0 -13.8d3 -0.2 0.2 0.2 0.45 0 -7.7d4 0 0 0 0 0 0cases 27.2 -12.6 -13.8 -7.7 0 158.2
Output matrix•Actually, this matrix is
(X΄X)- (X΄X)- X΄Y Y΄X(X΄X)- Y΄Y-Y΄X(X΄X)- X΄Y
•Parameter estimates are in upper right corner, SSE is lower right corner (last column on previous page)
Parameter estimates
StPar Est Err t PInt 27.2 B 1.45 18.73 <.0001d1 -12.6 B 2.05 -6.13 <.0001d2 -13.8 B 2.05 -6.72 <.0001d3 -7.7 B 2.17 -3.53 0.0030d4 0.0 B . . .
Caution Message
NOTE: The X'X matrix has beenfound to be singular, and ageneralized inverse was usedto solve the normal equations.Terms whose estimates arefollowed by the letter 'B' arenot uniquely estimable.
Interpretation
• If r = 0 (in our case, 4 = 0), then the corresponding estimate should be zero
• the intercept μ is estimated by the mean of the observations in group 4
• since μ + i is the mean of group i, the i are the differences between the mean of group i and the mean of group 4
Recall the means output
Level ofdesign N Mean Std Dev
1 5 14.6 2.32 5 13.4 3.63 4 19.5 2.64 5 27.2 3.9
Parameter estimates based on means
Level ofdesign Mean = 27.2 = 27.21 14.6 = 14.6-27.2 = -12.62 13.4 = 13.4-27.2 = -13.83 19.5 = 19.5-27.2 = -7.74 27.2 = 27.2-27.2 = 0
1234
Last slide
• Read KNNL Chapter 16 up to 16.10• We used programs topic20.sas to generate the
output for today• Will focus more on the relationship between
regression and one-way ANOVA in next topic