Upload
agatha-anderson
View
215
Download
0
Embed Size (px)
Citation preview
1
Analysis of Variance & One Factor Designs
Y= DEPENDENT VARIABLE
(“yield”)
(“response variable”)
(“quality indicator”)
X = INDEPENDENT VARIABLE
(A possibly influential FACTOR)
2
OBJECTIVE: To determine the impact of X on Y
Mathematical Model:
Y = f (x, ) , where = (impact of) all factors other than X
Ex: Y = Battery Life
(hours)
X = Brand of Battery
= Many other factors (possibly, some we’re unaware of)
3
Statistical Model
“LEVEL” OF BRAND(Brand is, of course, represented as “categorical”)
Y11 Y12 • • • • • • •Y1c
Yij
Y21
•
•
•
•
•
•
YRI
•
•
•
•
•
1
2
•
•
•
•
R
1 2 • • • • • • • • C
Yij = + j + ij
i = 1, . . . . . , R
j = 1, . . . . . , C
YRc• • • • • • • •
4
Where
= OVERALL AVERAGE
j = index for FACTOR (Brand) LEVEL
i= index for “replication”
j = Differential effect (response) associated with jth level of X
and ij = “noise” or “error” associated with the (particular) (i,j)th data value.
Let j = AVERAGE associated with jth level of X
j = j – and = AVERAGE of j .
5
Yij = + j + ij
By definition, j = 0C
j=1
The experiment produces
R x C Yij data values.
The analysis produces estimates of c. (We can then get estimates of
the ij by subtraction).
6
Y11 Y12 • • • • • •Y1c
Y21
•
•
•
•
YRI
••
••
YRc
1 2
C
Y• 1Y• c(Y• j)
• • • • • • • • •
Y• 2
3
Y•1, Y•2, etc., are Column Means
• • • • •
• • • • •
7
Y• • = Y• j /C = “GRAND MEAN”
(assuming same # data points in each column)
(otherwise, Y• • = mean of all the data)
j=1
c
8
MODEL: Yij = + j + ij
Y• • estimates
Y • j - Y • • estimatesj (= j – ) (for all j)
These estimates are based on Gauss’ (1796)
PRINCIPLE OF LEAST SQUARES
and (I would argue) on COMMON SENSE
9
MODEL: Yij = + j + ij
If you insert the estimates into the MODEL,
(1) Yij = Y • • + (Y•j - Y • • ) + ij.
it follows that our estimate of ij is
(2) ij = Yij - Y•j
<
<
10
Then, Yij = Y• • + (Y• j - Y• • ) + ( Yij - Y• j)
or, (Yij - Y• • ) = (Y•j - Y• •) + (Yij - Y•j ) { { {(3)
TOTAL
VARIABILITY
in Y
=
Variability
in Y
associated
with X
Variability
in Y
associated
with all other factors
+
11
If you square both sides of (3), and double sum both sides (over i and j), you get, [after some unpleasant algebra, but
lots of terms which “cancel”]
(Yij - Y• • )2 = R • (Y•j - Y• •)
2 + (Yij - Y•j)
2C R
j=1 i=1 { { {j=1
C C R
j=1 i=1
TSS
TOTAL SUM OF SQUARES
=
=
SSBC
SUM OF SQUARES BETWEEN COLUMNS
+
+
SSW (SSE)
SUM OF SQUARES WITHIN COLUMNS( ( (
( ((
12
ANOVA TABLE
SOURCE OF VARIABILITY
SSQ DFMeansquare
(M.S.)
Between Columns (due to brand)
Within Columns (due to error)
SSBC C - 1 MSBC
SSBC
C - 1
SSW (R - 1) • CSSW
(R-1)•C= MSW
=
TOTAL TSS RC -1
13
Example: Y = LIFETIME (HOURS)
BRAND3 replications
per level 1 2 3 4 5 6 7 8
1.8 4.2 8.6 7.0 4.2 4.2 7.8 9.0
5.0 5.4 4.6 5.0 7.8 4.2 7.0 7.4
1.0 4.2 4.2 9.0 6.6 5.4 9.8 5.8
2.6 4.6 5.8 7.0 6.2 4.6 8.2 7.4 5.8
SSBC = 3 ( [2.6 - 5.8]2 + [4.6 - 5.8]
2 + • • • + [7.4 - 5.8]2)
= 3 (23.04)
= 69.12
14
(1.8 - 2.6)2 = .64 (4.2 - 4.6)2 =.16 (9.0 -7.4)2 = 2.56
(5.0 - 2.6)2 = 5.76 (5.4 - 4.6)2= .64 • • • • (7.4 - 7.4)2 = 0
(1.0 - 2.6)2 = 2.56 (4.2 - 4.6)2= .16 (5.8 - 7.4)2 = 2.56
8.96 .96 5.12
Total of (8.96 + .96 + • • • • • • + 5.12),
SSW = 46.72
SSW =
15
ANOVA TABLE
Source of Variability
SSQ df M.S.
BRAND
ERROR
69.12
46.72
7
= 8 - 1
16
= 2 (8)
9.87
2.92
TOTAL 115.84 23
= (3 • 8) -1
16
We can show:
E (MSBC) = 2 +
“VCOL”{
MEASURE OF DIFFERENCES
AMONG COLUMN MEANS
RC-1
• (j - )2
{
j
((
E (MSW) = 2
(Assuming each Yij has (constant) standard deviation, )
(More about assumptions, Later)
17
E ( MSBC ) = 2 + VCOL
E ( MSW ) = 2
This suggests that
if MSBC
MSW > 1 ,
There’s some evidence of non-zero VCOL, or “level of X affects Y”
if MSBC
MSW< 1 ,
No evidence that VCOL > 0, or that “level of X affects Y”
18
With HO: Level of X has no impact on Y
HI: Level of X does have impact on Y,
We need
MSBC
MSW> > 1
to reject HO.
19
More Formally,
HO: 1 = 2 = • • • c = 0
HI: not all j = 0
OR
HO: 1 = 2 = • • • • c
HI: not all j are EQUAL
(All column means are equal)
20
The probability Law of
MSBC
MSW= “Fcalc” , is
The F - distribution with (C-1, (R-1)C)degrees of freedom
Assuming
HO true.
C = Table Value
21
In our problem:
ANOVA TABLE
Source of Variability
SSQ df M.S.
BRAND
ERROR
69.12
46.72
7
16
9.87
2.92 = 9.87 2.92
Fcalc
3.38
22
= .05
C = 2.66 3.38
F table coming up
(7,16 DF)
23
F-Table
24
Hence, at = .05, Reject Ho .
(i.e., Conclude that level of BRAND does have an impact on battery lifetime.)
25
ONE FACTOR ANOVA, Using EXCEL1.8 4.2 8.6 7 4.2 4.2 7.8 9
5 5.4 4.6 5 7.8 4.2 7 7.41 4.2 4.2 9 6.6 5.4 9.8 5.8
Anova: Single-Factor
Summary
Groups Count Sum Average Variance
Column 1 3 7.8 2.6 4.48Column 2 3 13.8 4.6 0.48Column 3 3 17.4 5.8 5.92Column 4 3 21 7 4Column 5 3 18.6 6.2 3.36Column 6 3 13.8 4.6 0.48Column 7 3 24.6 8.2 2.08Column 8 3 22.2 7.4 2.56
ANOVA
Source of VariationSS df MS F P-value F crit
Between Groups 69.12 7 9.87429 3.3816 0.02064 2.657Within Groups 46.72 16 2.92
Total 115.8 23
26
SPSS/MINITAB INPUTVAR001 VAR002
1.8 1
5.0 1
1.0 1
4.2 2
5.4 2
4.2 2
. .
. .
. .
9.0 8
7.4 8
5.8 8
27
ONE_FACTOR ANOVA, using SPSS - - - - - O N E W A Y - - - - - Variable Lifetime By Variable Device
Analysis of Variance
Sum of Mean F F
Source D.F. Squares Squares Ratio Prob.
Between Groups 7 69.1200 9.8743 3.3816 .0206
Within Groups 16 46.7200 2.9200
Total 23 115.8400
28
ONE FACTOR ANOVA (MINITAB)
Analysis of Variance for life
Source DF SS MS F P
brand 7 69.12 9.87 3.38 0.021
Error 16 46.72 2.92
Total 23 115.84
MINITAB: STAT>>ANOVA>>ONE-WAY
29
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
9
10
brand
life
Dotplots of life by brand(group means are indicated by lines)
30
1 2 3 4 5 6 7 8
0
1
2
3
4
5
6
7
8
9
10
brand
lifeBoxplots of life by brand
(means are indicated by solid circles)
31
EXAMPLE: MORTARThe tension bond strength of cement mortar is an important characteristic of the product. An engineer is interested in comparing the strength of a modified formulation in which polymer latex emulsions have been added during mixing to the strength of the unmodified mortar. The experimenter has collected 10 observations on strength for the modified formulation and another 10 observations for the unmodified formulation.
32
Modified Unmodified
16.85 17.5016.40 17.6317.21 18.2516.35 18.0016.52 17.8617.04 17.7516.96 18.2217.15 17.9016.59 17.9616.57 18.15
33
One-way ANOVA: strength versus type (Minitab)
Analysis of Variance for strengthSource DF SS MS F Ptype 1 6.7048 6.7048 82.98 0.000Error 18 1.4544 0.0808Total 19 8.1592
34
1 2
16.5
17.5
18.5
type
stre
ngth
Boxplots of strength by type(means are indicated by solid circles)
35
ONE FACTOR ANOVA, using JMPMVPC Survey Results
Amesbury Andover Methuen Salem66 55 56 6466 50 56 7066 51 57 6267 47 58 6470 57 61 6664 48 54 6271 52 62 6766 50 57 6071 48 61 6867 50 58 6863 48 54 6660 49 51 6666 52 57 6170 48 60 6369 48 59 6766 48 56 6770 51 61 7065 49 55 6271 46 62 6263 51 53 6869 54 59 7067 54 58 6264 49 54 6368 55 58 6565 47 55 6867 47 58 6865 53 55 6470 51 60 6568 50 58 6973 54 64 62
36
S c o re B y L o c a t io n
45
50
55
60
65
70
75
Amesbury Andover Methuen Salem
Location
O n e w a y A n o v aS u m m a ry o f F it
R S q u a re 0 .8 4 1 5 0 5R S q u a re A d j 0 .8 3 7 4 0 6R o o t M e a n S q u a re E rro r 2 .9 3 2 5 2 7M e a n o f R e s p o n s e 6 0 .0 9 1 6 7O b s e rv a t io n s (o r S u m W g ts ) 1 2 0
A n a ly s is o f V a r ia n c eS o u rc e D F S u m o f S q u a re s M e a n S q u a re F R a t ioM o d e l 3 5 2 9 6 .4 2 5 0 1 7 6 5 .4 7 2 0 5 .2 9 4 7E rro r 1 1 6 9 9 7 .5 6 6 7 8 .6 0 P ro b > FC T o ta l 1 1 9 6 2 9 3 .9 9 1 7 5 2 .8 9 < .0 0 0 1
M e a n s fo r O n e w a y A n o v aL e v e l N u m b e r M e a n S td E rro rA m e s b u ry 3 0 6 7 .1 0 0 0 0 .5 3 5 4 0A n d o v e r 3 0 5 0 .4 0 0 0 0 .5 3 5 4 0M e th u e n 3 0 5 7 .5 6 6 7 0 .5 3 5 4 0S a le m 3 0 6 5 .3 0 0 0 0 .5 3 5 4 0
S td E rro r u s e s a p o o le d e s t im a te o f e r ro r v a r ia n c e
37
AssumptionsBasically, the same as in
Regression analysis:
MODEL:
Yij = + j + ij
1.) the ij are indep. random variables
2.) Each ij is Normally Distributed
E(ij) = 0 for all i, j
3.) 2(ij) = constant for all i, j
Normality plot
Residual plot
Run order plot
38
Diagnosis: Normality
• The points on the normality plot must more or less follow a line to claim “normal distributed”.
• There are statistic tests to verify it scientifically. • The ANOVA method we learn here is not
sensitive to the normality assumption. That is, a mild departure from the normal distribution will not change our conclusions much.
Normality plot: normal scores vs. residuals
39
0.50.0-0.5
2
1
0
-1
-2
No
rma
l Sco
re
Residual
Normal Probability Plot of the Residuals(response is strength)
From Mortar data:
40
Diagnosis: Constant Variances
• The points on the residual plot must be more or less within a horizontal band to claim “constant variances”.
• There are statistic tests to verify it scientifically. • The ANOVA method we learn here is not sensitive
to the constant variances assumption. That is, slightly different variances within groups will not change our conclusions much.
Residual plot: fitted values vs. residuals
41
18.017.517.0
0.5
0.0
-0.5
Fitted Value
Re
sid
ua
l
Residuals Versus the Fitted Values(response is strength)
From Mortar data:
42
Diagnosis: Randomness/Independence
• The run order plot must show no “systematic” patterns to claim “randomness”.
• There are statistic tests to verify it scientifically. • The ANOVA method is sensitive to the constant
variances assumption. That is, a little level of dependence between data points will change our conclusions a lot.
Run order plot: order vs. residuals
43
2018161412108642
0.5
0.0
-0.5
Observation Order
Re
sid
ua
l
Residuals Versus the Order of the Data(response is strength)
From Mortar data:
44
This assumes a “fixed model”:Inherent interest in the specific levels of the factors under study - there’s
no direct interest in extrapolating to other levels - inference will be limited to levels that appear in the experiment. Experimenter selects the
levels
If a “random model”:Levels in experiment randomly selected from a population of such levels, and inference is to be made about the entire population of
levels.
Then, besides assumptions 1 to 3, there is another assumption:
4) a) the j are independent random variables which are normally distributed with constant variance
b) the j and ij are independent
45
With these assumptions, the estimates
(Y.. and the Y• j ) are “Maximum likelihood
estimates”(a statistical notion which could be thought of as “efficiency” [“most likely value”]),
and, more directly relevant:
The “Conventional” F- and t- tests are applicable (VALID) for a variety of hypothesis testing and confidence interval computations.
46
KRUSKAL - WALLIS TEST
(Non - Parametric Alternative)
HO: The probability distributions are identical for each level of the factor
HI: Not all the distributions are the same
47
Brand
A B C
32 32 28
30 32 21
30 26 15
29 26 15
26 22 14
23 20 14
20 19 14
19 16 11
18 14 9
12 14 8
BATTERY LIFETIME (hours)
(each column rank ordered, for simplicity)
Mean: 23.9 22.1 14.9 (here, irrelevant!!)
48
HO: no difference in distribution among the three brands with
respect to battery lifetime
HI: At least one of the 3 brands differs in distribution from the others with respect to lifetime
49
Brand
A B C
32 (29) 32 (29) 28 (24)
30 (26.5) 32 (29) 21 (18)
30 (26.5) 26 (22) 15 (10.5)
29 (25) 26 (22) 15 (10.5)
26 (22) 22 (19) 14 (7)
23 (20) 20 (16.5) 14 (7)
20 (16.5) 19 (14.5) 14 (7)
19 (14.5) 16 (12) 11 (3)
18 (13) 14 (7) 9 (2)
12 (4) 14 (7) 8 (1)T1 = 197 T2 = 178 T3 = 90
n1 = 10 n2 = 10 n3 = 10
Ranks
50
TEST STATISTIC:
H =12
N (N + 1)• (Tj
2/nj ) - 3 (N + 1)
nj = # data values in column j
N = nj
K = # Columns (levels)
Tj = SUM OF RANKS OF DATA ON COL j When all DATA COMBINED
(There is a slight adjustment in the formula as a function of the number of ties in rank.)
K
j = 1
K
j = 1
51
H =
[ 12 197 2 178 2 902
30 (31) 10 10 10+ +
[ - 3 (31)
= 8.41
(with adjustment for ties, we get 8.46)
52
We can show that, under HO , H is well
approximated by a 2 distribution with df = K - 1.
What do we do with H?
Here, df = 2, and at = .05, the critical value = 5.99
2
df
dfFdf,=
5.99 8.41 = H
= .05
Reject HO; conclude that mean lifetime NOT the same for all 3 BRANDS
8