1 Analysis of Variance & One Factor Designs Y= DEPENDENT VARIABLE (“yield”) (“response variable”) (“quality indicator”) X = INDEPENDENT VARIABLE (A possibly

1

Analysis of Variance & One Factor Designs

Y= DEPENDENT VARIABLE

(“yield”)

(“response variable”)

(“quality indicator”)

X = INDEPENDENT VARIABLE

(A possibly influential FACTOR)

2

OBJECTIVE: To determine the impact of X on Y

Mathematical Model:

Y = f (x, ) , where = (impact of) all factors other than X

Ex: Y = Battery Life

(hours)

X = Brand of Battery

= Many other factors (possibly, some we’re unaware of)

3

Statistical Model

“LEVEL” OF BRAND(Brand is, of course, represented as “categorical”)

Y11 Y12 • • • • • • •Y1c

Yij

Y21

•

•

•

•

•

•

YRI

•

•

•

•

•

1

2

•

•

•

•

R

1 2 • • • • • • • • C

Yij = + j + ij

i = 1, . . . . . , R

j = 1, . . . . . , C

YRc• • • • • • • •

4

Where

= OVERALL AVERAGE

j = index for FACTOR (Brand) LEVEL

i= index for “replication”

j = Differential effect (response) associated with jth level of X

and ij = “noise” or “error” associated with the (particular) (i,j)th data value.

Let j = AVERAGE associated with jth level of X

j = j – and = AVERAGE of j .

5

Yij = + j + ij

By definition, j = 0C

j=1

The experiment produces

R x C Yij data values.

The analysis produces estimates of c. (We can then get estimates of

the ij by subtraction).

6

Y11 Y12 • • • • • •Y1c

Y21

•

•

•

•

YRI

••

••

YRc

1 2

C

Y• 1Y• c(Y• j)

• • • • • • • • •

Y• 2

3

Y•1, Y•2, etc., are Column Means

• • • • •

• • • • •

7

Y• • = Y• j /C = “GRAND MEAN”

(assuming same # data points in each column)

(otherwise, Y• • = mean of all the data)

j=1

c

8

MODEL: Yij = + j + ij

Y• • estimates

Y • j - Y • • estimatesj (= j – ) (for all j)

These estimates are based on Gauss’ (1796)

PRINCIPLE OF LEAST SQUARES

and (I would argue) on COMMON SENSE

9

MODEL: Yij = + j + ij

If you insert the estimates into the MODEL,

(1) Yij = Y • • + (Y•j - Y • • ) + ij.

it follows that our estimate of ij is

(2) ij = Yij - Y•j

<

<

10

Then, Yij = Y• • + (Y• j - Y• • ) + ( Yij - Y• j)

or, (Yij - Y• • ) = (Y•j - Y• •) + (Yij - Y•j ) { { {(3)

TOTAL

VARIABILITY

in Y

=

Variability

in Y

associated

with X

Variability

in Y

associated

with all other factors

+

11

If you square both sides of (3), and double sum both sides (over i and j), you get, [after some unpleasant algebra, but

lots of terms which “cancel”]

(Yij - Y• • )2 = R • (Y•j - Y• •)

2 + (Yij - Y•j)

2C R

j=1 i=1 { { {j=1

C C R

j=1 i=1

TSS

TOTAL SUM OF SQUARES

=

=

SSBC

SUM OF SQUARES BETWEEN COLUMNS

+

+

SSW (SSE)

SUM OF SQUARES WITHIN COLUMNS( ( (

( ((

12

ANOVA TABLE

SOURCE OF VARIABILITY

SSQ DFMeansquare

(M.S.)

Between Columns (due to brand)

Within Columns (due to error)

SSBC C - 1 MSBC

SSBC

C - 1

SSW (R - 1) • CSSW

(R-1)•C= MSW

=

TOTAL TSS RC -1

13

Example: Y = LIFETIME (HOURS)

BRAND3 replications

per level 1 2 3 4 5 6 7 8

1.8 4.2 8.6 7.0 4.2 4.2 7.8 9.0

5.0 5.4 4.6 5.0 7.8 4.2 7.0 7.4

1.0 4.2 4.2 9.0 6.6 5.4 9.8 5.8

2.6 4.6 5.8 7.0 6.2 4.6 8.2 7.4 5.8

SSBC = 3 ( [2.6 - 5.8]2 + [4.6 - 5.8]

2 + • • • + [7.4 - 5.8]2)

= 3 (23.04)

= 69.12

14

(1.8 - 2.6)2 = .64 (4.2 - 4.6)2 =.16 (9.0 -7.4)2 = 2.56

(5.0 - 2.6)2 = 5.76 (5.4 - 4.6)2= .64 • • • • (7.4 - 7.4)2 = 0

(1.0 - 2.6)2 = 2.56 (4.2 - 4.6)2= .16 (5.8 - 7.4)2 = 2.56

8.96 .96 5.12

Total of (8.96 + .96 + • • • • • • + 5.12),

SSW = 46.72

SSW =

15

ANOVA TABLE

Source of Variability

SSQ df M.S.

BRAND

ERROR

69.12

46.72

7

= 8 - 1

16

= 2 (8)

9.87

2.92

TOTAL 115.84 23

= (3 • 8) -1

16

We can show:

E (MSBC) = 2 +

“VCOL”{

MEASURE OF DIFFERENCES

AMONG COLUMN MEANS

RC-1

• (j - )2

{

j

((

E (MSW) = 2

(Assuming each Yij has (constant) standard deviation, )

(More about assumptions, Later)

17

E ( MSBC ) = 2 + VCOL

E ( MSW ) = 2

This suggests that

if MSBC

MSW > 1 ,

There’s some evidence of non-zero VCOL, or “level of X affects Y”

if MSBC

MSW< 1 ,

No evidence that VCOL > 0, or that “level of X affects Y”

18

With HO: Level of X has no impact on Y

HI: Level of X does have impact on Y,

We need

MSBC

MSW> > 1

to reject HO.

19

More Formally,

HO: 1 = 2 = • • • c = 0

HI: not all j = 0

OR

HO: 1 = 2 = • • • • c

HI: not all j are EQUAL

(All column means are equal)

20

The probability Law of

MSBC

MSW= “Fcalc” , is

The F - distribution with (C-1, (R-1)C)degrees of freedom

Assuming

HO true.

C = Table Value

21

In our problem:

ANOVA TABLE

Source of Variability

SSQ df M.S.

BRAND

ERROR

69.12

46.72

7

16

9.87

2.92 = 9.87 2.92

Fcalc

3.38

22

= .05

C = 2.66 3.38

F table coming up

(7,16 DF)

23

F-Table

24

Hence, at = .05, Reject Ho .

(i.e., Conclude that level of BRAND does have an impact on battery lifetime.)

25

ONE FACTOR ANOVA, Using EXCEL1.8 4.2 8.6 7 4.2 4.2 7.8 9

5 5.4 4.6 5 7.8 4.2 7 7.41 4.2 4.2 9 6.6 5.4 9.8 5.8

Anova: Single-Factor

Summary

Groups Count Sum Average Variance

Column 1 3 7.8 2.6 4.48Column 2 3 13.8 4.6 0.48Column 3 3 17.4 5.8 5.92Column 4 3 21 7 4Column 5 3 18.6 6.2 3.36Column 6 3 13.8 4.6 0.48Column 7 3 24.6 8.2 2.08Column 8 3 22.2 7.4 2.56

ANOVA

Source of VariationSS df MS F P-value F crit

Between Groups 69.12 7 9.87429 3.3816 0.02064 2.657Within Groups 46.72 16 2.92

Total 115.8 23

26

SPSS/MINITAB INPUTVAR001 VAR002

1.8 1

5.0 1

1.0 1

4.2 2

5.4 2

4.2 2

. .

. .

. .

9.0 8

7.4 8

5.8 8

27

ONE_FACTOR ANOVA, using SPSS - - - - - O N E W A Y - - - - - Variable Lifetime By Variable Device

Analysis of Variance

Sum of Mean F F

Source D.F. Squares Squares Ratio Prob.

Between Groups 7 69.1200 9.8743 3.3816 .0206

Within Groups 16 46.7200 2.9200

Total 23 115.8400

28

ONE FACTOR ANOVA (MINITAB)

Analysis of Variance for life

Source DF SS MS F P

brand 7 69.12 9.87 3.38 0.021

Error 16 46.72 2.92

Total 23 115.84

MINITAB: STAT>>ANOVA>>ONE-WAY

29

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

9

10

brand

life

Dotplots of life by brand(group means are indicated by lines)

30

1 2 3 4 5 6 7 8

0

1

2

3

4

5

6

7

8

9

10

brand

lifeBoxplots of life by brand

(means are indicated by solid circles)

31

EXAMPLE: MORTARThe tension bond strength of cement mortar is an important characteristic of the product. An engineer is interested in comparing the strength of a modified formulation in which polymer latex emulsions have been added during mixing to the strength of the unmodified mortar. The experimenter has collected 10 observations on strength for the modified formulation and another 10 observations for the unmodified formulation.

32

Modified Unmodified

16.85 17.5016.40 17.6317.21 18.2516.35 18.0016.52 17.8617.04 17.7516.96 18.2217.15 17.9016.59 17.9616.57 18.15

33

One-way ANOVA: strength versus type (Minitab)

Analysis of Variance for strengthSource DF SS MS F Ptype 1 6.7048 6.7048 82.98 0.000Error 18 1.4544 0.0808Total 19 8.1592

34

1 2

16.5

17.5

18.5

type

stre

ngth

Boxplots of strength by type(means are indicated by solid circles)

35

ONE FACTOR ANOVA, using JMPMVPC Survey Results

Amesbury Andover Methuen Salem66 55 56 6466 50 56 7066 51 57 6267 47 58 6470 57 61 6664 48 54 6271 52 62 6766 50 57 6071 48 61 6867 50 58 6863 48 54 6660 49 51 6666 52 57 6170 48 60 6369 48 59 6766 48 56 6770 51 61 7065 49 55 6271 46 62 6263 51 53 6869 54 59 7067 54 58 6264 49 54 6368 55 58 6565 47 55 6867 47 58 6865 53 55 6470 51 60 6568 50 58 6973 54 64 62

36

S c o re B y L o c a t io n

45

50

55

60

65

70

75

Amesbury Andover Methuen Salem

Location

O n e w a y A n o v aS u m m a ry o f F it

R S q u a re 0 .8 4 1 5 0 5R S q u a re A d j 0 .8 3 7 4 0 6R o o t M e a n S q u a re E rro r 2 .9 3 2 5 2 7M e a n o f R e s p o n s e 6 0 .0 9 1 6 7O b s e rv a t io n s (o r S u m W g ts ) 1 2 0

A n a ly s is o f V a r ia n c eS o u rc e D F S u m o f S q u a re s M e a n S q u a re F R a t ioM o d e l 3 5 2 9 6 .4 2 5 0 1 7 6 5 .4 7 2 0 5 .2 9 4 7E rro r 1 1 6 9 9 7 .5 6 6 7 8 .6 0 P ro b > FC T o ta l 1 1 9 6 2 9 3 .9 9 1 7 5 2 .8 9 < .0 0 0 1

M e a n s fo r O n e w a y A n o v aL e v e l N u m b e r M e a n S td E rro rA m e s b u ry 3 0 6 7 .1 0 0 0 0 .5 3 5 4 0A n d o v e r 3 0 5 0 .4 0 0 0 0 .5 3 5 4 0M e th u e n 3 0 5 7 .5 6 6 7 0 .5 3 5 4 0S a le m 3 0 6 5 .3 0 0 0 0 .5 3 5 4 0

S td E rro r u s e s a p o o le d e s t im a te o f e r ro r v a r ia n c e

37

AssumptionsBasically, the same as in

Regression analysis:

MODEL:

Yij = + j + ij

1.) the ij are indep. random variables

2.) Each ij is Normally Distributed

E(ij) = 0 for all i, j

3.) 2(ij) = constant for all i, j

Normality plot

Residual plot

Run order plot

38

Diagnosis: Normality

• The points on the normality plot must more or less follow a line to claim “normal distributed”.

• There are statistic tests to verify it scientifically. • The ANOVA method we learn here is not

sensitive to the normality assumption. That is, a mild departure from the normal distribution will not change our conclusions much.

Normality plot: normal scores vs. residuals

39

0.50.0-0.5

2

1

0

-1

-2

No

rma

l Sco

re

Residual

Normal Probability Plot of the Residuals(response is strength)

From Mortar data:

40

Diagnosis: Constant Variances

• The points on the residual plot must be more or less within a horizontal band to claim “constant variances”.

• There are statistic tests to verify it scientifically. • The ANOVA method we learn here is not sensitive

to the constant variances assumption. That is, slightly different variances within groups will not change our conclusions much.

Residual plot: fitted values vs. residuals

41

18.017.517.0

0.5

0.0

-0.5

Fitted Value

Re

sid

ua

l

Residuals Versus the Fitted Values(response is strength)

From Mortar data:

42

Diagnosis: Randomness/Independence

• The run order plot must show no “systematic” patterns to claim “randomness”.

• There are statistic tests to verify it scientifically. • The ANOVA method is sensitive to the constant

variances assumption. That is, a little level of dependence between data points will change our conclusions a lot.

Run order plot: order vs. residuals

43

2018161412108642

0.5

0.0

-0.5

Observation Order

Re

sid

ua

l

Residuals Versus the Order of the Data(response is strength)

From Mortar data:

44

This assumes a “fixed model”:Inherent interest in the specific levels of the factors under study - there’s

no direct interest in extrapolating to other levels - inference will be limited to levels that appear in the experiment. Experimenter selects the

levels

If a “random model”:Levels in experiment randomly selected from a population of such levels, and inference is to be made about the entire population of

levels.

Then, besides assumptions 1 to 3, there is another assumption:

4) a) the j are independent random variables which are normally distributed with constant variance

b) the j and ij are independent

45

With these assumptions, the estimates

(Y.. and the Y• j ) are “Maximum likelihood

estimates”(a statistical notion which could be thought of as “efficiency” [“most likely value”]),

and, more directly relevant:

The “Conventional” F- and t- tests are applicable (VALID) for a variety of hypothesis testing and confidence interval computations.

46

KRUSKAL - WALLIS TEST

(Non - Parametric Alternative)

HO: The probability distributions are identical for each level of the factor

HI: Not all the distributions are the same

47

Brand

A B C

32 32 28

30 32 21

30 26 15

29 26 15

26 22 14

23 20 14

20 19 14

19 16 11

18 14 9

12 14 8

BATTERY LIFETIME (hours)

(each column rank ordered, for simplicity)

Mean: 23.9 22.1 14.9 (here, irrelevant!!)

48

HO: no difference in distribution among the three brands with

respect to battery lifetime

HI: At least one of the 3 brands differs in distribution from the others with respect to lifetime

49

Brand

A B C

32 (29) 32 (29) 28 (24)

30 (26.5) 32 (29) 21 (18)

30 (26.5) 26 (22) 15 (10.5)

29 (25) 26 (22) 15 (10.5)

26 (22) 22 (19) 14 (7)

23 (20) 20 (16.5) 14 (7)

20 (16.5) 19 (14.5) 14 (7)

19 (14.5) 16 (12) 11 (3)

18 (13) 14 (7) 9 (2)

12 (4) 14 (7) 8 (1)T1 = 197 T2 = 178 T3 = 90

n1 = 10 n2 = 10 n3 = 10

Ranks

50

TEST STATISTIC:

H =12

N (N + 1)• (Tj

2/nj ) - 3 (N + 1)

nj = # data values in column j

N = nj

K = # Columns (levels)

Tj = SUM OF RANKS OF DATA ON COL j When all DATA COMBINED

(There is a slight adjustment in the formula as a function of the number of ties in rank.)

K

j = 1

K

j = 1

51

H =

[ 12 197 2 178 2 902

30 (31) 10 10 10+ +

[ - 3 (31)

= 8.41

(with adjustment for ties, we get 8.46)

52

We can show that, under HO , H is well

approximated by a 2 distribution with df = K - 1.

What do we do with H?

Here, df = 2, and at = .05, the critical value = 5.99

2

df

dfFdf,=

5.99 8.41 = H

= .05

Reject HO; conclude that mean lifetime NOT the same for all 3 BRANDS

8

Documents

1 Analysis of Variance & One Factor Designs Y= DEPENDENT VARIABLE (“yield”) (“response variable”) (“quality indicator”) X = INDEPENDENT VARIABLE (A possibly