1 Comparing multiple tests for separating populations Juliet Popper Shaffer Paper presented at the Fifth International Conference on Multiple Comparisons,

1

Comparing multiple tests for separating populations

Juliet Popper ShafferPaper presented at the Fifth

International Conference on Multiple Comparisons, Vienna, July 10, 2007

2

Outline

• Background• Original separation concepts• Revised separation concepts• Planned comparisons of different FDR and

FWER-controlling methods• Selected examples with FDR-controlling

methods• Summary and description of further

planned work.

3

Background

• I begin thinking about this problem in the early 1970s, when I was approached by a faculty member with a rather common situation.

• He had compared means of three treatments in an analysis of variance followed by pairwise tests. He found treatment 1 and 3 significantly different, but neither 1 and 2 nor 2 and 3 significantly different.

4

• I pointed out that this was a rather common outcome. His response was

• “What am I supposed to do with that?”

• A good question: No clear interpretation.

• The pattern of results of pairwise tests is important.

5

• Consider four treatments.

• Suppose the outcome of pairwise treatments is:

• (a) 14 significant, 13 significant,24 significant.

• (b) 14 significant, 13 significant,12 significant.

• (b) is clearly interpretable, (a) is not.

6

(a) 1 2 3 4

• ---------------

• ------------------

• -----------------

(b) 1 2 3 4 ---------------------------------------

7

Original separation concepts

• I developed a measure of interpretability of the outcome of pairwise tests and published the description with a comparison of FWER-controlling methods including a new one for comparing three treatments as

• “Complexity: An interpretability criterion for multiple comparisons” (JASA, 1981).

8

• A pattern was defined as simple if it consisted of distinct groupings.

• The measure was the number of additional rejections necessary to make the pattern simple. For 3 treatments, this is a reasonable measure:

9

3 treatments: either no rejections or at least two rejections are necessary to achieve a simple pattern

Complexity =2 if overall test is significant but no pairwise differences are

significant1 if one pairwise difference is significant0 is two or three pairwise differences are significant or

nothing is significant.i.e. given that overall equality is rejected and 1 -3 would be

rejected before 1-2 or 2-3, simple patterns are 1 2 3 (2 rejections), 1 2 3 (2 rejections), or

------- ------- 1 2 3 (3 rejections)

10

• The results were interesting, and the F test followed by individual t tests resulted in greater average simplicity than the range test, when both controlled FWER.

• The study was limited to three treatments.

11

• For more than 3 treatments, there are a multiplicity of patterns (e.g. 15 for four groups).

• It is also less clear that the measure used is best with more than three groups, and average complexity is certainly harder to interpret in that case.

• Furthermore, it seems desirable to distinguish true patterns from false patterns. If a pattern is false, a complex pattern is arguably more desirable than a simple pattern.

12

• Since that time, the issue has been raised occasionally by others, so I decided to try again with a simpler way of dividing patterns.

• Also, with new concepts of error control especially FDR, it seemed interesting to see whether clearer patterns would emerge with FDR-controlling methods.

13

Revised separation concepts

• I’ll discuss patterns of treatment means, although this can be generalized to other parameters.

• Following Hartley (1955), I’ll call sets of populations with equal means (usually assumed identical) clusters.

14

• True Pattern: a set of K clusters of sizes n1,n2, …,nk,… of n true means. (If exact equality is considered impossible, think of virtual equality.)

• Observed outcome: Set of rejections of subset equality hypotheses.

• Outcome clusters: Subsets of sample means declared significantly different from all other means, with no subclusters within them.

15

• True outcome clusters: Outcome clusters in which all true means within the cluster are greater than all true means below it and smaller than all true means above it.

• False outcome clusters: Outcome clusters that are not true.

• If there is no separation into clusters, the number of outcome clusters is defined as zero.

16

• Note that there may be rejections within a cluster, as long as they don’t separate it into subclusters.

• Pure true outcome clusters: True outcome clusters with no false rejections.

17

• Note that there can be true rejections within a pure true outcome cluster if it contains true subclusters as long as there are no false rejections within it.

• False rejections refers to either rejecting equality when a pair is equal (Type I error), or rejecting equality when a pair is unequal, but deciding the difference is in the wrong direction (Type III error).

18

• False cluster rate: Expected value of the ratio of false observed clusters to total observed clusters, defined as zero if there are no observed clusters.

• Various measures of cluster power.

19

Comparisons of different FDR- and FWER-controlling methods

• Note that it isn’t clear that more liberal methods will produce more true outcome clusters, more pure true outcome clusters, or a smaller false cluster rate.

• With the collaboration of Rhonda Kowalchek and Harvey Keselman, we are conducting a large study of cluster measures as well as standard error and power measures with several methods, all at nominal level .05 for either FWER or FDR control.

20

True mean configurations

• We’re looking at true configurations in which one mean is different from all K-1 others, and at various other cluster configurations of 3, 4, 8 and possibly 12 means. The work is still in progress.

21

Methods

• FWER-controlling:• Tukey-Welsch multiple range test• Modified Peritz multiple range test• FDR-controlling• Benjamini-Hochberg original stepup method

(BH)• Yekutieli-revised BH method with proven FDR

control• Newman-Keuls method with empirical evidence

and limited proofs of FDR control (NK)

22

• The Newman-Keuls method (NK) is little used these days. It is a multiple range method.

• Let M1< M2 < … Mn be the sample means of Populations P1, P2, …, Pn with true means μ1, μ2, …, μn, respectively.

• For simplicity I’ll describe the method assuming the populations are identical except for possible location shift.

23

• Let. rj-i+1,α be the α-critical value of the range of j – i sample means. Then

• Hĳ: μi = μj is rejected if

Mj' – Mi' > rj'-i'+1 for all j’ ≥ j, i’ ≤ i.

• In other words, it is identical in form to the Tukey-Welsch multiple range method, but every subrange is tested for significance at the same level α.

24

BH and NK

• I’ll present some comparisons of these two FDR-controlling methods.

• Significant pairwise comparisons are ordered differently in these, since BH is based on individual pairwise p-values, and NK is a multiple-range-based method. This makes the comparison of cluster outcomes especially interesting.

25

• BH: The FWER increases with the number of populations being compared.

• NK: In addition to apparent FDR control, the NK has the additional property that the FWER is controlled at the nominal level α within each cluster. Thus either method can have the larger FWER, depending on the number of populations and the number of clusters.

26

True clusters (K-1)(1)

• BH apparently controls FDR according to simulation results. NK controls FDR, since it controls FWER in this case.

• With one true outcome cluster, it must be a pure true cluster. With two true outcome clusters, there may be one or two pure true clusters.

27

True clusters (K-1)(1)

Simulation results indicate that there are more true outcome clusters and pure true outcome clusters with NK than with BH through most of the range, and the difference is greater with pure true outcome clusters. (When there are 1 or 2 means in each cluster, every true outcome cluster is a pure true outcome cluster.)

28

29

30

31

Two clusters, more than 1 mean in each

• The following slides show results for clusters (2)(2) and (2)(4).

32

33

34

35

False cluster rate

• The false cluster rate seems to be generally higher for NK than for BH, and in fact can get higher than might be desired for both. The worst case is that in which there are two means in each cluster, since then one Type I error may result in two false clusters, while that can’t happen with more than two means in a cluster.

36

37

38

39

40

41

Summary

• Gave the background for an interest in separating populations into clusters and previous ways of formulating the problem.

• Described new measures of population separation.

• Compared Newman-Keuls and Benjamini-Hochberg methods on these measures in two-cluster examples.

42

Further work

• More combinations of numbers of clusters and numbers of means within clusters will be examined.

• FWER-controlling methods will be compared among themselves and with FDR-controlling methods.

• F-type measures will be added.• Nonparametric versions of the various methods

will be examined.Proofs of properties will be extended if possible.

Documents

1 Comparing multiple tests for separating populations Juliet Popper Shaffer Paper presented at the Fifth International Conference on Multiple Comparisons,