CH4. Multiple Comparison Proceduresshurongf/CH4.pdf · CH4. Multiple Comparison Procedures Michigan Technological University Department of Mathematical Sciences. Shurong Fang •

CH4. Multiple Comparison Procedures

Michigan Technological UniversityDepartment of Mathematical Sciences

Shurong Fang

• In population-based association studies, we generally aim to test for the presence of associations between the trait and each of multiple genotypes across several SNPs and gene loci.

• However, testing multiple hypotheses can result in an inflation of the error rate.

• Measures of error (Section 4.1)

• Methods for adjusting for multiple testing (Section 4.2)

• Resampling-based approaches (Section 4.3)

4.1 Measures of error

• Much of the literature on methods for adjusting for multiple comparisons describes controlling one of two error rates:

• the family-wise error rate (FWER) and • the false discovery rate (FDR).

Family-wise error rate• FWER is the probability of making at least one

type-I error.

• FWER under the complete null (FWEC) is the probability that at least one type-I error occurs given that all nulls are true.

• FWER under a set of null (FWEP) is the probability that at least one type-I error occurs given that a subset of nulls are true.

• Weak control of FWER implies FWEC≤ α.• Strong control of FWER implies FWEP≤ α

for all subsets of null hypotheses.[Eg] Consider m = 2 null hypotheses: H0¹, H0².There are a total of four partial nulls.

• Strong Control of the FWER means that FWEP ≤αfor all four of these configurations of true nulls.

False discovery rate

• FDR is the expected proportion of null hypotheses that are true among those that are declared significant.

Relationship between FDR and FWER

• Assume: all nulls are true V=R

• If all null hypotheses are true, then the FDR is equal to the FWER.

• If not all null hypotheses are true, V < R.

• In general. FDR ≤ FWER.• This implies that any approach that

controls the FWER will also control the FDR.

4.2 Single-step and step-down adjustments

(1) Single-step adjustment• A single criterion is used to assess the

significance of all test statistics or corresponding p-values.

(2) Step-down adjustment• It involves ordering test statistics or p-

values and then using a potentially different criterion for each of the ordered values.

Bonferroni adjustment (Single-step)• Consider m hypotheses:and each test is controlled at a level α.

• Assume that m tests are independent.

10 0,..., mH H

( )0 0Pr | is truei iiType I rej H H α− = ≤

• Consider a level of 5%.• If m=1, we can control FWEC at a level of

5%.• If m=10, we can control FWEC at a level

of 40.1%. This means that although we are controlling each of the ten individual tests at a level of 5%, overall error may be as great as 40.1%.

• Bonferroni adjustment for multiple comparisons is a single-step procedure. involving using .

• If m=10 and we want to control this at an overall level of 5%, then let

• .• Now, .

• If we control each of m tests at level, then our overall FWEC will be controlled at a level of .

mα α′ =

0.05 10 0.005α′ = =( )10FWEC 1 1 0.005 0.049≤ − − =

mα α′ =

α

Bonferroni adjustment is quite conservative.(1) it assumes all tests are independent,

which is generally not the case, and(2) it is based on FWER control.

Tukey and Scheffe tests (single-step)

• Tukey's studentized range test is useful for the comparison of means between groups (Tukey. 1977).

• Suppose an ANOVA model• H0 : α1 = α2 = … = αm

• If this F-test is significant. indicating a departure from the null, then interest may lie in H0. αi = αj .

• Tukey's honestly significantly different (HSD) test is based on the studentized range distribution.

• This HSD statistic is very similar to the statistic we construct in performing a two-sample t-test for the comparison of two independent means.

• Specifically, ts = sqrt(2)*t, where t is a two-sample t-statistic based on equal sample sizes.

HSD and t-test

The adjustment for the number of tests

• As the number of df, given by m, increase, the critical value qm;(mn)-m also increases, and it becomes harder to reject the null hypothesis.

Unbalanced data

• Suppose our trait is the age of onset of breast cancer and we have a single biallelic candidate SNP taking on the values A1A1, A1A2 and A2A2.

• Let u1, u2 and u3 be respectively the population-level mean age of onset of breast cancer for each genotype.

• H0: u1=u2; H0: u1=u3; H0: u2=u3• In this setting, the sample sizes could be

unequal across genotypes, and thus an extension of Tukey's method is required.

Tukey-Kramer method

• The Tukey-Kramer method involves simply replacing n with the harmonic mean of the two sample sizes. given by

Tukey’s single-step method using R

• ptukey(q, nm, df, lower.tail = TRUE,… ) • qtukey(p, nm, df, lower.tail = TRUE,… )

• TukeyHSD(aov, “factor”, conf.level = 0.95, …) Create a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage.

• TukeyHSD( ) and plot( ) will not work with a MANOVA (multivariate analysis of variance) fit.

Scheffe's method

• Tukey's method provides an adjustment for testing for differences between all pairs of means.

• Scheffe's method involves testing a larger set of hypotheses that includes all contrasts of the factor-level means.

• A contrast in the one-way ANOVA setting is defined as a linear combination of the means.

such that

• H0: μi=μj H0: μi-μj=0

• x

• We let L be the set of all linear contrasts of the factor means. Scheffe's method is an adjustment approach that controls the FWER when we are interested in testing whether each element of L is equal to zero.

[Eg]

F-test

• In order to construct an F-test statistic for testing a single contrast H0: λ’μ=0, we begin by defining the vector ρ as

Scheffe's method

• In this case, the numerator df are set equal to m-1, where in the usual setting for testing a single contrast, we set this equal to 1.

• Consider m = 5 groups, each of size n = 20.• H0: μ1=μ2

• w/o making a multiple comparisonadjustment: F=4.5; F1;40-1=4.08, so reject H0.

• w/ adjustment, Fs=1.125; F4;100-5=2.47, so fail to reject H0.

False discovery rate control (FDR)

• To control FDR, we begin w/ the Benjamini and Hochberg (B-H) adjustment, a step-down adjustment.

• Consider m independent null hypotheses given by H01;…;H0m and the resulting p-values are given by p1;…; pm.

• We want to control the false discovery rate at a level q.

Example

• test for an association between each of 10 SNPs and the presence of disease.

• Assume that each SNP is in a separate gene and our tests are independent.

• for each SNP i =1, … ,10. we construct a 2 by 3 contingency table and calculate a chisq-statistic corresponding to H0: ORi =1.

• The ordered p-values are given by0.001 0.012 0.014 0.122 0.2450.320 0.550 0.776 0.840 0.995

• The Bonferroni adjustment would lead us to use the adjusted significance level of 0.05/10 =0.005. Then, we would only reject only the 1st H0.

• Using the B-H method, we would compare the ith ordered p-value to αi=0.05(i/10). Then, we would reject the first three H0.

defining rejection criteria

(1) calculate an adjusted p-values

(2) update these p-values to ensure monotonicity.

• The B-H procedure for controlling the FDR assumes independence of the test statistics corresponding to the true null hypotheses.

• The B-H procedure will also control the FDR if the test statistics corresponding to the true null hypotheses are positively regression dependent (PRD).

• Benjamini and Yekutieli (2001) also propose an extension of the B-H approach that controls the FDR in settings for which PRD does not hold.

• This extension is to simply replace q with

(B-Y) adjustment

The q-value

• The q-value is an alternative measure of significance based on the FDR concept that was recently proposed for GWAS.

positive false discovery rate (pFDR)

• The pFDR cannot be controlled in a traditional sense since it is identically equal to 1 if the proportion of true H0, given by m0=m, is equal to 1. That is, we cannot guarantee that pFDR ≤ α for α< 1.

q-value• The q-value is based on an alternative

paradigm that eliminates this concern. • Specifically, rather than fixing an error rate

and then estimating a significance threshold that maintains this rate on average, we fix the significance threshold and then estimate the rate over that threshold.

• Significance Analysis of Microarrays (SAM): fix rejection regions and then estimate the corresponding FDRs.

p-value and q-value

• We can write the p-value, corresponding to an observed test statistic T = t as

• The q-value is defined similarly as

p-value and q-value• In the context of SAM, the q-value is computed

as the FDR for the smallest estimated rejection region for which the gene under consideration is called significant.

• Intuitively, the q-value is the minimum pFDR that can occur over the set of nested rejection regions when we reject our test statistic.

• This parallels w/ the definition of p-value as the minimum typeI error rate that can occur over all nested rejection regions when we reject our test statistic.

q-value• Similar to FDR control, use of the q-value is most

appropriate when the number of tests is large. In this case, the probability that at least one test is declared significant, given by Pr(R>0), is close to 1.

• If we set the tuning parameter λ=0 (a quantity that informs us about the proportion of true null hypotheses), then the q-value results in the same adjusted p-values as the B-H adjusted FDR.

• However, this is a conservative estimate of the q-value, and optimizing the choice of is tenable.

Resampling-based methods

• One primary advantage is that they offer a natural approach to account for underlying unknown correlation structure among multiple hypotheses.

free step-down resampling (FSDR)

• The idea behind the resampling-based approach is that by taking repeated samples of the observed data, we can simulate the distribution of the test statistics (or p-values) under the complete null hypothesis, ,and then compare the observed test statistics to this empirical distribution to ascertain the corresponding significance of our tests.

subset pivotality condition

• The distribution of test statistics is the same under any combination of true null hypotheses.

• The test statistic distribution is invariant to whether all null hypotheses are indeed true or a partial set of null hypotheses are true.

• The covariance between test statistics is assumed to be the same under all scenarios of true and false null hypotheses.

Algorithm: Free step-down resampling (MaxT):

1. Determine the “observed” test statistics and p-values, and let the corresponding ordered absolute value of the test statistics, sorted from smallest to largest, be given by |T|(1),…, |T|(m).

T1 Tm

2. Generate the (approximate) distribution of test statistics under the complete null.

Estimate the residuals

is drawn with replacement from the original set of residuals

Refit the model and get new statistics

Take absolute values:

3. Compare the observed test statistics to test statistics under the complete null to get adjusted p-values. The resampling component of Step 2 is repeated B times.

R package• The mt.maxT() and mt.minP() functions in the

multtest package can be used to implement the FSDR approach.

• However, application of these functions, originally written for gene expression data, to data arising from SNP association studies is not straightforward.

• Expression Data: association between the presence of the disease and gene expression levels across several genes.

• SNP Association Data: association between the trait and each class label.

a binary trait logistic model

• Rather than resampling from the residuals, generate new binary traits such that

• However, the subset pivotality condition is not met for this setting

Null unrestricted bootstrap< the subset pivotality assumption is violated >

P0 is the data distribution under the complete null;P is the true data distribution;Qn(P0) is the distribution of test statistics under the

null generated distribution;Q0n is the complete null distribution of test statistics.• The basic idea behind the null unrestricted

bootstrap approach is to project the distribution of test statistics, given by Qn(P) for the true data generating distribution, onto the space of mean zero distributions to arrive at Q0n.

Alternative paradigms

Methods for reducing the number of tests:• First, we present the approach, first described by

Cheverud (2001), that takes into account LD structure to determine the effective number of tests (Meff).

• Second, we describe a global testing framework, proposed independently by Goeman et al. (2004) for gene expression data and Foulkes et al. (2005) for SNP data, that obviates the need for a multiple testing adjustment in some settings.

Effective number of tests• primary motivation: SNPs are usually

correlated with one another.• The variance of the e-values of a correlation

matrix of a set of variables captures information on the collective correlation of this set.

• The proportion reduction in the number of tests due to correlation is characterized by the ratio

• The effective number of tests ranges from 1 to M and is defined by

• If all variables are perfectly correlated, then Var[λobs] = M and Meff = 1.

• If we have no correlation among the set of SNPs under investigation, then Var[λobs] = 0 and Meff = M.

• Gao et al. (2008) proposed Meff-G: the number of principal components of the correlation matrix that explain C% of the variability in the data.

• Formally, Meff-G is the minimum x such that

• Once the effective number of tests is determined, the resulting value can be used in place of M in our usual adjustment of multiple testing.

[Eg] Bonferroni adjustment

• The results of this analysis may be sensitive to the choice of correlation matrix.

Global tests

• ANOVA

[Eg]

• suppose we have two SNPs within a single gene so that there are 3^2 = 9 possible genotypes, given by

• While , it actually tends to become harder to reject the null hypothesis with more groups.

• In other words, the power of our test is limited by the number of groups, as the df are expended on relatively rare genotypes.

• This well-known degrees-of-freedom problem is described for the analysis of haplotype-trait association studies in Chapman et al. (2003), Clayton et al. (2004), Tzeng et al. (2006) and Foulkes et al. (2008).

Extension of the ANOVA

• Assume

• The null is that there is no variability in the effects of genotypes on the trait. A likelihood ratio test can be applied to test this hypothesis, and the resulting test statistic has a

unknownindependent

Documents

CH4. Multiple Comparison Proceduresshurongf/CH4.pdf · CH4. Multiple Comparison Procedures Michigan Technological University Department of Mathematical Sciences. Shurong Fang •