Sequential Selection Procedures and False Discovery Rate ... · Sequential FDR Control 3 Table 1. Typical realization of p-values for Hinc k with least-angle regression (LARS), as

Sequential Selection Procedures andFalse Discovery Rate Control

Max Grazier G’SellDepartment of Statistics, Carnegie Mellon University, Pittsburgh, USA.

Stefan WagerDepartment of Statistics, Stanford University, Stanford, USA.

Alexandra ChouldechovaHeinz College, Carnegie Mellon University, Pittsburgh, USA.

Robert TibshiraniDepartments of Health Research & Policy, and Statistics, Stanford University, Stanford, USA.

Summary. We consider a multiple hypothesis testing setting where the hypotheses are or-dered and one is only permitted to reject an initial contiguous block, H1, . . . , Hk, of hypothe-ses. A rejection rule in this setting amounts to a procedure for choosing the stopping pointk. This setting is inspired by the sequential nature of many model selection problems, wherechoosing a stopping point or a model is equivalent to rejecting all hypotheses up to that pointand none thereafter. We propose two new testing procedures, and prove that they controlthe false discovery rate in the ordered testing setting. We also show how the methods canbe applied to model selection using recent results on p-values in sequential model selectionsettings.

Keywords: multiple hypothesis testing, stopping rule, false discovery rate, sequential testing

1. Introduction

Suppose that we have a sequence of null hypotheses, H1, H2, . . . Hm, and that we want toto reject some hypotheses while controlling the False Discovery Rate (FDR, Benjamini andHochberg, 1995). Moreover, suppose that these hypotheses must be rejected in an orderedfashion: a test procedure must reject hypotheses H1, . . . , Hk for some k ∈ {0, 1, . . . , m}.Classical methods for FDR control, such as the original Benjamini-Hochberg selection pro-cedure, are ruled out by the requirement that the hypotheses be rejected in order.

In this paper we introduce new testing procedures that address this problem, and controlthe False Discovery Rate (FDR) in the ordered setting. Suppose that we have a sequence ofp-values, p1, ..., pm ∈ [0, 1] corresponding to the hypotheses Hj , such that pj is uniformlydistributed on [0, 1] when Hj is true. Our proposed methods start by transforming thesequence of p-values p1, ..., pm into a monotone increasing sequence of statistics 0 ≤ q1 ≤. . . ≤ qm ≤ 1. We then prove that we achieve ordered FDR control by applying the originalBenjamini-Hochberg procedure on the monotone test statistics qi.

E-mail: [email protected]; [email protected]

arX

iv:1

309.

5352

v3 [

mat

h.ST

] 2

3 M

ar 2

015

2 Max Grazier G’Sell, et. al.

1.1. Variable Selection along a Regression PathThis problem of FDR control for ordered hypotheses arises naturally when implementingvariable selection using a path-based a path-based regression algorithm; examples of suchalgorithms include forward stepwise regression (see Hocking, 1976, for a review) and least-angle regression (Efron et al., 2004). These methods build models by adding in variablesone-by-one, and the number of non-zero variables in the final model only depends on asingle sparsity-controlling tuning parameter. The lasso (Tibshirani, 1996) can also be usedfor path-based variable selection; however, the lasso also sometimes removes variables fromits active set while building its model.

Each time we add a new variable to the model, we may want to ask—heuristically—whether adding the new variable to the model is a “good idea”. Because the path algorithmspecifies the order in which variables must be added to the model, asking these questionsyields a sequence of ordered hypotheses for which it is desirable to control the overall FDR.

To fix notation, suppose that we have data X ∈ Rn×p and Y ∈ Rn, and seek to fit thelinear regression model

Y ∼ N(Xβ∗, σ2Ip×p

)using a sparse weight vector β. Path algorithms can then be seen as providing us with anordering of the variables j1, j2, ... ∈ {1, ..., p} along with a sequence of nested models

∅ =M0 ⊂M1 ⊂ ... ⊂Mp, withMk = {j1, ..., jk}.

The statistician then needs to pick one of the models Mk, and set to zero all coordinatesβj with j /∈ Mk. The k-th ordered hypothesis Hk tests whether or not adding the k-thvariable jk was informative.

The null hypothesis Hk that adding the k-th variable along the regression path wasuninformative can be formalized in several ways.

• The Incremental Null: In the spirit of the classical AIC (Akaike, 1974) and BIC(Schwarz et al., 1978) procedures, Hk measures whether model Mk improves overMk−1. In the case of linear regression, the null hypothesis states that the best re-gression fit for model Mk−1 is the same as the best regression fit for Mk or, moreformally:

H inck : PMk−1

Xβ∗ = PMkXβ∗, where (1)

PM = XM(X>MXM

)†XM (2)

is a projection onto the column-span of XM. Here, we write XM for the matrixcomprised of the columns of X contained in M, and A† denotes the Moore-Penrosepseudoinverse of a matrix A. Taylor et al. (2014) develop tests for H inc

k in the contextof both forward stepwise regression and least-angle regression.

• The Complete Null: We may also want to test the stronger null hypothesis thatthe model Mk−1 already captures all the available signal. More specifically, writingM∗ for the support set of β∗, we define

Hcompk :M∗ ⊆Mk−1. (3)

Tests of Hcompk for various pathwise regression models have been studied by, among

others, Lockhart et al. (2014), Fithian et al. (2014), Loftus and Taylor (2014), andTaylor et al. (2013).

Sequential FDR Control 3

Table 1. Typical realization of p-values for H inck with least-angle regression (LARS),

as proposed by Taylor et al. (2014).LARS step 1 2 3 4 5 6 7 8 9 10

Predictor 3 1 4 10 9 8 5 2 6 7p-value 0.00 0.08 0.34 0.15 0.93 0.12 0.64 0.25 0.49 ·

• The Full-Model Null: Perhaps the simplest pathwise hypothesis we may want totest is that

HFMk : jk ∈M∗, (4)

i.e., that the k-th variable added to the regression path belongs to the support set ofβ∗. Despite its simple appearance, however, the hypothesis HFM

k is difficult to workwith. The problem is that the truth of HFM

k depends critically on variables that maynot be contained in Mk, and so HFM

k will have a “high-dimensional” character evenwhen k is small. We are not aware of any general methods for testing HFM

k along, forexample, the least-angle regression path, and do not pursue this formalization furtherin this paper.

The incremental and complete null hypotheses may both be appropriate in different con-texts, depending on the needs of the statistician. An advantage of testing H inc

k is that itseeks parsimonious models where most non-zero variables are useful. On the other hand,Hcompk has the advantage that, unlike with H inc

k , subsequent hypotheses are nested; thiscan make interpretation easier. We note that, when X has full column rank:

Hcompk =

p∧l=k

H inck .

The goal of this paper is to develop generic FDR control procedures for ordered hy-potheses, that can be used for pathwise variable selection regardless of a statistician’s choiceof fitting procedure (forward stepwise or least-angle regression), null hypothesis (H inc

k orHcompk ), and test statistic. The flexibility of our approach should be a major asset, as the

proliferation of methods for pathwise hypothesis testing suggests an interest in the topic(Lockhart et al., 2014; Loftus and Taylor, 2014; Fithian et al., 2014; G’Sell et al., 2013; Leeet al., 2013; Lee and Taylor, 2014; Taylor et al., 2013, 2014).

Example. To further illustrate our setup, consider a simple model selection problem.We have n observations from a linear model with p predictors,

yi = β0 +

p∑j=1

xijβj + Zi with Zi ∼ N (0, 1), (5)

and seek to fit β by least-angle regression. As discussed above, this procedure adds variablesto the model one-by-one, and we need to decide after how many variables k to stop. Therecent work of Taylor et al. (2014) provides us with p-values for the sequence of hypothesesH inck defined in (1); Table 1 has a typical realization of these p-values with data generated

from a model

n = 50, p = 10, xijiid∼N (0, 1), β1 = 2, β3 = 4, β2 = β4 = β5 . . . β10 = 0.


0.05 0.1 0.2 0.35 0.5

02

46

8

Target FDR

Num

ber o

f var

iabl

es s

elec

ted

0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

Target FDR

Obs

erve

d FD

RFig. 1. For the model selection problem in Equation (5), 1000 random realizations were simulatedand the ForwardStop procedure applied. The left panel shows the number of predictors selected atFDR levels 0.05, 0.1, 0.2, 0.35 and 0.5. The right panel shows the observed FDR on the Y axis andthe Target FDR on the X axis. The 45o line is plotted in grey for reference.

These p-values are not exchangeable, and must be treated in the order in which the pre-dictors were entered: 3, 1, 4 etc. Our goal is to use these p-values to produce an FDR-controlling stopping rule. In the following section, we introduce two procedures: Forward-Stop and StrongStop that control FDR. Figure 1 illustrates the performance of one of ourproposed procedures; in this example, it allows us to accurately estimate the support of βwhile successfully controlling the FDR.

1.2. Stopping Rules for Ordered FDR ControlIn the ordered setting, a valid rejection rule is a function of p1 , ..., pm that returns a cutoffk such that hypotheses H1, . . . ,Hk are rejected. The False Discovery Rate (FDR) is defined

as E[V (k)/max(1, k)

], where V (k) is the number of null hypotheses among the rejected

hypotheses H1, ... , Hk.We propose two rejection functions for this scenario, called ForwardStop:

kF = max

{k ∈ {1, . . . , m} : −1

k

k∑i=1

log(1− pi) ≤ α

}, (6)

and StrongStop:

kS = max

k ∈ {1, . . . , m} : exp

m∑j=k

log pjj

≤ αk

m

. (7)

We adopt the convention that max(∅) = 0, so that k = 0 whenever no rejections can bemade. In Section 2 we show that both ForwardStop and StrongStop control FDR at levelα.


ForwardStop first transforms the p-values, and then sets the rejection threshold at thelargest k for which the first k transformed p-values have a small enough average. If the firstp-values are very small, then ForwardStop will always reject the first hypotheses regardlessof the last p-values. As a result, the rule is moderately robust to potential misspecificationof the null distribution of the p-values at high indexes. This is particularly important inmodel selection applications, where one may doubt whether the asymptotic distribution isaccurate in finite samples at high indexes.

Our second rule, StrongStop (7), comes with a stronger guarantee than ForwardStop. Aswe show in Section 2, provided that the non-null p-values precede the null ones, it not onlycontrols the FDR, but also controls the Family-Wise Error Rate (FWER) at level α. Recallthat the FWER is the probability that a decision rule makes even a single false discovery. Iffalse discoveries have a particularly high cost, then StrongStop may be more attractive thanForwardStop. The main weakness of StrongStop is that the decision to reject at k dependson all the p-values after k. If the very last p-values are slightly larger than they should beunder the uniform hypothesis, then the rule suffers a considerable loss of power.

A major advantage of both ForwardStop and StrongStop is that these procedures seekthe largest k at which an inequality holds, even if the inequality may not hold for someindex l with l < k. This property enables them to get past some isolated large p-values forthe early hypotheses, thus resulting in a substantial increase in power. This phenomenon isclosely related to the gain in power of the Benjamini and Hochberg (1995) procedure overthe Simes (1986) procedure.

1.3. Related WorkAlthough there is an extensive literature on FDR control and its variants (e.g., Benjaminiand Hochberg, 1995; Benjamini and Yekutieli, 2001; Blanchard and Roquain, 2008; Efronet al., 2001; Goeman and Solari, 2010; Romano and Shaikh, 2006; Storey et al., 2004), nodefinitive procedure for ordered FDR control has been proposed so far. The closest methodwe are aware of is an adaptation of the α-investing approach (Aharoni and Rosset, 2013;Foster and Stine, 2008). However, this procedure is not known to formally control theFDR (Foster and Stine prove that it controls the mFDR, defined as EV/(ER+ η) for someconstant η); moreover, in our simulations, this approach has lower power than our proposedmethods.

The problem of providing FDR control in regression models has been studied, amongothers, by Barber and Candes (2014), Benjamini and Gavrilov (2009), Bogdan et al. (2014),Lin et al. (2011), Meinshausen and Buhlmann (2010), Shah and Samworth (2012), and Wuet al. (2007), using a wide variety of ideas involving resampling, pseudo-variables, andspecifically tailored selection penalties. The goal of our paper is not to directly competewith these methods, but rather to provide “theoretical glue” that lets us transform therapidly growing family of sequential p-values described in Section 1.1 into model selectionprocedures with FDR guarantees.

We note that the problem of variable selection for regression models can be thought of asa generalization of the standard multiple testing problems, where each p-value correspondsto its own variable (e.g., Churchill and Doerge, 1994; Consortium et al., 2012; Simonsen andMcIntyre, 2004; Westfall and Young, 1993). In a standard genome-wide association study,for instance, one might test a family of hypotheses of the form Hi,0 : SNP i is associatedwith the response, i = 1, . . . ,m. When there is high spatial correlation across SNP’s,the set of rejected hypotheses is likely to contain correlated subgroups of SNP’s that are


redundant: while each is marginally significant, all SNP’s in a subgroup carry essentiallythe same information about the response. The goal of model selection is to avoid this typeof redundancy by selecting a group of SNP’s each of which contains significant distinctinformation about the response.

It is also important to contrast the goal of our work with that of prediction-drivenmodel selection procedures such as cross-validation. Prediction-driven approaches selectmodels that minimize the estimated prediction error, but generally provide no guaranteeon the statistical significance of the selected predictors. Our goal is to conduct inferenceto select a parsimonious model with inferential guarantees, even though the selected modelwill generally be smaller than the model giving the lowest prediction error.

Finally, a key challenge in conducting inference in regression settings is dealing withcorrelated predictors. Indeed, when the predictors are highly correlated, the appropriateness(and definition) of FWER and FDR as error criteria may come into question. If we selecta noise variable that is highly correlated with a signal variable, should we consider it to bea false selection? This is a broad question that is beyond the scope of this paper, but isworth considering when discussing selection errors in problems with highly correlated X.This question is discussed in more detail in several papers (e.g., Benjamini and Gavrilov,2009; Bogdan et al., 2014; G’Sell et al., 2013; Lin et al., 2011; Wu et al., 2007).

1.4. Outline of this paperWe begin by presenting generic methods for FDR control in ordered settings. Section2 develops our two main proposals for sequential testing, ForwardStop and StrongStop,along with their theoretical justification. We evaluate these rules on simulations in Section3. In Section 4, we review the recent literature on sequential testing for model selectionproblems and discuss its relation to our procedures. Moreover, we develop a more specializedversion of StrongStop, called TailStop, which takes advantage of special properties of someof the proposed sequential tests. Finally, in 5, we evaluate our sequential FDR controllingprocedures in combination with pathwise regression test statistics of Lockhart et al. (2014)and Taylor et al. (2014) in both simulations and a real data example.

All proofs are provided in Appendix A.

2. False Discovery Rate Control for Ordered Hypotheses

In this section, we study a generic ordered layout where we test a sequence of hypothesesthat are associated with p-values p1, ..., pm ∈ [0, 1]. A subset N ⊂ {0, ..., m} of thesep-values are null, with the property that

{pi : i ∈ N} iid∼ U([0, 1]). (8)

We can reject the k first hypotheses for some k of our choice. Our goal is to make k as largeas possible, while controlling the number of false discoveries

V (k) = |{i ∈ N : i ≤ k}| .

Specifically, we want to use a rule k with a bounded false discovery rate

FDR(k) = E[V (k)

/max

{k, 1

}]. (9)


We develop two procedures that provide such a guarantee.Classical FDR literature focuses on rejecting a subset of hypotheses R ∈ {0, ..., m} such

that R contains few false discoveries. Benjamini and Hochberg (1995) showed that, in thecontext of (8), we can control the FDR as follows. Let p(1), ..., p(m) be the sorted list ofp-values, and let

lα = max

{l : p(l) ≤

α l

m

}.

Then, if we reject those hypotheses corresponding to lα smallest p-values, we control theFDR at level α. This method for selecting the rejection set R is known as the BH procedure.The key difference between the setup of Benjamini and Hochberg (1995) and our problemis that, in the former, the rejection set R can be arbitrary, whereas here we must alwaysreject the first k hypotheses for some k. For example, even if the p-value correspondingto the third hypothesis is very small, we cannot reject the third hypothesis unless we alsoreject the first and second hypotheses.

2.1. A BH-Type Procedure for Ordered SelectionThe main motivation behind our first procedure—ForwardStop—is the following thoughtexperiment. Suppose that we could transform our p-values p1, ..., pm into statistics q1 <... < qm, such that the qi behaved like a sorted list of p-values. Then, we could apply theBH procedure on the qi, and get a rejection set R of the form R = {1, ..., k}.

Under the global null where p1, ..., pmiid∼ U([0, 1]), we can achieve such a transformation

using the Renyi representation theorem (Renyi, 1953). Renyi showed that if Y1, ..., Ym areindependent standard exponential random variables, then(

Y1m,Y1m

+Y2

m− 1, ...,

m∑i=1

Yim− i+ 1

)d=E1,m, E2,m, ..., Em,m,

where the Ei,m are exponential order statistics, meaning that the Ei,m have the samedistribution as a sorted list of independent standard exponential random variables. Renyirepresentation provides us with a tool that lets us map a list of independent exponentialrandom variables to a list of sorted order statistics, and vice-versa.

In our context, let

Yi = − log(1− pi), (10)

Zi =

i∑j=1

Yj/

(m− j + 1), and (11)

qi = 1− e−Zi . (12)

Under the global null, the Yi are distributed as independent exponential random variables.Thus, by Renyi representation, the Zi are distributed as exponential order statistics, andso the qi are distributed like uniform order statistics.

This argument suggests that in an ordered selection setup, we should reject the first kqFhypotheses where

kqF = max

{k : qk ≤

αk

m

}. (13)


The Renyi representation combined with the BH procedure immediately implies that therule kF controls the FDR at level α under the global null. Once we leave the global null,Renyi representation no longer applies; however, as we show in the following results, ourprocedure still controls the FDR.

We begin by stating a result under a slightly restricted setup, where we assume that thes first p-values are non-null and the m − s last p-values are null. We will later relax thisconstraint. The proof of the following result is closely inspired by the martingale argumentof Storey et al. (2004). As usual, our analysis is conditional on the non-null p-values (i.e.,we treat them as fixed).

Lemma 1. Suppose that we have p-values p1, ..., pm ∈ (0, 1), the last m − s of which

are null, i.e., independently drawn from U([0, 1]). Define qi as in (12). Then the rule kqFcontrols the FDR at level α, meaning that

E[(kqF − s

)+

/max

{kqF , 1

}]≤ α. (14)

Now the test statistics qi constructed in Lemma 1 depend on m. We can simplify therule by augmenting our list of p-values with additional null test statistics (taking m→∞),

and using the fact that 1−e−x

x → 1 as x gets small. This gives rise to one of our mainproposals:

Procedure 1 (ForwardStop). Let p1, ..., pm ∈ [0, 1], and let 0 < α < 1. We reject

hypotheses 1, ..., kF , where

kF = max

{k ∈ {1, ..., m} :

1

k

k∑i=1

Yi ≤ α

}, (15)

and Yi = − log(1− pi).

We call this procedure ForwardStop because it scans the p-values in a forward manner:If 1

k

∑ki=1 Yi ≤ α, then we know that we can reject the first k hypotheses regardless of the

remaining p-values. This property is desirable if we trust the first p-values more than thelast p-values.

A major advantage of ForwardStop over the direct Renyi stopping rule (13) is thatForwardStop provides FDR control even when some null hypotheses are interspersed amongthe non-null ones. In particular, in the regression setting, this is important for achievingFDR control for the incremental hypotheses H inc

k (1), which are not in general nested.

Theorem 2. Suppose that we have p-values p1, ..., pm ∈ (0, 1), a subset N ⊆ {1, ..., m}are null, i.e., independently drawn from U([0, 1]). Then, the ForwardStop procedure kF (15)controls FDR at level α, meaning that

E[∣∣∣{1, ..., kF

}∩N

∣∣∣/max{kF , 1

}]≤ α.

2.2. Strong Control for Ordered SelectionIn the previous section, we created the ordered test statistics Zi in (11) by summing trans-formed p-values starting from the first p-value. This choice was in some sense arbitrary.


Under the global null, we could just as well obtain uniform order statistics qi by summingfrom the back:

Yi = − log(pi), (16)

Zi =

m∑j=i

Yj/j, and (17)

qi = e−Zi . (18)

If we run the BH procedure on these backward test statistics, we obtain another methodfor controlling the number of false discoveries.

Procedure 2 (StrongStop). Let p1, ..., pm ∈ [0, 1], and let 0 < α < 1. We reject

hypotheses 1, ..., k, where

kS = max

{k ∈ {1, . . . , m} : qk ≤

αk

m

}(19)

and qk is as defined in (18).

Unlike ForwardStop, this new procedure needs to look at the p-values corresponding tothe last hypotheses before it can choose to make any rejections. This can be a liability if wedo not trust the very last p-values much. Looking at the last p-values can however be usefulif the model is correctly specified, as it enables us to strengthen our control guarantees:StrongStop not only controls the FDR, but also controls the FWER.

Theorem 3. Suppose that we have p-values p1, ..., pm ∈ (0, 1), the last m− s of which

are null (i.e., independently drawn from U([0, 1])). Then, the rule kS from (19) controlsthe FWER at level α, meaning that

P[kS > s

]≤ α. (20)

FWER control is stronger than FDR control, and so we immediately conclude fromTheorem 3 that StrongStop also controls the FDR. Note that the guarantees from Theorem3 only hold when the non-null p-values all precede the null ones.

3. Simulation Experiments: Simple Ordered Hypothesis Example

In this section, we demonstrate the performance of our methods in three simulation set-tings of varying difficulty. The simulation settings consist of ordered hypotheses where theseparation of the null and non-null hypotheses is varied to determine the difficulty of thescenario. Additional simulations are provided in Appendix B.

We consider a sequence of m = 100 hypotheses of which s = 20 are non-null. Thep-values corresponding to the non-null hypotheses are drawn from a Beta(1, β) distribution,while those corresponding to true null hypotheses are U([0, 1]). At each simulation iteration,the indices of the true null hypotheses are selected by sampling without replacement from theset {1, 2, . . . ,m = 100} with probability of selection proportional to iγ . In this scheme, lowerindices have smaller probabilities of being selected. We present results for three simulationcases, which we refer to as ‘easy’ (perfect separation), ‘medium’ (γ = 8), and ‘hard’ (γ = 4).


Fig. 2. Observed p-values for 50 realizations of the ordered hypothesis simulations described inSection 3. p-values corresponding to non-null hypotheses are shown in orange, while those corre-sponding to null hypotheses are shown in gray. The smooth black curve is the average proportionof null hypotheses up to the given index, and is shown to help gauge the difficulty of the problem.This curve can be thought of as the FDR of a fixed stopping rule which always stops at exactly thegiven index. Non-null p-values are drawn from a Beta(1, b) distribution, with b = 23, 14, 8 for the easy,medium and hard settings, respectively.

In the easy setup, we have strong signal b = 23 and all the non-null hypotheses precede thenull hypotheses, so we have perfect separation. In the medium difficulty setup, b = 14 andthe null and non-null hypotheses are lightly inter-mixed. In the hard difficulty setup, b = 8and the two are much more inter-mixed.

For comparison, we also apply the following two rejection rules:

(a) Thresholding at α. We reject all hypotheses up to the first time that a p-value exceedsα. This is guaranteed to control FWER and FDR at level α (Marcus et al., 1976).

(b) α-investing. We use the α-investing scheme of Foster and Stine (2008). While thisprocedure is not generally guaranteed to yield rejections that obey the ordering restric-tion, we can select parameters for which it does. In particular, defining an investingrule such that the wealth is equal to zero at the first failure to reject, we get

kinvest = min

{k : pk+1 >

(k + 1)α

1 + (k + 1)α

}.

This is guaranteed to control EV/(ER+1) at level α. We note that, using generalizedα-investing (Aharoni and Rosset, 2013), we could tweak the α-investing procedureto have more power to reject the earliest hypotheses and less power for further ones;however, we will not explore that possibility here.

These are the best competitors we are aware of for our problem. We emphasize that, unlikeForwardStop and StrongStop, these rules stop at the first p-value that exceeds a giventhreshold. Thus, these methods will fail to identify true rejections with very small p-valueswhen they are preceded by a few medium-sized p-values.

Figure 2 shows scatterplots of observed p-values for 50 realizations of the three setups.Figure 3 summarizes the performance of the four stopping rules. We note that StrongStop


2 4 6 8 10

24

68

10

1:10

1:10

AlphaInvestingAlphaThresholding

ForwardStopStrongStop

Target FDR

Fig. 3. Average power and observed FDR level for the ordered hypothesis example based on 2000simulation instances. The notion of power used here is that of average power, defined as the fractionof non-null hypotheses that are rejected (i.e., (k − V )/s). All four stopping rules successfully controlFDR across the three difficulty settings. StrongStop and α-thresholding are both very conservative interms of FDR control. Even though ForwardStop and α-investing have similar observed FDR curves,ForwardStop emerges as the more powerful method, and thus has better performance in terms of aprecision-recall tradeoff.

appears to be more powerful than other methods weak signal/low α settings. This mayoccur because, unlike the other methods, StrongStop scans p-values back-to-front and istherefore less sensitive to the occurrence of large p-values early in the alternative.

4. Model Selection and Ordered Testing

We now revisit the application that motivated our ordered hypothesis testing formalism.As discussed in Section 1.1, we assume that a path-based regression procedure like forwardstepwise regression or least-angle regression has given us a sequence of models ∅ = M0 ⊂M1 ⊂ . . . ⊂Mp, and our task is is to select one of these nested models. This results in anordered hypothesis testing problem that is conditional on the order in which the regressionalgorithm adds variables along its path.

We gave two options for formalizing the hypothesis that Mk improves over Mk−1 andthat the k-th variable should be added to the model: the incremental null (1) and thecomplete null (3). In this section, we review recent proposals by Taylor et al. (2014) and


Lockhart et al. (2014) for testing each of these nulls in the case of least-angle regression andthe lasso respectively, and show how to incorporate them into our framework.

We emphasize again that the field of ordered hypothesis testing appears to be growingrapidly, and that the applicability of our sequential FDR controlling procedures is notlimited to the tests surveyed here; for example, if we wanted to test Hcomp

k for forwardstepwise regression or the graphical lasso, we could use the test statistics of Loftus andTaylor (2014) or G’Sell et al. (2013) respectively.

4.1. Testing the Incremental Null for Least-Angle RegressionIn the context of least-angle regression, Taylor et al. (2014) provide exact, finite samplep-values for H inc

k for generic design matrices X. The corresponding test statistic is calledthe spacing test. The first spacing test statistic T1 has a simple form

T1 =

(1− Φ

(λ1σ

)) / (1− Φ

(λ2σ

)), (21)

where λ1 and λ2 are the first two knots along the least-angle regression path and σ isthe noise scale. Given a standardized design matrix X and the null hypothesis Hk

1 , T1is uniformly distributed over [0, 1]. Remarkably, this result holds under general positionconditions on X that hold almost surely if X is drawn from a continuous distribution, anddoes not require n or p to be large.

Taylor et al. (2014) also derive similar test statistics Tk for subsequent steps along theleast-angle regression path, which can be used for testing H inc

K . Assuming Gaussian noise,all the H inc

k -null p-values produced by this test are 1-dependent and uniformly distributedover [0, 1]. For the purpose of our demonstrations, we apply our general FDR controlprocedures directly as though the p-values were independent. Developing a version of thespacing test that yields independent p-values remains an active area of research.

4.2. Testing the Complete Null for the LassoWe also apply our formalism to testing the complete null for the lasso path, using thecovariance test statistics of Lockhart et al. (2014). As our experiments will make clear,an advantage of testing the complete null instead of the incremental null is the substantialincrease in power.

In the case of orthogonal X, the covariance test statistics have the particularly simpleform

Tk = λk(λk − λk+1), (22)

where λ1 ≥ λ2 ≥ . . . denote the knots of the lasso path. Because X is orthogonal, thelasso never removes variables along its path, and so we know there will be exactly p knots.Lockhart et al. (2014) show that these test statistics satisfy the following asyptotic guaran-tee. Recalling that the complete hypotheses are nested, suppose that Hcomp

1 , ..., Hcomps are

false, and Hcomps+1 is true. Then, in the limit with s fixed n, p→∞,

(Ts+1, ..., Ts+`)⇒(

Exp(1), Exp

(1

2

), ..., Exp

(1

`

))(23)

for any fixed ` ≥ 1. As shown below, we can use the harmonic asymptotics of these teststatistics to improve the power of our sequential procedures.


The major limitation of the statistics (22) is that their distribution can only be controlledasymptotically, and for orthogonal X. Lockhart et al. (2014) also provide adaptations of(22) that hold for non-orthogonal X; however, the required asymptotic regime is then quitestringent so we may prefer to use finite-sample-exact tests of Taylor et al. (2014) discussedin Section 4.1. In the future, it may be possible to use ideas from Fithian et al. (2014) todevise non-asymptotic and powerful tests of Hcomp

k for generic X.

4.2.1. False Discovery Rate Control for Harmonic Test Statistics

Motivated by the harmonic form of the test statistics Tk in (23), we show here how to im-prove the power of our sequential procedures in this setting. Similar harmonic asymptoticsalso arise in other contexts, e.g., the test statistics for the graphical lasso of G’Sell et al.(2013).

Abstracting away from concrete regression problems, suppose that we have a sequence ofarbitrary statistics T1, ..., Tm ≥ 0 corresponding to m hypotheses. The first s test statisticscorrespond to signal variables; the subsequent ones are independently distributed as

(Ts+1, ..., Tm) ∼(

Exp(1), Exp

(1

2

), ..., Exp

(1

m− s

)), (24)

where Exp(µ) denotes the exponential distribution with mean µ. As before, we wish toconstruct a stopping rule that controls the FDR.

To apply either ForwardStop or StrongStop using p-values based on (24) would requireknowledge of the number of signal variables s, and hence would not be practical. Fortu-itously, however, an extension of this idea yields a variation of StrongStop that does notrequire knowledge of s and controls FDR. Under (24), we have j · Ts+j ∼ Exp(1). Usingthis fact, suppose that we knew s and formed the StrongStop rule for the m − s null teststatistics. This would suggest a test based on

q∗i = exp

− m∑j=i

max{1, j − s}j

Tj

(25)

This is not a usable test, since it depends on knowledge of s. Now suppose we set s = 0,giving

q∗i = exp

− m∑j=i

Tj

(26)

An application of the BH procedure to the q∗i leads to the following rule.

Procedure 3 (TailStop). Let q∗i be defined as in (26). We reject hypotheses 1, ..., kT ,where

kT = max

{k : q∗k ≤

αk

m

}. (27)

Now the choice s = 0 is anti-conservative (in fact, it is the least conservative possibility fors), and so as expected we lose the strong control property of StrongStop. But surprisingly,in the idealized setting of (24), TailStop controls the FDR nearly exactly.


Fig. 4. Observed p-values for H inck in 50 realizations of the spacing test (Taylor et al., 2014) for

least-angle regression. p-values corresponding to non-null hypotheses are shown in orange, whilethose corresponding to null hypotheses are shown in gray. The smooth black curve is the averageproportion of null hypotheses up to the given index. This example is similar to the Easy setting ofthe ordered hypothesis example of §3 in that the null and alternative are nearly perfectly separated.However, in the least-angle regression setting the p-values under the alternative are highly variableand can be quite large, particularly in the Hard setting.

Theorem 4. Given (24), the rule from (27) controls FDR at level α. More precisely,

E[(kT − s

)+

/max

{kT , 1

}]= α

m− sm

.

The name TailStop emphasizes the fact that this procedure starts scanning the teststatistics from the back of the list, rather than from the front. Scanning from the back allowsus to adapt to the harmonic decay of the null p-values without knowing the number s of non-null predictors. An analogue to ForwardStop for this setup would be much more difficultto implement, as we would need to estimate s explicitly. We emphasize that the guaranteesfrom Theorem 4 hold under the generative model (24), whereas the covariance test statisticsonly have this distribution asymptotically. However, in our simulation experiments, theasymptotic regime appears to hold well enough for this not to be an issue.

5. Model Selection Experiments

In this section, we use the sequential procedures from 2 for pathwise model selection insparse regression. As discussed in Section 4, we focus on two particular problems: testingthe incremental null for least-angle regression with generic design (Section 5.1), and testingthe complete null for the lasso with orthogonal design (Section 5.2).

The first of these two settings is of course more immediately relevant to practice, andwe verify that ForwardStop paired with the spacing test statistics of Taylor et al. (2014)performs well on a real medical dataset. Meanwhile, the orthogonal simulations in Section5.2 showcase the power boost that we can obtain from testing the complete null instead ofthe incremental null. We believe that further theoretical advances in the pathwise testingliterature will enable us to have similar power along with FDR guarantees in finite samplewith generic X.


2 4 6 8 10

24

68

10

1:10

1:10



Target FDR

Fig. 5. Average power and observed FDR level for the spacing test p-values for H inck (Taylor et al.,

2014). Even though there is nearly perfect separation between the null and alternative regions, thepresence of large alternative p-values early in the path makes this a difficult problem. StrongStopattains both the highest average power and the lowest observed FDR across the simulation settings.Unlike the other methods, StrongStop scans p-values back-to-front, and is therefore able to performwell despite the occurrence of large p-values early in the path.

Finally, although our testing procedures are mathematically motivated by different nullhypotheses, namely the incremental and complete ones, we evaluate the performance ofeach method in terms of its full-model false discovery rate (that is, the fraction of selectedvariables that do not belong to the support of the true β∗). This lets us make a more directpractical comparison between different methods.

5.1. Testing the Incremental Null for Least-Angle RegressionWe compare the performance of ForwardStop, StrongStop, α-investing and α-thresholdingon the spacing test statistics from Section 4.1. We try three different simulation settingswith varying signal strength. TailStop is not included in this comparison because it shouldonly be used when the null test statistics exhibit harmonic behaviour as in (23), whereasthe spacing test p-values are uniform.

In all three settings we have n = 200 observations on p = 100 variables of which 10 arenon-null, and standard normal errors on the observations. The design matrix X is takento have iid Gaussian entries. The non-zero entries of the parameter vector β are taken tobe equally spaced values from 2γ to γ

√2 log p, where γ is varied to set the difficulty of the


● ● ● ●●

● ● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ● ● ●●

● ● ●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0.00

0.25

0.50

0.75

0 10 20Index

spac

ing

test

p−

valu

es


ForwardStopStrongStop Non−null

Null

Fig. 6. The H inck p-values from the spacings test for the least-angle regression path, applied to the

Abacavir resistance data of Rhee et al. (2006). The vertical lines mark the stopping points of thefour stopping rules, all with α = 0.2. ForwardStop selects the first 16 variables, even though thep-values at 9 and 14 are quite high. All but one the selections made by ForwardStop are consideredmeaningful by a previous study (Rhee et al., 2005).

problem.Figure 4 shows p-values from 50 realizations of each simulation setting. Note that

while all three settings have excellent separation—meaning that least-angle regression selectsmost of the signal variables before admitting any noise variables—the p-values under thealternative can still be quite large. Figure 5 shows plots of average power and observedFDR level across the three simulation settings.

5.1.1. HIV Data

As a practical demonstration of our methods, we apply the same approach to the HumanImmunodeficiency Virus Type 1 (HIV-1) data of Rhee et al. (2006), which studied thegenetic basis of HIV-1 resistance to several antiretroviral drugs. We focus on one of theaccompanying data sets, which measures the resistance of HIV-1 to six different NucleosideRT inhibitors (a type of antiretroviral drug) over 1005 subjects with mutations measuredat 202 different locations (after removing missing and duplicate values). The paper soughtto determine which particular mutations were predictive of resistance to these drugs.

In this section, we use least-angle regression to estimate a sparse linear model predict-ing drug resistance from the mutation marker locations. The ForwardStop, StrongStop,α-investing, and α-thresholding stopping rules are applied to the H inc

k p-values from thespacing test described in Section 4.1 to select a model along the least-angle regression path.A previous study of Rhee et al. (2005) provides a list of known relationships between mu-tations and drug resistance, which allows partial assessment of the validity of the selectedvariables. This data set has also been studied by Barber and Candes (2014) in order toassess the performance of the knockoff filter for variable selection; the main difference isthat, unlike us, they do not constrain the selection set to be the beginning of the least-angleregression path.


Table 2. Number of selections (R) made on the drug resistance data of Rhee et al. (2006)using least-angle regression, the p-values from the spacings tests, and the four stopping rules(with α = 0.2). The number of the correctly selected mutation locations (S) is assessed usingresults from a previous study. ForwardStop and StrongStop have the most competitive power,with the advantage varying by drug. The abbreviations match those used in the original paper.

3TC ABC AZT D4T DDI TDF

Rule R S R S R S R S R S R S

ForwardStop 4 4 16 15 4 4 18 14 3 3 6 6StrongStop 4 4 10 10 8 8 10 9 12 12 6 6α-Thresholding 4 4 8 8 4 4 10 9 3 3 2 2α-Investing 4 4 13 12 4 4 21 14 3 3 2 2

Table 2 shows the number of rejections and number of correct rejections for each methodapplied to each of the six drug resistance outcomes. For illustration, we plot the p-valuesand stopping points for resistance to Abacavir (ABC) in Figure 6. The theory supportingForwardStop and StrongStop suggest that the selected models should contain no more than20% false positives in expectation. The information available from the literature supportsthe validity of the selections, showing that the variables selected by our procedures largelycorresponded to meaningful relationships based on previous studies conducted with inde-pendent data. We observe that, while all methods appear to achieve overall FDR control, ineach case either ForwardStop or StrongStop yield the highest number of correct rejections.

5.2. Testing the Complete Null for the Lasso with Orthogonal XIn this section, we compare the performance of StrongStop, ForwardStop, α-investing, α-thresholding, as well as TailStop for testing the complete null for the lasso with orthogonalX using the covariance test statistics of Lockhart et al. (2014). As discussed in Section 4.2,these test statistics exhibit a harmonic behavior that TailStop is designed to take advantageof. All other procedures operate on conservative p-values, pj = exp(−Tj), obtained bybounding the null distributions by Exp(1).

We consider three scenarios which we once again refer to as easy, medium and hard.In all of the settings we have n = 200 observations on p = 100 variables of which 10 arenon-null, and standard normal errors on the observations. The non-zero entries of theparameter vector β are taken to be equally spaced values from 2γ to γ

√2 log p, where γ is

varied to set the difficulty of the problem. Figure 7 shows p-values from 50 realizations ofeach simulation setting; they exhibit harmonic behavior as described in Section 4.2.

Figure 8 shows plots of average power and observed FDR level across the three simulationsettings. The superior performance of TailStop is both desirable and expected, as it is theonly rule that can take advantage of the rapid decay of the test statistics in the null.

6. Conclusions

We have introduced a new setting for multiple hypothesis testing that is motivated bysequential model selection problems. In this setting, the hypotheses are ordered, and allrejections are required to lie in an initial contiguous block. Because of this constraint,existing multiple testing approaches do not control criteria like the False Discovery Rate(FDR).


Fig. 7. Observed p-values for 50 realizations of the covariance test (Lockhart et al., 2014) for Hcompk

with orthogonal X. p-values corresponding to non-null hypotheses are shown in orange, while thosecorresponding to null hypotheses are shown in gray. The smooth black curve is the average propor-tion of null hypotheses up to the given index. Note that these p-values behave very differently fromthose in the ordered hypothesis example presented in §3. The null p-values here exhibit Exp(1/`)behaviour, as described in §4.2. Note that the non-null p-values for this test can be quite large onoccasion. TailStop performs well in part because it is not sensitive to the presence of some largenon-null p-values.

We proposed a pair of procedures for testing in this setting, denoted by ForwardStopand StrongStop. We proved that these procedures control FDR at a specified level whilerespecting the required ordering of the rejections. Two procedures were proposed becausethey provide different advantages. ForwardStop is simple and robust to assumptions onthe particular behavior of the null distribution. Meanwhile, when the null distribution isdependable, StrongStop controls not only FDR, but the Family-Wise Error Rate (FWER).We then applied our methods to model selection, and provided a modification of StrongStop,called TailStop, which takes advantage of the harmonic distributional guarantees that areavailable in some of those settings.

A variety of researchers are continuing to work on developing stepwise distributionalguarantees for a wide range of model selection problems. As many of these procedures aresequential in nature, we hope that the stopping procedures from this paper will providea way to convert these stepwise guarantees into model selection rules with accompanyinginferential guarantees.

There are many important challenges for future work. For exact control of FDR orFWER, our methods require that the null p-values be independent. Except under orthogo-nal design, this is not true for any of the existing sequential p-value procedures that we areaware of. Further work is need in extending our theory, and/or developing new sequentialregression tests that yield independence under the null.

Acknowledgment

M.G. and S.W. contributed equally to this paper. The authors are grateful for helpfulconversations with William Fithian and Jonathan Taylor, and to the editors and refereesfor their constructive comments and suggestions. M.G., S.W. and A.C. are respectivelysupported by a NSF GRFP Fellowship, a B.C. and E.J. Eaves SGF Fellowship, and a


2 4 6 8 10

24

68

10

1:10

1:10



TailStopTarget FDR

Fig. 8. Average power and observed FDR level for the orthogonal lasso using the covariance testof Lockhart et al. (2014) for Hcomp

k . In the bottom panels, we see that all methods control the FDR.However, in the medium and hard settings TailStop is the only method that shows sensitivity to thechoice of target α level. All other methods have an observed FDR level that’s effectively 0, irrespectiveof the target α. From the power plots we also see that TailStop has far higher power than the otherprocedures — in the medium setting at low α the power is almost 10 times higher than any othermethod. By taking advantage of the Exp(1/`) behaviour of the null p-values, TailStop far outperformsthe other methods in power across all the difficulty settings.

NSERC PGSD Fellowship; R.T. is supported by NSF grant DMS-9971405 and NIH grantN01-HV-28183. Most of this work was performed while M.G. and A.C. were at the StanfordStatistics Department.

References

Aharoni, E. and S. Rosset (2013). Generalized α-investing: definitions, optimality resultsand application to public databases. Journal of the Royal Statistical Society: Series B(Statistical Methodology).

Akaike, H. (1974). A new look at the statistical model identification. Automatic Control,IEEE Transactions on 19 (6), 716–723.

Barber, R. F. and E. Candes (2014). Controlling the false discovery rate via knockoffs.arXiv preprint arXiv:1404.5609 .


Benjamini, Y. and Y. Gavrilov (2009). A simple forward selection procedure based on falsediscovery rate control. The Annals of Applied Statistics 3 (1), 179–198.

Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical andpowerful approach to multiple testing. Journal of the Royal Statistical Society. Series B(Methodological), 289–300.

Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multipletesting under dependency. Annals of statistics, 1165–1188.

Blanchard, G. and E. Roquain (2008). Two simple sufficient conditions for FDR control.Electronic journal of Statistics 2, 963–992.

Bogdan, M., E. v. d. Berg, C. Sabatti, W. Su, and E. J. Candes (2014). SLOPE—adaptivevariable selection via convex optimization. arXiv preprint arXiv:1407.3824 .

Churchill, G. A. and R. W. Doerge (1994). Empirical threshold values for quantitative traitmapping. Genetics 138 (3), 963–971.

Consortium, . G. P. et al. (2012). An integrated map of genetic variation from 1,092 humangenomes. Nature 491 (7422), 56–65.

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annalsof Statistics 32 (2), 407–499. With discussion, and a rejoinder by the authors.

Efron, B., R. Tibshirani, J. Storey, and V. Tusher (2001). Empirical Bayes analysis of amicroarray experiment. Journal of the American Statistical Association, 1151–1160.

Fithian, W., D. Sun, and J. Taylor (2014). Optimal inference after model selection. arXivpreprint arXiv:1410.2597 .

Foster, D. P. and R. A. Stine (2008). α-investing: a procedure for sequential control ofexpected false discoveries. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 70 (2), 429–444.

Goeman, J. J. and A. Solari (2010). The sequential rejection principle of familywise errorcontrol. The Annals of Statistics, 3782–3810.

G’Sell, M., S. Wager, A. Chouldechova, and R. Tibshirani (2015). Supplementary material:Sequential selection procedures and false discovery rate control. Journal of the RoyalStatistical Society: Series B (Statistical Methodology).

G’Sell, M. G., T. Hastie, and R. Tibshirani (2013). False variable selection rates in regres-sion. arXiv preprint arXiv:1302.2303 .

G’Sell, M. G., J. Taylor, and R. Tibshirani (2013). Adaptive testing for the graphical lasso.arXiv preprint arXiv:1307.4765 .

Hocking, R. R. (1976). A biometrics invited paper. The analysis and selection of variablesin linear regression. Biometrics, 1–49.

Lee, J., D. Sun, Y. Sun, and J. Taylor (2013). Exact post-selection inference with the lasso.arXiv preprint arXiv:1311.6238 .


Lee, J. D. and J. E. Taylor (2014). Exact post model selection inference for marginalscreening. In Advances in Neural Information Processing Systems, Volume 27.

Lin, D., D. Foster, and L. Ungar (2011). VIF regression: A fast regression algorithm forlarge data. Journal of the American Statistical Association 106 (493), 232–247.

Lockhart, R., J. Taylor, R. J. Tibshirani, and R. Tibshirani (2014). A significance test forthe lasso. Annals of Statistics (with Discussion) 42 (2), 413–468.

Loftus, J. R. and J. E. Taylor (2014). A significance test for forward stepwise model selection.arXiv preprint arXiv:1405.3920 .

Marcus, R., P. Eric, and K. R. Gabriel (1976). On closed testing procedures with specialreference to ordered analysis of variance. Biometrika 63 (3), 655–660.

Meinshausen, N. and P. Buhlmann (2010). Stability selection. Journal of the Royal Statis-tical Society: Series B (Statistical Methodology) 72 (4), 417–473.

Renyi, A. (1953). On the theory of order statistics. Acta Mathematica Hungarica 4 (3),191–231.

Rhee, S.-Y., W. J. Fessel, A. R. Zolopa, L. Hurley, T. Liu, J. Taylor, D. P. Nguyen, S. Slome,D. Klein, M. Horberg, et al. (2005). HIV-1 protease and reverse-transcriptase mutations:correlations with antiretroviral therapy in subtype B isolates and implications for drug-resistance surveillance. Journal of Infectious Diseases 192 (3), 456–465.

Rhee, S.-Y., J. Taylor, G. Wadhera, A. Ben-Hur, D. L. Brutlag, and R. W. Shafer (2006).Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Pro-ceedings of the National Academy of Sciences 103 (46), 17355–17360. Data available athttp://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/.

Romano, J. P. and A. M. Shaikh (2006). Stepup procedures for control of generalizationsof the familywise error rate. The Annals of Statistics, 1850–1873.

Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals of statistics 6 (2),461–464.

Shah, R. and R. Samworth (2012). Variable selection with error control: Another look atstability selection. Journal of the Royal Statistical Society: Series B (Statistical Method-ology).

Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance.Biometrika 73 (3), 751–754.

Simonsen, K. L. and L. M. McIntyre (2004). Using alpha wisely: improving power to detectmultiple qtl. Statistical applications in genetics and molecular biology 3 (1).

Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point es-timation and simultaneous conservative consistency of false discovery rates: a unified ap-proach. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66 (1),187–205.

http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/


Taylor, J., R. Lockhart, R. J. Tibshirani, and R. Tibshirani (2014). Post-selection adaptiveinference for least angle regression and the lasso. arXiv preprint arXiv:1401.3889 .

Taylor, J., J. Loftus, R. Tibshirani, and R. Tibshirani (2013). Tests in adaptive regressionvia the Kac-Rice formula. arXiv preprint arXiv:1308.3020 .

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society Series B 58 (1), 267–288.

Westfall, P. H. and S. S. Young (1993). Resampling-based multiple testing: Examples andmethods for p-value adjustment. John Wiley & Sons.

Wu, Y., D. Boos, and L. Stefanski (2007). Controlling variable selection by the addition ofpseudovariables. Journal of the American Statistical Association 102 (477), 235–243.

A. Proofs

Proof (Lemma 1). We can map any rejection threshold t to a number of rejectionsk. For the purpose of this proof, we will frame the problem as how to choose a rejectionthreshold t; any choice of t ∈ [0, 1] immediately leads to a rule

kF = R(t) =∣∣{i : qi ≤ t}

∣∣ .Similarly, the number of false discoveries is given by V (t) =

∣∣{i > s : qi ≤ t}∣∣. We define

the threshold selection rule

tα = max

{t ∈ [0, 1] : t ≤ αR(t)

m

}.

Here, R(tα) = kF and so this rule is equivalent to the one defined in the hypothesis.

When coming in from 0, R(t) is piecewise continuous with upwards jumps, so

tα =αR(tα)

m,

allowing us to simplify our expression of interest:

V (tα)

R(tα)=α

m

V (tα)

tα.

Thus, in order to prove our result, it suffices to show that

E[V (tα)

tα

]≤ m.

The remainder of this proof establishes the above inequality using Renyi representation anda martingale argument due to Storey et al. (2004).


Recall that, by assumption, ps+1, ..., pmiid∼ U([0, 1]). Thus, we can use Renyi represen-

tation to show that

(Zs+1 − Zs, ..., Zm − Zs) =

(Ys+1

m− s, ...,

m∑i=s+1

Yim− i+ 1

)d= (E1,m−s, ..., Em−s,m−s) ,

where the Ei,m−s are standard exponential order statistics, and so(e−(Zs+1−Zs), ..., e−(Zm−Zs)

)are distributed as m − s order statistics drawn from the uniform U([0, 1]) distribution.Recalling that

1− qs+i = (1− qs) e−(Zs+i−Zs),

we see that qs+1, ..., qm are distributed as uniform order statistics on [qs, 1].Because the last qi are uniformly distributed,

M(t) =V (t)

t

is a martingale on (qs, 1] with time running backwards. Here, the relevant filtration Fttells us which of the qi are strictly greater than t; we can also verify that tα is a stoppingtime with respect to this backwards-time filtration. Now, let M+(t), t+α , and F+

t be theright-continuous modifications of the previous quantities (again, with respect to backwards-running time). By the optional sampling theorem

E[min{M+(t+α ), C}; t+α > qs

]≤M(1) =

m− s1− qs

for any C ≥ 0; thus, by the (Lebesgue) monotone convergence theorem,

E[M+(t+α ); t+α > qs

]≤ m− s

1− qsMoreover, we can verify that

E[M+(t+α ); t+α > qs

]= E

[M(tα); tα > qs

],

almost surely, and so

E[M(tα); tα > qs

]≤ m− s

1− qs.

For all t > qs,V (t)

t=t− qst

M(t) ≤ (1− qs)M(t),

and so

E[V (tα)

tα; tα > qs

]≤ m− s.

Meanwhile,

E[V (tα)

tα

∣∣tα ≤ qs] = 0, and so, as claimed, E[V (tα)

tα

]≤ m.

2


We begin our analysis of ForwardStop (Procedure 1) by showing that it satisfies thesame guarantees as the stopping rule (13). Although the following corollary is subsumedby Theorem 2, its simple proof can still be helpful for understanding the motivation behindForwardStop.

Corollary 5. Under the conditions of Lemma 1, the ForwardStop procedure definedin (15) has FDR is controlled at level α.

Proof. We can extend our original list of p-values p1, ..., pm by appending additionalterms

pm+1, pm+2, ..., pm∗iid∼ U([0, 1])

to it. This extended list of p-values still satisfies the conditions of Lemma 1, and so we canapply procedure (13) to this extended list without losing the FDR control guarantee:

kq,m∗

F = max

{k :

m∗qm∗

k

k≤ α

}.

As we take m∗ →∞, we have

limm∗→∞

m∗qm∗

k = limm∗→∞

m∗

1− exp

− k∑j=1

Yjm∗ − j + 1

=

k∑j=1

Yj ,

and so, because the set [0, α] is closed, we recover the procedure described in the hypothesis:

limm∗→∞

kq,m∗

F = kF .

Thus, by dominated convergence, the rule kF controls the FDR at level α. 2

Proof (Theorem 2). The proof of Lemma 1 used quantities

Zi =

i∑j=1

Yjm− j + 1

=

i∑j=1

Yj|{l ∈ {j, ..., m}}|

to construct the sorted test statistics qi. The key difference between the setup of Lemma1 and our current setup is that we can no longer assume that if the ith hypothesis is null,then all subsequent hypotheses will also be null.

In order to adapt our proof to this new possibility, we need to replace the Zi with

ZALTi =

i∑j=1

Yjν(j)

, where ν(j) = |{l ∈ {j, ..., m} : l ∈ N}|,

and N is the set of indices corresponding to null hypotheses. Defining

qALTi = 1− e−ZALTi ,

we can use Renyi representation to check that these test statistics have distribution

1− qALTid= r(i)

(1− Uν(i), |N |

), where

r(i) := exp

− ∑{j≤i:j /∈N}

Yji


and the Uν(j), |N | are order statistics of the uniform U([0, 1]) distribution. Here r(i) isdeterministic in the sense that it only depends on the location and position of the non-nullp-values.

If we base our rejection threshold tALTα on the qALTi , then by an argument analogous tothat in the proof of Lemma 1, we see that

V(tALTα

)tALTα

is a sub-martingale with time running backwards. The key step in showing this is to noticeis that, now, the decay rate of V (t) is accelerated by a factor r−1(i) ≥ 1. Thus, the rejectionthreshold tALTα controls FDR at level α in our new setup where null and non-null hypothesesare allowed to mix.

Now, of course, we cannot compute the rule tALTα because the ZALTi depend on theunknown number ν(j) of null hypotheses remaining. However, we can apply the same trickas in the proof of Corollary 5, and append to our list an arbitrarily large number of p-valuesthat are known to be null. In the limit where we append infinitely many null p-values toour list, we recover the ForwardStop rejection threshold. Thus, by dominated convergence,ForwardStop controls the FDR even when null and non-null hypotheses are interspersed. 2

Proof (Theorem 3). We begin by considering the global null case. In this case, the

Yi are all standard exponential, and so by Renyi representation the qi are distributed as theorder statistics of a uniform U([0, 1]) random variable. Thus, under the global null, the rule

kS is just Simes’ procedure (Simes, 1986) on the qi. Simes’ procedure is known to provideexact α-level control under the global null, so (20) holds as an equality under the globalnull.

Now, consider the case where the global null does not hold. Suppose that we havekS = k > s. From the definition of qk, we see that qk depends only on pk, ..., pm, and sothe event qk ≤ αk/m is just as likely under the global null as under any alternative withless than k non-null p-values. Thus, conditional on s,

m∑k=s+1

P[kS = k

∣∣alternative]

=

m∑k=s+1

P[kS = k

∣∣null]≤ α,

and so the discussed procedure in fact provides strong control. 2

Proof (Theorem 4). Let Z∗i =∑mj=i Ti. By Renyi representation,(

Z∗s+1, ..., Z∗m

)∼ (Em−s,m−s, ..., E1,m−s) ,

where the Ei, j are exponential order statistics. Thus, the null test statistics(q∗s+1, ..., q

∗m

)are distributed as m− s order statistics drawn from the uniform U([0, 1]) distribution. Theresult of Benjamini and Hochberg (1995) immediately implies that we can achieve FDRcontrol by applying the BH procedure to the q∗i , and so TailStop controls the FDR. Theexact equality follows from the result of Benjamini and Yekutieli (2001). 2


B. Additional Simulations

In this section we revisit the ordered hypothesis example introduced in Section 3 and presentthe results of a more extensive simulation study. We explore the following perturbations ofthe problem:

(a) Varying signal strength while holding the level of separation fixed. (Figures 9, 10, 11)

(b) Increasing the number of hypotheses while retaining the same proportion of non-nullhypotheses (Figure 12)

(c) Varying the proportion of non-null hypotheses (Figures 13, 14, 15)

We remind the reader of the three simulation settings introduced in 3, which we termedEasy, Medium and Hard. These settings were defined as follows

Easy Perfect separation (all alternative precede all null), and strong signal (Beta(1, 23))

Medium Good separation (mild intermixing of hypotheses), and moderate signal (Beta(1, 14))

Hard Moderate separation (moderate intermixing hypotheses), and low signal (Beta(1, 8))

All results are based on 2000 simulation iterations. Unless otherwise specified, the simula-tions are carried out with m = 100 total hypotheses of which s = 20 are non-null.


2 4 6 8 10

24

68

10

1:10

1:10



Target FDR

Fig. 9. Effect of signal strength on stopping rule performance: Perfect separation regime. Forward-Stop remains the best performing method overall, except at the lowest α level in the moderate andlow signal regimes. All of the methods become more conservative as the signal strength decreases.


Fig. 10. Effect of signal strength on stopping rule performance: Good separation regime. The effectof signal strength is qualitatively the same as in the perfect separation regime.

Fig. 11. Effect of signal strength on stopping rule performance: Moderate separation regime. Theeffect of signal strength is qualitatively the same as in the perfect separation and good separationregimes.


2 4 6 8 10

24

68

10

1:10

1:10



Target FDR

Fig. 12. Effect of increasing the total number of hypotheses. Instead of 100 hypotheses of which20 are non-null, we consider 1000 hypotheses of which 200 are non-null. With the exception of α-thresholding, the performance of the methods remains largely unchanged. One small change isthat ForwardStop loses power around α = 0.1 in the Hard setting. The key difference is that theperformance of α-thresholding considerably degrades. This is not surprising when we consider thatα-thresholding is simply a geometric random variable. Thus as we increase the number of non-nullhypotheses we expect the average power of α-thresholding to drop to 0.


2 4 6 8 10

24

68

10

1:10

1:10



Target FDR

Fig. 13. Effect of varying the number of non-nulls out of m = 100 total hypotheses: Easy regime.With the exception of α-thresholding, the performance of the methods remains largely unchanged.The performance of α-thresholding degrades considerably as the number of non-null hypothesesincreases. An explanation for this behaviour is presented in 12.


Fig. 14. Effect of varying the number of non-nulls out of m = 100 total hypotheses: Medium regime.The effect of varying the number of non-null hypotheses is qualitatively the same as in the Easyregime.

Fig. 15. Effect of varying the number of non-nulls out of m = 100 total hypotheses: Hard regime.The effect of signal strength is qualitatively the same as in the Easy and Medium regimes.

Documents

Sequential Selection Procedures and False Discovery Rate ... · Sequential FDR Control 3 Table 1. Typical realization of p-values for Hinc k with least-angle regression (LARS), as