A Bayesian Perspective on Case Selection

A Bayesian Perspective on Case Selection

Tasha Fairfield and A.E. Charman

Adapted from Chapter 11 in Social Inquiry and Bayesian Inference: Rethinking Qualitative

Research, October 2019.

1. INTRODUCTION

Case selection is a matter of ongoing debate in comparative and multi-method research that

combines large-N analysis with qualitative case studies. The voluminous literature contains a

proliferation of case selection strategies and a noteworthy lack of consensus on which strategies

serve which ends, and through what underlying logic. This paper presents a far simpler Bayesian

perspective on case selection, where the single overarching principle is to maximize information

that will help develop, refine, and/or compare hypotheses. If hypotheses are yet to be invented

or articulated clearly, we aim to study cases that will be rich in information. Once hypotheses

become better specified, we seek cases that are likely to provide information that discriminates

between rival explanations. In the latter context, optimal case selection is governed by a single

mathematical expression that quantifies expected information gain—the weight of evidence we

anticipate a given case to provide in favor of the true hypothesis, taking into account that from

the outset we do not know what evidence we will find when we examine a given case, and

we do not know which hypothesis is correct. In principle, optimal Bayesian case selection for

theory testing always entails maximizing expected information gain; this single principle either

supplants or subsumes all other case selection strategies discussed in the literature.

In practice, formally maximizing expected information gain to conduct optimal Bayesian case

selection is infeasible, given the impossibility of foreseeing all possible evidentiary outcomes in

advance (Fairfield & Charman 2019). Nevertheless, we can extract useful heuristic guidelines for

case selection that aim to approximate the full Bayesian prescription. Some of these guidelines

diverge sharply from existing recommendations, while many others are similar to approaches

commonly followed in qualitative research, but generally without recognition of their Bayesian

rationale. We also emphasize that the mathematical properties of expected information gain

ensure that we can expect to learn from any case that we study. While good practice entails

2

putting some careful thought into case selection, we accordingly advocate spending less time

and e↵ort on this stage of research design than is generally the norm.

Our approach to case selection di↵ers fundamentally from frequentism, where the goal is to es-

timate population-level parameters, and the central case-selection principal is random sampling

from a pre-specified population. In contrast, Bayesianism entails information-based inference

and accordingly employs an information-theoretic approach to case-selection. Instead of es-

timating population-level parameters, we compare causal hypotheses that have clearly stated

scope conditions delineating the range of contexts to which they apply. Each case we study

provides some overall aggregate weight of evidence that we use to update the odds on our hy-

potheses; we gain confidence that a given hypothesis will explain other cases that fall within

its scope to the extent that the evidence gathered from the cases studied so far increases the

posterior odds on that hypothesis relative to rivals. In this sense, we use case evidence to learn

about the plausibility of (more or less) general causal hypotheses that make predictions for

as yet unobserved cases. Our information-theoretic goal then is to select those cases that will

adjudicate between rival hypotheses as e�ciently as possible—that is, to study the cases that

are likely to provide the largest weight of evidence in favor of the best explanation (the greatest

expected information gain).

Our approach to generalization—a common concern in the case selection literature—also di↵ers

fundamentally from frequentism, which asks whether there is bias in the sample that may un-

dermine external validity and lead to findings that fail to hold for the population of cases from

which the sample was drawn. Within Bayesianism, in contrast, generalization entails refining a

theory by broadening its scope conditions. (Alternatively, we might find that we need to narrow

the scope conditions as we iterate between theory refinement and data analysis.) Case selec-

tion to test scope conditions follows the same information-theoretic principles described above;

in this situation, we seek to study additional cases that will help adjudicate most e�ciently

between the refined hypothesis and salient rivals.

With respect to Humphreys and Jacobs’ (2015) Bayesian approach, we share a similar overar-

ching goal of finding cases with high “probative value.” However, we operationalize probative

value using a logarithmic scale, as is standard in information theory, not a linear scale. We

diverge more sharply from Humphreys and Jacobs’ approach in eschewing discussion of the pro-

portion of causal types in a population and instead focusing directly on the causal hypotheses

3

of interest and their stated scope of applicability.

The remainder of the paper proceeds as follows. Section 2 overviews recent literature on case

selection and provides a map through the terminological terrain that highlights fault-lines of

debate and analytic lacunae. Section 3 begins with a brief overview of the conceptual distinc-

tions between Bayesianism and frequentism that give rise to di↵erent approaches to inference

and case selection. We then present the basic mechanics of Bayesian inference that we will use

in Section 4, where we define expected information gain and develop our approach to case selec-

tion. Using the concept of expected information gain, we provide a Bayesian conceptualization

of critical cases and a precise mathematical statement of our ex-ante expectations about test

strength. Section 5 elaborates practical heuristic guidelines that arise from our Bayesian per-

spective. Finally, Section 6 applies our Bayesian framework to critique literature on most-likely

cases and least-likely cases. We explicate ambiguities and analytic pitfalls in the way they have

been conceptualized, from Eckstein (1975) through more contemporary treatments, and argue

that these notions should be replaced by expected information gain, which provides the only

sensible probabilistic way to evaluate how strong a test a given case will provide.

2. TOURING THE TERRAIN

KKV (1994) and the response from RSI (2004) stimulated a surge of interest in the logic of

case selection that has not abated. While this literature contains many helpful suggestions and

insights, charting the landscape of case selection strategies is challenging given the large number

of approaches that have been proposed, adjusted, and reinterpreted, as well as the overlapping

and sometimes contested purposes these strategies aim to serve. Table 1 classifies over two-

dozen case selection strategies in an e↵ort to elucidate commonalities and disagreements. While

the table is not comprehensive, we have aimed to include the most familiar and widely discussed

strategies from recent literature. We organize these strategies according to six primary guiding

principles: outliers & extremes, model-concordance, variation, representativeness, control, and

informativeness.1 Several of these organizing principles are grounded in frequentist thinking—

1 Two of the strategies fall under more than one guiding principle. We have placed Lieberman’s (2005) “on-the-line” cases under model-concordance as the primary principle, although this strategy also espouses variation,and we double-list Seawright and Gerring’s (2008) “typical cases” under both model-concordance and repre-sentativeness.

4

variation and representativeness are central concerns in orthodox statistical inference, and the

model-concordance strategies, as well as many of the outliers & extremes strategies, aim to

incorporate case studies and large-N analysis within an (at least implicitly) frequentist approach

to multi-method research.2 The principal of control is grounded in an experimental, potential

outcomes approach to inference. In contrast, the strategies grouped under the principle of

informativeness come closest to the Bayesian approach that we will elaborate in Section 4.

While our six guiding principals are not mutually exclusive, they represent our best e↵ort

toward a taxonomy of the literature.

Table 1 highlights several sources of potential confusion and/or lack of consensus. First, dif-

ferent terms are sometimes used for strategies that are closely analogous and/or overlapping.

Seawright and Gerring’s (2008) “diverse cases,” Lieberman’s (2005) “on-the-line” cases, and

Seawright and Gerring’s “typical cases” share many similarities. “On-the-line” cases appear

to combine “typical” and “diverse” selection strategies—the latter two categories as defined

by Seawright and Gerring are not disjoint. Likewise, there is substantial conceptual overlap

between Seawright and Gerring’s “extreme” and “diverse” case selection strategies in that both

aim to include a wide range of variation along a key variable.

Second, we find disagreements regarding what ends analogous strategies serve. For “extreme

value cases” on the independent variable, Van Evera (1997) designates the purpose as theory

testing, whereas Seawright and Gerring (2008) assert that this approach is for developing the-

ory. Among the model-concordance strategies, Gerring’s (2007) “pathway cases” and Goertz’s

(2016:61) “causal mechanism cases” are very similar in looking for cases that manifest both

the independent variable(s) X and the dependent variable Y of interest while minimizing the

possibility of overdetermination or the presence of confounders—aside from di↵erences in the

technical criteria advocated for identifying these cases in relation to an established large-N

cross-case relationship.3 Yet Goertz (2016:56) implicitly frames “causal mechanism” cases as

aiding theory testing by: “confirm[ing] that the proposed causal mechanism is in fact working

for this observation,” whereas Gerring (2007:238) holds that “pathway cases” are for developing

theory: “not to confirm or disconfirm a causal hypothesis (because that hypothesis is already

well established) but rather to clarify a hypothesis. More specifically, the case study serves to

2 Goertz also discusses applications within QCA.3 See Goertz (2016:214-17), Gerring (2007:242-46).

5

elucidate causal mechanisms.”

Third, the literature o↵ers multiple di↵erent case selection strategies for any given purpose,

without clear guidelines regarding which strategies are applicable or optimal under particular

circumstances, and often without clearly explaining the inferential logic through which a given

strategy is thought to achieve its designated purpose(s). Table 1 includes more than 13 strategies

with the designated goal of theory testing, and more than six strategies with the designated

goal of theory development; both the theory-testing and the theory-developing strategies are

interspersed across the six overarching case-selection principles. This issue is salient not only

at the aggregate level of the case selection literature, but also within individual studies. For

example, Van Evera (1997) presents eleven case selection criteria,4 seven of which serve for

theory building and eight of which aid theory testing. While Van Evera (1997) o↵ers many

useful suggestions that are often refreshingly grounded in common sense, from a methodological

perspective, the logic of these strategies and how they di↵er is not always adequately elucidated.

Likewise, Seawright and Gerring’s (2008) seven strategies overlap in their attributed uses: four

allow theory development while six can be used for testing. Here too, the rationale for why a

given strategy serves the designated purpose is not fully expounded; the authors’ emphasis is

instead on how to implement each selection technique.

Table 1 also highlights a number of additional points of contention in the literature. With regard

to outliers & extremes, should we focus on all o↵-line cases (Seawright and Gerring’s (2008)

“deviant cases”), or only those with (X,¬Y ) (Goertz’s (2016) “falsification-scope cases”)? Sim-

ilarly, with regard to model-concordance, should we focus just on cases with (X,Y ) (Goertz’s

(2016) “causal mechanism cases”), or cases with (¬X,¬Y ) as well (Lieberman’s (2005) “on-the-

line cases,” Seawright and Gerring’s (2008) “typical cases”)? When examining (X,Y ) cases,

there is also debate about whether over-determination should be avoided (Goertz 2016, Gerring

2007), or whether it is in fact preferable (Slater), or at least unproblematic (Beach and Pedersen

2016:17), to choose cases for which other explanations are plausible, beyond the hypothesis that

X alone causes Y . Turning to strategies classified under informativeness, we find substantial

lack of clarity or consensus in definitions and explications of critical cases. We will discuss the

literature on most-likely and least-likely cases in Section 6 after we develop a precise Bayesian

definition of a critical case.

4 Not all of these appear in Table I.

6

Finally, the literature leaves open several important questions. The one point of consensus

across the literature appears to be that case selection requires first enumerating the population,

but how should we proceed when the “population” of cases is not well-defined in advance, cannot

be precisely enumerated or delineated, or does not remain stable over time? Such situations

may well be the norm rather than the exception in social science, even though frequentist

statistical inference requires that all elements of the sampling or data generation process must

be articulated in advance. Goertz (2016:53) advises that “it is almost always a good idea to

start with the complete, if provisional, list,” but it is di�cult to conceive how a list could be

both.

Similarly, how should cases be selected when scores on key variables cannot easily be assessed

for even a moderate number of cases, let alone for what is considered the full population? In

many situations, scoring independent and dependent variables for a case may in itself require

in-depth research to obtain new information (as well as refinement of theory and concepts in

light of that new information in order to define the causal variables of interest). Yet in the

absence of readily-available and reliable large-N datasets, most of the strategies discussed in the

multi-method literature simply are not applicable. The qualitative methods literature does not

provide satisfactory answers here either. For instance, how can we identify a “crucial case” or a

“divergent predictions case” ahead of time, before actually conducting an in-depth case study?

Many typologies for clarifying case selection can only be e↵ectively applied retrospectively

during case analysis, yet at that stage, such classifications are largely irrelevant for causal

inference in the light of actual case data obtained. We will address these questions from a

Bayesian perspective after presenting the basics of Bayesian inference.

3. INTRODUCTION TO BAYESIAN REASONING FOR QUALITATIVE RESEARCH

This section begins by clarifying the di↵erences in the way the Bayesianism and frequentism—

the epistemological framework that underpins classical statistics—conceptualize and apply

probability (Section 3.1). We then introduce Bayes’ rule and explain how to apply Bayesian

reasoning in qualitative research (Section 3.2), with a brief example (Section 3.3).5 Section 3.4

5 See Fairfield & Charman (2017) for more detailed guidance on applying Bayesian reasoning to qualitativeresearch.

7

introduces a simple additive form of Bayes’ rule and defines the weight of evidence, an intuitive

concept introduced by I.J. Good that measures how strongly an evidentiary observation sup-

ports a given hypothesis over rivals. Section 3.5 then draws on this concept to explicate how

Bayesian inference proceeds when analyzing more than a single case.

3.1. Conceptualizing Probability

Bayesianism and frequentism di↵er first and foremost in how they define probability. Frequen-

tism conceptualizes probability as a limiting proportion in an infinite series of random trials or

repeated experiments. For example, the probability that a coin lands “heads’” on a given toss

is equated with the fraction of times it turns up heads in an infinite sequence of throws. In this

view, probability reflects a state of nature—e.g., a property of the coin (fair or weighted) and

the flipping process (random or rigged). In contrast, Bayesianism understands probability as

a degree of belief based on a state of knowledge. The probability an individual assigns to the

next toss of a coin represents her strength of confidence about the outcome after taking into

account all relevant information she knows. Two observers watching the same coin flip would

rationally assign di↵erent probabilities to the proposition “the next toss will produce heads” if

they have di↵erent information about the coin or tossing procedure. For example, an observer

who has had the opportunity to examine the coin in advance and discerns that it is weighted

in favor of heads would rationally place a higher probability on that outcome than an observer

who is not privy to such information.

The Bayesian notion of probability o↵ers multiple advantages—most centrally: it fits better

with how people intuitively reason under uncertainty; it can be applied to any proposition,

including causal hypotheses, which would be nonsensical from a frequentist perspective; it is

well suited for explaining unique events or working with a small number of cases, without need

to sample from a larger population; and inferences can be made from limited amounts of infor-

mation, using any relevant evidence (e.g., open-ended interviews, historical records), above and

beyond data generated from stochastic processes. These features make Bayesianism especially

appropriate for qualitative research, which evaluates competing explanations for complex so-

ciopolitical phenomena using evidence that cannot naturally be conceived as random samples

(e.g., information from expert informants, legislative records, archival sources). Strictly speak-

ing, “frequentist inference is inapplicable to the nonstochastic setting,” (Western & Jackman

8

1994:413).

The school of Bayesianism we advocate as the foundation for scientific inference—logical

Bayesianism—seeks to represent the rational degree of belief we should hold in propositions

given the information we possess, independently of hopes, subjective opinion, or personal

predilections. In Boolean logic, truth-values of all propositions are known with certainty. But

in most real-world contexts, we have limited and/or imperfect information, and we are always

at least somewhat unsure about whether a proposition is true or false. Bayesian probability is

an “extension of logic” (Jaynes 2003) in that it provides a prescription for how to reason when

we have incomplete knowledge and are thus uncertain about the truth of propositions. When

degrees of belief assume limiting values of zero (impossibility) or one (certainty), Bayesian

probability automatically reduces to Boolean logic.

3.2. Bayesian Inference

Intuitively speaking, Bayesian reasoning is simply a process of updating our views about which

hypothesis best explains the phenomena or outcomes of interest as we learn additional infor-

mation. We begin by identifying two or more alternative hypotheses. The literature we have

read along with our own previous experiences and observations give us an initial sense, or

“prior” view, about how plausible each hypothesis is—e.g., before heading into the field or the

archives, do we believe the median-voter theory is a much stronger contender for explaining lev-

els of redistribution in democracies than approaches focusing instead on the power of organized

actors including business associations and social movements? Or are we highly dubious that

the median-voter hypothesis provides an accurate explanation for the politics of inequality? As

our research proceeds, we ask whether the evidence we gather fits better with one hypothesis

as opposed to another. When we have finished collecting data, we arrive at a “posterior” view

regarding which hypothesis is most plausible. Bayes’ rule provides a mathematical framework

for how we should revise our confidence in a given hypothesis, considering both our previous

knowledge and the information we discovered during our research. If we remain too uncertain

about which hypothesis performs best after analyzing the data in hand, we may continue our

research and collect additional evidence.

Stated more formally, Bayesian inference generally proceeds by assigning prior probabilities to

9

salient rival hypotheses.6 These prior probabilities represent our rational degree of belief (or

confidence) in the truth of each hypothesis taking into account all relevant initial knowledge, or

background information (I), that we possess. Symbolically, we represent the prior probability

for hypothesis H as P (H | I). This follows the conventional notation whereby a conditional

probability P (A |B) represents the rational degree of belief that we should hold in proposition

A given proposition B—that is, how likely is A if we take proposition B to be true. We then

consider evidence E obtained during the investigation at hand. The evidence includes all obser-

vations (beyond our background information) that bear on the plausibility of the hypotheses.

Finally, we employ Bayes’ rule to update our degree of confidence in hypothesis H in light of

evidence E. Because inference always involves comparing hypotheses, we will work with the

odds-ratio form of Bayes’ rule:

posterior odds = prior odds ⇥ likelihood ratio

P (Hi |E I)P (Hj |E I) =

P (Hi | I)P (Hj | I) ⇥ P (E |Hi I)

P (E |Hj I) ,(1)

The posterior odds on the left-hand side of equation (1) tell us how much more plausible one

hypothesis Hi is relative to a rival hypothesis Hj in light of the evidence learned as well as the

background information we initially brought to the problem, while the prior odds on the right-

hand side is the plausibility of Hi compared to Hj based only on our background information.

For posterior odds and prior odds, we can think in terms of how willing we would be to bet

in favor of one hypothesis vs. the other. The likelihood ratio7—the second factor on the right-

hand side of (1)—represents how plausible, or expected, the evidence is under one hypothesis

relative to the other, or in other words, how likely the evidence would be if we assume Hi is

true, compared to how likely the evidence would be if we instead assume Hj is true. According

to Bayes’ rule, how much we end up favoring one hypothesis over another depends on both our

prior views and the extent to which the evidence weighs in favor of one hypothesis over another.

Assessing likelihood ratios P (E |Hi I)/P (E |Hj I) is therefore the critical inferential step that

tells us whether evidence E should make us more or less confident than we were initially in

one hypothesis relative to a rival. The likelihood ratio can be thought of as the probability of

observing E in a hypothetical world where Hi is true, relative to the probability of observing E

6 As we elaborate elsewhere, it is always possible to begin with a set of causal factors or causal hypothesesthat are non-rival and create from them a set of hypotheses that are mutually exclusive (Fairfield & Charman2017).

7 What we call the likelihood ratio is sometimes referred to as the Bayes factor.

10

in an alternative world where Hj is true. When evaluating likelihoods of the form P (E |Hi I),we must in e↵ect (a) suppress our awareness that E is a known fact, and (b) suppose that

Hi is correct, even though the actual status of the hypothesis is uncertain. Recall that in the

notation of conditional probability, everything that appears to the right of the vertical bar is

either known, or assumed as a matter of conjecture when reasoning about the probability of

the proposition to the left of the bar. In qualitative research, we need to “mentally inhabit the

world” of each hypothesis (Hunter 1984) and ask how surprising (low probability) or expected

(high probability) the evidence E would be in each respective world. If E seems less surprising

in the “Hi world” relative to the “Hj world,” then that evidence increases our odds on Hi vs.

Hj . Again, we gain confidence in a given hypothesis to the extent that it makes the evidence

we observe more plausible compared to rivals.

3.3. Example: State-Building in Latin America

To illustrate how Bayesian reasoning can be applied in qualitative social science, suppose we

are interested in whether the resource-curse hypothesis, or the warfare hypothesis (assumed

mutually exclusive), provides a better explanation for institutional under-development:

HR = Mineral resource dependence is the central factor hindering institutional development

in Latin America. Mineral wealth makes collecting taxes irrelevant and creates incentives for

subsidies and patronage, instead of building administrative capacity.

HW = Absence of warfare is the central factor hindering institutional development in Latin

America. Without external threats that necessitate e↵ective military defense, leaders lack in-

centives to collect taxes and build administrative capacity.

For simplicity, suppose we have no relevant background knowledge about state-building. We

would then reasonably assign even prior odds, such that our log-odds will equal one. We now

learn the following information about Peru:

E1 = Peru faced persistent military threats following independence, its economy was long

dominated by mineral exports, and it never developed an e↵ective state.

Intuitively, E1 strongly favors the resource-curse hypothesis. Applying Bayesian reasoning, we

must evaluate the likelihood ratio P (E1 |HR I)/P (E1 |HW I). Imagining a world where HR is

11

the correct hypothesis, mineral dependence in conjunction with weak state capacity is exactly

what we would expect, and external threats are not surprising given that a weak state with

mineral resources could be an easy and attractive target for invasion. In the alternative world

of HW , E1 would be quite surprising; something very unusual, and hence improbable, must

have happened for Peru to end up with a weak state if the warfare hypothesis is nevertheless

correct, because weak state capacity despite military threats contradicts the expectations of the

theory. Because E1 is much more probable under HR relative to HW—that is, P (E1 |HR I) ismuch greater than P (E1 |HW I)—the likelihood ratio is large, and it significantly boosts our

confidence in the resource-curse hypothesis.

Our posterior log-odds in light of E1, which now strongly favor HR over HW , in turn become

our prior log-odds when we move forward to consider an additional evidentiary observation

E2. Updating proceeds iteratively is this manner until we decide to terminate our research and

report our findings, or until a new or refined hypothesis comes to light. In the later situation,

we would need to go back and set up a di↵erent inferential problem that compares the revised

set of hypotheses in light of our background information and all of the evidence obtained thus

far.

3.4. Bayes’ Rule in Log-Odds Form

If we take the logarithm of both side of Bayes’ rule (1), we obtain a particularly simple, additive

relationship that is easy to remember and easy to use:

logh P (Hj |E I)P (Hk |E I)

i= log

h P (Hj | I)P (Hk | I)

P (E |Hj I)P (E |Hk I)

i

= logh P (Hj | I)P (Hk | I)

i+ log

h P (E |Hj I)P (E |Hk I)

i

posterior log-odds = prior log-odds + weight of evidence ,

(2)

where we have used the fundamental property that the logarithm of a product equals the sum

of the logarithms. The weight of evidence (Good 1983), which is just the the logarithm of the

likelihood ratio, conveys the probative value of the evidence—namely, how much it supports

one hypothesis compared to another (setting aside our prior beliefs about the hypotheses). We

will denote the weight of evidence in favor of hypothesis Hj relative to hypothesis Hk as:

WoE (Hj : Hk) = logh P (E |Hj I)P (E |Hk I)

i. (3)

12

As the term suggests, the weight of evidence is additive. In particular, when the aggregate

or total evidence E can be decomposed into a conjunction of separate pieces, such that E =

(E1E2 · · ·EN ), the overall or net weight of evidence (3) can itself be broken down into the sum

of weights for each distinct piece of evidence:

WoE (Hj : Hk) = logh P (EN |E1E2 · · ·EN�1Hj I)P (EN |E1E2 · · ·EN�1Hk I) · · ·

P (E2 |E1Hj I)P (E2 |E1Hk I)

P (E1 |Hj I)P (E1 |Hk I)

i

= logh P (EN |E1E2 · · ·EN�1Hj I)P (EN |E1E2 · · ·EN�1Hk I)

i+ log

h P (E2 |E1Hj I)P (E2 |E1Hk I)

i+ log

h P (E1 |Hj I)P (E1 |Hk I)

i

= WoEN (Hj : Hk , E1 · · ·EN�1) + · · ·+WoE2 (Hj : Hk , E1) +WoE1 (Hj : Hk) ,

(4)

where our notation denotes that for each successive piece of evidence we must take into ac-

count possible logical dependencies with previously-analyzed evidence, a task that we discuss

elsewhere (Fairfield & Charman 2017).8

Beyond the simplicity of equation (2) and the additivity of weights of evidence, there are deeper

reasons for using logarithms. As explained in Fairfield & Charman (2017), a logarithmic scale

allows us to better handle very large or very small probabilities and a↵ords better consistency

with human sensory perception, when compared to a linear scale.

3.5. Bayesian Inference with Multiple Cases

Bayesian analysis is usually associated with single case studies (“process-tracing” and “within-

case” analysis), yet Bayesian inference drawing on multiple case studies proceeds in exactly

the same manner. We begin with rival hypotheses that include clearly specified scope condi-

tions. Any case that falls within the scope of the hypotheses contributes some overall weight

of evidence, which we obtain by adding up the weights of evidence for each salient observation

pertaining to that case. We then sum the aggregate weights of evidence associated with each

case (C1, C2, etc) to obtain a total (multi-case) weight of evidence. Applying Bayes’ rule (2),

we have:

posterior log-odds = prior log-odds+WoEC1 +WoEC2 + · · ·+WoECN . (5)

8 Fairfield & Charman (2017) also provides guidelines for quantifying prior log-odds and weights of evidence.

13

To the extent that the N cases already studied increase our confidence in the truth of one

hypothesis relative to rivals, as reflected in our posterior log-odds, we become more confident

that this hypothesis successfully explains other, as-yet unobserved cases that fall within its

scope. This is an iterative process, where we can always add additional cases and revise theory

and scope conditions.

Returning to the state-building example, notice first that the two hypotheses we considered

include scope conditions that restrict their predictions to Latin American countries—in other

words, these hypotheses say nothing at all about institutional under-development in, for ex-

ample, Eastern Europe. Suppose a thorough investigation of the Peruvian case yields a weight

of evidence WoEP in favor of HR over HW . We then proceed to analyze the Venezuelan case

and obtain a weight of evidence of WoEV in favor of HR over HW . Starting from even prior

odds, our posterior log-odds then favor HR by an amount (WoEP + WoEV ), which we will

assume for illustration corresponds to a very strong degree of confidence in the resource-curse

relative to the warfare hypothesis. Moving forward, we accordingly have a very strong degree

of confidence that the resource curse will explain institutional under-development better than

the warfare hypothesis for any other Latin American case. Of course, we are conducting prob-

abilistic inference with incomplete information, so we might discover evidence in another case

that leads us to revise our views about these hypotheses.

If we wish to generalize our hypotheses beyond Latin America, we begin by articulating revised

versions with broader scope conditions; for example:

H 0R = Mineral resource dependence is the central factor hindering institutional development

in the global south. Mineral wealth makes collecting taxes irrelevant and creates incentives for

subsidies and patronage, instead of building administrative capacity.

H 0W = Absence of warfare is the central factor hindering institutional development in the

global south. Without external threats that necessitate e↵ective military defense, leaders lack

incentives to collect taxes and build administrative capacity.

Because Peru and Venezuela also fall within the scope of these generalized hypotheses, our

previous study of these two cases give us a very strong degree of confidence in H 0R vs. H 0

W , just

as these cases led us to very strongly favor HR over HW . But as a next step, we would want

to examine cases from another developing region—perhaps India or Egypt. While studying

14

additional Latin American cases will contribute to our inference, seeking cases from other

developing regions will be the most e↵ective way to assess how well H 0R fares against H 0

W .

4. OPTIMAL BAYESIAN CASE SELECTION

Logical Bayesianism provides a comprehensive, information-theoretic approach to choosing

cases for hypothesis testing. In principle, the single overarching case-selection criterion should

entail maximizing anticipated informativeness; the more we expect to learn, the more “criti-

cal,” or inferentially decisive, the case becomes. More precisely, we seek cases that maximize

expected information gain—the anticipated weight of evidence in favor of whichever hypothesis

under consideration provides the best explanation. We begin with a general introduction to the

information-theoretic approach, where we invoke an analogy to e�cient questioning. Readers

who prefer to skip mathematical details may proceed directly from this introduction to Section

4.4, where we present some practical caveats and principled insights, given that in practice,

calculating expected information gain in anything but a very rough approximation generally

will not be feasible.

4.1. Introduction to the Information-Theoretic Perspective

Logical Bayesianism is closely connected to information theory, which provides a framework for

e�cient questioning. The idea is to figure out which cases we should select in order to adjudicate

between rival hypotheses as quickly as possible. We can think of experimentation or observation

as communication with the physical or social world; the observable features of the world are

messages transmitted—usually with noise—to the researcher, who endeavors to decode the

signals. The simplest scenarios involve hypotheses that make deterministic predictions, and

observations with negligible measurement error that reveal one among a finite number of possible

evidentiary outcomes. The real world is rarely so straightforward, but such scenarios illustrate

some of the salient issues that arise in case selection and suggest some general strategies.

Consider first the classic game of “twenty questions,” where we ask a series of yes-or-no queries

to figure out what subject a friend has in mind. Here we have an essentially infinite number of

possible hypotheses as to what the subject may be, with evidentiary outcomes that entail either

15

a “yes” or a “no” answer. To e�ciently reduce uncertainty, instead of asking questions designed

to eliminate one specific possibility at a time (e.g., Barak Obama, duck-billed platypus, earl-grey

gelato) we should aim to ask questions that halve the remaining possible hypotheses at each

stage (e.g., something like: “is or was the subject a living organism?”). This strategy may not

be optimally informative with respect to any one hypothesis, but this approach is optimal for

winnowing down sets of hypotheses. Likewise, optimal case selection should not be conducted

with a single hypothesis in mind, but instead with the aim of e�ciently distinguishing between

salient rival explanations.

As a second example, consider Wason’s (1968) “selection task,” or four-card problem, which

in contrast to “twenty questions” involves a finite number of hypotheses, as well as a finite

number of noiseless evidentiary outcomes. Four cards are displayed on a table, as in Figure

1. We know that each card has a number on one side, and a color on the other side that can

be either red or blue. We are given a hypothesis about the relationship between numbers and

colors on the cards: H = If a card has an even number on one face, its opposite face must be

red. Given what we observe on the table—a 3, a blue face, an 8, and a red face, which card(s)

should we turn over to definitively test whether this hypothesis is true or false? The goal is to

flip as many cards as needed to be sure, but no more.

Here is an ex: suppose we have four cards, and we know that each of them has a

nmbr on one side and a color on the other side—either R or B.

We have a H about the relationship btwn nmbrs and colors on the cards—the

proposition is that if a card has an EVEN nmbr on one face, its op face is RED.

Given these four cards, which ones shld you turn over to test wthr this H is T or F?

Take a couple minutes to think about it, and then we’ll do a poll. [HANDOUT, START TP][email protected], psw: Peppercorn7

[Answer: 8-card AND blue card.]

19

FIG. 1 Wason’s (1968) “selection task”

To solve the problem, first notice that H = It is not the case that if a card has an even number

on one face, its opposite face must be red, logically implies the proposition: There is at least one

even card that is blue on the other side. Accordingly, we need to turn over the 8 and check its

color. If this card is blue, then H is false and we are done. If it is instead red, then H becomes

more plausible, but we need to turn over another card before we can can reach a definitive

conclusion. The second card that we need to check is that with the blue face. If it has an even

number on the other side, then H is false. If it instead has an odd number, then H is true

(assuming that the 8 card turned out to be red on its other face). It does not matter whether

we flip the blue card or the 8 card first. The important point is that these are the two cards

16

that will provide useful information for figuring out whether the hypothesis is true or false.

Nothing is learned by flipping the 3, because neither H nor H says anything about what color

we should expect to find on the back of an odd card. We are also wasting time and e↵ort if we

flip the red card, because both possible outcomes (odd or even) are equally consistent with H

and H—notice that H only tells us what to expect if we know one side to be even; it does not

tell us what we should find on the opposite side of a red card.

The four-card problem provides an analogy for purposive sampling, where we make decisions

about which cases (cards in this instance) to investigate based on the hypotheses under consid-

eration and the information we are likely to learn, rather than selecting at random. Moreover,

this example illustrates that random sampling—the guiding principle within frequentism—can

actually be a sub-optimal strategy for case selection. If we were to randomly choose two cards

to flip, the probability of drawing the blue card and the 8 card—the only two cards that provide

useful information for assessing the hypotheses—is only 1/6. It turns out that in many situa-

tions, random sampling is sub-optimal—we can do better by employing an information-theoretic

approach that directs us to identify cases with maximum informativeness.

Wason’s (1960) “sequence task” provides a third example, involving the following setup. We

are asked to infer a particular rule (determined by the experimenter) that is used to generate

sequences of three integers. We are given one instance of a sequence satisfying the rule (e.g.,

2, 4, 6). We can then propose additional three-number sequences as test cases and will be told

whether or not each such test sequence satisfies the rule. This example is an analogue for infer-

ence involving an infinite number of potential “cases” with which to assess an infinite number

of possible hypotheses, none of which can be definitively confirmed.9 While the sequence task

might more naturally serve as an analogy for experimental rather than observational research,

since we generate the “cases” ourselves, some useful insights for case selection nevertheless

emerge from this problem.

Given the goal of e�ciently narrowing down the field of candidate hypotheses with as few

test sequences as possible (i.e., examining as few cases as possible), choosing at random would

once again be suboptimal—in fact, that approach would be about the most ine�cient strategy

9 In the original experiment, Wason (1960:131) told participants: “When you feel highly confident that you havediscovered it [the rule], and not before, you are to write it down and tell me what it is.” If the proposed rule wascorrect, the experiment ended, so in this sense the participants hypothesis was confirmed in practice. OtherwiseWason (1960:132) allowed the participant to continue inventing hypotheses and proposing test sequences.

17

imaginable. Proposing trial sequences that conform to a pet hypothesis is also suboptimal.

Wason (1960) observed that many participants followed this kind of approach and succumbed

to confirmation bias—they became too confident too quickly in incorrect, often overly-complex

rules. A Popperian perspective would suggest that we should aim to falsify hypotheses rather

than confirm them. But that approach would also be ine�cient, just as eliminating possibilities

one by one in the twenty-questions game is ine�cient.

From a Bayesian or information-theoretic perspective, we should try to choose the test cases

that discriminate most e↵ectively between rival hypotheses. A natural strategy for the sequence

task entails generating an initial set of ten or so reasonably plausible hypotheses (based per-

haps on an assumption that if the rule were too complicated, the experimenter would have

trouble judging in a timely manner whether proposed trial sequences obeyed it or not), orga-

nizing them into families or classes by identifying more general hypotheses and sub-hypotheses

that are more specific or more restricted instances thereof, and then choosing test cases that

di↵erentiate between classes of hypotheses. This approach helps eliminate multiple candidate

rules with a single test case, regardless of whether we discover that our test sequence fits or

violates the experimenter’s rule. For example, if we learn that (6, 4, 2) does not fit, we can elim-

inate hypotheses postulating “all triples of integers,” “all even integers,” and “all arithmetic

sequences”10 along with more restricted hypotheses that fall within that class (e.g., “sequences

where each integer di↵ers by ±2 from its predecessor”). If we instead learn that (6, 4, 2) sat-

isfies the sequence-generation rule, we can eliminate hypotheses postulating “three increasing

integers” and any more specific variants thereof (e.g., “sequences where each integer di↵ers by

+2 from its predecessor”), and so forth. If none of our initial hypotheses survive after propos-

ing several test sequences, we can go back, invent more possibilities, and repeat this process.

Likewise, case selection in social science can be an iterative process that proceeds alongside

theory development.

In sum, an information-theoretic perspective reveals that (i) random selection is generally not an

optimal strategy for case selection,11 and (ii) cases should not be chosen in an e↵ort to confirm

a given hypothesis, nor to submit a single hypothesis to repeated attempts at falsification.

10 Such sequences are characterized by a constant di↵erence between consecutive terms.11 Jaynes (2003:532), an outspoken advocate of logical Bayesianism in the physical sciences, asserts much more

generally that: “Whenever there is a randomized way of doing something, there is a nonrandomized way thatyields better results from the same data, but requires more thinking.”

18

Instead, we should aim to examine cases for which rival hypotheses, or sets of rival hypotheses,

make the most divergent evidentiary predictions, or in other words, those cases we expect to be

most informative. In Bayesian terms, we want to choose cases that we anticipate will provide

a large weight of evidence—the quantity in Bayes’ rule that governs updating.

4.2. Discrimination Information

The first step toward quantifying anticipated informativeness entails recognizing that we cannot

know for certain what evidence we will discover before we actually investigate a given case C. At

least in principle, however, we could use our background information to anticipate the di↵erent

sorts and combinations of clues we might find, estimate their respective weights of evidence for

the hypotheses that we are considering, and then average over all of the anticipated evidentiary

possibilities.

Proceeding formally, we begin with the assumption (included in the background information I)that we have a finite, mutually-exclusive and exhaustive set of hypotheses {Hj}.12 The logical

negation of any one of the hypotheses is then the disjunction over the remaining alternatives:

H j =_

6=j

H` = H1 or · · · Hj�1 orHj+1 or · · ·HN . (6)

We also assume that we have delineated a complete and mutually exclusive set of possible

evidentiary outcomes for the case, {Ek}. Each possible Ek represents the composite information

that could be learned from the case, given a research strategy SC . To illustrate with a simple

example, suppose our cases are democratic countries and our data gathering strategy (SC)

entails soliciting interviews with the president and the opposition leader and asking each a

specific yes/no question. One possible evidentiary outcome would be E1 = President says yes

and opposition says no. A second outcome could be E2 = President could not be interviewed and

opposition says no. Assuming that each of the two informants is associated with three possible

“clue outcomes” (yes, no, or could not be interviewed), we would have a set of 3⇥3 = 9 possible

evidentiary outcomes Ek. (In real-world case studies, the possible set of evidentiary outcomes

will of course be vastly larger.)

12 The assumption of exhaustivenes is necessary for posing a well-specified inferential problem (Jaynes 2003). Ifa new hypothesis comes to mind in the future, we simply start over with a new, expanded or revised set ofhypotheses that we provisionally consider to be exhaustive. Bayesian inference is always tentative inference tothe best existing explanation.

19

The discrimination information D(Hj : H j |SC Hj I) also known as the relative entropy, quan-

tifies the expected weight of case evidence in favor of Hj relative to H j when we assume that

Hj is in fact true:

D(Hj : H j |SC Hj I) =X

k

P (Ek |SC Hj I) loghP (Ek |SC Hj I)P (Ek |SC H j I)

i. (7)

In other words, expression (7) averages the possible weights of evidence (the log of the likelihood

ratio), each weighted by its respective likelihood under Hj .13

The Gibbs inequality (a mathematical theorem from statistical mechanics) guarantees that the

discrimination information is nonnegative:

D(Hj : H j |SC Hj I) � 0, (8)

with equality if and only if P (Ek |SC Hj I) = P (Ek |SC H j I) for every possible evidentiary

outcome Ek. This latter situation would mean that we cannot learn anything about the truth

of Hj from the case, which would seem to be extremely rare in practice. Aside from such

special situations, we have the nontrivial fact that, on average, we always expect to find evi-

dence favoring a hypothesis, assuming that it is indeed true. This expectation must hold for

every hypothesis, even though: (i) for any given hypothesis, some of the possible clues must

produce non-negative weights of evidence while others must yield non-positive weights of evi-

dence (because a hypothesis cannot be confirmed by all possible evidence—otherwise we could

boost its credence without bothering to actually gather any evidence from the case); and (ii),

any given clue must be associated with non-negative weights of evidence for some hypotheses

and non-positive weights for others (the same evidence cannot boost the plausibility of ev-

ery hypothesis, otherwise the sum of their probabilities would exceed unity). Small values of

D(Hj : H j |SC Hj I) suggest that the case is expected to be uninformative about Hj if that

hypothesis is true, in the sense that Hj does not tend to make sharp predictions for this case

that di↵er from those predicted by its plausible rivals. A large value of D(Hj : H j |SC Hj I)instead predicts that if Hj is in fact true, the case will provide strong evidence supporting

this conclusion. In other words, small values of D(Hj : H j |SC Hj I) indicate that the case

prospectively provides a weak test for Hj , while large values indicate that the case constitutes

a prospectively strong test for Hj .

13 Although not manifest, the generalized discrimination information implicitly (and inconveniently) depends onthe priors P (Hj | I).

20

We can also construct the “dual” discrimination information for H j versus Hj ,

D(H j : Hj |SC H j I) =X

k

P (Ek |SC H j I) loghP (Ek |SC H j I)P (Ek |SC Hj I)

i, (9)

which likewise satisfies the Gibbs inequality:

D(H j : Hj |SC H j I) � 0 , (10)

with equality if and only if P (Ek |SC Hj I) = P (Ek |SC H j I) for all possible observational

outcomes. This inequality ensures that if Hj is assumed false (i.e., one of the concrete rival

hypotheses postulated under I is true instead, but we do not know which), then we expect to

find some evidence tending to disconfirm Hj . A low value for D(H j : Hj |SC H j I) indicates

that the case is expected to be uninformative when Hj is false, whereas a high value indicates

that the case is expected to strongly disconfirm Hj if it is false.

If we single out a particular hypothesis of primary interest, we can make a formal correspon-

dence between discrimination information and a generalized version of Van Evera’s (1997) test

typology that allows for non-binary evidentiary outcomes (Figure 2). Large values for both the

discrimination information D(Hj : H j |SC Hj I) and its dual D(H j : Hj |SC H j I) indicate a

prospective doubly-decisive case, where we expect to learn a lot about the truth of Hj regardless

of which clues we eventually discover. In contrast, small values of both D(Hj : H j |SC Hj I)and D(H j : Hj |SC H j I) indicate a prospective straw-in-the-wind case, which we expect to

provide little information about the truth of Hj .

Mismatches between the discrimination information and its dual yield Van Evera’s other two

test types. A large value of D(Hj : H j |SC Hj I) and small value of D(H j : Hj |SC H j I)corresponds to a prospective smoking-gun case for Hj (or equivalently, a hoop case for H j).

Conversely, a small value of D(Hj : H j |SC Hj I) and large value of D(H j : Hj |SC H j I)indicates a prospective hoop test for Hj (or equivalently, a smoking-gun test for H j). It is

important to emphasize that characterizing a case prospectively as a smoking-gun test does not

imply that we have a strong expectation of finding smoking-gun evidence upon studying that

case. On the contrary, in a smoking-gun test, it is much more likely that we will not actually

find the smoking gun. However, the case provides the potential for a large weight of evidence

in favor of Hj , if the smoking-gun evidence is indeed observed.

When working with prospective Van Evera test types for case selection, we must keep three

critical points in mind. First, the posterior informativeness of a case study for Hj vs. H j , based

21

2"D (H0 ; H1)

D (H

1 ; H

0)

Smoking gun for H1 or Hoop test for H0

Smoking gun for H0 or Hoop test for H1

Doubly decisive

Straw in the wind

FIG. 2 Prospective Van Evera case types classified in terms of discrimination information for a set of

binary MEE hypotheses

on the actual observations we make, may be higher or lower than the prior expectation. Second,

a case may be more or less informative (in either prospective expectation or retrospective

actuality) for adjudicating between some other hypothesis Hk 6=j and its negation. This latter

point is particularly salient for prospective straw-in-the-wind cases. We might not want to

discard studying such cases before considering whether we stand to learn substantial information

about the truth of one of the other plausible rival hypotheses Hk for k 6= j. Third, we emphasize

again that from a Bayesian perspective, the goal is not to try to confirm or falsify a single

hypothesis of interest Hj , but rather to adjudicate between all plausible rival hypotheses.

4.3. Expected Information Gain

Given that we do not know which of the hypotheses under consideration is correct, a better

approach to case selection involves averaging the expected weight of case evidence, D(Hj :

H j |SC Hj I), over all of the hypotheses, each weighted by its respective prior probability. The

22

resulting expression is the expected information gain associated with case C:

D(SC, I) =X

j

P (Hj | I)D(Hj : H j |SC I)

=X

j

P (Hj | I)X

k

P (Ek |Hj SC I) loghP (Ek |SC Hj I)P (Ek |SC H j I)

i

=X

j

P (Hj | I)X

k

P (Ek |Hj SC I)WoE (Hj : H j , SC) .

(11)

In essence, the expected information gain D averages the weight of evidence over our uncertainty

about what clues we will discover (as represented by the likelihoods for Ek), and (ii) our

uncertainty regarding which hypothesis is correct (as represented by the prior probabilities).

In plain language, expected information gain is the anticipated weight of evidence in favor of

whichever hypothesis is true.

Invoking the Gibbs inequality as before, expected information gain must be nonnegative:

D(SC, I) � 0, (12)

and can vanish if and only if P (Ek |SC Hj I) = P (Ek |SC H j I) for all evidentiary outcomes and

all hypotheses under consideration. For any given case, there is of course no guarantee that the

evidence uncovered will point toward the true hypotheses; case evidence might be misleading.

However, on average (i.e., across all plausible Hj and potential Ek), any case is expected to

provide evidentiary weight in favor of the true hypothesis.

4.4. Practical Caveats and Principled Insights

The good news is that maximizing expected information gain, D(SC, I), provides a cogent,

mathematically-grounded principle to guide case selection for theory testing. Recall that this

quantity (equation 11) represents the anticipated weight of evidence in favor of the true hypoth-

esis, which we obtain by averaging the weight of evidence over all possible evidentiary outcomes

for the case, weighted by the respective likelihoods, and then average over all of the plausible

hypotheses, weighted by the respective prior probabilities.

The bad news is that prospects for prospectively calculating anything but a very rough approxi-

mation to D(SC, I) are daunting—we would have to assess likelihoods for all possible evidence in

the cases under consideration. We do have some freedom in deciding how to partition the space

23

of possible observations, so we could try to work with a relatively course level of granularity

in order to contain the total number of distinct possible outcomes we must consider. However,

as discussed in Fairfield & Charman (2019), it is not unusual for the likelihoods of case-based

evidence to depend on subtle details of phrasing, timing, body language of informants, etc.

The power of evidence to discriminate between hypotheses might very well lie in features that

cannot easily be foreseen. We therefore face a formidable tradeo↵: if we adopt too fine-grained

a partition, tracking and assessing all of the possible observations becomes cumbersome, ine�-

cient, and ultimately unattainable given the impossibility of anticipating all salient details that

could arise. If we make do with too course a partition, the evidentiary possibilities we consider

will be too vague and insu�cient for (or ine�cient at) adjudicating between hypotheses.

Another di�culty with trying to calculate expected information gain is that likelihoods depend

critically on the search strategy we following for finding evidence in the case. In line with

the analogy of Bayesian inference as a dialogue with the data, the actual manner in which we

gather evidence within a case typically is not pre-determined; instead, it is a highly dynamic

and interactive process that is contingent on evidence gathered previously. Recall that the case

evidence Ej will typically consist of the conjunction of multiple observations or clues such that:

P (Ej |SC H I) = P (Ej1Ej2 · · · |SC H I)= P (Ej1 |SC H I)P (Ej2 |Ej1SC H I)P (Ej3 |Ej2Ej1SC H I) · · · .

(13)

Any real-world search strategy SC may become highly complex, since particular clues may

prompt follow-up questions and e↵orts to dig deeper that may direct us to pursue certain sources

more diligently or even to discover sources we were not aware of previously. Given unforeseeable

contingencies, it will be prohibitively di�cult to spell out SC explicitly in advance for all forks

in the investigative path—the tree of possibilities will be too intricate and profuse.14 Even if

one could foresee all possibilities, there will be too many to analyze in any detail in advance.

By contrast, whereas D depends on likelihoods, Bayesian updating once the evidence is in hand

depends only on likelihood ratios, P (Ej |SC H1 I)/P (Ej |SC H2 I), which makes conditioning

on the search strategy largely irrelevant. Whatever search strategy ultimately produced the

evidence, and however likely or unlikely we were to obtain that evidence, post-facto, the di↵er-

ence in likelihoods depends only on which hypothesis we condition on. Accordingly, post-data

14 Nevertheless, despite appearances, our formal notation does at least allow for such highly contingent or path-dependent search strategies—the problem is simply that of prospectively specifying SC.

24

inference is a far simpler task than prospectively calculating expected information gain. To

illustrate, suppose that our research involves interviewing politicians to find out why a tax cut

was enacted. The search strategy entails working hard to gain access to members of parliament.

But it is so unlikely from the outset that we would be able to interview the prime minister that

we do not plan to invest much e↵ort to that end. If Ek represents some kind of information we

could imagine receiving from an interview with the prime minister, Ek will contribute negligi-

bly to expected information gain, simply because the likelihood of obtaining any information

directly from the prime minister is so low. But now suppose that once in the field, by a stroke

of luck we meet the prime minister’s top aid, who then arranges a brief interview with the

prime minister. Post facto, how likely it was that we would be able to access the prime min-

ister is irrelevant, because obtaining access was essentially equally unlikely regardless of which

hypothesis provides the best explanation for the tax cut.

As a normative goal or prescriptive ideal, logical Bayesianism ignores the di�culty of calculating

quantities such as expected information gain; it presumes a sort of “logical omniscience,” where

the implications of propositions are presumed immediately known, and the temporal or eco-

nomic costs of information acquisition and information processing are largely ignored. At best,

we can aspire to approximate this ideal, taking practical limitations into account. For example,

it can be tedious to carry forward many di↵erent rival hypotheses, so instead of maximizing

expected information gain overall, it might make sense to try to find a “nail-in-the-co�n” case

for one or more of the least plausible hypotheses, which might then lose enough credence to be

e↵ectively removed from serious consideration.

Yet however di�cult to quantify in practice, expected information gain does tells us in principle

what we should look for in a case that is intended to test theory. From a Bayesian perspective,

the goal is not to try to confirm a primary hypothesis a la Hempel, nor to create an opportunity

for falsifying this hypotheses a la Popper, but to distinguish between alternative hypotheses,

considered on equal footing.

Our formal discussion of expected information gain fortifies this common-sense advice and

allows us to end on a highly encouraging note. If we wish to reduce our uncertainty over a set

of hypotheses, or to gain credence in the true hypothesis, then prospectively, we always expect

to learn from studying any given case. In that sense, there are no bad cases, just varying degrees

of better cases. Therefore, all is not lost if we end up (unknowingly) choosing sub-optimal cases;

25

a priori we can still expect the case study to contribute something to knowledge accumulation.

5. HEURISTIC BAYESIAN CASE SELECTION

In light of the obstacles facing e↵orts to perform truly optimized case selection, this section

o↵ers practical, Bayesian-inspired advice and considers how this advice either di↵ers from or

helps substantiate prevailing recommendations in the literature. While some of our suggestions

clearly break with extant views, many are matters of common sense and/or correspond to well-

established practices in qualitative research. Regarding the latter category, our goal in large

measure is precisely to show that Bayesianism justifies and underpins many widespread practices

that most scholars would probably consider intuitively reasonable, but without necessarily being

able to provide a consistent or coherent rationale as to why.

5.1. If substantial knowledge is available prior to case selection, aim to choose cases that

maximize some approximation to, or proxy for, expected information gain. If less is known about

potential cases a priori, seek cases that are anticipated to be data rich. If even less is known,

prioritize practical concerns and curtail e↵ort spent on case selection.

As elaborated in Section 4, we should aim to choose the cases that will be most informa-

tive; however, very often we will not have enough knowledge or resources ahead of time to

be able to make anything resembling a mathematically optimal choice a priori. For example,

as discussed in Section 2, much of the literature assumes that scores on key independent and

dependent variables are known in advance of case selection, yet for many innovative qualitative

research projects, this information itself is often produced only during the course of in-depth

case research. In such situations, we advocate prioritizing practical concerns and moving on

to in-depth data collection. Among other factors, salient practical concerns include language

skills, safety of the research venue, quality of secondary sources, access to archives and relevant

documentation, accessibility of key informants, and budget or time constraints.

26

5.2. There is no need to list all known cases before choosing some for in-depth investigation.

The near-universally espoused advice to begin the case selection process by enumerating all

possible cases is grounded in a frequentist approach to inference, where the population and

the procedure for sampling from that population must be fully specified before data collection

and analysis. From a Bayesian perspective, however, we regard this advice as misguided and

unnecessary. Qualitative research generally does not (and quite possibly cannot) aspire to

estimate numerical values for population-level parameters; as such, case selection does not

require randomly sampling from a pre-defined population, and in this sense at least, selection

bias cannot be an issue. Moreover, providing readers with a list of cases that were not chosen

for close analysis provides no salient information for evaluating inferences drawn from the

cases that actually were studied. In contrast to frequentism, where inference is always based

on considerations of what would happen under (often only imagined) repetitions of the same

sampling procedure, cases (or data) that were not investigated (or not obtained) play no direct

role in Bayesian analysis. Within a Bayesian framework, we are concerned with (1) conducting

sound inferences based on the actual evidence at hand—without ignoring salient information

and without allowing subjective hopes or counterfactual speculations to a↵ect our analysis—and

(2) articulating scope conditions in a manner that attempts to balance explanatory power and

generality—or stated di↵erently, accuracy and simplicity.15 If scholars are concerned that the

studied cases are unusual or misleading, such that the theoretical argument would not fare well

against rivals if the breadth of analysis were expanded, then follow-up research to strengthen

inferences can be conducted on additional cases within the original scope of the theory, and/or

extensions of the research can be carried out to assess and refine the theory in new contexts

that push the boundaries of the original scope conditions.16

Our Bayesian perspective is particularly salient considering that delineating the entire popula-

tion of applicable cases ahead of time would often be prohibitively di�cult. Because Bayesian

analysis invites an iterative dialog with the data where we may generate and refine hypotheses

and scope conditions as we gather new information, we are free to add cases as our research

progresses. In many situations, new cases will be constantly generated over time; consider

15 To that end, Bayesianism automatically incorporates an “Occam razor” that penalizes complexity beyondwhat is needed to explain the data (Fairfield & Charman 2019).

16 See Fairfield & Charman (2020) for discussion of continued research and extended research in the context ofdebates on replication and reliability of inference.

27

for example research on policy initiatives (e.g., tax reform or social policy innovation), elec-

toral campaigns, populist leaders, or transitions between democratic and authoritarian regimes.

When case selection follows a multi-level process (e.g., choosing countries and then identifying

salient reform episodes, court cases, or protest events within those countries), ongoing fieldwork

may be necessary to discover relevant cases within the macroscopic unit; information may not

be available outside of the country to identify salient cases in advance of consulting primary

documents or interviewing local experts.17

While Bayesian case selection need not begin with a fully enumerated population, it can cer-

tainly be useful to list a number of salient cases from the outset. At early stages of research,

considering preliminary information known about a substantial range of cases can stimulate

ideas about plausible hypotheses and scope conditions. In addition, we will argue that includ-

ing diverse cases is a sound guideline for case selection (Section 5.4).

5.3. Provide a clear and honest rationale for focusing on particular cases and excluding others.

We advocate a common-sense approach to explaining case selection that provides an honest

rationale for the decisions made and a useful orientation for readers, without invoking the

frequentist logic of sampling from a population. To the first point, there is no need to pretend

that a particular case was selected as a strong test of theory in instances when insu�cient

information was available ahead of time to e↵ectively assess expected information gain. If a

case ends up providing strong discriminating evidence, it stands on its merit as a strong test

retrospectively, whatever the initial reasons for including it in the study. This perspective

relates to the broader point that post facto framing of iterative research as conforming to a

linear deductive template should be avoided—so doing is as senseless as it is dishonest, given

that Bayesian analysis allows and indeed encourages iterative research.

Providing a rationale for focussing on particular cases is a matter of transparency that facilitates

scrutiny by other scholars. A well-reasoned discussion of how choices were made helps dispel

concerns that the author may have deliberately avoided cases that were anticipated to contradict

the favored hypothesis, and allows readers to more easily assess whether justifiable claims have

17 See Fairfield (2015: Appendix 1.3).

28

been made about scope, or whether more research is needed to substantiate generalizations.

To these ends, useful information would include reasons for not examining case(s) that readers

might otherwise naturally expect the study to cover. For example, one might wonder why

Kurtz’s (2009) research on Latin American state building does not discuss the paradigmatic case

of Venezuela, which played a key role in generating the resource-curse hypothesis (Karl 1997),

or El Salvador, a prime Latin American example of labor-repressive agriculture (Wood 2000).

Since Kurtz (2009) argues that labor-repressive agriculture, not resource wealth, is the primary

factor deterring institutional development, these cases would seem highly salient, although the

author may well have had good reason to anticipate that less would be learned relative to

the countries he did examine.18 Additionally, highlighting salient cases for follow-up research

or extensions to other contexts can encourage knowledge accumulation by facilitating future

tests of the theory or refinement thereof. Qualitative research regularly includes preliminary

discussions of how findings and hypotheses might apply in other contexts—whether in di↵erent

regions, countries, time-periods, or policy areas.

Boas (2016) is an excellent example of the kind of case-selection rationale we have in mind.

Boas (2016:34) provides a concise step-by step account of how he identified secondary country

cases for additional assessment of his “success-contagion” theory, which aims to explain salient

features of electoral campaign strategies (extent of cleavage priming, nature of candidate link-

ages to citizens, the degree of policy focus). After identifying countries that satisfy the theory’s

primary scope conditions—third-wave democracies—he focuses on those that (i) retained good

Freedom Houses scores on political rights (3 or lower) from 2000–06, and (ii) conducted enough

elections (at least 4) following the transition from authoritarian rule. The latter criterion il-

lustrates astute use of background knowledge, given that the primary case studies suggest that

candidates’ campaign strategies converge only after several rounds of learning from previous

elections. Boas (2016) then explains his rationale for excluding several countries on the basis of

unusual electoral institutions or inadequate information, and he notes that he includes several

countries that continued to hold elections despite democratic backsliding after 2006. Readers

might wonder why the 2006 cuto↵ matters and what might be learned by examining countries

that experienced democratic backsliding prior to 2006 but also continued to hold elections. Yet

with these clear case-selection criteria, interested scholars could easily identify such countries

18 Kurtz (2009) of course provides a detailed rationale for why the countries he does study merit comparison.

29

and conduct additional research. There are however two aspect of Boas’s (2016:28-29, 34) case-

selection discussion that diverge from our recommendations: occasional language referring to

population-based sampling, and an emphasis on testing theory with cases other than those used

to build the theory (Boas 2016:28-29)—both are unnecessary within a Bayesian framework.

5.4. Diversity among cases is generally good.

Seeking diversity when selecting cases is generally useful for several (related) reasons. First,

diverse cases are more likely to provide logically independent weights of evidence, such that

we gain more information regarding the truth of competing hypotheses from a given number

of observations. Consider for instance Fairfield’s (2015) hypothesis that strong business power

deters progressive tax reform. After a certain point, we may not learn much more about the

truth of this hypothesis by examining additional instances of unsuccessful progressive tax initia-

tives in Country X where business has multiple strong sources of political power. Accumulated

background information about previous failed reforms in this country (EX1 , . . . , EXN ) may

in itself strongly predict any evidence we find in the additional case (EXN+1), such that the

likelihood P (EXN+1|EX1 · · ·EXN Hi I) will tend to be quite high under essentially any of the

hypotheses Hi under consideration. In contrast, evidence from a case of failed reform in Coun-

try Y where business also has multiple sources of power may be more (logically) independent of

EX1 , . . . , EXN , and hence better able to discriminate between the business-power hypothesis

and rivals even after conditioning on all of the previous observations.19

Second, examining diverse cases can provide more stringent tests, especially if the theories in

question make di↵erent kinds of predictions in di↵erent contexts. In other words, we are able

to test more aspects of a complex theory. Theories will often explicitly or implicitly consist

of the conjunction of several di↵erent claims that emerge in di↵erent contexts. Suppose we

are considering hypotheses of the form H1 = Ha ^ Hb, H2 = Ha ^ H 0b, H3 = H 0

a ^ Hb, and

H4 = H 0a ^ H 0

b. These hypotheses inherit exclusivity from their conjuncts, assuming that Ha

is exclusive of H 0a, and Hb is exclusive of H 0

b. If we only look at cases where Hb and H 0b are

silent—that is, cases in which predictions are implicates of Ha or H 0a (due to scope or other

conditions), then such cases alone cannot possibly adjudicate between H1 and H3, or between

19 See Fairfield & Charman (2017) on handling logical dependence.

30

H2 and H4. Suppose H1 posits that democratization occurs via mass pressure from below in

countries with legacies of strong labor unions (Ha), but via international pressure in countries

where labor was historically weak (Hb); a second hypothesis H2 holds that democratization

occurs via mass pressure in countries with legacies of strong labor unions (Ha), but via intra-

elite conflict in countries where labor was historically weak (H 0b); and a third possibility H3

proposes that democratization occurs via pressure from business in countries with legacies of

strong labor unions (H 0a), but via international pressure in countries where labor was historically

weak (Hb). In this instance, it would clearly be important to consider cases of democratization

from both types of countries.

Third and relatedly, diversity helps assess and demarcate scope limitations. To assess di↵erent

scope conditions for related hypotheses of the form H1 = If A1 then B; H2 = If A2 then B ; etc.,

we must examine cases which di↵er meaningfully in the antecedent conditions A1, A2, . . . . For

instance, if we have shown that a theory works well in middle-income countries, examining a low-

income country may provide evidence that the theory holds more broadly, or that it applies only

to intermediate levels of development. Diversity among cases regarding geographical location,

time period, level of inequality, and so forth can serve similar purposes.

With respect to existing literature on case selection, as a rule of thumb, “maximizing variation”

on the dependent variable and plausible independent variables certainly contributes to diver-

sity; this practice is common throughout comparative case-study research. However we stress

again that in contrast to frequentist prescriptions, the aim in Bayesian qualitative case-study

research is not to obtain a representative sample from some defined population, or to avoid

bias arising from “selecting on the dependent variable.” The goal is to e�ciently adjudicate

between rival explanations, which may often benefit from intentionally choosing cases with di-

verse features or variation on certain dimensions. Incorporating diversity on any factors that

may be relevant—including dependent variables, confounding variables, contributing factors,

and scope conditions—can often be helpful both for testing theory and assessing its scope.

A best practice would entail endeavoring to find at least one case where the advocated theory

does not seem to perform well, in order to locate boundary conditions and signal to readers

that the severe tests were not deliberately avoided. Boas (2016:198-204) is a model in this

regard as well. Among his ten secondary case studies to assess the generality of the success

contagion theory, he includes three countries where electoral campaigns have not conformed to

31

predictions. His analysis delineates concrete, theoretically compelling (as opposed to ad-hoc)

scope limitations for his theory and usefully provides preliminary alternative explanations for

these misfit cases (Boas 2016:205-7).

5.5. Similarities among cases can be useful for testing theories that ascribe di↵erent roles to a

given causal factor.

Research designs often aim to construct “structured focused comparisons” (George 1979, George

and Bennett 2004: Chapter 3) that hold some factor constant across cases in order to highlight

the role of a particular explanatory variable. This approach is especially useful when comparing

theories that make di↵erent predictions about the role of a given variable; one hypothesis

may focus on X as the primary explanatory factor for Y , whereas a rival hypothesis implies

that Y is unrelated to X. Here it is sensible to examine cases that vary on X but manifest

similarities on other plausible causal factors beyond X. The logic is not that we are exercising

experimental control over assignment and can hence ascribe a causal e↵ect to a treatment

variable by randomizing away the influence of confounders, but rather than we are finding a

set of cases with high expected information gain, in that taken together, these cases are likely

to serve as a doubly-decisive test as to whether factor X matters for Y .

Kurtz’s (2009) paired comparison of state-building in Peru and Chile is an instructive ex-

ample. Both countries experienced commodity booms during early stages of institutional

development—they shared the same natural resource: nitrates—and both countries experi-

enced external warfare—indeed, they entered into armed conflict with each other. Yet they

di↵er in that Peru’s agricultural economy relied on labor-repressive agriculture, whereas Chile’s

did not. These cases accordingly provide promising ground for assessing Kurtz’s hypothesis

that labor-repressive agriculture deters institutional development against rivals that focus on

the role of resource wealth or warfare.

32

5.6. Cases that appear overdetermined, in the sense that prior knowledge about key variable

scores suggests they may be consistent with multiple hypotheses, need not be avoided.

Suppose we are interested in adjudicating between two rival hypotheses that aim to explain

outcome Y . H1 posits that the presence of X1 causes Y through some particular mechanism,

while H2 holds that X2 causes Y through some other stated mechanism. Suppose further that

case selection is informed by known cross-case scores on X1, X2, and Y . Cases for which initial

(X1, X2, Y ) information is compatible with both H1 and H2 do not give us any expectations

a priori about which way within-case evidence will lean, or how strongly that evidence will

favor one explanation over the other. Nevertheless, H1 and H2 may well make very di↵erent

predictions about the kinds of evidence we should find in the case, and resourceful scholars can

be highly successful at uncovering discriminating evidence. As such, advice (often informed

by a frequentist perspective) to avoid cases where multiple possible causes or confounders are

present (Goertz 2016, Gerring 2007), is overstated.

Examples of research that produced highly decisive evidence upon examining such cases in-

clude Slater’s (2009) work on popular collective action against dictatorships and Fairfield’s

(2015) work on business power and progressive tax reform. For Slater, democratic protest in

the Philippines and Burma followed both precipitous economic decline and appeals by commu-

nal elites to nationalist and religious sentiments, consistent with rival hypotheses that focus

respectively on one or the other of these two causal factors. Yet close examination of historical

records and primary accounts revealed information about timing and causal sequences as well

as other kinds of evidence that weigh strongly in favor of the nationalist-religious appeals hy-

pothesis over the economic decline hypothesis. Turning to Fairfield, the absence of significant

corporate tax increases following democratization in Chile is consistent from the outset with

both a capital mobility (structural power) hypothesis, and a political engagement (instrumental

power) hypothesis—given business’s strong capacity for collective action. Yet in-depth inter-

views with decision-makers produced evidence that strongly favored the instrumental-power

explanation.

While our Bayesian perspective clarifies that we can indeed learn from cases where multiple

possible causes are present, it also reveals that such cases have no intrinsic advantage for

adjudicating between rival theories—they do not necessarily provide stronger tests than cases

that a priori manifest only a single plausible cause of the outcome of interest. Ultimately, the

33

strength of the test depends on the weight of the evidence we discover, independently of our

case selection strategy.

5.7. To adjudicate between di↵erent causal mechanisms that may underpin an established model

(H = X causes Y ), select model-conforming cases.

Multi-methods literature highlights an important role for qualitative case studies in learning

about causal mechanisms, in situations where we begin with a strong cross-case (X,Y ) cor-

relation or an accepted high-level model (H = X causes Y ). Our advice for such research

conforms to our overarching recommendation for case selection: if existing knowledge allows,

choose cases that maximize (some approximation of) expected information gain. The only

di↵erence is that in the current situation, we are adjudicating between rival “sub-hypotheses”

that posit distinct causal mechanisms or pathways leading from X to Y , rather than adjudi-

cating between higher-level theories that posit rival explanations for the outcome (e.g., H =

Z causes Y ). Ross (2004) provides a frequentist-oriented example that could be understood in

these Bayesian terms. He begins with an “established” high-level hypothesis, HR = resource

wealth promotes civil conflict, which finds correlational support in statistical analyses. Ross

then identifies seven plausible causal mechanisms—which we would describe as sub-hypotheses

HRi—to be assessed through scrutiny of case-study literature.20

If (X,Y ) values are the only available information at the time of case selection, our rule of

thumb would follow advice to select model-conforming cases that fit well with the large-N ,

X-Y relationship (corresponding to Lieberman’s (2005) “on-the-line cases” or Seawright and

Gerring’s (2008) “typical cases”). Model-conforming cases practically by definition provide

fertile ground for examining causal mechanisms—these are the set of cases that are compatible

with H, and hence the cases where we should best be able to learn about causal mechanisms

underpinning H, as opposed to assessing high-level rival hypotheses that might account for

o↵-line cases and/or identifying scope limitations of the working model.

20 To the extent that Ross’s scrutiny of available sources constitutes true “evidence of absence,” his study providesstrong weight of evidence against two of the originally theorized mechanisms (looting and grievance). It isworth noting that from a Bayesian perspective, the fact that several of the selected cases display no plausiblecausal mechanisms connecting resource wealth to conflict suggests the need to assess the higher-level hypothesisHR against alternative explanations of civil war. This is particularly true of the relationship between naturalresources and onset of civil war; Ross finds no plausible causal mechanisms for 8 of the 13 cases examined.

34

In contrast, Seawright (2016:86-7) departs from this common advice by counterintuitively

proposing that “deviant cases” (o↵-line) are the appropriate choice for learning about “unknown

or incompletely understood causal pathways,” whereas “typical cases” (on-line) are essentially

never useful to this end. Seawright (2016:87) argues that o↵-line cases serve to investigate

heterogeneity of e↵ect size within hypothesized causal pathway(s) W (i.e., unusual e↵ects of

X on W , or unusual e↵ects of W on Y relative to the population average), or to investigate

“an unusual direct e↵ect [of X] on Y , net of the causal pathway of interest.”21 Seawright

seems to presume that the basic features of the X-Y model still hold in o↵-line cases. Such an

assertion is not always justified. A more likely scenario is that the mechanism simply does not

apply across all cases for example due to some as-yet unidentified scope conditions, and some

entirely di↵erent explanation (e.g. H 0 = Q causes Y ) is needed to understand these deviant

cases. Alternatively, some large “error term” may have perturbed the results in an o↵-line case,

indicating that the underlying mechanism governing on-line cases will either be obscured by

noise, or will at best be di�cult to observe.

To illustrate our perspective, consider again Fairfield’s (2015) hypothesis that stronger business

power deters progressive taxation, which for the sake of argument could be represented by a

regression model. Whereas the hypothesis holds well in 47 cases across the three countries

examined, substantial tax increases were enacted despite strong business power in two Chilean

cases, and substantial tax increases were blocked despite weak business power in two Argentine

cases. Analysis of the latter cases reveals an unusual dynamic whereby radical reform design

and government strategic errors provoked e↵ective contestation despite weak business power,

while analysis of the former cases highlights unusual contextual conditions that compelled

business to strategically acquiesce to tax increases (Fairfield 2015:299, 301). None of these

four “anomalous” cases elucidates the core causal mechanisms that underpin the more general

relationship between strong business power and absence of progressive tax reform.

While we have argued against “representative” or “random” sampling in qualitative research,

here we encounter a situation where choosing “typical” cases does make sense. Generally

speaking, we seek informativeness vis-a-vis the particular question we are asking. “Typical”

often connotes “unsurprising,” which would imply uninformativeness. But here, the typical

21 In our view, the latter scenario essentially declares the absence or irrelevance of pathway W , without addressingthe question of how X then produces Y , although Seawright may instead associate this scenario with “findingout about unknown...causal pathways.”

35

cases are deemed typical based only on (X,Y ) values under an established or assumed regression

relationship, and the questions of interest pertain to the underlying mechanisms responsible for

these observed correlations. Cases typical in this sense can still o↵er informative clues regarding

the causal sub-hypotheses, and they are well suited for making generalizable conclusions.

Turning to a related debate in the multi-methods literature, whether it makes more sense to look

only at cases with high values ofX and Y (or for categorical variables, X and Y both present) or

also cases with low values of X and Y (or both absent) depends on the nature of the hypothesis.

If the primary goal is to explain why Y occurs, in a situation where low values (or absence) of

Y is the obvious default outcome, then examining ‘high-high’ cases would be a sensible starting

point, consistent with Goertz’s (2016) emphasis on the “(1, 1) cell.” Suppose we are interested

in understanding why living in greater proximity to power lines might appear to cause higher

cancer rates in children, or the likelihood of spontaneous mass protest in relation to economic

decline. In the first instance, the absence of cancer in children does not beg explanation, whereas

assessing whether living next to a power line creates biophysical changes would be critical for

identifying a plausible causal link to cancer. Similarly, in the second example, the best starting

place to identify or adjudicate between causal mechanisms would be a (relatively rare) instance

of spontaneous mass protest.

In most situations, however, looking at cases with varying locations along the regression line

will be useful. Theories often aim to explain both high and low values (or occurrence and

non-occurrence) of Y (e.g., state strength versus weakness, enactment of progressive versus

regressive economic policies), such that examining causal pathways in cases near both ends of

the X-Y curve will be of interest. For example, Singh’s (2015) multi-methods research on sub-

national identities and social policy in India includes an informative analysis of Uttar Pradesh,

a “(0,0)” case in which the absence of sub-nationalism in this province led to minimal welfare

development. While this study might be an exception to Goertz’s (2016) observation that multi-

methods designs almost always select case studies from the “(1, 1) cell,” studies of (0,0) cases

are common in qualitative research designs (e.g., Fairfield 2015, Garay 2016, Kurtz 2009, Slater

2009:214). How much work such case studies do for highlighting causal mechanisms, beyond

providing careful scoring of the X-Y values that help to establish the overall causal correlation,

naturally varies according to the nature of the project and the quality of the case-study evidence

uncovered.

36

If resources are limited, (0, 0) cases arguably are not the first to consider; however, even if the

research justifiably focuses on understanding occurrences of Y , as in the cancer epidemiology or

spontaneous protest examples, (0, 0) cases can still serve an important role in clarifying causal

mechanisms that operate in (1, 1) cases by providing a useful foil or baseline for comparison.

That baseline may be the simple absence of the causal mechanisms suggested by (1, 1) cases.

Or the baseline could reveal that some dynamics initially thought to be part of the (1, 1) causal

process are present more broadly and may not actually play a dominant role in bringing about

Y .

To illustrate, consider Goertz’s (2016) example of Snow’s famous investigation of the Broad

Street cholera epidemic, where a (0, 0) case would be a healthy person residing far from the

water pump, whereas a (1, 1) “causal mechanism” case would be a sick person residing near

the pump. We would certainly want to examine instances of the latter in order to learn about

causal mechanisms—this is in fact the logical starting point for discovery—but (0, 0) cases can

also provide important information. Suppose the working hypothesis supported by large-N

correlational data is simply that residential proximity to the Broad Street pump causes cholera.

We might theorize two di↵erent causal mechanisms: H1 = people living near the pump obtain

their drinking water from that source, which carries the disease; and H2 = people living near

the pump shop at the adjacent butcher store, which sells contaminated meat that transmits

the disease. Useful (0, 0) cases that could provide discriminating information to adjudicate

between causal pathways H1 and H2 would include healthy people residing elsewhere who

regularly travel in to the community surrounding the pump. For example, case studies of some

such individuals might reveal that they make weekend shopping trips to get a good price on

meat from the butcher but do not stay in the community long enough to come in contact with

pump water, which would provide a strong weight of evidence in favor of mechanism H1 over

H2. Alternatively, we might find that these healthy individuals regularly drink water during

tea-time visits with friends residing near the pump, but they do not stay for dinner and hence

do not consume meat from the butcher shop. The latter instance would lend a strong weight

of evidence in favor of H2 over H1. We thus disagree with Goertz’s (2016:61) view that “(0,0)”

cases “have little or no role to play” for understanding causal mechanisms.

37

6. A REQUIEM FOR MOST-LIKELY AND LEAST-LIKELY CASES

We now return to the social science literature on critical cases, or equivalently, crucial cases,

which have played an important role in qualitative methods ever since Eckstein’s (1974) forma-

tive discussion. Indeed, “most-likely cases” and “least-likely cases” are among the most familiar

terms in qualitative case selection literature. Yet while authors agree that “most-likely cases”

should be disconfirming if the hypothesis performs poorly, whereas “least-likely cases” should

provide di�cult tests and hence strong support for the hypothesis if it passes, we find that the

literature is fraught with ambiguities, inconsistencies or even contradictions, and questionable

inferential logic (Table 2).

The crux of the problem entails defining what exactly it means for a case to be “most-likely”

or “least-likely.” Within a Bayesian framework, probabilities apply to logical propositions, and

all probabilities are conditional on some body of background information and assumptions.

For example, the conditional probability P (H | I) represents the rational degree of belief that

we should hold in hypothesis H given background knowledge I. Yet the literature is unclear

regarding precisely what assertion about the nature of the case, the predictions of the hypoth-

esis, or both, should be regarded as “likely” or “unlikely,” and what should be taken as the

conditioning information. In and of itself, a case does not possess a probability. Nor can we

meaningfully speak of “the probability that a hypothesis explains a case”—a hypothesis with

specified scope conditions is a claim about the world that is either true or false. A hypothesis

may be applicable in the sense that it makes prediction in some cases but not in others, and

we may need to revise scope conditions in light of case evidence, but we cannot assert that a

hypothesis is true in some cases but false in others. Stated di↵erently, a hypothesis has only

one probability under a given state of knowledge; the probability that a hypothesis is correct

cannot vary depending on which case we choose to investigate. The only way to meaningful

way to interpret a question regarding “the probability that a hypothesis explains a case” would

be to ask instead about the joint probability that (i) the hypothesis is true and (ii) the case in

question falls within its scope. But this interpretation in itself does not help us identify cases

that will provide strong tests or clarify what a most-likely or least-likely case might mean.

An additional problem is that the literature usually attempts to gauge “likely” or “unlikely”

with respect to a single theory, without reference to concrete alternatives. Yet from a Bayesian

perspective, the plausibility of hypotheses must always be compared rather than assessed in

38

absolute terms—theory-testing entails asking which of at least two rival hypotheses provides

the best explanation in light of the data obtained. Eckstein (2000:31) recognizes the importance

of rival hypotheses in discussing critical cases, yet he falls short of precisely articulating the

inferential logic in asserting that such cases “must closely fit a theory if one is to have confidence

in the theory’s validity, or, conversely, must not fit equally well any rule contrary to that

proposed”—we may recover the definition of a doubly-decisive test by replacing “or conversely”

with “and,” such that the evidence fits very well with the hypothesis but very poorly with

rivals.22 The role of rival hypotheses tends to become less clear in subsequent discussions of

most/least-likely cases. Here we find an odd asymmetry in that rival hypotheses appear more

often in discussions of least-likely cases than in discussions of most-likely cases.23

The following discussion begins by surveying di↵erent possible interpretations of most-likely

and least-likely cases as they have been treated in the literature (Section 6.1). We then dis-

cuss potential pitfalls with an inferential logic that is often associated with least-likely cases,

which Levy (2002) terms the “Sinatra inference” (Section 6.2). Our goal throughout is to dis-

cern the best-light, most sensible interpretation of the ideas that widely-cited authors in this

literature have tried to convey.24 Upon careful scrutiny, however, we conclude that the intu-

itions articulated by and large do not map onto Bayesian principles—a finding that is all the

more remarkable considering that many of the authors included in Table 2 invoke Bayesianism

(either informally or formally) as the underlying rationale for the most/least-likely case logic.

Accordingly, we advocate retiring the notion of most/least likely cases (Section 6.3).

22 It is worth highlighting here that Bacon’s instantia cruces and Hooke’s experimentum cruces, the forebearsof Eckstein’s crucial cases, refer to observations which can definitively distinguish between two or more rivalhypotheses.

23 Consider the authors summarized in Table 2. For least-likely cases, RSI (2004), Levy (2002), George andBennett (2004), Bennett and Elman (2007), and Gerring (2007) explicitly mention rival hypotheses in somemanner at some point, and rival hypotheses seem to be implicit in Eckstein’s (2000) discussion. But formost-likely cases, only George and Bennett (2004) discuss rival hypotheses explicitly, and only Gerring (2007)includes rivals implicitly. Turning to the two authors who explicitly adopt a Bayesian framework, the roleof rival hypotheses is treated inconsistently (Rohlfing 2012), and occasionally incorrectly (Rapport 2015; seeAppendix D for elaboration).

24 For reference, Table 2 provides authors’ definitions of most/least-likely cases in own their words, along withcritiques of the ambiguities and inconsistencies they contain.

39

6.1. Approaches to Defining Critical Cases

At least five possible interpretations of “most-likely” and “least-likely cases” can be taken

from the literature. These interpretations focus respectively on (1) scope conditions, (2) prior

probabilities, (3) the likelihood of the case evidence conditional on the theory, (4) the marginal

(or unconditional) likelihood of case evidence, and (5) divergent likelihoods of case evidence

under rival hypotheses. None of these interpretations is satisfactory for prospectively identifying

critical cases at the case-selection stage of research; however, the relative likelihoods approach

does capture the notion of a strong test—at least in a retrospective sense, once the evidence is

in hand.

6.1.1. Scope Conditions

In defining critical cases, Levy (2002) sensibly requires a most-likely case to satisfy all of the

assumptions of the theory; whereas he proposes that: “a least-likely case design identifies cases

in which the theory’s scope conditions are satisfied weakly if at all,” (Levy 2002:144). This

approach is problematic because by construction, a theory makes no predictions for a case that

falls outside its scope conditions. That is, beyond its intended range of application, a theory

is silent. Therefore, studying a least-likely case defined as per this strong understanding of the

scope criteria would not allow us to draw any inferences about the truth of the theory. We

might learn useful information that would help us reevaluate or generalize the scope conditions,

but refining a theory in this manner is a di↵erent task from testing an existing theory in a case

where it makes concrete predictions.25

A close look at the least-likely case example Levy provides—Allison’s (1971) study of the Cuban

missile crisis—helps pinpoint some analytic ambiguities regarding the scope-condition criteria

that can be resolved by stating the theory to be tested more carefully. Levy (2002:145) recounts:

Allison argued that the missile crisis was a least-likely case for the organizational

and bureaucratic models of decision making and a most likely case for the rational-

unitary model. We might expect organizational routines and bureaucratic politics

25 Both Van Evera (1997:34-35) and Rapport (2015:439-440) make similar critiques of this scope-conditionsunderstanding of least-likely cases.

40

to a↵ect decision making on budgetary issues and on issues of low politics, but

not in cases involving the most severe threats to national security, where rational

calculations to maximize the national interest should dominate...

At first glance, it might appear that Allison (1971) has chosen a case outside the scope of the

organizational-bureaucratic theory (HOB). But if the scope of HOB is restricted to non-crisis

domains of international security policy, such as military procurement, budgetary allocation,

and formation of military doctrine, then we cannot draw any inferences about the truth of HOB

by examining a case of severe national security threats, for the reasons discussed above. How-

ever, the second sentence in Levy’s explication of Allison’s study suggests that the theory being

testing is not HOB, but a di↵erent hypothesis, HOB/RC = Organizational-bureaucratic factors

dominate in non-crisis domains of international security policy, whereas rational calculations

dominate in instances of severe security threats. The Cuban case, which involved a severe se-

curity threat, falls squarely within the scope of HOB/RC, which predicts that a rational-choice

logic should govern decision-making. Without bringing in further information, it is not clear

whether the Cuban missile crisis should be regarded as a critical case for HOB/RC, but we can

at least hope to update our confidence in HOB/RC by studying this case.

[Another possible way to interpret this example, which may reflect more closely what Levy had

in mind,26 is to think in terms of a logistic regression, where the values of the independent

variable(s) a↵ect the probability of the outcome. Here, the hypothesis to be tested would be

H = The probability that rational calculations prevail over organizational-bureaucratic factors

in policymaking increases with the perceived severity of the security threat. We will return to

this interpretation in Section 6.2 when we discuss the “Sinatra inference.” Note however that

the severity of the threat is a causal variable in the model, not a scope condition.]

6.1.2. Prior Probabilities

Another criterion alluded to in the literature focusses on the prior probability of the hypothesis

to be tested. Some authors appear to associate a high prior probability for H with a most-likely

26 Personal communication, August 28, 2018.

41

case, and a low prior probability forH with a least-likely case (Rapport 2015:433, Beach 2018).27

This is a potential alternative interpretation of Levy’s (2002:145) approach; his statement:

“We might expect [emphasis added] organizational routines and bureaucratic politics to a↵ect

decision making on budgetary issues and on issues of low politics, but not in cases involving

the most severe threats to national security, where rational calculations ... should dominate”

suggests that HOB/RC is very plausible a priori: P (HOB/RC | I) is high. Conversely, the rival

hypothesis HOB/OB = Organizational-bureaucratic factors dominate in both non-crisis domains

of international security policy and in instances of severe security threats, would have a low

prior probability.

However, the problem with this interpretation is that priors on hypotheses do not vary across

cases. The background information I on which we base the prior probability is fixed—it nec-

essarily includes not just everything relevant that we know from the outset about the case at

hand (IC1), but also whatever we already know about every other case within the scope of

our hypotheses that we might select for closer scrutiny (IC2 IC3 IC4 ...). Therefore, all cases

satisfying the scope conditions of a given hypothesis would have to have exactly the same sta-

tus on the “most” to “least-likely” continuum under that hypothesis. In Levy’s example, all

salient cases would have to be considered “least-likely” for HOB/OB and “most-likely” for the

rival HOB/RC. This interpretation accordingly fails to capture the intent of the most/least-likely

concept, since it does not admit the possibility that di↵erent cases may provide tests of varying

strength.

In sum, while hypotheses can make case-specific predictions, it is extremely di�cult to make

sense of the idea that a hypothesis holds true or does not hold true in a particular case (recall

that a hypothesis makes no predictions whatsoever for cases that do not satisfy its scope condi-

tions). If we are uncertain from the outset about whether a given case falls within the scope of

the hypothesis, we might colloquially speak about the probability that the hypothesis explains

that case, but we should not conflate this meaning with the prior probability on the hypothesis

itself. Once we investigate that case more closely, we will either discover that it does satisfy the

scope conditions and then use the case evidence to update the probability on the hypothesis

relative to rivals, or we will find that it falls outside the scope and hence is irrelevant to testing

27 At other points Rapport appears to invoke di↵erent understandings, including (but not limited to) the marginallikelihood interpretation discussed in Section 6.1.4.

42

the hypothesis.

Alternatively, we could try to interpret a least/most-likely case to mean that HC—a restriction,

or specialization, of the broader theory H that applies only to case C—has a low/high prior

probability. For example, HOB/OB would yield the Cuban-case-restricted hypothesisHC(OB/OB) =

Organizational-bureaucratic factors dominated in the decision-making process surrounding the

Cuban missile crisis. This approach allows cases to be “more likely” or “less likely,” in the sense

that the prior probability for each case-restricted hypothesis will be di↵erent. Here P (HC1 | I)is determined by our background knowledge IC1 about case C1; whatever we know about other

cases (IC2 IC3 IC4 . . . ) is irrelevant, since HC1 is explicitly case-specific.

However, using restricted hypotheses would come at the price of any ability to generalize con-

clusions beyond single cases. By construction, HC1 is silent about any other cases C2, C3, C4 . . . .

Testing HC1 can raise or lower our confidence in HC1 compared to rival case-C1-restricted hy-

potheses, H 0C1, H 00

C1, . . . , but evidence from case C1 does not update our confidence in HC2

vs. H 0C2, or HC3 vs. H 0

C3, etc., nor does it tell us anything about the more broadly scoped

hypotheses H,H 0, H 00 . . . from which they were derived. Yet presumably these more broadly

scoped hypotheses are the primary propositions of interest.

Moreover, this interpretation of most/least likely is problematic in that the prior on a

hypothesis—whether H or HC—does not determine the strength of the test. Test strength

instead depends on the relative likelihoods of the evidence we discover upon investigating the

case. While we could decide to select cases based on priors for HC1 , HC2 , . . . , and then use the

case evidence to test H against H 0, H 00, . . . , there is no compelling reason to anticipate that a

case Ck with a relatively low value for the prior on the case-restricted hypothesis, P (HCk | I),would be especially informative for assessing the more broadly scoped hypothesis H versus its

rivals. Appendix C provides more examples to further clarify the why the prior-probability

interpretation of most/least-likely cases does not work.

6.1.3. Likelihood of Case Evidence

A third approach to defining most/least-likely cases focuses on the likelihood of case-specific

evidence under a given theory. Suppose P (EC |SC H I) denotes the likelihood of finding some

evidence, clue, or outcome EC after following a search strategy SC in case C. A most-likely case

43

for a hypothesis H would be associated with a large value for the likelihood P (EC |SC H I),such that the hyppothesis strongly predicts EC in this case, while a least-likely case would be

associated with a low value for P (EC |SC H I). Consider Eckstein’s (2000:31) example of a

most-likely case:

Malinowski’s (1926) study of a highly primitive, communistic ... society, to de-

termine whether automatic, spontaneous obedience to norms in fact prevailed in

it, as was postulated by other anthropologists. The society selected was a ‘most-

likely’ case—the very model of primitive, communistic society—and the finding

was contrary to the postulate...

Here we presumably have an anthropological theoryH that strongly predicts an evidentiary out-

come EC = Signatures of spontaneous obedience in primitive societies, such that P (EC |SC H I)is high. Levy (2002) and Gerring (2007:238) appear to classify Lijphart’s (1968) Netherlands

case as most-likely on similar grounds: pluralist theory H strongly predicted that the absence of

cross-cutting cleavages in this country would lead to EC = High levels of conflict and instability

in the Netherlands, such that P (EC |SC H I) is high. In both of these instances, whereas the

theory at hand predicted EC with high probability, the authors instead discovered EC , such

that H failed the “most-likely case” test.

There are two key problems with the likelihood approach. First, from a case selection perspec-

tive, we generally do not know what evidence we will find in advance of actually investigating

the case. Nevertheless, some scholars seem to incorporate the evidentiary outcome implicitly

when defining most-likely and least-likely cases; Gerring (2007) does so explicitly by describ-

ing least-likely cases as “confirming” and most-likely cases as “disconfirming” (Table 2), while

Rohlfing (2012:62) lists “failed most-likely” and “passed least-likely” as types of “selection

strategy.” Similarly, usage of these terms in empirical work seems to be primarily retrospective.

Our searches did not find any studies that describe prospectively choosing a most-likely case

for which evidence ends up conforming to the theory’s predictions, nor a least-likely case that

failed to produce evidence concordant with the theory in question. Yet if our understanding

of most/least-likely cases is to have any ex-ante traction, then more often than not we should

discover the unsurprising evidentiary outcome upon investigating such cases. Second, even if we

do discover the unexpected outcome, the hypothesis in question does not necessarily fail/pass

44

a strong test. Test strength is a function of the likelihood ratio, not the likelihood under a sin-

gle hypothesis of interest. The unexpected evidence discovered could have a similar likelihood

under a rival hypothesis, such that the case provides only a very weak test and we learn little

about which hypothesis is correct.

6.1.4. Marginal Likelihood of Case Evidence

A fourth possible way to interpret what it means for a case to be least-likely or most-likely is

to focus on the marginal likelihood of the evidence, P (E | I), in comparison to the conditional

likelihood P (E |H I). Most discussions of critical cases assume that the theory makes strong

predictions, such that there exists some case evidence or clue EC for which P (EC |SC H I) ishigh. A least-likely scenario would then be a case for which we do not expect to find this piece

of evidence conditional on our background information alone; that is, P (EC |SC I) is low. In

contrast, for a most-likely case, we are not surprised to discover EC regardless of the truth of

H, such that P (EC |SC I) is also high. When examining a least-likely case, if contrary to prior

expectations we do find EC , we would have cause to strongly increase our confidence in H,

given that the updating factor P (EC |SC H I)/P (EC |SC I) appearing in Bayes’ theorem will

be large. When examining a most-likely case, if EC does indeed turn up as expected a priori,

the theory might be said to “survive a plausibility probe” (KKV 1994:209) in that the posterior

probability for H will not change much, since the ratio P (EC |SC H I)/P (EC |SC I) will be

close to one.

One could potentially use this interpretation to make mathematical sense of Rapport’s

(2015:431) statement that least-likely cases “pose di�cult tests of theories, in that ... one

would not expect a theory’s expectations to be borne out by a review of the case evidence.” If

we assume that the words “expect” and “expectations” are intended to have di↵erent mean-

ings, we could associate the second term with the predictions of the working hypothesis,

P (EC |SC H I), and the first term with our ex-ante expectations about what we will find,

namely, a weighted average of the predictions of all plausible hypotheses under consideration,

P (EC |SC I) = PiP (Hi | I)P (EC |Hi I). It is important to stress that this marginal likelihood

is the probability of finding some case-specific evidence EC in the particular case C; contrary

to Rapport’s (2015:449) treatment, it is not the relative frequency of some analogous evidence

across a population of cases. We discuss this point further in Appendix D.

45

This marginal-likelihood approach is an improvement over the conditional likelihood interpre-

tation of most/least-likely cases discussed previously, because the marginal likelihood takes into

account expectations across di↵erent hypotheses, rather than attempting a classification based

on a single hypothesis. However, it is still inadequate. Suppose that against prior expectations,

EC is discovered in a case deemed least-likely in the present sense. While the probability of H

is substantially boosted, we cannot necessarily conclude that H comes out ahead, because one

of the rival theories might predict EC just as strongly and will be equally rewarded by Bayes’

theorem.28 Likewise, discovering EC in a most-likely case does not necessarily imply that H

performs worst, because one of the competing hypotheses might perform just as poorly.

6.1.5. Divergent Likelihoods

A fifth interpretation focuses on divergent (or discriminating) likelihoods of the evidence under

rival hypotheses. Suppose a theory HN strongly predicts EC under search strategy SC in

case C, whereas a rival theory HP strongly predicts EC , such that P (EC |SC HN I) is high

but P (EC |SC HP I) is low. If we subsequently discover EC upon examining the case, the

likelihood ratio strongly favors HN over HP , and we can update our relative confidence in

these theories accordingly. Several authors appear to discuss least-likely cases for HN in this

manner, where HP often plays the role of some “prevailing” theory and HN that of a “new”

explanation to be tested against it (e.g. Bennett and Elman 2007:173). Consider Eckstein’s

(2000:31) example: “Michel’s inquiry into the ubiquitousness of oligarchy in organizations,

based on the argument that certain organizations (those consciously dedicated to grass-roots

democracy...) are least likely, or very unlikely, to be oligarchic if oligarchy were not universal.”

Here HP = Organizations that promote democratic ideals are non-oligarchic, which predicts

EC = Absence of oligarchy in the case of an organization espousing grass-roots democracy,

whereas Michel actually discovers EC = Presence of oligarchy, in accord with HN = Oligarchy

is ubiquitous in organizations. Gerring’s discussion of Tsai’s (2007) study of village governance

in China could also be interpreted as focusing on divergent likelihoods under rival hypotheses.

28 If P (EC |SC I) is low, not all of the rivals can assign high probability to EC , because P (EC |SC I) is justthe average likelihood conditional on each of the MEE hypotheses, weighted by their prior probabilities. If weconsider only one rival hypothesis, then the probability of H can increase only if the probability of the rivaldecreases. But with three or more hypotheses, it is possible for the evidence to be unexpected a priori butstrongly predicted by more than one hypothesis.

46

Gerring (2007:236) describes the Li Settlement as a least-likely case for Tsai’s hypothesis, HN

= High social solidarity leads to good governance. Prevailing rival explanations (HP ) that focus

on other causal factors strongly predicted poor governance (EC) for this case, whereas HN

correctly predicted good governance (EC).

Using the likelihood-based terminology discussed earlier (Section 6.1.3), one might say that

Eckstein’s and Gerring’s case examples are “most-likely for HP ” and “least-likely for HN” with

respect to EC , since P (EC |SC HP I) is high while P (EC |SC HN I) is low. Several authors

appear to follow this approach (Levy 2002, George and Bennett 2004). However, using the

labels most-likely and least-likely in this manner obscures the fact that all that matters for

testing hypotheses is the value of the likelihood ratio. Updating depends on the quantity

P (EC |SC HP I)/P (EC |SC HN I)—not on P (EC |SC HP I) or P (EC |SC HN I) separately.

Accordingly, a case need not be “most likely” for one hypothesis and “least likely” for a rival

in order to constitute a strong test; a case that is most likely for HN in some absolute sense,

and also most-likely for HP in that same absolute sense, could nevertheless strongly favor HP ,

as long as the evidence is much more probable under HP as compared to HN .29

6.1.6. Van Evera Tests

Leaving aside ambiguous language, we can identify a loose correspondence between most/least-

likely cases and Van Evera’s (1997) process-tracing tests. The underlying logic that many

authors appear to have in mind for most-likely cases is simply that of Van Evera’s hoop test,

where successfully jumping through the hoop provides only moderate support for hypothesis,

but failing is strongly disconfirmatory. Cases that are said to be most-likely with respect to one

hypothesis and least-likely with respect to a rival would correspond to Van Evera’s “divergent

predictions cases,” or doubly-decisive tests, where we learn a lot about which theory is correct

regardless of the test outcome. Logically speaking, it should therefore follow that least-likely

cases correspond to Van Evera’s smoking-gun tests, where a clue is unexpected a priori but if

29 Some authors do recognize the importance of the likelihood ratio. Rohlfing (2002:196) for example notes that“the general guideline for Bayesian case studies ... is to maximize the di↵erence between the conditional likeli-hood of [the evidence under] the working proposition and the null hypothesis [better stated: rival hypotheses].”Yet while this statement takes us in the right direction (as discussed in Section 4, we want to maximize theexpected di↵erence between the logarithms of the likelihoods) we still lack a precise articulation of how to goabout identifying a case that can be expected to provide a strong weight of evidence ex-ante.

47

found provides a strong boost to the theory. Rohfling (2012:183) explicitly asserts the latter

equivalence; Bennett & Elman (2007:173) also seem to have this approach in mind (see Table

2).

These rough correspondences underscore a fundamental problem with treatments of the

most/least-likely characterization: there is no single dimension along which cases can be ordered

with respect to a single hypothesis from most-likely to least-likely that does the inferential work

intended. Van Evera’s test typology requires (at minimum) a two-dimensional parameterization,

involving dichotomous evidentiary outcomes (a binary clue, E or E) and dichotomous hypothe-

ses (e.g., H0 or H1, considered mutually exclusive and exhaustive). As we show elsewhere, a

test is then characterized by the two weights of evidence WoE (H0 : H1) and WoE (H0 : H1).

Least-likely and most-likely cases are presumed to provide strong tests if a particular eviden-

tiary outcome occurs (E in a least-likely case, E in most-likely case), yet the strength of the

test that the case provides simply cannot be assessed with respect to a single theory H.

The connections between most/least-likely cases and Van Evera’s (1997) test typology also

highlight a second di�culty mentioned earlier: the ex-ante problem. Retrospectively, test

classification is superfluous, since updating depends only on the weight of evidence for the

realized data and is always governed by the logic of Bayes’ theorem. Prospectively, however, we

do not know what evidence we will observe. A most/least-likely case may o↵er the possibility

of an outcome leading to a strong degree of confirmation or disconfirmation of one hypothesis

in comparison to rival(s), but if observing that particular evidentiary outcome is very unlikely,

anticipated test strength must su↵er. Importantly, we cannot expect to be surprised by the

data30—for instance, setting out to find a smoking gun is not a good case selection strategy,

because smoking guns are unlikely to be found.

Assessing the prospective strength of a test thus requires more than pointing out the possi-

bility of highly probative evidence. As explained in Section 4.2, we need to average over all

possible evidentiary observations, which allows us to characterize Van Evera’s tests in terms of

discrimination information. And we should also average over our uncertainty regarding which

30 As an amusing example, scientists recently discovered fossilized bones of a giant three-foot tall pre-historicparrot in New Zealand and named the bird Heracles inexpectatus (nicknamed squawkzilla). The lead scientisttold the press: “To have a parrot that big is surprising. This thing was way outside of expectations.” Whenasked about the research team’s plans to return to the site of the discovery for further excavation, he remarked:“We can’t go and plan to dig up a giant parrot. [But] if we turn over a lump of dirt and find one, we’ll be verypleased.” (New York Times Aug. 6, 2019: nytimes.com/2019/08/06/science/giant-parrot-new-zealand.html)

48

hypothesis is correct, whereby we obtain the expected information gain associated with a given

case (Section 4.3).

6.1.7. Nested Regression Models

Several authors use terms reminiscent of regression models when discussing the inferential

logic of least-likely cases (RSI 2004, Levy 2002, and Gerring 2007). Consider for example

RSI (2004:335): “A least-likely case often has extreme values on variables associated with

rival hypotheses, such that we might expect these other variables to negate the causal e↵ect

predicted by the theory.” Note that as stated, this assertion is problematic, considering that

rival theories cannot simultaneously operate to produce a causal e↵ect, nor compete for control

over the outcome in a case.31 However we can make sense of this statement if we interpret

it to mean that a simple hypothesis of interest, H1, which asserts that X1 alone causes the

outcome, is a member of some larger class of nested regression models, H, which also contains

alternative theories that invoke X1 along with additional independent variables X2, X3, . . .

acting and interacting in various ways. Suppose we choose a case for which the additional

variables X2, X3, . . . are expected to strongly influence the outcome according to the most

plausible alternative theories contained in H, in the sense that those variables take on extreme

values that, according to these more elaborate theories, will tend to push in the opposite

direction of what the simpler model H1 predicts. If H1 nevertheless works in this case, then we

have evidence suggesting that we need not resort to a more complex model that includes the

X2, X3, . . . variables.

Gerring’s (2007:236) example of Tsai’s (2007) case study can be interpreted according to this

logic, where least-likely cases are “villages that evidence a high level of social solidarity but

which, along other dimensions, would be judged least likely to develop good governance,”

such that “...all other plausible causal factors for [the good governance] outcome have been

minimized.”

This regression-like inferential logic is sound in principle, and even in purely qualitative research

it may sometimes be fruitful to pose hypotheses that are analogous to regression models. How-

31 If theories are truly rival, meaning mutually exclusive, only one can operate, although this one theory mightinvoke multiple variables acting in concert or in opposition.

49

ever, rival hypotheses cannot always be sensibly fit into this type of “nested model” structure.

The goal may be to test very di↵erent causal explanations that invoke distinct independent

variables and unrelated causal mechanisms, rather than to ascertain whether the “fit” would

be improved by adding additional control or explanatory variables to the working hypothesis.

6.2. Rethinking the “Sinatra Inference”

Most-likely and least-likely cases have been closely associated with a popular inferential logic

that can pose additional pitfalls. Levy (2002:144) provides the most explicit articulation of this

logic, which he describes as the Sinatra inference: if a theory can make it in a least-likely case,

it can make it anywhere—that is, a theory that survives a test in a most-likely case can be

anticipated to work in any other, “more-likely” case. Conversely, the inverse Sinatra inference

holds that if a theory cannot make it in a most-likely case, it cannot make it anywhere—we do

not expect the theory to work in “less-likely” cases if it has come up short in a most-likely case

(Levy 2002:144).

If the Sinatra inference simply means that hypothesis H has passed a severe test based on the

case evidence discovered, such that the posterior odds in favor of H relative to the rival(s)

are substantially boosted and we accordingly gain confidence in using H to make predictions

about new cases that fall within its defined scope, then this reasoning is perfectly sound from

a Bayesian perspective. However, the Sinatra analogy seems to imply a di↵erent interpretation

that is not justified within Bayesianism: if we find that the theory is successful in a least-likely

case, then it becomes even more probable that the theory will work in all other cases that we

had initially ranked as “more likely” for the theory. That is, a case we previously judged to be

“moderately likely” for H somehow gets boosted in status to become a “very likely” case, and

so forth.

This later interpretation of the Sinatra logic is problematic—aside from all of the previously-

discussed di�culties inherent in ascertaining what it means for a case to be least likely with

respect to a hypothesis—because it suggests that the least-likely case lends support to the

theory above and beyond whatever weight of evidence an examination of that case produced,

or through some means other than simply increasing our posterior probability on the hypothesis.

Suppose that two cases A and B each end up yielding the same weight of evidence in favor of

50

H relative to a rival H 0. Case A cannot be more or less informative than case B on the grounds

that before conducting the test, we judged case A to be “least likely,” or “less likely,” for H

relative to case B. Whichever (or both) of these two cases we choose to test our theory, we

become more confident that H is correct to the extent that the weight of evidence increases

the posterior odds on H versus H 0. Post facto—once we have conducted the case study and

observed the relevant clues and outcomes—two tests that provide the same weight of evidence

constitute tests of equivalent retrospective strength. Our prior expectations about what we

might have found in the case, or which case we had expected to provide a stronger test, become

irrelevant to the inference we draw once the data are in hand.

A related problem with this interpretation of the Sinatra logic arises from that fact that prospec-

tively, we would not expect a least-likely case for H to produce a large weight of evidence in

favor of H. If the case does end up doing so despite prior expectations, it would be wise to

revisit the assumptions that informed our prior judgments, and perhaps the theory itself, before

postulating that H can “make it anywhere.” The very fact that we observed something unex-

pected signals grounds for caution—e.g., we may have mischaracterized the case as least-likely

ex-ante, we may have misjudged the nature of auxiliary or antecedent (scope) conditions for

our theory, we may have misestimated the prevalence and/or typical values of relevant variables

within the set of considered cases. While the least-likely case evidence does boost the credibil-

ity of H (relative to rivals) in the scenario at hand, we might also anticipate the possibility of

unexpected evidence in other cases, such that we should not conclude ahead of time that all

other cases will weigh in favor of the theory.

To further elucidate the potential pitfalls, consider an analogy to classroom testing, where

an instructor assesses student learning via an online exam that can be programmed to present

questions in some specified order. The analog of a “least-likely” approach would entail beginning

the exam with what the instructor a priori judges to be the hardest problem. Suppose a student

answers this first, ostensibly most di�cult problem correctly. Can we conclude that because

the student succeeded on this question, s/he will have no trouble with any of the subsequent,

ostensibly easier problems? Taken to an extreme, the Sinatra logic suggests that the instructor

can program the online test to end if the student gets this first question right and confidently

award an ‘A’ on the exam (i.e., accept hypothesis HA = The student merits an A).32 But

32 In this vein, Beach and Pedersen (2016:19) go so far as to assert that finding that a hypothesis holds “where

51

this strategy would clearly be ill-advised. The student’s answer could have been a fluke. Or

the instructor may have misjudged the relative di�culty of the exam questions, such that the

student would find subsequent questions harder than problem #1. Furthermore, the student

may have studied only the topic covered in this particular question, but not others. Indeed,

question di�culty is multi-dimensional and may be highly context dependent or sensitive to

characteristics of the student body that are not well known before the exam is administered.

In any of these scenarios, the student’s grade may well come out below an ‘A’ (i.e., HA) if

required to complete all of the exam questions.

Likewise, a testing analogy for the “most-likely” logic would entail giving the easiest question

first, and assigning an ‘F’ on the exam (i.e., taking this evidence to confirm HF = The student

merits an F ) if the student answers this one question incorrectly, reasoning that if s/he fails

this question, s/he will also fail questions that the instructor views as more di�cult. The same

caveats apply. The student’s first answer may have resulted from distraction or a keyboard error,

the instructor’s ranking of question di�culty may be inaccurate from the student’s perspective,

and/or the student might perform di↵erentially on distinct types of problems, such that the

exam grade might turn out very di↵erently if the student were allowed to tackle all of the

questions (i.e., HF may be the correct hypothesis). Additionally, the goal is typically not

to adjudicate between HA vs. HA or HF vs. HF , but instead to reliably assess hypotheses

HA, HA�, HB+, HB, . . . , HF that discriminate between finer gradations of understanding. To

this end, the most informative questions tend not to lie at the extremes of di�culty (where

almost everyone or almost no one can answer correctly), but instead near the inflection points.33

Generally speaking, if exam time is limited, neither choosing only what we believe to be the

most di�cult question, nor using what we deem to be the easiest question, serves as an e↵ective

strategy for assessing hypotheses about student learning. Selecting discriminating questions and

diverse questions is a much better idea.

Likewise, in social science we should seek cases that are anticipated to be discriminating vis-a-

vis rival explanations, and cases that are expected to test a range of di↵erent implications of

one least expects it enables one to infer across cases that it should therefore be everywhere,” implying we havedefinitively confirmed the theory. This conclusion is clearly fallacious.

33 Item Response Theory has been developed to disentangle just these sorts of issues. It models each test questionin terms of a logistic regression, predicting the probability that a person with given ability level (or levels acrossmultiple topical or conceptual dimensions) will answer correctly, and includes parameters that account notonly for the ability of the student but also an a posteriori assessment of the di�culty of the question, and thee↵ects of chance guessing.

52

the theories, rather than focusing on a single case which may not be as informative as initially

anticipated, and may only speak to a limited range of theoretical implications. A case deemed

most/least-likely with regard to some particular aspect of a theory could well provide a hard

test for the theory in that regard, but only a weak test with respect to other important aspects

of the theory—just as exam questions may tap di↵erent areas of student knowledge and ability.

Let us now reexamine the Sinatra logic as Levy applies it to the Cuban missile crisis example.

For simplicity, we restrict the hypotheses to the realm of security crises, so that we aim to test

HRC = Rational calculations dominate decision-making in instances of severe threats, versus

HOB = Organizational-bureaucratic factors dominate decision-making in instances of severe

threats. Levy (2002:145) characterizes the Cuban missile crisis as a least-likely case for HOB,

and following the Sinatra logic, asserts that: “If Allison could show that bureaucratic and

organizational factors had a significant impact on key decisions in the Cuban missile crisis, we

would have good reasons to expect that these factors would be important in a wide range of

other situations.” If this assertion means that we have raised the posterior probability on HOB,

with the stated scope conditions, such that “other situations” means “other cases of severe

threats that fall within the scope of HOB,” then the reasoning is entirely valid. If instead this

assertion means we now expect organizational-bureaucratic factors to apply in other realms

of foreign policy beyond international security—for the sake of illustration, take trade policy

(although this area is admittedly not what Levy has in mind)—then we are revising the scope

conditions and ought to pursue further tests. The Cuban case as a test of HOB says nothing

about whether organizational-bureaucratic factors will be relevant in the realm of foreign trade.

In some situations, however, we can make sense of the generalization logic underlying the

Sinatra intuition by invoking logistic-regression-type models. Suppose our hypothesis models

the probability of outcome Y as a sigmoidal function of the independent variable(s) X, with

an unknown point of inflection. If we observe Y in a case with the least favorable value(s) of

X, we have some evidence to suggest that the logistic curve rises rapidly with X such that the

probability of Y is high over the entire range of X. Conversely, if Y fails to occur in a case

with the most favorable value(s) of X, we have some evidence suggesting that the probability

of Y remains low over the full range of X.

To illustrate, consider a di↵erent aspect of Levy’s (2002:157) reasoning for the Cuba example:

“if the rational-unitary model cannot explain state behavior in an international crisis as acute

53

as the one in 1962, we would have little confidence that it could explain behavior in situations of

non-crisis decision making.” Here we can model the underlying causal relationship as a logistic

regression, where the severity of crisis increases the probability that decision-making will be

dominated by rational calculations. If we find that at the highest level of crisis—approximated

by the Cuban missile crisis—rational calculations do not govern decision making, we can rule

out most of the “risk-level” parameter space for the location of the inflection point in the

logistic regression model. Accordingly, the probability of rational calculations dominating will

be low for most other instances of international security policymaking. Two caveats apply to

this logistic regression interpretation. First, caution is needed when extrapolating from a single

case, without filling in data points elsewhere along the X axis. Second, not all hypotheses can

be modeled in this manner.

6.3. Moving Forward: Retire the Most/Least-Likely Case

Even though the notion of most-likely and least-likely cases has a long history in qualitative

methods, our analysis shows that the intuition scholars seem to have in mind cannot be spelled

out clearly in Bayesian terms, except perhaps for special situations when the rival theories of

interest can be articulated as nested models, or when we are working with a logistic regression

model. If we instead take Bayesian probability as the starting point and ask how we might

prospectively identify cases that will provide strong tests of rival hypotheses, we are led to the

information-theoretic quantities defined in Section 4. While we can make associations between

discrimination information (Section 4.2) and Van Evera’s test types, we stress that neither a

prospective hoop test nor a prospective smoking-gun test justifies a “Sinatra inference” logic—

this logic can only make sense if we are working with a logistic-regression model as described

in Section 6.2. Accordingly, we believe the best way to preclude confusion is to set aside e↵orts

to classify cases as most/least likely—and even set aside discrete test-type typologies—and

focus instead on the fact that updating is always a matter of degree, and whatever our a priori

expectations about the probative value of a particular case, our inferences depend only on the

evidence that we ultimately obtain.

54

7. CONCLUSION

This paper has articulated a Bayesian approach to case selection that is grounded in information

theory, as opposed to a frequentist approach relying on random sampling from a pre-defined

population. The core principle underlying our Bayesian approach is to seek those cases that

will be most informative. At early stages of research, any information-rich case will be useful

for inventing hypotheses. Once we have clearly articulated a set of rival hypotheses, the goal

becomes choosing cases that we anticipate will serve as strong tests—namely, cases that can

be expected to provide a large weight of evidence in favor of whichever hypothesis provides the

best explanation. From this perspective, “critical cases” should not be thought of in terms of a

“most/least-likely” logic, which we have argued is di�cult to reconcile with Bayesian reasoning,

but simply as cases that maximize anticipated weight of evidence in favor of the best hypothesis

under consideration.

While in principle, the optimal Bayesian approach to case selection involves maximizing our

mathematical measure of expected information gain, in practice, this task will be usually be

prohibitively di�cult. Yet the properties of this mathematical measure tell us that prospec-

tively, we can expect to learn from any case we choose. So despite the fact that practical

realities will generally prevent us from identifying the optimal case(s) ahead of time, we can

still expect that on average, the (possibly sub-optimal) case(s) we do study will bring us closer

to the best explanation.

Finally, we emphasize that retrospectively, after the evidence is in hand, whatever we expected

or hoped we might learn from studying a particular case beforehand becomes irrelevant to

inference. Cases do not carry any extra inferential import from rankings or judgements made

a priori. We simply update the probabilities on rival hypotheses in accord with the weight of

evidence that the studied case actually provides, which may be stronger, weaker, or as decisive

as we anticipated ahead of time. To the extent that the case-study evidence boosts the posterior

probability on a particular hypothesis H above the salient rivals, we gain confidence that this

hypothesis will also explain other cases that satisfy its stated scope conditions. Learning then

proceeds iteratively, as we examine additional cases and potentially revise the scope conditions

of our hypothesis—whether generalizing by broadening the scope of applicability, or narrowing

down the scope to avoid overreach and loss of accuracy.

55

Inventing hypotheses Seek information-rich cases.

Comparing hypotheses In principle, maximize expected information gain—i.e. anticipated weight

of evidence in favor of the best hypothesis.

In practice, curtail e↵orts to identify optimal cases, prioritize pragmatic

considerations, and proceed to gather evidence on the chosen case(s).

Seek diversity among cases, with the goal of obtaining logically indepen-

dent evidence across cases and testing multiple aspects of theory.

Cases that are similar across many possible causal factors apart from X

can potentially serve jointly as an informative test of whether X matters

for Y .

When starting with an accepted high-level model asserting that X causes

Y , model-conforming cases provide fertile ground for adjudicating be-

tween hypotheses that posit di↵erent mechanisms through which X leads

to Y .

Provide a clear rationale for focussing on a particular set of cases to signal

that severe tests have not been deliberately avoided and to facilitate

scholarly scrutiny and follow-up research.

Assessing scope Seek diversity among cases, with the goal of identifying scope limitations.

Include a case where the advocated theory does not seem to perform well,

in order to identify boundary conditions.

Provide a clear rationale for focussing on the selected cases to facilitate

scholarly scrutiny of the argument’s scope.

TABLE I Guidelines for Case Selection

8. REFERENCES

Allison, G.T. 1971. Essence of Decision. New York: Little Brown.

Beach, Derek. 2018.“Look Before you Leap.” Paper prepared for the American Political Science

Association Annual Conference, Boston.

Beach, Derek, and Rasmus Pedersen. 2016. “Selecting Appropriate Cases When Tracing Causal

Mechanisms,” Sociological Methods & Research https://doi.org/10.1177/0049124115622510

Bennett, Andrew, and Colin Elman. 2007. “Case Study Methods in the International Relations

56

Subfield.” Comparative Political Studies 40(2):170-195.

Boas, Taylor. 2016. Presidential Campaigns in Latin America: Electoral Strategies and Success

Contagion. Cambridge University Press.

Brady, Henry, and David Collier (RSI). 2010. Rethinking Social Inquiry. Lanham: Rowman

and Littlefield.

Eckstein, Harry. 2000. “Case Study and Theory in Political Science,” Chapter 6 in Case Study

Method, eds. Roger Gomm, Martyn Hammersley, and Peter Foster, Sage Research Methods.

http://dx.doi.org/10.4135/9780857024367.d11

Fairfield, Tasha. 2015. Private Wealth and Public Revenue in Latin America: Business Power

and Tax Politics. Cambridge University Press.

Fairfield, Tasha, and A.E. Charman 2017. “Explicit Bayesian Analysis for Process Tracing,”

Political Analysis 25(3):363-380.

Fairfield, Tasha, and A.E. Charman. 2019. “The Bayesian Foundations of Iterative Research in

Qualitative Social Science: A Dialogue with the Data.” Perspectives on Politics 17(1):154-167.

Fairfield, Tasha, and A.E. Charman. 2020. “Reliability of Inference: Analogs of Replication in

Qualitative Research.” In The Production of Knowledge: Enhancing Progress in Social Science,

Eds. Colin Elman, John Gerring, and James Mahoney. Cambridge University Press.

Garay, Candelaria. 2007. “Social Policy and Collective Action: Unemployed Workers, Com-

munity Associations, and Protest in Argentina,” Politics & Society 35:301-328.

George, Alexander. 1979. “Case studies and theory development.” In Lauren, P., ed. Diplo-

macy: New Approaches in Theory, History, and Policy. New York: Free Press, 43?68.

George, Alexander, and Andrew Bennett. 2005. Case Studies and Theory Development. Cam-

bridge, MA: MIT Press.

Gerring, John. 2007. “Is There a (Viable) Crucial-Case Method?” Comparative Political Studies

40(3):231-253.

Goertz, Gary. 2017. Multimethod Research, Causal Mechanisms, and Case Studies Princeton

University Press.

57

Good, I.J. 1983. Good Thinking. University of Minnesota Press..

Humphreys, Macartan, and Alan Jacobs. 2015. “Mixing Methods: A Bayesian Approach.”

American Political Science Review. 109(4):653-73.

Jaynes, E.T. 2003. Probability Theory: The Logic of Science. Cambridge University Press.

Karl, Terry. 1997. The Paradox of Plenty. University of California Press.

King, Gary, Robert Keohane, and Sidney Verba (KKV). 1994. Designing Social Inquiry. Prince-

ton University Press.

Kurtz, Marcus. 2009.“The Social Foundations of Institutional Order: Reconsidering War and

the ‘Resource Curse’ in Third World State Building.” Politics & Society 37(4):479–520.

Levy, Jack. 2002. In Brecher, Michael, and Frank Harvey, Eds., Evaluating Methodology in

International Studies. University of Michigan Press.

Levy, Jack. 2008. “Case Studies: Types, Designs, and Logics of Inference.” Conflict Manage-

ment and Peace Science 25:1-18.

Lieberman, Evan. 2005. “Nested Analysis as a Mixed-method Strategy for Comparative Re-

search.” American Political Science Review 99(3):435-452.

Lijphart, A. 1968. The Politics of Accommodation: Pluralism and Democracy in the Nether-

lands. Berkeley: University of California Press.

Patton, Michael Quinn. 2001. Qualitative Research and Evaluation Methods. Thousand Oaks,

CA: Sage.

Rapport, Aaron. 2015. “Hard Thinking about Hard and Easy Cases in Security Studies,”

Security Studies 24(3):431-465.

Rohlfing, Ingo. 2012. Case Studies and Causal Inference. Palgrave Macmillan.

Ross, Michael. 2004. “How Do Natural Resources Influence Civil War? Evidence from Thirteen

Cases.” International Organization 58(1):35-67.

Seawright, Jason. 2016. Multi-Method Social Science. Cambridge University Press.

58

Seawright, Jason, and John Gerring. 2008. “Case Selection Techniques in Case Study Research,

A Menu of Qualitative and Quantitative Options.” Political Research Quarterly 61(2):294-308.

Singh, Purna. 2015. “Subnationalism and Social Development: A Comparative Analysis of

Indian States.” World Politics 67(3):506-62.

Slater, Dan. 2009. “Revolutions, Crackdowns, and Quiescence: Communal Elites and Demo-

cratic Mobilization in Southeast Asia.” American Journal of Sociology 115(1):203-254.

Tsai, Lily. 2007. Accountability without Democracy: How Solidary Groups Provide Public

Goods in Rural China. Cambridge, UK: Cambridge University Press.

Van Evera, Stephen. 1997. Guide to Methods for Students of Political Science. Ithaca, NY:

Cornell University Press.

Wason, P.C. 1968. “Reasoning about a rule.” Quarterly Journal of Experimental Psychology

20(3):273?281.

Wason, P.C. 1960. “On the failure to eliminate hypotheses in a conceptual task.” Quarterly

Journal of Experimental Psychology 12(3):129-140.

Western Bruce, and Simon Jackman. 1994. “Bayesian Inference for Comparative Research.”

American Political Science Review 88(2):412-423.

Wood, Elisabeth. 2000. Forging Democracy from Below: Insurgent Transitions in South Africa

and El Salvador. New York: Cambridge University Press.

GUIDING PRINCIPLE Selection Strategy

PROCEDURE PURPOSE (e.g. developing, testing, or generalizing theory)

Outliers and Extremes

“Outlier Cases” Van Evera

Choose cases where the outcome is poorly explained by existing theories.

Developing theory

“Deviant Cases” Seawright & Gerring, Levy

Choose cases with IV and DV values that deviate from an established cross-case relationship.

Developing theory (primary use) Testing theory (less common): “disconfirming a deterministic proposition” (p.302)

Seawright 2016 Finding ‘omitted variables,’ ‘testing hypotheses about causal paths,’ defining scope conditions (p.77)

“Off-the-Line Cases” Lieberman

Choose “at least one case that has not been well predicted by the best-fitting statistical model,” with a focus on the DV. p. 445

Developing theory: model-building in multi-methods research, when a preliminary model is not sufficiently robust. p. 445.

“Extreme Value Cases” Seawright & Gerring

Choose cases that take on extreme values on the IV or on the DV relative to that variable’s population mean.

Developing theory

“Extreme Value Cases on the IV” Seawright 2016

Finding ‘omitted variables,’ ‘testing hypotheses about causal paths,’ defining scope conditions (p.77)

“Extreme Value Cases” on SV Van Evera

Choose cases with extreme values on the study variable (SV). The researcher may be interested in this variable’s causes, or its effects.

Developing theory “If values on the SV are very high, its causes (or effects...) should be present in unusual abundance, hence these causes (or effects) should stand out against the background of the case more clearly” and vice versa. (p. 80)

“Extreme Value Cases” on IV Van Evera

Choose cases with extreme values on the IV.

Testing theory “Such cases offer strong tests because the theory’s predictions about the case are certain and unique” (p. 79)

“Extreme Value Cases” with X, ~Y Van Evera

Choose cases where the causal factor is strongly present but the outcome is absent

Assessing scope conditions

“Falsification-Scope Cases” (X, ~Y) Goertz

Choose cases where the causal factor is present but the outcome is absent.

Testing: disconfirming a hypothesized causal mechanism in multi-method research Assessing scope conditions

! !

tasha

Typewritten Text

Table 1



Model-Concordance

“On-the-Line Cases” Lieberman

For an accepted model of a cross-case relationship, choose cases that fit with the model’s predictions, and maximize variation on the IV.

Testing: “assessing the strength of a particular model” in multi-methods research.

“Typical / Representative Cases” Seawright & Gerring

Begin with a stable cross-case relationship; select low-residual (on-lier) cases.

Testing: “to probe causal mechanisms that may either confirm or disconfirm a given theory”

“Causal Mechanism Cases” Goertz

Begin with an established cross-case relationship. Select cases where the IV and DV are both present, following various regression-based criteria. Avoid over-determination.

Testing: confirming that “the proposed causal mechanism is in fact working for this observation” in multi-methods research

“Pathway Cases” Gerring

Begin with an established cross-case relationship. Look for cases close to the regression line that show scores on Y that are “strongly influenced by the theoretical variable of interest Xi, taking all other factors into account (X2)” p. 242

Developing theory: elucidating causal mechanisms and thus clarifying the X!Y hypothesis.

Variation

Maximize Variation on the DV KKV

Select cases that encompass the full range of the DV.

“Diverse Cases” on IV or on DV Seawright & Gerring

Select cases that maximize variance on the relevant dimension.

Developing theory: “generating hypotheses”

“Diverse Cases” that follow the hypothesized X–Y relationship Seawright & Gerring

Select cases that capture the full range of values along the X-Y relationship

Testing theory: “confirmatory hypothesis testing”

“Large within-case variance on the IV” Van Evera

Select cases with large within-case variance on the IV. “The more within-case variance in the IV’s value, the more predictions we have to test.” P.82

Testing theory

Representativeness

Random Sampling Herron & Quinn 2014 (See also Fearon & Laitin 2008)

Chose cases via random selection from the population; advocated when selecting 5 or more cases.

Testing theory: estimating population-level average casual effects

“Typical / Representative Cases” Seawright & Gerring

If there is a strong cross-case relationship such that most cases are on-line, then cases that are typical of that relationship will also tend to be representative of the population.

Testing theory: “to probe causal mechanisms that may either confirm or disconfirm a given theory”

“Cases with Prototypical Background Characteristics” Van Evera

Choose cases with “average or typical background conditions, on the grounds that theories that pass the tests these cases pose are more likely to travel well, applying widely to other cases” p. 84

Generalizability

! !



Control

“Most Similar Cases” Seawright & Gerring, Nielsen 2016 (See also Mill 1872, Przeworski & Teune 1970, Lijphart 1971)

Choose cases that are “similar on all the measured independent variables, except the independent variable of interest” p. 304. using matching strategies when starting with a large-N dataset.

“Exploratory if the hypothesis is X- or Y-centered; confirmatory if X/Y-centered” S&G p. 298. Testing: “helps to rule out alternative causes” in combination with process tracing N p. 574.

“Most Different Cases” Seawright & Gerring (See also Mill)

Choose cases where “just one independent variable as well as the dependent variable covary, and all other plausible independent variables show different values” p. 306

Theory Testing: “to eliminate necessary causes” p. 298

Informativeness

“Data Rich Cases” Van Evera (see also Flybjerg’s “Information Rich” cases)

Select cases that are rich in data to maximize learning

Developing and testing theory using process tracing

“Divergence of Predictions” Van Evera

Select cases for which “competing theories make divergent predictions” p.88

Testing Theory

“Crucial Cases” Eckstein

The case “must closely fit a theory if one is to have confidence in the theory’s validity, or, conversely, must not fit equally well any rule contrary to that proposed”

Testing Theory

“Most Likely Case” Choose a case that is “strongly expected to conform to the prediction of a particular theory.” RSI p. 339

Invalidation

“Least Likely Case” Choose a case that is “strongly expected not to conform to the prediction of the theory.” RSI p. 339

Confirmation

M

ost-Likely C

ase (ML

C)

Least-L

ikely Case (L

LC

) E

ckstein (1975)

“ ‘Most-likely’ or ‘least-likely’ cases—

cases that ought, or ought not, to invalidate or confirm theories, if any cases can be expected to do so.” p.149

Parsing the above literally, we obtain:

“MLC

s ought to invalidate theories, if any cases can be expected to do so.” C

ontradiction: prospectively, a MLC

ought to support a theory.

MLC

example: “M

alinowski’s (1926) study of a highly prim

itive, comm

unistic ... society, to determ

ine whether autom

atic, spontaneous obedience to norms in

fact prevailed in it, as was postulated by other anthropologists. The society

selected was a ‘m

ost-likely’ case—the very m

odel of primitive, com

munistic

society—and the finding w

as contrary to the postulate...” “The ‘m

ost-likely’ case ... [seems especially tailored] to invalidation.” p.149

Am

biguity: Does not clearly distinguish betw

een prospective and retrospective assessm

ents. Only if the theory’s predictions are not born out--contrary to

expectations--can the case cast doubt on that theory. Interpretation: P(Ec|Sc H

I) is high, but Ec is not observed. (Section 3.1.3: likelihood approach) A

lternative interpretation: The example m

ay suggest a failed hoop test for H

(Section 3.3)

“LLCs ought not to confirm

theories, if any cases can be expected to do so.” LLC

example: “M

ichel’s inquiry into the ubiquitousness of oligarchy in organizations, based on the argum

ent that certain organizations (those consciously dedicated to grass-roots dem

ocracy...) are least likely, or very unlikely, to be oligarchic if oligarchy were

not universal.” “The ‘least-likely’ case ... seem

s especially tailored to confirmation...” p.149

Am

biguity: Does not clearly distinguish betw

een prospective and retrospective assessm

ents. Only if the theory’s predictions are realized--contrary to expectations

that a LLC “ought not to confirm

” the theory--can the case support that theory. Interpretation: Im

plicitly, the example suggests that P(Ec|Sc H

I) is high, P(Ec|Sc ~H

I) is low, and Ec is observed, increasing our confidence in H

. (Section 3.1.5: divergent likelihoods approach)

Overarching C

ritique: What it m

eans to say a case “ought to support” or “ought not to support” a theory remains unclear. A theory cannot predict evidence that runs

counter to the expectations of that very theory—w

e must understand predictions to be evidentiary outcom

es with high probability under H

. Prospectively, anticipations of evidence that is dam

aging to a theory must be based on other, rival theories, or averaged over all theories, yet rival theories are not explicitly discussed here.

Retrospectively, test strength depends not on whether evidence fits w

ith theory’s predictions, but how w

ell it fits with the theory relative to rivals.

KK

V

(1994) “If predictions of w

hat appears to be an implausible theory conform

with

observations of a ‘most-likely’ observation, the theory w

ill not have passed a rigorous test but w

ill have survived a ‘plausibility probe’ and may be w

orthy of further scrutiny.” p.209 Inconsistencies (external): O

ther authors who invoke prior probabilities

associate MLC

s with a high value of P(H

|I), not a low value (an im

plausible theory). M

ost authors focus on failing the test, rather than passing the test. A

mbiguity: D

oes not clearly define a most-likely observation. D

oes it mean

that P(Ec|Sc H I) is high? O

r that P(E|Sc I) is high, or P(Ec|Sc H’ I) is high for

a prevalent rival H’? Based on w

hat conditioning information do w

e expect the likelihood of case evidence to be high? M

ost sensible interpretation: P(H|I) is low

(implausible theory), P(Ec|Sc H

I ) is high (H

predicts Ec), and P(Ec|Sc I) is high (Ec is an a priori ‘most-likely

observation’). If Ec is found, then P(H|Ec I) does not differ m

uch from P(H

|I). (Section 3.1.4: m

arginal likelihood approach, combined w

ith a low prior

probability) C

ritique: no discussion of rival hypotheses. H m

ight well pass a “strong test”

upon observing Ec if one of the rivals does not predict Ec as strongly.

“If the investigator chooses a case study that seems on a priori grounds unlikely to

accord with theoretical predictions—

a ‘least-likely’ observation—but the theory turns

out to be correct regardless, the theory will have passed a difficult test, and w

e will

have reason to support it with greater confidence.” p.209

Inconsistency (internal): In contrast to the MLC

discussion, there is no reference to P(H

|I). A

mbiguity: D

oes not clearly define a least-likely observation. If H predicts Ec, then

P(Ec|Sc H I) m

ust be high. But a priori we do not expect to find Ec upon exam

ining the case. D

oes this mean that P(Ec|Sc I) is low

, or P(Ec|Sc H’ I) is low

for a prevalent rival H

’, or P(H|I) is low

? M

ost sensible interpretation: P(Ec|Sc H I) is high (theory H

predicts Ec), P(Ec|Sc I) is low

(an a priori ‘least-likely observation’), but Ec is found. (Section 3.1.4: marginal

likelihood approach) C

ritique: No discussion of rival hypotheses. Finding Ec does not necessarily im

ply that H

passes a difficult test, because a rival might predict Ec just as strongly.

!

tasha

Typewritten Text

Table 2

M

ost-Likely C

ase (ML

C)

Least-L

ikely Case (L

LC

) L

evy (2002)

“A m

ost-likely case is one that almost certainly m

ust be true if the theory is true, in the sense that all the assum

ptions of a theory are satisfied and all the conditions hypothesized to contribute to a particular outcom

e are present, so the theory m

akes very strong predictions regarding outcomes in that case. If a

detailed analysis of a most-likely case dem

onstrates that the theory’s predictions are not satisfied, then our confidence in the theory is seriously underm

ined.” p.143 Interpretation: P(Ec|Sc H

I) is high (theory makes a strong prediction for the

case), but we do not find Ec upon exam

ining the case. (Section 3.1.3: likelihood approach) C

ritique: no discussion of rival hypotheses. H m

ay perform no w

orse, or even better, than com

peting explanations.

“A m

ost-likely case design can involve selecting cases where the scope

conditions for a theory are fully satisfied.” p.144 C

ritique: all cases within the theory’s scope w

ould be ‘equally-likely’—no one

case would be any ‘m

ore-likely’ or ‘most-likely’ for the theory.

“Most-likely case designs follow

the inverse Sinatra inference—if I cannot

make it there I cannot m

ake it anywhere.” p.144

“A least-likely case design... selects ‘hard’ cases in w

hich the predictions of a theory are quite unlikely to be satisfied because few

of its facilitating conditions are satisfied. If those predictions are nevertheless found to be valid, our confidence in the theory is increased, and w

e have good reasons to believe that the theory will hold in other

situations that are even more favorable for the theory.” p.144

“A least-likely case design identifies cases in w

hich the theory’s scope conditions are satisfied w

eakly if at all.” p.144 Interpretation: a LLC

is one that falls outside of the theory’s scope (Section 3.1.1). C

ritique: a theory makes no predictions for cases outside its scope, so cases that do

not satisfy scope conditions cannot be used to test the theory. “Least-likely case research designs follow

... the ‘Sinatra inference’—if I can m

ake it there I can m

ake it anywhere.” p.144

Critique: incorrectly suggests that a theory is penalized or rew

arded above and beyond the weight of evidence actually obtained, based on som

e ex-ante ranking of cases. If tw

o cases produce the same w

eight of evidence, they constitute tests of equal strength, regardless of our ex-ante expectations. (Section 3.2) “M

ost-likely and least-likely case designs are often based on a strategy of selecting cases with extrem

e values on the independent variables, which should produce

extreme outcom

es on the dependent variable, at least for hypotheses positing monotonically increasing or decreasing functional relationships.” p.144

Interpretation: Under various unstated assum

ptions, we can interpret this as a logistic regression approach (Section 3.2).

“The power of m

ost-likely and least-likely case analysis is further strengthened by defining most likely and least likely not only in term

s of the predictions of a particular theory but also in term

s of the predictions of leading alternative theories.” p. 144 Interpretation: H

ints at divergent likelihoods approach (Section 3.1.5). “The strongest support for a theory com

es when a case is least likely for a particular theory and m

ost likely for the rival theory, and when observations are consistent w

ith the predictions of the theory but not those of its com

petitor.” p. 144-145 M

ost sensible interpretation: If we apply the likelihood definition of the M

LC, and adopt a parallel definition of a LLC

, we can only m

ake sense of this statement if w

e take it m

ean that P(Ec|Sc H I) is low

(LLC for theory H

), and P(~Ec|Sc H’ I) is high (M

LC for rival H

’ with respect to the opposite outcom

e ~Ec, so that the theories m

ake different predictions). If P(Ec | Sc H’ I) is also m

uch lower than P(Ec| Sc H

I), then observing Ec will strongly support H

. (See related critique of George &

Bennett.)

!!

M

ost-Likely C

ase (ML

C)

Least-L

ikely Case (L

LC

) G

eorge &

Bennett

(2004)

Summ

arizing Eckstein: “In a most-likely case, the independent variables posited

by a theory are at values that strongly posit an outcome or posit an extrem

e outcom

e. ... Most-likely cases, he notes, are tailored to cast strong doubt on

theories if the theories do not fit...” p.121 Interpretation: P(Ec|Sc H

I) is high, but Ec is not observed. (Section 3.1.3: likelihood approach)

Summ

arizing Eckstein: “In a least-likely case, the independent variables in a theory are at values that only w

eakly predict an outcome or predict a low

-magnitude outcom

e. ...least-likely cases can strengthen support for theories that fit even cases w

here they should be w

eak.” p.121 A

mbiguity: W

hat does it mean for a theory to fit a case w

here it “should be weak”? If

the theory itself weakly predicts Ec in a given case, then the above statem

ent seems to

suggest that finding Ec supports the theory, but this can only be true if rivals predict that outcom

e even more w

eakly. Otherw

ise, we have cause to either revise the theory

or increase confidence in a rival that predicted the outcome m

ore strongly. Should we

instead take the statement to m

ean that the theory is weak relative to rivals, in the

sense that P(H|I) is com

paratively low (Section 3.1.2)? Alternatively, should w

e understand that cases w

here the theory “should be weak” are cases that fall outside

the theory’s scope conditions (Section 3.1.1)? “O

ne must consider not only w

hether a case is most or least likely for a given theory, but w

hether it is also most or least likely for alternative theories.” p.121

Interpretation: Hints at divergent likelihoods approach (Section 3.1.5).

“The strongest possible supporting evidence for a theory is a case that is least likely for that theory but most likely for all alternative theories, and one w

here the alternative theories collectively predict an outcom

e very different from that of the least-likely theory. If the least-likely theory turns out to be accurate, it deserves full

credit for a prediction that cannot also be ascribed to other theories. ...Theories that survive such a difficult test may prove to be generally applicable to m

any types of cases, as they have already proven their robustness in the presence of countervailing m

echanisms.” p.121-122

Most sensible interpretation: P(Ec | Sc H

I) is low (LLC

for H w

ith respect to outcome Ec), P(Ec |Sc H

’ I) is high (MLC

for rival H’ w

ith respect to Ec), but we do not

find Ec, which supports H

, since P(~Ec | Sc H I) m

ust be high (H predicts ~Ec) and P(~Ec | Sc H

’ I) must be low

(H’ predicts Ec). (Section 3.1.5: divergent likelihoods

approach, or Section 3.3: smoking-gun approach)

Alternative interpretation: The last sentence shifts gears and suggests instead a regression logic, w

here the hypotheses under consideration are nested models (Section

3.3). Note that the discussion of “countervailing m

echanisms,” w

hich suggests that different variable within a given m

odel are pushing in opposite directions, contradicts the previous discussion of alternative theories, w

hich we w

ould understand to be mutually exclusive.

“The best possible evidence for weakening a theory is w

hen a case is most likely for that theory and for alternative theories, and all these theories m

ake the same

prediction. If the prediction proves wrong, the failure of the theory cannot be attributed to the countervailing influence of variables from

other theories (again, left-out variables can still w

eaken the strength of this inference). This might be called an easiest test case. If a theory and all the alternatives fail in such a case, it should be

considered a deviant case and it might prove fruitful to look for an undiscovered causal path or variable. A

theory’s failure in an easiest test case calls into question its applicability to m

any types of cases.” p.122 C

ontradiction: A literal reading of the first sentence suggests that P(Ec|Sc H I) is high (M

LC for H

), and P(Ec|Sc H’ I) is also high (M

LC for alternative theories H

’ that m

ake same prediction). But this case w

ould then be the worst place to look for evidence that w

ould weaken H

, because all theories perform w

ell if Ec is found and badly if Ec is not found. Regardless of the evidentiary outcom

e, we at best obtain a very sm

all weight of evidence in favor of one theory over the other, m

eaning that H w

ill not be substantially underm

ined. A

mbiguity: W

e would understand “alternative theories” as discussed in the first sentence to be m

utually exclusive, yet “countervailing influence of variables from other

theories” in the second sentence seems to suggest that they are not m

utually exclusive, and that the authors instead have in mind a regression logic.

Critique: If w

e are comparing rival hypotheses, then one of them

may do som

ewhat better than the others, depending on how

strongly each theory predicted the incorrect outcom

e Ec. In that sense, the theory in question does not necessarily “fail” an easy test, although we m

ay well w

ant to revise the theory or devise a new one. A sim

ilar critique follow

s if what the authors have in m

ind is a set of nested models that contains other m

utually exclusive hypotheses with additional independent variables that

might be relevant for the outcom

e but are not included in H. In the scenario posed, all of these m

odels (incorrectly) predict Ec (the case is most-likely for all of the

theories considered), but presumably H

predicts Ec with som

ewhat low

er probability since it does not include additional independent variables that would together push

even more strongly tow

ard Ec in this case. As such, if we observe ~Ec, H

would be som

ewhat strengthened relative to the m

ultivariate rivals, not weakened.

M

ost-Likely C

ase (ML

C)

Least-L

ikely Case (L

LC

) R

SI (2004)

“A case that is strongly expected to conform

to the prediction of a particular theory. If the case does not m

eet this expectation, there is a basis for revising or rejecting the theory.” p.297 Inconsistency (internal): In contrast to the LLC

discussion, there is no reference to “values on variables associated w

ith rival hypotheses.” A

mbiguity: D

oes not clearly articulate what it m

eans to expect that a case will

conform to a theory’s predictions. If H

predicts E, then P(E|Sc H I) m

ust be high, but our a priori expectations about w

hat evidence we w

ill find in the case m

ust be based on something m

ore than the likelihood under the working

hypothesis. Should we understand that P(E|Sc I) is high, or P(Ec|Sc H

’ I) is high for a prevalent rival H

’, or P(H|I) is high?

Most sensible interpretation: P(Ec|Sc H

I ) is high (theory H predicts Ec), and

P(Ec|Sc I) is high (case is expected to conform to H

’s prediction). If Ec is not found, the theory fails a strong test. (Section 3.1.4: m

arginal likelihood approach) C

ritique: No discussion of rival hypotheses. H

may perform

no worse, or even

better, than competing explanations—

although an unexpected finding might

lead us to consider revising the theory.

A case that “is strongly expected not to conform

to the prediction of the theory.” p.297 “A

least-likely case often has extreme values on variables associated w

ith rival hypotheses, such that w

e might expect these other variables to negate the causal effect

predicted by the theory. If the case nonetheless conforms to the theory, this provides

evidence against these rival hypotheses and, therefore, strong support for the theory.” p.293

C

ontradiction: Read literally, “Variables associated with rival hypotheses” cannot

“negate the causal effect predicted by the theory,” because a hypothesis and its rivals are m

utually exclusive---they do not operate simultaneously to produce the observed

outcome.

Most sensible interpretation: The hypothesis of interest, H

, is one mem

ber of a larger class of nested m

odels that contains other mutually exclusive hypotheses w

ith additional independent variables that m

ight be relevant for the outcome but are not included in H

. (Section 3.3: regression-type approach) C

ritique: The logic of the regression-approach interpretation is sound, but rival hypotheses do not alw

ays fit into a “nested model” structure.

Bennett

& E

lman

(2007)

Not discussed.

“The more surprising an outcom

e is relative to extant theories, the more w

e increase our confidence in the theory or theories that are consistent w

ith that outcome. In the

strongest instance of such logic, if a theory of interest predicts one outcome in a case,

if the variables of that theory are not at extreme levels that strongly push tow

ard that outcom

e, and if all of the alternative hypotheses predict a different outcome in that

case, this is a least-likely case for the theory of interest.” p. 173 C

ontradiction: If “a theory predicts one outcome in a case,” say Ec, then presum

ably P(Ec|Sc H

I) is high; otherwise H

could be consistent with outcom

es other than Ec. But “if the variables of that theory are not at extrem

e levels that strongly push toward

that outcome,” then P(Ec|Sc H

I) must be low

. Furthermore, regardless of w

hether P(Ec |Sc H

I) is high or low, our confidence in H

increases to the extent that the evidence obtained is m

ore likely under that hypothesis relative to rivals. M

ost sensible interpretation: To avoid contradictions, we w

ould have to understand that P(Ec|Sc H

I) is low (the variables of the theory do not strongly push tow

ard outcom

e Ec), but P(Ec|Sc H’ I) is m

uch lower (rival H

’ very strongly predicts a different outcom

e, ~Ec), so if Ec is found, the weight of evidence strongly favors H

. (Section 3.3: sm

oking-gun approach) C

ritique: Ex-ante, setting out to find a smoking gun is not a good case selection

strategy, because smoking guns are not likely to be found.

Alternative interpretation: The authors m

ay have in mind the divergent likelihood

approach expressed in George &

Bennett (above) but have not precisely articulated that idea.

!!

M

ost-Likely C

ase (ML

C)

Least-L

ikely Case (L

LC

) G

erring (2007)

“A m

ost-likely case is one that, on all [1] dimensions except the [2] dim

ension of theoretical interest, is predicted to achieve a certain outcom

e and yet does not. It is therefore disconfirm

atory.” p. 232 C

ontradiction: This definition taken literally is logically equivalent to the definition provided for a least-likely case. If w

e apply the same interpretation

of uses [1] and [2] of “dimensions” as for uses [3] and [4], respectively, then

all of the variables X2, X3... associated with the m

ore complex rival hypotheses

push toward outcom

e Ec, whereas the variable X1 associated w

ith the simpler

working hypothesis H

pushes toward ~Ec, and this case w

ould be confirmatory

for H, not disconfirm

atory. In essence, compared to the least-likely case

definition, the outcome has sim

ply been labeled ~Ec instead of Ec: the more

complex rival hypotheses that include X1 along w

ith X2, X3... predict an incorrect outcom

e (e.g. Ec), whereas H

, predicts the correct outcome (~Ec).

Most sensible interpretation: To salvage this M

LC definition, w

e would need to

interpret use [1] of “dimensions” as the independent variable(s) X1 in the

working hypothesis H

, and use [2] (rather awkw

ardly) as the dependent variable of interest (Ec vs. ~Ec). W

ith this generous interpretation, we could

then recover the likelihood approach (Section 3.1.3), where P(Ec|Sc H

I) is high, but ~Ec is observed.

“A least-likely case is one that, on all [3] dim

ensions except the [4] dimension of

theoretical interest, is predicted not to achieve a certain outcome and yet does so. It is

confirmatory.” p. 232

Most sensible interpretation: W

e understand use [4] of “dimensions” to m

ean independent variable X1 that is central to the w

orking hypothesis H, and use [3] to

mean independent variables X2, X3... associated w

ith more com

plex rival hypotheses that also include X1. If H

predicts outcome Ec, and Ec is in fact observed, then H

is (to som

e degree) strengthened relative to the rivals that predicted ~Ec due to the (countervailing) role of the additional independent variables X2, X3.... (Section 2.3: regression-type approach) C

ritique: The regression logic in our interpretation is sound, but rival hypotheses do not alw

ays fit into a “nested model” structure.

Beach &

Pedersen

(2016)

“A m

ost-likely case is one where other conditions except the X

in focus suggests that Y

should occur but it does not, implying that w

e can disconfirm X

being a cause across the population.” p. 19 C

ontradiction: This statement follow

s Gerring’s (2007) definition (above) w

ith “dim

ensions" understood (as in Gerring’s least-likely case) to be the

independent variable X1 of the working hypothesis H

. Taking Beach &

Pedersen’s statement at face value, other variables X2, X3... that are not

relevant to hypothesis H predict Y, w

hereas X1 does not, and the observed outcom

e is ~Y. But this case cannot be interpreted as most-likely for H

, and it certainly cannot disconfirm

H, since X1 assum

es values in this case that correctly predict ~Y.

“A least-likely case is w

here other conditions except X point in the direction of Y

not occurring but it does, enabling us to infer that given that it occurred w

here we least

expected it, it should also occur in more probable places. It is vital to note that the

likelihood of a causal relationship occurring in a case is based on theoretical reasons, for exam

ple, contextual conditions that are more/less conducive determ

ine the likelihood of the causal relationship occurring.” p. 19-20 M

ost sensible interpretation: see comm

ents on Gerring above.

Critique: Strong articulation of the Sinatra logic, w

hich can only be justified if we

employ a logistic regression m

odel. C

ritique: The final sentence has little sound discernable meaning. If “contextual

conditions” are theorized to affect the likelihood of X producing Y, then those conditions m

ust be explicitly included within the hypothesis that is being tested.

!!

Rohlfing (2012)

“A m

ost-likely case has a relatively high probability of confirming the proposition under scrutiny, w

hile a least-likely case goes hand in hand with a com

paratively low

probability.” p. 84 “The conditional probability of interest” for identifying a most/least-case is “P(E|H

& case), m

eaning that the probability of collecting confirming

evidence–E–is conditional on the assumption that H

is correct and in light of theory-relevant features of the chosen case.” p. 86 A

mbiguity: W

hat it means to include the “case” in the conditioning inform

ation is unclear, since the background information I (om

itted in Rohlfing’s notation) includes everything relevant w

e know about all cases. Likew

ise, the reference to “theory-relevant features” of the case is ambiguous; all details that w

e initially know about the

case should be included as background information.

Inaccuracy: P(E|H I) is the probability of finding evidence E if H

is true, not the probability of finding “confirming evidence”---the extent to w

hich evidence supports or confirm

s a hypothesis depends on likelihood ratios. Here w

e have an instance where sloppy language m

ay lead to sloppy thinking. If we call E “confirm

ing evidence,” regardless of how

large or small the likelihood is under H

, then whenever w

e observe E, we are tem

pted to think that it does indeed confirm H

, even though E may

instead weigh in favor of a rival hypothesis that predicts this evidence m

ore strongly. Interpretation: U

sing better notation, this definition asserts that a MLC

is characterized by a high value for P(Ec|Sc H I), w

hile a LLC is characterized by a low

value for P(Ec|Sc H

I). (Section 3.1.3: likelihood approach) C

ritique: This definition cannot capture the prospective notion expressed in the first quotation. If a MLC

means that P(Ec|Sc H

I) is high, then the theory makes a strong

prediction for the case, but before examining the case, that fact alone gives us no cause to believe that this case w

ill actually produce Ec; furthermore, w

hether Ec will

“confirm” H

depends on how strongly rival hypotheses predict that sam

e outcome. Sim

ilarly, knowing that P(Ec|Sc H

I) is low on its ow

n does not tell us whether w

e will

find Ec in the case. Moreover, finding Ec need not confirm

H; if P(Ec|Sc H

I) is low, there m

ay well be a rival theory that predicts Ec w

ith higher probability.

“The larger the conditional probability of the working hypothesis relative to the null hypothesis, the sm

aller the likelihood ratio is and the greater the confidence in the w

orking hypothesis following a successful test. A

least-likely case for the working hypothesis can m

eet this criterion but only if the conditional likelihood of the null hypothesis is m

uch smaller. Sim

ilarly, a most-likely case for the w

orking hypothesis offers considerable inferential leverage only when the conditional probability of the

null proposition is smaller.” p.196

Translation: Using correct term

inology and a better convention for the likelihood ratio, the above quote would instead read: “The larger the posterior probability of the

working hypothesis relative to a rival in light of the evidence, the larger the likelihood ratio, P(Ec | Sc H

I)/P(Ec | Sc H’ I), and the greater the confidence in the w

orking hypothesis follow

ing a successful test. A least-likely case for the working hypothesis, w

here P(Ec | Sc H I) is low

(following Rohlfing’s previous likelihood approach),

can meet this criterion but only if the likelihood of the case evidence conditional on the rival hypothesis is m

uch smaller than the likelihood of the evidence conditional on

the working hypothesis. Sim

ilarly, a most-likely case for the w

orking hypothesis, where P(Ec | Sc H

I) is high, offers considerable inferential leverage only when the

likelihood of the case evidence under the rival hypothesis is smaller than the likelihood of the evidence under the w

orking hypothesis.” Interpretation: C

orrectly notes that test strength depends on likelihood ratios. Associates a LLC w

ith a successful smoking gun-test: P(E|Sc H

I) is low, but P(E | Sc H

’ I) is m

uch lower for rivals, such that finding Ec supports H

. Read literally, associates a MLC

with a doubly-decisive test: P(Ec |Sc H

I) is high, but P(Ec|Sc H’ I) is m

uch low

er. Elsewhere, Rohlfing (p. 183) explicitly associates a M

LC w

ith a failed hoop test: P(Ec|Sc H I) is high, P(Ec |Sc H

’ I) is somew

hat lower, but w

e observe ~Ec. Presum

ably the intent here was to say instead that a M

LC for H

produces significant inferential leverage under the surprising outcome ~Ec for H

. C

ritique: A retrospective assessment of test strength once evidence has been found is not useful for prospective case selection.

Rapport

(2015) “ ‘Least likely’ (LL) and ‘m

ost likely’ (ML) cases, [are] also referred to as ‘hard’ and ‘easy’ cases, respectively. The form

er pose difficult tests of theories, in that—unlike M

L cases—one w

ould not expect a theory’s expectations to be borne out by a review of the case evidence.” p. 431

Most sensible interpretation: O

ne could make m

athematical sense of this statem

ent by associating “a theory’s expectations” with the predictions of the w

orking hypothesis, P(Ec | Sc H

I), and associating the earlier use of the term “expect” w

ith our ex-ante anticipation of what w

e will find, P(E|Sc I), w

hich is a weighted average

of the predictions of all plausible hypotheses under consideration. (Section 3.1.4: marginal likelihood approach).

Inaccuracy: In the formal Bayesian analysis, incorrectly identifies the m

arginal likelihood P(E|I) with the relative frequency of evidence E in a population (Appendix B).

The “Bayesian ... approach ... defines a case as LL or M

L according to a researcher’s prior confidence that the theory being tested offers a valid explanation for the outcom

e of interest in other, similar cases.” p. 433 (see also p. 450)

Interpretation: P(H|I) is low

for a LLC and high for a M

LC. (Section 3.1.2: prior probability approach)

Critique: Priors do not vary across cases.

61

Appendix C: Understanding Why Prior Probabilities Do Not Vary Across Cases

In Section 6.1.2, we argued that prior probabilities on hypotheses are determined by a given

state of background knowledge and cannot vary depending on which case we select to study.

This appendix provides some additional examples to help illustrate this point, which may seem

counterintuitive given the way language is used colloquially—while we are making decisions

about how to define or revise the scope conditions for a hypothesis, we might loosely speak

about the probability that our hypothesis explains a given case, but this meaning should not be

conflated with the Bayesian prior probability of a well-articulated, concretely-scoped hypothesis.

The overarching remedy is to carefully define the hypothesis space before making statements

about prior probabilities.

Analogies to medical testing may be one source of confusion regarding the relationship between

prior probabilities on hypotheses and specific cases.34 It is generally recognized that when

diagnosing diseases, the base rate used for the priori probability should correspond to a group

of subjects who are as similar as possible to the patient on characteristics that are known to

matter for susceptibility to the suspected illnesses. Suppose two patients display symptoms

that often accompany prostrate cancer: Martin, a 74 year-old American male, and Cheng, a

22 year-old Chinese male. In the first instance, we might use for our prior the incidence of

prostrate cancer among American men in the age group 70–75, whereas in the second instance,

we would instead use a base rate for this disease among 20–25 year-olds, ideally based on studies

of Asian men, since we know that prostrate cancer is much less common among younger men,

and also less common among Asian males (possibly due to diet). One might then think that we

have case-specific prior probabilities for the prostrate cancer hypothesis—we might be tempted

to say that Martin is a “most-likely case” for prostrate cancer, whereas Cheng is a “least-likely

case” for prostrate cancer.

The flaw in this reasoning is that the di↵erent prior probabilities in question do not correspond

to the same hypothesis. In the first instance, we are considering the hypothesis HMPC = Martin

has prostrate cancer, whereas in the second instance, we are considering a distinct hypothesis

HCPC = Cheng has prostrate cancer. These are two case-specific hypotheses. Whatever evidence

we obtain from subjecting Martin to various medical tests has no direct implications for the

34 See Appendix D for discussion of additional misconceptions arising from medical testing examples.

62

hypothesis that Cheng has prostrate cancer, and vice versa. Note also that we need to specify

the salient rival hypothesis for each problem, which would posit that the patient in question has

a di↵erent condition that produces similar symptoms (the most plausible alternative disease

a✏icting Martin may not be the most plausible alternative disease a✏icting Cheng).

A critical point to stress again in this discussion is that hypotheses and their scope conditions

must be clearly articulated before we can assign probabilities and conduct Bayesian inference.

As we emphasized in Chapter 3, a well-posed hypothesis includes a statement of any relevant

scope conditions that circumscribe the range of cases to which it applies. Once we have defined

the scope of the hypothesis, all background knowledge we have about cases that fall within

its scope contributes to the prior probability of that hypothesis—regardless of whether the

information we know about a given case seems to support or to undermine the hypothesis

relative to the rival(s) under consideration.

As an example, suppose we are considering a hypothesis from the literature that is understood to

apply to all developing countries—to be concrete, takeHSE = Stolen elections are the key causal

factor that motivates democratic mobilization in developing countries, [via mechanism M ]. Now

suppose we have salient background knowledge about two developing countries—Serbia and the

Philippines. Let’s say our background information about the Serbian case moderately supports

HSE over the rival hypothesisHCE = Autonomous communal elites are the key causal factor that

motivates democratic mobilization in developing countries, but our background knowledge of

the Philippines (from reading Slater (2009)) strongly supports HCE over HSE .35 Our prior odds

on HSE vs. HCE then weakly favor the communal-elites hypothesis. These are our prior odds

regardless of which developing country we choose to study next—whether China, the Ukraine,

Venezuela, or Burma. If our study will involve additional research on the Serbian and/or

Philippine case(s), our prior odds still weakly favor HCE over HSE . These prior odds remain

unchanged regardless of whether we plan to gather new evidence for the Serbian case first, or

for the Philippine case first. Even though we have di↵erent background information for each

individual case, the priors on HSE vs. HCE reflect our combined background information about

both cases. Because neither of these hypotheses is case-specific, we cannot have case-specific

priors, and it would not make sense to assert that we have a di↵erent degree of confidence that

HSE provides a better explanation than HCE for the Serbian case compared to the Philippine

35 See Chapter 4.

63

case.

Now we might instead decide that in light of our background information, it is worth modifying

the scope conditions on the stolen-elections hypothesis. One approach would be to limit its

scope to Eastern Europe and then conduct additional research on Eastern European cases to

see whether the evidence further supports this narrower stolen-elections hypothesis over rivals.

Our new hypothesis, HEESE , does not apply to the Philippines, so any background information

we have about that case is irrelevant to the prior odds on HEESE vs. a rival that includes Eastern

Europe within its scope. So here it makes no sense to ask about the probability that HEESE

explains the Philippines, because this case lies outside its scope. And again our prior odds

remain fixed regardless of which Eastern European case we study next.

Alternatively, we might pose a new hypothesis that applies to both Eastern European and

Asian cases: HS/C = Stolen elections are the key factor that motivates democratic mobilization

in Eastern Europe, whereas autonomous communal elites are the key factor that motivates

democratic mobilization in Asia, and compare it to the original hypothesis HSE , which holds

that the stolen-elections logic applies in both regions. Now our Serbian background information

(IS) favors neither HS/C nor HSE , while our Philippine background information (IP ) stronglysupports H over HSE . Our prior odds, based on IS IP , then strongly supports HS/C over HSE .

Again, these prior odds are determined by our unique state of knowledge IS IP , which remains

unchanged regardless of which county in Eastern Europe or Asia we contemplate studying.

64

Appendix D: Understanding Marginal Likelihoods in the Context of Case-Based Research

Another misconception regarding the application of probabilistic reasoning to case selection in

qualitative research concerns the interpretation of the marginal likelihood in Bayes’ rule—it is

sometimes incorrectly taken to be the relative frequency of some evidence E in a population

of cases. This misconception, which is evident in Rapport’s (2015) analysis, may be fairly

widespread and thus merits some careful attention.

Referring to Bayes’ rule in the following form (where we have explicitly including the background

information, which the author omits):

P (H |E I) = P (H | I)P (E |H I)P (E | I) , (D1)

Rapport (2015:448) writes that: “Bayesian inference rests on three legs: the strength of one’s

prior beliefs about a theory, how closely evidence in a case fits theoretical expectations, and the

typicality of the evidence in a case.” We have argued that Bayesian inference in fact depends on

how much better or worse the evidence fits the respective expectations of competing theories, as

becomes more evident when working with the odd-ratio version of Bayes’ rule. But the greater

concern at hand is whether “typicality of the evidence in a case” is gauged by comparison across

cases, or by comparison across theories.

Rapport (2015:448) defines P (E | I) in Bayes’ rule as “the overall probability of observing E

at all given its prevalence in the population of interest—the broader class from which the case

is drawn.” Rapport (2015:449) accordingly asserts that: “one’s confidence in the theory ...

increases if the evidence one observes rarely occurs in comparable cases,” thinking this should

make the denominator P (E | I) in Bayes’ rule small and hence boost posterior probability.

However, P (E | I) has nothing to do with the prevalence of evidence E in other cases. This

marginal probability should instead reflect our surprise at seeing the evidence E in the case

at hand, based on our prior confidence in rival theoretical explanations that might account for

the evidence—not on some sort of relative frequency of how often similar evidence arises across

cases. Later in his discussion of Bayes’ rule, Rapport (2015:448) correctly writes that “P (E)

captures all the ways the evidence might have been generated in addition to the mechanisms

specified by the theory of interest.” But throughout, he conflates these two very di↵erent

notions: a weighted average across di↵erent cases, and a weighted average across di↵erent

65

theories.36

The confusion regarding P (E | I) may be dispelled by (1) using notation that explicitly cap-

tures the potentially case-specific nature of the evidence, and (2) remembering that Bayesian

probabilities are often unrelated to relative frequencies, as elaborated below.

1. Handling Case-Specific Evidence

In qualitative research, we work with detailed evidence that is often specific to a particular

case. To help keep this reality in mind, we will denote evidence gathered from case Ck as ECk .

With this more explicit notation, Bayes’ rule becomes:

P (HA |ECkI) = P (HA | I) P (ECk |HA I)P (ECk | I)

(D2)

Our background information may also contain case-specific knowledge, which we can highlight

by writing I = (I1 · · · Ik · · · ) I0, where a term Ik represents facts specific to case Ck, and I0contains common background knowledge, which at minimum specifies the mutually exclusive

and exhaustive hypotheses {HA, HB, ...} under consideration. Using the law of total probability,

we can decompose the marginal likelihood P (ECk | I) as follows:

P (ECk | I) = P (HA | I)P (ECk |HA I1 · · · Ik · · · I0)+ P (HB | I)P (ECk |HB I1 · · · Ik · · · I0) + · · · .

(D3)

This notation should clarify that the marginal likelihood has little to do with the prevalence of

similar-looking evidence that might be found in other cases. It is the likelihood of occurrence

of the particular evidence ECk in the particular case Ck, given the relevant background infor-

mation Ik I0, averaged over the possible hypotheses that might underpin the case dynamics,

weighted by our prior confidence in these rival explanations.37 If we include enough detail

36 This misconception may also underlie Rapport’s (2015:435) earlier claim that the Bayesian “approach attendsmore to the population from which a case was drawn and defines a case as LL [least likely] or ML [most likely]according to a researcher’s prior confidence that the theory being tested o↵ers a valid explanation for theoutcome of interest in other, similar cases.” Compared to more conventional attitudes to qualitative researchthat tend to be grounded in extrapolations of frequentist reasoning, Bayesian conceptualizations are far less,not more, tied to “populations from which a case was drawn,” and allow the researcher to assess within-caseevidence on its own terms rather than exclusively in comparison to “other, similar cases.”

37 It is possible for outside evidence to a↵ect the likelihoods and marginal likelihoods in the case under inves-tigation, but only indirectly. These probabilities are conditional on I1I2 · · · , which might include evidencepreviously collected in other cases. In turn, this previous evidence might have been used for instance to tune

66

(regarding specific events in the country, direct quotations from a particular politician), we can

always ensure that the “evidence one observes rarely occurs in comparable cases,” (Rapport

2015:449)—for instance, it would hardly make sense to ask whether we might observe Cardinal

Sin of the Philippines calling the people into the streets to protect retreating coup leaders fol-

lowing Marcos’ fraudulent election when conducting a case study of democratic mobilization in

Vietnam or Burma. Yet the uniqueness of the evidence to the case is not what increases our

confidence in a theory; it is the uniqueness of the evidence to a theory that boosts the posterior

probability.

In sum, for qualitative social science, marginal likelihoods in Bayes’ theorem generally do not

reflect “prevalence in the population” of cases (Rapport 2015:449), because we typically observe

a unique piece or body of evidence in the context of a particular case. Even if we were to coarse-

grain the evidence into broad “clue types” for the sake of greater comparability across cases,

the marginal probability of evidence ECk cannot be equated with the relative frequency of its

clue type in some population of interest. Outside of survey research, cases or case evidence can

rarely be viewed as random samples drawn from some delineated and exchangeable population.

2. Bayesian Probabilities vs. Relative Frequencies

Introductions to Bayes’ theorem routinely include what we might call a “dreaded diagnosis”

example from medical testing, similar to our worked example at the end of Chapter 2. Such cal-

culations nicely highlight the importance of taking prior information into account, but they may

also lead to some confusion between Bayesian probabilities and frequencies that can contributes

to misinterpretations of the marginal likelihood in the context of case selection.

Reprising our Chapter 2 example, suppose Patient X tests positive (+) for disease D according

to some standard diagnostic test ⌧ . How worried should X be? The answer will depend

on information about the reliability of the diagnostic test, and prior expectations about the

prevalence of the disease. Assume we know nothing else about the particulars of X’s medical

model parameters, which then can a↵ect the likelihoods generated by the theories regarding new evidence ob-served in the current case Ck. But this indirect influence acts primarily through the hypotheses, and does notseem to be what Rapport had in mind. Note also that if we alter the set of hypotheses under consideration, weare changing the common background information I0, and P (ECk | I) will accordingly change as well. Hence,contrary to Rapport’s (2015:449) assertion, the numerical value of the denominator P (ECk | I) definitely neednot remain “constant” as the researcher “goes back and forth between theory and evidence in di↵erent cases.”

67

condition or history other than the results of the test, but we do have information I about the

past performance of the test on a large number of other patients, and about the prevalence of

the disease in the general population. The probability of interest is then:

P (D | ⌧+ I) = P (D | I)P (⌧+ |D I)P (⌧+ | I) =

P (D | I)P (⌧+ |D I)P (D | I)P (⌧+ |D I) + P (D | I)P (⌧+ |D I) . (D4)

The hypotheses in question concern whether patient X su↵ers from the disease (D) or not

(D), and the evidence consists of a positive result on a diagnostic test ⌧+. The point of the

example is to emphasize that even if the true positive rate P (⌧+ |D I) is moderately high (but

definitely less than perfect), and the false positive rate P (⌧+ |D I) is moderately (but not

extremely) small, then the posterior probability of having the disease can still be much smaller

than the true positive rate of the test, if overall the disease is extremely rare in the sense that

P (D | I) ⌧ P (D | I) = 1� P (D | I).

A common misunderstanding that seems to be taken from this type of example is that prior

probabilities can be identified with base rates, and marginal likelihoods can be identified with

relative frequencies in a population. These assumptions are unproblematic in the above ex-

ample, but they are not justifiable more generally. In the medical testing example, the prior

P (D | I) could be estimated as the fraction of individuals from the population (or more practi-

cally, from a random sample drawn from the population) that do su↵er from the disease, and

the marginal likelihood P (⌧+ | I) could be estimated as the overall fraction of individuals who

(would) test positive because we are e↵ectively treating all individual cases as exchangeable

samples drawn at random from the larger population. However, if we include more specific in-

formation about the condition of patient X into our background information, replacing I with

more detailed IX , then P (D | IX) can no longer be interpreted as the base rate of the disease

in the overall population; it instead represents our prior belief that X has the disease given

this patient-specific medical information, and P (⌧+ | IX) is the probability that the particular

patient X will test positive whether su↵ering from the disease or not. In the idealized limiting

case where IX contains exceptionally detailed physiological information about X and biological

information about both the test and the disease, then P (⌧+ | IX) would have essentially nothing

to do with any population of other patients, but should instead reflect our knowledge of how the

individual physiology of X and the biochemical operation of the test might e↵ect the outcome

of the test in this one case.

In the original formulation of the medical example,

68

(i) the propositions of interest concern whether a particular patient X has a disease D or

not;

(ii) the population of all possible patients is considered exchangeable, meaning in this context

that we know nothing about X other than the test result, and X is assumed to have the

same propensities as any other patient; and

(iii) the possible evidentiary outcomes (i.e, the test results ⌧+ or ⌧�) are course-grained and

could be measured for any individual patient.

In this context, the marginal probability P (⌧+ | I) can be consistently estimated as a rela-

tive frequency across cases in the population. But in qualitative case research, where we are

interested in the marginal probability P (ECk | I),

(i0) the propositions of interest are often general explanatory theories which we hope apply

to all cases satisfying the stated scope conditions;

(ii0) cases are known or expected to have unique features, and hence are not exchangeable;

and

(iii0) the possible evidentiary outcomes tend to be fine-grained, are not easily denumerable,

and generally vary from case to case.

Therefore, in the latter context, P (ECk | I) cannot be interpreted or estimated as a rate of

prevalence in some population of cases.

Documents

A Bayesian Perspective on Case Selection