Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Controlling the false discovery rate in GWAS with population structure
Matteo Sesia
Department of Data Sciences and Operations, University of Southern California
Stephen Bates
Department of Statistics, Stanford University
Emmanuel Candes
Departments of Statistics and of Mathematics, Stanford University
Jonathan Marchini
Genetics Center, Regeneron Pharmaceuticals
Chiara Sabatti
Departments of Statistics and of Biomedical Data Sciences, Stanford University
Abstract
This paper proposes a novel statistical method to address population structure in genome-
wide association studies while controlling the false discovery rate, which overcomes some
limitations of existing approaches. Our solution accounts for linkage disequilibrium and
diverse ancestries by combining conditional testing via knockoffs with hidden Markov models
from state-of-the-art phasing methods. Furthermore, we account for familial relatedness by
describing the joint distribution of haplotypes sharing long identical-by-descent segments
with a generalized hidden Markov model. Extensive simulations affirm the validity of this
method, while applications to UK Biobank phenotypes yield many more discoveries compared
to BOLT-LMM, most of which are confirmed by the Japan Biobank and FinnGen data.
2
INTRODUCTION
Genome-wide association studies (GWAS) measure hundreds of thousands of single-nucleotide
polymorphisms (SNP) in thousands of individuals to identify variants affecting a phenotype of in-
terest. The objective is to reliably determine which genotype-phenotype associations are likely to be
important and which are spurious; several challenges make this difficult, particularly for polygenic
traits which may be influenced by thousands of variants. First, spurious associations may arise
from multiple comparisons; i.e., strong correlations will occur by chance as numerous variables are
tested simultaneously.1 The typical solution is to apply a stringent significance threshold designed
to control the family-wise error rate (FWER)—the probability of a single false positive. However,
this is too conservative for polygenic phenotypes because numerous discoveries are expected.2–5
Indeed, SNPs with effect sizes large enough to pass the FWER threshold do not fully explain the
heritability of complex traits.6 A second source of spurious associations is linkage disequilibrium
(LD): the stochastic dependence of alleles on the same chromosome,7,8 which tends to be stronger
for those physically closer to each other. Thus, SNPs with no effect on the phenotype may be
marginally associated with it simply because they are in LD with a causal variant.5,9 Standard
GWAS methods cannot account for LD,5 so their findings do not precisely localize causal variants
and may be difficult to recognize as distinct when multiple nearby loci are discovered.9 Such am-
biguity may help explain why large studies are reporting associations densely across the genome.10
Lastly, population structure (heterogeneous degrees of similarity between different individuals, due
to diverse ancestries or familial relatedness) may also analogously lead to spurious associations and
has long been of concern in genetic analyses.11–13 Population structure not only induces dependence
between distant loci, it can even create spurious associations when there are no causal variants at
all. For example, if two populations differ in the distribution of the trait solely due to their envi-
ronments, then any SNP whose allele frequency varies across populations will be associated with
the trait. Hence, several methods were developed to account for population structure. An early
remedy built on principal component analysis (PCA),14 although linear mixed models (LMM) have
subsequently become predominant.15–17 While these corrections mitigate the impact of population
structure, they are limited to marginal testing (i.e., they ignore LD) with FWER control.
KnockoffZoom9 was recently proposed to address the limitations of marginal testing and FWER
control. This method accounts for LD through conditional (on nearby loci) rather than marginal
testing, so that its discoveries are clearly distinct and indicative of the presence of causal variants.5,9
The inferences are valid for both quantitative (e.g., body measurement) and qualitative (e.g., dis-
3
ease status) phenotypes, regardless of their genetic architecture, in contrast to traditional methods,
which rely on linear models whose correctness may be difficult to verify. Practically, KnockoffZoom
partitions the genome into disjoint blocks and tests a conditional association hypothesis for each of
them, which, if rejected, suggests the presence of causal effects.9 While this requires testing the im-
portance of fixed groups of SNPs (determined at will, but before looking at the phenotype), it can
be easily applied at increasing levels of resolution, with finer and finer genome partitions, in order to
localize causal variants as precisely as possible. KnockoffZoom does not analyze imputed variants18
because, without additional assumptions, it is theoretically impossible to test conditional associa-
tions beyond the resolution of the SNP array (imputation cannot introduce additional information
about the phenotype compared to that contained in the typed SNPs).9 Nonetheless, its findings
achieve resolution comparable to that of fine-mapping methods19–21 applied to typed variants.9
Furthermore, KnockoffZoom is more powerful than LMM approaches because it controls the false
discovery rate22 (FDR)—the expected proportion of false discoveries—instead of the FWER. The
main limitation of KnockoffZoom is that, as originally implemented, it is only applicable to rela-
tively homogeneous samples because it does not fully account for population structure. To address
this missing component, we now present an extension that accounts for population structure while
retaining the advantages of conditional testing and FDR control.
Before presenting our results, we review the mechanics of KnockoffZoom. The method estab-
lishes statistical significance through careful data augmentation: it constructs imperfect copies
(knockoffs)5,23 of the genotypes for each individual that are in LD with the real variants and
also have the same allele frequencies, in such a way that replacing a group of genotypes with the
corresponding knockoffs would keep the modified data set statistically indistinguishable from the
original one, except possibly for some reduced association with the phenotype. Knockoffs serve as
negative controls:23,24 they are blindly analyzed jointly with the genotypes (the algorithm ignores
which variables are knockoffs until the very end), through any procedure of choice (e.g., an LMM, a
sparse regression model, or any machine learning tool). The significance of each genetic segment is
determined by contrasting the estimated importance of its genotypes to that of the corresponding
knockoffs. This is a fair comparison because knockoffs behave as the real non-causal variants. In
order to define precisely, and achieve practically, this exchangeability, the distribution of geno-
types is approximated with a hidden Markov model (HMM).5 This assumption is well grounded:
HMMs have already been widely and successfully employed to describe LD,8 for the purposes either
of ancestry inference,25,26 of identifying population-specific haplotype blocks,27 or of phasing and
imputation.28–32
4
However, KnockoffZoom has so far relied practically on the fastPHASE29 HMM, which is not
designed to account for population structure. In fact, this model cannot simultaneously describe
association between different loci due to mixture33 (individuals with different ancestries), admix-
ture34 (individuals of mixed ancestry), and linkage (allele dependencies between nearby loci)—we
demonstrate this limitation empirically in Supplementary Notes A– C and Supplementary Fig-
ures 1–13. Therefore, KnockoffZoom was previously applied only to homogeneous and unrelated
samples.5,9 In this paper, we build upon the more flexible SHAPEIT35–37 HMM and expand the
applicability of KnockoffZoom to populations with diverse and possibly admixed ancestries. We
also further extend this method to account for familial relatedness, by jointly describing the dis-
tribution of haplotypes from multiple individuals sharing long identical-by-descent (IBD) genetic
segments,38 and then developing a corresponding construction of knockoffs. Crucially, our method
is applicable even if the population structure and the familial relatedness are cryptic, and remains
completely model-free regarding the genetic architecture of the trait.
After testing our method through careful simulations with real genotypes and simulated pheno-
types, we shall apply it to study height, body mass index, platelet count, systolic blood pressure,
cardiovascular disease, respiratory disease, hypothyroidism, and diabetes in the UK Biobank data
set,39 including virtually all samples therein. We will show this yields more numerous (between 25%,
for height, and 320%, for cardiovascular disease) distinct discoveries compared to BOLT-LMM.40
Furthermore, we will verify that most additional discoveries (between 37.3%, for platelet count,
and 88.5%, for diabetes) are validated by the GWAS Catalog,52 the Japan Biobank Project41,
or the FinnGen resource,42 while many of the remaining ones have known associations to related
traits. Finally, we highlight a novel discovery for cardiovascular disease, as one of many promis-
ing findings that may be worthy of further validation. Our full results, which involve thousands
of discoveries, are available from https://msesia.github.io/knockoffzoom-v2/, along with an
efficient software implementation of our method.
RESULTS
Knockoffs preserving population structure
We assume all haplotypes have been phased9 and we approximate their distribution with an
HMM similar to that of SHAPEIT.35–37 This model describes each haplotype sequence as a mosaic
of K reference motifs corresponding to the haplotypes of other individuals in the data set, where
5
K is fixed (e.g., K = 100); critically, different haplotypes may use different sets of motifs. The
references are chosen based on haplotype similarity; see Methods. The idea is that the ancestry of
an individual should be approximately reflected by the choice of references; e.g., the haplotypes of
someone from England should be well-approximated by a mosaic of haplotypes primarily belonging
to other English individuals. Conditional on the references, the identity of the motif copied at each
position is described by a Markov chain with transition probabilities proportional to the genetic
distances between neighboring sites; different chromosomes are treated as independent. Conditional
on the Markov chain, the motifs are copied imperfectly: relatively rare mutations can independently
occur at any site. This model effectively accounts for population structure as well as LD in the
context of phasing,35–37 and our simulations will demonstrate its usefulness for conditional testing.
Having defined an HMM for each haplotype sequence, knockoffs are generated by repurposing
the algorithm in KnockoffZoom v1,9 which was originally based on the fastPHASE model, but is
sufficiently general to apply here. Our software implementation takes as input phased haplotypes
in standard binary format, builds the data-adaptive SHAPEIT HMM, and returns as a knockoff-
augmented data set that can be conveniently used by KnockoffZoom for conditional testing at the
desired resolution. The knockoff generation procedure is explained in the Methods.
Knockoffs preserving familial relatedness
The above model is not directly applicable to closely related samples because it describes them
independently (conditional on the reference motifs), whereas these share long IBD segments.38,43
Recall that knockoffs must follow the genotype distribution; therefore, processing related samples
independently would yield knockoffs breaking the relatedness structure, invalidating our inferences
(see Methods for a full explanation). We fix this by jointly modeling haplotypes in the same family.
We begin by detecting long IBD segments in the data.44–47 If the pedigree is known in advance,
one can restrict the IBD search within the given families; otherwise, there exists efficient software
to approximately reconstruct families from data at the UK Biobank scale.48 The results thus
obtained define a relatedness graph, where two haplotype sequences are connected if they share an
IBD segment; we refer to the connected components of this graph as the IBD-sharing families.
Conditional on the location of the IBD segments, we define a larger HMM jointly describing the
distribution of all haplotypes in each IBD-sharing family. Marginally, each haplotype is modeled by
the SHAPEIT HMM; however, different haplotypes are coupled along the IBD segments (and we
avoid using haplotypes in the same family as references for one another). This model is explicitly
6
described in the Methods, where we explain how to generate knockoffs for it. The details are
technical, since the algorithm in Sesia et al.9 is no longer feasible due to the much larger state
space of this new HMM. We shall demonstrate empirically that our method generates knockoffs
sharing the same IBD segments as the real haplotypes, and also simultaneously preserves LD.
Numerical experiments with genetic data
Setup
We test the proposed methods via simulations based on subsets of phased haplotypes from the
UK Biobank data, chosen as to have strong population structure either due to diverse ancestries
or familial relatedness. After some pre-processing (see Methods), we partition each of the 22
autosomes into contiguous groups of SNPs at 7 different levels of resolution, ranging from that
of single SNPs to that of 425 kb-wide groups; see Supplementary Table 1. These partitions are
obtained by applying complete-linkage hierarchical clustering to the SNPs (genetic distances are
used as similarity measures) and cutting the resulting dendrogram at different heights. For each
such partition, we generate knockoffs with two alternative methods. In one case, we fit fastPHASE
using K = 50 latent HMM motifs, even though this is not designed for data with population
structure, and then apply KnockoffZoom v1.9 In the other case, we apply our new method to
generate knockoffs, and then we proceed from there to select important groups of SNPs as in
KnockoffZoom v1; we refer to this approach as KnockoffZoom v2.
Knockoffs preserving population structure
We generate knockoffs for 10,000 unrelated individuals with one of 6 different self-reported
ancestries (Supplementary Table 2). We will perform a diagnostic check of these knockoffs by
verifying their exchangeability with the genotypes through a PCA and a covariance analysis; then,
we will assess the performance of KnockoffZoom for simulated phenotypes.
A property of valid knockoffs is that their distribution is the same as that of the genotypes.5,23
(This follows from the stronger exchangeability requirement; see Methods). Thus, the proportion of
knockoff variance explained by their top principal components should be close to the corresponding
quantity computed on the genotypes. The PCA in Supplementary Figures 14–15 demonstrates that
our knockoffs preserve population structure quite accurately in this sense, unlike those based on the
fastPHASE HMM. This holds even at low resolution, where the sensitivity to model misspecification
7
is highest.9 The covariance analysis9 in Supplementary Figures 16–22 confirms that our knockoffs
are approximately exchangeable with the genotypes and should have power comparable to that of
knockoffs based on the fastPHASE HMM.
Next, we simulate continuous phenotypes conditional on the true genotypes, from a homoscedas-
tic linear model with 500 causal variants distributed uniformly across the genome; the total heri-
tability is varied as a control parameter.9 We apply KnockoffZoom v1 and v2 on these data, using
lasso-based statistics.49 Supplementary Figure 23 shows the histogram of test statistics, which
should be symmetric around zero for null groups (i.e., those without causal variants). The statis-
tics obtained with the new knockoffs satisfy this property, while the fastPHASE model leads to a
rightward bias, which may result in an excess of false positives. By contrast, both methods yield
similarly distributed statistics for causal groups. The power9 and FDR are compared in Figure 1:
KnockoffZoom v2 has slightly lower power, but always controls the FDR.
single-SNP 208 kb 425 kb
FD
RP
ower
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.0
0.2
0.4
0.6
Heritability
Method
KnockoffZoom v2
KnockoffZoom v1
FIG. 1. Power and FDR in simulations involving samples with diverse ancestries. KnockoffZoom
performance at different resolutions on artificial phenotypes and real genotypes of 10,000 samples with very
diverse ancestries, using alternative knockoff constructions. The nominal FDR is 10%. The results are
averaged over 10 experiments with independent phenotypes; the vertical bars indicate standard errors.
Supplementary Figure 24 presents analogous results in simulations with sparser signals, while
Supplementary Figure 25 summarizes findings at different resolutions by counting only the most
specific ones.9 Supplementary Figure 26 reports on simulations where SNP importance is estimated
by BOLT-LMM instead of the lasso;50 the LMM is less powerful, but it makes fastPHASE knockoffs
even more susceptible to population structure, while our new method remains valid.
8
Knockoffs preserving familial relatedness
We test our method on 10,000 British individuals in 4,900 self-reported families; see Supplemen-
tary Table 3 and Supplementary Figure 27 for details. We use RaPID48 to detect IBD segments
wider than 3 cM, chromosome-by-chromosome, adopting the recommended parameters. After dis-
carding, for simplicity, segments shared by individuals who do not belong to the same self-reported
family, we are left with 723,454 of them. Their mean width is 19.6 Mb, or 26.1 cM, and each
contains 4238 SNPs on average (Supplementary Figure 28). We then generate knockoffs preserving
these IBD segments, and compare the results with those obtained disregarding relatedness.
Supplementary Figure 29 shows that knockoffs would not preserve IBD segments if we did not
explicitly enforce such constraint, especially at low resolution. The diagnostics in Supplementary
Figure 30 confirm that our method correctly preserves LD, and Supplementary Figure 31 demon-
strates that accounting for relatedness does not decrease power; to the contrary, it can increase it
by ensuring that closely related haplotypes are not used as references for one another, which would
reduce the desired contrast between genotypes and knockoffs.
We simulate binary phenotypes from a liability threshold (probit) model with 100 uniformly
distributed causal variants; the numbers of cases and controls are balanced. (We consider binary
phenotypes, as opposed to continuous phenotypes as in the previous section, simply to highlight
the flexibility of our method, which is equally valid regardless of the distribution of the trait). We
include in this model an additive random term for each family, mimicking shared environmental
effects, whose strength is smoothly controlled by a parameter γ ∈ [0, 1] (Methods). The phenotypes
of different individuals in the same family are conditionally independent given the genotypes if
γ = 0, while identical twins will always have the same phenotype if γ = 1. In theory, environmental
effects may introduce spurious associations, unless the knockoffs account for familial relatedness;
this point is demonstrated empirically below and explained rigorously in the Methods.
Figure 2 reports FDR and power at low-resolution, with and without preserving relatedness.
This shows that preserving IBD segments enables FDR control even with extreme environmental
factors (γ = 1), with virtually no power loss. However, KnockoffZoom v2 is reasonably robust even
if relatedness is ignored, especially at higher resolution (Supplementary Figure 32). This partly
depends on the multivariate importance statistics used here (i.e., sparse logistic regression); in fact,
marginal statistics are more vulnerable to confounding, as illustrated in Supplementary Figure 33.
Finally, Supplementary Figures 34–35 confirm that the test statistics for null groups of SNPs are
symmetrically distributed if relatedness is preserved.
9
γ: 0 γ: 0.655 γ: 1FDR
Pow
er
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
Heritability
Relatedness
Preserved
Ignored
FIG. 2. Power and FDR in simulations with familial relatedness. KnockoffZoom v2 performance on
artificial phenotypes and real genotypes of 10,000 related samples. Our method is applied with and without
preserving IBD segments. Results for phenotypes with different strengths of environmental effects γ are in
separate columns (γ = 0: no environmental effects, γ = 1: strongest environmental effects; see Methods for
more information about γ). Knockoff resolution equal to 425 kb. Other details are as in Figure 1.
.
Analysis of the UK Biobank phenotypes
Setup
We analyze 486,975 individuals both genotyped and phased in the UK Biobank, 136,818 of
which have close relatives; see Methods for details on quality control. Most samples are British
(429,934), Irish (12,702), or other Europeans (16,292). By running RaPID48 within the 57,164
families, as in the simulations, we detect 7,087,643 long IBD segments across the 22 autosomes.
We then apply KnockoffZoom v2 at different resolutions (as in the simulations), aiming to control
the FDR below 10%, using knockoffs preserving both population structure and familial relatedness.
10
Discoveries with different subsets of individuals
We study 4 continuous traits (height, body mass index, platelet count, systolic blood pressure)
and 4 diseases (cardiovascular disease, respiratory disease, hyperthyroidism, diabetes), using dif-
ferent subsets of samples to compare performance. The phenotypes are defined in Supplementary
Table 4. In order to increase power, we include in the KnockoffZoom predictive model, along with
the genotypes and the knockoffs, a few relevant covariates that help explain some variation in the
phenotype,9 as explained in the Methods. These covariates include the top principal components
of the genetic matrix,14 although it is worth emphasizing that the validity of our method does not
depend on them, since we account for population structure through the knockoffs. The numbers of
low-resolution (208 kb) discoveries are in Figure 3; this demonstrates that including relatives yields
many more discoveries, while including different ancestries tends to have a smaller impact (unsur-
prisingly, since we have relatively few non-British samples). Table I summarizes the gains in the
numbers of discoveries at different resolutions allowed by KnockoffZoom v2, either by leveraging
related samples, or by including non-British individuals. This shows an increase in power, except
at the single-SNP resolution; this exception may be partially explained by the fact that single-
SNP discoveries are fewer and thus more affected by the random variability in the knockoffs.9
Supplementary Tables 5–6 report the numbers of discoveries at other resolutions, as well as those
obtained from the analysis of European non-British samples only. The full results are available at
https://msesia.github.io/knockoffzoom-v2/, along with an interactive visualization tool.
Table II demonstrates that KnockoffZoom v2 is much more powerful than BOLT-LMM, when
the latter is applied to the 459k European samples;40 in fact, we discover almost all findings reported
by BOLT-LMM and many new ones. (See Supplementary Tables 7–8 for more detailed results at
other levels of resolution.) This is consistent with KnockoffZoom v1,9 although our method is even
more powerful (except for the single-SNP resolution), as shown in Supplementary Tables 9–10.
Validation of novel discoveries
We begin to validate our findings by comparing them with those in the GWAS Catalog,52
in the Japan Biobank Project,41 and in the FinnGen resource42 (we use the standard 5 × 10−8
threshold for the p-values reported by the latter two). For simplicity, hereafter we focus on our
findings obtained including both related and non-British individuals. Supplementary Tables 11–
12 show that most of our high-resolution discoveries correspond to SNPs previously known to be
11
cvd respiratory hypothyroidism diabetes
height bmi platelet sbp
Everyone British Everyone British Everyone British Everyone British
Everyone British Everyone British Everyone British Everyone British0
300600900
1200
0255075
100
0500
10001500
0
100
200
300
0500
1000150020002500
0
100
200
0100020003000
0250500750
Population
Discoveries
Related samples Included Excluded
FIG. 3. Numbers of KnockoffZoom discoveries for UK Biobank phenotypes. Low-resolution
(208 kb) discoveries using data from subsets of individuals with different self-reported ancestries.
Including related samples Including non-British samples
Everyone British Related Unrelated
ResolutionTotal
Change
(%)Total
Change
(%)Total
Change
(%)Total
Change
(%)
single-SNP 138–167 21.0 155–125 -19.4 125–167 33.6 155–138 -11.0
3 kb 921–992 7.7 655–971 48.2 971–992 2.2 655–921 40.6
20 kb 2814–3527 25.3 2808–3355 19.5 3355–3527 5.1 2808–2814 0.2
41 kb 4419–5867 32.8 4353–5354 23.0 5354–5867 9.6 4353–4419 1.5
81 kb 6784–8031 18.4 6676–7781 16.6 7781–8031 3.2 6676–6784 1.6
208 kb 8776–10270 17.0 8635–10049 16.4 10049–10270 2.2 8635–8776 1.6
425 kb 9401–10730 14.1 9028–10297 14.1 10297–10730 4.2 9028–9401 4.1
Sample size 408k–487k 19.4 356k–430k 20.8 430k–487k 13.0 356k–408k 14.6
TABLE I. Effect of sample size increases on numbers of KnockoffZoom v2 discoveries. Cumulative
numbers of discoveries for all UK Biobank phenotypes at different resolutions, with different subsets of the
samples. For example, including related individuals increases by 16.4% the number of discoveries obtained
from the British samples at the 208 kb resolution (from 8635 to 10,049). As another example, adding non-
British individuals (including related ones) increases by 2.2% the number of discoveries obtained from the
British samples (including related ones) at the 208 kb resolution (from 10049 to 10,270).
12
KnockoffZoom v2 BOLT-LMM
Phenotype Discoveries Overlap with LMM Discoveries Overlap with KZ
bmi 2395 898 (37.5%) 697 689 (98.9%)
cvd 940 274 (29.1%) 257 249 (96.9%)
diabetes 113 52 (46.0%) 62 55 (88.7%)
height 3339 2228 (66.7%) 2464 2430 (98.6%)
hypothyroidism 295 129 (43.7%) 143 142 (99.3%)
platelet 1743 1057 (60.6%) 1204 1183 (98.3%)
respiratory 262 82 (31.3%) 94 92 (97.9%)
sbp 1183 561 (47.4%) 568 530 (93.3%)
TABLE II. Comparison of KnockoffZoom v2 at low resolution with BOLT-LMM. KnockoffZoom
discoveries (208 kb resolution, 10% FDR) using all 487k UK Biobank samples for different phenotypes,
and corresponding BOLT-LMM genome-wide significant discoveries (5× 108); the latter is applied on 459k
European samples40 for all phenotypes except diabetes and respiratory disease, for which it is applied on
350k unrelated British samples,9 for the sake of consistency in phenotype definitions. For example, we report
940 distinct discoveries for cardiovascular disease, 274 of which contain significant associations according to
BOLT-LMM. The latter reports a total of 257 discoveries (clumped9 with the standard PLINK51 algorithm)
for this phenotype, 96.9% of which overlap with one of our discoveries.
associated with the phenotype of interest. This is particularly true for those findings that are
also detected by BOLT-LMM, although many of our additional discoveries are also confirmed; see
Supplementary Tables 13–14. For example, our method reports 1089 findings for cardiovascular
disease at the 425 kb resolution, only 255 of which can be detected by BOLT-LMM; however, 85.6%
of our additional 834 discoveries are confirmed in at least one of the aforementioned resources.
Furthermore, Supplementary Table 15 suggests that most relevant associations in the Catalog
(above 70%) are confirmed by our findings, which is again indicative of high power. The relative
power of our method (i.e., the proportion of previously reported associations that we discover) seems
to be above 90% for quantitative traits, but lower than 50% for all diseases except hypothyroidism,
probably due to the relatively small number of cases in the UK Biobank data set compared to
more targeted case-control studies.
The 5×10−8 genome-wide threshold for the Japan Biobank Project and the FinnGen resource is
overly conservative given that our goal is to confirm selected discoveries. Therefore, we next utilize
these independent summary statistics for an enrichment analysis. The idea is to compare the
distribution of the external statistics corresponding to our selected loci to that of loci from the rest
13
Total Not found by BOLT-LMM
Confirmed Confirmed
Phenotype Discoveries Other Other or Enrich. Discoveries Other Other or Enrich.
bmi 2395 1076 (44.9%) 1620 (67.6%) 1497 335 (22.4%) 806 (53.8%)
cvd 940 738 (78.5%) 764 (81.3%) 666 472 (70.9%) 493 (74.0%)
diabetes 113 97 (85.8%) 106 (93.8%) 61 46 (75.4%) 54 (88.5%)
height 3339 1886 (56.5%) 2493 (74.7%) 1111 164 (14.8%) 556 (50.0%)
hypothyroidism 295 156 (52.9%) 226 (76.6%) 166 43 (25.9%) 101 (60.8%)
platelet 1743 453 (26.0%) 1017 (58.3%) 686 29 (4.2%) 256 (37.3%)
respiratory 262 241 (92.0%) NA 180 159 (88.3%) NA
sbp 1183 643 (54.4%) 885 (74.8%) 622 154 (24.8%) 358 (57.6%)
TABLE III. Validation of findings through comparisons with other studies and enrichment.
Numbers of low-resolution (208 kb) discoveries obtained with our method and confirmed by other studies,
or by an enrichment analysis carried out on external summary statistics. For example, 81.3% of our 940
discoveries for cardiovascular disease are confirmed either by the results of other studies, or by the enrichment
analysis. The results are stratified based on whether our findings can be detected by BOLT-LMM using the
UK Biobank data (excluding non-European individuals).
of the genome, as explained in Supplementary Note D. This approach can estimate the number of
replicated discoveries but it has the limitation that it cannot tell exactly which ones are confirmed;
therefore, we will consider alternative validation methods later. (A more precise analysis is possible
here in theory, but has low power; see Supplementary Note D). Supplementary Tables 16–17
show that many additional discoveries can thus be validated, especially at high resolution. (See
Supplementary Tables 18–19 for more details about enrichment.) Table III summarizes these
confirmatory results. Respiratory disease is excluded from the enrichment analysis because the
FinnGen resource divides it among several fields, so it is unclear how to best obtain a single p-
value. In any case, the GWAS Catalog and the FinnGen resource already directly validate 90% of
our new findings for this phenotype.
We continue the validation by inspecting the novel discoveries (i.e., those missed by BOLT-
LMM and unconfirmed by the above studies) and cross-referencing them with the genetics litera-
ture, focusing for simplicity on the 20 kb resolution. Supplementary Table 20 shows that almost
all discoveries contain genes, and most have known associations to phenotypes closely related to
that of interest (Supplementary Table 21). Furthermore, most lead SNPs (those with the largest
14
importance measure in each group9) have functional annotations (Supplementary Table 22).
Figure 4 showcases one of our novel discoveries for cardiovascular disease. The finest finding
here spans 4 genes, but we could not find previously reported associations with cardiovascular
disease within this locus. However, one of these genes (SH3TC2) is known to be associated with
blood pressure,53 while another (ABLIM3) is known to be associated with body mass index.54
7.3
-lo
g 10(
p)
single-SNP3
204181
208425
Res
olu
tion
(kb
)
Manhattan plot (BOLT-LMM)
Chicago plot (KnockoffZoom v2)
Genes
NA
NA
NANA
NA
NA
NA
NANA
NA
NA
NA
NANA
NA
ABLIM3 →SH3TC2 ←
LOC255187 →
MIR584 ←
148.2 148.4 148.6 148.8
Chromosome 5 (Mb)
148.2 148.4 148.6 148.8
FIG. 4. Novel discovery for cardiovascular disease on the UK Biobank data. Center: the shaded
rectangles indicate the genetic segments detected by our method at different resolutions. Bottom: genes in
the locus spanned by our finest discovery. Top: BOLT-LMM marginal p-values computed on UK Biobank
samples with European ancestry,40 for genotyped and imputed variants within this locus. All BOLT-LMM p-
values within this locus are larger than 5×10−8; those larger than 10−3 are hidden. Note that KnockoffZoom
does not analyze imputed variants, since these do not carry any additional information about the phenotype.9
DISCUSSION
This paper has developed a new algorithm for constructing genetic knockoffs, inspired by the
SHAPEIT phasing method, that extends the applicability of KnockoffZoom to the analysis of data
with population structure. In particular, we can include related and ethnically diverse individuals
while controlling the FDR. This extension is crucial for several reasons. Firstly, very large studies
15
are sampling entire populations quite densely,42,55 which yields many closely related samples. It
would be wasteful to discard this information, and potentially dangerous not to carefully account for
relatedness. Secondly, the historical lack of diversity in GWAS (which mostly involve individuals
of European ancestry) is a well-recognized problem,56,57 which biases our scientific knowledge
and disadvantages the health of under-represented populations. While this issue goes beyond the
statistical difficulty of analyzing diverse GWAS data, our work should at least help remove a
technical barrier. Firstly, by allowing the simultaneous analysis of diverse populations, we can
borrow strength from one another and increase power, as the different patterns of LD in different
populations may uncover causal variants more effectively.58 Secondly, discoveries obtained from the
analysis of diverse data may improve our ability to explain phenotypic variation in the minority
populations.59,60 Since the UK Biobank mostly comprises of British individuals, the increase in
power resulting from the analysis of diverse samples can only be relatively small. Nonetheless,
we observe some gains when we include non-British individuals. This is promising, especially
since simulations demonstrate that our inferences are valid even when the population is extremely
heterogeneous. Therefore, in the near future, we would like to apply our method to more diverse
data sets, such as that collected by the Million Veteran Program,61 for example.
Our method accounts for LD, population structure, and cryptic relatedness, which are the major
sources of confounding in GWAS data, so our discoveries are directly interpretable and may be
portrayed in a causal sense relatively safely;5,9 indeed, a closely related approach yields formal
causal claims in the special case of parent-child trio data.62 Furthermore, our inferences require
no assumptions about the genetic architecture of the phenotype, which makes KnockoffZoom very
versatile. It is worth stressing that our simulations are based on genetic data, do not provide any
additional information to our method about the relatedness or population structure other than
that already available to the analyst, and do not exploit any knowledge of the architecture of the
simulated traits. Therefore, the results are informative about the general validity of our method.5,9
Confirmatory analyses demonstrate that a large proportion of our discoveries are either con-
sistent with the findings of BOLT-LMM on the same data, or are supported by other studies.
This is of interest, especially since we do not have access to studies larger than the UK Biobank,
so we cannot expect to already replicate all novel discoveries. Furthermore, many of our un-
confirmed findings are associated to related phenotypes or contain protein-coding genes that are
over-expressed in the relevant tissues. Even though our method does not perform fine-mapping in
the traditional sense—we do not work with imputed variants—it is more flexible and appreciably
more powerful than the existing genome-wide alternatives, such as BOLT-LMM. Furthermore, our
16
discoveries are more tightly localized than those based on marginal testing, and hence immediately
more interpretable, which will facilitate any follow-up analysis.
The possibility to include individuals of diverse ancestries opens alluring research opportunities.
For example, we would like to understand which discoveries are consistent across populations
and which are more specific, as this may help further weed out false positives, explain observed
variations in phenotypes, and possibly shed more light onto the underlying biology.
SOFTWARE AVAILABILITY
We have implemented our methods in a standalone software written in C++, which is available
from https://msesia.github.io/knockoffzoom-v2/. This takes as input phased haplotypes in
BGEN format63 and outputs genotype knockoffs at the desired resolution in the PLINK51 BED
format. Our software is designed for the analysis of large data: it is multi-threaded and memory
efficient. Furthermore, if sufficient computational resources are available, knockoffs for different
chromosomes can be constructed in parallel. For reference, it took us approximately 4 days using
10 cores and 80GB of memory to generate knockoffs for the UK Biobank data on chromosome 1
(approximately 1M haplotype sequences, 600k SNPs, and 600k IBD segments).
DATA AVAILABILITY
Data from the UK Biobank Resource (application 27837); see https://www.ukbiobank.ac.uk/.
ACKNOWLEDGEMENTS
M. S. was in the Department of Statistics at Stanford University. M. S., S. B., E. C. and C. S.
were supported by NSF grant DMS 1712800. S. B. was also supported by a Ric Weiland fellowship.
E. C. and C. S. were also supported by NSF grant OAC 1934578 and by a Math+X grant (Simons
Foundation). We thank Kevin Sharp (University of Oxford) for sharing useful computer code. The
authors acknowledge the participants and investigators of the UK Biobank project, the FinnGen
study, and the Japan Biobank Project.
1 Lander, E. & Kruglyak, L. Genetic dissection of complex traits: guidelines for interpreting and reporting
linkage results. Nat. Genet. 11, 241–247 (1995).
17
2 Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci.
U.S.A 100, 9440–9445 (2003).
3 Sabatti, C., Service, S. & Freimer, N. False discovery rate in linkage and association genome screens for
complex disorders. Genetics 164, 829–833 (2003).
4 Brzyski, D. et al. Controlling the rate of GWAS false discoveries. Genetics 205, 61–75 (2017).
5 Sesia, M., Sabatti, C. & Candes, E. Gene hunting with hidden Markov model knockoffs. Biometrika
106, 1–18 (2019).
6 Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
7 Hill, W. & Robertson, A. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 226–231
(1968).
8 Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using
single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
9 Sesia, M., Katsevich, E., Bates, S., Candes, E. & Sabatti, C. Multi-resolution localization of causal
variants across the genome. Nat. Comm. 11, 1093 (2020).
10 Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 467–484
(2019).
11 Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
12 Pritchard, J. K., Stephens, M., Rosenberg, N. A. & Donnelly, P. Association mapping in structured
populations. Am. J. Hum. Genet. 67, 170–181 (2000).
13 Sul, J. H., Martin, L. S. & Eskin, E. Population structure in genetic studies: Confounding factors and
mixed models. PLoS Genet. 14, e1007309 (2018).
14 Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association
studies. Nat. Genet. 38, 904–909 (2006).
15 Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of
relatedness. Nat. Genet. 38, 203–208 (2006).
16 Kang, H. M. et al. Efficient control of population structure in model organism association mapping.
Genetics 178, 1709–1723 (2008).
17 Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association
studies. Nat. Genet. 42, 348–354 (2010).
18 Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet.
11, 499–511 (2010).
19 Schaid, D. J., Chen, W. & Larson, N. B. From genome-wide associations to candidate causal variants
by statistical fine-mapping. Nat. Rev. Genet. 19, 491–504 (2018).
20 Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in
regression, with application to genetic fine mapping. J. R. Stat. Soc. B. (2020).
21 Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal variants at loci
with multiple signals of association. Genetics 198, 497–508 (2014).
18
22 Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach
to multiple testing. J. R. Stat. Soc. B. 57, 289–300 (1995).
23 Candes, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: Model-X knockoffs for high-dimensional
controlled variable selection. J. R. Stat. Soc. B. 80, 551–577 (2018).
24 Barber, R. F. & Candes, E. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 2055–2085
(2015).
25 Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype
data. Genetics 155, 945–959 (2000).
26 Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype
data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).
27 Koivisto, M. et al. An MDL method for finding haplotype blocks and for estimating the strength of
haplotype block boundaries. In Biocomputing 2003, 502–513 (World Scientific, 2002).
28 Browning, S. R. & Browning, B. L. Haplotype phasing: existing methods and new developments. Nat.
Rev. Genet. 12, 703 (2011).
29 Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data:
applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644
(2006).
30 Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide
association studies by imputation of genotypes. Nat. Genet. 39, 906 (2007).
31 Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the
next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
32 Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. MaCH: using sequence and genotype data
to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
33 Weir, B. S. Genetic Data Analysis (Sinauer, Sunderland, Massachusetts, 1990).
34 Stephens, J. C., Briscoe, D. & O’Brien, S. J. Mapping by admixture linkage disequilibrium in human
populations: limits and guidelines. Am. J. Hum. Genet. 55, 809 (1994).
35 Delaneau, O., Marchini, J. & Zagury, J.-F. A linear complexity phasing method for thousands of genomes.
Nat. Methods 9, 179 (2012).
36 Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and popu-
lation genetic studies. Nat. Methods 10, 5 (2013).
37 O’Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817 (2016).
38 Thompson, E. A. Identity by descent: variation in meiosis, across genomes, and in populations. Genetics
194, 301–326 (2013).
39 Bycroft, C. et al. The UK biobank resource with deep phenotyping and genomic data. Nature 562,
203–209 (2018).
40 Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-
scale datasets. Nat. Genet. 50, 906–908 (2018).
19
41 Japan, B. Biobank Japan Project (2020). URL http://jenger.riken.jp/en/.
42 FinnGen. FinnGen documentation of r3 release (2020). URL https://finngen.gitbook.io/
documentation/.
43 Sved, J. Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor.
Popul. Biol. 2, 125–141 (1971).
44 Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nat.
Genet. 40, 1068 (2008).
45 Gusev, A. et al. Whole population, genome-wide mapping of hidden relatedness. Genome Res. 19,
318–326 (2009).
46 Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by descent. Am. J.
Hum. Genet. 88, 173–182 (2011).
47 Bjelland, D. W., Lingala, U., Patel, P. S., Jones, M. & Keller, M. C. A fast and accurate method for
detection of IBD shared haplotypes in genome-wide SNP data. Eur. J. Hum. Genet. 25, 617–624 (2017).
48 Naseri, A., Liu, X., Tang, K., Zhang, S. & Zhi, D. Rapid: ultra-fast, powerful, and accurate detection of
segments identical by descent (ibd) in biobank-scale cohorts. Genome Biol. 20, 143–143 (2019).
49 Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society:
Series B (Methodological) 58, 267–288 (1996).
50 Marchini, J. L. Discussion of gene hunting with hidden Markov model knockoffs. Biometrika 106, 27–28
(2019).
51 Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses.
Am. J. Hum. Genet. 81, 559–575 (2007).
52 Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies,
targeted arrays and summary statistics 2019. Nucleic acids research 47, D1005–D1012 (2019).
53 Hoffmann, T. J. et al. Genome-wide association analyses using electronic health records identify new
loci influencing blood pressure variation. Nature Genet. 49, 54 (2017).
54 Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature
518, 197–206 (2015).
55 deCODE genetics. https://www.decode.com/ (2019). Accessed: 2019-12-06.
56 Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature News 538, 161 (2016).
57 Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177,
26–31 (2019).
58 Rosenberg, N. A. et al. Genome-wide association studies in diverse populations. Nat. Rev. Genet. 11,
356–366 (2010).
59 Bitarello, B. D. & Mathieson, I. Polygenic scores for height in admixed populations. bioRxiv preprint
(2020).
60 Cavazos, T. B. & Witte, J. S. Inclusion of variants discovered from diverse populations improves polygenic
risk score transferability. bioRxiv preprint (2020).
20
61 Gaziano, J. M. et al. Million Veteran Program: A mega-biobank to study genetic influences on health
and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
62 Bates, S., Sesia, M., Sabatti, C. & Candes, E. J. Causal inference in genetic trio studies. arXiv preprint
2002.09644 (2020).
63 Band, G. & Marchini, J. BGEN: a binary file format for imputed genotype and haplotype data. BioRxiv
308296 (2018).
64 Sesia, M., Sabatti, C. & Candes, E. Rejoinder: Gene hunting with hidden Markov model knockoffs.
Biometrika 106, 35–45 (2019).
65 Sabatti, C. Multivariate Linear Models for GWAS, 188–207 (Cambridge University Press, 2013).
66 Prive, F., Aschard, H., Ziyatdinov, A. & Blum, M. G. B. Efficient analysis of large-scale genome-wide
data with two R, packages: bigstatsr and bigsnpr. Bioinformatics 34, 2781–2787 (2018).
67 Consortium, I. H. . et al. Integrating common and rare genetic variation in diverse human populations.
Nature 467, 52 (2010).
68 Kinderman, R. & Snell, S. Markov random fields and their applications (American Mathematical Society,
Providence, RI, USA, 1980).
69 Yedidia, J., Freeman, W. & Weiss, Y. Understanding belief propagation and its generalizations. In
Exploring Artificial Intelligence in the New Millenium, vol. 8, 239–269 (Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 2003).
70 Bates, S., Candes, E., Janson, L. & Wang, W. Metropolized knockoff sampling. J. Am. Stat. Assoc.
1–25 (2020).
METHODS
Formal definition of the inferential objective
We begin by formally stating our objective. Let Y ∈ Rn be a phenotype of interest measured
for n subjects, and let X ∈ Rn×p be the corresponding genotypes at p sites, which are assumed
to be random variables from an HMM, as we will discuss shortly. Our goal is to detect genomic
regions containing distinct associations with the phenotype, as precisely as possible. Formally, we
seek this objective by testing conditional null hypotheses23 in the form of
H0,g : Y |= XGg | X−Gg , (1)
where G = (G1, . . . , GL) is a fixed partition of {1, . . . , p}, XGg denotes the variants in group Gg,
and X−Gg denotes those outside it. This framing is different from that employed by classical
GWAS techniques, where one is only able to test marginal independence between the phenotype
and a single variant Xj—a scientifically less interesting hypothesis.5,64 In particular, conditioning
21
on the remainder of the genome (X−Gg in the above notation) ensures that a rejection of the null
hypothesis implies the presence of a unique association of Y with the variants in the region Gg,
rather than a spurious correlation caused by LD. Moreover, such tests are naturally robust to the
confounding of population structure,64 and even enable formal causal inferences in some cases.62
While we do not have the space for a full account here, previous work has already discussed at
length the relative advantages of conditional testing over marginal testing.5,9,62,64 The only caveat is
that correctly implementing these tests using GWAS data is technically challenging in the presence
of population structure9—a difficulty that we address here by building upon KnockoffZoom.9
Review of KnockoffZoom
KnockoffZoom is a recently-introduced statistical technique for genome-wide association studies
that simultaneously tests the conditional null hypotheses defined above for all groups of variants in
any given partition of the genome, controlling the FDR (the expected number of false discoveries)
below a desired target level (e.g., 10%). By applying this procedure multiple times with increas-
ingly refined partitions, one can localize interesting conditional associations at different levels of
resolutions, ranging from a few hundred kilo-bases to single-SNP precision. Unlike those of other
approaches, the statistical guarantees of KnockoffZoom do not rely on any assumption about the
genetic architecture of the trait, such as linearity and Gaussian errors. Instead, KnockoffZoom only
needs to model the distribution of the genotypes with an HMM, consistently with the standard
approaches for phasing and imputation.5,9 To test the hypotheses in (1), KnockoffZoom leverages
a statistical technology called knockoffs,5,23,24 which we describe informally below.
For the genotypes X(i) of any individual i, we construct in silico a synthetic “knockoff” version
X(i) ∈ Rp, in such a way that X and X are statistically exchangeable at the population level,
except for the fact that X has no conditional associations with Y . In other words, for any group
Gg in the given genome partition, the distribution of XGg and XGg are indistinguishable unless
there is a conditional association between XGg and Y . This property, which will be formally stated
later, allows the knockoff genotypes X to serve as negative controls for X.23 The intuition is that,
by construction of the knockoffs, any detectable difference between XGg and XGg implies that the
region Gg must contain a distinct association with Y . Practically, one can control the FDR for the
conditional hypotheses in (1) by computing a test statistics for each group of variants (i.e., as the
contrast between any empirical association measures between Y and XGg , XGg , respectively) and
then rejecting the null hypotheses whose statistics are greater than the data-adaptive significance
22
threshold computed by the knockoff filter.24 To maximize the number of discoveries, one should use
powerful association (or, importance) measures; the typical solution is to compute these starting
from a multivariate predictive model, as explained next.23
KnockoffZoom9 fits a sparse generalized linear regression model of Y given the augmented data
[X, X] ∈ Rn×2p, after standardizing X and X to have unit variance. Then, letting βj(λCV) and
βj+p(λCV) indicate the estimated coefficients for Xj and Xj , respectively, it defines the importance
measures Tg =∑
j∈Gg|βj(λCV)| and Tg =
∑j∈Gg
|βj+p(λCV)| for each group of variants Gg. (The
regularization parameter λCV is tuned by cross-validation). We adopt these statistics because they
have the advantage of being powerful,9 interpretable for GWAS,65 and computationally feasible
even with very large data.66 However, the knockoffs framework can theoretically accommodate
virtually any statistics.23 The importance measures are combined into a test statistic for each
group of variants, i.e., Wg = Tg − Tg, which is finally provided as input to the knockoff filter24 to
select significant associations. In particular, the knockoff filter selects groups of SNPs whose W
is sufficiently large (i.e., here large values of W imply that the corresponding real genotypes have
larger regression coefficients compared to the knockoffs). See Figure 5 for a schematic summary of
the entire procedure.
Data
Data augmentation
Predictive system Calibration
Novel component
Y X H knockoff algorithm
haplotype model
H X
(X, X)
predictive model
genome partition
test statistics knockoff filter discoveries
FDR level
phasing de-phasing
FIG. 5. KnockoffZoom workflow. The novelty consists of an HMM for the distribution of haplotypes,
H, that can account for population structure and familial relatedness as well as LD, and of the associated
algorithm for generating knockoffs. For computational reasons, the genotypes are phased prior to generating
knockoffs, and the knockoff haplotypes are then de-phased to obtain knockoff genotypes.9
23
KnockoffZoom model setup
KnockoffZoom v15 assumes all samples to be independent and identically distributed (i.i.d.):
(X(i), Y (i))i.i.d.∼ PX,Y = PX · PY |X ,
where the distribution of the genotypes, PX , is approximated by the fastPHASE HMM,29 and PY |X
is left completely unspecified.23 The fastPHASE HMM restricts the applicability of KnockoffZoom
to homogeneous samples9 (see also Supplementary Notes A– C), while the i.i.d. assumption is
naturally ill-suited to described closely related samples that, in addition to sharing long nearly
identical portions of DNA, may also be exposed to the same environmental factors affecting their
phenotypes. Our present work shows how to overcome these limitations. We generalize the above
setup by assuming a fixed family structure and by grouping together individuals in the same
family, which are no longer treated independently. Furthermore, the distribution of the genotypes
is allowed to vary across families, depending on the population from which the family descended.
Formally, let F = {F1, . . . ,F|F|} indicate a fixed partition of {1, . . . , n}, where Ff ⊆ {1, . . . , n}for all f ∈ {1, . . . , |F|}. Then, we assume that genotypes and phenotypes, along with a shared
family effect E, are sampled independently for individuals in different families, from some joint
distribution:
(X(Ff ), Y (Ff ), Ef ) ∼ P fX,Y,E = P fE · PfX · P
fY |X,E . (2)
Above, the random matrix X(Ff ) ∈ R|Ff |×p contains the genotypes (i.e., the rows of X) for all
individuals in the fth family and is sampled jointly from P fX , the random vector Y (Ff ) ∈ R|Ff |
describes the corresponding phenotypes, and Ef is the environmental factor shared by all family
members. Crucially, these distributions are allowed to vary across families, which encodes the
fact that the distribution of genotypes and phenotypes varies across populations. Importantly, the
distributions P fE and P fY |X,E are not assumed to be known; only P fX must be specified to carry
out the procedure. Conditional on Xf and Ef , the phenotypes Y f are assumed to be sampled
independently for each individual:
p(Y (Ff ) | X(Ff ), Ef ) =∏i∈Ff
pf (Y (i) | X(i), Ef ),
for some distribution function pf , which may also vary across families. Lastly, note that X |= E,
which means E is assumed to be an environmental factor unrelated to the genetic effects.
Before proceeding to the mechanics of the test, we pause to explain why testing the hypotheses
in (1) properly already implicitly accounts for the family effects E. Since our motivation is to help
24
build a better genetic understanding the phenotype, the most intuitive goal within the above setup
is to test the null hypotheses
H0,g : Y |= XGg | X−Gg , E,
for all g ∈ {1, . . . , L}. However, these hypotheses are equivalent to those in (1) because we have
assumed X |= E. Therefore, tests of (1) already account for family effects, as long as we can carry
them out correctly, which is the problem we address here.
The last preliminary step is to formally define the knockoffs. This requires a small extension
of the existing framework,23 because there are now dependent groups of observations (families)
that follow different distributions. Nonetheless, the existing theory can be easily adapted for this
setting. In particular, we can still test the hypotheses in (1) with the knockoff filter,24 as long as
the KnockoffZoom test statistics satisfy the flip-sign property in Lemma 3.3 of Candes et al.23 To
ensure this property holds in the presence of families, the knockoff exchangeability property23 must
be defined in the following way: for any fixed partition G of the variants, we require that swapping
any group of SNPs with the corresponding knockoffs, simultaneously for all family members, would
keep the joint distribution of genotypes and knockoffs statistically indistinguishable. Formally, this
can be written as: (X(Ff ), X(Ff )
)swap(S;G)
d=(X(Ff ), X(Ff )
)∀S ⊆ G, (3)
where Ff indicates a particular family, and swap(S;G) is the operator that swaps all columns of
X(Ff ) indexed by S with the corresponding columns of X(Ff ). In the following, we will develop
a novel algorithm, based on a joint model for the genotypes in each family, to generate knockoff
genotypes satisfying (3).
Modeling haplotypes with population structure
We begin by recalling some useful notation for HMMs.9 We say that a sequence of phased
haplotypes H = (H1, . . . ,Hp), with Hj ∈ {0, 1}, is distributed as an HMM with K hidden states if
there exists a vector of latent random variables Z = (Z1, . . . , Zp), with Zj ∈ {1, . . . ,K}, such that:Z ∼ MC (Q) , (latent discrete Markov chain),
Hj | Z ∼ Hj | Zj ind.∼ fj(Hj | Zj), (emission distribution).
(4)
Above, the Markov chain MC (Q) has initial probabilities Q1 and transition matrices (Q2, . . . , Qp).
25
Taking inspiration from SHAPEIT,35–37 we assume the i-th haplotype sequence can be approx-
imated as an imperfect mosaic of K other haplotypes in the data set, indexed by {σi1, . . . , σiK} ⊆{1, . . . , 2n} \ {i}. (Note the slight overload of notation: i denotes hereafter a phased haplotype
sequence, two of which are available for each individual). We will discuss later how the references
are determined; for now, we take them as fixed and describe the other aspects of the model. Math-
ematically, the mosaic is described by an HMM in the form of (4), where the distribution of the
latent Markov chain is:
P[Z(i)1 = k] = α
(i)k ,
P[Z(i)j = k′ | Z(i)
j−1 = k] = Q(i)j (k′ | k) =
(1− e−ρdj
)α(i)k′ + e−ρdj , if k′ = k,(
1− e−ρdj)α(i)k′ , if k′ 6= k.
(5)
Above, dj indicates the genetic distance between loci j and j − 1, which is assumed to be fixed
and known (in practice, we will use distances previously estimated in a European population,67
although our method could easily accommodate different distances for different populations). The
parameter ρ > 0 controls the rate of recombination and can be estimated with an expectation-
maximization (EM) technique (Supplementary Methods). However, we have observed it works
well with our data to simply set ρ = 1; this is consistent with the approach of SHAPEIT,35–37
which also uses fixed parameters. The positive weights (α(i)1 , . . . , α
(i)K ) are normalized so that their
sum equals one and they can be interpreted as characterizing the ancestry of the i-th individual.
In the applications discussed within this paper, we simply assume α(i)k = 1/K for all k, although
these parameters could also be estimated by EM (Supplementary Methods). Conditional on the
sequence Z, each element of H follows an independent Bernoulli distribution:
f(i)j (H
(i)j | k) =
1− λj , if H
(i)j = H
(σik)j ,
λj , if H(i)j 6= H
(σik)j .
(6)
Above, the parameter λj is a site-specific mutation rate, which makes the HMM mosaic imperfect.
Earlier works that first proposed this model also suggested formulae for determining suitable values
of the parameters ρ and λ = (λ1, . . . , λp) in terms of the physical distances between SNPs and other
population genetics quantities.8 However, given that we have access to large a data set, we choose
instead to estimate λ by EM (Supplementary Methods). We have noticed that it works well to
also explicitly prevent λ from being too large or too small (e.g., 10−6 ≤ λj ≤ 10−3).
To reduce the computational burden and mitigate the risk of overfitting, the value of K should
not be too large; here, we adopt K = 100. We have observed that larger values improve the
26
goodness-of-fit relatively little, while reducing power by increasing the similarity between vari-
ables and knockoffs.9 Having thus fixed K, the identities of the reference haplotypes for each i,
{σi1, . . . , σiK}, are chosen in a data-adaptive fashion as those whose ancestry is most likely to be
similar to that of H(i). Concretely, one may carry this out as outlined by Algorithm 1, separately
chromosome-by-chromosome. Different options are available for defining similarities between hap-
lotypes; for simplicity, we use the Hamming distance. Since computing the pairwise distances
between all haplotypes would have quadratic complexity in the sample size, which is unfeasible
for large data, we first divide the haplotypes into clusters of size N , with K � N � 2n (i.e.,
N ≈ 5000), through recursive 2-means clustering, and then we only compute distances within
clusters, following in the footsteps of SHAPEIT v3.37
Algorithm 1 Choosing the HMM reference haplotypes
Input: haplotypes H ∈ {0, 1}2n×p, parameter K;
Input: hyperparameters N1, N2, s.t. K � N1 < N2 � n.
Input: a distance measure ξ between haplotypes.
Divide {1, . . . , 2n} into M sets Cc s.t. N1 ≤ |Cc| ≤ N2, by recursive 2-means clustering.37
for c = 1, . . . ,M do
Compute a distance matrix D ∈ R|Cc|×|Cc| for all haplotypes in Cc, with ξ.
for i in Cc do
Define R(i) as the set of K nearest neighbors of Hi in Cc.
Output: a set R(i) of K references for each haplotype H(i).
While the approach in Algorithm 1 is the easiest to introduce first, in practice it is preferable for
our purpose to utilize a different set of local references in different parts of the same chromosome.
We will describe this extension later, after discussing the novel knockoff generation algorithm.
Generating knockoffs preserving population structure
Conditional on the reference indices, {σi1, . . . , σiK}, the above setup describes each haplotype
sequence H(i) as an HMM; a model for which we already know how to generate knockoffs in
theory.5,9 Therefore, generating knockoffs in our setting is straightforward: all we need to do is to
define the customized HMM for each haplotype sequence, and then to apply the knockoff generation
algorithm previously developed in KnockoffZoom v1.9 This solution is outlined in Algorithm 2.
27
Algorithm 2 Knockoff haplotypes preserving population structure
Input: haplotypes H ∈ {0, 1}2n×p, genetic map ρ ∈ Rp−1, partition G of {1, . . . , p}; parameter K.
for i = 1, . . . , 2n do
Assign references R(i) = {σi1, . . . , σiK} with Algorithm 1.
Initialize α(i)k ← 1
K , for each k ∈ {1, . . . ,K}.
Estimate λ = (λ1, . . . , λp) by EM (Supplementary Methods), initialize ρ← 1.
for i = 1, . . . , 2n do
Define the HMM {R(i), ρ, λ}.Sample Z(i) = (Z
(i)1 , . . . , Z
(i)p ) from P[Z(i) | H(i)], with step I of Algorithm 3 in Sesia et al.9
Sample a knockoff copy Z(i) of Z(i) with respect to G, with step II of Algorithm 3 in Sesia et al.9
Sample H(i) from P[H(i) | Z(i) = Z(i)], with step III of Algorithm 3 in Sesia et al.9
Output: knockoff haplotypes H ∈ {0, 1}2n×p.
Knockoffs with local reference motifs based on hold-out distances
Relatedness is not necessarily homogeneous across the genome. This is particularly evident in
the case of admixture, which may cause an individual to share haplotypes with a certain population
only in part of a chromosome. Therefore, it is worth extending Algorithms 1–2 to accommodate
different local references within the same chromosome. We implement this as follows.
First, we divide each chromosome into relatively wide genetic windows (10 Mb, for concreteness);
then, we choose the reference haplotypes separately within each of them, based on their similarities
outside the window of interest. In order to allow the choice of references to be as locally adaptive as
possible, we only consider the alleles in the two neighboring windows when computing distances.
This approach is similar to that of SHAPEIT v3,37 although the latter does not hold out the
SNPs in the current window to determine local similarity. Our approach is better suited for
knockoff generation because it reduces overfitting—knockoffs too similar to the original variables—
and consequently increases power. Once the local references for each haplotype have been defined,
we can apply Algorithm 2 window-by-window. To avoid discontinuities at the boundaries, we
consider overlapping windows (expanded by 10 Mb on each side). More precisely, we condition on
all SNPs within 10 Mb when sampling the latent Markov chain (step I of Algorithm 3 in Sesia et
al.9) but then we only generate knockoffs within the window of interest.
28
Knockoffs preserving familial relatedness
Within an IBD-sharing family of size m, we model the joint distribution of the haplotype
sequences, namely (H(1), . . . ,H(m)), as an HMM with a Km-dimensional latent Markov chain. For
simplicity, we write this as Z(1 :m) = (Z(1), . . . , Z(m)), where Z(i) ∈ {1, . . . ,K}p. Conditional on
Z(i), each element of H(i) is independent and follows the same emission distribution as in (4)–(6):
P[H
(i)j = 1 | Z(i)
j = k]
= f(i)j (1 | k) =
1− λ, if H
(σik)j = 1,
λ, if H(σik)j = 0.
(7)
If Z(1), . . . , Z(m) were also independent of each other, this would reduce to a collection of m HMMs
of the type in (4)–(6). However, to account for familial relatedness, we assume that different Z(i)
are coupled at the IBD segments (which we have previously detected and hold fixed), as described
below.
Denote by ∂(i, j) ⊂ {1, . . . ,m} the set of haplotype indices that share an IBD segment with H(i)
at position j. Let us also define ηi,j = 1/(1 + |∂(i, j)|) ∈ (0, 1]. Then, we model the distribution of
Z(1 :m) as follows. For 1 < j ≤ p,
P[Z
(1 :m)j = z
(1 :m)j | Z(1 :m)
j−1 = z(1 :m)j−1
]=
m∏i=1
(Q
(i)j (z
(i)j | z
(i)j−1)
)ηi,j ∏i′∈∂(i,j)
1[z(i)j = z
(i′)j ], (8)
where the transition matrices Q(i)j are defined as in (5), while
P[Z
(1 :m)1 = (k(1), . . . , k(m))
]=
m∏i=1
(α(i)
k(i)
)ηi,j ∏i′∈∂(i,j)
1[k(i) = k(i′)]. (9)
The first term on the right-hand-side of (8) describes the transitions in the Markov chain, while
the second term is the constraint that makes the haplotypes match along the IBD segments. The
purpose of the ηi,j exponent is to make the marginal distribution of each sequence as consistent
as possible with the model presented earlier for unrelated haplotypes. (If we set instead ηi,j = 1,
transitions of the latent state may occur with significantly different frequency inside and outside
IBD segments). In the trivial cases of families of size one, ∂(1, j) = ∅ and η1,j = 1, for all
j ∈ {1, . . . , p}, so it is easy to see that (8)–(9) reduce to the original model for unrelated haplotypes
in (5). By contrast, in the general case, the latent states for different haplotypes in the same family
are constrained to be identical along all IBD segments. See Supplementary Figure 36 (a) for a
graphical representation of this model.
Generating knockoffs with the existing algorithm for HMMs in this case would have compu-
tational complexity equal to O(npKm), which is unfeasible for large data sets unless m = 1. To
29
understand why this complexity is O(npKm), note that the model is effectively an HMM with a
Km-dimensional latent Markov chain,9 where each vector-valued variable corresponds to the al-
leles at a specific site for all individuals in the family. Fortunately, one can equivalently look at
the joint distribution of (Z(1:m), H(1:m)) as a more general Markov random field68 with 2×m× pvariables, each taking values in {1, . . . ,K} or {0, 1}, respectively. See Supplementary Figure 36 (b)
for a graphical representation of this model, where each node corresponds to one of the two hap-
lotypes of an individual at a particular marker. The random field perspective opens the door to
more efficient inference and posterior sampling based on message-passing algorithms.69 Leaving the
technical details to the Supplementary Methods, we outline the new knockoff generation procedure
in Algorithm 3.
In a nutshell, we follow the same spirit of the construction for unrelated haplotypes in Algo-
rithm 2, with the important difference that the HMM with a K-dimensional latent Markov chain
of length p is replaced by a latent Markov random field with 2 ×m × p variables, which requires
some innovations.
• First, the K haplotype references in the model for each H(i) are shared by all haplotypes
within the same family; see Supplementary Algorithm 1.
• Second, the posterior sampling of Z(1:m) | H(1:m) is carried out via generalized belief
propagation69 instead of the standard forward-backward procedure for HMMs;5,9 see Sup-
plementary Algorithm 2 and Supplementary Figure 37. (This generally involves some degree
of approximation, but it is exact for many family structures).
• Third, the knockoff copies Z(1:m) of Z(1:m) are generated with a modified version of
the existing algorithm for HMMs9 that circumvents the computational difficulties of the
higher-dimensional model by breaking the couplings between different haplotypes through
conditioning70 upon the extremities of the IBD segments; see Supplementary Algorithm 3
and Supplementary Figure 38. To clarify, conditioning on the extremities of the IBD seg-
ments means that we make the knockoffs identical to the true genotypes for a few sites in
each family, which reduces power only slightly (we consider relatively long IBD segments, so
there are few extremities), but greatly reduces the computational burden (see Supplement
for a full explanation).
It is worth remarking that, for trivial families of size one, this is exactly equivalent to Algorithm 2.
Finally, note that the extension to local references with hold-out distances discussed earlier also
30
applies seamlessly here. Even though our software implements this extension, we do not explicitly
write down the algorithms with local references in this paper for the sake of clarity and space.
Algorithm 3 Knockoff haplotypes preserving population structure and familial relatedness
Input: H ∈ {0, 1}2n×p, d ∈ Rp−1, G, and K as in Algorithm 2, collection of IBD segments I.
Define the set U ⊆ {1, . . . , n} of haplotype indices that do not share any IBD segments in I.
Divide the remaining haplotypes into L distinct families Ff ⊆ {1, . . . , n}, for f ∈ {1, . . . , L}.for f ∈ 1, . . . , L do
Assign references R(f) = {σf1, . . . , σfK} with Supplementary Algorithm 1.
Initialize α(f)k ← 1
K , for each k ∈ {1, . . . ,K}.
Estimate λ = (λ1, . . . , λp) by EM (Supplementary Methods), initialize ρ← 1.
for f ∈ 1, . . . , L do
Define the HMM {R(f), ρ, λ}.Sample (Z(i))i∈Ff
∈ {1, . . . ,K}m×n from P[(Z(i))i∈Ff| (H(i))i∈Ff
], with Supplementary Algorithm 2.
Sample a knockoff copy (Z(i))i∈Ffof (Z(i))i∈Ff
with respect to G, with Supplementary Algorithm 3.
Sample H(i) from P[H(i) | Z(i) = Z(i)], coordinate by coordinate independently, for all i ∈ Ff .
for i ∈ U do
Generate a knockoff copy H(i) of H(i) as in Algorithm 2.
Output: knockoff haplotype matrix H ∈ {0, 1}2n×p that preserves the IBD segments.
Quality control and data pre-processing for the UK Biobank
We begin with 487,297 genotyped and phased subjects in the UK Biobank (application 27837).
Among these, 147,669 have reported at least one close relative in the data set. We define families by
clustering together individuals with kinship greater or equal to 0.05; then we discard 322 individuals
(chosen as those with the most missing phenotypes) to ensure that no families have size larger
than 10. This leaves us with 136,818 individuals labeled as related, divided into 57,164 families.
The median family size is 2, and the mean is 2.4). The total number of individuals that passed
this quality control (included those who have no relatives) is 486,975. We only analyze biallelic
SNPs with minor allele frequency above 0.1% and in Hardy-Weinberg equilibrium (10−6), among
the subset of 350,119 unrelated British individuals analyzed in previous work.9 The final SNPs
count is 591,513.
31
Including additional covariates
We account for the sex, age and squared age of the subjects to increase power (squared age is
not used for height), as in earlier work.9,40 These covariates are leveraged in the KnockoffZoom
predictive model, along with the top 5 genetic principal components. The inclusion of the principal
components, which may capture population-wide shifts in the distribution of the phenotype, is
useful to improve power because it increases the accuracy of the predictive model while keeping it
sparse,9 but it is unnecessary to avoid spurious discoveries due to population structure, since our
method already accounts for that through the knockoffs. We thus fit a sparse regression model to
predict Y given [Z,X, X] ∈ Rn×(r+2p), where Z,X, X are the r covariates, the p genotypes, and
the p knockoffs, respectively. The model coefficients for Z are not regularized, and their estimates
are not directly included in the test statistics.
Numerical experiments
Feature importance measures for each SNP are computed in three alternative ways: by fitting the
lasso with cross-validation and taking the absolute value of the estimated regression coefficients;9
by running BOLT-LMM40 and taking the negative logarithm of the marginal p-values; and by
performing univariate logistic regression (in the case of binary phenotypes) and taking the negative
logarithm of the marginal p-values. These models are designed to predict Y given [X, X]; in the first
two cases we also include the top 10 principal components of the genotype matrix (computed on
the entire UK Biobank data set) as additional covariates. Then, the feature importance measures
Tj and Tj , for the jth SNP and its corresponding knockoff, are combined in the usual way to define
the knockoff test statistics for each group Gg ⊆ {1, . . . , p} of variables: Wg =∑
j∈GgTj−
∑j∈Gg
Tj .
In the simulations with related samples, the latent Gaussian variable for the ith individual in
the probit model is given by:
L(i) =
p∑j=1
βjX(i)j + γE(fi) +
√1− γ2ε(i),
where E(fi) and ε(i) are independent standard normal random variables, γ ∈ [0, 1], and fi denotes
the family to which the ith individual belongs. The magnitude of the nonzero genetic coefficients
β is varied as a parameter, to control the total heritability of the trait.
Supplementary Information for
Controlling the false discovery rate in GWAS with population structure
Matteo Sesia, Stephen Bates, Emmanuel Candes, Jonathan Marchini, Chiara Sabatti
2
SUPPLEMENTARY METHODS
A. Estimating model parameters by EM
We can estimate the HMM parameters θ = (α, λ, ρ) in (5)–(6), in the main paper, with an
expectation-maximization (EM) method. To write down the algorithm explicitly, we begin by
noting the log-likelihood of θ given both the observable, H, and latent, Z, variables is:
`(θ;H,Z) = log p(H,Z | θ)
=n∑i=1
log p(H(i), Z(i) | θ)
=n∑i=1
log
p∏j=1
Qj(Z(i)j | Z
(i)j−1) ·
p∏j=1
f(i)j (H
(i)j | Z
(i)j )
=
n∑i=1
p∑j=1
logQj(Z(i)j | Z
(i)j−1) +
n∑i=1
p∑j=1
log f(i)j (H
(i)j | Z
(i)j ).
This log-likelihood cannot be directly minimized because we cannot observe Z. Instead, given an
initial estimate of the model parameters, θ(t−1), we iteratively update θ(t) by minimizing
L(θ, θ(t−1)) = EZ[`(θ;H,Z) | H, θ(t−1)
]=
n∑i=1
p∑j=1
EZ[logQj(Z
(i)j | Z
(i)j−1) | H(i), θ(t−1)
]+
n∑i=1
p∑j=1
EZ[log f
(i)j (H
(i)j | Z
(i)j ) | H(i), θ(t−1)
].
(1)
This quantity can be computed and minimized efficiently by leveraging the Markov property, as in
the Baum-Welch algorithm.
Let us begin by defining, for any fixed j ∈ {1, . . . , p}, the posterior marginals
γ(i)j (k) = P
[Z
(i)j = k | H(i), θ(t−1)
].
It is well-known that these quantities can be computed efficiently with the classical forward-
backward iteration that defines the expectation (E) step of the EM algorithm. What remains
to be developed explicitly is the maximization (M) step of the EM algorithm; we will do this in
the following, separately for α, λ, and ρ. These are fairly standard calculations but we outline the
details here for completeness.
3
1. Estimating the site-specific mutation rate λj
For any fixed j ∈ {1, . . . , p}, the parameter λj appears in the second term of (1):
1
n
n∑i=1
EZ[log f
(i)j (H
(i)j | Z
(i)j ) | H(i), θ(t−1)
]=
1
n
n∑i=1
∑z
log f(i)j (H
(i)j | Z
(i)j )P
[Z(i) = z | H(i), θ(t−1)
]=
1
n
n∑i=1
∑k
log f(i)j (H
(i)j | Z
(i)j = k)P
[Z
(i)j = k | H(i), θ(t−1)
]=
1
n
n∑i=1
∑k
log f(i)j (H
(i)j | Z
(i)j = k)γ
(i)j (k)
=1
n
n∑i=1
∑k
log
[(1− λj)δH(i)
j ,R(i)j (k)
+ λj(1− δH(i)j ,R
(i)j (k)
)
]γ
(i)j (k)
= log(1− λj)1
n
n∑i=1
∑k
δH
(i)j ,R
(i)j (k)
γ(i)j (k) + log(λj)
1
n
n∑i=1
∑k
(1− δH
(i)j ,R
(i)j (k)
)γ(i)j (k)
= log(1− λj)(1− Γj) + log(λj)Γj ,
where we have defined:
Γj =1
n
n∑i=1
∑k
(1− δH
(i)j ,R
(i)j (k)
)γ(i)j (k).
The above is maximized at λj = Γj . Therefore, the update rule for λj in the M step is: λj ← Γj .
2. Estimating the recombination scale
The parameter ρ appears in the first term of (1) through:
EZ[logQj(Z
(i)j | Z
(i)j−1) | H(i), θ(t−1)
]=∑z
logQj(zj | zj−1)P[Z(i) = z | H(i), θ(t−1)
]=∑k,l
logQj(k | l)∑
z−(j,j−1)
P[Z(i) = (k, l, z−(j,j−1) | H(i), θ(t−1)
]=∑k,l
logQj(k | l)P[Z
(i)j = k, Z
(i)j−1 = l | H(i), θ(t−1)
].
By defining
ξ(i)j (k, l) = P
[Z
(i)j = k, Z
(i)j−1 = l | H(i), θ(t−1)
],
we can writen∑i=1
p∑j=1
EZ[logQj(Z
(i)j | Z
(i)j−1) | H(i), θ(t−1)
]=
n∑i=1
p∑j=1
∑k,l
logQj(k | l)ξ(i)j (k, l).
4
We will discuss later how to compute ξ. Now, assume that ξ is available and we want to optimize the
above quantity with respect to the parameter ρ, which is hidden inside the transition matrices Q.
For simplicity, we also assume that α(i)k = 1/K, ∀i, k (we omit the computations for the general
case, which are more complicated). Note that
logQj(k | l) = log
(1− bjK
+ bjδk,l
)= log
(1− bjK
)+
[log
(1− bjK
+ bj
)− log
(1− bjK
)]δk,l
= const. + log (1− bj) + [log (1 + (K − 1)bj)− log (1− bj)] δk,l,
where bj = bj(ρ) = e−ρdj . Therefore,
1
n
n∑i=1
p∑j=1
∑k,l
logQj(k | l)ξ(i)j (k, l)
=1
n
n∑i=1
p∑j=1
log(1− bj)∑k,l
ξ(i)j (k, l) +
1
n
n∑i=1
p∑j=1
[log (1 + (K − 1)bj)− log (1− bj)]∑k
ξ(i)j (k, k)
=
p∑j=1
log(1− bj) +
p∑j=1
[log (1 + (K − 1)bj)− log (1− bj)]1
n
n∑i=1
∑k
ξ(i)j (k, k)
=
p∑j=1
log(1− bj) +
p∑j=1
[log (1 + (K − 1)bj)− log (1− bj)] Ξj ,
where we have defined:
Ξj =1
n
n∑i=1
∑k
ξ(i)j (k, k).
It is easy to verify that the above function is strictly quasiconcave in ρ, so it can be optimized
numerically by solving for its first derivative to be equal to zero. We will include the details of our
procedure later for completeness. Meanwhile, note that the computation of ξ(i)j (k, l) can be easily
obtained from the M step:
ξ(i)j (k, l) = P
[Z
(i)j−1 = l, Z
(i)j = k | H(i)
]∝ P
[Z
(i)j−1 = l, Z
(i)j = k,H(i)
]∝ F (i)
j−1(l)Q(i)j (k | l)f (i)
j (k | H(i)j )B
(i)j (k)
= ξ(i)j (k, l),
5
where F and B denote the forward and backward weights in the forward-backward algorithm. The
normalization constant for ξ(i)j (k, l) is:
∑k
∑l
ξ(i)j (k, l) =
∑k
∑l
F(i)j−1(l)Q
(i)j (k | l)f (i)
j (k | H(i)j )B
(i)j (k)
=∑l
F(i)j−1(l)
∑k
[aj + bjδk,l]f(i)j (k | H(i)
j )B(i)j (k)
= aj
(∑l
F(i)j−1(l)
)∑k
f(i)j (k | H(i)
j )B(i)j (k) + bj
∑k
F(i)j−1(k)f
(i)j (k | H(i)
j )B(i)j (k)
= aj∑k
f(i)j (k | H(i)
j )B(i)j (k) + bj
∑k
F(i)j−1(k)f
(i)j (k | H(i)
j )B(i)j (k)
=∑k
f(i)j (k | H(i)
j )B(i)j (k)
[aj + bjF
(i)j−1(k)
].
The diagonal elements of ξ are proportional to:
ξ(i)j (k, k) = F
(i)j−1(k)Q
(i)j (k | k)f
(i)j (k | H(i)
j )B(i)j (k)
= F(i)j−1(k) [aj + bj ] f
(i)j (k | H(i)
j )B(i)j (k).
Recall that we care about
Ξj =1
n
n∑i=1
∑k
ξ(i)j (k, k) =
1
n
n∑i=1
1∑k
∑l ξ
(i)j (k, l)
∑k
ξ(i)j (k, k),
which we can compute starting from
∑k
ξ(i)j (k, k) =
∑k
F(i)j−1(k) (aj + bj) f
(i)j (k | H(i)
j )B(i)j (k)
= (aj + bj)∑k
F(i)j−1(k)f
(i)j (k | H(i)
j )B(i)j (k).
Going back to the details of optimizing
1
n
n∑i=1
p∑j=1
∑k,l
logQj(k | l)ξ(i)j (k, l),
note that differentiating with respect to ρ yields:
0 = −p∑j=1
b′j1− bj
+
p∑j=1
b′j
[K − 1
1 + (K − 1)bj+
1
1− bj
]Ξj .
By definition of bj(ρ) = e−ρdj , it follows that b′j = −djbj . Therefore,
p∑j=1
djbj1− bj
=
p∑j=1
djbj
[K − 1
1 + (K − 1)bj+
1
1− bj
]Ξj = Ψ(ρ),
6
where we have defined:
Ψ(ρ) =
p∑j=1
djbj(ρ)
[K − 1
1 + (K − 1)bj(ρ)+
1
1− bj(ρ)
]Ξj .
Define also d = 1p
∑pj=1 dj . Then, we want to solve
Ψ(ρ) =
p∑j=1
djbj1− bj
= e−ρdp∑j=1
dj1− bj
e−ρ(dj−d) = e−ρdΦ(ρ),
where
Φ(ρ) =
p∑j=1
dj1− bj
e−ρ(dj−d).
Therefore, we can solve iteratively for ρ∗:
ρ∗ = −1
dlog
(Ψ(ρ∗)Φ(ρ∗)
).
Upon convergence (which we observe empirically but do not prove), the solution ρ∗ will give us
the M update for ρ in the EM algorithm: ρ← ρ∗.
3. Estimating the motif prevalences
For any fixed j ∈ {1, . . . , p}, the parameter α(i)k appears in the first term of (1) through:
logQj(k | l) = log(
(1− bj)α(i)k + bjδk,l
)= log
((1− bj)α(i)
k
)+[log(
(1− bj)α(i)k + bj
)− log
((1− bj)α(i)
k
)]δk,l
= (1− δk,l) logα(i)k + log
[(1− bj)α(i)
k + bj
]δk,l.
Therefore,
p∑j=1
∑k,l
logQj(k | l)ξ(i)j (k, l)
=∑k
log(α(i)k )
p∑j=1
∑l
ξ(i)j (k, l)−
∑k
log(α(i)k )
p∑j=1
ξ(i)j (k, k)
+
p∑j=1
∑k
log[(1− bj)α(i)
k + bj
]ξ
(i)j (k, k).
Differentiating this with respect to α(i)k gives:
0 =1
α(i)k
p∑j=1
∑l
ξ(i)j (k, l)− 1
α(i)k
p∑j=1
ξ(i)j (k, k) +
p∑j=1
1− bj(1− bj)α(i)
k + bjξ
(i)j (k, k)
=η(k)− ηα
(i)k
+
p∑j=1
1− bj(1− bj)α(i)
k + bjξ
(i)j (k, k),
7
where
η(k) =
p∑j=1
∑l
ξ(i)j (k, l), η =
p∑j=1
ξ(i)j (k, k).
In order to impose the constraint∑
k α(i)k = 1, we add a Lagrange multiplier W :
0 = −W +η(k)− ηα
(i)k
+
p∑j=1
1− bj(1− bj)α(i)
k + bjξ
(i)j (k, k)
= −Wα(i)k + (η(k)− η) + α
(i)k
p∑j=1
1− bj(1− bj)α(i)
k + bjξ
(i)j (k, k).
Therefore,
α(i)k =
1
W
η(k)− η + α(i)k
p∑j=1
1− bj(1− bj)α(i)
k + bjξ
(i)j (k, k)
.
We can solve this iteratively, setting W =∑
k α(i)k after each update of α(i). Upon convergence
(which we observe empirically but do not prove), the solution α(i)∗ will then give the M update in
the EM algorithm: α(i) ← α(i)∗.
B. Knockoffs preserving familial relatedness
1. Choosing the haplotype references
Supplementary Algorithm 1 modifies Algorithm 1 to ensure that: (i) IBD-sharing haplotypes
are not used as references for one another; (ii) all haplotypes in the same IBD-sharing family have
the same references.
8
Supplementary Algorithm 1 Choosing reference haplotypes preserving familial constraints
Input: H ∈ {0, 1}2n×p, K, and N1, N2 as in Algorithm 1;
Input: a collection of IBD-sharing families F1, . . . , FL, a distance measure ξ between haplotypes.
Divide the haplotypes into M sets Cc using ξ as in Algorithm 1, preserving the family structure.
for c = 1, . . . ,M do
Compute a distance matrix D ∈ R|Cc|×|Cc| for all haplotypes in Cc.
for i in Cc do
if ∃ l such that i ∈ Fl then
Define R(i) as the set of K nearest neighbors of Hi in Cc \ Fl.
else
Define R(i) as the set of K nearest neighbors of Hi in Cc.
for l in 1, . . . , L do
Initialize R(l) = ∩i∈FlR(i).
for i ∈ Fl do
Update R(i) = R(i) \ R(l).
if |R(l)| < K then
Update R(l) = R(l) ∪R(i).
else
break.
for i ∈ Fl do
Set R(i) = R(l).
Output: a set R(i) of K references for each haplotype H(i).
2. Posterior sampling via belief propagation
Conditional on H(1:m), the distribution of Z(1:m) is a Markov random field with m×p variables,
characterized by Equations (7)–(9) in the main paper. In order to sample Z(1:m) | H(1:m), we
implement belief propagation1 (BP) as follows. For any i ∈ {1, . . . ,m} and j ∈ {1, . . . , p − 1},denote by µ(i,j)→(i,j+1) ∈ RK the forward message from Z
(i)j to Z
(i)j+1. It is easy to verify that this
must satisfy the following recursive definition:
µ(i,j)→(i,j+1)(k) =
K∑l=1
[Q
(i)j+1(k | l)
]ηi,j+1 · f (i)j (H
(i)j | l) · µ(i,j−1)→(i,j)(l) ·
∏i′∈∂(i,j)
µ(i′,j)→(i,j)(l),
9
where it is understood that µ(i,0)→(i,1)(k) = 1, for all i and k. Above, µ(i′,j)→(i,j) indicates the
vertical message from Z(i′)j to Z
(i)j , for any i ∈ ∂(i, j). By the BP rules, this satisfies:
µ(i′,j)→(i,j)(k) =K∑l=1
δk,l · µ(i′,j−1)→(i′,j)(l) · µ(i′,j+1)→(i′,j)(l) ·∏
i′′∈∂(i′,j)\{i}µ(i′′,j)→(i′,j)(l),
where δk,l = 1 if k = l and 0 otherwise. Above, µ(i′,j+1)→(i′,j)(l) indicates the backward message
from Z(i′)j+1 to Z
(i′)j , which is defined recursively as:
µ(i,j)→(i,j−1)(k) =K∑l=1
[Q
(i)j (l | k)
]ηi,j · f (i)j (H
(i)j | l) · µ(i,j+1)→(i,j)(l) ·
∏i′∈∂(i,j)
µ(i′,j)→(i,j)(l).
Again, it is understood that µ(i,p+1)→(i,p)(k) = 1, for all i and k. Combined, the above message
update rules define a BP algorithm that is in principle already applicable to approximately sample
Z(1:m) | H(1:m). However, these recursion relations can be simplified by observing that Z(i)j = Z(i′)
whenever i′ ∈ ∂(i, j). Therefore, the corresponding nodes in the Markov random field can be
collapsed and treated as a single unit in the generalized belief propagation framework1 (GBP).
Thus, after defining
φ(i)j (l) = f
(i)j (H
(i)j | l) ·
∏i′∈∂(i,j)
f(i′)j (H
(i′)j | l),
ψ(i)j (k | l) =
[Q
(i)j (k | l)
]ηi,j · ∏i′∈∂(i,j)
[Q
(i′)j (k | l)
]ηi,j,
(2)
it is not difficult to verify that the GBP messages are given by:
µ(i,j)→(i,j+1)(k) =K∑l=1
ψ(i)j+1(k | l) · φ(i)
j (l) · µ(i,j−1)→(i,j)(l)
·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(l) ·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(l),
µ(i,j)→(i,j−1)(k) =
K∑l=1
ψ(i)j (l | k) · φ(i)
j (l) · µ(i,j+1)→(i,j)(l)
·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(l) ·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(l),
µ(i,j)→(i′,j+1)(k) = µ(i,j)→(i,j+1)(k), ∀i′ ∈ ∂(i, j + 1),
µ(i,j)→(i′,j−1)(k) = µ(i,j)→(i,j+1)(k), ∀i′ ∈ ∂(i, j − 1).
(3)
The GBP rules written above can be simplified even further analytically. Assuming for simplicity
that α(i)k = 1/K (as it is the case in our applications), we can write the transition matrices Q as:
Q(i)j (k | l) = Qj(k | l) = aj + bj1[k = l], aj =
1
K
(1− e−ρdj
), bj = e−ρdj .
10
Therefore,
ψ(i)j (k | l) =
[Q
(i)j (k | l)
]ηi,j · ∏i′∈∂(i,j)
[Q
(i′)j (k | l)
]ηi,j= [Qj(k | l)]ηi,j(1+|∂(i,j)|) = Qj(k | l).
This simplification allows us to equivalently rewrite the forward update rule in (3) as:
µ(i,j)→(i,j+1)(k) =
K∑l=1
[aj+1 + bj+11[k = l]] · φ(i)j (l) · µ(i,j−1)→(i,j)(l)
·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(l) ·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(l)
= aj+1
K∑l=1
φ(i)j (l) · µ(i,j−1)→(i,j)(l)
·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(l) ·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(l)+
+ bj+1φ(i)j (k) · µ(i,j−1)→(i,j)(k)
·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(k) ·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(k),
(4)
which has the advantage of having complexity O(K) to evaluate, rather than O(K2). Similarly,
we can rewrite the backward update rule as:
µ(i,j)→(i,j−1)(k) =K∑l=1
[aj + bj1[k = l]] · φ(i)j (l) · µ(i,j+1)→(i,j)(l)
·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(l) ·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(l)
= aj
K∑l=1
φ(i)j (l) · µ(i,j+1)→(i,j)(l)
·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(l) ·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(l)
+ bj · φ(i)j (k) · µ(i,j+1)→(i,j)(k)
·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(k) ·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(k),
(5)
which can also be evaluated at cost O(K).
The GBP formulation incorporates the IBD-sharing constraints implicitly, removing the vertical
messages and the corresponding small loops in the Markov random field. Even though some loops
may remain in the graphical model (e.g., if the same two haplotypes share two different IBD
segments), these will generally be large compared to the range of background LD, since we only
consider relatively long IBD segments. Therefore, we can expect the GBP approximation to work
11
well in general. Furthermore, in many practical cases, the resulting Markov random field is a tree,
so the GBP solution will be very fast to compute and provide exact posterior probabilities.1
GBP randomly initializes the messages µ(i,j)→(i′,j+1) and µ(i,j)→(i′,j−1), for all i, j and i′ ∈∂(i, j), and then recursively updates them until convergence according to the rules laid out in (3).
A schematic visualization of the updates is shown in Supplementary Figure 37. Even though
convergence to an exact solution is only theoretically guaranteed if the underlying graph structure
is a tree, the method often performs well in practice, especially if the graph is locally tree-like (i.e.,
it may have long loops but no short ones).2
Once the BP messages converge, the posterior distribution of Z(i)j | H(1:m) can be approximated
with the product of its incoming messages:
P[Z
(i)j = k | H(1:m)
]≈ µ(i,j−1)→(i,j)(k) · µ(i,j+1)→(i,j)(k)
·∏
i′∈∂(i,j)\∂(i,j−1)
µ(i′,j−1)→(i,j)(k) ·∏
i′∈∂(i,j)\∂(i,j+1)
µ(i′,j+1)→(i,j)(k).(6)
Crucially, the above relation is exact in the case of trees, which includes the previously well-known
example of a single haplotype sequence,3,4 as well as many non-trivial family structures (e.g., two
haplotypes sharing one IBD segment).
Since we are ultimately interested in sampling all coordinates of Z(1:m) | H(1:m) jointly, our
procedure does not end with (6). In general, after sampling Z(i)j | H(1:m) for some i and j, one
should update the Markov random field by conditioning on the observed value of Z(i)j and update all
messages until convergence before sampling the next variable, which is computationally unfeasible.
Fortunately, this procedure can be greatly simplified in our case because we only have relatively
long IBD segments, and thus there are few loops in the graphical model. We leverage this fact by
first sampling Z(i)j for all (i, j) in the set J ⊆ {1, . . . ,m} × {1, . . . , p} of junction nodes:
J = {(i, j) s.t. ∂(i, j) 6= ∂(i, j − 1) or ∂(i, j) 6= ∂(i, j + 1)}. (7)
Although this requires running |J | instances of GBP, this number will typically be small. Fur-
thermore, warm starts can decrease the number of iterations required for convergence. Once Z(i)j
has been sampled for all (i, j) ∈ J , the remaining random field is a collection of disjoint Markov
chains, as visualized in Supplementary Figure 38. Therefore, posterior sampling can be carried
out very efficiently with a simple forward-backward procedure that does not involve running BP
at each step, as outlined in Supplementary Algorithm 2.
12
Supplementary Algorithm 2 Posterior sampling preserving familial constraints
Input: H ∈ {0, 1}m×p, K, list of IBD segments {∂(i, j)}i∈{1,...,m},j∈{1,...,p};Input: a set R(i) of K references for each haplotype H(i).
Define the list of junction nodes J = {(i, j) s.t. ∂(i, j) 6= ∂(i, j − 1) or ∂(i, j) 6= ∂(i, j + 1)}.Initialize the list of active nodes A = {1, . . . ,m} × {1, . . . , p} and denote its complement as Ac.
Initialize the forward messages µ(i,j)→(i,j+1)(k) = 1K , for all i ∈ {1, . . . ,m} and j ∈ {1, . . . , p− 1}.
Initialize the backward messages µ(i,j)→(i,j−1)(k) = 1K , for all i ∈ {1, . . . ,m} and j ∈ {2, . . . , p}.
for (i∗, j∗) ∈ J ∩ A do
while messages not converged do
for j = 1, . . . , p− 1 do
for i = 1, . . . ,m do
if (i, j) ∈ A then
Update µ(i,j)→(i,j+1)(k), for all k ∈ {1, . . . ,K}, according to (4).
for j = p, . . . , 2 do
for i = 1, . . . ,m do
if (i, j) ∈ A then
Update µ(i,j)→(i,j−1)(k), for all k ∈ {1, . . . ,K}, according to (5).
Approximate the posteriors w(i∗)j∗ (k) of Z
(i∗)j∗ = k | H(1:m), {Z(i)
j }(i,j)∈Ac based on (6).
Sample Z(i∗)j∗ from P[Z
(i∗)j∗ = k] = w
(i∗)j∗ (k).
Update the list of active nodes: A ← A \ {(i∗, j∗)}.Update the Markov random field: φ
(i∗)j∗ (k)← 1[k = Z
(i∗)j∗ ], for each k ∈ {1, . . . , k}.
for i′ ∈ ∂(i∗, j∗) do
Set Z(i′)j∗ ← Z
(i∗)j∗ .
Update the list of active nodes: A ← A \ {(i′, j∗)}.Update the Markov random field: φ
(i′)j∗ (k)← 1[k = Z
(i′)j∗ ], for each k ∈ {1, . . . , k}.
Sample each disjoint segment of {Z(i)j }(i,j)∈J c | H(1:m), {Z(i)
j }(i,j)∈J , with standard forward-backward.4
Output: a latent Markov random field Z ∈ {1, . . . ,K}m×p that preserves the IBD structure.
3. Knockoff generation via conditioning
Having sampled Z(1:m) | H(1:m) with the procedure described above, we proceed to develop
a method for generating knockoff copies Z(1:m). Even though constructing exact knockoffs for a
general Markov random field may be computationally unfeasible, we can simplify the problem by
conditioning on some variables.5 In particular, we condition on all variables at the junctions of some
IBD segment, i.e., those in the set J defined in (7). This operation transforms the graphical model
13
for the remaining variables into a collection of disjoint one-dimensional chains, for which knockoffs
can be generated independently with the existing methods;3,4 see Supplementary Figure 38. This
solution is summarised in Supplementary Algorithm 3.
Supplementary Algorithm 3 Related knockoff haplotypes via conditioning
Input: H ∈ {0, 1}m×p, d ∈ Rp−1, G, and K as in Algorithm 2;
Input: IBD segments {∂(i, j)}i∈{1,...,m},j∈{1,...,p};Input: a set R(i) of K references for each haplotype H(i);
Input: Markov random field states Z ∈ {1, . . . ,K}m×p.
Define the list of junction nodes J = {(i, j) s.t. ∂(i, j) 6= ∂(i, j − 1) or ∂(i, j) 6= ∂(i, j + 1)}.for (i, j) ∈ J do
Define G as the group in partition G to which variant j belongs.
for j′ ∈ G do
Expand the list of junction nodes: J ← J ∪ {(i, j′)}.
for (i, j) ∈ J do
Make a trivial knockoff: Z(i)j ← Z
(i)j .
for each connected component C in {1, . . . ,m} × {1, . . . , p} \ J do
Generate group knockoffs {Z(i)j }(i,j)∈C of {Z(i)
j }(i,j)∈C | {Z(i)j }(i,j)∈J as in previous work.4
Output: knockoff matrix Z ∈ {1, . . . ,K}m×p.
14
SUPPLEMENTARY NOTES
A. Review of HMMs for genotype data
In the context of ancestry inference,6 the K states of the Markov chain in the HMM symbolize
the possible ancestry groups, and transitions represent admixture.7 A typical parametrization is:
P[Z(i)1 = k] = α
(i)k , P[Z
(i)j = k′ | Z(i)
j−1 = k] =
e−rj + (1− e−rj )α(i)
k′ , if k′ = k,
(1− e−rj )α(i)k′ , if k′ 6= k.
(8)
Above, rj > 0 controls the rate of admixture and is related to the physical distance between
consecutive loci, while the positive weights (α(i)1 , . . . , α
(i)K ) define the ancestry of the i-th individual
and are normalized so that their sum equals one. Finally, the Bernoulli emission distributions
fj(· | k) summarize the allele frequencies in each ancestral group.
By simply assigning different values and interpretations to the parameters of this model, one can
also obtain a good description of LD in homogeneous populations. For instance, the fastPHASE8
HMM takes the following form:
P[Z(i)1 = k] = αj,k, P[Z
(i)j = k′ | Z(i)
j−1 = k] =
e−rj + (1− e−rj )αj,k′ , if k′ = k,
(1− e−rj )αj,k′ , if k′ 6= k.
(9)
Here, the discrete latent states are understood to represent clusters of similar haplotypes rather
than precise ancestral groups, and the values of the rj parameters are typically larger, so that
transitions in the Markov chain occur more frequently, consistently with the shorter range of LD.
Note that, since the α parameters above do not depend on i, this model implicitly assumes all
individuals to be equally related.
B. Limitations of the fastPHASE HMM
Previous work on knockoffs for GWAS data3,4 adopted an HMM in the form of (9), with pa-
rameters estimated using fastPHASE. While this model can provide a good description of the
distribution of genotypes in an homogeneous samples, it cannot simultaneously account for popu-
lation structure. This problem is illustrated below by a numerical experiment.
We simulate haplotypes from a toy HMM that mimics the presence of different ancestries and
possible admixture, as well as LD. In particular, we consider a Markov chain of length p = 500, with
8 possible states divided into 2 ancestral classes of equal size. The Markov chain parameters are
15
defined such that transitions occur more frequently within rather than across classes (Supplemen-
tary Figure 1). Conditional on the Markov chain, the emission distributions at different positions
are independent Bernoulli with randomly fixed probabilities of success. We refer to the 8 sequences
of success probabilities corresponding to each state of the Markov chain as the motifs of this HMM.
These motifs have been randomly generated so that those within the same ancestral class are more
similar to each other than to those in the other class (Supplementary Figure 2). With this model,
we generate n = 10, 000 independent haplotype sequences and then perform a PCA to quantify the
population structure present in this synthetic data set (Supplementary Figure 3). By contrasting
the results of this analysis to those obtained with data generated from an HMM in the form of (9),
it becomes clear that our toy model induces significant population structure.
Supplementary Figure 3 suggests that knockoffs based on an estimated HMM in the form of (9)
cannot be valid in this context. We validate this conjecture by applying fastPHASE to the above
synthetic data set, setting K = 8, and utilize the estimated HMM to generate knockoffs with the
usual algorithm.4 Unsurprisingly, the PCA in Supplementary Figure 4 shows that the knockoffs
based on the fastPHASE HMM do not correctly preserve the population structure. Note that our
choice of K = 8 in the fastPHASE HMM corresponds to the number of latent states in the true
model, which is known in this toy example. Increasing K within the fastPHASE model would not
qualitatively change the results, since it would not address the main limitation that transitions
between any two different Markov states are equally likely.
C. Preview of our method
We give a preview of the proposed method in the toy example of Supplementary Note B.
We anticipate it will preserve the population structure much better than the fastPHASE HMM,
as shown in Supplementary Figure 5. In this example, knockoffs are constructed for groups of
50 variants each; if the resolution is even lower, as in Supplementary Figure 6, the fastPHASE
knockoffs completely break population structure, while our new method remains approximately
valid. In fact, lower-resolution knockoffs tend to be more loosely coupled to the data,4 so they are
naturally more sensitive to model misspecification; see also Supplementary Figure 7.
We can further examine the robustness of knockoffs at different resolutions in this toy example
by comparing their exchangeability with the original variables in terms of their second moments. In
particular, we compare values of cor(Xj , Xk) with cor(Xj , Xk), for j, k ∈ {1, . . . , p}, in Supplemen-
tary Figure 8. Here, we see that the knockoffs constructed with the fastPHASE HMM can preserve
16
short-range LD but not long-range dependencies due to population structure, at low resolution.
By contrast, the new knockoffs based on the SHAPEIT HMM seem approximately valid at all res-
olutions. Additional results consistent with this observation are given in Supplementary Figure 9,
where we compare values of cor(Xj , Xk) with cor(Xj , Xk), for j, k belonging to different groups in
{1, . . . , p}. We emphasize that valid knockoffs should produce scatter plots along the 45-degree line
in both Supplementary Figures 8–9. Therefore, we can conveniently summarise the goodness-of-fit
in both plots by computing the average mean square deviations from the reference 45-degree line,
relative to the values of cor(Xj , Xk). This information is displayed in Supplementary Figure 10 at
different resolutions, confirming again that the knockoffs based on the SHAPEIT HMM are more
robust to population structure, especially at lower resolutions.
Having verified that we can generate approximately valid knockoffs preserving population struc-
ture, we need to confirm that these are not trivially identical to the original variables. Therefore,
we describe in Supplementary Figures 11–12 the pairwise absolute correlations between each vari-
able and its corresponding knockoff. Roughly speaking, higher correlations will result in lower
power.9 These results show the new knockoffs should be almost as powerful as those obtained with
the fastPHASE HMM at low resolution, although they may tend to be less powerful at higher
resolutions. Even though this may be a reasonable price to pay for valid inference, we shall see
later that the loss in power on real data is lower than this toy example may suggest.
Finally, we give a first direct preview of the impact of population structure on the validity of
knockoff-based inference by generating a random response variable from a simple homoscedastic
linear model with 50 nonzero coefficients, and then computing knockoff test statistics, separately
with both types of knockoffs. More precisely, we define the following marginal importance measures:
Tj = − log(pj), Tj = − log(pj),
where pj and pj indicate the p-values for the marginal null hypotheses of no association between
Y and Xj or Xj , respectively. Then, we define the knockoff test statistics for each group Gg of
variables as usual: Wg =∑
j∈GgTj −
∑j∈Gg
Tj . Since marginal statistics are known to be more
susceptible to the confounding effect of population structure compared to multivariate methods,10
this choice is designed to emphasize the limitations of the fastPHASE HMM in this example. The
histogram of the test statistics for the null groups is shown in Supplementary Figure 13. We know
the distribution should be symmetric around zero if the knockoffs are valid;9 this appears to be
the case for the new knockoffs, but not for those based on the fastPHASE HMM, which are clearly
skewed to the right and therefore will tend to induce an excess of false positives.
17
D. Enrichment analysis with external summary statistics
We perform an enrichment analysis using external summary statistics from the Japan Biobank
project11 for the continuous traits, and from the FinnGen resource12 for all binary traits except
respiratory disease. These summary statistics were computed on data independent of those in the
UK Biobank, but some care must be exercised in the interpretation of these results because: (a)
the external statistics measure marginal association, not conditional importance; (b) the external
sample sizes are smaller than ours, which limits power. Despite these difficulties, which we address
as explained below, we find the enrichment analysis to be informative.
For each of group of SNPs Gg in the genome partition at the 20 kb resolution, we compute a
chi-square statistics with Fisher’s method: χ2g = −∑j∈Gg
log(pj), where {pj}j∈Gg denotes the set
of external marginal p-values within the region spanned by Gg. Since the UK Biobank and the
FinnGen project are based on different genome builds, our discoveries are matched to the external p-
values after appropriately lifting the physical positions. We then define: {χ2g}Snovel as the collection
of external Fisher statistics corresponding to our novel discoveries in Snovel; {χ2g}Sconfirmed as the
collection of external Fisher statistics corresponding to our previously confirmed discoveries (either
confirmed by BOLT-LMM, or by the other studies based on the GWAS catalog, and the Japan
Biobank or the FinnGen resource at the genome-wide significance level); and {χ2g}background as the
set of Fisher statistics for groups that are neither in Sconfirmed nor in Snovel.
We take the empirical distribution of {χ2g}background as the null distribution, and invert it to
compute an approximate enrichment p-value for each group in Snovel; we refer to these as penrichg .
The null hypothesis, under which the penrichg would be approximately uniformly distributed, is that
the Fisher statistics for the novel discoveries have the same distribution as those in {χ2g}background;
for instance, we expect this would be the case if all SNPs in our selected groups were independent
of the trait of interest. In theory, we could use these p-values in combination with any multiple
testing procedure; however, this turns out to have little power, due to the relatively small sample
size (compared to the UK Biobank) of the external data. However, it is clear that the empirical dis-
tribution of {penrichg }Snovel is not uniform, which suggest many discoveries are non-null. Therefore,
we take an empirical Bayes approach to estimate the proportion of non-null discoveries,13 as imple-
mented by the “quantile” method in the fdrtool R package.14 This computes an estimate of the
proportion of null enrichment p-values, which we bootstrap 10,000 times to assess its uncertainty.
Table III and Supplementary Tables 16–17 are based on the corresponding mean bootstrap values,
while Supplementary Tables 18–19 report the corresponding 90% bootstrap confidence intervals.
18
I. SUPPLEMENTARY FIGURES
Sample: 1 Sample: 2 Sample: 3
Pop
.structure
Hom
ogeneous
0 250 500 0 250 500 0 250 500
2
4
6
8
2
4
6
8
Variable
Markovchain
state
Ancestry
1
2
SUPPLEMENTARY FIGURE 1. Toy Markov chains with and without population structure. Visu-
alization of three independent trajectories of the latent Markov chain in a toy HMM simulating haplotypes
with population structure (top). The colors indicate the two possible ancestral classes, within which transi-
tions among the 8 possible states are more likely. This is contrasted to a toy HMM without any population
structure, where all state transitions are equally likely (bottom).
12345678
SUPPLEMENTARY FIGURE 2. Toy HMM motifs with population structure. Visualization of the
emission distributions for each of the 8 possible Markov states in the toy HMM of Supplementary Figure 1.
Darker shades of grey indicate higher probabilities of Hj = 1, for each of the 500 variable indices j. The
clustering dendrogram on the right-hand side shows that motifs in the same ancestral class are more similar
to each other.
19
-0.05
0.00
0.05
-0.06 -0.03 0.00 0.03 0.06First PC (6.13%)
Secon
dPC
(2.77%)
(a) Pop. structure
-0.05
0.00
0.05
0.10
-0.10 -0.05 0.00 0.05 0.10First PC (0.39%)
Secon
dPC
(0.38%
)
(b) Homogeneous
SUPPLEMENTARY FIGURE 3. PCA for a toy HMM with population structure. PCA of 10,000
haplotypes generated from the toy HMM of Supplementary Figure 1, with (a) and without (b) population
structure. The percentages indicate the proportion of genetic variance explained by the top principal compo-
nents. For simplicity, only 1000 randomly chosen points are shown explicitly. The color of each point reflects
the proportion of red and blue motifs (Supplementary Figure 2) in the Markov chain for the corresponding
sample (Supplementary Figure 1)—purple points indicate admixture of red and blue motifs.
-0.05
0.00
0.05
-0.06 -0.03 0.00 0.03 0.06First PC (6.13%)
Sec
ond
PC
(2.7
7%)
(a) Real genotypes
-0.10
-0.05
0.00
0.05
0.10
-0.08 -0.04 0 0.04 0.08First PC (2.64%)
Sec
on
dP
C(1
.52%
)
(b) Knockoffs (fastPHASE HMM)
SUPPLEMENTARY FIGURE 4. PCA with population structure and fastPHASE knockoffs. Prin-
cipal component analyses of 10,000 haplotypes (a) generated from the toy HMM of Supplementary Figure 1,
and of the corresponding knockoffs based on the fastPHASE model (b). Knockoffs for groups of size 50.
Other details are as in Supplementary Figure 3. Note that the colors in (b) are based on the motifs in the
real genotytpes, as in (a), not those in the knockoffs.
20
-0.05
0.00
0.05
-0.08 0 0.08First PC (6.13%)
Secon
dPC
(2.77%)
(a) Real data
-0.05
0.00
0.05
-0.08 0 0.08First PC (5.40%)
Secon
dPC
(2.30%
)
(b) SHAPEIT
-0.10
-0.05
0.00
0.05
0.10
-0.08 0 0.08First PC (2.64%)
Secon
dPC
(1.52%
)
(c) fastPHASE
SUPPLEMENTARY FIGURE 5. PCA for a toy HMM with population structure and knockoffs.
Principal component analyses of 10,000 haplotypes (a) generated from the toy HMM of Supplementary
Figure 1, of the corresponding knockoffs based on the SHAPEIT model (b), and of the knockoffs based on
the fastPHASE model (c). Knockoffs for groups of size 50. Other details are as in Supplementary Figure 4.
-0.05
0.00
0.05
-0.08 0 0.08First PC (6.13%)
SecondPC
(2.77%
)
(a) Real data
-0.05
0.00
0.05
-0.08 0 0.08First PC (5.96%)
SecondPC
(2.73%
)
(b) SHAPEIT
-0.04
0.00
0.04
-0.08 0 0.08First PC (1.41%)
Secon
dPC
(1.34%
)
(c) fastPHASE
SUPPLEMENTARY FIGURE 6. PCA for a toy HMM with population structure and knockoffs.
Principal component analyses of 10,000 haplotypes (a) generated from the toy HMM of Supplementary
Figure 1, of the corresponding knockoffs based on our new method (b), and of the knockoffs based on the
fastPHASE model (c). Knockoffs for groups of size 200. Other details are as in Supplementary Figure 5.
21
First PC Second PC
1 2 5 20 100 500 1 2 5 20 100 5000
2
4
6
8
Group size
Variance
explained
(%)
Model
SHAPEIT
fastPHASE
SUPPLEMENTARY FIGURE 7. PCA of knockoffs robustness to population structure. Proportion
of variance explained by the first (left) and second (right) principal component of knockoffs in the toy
example of Supplementary Figure 5, as a function of the knockoff resolution. The black horizontal lines
indicate the proportion of variance explained by principal components on the original data, which should
be approximately matched (up to finite-sample variations) by valid knockoffs.
Group size: 1 Group size: 20 Group size: 100
SHAPEIT
fastPHASE
-0.75 0.00 0.75 -0.75 0.00 0.75 -0.75 0.00 0.75
-0.75
0.00
0.75
-0.75
0.00
0.75
corr(Xj , Xk)
corr(X
j,X
k) Range
Short (d = 1)
Medium (20 < d < 25)
Long (100 < d < 110)
SUPPLEMENTARY FIGURE 8. Second-moment knockoff exchangeability diagnostics. Ex-
changeability diagnostics for different knockoffs, in the example of Supplementary Figure 7. We compare
cor(Xj , Xk) with cor(Xj , Xk), for j, k ∈ {1, . . . , p}, as a function of the distance between j and k. These
diagnostics should approximately lie on the 45-degree line if the knockoffs are valid.
22
Group size: 1 Group size: 20 Group size: 100
SHAPEIT
fastPHASE
-0.75 0.00 0.75 -0.75 0.00 0.75 -0.75 0.00 0.75
-0.75
0.00
0.75
-0.75
0.00
0.75
corr(Xj , Xk)
corr(X
j,X
k) Range
Short (d = 1)
Medium (20 < d < 25)
Long (100 < d < 110)
SUPPLEMENTARY FIGURE 9. Additional exchangeability diagnostics for in toy example. We
compare cor(Xj , Xk) to cor(Xj , Xk), for j, k in different groups. Other details are as in Supplementary
Figure 8.
23
Short range (d = 1) Medium range (20 < d < 25) Long range (100 < d < 110)Fullsw
apPartia
lsw
ap
1 2 5 20 100 500 1 2 5 20 100 500 1 2 5 20 100 500
0
25
50
75
100
0
25
50
75
100
Group size
Meansquared
loss
ofcorrelation(%
) Model
SHAPEIT
fastPHASE
SUPPLEMENTARY FIGURE 10. Second-moment knockoff goodness-of-fit in toy example.
Mean loss of correlation between variables upon: full swap with knockoffs (top); partial swap with
knockoffs (bottom). This is defined as∑
j,k[cor(Xj , Xk) − cor(Xj , Xk)]2/∑
j,k[cor(Xj , Xk)]2 (top), or∑j,k[cor(Xj , Xk) − cor(Xj , Xk)]2/
∑j,k[cor(Xj , Xk)]2 (bottom); the sums are taken over pairs of indices
whose distances are within the specified range. Valid knockoffs should have values close to zero. Equiva-
lently, this summarises the relative deviation of the scatter plots in Supplementary Figures 8–9 from the
45-degree line.
24
Group size: 1 Group size: 20 Group size: 100SHAPEIT
fastPHASE
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0
100
200
300
0
100
200
300
|corr(Xj , Xj)|
Cou
nt
SUPPLEMENTARY FIGURE 11. Pairwise similarities between variables and knockoffs. His-
tograms of |corr(Xj , Xj)| in the example of Supplementary Figure 3, for different knockoff constructions
and different levels of resolution.
0.00
0.25
0.50
0.75
1.00
1 2 5 20 100 500Group size
|corr(X
j,X
j)|
Model
SHAPEIT
fastPHASE
SUPPLEMENTARY FIGURE 12. Similarity of variables and knockoffs at different resolutions.
Average |corr(Xj , Xj)| in the toy example of Supplementary Figure 3, as a function of the resolution.
Equivalently, this summarises the mean quantities in Supplementary Figure 11 for knockoffs at different
levels of resolution.
25
fastPHASE (+243, -157, p-value: 2.0e-05) new (+202, -198, p-value: 8.8e-01)
-5 0 5 -5 0 5
1
10
50
100
200
Null test statistics
Cou
nt
SUPPLEMENTARY FIGURE 13. Distribution of null knockoff test statistics in toy example.
Histogram of knockoff test statistics (based on marginal p-values, as defined in Supplementary Note C) for
null groups (of size 50), for a simulated phenotype in the toy example of Supplementary Figure 3. The
counts at the top indicate the numbers of statistics with positive or negative signs; the p-value is obtained
from an exact binomial test of the null hypothesis that positive and negative statistics are equally likely.
26
-0.4
-0.2
0.0
0.2
-0.4 -0.2 0.0 0.2First principal component
Secondprincipal
compon
ent
Ethnicity
British
Irish
Indian
Chinese
Caribbean
African
(a)
fastPHASE SHAPEIT
single-S
NP
425kb
-0.4 -0.2 0.0 0.2 -0.4 -0.2 0.0 0.2
-0.4
-0.2
0.0
0.2
-0.4
-0.2
0.0
0.2
First principal component
Secon
dprincipal
compon
ent
(b)
SUPPLEMENTARY FIGURE 14. PCA of individuals with diverse ancestries, and of knockoffs.
The first two genetic principal components of 10,000 individuals in the UK Biobank with one of 6 possible
self-reported ancestries (a) are compared to the corresponding quantities computed on knockoffs at different
resolutions (b). The knockoffs based on the SHAPEIT HMM preserve population structure quite accurately,
even at low resolution. By contrast, the fastPHASE HMM tends to produce knockoffs that shrink together
individuals with diverse ancestries, thus breaking population structure.
27
0.000
0.025
0.050
0.075
0.100
single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution
Var
ian
ceex
pla
ined
Knockoffs
SHAPEIT
fastPHASE
SUPPLEMENTARY FIGURE 15. Variance explained by principal components of knockoffs. Pro-
portion of genetic variance explained by the first ten principal components of knockoffs at different resolu-
tions, for samples with diverse ancestries. The dashed horizontal line indicates the corresponding quantity
computed on the original data. Other details are as in Supplementary Figure 14.
(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] Mb
SHAPEIT
fastPHASE
0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0
0.0
0.5
1.0
0.0
0.5
1.0
|cor(Xj , Xk)|
|cor(X
j,X
k)|
SUPPLEMENTARY FIGURE 16. High-resolution knockoff exchangeability, diverse ancestries.
Exchangeability diagnostics of knockoffs based on different HMMs, for 10,000 individuals with diverse an-
cestries, as in Supplementary Figure 14. We compare |cor(Xj , Xk)| with |cor(Xj , Xk)|, for j, k ∈ {1, . . . , p},as a function of the distance between variants j and k on chromosome 22. Only 1000 randomly chosen
points are shown, for clarity. Variants with minor allele frequency smaller than 0.01 are not shown here, due
to the limited sample size. These diagnostics should approximately lie on the 45-degree line if the knockoffs
are valid.4 Knockoff resolution: single-SNP groups.
28
(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] MbSHAPEIT
fastPHASE
0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0
0.0
0.5
1.0
0.0
0.5
1.0
|cor(Xj , Xk)|
|cor(X
j,X
k)|
SUPPLEMENTARY FIGURE 17. Low-resolution knockoff exchangeability, diverse ancestries.
Exchangeability diagnostics for different knockoffs of 10,000 individuals with diverse ancestries. Knockoff
resolution: 425 kb. Other details are as in Supplementary Figure 16.
(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] Mb
SHAPEIT
fastPHASE
0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0
0.0
0.5
1.0
0.0
0.5
1.0
|cor(Xj , Xk)|
|cor(X
j, X
k)|
SUPPLEMENTARY FIGURE 18. Additional diagnostics of high-resolution exchangeability. Ex-
changeability diagnostics for different knockoffs of 10,000 individuals with diverse ancestries. We compare
|cor(Xj , Xk)| with |cor(Xj , Xk)|, for j, k in different groups, as a function of the distance between variants
j and k. Other details are as in Supplementary Figure 16.
29
(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] MbSHAPEIT
fastPHASE
0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0
0.0
0.5
1.0
0.0
0.5
1.0
|cor(Xj , Xk)|
|cor(X
j, X
k)|
SUPPLEMENTARY FIGURE 19. Additional diagnostics of low-resolution exchangeability. Ex-
changeability diagnostics for different knockoffs of 10,000 individuals with diverse ancestries. Knockoff
resolution: 425 kb. Other details are as in Supplementary Figure 18.
30
(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] Mb (5.0, 10.0] MbFullsw
apPartia
lsw
ap
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
0.00
0.01
0.02
0.03
0.00
0.01
0.02
0.03
Resolution
Meansquaredloss
ofcorrelation
Model SHAPEIT fastPHASE
SUPPLEMENTARY FIGURE 20. Second-moment knockoff goodness-of-fit, diverse ancestries.
Mean squared loss of correlation between variables upon: full swap with knockoffs (top); partial swap with
knockoffs (bottom). This is defined as [cor(Xj , Xk) − cor(Xj , Xk)]2 (top), or [cor(Xj , Xk) − cor(Xj , Xk)]2
(bottom), each averaged over pairs of variables j, k whose physical distances are within the specified range,
similarly to the diagnostics in Supplementary Figure 10. In words, these diagnostics summarise the aver-
age distances from the 45-degree line in the scatter plots of Supplementary Figures 16–19, including also
intermediate levels of resolutions. Valid knockoffs should have values close to zero.
31
0.00
0.25
0.50
0.75
1.00
single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution
Pairw
isesimilarity
Model
SHAPEIT
fastPHASE
SUPPLEMENTARY FIGURE 21. Similarity between genotypes and knockoffs, diverse ancestries.
Average absolute pairwise correlation between genotypes and knockoffs on chromosome 22 for 10,000 sam-
ples with diverse ancestries in the UK Biobank, as a function of the resolution. Other details are as in
Supplementary Figure 14.
single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kb
SHAPEIT
fastPHASE
0 1 0 1 0 1 0 1 0 1 0 1 0 1
0
500
1000
1500
2000
0
500
1000
1500
2000
Pairwise similarity
count
SUPPLEMENTARY FIGURE 22. Similarity between genotypes and knockoffs, diverse ancestries.
Histograms of absolute pairwise correlations between genotypes and knockoffs at different resolutions. Other
details are as in Supplementary Figure 21.
32
Null - fastPHASE
Null - SHAPEIT
-0.02 0.00 0.02
0
500
1000
1500
2000
0
500
1000
1500
2000
Test statistics
Causal - fastPHASE
Causal - SHAPEIT
-0.025 0.000 0.025 0.050 0.075 0.100
50
100150
50
100150
Test statistics
SUPPLEMENTARY FIGURE 23. Knockoff statistics in simulation with diverse ancestries. His-
togram of lasso-based knockoff test statistics for null (left) and causal (right) groups of variants, for a
simulated phenotype and real genotypes with population structure, as in Supplementary Figure 14. The
knockoffs are constructed by different algorithms at resolution equal to 425 kb.
single-SNP 208 kb 425 kb
FD
RP
ower
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Heritability
Method
KnockoffZoom v2
KnockoffZoom v1
SUPPLEMENTARY FIGURE 24. Simulations with diverse ancestries, strong signals. Knockoff-
Zoom performance at different levels of resolution in simulations with real genotypes and artificial pheno-
types. The number of causal variants is equal to 100. Other details are as in Figure 1.
33
100 200 500F
DR
Pow
erR
esolution
(kb)
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
0.00.10.20.30.40.5
0.00
0.25
0.50
0.75
0
200
400
600
800
Heritability
Method KnockoffZoom v2 KnockoffZoom v1
SUPPLEMENTARY FIGURE 25. Performance in simulations with diverse ancestries. Knockoff-
Zoom performance in simulations with real genotypes and artificial phenotypes. The discoveries at different
resolutions are combined, counting only the most specific findings in each locus.4 The results corresponding
to phenotypes with different numbers of causal variants are shown in separate columns. Other details are
as in Figure 1 and Supplementary Figure 24.
34
Null - fastPHASE
Null - SHAPEIT
-100 -50 0 50 100
1000
20003000
1000
20003000
Test statistics
Causal - fastPHASE
Causal - SHAPEIT
-50 0 50 100
4080120160
4080120160
Test statistics
SUPPLEMENTARY FIGURE 26. Knockoff statistics in simulation with diverse ancestries. His-
togram of LMM-based knockoff test statistics for null (left) and causal (right) groups of variants. Other
details are as in Supplementary Figure 23.
1
10
100
1000
0.1 0.2 0.3 0.4 0.5Pairwise kinship
Cou
nt
SUPPLEMENTARY FIGURE 27. Kinship in 10,000 related samples from the UK Biobank. His-
togram of pairwise kinship values between the 10,000 related British samples in the UK Biobank used for
our numerical experiments. Kinship is defined here as the fraction of shared DNA (e.g., 100% for identical
twins, 50% for parents or full siblings, 25% for half siblings, 12.5% for first cousins).
35
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22Chromosome
Width
(Mb)
SUPPLEMENTARY FIGURE 28. IBD segments in 10,000 related samples from the UK Biobank.
Violin plots displaying the distribution of IBD segment lengths in the 10,000 related British samples in the
UK Biobank used for our numerical experiments.
10.0
12.5
15.0
17.5
20.0
single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution
Width
(Mb)
Relatedness
Preserved
Ignored
SUPPLEMENTARY FIGURE 29. Width of IBD segments in knockoffs for related samples. Average
width of IBD segments on chromosome 22 in knockoffs for 10,000 related samples in UK Biobank, as in
Supplementary Figure 27. The dashed horizontal line indicates the corresponding quantity computed on the
data. The knockoffs are generated with our new method, either preserving or ignoring familial relatedness.
36
(0, 0.1] Mb (0.1, 0.5] Mb (0.5, 1.0] Mb (1.0, 5.0] MbFullsw
apPartia
lsw
ap
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
single-SNP
3kb
20kb
41kb
81kb
208kb
425kb
0.00
0.01
0.02
0.03
0.00
0.01
0.02
0.03
Resolution
Meansquaredloss
ofcorrelation
Relatedness Preserved Ignored
SUPPLEMENTARY FIGURE 30. Second-moment goodness-of-fit of knockoffs for related samples.
Goodness-of-fit of knockoffs at different resolutions, using data from 10,000 related British samples in the
UK Biobank, defined as in Supplementary Figure 20. Valid knockoffs should have values close to zero. The
knockoffs are generated with our new method, either preserving or ignoring familial relatedness.
0.00
0.25
0.50
0.75
1.00
single-SNP 3 kb 20 kb 41 kb 81 kb 208 kb 425 kbResolution
Pairw
isesimilarity
Relatedness
Preserved
Ignored
SUPPLEMENTARY FIGURE 31. Similarity between genotypes and knockoffs for related samples.
Average absolute pairwise correlation between genotypes and knockoffs on chromosome 22, as a function of
the resolution. Other details are as in Supplementary Figure 30.
37
Resolution: single-SNP Resolution: 41 kb Resolution: 425 kbFDR
Pow
er
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
Heritability
Relatedness
Preserved
Ignored
SUPPLEMENTARY FIGURE 32. Performance in simulations with related samples. KnockoffZoom
v2 performance at different levels of resolution on related samples, using test statistics based on sparse
logistic regression, as in Figure 2. Strong environmental effects, γ = 1. Other details are as in Figure 2.
γ: 0 γ: 0.655 γ: 1
FDR
Pow
er
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Heritability
Relatedness
Preserved
Ignored
SUPPLEMENTARY FIGURE 33. Performance in simulations with marginal statistics. Knockoff-
Zoom v2 performance with marginal statistics. Other details are as in Figure 2. Note that marginal statistics
have almost no power, although an excess of false discoveries occurs if the relatedness is not preserved.
38
γ = 0, Ignored
γ = 0, Preserved
-0.04 -0.02 0.00 0.02 0.04
1
10
100
1000
1
10
100
1000
Test statistics
γ = 1, Ignored
γ = 1, Preserved
-0.04 -0.02 0.00 0.02 0.04
1
10
100
1000
1
10
100
1000
Test statistics
SUPPLEMENTARY FIGURE 34. Lasso knockoff statistics in simulation with related samples.
Histogram of lasso-based test statistics for null groups of variants, for simulated phenotypes and real geno-
types from 10,000 related samples in the UK Biobank, as in Supplementary Figure 29. (a): no environmental
effects within families; (b): strong environmental effects within families. Resolution equal to 425 kb. Little
difference can be observed when γ = 0, while γ = 1 induces a small but clear (notice the log scale on the
vertical axis) rightward bias in the test statistics if the relatedness is ignored.
γ = 0, Ignored
γ = 0, Preserved
-100 -50 0 50 100
1
10
100
1000
1
10
100
1000
Test statistics
γ = 1, Ignored
γ = 1, Preserved
-100 -50 0 50 100
1
10
100
1000
1
10
100
1000
Test statistics
SUPPLEMENTARY FIGURE 35. Marginal knockoff statistics in simulation with related samples.
Histogram of marginal knockoff test statistics for null groups of variants in a simulation with related samples.
Other details are as in Supplementary Figure 34.
39
Za1 Za
2 Za3 Za
4 Za5 Za
6 Za7 Za
8
Zb1 Zb
2 Zb3 Zb
4 Zb5 Zb
6 Zb7 Zb
8
(a)
Za1
Z2 Z3 Z4
Za5 Za
6
Z7 Z8
Zb1 Zb
5 Zb6
(b)
SUPPLEMENTARY FIGURE 36. Model for the latent states in the HMM for haplotype families.
Graphical representation of the distribution of latent states in the HMM for two haplotype sequences of
length 8 sharing 2 IBD segments (shaded). (a) Representation as a Markov chain with K2 possible states in
each position, with the constraint that nodes connected by a vertical edge must be identical to each other.
(b): Equivalent representation of this model as a Markov random field with 11 variables, each taking one of
K possible values.
Za1
Z2 Z3 Z4
Za5 Za
6
Z7 Z8
Zb1 Zb
5 Zb6
t− 2 t− 1t
t− 1
t− 2
SUPPLEMENTARY FIGURE 37. Visualization of belief propagation for haplotype families. Belief
propagation update of a message in the example of Supplementary Figure 36. The new message evaluated
here at time t is that from the third node of the first shared IBD segment to the successive node in the first
haplotype sequence (bold arrow). This is computed as a function of the messages labeled as t−1, which had
previously been computed as a function of those labeled as t − 2. Red: forward messages; blue: backward
messages.
40
Za1
Z2 Z3 Z4
Za5 Za
6
Z7 Z8
Zb1 Zb
5 Zb6
Z2
Z2
Z4
Z4
Z7
Z7
(a)
Za1
Z2 Z3 Z4
Za5 Za
6
Z7 Z8
Zb1 Zb
5 Zb6
Z2
Z2
Z4
Z4
Z7
Z7loop
(b)
SUPPLEMENTARY FIGURE 38. Graphical model and conditioning for haplotype families. Graph-
ical model for related haplotypes in the example of Supplementary Figure 36. (a): the nodes at the extremi-
ties of the IBD segments are shaded. (b): conditional on the extremities of the IBD segments, the remaining
latent nodes are distributed as independent Markov chains.
41
II. SUPPLEMENTARY TABLES
Median width
(kb)
Mean width
(kb)
Number of
groups
Median size
(SNPs)
Mean size
(SNPs)
single-SNP single-SNP 591513 1 1
3 11 151532 3 4
20 41 56562 8 10
41 74 33929 14 17
81 134 19500 26 30
208 303 8863 58 67
425 575 4738 113 125
SUPPLEMENTARY TABLE 1. Genome partitions at different resolutions. Summary of 7 partitions
of the genome into disjoint groups of contiguous SNPs. The first column (median width in kb) will be used
to reference particular resolutions throughout this chapter.
Ethnicity Count
African 1710
British 1710
Caribbean 1710
Chinese 1450
Indian 1710
Irish 1710
SUPPLEMENTARY TABLE 2. Ethnicities of 10,000 unrelated individuals in the UK Biobank.
Summary of the self-reported ethnicities for the individuals in the UK Biobank used in our simulations.
Family size Number of families Average kinship
1 1 N.A.
2 4702 0.273
3 193 0.270
4 4 0.265
SUPPLEMENTARY TABLE 3. Family structure and average kinship of 10,000 related samples.
Summary of the self-reported family structure for 10,000 British individuals used in our simulations. These
families are chosen as those with the largest average kinship, which is defined as in Supplementary Figure 27.
One extra individual is included without relatives to bring the total number to a round value.
42
Name Description Number of cases UK Biobank Fields UK Biobank Codes
bmi body mass index continuous 21001-0.0
cvd cardiovascular disease 148715 20002-0.0–20002-0.32 1065, 1066, 1067, 1068,
1081, 1082, 1083, 1425,
1473, 1493
diabetes diabetes 19897 20002-0.0–20002-0.32 1220
height standing height continuous 50-0.0
hypothyroidism hypothyroidism 22493 20002-0.0–20002-0.32 1226
platelet platelet count continuous 30080-0.0
respiratory respiratory disease 64945 20002-0.0–20002-0.32 1111, 1112, 1113, 1114,
1115, 1117, 1413, 1414,
1415, 1594
sbp systolic blood pressure continuous 4080-0.0, 4080-0.1
SUPPLEMENTARY TABLE 4. Phenotype definitions. Definition of the UK Biobank phenotypes used
in our analysis.4 In the case of case-control phenotypes, the number of cases refers to the subset of individuals
that passed our quality control.
43
Everyone British White (non-British)
Phenotype Resolution all unrel. all unrel. all unrel.
cvd single-SNP 0 0 0 0 0 0
3 kb 22 20 0 0 0 0
20 kb 239 152 169 140 0 0
41 kb 339 235 270 181 0 0
81 kb 566 428 611 462 0 0
208 kb 940 594 815 611 0 0
425 kb 1089 861 1004 711 0 0
diabetes single-SNP 0 0 0 0 0 0
3 kb 0 17 0 12 0 0
20 kb 83 44 63 53 0 0
41 kb 123 86 82 57 0 0
81 kb 193 152 165 129 0 0
208 kb 262 242 217 171 0 0
425 kb 383 346 289 291 0 0
hypothyroidism single-SNP 19 22 11 11 0 0
3 kb 40 42 60 32 0 0
20 kb 105 79 109 86 0 0
41 kb 222 156 164 130 0 0
81 kb 277 173 269 153 0 0
208 kb 295 257 288 256 0 0
425 kb 335 309 312 266 0 0
respiratory single-SNP 0 0 0 11 0 0
3 kb 21 0 37 0 0 0
20 kb 61 33 54 33 0 0
41 kb 109 62 73 66 0 0
81 kb 109 84 63 94 0 0
208 kb 113 79 119 84 0 0
425 kb 194 102 186 139 0 0
SUPPLEMENTARY TABLE 5. Discoveries for UK Biobank phenotypes (binary). Numbers of
discoveries at different resolutions, using different subsets of the UK Biobank samples. Binary phenotypes.
44
Everyone British White (non-British)
Phenotype Resolution all unrel. all unrel. all unrel.
bmi single-SNP 95 64 80 69 0 10
3 kb 570 483 609 377 0 0
20 kb 1503 1294 1610 1412 25 20
41 kb 2384 1966 2353 2141 74 80
81 kb 3006 2768 3002 2681 91 80
208 kb 3339 3111 3370 3117 112 101
425 kb 3073 2922 2938 2735 170 104
height single-SNP 0 0 0 12 0 0
3 kb 10 10 0 0 0 0
20 kb 343 230 207 180 0 0
41 kb 918 566 820 492 0 0
81 kb 1480 1194 1433 1280 0 0
208 kb 2395 1938 2381 1975 0 0
425 kb 2460 2109 2426 2092 0 10
platelet single-SNP 53 52 34 52 0 0
3 kb 246 259 223 202 0 0
20 kb 1002 820 977 777 26 31
41 kb 1261 995 1171 944 52 44
81 kb 1570 1350 1502 1292 69 55
208 kb 1743 1583 1809 1510 53 51
425 kb 1653 1550 1741 1521 76 60
sbp single-SNP 0 0 0 0 0 0
3 kb 83 90 42 32 0 0
20 kb 191 162 166 127 0 0
41 kb 511 353 421 342 0 0
81 kb 830 635 736 585 0 0
208 kb 1183 972 1050 911 0 0
425 kb 1543 1202 1401 1273 0 0
SUPPLEMENTARY TABLE 6. Discoveries for UK Biobank phenotypes (continuous). Numbers of
discoveries for continuous phenotypes. Other details are as in Supplementary Figure 5.
45
KnockoffZoom v2 BOLT-LMM
Phenotype Resolution Discoveries Overlap with LMM Discoveries Overlap with KZ
cvd 3 kb 22 22 (100.0%) 257 25 (9.7%)
20 kb 239 180 (75.3%) 257 189 (73.5%)
41 kb 339 212 (62.5%) 257 213 (82.9%)
81 kb 566 261 (46.1%) 257 241 (93.8%)
208 kb 940 274 (29.1%) 257 249 (96.9%)
425 kb 1089 255 (23.4%) 257 254 (98.8%)
diabetes 3 kb 21 20 (95.2%) 62 21 (33.9%)
20 kb 61 45 (73.8%) 62 47 (75.8%)
41 kb 109 54 (49.5%) 62 52 (83.9%)
81 kb 109 50 (45.9%) 62 54 (87.1%)
208 kb 113 52 (46.0%) 62 55 (88.7%)
425 kb 194 57 (29.4%) 62 59 (95.2%)
hypothyroidism single-SNP 19 19 (100.0%) 143 30 (21.0%)
3 kb 40 40 (100.0%) 143 53 (37.1%)
20 kb 105 89 (84.8%) 143 101 (70.6%)
41 kb 222 128 (57.7%) 143 123 (86.0%)
81 kb 277 133 (48.0%) 143 130 (90.9%)
208 kb 295 129 (43.7%) 143 142 (99.3%)
425 kb 335 122 (36.4%) 143 142 (99.3%)
respiratory 20 kb 83 60 (72.3%) 94 62 (66.0%)
41 kb 123 74 (60.2%) 94 75 (79.8%)
81 kb 193 83 (43.0%) 94 85 (90.4%)
208 kb 262 82 (31.3%) 94 92 (97.9%)
425 kb 383 82 (21.4%) 94 93 (98.9%)
SUPPLEMENTARY TABLE 7. Comparison with BOLT-LMM (binary). KnockoffZoom discoveries
for binary phenotypes using all UK Biobank samples vs. BOLT-LMM genome-wide significant discoveries (5×10−8). BOLT-LMM is applied on 459k European samples for cardiovascular disease and hypothyroidism,15
and on 350k unrelated British samples for diabetes and respiratory disease.4
46
KnockoffZoom v2 BOLT-LMM
Phenotype Resolution Discoveries Overlap with LMM Discoveries Overlap with KZ
bmi 3 kb 10 10 (100.0%) 697 15 (2.2%)
20 kb 343 309 (90.1%) 697 317 (45.5%)
41 kb 918 618 (67.3%) 697 548 (78.6%)
81 kb 1480 792 (53.5%) 697 641 (92.0%)
208 kb 2395 898 (37.5%) 697 689 (98.9%)
425 kb 2460 794 (32.3%) 697 695 (99.7%)
height single-SNP 95 95 (100.0%) 2464 225 (9.1%)
3 kb 570 570 (100.0%) 2464 891 (36.2%)
20 kb 1503 1469 (97.7%) 2464 1761 (71.5%)
41 kb 2384 2167 (90.9%) 2464 2167 (87.9%)
81 kb 3006 2417 (80.4%) 2464 2360 (95.8%)
208 kb 3339 2228 (66.7%) 2464 2430 (98.6%)
425 kb 3073 1804 (58.7%) 2464 2454 (99.6%)
platelet single-SNP 53 53 (100.0%) 1204 131 (10.9%)
3 kb 246 245 (99.6%) 1204 391 (32.5%)
20 kb 1002 900 (89.8%) 1204 963 (80.0%)
41 kb 1261 1041 (82.6%) 1204 1075 (89.3%)
81 kb 1570 1120 (71.3%) 1204 1138 (94.5%)
208 kb 1743 1057 (60.6%) 1204 1183 (98.3%)
425 kb 1653 911 (55.1%) 1204 1195 (99.3%)
sbp 3 kb 83 83 (100.0%) 568 101 (17.8%)
20 kb 191 177 (92.7%) 568 204 (35.9%)
41 kb 511 366 (71.6%) 568 380 (66.9%)
81 kb 830 496 (59.8%) 568 480 (84.5%)
208 kb 1183 561 (47.4%) 568 530 (93.3%)
425 kb 1543 538 (34.9%) 568 548 (96.5%)
SUPPLEMENTARY TABLE 8. Comparison with BOLT-LMM (continuous). KnockoffZoom dis-
coveries for continuous phenotypes using all UK Biobank samples vs. BOLT-LMM genome-wide significant
discoveries (5× 10−8) using 459k European samples.
47
Resolution KnockoffZoom v2 KnockoffZoom v1
v2 v1 Discoveries Overlap with v1 Discoveries Overlap with v2
cvd
41 kb 42 kb 339 49 (14.5%) 51 46 (90.2%)
81 kb 88 kb 566 175 (30.9%) 182 165 (90.7%)
208 kb 226 kb 940 449 (47.8%) 514 446 (86.8%)
425 kb 226 kb 1089 453 (41.6%) 514 466 (90.7%)
diabetes
3 kb 4 kb 21 8 (38.1%) 11 8 (72.7%)
20 kb 18 kb 61 9 (14.8%) 10 9 (90.0%)
41 kb 42 kb 109 19 (17.4%) 21 19 (90.5%)
81 kb 88 kb 109 28 (25.7%) 33 28 (84.8%)
208 kb 226 kb 113 45 (39.8%) 50 46 (92.0%)
425 kb 226 kb 194 48 (24.7%) 50 48 (96.0%)
hypothyroidism
single-SNP single-SNP 19 8 (42.1%) 21 8 (38.1%)
81 kb 88 kb 277 103 (37.2%) 108 100 (92.6%)
208 kb 226 kb 295 183 (62.0%) 212 186 (87.7%)
425 kb 226 kb 335 188 (56.1%) 212 194 (91.5%)
respiratory
20 kb 18 kb 83 12 (14.5%) 13 13 (100.0%)
41 kb 42 kb 123 35 (28.5%) 41 35 (85.4%)
81 kb 88 kb 193 61 (31.6%) 65 59 (90.8%)
208 kb 226 kb 262 132 (50.4%) 176 140 (79.5%)
425 kb 226 kb 383 154 (40.2%) 176 159 (90.3%)
SUPPLEMENTARY TABLE 9. Comparison with KnockoffZoom v1 (binary). KnockoffZoom v2
discoveries using all UK Biobank British samples vs. KnockoffZoom v1 discoveries using 350k unrelated
British samples; the latter are obtained using slightly different genome partitions.4 Binary phenotypes.
48
Resolution KnockoffZoom v2 KnockoffZoom v1
v2 v1 Discoveries Overlap with v1 Discoveries Overlap with v2
bmi
3 kb 4 kb 10 7 (70.0%) 24 7 (29.2%)
20 kb 18 kb 343 29 (8.5%) 33 30 (90.9%)
41 kb 42 kb 918 61 (6.6%) 60 58 (96.7%)
81 kb 88 kb 1480 515 (34.8%) 555 485 (87.4%)
208 kb 226 kb 2395 1653 (69.0%) 1804 1615 (89.5%)
425 kb 226 kb 2460 1592 (64.7%) 1804 1733 (96.1%)
height
single-SNP single-SNP 95 68 (71.6%) 173 68 (39.3%)
3 kb 4 kb 570 252 (44.2%) 336 251 (74.7%)
20 kb 18 kb 1503 360 (24.0%) 388 350 (90.2%)
41 kb 42 kb 2384 832 (34.9%) 823 780 (94.8%)
81 kb 88 kb 3006 1864 (62.0%) 1976 1836 (92.9%)
208 kb 226 kb 3339 2775 (83.1%) 3284 3021 (92.0%)
425 kb 226 kb 3073 2398 (78.0%) 3284 3198 (97.4%)
platelet
single-SNP single-SNP 53 40 (75.5%) 143 40 (28.0%)
3 kb 4 kb 246 136 (55.3%) 161 138 (85.7%)
20 kb 18 kb 1002 264 (26.3%) 276 265 (96.0%)
41 kb 42 kb 1261 398 (31.6%) 408 385 (94.4%)
81 kb 88 kb 1570 856 (54.5%) 890 834 (93.7%)
208 kb 226 kb 1743 1288 (73.9%) 1460 1325 (90.8%)
425 kb 226 kb 1653 1162 (70.3%) 1460 1393 (95.4%)
sbp
41 kb 42 kb 511 86 (16.8%) 95 84 (88.4%)
81 kb 88 kb 830 265 (31.9%) 297 262 (88.2%)
208 kb 226 kb 1183 619 (52.3%) 722 612 (84.8%)
425 kb 226 kb 1543 663 (43.0%) 722 678 (93.9%)
SUPPLEMENTARY TABLE 10. Comparison with KnockoffZoom v1 (continuous). KnockoffZoom
v2 discoveries using all UK Biobank British samples vs. KnockoffZoom v1 discoveries using 350k unrelated
British samples. Continuous phenotypes. Other details are as in Supplementary Figure 9.
49
Confirmed
Phenotype Resolution Discoveries Catalog Japan FinnGen Any
cvd 3 kb 22 21 (95.5%) NA 11 (50.0%) 22 (100.0%)
20 kb 239 173 (72.4%) NA 81 (33.9%) 188 (78.7%)
41 kb 339 241 (71.1%) NA 126 (37.2%) 266 (78.5%)
81 kb 566 353 (62.4%) NA 251 (44.3%) 422 (74.6%)
208 kb 940 524 (55.7%) NA 581 (61.8%) 738 (78.5%)
425 kb 1089 671 (61.6%) NA 837 (76.9%) 967 (88.8%)
diabetes 3 kb 21 20 (95.2%) 13 (61.9%) 8 (38.1%) 20 (95.2%)
20 kb 61 54 (88.5%) 26 (42.6%) 18 (29.5%) 54 (88.5%)
41 kb 109 88 (80.7%) 36 (33.0%) 30 (27.5%) 88 (80.7%)
81 kb 109 88 (80.7%) 39 (35.8%) 36 (33.0%) 89 (81.7%)
208 kb 113 95 (84.1%) 49 (43.4%) 43 (38.1%) 97 (85.8%)
425 kb 194 140 (72.2%) 58 (29.9%) 59 (30.4%) 142 (73.2%)
hypothyroidism single-SNP 19 7 (36.8%) NA 3 (15.8%) 7 (36.8%)
3 kb 40 23 (57.5%) NA 14 (35.0%) 24 (60.0%)
20 kb 105 71 (67.6%) NA 20 (19.0%) 71 (67.6%)
41 kb 222 101 (45.5%) NA 27 (12.2%) 105 (47.3%)
81 kb 277 126 (45.5%) NA 38 (13.7%) 135 (48.7%)
208 kb 295 141 (47.8%) NA 50 (16.9%) 156 (52.9%)
425 kb 335 139 (41.5%) NA 74 (22.1%) 174 (51.9%)
respiratory 20 kb 83 74 (89.2%) NA 35 (42.2%) 76 (91.6%)
41 kb 123 110 (89.4%) NA 58 (47.2%) 114 (92.7%)
81 kb 193 155 (80.3%) NA 115 (59.6%) 174 (90.2%)
208 kb 262 195 (74.4%) NA 202 (77.1%) 241 (92.0%)
425 kb 383 263 (68.7%) NA 330 (86.2%) 357 (93.2%)
SUPPLEMENTARY TABLE 11. Confirmatory comparison for binary traits. Numbers of Knockof-
fZoom v2 discoveries at different resolutions (all UK Biobank samples) containing associations previously
reported in the GWAS Catalog, Japan Biobank resource, FinnGen resource, or any of the above.
50
Confirmed
Phenotype Resolution Discoveries Catalog Japan FinnGen Any
bmi 3 kb 10 10 (100.0%) 4 (40.0%) NA 10 (100.0%)
20 kb 343 307 (89.5%) 32 (9.3%) NA 308 (89.8%)
41 kb 918 655 (71.4%) 53 (5.8%) NA 656 (71.5%)
81 kb 1480 865 (58.4%) 55 (3.7%) NA 865 (58.4%)
208 kb 2395 1076 (44.9%) 64 (2.7%) NA 1076 (44.9%)
425 kb 2460 1090 (44.3%) 68 (2.8%) NA 1091 (44.3%)
height single-SNP 95 63 (66.3%) 57 (60.0%) NA 81 (85.3%)
3 kb 570 357 (62.6%) 258 (45.3%) NA 417 (73.2%)
20 kb 1503 1032 (68.7%) 483 (32.1%) NA 1102 (73.3%)
41 kb 2384 1534 (64.3%) 572 (24.0%) NA 1607 (67.4%)
81 kb 3006 1822 (60.6%) 590 (19.6%) NA 1879 (62.5%)
208 kb 3339 1856 (55.6%) 561 (16.8%) NA 1886 (56.5%)
425 kb 3073 1653 (53.8%) 494 (16.1%) NA 1669 (54.3%)
platelet single-SNP 53 37 (69.8%) 22 (41.5%) NA 41 (77.4%)
3 kb 246 153 (62.2%) 72 (29.3%) NA 168 (68.3%)
20 kb 1002 352 (35.1%) 97 (9.7%) NA 374 (37.3%)
41 kb 1261 391 (31.0%) 97 (7.7%) NA 409 (32.4%)
81 kb 1570 426 (27.1%) 91 (5.8%) NA 436 (27.8%)
208 kb 1743 445 (25.5%) 94 (5.4%) NA 453 (26.0%)
425 kb 1653 425 (25.7%) 86 (5.2%) NA 429 (26.0%)
sbp 3 kb 83 69 (83.1%) 10 (12.0%) NA 69 (83.1%)
20 kb 191 166 (86.9%) 17 (8.9%) NA 166 (86.9%)
41 kb 511 358 (70.1%) 22 (4.3%) NA 359 (70.3%)
81 kb 830 517 (62.3%) 22 (2.7%) NA 517 (62.3%)
208 kb 1183 643 (54.4%) 23 (1.9%) NA 643 (54.4%)
425 kb 1543 709 (45.9%) 23 (1.5%) NA 709 (45.9%)
SUPPLEMENTARY TABLE 12. Confirmatory comparison for continuous traits. Numbers of Knock-
offZoom v2 discoveries at different resolutions (all UK Biobank samples) containing previously reported
associations. Other details are as in Supplementary Table 11.
51
Found by BOLT-LMM Not found by BOLT-LMM
Resolution Total Catalog Japan FinnGen Any Total Catalog Japan FinnGen Any
cvd
3 kb 22 95.5% NA 50.0% 100.0% 0 NA NA NA NA
20 kb 180 83.9% NA 38.3% 88.9% 59 37.3% NA 20.3% 47.5%
41 kb 212 86.8% NA 42.9% 91.5% 127 44.9% NA 27.6% 56.7%
81 kb 261 87.4% NA 53.3% 92.7% 305 41.0% NA 36.7% 59.0%
208 kb 274 90.5% NA 71.9% 97.1% 666 41.4% NA 57.7% 70.9%
425 kb 255 94.9% NA 84.7% 99.2% 834 51.4% NA 74.5% 85.6%
diabetes
3 kb 20 95.0% 65.0% 40.0% 95.0% 1 100.0% 0.0% 0.0% 100.0%
20 kb 45 95.6% 53.3% 40.0% 95.6% 16 68.8% 12.5% 0.0% 68.8%
41 kb 54 96.3% 50.0% 46.3% 96.3% 55 65.5% 16.4% 9.1% 65.5%
81 kb 50 100.0% 54.0% 56.0% 100.0% 59 64.4% 20.3% 13.6% 66.1%
208 kb 52 98.1% 61.5% 61.5% 98.1% 61 72.1% 27.9% 18.0% 75.4%
425 kb 57 98.2% 61.4% 63.2% 98.2% 137 61.3% 16.8% 16.8% 62.8%
hypothyroidism
single-SNP 19 36.8% NA 15.8% 36.8% 0 NA NA NA NA
3 kb 40 57.5% NA 35.0% 60.0% 0 NA NA NA NA
20 kb 89 76.4% NA 22.5% 76.4% 16 18.8% NA 0.0% 18.8%
41 kb 128 64.1% NA 18.0% 65.6% 94 20.2% NA 4.3% 22.3%
81 kb 133 75.2% NA 22.6% 78.9% 144 18.1% NA 5.6% 20.8%
208 kb 129 85.3% NA 25.6% 87.6% 166 18.7% NA 10.2% 25.9%
425 kb 122 88.5% NA 32.8% 93.4% 213 14.6% NA 16.0% 28.2%
respiratory
20 kb 60 98.3% NA 48.3% 98.3% 23 65.2% NA 26.1% 73.9%
41 kb 74 100.0% NA 51.4% 100.0% 49 73.5% NA 40.8% 81.6%
81 kb 83 98.8% NA 65.1% 100.0% 110 66.4% NA 55.5% 82.7%
208 kb 82 98.8% NA 79.3% 100.0% 180 63.3% NA 76.1% 88.3%
425 kb 82 96.3% NA 92.7% 100.0% 301 61.1% NA 84.4% 91.4%
SUPPLEMENTARY TABLE 13. Comparison with other studies and LMM (binary). Numbers of
KnockoffZoom v2 discoveries containing previously reported associations. The results are stratified based
on whether they are also detected by BOLT-LMM (as in Supplementary Table 7). Other details are as in
Supplementary Table 11.
52
Found by BOLT-LMM Not found by BOLT-LMM
Phenotype Resolution Total Catalog Japan FinnGen Any Total Catalog Japan FinnGen Any
bmi 3 kb 10 100.0% 40.0% NA 100.0% 0 NA NA NA NA
20 kb 309 94.2% 10.4% NA 94.5% 34 47.1% 0.0% NA 47.1%
41 kb 618 89.3% 8.3% NA 89.5% 300 34.3% 0.7% NA 34.3%
81 kb 792 85.2% 6.4% NA 85.2% 688 27.6% 0.6% NA 27.6%
208 kb 898 82.5% 6.2% NA 82.5% 1497 22.4% 0.5% NA 22.4%
425 kb 794 85.8% 7.6% NA 85.9% 1666 24.5% 0.5% NA 24.5%
height single-SNP 95 66.3% 60.0% NA 85.3% 0 NA NA NA NA
3 kb 570 62.6% 45.3% NA 73.2% 0 NA NA NA NA
20 kb 1469 69.8% 32.8% NA 74.5% 34 20.6% 2.9% NA 20.6%
41 kb 2167 68.7% 26.2% NA 72.0% 217 20.7% 2.3% NA 21.2%
81 kb 2417 71.0% 24.0% NA 73.1% 589 18.2% 1.5% NA 18.8%
208 kb 2228 76.1% 24.7% NA 77.3% 1111 14.5% 1.0% NA 14.8%
425 kb 1804 81.3% 26.6% NA 82.0% 1269 14.7% 1.1% NA 14.9%
platelet single-SNP 53 69.8% 41.5% NA 77.4% 0 NA NA NA NA
3 kb 245 62.4% 29.4% NA 68.6% 1 0.0% 0.0% NA 0.0%
20 kb 900 38.3% 10.8% NA 40.8% 102 6.9% 0.0% NA 6.9%
41 kb 1041 36.8% 9.3% NA 38.5% 220 3.6% 0.0% NA 3.6%
81 kb 1120 36.6% 8.0% NA 37.5% 450 3.6% 0.2% NA 3.6%
208 kb 1057 39.4% 8.5% NA 40.1% 686 4.2% 0.6% NA 4.2%
425 kb 911 42.9% 9.0% NA 43.4% 742 4.6% 0.5% NA 4.6%
sbp 3 kb 83 83.1% 12.0% NA 83.1% 0 NA NA NA NA
20 kb 177 89.3% 9.6% NA 89.3% 14 57.1% 0.0% NA 57.1%
41 kb 366 86.1% 6.0% NA 86.3% 145 29.7% 0.0% NA 29.7%
81 kb 496 86.5% 4.4% NA 86.5% 334 26.3% 0.0% NA 26.3%
208 kb 561 87.2% 4.1% NA 87.2% 622 24.8% 0.0% NA 24.8%
425 kb 538 90.0% 4.3% NA 90.0% 1005 22.4% 0.0% NA 22.4%
SUPPLEMENTARY TABLE 14. Comparison with other studies and LMM (continuous). Numbers
of KnockoffZoom v2 discoveries containing previously reported associations. The results are stratified based
on whether they are also detected by BOLT-LMM (as in Supplementary Table 8). Other details are as in
Supplementary Table 12.
53
Phenotype Catalog Japan FinnGen
bmi 4261 / 4514 (94.4%) 5016 / 5094 (98.5%) NA
cvd 2223 / 4229 (52.6%) NA 2491 / 6713 (37.1%)
diabetes 709 / 1906 (37.2%) 5904 / 8550 (69.1%) 93 / 577 (16.1%)
height 4324 / 4461 (96.9%) 61730 / 63254 (97.6%) NA
hypothyroidism 176 / 197 (89.3%) NA 89 / 462 (19.3%)
platelet 1121 / 1159 (96.7%) 7797 / 8012 (97.3%) NA
respiratory 1751 / 4112 (42.6%) NA 1129 / 9450 (11.9%)
sbp 1781 / 2048 (87.0%) 1757 / 1817 (96.7%) NA
SUPPLEMENTARY TABLE 15. Estimated power based on other studies. Total numbers of reported
associations in the GWAS Catalog, Japan Biobank resource, or FinnGen resource, along with the corre-
sponding fraction confirmed in our low-resolution analysis (425 kb). Other details are as in Supplementary
Tables 11–12.
54
Total Not found by BOLT-LMM
Confirmed Confirmed
Resolution Discoveries Other Other or Enrich. Discoveries Other Other or Enrich.
cvd
3 kb 22 22 (100.0%) 22 (100.0%) 0 NA NA
20 kb 239 188 (78.7%) 219 (91.6%) 59 28 (47.5%) 50 (84.7%)
41 kb 339 266 (78.5%) 309 (91.2%) 127 72 (56.7%) 107 (84.3%)
81 kb 566 422 (74.6%) 495 (87.5%) 305 180 (59.0%) 240 (78.7%)
208 kb 940 738 (78.5%) 764 (81.3%) 666 472 (70.9%) 493 (74.0%)
425 kb 1089 967 (88.8%) 968 (88.9%) 834 714 (85.6%) 715 (85.7%)
diabetes
3 kb 21 20 (95.2%) 20 (95.2%) 1 1 (100.0%) NA
20 kb 61 54 (88.5%) 57 (93.4%) 16 11 (68.8%) 13 (81.2%)
41 kb 109 88 (80.7%) 97 (89.0%) 55 36 (65.5%) 42 (76.4%)
81 kb 109 89 (81.7%) 99 (90.8%) 59 39 (66.1%) 48 (81.4%)
208 kb 113 97 (85.8%) 106 (93.8%) 61 46 (75.4%) 54 (88.5%)
425 kb 194 142 (73.2%) 157 (80.9%) 137 86 (62.8%) 100 (73.0%)
hypothyroidism
single-SNP 19 7 (36.8%) 7 (36.8%) 0 NA NA
3 kb 40 24 (60.0%) 24 (60.0%) 0 NA NA
20 kb 105 71 (67.6%) 91 (86.7%) 16 3 (18.8%) 8 (50.0%)
41 kb 222 105 (47.3%) 172 (77.5%) 94 21 (22.3%) 61 (64.9%)
81 kb 277 135 (48.7%) 219 (79.1%) 144 30 (20.8%) 93 (64.6%)
208 kb 295 156 (52.9%) 226 (76.6%) 166 43 (25.9%) 101 (60.8%)
425 kb 335 174 (51.9%) 231 (69.0%) 213 60 (28.2%) 116 (54.5%)
SUPPLEMENTARY TABLE 16. Enrichment analysis with independent GWAS (binary). Numbers
of KnockoffZoom v2 discoveries confirmed by other studies or enrichment analysis using independent GWAS
summary statistics. Enrichment results are estimates. The results are stratified based on whether they are
also detected by BOLT-LMM (as in Supplementary Table 13).
55
Total Not found by BOLT-LMM
Confirmed Confirmed
Phenotype Resolution Discover. Other Other or Enrich. Discover. Other Other or Enrich.
bmi 3 kb 10 10 (100.0%) 10 (100.0%) 0 NA NA
20 kb 343 308 (89.8%) 328 (95.6%) 34 16 (47.1%) 29 (85.3%)
41 kb 918 656 (71.5%) 821 (89.4%) 300 103 (34.3%) 234 (78.0%)
81 kb 1480 865 (58.4%) 1182 (79.9%) 688 190 (27.6%) 450 (65.4%)
208 kb 2395 1076 (44.9%) 1620 (67.6%) 1497 335 (22.4%) 806 (53.8%)
425 kb 2460 1091 (44.3%) 1567 (63.7%) 1666 409 (24.5%) 820 (49.2%)
height single-SNP 95 81 (85.3%) 81 (85.3%) 0 NA NA
3 kb 570 417 (73.2%) 417 (73.2%) 0 NA NA
20 kb 1503 1102 (73.3%) 1351 (89.9%) 34 7 (20.6%) 20 (58.8%)
41 kb 2384 1607 (67.4%) 1997 (83.8%) 217 46 (21.2%) 111 (51.2%)
81 kb 3006 1879 (62.5%) 2386 (79.4%) 589 111 (18.8%) 314 (53.3%)
208 kb 3339 1886 (56.5%) 2493 (74.7%) 1111 164 (14.8%) 556 (50.0%)
425 kb 3073 1669 (54.3%) 2231 (72.6%) 1269 189 (14.9%) 622 (49.0%)
platelet single-SNP 53 41 (77.4%) 41 (77.4%) 0 NA NA
3 kb 246 168 (68.3%) 230 (93.5%) 1 0 (0.0%) 0 (0.0%)
20 kb 1002 374 (37.3%) 778 (77.6%) 102 7 (6.9%) 49 (48.0%)
41 kb 1261 409 (32.4%) 934 (74.1%) 220 8 (3.6%) 127 (57.7%)
81 kb 1570 436 (27.8%) 1058 (67.4%) 450 16 (3.6%) 226 (50.2%)
208 kb 1743 453 (26.0%) 1017 (58.3%) 686 29 (4.2%) 256 (37.3%)
425 kb 1653 429 (26.0%) 922 (55.8%) 742 34 (4.6%) 297 (40.0%)
sbp 3 kb 83 69 (83.1%) 69 (83.1%) 0 NA NA
20 kb 191 166 (86.9%) 178 (93.2%) 14 8 (57.1%) 12 (85.7%)
41 kb 511 359 (70.3%) 441 (86.3%) 145 43 (29.7%) 97 (66.9%)
81 kb 830 517 (62.3%) 663 (79.9%) 334 88 (26.3%) 200 (59.9%)
208 kb 1183 643 (54.4%) 885 (74.8%) 622 154 (24.8%) 358 (57.6%)
425 kb 1543 709 (45.9%) 983 (63.7%) 1005 225 (22.4%) 474 (47.2%)
SUPPLEMENTARY TABLE 17. Enrichment analysis with independent GWAS (continuous).
Numbers of KnockoffZoom v2 discoveries confirmed by other studies or enrichment analysis using indepen-
dent GWAS summary statistics. Other details are as in Supplementary Table 16.
56
Total Not found by BOLT-LMM
Resolution Input Confirmed Input Confirmed
cvd
20 kb 51 23–40 (45%–78%) 31 15–27 (48%–87%)
41 kb 73 33–53 (45%–73%) 55 27–43 (49%–78%)
81 kb 144 57–88 (40%–61%) 125 45–74 (36%–59%)
208 kb 202 8–47 (4%–23%) 194 5–40 (3%–21%)
425 kb 122 0–7 (0%–6%) 120 0–5 (0%–4%)
diabetes
3 kb 1 1–1 (100%–100%) 0 NA
20 kb 7 0–5 (0%–71%) 5 0–5 (0%–100%)
41 kb 21 3–14 (14%–67%) 18 1–11 (6%–61%)
81 kb 20 6–15 (30%–75%) 19 5–14 (26%–74%)
208 kb 16 4–14 (25%–88%) 15 3–12 (20%–80%)
425 kb 52 6–27 (12%–52%) 51 4–25 (8%–49%)
hypothyroidism
single-SNP 12 12–12 (100%–100%) 0 NA
3 kb 16 11–16 (69%–100%) 0 NA
20 kb 34 13–26 (38%–76%) 13 2–9 (15%–69%)
41 kb 117 53–81 (45%–69%) 73 30–51 (41%–70%)
81 kb 142 69–98 (49%–69%) 114 49–76 (43%–67%)
208 kb 139 55–86 (40%–62%) 123 43–73 (35%–59%)
425 kb 161 41–74 (25%–46%) 153 40–73 (26%–48%)
SUPPLEMENTARY TABLE 18. Details of enrichment analysis (binary). Bootstrap confidence inter-
vals (90%) for the proportion of novel KnockoffZoom v2 discoveries confirmed by the enrichment analysis
in Supplementary Table 16.
57
Total Not found by BOLT-LMM
Resolution Input Confirmed Input Confirmed
bmi
20 kb 35 13–27 (37%–77%) 18 8–18 (44%–100%)
41 kb 262 146–184 (56%–70%) 197 115–147 (58%–75%)
81 kb 615 284–350 (46%–57%) 498 231–289 (46%–58%)
208 kb 1319 494–595 (37%–45%) 1162 422–518 (36%–45%)
425 kb 1369 424–529 (31%–39%) 1257 361–461 (29%–37%)
height
single-SNP 14 0–9 (0%–64%) 0 NA
3 kb 153 90–123 (59%–80%) 0 NA
20 kb 401 225–272 (56%–68%) 27 7–19 (26%–70%)
41 kb 777 353–426 (45%–55%) 171 47–84 (27%–49%)
81 kb 1127 460–552 (41%–49%) 478 174–234 (36%–49%)
208 kb 1453 555–660 (38%–45%) 947 349–434 (37%–46%)
425 kb 1404 509–615 (36%–44%) 1080 387–478 (36%–44%)
platelet
single-SNP 12 3–12 (25%–100%) 0 NA
3 kb 78 53–70 (68%–90%) 1 0–0 (0%–0%)
20 kb 628 373–433 (59%–69%) 95 29–55 (31%–58%)
41 kb 852 488–561 (57%–66%) 212 100–138 (47%–65%)
81 kb 1134 578–665 (51%–59%) 434 181–238 (42%–55%)
208 kb 1290 514–614 (40%–48%) 657 190–264 (29%–40%)
425 kb 1224 442–542 (36%–44%) 708 224–301 (32%–43%)
sbp
3 kb 14 3–12 (21%–86%) 0 NA
20 kb 25 5–18 (20%–72%) 6 2–6 (33%–100%)
41 kb 152 67–97 (44%–64%) 102 40–67 (39%–66%)
81 kb 313 122–169 (39%–54%) 246 90–133 (37%–54%)
208 kb 540 209–273 (39%–51%) 468 173–233 (37%–50%)
425 kb 834 232–316 (28%–38%) 780 209–289 (27%–37%)
SUPPLEMENTARY TABLE 19. Details of enrichment analysis (continuous). Bootstrap confidence
intervals (90%) for the proportion of novel KnockoffZoom v2 discoveries confirmed by the enrichment analysis
in Supplementary Table 17.
58
Phenotype Discoveries Contains geneKnown lead SNP
consequence
Known lead SNP
association
cvd 31 26 (84%) 28 (90%) 21 (68%)
diabetes 5 5 (100%) 3 (60%) 5 (100%)
hypothyroidism 13 12 (92%) 8 (62%) 9 (69%)
respiratory 6 5 (83%) 4 (67%) 3 (50%)
SUPPLEMENTARY TABLE 20. Functional follow-up on novel unconfirmed discoveries. Numbers
of novel discoveries (not found by BOLT-LMM and not confirmed by the other studies in Supplementary
Table 13) that either contain a gene or whose lead SNP has a known functional annotation or a known
association with phenotypes closely related to that of interest.
Phenotype Associations
cvd NA (10), blood pressure (9), BMI (8), obesity (3), cardiovascular disease (1),
CCL2 (1), cholesterol (1), triglycerides (1), heart rate (1)
diabetes diabetes (3), Factor VII (1), glyburide metabolism (1)
hypothyroidism NA (4), autoimmune thyroid disease (2), psoriasis (2), diabetic nephropathy (1),
Graves disease (1), hypothyroidism (1), rheumatoid arthritis (1), thyroid function (1)
respiratory NA (3), hypersomnia (1), interaction with air pollution (1), serum IgE (1)
SUPPLEMENTARY TABLE 21. Known associations of novel discoveries to related phenotypes.
Associations of our novel discoveries (20 kb resolution) in Supplementary Table 20 to related traits. The
same discovery may have more than one relevant association in this table.
59
Consequence cvd diabetes hypothyroidism respiratory
2KB Upstream 1
3 Prime UTR 2
500B Downstream 1
Intron 19 2 6 3
Missense 3 1
Non coding transcript exon 1
Regulatory region 2
Stop gained 1
Tf binding site 1
Unknown 3 2 5 2
Total 31 5 13 6
SUPPLEMENTARY TABLE 22. Known consequences of lead variants. Numbers of lead variants with
known consequences for our novel discoveries (20 kb resolution) in Supplementary Table 20.
60
1 Yedidia, J., Freeman, W. & Weiss, Y. Understanding belief propagation and its generalizations. In
Exploring Artificial Intelligence in the New Millenium, vol. 8, 239–269 (Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 2003).
2 Wainwright, M. J. & Jordan, M. I. Graphical models, exponential families, and variational inference
(Now Publishers Inc, 2008).
3 Sesia, M., Sabatti, C. & Candes, E. Gene hunting with hidden Markov model knockoffs. Biometrika
106, 1–18 (2019).
4 Sesia, M., Katsevich, E., Bates, S., Candes, E. & Sabatti, C. Multi-resolution localization of causal
variants across the genome. Nat. Comm. 11, 1093 (2020).
5 Bates, S., Candes, E., Janson, L. & Wang, W. Metropolized knockoff sampling. J. Am. Stat. Assoc.
1–25 (2020).
6 Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype
data. Genetics 155, 945–959 (2000).
7 Falush, D., Stephens, M. & Pritchard, J. K. Inference of population structure using multilocus genotype
data: linked loci and correlated allele frequencies. Genetics 164, 1567–1587 (2003).
8 Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data:
applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644
(2006).
9 Candes, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: Model-X knockoffs for high-dimensional
controlled variable selection. J. R. Stat. Soc. B. 80, 551–577 (2018).
10 Klasen, J. R. et al. A multi-marker association method for genome-wide association studies without the
need for population structure correction. Nat. Commun. 7 (2016).
11 Japan, B. Biobank Japan Project (2020). URL http://jenger.riken.jp/en/.
12 FinnGen. FinnGen documentation of r3 release (2020). URL https://finngen.gitbook.io/
documentation/.
13 Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci.
U.S.A 100, 9440–9445 (2003).
14 Klaus, B., Strimmer, K. & Strimmer, M. K. Package ‘fdrtool‘ .
15 Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-
scale datasets. Nat. Genet. 50, 906–908 (2018).