Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
de Santiago et al.
BaalChIP: Bayesian analysis of allele-specific
transcription factor binding in cancer genomesInes de Santiago†, Wei Liu†, Martin O’Reilly, Ke Yuan, Chandra Sekhar Reddy Chilamakuri, Bruce A.J.
Ponder, Kerstin B. Meyerˆ and Florian Markowetz*ˆ
*Correspondence: flo-
University of Cambridge, Cancer
Research UK Cambridge
Institute, Robinson Way,
CB2 0RE Cambridge, UK
Full list of author information is
available at the end of the article
†Equal contributorsˆJoint
corresponding authors
Abstract
Allele-specific measurements of transcription factor binding from ChIP-seq data are
key to dissecting the allelic effects of non-coding variants and their contribution to
phenotypic diversity. However, most methods to detect allelic imbalance assume
diploid genomes. This assumption severely limits their applicability to cancer samples
with frequent DNA copy number changes. Here we present a Bayesian statistical
approach called BaalChIP to correct for the effect of background allele frequency on
the observed ChIP-seq read counts. BaalChIP allows the joint analysis of multiple
ChIP-seq samples across a single variant and outperforms competing approaches in
simulations. Using 548 ENCODE ChIP-seq and 6 targeted FAIRE-seq samples we
show that BaalChIP effectively corrects allele-specific analysis for copy number
variation and increases the power to detect putative cis-acting regulatory variants in
cancer genomes.
Keywords: Allele-specific binding; ChIP-sequencing; FAIRE-sequencing; Bayesian
statistics; Cancer; Copy-number change; Allele frequency
Background1
Allele-specific measurements of transcription factor binding from ChIP-seq data2
have provided important insights into the allelic effects of non-coding variants and3
their contribution to phenotypic diversity [1, 2, 3, 4, 5]. Since the majority of disease4
risk-associated SNPs occur in non-coding DNA, many of them might disrupt cis-5
regulatory elements. Thus, a direct method to identify functional SNPs is to use the6
information obtained from allelic-specific binding (ASB) of transcription factors.7
de Santiago et al. Page 2 of 57
ChIP-seq data are commonly used to study allele-specific binding of proteins at8
heterozygous non-coding SNPs and to infer putative effects of such variants on gene9
regulation. Allele-specific mapping corrects for environmental sources of variation10
that alter gene regulation, since both alleles are assayed in the same cellular context.11
Existing approaches to infer allelic imbalance Previous studies have used ChIP-seq12
and RNA-seq to identify ASB and allele-specific expression (ASE). These studies13
have described methods to address technical and methodological biases such as the14
sequence context of a SNP [6], alignment biases to the reference allele [7, 8], and15
the issue of increasing detection power by combining multiple SNPs in the same16
gene [9] or across multiple ChIP-seq samples [10]. However, all of these approaches17
are designed to examine the allelic imbalances in diploid samples and do not address18
copy number differences between the two alleles, an ubiquitous feature of cancer19
genomes (Figure S1).20
Few papers in the literature have tried to address confounding effects arising21
from copy number changes. Some studies have analyzed tumor and normal sam-22
ples without making any adjustments to the applied methodology, which risks23
identifying false positives where the detected ASE is mainly due to copy-number24
alterations [11]. Others have removed all sites overlapping copy number vari-25
ants [7, 12, 13], or used the genomic DNA allelic ratios to correct for the observed26
allelic imbalances [14, 15] or to remove SNPs with allelic imbalances detected in27
the control input DNA [2] which is only feasible when the coverage at any assayed28
heterozygous site is high (> 20x if a binomial test is to be applied to reach adequate29
statistical power [9, 12]). Recently, Liu et al. [16] developed cisASE, a likelihood-30
based method for detecting allele-specific expression. cisASE has been shown to31
successfully correct for copy number alterations that hinder the identification of32
ASE in RNA-seq data. Table S1 presents a summary of the different strategies33
of allelic-specific mapping analyses to deal with regions of altered copy-number.34
de Santiago et al. Page 3 of 57
While the analysis of allele-specific expression and binding are conceptually similar,35
in practice methods tailored to either ASE or ASB are not easily exchangeable,36
because the data types and thus the biological and statistical assumptions are dif-37
ferent.38
Bayesian analysis of allele specific binding To address the wide range of biases39
that can affect the detection of ASB from ChIP-seq data we developed BaalChIP40
(Bayesian Analysis of Allelic imbalance from ChIP-seq data) (Figure 1). BaalChIP41
makes it possible to quantify the significance of the allelic imbalance from ChIP-42
seq data in cell lines with genomic aberrations, which is a major advance over43
previous approaches. BaalChIP combines several important features: First, we use44
several strategies to rigorously account for allelic mapping bias, including filtering of45
SNPs in problematic alignment regions as well as simulations to identify reads with46
increased risk of mapping bias. Second, we implement a beta-binomial model for47
allelic read counts. This model accounts for the fact that the observed variance in the48
data is larger than expected from a standard binomial model, a phenomenon known49
as overdispersion [9, 17, 18]. Third, we take advantage of the fact that multiple50
transcription factor ChIP-seq data may be available for the same SNP to improve51
ASB detection. Finally, we use a Bayesian framework to account for the influence of52
the background allele composition and the reference mapping bias on the observed53
ChIP-seq allelic read count.54
We applied BaalChIP to 548 ENCODE samples [19] obtained from a panel of 1455
cell lines, representing different tissues and karyotypes, including 8 cancer cell lines56
(HeLa-S3, A549, MCF-7, T-47D, K562, HepG2, SK-N-SH, HL-60) and 6 non-cancer57
cell lines (H1-hESC, GM12878, GM12891, GM12892, MCF10A-Er-Src, IMR90), as58
well as 6 FAIRE-seq samples obtained from breast cancer cell lines (MDA-MB-59
134 and T-47D). We demonstrate that copy-number changes can easily give rise60
to spurious signals of allelic imbalance. We are able to integrate the quantitative61
de Santiago et al. Page 4 of 57
information obtained from ChIP-seq data along with information of the background62
allele composition to accurately detect allele-specific binding and correct for arti-63
facts caused by allele-specific losses or gains at the structural genomic level. Because64
of this integrated modeling, we were able to detect a large number of ASB events65
from ChIP-seq data generating a resource for future functional studies.66
BaalChIP is implemented as an R [20] package and freely available from67
Bioconductor repository https://bioconductor.org/packages/release/bioc/68
html/BaalChIP.html [21].69
Results70
Overview of BaalChIP workflow71
In this study we aim to devise a method that allows to correct for copy number72
changes and other biases in the analysis of ASB from ChIP-seq or similar data. The73
BaalChIP workflow requires three different sets of input data (Figure 1): the SNP74
variants file, ChIP-seq datasets (e.g. sets of BAM and BED files of ChIP-seq data75
obtained for different proteins that may or may not be grouped into replicates),76
and a corresponding set of genomic DNA files for each individual sample. The SNP77
variants file and the BED files are used to select the sites for the analysis, the ChIP-78
seq BAM files are used to compute the allelic read counts and the gDNA files will79
allow to determine reference-allele frequency (RAF) values to correct for effects of80
the background allele composition. Alternatively, RAF values may be directly pre-81
computed from BAF (B allele frequency) scores (Figure 1a). It has been shown that82
removing duplicate reads can reduce technical sources of biases at ASB sites [22]83
and we suggest that BAM files are pre-processed to contain only uniquely aligned84
reads or flagged duplicated reads.85
BaalChIP Quality Control86
The BaalChIP workflow starts by first computing the allelic read counts for each87
assayed heterozygous SNP overlapping genomic intervals of interest (e.g. ChIP-seq88
de Santiago et al. Page 5 of 57
peaks). By default, only uniquely mapped reads (MAQ > 15), and sites with base89
quality > 10 are used (Figure 1b). In the next step several quality control (QC)90
filters are applied to consider technical biases that may contribute to the false91
identification of regulatory SNPs (Figure 1b).92
In a first round, BaalChIP excludes SNPs mapping to regions of known prob-93
lematic read alignment: blacklisted regions [23], non-unique regions with genomic94
mappability score of less than 1 (based on the UCSC mappability track, 50bp seg-95
ments) [22], and collapsed repeat regions [24]. The excluded regions contain genomic96
intervals that are frequently enriched for repetitive elements and frequently cause97
ambiguous read mapping and sequencing artifacts [25].98
The second QC filter performs read-alignment simulations to consider an addi-99
tional type of read mapping bias named the intrinsic bias [6, 17]. This bias occurs100
due to intrinsic characteristics of the genome that translate into different probabil-101
ities of read mapping. Even when reads differ only in one location, reads carrying102
one of the alleles may have a higher chance of matching multiple locations (i.e. have103
many repeats in the genome) and may therefore be mapped to an incorrect locus.104
This, in turn, results in the underestimation of read counts and may cause both105
false-positive and false-negative inferences of ASB.106
The third QC filter selects those SNPs that pass all filters in all replicated samples,107
provided that replicated samples exist. This step will mainly remove SNPs in regions108
where the ChIP-seq signal is not consistently detected across all replicates (for109
instance when coverage is zero in one of the replicates).110
The final QC filter consists of removing possible homozygous SNPs by removing111
any site where only one allele is observed [4, 26].112
BaalChIP detection of allele-specific binding (ASB) events113
The filtered SNPs and their allelic read counts are merged into a table with the114
total number of read counts in the reference (Ref) and alternative (Alt) alleles. No115
de Santiago et al. Page 6 of 57
data is entered (missing data, NA) if a SNP did not pass the previously applied QC116
step for that sample (Figure 1c).117
BaalChIP considers two additional biases that may lead to inaccurate estimates118
of ASB: the reference mapping (RM) and the reference allele frequency (RAF) bi-119
ases. The RM bias occurs because the reference genome only represents a single120
"reference" allele at each SNP position. Reads that carry the "non-reference al-121
lele" have an extra mismatch to the reference sequence. Previous work has shown122
that this creates a marked bias in favor of the alignment of reads that contain the123
reference genome and could therefore affect the accuracy of allele-specific binding124
estimates [6].125
The RAF bias occurs due to alterations in the background abundance of each126
allele (e.g. in regions of copy-number alterations) and the correction for this bias is127
one of the key features of BaalChIP. RAF values at each heterozygous variant are128
used in the model likelihood to correct of the observed ChIP-seq read counts relative129
to the amount of the reference allele. These are given as relative measures from 0 to130
1, where values between 0.5 and 1 denote an underlying bias to the reference allele,131
and a value between 0 and 0.5 to the alternative allele. RAF scores can be obtained132
from SNP array B-allele frequencies (BAF) by assigning the correct reference and133
alternative alleles to allele A and B generic labels (RAF is equal to BAF if the134
reference allele corresponds to the B allele; RAF is equal to 1-BAF if the reference135
allele corresponds to the A allele). Alternatively, if whole-genome DNA (gDNA)136
sequencing samples are given as an input, BaalChIP calculates the RAF values137
directly from the gDNA allelic read counts.138
Finally, BaalChIP uses a beta-binomial distribution to model read count data139
therefore accounting for extra variability (over-dispersion) in allelic counts, a phe-140
nomenon that is often observed in sequencing data [9, 17]. The output of BaalChIP141
de Santiago et al. Page 7 of 57
is a posterior distribution of the estimated allelic balance ratio in read counts ob-142
served after considering all sources of underlying biases (Figure 1c).143
Evaluation of BaalChIP performance144
In a controlled simulation study, we compare BaalChIP with the two major compet-145
ing methods to infer allelic imbalance: binomial test and iASeq [10]. The binomial146
test is the most frequently used method for the analysis of allele-specificity from147
ChIP-seq data [1, 2, 3, 4, 7]. In a recent study [15], biases caused by differences148
in copy number were accounted for by weighting the binomial null with the allelic149
ratios observed from the gDNA samples (the number of reads mapping to each150
allele in the genomic DNA). Therefore, to take these strategy into consideration,151
we set the null hypothesis on the probability of success to the estimated RAF bias,152
instead of 0.5. iASeq is another available method that has been shown to improve153
the detection of ASB [10]. The iASeq method uses a Bayesian framework to combine154
information from different experiments or replicates, and models the read counts155
with a beta-binomal distribution, instead of a simple binomial distribution, to ac-156
count for extra-binomial variation caused by technical variability. Although iASeq157
was not originally designed to overcome copy-number biases, we included it in our158
simulation study because of its offered improvement of the traditional binomial test.159
All three methods are tested on synthetic data sets with varying sequencing depth,160
number of TF binding to a SNP and copy number states. In addition, we also exam-161
ine the robustness of the methods against a wide range of true allelic balance ratios.162
The allelic imbalance calling performance is assessed by ROC (Receiver operating163
characteristic) curves. In the absence of copy number changes, i.e. RAF is 0.5, the164
performances of the three methods are similar (Figure 2a, d and g).165
In the presence of copy number changes, BaalChIP shows significant improvements166
in the identification of true ASB events. In Figure 2 we highlight results on data sets167
with RAF set to either 0.3 or 0.1, which represent modest and severe background168
de Santiago et al. Page 8 of 57
allelic imbalance due to copy number changes. Significant improvements can be169
found across varying sequencing depth and across the number of TFs considered.170
Particularly, in the data sets with modest copy number changes, the performance171
of BaalChIP is similar to its performance in the absence of copy number changes,172
whereas the performance of both binomial test and iASseq suffers badly. When RAF173
is 0.1, BaalChIP is still able to achieve relatively good performance. In addition,174
the performance of BaalChIP benefits greatly from aggregating binding data for175
multiple TFs at the SNP (Figure S3).176
Overall, the simulation results show that BaalChIP is robust and performs well177
across a wide range of variations in the data set. In particular, in regions with copy178
number changes BaalChIP shows significant improvements over the state-of-the-art179
baselines. Thus BaalChIP offers a robust analysis of ASB in samples obtained with180
frequent copy number changes.181
Case study 1: ENCODE data182
We applied BaalChIP to 548 samples from the ENCODE project [19]. In total 271183
ChIP-seq experiments were analyzed, assaying a total of 8 cancer and 6 non-cancer184
cell lines representing different tissues. The data contained either 2 or 3 replicates185
per experiment and 4 to 42 DNA-binding proteins per cell line (Table S3 and Figure186
S4 show a summary of the cell lines, tissues, and number of ChIP-seq experiments187
used in this case study).188
The initial number of genotyped heterozygous SNPs per cell line ranged from189
139K to 356K SNPs. We selected those SNPs that occurred within ChIP-seq peaks190
in the corresponding cell lines, which amounts to between 1.6% and 5.8% of all191
SNPs. To ensure a reliable set of heterozygous SNPs we applied the BaalChIP QC192
step with the default parameters and options. We removed an average of 30.4%193
of all SNPs that hit peaks (Figure S5b); 0.8% of the excluded sites were found in194
regions of problematic read alignments; 7.4% had biases in simulations (Figure S5a);195
de Santiago et al. Page 9 of 57
16.4% were not consistent between replicates and 5.8% had only 1 observable allele196
(Figure S5c). After BaalChIP QC the number of heterozygous SNPs considered for197
downstream analysis ranged from 1,636-12,097 SNPs.198
Allele-specific copy-number alterations change ChIP-seq read densities199
First we examined of how allele-specific copy-number alterations affect the allelic200
ratios observed from ChIP-seq data (Figure 3). The relative presence of each allele201
is measured by the B allele frequency (BAF) obtained from SNP arrays [27].202
BAF values lie in the interval [0, 1]. A BAF value of 0.5 indicates an equal presence203
of the two alleles and is the expected value for heterozygous sites (AB) in a diploid204
genome, while values of 0 and 1 indicate homozygous genotypes (AA and BB). In205
a normal, non-contaminated diploid sample, a BAF plot will have three bands, one206
centered at 0.5 for AB genotypes, and two bands at 0 and 1 for the AA and BB207
homozygous genotypes, respectively.208
Figure 3a presents the BAF plots for chromosome 1 of three cancer cell lines209
(MCF-7, K-562 and SK-N-SH) and one non-cancer cell line obtained from normal210
blood (GM12878). As expected, the BAF plot of GM12878 is characterized by a211
typical diploid pattern of BAF values distributed around 0, 0.5 and 1 corresponding212
to the diploid genotypes AA, BB and AB. The few small deviations from these values213
can be attributed to germ line-based copy-number aberrations [28, 29]. The BAF214
plot for the SK-N-SH cell line, a glioblastoma cancer cell line known to have a near215
diploid karyotyope [30], is also relatively stable. In contrast, the MCF-7 and K-562216
cancer cell lines show more complex BAF plots due to a variety of copy-number217
aberrations.218
To evaluate the effects of allele-specific copy-number alterations on the ChIP-seq219
read densities, we compared for all four cell lines the BAF scores with ChIP-seq220
allelic ratios obtained at heterozygous SNPs (hetSNPs) (Figure 3b). We report221
ChIP-seq allelic ratios (AR) as the proportion of total reads carrying the reference222
de Santiago et al. Page 10 of 57
allele. We observed a clear correlation between the BAF scores and the imbalance223
in the ChIP-seq allelic ratios in MCF-7 and K-562 cancer cell lines (Spearman224
ρ = 0.704 and 0.525, respectively), while this correlation is not observed in the225
near-diploid SK-N-SH (Spearman ρ = 0.001) and the normal GM12878 cell lines226
(Spearman ρ = 0.036).227
The same effect can be observed in all other cell lines (Figure S6a) and is partic-228
ularly striking in cancer cells with the highest proportion of extreme BAF scores229
(< 0.4 or > 0.6; Figure S6b). Taken together, these results demonstrate the need230
to adjust for the background genomic abundance of the alleles when attempting to231
identify putative cis-acting regulatory SNPs from ChIP-seq data. These data also232
support our assumption that for allelic changes that do not change TF affinity, the233
allele frequency in ChIP-seq data is proportional to the presence of these alleles in234
the genomic DNA.235
Allele-specific amplification events explain most of the imbalances observed in236
ChIP-seq data in cancer cell lines237
Having obtained a reliable set of heterozygous SNPs per cell line, we next applied the238
Bayesian framework implemented in BaalChIP to each of the 14 cell lines individ-239
ually. To assess the importance of adjusting for copy-number biases, we performed240
the analysis with and without adjusting for the RAF bias. In a diploid sample, the241
allelic ratios (AR) are expected to be distributed around the 0.5 average, assuming242
that only a small proportion of sites carry an imbalance towards one of the alleles.243
However, as shown in Figure 3 copy-number alterations can affect read densities244
which will have an effect on ARs, in particular in cancer cells. Figure S7a shows245
the observed AR values in ChIP-seq data compared to the values before and after246
RAF correction for all analyzed cell lines. We observed that after RAF copy-number247
correction the ’corrected allelic ratios’ become more evenly distributed around an248
de Santiago et al. Page 11 of 57
average of 0.5. This effect is particularly notable in data obtained from cancer cell249
lines (Figure S7a).250
Overall, we found 2,438 ASB sites across all cell lines (Table S4), with an average251
of 3.1% SNPs displaying allele-specific binding using our chosen cutoffs. Figure S7b252
demonstrates that the copy-number correction has a strong effect in cancer cell253
lines. In the most extreme cases the number of allele-specific imbalances is reduced254
by 4-fold. In normal cells, and in cancer cells with near diploid genomes, such as255
HL-60 and SK-N-SH [30, 31], the total number of identified ASB sites due to copy-256
number effects is much lower (Figure S7b). This is consistent with the fact that257
non-cancer cell lines carry far fewer copy number aberrations. Because ASB is more258
easily detected in regions of high sequence coverage due to increases in statistical259
power (Figure S8), we repeated the same analysis after selecting only SNPs with260
30x-40x read coverage to control for read depth biases, and demonstrate that the261
same effect is verified (Figure S7c). Importantly, the data shown here demonstrates262
that adjusting for RAF is able to remove artifacts that are caused by copy number263
alterations in cancer cell lines.264
We observed higher rates of ASB on chromosome X of female cell lines (GM12878,265
GM12892, IMR90, MCF10, MCF-7, SK-N-SH) than in autosomal chromosomes266
(Figure S9, χ2 test p-value < 2.2e-16 for chromosome X versus autosomal identified267
sites comparison for all 6 cell lines). These cases might be explained by the extent268
of X-inactivation. In normal tissue X-inactivation is random. However, in clonal cell269
lines the same X-chromosome will continue to be silenced and most X-linked genes270
are expressed in a mono-allelic fashion [32].271
ASB events are consistent within and between cell lines272
We evaluated the correlation of allelic ratios between pairs of biological replicates,273
between distinct proteins bound at the same site, and at identical ASB sites in274
different cell lines. The allelic ratios of all shared heterozygous SNPs are well corre-275
de Santiago et al. Page 12 of 57
lated across biological replicates (Figure S10). Secondly, different proteins binding276
to the same SNP also display concordant allelic ratios (Figure S11) across cell lines.277
Our observations are consistent with the concordance of allele-specific binding of278
different co-bound TFs described in previous studies [1, 2, 3, 10, 12, 33, 34]. The279
Spearman correlation coefficients of all pairwise comparisons (between replicates280
and within cells between different DNA-binding proteins at the same SNP) are281
plotted as a boxplot in Figure S12a. Positive Spearman correlation coefficients are282
observed in every case, and in the majority of the cases we observe correlation coeffi-283
cients > 0.8. Therefore, overall we find that BaalChIP generates highly reproducible284
results, across separate ChIP-seq experiments.285
In addition, we analysed if heterozygous sites shared by cell lines had allelic ratios286
that were skewed towards the same allele or not. Of the identified ASB sites, only287
a small proportion (149 out of 2,438; 6.1%) were shared between cell lines (Figure288
S12b). Of these ASB sites, 91% (136 out of 149; Table S5) show an agreement in the289
direction of the allelic bias. Discordant cases may be explained by environmental290
or epigenetic factors, by different genomic context in different cells [1, 2], or by low291
sequencing coverage of ChIP-seq data. The high proportion of ASB with the same292
allelic bias further supports the robustness of BaalChIP.293
Functional annotation of ASB sites294
We then examined the genomic distribution of the identified ASB SNPs. A large295
proportion of ASB SNPs overlap introns and intergenic regions, and a considerable296
proportion is found at promoter-proximal regions, mainly reflecting the binding297
patterns of the ChIPed proteins (Figure S13). A large proportion of ASB SNPs298
overlap previously predicted enhancer regions and histone modifications associated299
with active enhancers (H3K4me1, H3K27ac), with an average of 70.2% of ASB SNPs300
occurring within cell-type specific putative enhancer regions (Figure S14). However,301
this enrichment is not significant when compared to non-ASB SNPs (Table S6,302
de Santiago et al. Page 13 of 57
χ2 pvalue > 0.05), and it is possibly mainly reflecting the distribution of binding303
sites (ChIP-seq peak regions) from which the initial set of heterozygous SNPs was304
sampled.305
Finally, to examine the putative functional mechanisms of ASB SNPs, we assessed306
if they modulated transcription factor binding affinity. We used predictions from307
HaploReg to assess if the observed ASB SNPs alter canonical transcription factor308
binding motifs. We found that 88% of ASB SNPs (range 83% to 92%) are motif309
disrupting SNPs but this enrichment was not statistically significant, since 85% of all310
tested SNPs under ChIP peaks were also found to alter motif scores. No significant311
difference in the magnitude of the change in binding affinity was observed between312
motif disrupting ASB SNPs when compared to all tested motif-disrupting SNPs,313
using the Kolmogorov–Smirnov test (Figure S15) To determine which TF-motifs314
are most likely to be disrupted, we grouped SNPs according to the DNA-binding315
proteins as identified by ChIP-seq peaks, and identified the top motifs disrupted316
by ASB SNPs when compared to non-ASB SNPs. In the majority of cases, the top317
disrupted motifs match the DNA-binding proteins used to generate the ChIP-seq318
data (Table S7). Overall, these results provide good evidence that the allele-specific319
binding we identify represents a true biological phenomenon.320
Case study 2: FAIRE-seq data321
To demonstrate the generality of our approach, we applied BaalChIP to targeted322
FAIRE-sequencing data obtained from two breast-cancer cell lines, MDA-MB-134323
and T-47D. FAIRE stands for Formaldehyde-Assisted Isolation of Regulatory Ele-324
ments and is an effective method to identify DNA regions in the genome associated325
with open chromatin [35]. The method is based on the fact that the formaldehyde326
cross-linking is more efficient in nucleosome-bound DNA than it is in nucleosome-327
depleted regions of the genome. Thus, FAIRE-seq identifies regions of open chro-328
de Santiago et al. Page 14 of 57
matin. One advantage of FAIRE-seq over ChIP-seq is that the assayed chromatin329
is not limited to the location of specific DNA-associated proteins.330
We chose to focus on the fraction of the genome that has been previously associ-331
ated to breast-cancer risk. To do so, we selected 75 previously known breast-cancer332
risk regions (Table S8) [36] covering a total of 4.93Mb of the human genome. We333
performed targeted sequencing of 3 replicated FAIRE samples per cell-line and the334
correspondent genomic DNA (gDNA) controls. Targeted sequencing of the gDNA335
samples allowed us to determine with confidence the genotype of a high number of336
sites at the assayed breast cancer risk regions. We identified a total of 3208 and 1624337
heterozygous SNPs in MDA-MB-134 and T-47D cells, respectively. In this dataset,338
the sequenced gDNA samples were used for the RAF correction step, i.e. allelic339
ratios at each SNP position were calculated directly from gDNA samples and used340
for bias correction. We first applied the BaalChIP QC pipeline to eliminate biases.341
We noticed that none of the SNPs in the selected targeted regions overlapped re-342
gions of potential problematic alignments, and only a small proportion of sites were343
eliminated during the BaalChIP QC step (< 0.3 %).344
We observed high correlation between allelic ratios in the gDNA and FAIRE345
samples (Figure 4a), indicating that observed allele-specificity in FAIRE-seq samples346
is primarily due to copy-number alterations and must therefore be corrected for.347
Figure 4b shows the observed allelic ratios (AR) obtained in the FAIRE-seq samples348
compared to the values after correcting for gDNA. After correction the ARs become349
more evenly distributed around an average of 0.5. We found that approximately350
0.65% (MDA-MB-134) and 0.56% (T-47D) of the tested sites in the selected risk351
regions were allele-specific. These correspond to a total of 21 sites in MDA-MB-134352
and 9 sites in T-47D cell lines (Table 1).353
Out of the 21 SNPs identified in the MDA-MB-134 cell line, 11 cluster in the354
10q26.13 region. This cluster includes the rs2981579 SNP in the second intron of355
de Santiago et al. Page 15 of 57
the FGFR2 gene (Table 1), which is the SNP with the strongest association with356
breast cancer in genome-wide analysis [37] In the 10q26.13 region, we identify two357
breast cancer risk SNPs, rs2981579 and rs2981578 with a strong allelic imbalance358
towards their risk alleles, rs2981579-A and rs2981578-C, respectively.359
Comparing BaalChIP to competing methods360
We applied two competing methods, the binomial test and the iASeq [10] method,361
to the ENCODE and FAIRE-seq datasets.362
When applying the iASeq method to the ENCODE dataset, the existence of363
missing data created an error. Missing data occur for those samples that do not364
contain ChIP-seq peaks at the SNP in question, or if the a SNP did not pass the365
previously applied QC step for that sample. To overcome this limitation of iASeq,366
we replaced every missing data point by zero, with the caveat that this ad hoc367
approach may create unknown biases in our results.368
When applying the binomial test, for each cell line, we pooled read count data369
from different experiments and replicates. While this approach maximizes statistical370
power, it may mask heterogeneity in ASB obtained from different experiments.371
Previous studies have shown that at least 20x read coverage at a particular SNP372
position is necessary to reach adequate statistical power with the binomial test [9,373
12]. Therefore, we restricted our analysis to SNPs with a coverage of at least 20374
reads in the pooled data. We included two different ways of performing the binomial375
test, either by setting the probability of success to 0.5 or by weighting the binomial376
null with the RAF scores. In real data, unlike simulations, we are not able to access377
the true-positive and false-positive rates and compute ROC curves. For this reason,378
we mainly focused on testing whether BaalChIP is comparable to existing methods379
while improving the biased ASB detection in high copy-number regions (Figure 5a)380
or the problem of overdispersion in the data (Figure 5b).381
de Santiago et al. Page 16 of 57
First, for consistency of the analysis, we obtained the set of heterozygous SNPs to382
test using the BaalChIP QC step. We then compared BaalChIP to the binomial test383
method with (p=RAF) and without (p=0.5) copy-number correction based on how384
many of the identified ASB SNPs were in CNA regions. For the comparison to the385
iASeq method, we computed the estimated posterior probability obtained for SNPs386
different regions of altered copy-numbers. Figure 5a shows that the binomial test387
(without RAF correction) and the iASeq methods show a substantial bias towards388
the increased detection of ASB events in regions altered copy-number, as expected.389
Both BaalChIP and iASeq employ a beta-binomial distribution to model read390
count data, which accounts for extra-binomial variability (overdispersion). To check391
the effect of overdispersion in the ENCODE and FAIRE-seq data we considered only392
SNPs at regions of normal or close to normal copy numbers (RAF betwen 0.4-0.6)393
and computed the percentage of detected ASB sites in bins of increasing depth of394
coverage. The overdispersion is often more accentuated for sites of higher coverage395
(> 200 pooled total counts), which will create a particularly sensitive scenario for396
the detection of ASB sites using the binomial distribution. Figure 5b demonstrates397
clearly the effect of overdispersion in the FAIRE-seq data, particularly visible at398
sites of high coverage, which shows the advantage of BaalChIP and iASeq over the399
binomial model. This effect is less pronounced in the ENCODE data, where depth400
of coverage is overall lower (Figure S16).401
These results suggest that for higher read counts random fluctuations in allelic402
ratios should be modelled with a beta-binomial distribution, rather than a binomial403
distribution, to reduce the number of false-positive results.404
Discussion405
ASB analysis is an important method for identifying putative regulatory SNPs406
that might have an effect on transcription factor binding and gene expression and407
might be associated with disease phenotypes. Our method for Bayesian Analysis408
de Santiago et al. Page 17 of 57
of Allelic imbalances from ChIP-seq data, called BaalChIP, has been motivated by409
the need to address the issue of detecting allelic-specific imbalances from ChIP-seq410
data obtained specifically from cancer cell lines, which frequently carry copy-number411
alterations.412
While allele-specific copy number alterations can be associated with changes in413
transcript abundance [38] and are possibly implicated in cancer phenotype, they can414
confound the identification of allele-specific binding at potentially regulatory sites.415
Traditional methods to identify ASB have not been able to distinguish between416
effects that are caused by allele specific amplification of binding sites versus the417
effects caused by allele-specific binding by specific TFs. Our method now allows to418
distinguish these and will therefore help to better delineate the causal mechanisms419
of disease.420
BaalChIP is a general framework applicable to a wide range of assays. We have421
applied it to ENCODE ChIP-seq and our own FAIRE-seq samples, and demon-422
strated the utility of BaalChIP to identify ASB events from cancer genomes. Each423
of these chromatin assays has its own advantages for the detection of allele-specific424
binding. ChIP-seq precisely determines the location of specific DNA-associated pro-425
teins, while FAIRE-seq identifies broader regions of open chromatin which might be426
less informative in terms of pinpointing the functional regulatory elements of the427
genome. On the other hand, the ChIP assay is limited to the availability of high-428
quality antibodies and only one factor can be tested per experiment. It is still an429
open question which chromatin assay will be the most informative for understanding430
allele-specific binding differences.431
Cancer cell lines carry frequent copy number alterations. Therefore an important432
issue is the determination of the background biases in allele frequency observed in433
regions of altered copy-number. For the ENCODE dataset, we have demonstrated434
that the use of BAF scores obtained from microarrays could be used to correct for435
de Santiago et al. Page 18 of 57
these effects in the allelic ratios, however it is also noteworthy that the microarray436
technology used by the ENCODE study did not generate BAF scores comprehen-437
sively for all possible heterozygous sites, limiting the number of SNPs at which the438
allele-specificity of each could be assayed. Therefore, for the majority of the samples439
in the ENCODE case study, it is likely that the regulatory potential of true causal440
SNPs was not directly assayed. As demonstrated by the analysis of FAIRE-seq data,441
this issue can be overcome by including genomic DNA-sequencing as control sam-442
ples to detect allele-specific biases, provided that there is sufficient read coverage at443
each site.444
The identified ASB sites may form the basis for future functional analysis of445
the genome. Of particular interest, within the FGFR2 gene, rs2981578 has been446
previously suggested to be the key causal variant. Indeed this SNP displays the447
highest allelic imbalance in a breast cancer cell line (Table 1).448
BaalChIP has been designed for data from cell lines, not tumor samples [39].449
Normal contamination and the existence of different clonal subpopulations in tumor450
samples pose additional challenges and can distort the expected distribution of451
heterozygous variant allele fractions. In the future, these challenges can potentially452
be overcome by combining the ideas implemented in BaalChIP with probabilistic453
methods developed for dissecting genetic heterogeneity and cancer evolution [40].454
Conclusions455
In summary, BaalChIP is a rigorous probabilistic method to detect allelic imbalance456
by correcting for the effect of background allele frequency on the observed ChIP-seq457
read counts. BaalChIP implements stringent filtering and preprocessing steps and458
allows the joint analysis of multiple ChIP-seq samples across a single variant. In459
simulations, BaalChIP outperformed competing approaches, and in an application460
to 548 ENCODE ChIP-seq and 6 targeted FAIRE-seq samples BaalChIP effectively461
de Santiago et al. Page 19 of 57
corrected allele-specific analysis for copy number changes and increased the power462
to detect putative cis-acting regulatory variants in cancer genomes.463
Methods464
BaalChIP Quality Control and filtering465
The QC and filtering steps implemented in the BaalChIP R package are:466
1 Keep only reads with MAPQ > 15 and with base quality > 10467
2 Within each cell line only consider heterozygous SNPs overlapping transcrip-468
tion factor binding sites identified by ChIP-seq peak calling;469
3 Exclude sites susceptible to allelic mapping bias in regions of known problem-470
atic read alignment [22, 23, 24];471
4 Apply simulation-based filtering to exclude SNPs with intrinsic bias to one of472
the alleles [6, 17];473
5 Consider only SNPs that are represented in all replicated samples after ap-474
plying all previous filters;475
6 Exclude possibly homozygous SNPs where only one allele was observed after476
pooling ChIP-seq reads from all examined samples [4, 26].477
Generating allele counts per SNP478
For each SNP, the number of reads carrying the reference and alternative alleles479
are computed using the pileup function and the PileupParam constructor of the480
Rsamtools package [41]. For each BAM file, BaalChIP only considers heterozygous481
SNPs overlapping the genomic regions in the corresponding BED files (peaks). Two482
arguments of the PileupParam constructor can be manipulated by the user: the483
min_mapq which refers to the minimum "MAPQ" value for an alignment to be484
included in pileup (reads with mapping quality score lower than this threshold are485
ignored; default is 15); and the min_base_quality which refers to the minimum486
quality value for each nucleotide in an alignment (bases at a particular location487
with base quality lower than this threshold are ignored; default is 10).488
de Santiago et al. Page 20 of 57
Data sources for regions of problematic alignments489
BaalChIP considers three sets of regions known to be of problematic read align-490
ment, by default: (1) blacklisted regions downloaded from the UCSC Genome491
Browser (mappability track; hg19, wgEncodeDacMapabilityConsensusExcludable492
and wgEncodeDukeMapabilityRegionsExcludable tables), (2) non-unique regions493
selected from DUKE uniqueness mappability track of the UCSC genome browser494
(hg19, wgEncodeCrgMapabilityAlign50mer table), and (3) collapsed repeat regions495
downloaded from [24] at the 0.1 % threshold. Sets of filtering regions used in this496
filtering step are fully customized and additional sets can be added by the user as497
GenomicRanges objects [42].498
Simulations to identify SNPs with intrinsic biases499
For each heterozygous site, BaalChIP simulates every possible read overlapping the500
site in four possible combinations - reads carrying the reference allele (plus and501
minus strand), and reads carrying the alternative allele (plus and minus strand).502
The simulated reads are constructed based on published methodology [6] (scripts503
shared by Degner et al. upon request) without taking into consideration different504
qualities at each base in the read or different read depth of coverage. As described505
by Degner et al. these parameters were sufficient to predict the SNPs that show an506
inherent bias [6].507
Reads are then aligned to the reference genome and BAM files are generated508
using Bowtie version 1.1.1 [43] and Picard tool version 1.47. The pipeline used509
to generate and align simulated reads can be fully customized with other aligners510
(e.g. BWA) and it is available under the file name "run_simulations.sh" found511
in the folder "extra" of the BaalChIP R package. Simulated allelic read counts512
are computed using the pileup function and the PileupParam constructor of the513
Rsamtools package [41]. For each SNP, the correct number of reads that should map514
to the reference and non-reference alleles is known (corresponds to twice the read515
de Santiago et al. Page 21 of 57
length for each allele); sites with incorrect number of read alignments are discarded516
from the analysis.517
Estimating reference mapping bias518
The reference mapping bias is calculated as described in [4, 26]. First, reads are519
combined across all heterozygous sites that pass the previous QC steps. The ex-520
pected allelic ratio (REF/TOTAL) for each cell line is then calculated separately521
for each allele combination. A minimum of 200 sites is required for each category;522
if there were less, a global estimate was used for that category. For each SNP this523
computation results in an allelic ratio µ ∈ (0, 1). These allelic ratios are used as524
part of a prior distribution in the Bayesian model described in the next section.525
BaalChIP Bayesian model description526
BaalChIP is designed to infer allelic imbalance from read counts data while inte-527
grating copy number and technical bias information. We assume that we are given528
a set of N datasets covering a SNP position. We jointly infer allelic imbalance from529
signals in all datasets covering the same SNP.530
For the nth dataset let dn ∈ N denote the total number of reads covering a SNP
position and an ∈ N the number of reads reporting the reference allele. BaalChIP
models an with a beta-binomial distribution, i.e. the binomial distribution in which
the probability of success is integrated out given that it follows the beta distribution:
p(an | dn, α, β) = BetaBin(an | dn, α, β). (1)
The beta-binomial distribution controls for overdispersion, i.e. the fact that the
increased variance in NGS data cannot be captured in the standard binomial
model [18]. To gain a more intuitive interpretation of the parameters, we re-
parametrize α and β in terms of the precision of the beta-binomial distribution,
de Santiago et al. Page 22 of 57
Λ, and the mean probability that a reference read is observed, θ:
α = θΛ (2)
β = (1− θ)Λ (3)
Including reference mapping bias In the next level of our model, we place a beta
distribution over θ, which allows us to naturally include the technical reference
mapping bias:
p(θ | α0, β0) = Beta (θ | α0, β0) , (4)
where, as before, we re-parametrise the shape parameters α0 and β0 by the mean
and variance of the distribution. Denoting the variance by λ and using as the mean
the reference mapping bias µ ∈ (0, 1) as discussed above, we get:
α0 =
((1− µ)
λ− 1
µ
)µ2 (5)
β0 =
((1− µ)
λ− 1
µ
)(µ− µ2
)(6)
As a consequence of this re-parametrisation, the reference mapping bias has a very531
intuitive interpretation as the a priori mean probability that a read reporting the532
reference allele is observed.533
Including reference allele frequency Next, we reparameterize θ as a function of
allelic balance ratio η and reference allele frequency ρ:
θ ≡ f(η, ρ) =ηρ
ηρ+ (1− η)(1− ρ). (7)
de Santiago et al. Page 23 of 57
Intuitively, this parameterization implements the Bayes’ rule with the following
definitions:
θ = Prob( allele = Ref | binding = Yes ),
η = Prob( binding = Yes | allele = Ref ),
1− η ≈ Prob( binding = Yes | allele = Alt ),
ρ = Prob( allele = Ref ),
1− ρ = Prob( allele = Alt ).
The auxiliary condition on binding status in the definition of θ reflects the fact534
that we only analysed peaks and thus assume that observed ChIP-seq reads are535
obtained after TF binding. Eq. 7 then formulates a ChIP-seq experiment as a process536
to obtain the posterior probability of a TF binding to the reference allele. In this537
model, the allelic balance ratio acts as the likelihood for TF binding to the reference538
allele and the RAF as the prior.539
To better understand how this model works, we will discuss two illustrative sce-540
narios. Example 1: Assuming no copy number alterations, i.e. RAF ρ = 0.5, and541
no allelic imbalance, i.e. η = 0.5, then θ = 0.5×0.50.5×0.5+0.5×0.5 = 0.5, which shows that542
BaalChIP contains the assumption of previous approaches as a special case. Exam-543
ple 2: In the event of LOH on the alternative allele, RAF ρ = 1, then irrespective544
of the allelic balance ratio, all observed reads will report the reference allele, which545
is reflected by θ = η×1η×1+(1−η)×0 = 1.546
Posterior distribution of allelic imbalance To rigorously represent the uncertainty
in the data, we adopt a full Bayesian approach targeting the posterior distribution
of allelic balance ratio η. We use the fact that the change of variables in Eq. 7 and
de Santiago et al. Page 24 of 57
the prior on θ directly define a prior over η, namely
p(η|ρ, µ, λ) = p(f(η, ρ)|µ, λ)∂f(η, ρ)
∂η, (8)
and write the posterior of η as follows:
p(η | {an}N1 , {bn}N1 , ρ,Λ, µ, λ
)(9)
∝N∏n=1
p(an|dn, f(η, ρ),Λ) p(f(η, ρ)|µ, λ)∂f(η, ρ)
∂η,
where
∂f(η, ρ)
∂η=
ρ
ηρ+ (1− η)(1− ρ)− ηρ(2ρ− 1)
(ηρ+ (1− η)(1− ρ))2. (10)
Inference of allelic imbalance We use the Metropolis-Hastings algorithm with ran-547
dom walk proposal to approach this distribution. In practice, we fix the Λ = 1000548
and λ = 0.05, which in our experience results in robust performance across a wide549
range of simulated and real datasets. Allelic imbalance calling is based on the high-550
est posterior density (HPD) interval, which is constructed from the MCMC trace551
of η as the shortest interval containing 95% of sampled values. Allelic imbalance is552
called, if the HPD does not contain the value 0.5, which is a rigorous way to decide553
that the data cannot be explained well by balanced alleles.554
In silico validations555
To thoroughly test the performance of allelic imbalance calling performance of556
BaalChIP, read counts data are generated using a wide range of parameter set-557
tings. Specifically, we varied the number N of data sets from 1 to 45, ρ from 0.1 to558
0.9 in steps of size 0.1, d from 1 to 100 in steps of size 7, and 1000 values of η are559
de Santiago et al. Page 25 of 57
evenly sampled from 0.1 to 0.9. These settings result in a set of 6, 075, 000 SNPs.560
The detailed simulation protocol is as follows:561
For each n ∈ [1, ..., N ] and ρ ∈ [0.1, ..., 0.9] and η ∈ [0.1, ..., 0.9]:562
1. Draw the number of available reference alleles:
tn ∼ Binomial(dn, ρ)
2. Draw the read counts reporting reference allele:
if η ≤ 0.5 :
an ∼ Binomial(tn, 2η)
if η > 0.5 :
bn ∼ Binomial(dn − tn, 2(1− η))
an = dn − bn
The performance is measured by ROC curves. The true allelic imbalance is deter-563
mine if the true η is outside the interval [0.45, 0.55].564
Baseline methods565
We compared BaalChIP against two baseline approaches: binomial test and566
iASeq [10]. Two-sided binomial tests are performed with the R function binom.test.567
To correct for reference mapping bias, the null hypothesis on the probability of suc-568
cess is set to be the previously estimated reference mapping bias (RAF), instead of569
0.5. We pooled data from all ChIP-seq samples for each analyzed cell line to max-570
imize power. For the analysis of real data, binomial test p-values were corrected571
for multiple testing using the false discovery rate (FDR) threshold of 0.01. FDR572
was calculated with the p.adjust function in R. For the simulated data, the ROC573
curves for binomial test are constructed based on the distance between reference574
de Santiago et al. Page 26 of 57
mapping bias and the mean of 95% confidence intervals for the probability of suc-575
cess. All iASseq analyses are performed with the function iASeqmotif where the576
number of non-null motif is set to be a vector [1, 2, ..., 5], the maximun number of577
iterations is 300 and the tolerance level of error is 0.001. The ROC curves for iASseq578
are constructed with bestmotif$p.post which is the posterior probability for each579
SNP being allele-specific event.580
Cell culture581
MDA-MB-134 and T-47D human breast cancer cells were cultured in RPMI (Invit-582
rogen) supplemented with 10% FBS and antibiotics. All cells were maintained at583
37C, 5% CO2. All cell lines were from the CRUK Cambridge Institute biorepository584
collection. Cell lines were authenticated by short tandem repeat genotyping using585
the GenePrint 10 (Promega) system and confirmed to be mycoplasma free.586
FAIRE and gDNA purification587
FAIRE stands for ’Formaldehyde assisted isolation of regulatory elements’ and was588
performed as previously described [35] with minor adaptations. Briefly, cells were589
fixed for 10 mins in 1% formaldehyde in FCS-free medium, washed and frozen. Nu-590
clei from MDA-MB-134 (3× 107 cells/tube) and T-47D (1.5× 107 cells/tube) were591
isolated and sonicated using 300µl volumes in 1.5ml eppendorfs, using a Diagenode592
Bioruptor. Sonication was performed for 20 cycles of 30s on/off at the "high" set-593
ting. The sonicate was subjected to three consecutive phenol–chloroform–isoamyl594
alcohol (25:24:1) extractions and reverse cross-linked overnight. DNA was purified595
by ethanol precipitation and quantified by Quanti-iT. Genomic DNA was isolated596
using Qiagen DNeasy Blood and tissue Kit according to protocol.597
Agilent SureSelect amplification, library preparation and sequencing598
DNA fragments were prepared for sequencing using the recommended protocol avail-599
able for SureSelect-Illumina sequencing. 69 known breast cancer risk tagging SNPs600
de Santiago et al. Page 27 of 57
were retrieved from [36]. SNAP Proxy Search was used to find SNPs correlated601
with the 69 tagging SNPs (r2 > 0.6 for 59 of the risk loci and r2 > 0.8 for 8 of the602
risk loci, within a distance of 500bp) using 1000 Genomes pilot 1 data. Genomic603
intervals were defined by the left-most to the right-most SNP in each LD block with604
an additional 400bp of flanking regions. An additional set of 7 SNPs with 500bp605
of flanking sequences were added manually to include the CCND1 (rs75915166,606
rs494406), MAP3K1 (rs10461612, rs112497245, rs17432750) and TERT gene regions607
(rs2736107, rs7705526). In total, we targeted 69 non-overlapping regions compris-608
ing 4.93Mb using SureSelect method (Table S8). DNA obtained from SureSelect609
solution-based sequence capture was subjected to Illumina HiSeq paired-end se-610
quencing (Illumina). Paired-end Sequencing was performed according to manufac-611
turer’s protocols.612
Pre-processing ENCODE samples613
We used publicly available ENCODE ChIP-seq and genotype datasets for a to-614
tal of 548 samples representing 271 different experiments. We included 8 cancer615
and 6 non-cancer cell lines representing different tissues. Table S3 shows a sum-616
mary of all tissues, cell lines and number of experiments included in this study.617
The ChIP-seq data was downloaded as reads mapped to the hg19 genome (BAM618
files) and correspondent peak calling files (BED files). Accession numbers of all619
public ChIP-seq datasets used in this study are provided in Table S2. Duplicated620
reads were marked using Picard tool version 1.47. ChIP-seq peak files were merged621
between replicates using HOMER version 5.4 mergePeaks with the default op-622
tion -d given to only consider peak ranges that overlapped for all replicates. Het-623
erozygous SNPs and BAF tracks were retrieved from the UCSC Genome browser624
(hg19, wgEncodeHaibGenotype track, wgEncodeHaibGenotypeBalleleSnp2015-03-625
04.tsv and wgEncodeHaibGenotypeGtypeSnp2014-09-15.tsv files). The initial num-626
ber of genotyped SNPs in the ENCODE files is 1.2 million SNPs. We only considered627
de Santiago et al. Page 28 of 57
SNPs listed in dbSNP (version 137 [44]). Homozygous SNPs (i.e., BAF > 0.9 or628
< 0.1) and SNPs with missing BAF scores were removed from the BAF tracks.629
BAF scores were converted to Reference Allele Frequency (RAF) scores using the630
information of A and B alleles in the Manifest file for the Illumina Human1M-Duo631
BeadChip (v3.0) downloaded from http://support.illumina.com/array/array_632
kits/human1m-duo_dna_analysis_kit/downloads.html.633
Pre-processing FAIRE-seq samples634
For sequence data from all FAIRE-seq and control samples, sequences were aligned635
to the human reference genome (GRCh37) using BWA version 0.7.12 [45]. Du-636
plicates were removed using Picard Tool version 1.47 and overlapping reads were637
clipped using clipOverlap tool from bamUtil repository version 1.0.14 with default638
parameters. SNPs were identified from gDNA sampes using the Genome Analysis639
ToolKit 3.4-46 software [46] across all gDNA samples simultaneously. As per GATK640
Best Practices recommendations [47, 48], duplicated reads were removed and local641
realignment and base quality recalibration were employed prior to SNP calling.642
Called SNPs were filtered using a hard filtering criteria (QD < 2.0, FS > 60.0, MQ643
< 30.0, MQRankSum < -12.5, ReadPosRankSum < -8.0).644
Applying BaalChIP to ENCODE and FAIRE-seq samples645
To ensure a reliable set of heterozygous SNPs we applied the BaalChIP (version646
0.1.9) quality control step and considered only uniquely mapping reads with MAQ647
> 15 and base call quality > 10. For the ENCODE dataset we removed from the648
analysis sets of SNPs based on the default six QC filters implemented within the649
BaalChIP pipeline. The FAIRE-seq samples contained paired-end sequenced reads650
of 125 bp. Since longer and paired-end reads reduce uncertainty of read alignment we651
did not consider two of the QC filters that are more relevant for shorter read lengths652
(of less than 50bp): the unique mappability and intrinsic bias filters. BaalChIP653
de Santiago et al. Page 29 of 57
(version 0.1.9) ASB Bayesian analysis was performed with the default parameters654
and options [21].655
Consistency of allelic ratios656
Consistency of the allelic ratios observed at ASB SNPs was analysed between (i)657
between replicates: pairs of replicated samples, (ii) within cell lines: pairs of ChIP-658
seq datasets from different proteins (pooled replicate data) in the same cell line (iii)659
between cell lines: pairs of cells (pooled data for each cell line). Correlations were660
calculated with Spearman correlation and required at least 15 shared ASB sites.661
SNP annotations relative to genes662
To annotate ASB SNPs with respect to gene annotations - 5’ UTR, 3’ UTR, pro-663
moter, spliceSite, coding, intron or intergenic - we used the VariantAnnotaion (ver-664
sion 1.8.13) package in R. To obtain the list of known genes and coordinates,665
we used the UCSC genome browser known gene annotation obtained from the666
TxDb.Hsapiens.UCSC.hg19.knownGene (version 2.6.2) library in R [20].667
Overlap with predicted enhancers668
To determine if ASB SNPs were enriched in any of the putative enhancer regions,669
we calculated the overlap of intergenic ASB SNPs in putative enhancer regions.670
The significance of the observed overlap was determined by a χ2 test by compar-671
ing the fraction of ASB SNPs in putative enhancer regions with the fraction of672
non-ASB heterozygous SNPs in putative enhancer regions. Putative enhancer re-673
gions were retrieved for 6 cell lines: GM12878, H1hESC, HeLa, HepG2, K562, A549674
based on predicted weak and strong enhancer sites and/or H3K27ac and H3K4me1675
chromatin marks. Enhancer site predictions were retrieved from human segmen-676
tations previously generated based on ENCODE data using Segway (downloaded677
from wgEncodeAwgSegmentation UCSC genome browser track).678
de Santiago et al. Page 30 of 57
TF motif disruption analysis679
In order to annotate the potential regulatory effects of the tested SNPs on tran-680
scription factor binding site (TFBS) motifs, the publically available HaploReg681
v2 database (accessible at http://www.broadinstitute.org/mammals/haploreg/682
haploreg_v2.php) was used. Haploreg calculates allele-specific changes in the log-683
odds (LOD) scores for position weight matrices (PWMs) of a regulatory motif684
based on a library of PWMs constructed from TRANSFAC, JASPAR, and and685
protein-binding microarray (PBM) experiments [49]. The magnitude of the change686
in binding affinity was calculated as the absolute difference (DELTA) of LOD scores687
(DELTA = LOD(ref) - LOD(alt)). The Kolmogorov-Smirnov test was used to com-688
pare the distributions of the absolute DELTA scores of motif disrupting ASB SNPs689
and all tested motif disrupting SNPs. To determine which TFBS-motifs ASB SNPs690
were more likely to disrupt, SNPs were grouped according to DNA-binding proteins691
as identified by ChIP-seq peaks. For each group, a Fisher exact test was used to692
identify TFBS-motifs that were more significantly disrupted by ASB SNPs when693
compared to non-ASB SNPs.694
Availability of data and materials695
ENCODE datasets used in this study are publicly available in the curated Gene Expression Omnibus (GEO)696
database. Accession numbers of all public ChIP-seq datasets used in this study are provided in Table S2.697
FAIRE-seq datasets used in this study are included in the GEO database (GEO accession number, GSE85261)698
Competing interests699
The authors declare that they have no competing interests.700
Author’s contributions701
IS, WL and KY designed and developed the Baal-ChIP statistical framework. IS and WL implemented702
BaalChIP package in R. IS analyzed the data. WL preformed simulation studies. KM and MOR designed and703
preformed FAIRE-seq experiments. BAJP, FM and KBM supervised the analysis. IS, WL, KY, KBM and FM704
wrote the manuscript. FM and KBM are co-corresponding authors.705
Funding706
We would like to acknowledge the support of The University of Cambridge, Cancer Research UK and Hutchison707
Whampoa Limited. Parts of this work were funded by CRUK core grant C14303/A17197 and A19274 and the708
Breast Cancer Research Foundation.709
de Santiago et al. Page 31 of 57
Acknowledgments710
We thank Thomas Carroll, Gordon Brown and Jing Su (CRUK Cambridge Institute, University of Cambridge)711
for suggestions and advice on ChIP-seq data analysis.712
References713
1. McDaniell, R., Lee, B.-K., Song, L., Liu, Z., Boyle, A.P., Erdos, M.R., Scott, L.J., Morken, M.A., Kucera,714
K.S., Battenhouse, A., et al.: Heritable individual-specific and allele-specific chromatin signatures in715
humans. Science 328(5975), 235–239 (2010)716
2. Reddy, T.E., Gertz, J., Pauli, F., Kucera, K.S., Varley, K.E., Newberry, K.M., Marinov, G.K., Mortazavi,717
A., Williams, B.A., Song, L., et al.: Effects of sequence variation on differential allelic transcription factor718
occupancy and gene expression. Genome research 22(5), 860–869 (2012)719
3. Kasowski, M., Kyriazopoulou-Panagiotopoulou, S., Grubert, F., Zaugg, J.B., Kundaje, A., Liu, Y., Boyle,720
A.P., Zhang, Q.C., Zakharia, F., Spacek, D.V., et al.: Extensive variation in chromatin states across721
humans. Science 342(6159), 750–752 (2013)722
4. Kilpinen, H., Waszak, S.M., Gschwind, A.R., Raghav, S.K., Witwicki, R.M., Orioli, A., Migliavacca, E.,723
Wiederkehr, M., Gutierrez-Arcelus, M., Panousis, N.I., et al.: Coordinated effects of sequence variation on724
dna binding, chromatin structure, and transcription. Science 342(6159), 744–747 (2013)725
5. McVicker, G., van de Geijn, B., Degner, J.F., Cain, C.E., Banovich, N.E., Raj, A., Lewellen, N., Myrthil,726
M., Gilad, Y., Pritchard, J.K.: Identification of genetic variants that affect histone modifications in human727
cells. Science 342(6159), 747–749 (2013)728
6. Degner, J.F., Marioni, J.C., Pai, A.A., Pickrell, J.K., Nkadori, E., Gilad, Y., Pritchard, J.K.: Effect of729
read-mapping biases on detecting allele-specific expression from rna-sequencing data. Bioinformatics730
25(24), 3207–3212 (2009)731
7. Rozowsky, J., Abyzov, A., Wang, J., Alves, P., Raha, D., Harmanci, A., Leng, J., Bjornson, R., Kong, Y.,732
Kitabayashi, N., et al.: Alleleseq: analysis of allele-specific expression and binding in a network framework.733
Molecular systems biology 7(1), 522 (2011)734
8. Satya, R.V., Zavaljevski, N., Reifman, J.: A new strategy to reduce allelic bias in rna-seq readmapping.735
Nucleic acids research, 425 (2012)736
9. Skelly, D.A., Johansson, M., Madeoy, J., Wakefield, J., Akey, J.M.: A powerful and flexible statistical737
framework for testing hypotheses of allele-specific gene expression from rna-seq data. Genome research738
21(10), 1728–1737 (2011)739
10. Wei, Y., Li, X., Wang, Q.-f., Ji, H.: iaseq: integrative analysis of allele-specificity of protein-dna740
interactions in multiple chip-seq datasets. BMC genomics 13(1), 681 (2012)741
11. Mayba, O., Gilbert, H.N., Liu, J., Haverty, P.M., Jhunjhunwala, S., Jiang, Z., Watanabe, C., Zhang, Z.:742
Mbased: allele-specific expression detection in cancer tissues and cell lines. Genome Biol 15(8), 405 (2014)743
12. Li, G., Bahn, J.H., Lee, J.-H., Peng, G., Chen, Z., Nelson, S.F., Xiao, X.: Identification of allele-specific744
alternative mrna processing via transcriptome sequencing. Nucleic acids research, 280 (2012)745
13. Chen, J., Rozowsky, J., Galeev, T.R., Harmanci, A., Kitchen, R., Bedford, J., Abyzov, A., Kong, Y.,746
Regan, L., Gerstein, M.: A uniform survey of allele-specific binding and expression over747
1000-genomes-project individuals. Nature communications 7 (2016)748
14. Almlöf, J.C., Lundmark, P., Lundmark, A., Ge, B., Pastinen, T., Goodall, A.H., Cambien, F., Deloukas, P.,749
Ouwehand, W.H., Syvänen, A.-C., et al.: Single nucleotide polymorphisms with cis-regulatory effects on750
long non-coding transcripts in human primary monocytes (2014)751
15. Bailey, S.D., Virtanen, C., Haibe-Kains, B., Lupien, M.: Abc: a tool to identify snvs causing allele-specific752
de Santiago et al. Page 32 of 57
transcription factor binding from chip-seq experiments. Bioinformatics, 321 (2015)753
16. Liu, Z., Gui, T., Wang, Z., Li, H., Fu, Y., Dong, X., Li, Y.: cisase: a likelihood-based method for detecting754
putative cis-regulated allele-specific expression in rna sequencing data. Bioinformatics (Oxford, England)755
32, 3291–3297 (2016). doi:10.1093/bioinformatics/btw416756
17. Pickrell, J.K., Marioni, J.C., Pai, A.A., Degner, J.F., Engelhardt, B.E., Nkadori, E., Veyrieras, J.-B.,757
Stephens, M., Gilad, Y., Pritchard, J.K.: Understanding mechanisms underlying human gene expression758
variation with rna sequencing. Nature 464(7289), 768–772 (2010)759
18. Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Côté, A., Shah,760
S.P.: Pyclone: statistical inference of clonal population structure in cancer. Nature methods 11(4),761
396–398 (2014)762
19. Consortium, E.P.: An integrated encyclopedia of dna elements in the human genome. Nature 489(7414),763
57–74 (2012)764
20. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical765
Computing, Vienna, Austria (2014). R Foundation for Statistical Computing. http://www.R-project.org/766
21. de Santiago, I., Wei, L., Yuang, K., Markowetz, F.: Baal-ChIP: Bayesian Analysis of Allele-specific Binding767
Events from ChIP-seq Data. (2015). R package version 0.1.9.768
https://github.com/InesdeSantiago/BaalChIP769
22. Castel, S.E., Levy-Moonshine, A., Mohammadi, P., Banks, E., Lappalainen, T.: Tools and best practices770
for data processing in allelic expression analysis. Genome biology 16(1), 1 (2015)771
23. Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S., Goldman, M., Barber,772
G.P., Clawson, H., Coelho, A., et al.: The ucsc genome browser database: update 2011. Nucleic acids773
research, 963 (2010)774
24. Pickrell, J.K., Gaffney, D.J., Gilad, Y., Pritchard, J.K.: False positive peaks in chip-seq and other775
sequencing-based functional assays caused by unannotated high copy number regions. Bioinformatics776
27(15), 2144–2146 (2011)777
25. Carroll, T.S., Liang, Z., Salama, R., Stark, R., de Santiago, I.: Impact of artifact removal on chip quality778
metrics in chip-seq and chip-exo data. Front Genet 5, 75 (2014)779
26. Lappalainen, T., Sammeth, M., Friedländer, M.R., Aoen, P., Monlong, J., Rivas, M.A., Gonzàlez-Porta,780
M., Kurbatova, N., Griebel, T., Ferreira, P.G., et al.: Transcriptome and genome sequencing uncovers781
functional variation in humans. Nature 501(7468), 506–511 (2013)782
27. Peiffer, D.A., Le, J.M., Steemers, F.J., Chang, W., Jenniges, T., Garcia, F., Haden, K., Li, J., Shaw, C.A.,783
Belmont, J., et al.: High-resolution genomic profiling of chromosomal aberrations using infinium784
whole-genome genotyping. Genome research 16(9), 1136–1148 (2006)785
28. Iafrate, A.J., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W., Lee, C.:786
Detection of large-scale variation in the human genome. Nature genetics 36(9), 949–951 (2004)787
29. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Månér, S., Massa, H., Walker, M.,788
Chi, M., et al.: Large-scale copy number polymorphism in the human genome. Science 305(5683), 525–528789
(2004)790
30. Biedler, J.L., Helson, L., Spengler, B.A.: Morphology and growth, tumorigenicity, and cytogenetics of791
human neuroblastoma cells in continuous culture. Cancer research 33(11), 2643–2652 (1973)792
31. Liang, J.C., Ning, Y., Wang, R.-y., Padilla-Nash, H.M., Schröck, E., Soenksen, D., Nagarajan, L., Ried, T.:793
Spectral karyotypic study of the hl-60 cell line: detection of complex rearrangements involving794
chromosomes 5, 7, and 16 and delineation of critical region of deletion on 5q31. 1. Cancer genetics and795
de Santiago et al. Page 33 of 57
cytogenetics 113(2), 105–109 (1999)796
32. Gimelbrant, A., Hutchinson, J.N., Thompson, B.R., Chess, A.: Widespread monoallelic expression on797
human autosomes. Science 318(5853), 1136–1140 (2007)798
33. Tang, F., Barbacioru, C., Nordman, E., Bao, S., Lee, C., Wang, X., Tuch, B.B., Heard, E., Lao, K., Surani,799
M.A.: Deterministic and stochastic allele specific gene expression in single mouse blastomeres. PLoS One800
6(6), 21208 (2011)801
34. Ni, Y., Hall, A.W., Battenhouse, A., Iyer, V.R.: Simultaneous snp identification and assessment of802
allele-specific bias from chip-seq data. BMC genetics 13(1), 46 (2012)803
35. Giresi, P.G., Kim, J., McDaniell, R.M., Iyer, V.R., Lieb, J.D.: Faire (formaldehyde-assisted isolation of804
regulatory elements) isolates active regulatory elements from human chromatin. Genome research 17(6),805
877–885 (2007)806
36. Michailidou, K., Hall, P., Gonzalez-Neira, A., Ghoussaini, M., Dennis, J., Milne, R.L., Schmidt, M.K.,807
Chang-Claude, J., Bojesen, S.E., Bolla, M.K., et al.: Large-scale genotyping identifies 41 new loci808
associated with breast cancer risk. Nature genetics 45(4), 353–361 (2013)809
37. Turnbull, C., Ahmed, S., Morrison, J., Pernet, D., Renwick, A., Maranian, M., Seal, S., Ghoussaini, M.,810
Hines, S., Healey, C.S., et al.: Genome-wide association study identifies five new breast cancer811
susceptibility loci. Nature genetics 42(6), 504–507 (2010)812
38. Tuch, B.B., Laborde, R.R., Xu, X., Gu, J., Chung, C.B., Monighetti, C.K., Stanley, S.J., Olsen, K.D.,813
Kasperbauer, J.L., Moore, E.J., et al.: Tumor transcriptome sequencing reveals allelic expression814
imbalances associated with copy number alterations. PLoS One 5(2), 9317 (2010)815
39. Ross-Innes, C.S., Stark, R., Teschendorff, A.E., Holmes, K.A., Ali, H.R., Dunning, M.J., Brown, G.D.,816
Gojis, O., Ellis, I.O., Green, A.R., et al.: Differential oestrogen receptor binding is associated with clinical817
outcome in breast cancer. Nature 481(7381), 389–393 (2012)818
40. Beerenwinkel, N., Schwarz, R.F., Gerstung, M., Markowetz, F.: Cancer evolution: mathematical models819
and computational inference. Systematic biology 64, 1–25 (2015).820
41. Morgan, M., Pagès, H., Obenchain, V., Hayden, N.: Rsamtools: Binary alignment (BAM), FASTA, variant821
call (BCF), and tabix file import. R package version 1.18. 2822
42. Lawrence, M., Huber, W., Pages, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M.T., Carey,823
V.J.: Software for computing and annotating genomic ranges. PLoS Comput Biol 9(8), 1003118 (2013)824
43. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., et al.: Ultrafast and memory-efficient alignment of825
short dna sequences to the human genome. Genome biol 10(3), 25 (2009)826
44. Sherry, S.T., Ward, M.-H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K.: dbsnp: the827
ncbi database of genetic variation. Nucleic acids research 29(1), 308–311 (2001)828
45. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics829
25(14), 1754–1760 (2009)830
46. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler,831
D., Gabriel, S., Daly, M., et al.: The genome analysis toolkit: a mapreduce framework for analyzing832
next-generation dna sequencing data. Genome research 20(9), 1297–1303 (2010)833
47. DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A.,834
Del Angel, G., Rivas, M.A., Hanna, M., et al.: A framework for variation discovery and genotyping using835
next-generation dna sequencing data. Nature genetics 43(5), 491–498 (2011)836
48. Auwera, G.A., Carneiro, M.O., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir,837
K., Roazen, D., Thibault, J., et al.: From fastq data to high-confidence variant calls: the genome analysis838
de Santiago et al. Page 34 of 57
toolkit best practices pipeline. Current protocols in bioinformatics, 11–10 (2013)839
49. Ward, L.D., Kellis, M.: Haploreg: a resource for exploring chromatin states, conservation, and regulatory840
motif alterations within sets of genetically linked variants. Nucleic acids research 40(D1), 930–934 (2012)841
de Santiago et al. Page 35 of 57
Figures842
0
2
4
6
0.00 0.25 0.50 0.75 1.00
SNP ID RM Bias RAF
rs2070942 0.51 0.34
rs4951389 0.51 0.69
rs10153729 0.50 0.59
Peak Ref Alt Ref Alt Peak Ref Alt
1 0 1 1 0 0 NA NA
1 10 12 4 7 1 3 4
1 1 0 0 1 0 NA NA
Rep1 + Rep2
TF1
...
TF2
SNP ID Lower Upper Allele Bias
rs2070942 0.57 0.95 Yes
rs4951389 0.13 0.40 Yes
rs10153729 0.07 0.87 No
• (2a) Blacklist regions
• (2a) non-unique regions
• (2a) Collapsed repeats
• (2b) Intrinsic bias
• (2d) Present in all reps• (2c) Only one allele
Bayesian Model
RM Biasbeta distribution
RAF,Ref & Total reads
beta binomial
TFn
x
Prior Likelihood
Posterior Distribution
Het. SNPs
gDNA BAM filesOr
pre-computed RAF scores
1
2
(a) Input data
(b) BaalChIP QC SNP filtering
(c) BaalChIP ASB detection
3
4
5
(d) Output
TFn
ChIP-seqBAM and BED files
Reads & peaks VariantsAllele frequency(for correction)
MAPQ<15base_qual<10
Figure 1 Description of BaalChIP model frame work. (a) The basic inputs for Baal are theChIP-seq raw read counts in a standard BAM alignment format, a BED file with the genomicregions of interest (such as ChIP-seq peaks) and a set of heterozygous SNPs in a tab-delimitedtext file. Optionally, genomic DNA (gDNA) BAM files can be specified for RAF computation,alternatively, the user can specify the pre-computed RAF scores for each variant. (b) The firstmodule of BaalChIP consists of (1) computing allelic read counts for each heterozygous SNP inpeak regions and (2) a round of filters to exclude heterozygous SNPs that are susceptible togenerating artifactual allele-specific binding effects. (3) the Reference Mapping (RM) bias and theReference Allele Frequency (RAF) are computed internally and the output consists of a datamatrix where RM and RAF scores are included alongside the information of allele counts for eachheterozygous SNP. The column ’Peak’ contains binary data used to state the called peaks. (c)The second module of BaalChIP consists of calling ASB binding events. (4) BaalChIP uses abeta-binomial Bayesian model to consider RM and RAF bias when detecting the ASB events. (d)The output from BaalChIP is a posterior distribution for each SNP. A threshold to identify SNPswith allelic bias is specified by the user (default value is a 95% interval). (5) The output ofBaalChIP is a credible interval (’Lower’ and ’Upper’) calculated based on the posteriordistribution. This interval corresponds to the true allelic ratio in read counts (i.e. after correctingfor RM and RAF biases). An ASB event is called if the lower and upper limits of the interval areoutside the 0.4-0.6 interval.
de Santiago et al. Page 36 of 57
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(a) TF = 3, RAF = 0.5, Reads per TF = 1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(b) TF = 3, RAF = 0.3, Reads per TF = 1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(c) TF = 3, RAF = 0.1, Reads per TF = 1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(d) TF = 5, RAF = 0.5, Reads per TF = 8
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(e) TF = 5, RAF = 0.3, Reads per TF = 8
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(f) TF = 5, RAF = 0.1, Reads per TF = 8
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(g) TF = 15, RAF = 0.5, Reads per TF = 15
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(h) TF = 15, RAF = 0.3, Reads per TF = 15
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(i) TF = 15, RAF = 0.1, Reads per TF = 15
Baal−ChIP Binomial Test iASeq
Figure 2 The ROC curve comparison between BaalChIP and other allele-specific SNP finding
methods: binomial test and iASeq, using a simulated data set. The BaalChIP result is shown by
solid red line. Binomial test and iASeq are shown in dashed blue and black lines. The number of
TFs able to bind at a given SNP and reads per TF are increasing from 3 to 15 and 1 to 15,
respectively. RAF is decreasing from 0.5 to 0.1.
de Santiago et al. Page 37 of 57
MCF7
0 125 250
00.
51
cor=0.704
0 0.5 1
0.2
0.5
0.8
chr1 position (M)
AR (ChIP−seq)
B A
llele
Fre
qR
AF
a
b
K562
0 125 250
00.
51
cor=0.525
0 0.5 1
0.2
0.5
0.8
chr1 position (M)
AR (ChIP−seq)
B A
llele
Fre
qR
AF
a
b
SKNSH
0 125 250
00.
51
cor=0.001
0 0.5 1
0.2
0.5
0.8
chr1 position (M)
AR (ChIP−seq)
B A
llele
Fre
qR
AF
a
b
GM12878
0 125 250
00.
51
cor=−0.036
0 0.5 1
0.2
0.5
0.8
chr1 position (M)
AR (ChIP−seq)
B A
llele
Fre
qR
AF
a
b
Figure 3 Example cancer and non-cancer cell lines SNP and ChIP-seq ENCODE data. (a) B allele
frequencies (BAF) for chromosome 1 for three cancer cell lines (MCF-7, K562, SK-N-SH) and one
non-cancer cell line (GM12878). Individual SNPs are colored according to genotype values:
homozygous AA or BB (blue) and heterozygous AB (orange). (b) Correlations between the BAF
values and the ChIP-seq allelic ratios (AR) of heterozygous SNPs. RAF corresponds to the BAF
value with respect to the reference allele (RAF is equal to BAF if the reference allele corresponds
to the B allele; RAF is equal to 1-BAF if the reference allele corresponds to the A allele). The
fitted linear model (blue line) and the Spearman correlation coefficient (cor) show the relationship
between BAF and ChIP-seq allelic ratios at heterozygous sites.
de Santiago et al. Page 38 of 57
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
AR (FAIRE−seq)
AR
(gD
NA
)
cor=0.422
MDAMB134
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0 cor=0.706
T47Da
b
Allelic Ratio (AR)
densi
ty
0
3
6
9
0 0.25 0.50 0.75 10 0.25 0.50 0.75 1
Corrected ARFAIRE-seq AR
MDAMB134 T47D
Figure 4 ASB detection from FAIRE targeted sequencing data. (a) Correlations between the
allelic ratios obtained from genomic DNA and FAIRE-seq data. (b) Density plots showing the
distribution of allelic ratios before (green) and after (orange) BaalChIP correction. The adjusted
AR values were estimated by BaalChIP model after taking into account the RAF scores computed
directly from the control genomic DNA samples.
de Santiago et al. Page 39 of 57
ENCODE FAIRE-seq
[0.4
7 -
0.5
3]
[0.4
5-0
.47
)(0
.53
- 0
.55
]
[0.4
-0.4
5)
(0.5
5 -
0.6
]
[0.3
5-0
.4)
(0.6
- 0
.65
]
[0.3
-0.3
5)
(0.6
5 -
0.7
]
[0-0
.3)
(0.7
- 1
]
[0.4
7 -
0.5
3]
[0.4
5-0
.47
)(0
.53
- 0
.55
]
[0.4
-0.4
5)
(0.5
5 -
0.6
]
[0.3
5-0
.4)
(0.6
- 0
.65
]
[0.3
-0.3
5)
(0.6
5 -
0.7
]
[0-0
.3)
(0.7
- 1
]
Relative allele frequency
ENCODE FAIRE-seq
Depth of Coverage
a
b>
60
0
> 1
.5e+
03
Figure 5 Comparison of BaalChIP with other available methods. Left y-axis corresponds to the
frequency of ASB events called by BaalChIP or the binomial tests (red, blue and purple lines).
Right y-axis corresponds to the mean of the posterial probability given by the iASeq method
(black line). Small numbers at the top of the plot show the total number of tested heterozygous
sites in each bin. (a) SNPs were grouped in bins of different RAF intervals. The RAF intervals are
increasing in terms of distance to the diploid (RAF=0.5) value. The bionomial test (without RAF
correction; purple line) and the iASeq methods (black line) are biased towards the detection of
ASB events in regions of altered copy numbers. (b) SNPs were grouped in bins of different depth
of coverage. SNPs in regions RAF < 0.4 or RAF > 0.6 were excluded from this analysis. When
applying the binomial test (purple and blue lines), the frequency of ASB detection increases for
higher covered sites, while the same is not true when applying BaalChIP or the iASeq methods.
The effect is particularly visible for the FAIRE-seq dataset.
Tables843
de Santiago et al. Page 40 of 57
AC
C ARead
cou
nts
SNP
CC
C
AAAAAA
A
C
SNP
ChIP-seq read out:
Aneuploid region:
AAAAAA
C
AC
TFRegulatory variant in a diploid region:
Allele 1:Allele 2:
Allele 2:
Allele 1(amplicon)
Figure S1 Example illustration of a ChIP-seq read out at a DNA-binding site when a true
regulatory difference exists between two alleles (A allele depicted in orange and C allele depicted
in blue), and when there is no regulatory difference between the two alleles. In a diploid sample,
the protein (yellow circle) binds preferentially to the A allele. Here the regulatory effect is
observed as an imbalance in the allelic ratios obtained from ChIP-seq read counts, with a higher
number of reads carrying the A allele. In the presence of allele-specific copy number aberrations
such as an amplicon affecting the A allele, the direct ChIP-seq readout may reflect the relative
presence of the alleles, rather than the regulatory effect of the single nucleotide variant.
Consequently, in the presence of copy number changes, ChIP-seq allelic ratios are not sufficient to
uncover true cis-regulatory effects.
de Santiago et al. Page 41 of 57
Table 1 Heterozygous SNPs identified as allele-specific and the correspondent cell-lines and alleles inthe FAIRE-seq dataset. CHROM, POS correspond to the coordinates in the hg19 human referencegenome.
ID CHROM POS REF ALT ASB allele Allelic Imbalance GeneName Location
rs2373062 2 218322137 G C T47D-G 0.76 DIRC3 intron
rs9682063 3 4737842 A G T47D-G 0.34 ITPR1 intron
rs9292122 5 56087910 G A T47D-G 0.81 intergenic
rs201314815 7 144115683 C T T47D-T 0.28 intergenic
rs201949580 7 144115698 T G T47D-G 0.30 intergenic
rs200312388 7 144115704 T G T47D-G 0.29 intergenic
rs200496470 8 128323987 A G T47D-G 0.15 CASC8 intron
rs111929748 11 69379287 G T T47D-T 0.32 intergenic
rs12939887 17 53162672 A G T47D-A 0.67 STXBP4 intron
rs72650670 1 114394681 G A MDAMB134-G 0.76 PTPN22 coding
rs116782319 5 158260122 A T MDAMB134-T 0.33 EBF1 intron
rs12670370 7 144118594 C A MDAMB134-C 0.79 intergenic
rs2392922 8 129166840 A G MDAMB134-A 0.79 intergenic
rs17793170 8 129205679 G A MDAMB134-A 0.33 intergenic
rs2666763 10 22101608 G A MDAMB134-A 0.29 DNAJC1 intron
rs2941733 10 22241656 T C MDAMB134-T 0.74 DNAJC1 intron
rs4752570 10 123337066 T C MDAMB134-C 0.32 FGFR2 intron
rs2912780 10 123337117 C T MDAMB134-C 0.69 FGFR2 intron
rs2912779 10 123337182 T C MDAMB134-T 0.72 FGFR2 intron
rs2981579 10 123337335 A G MDAMB134-A 0.71 FGFR2 intron
rs2912778 10 123338654 G A MDAMB134-G 0.73 FGFR2 intron
rs11200017 10 123339685 G A MDAMB134-A 0.26 FGFR2 intron
rs2981578 10 123340311 C T MDAMB134-C 0.88 FGFR2 intron
rs11599804 10 123340664 G A MDAMB134-A 0.16 FGFR2 intron
rs34032268 10 123341525 A C MDAMB134-C 0.19 FGFR2 intron
rs2936870 10 123348902 T C MDAMB134-T 0.69 FGFR2 intron
rs3135718 10 123353869 C T MDAMB134-C 0.67 FGFR2 intron
rs112536831 19 17424329 G A MDAMB134-G 0.75 DDA1 intron
rs113207944 19 17438923 C T MDAMB134-C 0.74 ANO8 intron
rs10416361 19 44294617 A G MDAMB134-A 0.75 intergenic
de Santiago et al. Page 42 of 57
�µ
⌘ ⇢
⇤ an dn
N
Figure S2 Graphical model of BaalChIP. The gray circles represent given variables and white
circles are unknown variables. The direction of arrows indicates statistical dependency in the
model. The observed reference allele read count is an, total read counts is dn. n is the index of
given factor (e.g. transcription factor) binding to the region of the SNP of interest. The
rectangular plane and the capital N indicates that there are N factors binding to the SNP. The
primary variable of interest is η representing allelic balance ratio. Λ is the precision parameter.
The reference allele frequency at the SNP is ρ. dn, η, ρ,Λ are the parameters of a beta-binomial
distribution that governs the uncertainty in an. Finally, the reference mapping bias µ and λ are
the mean and variance of a beta distribution which serves as a prior over η.
de Santiago et al. Page 43 of 57
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(a) RAF = 0.5 & Reads per TF = 15
TF = 1TF = 5TF = 10TF = 20TF = 30
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(b) RAF = 0.5 & TF = 1
reads = 3reads = 30reads = 250reads = 520reads = 1000
Figure S3 BaalChIP simulation performance with varying number of transcription factors (a)
and read depth (b).
de Santiago et al. Page 44 of 57
ATF
3B
CLA
F1
CB
X3
CE
BP
BC
EB
PD
CR
EB
1C
TC
FC
TC
FL
E2F
6E
GR
1E
LF1
ET
S1
FO
SL1
GA
BPA
GAT
A2
MA
XM
EF
2AN
R2F
2N
RS
FP
ML
PO
L2P
U1
RA
D21
SIN
3AK
20S
IX5
SP
1S
P2
SR
FS
TAT
5ATA
F1
TAF
7T
EA
D4
TH
AP
1T
RIM
28U
SF
1Y
Y1
ZB
TB
33Z
BT
B7A
BA
F15
5B
AF
170
BR
CA
1cJ
UN
CH
D2
ELK
1G
CN
5G
TF
2F1
HC
FC
1IN
I1IR
F3
JUN
DM
AF
KM
AZ
MX
I1N
RF
1P
300
PR
DM
1C
OR
ES
TR
FX
5S
MC
3S
PT
20S
TAT
3T
BP
US
F2
ZK
SC
AN
1Z
NF
143
BC
L3F
OS
L2F
OX
A2
TC
F12
FO
XM
1G
ATA
3H
DA
C2
FO
XA
1B
HLH
E40
HN
F4A
HN
F4G
MB
D4
MY
BL2
NF
ICR
XR
AZ
EB
1P
BX
3cF
OS
cMY
CAT
F2
BC
L11A
NA
NO
GP
OU
5F1
SP
4B
ATF
EB
F1
IRF
4M
EF
2CM
TA3
NFA
TC
1PA
X5
PO
U2F
2R
UN
X3
TC
F3
CH
D1
K562
HeLa
A549
MCF7
T47D
HepG2
SKNSH
HL60
MCF10
H1hESC
GM12878
GM12891
GM12892
IMR90
DNA−binding proteins
Figure S4 Data matrix summarizing the ENCODE samples used in this study for allele-specific
binding analyses. More detailed description including assession numbers can be found in Table S2
and Table S3.
de Santiago et al. Page 45 of 57
75
80
85
90
95
30 35 40 45 50read length
Rightcalls(%)
0.00
0.25
0.50
0.75
1.00
A549
GM12878
GM12891
GM12892
H1hESC
HeLa
HepG2
HL60
IMR90
K562
MCF10
MCF7
SKNSH
T47D
cell line
Filtered/Total(%)
keyblacklist
highcov
mappability
intbias
replicates_merged
Only1Allele
pass
pass
Only1Allele
replicates_merged
intbias
mappabilityhighcovblacklista
b
c
Figure S5 BaalChIP SNP filtering. (a) Percentage of SNPs with the correct number of aligned
reads based on simulations of reads of different read lengths (28 to 50 mer). The percentage of
correct calls increased with read length. (b) Proportion of SNPs that were filtered out in each
filtering step. (b) The average proportion of SNPs that were filtered out in each filtering step.
Average was calculated accross all 14 ENCODE cell lines.
de Santiago et al. Page 46 of 57
cancer, A549 cancer, HeLa cancer, HepG2 cancer, HL60
cancer, K562 cancer, MCF7 cancer, SKNSH cancer, T47D
non−cancer, GM12878 non−cancer, GM12891 non−cancer, GM12892 non−cancer, H1hESC
non−cancer, IMR90 non−cancer, MCF10
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00AR (ChIP-seq)
RA
F
0.0
0.2
0.4
0.6
0.0 0.2 0.4 0.6Correlation ChIP−seq AR vs. RAF scores
Pro
port
ion
ofsi
tes
inC
NA
regi
ons
celltype
cancer
non−cancer
a
b
Figure S6 Correlation between the BAF values and the ChIP-seq allelic ratios (AR). (a) Scatter
plots showing the correlation between RAF and AR at each SNP, for all considered cell lines of the
ENCODE dataset. RAF corresponds to the BAF value with respect to the reference allele (RAF is
equal to BAF if the reference allele corresponds to the B allele; RAF is equal to 1-BAF if the
reference allele corresponds to the A allele). The blue line shows the fitted linear regression line.
(b) Positive relationship between the Spearman correlation coefficient (obtained from the
correlation of AR with RAF scores) and the proportion of sites in copy-number altered (CNA)
regions. Each dot in the scatter plot corresponds to a cell line. Sites in copy number changed
regions correspond to heterozygous SNPs with BAF score < 0.4 or BAF > 0.6.
de Santiago et al. Page 47 of 57
●●●
●
●●
●
●
●
●●
●
●
●
●
●
0.025
0.050
0.075
0.100
0.125
0.0 0.2 0.4 0.6
●●
●
●
●
●●
●
●
●●
●
●
●
●
●0.05
0.10
0.15
0.20
0.0 0.2 0.4 0.6
Without RAF correction
With RAF correction
Cancer cell line
Non-cancer cell line
Pro
port
ion
of
ASB
site
sProportion of testedsites in CNA regions
Allelic Ratio (AR)
dens
ity
ChIP-seq AR (not corrected)
Corrected AR
K562 HeLa A549
MCF7 T47D HepG2
SKNSH HL60 MCF10
H1hESC GM12878 GM12891
GM12892 IMR90
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
4
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
4
0
1
2
3
4
0
1
2
3
0
1
2
3
0
1
2
3
4
0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
b
C
a
Pro
port
ion
of
ASB
site
s
Figure S7 Effects of BaalChIP adjustment of the allelic ratios (AR) after correcting for the
background Reference Allele Frequency (RAF) bias. (a) Density plots showing the distribution of
allelic ratios before (green) and after (orange) BaalChIP correction. The ARs before correction
were calculated directly from the ChIP-seq data and refer to the number of reads in the reference
allele divided by the total number of reads. The corrected AR values were estimated by BaalChIP
model after taking into account the RAF bias. The adjustment of AR is more dramatic in cancer
cell lines. (b) Proportion of ASB sites detected with (orange) and without (green) RAF correction
as a function of the proportion of sites within copy-number altered (CNA) regions. Each dot refers
to a cell line. Sites in regions of copy number change refer to any SNP with a BAF score higher
than 0.6 or lower than 0.4. Cancer cell lines have a higher proportion of SNPs in CNA regions and
benefit more from BaalChIP correction. (c) Same as (b) but after selecting SNPs with 30X-40X
coverage.
de Santiago et al. Page 48 of 57
0.00
0.04
0.08
0.12
20 30 40 50 20 30 40 50
Average coverage at tested sites
Pro
port
ion
of
AS
B s
ites
With RAF correction Without RAF correction
cell type
cancernon-cancer
Figure S8 Proportion of identified ASB SNPs per cell line. Correlation between the proportion of
identified ASB SNPs and the average coverage at all assayed heterozygous SNPs. The general
correlation is expected due to the higher power to detect ASB when coverage is high.
de Santiago et al. Page 49 of 57
0.0
0.2
0.4
0.0
0.1
0.2
0.000.010.020.030.04
0.00
0.05
0.10
0.15
0.0
0.1
0.2
0.3
0.0
0.1
0.2
0.3
GM
12878G
M12892
IMR
90M
CF
10M
CF
7S
KN
SH
chr1
chr1
0ch
r11
chr1
2ch
r13
chr1
4ch
r15
chr1
6ch
r17
chr1
8ch
r19
chr2
chr2
0ch
r21
chr2
2ch
r3ch
r4ch
r5ch
r6ch
r7ch
r8ch
r9ch
rX
Pro
port
ion
ofA
SB
site
s
cancernon−cancer
Figure S9 Higher rates of ASB on chromosome X of female cell lines than in autosomal
chromosomes.
de Santiago et al. Page 50 of 57
A549, FOSL2 A549, FOXA2 A549, GABPA A549, J UND A549, NRSF A549, P300 A549, POL2 A549, SIX5 A549, TCF12 GM12878, ATF2 GM12878, BATF GM12878, BCL11A GM12878, BCL3 GM12878, CREB1
GM12878, EBF1 GM12878, EGR1 GM12878, ELF1 GM12878, FOXM1 GM12878, GABPA GM12878, IRF4 GM12878, MEF2A GM12878, MTA3 GM12878, NFIC GM12878, NRSF GM12878, PAX5 GM12878, PBX3 GM12878, POL2 GM12878, POU2F2
GM12878, PU1 GM12878, RAD21 GM12878, RUNX3 GM12878, SIX5 GM12878, SP1 GM12878, SRF GM12878, STAT5A GM12878, TAF1 GM12878, TCF12 GM12878, TCF3 GM12878, YY1 GM12891, POL2 GM12891, POU2F2 GM12891, PU1
GM12891, TAF1 GM12891, YY1 GM12892, POL2 GM12892, YY1 H1hESC, CREB1 H1hESC, CTCF H1hESC, E2F6 H1hESC, EGR1 H1hESC, MAX H1hESC, NRSF H1hESC, POL2 H1hESC, RAD21 H1hESC, SIN3AK20 H1hESC, SP1
H1hESC, SRF H1hESC, TAF1H1hESC, TCF12 H1hESC, TEAD4 H1hESC, USF1 H1hESC, YY1 HeLa, BRCA1 HeLa, CEBPB HeLa, CHD2 HeLa, cJ UN HeLa, GTF2F1 HeLa, HCFC1 HeLa, J UND HeLa, MAFK
HeLa, MAX HeLa, MXI1 HeLa, P300 HeLa, POL2 HeLa, PRDM1 HeLa, RAD21 HeLa, RFX5 HeLa, SMC3 HeLa, STAT3 HeLa, TBP HeLa, USF2 HepG2, CEBPB HepG2, CREB1 HepG2, CTCF
HepG2, ELF1 HepG2, FOSL2 HepG2, FOXA1 HepG2, FOXA2 HepG2, GABPA HepG2, HNF4A HepG2, HNF4G HepG2, J UND HepG2, MAX HepG2, NR2F2 HepG2, P300 HepG2, POL2 HepG2, RAD21 HepG2, RXRA
HepG2, SRF HepG2, TAF1 HepG2, USF1 HepG2, YY1 HL60, GABPA HL60, NRSF HL60, POL2 HL60, PU1 IMR90, CEBPB IMR90, CTCF IMR90, MAFK IMR90, MXI1 IMR90, POL2 IMR90, RAD21
K562, ATF3 K562, CEBPB K562, CEBPD K562, CREB1 K562, CTCF K562, E2F6 K562, EGR1 K562, ELF1 K562, ETS1 K562, FOSL1 K562, GABPA K562, GATA2 K562, MAX K562, MEF2A
K562, NR2F2 K562, NRSF K562, PML K562, POL2 K562, PU1 K562, RAD21 K562, SP1 K562, SRF K562, STAT5A K562, TAF1 K562, TEAD4 K562, TRIM28 K562, USF1 K562, ZBTB7A
MCF10, cFOS MCF10, cMYC MCF10, POL2 MCF10, STAT3 MCF7, CEBPB MCF7, CTCF MCF7, EGR1 MCF7, ELF1 MCF7, FOSL2 MCF7, FOXM1 MCF7, GABPA MCF7, GATA3 MCF7, HDAC2 MCF7, J UND
MCF7, MAX MCF7, NR2F2 MCF7, NRSF MCF7, P300 MCF7, PML MCF7, RAD21 MCF7, SIN3AK20 MCF7, SRF MCF7, TAF1 MCF7, TCF12 MCF7, TEAD4 SKNSH, ELF1 SKNSH, FOSL2 SKNSH, FOXM1
SKNSH, GABPA SKNSH, GATA3 SKNSH, J UND SKNSH, MAX SKNSH, NFIC SKNSH, NRSF SKNSH, P300 SKNSH, PBX3 SKNSH, POL2 SKNSH, RXRA SKNSH, TCF12 SKNSH, TEAD4 SKNSH, USF1 SKNSH, YY1
SKNSH, ZBTB33 T47D, CTCF T47D, FOXA1 T47D, GATA3 T47D, P300
0
0.5
1
0 0.5 1Allelic Ratio (Replicate 1)
Alle
lic R
ati
o (
Replic
ate
2)
0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1
0
0.5
1
0
0.5
0
0.5
1
0
0.5
1
0
0.5
1
1
0
0.5
1
0
0.5
1
0
0.5
0
0.5
1
0
0.5
1
0
0.5
1
1
0
0.5
1
0
0.5
1
Figure S10 Consistency of ASB across biological replicates. Each dot in the scatter plot
represents the allelic ratio at a single SNP obtained for two replicated samples from the same
ChIP-seq experiment.
de Santiago et al. Page 51 of 57
K562 HeLa A549 MCF7
T47D HepG2 SKNSH HL60
MCF10 H1hESC GM12878 GM12891
GM12892 IMR90
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0 0.5 1.00.0 0.5 1.0Allelic Ratio (Protein X)
Alle
lic R
atio
(P
rote
in Y
)
Figure S11 Consistency of ASB across pairs of different assayed proteins (Protein "X" and "Y"),
within the same cell line. Each dot in the scatter plot represents the allelic ratio at a single SNP
obtained for distinct ChIP-seq experiments in the same cell line (i.e distinct ChIPed proteins
bound at the same site).
de Santiago et al. Page 52 of 57
0.4
0.6
0.8
Rep sameCell
Spearman
correlationcoefficient
(124)
(18)
(5)(1) (1)
0
50
100
2 3 4 5 6Number of cell lines
Num
berofshared
ASBsites
a b
Figure S12 (a) Distribution of Spearman correlation coefficients obtained for the correlation of
allelic ratios obtained for pairs of biological replicates (Rep), and between distinct proteins bound
at the same site (sameCell). The allelic ratios are highly correlated in both scenarios. (b) Number
of shared ASB SNPs across cell lines. Numbers in parenthesis show the number of ASB SNPs
shared between 2 to 6 cell lines. 149 ASB SNPs (124+18+5+1+1) were concomitantly detected
in two or more cell lines.
de Santiago et al. Page 53 of 57
0
10
20
30
40
50 ASB
AssayedPe
rcenta
ge o
f O
verl
ap
coding
5' UTR
intergenicintro
n
promoter
3' UTR
Figure S13 Percentage of ASB SNPs overlapping gene annotations (5’ UTR, 3’ UTR, coding,
intergenic, intron or promoter). The largest proportion of ASB SNPs was found in introns and
intergenic regions. A similar distribution pattern is observed when considering all assayed
heterozygous SNPs.
de Santiago et al. Page 54 of 57
0
25
50
75K27ac
k4me1 E
K27ac WE E
K27ac WE E
K27ac WE E
K27ac WE E
K27ac WE
Cell−specific chromatin features
Percentageofoverlap
A549 GM12878 H1hESC HeLa HepG2 K562
*
0
25
50
75
A549
GM12878
H1hESC
HeLa
HepG2
K562
Cell−specific chromatin features (merged ranges)
Percentageofoverlap
observed expectedb
a
Figure S14 Percentage of ASB SNPs overlapping putative enhancer regions from the matched
cell line. (a) Putative enhancer regions were retrieved from publicly available ENCODE datasets
on H3K27ac and H3K4me1 modified regions, and previously predicted weak enhancers (WE) and
strong enhancer (E) sites obtained from Segway chromatin state annotations of the ENCODE
data. (b) Comparison of the observed and expected number of ASB SNPs that map to enhancer
regions. A significant difference (*) was only detected in HeLa cells.
de Santiago et al. Page 55 of 57
SKNSH T47D
IMR90 K562 MCF10 MCF7
H1hESC HeLa HepG2 HL60
A549 GM12878 GM12891 GM12892
ASB Assayed ASB Assayed
ASB Assayed ASB Assayed
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
abs(LOD(alt)−LOD(ref))
group
ASB
Assayed
p=0.63
p=0.81
p=0.56
p=0.85
p=0.84
p=0.99
p=0.97
p=0.63
p=0.94
p=0.89
p=0.81
p=0.6
p=0.98
p=0.03
SKNSH T47D
IMR90 K562 MCF10 MCF7
H1hESC HeLa HepG2 HL60
A549 GM12878 GM12891 GM12892
ASB Assayed ASB Assayed
ASB Assayed ASB Assayed
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
abs(LOD(alt)−LOD(ref))
group
ASB
Assayed
a
b
Figure S15 Distribution of absolute DELTA scores (deltaLOD) of motif-disrupting SNPs with
(a) deltaLOD > 0 and (b) deltaLOD > 5. The DELTA score corresponds to the change in the
PWM score between SNP alleles (DELTA = LOD(ref) - LOD(alt)). P-values correspond to the
Kolmogorov-Smirnov test when comparing the distribution of DELTA scores between
motif-disrupting ASB SNPs with all motif-disrupting SNPs.
de Santiago et al. Page 56 of 57
ENCODE (ChIP−seq)
10100
1000
K562
HeLa
A549
MCF7
T47D
HepG2
SKNSH
HL60
MCF10
H1hESC
GM12878
GM12891
GM12892
IMR90
Depth
FAIRE−seq gDNA
10100
1000
MDA134
T47D
MDA134
T47D
Depth
a b
Figure S16 Distribution of depth of coverage at tested heterozygous SNPs in (a) ENCODE
ChIP-seq datasets (pooled data across all ChIP-seq samples in each cell line) and (b) targeted
sequencing FAIRE-seq datasets (pooled data for three replicates).
de Santiago et al. Page 57 of 57
Additional Files844
Additional file 1 — supplementary_tables.xlsx845
Supplementary tables S1 - S8.846