Evaluation of Paired-end Sequencing Strategies for Detection of … · 2017. 1. 9. · Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and

Evaluation of Paired-end Sequencing Strategies for Detection of

Genome Rearrangements in Cancer

Ali Bashir∗ Stanislav Volik† Colin Collins† Vineet Bafna‡ Benjamin J. Raphael§

June 1, 2008

Abstract

Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and structuralvariation on a genome-wide scale. This technique is particularly useful for detecting copy neutral rearrange-ments such as inversions and translocations, which are common in cancer, and can produce novel fusiongenes. We address the question of how much sequencing is required to detect rearrangement breakpointsand to localize them precisely using both theoretical models and simulation. We derive a formula for theprobability of a fusion gene given a set of observed paired-end sequences and use this formula to compute fu-sion gene probabilities in several breast cancer samples. We find that we are able to accurately predict fusiongenes in these samples with a relatively small number of fragments of large sizes. We further demonstratehow the ability to detect fusion genes depends on the distribution of lengths of human genes, and evaluatehow different parameters of a sequencing strategy impact breakpoint detection, breakpoint localization, andfusion gene detection, even in the presence of errors that suggest false rearrangements. These results will beuseful in calibrating future cancer sequencing efforts, particularly large-scale studies of many cancer genomesthat are enabled by new massively parallel sequencing technologies.

Introduction

Cancer is a disease driven by selection for somatic mutations. These mutations range from single nucleotidechanges to large-scale chromosomal aberrations such as deletion, duplications, inversions and translocations.While many such mutations have been cataloged in cancer cells via cytogenetics, gene resequencing, and array-based techniques (i.e. comparative genomic hybridization) there is now great interest in using genome sequencingto provide a comprehensive understanding of mutations in tumor genomes. One such sequencing initiative, TheCancer Genome Atlas (http://cancergenome.nih.gov/index.asp) focuses sequencing efforts in the pilot phaseon point mutations in coding regions. However, the pilot phase of this initiative largely ignores copy neutralgenome rearrangements including translocations and inversions. Such rearrangements can create novel fusiongenes, as observed in leukemias, lymphomas, and sarcomas[1, 2, 3]. The canonical example of a fusion gene isBCR-ABL that results from a characteristic translocation (termed the “Philadelphia chromosome”) in manypatients with chronic myelogenous leukemia (CML)[3]. The advent of Gleevec, a drug targeted to the BCR-ABLfusion gene, has proven successful in treatment of CML patients [4], invigorating the search for other fusiongenes which might provide tumor-specific biomarkers or drug targets.

Until recently, it is was generally believed that translocations and resulting gene fusions were solely recurrentin hematological disorders and sarcomas, with few suggesting that such recurrent events were prevalent acrossall tumor types, including solid tumors[5, 6]. This view has been increasingly challenged with the discovery of afusion between the TMPRSS2 gene and several members of the ERG protein family in prostate cancer[7] andthe EML4-ALK fusion in lung cancer[8].

These studies raise the question of what other recurrent rearrangements remain to be discovered and em-phasize the importance of developing a systematic, genome-wide approach for detecting fusion genes and other∗Univ. of California, San Diego, Bioinformatics Graduate Program†Univ. of California, San Francisco, Comprehensive Cancer Center.‡Univ. of California, San Diego, Dept. of Computer Science and Engineering§Brown Univ., Dept. of Computer Science and Center for Computational Molecular Biology

1

large scale rearrangements. One strategy for high-resolution identification of rearrangements involves paired-endsequencing of clones, or other fragments of genomic DNA, from tumor samples. The resulting end-sequencepairs, or paired reads, are mapped back to the reference human genome sequence. If the mapped locations ofthe ends of a clone are “invalid” (i.e. have abnormal distance or orientation) then a genomic rearrangementis suggested (See Figure 1 and Methods). This strategy was initially described in the End Sequence Profil-ing approach[9] and is currently also used to assess genetic structural variation[10]. An innovative approachutilizing SAGE-like sequencing of concatenated short paired-end tags successfully identified fusion transcriptsin cDNA libraries[11]. Present and forthcoming next-generation DNA sequencers hold promise for extremelyhigh-throughput sequencing of paired-end reads. For example, the Illumina Genome Analyzer will soon be ableto produce millions of paired reads of approximately 30bp from fragments of length 500-1000bp[12], while theSOLiD system from Applied Biosystems promises 25bp reads from each end of size selected DNA fragment ofmany sizes[13]. Similar strategies coupling the generation of paired-end tags with 454 sequencing have also beendescribed[14, 15].

Whole genome paired-end sequencing approaches allow for a genome-wide survey of all potential fusiongenes and other rearrangements in a tumor. This approach holds several advantages over transcript or proteinprofiling in cancer studies. First, discovery of fusion genes using mRNA expression[7], cDNA sequencing, or massspectrometry[16] depends on the fusion genes being transcribed under the specific cellular conditions present inthe sample at the time of the assay. These conditions might be different than those experienced by the cellsduring tumor development. Second, measurement of fusions at the DNA sequence level focuses on gene fusionsdue to genomic rearrangements and thus is less impeded by splicing artifacts or trans splicing[17]. Finally, genomesequencing can identify more subtle regulatory fusions that result when the promoter of one gene is fused tothe coding region of another gene, as in the case with with the c-Myc oncogene fusion with the immunoglobingene promoter in Burkitt’s lymphoma[18].

In this paper, we address a number of theoretical and practical considerations for assessing tumor genomeorganization using paired-end sequencing approaches. We are largely concerned with detecting rearrangementbreakpoints, where a pair of non-adjacent coordinates in the reference genome are adjacent (i.e. fused) inthe tumor genome. In particular, we extend this idea of a breakpoint to examine the ability to detect fusiongenes. If a rearrangement in the tumor genome is identified by detecting a clone1 with end sequences mappingto distant locations, does this rearrangement lead to formation of a fusion gene? Obviously, sequencing theclone will answer this question, but this requires additional effort/cost and (in the case of most next-generationsequencing technologies which do not “archive” the genome in a clone library) may be problematic. We showhow to use evidence from multiple clones and the empirical distribution of clone sizes to derive a formula forthe probability of fusion between a pair of genomic regions (e.g. genes) given the set of all mapped clones. Suchestimates are useful for prioritizing follow-up experiments to validate fusion genes. In a test experiment on theMCF7 cell-line, 3,201 pairs of genes were found near clone with aberrantly mapping end-sequences. However,our analysis revealed only 18 pairs of genes with a high probability2 of fusion, of which six were tested and fiveexperimentally confirmed.

The advent of high throughput sequencing strategies raise important questions with respect to experimentaldesign for understanding tumor genome organization. Obviously, sequencing more clones improves the prob-ability of detecting fusion genes and breakpoints. However, even with the latest sequencing technologies, itwould not be practical nor cost effective to shotgun sequence and assemble the genomes of thousands of tumorsamples. Thus, it is important to optimize to maximize the probability of detecting fusion genes with theleast amount of sequencing. This probability depends on multiple factors including the number and length ofend-sequenced clones, the length of genes that are fused, and possible errors in breakpoint localization. Here,we derive (theoretically and empirically) many formulae which elucidate the trade-offs in experimental design ofboth current and next-generation sequencing technologies. Using our probability calculations and simulations,we find that even with current paired-end technology we can obtain extremely high probability of breakpointdetection with a very low number of reads. For example, more than 90% of all breakpoints can be detectedwith paired-end sequencing of less than 100,000 clones (Table 2). Additionally, next-generation sequencers canpotentially detect rearrangements with a greater than 99% probability and localize the breakpoints of theserearrangements to less than 300bp in a single run of the machine (Table 2).

1A number of next-generation sequencing technologies do not use traditional cloning vectors to amplify DNA. For the sake ofsimplicity we will use the term “clone”to refer to any contiguous fragment that is sequenced from both ends.

2Probability greater than .5

2

(a)

Tumor Genome

L= (a-xC) + (b-yC)

yCa bxC

Gene VGene U

Normal Genome

Fused Gene Pair

(b) Genomic Position

Ge

no

mic

Po

sitio

n

Clone size = 0

Clone size = LMin

Clone size = LMax

xC

yC

a

bG

en

e V

Gene U

Figure 1: Schematic of Breakpoint Calculation. (a) The endpoints of a clone C from the tumor genome map tolocations xC and yC (joined by an arc) on the reference genome that are inconsistent with C being a contiguous pieceof the reference genome. This configurations indicates the presence of a breakpoint (a, b) that fuse at ζ in the tumorgenome. (b) The coordinates (a, b) of the breakpoint are unknown but lie with the trapezoid described by equation (1).The observed length of the clone is given by LC = (a− xC) + (b− yC). The rectangle U × V describes the breakpointsthat lead to a fusion between genes U and V .

3

Results

Computing probability of a fusion gene

We consider the tumor genome as a rearranged version of the reference human genome and assume that thereexists a mapping between coordinates of the two genomes. The reference genome is described by a single intervalof length G; i.e. we concatenate multiple chromosomes into a single coordinate system. Define a breakpoint(a, b) as a pair of non-adjacent coordinates a and b in the reference genome that are adjacent in the tumorgenome. Correspondingly, we define the fusion point3 as the coordinate ζ in the tumor genome such that thepoint a maps to ζ and the point b maps to ζ+1. Consider a clone C containing ζ. If the breakpoints a and b arefar apart (e.g. on different chromosomes) then the endpoints of C will map to two locations xC and yC on thereference genome that are inconsistent with C being a contiguous fragment of the reference genome (Figure 1a).In this case, we say that (xC , yC) is an invalid pair [20]. Observing an invalid pair (xC , yC) does not identify thebreakpoint (a, b) exactly. However, if we know that the length of the clone C lies within the range [Lmin, Lmax],and we assume that: (i) only a single breakpoint is contained in a clone; and (ii) a > xC and b > yC (withoutloss of generality: See Methods); then breakpoint (a, b) that are consistent with (xC , yC) must satisfy

Lmin ≤ (a− xC) + (b− yC) ≤ Lmax. (1)

If we plot an invalid pair (xC , yC) as a point in the two dimensional space G × G then the breakpoints (a, b)satisfying the above equation define a trapezoid4 (Figure 1).

If multiple clones contain the same fusion point ζ, then the corresponding breakpoint (a, b) lies within theintersection I of the trapezoids corresponding to the clones. Conversely, we will assume that if the trapezoidsdefined by several invalid pairs intersect, then they share a common breakpoint. We call a set of clones whosetrapezoids have non-empty intersection a cluster. Figure 2 displays a cluster of six clones from the MCF7 cellline. As the number of clones that are end-sequenced increases, more clones will contain the same fusion pointand more clusters will be formed. Thus, the area of I will decrease, and correspondingly the uncertainty in thelocation of the fusion point decreases.

Given a set of clones, we are interested in computing the probability that the cancer genome contains afusion gene, i.e. a fusion of two different genes from the reference genome. A gene defines an interval U = [s, t]where s is the 5′ transcription start site and t is the 3′ transcription termination site. Consider two genes withintervals U and V . The two genes are fused if there exists a breakpoint (u, v) that lies in U×V . This breakpointis detected if (u, v) lies in I. An approximate probability for a fusion gene is the fraction of I that lies withinthe rectangle U × V .

We obtain a better estimate of the probability of fusion by considering the empirical distribution of clonelengths and provide a (See Methods). The exact probability of the gene fusion is given by the probability massthat lies within the intersection of I and the rectangle U×V defined by the pair of genes. An efficient algorithmfor computing these probabilities is given in Methods.

Fusion Predictions in Breast Cancer

We made predictions of fusion genes for the MCF7, BT474, SKBR3 breast cancer cell lines as well as two primarytumors using data from end sequence profiling of these samples[21, 22]. Approximately, 71Mb of end-sequencewas derived from these 5 samples, ∼ 29Mb (corresponding to .47 clonal coverage) coming from the MCF7 cellline. Across all samples, a total of 1,141 invalid pairs were obtained. These formed 919 clusters, 95 of whichcontained more than one clone.

The empirical distribution of clone lengths for the MCF7 library library is shown in Supp. Fig I. We usethe distribution in each clone library to compute the probabilities shown in Tables 1. Table 1 shows the resultsof our predictions for fully sequenced BACs across multiple breast cancer cell lines and primary tumors, sortedaccording to fusion probability. We have successfully validated a number of these highest ranked predictions(Table 1). Experimental validation in this context involves sequencing of an entire clone, thus exactly confirmingthe breakpoint (and point of gene fusion) if present (See Methods). In some cases multiple breakpoints werefound in a clone, we ensure that the breakpoint associated to each gene in the fusion disrupts the corresponding

3In the genome rearrangement literature, a fusion point is also called a breakpoint[19].4A triangle when Lmin = 0

4

1.074 1.075 1.076 1.077 1.078x 108

5.195

5.2

5.205

5.21

5.215

5.22

5.225

5.23 x 107

Genomic Position (Chromosome 1)

Gen

omic

Pos

ition

(Chr

omos

ome

20)

NTNG1

BC

AS1

Figure 2: Prediction of a Fusion between the NTNG1 and BCAS1 genes. The rectangle indicates the possiblelocations of a breakpoint on chromosomes 1 and 20 that would result in a fusion between NTNG1 and BCAS1. Eachtrapezoid indicates possible locations for a breakpoint consistent with an invalid pair. Assuming that all clones containthe same breakpoint, this breakpoint must lie in the intersection of the trapezoids (shaded region). Approximately69% of this shaded region intersects (darkly shaded region) the fusion gene rectangle, giving a probability of fusionof approximately 0.69. The empirical distribution of clone sizes reveals that not all clone sizes are equally likely (e.g.extremely long or short clones are rare). Using this additional information, our improved estimate for the probability offusion is > 0.99.

5

(a)

102

104

106

108

1010

1012

1014

0

0.2

0.4

0.6

0.8

1

Product of Gene Lengths (log scale)

Pro

babi

lity

of F

usio

n

(b)

102

104

106

108

1010

1012

1014

0

20

40

Product of Gene Lengths (log scale)

Cou

nt in

Chi

mer

DB

Figure 3: (a) Probability of Fusion vs. product of gene lengths involved in the fusion event indicates higherfusion probabilities for larger gene pairs. Larger circles indicate gene pairs experimentally validated by furthersequencing. Darkly shaded circles indicate fusions confirmed by clone sequencing; yellow circles indicate predicted fusionswith negative sequencing results. Smaller points indicate untested predictions with blue circles indicating cluster sizesof 2 or more clones and red circles indicating singletons. (b) The number of fusion genes in chimerDB[23] plotted as afunction of the product of gene lengths in the fusion.

6

Table 1: Fusion Probability Predictions and Sequencing Results for Clusters in Breast CancerStart End Fusion Cluster Sequencing Supporting Cell Line /Gene Gene Probability Size Fusion Primary TumorASTN2 PTPRG 1 2 Yes† MCF7BCAS4 BCAS3 1 20 Yes† MCF7KCND3 PPM1E 0.99 12 Yes MCF7NTNG1 BCAS1 0.99 6 Yes MCF7BCAS3 ATXN7 0.83 8 Yes† MCF7ZFP64 PHACTR3 0.6322 2 No BT474CT012 HUMAN UBE2G2 0.0880 1 No BreastVAPB ZNFN1A3 0.0842* 3 Yes BT474BMP7 EYA2 0.0324 4 No† MCF7KCNH7 TDGF1 0.0215 1 No BreastSULF2 TBX4 0.00656 2 No MCF7NACAL NCOA3 0.0057 2 No MCF7MRPL45 TBC1D3C 0.0005 1 No BT474U1 NP 060028.2 0.0005 1 No BreastRBBP9 ITGB2 0.0005 1 No BreastY SYNPR < 0.0001 4 No MCF7PRR11 TMEM49 < 0.0001 9 No MCF7BMP7 Q96TB < 0.0001 3 No MCF7

Ranked List of Fusion Genes Predictions in Breast Cancer Cell Lines and Primary Tumors Thegene order shown indicates “start” and “end” positions with respect to transcription. Elements labeled as “NotTested” are, as yet, not sequenced. Though additional clones have been sequenced they did not overlap anyputative fusion region. It should be noted that those clones were also negative for gene fusion. VAPB/ZNFNA13has low probability, however there are many pairs of genes with low probability of fusion in this region. Theprobability that any one of these gene pairs fuse is > .30. All clones in a cluster are non-redundant (the sameclones do not reappear multiple times in a cluster). † indicates that a single clone contained more than twochromosomal segments, i.e. the clone is not a simple fusion of two genomic loci.

gene. Such multiple rearranged regions have been observed to still form fusion transcripts as in the case ofBCAS4/BCAS3[11, 24]. An example, and explanation, of a high-scoring prediction (NTNG1/BCAS1) can beseen in Figure 2. The strong correspondence between fusion probability prediction and subsequent sequencingvalidation of the breakpoint in Table 1 illustrates the predicative power of our method.

Table 1 also indicates the power of the technique in predicting negative results. In addition to gene pairswith highly predcited fusion probabilities it shows results on all remaining sequenced clones for which fusioninformation was available. Only one of these clones (below 50% fusion probability) resulted in positive resultsfor fusion genes (VAPB/ZNFN1A3). The negative results mostly follow the computational predictions, as onlyone of the clones had a probability above 50% for gene fusion. A notable result from sequencing, as notedearlier, is the observation that certain clones undergo multiple rearrangements (see Table 1). This results ingreater than two chromosomal segments, with respect to the reference genome, present within a single clone.

Detecting fusion points

Consider an idealized model in which R clones, each of fixed length L are picked uniformly at random froma tumor genome 5 of length G, end-sequenced, and ends mapped to the reference genome. The fraction ofclones uniquely mapped, f , provides one with a revised set of clones N = fR. The fraction of clones uniquelymapped varies significantly between technologies to the next. Our ESP study had 90% of reads mappable with58% uniquely mapped[22]. 454 sequencing of structural variants in the human genome reported 63% mappingof sequences with recognizable linker sequences, corresponding to mapping of 41% mapping of all reads [15]. Itshould be noted that the 454 reads are of significantly higher sequenced end lengths (average 109bp)[15] andquality compared to other next gene technologies (average 20− 30bp)[13, 12], so lower mapping fidelity would beexpected for such technologies.

A fusion point, ζ, on the tumor genome is detected if a uniquely mapped clone contains it (See Figure 4).

5For these calculations the tumor genome size, G, will correspond to the diploid genome size, ∼ 6× 106bp for human.

7

Cl Cr

Genome

Figure 4: Schematic of a Breakpoint Region. A fusion point ζ on the tumor genome contained in multiple clones.The leftmost and rightmost clones determine the breakpoint region Θζ in which the fusion point can occur.

Using the Clarke-Carbon formula[25, 26](See Methods), the probability Pζ of detection is given by

Pζ ≈ 1− e−c (2)

where c = NL/G is the clone coverage. A single clone localizes a fusion point to within L bp, while multipleclones containing a fusion point will localize the fusion point more precisely. Define the breakpoint region, Θζ ,as the interval determined by the intersection of all clones that contain ζ. Therefore |Θζ | defines the localizationof ζ, or the uncertainty in mapping ζ. To localize a fusion point to within L, we only need a single cloneoverlapping ζ. We show (See Methods) that

Pr(|Θζ | = L) ≈ Le−c(1− e−NG ). (3)

Furthermore, we show that for s < L,

Pr[(|Θζ | = s)] ≈ se−NsG (1− e−N

G )2 (4)

These equations allow us to estimate the the expected length of Θζ , conditioned on ζ being covered 6 (SeeMethods for full derivation and closed form solution, equation 24) as

E(|Θζ | | ζ is covered) ≈

(1− e−N

G

1− e−c

)(L2e−c +

L−1∑s=1

s2e−NsG (1− e−N

G )

)(5)

We evaluated the error in this approximation by simulation (See Supplemental Methods for descriptions of allsimulations). Supp. II shows that approximation (5) very closely models the average obsered |Θζ |. The relativeerror between the average observed length of the breakpoint region and approximation (5) was 0.02.

We also assess the effect of different clone sizes, L, and number of clones, N , on the expected length of thebreakpoint region, E(|Θζ |), around a specific fusion point, ζ. In Figure 5 we see the obvious correlation that anincrease in the number of reads, N , decreases the uncertainty in localization (|Θζ |). Interestingly, we note that40kb clones are most advantageous for detection when |Θζ | = 40kb. A similar effect is observed for 150kb and2kb. This implies that the choice of clone-lengths impacts one’s ability to detect fusions of a specific size.

Evaluation of Sequencing Strategies

Formulas (2) and (5) provide a framework for examining a variety of sequencing parameters (L,N, c). In Table 2and graphically in Supp. III and IV, we assess the effect of using different clone sizes and varying numbers ofpaired reads on the ability to detect and localize a fusion point (also see Supp. V). One can see that a distincttrade-off exists between detection, in which larger clones hold a distinct advantage, and localization, in whichcase smaller clones are advantageous. Though larger size clones (e.g. BACs of 150kb) appear more pragmaticfor sequencing projects using a smaller number of paired reads, the advent of low cost, highly parallel sequencingof small clones could soon yield extremely high resolution (smaller |Θζ |) and coverage of fusion points.

6Otherwise, Θζ is not defined

8

0 0.5 1 1.5 2x 105

0

0.2

0.4

0.6

0.8

1

Θζ

Prob

abilit

y

150kb Clones, 10k Reads40kb Clones, 10k Reads2kb Clones, 10k Reads150kb Clones, 100k Reads40kb Clones, 100k Reads2kb Clones, 100k Reads150kb Clones, 1M Reads40kb Clones, 1M Reads2kb Clones, 1M Reads

Figure 5: Probability of localizing a fusion point to an interval of a given length. A fusion point is localized to length sif the corresponding breakpoint point region has length s or less. Note that when s exceeds the clone length L, only asingle clone contains the fusion point; thus, the probability of localization is the fixed probability of a clone containingthe fusion point at a given clonal coverage. Note that a fixed clone length is assumed here. Using a distribution of clonelengths would create a less abrupt transition.

Table 2: Breakpoint Detection and Localization for Different Sequencing Strategies

Clone Size(L) Paired Reads(N) Clone Coverage(c) E(|Θζ |) Pζ E(|Θζ∗ |) Pζ∗

1kb 40× 106 13.3X 295 > .99 289 .991kb 1× 106 .33X 972 .15 658 .0122kb 20× 106 13.3X 593 > .99 581 .992kb 1× 106 .66X 1889 .28 1296 .04410kb 5× 106 16.7X 2393 > .919 2378 > .9910kb 1× 106 3.3X 7342 .81 5657 .5040kb 2× 106 26.7X 5998 > .99 5997 > .9940kb .1× 106 1.33X 35587 .49 25124 .14150kb .5× 106 25X 23997 > .99 76807 .71150kb .1× 106 5X 93169 .92 72022 .80150kb .012× 106 .6X 142510 .26 97457 0.037

The probability Pζ of detecting a fusion point and the expected length E(|Θζ |) of a breakpoint region under various

clone sizes (L) and number of end-sequenced clones (N). Pζ∗ , and E(|Θζ∗ |), correspond to the probability for, and

expected size of a breakpoint region in the case of, two clones spanning ζ. The small clone sizes (1kb, 2kb) and large

number of reads represent what one might achieve in a single run with new technologies (assuming perfect mapping of

end sequences). The last value for the 150kb clones represent our current status on the MCF7 cell line.

9

1 5 10 20 50 100 200 5000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Gene Length in kb (Log Scale)

Perc

enta

ge o

f Tot

al G

enes

in S

etAll Genes (39,368)

Kinases (620)

TranscriptionFactors (562)

ChimerDB (300)

Random FusionGenes (2000)

Figure 6: Distribution of gene sizes for different sets of genes. All genes: The “known genes” track in the UCSCGenome Browser[27]. Kinases: Selected from the KinBase database[28]. Transcription factors: Selected from theAmiGO database according to the GO term “transcription factor activity”[29]. ChimerDB: Fusion genes in cancerextracted from the chimerDB database[23]. Random Fusion Genes: A set of 2000 genes involved in 1000 randomfusion events. Random Fusion events were formed by inducing random breakpoints, and selecting such events if theyformed a fusion gene. Note that the gene sizes are on a log scale, and the number of genes from each set used to deriveeach distribution is shown in the legend.

Fusion gene probabilities

Figure 5 indicates that to achieve a desired breakpoint localization with a fixed number of reads, some clonesizes are more effective than others. Thus, Figure 5 implies that our ability to accurately assess fusion genesdepends not only on the sequencing parameters but also on the sizes of genes involved in the fusion.

There is considerable variation in sizes of human genes, as seen in Figure 6. When considering all knowntranscripts[27], the median gene size is approximately 20kb and the mean is approximately 40kb. However, thisdoes not correspond to the sizes of known fusion genes in cancer. Examination of chimerDB [23], a database offusion sequences derived from mRNA, EST, literature and database searches, shows a clear bias towards largergenes (Figure 6), with a median gene size of 40kb and a mean gene size of 90kb. It is tempting to speculate onthe reasons for this bias. There is the potential for ascertainment bias as larger genes would be easier to localizevia cytogenetic techniques. Additionally, random breakage would seem to favor larger genes as the probabilityof a breakpoint disrupting a large gene would be greater than for a small gene. To assess this possibility, wesimulated random fusion events by inducing random breakpoints in the genome. If a breakpoint formed a fusiongene we recorded the size of each fusion partner as seen in Figure 6. It is interesting to note that these randomfusion events results in much larger genes than observed in the normal genome or chimerDB (median and meangene sizes of 155kb and 284kb, respectively). Though further investigations are needed, this could indicate thatthe observed fusion genes are selected for functional reasons. We also examined the distribution of transcriptionfactors and kinases which are members of multiple fusion genes, (Figure 6). Interestingly, kinases correspondmuch more closely to the chimerDB distribution whereas transcription factors correspond better with gene sizesobserved across the entire genome.

The variation in gene sizes for different classes of genes (Figure 6) suggests that one consider a wide range ofgene sizes when assessing our ability to detect fusion genes. Figure 7 shows the number of clones necessary toachieve at least a .5 fusion probability for a random gene pair of specified size, across a variety of different clonesizes. It should be noted that the breakpoint could exist at any position within either gene. Smaller clone sizesclearly hold a distinct advantage in the probability of detecting fusion genes when compared across equal clonalcoverage while large clones tend to perform better when comparing across equivalent reads (Figure 7a). Thisis not surprising, as a significantly higher number of paired reads is required to achieve the same coverage withsmaller clones. For 2kb clones 75 times the number of paired reads is needed to yield equivalent clone coverage

10

(a)

0 2 4 5 9 10 1510

5

106

107

108

Fusion Gene Size

Pai

red

Rad

s

Fusion Probability >0.5

10001000040000150000

(a)

500 1000 2000 10000 40000 15000010

5

106

107

108

Clone Size

Num

ber

of C

lone

s

Fusion Probability > .5

20kb fusion40kb fusion100kb fusion150kb fusion

Figure 7: (a) The number of paired reads necessary to detect fusion genes with fusion probability greater than .5 as afunction of gene size for different clone lengths. The vertical lines indicate median (20kb) and mean (40kb) sizes for allknown genes as well as the median (40kb) and mean (90kb) sizes for chimerDB genes. (b) The number of paired readsnecessary to detect fusion genes with fusion probability greater than .5 as a function of clone size for different fusiongenes sizes (log scale in both axes). Each point in these plots is the average over 100 different fusion genes and and 100different simulations of clones from the genome. Thus, each datapoint represents the average value of 104 simulations.In each simulation, a pair of genes was chosen such that area of the resulting gene rectangle (U × V ) was equal to thesquare of the indicated fusion gene size. A breakpoint was selected for the gene pair uniformly in the rectangle U × V ).

with respect to 150kb clones. It is important to note, however, the next-generation sequencing technologies canpotentially generate a high number of paired reads for small clone sizes at very low cost (See Discussion).

Less obvious is the relationship between fusion gene size and detectability of fusion events (Figure 7b). Notethat larger clones create larger trapezoids. This increases the probability that the trapezoid intersects with thegene rectangle (better detection), but may, thus, reduce the calculated fusion probability since only a smallfraction of the larger trapezoid would overlap the gene rectangle (akin to creating a larger breakpoint regionaround a breakpoint). Notice that there appears to be a distinct correlation between the size of genes involvedin a fusion and the size of the clone being used to sequence the genome. When the resulting fused gene is 40kbin length, the average fusion probability is significantly greater when using the same number of 40kb clonescompared to 2kb clones, largely due to the greater genomic coverage provided by the larger clones. However,40kb clones also perform nearly as well as 150kb clones (Figure 7b), because of their improved breakpointlocalization (Figure 5). Once the fusion gene size is increased to 150kb, once again, a significant benefit is seenby moving to the larger clone size. In addition to supporting bias for identification of larger gene fusion eventsby large clones, these results suggest that when investigating specific fusion events, designing one’s clone libraryto reflect the size of the event could be beneficial in detecting it with high probability. However, it is importantto note that the 150kb clones consistently show less variance in predicted probabilities (Supp. VI). This makesthem a more reliable source for performing whole genome assessments and comparisons across multiple tumorsamples, especially with a small number of paired reads.

Effects of Errors

There are numerous sources of error in paired-end sequencing approaches. The method of computing fusionprobabilities from end-sequences makes it possible that a pair of genes near a breakpoint may exhibit a highprobability of fusion but not actually fuse. This is more likely in the case of large clone sizes and low clonecoverage.

A number of other factors, including experimental artifacts, genome assembly errors or mismapping of endsequences, can affect the prediction of fusion genes. In particular, experimental artifacts can lead to falseidentification of fusion genes. A major source of experimental artifacts is chimeric clones produced when twonon-contiguous regions of DNA are joined together during the cloning procedure. Approximately 1 − 2% ofclones in modern BAC libraries are chimeric[21, 15], and rates for other vectors are roughly similar. The type andrate of experimental artifacts for new genome amplification and sequencing strategies is still an open question.

In order to assess the rate of false positive predictions of fusion genes in the presence of errors, we simulated

11

(a)0 2 4 6 8

0

0.5

1

1.5

2

2.5

FP

TP

2kb Clones40kb Clones150kb Clones

(b)0 10 20 30 40 50 60

0

5

10

15

FP

TP


(c)0 5 10 15 20

0

5

10

15

20

FP

TP


(d)0 0.5 1 1.5

0

5

10

15

FPT

P


Figure 8: Sensitivity and Specificity of Fusion Gene Predictions (a) Number of false positive (FP) and truepositive (TP) fusion gene predictions for a simulated genome with 100 translocations and 10,000 paired reads. Eachcurve represents the average of 50 simulations with clones of a fixed size (2kb, 40kb, 150kb clones). The minimum fusionprobability threshold for indicating that a fusion gene was predicted was decreased from > .95 (leftmost point) to > 0(rightmost point) in increments .05. and the number of true and false predictions was determined. For all figures 19 truefusion genes were present in the rearranged genome. These 19 events were not selected for, rather they were a result ofthe random rearrangment of the genome. (b) 100,000 paired reads. (c) 1,000,000 paired reads. (d) 10,000,000 pairedreads.

100 random rearrangement events with the percent of chimeric clones equal to 1% (Figure 8). For each curve,the points represent “True Positives” (fusion pairs correctly labeled as such) vs. “False Positives” (non-fusingpairs incorrectly labeled as fusing) for various fusion probability thresholds. As expected, with low numbers ofpaired reads 150kb clones obtain the most “True Positives” (Figure 8a,b), while with high number of pairedreads smaller clone sizes are better (40kb clones in Figure 8c). Very high coverages are needed before very smallclones, 2kb, can become effective (Figure 8d). However, they show almost no false positives at appropriatethresholds, with little (if any) increase in true positives when reducing the probability threshold (Figure 8d).

Finally, we examined the effect of chimeric clones on our ability to identify breakpoints from invalid clusters.Obviously, when only a single isolated invalid pair exists we cannot determine whether it arose via a chimericevent or through a true rearrangement. However, a cluster of invalid pairs is highly unlikely to arise from chimericclones[20] as seen in Figure 9). In most cases, no clusters formed from chimeric clones are ever observed. Evenunder high coverage (10X) and a very high percentage of chimeric clones (5% of all paired reads) in 80% ofstudies no chimeric clusters would be observed. This result increases the confidence in the assumption thatclusters of two or more invalid pairs indicate true rearrangement events. When considering an equal numberof chimeric clones for different clone lengths, the probability of observing a chimeric cluster is much lower forsmaller clone sizes.

12

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.0510

−6

10−5

10−4

10−3

10−2

10−1

100

Percent Chimeric ClonesPro

babi

lity

of O

bser

ving

a C

him

eric

Clu

ster

1X Clonal Coverage2X Clonal Coverage5X Clonal Coverage10X Clonal Coverage

Figure 9: Probability of observing at least one chimeric cluster vs. the percent of chimeric clones. These probabilitieswere computed using Equation 27, with clone size L = 150kb and confirmed by simulation. Other clone sizes yieldvirtually identical probabilities at the same clonal coverage.

Discussion

We provided a computational framework to evaluate paired-end sequencing strategies for detection of rearrange-ments in cancer genomes via. Using our probability calculations and simulations, we find that even with currentpaired-end technology we can obtain extremely high probability of breakpoint detection with a very low numberof reads. For example, more than 90% of all breakpoints can be detected with paired-end sequencing of lessthan 100,000 clones (Table 2). Additionally, next-generation sequencers can potentially detect rearrangementswith a greater than 99% probability and localize the breakpoints of these rearrangements to less than 300bp ina single run of the machine (Table 2). Even when only a fraction (f ' 0.5) of the reads map uniquely, similardetection levels can be achieved by simply doubling the amount of sequencing.

We derived formulae that provide estimates of the probability of detecting rearrangement breakpoints andlocalizing them precisely. For a genome of length G with N mapped paired reads from clones of size L, thedetection probability is a function of the of clonal coverage (c = NL

G ). Thus, increasing L means that fewerclones are needed to maintain the same probability of detecting a fusion. Traditionally, clone size L was dictatedby efficiency considerations with available cloning vectors (e.g. plasmids ≈ 2kb, fosmids ≈ 40kb, and BACs≈ 150kb). However, paired-end tag approaches coupled with new sequencing technolgies permit paired-endsequencing from a larger range of “clone” sizes. On the other hand, the distribution of breakpoint localizationdepends separately on “clone” size L, number of mapped reads N , and genome size G.

The natural question for the practioner is: what sequencing strategy maximizes information about rearrange-ments in the cancer genome for minimum cost? Providing a definitive answer to the question is complicatedby three major issues. First, the goal of “maximizing information about rearrangments in cancer genomes”requires further specification. If one is interested in detecting as many rearrangement breakpoints (and thusfusion genes) as possible with the minimum amount of sequencing (i.e. maximizing the coverage c), then fora fixed number of paired reads, larger clones give higher probabilities of detection than smaller clones, a con-sequence that is follows immediately from the breakpoint detection probability (Equation 2). However, if oneis interested in precise localization of breakpoint regions, then smaller clones will provide better localizationwhen the breakpoint is detected (Figure 8, V). If one wants to validate the breakpoint via PCR, then clearlylocalization must be less than the length amplified via PCR, typically less than a few kilobases. Our resultsshow the less obvious correlation between clone size and the probability of detecting breakpoint regions of aspecific size and, thus, fusion genes involving partner genes of a specific size. Figure 5 shows that at equalnumber of paired reads different clone sizes obtain better probabilities for resolving fusion points to an intervalof fixed length. Moreover, Figure 7b shows that these results readily translate to the probability of detectingfusion genes with partner genes of a given size. Such trade-offs permeate nearly all aspects of sequence strategyevaluation. Suppose that paired-end sequences could be obtained for any clone length. Then the choice ofoptimal clone size depends on the length of fusion genes that the researcher wants to have the greatest abilityto localize, which might depend on a prior belief about the model of rearrangement in cancer. For example, ifone wants to be able to localize fusion genes whose size is approximately the length of an average human gene

13

(40kb), then the optimal clone size is 40kb. However, under the hypothesis that the breaks in the genome thatlead to fusion genes are distributed unifomly on the genome, larger fusion genes would be expected and thuslarger clone sizes would be optimal.

Note that in many cases, one is interested not in breakpoints in a single tumor, but recurrent breakpointsacross multiple samples. For this reason, approaches like Primer Approximation Multiplex PCR (PAMP) thatassay for variable genomic lesions (such as deletions) in a patient population[30] are appropriate, and preciselocalization of breakpoints is unnecessary. Moreover, if rearrangement breakpoints are in highly repetitiveregions, it might be difficult to map ends too close to the breakpoints, and thus larger clone sizes are appropriate.On the other hand, if there are multiple rearrangements clustered in a small genomic interval, large clones wouldmiss some of these rearrangements, as observed with the multiple breakpoints found in some of our BACs.

The second major difficulty is that the sequencing parameters cannot be set arbitrarily, but are limitedby the technology chosen. Some of these technologies are not yet commercially available, and the informationabout their constraints is unknown. Moreover, the field is developing rapidly and any claims stated about readlengths, sequencing error rates, etc. are undergoing continual revision.

Finally, the complexity of cancer genomes at the sequence level is unknown. Cytogenetics has demonstratedextensive polyploidy in tumor cells, and more recent ESP studies[21, 31] and shotgun sequencing will naturallysample more from amplified regions. Although we did not explicity simulate amplifications, it is clear that theprobability of detecting amplified translocations is directly proportional to their amplification status. However,it is difficult to calibrate the genome length parameter in our simulations. For this reason, pilot studies will beneeded to assess the extent of amplification and heterogeneity in these samples.

Mapping Reads

Our analysis focused on several key parameters including number of paired reads, clone length, and percent ofchimeric clones. Some of these parameters are adjustable while others (e.g. error rate) are fixed by the choice of agiven technology. However, our model and simulations did not describe all factors that might influence the choiceof a given technology. In particular, we avoided the issue of the mapping of reads to the reference genome. Ourresults give some indication of the promise of these new technologies, but at present there are many unknownsin the number and size of mapped paired-end sequences (See below) that one can expect. Clearly, pilot studiesare needed to address this issue. As next-generation technologies mature they will inevitably provide highlyreliable and precise mappings of fusion points and detection of fusion genes, allowing for the systematic analysisof genomic events of many cancer samples. Different sequencing technologies produce reads of varying lengthand quality which can have a dramatic effect on the ability to map paired reads. On one extreme, conventionalpaired-end sequencing of cloned genomic fragments employed by current ESP studies[9, 21], yields high qualityreads exceeding 500 bp. This enables the majority of reads outside of repeats and segmental duplications tobe uniquely and accurately mapped to the reference genome. For example, 11492 out of 19831 clones (58%)in the MCF7 study, using 500bp from each end, mapped uniquely[22], while 41% of paired reads using 100bpof pyrosequencing from each end, mapped uniquely[15]. Newer sequencing technologies such as Illumina andABI have even shorter reads (20 to 30 bp) and higher error rates[13, 12], and the ditag approach sequences only18 − 20 base pairs from each end of the genomic fragment[11]. These shorter reads will be much more difficultto map, particulary when analyzing rearrangements. Unlike resequencing studies when one can use additionalinformation that the end sequences are close together, detection of rearrangements specifically requires theaccurate mapping of end sequences from distant locations on the genome. It would be informative to model theeffect of different read accuracies and lengths on our ability to accurately resolve breakpoints.

Organization of Cancer Genomes

An additional limitation is that the simulation makes certain simplifying assumptions about the character ofcancer genomes. Most notably, we assume that the size of the cancer genome (equal to the parameter G above)is known. Since many cancer genomes, particularly solid tumors, have extensive aneuploidy the actual size of agiven genome might differ greatly from normal cells[32]. Additionally, certain regions are highly amplified andcan have complex structures due to different mechanisms of duplication[33, 21]. Finally, genomic heterogeneity,particularly in primary tumor samples, reduces the effective coverage and thus the probability of detectingrearrangement breakpoints. Even a genomic lesion that is an important checkpoint in a cancer progression,

14

might be difficult to detect in an admixed sample containing normal cells and cells from earlier developmentalstages of the tumor. It is nearly impossible to determine how these factors will affect cancer sequencing withoutfurther pilot studies. Such studies promise to provide lots of new information about the extent of ploidy changesand heterogeneity.

Extensions and Applications

Our formula for the probability of a fusion genes is readily extended to fusions of any genomic feature. For ex-ample, regulatory fusions. that result from the fusion of the promoter of one gene to the coding region of anothergene can be assessed. Similarly, it should be noted that predicting the probability of detection for amplifiedbreakpoints, Pζ , (under the assumption of constant genome size) naturally follows from the current model. Asamplification, a, increases the probability of detection corresponding increases, following (1− (1− Pζ)a).

Additionally, a number of techniques based on the accurate resolution of genomic lesions can be used incombination with, or extensions of, this approach. Array CGH, can provide high resolution of breakpointsinvolved in deletion and duplication events at average resolutions of less than 10kb [34, 35]. When this informationoverlaps paired-end sequencing data (such as the case with an amplified translocation like BCAS4/BCAS3) onecan potentially improve the resolution of the breakpoint interval defined by a paired-end sequencing approach.Another example, Primer Approximation Multiplex PCR (PAMP) attempts to assay for variable genomic lesions(such as deletions) in a patient population[30]. The success of this technique relies on establishing the variance ingenomic boundaries of a genomic event, so that appropriate primers spanning the region can be designed. Ourapproach provides an effective method by which to derive rough approximations for boundaries of such events.Combining this approach with such experimental techniques makes it a powerful tool for both designing clinicaldiagnostics and identifying therapeutic targets.

Methods

Mapping and clustering of end sequences

We assume that each clone C is end sequenced and the ends are mapped uniquely to the reference humangenome sequence. Thus, each clone C corresponds to a pair (xC , yC) of locations in the human genome wherethe end sequences map. In addition, an end sequence may map to either DNA strand, and so each mappedend has a sign (+ or −) indicating the mapped strand. We call such a signed pair an end sequence pair (ESpair). If we know that the length LC of the clone C lies within the range [Lmin, Lmax], then for most clones thedistance between elements of the corresponding ES pair will lie in this range and the ends will have opposite,convergent orientations: i.e. an ES pair of the form (+x,−(x + LC)). Following[20] we call such ES pairs validpairs because these indicate no rearrangement in the tumor genome. We can use the distribution of distance|y| − |x| between the ends of valid pairs to define an empirical distribution for the clone lengths (cf. Supp. I).

If a pair (xC , yC) has ends with non-convergent orientation or whose distance |y| − |x| is greater than Lmax

or smaller than Lmin, we say that (xC , yC) is an invalid pair. The set of breakpoints (a, b) that are consistentwith the invalid pair (xC , yC) is determined by the inequalities[36]

Lmin ≤ sign(xC)(a− xC) + sign(yC)(b− yC) ≤ Lmax. (6)

Throughout the paper, we assume (without loss of generality) that sign(xC) = sign(yC) = + so that a ≥ xCand b ≥ yC .

Validating fusion predictions by sequencing

Clones contained predicted fusion genes were draft sequenced (1X coverage) by subcloning into 3kb plasmids asdescribed in[22]. Assembly of these sequences and alignment to the reference human genome identified either theprecise fusion point, or identified a plasmid containing the fusion point thus localizing the breakpoint to 3kb.

15

Computing fusion probability

Define C(a,b) as the event that a clone C from the tumor with corresponding invalid pair (xC , yC) overlaps abreakpoint (a, b) of a reference genome. Assume w.l.o.g. that the invalid pair (xC , yC) is oriented such thata ≥ xC and b ≥ yC . Given the breakpoint (a, b), the length LC of the clone is fixed to be

lC(a, b) = (a− xC) + (b− yC) = (a+ b)− (xC + yC) (7)

Therefore, the event C(a,b) implies the event that LC = lC(a, b), allowing us to express the probability ofoccurrence of breakpoint (a, b) in terms of the distribution on the lengths of clones. Denote NC [s] as thenumber of discrete breakpoints (a, b) such that a ≥ xC , b ≥ yC , and a+ b = s. Then

Pr(C(a,b)) = Pr(C(a,b) ∩ (LC = lC(a, b))) (8)= Pr(C(a,b) | LC = lC(a, b)) · Pr(LC = lC(a, b)) (9)

=1

NC [a+ b]Pr(LC = lC(a, b)), (10)

where the last equality follows from Equation 7 and the assumption that all breakpoints are equally likely.Consider a pair of genes spanning genomic intervals U, V . If exactly one of the genes is on the “+” strand andthe other is on the “-” strand, then the probability of genes fusing, given C is

Pr(∪(a,b)∈U×V C(a,b)) =∑

(a,b)∈U×V

Pr(LC = lC(a, b))NC [a+ b]

. (11)

Otherwise, if the genes are both on the same strand then an in-frame fusion transcript is impossible7 and theprobability is 0. In the simple case, assume that the clone lengths were uniformly distributed over the range[Lmin, Lmax], so that

Pr(LC = lC(a, b)) =

{1

Lmax−Lminif Lmin ≤ lC(a, b) ≤ Lmax,

0 otherwise.

In this case, Equation 11 refers to the fraction of the trapezoid (Equation 1) that intersects with U × V .A more realistic distribution of the clone lengths is empirically estimated (Supp. I), and used to computePr(LC = lC(a, b)).

Next, we extend the equations to include the case when a set {C(1), C(2), . . .} of multiple clones overlap thebreakpoint (a, b). Define C to be the event that all clones overlaps the same breakpoint. Then

C = ∪(a,b)C(a,b)where

C(a,b) = ∩jC(j)(a,b)

is the event that all clones C(j) overlap the breakpoint (a, b). Thus, the probability of (a, b) being the breakpointgiven that all clones overlap it is given by

Pr(C(a,b)|C) =Pr(C(a,b) ∩ C)

Pr(C)(12)

=Pr(C(a,b))

Pr(C)(13)

=

∏j Pr(C(j)

(a,b))∑(a,b)

∏j Pr(C(j)

(a,b))(14)

Here, equation 13 follows from the fact that C(a,b) implies C, and Equation 14 follows from the independence ofclones. This allows us to compute the probability of genes U, V fusing as

Pr(∪(a,b)∈U×V C(a,b)|C) =

∑(a,b)∈U×V

∏j Pr(C(j)

(a,b))∑(a,b)

∏j Pr(C(j)

(a,b))(15)

7The analysis of the orientation of genes is analogous in the cases of invalid pairs with other signs.

16

Algorithms for efficient probability computation

The naive approach to computing Pr(∪(a,b)∈U×V C(a,b)|C) in Equation 15 involves the computation of Pr(C(j)(a,b))

over all (a, b), and all clones C(j), which can be expensive. We make the computation more efficient by exploitingredundancies in computation. First, note that we only need to do the computation over a range of (a, b) valuesdefined by the intersection of all the trapezoids.

Note from equation 10 that

lC(a, b) = lC(a′, b′)⇒ Pr(C(j)(a,b)) = Pr(C(j)

(a′,b′))

Furthermore, since lc(a, b) = (a+ b)− (xC − yC) all points (a, b) with equal values of lC(a, b) lie on a line withslope −1 (antidiagonal). This provides a methodology for rapidly computing the probability of fusion.

Define diagonal s as the diagonal line with a + b = s. Let Ds denote the set of breakpoints that lie ondiagonal s, and are overlapped by all clones. Then

Ds = {(a, b) : a ≥ xC(j) and b ≥ yC(j)∀j, (a+ b) = s}

Then, D = ∪sDs is the set of breakpoints overlapped by all clones. Also, define the diagonal probability as aproduct of the probabilities of these clone lengths

Ps =∏j

Pr[|C(j)| = (s − xC(j) − yC(j))

]NC(j) [s]

Then, we have

Pr(∪(a,b)∈U×V C(a,b)|C) =

∑(a,b)∈U×V ∩D

∏j Pr(C(j)

(a,b))∑(a,b)∈D

∏j Pr(C(j)

(a,b))

=

∑s

∑(a,b)∈U×V ∩Ds

∏j Pr(C(j)

(a,b))∑s

∑(a,b)∈Ds

∏j Pr(C(j)

(a,b))

=∑

s |Ds ∩ U × V | · Ps∑s |Ds | · Ps

The algorithm itself is straightforward. Compute |Ds |, |Ds ∩ U × V | , Ps , for all values of s. It is efficient, asthere are relatively few diagonals with Ps > 0 and |Ds | > 0.

Expected number of fusion points

We seek to compute the probability of detecting a fusion point and the expected number of fusion points thatcan be detected for a specific amount of end sequencing. Assume that N clones, each of length L, are endsequenced from a tumor genome of size G. We assume that the left endpoint of each clone is selected uniformlyat random from the tumor genome. Then a fusion point ζ is detected if a clone contains it. Thus, the probabilityPζ of detection is given by[25, 26]

Pζ = 1−(

1− L

G

)N≈ 1− e

−NLG = 1− e−c, (16)

where c = NLG is the clonal coverage. Suppose there are M fusion points in the tumor genome, and define the

random variables X1, . . . , XM by Xi = 1 if the i-th fusion point is covered and Xi = 0, otherwise. Then

E(Xi) = 1− e−c.

The expected number of fusion points detected is given by

E(X) =M∑i=1

E(Xi) = M(1− e−c).

17

Using the Poisson approximation with λ = M(1− e−c)

Pr[m fusion points detected] ≈ e−λλm

m!

Given m observed fusion points, the maximum likelihood estimator M̂ of the total number of fusion points is

M̂ =m

1− e−c. (17)

Localization of Rearrangement Fusion Points

When multiple clones contain a fusion point ζ, the localization of ζ (the length of the interval that the clonesindicate ζ must lie within) is improved. Define Θζ as the intersection of all clones that cover ζ (Figure 4). Wewish to compute the probability distribution on the length of Θζ . Following Lander-Waterman[26], we assumethat the left endpoints of clones follow a Poisson process with intensity c = NL

G on the interval G. Θζ isdetermined by the left endpoint of the right-most clone that contains ζ, and the right endpoint of the left-mostclone that contains ζ. Define Aj , (0 ≤ j ≤ L−1) as the event in which the right-most clone has its left endpointj nucleotides to the left of ζ. Correspondingly, let Bj , 1 ≤ j ≤ L be the event that the left-most clone has itsright endpoint j nucleotides to the right of ζ. The event Aj occurs when there is a clone with left endpoint atζ − j and no clones with left endpoints in the interval j nucleotides to the right of ζ − j, and similarly for Bj .Therefore,

Pr(Aj) = Pr(Bj) = e−jNG (1− e−N

G ) (18)

The events Aj (likewise, all Bj) are mutually exclusive. Note that this allows us to redefine Pζ as

Pζ = Pr(∪L−1j=0 Aj) = (1− e−N

G )L−1∑j=0

e−jNG = (1− e−NL

G ) = 1− e−c (19)

Note that if s < L, then As−j and Bj are independent for all j. To compute the probability distribution on|Θζ |, we have two cases. For s < L,

Pr(|Θζ | = s) = Pr(∪sj=0(As−j ∩Bj)) =s∑j=0

Pr(As−j ∩Bj)

=s∑j=0

Pr(As−j)Pr(Bj)

= se−sNG (1− e−N

G )2 (20)

The event |Θζ | = L requires all clones covering ζ to have the same left endpoint. Therefore

Pr(|Θζ | = L) = Le−c(1− e−NG ) (21)

We can compute the expected length of Θζ , conditioned on ζ being covered by a clone; otherwise Θζ is undefined.Since the event |Θζ | ≤ L occurs only when ζ is covered, we have

Pr(|Θζ | = s |ζ is covered) =Pr(|Θζ | = s)

Pr(ζ is covered)=

Pr(|Θζ | = s)1− e−c

(22)

and combining (20), (21), and (22) obtains

E(|Θζ | | ζ is covered) =

(1− e−N

G

1− e−c

)(L2e−c +

L−1∑s=0


G )

)(23)

18

We note that the sum in the above formula has a closed form solution:

E(|Θζ | | ζ is covered) =

(1− e−N

G

1− e−c

)[L2e−c − 1

(1− e−NG )2(

e−NG (1 + e−

NG )− ec

(L2(1− e−N

G ) + (L− 1)2e−NG (1 + e−

NG ))) ]

.

(24)

Because of the presence of chimeric reads, it might be useful to consider only clusters of invalid pairs, where2 or more paired reads span a breakpoint. In this case

Pζ ≈ 1− e−c − NL

Ge

(N−1)LG (25)

and

E(|Θζ | | ζ is covered by multiple clones) =

1− e−NG

1−(e−c + NL

G e−(N−1)L

G

)(L−1∑

s=0


G )

)(26)

It is also informative to obtain the probability that two or more chimeric paired reads form a cluster. LetN is the total number of paired reads as before and q be the probability that a paired read is chimeric. If weassume that the distribution of clone lengths has mean L and is variable over a range of size L′, then

P (at least one pair of chimeric clones overlap) ≈ 1−(

1− LL′

G

)Nq(Nq−1)2

≈ 1− eNq(Nq−1)LL′

2G2 . (27)

Acknowledgements

BJR is supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. AMB issupported in part by a Graduate Research Fellowship from the National Science Foundation, and by a grantfrom the Nano Cancer Instiute ( 5 U54 CA119335-02). The work in the C.C. laboratory was supported by thegrant R33 CA103068 from NIH/NCI, Breast Cancer Research Program Grant 8WB-0054, and BCTR0601011from the Susan G. Komen for the Cure Foundation, and by the grant CA58207 from Bay Area Breast CancerTranslational Research Program.

19

References

[1] S.W. Morris, M.N. Kirstein, M.B. Valentine, K.G. Dittmer, D.N. Shapiro, D.L. Saltman, and A.T. Look. Fusion ofa kinase gene, ALK, to a nucleolar protein gene, NPM, in non-Hodgkin’s lymphoma. Science, 263(5151):1281–1284,1994.

[2] W.A. May, M.L. Gishizky, S.L. Lessnick, L.B. Lunsford, B.C. Lewis, O. Delattre, J. Zucman, G. Thomas, andC.T. Denny. Ewing Sarcoma 11; 22 Translocation Produces a Chimeric Transcription Factor that Requires theDNA-Binding Domain Encoded by FLI1 for Transformation. Proc Natl Acad Sci U S A, 90(12):5752–5756, 1993.

[3] R. Kurzrock and M. Talpaz. The molecular pathology of chronic myelogenous leukaemia. British journal of haema-tology, (79):34–7, 1991.

[4] B.J. Druker. STI571 (Gleevec) as a paradigm for cancer therapy. Trends Mol Med, 8(4 Suppl):S14–S18, 2002.

[5] F. Mitelman, B. Johansson, and F. Mertens. Fusion genes and rearranged genes as a linear function of chromosomeaberrations in cancer. Nature Genetics, 36:331–334, 2004.

[6] F. Mitelman, B. Johansson, and F. Mertens. The impact of translocations and gene fusions on cancer causation.Nat Rev Cancer 2007b, 7:233–45.

[7] S.A. Tomlins, D.R. Rhodes, S. Perner, S.M. Dhanasekaran, R. Mehra, X.W. Sun, S. Varambally, X. Cao, J. Tchinda,R. Kuefer, et al. Recurrent Fusion of TMPRSS2 and ETS Transcription Factor Genes in Prostate Cancer. Science,310(5748):644–648, 2005.

[8] M. Soda, Y.L. Choi, M. Enomoto, S. Takada, Y. Yamashita, S. Ishikawa, S. Fujiwara, H. Watanabe, K. Kurashina,H. Hatanaka, et al. Identification of the transforming EML4–ALK fusion gene in non-small-cell lung cancer. Nature,2007.

[9] S. Volik, S. Zhao, K. Chin, J.H. Brebner, D.R. Herndon, Q. Tao, D. Kowbel, G. Huang, A. Lapuk, W.L. Kuo, et al.End-sequence profiling: Sequence-based analysis of aberrant genomes. Proc Natl Acad Sci U S A, 100(13):7696–7701,2003.

[10] E. Tuzun, A.J. Sharp, J.A. Bailey, R. Kaul, V.A. Morrison, L.M. Pertz, E. Haugen, H. Hayden, D. Albertson,D. Pinkel, , M.V. Olson, and E.E. Eichler. Fine-scale structural variation of the human genome. Nat Genet,37(7):727–732, Jul 2005.

[11] Y. Ruan, H.S. Ooi, S.W. Choo, K.P. Chiu, X.D. Zhao, KG Srinivasan, F. Yao, C.Y. Choo, J. Liu, P. Ariyaratne, et al.Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysisusing Paired-End diTags (PETs). Genome Res, 17(6):828–838, 2007.

[12] D.R. Bentley. Whole-genome re-sequencing. Curr. Opin. Genet. Dev, 16:545–552, 2006.

[13] http://marketing.appliedbiosystems.com/images/Product/Solid Knowledge/SOLiD Chemistry Presentation 1019.pdf.

[14] P. Ng, J.J.S. Tan, H.S. Ooi, Y.L. Lee, K.P. Chiu, M.J. Fullwood, K.G. Srinivasan, C. Perbost, L. Du, W.K. Sung,et al. Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis oftranscriptomes and genomes. Nucleic Acids Research, 34(12):e84, 2006.

[15] Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L,Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT,Gerstein MB, Egholm M, and Snyder M. Paired-End Mapping Reveals Extensive Structural Variation in the HumanGenome. Science, 2007.

[16] K.S.J. Elenitoba-Johnson, D.K. Crockett, J.A. Schumacher, S.D. Jenson, C.M. Coffin, A.L. Rockwood, and M.S.Lim. Proteomic identification of oncogenic chromosomal translocation partners encoding chimeric anaplastic lym-phoma kinase fusion proteins. Proc Natl Acad Sci U S A, 103(19):7402–7407, 2006.

[17] C. Meyer, T. Burmeister, S. Strehl, B. Schneider, D. Hubert, O. Zach, O. Haas, T. Klingebiel, T. Dingermann, andR. Marschalek. Spliced MLL fusions: a novel mechanism to generate functional chimeric MLL-MLLT1 transcriptsin t(11;19)(q23;p13.3) leukemia. Leukemia, 21(3):588–590, 2007.

[18] C.M. Croce, J. Erikson, F.G. Haluska, L.R. Finger, and Y. Showe, L.C.and Tsujimoto. Molecular genetics of humanB- and T-cell neoplasia. In Cold Spring Harb Symp Quant Biol., volume 51, pages 891–8, 1986.

[19] P. Pevzner. Computational Molecular Biology: An Algorithmic Approach. MIT press, 2000.

[20] B.J. Raphael, C. Volik, S.and Collins, and P.A. Pevzner. Reconstructing tumor genome architectures. Bioinformat-ics, 19 Suppl 2:II162–II171, 2003.

[21] S. Volik, B.J. Raphael, G. Huang, M.R. Stratton, G. Bignel, J. Murnane, J.H. Brebner, K. Bajsarowicz, P.L.Paris, Q. Tao, et al. Decoding the fine-scale structure of a breast cancer genome and transcriptome. Genome Res,16(3):394–404, 2006.

20

[22] B.J. Raphael, S. Volik, P. Yu, C. Wu, G. Huang, J. Waldman, F. Costello, K. Pienta, G. Mills, K. Bajsarowicz,Y. Kobayashi, S. Sridharan, P. Paris, Q Tao, S. Aerni, R.P. Brown, A. Bashir, J.W. Gray, J-F. Cheng, P. Jong,M. Nefedov, H.M. Padilla-Nash, and C.C. Collins. A sequence based survey of the complex structural organizationof tumor genomes. In Submission.

[23] N. Kim, P. Kim, S. Nam, S. Shin, S. Lee, and O. Journals. ChimerDB–a knowledgebase for fusion sequences. NucleicAcids Res, 34(Database issue):D21–D24, 2006.

[24] M. Barlund, O. Monni, J.D. Weaver, P. Kauraniemi, G. Sauter, M. Heiskanen, O.P. Kallioniemi, and A. Kallioniemi.Cloning of BCAS3 (17q23) and BCAS4 (20q13) genes that undergo amplification, overexpression, and fusion in breastcancer. Genes Chromosomes Cancer, 35(4):311–317, 2002.

[25] L. Clarke and J. Carbon. A colony bank containing synthetic Col El hybrid plasmids representative of the entire E.coli genome. Cell, 9(1):91–99, 1976.

[26] E. S. Lander and M. S. Waterman. Genomic mapping by fingerprinting random clones: a mathematical analysis.Genomics, 2(3):231–239, 1988.

[27] D. Karolchik, R. Baertsch, M. Diekhans, T. S. Furey, A. Hinrichs, Y. T. Lu, K. M. Roskin, M. Schwartz, C. W.Sugnet, D. J. Thomas, R. J. Weber, D. Haussler, W. J. Kent, and University of California Santa Cruz. The UCSCGenome Browser Database. Nucleic Acids Res, 31(1):51–54, 2003.

[28] G. Manning, D. B. Whyte, R. Martinez, T. Hunter, and S. Sudarsanam. The protein kinase complement of thehuman genome. Science, 298(5600):1912–1934, 2002.

[29] M. Ashburner, CA Ball, JA Blake, D. Botstein, H. Butler, JM Cherry, AP Davis, K. Dolinski, SS Dwight, JT Eppig,et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25–29,2000.

[30] Y.T. Liu and D.A. Carson. A novel approach for determining cancer genomic breakpoints in the presence of normalDNA. PLoS ONE, 2(4):e380, 2007.

[31] Graham R Bignell, Thomas Santarius, Jessica C M Pole, Adam P Butler, Janet Perry, Erin Pleasance, Chris Green-man, Andrew Menzies, Sheila Taylor, Sarah Edkins, Peter Campbell, Michael Quail, Bob Plumb, Lucy Matthews,Kirsten McLay, Paul A W Edwards, Jane Rogers, Richard Wooster, P Andrew Futreal, and Michael R Stratton.Architectures of somatic genomic rearrangement in human cancer amplicons at sequence-level resolution. GenomeRes, 17(9):1296–1303, 2007.

[32] D. Pinkel and D.G. Albertson. Array comparative genomic hybridization and its applications in cancer. Nat Genet,37 Suppl:S11–S17, Jun 2005.

[33] B.J. Raphael and P.A. Pevzner. Reconstructing tumor amplisomes. Bioinformatics, 20 Suppl 1:I265–I273, 2004.

[34] P.L. Paris, S. Sridharan, A. Scheffer, A. Tsalenko, L. Bruhn, and C. Collins. High resolution oligonucleotide CGHusing DNA from archived prostate tissue. Prostate, 67(13):1447–1455, 2007.

[35] M.T. Barrett, A. Scheffer, A. Ben-Dor, N. Sampas, D. Lipson, R. Kincaid, P. Tsang, B. Curry, K. Baird, P.S.Meltzer, et al. Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. ProcNatl Acad Sci U S A, 101(51):17765–17770, 2004.

[36] B. Raphael, S. Volik, and C. Collins. Analysis of genomic alterations in cancer. In H. Tang, S. Kim, and E. Mardis,editors, Genome Sequencing Technology and Algorithms. Arctech House, In Press.

21

0 0.5 1 1.5 2 2.5 3

x 105

0

100

200

300

400

500

600

Size in Basepairs

Cou

nt

Distribution of MCF7 BAC Lengths

Figure I: Distribution of MCF7 Clone Lengths. The mean for this distribution is 122 kb, and the standard deviationis 24 kb. Fusion Probabilities in Table 1 are computed using this distribution and the putative fusion regions for eachgene pair (see Methods).

0 10 20 30 40 500

2

4

6

8

10

12

14x 10

4

Clonal Coverage

Bre

akpo

int R

egio

n S

ize

(Θ)

Predicted vs Simulated Θ (150kb Clones)

PredictedSimulated

Figure II: Length of a Breakpoint Region (BPR) for varying amounts of clonal coverage. The blue curveshows the expected length (Equation 5), while red curve is the average observed length over 50 simulations.

22

(a)

0

1

2

x 105051015

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clone SizeΘζ

Pζ

(b)

05

1015

x 104

051015

x 104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clone SizeΘ

ζ

Pζ

5k Reads10k Reads20k Reads50k Reads100k Reads500k Reads1M Reads2M Reads10M Reads50M Reads100M Reads

Figure III: Clone Size vs. Pζ vs. |Θζ | for varying N . A clear tradeoff can be observed. Larger clone sizes yield higherPζ (detection probability), compared to smaller clone sizes, which have the advantage of better localization (smaller|Θζ |). Different lines originating from 0 refer to different number of reads. As the number of reads grows, the trade-offconverges to high detection, and better localization. (a) shows values in a mesh graph, while (b) shows raw values.

(a)

101

102

103

104

105

106

103

104

105

106

107

108

0

0.2

0.4

0.6

0.8

1

Paired Reads

Clone Siize

Pζ

(b)

10110

210310

410510

6 102

104

106

108

101

102

103

104

105

106

Paired ReadsClone Size

Θζ

Figure IV: The effect of clone size and number of paired reads on Pζ and |Θζ |. (a) Pζ increases as the number of pairedreads N or clone length L increases, but is constant as a function of NL. (b) |Θζ | decreases as the number of pairedreads increases or the clonse length decreases. Note that all axes are log values (with the exception of Pζ in (a).

23

(a)

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

Reads(Millions)

ζ

1kb2kb10kb40kb150kb

P

(b)

0 2 4 6 8 100

5

10

15 x 104

Reads(Millions)

| Θ |

ζ

1kb2kb10kb40kb150kb

Figure V: (a) The probability of detecting a fusion point, Pζ , for different clone sizes and varying number of mappedpaired reads. (b) The expected length of a breakpoint region, |Θζ |, around a fusion point (assuming that the fusion pointis contained in a clone).

0 0.5 1 1.5 2

x 106

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Reads

Ave

rage

Fus

ion

Pro

babi

lity

200kb Genes


Figure VI: Average fusion probability vs. number of mapped reads. The average fusion probability with meanand standard deviations as a function of N , the number of mapped paired reads. The x-axis represents the number ofclones sequenced, N . The simulated fusion genes were 200kb.

24

Documents

Evaluation of Paired-end Sequencing Strategies for Detection of … · 2017. 1. 9. · Paired-end sequencing is emerging as a key technique for assessing genome rearrangements and