Optimal Seed Solver: Optimizing Seed Selection in Read Mapping · The error-free seed can be obtained by breaking the read into many non-overlapping seeds; in general, to tolerate

Optimal Seed Solver: Optimizing Seed Selection in ReadMappingHongyi Xin 1∗, Sunny Nahar 1, Richard Zhu 1, John Emmons 5,Gennady Pekhimenko 1, Carl Kingsford 3, Can Alkan 4∗ and Onur Mutlu 2∗

1Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA.2Dept. of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213,USA.3Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.4Dept. of Computer Engineering, Bilkent University, Bilkent, Ankara, 06800, Turkey.5Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130,USA

ABSTRACTMotivation: Optimizing seed selection is an important problem inread mapping. The number of non-overlapping seeds a mapperselects determines the sensitivity of the mapper while the totalfrequency of all selected seeds determines the speed of the mapper.Modern seed-and-extend mappers usually select seeds with eitheran equal and fixed-length scheme or with an inflexible placementscheme, both of which limit the potential of the mapper to select lessfrequent seeds to speed up the mapping process. Therefore, it iscrucial to develop a new algorithm that can adjust both the individualseed length and the seed placement, as well as derive less frequentseeds.Results: We present the Optimal Seed Solver (OSS), a dynamicprogramming algorithm that discovers the least frequently-occurringset of x seeds in an L-bp read in O(x × L) operations on averageand in O(x × L2) operations in the worst case, while generatingO(L2) maximal seed frequency lookups. We compared OSS againstfour state-of-the-art seed selection schemes and observed that OSSprovides a 3-fold reduction of average seed frequency over the bestprevious seed selection optimizations.Availability: We provide an implementation of the Optimal SeedSolver in C++ at: https://github.com/CMU-SAFARI/Optimal-Seed-SolverContact: [email protected], [email protected], [email protected]

1 INTRODUCTIONThe invention of high-throughput sequencing (HTS) platformsduring the past decade triggered a revolution in the fieldof genomics. These platforms enable scientists to sequencemammalian-sized genomes in a matter of days, which havecreated new opportunities for biological research. For example, itis now possible to investigate human genome diversity betweenpopulations (1000 Genomes Project Consortium, 2010, 2012), find

∗to whom correspondence should be addressed

genomic variants likely to cause disease (Ng et al., 2010; Flannicket al., 2014), and study the genomes of ape species (Marques-Bonetet al., 2009; Scally et al., 2012; Ventura et al., 2011; Prado-Martinezet al., 2013) and ancient hominids (Green et al., 2010; Reich et al.,2010; Meyer et al., 2012) to better understand human evolution.

However, these new sequencing platforms drastically increase thecomputational burden of genome data analysis. First, billions ofshort DNA segments (called reads) are aligned to a long referencegenome. Each read is aligned to one or more sites in the referencebased on similarity with a process called read mapping (Flicekand Birney, 2009). Reads are matched to locations in the genomewith a certain allowed number of errors: insertions, deletions, andsubstitutions (which usually constitute less than 5% of the read’slength). Matching strings approximately with a certain number ofallowed errors is a difficult problem. As a result, read mappingconstitutes a significant portion of the time spent during the analysisof genomic data.

Pigeonhole principle based seed-and-extend mappers are onekind of popular mappers that have been widely used to aid manybiological applications (Green et al., 2010; Navin et al., 2011;Van Vlierberghe et al., 2010). In pigeonhole based seed-and-extendbased read mappers such as mrFAST (Alkan et al., 2009; Xinet al., 2013), RazerS3 (Weese et al., 2012), GEM (Marco-Solaet al., 2012), SHRiMP (Rumble et al., 2009) and Hobbes (Ahmadiet al., 2011), reads are partitioned into several short segments calledseeds. Here we define seeds as substrings of a read. This definitionis different from the “spaced seeds” definition (which can be asubsequence, rather than a substring)—a concept we will explainin the Related Works section. Seeds are used as indices into thereference genome to reduce the search space and speed up themapping process. Since a seed is a substring of the read that containsit, every correct mapping for a read in the reference genome will alsobe mapped by the seed (assuming no errors in the seed). Therefore,mapping locations of the seeds generate a pool of potentialmappings of the read. Mapping locations of seeds in the referencegenome are pre-computed and stored in a seed database (usually

1

Associate Editor: Prof. Gunnar Ratsch

© The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Bioinformatics Advance Access published November 14, 2015 at W

ashington University School of M

edicine Library on N

ovember 23, 2015

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from


implemented as a hash table or Burrows-Wheeler-transformation(BWT) (Burrows et al., 1994) with FM-indexing (Ferragina andManzini, 2000)) and can be quickly retrieved through a databaselookup.

When there are errors in a read, the read can still be correctlymapped as long as there exists one seed of the read that is errorfree. The error-free seed can be obtained by breaking the read intomany non-overlapping seeds; in general, to tolerate e errors, a readis divided into e + 1 seeds, and based on the pigeonhole principle,at least one seed will be error free.

Alternatively, a mapper can also choose overlapping seeds andfollow the q-gram approach (Rasmussen et al., 2006) to achievefull mapping sensitivity (finding all valid mappings that have fewererrors than permitted) or simply select overlapping seeds withoutguaranteeing the full mapping sensitivity under the given errorthreshold (e.g., bowtie2 (Langmead and Salzberg, 2012), BWA-MEM (Li, 2013)). Compared to the pigeonhole principle (a specialcase of the q-gram approach), selecting overlapping seeds followingthe q-gram approach could generate longer, less frequent seeds.However, in order to guarantee full mapping sensitivity, it alsoselects a larger number of seeds which may increase the totalnumber of potential mappings. In this work, we focus on seedselection mechanisms based on the pigeonhole principle thatprovide full mapping sensitivity by selecting non-overlapping seeds.

For each selected non-overlapping seed, its locations are furtherverified using weighted edit-distance calculation mechanisms (suchas Smith-Waterman (Smith and Waterman, 1981) and Needleman-Wunsch (Needleman and Wunsch, 1970) algorithms), to examinethe similarity between the read and the reference at each potentialmapping site. Locations that pass this final verification step (i.e.,contain fewer than e substitutions, insertions, and deletions) arevalid mappings and are recorded by the mapper for use in laterstages of genomic analysis.

Computing the edit-distance is an expensive operation and is theprimary computation performed by most read mappers. In fact,speeding up this computation is the subject of many other works inthis area of research, such as Shifted Hamming Distance (Xin et al.,2015), Gene Myers’ bit-vector algorithm (Myers, 1999) and SIMDimplementations of edit-distance algorithms (Szalkowski et al.,2008; Rognes, 2011). To allow edits, mappers must divide readsinto multiple seeds. Each seed increases the number of locationsthat must be verified. Furthermore, to divide a read into more seeds,the lengths of seeds must be reduced to make space for the increasednumber of seeds; shorter seeds occur more frequently in the genomewhich requires the mapper to verify even more potential mappings.

Therefore, the key to building a fast yet error tolerant mapperwith high sensitivity is to select many seeds (to provide greatertolerance) while minimizing their frequency of occurrence (orsimply frequency) in the genome to ensure fast operation. Ourgoal, in this work, is to lay a theoretically-solid groundwork toenable techniques for optimal seed selection in current and futureseed-and-extend mappers.

Selecting the optimal set of non-overlapping seeds (i.e. the leastfrequent set of seeds) from a read is difficult primarily because theassociated search space (all valid choices of seeds) is large and itgrows exponentially as the number of seeds increases. A seed canbe selected at any position in the read with any length, as long as itdoes not overlap other seeds. We observe that there is a significant

advantage to selecting seeds with unequal lengths, as possible seedsof equal lengths can have drastically different levels of frequencies.

Our goal in this paper is to develop an inexpensive algorithmfor seed-and-extend mappers based on the pigeonhole principle thatderives the optimal placement and length of each seed in a read suchthat the overall sum of frequencies of all seeds is minimized.

This paper makes the following contributions:

• Examines the frequency distribution of seeds in the seeddatabase and counts how often seeds of different frequencylevels are selected using a naıve seed selection scheme. Weconfirm the discovery of prior works (Kiełbasa et al., 2011)that frequencies are not evenly distributed among seeds andfrequent seeds are selected more often under a naıve seedselection scheme. We further show that this phenomenonpersists even when using longer seeds.

• Provides an implementation of an optimal seed findingalgorithm, Optimal Seed Solver, which uses dynamicprogramming to efficiently find the least frequent non-overlapping seeds of a given read. We prove that this algorithmalways provides the least frequently-occurring set of seeds in aread.

• Gives a comparison of the Optimal Seed Solver and existingseed selection optimizations, including Adaptive Seeds Filterin GEM mapper (Marco-Sola et al., 2012), Cheap K-merSelection in FastHASH (Xin et al., 2013), Optimal PrefixSelection in Hobbes mapper (Ahmadi et al., 2011) and spacedseeds in PatternHunter (Ma et al., 2002). We compare thecomplexity, memory traffic, and average frequency of selectedseeds of Optimal Seed Solver with the above four state-of-the-art seed selection mechanisms. We show that the OptimalSeed Solver provides the least frequent set of seeds among allexisting seed selection optimizations at reasonable complexityand memory traffic.

2 MOTIVATIONTo build a fast yet error tolerant mapper with high mappingcoverage, reads need to be divided into many, infrequently occurringseeds. In this way, mappers will be able to find all correct mappingsof the read (mappings with small edit-distances) while minimizingthe number of edit-distance calculations that need to be performed.To achieve this goal, we have to overcome two major challenges: (1)seeds are short, in general, and therefore frequent in the genome;and (2) the frequencies of seeds vary significantly. Below weprovide discussions about each challenge in greater detail.

Assume a read has a length of L base-pairs (bp) and x% of it iserroneous (e.g., L = 80 and x% = 5% implies that are 4 edits).To tolerate x%× L errors in the read, we need to select x%× L+1 seeds, which renders a seed to be L ÷ (x% × L + 1)-bp longon average. Given that the desired error rates for many mainstreammappers have been as large as 0.05, typically the average seed lengthof a hash-table based mapper is not greater than 16-bp (Alkan et al.,2009; Weese et al., 2012; Rumble et al., 2009; Marco-Sola et al.,2012; Ahmadi et al., 2011).

Seeds have two important properties: (1) the frequency of aseed is monotonically non-increasing with greater seed lengthsand (2) the frequencies between seeds typically differ (sometimes

2

at Washington U

niversity School of Medicine L

ibrary on Novem

ber 23, 2015http://bioinform

atics.oxfordjournals.org/D

ownloaded from


significantly) (Kiełbasa et al., 2011). Figure 1 shows the staticdistribution of frequencies of 10-bp to 15-bp fixed-length seedsfrom the human reference genome (GRCh37). This figure showsthat the average seed frequency decreases with the increase in theseed length. With longer seeds, there are more patterns to index thereference genome. Thus each pattern on average is less frequent.

Fig. 1. Frequency distribution of seeds in fixed-length seed (k-mers atvarying ‘k’s) databases of human reference genome version 37. Each plotshows how many unique seeds there are at each frequency level. Notice thatthe trend in each plot is not a continuous line but made of many discrete datapoints. Each data point denotes how many unique seeds there are at eachfrequency level.

From Figure 1, we can also observe that the frequencies of seedsare not evenly distributed: for seeds with lengths between 10-bp to15-bp, many seeds have frequencies below 100, as in the figure,seed frequencies below 100 have high seed counts, often over 103.However, there are also a few seeds which have frequencies greaterthan 100K, even though the seed counts of such frequencies arevery low, usually just 1. This explains why most plots in Figure 1follow a bimodal distribution; except for 10-bp seeds and perhaps11-bp seeds, where the frequency of seeds peaks at around 100.Although ultra-frequent seeds (seeds that appear more frequentlythan 104 times) are few among all seeds, they are ubiquitous inthe genome. As a result, for a randomly selected read, there is ahigh chance that the read contains one or more of such frequentseeds. This effect is best illustrated in Figure 2, which presents thenumbers of frequencies of consecutively selected seeds, when we

map over 4 million randomly selected 101-bp reads from the 1000Genome Project (1000 Genomes Project Consortium, 2010) to thehuman reference genome.

Fig. 2. Frequency distribution of seeds at runtime by selecting 15-bp seedsconsecutively from reads while mapping 4,031,354 101-bp reads from a realread set, ERR240726 from the 1000 Genome Project, to human referenceversion 37, under different numbers of required seeds. Unlike Figure 1,which only counts the population of unique seeds at different frequencylevels in the human reference, this figure records the overall numbers ofencounters of seeds at different frequency levels while mapping a realread set. In this figure, repetitive seeds are counted multiple times at eachencounter.

Unlike in Figure 1, in which the average frequency of 15-bpseeds is 5.25, the average frequencies of seeds in Figure 2 are allgreater than 2.7K. Furthermore, from Figure 2, we can observethat the ultra-frequent seeds are selected far more often than someof the less frequent seeds, as the seed count increases with higherseed frequencies after 104 (as opposed to Figure 1, where seedfrequencies over 104 usually have seed counts below 10). Thisobservation suggests that the ultra-frequent seeds are surprisinglynumerous in reads, especially considering how few ultra-frequentseed patterns there are in total in the seed database (and the plots inFigure 2 no longer follow a bimodal distribution as in Figure 1).We call this phenomenon frequent seed phenomenon. Frequentseed phenomenon is explained in previous works (Kiełbasa et al.,2011). To summarize, highly frequent seed patterns are ubiquitousin the genome, therefore they appear more often in randomlysampled reads, such as reads sampled from shotgun sequencing.

3

at Washington U


ibrary on Novem



ownloaded from


Frequency distributions of other seed lengths are provided in theSupplementary Materials.

The key takeaway from Figure 1 and Figure 2 is that althoughlonger seeds on average are less frequent than shorter seeds, someseeds are still much more frequent than others and are moreprevalent in the reads. Therefore, with a naıve seed selectionmechanism (e.g., selecting seeds consecutively from a read), amapper still selects many frequent seeds, which increases thenumber of calls to the computationally expensive verificationprocess.

To reduce the total frequency of selected seeds, we needan intelligent seed selection mechanism to avoid using frequentpatterns as seeds. More importantly, as there is a limited numberof base-pairs in a read, we need to carefully choose the length ofeach seed. Extension of an infrequent seed does not necessarilyprovide much reduction in the total frequency of all seeds, but itwill “consume” base-pairs that could have been used to extend othermore frequent seeds. Besides determining individual seed lengths,we should also intelligently select the position of each seed. Ifmultiple seeds are selected from a small region of the read, as theyare closely packed together, seeds are forced to keep short lengths,which potentially increase their seed frequency.

Based on the above observations, our goal in this paper is todevelop an algorithm that can calculate both the length and theplacement of each seed in the read. Hence, the total frequency ofall seeds will be minimized. We call such a set of seeds the optimalseeds of the read as they produce the minimum number of potentialmappings to be verified while maintaining the sensitivity of themapper. We call the sum of frequencies of the optimal seeds theoptimal frequency of the read.

3 METHODSThe biggest challenge in deriving the optimal seeds of a read isthe large search space. If we allow a seed to be selected from anarbitrary location in the read with an arbitrary length, then froma read of length L, there can be L×(L+1)

2possibilities to extract a

single seed. When there are multiple seeds, the search space growsexponentially since the position and length of each newly selectedseed depend on positions and lengths of all previously selectedseeds. For x seeds, there can be as many as O(L

2×x

x!) seed selection

schemes.Below we propose Optimal Seed Solver (OSS), a dynamic

programming algorithm that finds the optimal set of x seeds of aread in O(x×L) operations on average and in O(x×L2) operationsin the worst case scenario.

Although in theory a seed can have any length, in OSS, we assumethe length of a seed is bounded by a range [Smin, Smax]. Thisbound is based on our observation that in practice, neither very shortseeds nor very long seeds are commonly selected in optimal seeds.Ultra-short seeds (< 8-bp) are too frequent. Most seeds shorter than8-bp have frequencies over 1000. Ultra-long seeds “consume” toomany base-pairs from the read, which shorten the lengths of otherseeds and increase their frequencies. Furthermore, long seeds (e.g.,40-bp) are mostly either unique or non-existent in the referencegenome (seed of 0 frequency is still useful in read mapping as itconfirms there exist at least one error in it). Extending a unique or

non-existent seed longer provides little benefit while “consuming”extra base-pairs from the read.

Bounding seed lengths reduces the search space of optimal seeds.However, it is not essential to OSS. OSS can still work without seedlength limitations (to lift the limitations, one can simply set Smin =1 and Smax = L), although OSS will perform extra computation.

Below we describe our Optimal Seed Solver algorithm in threesections. First, we introduce the core algorithm of OSS. Thenwe improve the algorithm with four optimizations, optimal dividercascading, early divider termination, divider sprinting, and optimalsolution forwarding. Finally we explain the overall algorithm andprovide the pseudo-code.

3.1 The core algorithmA naıve brute-force solution to find the optimal seeds of a readwould be systematically iterating through all possible combinationsof seeds. We start with selecting the first seed by instantiating allpossible positions and lengths of the seed. On top of each positionand length of the first seed, we instantiate all possible positions andlengths of the second seed that is sampled after (to the right-handside of) the first seed. We repeat this process for the rest of the seedsuntil we have sampled all seeds. For each combination of seeds, wecalculate the total seed frequency and find the minimum total seedfrequency among all combinations.

The key problem in the brute-force solution above is that itexamines a lot of obviously suboptimal combinations. For example,in Figure 3, there are two m-seed combinations, SA and SB ,extracted from the same read, R. Both combinations end at the sameposition, p, in R. Among them, SB has a higher total seed frequencythan SA. For any number of seeds that is greater than m, we knowthat in the final optimal solution of R, seeds before position p willnot be exactly like SB , since any seeds that are appended after SB

(e.g., S′B in Figure 3) can also be appended after SA (e.g., S′

A inFigure 3) and produce a smaller total seed frequency. In other words,compared to SB , only SA has the potential to be part of the optimalsolution and worth appending more seeds after. In general, amongtwo combinations that have equal numbers of seeds and end at thesame position in the read, only the combination with the smallertotal seed frequency has the potential of becoming part of a bigger(more seeds) optimal solution. Therefore, for a partial read and allcombinations of a subset set of seeds in this partial read, only theoptimal subset of seeds of this partial read (with regard to differentnumbers of seeds) might be relevant to the optimal solution of theentire read. Any other suboptimal solutions of this partial read (withregard to different numbers of seeds) will not lead to the optimalsolution and should be pruned.

The above observation suggests that by summarizing the optimalsolutions of partial reads under a smaller number of seeds, we canprune the search space of the optimal solution. Specifically, givenm (with m < x) seeds and a substring U , only the optimal m seedsof U could be part of the optimal solution of the read. Any othersuboptimal combinations of m seeds of U should be pruned.

Storing the optimal solutions of partial reads under a smallernumber of seeds also helps speed up the computation of largernumbers of seeds. Assuming we have already calculated and storedthe optimal frequency of m seeds of all substrings of R, to calculatethe optimal (m + 1)-seed solution of a substrings, we can iteratethrough a series of divisions of this substring. In each division, we

4

at Washington U


ibrary on Novem



ownloaded from


��

��

��

��

��

��

��

��

��

��

�

��

��

��

��

�

��

�

Fig. 3. Two 2-seed combinations, SA and SB that end at the same position,p, in the read. The total seed frequency of SA is smaller than SB . Bothcombinations can be extended by adding a third seed, making them S′

A andS′B respectively. For any third seed, the total seed frequency of S′

B must begreater than S′

A. Hence, SB must not be part of any optimal solution.

divide the substring into two parts: We extract m seeds from thefirst part and 1 seed from the second part. The minimum total seedfrequency of this division (or simply the “optimal frequency of thedivision”) is simply the sum of the optimal m-seed frequency of thefirst part and the optimal 1-seed frequency of the second part. Aswe already have both the optimal m-seed frequency of the first partand the 1-seed frequency of the second part calculated and stored,the optimal frequency of this division can be computed with oneaddition and two lookups.

The optimal (m+1)-seed solution of this substring is simply thedivision that yields the minimum total frequency. Given that eachseed requires at least Smin base-pairs, for a substring of lengthL′, there are in total L′ − (m + 1) × Smin possible divisions tobe examined. This relationship can be summarized as a recurrencefunction in Equation 1, in which Opt(U,m) denotes the optimalm-seed frequency of substring U and u denotes the length of U .

Opt(U,m+ 1) = mini

[Opt(U [1 : i− 1],m) + Opt(U [i : u], 1)]

(1)We can apply the same strategy to the entire read: to obtain the

optimal x + 1 seeds from read R, we first examine all possible 2-part divisions of the read, which divide the read into a prefix and asuffix. For each division, we extract x seeds from the prefix, and 1seed from the suffix. The optimal (x+ 1)-seed solution of the readis simply the division that provides the lowest total seed frequency.As we have discussed above, for a division to be optimal, its x-seed prefix and 1-seed suffix must also be optimal (provides theminimum total seed frequency). By the same logic, to obtain theoptimal x-seed solution of a prefix, we can further divide the prefixinto an optimal (x− 1)-seed prefix and an optimal 1-seed substring(no longer a suffix of the read). We can keep applying this prefix-division process until we have reached 1-seed prefixes. In otherwords, by progressively calculating the optimal solutions of allprefixes from 1 to x seeds, we can find the optimal (x + 1)-seedsolution of the read.

OSS implements the above strategy using a dynamic programmingalgorithm: To calculate the optimal (x + 1)-seed solution of aread, R, OSS computes and stores optimal solutions of prefixeswith fewer seeds through x iterations. In each iteration, OSScomputes optimal solutions of prefixes with regard to a specific

number of seeds. In the mth iteration (m ≤ x), OSS computes theoptimal m-seed solutions of all prefixes of R, by re-using optimalsolutions computed from the previous (m− 1)th iteration. For eachprefix, OSS performs a series of divisions and finds the divisionthat provides the minimum total frequency of m seeds. For eachdivision, OSS computes the optimal m-seed frequency by summingup the optimal (m−1)-seed frequency of the first part and the 1-seedfrequency of the second part. Both frequencies can be obtained fromprevious iterations. Overall, OSS starts from 1 seed and iterates to xseeds. Finally OSS computes the optimal (x + 1)-seed solution ofR by finding the optimal division of R and reuses results from thexth iteration.

3.2 Further optimizationsWith the dynamic programming algorithm, OSS can find the optimal(x + 1) seeds of a L-bp read in O(x × L2) operations: In eachiteration, OSS examines O(L) prefixes (to be exact, L− (x+ 1)×Smin prefixes for each iteration) and for each prefix OSS inspectsO(L) divisions (to be exact, L′ − i × Smin divisions of a L′-bpprefix for the ith iteration). In total, there are O(L2) divisions to beverified in an iteration.

To further speed up OSS and reduce the average complexityof processing each iteration, we propose four optimizationsto OSS: optimal divider cascading, early divider termination,divider sprinting and optimal solution forwarding. With all fouroptimizations, we empirically reduce the average complexity ofprocessing an iteration to O(L). Below we describe the fouroptimizations in detail.

3.2.1 Optimal divider cascading Until this point, our assumptionis that optimal solutions of prefixes within an iteration areindependent from each other: the optimal division (the division thatprovides the optimal frequency) of one prefix is independent fromthe optimal division of another prefix, thus they must be derivedindependently.

We observe that this assumption is not necessarily true as thereexists a relationship between two prefixes of different lengths inthe same iteration (under the same seed number): the first optimaldivider (the optimal divider that is the closest towards the beginningof the read, if there exist multiple optimal divisions with the sametotal frequency) of the shorter prefix must be at the same or acloser position towards the beginning of the read, comparedto the first optimal divider of the longer prefix. We call thisphenomenon the optimal divider cascading, and it is depicted inFigure 4. The proof that the optimal divider cascading is alwaystrue is provided in the Supplementary Materials.

Based on the optimal divider cascading phenomenon, we knowthat for two prefixes in the same iteration, the first optimal dividerof the shorter prefix must be no further than the first optimal dividerof the longer prefix. With this relationship, we can reduce the searchspace of optimal dividers in each prefix by processing prefixeswithin an iteration from the longest to the shortest.

In each iteration, we start with the longest prefix of the read,which is the read itself. We examine all divisions of the read andfind the first optimal divider of it. Then, we move to the next prefixof the length |L− 1|. In this prefix, we only need to check dividersthat are at the same or a prior position than the first optimal dividerof the read. After processing the length |L − 1| prefix, we move to

5

at Washington U


ibrary on Novem



ownloaded from


��

��

��

��

��

��

��

��

��

Fig. 4. All prefixes and their first optimal dividers in an iteration. Weobserve that the first optimal divider of a longer prefix is never more towardsthe beginning of the read than the first optimal divider of a shorter prefix.

the length |L − 2| prefix, whose search space is further reduced topositions that are at the same or a closer position to the beginningof the read than the first optimal divider of the length |L− 1| prefix.This procedure is repeated until the shortest prefix in this iterationis processed.

3.2.2 Early divider termination With optimal divider cascading,we are able to reduce the search space of the first optimal dividerof a prefix and exclude positions that come after the first optimaldivider of the previous, 1-bp longer prefix. However, the searchspace is still large since any divider prior to the first optimal dividerof the previous prefix could be the optimal divider. To further reducethe search space of dividers in a prefix, we propose the secondoptimization – early divider termination.

The key idea of early divider termination is simple: Theoptimal frequency of a substring monotonically non-increases as thesubstring extends longer in the read (see Lemma 1 in SupplementaryMaterials for the proof of this fact).

Based on the optimal divider cascading, we start at the positionof the first optimal divider in the previous prefix. Then, wegradually move the divider towards the beginning (or simplymoving backward) and check the total seed frequency of thedivision after each move. During this process, the first part ofthe division gradually shrinks while the second part graduallygrows, as we show in Figure 5. According to the Lemma 1 in theSupplementary Materials, the optimal frequency of the first partmust be monotonically non-decreasing while the optimal frequencyof the second part must be monotonically non-increasing.

��

��

��

��

��

��

Fig. 5. Propagating dividers in a prefix. The divider starts at the position ofthe previous prefix’s first optimal divider, then gradually moves towards thebeginning of the prefix, until it reaches the termination position, T .

For each position of the divider, let FREQP2denote the frequency

of the second part (P2) and ∆FREQP1denote the change of

frequency of the first part (P1) between current and the next move.Early divider termination suggests that: the divider should stopmoving backward, whenever |∆FREQP1

| > |FREQP2|. All dividers

that are prior to this position are guaranteed to have greater total seedfrequencies. We call this stopping position the termination position,and the division at this position – termination division, denoted as T .We name the first and the second part of T as T1 and T2 respectively.

For any divider that comes prior to the termination position,compared to the termination division, its first part is shorter andits second part is longer. Hence the optimal frequency of its firstpart is greater and the optimal frequency of its second part issmaller. Let |∆FREQP ′

1(T )| denote the increase of the optimal

frequency of the first part between current division and terminationdivision and |∆FREQP ′

2(T )| denote the decrease of the second part.

Based on Lemma 1, we have |∆FREQP ′1(T )| ≥ |∆FREQT1

|. Sincethe frequency of a seed can be no smaller than 0, we also have|FREQT2

| ≥ |∆FREQP ′2(T )|. Combining the three inequalities,

we have |∆FREQP1(T )| > |∆FREQP1

(T )|. This suggests thatcompared to the termination division, the frequency increase of thefirst part must be greater than the frequency reduction of the secondpart. Hence, the overall optimal frequency of such division mustbe greater than the optimal frequency of the termination division.Therefore divisions prior to termination position cannot be optimal.

Using early divider termination, we can further reduce the searchspace of dividers within a prefix and exclude all positions that areprior to the termination position. Since the second part of the prefixhosts only one seed and frequencies of most seeds decrease to 1after extending it to a length of over 20-bp, we observe that thetermination position of a prefix is reached fairly quickly, only aftera few moves. With both optimal divider cascading and early dividertermination, from our experiments, we observe that we only need toverify 5.4 divisions on average (this data is obtained from mappingERR240726 to human genome v37, under the error threshold of5) for each prefix. To conclude, with both optimizations, we havereduced the average complexity of Optimal Seed Solver to O(x ×L).

3.2.3 Divider sprinting According to optimal divider cascadingand early divider termination, for each prefix, after inheriting thestarting divider from the previous prefix, we gradually move thedivider towards the beginning of the prefix, one base-pair at a time,until early divider termination is triggered. In each move, we checkthe optimal frequency of the two parts in the current division as wellas the frequency increase of the first part compared to the previousdivision. We stop moving the divider when the frequency increaseof the first part is greater than the optimal frequency of the secondpart.

We observe that it is unnecessary to always move the dividera single base-pair at a time and check for frequencies after eachmove. In early divider termination method, the move terminatesonly when the frequency increase of the first part is greater thanthe optimal seed frequency in the second part. This suggests thatwhen the frequency of the first part remains unchanged betweenmoves, which produces no increase in frequency, we do not needto check the frequency of the second part as it will not trigger earlytermination. When multiple dividers in an interval share the samefirst-part frequency, we only need to verify the last divider of thisinterval and skip all other dividers in the middle (an example isprovided in the Supplementary Materials). The last divider always

6

at Washington U


ibrary on Novem



ownloaded from


provides the least total seed frequency among all dividers in thisinterval since it has the longest second part compared to otherdividers (longer substring always provides less or equally frequentoptimal seed) while keeping its first-part frequency the same. Wecall this method divider sprinting.

3.2.4 Optimal solution forwarding With optimal divider cascading,early divider termination and divider sprinting, we have reducedthe average number of divisions per prefix to 3.7 from the previous5.4. Nevertheless, for each prefix, we still need to examine at leasttwo divisions (one for the inherited optimal division of the previousprefix and at least one more for early divider termination). Weobserve that some prefixes can also inherit the optimal solution ofthe previous prefix without verifying any divisions, as they share thesame optimal divider with the previous prefix. Within an iteration,we recognize that there exist many prefixes that share the samesecond-part frequency with the previous prefix when divided bythe previous prefix’s optimal divider. We conclude that such prefixesmust also share the same optimal divider as well as the sameoptimal seed frequency with the previous prefix (detailed proofis provided in the Supplementary Materials). We call this optimalsolution forwarding.

With optimal solution forwarding, for each incoming prefix, afterinheriting the optimal divider from the previous prefix, we first testif the second-part frequency of the new prefix equals the second-part frequency of the previous prefix. If they are equal, then we canassert that the optimal divider of the previous prefix must also bethe optimal divider of the new prefix and move on to the next read,without examining any divisions.

With optimal solution forwarding, we observe that we furtherreduce the average number of division verifications per prefix to0.95 (this data is obtained from mapping ERR240726 to humangenome v37, under the error threshold of 5), providing a 5.68xpotential speedup.

3.3 The full algorithmBelow we present the full algorithm of the Optimal Seed Solver.Before calculating the optimal x-seed frequency of the read, R, weassume that we already have the optimal 1-seed frequency of anysubstring of R and it can be retrieved in a O(1)-time lookup viathe optimalFreq(substring) function 1. It requires at most O(L2)lookups to the seed database for all possible substrings of the read.

Let firstOptDivider(prefix) be the function to calculate thefirst optimal divider of a prefix. Then the optimal set of seeds can becalculated by filling a 2-D array, opt data, of size (x−1)×L. In thisarray, each element stores two data: an optimal seed frequency and

1 This assumption is valid only if seeds are stored in a large hash table.For seeds that are pre-processed by the Burrows-Wheeler transformation,OSS requires O(s) total steps in FM-indexing to obtain the frequency ofthe seed, where s is the length of the seed. In total, it requires O(L3) totalsteps to index all possible seeds in the read, which potentially could generateO(L3) memory accesses and O(L3) cache misses in the worst case. Laterin Supplementary materials, we propose lock-step BWT, a mechanism thatreduces the average total number of cache misses per read to O(L), byimposing a minimum seed length requirement and by traversing all prefixesof the read in a lock-step fashion. Specifically, lock-step BWT organizes allprefixes together such that they extend the same base-pair in the read at thesame time. Please refer to Supplementary Materials for further details.

Algorithm 1: optimalSeedSolverInput: the read, ROutput: the optimal x-seed frequency of R, opt freq and thefirst x-seed optimal divider of R, opt divGlobal data structure: the 2-D data array opt data[ ][ ]Functions:firstOptDivider : computes the first optimal divider of theprefixoptimalFreq : retrieves the optimal 1-seed frequency of asubstringPseudocode:// Fill the first iterationfor l = L to Smin do

prefix = R[1 ...l ];opt data[1 ][l ].freq = optimalFreq(prefix);

// From iteration 2 to x-1for iter = 2 to x− 1 do

// Previous optimal dividerprev div = L− Smin + 1;for l = L to iter × Smin do

prefix = R[1...l ];div = firstOptDivider(prefix, iter , prev div);1st part = R[1...div − 1];2nd part = R[div ...L];1st freq = opt data[iter − 1][div − 1].freq;2nd freq = optimalFreq(2nd part);opt data[iter ][l ].div = div;opt data[iter ][l ].freq = 1st freq+ 2nd freq;// optimal seed cascadingprev div = div;

prev div = L− Smin + 1;opt div = firstOptDivider(R,L− Smin + 1 );1st part = R[1...opt div − 1];2nd part = R[opt div ...L];1st freq = opt data[x − 1][opt div − 1].freq;2nd freq = optimalFreq(2nd part);opt freq = 1st freq+ 2nd freq;return opt freq, opt div;

a first optimal divider. For the element at ith row and jth column, itstores the optimal i-seed frequency of the prefix R[1...j] as well asthe first optimal divider of the prefix.

Algorithm 1 provides the pseudo-code of optimalSeedSolverand Algorithm 2 provides the pseudo-code of firstOptDivider.

To retrieve the starting and ending positions of each optimal seed,we can backtrack the 2-D array and backward induce the optimalprefixes and their optimal dividers in each iteration. The pseudo-code of the backtracking process is provided in SupplementaryMaterials.

For better understanding, we provide a real example inSupplementary Materials to show how Optimal Seed Solveroperates.

7

at Washington U


ibrary on Novem



ownloaded from


Algorithm 2: firstOptDividerInput: the prefix; the iteration count, iter ; the previous prefixdivider, prev divOutput: the first optimal divider of the prefix, opt divGlobal data structure: the 2-D data array opt data[ ][ ], theoptimal 2nd-part frequency of the previous prefix, opt 2nd freqFunctions:optimalFreq : retrieves the optimal 1-seed frequency of asubstringPseudocode:// optimal solution forwarding2nd part = prefix[prev div ...end ];2nd freq = optimalFreq(2nd part);if opt 2nd freq = 2nd freq then

return prev div ;

first div = prev div;min freq = MAX INT;prev 1st freq = MAX INT;prev 2nd freq = MAX INT;for div = prev div to (iter − 1)× Smin do

1st part = prefix[1...div − 1];2nd part = prefix[div ...end ];1st freq = opt data[iter ][div − 1].freq;next 1st freq = opt data[iter ][div − 2].freq;// divider sprintingif next 1st freq = 1st freq then

continue;

2nd freq = optimalFreq(2nd part);// early divider terminationif (1st freq− prev 1st freq) > prev 2nd freq then

break ;

freq = 1st freq+ 2nd freq;// update the optimal dividerif (freq ≤ min freq then

min freq = freq;first div = div;opt 2nd freq = 2nd freq;

prev 1st freq = 1st freq;prev 2nd freq = 2nd freq;

return first div;

4 RELATED WORKSThe primary contribution of this work is a dynamic programmingalgorithm that derives the optimal non-overlapping seeds of a readin O(x×L) operations on average. To our knowledge, this is the firstwork that calculates the optimal seeds and the optimal frequency ofa read. The most related prior works are optimizations to the seedselection mechanism which reduce the sum of seed frequencies of aread using greedy algorithms.

Furthermore, we distinguish OSS from other methods (Kucherovet al., 2014; Langmead and Salzberg, 2012; Li, 2013) whichsolve similar yet unrelated problems. These previous works eitherdetermine the number and length of erroneous seeds such thatthe total number of branches in backtracking is minimized foreach seed (Kucherov et al., 2014) or simply select seeds and their

locations through probabilistic methods without providing errortolerance guarantees (e.g., bowtie2 (Langmead and Salzberg, 2012)and BWA-MEM (Li, 2013)). By contrast, OSS finds the number andlengths of non-overlapping seeds such that the total frequency ofall seeds is minimized. Former mechanisms are not designed forseed-and-extend based mappers that rely on non-overlapping seedsfollowing the pigeonhole principle. In this paper, we only compareseed selection mechanisms that follow the pigeonhole principle.

Existing seed selection optimizations can be classified into threecategories: (1) extending seed length, (2) avoiding frequent seedsand (3) rebalancing frequencies among seeds. Optimizations inthe first category extend frequent seeds longer in order to reducetheir frequencies. Optimizations in the second category sample seedpositions in the read and reject positions that generate frequentseeds. Optimizations in the third category rebalance frequenciesamong seeds such that the average seed frequency at runtime is moreconsistent with the static average seed frequency of the seed table.

In the remainder of this section, we qualitatively compare theOptimal Seed Solver (OSS) to four representative prior worksselected from the above three categories. They are: Cheap K-merSelection (CKS) in FastHASH (Xin et al., 2013), Optimal PrefixSelection (OPS) in Hobbes (Ahmadi et al., 2011), Adaptive SeedsFilter (ASF) in GEM mapper (Marco-Sola et al., 2012) and spacedseeds in PatternHunter (Ma et al., 2002). Among the four priorworks, ASF represents works from the first category; CKS andOPS represent works from the second category and spaced seedsrepresents works from the third category. Below we elaborate eachof them in greater details.

The Adaptive Seeds Filter (ASF) seeks to reduce the frequencyof seeds by extending the lengths of the seeds. For a read,ASF starts the first seed at the very beginning of the read andkeeps extending the seed until the seed frequency is below a pre-determined threshold, t. For each subsequent seed, ASF starts itfrom where the previous seed left off in the read, and repeats theextension process until the last seed is found. In this way, ASFguarantees that all seeds have a frequency below t.

Compared to OSS, ASF has two major drawbacks. First, ASFassumes the least frequent set of seeds in a read has similarfrequencies; hence, it shares a common frequency threshold t.We observe that this is not always true. The optimal set of seedsoften have very different frequencies. This is because some seedsdo not provide much frequency reduction despite long extensionswhile other seeds yield significant frequency reductions only atcertain extension lengths (the frequency reduction looks like astep function). By regulating all seeds with the same frequencythreshold, ASF inefficiently distributes base-pairs among seeds.Second, ASF sets a fixed frequency threshold t for all reads. Thisoften leads to under-utilization of base-pairs in reads. From ourobservation, different reads usually get different optimal thresholds(the threshold that provides the least frequent set of seeds under ASFfor the read). For reads that contain frequent seeds, their optimalthresholds are usually large (e.g., t > 1000), while for reads withoutfrequent seeds, their optimal thresholds are usually small (e.g.,t < 100). Unfortunately, ASF can only apply a single thresholdto all reads. If t is set to a large threshold to accommodate readswith frequent seeds, then for other reads, ASF will only extract shortseeds even if there are many unused base-pairs. Otherwise if t is setto a small threshold, then frequent seeds will consume many base-pairs and reads with frequent seeds will have insufficient base-pairs

8

at Washington U


ibrary on Novem



ownloaded from


to construct enough seeds to tolerate all errors. OSS, however, findsthe least frequent set of seeds individually for each read, whichcould contain highly variable seed frequencies.

Selecting seeds consecutively starting at the beginning of a readdoes not always produce infrequent seeds. Although most seedsthat are longer than 20-bp are either unique or non-existent in thereference, there are a few seeds that are still more frequent than 100occurrences even at 40-bp (e.g., all “A”s). With a small Smax (e.g.,Smax ≤ 40) and a small t (t ≤ 50), ASF cannot not guaranteethat all selected seeds are less frequent than t. This is because ASFcannot extend a seed more than Smax-bp, even if its frequency isstill greater than t. If a seed starts at a position that yields a long andfrequent seed, ASF will extend the seed to Smax and accept a seedfrequency that is still greater than t.

Setting a static t for all reads further worsens the problem. Readsare drastically different. Some reads do not include any frequentshort patterns (e.g., 10-bp patterns) while other reads have one tomany highly frequent short patterns. Reads without frequent shortpatterns do not produce frequent seeds in ASF, unless t is set to bevery large (e.g., ≥ 10, 000) and as a result the selected seeds arevery short (e.g., ≤ 8-bp). Reads with many frequent short patternshave a high possibility of producing longer seeds under medium-sized or small t’s (e.g., ≤ 100). For a batch of reads, if the globalt is set to a small number, reads with many frequent short patternswill have a high chance of producing many long seeds that the readdoes not have enough length to support. If t is set to a large number,reads without any frequent short patterns will produce many shortbut still frequent seeds as ASF will stop extending a seed as soonas it is less frequent than t, even though the read affords longer andless frequent seeds.

Cheap K-mer Selection (CKS) aims to reduce seed frequenciesby selecting seeds from a wider potential seed pool. For a fixedseed length k, CKS samples ⌊L

k⌋ seed positions consecutively in

a read, with each position apart from another by k-bp. Among the⌊Lk⌋ positions, it selects x seed positions that yield the least frequent

seeds (assuming the mapper needs x seeds). In this way, it avoidsusing positions that generate frequent seeds.

CKS has low overhead. In total, CKS only needs ⌊Lk⌋ lookups

for seed frequencies followed by a sort of ⌊Lk⌋ seed frequencies.

Although fast, CKS can only provide limited seed frequencyreduction as it has a very limited pool to select seeds from. Forinstance, in a common mapping setting where the read length L is100-bp and seed length k is 12, the read can be divided into at most⌊ 100

12⌋ = 8 positions. With only 8 potential positions to select from,

CKS is forced to gradually select more frequent seeds under greaterseed demands. To tolerate 5 errors in this read, CKS has to select 6seeds out of 8 potential seed positions. This implies that CKS willselect the 3rd most frequent seed out of 8 potential seeds. As wehave shown in Figure 1, 12-bp seeds on average have a frequencyover 172, and selecting the 3rd frequent position out of 8 potentialseeds renders a high possibility of selecting a frequent seed whichhas a higher frequency than average.

Similar to CKS, Optimal Prefix Selection (OPS) also uses fixedlength seeds. However, it allows a greater freedom of choosing seedpositions. Unlike CKS, which only select seeds at positions that area multiple of the seed length k, OPS allows seeds to be selectedfrom any position in the read, as long as they do not overlap.

Resembling our optimal seed finding algorithm, the basis of OPSis also a dynamic programming algorithm that implements a simpler

recurrence function. The major difference between OPS and OSSis that OPS does not need to derive the optimal length of eachseed, as the seed length is fixed to k-bp. This reduces the searchspace of optimal seeds to a single dimension, which is only the seedplacements. The worst case/average complexity of OPS is O(L×x).

Compared to CKS, OPS is more complex and requires moreseed frequency lookups. In return, OPS finds less frequent seeds,especially under large seed numbers. However, with a fixed seedlength, OPS cannot further reduce the seed frequencies by extendingthe seeds longer.

Spaced seeds aim at rebalancing frequencies among patterns inthe seed database (to improve the sensitivity/selectivity ratio ofseeds in read mapping (Egidi and Manzini, 2015)) by introducinga hash function that is guided by a user-defined bit-mask. Differentpatterns that are hashed into the same hash value are consideredas a single “rebalanced seed”. By carefully designing the hashingfunction, which extracts base-pairs only at selected positions froma longer (e.g., 18-bp) pattern, spaced seeds can group up frequentlong patterns with infrequent long patterns and merge them into thenew “rebalanced seeds”, which have smaller frequency variations.At runtime, long raw seeds are selected consecutively in the reads,which are processed by the rebalancing hash function later.

Compared to OSS, spaced seeds has two disadvantages. First,the hash function cannot perfectly balance frequencies among all“rebalanced seeds”. After rebalancing, there is still a large disparityin seed frequency amongst seeds. Second, seed placement in spacedseeds is static, and does not accommodate for high frequency seeds.Therefore, positions that generate frequent seeds are not avoidedwhich still give rise to the frequent seeds phenomenon (frequentseeds are encountered more frequently in the reads, if selectedconsecutively).

5 RESULTSIn this section, we compare the average case complexity, memorytraffic and effectiveness of OSS against the four prior studies, ASF,CKS, OPS and spaced seeds as well as the naıve mechanism, whichselects fixed seeds consecutively. Memory traffic is measured by thenumber of required seed frequency lookups to map a single read.The effectiveness of a seed selection scheme is measured by theaverage seed frequency of mapping 4,031,354 101-bp reads from areal read set, ERR240726 from the 1000 Genomes Project, underdifferent numbers of seeds.

We do not measure the execution time of each mechanismbecause different seed selection optimizations are combined withdifferent seed database implementations. CKS, OPS and spacedseeds use hash tables for short, fixed-length seeds while ASF andOSS employs slower but more memory efficient BWT and FM-index for longer, variant-length seeds. However, this combination isinter-changeable. CKS and OPS can also work well with BWT andFM-index and ASF, OSS can also be combined with a large hash-table, given sufficient memory space. Besides, different existingimplementations have their unique seed database optimizations,which introduces more variations to the execution time. With theabove reasons, we only compare the complexity and memory trafficof each seed selection scheme, without measuring their runtimeperformance.

9

at Washington U


ibrary on Novem



ownloaded from


Optimal Seed Solver ASF CKS OPS Spaced seeds naıveEmpirical average case complexity O(x× L) O(x) O(x× log L

k) O(x× L) O(x) O(x)

Number of lookups O(L2) O(x) O(Lk) O(L) O(x) O(x)

Table 1. An average case complexity and memory traffic comparison (measured by the number of seed-frequency lookups) of seed selection optimizations,including Optimal Seed Solver (OSS), Adaptive Seeds Filter (ASF), Cheap K-mer Selection (CKS), Optimal Prefix Selection (OPS), spaced seeds and naıve(selecting fixed-length seeds consecutively). Note that only OSS has different empirical average case complexity and worst case complexity. The average caseand worst case complexity of other optimizations are equal.

We benchmark each seed optimization scheme with multipleconfigurations. We benchmark ASF with multiple frequencythresholds, 5, 10, 100, 500 and 1000. If a read fails to provideenough seeds in ASF, due to having many long seeds under smallthresholds, the read will be processed again in CKS with a fixedseed length of 12-bp. We benchmark CKS, OPS and naıve underthree fixed seed lengths, 12, 13 and 14. We benchmark spaced seedswith the default bit-mask provided in PatternHunter’s paper (Maet al., 2002), “110100110010101111”, which hashes 18-bp longseeds into 11-bp long signatures.

All seed selection mechanisms are benchmarked using an in-house seed database, which supports variant seed lengths betweenSmin = 10 and Smax = 30.

Table 1 summarizes the average complexity and memory trafficof each seed selection optimization. From the table, we can observethat OSS requires the most seed frequency lookups (O(L2)) withthe worst average case complexity, (O(x × L)), which tied withOPS. Nonetheless, OSS is the most effective seed selection schemeas Figure 6 shows. Among all seed selection optimizations, OSSprovides the largest frequency reduction of seeds on average,achieving a 3x larger frequency reduction compared to the secondbest seed selection scheme, OPS.

As shown in Figure 6, the average seed frequencies of OSS, CKSand OPS increase with more seeds. This is expected as there is lessflexibility in seed placement with more seeds. For OSS, more seedsalso means shorter average seed length, which also contributes togreater average seed frequencies. For ASF, average seed frequenciesremains similar for three or fewer seeds. When there are more thanthree seeds, the average seed frequencies increase with more seeds.This is because up until three seeds, all reads have enough base-pairs to accommodate all seeds, since the maximum seed length isSmax = 30. However, once beyond three seeds, reads start to fail inASF (having insufficient base-pairs to accommodate all seeds) andthe failed reads are passed to CKS instead. Therefore the increaseafter three seeds is mainly due to the increase in CKS. For t = 10with six seeds, we observe from our experiment that 66.4% of totalreads fail ASF and are processed in CKS instead.

For CKS and OPS, the average seed frequency decreases withincreasing seed length when the number of seeds is small (e.g.,< 4). When the number of seeds is large (e.g., 6), it is not obviousif greater seed lengths provide smaller average seed frequencies. Infact, for 6 seeds, the average seed frequency of OPS rises slightlywhen we increase the seed length from 13-bp to 14-bp. This isbecause, for small numbers of seeds, the read has plenty of spaceto arrange and accommodate the slightly longer seeds. Therefore, inthis case, longer seeds reduce the average seed frequency. However,for large numbers of seeds, even a small increase in seed lengthwill significantly decrease the flexibility in seed arrangement. Inthis case, the frequency reduction of longer seeds is surpassed by

the frequency increase of reduced flexibility in seed arrangement.Moreover, the benefit of having longer seeds diminishes with greaterseed lengths. Many seeds are already infrequent at 12-bp. Extendingthe infrequent seeds longer does not introduce much reduction in thetotal seed frequency. This result corroborates the urge of enablingflexibility in both individual seed length and seed placements.

Overall, OSS provides the least frequent seeds on average,achieving a 3x larger frequency reduction than the second best seedselection schemes, OPS.

6 DISCUSSIONAs shown in the Results section, OSS requires O(L2) seed-frequency lookups in order to derive the optimal solution of a read.For a non-trivial seed database implementation such as BWT withFM-index, this can be a time consuming process. For reads thatgenerate equally frequent seeds in OSS and other seed selectionmechanisms, OSS could be less beneficial as it generates morequeries of seed frequencies to the seed database without reducingthe total seed frequency. When such reads are prevalent (althoughvery unlikely), OSS might not be the ideal seeding mechanism. Oneworkaround under this event is to combine OSS with other greedyseed selection algorithms (e.g., CKS, OPS). In such a configuration,OSS will only be invoked when greedy seed selection algorithmsfail to deliver infrequent seeds. However, how to combine differentseeding mechanisms is beyond the scope of this paper and will beexplored in our future research.

The Optimal Seed Solver also revealed that there is still greatpotential in designing better greedy seed selection optimizations.From our experiment, we observe that the most effective greedyseed selection optimization still provides 3x more frequent seedson average than optimal. Better greedy algorithms that provides lessfrequent seeds without a large number of database lookups will alsobe included in our future research.

7 CONCLUSIONOptimizing seed selection is an important problem in read mapping.The number of selected non-overlapping seeds defines the errortolerance of a mapper while the total frequency of all selected seedsdetermines the performance of the mapper. To build a fast yet errortolerant mapper, it is essential to select a large number of non-overlapping seeds while keeping each seed as infrequent as possible.In this paper, we confirmed frequent seed phenomenon discoveredin previous works (Kiełbasa et al., 2011), which suggests that in anaıve seed selection scheme, mappers tend to select frequent seedsfrom reads, even when using long seeds. To solve this problem,we proposed Optimal Seed Solver (OSS), a dynamic-programming

10

at Washington U


ibrary on Novem



ownloaded from


Fig. 6. Average seed frequency comparison among Optimal Seed Solver(OSS), Adaptive Seeds Filter (ASF), Cheap K-mer Selection (CKS), OptimalPrefix Selection (OPS), spaced seeds and naıve (selecting fixed length seedsconsecutively). The results are gathered by mapping 4031354 101-bp readsfrom the read set ERR240726 1 from 1000 Genome Project under differentnumbers of seeds (for better accuracy and error tolerance). In each figure,a smaller average seed frequency indicates a more effective seed selectionmechanism.

algorithm that derives the optimal set of seeds from a read that hasthe minimum total frequency. We further improved OSS with fouroptimizations: optimal divider cascading, early divider termination,divider sprinting, and optimal solution forwarding. Using all fouroptimizations, we reduced the average-case complexity of OSS toO(x × L), and achieved a O(x × L2) worst-case complexity. We

compared OSS to four prior studies, Adaptive Seeds Filter, CheapK-mer Selection, Optimal Prefix Selection and spaced seeds andshowed that OSS provided a 3-fold seed frequency reduction overthe best previous seed selection scheme, Optimal Prefix Selection.

Funding: This study is supported by two NIH grants (HG006004to C. Alkan and O. Mutlu; and HG007104 to C. Kingsford) and aMarie Curie Career Integration Grant (PCIG-2011-303772) to C.Alkan under the Seventh Framework Programme. C. Alkan alsoacknowledges support from The Science Academy of Turkey, underthe BAGEP program.

REFERENCES1000 Genomes Project Consortium (2010). A map of human

genome variation from population-scale sequencing. Nature, 467,1061–1073.

1000 Genomes Project Consortium (2012). An integrated map ofgenetic variation from 1,092 human genomes. Nature, 491(7422),56–65.

Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., and Xie,X. (2011). Hobbes: optimized gram-based methods for efficientread alignment. Nucleic Acids Research, 40, e41.

Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci,F., Hormozdiari, F., Kitzman, J. O., Baker, C., Malig, M., Mutlu,O., Sahinalp, S. C., Gibbs, R. A., and Eichler, E. E. (2009).Personalized copy number and segmental duplication maps usingnext-generation sequencing. Nat Genet, 41, 1061–1067.

Burrows, M., Wheeler, D. J., Burrows, M., and Wheeler, D. J.(1994). A block-sorting lossless data compression algorithm.

Egidi, L. and Manzini, G. (2015). Multiple seeds sensitivity usinga single seed with threshold. Journal of Bioinformatics andComputational Biology, 13(04), 1550011. PMID: 25747382.

Ferragina, P. and Manzini, G. (2000). Opportunistic data structureswith applications. In Proceedings of the 41st Annual Symposiumon Foundations of Computer Science, FOCS ’00, pages 390–,Washington, DC, USA. IEEE Computer Society.

Flannick, J., Thorleifsson, G., Beer, N. L., Jacobs, S. B. R., Grarup,N., Burtt, N. P., Mahajan, A., Fuchsberger, C., Atzmon, G.,Benediktsson, R., Blangero, J., Bowden, D. W., Brandslund, I.,Brosnan, J., Burslem, F., Chambers, J., Cho, Y. S., Christensen,C., Douglas, D. A., Duggirala, R., Dymek, Z., Farjoun, Y.,Fennell, T., Fontanillas, P., Forsen, T., Gabriel, S., Glaser, B.,Gudbjartsson, D. F., Hanis, C., Hansen, T., Hreidarsson, A. B.,Hveem, K., Ingelsson, E., Isomaa, B., Johansson, S., Jørgensen,T., Jørgensen, M. E., Kathiresan, S., Kong, A., Kooner, J., Kravic,J., Laakso, M., Lee, J.-Y., Lind, L., Lindgren, C. M., Linneberg,A., Masson, G., Meitinger, T., Mohlke, K. L., Molven, A.,Morris, A. P., Potluri, S., Rauramaa, R., Ribel-Madsen, R.,Richard, A.-M., Rolph, T., Salomaa, V., Segre, A. V., Skarstrand,H., Steinthorsdottir, V., Stringham, H. M., Sulem, P., Tai, E. S.,Teo, Y. Y., Teslovich, T., Thorsteinsdottir, U., Trimmer, J. K.,Tuomi, T., Tuomilehto, J., Vaziri-Sani, F., Voight, B. F., Wilson,J. G., Boehnke, M., McCarthy, M. I., Njølstad, P. R., Pedersen,O., , G.-T. C., , T.-G. E. N. E. S. C., Groop, L., Cox, D. R.,Stefansson, K., and Altshuler, D. (2014). Loss-of-functionmutations in slc30a8 protect against type 2 diabetes. NatureGenetics, 46(4), 357–363.

11

at Washington U


ibrary on Novem



ownloaded from


Flicek, P. and Birney, E. (2009). Sense from sequence reads:methods for alignment and assembly. Nature methods, 6(11Suppl), S6–S12.

Green, R. E., Krause, J., Briggs, A. W., Maricic, T., Stenzel, U.,Kircher, M., Patterson, N., Li, H., Zhai, W., Fritz, M. H.-Y.,Hansen, N. F., Durand, E. Y., Malaspinas, A.-S., Jensen, J. D.,Marques-Bonet, T., Alkan, C., Prfer, K., Meyer, M., Burbano,H. A., Good, J. M., Schultz, R., Aximu-Petri, A., Butthof, A.,Hber, B., Hffner, B., Siegemund, M., Weihmann, A., Nusbaum,C., Lander, E. S., Russ, C., et al. (2010). A draft sequence of theNeandertal genome. Science, 328, 710–722.

Kiełbasa, S., Wan, R., Sato, K., Horton, P., and Frith, M. C. (2011).Adaptive seeds tame genomic sequence comparison. GenomeResearch, 21(3), 487–493.

Kucherov, G., Salikhov, K., and Tsur, D. (2014). Approximatestring matching using a bidirectional index. In A. Kulikov,S. Kuznetsov, and P. Pevzner, editors, Combinatorial PatternMatching, volume 8486 of Lecture Notes in Computer Science,pages 222–231. Springer International Publishing.

Langmead, B. and Salzberg, S. L. (2012). Fast gapped-readalignment with bowtie 2. Nature Method, 9, 357–359.

Li, H. (2013). Aligning sequence reads, clone sequences andassembly contigs with bwa-mem.

Ma, B., Tromp, J., and Li, M. (2002). Patternhunter: faster and moresensitive homology search. Bioinformatics, 18, 440–445.

Marco-Sola, S., Sammeth, M., Guig, R., and Ribeca, P. (2012). Thegem mapper: fast, accurate and versatile alignment by filtration.Nat Methods, 9(12), 1185–1188.

Marques-Bonet, T., Kidd, J. M., Ventura, M., Graves, T. A., Cheng,Z., Hillier, L. W., Jiang, Z., Baker, C., Malfavon-Borja, R.,Fulton, L. A., Alkan, C., Aksay, G., Girirajan, S., Siswara,P., Chen, L., Cardone, M. F., Navarro, A., Mardis, E. R.,Wilson, R. K., and Eichler, E. E. (2009). A burst of segmentalduplications in the genome of the African great ape ancestor.Nature, 457, 877–881.

Meyer, M., Kircher, M., Gansauge, M.-T., Li, H., Racimo, F.,Mallick, S., Schraiber, J. G., Jay, F., Prufer, K., de Filippo,C., Sudmant, P. H., Alkan, C., Fu, Q., Do, R., Rohland, N.,Tandon, A., Siebauer, M., Green, R. E., Bryc, K., Briggs, A. W.,Stenzel, U., Dabney, J., Shendure, J., Kitzman, J., Hammer,M. F., Shunkov, M. V., Derevianko, A. P., Patterson, N., Andres,A. M., Eichler, E. E., Slatkin, M., Reich, D., Kelso, J., and Paabo,S. (2012). A high-coverage genome sequence from an archaicdenisovan individual. Science, 338(6104), 222–226.

Myers, G. (1999). A fast bit-vector algorithm for approximate stringmatching based on dynamic programming. J. ACM, 46(3), 395–415.

Navin, N., Kendall, J., Troge, J., Andrews, P., Rodgers, L.,McIndoo, J., Cook, K., Stepansky, A., Levy, D., Esposito,D., et al. (2011). Tumour evolution inferred by single-cellsequencing. Nature, 472(7341), 90–94.

Needleman, S. B. and Wunsch, C. D. (1970). A general methodapplicable to the search for similarities in the amino acidsequence of two proteins. Journal of Molecular Biology, 48,443–453.

Ng, S. B., Bigham, A. W., Buckingham, K. J., Hannibal, M. C.,McMillin, M. J., Gildersleeve, H. I., Beck, A. E., Tabor, H. K.,Cooper, G. M., Mefford, H. C., Lee, C., Turner, E. H., Smith,J. D., Rieder, M. J., Yoshiura, K.-I., Matsumoto, N., Ohta, T.,

Niikawa, N., Nickerson, D. A., Bamshad, M. J., and Shendure, J.(2010). Exome sequencing identifies MLL2 mutations as a causeof kabuki syndrome. Nat Genet, 42(9), 790–793.

Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley,J. L., Lorente-Galdos, B., Veeramah, K. R., Woerner, A. E.,O’Connor, T. D., Santpere, G., Cagan, A., Theunert, C., Casals,F., Laayouni, H., Munch, K., Hobolth, A., Halager, A. E., Malig,M., Hernandez-Rodriguez, J., Hernando-Herraez, I., Prufer, K.,Pybus, M., Johnstone, L., Lachmann, M., Alkan, C., Twigg,D., Petit, N., Baker, C., Hormozdiari, F., Fernandez-Callejo, M.,Dabad, M., Wilson, M. L., Stevison, L., Camprubı, C., Carvalho,T., Ruiz-Herrera, A., Vives, L., Mele, M., Abello, T., Kondova,I., Bontrop, R. E., Pusey, A., Lankester, F., Kiyang, J. A.,Bergl, R. A., Lonsdorf, E., Myers, S., Ventura, M., Gagneux,P., Comas, D., Siegismund, H., Blanc, J., Agueda-Calpena, L.,Gut, M., Fulton, L., Tishkoff, S. A., Mullikin, J. C., Wilson,R. K., Gut, I. G., Gonder, M. K., Ryder, O. A., Hahn, B. H.,Navarro, A., Akey, J. M., Bertranpetit, J., Reich, D., Mailund,T., Schierup, M. H., Hvilsom, C., Andres, A. M., Wall, J. D.,Bustamante, C. D., Hammer, M. F., Eichler, E. E., and Marques-Bonet, T. (2013). Great ape genetic diversity and populationhistory. Nature, 499(7459), 471–475.

Rasmussen, K. R., Stoye, J., and Myers, E. W. (2006). Efficient q-gram filters for finding all e-matches over a given length. Journalof Computational Biology, 13.

Reich, D., Green, R. E., Kircher, M., Krause, J., Patterson, N.,Durand, E. Y., Viola, B., Briggs, A. W., Stenzel, U., Johnson, P.L. F., Maricic, T., Good, J. M., Marques-Bonet, T., Alkan, C., Fu,Q., Mallick, S., Li, H., Meyer, M., Eichler, E. E., Stoneking,M., Richards, M., Talamo, S., Shunkov, M. V., Derevianko,A. P., Hublin, J.-J., Kelso, J., Slatkin, M., and Pbo, S. (2010).Genetic history of an archaic hominin group from Denisova Cavein Siberia. Nature, 468, 1053–1060.

Rognes, T. (2011). Faster smith-waterman database searches withinter-sequence simd parallelisation. BMC Bioinformatics, 12,221.

Rumble, S. M., Lacroute, P., Dalca, A. V., Fiume, M., Sidow, A.,and Brudno, M. (2009). Shrimp: Accurate mapping of shortcolor-space reads. PLoS Comput Biol, 5, e1000386.

Scally, A., Dutheil, J. Y., Hillier, L. W., Jordan, G. E., Goodhead, I.,Herrero, J., Hobolth, A., Lappalainen, T., Mailund, T., Marques-Bonet, T., McCarthy, S., Montgomery, S. H., Schwalie, P. C.,Tang, Y. A., Ward, M. C., Xue, Y., Yngvadottir, B., Alkan, C.,Andersen, L. N., Ayub, Q., Ball, E. V., Beal, K., Bradley, B. J.,Chen, Y., Clee, C. M., Fitzgerald, S., Graves, T. A., Gu, Y., Heath,P., Heger, A., et al. (2012). Insights into hominid evolution fromthe gorilla genome sequence. Nature, 483, 169–175.

Smith, T. F. and Waterman, M. S. (1981). Identification of commonmolecular subsequences. Journal of Molecular Biology, 147,195–195.

Szalkowski, A., Ledergerber, C., Krahenbuhl, P., and Dessimoz, C.(2008). SWPS3 - fast multi-threaded vectorized Smith-Watermanfor IBM Cell/B.e. and x86/SSE2. BMC Research Notes, 1(1),107+.

Van Vlierberghe, P., Palomero, T., Khiabanian, H., Van der Meulen,J., Castillo, M., Van Roy, N., De Moerloose, B., Philippe, J.,Gonzalez-Garcıa, S., Toribio, M. L., et al. (2010). Phf6 mutationsin t-cell acute lymphoblastic leukemia. Nature genetics, 42(4),338–342.

12

at Washington U


ibrary on Novem



ownloaded from


Ventura, M., Catacchio, C. R., Alkan, C., Marques-Bonet, T.,Sajjadian, S., Graves, T. A., Hormozdiari, F., Navarro, A., Malig,M., Baker, C., Lee, C., Turner, E. H., Chen, L., Kidd, J. M.,Archidiacono, N., Shendure, J., Wilson, R. K., and Eichler, E. E.(2011). Gorilla genome structural variation reveals evolutionaryparallelisms with chimpanzee. Genome Res, 21, 1640–1649.

Weese, D., Holtgrewe, M., and Reinert, K. (2012). RazerS 3: Faster,fully sensitive read mapping. Bioinformatics, 28(20), 2592–2599.

Xin, H., Lee, D., Hormozdiari, F., Yedkar, S., Mutlu, O., and Alkan,C. (2013). Accelerating read mapping with FastHASH. BMCGenomics, 14(Suppl 1), S13.

Xin, H., Greth, J., Emmons, J., Pekhimenko, G., Kingsford, C.,Alkan, C., and Mutlu, O. (2015). Shifted hamming distance:A fast and accurate simd-friendly filter to accelerate alignmentverification in read mapping. Bioinformatics.

13

at Washington U


ibrary on Novem



ownloaded from


Documents

Optimal Seed Solver: Optimizing Seed Selection in Read Mapping · The error-free seed can be obtained by breaking the read into many non-overlapping seeds; in general, to tolerate