Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington...

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

Outline Problem of multi-seed design Methods

Greedy covering algorithm Compute conditional match probabilities

Experiments and results Conclusion and future work

Sequence Alignment

Functional regions conserved despite DNA mutations over time

Conserved region can be aligned with high score

Exact solution: DP; time complexity: O(MN)

Fast but heuristic solution: seeded alignment algorithm

Seeded Alignment Algorithm BLAST is the most popular tool.Step 1: word match step 2: extend the

match to find the high similarity pair

TAGGACCTAACC

GACCACCTTTT

TAGGACCTAACC

GACCACCTTTT

Seed and Similarity

Example of a similarity and a single seed tgcagaaatgcagaggca | || | | |||| tacacaggcaccgaggag

Similarity: 101101000010111100

Seed: 11*1, weight = 3, span = 4The seed detects/matches this similarity.

Seed Choice is Important

Significant alignment Seed match

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Seed Design: Previous Work

Traditional seed: word (e.g. 11111111111)

Discontiguous patterns of matching bases: [CR1993]; [MTL’02] {111010010100110111}

Our work on single discontiguous seed: [BKS’03]

Multiple Simultaneous Seeds Multiple simultaneous seeds are

defined as a set of seeds. ∏= {seed1, seed2,…seed i,…, seedn} ∏ detects a similarity if at least one of the

component seeds detects the similarity Example

Simultaneous seeds {11*1, 1*11} detect similarities 100110100001, 1000010110001, 1101001011001

Multi-seed Design – Balance Sensitivity with Specificity Sensitivity=A / Biologically

meaningful alignments

Specificity=A / seed matches Increase sensitivity:

Decrease weight of single seed Use multiple seeds

Both methods hurt specificity

Hypothesis: a set of multiple seeds has a better tradeoff of sensitivity vs. specificity comparing to single seed

biologically meaningful alignments

seed matches

Our Work – Design Multiple Simultaneous Seeds Efficiently Use a new local search method to

optimize seed set Design an efficient algorithm to

calculate conditional match probability Empirical verification that multiple

simultaneous seeds have better tradeoff of sensitivity vs. specificity

Multi-seed Design Problem Input:

Ungapped alignments sampled from two genomic DNA sequences

Resource constraints of seeds: weight, span, number

Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S]. Pr(∏ detects S) = Pr( (seed1 detects S) or

(seed2 detects S)…or (seedn detects S))

Outline

Problem of multi-seed Design Methods

Computing Match Probability for Specified Seeds [BKS ’03] Learn a kth-order Markov model from

similarities.

Build a DFA that only accepts strings containing the given seeds

Compute the probability that the DFA accepts a string chosen randomly from model M by DP.

Seek the Locally Optimal Set of Seeds Original local search

Greedy covering algorithm – a faster local search strategyEfficient computation of conditional

match probability

Find Optimal Set of Seeds by Original Local Search

Seed space with span<=8,weight=3

1*1***1,

1*****11

Pr=0.70

1**1**1,

1*****11

Pr=0.67

1***1*1,

1*****11

Pr=0.75

1****11, 1*****11

Pr=0.71

Design 3 simultaneous seeds:{s1,s2,s3}

s1= argmaxxPr(x)s2=argmaxx Pr(x|~s1)s3=argmaxx Pr(x|~{s1,s2})

Similarity space

Similarities detected by S1

Greedy Covering Algorithm

Calculate Conditional Match Probabilities Challenge: how to calculate the

conditional probability efficiently ?

Seeds with small span: exact computation via DFAs

Seeds with large span: Monte Carlo

Calculate Conditional Match Probability via DFA Pr( x| ) = Pr(x )/ Pr( )

Build DFA corresponding to x by using cross product and complementation of DFA

Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed

Outline

Problem of multi-seed design Methods

Greedy Covering vs. Original Local Search

Detection probability

Greedy Covering is Much Faster When n=5, on the same hardware

platform(P4)Greedy covering needs 20 minutesThe original local search needs 2.4

Experimental Setup

The ungapped alignments are sampled uniformly from human and mouse syntenies

For a specified seed set sensitivity : the number of significant

gapped alignments found by our BLAST-like alignment tool

False positive rate : approximated by the number of seed matches

Results: Verify the Hypothesis on Noncoding Sequences

seed weight

number of seeds

# gapped alignments found (sensitivity)

%improvement of sensitivity

total seed matches (approximation of f.p)

11 1 251941 ---- 1.57x109

10 1 273831 8.7 5.88x109

11 3 292093 15.9 4.56x109

Summary of Contributions

Efficient algorithms to design multiple simultaneous seeds at reasonable cost

Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity

Future Work

Design a better evaluation platform for different seeds

Investigate utility of seeds in multiple sequence alignment

Acknowledgements

Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope

Laboratory for computational genomics in Washington University in Saint Louis

http://www.cse.wustl.edu/~jbuhler/mandala

Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington...

Documents

Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster

Buhler Indian Paleography

Book Yanni

Yanni+ +felitsa

Felitsa - Yanni

119444614 Yanni Nostalgia

melodia Nightingale Yanni

Yanni - Felitsa

Yanni Nostalgia.pdf

Yanni the Best of Yanni

For inspection purposes only. Consent of copyright owner ...MALT HAMMER MILL BARLEY BUHLER SCREEN RUBBLE BARLEY BUHLER SCREEN CYCLONE MALT BUHLER SCREEN RUBBLE MALT BUHLER VACUUM SCREEN

8968467 Yanni Santorini

Yanni - Swept Away

BUHLER CASEGOOD CATALOG

Book - Yanni - The Best Of Yanni (Piano Solos).pdf

Lflacso 05-buhler

Book - Yanni - The Best of Yanni (Piano Solos)

Yanni - The Mermaid

Buhler Generic Formulations

4003 Yanni Felitsa