26
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Yanni Sun, Jeremy Buhler Washington University in Saint Louis

Page 2: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

2

Outline Problem of multi-seed design Methods

Greedy covering algorithm Compute conditional match probabilities

Experiments and results Conclusion and future work

Page 3: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

3

Sequence Alignment

Functional regions conserved despite DNA mutations over time

Conserved region can be aligned with high score

Exact solution: DP; time complexity: O(MN)

Fast but heuristic solution: seeded alignment algorithm

Page 4: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

4

Seeded Alignment Algorithm BLAST is the most popular tool.Step 1: word match step 2: extend the

match to find the high similarity pair

TAGGACCTAACC

GACCACCTTTT

TAGGACCTAACC

GACCACCTTTT

Page 5: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

5

Seed and Similarity

Example of a similarity and a single seed tgcagaaatgcagaggca | || | | |||| tacacaggcaccgaggag

Similarity: 101101000010111100

Seed: 11*1, weight = 3, span = 4The seed detects/matches this similarity.

Page 6: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

6

Seed Choice is Important

Significant alignment Seed match

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Page 7: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

7

Seed Design: Previous Work

Traditional seed: word (e.g. 11111111111)

Discontiguous patterns of matching bases: [CR1993]; [MTL’02] {111010010100110111}

Our work on single discontiguous seed: [BKS’03]

Page 8: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

8

Multiple Simultaneous Seeds Multiple simultaneous seeds are

defined as a set of seeds. ∏= {seed1, seed2,…seed i,…, seedn} ∏ detects a similarity if at least one of the

component seeds detects the similarity Example

Simultaneous seeds {11*1, 1*11} detect similarities 100110100001, 1000010110001, 1101001011001

Page 9: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

9

Multi-seed Design – Balance Sensitivity with Specificity Sensitivity=A / Biologically

meaningful alignments

Specificity=A / seed matches Increase sensitivity:

Decrease weight of single seed Use multiple seeds

Both methods hurt specificity

Hypothesis: a set of multiple seeds has a better tradeoff of sensitivity vs. specificity comparing to single seed

biologically meaningful alignments

seed matches

A

Page 10: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

10

Our Work – Design Multiple Simultaneous Seeds Efficiently Use a new local search method to

optimize seed set Design an efficient algorithm to

calculate conditional match probability Empirical verification that multiple

simultaneous seeds have better tradeoff of sensitivity vs. specificity

Page 11: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

11

Multi-seed Design Problem Input:

Ungapped alignments sampled from two genomic DNA sequences

Resource constraints of seeds: weight, span, number

Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S]. Pr(∏ detects S) = Pr( (seed1 detects S) or

(seed2 detects S)…or (seedn detects S))

Page 12: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

12

Outline

Problem of multi-seed Design Methods

Greedy covering algorithm Compute conditional match probabilities

Experiments and results Conclusion and future work

Page 13: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

13

Computing Match Probability for Specified Seeds [BKS ’03] Learn a kth-order Markov model from

similarities.

Build a DFA that only accepts strings containing the given seeds

Compute the probability that the DFA accepts a string chosen randomly from model M by DP.

Page 14: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

14

Seek the Locally Optimal Set of Seeds Original local search

Greedy covering algorithm – a faster local search strategyEfficient computation of conditional

match probability

Page 15: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

15

Find Optimal Set of Seeds by Original Local Search

Seed space with span<=8,weight=3

1*1***1,

1*****11

Pr=0.70

1**1**1,

1*****11

Pr=0.67

1***1*1,

1*****11

Pr=0.75

1****11, 1*****11

Pr=0.71

Page 16: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

16

Design 3 simultaneous seeds:{s1,s2,s3}

s1= argmaxxPr(x)s2=argmaxx Pr(x|~s1)s3=argmaxx Pr(x|~{s1,s2})

Similarity space

Similarities detected by S1

Similarities detected by S3

Similarities detected by S2

Greedy Covering Algorithm

Page 17: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

17

Calculate Conditional Match Probabilities Challenge: how to calculate the

conditional probability efficiently ?

Seeds with small span: exact computation via DFAs

Seeds with large span: Monte Carlo

Page 18: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

18

Calculate Conditional Match Probability via DFA Pr( x| ) = Pr(x )/ Pr( )

Build DFA corresponding to x by using cross product and complementation of DFA

Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed

Page 19: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

19

Outline

Problem of multi-seed design Methods

Greedy covering algorithm Compute conditional match probabilities

Experiments and results Conclusion and future work

Page 20: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

20

Greedy Covering vs. Original Local Search

Detection probability

Page 21: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

21

Greedy Covering is Much Faster When n=5, on the same hardware

platform(P4)Greedy covering needs 20 minutesThe original local search needs 2.4

hours

Page 22: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

22

Experimental Setup

The ungapped alignments are sampled uniformly from human and mouse syntenies

For a specified seed set sensitivity : the number of significant

gapped alignments found by our BLAST-like alignment tool

False positive rate : approximated by the number of seed matches

Page 23: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

23

Results: Verify the Hypothesis on Noncoding Sequences

seed weight

number of seeds

# gapped alignments found (sensitivity)

%improvement of sensitivity

total seed matches (approximation of f.p)

11 1 251941 ---- 1.57x109

10 1 273831 8.7 5.88x109

11 3 292093 15.9 4.56x109

Page 24: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

24

Summary of Contributions

Efficient algorithms to design multiple simultaneous seeds at reasonable cost

Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity

Page 25: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

25

Future Work

Design a better evaluation platform for different seeds

Investigate utility of seeds in multiple sequence alignment

Page 26: Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics

26

Acknowledgements

Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope

Laboratory for computational genomics in Washington University in Saint Louis

http://www.cse.wustl.edu/~jbuhler/mandala