View
218
Download
0
Category
Tags:
Preview:
Citation preview
Designing Multiple Simultaneous Seeds for DNA Similarity Search
Yanni Sun, Jeremy Buhler Washington University in Saint Louis
WashU. Laboratory for Computational Genomics
2
Outline Problem of multi-seed design Methods
Greedy covering algorithm Compute conditional match probabilities
Experiments and results Conclusion and future work
WashU. Laboratory for Computational Genomics
3
Sequence Alignment
Functional regions conserved despite DNA mutations over time
Conserved region can be aligned with high score
Exact solution: DP; time complexity: O(MN)
Fast but heuristic solution: seeded alignment algorithm
WashU. Laboratory for Computational Genomics
4
Seeded Alignment Algorithm BLAST is the most popular tool.Step 1: word match step 2: extend the
match to find the high similarity pair
TAGGACCTAACC
GACCACCTTTT
TAGGACCTAACC
GACCACCTTTT
WashU. Laboratory for Computational Genomics
5
Seed and Similarity
Example of a similarity and a single seed tgcagaaatgcagaggca | || | | |||| tacacaggcaccgaggag
Similarity: 101101000010111100
Seed: 11*1, weight = 3, span = 4The seed detects/matches this similarity.
WashU. Laboratory for Computational Genomics
6
Seed Choice is Important
Significant alignment Seed match
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
WashU. Laboratory for Computational Genomics
7
Seed Design: Previous Work
Traditional seed: word (e.g. 11111111111)
Discontiguous patterns of matching bases: [CR1993]; [MTL’02] {111010010100110111}
Our work on single discontiguous seed: [BKS’03]
WashU. Laboratory for Computational Genomics
8
Multiple Simultaneous Seeds Multiple simultaneous seeds are
defined as a set of seeds. ∏= {seed1, seed2,…seed i,…, seedn} ∏ detects a similarity if at least one of the
component seeds detects the similarity Example
Simultaneous seeds {11*1, 1*11} detect similarities 100110100001, 1000010110001, 1101001011001
9
Multi-seed Design – Balance Sensitivity with Specificity Sensitivity=A / Biologically
meaningful alignments
Specificity=A / seed matches Increase sensitivity:
Decrease weight of single seed Use multiple seeds
Both methods hurt specificity
Hypothesis: a set of multiple seeds has a better tradeoff of sensitivity vs. specificity comparing to single seed
biologically meaningful alignments
seed matches
A
WashU. Laboratory for Computational Genomics
10
Our Work – Design Multiple Simultaneous Seeds Efficiently Use a new local search method to
optimize seed set Design an efficient algorithm to
calculate conditional match probability Empirical verification that multiple
simultaneous seeds have better tradeoff of sensitivity vs. specificity
11
Multi-seed Design Problem Input:
Ungapped alignments sampled from two genomic DNA sequences
Resource constraints of seeds: weight, span, number
Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S]. Pr(∏ detects S) = Pr( (seed1 detects S) or
(seed2 detects S)…or (seedn detects S))
WashU. Laboratory for Computational Genomics
12
Outline
Problem of multi-seed Design Methods
Greedy covering algorithm Compute conditional match probabilities
Experiments and results Conclusion and future work
WashU. Laboratory for Computational Genomics
13
Computing Match Probability for Specified Seeds [BKS ’03] Learn a kth-order Markov model from
similarities.
Build a DFA that only accepts strings containing the given seeds
Compute the probability that the DFA accepts a string chosen randomly from model M by DP.
WashU. Laboratory for Computational Genomics
14
Seek the Locally Optimal Set of Seeds Original local search
Greedy covering algorithm – a faster local search strategyEfficient computation of conditional
match probability
WashU. Laboratory for Computational Genomics
15
Find Optimal Set of Seeds by Original Local Search
Seed space with span<=8,weight=3
1*1***1,
1*****11
Pr=0.70
1**1**1,
1*****11
Pr=0.67
1***1*1,
1*****11
Pr=0.75
1****11, 1*****11
Pr=0.71
WashU. Laboratory for Computational Genomics
16
Design 3 simultaneous seeds:{s1,s2,s3}
s1= argmaxxPr(x)s2=argmaxx Pr(x|~s1)s3=argmaxx Pr(x|~{s1,s2})
Similarity space
Similarities detected by S1
Similarities detected by S3
Similarities detected by S2
Greedy Covering Algorithm
WashU. Laboratory for Computational Genomics
17
Calculate Conditional Match Probabilities Challenge: how to calculate the
conditional probability efficiently ?
Seeds with small span: exact computation via DFAs
Seeds with large span: Monte Carlo
WashU. Laboratory for Computational Genomics
18
Calculate Conditional Match Probability via DFA Pr( x| ) = Pr(x )/ Pr( )
Build DFA corresponding to x by using cross product and complementation of DFA
Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed
WashU. Laboratory for Computational Genomics
19
Outline
Problem of multi-seed design Methods
Greedy covering algorithm Compute conditional match probabilities
Experiments and results Conclusion and future work
20
Greedy Covering vs. Original Local Search
Detection probability
WashU. Laboratory for Computational Genomics
21
Greedy Covering is Much Faster When n=5, on the same hardware
platform(P4)Greedy covering needs 20 minutesThe original local search needs 2.4
hours
WashU. Laboratory for Computational Genomics
22
Experimental Setup
The ungapped alignments are sampled uniformly from human and mouse syntenies
For a specified seed set sensitivity : the number of significant
gapped alignments found by our BLAST-like alignment tool
False positive rate : approximated by the number of seed matches
WashU. Laboratory for Computational Genomics
23
Results: Verify the Hypothesis on Noncoding Sequences
seed weight
number of seeds
# gapped alignments found (sensitivity)
%improvement of sensitivity
total seed matches (approximation of f.p)
11 1 251941 ---- 1.57x109
10 1 273831 8.7 5.88x109
11 3 292093 15.9 4.56x109
WashU. Laboratory for Computational Genomics
24
Summary of Contributions
Efficient algorithms to design multiple simultaneous seeds at reasonable cost
Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity
WashU. Laboratory for Computational Genomics
25
Future Work
Design a better evaluation platform for different seeds
Investigate utility of seeds in multiple sequence alignment
WashU. Laboratory for Computational Genomics
26
Acknowledgements
Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope
Laboratory for computational genomics in Washington University in Saint Louis
http://www.cse.wustl.edu/~jbuhler/mandala
Recommended