Efficient Algorithms for Motif Search

04/20/23 1

Efficient Algorithms for Efficient Algorithms for Motif SearchMotif Search

Sudha BallaSudha Balla

Sanguthevar RajasekaranSanguthevar Rajasekaran

University of ConnecticutUniversity of Connecticut

04/20/23 2

Problem1 Definition

Input: n sequences of length m each, integers l and d, s.t. l << m and d < l.Each input sequence has an occurrence of a motif M of length l at a Hamming Distance of d from M.Output: MThe above problem is known as the Planted (l, d) Motif Problem.

04/20/23 3

Problem2 Definition

Input is a database DB of n sequences, integers l, d, and q. Output should be all the patterns in DB such that each pattern is of length l and it occurs in at least q of the n sequences.A pattern u is considered an occurrence of another pattern v as long as the edit distance between u and v is at most d.

04/20/23 4

Problem 1: State of the Art

Two kinds of algorithms are known:Approximate and Exact.

WINNOWER (Pevzner and Sze[2000]) and PROJECTION (Buhler and Tompa[2001]) are approximate algorithms.

MITRA (Eskin and Pevzner [2002]) is an exact algorithm.

04/20/23 5

A Probabilistic Analysis

Problem1 is complicated by the fact that, for a given value of l, the higher the value of d, the higher the expected number of motifs that occur by random chance. For instance, when n=20, m=600, l=9, d=2, the expected number of spurious motifs is 1.6. On the other hand for n=20, m=600, l=10, d=2, the expected number of spurious motifs is only 6.1 X 10-8.

04/20/23 6

WINNOWER

Generate all l-mers from out of all the input sequences. The number of such l-mers is O(nm).

Generate a graph G(V,E). Each l-mer is a node in G. Two nodes are connected if the hamming distance between them is at most 2d.

Find all cliques in the graph. Process these cliques to identify M.

04/20/23 7

WINNOWER DetailsPevzner and Sze observe that the graph G constructed above is 'almost random' and is multipartite.They use the notion of an extendable clique. If Q is any clique, node u is called a neighbor of Q if the nodes in Q and u also form a clique.

A clique is called extendable if it has at least one neighbor in every part of the multipartite graph G. The algorithm WINNOWER is based on the observation that every edge in a maximal n-clique belongs to at least (n-2) extendable cliques of size k. This (k-2)observation is used to eliminate edges.

04/20/23 8

PROJECTION

Let C be the collection of all l-mers in the input.Project these l-mers along k randomly chosen columns. (k is typically 7).Group the k-mers such that equal k-mers are in the same group. If a group is of size greater than a threshold s (s is typically 3), then M is likely to have this k-mer.The rest of M is computed using maximum likelihood estimates.

04/20/23 9

MITRA

MITRA is based on WINNOWER;Uses pairwise similarity information.

MITRA uses a mismatch tree data structure and splits the space of all possible patterns into disjoint subspaces that start with a given prefix.

Pruning is applied in each subspace.

04/20/23 10

Pattern BranchingOne way of solving the planted motif search problem is to start from each l-mer in the input, search the neighbors of this l-mer, score them appropriately and output the best scoring neighbor. Pattern Branching only examines a selected subset of neighbors of any l-mer u of the input and hence is more efficient. For any l-mer u, let Di(u) stand for the set of neighbors of u that are at a hamming distance of i. For any input sequence Sj let d(u,Sj) denote the minimum hamming distance between u and any l-mer of Sj. Let d(u,S)=Σn

j=1 d(u,Sj).

04/20/23 11

Pattern Branching Contd…

For any l-mer u in the input let BestNeighbor(u) stand for the neighbor v in D1(u) whose distance d(v,S) is minimum from among all the elements of D1(u).

The PatternBranching algorithm starts from a u, identifies u1= BestNeighbor(u); Then it identifies u2=BestNeighbor(u1); and so on. It finally outputs ud. The best ud from among all possible u's is output.

04/20/23 12

A Simple Algorithm

1) Form all possible l-mers from the input sequences. Let C be this collection. Let C’ be the collection of l-mers in the first input sequence.

2) For every u in C’ generate all l-mers that are at a hamming distance of d from u. Let C’’ be the collection of these l-mers. Note that C’’ contains M.

3) For every pair of l-mers (u, v) with u in C and v in C’’ compute the hamming distance between u and v. Output that l-mer of C’’ that has a neighbor (i.e., an l-mer at a hamming distance of d) in each one of the n input sequences.

04/20/23 13

A Simple Algorithm Contd…

The run time of the above algorithm is

d

d

lnm2lO ||

04/20/23 14

PMS1

1) Generate all possible l-mers from out of each of the n input sequences. Let Ci be the collection of l-mers from the i-th sequence.

2) For each Ci and each u in Ci do: Generate all l-mers v such that u and v are at a hamming distance of d. Let Ci

’ be the neighbors of Ci.

3) Sort all the l-mers in every Ci. Let Li be the sorted list corresponding to Ci.

4) Merge all the Li’s and output the generated (in step 2) l-mer that occurs in all the Li’s.

04/20/23 15

PMS1 Contd…

The run time of PMS1 is: (Here w is the word length of the computer. Radix sort is used.)

w

l

d

lnmO d||

04/20/23 16

PMS2

Note that if M occurs in every input sequence, then every substring of M also occurs in every input sequence.

In particular, there are at least l - k + 1 k-mers (for d <= k <= l) such that each of these occurs in every input sequence at a hamming distance of at most d.

Let Q be the collection of k-mers that can be formed out of M. There are l - k + 1 k-mers in Q. Each one of these k-mers will be present in each input sequence at a hamming distance of at most d.

04/20/23 17

PMS3

This algorithm enables one to handle large values of d. Let d’=d/2. Let M be the motif of interest with |M|=l=2l’for some integer l’. Let M’ refer to the first half of M and M’’ to the second half. We know that M occurs in every input sequence. Let S be an arbitrary input sequence and let p be the occurrence of M in S.

If p’ and p’’ are the two halves of p, then, either (1) the hamming distance between M’ and p’ is at most d’ or (2) the hamming distance between M’’ and p’’ is at most d’.

04/20/23 18

PMS3 Contd…

Also, note that in every input sequence either M’ occurs with a hamming distance of at most d’ or M’’ occurs with a hamming distance of at most d’.

As a result, in at least n/2 sequences either M’ occurs with a hamming distance of at most d’ or M’’ occurs with a hamming distance of at most d’. PMS3 exploits these observations.

04/20/23 19

Experimental Data

l d T l d T l d T

99 22 1.441.44

1010 22 0.840.84

1111 22 0.780.78 1111 33 19.8419.84

1212 22 0.840.84 1212 33 15.5315.53

1313 22 0.700.70 1313 33 20.9820.98 1313 44 228.94228.94

1414 22 1.051.05 1414 33 20.3820.38 1414 44 226.83226.83

1515 22 1.331.33 1515 33 20.5320.53 1515 44 217.34217.34

1616 22 2.612.61 1616 33 21.2021.20 1616 44 216.92216.92

04/20/23 20

A Comparison with MITRA

For l=11 and d=2, MITRA takes one minute whereas PMS2 takes around a second.

For l=12 and d=3, two versions of MITRA take one minute and four minutes, respectively. PMS2 takes 15.53 seconds.

For l=14 and d=4, two versions of MITRA take 4 minutes and 10 minutes, respectively. PMS2 takes 226.83 seconds.

04/20/23 21

Known Algorithms for Problem 2

Sagot [1998]’s algorithm runs in time O(n2mld |Σ|d) and is based on generalized suffix trees. Space used is O(n2m/w) where w is the word length of the computer.

This algorithm builds a suffix tree on the given sequences in O(nm) time using O(nm) space. If u is any l-mer present in the input, there are O(ld (|Σ|-1)d) possible neighbors for u. Any of these neighbors could potentially be a motif of interest. Since there are O(nm) l-mers in the input, the number of such neighbors is O(nmld(|Σ|-1)d).

04/20/23 22

Sagot’s Algorithm Contd…

This algorithm, for each such neighbor v, walks through the tree to check if v is a possible answer. This walking step is referred to as 'spelling'. The spelling operation takes a total of O(n2 mld(|Σ|-1)d) time using an additional O(nm) space. When employed for solving Problem 2, the same algorithm takes O(n2 mld|Σ|d ) time.

The algorithm of Adebiyi and Kaufmann [2002] takes an expected O(nm+d(nm)1.9 log nm) time.

04/20/23 23

An Algorithm Similar to PMS1

The basic idea behind the algorithm is: We generate all possible l-mers in the database. There are at most mn such l-mers and these are the patterns of interest. For each such l-mer we want to determine if it occurs in at least q of the input sequences. Let u be one of the above l-mers. If v is a string such that the edit distance between u and v is at most d, then we say v is a neighbor of u.

We generate all the neighbors of u. For each neighbor v of u we determine a list of input sequences in which v is present. These lists (over all possible neighbors of u) are then merged to obtain a list of input sequences in which u occurs (within an edit distance of d).

04/20/23 24

New Algorithm Contd…

The above algorithm runs in time O(n2 mld|Σ|d). The space used is O(nmd+ld|Σ|d).

Space used is less than those of prior algorithms. Only arrays are used in the new algorithm. The underlying constant is small and hence will potentially perform better in practice than Sagot’s algorithm.

04/20/23 25

Thank You.Thank You.

Documents

Efficient Algorithms for Motif Search