Upload
scott-hunter
View
221
Download
0
Embed Size (px)
Citation preview
Suffix tree (Example)
Let s=abab, a suffix tree of s contains all the suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
ab
ab
$
ab
$
b
$
$
$
Trivial algorithm to build a Suffix tree
Put the largest suffix in
Put the suffix bab$ in
abab$
abab
$
ab$
b
s=abab$
We will also label each leaf with the starting point of the corres. suffix.
ab
ab
$
ab$
b
$
$
$
12
ab
ab
$
ab
$
b
3
$ 4
$
5
$
{
abab$
bab$
ab$
b$
$
}
What can we do with it ?
7.1. APL1: Exact string matchingGiven a Text T, |T| = n, preprocess it
such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.
W e may also want to find all occurrences of P in T
Exact string matchingIn preprocessing we just build a suffix tree in O(n) time
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Given a pattern P = ab we traverse the tree according to the pattern.
T=abab$12345
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
If we did not get stuck traversing the pattern then the pattern occurs in the text.
Each leaf in the subtree below the node we reach corresponds to an occurrence.
By traversing this subtree we get all k occurrences in O(n+k) time
Generalized suffix tree
Given a set of strings S a generalized suffix tree of S contains all suffixes of s S
To make these suffixes prefix-free we add a special char, say $, at the end of s
To associate each suffix with a unique string in S add a different special char to each s
Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2
{ $ # b$ b# ab$ ab# bab$ aab# abab$ }
1
2
a
b
ab
$
ab$
b
3
$
4
$
5
$
1
b#
a
2
#
3
#
4
#S = s1$s1# = abab$aab#
What can we do with it ?
7.5. APL5: Recognizing DNA Contamination• We isolate, purify, clone, copy, maintain,
probe or sequence DNA string• Often unwanted DNA is inserted into the
desired DNA sequence• DNA sequence extracted form dinosaur, more
similar to mammal (human) than to bird or crockodilian DNA
7.5 DNA Contamination Problem
• Let string S1 is newly isolated/sequenced DNA
• We know S2 the possible contaminated DNA
• Problem is to find all substrings of S2 that occur in S1 and that are longer than some given length l.
• Solution: Generate generalized suffix tree of S1 and S2.
7.4: Longest common substring (of two strings)
Every node with a leaf
descendant from string s1 and a
leaf descendant from string s2
represents a maximal common substring and vice versa.
1
2
a
b
ab
$
ab$
b
3
$
4
$
5
$
1
b#
a
2
#
3
#
4
#
Find such node with largest “string depth”
7.11.1 Repetitive Structures in Biological Strings
• One of the most striking features of DNA is that there are many repeated substrings
• This is specially true for eukaryotes– Most of the Y chromosome consists of repeated substrings– One third of human genome is from repeated family
• Prokaryotes have less repeats
7.11.1 Repetitive Structures in Biological Strings
• Three types of repeats– Local : small scale repeated strings whose function
or origin is at least partially understood– Simple repeats: both local and interspersed,
whose function is less clear– Complex interspersed repeats: whose function is
even more in doubt
7.11.1 Repetitive Structures in Biological Strings
• Palindrome is a string that reads the same backwards as forwards– xyaayx is a palindrome– “Was it a cat I saw” is a palindrome
ignoring space
7.11.1 Repetitive Structures in Biological Strings
• Complemented palindrome of DNA or RNA– A – T/U and C – G are complements in
DNA/RNA– AGCTCGCGAGCT is a complemented palindrome
• Complemented palindromes in both DNA/RNA regulates DNA transcription– Folds to form a “hairpin loop”
• Many more functionalities
7.11.1 Repetitive Structures in Biological Strings
• Restriction enzyme recognizes a specific substring in DNA of both prokaryotes and eukaryotes and cuts the DNA every place where the pattern occurs
• Restriction Enzyme Cutting Sites are interesting examples of repeats because they tend to be complemented palindromic repeats– For example EcoRI (restriction enzyme) recognizes GAATTC and cuts between the G and the adjoining A
– BglI recognizes GCCNNNNNGGC, where N stands for any nucleotide. The enzyme cuts between the last two Ns.
Lowest common ancestorsA lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
2
a
b
ab
$
ab$
b
3
$
4
$
5
$
1
b#
a
2
#
3
#
4
#
Finding maximal palindromes
• A palindrome: caabaac, cbaabc• Want to find all maximal palindromes in a string s
Let s = cbaaba
The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr
Maximal palindromes algorithmPrepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#
For every i find the LCA of suffix i of s and suffix m-i+1 of sr
3
a
a b
a
baaba$
b
3
$
7
$
b
7
#
c
1
6a b
c #
5
2 2
a $
c #
a
5
6
$
4
4
1
c #
a $
$abc #
c #
Let s = cbaaba$ then sr = abaabc#
Drawbacks
• Suffix trees consume a lot of space
• It is O(n) but the constant is quite big
• Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node
7.14.1 Suffix array
• We loose some of the functionality but we save space.
Let s = abab
Sort the suffixes lexicographically: ab, abab, b, bab
The suffix array gives the indices of the suffixes in lexically sorted order
3 1 4 2
Example
iippiissippiississippimississippipippisippisisippissippississippi
Let T = mississippi
8
5
2
1
10
9
7
4
11
6
3
L
R
Let P = issa
M1. Suffix starting at position
Pos(1) of T is the lexically smallest suffix
2. In general suffix Pos(i) of T is lexically smaller than suffix Pos(i+1)
How do we build it ?
• Build a suffix tree• Traverse the tree in
DFS, lexicographically picking edges outgoing from each node and fill the suffix array.
• O(n) time1
2
a
b
ab
$
ab$
b
3
$
4
$
5
$
1
b#
a
2
#
3
#
4
#
How do we search for a pattern ?
• If P occurs in T then all its occurrences are consecutive in the suffix array.
• Do a binary search on the suffix array
• Takes O(mlogn) time
ExampleLet S = mississippi
iippiissippiississippimississippipi
8
5
2
1
10
9
7
4
11
6
3
ppisippisisippissippississippi
L
R
Let P = issi
M
How do we accelerate the search ?
L
R
Maintain l = LCP(P,L)Maintain r = LCP(P,R)
M
If l = r then start comparing M to P at l + 1
l
r
How do we accelerate the search ?
L
R
Suppose we know LCP(L,M)
If LCP(L,M) < l we go left
If LCP(L,M) > l we go right
If LCP(L,M) = l we start comparing at l + 1
M
If l > r then
r
l
Analysis of the acceleration
If we do more than a single comparison in an iteration then max(l, r ) grows by 1 for each comparison O(logn + m) time