Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
02.01.15
1
Text Algorithms
Jaak Vilo 2014 fall
1 MTAT.03.190 Text Algorithms Jaak Vilo
Exact pa@ern matching
• S = s1 s2 … sn (text) |S| = n (length)
• P = p1p2..pm (pa@ern) |P| = m
• Σ -‐ alphabet | Σ| = c
• Does S contain P? – Does S = S' P S" fo some strings S' ja S"? – Usually m << n and n can be (very) large
Find occurrences in text
S
P
Algorithms
One-‐pa'ern • Brute force • Knuth-‐Morris-‐Pra@ • Karp-‐Rabin • Shi[-‐OR, Shi[-‐AND • Boyer-‐Moore • Factor searches
Mul--‐pa'ern • Aho Corasick • Commentz-‐Walter
Indexing • Trie (and suffix trie) • Suffix tree
Anima-ons • h@p://www-‐igm.univ-‐mlv.fr/~lecroq/string/
• EXACT STRING MATCHING ALGORITHMS Anima-on in Java
• Chrishan Charras -‐ Thierry Lecroq Laboratoire d'Informahque de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-‐Saint-‐Aignan Cedex FRANCE
• e-‐mails: {Chris.an.Charras, Thierry.Lecroq}@laposte.net
Brute force: BAB in text?
A B A C A B A B B A B B B A
B A B
02.01.15
2
Brute Force
S
P i i+j-‐1
j
Idenhfy the first mismatch!
Queshon: § Problems of this method? § Ideas to improve the search?
L J
Brute force
Algorithm Naive Input: Text S[1..n] and
pa@ern P[1..m] Output: All posihons i, where
P occurs in S for( i=1 ; i <= n-‐m+1 ; i++ ) for ( j=1 ; j <= m ; j++ ) if( S[i+j-‐1] != P[j] ) break ; if ( j > m ) print i ;
attempt 1: gcatcgcagagagtatacagtacg GCAg.... attempt 2: gcatcgcagagagtatacagtacg g....... attempt 3: gcatcgcagagagtatacagtacg g....... attempt 4: gcatcgcagagagtatacagtacg g....... attempt 5: gcatcgcagagagtatacagtacg g....... attempt 6: gcatcgcagagagtatacagtacg GCAGAGAG attempt 7: gcatcGCAGAGAGtatacagtacg g.......
Brute force or NaiveSearch
1 func-on NaiveSearch(string s[1..n], string sub[1..m]) 2 for i from 1 to n-‐m+1 3 for j from 1 to m 4 if s[i+j-‐1] ≠ sub[j] 5 jump to next iterahon of outer loop 6 return i 7 return not found
C code int bf_2( char* pat, char* text , int n ) /* n = textlen */ {
int m, i, j ;
int count = 0 ; m = strlen(pat);
for ( i=0 ; i + m <= n ; i++) {
for( j=0; j < m && pat[j] == text[i+j] ; j++) ;
if( j == m ) count++ ; }
return(count);
}
C code int bf_1( char* pat, char* text ) { int m ;
int count = 0 ; char *tp; m = strlen(pat); tp=text ;
for( ; *tp ; tp++ ) { if( strncmp( pat, tp, m ) == 0 ) { count++ ;
} }
return( count ); }
Main problem of Naive
• For the next possible locahon of P, check again the same posihons of S
S
P i i+j-‐1
j S
j
02.01.15
3
Goals
• Make sure only a constant nr of comparisons/operahons is made for each posihon in S – Move (only) from le[ to right in S
– How? – A[er a test of S[i] <> P[j] what do we now ?
Knuth-‐Morris-‐Pra@
• Make sure that no comparisons “wasted”
• A[er such a mismatch we already know exactly the values of green area in S !
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. SIAM Journal on Computing 6:323-350, 1977.
x
y ≠
Knuth-‐Morris-‐Pra@
• Make sure that no comparisons “wasted”
• P – longest suffix of any prefix that is also a prefix of a pa@ern
• Example: ABCABD
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. SIAM Journal on Computing 6:323-350, 1977.
prefix x
prefix y
p z
≠
≠
ABCABD
Automaton for ABCABD
1 2 3 4 5 6 7A AB C B D
NOT A
Automaton for ABCABD
1 2 3 4 5 6 7A AB C B D
NOT A
0 1 1 1 2 3 1 Fail:
A B C A B D Pattern:
1 2 3 4 5 6
KMP matching Input: Text S[1..n] and pa@ern P[1..m] Output: First occurrence of P in S (if exists)
i=1; j=1; initfail(P) // Prepare fail links
repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link
until j>m or i>n if j>m then report match at i-m
02.01.15
4
Inihalizahon of fail links
Algorithm: KMP_Ini|ail Input: Pa@ern P[1..m] Output: fail[] for pa@ern P i=1, j=0 , fail[1]= 0 repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j else j = fail[j] until i>=m
Inihalizahon of fail links i=1, j=0 , fail[1]= 0
repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j else j = fail[j]
until i>=m
0 Fail:
ABCABD i
j
0 1
0 1 1 1
ABCABD
0 1 1 1 2
Time complexity of KMP matching? Input: Text S[1..n] and pa@ern P[1..m] Output: First occurrence of P in S (if exists)
i=1; j=1; initfail(P) // Prepare fail links
repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link
until j>m or i>n if j>m then report match at i-m
Analysis of hme complexity
• At every cycle either i and j increase by 1 • Or j decreases (j=fail[j])
• i can increase n (or m) hmes • Q: How o[en can j decrease?
– A: not more than nr of increases of i
• Amor-sed analysis: O(n) , preprocess O(m)
Karp-‐Rabin
• Compare in O(1) a hash of P and S[i..i+m-‐1]
• Goal: O(n). • f( h(T[i.. i+m-‐1]) -‐> h(T[i+1..i+m]) ) = O(1)
R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
i .. (i+m-‐1)
1..m
h(T[i.. i+m-1])
h(P)
Karp-‐Rabin
• Compare in O(1) a hash of P and S[i..i+m-‐1]
• Goal: O(n). • f( h(T[i.. i+m-‐1]) -‐> h(T[i+1..i+m]) ) = O(1)
R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
i .. (i+m-‐1)
1..m
h(T[i+1..i+m])
h(P)
i .. (i+m-‐1)
02.01.15
5
Hash
• “Remove” the effect of T[i] and “Introduce” the effect of T[i+m] – in O(1)
• Use base |Σ| arithmehcs and treat charcters as numbers
• In case of hash match – check all m posihons • Hash collisions => Worst case O(nm)
Let’s use numbers
• T= 57125677 • P= 125 (and for simplicity, h=125)
• H(T[1])=571 • H(T[2])= (571-‐5*100)*10 + 2 = 712
• H(T[3]) = (H(T[2]) – ord(T[1])*10m)*10 + T[3+m-‐1]
hash
• c – size of alphabet
• HSi = H( S[i..i+m-‐1] )
• H( S[i+1 .. i+m ] ) = ( HSi – ord(S[i])* cm-‐1 ) * c + ord( S[i+m] )
• Modulo arithmehc – to fit value in a word!
• hash(w[0 .. m-‐1])=(w[0]*2m-‐1+ w[1]*2m-‐2+·∙·∙·∙+ w[m-‐1]*20) mod q
Karp-‐Rabin Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position
8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10. if hp == hs and P == S[i..i+m-1] 11. report match at position i
More ways to ensure O( n ) ?
02.01.15
6
ShiT-‐AND / ShiT-‐OR
• Ricardo Baeza-‐Yates , Gaston H. Gonnet A new approach to text searching Communica.ons of the ACM October 1992, Volume 35 Issue 10 [ACM Digital Library:h@p://doi.acm.org/10.1145/135239.135243] [DOI]
Bit-‐operahons
• Maintain a set of all prefixes that have so far had a perfect match
• On the next character in text update all previous pointers to a new set
• Bit vector: for every possible character
State: which prefixes match?
0
1
0
0
1
Move to next:
Track posihons of prefix matches
0 1 0 1 0 1
1 0 0 0 1 1
1 0 1 0 1 1 Shift left <<
1 0 0 0 1 1 Mask on char T[i] Bitwise AND
Vectors for every char in Σ
• P=aste a s t e b c d .. z
1 0 0 0 0 ...
0 1 0 0 0 ...
0 0 1 0 0 ...
0 0 0 1 0 ...
02.01.15
7
• T= lasteaed
l a s t e a e d
0 1
0 0
0 0
0 0
• T= lasteaed
l a s t e a e d
0 1 0
0 0 1
0 0 0
0 0 0
• T= lasteaed
l a s t e a e d
0 1 0 0 0 1
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
• T= lasteaed
l a s t e a e d
0 1 0 0 0 1
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
Summary Algorithm Worst case Ave. Case Preprocess
Brute force O(mn) O(n * (1+1/|Σ|+..)
Knuth-‐Morris-‐Pra@ O(n) O(n) O(m)
Rabin-‐Karp O(mn) O(n) O(m)
Boyer-‐Moore O(n/m) ?
BM Horspool
Factor search
Shi[-‐OR O(n) O(n) O( m|Σ| )
02.01.15
8
• R. Boyer, S.Moore: A fast string searching algorithm. CACM 20 (1977), 762-‐772 [PDF]
• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf
43
Find occurrences in text
• Have we missed anything?
44
S
P
Find occurrences in text
• What have we learned if we test for a potenhal match from the end?
45
S
P
ABCDE BBCDE
46
Find occurrences in text
S
P
A B
47
Bad character heuris-cs maximal shi[ on S[i]
S
P
A B
X
S X X
delta1( S[i] ) – |m| if pattern does not contain S[i] patlen-j max j so that P[j] == S[i]
S[i]
First x in pattern (from end)
48
02.01.15
9
void bmInitocc() { char a; int j;
for(a=0; a<alphabetsize; a++)
occ[a]=-1;
for (j=0; j<m; j++) {
a=p[j];
occ[a]=j; }
} 49
Good suffix heuris-cs
S
P
A B
µ
S
delta2( S[i] ) – minimal shift so that matched region is fully covered or that the sufix of match is also a prefix of P
µ µ
S
µ µ’
1.
2.
50
Boyer-‐Moore algorithm Input: Text S[1..n] and pattern P[1..m]
Output: Occurrences of P in S
preprocess_BM() // delta1 and delta2
i=m
while i <= n
for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ;
if j==0 report match at position i-m+1
i = i+ max( delta1[ S[i] ], delta2[ j ] )
51
• h@p://www.ih.�-‐flensburg.de/lang/algorithmen/pa@ern/bmen.htm
• h@p://biit.cs.ut.ee/~vilo/edu/2005-‐06/Text_Algorithms/Arhcles/Exact/Boyer-‐Moore-‐original-‐p762-‐boyer.pdf
• Animahon: h@p://www-‐igm.univ-‐mlv.fr/~lecroq/string/
52
Simplificahons of BM
• There are many variants of Boyer-‐Moore, and many scienhfic papers.
• On average the hme complexity is sublinear • Algorithm speed can be improved and yet simplify the code.
• It is useful to use the last character heurishcs (Horspool (1980), Baeza-‐Yates(1989), Hume and Sunday(1991)).
53
Algorithm BMH (Boyer-‐Moore-‐Horspool)
• RN Horspool -‐ Prachcal Fast Searching in Strings SoCware -‐ Prac.ce and Experience, 10(6):501-‐506 1980
Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]
54
02.01.15
10
String Matching: Horspool algorithm
Text :
Pattern : From right to left: suffix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
It depends of where appears the last letter of the text, say it ‘a’, in the pattern:
a a a
Then it is necessary a preprocess that determines the length of the shift.
a a a a a a
Algorithm Boyer-‐Moore-‐Horspool-‐Hume-‐Sunday (BMHHS)
• Use delta in a hght loop • If match (delta==0) then check and apply original delta d
Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m] 4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 0 5. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop 7. t=delta[ S[i] ] ; i = i + t 8. until t==0 9. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ; 10. if j==0 report match at i-m+1
BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P (in order for the algorithm to finish correctly – at least one occurrence!).
56
• Daniel M. Sunday: A very fast substring search algorithm [PDF] Communica.ons of the ACM August 1990, Volume 33 Issue 8
• Loop unrolling: • Avoid too many loops (each loop requires tests) by just repeahng code
within the loop. • Line 7 in previous algorithm can be replaced by:
7. i += delta[ S[i] ]; i += delta[ S[i] ]; i += (t = delta[ S[i] ]) ;
57 58
Forward-‐Fast-‐Search: Another Fast Variant of the Boyer-‐Moore String Matching Algorithm
• The Prague Stringology Conference '03 • Domenico Cantone and Simone Faro
• Abstract: We present a variahon of the Fast-‐Search string matching algorithm, a recent member of the large family of Boyer-‐Moore-‐like algorithms, and we compare it with some of the most effechve string matching algorithms, such as Horspool, Quick Search, Tuned Boyer-‐Moore, Reverse Factor, Berry-‐Ravindran, and Fast-‐Search itself. All algorithms are compared in terms of run-‐hme efficiency, number of text character inspechons, and number of character comparisons. It turns out that our new proposed variant, though not linear, achieves very good results especially in the case of very short pa@erns or small alphabets.
• h@p://cs.felk.cvut.cz/psc/event/2003/p2.html • PS.gz (local copy)
59
Factor based approach
• Ophmal average-‐case algorithms – Assuming independent characters, same probability
• Factor – a substring of a pa@ern – Any substring – (how many?)
60
02.01.15
11
Factor based approach
• Ophmal average-‐case algorithms – Assuming independent characters, same probability
61
Factor searches
Do not compare characters, but find the longest match to any subregion of the pattern.
S
P
X u
62
Examples
• Backward DAWG Matching (BDM) – Crochemore et al 1994
• Backward Nondeterminishc DAWG Matching (BNDM) – Navarro, Raffinot 2000
• Backward Oracle Matching (BOM) – Allauzen, Crochermore, Raffinot 2001
63
Backward DAWG Matching BDM
Do not compare characters, but find the longest match to any subregion of the pattern. 64
Suffix automaton recognises all factors (and suffixes) in O(n)
BNDM – simulate using bitparallelism
65
Bits – show where the factors have occurred so far
BNDM matches an NDA
NDA on the suffixes of ‘announce’
66
02.01.15
12
Determinishc version of the same Backward Factor Oracle
67
BNDM – Backward Non-Deterministic DAWG Matching BOM - Backward Oracle matching
68
String Matching of one pa@ern
CTACTACTACGTCTATACTGATCGTAGC
TACTACGGTATGACTAA
Factor search
Prefix search
Suffix search
1.
2.
3.
Mulhple pa@erns
S
{P}
Why?
• Mulhple pa@erns • Highlight mulhple different search words on the page • Virus detechon – filter for virus signatures • Spam filters • Scanner in compiler needs to search for mulhple keywords • Filter out stop words or disallowed words • Intrusion detechon so[ware • Next-‐generahon sequencing produces huge amounts
(many millions) of short reads (20-‐100 bp) that need to be mapped to genome!
• …
Algorithms
• Aho-‐Corasick (search for mulhple words) – Generalizahon of Knuth-‐Morris-‐Pra@
• Commentz-‐Walter – Generalizahon of Boyer-‐Moore & AC
• Wu and Manber – improvement over C-‐W
• Addihonal methods, tricks and techniques
02.01.15
13
Aho-‐Corasick (AC) • Alfred V. Aho and Margaret J. Corasick (Bell Labs, Murray Hill, NJ)
Efficient string matching. An aid to bibliographic search. Communica.ons of the ACM, Volume 18 , Issue 6, p333-‐340 (June 1975)
• ACM:DOI PDF • ABSTRACT This paper describes a simple, efficient algorithm to locate all
occurrences of any of a finite number of keywords in a string of text. The algorithm consists of construc.ng a finite state paSern matching machine from the keywords and then using the paSern matching machine to process the text string in a single pass. Construc.on of the paSern matching machine takes .me propor.onal to the sum of the lengths of the keywords. The number of state transi.ons made by the paSern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
References:
• Generalizahon of KMP for many pa@erns • Text S like before. • Set of pa@erns P = { P1 , .., Pk } • Total length |P| = m = Σ i=1..k mi
• Problem: find all occurrences of any of the Pi ∈ P from S
Idea
1. Create an automaton from all pa@erns
2. Match the automaton
• Use the PATRICIA trie for creahng the main structure of the automaton
PATRICIA trie • D. R. Morrison, "PATRICIA: Prachcal Algorithm To Retrieve Informahon
Coded In Alphanumeric", Journal of the ACM 15 (1968) 514-‐534. • Abstract PATRICIA is an algorithm which provides a flexible means of
storing, indexing, and retrieving informa.on in a large file, which is economical of index space and of reindexing .me. It does not require rearrangement of text or index as new material is added. It requires a minimum restric.on of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves informa.on in response to keys furnished by the user with a quan.ty of computa.on which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several varia.ons as FORTRAN programs for the CDC-‐3600, u.lizing disk file storage of text. It has been applied to several large informa.on-‐retrieval problems and will be applied to others.
• ACM:DOI PDF
• Word trie -‐ a good data structure to represent a set of words (e.g. a dichonary).
• trie (data structure)
• Defini-on: A tree for storing strings in which there is one node for every common preffix. The strings are stored in extra leaf nodes.
• See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree.
• Note: The name comes from retrieval and is pronounced, "tree."
• To test for a word p, only O(|p|) hme is used no ma@er how many words are in the dichonary...
02.01.15
14
Trie for P={he, she, his, hers}
0
1
2
h
e
0
1
2
h
e
3
s
4
5
e
h
Trie for P={he, she, his, hers} 0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
How to search for words like he, sheila, hi. Do these occur in the trie?
0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
Aho-‐Corasick
1. Create an automaton MP for a set of strings P. 2. Finite state machine: read a character from text,
and change the state of the automaton based on the state transihons...
3. Main links: goto[j,c] -‐ read a character c from text and go from a state j to state goto[j,c].
4. If there are no goto[j,c] links on character c from state j, use fail[j].
5. Report the output. Report all words that have been found in state j.
AC Automaton (vs KMP)
0
1
2
h
e 3
s
4
5
e
h
8
i
7
s
9
r 6
s
goto[1,i] = 6. ; fail[7] = 3, fail[8] = 0 , fail[5]=2.
Output table state output[j] 2 he 5 she, he 7 his 9 hers
NOT { h, s }
AC -‐ matching Input: Text S[1..n] and an AC automaton M for pa@ern set P Output: Occurrences of pa@erns from P in S (last posihon) 1. state = 0 2. for i = 1..n do
3. while (goto[state,S[i]]== ∅ ) and (fail[state]!=state) do 4. state = fail[state]
5. state = goto[state,S[i]] 6. if ( output[state] not empty ) 7. then report matches output[state] at posihon i
02.01.15
15
Algorithm Aho-‐Corasick preprocessing I (TRIE) Input: P = { P1, ... , Pk } Output: goto[] and parhal output[] Assume: output(s) is empty when a state s is created; goto[s,a] is not defined. procedure enter(a1, ... , am) /* Pi = a1, ... , am */ begin 1. s=0; j=1; 2. while goto[s,aj] ≠ ∅ do // follow exishng path 3. s = goto[s,aj]; 4. j = j+1; 5. for p=j to m do // add new path (states) 6. news = news+1 ; 7. goto[s,ap] = news; 8. s = news; 9. output[s] = a1, ... , am end
begin 10. news = 0 11. for i=1 to k do enter( Pi ) 12. for a ∈ Σ do
13. if goto[0,a] = ∅ then goto[0,a] = 0 ; end
Preprocessing II for AC (FAIL) queue = ∅ for a ∈ Σ do
if goto[0,a] ≠ 0 then enqueue( queue, goto[0,a] ) fail[ goto[0,a] ] = 0
while queue ≠ ∅
r = take( queue ) for a ∈ Σ do if goto[r,a] ≠ ∅ then s = goto[ r, a ] enqueue( queue, s ) // breadth first search state = fail[r] while goto[state,a] = ∅ do state = fail[state] fail[s] = goto[state,a] output[s] = output[s] + output[ fail[s] ]
Correctness
• Let string t "point" from inihal state to state j.
• Must show that fail[j] points to longest suffix that is also a prefix of some word in P.
• Look at the arhcle...
AC matching hme complexity
• Theorem For matching the MP on text S, |S|=n, less than 2n transihons within M are made.
• Proof Compare to KMP. • There is at most n goto steps. • Cannot be more than n Fail-‐steps. • In total -‐-‐ there can be less than 2n transihons in M.
Individual node (goto)
• Full table
• List
• Binary search tree(?)
• Some other index?
AC thoughts • Scales for many strings simultaneously. • For very many pa@erns – search hme (of grep) improves(??)
– See Wu-‐Manber arhcle
• When k grows, then more fail[] transihons are made (why?) • But always less than n. • If all goto[j,a] are indexed in an array, then the size is
|MP|*|Σ|, and the running hme of AC is O(n). • When k and c are big, one can use lists or trees for storing
transihon funchons.
• Then, O(n log(min(k,c)) ).
02.01.15
16
Advanced AC
• Precalculate the next state transihon correctly for every possible character in alphabet
• Can be good for short pa@erns
Problems of AC?
• Need to rebuild on adding / removing pa@erns
• Details of branching on each node(?)
Commentz-‐Walter
• Generalizahon of Boyer-‐Moore for mulhple sequence search
• Beate Commentz-‐Walter A String Matching Algorithm Fast on the Average Proceedings of the 6th Colloquium, on Automata, Languages and Programming. Lecture Notes In Computer Science; Vol. 71, 1979. pp. 118 -‐ 132, Springer-‐Verlag
• h@p://www.�-‐albsig.de/win/personen/professoren.php?RID=36 • You can download here my algorithm StringMatchingFastOnTheAverage (PDF, ~17,2 MB) or
here StringMatchingFastOnTheAverage (extended abstract) (PDF, ~3 MB)
C-‐W descrip-on
• Aho and Corasick [AC75] presented a linear-‐hme algorithm for this problem, based on an automata approach. This algorithm serves as the basis for the UNIX tool fgrep. A linear-‐hme algorithm is ophmal in the worst case, but as the regular string-‐searching algorithm by Boyer and Moore [BM77] demonstrated, it is possible to actually skip a large por-on of the text while searching, leading to faster than linear algorithms in the average case.
Commentz-‐Walter [CW79] • Commentz-‐Walter [CW79] presented an algorithm for the
mulh-‐pa@ern matching problem that combines the Boyer-‐Moore technique with the Aho-‐Corasick algorithm. The Commentz-‐Walter algorithm is substanhally faster than the Aho-‐Corasick algorithm in prachce. Hume [Hu91] designed a tool called gre based on this algorithm, and version 2.0 of fgrep by the GNU project [Ha93] is using it.
• Baeza-‐Yates [Ba89] also gave an algorithm that combines the Boyer-‐Moore-‐Horspool algorithm [Ho80] (which is a slight variahon of the classical Boyer-‐Moore algorithm) with the Aho-‐Corasick algorithm.
Idea of C-‐W
• Build a backward trie of all keywords
• Match from the end unhl mismatch...
• Determine the shi[ based on the combinahon of heurishcs
02.01.15
17
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
4. Start the search
T A
A
G
G A T
T T
T
G
A
A
A A T
1. Build the trie of the inverted patterns
2. lmin=4 A 1 C 4 (lmin) G 2 T 1
3. Table of shifts
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
The text ACATGCTATGTGACA…
A 1 C 4 (lmin) G 2 T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
The text ACATGCTATGTGACA…
A 1 C 4 (lmin) G 2 T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
The text ACATGCTATGTGACA…
A 1 C 4 (lmin) G 2 T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
The text ACATGCTATGTGACA…
A 1 C 4 (lmin) G 2 T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
The text ACATGCTATGTGACA…
A 1 C 4 (lmin) G 2 T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
02.01.15
18
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
The text ACATGCTATGTGACA…
A 1 C 4 (lmin) G 2 T 1
… Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
The text ACATGCTATGTGACA…
A 1 C 4 (lmin) G 2 T 1
…
Short Shifts!
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
What are the possible limitahons for C-‐W?
• Many pa@erns, small alphabet – minimal skips
• What can be done differently?
Wu-‐Manber • Wu S., and U. Manber, "A Fast Algorithm for Mulh-‐Pa@ern Searching,"
Technical Report TR-‐94-‐17, Department of Computer Science, University of Arizona (May 1993).
• Citeseer: h@p://citeseer.ist.psu.edu/wu94fast.html [Postscript] • We present a different approach that also uses the ideas of Boyer and
Moore. Our algorithm is quite simple, and the main engine of it is given later in the paper. An earlier version of this algorithm was part of the second version of agrep [WM92a, WM92b], although the algorithm has not been discussed in [WM92b] and only briefly in [WM92a]. The current version is used in glimpse [MW94]. The design of the algorithm concentrates on typical searches rather than on worst-‐case behavior. This allows us to make some engineering decisions that we believe are crucial to making the algorithm significantly faster than other algorithms in prachce.
Key idea
• Main problem with Boyer-‐Moore and many pa@erns is that, the more there are pa@erns, the shorter become the possible shi[s...
• Wu and Manber: check several characters simultaneously, i.e. increase the alphabet.
• Instead of looking at characters from the text one by one, we consider them in blocks of size B.
• logc2M; in prac.ce, we use either B = 2 or B = 3. • The SHIFT table plays the same role as in the regular Boyer-‐Moore algorithm, except that it determines the shi[ based on the last B characters rather than just one character.
02.01.15
19
AA 1 AC 3 (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 …
2 símbols
Horspool to Wu-‐Manber How do we can increase the length of the shifts?
With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG
AA 1 AT 1 GT 1 TA 2 TG 2
A 1 C 4 (lmin) G 2 T 1
1 símbol
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Wu-‐Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
G A T
T T
T
G
A
A
A A T
into the text: ACATGCTATGTGACATAATA
…
AA 1 AT 1 GT 1 TA 2 TG 2
Experimental length: log|Σ| 2*lmin*r Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Backward Oracle
• Set Backwards oracle SBDM, SBOM
• Pages 68-‐72
String matching of many pa@erns
5 10 15 20 25 30 35 40 45
8
4
2
| Σ|
Wu-Manber
SBOM Lmin
(5 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (10 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(100 patterns)
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
String matching of many pa@erns
5 10 15 20 25 30 35 40 45
8
4
2
| Σ|
Wu-Manber
SBOM
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
5 10 15 20 25 30 35 40 45
8
4
2
SBOM
Lmin
(5 patterns)
(10 patterns)
(100 patterns) (1000 patterns)
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
02.01.15
20
5 strings 10 strings
100 strings 1000 strings
Factor Oracle Factor Oracle: safe shi[
02.01.15
21
Factor Oracle:
Shift to match prefix of P2?
Factor oracle
Construchon of factor Oracle Factor oracle • Allauzen, C., Crochemore, M., and Raffinot, M. 1999. Factor Oracle: A New
Structure for Pa@ern Matching. In Proceedings of the 26th Conference on Current Trends in theory and Prac.ce of informa.cs on theory and Prac.ce of informa.cs (November 27 -‐ December 04, 1999). J. Pavelka, G. Tel, and M. Bartosek, Eds. Lecture Notes In Computer Science, vol. 1725. Springer-‐Verlag, London, 295-‐310.
• h@p://portal.acm.org/citahon.cfm?id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=61811641#
• h@p://www-‐igm.univ-‐mlv.fr/~allauzen/work/sofsem.ps
02.01.15
22
So far
• Generalised KMP -‐> AhoCorasick • Generalised Horspool -‐> CommentzWalter, WuManber
• BDM, BOM -‐> Set Backward Oracle Matching…
• Other generalisahons?
Mulhple Shi[-‐AND
• P={P1, P2, P3, P4}. Generalize Shi[-‐AND
• Bits =
• Start =
• Match=
P1 P2 P3 P4
1 1 1 1
1 1 1 1
Text Algorithms (6EAP)
Full text indexing
Jaak Vilo 2010 fall
129 MTAT.03.190 Text Algorithms Jaak Vilo
Problem
• Given P and S – find all exact or approximate occurrences of P in S
• You are allowed to preprocess S (and P, of course)
• Goal: to speed up the searches
E.g. Dichonary problem
• Does P belong to a dichonary D={d1,…,dn} – Build a binary search tree of D – B-‐Tree of D – Hashing – Sorhng + Binary search
• Build a keyword trie: search in O(|P|) – Assuming alphabet has up to a constant size c – See Aho-‐Corasick algorithm, Trie construchon
Sorted array and binary search
he
hers
his
global
index happy
head
header
info
informal
search
show
stop
1 13
02.01.15
23
Sorted array and binary search
he
hers
his
global
index happy
head
header
info
informal
search
show
stop
1 13
O( |P| log n )
Trie for D={ he, hers, his, she}
0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
O( |P| )
S != set of words
• S of length n
• How to index?
• Index from every posihon of a text
• Prefix of every possible suffix is important
a
b
b
aa
a
a a
b
b
b
babaab abaab
baab aab
ab b
Trie(babaab)
b
a
a
b
Suffix tree • Defini-on: A compact representahon of a trie corresponding to the
suffixes of a given string where all nodes with one child are merged with their parents.
• Defini-on ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a rooted, labeled tree with a leaf for each non-‐empty suffix of S. Furthermore, a suffix tree sahsfies the following properhes:
• Each internal node, other than the root, has at least two children; • Each edge leaving a parhcular node is labeled with a non-‐empty substring
of S of which the first symbol is unique among all first symbols of the edge labels of the edges leaving this parhcular node;
• For any leaf in the tree, the concatenahon of the edge labels on the path from the root to this leaf exactly spells out a non-‐empty suffix of s.
• Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computahonal Biology. Hardcover -‐ 534 pages 1st edihon (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198.
Literature on suffix trees • h@p://en.wikipedia.org/wiki/Suffix_tree • Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer
Science and Computahonal Biology. Hardcover -‐ 534 pages 1st edihon (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: 89-‐-‐208)
• E. Ukkonen. On-‐line construchon of suffix trees. Algorithmica, 14:249-‐60, 1995. h@p://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751
• Ching-‐Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Construchng Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transachons on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-‐105, January, 2005. h@p://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3
• CPM arhcles archive: h@p://www.cs.ucr.edu/~stelo/cpm/
• Mark Nelson. Fast String Searching With Suffix Trees Dr. Dobb's Journal, August, 1996. h@p://www.dogma.net/markn/arhcles/suffixt/suffixt.htm
02.01.15
24
• STXXL : Standard Template Library for Extra Large Data Sets.
• h@p://stxxl.sourceforge.net/
• ANSI C implementa-on of a Suffix Tree • h@p://yeda.cs.technion.ac.il/~yona/suffix_tree/
These are old links – check for newer…
http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
141
The suffix tree Tree(T) of T
• data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T
• linear size: |Tree(T)| = O(|T|) • can be constructed in linear hme O(|T|) • has myriad virtues (A. Apostolico) • is well-‐known: 366 000 Google hits
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
142
Suffix trie and suffix tree
a
b
b
aa a
a a
b
b
b
a
baab
baab ab
abaab baab aab ab b
Trie(abaab) Tree(abaab)
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
143
Suffix tree and suffix array techniques for pa@ern analysis in
strings
Esko Ukkonen Univ Helsinki
Erice School 30 Oct 2005
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
144
Algorithms for combinatorial string matching?
• deep beauty? +-‐ • shallow beauty? + • applicahons? ++ • intensive algorithmic miniatures • sources of new problems:
text processing, DNA, music,…
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
02.01.15
25
High-‐throughput genome-‐scale sequence analysis and mapping
using compressed data structures Veli Mäkinen
Department of Computer Science University of Helsinki
146
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
147
Analysis of a string of symbols
• T = h a t t i v a t t i ’text’ • P = a t t ’paSern’
• Find the occurrences of P in T: h a t t i v a t t i
• Pa@ern synthesis: #(t) = 4 #(a�) = 2 #(t****t) = 2
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 148
Soluhon: backtracking with suffix tree
...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....
149
Pa@ern finding & synthesis problems • T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet
• Indexing problem: Preprocess T (build an index structure) such that the occurrences of different pa@erns P can be found fast – stahc text, any given pa@ern P
• Pa@ern synthesis problem: Learn from T new pa@erns that occur surprisingly o[en
• What is a pa@ern? Exact substring, approximate substring, with generalized symbols, with gaps, … 150
1. Suffix tree
2. Suffix array
3. Some applications
4. Finding motifs
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
02.01.15
26
151
The suffix tree Tree(T) of T
• data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T
• linear size: |Tree(T)| = O(|T|) • can be constructed in linear hme O(|T|) • has myriad virtues (A. Apostolico) • is well-‐known: 366 000 Google hits
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
152
Suffix trie and suffix tree
a
b
b
aa a
a a
b
b
b
a
baab
baab ab
abaab baab aab ab b
Trie(abaab) Tree(abaab)
153
Trie(T) can be large
• |Trie(T)| = O(|T|2) • bad example: T = anbn
• Trie(T) can be seen as a DFA: language accepted = the suffixes of T
• minimize the DFA => directed cyclic word graph (’DAWG’)
154
Tree(T) is of linear size
• only the internal branching nodes and the leaves represented explicitly
• edges labeled by substrings of T • v = node(α) if the path from root to v spells α • one-‐to-‐one correspondence of leaves and suffixes
• |T| leaves, hence < |T| internal nodes • |Tree(T)| = O(|T| + size(edge labels))
155
Tree(ha�va�) hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivatti attivatti ttivatti
tivatti
ivatti
vatti
vatti vatti
atti ti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti
156
Tree(ha�va�) hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivatti attivatti ttivatti
tivatti
ivatti
vatti
vatti vatti
atti ti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
hattivatti
atti
substring labels of edges represented as pairs of pointers
02.01.15
27
157
Tree(ha�va�) hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i 1 2 3
4
5
6
6,10 6,10
2,5 4,5
i
10
8
9
3,3
i
vatti
vatti
vatti
hattivatti
hattivatti
7
158
Tree(T) is full text index Tree(T)
P
31 8
P occurs in T at locations 8, 31, …
P occurs in T ó P is a prefix of some suffix of T ó Path for P exists in Tree(T)
All occurrences of P in time O(|P| + #occ)
159
Find a@ from Tree(ha�va�) hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivatti attivatti ttivatti
tivatti
ivatti
vatti
vatti vatti
atti ti
2
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti 7
160
Linear hme construchon of Tree(T)
hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
Weiner (1973),
’algorithm of the year’
McCreight (1976)
’on-line’ algorithm (Ukkonen 1992)
161
On-‐line construchon of Trie(T)
• T = t1t2 … tn$ • Pi = t1t2 … ti i:th prefix of T • on-‐line idea: update Trie(Pi) to Trie(Pi+1) • => very simple construchon
162
Trie(abaab)
a a
b
b a
b
b
aa
Trie(a) Trie(ab) Trie(aba)
chain of links connects the end points of current suffixes
abaa baa
aa εa ε
02.01.15
28
163
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aa a
a a
Trie(abaa)
164
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aa a
a a
Trie(abaa)
Add next symbol = b
165
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aa a
a a
Trie(abaa)
Add next symbol = b
From here on b-arc already exists
166
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aa a
a a
a
b
b
aa a
a a
b
b
b
Trie(abaab)
167
What happens in Trie(Pi) => Trie(Pi+1) ?
ai
ai
ai
ai
ai
ai
Before
After
New nodes
New suffix links
From here on the ai-arc exists already => stop updating here
168
What happens in Trie(Pi) => Trie(Pi+1) ?
• hme: O(size of Trie(T)) • suffix links:
slink(node(aα)) = node(α)
02.01.15
29
169
On-‐line procedure for suffix trie
1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root
2. for i := 2 to n do begin
3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf)
4. v := vi-1; v´ := 0
5. while node v has no outgoing arc for ti do begin
6. Create a new node v´´ and an arc son(v,ti) = v´´
7. if v´ ≠ 0 then slink(v) := v´´
8. v := slink(v); v´ := v´´ end
9. for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´
170
Suffix trees on-‐line
• ’compacted version’ of the on-‐line trie construchon: simulate the construchon on the linear size tree instead of the trie => hme O(|T|)
• all trie nodes are conceptually shll needed => implicit and real nodes
171
Implicit and real nodes
• Pair (v,α) is an implicit node in Tree(T) if v is a node of Tree and α is a (proper) prefix of the label of some arc from v. If α is the empty string then (v, α) is a ’real’ node (= v).
• Let v = node(α´) in Tree(T). Then implicit node (v, α) represents node(α´α) of Trie(T)
172
Implicit node
…
v
(v, α) α …
α´
173
Suffix links and open arcs
…
v
aα
…
α
root
slink(v)
label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T
w
174
Big picture
… … …
…
…
suffix link path traversed: total work O(n)
new arcs and nodes created: total work O(size(Tree(T))
02.01.15
30
175
On-‐line procedure for suffix tree
Input: string T = t1t2 … tn$
Output: Tree(T)
Notation: son(v,α) = w iff there is an arc from v to w with label α
son(v,ε) = v
Function Canonize(v, α):
while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do
v := son(v, α´); α := α´´
return (v, α)
176
Suffix-‐tree on-‐line: main procedure
Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ for i := 2 to n+1 do v´ := 0 while there is no arc from v with label prefix αti do if α ≠ ε then /* divide the arc w = son(v, αη) into two */ son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else son(v,ti) := v´´´; v´´ := v if v´ ≠ 0 then slink(v´) := v´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) if v´ ≠ 0 then slink(v´) := v (v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */
http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
178
The actual hme and space
• |Tree(T)| is about 20|T| in prachce • brute-‐force construchon is O(|T|log|T|) for random strings as the average depth of internal nodes is O(log|T|)
• difference between linear and brute-‐force construchons not necessarily large (Giegerich & Kurtz)
• truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003)
• alphabet independent linear hme (Farach 1997)
abc abcabxabcd
02.01.15
31
Applicahons of Suffix Trees • Dan Gusfield: Algorithms on Strings, Trees, and Sequences:
Computer Science and Computahonal Biology. Hardcover -‐ 534 pages 1st edihon (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. -‐ book
• APL1: Exact String Matching Search for P from text S. Soluhon 1: build STree(S) -‐ one achieves the same O(n+m) as Knuth-‐Morris-‐Pra@, for example!
• Search from the suffix tree is O(|P|) • APL2: Exact set matching Search for a set of pa@erns P
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 186
C
Back to backtracking
A C
T
4 2 1 5 3 6
C T
T A T
T A C
T
A T C
T
A
ACA, 1 mismatch
Same idea can be used to many other forms of approximate search, like Smith-Waterman, position-restricted scoring matrices, regular expression search, etc.
02.01.15
32
Applicahons of Suffix Trees
• APL3: substring problem for a database of pa@erns Given a set of strings S=S1, ... , Sn -‐-‐-‐ a database Find all Si that have P as a substring
• Generalized suffix tree contains all suffixes of all Si • Query in hme O(|P|), and can idenhfy the LONGEST common prefix of P in all Si
Applicahons of Suffix Trees
• APL4: Longest common substring of two strings • Find the longest common substring of S and T. • Overall there are potenhally O( n2 ) such substrings, if n is the length of a shorter of S and T
• Donald Knuth once (1970) conjectured that linear-‐hme algorithm is impossible.
• Soluhon: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves.
• Ex: S= superiorcalifornialives T= sealiver have both a substring xxxxxx.
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 189
Simple analysis task: LCSS
• Let LCSSA(A,B) denote the longest common substring two sequences A and B. E.g.: – LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT.
• A good soluhon is to build suffix tree for the shorter sequence and make a descending suffix walk with the other sequence.
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 190
Suffix link
X
aX
suffix link
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 191
Descending suffix walk
suffix tree of A Read B left-to-right, always going down in the tree when possible. If the next symbol of B does not match any edge label on current position, take suffix link, and try again. (Suffix link in the root to itself emits a symbol). The node v encountered with largest string depth is the solution. v
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 192
Another common tool: Generalized suffix tree
ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$
A C
C
node info: subtree size 47813871 sequence count 87
02.01.15
33
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 193
Generalized suffix tree applicahon
...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...
...#....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#......#...
A C
C
node info: subtree size 4398 blue sequences 12/15 red sequences 2/62 .....
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 194
Case study conhnued
genome
regions with ChIP-seq matches
suffix tree of genome
5 blue 1 red
TAC
..........T
motif?
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 195
Properhes of suffix tree
• Suffix tree has n leaves and at most n-‐1 internal nodes, where n is the total length of all sequences indexed.
• Each node requires constant number of integers (pointers to first child, sibling, parent, text range of incoming edge, stahshcs counters, etc.).
• Can be constructed in linear hme.
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 196
Properhes of suffix tree... in prachce
• Huge overhead due to pointer structure: – Standard implementahon of suffix tree for human genome requires over 200 GB memory!
– A careful implementahon (using log n -‐bit fields for each value and array layout for the tree) shll requires over 40 GB.
– Human genome itself takes less than 1 GB using 2-‐bits per bp.
Applicahons of Suffix Trees
• APL4: Longest common substring of two strings • Find the longest common substring of S and T. • Overall there are potenhally O( n2 ) such substrings, if n is the length of a shorter of S and T
• Donald Knuth once (1970) conjectured that linear-‐hme algorithm is impossible.
• Soluhon: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves.
• Ex: S= superiorcalifornialives T= sealiver have both a substring alive.
Applicahons of Suffix Trees
• APL5: Recognizing DNA contaminahon Related to DNA sequencing, search for longest strings (longer than threshold) that are present in the DB of sequences of other genomes.
• APL6: Common substrings of more than two strings Generalizahon of APL4, can be done in linear (in total length of all strings) hme
02.01.15
34
Applicahons of Suffix Trees
• APL7: Building a directed graph for exact matching: Suffix graph -‐ directed acyclic word graph (DAWG), a smallest finite state automaton recognizing all suffixes of a string S. This automaton can recognize membership, but not tell which suffix was matched.
• Construchon: merge isomorfic subtrees. • Isomorfic in Suffix Tree when exists suffix link path, and subtrees have equal nr. of leaves.
Applicahons of Suffix Trees
• APL8: A reverse role for suffix trees, and major space reduchon Index the pa@ern, not tree...
• Matching stahshcs. • APL10: All-‐pairs suffix-‐prefix matching For all pairs Si, Sj, find the longest matching suffix-‐prefix pair. Used in shortest common superstring generahon (e.g. DNA sequence assembly), EST alignmentm etc.
Applicahons of Suffix Trees
• APL11: Finding all maximal repehhve structures in linear hme
• APL12: Circular string linearizahon e.g. circular chemical molecules in the database, one wants to lienarize them in a canonical way...
• APL13: Suffix arrays -‐ more space reduchon will touch that separately
Applicahons of Suffix Trees
• APL14: Suffix trees in genome-‐scale projects • APL15: A Boyer-‐Moore approach to exact set matching
• APL16: Ziv-‐Lempel data compression • APL17: Minimum length encoding of DNA
Applicahons of Suffix Trees • Addihonal applicahons Mostly exercises... • Extra feature: CONSTANT -me lowest common ancestor retrieval (LCA)
Andmestruktuur mis võimaldab leida konstantse ajaga alumist ühist vanemat (see vastab pikimale ühisele prefixile!) on võimalik koostada lineaarse ajaga.
• APL: Longest common extension: a bridge to inexact matching • APL: Finding all maximal palindromes in linear -me
Palindrome reads from central posihon the same to le[ and right. E.g.: kirik, saippuakivikauppias.
• Build the suffix tree of S and inverted S (aabcbad => aabcbad#dabcbaa ) and using the LCA one can ask for any posihon pair (i, 2i-‐1), the longest common prefix in constant hme.
• The whole problem can be solved in O(n).
Applicahons of Suffix Trees
• APL: Exact matching with wild cards • APL: The k-‐mismatch problem • Approximate palindromes and repeats • Faster methods for tandem repeats • A linear-‐hme soluhon to the mulhple common substring problem
• And many-‐many more ...
02.01.15
35
205
1. Suffix tree
2. Suffix array
3. Some applications
4. Finding motifs
206
Suffixes -‐ sorted
• Sort all suffixes. Allows to perform binary search!
hattivatti attivatti
ttivatti tivatti ivatti vatti
atti tti ti i ε
ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti
207
Suffix array: example
• suffix array = lexicographic order of the suffixes
hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε
ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti
11 7 2 1 10 5
9 4 8 3
6
1 2 3 4 5 6 7 8 9 10 11
11 7 2 1 10 5 9 4 8 3 6
208
Suffix array construchon: sort!
• suffix array = lexicographic order of the suffixes
hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε
11 7 2 1 10 5
9 4 8 3
6
1 2 3 4 5 6
7 8 9 10
11
11 7 2 1 10 5 9 4 8 3 6
209
Suffix array
• suffix array SA(T) = an array giving the lexicographic order of the suffixes of T
• space requirement: 5|T| • prachhoners like suffix arrays (simplicity, space efficiency)
• theorehcians like suffix trees (explicit structure)
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 210
Reducing space: suffix array
A C
T
4 2 1 5 3 6
C A T A C T 1 2 3 4 5 6
=[3,3] =[3,3] =[2,2]
suffix array
=[4,6] =[6,6] =[2,6]
=[3,6] =[5,6]
C C
T T A
T
T A C
T C T
A
T
A
02.01.15
36
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 211
Suffix array
• Many algorithms on suffix tree can be simulated using suffix array... – ... and couple of addihonal arrays... – ... forming so-‐called enhanced suffix array... – ... leading to the similar space requirement as careful implementahon of suffix tree
• Not a sahsfactory soluhon to the space issue.
212
Pa@ern search from suffix array hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε
ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti
11 7 2 1 10 5 9 4 8 3 6
att binary search
ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 213
What we learn today?
• We learn that it is possible to replace suffix trees with compressed suffix trees that take 8.8 GB for the human genome.
• We learn that backtracking can be done using compressed suffix arrays requiring only 2.1 GB for the human genome.
• We learn that discovering intereshng mohf seeds from the human genome takes 40 hours and requires 9.3 GB space.
214
Recent suffix array construchons
• Manber&Myers (1990): O(|T|log|T|) • linear hme via suffix tree • January / June 2003: direct linear hme construchon of suffix array -‐ Kim, Sim, Park, Park (CPM03) -‐ Kärkkäinen & Sanders (ICALP03) -‐ Ko & Aluru (CPM03)
215
Kärkkäinen-‐Sanders algorithm
1. Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.
2. Construct the suffix array of the remaining suffixes using the result of the first step.
3. Merge the two suffix arrays into one.
216
Notahon
• string T = T[0,n) = t0t1 … tn-‐1 • suffix Si = T[i,0) = titi+1 … tn-‐1 • for C \subset [0,n]: SC = {Si | i in C}
• suffix array SA[0,n] of T is a permutahon of [0,n] sahsfying SSA[0] < SSA[1] < … < SSA[n]
02.01.15
37
217
Running example
• T[0,n) = y a b b a d a b b a d o 0 0 …
• SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)
0 1 2 3 4 5 6 7 8 9 10 11
218
Step 0: Construct a sample
• for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k}
• C = B1 U B2 sample posi.ons • SC sample suffixes
• Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}
219
Step 1: Sort sample suffixes • for k = 1,2, construct Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2] R = R1 ^ R2 concatenahon of R1 and R2 Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ;
correspondence is order preserving.
Sort the suffixes of R: radix sort the characters and rename with ranks to obtain R´. If all characters different, their order directly gives the order of suffixes. Otherwise, sort the suffixes of R´ using Kärkkäinen-‐Sanders. Note: |R´| = 2n/3.
220
Step 1 (cont.)
• once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0
• Example: R = [abb][ada][bba][do0][bba][dab][bad][o00] R´ = (1,2,4,6,4,5,3,7) SAR´ = (8,0,1,6,4,2,5,3,7) rank(Si) -‐ 1 4 -‐ 2 6 -‐ 5 3 – 7 8 – 0 0
221
Step 2: Sort nonsample suffixes
• for each non-‐sample Si є SB0 (note that rank(Si+1) is always defined for i є B0):
Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • radix sort the pairs (ti,rank(Si+1)).
• Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)
222
Step 3: Merge • merge the two sorted sets of suffixes using a standard comparison-‐based merging:
• to compare Si є SC with Sj є SB0, dishnguish two cases:
• i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) • i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))
• note that the ranks are defined in all cases! • S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)
02.01.15
38
223
Running hme O(n)
• excluding the recursive call, everything can be done in linear hme
• the recursion is on a string of length 2n/3 • thus the hme is given by recurrence
T(n) = T(2n/3) + O(n) • hence T(n) = O(n)
224
Implementahon
• about 50 lines of C++ • code available e.g. via Juha Kärkkäinen’s home page
225
LCP table
• Longest Common Prefix of successive elements of suffix array:
• LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1]
• build inverse array SA-‐1 from SA in linear hme • then LCP table from SA-‐1 in linear hme (Kasai et al, CPM2001)
• Oxford English Dischonary h@p://www.oed.com/ • Example -‐ Word of the Day , Fourth
h@p://biit.cs.ut.ee/~vilo/edu/2005-‐06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.html h@p://www.oed.com/cgi/display/wotd
• PAT index -‐ by Gaston Gonnet (ta on samuh Maple tarkvara üks loojatest ning hiljem molekulaarbioloogia tarkvarapakeh väljatöötajaid)
• PAT index is essenhally a suffix array. To save space, indexed only from first character of every word
• XML-‐tagging (or SGML, at that hme!) also indexed • To mark certain fields of XML, the bit vectors were used. • Main concern -‐ improve the speed of search on the CD -‐ minimize random
accesses. • For slow medium even 15-‐20 accesses is too slow... • G. H. Gonnet, R. A. Baeza-‐Yates, and T. Snider, Lexicographical indices for
text: Inverted files vs. PAT trees, Technical Report OED-‐91-‐01, Centre for the New OED, University of Waterloo, 1991.
227
Suffix tree vs suffix array
• suffix tree ó suffix array + LCP table
228
1. Suffix tree
2. Suffix array
3. Some applications
4. Finding motifs
02.01.15
39
229
Substring mohfs of string T
• string T = t1 … tn in alphabet A. • Problem: what are the frequently occurring (ungapped) substrings of T? Longest substring that occurs at least q hmes?
• Thm: Suffix tree Tree(T) gives complete occurrence counts of all substring mohfs of T in O(n) hme (although T may have O(n2) substrings!)
230
Counhng the substring mohfs
• internal nodes of Tree(T) ↔ repeahng substrings of T
• number of leaves of the subtree of a node for string P = number of occurrences of P in T
231
Substring mohfs of ha�va�
hattivatti attivatti ttivatti
tivatti
ivatti
vatti
vatti vatti
atti ti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti
2
2 2
2 4
Counts for the O(n) maximal motifs shown 232
Finding repeats in DNA
• human chromosome 3 • the first 48 999 930 bases • 31 min cpu hme (8 processors, 4 GB)
• Human genome: 3x109 bases • Tree(HumanGenome) feasible
233
Longest repeat?
Occurrences at: 28395980, 28401554r Length: 2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg 234
Ten occurrences?
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt
Length: 277
Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125 In the reversed complement at: 17858493, 41463059, 42431718, 42580925
02.01.15
40
235
Using suffix trees: plagiarism
• find longest common substring of strings X and Y
• build Tree(X$Y) and find the deepest node which has a leaf poinhng to X and another poinhng to Y
236
Using suffix trees: approximate matching
• edit distance: inserhons, delehons, changes
• STOCKHOLM vs TUKHOLMA
237
String distance/similarity funchons
STOCKHOLM vs TUKHOLMA
STOCKHOLM_ _TU_
KHOLMA
=> 2 deletions, 1 insertion, 1 change
238
Approximate string matching
• A: S T O C K H O L M B:
T U K H O L M A • minimum number of ’mutahon’ steps:
a -‐> b a -‐> є є -‐> b … • dID(A,B) = 5 dLevenshtein(A,B)= 4 • mutahon costs => probabilishc modeling • evaluahon by dynamic programming => alignment
239
Dynamic programming di,j = min(if ai=bj then di-1,j-1 else ∞,
di-1,j + 1, di,j-1 + 1)
= distance between i-prefix of A and j-prefix of B (substitution excluded)
di,j
di-1,j-1 di,j-1
di-1,j
dm,n
mxn table d
A
B
ai
bj
+1
+1
240
A\B s t o c k h o l m0 1 2 3 4 5 6 7 8 9
t 1 2 1 2 3 4 5 6 7 8u 2 3 2 3 4 5 6 7 8 9k 3 4 3 4 5 4 5 6 7 8h 4 5 4 5 6 5 4 5 6 7o 5 6 5 4 5 6 5 4 5 6l 6 7 6 5 6 7 6 5 4 5m 7 8 7 6 7 8 7 6 5 4a 8 9 8 7 8 9 8 7 6 5
di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di,j-1 + 1)
dID(A,B)
optimal alignment by trace-back
02.01.15
41
241
Search problem
• find approximate occurrences of pa@ern P in text T: substrings P’ of T such that d(P,P’) small
• dyn progr with small modificahon: O(mn) • lots of (prachcal) improvement tricks
P
T P’
242
Index for approximate searching?
• dynamic programming: P x Tree(T) with backtracking
P
Tree(T)