Lecture 15 TextAlgorithms - ut

02.01.15

1

Text Algorithms

Jaak Vilo 2014 fall

1 MTAT.03.190 Text Algorithms Jaak Vilo

Exact pa@ern matching

•  S = s1 s2 … sn (text) |S| = n (length)

•  P = p1p2..pm (pa@ern) |P| = m

•  Σ -‐ alphabet | Σ| = c

•  Does S contain P? – Does S = S' P S" fo some strings S' ja S"? – Usually m << n and n can be (very) large

Find occurrences in text

S

P

Algorithms

One-‐pa'ern •  Brute force •  Knuth-‐Morris-‐Pra@ •  Karp-‐Rabin •  Shi[-‐OR, Shi[-‐AND •  Boyer-‐Moore •  Factor searches

Mul--‐pa'ern •  Aho Corasick •  Commentz-‐Walter

Indexing •  Trie (and suffix trie) •  Suffix tree

Anima-ons •  h@p://www-‐igm.univ-‐mlv.fr/~lecroq/string/

•  EXACT STRING MATCHING ALGORITHMS Anima-on in Java

•  Chrishan Charras -‐ Thierry Lecroq Laboratoire d'Informahque de Rouen Université de Rouen Faculté des Sciences et des Techniques 76821 Mont-‐Saint-‐Aignan Cedex FRANCE

•  e-‐mails: {Chris.an.Charras, Thierry.Lecroq}@laposte.net

Brute force: BAB in text?

A B A C A B A B B A B B B A

B A B

02.01.15

2

Brute Force

S

P i i+j-‐1

j

Idenhfy the first mismatch!

Queshon: § Problems of this method? § Ideas to improve the search?

L J

Brute force

Algorithm Naive Input: Text S[1..n] and

pa@ern P[1..m] Output: All posihons i, where

P occurs in S for( i=1 ; i <= n-‐m+1 ; i++ ) for ( j=1 ; j <= m ; j++ ) if( S[i+j-‐1] != P[j] ) break ; if ( j > m ) print i ;

attempt 1: gcatcgcagagagtatacagtacg GCAg.... attempt 2: gcatcgcagagagtatacagtacg g....... attempt 3: gcatcgcagagagtatacagtacg g....... attempt 4: gcatcgcagagagtatacagtacg g....... attempt 5: gcatcgcagagagtatacagtacg g....... attempt 6: gcatcgcagagagtatacagtacg GCAGAGAG attempt 7: gcatcGCAGAGAGtatacagtacg g.......

Brute force or NaiveSearch

1 func-on NaiveSearch(string s[1..n], string sub[1..m]) 2 for i from 1 to n-‐m+1 3  for j from 1 to m 4  if s[i+j-‐1] ≠ sub[j] 5  jump to next iterahon of outer loop 6  return i 7 return not found

C code int bf_2( char* pat, char* text , int n ) /* n = textlen */ {

int m, i, j ;

int count = 0 ; m = strlen(pat);

for ( i=0 ; i + m <= n ; i++) {

for( j=0; j < m && pat[j] == text[i+j] ; j++) ;

if( j == m ) count++ ; }

return(count);

}

C code int bf_1( char* pat, char* text ) { int m ;

int count = 0 ; char *tp; m = strlen(pat); tp=text ;

for( ; *tp ; tp++ ) { if( strncmp( pat, tp, m ) == 0 ) { count++ ;

} }

return( count ); }

Main problem of Naive

•  For the next possible locahon of P, check again the same posihons of S

S

P i i+j-‐1

j S

j

02.01.15

3

Goals

•  Make sure only a constant nr of comparisons/operahons is made for each posihon in S – Move (only) from le[ to right in S

– How? – A[er a test of S[i] <> P[j] what do we now ?

Knuth-‐Morris-‐Pra@

•  Make sure that no comparisons “wasted”

•  A[er such a mismatch we already know exactly the values of green area in S !

D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. SIAM Journal on Computing 6:323-350, 1977.

x

y ≠

Knuth-‐Morris-‐Pra@

•  Make sure that no comparisons “wasted”

•  P – longest suffix of any prefix that is also a prefix of a pa@ern

•  Example: ABCABD

D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings. SIAM Journal on Computing 6:323-350, 1977.

prefix x

prefix y

p z

≠

≠

ABCABD

Automaton for ABCABD

1 2 3 4 5 6 7A AB C B D

NOT A

Automaton for ABCABD

1 2 3 4 5 6 7A AB C B D

NOT A

0 1 1 1 2 3 1 Fail:

A B C A B D Pattern:

1 2 3 4 5 6

KMP matching Input: Text S[1..n] and pa@ern P[1..m] Output: First occurrence of P in S (if exists)

i=1; j=1; initfail(P) // Prepare fail links

repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link

until j>m or i>n if j>m then report match at i-m

02.01.15

4

Inihalizahon of fail links

Algorithm: KMP_Ini|ail Input: Pa@ern P[1..m] Output: fail[] for pa@ern P i=1, j=0 , fail[1]= 0 repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j else j = fail[j] until i>=m

Inihalizahon of fail links i=1, j=0 , fail[1]= 0

repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j else j = fail[j]

until i>=m

0 Fail:

ABCABD i

j

0 1

0 1 1 1

ABCABD

0 1 1 1 2

Time complexity of KMP matching? Input: Text S[1..n] and pa@ern P[1..m] Output: First occurrence of P in S (if exists)

i=1; j=1; initfail(P) // Prepare fail links

repeat if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link

until j>m or i>n if j>m then report match at i-m

Analysis of hme complexity

•  At every cycle either i and j increase by 1 •  Or j decreases (j=fail[j])

•  i can increase n (or m) hmes •  Q: How o[en can j decrease?

– A: not more than nr of increases of i

•  Amor-sed analysis: O(n) , preprocess O(m)

Karp-‐Rabin

•  Compare in O(1) a hash of P and S[i..i+m-‐1]

•  Goal: O(n). •  f( h(T[i.. i+m-‐1]) -‐> h(T[i+1..i+m]) ) = O(1)

R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

i .. (i+m-‐1)

1..m

h(T[i.. i+m-1])

h(P)

Karp-‐Rabin

•  Compare in O(1) a hash of P and S[i..i+m-‐1]

•  Goal: O(n). •  f( h(T[i.. i+m-‐1]) -‐> h(T[i+1..i+m]) ) = O(1)

R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.

i .. (i+m-‐1)

1..m

h(T[i+1..i+m])

h(P)

i .. (i+m-‐1)

02.01.15

5

Hash

•  “Remove” the effect of T[i] and “Introduce” the effect of T[i+m] – in O(1)

•  Use base |Σ| arithmehcs and treat charcters as numbers

•  In case of hash match – check all m posihons •  Hash collisions => Worst case O(nm)

Let’s use numbers

•  T= 57125677 •  P= 125 (and for simplicity, h=125)

•  H(T[1])=571 •  H(T[2])= (571-‐5*100)*10 + 2 = 712

•  H(T[3]) = (H(T[2]) – ord(T[1])*10m)*10 + T[3+m-‐1]

hash

•  c – size of alphabet

•  HSi = H( S[i..i+m-‐1] )

•  H( S[i+1 .. i+m ] ) = ( HSi – ord(S[i])* cm-‐1 ) * c + ord( S[i+m] )

•  Modulo arithmehc – to fit value in a word!

•  hash(w[0 .. m-‐1])=(w[0]*2m-‐1+ w[1]*2m-‐2+·∙·∙·∙+ w[m-‐1]*20) mod q

Karp-‐Rabin Input: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */ 2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0 5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position

8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q 10.   if hp == hs and P == S[i..i+m-1] 11.   report match at position i

More ways to ensure O( n ) ?

02.01.15

6

ShiT-‐AND / ShiT-‐OR

•  Ricardo Baeza-‐Yates , Gaston H. Gonnet A new approach to text searching Communica.ons of the ACM October 1992, Volume 35 Issue 10 [ACM Digital Library:h@p://doi.acm.org/10.1145/135239.135243] [DOI]

•  PDF

Bit-‐operahons

•  Maintain a set of all prefixes that have so far had a perfect match

•  On the next character in text update all previous pointers to a new set

•  Bit vector: for every possible character

State: which prefixes match?

0

1

0

0

1

Move to next:

Track posihons of prefix matches

0 1 0 1 0 1

1 0 0 0 1 1

1 0 1 0 1 1 Shift left <<

1 0 0 0 1 1 Mask on char T[i] Bitwise AND

Vectors for every char in Σ

•  P=aste a s t e b c d .. z

1 0 0 0 0 ...

0 1 0 0 0 ...

0 0 1 0 0 ...

0 0 0 1 0 ...

02.01.15

7

•  T= lasteaed

l a s t e a e d

0 1

0 0

0 0

0 0

•  T= lasteaed

l a s t e a e d

0 1 0

0 0 1

0 0 0

0 0 0

•  T= lasteaed

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

•  T= lasteaed

l a s t e a e d

0 1 0 0 0 1

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

Summary Algorithm Worst case Ave. Case Preprocess

Brute force O(mn) O(n * (1+1/|Σ|+..)

Knuth-‐Morris-‐Pra@ O(n) O(n) O(m)

Rabin-‐Karp O(mn) O(n) O(m)

Boyer-‐Moore O(n/m) ?

BM Horspool

Factor search

Shi[-‐OR O(n) O(n) O( m|Σ| )

02.01.15

8

•  R. Boyer, S.Moore: A fast string searching algorithm. CACM 20 (1977), 762-‐772 [PDF]

•  http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf

43


•  Have we missed anything?

44

S

P


•  What have we learned if we test for a potenhal match from the end?

45

S

P

ABCDE BBCDE

46


S

P

A B

47

Bad character heuris-cs maximal shi[ on S[i]

S

P

A B

X

S X X

delta1( S[i] ) – |m| if pattern does not contain S[i] patlen-j max j so that P[j] == S[i]

S[i]

First x in pattern (from end)

48

02.01.15

9

void bmInitocc() { char a; int j;

for(a=0; a<alphabetsize; a++)

occ[a]=-1;

for (j=0; j<m; j++) {

a=p[j];

occ[a]=j; }

} 49

Good suffix heuris-cs

S

P

A B

µ

S

delta2( S[i] ) – minimal shift so that matched region is fully covered or that the sufix of match is also a prefix of P

µ µ

S

µ µ’

1.

2.

50

Boyer-‐Moore algorithm Input: Text S[1..n] and pattern P[1..m]

Output: Occurrences of P in S

preprocess_BM() // delta1 and delta2

i=m

while i <= n

for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ;

if j==0 report match at position i-m+1

i = i+ max( delta1[ S[i] ], delta2[ j ] )

51

•  h@p://www.ih.�-‐flensburg.de/lang/algorithmen/pa@ern/bmen.htm

•  h@p://biit.cs.ut.ee/~vilo/edu/2005-‐06/Text_Algorithms/Arhcles/Exact/Boyer-‐Moore-‐original-‐p762-‐boyer.pdf

•  Animahon: h@p://www-‐igm.univ-‐mlv.fr/~lecroq/string/

52

Simplificahons of BM

•  There are many variants of Boyer-‐Moore, and many scienhfic papers.

•  On average the hme complexity is sublinear •  Algorithm speed can be improved and yet simplify the code.

•  It is useful to use the last character heurishcs (Horspool (1980), Baeza-‐Yates(1989), Hume and Sunday(1991)).

53

Algorithm BMH (Boyer-‐Moore-‐Horspool)

•  RN Horspool -‐ Prachcal Fast Searching in Strings SoCware -‐ Prac.ce and Experience, 10(6):501-‐506 1980

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]

54

02.01.15

10

String Matching: Horspool algorithm

Text :

Pattern : From right to left: suffix search

•  Which is the next position of the window?

•  How the comparison is made?

Pattern :

Text : a

It depends of where appears the last letter of the text, say it ‘a’, in the pattern:

a a a

Then it is necessary a preprocess that determines the length of the shift.

a a a a a a

Algorithm Boyer-‐Moore-‐Horspool-‐Hume-‐Sunday (BMHHS)

•  Use delta in a hght loop •  If match (delta==0) then check and apply original delta d

Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m] 4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 0 5. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop 7. t=delta[ S[i] ] ; i = i + t 8. until t==0 9. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ; 10. if j==0 report match at i-m+1

BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P (in order for the algorithm to finish correctly – at least one occurrence!).

56

•  Daniel M. Sunday: A very fast substring search algorithm [PDF] Communica.ons of the ACM August 1990, Volume 33 Issue 8

•  Loop unrolling: •  Avoid too many loops (each loop requires tests) by just repeahng code

within the loop. •  Line 7 in previous algorithm can be replaced by:

7. i += delta[ S[i] ]; i += delta[ S[i] ]; i += (t = delta[ S[i] ]) ;

57 58

Forward-‐Fast-‐Search: Another Fast Variant of the Boyer-‐Moore String Matching Algorithm

•  The Prague Stringology Conference '03 •  Domenico Cantone and Simone Faro

•  Abstract: We present a variahon of the Fast-‐Search string matching algorithm, a recent member of the large family of Boyer-‐Moore-‐like algorithms, and we compare it with some of the most effechve string matching algorithms, such as Horspool, Quick Search, Tuned Boyer-‐Moore, Reverse Factor, Berry-‐Ravindran, and Fast-‐Search itself. All algorithms are compared in terms of run-‐hme efficiency, number of text character inspechons, and number of character comparisons. It turns out that our new proposed variant, though not linear, achieves very good results especially in the case of very short pa@erns or small alphabets.

•  h@p://cs.felk.cvut.cz/psc/event/2003/p2.html •  PS.gz (local copy)

59

Factor based approach

•  Ophmal average-‐case algorithms – Assuming independent characters, same probability

•  Factor – a substring of a pa@ern – Any substring –  (how many?)

60

02.01.15

11

Factor based approach

•  Ophmal average-‐case algorithms – Assuming independent characters, same probability

61

Factor searches

Do not compare characters, but find the longest match to any subregion of the pattern.

S

P

X u

62

Examples

•  Backward DAWG Matching (BDM) – Crochemore et al 1994

•  Backward Nondeterminishc DAWG Matching (BNDM) – Navarro, Raffinot 2000

•  Backward Oracle Matching (BOM) – Allauzen, Crochermore, Raffinot 2001

63

Backward DAWG Matching BDM

Do not compare characters, but find the longest match to any subregion of the pattern. 64

Suffix automaton recognises all factors (and suffixes) in O(n)

BNDM – simulate using bitparallelism

65

Bits – show where the factors have occurred so far

BNDM matches an NDA

NDA on the suffixes of ‘announce’

66

02.01.15

12

Determinishc version of the same Backward Factor Oracle

67

BNDM – Backward Non-Deterministic DAWG Matching BOM - Backward Oracle matching

68

String Matching of one pa@ern

CTACTACTACGTCTATACTGATCGTAGC

TACTACGGTATGACTAA

Factor search

Prefix search

Suffix search

1.

2.

3.

Mulhple pa@erns

S

{P}

Why?

•  Mulhple pa@erns •  Highlight mulhple different search words on the page •  Virus detechon – filter for virus signatures •  Spam filters •  Scanner in compiler needs to search for mulhple keywords •  Filter out stop words or disallowed words •  Intrusion detechon so[ware •  Next-‐generahon sequencing produces huge amounts

(many millions) of short reads (20-‐100 bp) that need to be mapped to genome!

•  …

Algorithms

•  Aho-‐Corasick (search for mulhple words) – Generalizahon of Knuth-‐Morris-‐Pra@

•  Commentz-‐Walter – Generalizahon of Boyer-‐Moore & AC

•  Wu and Manber –  improvement over C-‐W

•  Addihonal methods, tricks and techniques

02.01.15

13

Aho-‐Corasick (AC) •  Alfred V. Aho and Margaret J. Corasick (Bell Labs, Murray Hill, NJ)

Efficient string matching. An aid to bibliographic search. Communica.ons of the ACM, Volume 18 , Issue 6, p333-‐340 (June 1975)

•  ACM:DOI PDF •  ABSTRACT This paper describes a simple, efficient algorithm to locate all

occurrences of any of a finite number of keywords in a string of text. The algorithm consists of construc.ng a finite state paSern matching machine from the keywords and then using the paSern matching machine to process the text string in a single pass. Construc.on of the paSern matching machine takes .me propor.onal to the sum of the lengths of the keywords. The number of state transi.ons made by the paSern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.

References:

•  Generalizahon of KMP for many pa@erns •  Text S like before. •  Set of pa@erns P = { P1 , .., Pk } •  Total length |P| = m = Σ i=1..k mi

•  Problem: find all occurrences of any of the Pi ∈ P from S

Idea

1.  Create an automaton from all pa@erns

2.  Match the automaton

•  Use the PATRICIA trie for creahng the main structure of the automaton

PATRICIA trie •  D. R. Morrison, "PATRICIA: Prachcal Algorithm To Retrieve Informahon

Coded In Alphanumeric", Journal of the ACM 15 (1968) 514-‐534. •  Abstract PATRICIA is an algorithm which provides a flexible means of

storing, indexing, and retrieving informa.on in a large file, which is economical of index space and of reindexing .me. It does not require rearrangement of text or index as new material is added. It requires a minimum restric.on of format of text and of keys; it is extremely flexible in the variety of keys it will respond to. It retrieves informa.on in response to keys furnished by the user with a quan.ty of computa.on which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several varia.ons as FORTRAN programs for the CDC-‐3600, u.lizing disk file storage of text. It has been applied to several large informa.on-‐retrieval problems and will be applied to others.

•  ACM:DOI PDF

•  Word trie -‐ a good data structure to represent a set of words (e.g. a dichonary).

•  trie (data structure)

•  Defini-on: A tree for storing strings in which there is one node for every common preffix. The strings are stored in extra leaf nodes.

•  See also digital tree, digital search tree, directed acyclic word graph, compact DAWG, Patricia tree, suffix tree.

•  Note: The name comes from retrieval and is pronounced, "tree."

•  To test for a word p, only O(|p|) hme is used no ma@er how many words are in the dichonary...

02.01.15

14

Trie for P={he, she, his, hers}

0

1

2

h

e

0

1

2

h

e

3

s

4

5

e

h

Trie for P={he, she, his, hers} 0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

How to search for words like he, sheila, hi. Do these occur in the trie?

0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

Aho-‐Corasick

1.  Create an automaton MP for a set of strings P. 2.  Finite state machine: read a character from text,

and change the state of the automaton based on the state transihons...

3.  Main links: goto[j,c] -‐ read a character c from text and go from a state j to state goto[j,c].

4.  If there are no goto[j,c] links on character c from state j, use fail[j].

5.  Report the output. Report all words that have been found in state j.

AC Automaton (vs KMP)

0

1

2

h

e 3

s

4

5

e

h

8

i

7

s

9

r 6

s

goto[1,i] = 6. ; fail[7] = 3, fail[8] = 0 , fail[5]=2.

Output table state output[j] 2  he 5 she, he 7  his 9 hers

NOT { h, s }

AC -‐ matching Input: Text S[1..n] and an AC automaton M for pa@ern set P Output: Occurrences of pa@erns from P in S (last posihon) 1. state = 0 2. for i = 1..n do

3. while (goto[state,S[i]]== ∅ ) and (fail[state]!=state) do 4. state = fail[state]

5. state = goto[state,S[i]] 6. if ( output[state] not empty ) 7. then report matches output[state] at posihon i

02.01.15

15

Algorithm Aho-‐Corasick preprocessing I (TRIE) Input: P = { P1, ... , Pk } Output: goto[] and parhal output[] Assume: output(s) is empty when a state s is created; goto[s,a] is not defined. procedure enter(a1, ... , am) /* Pi = a1, ... , am */ begin 1. s=0; j=1; 2. while goto[s,aj] ≠ ∅ do // follow exishng path 3. s = goto[s,aj]; 4. j = j+1; 5. for p=j to m do // add new path (states) 6. news = news+1 ; 7. goto[s,ap] = news; 8. s = news; 9. output[s] = a1, ... , am end

begin 10. news = 0 11. for i=1 to k do enter( Pi ) 12. for a ∈ Σ do

13. if goto[0,a] = ∅ then goto[0,a] = 0 ; end

Preprocessing II for AC (FAIL) queue = ∅ for a ∈ Σ do

if goto[0,a] ≠ 0 then enqueue( queue, goto[0,a] ) fail[ goto[0,a] ] = 0

while queue ≠ ∅

r = take( queue ) for a ∈ Σ do if goto[r,a] ≠ ∅ then s = goto[ r, a ] enqueue( queue, s ) // breadth first search state = fail[r] while goto[state,a] = ∅ do state = fail[state] fail[s] = goto[state,a] output[s] = output[s] + output[ fail[s] ]

Correctness

•  Let string t "point" from inihal state to state j.

•  Must show that fail[j] points to longest suffix that is also a prefix of some word in P.

•  Look at the arhcle...

AC matching hme complexity

•  Theorem For matching the MP on text S, |S|=n, less than 2n transihons within M are made.

•  Proof Compare to KMP. •  There is at most n goto steps. •  Cannot be more than n Fail-‐steps. •  In total -‐-‐ there can be less than 2n transihons in M.

Individual node (goto)

•  Full table

•  List

•  Binary search tree(?)

•  Some other index?

AC thoughts •  Scales for many strings simultaneously. •  For very many pa@erns – search hme (of grep) improves(??)

–  See Wu-‐Manber arhcle

•  When k grows, then more fail[] transihons are made (why?) •  But always less than n. •  If all goto[j,a] are indexed in an array, then the size is

|MP|*|Σ|, and the running hme of AC is O(n). •  When k and c are big, one can use lists or trees for storing

transihon funchons.

•  Then, O(n log(min(k,c)) ).

02.01.15

16

Advanced AC

•  Precalculate the next state transihon correctly for every possible character in alphabet

•  Can be good for short pa@erns

Problems of AC?

•  Need to rebuild on adding / removing pa@erns

•  Details of branching on each node(?)

Commentz-‐Walter

•  Generalizahon of Boyer-‐Moore for mulhple sequence search

•  Beate Commentz-‐Walter A String Matching Algorithm Fast on the Average Proceedings of the 6th Colloquium, on Automata, Languages and Programming. Lecture Notes In Computer Science; Vol. 71, 1979. pp. 118 -‐ 132, Springer-‐Verlag

•  h@p://www.�-‐albsig.de/win/personen/professoren.php?RID=36 •  You can download here my algorithm StringMatchingFastOnTheAverage (PDF, ~17,2 MB) or

here StringMatchingFastOnTheAverage (extended abstract) (PDF, ~3 MB)

C-‐W descrip-on

•  Aho and Corasick [AC75] presented a linear-‐hme algorithm for this problem, based on an automata approach. This algorithm serves as the basis for the UNIX tool fgrep. A linear-‐hme algorithm is ophmal in the worst case, but as the regular string-‐searching algorithm by Boyer and Moore [BM77] demonstrated, it is possible to actually skip a large por-on of the text while searching, leading to faster than linear algorithms in the average case.

Commentz-‐Walter [CW79] •  Commentz-‐Walter [CW79] presented an algorithm for the

mulh-‐pa@ern matching problem that combines the Boyer-‐Moore technique with the Aho-‐Corasick algorithm. The Commentz-‐Walter algorithm is substanhally faster than the Aho-‐Corasick algorithm in prachce. Hume [Hu91] designed a tool called gre based on this algorithm, and version 2.0 of fgrep by the GNU project [Ha93] is using it.

•  Baeza-‐Yates [Ba89] also gave an algorithm that combines the Boyer-‐Moore-‐Horspool algorithm [Ho80] (which is a slight variahon of the classical Boyer-‐Moore algorithm) with the Aho-‐Corasick algorithm.

Idea of C-‐W

•  Build a backward trie of all keywords

•  Match from the end unhl mismatch...

•  Determine the shi[ based on the combinahon of heurishcs

02.01.15

17

Horspool for many pa@erns Search for ATGTATG,TATG,ATAAT,ATGTG

4. Start the search

T A

A

G

G A T

T T

T

G

A

A

A A T

1. Build the trie of the inverted patterns

2. lmin=4 A 1 C 4 (lmin) G 2 T 1

3. Table of shifts

Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)


T A

A

G

G A T

T T

T

G

A

A

A A T

The text ACATGCTATGTGACA…

A 1 C 4 (lmin) G 2 T 1



T A

A

G

G A T

T T

T

G

A

A

A A T


A 1 C 4 (lmin) G 2 T 1



T A

A

G

G A T

T T

T

G

A

A

A A T


A 1 C 4 (lmin) G 2 T 1



T A

A

G

G A T

T T

T

G

A

A

A A T


A 1 C 4 (lmin) G 2 T 1



T A

A

G

G A T

T T

T

G

A

A

A A T


A 1 C 4 (lmin) G 2 T 1


02.01.15

18


T A

A

G

G A T

T T

T

G

A

A

A A T


A 1 C 4 (lmin) G 2 T 1

… Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)


T A

A

G

G A T

T T

T

G

A

A

A A T


A 1 C 4 (lmin) G 2 T 1

…

Short Shifts!


What are the possible limitahons for C-‐W?

•  Many pa@erns, small alphabet – minimal skips

•  What can be done differently?

Wu-‐Manber •  Wu S., and U. Manber, "A Fast Algorithm for Mulh-‐Pa@ern Searching,"

Technical Report TR-‐94-‐17, Department of Computer Science, University of Arizona (May 1993).

•  Citeseer: h@p://citeseer.ist.psu.edu/wu94fast.html [Postscript] •  We present a different approach that also uses the ideas of Boyer and

Moore. Our algorithm is quite simple, and the main engine of it is given later in the paper. An earlier version of this algorithm was part of the second version of agrep [WM92a, WM92b], although the algorithm has not been discussed in [WM92b] and only briefly in [WM92a]. The current version is used in glimpse [MW94]. The design of the algorithm concentrates on typical searches rather than on worst-‐case behavior. This allows us to make some engineering decisions that we believe are crucial to making the algorithm significantly faster than other algorithms in prachce.

Key idea

•  Main problem with Boyer-‐Moore and many pa@erns is that, the more there are pa@erns, the shorter become the possible shi[s...

•  Wu and Manber: check several characters simultaneously, i.e. increase the alphabet.

•  Instead of looking at characters from the text one by one, we consider them in blocks of size B.

•  logc2M; in prac.ce, we use either B = 2 or B = 3. •  The SHIFT table plays the same role as in the regular Boyer-‐Moore algorithm, except that it determines the shi[ based on the last B characters rather than just one character.

02.01.15

19

AA 1 AC 3 (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 …

2 símbols

Horspool to Wu-‐Manber How do we can increase the length of the shifts?

With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG

AA 1 AT 1 GT 1 TA 2 TG 2

A 1 C 4 (lmin) G 2 T 1

1 símbol


Wu-‐Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG

T A

A

G

G A T

T T

T

G

A

A

A A T

into the text: ACATGCTATGTGACATAATA

…

AA 1 AT 1 GT 1 TA 2 TG 2

Experimental length: log|Σ| 2*lmin*r Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Backward Oracle

•  Set Backwards oracle SBDM, SBOM

•  Pages 68-‐72

String matching of many pa@erns

5 10 15 20 25 30 35 40 45

8

4

2

| Σ|

Wu-Manber

SBOM Lmin

(5 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM (10 patterns)

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

(100 patterns)


String matching of many pa@erns

5 10 15 20 25 30 35 40 45

8

4

2

| Σ|

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

Wu-Manber

SBOM

5 10 15 20 25 30 35 40 45

8

4

2

SBOM

Lmin

(5 patterns)

(10 patterns)

(100 patterns) (1000 patterns)


02.01.15

20

5 strings 10 strings

100 strings 1000 strings

Factor Oracle Factor Oracle: safe shi[

02.01.15

21

Factor Oracle:

Shift to match prefix of P2?

Factor oracle

Construchon of factor Oracle Factor oracle •  Allauzen, C., Crochemore, M., and Raffinot, M. 1999. Factor Oracle: A New

Structure for Pa@ern Matching. In Proceedings of the 26th Conference on Current Trends in theory and Prac.ce of informa.cs on theory and Prac.ce of informa.cs (November 27 -‐ December 04, 1999). J. Pavelka, G. Tel, and M. Bartosek, Eds. Lecture Notes In Computer Science, vol. 1725. Springer-‐Verlag, London, 295-‐310.

•  h@p://portal.acm.org/citahon.cfm?id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=61811641#

•  h@p://www-‐igm.univ-‐mlv.fr/~allauzen/work/sofsem.ps

02.01.15

22

So far

•  Generalised KMP -‐> AhoCorasick •  Generalised Horspool -‐> CommentzWalter, WuManber

•  BDM, BOM -‐> Set Backward Oracle Matching…

•  Other generalisahons?

Mulhple Shi[-‐AND

•  P={P1, P2, P3, P4}. Generalize Shi[-‐AND

•  Bits =

•  Start =

•  Match=

P1 P2 P3 P4

1 1 1 1

1 1 1 1

Text Algorithms (6EAP)

Full text indexing

Jaak Vilo 2010 fall

129 MTAT.03.190 Text Algorithms Jaak Vilo

Problem

•  Given P and S – find all exact or approximate occurrences of P in S

•  You are allowed to preprocess S (and P, of course)

•  Goal: to speed up the searches

E.g. Dichonary problem

•  Does P belong to a dichonary D={d1,…,dn} – Build a binary search tree of D – B-‐Tree of D – Hashing – Sorhng + Binary search

•  Build a keyword trie: search in O(|P|) – Assuming alphabet has up to a constant size c – See Aho-‐Corasick algorithm, Trie construchon

Sorted array and binary search

he

hers

his

global

index happy

head

header

info

informal

search

show

stop

1 13

02.01.15

23

Sorted array and binary search

he

hers

his

global

index happy

head

header

info

informal

search

show

stop

1 13

O( |P| log n )

Trie for D={ he, hers, his, she}

0

1

2

h

e

3

s

4

5

e

h

8

i

7

s

9

r

6

s

O( |P| )

S != set of words

•  S of length n

•  How to index?

•  Index from every posihon of a text

•  Prefix of every possible suffix is important

a

b

b

aa

a

a a

b

b

b

babaab abaab

baab aab

ab b

Trie(babaab)

b

a

a

b

Suffix tree •  Defini-on: A compact representahon of a trie corresponding to the

suffixes of a given string where all nodes with one child are merged with their parents.

•  Defini-on ( suffix tree ). A suffix tree T for a string S (with n = |S|) is a rooted, labeled tree with a leaf for each non-‐empty suffix of S. Furthermore, a suffix tree sahsfies the following properhes:

•  Each internal node, other than the root, has at least two children; •  Each edge leaving a parhcular node is labeled with a non-‐empty substring

of S of which the first symbol is unique among all first symbols of the edge labels of the edges leaving this parhcular node;

•  For any leaf in the tree, the concatenahon of the edge labels on the path from the root to this leaf exactly spells out a non-‐empty suffix of s.

•  Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer Science and Computahonal Biology. Hardcover -‐ 534 pages 1st edihon (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198.

Literature on suffix trees •  h@p://en.wikipedia.org/wiki/Suffix_tree •  Dan Gusfield: Algorithms on Strings, Trees, and Sequences: Computer

Science and Computahonal Biology. Hardcover -‐ 534 pages 1st edihon (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. (pages: 89-‐-‐208)

•  E. Ukkonen. On-‐line construchon of suffix trees. Algorithmica, 14:249-‐60, 1995. h@p://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751

•  Ching-‐Fung Cheung, Jeffrey Xu Yu, Hongjun Lu. "Construchng Suffix Tree for Gigabyte Sequences with Megabyte Memory," IEEE Transachons on Knowledge and Data Engineering, vol. 17, no. 1, pp. 90-‐105, January, 2005. h@p://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3

•  CPM arhcles archive: h@p://www.cs.ucr.edu/~stelo/cpm/

•  Mark Nelson. Fast String Searching With Suffix Trees Dr. Dobb's Journal, August, 1996. h@p://www.dogma.net/markn/arhcles/suffixt/suffixt.htm

02.01.15

24

•  STXXL : Standard Template Library for Extra Large Data Sets.

•  h@p://stxxl.sourceforge.net/

•  ANSI C implementa-on of a Suffix Tree •  h@p://yeda.cs.technion.ac.il/~yona/suffix_tree/

These are old links – check for newer…

http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

141

The suffix tree Tree(T) of T

•  data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T

•  linear size: |Tree(T)| = O(|T|) •  can be constructed in linear hme O(|T|) •  has myriad virtues (A. Apostolico) •  is well-‐known: 366 000 Google hits

E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt

142

Suffix trie and suffix tree

a

b

b

aa a

a a

b

b

b

a

baab

baab ab

abaab baab aab ab b

Trie(abaab) Tree(abaab)


143

Suffix tree and suffix array techniques for pa@ern analysis in

strings

Esko Ukkonen Univ Helsinki

Erice School 30 Oct 2005


144

Algorithms for combinatorial string matching?

•  deep beauty? +-‐ •  shallow beauty? + •  applicahons? ++ •  intensive algorithmic miniatures •  sources of new problems:

text processing, DNA, music,…


02.01.15

25

High-‐throughput genome-‐scale sequence analysis and mapping

using compressed data structures Veli Mäkinen

Department of Computer Science University of Helsinki

146

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt


147

Analysis of a string of symbols

•  T = h a t t i v a t t i ’text’ •  P = a t t ’paSern’

•  Find the occurrences of P in T: h a t t i v a t t i

•  Pa@ern synthesis: #(t) = 4 #(a�) = 2 #(t****t) = 2


ISMB 2009 Tutorial Veli Mäkinen: "...analysis and mapping..." 148

Soluhon: backtracking with suffix tree

...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....

149

Pa@ern finding & synthesis problems •  T = t1t2 … tn, P = p1 p 2 … pn , strings of symbols in finite alphabet

•  Indexing problem: Preprocess T (build an index structure) such that the occurrences of different pa@erns P can be found fast –  stahc text, any given pa@ern P

•  Pa@ern synthesis problem: Learn from T new pa@erns that occur surprisingly o[en

•  What is a pa@ern? Exact substring, approximate substring, with generalized symbols, with gaps, … 150

1.  Suffix tree

2.  Suffix array

3.  Some applications

4.  Finding motifs


02.01.15

26

151

The suffix tree Tree(T) of T

•  data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T

•  linear size: |Tree(T)| = O(|T|) •  can be constructed in linear hme O(|T|) •  has myriad virtues (A. Apostolico) •  is well-‐known: 366 000 Google hits


152

Suffix trie and suffix tree

a

b

b

aa a

a a

b

b

b

a

baab

baab ab

abaab baab aab ab b

Trie(abaab) Tree(abaab)

153

Trie(T) can be large

•  |Trie(T)| = O(|T|2) •  bad example: T = anbn

•  Trie(T) can be seen as a DFA: language accepted = the suffixes of T

•  minimize the DFA => directed cyclic word graph (’DAWG’)

154

Tree(T) is of linear size

•  only the internal branching nodes and the leaves represented explicitly

•  edges labeled by substrings of T •  v = node(α) if the path from root to v spells α •  one-‐to-‐one correspondence of leaves and suffixes

•  |T| leaves, hence < |T| internal nodes •  |Tree(T)| = O(|T| + size(edge labels))

155

Tree(ha�va�) hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

hattivatti attivatti ttivatti

tivatti

ivatti

vatti

vatti vatti

atti ti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

156


attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i


tivatti

ivatti

vatti

vatti vatti

atti ti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

hattivatti

atti

substring labels of edges represented as pairs of pointers

02.01.15

27

157


attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i 1 2 3

4

5

6

6,10 6,10

2,5 4,5

i

10

8

9

3,3

i

vatti

vatti

vatti

hattivatti

hattivatti

7

158

Tree(T) is full text index Tree(T)

P

31 8

P occurs in T at locations 8, 31, …

P occurs in T ó P is a prefix of some suffix of T ó Path for P exists in Tree(T)

All occurrences of P in time O(|P| + #occ)

159

Find a@ from Tree(ha�va�) hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i


tivatti

ivatti

vatti

vatti vatti

atti ti

2

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti 7

160

Linear hme construchon of Tree(T)

hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

Weiner (1973),

’algorithm of the year’

McCreight (1976)

’on-line’ algorithm (Ukkonen 1992)

161

On-‐line construchon of Trie(T)

•  T = t1t2 … tn$ •  Pi = t1t2 … ti i:th prefix of T •  on-‐line idea: update Trie(Pi) to Trie(Pi+1) •  => very simple construchon

162

Trie(abaab)

a a

b

b a

b

b

aa

Trie(a) Trie(ab) Trie(aba)

chain of links connects the end points of current suffixes

abaa baa

aa εa ε

02.01.15

28

163

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aa a

a a

Trie(abaa)

164

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aa a

a a

Trie(abaa)

Add next symbol = b

165

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aa a

a a

Trie(abaa)

Add next symbol = b

From here on b-arc already exists

166

Trie(abaab)

a a

b

b a

b

b

aa

a

b

b

aa a

a a

a

b

b

aa a

a a

b

b

b

Trie(abaab)

167

What happens in Trie(Pi) => Trie(Pi+1) ?

ai

ai

ai

ai

ai

ai

Before

After

New nodes

New suffix links

From here on the ai-arc exists already => stop updating here

168

What happens in Trie(Pi) => Trie(Pi+1) ?

•  hme: O(size of Trie(T)) •  suffix links:

slink(node(aα)) = node(α)

02.01.15

29

169

On-‐line procedure for suffix trie

1.  Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root

2.  for i := 2 to n do begin

3.  vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf)

4.  v := vi-1; v´ := 0

5.  while node v has no outgoing arc for ti do begin

6.  Create a new node v´´ and an arc son(v,ti) = v´´

7.  if v´ ≠ 0 then slink(v) := v´´

8.  v := slink(v); v´ := v´´ end

9.  for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´

170

Suffix trees on-‐line

•  ’compacted version’ of the on-‐line trie construchon: simulate the construchon on the linear size tree instead of the trie => hme O(|T|)

•  all trie nodes are conceptually shll needed => implicit and real nodes

171

Implicit and real nodes

•  Pair (v,α) is an implicit node in Tree(T) if v is a node of Tree and α is a (proper) prefix of the label of some arc from v. If α is the empty string then (v, α) is a ’real’ node (= v).

•  Let v = node(α´) in Tree(T). Then implicit node (v, α) represents node(α´α) of Trie(T)

172

Implicit node

…

v

(v, α) α …

α´

173

Suffix links and open arcs

…

v

aα

…

α

root

slink(v)

label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T

w

174

Big picture

… … …

…

…

suffix link path traversed: total work O(n)

new arcs and nodes created: total work O(size(Tree(T))

02.01.15

30

175

On-‐line procedure for suffix tree

Input: string T = t1t2 … tn$

Output: Tree(T)

Notation: son(v,α) = w iff there is an arc from v to w with label α

son(v,ε) = v

Function Canonize(v, α):

while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do

v := son(v, α´); α := α´´

return (v, α)

176

Suffix-‐tree on-‐line: main procedure

Create Tree(t1); slink(root) := root (v, α) := (root, ε) /* (v, α) is the start node */ for i := 2 to n+1 do v´ := 0 while there is no arc from v with label prefix αti do if α ≠ ε then /* divide the arc w = son(v, αη) into two */ son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else son(v,ti) := v´´´; v´´ := v if v´ ≠ 0 then slink(v´) := v´´ v´ := v´´; v := slink(v); (v, α) := Canonize(v, α) if v´ ≠ 0 then slink(v´) := v (v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */

http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

178

The actual hme and space

•  |Tree(T)| is about 20|T| in prachce •  brute-‐force construchon is O(|T|log|T|) for random strings as the average depth of internal nodes is O(log|T|)

•  difference between linear and brute-‐force construchons not necessarily large (Giegerich & Kurtz)

•  truncated suffix trees: k symbols long prefix of each suffix represented (Na et al. 2003)

•  alphabet independent linear hme (Farach 1997)

abc abcabxabcd

02.01.15

31

Applicahons of Suffix Trees •  Dan Gusfield: Algorithms on Strings, Trees, and Sequences:

Computer Science and Computahonal Biology. Hardcover -‐ 534 pages 1st edihon (January 15, 1997). Cambridge Univ Pr (Short); ISBN: 0521585198. -‐ book

•  APL1: Exact String Matching Search for P from text S. Soluhon 1: build STree(S) -‐ one achieves the same O(n+m) as Knuth-‐Morris-‐Pra@, for example!

•  Search from the suffix tree is O(|P|) •  APL2: Exact set matching Search for a set of pa@erns P


C

Back to backtracking

A C

T

4 2 1 5 3 6

C T

T A T

T A C

T

A T C

T

A

ACA, 1 mismatch

Same idea can be used to many other forms of approximate search, like Smith-Waterman, position-restricted scoring matrices, regular expression search, etc.

02.01.15

32

Applicahons of Suffix Trees

•  APL3: substring problem for a database of pa@erns Given a set of strings S=S1, ... , Sn -‐-‐-‐ a database Find all Si that have P as a substring

•  Generalized suffix tree contains all suffixes of all Si •  Query in hme O(|P|), and can idenhfy the LONGEST common prefix of P in all Si


•  APL4: Longest common substring of two strings •  Find the longest common substring of S and T. •  Overall there are potenhally O( n2 ) such substrings, if n is the length of a shorter of S and T

•  Donald Knuth once (1970) conjectured that linear-‐hme algorithm is impossible.

•  Soluhon: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves.

•  Ex: S= superiorcalifornialives T= sealiver have both a substring xxxxxx.


Simple analysis task: LCSS

•  Let LCSSA(A,B) denote the longest common substring two sequences A and B. E.g.: –  LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT.

•  A good soluhon is to build suffix tree for the shorter sequence and make a descending suffix walk with the other sequence.


Suffix link

X

aX

suffix link


Descending suffix walk

suffix tree of A Read B left-to-right, always going down in the tree when possible. If the next symbol of B does not match any edge label on current position, take suffix link, and try again. (Suffix link in the root to itself emits a symbol). The node v encountered with largest string depth is the solution. v


Another common tool: Generalized suffix tree

ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$

A C

C

node info: subtree size 47813871 sequence count 87

02.01.15

33


Generalized suffix tree applicahon

...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...

...#....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#......#...

A C

C

node info: subtree size 4398 blue sequences 12/15 red sequences 2/62 .....


Case study conhnued

genome

regions with ChIP-seq matches

suffix tree of genome

5 blue 1 red

TAC

..........T

motif?


Properhes of suffix tree

•  Suffix tree has n leaves and at most n-‐1 internal nodes, where n is the total length of all sequences indexed.

•  Each node requires constant number of integers (pointers to first child, sibling, parent, text range of incoming edge, stahshcs counters, etc.).

•  Can be constructed in linear hme.


Properhes of suffix tree... in prachce

•  Huge overhead due to pointer structure: – Standard implementahon of suffix tree for human genome requires over 200 GB memory!

– A careful implementahon (using log n -‐bit fields for each value and array layout for the tree) shll requires over 40 GB.

– Human genome itself takes less than 1 GB using 2-‐bits per bp.


•  APL4: Longest common substring of two strings •  Find the longest common substring of S and T. •  Overall there are potenhally O( n2 ) such substrings, if n is the length of a shorter of S and T

•  Donald Knuth once (1970) conjectured that linear-‐hme algorithm is impossible.

•  Soluhon: construct the STree(S+T) and find the node deepest in the tree that has suffixes from both S and T in subtree leaves.

•  Ex: S= superiorcalifornialives T= sealiver have both a substring alive.


•  APL5: Recognizing DNA contaminahon Related to DNA sequencing, search for longest strings (longer than threshold) that are present in the DB of sequences of other genomes.

•  APL6: Common substrings of more than two strings Generalizahon of APL4, can be done in linear (in total length of all strings) hme

02.01.15

34


•  APL7: Building a directed graph for exact matching: Suffix graph -‐ directed acyclic word graph (DAWG), a smallest finite state automaton recognizing all suffixes of a string S. This automaton can recognize membership, but not tell which suffix was matched.

•  Construchon: merge isomorfic subtrees. •  Isomorfic in Suffix Tree when exists suffix link path, and subtrees have equal nr. of leaves.


•  APL8: A reverse role for suffix trees, and major space reduchon Index the pa@ern, not tree...

•  Matching stahshcs. •  APL10: All-‐pairs suffix-‐prefix matching For all pairs Si, Sj, find the longest matching suffix-‐prefix pair. Used in shortest common superstring generahon (e.g. DNA sequence assembly), EST alignmentm etc.


•  APL11: Finding all maximal repehhve structures in linear hme

•  APL12: Circular string linearizahon e.g. circular chemical molecules in the database, one wants to lienarize them in a canonical way...

•  APL13: Suffix arrays -‐ more space reduchon will touch that separately


•  APL14: Suffix trees in genome-‐scale projects •  APL15: A Boyer-‐Moore approach to exact set matching

•  APL16: Ziv-‐Lempel data compression •  APL17: Minimum length encoding of DNA

Applicahons of Suffix Trees •  Addihonal applicahons Mostly exercises... •  Extra feature: CONSTANT -me lowest common ancestor retrieval (LCA)

Andmestruktuur mis võimaldab leida konstantse ajaga alumist ühist vanemat (see vastab pikimale ühisele prefixile!) on võimalik koostada lineaarse ajaga.

•  APL: Longest common extension: a bridge to inexact matching •  APL: Finding all maximal palindromes in linear -me

Palindrome reads from central posihon the same to le[ and right. E.g.: kirik, saippuakivikauppias.

•  Build the suffix tree of S and inverted S (aabcbad => aabcbad#dabcbaa ) and using the LCA one can ask for any posihon pair (i, 2i-‐1), the longest common prefix in constant hme.

•  The whole problem can be solved in O(n).


•  APL: Exact matching with wild cards •  APL: The k-‐mismatch problem •  Approximate palindromes and repeats •  Faster methods for tandem repeats •  A linear-‐hme soluhon to the mulhple common substring problem

•  And many-‐many more ...

02.01.15

35

205

1.  Suffix tree

2.  Suffix array



206

Suffixes -‐ sorted

•  Sort all suffixes. Allows to perform binary search!

hattivatti attivatti

ttivatti tivatti ivatti vatti

atti tti ti i ε

ε atti attivatti hattivatti i ivatti ti tivatti tti ttivatti vatti

207

Suffix array: example

•  suffix array = lexicographic order of the suffixes

hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε


11 7 2 1 10 5

9 4 8 3

6

1 2 3 4 5 6 7 8 9 10 11

11 7 2 1 10 5 9 4 8 3 6

208

Suffix array construchon: sort!

•  suffix array = lexicographic order of the suffixes

hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε

11 7 2 1 10 5

9 4 8 3

6

1 2 3 4 5 6

7 8 9 10

11

11 7 2 1 10 5 9 4 8 3 6

209

Suffix array

•  suffix array SA(T) = an array giving the lexicographic order of the suffixes of T

•  space requirement: 5|T| •  prachhoners like suffix arrays (simplicity, space efficiency)

•  theorehcians like suffix trees (explicit structure)


Reducing space: suffix array

A C

T

4 2 1 5 3 6

C A T A C T 1 2 3 4 5 6

=[3,3] =[3,3] =[2,2]

suffix array

=[4,6] =[6,6] =[2,6]

=[3,6] =[5,6]

C C

T T A

T

T A C

T C T

A

T

A

02.01.15

36


Suffix array

•  Many algorithms on suffix tree can be simulated using suffix array... –  ... and couple of addihonal arrays... –  ... forming so-‐called enhanced suffix array... –  ... leading to the similar space requirement as careful implementahon of suffix tree

•  Not a sahsfactory soluhon to the space issue.

212

Pa@ern search from suffix array hattivatti attivatti ttivatti tivatti ivatti vatti atti tti ti i ε


11 7 2 1 10 5 9 4 8 3 6

att binary search


What we learn today?

•  We learn that it is possible to replace suffix trees with compressed suffix trees that take 8.8 GB for the human genome.

•  We learn that backtracking can be done using compressed suffix arrays requiring only 2.1 GB for the human genome.

•  We learn that discovering intereshng mohf seeds from the human genome takes 40 hours and requires 9.3 GB space.

214

Recent suffix array construchons

•  Manber&Myers (1990): O(|T|log|T|) •  linear hme via suffix tree •  January / June 2003: direct linear hme construchon of suffix array -‐  Kim, Sim, Park, Park (CPM03) -‐  Kärkkäinen & Sanders (ICALP03) -‐  Ko & Aluru (CPM03)

215

Kärkkäinen-‐Sanders algorithm

1. Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.

2. Construct the suffix array of the remaining suffixes using the result of the first step.

3. Merge the two suffix arrays into one.

216

Notahon

•  string T = T[0,n) = t0t1 … tn-‐1 •  suffix Si = T[i,0) = titi+1 … tn-‐1 •  for C \subset [0,n]: SC = {Si | i in C}

•  suffix array SA[0,n] of T is a permutahon of [0,n] sahsfying SSA[0] < SSA[1] < … < SSA[n]

02.01.15

37

217

Running example

•  T[0,n) = y a b b a d a b b a d o 0 0 …

•  SA = (12,1,6,4,9,3,8,2,7,5,10,11,0)

0 1 2 3 4 5 6 7 8 9 10 11

218

Step 0: Construct a sample

•  for k = 0,1,2 Bk = {i є [0,n] | i mod 3 = k}

•  C = B1 U B2 sample posi.ons •  SC sample suffixes

•  Example: B1 = {1,4,7,10}, B2 = {2,5,8,11}, C = {1,4,7,10,2,5,8,11}

219

Step 1: Sort sample suffixes •  for k = 1,2, construct Rk = [tktk+1tk+2] [tk+3tk+4tk+5]… [tmaxBktmaxBk+1tmaxBk+2] R = R1 ^ R2 concatenahon of R1 and R2 Suffixes of R correspond to SC: suffix [titi+1ti+2]… corresponds to Si ;

correspondence is order preserving.

Sort the suffixes of R: radix sort the characters and rename with ranks to obtain R´. If all characters different, their order directly gives the order of suffixes. Otherwise, sort the suffixes of R´ using Kärkkäinen-‐Sanders. Note: |R´| = 2n/3.

220

Step 1 (cont.)

•  once the sample suffixes are sorted, assign a rank to each: rank(Si) = the rank of Si in SC; rank(Sn+1) = rank(Sn+2) = 0

•  Example: R = [abb][ada][bba][do0][bba][dab][bad][o00] R´ = (1,2,4,6,4,5,3,7) SAR´ = (8,0,1,6,4,2,5,3,7) rank(Si) -‐ 1 4 -‐ 2 6 -‐ 5 3 – 7 8 – 0 0

221

Step 2: Sort nonsample suffixes

•  for each non-‐sample Si є SB0 (note that rank(Si+1) is always defined for i є B0):

Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) •  radix sort the pairs (ti,rank(Si+1)).

•  Example: S12 < S6 < S9 < S3 < S0 because (0,0) < (a,5) < (a,7) < (b,2) < (y,1)

222

Step 3: Merge •  merge the two sorted sets of suffixes using a standard comparison-‐based merging:

•  to compare Si є SC with Sj є SB0, dishnguish two cases:

•  i є B1: Si ≤ Sj ↔ (ti,rank(Si+1)) ≤ (tj,rank(Sj+1)) •  i є B2: Si ≤ Sj ↔ (ti,ti+1,rank(Si+2)) ≤ (tj,tj+1,rank(Sj+2))

•  note that the ranks are defined in all cases! •  S1 < S6 as (a,4) < (a,5) and S3 < S8 as (b,a,6) < (b,a,7)

02.01.15

38

223

Running hme O(n)

•  excluding the recursive call, everything can be done in linear hme

•  the recursion is on a string of length 2n/3 •  thus the hme is given by recurrence

T(n) = T(2n/3) + O(n) •  hence T(n) = O(n)

224

Implementahon

•  about 50 lines of C++ •  code available e.g. via Juha Kärkkäinen’s home page

225

LCP table

•  Longest Common Prefix of successive elements of suffix array:

•  LCP[i] = length of the longest common prefix of suffixes SSA[i] and SSA[i+1]

•  build inverse array SA-‐1 from SA in linear hme •  then LCP table from SA-‐1 in linear hme (Kasai et al, CPM2001)

•  Oxford English Dischonary h@p://www.oed.com/ •  Example -‐ Word of the Day , Fourth

h@p://biit.cs.ut.ee/~vilo/edu/2005-‐06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.html h@p://www.oed.com/cgi/display/wotd

•  PAT index -‐ by Gaston Gonnet (ta on samuh Maple tarkvara üks loojatest ning hiljem molekulaarbioloogia tarkvarapakeh väljatöötajaid)

•  PAT index is essenhally a suffix array. To save space, indexed only from first character of every word

•  XML-‐tagging (or SGML, at that hme!) also indexed •  To mark certain fields of XML, the bit vectors were used. •  Main concern -‐ improve the speed of search on the CD -‐ minimize random

accesses. •  For slow medium even 15-‐20 accesses is too slow... •  G. H. Gonnet, R. A. Baeza-‐Yates, and T. Snider, Lexicographical indices for

text: Inverted files vs. PAT trees, Technical Report OED-‐91-‐01, Centre for the New OED, University of Waterloo, 1991.

227

Suffix tree vs suffix array

•  suffix tree ó suffix array + LCP table

228

1.  Suffix tree

2.  Suffix array



02.01.15

39

229

Substring mohfs of string T

•  string T = t1 … tn in alphabet A. •  Problem: what are the frequently occurring (ungapped) substrings of T? Longest substring that occurs at least q hmes?

•  Thm: Suffix tree Tree(T) gives complete occurrence counts of all substring mohfs of T in O(n) hme (although T may have O(n2) substrings!)

230

Counhng the substring mohfs

•  internal nodes of Tree(T) ↔ repeahng substrings of T

•  number of leaves of the subtree of a node for string P = number of occurrences of P in T

231

Substring mohfs of ha�va�


tivatti

ivatti

vatti

vatti vatti

atti ti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

2

2 2

2 4

Counts for the O(n) maximal motifs shown 232

Finding repeats in DNA

•  human chromosome 3 •  the first 48 999 930 bases •  31 min cpu hme (8 processors, 4 GB)

•  Human genome: 3x109 bases •  Tree(HumanGenome) feasible

233

Longest repeat?

Occurrences at: 28395980, 28401554r Length: 2559 ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg 234

Ten occurrences?

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt

Length: 277

Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125 In the reversed complement at: 17858493, 41463059, 42431718, 42580925

02.01.15

40

235

Using suffix trees: plagiarism

•  find longest common substring of strings X and Y

•  build Tree(X$Y) and find the deepest node which has a leaf poinhng to X and another poinhng to Y

236

Using suffix trees: approximate matching

•  edit distance: inserhons, delehons, changes

•  STOCKHOLM vs TUKHOLMA

237

String distance/similarity funchons

STOCKHOLM vs TUKHOLMA

STOCKHOLM_ _TU_

KHOLMA

=> 2 deletions, 1 insertion, 1 change

238

Approximate string matching

•  A: S T O C K H O L M B:

T U K H O L M A •  minimum number of ’mutahon’ steps:

a -‐> b a -‐> є є -‐> b … •  dID(A,B) = 5 dLevenshtein(A,B)= 4 •  mutahon costs => probabilishc modeling •  evaluahon by dynamic programming => alignment

239

Dynamic programming di,j = min(if ai=bj then di-1,j-1 else ∞,

di-1,j + 1, di,j-1 + 1)

= distance between i-prefix of A and j-prefix of B (substitution excluded)

di,j

di-1,j-1 di,j-1

di-1,j

dm,n

mxn table d

A

B

ai

bj

+1

+1

240

A\B s t o c k h o l m0 1 2 3 4 5 6 7 8 9

t 1 2 1 2 3 4 5 6 7 8u 2 3 2 3 4 5 6 7 8 9k 3 4 3 4 5 4 5 6 7 8h 4 5 4 5 6 5 4 5 6 7o 5 6 5 4 5 6 5 4 5 6l 6 7 6 5 6 7 6 5 4 5m 7 8 7 6 7 8 7 6 5 4a 8 9 8 7 8 9 8 7 6 5

di,j = min(if ai=bj then di-1,j-1 else ∞, di-1,j + 1, di,j-1 + 1)

dID(A,B)

optimal alignment by trace-back

02.01.15

41

241

Search problem

•  find approximate occurrences of pa@ern P in text T: substrings P’ of T such that d(P,P’) small

•  dyn progr with small modificahon: O(mn) •  lots of (prachcal) improvement tricks

P

T P’

242

Index for approximate searching?

•  dynamic programming: P x Tree(T) with backtracking

P

Tree(T)

Documents

Lecture 15 TextAlgorithms - ut