Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
23.11.16
1
TextAlgorithms
JaakVilo2016fall
1MTAT.03.190TextAlgorithmsJaakVilo
Topics
• Exactmatchingofonepattern(string)• Exactmatchingofmultiplepatterns• Suffixtrie andtreeindexes
– Applications• Suffixarrays• Invertedindex• Approximatematching
Algorithms
One-pattern• Bruteforce• Knuth-Morris-Pratt• Karp-Rabin• Shift-OR,Shift-AND• Boyer-Moore• Factor searches
• Regular expressions(?)• Weight matrices(?)
Multi-pattern• Aho Corasick• Commentz-Walter
Indexing• Trie (andsuffixtrie)• Suffixtree
Exactpatternmatching
• S=s1 s2… sn (text) |S|=n(length)
• P=p1p2..pm (pattern) |P|=m
• Σ - alphabet | Σ|=c
• DoesScontainP?– DoesS=S'PS"fosomestringsS'jaS"?– Usuallym<<nandncanbe(very)large
Findoccurrencesintext
S
P
Animations• http://www-igm.univ-mlv.fr/~lecroq/string/
• EXACTSTRINGMATCHINGALGORITHMSAnimationinJava
• ChristianCharras- ThierryLecroqLaboratoired'InformatiquedeRouenUniversitédeRouenFacultédesSciencesetdesTechniques76821Mont-Saint-AignanCedexFRANCE
• e-mails:{Christian.Charras,Thierry.Lecroq}@laposte.net
23.11.16
2
Bruteforce:BABintext?
A B A C A B A B B A B B B AB A B
BruteForce
S
Pi i+j-1
j
Identifythefirstmismatch!
Question:
§Problemsofthismethod?§Ideastoimprovethesearch?
L
J
Bruteforce
AlgorithmNaiveInput:TextS[1..n]and
patternP[1..m]Output:Allpositionsi,where
PoccursinS
for(i=1;i<=n-m+1;i++)for (j=1;j<=m;j++)if(S[i+j-1]!=P[j])break;
if (j>m)printi;
attempt 1:gcatcgcagagagtatacagtacgGCAg....
attempt 2:gcatcgcagagagtatacagtacgg.......
attempt 3:gcatcgcagagagtatacagtacg
g.......
attempt 4:gcatcgcagagagtatacagtacg
g.......
attempt 5:gcatcgcagagagtatacagtacg
g.......
attempt 6:gcatcgcagagagtatacagtacg
GCAGAGAG
attempt 7:gcatcGCAGAGAGtatacagtacg
g.......
BruteforceorNaiveSearch
1 function NaiveSearch(string s[1..n],string sub[1..m])2 for i from 1to n-m+13 for j from 1tom4 if s[i+j-1]≠sub[j]5 jumptonextiterationofouterloop6 return i7return notfound
Ccodeint bf_2( char* pat, char* text , int n ) /* n = textlen */{
int m, i, j ; int count = 0 ; m = strlen(pat);
for ( i=0 ; i + m <= n ; i++) {
for( j=0; j < m && pat[j] == text[i+j] ; j++) ;
if( j == m )count++ ;
}
return(count);}
Ccodeint bf_1( char* pat, char* text ) {
int m ; int count = 0 ; char *tp;
m = strlen(pat); tp=text ;
for( ; *tp ; tp++ ) {if( strncmp( pat, tp, m ) == 0 ) {
count++ ; }
}
return( count ); }
23.11.16
3
MainproblemofNaive
• ForthenextpossiblelocationofP,checkagainthesamepositionsofS
S
Pi i+j-1
jS
j
Goals
• Makesureonlyaconstantnrofcomparisons/operationsismadeforeachpositioninS– Move(only)fromlefttorightinS
– How?– AfteratestofS[i]<>P[j]whatdowenow?
Knuth-Morris-Pratt
• Makesurethatnocomparisons“wasted”
• AftersuchamismatchwealreadyknowexactlythevaluesofgreenareainS!
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.
x
y≠
Knuth-Morris-Pratt
• Makesurethatnocomparisons“wasted”
• P– longestsuffixofanyprefixthatisalsoaprefixofapattern
• Example: ABCABD
D. Knuth, J. Morris, V. Pratt: Fast Pattern Matching in strings.SIAM Journal on Computing 6:323-350, 1977.
prefix x
prefix y
p z
≠
≠
ABCABD
AutomatonforABCABD
1 2 3 4 5 6 7A AB C B D
NOT A
AutomatonforABCABD
1 2 3 4 5 6 7A AB C B D
NOT A
0 1 1 1 2 3 1Fail:
A B C A B DPattern:
1 2 3 4 5 6
23.11.16
4
KMPmatching
Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)
i=1; j=1; initfail(P) // Prepare fail linksrepeat
if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link
until j>m or i>n if j>m then report match at i-m
Initializationoffaillinks
Algorithm:KMP_InitfailInput:PatternP[1..m]Output:fail[]forpatternP
i=1, j=0 , fail[1]= 0 repeat if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = jelse j = fail[j]
until i>=m
Initializationoffaillinks
i=1, j=0 , fail[1]= 0 repeat
if j==0 or P[i] == P[j] then i++ , j++ , fail[i] = j
else j = fail[j]until i>=m
0Fail:
ABCABDi
j
0 1
0 1 1 1
ABCABD
0 1 1 1 2
TimecomplexityofKMPmatching?
Input:TextS[1..n]andpatternP[1..m]Output: FirstoccurrenceofPinS(ifexists)
i=1; j=1; initfail(P) // Prepare fail linksrepeat
if j==0 or S[i] == P[j] then i++ , j++ // advance in text and in pattern else j = fail[j] // use fail link
until j>m or i>n if j>m then report match at i-m
Analysisoftimecomplexity
• Ateverycycleeitheriandjincreaseby1• Orjdecreases(j=fail[j])
• icanincreasen(orm)times• Q:Howoftencanjdecrease?
– A:notmorethannrofincreasesofi
• Amortisedanalysis: O(n),preprocessO(m)
Karp-Rabin
• CompareinO(1)ahashofPandS[i..i+m-1]
• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)
R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
i..(i+m-1)
1..m
h(T[i.. i+m-1])
h(P)
23.11.16
5
Karp-Rabin
• CompareinO(1)ahashofPandS[i..i+m-1]
• Goal:O(n).• f(h(T[i..i+m-1])->h(T[i+1..i+m]))=O(1)
R.Karp and M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249-260.
i..(i+m-1)
1..m
h(T[i+1..i+m])
h(P)
i..(i+m-1)
Hash
• “Remove” theeffectofT[i]and“Introduce”theeffectofT[i+m]– inO(1)
• Usebase|Σ|arithmeticsandtreatcharctersasnumbers
• Incaseofhashmatch– checkallmpositions• Hashcollisions=>WorstcaseO(nm)
Let’susenumbers
• T=57125677• P=125(andforsimplicity,h=125)
• H(T[1])=571• H(T[2])=(571-5*100)*10+2 =712
• H(T[3])=(H(T[2])– ord(T[1])*10m)*10+T[3+m-1]
hash
• c– sizeofalphabet
• HSi=H(S[i..i+m-1])
• H(S[i+1..i+m])=(HSi– ord(S[i])*cm-1 )*c+ord(S[i+m])
• Moduloarithmetic– tofitvalueinaword!
• hash(w[0..m-1])=(w[0]*2m-1+w[1]*2m-2+···+w[m-1]*20)modq
Karp-RabinInput: Text S[1..n] and pattern P[1..m] Output: Occurrences of P in S 1. c=20; /* Size of the alphabet, say nr. of aminoacids */
2. q = 33554393 /* q is a prime */ 3. cm = cm-1 mod q 4. hp = 0 ; hs = 0
5. for i = 1 .. m do hp = ( hp*c + ord(p[i]) ) mod q // H(P) 6. for i = 1 .. m do hs = ( hp*c + ord(s[i]) ) mod q // H(S[1..m]) 7. if hp == hs and P == S[1..m] report match at position
8. for i=2 .. n-m+1 9. hs = ( (hs - ord(s[i-1])*cm) * c + ord(s[i+m-1]) mod q
10. if hp == hs and P == S[i..i+m-1] 11. report match at position i
23.11.16
6
MorewaystoensureO(n)? Shift-AND/Shift-OR
• RicardoBaeza-Yates,GastonH.GonnetAnewapproachtotextsearchingCommunicationsoftheACM October1992,Volume35Issue10[ACMDigitalLibrary:http://doi.acm.org/10.1145/135239.135243][DOI]
Bit-operations
• Maintainasetofallprefixesthathavesofarhadaperfectmatch
• Onthenextcharacterintextupdateallpreviouspointerstoanewset
• Bitvector:foreverypossiblecharacter
State:whichprefixesmatch?
1
0
0
1
0
Move to next:shift 1,introduce 1,bitwise and
1
0 0
0
1
1 1
0 0
01
0
0
0
1
1
1
0
0
0
&
Pattern[S[i]]
1
1
1
0
0
=
Trackpositionsofprefixmatches
0 1 0 1 0 1
1 0 0 0 1 1
1 0 1 0 1 1 Shift left <<
1 0 0 0 1 1Mask on char T[i] Bitwise AND
23.11.16
7
VectorsforeverycharinΣ
• P=aste
a s t e b c d .. z
1 0 0 0 0 ...
0 1 0 0 0 ...
0 0 1 0 0 ...0 0 0 1 0 ...
• T=lasteaed
l a s t e a e d
0 1
0 0
0 00 0
• T=lasteaed
l a s t e a e d
0 1 0
0 0 1
0 0 00 0 0
• T=lasteaed
l a s t e a e d
0 1 0 0 0 1
0 0 1 0 0 0
0 0 0 1 0 00 0 0 0 1 0
• T=lasteaed
l a s t e a e d
0 1 0 0 0 1
0 0 1 0 0 0
0 0 0 1 0 00 0 0 0 1 0
http://www-igm.univ-mlv.fr/~lecroq/string/node6.html
23.11.16
8
[A]11010101
SummaryAlgorithm Worstcase Ave.Case Preprocess
Bruteforce O(mn) O(n*(1+1/|Σ|+..)
Knuth-Morris-Pratt O(n) O(n) O(m)
Rabin-Karp O(mn) O(n) O(m)
Boyer-Moore O(n/m)?
BMHorspool
Factorsearch
Shift-OR O(n) O(n) O(m|Σ|)
• R.Boyer,S.Moore:Afaststringsearchingalgorithm.CACM 20(1977),762-772[PDF]
• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf
47
Findoccurrencesintext
• Havewemissedanything?
48
S
P
23.11.16
9
Findoccurrencesintext
• Whathavewelearnedifwetestforapotentialmatchfromtheend?
49
S
P
ABCDEBBCDE
50
Findoccurrencesintext
S
P
AB
51
BadcharacterheuristicsmaximalshiftonS[i]
S
P
AB
X
SXX
delta1( S[i] ) – |m| if pattern does not contain S[i]patlen-j max j so that P[j] == S[i]
S[i]
First x in pattern (from end)
52
void bmInitocc() {
char a; int j; for(a=0; a<alphabetsize; a++)
occ[a]=-1;
for (j=0; j<m; j++) {
a=p[j]; occ[a]=j; }
}53
Goodsuffixheuristics
S
P
AB
µ
S
delta2( S[i] ) – minimal shift so that matched region is fully coveredor that the sufix of match is also a prefix of P
µµS
µµ’
1.
2.
54
23.11.16
10
Boyer-Moorealgorithm
Input: Text S[1..n] and pattern P[1..m]
Output: Occurrences of P in S
preprocess_BM() // delta1 and delta2
i=m
while i <= n
for( j=m; j>0 and P[j]==S[i-m+j]; j-- ) ;
if j==0 report match at position i-m+1
i = i+ max( delta1[ S[i] ], delta2[ j ] )
55
• http://www.iti.fh-flensburg.de/lang/algorithmen/pattern/bmen.htm
• http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/Articles/Exact/Boyer-Moore-original-p762-boyer.pdf
• Animation:http://www-igm.univ-mlv.fr/~lecroq/string/
56
SimplificationsofBM
• TherearemanyvariantsofBoyer-Moore,andmanyscientificpapers.
• Onaveragethetimecomplexityissublinear• Algorithmspeedcanbeimprovedandyetsimplifythecode.
• Itisusefultousethelastcharacterheuristics(Horspool(1980),Baeza-Yates(1989),HumeandSunday(1991)).
57
AlgorithmBMH(Boyer-Moore-Horspool)
• RNHorspool - PracticalFastSearchinginStringsSoftware- PracticeandExperience,10(6):501-5061980
Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j
3. i=m 4. while i <= n 5. if S[i] == P[m] 6. j = m-1 7. while ( j>0 and P[j]==S[i-m+j] ) j = j-1 ; 8. if j==0 report match at i-m+1 9. i = i + delta[ S[i] ]
58
StringMatching:Horspoolalgorithm
Text :
Pattern :From right to left: suffix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
It depends of where appears the last letter of the text, say it ‘a’, in the pattern:
a a a
Then it is necessary a preprocess that determines the length of the shift.
aa a
a a a
AlgorithmBoyer-Moore-Horspool-Hume-Sunday(BMHHS)
• Usedeltainatightloop• Ifmatch(delta==0)thencheckandapplyoriginaldeltad
Input: Text S[1..n] and pattern P[1..m] Output: occurrences of P in S 1. for a in Σ do delta[a] = m 2. for j=1..m-1 do delta[P[j]] = m-j 3. d = delta[ P[ m ] ]; // memorize d on P[m]4. delta[ P[ m ] ] = 0; // ensure delta on match of last char is 05. for ( i=m ; i<= n ; i = i+d ) 6. repeat // skip loop7. t=delta[ S[i] ] ; i = i + t 8. until t==09. for( j=m-1 ; j> 0 and P[j]==S[i-m+j] ; j = j-1 ) ;10. if j==0 report match at i-m+1
BMHHS requires that the text is padded by P: S[n+1]..S[n+m] = P(in order for the algorithm to finish correctly – at least one occurrence!).
60
23.11.16
11
• DanielM.Sunday: Averyfastsubstringsearchalgorithm[PDF]CommunicationsoftheACMAugust1990,Volume33Issue8
• Loopunrolling:• Avoidtoomanyloops(eachlooprequirestests)byjustrepeatingcode
withintheloop.• Line7inpreviousalgorithmcanbereplacedby:
7. i += delta[ S[i] ];i += delta[ S[i] ];i +=(t=delta[S[i]]) ;
61 62
Forward-Fast-Search:AnotherFastVariantoftheBoyer-MooreStringMatchingAlgorithm
• ThePragueStringologyConference'03• DomenicoCantoneandSimoneFaro• Abstract: WepresentavariationoftheFast-Searchstringmatching
algorithm,arecentmemberofthelargefamilyofBoyer-Moore-likealgorithms,andwecompareitwithsomeofthemosteffectivestringmatchingalgorithms,suchasHorspool,QuickSearch,TunedBoyer-Moore,ReverseFactor,Berry-Ravindran,andFast-Searchitself.Allalgorithmsarecomparedintermsofrun-timeefficiency,numberoftextcharacterinspections,andnumberofcharactercomparisons.Itturnsoutthatournewproposedvariant,thoughnotlinear,achievesverygoodresultsespeciallyinthecaseofveryshortpatternsorsmallalphabets.
• http://cs.felk.cvut.cz/psc/event/2003/p2.html• PS.gz (localcopy)
63
Factorbasedapproach
• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability
• Factor– asubstringofapattern– Anysubstring– (howmany?)
64
Factorbasedapproach
• Optimalaverage-casealgorithms– Assumingindependentcharacters,sameprobability
65
Factorsearches
Do not compare characters, but find the longest match to anysubregion of the pattern.
S
P
X u
66
23.11.16
12
Examples
• BackwardDAWGMatching(BDM)– Crochemoreetal1994
• BackwardNondeterministicDAWGMatching(BNDM)– Navarro,Raffinot2000
• BackwardOracleMatching(BOM)– Allauzen,Crochermore,Raffinot2001
67
BackwardDAWGMatchingBDM
Do not compare characters, but find the longest match to anysubregion of the pattern. 68
Suffix automaton recognises all factors (and suffixes) in O(n)
BNDM– simulateusingbitparallelism
69
Bits – show where the factors have occurred so far
BNDMmatchesanNDA
NDAonthesuffixesof‘announce’
70
DeterministicversionofthesameBackwardFactorOracle
71
BNDM – Backward Non-Deterministic DAWG MatchingBOM - Backward Oracle matching
72
23.11.16
13
StringMatchingofonepattern
CTACTACTACGTCTATACTGATCGTAGCTACTACGGTATGACTAA
Factor search
Prefix search
Suffix search
1.
2.
3.
Multiplepatterns
S
{P}
Why?
• Multiplepatterns• Highlightmultipledifferentsearchwords onthepage• Virusdetection – filterforvirussignatures• Spamfilters• Scannerincompiler needstosearchformultiplekeywords• Filterout stopwordsordisallowedwords• Intrusiondetectionsoftware• Next-generationsequencingproduceshugeamounts
(manymillions)ofshortreads(20-100bp)thatneedtobemappedtogenome!
• …
Algorithms
• Aho-Corasick(searchformultiplewords)– GeneralizationofKnuth-Morris-Pratt
• Commentz-Walter– GeneralizationofBoyer-Moore&AC
• WuandManber– improvementoverC-W
• Additionalmethods,tricksandtechniques
Aho-Corasick(AC)• AlfredV.AhoandMargaretJ.Corasick(BellLabs,MurrayHill,NJ)
Efficientstringmatching.Anaidtobibliographicsearch.CommunicationsoftheACM,Volume18,Issue6,p333-340(June1975)
• ACM:DOI PDF• ABSTRACT Thispaperdescribesasimple,efficientalgorithmtolocateall
occurrencesofanyofafinitenumberofkeywordsinastringoftext.Thealgorithmconsistsofconstructingafinitestatepatternmatchingmachinefromthekeywordsandthenusingthepatternmatchingmachinetoprocessthetextstringinasinglepass.Constructionofthepatternmatchingmachinetakestimeproportionaltothesumofthelengthsofthekeywords.Thenumberofstatetransitionsmadebythepatternmatchingmachineinprocessingthetextstringisindependentofthenumberofkeywords.Thealgorithmhasbeenusedtoimprovethespeedofalibrarybibliographicsearchprogrambyafactorof5to10.
References:
• GeneralizationofKMPformanypatterns• TextSlikebefore.• SetofpatternsP ={P1 ,..,Pk }• Totallength|P|=m=Σi=1..k mi
• Problem:findalloccurrencesofany ofthePi∈ P fromS
23.11.16
14
Idea
1. Createanautomaton fromallpatterns
2. Matchtheautomaton
• UsethePATRICIAtrieforcreatingthemainstructureoftheautomaton
PATRICIAtrie• D.R.Morrison,"PATRICIA:PracticalAlgorithmToRetrieveInformation
CodedInAlphanumeric",JournaloftheACM15(1968)514-534.• Abstract PATRICIAisanalgorithmwhichprovidesaflexiblemeansof
storing,indexing,andretrievinginformationinalargefile,whichiseconomicalofindexspaceandofreindexingtime.Itdoesnotrequirerearrangementoftextorindexasnewmaterialisadded.Itrequiresaminimumrestrictionofformatoftextandofkeys;itisextremelyflexibleinthevarietyofkeysitwillrespondto.Itretrievesinformationinresponsetokeysfurnishedbytheuserwithaquantityofcomputationwhichhasaboundwhichdependslinearlyonthelengthofkeysandthenumberoftheirproperoccurrencesandisotherwiseindependentofthesizeofthelibrary.IthasbeenimplementedinseveralvariationsasFORTRANprogramsfortheCDC-3600,utilizingdiskfilestorageoftext.Ithasbeenappliedtoseverallargeinformation-retrievalproblemsandwillbeappliedtoothers.
• ACM:DOI PDF
• Wordtrie - agooddatastructuretorepresentasetofwords(e.g.adictionary).
• trie (datastructure)
• Definition: Atreeforstoringstringsinwhichthereisonenodeforeverycommonpreffix.Thestringsarestoredinextraleafnodes.
•Seealsodigitaltree,digitalsearchtree,directedacyclicwordgraph,compactDAWG,Patriciatree,suffixtree.
•Note: Thenamecomesfromretrievalandispronounced,"tree."
• Totestforawordp,onlyO(|p|)timeisusednomatterhowmanywordsareinthedictionary...
TrieforP={he,she,his,hers}
0
1
2
h
e
0
1
2
h
e
3
s
4
5
e
h
TrieforP={he,she,his,hers}0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
23.11.16
15
Howtosearchforwordslikehe,sheila,hi.Dotheseoccurinthetrie?
0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
Aho-Corasick
1. CreateanautomatonMP forasetofstringsP.2. Finitestatemachine:reada characterfromtext,and
changethestateoftheautomatonbasedonthestatetransitions...
3. Mainlinks:goto[j,c] - readacharactercfromtextandgofromastatejtostategoto[j,c].
4. Iftherearenogoto[j,c]linksoncharactercfromstatej,usefail[j].
5. Reporttheoutput.Reportallwordsthathavebeenfoundinstatej.
ACAutomaton(vsKMP)
0
1
2
h
e3
s
4
5
e
h
8
i
7
s
9
r6
s
goto[1,i] = 6. ;
fail[7] = 3, fail[8] = 0 , fail[5]=2.
Output tablestate output[j] 2 he 5 she, he 7 his 9 hers
NOT { h, s }
AC- matching
Input:TextS[1..n]andanACautomatonMforpatternsetPOutput:OccurrencesofpatternsfromPinS(lastposition)1. state=02. for i=1..ndo
3. while (goto[state,S[i]]==∅ )and (fail[state]!=state)do4. state=fail[state]5. state=goto[state,S[i]]6. if (output[state]notempty)7. then reportmatchesoutput[state]atpositioni
AlgorithmAho-CorasickpreprocessingI(TRIE)Input:P={P1,...,Pk }Output:goto[]andpartialoutput[]Assume:output(s)isemptywhenastatesiscreated;
goto[s,a]isnotdefined.
procedure enter(a1,...,am)/*Pi =a1,...,am */begin1.s=0;j=1;2.while goto[s,aj]≠∅ do //followexistingpath3.s=goto[s,aj];4.j=j+1;5.for p=jtomdo //addnewpath(states)6.news=news+1;7.goto[s,ap]=news;8.s=news;9.output[s]=a1,...,amend
begin10. news = 011. for i=1 to k do enter( Pi )12. for a ∈ Σ do
13. if goto[0,a] = ∅ then goto[0,a] = 0 ; end
PreprocessingIIforAC(FAIL)queue = ∅for a ∈ Σ do
if goto[0,a] ≠ 0 thenenqueue( queue, goto[0,a] )fail[ goto[0,a] ] = 0
while queue ≠ ∅r = take( queue )for a ∈ Σ do
if goto[r,a] ≠ ∅ then s = goto[ r, a ]enqueue( queue, s ) // breadth first searchstate = fail[r]while goto[state,a] = ∅ do state = fail[state]fail[s] = goto[state,a]output[s] = output[s] + output[ fail[s] ]
23.11.16
16
Correctness
• Letstringt"point"frominitialstatetostatej.
• Mustshowthatfail[j]pointstolongestsuffixthatisalsoaprefixofsomewordinP.
• Lookatthearticle...
ACmatchingtimecomplexity
• Theorem FormatchingtheMP ontextS,|S|=n,lessthan2ntransitionswithinMaremade.
• Proof ComparetoKMP.• Thereisatmostngotosteps.• CannotbemorethannFail-steps.• Intotal-- therecanbelessthan2ntransitionsinM.
Individualnode(goto)
• Fulltable
• List
• Binarysearchtree(?)
• Someotherindex?
ACthoughts
• Scalesformanystringssimultaneously.• Forverymanypatterns– searchtime(ofgrep)improves(??)
– SeeWu-Manberarticle
• Whenkgrows,thenmorefail[]transitionsaremade(why?)• Butalwayslessthann.• Ifallgoto[j,a]areindexedinanarray,thenthesizeis
|MP|*|Σ|,andtherunningtimeofACisO(n).• Whenkandcarebig,onecanuselistsortreesforstoring
transitionfunctions.
• Then,O(nlog(min(k,c))).
AdvancedAC
• Precalculatethenextstatetransitioncorrectlyforeverypossiblecharacterinalphabet
• Canbegoodforshortpatterns
ProblemsofAC?
• Needtorebuildonadding/removingpatterns
• Detailsofbranchingoneachnode(?)
23.11.16
17
Commentz-Walter
• GeneralizationofBoyer-Mooreformultiplesequencesearch
• BeateCommentz-WalterAStringMatchingAlgorithmFastontheAverageProceedingsofthe6thColloquium,onAutomata,LanguagesandProgramming.LectureNotesInComputerScience;Vol.71,1979. pp.118- 132,Springer-Verlag
• http://www.fh-albsig.de/win/personen/professoren.php?RID=36• YoucandownloadheremyalgorithmStringMatchingFastOnTheAverage (PDF,~17,2MB)or
hereStringMatchingFastOnTheAverage(extendedabstract) (PDF,~3MB)
C-Wdescription
• AhoandCorasick[AC75]presentedalinear-timealgorithmforthisproblem,basedonanautomataapproach.ThisalgorithmservesasthebasisfortheUNIXtoolfgrep.Alinear-timealgorithmisoptimalintheworstcase,butastheregularstring-searchingalgorithmbyBoyerandMoore[BM77]demonstrated,itispossibletoactuallyskipalargeportionofthetextwhilesearching,leadingtofasterthanlinearalgorithmsintheaveragecase.
Commentz-Walter[CW79]
• Commentz-Walter[CW79]presentedanalgorithmforthemulti-patternmatchingproblemthatcombinestheBoyer-MooretechniquewiththeAho-Corasickalgorithm.TheCommentz-WalteralgorithmissubstantiallyfasterthantheAho-Corasickalgorithminpractice.Hume[Hu91]designedatoolcalledgrebasedonthisalgorithm,andversion2.0offgrepbytheGNUproject[Ha93]isusingit.
• Baeza-Yates[Ba89]alsogaveanalgorithmthatcombinestheBoyer-Moore-Horspoolalgorithm[Ho80](whichisaslightvariationoftheclassicalBoyer-Moorealgorithm)withtheAho-Corasickalgorithm.
IdeaofC-W
• Buildabackward trieofallkeywords
• Matchfromtheenduntilmismatch...
• Determinetheshiftbasedonthecombinationofheuristics
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
4. Start the search
T A
A
G
GA
TTT
T
G
A
A
AA T
1. Build the trie of the inverted patterns
2. lmin=4A 1C 4 (lmin)G 2T 1
3. Table of shifts
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
23.11.16
18
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
…Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
HorspoolformanypatternsSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
…
Short Shifts!
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
23.11.16
19
WhatarethepossiblelimitationsforC-W?
• Manypatterns,smallalphabet– minimalskips
• Whatcanbedonedifferently?
Wu-Manber• WuS.,andU.Manber,"AFastAlgorithmforMulti-PatternSearching,"
TechnicalReportTR-94-17,DepartmentofComputerScience,UniversityofArizona(May1993).
• Citeseer:http://citeseer.ist.psu.edu/wu94fast.html [Postscript]• WepresentadifferentapproachthatalsousestheideasofBoyerand
Moore.Ouralgorithmisquitesimple,andthemainengineofitisgivenlaterinthepaper.Anearlierversionofthisalgorithmwaspartofthesecondversionofagrep[WM92a,WM92b],althoughthealgorithmhasnotbeendiscussedin[WM92b]andonlybrieflyin[WM92a].Thecurrentversionisusedinglimpse[MW94].Thedesignofthealgorithmconcentratesontypicalsearchesratherthanonworst-casebehavior.Thisallowsustomakesomeengineeringdecisionsthatwebelievearecrucialtomakingthealgorithmsignificantlyfasterthanotheralgorithmsinpractice.
Keyidea
• MainproblemwithBoyer-Mooreandmanypatternsisthat,themoretherearepatterns,theshorterbecomethepossibleshifts...
• WuandManber:checkseveralcharacterssimultaneously,i.e.increasethealphabet.
• Insteadoflookingatcharactersfromthetextonebyone,weconsidertheminblocksofsizeB.
• logc2M;inpractice,weuseeitherB=2orB=3.• TheSHIFTtable playsthesameroleasintheregularBoyer-Moorealgorithm,exceptthatitdeterminestheshiftbasedon thelastBcharactersratherthanjustonecharacter.
AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…
2 símbols
Horspoolto Wu-ManberHow do we can increase the length of the shifts?
With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG
AA 1AT 1GT 1TA 2TG 2
A 1C 4 (lmin)G 2T 1
1 símbol
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Wu-ManberalgorithmSearch for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
into the text: ACATGCTATGTGACATAATA
…
AA 1AT 1GT 1TA 2TG 2
Experimental length: log|Σ| 2*lmin*rSlides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
23.11.16
20
BackwardOracle
• SetBackwardsoracleSBDM,SBOM
• Pages68-72
Stringmatchingofmanypatterns
5 10 15 20 25 30 35 40 45
8
4
2
| S|
Wu-Manber
SBOMLmin
(5 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(100 patterns)
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Stringmatchingofmanypatterns
5 10 15 20 25 30 35 40 45
8
4
2
| S|
Wu-Manber
SBOM
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
5 10 15 20 25 30 35 40 45
8
4
2
SBOM
Lmin
(5 patterns)
(10 patterns)
(100 patterns)(1000 patterns)
Slides courtesy of: Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
5strings 10strings
23.11.16
21
100strings 1000strings
FactorOracle FactorOracle:safeshift
FactorOracle:
Shift to match prefix of P2?
Factororacle
23.11.16
22
ConstructionoffactorOracle Factororacle• Allauzen,C.,Crochemore,M.,andRaffinot,M.1999.FactorOracle:ANew
StructureforPatternMatching.InProceedingsofthe26thConferenceonCurrentTrendsintheoryandPracticeofinformaticsontheoryandPracticeofinformatics (November27- December04,1999).J.Pavelka,G.Tel,andM.Bartosek,Eds.LectureNotesInComputerScience,vol.1725.Springer-Verlag,London,295-310.
• http://portal.acm.org/citation.cfm?id=647009.712672&coll=GUIDE&dl=GUIDE&CFID=31549541&CFTOKEN=61811641#
• http://www-igm.univ-mlv.fr/~allauzen/work/sofsem.ps
Sofar
• GeneralisedKMP->AhoCorasick• GeneralisedHorspool->CommentzWalter,WuManber
• BDM,BOM->SetBackwardOracleMatching…
• Othergeneralisations?
MultipleShift-AND
• P={P1,P2,P3,P4}. GeneralizeShift-AND
• Bits=
• Start=
• Match=
P1P2P3P4
1111
1111