Special lecture on Information Knowledge Network

Special lecture on Information Knowledge Network-Information retrieval and pattern matching-

The 4th Approximate string matching

Takuya kidaIKN Laboratory,

Division of Computer Science and Information Technology

2018/11/22Special lecture on IKN

Today’s contents

What is the approximation pattern matching?

Dynamic programming approach

NFA-base approachBit parallel simulation (BPR, BPD)

Filtering approachPattern division method (PEX)NFA method (ABNDM)

2

Let’s talk with an intelligent computer!

That’s right!Vermeer!

Both of them werefrom the Netherlands,

weren’t them?

By the way.Did you know Tera-sgees

who was in the same era?

Yes!Rembrandt! I remember!…And who that painter?

He drew many genre paintings in the same era.

Very popular in Japan

Er…,Not Wermer…

Who was that painter in the Baroque era?

He drew a famous picture called

“The Nightwatch”…

Er…,Certainly, was he …

Rembbright?

Velázquez!ヽ(`Д´)ノ #

…Perhaps,Vermeer?It's Rembrandt




weren’t them?



Yes!Rembrandt! I remember!…And who that painter?

He drew many genre paintings in the same era.

Very popular in Japan

Er…,Not Wermer…


…Perhaps,Vermeer?




weren’t them?




What is the approximation pattern matching?

It is the problem to find positions of substrings in a given text where its edit distance with a given pattern is less than or equal to 𝑘𝑘

Edit distance ed 𝑥𝑥,𝑦𝑦 is defined as the minimum cost 𝑑𝑑 for translating string 𝑥𝑥 into string 𝑦𝑦 with character edit operations: insertion, deletion, and substitution.

MARRIAGE

MASSAGE

CARRIAGE

𝑘𝑘 = 2

𝑑𝑑 = 1

𝑑𝑑 = 3

MARRIAGE

MASS AGE

deletesubstitute

OK

Bad

0 < k < m

ed(MARRIAGE, MASSAGE)=3

Edit distanceHow much do two strings look like?

similarity ⇔ edit distance between strings (dissimilarity)

Variation of edit distanceLevenshtein distance ：The costs of all operations are equal to 1.Hamming distance ：Only substitution is allowed.Weighted-cost edit distance

：The cost of each operation may differ.Unrestricted-cost edit distance

：The cost is different at each character pair.Damerau distance ：The character transposition is also permitted.Indel distance ：Substitution is not allowed.

insertion + deletion = indel(from Heikki Hyyrö [SOFSEM2005])

Hereafter, we mainly treat with Levenshtein distance

Application examples

Calculating the similarity between DNAs

Spell checker / Searching with ambiguityOrthographic variation： Carpaccio ⇔ Caravaggio

Retrieval of similar sentencesAn advanced retrieval can be realized by combining with natural language processingA sentence = a sequence of morphemes ≒ a string

Similar music retrievalFinding a similar phrase on MIDI dataRetrieval with humming recognition

Search on OCR dataOCRed data often contains mistakes

Applications to real data miningWeb mining using approximate string matching algorithms (T. Nakato, Kyushu Univ.)

Searching with thesaurus is another related topic

Consider each morpheme as a meta character

Using Dynamic Time Warping (DTW)

Dynamic programming approach

The way of calculating edit distance based on dynamic programming (DP) has been known in the 1960’s. However, the well-known algorithm for pattern matching is shown by Sellers in 1980.

P. H. Sellers, The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1(4):359-373,1980.H. Sakoe and S. Chiba, A Dynamic Programming Algorithm Optimization for Spoken Word Recognition, IEEE Trans. on Acoust., Speech and Signal Proc., Vol. ASSP- 26, No. 1, pp. 43-49, 1978.

How to calculate ed(𝑥𝑥,𝑦𝑦):Let 𝑀𝑀𝑖𝑖,𝑗𝑗 = ed(𝑥𝑥 1: 𝑖𝑖 ,𝑦𝑦 1: 𝑗𝑗 ). Then,

𝑀𝑀0,0 ← 0,𝑀𝑀𝑖𝑖,𝑗𝑗 ← min 𝑀𝑀𝑖𝑖−1,𝑗𝑗−1 + 𝛿𝛿 𝑥𝑥 𝑖𝑖 ,𝑦𝑦 𝑗𝑗 ,𝑀𝑀𝑖𝑖−1,𝑗𝑗 + 1,𝑀𝑀𝑖𝑖,𝑗𝑗−1 + 1 .

where 𝛿𝛿 𝑎𝑎, 𝑏𝑏 is defined as 0 if 𝑎𝑎 = 𝑏𝑏, otherwise 1.Efficient recursive formulas for doing the same calculations are:

𝑀𝑀𝑖𝑖,𝑗𝑗 ←𝑀𝑀𝑖𝑖,0 ← 𝑖𝑖, 𝑀𝑀0,𝑗𝑗 ← 𝑗𝑗𝑀𝑀𝑖𝑖−1,𝑗𝑗−1 (if 𝑥𝑥 𝑖𝑖 = 𝑦𝑦 𝑗𝑗 )1 + min 𝑀𝑀𝑖𝑖−1,𝑗𝑗−1,𝑀𝑀𝑖𝑖−1,𝑗𝑗 ,𝑀𝑀𝑖𝑖,𝑗𝑗−1 (otherwise)

i.e., 𝑀𝑀|𝑥𝑥|,|𝑦𝑦| = ed(𝑥𝑥,𝑦𝑦)

Why can we do correct calculation?

Prove by induction. Let 𝑀𝑀0,0 = 0 be for two empty strings. Now we want to obtain ed 𝑥𝑥[1: 𝑖𝑖],𝑦𝑦[1: 𝑗𝑗] = 𝑀𝑀𝑖𝑖,𝑗𝑗. Assume that we have ed 𝑥𝑥[1: 𝑖𝑖′],𝑦𝑦[1: 𝑗𝑗′] for any 𝑖𝑖′ < 𝑖𝑖 and 𝑗𝑗′ < 𝑗𝑗. Then, we consider the cost for translating 𝑥𝑥[1: 𝑖𝑖] into 𝑦𝑦[1: 𝑗𝑗].If 𝑥𝑥 𝑖𝑖 = 𝑦𝑦[𝑗𝑗], then we can simply transform 𝑥𝑥[1: 𝑖𝑖 − 1] into 𝑦𝑦[1: 𝑗𝑗 − 1] with the minimum cost 𝑀𝑀𝑖𝑖−1,𝑗𝑗−1. In this case, it holds 𝑀𝑀𝑖𝑖,𝑗𝑗 = 𝑀𝑀𝑖𝑖−1,𝑗𝑗−1.If 𝑥𝑥[𝑖𝑖] ≠ 𝑦𝑦[𝑗𝑗], then we have three cases:

substituting 𝑥𝑥[𝑖𝑖] with 𝑦𝑦[𝑗𝑗], and change 𝑥𝑥[1: 𝑖𝑖 − 1] to 𝑦𝑦[1: 𝑗𝑗 − 1] with cost 𝑀𝑀𝑖𝑖−1,𝑗𝑗−1deleting 𝑥𝑥[𝑖𝑖], and change 𝑥𝑥 1: 𝑖𝑖 − 1 to 𝑦𝑦[1: 𝑗𝑗] with cost 𝑀𝑀𝑖𝑖−1,𝑗𝑗inserting 𝑦𝑦[𝑖𝑖] at the end of 𝑥𝑥[1: 𝑖𝑖], and change 𝑥𝑥[1: 𝑖𝑖] to 𝑦𝑦[1: 𝑗𝑗 − 1] with cost 𝑀𝑀𝑖𝑖,𝑗𝑗−1

We choose the minimum one among the above.

deletion

𝑥𝑥[1: 𝑖𝑖– 1]

𝑦𝑦[1: 𝑗𝑗– 1] 𝑦𝑦[𝑗𝑗]

𝑀𝑀𝑖𝑖−1,𝑗𝑗

𝑥𝑥[𝑖𝑖]+1

substitution

𝑥𝑥[1: 𝑖𝑖– 1] 𝑥𝑥[𝑖𝑖]


𝑀𝑀𝑖𝑖−1,𝑗𝑗−1 +1

insertion

𝑥𝑥[1: 𝑖𝑖– 1]


𝑥𝑥[𝑖𝑖]

+1𝑀𝑀𝑖𝑖,𝑗𝑗−1

𝑀𝑀𝑖𝑖−1,𝑗𝑗−1

𝑀𝑀𝑖𝑖,𝑗𝑗−1 𝑀𝑀𝑖𝑖,𝑗𝑗


+𝛿𝛿(𝑥𝑥[𝑖𝑖],𝑦𝑦[𝑗𝑗]) +1

+1

How to detect the pattern occurrences

a n n e a l i n g0 1 2 3 4 5 6 7 8 9

a 1 0 1 2 3 4 5 6 7 8n 2 1 0 1 2 3 4 5 6 7n 3 2 1 0 1 2 3 4 5 6u 4 3 2 1 1 2 3 4 5 6a 5 4 3 2 2 1 2 3 4 5l 6 5 4 3 3 2 1 2 3 4

𝑀𝑀𝑖𝑖,𝑗𝑗 for ed(annual, annealing)

𝑀𝑀𝑥𝑥 , 𝑦𝑦 = ed(annual, annealing) = 4

𝑀𝑀0,0 a n n e a l i n g0 0 0 0 0 0 0 0 0 0

a 1 0 1 1 1 0 1 1 1 1n 2 1 0 1 2 1 1 2 1 2n 3 2 1 0 1 2 2 2 2 2u 4 3 2 1 1 2 3 3 3 3a 5 4 3 2 2 1 2 3 4 4l 6 5 4 3 3 2 1 2 3 4

Approximate string matching for𝑃𝑃 =annual, 𝑇𝑇 =annealing, 𝑘𝑘 = 2

For any 𝑗𝑗 = 0 …𝑛𝑛, all that we have to do is to set 𝑀𝑀0,𝑗𝑗 = 0

This means that empty string 𝜀𝜀 matches at anywhere in a given text with 0 error

𝑂𝑂(𝑚𝑚𝑛𝑛) time and 𝑂𝑂(𝑚𝑚) space

𝑀𝑀𝑖𝑖−1,𝑗𝑗−1

𝑀𝑀𝑖𝑖,𝑗𝑗−1 𝑀𝑀𝑖𝑖,𝑗𝑗


+𝛿𝛿(𝑥𝑥[𝑖𝑖],𝑦𝑦[𝑗𝑗]) +1

+1

Improvement the average time complexity

The given pattern seldom occurs in the text!During calculations of each column, values become k+1 before reaching to the bottom (that is, mismatch occurs at the current position).A cell whose value is larger than k+1 does not affect to the final results.If the value of a cell is less than or equal to 𝑘𝑘, we call it active. The average time complexity can be reduced to O(𝑘𝑘𝑛𝑛) by calculating only active cells. (This improved algorithm is called DP)

E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1-3):132-137, 1985.

a n n e a l i n g0 0 0 0 0 0 0 0 0 0

a 1 0 1 1 1 0 1 1 1 1n 2 1 0 1 2 1 1 2 1 2n 3 2 1 0 1 2 2 2 2 2u 4 3 2 1 1 2 3 3 3 3a 5 4 3 2 2 1 2 3 4 4l 6 5 4 3 3 2 1 2 3 4

DP calculation for𝑃𝑃 =annual, 𝑇𝑇 =annealing, 𝑘𝑘 = 2

𝑂𝑂(𝑚𝑚𝑛𝑛) time in the worst case𝑂𝑂(𝑘𝑘𝑛𝑛) time for the average

Pseudo code of DP algorithm

DP (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For i∈0…m Do Ci ← i3 lact ← k + 1 /* last active cell */4 Searching:5 For pos∈1...n Do6 pC ← 0, nC ← 07 For i ∈ 1…lact Do8 If pi = tpos Then nC ← pC9 Else10 If pC < nC Then nC ← pC11 If Ci < nC Then nC ← Ci12 nC ← nC + 113 End of if14 pC ← Ci, Ci ← nC15 End of for16 While Clact > k Do lact ← lact – 117 If lact = m Then report an occurrence at pos18 Else lact ← lact + 119 End of for

NFA-base approach

Doing pattern matching by simulating this NFA by translating it to DFA.Originally, this is proposed by Ukkonen[1985]. And several improvements have been proposed so far.Translating to a corresponding DFA increases the number of states to (min(3𝑚𝑚,𝑚𝑚(2𝑚𝑚|∑|)𝑘𝑘)).Therefore, it is not practical when 𝑚𝑚 is large.

An NFA that accepts 𝑃𝑃 = annual with allowing 2 errorsany a ∈ ∑

a n un a l

a n un a l

∑ ∑ ∑ ∑ ∑ ∑∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

a n un a l

∑ ∑ ∑ ∑ ∑ ∑∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

no error

1 error

2 error

Active states after reading 𝑇𝑇 = anneal

E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1-3):132-137, 1985.

ΣΣ Σε

Σ

L

ΣL

match

ins

del

sub

Row-wise bit-parallel for the NFA (BPR)

Pack states of each row to one bit-vector (1 indicates active, 0 indicates non-active) and simulate the move of the whole NFA by a bit-parallel technique.It needs 𝑘𝑘 + 1 bit masks whose length are 𝑚𝑚 bits.The formulas to update the 𝑖𝑖-th row state 𝑅𝑅𝑖𝑖 into new 𝑅𝑅′𝑖𝑖:

R’0 ← ((R0<<1)|0m-11) & B[tj]R’i ← ((Ri<<1)&B[tj])|Ri-1|(Ri-1<<1)|(R’i-1 << 1)|0m-11

no error

1 error

2 error

000000

100011

110111

any a ∈ ∑

a n un a l

a n un a l

∑ ∑ ∑ ∑ ∑ ∑∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

a n un a l

∑ ∑ ∑ ∑ ∑ ∑∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

An NFA that accepts 𝑃𝑃 = annual with allowing 2 errors

Active states after reading 𝑇𝑇 = anneal

𝑂𝑂(𝑘𝑘𝑚𝑚/𝑤𝑤𝑛𝑛) time𝑂𝑂(𝑘𝑘𝑛𝑛) time if 𝑚𝑚 ≦ 𝑤𝑤

Note that the bit order is reversed.

S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10): 83-91,1992.

Multiple rows can be packed into one vector!

Pseudo code of BPR

BPR (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For c∈∑ Do B[c] ← 0m3 For j ∈1…m Do B[pj ] ← B[pj ] | 0m-j 10j-14 Searching:5 For i ∈0...k Do Ri ← 0m-i 1i6 For pos ∈ 1…n Do7 oldR ← R08 newR ← ((oldR<<1)|0m-1 1)&B[tpos]9 R0 ← newR10 For i ∈1...k Do11 newR ← ((Ri<<1)&B[tpos])|oldR|((oldR|newR)<<1)|0m-1112 oldR ← Ri, Ri ← newR13 End of for14 If newR & 10m-1≠0m Then report an occurrence at pos15 End of for

Diagonal-wise bit-parallel for the NFA (BPD)

Pack states diagonally by representing the depth of active states with unary(needing k+1 bits), and combine them into one bit-vector.It needs ‘0’s for representing the boundaries, the total length of the vector becomes (𝑚𝑚− 𝑘𝑘)(𝑘𝑘 + 2) bits.The formulas to update when reading 𝑖𝑖-th character 𝑡𝑡𝑗𝑗:

D’i ← min(Di+1, Di+1+1, g(i-1, tj ))g(i,c) = min({k+1}∪{r|r≧Di and pi+1+r=c})

∑ a n un a l

a n un a l

∑ ∑ ∑ ∑ ∑ ∑∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

a n un a l

∑ ∑ ∑ ∑ ∑ ∑∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

D0 D1 D2 D3 D4

no error

1 error

2 error

R. A. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127-158, 1999.

D= 0 001 0 0 0k+1 bits k+1 bits k+1 bits k+1 bits

D1111 011 011

D2 D3 D4

Bit-masks like those of Shift-Or

The 1st item is for sub.The 2nd item is for ins.The 3rd item is for match

𝐷𝐷𝑖𝑖 = 3 =[111] if there is no active state

Pseudo code of BPD

BPD (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For c∈∑ Do B[c] ← 1m3 For j ∈1…m Do B[pj ] ← B[pj ] & 1m-j 01j-14 For c∈∑ Do5 BB[c] ← 0 sk+1(B[c],0) 0 sk+1(B[c],1)… 0 sk+1(B[c],m-k-1)6 End of for7 Searching:8 D ← (01k+1)m-k9 For pos ∈ 1…n Do10 x ← (D >> (k+2)) | BB[tpos]11 D ← ((D << 1) | (0k+11)m-k)12 & ((D << (k+3)) | (0k+11)m-k-101k+1)13 & (((x + (0k+11)m-k) ∧ x) >> 1) & (01k+1)m-k14 If D & 0(m-k-1)(k+2)010k = 0(m-k)(k+2) Then15 Report an occurrence at pos16 D ← D | 0(m-k-1)(k+2)01k+117 End of If18 End of for

𝐷𝐷𝑖𝑖 + 1𝐷𝐷𝑖𝑖+1 + 1

𝑔𝑔(𝑖𝑖 − 1, tpos)

clean up

Filtering approach: Pattern division method

Idea of filtering approach:It is easier to say “Here is not an occurrence” than “Here is an occurrence”→ Find the candidates rapidly, then look up in detail!This improves the average complexity.Actually, it goes well when the error rate (𝛼𝛼 = 𝑘𝑘/𝑚𝑚) is small.

Pattern division method:Divide a given pattern into k+1 piecesThen, find each piece using a fast multiple pattern matching algorithmWhen finding a piece, run an ordinary approximate string matching algorithm (such as DP) over the neighborhood of the occurrence to check if the pattern matches

Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC

𝑘𝑘 + 1 pieces: TAAAT, CACGG, CATACT

For 𝑘𝑘 = 2Pattern: TAAATCACGGCATACT

S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10): 83-91,1992.

Multiple Shift-And orSet Horspool

Speeding-up by hierarchical verification (PEX)

Checking candidates hierarchically can reduce the processing time.Assume that 𝑗𝑗 = 𝑘𝑘 + 1 = 2𝑟𝑟. Halve a given pattern with allowing 𝑘𝑘/2 errors for each, and repeat the division recursively till each piece allows 0 errors.Find the pieces using a multiple pattern matching algorithm, and then check the candidates hierarchically.

CreateTree (P=p1p2…pm, k, myParent, idx, plen)1 Create new node2 from(node) ← i3 to(node) ← j4 left ← (k+1)/25 parent(node) ← myParent6 err(node) ← k7 If k = 0 Then leafidx ← node8 Else9 CreateTree(pi…i+left・plen–1, (left・k)/(k+1), node, idx, plen) 10 CreateTree(pi+left・plen…j,((k+1–left)・k)/(k+1),node,idx+left,plen)11 End of If

G. Navarro and R. Baeza-Yates. Very fast and simple approximate string matching. Information Processing Letters, 72:65-70, 1999.

a a a b b b c c c d d da a a b b b c c c d d d

a a a b b b c c c d d d

𝑘𝑘 = 3 errors

𝑘𝑘 = 1 errors

𝑘𝑘 = 0 errors

Make padding when it doesn’t match with 2𝑟𝑟

Pseudo code of PEX

PEX (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 CreateTree(p, k,θ, 0, m/(k+1) )3 Preprocess multipattern search for4 {pfrom(node)…pto(node) | node = leafi , i∈{0…k} }5 Searching:6 For (pos, i) ∈ output of multipattern search Do7 node ← leafi8 in ← from(node)9 node ← parent(node)10 cand ← TRUE11 While cand = TRUE and node ≠θ Do12 p1 ← pos – (in – from(node)) – err(node)13 p2 ← pos + (to(node) – in + 1) + err(node)14 Verify text area Tp1…p2 for pattern piece pfrom(node)…to(node)15 allowing err(node) errors 16 If pattern piece was not found Then cand ← FALSE17 Else node ← parent(node)18 End of while19 If cand = TRUE Then20 Report the positions where the whole p was found21 End of If22 End of for

Filtering approach: BNDM method (ABNDM)

Construct an NFA that accepts any factor of 𝑃𝑃𝑅𝑅 for a given pattern 𝑃𝑃 with allowing 𝑘𝑘 errors → an extension of BNDM

The NFA can tell if the input is a prefix of 𝑃𝑃𝑅𝑅 with 𝑘𝑘 errors.BNDM runs faster than BM when the alphabet size is small enough.We can quickly extract candidate positions by this NFA.It can skip several text positions like BNDM.

For texts whose alphabet size is small, such as DNA sequence, ABNDM runs faster than PEX

anu nal∑ ∑ ∑ ∑ ∑ ∑

∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

∑ ∑ ∑ ∑ ∑ ∑∑ ∑ ∑ ∑ ∑ ∑ ∑ε ε ε ε ε ε

no error

1 error

2 error

εε ε ε ε ε ε

G. Navarro and R. Baeza-Yates. Very fast and simple approximate string matching. Information Processing Letters, 72:65-70, 1999.

anu nal

anu nal

Pseudo code of ABNDM

ABNDM (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For c∈∑ Do B[c] ← 0m3 For j ∈1…m Do B[pj ] ← B[pj ] | 0m-j 10j-14 Searching:5 pos ← 06 While pos ≦ n – (m – k) Do7 j ← m – k – 1, last ← m – k – 18 R0 ← B[tpos+m–k ]9 newR ← 1m10 For i ∈1…k Do Ri ← newR11 While newR ≠ 0m and j ≠ 0 Do12 oldR ← R013 newR ← (oldR << 1) & B[tpos+j ]14 R0 ← newR15 For i ∈1…k Do16 newR ← ((Ri<<1)&B[tpos+j])|oldR|((oldR|newR)<<1)17 oldR ← Ri, Ri ← newR18 End of for19 j ← j – 120 If newR & 10m-1 ≠ 0m Then /* prefix recognized */21 If j > 0 Then last ← j22 Else check a possible occurrence starting at pos+123 End of if24 End of while25 pos ← pos + last26 End of while

SummaryWhat is the approximate string matching?

It is the problem of finding substrings which match to 𝑃𝑃 within 𝑘𝑘 edit distances.Dynamic programming approach:

O 𝑚𝑚𝑛𝑛 time and O(𝑚𝑚) space → can be improved to O(𝑘𝑘𝑛𝑛) for the average (DP)

NFA approach:It constructs an NFA that accepts 𝑃𝑃 with 𝑘𝑘 errors → translate to a corresponding DFA and then simulate itBit-parallel simulation of the NFA:

Row-wise (BPR)： O(𝑘𝑘 𝑚𝑚/𝑤𝑤 𝑛𝑛) timeDiagonal-wise (BPD)： O( 𝑘𝑘(𝑚𝑚 − 𝑘𝑘)/𝑤𝑤 𝑛𝑛) time

Filtering approach:It finds without checking the most of text in detailPattern division method (PEX), BNDM method (ABNDM)

The next theme:Regular expression matching: for a flexible and convenient keyword searching

Documents

Special lecture on Information Knowledge Network