24
Special lecture on Information Knowledge Network - Information retrieval and pattern matching - The 4th Approximate string matching Takuya kida IKN Laboratory, Division of Computer Science and Information Technology 2018/11/22 Special lecture on IKN

Special lecture on Information Knowledge Network

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Special lecture on Information Knowledge Network

Special lecture on Information Knowledge Network-Information retrieval and pattern matching-

The 4th Approximate string matching

Takuya kidaIKN Laboratory,

Division of Computer Science and Information Technology

2018/11/22Special lecture on IKN

Page 2: Special lecture on Information Knowledge Network

Today’s contents

What is the approximation pattern matching?

Dynamic programming approach

NFA-base approachBit parallel simulation (BPR, BPD)

Filtering approachPattern division method (PEX)NFA method (ABNDM)

2

Page 3: Special lecture on Information Knowledge Network

Let’s talk with an intelligent computer!

That’s right!Vermeer!

Both of them werefrom the Netherlands,

weren’t them?

By the way.Did you know Tera-sgees

who was in the same era?

Yes!Rembrandt! I remember!…And who that painter?

He drew many genre paintings in the same era.

Very popular in Japan

Er…,Not Wermer…

Who was that painter in the Baroque era?

He drew a famous picture called

β€œThe Nightwatch”…

Er…,Certainly, was he …

Rembbright?

VelΓ‘zquez!ヽ(`Π”Β΄)γƒŽ #

…Perhaps,Vermeer?It's Rembrandt

Page 4: Special lecture on Information Knowledge Network

Let’s talk with an intelligent computer!

That’s right!Vermeer!

Both of them werefrom the Netherlands,

weren’t them?

By the way.Did you know Tera-sgees

who was in the same era?

Yes!Rembrandt! I remember!…And who that painter?

He drew many genre paintings in the same era.

Very popular in Japan

Er…,Not Wermer…

VelΓ‘zquez!ヽ(`Π”Β΄)γƒŽ #

…Perhaps,Vermeer?

Page 5: Special lecture on Information Knowledge Network

Let’s talk with an intelligent computer!

That’s right!Vermeer!

Both of them werefrom the Netherlands,

weren’t them?

By the way.Did you know Tera-sgees

who was in the same era?

VelΓ‘zquez!ヽ(`Π”Β΄)γƒŽ #

Page 6: Special lecture on Information Knowledge Network

What is the approximation pattern matching?

It is the problem to find positions of substrings in a given text where its edit distance with a given pattern is less than or equal to π‘˜π‘˜

Edit distance ed π‘₯π‘₯,𝑦𝑦 is defined as the minimum cost 𝑑𝑑 for translating string π‘₯π‘₯ into string 𝑦𝑦 with character edit operations: insertion, deletion, and substitution.

MARRIAGE

MASSAGE

CARRIAGE

π‘˜π‘˜ = 2

𝑑𝑑 = 1

𝑑𝑑 = 3

MARRIAGE

MASS AGE

deletesubstitute

OK

Bad

0 < k < m

ed(MARRIAGE, MASSAGE)=3

Page 7: Special lecture on Information Knowledge Network

Edit distanceHow much do two strings look like?

similarity ⇔ edit distance between strings (dissimilarity)

Variation of edit distanceLevenshtein distance :The costs of all operations are equal to 1.Hamming distance :Only substitution is allowed.Weighted-cost edit distance

:The cost of each operation may differ.Unrestricted-cost edit distance

:The cost is different at each character pair.Damerau distance :The character transposition is also permitted.Indel distance :Substitution is not allowed.

insertion + deletion = indel(from Heikki HyyrΓΆ [SOFSEM2005])

Hereafter, we mainly treat with Levenshtein distance

Page 8: Special lecture on Information Knowledge Network

Application examples

Calculating the similarity between DNAs

Spell checker / Searching with ambiguityOrthographic variation: Carpaccio ⇔ Caravaggio

Retrieval of similar sentencesAn advanced retrieval can be realized by combining with natural language processingA sentence = a sequence of morphemes β‰’ a string

Similar music retrievalFinding a similar phrase on MIDI dataRetrieval with humming recognition

Search on OCR dataOCRed data often contains mistakes

Applications to real data miningWeb mining using approximate string matching algorithms (T. Nakato, Kyushu Univ.)

Searching with thesaurus is another related topic

Consider each morpheme as a meta character

Using Dynamic Time Warping (DTW)

Page 9: Special lecture on Information Knowledge Network

Dynamic programming approach

The way of calculating edit distance based on dynamic programming (DP) has been known in the 1960’s. However, the well-known algorithm for pattern matching is shown by Sellers in 1980.

P. H. Sellers, The theory and computation of evolutionary distances: Pattern recognition. Journal of Algorithms, 1(4):359-373,1980.H. Sakoe and S. Chiba, A Dynamic Programming Algorithm Optimization for Spoken Word Recognition, IEEE Trans. on Acoust., Speech and Signal Proc., Vol. ASSP- 26, No. 1, pp. 43-49, 1978.

How to calculate ed(π‘₯π‘₯,𝑦𝑦):Let 𝑀𝑀𝑖𝑖,𝑗𝑗 = ed(π‘₯π‘₯ 1: 𝑖𝑖 ,𝑦𝑦 1: 𝑗𝑗 ). Then,

𝑀𝑀0,0 ← 0,𝑀𝑀𝑖𝑖,𝑗𝑗 ← min π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1 + 𝛿𝛿 π‘₯π‘₯ 𝑖𝑖 ,𝑦𝑦 𝑗𝑗 ,π‘€π‘€π‘–π‘–βˆ’1,𝑗𝑗 + 1,𝑀𝑀𝑖𝑖,π‘—π‘—βˆ’1 + 1 .

where 𝛿𝛿 π‘Žπ‘Ž, 𝑏𝑏 is defined as 0 if π‘Žπ‘Ž = 𝑏𝑏, otherwise 1.Efficient recursive formulas for doing the same calculations are:

𝑀𝑀𝑖𝑖,𝑗𝑗 ←𝑀𝑀𝑖𝑖,0 ← 𝑖𝑖, 𝑀𝑀0,𝑗𝑗 ← π‘—π‘—π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1 (if π‘₯π‘₯ 𝑖𝑖 = 𝑦𝑦 𝑗𝑗 )1 + min π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1,π‘€π‘€π‘–π‘–βˆ’1,𝑗𝑗 ,𝑀𝑀𝑖𝑖,π‘—π‘—βˆ’1 (otherwise)

i.e., 𝑀𝑀|π‘₯π‘₯|,|𝑦𝑦| = ed(π‘₯π‘₯,𝑦𝑦)

Page 10: Special lecture on Information Knowledge Network

Why can we do correct calculation?

Prove by induction. Let 𝑀𝑀0,0 = 0 be for two empty strings. Now we want to obtain ed π‘₯π‘₯[1: 𝑖𝑖],𝑦𝑦[1: 𝑗𝑗] = 𝑀𝑀𝑖𝑖,𝑗𝑗. Assume that we have ed π‘₯π‘₯[1: 𝑖𝑖′],𝑦𝑦[1: 𝑗𝑗′] for any 𝑖𝑖′ < 𝑖𝑖 and 𝑗𝑗′ < 𝑗𝑗. Then, we consider the cost for translating π‘₯π‘₯[1: 𝑖𝑖] into 𝑦𝑦[1: 𝑗𝑗].If π‘₯π‘₯ 𝑖𝑖 = 𝑦𝑦[𝑗𝑗], then we can simply transform π‘₯π‘₯[1: 𝑖𝑖 βˆ’ 1] into 𝑦𝑦[1: 𝑗𝑗 βˆ’ 1] with the minimum cost π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1. In this case, it holds 𝑀𝑀𝑖𝑖,𝑗𝑗 = π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1.If π‘₯π‘₯[𝑖𝑖] β‰  𝑦𝑦[𝑗𝑗], then we have three cases:

substituting π‘₯π‘₯[𝑖𝑖] with 𝑦𝑦[𝑗𝑗], and change π‘₯π‘₯[1: 𝑖𝑖 βˆ’ 1] to 𝑦𝑦[1: 𝑗𝑗 βˆ’ 1] with cost π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1deleting π‘₯π‘₯[𝑖𝑖], and change π‘₯π‘₯ 1: 𝑖𝑖 βˆ’ 1 to 𝑦𝑦[1: 𝑗𝑗] with cost π‘€π‘€π‘–π‘–βˆ’1,𝑗𝑗inserting 𝑦𝑦[𝑖𝑖] at the end of π‘₯π‘₯[1: 𝑖𝑖], and change π‘₯π‘₯[1: 𝑖𝑖] to 𝑦𝑦[1: 𝑗𝑗 βˆ’ 1] with cost 𝑀𝑀𝑖𝑖,π‘—π‘—βˆ’1

We choose the minimum one among the above.

deletion

π‘₯π‘₯[1: 𝑖𝑖– 1]

𝑦𝑦[1: 𝑗𝑗– 1] 𝑦𝑦[𝑗𝑗]

π‘€π‘€π‘–π‘–βˆ’1,𝑗𝑗

π‘₯π‘₯[𝑖𝑖]+1

substitution

π‘₯π‘₯[1: 𝑖𝑖– 1] π‘₯π‘₯[𝑖𝑖]

𝑦𝑦[1: 𝑗𝑗– 1] 𝑦𝑦[𝑗𝑗]

π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1 +1

insertion

π‘₯π‘₯[1: 𝑖𝑖– 1]

𝑦𝑦[1: 𝑗𝑗– 1] 𝑦𝑦[𝑗𝑗]

π‘₯π‘₯[𝑖𝑖]

+1𝑀𝑀𝑖𝑖,π‘—π‘—βˆ’1

π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1

𝑀𝑀𝑖𝑖,π‘—π‘—βˆ’1 𝑀𝑀𝑖𝑖,𝑗𝑗

π‘€π‘€π‘–π‘–βˆ’1,𝑗𝑗

+𝛿𝛿(π‘₯π‘₯[𝑖𝑖],𝑦𝑦[𝑗𝑗]) +1

+1

Page 11: Special lecture on Information Knowledge Network

How to detect the pattern occurrences

a n n e a l i n g0 1 2 3 4 5 6 7 8 9

a 1 0 1 2 3 4 5 6 7 8n 2 1 0 1 2 3 4 5 6 7n 3 2 1 0 1 2 3 4 5 6u 4 3 2 1 1 2 3 4 5 6a 5 4 3 2 2 1 2 3 4 5l 6 5 4 3 3 2 1 2 3 4

𝑀𝑀𝑖𝑖,𝑗𝑗 for ed(annual, annealing)

𝑀𝑀π‘₯π‘₯ , 𝑦𝑦 = ed(annual, annealing) = 4

𝑀𝑀0,0 a n n e a l i n g0 0 0 0 0 0 0 0 0 0

a 1 0 1 1 1 0 1 1 1 1n 2 1 0 1 2 1 1 2 1 2n 3 2 1 0 1 2 2 2 2 2u 4 3 2 1 1 2 3 3 3 3a 5 4 3 2 2 1 2 3 4 4l 6 5 4 3 3 2 1 2 3 4

Approximate string matching for𝑃𝑃 =annual, 𝑇𝑇 =annealing, π‘˜π‘˜ = 2

For any 𝑗𝑗 = 0 …𝑛𝑛, all that we have to do is to set 𝑀𝑀0,𝑗𝑗 = 0

This means that empty string πœ€πœ€ matches at anywhere in a given text with 0 error

𝑂𝑂(π‘šπ‘šπ‘›π‘›) time and 𝑂𝑂(π‘šπ‘š) space

π‘€π‘€π‘–π‘–βˆ’1,π‘—π‘—βˆ’1

𝑀𝑀𝑖𝑖,π‘—π‘—βˆ’1 𝑀𝑀𝑖𝑖,𝑗𝑗

π‘€π‘€π‘–π‘–βˆ’1,𝑗𝑗

+𝛿𝛿(π‘₯π‘₯[𝑖𝑖],𝑦𝑦[𝑗𝑗]) +1

+1

Page 12: Special lecture on Information Knowledge Network

Improvement the average time complexity

The given pattern seldom occurs in the text!During calculations of each column, values become k+1 before reaching to the bottom (that is, mismatch occurs at the current position).A cell whose value is larger than k+1 does not affect to the final results.If the value of a cell is less than or equal to π‘˜π‘˜, we call it active. The average time complexity can be reduced to O(π‘˜π‘˜π‘›π‘›) by calculating only active cells. (This improved algorithm is called DP)

E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1-3):132-137, 1985.

a n n e a l i n g0 0 0 0 0 0 0 0 0 0

a 1 0 1 1 1 0 1 1 1 1n 2 1 0 1 2 1 1 2 1 2n 3 2 1 0 1 2 2 2 2 2u 4 3 2 1 1 2 3 3 3 3a 5 4 3 2 2 1 2 3 4 4l 6 5 4 3 3 2 1 2 3 4

DP calculation for𝑃𝑃 =annual, 𝑇𝑇 =annealing, π‘˜π‘˜ = 2

𝑂𝑂(π‘šπ‘šπ‘›π‘›) time in the worst case𝑂𝑂(π‘˜π‘˜π‘›π‘›) time for the average

Page 13: Special lecture on Information Knowledge Network

Pseudo code of DP algorithm

DP (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For i∈0…m Do Ci ← i3 lact ← k + 1 /* last active cell */4 Searching:5 For pos∈1...n Do6 pC ← 0, nC ← 07 For i ∈ 1…lact Do8 If pi = tpos Then nC ← pC9 Else10 If pC < nC Then nC ← pC11 If Ci < nC Then nC ← Ci12 nC ← nC + 113 End of if14 pC ← Ci, Ci ← nC15 End of for16 While Clact > k Do lact ← lact – 117 If lact = m Then report an occurrence at pos18 Else lact ← lact + 119 End of for

Page 14: Special lecture on Information Knowledge Network

NFA-base approach

Doing pattern matching by simulating this NFA by translating it to DFA.Originally, this is proposed by Ukkonen[1985]. And several improvements have been proposed so far.Translating to a corresponding DFA increases the number of states to (min(3π‘šπ‘š,π‘šπ‘š(2π‘šπ‘š|βˆ‘|)π‘˜π‘˜)).Therefore, it is not practical when π‘šπ‘š is large.

An NFA that accepts 𝑃𝑃 = annual with allowing 2 errorsany a ∈ βˆ‘

a n un a l

a n un a l

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

a n un a l

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

no error

1 error

2 error

Active states after reading 𝑇𝑇 = anneal

E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6(1-3):132-137, 1985.

ΣΣ ΣΡ

Ξ£

L

Ξ£L

match

ins

del

sub

Page 15: Special lecture on Information Knowledge Network

Row-wise bit-parallel for the NFA (BPR)

Pack states of each row to one bit-vector (1 indicates active, 0 indicates non-active) and simulate the move of the whole NFA by a bit-parallel technique.It needs π‘˜π‘˜ + 1 bit masks whose length are π‘šπ‘š bits.The formulas to update the 𝑖𝑖-th row state 𝑅𝑅𝑖𝑖 into new 𝑅𝑅′𝑖𝑖:

R’0 ← ((R0<<1)|0m-11) & B[tj]R’i ← ((Ri<<1)&B[tj])|Ri-1|(Ri-1<<1)|(R’i-1 << 1)|0m-11

no error

1 error

2 error

000000

100011

110111

any a ∈ βˆ‘

a n un a l

a n un a l

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

a n un a l

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

An NFA that accepts 𝑃𝑃 = annual with allowing 2 errors

Active states after reading 𝑇𝑇 = anneal

𝑂𝑂(π‘˜π‘˜π‘šπ‘š/𝑀𝑀𝑛𝑛) time𝑂𝑂(π‘˜π‘˜π‘›π‘›) time if π‘šπ‘š ≦ 𝑀𝑀

Note that the bit order is reversed.

S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10): 83-91,1992.

Multiple rows can be packed into one vector!

Page 16: Special lecture on Information Knowledge Network

Pseudo code of BPR

BPR (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For cβˆˆβˆ‘ Do B[c] ← 0m3 For j ∈1…m Do B[pj ] ← B[pj ] | 0m-j 10j-14 Searching:5 For i ∈0...k Do Ri ← 0m-i 1i6 For pos ∈ 1…n Do7 oldR ← R08 newR ← ((oldR<<1)|0m-1 1)&B[tpos]9 R0 ← newR10 For i ∈1...k Do11 newR ← ((Ri<<1)&B[tpos])|oldR|((oldR|newR)<<1)|0m-1112 oldR ← Ri, Ri ← newR13 End of for14 If newR & 10m-1β‰ 0m Then report an occurrence at pos15 End of for

Page 17: Special lecture on Information Knowledge Network

Diagonal-wise bit-parallel for the NFA (BPD)

Pack states diagonally by representing the depth of active states with unary(needing k+1 bits), and combine them into one bit-vector.It needs β€˜0’s for representing the boundaries, the total length of the vector becomes (π‘šπ‘šβˆ’ π‘˜π‘˜)(π‘˜π‘˜ + 2) bits.The formulas to update when reading 𝑖𝑖-th character 𝑑𝑑𝑗𝑗:

D’i ← min(Di+1, Di+1+1, g(i-1, tj ))g(i,c) = min({k+1}βˆͺ{r|r≧Di and pi+1+r=c})

βˆ‘ a n un a l

a n un a l

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

a n un a l

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

D0 D1 D2 D3 D4

no error

1 error

2 error

R. A. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127-158, 1999.

D= 0 001 0 0 0k+1 bits k+1 bits k+1 bits k+1 bits

D1111 011 011

D2 D3 D4

Bit-masks like those of Shift-Or

The 1st item is for sub.The 2nd item is for ins.The 3rd item is for match

𝐷𝐷𝑖𝑖 = 3 =[111] if there is no active state

Page 18: Special lecture on Information Knowledge Network

Pseudo code of BPD

BPD (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For cβˆˆβˆ‘ Do B[c] ← 1m3 For j ∈1…m Do B[pj ] ← B[pj ] & 1m-j 01j-14 For cβˆˆβˆ‘ Do5 BB[c] ← 0 sk+1(B[c],0) 0 sk+1(B[c],1)… 0 sk+1(B[c],m-k-1)6 End of for7 Searching:8 D ← (01k+1)m-k9 For pos ∈ 1…n Do10 x ← (D >> (k+2)) | BB[tpos]11 D ← ((D << 1) | (0k+11)m-k)12 & ((D << (k+3)) | (0k+11)m-k-101k+1)13 & (((x + (0k+11)m-k) ∧ x) >> 1) & (01k+1)m-k14 If D & 0(m-k-1)(k+2)010k = 0(m-k)(k+2) Then15 Report an occurrence at pos16 D ← D | 0(m-k-1)(k+2)01k+117 End of If18 End of for

𝐷𝐷𝑖𝑖 + 1𝐷𝐷𝑖𝑖+1 + 1

𝑔𝑔(𝑖𝑖 βˆ’ 1, tpos)

clean up

Page 19: Special lecture on Information Knowledge Network

Filtering approach: Pattern division method

Idea of filtering approach:It is easier to say β€œHere is not an occurrence” than β€œHere is an occurrence”→ Find the candidates rapidly, then look up in detail!This improves the average complexity.Actually, it goes well when the error rate (𝛼𝛼 = π‘˜π‘˜/π‘šπ‘š) is small.

Pattern division method:Divide a given pattern into k+1 piecesThen, find each piece using a fast multiple pattern matching algorithmWhen finding a piece, run an ordinary approximate string matching algorithm (such as DP) over the neighborhood of the occurrence to check if the pattern matches

Text: ACCCTGTTTAGATCACGGCACTACTGTAAAC

π‘˜π‘˜ + 1 pieces: TAAAT, CACGG, CATACT

For π‘˜π‘˜ = 2Pattern: TAAATCACGGCATACT

S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10): 83-91,1992.

Multiple Shift-And orSet Horspool

Page 20: Special lecture on Information Knowledge Network

Speeding-up by hierarchical verification (PEX)

Checking candidates hierarchically can reduce the processing time.Assume that 𝑗𝑗 = π‘˜π‘˜ + 1 = 2π‘Ÿπ‘Ÿ. Halve a given pattern with allowing π‘˜π‘˜/2 errors for each, and repeat the division recursively till each piece allows 0 errors.Find the pieces using a multiple pattern matching algorithm, and then check the candidates hierarchically.

CreateTree (P=p1p2…pm, k, myParent, idx, plen)1 Create new node2 from(node) ← i3 to(node) ← j4 left ← (k+1)/25 parent(node) ← myParent6 err(node) ← k7 If k = 0 Then leafidx ← node8 Else9 CreateTree(pi…i+left・plen–1, (left・k)/(k+1), node, idx, plen) 10 CreateTree(pi+left・plen…j,((k+1–left)・k)/(k+1),node,idx+left,plen)11 End of If

G. Navarro and R. Baeza-Yates. Very fast and simple approximate string matching. Information Processing Letters, 72:65-70, 1999.

a a a b b b c c c d d da a a b b b c c c d d d

a a a b b b c c c d d d

π‘˜π‘˜ = 3 errors

π‘˜π‘˜ = 1 errors

π‘˜π‘˜ = 0 errors

Make padding when it doesn’t match with 2π‘Ÿπ‘Ÿ

Page 21: Special lecture on Information Knowledge Network

Pseudo code of PEX

PEX (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 CreateTree(p, k,ΞΈ, 0, m/(k+1) )3 Preprocess multipattern search for4 {pfrom(node)…pto(node) | node = leafi , i∈{0…k} }5 Searching:6 For (pos, i) ∈ output of multipattern search Do7 node ← leafi8 in ← from(node)9 node ← parent(node)10 cand ← TRUE11 While cand = TRUE and node β‰ ΞΈ Do12 p1 ← pos – (in – from(node)) – err(node)13 p2 ← pos + (to(node) – in + 1) + err(node)14 Verify text area Tp1…p2 for pattern piece pfrom(node)…to(node)15 allowing err(node) errors 16 If pattern piece was not found Then cand ← FALSE17 Else node ← parent(node)18 End of while19 If cand = TRUE Then20 Report the positions where the whole p was found21 End of If22 End of for

Page 22: Special lecture on Information Knowledge Network

Filtering approach: BNDM method (ABNDM)

Construct an NFA that accepts any factor of 𝑃𝑃𝑅𝑅 for a given pattern 𝑃𝑃 with allowing π‘˜π‘˜ errors β†’ an extension of BNDM

The NFA can tell if the input is a prefix of 𝑃𝑃𝑅𝑅 with π‘˜π‘˜ errors.BNDM runs faster than BM when the alphabet size is small enough.We can quickly extract candidate positions by this NFA.It can skip several text positions like BNDM.

For texts whose alphabet size is small, such as DNA sequence, ABNDM runs faster than PEX

anu nalβˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘ βˆ‘Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅ Ξ΅

no error

1 error

2 error

ΡΡ Ρ Ρ Ρ Ρ Ρ

G. Navarro and R. Baeza-Yates. Very fast and simple approximate string matching. Information Processing Letters, 72:65-70, 1999.

anu nal

anu nal

Page 23: Special lecture on Information Knowledge Network

Pseudo code of ABNDM

ABNDM (P=p1p2…pm, T=t1t2…tn, k)1 Preprocessing:2 For cβˆˆβˆ‘ Do B[c] ← 0m3 For j ∈1…m Do B[pj ] ← B[pj ] | 0m-j 10j-14 Searching:5 pos ← 06 While pos ≦ n – (m – k) Do7 j ← m – k – 1, last ← m – k – 18 R0 ← B[tpos+m–k ]9 newR ← 1m10 For i ∈1…k Do Ri ← newR11 While newR β‰  0m and j β‰  0 Do12 oldR ← R013 newR ← (oldR << 1) & B[tpos+j ]14 R0 ← newR15 For i ∈1…k Do16 newR ← ((Ri<<1)&B[tpos+j])|oldR|((oldR|newR)<<1)17 oldR ← Ri, Ri ← newR18 End of for19 j ← j – 120 If newR & 10m-1 β‰  0m Then /* prefix recognized */21 If j > 0 Then last ← j22 Else check a possible occurrence starting at pos+123 End of if24 End of while25 pos ← pos + last26 End of while

Page 24: Special lecture on Information Knowledge Network

SummaryWhat is the approximate string matching?

It is the problem of finding substrings which match to 𝑃𝑃 within π‘˜π‘˜ edit distances.Dynamic programming approach:

O π‘šπ‘šπ‘›π‘› time and O(π‘šπ‘š) space β†’ can be improved to O(π‘˜π‘˜π‘›π‘›) for the average (DP)

NFA approach:It constructs an NFA that accepts 𝑃𝑃 with π‘˜π‘˜ errors β†’ translate to a corresponding DFA and then simulate itBit-parallel simulation of the NFA:

Row-wise (BPR): O(π‘˜π‘˜ π‘šπ‘š/𝑀𝑀 𝑛𝑛) timeDiagonal-wise (BPD): O( π‘˜π‘˜(π‘šπ‘š βˆ’ π‘˜π‘˜)/𝑀𝑀 𝑛𝑛) time

Filtering approach:It finds without checking the most of text in detailPattern division method (PEX), BNDM method (ABNDM)

The next theme:Regular expression matching: for a flexible and convenient keyword searching