Download pdf - Special Lecture on IKN

Transcript
Page 1: Special Lecture on IKN

Special lecture on Information Knowledge Network-Information retrieval and pattern matching-

The 1st. Preliminary: terms and definitions

Takuya kidaIKN Laboratory,

Division of Computer Science and Information Technology

2018/11/22Special lecture on IKN

Page 2: Special Lecture on IKN

Today’s contents

What is the pattern matching problem?

Sequential search vs. indexing

Basic terms of text algorithms

Finite automatons

2

Page 3: Special Lecture on IKN

What is the pattern matching problem?

Given a pattern 𝑃𝑃 and a text 𝑇𝑇, the problem is to find occurrences of 𝑃𝑃 in 𝑇𝑇.

3

We introduce a general framework which is suitable tocapture an essence of compressed pattern matchingaccording to various dictionary based compressions. Thegoal is to find all occurrences of a pattern in a textwithout decompression, which is one of the most activetopics in string matching. Our framework includes suchcompression methods as Lempel-Ziv family, (LZ77, LZSS,LZ78, LZW), byte-pair encoding, and the static dictionarybased method. Technically, our pattern matching algorithmextremely extends that for LZW compressed text presentedby Amir, Benson and Farach [Amir94]..

text 𝑇𝑇:

pattern 𝑃𝑃: compress

Well-known algorithmsοΌšβ€’ KMP method (Knuth&Morris&Pratt[1974])β€’ BM method (Boyer&Moore[1977])β€’ Karp-Rabin method (Karp&Rabin[1987])

Page 4: Special Lecture on IKN

text 𝑇𝑇 : pururunpurunfamifamifapattern 𝑃𝑃: famifa

Existence problem:

text 𝑇𝑇 : pururunpurunfamifamifapattern 𝑃𝑃: famifa

All-occurrences problem:

When we retrieve a document from a set of documents, it is enough to solve the existence problem. However, β€œpattern matching problem” often means solving the all-occurrences problem. 4

Existence problem and all-occurrences problem

Yes!

8 11

Page 5: Special Lecture on IKN

Two approaches to the problem

5

Retrieve using index structuresadvantages: Very fast High scalability

disadvantages: Preprocessing for indexing Less flexible for data updating Large memory consumption

Retrieve using sequential search algorithmsadvantages: No extra data structures Flexible for data updating

disadvantages: Slow Low scalability

For small scale group of documents(e.g. grep of UNIX)

For large scale DB(e.g. namazu, sufary, Google, etc.)

𝑂𝑂(𝑛𝑛)

𝑂𝑂(π‘šπ‘š log𝑛𝑛)

Main topic of this classIndex structures

(inverted file, suffix tree/array, etc.)

Page 6: Special Lecture on IKN

Computational complexity

To evaluate an algorithm, it is necessary to clearly determine the computational complexity of the algorithm

– How much time or how many memory space do we need to calculate for the input data of length 𝑛𝑛?

big-𝑂𝑂 notation:– Definition: let 𝑓𝑓 and 𝑔𝑔 be functions from an integer to an integer. For

some constant 𝐢𝐢 and 𝑁𝑁, if 𝑓𝑓 𝑛𝑛 < 𝐢𝐢・𝑔𝑔(𝑛𝑛) holds for any 𝑛𝑛 > 𝑁𝑁, then we denote 𝑓𝑓 𝑛𝑛 = 𝑂𝑂(𝑔𝑔(𝑛𝑛)). (we say that 𝑓𝑓 is order of 𝑔𝑔.)

– If 𝑓𝑓 and 𝑔𝑔 are the same order, that is, both 𝑓𝑓(𝑛𝑛) = 𝑂𝑂(𝑔𝑔(𝑛𝑛)) and 𝑔𝑔(𝑛𝑛) = 𝑂𝑂(𝑓𝑓(𝑛𝑛)) hold, then we denote 𝑓𝑓 = Θ(𝑔𝑔).

– Let 𝑇𝑇(𝑛𝑛) be a function of the calculation time for the input of length 𝑛𝑛, and assume that 𝑇𝑇(𝑛𝑛) = 𝑂𝑂(𝑔𝑔(𝑛𝑛)). This means that β€œIt asymptotically takes only the time proportional to 𝑔𝑔 𝑛𝑛 at most.”

6e.g. 𝑂𝑂 𝑛𝑛 , 𝑂𝑂(𝑛𝑛・log𝑛𝑛)

indicates the upper bound of asymptotic complexity(There is 𝛺𝛺 notation to indicate the lower bound)

Page 7: Special Lecture on IKN

Basic terms for text algorithms

Ξ£ : a finite set of non-empty characters (letters, symbols), called (finite) alphabet. Ex) Ξ£ = a, b, c, … , z , Ξ£ = 0,1 , Ξ£ = {0x00, 0x01, … , 0xFF}

π‘₯π‘₯ ∈ Ξ£βˆ— : a string (word, text), which is a sequence of characters.|π‘₯π‘₯| : length of string π‘₯π‘₯. Ex) aba = 3πœ€πœ€ : the empty string, which is the string whose length is equal to 0,

namely πœ€πœ€ = 0.π‘₯π‘₯[𝑖𝑖] : the 𝑖𝑖-th character of string π‘₯π‘₯.π‘₯π‘₯ 𝑖𝑖: 𝑗𝑗 : the consecutive sequence of characters from the 𝑖𝑖-th to

the 𝑗𝑗-th of string π‘₯π‘₯, which is called the factor or substring of π‘₯π‘₯. If 𝑖𝑖 > 𝑗𝑗, we assume π‘₯π‘₯[𝑖𝑖: 𝑗𝑗] = πœ€πœ€ for convenience. It is often also denoted as π‘₯π‘₯[𝑖𝑖. . 𝑗𝑗] or π‘₯π‘₯[𝑖𝑖, 𝑗𝑗].

π‘₯π‘₯𝑅𝑅 : the reversed string π‘Žπ‘Žπ‘˜π‘˜π‘Žπ‘Žπ‘˜π‘˜βˆ’1 β€¦π‘Žπ‘Ž1 for π‘₯π‘₯ = π‘Žπ‘Ž1 π‘Žπ‘Ž2 … π‘Žπ‘Žπ‘˜π‘˜ ∈ Ξ£βˆ—.7

Page 8: Special Lecture on IKN

Prefix, factor, and suffix

Factor π‘₯π‘₯[1: 𝑖𝑖] is especially called a prefix of π‘₯π‘₯, and π‘₯π‘₯[𝑖𝑖: |π‘₯π‘₯|] is especially called a suffix of π‘₯π‘₯.

8

Note that the difference with a factor!

factor

ccococcoco

aoa

coaocoao

ococo

prefix suffix

w = cocoa

πœ€πœ€cocoa

For strings π‘₯π‘₯ and 𝑦𝑦, we say π‘₯π‘₯ is a subsequence of 𝑦𝑦 if the string obtained by removing 0 or more characters from 𝑦𝑦 is identical with π‘₯π‘₯.

e.g.: π‘₯π‘₯ = abba is a subsequence of 𝑦𝑦 = aaababaab.

Page 9: Special Lecture on IKN

Finite Automaton

It is automatic machine (automata) that determines if the input string is acceptable or not. It repeatedly changes its internal state reading characters from the input one by one. There are several variations according to the definition of the state transitions and the use of auxiliary spaces. (e.g. non-deterministic automaton, sequential machine, pushdown automaton, etc.)

9

References:"Automaton and computability," S. Arikawa and S. Miyano, BAIFU-KAN (written in Japanese)"Formal language and automaton," Written by E. Moriya, SAIENSU-SHA (written in Japanese)

fundamental concept on computer science!

It deeply relates to the pattern matching

It has the ability to define a (formal) language; it is often used for lexical analysis of strings.

π‘žπ‘ž

input tape:

head:(keep the internal state)

π‘Žπ‘Ž1 π‘Žπ‘Ž2 π‘Žπ‘Ž3 π‘Žπ‘Ž4 π‘Žπ‘Ž5 π‘Žπ‘Ž6 π‘Žπ‘Ž7 π‘Žπ‘Ž8 ・・・

Page 10: Special Lecture on IKN

Deterministic finite automaton (DFA)

10

DFA 𝑀𝑀 is a quintuple (5-tuple) (𝐾𝐾, Ξ£, π‘žπ‘ž0, 𝛿𝛿,𝐹𝐹):𝐾𝐾 : a set of internal states {π‘žπ‘ž0,π‘žπ‘ž1, … , π‘žπ‘žπ‘›π‘›} of 𝑀𝑀. Ξ£ : an alphabet (a set of characters) {π‘Žπ‘Ž0,π‘Žπ‘Ž1, … ,π‘Žπ‘Žπ‘˜π‘˜}.π‘žπ‘ž0 : the initial state of 𝑀𝑀.𝛿𝛿 : a function 𝐾𝐾 Γ— Ξ£ β†’ 𝐾𝐾 that returns the next state when reading a

character from the input. It is called a transition function. 𝐹𝐹 : a set of accept states (or final state). 𝐹𝐹 βŠ† 𝐾𝐾.

We extend the domain of 𝛿𝛿 from 𝐾𝐾 Γ— Ξ£ to 𝐾𝐾 Γ— Ξ£βˆ— as follows:𝛿𝛿 π‘žπ‘ž, πœ€πœ€ = π‘žπ‘ž π‘žπ‘ž ∈ 𝐾𝐾 ,𝛿𝛿 π‘žπ‘ž,π‘Žπ‘Žπ‘₯π‘₯ = 𝛿𝛿 𝛿𝛿 π‘žπ‘ž,π‘Žπ‘Ž , π‘₯π‘₯ . (π‘žπ‘ž ∈ 𝐾𝐾,π‘Žπ‘Ž ∈ Ξ£, π‘₯π‘₯ ∈ Ξ£βˆ—)

Then, 𝛿𝛿 π‘žπ‘ž0,𝑀𝑀 indicates the state when string 𝑀𝑀 is input.We say that 𝑀𝑀 accepts 𝑀𝑀 if 𝛿𝛿 π‘žπ‘ž0,𝑀𝑀 = 𝑝𝑝 ∈ 𝐹𝐹.

The set 𝐿𝐿 𝑀𝑀 = 𝑀𝑀 𝛿𝛿 π‘žπ‘ž0,𝑀𝑀 ∈ 𝐹𝐹 of all strings that 𝑀𝑀 accepts is called the language accepted by finite automaton 𝑀𝑀.

Page 11: Special Lecture on IKN

a

b

b a

a, b

π‘žπ‘ž0 π‘žπ‘ž1

π‘žπ‘ž2

𝛿𝛿 π‘žπ‘ž0, aba = 𝛿𝛿 𝛿𝛿 π‘žπ‘ž0, a , ba= 𝛿𝛿 π‘žπ‘ž1, ba= 𝛿𝛿 𝛿𝛿 π‘žπ‘ž1, b , a= 𝛿𝛿 π‘žπ‘ž0, a= π‘žπ‘ž1 ∈ 𝐹𝐹

State transition diagramState transition

𝐿𝐿 𝑀𝑀 = a ba 𝑛𝑛 𝑛𝑛 β‰₯ 0}

11

Language that 𝑀𝑀 accepts

π‘žπ‘ž0

input tape:

head:

a b a

An example of DFA

Page 12: Special Lecture on IKN

Nondeterministic finite automaton (NFA)

NFA 𝑀𝑀 is a quintuple (𝐾𝐾, Ξ£,𝑄𝑄0,𝛿𝛿,𝐹𝐹):𝐾𝐾 : a set of internal states {π‘žπ‘ž0,π‘žπ‘ž1, … , π‘žπ‘žπ‘›π‘›} of 𝑀𝑀.Ξ£ : an alphabet (a set of characters) {π‘Žπ‘Ž0,π‘Žπ‘Ž1, … ,π‘Žπ‘Žπ‘˜π‘˜}.𝑄𝑄0 : the initial state of 𝑀𝑀. 𝑄𝑄0 βŠ‚ 𝐾𝐾.𝛿𝛿 : a transition function 𝐾𝐾 Γ— Ξ£ β†’ 2𝐾𝐾 of 𝑀𝑀.𝐹𝐹 : a set of accept states. 𝐹𝐹 βŠ† 𝐾𝐾.

We extend the domain of 𝛿𝛿 from 𝐾𝐾 Γ— Ξ£ to 𝐾𝐾 Γ— Ξ£βˆ— as follows:𝛿𝛿 π‘žπ‘ž, πœ€πœ€ = π‘žπ‘ž ,𝛿𝛿 π‘žπ‘ž,π‘Žπ‘Žπ‘₯π‘₯ =βˆͺπ‘π‘βˆˆπ›Ώπ›Ώ π‘žπ‘ž,π‘Žπ‘Ž 𝛿𝛿 𝑝𝑝, π‘₯π‘₯ . π‘žπ‘ž ∈ 𝐾𝐾,π‘Žπ‘Ž ∈ Ξ£, π‘₯π‘₯ ∈ Ξ£βˆ—

Moreover we extends it to 2𝐾𝐾 Γ— Ξ£βˆ—:𝛿𝛿 𝑆𝑆, π‘₯π‘₯ =βˆͺπ‘žπ‘žβˆˆπ‘†π‘† 𝛿𝛿 π‘žπ‘ž, π‘₯π‘₯ .

For π‘₯π‘₯ ∈ Ξ£βˆ—, we say that 𝑀𝑀 accepts π‘₯π‘₯ if 𝛿𝛿 𝑄𝑄0, π‘₯π‘₯ ∩ 𝐹𝐹 β‰  πœ™πœ™.

12

Some states can be active.

The destination of each transition isn’t unique.

Page 13: Special Lecture on IKN

b π‘žπ‘ž2π‘žπ‘ž1π‘žπ‘ž0b

a, b

For string abb, 𝛿𝛿 π‘žπ‘ž0, abb = 𝛿𝛿(π‘žπ‘ž0, bb)= 𝛿𝛿 π‘žπ‘ž0, b βˆͺ 𝛿𝛿(π‘žπ‘ž1, b)= {π‘žπ‘ž0} βˆͺ {π‘žπ‘ž1} βˆͺ {π‘žπ‘ž2}= {π‘žπ‘ž0, π‘žπ‘ž1, π‘žπ‘ž2}

Then, abb ∈ 𝐿𝐿(𝑀𝑀) since π‘žπ‘ž2 ∈ 𝐹𝐹.

State diagram of NFA

13

An example of NFA

Page 14: Special Lecture on IKN

Sequential machine

14

Sequential machine 𝑀𝑀, a variation of DFA that sequentially outputs to an input, is a sextuple (6-tuple) (𝐾𝐾, Ξ£,Ξ”, π‘žπ‘ž0,𝛿𝛿, πœ†πœ†):𝐾𝐾 : a set of internal states {π‘žπ‘ž0,π‘žπ‘ž1, … , π‘žπ‘žπ‘›π‘›}Ξ£ : an alphabet of the input

{π‘Žπ‘Ž0,π‘Žπ‘Ž1, … ,π‘Žπ‘Žπ‘˜π‘˜}Ξ” : an alphabet of the output

{𝑏𝑏0,𝑏𝑏1, … , 𝑏𝑏ℓ}π‘žπ‘ž0 : the initial state.𝛿𝛿 : the transition function 𝐾𝐾 Γ— Ξ£ β†’ 𝐾𝐾.πœ†πœ† : the output function 𝐾𝐾 Γ— Ξ£ β†’ Ξ”.

We extend the domain of 𝛿𝛿 to 𝐾𝐾 Γ— Ξ£βˆ— in a similar way of DFA. Moreover we extends the domain of πœ†πœ† to that from 𝐾𝐾 Γ— Ξ£βˆ— to Ξ”βˆ— as follows:

πœ†πœ† π‘žπ‘ž, πœ€πœ€ = π‘žπ‘ž, π‘žπ‘ž ∈ πΎπΎπœ†πœ† π‘žπ‘ž,π‘Žπ‘Žπ‘₯π‘₯ = πœ†πœ† π‘žπ‘ž, π‘Žπ‘Ž πœ†πœ† 𝛿𝛿 π‘žπ‘ž,π‘Žπ‘Ž , π‘₯π‘₯ . (π‘žπ‘ž ∈ 𝐾𝐾,π‘Žπ‘Ž ∈ Ξ£, π‘₯π‘₯ ∈ Ξ£βˆ—)

input tape

output tape

head:

π‘Žπ‘Ž1 π‘Žπ‘Ž2 π‘Žπ‘Ž3 π‘Žπ‘Ž4 π‘Žπ‘Ž5 π‘Žπ‘Ž6 π‘Žπ‘Ž7 π‘Žπ‘Ž8 ・・・

𝑏𝑏1 𝑏𝑏2 𝑏𝑏3 𝑏𝑏4 𝑏𝑏5 𝑏𝑏6 𝑏𝑏7 𝑏𝑏8 ・・・

π‘žπ‘ž

read

write

Page 15: Special Lecture on IKN

An example of sequential machine

15

π‘žπ‘ž0

π‘žπ‘ž3

π‘žπ‘ž5

π‘žπ‘ž1 π‘žπ‘ž2

π‘žπ‘ž4

a/0b/1

b/1a/0

a/0a/0

a/0a/0

b/0

b/0b/0

b/1

πœ†πœ†(π‘žπ‘ž0, abb)

= πœ†πœ†(π‘žπ‘ž0, a)πœ†πœ†(𝛿𝛿(π‘žπ‘ž0, a), bb)

= 0πœ†πœ†(π‘žπ‘ž4, bb)

= 0πœ†πœ†(π‘žπ‘ž4, b)πœ†πœ†(𝛿𝛿(π‘žπ‘ž4, b), b)

= 01πœ†πœ†(π‘žπ‘ž5, b)

= 010

it’s a kind of translator!state diagram

state transition and outputs

Page 16: Special Lecture on IKN

Summary

What is the pattern matching problem?The problem of finding the occurrences of pattern 𝑃𝑃 in text 𝑇𝑇.There are the existence problem and all-occurrence problem.

The difference between sequential search and indexingThe former is usually slower than the latter. However, the former doesn’t need any auxiliary index structure.

Basic terms and notationsComputational complexity: big-𝑂𝑂 notationAlphabet, string, prefix, factor, and suffix

Finite automatonDFA defines a language.NFA has multiple current states.Sequential machine outputs for each input character.

16