Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Special lecture on Information Knowledge Network-Information retrieval and pattern matching-
The 1st. Preliminary: terms and definitions
Takuya kidaIKN Laboratory,
Division of Computer Science and Information Technology
2018/11/22Special lecture on IKN
Todayβs contents
What is the pattern matching problem?
Sequential search vs. indexing
Basic terms of text algorithms
Finite automatons
2
What is the pattern matching problem?
Given a pattern ππ and a text ππ, the problem is to find occurrences of ππ in ππ.
3
We introduce a general framework which is suitable tocapture an essence of compressed pattern matchingaccording to various dictionary based compressions. Thegoal is to find all occurrences of a pattern in a textwithout decompression, which is one of the most activetopics in string matching. Our framework includes suchcompression methods as Lempel-Ziv family, (LZ77, LZSS,LZ78, LZW), byte-pair encoding, and the static dictionarybased method. Technically, our pattern matching algorithmextremely extends that for LZW compressed text presentedby Amir, Benson and Farach [Amir94]..
text ππ:
pattern ππ: compress
Well-known algorithmsοΌβ’ KMP method (Knuth&Morris&Pratt[1974])β’ BM method (Boyer&Moore[1977])β’ Karp-Rabin method (Karp&Rabin[1987])
text ππ : pururunpurunfamifamifapattern ππ: famifa
Existence problem:
text ππ : pururunpurunfamifamifapattern ππ: famifa
All-occurrences problem:
When we retrieve a document from a set of documents, it is enough to solve the existence problem. However, βpattern matching problemβ often means solving the all-occurrences problem. 4
Existence problem and all-occurrences problem
Yes!
8 11
Two approaches to the problem
5
Retrieve using index structuresadvantages: Very fast High scalability
disadvantages: Preprocessing for indexing Less flexible for data updating Large memory consumption
Retrieve using sequential search algorithmsadvantages: No extra data structures Flexible for data updating
disadvantages: Slow Low scalability
For small scale group of documents(e.g. grep of UNIX)
For large scale DB(e.g. namazu, sufary, Google, etc.)
ππ(ππ)
ππ(ππ logππ)
Main topic of this classIndex structures
(inverted file, suffix tree/array, etc.)
Computational complexity
To evaluate an algorithm, it is necessary to clearly determine the computational complexity of the algorithm
β How much time or how many memory space do we need to calculate for the input data of length ππ?
big-ππ notation:β Definition: let ππ and ππ be functions from an integer to an integer. For
some constant πΆπΆ and ππ, if ππ ππ < πΆπΆγ»ππ(ππ) holds for any ππ > ππ, then we denote ππ ππ = ππ(ππ(ππ)). (we say that ππ is order of ππ.)
β If ππ and ππ are the same order, that is, both ππ(ππ) = ππ(ππ(ππ)) and ππ(ππ) = ππ(ππ(ππ)) hold, then we denote ππ = Ξ(ππ).
β Let ππ(ππ) be a function of the calculation time for the input of length ππ, and assume that ππ(ππ) = ππ(ππ(ππ)). This means that βIt asymptotically takes only the time proportional to ππ ππ at most.β
6e.g. ππ ππ , ππ(ππγ»logππ)
indicates the upper bound of asymptotic complexity(There is πΊπΊ notation to indicate the lower bound)
Basic terms for text algorithms
Ξ£ : a finite set of non-empty characters (letters, symbols), called (finite) alphabet. Ex) Ξ£ = a, b, c, β¦ , z , Ξ£ = 0,1 , Ξ£ = {0x00, 0x01, β¦ , 0xFF}
π₯π₯ β Ξ£β : a string (word, text), which is a sequence of characters.|π₯π₯| : length of string π₯π₯. Ex) aba = 3ππ : the empty string, which is the string whose length is equal to 0,
namely ππ = 0.π₯π₯[ππ] : the ππ-th character of string π₯π₯.π₯π₯ ππ: ππ : the consecutive sequence of characters from the ππ-th to
the ππ-th of string π₯π₯, which is called the factor or substring of π₯π₯. If ππ > ππ, we assume π₯π₯[ππ: ππ] = ππ for convenience. It is often also denoted as π₯π₯[ππ. . ππ] or π₯π₯[ππ, ππ].
π₯π₯π π : the reversed string ππππππππβ1 β¦ππ1 for π₯π₯ = ππ1 ππ2 β¦ ππππ β Ξ£β.7
Prefix, factor, and suffix
Factor π₯π₯[1: ππ] is especially called a prefix of π₯π₯, and π₯π₯[ππ: |π₯π₯|] is especially called a suffix of π₯π₯.
8
Note that the difference with a factor!
factor
ccococcoco
aoa
coaocoao
ococo
prefix suffix
w = cocoa
ππcocoa
For strings π₯π₯ and π¦π¦, we say π₯π₯ is a subsequence of π¦π¦ if the string obtained by removing 0 or more characters from π¦π¦ is identical with π₯π₯.
e.g.: π₯π₯ = abba is a subsequence of π¦π¦ = aaababaab.
Finite Automaton
It is automatic machine (automata) that determines if the input string is acceptable or not. It repeatedly changes its internal state reading characters from the input one by one. There are several variations according to the definition of the state transitions and the use of auxiliary spaces. (e.g. non-deterministic automaton, sequential machine, pushdown automaton, etc.)
9
References:"Automaton and computability," S. Arikawa and S. Miyano, BAIFU-KAN (written in Japanese)"Formal language and automaton," Written by E. Moriya, SAIENSU-SHA (written in Japanese)
fundamental concept on computer science!
It deeply relates to the pattern matching
It has the ability to define a (formal) language; it is often used for lexical analysis of strings.
ππ
input tapeοΌ
headοΌ(keep the internal state)
ππ1 ππ2 ππ3 ππ4 ππ5 ππ6 ππ7 ππ8 γ»γ»γ»
Deterministic finite automaton (DFA)
10
DFA ππ is a quintuple (5-tuple) (πΎπΎ, Ξ£, ππ0, πΏπΏ,πΉπΉ):πΎπΎ : a set of internal states {ππ0,ππ1, β¦ , ππππ} of ππ. Ξ£ : an alphabet (a set of characters) {ππ0,ππ1, β¦ ,ππππ}.ππ0 : the initial state of ππ.πΏπΏ : a function πΎπΎ Γ Ξ£ β πΎπΎ that returns the next state when reading a
character from the input. It is called a transition function. πΉπΉ : a set of accept states (or final state). πΉπΉ β πΎπΎ.
We extend the domain of πΏπΏ from πΎπΎ Γ Ξ£ to πΎπΎ Γ Ξ£β as follows:πΏπΏ ππ, ππ = ππ ππ β πΎπΎ ,πΏπΏ ππ,πππ₯π₯ = πΏπΏ πΏπΏ ππ,ππ , π₯π₯ . (ππ β πΎπΎ,ππ β Ξ£, π₯π₯ β Ξ£β)
Then, πΏπΏ ππ0,π€π€ indicates the state when string π€π€ is input.We say that ππ accepts π€π€ if πΏπΏ ππ0,π€π€ = ππ β πΉπΉ.
The set πΏπΏ ππ = π€π€ πΏπΏ ππ0,π€π€ β πΉπΉ of all strings that ππ accepts is called the language accepted by finite automaton ππ.
a
b
b a
a, b
ππ0 ππ1
ππ2
πΏπΏ ππ0, aba = πΏπΏ πΏπΏ ππ0, a , ba= πΏπΏ ππ1, ba= πΏπΏ πΏπΏ ππ1, b , a= πΏπΏ ππ0, a= ππ1 β πΉπΉ
State transition diagramState transition
πΏπΏ ππ = a ba ππ ππ β₯ 0}
11
Language that ππ accepts
ππ0
input tapeοΌ
headοΌ
a b a
An example of DFA
Nondeterministic finite automaton (NFA)
NFA ππ is a quintuple (πΎπΎ, Ξ£,ππ0,πΏπΏ,πΉπΉ):πΎπΎ : a set of internal states {ππ0,ππ1, β¦ , ππππ} of ππ.Ξ£ : an alphabet (a set of characters) {ππ0,ππ1, β¦ ,ππππ}.ππ0 : the initial state of ππ. ππ0 β πΎπΎ.πΏπΏ : a transition function πΎπΎ Γ Ξ£ β 2πΎπΎ of ππ.πΉπΉ : a set of accept states. πΉπΉ β πΎπΎ.
We extend the domain of πΏπΏ from πΎπΎ Γ Ξ£ to πΎπΎ Γ Ξ£β as follows:πΏπΏ ππ, ππ = ππ ,πΏπΏ ππ,πππ₯π₯ =βͺππβπΏπΏ ππ,ππ πΏπΏ ππ, π₯π₯ . ππ β πΎπΎ,ππ β Ξ£, π₯π₯ β Ξ£β
Moreover we extends it to 2πΎπΎ Γ Ξ£β:πΏπΏ ππ, π₯π₯ =βͺππβππ πΏπΏ ππ, π₯π₯ .
For π₯π₯ β Ξ£β, we say that ππ accepts π₯π₯ if πΏπΏ ππ0, π₯π₯ β© πΉπΉ β ππ.
12
Some states can be active.
The destination of each transition isnβt unique.
b ππ2ππ1ππ0b
a, b
For string abb, πΏπΏ ππ0, abb = πΏπΏ(ππ0, bb)= πΏπΏ ππ0, b βͺ πΏπΏ(ππ1, b)= {ππ0} βͺ {ππ1} βͺ {ππ2}= {ππ0, ππ1, ππ2}
Then, abb β πΏπΏ(ππ) since ππ2 β πΉπΉ.
State diagram of NFA
13
An example of NFA
Sequential machine
14
Sequential machine ππ, a variation of DFA that sequentially outputs to an input, is a sextuple (6-tuple) (πΎπΎ, Ξ£,Ξ, ππ0,πΏπΏ, ππ):πΎπΎ : a set of internal states {ππ0,ππ1, β¦ , ππππ}Ξ£ : an alphabet of the input
{ππ0,ππ1, β¦ ,ππππ}Ξ : an alphabet of the output
{ππ0,ππ1, β¦ , ππβ}ππ0 : the initial state.πΏπΏ : the transition function πΎπΎ Γ Ξ£ β πΎπΎ.ππ : the output function πΎπΎ Γ Ξ£ β Ξ.
We extend the domain of πΏπΏ to πΎπΎ Γ Ξ£β in a similar way of DFA. Moreover we extends the domain of ππ to that from πΎπΎ Γ Ξ£β to Ξβ as follows:
ππ ππ, ππ = ππ, ππ β πΎπΎππ ππ,πππ₯π₯ = ππ ππ, ππ ππ πΏπΏ ππ,ππ , π₯π₯ . (ππ β πΎπΎ,ππ β Ξ£, π₯π₯ β Ξ£β)
input tape
output tape
headοΌ
ππ1 ππ2 ππ3 ππ4 ππ5 ππ6 ππ7 ππ8 γ»γ»γ»
ππ1 ππ2 ππ3 ππ4 ππ5 ππ6 ππ7 ππ8 γ»γ»γ»
ππ
read
write
An example of sequential machine
15
ππ0
ππ3
ππ5
ππ1 ππ2
ππ4
a/0b/1
b/1a/0
a/0a/0
a/0a/0
b/0
b/0b/0
b/1
ππ(ππ0, abb)
= ππ(ππ0, a)ππ(πΏπΏ(ππ0, a), bb)
= 0ππ(ππ4, bb)
= 0ππ(ππ4, b)ππ(πΏπΏ(ππ4, b), b)
= 01ππ(ππ5, b)
= 010
itβs a kind of translator!state diagram
state transition and outputs
Summary
What is the pattern matching problem?The problem of finding the occurrences of pattern ππ in text ππ.There are the existence problem and all-occurrence problem.
The difference between sequential search and indexingThe former is usually slower than the latter. However, the former doesnβt need any auxiliary index structure.
Basic terms and notationsComputational complexity: big-ππ notationAlphabet, string, prefix, factor, and suffix
Finite automatonDFA defines a language.NFA has multiple current states.Sequential machine outputs for each input character.
16