of 16 /16
Special lecture on Information Knowledge Network - Information retrieval and pattern matching - The 1st. Preliminary: terms and definitions Takuya kida IKN Laboratory, Division of Computer Science and Information Technology 2018/11/22 Special lecture on IKN

Special Lecture on IKN

  • Author
    others

  • View
    2

  • Download
    0

Embed Size (px)

Text of Special Lecture on IKN

Special Lecture on IKNThe 1st. Preliminary: terms and definitions
Takuya kida IKN Laboratory,
2018/11/22Special lecture on IKN
Sequential search vs. indexing
Finite automatons
What is the pattern matching problem?
Given a pattern and a text , the problem is to find occurrences of in .
3
We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach [Amir94]..
text :
pattern : compress
Well-known algorithms • KMP method (Knuth&Morris&Pratt[1974]) • BM method (Boyer&Moore[1977]) • Karp-Rabin method (Karp&Rabin[1987])
text : pururunpurunfamifamifa pattern : famifa
Existence problem:
All-occurrences problem:
When we retrieve a document from a set of documents, it is enough to solve the existence problem. However, “pattern matching problem” often means solving the all-occurrences problem. 4
Existence problem and all-occurrences problem
Yes!
5
disadvantages: Preprocessing for indexing Less flexible for data updating Large memory consumption
Retrieve using sequential search algorithms advantages: No extra data structures Flexible for data updating
disadvantages: Slow Low scalability
For small scale group of documents (e.g. grep of UNIX)
For large scale DB (e.g. namazu, sufary, Google, etc.)
()
(inverted file, suffix tree/array, etc.)
Computational complexity
To evaluate an algorithm, it is necessary to clearly determine the computational complexity of the algorithm
– How much time or how many memory space do we need to calculate for the input data of length ?
big- notation: – Definition: let and be functions from an integer to an integer. For
some constant and , if < () holds for any > , then we denote = (()). (we say that is order of .)
– If and are the same order, that is, both () = (()) and () = (()) hold, then we denote = Θ().
– Let () be a function of the calculation time for the input of length , and assume that () = (()). This means that “It asymptotically takes only the time proportional to at most.”
6e.g. , (log)
indicates the upper bound of asymptotic complexity (There is notation to indicate the lower bound)
Basic terms for text algorithms
Σ : a finite set of non-empty characters (letters, symbols), called (finite) alphabet. Ex) Σ = a, b, c, … , z , Σ = 0,1 , Σ = {0x00, 0x01, … , 0xFF}
∈ Σ∗ : a string (word, text), which is a sequence of characters. || : length of string . Ex) aba = 3 : the empty string, which is the string whose length is equal to 0,
namely = 0. [] : the -th character of string . : : the consecutive sequence of characters from the -th to
the -th of string , which is called the factor or substring of . If > , we assume [: ] = for convenience. It is often also denoted as [. . ] or [, ].
: the reversed string −1 …1 for = 1 2 … ∈ Σ∗. 7
Prefix, factor, and suffix
Factor [1: ] is especially called a prefix of , and [: ||] is especially called a suffix of .
8
factor
a oa
coa ocoao
oc oco
prefix suffix
w = cocoa
cocoa
For strings and , we say is a subsequence of if the string obtained by removing 0 or more characters from is identical with .
e.g.: = abba is a subsequence of = aaababaab.
Finite Automaton
It is automatic machine (automata) that determines if the input string is acceptable or not. It repeatedly changes its internal state reading characters from the input one by one. There are several variations according to the definition of the state transitions and the use of auxiliary spaces. (e.g. non-deterministic automaton, sequential machine, pushdown automaton, etc.)
9
References:"Automaton and computability," S. Arikawa and S. Miyano, BAIFU-KAN (written in Japanese) "Formal language and automaton," Written by E. Moriya, SAIENSU-SHA (written in Japanese)
fundamental concept on computer science!
It deeply relates to the pattern matching
It has the ability to define a (formal) language; it is often used for lexical analysis of strings.

1 2 3 4 5 6 7 8
Deterministic finite automaton (DFA)
10
DFA is a quintuple (5-tuple) (, Σ, 0, ,): : a set of internal states {0,1, … , } of . Σ : an alphabet (a set of characters) {0,1, … ,}. 0 : the initial state of . : a function × Σ → that returns the next state when reading a
character from the input. It is called a transition function. : a set of accept states (or final state). ⊆ .
We extend the domain of from × Σ to × Σ∗ as follows: , = ∈ , , = , , . ( ∈ , ∈ Σ, ∈ Σ∗)
Then, 0, indicates the state when string is input. We say that accepts if 0, = ∈ .
The set = 0, ∈ of all strings that accepts is called the language accepted by finite automaton .
a
b
2
0, aba = 0, a , ba = 1, ba = 1, b , a = 0, a = 1 ∈
State transition diagramState transition
11
An example of DFA
Nondeterministic finite automaton (NFA)
NFA is a quintuple (, Σ,0,,): : a set of internal states {0,1, … , } of . Σ : an alphabet (a set of characters) {0,1, … ,}. 0 : the initial state of . 0 ⊂ . : a transition function × Σ → 2 of . : a set of accept states. ⊆ .
We extend the domain of from × Σ to × Σ∗ as follows: , = , , =∪∈ , , . ∈ , ∈ Σ, ∈ Σ∗
Moreover we extends it to 2 × Σ∗: , =∪∈ , .
For ∈ Σ∗, we say that accepts if 0, ∩ ≠ .
12
The destination of each transition isn’t unique.
b 210 b
a, b
For string abb, 0, abb = (0, bb) = 0, b ∪ (1, b) = {0} ∪ {1} ∪ {2} = {0, 1, 2}
Then, abb ∈ () since 2 ∈ .
State diagram of NFA
14
Sequential machine , a variation of DFA that sequentially outputs to an input, is a sextuple (6-tuple) (, Σ,Δ, 0,, ): : a set of internal states {0,1, … , } Σ : an alphabet of the input
{0,1, … ,} Δ : an alphabet of the output
{0,1, … , } 0 : the initial state. : the transition function × Σ → . : the output function × Σ → Δ.
We extend the domain of to × Σ∗ in a similar way of DFA. Moreover we extends the domain of to that from × Σ∗ to Δ∗ as follows:
, = , ∈ , = , , , . ( ∈ , ∈ Σ, ∈ Σ∗)
input tape
output tape

read
write
15
0
3
5
state transition and outputs
Summary
What is the pattern matching problem? The problem of finding the occurrences of pattern in text . There are the existence problem and all-occurrence problem.
The difference between sequential search and indexing The former is usually slower than the latter. However, the former doesn’t need any auxiliary index structure.
Basic terms and notations Computational complexity: big- notation Alphabet, string, prefix, factor, and suffix
Finite automaton DFA defines a language. NFA has multiple current states. Sequential machine outputs for each input character.
16
Today’s contents
Existence problem and all-occurrences problem
Two approaches to the problem
Computational complexity
Prefix, factor, and suffix
Summary