Special lecture on Information Knowledge Network - Information retrieval and pattern matching - The 1st. Preliminary: terms and definitions Takuya kida IKN Laboratory, Division of Computer Science and Information Technology 2018/11/22 Special lecture on IKN
Special Lecture on IKNThe 1st. Preliminary: terms and
definitions
Takuya kida IKN Laboratory,
2018/11/22Special lecture on IKN
Sequential search vs. indexing
Finite automatons
What is the pattern matching problem?
Given a pattern and a text , the problem is to find occurrences of
in .
3
We introduce a general framework which is suitable to capture an
essence of compressed pattern matching according to various
dictionary based compressions. The goal is to find all occurrences
of a pattern in a text without decompression, which is one of the
most active topics in string matching. Our framework includes such
compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW),
byte-pair encoding, and the static dictionary based method.
Technically, our pattern matching algorithm extremely extends that
for LZW compressed text presented by Amir, Benson and Farach
[Amir94]..
text :
pattern : compress
Well-known algorithms • KMP method
(Knuth&Morris&Pratt[1974]) • BM method
(Boyer&Moore[1977]) • Karp-Rabin method
(Karp&Rabin[1987])
text : pururunpurunfamifamifa pattern : famifa
Existence problem:
All-occurrences problem:
When we retrieve a document from a set of documents, it is enough
to solve the existence problem. However, “pattern matching problem”
often means solving the all-occurrences problem. 4
Existence problem and all-occurrences problem
Yes!
5
disadvantages: Preprocessing for indexing Less flexible for data
updating Large memory consumption
Retrieve using sequential search algorithms advantages: No extra
data structures Flexible for data updating
disadvantages: Slow Low scalability
For small scale group of documents (e.g. grep of UNIX)
For large scale DB (e.g. namazu, sufary, Google, etc.)
()
(inverted file, suffix tree/array, etc.)
Computational complexity
To evaluate an algorithm, it is necessary to clearly determine the
computational complexity of the algorithm
– How much time or how many memory space do we need to calculate
for the input data of length ?
big- notation: – Definition: let and be functions from an integer
to an integer. For
some constant and , if < () holds for any > , then we denote
= (()). (we say that is order of .)
– If and are the same order, that is, both () = (()) and () = (())
hold, then we denote = Θ().
– Let () be a function of the calculation time for the input of
length , and assume that () = (()). This means that “It
asymptotically takes only the time proportional to at most.”
6e.g. , (log)
indicates the upper bound of asymptotic complexity (There is
notation to indicate the lower bound)
Basic terms for text algorithms
Σ : a finite set of non-empty characters (letters, symbols), called
(finite) alphabet. Ex) Σ = a, b, c, … , z , Σ = 0,1 , Σ = {0x00,
0x01, … , 0xFF}
∈ Σ∗ : a string (word, text), which is a sequence of characters. ||
: length of string . Ex) aba = 3 : the empty string, which is the
string whose length is equal to 0,
namely = 0. [] : the -th character of string . : : the consecutive
sequence of characters from the -th to
the -th of string , which is called the factor or substring of . If
> , we assume [: ] = for convenience. It is often also denoted
as [. . ] or [, ].
: the reversed string −1 …1 for = 1 2 … ∈ Σ∗. 7
Prefix, factor, and suffix
Factor [1: ] is especially called a prefix of , and [: ||] is
especially called a suffix of .
8
factor
a oa
coa ocoao
oc oco
prefix suffix
w = cocoa
cocoa
For strings and , we say is a subsequence of if the string obtained
by removing 0 or more characters from is identical with .
e.g.: = abba is a subsequence of = aaababaab.
Finite Automaton
It is automatic machine (automata) that determines if the input
string is acceptable or not. It repeatedly changes its internal
state reading characters from the input one by one. There are
several variations according to the definition of the state
transitions and the use of auxiliary spaces. (e.g.
non-deterministic automaton, sequential machine, pushdown
automaton, etc.)
9
References:"Automaton and computability," S. Arikawa and S. Miyano,
BAIFU-KAN (written in Japanese) "Formal language and automaton,"
Written by E. Moriya, SAIENSU-SHA (written in Japanese)
fundamental concept on computer science!
It deeply relates to the pattern matching
It has the ability to define a (formal) language; it is often used
for lexical analysis of strings.
1 2 3 4 5 6 7 8
Deterministic finite automaton (DFA)
10
DFA is a quintuple (5-tuple) (, Σ, 0, ,): : a set of internal
states {0,1, … , } of . Σ : an alphabet (a set of characters) {0,1,
… ,}. 0 : the initial state of . : a function × Σ → that returns
the next state when reading a
character from the input. It is called a transition function. : a
set of accept states (or final state). ⊆ .
We extend the domain of from × Σ to × Σ∗ as follows: , = ∈ , , = ,
, . ( ∈ , ∈ Σ, ∈ Σ∗)
Then, 0, indicates the state when string is input. We say that
accepts if 0, = ∈ .
The set = 0, ∈ of all strings that accepts is called the language
accepted by finite automaton .
a
b
2
0, aba = 0, a , ba = 1, ba = 1, b , a = 0, a = 1 ∈
State transition diagramState transition
11
An example of DFA
Nondeterministic finite automaton (NFA)
NFA is a quintuple (, Σ,0,,): : a set of internal states {0,1, … ,
} of . Σ : an alphabet (a set of characters) {0,1, … ,}. 0 : the
initial state of . 0 ⊂ . : a transition function × Σ → 2 of . : a
set of accept states. ⊆ .
We extend the domain of from × Σ to × Σ∗ as follows: , = , , =∪∈ ,
, . ∈ , ∈ Σ, ∈ Σ∗
Moreover we extends it to 2 × Σ∗: , =∪∈ , .
For ∈ Σ∗, we say that accepts if 0, ∩ ≠ .
12
The destination of each transition isn’t unique.
b 210 b
a, b
For string abb, 0, abb = (0, bb) = 0, b ∪ (1, b) = {0} ∪ {1} ∪ {2}
= {0, 1, 2}
Then, abb ∈ () since 2 ∈ .
State diagram of NFA
14
Sequential machine , a variation of DFA that sequentially outputs
to an input, is a sextuple (6-tuple) (, Σ,Δ, 0,, ): : a set of
internal states {0,1, … , } Σ : an alphabet of the input
{0,1, … ,} Δ : an alphabet of the output
{0,1, … , } 0 : the initial state. : the transition function × Σ →
. : the output function × Σ → Δ.
We extend the domain of to × Σ∗ in a similar way of DFA. Moreover
we extends the domain of to that from × Σ∗ to Δ∗ as follows:
, = , ∈ , = , , , . ( ∈ , ∈ Σ, ∈ Σ∗)
input tape
output tape
read
write
15
0
3
5
state transition and outputs
Summary
What is the pattern matching problem? The problem of finding the
occurrences of pattern in text . There are the existence problem
and all-occurrence problem.
The difference between sequential search and indexing The former is
usually slower than the latter. However, the former doesn’t need
any auxiliary index structure.
Basic terms and notations Computational complexity: big- notation
Alphabet, string, prefix, factor, and suffix
Finite automaton DFA defines a language. NFA has multiple current
states. Sequential machine outputs for each input character.
16
Today’s contents
Existence problem and all-occurrences problem
Two approaches to the problem
Computational complexity
Prefix, factor, and suffix
Summary