View
264
Download
2
Category
Preview:
Citation preview
January 5, 2016 1 comp_dep_educ@yahoo.com
January 5, 2016 2 comp_dep_educ@yahoo.com
Introduction
Contents
Context Free Grammar
Sequitur Principles
Context-Free Grammar Example
January 5, 2016 comp_dep_educ@yahoo.com 3
Sequitur (or Nevill-Manning algorithm) is a recursive algorithm developed
by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical
structure (context free grammar) from a sequence of discrete symbols. The
algorithm operates in linear space and time. It can be used in data
compression software applications
Sequitur is based on the concept of context-free grammars, so we start
with a short review of this field.
Introduction
January 5, 2016 comp_dep_educ@yahoo.com 4
It reads the input symbol by symbol and uses repeated phrases in the
input data to build a set of context-free production rules.
Sequitur (from the Latin for “it follows”) is based on the concept of context-
free grammars.
It considers the input stream a valid sequence in some formal language.
January 5, 2016 5 comp_dep_educ@yahoo.com
A (natural) language starts with a small number of building blocks (letters
and punctuation marks) and uses them to construct words and sentences.
A sentence is a finite sequence (a string) of symbols that obeys certain
grammar rules.
Similarly, a formal language uses a small number of symbols (called
terminal symbols) from which valid sequences can be constructed.
The rules can be used to construct valid sequences and also to
determine whether a given sequence is valid.
A production rule consists of a nonterminal symbol on the left and a
string of terminal and nonterminal symbols on the right.
January 5, 2016 comp_dep_educ@yahoo.com 6
– terminals: b, e
– non-terminals: S, A
– Production Rules:
– S is the start symbol
January 5, 2016 comp_dep_educ@yahoo.com 7
The nonterminal symbol on the left becomes the name of the string on
the right.
In general, the right-hand side may contain several alternative strings,
but the rules generated by sequitur have just a single string.
The BNF notation, used to describe the syntax of programming
languages, is based on the concept of production rules.
We use lowercase letters to denote terminal symbols and uppercase
letters for the non-terminals.
BNF is an acronym for “Backus Naur Form“
January 5, 2016 comp_dep_educ@yahoo.com 8
Suppose that the following production rules are given:
A → ab, B → Ac, C → BdA.
Now verify that the string abcdab is valid
It is clear that the production rules reduce the redundancy of the original
sequence, so they can serve as the basis of a compression method.
Using these rules we can generate the valid strings ab (an application of
the nonterminal A), abc (an application of B), abcdab (an application of C),
as well as many others.
Context Free Grammar
January 5, 2016 comp_dep_educ@yahoo.com 9
Each repetition results in a rule, is replaced by the name of the rule (a
nonterminal symbol), thereby resulting in a shorter representation.
Generally, a set of production rules can be used to generate many valid
sequences, but the production rules produced by sequitur are not general.
They can be used only to reconstruct the original data.
The production rules themselves are not much smaller than the original
data, so sequitur has to go through one more step, where it compresses
the production rules.
The compressed production rules become the compressed stream, and
the sequitur decoder uses the rules (after decompressing them) to
reconstruct the original data.
January 5, 2016 comp_dep_educ@yahoo.com 10
If the input is a typical text in a natural language, the top-level rule
becomes very long, typically 10–20% of the size of the input, and the
other rules are short, with typically 2–3 symbols each.
January 5, 2016 comp_dep_educ@yahoo.com 11
Sequitur Principles
• Digram Uniqueness:
– no pair of adjacent symbols (digram) appears more than once in the
grammar.
• Rule Utility:
– Every production rule is used more than once.
• These two principles are maintained as an invariant while inferring a
grammar for the input string.
Sequitur constructs its grammars by observing two principles (or enforcing
two constraints) that we denote by p1 and p2.
Constraint p1 is; No pair of adjacent symbols will appear more than once in
the grammar (this can be rephrased as; Every digram in the grammar is
unique).
Constraint p2 says; Every rule should be used more than once.
This ensures that rules are useful. A rule that occurs just once is useless
and should be deleted.
January 5, 2016 comp_dep_educ@yahoo.com 12
The result is a two-rule grammar, where the first rule is the input
sequence with its redundancy removed, and the second rule is short,
replacing the digram bc with the single nonterminal symbol A.
January 5, 2016 comp_dep_educ@yahoo.com 13
The input S is considered a one-rule grammar. It has redundancy, so each
occurrence of abcdbc is replaced with A. Rule A still has redundancy
because of a repetition of the phrase bc, which justifies the introduction of
a second rule B.
January 5, 2016 comp_dep_educ@yahoo.com 14
Above Figure shows how the two constraints can be violated. The first
grammar of Figure contains two occurrences of bc, thereby violating p1.
The second grammar contains rule B, which is used just once. It is easy to
see how removing B reduces the size of the grammar. The resulting,
shorter grammar is shown in following Figure. It is one rule and one symbol
shorter.
January 5, 2016 comp_dep_educ@yahoo.com 15
The sequitur encoder constructs the grammar rules while enforcing the
two constraints at all times.
If constraint p1 is violated, the encoder generates a new production rule.
When p2 is violated, the useless rule is deleted.
The encoder starts by setting rule S to the first input symbol. It then goes
into a loop where new symbols are input and appended to S.
Each time a new symbol is appended to S, the symbol and its
predecessor become the current digram.
If the current digram already occurs in the grammar, then p1 has been
violated, and the encoder generates a new rule with the current digram
on the right-hand side and with a new nonterminal symbol on the left.
The two occurrences of the digram are replaced by this nonterminal.
January 5, 2016 comp_dep_educ@yahoo.com 16
January 5, 2016 comp_dep_educ@yahoo.com 17
Notice that generating rule C has made rule B underused (i.e., used just
once), which is why it was removed in the previous Figure.
One more detail, namely rule utilization, still needs to be discussed.
When a new rule X is generated, the encoder also generates a counter
associated with X, and initializes the counter to the number of times X is
used (a new rule is normally used twice when it is first generated). Each
time X is used in another rule Y, the encoder increments X’s counter by
1. When Y is deleted, the counter for X is decremented by 1. If X’s
counter reaches 1, rule X is deleted.
January 5, 2016 comp_dep_educ@yahoo.com 18
As an example, we show the information sent to the decoder for the input
string abcdbcabcdbc (above Figure). Rule S consists of two copies of rule
A. The first time rule A is encountered, its contents aBdB are sent. This
involves sending rule B twice. The first time rule B is sent, its contents bc
are sent (and the decoder does not know that the string bc it is receiving is
the contents of a rule). The second time rule B is sent, the pair (1, 2) is sent
(offset 1, count 2).
The decoder identifies the pair and uses it to set up the rule 1 → bc.
Sending the first copy of rule A therefore amounts to sending abcd(1, 2).
The second copy of rule A is sent as the pair (0, 4) since A starts at offset
0 in S and its length is 4. The decoder identifies this pair and uses it to set
up the rule 2 → a 1 d 1 . The final result is therefore abcd(1, 2)(0, 4).
January 5, 2016 comp_dep_educ@yahoo.com 19
Context-Free Grammar Example
January 5, 2016 comp_dep_educ@yahoo.com 20
Arithmetic Expressions
January 5, 2016 comp_dep_educ@yahoo.com 21
Sequitur Example
January 5, 2016 comp_dep_educ@yahoo.com 22
January 5, 2016 comp_dep_educ@yahoo.com 23
January 5, 2016 comp_dep_educ@yahoo.com 24
January 5, 2016 comp_dep_educ@yahoo.com 25
January 5, 2016 comp_dep_educ@yahoo.com 26
January 5, 2016 comp_dep_educ@yahoo.com 27
January 5, 2016 comp_dep_educ@yahoo.com 28
January 5, 2016 comp_dep_educ@yahoo.com 29
January 5, 2016 comp_dep_educ@yahoo.com 30
January 5, 2016 comp_dep_educ@yahoo.com 31
January 5, 2016 comp_dep_educ@yahoo.com 32
January 5, 2016 comp_dep_educ@yahoo.com 33
January 5, 2016 comp_dep_educ@yahoo.com 34
January 5, 2016 comp_dep_educ@yahoo.com 35
January 5, 2016 comp_dep_educ@yahoo.com 36
January 5, 2016 comp_dep_educ@yahoo.com 37
January 5, 2016 comp_dep_educ@yahoo.com 38
January 5, 2016 comp_dep_educ@yahoo.com 39
January 5, 2016 comp_dep_educ@yahoo.com 40
January 5, 2016 comp_dep_educ@yahoo.com 41
January 5, 2016 comp_dep_educ@yahoo.com 42
January 5, 2016 comp_dep_educ@yahoo.com 43
January 5, 2016 comp_dep_educ@yahoo.com 44
January 5, 2016 comp_dep_educ@yahoo.com 45
January 5, 2016 comp_dep_educ@yahoo.com 46
January 5, 2016 comp_dep_educ@yahoo.com 47
The Hierarchy
January 5, 2016 48 comp_dep_educ@yahoo.com
Recommended