Lecture 7 sequitur

  • View
    264

  • Download
    2

  • Category

    Science

Preview:

Citation preview

January 5, 2016 1 comp_dep_educ@yahoo.com

January 5, 2016 2 comp_dep_educ@yahoo.com

Introduction

Contents

Context Free Grammar

Sequitur Principles

Context-Free Grammar Example

January 5, 2016 comp_dep_educ@yahoo.com 3

Sequitur (or Nevill-Manning algorithm) is a recursive algorithm developed

by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical

structure (context free grammar) from a sequence of discrete symbols. The

algorithm operates in linear space and time. It can be used in data

compression software applications

Sequitur is based on the concept of context-free grammars, so we start

with a short review of this field.

Introduction

January 5, 2016 comp_dep_educ@yahoo.com 4

It reads the input symbol by symbol and uses repeated phrases in the

input data to build a set of context-free production rules.

Sequitur (from the Latin for “it follows”) is based on the concept of context-

free grammars.

It considers the input stream a valid sequence in some formal language.

January 5, 2016 5 comp_dep_educ@yahoo.com

A (natural) language starts with a small number of building blocks (letters

and punctuation marks) and uses them to construct words and sentences.

A sentence is a finite sequence (a string) of symbols that obeys certain

grammar rules.

Similarly, a formal language uses a small number of symbols (called

terminal symbols) from which valid sequences can be constructed.

The rules can be used to construct valid sequences and also to

determine whether a given sequence is valid.

A production rule consists of a nonterminal symbol on the left and a

string of terminal and nonterminal symbols on the right.

January 5, 2016 comp_dep_educ@yahoo.com 6

– terminals: b, e

– non-terminals: S, A

– Production Rules:

– S is the start symbol

January 5, 2016 comp_dep_educ@yahoo.com 7

The nonterminal symbol on the left becomes the name of the string on

the right.

In general, the right-hand side may contain several alternative strings,

but the rules generated by sequitur have just a single string.

The BNF notation, used to describe the syntax of programming

languages, is based on the concept of production rules.

We use lowercase letters to denote terminal symbols and uppercase

letters for the non-terminals.

BNF is an acronym for “Backus Naur Form“

January 5, 2016 comp_dep_educ@yahoo.com 8

Suppose that the following production rules are given:

A → ab, B → Ac, C → BdA.

Now verify that the string abcdab is valid

It is clear that the production rules reduce the redundancy of the original

sequence, so they can serve as the basis of a compression method.

Using these rules we can generate the valid strings ab (an application of

the nonterminal A), abc (an application of B), abcdab (an application of C),

as well as many others.

Context Free Grammar

January 5, 2016 comp_dep_educ@yahoo.com 9

Each repetition results in a rule, is replaced by the name of the rule (a

nonterminal symbol), thereby resulting in a shorter representation.

Generally, a set of production rules can be used to generate many valid

sequences, but the production rules produced by sequitur are not general.

They can be used only to reconstruct the original data.

The production rules themselves are not much smaller than the original

data, so sequitur has to go through one more step, where it compresses

the production rules.

The compressed production rules become the compressed stream, and

the sequitur decoder uses the rules (after decompressing them) to

reconstruct the original data.

January 5, 2016 comp_dep_educ@yahoo.com 10

If the input is a typical text in a natural language, the top-level rule

becomes very long, typically 10–20% of the size of the input, and the

other rules are short, with typically 2–3 symbols each.

January 5, 2016 comp_dep_educ@yahoo.com 11

Sequitur Principles

• Digram Uniqueness:

– no pair of adjacent symbols (digram) appears more than once in the

grammar.

• Rule Utility:

– Every production rule is used more than once.

• These two principles are maintained as an invariant while inferring a

grammar for the input string.

Sequitur constructs its grammars by observing two principles (or enforcing

two constraints) that we denote by p1 and p2.

Constraint p1 is; No pair of adjacent symbols will appear more than once in

the grammar (this can be rephrased as; Every digram in the grammar is

unique).

Constraint p2 says; Every rule should be used more than once.

This ensures that rules are useful. A rule that occurs just once is useless

and should be deleted.

January 5, 2016 comp_dep_educ@yahoo.com 12

The result is a two-rule grammar, where the first rule is the input

sequence with its redundancy removed, and the second rule is short,

replacing the digram bc with the single nonterminal symbol A.

January 5, 2016 comp_dep_educ@yahoo.com 13

The input S is considered a one-rule grammar. It has redundancy, so each

occurrence of abcdbc is replaced with A. Rule A still has redundancy

because of a repetition of the phrase bc, which justifies the introduction of

a second rule B.

January 5, 2016 comp_dep_educ@yahoo.com 14

Above Figure shows how the two constraints can be violated. The first

grammar of Figure contains two occurrences of bc, thereby violating p1.

The second grammar contains rule B, which is used just once. It is easy to

see how removing B reduces the size of the grammar. The resulting,

shorter grammar is shown in following Figure. It is one rule and one symbol

shorter.

January 5, 2016 comp_dep_educ@yahoo.com 15

The sequitur encoder constructs the grammar rules while enforcing the

two constraints at all times.

If constraint p1 is violated, the encoder generates a new production rule.

When p2 is violated, the useless rule is deleted.

The encoder starts by setting rule S to the first input symbol. It then goes

into a loop where new symbols are input and appended to S.

Each time a new symbol is appended to S, the symbol and its

predecessor become the current digram.

If the current digram already occurs in the grammar, then p1 has been

violated, and the encoder generates a new rule with the current digram

on the right-hand side and with a new nonterminal symbol on the left.

The two occurrences of the digram are replaced by this nonterminal.

January 5, 2016 comp_dep_educ@yahoo.com 16

January 5, 2016 comp_dep_educ@yahoo.com 17

Notice that generating rule C has made rule B underused (i.e., used just

once), which is why it was removed in the previous Figure.

One more detail, namely rule utilization, still needs to be discussed.

When a new rule X is generated, the encoder also generates a counter

associated with X, and initializes the counter to the number of times X is

used (a new rule is normally used twice when it is first generated). Each

time X is used in another rule Y, the encoder increments X’s counter by

1. When Y is deleted, the counter for X is decremented by 1. If X’s

counter reaches 1, rule X is deleted.

January 5, 2016 comp_dep_educ@yahoo.com 18

As an example, we show the information sent to the decoder for the input

string abcdbcabcdbc (above Figure). Rule S consists of two copies of rule

A. The first time rule A is encountered, its contents aBdB are sent. This

involves sending rule B twice. The first time rule B is sent, its contents bc

are sent (and the decoder does not know that the string bc it is receiving is

the contents of a rule). The second time rule B is sent, the pair (1, 2) is sent

(offset 1, count 2).

The decoder identifies the pair and uses it to set up the rule 1 → bc.

Sending the first copy of rule A therefore amounts to sending abcd(1, 2).

The second copy of rule A is sent as the pair (0, 4) since A starts at offset

0 in S and its length is 4. The decoder identifies this pair and uses it to set

up the rule 2 → a 1 d 1 . The final result is therefore abcd(1, 2)(0, 4).

January 5, 2016 comp_dep_educ@yahoo.com 19

Context-Free Grammar Example

January 5, 2016 comp_dep_educ@yahoo.com 20

Arithmetic Expressions

January 5, 2016 comp_dep_educ@yahoo.com 21

Sequitur Example

January 5, 2016 comp_dep_educ@yahoo.com 22

January 5, 2016 comp_dep_educ@yahoo.com 23

January 5, 2016 comp_dep_educ@yahoo.com 24

January 5, 2016 comp_dep_educ@yahoo.com 25

January 5, 2016 comp_dep_educ@yahoo.com 26

January 5, 2016 comp_dep_educ@yahoo.com 27

January 5, 2016 comp_dep_educ@yahoo.com 28

January 5, 2016 comp_dep_educ@yahoo.com 29

January 5, 2016 comp_dep_educ@yahoo.com 30

January 5, 2016 comp_dep_educ@yahoo.com 31

January 5, 2016 comp_dep_educ@yahoo.com 32

January 5, 2016 comp_dep_educ@yahoo.com 33

January 5, 2016 comp_dep_educ@yahoo.com 34

January 5, 2016 comp_dep_educ@yahoo.com 35

January 5, 2016 comp_dep_educ@yahoo.com 36

January 5, 2016 comp_dep_educ@yahoo.com 37

January 5, 2016 comp_dep_educ@yahoo.com 38

January 5, 2016 comp_dep_educ@yahoo.com 39

January 5, 2016 comp_dep_educ@yahoo.com 40

January 5, 2016 comp_dep_educ@yahoo.com 41

January 5, 2016 comp_dep_educ@yahoo.com 42

January 5, 2016 comp_dep_educ@yahoo.com 43

January 5, 2016 comp_dep_educ@yahoo.com 44

January 5, 2016 comp_dep_educ@yahoo.com 45

January 5, 2016 comp_dep_educ@yahoo.com 46

January 5, 2016 comp_dep_educ@yahoo.com 47

The Hierarchy

January 5, 2016 48 comp_dep_educ@yahoo.com

Recommended