34
Chapter 9: Text Processing Pattern Matching Data Compression

Chapter 9: Text Processing Pattern Matching Data Compression

Embed Size (px)

Citation preview

Page 1: Chapter 9: Text Processing Pattern Matching Data Compression

Chapter 9: Text ProcessingPattern MatchingData Compression

Page 2: Chapter 9: Text Processing Pattern Matching Data Compression

Outline and Reading Strings (§9.1.1) Pattern matching algorithms

Brute-force algorithm (§9.1.2) Knuth-Morris-Pratt algorithm (§9.1.4)

Regular Expressions and Finite Automata

Data Compression Huffman Coding Lempel-Ziv Compression

Page 3: Chapter 9: Text Processing Pattern Matching Data Compression

Motivation: Bioinformatics The application of computer science

techniques to genetic data See Gene-Finding notes Many interesting algorithm problems Many interesting ethical issues!

Page 4: Chapter 9: Text Processing Pattern Matching Data Compression

Strings A string is a sequence of

characters Examples of strings:

Java program HTML document DNA sequence Digitized image

An alphabet is the set of possible characters for a family of strings

Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T}

Let P be a string of size m A substring P[i .. j] of P is

the subsequence of P consisting of the characters with ranks between i and j

A prefix of P is a substring of the type P[0 .. i]

A suffix of P is a substring of the type P[i ..m 1]

Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P

Applications: Text editors Regular expressions Search engines Biological research

Page 5: Chapter 9: Text Processing Pattern Matching Data Compression

Brute-Force Algorithm The brute-force pattern

matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either

a match is found, or all placements of the

pattern have been tried Brute-force pattern

matching runs in time O(nm)

Example of worst case: T aaa … ah P aaah may occur in images

and DNA sequences unlikely in English text

Algorithm BruteForceMatch(T, P)Input text T of size n and pattern

P of size mOutput starting index of a

substring of T equal to P or 1 if no such substring exists

for i 0 to n m{ test shift i of the pattern }j 0while j m T[i j] P[j]

j j 1if j m

return i {match at i}else

break while loop {mismatch}return -1 {no match anywhere}

Page 6: Chapter 9: Text Processing Pattern Matching Data Compression

The KMP Algorithm - Motivation

Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm.

When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons?

Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]

x

j

. . a b a a b . . . . .

a b a a b a

a b a a b a

No need torepeat thesecomparisons

Resumecomparing

here

Page 7: Chapter 9: Text Processing Pattern Matching Data Compression

KMP Failure Function Knuth-Morris-Pratt’s

algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself

The failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]

Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j]T[i] we set j F(j 1)

j 0 1 2 3 4

P[j] a b a a b a

F(j) 0 0 1 1 2

x

j

. . a b a a b . . . . .

a b a a b a

F(j 1)

a b a a b a

Page 8: Chapter 9: Text Processing Pattern Matching Data Compression

The KMP Algorithm The failure function can be

represented by an array and can be computed in O(m) time

At each iteration of the while-loop, either

i increases by one, or the shift amount i j

increases by at least one (observe that F(j 1) < j)

Hence, there are no more than 2n iterations of the while-loop

Thus, KMP’s algorithm runs in optimal time O(m n)

Algorithm KMPMatch(T, P)F failureFunction(P)i 0j 0while i n

if T[i] P[j]if j m 1

return i j { match }

elsei i 1j j 1

elseif j 0

j F[j 1]else

i i 1return 1 { no match }

Page 9: Chapter 9: Text Processing Pattern Matching Data Compression

Computing the Failure Function The failure function can

be represented by an array and can be computed in O(m) time

The construction is similar to the KMP algorithm itself

At each iteration of the while-loop, either

i increases by one, or the shift amount i j

increases by at least one (observe that F(j 1) < j)

Hence, there are no more than 2m iterations of the while-loop

Algorithm failureFunction(P)F[0] 0i 1j 0while i m

if P[i] P[j]{we have matched j + 1

chars}F[i] j + 1i i 1j j 1

else if j 0 then{use failure function to

shift P}j F[j 1]

elseF[i] 0 { no match }i i 1

Page 10: Chapter 9: Text Processing Pattern Matching Data Compression

Example

1

a b a c a a b a c a b a c a b a a b b

7

8

19181715

a b a c a b

1614

13

2 3 4 5 6

9

a b a c a b

a b a c a b

a b a c a b

a b a c a b

10 11 12

c

j 0 1 2 3 4

P[j] a b a c a b

F(j) 0 0 1 0 1

Page 11: Chapter 9: Text Processing Pattern Matching Data Compression

More Complex Patterns Suppose you want to find repeated ATs

followed by a G in GAGATATATATCATATG. How do you express that pattern to find? How can you find it efficiently? How if the strings were billions of

characters long?

Page 12: Chapter 9: Text Processing Pattern Matching Data Compression

Finite Automata and Regular Expressions How do I match perl-like regular

expressions to text? Important topic: regular expressions

and finite automata. theoretician: regular expressions are

grammars that define regular languages programmer: compact patterns for

matching and replacing

Page 13: Chapter 9: Text Processing Pattern Matching Data Compression

Regular Expressions Regular expressions are one of

a literal character a (regular expression) – in parentheses a concatenation of two REs the alternation (“or”) of two REs, denoted + in formal

notation the closure of an RE, denoted * (ie 0 or more occurrences) Possibly additional syntactic sugar

Examplesabracadabraabra(cadabra)* = {abra, abracadabra, abracadabracadabra,

… }(a*b + ac)d(a(a+b)b*)*t(w+o)?o [? means 0 or 1 occurrence in Perl]aa+rdvark [+ means 1 or more occurrences in Perl]

Page 14: Chapter 9: Text Processing Pattern Matching Data Compression

Finite Automata Regular language: any language defined by a RE Finite automata: machines that recognize regular

languages. Deterministic Finite Automaton (DFA):

a set of states including a start state and one or more accepting states

a transition function: given current state and input letter, what’s the new state?

Non-deterministic Finite Automaton (NDFA): like a DFA, but there may be

more than one transition out of a state on the same letter (Pick the right one non-deterministically, i.e. via lucky guess!)

epsilon-transitions, i.e. optional transitions on no input letter

Page 15: Chapter 9: Text Processing Pattern Matching Data Compression

DFA for (AT)+C Note that DFA can be represented as a 2D array,

DFA[state][inputLetter] newstate DFA:

state letter newstate 0 A 1 0 TCG 0 1 T 2 1 ACG 0 2 C 4 [accept] 2 GT 0 2 A 3 3 T 2 3 AGC 0 4 AGCT 0

Page 16: Chapter 9: Text Processing Pattern Matching Data Compression

RE NDFA Given a Regular Expression, how can I build a

DFA? Work bottom up. Letter:

Concatenation:

Or: Closure:

Page 17: Chapter 9: Text Processing Pattern Matching Data Compression

RE NDFA Example Construct an NDFA for the RE

(A*B + AC)D

AA*A*BA*B + AC(A*B + AC)D

Page 18: Chapter 9: Text Processing Pattern Matching Data Compression

NDFA -> DFA Keep track of the set of states you are

in. On each new input letter, compute the

new set of states you could be in.

The set of states for the DFA is the power set of the NDFA states. I.e. up to 2n states, where there were n in

the DFA.

Page 19: Chapter 9: Text Processing Pattern Matching Data Compression

Recognizing Regular Languages Suppose your language is given by a DFA.

How to recognize? Build a table. One row for every (state, input

letter) pair. Give resulting state. For each letter of input string, compute new

state When done, check whether the last state is an

accepting state. Runtime?

O(n), where n is the number of input letters Another approach: use a C program to

simulate NDFA with backtracking. Less space, more time. (egrep vs. fgrep?)

Page 20: Chapter 9: Text Processing Pattern Matching Data Compression

Examples Unix grep REs Perl and other languages$input =~ s/t[wo]?o/2/;

$input =~ s|<link[^>]*>\s*||gs;

$input =~ s|\s*\@font-face\s*{.*?}||gs;

$input =~ s|\s*mso-[^>"]*"|"|gis;

$input =~ s/([^ ]+) +([^ ]+)/$2 $1/;

$input =~ m/^[0-9]+\.?[0-9]*|\.[0-9]+$/;

($word1,$word2,$rest) =

($foo =~ m/^ *([^ ]+) +([^ ]+) +(.*)$/);

$input=~s|<span[^>]*>\s*<br\s+clear="?all[^>]*>\s*</span>|<br clear="all"/>|gis;

Page 21: Chapter 9: Text Processing Pattern Matching Data Compression

Data Compression: Intro Suppose you have a text, abracadabra. Want

to compress it. How many bits required?

at 3 bits per letter, 33 bits. Can we do better? How about variable length codes? In order to be able to decode the file again,

we would need a prefix code: no code is the prefix of another.

How do we make a prefix code that compresses the text?

Page 22: Chapter 9: Text Processing Pattern Matching Data Compression

Huffman Coding Note: Put the letters at the leaves of a

binary tree. Left=0, Right=1. Voila! A prefix code.

Huffman coding: an optimal prefix code Algorithm: use a priority queue.

insert all letters according to frequencyif there is only one tree left, done.else, a=deleteMin(); b=deleteMin(); make tree t out of a and b with weight a.weight() +

b.weight(); insert(t)

Page 23: Chapter 9: Text Processing Pattern Matching Data Compression

Huffman coding example abracadabra frequencies:

a: 5, b: 2, c: 1, d: 1, r: 2 Huffman code:

a: 0, b: 100, c: 1010, d: 1011, r: 11 bits: 5 * 1 + 2 * 3 + 1 * 4 + 1 * 4 + 2 * 2 = 23

Follow the tree to decode – (n) Time to encode?

Compute frequencies – O(n) Build heap – O(1) assuming alphabet has

constant size Encode – O(n)

Page 24: Chapter 9: Text Processing Pattern Matching Data Compression

Huffman coding summary Huffman coding is very frequently used (You use it every time you watch HTDV

or listen to mp3, for example) Text files often compress to 60% of

original size (depending on entropy) In real life, Huffman coding is usually

used in conjunction with a modeling algorithm…

Page 25: Chapter 9: Text Processing Pattern Matching Data Compression

Data compression overview Two stages: modeling and entropy

coding Modeling: break up input into tokens or

chunks (the bigger, the better) Entropy Coding: use shorter bit strings

to represent more frequent tokens If P is the probability of a code element, the

optimal number of bits is –lg(P)

Page 26: Chapter 9: Text Processing Pattern Matching Data Compression

Lempel-Ziv Modeling Consider compressing text Certain byte strings are more frequent

than others: the, and, tion, es, etc. Model these with single tokens

Build a dictionary of the byte strings you see; the second time you see a byte string, use the dictionary entry

Page 27: Chapter 9: Text Processing Pattern Matching Data Compression

Lempel-Ziv Compression Start with a dictionary of 256 entries for the

first 256 characters At each step,

Output the code of the longest dictionary match and delete those characters from input

Add last two tokens as new dictionary entry with code 256, 257, 258, …

Note that code lengths grow by one bit as dictionary reaches size 512, 1024, 2048, etc.

Page 28: Chapter 9: Text Processing Pattern Matching Data Compression

Lempel-Ziv ExampleOutput Add to Dict Output Add to Dict

#(C) - #(D) ND

#(O) CO #(_) D_

#(CO) OC #(B) _B

#(A) COA #(AN) BA

#(_) A_ #(AN) ANA

#(A) _A #(A) ANA

#(N) AN #(S) AS

Page 29: Chapter 9: Text Processing Pattern Matching Data Compression

Lempel-Ziv Variations All compression algorithms like zip, gzip

use variations on Lempel-Ziv Possible variations:

Fixed-length vs. variable length codes or adaptive Huffman or arithmetic coding

Don’t add duplicate entries to the dictionary

Limit the number of codes or switch to larger ones as needed

Delete less frequent dictionary entries or give frequent entries shorter codes

Page 30: Chapter 9: Text Processing Pattern Matching Data Compression

How about this approach: Repeat

for each letter pair occurring in the text, try: replace the pair with a single new token measure the total entropy (Huffman-compressed size) of

the file if that letter pair resulted in the greatest reduction in

entropy so far, remember it permanently substitute new token for the pair that

caused the greatest reduction in entropy until no more reductions in entropy are possible

Results: compression to about 25% for big books: better than gzip, zip. [But not as good as bzip!]

Page 31: Chapter 9: Text Processing Pattern Matching Data Compression

Compression other data Modeling for audio?

Modeling for images?

Page 32: Chapter 9: Text Processing Pattern Matching Data Compression

Modeling for Images?

Wikipedia

Page 33: Chapter 9: Text Processing Pattern Matching Data Compression

JPEG, etc. Modeling: convert to the frequency domain with DCT Throw away some high-frequency components Throw away imperceptible components Quantize coefficients Encode the remaining coefficients

with Huffman coding

Results: up to 20-1 compressionwith good results, 100-1 with recognizable results

How the DCT changed the world…

Page 34: Chapter 9: Text Processing Pattern Matching Data Compression

Data compression resultsBest algorithms compress text to 25% of

original size, but humans can compress to 10%

Humans have far better modeling algorithms because they have better pattern recognition and higher-level patterns to recognize

Intelligence ≈ pattern recognition ≈ data compression?

Going further: Data-Compression.com