36
January 5, 2016 1 [email protected]

Lecture 8 burrows wheeler transform

Embed Size (px)

Citation preview

Page 1: Lecture 8 burrows wheeler transform

January 5, 2016 1 [email protected]

Page 2: Lecture 8 burrows wheeler transform

January 5, 2016 2 [email protected]

Introduction

Example

Example

Reverse Transform

Encoding Example

Decoding Example

Move to Front

Compression L

Contents

Page 3: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 3

• Burrows-Wheeler, 1994

• BW Transform creates a representation of the data which has a small

working set.

• The transformed data is compressed with move to front compression.

• The decoder is quite different from the encoder.

• The algorithm requires processing the entire string at once (it is not on-

line).

• It is a remarkably good compression method.

Introduction

Page 4: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 4

The Burrows-Wheeler Transform (BWT) is a way of permuting the

characters of a string T into another string BWT(T).

This permutation is reversible; one procedure exists for turning T into

BWT(T) and another exists for turning BWT(T) back into T.

The BWT has two main applications: compression and indexing.

T denotes the string we would like to transform

m = |T| (the length of T)

Page 5: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 5

prepares a string of data for later compression. The compression itself is

done with the move-to-front method, perhaps in combination with RLE.

Burrows and Wheeler works in a block mode, where the input stream is

read block by block and each block is encoded separately as one string.

The BW method is general purpose, it works well on images, sound,

and text, and can achieve very high compression ratios

Page 6: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 6

Take T = abaaba$

First, we write down the rotations of T:

The distinct strings we can make from T by repeatedly taking a character

from one end and sticking it on the other:

Read the string abaaba

Example

Page 7: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 7

By writing them stacked vertically, we've created an m x m matrix. Now we

sort the rows of the matrix lexicographically (i.e. alphabetically):

This is the Burrows-Wheeler Matrix (BWM(T)). The final column of BWM(T),

read from top to bottom, is BWT(T). So for T = abaaba$, BWT(T) = abba$aa.

Page 8: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 8

Read in the following block: this is a test.

N = 15

C0 = 't'

C1 = 'h'

C13 = 't'

C14 = '.‘

The next step is to think of the block as a cyclic buffer. N strings

(rotations) S0 … SN-1 may be constructed such that:

S0 = C0, …, CN-1

S1 = C1, …, CN-1, C0

S2 = C2, …, CN-1, C0, C1

SN-1 = CN-1, C0, …, CN-2

Example

Page 9: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 9

"this is a test." yields the following rotations:

S0 = "this is a test."

S1 = "his is a test.t"

S2 = "is is a test.th"

S3 = "s is a test.thi"

S4 = " is a test.this"

S5 = "is a test.this "

S6 = "s a test.this i"

S7 = " a test.this is"

S8 = "a test.this is "

S9 = " test.this is a"

S10 = "test.this is a "

S11 = "est.this is a t"

S12 = "st.this is a te"

S13 = "t.this is a tes"

S14 = ".this is a test"

Page 10: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 10

The third step of BWT is to lexicographically sort S0 … SN-1.

"this is a test." yields the following sorted rotations:

S7 = " a test.this is"

S4 = " is a test.this"

S9 = " test.this is a"

S14 = ".this is a test"

S8 = "a test.this is "

S11 = "est.this is a t"

S1 = "his is a test.t"

S5 = "is a test.this "

S2 = "is is a test.th"

S6 = "s a test.this i"

S3 = "s is a test.thi"

S12 = "st.this is a te"

S13 = "t.this is a tes"

S10 = "test.this is a "

S0 = "this is a test."

Page 11: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 11

The final step in the transform is to output a string L, consisting of the

last character in each of the rotations in their sorted order along with

I, the sorted row containing S0.

"this is a test." yields the following output:

L = "ssat tt hiies .", I = 14

Page 12: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 12

Reversing BWT is a little more complicated than the initial transform.

The reversal process starts with a string L composed of last characters of

sorted rotations (S0 … SN-1) and I, the position of the contribution S0 made

to L.

The reversal process must yield S0, the original block.

It turns out there are a few ways to reverse the transform. The method

discussed here is the one that I ended up implementing.

If L is composed of the symbols V0 … VN-1, the transformed string may

be parsed to determine the following pieces of additional information:

1.The number of symbols in the substring V0 … Vi-1 that are identical to Vi.

2.For each unique symbol, Vi, in L, the number of symbols that are

lexicographically less than that symbol.

Reverse Transform

Page 13: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 13

L = "ssat tt hiies ." produces the following:

Table 1 Table 2

Page 14: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 14

Using tables 1 and 2 reverse BWT where L = "ssat tt hiies ." and I = 14.

We start with:

S0 = ???????????????

We're given that C14 is V14 = '.'.

S0 = ??????????????.

Table 1 tells us that there are 0 other '.' before V14 and Table 2 tells us that there are 3 characters < '.',

so C14 must be V0 + 3 = V3 = 't'.

S0 = ?????????????t.

Table 1 tells us that there are 0 other 't' before V3 and Table 2 tells us that there are 12 characters < 't',

so C13 must be V0 + 12 = V12 = 's'.

S0 = ????????????st.

Table 1 tells us that there are 2 other 's' before V12 and Table 2 tells us that there are 9 characters < 's',

so C12 must be V2 + 9 = V11 = 'e'.

S0 = ???????????est.

Page 15: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 15

Table 1 tells us that there are 0 other 'e' before V11 and Table 2 tells us that there are 5 characters < 'e',

so C11 must be V0 + 5 = V5 = 't'.

S0 = ??????????test.

Table 1 tells us that there is 1 other 't' before V5 and Table 2 tells us that there are 12 characters < 't', so

C10 must be V1 + 12 = V13 = ' '.

S0 = ????????? test.

Table 1 tells us that there is 2 other ' ' before V13 and Table 2 tells us that there are 0 characters < ' ', so

C9 must be V2 + 0 = V2 = 'a'.

S0 = ????????a test.

Page 16: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 16

• abracadabra

1. Create all cyclic shifts of the string.

0 abracadabra

1 bracadabraa

2 racadabraab

3 acadabraabr

4 cadabraabra

5 adabraabrac

6 dabraabraca

7 abraabracad

8 braabracada

9 raabracadab

10 aabracadabr

Encoding Example

Page 17: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 17

2. Sort the strings alphabetically into array A

Page 18: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 18

3. L = the last column

Page 19: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 19

4. Transmit X the index of the input in A and L (using move to front coding).

Page 20: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 20

• At first of decode we assuming some information. We then show how

to compute the information.

• Let As be A shifted by 1

Decoding Example

Page 21: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 21

• Assume we know the mapping T[i] is the index in As of the string i in A.

• T = [2 5 6 7 8 9 10 4 1 0 3]

Page 22: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 22

• Let F be the first column of A, it is just L sorted.

• Follow the pointers in T in F to recover the input starting with X.

Decoding Example

Page 23: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 23

Page 24: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 24

Page 25: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 25

Page 26: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 26

• Why does this work?

• The first symbol of A[T[i]] is the second symbol of A[i]

because As[T[i]] = A[i].

Decoding Example

Page 27: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 27

• How do we compute F and T from L and X?

F is just L sorted

Note that L is the first column of As, and As is in the same order as A.

If i is the k-th x in F then T[i] is the k-th x in L.

Page 28: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 28

Page 29: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 29

Page 30: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 30

Page 31: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 31

Page 32: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 32

Page 33: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 33

1. Initialize A to a list containing our alphabet A.

2. For i : 0, . . . , n − 1, encode symbol Li as the number of symbols

preceding it in A, and then move symbol Li to the beginning of A.

3. Combine the codes of step 2 in a list C, which will be further

compressed using Huffman or arithmetic coding.

Compression L

Page 34: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 34

Move to Front

The basic idea of this method [Bentley 86] is to maintain the alphabet A of

symbols as a list where frequently occurring symbols are located near the

front.

Page 35: Lecture 8 burrows wheeler transform

January 5, 2016 [email protected] 35

NOTE.

The last column, L, of the sorted matrix contains concentrations of identical

characters, which is why L is easy to compress. However, the first column,

F, of the same matrix is even easier to compress, since it contains runs, not

just concentrations, of identical characters. Why select column L and not

column F? Answer. Because the original string S can be reconstructed from

L but not from F.

Page 36: Lecture 8 burrows wheeler transform

January 5, 2016 36 [email protected]