Lecture 8 burrows wheeler transform

January 5, 2016 1 [email protected]


Introduction

Example

Example

Reverse Transform

Encoding Example

Decoding Example

Move to Front

Compression L

Contents

January 5, 2016 [email protected] 3

• Burrows-Wheeler, 1994

• BW Transform creates a representation of the data which has a small

working set.

• The transformed data is compressed with move to front compression.

• The decoder is quite different from the encoder.

• The algorithm requires processing the entire string at once (it is not on-

line).

• It is a remarkably good compression method.

Introduction


The Burrows-Wheeler Transform (BWT) is a way of permuting the

characters of a string T into another string BWT(T).

This permutation is reversible; one procedure exists for turning T into

BWT(T) and another exists for turning BWT(T) back into T.

The BWT has two main applications: compression and indexing.

T denotes the string we would like to transform

m = |T| (the length of T)


prepares a string of data for later compression. The compression itself is

done with the move-to-front method, perhaps in combination with RLE.

Burrows and Wheeler works in a block mode, where the input stream is

read block by block and each block is encoded separately as one string.

The BW method is general purpose, it works well on images, sound,

and text, and can achieve very high compression ratios


Take T = abaaba$

First, we write down the rotations of T:

The distinct strings we can make from T by repeatedly taking a character

from one end and sticking it on the other:

Read the string abaaba

Example


By writing them stacked vertically, we've created an m x m matrix. Now we

sort the rows of the matrix lexicographically (i.e. alphabetically):

This is the Burrows-Wheeler Matrix (BWM(T)). The final column of BWM(T),

read from top to bottom, is BWT(T). So for T = abaaba$, BWT(T) = abba$aa.


Read in the following block: this is a test.

N = 15

C0 = 't'

C1 = 'h'

…

C13 = 't'

C14 = '.‘

The next step is to think of the block as a cyclic buffer. N strings

(rotations) S0 … SN-1 may be constructed such that:

S0 = C0, …, CN-1

S1 = C1, …, CN-1, C0

S2 = C2, …, CN-1, C0, C1

…

SN-1 = CN-1, C0, …, CN-2

Example


"this is a test." yields the following rotations:

S0 = "this is a test."

S1 = "his is a test.t"

S2 = "is is a test.th"

S3 = "s is a test.thi"

S4 = " is a test.this"

S5 = "is a test.this "

S6 = "s a test.this i"

S7 = " a test.this is"

S8 = "a test.this is "

S9 = " test.this is a"

S10 = "test.this is a "

S11 = "est.this is a t"

S12 = "st.this is a te"

S13 = "t.this is a tes"

S14 = ".this is a test"


The third step of BWT is to lexicographically sort S0 … SN-1.

"this is a test." yields the following sorted rotations:

S7 = " a test.this is"

S4 = " is a test.this"

S9 = " test.this is a"

S14 = ".this is a test"

S8 = "a test.this is "

S11 = "est.this is a t"

S1 = "his is a test.t"

S5 = "is a test.this "

S2 = "is is a test.th"

S6 = "s a test.this i"

S3 = "s is a test.thi"

S12 = "st.this is a te"

S13 = "t.this is a tes"

S10 = "test.this is a "

S0 = "this is a test."


The final step in the transform is to output a string L, consisting of the

last character in each of the rotations in their sorted order along with

I, the sorted row containing S0.

"this is a test." yields the following output:

L = "ssat tt hiies .", I = 14


Reversing BWT is a little more complicated than the initial transform.

The reversal process starts with a string L composed of last characters of

sorted rotations (S0 … SN-1) and I, the position of the contribution S0 made

to L.

The reversal process must yield S0, the original block.

It turns out there are a few ways to reverse the transform. The method

discussed here is the one that I ended up implementing.

If L is composed of the symbols V0 … VN-1, the transformed string may

be parsed to determine the following pieces of additional information:

1.The number of symbols in the substring V0 … Vi-1 that are identical to Vi.

2.For each unique symbol, Vi, in L, the number of symbols that are

lexicographically less than that symbol.

Reverse Transform


L = "ssat tt hiies ." produces the following:

Table 1 Table 2


Using tables 1 and 2 reverse BWT where L = "ssat tt hiies ." and I = 14.

We start with:

S0 = ???????????????

We're given that C14 is V14 = '.'.

S0 = ??????????????.

Table 1 tells us that there are 0 other '.' before V14 and Table 2 tells us that there are 3 characters < '.',

so C14 must be V0 + 3 = V3 = 't'.

S0 = ?????????????t.

Table 1 tells us that there are 0 other 't' before V3 and Table 2 tells us that there are 12 characters < 't',

so C13 must be V0 + 12 = V12 = 's'.

S0 = ????????????st.

Table 1 tells us that there are 2 other 's' before V12 and Table 2 tells us that there are 9 characters < 's',

so C12 must be V2 + 9 = V11 = 'e'.

S0 = ???????????est.


Table 1 tells us that there are 0 other 'e' before V11 and Table 2 tells us that there are 5 characters < 'e',

so C11 must be V0 + 5 = V5 = 't'.

S0 = ??????????test.

Table 1 tells us that there is 1 other 't' before V5 and Table 2 tells us that there are 12 characters < 't', so

C10 must be V1 + 12 = V13 = ' '.

S0 = ????????? test.

Table 1 tells us that there is 2 other ' ' before V13 and Table 2 tells us that there are 0 characters < ' ', so

C9 must be V2 + 0 = V2 = 'a'.

S0 = ????????a test.


• abracadabra

1. Create all cyclic shifts of the string.

0 abracadabra

1 bracadabraa

2 racadabraab

3 acadabraabr

4 cadabraabra

5 adabraabrac

6 dabraabraca

7 abraabracad

8 braabracada

9 raabracadab

10 aabracadabr

Encoding Example


2. Sort the strings alphabetically into array A


3. L = the last column


4. Transmit X the index of the input in A and L (using move to front coding).


• At first of decode we assuming some information. We then show how

to compute the information.

• Let As be A shifted by 1

Decoding Example


• Assume we know the mapping T[i] is the index in As of the string i in A.

• T = [2 5 6 7 8 9 10 4 1 0 3]


• Let F be the first column of A, it is just L sorted.

• Follow the pointers in T in F to recover the input starting with X.

Decoding Example





• Why does this work?

• The first symbol of A[T[i]] is the second symbol of A[i]

because As[T[i]] = A[i].

Decoding Example


• How do we compute F and T from L and X?

F is just L sorted

Note that L is the first column of As, and As is in the same order as A.

If i is the k-th x in F then T[i] is the k-th x in L.







1. Initialize A to a list containing our alphabet A.

2. For i : 0, . . . , n − 1, encode symbol Li as the number of symbols

preceding it in A, and then move symbol Li to the beginning of A.

3. Combine the codes of step 2 in a list C, which will be further

compressed using Huffman or arithmetic coding.

Compression L


Move to Front

The basic idea of this method [Bentley 86] is to maintain the alphabet A of

symbols as a list where frequently occurring symbols are located near the

front.


NOTE.

The last column, L, of the sorted matrix contains concentrations of identical

characters, which is why L is easy to compress. However, the first column,

F, of the same matrix is even easier to compress, since it contains runs, not

just concentrations, of identical characters. Why select column L and not

column F? Answer. Because the original string S can be reconstructed from

L but not from F.


Science

Lecture 8 burrows wheeler transform