17
Wavelet Trees Ankur Gupta Butler University

Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Embed Size (px)

Citation preview

Page 1: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Wavelet Trees

Ankur GuptaButler University

Page 2: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Text Dictionary Problem• The input is a text T drawn from an alphabet Σ. We

want to support the following queries– char(i) – returns the symbol at position i– rankc(i) – the number of c’s from T up to i– selectc(i) – the ith occurrence of symbol c in T

• Text T can be compressed to nH0 space, answering queries in – O(log Σ) time using the wavelet tree [GGV03]– O(log log Σ) time using [GMR06], but space is more

• When Σ = polylog(n), queries can be answered in O(1) time [FMMN04]

Page 3: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

preparedpeppers110101001011011

eedee11011

ppppp11111

prprppprs010100011

eaedee101111

a1

rrrs0001

s1

rrr111

d1

eeee1111

Compute rankr(10)

(answer is 2)

Actually compute rank1(10) = 5

Actually compute

rank0(2)=2

Actually compute

rank1(2)=2

Actually compute

rank1(5) = 2

Page 4: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

preparedpeppers110101001011011

eedee11011

ppppp11111

prprppprs010100011

eaedee101111

a1

rrrs0001

s1

rrr111

d1

eeee1111

Compute selectr(2)

(answer is 6)

Actually compute select1(4) = 6

Actually compute

select1(2)=2

Actually compute

select1(2)=2

Actually compute

select1(2)=4

Page 5: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

preparedpeppers110101001011011

eedee11011

ppppp11111

prprppprs010100011

eaedee101111

a1

rrrs0001

s1

rrr111

d1

eeee1111

Compute char(7)

(answer is e)

Actually compute char(7)=0

select0(7)=3

Actually computechar(2)=1rank1(2)=2

Actually compute

char(2)=1 rank1(2)=2

Actually compute

char(3)=1 select1(3)=2

Page 6: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Some comments• Don’t have to store any of the “all 1s”

nodes– That’s just to help for the example.

• What does the wavelet tree imply?– Converts representation of a finite string on an

alphabet to representation of many bitvectors.– Useful to achieve, ultimately, high-order

compression.– Easy to implement – very simple structure and

query pattern

Page 7: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Shapin’ Up ToSomething Special

• What about the shape of a wavelet tree?– Does it affect space? No. (You will see why in a bit.)– Time? Yes.

• Good news!• Reorganize it to optimize query time. . .

– Use a Huffman orientation based on query access.– If you choose symbol frequency, you now can search

in O(H0) time instead of O(log Σ).

Page 8: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Wavelet Tree Space/Time• Simple bitvectors

– n bits per level and log |Σ| levels• n log |Σ| overall bits• O(n log log n / log n) extra bits for rank/select [J89]

– Same space as original text but can now support rankc/selectc/char in O(log |Σ|) time. (RAM)

• Fancy– [RRR02] Gets O(nH0) + O(n log log n / log n)

bits of space with O(log |Σ|) query time

Page 9: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Even Skewed Is a Shape• Consider a totally skewed wavelet ``tree’’

– i.e. set symbol a to 0 and all others to 1– The tree will look like a line, and will take this much

space [RRR02]. . .

– This telescopes into the multinomial coefficient, regardless of the dependency list• Simple exercise to check this fact

• Thus, shape doesn’t affect the space.

Page 10: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Empirical Entropy

• Text T of n symbols drawn from alphabet Σ (n lg |Σ| bits)• Entropy: measure to assess compression size based on text• Higher order entropy Hh (of order h)

– Considers context x of neighboring h symbols– Each Prob[y|x] term is thus conditioned on context x of h symbols– Note that Hh(T) ≤ lg |Σ|– Now the text takes nHh ≤ n lg |Σ| bits of space to encode

Page 11: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

One Text Indexing ResultBecause Frankly, There Are Lots

• Main Results (using CSA [GGV03])– Space usage: nHh + o(n log |Σ|) bits– Search time: O(mlog |Σ| + polylog(n)) time– Can improve to o(m) time with constant factor more

space• When the text T is highly compressible (i.e. nHh =

o(n)), we achieve the first index sublinear in both space and search time

• Second-order terms represent the space needed for– Fast indexing– Storing count statistics for the text

• Obtain nearly tight bounds on the encoding length of the Burrows-Wheeler Transform (BWT)

Page 12: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Tell Me More!How Do You Do It?

4 7 15 8 6 32SA0

Text Positions

SA1 2 4 13 For even index, use SA1.

Example:SA0[5] = 2·SA1[Rankred(5)] = 8.

For odd index, use neighbor function Φ0. Example:SA0[2] = SA0[Φ0(5)] – 1 = 7.

Perform these steps recursively for a compressed suffix array

Encode increasing subsequences together to get zero-order entropyΦ0 3 5 76 4 2 8

4 7 15 8 6 32SA0

Subdivide subsequences and encode to get high-order entropy

Neighbor function Φ0 tells the position in the suffix array of the next suffix (in text order)

It turns out that the neighbor function Φ is the primary bottleneck for space.

For this example, suppose we know SA1

Rankred 1 1 1 1 2 3 4 4

14 7 15 8 6 32

Page 13: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Burrows-Wheeler Transform (BWT) and the Neighbor Φ function

• The Φ function has a strong relationship to the Burrows-Wheeler Transform (BWT)

• The BWT has had a profound impact on a myriad of fields. – Simply put, it pre-processes an input text T by a

reversible transform.– The result is easily compressible using simple methods.

• The BWT (and the Φ function) are at the heart of many text compression and indexing techniques, such as bzip2.

• We also call the Φ function the FL mapping from the BWT.

Page 14: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Burrows-Wheeler Transform (BWT)

Page 15: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

A Shifty Little BWT

list i

list s

Page 16: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Where Oh Where Is MyWavelet Tree?

• For each list from the previous slide, we store a wavelet tree to achieve 0th order entropy– The collection of 0th order compressors gives high-

order entropy based on the context (not shown in this talk).

• Technical point: number of alphabet symbols cannot be more than text length– We “rerank” symbols to match this requirement

(negligible extra cost in space, O(1) time)

Page 17: Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following

Any questions?