Upload
matilda-poole
View
215
Download
0
Embed Size (px)
Citation preview
Wavelet Trees
Ankur GuptaButler University
Text Dictionary Problem• The input is a text T drawn from an alphabet Σ. We
want to support the following queries– char(i) – returns the symbol at position i– rankc(i) – the number of c’s from T up to i– selectc(i) – the ith occurrence of symbol c in T
• Text T can be compressed to nH0 space, answering queries in – O(log Σ) time using the wavelet tree [GGV03]– O(log log Σ) time using [GMR06], but space is more
• When Σ = polylog(n), queries can be answered in O(1) time [FMMN04]
preparedpeppers110101001011011
eedee11011
ppppp11111
prprppprs010100011
eaedee101111
a1
rrrs0001
s1
rrr111
d1
eeee1111
Compute rankr(10)
(answer is 2)
Actually compute rank1(10) = 5
Actually compute
rank0(2)=2
Actually compute
rank1(2)=2
Actually compute
rank1(5) = 2
preparedpeppers110101001011011
eedee11011
ppppp11111
prprppprs010100011
eaedee101111
a1
rrrs0001
s1
rrr111
d1
eeee1111
Compute selectr(2)
(answer is 6)
Actually compute select1(4) = 6
Actually compute
select1(2)=2
Actually compute
select1(2)=2
Actually compute
select1(2)=4
preparedpeppers110101001011011
eedee11011
ppppp11111
prprppprs010100011
eaedee101111
a1
rrrs0001
s1
rrr111
d1
eeee1111
Compute char(7)
(answer is e)
Actually compute char(7)=0
select0(7)=3
Actually computechar(2)=1rank1(2)=2
Actually compute
char(2)=1 rank1(2)=2
Actually compute
char(3)=1 select1(3)=2
Some comments• Don’t have to store any of the “all 1s”
nodes– That’s just to help for the example.
• What does the wavelet tree imply?– Converts representation of a finite string on an
alphabet to representation of many bitvectors.– Useful to achieve, ultimately, high-order
compression.– Easy to implement – very simple structure and
query pattern
Shapin’ Up ToSomething Special
• What about the shape of a wavelet tree?– Does it affect space? No. (You will see why in a bit.)– Time? Yes.
• Good news!• Reorganize it to optimize query time. . .
– Use a Huffman orientation based on query access.– If you choose symbol frequency, you now can search
in O(H0) time instead of O(log Σ).
Wavelet Tree Space/Time• Simple bitvectors
– n bits per level and log |Σ| levels• n log |Σ| overall bits• O(n log log n / log n) extra bits for rank/select [J89]
– Same space as original text but can now support rankc/selectc/char in O(log |Σ|) time. (RAM)
• Fancy– [RRR02] Gets O(nH0) + O(n log log n / log n)
bits of space with O(log |Σ|) query time
Even Skewed Is a Shape• Consider a totally skewed wavelet ``tree’’
– i.e. set symbol a to 0 and all others to 1– The tree will look like a line, and will take this much
space [RRR02]. . .
– This telescopes into the multinomial coefficient, regardless of the dependency list• Simple exercise to check this fact
• Thus, shape doesn’t affect the space.
Empirical Entropy
• Text T of n symbols drawn from alphabet Σ (n lg |Σ| bits)• Entropy: measure to assess compression size based on text• Higher order entropy Hh (of order h)
– Considers context x of neighboring h symbols– Each Prob[y|x] term is thus conditioned on context x of h symbols– Note that Hh(T) ≤ lg |Σ|– Now the text takes nHh ≤ n lg |Σ| bits of space to encode
One Text Indexing ResultBecause Frankly, There Are Lots
• Main Results (using CSA [GGV03])– Space usage: nHh + o(n log |Σ|) bits– Search time: O(mlog |Σ| + polylog(n)) time– Can improve to o(m) time with constant factor more
space• When the text T is highly compressible (i.e. nHh =
o(n)), we achieve the first index sublinear in both space and search time
• Second-order terms represent the space needed for– Fast indexing– Storing count statistics for the text
• Obtain nearly tight bounds on the encoding length of the Burrows-Wheeler Transform (BWT)
Tell Me More!How Do You Do It?
4 7 15 8 6 32SA0
Text Positions
SA1 2 4 13 For even index, use SA1.
Example:SA0[5] = 2·SA1[Rankred(5)] = 8.
For odd index, use neighbor function Φ0. Example:SA0[2] = SA0[Φ0(5)] – 1 = 7.
Perform these steps recursively for a compressed suffix array
Encode increasing subsequences together to get zero-order entropyΦ0 3 5 76 4 2 8
4 7 15 8 6 32SA0
Subdivide subsequences and encode to get high-order entropy
Neighbor function Φ0 tells the position in the suffix array of the next suffix (in text order)
It turns out that the neighbor function Φ is the primary bottleneck for space.
For this example, suppose we know SA1
Rankred 1 1 1 1 2 3 4 4
14 7 15 8 6 32
Burrows-Wheeler Transform (BWT) and the Neighbor Φ function
• The Φ function has a strong relationship to the Burrows-Wheeler Transform (BWT)
• The BWT has had a profound impact on a myriad of fields. – Simply put, it pre-processes an input text T by a
reversible transform.– The result is easily compressible using simple methods.
• The BWT (and the Φ function) are at the heart of many text compression and indexing techniques, such as bzip2.
• We also call the Φ function the FL mapping from the BWT.
Burrows-Wheeler Transform (BWT)
A Shifty Little BWT
list i
list s
Where Oh Where Is MyWavelet Tree?
• For each list from the previous slide, we store a wavelet tree to achieve 0th order entropy– The collection of 0th order compressors gives high-
order entropy based on the context (not shown in this talk).
• Technical point: number of alphabet symbols cannot be more than text length– We “rerank” symbols to match this requirement
(negligible extra cost in space, O(1) time)
Any questions?