Compressing and Indexing Strings and (labeled) Trees

  • View
    35

  • Download
    14

Embed Size (px)

DESCRIPTION

Compressing and Indexing Strings and (labeled) Trees. Paolo Ferragina Dipartimento di Informatica, Universit di Pisa. book. chapter. chapter. section. section. section. section. Two types of data. String = raw sequence of symbols from an alphabet Texts DNA sequences - PowerPoint PPT Presentation

Transcript

  • Compressing and Indexing Strings and (labeled) TreesPaolo FerraginaDipartimento di Informatica, Universit di Pisa

    Paolo Ferragina, Universit di Pisa

  • Two types of data String = raw sequence of symbols from an alphabet Texts DNA sequences Executables Audio files ... Labeled tree = tree of arbitrary shape and depth whose nodes are labeled with strings drawn from an alphabet XML files Parse trees Tries and Suffix Trees Compiler intermediate representations Execution traces ...

    Paolo Ferragina, Universit di Pisa

  • What do we mean by Indexing ? Word-based indexes, here a notion of word must be devised ! Inverted files, Signature files, Bitmaps. Full-text indexes, no constraint on text and queries ! Suffix Array, Suffix tree, ... Path indexes that also support navigational operations ! see next...Subset of XPath [W3C]

    Paolo Ferragina, Universit di Pisa

  • What do we mean by Compression ?Data compression has two positive effects:Space saving (or, enlarge memory at the same cost)Performance improvementBetter use of memory levels closer to CPUIncreased network, disk and memory bandwidthReduced (mechanical) seek time

    Paolo Ferragina, Universit di Pisa

  • Paolo Ferragina, Universit di Pisa

  • Paolo Ferragina, Universit di Pisa

  • Paolo Ferragina, Universit di Pisa

  • Study the interplay of Compression and Indexing Do we witness a paradoxical situation ?An index injects redundant data, in order to speed up the pattern searchesCompression removes redundancy, in order to squeeze the space occupancyNO, new results proved a mutual reinforcement behaviour !Better indexes can be designed by exploiting compression techniques Better compressors can be designed by exploiting indexing techniques

    More surprisingly, strings and labeled trees are closer than expected !Labeled-tree compression can be reduced to string compressionLabeled-tree indexing can be reduced to special string indexing problems

    Paolo Ferragina, Universit di Pisa

  • Our journey over string dataSuffix Array87 and 90Index design (Weiner 73)Compressor design (Shannon 48)Burrows-Wheeler Transform(1994)Improved indexes and compressors for strings[Ferragina-Manzini-Makinen-Navarro, 04]And many other papers of many other authors...Wavelet Tree[Grossi-Gupta-Vitter, Soda 03]

    Paolo Ferragina, Universit di Pisa

  • The Suffix Array [BaezaYates-Gonnet, 87 and Manber-Myers, 90]P=siT = mississippi#Suffix permutation cannot be any of {1,...,N}# binary texts = 2N N! = # permutations on {1, 2, ..., N} (N) bits is the worst-case lower bound (N H(T)) bits for compressible texts Several papers on characterizing the SAs permutation[Duval et al, 02; Bannai et al, 03; Munro et al, 05; Stoye et al, 05]

    Paolo Ferragina, Universit di Pisa

  • Can we compress the Suffix Array ? The FM-index is a data structure that mixes the best of:Suffix array data structureBurrows-Wheeler TransformThe corollary is that:The Suffix Array is compressible It is a self-indexThe theoretical result:Query complexity: O(p + occ loge N) timeSpace occupancy: O( N Hk(T)) + o(N) bits o(N) if T compressibleIndex does not depend on k Bound holds for all k, simultaneously[Ferragina-Manzini, JACM 05]New concept: The FM-index is an opportunistic data structure that takes advantage of repetitiveness in the input data to achieve compressed space occupancy, and still efficient query performance.[Ferragina-Manzini, Focs 00]

    Paolo Ferragina, Universit di Pisa

  • The Burrows-Wheeler Transform (1994)Let us given a text T = mississippi#mississippi#ississippi#mssissippi#mi sissippi#missippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippissippi#missiissippi#missFL

    Paolo Ferragina, Universit di Pisa

  • Why L is so interesting for compression ?Bzip vs. Gzip: 20% vs. 33% compression ratio ! [Some theory behind: Manzini, JACM 01]i #mississip p# mississipp ii ppi#missis s FLA key observation:L is locally homogeneousunknownBuilding the BWT SA constructionInverting the BWT array visit...overall (N) time, but slower than gzip...

    Paolo Ferragina, Universit di Pisa

  • L is helpful for full-text searching ?#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#mmississippi

    Paolo Ferragina, Universit di Pisa

  • A useful tool: L F mapping# mississipp ii #mississip pi ppi#missis s FLunknownTo implement the LF-mapping we need an oracleocc( c , j ) = Rank of char c in L[1,j]

    Paolo Ferragina, Universit di Pisa

  • Substring search in T (Count the pattern occurrences)Find the first c in L[fr, lr]Find the last c in L[fr, lr]L-to-F mapping of these charsssunknownOcc() oracle is enough(ie. Rank/Select primitives over L)

    Paolo Ferragina, Universit di Pisa

  • Many details are missing... What about a large Wavelet Tree and variations [Grossi et al, Soda 03; F.M.-Makinen-Navarro, Spire 04]New approaches to Rank/Select primitives [Munro et al. Soda 06]Efficient and succinct index construction [Hon et al., Focs 03]In practice, Lightweight Algorithms: (5+)N bytes of space [see Manzini-Ferragina, Algorithmica 04]

    Paolo Ferragina, Universit di Pisa

  • Five years of history...FM-index (Ferragina-Manzini, Focs 00)Compact Suffix Array (Grossi-Vitter, Stoc 00)Space: 5 N Hk(T) + o(N) bits, for any kSearch: O( p + occ loge N ) Space: (N) bits [+ text]Search: O(p + polylog(N) + occ loge N ) [o(p) time with Patricia Tree, O(occ) for short P ] q-gram index [Krkkinen-Ukkonen, 96]Succinct Suffix Tree: N log N + (N) bits [Munro et al., 97ss]LZ-index: (N) bits and fast occ retrieval [Navarro, 03]Variations over CSA and FM-index [Navarro, Makinen]Wavelet TreeWT variantLook at the survey by Gonzalo Navarro and Veli Makinen

    Paolo Ferragina, Universit di Pisa

  • Whats next ?

    Paolo Ferragina, Universit di Pisa

  • What about their practicality ? [December 2003][January 2005]

    Paolo Ferragina, Universit di Pisa

  • Paolo Ferragina, Universit di Pisa

  • Is this a technological breakthrough ?

    Paolo Ferragina, Universit di Pisa

  • Paolo Ferragina, Universit di Pisa

  • Where we are...Data typeIndexingCompressedIndexingLabeled Trees ?

    Paolo Ferragina, Universit di Pisa

  • Why we care about labeled trees ?

    Paolo Ferragina, Universit di Pisa

  • An XML excerpt

    Donald E. Knuth The TeXbook Addison-Wesley 1986 Donald E. Knuth Ronald W. Moore An Analysis of Alpha-Beta Pruning 293-326 1975 6 Artificial Intelligence ...

    Paolo Ferragina, Universit di Pisa

  • A tree interpretation...XML document exploration Tree navigationXML document search Labeled subpath searchesSubset of XPath [W3C]

    Paolo Ferragina, Universit di Pisa

  • Our problemConsider a rooted, ordered, static tree T of arbitrary shape, whose t nodes are labeled with symbols from an alphabet S.

    We wish to devise a succinct representation for T that efficiently supports some operations over Ts structure:Navigational operations: parent(u), child(u, i), child(u, i, c)Subpath searches over a sequence of k labelsSeminal work by Jacobson [Focs 90] dealt with binary unlabeled trees, achieving O(1) time per navigational operation and 2t + o(t) bits.Munro-Raman [Focs 97], then many others, extended to unlabeled trees of arbitrary degree and a richer set of navigational ops: subtree size, ancestor,...Yet, subpath searches are unexplored Geary et al [Soda 04] were the first to deal with labeled trees and navigational operations, but the space is Q(t |S|) bits.

    Paolo Ferragina, Universit di Pisa

  • Our journey over labeled trees [Ferragina et al, Focs 05]We propose the XBW-transform that mimics on trees the nice structural properties of the BW-trasform on strings.The XBW-transform linearizes the tree T in such a way that: the indexing of T reduces to implement simple rank/select operations over a string of symbols from S.the compression of T reduces to use any k-th order entropy compressor (gzip, bzip,...) over a string of symbols from S.

    Paolo Ferragina, Universit di Pisa

  • The XBW-TransformCBDcacAb aDcBDbaSae CB CD B CD B CB CCA CA CA CD A CCB CD B CB C SpStep 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward path

    Paolo Ferragina, Universit di Pisa

  • The XBW-TransformCbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpStep 2.Stably sort according to Sp

    Paolo Ferragina, Universit di Pisa

  • The XBW-TransformXBW takes optimal t log |S| + 2t bits1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpStep 3.Add a binary array Slast marking therows corresponding to last childrenSlastXBW can be built and inverted in optimal O(t) timeKey factsNodes correspond to items in Node numbering has usefulproperties for compression and indexing

    Paolo Ferragina, Universit di Pisa

  • The XBW-Transform is highly compressible1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpSlastXBW is highly compressible: Sa is locally homogeneous (like BWT for strings) Slast has some structure (because of Ts structure)

    Paolo Ferragina, Universit di Pisa

  • XML Compression: XBW + PPMdi !String compressors are not so bad !?!

    Paolo Ferragina, Universit di Pisa

  • Structural properties of XBW1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpSlastProperties: Relative order among nodes having same leading path reflects th