Compressing and Indexing Strings and (labeled) Trees

  • Published on
    10-Jan-2016

  • View
    31

  • Download
    14

DESCRIPTION

Compressing and Indexing Strings and (labeled) Trees. Paolo Ferragina Dipartimento di Informatica, Universit di Pisa. book. chapter. chapter. section. section. section. section. Two types of data. String = raw sequence of symbols from an alphabet Texts DNA sequences - PowerPoint PPT Presentation

Transcript

Compressing and Indexing Strings and (labeled) TreesPaolo FerraginaDipartimento di Informatica, Universit di PisaPaolo Ferragina, Universit di PisaTwo types of data String = raw sequence of symbols from an alphabet Texts DNA sequences Executables Audio files ... Labeled tree = tree of arbitrary shape and depth whose nodes are labeled with strings drawn from an alphabet XML files Parse trees Tries and Suffix Trees Compiler intermediate representations Execution traces ...Paolo Ferragina, Universit di PisaWhat do we mean by Indexing ? Word-based indexes, here a notion of word must be devised ! Inverted files, Signature files, Bitmaps. Full-text indexes, no constraint on text and queries ! Suffix Array, Suffix tree, ... Path indexes that also support navigational operations ! see next...Subset of XPath [W3C]Paolo Ferragina, Universit di PisaWhat do we mean by Compression ?Data compression has two positive effects:Space saving (or, enlarge memory at the same cost)Performance improvementBetter use of memory levels closer to CPUIncreased network, disk and memory bandwidthReduced (mechanical) seek timePaolo Ferragina, Universit di PisaPaolo Ferragina, Universit di PisaPaolo Ferragina, Universit di PisaPaolo Ferragina, Universit di PisaStudy the interplay of Compression and Indexing Do we witness a paradoxical situation ?An index injects redundant data, in order to speed up the pattern searchesCompression removes redundancy, in order to squeeze the space occupancyNO, new results proved a mutual reinforcement behaviour !Better indexes can be designed by exploiting compression techniques Better compressors can be designed by exploiting indexing techniquesMore surprisingly, strings and labeled trees are closer than expected !Labeled-tree compression can be reduced to string compressionLabeled-tree indexing can be reduced to special string indexing problemsPaolo Ferragina, Universit di PisaOur journey over string dataSuffix Array87 and 90Index design (Weiner 73)Compressor design (Shannon 48)Burrows-Wheeler Transform(1994)Improved indexes and compressors for strings[Ferragina-Manzini-Makinen-Navarro, 04]And many other papers of many other authors...Wavelet Tree[Grossi-Gupta-Vitter, Soda 03]Paolo Ferragina, Universit di PisaThe Suffix Array [BaezaYates-Gonnet, 87 and Manber-Myers, 90]P=siT = mississippi#Suffix permutation cannot be any of {1,...,N}# binary texts = 2N N! = # permutations on {1, 2, ..., N} (N) bits is the worst-case lower bound (N H(T)) bits for compressible texts Several papers on characterizing the SAs permutation[Duval et al, 02; Bannai et al, 03; Munro et al, 05; Stoye et al, 05] Paolo Ferragina, Universit di PisaCan we compress the Suffix Array ? The FM-index is a data structure that mixes the best of:Suffix array data structureBurrows-Wheeler TransformThe corollary is that:The Suffix Array is compressible It is a self-indexThe theoretical result:Query complexity: O(p + occ loge N) timeSpace occupancy: O( N Hk(T)) + o(N) bits o(N) if T compressibleIndex does not depend on k Bound holds for all k, simultaneously[Ferragina-Manzini, JACM 05]New concept: The FM-index is an opportunistic data structure that takes advantage of repetitiveness in the input data to achieve compressed space occupancy, and still efficient query performance.[Ferragina-Manzini, Focs 00]Paolo Ferragina, Universit di PisaThe Burrows-Wheeler Transform (1994)Let us given a text T = mississippi#mississippi#ississippi#mssissippi#mi sissippi#missippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippissippi#missiissippi#missFLPaolo Ferragina, Universit di PisaWhy L is so interesting for compression ?Bzip vs. Gzip: 20% vs. 33% compression ratio ! [Some theory behind: Manzini, JACM 01]i #mississip p# mississipp ii ppi#missis s FLA key observation:L is locally homogeneousunknownBuilding the BWT SA constructionInverting the BWT array visit...overall (N) time, but slower than gzip...Paolo Ferragina, Universit di PisaL is helpful for full-text searching ?#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#mmississippiPaolo Ferragina, Universit di PisaA useful tool: L F mapping# mississipp ii #mississip pi ppi#missis s FLunknownTo implement the LF-mapping we need an oracleocc( c , j ) = Rank of char c in L[1,j]Paolo Ferragina, Universit di PisaSubstring search in T (Count the pattern occurrences)Find the first c in L[fr, lr]Find the last c in L[fr, lr]L-to-F mapping of these charsssunknownOcc() oracle is enough(ie. Rank/Select primitives over L)Paolo Ferragina, Universit di PisaMany details are missing... What about a large Wavelet Tree and variations [Grossi et al, Soda 03; F.M.-Makinen-Navarro, Spire 04]New approaches to Rank/Select primitives [Munro et al. Soda 06]Efficient and succinct index construction [Hon et al., Focs 03]In practice, Lightweight Algorithms: (5+)N bytes of space [see Manzini-Ferragina, Algorithmica 04]Paolo Ferragina, Universit di PisaFive years of history...FM-index (Ferragina-Manzini, Focs 00)Compact Suffix Array (Grossi-Vitter, Stoc 00)Space: 5 N Hk(T) + o(N) bits, for any kSearch: O( p + occ loge N ) Space: (N) bits [+ text]Search: O(p + polylog(N) + occ loge N ) [o(p) time with Patricia Tree, O(occ) for short P ] q-gram index [Krkkinen-Ukkonen, 96]Succinct Suffix Tree: N log N + (N) bits [Munro et al., 97ss]LZ-index: (N) bits and fast occ retrieval [Navarro, 03]Variations over CSA and FM-index [Navarro, Makinen]Wavelet TreeWT variantLook at the survey by Gonzalo Navarro and Veli MakinenPaolo Ferragina, Universit di PisaWhats next ? Paolo Ferragina, Universit di PisaWhat about their practicality ? [December 2003][January 2005]Paolo Ferragina, Universit di PisaPaolo Ferragina, Universit di PisaIs this a technological breakthrough ? Paolo Ferragina, Universit di PisaPaolo Ferragina, Universit di PisaWhere we are...Data typeIndexingCompressedIndexingLabeled Trees ?Paolo Ferragina, Universit di PisaWhy we care about labeled trees ? Paolo Ferragina, Universit di PisaAn XML excerpt Donald E. Knuth The TeXbook Addison-Wesley 1986 Donald E. Knuth Ronald W. Moore An Analysis of Alpha-Beta Pruning 293-326 1975 6 Artificial Intelligence ...Paolo Ferragina, Universit di PisaA tree interpretation...XML document exploration Tree navigationXML document search Labeled subpath searchesSubset of XPath [W3C]Paolo Ferragina, Universit di PisaOur problemConsider a rooted, ordered, static tree T of arbitrary shape, whose t nodes are labeled with symbols from an alphabet S. We wish to devise a succinct representation for T that efficiently supports some operations over Ts structure:Navigational operations: parent(u), child(u, i), child(u, i, c)Subpath searches over a sequence of k labelsSeminal work by Jacobson [Focs 90] dealt with binary unlabeled trees, achieving O(1) time per navigational operation and 2t + o(t) bits.Munro-Raman [Focs 97], then many others, extended to unlabeled trees of arbitrary degree and a richer set of navigational ops: subtree size, ancestor,...Yet, subpath searches are unexplored Geary et al [Soda 04] were the first to deal with labeled trees and navigational operations, but the space is Q(t |S|) bits.Paolo Ferragina, Universit di PisaOur journey over labeled trees [Ferragina et al, Focs 05]We propose the XBW-transform that mimics on trees the nice structural properties of the BW-trasform on strings.The XBW-transform linearizes the tree T in such a way that: the indexing of T reduces to implement simple rank/select operations over a string of symbols from S.the compression of T reduces to use any k-th order entropy compressor (gzip, bzip,...) over a string of symbols from S.Paolo Ferragina, Universit di PisaThe XBW-TransformCBDcacAb aDcBDbaSae CB CD B CD B CB CCA CA CA CD A CCB CD B CB C SpStep 1.Visit the tree in pre-order. For each node, write down its label and the labels on its upward pathPaolo Ferragina, Universit di PisaThe XBW-TransformCbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpStep 2.Stably sort according to SpPaolo Ferragina, Universit di PisaThe XBW-TransformXBW takes optimal t log |S| + 2t bits1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpStep 3.Add a binary array Slast marking therows corresponding to last childrenSlastXBW can be built and inverted in optimal O(t) timeKey factsNodes correspond to items in Node numbering has usefulproperties for compression and indexingPaolo Ferragina, Universit di PisaThe XBW-Transform is highly compressible1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpSlastXBW is highly compressible: Sa is locally homogeneous (like BWT for strings) Slast has some structure (because of Ts structure)Paolo Ferragina, Universit di PisaXML Compression: XBW + PPMdi !String compressors are not so bad !?!Paolo Ferragina, Universit di PisaStructural properties of XBW1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpSlastProperties: Relative order among nodes having same leading path reflects the pre-order visit of T Children are contiguous in XBW (delimited by 1s) Children reflect the order of their parentsPaolo Ferragina, Universit di PisaThe XBW is searchable1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpSlast0100100 01001000SSABCDXBW indexing [reduction to string indexing]: Store succinct and efficient Rank and Select data structures over these three arraysPaolo Ferragina, Universit di PisaSubpath search in XBW1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpSlast0100100 01001000SSInductive step: Pick the next char in P[i+1], i.e. D Search for the first and last D in Sa[fr,lr] Jump to their childrenP = B DTheir childrenhave upwardpath = D BPaolo Ferragina, Universit di PisaSubpath search in XBW1001010 10011011CbaDDc DaBABccabSae A CA CA CB CB CB CB C CCCD A CD B CD B CD B CSpSlast0100100 01001000SSInductive step: Pick the next char in P[i+1], i.e. D Search for the first and last D in Sa[fr,lr] Jump to their childrenLook at Slast to find the 2 and 3 groupof childrenTwo occurrencesbecause of two 1sP = B DTheir childrenhave upwardpath = D BPaolo Ferragina, Universit di PisaXML Compressed IndexingWhat about XPress and XGrind ?XPress 30% (dblp 50%), XGrind 50% no software runningPaolo Ferragina, Universit di PisaIn summary [Ferragina et al, Focs 05]The XBW-transform takes optimal space: 2t + t log |S|, and can be computed in optimal linear time.We can compress and index the XBW-transform so that: its space occupancy is the optimal t H0(T) + 2t + o(t) bitsnavigational operations take O(log |S|) timesubpath searches take O(p log |S|) timeIf |S|=polylog(t), no log|S|-factor(loglog |S| for general S [Munro et al, Soda 06])New bread forRank/Select people !!It is possible to extend these ideas to other XPath queries, like: //path[text()=substring] //path1//path2...Paolo Ferragina, Universit di PisaThe overall picture on Compressed Indexing...Data typeIndexingCompressedIndexingStrong connection[Kosaraju, Focs 89]Paolo Ferragina, Universit di PisaWe investigated the reinforcement relation:Compression ideas Index designLets now turn to the other directionIndexing ideas Compressor designMutual reinforcement relationship...BoosterPaolo Ferragina, Universit di PisaCompression Boosting for strings [Ferragina et al., J.ACM 2005]Qualitatively, the booster offers various properties: The more compressible is s, the shorter is c wrt c It deploys compressor A as a black-box, hence no change to As structure is needed No loss in time efficiency, actually it is optimal Its performance holds for any string s, it results better than Gzip and Bzip It is fully combinatorial, hence it does not require any parameter estimationsPaolo Ferragina, Universit di PisaAn interesting compression paradigmGiven a string S, compute a permutation P(S)Partition P(S) into substringsCompress each substring, and concatenate the resultsProblem 1. Fix a permutation P. Find a partitioning strategy and acompressor that minimize the number of compressed bits. If P=Id, this is classic data compression !PPC paradigm (Permutation, Partition, Compression)Problem 2. Fix a compressor C. Find a permutation P and partitioning strategy that minimize the number of compressed bits. Taking P=Id, PPC cannot be worse than compressor C alone. Our booster showed that a good P can make PPC far better. Other contexts: Tables [AT&T people], Graphs [Bondi-Vigna, WWW 04]Theory is missing, here!Paolo Ferragina, Universit di PisaCompression of labeled trees [Ferragina et al., Focs 05]Extend the definition of Hk to labeled trees by taking as k-context of a node its leading path of k-length(related to Markov random fields over trees)XBW(T)A new paradigm for compressing the tree TPaolo Ferragina, Universit di PisaThanks !!Paolo Ferragina, Universit di PisaWe investigated the reinforcement relation:Compression ideas Index designLets now turn to the other directionIndexing ideas Compressor designWhere we are ...BoosterPaolo Ferragina, Universit di PisaWhat do we mean by boosting ?A memoryless compressor is poor in that it assigns codewords to symbols according only to their frequencies (e.g. Huffman)It incurs in some obvious limitations: T = anbn(highly compressible)T= random string of n as and n bs (uncompressible)Paolo Ferragina, Universit di PisaThe empirical entropy Hk Hk(T)= (1/|T|) |w|=k | T[w] | H0(T[w])Example: Given T = mississippi, we have T[w] = string of symbols that precede w in TProblems with this approach: How to go from all T[w] back to the string T ? How do we choose efficiently the best k ? Paolo Ferragina, Universit di PisaUse BWT to approximate HkBwt(T)unknown compress pieces of bwt(T) up to H0 Remember that...Paolo Ferragina, Universit di PisaFinding the best pieces to compress...Goal: find the best BWT-partition induced by a Leaf Cover !!Some leaf covers are related to Hk !!!121195211097463Leaf cover ?L1L2unknownH1(T)H2(T)Paolo Ferragina, Universit di PisaA compression booster [Ferragina et al., JACM 05]Let Compr be the compressor we wish to boostLet LC1, , LCr be the partition of BWT(T) induced by a leaf cover LC, and let us define cost of LC as cost(LC, Compr)=j |Compr(LCj)|Goal: Find the leaf cover LC* of minimum cost It suffices a post-order visit of the suffix tree (suffix array), optimal time We have: Cost(LC*, Compr) Cost(Hk, Compr) Hk(T), k0kkThis is purely combinatorial. We do not need any knowledge of the statistical properties of the source, no parameter estimation, no training,...Paolo Ferragina, Universit di PisaPaolo Ferragina, Universit di Pisa2001Paolo Ferragina, Universit di PisaLocate the pattern occurrences in TT = mississippi#From ss position we get 4 + 3 = 7, ok !!4 Paolo Ferragina, Universit di PisaWhat about their practicality ? We have a library that currently offers: The FM-index: build, search, display,... The Suffix Array: construction in space (5+) n bytes The LCP Array: construction in space (6+) n bytesPaolo Ferragina, Universit di PisaWhat about word-based searches ?T = bzipbzip2unbzip2unbzip The FM-index can be adapted to support word-based searches:Preprocess T and transform it into a digested text DTP=bzip...the post-processing phase can be time consuming !Use the FM-index over the digested DTWord-search in T Substring-search in DTPaolo Ferragina, Universit di PisaThe WFM-indexDigested-T derived from a Huffman variant: [Moura et al, 98] Symbols of the huffman tree are the words of T The Huffman tree has fan-out 128 Codewords are byte-aligned and taggedAny wordP= bzip1. Dictionary of words3. FM-index built on DTPaolo Ferragina, Universit di PisaA historical perspectiveShannon showed a narrower result for a stationary ergodic SIdea: Compress groups of k chars in the string TResult: Compress ratio the entropy of S, for k Various limitationsIt works for a source SIt must modify As structure, because of the alphabet changeFor a given string T, the best k is found by trying k=0,1,,|T|W(|T|2) time slowdown k is eventually fixed and this is not an optimal choice !Any string sBlack-boxO(|s|) timeVariable lengthcontextsTwo Key Components: Burrows-Wheeler Transform and Suffix TreePaolo Ferragina, Universit di PisaHow do we find the best partition (i.e. k) Approximate via MTF [Burrows-Wheeler, 94] MTF is efficient in practice [bzip2] Theory and practice showed that we can aim for more !Use Dynamic Programming [Giancarlo-Sciortino, CPM 03] It finds the optimal partition Very slow, the time complexity is cubic in |T|Surprisingly, full-text indexes help in findingthe optimal partition in optimal linear time !!Paolo Ferragina, Universit di PisaExample: not one kxs= ynzn > yxs= yn-1 , zxs= zn-1Paolo Ferragina, Universit di Pisa