29
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz, S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman, Adam Storm … How do we encode a large tree or other combinatorial object of specialized information … even a static one in a small amount of space and still perform queries in constant time ???

Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,

Embed Size (px)

Citation preview

Succinct Data StructuresSuccinct Data Structures

Ian MunroUniversity of WaterlooJoint work with David Benoit, Andrej Brodnik, D, Clark,

F. Fich, M. He, J. Horton, A. López-Ortiz, S. Srinivasa Rao, Rajeev Raman, Venkatesh Raman, Adam Storm …

How do we encode a large tree or other combinatorial object of specialized information

… even a static one in a small amount of spaceand still perform queries in constant time ???

Example of a Succinct Data Structure: The (Static) Bounded Subset

Example of a Succinct Data Structure: The (Static) Bounded SubsetGiven: Universe of n elements [0,...n-1]and m arbitrary elements from this universeCreate: a static structure to support search in

constant time (lg n bit word and usual operations)

Using: Essentially minimum possible # bits ...Operation: Member query in O(1) time(Brodnik & M.)

))mnlg((

Focus on TreesFocus on Trees

.. Because Computer Science is .. Arbophilic

- Directories (Unix, all the rest)- Search trees (B-trees, binary search trees,

digital trees or tries)- Graph structures (we do a tree based search)- Search indices for text (including DNA)

A Big Patricia Trie / Suffix TrieA Big Patricia Trie / Suffix Trie

Given a large text file; treat it as bit vector Construct a trie with leaves pointing to unique locations

in text that “match” path in trie (paths must start at character boundaries)

Skip the nodes where there is no branching ( n-1 internal nodes)

1 0 0 0 1 1

0 1

0

1

Space for Trees

Abstract data type: binary treeSize: n-1 internal nodes, n leavesOperations: child, parent, subtree size, leaf

dataMotivation: “Obvious” representation of an n

node tree takes about 6 n lg n words (up, left, right, size, memory manager, leaf reference)

i.e. full suffix tree takes about 5 or 6 times the space of suffix array (i.e. leaf references only)

Succinct Representations of Trees

Start with Jacobson, then others:There are about 4n/(πn)3/2 ordered rooted

trees, and same number of binary treesLower bound on specifying is about 2n bitsWhat are the natural representations?

Arbitrary Ordered Trees

Use parenthesis notation Represent the tree

As the binary string (((())())((())()())): traverse tree as “(“ for node, then subtrees, then “)”

Each node takes 2 bits

Heap-like Notation for a Binary Tree

Add external nodesEnumerate level by level

Store vector 11110111001000000 length2n+1(Here don’t know size of subtrees; can be overcome.

Could use isomorphism to flip between notations)

1

1 1

1 1 1

11

0 0

0

0

0

0

0 0

0

How do we Navigate?How do we Navigate?

Jacobson’s key suggestion:Operations on a bit vector

rank(x) = # 1’s up to & including xselect(x) = position of xth 1

So in the binary tree

leftchild(x) = 2 rank(x)rightchild(x) = 2 rank(x) + 1parent(x) = select(x/2)

Rank & SelectRank & Select

Rank -Auxiliary storage ~ 2nlglg n / lg n bits

#1’s up to each (lg n)2 rd bit#1’s within these too each lg nth bitTable lookup after that

Select -more complicated but similar notionsKey issue: Rank & Select take O(1) time with

lg n bit word (M. et al)Aside: Interesting data type by itself

Other Combinatorial Objects

Planar Graphs (Lu et al)Permutations [n]→ [n]Or more generally

Functions [n] → [n]But what are the operations?Clearly π(i), but also π

-1(i) And then π

k(i) and π -k(i)

Suffix Arrays (special permutations) in linear space

Permutations: a Shortcut Notation

Let P be a simple array giving π; P[i] = π[i]Also have B[i] be a pointer t positions back in

(the cycle of) the permutation; B[i]= π-t[i] .. But only define B for every tth position in cycle. (t is a constant; ignore cycle length “round-off”)

So array representationP = [8 4 12 5 13 x x 3 x 2 x 10 1]

1 2 3 4 5 6 7 8 9 10 11 12 13

2 4 5 13 1 8 3 12 10

Representing Shortcuts

In a cycle there is a B every t positions …But these positions can be in arbitrary order

Which i’s have a B, and how do we store it?Keep a vector of all positions

0 indicates no B 1 indicates a BRank gives the position of B[“i”] in B arraySo: π(i) and π

-1(i) in O(1) time & (1+ε)n lg n bitsTheorem: Under a pointer machine model with

space (1+ ε) n references, we need time 1/ε to answer π and π

-1 queries; i.e. this is as good as it gets.

Getting n lg n Bits: an Aside

This is the best we can do for O(1) operationsBut using Benes networks:1-Benes network is a 2 input/2 output switchr+1-Benes network … join tops to tops

1

2

3

4

5

6

7

8

3

5

7

8

1

6

4

2

R-Benes Network

R-Benes Network

A Benes Network

Realizing the permutation(3 5 7 8 1 6 4 2)

1

2

3

4

5

6

7

8

3

5

7

8

1

6

4

2

What can we do with it?

Divide into blocks of lg lg n gates … encode their actions in a word. Taking advantage of regularity of address mechanism

and alsoModify approach to avoid power of 2 issueCan trace a path in time O(lg n/(lg lg n)This is the best time we are able get for π and π-1

in minimum space.Observe: This method “violates” the pointer

machine lower bound by using “micropointers”.

Back to the main track: Powers of π

Consider the cycles of π( 2 6 8)( 3 5 9 10)( 4 1 7)Keep a bit vector to indicate the start of

each cycle( 2 6 8 3 5 9 10 4 1 7)Ignoring parentheses, view as new

permutation, ψ.Note: ψ-1(i) is position containing i …So we have ψ and ψ-1 as beforeUse ψ-1(i) to find i, then bit vector (rank,

select) to find πk or π-k

FunctionsNow consider arbitrary functions [n]→[n]“A function is just a hairy permutation”All tree edges lead to a cycle

Challenges here

Essentially write down the components in a convenient order and use the n lg n bits to describe the mapping (as per permutations)

To get fk(i):Find the level ancestor (k levels up) in a treeOrGo up to root and apply f the remaining

number of steps around a cycle

Level Ancestors

There are several level ancestor techniques using

O(1) time and O(n) WORDS.Adapt Bender & Farach-Colton to work in

O(n) bits

But going the other way …

f-k is a setMoving Down the tree requires

caref-3( ) = ( )The trick:Report all nodes on a given

level of a tree in time proportional to the number of nodes, and

Don’t waste time on trees with no answers

Final Function Result

Given an arbitrary function f: [n]→[n]With an n lg n + O(n) bit representation we

can compute fk(i) in O(1) time and f-k(i) in time O(1 + size of answer).

Back to Text … And Suffix ArraysText T[1..n] over (a,b)*# (a<#<b)There are 2n-1 such texts, which of the n! suffix arrays are valid?

1 2 3 4 5 6 7 8

SA= 4 7 5 1 8 3 6 2 is

a b b a a b a #SA-1= 4 8 6 1 3 7 2 5

M= 4 7 1 5 8 2 3 6 isn’t ..why?

Ascending to MaxM is a permutation so M-1 is its inverse i.e. M-1[i] says where i is in M

Ascending-to-Max: 1 i n-2i)M-1[i] < M-1[n] and M-1[i+1] < M-1[n] M-

1[i] < M-1[i+1] ii)M-1[i] > M-1[n] and M-1[i+1] > M-1[n] M-

1[i] > M-1[i+1]

4 7 5 1 8 3 6 2 OK4 7 1 5 8 2 3 6 NO

Non-NestingNon-Nesting: 1 i,j n-1 and M-1[i]<M-1[j] i)M-1[i] < M-1[i+1] and M-1[j] < M-1[j+1]

M-1[i+1] < M-1[j+1] ii)M-1[i] > M-1[i+1] and M-1[j] > M-1[j+1]

M-1[i+1] < M-1[j+1]

4 7 5 1 8 3 6 2 OK4 7 1 5 8 2 3 6 NO

Characterization Theorem for Suffix Arrays on Binary Texts

Theorem: Ascending to Max & Non-nesting Suffix Array

Corollary: Clean method of breaking SA into segments

Corollary: Linear time algorithm to check whether SA is valid

Cardinality QueriesT= a b a a a b b a a a b a a b b #Remember lengths longest run of a’s and of b’sSA (broken by runs, but not stored explicitly)

8 3 | 9 4 12| 1 10 5 13|16 | 7 2 11 15|6 14Ba, bit vector .. If SA-1[i-1] in an “a” section store 1 in Ba,[SA-1[i]], else 0

Ba 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1 1

Create rank structure on Ba, and similarly Bb, (Note these are reversed except at #)

Algorithm Count(T,P)s ← 1; e ←n; i ← m;while i>0 and se do

if P[i[=a then

s← rank1(Ba,s-1)+1; e←rank1(Ba,e)else

s← na + 2 + rank1(Bb,s-1); e←na + 1 +rank1(Bb,e)i ← i-1Return max(e-s+1,0)

Time: O(length of query)

Listing QueriesComplex methodsKey idea: for queries of length at least d,

index every dth position .. For T and forT(reversed)

So we have matches for T[i..n] and T[1,i-1]View these as points in 2 space (Ferragina &

Manzini and Grossi & Vitter)Do a range query (Alstrup et al)Variety of results follow

General Conclusion

Interesting, and useful, combinatorial objects can be:

Stored succinctly … O(lower bound) +o()So that Natural queries are performed in O(1) time

(or at least very close)

This can make the difference between using them and not …