Tries and Succinct Data Structures - Persone - …pages.di.unipi.it/.../uploads/sites/7/2015/09/Lez3-4-5.pdfTries and Succinct Data Structures Auto-completion as our target application

Tries and Succinct Data Structures

Auto-completion as our target application

Rossano [email protected]

mailto:[email protected]

Dataset?

Dataset? All the past queries

Dataset?Searches?

All the past queries

Dataset?Prefix searchSearches?All the past queries


Data structure?


Data structure? Trie


Data structure? TrieHow to find top-k efficiently?



Project: implement a autocompletion system

Problem

Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:

Problem


• Lookup(P): say 1 if P belongs to D, 0 otherwise;

Problem


• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D

that are prefixed by P, -1 if there is no such interval;

Problem



that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D

sharing the longest common prefix with P;

Problem




sharing the longest common prefix with P;• Insert(P): insert s in D;

Problem




sharing the longest common prefix with P;• Insert(P): insert s in D;• Delete(P): delete s from D;

D = { ab, bab, bca, cab, cac, cbac, cbba }

n = |D|, m = total length of strings in D, σ = alphabet size

Problem




sharing the longest common prefix with P;• Insert(P): insert s in D;• Delete(P): delete s from D;



3 4

a b

a c

c

a b

bab c

Trie

0

b

1

b

2

a

5

c

6

a



3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a



3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a



3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a



3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a

Find all the strings prefixed by any pattern P in O(|P| log σ) time



3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

a

Find all the strings prefixed by any pattern P in O(|P| log σ) time Insert? Delete?



3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

aO(m) nodes

O(m log m + m log σ) bits of space

Find all the strings prefixed by any pattern P in O(|P| log σ) time Insert? Delete?



3 4

a b

a c

c

a b

bab c

Prefix(c)

Trie

0

b

1

b

2

a

5

c

6

aO(m) nodes

O(m log m + m log σ) bits of space


Save space?

Insert? Delete?



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie

O(n) nodes O(n log m + mlog σ) bits of space



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie





0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



Variable length labels on the edges 😱



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie





0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

label len



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

label len string id



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

label len string id



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

<b,0,1>

label len string id



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

<b,0,1>

label len string id



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

<b,0,1>

<a,1,1>

label len string id



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

<b,0,1>

<a,1,1>

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

<b,0,1>

<a,1,1>

Search: every time we traverse an edge we have to access a

dictionary string!

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

<b,0,1>

<a,1,1>


dictionary string!

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>

Probably a cache miss at each edge 😱. Can we do better?



0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

(Compact) Trie



<a,1,0>

<b,0,1>

<a,1,1>


dictionary string!

label len string id

<c,1,2> <a,0,3>

<b,0,3><c,0,4><a,1,5><b,1,6>

<c,1,3>

<b,1,5>

Probably a cache miss at each edge 😱. Can we do better?

Patricia trie & Blind search every time we traverse an

edge, we match the only symbol and skip “label len” symbols in the

pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1

(b) Patricia trie of S

1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i

(d) Tuples inducedby string abcc

Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.

5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.

6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.

7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.

8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.

15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9

(c) Front Coding of S

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i


Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.

16

Patricia Trie (and blind search)

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16


Blind search every time we traverse an

edge, we match the only symbol and skip “label len” symbols in the

pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16


P= 000 0 110Blind search

every time we traverse an edge, we match the only symbol

and skip “label len” symbols in the pattern

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16





0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16





0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16





0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16





0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16





0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16





What’s the property of the node we identify?

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16






Any substring in its subtree has the same longest common prefix with P

as the “correct node”

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16








=> the correct node is an ancestor of the identified node.

0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16









0000000000S1

0000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

1

2

3

4

S1 S2

0 1

5

S3 S4

0 1

0[00] 1[11]

6

7

S5 S6

0 1

S7

0 1

0 1[01]

S8

0 1

S9

0[00] 1


1 2 3 4 5

1 0 0 1 ⇥

0 0 1 3 ⇥

⇡ 3 2 4 1 5

(c) , , and ⇡

h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i







15

000000000S1

000000001S2

000001110S3

000001111S4

000010100S5

000010101S6

00001011S7

0001S8

1S9

(a) S

0

S91

S82

6

S77

S6S5

0 1

0 1

3

5

S4S3

0 1

4

S2S1

0 1

0[00] 1[11]

0 1[01]

0 1

0[00] 1


0,000000000S1

8,1S2

5,1110S3

5,1S4

4,10100S5

8,1S6

7,1S7

3,1S8

0,1S9


h0, 0, a, 3, 5i

h1, 1, b, 2, 3i

h1, 2, c, 1, 3i

h1, 3, c, 0, 0i



16









Insert? Delete?



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a

A node is a struct with- a variable length vector of (sorted)

symbols and pointers to the children



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a



Binary search on symbols to decide which is the edge to follow



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a




Symbols and pointers stored far from the node 😱



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a




Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a





Can you solve both these issues?



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a






A node is a struct with- a binary vector saying which

symbol is represented and a variable length vector of pointers

to the children



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a








to the children

but the space grows to at least σ bits per node!



3 4

a b

a c

c

a b

bab c

Implement a Trie

0

b

1

b

2

a

5

c

6

a








to the children

but the space grows to at least σ bits per node!

Obtaining the best of both with a easy-to-implement

data structure!

Ternary Search Tree

Ternary Search TreeEvery node has (at most) tree

children


children

<c =c >croot


children

Strings that start with c

Strings that start with a symbol <c

Strings that start with a symbol >c

<c =c >croot


children

Bulid the TST recursively removing the

c




<c =c >croot


children

Bulid the TST recursively removing the

c




Bulid the TST recursively

Bulid the TST recursively

<c =c >croot

Ternary Search Tree

Ternary Search Tree

Ternary Search Tree

D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad

caw cue fee tap ago tar jam dug and }

Ternary Search Tree



Ternary Search Tree



Ternary Search Tree



Ternary Search Tree



Ternary Search Tree



Ternary Search Tree



Ternary Search Tree



Node have fixed size, independent of σ

Ternary Search Tree




All information is on the node

Ternary Search Tree





What about query time?

Ternary Search Tree






Search time is O(|P| log σ) if c is always the symbol in

the “middle”…

Ternary Search Tree







the “middle”…

How can we choose c to have O(|P| + log n) query

time?

Ternary Search Tree







the “middle”…


time?

Query time is

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #

O(|P|)

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #

O(|P|) Select c to pay O(log n)

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #


~ (3way)Quicksort and pivot

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

<n/2 <n/2

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

<n/2 <n/2

This way every time we go left or right we (at least) halve the size of the remaining strings

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

<n/2 <n/2


Insert? Delete?

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

<n/2 <n/2


Insert? Delete?

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

<n/2 <n/2


Insert? Delete?

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

<n/2 <n/2


Insert? Delete?

Ohhhh it is simple even in C

Ternary Search Tree







the “middle”…


time?

Query time is

O( # )+ # + #



c<c >c

<n/2 <n/2


Insert? Delete?

Project: Is it possible to increase nodes’ fanout to reduce cache-misses?

Ohhhh it is simple even in C



D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }


7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cFinding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cFinding Top-1

Score



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

P = cHow to find Top-1?

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c


Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c


Scan to find the maximum!

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



O(n) query time :-(

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



O(n) query time :-(

Finding Top-1

Better ideas?



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c


Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c


Augment each node with the max (and string id )

within its subtree!

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3 6,5

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3 6,5

6,5

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3 6,5

6,52,1

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3 6,5

6,52,1

7,0

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3 6,5

6,52,1

7,0

Preprocessing time: O(n)

Extra space: O(n log n) bits

Query time: O(1)

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3 6,5

6,52,1

7,0



Query time: O(1)

Solving Top-k?

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



within its subtree!

4,3 6,5

6,52,1

7,0



Query time: O(1)

Solving Top-k?

Finding Top-1

Solving Top-k?

- Extra space: O(k*n*log n) bits :-( - You must know k at building time! :-(


7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c


Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c


7 2 1 4 1 6 2S

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c


Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits

RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c




Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c




RMQ(3,6) = 5

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c




RMQ(3,6) = 5Can you solve Top-2?

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2SCan you solve Top-2?

Finding Top-1



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c



RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2SCan you solve Top-2?

Finding Top-1


Finding Top-k

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7

Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

Cartesian Tree

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

5

3

6

1

Cartesian Tree

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian TreeIt can be built top-down

with RMQ

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?

Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.

Extract (and report) the maximum from the heap and visit its children.

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=41

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

10Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

Results 1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

Results

10

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

Results

10

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

10

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

107

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

7

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

8

7

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

8

7

77

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

8

7

777

7

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

8

7

777

7

Claim: we “touch” at most 2k nodes. ⇒ Query time O(k log k)

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

8

7

777

7


Important: the cartesian tree is not built!

1

Finding Top-k

3 5 1 7 1 6 10S 9 8 7 1 4 2… …

10

7 9

85

3

6

1 7

4

1 2

Cartesian Tree

How to find Top-k?



max-Heap

k=4

97

Results

1079

87

8

7

777

7


Important: the cartesian tree is not built!

1


RMQ(i,j) = position of the maximum in the range S[i,j]

Range Maximum Query (1)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n2 log n) bits Query time: O(1)


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


Precompute the answer to any possible query.

There are O(n2) distinct queries!


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


M[i,j] = RMQ(i,j)




3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


M 0 1 2 3 4 5 6 7 8 9 10 1101

5

234

6789

1011

M[i,j] = RMQ(i,j)




3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


M 0 1 2 3 4 5 6 7 8 9 10 1101

5

234

6789

1011

M[i,j] = RMQ(i,j)



3


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Space: O(n log2 n) bits Query time: O(1)


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


Maximum in a interval is the max between the maxima of any its subintervals


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


Precompute the answer to every interval of size a power of 2.

There are O(log n) possible intervals starting at any position i.



3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


M[i,j] = RMQ(i,i+2j)



M 0 1 2 3 401

5

234

6789

1011



3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

?



3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

?


9=1+23


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

?


9=1+23

6


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

Maximum of a interval is the max between the maxima of any its subintervals


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011


RMQ(1,7) =


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011


argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011


argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

3


argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

3


argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

3


argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

6


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11





M 0 1 2 3 401

5

234

6789

1011

3


argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =

where len =⎣log (j-i+1)⎦RMQ(i,j) = argmax(S[M[i,i+2len]], S[M[j-2len,j]])

6


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)

3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log n


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log n


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7

Use the previous solution on R!

Space: ? bits Query time: O(1)


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7


Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?

O(1) time


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?

O(1) time

O(log n) time O(log n) time


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?

O(1) time


Space: O(n log n) bits Query time: O(1)


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?

O(1) time



O(1) time O(1) time


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?

O(1) time



O(1) time O(1) time


3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11

log nR 5 7 10 7



RMQ(1,10) = ?

O(1) time



O(1) time O(1) time

Puzzle: Use RMQ to compute LCA(u,v) queries on a tree


n = |D|, m total length of strings in D

7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary


Compute the top-k strings



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary


Compute the top-k strings O(k log k) time



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary


Compute the top-k strings O(k log k) time O(n) bits



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary





7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary



3 months query log at Yahoo!

≈600 million of distinct (and clean) queries



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary





Trie requires ≈50 Gbytes!



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary






We will see how to reduce to ≈5 Gbytes!

0

1 2

3 4 5 6

ab b

ab ca

c

a b

baacb c

Patricia trie



Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



First symbol and length of the label

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



P = cba

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



P = cba

Blind search. We can skip symbols and

check at the end.

Patricia trie

0

1 2

3 4 5 6

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1



P = cba

Blind search. We can skip symbols and

check at the end.

O(n log m + m log σ) bitsO(|P| + log m) time with TST structure

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

Rank0(j) = # of 0 in B[0,j]

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11


Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11


Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4

Select0(j) = position of the j-th 0 in B

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4



Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4



Select1(4) =

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4



Select1(4) = 6

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4



Select1(4) = 6 Space: n + O(n log log n/log n) bits Query time: O(1)

Rank1(7) = 8 - Rank0(7) = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Rank0(7) = 4



Select1(4) = 6 Space: n + O(n log log n/log n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11



Space: n + O(n log log n/log n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) =

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

1/2 log n bits fit in a word. O(1) with popcount op!

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?

O(21/2 log n log n) = O(√n log n) cells,

each uses O(log log n) bits

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?



How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?



How much space?

O(n/log n) entries, each uses O(log n) bits

⇒ O(n) bits :-(

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11

1/2 log n




B’ 0 2 3 4

Rank0(7) = 3+

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1

1 = 4

How much space?



How much space?

O(n/log n) entries, each uses O(log n) bits

⇒ O(n) bits :-(

Space: O(n) + o(n) bits Query time: O(1)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

Groups into superblocks!

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4Store the # of 0s up to the beginning of its superblock

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n


0

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n


0

Rank0(j) is split into 3 parts:

- # of 0s up to the superblock of j - # of 0s up to the block of j - # of 0s within the block of j

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

O(n/log2 n) entries, each uses O(log n) bits ⇒ O(n/log n) bits :-)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How much space?

O(n/log n) entries, each uses O(log log n) bits ⇒ O(n log log n/log n) bits :-)

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How to support Select in O(log n) time?

Rank/Select queries

0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11




B’ 0 2 3 4

M1 2 31 2 3

1 2 2000001

0 1 1101

1 1 20101 1 1011

0 1 2100

0 0 11100 0 0111

1/2 log n

log n

B’’ 0 4

0

How to support Select in O(log n) time?

Select can be solved in O(1) time with a more difficult

approach

Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m


Space: n log (m/n) + O(n) bits Access(i) in O(1)



2 3 5S1 2 3

74

115

136

147

248



2 3 5S1 2 3

74

115

136

147

248

Trivial representation requires O(n log m) bits :-( Can we do better?



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

0

log

m =

log

24



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

Max number of buckets 2log n = n



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H


Write buckets’ cardinalities in unary



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013

1 014 15





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

012

013

1 014 15


A 1 for each value, at most a 0 for each bucket. <= 2n bits!



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15





2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15


n log (m/n) bits




2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15


Access(6)



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15


Access(6)1. Select1(6)-6 =3



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15


Access(6)1. Select1(6)-6 =3 011 in binary!



2 3 5S1 2 3

74

115

136

147

248

0

0

0

1

1

0

0

1

0

1

0

0

1

1

1

0

1

0

1

1

0

1

1

0

1

0

1

1

1

0

1

1

0

0

0

0

0

0

1

0

log

m =

log

24

log (m/n)

log n

H 1 1 01 2 3

1 14 5

06

1 07 8

19

110

011

L

012

013

1 014 15


Access(6)1. Select1(6)-6 =3 011 in binary!

2. Access L in position 6*log (m/n) = 13

Elias-Fano representationGiven a sequence of n (positive) integers summing up to m


2 2 4 3S1 2 3 4


2 2 4 3S

Trivial representation requires O(n log m) bits :-( Can we do better?

1 2 3 4


2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros

1 2 3 4

Elias-Fano representation

1 00 1

Given a sequence of n (positive) integers summing up to m

2 2 4 3S

B


1 2 3 4


1 00 1

1 02 3


2 2 4 3S

B


1 2 3 4


1 00 1

1 02 3

1 1 1 04 5 6 7


2 2 4 3S

B


1 2 3 4


1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10


2 2 4 3S

B


1 2 3 4


1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10

The i-th value of S is Select0(i)-Select0(i-1)


2 2 4 3S

B


1 2 3 4


1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10



2 2 4 3S

B


1 2 3 4


1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10



2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros Select0(3)=7Select0(2)=3

1 2 3 4


1 00 1

1 02 3

1 1 1 04 5 6 7

1 1 08 9 10



2 2 4 3S

B

Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros Select0(3)=7Select0(2)=3

Space: n log (m/n) + O(n) bits Select0 in O(1)

1 2 3 4

Unicorn: A System for Searching the Social Graph

Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,

Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,

Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang

Facebook, Inc.

ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.

1. INTRODUCTIONOver the past three years we have built and deployed a

search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.

Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-

1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission and/or a fee. Articles from this volume were invited to present

their results at The 39th International Conference on Very Large Data Bases,

August 26th - 30th 2013, Riva del Garda, Trento, Italy.

Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.

rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-

trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:

• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.

• We discuss key features for promoting socially relevantsearch results.

• We discuss two operators, apply and extract, whichallow rich semantic graph queries.

This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.

2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships

between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-

chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type





Facebook, Inc.




























Facebook, Inc.




























Facebook, Inc.
























101 102 103 104 1050

20

40

60

Inner Truncation Limit

RelativeDeviation

(%)

The first two plots above show that increasing the in-ner truncation limit leads to higher latency and cost, withlatency passing 100ms at approximately a limit of 5000.Query cost increases sub-linearly, but there will likely alwaysbe queries whose inner result sets will need to be truncatedto prevent latency from exceeding a reasonable threshold(say, 100ms).

This means that—barring architectural changes—we willalways have queries that require inner truncation. As men-tioned in section 7.1.2, good inner query ranking is typicallythe most useful tool. However, it is always possible to con-struct “needle-in-a-haystack” queries that elicit bad perfor-mance. Query planning and selective denormalization canhelp in many of these cases.

The final plot shows that relative deviation increases grad-ually as the truncation limit increases. Larger outer querieshave higher variance per shard as perceived by the rack ag-gregator. This is likely because network queuing delays be-come more likely as query size increases.

11. RELATED WORKIn the last few years, sustained progress has been made in

scaling graph search via the SPARQL language, and someof these systems focus, like Unicorn, on real-time responseto ad-hoc queries [24, 12]. Where Unicorn seeks to handle afinite number of well understood edges and scale to trillionsof edges, SPARQL engines intend to handle arbitrary graphstructure and complex queries, and scale to tens of millionsof edges [24]. That said, it is interesting to note that thecurrent state-of-the-art in performance is based on variantsof a structure from [12] in which data is vertically partitionedand stored in a column store. This data structure uses aclustered B+ tree of (subject-id, value) for each propertyand emphasizes merge-joins, and thus seems to be evolvingtoward a posting-list-style architecture with fast intersectionand union as supported by Unicorn.

Recently, work has been done on adding keyword searchto SPARQL-style queries [28, 15], leading to the integrationof posting lists retrieval with structured indices. This workis currently at much smaller scale than Unicorn. Startingwith XML data graphs, work has been done to search forsubgraphs based on keywords (see, e.g. [16, 20]). The focusof this work is returning a subgraph, while Unicorn returnsan ordered list of entities.

In some work [13, 18], the term ’social search’ in fact refersto a system that supports question-answering, and the socialgraph is used to predict which person can answer a question.

While some similar ranking features may be used, Unicornsupports queries about the social graph rather than via thegraph, a fundamentally di↵erent application.

12. CONCLUSIONIn this paper, we have described the evolution of a graph-

based indexing system and how we added features to make ituseful for a consumer product that receives billions of queriesper week. Our main contributions are showing how manyinformation retrieval concepts can be put to work for serv-ing graph queries, and we described a simple yet practicalmultiple round-trip algorithm for serving even more com-plex queries where edges are not denormalized and insteadmust be traversed sequentially.

13. ACKNOWLEDGMENTSWe would like to acknowledge the product engineers who

use Unicorn as clients every day, and we also want to thankour production engineers who deal with our bugs and helpextinguish our fires. Also, thanks to Cameron Marlow forcontributing graphics and Dhruba Borthakur for conversa-tions about Facebook’s storage architecture. Several Face-book engineers contributed helpful feedback to this paperincluding Philip Bohannon and Chaitanya Mishra. BretTaylor was instrumental in coming up with many of theoriginal ideas for Unicorn and its design. We would alsolike to acknowledge the following individuals for their con-tributions to Unicorn: Spencer Ahrens, Neil Blakey-Milner,Udeepta Bordoloi, Chris Bray, Jordan DeLong, Shuai Ding,Jeremy Lilley, Jim Norris, Feng Qian, Nathan Schrenk, San-jeev Singh, Ryan Stout, Evan Stratford, Scott Straw, andSherman Ye.

Open SourceAll Unicorn index server and aggregator code is written inC++. Unicorn relies extensively on modules in Facebook’s“Folly” Open Source Library [5]. As part of the e↵ort ofreleasing Graph Search, we have open-sourced a C++ im-plementation of the Elias-Fano index representation [31] aspart of Folly.

14. REFERENCES[1] Apache Hadoop. http://hadoop.apache.org/.[2] Apache Thrift. http://thrift.apache.org/.[3] Description of HHVM (PHP Virtual machine).

https://www.facebook.com/note.php?note_id=

10150415177928920.[4] Facebook Graph Search.

https://www.facebook.com/about/graphsearch.[5] Folly GitHub Repository.

http://github.com/facebook/folly.[6] HPHP for PHP GitHub Repository.

http://github.com/facebook/hiphop-php.[7] Open Compute Project.

http://www.opencompute.org/.[8] Scribe Facebook Blog Post. http://www.

facebook.com/note.php?note_id=32008268919.[9] Scribe GitHub Repository.

http://www.github.com/facebook/scribe.[10] The Life of a Typeahead Query. http:

//www.facebook.com/notes/facebook-engineering/

the-life-of-a-typeahead-query/389105248919.





Facebook, Inc.
























101 102 103 104 1050

20

40

60


RelativeDeviation

(%)





























Facebook, Inc.
























101 102 103 104 1050

20

40

60


RelativeDeviation

(%)

























Succinct representation of trees (1)[LOUDS - Level-order unary degree sequence]

Succinct representation of trees (1)

Trivial: O(n log n) bits

[LOUDS - Level-order unary degree sequence]


Trivial: O(n log n) bits Best: 2n bits




3




3

0 2 2

2 20 0

0 0 0 0




Write the degree sequence in level order

3

0 2 2

2 20 0

0 0 0 0





3

0 2 2

2 20 0

0 0 0 0

D 3





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0A tree is uniquely determined by the

degree sequence





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0A tree is uniquely determined by the

degree sequence

How reconstruct the tree?





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0

It still requires O(n log n) bits :-(





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0


Solution: write them in unary





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0



B





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0



B 1110





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0



B 1110 0 110 110 0 0 110 110 0 0 0 0





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0



B 1110 0 110 110 0 0 110 110 0 0 0 0

B takes 2n - 1 bits! For each node we have a 0 and a 1

(but the root)





3

0 2 2

2 20 0

0 0 0 0

D 3 0 2 2 0 0 2 2 0 0 0 0



B 1110 0 110 110 0 0 110 110 0 0 0 0

B takes 2n - 1 bits! For each node we have a 0 and a 1

(but the root)

Can we navigate the tree?



B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0



B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1


B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42


B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6


B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

1


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) =

1 2 3 4 5 6 7 8 9 10 11 12

pos(5)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

pos(5)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ?


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12

firstChild(x) = ? y = Select0(x)+1// start of x’s children in B


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12


firstChild(3)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12


y= Select0(3)+1=8

firstChild(3)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12


if B[y] == 0return -1 // is a leaf

y= Select0(3)+1=8

firstChild(3)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12



return y-x // Rank1(y)

else

y= Select0(3)+1=8

firstChild(3)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else

y= Select0(3)+1=8

firstChild(3) = 8-3 = 5


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else

y= Select0(3)+1=8

degree(x) = ?


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else

y= Select0(3)+1=8

degree(x) = ? Select0(x+1) - (Select0(x) + 1)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else

y= Select0(3)+1=8


degree(3)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else

y= Select0(3)+1=8


degree(3) = Select0(4) - (Select0(3) + 1)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else

y= Select0(3)+1=8


Select0(4)=10

degree(3) = Select0(4) - (Select0(3) + 1)


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else


parent(x) =


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else


parent(x) = Rank0(pos(x))


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else


parent(x) = Rank0(pos(x))All these operations in O(1) time!


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else



subtreeSize(x) = ?


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else



subtreeSize(x) = ?

Not efficient! Nodes of the subtree are

spread in B


1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0


1

3 42

7 85 6

9 10 11 12

pos(x) = Select1(x)

1 2 3 4 5 6 7 8 9 10 11 12




else



subtreeSize(x) = ?

Succinct representation of trees (2)[BP - Balanced parenthesis]



1


1


1

2


1

2


1

32


1

32


1

32


1

32


1

32 6

7 104 5

8 9 11 12


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )

[BP - Balanced parenthesis]

1

32 6

7 104 5

8 9 11 12


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

A tree is uniquely determined by the vector B

How reconstruct the tree?


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

subtree of 6


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

subtree of 6( and ) balance themselves


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) =


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (

findClose(7)


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12

pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (

enclose(10)

= position of the parent of x in B


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12


enclose(10)


They can be implemented in O(1) time.


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12


enclose(10)

= position of the parent of x in Bparent(x) =



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12


enclose(10)

= position of the parent of x in Bparent(x) = Rank ( (enclose(x))



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12



firstChild(x) =parent(x) = Rank ( (enclose(x))



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )

return -1 // is a leaf

return Rank ( (y)else



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )



sibling(x) =



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )



sibling(x) = Rank ( (findClose(x)+1) (if any)



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )



sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )




(findClose(x) - pos(x) + 1) / 2



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )




(findClose(x) - pos(x) + 1) / 2depth(x) =



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )





Rank ( (pos(x)) - Rank ) (pos(x))



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )





Rank ( (pos(x)) - Rank ) (pos(x))


All these operations in O(1) time!


B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )





degree(x) = ?

depth(x) =Rank ( (pos(x)) - Rank ) (pos(x))



B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12




y = pos(x)+1if B[y] == )





degree(x) = ?

Quite inefficient! Solved by repeatedly calling sibling to scan x’s children.

depth(x) =Rank ( (pos(x)) - Rank ) (pos(x))


Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12


1

32 6

7 104 5

8 9 11 12


1

32 6

7 104 5

8 9 11 12


1

32 6

7 104 5

8 9 11 12


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

[DFUDS - Depth First Unary Degree Sequence

1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

subtree of 6


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

children of 1


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

pos(6)

pos(x) = Select ) (x) // closing )

id(pos) = Rank ) (pos)

⇋


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

degree(x) =



⇋children of 6


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

degree(x) = Select ) (Rank ) (pos(x))) - x



⇋children of 6


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )


1

32 6

7 104 5

8 9 11 12

1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12

child(x,i), parent(x), subtreeSize(x), …

degree(x) = Select ) (Rank ) (pos(x))) - x



⇋

All these operations in O(1) time!

Patricia trie with DFUDS

B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

c?* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

b?* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

a?* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS

Search P in O(|P|) time!


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


* 2 1 221 11 11 22L


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


* 2 1 221 11 11 22LElias-Fano representation: n log(m/n) + O(n) bits and

O(1) time access.


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


S, L and B use m log σ + n log (m/n)+ O(n) bits

* 2 1 221 11 11 22L


B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )

1

32 6

7 104 5

8 9 11 12

1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12

a,2 b,1

a,2 c,2

c,1

a,1 b,1

b,2a,2b,1 c,1

P = cba

* a b cac ba cb baS


S, L and B use m log σ + n log (m/n)+ O(n) bits

* 2 1 221 11 11 22L

Project: implement a compact TST



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P|) time








7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary

Find the node “prefixed” by P O(|P|) time m log σ + n log (m/n) + O(n) bits








7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary







to be compared with O(m log σ + n log m) bits



7

2 1

4 1 6 2

ab b

ab ca

c

a b

baacb c

Summary







to be compared with O(m log σ + n log m) bits

(n=) 1 billion of strings of average length 64

m=64*109 symbols and m/n = 64

Documents

Tries and Succinct Data Structures - Persone - …pages.di.unipi.it/.../uploads/sites/7/2015/09/Lez3-4-5.pdfTries and Succinct Data Structures Auto-completion as our target application