Upload
phamdieu
View
218
Download
0
Embed Size (px)
Citation preview
Tries and Succinct Data Structures
Auto-completion as our target application
Rossano [email protected]
Dataset?
Dataset? All the past queries
Dataset?Searches?
All the past queries
Dataset?Prefix searchSearches?All the past queries
Dataset?Prefix searchSearches?All the past queries
Data structure?
Dataset?Prefix searchSearches?All the past queries
Data structure? Trie
Dataset?Prefix searchSearches?All the past queries
Data structure? TrieHow to find top-k efficiently?
Dataset?Prefix searchSearches?All the past queries
Data structure? TrieHow to find top-k efficiently?
Project: implement a autocompletion system
Problem
Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:
Problem
Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:
• Lookup(P): say 1 if P belongs to D, 0 otherwise;
Problem
Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:
• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D
that are prefixed by P, -1 if there is no such interval;
Problem
Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:
• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D
that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D
sharing the longest common prefix with P;
Problem
Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:
• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D
that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D
sharing the longest common prefix with P;• Insert(P): insert s in D;
Problem
Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:
• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D
that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D
sharing the longest common prefix with P;• Insert(P): insert s in D;• Delete(P): delete s from D;
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
Problem
Given a dictionary D of n strings of total length m drawn from an alphabet of size σ, build a data structures to answer the following queries:
• Lookup(P): say 1 if P belongs to D, 0 otherwise;• WeakPrefix(P): return the consecutive range of strings in D
that are prefixed by P, -1 if there is no such interval;• StrongPrefix(P): return the consecutive range of strings in D
sharing the longest common prefix with P;• Insert(P): insert s in D;• Delete(P): delete s from D;
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Trie
0
b
1
b
2
a
5
c
6
a
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Prefix(c)
Trie
0
b
1
b
2
a
5
c
6
a
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Prefix(c)
Trie
0
b
1
b
2
a
5
c
6
a
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Prefix(c)
Trie
0
b
1
b
2
a
5
c
6
a
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Prefix(c)
Trie
0
b
1
b
2
a
5
c
6
a
Find all the strings prefixed by any pattern P in O(|P| log σ) time
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Prefix(c)
Trie
0
b
1
b
2
a
5
c
6
a
Find all the strings prefixed by any pattern P in O(|P| log σ) time Insert? Delete?
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Prefix(c)
Trie
0
b
1
b
2
a
5
c
6
aO(m) nodes
O(m log m + m log σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time Insert? Delete?
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Prefix(c)
Trie
0
b
1
b
2
a
5
c
6
aO(m) nodes
O(m log m + m log σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
Save space?
Insert? Delete?
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
Variable length labels on the edges 😱
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
label len
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
label len string id
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
label len string id
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
<b,0,1>
label len string id
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
<b,0,1>
label len string id
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
<b,0,1>
<a,1,1>
label len string id
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
<b,0,1>
<a,1,1>
label len string id
<c,1,2> <a,0,3>
<b,0,3><c,0,4><a,1,5><b,1,6>
<c,1,3>
<b,1,5>
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
<b,0,1>
<a,1,1>
Search: every time we traverse an edge we have to access a
dictionary string!
label len string id
<c,1,2> <a,0,3>
<b,0,3><c,0,4><a,1,5><b,1,6>
<c,1,3>
<b,1,5>
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
<b,0,1>
<a,1,1>
Search: every time we traverse an edge we have to access a
dictionary string!
label len string id
<c,1,2> <a,0,3>
<b,0,3><c,0,4><a,1,5><b,1,6>
<c,1,3>
<b,1,5>
Probably a cache miss at each edge 😱. Can we do better?
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
(Compact) Trie
O(n) nodes O(n log m + mlog σ) bits of space
Find all the strings prefixed by any pattern P in O(|P| log σ) time
<a,1,0>
<b,0,1>
<a,1,1>
Search: every time we traverse an edge we have to access a
dictionary string!
label len string id
<c,1,2> <a,0,3>
<b,0,3><c,0,4><a,1,5><b,1,6>
<c,1,3>
<b,1,5>
Probably a cache miss at each edge 😱. Can we do better?
Patricia trie & Blind search every time we traverse an
edge, we match the only symbol and skip “label len” symbols in the
pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
Blind search every time we traverse an
edge, we match the only symbol and skip “label len” symbols in the
pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
What’s the property of the node we identify?
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
What’s the property of the node we identify?
Any substring in its subtree has the same longest common prefix with P
as the “correct node”
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
What’s the property of the node we identify?
Any substring in its subtree has the same longest common prefix with P
as the “correct node”
=> the correct node is an ancestor of the identified node.
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
What’s the property of the node we identify?
Any substring in its subtree has the same longest common prefix with P
as the “correct node”
=> the correct node is an ancestor of the identified node.
0000000000S1
0000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
1
2
3
4
S1 S2
0 1
5
S3 S4
0 1
0[00] 1[11]
6
7
S5 S6
0 1
S7
0 1
0 1[01]
S8
0 1
S9
0[00] 1
(b) Patricia trie of S
1 2 3 4 5
1 0 0 1 ⇥
0 0 1 3 ⇥
⇡ 3 2 4 1 5
(c) , , and ⇡
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings ={00000000, 00000001, 0000001, 00000100, cbcc}.
5. A. Brodnik and J. I. Munro. Membership in constant time and almost-minimumspace. SIAM J. Comput., 28(5):1627–1640, 1999.
6. E. D. Demaine, J. Iacono, and S. Langerman. Worst-case optimal tree layout in amemory hierarchy. CoRR, cs.DS/0410048, 2004.
7. P. Elias. E�cient storage and retrieval by content and address of static files. J.ACM, 21(2):246–260, 1974.
8. P. Ferragina. On the weak prefix-search problem. In CPM, LNCS vol. 6661, pages261–272. Springer, 2011.
15
000000000S1
000000001S2
000001110S3
000001111S4
000010100S5
000010101S6
00001011S7
0001S8
1S9
(a) S
0
S91
S82
6
S77
S6S5
0 1
0 1
3
5
S4S3
0 1
4
S2S1
0 1
0[00] 1[11]
0 1[01]
0 1
0[00] 1
(b) Patricia trie of S
0,000000000S1
8,1S2
5,1110S3
5,1S4
4,10100S5
8,1S6
7,1S7
3,1S8
0,1S9
(c) Front Coding of S
h0, 0, a, 3, 5i
h1, 1, b, 2, 3i
h1, 2, c, 1, 3i
h1, 3, c, 0, 0i
(d) Tuples inducedby string abcc
Figure 1. The picture shows a running example for the set of strings S ={000000000, 000000001, 000001110, 000001111, 000010100, 000010101, 00001011, 0001, 1}.
16
Patricia Trie (and blind search)
P= 000 0 110Blind search
every time we traverse an edge, we match the only symbol
and skip “label len” symbols in the pattern
What’s the property of the node we identify?
Any substring in its subtree has the same longest common prefix with P
as the “correct node”
=> the correct node is an ancestor of the identified node.
Insert? Delete?
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
Binary search on symbols to decide which is the edge to follow
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
Binary search on symbols to decide which is the edge to follow
Symbols and pointers stored far from the node 😱
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
Binary search on symbols to decide which is the edge to follow
Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
Binary search on symbols to decide which is the edge to follow
Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)
Can you solve both these issues?
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
Binary search on symbols to decide which is the edge to follow
Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)
Can you solve both these issues?
A node is a struct with- a binary vector saying which
symbol is represented and a variable length vector of pointers
to the children
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
Binary search on symbols to decide which is the edge to follow
Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)
Can you solve both these issues?
A node is a struct with- a binary vector saying which
symbol is represented and a variable length vector of pointers
to the children
but the space grows to at least σ bits per node!
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m = total length of strings in D, σ = alphabet size
3 4
a b
a c
c
a b
bab c
Implement a Trie
0
b
1
b
2
a
5
c
6
a
A node is a struct with- a variable length vector of (sorted)
symbols and pointers to the children
Binary search on symbols to decide which is the edge to follow
Symbols and pointers stored far from the node 😱 Seach time is O(|P| log σ)
Can you solve both these issues?
A node is a struct with- a binary vector saying which
symbol is represented and a variable length vector of pointers
to the children
but the space grows to at least σ bits per node!
Obtaining the best of both with a easy-to-implement
data structure!
Ternary Search Tree
Ternary Search TreeEvery node has (at most) tree
children
Ternary Search TreeEvery node has (at most) tree
children
<c =c >croot
Ternary Search TreeEvery node has (at most) tree
children
Strings that start with c
Strings that start with a symbol <c
Strings that start with a symbol >c
<c =c >croot
Ternary Search TreeEvery node has (at most) tree
children
Bulid the TST recursively removing the
c
Strings that start with c
Strings that start with a symbol <c
Strings that start with a symbol >c
<c =c >croot
Ternary Search TreeEvery node has (at most) tree
children
Bulid the TST recursively removing the
c
Strings that start with c
Strings that start with a symbol <c
Strings that start with a symbol >c
Bulid the TST recursively
Bulid the TST recursively
<c =c >croot
Ternary Search Tree
Ternary Search Tree
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|)
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
<n/2 <n/2
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
<n/2 <n/2
This way every time we go left or right we (at least) halve the size of the remaining strings
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
<n/2 <n/2
This way every time we go left or right we (at least) halve the size of the remaining strings
Insert? Delete?
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
<n/2 <n/2
This way every time we go left or right we (at least) halve the size of the remaining strings
Insert? Delete?
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
<n/2 <n/2
This way every time we go left or right we (at least) halve the size of the remaining strings
Insert? Delete?
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
<n/2 <n/2
This way every time we go left or right we (at least) halve the size of the remaining strings
Insert? Delete?
Ohhhh it is simple even in C
Ternary Search Tree
D= { now for tip ilk dim tag jot sob nob sky hut ace bet men egg few jay owl joy rap gig wee was cab wad
caw cue fee tap ago tar jam dug and }
Node have fixed size, independent of σ
All information is on the node
What about query time?
Search time is O(|P| log σ) if c is always the symbol in
the “middle”…
How can we choose c to have O(|P| + log n) query
time?
Query time is
O( # )+ # + #
O(|P|) Select c to pay O(log n)
~ (3way)Quicksort and pivot
c<c >c
<n/2 <n/2
This way every time we go left or right we (at least) halve the size of the remaining strings
Insert? Delete?
Project: Is it possible to increase nodes’ fanout to reduce cache-misses?
Ohhhh it is simple even in C
Dataset?Prefix searchSearches?All the past queries
Data structure? TrieHow to find top-k efficiently?
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cFinding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cFinding Top-1
Score
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Scan to find the maximum!
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Scan to find the maximum!
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Scan to find the maximum!
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Scan to find the maximum!
O(n) query time :-(
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Scan to find the maximum!
O(n) query time :-(
Finding Top-1
Better ideas?
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3 6,5
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3 6,5
6,5
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3 6,5
6,52,1
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3 6,5
6,52,1
7,0
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3 6,5
6,52,1
7,0
Preprocessing time: O(n)
Extra space: O(n log n) bits
Query time: O(1)
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3 6,5
6,52,1
7,0
Preprocessing time: O(n)
Extra space: O(n log n) bits
Query time: O(1)
Solving Top-k?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Augment each node with the max (and string id )
within its subtree!
4,3 6,5
6,52,1
7,0
Preprocessing time: O(n)
Extra space: O(n log n) bits
Query time: O(1)
Solving Top-k?
Finding Top-1
Solving Top-k?
- Extra space: O(k*n*log n) bits :-( - You must know k at building time! :-(
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
7 2 1 4 1 6 2S
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits
RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits
RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits
RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S
RMQ(3,6) = 5
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits
RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2S
RMQ(3,6) = 5Can you solve Top-2?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits
RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2SCan you solve Top-2?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m = total length of strings in D, σ = alphabet size
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
P = cHow to find Top-1?
Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits
RMQ(i,j) = position of the maximum in the range S[i,j] 7 2 1 4 1 6 2SCan you solve Top-2?
Finding Top-1
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
Finding Top-k
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
Cartesian Tree
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10Cartesian Tree
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7
Cartesian Tree
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
Cartesian Tree
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
5
3
6
1
Cartesian Tree
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian TreeIt can be built top-down
with RMQ
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=41
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
Results 1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
Results 1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
10Results 1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
Results 1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
Results
10
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
Results
10
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
10
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
107
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
7
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
8
7
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
8
7
77
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
8
7
777
7
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
8
7
777
7
Claim: we “touch” at most 2k nodes. ⇒ Query time O(k log k)
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
8
7
777
7
Claim: we “touch” at most 2k nodes. ⇒ Query time O(k log k)
Important: the cartesian tree is not built!
1
Finding Top-k
3 5 1 7 1 6 10S 9 8 7 1 4 2… …
10
7 9
85
3
6
1 7
4
1 2
Cartesian Tree
How to find Top-k?
Visit the node starting from the root and try to insert each visited node in a max-Heap storing at most k elements.
Extract (and report) the maximum from the heap and visit its children.
max-Heap
k=4
97
Results
1079
87
8
7
777
7
Claim: we “touch” at most 2k nodes. ⇒ Query time O(k log k)
Important: the cartesian tree is not built!
1
Assume you have a Data Structure on top of S answering in O(1) by using O(n) bits
RMQ(i,j) = position of the maximum in the range S[i,j]
Range Maximum Query (1)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Range Maximum Query (1)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n2 log n) bits Query time: O(1)
Range Maximum Query (1)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n2 log n) bits Query time: O(1)
Precompute the answer to any possible query.
There are O(n2) distinct queries!
Range Maximum Query (1)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n2 log n) bits Query time: O(1)
M[i,j] = RMQ(i,j)
Precompute the answer to any possible query.
There are O(n2) distinct queries!
Range Maximum Query (1)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n2 log n) bits Query time: O(1)
M 0 1 2 3 4 5 6 7 8 9 10 1101
5
234
6789
1011
M[i,j] = RMQ(i,j)
Precompute the answer to any possible query.
There are O(n2) distinct queries!
Range Maximum Query (1)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n2 log n) bits Query time: O(1)
M 0 1 2 3 4 5 6 7 8 9 10 1101
5
234
6789
1011
M[i,j] = RMQ(i,j)
Precompute the answer to any possible query.
There are O(n2) distinct queries!
3
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
Maximum in a interval is the max between the maxima of any its subintervals
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
Maximum in a interval is the max between the maxima of any its subintervals
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
Maximum in a interval is the max between the maxima of any its subintervals
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
?
Maximum in a interval is the max between the maxima of any its subintervals
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
?
Maximum in a interval is the max between the maxima of any its subintervals
9=1+23
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
?
Maximum in a interval is the max between the maxima of any its subintervals
9=1+23
6
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
Maximum of a interval is the max between the maxima of any its subintervals
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
Maximum of a interval is the max between the maxima of any its subintervals
RMQ(1,7) =
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
Maximum of a interval is the max between the maxima of any its subintervals
argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
Maximum of a interval is the max between the maxima of any its subintervals
argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
3
Maximum of a interval is the max between the maxima of any its subintervals
argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
3
Maximum of a interval is the max between the maxima of any its subintervals
argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
3
Maximum of a interval is the max between the maxima of any its subintervals
argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =
6
Range Maximum Query (2)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Space: O(n log2 n) bits Query time: O(1)
M[i,j] = RMQ(i,i+2j)
Precompute the answer to every interval of size a power of 2.
There are O(log n) possible intervals starting at any position i.
M 0 1 2 3 401
5
234
6789
1011
3
Maximum of a interval is the max between the maxima of any its subintervals
argmax(S[M[1,1+22]], S[M[7-22,7]]) = 6RMQ(1,7) =
where len =⎣log (j-i+1)⎦RMQ(i,j) = argmax(S[M[i,i+2len]], S[M[j-2len,j]])
6
Range Maximum Query (3)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log n
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log n
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1)
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
O(1) time
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
O(1) time
O(log n) time O(log n) time
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
O(1) time
O(log n) time O(log n) time
Space: O(n log n) bits Query time: O(1)
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
O(1) time
O(log n) time O(log n) time
Space: O(n log n) bits Query time: O(1)
O(1) time O(1) time
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
O(1) time
O(log n) time O(log n) time
Space: O(n log n) bits Query time: O(1)
O(1) time O(1) time
Range Maximum Query (3) Space: O(n log n) bits Query time: O(log n)
3 5 1 7 1 6 10S 9 8 7 1 40 1 2 3 4 5 6 7 8 9 10 11
log nR 5 7 10 7
Use the previous solution on R!
Space: ? bits Query time: O(1) Space: O(n log n) bits Query time: O(1)
RMQ(1,10) = ?
O(1) time
O(log n) time O(log n) time
Space: O(n log n) bits Query time: O(1)
O(1) time O(1) time
Puzzle: Use RMQ to compute LCA(u,v) queries on a tree
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
Compute the top-k strings
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
Compute the top-k strings O(k log k) time
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
Compute the top-k strings O(k log k) time O(n) bits
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
Compute the top-k strings O(k log k) time O(n) bits
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
Compute the top-k strings O(k log k) time O(n) bits
3 months query log at Yahoo!
≈600 million of distinct (and clean) queries
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
Compute the top-k strings O(k log k) time O(n) bits
3 months query log at Yahoo!
≈600 million of distinct (and clean) queries
Trie requires ≈50 Gbytes!
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P| + log n) time O(m log σ + n log m) bits
Compute the top-k strings O(k log k) time O(n) bits
3 months query log at Yahoo!
≈600 million of distinct (and clean) queries
Trie requires ≈50 Gbytes!
We will see how to reduce to ≈5 Gbytes!
0
1 2
3 4 5 6
ab b
ab ca
c
a b
baacb c
Patricia trie
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
First symbol and length of the label
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
P = cba
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
P = cba
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
P = cba
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
P = cba
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
P = cba
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
P = cba
Blind search. We can skip symbols and
check at the end.
Patricia trie
0
1 2
3 4 5 6
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
D = { ab, bab, bca, cab, cac, cbac, cbba }
n = |D|, m total length of strings in D
P = cba
Blind search. We can skip symbols and
check at the end.
O(n log m + m log σ) bitsO(|P| + log m) time with TST structure
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank0(7) =
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Rank1(7) = 8 - Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Rank1(7) = 8 - Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Select0(j) = position of the j-th 0 in B
Rank1(7) = 8 - Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Select0(j) = position of the j-th 0 in B
Select1(j) = position of the j-th 0 in B
Rank1(7) = 8 - Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Select0(j) = position of the j-th 0 in B
Select1(j) = position of the j-th 0 in B
Select1(4) =
Rank1(7) = 8 - Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Select0(j) = position of the j-th 0 in B
Select1(j) = position of the j-th 0 in B
Select1(4) = 6
Rank1(7) = 8 - Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Select0(j) = position of the j-th 0 in B
Select1(j) = position of the j-th 0 in B
Select1(4) = 6 Space: n + O(n log log n/log n) bits Query time: O(1)
Rank1(7) = 8 - Rank0(7) = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Rank0(7) = 4
Select0(j) = position of the j-th 0 in B
Select1(j) = position of the j-th 0 in B
Select1(4) = 6 Space: n + O(n log log n/log n) bits Query time: O(1)
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) =
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) =
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) =
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
1/2 log n bits fit in a word. O(1) with popcount op!
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1
1 = 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1
1 = 4
How much space?
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1
1 = 4
How much space?
O(21/2 log n log n) = O(√n log n) cells,
each uses O(log log n) bits
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1
1 = 4
How much space?
O(21/2 log n log n) = O(√n log n) cells,
each uses O(log log n) bits
How much space?
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1
1 = 4
How much space?
O(21/2 log n log n) = O(√n log n) cells,
each uses O(log log n) bits
How much space?
O(n/log n) entries, each uses O(log n) bits
⇒ O(n) bits :-(
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
1/2 log n
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
Rank0(7) = 3+
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1
1 = 4
How much space?
O(21/2 log n log n) = O(√n log n) cells,
each uses O(log log n) bits
How much space?
O(n/log n) entries, each uses O(log n) bits
⇒ O(n) bits :-(
Space: O(n) + o(n) bits Query time: O(1)
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
Groups into superblocks!
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4Store the # of 0s up to the beginning of its superblock
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4Store the # of 0s up to the beginning of its superblock
0
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4Store the # of 0s up to the beginning of its superblock
0
Rank0(j) is split into 3 parts:
- # of 0s up to the superblock of j - # of 0s up to the block of j - # of 0s within the block of j
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
0
How much space?
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
0
How much space?
O(n/log2 n) entries, each uses O(log n) bits ⇒ O(n/log n) bits :-)
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
0
How much space?
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
0
How much space?
O(n/log n) entries, each uses O(log log n) bits ⇒ O(n log log n/log n) bits :-)
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
0
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
0
How to support Select in O(log n) time?
Rank/Select queries
0 0 1 1 1 0 1B 0 1 1 0 10 1 2 3 4 5 6 7 8 9 10 11
Rank0(j) = # of 0 in B[0,j]
Rank1(j) = # of 1 in B[0,j]
Space: n + O(n log log n/log n) bits Query time: O(1)
B’ 0 2 3 4
M1 2 31 2 3
1 2 2000001
0 1 1101
1 1 20101 1 1011
0 1 2100
0 0 11100 0 0111
1/2 log n
log n
B’’ 0 4
0
How to support Select in O(log n) time?
Select can be solved in O(1) time with a more difficult
approach
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
Trivial representation requires O(n log m) bits :-( Can we do better?
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
0
log
m =
log
24
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
Max number of buckets 2log n = n
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
012
013
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
012
013
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
012
013
1 014 15
Max number of buckets 2log n = n
Write buckets’ cardinalities in unary
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
012
013
1 014 15
Max number of buckets 2log n = n
A 1 for each value, at most a 0 for each bucket. <= 2n bits!
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
L
012
013
1 014 15
Max number of buckets 2log n = n
A 1 for each value, at most a 0 for each bucket. <= 2n bits!
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
L
012
013
1 014 15
Max number of buckets 2log n = n
n log (m/n) bits
A 1 for each value, at most a 0 for each bucket. <= 2n bits!
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
L
012
013
1 014 15
Max number of buckets 2log n = n
Access(6)
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
L
012
013
1 014 15
Max number of buckets 2log n = n
Access(6)1. Select1(6)-6 =3
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
L
012
013
1 014 15
Max number of buckets 2log n = n
Access(6)1. Select1(6)-6 =3 011 in binary!
Elias-Fano representationGiven a sequence S of n (positive) increasing integers up to m
Space: n log (m/n) + O(n) bits Access(i) in O(1)
2 3 5S1 2 3
74
115
136
147
248
0
0
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
log
m =
log
24
log (m/n)
log n
H 1 1 01 2 3
1 14 5
06
1 07 8
19
110
011
L
012
013
1 014 15
Max number of buckets 2log n = n
Access(6)1. Select1(6)-6 =3 011 in binary!
2. Access L in position 6*log (m/n) = 13
Elias-Fano representationGiven a sequence of n (positive) integers summing up to m
Elias-Fano representationGiven a sequence of n (positive) integers summing up to m
2 2 4 3S1 2 3 4
Elias-Fano representationGiven a sequence of n (positive) integers summing up to m
2 2 4 3S
Trivial representation requires O(n log m) bits :-( Can we do better?
1 2 3 4
Elias-Fano representationGiven a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros
1 2 3 4
Elias-Fano representation
1 00 1
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros
1 2 3 4
Elias-Fano representation
1 00 1
1 02 3
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros
1 2 3 4
Elias-Fano representation
1 00 1
1 02 3
1 1 1 04 5 6 7
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros
1 2 3 4
Elias-Fano representation
1 00 1
1 02 3
1 1 1 04 5 6 7
1 1 08 9 10
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros
1 2 3 4
Elias-Fano representation
1 00 1
1 02 3
1 1 1 04 5 6 7
1 1 08 9 10
The i-th value of S is Select0(i)-Select0(i-1)
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros
1 2 3 4
Elias-Fano representation
1 00 1
1 02 3
1 1 1 04 5 6 7
1 1 08 9 10
The i-th value of S is Select0(i)-Select0(i-1)
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros
1 2 3 4
Elias-Fano representation
1 00 1
1 02 3
1 1 1 04 5 6 7
1 1 08 9 10
The i-th value of S is Select0(i)-Select0(i-1)
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros Select0(3)=7Select0(2)=3
1 2 3 4
Elias-Fano representation
1 00 1
1 02 3
1 1 1 04 5 6 7
1 1 08 9 10
The i-th value of S is Select0(i)-Select0(i-1)
Given a sequence of n (positive) integers summing up to m
2 2 4 3S
B
Represent the integer x by writing x-1 in unary to obtain B of m bits with n zeros Select0(3)=7Select0(2)=3
Space: n log (m/n) + O(n) bits Select0 in O(1)
1 2 3 4
Unicorn: A System for Searching the Social Graph
Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,
Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,
Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang
Facebook, Inc.
ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.
1. INTRODUCTIONOver the past three years we have built and deployed a
search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.
Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-
1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-
trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:
• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.
• We discuss key features for promoting socially relevantsearch results.
• We discuss two operators, apply and extract, whichallow rich semantic graph queries.
This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.
2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships
between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-
chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type
Unicorn: A System for Searching the Social Graph
Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,
Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,
Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang
Facebook, Inc.
ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.
1. INTRODUCTIONOver the past three years we have built and deployed a
search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.
Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-
1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-
trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:
• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.
• We discuss key features for promoting socially relevantsearch results.
• We discuss two operators, apply and extract, whichallow rich semantic graph queries.
This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.
2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships
between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-
chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type
Unicorn: A System for Searching the Social Graph
Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,
Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,
Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang
Facebook, Inc.
ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.
1. INTRODUCTIONOver the past three years we have built and deployed a
search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.
Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-
1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-
trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:
• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.
• We discuss key features for promoting socially relevantsearch results.
• We discuss two operators, apply and extract, whichallow rich semantic graph queries.
This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.
2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships
between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-
chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type
Unicorn: A System for Searching the Social Graph
Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,
Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,
Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang
Facebook, Inc.
ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.
1. INTRODUCTIONOver the past three years we have built and deployed a
search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.
Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-
1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-
trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:
• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.
• We discuss key features for promoting socially relevantsearch results.
• We discuss two operators, apply and extract, whichallow rich semantic graph queries.
This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.
2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships
between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-
chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type
101 102 103 104 1050
20
40
60
Inner Truncation Limit
RelativeDeviation
(%)
The first two plots above show that increasing the in-ner truncation limit leads to higher latency and cost, withlatency passing 100ms at approximately a limit of 5000.Query cost increases sub-linearly, but there will likely alwaysbe queries whose inner result sets will need to be truncatedto prevent latency from exceeding a reasonable threshold(say, 100ms).
This means that—barring architectural changes—we willalways have queries that require inner truncation. As men-tioned in section 7.1.2, good inner query ranking is typicallythe most useful tool. However, it is always possible to con-struct “needle-in-a-haystack” queries that elicit bad perfor-mance. Query planning and selective denormalization canhelp in many of these cases.
The final plot shows that relative deviation increases grad-ually as the truncation limit increases. Larger outer querieshave higher variance per shard as perceived by the rack ag-gregator. This is likely because network queuing delays be-come more likely as query size increases.
11. RELATED WORKIn the last few years, sustained progress has been made in
scaling graph search via the SPARQL language, and someof these systems focus, like Unicorn, on real-time responseto ad-hoc queries [24, 12]. Where Unicorn seeks to handle afinite number of well understood edges and scale to trillionsof edges, SPARQL engines intend to handle arbitrary graphstructure and complex queries, and scale to tens of millionsof edges [24]. That said, it is interesting to note that thecurrent state-of-the-art in performance is based on variantsof a structure from [12] in which data is vertically partitionedand stored in a column store. This data structure uses aclustered B+ tree of (subject-id, value) for each propertyand emphasizes merge-joins, and thus seems to be evolvingtoward a posting-list-style architecture with fast intersectionand union as supported by Unicorn.
Recently, work has been done on adding keyword searchto SPARQL-style queries [28, 15], leading to the integrationof posting lists retrieval with structured indices. This workis currently at much smaller scale than Unicorn. Startingwith XML data graphs, work has been done to search forsubgraphs based on keywords (see, e.g. [16, 20]). The focusof this work is returning a subgraph, while Unicorn returnsan ordered list of entities.
In some work [13, 18], the term ’social search’ in fact refersto a system that supports question-answering, and the socialgraph is used to predict which person can answer a question.
While some similar ranking features may be used, Unicornsupports queries about the social graph rather than via thegraph, a fundamentally di↵erent application.
12. CONCLUSIONIn this paper, we have described the evolution of a graph-
based indexing system and how we added features to make ituseful for a consumer product that receives billions of queriesper week. Our main contributions are showing how manyinformation retrieval concepts can be put to work for serv-ing graph queries, and we described a simple yet practicalmultiple round-trip algorithm for serving even more com-plex queries where edges are not denormalized and insteadmust be traversed sequentially.
13. ACKNOWLEDGMENTSWe would like to acknowledge the product engineers who
use Unicorn as clients every day, and we also want to thankour production engineers who deal with our bugs and helpextinguish our fires. Also, thanks to Cameron Marlow forcontributing graphics and Dhruba Borthakur for conversa-tions about Facebook’s storage architecture. Several Face-book engineers contributed helpful feedback to this paperincluding Philip Bohannon and Chaitanya Mishra. BretTaylor was instrumental in coming up with many of theoriginal ideas for Unicorn and its design. We would alsolike to acknowledge the following individuals for their con-tributions to Unicorn: Spencer Ahrens, Neil Blakey-Milner,Udeepta Bordoloi, Chris Bray, Jordan DeLong, Shuai Ding,Jeremy Lilley, Jim Norris, Feng Qian, Nathan Schrenk, San-jeev Singh, Ryan Stout, Evan Stratford, Scott Straw, andSherman Ye.
Open SourceAll Unicorn index server and aggregator code is written inC++. Unicorn relies extensively on modules in Facebook’s“Folly” Open Source Library [5]. As part of the e↵ort ofreleasing Graph Search, we have open-sourced a C++ im-plementation of the Elias-Fano index representation [31] aspart of Folly.
14. REFERENCES[1] Apache Hadoop. http://hadoop.apache.org/.[2] Apache Thrift. http://thrift.apache.org/.[3] Description of HHVM (PHP Virtual machine).
https://www.facebook.com/note.php?note_id=
10150415177928920.[4] Facebook Graph Search.
https://www.facebook.com/about/graphsearch.[5] Folly GitHub Repository.
http://github.com/facebook/folly.[6] HPHP for PHP GitHub Repository.
http://github.com/facebook/hiphop-php.[7] Open Compute Project.
http://www.opencompute.org/.[8] Scribe Facebook Blog Post. http://www.
facebook.com/note.php?note_id=32008268919.[9] Scribe GitHub Repository.
http://www.github.com/facebook/scribe.[10] The Life of a Typeahead Query. http:
//www.facebook.com/notes/facebook-engineering/
the-life-of-a-typeahead-query/389105248919.
Unicorn: A System for Searching the Social Graph
Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,
Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,
Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang
Facebook, Inc.
ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.
1. INTRODUCTIONOver the past three years we have built and deployed a
search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.
Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-
1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-
trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:
• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.
• We discuss key features for promoting socially relevantsearch results.
• We discuss two operators, apply and extract, whichallow rich semantic graph queries.
This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.
2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships
between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-
chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type
101 102 103 104 1050
20
40
60
Inner Truncation Limit
RelativeDeviation
(%)
The first two plots above show that increasing the in-ner truncation limit leads to higher latency and cost, withlatency passing 100ms at approximately a limit of 5000.Query cost increases sub-linearly, but there will likely alwaysbe queries whose inner result sets will need to be truncatedto prevent latency from exceeding a reasonable threshold(say, 100ms).
This means that—barring architectural changes—we willalways have queries that require inner truncation. As men-tioned in section 7.1.2, good inner query ranking is typicallythe most useful tool. However, it is always possible to con-struct “needle-in-a-haystack” queries that elicit bad perfor-mance. Query planning and selective denormalization canhelp in many of these cases.
The final plot shows that relative deviation increases grad-ually as the truncation limit increases. Larger outer querieshave higher variance per shard as perceived by the rack ag-gregator. This is likely because network queuing delays be-come more likely as query size increases.
11. RELATED WORKIn the last few years, sustained progress has been made in
scaling graph search via the SPARQL language, and someof these systems focus, like Unicorn, on real-time responseto ad-hoc queries [24, 12]. Where Unicorn seeks to handle afinite number of well understood edges and scale to trillionsof edges, SPARQL engines intend to handle arbitrary graphstructure and complex queries, and scale to tens of millionsof edges [24]. That said, it is interesting to note that thecurrent state-of-the-art in performance is based on variantsof a structure from [12] in which data is vertically partitionedand stored in a column store. This data structure uses aclustered B+ tree of (subject-id, value) for each propertyand emphasizes merge-joins, and thus seems to be evolvingtoward a posting-list-style architecture with fast intersectionand union as supported by Unicorn.
Recently, work has been done on adding keyword searchto SPARQL-style queries [28, 15], leading to the integrationof posting lists retrieval with structured indices. This workis currently at much smaller scale than Unicorn. Startingwith XML data graphs, work has been done to search forsubgraphs based on keywords (see, e.g. [16, 20]). The focusof this work is returning a subgraph, while Unicorn returnsan ordered list of entities.
In some work [13, 18], the term ’social search’ in fact refersto a system that supports question-answering, and the socialgraph is used to predict which person can answer a question.
While some similar ranking features may be used, Unicornsupports queries about the social graph rather than via thegraph, a fundamentally di↵erent application.
12. CONCLUSIONIn this paper, we have described the evolution of a graph-
based indexing system and how we added features to make ituseful for a consumer product that receives billions of queriesper week. Our main contributions are showing how manyinformation retrieval concepts can be put to work for serv-ing graph queries, and we described a simple yet practicalmultiple round-trip algorithm for serving even more com-plex queries where edges are not denormalized and insteadmust be traversed sequentially.
13. ACKNOWLEDGMENTSWe would like to acknowledge the product engineers who
use Unicorn as clients every day, and we also want to thankour production engineers who deal with our bugs and helpextinguish our fires. Also, thanks to Cameron Marlow forcontributing graphics and Dhruba Borthakur for conversa-tions about Facebook’s storage architecture. Several Face-book engineers contributed helpful feedback to this paperincluding Philip Bohannon and Chaitanya Mishra. BretTaylor was instrumental in coming up with many of theoriginal ideas for Unicorn and its design. We would alsolike to acknowledge the following individuals for their con-tributions to Unicorn: Spencer Ahrens, Neil Blakey-Milner,Udeepta Bordoloi, Chris Bray, Jordan DeLong, Shuai Ding,Jeremy Lilley, Jim Norris, Feng Qian, Nathan Schrenk, San-jeev Singh, Ryan Stout, Evan Stratford, Scott Straw, andSherman Ye.
Open SourceAll Unicorn index server and aggregator code is written inC++. Unicorn relies extensively on modules in Facebook’s“Folly” Open Source Library [5]. As part of the e↵ort ofreleasing Graph Search, we have open-sourced a C++ im-plementation of the Elias-Fano index representation [31] aspart of Folly.
14. REFERENCES[1] Apache Hadoop. http://hadoop.apache.org/.[2] Apache Thrift. http://thrift.apache.org/.[3] Description of HHVM (PHP Virtual machine).
https://www.facebook.com/note.php?note_id=
10150415177928920.[4] Facebook Graph Search.
https://www.facebook.com/about/graphsearch.[5] Folly GitHub Repository.
http://github.com/facebook/folly.[6] HPHP for PHP GitHub Repository.
http://github.com/facebook/hiphop-php.[7] Open Compute Project.
http://www.opencompute.org/.[8] Scribe Facebook Blog Post. http://www.
facebook.com/note.php?note_id=32008268919.[9] Scribe GitHub Repository.
http://www.github.com/facebook/scribe.[10] The Life of a Typeahead Query. http:
//www.facebook.com/notes/facebook-engineering/
the-life-of-a-typeahead-query/389105248919.
Unicorn: A System for Searching the Social Graph
Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko,
Lucian Grijincu, Tom Jackson, Sandhya Kunnatur, Soren Lassen, Philip Pronin,
Sriram Sankar, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang
Facebook, Inc.
ABSTRACTUnicorn is an online, in-memory social graph-aware index-ing system designed to search trillions of edges between tensof billions of users and entities on thousands of commodityservers. Unicorn is based on standard concepts in informa-tion retrieval, but it includes features to promote resultswith good social proximity. It also supports queries that re-quire multiple round-trips to leaves in order to retrieve ob-jects that are more than one edge away from source nodes.Unicorn is designed to answer billions of queries per day atlatencies in the hundreds of milliseconds, and it serves as aninfrastructural building block for Facebook’s Graph Searchproduct. In this paper, we describe the data model andquery language supported by Unicorn. We also describe itsevolution as it became the primary backend for Facebook’ssearch o↵erings.
1. INTRODUCTIONOver the past three years we have built and deployed a
search system called Unicorn1. Unicorn was designed withthe goal of being able to quickly and scalably search all basicstructured information on the social graph and to performcomplex set operations on the results. Unicorn resemblestraditional search indexing systems [14, 21, 22] and servesits index from memory, but it di↵ers in significant ways be-cause it was built to support social graph retrieval and socialranking from its inception. Unicorn helps products to spliceinteresting views of the social graph online for new user ex-periences.
Unicorn is the primary backend system for Facebook GraphSearch and is designed to serve billions of queries per daywith response latencies less than a few hundred milliseconds.As the product has grown and organically added more fea-tures, Unicorn has been modified to suit the product’s re-quirements. This paper is intended to serve as both a nar-
1The name was chosen because engineers joked that—muchlike the mythical quadruped—this system would solve all ofour problems and heal our woes if only it existed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00.
rative of the evolution of Unicorn’s architecture, as well asdocumentation for the major features and components ofthe system.To the best of our knowledge, no other online graph re-
trieval system has ever been built with the scale of Unicornin terms of both data volume and query volume. The sys-tem serves tens of billions of nodes and trillions of edgesat scale while accounting for per-edge privacy, and it mustalso support realtime updates for all edges and nodes whileserving billions of daily queries at low latencies.This paper includes three main contributions:
• We describe how we applied common information re-trieval architectural concepts to the domain of the so-cial graph.
• We discuss key features for promoting socially relevantsearch results.
• We discuss two operators, apply and extract, whichallow rich semantic graph queries.
This paper is divided into four major parts. In Sections 2–5, we discuss the motivation for building unicorn, its design,and basic API. In Section 6, we describe how Unicorn wasadapted to serve as the backend for Facebook’s typeaheadsearch. We also discuss how to promote and rank sociallyrelevant results. In Sections 7–8, we build on the imple-mentation of typeahead to construct a new kind of searchengine. By performing multi-stage queries that traverse aseries of edges, the system is able to return complex, user-customized views of the social graph. Finally, in Sections8–10, we talk about privacy, scaling, and the system’s per-formance characteristics for typical queries.
2. THE SOCIAL GRAPHFacebook maintains a database of the inter-relationships
between the people and things in the real world, which itcalls the social graph. Like any other directed graph, itconsists of nodes signifying people and things; and edgesrepresenting a relationship between two nodes. In the re-mainder of this paper, we will use the terms node and entityinterchangeably.Facebook’s primary storage and production serving ar-
chitectures are described in [30]. Entities can be fetchedby their primary key, which is a 64-bit identifier (id). Wealso store the edges between entities. Some edges are di-rectional while others are symmetric, and there are manythousands of edge-types. The most well known edge-type
101 102 103 104 1050
20
40
60
Inner Truncation Limit
RelativeDeviation
(%)
The first two plots above show that increasing the in-ner truncation limit leads to higher latency and cost, withlatency passing 100ms at approximately a limit of 5000.Query cost increases sub-linearly, but there will likely alwaysbe queries whose inner result sets will need to be truncatedto prevent latency from exceeding a reasonable threshold(say, 100ms).
This means that—barring architectural changes—we willalways have queries that require inner truncation. As men-tioned in section 7.1.2, good inner query ranking is typicallythe most useful tool. However, it is always possible to con-struct “needle-in-a-haystack” queries that elicit bad perfor-mance. Query planning and selective denormalization canhelp in many of these cases.
The final plot shows that relative deviation increases grad-ually as the truncation limit increases. Larger outer querieshave higher variance per shard as perceived by the rack ag-gregator. This is likely because network queuing delays be-come more likely as query size increases.
11. RELATED WORKIn the last few years, sustained progress has been made in
scaling graph search via the SPARQL language, and someof these systems focus, like Unicorn, on real-time responseto ad-hoc queries [24, 12]. Where Unicorn seeks to handle afinite number of well understood edges and scale to trillionsof edges, SPARQL engines intend to handle arbitrary graphstructure and complex queries, and scale to tens of millionsof edges [24]. That said, it is interesting to note that thecurrent state-of-the-art in performance is based on variantsof a structure from [12] in which data is vertically partitionedand stored in a column store. This data structure uses aclustered B+ tree of (subject-id, value) for each propertyand emphasizes merge-joins, and thus seems to be evolvingtoward a posting-list-style architecture with fast intersectionand union as supported by Unicorn.
Recently, work has been done on adding keyword searchto SPARQL-style queries [28, 15], leading to the integrationof posting lists retrieval with structured indices. This workis currently at much smaller scale than Unicorn. Startingwith XML data graphs, work has been done to search forsubgraphs based on keywords (see, e.g. [16, 20]). The focusof this work is returning a subgraph, while Unicorn returnsan ordered list of entities.
In some work [13, 18], the term ’social search’ in fact refersto a system that supports question-answering, and the socialgraph is used to predict which person can answer a question.
While some similar ranking features may be used, Unicornsupports queries about the social graph rather than via thegraph, a fundamentally di↵erent application.
12. CONCLUSIONIn this paper, we have described the evolution of a graph-
based indexing system and how we added features to make ituseful for a consumer product that receives billions of queriesper week. Our main contributions are showing how manyinformation retrieval concepts can be put to work for serv-ing graph queries, and we described a simple yet practicalmultiple round-trip algorithm for serving even more com-plex queries where edges are not denormalized and insteadmust be traversed sequentially.
13. ACKNOWLEDGMENTSWe would like to acknowledge the product engineers who
use Unicorn as clients every day, and we also want to thankour production engineers who deal with our bugs and helpextinguish our fires. Also, thanks to Cameron Marlow forcontributing graphics and Dhruba Borthakur for conversa-tions about Facebook’s storage architecture. Several Face-book engineers contributed helpful feedback to this paperincluding Philip Bohannon and Chaitanya Mishra. BretTaylor was instrumental in coming up with many of theoriginal ideas for Unicorn and its design. We would alsolike to acknowledge the following individuals for their con-tributions to Unicorn: Spencer Ahrens, Neil Blakey-Milner,Udeepta Bordoloi, Chris Bray, Jordan DeLong, Shuai Ding,Jeremy Lilley, Jim Norris, Feng Qian, Nathan Schrenk, San-jeev Singh, Ryan Stout, Evan Stratford, Scott Straw, andSherman Ye.
Open SourceAll Unicorn index server and aggregator code is written inC++. Unicorn relies extensively on modules in Facebook’s“Folly” Open Source Library [5]. As part of the e↵ort ofreleasing Graph Search, we have open-sourced a C++ im-plementation of the Elias-Fano index representation [31] aspart of Folly.
14. REFERENCES[1] Apache Hadoop. http://hadoop.apache.org/.[2] Apache Thrift. http://thrift.apache.org/.[3] Description of HHVM (PHP Virtual machine).
https://www.facebook.com/note.php?note_id=
10150415177928920.[4] Facebook Graph Search.
https://www.facebook.com/about/graphsearch.[5] Folly GitHub Repository.
http://github.com/facebook/folly.[6] HPHP for PHP GitHub Repository.
http://github.com/facebook/hiphop-php.[7] Open Compute Project.
http://www.opencompute.org/.[8] Scribe Facebook Blog Post. http://www.
facebook.com/note.php?note_id=32008268919.[9] Scribe GitHub Repository.
http://www.github.com/facebook/scribe.[10] The Life of a Typeahead Query. http:
//www.facebook.com/notes/facebook-engineering/
the-life-of-a-typeahead-query/389105248919.
Succinct representation of trees (1)[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
3
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
3
0 2 2
2 20 0
0 0 0 0
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0A tree is uniquely determined by the
degree sequence
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0A tree is uniquely determined by the
degree sequence
How reconstruct the tree?
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
It still requires O(n log n) bits :-(
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
It still requires O(n log n) bits :-(
Solution: write them in unary
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
It still requires O(n log n) bits :-(
Solution: write them in unary
B
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
It still requires O(n log n) bits :-(
Solution: write them in unary
B 1110
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
It still requires O(n log n) bits :-(
Solution: write them in unary
B 1110 0 110 110 0 0 110 110 0 0 0 0
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
It still requires O(n log n) bits :-(
Solution: write them in unary
B 1110 0 110 110 0 0 110 110 0 0 0 0
B takes 2n - 1 bits! For each node we have a 0 and a 1
(but the root)
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
Trivial: O(n log n) bits Best: 2n bits
Write the degree sequence in level order
3
0 2 2
2 20 0
0 0 0 0
D 3 0 2 2 0 0 2 2 0 0 0 0
It still requires O(n log n) bits :-(
Solution: write them in unary
B 1110 0 110 110 0 0 110 110 0 0 0 0
B takes 2n - 1 bits! For each node we have a 0 and a 1
(but the root)
Can we navigate the tree?
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
Succinct representation of trees (1)
B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
Succinct representation of trees (1)
B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
Succinct representation of trees (1)
B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
Succinct representation of trees (1)
B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
1
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) =
1 2 3 4 5 6 7 8 9 10 11 12
pos(5)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
pos(5)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ?
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
firstChild(3)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
y= Select0(3)+1=8
firstChild(3)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
y= Select0(3)+1=8
firstChild(3)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
y= Select0(3)+1=8
firstChild(3)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
y= Select0(3)+1=8
firstChild(3) = 8-3 = 5
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
y= Select0(3)+1=8
degree(x) = ?
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
y= Select0(3)+1=8
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
y= Select0(3)+1=8
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
degree(3)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
y= Select0(3)+1=8
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
degree(3) = Select0(4) - (Select0(3) + 1)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
y= Select0(3)+1=8
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
Select0(4)=10
degree(3) = Select0(4) - (Select0(3) + 1)
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
parent(x) =
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
parent(x) = Rank0(pos(x))
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
parent(x) = Rank0(pos(x))All these operations in O(1) time!
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
parent(x) = Rank0(pos(x))
subtreeSize(x) = ?
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
parent(x) = Rank0(pos(x))
subtreeSize(x) = ?
Not efficient! Nodes of the subtree are
spread in B
Succinct representation of trees (1)
1 0B 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0
[LOUDS - Level-order unary degree sequence]
1
3 42
7 85 6
9 10 11 12
pos(x) = Select1(x)
1 2 3 4 5 6 7 8 9 10 11 12
firstChild(x) = ? y = Select0(x)+1// start of x’s children in B
if B[y] == 0return -1 // is a leaf
return y-x // Rank1(y)
else
degree(x) = ? Select0(x+1) - (Select0(x) + 1)
parent(x) = Rank0(pos(x))
subtreeSize(x) = ?
Succinct representation of trees (2)[BP - Balanced parenthesis]
Succinct representation of trees (2)[BP - Balanced parenthesis]
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
2
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
2
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
32
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
32
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
32
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
32
Succinct representation of trees (2)[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
A tree is uniquely determined by the vector B
How reconstruct the tree?
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
subtree of 6
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
subtree of 6( and ) balance themselves
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) =
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (
findClose(7)
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
enclose(10)
= position of the parent of x in B
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
enclose(10)
= position of the parent of x in B
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
enclose(10)
= position of the parent of x in Bparent(x) =
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
enclose(10)
= position of the parent of x in Bparent(x) = Rank ( (enclose(x))
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) =
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =
(findClose(x) - pos(x) + 1) / 2
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =
(findClose(x) - pos(x) + 1) / 2depth(x) =
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =
(findClose(x) - pos(x) + 1) / 2depth(x) =
Rank ( (pos(x)) - Rank ) (pos(x))
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =
(findClose(x) - pos(x) + 1) / 2depth(x) =
Rank ( (pos(x)) - Rank ) (pos(x))
They can be implemented in O(1) time.
All these operations in O(1) time!
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =
(findClose(x) - pos(x) + 1) / 2
degree(x) = ?
depth(x) =Rank ( (pos(x)) - Rank ) (pos(x))
They can be implemented in O(1) time.
Succinct representation of trees (2)
B ( ( ) ( ( ) ( ) ) ( ( ( ) ( ) ) ( ( ) ( ) ) ) )
[BP - Balanced parenthesis]
1
32 6
7 104 5
8 9 11 12
1 2 3 4 5 6 7 8 9 10 11 12 12 34 5 678 9 1011 12
pos(x) = Select ( (x)findClose(x) = returns the position of ) matching x-th (enclose(x) = returns the position of ( enclosing x-th (
= position of the parent of x in B
firstChild(x) =parent(x) = Rank ( (enclose(x))
y = pos(x)+1if B[y] == )
return -1 // is a leaf
return Rank ( (y)else
sibling(x) = Rank ( (findClose(x)+1) (if any)subtreeSize(x) =
(findClose(x) - pos(x) + 1) / 2
degree(x) = ?
Quite inefficient! Solved by repeatedly calling sibling to scan x’s children.
depth(x) =Rank ( (pos(x)) - Rank ) (pos(x))
They can be implemented in O(1) time.
Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
Succinct representation of trees (3)[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
Succinct representation of trees (3)
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12
Succinct representation of trees (3)
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12
subtree of 6
Succinct representation of trees (3)
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12
children of 1
Succinct representation of trees (3)
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12
pos(6)
pos(x) = Select ) (x) // closing )
id(pos) = Rank ) (pos)
⇋
Succinct representation of trees (3)
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12
degree(x) =
pos(x) = Select ) (x) // closing )
id(pos) = Rank ) (pos)
⇋children of 6
Succinct representation of trees (3)
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12
degree(x) = Select ) (Rank ) (pos(x))) - x
pos(x) = Select ) (x) // closing )
id(pos) = Rank ) (pos)
⇋children of 6
Succinct representation of trees (3)
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
[DFUDS - Depth First Unary Degree Sequence
1
32 6
7 104 5
8 9 11 12
1 2 3 546 107 98 11 1291 32 4 875 6 10 11 12
child(x,i), parent(x), subtreeSize(x), …
degree(x) = Select ) (Rank ) (pos(x))) - x
pos(x) = Select ) (x) // closing )
id(pos) = Rank ) (pos)
⇋
All these operations in O(1) time!
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
c?* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
b?* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
a?* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Search P in O(|P|) time!
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Search P in O(|P|) time!
* 2 1 221 11 11 22L
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Search P in O(|P|) time!
* 2 1 221 11 11 22LElias-Fano representation: n log(m/n) + O(n) bits and
O(1) time access.
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Search P in O(|P|) time!
S, L and B use m log σ + n log (m/n)+ O(n) bits
* 2 1 221 11 11 22L
Patricia trie with DFUDS
B ( ( ( ( ) ) ( ( ) ) ) ( ( ) ( ( ) ) ) ( ( ) ) )
1
32 6
7 104 5
8 9 11 12
1 2 3 546 87 910 11 1291 32 4 875 6 10 11 12
a,2 b,1
a,2 c,2
c,1
a,1 b,1
b,2a,2b,1 c,1
P = cba
* a b cac ba cb baS
Search P in O(|P|) time!
S, L and B use m log σ + n log (m/n)+ O(n) bits
* 2 1 221 11 11 22L
Project: implement a compact TST
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P|) time
Compute the top-k strings O(k log k) time O(n) bits
3 months query log at Yahoo!
≈600 million of distinct (and clean) queries
Trie requires ≈50 Gbytes!
We will see how to reduce to ≈5 Gbytes!
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P|) time m log σ + n log (m/n) + O(n) bits
Compute the top-k strings O(k log k) time O(n) bits
3 months query log at Yahoo!
≈600 million of distinct (and clean) queries
Trie requires ≈50 Gbytes!
We will see how to reduce to ≈5 Gbytes!
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P|) time m log σ + n log (m/n) + O(n) bits
Compute the top-k strings O(k log k) time O(n) bits
3 months query log at Yahoo!
≈600 million of distinct (and clean) queries
Trie requires ≈50 Gbytes!
We will see how to reduce to ≈5 Gbytes!
to be compared with O(m log σ + n log m) bits
D = { ab (7), bab (2), bca (1), cab (4), cac (1), cbac (6), cbba (2) }
n = |D|, m total length of strings in D
7
2 1
4 1 6 2
ab b
ab ca
c
a b
baacb c
Summary
Find the node “prefixed” by P O(|P|) time m log σ + n log (m/n) + O(n) bits
Compute the top-k strings O(k log k) time O(n) bits
3 months query log at Yahoo!
≈600 million of distinct (and clean) queries
Trie requires ≈50 Gbytes!
We will see how to reduce to ≈5 Gbytes!
to be compared with O(m log σ + n log m) bits
(n=) 1 billion of strings of average length 64
m=64*109 symbols and m/n = 64