Upload
nan
View
24
Download
0
Embed Size (px)
DESCRIPTION
Data structures: time, I/Os, entropy, joules. Paolo Ferragina Dipartimento di Informatica Università di Pisa. ... but do NOT forget practice ;-). Our driving moral. Big steps come from theory. Strings... why?. Ubiquitous: any datum is a sequence of bits, hence a string - PowerPoint PPT Presentation
Citation preview
Data structures: time, I/Os, entropy, joules
Paolo FerraginaDipartimento di Informatica
Università di Pisa
Our driving moral...
Big steps come from theory
... but do NOT forget practice ;-)
Strings... why?
Ubiquitous: any datum is a sequence of bits, hence a string
Spur new problems in many areas:
Geometry String-similarity search Points in high-dim space and NN-
search
Lower/upper-bounds to indexing via reductions to geo-problem
Graphs Doc-doc similarity graph ubiquitous in Text/Web mining
Query-log graphs edge iff 2 queries clicked on the same res-
page
Data compression Shortest paths on char-based weighted
graphs [Ferragina et al,
SODA 09, ESA 09]
(String-)Dictionary Problem
Given a dictionary D of K strings, of total
length N, store them in a way that we
can efficiently support prefix-searches for
a pattern P.
Exact search Hashing Mitzenmacher, ESA invited ‘09
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
Performance:• Search ≈ O(|P|) time
• Space ≈ O(K + N)
Dominated the string-matching scenein the ‘80s-90s with its suffix-version:
the Suffix Tree
Timeline: theory and practice...
‘60
Trie
’90
’70-
’80
Suffix Tree
What aboutSoftware Engineers ??
(Compacted) Trie
1
2 2
0
4
5
6
7
2 3
y
s
1z
stile zyg
5
etic
ialygy
aibelyite
czecin
omo
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
[Fredkin, CACM 1960]
(2; 3,5)
... But in practice…• Search: random memory accesses
• Space: len + pointers + strings
Performance:• Search ≈ O(|P|) time
• Space ≈ O(K + N)
Dominated the string-matching scenein the ‘80s-90s with its suffix-version:
the Suffix Tree
What did systems implement?
Used the Compacted trie, of course, but with 2 other concerns because of large data
1° issue: space concern
http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html
http://checkmate.com/All/Natural/Washcloth.html...
0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html
3345%
0 http://checkmate.com/All/Natural/Washcloth.html...
….systile syzygetic syzygial syzygy….2 5 5
FrontCoding
Bender et al., PODS 2006Ferragina et al., PODS 2008
2° issue: hierarchical memory
track
Spatial locality or Temporal localityLess and Faster I/Os
caching: less I/Os
HDB
CPU InternalMemory
1
Count I/Os
M
….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….
systile szaielyite
CTon a sample
2-level indexing
Disk
InternalMemory 2 limitations:
• Sampling rate ≈ lengths of sampled strings
• Trade-off ≈ speed vs space (because of bucket size)
2 advantages:• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets
(Prefix) B-tree
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
Space+
Hierarchical Memory
Do we need to tradespace by I/Os ?
1995
String B
-tree
An old idea: Patricia Trie
1
2 2
0
4
5
6
7
2 3
y
s1
z
stile zyg
5
etic
ial
ygy
aibelyite
czecin
omo
[Morrison, J.ACM 1968]
A new search
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
2 2
0
y
s
1 z
sz
5
e
i
y
a
c
o
Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation
Three-phase search:P = syzyyea
01
2 5 g < y
P’s positionOnly 1 string is checked
Trie Space ≈ #strings, NOT their
length
[Ferragina-Grossi, J.ACM 1999]
The String B-tree
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
PTSearch(P) •O((p/B) logB K) I/Os•O(occ/B) I/Os
It is dynamic...
1 string checked : O(p/B)
O(logB K) levels
+
Lexicographic position of P
[Ferragina-Grossi, J.ACM 1999]
> 15 US-patents cite it !![Handbook of Comp. Biology, 2009]
Knuth, vol 3°, pag. 489: “elegant”
I/O-aware algorithms & data structures
[CACM 1988]
[2006]
Huge literature !!
I/Os was the
main concern
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
CPUregisters
L1 L2 RAM
Cache
HD net
Parameter-free solutions Anywhere, anytime, anyway... I/O-
optimal !!
Cache-oblivious Algo. and Data Str.
See chap by Arge, Brodal, Fagerberg
Not just 2 memory levels
Space
Some precious achievements...
Cache-oblivious trie Static dictionary of strings [Brodal et al, SODA 2006]
Cache-oblivious String B-tree Dynamic dictionary of strings [Bender et al, PODS 2006]
Cache-oblivious tree mapping Split-and-Refine that applies to any B-fixed tree
partitioning [Alstrup et al, manuscript 2003]
Worst-case solution [Demaine et al, manuscript 2004]
Patricia Trie
Timeline: theory and practice...
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
Space
Cache-oblivious data structures
Compresseddata structures
Not just 2 memory levels
Can we “automate” and “guarantee” the process ?
A challenging question [Ken Church, AT&T
1995]
Soft. Eng. use many “squeezing heuristics” that
compress data and still support fast access to them
Aka: Compressed self-indexes
Opportunistic Data Structures with Applications
P. Ferragina, G. Manzini
Space for text + (full-text) index compressed text ( Hk)
Query/Decompression time theoretically (quasi-)optimal
...now, J.ACM 2005
The big (unconscious) step...
[Burrows-Wheeler, 1994]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
Can we compress it ?
The big (unconscious) step...
[Burrows-Wheeler, 1994]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
T
bzip2 = BWT + other simple compressors
bwt(T)
The big (unconscious) step...
[Burrows-Wheeler, 1994]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
Let us given a text T = mississippi#
mississippi#ississippi#mssissippi#mi sissippi#mis
sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi
ssippi#missiissippi#miss Sort the rows
# mississipp ii #mississip pi ppi#missis s
T
bzip2 = BWT + other simple compressors
bwt(T)
Suffix Array
From practice to theory...
FM-index = BWT is searchable
...or Suffix Array is compressible
• Space = |T| Hk + o(|T|) bits
• Search(P) = O(p + occ * polylog(|T|))
Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]
[Ferragina-Manzini, Focs ‘00]
p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i
i ssippi#mis s
m ississippi #i ssissippi# m
# mississipp ii #mississip pi ppi#missis s
bwt(T)
Compressed & Searchable data formats
Integer Sets
SODA 2002…
FOCS 2008STACS 2009
TextsFOCS 2000
SODA 2003, 04SODA 2007SPIRE 2007CPM 2008CPM 2010ICALP 2010
TreesSODA 2002SODA 2007ICALP 2007SWAT 2008ICALP 2009SODA 2010
Graphs
DCC 2001WWW 2004ISAAC 2007ESA 2008
FOCS 2009
Labeled Trees
SODA 2002FOCS 2005
WWW 2006SODA 2007ICDE 2010
Functions
ICALP 2003, 04SODA 2004ICALP 2008ESA 2009
LATIN 2010
Point Sets
SODA 2003TALG 2007WADS 2009SODA 2009
Images
DCC 2008
[December 2003] [January 2005]
ACM J. on Experimental Algorithmics, 2009
> 103 faster than Smith-W.
What about the Web ?[Ferragina-Manzini, ACM WSDM 2010]
>102 faster than SOAP & Maq
Where we are nowadays
‘60
Trie
’90
2-level in
dexing
’70-
’80
Suffix Tree
1995
String B
-tree
1999
Cache-oblivious data structures
Compresseddata structures
Something is known... yet very preliminary
[PODS ‘08, Navarro, Vitter, ...]
Bellazougui et al, this ESA
What else...
[E. Gal, S. Toledo. ACM Comp. Surv., 2005]
[Ajwani et al, WEA 2009]
Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling
Self-adjusting or Weighted design Time ops depend on some (un/known) distribution
Challenging : no pointers, self-adjust (perf) vs compression (space)
A bigger challenge: from micro to macro !
IEEE Computer, 2007
Approach #1 (engineering oriented)
News: Proper system components + specific algorithms
Sanders & Meyer’s groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom +
Sort]
Approach #2 (Manage resources)
Goal: Develop on-line algorithms that dynamically manage power by trading off performance, energy and reliability
Susanne Albers, Comm. ACM 2010
Approach #3 (models and algorithms)
“Algorithmics offers benefits that extend far
beyond TCS into the design of systems.”
IEEE Computer, 2009
Workshop in
IEEE Conf. on
Green Comp. 2010
Sometimes energy is a primary resource!
Energy-aware Algo+Ds ?
Locality pays off
Memory-level impacts
I/Os and compression
are obviously important
BUT
here there is a new twist
MIPS per Watt ?Battery life !!
Who cares whether your application:1.is y% slower than optimal, but it is more energy efficient ?
2.occupies x% more space than optimal, but decompr is faster ?
Approach in aprincipled way
MIPS per Watt ?Battery life !!
Stay tuned:
Algorithm Library
for Mobile Phones
Idea:Multi-objective optimization
in data-structure design
v
HyperTableCassandra
BigTable, 2006Hbase - Hadoop
Real-time search
Q&A social search
Cosmos
Knowledge search
Many ingredients
Items are graphs, vectors, strings, …
Number and size are VERY large
Involve many resources to be optimized:
Time (speed/patience)
Space (#disks/management costs)
Bandwidth (speed/€)
Energy (€) Multi-objective optimization
in data-structure design!
That’s all !
Look at my paper in the proceedings