43
Data structures: time, I/Os, entropy, joules Paolo Ferragina Dipartimento di Informatica Università di Pisa

Data structures: time, I/Os, entropy, joules

  • Upload
    nan

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Data structures: time, I/Os, entropy, joules. Paolo Ferragina Dipartimento di Informatica Università di Pisa. ... but do NOT forget practice ;-). Our driving moral. Big steps come from theory. Strings... why?. Ubiquitous: any datum is a sequence of bits, hence a string - PowerPoint PPT Presentation

Citation preview

Page 1: Data structures: time, I/Os, entropy,  joules

Data structures: time, I/Os, entropy, joules

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Page 2: Data structures: time, I/Os, entropy,  joules

Our driving moral...

Big steps come from theory

... but do NOT forget practice ;-)

Page 3: Data structures: time, I/Os, entropy,  joules

Strings... why?

Ubiquitous: any datum is a sequence of bits, hence a string

Spur new problems in many areas:

Geometry String-similarity search Points in high-dim space and NN-

search

Lower/upper-bounds to indexing via reductions to geo-problem

Graphs Doc-doc similarity graph ubiquitous in Text/Web mining

Query-log graphs edge iff 2 queries clicked on the same res-

page

Data compression Shortest paths on char-based weighted

graphs [Ferragina et al,

SODA 09, ESA 09]

Page 4: Data structures: time, I/Os, entropy,  joules

(String-)Dictionary Problem

Given a dictionary D of K strings, of total

length N, store them in a way that we

can efficiently support prefix-searches for

a pattern P.

Exact search Hashing Mitzenmacher, ESA invited ‘09

Page 5: Data structures: time, I/Os, entropy,  joules

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

Performance:• Search ≈ O(|P|) time

• Space ≈ O(K + N)

Dominated the string-matching scenein the ‘80s-90s with its suffix-version:

the Suffix Tree

Page 6: Data structures: time, I/Os, entropy,  joules

Timeline: theory and practice...

‘60

Trie

’90

’70-

’80

Suffix Tree

What aboutSoftware Engineers ??

Page 7: Data structures: time, I/Os, entropy,  joules

(Compacted) Trie

1

2 2

0

4

5

6

7

2 3

y

s

1z

stile zyg

5

etic

ialygy

aibelyite

czecin

omo

systile syzygetic syzygial syzygy szaibelyite szczecin szomo

[Fredkin, CACM 1960]

(2; 3,5)

... But in practice…• Search: random memory accesses

• Space: len + pointers + strings

Performance:• Search ≈ O(|P|) time

• Space ≈ O(K + N)

Dominated the string-matching scenein the ‘80s-90s with its suffix-version:

the Suffix Tree

Page 8: Data structures: time, I/Os, entropy,  joules

What did systems implement?

Used the Compacted trie, of course, but with 2 other concerns because of large data

Page 9: Data structures: time, I/Os, entropy,  joules

1° issue: space concern

http://checkmate.com/All_Natural/http://checkmate.com/All_Natural/Applied.htmlhttp://checkmate.com/All_Natural/Aroma.htmlhttp://checkmate.com/All_Natural/Aroma1.htmlhttp://checkmate.com/All_Natural/Aromatic_Art.htmlhttp://checkmate.com/All_Natural/Ayate.htmlhttp://checkmate.com/All_Natural/Ayer_Soap.htmlhttp://checkmate.com/All_Natural/Ayurvedic_Soap.htmlhttp://checkmate.com/All_Natural/Bath_Salt_Bulk.htmlhttp://checkmate.com/All_Natural/Bath_Salts.htmlhttp://checkmate.com/All/Essence_Oils.htmlhttp://checkmate.com/All/Mineral_Bath_Crystals.htmlhttp://checkmate.com/All/Mineral_Bath_Salt.htmlhttp://checkmate.com/All/Mineral_Cream.html

http://checkmate.com/All/Natural/Washcloth.html...

0 http://checkmate.com/All_Natural/33 Applied.html34 roma.html38 1.html38 tic_Art.html34 yate.html35 er_Soap.html35 urvedic_Soap.html33 Bath_Salt_Bulk.html42 s.html25 Essence_Oils.html25 Mineral_Bath_Crystals.html38 Salt.html33 Cream.html

3345%

0 http://checkmate.com/All/Natural/Washcloth.html...

….systile syzygetic syzygial syzygy….2 5 5

FrontCoding

Bender et al., PODS 2006Ferragina et al., PODS 2008

Page 10: Data structures: time, I/Os, entropy,  joules

2° issue: hierarchical memory

track

Spatial locality or Temporal localityLess and Faster I/Os

caching: less I/Os

HDB

CPU InternalMemory

1

Count I/Os

M

Page 11: Data structures: time, I/Os, entropy,  joules

….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo….

systile szaielyite

CTon a sample

2-level indexing

Disk

InternalMemory 2 limitations:

• Sampling rate ≈ lengths of sampled strings

• Trade-off ≈ speed vs space (because of bucket size)

2 advantages:• Search ≈ typically 1 I/O

• Space ≈ Front-coding over buckets

(Prefix) B-tree

Page 12: Data structures: time, I/Os, entropy,  joules

Timeline: theory and practice...

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

Space+

Hierarchical Memory

Do we need to tradespace by I/Os ?

1995

String B

-tree

Page 13: Data structures: time, I/Os, entropy,  joules

An old idea: Patricia Trie

1

2 2

0

4

5

6

7

2 3

y

s1

z

stile zyg

5

etic

ial

ygy

aibelyite

czecin

omo

[Morrison, J.ACM 1968]

Page 14: Data structures: time, I/Os, entropy,  joules

A new search

….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….

2 2

0

y

s

1 z

sz

5

e

i

y

a

c

o

Search(P):• Phase 1: tree navigation• Phase 2: Compute LCP• Phase 3: tree navigation

Three-phase search:P = syzyyea

01

2 5 g < y

P’s positionOnly 1 string is checked

Trie Space ≈ #strings, NOT their

length

[Ferragina-Grossi, J.ACM 1999]

Page 15: Data structures: time, I/Os, entropy,  joules

The String B-tree

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

PTSearch(P) •O((p/B) logB K) I/Os•O(occ/B) I/Os

It is dynamic...

1 string checked : O(p/B)

O(logB K) levels

+

Lexicographic position of P

[Ferragina-Grossi, J.ACM 1999]

> 15 US-patents cite it !![Handbook of Comp. Biology, 2009]

Knuth, vol 3°, pag. 489: “elegant”

Page 16: Data structures: time, I/Os, entropy,  joules

I/O-aware algorithms & data structures

[CACM 1988]

[2006]

Huge literature !!

I/Os was the

main concern

Page 17: Data structures: time, I/Os, entropy,  joules

Timeline: theory and practice...

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

CPUregisters

L1 L2 RAM

Cache

HD net

Parameter-free solutions Anywhere, anytime, anyway... I/O-

optimal !!

Cache-oblivious Algo. and Data Str.

See chap by Arge, Brodal, Fagerberg

Not just 2 memory levels

Space

Page 18: Data structures: time, I/Os, entropy,  joules

Some precious achievements...

Cache-oblivious trie Static dictionary of strings [Brodal et al, SODA 2006]

Cache-oblivious String B-tree Dynamic dictionary of strings [Bender et al, PODS 2006]

Cache-oblivious tree mapping Split-and-Refine that applies to any B-fixed tree

partitioning [Alstrup et al, manuscript 2003]

Worst-case solution [Demaine et al, manuscript 2004]

Patricia Trie

Page 19: Data structures: time, I/Os, entropy,  joules

Timeline: theory and practice...

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

Space

Cache-oblivious data structures

Compresseddata structures

Not just 2 memory levels

Page 20: Data structures: time, I/Os, entropy,  joules

Can we “automate” and “guarantee” the process ?

A challenging question [Ken Church, AT&T

1995]

Soft. Eng. use many “squeezing heuristics” that

compress data and still support fast access to them

Page 21: Data structures: time, I/Os, entropy,  joules

Aka: Compressed self-indexes

Opportunistic Data Structures with Applications

P. Ferragina, G. Manzini

Space for text + (full-text) index compressed text ( Hk)

Query/Decompression time theoretically (quasi-)optimal

...now, J.ACM 2005

Page 22: Data structures: time, I/Os, entropy,  joules

The big (unconscious) step...

[Burrows-Wheeler, 1994]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

Can we compress it ?

Page 23: Data structures: time, I/Os, entropy,  joules

The big (unconscious) step...

[Burrows-Wheeler, 1994]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

T

bzip2 = BWT + other simple compressors

bwt(T)

Page 24: Data structures: time, I/Os, entropy,  joules

The big (unconscious) step...

[Burrows-Wheeler, 1994]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

Let us given a text T = mississippi#

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

# mississipp ii #mississip pi ppi#missis s

T

bzip2 = BWT + other simple compressors

bwt(T)

Suffix Array

Page 25: Data structures: time, I/Os, entropy,  joules

From practice to theory...

FM-index = BWT is searchable

...or Suffix Array is compressible

• Space = |T| Hk + o(|T|) bits

• Search(P) = O(p + occ * polylog(|T|))

Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007]

[Ferragina-Manzini, Focs ‘00]

p i#mississi pp pi#mississ is ippi#missi ss issippi#mi ss sippi#miss is sissippi#m i

i ssippi#mis s

m ississippi #i ssissippi# m

# mississipp ii #mississip pi ppi#missis s

bwt(T)

Page 26: Data structures: time, I/Os, entropy,  joules

Compressed & Searchable data formats

Integer Sets

SODA 2002…

FOCS 2008STACS 2009

TextsFOCS 2000

SODA 2003, 04SODA 2007SPIRE 2007CPM 2008CPM 2010ICALP 2010

TreesSODA 2002SODA 2007ICALP 2007SWAT 2008ICALP 2009SODA 2010

Graphs

DCC 2001WWW 2004ISAAC 2007ESA 2008

FOCS 2009

Labeled Trees

SODA 2002FOCS 2005

WWW 2006SODA 2007ICDE 2010

Functions

ICALP 2003, 04SODA 2004ICALP 2008ESA 2009

LATIN 2010

Point Sets

SODA 2003TALG 2007WADS 2009SODA 2009

Images

DCC 2008

Page 27: Data structures: time, I/Os, entropy,  joules

[December 2003] [January 2005]

Page 28: Data structures: time, I/Os, entropy,  joules

ACM J. on Experimental Algorithmics, 2009

Page 29: Data structures: time, I/Os, entropy,  joules

> 103 faster than Smith-W.

What about the Web ?[Ferragina-Manzini, ACM WSDM 2010]

>102 faster than SOAP & Maq

Page 30: Data structures: time, I/Os, entropy,  joules

Where we are nowadays

‘60

Trie

’90

2-level in

dexing

’70-

’80

Suffix Tree

1995

String B

-tree

1999

Cache-oblivious data structures

Compresseddata structures

Something is known... yet very preliminary

[PODS ‘08, Navarro, Vitter, ...]

Bellazougui et al, this ESA

Page 31: Data structures: time, I/Os, entropy,  joules

What else...

[E. Gal, S. Toledo. ACM Comp. Surv., 2005]

[Ajwani et al, WEA 2009]

Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling

Self-adjusting or Weighted design Time ops depend on some (un/known) distribution

Challenging : no pointers, self-adjust (perf) vs compression (space)

Page 32: Data structures: time, I/Os, entropy,  joules

A bigger challenge: from micro to macro !

IEEE Computer, 2007

Page 33: Data structures: time, I/Os, entropy,  joules

Approach #1 (engineering oriented)

News: Proper system components + specific algorithms

Sanders & Meyer’s groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom +

Sort]

Page 34: Data structures: time, I/Os, entropy,  joules

Approach #2 (Manage resources)

Goal: Develop on-line algorithms that dynamically manage power by trading off performance, energy and reliability

Susanne Albers, Comm. ACM 2010

Page 35: Data structures: time, I/Os, entropy,  joules

Approach #3 (models and algorithms)

“Algorithmics offers benefits that extend far

beyond TCS into the design of systems.”

IEEE Computer, 2009

Workshop in

IEEE Conf. on

Green Comp. 2010

Page 36: Data structures: time, I/Os, entropy,  joules

Sometimes energy is a primary resource!

Page 37: Data structures: time, I/Os, entropy,  joules
Page 38: Data structures: time, I/Os, entropy,  joules

Energy-aware Algo+Ds ?

Locality pays off

Memory-level impacts

I/Os and compression

are obviously important

BUT

here there is a new twist

Page 39: Data structures: time, I/Os, entropy,  joules

MIPS per Watt ?Battery life !!

Who cares whether your application:1.is y% slower than optimal, but it is more energy efficient ?

2.occupies x% more space than optimal, but decompr is faster ?

Approach in aprincipled way

Page 40: Data structures: time, I/Os, entropy,  joules

MIPS per Watt ?Battery life !!

Stay tuned:

Algorithm Library

for Mobile Phones

Idea:Multi-objective optimization

in data-structure design

Page 41: Data structures: time, I/Os, entropy,  joules

v

HyperTableCassandra

BigTable, 2006Hbase - Hadoop

Real-time search

Q&A social search

Cosmos

Knowledge search

Page 42: Data structures: time, I/Os, entropy,  joules

Many ingredients

Items are graphs, vectors, strings, …

Number and size are VERY large

Involve many resources to be optimized:

Time (speed/patience)

Space (#disks/management costs)

Bandwidth (speed/€)

Energy (€) Multi-objective optimization

in data-structure design!

Page 43: Data structures: time, I/Os, entropy,  joules

That’s all !

Look at my paper in the proceedings