214
1 Scalable Distributed Data Structures Part 2 Witold Litwin [email protected]

1 Scalable Distributed Data Structures Part 2 Witold Litwin [email protected]

Embed Size (px)

Citation preview

Page 1: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

1

Scalable Distributed Data Structures

Part 2

Scalable Distributed Data Structures

Part 2

Witold Litwin

[email protected]

Page 2: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

2

Lecture Plan Lecture Plan

LH* for Relational Databases High-availability SDDS schemes Range Partitioning SDDS Schemes

– RP* Family » Precursor of Google’s Big Table

SDDS Prototypes

Page 3: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

3

Lecture Plan Lecture Plan

Algebraic Signatures for an SDDS– Backup– Updates– Pattern Matching

SDDSs for P2Ps– Chord & other DHT based schemes

SDDS for Grids & Clouds Conclusion

Page 4: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

4

Relational Database Queries over LH* Tables

Relational Database Queries over LH* Tables

We talk about applying SDDS files to a relational database implementation

In other words, we talk about a relational database using SDDS files instead of more traditional ones

We examine the processing of typical SQL queries– Using the operations over SDDS files

»Key-based & scans

Page 5: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

5

Relational Database Queries over LH* Tables

Relational Database Queries over LH* Tables

For most, LH* based implementation appears easily feasible

The analysis applies to some extent to other potential applications – e.g., Data Mining

Page 6: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

6

Relational Queries over LH*Relational Queries over LH*

All the theory of parallel database processing applies to our analysis– E.g., classical work by DeWitt team (U.

Madison) With a distinctive advantage

–The size of tables matters less» The partitioned tables were basically static» See specs of SQL Server, DB2, Oracle…» Now they are scalable

–Especially this concerns the size of the output table»Often hard to predict

Page 7: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

7

How Useful Is This Material ?How Useful Is This Material ?

De : [email protected] [mailto:[email protected]] De la part de David DeWittEnvoyé : lundi 29 décembre 2008 22:15À : [email protected] : [Dbworld] Job openings at Microsoft Jim Gray Systems Lab

 

The Microsoft Jim Gray Systems Lab (GSL) in Madison, Wisconsin has several

positions open at the Ph.D. or M.S. levels for exceptionally well-qualified individuals

with significant experience in the design and implementation of database

management systems. Organizationally, GSL is part of the Microsoft SQL Server

Division. Located on the edge of the UW-Madison campus, the GSL staff

collaborates closely with the faculty and graduate students of the database group

at the University of Wisconsin. We currently have several on-going projects in the

areas of parallel database systems, advanced storage technologies for database

systems, energy-aware database systems, and the design and implementation of

database systems for multicore CPUs. ….

 

Page 8: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

8

How Useful Is This Material ?How Useful Is This Material ?

De : [email protected] [mailto:[email protected]] De la part de Gary WorrellEnvoyé : samedi 14 février 2009 00:36À : [email protected] : [Dbworld] Senior Researcher-Scientific Data Management

  The Computational Science and Mathematics division of the Pacific

Northwest National Laboratory is looking for a senior researcher in Scientific Data Management to develop and pursue new opportunities. Our research is aimed at creating new, state-of-the-art computational capabilities using extreme-scale simulation and peta-scale data analytics that enable scientific breakthroughs. We are looking for someone with a demonstrated ability to provide scientific leadership in this challenging discipline and to work closely with the existing staff, including the SDM technical group manager.

 

Page 9: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

9

How Useful Is This Material ?How Useful Is This Material ?

Page 10: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

10

How Useful Is This Material ?How Useful Is This Material ?

Page 11: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

11

How Useful Is This Material ?How Useful Is This Material ?

Page 12: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

12

Relational Queries over LH* TablesRelational Queries over LH* Tables

We illustrate the point using the well-known Supplier Part (S-P) database

S (S#, Sname, Status, City)P (P#, Pname, Color, Weight, City)SP (S#, P#, Qty)

See my database classes on SQL

– At the Website

Page 13: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

13

Relational Queries over LH* TablesRelational Queries over LH* Tables

We consider the typical relational implementation

– One relational tuple is one LH* record

–The key attributes form the LH* record key

–The other attributes form the non-key field of the record

– Relational pages are LH* buckets» Whether in RAM or on disk

Page 14: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

14

Relational Queries over LH* TablesRelational Queries over LH* Tables

Single Primary key based searchSelect * From S Where S# = S1

Translates to simple key-based LH* search–Assuming naturally that S# becomes

the primary key of the LH* file with tuples of S (S1 : Smith, 100, London) (S2 : ….

Page 15: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

15

Relational Queries over LH* TablesRelational Queries over LH* Tables

Select * From S Where S# = S1 OR S# = S1 –A series of primary key based searches

Non key-based restriction–…Where City = Paris or City = London–Deterministic scan with local restrictions

»Results are perhaps inserted into a temporary LH* file

» LH* is especially useful since the result size is often unknown

Page 16: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

16

Relational Queries Over LH* TablesRelational Queries Over LH* Tables

Key based Insert INSERT INTO P VALUES ('P8', 'nut', 'pink', 15,

'Nice') ;–Process as usual for LH*–Or use SD-SQL Server

»If no access “under the cover” of the DBMS Key based Update, Delete

– Idem

Page 17: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

17

Non-key projection Select S.Sname, S.City from S–Deterministic scan with local projections

»Results are perhaps inserted into a temporary LH* file

Non-key projection and restrictionSelect S.Sname, S.City from S

Where City = ‘Paris’ or City = ‘London’– Idem

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 18: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

18

Non Key DistinctSelect Distinct City from P–Scan with local or upward propagated

aggregation towards bucket 0– Process Distinct locally if you do not

have any son–Otherwise wait for input from all your

sons–Process Distinct together–Send result to father if any or to client or

to output table

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 19: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

19

Non Key Count or SumSelect Count(S#), Sum(Qty) from SP–Scan with local or upward

propagated aggregation–Eventual post-processing on the

client Non Key Avg, Var, StDev…

–Your proposal here

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 20: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

20

Non-key Group By, Histograms…Select Sum(Qty) from SP Group By S#–Scan with local Group By at each server–Upward propagation –Or post-processing at the client Or the result directly in the output table

Of a priori unknown size That with SDDS technology does not need to

be estimated upfront

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 21: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

21

EquijoinSelect * From S, SP where S.S# = SP.S#–Scan at S and scan at SP sends all tuples to temp

LH* table T1 with S# as the key –Scan at T merges all couples (r1, r2) of records

with the same S#, where r1 comes from S and r2 comes from SP

–Result goes to client or temp table T2 All above is an SD generalization of Grace

hash join

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 22: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

22

Equijoin & Projections & Restrictions & Group By & Aggregate &…–Combine what above–Into a nice SD-execution plan

Your Thesis here

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 23: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

23

Equijoin & -joinSelect * From S as S1, S where S.City

= S1.City and S.S# < S1.S# –Processing of equijoin into T1–Scan for parallel restriction over T1

with the final result into client or (rather) T2

Order By and Top K–Use RP* as output table

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 24: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

24

Having

Select Sum(Qty) from SP Group By S# Having Sum(Qty) > 100

Here we have to process the result of the aggregation

One approach: post-processing on client or temp table with results of Group By

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 25: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

25

Relational Operations Over LH* TablesRelational Operations Over LH* Tables

Subqueries – In Where or Select or From Clauses–With Exists or Not Exists or Aggregates… –Non-correlated or correlated

Non-correlated subquerySelect S# from S where status = (Select

Max(X.status) from S as X)–Scan for subquery, then scan for superquery

Page 26: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

26

Correlated Subqueries

Select S# from S where not exists (Select * from SP where S.S# = SP.S#)

Your Proposal here

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 27: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

27

Like (…)–Scan with a pattern matching or regular

expression –Result delivered to the client or output

table

Your Thesis here

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 28: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

28

Relational Operations Over LH* TablesRelational Operations Over LH* Tables

Cartesian Product & Projection & Restriction…

Select Status, Qty From S, SP Where City = “Paris”

–Scan S and SP for local restrictions and projections

–Put result for S into temp T1 –Put result for SP into temp T2

Page 29: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

29

Relational Operations Over LH* TablesRelational Operations Over LH* Tables

Scan T1 delivering every tuple towards every bucket of T2–Using multicast preferably–Details not that simple since some flow

control is necessary Deliver the result of the tuple merge

over every couple to T3

Page 30: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

30

New or Non-standard Aggregate Functions–Covariance–Correlation–Moving Average–Cube–Rollup– -Cube–Skyline– … (see my class on advanced SQL)

Your Thesis here

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 31: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

31

Indexes Create Index SX on S (sname); Create, e.g., LH* file with records

(Sname, (S#1, S#2,..)

Where each S#i is the key of a tuple with Sname

Notice that an SDDS index is not affected by location changes due to splits–A potentially huge advantage

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 32: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

32

For an ordered index use –an RP* scheme

»To come later in this course–or Baton

» See Course Doc–…

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 33: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

33

For a k-d index use –k-RP*

» Developed with M-A Neimat (Oracle) & Donovan Schneider (Yahoo Research ?)

»See Course Doc–or SD-Rtree

» ICDE 07, with C. de Mouza (CNAM) & Ph. Rigaux (Dauphine)

–…

Relational Queries over LH* TablesRelational Queries over LH* Tables

Page 34: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

34

LH* seems well fit for the coming relational database needs– Over clouds, grids, P2P…

In fact other SDDSs as well The industry looks for competences in

the area– See again the adds in this course

ConclusionConclusion

Page 35: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

35

The framework of the analysis is instructive for other apps over massive data collections– Already discussed in this course

A lot of room for valuable research

ConclusionConclusion

Page 36: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

36

Page 37: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

37

High-availability SDDS schemesHigh-availability SDDS schemes

Data remain available despite :–any single server failure & most of

two server failures–or any up to n-server failure–and some catastrophic failures

The n factor should scale with the file–To offset the reliability decline which

would otherwise occur

Page 38: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

38

High-availability SDDS schemesHigh-availability SDDS schemes

Three principles for high-availability SDDS schemes–mirroring (LH*m)–striping (LH*s)–grouping (LH*g, LH*sa, LH*rs)

Realize different performance trade-offs

Page 39: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

39

High-availability SDDS schemesHigh-availability SDDS schemes

Mirroring –Lets for instant switch to the backup

copy–Costs most in storage overhead

»k * 100 %–Hardly applicable for more than 2

copies per site.

Page 40: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

40

High-availability SDDS schemesHigh-availability SDDS schemes

Striping –Storage overhead of O (k / m) –m times higher messaging cost of a

record search–m = number of stripes–At least m + k times higher record

search costs while a segment is unavailable»Or bucket being recovered

Page 41: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

41

High-availability SDDS schemesHigh-availability SDDS schemes

Grouping–Storage overhead of O (k / m) –m = number of records in a record

(bucket) group– No messaging overhead of a record

search–At least m + k times higher record

search costs while a segment is unavailable»Or bucket being recovered

Page 42: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

42

High-availability SDDS schemesHigh-availability SDDS schemes

Grouping appears most practical How to do it in practice ? One answer : LH*RS

Thesis of Rim Moussa (Dauphine)– Published in ACM-TODS– Exceptional result for French University

DB research »Only a couple of papers accepted over

20 years

Page 43: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

43

LH*RS : Record GroupsLH*RS : Record Groups

LH*RS records–LH* data records & parity records

Records with same rank r in the bucket group form a record group

Each record group gets n parity records

Page 44: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

44

LH*RS : Record GroupsLH*RS : Record Groups

Each record contains– r as the primary key– The keys c of the records with rank r in the

group» Record Group

– A parity code computed for the record group

Page 45: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

45

LH*RS : Record GroupsLH*RS : Record Groups

Parity value is computed using Reed-Salomon erasure correction codes–Additions and multiplications in Galois

Fields–See the ACM-TODS LH*rs paper on the

Web site for details r is the common key of these records Each group supports unavailability of up to n

of its members

Page 46: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

46

LH*RS Record Groups LH*RS Record Groups

non-key data c

Parity record r Data record c

parity bits

B c l c 1 r

(a)

(b)

x x x x x x

x x x x x x

Data records Parity records

Page 47: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

47

LH*RS Scalable availabilityLH*RS Scalable availability

Create 1 parity bucket per group until M = 2i1

buckets Then, at each split,

– add 2nd parity bucket to each existing group

– create 2 parity buckets for new groups until 2i2

buckets etc.

Page 48: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

48

LH*RS Scalable availabilityLH*RS Scalable availability

Page 49: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

49

LH*RS Scalable availabilityLH*RS Scalable availability

Page 50: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

50

LH*RS Scalable availabilityLH*RS Scalable availability

Page 51: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

51

LH*RS Scalable availabilityLH*RS Scalable availability

Page 52: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

52

LH*RS Scalable availabilityLH*RS Scalable availability

Page 53: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

53

LH*RS : Galois FieldsLH*RS : Galois Fields

A finite set with algebraic structure–We only deal with GF (N) where N = 2^f ;

f = 4, 8, 16 »Elements (symbols) are 4-bits, bytes and

2-byte words Contains elements 0 and 1 Addition with usual properties

– In general implemented as XORa + b = a XOR b

Page 54: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

54

LH*RS : Galois FieldsLH*RS : Galois Fields

Multiplication and division–Usually implemented as log / antilog

calculus»With respect to some primitive element »Using log / antilog tables

a * b = antilog (log a + log b) mod (N – 1)

Page 55: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

55

Example: GF(4)Example: GF(4)

* 00 10 01 11 log antilog

00 00 00 00 00 00 - - 00

10 00 10 01 11 10 0 0 10

01 00 01 11 10 01 1 1 01

11 00 11 10 01 11 2 2 11

Direct Multiplication Logarithm Antilogarithm

Tables for GF(4).

Addition : XORMultiplication :

direct table Primitive element based log / antilog tables

Log tables are more efficient for a large GF

= 01

10 = 1

00 = 0

0 = 10 1 = 01 ; 2 = 11 ; 3 = 10

Page 56: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

56

String int hex log

0000 0 0 -0001 1 1 0

0010 2 2 1

0011 3 3 4

0100 4 4 2

0101 5 5 8

0110 6 6 5

0111 7 7 10

1000 8 8 3

1001 9 9 14

1010 10 A 9

1011 11 B 7

1100 12 C 6

1101 13 D 13

1110 14 E 11

1111 15 F 12

Example: GF(16)Example: GF(16)

Direct table would

have 256 elements

Addition : XOR

Elements & logs

= 2

Page 57: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

57

LH*RS Parity EncodingLH*RS Parity Encoding

Create the m x n generator matrix G–using elementary transformation of extended

Vandermond matrix of GF elements–m is the records group size–n = 2l is max segment size (data and parity

records)–G = [I | P]– I denotes the identity matrix

Page 58: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

58

LH*RS Parity EncodingLH*RS Parity Encoding

The m symbols with the same offset in the records of a group become the (horizontal) information vector U

The matrix multiplication UG provides the (n - m) parity symbols, i.e., the codeword vector

Page 59: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

59

LH*RS : GF(16) Parity EncodingLH*RS : GF(16) Parity EncodingRecords :

“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”

45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74

1 0 0 0 8 1 7 7 9 3 2 7 7

0 1 0 0 8 7 1 9 7 3 2 7 7

0 0 1 0 1 7 8 3 7 9 7 2 7

0 0 0 1 7 1 8 3 9 7 7 2 7

F C A E

F C A E

F C E A

F C E A

G

Page 60: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

60

LH*RS GF(16) Parity EncodingLH*RS GF(16) Parity Encoding

“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”

45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74

1 0 0 0 8 1 7 7 9 3 2 7 7

0 1 0 0 8 7 1 9 7 3 2 7 7

0 0 1 0 1 7 8 3 7 9 7 2 7

0 0 0 1 7 1 8 3 9 7 7 2 7

F C A E

F C A E

F C E A

F C E A

G

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0

Records :

Page 61: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

61

“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”

45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74

1 0 0 0 8 1 7 7 9 3 2 7 7

0 1 0 0 8 7 1 9 7 3 2 7 7

0 0 1 0 1 7 8 3 7 9 7 2 7

0 0 0 1 7 1 8 3 9 7 7 2 7

F C A E

F C A E

F C E A

F C E A

G

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0

5 1 4 9 F 8 A 4 B 1 1 2 7 E 9 9 A 5 1 4 9 F 8 A 4 B 1 1 2 7 E 9 9 A

LH*RS GF(16) Parity EncodingLH*RS GF(16) Parity EncodingRecords :

Page 62: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

62

1 0 0 0 8 1 7 7 9 3 2 7 7

0 1 0 0 8 7 1 9 7 3 2 7 7

0 0 1 0 1 7 8 3 7 9 7 2 7

0 0 0 1 7 1 8 3 9 7 7 2 7

F C A E

F C A E

F C E A

F C E A

G

4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0

5 1 4 9 F 8 A 4 B 1 1 2 7 E 9 9 A ... … ... ... ... ... ... ... … ... ... ... ... … ... 6

36EE4

6EDCEE

6649DD

“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”

45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74

LH*RS GF(16) Parity EncodingLH*RS GF(16) Parity EncodingRecords :

Page 63: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

63

LH*RS : Actual Parity ManagementLH*RS : Actual Parity Management

An insert of data record with rank r creates or, usually, updates parity records r

An update of data record with rank r updates parity records r

A split recreates parity records–Data record usually change the rank

after the split

Page 64: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

64

LH*RS : Actual Parity EncodingLH*RS : Actual Parity Encoding

Performed at every insert, delete and update of a record–One data record at the time

Each updated data bucket produces -record that sent to each parity bucket – -record is the difference between the old

and new value of the manipulated data record»For insert, the old record is dummy»For delete, the new record is dummy

Page 65: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

65

LH*RS : Actual Parity EncodingLH*RS : Actual Parity Encoding

The ith parity bucket of a group contains only the ith column of G –Not the entire G, unlike one could expect

The calculus of ith parity record is only at ith parity bucket–No messages to other data or parity

buckets

Page 66: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

66

LH*RS : Actual RS codeLH*RS : Actual RS code

Encoding uses GF (2**16) –Encoding / decoding typically faster than

for our earlier GF (2**8) »Experimental analysis

–By Ph.D Rim Moussa

–Possibility of very large record groups with very high availability level k

Page 67: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

67

LH*RS : Actual RS codeLH*RS : Actual RS code

Still reasonable size of the Log/Antilog multiplication table–Ours (and well-known) GF multiplication

method Calculus uses the log parity matrix About 8 % faster than the traditional

parity matrix

Page 68: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

68

LH*RS : Actual RS codeLH*RS : Actual RS code

1-st parity record calculus uses only XORing–1st column of the parity matrix contains

1’s only–Like, e.g., RAID systems–Unlike our earlier code published in

Sigmod-2000 paper

Page 69: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

69

LH*RS : Actual RS codeLH*RS : Actual RS code

1-st data record parity calculus uses only XORing–1st line of the parity matrix contains 1’s

only It is at present for our purpose the best

erasure correcting code around

Page 70: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

70

LH*RS : Actual RS codeLH*RS : Actual RS code

0000 0000 0000 …

0000 5ab5 e267 …

0000 e267 0dce …

0000 784d 2b66 … … … … …

Logarithmic Parity Matrix

0001 0001 0001 …

0001 eb9b 2284 …

0001 2284 9é74 …

0001 9e44 d7f1 … … … … …

Parity Matrix

All things considered, we believe our code, the most suitable erasure correcting code for high-availability SDDS files at present

Page 71: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

71

LH*RS : Actual RS codeLH*RS : Actual RS code

Systematic : data values are stored as is

Linear : –We can use -records for updates

»No need to access other record group members

–Adding a parity record to a group does not require access to existing parity records

Page 72: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

72

LH*RS : Actual RS codeLH*RS : Actual RS code

MDS (Maximal Distance Separable)–Minimal possible overhead for all practical

records and record group sizes»Records of at least one symbol in non-

key field : –We use 2B long symbols of GF (2**16)

More on codes–http://fr.wikipedia.org/wiki/Code_parfait

See also the Intel’s paper in the course Doc

Page 73: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

73

LH*RS Record/Bucket Recovery

LH*RS Record/Bucket Recovery

Performed when at most k = n - m buckets are unavailable in a segment :

Choose m available buckets of the segment

Form submatrix H of G from the corresponding columns

Page 74: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

74

LH*RS Record/Bucket Recovery

LH*RS Record/Bucket Recovery

Invert this matrix into matrix H-1

Multiply the horizontal vector S of available symbols with the same offset by H-1

The result contains the recovered data and/or parity symbols

Page 75: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

75

ExampleExampleData buckets

“En arche ...”, “Dans le ...”, “Am Anfang ...”, “In the beginning”

45 6E 20 41 72 , 41 6D 20 41 6E 44 61 6E 73 20 ”, 49 6E 20 70 74

Page 76: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

76

ExampleExampleAvailable buckets

“In the beginning”

49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD

Page 77: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

77

ExampleExample

1 0 0 0 8 1 7 7 9 3 2 7 7

0 1 0 0 8 7 1 9 7 3 2 7 7

0 0 1 0 1 7 8 3 7 9 7 2 7

0 0 0 1 7 1 8 3 9 7 7 2 7

F C A E

F C A E

F C E A

F C E A

G

0 8 1

0 8 7

0 1 7 8

1 7 1

F

F

F

H

“In the beginning”

49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD

Available buckets

Page 78: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

78

ExampleExample

1 0 0 0 8 1 7 7 9 3 2 7 7

0 1 0 0 8 7 1 9 7 3 2 7 7

0 0 1 0 1 7 8 3 7 9 7 2 7

0 0 0 1 7 1 8 3 9 7 7 2 7

F C A E

F C A E

F C E A

F C E A

G

0 8 1

0 8 7

0 1 7 8

1 7 1

F

F

F

H

1

1

4 2 0.

4 7 0

2 4 0

B F A

C

D

D

H

“In the beginning”

49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD

Available buckets

E.g Gauss Inversion

Page 79: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

79

ExampleExample“In the beginning”

49 6E 20 70 744F 63 6E E4 48 6E DC EE 4A 66 49 DD

1 0 0 0 8 1 7 7 9 3 2 7 7

0 1 0 0 8 7 1 9 7 3 2 7 7

0 0 1 0 1 7 8 3 7 9 7 2 7

0 0 0 1 7 1 8 3 9 7 7 2 7

F C A E

F C A E

F C E A

F C E A

G

0 8 1

0 8 7

0 1 7 8

1 7 1

F

F

F

H

1

1

4 2 0.

4 7 0

2 4 0

B F A

C

D

D

H4 4 4

5 1 4

6 6 6... ,, .,

Recoveredsymbols / buckets

Available buckets

Page 80: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

80

PerformancePerformance

Data bucket load factor : 70 %

Parity overhead : k / m

m is file parameter, m = 4,8,16…

larger m increases the recovery cost

Key search time

• Individual : 0.2419 ms

• Bulk : 0.0563 ms

File creation rate

• 0.33 MB/sec for k = 0,

• 0.25 MB/sec for k = 1,

• 0.23 MB/sec for k = 2

Record insert time (100 B)

• Individual : 0.29 ms for k = 0,

0.33 ms for k = 1,

0.36 ms for k = 2

• Bulk : 0.04 ms

Record recovery time

• About 1.3 ms

Bucket recovery rate (m = 4)

• 5.89 MB/sec from 1-unavailability,

• 7.43 MB/sec from 2-unavailability,

• 8.21 MB/sec from 3-unavailability

(Wintel P4 1.8GHz, 1Gbs Ethernet)

Page 81: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

81

About the smallest possible – Consequence of MDS property of RS codes

Storage overhead (in additional buckets)– Typically k / m

Insert, update, delete overhead – Typically k messages

Record recovery cost– Typically 1+2m messages

Bucket recovery cost– Typically 0.7b (m+x-1)

Key search and parallel scan performance are unaffected– LH* performance

Parity Overhead Parity Overhead Performance

Page 82: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

82

ReliabilityReliability

• Probability P that all the data are available• Inverse of the probability of the catastrophic k’ -

bucket failure ; k’ > k • Increases for • higher reliability p of a single node • greater k

at expense of higher overhead• But it must decrease regardless of any fixed k when

the file scales• k should scale with the file• How ??

Performance

Page 83: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

83

Uncontrolled availabilityUncontrolled availability

k = 4, p = 0.15

N

P

0.7500.8000.8500.9000.950

k = 4, p = 0.1

N

P

0.850

0.900

0.950

1.000

m = 4, p = 0.15

m = 4, p = 0.1

M

P

M

P

OK

OK ++

Page 84: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

84

RP* schemesRP* schemes

Produce 1-d ordered files– for range search

Uses m-ary trees– like a B-tree

Efficiently supports range queries–LH* also supports range queries

»but less efficiently Consists of the family of three schemes

–RP*N RP*C and RP*S

Page 85: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

85

Current PDBMS technology (Pioneer: Non-Stop SQL , Tandem & J. Gray)

Current PDBMS technology (Pioneer: Non-Stop SQL , Tandem & J. Gray)

Static Range Partitioning SQL Server, DB2, Oracle...

Done manually by DBA Requires goods skills Not scalable

Page 86: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

86

Fig. 1 RP* design trade-offs

RP*N

RP*C

RP*S

No index all multicast

+ client index limited multicast

+ servers index optional multicast

RP* schemesRP* schemes

Page 87: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

87

theofand

to

a

ofand

the

of

toa

of

of

and

the

of

to

a

of

in

that

is

and

theto

a

of

in

thatof

in

is

of

in

and

theto

a

of

in

that

of

in

it

of

in

i

is

and

theto

a

of

that

of

is

of

in

iin

infor

it

RP* file expansion

for

for

for

0 1 2 3

0 0

0 0

0

1

1

1

1 2

2

Page 88: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

88

RP* Range QueryRP* Range Query

Searches for all records in query range Q–Q = [c1, c2] or Q = ]c1,c2] etc

The client sends Q – either by multicast to all the buckets

»RP*n especially–or by unicast to relevant buckets in its

image»those may forward Q to children

unknown to the client

Page 89: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

89

RP* Range Query TerminationRP* Range Query Termination

Time-out Deterministic

–Each server addressed by Q sends back at least its current range

–The client performs the union U of all results

– It terminates when U covers Q

Page 90: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

90

RP*c client imageRP*c client image0

0 fo r * in 2 o f *

0 fo r * in 2 o f 1

0 fo r 3 in 2 o f 1

E volu tion of RP * c lien t im a g e a fter sea rc h es for k eys .

c

T0

T1

T2

T3

i t, tha t, in

0-for

2inof

IAMs

1of

3forin

Page 91: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

91

RP*sRP*s

.

An RP* file with (a) 2-level kernel, and (b) 3-level kernel

s

0 1 2 3 4

aa

and

th e s e

theth e se

a

of

th a t

of

is

ofin

iin

in

itfor

for

a a

for

th is

th es e

to

a

in 2 of 1 these 4 b

a i n b c

ca in 0 f o r 3c

* ( b )

a

and

theto

a

of

that

of

is

of

in

iin

in

itfor

for

0 1 2 3

0 fo r 3 in 2 of 1

a

for

aa

a ( a )

Distr.Indexroot

Distr.Indexroot

Distr.Indexpage

Distr.Indexpage

IAM = traversed

pages

Page 92: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

92

RP*s / BigTableRP*s / BigTable

BigTable : Google File System (GFS) Abstraction–See the OSDI 06 paper in course Doc– Also somehow present in Hadoop

A Scalable Distributed Table An operational structure for Google

data everyone uses are within RP*s is the precursor of BigTable

Page 93: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

93

RP*s / BigTableRP*s / BigTable RP* Bucket = BigTable Tablet Record = Row Coordinator = Master RP*s = 3-level hierarchy of metatables Besides there are many BigTable specifics

– Column structure of non-key fields in the rows

– A specific concurrency manager»Called Chubby

–…

Page 94: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

97

b RP*C RP*S LH*

50 2867 22.9 8.9

100 1438 11.4 8.2

250 543 5.9 6.8

500 258 3.1 6.4

1000 127 1.5 5.7

2000 63 1.0 5.2

Number of IAMs until image convergence

Page 95: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

98

RP* Bucket StructureRP* Bucket Structure

Header – Bucket range – Address of the

index root – Bucket size…

Index – Kind of of B+-

tree– Additional links

» for efficient index splitting during RP* bucket splits

Data– Linked leaves

with the data

Header B+-tree index Data (Linked list of index leaves)

Index

Root

Leaf headers

Records

Page 96: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

99

SDDS-2004 Menu ScreenSDDS-2004 Menu Screen

Page 97: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

100

SDDS-2000: Server Architecture

SDDS-2000: Server Architecture

. . .

Response

Results Results

Execution

Main memory Server

RP* Buckets

Network (TCP/ IP, UDP)

Response

W.Thread 1

W.Thread N

ListenThread

Client

. . .

RP* Functions : Insert, Search, Update, Delete, Forward, Splite.

. . .

Request Analyze

BAT

SendAck

Requests queue

Ack queue

Client

Several buckets of different SDDS files

Multithread architecture

Synchronization queues

Listen Thread for incoming requests

SendAck Thread for flow control

Work Threads for

request processing

response sendout

request forwarding

UDP for shorter messages (< 64K)

TCP/IP for longer data exchanges

Page 98: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

101

SDDS-2000: Client Architecture

SDDS-2000: Client Architecture

Receive Module Send Module

. . .

Requests Journal

Update

Return Response

Get Request

Client

Application 1

IP Add.

Request

Images

Response

Network (TCP/ IP, UDP)

Send Request

Receive Response

Server

Key IP Add. … …

SDDS Applications Interface

Analyze Response

Id_Req Id_App … …

Client Flow control

Manager

Application N . . .

Server

1 4 …

2 Modules

Send Module

Receive Module

Multithread Architecture

SendRequest

ReceiveRequest

AnalyzeResponse1..4

GetRequest

ReturnResponse

Synchronization Queues

Client Images

Flow control

Page 99: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

102

Performance AnalysisPerformance Analysis

Experimental Environment Six Pentium III 700 MHz

o Windows 2000– 128 MB of RAM– 100 Mb/s Ethernet

Messages– 180 bytes : 80 for the header, 100 for the record– Keys are random integers within some interval– Flow Control sliding window of 10 messages

Index– Capacity of an internal node : 80 index elements– Capacity of a leaf : 100 records

Page 100: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

103

Performance AnalysisPerformance AnalysisFile Creation

Bucket capacity : 50.000 records150.000 random inserts by a single client

With flow control (FC) or without

0

10000

20000

30000

40000

50000

60000

70000

80000

0 50000 100000 150000Number of records

Tim

e (m

s)

Rp*c/ Without FC RP*c/ With FC

RP*n/ With FC RP*n/ Without FC

0.0000.1000.2000.3000.4000.500

0.6000.7000.8000.9001.000

0 50000 100000 150000Number of records

Tim

e (

ms)

RP*c without FC RP*c with FC

RP*n with FC RP*n without FC

File creation time Average insert time

Page 101: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

104

DiscussionDiscussion

Creation time is almost linearly scalable Flow control is quite expensive

– Losses without were negligible Both schemes perform almost equally well

– RP*C slightly better » As one could expect

Insert time 30 faster than for a disk file Insert time appears bound by the client

speed

Page 102: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

105

Performance AnalysisPerformance AnalysisFile Creation

File created by 120.000 random inserts by 2 clientsWithout flow control

0

500010000

15000

2000025000

30000

3500040000

45000

0 50000 100000 150000

Number of records

Tim

e (m

s)

0.000

0.0500.100

0.150

0.2000.250

0.300

0.3500.400

0.450

RP*c to. time / 2 clients

RP*n to. time / 2 clients

RP*c / Time per record

RP*n/ Time per record

0

10000

20000

30000

40000

50000

60000

0 50000 100000 150000 200000

Number of serversT

ime

(m

s)

RP*c / 1 client

RP*n / 1 client

RP*c to. time / 2 clients

RP*n to. time / 2 clients

File creation by two clients : total time and per insert

Comparative file creation time by one or two clients

Page 103: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

106

DiscussionDiscussion

Performance improves Insert times appear bound by a

server speed More clients would not improve

performance of a server

Page 104: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

107

Performance AnalysisPerformance AnalysisSplit Time

0

500

1000

1500

2000

2500

3000

3500

4000

10

00

0

20

00

0

30

00

0

40

00

0

50

00

0

60

00

0

70

00

0

80

00

0

90

00

0

10

00

0

Bucket size

Sp

lit

tim

e (

ms)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Split time Time per Record

b Time Time/ Record 10000 1372 0.137 20000 1763 0.088 30000 1952 0.065 40000 2294 0.057 50000 2594 0.052 60000 2824 0.047 70000 3165 0.045 80000 3465 0.043 90000 3595 0.040

100000 3666 0.037

Split times for different bucket capacity

Page 105: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

108

DiscusionDiscusion

About linear scalability in function of bucket size

Larger buckets are more efficient Splitting is very efficient

–Reaching as little as 40 s per record

Page 106: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

109

Performance AnalysisInsert without splits

Performance AnalysisInsert without splits

Up to 100000 inserts into k buckets ; k = 1…5Either with empty client image adjusted by IAMs or

with correct imageRP*C RP*N

Without flow control With flow control Empty image Correct image

With flow control Without flow control

k

Ttl time

Time/ Ins. Ttl time

Time/ Ins. Ttl time

Time/ Ins. Ttl time

Time/ Ins. Ttl time

Time/ Ins.

1 35511 0.355 27480 0.275 27480 0.275 35872 0.359 27540 0.275 2 27767 0.258 14440 0.144 13652 0.137 28350 0.284 18357 0.184 3 23514 0.235 11176 0.112 10632 0.106 25426 0.254 15312 0.153 4 22332 0.223 9213 0.092 9048 0.090 23745 0.237 9824 0.098 5 22101 0.221 9224 0.092 8902 0.089 22911 0.229 9532 0.095

Insert performance

Page 107: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

110

Performance AnalysisInsert without splits

Performance AnalysisInsert without splits

05000

10000

1500020000

2500030000

3500040000

0 1 2 3 4 5

Number of servers

Tim

e (

ms)

RP*c/ With FC RP*c/ Without FC

RP*n/ With FC RP*n/ Without FC

00.050.1

0.150.2

0.250.3

0.350.4

0 1 2 3 4 5

Number of servers

Tim

e (

ms)

RP*c/ With FC RP*c/ Without FC

RP*n/ With FC RP*n/ Without FC

Total insert time Per record time

• 100 000 inserts into up to k buckets ; k = 1...5

• Client image initially empty

Page 108: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

111

DiscussionDiscussion

Cost of IAMs is negligible Insert throughput 110 times faster

than for a disk file– 90 s per insert

RP*N appears surprisingly efficient for more buckets, closing on RP*c–No explanation at present

Page 109: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

112

Performance AnalysisPerformance AnalysisKey Search

A single client sends 100.000 successful random search requests

The flow control means here that the client sends at most 10 requests without reply

RP*C RP*N With flow control Without flow control With flow control Without flow control

. k

Ttl time Avg time Ttl time Avg time Ttl time Avg time Ttl time Avg time 1 34019 0.340 32086 0.321 34620 0.346 32466 0.325 2 25767 0.258 17686 0.177 27550 0.276 20850 0.209 3 21431 0.214 16002 0.160 23594 0.236 17105 0.171 4 20389 0.204 15312 0.153 20720 0.207 15432 0.154 5 19987 0.200 14256 0.143 20542 0.205 14521 0.145

Search time (ms)

Page 110: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

113

Performance AnalysisPerformance AnalysisKey Search

0

500010000

15000

20000

2500030000

35000

40000

0 1 2 3 4 5

Number of servers

Tim

e (

ms)

RP*c/ With FC RP*c/ Without FC

RP*n/ With FC RP*n/ Without FC

00.050.1

0.150.2

0.250.3

0.350.4

0 1 2 3 4 5

Number of servers

Tim

e (

ms)

RP*c/ With FC RP*c/ Without FC

RP*n/ With FC RP*n/ Without FC

Total search time Search time per record

Page 111: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

114

DiscussionDiscussion

Single search time about 30 times faster than for a disk file– 350 s per search

Search throughput more than 65 times faster than that of a disk file– 145 s per search

RP*N appears again surprisingly efficient with respect RP*c for more buckets

Page 112: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

115

Performance AnalysisPerformance AnalysisRange Query

Deterministic termination Parallel scan of the entire file with all the

100.000 records sent to the client

0

500

1000

1500

2000

2500

3000

3500

4000

0 1 2 3 4 5

Number of servers

Tim

e (

ms)

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 1 2 3 4 5

Number of servers

Tim

e (

ms)

Range query total time Range query time per record

Page 113: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

116

DiscussionDiscussion

Range search appears also very efficient–Reaching 100 s per record delivered

More servers should further improve the efficiency–Curves do not become flat yet

Page 114: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

117

Scalability AnalysisScalability Analysis

The largest file at the current configuration :

- 64 MB buckets with b = 640 K

- 448.000 records per bucket loaded at 70 % at the average.

- 2.240.000 records in total

- 320 MB of distributed RAM (5 servers)

Page 115: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

118

Scalability AnalysisScalability Analysis

The largest file at the current configuration (cont.)

•264 s creation time by a single RP*N client

• 257 s creation time by a single RP*C client

• A record could reach 300 B

• Note: a server RAM is now 256 MB each

Page 116: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

119

Scalability AnalysisScalability Analysis

If the example file with b = 50.000 had scaled to 10.000.000 records - It would span over 286 buckets (servers)- There are many more machines at Paris 9

-Creation time by random inserts would be - 1235 s for RP*N - 1205 s for RP*C

Page 117: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

120

Scalability AnalysisScalability Analysis

If the example file with b = 50.000 had scaled to 10.000.000 records (cont.)

•285 splits would last 285 s in total

• Inserts alone would last

• 950 s for RP*N

• 920 s for RP*C

Page 118: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

121

Actual results for a big fileActual results for a big file Bucket capacity : 751K records, 196 MB Number of inserts : 3M Flow control (FC) is necessary to limit the input

queue at each server

File creation by a single client - file size : 3,000,000 records

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

Number of records

Tim

e (m

s)

RP*c/ With FC

RP*n/ With FC

Page 119: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

122

Actual results for a big fileActual results for a big file

Bucket capacity : 751K records, 196 MB Number of inserts : 3M GA : Global Average; MA : Moving Average

Insert time by a single client - file size : 3,000,000 records

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

Number of records

Tim

e (m

s) RP*c with FC / GA

RP*c with FC / MA

RP*n with FC / GA

RP*n with FC / MA

Page 120: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

123

Related WorksRelated WorksRP*N Imp. RP*C Impl. LH* Imp. RP*N Thr.

With FC No FC With FC No FC

tc 51000 40250 69209 47798 67838 45032

ts 0.350 0.186 0.205 0.145 0.200 0.143

ti,c 0.340 0,268 0.461 0.319 0.452 0.279

ti 0.330 0.161 0.229 0.095 0.221 0.086

tm 0.16 0.161 0.037 0.037 0.037 0.037

tr 0.005 0.010 0.010 0.010 0.010

tc: time to create the file ts: time per key search (throughput) ti: time per random insert (throughput) ti,c: time per random insert (throughput) during the file creation tm: time per record for splitting tr: time per record for a range query

Comparative Analysis

Page 121: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

124

DiscussionDiscussion

The 1994 theoretical performance predictions for RP* were quite accurate

RP* schemes at SDDS-2000 appear globally more efficient than LH*– No explanation at present

Page 122: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

125

ConclusionConclusion

SDDS-2000x : a prototype SDDS manager for Windows multicomputer - Various SDDSs

- Several variants of the RP*

Page 123: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

126

ConclusionConclusion

Performance of RP* schemes appears in line with the expectations -Access times in the range of a fraction of a millisecond

-About 30 to 100 times faster than a disk file access performance

- About ideal (linear) scalability

Page 124: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

127

ConclusionConclusion

Results prove the overall efficiency of SDDS-2000 architecture

Page 125: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

128

SDDS PrototypesSDDS Prototypes

Page 126: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

129

PrototypesPrototypes LH*RS Storage (VLDB 04) SDDS –2006 (several papers)

– RP* Range Partitioning– Disk back-up (alg. signature based, ICDE 04)– Parallel string search (alg. signature based,

ICDE 04)– Search over encoded content

»Makes impossible any involuntary discovery of stored data actual content

»Several times faster pattern matching than for Boyer Moore

– Available at our Web site

Page 127: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

130

PrototypesPrototypes

SD –SQL Server (CIDR 07 & BNCOD 06)

–Scalable distributed tables & views

SD-AMOS and AMOS-SDDS

Page 128: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

131

SDDS-2006 Menu ScreenSDDS-2006 Menu Screen

Page 129: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

132

LH*RS PrototypeLH*RS Prototype

Presented at VLDB 2004 Vidéo démo at CERIA site Integrates our scalable availability RS based

parity calculus with LH* Provides actual performance measures

– Search, insert, update operations– Recovery times

See CERIA site for papers – SIGMOD 2000, WDAS Workshops, Res. Reps.

VLDB 2004

Page 130: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

133

LH*RS Prototype : Menu ScreenLH*RS Prototype : Menu Screen

Page 131: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

134

SD-SQL Server : Server NodeSD-SQL Server : Server Node The storage manager is a full scale SQL-Server

DBMS SD SQL Server layer at the server node provides the

scalable distributed table management– SD Range Partitioning

Uses SQL Server to perform the splits using SQL triggers and queries– But, unlike an SDDS server, SD SQL Server does not

perform query forwarding– We do not have access to query execution plan

Page 132: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

135

Manages a client view of a scalable table – Scalable distributed partitioned view

» Distributed partitioned updatable iew of SQL Server Triggers specific image adjustment SQL queries

– checking image correctness» Against the actual number of segments» Using SD SQL Server meta-tables (SQL Server tables)

– Incorrect view definition is adjusted– Application query is executed.

The whole system generalizes the PDBMS technology– Static partitioning only

SD-SQL Server : Client NodeSD-SQL Server : Client Node

Page 133: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

136

SD-SQL ServerGross Architecture

SD-SQL ServerGross Architecture

SQL-Server

Application

SD-DBS Manager

SQL-Server

Application

SQL-Server

Application

SD-DBS Manager

SD-DBS Manager SDDS

layer

SQL-Server layer

D1 D2 D999

999

Page 134: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

137

SD-DBS Architecture Server side

SD-DBS Architecture Server side

DB_1Segment

Meta-tables

SD_C SD_RP

DB_2Segment

Meta-tables

SD_C SD_RP

………

SQL Server 1 SQL Server 2

• Each segment has a check constraint on the partitioning attribute

• Check constraints partition the key space• Each split adjusts the constraint

Split Split

Split

SQL …

Page 135: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

138

S

b+1

S S1

pb+1-p

p=INT(b/2)

C( S)= { c: c < h = c (b+1-p)}

C( S1)={c: c > = c (b+1-p)}

Check Constraint?

b

SELECT TOP Pi * INTO Ni.Si FROM S ORDER BY C ASCSELECT TOP Pi * WITH TIES INTO Ni.S1 FROM S ORDER BY C ASC

Single Segment Split Single Tuple Insert

Page 136: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

139

Single Segment Split Bulk Insert

p = INT(b/2)C(S) = {c: l < c < h } à { c: l ≤ c < h’ = c (b+t–Np)}C(S1) = {c: c (b+t-p) < c < h}…C(SN) = {c: c (b+t-Np) ≤ c < c (b+t-(N-1)p)}

Pn

S

b

b+t

(b)

P1

b+t-np

S

b

(a)

S

b

(c)

b+t-np

P1

S1

b

Pn

SN

b

p p

Single segment split

Page 137: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

140

Sk

b

S

b

S1

b

S1, n1

b

p

Sk

b

Sk, nk

b

p

Multi-Segment Split Bulk Insert

Multi-segment split

Page 138: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

141141

SDB DB1SDB DB1

Scalable Table T

sd_insert

N1 N2 N4N3

NDBDB1

NDBDB1

NDBDB1

sd_insert

NDBDB1

Ni

sd_create_node

sd_insert

N3

NDBDB1

sd_create_node_database

NDBDB1

…….

sd_create_node_database

SDB DB1

Split with SDB ExpansionSplit with SDB Expansion

Page 139: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

142

SD-DBS Architecture Client View

SD-DBS Architecture Client View

Distributed Partitioned

Union All View

Db_1.Segment1 Db_2. Segment1 …………

• Client view may happen to be outdated

• not include all the existing segments

Page 140: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

143143

Internally, every image is a specific SQL Server view of the segments:Distributed partitioned union view

CREATE VIEW T AS SELECT * FROM N2.DB1.SD._N1_T UNION ALL SELECT * FROM N3.DB1.SD._N1_T

UNION ALL SELECT * FROM N4.DB1.SD._N1_T

Updatable• Through the check constraints

With or without Lazy Schema Validation

Scalable (Distributed) Table

Page 141: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

144

SD-SQL ServerGross Architecture : Appl. Query Processing

SD-SQL ServerGross Architecture : Appl. Query Processing

SQL-Server

Application

SD-DBS Manager

SQL-Server

Application

SQL-Server

Application

SD-DBS Manager

SD-DBS Manager SDDS

layer

SQL-Server layer

D1 D2 D999

999

9999 ?

Page 142: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

145

USE SkyServer /* SQL Server command */

Scalable Update Queriessd_insert ‘INTO PhotoObj SELECT * FROM

Ceria5.Skyserver-S.PhotoObj’

Scalable Search Queriessd_select ‘* FROM PhotoObj’ sd_select ‘TOP 5000 * INTO PhotoObj1

FROM PhotoObj’, 500

Scalable Queries Management

Page 143: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

146146

Concurrency

SD-SQL Server processes every command as SQL distributed transaction at Repeatable Read isolation level Tuple level locks Shared locks Exclusive 2PL locks Much less blocking than the Serializable Level

Page 144: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

147147

Splits use exclusive locks on segments and tuples in RP meta-table. Shared locks on other meta-tables: Primary, NDB

meta-tables

Scalable queries use basically shared locks on meta-tables and any other table involved

All the conccurent executions can be shown serializable

Concurrency

Page 145: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

148148

(Q) sd_select ‘COUNT (*) FROM PhotoObj’

Query (Q1) execution time

0

0,5

1

1,5

2

39500 79000 158000

Capacité de PhotoObj

Te

mp

s

d'e

cu

tio

n d

e

(se

c)

Adjustment on a Peer Checking on a Peer

SQL Server Peer Adjustment on a Client

Checking on a Clientj SQL Server client

Image Adjustment

Page 146: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

149149

(Q): sd_select ‘COUNT (*) FROM PhotoObj’

Execution time of (Q) on SQL Server and SD-SQL Server

93156

220250

326

106

164226

256

343283

20393

356

436

220203123

76160

100

200

300

400

500

1 2 3 4 5

Nombre de segments

Tem

ps

d'e

xécu

tio

n

(sec

)SQL Server-Distr SD-SQL Server

SQL Server-Centr SD-SQL Server LSV

SD-SQL Server / SQL Server

Page 147: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

150

• Will SD SQL Server be useful ?

• Here is a non-MS hint from the practical folks who knew nothing about it

• Book found in Redmond Town Square Border’s Cafe

Page 148: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

151

Algebraic Signatures for SDDSAlgebraic Signatures for SDDS

Small string (signature) characterizes the SDDS record.

Calculate signature of bucket from record signatures.– Determine from signature whether record / bucket has

changed.» Bucket backup» Record updates» Weak, optimistic concurrency scheme» Scans

Page 149: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

152

SignaturesSignatures

Small bit string calculated from an object. Different Signatures Different Objects Different Objects with high probability

Different Signatures.

» A.k.a. hash, checksum.» Cryptographically secure: Computationally impossible to find an

object with the same signature.

Page 150: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

153

Uses of SignaturesUses of Signatures

Detect discrepancies among replicas. Identify objects

– CRC signatures.– SHA1, MD5, … (cryptographically secure).– Karp Rabin Fingerprints.– Tripwire.

Page 151: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

154

Properties of SignaturesProperties of Signatures

Cryptographically Secure Signatures: –Cannot produce an object with given

signature.

Cannot substitute objects without changing signature.

Page 152: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

155

Properties of SignaturesProperties of Signatures

Algebraic Signatures:–Small changes to the object change

the signature for sure.»Up to the signature length (in symbols)

–One can calculate new signature from the old one and change.

Both:–Collision probability 2-f (f length of

signature).

Page 153: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

156

Definition of Algebraic Signature: Page Signature

Definition of Algebraic Signature: Page Signature

Page P = (p0, p1, … pl-1).– Component signature.

– n-Symbol page signature

– = (, 2, 3, 4…n) ; i = i

» is a primitive element, e.g., = 2.

1

0sig ( )

l iii

P p

1 2sig ( ) (sig ( ),sig ( ),...,sig ( ))

nP P P P α

Page 154: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

157

Algebraic Signature PropertiesAlgebraic Signature Properties

Page length < 2f-1: Detects all changes of up to n symbols.

Otherwise, collision probability = 2-nf

Change starting at symbol r:

sig ( ') sig ( ) sig ( ).rP P

Page 155: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

158

Algebraic Signature PropertiesAlgebraic Signature Properties

Signature Tree: Speed up comparison of signatures

Page 156: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

159

Uses for Algebraic Signatures in SDDSUses for Algebraic Signatures in SDDS

Bucket backup Record updates Weak, optimistic concurrency scheme Stored data protection against involuntary

disclosure Efficient scans

– Prefix match– Pattern match (see VLDB 07)– Longest common substring match– …..

Application issued checking for stored record integrity

Page 157: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

160

Signatures for File BackupSignatures for File Backup

Backup an SDDS bucket on disk. Bucket consists of large pages. Maintain signatures of pages on disk. Only backup pages whose signature has

changed.

Page 158: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

161

Signatures for File BackupSignatures for File Backup

BUCKET

Page 1

Page 2

Page 3

Page 4

Page 5

Page 6

Page 7

DISK

Page 1

Page 2

Page 3

Page 4

Page 5

Page 6

Page 7

Backup Manager

sig 1

sig 2

sig 3

sig 4

sig 5

sig 6

sig 7

Application access but does not change page 2

Application changes page 3

Page 3 sig 3

Backup manager will only backup page 3

Page 159: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

162

Record Update w. SignaturesRecord Update w. Signatures

Application requests record R

Client provides record R, stores signature sigbefore(R)

Application updates record R: hands record to client.

Client compares sigafter(R) with sigbefore(R):

Only updates if different.

Prevents messaging of pseudo-updates

Page 160: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

163

Scans with SignaturesScans with Signatures

Scan = Pattern matching in non-key field. Send signature of pattern

– SDDS client Apply Karp-Rabin-like calculus at all

SDDS servers.– See paper for details

Return hits to SDDS client Filter false positives.

– At the client

Page 161: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

164

Scans with SignaturesScans with Signatures

Client: Look for “sdfg”.

Calculate signature for sdfg.

Server: Field is “qwertyuiopasdfghjklzxcvbnm”Compare with signature for “qwer”Compare with signature for “wert”Compare with signature for “erty”Compare with signature for “rtyu”Compare with signature for “tyui”Compare with signature for “uiop”Compare with signature for “iopa”

Compare with signature for “sdfg” HIT

Page 162: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

165

Record UpdateRecord Update

SDDS updates only change the non-key field.

Many applications write a record with the same value.

Record Update in SDDS:–Application requests record.–SDDS client reads record Rb .

–Application request update.–SDDS client writes record Ra .

Page 163: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

166

Record Update w. SignaturesRecord Update w. Signatures Weak, optimistic concurrency protocol:

–Read-Calculation Phase: »Transaction reads records, calculates records,

reads more records.»Transaction stores signatures of read records.

–Verify phase: checks signatures of read records; abort if a signature has changed.

–Write phase: commit record changes. Read-Commit Isolation ANSI SQL

Page 164: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

167

Performance ResultsPerformance Results

1.8 GHz P4 on 100 Mb/sec Ethernet

Records of 100B and 4B keys. Signature size 4B

–One backup collision every 135 years at 1 backup per second.

Page 165: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

168

Performance Results:Backups

Performance Results:Backups

Signature calculation 20 - 30 msec/1MB Somewhat independent of details of

signature scheme GF(216) slightly faster than GF(28) Biggest performance issue is caching. Compare to SHA1 at 50 msec/MB

Page 166: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

169

Performance Results:UpdatesPerformance Results:Updates

Run on modified SDDS-2000 Signature Calculation

–5 sec / KB on P4–158 sec/KB on P3–Caching is bottleneck

Updates–Normal updates 0.614 msec / 1KB

records–Normal pseudo-update 0.043 msec / 1KB

record

Page 167: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

170

More on Algebraic SignaturesMore on Algebraic Signatures

Page P : a string of l < 2f -1 symbols pi ; i =

0..l-1 n-symbol signature base :

– a vector = (1…n) of different non-zero elements

of the GF. (n-symbol) P signature based on : the vector

1 2sig ( ) (sig ( ),sig ( ),...,sig ( ))α

nP P P P

1

0sig ( )

l i

iiP p

• Where for each :

Page 168: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

171

The sig,n and sig2,n schemesThe sig,n and sig2,n schemes

sig,n

= (, 2, 3…n) with n << ord(a) = 2f - 1.

• The collision probability is 2-nf at best

sig2,n

= (,, 2, 4, 8…2n)

Page 169: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

172

The sig,n and sig2,n schemesThe sig,n and sig2,n schemes

• The randomization is possibly better for more than 2-symbol signatures since all the i are primitive

• In SDDS-2002 we use sig,n

• Computed in fact for p’ = antilog p•To speed-up the multiplication

Page 170: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

173

The sig,n Algebraic Signature The sig,n Algebraic Signature

Consider that P1 and P2 Differ by at most n symbols, Have no more than 2f – 1

Probability of collision is then 0.New property at present unique to

sig,n

Due to its algebraic nature

Page 171: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

174

The sig,n Algebraic Signature The sig,n Algebraic Signature

If P1 and P2 differ by more than n symbols, then probability of collision reaches 2-nf

Good behavior for Cut/PasteBut not best possible

See our IEEE ICDE-04 paper for other properties

Page 172: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

175

The sig,n Algebraic Signature

Application in SDDS-2004

The sig,n Algebraic Signature

Application in SDDS-2004

Disk back up–RAM bucket divided into pages–4KB at present–Store command saves only pages

whose signature differs from the stored one

–Restore does the inverse

Page 173: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

176

The sig,n Algebraic Signature

Application in SDDS-2004

The sig,n Algebraic Signature

Application in SDDS-2004

Updates–Only effective updates go from the

client»E.g. blind updates of a surveillance

camera image–Only the update whose before

signature ist that of the record at the server gets accepted»Avoidance of lost updates

Page 174: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

177

The sig,n Algebraic Signature

Application in SDDS-2004

The sig,n Algebraic Signature

Application in SDDS-2004

Non-key distributed scans–The client sends to all the servers the

signature S of the data to find using:– Total match

»The whole non-key field F matches S–SF = S

Page 175: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

178

The sig,n Algebraic Signature

Application in SDDS-2004

The sig,n Algebraic Signature

Application in SDDS-2004 Partial match

–S is equal to the signature Sf of a sub-field f of F

–We first used a Karp-Rabin like computation of Sf

– Next, we used much faster Boyer-Moore like computation

– The algorithm is among the fastest known» VLDB-07 paper with R. Mokadem, Ph.

Rigaux & Th. Schwarz

Page 176: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

179

SDDS & P2PSDDS & P2P P2P architecture as support for an SDDS

–A node is typically a client and a server–The coordinator is super-peer–Client & server modules are Windows active

services»Run transparently for the user»Referred to in Start Up directory

See :–Planetlab project literature at UC Berkeley– J. Hellerstein tutorial VLDB 2004

Page 177: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

180

SDDS & P2PSDDS & P2P

P2P node availability (churn)–Much lower than traditionally for a

variety of reasons»(Kubiatowicz & al, Oceanstore project

papers) A node can leave anytime

–Letting to transfer its data at a spare–Taking data with

LH*RS parity management seems a good basis to deal with all this

Page 178: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

181

LH*RSP2PLH*RSP2P

Each node is a peer – Client and server

Peer can be– (Data) Server peer : hosting a data bucket– Parity (sever) peer : hosting a parity bucket

»LH*RS only

– Candidate peer: willing to host

Page 179: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

182

LH*RSP2PLH*RSP2P

A candidate node wishing to become a peer–Contacts the coordinator–Gets an IAM message from some

peer becoming its tutor»With level j of the tutor and its number

a»All the physical addresses known to

the tutor

Page 180: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

183

LH*RSP2PLH*RSP2P

–Adjusts its image–Starts working as a client–Remains available for the « call

for server duty »»By multicast or unicast

Page 181: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

184

LH*RSP2PLH*RSP2P

Coordinator chooses the tutor by LH over the candidate address– Good load balancing of the tutors’ load

A tutor notifies all its pupils and its own client part at its every split– Sending its new bucket level j value

Recipients adjust their images Candidate peer notifies its tutor when it

becomes a server or parity peer

Page 182: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

185

LH*RSP2PLH*RSP2P

End result–Every key search needs at most one

forwarding to reach the correct bucket»Assuming the availability of the buckets

concerned

–Fastest search for any possible SDDS»Every split would need to be synchronously

posted to all the client peers otherwise»To the contrary of SDDS axioms

Page 183: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

186

Churn in LH*RSP2PChurn in LH*RSP2P

A candidate peer may leave anytime without any notice– Coordinator and tutor will assume so if no reply

to the messages– Deleting the peer from their notification tables

A server peer may leave in two ways– With early notice to its group parity server

»Stored data move to a spare– Without notice

»Stored data are recovered as usual for LH*rs

Page 184: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

187

Churn in LH*RSP2PChurn in LH*RSP2P

Other peers learn that data of a peer moved when the attempt to access the node of the former peer– No reply or another bucket found

They address the query to any other peer in the recovery group

This one resends to the parity server of the group– IAM comes back to the sender

Page 185: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

188

Churn in LH*RSP2PChurn in LH*RSP2P

Special case– A server peer S1 is cut-off for a while, its

bucket gets recovered at server S2 while S1 comes back to service

– Another peer may still address a query to S1– Getting perhaps outdated data

Case existed for LH*RS, but may be now more frequent

Solution ?

Page 186: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

189

Churn in LH*RSP2PChurn in LH*RSP2P

Sure Read– The server A receiving the query contacts its

availability group manager»One of parity data manager »All these address maybe outdated at A as well»Then A contacts its group members

The manager knows for sure – Whether A is an actual server – Where is the actual server A’

Page 187: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

190

Churn in LH*RSP2PChurn in LH*RSP2P

If A’ ≠ A, then the manager – Forwards the query to A’ – Informs A about its outdated status

A processes the query The correct server informs the client with an

IAM

Page 188: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

191

SDDS & P2PSDDS & P2P

SDDSs within P2P applications– Directories for structured P2Ps

»LH* especially versus DHT tables– CHORD– P-Trees

– Distributed back up and unlimited storage »Companies with local nets»Community networks

– Wi-Fi especially– MS experiments in Seattle

Other suggestions ???

Page 189: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

192

Popular DHT: Chord(from J. Hellerstein VLDB 04 Tutorial)

Popular DHT: Chord(from J. Hellerstein VLDB 04 Tutorial)

Consistent Hash + DHT Assume n = 2m nodes

for a moment– A “complete” Chord

ring Key c and node ID N

are integers given by hashing into 0,..,24 – 1– 4 bits

Every c should be at the first node N c.– Modulo 2m

Page 190: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

193

Popular DHT: ChordPopular DHT: Chord

Full finger DHT table at node 0 Used for faster search

Page 191: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

194

Popular DHT: ChordPopular DHT: Chord

Full finger DHT table at node 0 Used for faster search Key 3 and Key 7 for instance from node 0

Page 192: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

195

Full finger DHT tables at all nodes O (log n) search cost– in # of forwarding messages Compare to LH* See also P-trees– VLDB-05 Tutorial by K. Aberer» In our course doc

Popular DHT: ChordPopular DHT: Chord

Page 193: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

196

Churn in ChordChurn in Chord

Node Join in Incomplete Ring– New Node N’ enters the ring between

its (immediate) successor N and (immediate) predecessor

– It gets from N every key c ≤ N – It sets up its finger table

» With help of neighbors

Page 194: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

197

Churn in ChordChurn in Chord

Node Leave– Inverse to Node Join

To facilitate the process, every node has also the pointer towards predecessor

Compare these operations to LH*

Compare Chord to LH* High-Availability in Chord

– Good question

Page 195: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

198

DHT : Historical NoticeDHT : Historical Notice

Invented by Bob Devine–Published in 93 at FODO

The source almost never cited The concept also used by S. Gribble

–For Internet scale SDDSs– In about the same time

Page 196: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

199

DHT : Historical NoticeDHT : Historical Notice

Most folks incorrectly believe DHTs invented by Chord–Which does not cite initially neither

Devine

–Nor our Sigmod & TODS SDDS papers

–Reason ?

»Ask Chord folks

Page 197: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

200

SDDS & Grid & Clouds…SDDS & Grid & Clouds…

What is a Grid ?–Ask J. Foster (Chicago University)

What is a Cloud ?–Ask MS, IBM… – Nevertheless clouds are THE

FUTURE of Computers

Page 198: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

201

SDDS & Grid & Clouds…SDDS & Grid & Clouds…

The World is supposed to benefit from power grids and data grids & clouds & SaaS

Difference between a grid & al and P2P net ?–Local autonomy ?–Computational power of servers–Number of available nodes ?–Data Availability & Security ?

Page 199: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

202

Maui High Performance Comp. Center– Grids ?

» Tempest : network of large SP2 supercomputers» 512-node bi-processor Linux cluster

SDDS storage is an infrastructure for data in grids & clouds– Perhaps even easier to apply than to P2P

» Less server autonomy : better for stored data security

– Necessary for modern applications

SDDS & Grids & Clouds…SDDS & Grids & Clouds…

Page 200: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

203

SDDS & Grids & Clouds…SDDS & Grids & Clouds…

Some dedicated applications we have been looking upon– Google with Big Table–Skyserver (J. Gray & Co)–Virtual Telescope–Streams of particles (CERN)–Biocomputing (genes, image analysis…)– HadoopDB

Page 201: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

204

SDDS & Grids & Clouds…SDDS & Grids & Clouds…

Cloud Computing– Azure with Tables and SQL Services– Non-SQL Databases

» E.g. Facebook’s Cassandra over Hadoop

–Amazon»Custom Modified Consistent Hashing

(?) and SaaS Interface

Page 202: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

205

SDDS & Grids & Clouds…SDDS & Grids & Clouds…

IBM BlueCloud Cloudera with Hadoop Goggle Appliances with BigTable … See elsewhere dans le course and Web

sites

Page 203: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

206

ConclusionConclusion

SDDS in 2010- Research has demonstrated the

initial objectives

- Including Jim Gray’s expectance

- Distributed RAM based access can be up to 100 times faster than to a local disk

- Response time may go down, e.g.,

- From 2 hours to 1 min

Page 204: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

207

ConclusionConclusion

SDDS in 2010- Data collection can be almost

arbitrarily large

- It can support various types of queries

- Key-based, Range, k-Dim, k-NN…

- Various types of string search (pattern matching)

- SQL

- The collection can be k-available

- It can be secure

- …

Page 205: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

208

ConclusionConclusion

SDDS in 2010- Several variants of LH* and RP*

- New schemes:

- SD-Rtree

- IIEE Data Eng. 07

- VLDB Journal (to appear)

- with C. duMouza, Ph. Rigaux, Th. Schwarz

- CTH*, IH, Baton, VBI

- See ACM Portal for refs

Page 206: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

209

ConclusionConclusion

SDDS in 2010 - Several variants of LH* and RP*

- Database schemes : SD-SQL Server

- 20 000+ estimated references on Google

- Scalable high-availability

- Signature based Backup and Updates

- P2P Management including churn r

- Cloud & Grid ready

Page 207: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

210

ConclusionConclusion

SDDS in 2010 - Pattern Matching using Algebraic

Signatures

- Over Encoded Stored Data

- Using non-indexed n-grams

- see VLDB 08

- with R. Mokadem, C. duMouza, Ph. Rigaux, Th. Schwarz

- Over indexed n—grams

- AS-Index (published in CIKM 2010)

- with C. duMouza, Ph. Rigaux, Th. Schwarz

Page 208: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

211

ConclusionConclusion

The SDDS properties proofs are of “proof of concept” type

- As usual for research

The domain is ready for industrial portage

And industrial strength applications

Page 209: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

212

Recent Research at Dauphine & alRecent Research at Dauphine & al

AS-Index – With Santa Clara U., CA

SD-Rtree– With CNAM

LH*RSP2P

– Thesis by Y. Hanafi– With Santa Clara U., CA

HDRAM– 2009 Ph.D by D. Cieslicki– With Santa Clara University

Page 210: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

213

Recent Research at Dauphine & alRecent Research at Dauphine & al

Data Deduplication –With SSRC

» UC of Santa Cruz LH*RE

–With CSIS, George Mason U., VA –With Santa Clara U., CA

Page 211: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

214

Credits : ResearchCredits : Research

LH*RS Rim Moussa (Ph. D. Thesis to defend in Oct. 2004)

SDDS 200X Design & Implementation (CERIA)» J. Karlson (U. Linkoping, Ph.D. 1st LH* impl., now Google

Mountain View)» F. Bennour (LH* on Windows, Ph. D.); » A. Wan Diene, (CERIA, U. Dakar: SDDS-2000, RP*, Ph.D). » Y. Ndiaye (CERIA, U. Dakar: AMOS-SDDS & SD-AMOS,

Ph.D.)» M. Ljungstrom (U. Linkoping, 1st LH*RS impl. Master Th.)» R. Moussa (CERIA: LH*RS, Ph.D)» R. Mokadem (CERIA: SDDS-2002, algebraic signatures & their

apps, Ph.D, now U. Paul Sabatier, Toulouse)» B. Hamadi (CERIA: SDDS-2002, updates, Res. Internship)» See also Ceria Web page at ceria.dauphine.fr

SD SQL Server– Soror Sahri (CERIA, Ph.D.)

Page 212: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

215

Credits: FundingCredits: Funding

– CEE-EGov bus project– Microsoft Research – CEE-ICONS project– IBM Research (Almaden)– HP Labs (Palo Alto)

Page 213: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

216

ENDEND

Thank you for your attention

Witold [email protected]

Page 214: 1 Scalable Distributed Data Structures Part 2 Witold Litwin Witold.Litwin@dauphine.fr

217