33
Scalaris: Reliable Transactional P2P Key/Value Store Thorsten Schütt 1 Thorsten Schütt, Florian Schintke , Alexander Reinefeld Zuse Institute Berlin (ZIB) onScale solutions

Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Scalaris: Reliable Transactional

P2P Key/Value Store

Thorsten Schütt 1

Thorsten Schütt, Florian Schintke , Alexander Reinefeld

Zuse Institute Berlin (ZIB)

onScale solutions

Page 2: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Background

• CSR – Parallel and Distributed Computing

• P2P Computing, Structured Overlay Networks

• Need Simulators

• Experiments (empirical evaluation)

Thorsten Schütt 2

• Experiments (empirical evaluation)

• Abstract from certain details

• Need Implementations (EU projects)

• „production-quality“ code

Page 3: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Scalaris vs. Mnesia

• Distributed database

• Partitioning

• Distributed database

• Partitioning (table

fragmentation)

Thorsten Schütt 3

• Transactions

• Key-Value Store

• More scalable

• Transactions

• Complex Data Model

Page 4: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Our Approach

Clients

Key/Value Store

(simple DBMS)

Thorsten Schütt 4

P2P OverlayP2P Overlay

Transactions Transactions

ReplicationReplication

ACID

Page 5: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Scalaris

Thorsten Schütt 5

Scalaris

P2P Overlay

Page 6: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Chord#

• A dictionary has 3 ops:

• insert(key, value)

• delete(key)

• lookup(key)

• Chord# implements a distributed dictionary

Thorsten Schütt 6

• Chord implements a distributed dictionary

Key Value

Clarke 2007

Allen 2006

Bachman 1973

Thompson 1983

Knuth 1974

Codd 1981

node A

node D

node B

node C

Each node only

stores part of the

data

Page 7: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Backus

Codd

Dijkstra

FlyodRivest

Wirth

Yao

Chord#

• Key Space is an arbitrary set

with a total order, e.g.

strings

• Every node is assigned a

Thorsten Schütt 7

Gray

Karp

Knuth

Milner

Minsky

Naur

Nygaard

Perlis

Ritchie

• Every node is assigned a

random key

• Nodes form logical ring

over the key space

Page 8: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Backus

Distributed Dictionary

• Items are stored on their

successor, i.e. the first node

encountered in clockwise

direction

Thorsten Schütt 8

Backus

Codd

Ritchie

Rivest

Wirth

Yao

(Thompson, 1983)

(Clarke, 2007)

(Allen, 2006)

direction

– Thomson on Wirth

– Allen on Backus

– Bachman on Backus

– Clarke on Codd

– …

(Bachman, 1973)

Page 9: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

• Each node puts the successor and log2N exponentially spaced fingers in its

routing table

With each routing step using greedy routing,

Distributed Dictionary

Backus

CoddYao

Thorsten Schütt 9

• With each routing step using greedy routing,

the distance between the currrent node

and the destination is halved.

=> Yields O(log2N) hops.

Dijkstra

Flyod

Gray

Karp

Knuth

Milner

Minsky

Naur

Nygaard

Perlis

Ritchie

Rivest

Wirth

Page 10: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Summary: Chord#

• DHT is a fully decentralized data structure

• DHTs self-organize as nodes

• Only one hop for calculating a routing pointer, Chord needs log(N) hops.

DHTs/Chord Chord#

Thorsten Schütt 10

• DHTs self-organize as nodes join, leave, and fail

• All operations only require local knowledge

• max. log (N) hops, while Chord guarantees this only with “high probability”.

• Can adapt to imbalances in the query load; Chord can’t.

• Supports range queries.

Page 11: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Scalaris

Thorsten Schütt 11

Scalaris

Transactions

Page 12: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Transactions + Replicas

START

debit (a, 100);

START

debit (a1, 100);

debit (a2, 100);

debit (a3, 100);

Thorsten Schütt 12

deposit (b, 100);

COMMIT

3

deposit (b1, 100);

deposit (b2, 100);

deposit (b3, 100);

COMMIT

Page 13: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Adapted Paxos Commit

• Optimistic CC with fallback

• Write

• 3 rounds

• non-blocking (fallback)

• Read even faster

Leader

replicated

Transaction

Managers

(TMs)

Items at

Transaction

Participants

(TPs)

1. Step

2. Step

1. Step:

O(log N) hops

1. Step:

O(log N) hops

Thorsten Schütt 13

• Read even faster

• reads majority of replicas

• just 1 round

• succeeds when

>f/2 nodes alive

2. Step

3. Step

4. Step

5. Step

6. Step

After majorityAfter majority

2.-6. Step:

O(1) hops

2.-6. Step:

O(1) hops

Page 14: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Erlang

Thorsten Schütt 14

Erlang

Page 15: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Motivation for Using Erlang

• Fit our model

• Processes + Message Passing

• Allows simple testing/simulating

• Light-Weight Processes

Thorsten Schütt 15

• Light-Weight Processes

• 100s of nodes on a laptop

• Same code runs:

• locally and

• on a cluster

• Simulator => Application

Page 16: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Distributed Erlang

Too Powerfull:

• We want to implement our own Failure Detectors

• Part of the Algorithms

Too weak:

Thorsten Schütt 16

Too weak:

• SSL + X509?

• Scalability?

⇒TCP-based „reimplementation“

⇒Message statistics for free

Page 17: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Transactions in mnesia

raise(Eno, Raise) ->

F = fun() ->

[E] = mnesia:read(employee, Eno, write),

Thorsten Schütt 17

[E] = mnesia:read(employee, Eno, write),

Salary = E#employee.salary + Raise,

New = E#employee{salary = Salary},

mnesia:write(New)

end,

mnesia:transaction(F).

Page 18: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Transactions in Scalaris

F = fun (TransLog) ->{X, TL1} = read(TransLog, "Account A"),{Y, TL2} = read(TL1, "Account B"),if X > 100 ->

Thorsten Schütt 18

X > 100 ->TL3 = write(TL2, "Account A", X - 100),TL4 = write(TL3, "Account B", Y + 100){ok, TL4};

true ->{ok, TL2};

endend,transaction:do_transaction(F, ...).

Page 19: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Supervisor Tree

• Components

– Chord#

– Load-Balancing

– Transaction Framework

• OTP behaviours

Thorsten Schütt 19

• OTP behaviours

– supervisor

– gen_server

Page 20: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Erlang Quirks

• Flat Namespace for Processes

• U. Wiger. „Extended Process Registry for Erlang.“ Erlang WS 2007.

• Error Messages (Line Number?)

• I can translate gcc to english (took me 2 years)

Thorsten Schütt 20

• I can translate gcc to english (took me 2 years)

• Packages, e.g. transstore.transaction_api:do/1

• Sub-Directories

o Cover?

o Dialyzer?

Page 21: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Wikipedia

Thorsten Schütt 21

Wikipedia

Page 22: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Wikipedia

Top 10 Web sites

1. Yahoo!

2. Google

3. YouTube

4. Windows Live

5. Facebook

6. MSN

7. Myspace

50.000 requests/sec

– 95% are answered by squid proxies

– only 2,000 req./sec hit the backend

Thorsten Schütt 22

7. Myspace

8. Wikipedia

9. Blogger.com

10. Yahoo!カテゴリ

Page 23: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

The Wikipedia System Architecture

other

Thorsten Schütt 23

web servers

search servers

NFS

Page 24: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Data Model

Wikipedia

– SQL DB

Chord#

– Key-Value Store

Thorsten Schütt 24

Map Relations to Key-Value Pairs

– (Title, List of Versions)

– (CategoryName, List of Titles)

– (Title, List of Titles) //Backlinks

CREATE TABLE /*$wgDBprefix*/page (

page_id int unsigned NOT

NULL auto_increment,

page_namespace int NOT NULL,

...

Page 25: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Data Model

void updatePage(string title, int oldVersion, string newText)

{

//new transaction

Transaction t = new Transaction();

//read old version

Page p = t.read(title);

//check for concurrent update

if(p.currentVersion != oldVersion)

Thorsten Schütt 25

t.abort();

else{

//write new text

t.write(p.add(newText));

//update categories

foreach(Category c in p)

t.write(t.read(c.name).add(title));

//commit

t.commit();

}

}

Page 26: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Wiki

Database:

• Chord#

• Mapping

– Wiki -> Key-Value Store

Thorsten Schütt 26

Renderer:

• Java

– Tomcat

– Plog4u

• Jinterface

– Interface to Erlang

Page 27: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only
Page 28: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

IEEE Scale Challenge 2008

Live Demos

• Bavarian

• Simple English

• Browsing

• Deployments

– Planet-Lab

• 20 nodes

– Cluster

• 320 nodes in Berlin

Thorsten Schütt 28

• Browsing

• Editing with full History

• Category-Pages

• 320 nodes in Berlin

Page 29: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Summary

• Previously, P2P was mainly used for file sharing (read only).

• We support consistent, distributed write operations.

DHT + Transactions = Scalable, Reliable, Efficient Key/Value Store

Thorsten Schütt 29

• We support consistent, distributed write operations.

• Numerous applications

• Internet databases, transactional online-services, …

Page 30: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Team

• Thorsten Schütt

• Florian Schintke

• Monika Moser

• Stefan Plantikow

• Alexander Reinefeld

Thorsten Schütt 30

• Alexander Reinefeld

• Nico Kruber

• Christian von Prollius

• Seif Haridi (SICS)

• Ali Ghodsi (SICS)

• Tallat Shafaat (SICS)

Page 31: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

PublicationsChord#

T. Schütt, F. Schintke, A. Reinefeld.

A Structured Overlay for Multi-dimensional

Range Queries. Euro-Par, August 2007.

T. Schütt, F. Schintke, A. Reinefeld.

Structured Overlay without Consistent

Hashing: Empirical Results. GP2PC, May 2006.

Talks / Demos

IEEE Scale Challenge, May 2008

1st price (live demo)

Thorsten Schütt 31

TransactionsM. Moser, S. Haridi.

Atomic Commitment in Transactional DHTs.

1st CoreGRID Symposium, August 2007.

T. Shafaat, M. Moser, A. Ghodsi , S. Haridi,

T. Schütt, A. Reinefeld. Key-Based Consistency

and Availability in Structured Overlay Networks.

Infoscale, June 2008.

WikiS. Plantikow, A. Reinefeld, F. Schintke.

Transactions for Distributed Wikis on Structured Overlays.

DSOM, October 2007. http://code.google.com/p/scalaris

Page 32: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Questions

Thorsten Schütt 32

Questions

Page 33: Scalaris: Reliable Transactional P2P Key/Value Storeerlang.org/workshop/2008/Sess22.pdf · 2008. 10. 5. · Myspace 50.000 requests/sec – 95% are answered by squid proxies – only

Multi Data-Center Scenario

• Multi-Data-Center Scenarios– Optimize for Latency

– Increase Availability

• Prefix Articles with– Language

Thorsten Schütt 33

– Language

– Replicas Number

• E.g. „de:Main Page“– 5 replicas

– 2 in Germany

– 1 in UK

– 1 in USA

– 1 in Asia

„0de:Main Page“, „1de:Main Page“, „2de:Main Page“, …