32
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes - France * University of Waterloo - Canada July 2005

Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

Embed Size (px)

Citation preview

Page 1: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

Preventive Replication in Database Cluster

Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group

University of Nantes - France* University of Waterloo - Canada

July 2005

Page 2: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

2

LINA / INRIA – Atlas Group

Outline

Motivations Cluster Architecture Preventive Replication Multi-Master Partially Replicated configurations Replication Manager Architecture Optimizations RepDB* Prototype Experiments Conclusions, Current and Future Work

Page 3: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

3

LINA / INRIA – Atlas Group

Motivations

Applications and Data are asynchronously replicated among a set of cluster nodes connected by a fast and reliable network to improve users requests response times

Use of lazy preventive replication to enforce data consistency

Cluster of n PC nodes

External Users Requests

Page 4: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

4

LINA / INRIA – Atlas Group

Cluster system architecture

Fast NetworkNode 1

Node 2

Request Router

Replication Manager

Transaction LoadBalancer

Application Manager

DBMS

CurrentLoadMonitor

Node n

Global UserDirectory

GlobalDataPlacement

Cluster Architecture

Page 5: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

5

LINA / INRIA – Atlas Group

Preventive Replication (1)

Properties: Strong consistency Non-blocking Scale and Speeds Up Highly High Data availability

Page 6: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

6

LINA / INRIA – Atlas Group

Preventive Replication (2)

Assumptions: Network interface provides FIFO reliable

multicast Max is the upper bound of time needed to

multicast a message from a node i and to be received at a receiving node j

Clocks are -synchronized Each transaction has a timestamp C value

(arrival time)

Page 7: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

7

LINA / INRIA – Atlas Group

Preventive Replication (3)

Consistency Criteria Total Order Enforcement: Transactions are received in the same

order at all involved nodes: correspond to the execution order

To enforce total order, transactions are chronologically ordered at each node using its delivery_time value:

delivery_time = C + Max + ε

T is received at node i

node i

Wait untildelivery_time

T

node j

Page 8: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

8

LINA / INRIA – Atlas Group

Preventive Replication (4)

Whenever a node i receives T Propagation: It multi-cast T to all nodes

including itself

Scheduling: At each node T’s delivery-time expires if and only if it is the older transaction

Execution: When T’s delivery-time expires then T is entirely executed

Page 9: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

9

LINA / INRIA – Atlas Group

Partial Architecture

Page 10: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

10

LINA / INRIA – Atlas Group

R

S

r', s'

r'', s''

R1, S

1

R2, S

2

R3, S

3

R4, S

4

Bowtie Fully replicated

Partially replicated

R1, S

1

S2

R2

Partially replicated

R1, S

R2, s'

R3

s''

Preventive Replication (4)

PRIMARY copies (R): Can be updated only on master node

Secondary copies (r): read-only

MULTIMASTER copies (R1): Can be updated on more than one node

Page 11: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

11

LINA / INRIA – Atlas Group

Preventive Replication (5)

Introduces Max + ε delay time Negligible in Cluster Networks Critical in bursty workloads

Data placement restrictions Lazy-Master, Fully replicated

In Fully-Replicated Overhead of message exchanges Not all nodes may have enough place to stores all replicas

=> Free data placement

Page 12: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

12

LINA / INRIA – Atlas Group

In the case where all data are not fully replicated, some transactions cannot be executed on target nodes

Example:UPDATE r SET c1 WHERE c2 IN (SELECT c3 FROM s);

N2

T1(R, S)

R1, S

1

S2

R2

N3

N1

Partially Replicated Configurations (1)

Page 13: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

13

LINA / INRIA – Atlas Group

On target nodes, T1 waits after its selection (Step 3) At the end of the execution on the origin node, a Refresh

Transaction (RT1) is multicast to target nodes (Step 4) RT1 is executed to update replicated data

R1, S

1

R2

S2

N1

N2 N3

ClientT

1(r

S, w

R)

R1, S

1

R2

S2

N1

N2 N3

R1, S

1

R2

S2

N1

N2 N3

Client

Answer T1

R1, S

1

R2

S2

N1

N2 N3

Step 1

R1, S

1

R2

S2

N1

N2 N3

Step 2 Step 3 Step 4 Step 5

T1(r

S, w

R)

Standby

RT1(w

R)

Perform

Partially Replicated Configurations (2)

Page 14: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

14

LINA / INRIA – Atlas Group

Data Placement

Tables must have a Primary Key A node i can not hold primary copies which has Foreign

keys of others tables which are not held by node i

ITEM,ORDER

ORDER(On N3, a order can be done on an item which doesn’t exist)

N3

N1

orderN2

Page 15: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

15

LINA / INRIA – Atlas Group

Replication Manager Architecture

Page 16: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

16

LINA / INRIA – Atlas Group

Optimization: Eliminating delay times (1)

In a cluster network, messages are naturally totally ordered

Schedule a transaction in parallel with its execution Submitting a transaction to execution as soon as it is

received Schedule the commit order of the transactions: A

transaction can be committed only after Max + ε Abort and re execute all younger transactions when a

transaction is received out of order Concurrent execution of non conflicting

transactions

Page 17: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

17

LINA / INRIA – Atlas Group

Optimization: Eliminating delay times (2)

Scheduling

Execution

T

Validation

Scheduling ValidationExecutionT

Abort

Preventive replication:

Optimized Preventive Replication:

Page 18: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

18

LINA / INRIA – Atlas Group

Optimisation Example (3)

Page 19: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

19

LINA / INRIA – Atlas Group

Optimization: Eliminating delay times (4)

Without the optimization, the refreshment time of a transaction T is always delayed by: Max + ε + t

With the optimization, the refreshment time of a transaction T is : Maximum((Max + ε), t), where t is the time spent to execute T

Page 20: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

20

LINA / INRIA – Atlas Group

RepDB* Prototype: Architecture

DBMSClients

ReplicaInterfaceJDBC server

LogMonitor

DBMS specific

Propagator Receiver

Refresher

Deliver

Network

JDBC JDBC

RepDB*

Page 21: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

21

LINA / INRIA – Atlas Group

RepDB* Prototype: Implementation

Java (around 10000 lines) DBMS is a black-box Interface JDBC (RMI-JDBC) Use of Spread toolkit to manage the network

(Center for Networking and Distributed Systems - CNDS)

Simulation version (SimJava) http://www.sciences.univnantes.fr/ATLAS/RepDB

Page 22: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

22

LINA / INRIA – Atlas Group

Replicas definition (1)

A file contains the replica placement specification:

<NODE name='node1'>

<MASTER>R</MASTER>

<MASTER>S</MASTER>

<SLAVE>T</SLAVE></NODE>

<NODE name='node2'>

<MASTER>R</MASTER>

<SLAVE>S</SLAVE>

</NODE>

Page 23: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

23

LINA / INRIA – Atlas Group

Interface: Applications / RepDB* (2)

Connection c;Statement s;Class.forName(“org.atlas.repdb.jdbc.Driver”);c = DriverManager.getConnection(

” jdbc:repdb://node0:4444/” , ”login”, ”password”);s = c.createStatement();s.executeUpdate(

“<WRITE>R, S</WRITE><READ>T</READ>“ + “UPDATE R SET att2 = 1 WHERE att1 IN “ +“ (SELECT att3 FROM T); “+“UPDATE S SET att2 = 1 WHERE att1 NOT IN “ +“ (SELECT att3 FROM T);” );

s.close(); c.close();

Page 24: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

24

LINA / INRIA – Atlas Group

Experiments (1): TPC-C benchmark

1 / 5 / 10 Warehouses 10 clients per Warehouse Transactions’ arrival rate is 1s / 200ms /

100ms 4 types of transactions:

New-order: Read-Write, high frequency (45%) Payment: Read-Write, high frequency (45%) Order-status: Read, low frequency (5%) Stock-level: Read, low frequency (5%)

Page 25: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

25

LINA / INRIA – Atlas Group

Experiments (2)

Cluster of 64 nodes PostgreSQL 7.3.2 1 Gb/s network

2 Configurations Fully Replicated (FR) Partially Replicated (PR): each type of TPC-

C transaction runs using ¼ of the nodes.

Page 26: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

26

LINA / INRIA – Atlas Group

Experiments (3): Scale up

a) Fully Replicated (FR) b) Partially Replicated (PR)

Page 27: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

27

LINA / INRIA – Atlas Group

Experiments (4): Speed up

+ Launch 128 clients that submit Order-status transactions (read-only)

a) Fully Replicated (FR) b) Partially Replicated (PR)

Page 28: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

28

LINA / INRIA – Atlas Group

Experiments (5): Unordored messages

a) Fully Replicated (FR) b) Partially Replicated (PR)

Page 29: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

29

LINA / INRIA – Atlas Group

Experiments (6): Delay x Trans. size

Page 30: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

30

LINA / INRIA – Atlas Group

Conclusions

Preventive replication Strong consistency Prevents conflicts for partially replicated

databases Full node autonomy Scale and Seeps up Experiments show the configuration and the

placement of the copies should be tuned to selected types of transactions

Page 31: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

31

LINA / INRIA – Atlas Group

Current andFuture Work

Preventive Replication for P2P systems Small and Dynamic multi-master groups Max is computed dynamically Small and dynamic slave groups

Optimistic Replication Distributed Semantic Reconcialiation

Page 32: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes

32

LINA / INRIA – Atlas Group

Thanks !

Merci !

Obrigado !

Questions ?