Replication and Query Processing in the APPA Data Management System

1/27

Replication and Query Processing in the APPA Data Management System

Reza AKBARINIAVidal MARTINSEsther PACITTIPatrick VALDURIEZ

2/27

Motivation Advanced applications

They must deal with semantically rich data They use a high-level SQL-like query language Applications

Epidemiological study Astronomic data sharing

Little work on managing data replication in the presence of updates

Gnutella and Kaaza: static files (no updates) Freenet: update propagation downward to close connect peers ActiveXML: on demand (web services) P-Grid: rumor spreading (probabilistic guarantees for

consistency)

3/27

Motivation Replication in distributed systems

Synchronous replication (ROWA) Asynchronous replication

Preventive replication Optimistic replication Rumor spreading

We propose a new P2P system to address Data replication in the context of advanced

applications Query processing in the presence of advanced

replication capabilities

4/27

Outline

Motivation APPA Architecture Data Replication Query Processing Validation Conclusion


5/27

APPA Architecture

APPA

P2P Network

Key-based Storage and Retrieval

Peer Linking

Peer ID Assignment

Peer Communication

Advanced Services

Replication CachingQuery Processing ...

Internet

Basic Services

ConsensusP2P Data Management …

6/27

Outline


7/27

Data Replication

Replication Model Assumptions

Frequent and unpredictable network changes Small world

Based on lazy multi-master scheme Log-based reconciliation to solve replica divergence Schema management

r-lsd: local schema description of relation r r-csd: common schema description of relation r Each peer defines mapping functions between r-lsd and r-csd

Data storage Each peer stores tuples using r-lsd and r-csd schemas Updates in one schema are mapped to the other Multi-master groups

8/27

Data Replication

Reconciliation Properties Eventual consistency: when all clients stop

submission of update actions, all replicas eventually achieve the same values

Mergeability: it is possible to schedule any arbitrary collection of log operations respecting constraints

Eventual decision: a decision is taken for each submitted action

Eventual propagation: actions and constraints known at peer “p” at time “t” are eventually known by an arbitrary peer of the group

Safe decisions: peers may not make conflicting decisions

9/27

Data Replication

Reconciliation Solutions IceCube (Microsoft Research)

Centralized conflict detection and resolution Resolution based on application semantics Non-deterministic resolution

APPA Distributed conflict detection and resolution Resolution based on application semantics Deterministic resolution – enables parallelism Considers dynamique connections and

disconnections

10/27

Data Replication

APPA Distribued Reconciliation Foundation

Use a common action log (P2P data) All tentative actions are stored in the action log Action log actions are grouped by time interval

(log unit) The resolution (deterministic) is made on

demand, comprises a log unit and produces a schedule

The schedules are available (P2P data) to all peers

Parallelism and distribution - scalability

11/27

Data Replication

Distributed Reconciliation

12/27

Data Replication

Distributed Reconciliation Log units assure unique vision over unordered actions Log unit life cycle must be managed Decision factor eliminates non-determinism Several peers can reconcile the same log unit

concurrently A peer can reuse the reconciliation made by another one A peer can finish the reconciliation started by another

one Reconciliation properties are assured Multi-master replication in P2P environment is reached

13/27

Data Replication

Service Architecture

APPA

Replication Service

Log Unit Manager

Basic Services

P2P Data Manager Consensus

P2P Network

KSR Peer Communication

Internet

Local Persistent Data

Local Log

Reconciler

Application

14/27

Outline


15/27

Query Processing

Problem definition Consider that

Each peer has a local schema to describe their data Peers agree on a Common Schema Description (CSD) Each peer maps its local schema to the CSD

Given a user query on a peer schema, the problem is To find the minimum set of peers that should answer the

query To execute the query in these peers and return a list of

(ranked) answers to the user Assumption

A query answer includes data from several multi-master groups (all of them which store relevant data

16/27

Query Processing

Proposed Solution

17/27

Query Processing

Proposed Solution Query reformulation

p:r(A,B,D) csd:r1(A,B,C), csd:r2(C,D,E) select A,D from r where B=b select A,D from r1,r2 where B=b and r1.C=r2.C

Query matching P: set of peers in the P2P system ps(p,r): peer schema of peer “p” involves relation “r” Problem: to find P’ P where each p in P’ has relevant

data Result: P’= { p | p P r R ps(p,r) }

18/27

Query Processing

Proposed Solution

r1

r1 r1,s1 s1

t1,u1 t1,u1 r3,s3 r3,s3

r3,s3t1,u1v

t2,u2 t2,u2

r2 r2,s2 s2

s2r2,s2r1,s1 r2s1

r3,s3 t2,u2 t2,u2

P

Q = join (r,s,v)

Query matchingP’

1 – European data2 – American data3 – African data

19/27

Query Processing

Proposed Solution Query optimization

Consider P’ a set of relevant peers Goal: obtain P’’ P’ such that

For any two peers in P’’, their relevant data are not replicated

The relevant data of peers in P’’ are equal to that in P’ The cost of query execution by peers in P’’ is minimum

Cost function A function of communication, computing power, etc.

Phases of optimization Determining relevant replicas for Q’s relations and their

peers Determining best peer per replica

20/27

Query Processing

Proposed Solution

r1

r1 r1,s1 s1

t1,u1 t1,u1 r3,s3 r3,s3

r3,s3t1,u1v

t2,u2 t2,u2

r2 r2,s2 s2

s2r2,s2r1,s1 r2s1

r3,s3 t2,u2 t2,u2

P’

Q = join (r,s,v)

Query optimization

r3,s3

r1,s1 r2

s2

v

P’’

1 – European data2 – American data3 – African data

21/27

Query Processing

Proposed Solution Algorithms

Cost parameters tcom(r,p): time to send the results of Q concerning to

replica r from a peer p to the query originator tresp(r,p): time which p needs to execute the part of Q

concerning to replica r and start to send the results to the query originator

tdjoin(S): time to join the set of replicas S in a distributed way

Exampler s

p1 p2 p3

4 6 7 5 tcom(r,p1) + tresp(r,p1) = 4tcom(s,p2) + tresp(s,p2) = 7tcom(r,p2) + tresp(r,p2) = 6tcom(s,p3) + tresp(s,p3) = 5

tdjoin({r,s}) = 6

Total Cost = 4 + 5 + 6 = 15Total Cost = 6 + 7 = 13

22/27

Query Processing

Proposed Solution

A none-linear programming model Minimize

Complexity

23/27

Query Processing

Proposed Solution Algorithms

Branch and bound Optimal selection of peers Complexity (worst case): O( )

A heuristic solution While there is an edge in the graph

Select the edge with minimum label Set the peer p as selected peer for the replica r Update the label edges of other peers that hold the

replica r Remove the replica r and its edges from the graph

Complexity: O((ma)2)

kak

m

24/27

Outline

Motivation APPA Architecture Data Replication Query processing Validation Conclusion

25/27

ImplementationJXTA

Community ApplicationsSun JXTA

Applications

JXTA Core

JXTA Applications

JXTA Services

Sun JXTA Services

Indexing

Discover

Search

Membership

JXTA Shell

Peer Commands

Peer Groups

Peer Advertisements

Peer Pipes Peer Monitoring

Peer IDs Security

Any Connected Device

JXTA Community Services

APPA

P2P Network


Peer Linking

Peer ID Assignment

Peer Communication

Advanced Services

Replication CachingQuery Processing

...

Basic Services

ConsensusP2P Data Management

GISP

26/27

SimulationJXTA

Community ApplicationsSun JXTA

Applications

JXTA Core

JXTA Applications

JXTA Services

Sun JXTA Services

Indexing

Discover

Search

Membership

JXTA Shell

Peer Commands

Peer Groups

Peer Advertisements

Peer Pipes Peer Monitoring

Peer IDs Security

Any Connected Device

JXTA Community Services

APPA

P2P Network


Peer Linking

Peer ID Assignment

Peer Communication

Advanced Services

Replication CachingQuery Processing

...

Basic Services

ConsensusP2P Data Management

GISP

Internet SimulationGT/ITM

P2P SimulationChord Simulator

P2P Network


Peer Linking

Peer ID Assignment

Peer Communication APPA Simulation

27/27

Conclusion Summary

Advanced cooperative applications (multi-master replication) A new P2P network-independent data management system A distributed optimistic multi-master replication solution Eventual consistency guarantee A query processing solution based on replication Validation

Future work Consider secondary copies Consider replica quality in query optimization Data caching Implementation over other P2P architectures (e.g., flooding)

Documents

Replication and Query Processing in the APPA Data Management System