A DAPT IST-2001-37126 Middle-R: A Middleware for Dynamically Adaptive Database Replication R. Jiménez-Peris, M. Patiño-Martínez, Jesús Milán Distributed

ADAPT IST-2001-37126

Middle-R: A Middlewarefor Dynamically Adaptive

Database ReplicationR. Jiménez-Peris, M. Patiño-Martínez, Jesús Milán

Distributed Systems

LaboratoryUniversidad Politécnica de Madrid (UPM)Lsd

2nd Adapt workshop, 13-14th Feb. 2003, Bologna (Italy)2 ADAPT

Symmetric vs. Asymmetric Processing

• Transactions in a replicated system can be processed either:– Symmetrically, that means, that all replicas process the

whole transaction.• This approach can only scale by introducing queries in the workload.

– Asymmetrically, that means, that one replica process the transaction and the other replicas just apply the resulting updates.

• This approach can scale depending the ratio between the cost of executing the whole transaction and the cost of just applying the updates.


Scalability of Symmetric Systems

0

3

6

9

12

15

1 3 5 7 9 11 13 15

sites

Sca

le -

ou

t

w = 0w = 0.25w = 0.5w = 0.75

w = 1


Scalability of Asymmetric Systems

Asymmetric SystemAsymmetric System

• The transaction is fully executed at its master site.• Non-master sites only apply the updates.• This approach leaves some spare computing powerthat enables the scalability


Comparing the Scalability

1

3

5

7

9

11

13

15

1 3 5 7 9 11 13 15sites

scale

-ou

t

w = 0w = 0.5w = 1

Scalability of our middlewareusing asymmetric processing

Potential scalability of a symmetric system

0

3

6

9

12

15

1 3 5 7 9 11 13 15

sites

sc

ale

-ou

t

w = 0w = 0.5w = 1


Taxonomy of Eager Database Replication– White box. Modifying the database engine (Betinna’s

PostgresR [VLDB’00,TODS’00]).• It can use either symmetric or asymmetric processing.

– Black box. At the middleware level without assuming anything from the database (Yair Amir [ICDCS’02]).

• Inherently symmetric approach.

• Transactions are executed sequentially by all replicas.

– Gray box. At the middleware level based on the get/set updates services (our approach [ICDCS’02]).

• It can use symmetric processing.

• It can also use asymmetric processing provided two services from the database to get/set updates of a transaction. This the approach we have taken.


Assumptions in Middle-R• Each site has the entire database (no partial replication).• Read one – write all available.• We work on a LAN.• Virtually synchronous group communication available.• The underlying database provides two basic services (i.e. similar to

the Corba ones):– get state: returns a list of the physical updates performed by a transaction, – set state: applies the physical updates of a transaction at a site.

• Our approach exploits the application semantics; we assume that the database is partitioned in some arbitrary way and that it is known which data partitions are going to be accessed by a transaction.– This allows us to execute transactions from different partitions in parallel.

Transactions spanning several partitions are also considered.


ReplicaReplicamanagermanager

DBDB

Replica 4Replica 4


DBDB

Replica 3Replica 3


DBDB

Replica 2Replica 2

Replica Replica managermanager

DBDB

Replica 1Replica 1

Protocol Overview [Disc’00]

DatabaseDatabaselayerlayer

MiddlewareMiddlewarelayerlayer

GetStateGetState SetStateSetState SetStateSetState SetStateSetState

ClientClient

TransactionTransaction

Update propagationUpdate propagation


Integrating the Middleware with the Application Server• JBoss accesses databases through JDBC.• In order to integrate the middleware with JBoss it will

be necessary to develop a JDBC driver.• This JDBC driver will access the middleware by

multicasting requests to the middleware instances at each site.


Integrating the Middleware with the Application Server

JBoss

JDBCDriver

DB

JBoss

JDBCDriver

JBoss

JDBCDriver

DB DB DB

Group Communication Bus

Middle-R Middle-R Middle-R Middle-R


Integrating the Middlewarewith the Application Server• If JBoss is replicated, some issues should be tackled with:

– Independently of the kind of replication in JBoss, duplicated requests might reach the replicated database.

• Active replication provokes the duplication of every request.• Other kinds of replication strategies might generate duplicate

requests upon fail-over (i.e., requests done by the failed primary might be resubmitted by the new primary).

– The middleware imposes the requirement to identify duplicate requests identically.

– The middleware, provided the above guarantee, will enforce the removal of duplicate requests.


Automatic DB partitioning

• Middle-R exploits application semantics, that is, it requires to partition the DB in some arbitrary way and know in advance which partitions each transaction is going access.

• In our previous work, these partitioning was performed by the programmer. – For each stored procedure accessing the DB, a function was provided

that taking the parameters of the invocation determined the partitions that would be accessed by the stored procedure invocation.

• This is a limitation of the previous approach that has to be overcome in Adapt.– This DB partitioning should transparent to users and therefore

automatically performed on a partition per table basis (at least).


Automatic DB Partitioning

• The second issue is how to know in advance which partitions a particular transaction is going to access.

• Our new approach will analyze on-the-fly the submitted SQL statements to determine which partitions it will access.


DB Interaction Model

• Our previous work assumed that each transaction was submitted in a single message to the middleware.– This model was suitable for working for stored procedures.

• However, this interaction model does not match with the one adopted by JDBC.– Under JDBC a transaction might span an arbitrary number of

requests.– Under JDBC a transaction might be distributed, so the XA

interface should be supported for distributed atomic commit.

• For this reason, we are extending the underlying replication protocol to deal with transactions spanning multiple messages.


Dynamic Adaptability

• The following dynamic adaptability properties are considered:– Online recovery. Whilst a new (or failed) replica is being

recovered, the system continues its regular processing without disruption ([SRDS’02] approach that extend ideas from [DSN’01] to the middleware context).

– Load balancing. The masters of the different partitions are reassigned to balance the load dynamically.

– Admission control. Depending on the workload the optimal number of transactions active in the system changes. A limit of active transactions is dynamically adapted to reach the maximum throughput for each workload.


Dynamic Adaptability: Online Recovery [SRDS’02]

• Recovery is performed on a per-partition basis.• Recovery is not performed during the state transfer associated to

the view change to prevent the blocking of regular requests.• Once a partition is recovered at a recovering replica, it can start

processing requests on that partition although the other partitions are not recovered yet.

• Recovery is flexible to enable load balancing policies to take into account the load of recovery:– The recovery can use one or more recoverers.– Each recoverer can recover one or more partitions.


Dynamic Adaptability: Online Recovery• Replicas might recover in a cascading fashion.• The online recovery protocol deals efficiently with

cascading recoveries.• Basically, it prevent redundancies in the recovery process

as follows:– A replica that starts recovery, whilst the recovery of another

replica is underway, is not delayed till the whole recovery completes.

– Neither a new recovery is started in parallel (yielding redundant recoveries).

– Instead, this replica joins the recovery process with the next partition to be recovered.

– In this way, cascading recovering replicas share the recovery of common partitions.


Dynamic Adaptability: Load Balancing

• The middleware approach has the advantage that every replica knows without any additional information the load of each other replica.

• This allows to achieve load balancing with very little overhead.• One of the main difficulties of load balancing is to determine

the current load of each replica.• We are currently modeling the behavior of the DB to be able to

determine dynamically the current load of each replica.• These models will enable the middleware to determine which

replicas become saturated, so its load can be redistributed.• The load is redistributed by reducing the number of partitions

that are mastered by an overloaded replica.


Dynamic Adaptability:Load Balancing during Online Recovery• The load balancing will also control the online recovery

to adapt it to the load conditions.• When the system load is low it will increase the resources

devoted to recovery to accelerate it taking advantage of the spare computing resources.

• When the system load increases it will dynamically decrease the resources devoted to recovery to cope with the new load.


Dynamic Adaptability: Admission Control

• The maximum throughput for a workload is reached with a given number of concurrent transactions in the system.

• Once this threshold is exceeded the DB begins to thrash.• This threshold is different for each workload so it needs to be

dynamically adapted to achieve the maximum throughput for the changing workload.

• The middleware has a pool of connections with the DB, and it can control the transaction admission to attain the optimal degree of concurrency.

• We are developing behavior models that will enable us to find dynamically the thrashing point and adapt dynamically the threshold in the admission control.


Wide Area Replication

• The underlying protocols in the middleware are amenable to be used in a WAN.

• We are currently studying which new requirements are needed in a WAN to find problems that might require changes in the protocols.

• Replication across a WAN help to survive catastrophic failures and it is also needed by many multinational companies with branches spanning different countries.– For the former scenario we contemplate a replica at each geographic

location.– For the latter scenario we contemplate a cluster at each geographic

location.


Partial Replication• Scalability in the middleware, although good, is

limited due to the overhead induced by propagating the updates to all the replicas (see [SRDS’01] for an analytical model determining the precise scalability of the approach).

• This limitation can be overcome by means of partial replication.

• In this way, each partition can be dynamically replicated to the optimal level.

• However, partial replication introduces new complications such as queries spanning multiple partitions that cannot be performed on a replica that do not hold a copy of all the accessed partitions.


Conclusions

• Extensions to our previous work and the JDBC driver will enable the use of our middleware approach to provide dynamically adaptable DB replication for JBoss.

• The flexibility of the middleware approach enable us to contribute on different issues regarding dynamic adaptability such as online recovery, dynamic admission control, dynamic load balancing, changing dynamically the degree of partial replication, etc.


Optimistic Delivery [KPAS99]

Total order MCof the transaction

timeLatency for a transaction

Execution of transaction

a) Replication protocol with non-optimistic total ordered multicast

Opt-delivery

time

Total order MC ofthe transaction

Totally ordered-delivery

Execution oftransaction

Latency for a transaction

b) Replication protocol with optimistic total ordered multicast


Advantages of optimistic delivery

• For the optimism to create problems two things must happen at the same time:– Messages get out of order (unlikely in a LAN).– The corresponding transactions conflict.

• The resulting probability is very low and we can make it even lower (transaction reordering at the primary).

• Cost of group communication minimized.


Experimental set up

• Database: PostgreSQL.

• Group communication Ensemble.

• Network: 100 Mbit Ethernet.

• 15 database sites (each SUN Ultra 5, Solaris).

• Two kinds of transactions were used in the workload:– Queries (only reads).

– Pure updates (only writes).


Experiments

• #1: using replication does not make the system worse.• #2: adding more replicas increases the throughput of

the system.• #3: the increase in throughput does not affect the

response time.• #4: acceptable overhead in worst case scenarios.

The dangers of replication:none of these statements

is true in conventionaleager replication protocols.


1. Comparison with Distributed Locking

Standard Distributed Locking vs. Scalable Replication

0100200300400500600700800

1 3 5 7 9 11 13 15

Sites

Re

sp

on

se

Tim

e

(ms

)

Standard Distributed Locking (5 upd)

Scalable replication (5 upd)

Distributed lockingdegrades very fastwith an increasing number of replicas

Our middleware response time is

stableLoad 5 tps


Throughput Scale-out

1

3

5

7

9

11

13

15

1 3 5 7 9 11 13 15

sites

scal

e-o

ut

0% upd

50% upd

100% upd

2. Throughput Scalability

Scalability of 1/2 the nominal capacity

Scalability of 2/3 the nominal capacity

15 tps

225 tps


3. Response Time Analysis

Workload Analysis (100% upd)

0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100 110

Workload (tps)

Res

po

nse

tim

e (m

s)

5 rep

10 rep

15 rep




0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100 110

Workload (tps)

Res

po

nse

tim

e (m

s)

5 rep10 rep15 rep




0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100 110

Workload (tps)

Resp

on

se t

ime (

ms)

5 rep10 rep15 rep


4. Coordination overhead

Small Transactions

0

100

200

300

400

500

600

20 50 80 110 140 170 200 230 260

Workload (tps)

Re

sp

on

se

Tim

e (

ms

)

5 rep

10 rep

15 rep


Conclusions

• Consistent replication can be implemented at the middleware level.

• Achieving efficiency requires to understand the dangers of replication:– Only one message per transaction

– Asymmetric system

– Reduce communication latency

– Reduce abort rates

• Our system demonstrates different ways to address all of these problems.


Ongoing work

• We are using the middleware to implement replication in object containers (e.g. J2EE, Corba).

• Tests are underway to use the system to implement replication across the Internet.

• Porting system to Spread [Amir et al.].• Load balancing for web servers based on replicated databases.• Online recovery and dynamic system reconfiguration:

– DSN 2001 [Kemme, Bartoli, Babaoglu].– SRDS 2002 [Jimenez, Patiño, Alonso].


Analytical vs. Empirical Measures

Analytical vs. Empirical Measures

1

3

5

7

9

1 3 5 7 9 11 13 15

sites

empir. 50% updempir. 100% updtheor. 50% updtheor. 100% upd


How can the middleware performwith faster databases?

• The 1 upd transaction took 10 ms to be executed, whilst an 8 upd transaction took 55 ms.

• This means that in a faster database for transactions lasting within these ranges we can obtain similar scalabilities (till some bottleneck is reached, most likely group communication).

• The determinant factor of scalability is the ratio of the cost of executing a full transaction and applying its updates, but this factor, although can be reduced, will be always significant (in Postgres for 8 upd transactions it was 0.16 and for 1 upd transactions it was 0.2).


Background

• Replication has been used for two different and exclusive purposes in transactional systems:– To increase availability (eager replication) by providing redundancy at

the cost of throughput and scalability.– To increase throughput and scalability by distributing the work among

replicas (lazy replication) at the cost of consistency.• We want both availability and performance.• However, Gray in “The Dangers of replication” SIGMOD’96

stated that eager replication could not scale.


Motivation

• Postgres-R [KA00] showed how to combine database replication with group communication to implement a scalable solution within a database.

• We extended this work [PJKA00] by exploring how to implement replication outside the database:– Protocol is provably correct.– Could be implemented as middleware.– It scales (e.g. adding more sites increases the capacity).

• In this talk we discuss the performance of such protocol as implemented on a cluster of computers connected through a LAN and show that it can be used in a wide range of applications.


Eager Data Replication

• There is a copy of the database at each site.• Every replica can perform update transactions

(update everywhere).• Transaction updates must be propagated to the rest

of the replicas.• Queries (read only transactions) are executed at a

single replica.


Understanding the Scalabilityof Data Replication

Each transaction executed by a site induces a load of one Each transaction executed by a site induces a load of one transaction on each other sitetransaction on each other site

Assume sites with a processing capacity of 4 tpsAssume sites with a processing capacity of 4 tps

Symmetric SystemSymmetric System

The capacity of the system is at most the capacity of a single site: 4 tpsThe capacity of the system is at most the capacity of a single site: 4 tps


Asymmetric Systems

• In an asymmetric system the work performed by a replica consists of:– Local transactions, i.e., transactions submitted to the

replica.

– Remote transactions, i.e., update transactions submitted to other replicas.


A Middleware Replication Layer

Group CommunicationGroup Communication

PostgreSQLPostgreSQL

QueueQueueManagerManager

Client AClient A Client BClient B

CommunicationCommunicationManagerManager

ConnectionConnectionManagerManager

ReplicaReplicaManager XManager X

PostgreSQLPostgreSQL

QueueQueueManagerManager

Client CClient C Client DClient D

CommunicationCommunicationManagerManager

ConnectionConnectionManagerManager

ReplicaReplicaManager YManager Y


A Middleware Replication Layer

• The replication system has been implemented as a middleware layer that runs on top of off-the-shelf non-distributed databases or other data stores (e.g., an object container like Corba).

• This layer only requires two simple services from the underlying data repository:– get state: returns a list of the physical updates performed by a

transaction,

– set state: applies the physical updates of a transaction at a replica.


Exp. 1: Comparison with Distributed Locking• In this experiment we compared our system with a commercial

database using distributed locking and eager replication to guarantee full consistency of the replicas.

• A small load of 5 transactions per second was used for this experiment.


Response Time Analysis

• The goal of this experiment is to show that transaction latency keeps stable with loads within the scalability interval.

• For each configuration and update rate, the load is increased until the response time degenerates.


Exp. 2: Throughput Scalability

• This experiment tested how the throughput of the system varies for an increasing number of replicas.

• In particular, we wanted to know the power of the cluster relative to a single site.


Measuring the Overhead

• The latency of short transactions is extremely sensitive to any overhead.

• The goal of this experiment is to measure how the response time was affected by the overhead introduced by the middleware layer.

• In this experiment the shortest update transaction was used: a transaction with a single update.


Motivation and background

• Eager replication is the text book approach to achieve availability …

• Yet, very few database products provide consistent replication.

• The reasons were explained by Gray in “The Dangers of replication” SIGMOD’96.

• Postgres-R [KA00] showed how to avoid these dangers and implement eager replication within a DB.– Combines transaction

processing and group communication.

– Uses asymmetric processing

– Showed how to embed these techniques in a real database engine.


Motivation and Background

• A subsequent approach explored scalable eager DB replication outside the DB, at the middleware level [Disc00,ICDCS’02].

• Experiments showed that it was possible to achieve replication at the middleware level with a scalability close to the one achieved within the database.


Two Crucial Issues

• Processing should be asymmetric– Otherwise it does not scale …

– … but difficult to do outside the database

• Avoid the latency introduced by group communication (especially for large groups)– Otherwise the response time suffers …

– … but we need the group communication semantics

Documents

A DAPT IST-2001-37126 Middle-R: A Middleware for Dynamically Adaptive Database Replication R. Jiménez-Peris, M. Patiño-Martínez, Jesús Milán Distributed