CSS434 Replication1 CSS434 Distributed Transactions and Replication Textbook Ch 14 - 15 Professor: Munehiro Fukuda

CSS434 Replication 1

CSS434 Distributed Transactions and CSS434 Distributed Transactions and ReplicationReplication

Textbook Ch 14 - 15Textbook Ch 14 - 15

Professor: Munehiro Fukuda


Outline Distributed transaction

Two-phase commitment protocol File replication

Group communication revisited Primary copy replication Active replication

Read-any-write-all protocol Available copy protocol Quorum-based protocol


Distributed TransactionExample: Banking Transaction

..

BranchZ

BranchX

participant

participant

C

D

Client

BranchY

B

A

participant join

join

join

T

a.withdraw(4);

c.deposit(4);

b.withdraw(3);

d.deposit(3);

openTransaction

b.withdraw(T, 3);

closeTransaction

T = openTransaction a.withdraw(4); c.deposit(4); b.withdraw(3); d.deposit(3);

closeTransaction

Note: the coordinator is in one of the servers, e.g. BranchX


Transaction Commitment How can all participant servers either commit

a transaction or abort it? One-phase atomic commit protocol

The coordinator keep requesting all participants to commit until they return an acknowledgment.

No chance of a participant to initiate an abort. Two-phase commit protocol

Phase 1: calls for participants’ vote. Phase 2: Complete a commit or an abort according to out

put of vet.


Two-Phase Commit Protocol

operations

canCommit?(trans)-> Yes / NoCall from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote.

doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction.

doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction.

haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction.

getDecision(trans) -> Yes / NoCall from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages.


Two-Phase Commit ProtocolCommunication

canCommit?

Yes

doCommit

haveCommitted

Coordinator

1

3

(waiting for votes)

committed

done

prepared to commit

step

Participant

2

4

(uncertain)

prepared to commit

committed

statusstepstatus


Two-Phase Commit ProtocolState Transition

INIT

READY

ABORT COMMIT

CanCommit?Vote-Yes

doCommitAck

doAbortAck

CanCommit?Vote-No

Worker 2INIT

WAIT

ABORT COMMIT

Client_wants_to_commitCanCommit?

Vote-NodoAbort

Vote-YesdoCommit

Coordinator

INIT

READY

ABORT COMMIT

CanCommit?Vote-Yes

doCommitAck

doAbortAck

CanCommit?Vote-No

Worker 1

Another possible cases:The coordinator didn’t receive all vote-Yes. → Time out and send a doAbort.A worker didn’t receive a CanCommit?. → All workers eventually receive a doAbort.A worker didn’t receive a doCommit. → Time out and check the other work’s status.


File ReplicationConcepts

Difference between replication and caching A replica is associated with a server, whereas a cache with

client. A replicate focuses on availability, while a cache on locality A replicate is more persistent than a cache is A cache is contingent upon a replica

Advantages Increased availability/reliability Performance enhancement (response time and network traffic) Scalability and autonomous operation

Requirements Naming: no need to be aware of multiple replicas. Consistency: data consistency among replicated files. Replication control: explicit v.s. implicit/lazy replication ACID: Atomicity, Consistency, Isolation, and Durability


File ReplicationBasic Architectural Model

1. Request: send a client request to a server.

2. Coordination: deliver the request to each replica manger in some order.

3. Execution: process a client request but not permanently commit it.

4. Agreement: agree if the execution will be committed (ex. Two-phase commit protocol)

5. Response: respond to the front end

Client

Client

ReplicaManger

ReplicaManger

ReplicaManger

FrontEnd

FrontEnd

Ex: DNS Web server


Review: Group Communication

Group membership service Create and destroy a group. Add or withdraw a replica

manager to/from a group. Detect a failure. Notify members of group

membership changes. Provide clients with a group

address. Message delivery

Absolute ordering Consistent ordering

ReplicaManger

ClientReplicaManger

ReplicaManger

ReplicaManger

group



Example: ISIS

Group view

p1

p2

p3

p4

Joins the group

multicast

multicast

crashed

Multicast toavailable processes

multicastrejoins

In ISIS, if P4 receives this partially multicast message at the same time whenit knows p3 has been crashed, it forwards it to all the others and immediatelysends a flush message. In other words, P1, P2, and P4 receive this multicastmessage as if P3 was still alive.

Deleted or delivered?


Review: Group CommunicationAbsolute Ordering - Linearizability

Rule: Mi must be delivered before mj if Ti < Tj

Implementation: A clock synchronized among machines A sliding time window used to commit

message delivery whose timestamp is in this window.

Example: Distributed simulation

Drawback Too strict constraint No absolute synchronized clock No guarantee to catch all tardy

messages

mi

mi

mj

mj

Tj

Ti

Ti < Tj



Total Ordering - Sequential Consistency

Rule: Messages received in the same

order (regardless of their timestamp).

Implementation: A message sent to a sequencer,

assigned a sequence number, and finally multicast to receivers

A message retrieved in incremental order at a receiver

Example: Replicated database update

Drawback: A centralized algorithm

mi

mi

mj

mj

Tj

Ti

Ti < Tj


Multi-copy Update Problem Keep in mind the basic architecture and

group communication models, how can we update multiple copies over replica servers? Read-only replication

Allow the replication of only immutable files. Primary backup replication

Designate one copy as the primary copy and all the others as secondary copies.

Active backup replication Access any or all of replicas

Read-any-write-all protocol Available-copies protocol Quorum-based consensus


Primary-Copy Replication

Client

Client

ReplicaManger

ReplicaManger

ReplicaManger

FrontEnd

FrontEnd

PrimaryBackup

Backup

1. Request: The front end sends a request to the primary replica.

2. Coordination:. The primary takes the request atomically.

3. Execution: The primary executes and stores the results.

4. Agreement: The primary sends the updates to all the backups and receives an ask from them.

5. Response: reply to the front end. Advantage: an easy implementation, line

arizable, coping with n-1 crashes. Disadvantage: large overhead especially

if the failing primary must be replaced with a backup. Ex: Sun NIS (Yellow Page)


Active Replication1. Request: The front end multicasts to

all replicas.2. Coordination:. All replica take the req

uest in the sequential order.3. Execution: Every replica executes the

request.4. Agreement: No agreement needed.5. Response: Each replies to the front. Advantage: achieve sequential consi

stency, cope with (n/2 – 1) byzantine failures

Disadvantage: no more linearizable

Client

Client

ReplicaManger

ReplicaManger

ReplicaManger

FrontEnd

FrontEnd


Read-Any-Write-All Protocol

Read Lock any one of replicas for

a read Write

Lock all of replicas for a write

Sequential consistency Intolerable for even 1 failing

replica upon a write.

Client ReplicaManger

ReplicaManger

ReplicaManger

FrontEnd

Read from any one of them

Write to all of them


FrontEnd


Available-Copies Protocol Read

Lock any one of replicas for a read

Write Lock all available replicas

for a write Recovering replica

Bring itself up to date by coping from other servers before accepting any user request.

Better availability Cannot cope with network

partition. (Inconsistency in two sub-divided network groups)


ReplicaManger

ReplicaManger

FrontEnd

Read from any one of them

Write to all available replicats

XClient Replica

Manger

FrontEnd


Available Copies ProtocolExample 1: Gossip

RMk

RMj(Tj)

RMi(Ti)

FE(Tf)

Client

FE

Client

Query

Query, Tf Value, Ti

Value

Update, Tf

Update

Update id

Gossip

If (Tf < Ti) return valueelse { waits for RMi to be updated or query RMj/RMk}

If (Tf > Tj) update RMjelse { update Client or ignore and update RMj}

If (Tj > Tk) update RMkelse discard the gossip message

Categorized in lazy availablecopies protocol

Tardy messages are ignored


To make a tentative update committed:

Perform a dependency check

Check conflicts Check priority

Merge Procedure Cancel tentative updates Change tentative updates

T0 Tn+1TnT3T2T1CNC2C1C0

Committed Tentative

Primary

RM RM

FE

Client

FE

Client

TnT3

Secretary and other employees: book 3pm

Executive: book 3pm

Sent first

Sent later

FE

Client

T1

FE

Client

T0

Available Copies ProtocolExample 2: Bayou

Categorized in lazy availablecopies protocol

Tardy messages are reorderedor merged.


Network PartitionsWell-known Solution: Quorum-Based Protocols

Client

ReplicaManger

ReplicaManger

FrontEnd

Client

ReplicaManger

FrontEnd

ReplicaManger

ReplicaManger

ReplicaManger

ReplicaManger

ReplicaManger

Read quorum

Write quorum

#replicas in read quorum + #replicas in write quorum > n Read

Retrieve the read quorum Select the one with the

latest version. Perform a read on it

Write Retrieve the write quorum. Find the latest version and

increment it. Perform a write on the

entire write quorum. If a sufficient number of

replicas from read/write quorum, the operation must be aborted.

Read-any-write-all: r = 1, w = n


Network PartitionsSystem example: Coda

Server 3 Server 2 Server 1

1. Normal case:• Read-any, write-all protocol• Whenever a client writes back its file, it increments the file version at each server.

Version[1,1,1] Version[1,1,1] Version[1,1,1]

W

Version[2,2,2] Version[2,2,2] Version[2,2,2]

2. Network disconnection:• A client writes back its file to only available servers.• Version conflicts are detected and resolved automatically when network is reconnected

W WVersion[2,2,3] Version[3,3,2] Version[3,3,2]

3. Client disconnection:• A client caches as many files as possible (in hoard walking).• A client works in local if disconnected (in emulation mode).• A client writes back updated files to servers (in reintegration mode).

hoard

reintegration

emulation


Paper Review by Students ISIS System Gossip Architecture Bayou System Coda Discussions

What if a message is lost in ISIS group communication? What if another crash occurs when unstable/flush messages are exchanged?

What performance drawbacks does Gossip have? What problems remain to users in Bayou? Why doesn’t Coda use read/write quorum?


Non-Turn-In Exercises1. The following state transition diagram describes the two-phase commitment protoco

l. Let’s assume that worker1 crashed when a coordinate sent a commit message. Trace this diagram. To be specific, make appropriate dashed arrows “thick and solid arrows” with your pen or pencil.

INIT

READY

ABORT COMMIT

CanCommit?Vote-Yes

doCommitAck

doAbortAck

CanCommit?Vote-No

Worker 2INIT

WAIT

ABORT COMMIT

Client_wants_to_commitCanCommit?

Vote-NodoAbort

Vote-YesdoCommit

Coordinator

INIT

READY

ABORT COMMIT

CanCommit?Vote-Yes

doCommitAck

doAbortAck

CanCommit?Vote-No

Worker 1


Non-Turn-In Exercises2. Textbook p762, Q17.1: In a decentralized variant of the two-phase

commit protocol, the participants communicate directly with one another instead of indirectly via the coordinator. In phase 1, the coordinator sends its vote to all the participants. In phase 2, if the coordinator’s vote is No, the participants just abort the transaction; if it is Yes, each participant sends its vote to the coordinator and the other participants, each of which decides on the outcome according to the vote and carries it out. Calculate the number of messages and the number of rounds it takes. What are its advantages or disadvantages in comparison with the centralized variant?

3. Textbook p816, Q18.10: Explain why allowing backups to process read operations directly, (i.e., without contacting a primary), leads to sequentially consistent rather than linearizable executions in a primary-copy replication.

4. Textbook p816, Q18.11: Could the gossip architecture be used for a distributed computer game as describe below?

5. The players move figures around a common scene. The state of the game is replicated at the players’ workstations and at a server, which contains services controlling the game overall, such as collision detection. Updates are multicast to all replicas.

6. The quorum-based replication protocol can address network partition problems. Why didn’t Coda use this protocol? Explain the reason.

7. What if a message is lost in ISIS group communication? Describe a solution.

Documents

CSS434 Replication1 CSS434 Distributed Transactions and Replication Textbook Ch 14 - 15 Professor: Munehiro Fukuda