30
Chapter 5 Chapter 5 Wenbing Zhao Wenbing Zhao Department of Electrical and Computer Department of Electrical and Computer Engineering Engineering Cleveland State University Cleveland State University [email protected] [email protected] Building Dependable Building Dependable Distributed Systems Distributed Systems Building Dependable Distributed Systems, Copyright Wenbing Zhao 1

Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University [email protected] Building Dependable Distributed Systems

Embed Size (px)

Citation preview

Page 1: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Chapter 5Chapter 5

Wenbing ZhaoWenbing ZhaoDepartment of Electrical and Computer EngineeringDepartment of Electrical and Computer Engineering

Cleveland State UniversityCleveland State University

[email protected]@ieee.org

Building Dependable Building Dependable Distributed SystemsDistributed Systems

Building Dependable Distributed Systems, Copyright Wenbing Zhao 1

Page 2: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

OutlineOutline Group communication systems

Reliable, ordered multicast Types of total ordering GCS services How to implement GCS

Page 3: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Group Communication Group Communication SystemSystem Services provided by the GCS

Membership service: who is up and who is down Deals with failure detection and more

Reliable, ordered, multicast service FIFO, causal, total

Virtual synchrony service Virtual synchrony synchronizes membership change with

multicasts

GCS makes the implementation of state machine replication much easier

Page 4: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Reliable MulticastReliable Multicast Reliable multicast – the message is targeted to multiple

receivers, and all receivers receive the message reliably Positive or negative acknowledgement Need to avoid ack/nack implosion

Distinguish receiving from delivery!

Application

Middleware

Receiving

Delivering

Page 5: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Ordered Reliable MulticastOrdered Reliable Multicast Ordered reliable multicast – if many messages are

multicast by many senders, in what order the messages are delivered at the receivers? First in first out (FIFO) Causal – the causal relationship among msgs preserved Total – all msgs are delivered at all receivers in the same order

Event ordering (Section 4.3.2) Happens before relationship: given two events E and E’ at

processes i and j, respectively, we say E happens before E’, denoted as E -> E’, provided that i=j and E occurred ahead of E’, or i≠j and E is the sending of a message and E’ is the receiving of the msg

Page 6: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Event OrderingEvent Ordering Ordered reliable multicast – if many messages are

multicast by many senders, in what order the messages are delivered at the receivers? First in first out (FIFO) Causal – the causal relationship among msgs preserved Total – all msgs are delivered at all receivers in the same order

Event ordering (Section 4.3.2) Happens before relationship: given two events E and E’ at

processes i and j, respectively, we say E happens before E’, denoted as E -> E’, provided that i=j and E occurred ahead of E’, or i≠j and E is the sending of a message and E’ is the receiving of the msg

Page 7: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

FIFO Ordered MulticastFIFO Ordered Multicast

FIFO or sender ordered multicast:Messages are delivered in the order they were sent (by any single sender)

p

q

r

s

a

b c d

e

delivery of c to p is delayed until after b is delivered

Page 8: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Causally Ordered MulticastCausally Ordered Multicast

Causal or happens-before ordering:If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b

Page 9: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Causally Ordered MulticastCausally Ordered Multicast

Causal or happens-before ordering:If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b cdelivery of c to p is delayed until after b is delivered

Page 10: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Causally Ordered MulticastCausally Ordered Multicast

Causal or happens-before ordering:If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b c

e

delivery of c to p is delayed until after b is delivered e is sent (causally) after b

Page 11: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Causally Ordered MulticastCausally Ordered Multicast

Causal or happens-before ordering:If send(a) send(b) then deliver(a) occurs before deliver(b) at common destinations

p

q

r

s

a

b c d

e

delivery of c to p is delayed until after b is delivered delivery of e to r is delayed until after b&c are

delivered

Page 12: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Totally Ordered MulticastTotally Ordered Multicast Total ordering:

Messages are delivered in same order to all recipients (including the sender)

p

q

r

s

a

b c d

e

all deliver a, b, c, d, then e

Page 13: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Two Types of Total OrderingTwo Types of Total Ordering Uniform total ordering

Given any msg that is broadcast, if it is delivered by a node according to some total order, it is delivered in every other node in the same total order unless the node has failed

Nonuniform total ordering Given a set of messages that have been broadcast and

totally ordered, no node delivers any of them out of the total order.

However, there is no guarantee that if a node delivers a message, then all other nodes deliver the same message.

Page 14: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

ExampleExample

Page 15: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Implementing Total OrderingImplementing Total Ordering Use a sequencer to order all multicast

Sequencer determines the order for the message Each sender can take turn to serve as the sequencer

(rotating sequencer) Use a token that moves around

Token has a sequence number Sender determines the total order: when you hold the token

you can send the next burst of multicasts Use vector clocks

Each process maintains a vector clock Each msg is piggybacked with a vector timestamp

Page 16: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Sequencer Based GCSSequencer Based GCS First practical solution for GCS A system is structured into a combination of two

subsystems Multiple senders with a single receiver A single sender with multiple receivers The single receiver and single sender are collocated at the same

node => all msgs are funneled through this node, i.e., sequencer

Page 17: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Sequencer Based GCSSequencer Based GCS The sequencer is responsible to assign a global

sequence number to each message funneled Each node deliver a msg if it has received and delivered

all msgs with smaller sequence numbers Sequencer: a single point of failure Rotating sequencer: overcoming single point of failure

Assume up to f nodes could fail, total number of nodes N > 2f Each node takes turn to act as a sequencer (e..g, one msg at a

time) A node does not deliver a msg until it receives f+1 sequencing

msgs Achieves fault tolerance as well as uniform total ordering

Page 18: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating Sequencer: Data Rotating Sequencer: Data StructureStructure View number v, list of node ids in the current view

Each node has a rank: it knows when it should take over as the next sequencer

A local sequence number vector M[], each element representing the expected local seq # for the corresponding node: for reliable delivery M[i] refers to the expected local seq# carried by the next msg

sent by node i Init each element to 0

Expected global seq# s carried in the next sequencing msg sent by the sequencer node: for total ordering

Page 19: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating Sequencer: Normal Rotating Sequencer: Normal OperationOperation Transmitting phase

A node i broadcasts a msg, B(v,i,n), to all nodes n: local seq#, initial 0, incremented for each msg broadcast =>

reliable broadcast Waits for a sequencing msg for the broadcast msg

A node j accepts a msg B(v,i,n) if it is in the same view and buffer it

Sequencing phase Committing phase

Page 20: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating Sequencer: Normal Rotating Sequencer: Normal OperationOperation Sequencing phase

When the sequencer receives a broadcast msg B(v,i,n) It verifies that it is the next expected msg from node i, M[i] = n Assigns the current global seq# s to B(v,i,n) Broadcasts a sequencing msg: SEQ(s,v,[i,n])

When a node j receives SEQ(s,v,[i,n]), it accepts it provided S is the expected global seq# It has B(v,i,n) in its buffer, otherwise, request retransmission Updates its data structures:

Increment expected global seq# Increment expected local seq#

SEQ(s,v,[i,n]) also serves as positive ack for broadcast msg B(v,i,n)

Page 21: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating Sequencer: Normal Rotating Sequencer: Normal OperationOperation Committing phase

A node does not deliver a broadcast msg B(v,i,n) until it receives SEQ(s,v,[i,n]) and f subsequent SEQ msgs

Ensuring uniform total ordering Even if f nodes failed, at least one node would have received both

B(v,i,n) and SEQ(s,v,[i,n]) This node ensures that the message is delivered at other nodes in the

same total order

How to transfer the sequencer role The transfer of the sequencer role can be achieved implicitly by the

sending of a new sequencing message The next node i assumes the sequencer role when it receives a

SEQ(s) msg and the following conditions are met (s+1)%N=i It has received all previous SEQ msgs and B msgs

What if no one broadcasts B msgs, sequencer sends null SEQ msgs

Page 22: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Normal Operation: ExampleNormal Operation: Example N=5, f=1 Can a node delivers B

as soon as it receives the corresponding SEQ msg?

When B(v,4,20) will be delivered?

Page 23: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating Sequencer: Rotating Sequencer: Membership ChangeMembership Change A membership change is triggered by

The detection of a failure. A node fails to receive the corresponding SEQ msg for its B msg => sequencer failed

The recovery of a failed node. When a node recovers from a failure, it tries to rejoin the membership

Objective of membership change protocol Only one valid membership view can be formed by the system If a B msg is committed at some nodes in a view, then all nodes

in the new view must commit B in the same total order

Page 24: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating Sequencer: Rotating Sequencer: Membership ChangeMembership Change

Operates in three phases Phase I:

The node that detected a failure (originator) set new view# = v+1, and broadcasts an invitation msg Invitation msg carries the new view#

A node accepts the invitation and ack it provided that It has not accepted an invitation for a competing view Note: a node joins at most one membership view at a time The ack carries the node’s current view# and the next expected global seq#

Page 25: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating SequencerRotating Sequencer: : Membership Membership Change, Phase IIChange, Phase II The originator keeps collecting acks until

Either it has received ack from every node in the new membership, or It has collected at least N-f acks and a timeout occurred (for liveness)

If all acks are positive, the originator proceeds to building a node list for the new view and broadcast it

The originator also learns the msg ordering history of previous view Highest global seq#: smax

Originator’s expected global seq#: s0

If smax > s0, the originator is missing msgs Smax ≥ than that of the last msg committed in previous view Request retransmission Use smax as starting global seq# for new view provided that it can receive all

missing msgs, otherwise, use largest s with the corresponding B received If negative responses received, abort and retry

A node aborts when (1) receives an abort msg from originator, or (2) it times out membership change

Page 26: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating SequencerRotating Sequencer: : Membership Membership Change, Phase IIIChange, Phase III The originator collects responses to its new membership view msg If receives positive responses from every node in new view, commits

to the new view Otherwise, abort, waits for a random amount time, and retry with a

larger view number

Page 27: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating SequencerRotating Sequencer: : Membership Membership Change ExamplesChange Examples Competing originators

Page 28: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating SequencerRotating Sequencer: : Membership Membership Change ExamplesChange Examples Premature timeout

Page 29: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Rotating Rotating SequencerSequencer: : Membership Membership Change ExamplesChange Examples Network partitioning

Page 30: Chapter 5 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Building Dependable Distributed Systems

Building Dependable Distributed Systems, Building Dependable Distributed Systems, Copyright Wenbing ZhaoCopyright Wenbing Zhao Wenbing ZhaoWenbing Zhao

Token Based GCS: TotemToken Based GCS: Totem Totem consists of:

Total ordering protocol Membership protocol Recovery protocol Flow control mechanisms

Total ordering msg delivery types Safe delivery: a message is delivered only when all correct

processes have received it => uniform total ordering Agreed delivery: a message is delivered as long as it is the next

message in total order => nonuniform total ordering