28
TOTEM: A FAULT-TOLERANT MULTICAST GROUP COMMUNICATION SYSTEM L. E. Moser, P. M. Melliar Smith, D. A. Agarwal, B. K. Budhia C. A. Lingley- Papadopoulos University of California,

TOTEM: A FAULT-TOLERANT MULTICAST GROUP COMMUNICATION SYSTEM L. E. Moser, P. M. Melliar Smith, D. A. Agarwal, B. K. Budhia C. A. Lingley-Papadopoulos University

Embed Size (px)

Citation preview

TOTEM: A FAULT-TOLERANT MULTICAST

GROUP COMMUNICATION SYSTEM

L. E. Moser, P. M. Melliar Smith,D. A. Agarwal, B. K. BudhiaC. A. Lingley-PapadopoulosUniversity of California, Santa Barbara

INTRODUCTION

• Totem provides reliable totally-ordered multicasting of messages over LANs

• Intended for complex applications with critical requirements for– fault tolerance– real-time performance

• Exploits hardware broadcast of most LANs

TOTEM SERVICES

• Built as a hierarchy of protocols:

Application layer

Process group interface

Multiple-ring protocol

Single-ring protocol

Physical medium

Single-ring protocol

• Built on top of a best-effort multicast service, using UDP to exploit the hardware broadcasts of the LAN

• Converts these multicasts into the service of reliable totally ordered delivery of messages on a single LAN

• Also provides fault-detection, recovery and configuration change service

Multiple-ring protocol

• Uses information from the process group interface above it

• Provides total ordering of messages as well as network topology maintenance services

Process group interface

• Delivers messages to the application processes in the appropriate process groups

• Provides process group membership services.

Services provided by Totem

• Two reliable totally ordered message delivery services: – Agreed delivery– Safe delivery

• Both services deliver messages in a single system-wide total order that respects Lamport’s causal order

Agreed Delivery

• Guarantees that a processor will not deliver a message before it has delivered all prior messages that:– Have been issued by processors in the current

configuration and – Have time-stamps within the duration of that

configuration• All processes receive all messages in the

order they were sent

Safe Delivery

• Further guarantee that a processor will not deliver a message unless all processors in its configuration have received it (everyone or nobody).

• All processes receive all messages in the same order at the same time

Why Lamport’s causal order?

• Otherwise processes that belong to two or more groups could receive message from different groups in different order– A and B both in groups G and H– A receives m from group G then m’ from group

H and finally m’’ from group G– B could receive m’ from group H then m from

group G and finally m’’ from group G

Example

Group G sends messages m and m’’

to A and B

Group H sends message m’to A and B

Both A and B will receive m and m’’in the same order

Without total ordering, A could receive m’ before m’’ and B could receive m’’ before m’

Delivery guarantees

• Extended virtual synchrony ensures that these guarantees are honored within every configuration– When a fault occurs, Totem forms a

transitional configuration with a reduced membership

– Message order is guaranteed even in the presence of network partitions

Extended virtual synchrony (I)

• We want to ensure that – Messages are received in the same order by

all processes– All processes share the same view of the

process group to which they belong

Extended virtual synchrony (II)

• Virtual synchrony model (K. Birman, ISIS) orders group membership changes along with the regular messages

• Ensures that failures do not result in– Incomplete delivery of multicast messages– Holes in the causal delivery order

• Problems remain if network can partition

Extended virtual synchrony (III)

• Extended virtual synchrony model (Totem) extends the virtual synchrony model to systems– Processes can fail and recover– Network can partition and remerge

• Guarantees that same message sent to processes in two or more components of a partitioned network will be in a consistent order in all these components

Ordering of messages

• Messages are born-ordered:– Each message includes a time-stamp– Relative order of messages is determined by

the message themselves as created by their senders

The single-ring protocol (I)

• Uses a circulating token containing among others:– A seq field with the sequence number of the last

message that was sent– An aru field with the sequence number of the

last message that has been received by all processors

• Only the processor that holds the token can send a message

The single-ring protocol (II)

• aru field used to implement safe delivery:– Tells processors which messages have been

received by every processor in the ring• Token also provides information about the

aggregate message backlog of the processors on the ring– Results in a fairer bandwidth allocation among

processors than FDDI

Local membership protocol (I)

• Part of the single-ring protocol• Allows

– Inclusion of new or recovering processors– Deletion of faulty processors

Local membership protocol (II)

• Ensures:– Consensus among all members of a

configuration about the configuration membership

– Termination as each configuration will be installed on every processor within a bounded time or not at all.

The multiple-ring protocol (I)

• Operates over several LANs linked by gateways

– Each LAN is organized as a virtual token ring and managed by the single-ring protocol

• Offers same services and same guarantees as single-ring protocol

Ring A Ring B Ring C

The multiple-ring protocol (II)

• Uses Lamport’s timestamps and delivers messages in timestamp order

• When a gateway forwards a message from one ring to another, it gives to the message a new sequence number for the new ring

• Processor faults and network partitions are detected by the single-ring protocol

Message delivery (I)

• Each processor maintains one recv_msgs list of messages received but not yet delivered for each ring from which it can receive messages

Ring Messages

B mB1, mB2

C mC1

A mA1, mA2, mA3

Message delivery (II)

• A processor will deliver a message as an agreed message as soon as– Message has the lowest time stamp of all the

messages in its recv_msgs lists – None of these lists is empty

• We could wait forever for rings that have no messages to send but for guaranteed vector messages

Example (I)• Consider the following recv_msgs list where the numbers

in parentheses indicate the message timestamps.

Ring

B

C

A

Messages (with Timestamps)

mB1(9), mB2(13)

mC1(10)

mA1(8), mA2(12), mA3(16)

Example (II)

• We can deliver messages mA!, mB1 and mC1

Ring

B

C

A

Messages (with Timestamps)

mB1(9), mB2(13)

mC1(10)

mA1(8), mA2(12), mA3(16)

Example (III)• We cannot deliver more messages because we

might miss a message mCnfrom ring C with an

earlier timestamp, say ts = 11.

Ring

B

C

A

Messages (with Timestamps)

mB2(13)

---

mA2(12), mA3(16)

Related issues

• Totem sends from time to time guaranteed vector messages – They specify among other things, which rings

have sent messages • Processor faults and network partitions are

detected by the single-ring protocol• Configuration and topology change messages

have timestamps and are delivered in strict timestamp order