48
Bimodal Multicast KENNETH P. BIRMAN Cornell University MARK HAYDEN Digital Equipment Corporation/Compaq OZNUR OZKASAP and ZHEN XIAO Cornell University MIHAI BUDIU Carnegie Mellon University and YARON MINSKY Cornell University There are many methods for making a multicast protocol “reliable.” At one end of the spectrum, a reliable multicast protocol might offer atomicity guarantees, such as all-or- nothing delivery, delivery ordering, and perhaps additional properties such as virtually synchronous addressing. At the other are protocols that use local repair to overcome transient packet loss in the network, offering “best effort” reliability. Yet none of this prior work has treated stability of multicast delivery as a basic reliability property, such as might be needed in an internet radio, television, or conferencing application. This article looks at reliability with a new goal: development of a multicast protocol which is reliable in a sense that can be rigorously quantified and includes throughput stability guarantees. We characterize this new protocol as a “bimodal multicast” in reference to its reliability model, which corresponds to a family of bimodal probability distributions. Here, we introduce the protocol, provide a theoretical analysis of its behavior, review experimental results, and discuss some candidate applications. These confirm that bimodal multicast is reliable, scalable, and that the protocol provides remarkably stable delivery throughput. Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network This work was supported by DARPA/ONR contracts N0014-96-1-10014 and ARPA/RADC F30602-96-1-0317, the Cornell Theory Center, and the Turkish Research Foundation. Authors’ addresses: K. P. Birman, Department of Computer Science, Cornell University, 4126 Upson Hall, Ithaca, NY 14853; email: [email protected]; M. Hayden, Systems Research Center, Digital Equipment Corporation/Compaq, 130 Lytton Avenue, Palo Alto, CA 94301; email: [email protected]; O. Ozkasap and Z. Xiao, Department of Computer Science, Cornell University, 4126 Upson Hall, Ithaca, NY 14853; email: [email protected]; [email protected]; M. Budiu, Department of Computer Science, Carnegie Mellon University, Ithaca, NY 14853; email: [email protected]; Y. Minsky, Department of Computer Science, Cornell University, 4126 Upson Hall, Ithaca, NY 14853. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 1999 ACM 0734-2071/99/0500 –0041 $5.00 ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999, Pages 41–88.

Bimodal multicast

  • Upload
    others

  • View
    34

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bimodal multicast

Bimodal Multicast

KENNETH P. BIRMANCornell UniversityMARK HAYDENDigital Equipment Corporation/CompaqOZNUR OZKASAP and ZHEN XIAOCornell UniversityMIHAI BUDIUCarnegie Mellon UniversityandYARON MINSKYCornell University

There are many methods for making a multicast protocol “reliable.” At one end of thespectrum, a reliable multicast protocol might offer atomicity guarantees, such as all-or-nothing delivery, delivery ordering, and perhaps additional properties such as virtuallysynchronous addressing. At the other are protocols that use local repair to overcome transientpacket loss in the network, offering “best effort” reliability. Yet none of this prior work hastreated stability of multicast delivery as a basic reliability property, such as might be neededin an internet radio, television, or conferencing application. This article looks at reliabilitywith a new goal: development of a multicast protocol which is reliable in a sense that can berigorously quantified and includes throughput stability guarantees. We characterize this newprotocol as a “bimodal multicast” in reference to its reliability model, which corresponds to afamily of bimodal probability distributions. Here, we introduce the protocol, provide atheoretical analysis of its behavior, review experimental results, and discuss some candidateapplications. These confirm that bimodal multicast is reliable, scalable, and that the protocolprovides remarkably stable delivery throughput.

Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network

This work was supported by DARPA/ONR contracts N0014-96-1-10014 and ARPA/RADCF30602-96-1-0317, the Cornell Theory Center, and the Turkish Research Foundation.Authors’ addresses: K. P. Birman, Department of Computer Science, Cornell University, 4126Upson Hall, Ithaca, NY 14853; email: [email protected]; M. Hayden, Systems ResearchCenter, Digital Equipment Corporation/Compaq, 130 Lytton Avenue, Palo Alto, CA 94301;email: [email protected]; O. Ozkasap and Z. Xiao, Department of Computer Science, CornellUniversity, 4126 Upson Hall, Ithaca, NY 14853; email: [email protected];[email protected]; M. Budiu, Department of Computer Science, Carnegie Mellon University,Ithaca, NY 14853; email: [email protected]; Y. Minsky, Department of Computer Science,Cornell University, 4126 Upson Hall, Ithaca, NY 14853.Permission to make digital / hard copy of part or all of this work for personal or classroom useis granted without fee provided that the copies are not made or distributed for profit orcommercial advantage, the copyright notice, the title of the publication, and its date appear,and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, torepublish, to post on servers, or to redistribute to lists, requires prior specific permissionand / or a fee.© 1999 ACM 0734-2071/99/0500–0041 $5.00

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999, Pages 41–88.

Page 2: Bimodal multicast

Architecture and Design; C.2.2 [Computer-Communication Networks]: Network Protocols;C.2.4 [Computer-Communication Networks]: Distributed Systems; D.4.1 [OperatingSystems]: Process Management; D.4.4 [Operating Systems]: Communications Management;D.4.7 [Operating Systems]: Organization and Design

Additional Key Words and Phrases: Bimodal Multicast, internet media transmission, isochro-nous protocols, probabilistic multicast reliability, scalable group communications, ScalableReliable Multicast

PREFACEEncamped on the hilltops overlooking the enemy fortress, the commandingGeneral prepared for the final battle of the campaign. Given the informationhe was gathering about enemy positions, his forces could prevail. Indeed, ifmost of his observations could be communicated to most of his forces thebattle could be won even if some reports reached none or very few of histroops. But if many reports failed to get through, or reached many but notmost of his commanders, their attack would be uncoordinated and the battlelost, for only he was within direct sight of the enemy, and in the comingbattle strategy would depend critically upon the quality of the informationat hand.

Although the General had anticipated such a possibility, his situation wasdelicate. As the night wore on, he dispatched wave upon wave of updates onthe enemy troop placements. Some couriers perished in the dark, wet forestsseparating the camps. Worse still, some of his camps were beset by thedisease that had ravaged the allies since the start of the campaign. Theycould not be relied upon, as chaos and death ruled there.

With the approach of dawn, the General sat sipping coffee—rotgut stuff—reflectively. In the night, couriers came and went, following secret protocolsworked out during the summer. At the appointed hour, he rose to lead theattack. The General was not one to shirk a calculated risk.

1. INTRODUCTION

Although many communication systems provide software support for reli-able multicast communication, the meaning given to “reliability” splitsthem into two broad classes. One class of definitions corresponds to“strong” reliability properties. These typically include atomicity, which isthe guarantee that if a multicast is delivered to any destination thatremains operational it will eventually be delivered to all operationaldestinations. An atomic multicast may also provide message deliveryordering, support for virtual synchrony (an execution model used by manygroup communication systems), security properties, real-time guarantees,or special behavior if a network partitioning occurs [Birman 1997]. Acriticism is that to obtain these strong reliability properties one employscostly protocols, accepts the possibility of unstable or unpredictable perfor-mance under stress, and tolerates limited scalability [Cheriton and Skeen1993] (but see also Birman [1994], Cooper [1994], and van Renesse [1994]).

42 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 3: Bimodal multicast

As we will see shortly, transient performance problems can cause theseprotocols to exhibit degraded throughput. Even with a very stable network,it is hard to scale these protocols beyond several hundred participants[Piantoni and Stancescu 1997].

Protocols belonging to the second class of “reliable multicast” solutionsfocus upon best-effort reliability in very large scale settings. Examplesinclude the Internet MUSE protocol (for network news distribution) [Lidl etal. 1994], the Scalable Reliable Multicast protocol (SRM) [Floyd et al.1995], the XPress Transfer Protocol [XTP Forum 1995], and the ReliableMessage Transport Protocol (RMTP) [Lin and Paul 1997; Paul et al. 1997].These systems include scalable multicast protocols which overcome mes-sage loss or failures, but are not provided with an “end to end” reliabilityguarantee. Indeed, as these protocols are implemented, “end to end”reliability may not be a well-defined concept. There is no core system totrack membership in the group of participants; hence it may not be clearwhat processes belong to the destination set for a multicast, or evenwhether the set is small or large. Typically, processes join anonymously bylinking themselves to a multicast forwarding tree, and subsequently inter-act only with their immediate neighbors. Similarly, a member may drop outor fail without first informing its neighbors.

The reliability of such protocols is usually expressed in “best effort”terminology: if a participating process discovers a failure, a reasonableeffort is made to overcome it. But it may not always be possible to do so.For example, in SRM (the most carefully studied among the protocols inthis class) a router overload may disrupt the forwarding of multicastmessages to processes downstream from the router. If this overload alsoprevents negative acknowledgments and retransmissions from gettingthrough for long enough, gaps in the message delivery sequence may not berepaired. Liu and Lucas report conditions under which SRM can behavepathologically, remulticasting each message a number of times that riseswith system scale [Liu 1997; Lucas 1998]. Here, we present additional dataof a similar nature. (Liu also suggests a technique to improve SRM so as topartially overcome the problem.) The problematic behavior is triggered bylow levels of systemwide noise or by transient elevated rates of messageloss, phenomena known to be common in Internet protocols [Labovitz et al.1997; Paxson 1997]. Yet SRM and similar protocols do scale beyond thelimits of the virtual synchrony protocols, and when message loss is suffi-ciently uncommon, they can give a very high degree of reliability.

In effect, the developer of a critical application is forced to choosebetween reduced scalability but stronger notions of reliability in the firstclass of reliable multicast protocol, and weaker guarantees but betternormal-case scalability afforded by the second class. For critical uses, theformer option may be unacceptable because of the risk of a throughputcollapse under unusual but not “exceptionally rare” conditions. Yet thelatter option may be equally unacceptable, because it is impossible toreason about the behavior of the system when things go wrong.

Bimodal Multicast • 43

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 4: Bimodal multicast

The present article introduces a new option: a bimodal multicast protocolthat scales well, and provides predictable reliability even under highlyperturbed conditions. For example, the reliability and throughput of ournew protocol remain steady even as the network packet loss rate rises to20% and even when 25% of the participating processes experience transientperformance failures. We also present data showing that the LAN imple-mentation of our protocol overcomes bursts of packet loss with minimaldisruption of throughput.

The sections that follow start by presenting the protocol itself and someof the results of an analytic study (the details of the analysis are includedas an Appendix). We show that the behavior of our protocol can bepredicted given simple information about how processes and the networkbehave most of the time, and that the reliability prediction is strongenough to support a development methodology that would make sense incritical settings. Next, we present a variety of data comparing our newprotocol with prior protocols, notably a virtually synchronous reliablemulticast protocol, also developed by our group, and the SRM protocol. Ineach case we use implementations believed to be the best available in termsof performance and tuned to match the environment. Our studies include amixture of experimental work on an SP2, simulation, and experiments witha bimodal multicast implementation for LANs (possibly connected by WANgateways). Under conditions that cause other reliable protocols to exhibitunstable throughput, bimodal multicast remains stable. Moreover, we willshow that although our model makes simplifying assumptions, it stillmakes accurate predictions about real-world behavior. Finally, the articleexamines some critical reliable multicast applications, identifying roles forprotocols with strong properties and roles for bimodal multicast.

Bimodal multicast is not a panacea: the protocol offers a new approach toreliability, but uses a model that is weaker in some ways than virtualsynchrony, despite its stronger throughput guarantees. We see it as a toolto offer side by side with other reliability tools, but not as a solution that“competes” with previous protocols.

2. MULTICAST THROUGHPUT STABILITY

In our work with reliable multicast we participated in the development ofcommunication infrastructures for applications such as stock markets (theNew York and Swiss Stock Exchanges) [Birman 1999; Piantoni and Stanc-escu 1997] and air traffic control (the French console replication system1

called PHIDIAS). The critical nature of such applications means thatdevelopers will have to know exactly how their systems behave underexpected operational conditions, and to do so they need detailed informa-tion about how the reliable communication primitives they use will behave.These applications also demand high performance and scalability. In par-ticular, they often have a data transport subsystem that will produce a

1http://www.stna.dgac.fr/projects/phidias/

44 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 5: Bimodal multicast

sustained, fairly high volume of data considered critical for safe operation.In the past, such subsystems were often identified as critical real-timeapplications, but today’s computers and networks are so fast that the realneed is for stable throughput.

Our new protocol permits designers of such applications to factor out thesoft real-time data stream. Bimodal multicast will handle this high-volumeworkload, leaving a less demanding lower-volume residual communicationtask for protocols like the virtual synchrony ones, which work well in lessstressful settings. Because the communication demands of bimodal multi-cast can be predicted from the multicast rate and a small set of parameters,a designer can anticipate that bimodal multicast will consume a fixedpercentage of available bandwidth and memory resources, and configurethe system with adequate time for the virtual synchrony mechanisms.

Bimodal multicast is a good choice for this purpose, for several reasons.First, as just noted, the load associated with the protocol is predictable andlargely independent of scale. The protocol can be shown to have a bimodaldelivery guarantee: given information about the environment—informationthat we believe is reasonable for typical networks running standard Inter-net protocols—our protocol can be configured to have a very small proba-bility of delivering to a small number of destinations (counting failed ones),an insignificant risk of delivering to “many” but not “most” destinations,and a very high probability of delivering the message to all or almost alldestinations. Our model lets us tune the actual probabilities to match theintended use. And we will show how to use the model to evaluate the safetyof applications, such as the ones mentioned above.

Secondly, our protocol has stable throughput. Traditional reliable multi-cast protocols—atomic broadcast in its various incarnations—suffer from aform of interference between flow control and reliability mechanisms. Thiscan trigger unstable throughput when the network is scaled up, and someapplication programs exhibit erratic behavior. We are able to demonstratethe stability of our new protocol both theoretically and experimentally. Forthe types of applications that motivate our work, this stability guarantee isextremely important: one needs to know that the basic data stream isdelivered at a steady rate and that it is delivered reliably.

To give some sense of where the article is headed, consider Figure 1. Herewe measured throughput at a healthy process in virtually synchronousmulticast groups of various sizes: 32, 64, and 96 members. One of thesemembers attempts to inject 7KB multicast messages at a rate of 200 persecond. Ideally, 200 messages per second emerge. But the graph shows thatas we “perturb” even a single group member by causing it to sleep for thepercentage of each second shown on the x-axis, throughput collapses for theunperturbed group members. The problem becomes worse as the groupgrows larger (it would also be worse if we increase the percentage ofperturbed members). In the experimental sections of this article, we willsee that the bimodal multicast achieves the “ideal” output rate of 200messages per sec. under the same conditions, even with 25% of the

Bimodal Multicast • 45

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 6: Bimodal multicast

members perturbed. Details of the experiment used to produce Figure 1appear in the experimental section of this article.

As mentioned earlier, studies of SRM have identified similar problems.In the case of SRM, networkwide noise and routing problems represent theworst case. For example, Lucas, in his doctoral dissertation [Lucas 1998],shows that even low levels of network noise can cause SRM to broadcasthigh rates of retransmission requests and retransmitted data messages, sothat each multicast triggers multiple messages on the wire. Lucas findsthat the rate of retransmissions rises in proportion to the SRM group size.Liu, studying other problems (but of a similar nature) proposes a number ofchanges to SRM that might improve its behavior in noisy networks. Ourown simulations, included in the experimental section of this article,confirm these problems and make it clear that as SRM is scaled up, theprotocol will eventually collapse, much as does the virtually synchronousprotocol shown in Figure 1.

What causes these problems? In the case of the virtually synchronousprotocols, a perturbed process is particularly difficult to accommodate. Onthe one hand, the process is not considered to have failed, since it issending and receiving messages. Yet it is slow to acknowledge messagesand may experience high loss rates, particularly if operating systemsbuffers fill up. The sender and healthy receivers keep copies of unacknowl-edged messages until they get through, exhausting available bufferingspace and causing flow control to kick in. One could imagine setting failuredetection parameters more aggressively (this is what Piantoni and Stanc-escu [1997] recommend), but now the risk of an erroneous failure classifi-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

50

100

150

200

250Virtually synchronous Ensemble multicast protocols

perturb rate

aver

age

thro

ughp

ut o

n no

nper

turb

ed m

embe

rs

group size: 32group size: 64group size: 96

Fig. 1. Throughput as one member of a multicast group is “perturbed” by forcing it to sleepfor varying amounts of time.

46 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 7: Bimodal multicast

cation will rise roughly as the square of the group size. The problem is thatall group members can be understood as monitoring one another; hence, themore aggressive the failure detector, the more likely that a paging orscheduling delay will be interpreted as a crash. Thus, as one scales theseprotocols beyond a group size of about 50–100 members, the tensionbetween throughput stability and failure detection accuracy becomes asignificant problem. Not surprisingly, most successes with virtual syn-chrony use fairly small groups, sometimes structured hierarchically. Thelargest systems are typically ones where performance demands are limitedto short bursts of multicasts, far from the rates seen in Figure 1 [Birman1999].

Turning to SRM, one can understand the problem as emerging from aform of stochastic attack on the probabilistic assumptions built into theprotocol. Readers familiar with SRM will know that the protocol includesmany such assumptions: they are used to prevent duplicated multicasts ofrequests for retransmission and duplicated retransmissions of data, and toestimate the appropriate time-to-live (TTL) value to use for each multicast.Such assumptions have a small probability of being incorrect, and in thecase of SRM, as the size of the system rises, the absolute likelihood ofmistakes rises, causing the background overhead to rise. Eventually, theseforms of overhead interfere with normal system function, causing through-put to become unstable. For sufficiently large configurations or loads, theycan trigger a form of “meltdown.”

We believe that our article is among the first to focus on stable through-put in reliable multicast settings. Historically, reliable multicast split earlyinto two “camps.” One camp focused on performance and scalability,emphasizing peak performance under ideal situations. The other campfocused on providing rigorous definitions for reliability and protocols thatcould be proved to implement their reliability specifications. These proto-cols tended to be fairly heavyweight, but performance studies also empha-sized their best-case performance. Neither body of work viewed stablescalable throughput as a fundamental reliability goal, and as we have seen,stable throughput is not so easily achieved.

The properties of the bimodal multicast protocol seem to be ideal formany of the applications where virtual synchrony encounters its limits.These include internet media distribution (such for radio and televisionbroadcasts, or conferencing), distribution of stock prices and trade informa-tion on the floor of all-electronic stock exchanges, distribution of flighttelemetry data in air traffic control systems and military theater ofoperations systems, replication of medical telemetry data in critical caresystems, and communication within factory automation settings. Each ofthese is a setting within which the highest-volume data sources have anassociated notion of “freshness,” and the importance of delivering individ-ual messages decreases with time. Yet several are also “safety critical”applications for which every component must be amenable to analysis.

Bimodal Multicast • 47

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 8: Bimodal multicast

3. A BIMODAL MULTICAST PROTOCOL

Our protocol is an outgrowth of work which originated in studies of gossipprotocols done at Xerox [Demers et al. 1988], the Internet MUSE protocol[Lidl et al. 1994], the SRM protocol of Floyd et al. [1995], the NAK-onlyprotocols used in XTP [XTP Forum 1995], and the lazy transactionalreplication method of Ladin et al. [1992]. Our protocol can be understood asoffering a form of weak real-time guarantee; relevant prior work includesCristian et al. [1985; 1990] and Baldoni et al. [1996a; 1996b].

The idea underlying gossip protocols dates back to the original USENETnews protocol, NNTP, developed in the early 1980’s. In this protocol, acommunications graph is superimposed on a set of processes, and neighborsgossip to diffuse news postings in a reliable manner over the links. Forexample, if process A receives a news posting and then establishes commu-nication with process B, A would offer B a copy of that news message, andB solicits the copy if it has not already seen the message.

The Xerox work considered gossip communication in the context of aproject developing wide-area database systems [Demers et al. 1988]. Theyshowed how gossip communication is related to the mathematics underly-ing the propagation of epidemics, and developed a family of gossip-basedmulticast protocols. However, the frequency of database updates was low(at most, a few per second); hence, the question of stable throughput did notarise. The model itself considered communication failures but not processfailures. Our work addresses both aspects.

In this article, we actually report on multiple implementations of our newprotocol. The version we study most carefully was implemented withinCornell University’s Ensemble system [Hayden 1998], which offers a mod-ular plug-and-play framework that includes some of the standard reliablemulticast protocols, and can easily be extended with new ones. Thisplug-and-play architecture was important both because it lets our newwork sit next to other protocols, and because it facilitated controlledexperiments. Ensemble supports group-communication protocol stacks thatare constructed by composing microprotocols, an idea that originated in theHorus project [Birman 1997; van Renesse et al. 1996].

Because we implemented this version of our protocol within Ensemble,the system model is greatly simplified. An Ensemble process group exe-cutes as a series of execution periods during each of which group member-ship is static, and known to all members (see Figure 2, where timeadvances from left to right, and each time-line corresponds to the executionof an individual process). One execution period ends and a new one beginswhen membership changes by the addition of new processes, or the depar-ture (failure) of old ones. Below, we will discuss our new protocol for just asingle execution period; this restriction is especially important in theformal analysis we present. The mechanisms whereby Ensemble switchesfrom one period to the next depend only on knowledge of when thecurrently active set of multicast instantiations has terminated, i.e., stabi-

48 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 9: Bimodal multicast

lized. In our new protocol, this occurs at a given group member when thatmember garbage-collects the multicasts known to it.

The second version of the protocol is more recent, and while we includesome experimental data obtained using it, we will not discuss it separately.This implementation is the basis of a new system we are developing calledSpinglass, and operates as a free-standing solution. Unlike the Ensemblesolution, which we use primarily for controlled studies on an SP2 computer,the newer implementation runs on a conventional LAN and is beingextended to also run in a WAN. The protocols are not identical, but thedifferences are not of great importance in the present setting, and we willtreat them as a single implementation. Spinglass uses gossip to trackmembership as well as to do communication [van Renesse et al. 1996], butthe behavior of the bimodal protocol is unaffected (formal analysis of thecombined gossip mechanisms is, however, beyond our current ability).

In the remainder of this article, we will refer to our new protocol aspbcast: “probabilistic broadcast,” since “bimodal multicast” and the obviouscontractions seem awkward. A pbcast to a process group satisfies thefollowing properties:

—Atomicity: The protocol provides a bimodal delivery guarantee, underwhich there is a high probability that each multicast will reach almost allprocesses, a low probability that a multicast will reach just a very smallset of processes, and a vanishingly small probability that it will reachsome intermediate number of processes. The traditional “all or nothing”guarantee thus becomes “almost all or almost none.”

—Throughput stability: The expected variation in throughput can be char-acterized and, for the settings of interest to us, is low in comparison totypical multicast rates.

—Ordering: Messages are delivered in FIFO order on a per-sender basis.Stronger orderings can be layered over the protocol, as discussed inBirman [1997]. For example, Hayden and Birman [1996] include aprotocol similar to pbcast with total ordering layered over it.

—Multicast stability: The protocol detects stability of messages, meaningthat the bimodal delivery guarantee has been achieved. A stable message

p

q

r

s

Fig. 2. Multicast execution periods in Ensemble. Initially, the group consist of p and q;multicasts sents by p are delivered to q (and vice versa). R then joins and state is transferredto it. After a period of additional multicasting, q fails; s later joins and receives an additionalstate transfer. The period between two consecutive membership lists is denoted an “executionperiod”.

Bimodal Multicast • 49

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 10: Bimodal multicast

can be safely garbage-collected, and if desired, the application layer willalso be informed. Protocols such as SRM generally lack stability detec-tion, whereas virtual synchrony protocols include such mechanisms.

—Detection of lost messages: Although unlikely at a healthy process, ourbimodal delivery property does admit a small possibility that somemulticasts will not reach some processes, and message loss is common atfaulty processes. Should such an event occur, processes that did notreceive a message are informed via an upcall. Section 5 discussesrecovery from message loss.

—Scalability: Costs are constant or grow slowly as a function of thenetwork size. We will see that most pbcast overheads are constant as afunction of group size and that throughput variation grows slowly (withthe log of the group size).

For purposes of analysis, our work assumes that the protocol operates ina network for which throughput and reliability can be characterized forabout 75% of messages sent, and where network errors are iid. We assumethat a correctly functioning process will respond to incoming messageswithin a known, bounded delay. Again, this assumption needs to hold onlyfor about 75% of processes in the network. We also assume that bounds onthe delays of network links are known. However, this last assumption issubtle, because pbcast is normally configured to communicate preferen-tially over low-latency links, as elaborated in Section 4.

Traditionally, the systems community has distinguished two types offailures. Hard failures include crash failures or network partitionings. Softfailures include the failure to receive a message that was correctly deliv-ered (normally, because of buffer overflow), failure to respect the bounds forhandling incoming messages, and transient network conditions that causethe network to locally violate its normal throughput and reliability proper-ties. Unlike most protocols, which only tolerate hard failures, the goal ofour protocol is to also overcome bounded numbers of soft failures withminimal impact on the throughput of multicasts sent by a correct process toother correct processes. Moreover, although this is not a “guarantee” of theprotocol, a process that experiences a soft failure will (obviously) experiencea transient deviation from normal throughput properties, but can thencatch up with the others. Again, this behavior holds up to a boundednumber of soft-failure events. We do not consider Byzantine failures.

Although our assumptions may seem atypical of local-area networks, thedesigner of a critical application such as an air traffic control system wouldoften be able to satisfy them. For example, a developer could work withLAN and WAN links isolated from uncontraffic, high-speed interconnects ofthe sort used in cluster-style scalable computers, or a virtually privatenetwork with quality of service guarantees.

But our work appears to be more broadly applicable, even in conventionalnetworks. Here, we report on experiments with the Spinglass implementa-tion of pbcast in normal networks, where we observed the protocol’s

50 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 11: Bimodal multicast

behavior during bursts of packet loss of the sort triggered by transientnetwork overloads. Although such events violate the assumptions of themodel, the protocol continues to behave reliably and to provide steadythroughput. One open question concerns the expected behavior of pbcastwhen running over WAN gateways spanning the public Internet, whereloss rates and throughput fluctuate chaotically [Labovitz et al. 1997;Paxson 1997]. Based on our work up to the present, it seems that Spinglasscan do fairly well by tunneling over TCP links between gateway processes,provided that enough such links are available. We conjecture that althoughthe formal study of such configurations may be intractable, the idealizednetwork model is considerably more robust than the assumptions underly-ing it would suggest.

4. DETAILS OF THE PBCAST PROTOCOL

Pbcast is composed of two subprotocols structured roughly as in theInternet MUSE protocol [Lidl et al. 1994]. The first is an unreliable,hierarchical broadcast that makes a best-effort attempt to efficiently de-liver each message to its destinations. Where IP-multicast is available, itcan play this role. The second is a two-phase anti-entropy protocol thatoperates in a series of unsynchronized rounds. During each round, the firstphase detects message loss; the second phase corrects such losses and runsonly if needed. This section describes the protocol; pseudocode is includedas Appendix B. We begin with a basic discussion but then introduce a seriesof important optimizations.

4.1 Optimistic Dissemination Protocol

The first stage of our protocol multicasts each message using an unreliablemulticast primitive. This can be done using IP multicast or, if IP multicastis not available, using a randomized dissemination protocol. In the lattercase, we assume full connectivity and superimpose “virtual” multicastspanning trees upon the set of participants. Each process has a variety ofsuch pseudorandomly generated spanning trees for broadcasting messagesto the entire group; these are generated in a deterministic manner from thegroup membership using an inexpensive algorithm. When a member broad-casts a message, it is sent using a randomly selected spanning tree.2 This isdone by attaching a tree identifier (a small integer) to the message andsending it to the sender’s neighbors in the tree. Upon receipt, thosemembers deliver the message and then forward it using the tree identifier.This dissemination protocol can be tuned with respect to the number ofrandom trees that are used and the degree of nodes in the tree. Because allof the messages are sent unreliably, the choice of tree should be understoodpurely as an optimization to quickly deliver the message to many members.

2Although the tree that will be used for a given multicast execution period is not predictableprior to when that period begins, the same tree will be used throughout the duration of theexecution period.

Bimodal Multicast • 51

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 12: Bimodal multicast

If some members do not receive the message, the anti-entropy protocol stillensures probabilistically reliable delivery.

In the Ensemble implementation of pbcast, the tree-dissemination proto-col uses Ensemble’s group membership manager to track membership, butthis also limits scalability. Ensemble’s group membership system workswell up to a hundred members or so, and can probably be scaled to a fewhundred. As for the Spinglass version of pbcast, we currently use ahand-configured multicast architecture, represented as a multicast routingtable used by the protocol; again, this makes sense for a few hundredmachines but probably not for larger networks. We see management of themulticast dissemination routes for pbcast as a topic for which additionalstudy will be required.

4.2 Two-Phase Anti-Entropy Protocol

The important properties of pbcast stem from its gossip-based anti-entropyprotocol. The term anti-entropy is from Demers et al. [1988] and refers toprotocols that detect and correct inconsistencies in a system by continuousgossiping. Our anti-entropy protocol progresses through rounds in whichmembers randomly choose other members, send a summary of their mes-sage histories to the selected process, and then solicit copies of anymessages they discover themselves to be lacking to converge toward identi-cal histories. This is illustrated in Figure 3: after a period during whichmessages are multicast unreliably (process Q fails to receive a copy of M0,and process S fails to receive M1, denoted by the dashed arrows), theanti-entropy protocol executes (gray region). At this time Q discovers thatit has missed M0 and requests a retransmission from P, which forwards it.S does not detect and repair its own loss until the subsequent anti-entropyround.

The figure oversimplifies by suggesting that the protocol alternatesbetween multicasting and running anti-entropy rounds; in practice, the twomodes are concurrent. Also, the figure makes the anti-entropy communica-tion look regular; in practice, it is quite random. Thus, Q receives ananti-entropy message from P, but could have obtained it from R or S.Moreover, as a side effect of randomness, a process may not receive ananti-entropy message, at all, in a given round of gossip: here, S receivesnone, while Q receives two.

M 0 M 1 M2 anti-entropy M3 M 4 …. anti-entropy

P

Q

R

S

M 0

M 1

Fig. 3. Illustration of the pbcast anti-entropy protocol.

52 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 13: Bimodal multicast

Our protocol differs from most prior gossip protocols in that pbcastemphasizes achieving a common suffix of message histories rather than acommon prefix. In other words, our protocol prioritizes recovery of recentmessages, and when a message becomes old enough the protocol gives upentirely and marks the message as lost. The advantage of this structure isthat the protocol avoids scenarios where processes suffer transient failuresand are subsequently unable to catch up with the rest of the system. Intraditional gossip protocols, such a situation can cause the other processes’message buffers to fill and the overall system to slow down. Our protocolavoids this behavior by eventually giving up on old messages, and insteademphasizing delivery of recent messages. However, even though messagesmay eventually be marked as “lost,” a probabilistic analysis of the protocolshows that—when properly configured—this loss is unlikely to happenexcept at failed processes, or with messages sent by processes that failed asthey sent them. Section 5 discusses our handling of such cases.

Thus, in our example, if process R had experienced a soft failure andmissed messages M0, M4, it would learn about them in a subsequentanti-entropy message and would request retransmissions in reverse order:M4 first, then M3, and so forth. While pulling these messages from otherprocesses, R would participate normally in new multicasts and new roundsof the anti-entropy algorithm.

The anti-entropy protocol is run by all processes in the system, andproceeds through a sequence of rounds. The length of each round must belarger than the typical round-trip time for an RPC over the communica-tions links used by the protocol; in practice, we used much longer roundswhich are a substantial fraction of a second in duration (for example, ourexperiments start a round every 100ms). Clocks need not be synchronized,but we will initially act as if they were for simplicity of the exposition. Atthe beginning of each round, every member randomly chooses anothermember with which it will conduct the anti-entropy protocol and sends it adigest (summary) of its message histories. The message is called a “gossipmessage.” The member that receives this message compares the digest withthe messages in its own buffers. If the digest contains messages that thismember does not have, then it sends a message back to the original senderto request some messages to be retransmitted. This message is called the“solicitation” and causes the receiver to retransmit some of the messages.

Processes maintain buffers of messages that have been received fromother members in the group. Every message is either delivered to theapplication, or if a message cannot be recovered during the retransmissionprotocol, an upcall is used to notify the application of missing messages.These events occur in FIFO order from each sender. When messages arereceived, they are inserted in the appropriate location in a message buffer.

Upon receiving a message, a process tags the message with the round inwhich the message was received. Any undelivered messages that are now inorder are delivered. (In some situations, it might make sense to delaydelivery until one or more rounds of gossip have been completed. Doing so

Bimodal Multicast • 53

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 14: Bimodal multicast

would clearly reduce the risk that a very small number of processes deliverthe pbcast, but we have not explored the details of such a change.) Theprocess will continue to gossip about the message until a fixed number ofrounds after its initial reception, after which the message is garbage-collected. This number of rounds and the number of processes to which aprocesses gossips in each round3 are parameters to the protocol. Theproduct of these parameters, called the fanout, can be tuned using thetheory we develop in Appendix A (summarized in Section 6). If a processhas been unable to recover a missing message for long enough to deducethat other processes will have garbage-collected it, it gives up on thatmessage and reports a gap (lost message) to the application layer.

There are several optimizations to the anti-entropy protocol that act tolimit its costs under failure scenarios. Without these additions, a normalanti-entropy protocol is liable to enter fluctuating communication patternswhereby poorly performing processes or a noisy network can affect healthyprocesses, by swamping them with retransmission requests. Here, wesummarize six important optimizations. In Section 7 we present experi-mental evidence that they achieve the desired outcome.

Optimization 1: Soft-Failure Detection. Retransmission requests areonly serviced if they are received in the same round for which the originalsolicitation was sent. If the response to a solicitation takes longer than around (which is normally more than enough time) then the response isdropped. The failure of a process to respond to a solicitation within a roundis an indication that the process or the network is unhealthy, and hencethat a retransmission is unlikely to succeed. This also protects againstcases where a process responds to many solicitations at once and causes thenetwork to become flooded with redundant retransmissions.

Optimization 2: Round Retransmission Limit. The maximum amount ofdata (in bytes) that a process will retransmit in one round is also limited. Ifmore data are requested than this, then the process stops sending when itreaches this limit. This prevents processes that have fallen far behind thegroup from trying to catch up all at once. Instead, the retransmission canbe carried out over several rounds with several different processes, spread-ing the overhead in space and time.

Optimization 3: Cyclic Retransmissions. Processes responding to re-transmission requests cycle through their undelivered messages, takinginto account the messages that were requested in the previous rounds. Ifthe request from the previous round was successful, but the messagesmight still be in transit, the response will include different messages,avoiding redundant retransmissions.

3In our experiments with the protocol, we find that the average load associated with theprotocol is minimized if a process gossips to just one other process in each round, but one canimagine settings for which this would not be the case.

54 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 15: Bimodal multicast

Optimization 4: Most-Recent-First Retransmission. Messages are re-transmitted in the order of most recent first. Oldest-first retransmissionrequests can induce a behavior in which a temporarily faulty process triesto catch up and recover from its problem, but is left permanently laggingbehind the rest of the group.

Optimization 5: Independent Numbering of Rounds. It may seem as ifprocesses should advance synchronously through rounds, but the protocolactually allows each process to maintain its own round numbers and to runthem asynchronously, which is how our implementation actually works.The insight is that the number of rounds that have elapsed is used todetermine when to deliver or garbage-collect a message, but this is entirelya local decision. The round number also enters in a gossip message and anysubsequent solicitation to retransmit, but the solicitation can simply copythe round number used by the sender of the gossip message.

Optimization 6: Random Graphs for Scalability. If one assumes thatlarge groups would use IP multicast for the unreliable multicast, the basicprotocol presented above is highly scalable except in two dimensions. First,as stated, it would appear that each participating process needs a fullmembership list for the multicast group, since this information is used inthe anti-entropy stages of the protocol. Such an approach implies a poten-tially high traffic of membership updates to process group members, andthe list itself could become large. Second, in a wide-area use of the protocol,anti-entropy will often involve communication over high-latency communi-cation paths. In a very large network the buffering requirements andround-length used in the protocol could then grow as a function of worst-case network latency.

Both problems can be avoided. A WAN is typically structured as acollection of LANs interconnected (redundantly) by TCP tunnels or gate-ways. In such an architecture, typical participants would only need to knowabout other processes within the same LAN component; only processesholding TCP endpoints would perform WAN gossip.

Generalizing, we can ask about the behavior of pbcast if each participantonly knows about, and gossips to, a subset of the other participants—perhaps, in a network of 10,000 processes, each participant might gossipwithin a set of 100 others. Research on randomized networks has demon-strated that randomized protocols operate correctly on randomly generatedgraphs with much less than full connectivity [Feige et al. 1990]. Drawingupon this theory, we can conclude that pbcast should retain its propertieswhen such a subset scheme is employed. Moreover, the subset to which agiven member gossips can be picked to minimize latency, thereby boundingthe round-trip times and hence round-lengths to a reasonable level. We arecurrently developing a membership service for our Spinglass implementa-tion of pbcast, which will manage membership on behalf of the system,select these subsets, and inform each pbcast participant of the list ofprocesses to which it should gossip.

Bimodal Multicast • 55

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 16: Bimodal multicast

Extended in this manner, the protocol overcomes the scalability concernsjust identified, leaving us with a protocol having entirely local costs, muchlike SRM, RMTP, XTP, or MUSE. A pbcast message can be visualized as asort of frontier advancing through the network over a spanning tree. Eachprocess first learns of the pbcast either during the initial multicast, or inrounds of gossip that occur over low-latency links between processes andtheir neighbors. Irrespective of the size of the network, safety and stabilitywould rapidly be reached. Moreover, only membership service needs thefull membership of the multicast group. Typical pbcast participants wouldknow only of the processes to which they gossip, would gossip mostly toneighbors, and the list of gossip destinations would be updated only whenthat set changes, not on each membership change of the overall group.

Optimization 7: Multicast for Some Retransmissions. In certain situa-tions, our protocol employs multicast to retransmit a message, although wedo this rather carefully to avoid triggering the sort of unscalable growth inoverhead seen in the SRM protocol. At present, our protocol uses multicastif the same process is solicited twice to retransmit the same message: theprobability of this happening is very low unless a large number of processeshave dropped the message. Additionally, suppose that we define distance interms of common IP address prefixes: processes in the same subnet are“close” to one another, and processes on different subnetworks are “re-mote.” Then when a process solicits a (point-to-point) retransmission from aprocess remote from it, upon receiving that message it immediately remul-ticasts it using a “regional” setting for the multicast TTL field. The idea isthat since optimization 6 ensures that most gossip will be between pro-cesses close to one another, it is unlikely that a retransmission would beneeded from a remote source unless the message in question was droppedwithin the region of the soliciting process. Accordingly, the best strategy isto remulticast that message immediately upon receipt, within the receiver’sregion.

5. INTEGRATION WITH ENSEMBLE’S FLOW CONTROL AND STATETRANSFER TOOLS

5.1 Flow Control

Our model implicitly requires that the rate of pbcast messages be limited.Should this rate be exceeded, the network load would threaten the indepen-dent failure and latency assumptions of the model, and the guarantees ofthe protocol would start to degrade. In normal use, some form of applica-tion-level rate control is needed to limit the rate of multicasts. For example,the application might simply be designed to produce multicasts at aconstant, predetermined rate, calculated to ensure that the risk of over-loading the network is acceptably low.

Pbcast can also be combined with a form of flow control tied to thenumber of buffered messages active within the protocol itself. In thisapproach, when a sender presents a new multicast to the communication

56 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 17: Bimodal multicast

subsystem, the message would be delayed if the subsystem is currentlybuffering more than some threshold level of active multicasts, from thesame or other sources. As pbcast messages age and are garbage-collected,new multicasts would be admitted. Ensemble, the multicast frameworkwithin which we implemented one of our two versions of pbcast, supports aflow control mechanism that works in this manner. However, for theexperiments reported here, we employed application-level rate limitations.We believe that for the class of applications most likely to benefit from abimodal reliable multicast, the rate of data generation will be predictable,and used to parameterize the protocol. In such cases, the addition of anunpredictable internal flow-control mechanism would reduce the determin-ism of the protocol, while bringing no real benefits.

5.2 Recovery from Delivery Failures

Recall from Figure 1 that in a conventional form of reliable group commu-nication, a single lagging process can impact throughput and latenciesthroughout the entire group. Our protocol overcomes this phenomenon butsuffers from a complementary problem, which is that if a process lags farenough behind the other group members, those processes may garbage-collect their message histories, effectively partitioning the slow processaway from the remainder of the group. The slow process will detect thiscondition when normal communication is restored, but has no opportunityto catch up within the basic protocol. Notice that this problem is experi-enced by a faulty process, not a healthy one, and hence cannot be addressedsimply by adjusting protocol parameters.

We see two responses to this problem. In Spinglass, we are exploring thepossibility of varying the amount of buffering used by each pbcast partici-pant. Most processes would have small buffers, but some might have verylarge buffers—in the limit, some could spool copies of all messages sent inthe system. These would then serve as repositories of last resort, fromwhich a recovering process could request the entire sequence of messageswhich were lost during a transient outage.

When pbcast is used together with Ensemble, a second option arises.Ensemble includes tools for process join and leave, membership tracking,and traditional reliable multicast within a process group. Included withthese is a state transfer feature. The mechanism permits a process joininga process group to receive state from any process or set of processes alreadypresent in the group. Such a joining process can also offer its own state tothe existing members; hence, the protocol supports state merge, although innormal usage we prefer to transfer state from a “primary component” of thepartitioned group to a “minority component.” A pbcast participant that fallsbehind could thus use the Ensemble state transfer as a recovery mecha-nism.

6. GRAPHING THE COMPUTATIONAL RESULTS

In Appendix A, we show how pbcast can be analyzed under the assump-tions of our model. This analysis yields a computational model for the

Bimodal Multicast • 57

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 18: Bimodal multicast

protocol, which we used to generate the graphs in Figure 4. These graphswere produced under the assumption that the initial unreliable multicastfailed (only the original sender initially has a copy), that the probability ofmessage loss is 5%, and that the probability that a process will experience acrash failure during a run of the protocol is 0.1%. All of these assumptionsare very conservative; hence, these graphs are quite conservative. Recallfrom Section 4 that the fanout measures the number of processes to whichthe holder of a multicast will gossip before garbage-collecting the message.

On the upper left is a graph illustrating pbcast’s bimodal deliverydistribution, which motivates the title of this article. As the General of ourlittle fable recognized, the likelihood that a very small number of processeswill receive a multicast is quite low. The likelihood that almost all receivethe multicast is very high. And the intermediary outcomes are of vanish-ingly low probability.

To understand this graph, notice that the y-axis is expressed as apredicate over the state of the system after some amount of time. Intu-itively, the graph imagines that we run the protocol for a period of time andthen look on the state S achieved by the protocol. Pbcast guarantees thatthe probability that S is such that almost none or almost all process havedelivered pbcast is high, and the probability that S is such that about a

Fig. 4. Analytical results (unless otherwise indicated, for 50 processes).

58 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 19: Bimodal multicast

half of participants have delivered pbcast is extremely low. When oneconsiders that the y-axis is on a logarithmic scale, it becomes clear thatpbcast is overwhelmingly likely to deliver to almost all processes if thesender remains healthy and connected to the network.

When the initial unreliable multicast is successful, the situation is quitedifferent; so much so that we did not produce a graph for this case. Supposethat 5% of these initial messages are not delivered. The initial state willnow be one in which 47 processes are already infected, and if our protocolruns for even a single round, it becomes overwhelmingly probable that thepbcast will reach all 50 processes, limited only by process failures. For thissection the worst-case outcomes are more relevant; hence, it makes sense toassume that the initial unreliable multicast fails.

The remaining graphs superimpose the behavior of pbcast with respect totwo predicates that define exemplary undesired outcomes; in keeping withthe idea that these show risk of failure, we produced them under thepessimistic assumption that the initial IP multicast fails. The first predi-cate is the one our General might have used: it considers a run of theprotocol to be a failure (in his case, a gross understatement!) if themulticast reaches more than 10% of the processes in the system but lessthan 90%. The second predicate is one that arises when pbcast is used in asystem that replicates data in a manner having properties similar to thoseof virtual synchrony, using an algorithm we describe elsewhere [Haydenand Birman 1996]. In this protocol, updates have a two-phase behavior,which is implemented using pbcast. An undesired outcome arises if pbcastdelivers to roughly half the processes in the system, and crash failuresmake it impossible to determine whether or not a majority was reached,forcing the update to abort (roll-back) and be restarted. The idea of thesethree graphs is to employ our model to explore the likelihood that pbcastcould be used successfully in applications with these sorts of definitions offailure. Both predicates are formalized in the appendix.

The applications of pbcast discussed in the introduction could also bereduced to failure predicates. For example, in an air traffic application thatuses pbcast to replicate updates to the tracks associated with currentflights, the system would typically operate safely unless several updates ina row were lost for the same track. By starting with the controller’s abilityto tolerate missing track updates or inconsistency between the data dis-played on different consoles, one can compute a predicate encoding theresulting risk threshold as a predicate. At the level of our model, eachpbcast is treated as an independent event; hence, any condition expressedover a run of multicasts can be reexpressed as a condition on an individualoutcome.

The graph on the upper right in Figure 4 shows how the risk of a “failed”pbcast drops with the size of the system. The lower graphs look at therelation between the expected “fanout” from each participant during thegossip stage of failure and the risk that the run will fail in the sensedefined earlier. The graphs were based on the parameters used in ourexperimental work and can be used in setting parameters such as the

Bimodal Multicast • 59

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 20: Bimodal multicast

fanout and the number of rounds, so that pbcast will achieve a desiredreliability level, or to explore the likely behavior of pbcast with a particularparameterization in a setting of interest. Notice also that predicate I yieldsa much lower reliability than predicate II. This should not surprise us:predicate I counts a pbcast as faulty if, for example, 30% of the processes inthe system fail to receive it. Predicate II would treat such an outcome as asuccess, since 70% is still a clear majority. Note also that the graph on thelower right compares the fanout required to obtain 1E-8 reliability usingpredicate I with that required to obtain 1E-12 for predicate II. This was to

0 2 4 6 8 10 12 14 16 18 200

100

200

300

400

500

600

700

800

900

1000

#rounds

# su

scep

tible

pro

cess

es

1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

80

90

100

#rounds

# su

scep

tible

pro

cess

es

Fig. 5. Number of susceptible processes versus number of gossip rounds when the initialmulticast fails (left) and when it reaches 90% of processes (right; note scale). Both runsassume 1000 processes.

60 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 21: Bimodal multicast

get the curves onto the same scale; in practice, 1E-8 is probably adequatefor the applications discussed later. If the IP multicast were successful, therisks of failure in all three graphs would be reduced by several orders ofmagnitude.

Since throughput stability lies at the heart of our work, we also set out toanalyze the expected variance in throughput rates using formal methods.We first considered the expected situation for a single pbcast where theinitial unreliable multicast fails. Our approach was to use the analysis toobtain a series of predictions showing how the number of processes whichhave yet to receive a copy decreases over time (Figure 5), and then to usethis data to compute the expected number of rounds before a selectedcorrect participant receives a multicast (Figure 6). The resulting curvespeak at roughly the log of the group size and have variance that also growswith the log of the group size.

Next, we considered the situation when the initial IP multicast ortree-based multicast is successful. In this case, a typical process receives amulticast after it has been “relayed” through some number of intermediaryprocesses, and the length of the relay chain will grow with logb~N!, wherethe base b is the average branching factor of the forwarding tree used bythe initial multicast—2 in the case of the tree-based scheme used in ourexperimental work, but potentially much larger for IP multicast. Supposethat this chain is of length c. Each of the relaying processes can be viewedas an independent “filter” that delays messages by some mean amount withsome associated variance. If we treat this as a normal distribution, the

0 2 4 6 8 10 12 14 16 18 200

5

10

15

20

25

#rounds

prob

abili

ty (

%)

N = 16 N = 128 N = 1024

Fig. 6. The probability for a correct process to receive a pbcast in a particular round for goupsof various sizes. The distributions are essentially normal, with means centered at log~n!.

Bimodal Multicast • 61

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 22: Bimodal multicast

transit through the entire tree will also have a normal distribution, with itsmean equal to the c times the mean forwarding delay, and variance equalto Îc * s, where s is the variance of the forwarding delay distribution.4

From this information, we can make predictions about the averagethroughput and the variance in throughput that might be observed overvarious time scales. Consider a period of time during which a series ofpbcast messages are injected into the system, and assume that the mes-sages are independent of one another (that is, that the rate is sufficientlylow so that there are no interference effects to worry about). Based on theanalysis above, the expected variance in time to receive the sequence willbe Î~2c! * s. Thus we would expect the throughput variance to growslowly, as the square root of the log of the system size.

But now we face a problem: the two cases have very different expecteddelivery latency and variance. If the initial multicast has erratic averagereliability, throughput will fluctuate between the two modes.

Our experimental data, reported in Section 7, is for a setting in which theinitial tree-based multicast is quite reliable, and we indeed observe verystable throughput with variance that grows slowly and in line with predic-tions. But suppose that we were to use pbcast in dedicated Internetsettings or even over the public Internet. In these cases, the unreliablemulticast might actually fail some significant percentage of the time.Compensating for this is optimization 7, which was not treated in ourtheoretical analysis, but in practice would cause processes to remulticastmessages rapidly in such a situation. Experimentally, optimization 7 doesseem to sharpen the delivery distribution dramatically.5

Under the assumption that the delivery distribution will be reasonablytight, but still have two modes, one option would be to introduce a bufferingdelay to “hide” the resulting variations in throughput, using the experimen-tal and analytic results to predict the amount of buffering needed. Forexample, suppose that we assume clock synchronization and that we delayeach multicast, delivering it k times the typical round-length after the timeat which it was sent. Here, k could be derived from Figure 6 so that 99% ofall messages will have been received in this amount of time—14 to 16gossip rounds in the case of N 5 128, for example—about 1.5 seconds ifrounds last for 100ms. At the cost of buffering the delayed messages forthis amount of time, we could now smooth almost all variance in through-put rate. Such a method lies at the core of the well-known (D-T atomicmulticast [Cristian et al. 1985]. A version of pbcast that incorporates such a

4This is because the distributions are identical normal ones with variance s, and variance fora summed distribution grows as the square root of the sum of squares.5One way to visualize this is to think of the networks as the surface of a soccer ball, havingregions with high connectivity (surface patches) connected by tunnels (borders between thepatches). Pbcast operates like a firework, bursting throught he network to infect each region,then bursting again regionally to infect most processes in each patch, and finally fading awaywith a few last sparks as the remaining processes are infected by unicast gossip.

62 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 23: Bimodal multicast

delay (although for a different reason) is discussed in Hayden and Birman[1996] and Birman [1997].

But worst-case delay may exaggerate the actual need for buffering. Ourexperimental work, which reflects the impact of optimization 7, suggeststhat even a very small amount of buffering could have a dramatic impact.This is in contrast to the situation in Cristian et al. [1985], where adeterministic worst-case analysis leads to the somewhat pessimistic conclu-sion that very substantial amounts of buffering may be needed, and verylong delays before delivery. In our setting, the goals are probabilistic, andthe analysis can focus on the expected situation, not the worst case.

To summarize, formal analysis gives us powerful tools and significantpredictive options. The tools permit pbcast to be parameterized for aparticular setting, and they show us how to bridge the gap between thepbcast primitive itself and the application-level reliability objectives. Thepredictions concern the distribution of expected outcomes for the protocoland the degree to which throughput will have the stability properties weseek. In this second respect, we find that pbcast, used in settings where theinitial unreliable multicast is likely to be successful (reaching most desti-nations most of the time), would indeed exhibit stable and steady through-put in a scalable manner. Confirming this, our experiments with Spinglass,reported in the next section, show that even with very small buffers andwithout any artificial delay at all, the received data rate remains steadywhen we alternate between a mode in which most multicast are successful,and one in which multicasts reach very few destinations.

7. PERFORMANCE AND SCALABILITY OF AN IMPLEMENTATION

In this section, we present experimental results concerning the perfor-mance, throughput stability, and scalability of pbcast using runs of theactual protocol. We include several types of experimental work. We startwith a study of the Ensemble virtual synchrony protocols, which we runside by side with the Ensemble implementation of pbcast. The Ensembleprotocols we selected perform extremely well and have been heavily tuned;hence, we believe it fair to characterize them as “typical” of protocols in thisclass. Obviously, however, one must use care in extrapolating these resultsto other implementations of virtually synchronous multicast. The experi-ments reported here were conducted using an SP2 parallel computer, whichwe treated as a network of workstations. The idea was to start by isolatingour software (the real software, which can run without changes on a normalInternet LAN or WAN) on a very clean network and then to inject noise.

Next, we compare pbcast with SRM, using the NS-2 simulator [Feige etal. 1990], and the two SRM implementations available for that environ-ment. We used NS-2 to construct a simulation of pbcast, and then examinedpbcast and SRM side by side under various network topologies and condi-tions, using the SRM parameter settings recommended by the designers ofthat protocol. The simulation allowed us to scale both protocols into verylarge networks.

Bimodal Multicast • 63

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 24: Bimodal multicast

Finally, we looked at the Spinglass implementation of pbcast on anetwork of workstations running over a 10Mbit ethernet in a setting wherehardware multicast is available. This last study was less ambitious, be-cause the number of machines available to us was small, but still providesevidence that what we see in simulation and on the SP2 actually doespredict behavior of the protocol in more realistic networks.

Accordingly, we start by looking at pbcast next to virtual synchrony onnetwork configurations of various sizes, running on the SP2. We emulatenetwork load by randomly dropping or delaying packets, and emulateill-behaved applications and overloaded computers by forcing participatingprocesses to sleep with varied probabilities. With this approach we studiedthe behavior of groups containing as many as 128 processes.

Figure 7 shows the interarrival message spacing for a traditional virtualsynchrony protocol, running in Ensemble,6 side by side with the Ensembleimplementation of pbcast. These were produced in groups of eight processesin which one process was perturbed by forcing it to sleep during 100msintervals with the probability shown. The data rate was 75 7KB multicastsper second: relatively light for the Ensemble protocol (which can reachsome 300 multicasts per second in this configuration), and well below thelimit for pbcast (about 250 per second). Both graphs were produced on theSP2.

The interarrival times for the traditional Ensemble protocols spread witheven modest perturbation, reflecting bursty delivery. Pbcast maintainssteady throughput even at high perturbation rates.

The first figure in this article, Figure 1, illustrated the same problem atvarious scales. In the experiment used to produce that figure, we measuredthe throughput in traditional Ensemble virtual synchrony groups of varioussizes as we perturbed a single member. We see clearly, that, althoughEnsemble can sustain very high rates of throughput (200 7KB messagesper second is close to the limit for the SP2 used in this manner, since themachine lacks hardware multicast), as the group becomes larger it alsobecomes more and more sensitive to perturbation. In Figure 8, we exam-ined the same phenomenon in more detail for a small group of eightprocesses. Interestingly, even the perturbed process receives a higher rateof messages with pbcast.

Figures 1, 7, and 8 are not intended as an attack upon virtual synchrony,since the model has typically been used in smaller groups of computers,and with applications that generate data in a more bursty manner, rarelymaintaining sustained, high data rates [Birman 1999]. Under these lessextreme conditions, the protocols work well and are very stable. Theproblems cited here arise from a combination of factors: large scale, high

6Although Ensemble supports a scalable protocol stack, for our experiments that stack wasfound to behave identically to the normal virtual synchrony stack. Accordingly, the datareproduced here are for a normal Ensemble stack, providing FIFO ordering and virtualsynchrony.

64 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 25: Bimodal multicast

and sustained data rates, and a type of perturbation designed to disruptthroughput without triggering the failure detector.

Figure 9 was derived from the same experiment using pbcast; now,because throughput was so steady, we included error bars. These show,that, as we scale a process group, throughput can be maintained even if weperturb members, but the variance (computed over 500ms intervals) grows.On the bottom right is a graph of pbcast throughput variance as a functionof group size. Although the scale of our experiments was inadequate to testthe log-growth predictions of Section 6, the data at least seem consistentwith those predictions.

Histogram of throughput for Ensemble’s FIFO virtual synchrony protocol

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07

Inter-arrival spacing (sec)

Pro

babi

lity

of o

ccur

ence

Traditional Protocol with .05 sleep probability

Traditional Protocol with .45 sleep probability

Histogram of throughput for Pbcast

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07

Inter-arrival spacing (sec)

Pro

babi

lity

of o

ccur

ence

Pbcast with .05 sleep probability

Pbcast with .45 sleep probability

Fig. 7. Histograms of the interarrival spacing of multicasts in an 8-process group when usinga traditional virtual synchrony protocol (left) and pbcast (right), at 75 8KB messages persecond. The tighter distribution of pbcast supports our throughput stability cliams.

Bimodal Multicast • 65

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 26: Bimodal multicast

In Figure 10 we looked at the consequences of injecting noise into thesystem. Here, systemwide packet loss rates were emulated by causing theSP2 to randomly drop the designated percentage of messages. As thepacket loss rate grows to exceed 10% of all packets, pbcast becomes lossy at

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200Low bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional w/1 perturbed pbcast w/1 perturbed throughput for traditional, measured at perturbed hostthroughput for pbcast measured at perturbed host

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

20

40

60

80

100

120

140

160

180

200High bandwidth comparison of pbcast performance at faulty and correct hosts

perturb rate

aver

age

thro

ughp

ut

traditional: at unperturbed hostpbcast: at unperturbed host traditional: at perturbed host pbcast: at perturbed host

Fig. 8. Ensemble (“traditional”) and pbcast, side by side, in an experiment similar to the oneused to produce Figure 1. For a group of eight processes, we perturbed one and looked at thedelivery rate at a healthy group member and at the perturbed process, at 100 messages persecond and 150 per second. With the virtual synchrony protocol, data rates to the healthy andperturbed process are identical. With pbcast, the perturbed process starts to drop messages(signified by a lower data rate than the injection rate), but the healthy processes are notaffected.

66 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 27: Bimodal multicast

the highest message rate we tested (200 per second); the protocol remainsreliable even at a 20% packet loss rate when we run it at only 100 messagesper second. Increasing the fanout did not help at the high packet injectionrate, apparently because we are close to the bandwidth limits of the SP2interconnect.

Figure 10 thus illustrates the dark side of our protocol. As we see here,with a mixture of high data bandwidths and high loss rates, pbcast is quitecapable of reporting gaps to healthy processes. This can be understood as a“feature” of the protocol; presumably, if we computed the bimodal curve forthis case, it would be considerably less “sharp” than Figure 4 suggests. Ingeneral, if the gossip fanout is held fixed, the expected reliability of pbcastdrops as the network data loss rate rises or if the network becomessaturated. In the same situation, a virtual synchrony protocol would refuse

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5180

185

190

195

200

205

210

215

220mean and standard deviation of pbcast throughput: 16-member group

perturb rate

thro

ughp

ut (

msg

s/se

c)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5180

185

190

195

200

205

210

215

220mean and standard deviation of pbcast throughput: 96-member group

perturb rate

thro

ughp

ut (

msg

s/se

c)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5180

185

190

195

200

205

210

215

220mean and standard deviation of pbcast throughput: 128-member group

perturb rate

thro

ughp

ut (

msg

s/se

c)

0 50 100 1500

50

100

150standard deviation of pbcast throughput

process group size

stan

dar

d de

viat

ion

Fig. 9. Pbcast througput is extremely stable under the same conditions that provokedegrated throughput for traditional Ensemble protocols, but variance does grow as a functionof group size. For these experiments 25% of group members were perturbed, and throughputwas instrumented 100 messages at a time. Behavior remains the same as the perturbationrate is increased to 1.0, but for clarity of the graphs, we show only the interval [0,5]. Althoughour experiments have not scaled up sufficiently to permit very general conclusions to bedrawn, the variance in throughput is clearly small compared to the throughput rate and isgrowing slowly.

Bimodal Multicast • 67

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 28: Bimodal multicast

to accept new multicasts. With this one exception, our experiments pro-voked no data loss at all for healthy pbcast receivers.

Figure 11 shows the background overhead associated with these sorts oftests. Here we see that as the perturbation rate rises, the overhead alsorises: for example, in a 16-member group with 25% of processes perturbed

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

20

40

60

80

100

120

140

160

180

200

Pbcast with system-wide message loss: high and low bandwidth

system-wide drop rate

aver

age

thro

ughp

ut o

f rec

eive

rs

hbw:8hbw:32hbw:64hbw:96lbw:8lbw:32lbw:64lbw:96

Fig. 10. Impact of packet loss on pbcast reliability. At high data rates (200 messages persecond) noise triggers packet loss in large groups; at lower rates even significant noise can betolerated.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

10

20

30

40

50

60

70

80

90

100Pbcast background overhead: perturbed process percentage (25%)

perturb rate

retr

ansm

itted

mes

sage

s (%

)

8 nodes 16 nodes 64 nodes 128 nodes

Fig. 11. The number of retransmission solicitations received by a healthy process as afunction of group size and perturbation rate.

68 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 29: Bimodal multicast

25% of the time, 8% of messages must be retransmitted by a typicalparticipant; this rises to 22% in a 128-member group. Although ouranalysis shows that overhead is bounded, it would seem that the testsundertaken here did not push to the limits until the perturbation rate wasvery high.

Next, we turned to a simulation, performed using NS-2. In the interest ofbrevity, we include just a small amount of the data we obtained throughsimulation; elsewhere, we discuss our simulation findings in much moredetail [Ozkasap et al. 1999]. Figure 12 shows data collected using anetwork structured as a four-level balanced tree, within which we studiedlink utilization for pbcast with and without optimization 7 (we treated thisseparately because our theoretical results did not consider this optimiza-tion) and for the two versions of SRM available to us—the adaptive andnonadaptive protocol, with parameters configured as recommended by thedevelopers. The two graphs show utilization for a link out of and into asender. The group size is the same as the network size, and there is a singlesender generating 100 210-byte messages per second.

What we see here is that as the group grows larger (experiencing a largenumber of dropped packets, since all links are lossy), both protocols placegrowing load on the network links. Much of the pbcast traffic consists ofunicasts (gossip and retransmissions), while the SRM costs are all associ-ated with multicast and rise faster than those for pbcast.

Figure 13 looks specifically at overheads associated with the two proto-cols, measuring the rate of retransmission requests and copies of messagesreceived by a typical participant when 100 messages per second aretransmitted in a network with 0.1% packet loss on each link. The findingsare similar: SRM overheads grow much more rapidly than pbcast over-heads as we scale the system up, at least for this data loss pattern. Readersinterested in more detail are referred to Ozkasap et al. [1999].

Finally, we include some preliminary data from the Spinglass implemen-tation of pbcast, which runs on local-area networks. These graphs, shownin Figure 14, explore the impact of optimizations 2 and 7. Recall thatoptimization 2 introduced a limit on the amount of data pbcast willretransmit in any single round, while optimization 7 involves the selectiveuse of multicast when retransmitting data. To create these graphs, weconfigured a group of 35 processes with a single sender, transmitting 1001KB messages per second on a 10Mbit LAN. For the top two graphs, every20 seconds, we triggered a burst of packet loss by arranging that 30% of theprocesses will simultaneously discard 50 consecutive messages; we thengraphed the impact on throughput at a healthy process. Recall that ourthroughput analysis has trouble with such cases, because they impact theoverall throughput curve of the protocol by increasing the expected meanlatency.

What we see here is that without optimization 2 (top left), the perturba-tion causes a big fluctuation in throughput. But with the optimization, inthis case limiting each process to retransmit a maximum of 10KB per100ms gossip round, throughput is fairly steady even when packet loss

Bimodal Multicast • 69

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 30: Bimodal multicast

occurs. Obviously, this optimization reflects a trade-off, since a perturbedprocess will have more trouble catching up, but the benefit for systemthroughput stability is much more stable.

The lower graphs examine the case where an outage causes the initialmulticast to fail, along the lines of the conservative analysis presented inSection 6 of this article. The emulation operates by intercepting 10 consec-utive multicasts and, in each case, allowing the message to reach just asingle randomly selected destination (hence, two processes are initially

0 10 20 30 40 50 60 70 80 90 1000

5

10

15

20

25

30

35

40

45

50PBCAST and SRM with system wide constant noise, tree topology

group size

link

utili

zatio

n on

an

outg

oing

link

from

sen

der

PbcastPbcast-IPMCSRMAdaptive SRM

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

20PBCAST and SRM with system wide constant noise, tree topology

group size

link

utili

zatio

n on

an

inco

min

g lin

k to

sen

der

PbcastPbcast-IPMCSRMAdaptive SRM

Fig. 12. SRM and pbcast. Here we compare link-utilization levels; small numbers are better.With constant noise, both protocols exhibit some growth in overheads, but the scalability ofpbcast is better. Here, each link had an independent packet-loss probability of 0.1%.

70 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 31: Bimodal multicast

infected; we comment, however, that the graphs look almost identical if theinitial multicast is entirely discarded, so that it initially infects only thesender). Here, even with optimization 2, throughput is dramatically im-pacted each time the outage occurs. With optimization 7, however, Spin-glass remulticasts the affected messages, and throughput is again verysmooth. We used a 256KB pbcast buffer for these tests, but in fact quite abit less memory was actively used. Since this smooth delivery was obtainedeven without delaying received messages (except to put them in FIFOorder), the data supports our contention that at most a very brief delay isneeded to ensure an extremely smooth delivery rate when optimization 7 isin use.

The steadiness evident in these Spinglass performance graphs is in partdue to the data collection period we used. These graphs show averagethroughput during one-second intervals. Throughput would be less steadyfor shorter intervals, and more so for larger ones. An application designer,

0 10 20 30 40 50 60 70 80 90 1000

5

10

15PBCAST and SRM with system wide constant noise, tree topology

group size

requ

ests

/sec

rec

eive

d

SRM

Pbcast

adaptive SRM

Pbcast-IPMC

0 10 20 30 40 50 60 70 80 90 1000

5

10

15PBCAST and SRM with system wide constant noise, tree topology

group size

repa

irs/s

ec r

ecei

ved

SRM

Pbcast

adaptive SRM

Pbcast-IPMC

0 10 20 30 40 50 60 70 80 90 1000

5

10

15PBCAST and SRM with system wide constant noise, star topology

group size

requ

ests

/sec

rec

eive

d SRM

Pbcast

adaptive SRM

Pbcast-IPMC

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60PBCAST and SRM with system wide constant noise, star topology

group size

repa

irs/s

ec r

ecei

ved

SRM

Pbcast

adaptive SRM

Pbcast-IPMC

Fig. 13. Comparison of the rate of overhead messages received, per second, by typicalmembers of a process group when using SRM and pbcast to send 100 messages per secondwith 0.1% message-loss rate on each link. Here, we look at two topologies: the same balancedfour-level tree as in Figure 12 and a star topology. In some situations, SRM can be provokedinto sending multiple retransmissions (repair messages) for each request; pbcast generateslower overheads.

Bimodal Multicast • 71

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 32: Bimodal multicast

knowing the degree to which the application is sensitive to throughputvariations, would translate this to a throughput stability goal. Using thetheory and experimental data, one can then tune7 pbcast to match thedesired behavior.

8. PROGRAMMING WITH PROBABILISTIC COMMUNICATION TOOLS

Although a probabilistic protocol can be used like other types of reliablegroup communication and multicast tools, the weaker nature of the guar-antees provided has important application-level implications. For example,Ensemble’s virtually synchronous multicast protocols guarantee that allnonfaulty members of a process group will receive any multicast sent tothat group, even if this requires delaying the entire group while waiting fora balky process to catch up. In contrast, probabilistic protocols could violatetraditional atomicity guarantees. The likelihood of such an event is knownto be low if the network is behaving itself and if the soft-failure limits are

7The relevant parameter is the length of the gossip round. With shorter lengths the overheadrises, but the protocol more rapidly discovers and retransmits lost packets.

0 10 20 30 40 50 60 70 80 90 10050

60

70

80

90

100

110

120

130

140

150

sec

#msg

sWithout Round Retransmission Limit

0 10 20 30 40 50 60 70 80 90 10050

60

70

80

90

100

110

120

130

140

150

sec

#msg

s

With Round Retransmission Limit

0 10 20 30 40 50 60 70 80 90 10050

60

70

80

90

100

110

120

130

140

150

sec

#msg

s

With multicast retransmssion

0 10 20 30 40 50 60 70 80 90 1000

20

40

60

80

100

120

140

160

180

200

sec

#msg

s

Without multicast retransmission

Fig. 14. Representative data for Spinglass, used in a group of 35 members on a 10Mbit localarea network.

72 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 33: Bimodal multicast

appropriate ones, but a transient problem could certainly trigger thesesorts of problems.

These considerations mean that if data are replicated using our protocol,the application should be one that is insensitive to small inconsistencies,such as the following:

—Applications that send media, such as radio, television, or teleconferenc-ing data over the internet. The quality predictions provided by pbcast canbe used to tune the application for a minimal rate of dropouts. Whenusing conventional multicast mechanisms, such applications must betuned under very pessimistic assumptions.

—In a stock market or equity-trading environment [Piantoni and Stancescu1997], actively traded securities are quoted repeatedly. The infrequentloss of a quote would not normally pose a problem as long as the eventsare rare enough and randomly distributed over messages generatedwithin the system—a property pbcast can guarantee.

—In an air traffic control setting (such as in PHIDIAS), many forms of data(such as the periodic updates to radar images and flight tracks8) agerapidly and are repeatedly refreshed. Dropping updates of these sortsinfrequently would not create a safety threat. It is appealing to use ascalable reliable protocol in this setting, yet one needs to demonstrate tothe customer that doing so does not compromise safety. The ability toselectively send the more time-critical but less safety-critical informationdown a probabilistic protocol stack that guarantees stable throughputand latency would be very desirable.In this setting, there are also problems for which stronger guarantees ofa virtually synchronous nature are needed. For example, the PHIDIASsystem replicates flight plan updates within small clusters of 3–5 work-stations, using state machine replication. Each event relevant to theshared state is reliably multicast to the cluster participants, whichsuperimpose a terse representation of the flight plan on a background ofradar image and track data. The rate of updates to the foregrounddata—the flight tracks—may be as low as one or two events per second. Avirtually synchronous multicast is far more appropriate for this secondclass of uses [Birman 1999].

—In a health care setting, many forms of patient telemetry are refreshedfrequently on displays close to the bed, at the nursing station, in thephysician’s office, etc. Data of this sort can be transmitted using pbcast.On the one hand, the underlying signal frequently contains noise, soinfrequent data loss is intrinsically tolerable; on the other, the hospitalseeks to make it likely that alarms will be triggered promptly, and thathealth care workers will use fresh data when making decisions. In

8A flight track plots the observed position and trajectory of a flight, as measured by radar andtelemetry. The flight plan is a record of the pilot’s intentions and the instructions given by thecontroller.

Bimodal Multicast • 73

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 34: Bimodal multicast

contrast, a medication change order would probably be replicated using aprotocol with end-to-end guarantees. The doctor’s computer (at one end ofthe dosage-changing operation) needs the guarantee that the systemsdisplaying medication orders (at the other end) will reflect the changeddosage. Otherwise the doctor should be warned.

Each of these examples is best viewed as a superposition of two (or more)uses of process groups. Notice, however, that the different uses are inde-pendent—virtual synchrony is not used to overcome the limitations ofpbcast. Rather, a pbcast-based application (such as the one used to updatethe radar images on the background of a controller’s screen) coexists with avirtually synchronous one (the application used to keep track of flight plansand instructions which the controller has issued to each flight). Theapplication decomposes cleanly into two applications, one of which is solvedwith pbcast, and the other with the traditional form of reliable multicast.

Traditional forms of reliable multicast can and should be used whereindividual data items have critical significance for the correctness of theapplication. Examples include security keys employed for access to a stockexchange system, flight plan data replicated at the databases associatedwith multiple air traffic control centers, or the medication dosage instruc-tions from the health care example. But as just seen, other kinds of datamay be well matched to the pbcast properties. Interestingly, in the aboveexamples, frequent message traffic would often have the properties neededfor pbcast to be used safely, while infrequent traffic would be typical forobjects such as medical records, which are updated rarely, by hand. Thuspbcast would relieve the more reliable protocols of the sort of load that theyhave problems sustaining in a steady manner. These examples are repre-sentative of a class of systems with mixed reliability requirements.

A second way to program with pbcast is to develop algorithms that makeexplicit use of the probabilistic reliability distribution of the protocol, aswas done in the data replication algorithm mentioned in Section 6 [Haydenand Birman 1996]. One can imagine algorithms that use pbcast to replicatedata with probabilistic reliability, and then employ decision-theoreticmethods to overcome this uncertainty when basing decisions on the repli-cated data. For example, suppose that pbcast was used to replicate allforms of air traffic control data—an idea which might be worth pursuing,since the protocols are lightweight, easy to analyze, and very predictable.The quality of each flight plan will now be probabilistic. Under whatconditions is it safe to make a safety-critical decision, and under whatconditions should we gossip longer (to sharpen the quality of the decision)?In the future, we hope to explore these issues more fully.

9. COMPARISON WITH PRIOR WORK

As noted earlier, our protocol builds upon a considerable body of prior work.Among this, the strongest connections are to the epidemic algorithmsstudied by Demers et. al. in the Xerox work on data replication. However,the Xerox work looked at systems under light load, and did not develop the

74 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 35: Bimodal multicast

idea of probabilistic reliability as a property one might present to theapplication developer. Our work extends the Xerox work by consideringruns of the protocol, and by using IP-multicast. In addition to the workreported here, our group at Cornell also explored other uses of gossip, suchas gossip-based membership tracking [van Renesse et al. 1996] and gossip-based stability detection [Guo 1998].

Our protocol can also be seen as a “soft” real-time protocol, with connec-tions to such work as the D-T protocol developed by Cristian et al. [1985],and Baldoni et al.’s d-causal protocol [Baldoni et al. 1996a; 1996b]. None ofthis prior work investigated the issue of steady load and steady datadelivery during failures, nor does the prior work scale particularly well. Forexample, the D-T protocol involves delaying messages for a period of timeproportional to the worst-case delay in the system and to estimates of thenumbers of messages that might be lost and processes that might crash in aworst-case failure pattern. In the environments of interest to us, thesedelays would be enormous and would rise without limit as a function ofsystem size. Similar concerns could be expressed with regard to the(d-causal protocol, which guarantees causal order for delivered messageswhile discarding any that are excessively delayed. It may be possible toextend these protocols into ones with steady throughput and good scalabil-ity, but additional work would be needed.

10. CONCLUSION

Although many reliable multicast protocols have been developed, reliabilitycan be defined in more than one way, and the corresponding tools matchdifferent classes of applications. Reliable protocols that guarantee deliverycan be expensive, and may lack stable throughput needed in soft real timeapplications, where data are produced very regularly and where deliverymust keep up. Best-effort delivery is inexpensive and scalable, but lacksend-to-end guarantees that may be important when developing mission-critical applications. We see these two observations as representing thecore of a debate about the virtues of reliable multicast primitives inbuilding distributed systems.

In this article, we introduced a new region in the spectrum, one that canbe understood as falling between the two previous endpoints. Specifically,we showed that a multicast protocol with bimodal delivery guarantees canbe built in many realistic network environments—although not all of them,at least for the present—and that the protocol is also scalable and givesstable throughput. We believe that this new design point responds toimportant requirements not adequately addressed by any previous optionin the reliable multicast protocol space.

Epilogue

As the military guard led him away in shackles, the Baron suddenly turned.“General,” he snarled, “When the Emperor learns of your recklessness, you’lljoin me in his dungeons. It is impossible to reliably coordinate an attack

Bimodal Multicast • 75

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 36: Bimodal multicast

under these conditions.” “Not so,” replied the General. “The bimodal guar-antee was entirely sufficient.”

APPENDIX

A. FORMAL ANALYSIS OF THE PROTOCOL

In this appendix, we provide an analysis of the pbcast protocol. Theanalysis is true to the protocol implementation, except with respect to threesimplifications. Note that the experimental results suggest that the actualprotocol behaves according to predictions even in environments whichdeviate from our assumptions: in effect, the model is surprisingly robust.

The first of these concerns the initial unreliable multicast. When theprocess that initiates a pbcast does not crash, remains connected to thenetwork, while the initial multicast is successful, then the protocol providesvery strong delivery guarantees because the state of the system after themulticast involves widespread knowledge of the message. If this initialmulticast fails, however, one could be faced with a pbcast run in whichthere is just a single process with a copy of the message at the outset.Below, we focus on this conservative assumption: only the initiator initiallyhas a copy of the message. However, we point out the step at which a morerealistic assumption could have been made.

A second simplification relates to the model. In the protocol as developedabove, each process receives a message and then gossips about thatmessage in subsequent rounds of the protocol. But recall that these roundsare asynchronous and that message loss is independent for each messagesend event. Accordingly, our protocol is equivalent to one in which aprocess gossips to all at once to randomly selected processes in the firstround after it hears of a message and then ceases to gossip about thatmessage (such a solution might not be as scalable because load would bemore bursty, but this does not enter into the analysis that follows). Thistransformation simplifies the analysis, and we employ it below.

Finally, the analysis omits solicitations and retransmissions, collapsingthese into the single “gossip” message. As will become clear shortly, thissimplification is justifiable for the purposes of our analysis, although thereare certainly questions one could ask about pbcast for which it would not beappropriate.

In what follows, R is the number of rounds during which the protocolruns; P is the set of processes in the system; and b*?P? is the expectedfanout for gossip. The first round (like all others) is gossip based. Usingthis notation, the abstract protocol that we analyze is illustrated in Figure15. We give pseudocode in Appendix B; there, the additional parameter G isused to denote the number of rounds after which a participant shouldgarbage-collect a message.

Before undertaking the analysis, we should also comment briefly as tothe nature of the guarantees provided by the protocol. With a traditionalreliable multicast protocol, any run has an all-or-nothing outcome: either

76 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 37: Bimodal multicast

all correct destinations receive a copy of a multicast, or none do so. Thissection will demonstrate that pbcast has a bimodal delivery distribution:with very high probability, all, or almost all, correct destinations receive acopy. With rather low probability, a small number of processes (some or allof which may be faulty) receive a copy. And the probability of intermediateoutcomes—for example, in which half the processes receive a copy—is sosmall as to be negligible.9

A.1 System Model

The system model in which we analyze pbcast is a static set of processescommunicating synchronously over a fully connected, point-to-point net-work. The processes have unique, totally ordered identifiers and can tossweighted, independent random coins. Runs of the system proceed in asequence of rounds in which messages sent in the current round aredelivered in the next. There are two types of failures, both probabilistic innature. The first are process failures. There is an independent, per-processprobability of at most « that a message between nonfaulty processes is lostin the network. Message failure events and process failure events aremutually independent. There are no malicious faults, spurious messages, orcorruption of messages. We expect that both « and t are small probabilities.(For example, the values used to compute Figure 4 are « 5 0.05 and t 50.001).

9Notice, however, that when using the protocol, these extremely unlikely outcomes must stillbe recognized as possibilities. The examples cited in the body of the article share the propertythat for application-specific reasons, such outcomes could be tolerated if they are sufficientlyunlikely. An application that requires stronger guarantees would need to use a reliablemulticast protocol such as Ensemble’s virtually synchronous multicast, tuning it carefully toensure steady throughput.

(* Auxiliary function. *)to deliver_and_gossip(msg,round): (* Do nothing if already received it. *) if received_already then return

(* Mark the message as being seen and deliver. *) received_already := true deliver(msg)

(* If last round, don’t gossip. *) if round = 0 then return

let S be a randomly selected subset of P, |S|=|P|*β

foreach p in S: sendto p Gossip(msg,round-1)

(* State kept per pbcast: *)(* have I received a message *)(* regarding this pbcast yet? *)let received_already = false

(* Initiate a pbcast. *)to pbcast(msg): deliver_and_gossip(msg,R)

(* Handle message receipt. *)on receive Gossip(msg,round): deliver_and_gossip(msg,round)

Fig. 15. Abstract version of pbcast used for the analysis. The abstract version considers justa single multicast message and differs from the protocol as described earlier in ways thatsimplify the discussion without changing the analytic results. Pseudocode for the true protocolappears in Apendix B.

Bimodal Multicast • 77

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 38: Bimodal multicast

The impact of the failure model above can be described in terms of anadversary attempting to cause a protocol to fail by manipulating the systemwithin the bounds of the model. Such an adversary has these capabilitiesand restrictions:

—An adversary cannot use knowledge of future probabilistic outcomes,interfere with random coin tosses made by processes, cause correlated(nonindependent) failures to occur, or do anything not enumerated below.

—The adversary has complete knowledge of the history of the current run.

—At the beginning of a run of the protocol, it has the ability to individuallyset process failure rates, within the bounds @0..t#

—For messages, it has the ability to individually set message failureprobabilities within the bounds of @0..«# and can arbitrarily select the“point” at which messages are lost.

Note, that, although probabilities may be manipulated by the adversary,it may only make the system “more reliable” than the bounds, « and t.

Over this system model, we layer protocols with strong probabilisticconvergence properties. The probabilistic analysis of these properties is,necessarily, only valid in runs of the protocol in which the system obeys themodel. The independence properties of the system model are quite strongand are not likely to be continuously realizable in the actual system. Forexample, partition failures are correlated communication failures and donot occur in this model. Partitions can be “simulated” by the independentfailures of several processes, but are of vanishingly low probability. Simi-larly, the model gives little insight into how a system might behave duringand after a brief networkwide communication outage. Both types of failuresare realistic threats, which is why we resorted to experiments to exploretheir impact on the protocol.

A.2 Pbcast Protocol

The version of the protocol used in our analysis is simplified, as follows. Wewill assume that a run of the pbcast protocol consists of a fixed number ofrounds, after which a multicast vanishes from the system because thecorresponding message is garbage-collected. A process initiates a pbcast byunreliably multicasting the message, and it is received by a random subsetof the processes. These gossip about the message, causing it to reachprocesses that did not previously have a copy, which gossip about it in turn.For our analysis, we consider just a single multicast event, and we adoptthe view that a process gossips about a multicast message only during theround in which it first receives a copy of that message. Processes choose thedestinations for their gossip by tossing a weighted random coin for eachother process to determine whether to send a gossip message to thatprocess. Thus, the parameters of the protocol studied in the analysis are

—P: the set of processes in the system. N 5 ?P?.

78 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 39: Bimodal multicast

—R: the number of rounds of gossip to run.

—b: the probability tht a process gossips to each other process (theweighting of the coin mentioned above). We define the fanout of theprotocol to be b*N: this is the expected number of processes to which aparticipant gossips.

Described in this manner, the behavior of the gossip protocol mirrors aclass of disease epidemics which nearly always infect either almost all of apopulation or almost none of it. The pbcast bimodal delivery distribution,mentioned earlier, will stem from the “epidemic” behavior of the gossipprotocol. The normal case for the protocol is one in which gossip floods thenetwork in a random but exponential fashion.

A.3 Pbcast Analysis

Our analysis will show how to calculate the bimodal pbcast deliverydistribution for a given setting, and how to bound the probability of apbcast “failure” using a definition of failure provided by the applicationdesigner in the form of a predicate on the final system state. It would bepreferable to present a closed-form solution; however, doing so for non-trivial epidemics of the kind seen here is an open problem in epidemictheory. In the absence of closed-form bounds, the approach of this analysiswill be to derive a recurrence relation between successive rounds of theprotocol, which will then be used to calculate an upper bound on the chanceof a failed pbcast run.

A.4 Notation and Probability Background

The following analysis uses standard probability theory. We use threetypes of random variables. Lowercase variables, such as f, r and s, areintegral random variables; uppercase variables, such as X, are binaryrandom variables (they take values from $0,1%); and uppercase bold vari-ables, such as X, are integral random variables corresponding to sums ofbinary variables of the same letter: X 5 SXi.

P$v 5 k% refers to the probability of the random variable v having thevalue k. For binary variables, P$X% 5 P$X 5 1%. With lowercase integralrandom variables, in P$r% the variable serves both to specify a randomvariable and as a binding occurrence for a variable of the same name.

The distributions of sums of independent, identically distributed binaryvariables are called binomial distributions. If @0 # i , n : P$Xi% 5 p,then

P$X 5 k% 5 S nk D~p!k~1 2 p!n2k.

We use relations among random variables to derive bounds on thedistributions of the weighted and unweighted sums of the variables. Let Xi,

Bimodal Multicast • 79

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 40: Bimodal multicast

Yi, and Zi form finite sets of random variables, and let g~i! be a nonnega-tive real-valued function defined over integers. If

@0 # i , n : P$Xi% # P$Yi% # P$Zi%

then

P$Y 5 k% # P$Z $ k% 2 P$X $ k 1 1% (1)

O0#i,n

P$X 5 i%g$i% # O0#i,n

P$Y 5 i%max0#j#i

g~j! (2)

These equations will be applied later in the analysis.

A.5 A Recurrence Relation

The first step is to derive a recurrence relation that bounds the probabilityof protocol state transitions between successive rounds. We describe thestate of a round using three integral random variables: st is the number ofprocesses that may gossip in round t (or in epidemic terminology theinfectious processes); rt is the number of processes in round t that have notreceived a gossip message yet (the susceptible processes); and ft is thenumber of infectious processes in the current round which are faulty.

Recall from the outset of this chapter that our analysis is pessimistic,assuming that the initial unreliable broadcast fails and reaches none of thedestinations, leaving an initial state in which a single process has a copy ofthe message while all others are susceptible:

s0 5 1, r0 5 N 2 1, f0 5 0

rt11 1 st11 5 rt

O0#t#R

st 1 rR 5 N

The recurrence relation we derive, R~st, rt, ft, st11!, is a bound on theconditional probability, given the current state described by ~st, rt, ft!,that st11 of the rt susceptible processes receive a gossip message from thisround. Expressed as a conditional probability, this is P$st11 st, rt, ft%.

For each of the rt processes, we introduce a binary random variable, Xi,corresponding to whether a particular susceptible process receives gossipthis round. st11 is equal to the sum of these variables, SXi or equivalentlyX. In order to calculate R~st, rt, ft, st11!, we will derive bounds on thedistribution of X. Our derivation will be in four steps. First we considerP$Xi% in the absence of faulty processes and with fixed message failures.Then we introduce, separately, generalized message failures and faultyprocesses, and finally we combine both failures. Then we derive bounds onP$X 5 k% for the most general case.

80 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 41: Bimodal multicast

A.5.1 Fixed Message Failures. The analysis begins by assuming (1) thatthere are no faulty processes and (2) that message delay failures occur withexactly « probability, no more and no less. This assumption limits thesystem from behaving with a more reliable message failure rate. In theabsence of these sort of failures, the behavior of the system is the same as awell-known (in epidemic theory) epidemic model, called the chain-binomialepidemic. The literature on epidemics provides a simple method for calcu-lating the behavior of these epidemics when there are an unlimited numberof rounds and no notion of failures [Bailey 1975]. We introduce constantsp 5 b~1 2 «! and q 5 1 2 p. p is the probability that both an infectiousprocess gossips to a particular susceptible process and that the messagedoes not experience a send omission failure under the assumption of fixedmessage failures. (Note that this use of p is unrelated to the reliabilityparameter p employed elsewhere in the article; the distinction is clear fromcontext.)

For each of the rt susceptible processes and corresponding variable, Xi,we consider the probability that at least one of the st infectious processessends a gossip message which gets through. Expressed differently, this isthe probability that not all infectious processes fail to send a message to aparticular susceptible process:

P$Xi% 5 1 2 ~1 2 p!st 5 1 2 qst

A.5.2 Generalized Message Failures. A potential risk in the analysis ofpbcast is to assume, as may be done for many other protocols, that theworst case occurs when message loss is maximized. Pbcast’s failure modeoccurs when there is a partial delivery of a pbcast. A pessimistic analysismust consider the case where local increases in the message deliveryprobability decrease the reliability of the overall pbcast protocol. We extendthe previous analysis to get bounds on P$Xi%, but where the message failurerate may be anywhere in the range of @0..«#

Consider every process i that gossips, and consider every process j that isends a gossip message to. With generalized message failures, there is aprobability « ij that the message experiences a send omission failure, suchthat 0 # « ij # «. This gives bounds @plo..phi# on pij the probability thatprocess i both gossips to process j and the probability that the message isdelivered: b~1 2 «! 5 plo # b~1 2 « ij! 5 pij # phi 5 b (we also have qlo

5 1 2 plo and qhi 5 1 2 phi).This in turn gives bounds on the probability of each of the rt processes

being gossiped to, expressed using the variables Xhi and Xlo which corre-spond to a fixed message failure rate model:

1 2 qlost 5 P$Xlo% # P$Xj% # P$Xhi% 5 1 2 qhi

st

A.5.3 Process Failures. Introducing process failures into the analysis isdone in a similar fashion to that of generalized message failures. For

Bimodal Multicast • 81

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 42: Bimodal multicast

simplicity in the following discussion, we again fix the probability ofmessage failure to «.

We assume that ft of the st infectious processes that are gossiping in thecurrent round are faulty. For the purposes of analyzing pbcast, there arethree ways in which processes can fail. They can crash before, during, orafter the gossip stage of the pbcast protocol. Regardless of which caseapplies, a process always sends a subset of the messages it would have senthad it not been faulty: a faulty process never introduces spurious messages.If all ft processes crash before sending their gossip messages, then theprobability of one of the susceptible processes receiving gossip message,P$Xi%, will be as though there were exactly st 2 ft correct processesgossiping in the current round. If all crash after gossiping then theprobability will be as though all st processes gossiped, while none of the ft

processes had failed. All other cases cause the random variables, Xi, tobehave with some probability in between:

1 2 qst2ft 5 P$Xlo% # P$Xi% # P$Xhi% 5 1 2 qst

A.5.4 Combined Failures. The bounds from the two previous sectionsare “combined” to arrive at

1 2 qlost2ft 5 P$Xlo% # P$Xi% # P$Xhi% 5 1 2 qhi

st

Then we apply Eq. (1) to get bounds on P$SXj 5 k%, or P$X 5 k%:

P$X 5 k% # P$Xhi $ k% 2 P$Xlo $ k 1 1%

Expanding terms, we get the full recurrence relation:

P$st11 st, rt, ft% # Ost11#i#N

S rt

i D~1 2 qhist !i~qhi

st !rt2i

2 Ost11#i#N

S rt

i D~1 2 qlost2ft!i~qlo

st2ft!rt2i (3)

We define the right hand side of relation (3) to be R~st, rt, ft, st11!, “anupper bound on the probability that with st gossiping processes of which ft

are faulty, and with rt processes that have not yet received the gossip, thatst11 processes will receive the gossip this round.”

A.6 Predicting Latency to Delivery

Still working within the same model10 we can compute the distribution oflatency between when a message is sent and when it is delivered. For the

10Actually, we differ in one respect: the analysis of this subsection explicitly treats gossip to bprocesses during each round. The previous analysis treated all gossip as occuring in the firstround.

82 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 43: Bimodal multicast

case where the initial multicast is successful, this latency will be deter-mined by the multicast transport protocol: IP-multicast or the tree-basedmulticast introduced earlier. Both protocols can be approximated as simplepacket-forwarding algorithms operating over forwarding trees. If the typi-cal per-round fanout for a node is b, then a typical message will takelogb~N! hops from sender to destination. Given some information about thedistribution of response times for forwarding nodes, we could then calculatea distribution of latency to delivery and an associated variance. Ourexperience suggests that both mean latency and variance will grow aslogb~N!.

When the initial multicast does not reach some11 destinations, theanalysis is quite another matter. Suppose the initial multicast infects ~N2 1! * ~1 2 «0! processes, for some constant 0, i.e., s0 5 1 1 ~N 2 1! * ~12 «0! (the sender always has a copy of the message). If we denote by rt thenumber of correct processes that have not yet received a copy of themessage by time t, then r0 5 ~N 2 1! * «0. Given st and rt we now derive arecurrence relation for st11 and rt11.

As before, we introduce constants p 5 b~1 2 «! and q 5 1 2 p. First,we assume that processes do not crash. For a susceptible process, theprobability that at least one of the st infectious processes sends a gossipmessage which gets through is 1 2 qst. Let k be the expected number ofnewly infected processes. We then have k 5 rt * ~1 2 qst!, st11 5 st 1 k,rt11 5 rt 2 k.

Now we can introduce process failures into the analysis. There are threeways that a process can fail: they can crash before, during, and after thegossip stage of the protocol. Here we are investigating the relationshipbetween the number of susceptible (hence, correct) processes and thenumber of gossip rounds. The worst occurs when all faulty processes failbefore the gossip stage (similarly, we can relax the message failure rate,but the worst case occurs when the loss rate is «). We now have st 5 st *~1 2 t!, rt 5 rt * ~1 2 t!; k 5 rt * ~1 2 qst!; st11 5 st 1 k; rt11 5 rt 2k. From these relations we can produce graphs such as Figure 6 in Section6, which shows the number of susceptible processes as a function of thenumber of gossip rounds with N 5 1000, the gossip fanout is 1, t 50.001, « 5 0.05, and «0 5 1.0 (the initial multicast fails).

Now, define vt to be the probability that a susceptible process getsinfected in any round prior to round t, and define wt to be the probabilitythat a susceptible process gets infected in round t. Observe, that, in anyround, all the currently susceptible processes have equal probability ofgetting infected. We have vt 5 1 2 rt/N and wt 5 vt 2 vt21. From this weare able to produce Figure 7 in Section 6, showing the probability for a

11The case where the initial multicast reaches no processes corresponds to «0 5 1, s0 5 1, andr0 5 N 2 1.

Bimodal Multicast • 83

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 44: Bimodal multicast

correct process to receive a message in a certain round. The figure super-imposes curves for various values of N: 10, 128, and 1024. Notice that thecurve has roughly the same shape in each case and that the peak falls closeto logfanout~N!.

A.7 Failure Predicates

In this section we show how to calculate a bound on the probability of apbcast, in a particular round and state, ending in an undesired (“failed”)state on round R:

Ft~st, rt, f#t!

(f# t is the total number of faulty processes that have failed prior to time t,

f# t 5 O0#i,t f i).Given F, the reliability of pbcast can be found by examining the value of

F for the initial state of the protocol, or F0~1, N 2 1,0!. (Making a moreoptimistic assumption, we could compute F0~N*~1 2 «!, N*«, 0!, givingthe expected outcome if the initial multicast reaches all but N*« processes).This computation would then yield values for the pbcast parameters whichwill give a sufficiently high reliability for the desired use.

Values of F are calculated in the context of a predicate that defineswhether a run of the protocol failed or not, according to its final state.Failure states correspond to outcomes we wish to avoid. The predicate,P~S, F!, is defined over the total number of infected processes (S) (possi-bly including some faulty processes) and the total number of faulty pro-cesses (F). This predicate can be defined differently, depending on the useof pbcast.

To illustrate this, we now give two predicates and use them to explore thepredicted reliability of pbcast in the environment of interest to us. Thefirst, predicate I, defines a failed pbcast to be one that reaches more thansN processes in a system, but less than ~1 2 s!N processes. For a value ofs 5 0.1 this captures the notion of failure that might arise in the case ofthe General of the introduction:

P~S, F! 5 ~S $ sN! ∧ ~S # ~1 2 s!N! (I)

Predicate II would make sense if pbcast were used as the basis of aquorum replication algorithm (a topic discussed in Birman [1997]). Forsuch applications, the worst possible situation is one in which pbcastreaches about half the processes in the system: neither a clear majority nora clear majority, with failed processes representing a possible “swing vote.”To capture this, the predicate counts failed processes twice: it pessimisti-cally totals all of the processes that may have been infected, so that a failedpbcast is one in which both the percentage of infected processes is less thana majority (with faulty processes counting as being uninfected) while the

84 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 45: Bimodal multicast

percentage of infected processes is larger than a minority (with faultyprocesses counting as infected):

P~S, F! 5 SS 2 F ,N 1 1

2 D ∧ SS 1 F $N 1 1

2 D (II)

The calculation works backward from the last round. For each round, wesum over the possible number of failures in this round and the number ofinfectious processes in the next round. This is done using the calculationsfor the next round and the recurrence relation, R, in order to get the twofollowing equations. The first equation calculates bounds on the probabili-ties for round R; the second equation calculates bounds for the previousrounds (here we take P~S, F! 5 1 if true and 0 if false):

FR~SR, rR, f#R! # O0#fR#sR1rR

P$ft%P~N 2 rR, f#R 1 fR!

Ft~st, rt, f#t! # O0#ft#st

~P$ft% O0#st11#rt

R~st, rt, ft, st11!Ft11~st11, rt 2 st11, f#t 1 ft!!

(4)

We do not know the exact distribution of P$ft% because individualprocesses can fail with probabilities anywhere in @0..t#. However, we canapply Eq. (2) to get bounds on the two equations above. For example, thebound for Eq. (4) is

Ft~st, rt, f#t! # O0#ft#st

SS st

ftD~t!ft~1 2 t!st2ft max

0#i#ft

O0#st11#rt

R~st, rt, i, st11!

Ft11~st11, rt 2 st11, f#t 1 i!D.

Given the parameters of the system and a predicate defining failed finalstates of the protocol, we can now compute bounds on the probability ofpbcast ending up in a failed state. This was done to obtain the graphspresented in Section 6 of the article.

B. PSEUDOCODE FOR THE PROTOCOL

The following code is executed, concurrently, by all processes in the system.Notice that, per optimization 5, the “rounds” need not be synchronous.Although round numbers arise in the protocol, they are used in a mannerthat does not require processes to be in the same round at the same time.For example, if process p is in round n when it sends a gossip message toprocess q, process q ’s round number is not relevant. Instead, if q solicits aretransmission from p, it does so using p ’s round number from the gossipmessage.

Bimodal Multicast • 85

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 46: Bimodal multicast

pbcast(msg):add_to_msg_buffer(msg);unreliably_multicast(msg);

first_reception(msg):add_to_msg_buffer(msg);deliver messages that are now in order; report gaps after suitabledelay;

add_to_msg_buffer(msg):slot:5 free_slot;msg_buffer[slot].msg:5 msg;msg_buffer[slot].gossip_count:5 0;

gossip_round: (* Runs every 100ms in our implementation *)my_round_number :5 my_round_number11;gossip_msg :5 <my_round_number, digest(msg_buffer)>;for(i 5 0; i , b*N/R; i :5 i 1 1)

{ dest :5 randomly_selected_member send gossip_msg to dest; }foreach slot

msg_buffer[slot].gossip_count:5 msg_buffer[slot].gossip_count11;

discard messages for which gossip_count exceeds G, the garbage-collection limit;

rcv_gossip_msg(round_number, gmsg):compare with contents of local message buffer;foreach missing message, most recent first

if this solicitation won’t exceed limit on retransmissions perround

send solicit_retransmission(round_number, msg.id)to gmsg.sender;

rcv_solicit_retransmission(msg):if I am no longer in msg.round, or if have exceeded limits for thisround

ignoreelse

send make_copy(msg.solicited_msgid) to msg.sender;

ACKNOWLEDGMENTS

Matt Lucas was extremely helpful in providing insight and data document-ing conditions under which SRM throughput becomes unstable. Eli Upfalpointed us to work on random graphs used in the scalability analysis.Srinivasan Keshav, Fred Schneider, and Robbert van Renesse all madeextensive suggestions after reading an early version of this article, and theanonymous reviewers of this article made further suggestions. Their help isgratefully recognized. Werner Vogels and Michael Kalantar also madeuseful suggestions. Tom Coleman and Jay Blaire helped with access to theCornell Theory Center SP2, the platform used in our experimental work.The Cornell Theory Center itself was also very helpful, particularly forprioritizing our experiments so that they would be scheduled without longdelays.

REFERENCES

BAILEY, N. 1975. The Mathematical Theory of Infectious Diseases. 2nd edition CharlesGriffen and Company, London, England, United Kingdom.

86 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 47: Bimodal multicast

BAJAJ, S., BRESLAU, L., ESTRIN, D., FALL, K., FLOYD, S., HALDAR, P., HANDLEY, M., HELMY, A.,HEIDEMANN, J., HUANG, P., KUMAR, S., MCCANNE, S., REJAIE, R., SHARMA, P., VARADHAN, K.,XU, Y., YU, H., AND ZAPPALA, D. 1999. Improving simulation for network research. Tech.Rep. 99-702. Computer Science Department, University of Southern California, LosAngeles, CA.

BALDONI, R., MOSTEFAOUI, A., AND RAYNAL, M. 1996a. Casual delivery of messages withreal-time data in unreliable networks. Real-Time Syst. 10, 3, 245–262.

BALDONI, R., PRAKASH, R., RAYNAL, M., AND SINGHAL, M. 1996b. Broadcast with time andcausality constraints for multimedia applications. INRIA, Rennes, France.

BIRMAN, K. 1994. A response to Cheriton and Skeen’s criticism of causal and totally orderedcommunication. ACM SIGOPS Oper. Syst. Rev. 28, 1 (Jan. 1994), 11–21.

BIRMAN, K. P. 1997. Building Secure and Reliable Network Applications. ManningPublications Co., Greenwich, CT. http://www.browsebooks.com/Birman/index.html.

BIRMAN, K. P. 1999. A review of experiences with reliable multicast. Softw. Pract. Exper. 29,9.

BIRMAN, K., FRIEDMAN, R., HAYDEN, M., AND RHEE, I. 1998. Middleware support for distributedmultimedia and collaborative computing. In Proceedings of ACM Multimedia and Network-ing (MMCN ’98, San Jose, CA). ACM, New York, NY.

CHERITON, D. R. AND SKEEN, D. 1993. Understanding the limitations of causally and totallyordered communication. ACM SIGOPS Oper. Syst. Rev. 27, 5 (Dec. 1993), 44–57.

COOPER, R. 1994. Experience with causally and totally ordered communication support: Acautionary tale. ACM SIGOPS Oper. Syst. Rev. 28, 1 (Jan. 1994), 28–31.

CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. 1985. Atomic broadcast: From simplemessage diffusion to Byzantine agreement. In Proceedings of the 15th IEEE InternationalSymposium on Fault-Tolerant Computing (FTCS ’85, Ann Arbor, MI, June). IEEE Press,Piscataway, NJ, 200–206.

CRISTIAN, F., DOLEV, D., STRONG, R., AND AGHILI, H. 1990. Atomic broadcast in a real-timeenvironment. In Fault-Tolerant Distributed Computing, B. Simons and A. Spector,Eds. Springer lecutre notes in computer science. Springer-Verlag, Berlin, Germany, 51–71.

DEMERS, A., GREENE, D., HAUSER, C., IRISH, W., AND LARSON, J. 1987. Epidemic algorithms forreplicated database maintenance. In Proceedings of the 6th Annual ACM Symposium onPrinciples of Distributed Computing (PODC ’87, Vancouver, BC, Aug. 10–12), F. B. Schnei-der, Ed. ACM Press, New York, NY, 1–12.

FEIGE, U., PELEG, D., RAGHAVAN, P., AND UPFAL, E. 1990. Rndomized broadcast innetworks. Random Struct. Alg. 1, 4, 447–460.

FLOYD, S., JACOBSON, V., MCCANNE, S., LIU, C.-G., AND ZHANG, L. 1995. A reliable multicastframework for light-weight sessions and application level framing. SIGCOMM Comput.Commun. Rev. 25, 4 (Oct.), 342–356. For more information see http://www.aciri.org/floyd/srm.html.

GOLDING, R. AND TAYLOR, K. 1992. Group membership in the epidemic style. Tech. Rep.UCSC-CRL-92-13. Computer and Information Sciences Department, University of Califor-nia at Santa Cruz, Santa Cruz, CA.

GUO, K. 1998. Scalable membership detection protocols. Ph.D. Dissertation. Department ofComputer Science, Cornell University, Ithaca, NY. Available as Tech. Rep. 98-1684.

HAYDEN, M. 1998. The Ensemble system. Ph.D. Dissertation. Department of ComputerScience, Cornell University, Ithaca, NY. Available as Tech. Rep. TR 98-1662.

HAYDEN, M. G. AND BIRMAN, K. P. 1996. Probabilistic broadcast. Tech. Rep. TR96-1606. Department of Computer Science, Cornell University, Ithaca, NY.

LABOVITZ, C., MALAN, G., AND JAHANIAN, F. 1997. Internet routing instability. In Proceedingsof the ACM Conference on Communications, Architecture, and Protocols (SIGCOMM ’97,Oct.). ACM Press, New York, NY.

LADIN, R., LISKOV, B., SHRIRA, L., AND GHEMAWAT, S. 1992. Providing high availability usinglazy replication. ACM Trans. Comput. Syst. 10, 4 (Nov. 1992), 360–391.

LIDL, K., OSBORNE, J., AND MALCOME, J. 1994. Drinking from the firehose: Multicast USENETnews. In Proceedings of USENIX Winter (Jan.). USENIX Assoc., Berkeley, CA, 33–45.

Bimodal Multicast • 87

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.

Page 48: Bimodal multicast

LIN, J. C. AND PAUL, S. 1996. A reliable multicast transport protocol. In Proceedings of theIEEE Conference on Computers and Communication (INFOCOM ’96). IEEE Press, Piscat-away, NJ, 1414–1424. See also http://www.bell-labs.com/project/rmtp.

LIU, C.-G. 1997. Error recovery in scalable reliable multicast. Ph.D. Dissertation. ComputerScience Department, University of Southern California, Los Angeles, CA.

LUCAS, M. 1998. Efficient data distribution in large-scale multicast networks. Ph.D.Dissertation. Department of Computer Science, University of Virginia, Charlottesville, VA.

OZKASAP, O., XIAO, Z., AND BIRMAN, K. P. 1999. Scalability of two reliable multicastprotocols. Tech. Rep. TR 99-1748. Department of Computer Science, Cornell University,Ithaca, NY.

PAXSON, V. 1997. End-to-end Internet packet dynamics. In Proceedings of the ACMConference on Communications, Architecture, and Protocols (SIGCOMM ’97, Oct.). ACMPress, New York, NY, 139–154.

PIANTONI, R. AND STANCESCU, C. 1997. Implementing the Swiss Exchange Trading System. InProceedings of the Conference on Fault Tolerant Computing Systems (FTCS ’97, June). IEEEPress, Piscataway, NJ, 309–313.

PAUL, S., SABNANI, K. K., LIN, J. C., AND BHATTACHARYYA, S. 1997. Reliable MulticastTransport Protocol. IEEE J. Sel. Areas Commun. 15, 3 (Apr.).

VAN RENESSE, R. 1994. Why bother with CATOCS?. ACM SIGOPS Oper. Syst. Rev. 28, 1 (Jan.1994), 22–27.

VAN RENESSE, R., BIRMAN, K. P., AND MAFFEIS, S. 1996. Horus: a flexible group communicationsystem. Commun. ACM 39, 4, 76–83.

VAN RENESSE, R., MINSKY, Y., AND HAYDEN, M. 1998. Gossip-based failure detection service. InProceedings of Middleware ’98.

XTP FORUM. 1995. Xpress Transfer Protocol specification. XTP Rev. 4.0, 95-20.

Received: May 1998; revised: May 1999; accepted: May 1999

88 • K. P. Birman et al.

ACM Transactions on Computer Systems, Vol. 17, No. 2, May 1999.