BFT Protocols Under Fire
Atul Singh Tathagata Das? Petros Maniatis Peter Druschel Timothy RoscoeMPI-SWS, Rice University, ?IIT Kharagpur, Intel Research Berkeley, ETH Zrich
Much recent work on Byzantine state machine replica-tion focuses on protocols with improved performanceunder benign conditions (LANs, homogeneous repli-cas, limited crash faults), with relatively little evalua-tion under typical, practical conditions (WAN delays,packet loss, transient disconnection, shared resources).This makes it difficult for system designers to choosethe appropriate protocol for a real target deployment.Moreover, most protocol implementations differ in theirchoice of runtime environment, crypto library, and trans-port, hindering direct protocol comparisons even undersimilar conditions.
We present a simulation environment for such proto-cols that combines a declarative networking system witha robust network simulator. Protocols can be rapidlyimplemented from pseudocode in the high-level declar-ative language of the former, while network conditionsand (measured) costs of communication packages andcrypto primitives can be plugged into the latter. We showthat the resulting simulator faithfully predicts the perfor-mance of native protocol implementations, both as pub-lished and as measured in our local network.
We use the simulator to compare representative pro-tocols under identical conditions and rapidly explorethe effects of changes in the costs of crypto operations,workloads, network conditions and faults. For example,we show that Zyzzyva outperforms protocols like PBFTand Q/U under most but not all conditions, indicating thatone-size-fits-all protocols may be hard if not impossibleto design in practice.
Byzantine Fault-Tolerant (BFT) protocols for replicatedsystems have received considerable attention in the sys-tems research community [3, 7, 9], for applications in-cluding replicated file systems , backup , and block
stores . Such systems are progressively becomingmore mature, as evidenced by recent designs sufficientlyfine-tuned and optimized to approach the performance ofcentralized  or crash-fault only  systems in somesettings.
Much of the attraction of such systems stems fromthe combination of a simple programming interface withprovable correctness properties under a strong adversar-ial model. All a programmer need do is write her serverapplication as a sequential, misbehavior-oblivious statemachine; available BFT protocols can replicate such ap-plication state machines across a population of replicaservers, guaranteeing safety and liveness even in the faceof a bounded number of arbitrarily faulty (Byzantine)replicas among them. The safety property (linearizabil-ity) ensures that requests are executed sequentially un-der a single schedule consistent with the order seen byclients. The liveness property ensures that all requestsfrom correct clients are eventually executed.
Though these protocols carefully address such correct-ness properties, their authors spend less time and effortevaluating BFT protocols under severeyet benignfailures. In fact, they often optimize under the as-sumption that such failures do not occur. For example,Zyzzyva  obtains a great performance boost underthe assumption that all replica servers have good, pre-dictable latency1 to their clients, whereas Q/U  signif-icantly improves its performance over its precursors as-suming no service object is being updated by more thanone client at a time.
Unfortunately, even in the absence of malice, devia-tions from expected behavior can wreak havoc with com-plex protocols. As an example from the non-Byzantineworld, Junqueira et al.  have shown that though thefast version of Paxos consensus2 operates in fewerrounds than the classic version of Paxos (presum-ably resulting in lower request latency), it is neverthe-less more vulnerable to variability in replica connectivity.Because fast Paxos requires more replicas (two-thirds of
NSDI 08: 5th USENIX Symposium on Networked Systems Design and ImplementationUSENIX Association 189
the population) to participate in a round, it is as slowas the slowest of the fastest two-thirds of the popula-tion; in contrast, classic Paxos is only as slow as themedian of the replicas. As a result, under particularlyskewed replica connectivity distributions, the two roundsof fast Paxos can be slower than the three rounds of clas-sic Paxos. This is the flavor of understanding we seek inthis paper for BFT protocols. We wish to shed light onthe behavior of BFT replication protocols under adverse,yet benign, conditions that do not affect correctness, butmay affect tangible performance metrics such as latency,throughput, and configuration stability.
As we approach this objective, we rely on simula-tion. We present BFTSim, a simulation framework thatcouples a high-level protocol specification language andexecution system based on P2  with a computation-aware network simulator built atop ns-2  (Section 3).P2s declarative networking language (OverLog) allowsus to capture the salient points of each protocol with-out drowning in the details of particular thread pack-ages, cryptographic primitive implementations, and mes-saging modules. ns-2s network simulation enables usto explore a range of network conditions that typicaltestbeds cannot easily address. Using this platform, weimplemented from scratch three protocols: the originalPBFT , Q/U , and Zyzzyva . We validate oursimulated protocols against published results under cor-responding network conditions. Though welcome, this issurprising, given that all three systems depend on differ-ent types of runtime libraries and thread packages, andleads us to suspect that a protocols performance char-acteristics are primarily inherent in its high-level design,not the particulars of its implementation.
Armed with our simulator, we make an apples to ap-ples comparison of several BFT protocols under iden-tical conditions. Then, we expose the protocols to be-nign conditions that push them outside their comfortzone (and outside the parameter space typically exer-cised in the literature), but well within the realm ofpossibility in real-world deployment scenarios. Specif-ically, we explore latency and bandwidth heterogeneitybetween clients and replicas, and among replicas them-selves, packet loss, and timeout misconfiguration (Sec-tion 4). Our primary goal is to test conventional (or pub-lished) wisdom with regards to which protocol or pro-tocol type is better than which; it is rare that one sizefits all in any engineering discipline, so understandingthe envelope of network conditions under which a clearwinner emerges can be invaluable.
While we have only begun to explore the poten-tial of our methodology, our study has already led tosome interesting discoveries. Among those, perhaps thebroadest statement we can make is that though agree-ment protocols offer hands down the best throughput,
quorum-based protocols tend to offer lower latency inwide-area settings. Zyzzyva, the current state-of-the-artagreement-based protocol provides almost universallythe best throughput in our experiments, except in a fewcases. First, Zyzzyva is dependent on timeout settings atits clients that are closely tied to client-replica latencies;when those latencies are not uniform, Zyzzyva tends tofall back to behavior similar to a two-phase quorum pro-tocol like HQ , as long as there is no write contention.Second, with large request sizes, Zyzzyvas throughputdrops and falls slightly below Q/Us and PBFTs withbatching, since its primary is required to send full re-quests to all the backup replicas. Lastly, under high lossrates, Zyzzyva tends to compensate quickly and expen-sively, causing its response time to exceed that of themore mellow Q/U.
Section 2 provides some background on BFT repli-cated state machines. In Section 3, we explain our exper-imental methodology, describe our simulation environ-ment, and validate it by comparing its predictions withpublished performance results on several existing BFTprotocols we have implemented in BFTSim. Section 4presents results of a comparative evaluation of BFT pro-tocols under a wide range of conditions. We discuss re-lated works in Section 5 and close with future work andconclusions in Section 6.
In this section, we discuss the work on which this pa-per is based: BFT replicated state machines. Specifi-cally, we outline the basic machinery of the protocolswe study in the rest of this paper: PBFT by Castro andLiskov , Q/U by Abd-El Malek et al. , and Zyzzyvaand Zyzzyva5 by Kotla et al. .
At a high level, all such protocols share the basic ob-jective of assigning each client request a unique orderin the global service history, and executing it in thatorder. Agreement-based protocols such as PBFT firsthave the replicas communicate with each other to agreeon the sequence number of a new request and, whenagreed, execute that request after they have executedall preceding requests in that order. PBFT has a three-phase agreement protocol among replicas before it ex-ecutes a request. Quorum protocols, like Q/U, insteadrestrict their communication to be only between clientsand replicasas opposed to among replicas; each replicaassigns a sequence number to a request and executes itas long as the submitting client appears to have a cur-rent picture of the whole replica population, otherwiseuses conflict resolution to bring enough replicas up tospeed. Q/U has a one-phase protocol in the fault-freecase, but when faults occur or clients contend to write thesame object the protocol has more phases. Zyzzyva is a
NSDI 08: 5th USENIX Symposium on Networked Systems Design and Implementation USENIX Association190