Building Dependable Distributed Applications Dependable Distributed Applications Using AQUA1 Jennifer

  • View

  • Download

Embed Size (px)

Text of Building Dependable Distributed Applications Dependable Distributed Applications Using AQUA1...

  • Abstract

    Building dependable distributed systems using ad hocmethods is a challenging task. Without proper support, anapplication programmer must face the daunting require-ment of having to provide fault tolerance at the applicationlevel, in addition to dealing with the complexities of thedistributed application itself. This approach requires adeep knowledge of fault tolerance on the part of the appli-cation designer, and has a high implementation cost. Whatis needed is a systematic approach to providing depend-ability to distributed applications. Proteus, part of theAQuA architecture, fills this need, and provides facilitiesto make a standard distributed CORBA application de-pendable, with minimal changes to an application. Fur-thermore, it permits applications to specify, either directlyor via the Quality Objects (QuO) infrastructure, the levelof dependability they expect of a remote object, and willattempt to configure the system to achieve the requesteddependability level. Our previous papers have focused onthe architecture and implementation of Proteus. This pa-per describes how to construct dependable applicationsusing the AQuA architecture, by describing the interfacethat a programmer is presented with and the graphicalmonitoring facilities that it provides.

    1. Introduction

    Middleware support for building dependable distributedsystems has the potential to ease the burden on applicationprogrammers, and increase the dependability of standardapplications, by providing an easy way to make an appli-cation more dependable. In order to be useful, the middle-ware must be easy to add to an existing distributed appli-cation, must run on standard commercial off-the-shelf

    1 This research has been supported by DARPA ContractsF30602-96-C-0315 and F30602-97-C-0276.

    hardware, and must interfere as little as possible with ap-plications at runtime. In particular, it should 1) provide asimple interface in which application objects can specifydesires about the dependability of remote objects they use,2) provide automatic and transparent detection of and re-covery from failures, and 3) manage a pool of resources ina manner consistent with the desires of multiple objectsthat require dependable remote objects. While these goalsare clearly desirable, building a software infrastructure thatachieves them is not an easy task.The AQuA architecture [Cuk98] is one approach to build-ing dependable distributed objects that attempts to meetthese goals. In particular, AQuA aims to allow distributedapplications to request and obtain a desired level of de-pendability using Proteus [Sab99]. Proteus dynamicallymanages the replication of distributed objects in order tomake them dependable. More specifically, Proteus takesrequests regarding the dependability of remote objects usedby an application object and decides how to provide faulttolerance. The choice of how to provide fault tolerance in-volves choosing the style of replication, the type of faultsto tolerate, and the location of the replicas, among otherthings. Once a decision is made, the system is configuredto try to achieve the dependability requested by one ormore application objects. Reconfiguration of the systemcan occur if faults occur, or if the requested dependabilityof one or more application objects changes.Several projects focus on building dependable distributedobjects. The Eternal system [Nar97] adds fault tolerance toapplications by object replication. However, Eternal doesnot support dynamic system configuration changes in re-sponse to changing application requirements. Electra[Maf95] provides fault tolerance to CORBA by building aspecialized ORB. However, since Electra uses a non-standard ORB to provide group communication services, itis incompatible with other ORBs if the fault-tolerant fea-tures are used. The OpenDREAMS research project[Fel96] focuses on the design and implementation of anObject Group Service (OGS), which provides facilities for

    Building Dependable Distributed Applications Using AQUA1

    Jennifer Ren, Michel Cukier,Paul Rubel, and William H. Sanders

    Center for Reliable and High-Performance ComputingCoordinated Science Laboratory

    and Department of Electrical and Computer EngineeringUniversity of Illinois at Urbana-Champaign,

    Urbana, Illinois 61801{ren, cukier, rubel, whs}

    David E. Bakken and David A. Karr

    BBN TechnologiesCambridge, Massachusetts 02138

    {dbakken, dkarr}

  • CORBA object group communication. This approach hasthe potential to provide group services to CORBA objects;however, it requires that the application developers beaware of and explicitly make use of the OGS. For abroader comparison with other projects, see [Cuk98].In this paper, we describe how to build dependable distrib-uted applications using the AQuA architecture. In particu-lar, we explain the remote method calls one can use tomake a quality of service (QoS) request regarding depend-ability, and the callbacks that occur when a dependabilityrequest can no longer be met. We also describe how an ap-plication object can obtain information on hosts and makesuggestions about the set of hosts that may be used. Wethen explain how to request more detailed informationfrom Proteus regarding actions it takes and decisions itmakes. Finally, we describe the graphical interface Proteusprovides on its manager, and on each node that hosts repli-cas. These programming and monitoring facilities providean easy-to-use environment for building dependable dis-tributed CORBA applications. To illustrate this, we presentan example in which a simple CORBA application wasmade dependable through the use of Proteus and the AQuAarchitecture.

    2. AQuA Overview

    Before describing how a distributed application interactswith Proteus, we briefly review the AQuA architecture.Figure 1 shows the different components of the AQuA ar-chitecture in one particular configuration. These compo-nents can be assigned to hosts in many different ways, de-pending on the dependability level that objects desire ofremote objects they use.The AQuA system uses the Maestro/Ensemble groupcommunication system [Hay98, Vay98] to provide reliablemulticast to a dynamically changing group of processes, toensure atomic delivery of multicasts to groups with chang-ing membership, and to detect and exclude from the groupany members that fail by crashing. The Ensemble protocolstack used in AQuA provides inter-process communicationbased on the virtual synchrony model [Bir96]. Maestro[Vay97] provides an object-oriented interface (in C++) toEnsemble.

    Proteus, implemented on top of Maestro/Ensemble, is aflexible infrastructure for providing adaptive fault toler-ance. Proteus makes remote objects dependable by using 1)a replicated dependability manager to make decisions re-garding reconfigurations and to coordinate changes in sys-tem configurations, 2) object factories to kill and start ob-jects and provide information to the dependability managerregarding a host, and 3) gateways that implement particularvoting and replication schemes.The Proteus dependability manager makes decisions re-garding reconfiguration based on reported faults and de-pendability requests from QuO, and, together with thegateways, implements the chosen fault tolerance approach.Depending on the choices made by the dependability man-ager, Proteus can tolerate and recover from crash failures,time faults, and value faults in application objects and theQuO runtime. Note that we do not aim to tolerate Byzan-tine faults, value faults in the gateway, or faults in thegroup communication system itself. If tolerance of morecomplex fault types is required, one could substitute amore secure group communication protocol (e.g., [Kih98,Rei95]) for Ensemble within the AQuA architecture.Object factories are used to kill and start replicated appli-cations, depending on decisions made by the dependabilitymanager, and to provide information regarding the host tothe dependability manager.CORBA provides application developers with a standardinterface for building distributed object-oriented applica-tions, but does not provide a simple approach that allowsapplications to be fault-tolerant. The gateway provides astandard CORBA interface by translating between process-level communication, as supported by Ensemble, and IIOPmessages, which are understood by Object Request Bro-kers (ORBs) in CORBA. In this way, CORBA-based dis-tributed applications written for the AQuA architecture canuse standard, commercially available ORBs. In addition toproviding basic reliable communication services for appli-cation objects and the QuO runtime, the gateway also pro-vides fault tolerance using different voters and replicationprotocols. These services are located in the gateway han-dlers. Both active and passive replication of AQuA ob-jects can be supported.AQuA objects are the basic units of replication in theAQuA architecture. Each one consists of a gateway, an

    Figure 1. AQuA Architecture




    Server QuO















    ClientClientClient QuGateway

    Client QuONameServer


    Ensemble Group Communication System

  • application object, and a QuO runtime, if QuO is beingused to manage the desires of the application object. In thiscontext, the application object can be part of the distrib-uted application itself, or part of the AQuA architecturethat uses the services of a gateway (such as the depend-ability manager and the object factories).In order to provide a simple way for application objects tospecify the level o