Extending the Limits of DMAS Survivability: The UltraLog Project

SEPTEMBER/OCTOBER 2004 1541-1672/04/$20.00 © 2004 IEEE 53Published by the IEEE Computer Society

D e p e n d a b l e A g e n t S y s t e m s

Extending the Limits ofDMAS Survivability:The UltraLog ProjectMarshall Brinn, Jeff Berliner, Aaron Helsinger, and Todd Wright, BBN Technologies

Mike Dyson, Schafer Corporation

Sue Rho, Cougaar Software

David Wells, Object Services and Consulting

Multiagent systems, being distributed and autonomous, provide the potential for

more efficient, component-based development approaches. The massive par-

allelism intrinsic to these systems promises greatly enhanced performance. However,

the price of these potential benefits is often paid in survivability. Distributed multiagent

systems have typically failed to provide reliable,assured performance even in well-controlled envi-ronments. Their distributed nature fills them withpotential soft spots, ripe for attack and failure; theirautonomous nature results in chaotic emergentbehavior that is difficult to predict, let alone con-trol.1,2 This shortcoming is compounded by unan-ticipated stresses and attacks in real-world situations.For this reason, many organizations are reluctant toembrace DMAS approaches.

To address these difficulties, DARPA presented atwofold challenge to the UltraLog program (www.ultralog.net)—a DARPA-sponsored research projectto which 15 to 20 companies and universities con-tributed. In 2000, a sophisticated, large-scale com-plex DMAS for developing military logistics planswas completed by DARPA’s Advanced Logistics Pro-ject (ALP) . First, the UltraLog team needed to adaptthis DMAS to military use, which required hardeningit to survive in stressful, even wartime, environments.

Next, DARPA challenged the team to achieve thissurvivability in a way that would convince otherpotential customers that they could also make theirDMAS applications survivable. In response, wedeveloped and applied a generic methodology forsurvivable DMASs.

Survivability concepts and definitionsBefore we describe the UltraLog survivability

approach, application, and architecture in detail, some

definitions are in order. We assume that a DMAS pre-sents some function with measurable qualities, ormeasures of performance. These MOPs can vary andinclude performance (such as timeliness, complete-ness, correctness, or precision), usability (such asavailability and responsiveness), and integrity (suchas accountability and confidentiality). The MOPs’significance, ranging from crucial to meaningless,depends on the applications and contexts.3

A roll-up utility function of these MOPs—a scalarrepresenting how well the system provides its aggre-gate system function—captures a system’s utility.We call this the multiattribute utility (MAU), or theapplication system utility.

The application’s operating environment providesmany challenges that can undermine its ability tomaximize system utility. We categorize such chal-lenges or stresses as load (significant work demandedof the system relative to its capabilities), resourcedegradation (attacks or failures that degrade the sys-tem’s computational or network resources), or infor-mation attacks (directed attacks on system utility, par-ticularly to degrade its integrity and confidentiality).

A system’s survivability is the degree to which itmaintains utility in the face of environmental stresses.A system is robust if applying stress doesn’t result inpermanent system utility degradation. That is, afterstress, a robust system will eventually resume anacceptable level of performance. By predictable, wemean that a system will demonstrate a reliable level

Widespread, confident

deployment of

distributed multiagent

systems requires

assuring their

survivability in stressful

environments. The

DARPA UltraLog

program presents

a methodology for

making a given

distributed military

logistics application

survivable, with general

applicability for broad

classes of DMASs.

of system utility in a given stress environ-ment. Assured survivability characterizes asystem that is both robust and predictable.

The UltraLog experienceThe application that we worked with in the

UltraLog program plans and simulates mili-tary logistics operations. The DMAS consistsof hundreds of autonomous agents—537 in2003 and 1,092 in 2004—each representinga combat or logistics organization. As the sys-tem receives operations plans, combat orga-nizations compute demands for particularquantities of particular goods, and the logis-tics organizations determine how they canbest satisfy these demands. The resultinglogistics plan (time-phased force and deploy-ment data or TPFDD) contains 250,000 indi-vidual plan elements representing demandand transportation for more than 34,000 enti-ties of more than 200 types of major enditems. The demand covers ammunition, fuel,and food; the transportation covers end-to-end movement schedules for personnel, unitequipment, and ammunition resupply. Oncethe operation is planned, we simulate its exe-cution, replanning to changes in requirementsand deviations between expected and ob-served behaviors.

DARPA’s ALP built the application in theCougaar agent architecture. (For more infor-mation, see the “Cougaar Essentials” side-bar.) Cougaar’s component-based natureenables a successful distributed developmentenvironment. The UltraLog program man-ages a large test and integration facility con-taining hundreds of computers and virtual-private-network access to support this largedistributed team.

UltraLog’s stated goal is to “operate [theapplication] with up to 45 percent informa-tion infrastructure loss with no more than 20percent capabilities loss or 30 percent per-formance loss in an environment character-ized by wartime loads and directed adversaryattack.” Here, wartime loads represent largequantities and frequencies of tasks, plan per-turbations, and queries that the system mustrespond to. Information infrastructure lossrepresents a combination of loss and degra-dation of nodes (processing platforms) or ofconnectivity. Directed adversary attacks rep-resent attempts to undermine system confi-dentiality or integrity by attempting to cor-rupt or read (without authorization) systemcode, messages, or internal or persisted state.

For this goal, we measure system functiondegradation in terms of quality, complete-

ness, correctness, and timeliness of planningand execution functions. (See the “UltraLogMAUs and MOPs Overview” sidebar formore details on our composed approach toquantifying system function.)

To improve system survivability, we iter-atively identify and integrate two types offeatures: survivability enabling, which pro-tects or recovers processing, and survivabil-ity enhancing, which improves exhibited sys-tem function in a given environment.

Survivability-enabling featuresThe early versions of the UltraLog proto-

type were simply not survivable. Theycouldn’t provide any system function underbasic categories of kinetic and informationattacks. So, we developed survivability-enabling robustness and security defensesfocused on

• Persistence and rehydration. UltraLogagents support state persistence and rehy-dration. The infrastructure further supportsreconciliation of restarting agents and pre-existing agents to let agents reenter anexisting society on the fly.

• Community-based monitoring and restart.We established a service in which agent or


54 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

Cougaar is an open-source, distributed multiagentsystems (DMAS) framework implemented in Java (seewww.cougaar.org for the source code and active usercommunity information) developed under the DARPA

Advanced Logistics and UltraLog programs from 1998to the present.

Cougaar provides a modular infrastructure, in whichentities consist of cooperative subcomponents. The es-sential autonomous entity is the agent, which consistsof many subcomponents including plugins. Multipleagents co-reside on a single node (Java virtual machine),which manages common platform resources and mes-sage traffic. A fully running DMAS society consists ofmany agents, typically organized in subsocieties or com-munities of agents.

Cougaar has been engineered for scalability and flex-ibility, presenting a modular interface to support con-figuring and composing Cougaar systems with as manyagents as desired or as few services as required. Theresulting architecture supports running up to 10,000 light-weight agents on a single host or a single, fully featuredagent per host, with each component encapsulated by bindercomponents that monitor, control, or restrict access to systemresources. Components within an agent communicate via theblackboard for efficient asynchronous data exchange whencomponents require heavy interaction. Figure A provides ahigh-level Cougaar component schematic.

Interagent messaging in Cougaar is asynchronous and han-dled by a full-featured quality-of-service-enabled and compo-nentized message transport service. Every agent can providelightweight servlet interfaces through an embedded TomcatWeb server. Finally, agents locate each other using a DNS-stylewhite pages and locate services using a hierarchical yellow-pages directory service for dynamic relationships. Togetherthese features provide a scalable, configurable, robust, andtransparent infrastructure for developing DMASs.

NodeYellow-pages

andwhite-pages

directoryservices

Communityservices

Plugin

Binder

Blackboard

Agent

Message transportservice

Servletinterface

Binder

Figure A. A high-level Cougaar components schematic.

Cougaar Essentials

node failure is detected by missed heart-beat detection, and agents are restarted onthe same or different platforms. Thismechanism relies on a set of low-levelinfrastructure services supporting sets ofagents forming a community that com-monly promotes and protects particularattributes (liveliness, in this case).

• Distributed, robust yellow pages and whitepages. UltraLog uses a white- and yellow-pages infrastructure to register and lookupagent and community services by nameand attribute. These registries are distrib-uted and redundant so that the service isrobust to the loss of significant pieces ofthe underlying infrastructure.

To prepare for the scoped set of informa-tion attacks that UltraLog will be subjectedto, we established a set of security servicesthat let the system adapt to constantly evolv-ing cyber attacks:

• Secure execution environment. We protectthe integrity of loaded code and static datawith a secure bootstrapping class loader,a Cougaar-specific Java security manager,and a jar file verifier.

• Certificate authority. We established ahierarchy of redundant authorities toensure trusted identification of users andagents within the system, including coor-dinated certificate update and revocation.

• Preventative measures. We implemented aseries of protections against maliciousinformation attacks, including user andcomponent access control to UltraLog ser-vices and encryption-based protection ofmessages, blackboards, and persistent data.

• Policy management. We use the Knowl-edgeable Agent-Oriented System (KAoS)policy management system4 to establishfixed policy-based constraints on compo-nent behaviors, including which messagesmay be transmitted between which enti-ties. Policies can be changed by users oradaptivity rules indicating a change indefensive posture. To enforce these poli-cies, we leverage the Cougaar componentmodel to establish binders on the black-board, messaging, and user access.

• Monitoring and response service. We col-lect evidence of potential attacks, faults,and errors, taking corrective action inresponse. Gathering IDMEF (intrusiondetection message exchange format) data

from various components, we invoke rulesto determine what response to take—rang-ing from logging and alerts to adaptivepolicy changes, including changing theencryption level.

These services give the administrators a real-time understanding of the operational threatsituation, automating rapid reconfigurationof existing countermeasures and providingthe ability to deploy new countermeasures.

Survivability-enhancing featuresHaving provided UltraLog the necessary

basic enabling features, we focused in lateriterations on raising the level of survivabilitypossible under stress. Enhancing UltraLog’smeasurable utility required adding genericfeatures in the infrastructure as well as spe-cific features in the logistics application.

Infrastructure enhancements. UltraLog en-courages components to provide multiple im-plementations of the same interface, providingdifferent trade-offs of resource consumptionversus utility. These different implementations,operational modes, are available for real-timeadaptivity. Each agent has an adaptivity engine

SEPTEMBER/OCTOBER 2004 www.computer.org/intelligent 55

The UltraLog program takes a unique approach to measur-ing system survivability. We encountered unique problems indetermining the accuracy of a logistics plan of the magnitudeof a major military operation. No one had ever constructed aplan to that scale (180 days of deployment) or fidelity (detaileddown to the individual box and truck), nor was there a plan touse as a baseline for comparison.

Our solution was to build a hierarchical measurement frame-work based on multiattribute utility (MAU) functions.1 Usinga panel of logistics experts, we built a survivability hierarchyfrom two attributes: capability and performance (see TableA). Capability represents how complete, correct, and securethe logistics plan was. Performance represents the time ittook to create or revise the plan and present it to an opera-tor. The functional expert panel further subdivided thehierarchy to account for attributes logistics experts deemedessential, such as the completeness and correctness of supplyand transportation tasks as well as confidentiality and ac-countability. At the lowest hierarchy level, we developedMAU curves to establish each measurement’s utility to thelogistics operator.

The panel then considered trade-offs between measuresand assigned a swing weight to each to roll up into an overallscore. By this process of expert-driven measures of perform-ance (MOPs) definition and utility assignment, we extracted ameasurable, objective definition from extremely subjectivenotions of defining and measuring system utility.

Reference

1. J. Ulvila, J. Gaffney, and J. Boone, A Preliminary Draft Multi-attributeUtility Analysis of UltraLog Metrics, Decision Science Associates,2001.

UltraLog MAUs and MOPs OverviewTable A. Summary of high-level UltraLog MOPs and

associated swing weights with UltraLog survivability at a swing weight of 1.0.

Measures of performance Swing weight

1. Planning and replanning 0.411-1. Plan completeness 0.411-2. Plan correctness 0.391-3. Presentation completeness 0.101-4. Presentation correctness 0.10

2. Confidentiality and accountability 0.172-1. Memory data available 0.162-2. Disk data available 0.162-3. Transmission data available 0.312-4. User actions counter to policy 0.212-5. User actions recorded 0.042-6. User violations recorded 0.12

3. Performance 0.423-1. Time to compute plan or replan 0.803-2. Time to present 0.20

that runs a set of adaptivity rules called play-books to support adaptive selection of thedesired OpModes in particular environmentalconditions.5

UltraLog also has various defenses thatsense system state and attempt to repair orproactively defend the society, thus letting asociety survive in the presence of numerousphysical and cyber threats. Unfortunately, agiven situation might cause multiple symp-toms. These might in turn automatically trig-ger one or more defenses, which can causeconflicts due to resource limitations or de-structive interaction. The defense coordina-tor oversees and orchestrates correlating andorganizing sets of defensive actions thatmight pertain in a given situation. To achievethe desired flexibility, the coordinator relieson models of asset types (for example, agents,hosts, and networks), assets, sensors, actua-tors, and threats from the environment.6

Application enhancements. An example ofan attribute enhancing system survivabilityis opportunistic planning—taking advantageof (and making do with) available resourcesto improve performance. For example, inUltraLog logistics planning, we develop a

quick plan that requires relatively few re-sources to compute and store. As resourcesbecome available, we compute a more de-tailed plan incrementally in the background.On demand, we can provide a complete planwith as much detail as possible—withouthaving to explicitly determine how good aplan we can support currently.

One area leading to considerable unsta-ble and unpredictable behavior is ringing,or feedback loops among agents within theapplication. Although these interactionsmight reflect actual interactions in the mod-eled domain, we want to identify and elim-inate them where possible. In certain cases,we were able to streamline protocols be-tween agents to a single deterministic trans-action, allowing for more reliable and effi-cient behavior.

We identified several agents whose pro-cessing was beginning to represent a memoryand processing bottleneck. These bottlenecksreflected the agents’ centrality in the appli-cation workflow. By refactoring the applica-tion-level roles and relationships, we brokeup these agents into subagents and distrib-uted their functions across other processingplatforms, thus providing greater utility from

the same hardware baseline. The application provides many different

predictors, which are components providinglow-level models of downstream processing.Using these, the system can continue pro-cessing and providing some degree of func-tion with rule-of-thumb estimates when con-nectivity to other service-provider agents istemporarily unavailable.

Integration and assessmentThe UltraLog program operates on an

annual cycle of design, build, integrate, andassess. Each year, we take the outcome of theprevious year’s assessment and attempt toimprove the system’s survivability on thebasis of the most critical vulnerabilities andinefficiencies we find.

We can’t critically assess an application’ssurvivability without providing a stress envi-ronment of known characteristics so we canmeasure system utility in a reliable, efficientmanner. We built an elaborate test and evalu-ation framework and lab environment for thispurpose. (See the “UltraLog Test and Assess-ment Environment” sidebar for more details.)

We ran a formal assessment of the Ultra-Log prototype to determine its survivabilityand to identify vulnerable areas. We haveconducted this assessment process annuallyon the evolving prototype, letting us measureprogress and determine what areas to developfor the subsequent year. The assessment con-sists of three independent efforts:

• A red team assessment to identify absolutevulnerabilities in security and robustness

• A functional assessment to determine howUltraLog functionality (accuracy, ade-quacy of planning, and usability of inter-faces) degraded under attack

• A quantitative assessment to determinelevels of survivable system function understress

Assessment resultsWe collected baseline measurements for

comparison using the UltraLog system inan unstressed environment. As we appliedstresses, we measured each parameter andcalculated a survivability score using thealgebra we described earlier. Figure 1 showsresults of over 200 experiments run in early2004. Stresses spanned a range of computerand network failures that destroyed com-puters and nodes, cut communications, ordegraded bandwidth, CPU, and memory.

Table 1 shows our annual technical progress



After developing and integrating distributed multiagent system (DMAS) soft-ware components, measuring how well a large system of distributed agents per-forms is itself a challenge. Our work in testing and evaluating systems wouldn’t bepossible without extensive automation support for running numerous repeatableexperiments in a controlled environment. This lets us statistically evaluate the ex-perimental results. To this end, we developed the Automated Configuration Man-agement Environment.1

ACME is a distributed agent system that runs in parallel to the UltraLog DMASand manages the the UltraLog system’s computers and networks. It manages the ini-tialization of the UltraLog DMAS, applies survivability stresses, and collects measure-ments of the DMAS under test. Each computer used for UltraLog testing runs anACME agent that is populated with plugins that provide the necessary testing andmeasurement services. These services include Cougaar agent initialization, log filemanagement, and network bandwidth-shaping. A single script directs the ACMEagents participating in a given experiment generating events, and commands theappropriate times.

We implemented ACME agents and scripts using a lightweight language, lettingtesters, developers, and managers understand the experiment without the ambi-guity common in large-system testing. Furthermore, we developed an UltraLogproject dashboard that continuously shows the status of all software componentsin terms of compilation status, code size, static code quality, and design qualitymetrics (see http://pmd.sourceforge.net).

Reference

1. A. Helsinger et al., “Tools and Techniques for Performance Measurement of Large DistributedMultiagent Systems,” Proc. 2nd Int’l Conf. Autonomous Agents and Multiagent Systems(AAMAS), ACM Press, 2003, pp. 843–850; http://portal.acm.org/citation.cfm?id=860711.

UltraLog Test and Assessment Environment

under the UltraLog program toward our goal ofa survivable DMAS under harsh environmentsof wartime loads and directed kinetic and infor-mation attacks (see www.ultralog.net for de-tailed assessment results).

A survivable methodologyFrom our experiences in UltraLog, we

abstracted approaches and lessons learned intoa methodology (see Figure 2) for iterativelyassessing and improving DMAS survivabil-ity. At a high level, it forms a simple iterativeloop—namely, to identify the problem (envi-ronment and metrics) and iteratively identifyimprovements and assess their impact. Aswe’ll show, considerable complexity under-lies the simplicity of this iterative loop. Ourexperience tells us that, in general, you can’tengineer and assure DMAS survivability in asingle design pass. Rather, we support an ap-proach that allows for incrementally identify-ing inevitable shortcomings and integratingpotential corresponding enhancements.

Figure 2 also illustrates two higher-leveliterative loops (in dashed arrows). Definingmeasurements is an iterative process in thatcapturing the appropriate high-level utilityfrom low-level observables is often impre-cise and requires repeated revisiting. Defin-ing the stress environment is also an iterativeprocess because the ability to characterizestresses, add new stress types, or change ex-pected frequencies and magnitudes of ex-pected stresses requires revisiting the entiresurvivability strategy.

Defining the stress environmentWe characterize the environment in which

an application must operate in terms of sys-tem load and the resources available toaccomplish the work. We can express load interms of frequency and magnitude of tasks,queries to answer, and so forth. Resourcesrepresent available computation and com-munication capabilities either at a low level(memory, bandwidth, and CPU) or at a morespecialized high level (such as server access).

Stresses represent changes to the requiredworkload or available resources relative tosome assumed unstressed baseline. For ourpurposes, we don’t need to distinguish be-tween benign stresses or those due to adver-sary action—that is, normal upsurges duringwartime or occasional hardware failure ver-sus denial-of-service-motivated, false-task-ing, or kinetic attacks on network resources.From a survivability point of view, we needto understand how these changes in load and

resource availability impact system functionin the short and long term.

Furthermore, we must characterize thesestresses’ distribution frequencies and magni-tudes. Configuring DMASs in a survivablemanner requires defining some expectationson the bounds on these stresses so that we candeploy hardware resources and redundancyaccordingly. We can’t expect a DMAS con-figuration to survive an arbitrarily large set ofstresses (such as a stress that kills all itsresources at once or provides an infinite queueof tasks to perform). Appropriately charac-terizing the stresses should suffice to fully sim-ulate the stress environment, including instan-taneous magnitudes and average bounds.

Defining MOPs, MAUs, and goalsPerhaps the most important step toward

assuring survivability is defining it properly.We must adequately characterize the natureof the function that we wish to protect in theface of stress so that we can assess ourprogress and the contribution of different

enhancements toward that survivability. Wemust set these definitions with great care andrevisit them periodically because they deter-mine all future efforts to achieve survivabil-ity. Notably, only those attributes that wemeasure and that contribute to the overallsystem utility are relevant for the purposesof assuring or improving survivability.

We must define MOPs that characterizehow well the system is doing in some dimen-sion that leads directly to user utility. Beyonddescribing the measure (for example, the timeto complete a particular task), we must detaila procedure for computing this measure fromavailable, observable data within the runningDMAS. This can be a key task because theobservable data might not provide adequateinsight into the intended behavior, and wemight need to adjust either the MOP or sys-tem visibility to ensure that the procedureactually captures the spirit of the MOP.

Here are some lessons learned from ourexperience in defining successful (and unsuc-cessful) MOPs:


(a)

(b)

1009080706050403020100

20 40 60 80 100Infrastructure loss (percent)

0

0

20 40 60 80 100Infrastructure loss (percent)

Capa

bilit

y sc

ore

200180160140120100806040200

Tim

e to

pla

n (m

in.)

Figure 1. Over a range of computer and network failures of over 45 percent, UltraLogdemonstrates continuous performance: (a) greater than 80 percent capability and (b) greater than 70 percent performance (time to plan). The dotted lines represent the target/requirement values, and the solid line represents the stressed score relative tothe unstressed baseline.

• MOPs should be, as much as possible,measurable and visible to the system inreal time (as opposed to requiring offlinepostprocessing). In this way, we can sup-port real-time adaptivity.

• MOPs should capture macro or aggregatebehaviors as opposed to micro or individualbehaviors. The nature of DMASs is suchthat individual task or message processingmight vary dramatically within a run or, fora given task, from run to run. MOPs thatattempt to capture how well a particular taskwas processed might suffer from nondeter-minism. Macro MOPs that capture a pro-cessing stream’s behavior, however, willtend to be stable and repeatable.

• MAU curves should avoid stepwise func-tions. In such cases, no additional utility isawarded for achieving additional MOPs,making incremental improvements througha sensitivity-analysis approach difficult.

Identifying and developingenhancements

Having defined the problem, the next step



Define stress environment

Integrate and assess

Infrastructure

ApplicationIdentify and develop enhancements

Robustness Utility Assessibility

Define measures of progress, multiattribute utility, and goals

Figure 2. Iterative survivable DMAS methodology schematic (dotted arrows denotehigher-level iterative loops).

Table 1. UltraLog technical progress.

2000 2001 2002 2003 Stress Technology approach

Scalability Fail Okay Okay Pass Wartime loads 1 Variable-fidelity processing

Fail Fail Okay Okay Wartime loads 2 Load balancing

Fail Fail Okay Pass Wartime loads 3 Dynamic reconfiguration and problem

Fail Fail Okay Okay Thrashing Stability guards, distributed control

Fail Okay Okay Okay Scaling of nodes Distributed-adaptivity engine

Fail Fail Okay Okay Scaling logistics problem Service discovery

Security Okay Pass Pass Pass Fraudulent, untrusted code Secure Java class loader, signed class

Fail Pass Pass Pass Untrusted communications Prohibition policies, Public Key Infrastructure

Fail Pass Pass Pass Insecure/dangerous Java security manager, monitor and response (M&R) subsystem

Fail Fail Okay Pass Corruption of persisted Encryption, certificate infrastructure

Fail Okay Okay Okay Unauthorized processing Access control binders, policies, rovers

Okay Okay Okay Pass Unexpected plug-in Technical specification binder protection

Okay Pass Pass Pass Component masquerade Certificate-based authentication

Fail Okay Okay Pass Compromised agents 1 Component isolation

Fail Okay Okay Pass Compromised agents 2 Identity and certificate leasing

Fail Okay Okay Okay Intrusion Certification revocation, component isolation, M&R

Fail Fail Okay Okay Compromised Random routing and active port

Fail Fail Okay Okay Snooping Traffic masking, trust models, user authentication

Okay Pass Pass Pass Message intercept Encryption, authentication, PKI

Robustness Fail Okay Pass Pass Processing failure 1 Liveliness checking, persistence

Fail Okay Okay Pass Processing failure 2 Role-based restoration

Fail Okay Pass Pass Network failure 1 Predictors

Fail Okay Pass Pass Network failure 2 Multiple message transports

Fail Okay Okay Pass Processing contention Thread and resource management

Fail Okay Okay Pass DOS attack Compression, multiple communication

is to iterate through a development cycle:develop, test, enhance, and repeat. In eachiteration, we identify the most important fac-tors requiring additional enhancements. Thefactors underlying the DMAS’s survivabil-ity fall into three categories: robustness, util-ity, and assessibility.

Robustness. Vulnerabilities limit a givenDMAS’s robustness. These represent holesor failure modes that cause the system to failto be robust. Establishing a system’s surviv-ability requires that we develop and deploy aseries of defenses. These defenses mustmaintain an environment in which the appli-cation can operate under stress. They tend tofall into the following overall categories:

• Avoid. Take action to prevent a given stressor attack from having any negative impacton system utility.

• Detect. Determine that a given stress orattack is under way.

• Contain. Dampen the impact or spread ofthe stress or attack across the DMASsociety.

• Recover. Restore the computation envi-ronment to the application.

A robust system, of course, must avoid anysingle points of failure. To that end, UltraLoghas emphasized that multiple ways of achiev-ing a given goal should exist, either throughmultiple distributed implementations of agiven capability or different implementationsof a similar capability. Our service discoveryand leasing capability allows for continuallyfinding live services without dependency onparticular fixed servers. Furthermore, ourcommunity structure lets the system dynam-ically manipulate configurations in a mannerthat is transparent to the rest of the society.

Asynchronous messaging, as in Cougaar,allows for reliable communications in theface of agent loss or periodic communica-tions failures. We interpose predictors attime-critical communications points so thatthe system always has access to low-fidelitysolutions in case more detailed answers aretemporarily unavailable.

Utility. By identifying opportunities to mar-ginally improve utility, we can raise the over-all system survivability. These enhancementsinclude the following categories:

• Efficiency. Take advantage of the resourcesavailable to provide measurable utility.

Furthermore, establish mechanisms toavoid wasted effort due to interdefenseconflicts or mechanisms that don’t con-tribute to a MOP.

• Adaptivity. Add new options for resourceand utility trade-offs, and add new capa-bilities to manage these trade-offs in dif-ferent circumstances.

• Resources. Alleviate points wherein theapplication is hardware-bound, and aug-ment where possible.

Having many ways to do the same thing isa key enabler of adaptivity. Where possible,multiple implementations of the same inter-face (within the same component or multi-ple components) allow for different trade-offs of resources consumed for a function

produced in different circumstances. Ofcourse, the infrastructure must be able to effi-ciently manage these multiple candidates foradaptivity, selecting and invoking the rightmode in the right circumstance.

Assessibility. To encourage assessibility, weestablished the following requirements forDMASs:

• Reliability. The system must supportrepeatability to let us assess the systemfunction in a given environment over along set of runs.

• Visibility. The system must provide de-fined and required MOPs and other keyinternal states.

• Divisibility. The system must allow forsubdivision into subsystems and subprob-lems for further analysis.

The unstable interactions of multiplequeuing systems make reproducible pre-dictable behavior difficult to achieve. Weaddress this in several ways. First, we attempt

to measure macrolevel MOPs, which tend tohave greater stability. Next, we acknowledgethe existence of low-level chaotic behaviorsbut attempt to minimize the impact of thischaos on overall system performance. Weseek to eliminate component behaviors thatact as amplifiers of small variability intolarge system fluctuations. We’ve also tried tolimit timer-based batching of task flows.

In the area of visibility, we seek to assuremetric-gathering scalability by limiting thenumber and scope of aggregate (cross-agent)MOPs. Furthermore, we seek to aggregatemacro MOPs within an agent prior to cross-agent aggregation. We make available vari-ous low-level system performance and statemetrics to all component levels in support ofdifferent forms of real-time adaptivity.1

In the area of decomposition for scalability,such substructures can and should exist in mul-tiple dimensions. The application often pre-sents multiple threads representing particularworkflows or domain roles and relationships(for example, the ammunition distributionthread within UltraLog). The infrastructurepresents different threads of support services(such as the yellow pages or policy subsys-tems). Furthermore, Cougaar’s componentmodel imposes a natural hierarchical systembreakdown into communities, nodes, agents,and ultimately components.

Integrating and assessingOnce we have a candidate enhancement

or set of candidate enhancements to theapplication or infrastructure, we must inte-grate them into the full DMAS application,deploy them to the distributed-resource envi-ronment, and assess them. To assure reliableperformance, the assessment must let us con-duct multiple runs of the DMAS in a repre-sentative stressful environment. We must beable to capture MOPs and MAUs rollups toassess overall system survivability in thatenvironment. We must capture additionallogged information pertaining to failures,resource consumption, or system perfor-mance capture for postanalysis of the inte-grated system’s performance and allocationof credit and blame to subsystems and com-ponents. With these results in hand, we iden-tify the vulnerabilities and bottlenecks.

Our experience with the ACME testingframework (see the “UltraLog Test andAssessment Environment” sidebar) showedthat performing multiple runs with the sameparameters was invaluable in characterizingDMAS behaviors. This let us refine our MOP


To ensure reliable performance,

the assessment must let us

conduct multiple runs of the DMAS

in a representative stressful

environment.

definitions and measurement procedures bydetermining which behaviors were and whichwere not stable and reliable across many runs.

Furthermore, we were able to perform sensi-tivity analysis by modifying stress values andconfiguration parameters to measure corre-

sponding survivability behaviors, thus point-ing to areas for further analysis and likely per-formance enhancement.

Beyond UltraLogHaving described this methodology, we

can return to an earlier question. What otherDMASs will this methodology work for? Ahigh-level review might help illustrate thekey attributes required for successfully ap-plying the methodology. We contend that thismethodology relies on essential elements ofthe agent paradigm—namely, goal orienta-tion and autonomy. Other aspects of themethodology require attributes you might notnormally associate with agents.

We impose a survivability goal on theDMAS and enlist the agents to help ensurethat goal. We instill a perseverance across theinfrastructure from message delivery to fail-ure detection and repair. Agents seek to rec-oncile state with one another upon restart.The MOP and MAU infrastructure coupledwith the proactive, reactive, and adaptivecapabilities form a virtual control loop thatstrives to maintain and improve system util-ity at all times.

Agent autonomy is essential to this behav-ior, removing dependencies on fixed hard-ware (agents are mobile to any available plat-form) or other agents. Agents are anonymousand identified by the service descriptions theyadvertise, rather than their identities or loca-tions. Relationships among agents are dis-covered dynamically, and all interactions arebased on a loosely coupled, asynchronousmessage-based protocol. Agents don’t exposetheir internal state to other agents. Any agentacts as a service provider to any other, mask-ing potentially different implementations.

Our methodology is, essentially, a DMASrealization of the standard debugging andtuning process familiar to developers in sin-gle-threaded applications. This process de-pends on the ability to establish reliableexpectations, compare those with systemobservations, and subdivide the problem intosmaller subproblems for further analysis andrepair of failure or inefficiency.

We characterized these attributes earlieras reliability, visibility, and divisibility. With-out some minimal level of these attributes,such analysis is impossible. Above that,because our DMAS exhibits greater reliabil-ity, visibility, and divisibility, it can partici-pate more readily and efficiently in thisprocess. Requiring these attributes is easiersaid than done, to be sure. Our experience in



T h e A u t h o r sMarshall Brinn is a principal engineer at BBN Technologies. He is the sur-vivability architect on the DARPA UltraLog program. His research interestsinclude the control and modeling of emergent behaviors. He received his MSin applied mathematics from Harvard University. He is a member of the ACM.Contact him at [email protected].

Jeff Berliner is a division scientist at BBN Technologies, where he is thetechnical director of the Distributed Systems and Logistics Department. Hisresearch interests include modeling and predicting emergent behaviors of indi-viduals and organizations, and building applications that help manage andcontrol these behaviors. He received his PhD in electrical engineering fromthe Massachusetts Institute of Technology. Contact him at [email protected].

Aaron Helsinger is an engineer at BBN Technologies, focusing for the pastfour years on developing the Cougaar agent architecture. His interests includemanagement and configuration of complex systems, distributed-system secu-rity, and data systems integration and scalability. He holds a BA in physicsfrom Harvard University. Contact him at [email protected].

Todd Wright is a software engineer in the Distributed Systems and Logis-tics Department at BBN Technologies. He is a key infrastructure contribu-tor to the Cougaar peer-to-peer multiagent system and UltraLog distributed-logistics application. He holds an MS in computer science from the Universityof Massachusetts, Amherst. Contact him at [email protected].

Mike Dyson is a program manager at the Schafer Corporation. He supportsdevelopment and execution of research projects within DARPA. He is sup-porting the UltraLog program in the assessment and testing of survivabledistributed-agent software architectures for military logistics. He holds a BSin electrical engineering from the University of Maryland. Contact him [email protected].

Sue Rho is a director at Cougaar Software. She is also a principal investiga-tor for the Ultralog program, responsible for incorporating adaptive security intothe Ultralog system. Her primary interest is in providing adaptive security forhighly distributed systems. She received her MA in mathematics from the Uni-versity of Southern California. Contact her at [email protected].

David Wells is a founder of Object Services and Consulting. His researchinterests include cryptography, computer security, database systems, agents,and self-healing systems. He has received four US patents and was the archi-tect of the Open Object-Oriented Database system. He holds a DEng fromthe University of Wisconsin, Milwaukee. Contact him at [email protected].

making UltraLog survivable shows that it ispossible, by augmenting the traditional setof agent attributes.

Our future efforts under the remainderof the UltraLog program will focus on

developing quantitative predictive modelswith which we can determine how survivablea given DMAS configuration can become ina given stress environment. In addition, we’reexploring the use of logical models to iden-tify vulnerabilities in robustness and securityprotocols that might lead to failure modes.We also have a final iteration of improve-ments and assessment slated for our proto-type, from which we expect to draw anotherround of lessons learned for broader DMASsurvivability.

We did a substantial amount of our sur-vivability work at the Cougaar layer, leavingrelatively well-defined tasks at the applica-tion layer to establish its survivability. Wetherefore recommend Cougaar and its Ultra-

Log extensions and tools as a solid platformon which to build applications and strive forenhanced survivability.

AcknowledgmentsDARPA sponsored this work under contracts

MDA972-01-C-0025, MDA972-01-C-0028,NBCHC010011, and N00174-02-D-0015.

References

1. M. Brinn and M. Greaves, “Leveraging AgentProperties to Assure Survivability of DistributedMulti-Agent Systems,” Proc. 2nd Int’l Conf.Autonomous Agents and Multiagent Systems(AAMAS), ACM Press, 2003, pp. 946–947;http://portal.acm.org/citation.cfm?id=860738&jmp=cit&dl=portal&dl=ACM.

2. J. Stafford and J. McGregor, “Issues in Pre-dicting the Reliability of Composed Compo-nents,” Proc. 5th ICSE Workshop Component-Based Software Eng., 2004; www.sei.cmu.edu/pacc/CBSE5/StaffordMcGregor-cbse5.pdf.

3. S. Dietrich and P. Ryan, “The Survivabilityof Survivability,” Proc. 4th InformationSurvivability Workshop (ISW 2001/2002),

2002; www.cert.org/research/isw/isw2001/papers/.

4. J. Bradshaw et al., “KAOS Policy andDomain Services: Towards a DescriptionLogic Approach to Policy Representation,Deconfliction and Enforcement,” Proc. IEEE4th Int’l Workshop Policies for DistributedSystems and Networks, 2003; http://ontology.coginst.uwf.edu/Papers/KAoS_Policy_Service_2003.pdf.

5. K. Kleinmann, R. Lazarus, and R. Tomlinson,“An Infrastructure for Adaptive Control inMulti-Agent Systems,” Proc. IEEE Conf.Knowledge-Intensive Multi-Agent Systems(KIMAS), 2003.

6. D. Wells et al., “Adaptive Defense Coordina-tion for Multi-Agent Systems,” Proc. 2004IEEE Multi-Agent Security and SurvivabilityConf., 2004; www.cs.drexel.edu/mass2004.

For more on this or any other computing topic,see our Digital Library at www.computer.org/publications/dlib.


Advertising Sales Offices

Sandy Brown10662 Los Vaqueros Circle, Los Alamitos, CA90720-1314; phone +1 714 821 8380; fax +1714 821 4010; [email protected].

Advertising Contact: Marian Anderson, 10662Los Vaqueros Circle, Los Alamitos, CA 90720-1314; phone +1 714 821 8380; fax +1 714 8214010; [email protected].

Advertiser/Product IndexSeptember/October 2004

For production information, conference and classified

advertising, contact Marian Anderson, IEEE Intelligent

Systems, 10662 Los Vaqueros Circle, Los Alamitos, CA

90720-1314; phone +1 714 821 8380; fax +1 714 821

4010; [email protected]; www.computer.org.

N E X T I S S U E

November/December:

Mining the Web for Actionable Knowledge

Recently, there has been much research on data mining onthe Web to discover novel, useful knowledge about the Web andits users. Much of this knowledge can be consumed directly bycomputers rather than humans. Such actionable knowledge canbe applied back to the Web for measurable performanceimprovement. For example, knowledge about user behavior andusage patterns can help build adaptive user interfaces and high-performance network systems. Articles in this issue will addressthe problem of actionable data mining on the Web, describingmeasurable gains in terms of well-defined performance criteriathrough Web data mining.

Documents

Extending the Limits of DMAS Survivability: The UltraLog Project