Distributed Computer Architectures for Combat SystemsDistributed Computer Architectures for Combat Systems Mark E. Schmid and Douglas G. Crowe dvances in computer hardware, networks,

488 JOHNSHOPKINSAPLTECHNICALDIGEST,VOLUME22,NUMBER4(2001)

M. E. SCHMID and D. G. CROWE

A

DistributedComputerArchitecturesforCombatSystems

Mark E. Schmid and Douglas G. Crowe

dvancesincomputerhardware,networks,andsupportsoftwarehavebroughtdis-tributedcomputingtotheforefrontoftoday’sNavycombatsystems.AneffortknownastheHighPerformanceDistributedComputing(HiPer-D)Programwasestablishedin1991topursueadvantageousapplicationsofthistechnologytotheNavy’spremierAegiscruiseranddestroyercombatants.Thisarticleprovidesanoverviewoftheprogramaccomplish-mentsandpresentsabriefdiscussionofsomeoftheintercomputercommunicationsissuesthatwereparamounttotheprogram’ssuccess.MostsignificantwasthecredibilitybuiltfordistributedcomputingwithintheAegiscommunity.Thiswasaccomplishedthroughmanydemonstrationsofcombatsystemfunctionsthatexhibitedcapacity,timeliness,andreliabilityattributesnecessarytomeetstringentAegissystemrequirements.

INTRODUCTIONIn the 1950s, the Navy embarked on its course

of developing standardized computers for operationaluse.TheAN/UYK-7,AN/UYK-20,AN/UYK-43,andAN/UYK-44arestilltheprimarycombatsystemcom-puters deployed today (Fig. 1). They function at theheartoftheNavy’spremiersurfaceships,merginginfor-mation from sensor systems, providing displays andwarningstothecrew,andcontrollingtheuseofweap-ons.ThesecomputersarewhollyNavydeveloped.Theprocessingunits,interfaces,operatingsystem,andsup-port software (e.g., compilers) were all built by theNavy and its contractors. The AN/UYK-43 is thelast in the line of Navy standard computers (oddlyenoughgoing intoproductionafter its smaller sibling,theAN/UYK-44).Describedin1981as“thenewlargescalecomputer...expectedtobeinserviceforthenext25years,”1itwillactuallyseeaservicelifebeyondthat.

TheNavyStandardComputerProgramwassuccess-ful at reducing logistics, maintenance, and a host ofothercostsrelatedtotheproliferationofcomputervari-ants;however,inthe1980s,theNavyfacedasubstan-tialincreaseinthecostofdevelopingitsownmachines.Pushedbyleapsinthecommercialcomputingindustry,everything associated with Navy Standard Computerdevelopment—frominstructionsetdesigntoprogram-minglanguagecapabilities—wasbecomingmorecom-plexandmoreexpensive.

By 1990, it was clear that a fundamental shift wasoccurring.Theshrinkingcostanddramaticallygrowingperformanceofcomputerswerediminishingthesignifi-canceofthecomputeritself.Thetransformationofcen-tralizedmainframecomputingtomorelocalandrespon-siveminicomputerswasbringingabouttheproliferationofevenlower-costdesktopcomputers.Microprocessors

JOHNSHOPKINSAPLTECHNICALDIGEST,VOLUME22,NUMBER4(2001) 489

DISTRIBUTED COMPUTER ARCHITECTURES

had indeed achieved the performance of their main-framepredecessorsjust10yearsearlier,butthiswasonlyhalfoftheshift!Spurredbynewprocessingcapabilities,systemdesignerswerefreedtotackleincreasinglycom-plexproblemswithcorrespondinglymoreintricatesolu-tions. As a new generation of complex software tookshape,thebalanceofcomputerhardwareandsoftwareinvestments in Navy systems cascaded rapidly to thesoftwareside.

Harnessingthecollectivepowerofinexpensivemain-streamcommercialcomputersforlarge,complexsystemswasclearlybecomingprominentinNavycombatsystemdesign.Thehighlysuccessfuldemonstrationin1990oftheCooperativeEngagementCapability(CEC)proto-type, built with over 20 Motorola commercial micro-processors, was a milestone in distributed computingfor Navy combat system applications. Developmentsin both Aegis and non-Aegis combat systems focusedthereafter on applying the capabilities of distributedcomputingtovirtuallyalloftheNavy’sprototypeefforts.Distributedcomputingofferedthepotentialforprovid-ingsubstantiallymorepowerfulandextensiblecomput-ing, the ability to replicate critical functions for reli-ability,andthesimplificationofsoftwaredevelopmentandtestingthroughthe isolationof functions intheirowncomputingenvironments.Indeed,whentheNavy’sAegisBaseline7andShipSelf-DefenseSystem(SSDS)Mk2Mod1aredelivered(expectedin2002),acom-mercialdistributedcomputing infrastructurewillhavebeenestablishedforallnewbuildsandupgradesoftheNavy’s primary surface combatants (carriers, cruisers,anddestroyers).

Distributedcomputingforcombatsystems,however,isnotwithoutitsdifficulties.RequirementsforsystemssuchasAegisarestringent(responsetimes,capacities,andreliability)andvoluminous(dozensofindependentsensorsystems,weaponsystems,operatorinterfaces,andnetworks tobemanaged),and theprior largebodyofdesignknowledgehasbeenfocusedonNavyStandard

Computerimplementation.(Asanexample,theAegisCombatSystemhadasmallnumberofAN/UYK-7[laterAN/UYK-43] computers. Each represented an Aegis“element”suchastheWeaponControlSystem,RadarControl System, or Command and Decision System.Within an element, interactions among subfunctionswerecarriedoutprimarilyviasharedmemorysegments.Thesedonotmapwelltoadistributedcomputingenvi-ronmentbecausethedistributednodesdonotshareanyphysicalmemory.)

Among the most fundamental issues facing thisnew generation of distributed computing design wasthe pervasive need for “track” data to be available“everywhere”—atallthecomputingnodes.Trackscom-prise thecombat system’sbestassessmentof the loca-tions, movements, and characteristics of the objectsvisibleto its sensors.Theyarethe focalpointofbothautomated and manual decision processes as well asthe subjectof interactionswithothermembersof thebattlegroup’snetworks.Shipsensorsprovidenewinfor-mation on each track periodically, sometimes at highrate.Whilethereportedinformationisrelativelysmall(e.g.,anewpositionandvelocity),thefrequencyofthereportsandthelargenumberofobjectscancombinetoyieldademandingload.Thenumberoftrackdatacon-sumerscaneasilyamounttodozensandplacetheover-all systemcommunicationburden in the regionof 50to100,000messagespersecond.(Figure2showsascanwithmanytracksfeedingmessagesforward.)

Althoughcommercialmachinesandnetworkshaveadvanced admirably in their ability to communicatedata across networks, the combat system applicationshavesomeunusualcharacteristicsthatmaketheirneedssubstantiallydifferentfromthecommercialmainstream.The track information that comprises thebulkof thecombatsystemcommunicationsconsistsofmanysmall(afewhundredbytes)messages.Thesemessagescanbeindependently significant to the downstream consum-ers,withvaryingrequirementstoarrive“quickly”atpar-ticulardestinations.Commercialsystemsarenottunedtohandlelargenumbersofsmallmessages;theyfocusonhandlingtheneedsofthemassmarket,whicharetypi-callylessfrequentandlargersetsofdata.

HIGH PERFORMANCE DISTRIBUTED COMPUTING

TheHiPer-DProgramoriginatedasajointDefenseAdvancedResearchProjectsAgency(DARPA)/Aegis(PMS-400)endeavor,withthreeparticipatingtechni-calorganizations:APL,NavalSurfaceWarfareCenterDahlgrenDivision(NSWCDD),andLockheedMartinCorporation.Itwasinspiredbyafutureshipboardcom-puting vision2 and the potential for synergy betweenit and awaveof emergingDARPA researchproductsindistributedcomputing.Theprogramcommencedin

Figure 1. Navy family computers.



June1991andhasbeenanongoingresearchanddevel-opmenteffortdesignedtoreduceriskduetotheintro-duction of distributed computing technologies, meth-ods,andtoolsintotheAegisWeaponSystem.

Many of the significant developments of HiPer-Dhave been showcased in demonstrations. Since 1994,whenthefirstmajorHiPer-Ddemonstrationwasheld,there have been seven demonstrations that providedexposureofconceptsandtechniquestoAegisandotherinterested communities. The first HiPer-D integrateddemonstration(Fig.3)focusedprimarilyonthepoten-tial for applying massively parallel processor (MPP)technology to combat system applications. It broughttogether a collectionof over 50programs runningon34 processors (including 16 processors on the IntelParagon MPP) as a cooperative effort of APL andNSWCDD.Thefunctionalityincludedacompletesen-sor-to-weaponpaththroughtheAegisWeaponSystem,withdoctrine-controlledautomaticandsemi-automaticsystem responses and an operator interface that pro-videdessentialtacticaldisplaysanddecisioninterfaces.ThemajoranomalyrelativetotheactualAegissystemwas the use of core elements from CEC (rather thanAegis) to provide the primary sensor integration andtracking.

By many measures, this first demonstration was asignificant success. Major portions of weapon systemfunctionality had been repartitioned to execute on alargecollectionofprocessors,withmanyfunctionscon-structedtotakeadvantageofreplicationforreliabilityandscaling.TheParagonMPP,however,sufferedsevereperformanceproblemsconnectingwithcomputersout-side itsMPPcommunicationmesh.Keyexternalnet-work functions were on the order of 20 times slowerthanonSunworkstationsofthesameera.Theresulting

impact on track capacity (less than 100) drove laterexplorationawayfromtheMPPapproach.

Thenext-generationdemonstrationhadtwosignifi-cantdifferences:(1)itfocusedontheuseofnetworkedworkstationstoachievethedesiredcomputingcapacity,and(2)itstroveforastrongergroundingintheexistingAegisrequirementsset.Theprimaryimpactofthesedif-ferenceswastoswitchfromaCEC-basedshiptrackingapproachtoanAegis-derivedtrackingapproach.Addi-tionalfidelitywasaddedtothedoctrineprocessing,and

Figure 2. Information originates from sensor views of each object. Each report is small, but the quantity can be very large (because of high sensor report rates and potentially large numbers of objects). The reports flow throughout to be integrated and modified, and to enable rapid reaction to change.

Figure 3. First HiPer-D demonstration physical/block diagram.

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Processor

Network

Distributed systemTrack information

Ship sensors

010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110010011001111011101000010110011001111011101000010110

Tact

ical

dis

play

s

Filter and correlation

Paragon MPP

Control and instrumentation

Eth

erne

t

Doc

trin

e an

d ta

ctic

alsi

mul

atio

n



therapidself-defenseresponsecapabilityreferredtoas“auto-special”wasaddedtoprovideacontextinwhichto evaluate stringent end-to-end timing requirements.The1995HiPer-Ddemonstration3–5representedasig-nificantmilestoneinassessingthefeasibilityofmovingcommercialdistributedcomputingcapabilitiesintotheAegismission-criticalsystems’domain.Trackcapacitywasimprovedanorderofmagnitudeovertheoriginaldemonstration;avarietyofreplicationtechniquespro-viding fault tolerance and scalability were exhibited;andcriticalAegisend-to-endtimelinesweremet.

Subsequent demonstrations in 1996 to 1998 builtupon the success of the 1995 demonstration, addingincreased functionality; extending system capacity;exploring new network technologies, processors, andoperating systems; and providing an environment inwhichtoexaminedesignapproachesthatbettermatchedthenewdistributedparadigm.Indeed,afairlycompletesoftware architecturenamed “Amalthea”4was createdtoexplorethefeasibilityofapplication-transparentscal-able, fault-tolerantdistributed systems. Itwas inspiredby theDelta-4architecturedone in the late1980sbyESPRIT(EuropeanStrategicProgramme forResearchandDevelopmentinInformationTechnology)5andwasusedtoconstructasetofexperimental, reliableappli-cations. In addition to architectural exploration, theapplication domain itself was addressed. Even thoughthetop-levelsystemspecificationcontinuedtoconveya valid view of system performance requirements, thedecompositionofthosetop-levelrequirementsintotheAegiselementprogramperformancespecificationshadbeencrafted20yearsearlierwithacomputingcapacityand architecturevastly different from the currentdis-tributedenvironment.

The command and decision element, from whichmuchoftheHiPer-Dfunctionalityoriginated,seemedlikeaparticularlyfertileareaforsignificantchange.Anexample of this was the tendency of requirements tobe specified in termsof aperiodic review. In thecaseofdoctrine-basedreviewforautomaticactions,therealneedwastoensurethatatrackqualifyingforactionwasidentifiedwithinacertainamountoftime,notneces-sarilythatitbedoneperiodically.IntheHiPer-Darchi-tecture, it is simple to establish a client (receiver oftrackdata)thatwillexaminetrackreportsforthedoc-trine-specifiedactionupon data arrival.Fromarespon-siveness standpoint, this far exceeds the performanceoftheperiodicreviewspecifiedintheelementrequire-ments and yet is a very straightforward approach forthehigh-capacityandindependentcomputingenviron-mentofthedistributedsystem.

The1999and2000demonstrations6,7continuedthepattern,bringinginnewfunctionalityandasignificantnew paradigm for scalable fault-tolerant servers thatusenewnetworkswitchtechnology(seenextsection).This phase of development also addressed a need for

instrumenting the distributed system which had beenidentifiedearlyintheprogrambutmetwithaless-than-satisfactorysolution.

Distributed systems apply the concept of “work inparallel,”whereeachworkingunithasitsownindepen-dentresources(processorandmemory).Althoughthiseasesthecontentionbetweenfunctionsrunninginpar-allel, itelevatesthecomplexityoftestandanalysistoanewlevel.Collectionandintegrationoftheinforma-tion required to verify performance or diagnose prob-lemsrequiresophisticateddatagatheringdaemonstoberesidentoneachprocessor,andanalysiscapabilitiesthatcan combine the individual processor-extracted datastreamsmustbecreated.EarlyintheHiPer-Deffort,aGerman research product called “JEWEL”8 was foundtoprovidesuchcapabilitieswithdisplaysthatcouldbeviewedinrealtime.Thereal-timemonitoring/analysiscapabilityofHiPer-Dwasnotonlyoneofitsmostprom-inentdemonstrationfeatures,butsomewouldclaimalsooneofthemostsignificantreasonsthattheannualinte-grationof the complex systemswas consistently com-pletedontime.JEWEL,however,neverachievedstatusasasupportedcommercialproductandwassomewhatcumbersomefromtheperspectiveoflargesystemdevel-opment.Forexample,ithadaninflexibleinterfaceforreportingdata,allowingonlyalimitednumberofinte-gerstobereportedoneachcall.Dataotherthaninte-gerscouldbereported,butonlybyconventionwiththeultimatedatainterpreter(graphingprogram).

Beginningin1999,areal-timeinstrumentationtoolkit called Java Enhanced Distributed System Instru-mentation (JEDSI) was developed and employed as areplacementforJEWEL.ItfollowedtheJEWELmodelofprovidingverylowoverheadextractionofdatafromthe processes of interest but eliminated the interfaceissues thatmade it awkward for large-scale systems. Italso capitalized on the use of commercially availableJava-basedgraphicspackagesandcommunicationssup-porttoprovideanextensivelibraryofreal-timeanalysisandperformancevisualizationtools.

Current and planned initiatives in HiPer-D aremovingtowardmulti-unitoperations.FunctionalityforintegratingdatafromNavytacticaldatalinkshasbeenadded to the system, and the integrationof realCECcomponentsisplannedforlate2001.Suchacapabilitywill provide the necessary foundation to extend theexplorationofdistributedcapabilitiesbeyondtheindi-vidualunit.

EVOLUTION OF COMMUNICATIONS IN HiPer-D

The first phase of HiPer-D, culminating in thesystemshowninFig.3,hadatargetcomputingenviron-mentthatconsistedofanMPPastheprimarycomput-ingresource,surroundedbynetworkedcomputersthat



modeled the sensor inputs and workstations that pro-vided graphical operator interfaces. The MPP was anIntel Paragon (Fig. 4). It had a novel internal inter-connectandanoperatingsystem(Mach)thathadmes-sagepassingasanintegralfunction.Machwasthefirstwidelyused(and,forabriefperiod,commerciallysup-ported) operating system that was built as a “distrib-uted operating system”—intended from the outset toefficiently support the communication and coordina-tionrequirementsofdistributedcomputing.

Despite these impressive capabilities, an importantprinciplehadbeenestablishedwithintheDARPAcom-munity that rendered even these capabilities incom-pletefortheobjectiveofreplicationforfaulttoleranceandscalability.ConsiderthecomputingarrangementinFig.5.Asimpleservice—theallocationoftracknum-berstoclientswhoneedtouniquelylabeltracks—isrep-licatedforreliability.Thereplicalistenstotherequestsandperformsthesamesimpleassignmentalgorithm,butdoes not reply to the requester as the primary serverdoes.Thissimplearrangementcanbeproblematicifthecommunicationsdonotbehaveinspecificways.Ifoneoftheclients’requestsisnotseenbythebackup(i.e.,themessageisnotdelivered),thebackupcopywillbeoffset1 fromtheprimary. If theorderofmultiplecli-ents’requestsisinterchangedbetweentheprimaryandbackupservers,thebackupserverwillhaveanerrone-ousrecordofnumbersassignedtoclients.Furthermore,

at such time as the primary copy is declared to havefailed,thebackupwillnotknowwhichclientrequestshavebeenservicedandwhichhavenot.

Communication attributes required to achieve thedesiredbehaviorinthisserverexampleincludereliabledelivery, ordered delivery, and atomic membershipchange.Reliabledeliveryassuresthatamessagegetstoalldestinations(ornoneatall)andaddressestheprob-lemoftheserversgetting“outofsynch.”Ordereddeliv-erymeansthat,withinadefinedgroup(specifyingbothmembersandmessages),allmessagesdefinedaspartofthe group are received in the same order. This elimi-natestheproblemthatcanoccurwhentheprimaryandreplicaseemessagesinadifferentorder.Atomicmem-bershipchangemeansthat,withinadefinedgroup,anymemberenteringor leaving thegroup(including fail-ure)isseentodosoatthesamepointofthemessagestreamamongallgroupmembers.Inoursimpleserverexample above, this allows anew replica tobe estab-lishedwithaproperunderstandingofhowmuch“his-tory”ofpreviousprimaryactivitymustbesupplied,andhow much can be picked up from the current clientrequeststream.

Aconvenientmodel forprovidingthesecommuni-cationattributesis“processgroupcommunications.”Inthismodel,theattributesaboveareguaranteed,i.e.,allmessagesaredeliveredtoallmembersornone,allmes-sages are delivered to all group members in the sameorder, and any members that leave or join the groupareseentodosoatthesamepointwithinthegroup’smessage stream. Early HiPer-D work used a toolkitcalled the Isis Distributed Toolkit9 that served as thefoundationforprocessgroupcommunications.Initiallydeveloped at Cornell University, it was introduced asa commercialproductbya small companynamed IsisDistributedSystems.Figure 4. Paragon.

Shadowcopy

Next # = 11

I nee

d a

num

ber

Client

Tracknumberserver

Next # = 11

10

Shadowcopy

Next # = 10

I nee

d a

num

ber

Client

Tracknumberserver

Next # = 11

10

X

Normal Flawed

Figure 5. Simple replicated track number service operations.



The cornerstone element of the HiPer-D commu-nications chain was an application called the RadarTrackServer(RTS).Itstaskwastoservicealargesetof(potentiallyreplicated)clients,providingthemwithtrackdataattheirdesiredrateandstaleness.Fedfromthe stream of track data emerging from the primaryradar sensor and its subsequent track filter, the RTSallowseachconsumerofthetrackinformationtospec-ifydesiredupdateratesforthetracksandfurtherspec-ify latencyor stalenessassociatedwith thedeliveryoftrackinformation.Thiswasasignificant,newarchitec-turalstrategy.Inpriordesigns,trackreportswereeitherdeliveredenmasseat the rateproducedby thesensoror delivered on a periodic basis. (Usually, the sameperiodwasappliedtoalltracksandalluserswithinthesystem,whichmeantacompromiseofsomesort.)Thesenewservercapabilitiesbroughttheabilitytoselectivelyapplyvaryingperformancecharacteristics(updaterateandlatency)basedoneachclient’scriticalevaluationofeachindividualtrack.

TheRTSmatchesaclient’srequestedrateforatrackby filtering out unnecessary reports that occur duringthetargetreportintervals(Fig.6).Forinstance,ifthesensor provides an update every second but a clientrequests updates every 2 s, the RTS will send only

alternateupdates.Of course, this is client- and track-specific.Inthepreviousexample,shouldatrackofmoreextremeinterestappear,theclientcoulddirecttheRTSto report that specific trackat the full1-s report rate.(There was also successful experimentation in use oftheRTStofeedbackconsumerneedstothesensortoenableoptimizationof sensor resources.) If aconsum-er’s desired report rate is not an integral multiple ofthesensorrate,theRTSmaintainsan“achievedreportrate”anddecideswhethertosendanupdatebasedonwhetheritsdeliverywillmaketheachievedreportratecloserorfurtherawayfromtherequestedreportrate.

OnelessonlearnedveryearlyintheHiPer-Deffortwasthat, innetworkcommunications,thecapacitytodelivermessagesbenefitsimmenselyfrombufferinganddeliveryofcollectionsofmessagesratherthanindivid-ualtransmissionofeachmessage.Ofcourse,thisintro-duces latency to thedeliveryof themessages thatareforcedtowaitforotherstofillupthebuffer.TheRTSusesacombinationofbufferingandtimingmechanismsthatallowsthebuffertofillforaperiodnottoexceedtheclient’slatencyspecification.Uponreachingafullbufferconditionorthespecifiedmaximumlatency,thebuffer—even if it is only a single message—is thendeliveredtotheclient.

1–100 4 s

4 s2 s

4 s

100 ms

100 ms50 ms

200 ms

A

B

C 1–100

1–5050–100

Track set Rate LatencyClient

Filter outunneeded

reports

Trackreports

Clientbuffers

Trackreportto Client B

Trackreportto Client A

Trackreportto Client C

Drivetime out(max. waittime)on clientbuffer

Figure 6. Radar track server.

The criticality of the RTS tothesystemandtheanticipatedloadfromthelargenumberofclientsledtoitsimmediateidentificationasaprimecandidateforapplyingrepli-cationtoensurereliabilityandscal-ability to high load levels. Allthese factorsmadetheRTSarichsubject for exploration of processgroup communications. Figure 7depictsthebasicsetofprocessgroupcommunications–basedinteractionsamonga setofcooperating serversprovidingtrackdatatoasetofcli-ents.Theprocessgroupcommunica-tionsdeliveryproperties(combinedwith the particular server design)allowed the server function to bereplicatedforbothimprovedcapac-ity and reliability. It also allowedclientstobereplicatedforeitherorbothpurposes.

ThefirstParagon-centereddem-onstration included the use of theprocessgroupcommunicationspar-adigm and the RTS. Even thoughthe internal Paragon communica-tionsmeshprovedtobeverycapa-ble, thesystemasawholesufferedfrom a severe bottleneck in mov-ing track data into and out of the



Paragon.The16-nodeParagonmeshwasshowncapa-bleof8pairsofsimultaneouscommunicationsstreamsof1000messagesasecond(withnobuffering!).Com-pared to the roughly 300 messages per second beingachievedondesktopsofthatperiod,thecombinationofMachmessagingfeaturesandParagonmeshwasimpres-sive.However,asdescribedearlier,theParagon’sEther-net-basedexternalconnectivitywasflawed.Apparently,in thenicheofParagoncustomers,Ethernet interfaceperformancewasnotofsubstantialconcern.Servingthe“supercomputermarket,”theParagonwasmoreattunedto high-performance parallel input devices (primarilydisk) and high-end graphics engines that had HiPPi’s(High-PerformanceParallelInterfaces).

The migration from Paragon MPP hardware, withitsmesh-basedcommunications, tothenetwork-basedarchitectureofsubsequentdemonstrationsprovedtobesmoothforprocessgroupcommunications.Theelimina-tionofpoor-performanceParagonEthernetinput/outputallowed the network-based architecture to achieve arealisticcapacityof600trackswhilemeetingsubsecondtimelines for high-profile, time-critical functions. TheRTSscalabledesignwasabletoaccommodate9distinctclients(withdifferenttrackinterestsandreportcharac-teristics),someofthemreplicated,fora14-clienttotal.

Althoughthenetworkarchitectureandprocessgroupcommunicationsweremajorsuccesses,theIsisDistrib-utedToolkitmentionedearlierwasencounteringreal-worldbusinesspractices.WhenIsisDistributedSystemswasacquiredbyStratusComputer, theeventwas ini-tiallyhailedasapositivematurationofthetechnologyintoasolidproductthatcouldbeevenbettermarketedand supported. However, in 1997, Stratus pulled Isisfromthemarketandterminatedsupport.

Thiswasunsettling.Eventhoughtherewasnoneedtopanic(theexistingIsiscapabilitiesandlicenseswouldcontinue to serve the immediate needs), the demise

of Isis signaled the indifference of the commercialworldtothetypeofdistributedcomputingthatIsissup-ported(andwasfoundtobesouseful).APLeventuallyemployed another research product, “Spread,” devel-oped by Dr. Yair Amir of The Johns Hopkins Uni-versity.10 It provides much of the same process groupcommunicationssupportpresentinIsis,isstillusedbyHiPer-Dtoday,andhasjustrecentlybeen“productized”bySpreadConceptsLLC,Bethesda,MD.Theabsenceofanylargemarketforcommercialprocessgroupcom-munications,however, is an important reminder that,althoughsystemsthatmaximizetheuseofcommercialproducts are pursued, the requirements of the Navy’sdeployedcombatsystemsandthoseofthecommercialworldwillnotalwayscoincide.

In 1998, an interesting new capability in networkswitches, combined with some ingenuity, provided anewapproachfordevelopingfault-tolerantandscalableserverapplications.Recentadvancesinnetworkswitchtechnologyallowedforfast,application-awareswitching(OSILevel4).Withanabilitytosensewhetheraserverapplication was “alive,” a smart switch could intel-ligently balance incoming client connections amongmultipleserversandroutenewconnectionsawayfromfailedones.

The RTS was chosen as a proving ground for thisnewtechnology.Becauseofitscomplexity,theRTSwastheoneremainingHiPer-DfunctionthatemployedIsis.It alsohad thepotential tobenefit substantially (per-formance-wise)byeliminatingtheburdenofinter-RTScoordination.Theinter-RTScoordinationusedfeaturesof Isis to provide a mechanism for exchange of stateinformation(e.g.,clientsandtheirrequestedtrackchar-acteristics)between server replicas.This coordinationaddedcomplexitytotheRTSandtheunderlyingcom-munications and consumed processing time, networkbandwidth,andmemoryresources.

The switch-based RTS continues to support themajorfunctionalityoftheoriginalRTSbutdoessowitha much simpler design and smaller source code base.TheguidingprincipleinthedesignofthenewRTSwassimplicity.Notonlywould thisproveeasier to imple-ment, it would also provide fewer places to introducelatency,thusimprovingperformance.EachRTSreplicaoperates independentlyofallothers. In fact, theserv-ers do not share any state information among themandareoblivioustotheexistenceofanyotherservers.TheimplementationofadesignthatdoesnotrequireknowledgeofotherserverswasadramaticchangefromtheoriginalRTSdesign,whichhadrequiredextensivecoordinationamongtheserverstoensurethatallclientswerebeingproperlyserved,particularlyundertheaber-rantconditionsofaclientorserverfailure.

Clientsoftheswitch-basedRTSserversetviewitasasinglevirtualentity(Fig.8).Eachtimeanewclientrequests service from the “virtual RTS,” the Level 4

Replicatedclient

Client

Replicatedclient

Server

Server

Server

Servercoordination

group

Clientservicegroups

Tracking service sign-on group

Figure 7. Basic RTS operations (different groups and message content).



RTS 1

RTS 3

RTS 2 Level 4switch

A

B

C

D

Virtualserver

complexClients

Trac

k in

form

atio

n

Figure 8. Switch-based RTS block diagram.

switch selects one of the RTS servers to handle therequest.Theselectioniscurrentlybasedonarotatingassignmentpolicybutcouldbebasedonotherloadingmeasures that are fed to the switch. Once a clientisestablishedwithaparticularserver,thatserverhan-dlesallitsneeds.Intheeventofserverfailure,aclient-sidelibraryfunctionestablishesconnectionwithanewserver and restores all the previously established ser-vices.Inadditiontoreplicationoftheserver,theLevel4 switch itselfcanbereplicatedtoprovideredundantpathstotheservers.Thisaccommodatesfailureoftheswitchitselfaswellasseveranceofanetworkcable.

Figure 9a is a time plot of a typical end-to-endresponseforthecombatsystemdemonstrationapplica-tionthatusestheRTS.FromsensororiginationthroughtrackfilteringandRTSdeliverytotheclient,averagelatenciesareapproximately80ms.The80-mslatencyisanominal levelofperformancethat is intendedfortracksandprocessingpurposeswherelowlatencyisnot

(a) (b)

Figure 9. Nominal performance with recovery (a) and performance with cable pull diagram (b).

ofhighconcern.Bytuningthepro-cessingfunctionsandclients,laten-ciesinthe20-to30-msrangecanbe achievedbut at the expenseofcapacity. This is a clear engineer-ing trade-off situation, where theproperbalancecanbeadjustedforaparticular application (using com-mandlineparameters).

A server software failure wasforced in the center of the runshown in Fig. 9a, with little per-ceptible impact. Figure 9b showsthespikesinmaximumlatencythatoccur as a result of physical dis-connection of the server from theswitch. Physical reconfigurations,however,take2to4stocomplete.

The switch-based RTS is a sig-nificantsimplification,reducingthe

sourcecode from12,000 to5,000 lines;however, twoofitsperformanceattributesareofpotentialconcerntoclients:

1. Delivery of all data to a client is not strictly guar-anteed.Whenaserverfails,messagesthattranspirebetween the server failure and the client’s rees-tablishment with a new server are not eligible fordelivery to that client. However, when the clientreconnects, it will immediately receive the freshlyarriving track updates. In the case of periodicallyreportedradardata,thisissufficientformostneeds.

2. Failures of the switch itself, the cabling, or serverhardware incur a relatively largedelay to recovery.This isamoreseriouspotentialproblemforclientsbecause the reestablishment of service takes multi-pleseconds.Initially,muchofthistimewasusedtodetectthefault.A“heartbeat”messagewasaddedtotheclient-serverprotocoltosupportsubsecondfaultdetection,butthemechanismsintheLevel4switchfordismantlingthefailedconnectionandestablish-ing a new one could not be accelerated withoutgreater vendor assistance than has been available.Thisseemedtobeaclassiccaseofnotrepresentinga strong enough market to capture the vendor’sattention.Hadthisbeenahigher-profileeffort,withmorepotentialsalesontheline,itisourperceptionthat this problem could have been addressed moresatisfactorily.

The use of Level 4 network switch technology pro-videstwomajoradvantages.First,serverimplementationismuchsimpler.ThenewRTSarchitectureissimilarindesign to a replicated commercial Internet server. Thecomplexserver-to-servercoordinationprotocolsrequiredto manage server and client entries exist, and failureshavebeeneliminated.Second,thecommunicationslayer



used between the server and its clients is the same“socket”serviceusedbymajorInternetcomponentslikebrowsersandWebservers.Theuseofstandardcommer-cialcommunicationsgreatlyreducesthedependencyofthesoftwaresystemonspecificproducts.Therefore,port-ing the software to new (both processor and network)environmentswillbestraightforward,loweringthelife-cycleandhardwareupgradecosts.

CONCLUSIONForemostamongtheaccomplishmentsoftheHiPer-D

Programisthelonglistofproof-of-conceptdemonstra-tionstotheAegissponsoranddesignagent,LockheedMartin:

• The sufficiency of distributed computing for Aegisrequirements

• The ability to replicate services for reliability andscalability

• The ability to employ heterogeneous computingenvironments(e.g.,SunandPC)

• Thefeasibilityandadvantagesofatrackdataserver• A single track numbering scheme and server for

allocating/maintainingtracknumbers• Thefeasibilityofstandardnetworkapproacheslike

Ethernet, FDDI (Fiber-Distributed Data Interface)andATM(asynchronoustransfermode)

• Strategies for control of the complex distributedsystemandreconfigurationforreliabilityorscaling

• Thevalueofreal-timeinstrumentation

The relationship between these accomplishmentsand the production Aegis baselines is primarily indi-rect.Whileitwouldbeimpropertoclaim“responsibil-ity”foranyoftheadvancesthattheLockheedMartinengineeringteamhasbeenabletoinfuseintoongoingAegisdevelopments,itisclearthattheHiPer-Deffortshave emboldened the program to advance rapidly.Currently,threemajorAegisbaselinesareunderdevel-opment,eachwithincreasinglyhigherrelianceondis-tributedcomputingandtechnology.Themostrecent,referred to as Aegis Open Architecture, will fullycapitalize on its distributed architecture, involving acomplete re-architectingof the system for a commer-cial-off-the-shelfdistributedcomputinginfrastructure.

Inthespecificareaofcommunications,HiPer-Dhascontinued to follow the advance of technology in its

pursuitofservicesthatcaneffectivelymeetAegistac-tical requirements. Robust solutions with satisfyinglyhigh levels of performance have been demonstrated.Theproblemremains,however,thatthelifeofcommer-cialproductsissomewhatfragile.Thebottom-lineneedforprofit incommercialofferingsconstrainsthesetofresearchproductsthatreachthecommercialmarketandputstheirlongevityinquestion.Thecommercialworld,particularly the computing industry, is quitedynamic.Significantproductswithapparentlypromising futuresandindustrysupportcanarriveanddisappearinaveryshorttime.AninterestingexampleofthisisFDDInet-work technology. Arriving in the early 1990s as analternative to Ethernet that elevated capacity to the“next level” (100Mbit),andgarnering the supportofvirtuallyallcomputermanufacturers,ithasallbutdis-appeared. However, it is this same fast-paced changethatisenablingnewsystemconceptsandarchitecturalapproachestoberealized.

Thechallenge,wellrecognizedbyLockheedMartinandintegratedintoitsstrategyforAegisOpenArchi-tecture, is to build the complex application systemona structured foundation that isolates theapplica-tion from the dynamic computing and communica-tionenvironment.

REFERENCES 1Wander, K. R., and Sleight, T. P., “Definition of New Navy Large

ScaleComputer,” inJHU/APL Developments in Science and Technol-ogy,DST-9,Laurel,MD(1981).

2Zitzman, L. H., Falatko, S. M., and Papach, J. L., “Combat SystemArchitecture Concepts for Future Combat Systems,” Naval Eng. J.102(3),43–62(May1990).

3HiPer-D Testbed 1 Experiment Demonstration,F2D-95-3-27,JHU/APL,Laurel,MD(16May1995).

4Amalthea Software Architecture,F2D-93-3-064,JHU/APL,Laurel,MD(10Nov1993).

5Powell,D.(ed.),Delta-4: A Generic Architecture for Dependable Distrib-uted Computing,ESPRITProject818/2252,Springer-Verlag(1991).

6High Performance Distributed Computing Program 1999 Demonstration Report,ADS-01-007,JHU/APL,Laurel,MD(Mar2001).

7High Performance Distributed Computing Program (HiPer-D) 2000 Dem-onstration Report,ADS-01-020,JHU/APL,Laurel,MD(Feb2001).

8Gergeleit, M., Klemm, L., et al., Jewel CDRS (Data Collection and Reduction System) Architectural Design,ArbeitspapierederGMD,No.707(Nov1992).

9Birman,K.P.,andVanRenesse,R.(eds.),Reliable Distributed Comput-ing with the Isis Toolkit,IEEEComputerSocietyPress,LosAlamitos,CA(Jun1994).

10Stanton, J.R.,A Users Guide to Spread Version 0,1,Center forNet-workingandDistributedSystems,TheJohnsHopkinsUniversity,Bal-timore,MD(Nov2000).



THE AUTHORS

MARKE.SCHMIDreceivedaB.S.fromtheUniversityofRochesterin1978andanM.S.fromtheUniversityofMarylandin1984,bothinelectricalengineering.HejoinedAPLin1978inADSD’sAdvancedSystemsDevelopmentGroup,whereheiscurrentlyanAssistantGroupSupervisor.Mr.Schmid’stechnicalpursuitshavefocusedontheapplicationofemergentcomputerhardwareandsoftwaretechnol-ogytothecomplexenvironmentofsurfaceNavyreal-timesystems.InadditiontotheHiPer-Deffort,hisworkincludespursuitofsolutionstoNavalandJointForcesysteminteroperabilityproblems.Hise-mailaddressismark.schmid@jhuapl.edu.

DOUGLAS G. CROWE is a member of the APL Senior Professional Staff andSupervisorof theA2D-003Section inADSD’sAdvancedSystemsDevelopmentGroup.HehasB.S.degrees incomputerscienceandelectricalengineering,bothfromThePennsylvaniaStateUniversity,aswellasanM.A.S.fromTheJohnsHop-kinsUniversity.Mr.Crowehasbeentheleadforgroupsworkingonthedevelop-mentofautomatictestequipmentforairborneavionics,trainingsimulators,andairdefenseC3I systems.He joinedAPL in1996and iscurrently the lead forAPL’[email protected].

Documents

Distributed Computer Architectures for Combat SystemsDistributed Computer Architectures for Combat Systems Mark E. Schmid and Douglas G. Crowe dvances in computer hardware, networks,