Cloning Parallel Simulationscobweb.cs.uga.edu/~maria/pads/papers/p378-hybinette.pdf · 2006-10-12 · Cloning Parallel Simulations 379 Fig. 1. Envisioned system: A faster-than-real-time

Cloning Parallel Simulations

MARIA HYBINETTEThe University of GeorgiaandRICHARD M. FUJIMOTOGeorgia Institute of Technology

We present a cloning mechanism that enables the evaluation of multiple simulated futures. Per-formance of the mechanism is analyzed and evaluated experimentally on a shared memory mul-tiprocessor. A running parallel discrete event simulation is dynamically cloned at decision pointsto explore different execution paths concurrently. In this way, what-if and alternative scenarioanalysis can be performed in applications such as gaming or tactical and strategic battle manage-ment. A construct called virtual logical processes avoids repeating common computations amongclones and improves efficiency. The advantages of cloning are preserved regardless of the numberof clones (or execution paths). Our performance results with a commercial air traffic control sim-ulation demonstrate that cloning can significantly reduce the time required to compute multiplealternate futures.

Categories and Subject Descriptors: I.6.1 [Simulation and Modeling]: Simulation Theory; I.6.8[Simulation and Modeling]: Types of Simulation—Discrete event; parallel; distributed; D.4.1[Operating Systems]: Process Management—Concurrency; scheduling; synchronization; D.1.3[Programming Techniques]: Concurrent Programming—Distributed programming

General Terms: Algorithms, Performance

Additional Key Words and Phrases: Cloning, multiprocessors, parallel algorithms, parallel simu-lation, pruning

1. OVERVIEW

Simulation is becoming an important tool for decision makers in time-constrained environments. High-fidelity simulations demand parallel imple-mentations to support the decision process at real-time rates. Typical decisionmaking applications in which parallel simulations can or do play a role includeair traffic control, gaming strategy and battle management among others. Thegoal of this research is to provide a simulation-based decision aid for managers

This work was supported in part by U.S. Army Contract DASG60-95-C-0103 funded by the BallisticMissile Defense Organization and managed through the Space and Strategic Defense Command,a MITRE Corporation research grant, and a grant from Sun Microsystems.Authors’ addresses: Maria Hybinette, Computer Science Department, The University of Georgia,Athens, GA 30602-7404; email: [email protected]; Richard M. Fujimoto, College of Computing,Georgia Institute of Technology, Atlanta, GA 30332-0280; email: [email protected] to make digital/hard copy of part or all of this work for personal or classroom use isgranted without fee provided that the copies are not made or distributed for profit or commercialadvantage, the copyright notice, the title of the publication, and its date appear, and notice is giventhat copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers,or to redistribute to lists, requires prior specific permission and/or a fee.C© 2001 ACM 1049-3301/01/1000-0378 $5.00

ACM Transactions on Modeling and Computer Simulation, Vol. 11, No. 4, October 2001, Pages 378–407.

Cloning Parallel Simulations • 379

Fig. 1. Envisioned system: A faster-than-real-time simulation is integrated in a real-time inter-active environment, enabling what-if evaluation of alternative causes of action.

faced with complex planning tasks (e.g., real-time air traffic management). Torealize effective decision making we propose a technology called simulationcloning that enables more efficient exploration of different possible future out-comes based on policy decisions made at well defined decision points.

The envisioned decision aid includes three components: (1) a situation database built from live data feeds that characterizes the current state of the sys-tem under investigation, (2) a faster-than-real-time forecasting simulator and(3) an interactive environment that interfaces with both the database and theforecasting tool for what-if evaluation of alternative courses of action. Theoverall system is depicted in Figure 1 in the context of an air traffic controlapplication.

A powerful simulation-based decision tool should provide capabilities suchas monitoring and steering. Monitoring provides the ability to query or samplesimulation variables. These variables may be sampled continuously, displayedwhen a pre-defined condition is satisfied or presented on demand.

Simulation steering controls the execution of an application, such as pause,execution path modification and rollback. Pausing stops the simulation andenables the analyst to fully investigate the state of the simulation to determinewhether to to create new execution paths or to modify variables of the currentexecution path. Execution path modifications change the state of the simulationand are added to the simulation at decision points. Rollback is the retractionto an earlier simulated time and provides for the re-evaluation of old decisionpoints and variables.

Such a tool could be used for example, to determine which of several airtraffic control decisions offers minimal delay. If thunderstorms are approachinga controller may want to evaluate alternative air traffic restrictions to minimize

ACM Transactions on Modeling and Computer Simulation, Vol. 11, No. 4, October 2001.

380 • M. Hybinette and R. M. Fujimoto

travelers’ delays while maintaining safety in the airspace. This can be doneby dynamically cloning (replicating) several faster-than-real-time simulationswith different courses of action. The clones utilize both the current situationand forecasting parameters such as the start time of the restriction. The clonesare inserted as decision points and each of the new “clones” proceeds in parallel,sharing common computations to improve performance (sharing a computationmeans that the computation is performed only once among the clones). Forexample, the clones will share all computations before the start time of therestriction. The effect of each policy within each clone is monitored to determinewhich policy offers the most effective solution. This research focuses on thedetails of the parallel evaluation of multiple alternatives.

Parallel discrete event simulation techniques are used to speed up the exe-cution of the simulator. The parallel simulator consists of distinct componentscalled logical processes or LPs. Each LP models some part of the system underinvestigation and maintains its own state. For example in an air traffic appli-cation each airport may be modeled by an LP. The logical processes are oftenmapped to different processors and interact by exchanging time-stamped eventmessages. The simulation progresses as LPs exchange messages that modelchanges of the system state at discrete points in time. In the air traffic appli-cation for example, the airports may interact by sending messages modelingaircraft arrival and departure events.

In order to maintain consistency, events are processed in time-stamp orderaccording to a consistency protocol. The two prevailing consistency protocolsare termed conservative and optimistic. The conservative protocol enforces con-sistency by each LP processing its events in time stamp order. The optimisticprotocol, in contrast, allows out of order event processing, but provides a mech-anism such as rollback to recover. Our cloning mechanism is applicable to bothconservative and optimistic simulation protocols.

The rest of this paper is organized as follows: Background and related workis discussed in Section 2. Section 3 introduces dynamic cloning. In Section 4 wepresent the virtual logical-process paradigm. This is followed by a discussionof the cloning mechanism in Section 5. We then describe our implementationof the cloning mechanism in Section 6. Next, performance measurements arepresented and analyzed. We conclude with a summary and a discussion of futuredirections.

2. RELATED WORK

Related work in interactive parallel discrete event simulation is described inSteinman [1991] and in Franks et al. [1997]. In Steinman, a message type calledan external event enables interaction with the simulator. These events can steer,sample and query the simulation in progress. The approach of Franks et al.allows for the testing of what-if scenarios provided they are interjected beforea deadline. Alternatives are examined sequentially by using a rollback mecha-nism to erase exploration of one path in order to begin exploration of another.In contrast to these methods, our algorithm dynamically creates and evaluatesmultiple alternatives to allow concurrent exploration of multiple paths. New



versions of an executing simulation can be dynamically cloned to evaluate al-ternative paths. Because the cloned simulation proceeds in parallel with theoriginal, an increased number of alternatives can be evaluated.

Cloning was suggested by von Neumann [1956] more than 40 years agoto provide fault-tolerance. Fault-tolerance is ensured by providing identi-cal processes, identical transactions, duplicate data, or redundant services[Schneider 1990]. Cloning can also improve throughput by placing process repli-cas in the proximity of where a service is needed [Goldberg 1992; Schneider1990]. Cloning is suggested as a solution for concurrency control in real-timedatabases [Bestavros 1994] and to improve accuracy of simulation results—torun multiple independent replications then average their results at the endof the runs. For example, in Heidelberger [1991] and Glynn and Heidelberger[1991], research focuses on how to achieve statistical efficiency of replicationsspread on different machines. In Vakili [1992], replications are synchronizedvia a shared clock. In this way the same event occurs at the same time in allreplicas.

Cloning is also used in sequential simulations to propagate faults in digitalcircuits [Lentz et al. 1999], to model flexible manufacturing systems [Davis1998] or to fork transactions in simulation languages [Henriksen 1997].Sequential simulation languages typically clone dummies (“shadows”) to dotime-based waiting [Schriber and Brunner 1997].

The incremental update schemes of process migration algorithms such asZayas [1987] are similar in philosophy to our virtual logical process scheme(covered in a later section). The common goal is to reduce the cost of copying thevirtual address space between clones. The process migration algorithms differfrom ours in that they only support one active clone, while we allow for multipleclones. Our main motivation is to develop a parallel model that supports anefficient, simple, and effective way to explore and compare alternate scenarios.This is decidedly different from the studies reported above.

3. SIMULATION CLONING

Simulation cloning enables the creation of several new copies of a runningsimulation, each modeling a different course of action. The clones are insertedinto the on-going simulation as decision points and each of the new clonesproceeds in parallel. Here, a decision point defines the point in simulation timewhere the simulation model is cloned so the original and cloned simulationsmay execute along different execution paths.

Simulation cloning improves simulation performance in two ways: First, sim-ulations that are interactively cloned at a decision point are able to share thesame execution path before the decision point, so the computations before thedecision point need only be executed once. Second, after cloning, multiple ver-sions of the simulation may be able to share a single execution of certain compu-tations that are common between them (a mechanism for achieving this will bedescribed in the next section). Additionally, clones can be “pruned,”—deleted—when it is apparent that the course of action they are evaluating will not becompetitive with respect to others.



These capabilities provide a significant improvement over the traditionalparadigm where several alternative trials of a simulation are completely repli-cated with each replication evaluating a single course of action. One disadvan-tage of this approach is that it may be difficult to predict ahead of time whichalternate courses of action or simulation parameter alternatives need to be con-sidered. In our approach however, one can interactively clone new simulationsfrom a running simulation as needed.

As an example, consider an air traffic control application where the goal isto minimize flight delays and aircraft “circling” while meeting the air trafficdemand and safety requirements. Towards this end, air traffic controllers canadjust the spacing between aircraft—the miles in trail (MIT) to control theflow of traffic in the air space. Further, controllers can impose “ground delay”programs to keep aircraft from taking off rather than have them circle neartheir destination, which is both costly and is less safe. Simulation can be usedto evaluate the effects of imposing different MIT restrictions and/or grounddelay programs.

Suppose, for example, that congestion is expected to develop at Dulles (IAD)because of a planned runway closure that afternoon. A controller might con-sider reducing the rate of aircraft entering the Dulles airspace. The controllermight wish to consider MIT values: 10, 20 or 30 miles in the traffic corridorsnear Dulles to assess their impact on flight delays of inbound aircraft, and theaffect of these changes on the entire national air space. Using replicated trials,the controller would have to launch four separate simulations, each using adifferent restriction imposed that afternoon prior to the runway closure. Then,after analyzing the result of each simulation the controller would impose therestriction which promises the least flight delays within the overall system.This scenario is depicted in the top image of Figure 2 where each horizontalline represents the execution of one of the replications.

The middle image in Figure 2 illustrates how cloning can be applied to thisscenario. A decision point is set at the time when the restrictions would beimposed. Rather than repeating the execution of the simulation prior to thedecision point four times, that portion of the computation is only performedonce. When the simulation time reaches the decision point, three new clonesare created, and each completes the execution using different MIT restrictions.It is clear that this will be much more efficient than the replicated approach.

The bottom image of Figure 2 shows another scenario, where a controller isattempting to determine the proper time at which to enforce a restriction. Inthis case the decision points are set at different simulation times correspondingto candidate times for imposing the restriction. These two examples show thatcloning may increase the number of alternate simulations that can be run byreducing the amount of computational resources required.

Another benefit of dynamic cloning is that clones can be created interactively.For example as soon as the analyst sees something interesting in the ongoingsimulation he may interactively create a new clone, or pause the simulation, rollback and then create a new clone. In contrast to replicated trials, alternativesimulation(s) do not need to re-run from the beginning and can be spawned assoon as it is realized a new alternative is viable.



Fig. 2. Simulations that are cloned at a decision point share the same execution path beforecloning. The top image shows the current approach where four independent simulations are runseparately to evaluate alternatives.

Computational resources can be even more efficiently utilized if unproductivesimulation clones are dynamically pruned. A controller may realize that someMiles-In-Trail (MIT) restrictions will not lead to desirable outcomes long beforethe simulation completes. In this case the cloned simulation can be simplyterminated prior to completion, releasing the resources it was using for use byother clones.

4. VIRTUAL LOGICAL PROCESSES

As previously mentioned, simulation cloning avoids the cost of simulating mul-tiple futures by sharing common computations (i.e. computation that is commonto more than one clone need be performed only once). In the previous section weillustrated how clones share computations before the branching point by defin-ing decision points. In this section we introduce the construct virtual logicalprocesses, that enables sharing of computations after the decision point.

Cloning can be implemented by simply replicating the entire simulation ateach decision point. For large simulations, consisting of thousands or millionsof LPs, this approach may be wasteful, especially when the differences amongthe clones are confined to only a few LPs. In this case, as the simulation pro-gresses, many computations and messages in the different clones will be thesame. It would clearly be desirable to perform these duplicate computationsonly once. For example, consider a decision point introducing an MIT restric-tion near Dulles. The new restriction will initially only affect the air space in theWashington, D.C. area. Simulation computations corresponding to more distantairports, for example Los Angeles, will be the same in the different clones until



Fig. 3. An air traffic control simulation is divided into two layers: A physical layer and a virtuallayer. The incremental update scheme allows sharing of computational resources until interaction.

the effects of the MIT restrictions at Dulles propagate across the nationalairspace. Thus the computations corresponding to LPs representing airportsin the Los Angeles area need only be performed once, and their results can beshared among the different clones.

Our approach to avoiding duplicate computations is to replicate the statespace of the simulation incrementally (i.e. LPs can be replicated incrementally)rather than all at once. To explain this approach, we use concepts analogousto those used in virtual memory systems. Each simulation clone is referred toas a version of the simulation, and consists of a set of virtual logical processesthat communicate by exchanging virtual messages. As implied by their name,virtual LPs and messages do not have an actual, physical realization. Instead,each virtual LP maps to (exactly) one physical logical process that does have aphysical realization in the parallel simulation system. Similarly, each virtualmessage maps to a single physical message. Common computations amongtwo or more clones can be shared by mapping virtual LPs among the differentclones to a single physical LP that performs the actual (shared) computation.Similarly, virtual messages among different clones can be mapped to a singlephysical message to avoid duplication.

When a new version (clone) of the simulation is created, the entire set of vir-tual logical processes making up the version is replicated. Replicating a virtualLP requires very little computation, however, because one need only update afew internal data structures. One need only replicate the physical LPs that aredifferent in the two clones. Replicating a physical LP requires replication ofthe state variables included in the LP. Virtual LPs that are identical in the twoclones map to the same physical LP, while those that are different in the clonesmap to different physical LPs.

To illustrate these concepts, again consider an air traffic control simulation.Suppose there are three aircraft approaching Dulles (IAD) airport. Thisscenario is depicted on the left of Figure 3. The tower represents the controlcenter at the Dulles airport. The virtual LPs are depicted in the section of thediagram labeled the virtual layer, and the physical LPs are in the area labeledthe physical layer. Initially (the leftmost diagram in Figure 3) there is a single



version of the simulation, and each virtual logical process maps to a singlephysical logical process.

Suppose a clone of the simulation is created halfway through the executionthat introduces a new MIT restriction at Dulles. The original version of thesimulation does not include this change. This is shown in the middle image ofFigure 3. There are now two versions of the simulation in the virtual layer.Notice that there are two physical LPs modeling the tower. One simulatesactions at the tower with the new restrictions and the other represents theoriginal tower with no new restrictions. The virtual logical processes model-ing the tower at Dulles are mapped to different physical logical processes. Thesimulation models of the three aircraft are identical in both versions (for now)because they are not yet affected by the restriction, so they share the same phys-ical simulation. The two virtual logical processes for each aircraft are mappedto the same physical logical process. In this way, common computations amongthe two versions need only be performed once.

Common message send and receive computations are also shared amongclones. In general, when a physical LP (e.g. the aircraft LPs in Figure 3) sendsa message, this corresponds to a set of message sends in the virtual layer. Inthe example (Figure 3), a message sent from the physical logical process corre-sponding to the top aircraft, to the physical logical process corresponding to thebottom aircraft, results in two virtual messages sent from their correspondingvirtual logical processes; one message sent for each version of the simulation.This illustrates how the virtual/physical abstraction allows common communi-cation among different clones to be shared.

To continue with our example, suppose the topmost aircraft calls the tower.In the simulation, this corresponds to a message send from an aircraft LP tothe tower LP. The same communication occurs in the original execution as wellas the clone. Both versions of the sender of the message (the airplane) arerepresented by the same physical process but there are two distinct physicalprocesses representing the receiver (the tower). Thus, two copies of the messagemust be sent, one to each physical LP. To handle this situation, the cloningmechanism must duplicate, or clone the message.

Our cloning mechanism increases efficiency by delaying the cloning of phys-ical LPs until it is absolutely necessary. As an example, eventually, as the air-craft draw closer to Dulles, the restriction impacts the aircraft. At that time,the physical layer clones the physical LP corresponding to the delayed aircraft.This is illustrated in the rightmost image of Figure 3. There are now two ver-sions of the aircraft in the physical layer and two versions of the aircraft in thevirtual layer. Each copy of the aircraft in the virtual layer maps to one versionin the physical layer. Note that remaining aircraft in the virtual layer continueto share the same physical LPs in the physical layer.

5. THE CLONING MECHANISM

As mentioned earlier, LPs can only affect other LPs in a parallel simulation byexchanging messages. Thus, the reception of a message is the point at which alogical process must consider message forwarding and process cloning.



Message sends and receives are implemented in the physical layer. Thecloning mechanism first determines which physical LPs should receive themessage. Next it considers whether the generation, or cloning, of a new physicalprocess is required. Whether a message is forwarded or a process is created,depends on the set of virtual processes mapped to the sending physical processand on the set of virtual processes mapped to the receiving physical process.

Two data structures (represented as sets) are utilized by the cloning mecha-nism to determine where a message is delivered, and whether additional cloningis necessary:

—VSendSet: the version numbers of the virtual processes mapped to the send-ing physical LP; and

—VRcvSet: the version numbers of the virtual processes mapped to the receiv-ing physical LP.

The sender piggy-backs additional bits to the message so that the receiver canconstruct VSendSet. Before examining this process in more detail, we introducethe notation that will be used.

5.1 Notation

The original simulation is referred to as version 1 (the version number is incre-mented for each clone that is created). V denotes a virtual logical process andP denotes a physical logical process. V i

A is virtual logical process A version i.Similarly, Pi

A refers to version i of physical logical process A. It can be shownthat the version number of a physical logical process is the same as the lowestversion of the virtual logical processes mapped to that physical LP.

A cloned simulation passes through three distinct phases: (1) before cloning(2) diffusion (before the simulation is fully replicated) and (3) fully replicated.Before a simulation is cloned there is one set of V s and one set of Ps. At thisstage V 1

j maps to P1j for all j . As an example, consider a simulation composed

of three logical processes (before cloning):

—V 1A → P1

A

—V 1B → P1

B

—V 1C → P1

C

Where→means “maps to.” This scenario is illustrated in Figure 4. The virtuallogical processes are at the top, and the physical logical processes are at thebottom of the figure.

5.2 Decision Point

A simulation is cloned at a decision point. The decision point defines whenand how the new version differs from the original version. The semantics ofinserting a decision point (creating a new simulation) are (1) to define the setof virtual logical processes (referred to as the clone set) that immediately differin the new version compared to the original, and (2) to define how the LPs inthe clone set of the new version differ from the original version. The clone set is



Fig. 4. A snapshot of a simulation consisting of three logical processes before the instantiation ofa clone; the mapping of the virtual processes to physical processes A, B and C.

Fig. 5. A snapshot of a simulation that has been cloned; the mapping of the virtual processes tophysical processes A, B and C.

denoted CloneSeti, where i refers to the new cloned version. All virtual logicalprocesses in a CloneSet must belong to the same version (i.e. the decision pointcannot span multiple versions).

When the simulation is cloned, a new complete set of virtual processes iscreated. One set represents the original simulation and the other representsthe new version. Cloning also creates a new set of physical logical processes;one physical logical process is created for each virtual logical process in theCloneSet. The state of these new physical logical processes differs from theoriginal according to the definition of the new version of the simulation.

Immediately after the clone is created, V 1j maps to P1

j for all j , V 2k maps to

P2k for all k that are part of the clone set, and V 2

l maps to P1l for all other virtual

LPs. Figure 5 shows the mapping between virtual and physical logical processesimmediately after the simulation is cloned, where the clone set consists of a



Fig. 6. Transparent send and receive between virtual logical processes.

single virtual LP, V 1A. The mapping between virtual and physical processes for

the new version of the simulation is as follows:

—V 2A → P2

A

—V 2B → P1

B

—V 2C → P1

C

Because V 1A is in the clone set and it maps to P1

A, P1A is cloned to create P2

A.For virtual processes B and C, versions 1 and 2, share the same correspondingphysical process.

5.3 Computational Sharing

Simulation computation and message sends and receives are carried out in thephysical layer. A physical message send may correspond to a set of sends in thevirtual layer. In the example (Figure 6), a message sent from physical processB to physical process C implements two virtual message sends from V 1

B and V 2B

to V 1C and V 2

C respectively. Similarly, simulation computations in P1B implement

the computations performed by V 1B and V 2

B. In this way common computationsamong clones are only performed once.

5.4 Message Cloning

We have already seen that cloned versions start to diverge immediately after theclone set has been defined. As cloned versions of a parallel simulation progress,physical processes may need to be cloned, or messages originating from anun-cloned physical process may need to be forwarded (cloned). This sectionexamines the process of determining whether or not a message must be cloned.

A message is cloned when the same physical sender is shared among two ormore virtual senders, but the virtual receivers do not share the same physicalreceiver. To continue the above example, consider the case where process Bnow sends a message to process A (See Figure 7). Because both versions of thesender of the message are represented by the same physical process but thereare two different physical processes representing the receiver, two copies of



Fig. 7. Cloning a message.

the message must be sent, one to each physical receiver. The physical receiverdetermines this by inspecting VSendSet and VRcvSet.

In the example, when P1B sends a message to P1

A, VSendSet = {1, 2}, since V 1B

and V 2B map to the sender P1

B, and VRcvSet = {1}, since only V 1A maps to P1

A. Eachvirtual process in the send set should have a corresponding virtual receiver. Ifthe receiving physical process does not have a corresponding virtual receiverin the receive set then a copy of the message is forwarded to the appropriatephysical logical process. In the example (See Figure 7), since version 2 is in thesend set but not in the receive set, P1

A forwards the message to P2A.

The algorithm that determines which versions of the physical LP are toreceive a cloned copy of the message is outlined below:

(1) get VSendSet from message(2) get VRcvSet from local state(3) forwardSet = VSendSet ∩ VRcvSet

5.5 Process Cloning

A physical logical process is cloned if (1) there is a virtual logical process in thereceive set that should not be influenced by the incoming message or if (2) thedestination address of the message is a physical logical process that has not yetbeen created.

Consider our running example of three logical processes. P1B clones itself

after receiving a message from P1A (See Figure 8), since version 2 in the receive

set should not be influenced by a message that is coming from a simulationthat has been cloned (P1

A has cloned P2A). This corresponds to the sending of a

physical message from P1A to P1

B. Because V 2B should not receive the message,

version 2 of the virtual receiver should be prevented from being influenced bythis message.

To implement this, the process is cloned before the event is processed sothat the state of the physical process is not influenced by the incoming mes-sage. The new “version 2” of physical process B avoids the misconception that



Fig. 8. Creation of a physical logical process (Case 1).

Fig. 9. Creation of a physical logical process (Case 2).

A processed the message. In this scenario the send set is {1} because theset contains version 1 of virtual senders mapped to physical process P1

A. Thereceive set is {1, 2} since both version 1 and version 2 map to the physicalreceiver P1

B.Cloning occurs when the physical receiver maps to a virtual version not

included in the virtual send set. Note that this case also includes the sending ofa message to a receiver that has not yet been cloned (i.e. it is sent from a versionnumber that is higher than the physical receiver). This case is illustrated inFigure 9. P2

A sends a message to P2B, but P2

B does not exist. So, P2B is cloned from

the process that contains the virtual logical processes mapped to the receiver(i.e. P1

B).Timely cloning of logical processes prevents receivers from being influenced

by messages from versions not contained in the receive set. This is achievedby physically cloning processes and changing the mapping between virtualand physical processes. The physical logical process that receives the message



computes a set called VPsMoveToClone to determine if it needs to be cloned. Thisset contains the virtual logical processes that should be prevented from beingaffected by the message. If the set is non-empty, a new physical logical processis cloned. The new physical process is assigned the same version number asthe lowest virtual LP in VPsMoveToClone and these virtual LPs, are moved tothe new cloned process. To accommodate the second case of cloning, a set calledVPsRequested is defined. This set points to the virtual LPs that should be influ-enced by the message. If the version number of the physical receiver is not inthis set, then these LPs are assigned to the VPsMoveToClone set. The algorithmused to determine process cloning is listed below:

(1) get VSendSet from message(2) get VRcvSet from local state(3) VPsUnaffected = VSendSet ∩ VRcvSet(4) VPsMoveToClone = VPsUnaffected

(5) VPsRequested = VSendSet ∩ VRcvSet(6) if ( ! (PhysicalReceiver ∩ VPsRequested) ) then

VPsMoveToClone = VPsRequested

(7) if ( VPsMoveToClone <> ∅ ) thenCloned Physical LP← lowest( VPsMoveToClone )

5.6 The Four Cases of Cloning

To summarize, four outcomes are possible when a physical LP receives a mes-sage . The LP must be prepared to handle each of these four cases based on themapping of the sender and receiver before receiving the message:

—Computational Sharing: The same physical sender and receiver areshared between virtual logical processes (e.g. Figure 6, virtual processes V 1

Band V 2

B both map to P1B and both V 1

C and V 2C maps to P1

C). No message orprocess cloning is required.

—Message Cloning: The same physical sender is shared between virtualsenders but the virtual receivers do not share the same physical receiver(e.g. Figure 7, V 1

B and V 2B both map to the sender P1

B, but V 1A and V 2

A mapto different physical processes). The message must be replicated and sent toboth physical destinations.

—Process Cloning: The physical receiver maps to a virtual version that is notmapped by the sender (e.g. Figure 8, virtual receiver V 2

B, which is mapped tophysical receiver P1

B should not see the message from virtual sender V 2A). Note

that this case also includes the sending of a message to a receiver that hasnot yet been cloned. The physical process that does not have a correspondingvirtual sender must be cloned.

—Replicated: Virtual logical processes are independent of each other anddo not share physical logical processes. No process or message cloning isrequired.



Fig. 10. Views of simulations: Traditional parallel discrete event simulation is shown on theleft; the monitoring layer called interactive-sim in relation to the simulation executive and thesimulation application is shown on the right.

Before proceeding to the discussion on the performance of the cloning algorithmwe discuss how cloning is implemented.

6. CLONING IMPLEMENTATION

We define a simulation as being composed of a simulation application (providedby the user) that interacts with a simulation executive. The simulation exec-utive provides primitives that allow simulation programmers to define theirown applications and it implements the necessary underlying synchronizationprotocols. This is a layered software system, with the operating system at thebottom, the simulator executive in the middle and the simulation applicationat the top (See Figure 10).

Cloning is implemented in a software library named Clone-Sim that pro-vides an application programmer interface (API) to the system. The librarysupports on-demand “cloning” of parallel and distributed discrete event sim-ulations. Efficiency is accomplished by intercepting communication primitives(function calls) between the user application and the simulation executive. Theapproach will work for interactive as well as noninteractive environments andboth optimistic and conservative simulators can be supported. The results re-ported here were accomplished with a Clone-Sim implementation using GeorgiaTech’s Time Warp simulation executive [Das et al. 1994] called GTW, an opti-mistic simulator.

The architecture of Clone-Sim is shown on the right of Figure 10. Interactive-Sim exists as a software layer between the simulation executive and the user’sapplication simulation program. Interactive-Sim is decomposed into submod-ules, where each is implemented for a particular simulation executive. Thesubmodules are “pluggable” in that the appropriate submodule is plugged infor a specified simulation executive. New submodules can be implemented us-ing a specified application interface. The general idea behind Interactive-Sim isthat it is transparent to the simulation program, as well as to the programmerutilizing the cloning primitives.

Clone-Sim is composed of two primary modules: Interactive-Sim whichis layered between the simulation executive and the application simulationand the Clone-DB database. Clone-DB is independent of the synchroniza-tion primitives of the simulation executive. Interactive-Sim: (1) intercepts mes-sage sends and (2) processes events. Accordingly, Interactive-Sim emulates the



simulation executive’s message send and message receive function prototypesto intercept the invocation to process events or to forward copies of a message tocloned LPs. After interception, Interactive-Sim queries Clone-DB to determinewhether message or process cloning is necessary (via respective inquire func-tions). Eventually Clone-Sim calls the simulation executive’s actual messagesend and receive functions to accomplish the simulation. From the point of viewof the simulation executive, Interactive-Sim is a simulator application. Detailsof the application programmer interface of cloning are beyond the scope of thispaper, but are described elsewhere, please see Hybinette and Fujimoto [2002].

6.1 Assumptions

The relationship between the user application and the simulation executive isillustrated on the left of Figure 10. Here, the user application defines the eventsand the simulation executive handles synchronization issues. It is assumedthat the simulation executive provides ScheduleEvent (send and schedule) andProcessEvent (receive) primitives. If the application uses GTW as a simulationexecutive, it must define ProcessEvent, and tell GTW when to schedule theevent. GTW then schedules a call to ProcessEvent at the appropriate time.

We also assume:

—that one can schedule events conservatively, that is, the event can be sched-uled with the assumption that it will never roll back, (this is trivial if thesimulation executive is conservative);

—the simulation executive supports dynamic LP creation and allocation, ini-tialization and copying of LPs;

—simultaneous events are addressed by the underlying simulation engine (e.g.by prioritization).

Note that Clone-Sim ensures that simultaneous events do not interfere withcloning. This is done by prioritizing cloning events (such as instantiations ofnew simulations and physical processes) above all other events. The next sectiondescribes how cloning has been implemented using an optimistic simulator.

6.2 Implementation of the Optimistic Protocol

Clone-Sim has been implemented using GTW, Georgia Tech’s Time Warp sim-ulation executive [Das et al. 1994]. GTW is an optimistic simulator. Recall thatevents in an optimistic simulation may roll back. As mentioned above, the eventthat instantiates cloning must be a conservative event (guaranteed to neverroll back). However, subsequent LP cloning events (within the same version ofthe simulation) may roll back.

A key challenge in the optimistic implementation is handling the rollback ofevents that may have triggered the cloning of a new physical LP. To address this,we use a special rollback function to keep LP mapping information consistent.When a message is received by an LP and the appropriate recipient does notyet exist, the parent accepts the event, clones itself, then sends a copy of themessage to activate the new child LP. As a result, the original destination fieldin the event is overwritten with the parent’s address. Before the destination is



overwritten, however, it is saved in a separate field. During rollback, the originaldestination and mapping information is restored via the rollback function.

Two main data structures are used by the cloning mechanism, a relationshiptable and a mapping table. The relationship table keeps track of the shape ofthe version tree: which version cloned which other versions. This structure ismodified when a simulation is cloned (not each time an LP is cloned). The datastructure is read by LPs, upon the receipt of a message, to determine whichother LPs are their children.

The mapping table is maintained by active LPs. The mapping table is partof the local state of the physical logical processes. This table is exclusive to theowner of the state. However, during instantiation of a clone, the parent initial-izes the index that belongs to its child. Thereafter, the child itself maintains itsown information.

This mapping table contains (1) mapping information, (2) “birth” timestampsand (3) a bit indicating whether the corresponding LP is active or not. Themapping table is referenced to determine whom to clone and to whom messagesshould be forwarded. The VRcvSet discussed in Section 5 is constructed fromthe mapping table. It is also referenced to determine which LP to roll back orto determine which LP should handle a message when it is addressed to an LPthat did not exist before the timestamp of the message.

7. CLONING EVALUATION

In order to establish the utility of the cloning mechanism, we evaluated the sys-tem in three ways. First, we presented performance results from a real worldtask and examined typical usage scenarios. Second, we looked at the perfor-mance improvement of cloning while controlling the rate at which physical LPsmust be replicated. Finally, we examined the scalability of the approach bothanalytically and empirically.

Evaluation of cloning was conducted on two machines: an SGI Origin 2000with sixteen 195 MHz MIPS R10,000 processors and on an SGI Power Challengewith twelve 75 MHz MIPS R8000 processors. On the Origin, the first levelinstruction and data caches are each 32 KB. The unified secondary cache is4 MB. The main memory size is 4 GB. The first level instruction and datacaches on the Power Challenge are each 16 KB, and the unified secondarycache is 4 MB. The main memory size of the Power Challenge is 3 GB. Eachdata point represents at least 5 runs for the Power Challenge and at least 15runs for the Origin. The experiments use 4 processors on the Power Challengeunless otherwise indicated.

7.1 Air Traffic Control Simulation

To investigate the utility of cloning for real world tasks we implemented thecloning mechanism and examined typical usage scenarios for a commercial airtraffic control simulation called the Detailed Policy Assessment Tool (DPAT).DPAT simulates air traffic patterns and computes congestion related delaysand through-puts. The software can simulate both international airspace—Asia



Fig. 11. DPAT usage scenarios. Left: several clones generated on several MIT values at the samesimulation time. Right: one MIT value is evaluated at several initiation times. Times of the decisionpoints are shown as dashed vertical lines.

and Taiwan—as well as the continental US (CONUS). For more detail on DPATplease see Wieland [1998].

The system models the behavior of each airport and the airspace betweenthem. The airspace is divided into sectors. Each sector can enforce restrictionssuch as the number of aircraft occupying a sector at a specific time. In theseexperiments, two control restrictions affect the flow of aircraft: Ground-delayprograms and miles-in-trail (MIT).

Departing aircraft can be regulated by ground-delay-programs. These re-strictions prevent aircraft from departing an airport in order to avoid circlingabove the destination airport (which wastes fuel and can increase the risk ofaccidents).

En-route traffic is controlled by miles-in-trail restrictions that set a minimumdistance between, aircraft. The allowed minimum distance differs, dependingon weather conditions. In good weather the minimum distance allowed is typi-cally 2 miles, while in bad weather (decreased visibility), the minimum distanceis typically increased to 5 miles for greater safety.

The FAA can evaluate the effect of MIT restrictions and ground-delay-programs on traffic flow, using a simulation like DPAT. For example, an airtraffic controller may wish to increase safety by increasing MIT restrictionsprovided this does not adversely affect the traffic flow rate.

Using dynamic cloning, we anticipate that more alternative MIT restrictionscan be evaluated to manage the air space than if the simulation were simplyreplicated. To investigate this hypothesis, we evaluated typical usage scenariosof the DPAT software.

Two questions controllers often face regarding MIT restrictions are: WhatMIT value is the most beneficial and when should a new MIT value be enforced?These questions suggest two scenarios (depicted in Figure 11).

On the left, alternative values of MIT restrictions are imposed at the sametime in the future. The figure depicts scenarios where a controller considersMIT values between 10 and 14. Each of the restrictions are activated 1,200minutes from “now” (i.e. 20 hours from the current time). This point in time islabeled the “cloning point.” We refer to this as the “forked” scenario.

On the right of Figure 11, different clones evaluate different points in timewhen a particular MIT restriction is imposed. Here, different simulations



evaluate the same MIT restriction (11) enforced at various times in the future.For reference, the original MIT restriction of 10 is also evaluated. The time toenforce the restriction varies between 600 minutes from the current time to730 minutes (the evaluations are initiated at 30 minute intervals). This isreferred to as the “staggered” scenario.

The experiments using DPAT simulate 50 hours (3,000 minutes) of air trafficover the Continental US. This configuration consists of 738 sectors and 520 air-ports, or a total of 1258 logical processes. The total number of flights scheduledis 53,860. Typically, this scenario induces 349 delays of more than 15 minutes.Cloned simulations are instantiated on a sector near Chicago’s O’Hare airport,sector “ZLC1N.”

To stress the cloning mechanism, the experiments use an initial MIT restric-tion of 70. This is much larger than values typically used by the FAA, but largernumbers cause more congestion and thus lead to faster spreading of replicatedLPs, making it a more challenging test case of the cloning software. SmallerMIT values (as those shown in the usage scenarios) lead to significantly bettercloning performance. Cloning is evaluated by instantiating clones that increasethe MIT value by a specific percentage. In the forked scenario each successiveclone increases the value by 10%. In the staggered scenario each clone evalu-ates the same MIT value. The time difference between successive clones in thestaggered scenario is 30 to 60 minutes.

DPAT is implemented on GTW (an optimistic parallel simulator) and islinked to the cloning library Clone-Sim. The DPAT source code includes 20,133lines of C. Additional instructions are inserted into the sector code to detectpreset cloning times and to identify the proper logical process (i.e. it ensuresthat ZLC1N induces cloning in the cloning set). At the proper time, a conserva-tive event, schedules a cloning event, which in turn clones the process and setsthe MIT value. The instructions added to DPAT to detect and initiate cloningwere fewer than 100 lines of code.

7.2 Performance

We conducted a number of experiments to investigate the impact of dynamiccloning on DPAT simulation performance. For comparison, a batch-processingscheme, where each simulation path is run separately, one after the other,is compared with the dynamic cloning scheme. Performance is evaluated asspeedup of the run time of cloning versus batch processing. Speedup is the totalrunning time of the batch processing approach divided by the running timeusing dynamic cloning.

The independent variables of this experiment are the inception time of thecloning point (in simulated time) and the number of simulations. Both the“forked” and “staggered” scenarios are evaluated.

The plot on the left of Figure 12 shows the performance of DPAT on anOrigin when several clones are generated on different MIT values at the samesimulation time (the forked scenario). The plot on the right shows the perfor-mance of DPAT when one MIT value is evaluated at several different initiationtimes (for reference, the performance of a single simulation, where no clonesare generated is shown as well).



Fig. 12. Simulation cloning significantly improves the performance of DPAT over batch processing.Left: speedup of cloning over batch processing for DPAT when several clones are generated ondifferent MIT values at the same simulation time. Right: speedup of DPAT when one MIT value isevaluated at several initiation times. Larger numbers indicate better performance.

The results indicate that dynamic cloning always outperforms the tradi-tional batch scheme when running multiple simulations. As one would expect,simulation cloning becomes more advantageous as the number of alternativesincreases or the later the decision point is inserted. The plots show that dynamiccloning runs up to 240 percent faster than the batch scheme (e.g. for simulat-ing different MIT values: running 5 different MIT values at 2,500 simulatedminutes takes 75.38 seconds for the dynamic scheme versus 183.60 seconds forthe batch scheme).

7.3 Message Spreading Rate

The benefit of the incremental update scheme is evaluated by varying the rateat which one clone causes the creation of additional physical LPs. Our incre-mental update scheme avoids replicating the entire state space upon cloning thesimulation. To quantify this effect we use a benchmark called P-Hold. P-Holdprovides synthetic workloads using a fixed message population [Fujimoto 1990].Parameters of the application allow us to control the spreading rate.

P-Hold works as follows: each LP is instantiated by an event. Upon instan-tiation, the LP schedules a new event with a specified time-stamp incrementand destination. The destination LP is specified within the message.

To control the spreading rate of P-Hold, we adjusted the selection of thedestination LP of a message (logical processes are instantiated by receivingmessages). In the first case the spreading is slowed by having an LP sendmessages only to itself or to its neighbor (the local distribution). In the secondcase the spreading is not constrained and the destination is selected from auniform distribution among all logical processes in the simulated system. Thetimestamp increment to schedule a new event is 1.0 in the local case and selectedfrom an exponential distribution between 0.0 and 1.0 in the uniform case. Inthe cases of instantiated multiple simulations, each new clone is instantiatedwith the same time stamp. The experimental P-Hold simulations use a messagepopulation of 8096 and 1024 logical processes (for more detail on P-Hold pleasesee Fujimoto [1990]).



Fig. 13. Incremental updating improves the performance of P-Hold over batch processing. Severalclones are generated at the same simulation time. Left: Speedup where destination LP is eitherself or nearest neighbor, the local case (slower spreading). Right: Speedup of cloning over batchprocessing where destination LP is selected from a uniform distribution of all LPs in the simulation(fast spreading). Larger numbers indicate better performance.

7.4 Performance

The independent variables for this experiment are the inception time of thecloning point (in simulated time) and the number of simulations. The cloningtime varies between 0 and 600 simulated seconds in the local case and between0 and 180 simulated seconds in the uniform case. There are zero to five newclones created for each case of P-Hold making it a total of 1 to 6 simulationsrun simultaneously.

The plot on the left of Figure 13 shows the performance of P-Hold in thelocal case versus batch processing. The plot on the right of Figure 13 showsthe performance of P-Hold in the uniform case versus batch processing. Whenrunning multiple simulations, the results of a single simulation, where no clonesare generated is shown for reference.

In the local case, dynamic cloning always outperforms the traditional batchscheme approach. In the uniform case however, the batch scheme performs bet-ter if cloning is initiated early and when evaluating more than 3 alternatives.This is because the spreading is quicker, and as a result very few computationsare shared between the clones. If the cloning event occurs later, more compu-tations are shared and again dynamic cloning outperforms batch processing.

As in the DPAT experiments, the performance advantage of cloning becomesmore dramatic the later clones are spawned or the larger the number of al-ternatives. For the uniform case, dynamic cloning improves performance by asmuch as 313 percent and in the local case by as much as 450 percent (e.g. in-stantiating 6 clones at virtual time 550 takes 53.04 seconds for the dynamicscheme versus 239.27 seconds for the batch processing scheme). As one wouldexpect, the plots indicate that cloning is more advantageous for simulationsusing local interactions.

The dramatic improvement over the batch-processing is due to the incremen-tal updating scheme. The performance gain of the local case highlights thatincremental updating further improves the benefit of dynamic cloning whenthe behavior of the cloned simulations diverges slowly.



8. SCALABILITY

We now examine the scalability of cloning. Specifically, we examine how cloningimpacts the performance as we increase the number of clones. For a replicatedsimulation application with N execution paths, each LP is duplicated N times.In many cases, messages are duplicated N times as well. On average, we expecta replicated simulation to consume N times the resources (in terms of spaceand time) as a simulation with a single execution path.

In the best case our cloning approach is significantly more efficient thanreplication (as much as N times faster). In the worst case, cloning is only slightlyless efficient than replication. In this section we show that this efficiency isscalable with respect to the number of clones. By that we mean that cloningmaintains its advantage over replication regardless of the number of executionpaths.

When evaluating the performance and scalability of a cloned simulation it isimportant to keep in mind that there are three key phases of an application’s“life cycle” as follows:

—Before decision point: in this phase the simulation has not been cloned,and there is a single execution path. It is during this phase that cloning offersa tremendous advantage over replicated simulations. All events processedin this phase run once, whereas in the replicated case the events must beprocessed redundantly in each copy of the simulation.

—Spreading: our approach uses an incremental method that only duplicates(clones) LPs as necessary. At the decision point only a small number of LPsmust be replicated. In this phase a large proportion of computations can beshared, and the approach retains a significant advantage over replicated sim-ulations. However, as the replicated LPs interact with other LPs it becomesnecessary to clone the affected LPs as well.

—Spreading complete: eventually all LPs will have been cloned. From thispoint on, the fully cloned simulation must process the same number of eventsas a replicated simulation. The additional overhead of cloning in this caseis a single comparison per event. Cloned simulations in this phase are onlyslightly more expensive than replicated simulations.

The performance improvement of a cloned simulation in comparison to repli-cated simulations depends on the relative length of each of these phases. Forinstance, if the decision point occurs late in the simulation, the cloned simu-lation would have a tremendous performance advantage over replicated sim-ulations. On the other hand, the worst case for cloned simulations is an earlydecision point combined with rapid spreading. In this case the cloned simula-tion’s performance will be slightly worse than a replicated simulation (an extracomparison for every event).

Before proceeding to empirical results, we consider how the number of clonesand the number of processing elements (PEs) may impact the scalability of thecloning algorithm.



8.1 Scalability: Number of Clones

Before the first decision point, every message applies to every possible cloneand the work of processing a message is automatically shared between thefuture cloned execution paths. In comparison to a single copy of a replicatedsimulation, our cloning algorithm requires an additional operation to determineif it is in this phase. Because the results of events processed during this phaseare automatically shared between the future clones, we realize up to N timeshigher performance with respect to replicated simulations.

After spreading is complete, each message applies to only one clone, soadditional processing to determine whether it should be forwarded is notrequired. The overhead of cloning in this phase is just one comparison per re-ceived message—the LP must determine whether the simulation has reachedthe “fully replicated” phase of the simulation. Because of the additional com-parison per event, performance of the cloning approach is slightly worse thanreplication in this phase of the simulation.

Analysis of performance during the spreading phase is a bit more compli-cated. As the number of clones is increased, performance is impacted duringthe spreading phase in three ways. As the number of clones increases: (1) eachLP must maintain additional state information to keep track of the set of clonesto which it belongs, (2) the amount of information appended to messages in-creases because the sender must communicate which clones it belongs to (seeSection 5 for additional details), and (3) the time required for an LP to processeach incoming message increases.

(1) and (2) are information storage, or space costs. Additional information isneeded at the LPs and in the messages during the spreading phase to enable areceiving LP to compute whether it will need to clone itself or perhaps forwardthe message to a cloned version of itself.

We can estimate the storage requirements for (1) and (2) using informationtheory as follows. Assume there are N clones. In addition to the content of themessage used by the simulation application, the sender must also communicatewhich subset of the N clones it belongs to; there are 2N − 1 possible subsets,or 2N − 1 possibilities for this part of the message. According to Shannon, ifall of the approximately 2N possible subsets are equally likely, the informationcontent of this portion of the message is log(2N − 1) ≈ N bits [Shannon andWeaver 1949]. Essentially, one bit is required for each clone. This is the worstcase. If we know a priori that certain subsets are more likely than others,advanced coding techniques may be employed to reduce this space overhead(e.g. Huffman coding, see for example Cover and Thomas [1991]).

In terms of information storage requirements, the overhead scales linearlywith respect to the number of clones. In comparison, N replicated simulationswould require the same message to be sent N times. In a cloned simulationwe can send one message with N additional bits appended. We are, in effect,compressing each additional message into one bit.

During the spreading phase, the time to process a message once an LP re-ceives it, increases linearly with respect to the number of clones. Processinga message involves at most: (1) four set intersections, (2) three set inverses,and (3) forwarding the message to the corresponding cloned LPs (see Section 5



for more details on the computation). It is possible on some hardware for(1) and (2) to be computed in constant time (otherwise, it is at worst O(N/sizeof(largest bit vector on hardware))). If parallel multicast is available,(3) can be accomplished in constant time as well. Empirically, the cost for theseoperations is insignificant in comparison to the other costs of processing a re-ceived message. In other words, cloning overhead is much less expensive thanduplicating the entire processing of a message as would occur in a replicatedsimulation.

To summarize scalability with respect to the number of clones: Before thefirst decision point, the overhead of cloning is minimal and performance in com-parison to a replicated simulation is almost N times better. After spreading iscomplete, the overhead, in comparison to a fully replicated simulation, is onecomparison operation per event—performance is slightly worse than replica-tion. Finally, during the spreading phase, required resources (space and time)increase linearly with the number of clones but are always less than the cost offull replication.

8.2 Scalability: Number of PEs

For many parallel algorithms, as the number of PEs is increased, speedup be-comes sub-linear because synchronization and communication overheads beginto dominate. Our cloning algorithm, however, only requires synchronization atone point: the initialization of a new clone. No other inter-processor synchro-nization is necessary. Therefore the cloning algorithm scales with the numberof PEs to the same extent as the underlying parallel simulation. The cloningoverhead does not increase as the number of processors increases, although ofcourse applications may be subject to inefficiencies in the underlying simula-tion executive. Our implementation uses the GTW executive, which has beenshown to be efficient empirically [Das et al. 1994].

8.3 Scalability: Performance

In this section we evaluate scalability experimentally with regard to (1) theperformance as number of clones increase, (2) the performance improvement asmore processors are available to run the simulation, (3) the impact on run timeperformance as the number of clones (and LPs) increase and (4) the impact onperformance for a given number of clones as the message population increases.

To evaluate how the number of clones impacts performance, a suite of ex-periments was run using the P-Hold benchmark application. The number ofclones was varied from 1 to 32, while the number of PEs was varied from 1 to16. The number of LPs was varied from 4 to 1024, and the message populationwas varied from 64 to 8192. All simulations ran for 300 simulated seconds. Forcomparison, we also ran a replicated simulation under the same conditions. Inthese experiments, the decision point for cloning was set early (at 10 simulatedseconds) in the simulation to create a worst-case situation for cloning.

A plot showing representative performance is presented in Figure 14.Performance is run-time excluding initialization and wrap-up. This plot depicts



Fig. 14. Empirical performance of our method scales with the number of clones. Performance ismeasured as the run time of the simulation, excluding initialization and wrap-up. Smaller numbersindicate better performance.

a series of P-Hold runs on 4 processors using 256 logical processes and a mes-sage population of 128. Each data point is the average of 5 runs. This plotshows that as the number of clones increases, run time increases linearly. Italso shows that in all cases, cloning is more efficient than replication. The ad-vantages of cloning over replication scale with respect to the number of clones.Similar results using the DPAT and again the P-Hold application are providedin Section 7.2 and 7.4.

To assess potential sequential bottlenecks we examined the impact on per-formance as the number of processors and the number of clones is varied. Per-formance is measured as the run time of the simulation, excluding initializationand wrap-up (shorter run times indicate better performance).

The plot in Figure 15 illustrates the impact on run time as the number ofprocessors is varied from 1 to 16 and the number of simulations is varied from4 to 18. This simulations use 256 logical processes and a message population of1024. All simulations run for 1000 simulated seconds and are cloned midway, at500 simulated seconds. The run time and number of processor axes are plottedon a log scale. The planar shape of the resulting plotted data shows that speedupis linear with regard to the number of PEs.

8.4 Summary of Scalability Results

In this section we have argued that cloning is almost always more efficient thansimulation replication. In some phases of a simulation, we can provide up to Ntimes higher performance (where N is the number of execution paths).

The situation where replicated simulation has an advantage occurs when adecision point is placed early in simulated time and spreading is rapid. Thisforces the cloned system to fully replicate all LPs; the system must provide aseparate execution path for every clone. However, in this worst case, cloning is



Fig. 15. Cloning scales with the number of processors. Performance is measured as the run time ofthe simulation, excluding initialization and wrap-up. Smaller numbers indicate better performance.

only slightly less efficient than replication; only one additional comparison isrequired to process each event.

Our analysis shows that the advantages of cloning are preserved regardlessof the number of clones (or execution paths). This analysis is confirmed empir-ically through a number of experiments in which multiple parameters of thesimulation are varied.

9. PRUNING EVALUATION

Just as it is advantageous to add clones to a running simulation, it may alsobe useful to enable a user to terminate a simulation that is no longer needed.This releases computational resources so they can be used by the remainingclones. Simulation pruning may be interactive or activated via a trigger functiondefined at the insertion of the cloning point.

Performance improvements due to pruning can be observed by measuring theprogress of virtual time as a function of real time. In these experiments, 2 daysof air traffic delays or 3000 simulated minutes are evaluated. Cloning occurs atsimulated time 804 and pruning is initiated by a trigger that is installed at theinsertion of the decision point. The trigger is set to be activated at simulatedtime 1,116. The trigger then checks if subsequent events will activate a cloningpoint. The trigger is a conditional statement that inserts a decision point whenexecuting an event that has a time stamp later than 1,116. In these experimentsan event with a time stamp of 1166.70 inserts a cloning point at the currentsimulated time (i.e. the cloning point is inserted at 1166.70).

The plot on the right of Figure 16 shows three curves: the curve on the leftdepicts a single simulation shown for reference, the middle and right curvesshow runs of two simulations each, where cloning occurs at simulated time 804.The simulation depicted by the the right curve is never pruned while the clone inthe simulation plotted in the middle curve is pruned at simulated time 1,116.70.



Fig. 16. Pruning of DPAT.

The acceleration of simulated time due to pruning is noticeable at the acti-vation of the pruning trigger as a substantially increased slope (at simulatedtime 1,116.70). The pruned curve immediately conforms to the slope of thesingle simulation curve.

An estimate of the performance of a cloned simulation that is pruned can beestimated by factoring the single simulation execution time into the proportionof time spent at the different phases of cloning:

Estimated Time = single simulation proportion ∗ single time+diffusion simulation proportion∗ diffusion savings factor ∗ single time+ replicated simulations proportion before pruning∗ # prunes ∗ single time+ replicated simulations proportion after pruning∗ # prunes ∗ single time

The “diffusion savings factor” depends on how much of the processes’ stateis shared between clones. In this estimate, pruning is assumed to occur afterdiffusion1 (see illustration on left in Figure 16). Since a pruned simulation, iscompared with an unpruned simulation, the first three factors are eliminatedin the subtraction.

As an example, consider the performance curves described above. Here, thesimulation completes at simulated time 3,000. Each run completes in real timeat 15.93 seconds for the single simulation run, 20.33 seconds for the cloned andpruned run, and 30.30 seconds for the cloned run (note that pruning improvesthe performance by 33 percent over not pruning the simulation). In this examplethe pruning heuristic predicts that the performance improvement should be9.73 seconds:

Estimate Time Benefit = 3, 000− 1166.703, 000

∗ 15.93

= 9.73 seconds

1Diffusion is the spreading phase of cloning; this is before the simulation is fully replicated.



The actual performance benefit is:

Actual Time Benefit = 30.30− 20.33= 9.97 seconds

In this case the estimate differs 2.4% from the actual performance. The abovemethodology can be used for multiple prunes with the same pruning time. Forclones with different pruning time, the heuristics may be applied to each cloneseparately then adding the time benefit of each pruned simulation.

Note that the implementation allows pruning to occur during the diffusionphase, even though our evaluation heuristic assumes otherwise.

10. CONCLUSIONS AND FUTURE WORK

Simulation cloning enables the exploration of multiple possible futures in in-teractive parallel simulation environments. The mechanism may be applied tosimulations using either conservative or optimistic synchronization protocols.

Simulation cloning improves simulation performance in two ways: First, sim-ulations that are interactively cloned at a decision point are able to share thesame execution path before the decision point, so the computations before thedecision point only need to be executed once. Second, after cloning, multipleversions of the simulation are able to share computations and messages viaa construct called virtual logical processes. Additionally, clones can be created“on-the fly” and “pruned” when it can be determined that they will not provideuseful results.

The benefit of execution path sharing is demonstrated by evaluating acommercial air traffic control simulation. Our performance results show thatcloning can significantly reduce the time required to compute multiple alternatefutures. These results also show that dynamic cloning becomes more advanta-geous as the number of alternatives increases or the later the decision point isinserted.

We also demonstrate that virtual logical processes are especially efficientwhen the influence of a message introduced by a logical process does not spreadto other processes in the simulation at a high rate. This is shown by using abenchmark that varies the rate of cloning other LPs. The advantages of cloningare preserved regardless of the number of clones (or execution paths).

Efficient implementation of the cloning mechanism is achieved by intercept-ing the communication primitives of the simulator executive. By monitoringsend and receive primitives, cloning can determine when to clone a logical pro-cess and when to forward a message (clone a message). To accomplish this, theLPs maintain a data structure used to determine mappings to virtual logicalprocesses. Additional information is piggy-backed by the sender on messages sothe receiver can determine whether to clone a process or to forward a message.

Our work focuses on eliminating inefficiencies arising from the use of repli-cated trials for testing the impact of different policies on performance (as mightbe the case in an air traffic control application). However, researchers often usereplicated trials with different random number seeds to determine statisticalmeasures such as the mean and confidence interval of a performance variable.



The best use of cloning in stochastic simulation remains an open question.Because if the random numbers differ at the start (by far the most commonscenario), the replications may have little computation in common. If this isthe case, simulation cloning will not be advantageous. However, one can usecloning in conjunction with traditional replication, the cloning being used toevaluate alternate parameter settings, and replicate a set of clones to addressthe random number issue.

Consider an example where a researcher would like to test five differentpolicies and that he would like to test each one five times (for statistical signif-icance). Suppose, also, that the different policies are applied mid-way throughthe simulation. Ordinarily one would have to run 25 replicated simulations totest each of the five policies five times. Cloning can be employed in this exampleas follows. The simulation is cloned five times at the start; each clone is initial-ized with a different random number seed. Mid-way through the simulation,each clone is cloned five times (one for each policy). The final result is 25 dif-ferent execution paths—each policy is tested five times with different randomnumber seeds. This approach is efficient because computations can be sharedat each stage of the simulation.

Currently, it is assumed that a user can interject decision points into thesimulator. Defining a mechanism for automating the interjection of decisionpoints is a topic for future research. This may be accomplished by balancingthe “running ahead” and “branching” depending on the probability of particularbranches against available computational resources and real time constraints.

ACKNOWLEDGMENTS

The D-PAT application was written by Frederick Wieland of the MITRE Cor-poration and we are also grateful for his helpful comments on this work.

REFERENCES

BESTAVROS, A. 1994. Multi-version speculative concurrency control with delayed commit. InProceedings of the 1994 International Conference on Computers and their Applications.

COVER, T. AND THOMAS, J. A. 1991. Elements of Information Theory. Wiley Series in Telecommu-nications. John Wiley & Sons.

DAS, S., FUJIMOTO, R., PANESAR, K., ALLISON, D., AND HYBINETTE, M. 1994. GTW: A time warp systemfor shared memory multiprocessors. In Proceedings of the 1994 Winter Simulation ConferenceProceedings. 1332–1339.

DAVIS, W. J. 1998. On-line simulation: Need and evolving research requirements. In Handbookof Simulation: Principles, Methodology, Advances, Applications, and Practice, J. Banks, Ed. JohnWiley & Sons. Co-published by Engineering and Management Press.

FRANKS, S., GOMES, F., UNGER, B., AND CLEARY, J. 1997. State saving for interactive optimisticsimulation. In Proceedings of the 11th Workshop on Parallel and Distributed Simulation(PADS-97). 72–79.

FUJIMOTO, R. M. 1990. Performance of time warp under synthetic workloads. In Proceedings ofthe SCS Multiconference on Distributed Simulation. Vol. 22. SCS Simulation Series, 23–28.

GLYNN, P. AND HEIDELBERGER, P. 1991. Analysis of parallel replicated simulations under a comple-tion time constraint. ACM Trans. Model. Comput. Simul. 1, 3–23.

GOLDBERG, A. P. 1992. Virtual time synchronization of replicated processes. In Proceedings of the6th Workshop on Parallel and Distributed Simulation (PADS-92). Vol. 24. SCS Simulation Series,107–116.



HEIDELBERGER, P. 1991. Discrete event simulations and parallel processing: Statistical properties.SIAM J. Sci. Statis. Comput. 9, 1114–1132.

HENRIKSEN, J. O. 1997. An introduction to SLX. In Proceedings of the 1997 Winter SimulationConference. 559–566.

HYBINETTE, M. AND FUJIMOTO, R. M. 2002. Scalability of parallel simulation cloning. In Proceedingsof the 35th Annual Simulation Symposium (SS-2002). IEEE Computer Society Press.

LENTZ, K. P., MANOLAKOS, E. S., CZECK, E., AND HELLER, J. 1999. Multiple experiment environmentsfor testing. J. Elect. Test.: Theory Appl. 11, 3 (December), 247–262.

SCHNEIDER, F. 1990. Implementing fault-tolerance services using the state machine approach: Atutorial. ACM Comput. Surveys 22, 4, 299–320.

SCHRIBER, T. J. AND BRUNNER, D. T. 1997. Inside discrete-event simulation software: How it worksand why it matters. In Proceedings of the 1997 Winter Simulation Conference. 14–22.

SHANNON, C. E. AND WEAVER, W. W. 1949. The Mathematical Theory of Communication. Universityof Illinois Press, Urbana, IL.

STEINMAN, J. 1991. SPEEDES: Synchronous parallel environment for emulation and discreteevent simulation. In Advances in Parallel and Distributed Simulation. Vol. 23. SCS SimulationSeries, 95–103.

VAKILI, P. 1992. Massively parallel and distributed simulation of a class of discrete event systems:A different perspective. ACM Trans. Model. Comput. Simul. 2, 3, 214–238.

VON NEUMANN, J. 1956. Probabilistic Logics and the Synthesis of Reliable Organism from Unreli-able Components. Princeton University Press.

WIELAND, F. 1998. Parallel simulation for aviation applications. In Proceedings of the IEEE WinterSimulation Conference. 1191–1198.

ZAYAS, E. 1987. Attacking the process migration bottleneck. In Proceedings of the 11th ACMSymposium on Operating System Principles. 13–24.

Received April 2000; revised March 2001 and January 2002; accepted February 2002


Documents

Cloning Parallel Simulationscobweb.cs.uga.edu/~maria/pads/papers/p378-hybinette.pdf · 2006-10-12 · Cloning Parallel Simulations 379 Fig. 1. Envisioned system: A faster-than-real-time