Storyboard: Optimistic Deterministic Multithreading

Storyboard: Optimistic Deterministic Multithreading

Rudiger Kapitza, Matthias Schunter, and Christian CachinIBM Research - Zurich

{rka,mts,cca}@zurich.ibm.comKlaus Stengel and Tobias Distler

Friedrich-Alexander University Erlangen-Nuremberg{stengel,distler}@cs.fau.de

AbstractState-machine replication is a general approach to ad-

dress the increasing importance of network-based ser-vices by improving their availability and reliability viareplicated execution. If a service is deterministic, multi-ple replicas will produce the same results, and faults canbe tolerated by means of agreement protocols.

Unfortunately, real-life services are often not deter-ministic. One major source of non-determinism is multi-threaded execution with shared data access in which thethread execution order is determined by the run-time sys-tem and the outcome may depend on which thread ac-cesses data first.

We present Storyboard, an approach that ensures de-terministic execution of multi-threaded programs. Sto-ryboard achieves this by utilizing application-specificknowledge to minimize costly inter-replica coordinationand to exploit concurrency in a similar way as non-deterministic execution. This is accomplished by mak-ing a forecast for a likely execution path, provided as anordered sequence of locks that protect critical sections.If this forecast is correct, a request is executed in parallelto other running requests without further actions. Only incase of an incorrect forecast will an alternative executionpath be resolved by inter-replica coordination.

1 Introduction

Network-based services have constantly increased in im-portance for our every-day life. Accordingly, there is atrend to offer these services 24/7 by making them toler-ate hard- and software faults ranging from plain crashesto arbitrary faults. Besides providing a well-functioningservice, this requires additional measures such as state-machine replication [14] to accomplish the demandeddegree of availability and reliability.

While in theory a service is a deterministic state ma-chine, most real-world service implementations are not.

This is especially the case for modern services that areimplemented using concurrent execution (i. e., threads)to use multiple cores for improving throughput and scal-ability. In such systems, the execution environment of aservice (e. g., the operating system or a managed run-time environment) determines the execution order ofthreads independently of the service logic. Each re-quest processed by such a service is executed by a ded-icated thread. If different requests share data, locksto prevent inconsistencies guard this data. While datais well protected this way, the order in which requestsget processed and consequently modify data is not pre-determined. Thus, the execution order is likely to differbetween multiple instances once replication is applied,ultimately leading to state inconsistencies and divergentclient replies, as exemplified by a simple counter exam-ple shown in Figure 1.

Replica 2

lock()

unlock()

counter=k

counter=k+1

counter=0unlock()

lock()

Replica 1

T1clear()

lock()

unlock()

counter=k

counter=1

counter=0

unlock()

lock()

non‐criticalsectioncriticalsection

T2inc()

T1clear()

T2inc()

Figure 1: A minimal example for non-determinismcaused by concurrency. On replica 1, thread T1 firstclears a lock-protected counter. Then, thread T2 incre-ments the counter by one. On replica 2, the operationsare performed in the opposite order. The final counterstate on replica 1 is 1, whereas on replica 2 it is 0.

In sum, state-machine replication and unrestrictedconcurrent execution of state-sharing threads lead to seri-ous problems. The latter has been recently demonstratedin the course of a software migration testing infrastruc-ture running an old and a new software instance in paral-lel. It detected a high number of diverging replies that

1

turned out to be false positives that could be blamedon unrestricted concurrency and other causes for non-determinism [7].

A pragmatic approach is to enforce sequential exe-cution by processing only one request at a time [4, 5],which however leads to poor performance and can evencause deadlocks [13]. Therefore, more sophisticated ap-proaches are needed; for example, using some form oflock-step replication [3, 6]. However, this is costly interms of inter-replica coordination and usually necessi-tates hardware support. Other approaches settle for re-stricted forms of parallel execution, but are less demand-ing in terms of coordination and hardware requirements[1, 2, 10, 12, 13]. While some of them enforce the or-der of execution of critical sections purely based on therequest order as imposed by the replication infrastruc-ture [1, 13], others employ a leader-follower approach inwhich one leader process communicates the lock orderthat the other replicas should follow [2, 11, 12]. Noneof the solutions that handle multithreading comes closeto the performance of a non-deterministically–executedvariant in a distributed setting, either because of limitedparallelism or because of extended communication over-head [8].

Kotla and Dahlin [10] avoid the problem of coordi-nating shared-data access at the lock level altogether bypre-processing requests via a special module that makesuse of application knowledge and restricts parallel exe-cution to non-sharing requests. Domaschka et al. [9] aimat relaxing this restriction by using static code analysis,enabling reachability analysis of locks at runtime basedon the current execution path. However, the associatedruntime overhead cannot be neglected, and the approachresorts to pessimistic algorithms where static analysisreaches its limits.

In this paper, we present Storyboard, an approach thatutilizes application knowledge to heuristically predict theexecution path of a request. In case of a correct forecast,this enables parallel execution at a similar degree as inan unreplicated scenario. If the predicted execution pathturns out to be wrong, a replica establishes a determinis-tic execution order in cooperation with other replicas. Aspredictions do not have to be correct, Storyboard can op-timistically bet on the most likely outcome in those caseswhere the execution path heavily depends on the servicestate or is too complex for a correct prediction. In sum,Storyboard utilizes application knowledge to minimizecostly inter-replica coordination and exploit concurrencyat a similar level as offered by a non-deterministic exe-cution. This makes it superior to approaches that restrictconcurrency in favor of avoiding inter-replica coordina-tion as well as approaches that enable high concurrencybut heavily rely on coordination.

Our approach integrates well with a standard repli-

cation architecture that consists of an agreement stageand an execution stage. The agreement stage orders therequests and forwards them to a predictor component,which in turn makes a forecast on a per-request basis;such a forecast comprises an ordered sequence of locksthat are assumed to be taken during request execution.This prediction is based on application knowledge thatcan be either static or dynamic, but does not need to betotally correct. Before a request is executed, the predic-tion is handed over to our Storyboard component thatis part of the execution stage of each replica; this com-ponent monitors the order in which locks are acquiredduring request execution. If the request advances as pre-dicted, it can be executed straight to completion, and Sto-ryboard enforces that concurrent requests are executed ina deterministic order across all replicas. If a predictionturns out to be wrong, this is detected by Storyboard andresolved by inter-replica coordination, thereby avoidingcostly rollback mechanisms.

In the remainder of the paper, we first give a briefoverview of the system model. Second, we outline howto predict the execution path of a request and proceed bydetailing the subsequent processing. Third, we outlinehow to handle wrong execution-path predictions. Fourth,we present initial results and finally draw conclusions.

2 System Model

Replicas have to implement a deterministic state ma-chine [14] to ensure consistency. The state machine con-sists of a set of variables encoding its state and a set ofcommands operating on them. The execution of a com-mand leads to an arbitrary number of variables being readand/or modified and, as a result of the execution, an out-put being provided to the environment. For the same se-quence of inputs, every non-faulty replica must producethe same sequence of outputs.

To model today’s multithreaded high-performance ser-vice implementations, we assume a concurrent state ma-chine that is able to execute multiple commands in paral-lel. Each command is processed in an associated thread.Commands are allowed to access overlapping sets ofvariables. Such shared variable sets are protected bylocks that guarantee that only one command at a timeoperates on them; other commands that try to access thevariable set are blocked temporarily. Depending on theinterleaving of threads, there is a set of valid results for agiven command. To preserve determinism, the same or-der has to be imposed across all replicas. For simplicity,we assume that each command is executed by a singlethread that is not allowed to spawn further threads, andthat the use of non-blocking or wait-free synchronizationis not permitted.

2

C3 C5

C4

C2

C1 !"#$%&'(

Predictor Predictor

)*$+,-.%(

Predictor

/01$$2$%&(

L1

L2

L3

C1

C1

C3

C2

C2

Storyboard

C1 C3 C2 ...

)*$+,-.%(

L1

L2

L3

C1

C1

C3

C2

C2

Storyboard

)*$+,-.%(

L1

L2

L3

C1

C1

C3

C2

C2

Storyboard

C1

C3 C2 C1

C3 C2 C1 C3

C2

Fore

cast

Fore

cast

Fore

cast

Figure 2: System architecture of Storyboard

3 General Approach

Figure 2 outlines a general replicated state-machine ar-chitecture which comprises an agreement stage that en-forces an order on the commands issued by one or moreclients and an execution stage that represents the servicereplicas.1 It is enhanced by Storyboard as follows: In be-tween the agreement stage and the execution stage, thereis a predictor component that executes a predict()function on every ordered command. This function isdeterministic and utilizes knowledge about the serviceand its state. Depending on a given command, it out-puts a forecast, that is, an ordered sequence of locks thatthe command presumably acquires to access and modifysubsets of the service state during its execution. The fore-cast is forwarded to our Storyboard component via a sep-arate channel. Based on the forecast, Storyboard ensuresa deterministic execution order among multiple concur-rently executed commands by monitoring and managingtheir access to locks.

The Storyboard component controls the processing ofcommands in such a way that two commands sharingstate-variable sets never overtake each other regardingthe order in which they acquire shared locks. More for-mally, for every pair of commands Ci and Cj with i < j(i. e., according to the total order imposed by the agree-ment stage, Ci is executed before Cj), every lock thatis shared between Ci and Cj must first be acquired byCi and then by Cj . Apart from this constraint, an arbi-trary number of commands can be executed in parallel,each command at its own speed. In Sections 5 and 6 wediscuss some limited relaxations that allow commands toovertake each other in a way that preserves determinism.

In the optimal case, the forecast made by the pre-dictor exactly matches the execution path of a com-

1For now, we assume a crash-stop fault model but will discuss howto relax this assumption in Section 7.

mand. In order to always produce a correct forecast, thepredict() function has to be optimal; for example,for the shared counter presented in Figure 1, an optimalpredict() function predicts the lock that protects thecounter for all commands accessing the counter. Thiscounter is a simple example, but we expect optimal fore-casts to be feasible also for much more complex services.This can be achieved, for example, by making a carefulcode analysis to build the predict() function.

However, as the lock sequence of a command mayheavily depend on the internal service state, a rather com-plex logic might be needed to correctly predict the exe-cution. On the other hand, predict() should be assimple and fast as possible to compute, as otherwise thebenefits of concurrent execution would be degraded. Ac-cordingly, simple heuristic predict() functions thatmight fail to provide a correct forecast for rare cor-ner cases are an attractive alternative; for example, locktraces of common workloads may help to identify themost likely lock sequence for each command type.

In the remainder of the paper we will therefore refrainfrom a perfect forecast. In fact, Storyboard can handlemispredictions that are completely wrong. The Story-board component detects a misprediction when a com-mand unexpectedly aims to acquire a lock Lunex thatis not the next lock according to the forecast or when acommand finishes prematurely (i. e., the command doesnot acquire all locks included in the forecast); in the lat-ter case, no further actions besides removing the forecastfrom the system are required.

In general, upon the detection of a misprediction, theStoryboard component does not immediately grant thelock in question to the command that unexpectedly de-mands it: as there is no specified lock order, such a pro-cedure could introduce inconsistencies. Instead, Story-board blocks the execution of the command and instructsthe predictor to repredict the command’s execution path.The predictor reacts by sending a repredict com-mand through the agreement stage in order to deter-mine a consistent point in the execution order acrossall replicas at which the reprediction can be safely per-formed2. Upon receiving the command, a repredictionfor the blocked command is performed; this repredictionwill at least contain the lock which the command cur-rently seeks to acquire, but possibly also further locks.Finally, Storyboard components on all replicas continuethe execution of the command according to the new fore-cast. In case of further mispredictions, the cycle of exe-cution and reprediction repeats until the command finallycompletes.

2Depending on the implementation, there might be either one dedi-cated predictor component that is responsible for reprediction, typicallyco-located with the leader of the agreement layer, or all predictor com-ponents can initiate repredictions and duplicates have to be suppressed.

3

4 Normal Operation

For every command, the predictor forwards a forecast tothe Storyboard component, which in turn updates a fore-cast store. This store is a list3 that has a slot for everylock protecting parts of the service state; each slot ofthe list contains a FIFO-ordered list. For every predictedlock, Storyboard inserts a tuple comprising the commandidentifier and the forecast index into the lock’s FIFO list(see Figure 2). Note that updating the forecast store isdone sequentially to implement the command order im-posed by the agreement stage.

After the forecast store update, the command is exe-cuted. At this point, we can relinquish sequential exe-cution, and all commands are processed concurrently atthe speed of their associated thread. Thereby, we assumethat the thread running on behalf of a command can beidentified. Storyboard associates a counter to each com-mand, which is increased every time the thread takes alock while executing the command.

When a thread wants to enter a critical section, it firsthas to acquire the associated lock. This procedure is in-tercepted by the Storyboard component and the forecaststore is consulted. In particular, the slot assigned to thelock requested is selected, and it is checked whether theidentity of the command matches the first element of thelist and whether the counter of the command matchesthe position in the forecast history. If this is the case,the thread is allowed to enter the critical section. Whenthe thread has finished the critical section, the commandidentifier is removed from the FIFO list of the lock andthe counter is increased by one.

If the command is not in the first position of the FIFOlist of the lock, but somewhere else, the thread will besuspended until the command is in the first position. Inthis way, we enforce the order of execution as determinedby the agreement stage. If a lock was not foreseen tobe taken by a command, the output of the predict()function or the command lock counter do not match theposition in the forecast history, the predictor guessedwrong. In this case, we have an out-of-order executionpath which requires special actions as outlined next.

5 Handling Mispredictions

There might be cases in which the predict() functionreturns an execution path, but the path taken by a threadwill differ, for example, because of an internal servicestate influencing the execution. In this case, a thread de-mands to acquire a lock, and likely in the future a se-quence of locks, that diverge from what is stored in theforecast store.

3For simplicity, we assume a static set of locks, but locks could beadded at runtime to the forecast store to match more dynamic scenarios.

We cannot let the thread acquire the lock directly, asthis would introduce the same inconsistencies as if wenever interfered. Thus, we re-schedule/repredict the re-quest execution path based on the new input in a deter-ministic way. To do so, we send a repredict com-mand via the agreement stage. Once this message hasbeen received, we have a deterministic point in the ex-ecution order across all replicas from which we can re-plan the further processing of the original command.

Independent Locks If a mispredicted lock is indepen-dent of all other locks and simply protects some subset ofthe service state from concurrent access, we can treat theordered repredict command as if it were an entirelynew command. Thus, we remove the missed forecastfrom the forecast store and then feed all the informationinto the predict() function, including the previouslyrequested lock. Depending on the input and the lock re-quested, predict() offers a new forecast that at leastincludes this lock in the first position.

Nested Locks Handling of nested locks requires fur-ther actions. Assume a command Ci tries to acquirea lock Lunex not included in its forecast while holdinglocks. Imagine there is another command Cj that needsa subset of these locks held by the previous request andlock Lunex follows in Cj’s forecast. Next, the repredic-tion is performed, and the new forecast for Ci, includ-ing Lunex, is added to the forecast store. The result is adeadlock. Command Ci cannot proceed with Lunex asit was first predicted for Cj , whereas Cj cannot continueexecution as it has to wait for locks held by Ci.

We solve this order inversion problem by carefully en-queuing repredicted locks in the forecast store so cross-ing of nested lock chains of dependent commands isavoided. This is achieved by extending the informationstored in the forecast store by a list of nested locks heldby a command while acquiring a next lock and a list oflocks that will be acquired while holding this lock. Interms of nested locks, we are now able to search in thepast and in the future for a certain command at a certainpoint in its execution path. In case of a misprediction, thereprediction request of a command will be attributed bythe list of already acquired locks (past). This informationtogether with the new prediction enable a safe way to in-sert the command into the forecast store; prior to that, weremove the missed forecast from the forecast store.

For every lock in a prediction, we insert the commandinformation into the list associated to the lock as before.However, in this case, we cannot simply stick with theFIFO order, but have to insert the command at a certainposition in the order in which a lock is assumed to beaccessed. To find the right spot, we start at the begin-ning of the list with the logically oldest command request

4

and work consecutively to the end. For each element inthe list, we check whether a command shares locks ofits prediction (future) with the history of the repredictedcommand (past). If this is the case, we have to placethe repredicted command before the command it is com-pared with. If the repredicted command does not sharelocks with any of the commands in the list, it is placed atthe end of the list.

6 Condition Variables

Condition variables per se need no special treatment aslocks protect them. However, condition variables areused for coordination amongst threads, and this can leadto problems. Imagine that a thread executing on behalf ofa command Ci checks some condition that is not fulfilledand waits until the condition is true. The command Ci,however, has a prediction for a set of forthcoming locks.Further, we assume there is a command Cj that at somepoint in its execution will enable the condition Ci is wait-ing for and then notify command Ci. In a normal system,this is no problem. However, in the context of Story-board, Ci might share some forthcoming locks in its pre-diction with command Cj that have to be acquired by Cj

before enabling the condition. In this case, we have adeadlock, because Cj is not allowed to overtake Ci, andCi waits for notification by Cj .

The problem can be addressed by restricting the pre-diction of a command to the point where it locks the crit-ical section of the condition variable even if we knowmore about the command. In this case, after the condi-tion has become true, the command will continue exe-cution and acquire a lock that was not foreseen by theprediction, and consequently a reprediciton is necessary.However, this is only necessary in the special cases out-lined. Furthermore, there might be time-bound conditionvariables where a service only waits for a timeout to ex-pire before execution is continued regardless of whetherthe condition is fulfilled or not. Because of different ex-ecution speeds, there might be replicas in which a con-dition becomes true before the timeout expires and viceversa. We handle such a non-deterministic timeout bymasking it as a reprediction request that is received byall replicas in the same order.

7 Arbitrary Faults

So far, we only considered systems that are subject tocrash stop failures. However, Storyboard can be ex-tended to tolerate arbitrary faults using a Byzantine fault-tolerant agreement protocol [4] and some limited exten-sions. The actual processing of commands does not needto be changed, and the same applies to handling forecasts

provided by the predict() function. However, ex-tensions are necessary when it comes to mispredictions.In this case, it is not enough to wait for the first indi-cation of a misprediction of a replica, as this might beprovided by a malicious node, but until at least f + 1nodes have signaled it, f being the maximum number offaults to tolerate. This ensures at least one correct indi-cation of a misprediction. If this requirement is fulfilled,the workflow outlined for reprediction can be performed.Due to the higher protocol overhead of Byzantine fault-tolerant agreement compared to fail-stop protocols, mis-predictions should be avoided; that is, the predict()function needs to provide almost optimal results in orderto gain benefits from using Storyboard.

8 Preliminary Results

As an initial evaluation, we investigated whether the lockorder can be predicted for a realistic network-based ser-vice. Therefore, we analyzed the management of shareddata in CBASE-FS [10], which implements a networkfile system (NFS) service. For NFS, command types arelimited and every command type can be determined byits header. These are prerequisites for implementing alean and fast predictor component. We ran some initialbenchmarks in which we traced the lock access and thevariability of lock order for the various command types.

In the context of the PostMark benchmark, we de-tected at most three different orders of lock access forthe same command type. The number of different locksaccessed varied between one and eight, some of themwere accessed multiple times. After careful examinationof the code, it turned out that the number of locks thathave to be considered by the predictor component can bereduced to between one and three, as some locks onlyprotect system-internal buffers that do not directly con-tribute to the visible service state and therefore can beignored. This confirms the results gained by Pool et al.for other services [12] and narrows down the number ofcommand types with a variable lock history to only onecommand type (i. e., lookup). Here, we traced exactlytwo possible execution paths that are taken depending onwhether a queried file exists or not. Taking these factsinto account, a predictor can be implemented that is al-most optimal for CBASE-FS that builds a conformancewrapper for standard NFS implementations.

We implement Storyboard as a customized POSIXThreads (pthreads) library that can be pre-loaded to anypthreads-based service and are in the process of build-ing the predictor component for CBASE-FS, which thenwill be integrated with an agreement stage. Thereby, weaim at evaluating Storyboard in the context of fail-stopas well as Byzantine faults by using, for example, Spreadand SmartBFT.

5

9 Conclusion and Future Work

We outlined Storyboard, an approach that supports deter-ministic multithreaded execution. In contrast to relatedapproaches, it utilizes application-specific knowledge toimprove concurrency and avoids inter-replica coordina-tion if possible. Thereby, we chose a predictive approachthat does not need to be exact because mispredictionswill be resolved at runtime.

Our evaluation results indicate that prediction can havea high success rate for moderate complex services like aremote file system. Accordingly, we are confident thatStoryboard will increase throughput of replicated mul-tithreaded services in the context of fail-stop as well asByzantine fault-tolerance.

10 Acknowledgments

This work was partially supported by the German Re-search Council (DFG) under grant no. KA 3171/1 andby the MASTER research project funded by the Euro-pean Commission’s FP7 programme.

References[1] BASILE, C., KALBARCZYK, Z., AND IYER, R. A preemptive

deterministic scheduling algorithm for multithreaded replicas. InProceedings of the 2003 Int. Conf. on Dependable Systems andNetworks (DSN ’03) (2003), pp. 149–158.

[2] BASILE, C., WHISNANT, K., KALBARCZYK, Z., AND IYER, R.Loose synchronization of multithreaded replicas. In Proceedingsof the 21st IEEE Symp. on Reliable Distributed Systems (SRDS’02) (2002), pp. 250–255.

[3] BRESSOUD, T. C., AND SCHNEIDER, F. B. Hypervisor-basedfault tolerance. ACM Transactions on Computer Systems (TOCS)14, 1 (1996), 80–107.

[4] CASTRO, M., AND LISKOV, B. Practical Byzantine fault tol-erance. In Proceedings of the 3rd Symp. on Operating SystemsDesign and Implementation (OSDI ’99) (1999), pp. 173–186.

[5] CLEMENT, A., KAPRITSOS, M., LEE, S., WANG, Y., ALVISI,L., DAHLIN, M., AND RICHE, T. UpRight cluster services. InProceedings of the 22nd Symp. on Operating Systems Principles(SOSP ’09) (2009), pp. 277–290.

[6] CULLY, B., LEFEBVRE, G., MEYER, D., FEELEY, M.,HUTCHINSON, N., AND WARFIELD, A. Remus: high availabil-ity via asynchronous virtual machine replication. In Proceedingsof the 5th USENIX Symp. on Networked Systems Design and Im-plementation (NSDI ’08) (2008), pp. 161–174.

[7] DING, X., HUANG, H., RUAN, Y., SHAIKH, A., PETERSON,B., AND ZHANG, X. Splitter: a proxy-based approach for post-migration testing of web applications. In Proceedings of the5th European Conf. on Computer Systems (EuroSys ’10) (2010),pp. 97–110.

[8] DOMASCHKA, J., BESTFLEISCH, T., HAUCK, F. J., REISER,H. P., AND KAPITZA, R. Multithreading strategies for repli-cated objects. In Proceedings of the ACM/IFIP/USENIX 9th Int.Middleware Conf. (Middleware ’08) (2008), pp. 104–123.

[9] DOMASCHKA, J., SCHMIED, A. I., REISER, H. P., ANDHAUCK, F. J. Revisiting deterministic multithreading strategies.In Proceedings of the IEEE Int. Parallel and Distributed Process-ing Symp. (IPDPS ’07) (2007), pp. 225–232.

[10] KOTLA, R., AND DAHLIN, M. High throughput Byzantine faulttolerance. In Proceedings of 2004 Int. Conf. on Dependable Sys-tems and Networks (DSN ’04) (2004), pp. 575–584.

[11] NAPPER, J., ALVISI, L., AND VIN, H. A fault-tolerant Javavirtual machine. In Proceedings of the 2003 Int. Conf. on De-pendable Systems and Networks (DSN ’03) (2003), pp. 425–434.

[12] POOL, J., WONG, I. S. K., AND LIE, D. Relaxed determinism:making redundant execution on multiprocessors practical. In Pro-ceedings of the 11th USENIX Work. on Hot Topics in OperatingSystems (HOTOS ’07) (2007), pp. 1–6.

[13] REISER, H. P., HAUCK, F. J., DOMASCHKA, J., KAPITZA,R., AND SCHRODER-PREIKSCHAT, W. Consistent replicationof multithreaded distributed objects. In Proceedings of the 25thIEEE Symp. on Reliable Distributed Systems (SRDS’06) (2006),pp. 257–266.

[14] SCHNEIDER, F. B. Implementing fault-tolerant services usingthe state machine approach: a tutorial. ACM Computing Survey22, 4 (1990), 299–319.

6

Documents

Storyboard: Optimistic Deterministic Multithreading