Automated Dynamic Detection of Busy-wait Synchronizationsalumni.cs.ucr.edu/~tianc/publications/spe09.pdf · 2011-01-10 · Softw. Pract. Exper. 2008; 00:1{7 Prepared using speauth.cls

SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2008; 00:1–7 Prepared using speauth.cls [Version: 2002/09/23 v2.2]

Automated DynamicDetection of Busy-waitSynchronizations

Chen Tian∗, Vijay Nagarajan∗, Rajiv Gupta∗,Sriraman Tallam†

∗University of California at Riverside, CSE department, Riverside, CA 92521†Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043E-mail: {tianc,vijay,gupta}@cs.ucr.edu, [email protected]

SUMMARY

With the advent of multicores, multithreaded programming has acquired increasedimportance. In order to obtain good performance, the synchronization constructs inmultithreaded programs need to be carefully implemented. These implementations canbe broadly classified into two categories: busy-wait and schedule-based. For sharedmemory architectures, busy-wait synchronizations are preferred over schedule-basedsynchronizations because they can achieve lower wakeup latency, especially when theexpected wait time is much shorter than the scheduling time [23].

While busy-wait synchronizations can improve the performance of multithreadedprograms running on multicore machines, they create a challenge in programdebugging, especially in detecting and identifying the causes of data races. Althoughsignificant research has been done on data race detection, prior works rely on oneimportant assumption – the debuggers are aware of all the synchronization operationsperformed during a program run. This assumption is a significant limitation asmultithreaded programs, including the popular SPLASH-2 benchmark [43], have busy-wait synchronizations such as barriers and flag synchronizations implemented in the usercode. We show that the lack of knowledge of these synchronization operations leads tounnecessary reporting of numerous races. To tackle this problem, we propose a dynamictechnique for identifying user defined synchronizations that are performed during aprogram run. Both software and hardware implementations are presented. Furthermore,our technique can be easily exploited by a record/replay system to significantly speedupthe replay. It can also be leveraged by a transactional memory (TM) system to effectivelyresolve a livelock situation.

Our evaluation confirms that our synchronization detector is highly accurate with nofalse negatives and very few false positives. We further observe that the knowledge ofsynchronization operations results in 23% reduction in replay time. Finally, we showthat using synchronization knowledge livelocks can be efficiently avoided during runtimemonitoring of programs.

key words: Data Race; Synchronization; Record/Replay System; Transactional Memory; Livelocks

Contract/grant sponsor: NSF; contract/grant number: CNS-0810906, CNS0751961, CCF-0753470, CNS-0751949

ReceivedCopyright c© 2008 John Wiley & Sons, Ltd. Revised 3 February 2009

2 C. TIAN, V. NAGARAJAN, R. GUPTA AND S. TALLAM

1. INTRODUCTION

With the advent of multicores, multithreaded programming has acquired increased importance.In order to obtain good performance, the synchronization constructs in multithreadedprograms need to be carefully implemented. Implementations of these constructs can bebroadly classified into two categories. One category employs schedule-based implementationswhere all waiting processes are descheduled by the OS. The other category employs busy-waitimplementations in which all processes repeatedly inspect the values of shared variables inorder to determine when they can proceed. Since busy-wait synchronizations do not requirethe involvement of the OS, they are usually programmed in user code. For shared memoryarchitectures, busy-wait synchronizations are preferred over schedule-based synchronizationsbecause they can achieve lower wakeup latency, especially when the expected wait time ismuch shorter than the scheduling time [23].

While busy-wait synchronizations improve the performance of multithreaded programsrunning on multicore machines, they create a challenge in program debugging, especially indetecting and identifying the causes of data races. Although significant research on data racedetection has been done, prior works [27, 37, 34, 16, 13] suffer from one serious limitation:an assumption is made that the data race detectors are aware of all the synchronizationoperations that are performed by the program. In general, this is not a good assumption tomake because multithreaded programs, including the popular SPLASH-2 benchmarks [43],have busy-wait synchronizations such as barriers and flag synchronizations implemented inthe user code. It is unreasonable to assume that the data race detector is aware of such userdefined synchronization operations.

Knowledge of busy-wait synchronization operations is crucial to data race detection. This isdue to two reasons. First, busy-wait synchronization operations themselves cause races in theprogram. These races, known as synchronization races [34], arise due to the implementations ofthe synchronization operations. Any synchronization-unaware data race detector is bound toreport these as data races. Unfortunately, these are not the races that the user is interested in,as these synchronization races are benign. In our experiments with SPLASH-2 programs, wefound that user defined synchronization operations including barriers and flag synchronizationsare used across all programs in the suite and this results in 1 to 19 distinct segments of codebeing present in the programs, which give rise to reporting of numerous synchronization races.Second, a data race detector that is not aware of synchronization operations is liable to reportraces that are infeasible [34] and consequently cause more false positives. This is because,shared memory accesses that are protected by synchronization operations are not real dataraces; if the race detector is unaware of the synchronization operations, it will report theseprotected accesses as races. In our experiments, we found that this caused an additional 11 to107 distinct segments of code being present in programs that resulted in reporting of infeasibleraces, i.e. false positives.

Unfortunately, identifying busy-wait synchronization operations in the program is not trivial.Synchronization may be performed via a simple flag synchronization or through a complexbarrier synchronization or even using spin locks. Often these synchronization operations areimplemented in the program source code itself and not in libraries and there are several different

Copyright c© 2008 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2008; 00:1–7Prepared using speauth.cls

AUTOMATED DYNAMIC DETECTION OF BUSY-WAIT SYNCHRONIZATIONS 3

algorithms for accomplishing synchronization. Hence identifying synchronization operations,at best, is a tedious process requiring manual source code inspection. In this paper, we proposea dynamic technique to identify busy-wait synchronization operations implemented by users.Our technique is based on the observation that a spinning read is an essential part of each busy-wait synchronization construct and is the major cause of synchronization races. Both softwareand hardware implementations of detecting spinning reads are presented in this paper. Ourexperiments confirm that our dynamic technique is able to identify the user defined busy-waitsynchronization operations with no false negatives and very few false positives.

There has also been significant research on record/replay systems [6, 26, 44, 25] for thepurpose of deterministic replay debugging. We propose a scheme where our synchronizationdetection technique is used to optimize replay. This is based on the observation that it is notnecessary to implement the synchronization operations exactly during replay; it suffices if wejust enforce the dependencies during replay in the same manner that they were enforced bythe synchronization operations. Our experiments on the SPLASH-2 benchmark suite, confirmthat synchronization-aware replay, on a uniprocessor, is 23% faster.

Besides data race detection and optimizing replay, our technique can also be used tobreak livelocks, which may occur when software monitoring programs are combined with atransactional memory system. One common type of monitoring programs are dynamic binarytranslation (DBT) tools [1, 19]. In fact, there has been a lot of research on performing runtimemonitoring of programs using DBT tools. However, such tools currently handle only sequentialprograms efficiently. When handling multithreaded programs, such tools often encounter racingproblems and require serialization of threads for correctness. The races arise when applicationdata and the corresponding meta data stored in DBT tools are updated concurrently. Toaddress this problem, transactional memory (TM) system was recently proposed to enforceatomicity of updating of application data and their corresponding meta data [10]. However,enclosing legacy synchronization operations within transactions can cause livelocks, and thusdegrade performance. In this paper, we propose synchronization-aware conflict resolution toeffectively break livelocks at runtime.

The remainder of the paper is organized as follows. Section 2 gives a discussion of benignraces and closely related work. In section 3 we present our approach for dynamic detection ofbusy-wait synchronization operations. In section 4 we describe how this information is used indata race detection, efficient replay, and breaking livelocks. Section 5 presents results of ourexperiments. Section 6 contains a discussion of related work. We conclude in Section 7.

2. Data Races and Synchronization

A data race occurs when two or more different threads access a shared memory locationwithout any synchronization, and at least one of these accesses is a write access. Data racesare considered to be harmful, and are known as concurrency bugs, if they lead to unpredictableresults or cause the behavior of programs to be different from users’ expectation. There hasbeen significant research [13, 16, 27, 34, 37] on helping users find such data races, and thus fixconcurrency bugs. But not all races reported by such tools are actually concurrency bugs. Inthis section, we examine how busy-wait synchronization operations cause data race detectors to



Figure 1. Flag and Barrier Synchronizations in SPLASH-2.

report false positives. Specifically, we classify the false positives reported by the race detectiontools into two categories: intentional races due to synchronization and infeasible races dueto missed synchronizations. The first category refers to harmless races that are intentionallyprogrammed to implement synchronization. The second category refers to shared memoryaccesses which are actually protected by synchronization; but erroneously considered to beraces by the race detector, since the race detector is unaware of the synchronization.

Intentional races in busy-wait synchronization algorithms. In some situations, a data raceis intentionally programmed to introduce non-determinism into the program. For instance,implementation of busy-wait synchronization operations often introduces data races to enablecompetition of processors to enter a critical section, lock a semaphore etc. Let us consider theflag synchronization shown in Figure 1(a), coming from one of the SPLASH-2 benchmarks,barnes. When thread 2 starts executing the statement at line 407, it spins on variable Done(r),which can only be modified by thread 1 on line 396. Therefore thread 2 cannot proceed, until theshared variable is marked as true by thread 1. Consequently, the executions of write operationat line 396 of thread 1 and the read operation at line 407 of thread 2 form many dynamic dataraces. However, the purpose of these races is only to ensure execution order and thus, theseraces do not constitute a concurrency bug. These races that are intentionally programmed toimplement synchronization constructs are also known as synchronization races [34]. Figure 1(b)shows another example of synchronization race, due to a barrier implementation. Here, thewhile loop (line 227, statement 4) keeps spinning until all processors have reached the barrierand statements 2 and 4 of the barrier implementation race with each other.

There are several other situations in which the programmer intentionally introduces dataraces that are benign. In [27], Narayanasamy et al. provide a detailed categorization of thesebenign synchronization races.

Infeasible races due to missed synchronizations. Data race detectors, due to inherentlimitations of the race detection algorithms [17, 37], sometimes report races which are not



actually real races. For example, let us consider Figure 1(b) where two threads execute the samebarrier code to synchronize their executions. We can see that lines 188 and 256 respectivelywrite to and read from the same location, when the value of j is 0. Although, each of theseoperations access the same shared memory location, they do not represent actual data races asthey are protected by the barrier synchronization. However, a race detector that is unaware ofthe barrier synchronization, will consider these as data races. Thus, shared memory accessesthat are actually protected by synchronization and are erroneously considered to be races bythe race detector since the race detector is unaware of the synchronization, constitute a majorclass of false races [27]. These false races are also known as infeasible races [34].

Replay analysis to classify reported races. In recent work, Narayanasamy et al. [27] describea technique to automatically classify those races that are reported by a data race detectorinto harmful and harmless races. Their technique works by replaying the execution twice fora reported data race, once for each possible execution order between the conflicting memoryoperations. It is possible for this technique to identify the synchronization races as harmless,but it involves a significant offline processing. To see why, let us again consider the flagsynchronization shown in Figure 1(a). Let us assume that in the actual execution of theprogram, the while loop (line 407) was executed n times, before line 396 was executed inanother processor. This results in n dynamic races between lines 396 and 407. To confirm thatthis race is benign, the execution order of each of these racing instances has to be invertedand the program has to be replayed each time. If these synchronization races are identifiedon-line, as we do in this current work, then these races need not even be reported to the user.Thus our work is complementary to offline replay analysis of Narayanasamy et al., as far assynchronization races are concerned.

On the other hand, the replay analysis cannot classify the infeasible races due to missedsynchronizations as harmless, in general. To see why, let us consider the barrier synchronizationexample shown in Figure 1(b). Recall that the read and write operations at lines 256 and 188respectively constitute the infeasible race, when the value of the loop iterator j is 0. The replayanalysis works by inverting the order of the memory accesses – line 256 will not read thevalue it is supposed to read, i.e. the value coming from line 188. This will cause the programto misbehave, in general. Thus, if the replayer is not aware of synchronization operations,replay analysis [27] cannot be used to correctly identify these infeasible races. In contrast theapproach we present next effectively handles infeasible races.

3. Dynamic Detection of Busy-wait Synchronizations

As discussed above, the detection of busy-wait synchronization operations is the key topreventing benign races and false races from being reported to the user. In this section, we studythe synchronization races that occur in various busy-wait algorithms for implementing widelyused synchronization operations like flag synchronizations, locks and barriers and formulate ageneralized online algorithm for identifying these operations. Then we present both softwareand hardware implementation of this online algorithm.



3.1. Common Patterns in Busy-wait Synchronizations

We first examine the various busy-wait implementations of barriers, locks and flagsynchronizations to see if there is a common pattern among them, so that we can exploitthat common pattern in our algorithm for identifying the synchronizations.

3.1.1. Data Races in Flag Synchronizations

Flag synchronization is the simplest mechanism to synchronize two threads as it doesnot need any special instructions such as Test-and-Set, Compare-and-Swap, etc. Instead, itsimplementation only needs one shared variable, called the flag. When a flag synchronization isencountered in a multithreaded program, one thread executes a while loop waiting for the valueof the flag to be changed by another thread. Once the value has been changed, the waitingthread is able to leave the while loop and proceed. From Figure 1(a), we can clearly see apattern of flag synchronization, where thread 2 performs a spinning read (line 407) and thread1 performs a remote write (line 396) on the same shared location and these two instructionsare those that cause the synchronization races.

3.1.2. Data Races in Lock Implementations

We consider different lock implementations including the test and test-and-set lock which isfrequently used in several thread libraries to implement a spin lock, and a CLH queuing basedlock. We intend to find a common pattern that spans across these lock implementations.

A classic Test and Test-and-Set algorithm, which is used in pthread library(pthread spinlock) is shown in Figure 2(a). To acquire the lock, each thread executes an atomicTest-and-Set instruction (line 3), which reads and saves the value of lock, and then sets thelock to true. If the lock is available, then the Test-and-Set instruction returns a false, and awinner enters the critical section. Other threads have to spin on the lock (line 4) until thereis a possibility that Test-and-Set instruction can succeed. The reason for the spinning on line4 is to avoid executing Test-and-Set instruction repeatedly which causes cache invalidationsthat generate significant overhead due to cache coherence messages that are generated. Fromthis implementation, we can see that line 4 is a spinning read and line 9 is the remote writewhich race with each other. Also observe the atomic instruction in line 3 simultaneously readsand writes the lock variable, and consequently races with lines 4, 7, and itself.

CLH lock [21] is another well-studied spin lock, which is a variant of the popular MCS [23]lock. The main idea of this lock is that each processor that wants to acquire the lock is putinto the queue of waiting processors, and is made to poll on the flag of predecessor, which is inturn set when the latter releases the lock. As we can see from Figure 2(b), it also has similarpatterns, a spinning read (line 8) and a remote write (line 11) to the same shared variablesucc must wait, and it also includes an atomic instruction (line 7) to handle the case whenmultiple processors want to enter the queue at the same time.

Thus the synchronization races due to the lock synchronization follow the following pattern:a spinning read with its corresponding remote write and an atomic instruction in the vicinityof the spinning read.



Figure 2. Locks.

3.1.3. Data Races in Barrier Implementations

In this section, we consider different barrier implementations including the simple centralizedbarrier (which is still used in the source code of several SPLASH-2 benchmarks), the sensereversing barrier, and the arrival tree barrier. Here again, we find that the spinning read alongwith its corresponding spin ending write is the cause for synchronization races.

In Figure 1(b), we have shown the centralized barrier, where all threads except thelast one, are delayed by a spinning read on variable counter (line 227, statement 4). In thisimplementation, every thread also increments variable counter (line 227, statement 2), whichis the remote write to all earlier-arrived threads.

To make the centralized counter barrier reusable, a sense-reversing centralized barrier,described in [23], is shown in Figure 3(a). Each arriving processor decrements count byexclusively executing line 7 and then waits for the value of variable sense to be changedby the last arriving processor (line 10). Similar to the simple counter barrier, line 13 is aspinning read and line 10 is a write on variable sense, which is the cause of synchronizationraces produced due by this barrier.

Another efficient barrier algorithm, is the arrival tree barrier which is described in priorwork [14, 23]. Every processor is assigned a unique tree node to form two trees, an arrivaltree and a wakeup tree. In the arrival tree, the arrival information is propagated from theleaves up to the root. In the wakeup tree, the wakeup information is propagated in an oppositedirection, from root to the leaves. To obtain the best performance, the degree of arrival treeis set to 4 and that of wakeup tree is 2. Figure 3(b) shows the source code for this barrier. Inthe arrival tree, each processor waits for the arrival of its four children by spinning on variablechildnotready. When all children have arrived, it informs its parent by updating a variable inits parent’s node pointed by parentpointer. Thus lines 12 and 14 form the spinning-read and



Figure 3. Barriers.

the remote-write pattern in the arrival tree. Similarly, line 20 is a spinning read and lines 22,23 are the remote writes in the wakeup tree.

Having studied the different implementations of various busy-wait synchronizationoperations, we find that the spinning read and its corresponding remote write is a commonpattern among the synchronization operations.

3.2. Algorithm to Detect the Pattern

In the previous section, we found that the spinning read and its corresponding remotewrite form the common pattern across different implementations of various busy-waitsynchronization primitives. It is worth noting that it is difficult to find this pair by staticallyexamining the code. Even if are able to reduce the candidate set of spinning reads, it is notclear how the remote write can be statically identified. Hence, we explore a dynamic techniqueto identify this pattern. By examining the dynamic values and the addresses accessed by a loadinstruction, we decide on whether the load is a spinning read. We identify the correspondingremote write, by identifying the store instruction from which the last iteration of the spinningread obtained its value.

Figure 4 shows the data structure used in the algorithm. We first introduce a load table whichstores the information of 3 most recent load instructions for each thread. The informationincludes the pc, the previous address addr accessed by the load instruction, the previous valueval in addr and a variable counter, which essentially maintains the current count of spin loop.The reason that we set the size of load table to 3 is based on the fact that it is sufficientto have as many entries as the maximum number of static loads in a spinning loop. In ourexperiments, we found this number to be less than 3 and so we limited the number of entriesto 3. This is a reasonable limit, as in a spin loop there are typically not more than two loads,



For each entry in load table:pc: program counter of a loadaddr: address accessed by a loadval: value loaded by a loadcounter: spinning count of a loadpossible spin: indicate if the counter reaches the threshold

For each entry load or store instruction ins:addr: address accessed by insval: value loaded by a load inscur tid: current id of thread that executes ins

For each shadow memory location:writeid: Thread id of a storewritepc: program counter of a store

For each entry in syn table:readpc: program counter of a loadwritepc: program counter of a store

Figure 4. Data Structure for Dynamically Detecting Synchronization

a load instruction that loads the shared memory value and possibly another load that loadsthe address of the shared memory location.

For every memory location accessed by a store instruction, we maintain the PC of the laststore in writepc and the thread id of a thread that performs the last store in writetid. We alsomaintain a synchronization table, syn table, which stores a pair of instructions: pc of spinningread as readpc; and the pc of the corresponding remote write as writepc.

With the above data structures, we now can use our algorithm to dynamically identifythe synchronization pattern. The general idea of our algorithm is as follows. For each loadinstruction, we examine if it has been loading the same value from the address for a thresholdnumber of times by one thread, until the value of this location is changed by another thread. Itis worth noting that the threshold is a heuristic to give importance to the process of spinningand thus distinguish it from other potential situations that are not spin loops.

The detailed algorithm following our approach is shown in Figure 5. On every loadinstruction that has not been determined as a spinning read in a thread, we first examineif the information of the load has been stored in load table by searching the matching PC.If not, we need to find a location in load table for this load instruction. The location can beeither empty (line 21) or the one that has the oldest entry (line 22-24). Then we store the



On executing each load ins that is not in syn table:1: IF find the location loc for matched ins’s PC in load table2: IF load table[loc].addr != addr3: GOTO Reset;4: ENDIF5: IF load table[loc].val = val6: IF load table[loc].possible spin = 17: RETURN;8: ENDIF9: load table[loc].counter++;10: IF load table[loc].counter = THRESHOLD11: load table[loc].possible spin := 1;12: ENDIF13: ELSE14: IF shadow mem[addr].writetid = cur tid OR15: load table[addr].possible spin != 116: GOTO Reset;17: ENDIF18: add (ins’s PC, shadow mem[loc].writepc)

into syn table and RETURN;19: ENDIF20: ELSE21: loc = find an empty entry in load table;22: IF no empty entry23: loc = find the oldest entry in load table;24: ENDIF

Reset:25: load table[loc].pc := PC of ins;26: load table[loc].addr := addr;27: load table[loc].val := val;28: load table[loc].counter := 1;29: load table[loc].possible spin := 0;30: ENDIF

On executing each store ins that is not in syn table:31: shadow mem[addr].writepc := PC of ins;32: shadow mem[addr].writetid := cur tid;

Figure 5. Dynamic Detection of Synchronization Pattern.



Figure 6. Hardware Implementation.

information of the load into this location (line 25-29). If we can find an entry in load tablethat matches current load (line 1), we first check if the current load accesses the same addressas before (line 2-4). If so, we then compare the current value with the previous value of thisaddress to determine if the variable counter reflecting the number of executions of a spinningread, should be incremented by 1. The flag possible spin is set to 1 indicating that the currentload is a possible spinning read, if counter reaches the threshold number (line 5-12). Recallthat to determine a synchronization pattern, we also need to ensure that the value of thisaddress has to be changed by another thread. Therefore we check this condition by comparingthe id of current thread with the id of the thread that performs the most recent write to thisaddress. If they are the same or the flag possible spin has not been updated to 1, we reset theinformation about this load, expecting that the pattern can be determined subsequently (line14-17). Otherwise, a synchronization pattern has been recognized. Thus, we store the load PCand store PC into syn table and then return (line 18).

For every store instruction that has not been stored into the synchronization table, we simplyrecord its PC and thread-id in the shadow memory corresponding to the location accessed bythis store (line 31-32).

Finally, recall that atomic instructions were responsible for creating synchronization racesin some lock implementations. We consider the atomic instructions that appear in the vicinityof a spinning read to be a potential synchronization race. Specifically, we capture the atomicinstructions whose PC values are within the range of 20 instructions from the spinning readinstruction.

3.3. Hardware Implementation

While the algorithm discussed in the previous section can be entirely implemented in software,it can also be implemented in hardware by leveraging the cache coherence protocol. Inparticular, the hardware implementation does not require shared memory to store the PC



and thread id for each store. Instead, the store PC can be put into a coherence message andsynchronizations can be reported when this message is received by the processor where aspinning read has been discovered.

Figure 6 shows the design details where all new components are in gray. First, the loadtable, which stores the information of each load instruction, is implemented as a on-chip bufferat each processor. Second, a hardware detector that implements the detection algorithm isadded at each processor. The detector can access the load table as well as the cache coherencemessage. An extra word is also added to the cache invalidate message. When a store experiencesa write hit and dirties a a cache block, its PC will be put into the invalidate message. If thedetector has found a spinning read, it then monitors invalidate messages to identify the PCof the remote write to the same memory location. When the pattern is detected, the loadPC and store PC are stored into the synchronization table, which is implemented as a sharedsynchronization buffer.

4. Exploiting Dynamic Information

In this section, we discuss how synchronization information is exploited to filter out harmlessdata races, namely synchronization races and infeasible races. Then we will present a schemewhere the knowledge of synchronizations can be used to speed up the replay process. We willalso present a solution to the livelock problem that arises when dynamic binary translationtools and transaction memory system are used together.

4.1. Race Detection

Significant amount of research [13, 16, 34, 37] has focused on tools for data race detection.However, if these tools cannot recognize all synchronization operations in a program execution,they will report many synchronization races and infeasible races. Since now we can dynamicallyrecognize synchronization operations with the detection technique described in the previoussection, we can easily prevent the race detector from reporting harmless races.

To filter out infeasible races, an existing race detector does not have to monitor any calls tothe library functions of busy-wait synchronizations. Instead, it only needs to look up oursynchronization report syn table to get the information about synchronizations, especiallythose that are directly programmed by users. Using this information, the race detector cannow function as it did before, for example, computing happens partial order to discoverharmful data races. Since our synchronization table contains more accurate synchronizationknowledge, the infeasible races will be eliminated. To filter out synchronization races, theexisting race detectors only need to compare the races discovered with the synchronizationoperations stored in syn table. If there is a match, then we do not report the data race becausethe pairs of instructions we stored in syn table are actually synchronization races. Thus, oursynchronization detection technique can be easily used to filter out benign races. No significantchanges to an existing race detector are required to make use of our approach.



��

��

��

��

��

��

��

��

��

Figure 7. Synchronization-aware Record/Replay.

4.2. Synchronization-aware Record and Replay

Recently there has been research on providing software support [6, 36] and hardwaresupport [5, 44, 26, 25] for recording a program’s execution. The key idea of record/replaysystems is to record outcomes of all non-deterministic events, including memory races, sothat the program can be replayed accurately. In other words, recording systems record all thememory dependencies exercised in the program, so that they can be enforced while replay.

A benefit of having the knowledge of synchronization races is that it can lead to optimizedreplay, especially if the replay happens on a uniprocessor. Here, we make a key observation thatif the recorder is aware of all the synchronization races, it is possible for the replayer to replaythe original program without the execution of synchronization operations. This is because,the main purpose of synchronization operations themselves in a multithreaded program isto enforce a set of memory dependencies in the program. Suppose we know a priori, thedependencies that the synchronization operations are trying to enforce, then we can modifythe replayer to enforce these dependencies and consequently, there is no need to replay thesynchronization operations.

Consider the example in Figure 7 which shows barrier synchronization between twoprocessors. Processor 1 reaches the barrier first (time t1) and spins until processor 2 alsoreaches the barrier (time t2). The spinning reads (denoted by Rs1...Rsn) race with the write(Ws) from the processor 2 when it eventually reaches the barrier. Now let us consider thedependence W,R, which is one of the dependencies that the barrier is actually trying to enforce.Clearly, if we are able to enforce this dependency, then we can safely remove the execution ofthe spinning reads from the replay. To enable this replay without synchronization, the recordersystem must be slightly modified. As far as the barrier is concerned, many recorders will recordthe last dependency (Ws, Rsn) along with the respective processors’ instruction counts, sincethis is the last read and it is the one that is obtained from the coherence reply (all other readsare local reads). When the W,R dependency is encountered, it is optimized using Netzer’stransitive optimization [29] as it is implied by the Ws, Rsn dependency.



In our synchronization aware recording scheme, by the time the last read (Rsn) is executed,we would have inferred this is a spinning read and hence we do not record this dependency.At the same time, we decrement the instruction count of spinning processor by the number oftime the processor spins to enable replay without execution of spinning reads. When the W,Ris now encountered, it is recorded and not optimized away as we do not record the Ws, Rsndependence. Synchronization-aware replay, happens as usual except that we do not executethe reads that are identified as spinning reads due to synchronization.

Thus, with only small changes to record and replay mechanism, the knowledge ofsynchronization races helps us to avoid execution of the synchronization operations duringreplay.

4.3. Synchronization-aware Conflict Resolution for HTM System

Dynamic binary translation (DBT) tools like valgrind [1] and pin [19] are widely used foronline monitoring of executing programs for the purposes of program profiling, debugging,and security. However, such software monitoring frameworks currently handle only sequentialprograms efficiently. When handling multithreaded programs, such tools encounter racingproblems and require serialization of threads for correctness. The races arise when applicationdata and corresponding meta data stored in DBT tools are updated concurrently. To addressthis problem, transactional memory (TM) system was recently proposed to enforce atomicityof updating of application data and their corresponding meta data [10].

Transactional memory systems [15] enable atomic execution of blocks of code. TM systemscan be completely implemented in software (STM) [2], or in software with limited hardwaresupport [11], or completely in hardware (HTM) [15]. For the reminder of the discussion,we focus on HTMs. The three major functions of a HTM are conflict detection, versionmanagement, and conflict resolution. Previously proposed HTM systems can be separatedinto three different categories [7]: LL - lazy conflict detection, lazy version management, andcommitter wins; EL - eager conflict detection, lazy version management, and requester wins;and EE - eager conflict detection, eager version management, and requester loses.

An important parameter that directly affects the efficiency of performing monitoring usingHTM is the length of the transactions [10]. This is because of the expensive bookkeeping tasksinvolved in starting and ending transactions. Thus, for example, it will be inefficient to put anindividual basic block of original code and its accompanying instrumentation within a singletransaction. This is especially the case if a basic block is present in a hot loop since this willentail the creation and the committing of a transaction for every iteration of the loop. Toalleviate this problem, Chung et al. proposed two techniques, creating transactions at trace(i.e. sequences of hot basic blocks) level, and dynamically merging small transactions [10].However, these two techniques increase the chance of putting synchronization code from theoriginal program into these transactions and hence the chance of introducing livelocks.

Figure 8 illustrates a simple code sequence consisting a flag synchronization. The transactionstart and transaction end instructions of a transaction T1 are outside the spinning loop, andthe write that sets the flag is part of transaction T2. Let us assume that processor 1 reachesthe while loop first where it spins until the value of flag becomes 1, which is set subsequentlyby processor 2. Furthermore, we assume that HTM system follows eager conflict detection,



Figure 8. A Livelock Caused by Flag Synchronization.

eager version management, and requester loses policy similar to the one followed in the logTMsystem [24].

Now let us analyze the sequence of events in the execution. First, T1 is created in processor1 and is not committed until flag is eventually set. By the time processor 2 tries to set flag,T2 in processor 2 would have been started. Clearly T2 will conflict with the ongoing T1 sinceit tries to write to flag that has already been read by T1. Since we follow the policy of therequester aborting, T2, which is the requester, is aborted. When it is restarted, it still remainsthe requester as T1 has not committed yet, and so ends up getting aborted repeatedly. Thiscauses a livelock situation.

Similarly, if DBT tools create a bigger transaction that encloses a spin lock or a barrier, thelivelock will also appear, which then will degrade the performance rather than improving it.

It is worth nothing that if HTM follows the requester wins policy in the above example, thelivelock can be avoided after several abortions, especially when T2 is very small. This is becausea smaller T2 is very likely to be committed before the conflict on flag is detected. Further, ifthe lazy conflict detection is used, we will never get into a livelock situation because no conflictwill be reported when T2 is being executed. So we can see that in addition to the transactionsize, HTM policies can also influence the occurrence of livelocks. Table I summarizes variouslivelocks scenarios for different TM policies, and for various transaction sizes.

Although DBT tools create the possibility of having livelocks when optimizing the overheadof transaction creation and termination, we can use our hardware synchronization detector tohelp the HTM system in resolving livelocks at runtime. The key idea is to make HTM systemcommit the synchronization write as soon as the write instruction is detected and ensure thatthe spinning read from the other processor can see the updated shared variable quickly. Inorder to do that, we add the following two rules into the synchronization detector and theHTM system respectively:

• If a store instruction in a transaction is determined as a synchronization write, thedetector signals the HTM system to commit this transaction immediately and start anew transaction. The detector also notifies the HTM system of the transaction containinga spinning read instruction so that it can be aborted.



T1-- Start Transaction –…while (flag != 1) ;

-- Restart Transaction –…while (flag != 1) ; …

-- Commit Transaction –

T2-- Start Transaction --…flag = 1;

…-- Commit Transaction --

Initially, flag = 0;

Commit Transaction

Start TransactionAbort Transaction

flag is 1 now, go ahead

Figure 9. Avoiding a Livelock by Committing a Transaction Containing the Remote Write.

• If the HTM system uses eager detection and detects a conflict, the synchronization tableis checked to see if the conflict is caused by synchronization. If so, the transaction that hasthe synchronization write will be committed, and a new transaction will be started forthis thread; the transaction that has the spinning read will be aborted. If lazy detectionis used, the HTM system checks synchronization table for each store. Same action willbe taken if there is a match.

The first rule ensures that if a livelock due to synchronization has already occurred or willoccur, it can be broken or avoided. The second rule ensures that discovered synchronizationcan never cause any livelocks in the future. Intuitively, these two rules dynamically splita transaction containing a synchronization write into two smaller transactions, and givethe priority to a transaction containing the synchronization write. Note that to ensure theatomicity, any instrumentation code for the synchronization write will be executed before thetransaction is committed.

To see how this new resolution works, let’s again consider the example shown in Figure 8.We assume eager detection is used and livelock will still occur because T2 keeps being aborted.However, our detector can quickly find the flag synchronization. Thus, the detector can signalthe HTM system to immediately commit T2, which performs a remote write to the variableflag, and abort T1. Therefore, in its next trial, T1 is able to see the updated value of flag andthen go ahead (Figure 9).

5. Experimental Evaluation

In this section, we first introduce the experimental environment and examine the numberof synchronization races and infeasible races present in SPLASH-2 programs. Next weevaluation the effectiveness of our online synchronization detection algorithm in filteringout synchronization and infeasible races. We also evaluate the runtime overhead of dynamicsynchronization detection. Next, we evaluate benefits of synchronization detection in



performing execution replay. Finally, we compare our livelock breaking solution to a time-out based solution in terms of the performance overhead.

5.1. Experimental Setup

Software Implementation. Our software based detector is implemented by using the Pin [19]dynamic instrumentation framework to instrument the program as it executes. A load tableis implemented by a struct variable for each thread to store the regarding each potentialspinning read. The information includes the value, address, the execution count of each load,and a flag. We also use shadow memory support [28] for maintaining information about eachstore instruction. Specifically, we store the PC of the store instruction and the thread-id inthe shadow memory corresponding to the memory location for the store. To determine if aread is a spinning read, we examine if the value loaded and the address remain unchangedfor a threshold number of times; and when the value changes, we ensure that it is caused bya store coming from a different thread. This store, incidentally, is also the store that raceswith the spinning read. If the above conditions are satisfied, we can conclude that the loadPC is actually a spinning read. Note that the accesses to the shadow memory have to bedone atomically, so we use Pin Lock to prevent any possible violations. Since the Pin Lock isprovided by the instrumentation tool, it will not affect the detection of synchronizations in theoriginal program. All our experiments were conducted under Fedora 4 OS running on a dualquad-core Xeon machine with 16GB memory. Each core is running at 3.0 GHz.

Hardware Implementation. Our hardware based detector is implemented on VirtutechSimics [20] which is a full-system functional simulator. We configure Simics with 4 3GHzx86 processors, 64KB 4-way private L1 cache and 2MB 8-way unified L2 cache. We assumethe MESI snoopy cache coherence protocol. All the caches are assumed to be write back andwrite allocate caches.

Monitoring Application. As mentioned in Section 4.3, the livelocks occur when we combineDBT tools and HTM system. Therefore, we choose Pin again as the DBT tool. We considereddynamic information flow tracking (DIFT) as our monitoring application [33] and implementedit in Pin for our experiments. However, our results are equally applicable for other monitoringapplications.

HTM System. For our transactional memory experiments, we consider a HTM system thatuses a EE policy similar to the logTM system [24]. We used emulation in Pin to estimate theoverhead of transactional memory. We used a nop instruction to indicate start of a transactionand the mfence instruction to indicate the end of a transaction. A memory fence instructionenforces memory ordering of instructions before and after it, and since this functionality isrequired for Transaction-end, we use the mfence instruction.

Benchmarks. In our experiments, we choose the SPLASH-2 benchmark suite [43] as it iswidely used to facilitate the study of shared memory multiprocessors. Table II shows the names,number of lines, inputs, and brief descriptions for each program used in our experiments. Theseprograms have different types of synchronizations (flag synchronizations, locks, and barriers)most of which are defined by PARMACS Macro constructs [4]. These constructs are differentfrom library routines, because during compilation, the implementations of the constructs willbecome inline codes of applications, unless the implementations are library routines.



Figure 10. Synchronization Races and Infeasible Races.

Using PARMACS constructs in source code gives programmers the flexibility to choosedifferent implementations of synchronizations according to their needs like performance,portability etc. For example, the programmer may use an available implementation of asynchronization operation from a library routine (e.g., pthread library) or the programmermay develop his/her own implementation. For instance, in some SPLASH-2 benchmarks, weobserve that programmers have implemented their own flag synchronizations in the user code(Figure 1(a)) rather than using PARMACS Macro construct. Even for Macro constructs,library routines may not provide the implementations that programmers want. For example,counter based barrier using spin lock is not available in the pthread library. Therefore,programmers have to implement their own algorithms for such constructs (Figure 1(b)).

Data Race Detection. We studied the benchmarks by identifying occurrences ofsynchronization and infeasible races in these benchmarks. To carry out this study we useda race detection tool for computing the happens-before partial order based on the knowledgeof synchronization operations in the libraries. All synchronizations implemented in the usercode are ignored. The results of this study depend upon what synchronization operations areused from libraries. While the flag synchronization must always be expressed in user code, ingeneral, the LOCK/UNLOCK and BARRIER operations could be either expressed in usercode or used from an available library. For SPLASH-2 benchmarks we consider three differentscenarios:

(Realistic) Given the availability of the pthread library on Fedora 4, spin lock implementationis available in form of two library routines, pthread spin lock and pthread spin unlock.However, BARRIER operation must be implemented in the user code;

(Optimistic) We assume that LOCK/UNLOCK and BARRIER operations are availableas library routines. To study this scenario we compiled our own implementation ofBARRIER into a library file; and

(Pessimistic) We assume that LOCK/UNLOCK and BARRIER operations are implementedin user code.



Figure 10 shows the number of synchronization races and infeasible races found in thevarious scenarios. Note that we create 4 threads for each benchmark for our experiments.For each scenario, 4 columns giving the number of distinct synchronization races, dynamicsynchronization race instances, distinct infeasible races and dynamic infeasible race instances.

For the realistic scenario (columns 2-5), barriers and flag synchronizations contribute tothe synchronization races. The number of observed distinct races in synchronization and theircorresponding dynamic instances are shown in column 2 and 3, which varies from 1-19 and15.1K - 17.3M respectively. Since these synchronizations cannot be captured by race detector,11-107 distinct infeasible races will be reported, which contribute thousands of dynamicinfeasible races (column 4-5). Thus we can see that user defined synchronization operationscause a significant number of distinct false positives (12-121) to be reported.

For the pessimistic scenario (columns 10-13) these numbers are even higher. As we can seein Figure 10, 55-275 distinct synchronization race instances contribute to millions of dynamicsynchronization races. In addition, millions of distinct and dynamic infeasible races are alsoreported. To perform this experiment we implemented a Test and Test-and-Set lock via anatomic decrement x86 instruction for LOCK/UNLOCK Macro constructs, and a counter-basedsense-reversing barrier for BARRIER Macro constructs.

Finally, as expected, for the optimistic scenario (columns 6-9), the number of distinctsynchronization and infeasible races is small and these gave rise to few thousand dynamicraces. It should be noted that to conduct this experiment we had to disassemble our ownlibrary code and hard-code the instruction addresses into the race detector. Thus, while thisapproach gives good results, it is not practical for the programmer.

5.2. Filtering Data Races

To evaluate the effectiveness of our synchronization detection algorithm, we added our softwareimplementation into the happens-before race detector and conducted the experiments on allthree scenarios described above. This experiment yielded the following key results.

Robustness. First, the results were identical for all three scenarios. In other words, oursynchronization detection based approach is highly robust as it is equally effective in filteringout synchronization and infeasible races in varying scenarios.

Filtering effectiveness. Second, nearly all the number of synchronization and infeasibleraces were successfully filtered out using the dynamically detected synchronization operations.Let us examine the data in Table III in more detail. With the exception of FMM, nosynchronization races or infeasible races are reported in any of the benchmarks. In FMM,we found 4 synchronization races and 4 infeasible races. The dynamic numbers are 0.4Kand 0.6K respectively. The reason we report these races is that our algorithm missed 4 flagsynchronizations; 2 in the function VListInteration and 2 in the function WListInteraction inthe file interaction.C. We investigated why we missed these synchronizations and found thereason to be following. In each of the flag synchronizations we missed, the spinning read wasexecuted exactly once. In other words, the spinning read did not actually experience any spin.We also measured the effect of missing this synchronizations and found that this caused 4additional distinct infeasible races were reported.



False positives and negatives. Finally, it is worth noting that if our synchronization detectordoes not miss any synchronization operation, the false positives caused due to synchronizationwill be eliminated. At the same time, if our synchronization detector falsely considers someother program operations as synchronization operations, this can potentially lead to a real racebeing considered as a synchronization race and hence cause false negatives. However, the lattersituation does not arise. This is because the pattern we captured, namely a spinning read anda corresponding write, is the essence of synchronization. Accesses to a shared memory that isnot used for synchronization will not experience any spin.

Selection of Threshold Value. In the above experiment the threshold value used by thealgorithm was set to 10. We also varied the threshold value to study its impact on theeffectiveness of our approach. In Table IV, we show the number of distinct synchronizationsreported by our algorithm under the realistic scenario with threshold values of 10, 100, and 500.Since the threshold number is our heuristic used in the algorithm to quantify the number ofspins, the higher this threshold, the higher the chance that we may miss synchronization racesand vice versa. The actual number of synchronizations are also presented for the purpose ofcomparison in column ”Actual”. From this table, we can see that setting the threshold to 500causes us to miss some synchronization operations from being detected. Then we reduced thethreshold to 10 and found that we were able to find most of the synchronizations, missing only4 in FMM which has been discussed earlier. We could not observe an increase in the detectedsynchronizations, if we lowered the threshold any further. Thus, the synchronization detectorworks well when the threshold was set to 10. To evaluate the sensitivity, we also considered athreshold of 100 and found that the results were same with 100 as with the threshold of 10.

In conclusion, from the above experiment, we are able to observe the following. First,there was no situation where we falsely considered some other program operation to bea synchronization operation. Hence our synchronization detector did not cause any falsenegatives in race detection. Second, the number of missed synchronizations depended on thethreshold with more synchronizations missed when the threshold was set higher. Third, evenif the threshold value was set to be low, we still can miss synchronizations if there is no spinexperienced.

Overhead of Synchronization Detection. We studied the overhead of our technique. Figure 11shows the performance overhead involved in our software implementations. In this experimentthe realistic scenario is used. Note that in our Baseline implementation we instrument everyload and store instruction, while in our optimized version (Opt.), we only instrument thespecific loads and stores that are likely to be spinning reads and writes. Specifically, weinstrument only those loads that are within a spin loop and those stores that do not operate onthe stack. We identify potential spin loops by first identifying branch instructions that branchbackwards in code which contain just loads and compare instructions. To identify the storesthat access non-stack data, we exploit the API INS IsStackWrite provided in Pin. As we cansee from Figure 11 our average baseline overhead is a slowdown by a factor of 45, while ouroptimization is able to significantly reduce the overhead to a slowdown factor of 9.

Experimental Results for Hardware Implementation. We also conducted the experiment toverify the accuracy of our hardware synchronization detector implemented in Simics. Weobserve that the result is identical to the result of the software technique. The hardware



BaselineOpt.

0

10

20

30

40

50

60

70

mea

n

WA

TER−

2

WA

TER−

1

VO

LREN

D

RAY

TRA

CE

RAD

IOSI

TY

OCE

AN

−2

OCE

AN

−1

FMM

BARN

ES

Nor

mal

ized

Exe

cutio

n Ti

me

Figure 11. Overhead of Synchronization Detection.

Time spent on other computationTime spent on synchronization

0%

20%

40%

60%

80%

100%

mea

n

WA

TER−

2

WA

TER−

1

VO

LREN

D

RAY

TRA

CE

RAD

IOSI

TY

OCE

AN

−2

OCE

AN

−1

FMM

BARN

ES

Exe

cutio

n Ti

me

Dist

ribut

ion

Figure 12. Savings during Replay.

detector is able to find all synchronizations in Splash-2 benchmarks, except for the 4 flagsynchronizations that do not experience any spin as mentioned earlier.

As for the overhead, hardware implementation does not impose any overhead on the originalexecution, because all the detection work is done by the hardware.

5.3. Synchronization-aware Replay

In this experiment, we wanted to measure the savings in synchronization-aware replay, ifthe replay is made to run on a uniprocessor. Recall that if we are aware of synchronization



4−BB Trans. + STT4−BB Trans. + SRC8−BB Trans. + STT8−BB Trans. + SRC12−BB Trans. + STT12−BB Trans. + SRC

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

WATER−SPWATER−NSVOLRENDRAYTRACERADIOSITYOCEAN−2OCEAN−1FMMBARNES

HTM

Ove

rhea

d Co

mpa

rison

Figure 13. HTM Overhead.

operations, we do not need to faithfully re-execute the synchronization events during replay;it suffices to just enforce the appropriate dependencies during replay. We have not actuallyimplemented a replay system, but we measured the time spent on synchronization operationsfor each of the programs. As this is a measure of time we can save during replay, this percentageis a good indicator of the speed up that can be achieved during replay. As we can see fromFigure 12, the savings varies from 7% to 48% and the average savings is 23%. Note that weconsider the realistic scenario for synchronization implementations in SPLASH-2 programs.

5.4. Synchronization-aware Conflict Resolution

In this section, we compare two strategies of dealing with livelocks, STT - Splitting Transactionson Timeout and SCR: Synchronization aware Conflict Resolution. In the first strategy, a livelockis detected via timeouts and upon detection transactions are broken down such that thereis only one transaction per basic block [10]. In the second strategy, we use the knowledge ofsynchronization to perform conflict resolution as discussed in Section 4.3. In this experiment,we consider the realistic scenario for synchronization implementations in SPLASH-2 programs.

Recall that different size of transactions will influence the number of livelock scenarios, so weconduct our experiments with different size of transactions namely, 4-basic-block-transaction,8-basic-block-transaction, and 12-basic-block-transaction. The result is shown in Figure 13.For each benchmark (for the same transaction size), we find that SCR has better performancethan STT. The saving is about 3% on average. This is because there was no need to break thetransactions into one basic block per transaction in the SCR scheme.



6. Related Work

The synchronization detection algorithm proposed in this work is related to spin detectionproposed in [18, 41]. In the work [18], Li et al. proposed a hardware technique to detect the spin.The technique requires bookkeeping the state of register and memory. Wells et al. also proposeda hardware technique in [41] to detect the spin in OS. In their work, they identify a kernel spinby checking if the number of unique stores executed with 1024 committed instructions is lessthan some threshold value. Although both of previous solutions and our detection algorithmcan dynamically detect the spin in applications, our algorithm differs from theirs, in that, wealso detect the remote write in addition to the spinning read. Therefore, our technique can beexploited to improve the race detection, speed up replay system and resolve livelock situationscaused by combining DBT tools and TM system.

Race Detection. Dynamic race detection techniques can be broadly divided into threecategories, those using happens-before algorithm, those using lockset algorithm, and thosecombining happens-before and lockset algorithms. The happens-before based techniques usethe execution order and the knowledge of synchronization events to compute happens-beforepartial order, which is proposed by Lamport in [17]. If two accesses from different threads toa shared location are not ordered by the happens-before relation, a data race will be reported.Race detection techniques used in [29, 8, 3, 9, 34, 22] are in this category.

Race detectors using lockset algorithm basically focus on lock-based multithreaded programs.The idea first proposed by Savage et al. in [37] is to verify that every shared-memory accessis associated with a correct locking behavior. To avoid false positives due to some commonprogramming practices like using read-only shared variables after they are initialized, manyimproved lockset algorithms, which basically tracks the states of each memory location, areproposed and utilized in recent work [37, 16, 13, 40].

Both happens-before and lockset algorithms have their own drawbacks for race detection.The happens-before algorithm is hard to implement in software and more likely to miss somepotential races resulting in false negatives. On the other hand, lockset method is efficient toimplement but usually give too many false alarms. Therefore, many works attempt to combinethese two algorithms in some way in order to overcome the drawbacks. The hybrid techniquesare discussed in [12, 30, 45, 32].

To report data races as accurately as possible, all of the above approaches need exactsynchronization information regardless of whether synchronizations are implemented inlibraries or user code. Unfortunately, when monitoring a program, none of those approachesattempt to recognize every synchronization event, especially those busy-wait synchronizationsimplemented in user code. They either make an assumption that all synchronizations arein the library or ignore synchronizations defined by user. In reality, however, programmersmay use different synchronization implementations according to their demands as shown inSPLASH-2 benchmark suite rather than using library implementations. Thus, when thosedetectors are applied, many synchronization races and infeasible races will be falsely reportedto debuggers, which may consume vast amounts of time. On the contrary, our technique iseffective in avoiding the reporting of benign or false races by automatically identifying thesynchronizations no matter how the synchronizations are implemented. Hence, our currentwork is a complement to prior work.



Record/Replay System. Record/replay is an attractive technique, which was first inventedto facilitate debugging parallel and distributed programs [31, 42], and then applied in generalapplication debugging [35, 39]. A lot of research work on reducing its costs has been done[38, 26, 44]. However, none of these work considered synchronization overhead. In fact, formultithreaded programs like SPLASH-2 benchmarks, busy-wait synchronization operationscan cause significant overhead as shown in our experiment. Hence, our synchronizationdetection technique is very useful in reducing this overhead for record/replay systems.

Software Monitoring Tools and TM System. There has been a lot of recent work that usessoftware runtime monitoring techniques. Compared to the hardware based solution, softwarebased monitoring schemes do not require any hardware changes. However, one of the seriousinefficiencies for current software based monitoring schemes is that for multithreaded programsrunning on multiprocessors, they require serialization of threads [1].

Recently, TM has been proposed to enable efficient software based monitoring formultithreaded programs. However, the length of the transaction is an important issue thatdirectly affects the efficiency of the system. While larger transactions cause lesser transactionalbookkeeping overhead, they can lead to livelocks. The prior work [10] observes that there is apossibility of a livelock, when conditional wait constructs have been put into one transactionand proposes the idea of splitting of transactions on detecting a livelock. Since schedulebased locks were used in the above work, livelocks were hardly observed. In this work, weran the same benchmarks with busy-waiting synchronizations, since they are frequently usedin an SMP/Multicore setting and consequently, livelocks became much more frequent. Weevaluated two strategies STT, based on the idea proposed in [10], and SCR based on our idea ofsynchronization detection. Livelock scenarios in TM system are also discussed in [7]. However,the solution proposed in [7] is not directly applicable, since knowledge of synchronization isrequired for conflict resolution.

7. Conclusion

In this paper we first discussed how lack of knowledge of user defined synchronizations canlead to numerous false positives in race detection tools. We then proposed a technique todynamically identify synchronizations that take place in a program run. This informationwas demonstrated to be highly effective in filtering out synchronization and infeasible races.Furthermore, our technique can be easily exploited by a record/replay system to significantlyspeedup the replay. A scheme of using the knowledge of synchronizations to optimize replayis proposed. We also show that the technique can be leveraged by a TM system to effectivelyresolve a livelock situation.

Our evaluation confirms that our synchronization detector is highly accurate with no falsenegatives and very few false positives. We also show that, on average, our optimized softwareimplementation causes a 9 fold slowdown in program execution. We also show that theknowledge of synchronization operations resulted in about 23% reduction in replay time.Finally, we show that using synchronization knowledge can efficiently avoid livelocks duringthe monitoring of SPLASH-2 benchmarks.



REFERENCES

1. http://valgrind.org.2. A. R. Adl-Tabatabai, B. T. Lewis, V. Menon, B. R. Murphy, B. Saha, and T. Shpeisman. Compiler and

runtime support for efficient software transactional memory. In PLDI ’06: Proceedings of the 2006 ACMSIGPLAN conference on Programming language design and implementation, pages 26–37, New York, NY,USA, 2006. ACM.

3. S. V. Adve, M. D. Hill, B. P. Miller, and R. H. B. Netzer. Detecting data races on weak memory systems.In Proceedings of the 18th International Symposium on Computer Architecture (ISCA), volume 19, pages234–243, New York, NY, 1991. ACM Press.

4. E. Artiaga, N. Navarro, X. Martorell, Y. Becerra, M. Gil, and A. Serra. Experiences on the implementationof parmacs macros using different multiprocessor operating system interfaces.

5. D. F. Bacon and S. C. Goldstein. Hardware-assisted replay of multiprocessor programs. In PADD ’91:Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging, pages 194–206, NewYork, NY, USA, 1991. ACM Press.

6. S. Bhansali, W.-K. Chen, S. de Jong, A. Edwards, R. Murray, M. Drinic, D. Mihocka, and J. Chau.Framework for instruction-level tracing and analysis of program executions. In VEE ’06: Proceedings ofthe 2nd international conference on Virtual execution environments, pages 154–163, New York, NY, USA,2006. ACM.

7. J. Bobba, K. E. Moore, H. Volos, L. Yen, M. D. Hill, M. M. Swift, and D. A. Wood. Performancepathologies in hardware transactional memory. In ISCA ’07: Proceedings of the 34th annual internationalsymposium on Computer architecture, pages 81–91, New York, NY, USA, 2007. ACM.

8. J. D. Choi, B. P. Miller, and R. H. B. Netzer. Techniques for debugging parallel programs with flowbackanalysis. ACM Trans. Program. Lang. Syst., 13(4):491–530, 1991.

9. M. Christiaens and K. D. Bosschere. Trade, a topological approach to on-the-fly race detection in javaprograms. In JVM’01: Proceedings of the JavaTM Virtual Machine Research and Technology Symposiumon JavaTM Virtual Machine Research and Technology Symposium, pages 15–15, Berkeley, CA, USA, 2001.USENIX Association.

10. J. Chung, M. Dalton, H. Kannan, and C. Kozyrakis. Thread-safe binary translation using transactionalmemory. In HPCA ’08: Proceedings of the 14th Intl. Symposium on High-Performance ComputerArchitecture, 2008.

11. P. Damron, A. Fedorova, Y. Lev, V. Luchangco, M. Moir, and D. Nussbaum. Hybrid transactionalmemory. In ASPLOS-XII: Proceedings of the 12th international conference on Architectural support forprogramming languages and operating systems, pages 336–346, New York, NY, USA, 2006. ACM.

12. A. Dinning and E. Schonberg. Detecting access anomalies in programs with critical sections. In PADD’91: Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging, pages 85–96,New York, NY, USA, 1991. ACM Press.

13. C. Flanagan and S. N. Freund. Atomizer: a dynamic atomicity checker for multithreaded programs.SIGPLAN Not., 39(1):256–267, 2004.

14. R. Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors. In ASPLOS-III:Proceedings of the third international conference on Architectural support for programming languages andoperating systems, pages 54–63, New York, NY, USA, 1989. ACM Press.

15. M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures.In Proceedings of theTwentiethAnnual International Symposium on Computer Architecture, 1993.

16. B. Krena, Z. Letko, R. Tzoref, S. Ur, and T. Vojnar. Healing data races on-the-fly. In PADTAD ’07:Proceedings of the 2007 ACM workshop on Parallel and distributed systems: testing and debugging, pages54–64, New York, NY, USA, 2007. ACM Press.

17. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, 1978.

18. T. Li. Spin detection hardware for improved management of multithreaded systems. IEEE Trans. ParallelDistrib. Syst., 17(6):508–521, 2006. Senior Member-Alvin R. Lebeck and Member-Daniel J. Sorin.

19. C. K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, andK. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. SIGPLANNot., 40(6):190–200, 2005.

20. P. S. Magnusson, F. Dahlgren, H. Grahn, M. Karlsson, F. Larsson, F. Lundholm, A. Moestedt, J. Nilsson,P. Stenstrom, and B. Werner. Simics/sun4m: a virtual workstation. In ATEC’98: Proceedings of theAnnual Technical Conference on USENIX Annual Technical Conference, 1998, pages 10–10, Berkeley,CA, USA, 1998. USENIX Association.



21. P. S. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In IPPS,pages 165–171, 1994.

22. J. Mellor-Crummey. On-the-fly detection of data races for programs with nested fork-join parallelism.In Supercomputing ’91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pages 24–33,New York, NY, USA, 1991. ACM.

23. J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memorymultiprocessors. ACM Trans. Comput. Syst., 9(1):21–65, 1991.

24. M. J. Moravan, J. Bobba, K. E. Moore, L. Yen, M. D. Hill, B. Liblit, M. M. Swift, and D. A. Wood.Supporting nested transactional memory in logtm. SIGOPS Oper. Syst. Rev., 40(5):359–370, 2006.

25. S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. InASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programminglanguages and operating systems, pages 229–240, New York, NY, USA, 2006. ACM Press.

26. S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution fordeterministic replay debugging. In ISCA ’05: Proceedings of the 32nd annual international symposiumon Computer Architecture, pages 284–295, Washington, DC, USA, 2005. IEEE Computer Society.

27. S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benignand harmful data racesallusing replay analysis. In PLDI ’07: Proceedings of the 2007 ACM SIGPLANconference on Programming language design and implementation, pages 22–31, New York, NY, USA,2007. ACM Press.

28. N. Nethercote and J. Seward. How to shadow every byte of memory used by a program. In VEE ’07:Proceedings of the 3rd international conference on Virtual execution environments, pages 65–74, NewYork, NY, USA, 2007. ACM Press.

29. R. H. B. Netzer. Optimal tracing and replay for debugging shared-memory parallel programs. In PADD’93: Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging, pages 1–11, NewYork, NY, USA, 1993. ACM Press.

30. R. O’Callahan and J.-D. Choi. Hybrid dynamic data race detection. In PPoPP ’03: Proceedings of theninth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 167–178,New York, NY, USA, 2003. ACM Press.

31. D. Z. Pan and M. A. Linton. Supporting reverse execution for parallel programs. In PADD ’88:Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging,pages 124–129, New York, NY, USA, 1988. ACM.

32. V. Project. Helgrind, a data race detector. In http://valgrind.org/docs/manual/hg-manual.html, 2003.33. F. Qin, C. Wang, Z. Li, H. seop Kim, Y. Zhou, and Y. Wu. Lift: A low-overhead practical information

flow tracking system for detecting security attacks. In MICRO 39: Proceedings of the 39th AnnualIEEE/ACM International Symposium on Microarchitecture, pages 135–148, Washington, DC, USA, 2006.IEEE Computer Society.

34. M. Ronsse and K. D. Bosschere. Recplay: a fully integrated practical record/replay system. ACM Trans.Comput. Syst., 17(2):133–152, 1999.

35. M. Ronsse, K. D. Bosschere, M. Christiaens, J. C. de Kergommeaux, and D. Kranzlmuller. Record/replayfor nondeterministic program executions. Commun. ACM, 46(9):62–67, 2003.

36. Y. Saito. Jockey: a user-space library for record-replay debugging. In AADEBUG’05: Proceedings ofthe sixth international symposium on Automated analysis-driven debugging, pages 69–76, New York, NY,USA, 2005. ACM.

37. S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detectorfor multi-threaded programs. In SOSP ’97: Proceedings of the sixteenth ACM symposium on Operatingsystems principles, pages 27–37, New York, NY, USA, 1997. ACM Press.

38. S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: A lightweight extension forrollback and deterministic replay for software debugging. In USENIX Annual Technical Conference,General Track, pages 29–44, 2004.

39. S. Tallam, C. Tian, R. Gupta, and X. Zhang. Enabling tracing of long-running multithreaded programsvia dynamic execution reduction. In ISSTA ’07: Proceedings of the 2007 international symposium onSoftware testing and analysis, pages 207–218, New York, NY, USA, 2007. ACM.

40. C. von Praun and T. R. Gross. Object race detection. In OOPSLA ’01: Proceedings of the 16th ACMSIGPLAN conference on Object oriented programming, systems, languages, and applications, pages 70–82,New York, NY, USA, 2001. ACM.

41. P. M. Wells, K. Chakraborty, and G. S. Sohi. Hardware support for spin management in overcommittedvirtual machines. In PACT ’06: Proceedings of the 15th international conference on Parallel architecturesand compilation techniques, pages 124–133, New York, NY, USA, 2006. ACM.



42. L. D. Wittie. Debugging distributed c programs by real time reply. In PADD ’88: Proceedings of the1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging, pages 57–67, NewYork, NY, USA, 1988. ACM.

43. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterizationand methodological considerations. In Proceedings of the 22th International Symposium on ComputerArchitecture, pages 24–36, Santa Margherita Ligure, Italy, 1995.

44. M. Xu, R. Bodik, and M. D. Hill. A ”flight data recorder” for enabling full-system multiprocessordeterministic replay. In ISCA ’03: Proceedings of the 30th annual international symposium on Computerarchitecture, pages 122–135, New York, NY, USA, 2003. ACM Press.

45. Y. Yu, T. Rodeheffer, and W. Chen. Racetrack: efficient detection of data race conditions via adaptivetracking. In SOSP ’05: Proceedings of the twentieth ACM symposium on Operating systems principles,pages 221–234, New York, NY, USA, 2005. ACM Press.



TM Policies Barrier Spinning-Read Basic Block

EE Yes Yes Not ProbableEL Yes Possible Not ProbableLL Yes No No

Table I. Livelock Scenarios for Different TM Policies and Transaction Sizes.



Programs LOC Input DescriptionBARNES 2.0K 8192 Barnes-Hut algorithmFMM 3.2K 256 fast multipole algorithmOCEAN-1 2.6K 258× 258 non-contiguousOCEAN-2 4.0K 258× 258 contiguousRADIOSITY 8.2K batch diffuse radiosity algorithmRAYTRACE 6.1K tea ray tracing algorithmVOLREND 2.5K head -a ray casting algorithmWATER-1 1.2K 512 nsquaredWATER-2 1.6K 512 spatial

Table II. SPLASH-2 Benchmarks Description.



Programs Sync. Races Infeasible RacesDistinct Dynamic Distinct Dynamic

BARNES 0 0 0 0FMM 4 0.4K 4 0.6KOCEAN-1 0 0 0 0OCEAN-2 0 0 0 0RADI. 0 0 0 0RAYTRC 0 0 0 0VOLRND 0 0 0 0WATER-1 0 0 0 0WATER-2 0 0 0 0

Table III. Synchronization Races and Infeasible Races with Synchronization Detection.



Programs Actual T = 500 T = 10 T = 100BARNES 7 4 7 7FMM 14 7 10 10OCEAN-1 19 19 19 19OCEAN-2 18 16 18 18RADIOSITY 3 2 3 3RAYTRACE 1 1 1 1VOLREND 7 5 7 7WATER-1 7 7 7 7WATER-2 9 7 9 9

Table IV. Number of User Defined Synchronizations in SPLASH-2 Benchmarks.


Documents

Automated Dynamic Detection of Busy-wait Synchronizationsalumni.cs.ucr.edu/~tianc/publications/spe09.pdf · 2011-01-10 · Softw. Pract. Exper. 2008; 00:1{7 Prepared using speauth.cls