Mutation-driven Generation of Unit Tests and Oracles · Mutation-driven Generation of Unit Tests and Oracles Gordon Fraser, Member, IEEE, Andreas Zeller, Member, IEEE Abstract—To

1

Mutation-driven Generationof Unit Tests and Oracles

Gordon Fraser, Member, IEEE, Andreas Zeller, Member, IEEE

Abstract —To assess the quality of test suites, mutation analysis seeds artificial defects (mutations) into programs; a non-detected mutation indicates a weakness in the test suite. We present an automated approach to generate unit tests that detectthese mutations for object-oriented classes. This has two advantages: First, the resulting test suite is optimized towards findingdefects modeled by mutation operators rather than covering code. Second, the state change caused by mutations inducesoracles that precisely detect the mutants. Evaluated on 10 open source libraries, our µTEST prototype generates test suites thatfind significantly more seeded defects than the original manually written test suites.

Index Terms —Mutation analysis, test case generation, unit testing, test oracles, assertions, search based testing

✦

1 INTRODUCTION

How good are my test cases? This question can be an-swered by applying mutation analysis: Artificial defects(mutants) are injected into software and test cases areexecuted on these fault-injected versions. A mutantthat is not detected shows a deficiency in the test suiteand indicates in most cases that either a new test caseshould be added, or that an existing test case needs abetter test oracle.

Improving test cases after mutation analysis usuallymeans that the tester has to go back to the drawing-board and design new test cases, taking the feedbackgained from the mutation analysis into account. Thisprocess requires a deep understanding of the sourcecode and is a non-trivial task. Automated test gener-ation can help in covering code (and thus hopefullydetecting mutants); but even then, the tester still needsto assess the results of the generated executions—andhas to write hundreds or even thousands of oracles.

In this paper, we present µTEST, an approach thatautomatically generates unit tests for object-orientedclasses based on mutation analysis. By using mu-tations rather than structural properties as coveragecriterion, we not only get guidance in where to test,but also what to test for. This allows us to generateeffective test oracles, a feature raising automation to alevel not offered by traditional tools.

As an example, consider Figure 1, showing a pieceof code from the open source library Joda Time.This constructor sets iLocalMillis to the millisec-onds represented by day, month, and year in thevalues array and 0 milliseconds. It is called by12 out of 143 test cases for LocalDate—branchesand statements are all perfectly covered, and each

• G. Fraser and A. Zeller are with the Chair of Software Engineering,Saarland University, Saarbrucken, Germany.E-mail: fraser,[email protected]

1 public class LocalDate {2 // The local milliseconds from 1970−01−01T00:00:003 private long iLocalMillis;4 . . .5 // Construct a LocalDate instance6 public LocalDate(Object instant, Chronology c) {7 . . . // convert instant to array values8 . . . // get iChronology based on c and instant9 iLocalMillis = iChronology.getDateTimeMillis(

10 values[0],values[1],values[2],0); ⇐ Change 0 to 1

11 }12}

Fig. 1. Mutating the initialization of iLocalMillis isnot detected by the Joda Time test suite.

1 LocalDate var0 = new org.joda.time.LocalDate()2 DateTime var1 = var0.toDateTimeAtCurrentTime()3 LocalDate var2 = new org.joda.time.LocalDate(var1)4 assertTrue(var2.equals(var0));

Fig. 2. Test generated by µTEST. The call in Line 3triggers the mutation; the final assertion detects it.

of these test cases also covers several definition-usepairs of iLocalMillis. These test cases, however,only check whether the day, month, and year are setcorrectly, which misses the fact that comparison be-tween LocalDate objects compares the actual valueof iLocalMillis. Consequently, if we mutate thelast argument of getDateTimeMillis from 0 to 1,thus increasing the value of iLocalMillis by 1, thisseeded defect is not caught.

µTEST creates a test case with an oracle that catchesthis very mutation, shown in Figure 2. LocalDateobject var0 is initialized to a fixed value (0, in ourcase). Line 2 generates a DateTime object with thesame day, month, and year as var0 and the localtime (fixed to 0, again). The constructor call for var2implicitly calls the constructor from Figure 1, and

2

Fig. 3. The µTEST process: Mutation analysis, unitpartitioning, test generation, oracle generation.

therefore var0 and var2 have identical day, month,and year, but differ by 1 millisecond on the mutant,which the assertion detects.

By generating oracles in addition to tests, µTEST

allows the tester to check whether the generatedassertions reflect the expected behavior, rather thancoming up with assertions and sequences that arerelated to these assertions. If a suggested assertion isdoes not reflect the expected behavior, then usually abug has been found.

Summarizing, the contributions of this paper are:Mutant-based oracle generation: By comparing

executions of a test case on a program and itsmutants, we generate a reduced set of assertions thatis able to distinguish between a program and itsmutants.

Mutant-based unit test case generation: µTEST

uses a genetic algorithm to breed method/constructorcall sequences that are effective in detecting mutants.

Impact-driven test case generation: To minimizeassessment effort, µTEST optimizes test cases andoracles towards detecting mutations with maximalimpact—that is, changes to program state all acrossthe execution. Intuitively, greater impact is easier toobserve and assess for the tester, and more importantfor a test suite.

Mutant-based test case minimization: Intuitively,we expect that shorter test cases are easier tounderstand. We minimize test cases by first preferringshorter sequences during test case generation, andthen we remove all irrelevant statements in the finaltest case.

Figure 3 shows the overall µTEST process: The firststep is mutation analysis (Section 2) and a partitioningof the software under test into test tasks, consisting ofa unit under test with its methods and constructorsas well as all classes and class members relevant fortest case generation. For each unit a genetic algorithmbreeds sequences of method and constructor callsuntil each mutant of that unit, where possible, iscovered by a test case such that its impact is maxi-mized (Section 3). We minimize unit tests by removingall statements that are not relevant for the mutationor affected by it; finally, we generate and minimizeassertions by comparing the behavior of the test caseon the original software and its mutants (Section 4).

We have implemented µTEST as an extension to

the Javalanche [39] mutation system (Section 5). Thispaper extends an earlier version presented at IS-STA 2010 [19], in which we used a first prototype to dopreliminary analysis. In this paper, we use a matureversion of µTEST to extend the empirical analysis toa total of 10 open source libraries, making it one ofthe largest empirical studies of evolutionary testingof classes to date. The results show that our approachcan be used to extend (and theoretically, replace) themanually crafted unit tests (Section 6).

2 BACKGROUND

2.1 Mutation Analysis

Mutation analysis was introduced in the 1970s [15]as a method to evaluate a test suite in terms of howgood it is at detecting faults, and to give insight intohow and where a test suite needs improvement. Theidea of mutation analysis is to seed artificial faultsbased on what is thought to be real errors commonlymade by programmers. The test cases of an existingtest suite are executed on a program version (mutant)containing one such fault at a time in order to see ifany of the test cases can detect that there is a fault.A mutant that is detected as such is considered dead,and of no further use. A live mutant, however, showsa case where the test suite potentially fails to detectan error and therefore needs improvement.

There are two main problems of mutation analysis:First, the sheer number of mutants and the effort ofchecking all tests against all mutants can cause signif-icant computational costs. Second, equivalent mutantsdo not observably change the program behavior orare even semantically identical, and so there is noway to possibly detect them by testing. Since theintroduction of mutation analysis, a number of op-timizations have been proposed to overcome possibleperformance problems, and heuristics can identifya small fraction of equivalent mutants—at the endof the day, however, detecting equivalent mutantsis still a job to be done manually. A recent surveypaper [25] concisely summarizes all applications andoptimizations that have been proposed over the years.

Javalanche [39] is a tool that incorporates manyof the proposed optimizations and made it possibleto apply mutation analysis to software of previouslyunthinkable size. In addition, Javalanche alleviates theequivalent mutant problem by ranking mutants bytheir impact: A mutant with high impact is less likelyto be equivalent, which allows the tester to focuson those mutants that can really help to improve atest suite. The impact can be measured, for example,in terms of violated invariants or effects on codecoverage.

2.2 Test Case Generation

Research on automated test case generation has re-sulted in a great number of different approaches,

3

deriving test cases from models or source code, us-ing different test objectives such as coverage criteria,and using many different underlying techniques andalgorithms. In this paper, we only consider white-boxtechniques that require no specifications; naturally, anexisting specification can help in both generating testcases and can serve as test oracle.

The majority of systematic white-box testing ap-proaches consider the control-flow of the program,and try to cover as many aspects as possible (e.g., allstatements or all branches). Generating test data bysolving path constraints generated with symbolic ex-ecution is a popular approach (e.g., PathCrawler [49]),and dynamic symbolic execution as an extensioncan overcome a number of problems by combiningconcrete executions with symbolic execution. Thisidea has been implemented in tools like DART [20]and CUTE [41], and is also applied in Microsoft’sparametrized unit testing tool PEX [45].

Meta-heuristic search techniques have been usedas an alternative to symbolic execution based ap-proaches and can be applied to stateless and statefulprograms [1], [29] as well as object-oriented containerclasses [8], [46], [48]. The test generation approachpresented in this paper is in principle similar to thetechnique presented by Tonella [46], but refines thesearch with different search operators and a fitnessfunction targeting mutations and their impact, whichpotentially leads to very different test cases. A promis-ing avenue seems to be the combination of evolu-tionary methods with dynamic symbolic execution(e.g., [24]), alleviating some of the problems bothapproaches have.

While systematic approaches in general have prob-lems with scalability, random testing is an approachthat scales with no problems to programs of any size.Besides random input generation a recent trend is alsogeneration of random unit tests, as for example imple-mented in Randoop [35], JCrasher [14], AutoTest [13],or RUTE-J [4]. Although there is no guarantee ofreaching certain paths, random testing can achieve rel-atively good coverage with low computational needs.

2.3 Test Case Generation for Mutation Testing

Surprisingly, only a little work has been done ongenerating test cases dedicated to kill mutants, eventhough this is the natural extension of mutation anal-ysis. DeMillo and Offutt [16] have adapted constraintbased testing to derive test data that kills mutants.The idea is similar to test generation using symbolicexecution and constraint solving, but in addition tothe path constraints (called reachability condition byDeMillo and Offutt), each mutation adds a conditionthat needs to be true (necessity condition) such that themutant affects the state. Test data is derived usingconstraint solving techniques again. Offutt et al. [31]also described an advanced approach to generate test

data that overcomes some of the limitations of theoriginal constraint based approach. Recently, a con-straint based approach to derive test data for mutantshas also been integrated in the dynamic symbolicexecution based tool Pex [52].

Jones et al. [26] proposed the use of a geneticalgorithm to find mutants in branch predicates, andBottaci [11] proposed a fitness function for geneticalgorithms based on the constraints defined by De-Millo and Offutt [16]; recently this has been used forexperiments using genetic algorithms and ant colonyoptimization to derive test data that kills mutants [9].µTEST also uses a genetic algorithm to detect mutantsbut differs from these approaches as it addresses ob-ject oriented software and adds assertions as oracles.

Differential test case generation (e.g., [17], [33], [44])shares similarities with test case generation based onmutation analysis in that these techniques aim togenerate test cases that show the difference betweentwo versions of a program. Mutation testing, how-ever, does not require the existence of two differentversions to be checked and is therefore not restrictedin its applicability to regression testing. In addition,the differences in the form of simple mutations areprecisely controllable and can therefore be exploitedfor test case generation.

2.4 Oracle Generation

Automated synthesis of assertions is a natural exten-sion of test case generation. Randoop [35] allows an-notation of the source code to identify observer meth-ods to be used for assertions generation. Orstra [50]generates assertions based on observed return valuesand object states and adds assertions to check futureruns against these observations. A similar approachhas also been adopted in commercial tools such asAgitar Agitator [10]. While such approaches can beused to derive efficient oracles, they do not serveto identify which of these assertions are actuallyuseful, and such techniques are therefore only foundin regression testing. The approach presented in thispaper generates assertion in a similar manner, butuses mutation testing to select an effective subset.

Eclat [34] can generate assertions based on a modellearned from assumed correct executions; in contrast,µTEST does not require any existing executions.

Evans and Savoia [17] generate assertions from runsof two different versions of a software system andDiffGen [44] extends the Orstra approach to generateassertions from runs on two different program ver-sions. This is similar to our µTEST approach, althoughwe do not assume two existing different versions ofthe same class but generate many different versionsby mutation, and are therefore not restricted to aregression testing scenario.

4

3 UNIT TESTS FROM MUTANTS

To generate test cases for mutants, we adopt anevolutionary approach in line with previous work ontesting classes [8], [46]: In general, genetic algorithmsevolve a population of chromosomes using genetics-inspired operations such as selection, crossover, andmutation, and each chromosome represents a possibleproblem solution.

To generate unit tests with a genetic algorithm, thefirst ingredient is a genetic representation of test cases.A unit test generally is a sequence of method callson an object instance; therefore, the main componentsare method and constructor calls. These methods andconstructors take parameters which can be of primi-tive or complex type. This gives us four main typesof statements:

Constructor statements generate a new instance ofthe class under test or any other class needed asa parameter for another statement:

DateTime var0 = new DateTime()

Method statements invoke methods on instances ofany existing objects (or static methods):

int var1 = var0.getSecondOfMinute()

Field statements access public fields of objects:

DurationField var2 = MillisDurationField.INSTANCE

Primitive statements represent numeric data types:

int var3 = 54

Each statement defines a new variable (except voidmethod calls), and a chromosome is a sequence ofsuch statements. The parameters of method and con-structor calls may only refer to variables occurringearlier in the same chromosome. Constructors, meth-ods, and fields are not restricted to the members ofthe class under test because complex sequences mightbe necessary to create suitable parameter objects andreach required object states.

Let parameters(M) be a function that returns a listcontaining the classes of the parameters of methodor constructor M , including the callee in the case ofnon-static methods. Furthermore, let classes(t) be afunction that returns the set of classes for which thereare instances in test case t. A method is a generatorof class C if it returns a value of type C, and anyconstructor of class C is also a generator of C. Letgenerators(M,C) return the set of generators of classC out of the set of methods and constructors M .

The initial population of test cases is generatedrandomly as shown in Algorithm 1: Before gener-ating test cases, we statically determine the set ofall method and constructor calls that either containthe target mutation or directly or indirectly call themethod or constructor containing the mutation. We

Algorithm 1 Random generation of test cases.

Require: M : Set of all methods and constructorsRequire: C: Set of all methods and constructors di-

rectly/indirectly calling mutation; C ⊆MRequire: l: Desired length of test caseEnsure: t: Randomly generated test

procedure GENTEST(C, l)t← 〈〉s← randomly select an element from Cfor c in parameters(s) do

t←GENOBJECT(c, {}, M , t)end fort← t.swhile |t| < l do

c← randomly select class in classes(t)M ′ ← {m | m ∈M ∧ c ∈ parameters(m)}s← randomly select method from M ′

for p in parameters(s) doif ¬(p ∈ classes(t)) then

t←GENOBJECT(p, {}, M , t)end if

end fors← set parameters of s to values from tt← t.s

end whilereturn t

end procedure

randomly select one of these calls, and try to generateall parameters including its callee, if applicable. Forthis, we recursively add new calls that yield thenecessary objects (Algorithm 2) if the test case doesnot already contain instances of the required types.During this search, we keep track of classes we arealready trying to initialize to avoid getting stuck inrecursive loops. In Algorithm 1 and Algorithm 2 weonly generate new objects if they do not already exist;in practice, we reuse existing objects only with acertain probability (90% in our experiments) in orderto increase diversity within the test cases. If randomgeneration turns out to be difficult for a particularclass, Algorithm 2 can be turned into an exhaustivesearch by adding backtracking and keeping track ofexplored choices. Even though the search space canbe large depending on the size of the software undertest, this is not problematic as one can retain usefulsequences that lead to creation of certain objects andreuse them later. Test cases are created in this wayuntil the initial population has reached the requiredsize.

Evolution of this population is performed by re-peatedly selecting individuals from the current gen-eration (for example, using tournament or rank selec-tion) and applying crossover and mutation accordingto a certain probability. Figure 4 illustrates single pointcrossover for unit tests: A random position in each of

5

Algorithm 2 Recursive generation of objects.

Require: c: Class of desired objectRequire: G: Set of classes already attempting to gen-

erateRequire: M : Set of all methods and constructorsRequire: t: Test caseEnsure: t: Test case extended with an instance of c

procedure GENOBJECT(c, G, M , t)M ′ ← generators(M, c)s← randomly select element from M ′

for all c ∈ parameters(s) doif ¬(c ∈ classes(t)) then

t←GENOBJECT(c, G, M , t)end if

end fors← set parameters of s to values from tt← t.sreturn t

end procedure

Fig. 4. Crossover between two test cases.

the two selected parents is selected, and an offspringis generated by trying to merge the sub-sequences.As statements can refer to all variables created earlierin the sequence this means simply attaching a cut-offsub-sequence will not work. Instead we add the state-ments and try to satisfy parameters with the existingvariables, or possibly create new objects to satisfy allparameters, similar to the initial generation (Algo-rithm 1). By choosing different points of crossover inthe two parent chromosomes the crossover will resultin a variation of the length of test cases.

After selection and crossover the offspring is mu-tated with a given probability. There are numerousways to mutate chromosomes:

Delete a statement: Drop one random statement inthe test case. The return value of this statementmust not be used as a parameter or callee inanother statement, unless it can be replaced withanother variable.

Insert a method call: Add a random method call for

one of the objects of the test case, or create a newobject of a type already used in the test case.

Modify an existing statement: Apply one of the fol-lowing changes on one of the existing statements:

– Change callee: For a method call or field refer-ence, change the source object.

– Change parameters: Change one parameter of amethod or constructor call to a different valueor create a new value.

– Change method/constructor: Replace a call with adifferent one with an identical return type.

– Change field: Change a field reference to a dif-ferent field of the same type on the same class.

– Change primitive: Replace a primitive value witha different primitive value of the same type.

When mutating a chromosome, with a certain prob-ability each statement is changed or deleted, or a newstatement is inserted before it.

In order to guide the selection of parents for off-spring generation, all individuals of a population areevaluated with regard to their fitness. The fitness ofa test case with regard to a mutant is measured withrespect to (1) how close it comes to executing the mu-tated method, (2) how close it comes to the mutatedstatement in this method, and (3) how significant theimpact of the mutant on the remaining execution is.Consequently, the fitness function is a combination ofthese three components:

Distance to calling functionIf the method/constructor that contains the mutationis not executed, then the fitness estimates the distancetoward making this call possible. To quantify howclose we are to enabling a given method call, wecount for how many of the parameters of the methodthere are no existing objects of that type in the testcase (i.e., how many of its parameters are unsatisfied).If a method is not static, we also count the targetobject (callee) of the method call as a parameter.A mutation can be executed by directly calling themethod it is contained in, or indirectly by calling amethod that may lead to a call of the method withthe mutation. The overall distance is the minimum ofthe distance values for all methods that directly orindirectly execute the mutation. Generally, we wantthe distance to be as small as possible. This value canbe determined without executing the test case, and is1 if the mutant is executed. We define this distance asa function of a test case t:

d(c, t) := 1 + # Unsatisfied parameters (1)

Df (t) := min{d(c, t) | calls c related to mutation}(2)

Distance to mutationIf the mutant method/constructor is executed butthe mutated statement itself is not, then the fitness

6

specifies the distance of the test case to executingthe mutation; again, we want to minimize this dis-tance. This basically is the prevailing approach leveland branch distance measurement applied in search-based test data generation [29]. The approach leveldescribes how far a test case was from the target inthe control flow graph when it deviated course. This isusually measured as the number of unsatisfied controldependencies between the point of deviation and thetarget, and is 0 if all control dependent branches arereached. The branch distance estimates how far thebranch at which execution diverged from reaching themutation is from evaluating to the necessary outcome.In addition, one can use the necessity conditions [16]definable for different mutation operators to estimatethe distance to an execution of the mutation thatinfects the state (necessity distance [11]). To deter-mine these values, the test case has to be executedonce on the unmodified software. If the mutationexecuted, then approach level and branch distanceare 0, while the necessity distance is 0 only if a stateinfection occurs. As usual, the branch distance shouldnot dominate the approach level, and therefore it isnormalized in the range [0, 1], for example using thefollowing normalization function initially proposedby Arcuri [5]:

α(x) =x

x+ 1(3)

We also normalize the necessity distance in the range[0, 1] with the same formula, but have to make surethat the resulting value is 1 in the case that themutation is not executed:

β(x) =

{

1 Mutation is not executedx

x+1Mutation is executed

(4)

Dm(t) := Approach Level+α(Branch Distance)

+β(Necessity Distance) (5)

Mutation impact

If the mutation is executed, then we want the testcase to propagate any changes induced by the mu-tation such that they can be observed. Traditionally,this consists of two conditions: First, the mutationneeds to infect the state (necessity condition). Thiscondition can be formalized and integrated into thefitness function (see above). In addition, however, themutation needs to propagate to an observable state(sufficiency condition) — it is difficult to formalizethis condition [16]. Therefore, we measure the impactof the mutation on the execution; we want this impactto be as large as possible.

Quantification of the impact of mutations was ini-tially proposed using dynamic invariants [38]: Themore invariants of the original program a mutantprogram violates, the less likely it is to be equiva-lent. A variation of the impact measurement uses thenumber of methods with changed coverage or return

values [40] instead of invariants. We take a slightlydifferent view on the impact, as strictly speaking itis not just a function of the mutant itself, but also ofthe test case. Consequently, we use the impact as partof the fitness function, and quantify the impact of amutation as the unique number of methods for whichthe coverage changed and the number of observabledifferences. An observable difference is, for example,a changed return value or any other traceable artifact(cf. Section 4). Using the unique number of changedmethods avoids ending up with long sequences of thesame method call.

Im(t) = c× |C|+ r × |A| (6)

Here, C is the set of called statements with changedcoverage, and A is the set of observable differencesbetween a run on the original and the mutant pro-gram; c and r are constants that allow testers to putmore weight on observable changes. To determine theimpact, a test case has to be executed on the originalunit and on the mutant that is currently considered.

Overall fitness

The overall fitness function is a combination of thesethree factors; the two distances have to be minimized,while the impact should be maximized. As long asDf (t) > 1 and Dm(t) > 0, Im(t) will be 0 perdefinition. Consequently, a possible combination asa fitness function that should be maximized is asfollows:

fitness(t) =1

Df (t) +Dm(t)+ Im(t) (7)

The fitness is less than 1 if the mutant has not beenexecuted, equal to 1 if the mutant is executed but hasno impact, and greater than one if there is impact orany observable difference.

Individuals with dynamic length suffer from aproblem called bloat, which describes a dispropor-tional growth of the individuals of a population dur-ing the search. Test cases that are much too longare problematic with respect to their execution timeand memory consumption, and may thus inhibit thesearch. However, longer test cases can be useful asthey make it easier to reach difficult search objec-tives [6], thus it is important to allow the searchto dive into longer sequences during the evolution,while still preventing it from becoming excessivelylong. On the other hand, from a user perspective weexpect that test cases should be as short as possibleto allow easier understanding. In our initial proto-type [19] we integrated the length directly into thefitness function. This, however, might have negativeside-effects on achieving the optimization goal, asthe search might converge prematurely. Consequently,following the insights of our recent extensive experi-mentation [18] we do not include the length explicitly

7

in the fitness function, but implicitly in terms ofthe rank selection: If two individuals have the samefitness, the shorter one will be ranked before thelonger individual. In addition, we apply bloat controltechniques such as a fixed upper bound on the lengthof individuals during the search, or relative positioncrossover which avoids an increase in length [18].

When generating test cases for structural coveragecriteria the stopping criterion is easy: If the entity tobe reached (e.g., a branch) is actually reached, thenone is done. In our scenario this is not so easy: Itmight not be sufficient to stop once a mutant hasresulted in an observable difference, as we are alsooptimizing with respect to the test case length andthe impact. Consequently, one may choose to let thegenetic algorithm continue until a maximum time ornumber of generations has been reached and returnsthe test case with the highest fitness.

As determining the fitness value of a test caserequires it to be executed, a nice side-effect is thatmemory violations and uncaught exceptions may alsobe detected during test case generation. If such excep-tions are not declared to occur, then they are likely topoint to defects (see Section 5).

4 GENERATING ASSERTIONS TO K ILL MU-TANTS

The job of generating test cases for mutants is notfinished once a mutation is executed: A mutant isonly detected if there is an oracle that can identifythe misbehavior that distinguishes the mutant fromthe original program. Consequently, mutation-basedunit tests need test oracles such that the mutants aredetected.

A common type of oracles in the case of unittests are assertions, which are supported by most unittesting frameworks. Some types of assertions (e.g.,assertions on primitive return values) can be easilygenerated, as demonstrated by existing testing tools(cf. Section 2.4). This, however, is not so easy forall types of assertions, and the number of possibleassertions in a test case often exceeds the number ofstatements in the test case by far. Mutation testinghelps in precisely this matter by suggesting not onlywhere to test, but also what to check for. This is a veryimportant aspect of mutation testing, and has hithertonot received attention for automated testing.

After the test case generation process we run eachtest case on the unmodified software and all mutantsthat are covered by the test case, while recordingtraces with information necessary to derive assertions.In the following we list different types of assertions,all illustrated with examples of actually generated testcases:

Primitive assertions make assumptions on primi-tive (i.e., numeric or Boolean) return values:

DurationField var0 =MillisDurationField.INSTANCE;

long var1 = 43;long var2 = var0.subtract(var1, var1);assertEquals(var2, 0);

Comparison assertions compare objects of thesame class with each other. Comparisons of objectsacross different execution traces are not safe becausethey might be influenced by different memory ad-dresses, which in turn would influence assertions;therefore we compare objects within the same testcase. For classes implementing the Java Comparableinterface we call compareTo; otherwise, we applycomparison to all objects of compatible type in termsof the Java equals method:

DateTime var0 = new DateTime();Chronology var1 = Chronology.getCopticUTC();DateTime var2 = var0.toDateTime(var1);assertFalse(var2.equals(var0));

Inspector assertions call inspector methods on ob-jects to identify their states. An inspector method is amethod that takes no parameters, has no side-effects,and returns a primitive data type. Inspectors can, forexample, be identified by purity analysis [37].

long var0 = 38;Instant var1 = new Instant(var0);Instant var2 = var1.plus(var0);assertEquals(var2.getMillis(), 76);

Field assertions are a variant of inspector asser-tions and compare the public primitive fields ofclasses among each other directly. (As Joda Time hasno classes with public fields, there is no example fromthis library).

String assertions compare the string representa-tion of objects by calling the toString method. Forexample, the following assertion checks the ISO 8601representation of a period in Joda Time:

int var0 = 7;Period var1 = Period.weeks(var0);assertEquals(”P7W”, var1.toString());

String assertions are not useful in all cases: First, theyare only usable if the class implements this method it-self, as the java.lang.Object.toString methodinherited by all classes includes the memory locationof the reference—which does not serve as a validoracle. Second, the toString method is often usedto produce debug output which accesses many inter-nals that might else not be observable via the publicinterface. Oracles depending on internal details arenot resilient to changes in the code, and we thereforeonly use these assertions if no memory locations areincluded in the string and if no other assertions canbe generated.

8

Null assertions compare object references with thespecial value null:

int var0 = 7;Period var1 = Period.weeks(var0);assertNonNull(var1);

Exception assertions check which exceptions areraised during execution of a test case:

@Test (expected=IllegalArgumentException.class)int var0 = −20;DateTime var1 = new DateTime(var0,var0,var0); //

should raise exception

To generate assertions for a test case we run itagainst the original program and all mutants usingobservers to record the necessary information: An ob-server for primitive values records all observed returnand field values, while an inspector observer callsall inspector methods on existing objects and storesthe outcome, and a comparison observer comparesall existing objects of equal type and again stores theoutcome. After the execution the traces generated bythe observers are analyzed for differences between theruns on the original program and its mutants, and foreach difference an assertion is added. At the end ofthis process, the number of assertions is minimizedby tracing for each assertion which mutation it kills,and then finding a subset for each test case that issufficient to detect all mutations that can be detectedwith this test case. This is an instance of the NP-hardminimum set covering problem, and we therefore usea simple greedy heuristic [12]. The heuristic startsby selecting the best assertion, and then repeatedlyadds the assertion that detects the most undetectedmutants.

5 GENERATING JAVA UNIT TEST SUITES

Being able to generate unit tests and assertions givesus the tools necessary to create or extend entire testsuites. We have implemented the described µTEST ap-proach as an extension to the Javalanche [39] mutationsystem.

5.1 Mutation Analysis

The first step of a mutation based approach is toperform classic mutation analysis, which Javalanchedoes efficiently: Javalanche instruments Java bytecodewith mutation code and runs JUnit test cases againstthe instrumented (mutated) version. The result of thisprocess is (a) a classification of mutants into dead orlive with respect to the JUnit test suite, as well as (b)an impact analysis on live mutants that have beenexecuted but not detected.

5.2 Setup

The second step on the way to a test suite is to extracttestable units with all necessary information:

• Mutations of the unit under test (UUT)• Testable constructors and methods of the UUT,

together with their control flow and control de-pendency graphs for fitness calculation

• Classes and their members necessary to executeall methods and constructors of the UUT (i.e.,parameter classes)

• Inspector methods of the UUT and all otherconsidered classes

During test case generation, µTEST accesses theUUT via Java reflection, but at the end of the testcase generation process the test cases are output asregular JUnit test cases. Therefore, an important factorfor the testability of a Java class is its accessibility. Thismeans that all public classes can be tested, includingpublic member classes. Private or anonymous classes,on the other hand, cannot be explicitly tested but onlyindirectly via other classes that access them. Of theaccessible classes, all constructors, methods, and fieldsdeclared public can be accessed. As abstract classescannot be instantiated directly, we consider all derivedsubclasses when deriving information about testablemembers for a class.

A basic requirement to test methods and construc-tors is the ability to create objects of all parametertypes; therefore each parameter type is analyzed todetermine whether it offers all necessary generators tobe applicable to Algorithm 2. Alternatively, the usercan add helper functions that create usable objects,or theoretically one could also keep a pool of createdobjects [13]. A further requirement for a method to betestable is that it can be executed repeatedly duringthe search, and that calling the same method withidentical parameters always returns the same result(i.e., it is deterministic).

For each unit, we only consider the mutants derivedat this level in the type hierarchy, i.e., for an abstractclass only mutants of that class are used althoughmembers of subclasses are necessary to derive testcases. The intuition for this is that it will be easierfor the tester to understand test cases and assertionsif she or he only has to focus on a single source fileat a time.

In addition to the test methods, each unit is an-alyzed with respect to its inspector methods. Aninspector method is a method that does not changethe system state but only reports some informationabout it, and is therefore very useful for assertiongeneration. For Java, one has to resort to purity anal-ysis [37] or manual input to identify inspectors.

5.3 Test Case Generation

Once the setup is finished, the actual test case gener-ation can be started, possibly in parallel for different

9

Fig. 5. The process of generating test cases.

units. µTEST selects a target mutant and generates atest case for it (see Figure 5). To measure the fitnessof individuals, µTEST executes test cases using Javareflection. The resulting test case is checked against allremaining live mutants to see if it also kills other mu-tants. For all killed mutants, we derive a minimizedversion of the test case, enhanced with assertions, andadd it to the test suite. To minimize test cases we use arigorous approach and try to remove each statementfrom the test case and check the test case’s validityand fitness afterward. Although this is basically acostly approach, it is not problematic in practice, asthe rank selection favors short test cases, and they aretherefore usually short to begin with. Alternatively,one could for example apply delta-debugging [27] ora combination of slicing and delta-debugging [28] tospeed up minimization. This process is repeated untilall mutants are killed or at least test case generationhas been attempted on each of them.

5.4 Output

At the end of the process, the tester receives a JUnittest file for each unit, containing test cases with asser-tions and information about which test case is relatedto which mutant. Unless the test cases are used forregression testing, the assertions need to be analyzedand confirmed by the tester, and any assertion that isnot valid reveals a bug in the software, or an invalidtest input due to an implicit constraint that needsto be declared. To aid the developer, the numberof assertions is minimized such that each test caseonly includes sufficiently many assertions to kill allmutants that the test case can detect.

Besides test cases, the developer also receives in-formation about the mutants considered during testcase generation: If one or more assertions could begenerated, the mutant is considered to be killed. Ifthe mutant was executed but no assertions could befound, then the impact is returned, such that muta-tions can be ranked according to their probability ofbeing inequivalent. Some mutants cannot be tested(such as mutations in private methods that are notused in public methods), and some mutants are notsupported by the tool itself (for example, mutations

TABLE 1Statistics on case study objects

Classes Mutants

Case Study Testable Total Testable Total

Commons CLI 13 20 652 1,276Commons Codec 20 29 3,075 4,041Commons Collections 193 273 13,188 21,705Commons Logging 10 14 286 1,317Commons Math 243 388 25,226 43,273Commons Primitives 154 238 2,385 5,734Google Collections 83 131 4,054 11,339JGraphT 112 167 2,341 5,139Joda Time 123 154 11,163 23,145NanoXML 1 2 614 954

Σ 952 1416 62,984 117,913

in static class constructors, as this would require un-loading the class at each test run). Finally, a mutantmight simply not have been executed because thetool failed to find a suitable sequence that reaches themutated statement.

6 EVALUATION

To learn about the applicability and feasibility of thepresented approach, we applied µTEST to a set of 10open source libraries. In our experiments, we focuson evaluating how well µTEST performs at producingtest cases that detect mutants of a program; to demon-strate that the approach could also lead to detectionof defects in the current version of the program wewould additionally need data about real faults.

6.1 Case Study Objects

In selecting case study objects for a thorough evalua-tion of the approach we tried to not only use containerclasses, which are often used in the literature [42].There are, however, limitations to automated test casegeneration, and we therefore had to select open sourceprojects that do not rely on networking, I/O, multi-threading, or GUIs.

Table 1 summarizes statistics of the case studyobjects. The number of classes listed in this table rep-resents only top-level classes (i.e., no member classes).Not all classes are suitable for test generation, as toproduce executable JUnit test cases the classes need tobe publicly accessible and Javalanche needs to be ableto create mutants; we also excluded exception classes.

Commons CLI (CLI)

The Apache Commons CLI1 library provides an APIfor parsing command line options.

Commons Codec (CDC)

Commons Codec 2 provides implementations of com-mon encoders and decoders such as Base64, Hex,Phonetic and URLs.

1. http://commons.apache.org/cli/2. http://commons.apache.org/codec/

10

Commons Collections (COL)

Commons Collections 3 is a collection of data struc-tures that seek to improve over Java standard libraryclasses. Commons collections makes heavy use ofJava Generics, which make it possible to parametrizeclasses and methods with types. Unfortunately, thistype information is removed from instances of genericclasses by the compiler and therefore not accessibleat runtime by reflection, and so in most cases a typeparameter is just as difficult to treat as a parameterof type Object. For this case study object, we usedInteger objects whenever a method has an Objectin its signature, as integer numbers are easy to gen-erate and have a defined ordering.

Commons Logging (LOG)

The Commons Logging4 package bridges betweendifferent popular logging implementations, which canalso be changed at runtime.

Commons Math (MTH)

Commons Math5 is a library of lightweight, self-contained mathematics and statistics components ad-dressing the most common problems not available inthe Java programming language.

Commons Primitives (PRI)

The Commons Primitives6 library contains a set ofdatastructures and utilities that are optimized forprimitive datatypes.

Google Collections (GCO)

The Google Collections7 library is a set of new collec-tion types and implementations.

JGraphT (JGT)

GraphT is a graph library that provides mathematicalgraph-theory objects and algorithms.8

Joda Time (JOT)

Joda Time9 provides replacements for the Java dateand time classes. Joda Time is known for its compre-hensive set of developer tests, providing assuranceof the library’s quality, which makes it well suitedas a benchmark to compare automatically generatedtests with. To make sure that test cases and generatedassertions are deterministic, we configured the globalSystemMillisProvider to return a constant value.

NanoXML (XML)

NanoXML10 is a small XML parser for Java. Asthe parser framework relies heavily on I/O to read,transform, and write XML files we only tested theNanoXML Lite version.

3. http://commons.apache.org/collections/4. http://commons.apache.org/logging/5. http://commons.apache.org/math/6. http://commons.apache.org/primitives/7. http://code.google.com/p/google-collections/8. http://www.jgrapht.org/9. http://joda-time.sourceforge.net/10. http://devkix.com/nanoxml.php

6.2 Experimental Setup

Test case generation only requires Java bytecode, andwe used the µTEST prototype “out of the box” as far aspossible, i.e., without writing test helper functions ormodifying the software under test. Minor adaptationswere only necessary to avoid calling methods withnondeterministic results like random number genera-tors; these methods are easily detectable as generatedassertions will not always hold on the original pro-gram either.

We configured µTEST to use a steady state geneticalgorithm with a population size of 100, and anelitism size of 1 individual. The maximum test caselength was set to 60 statements, and evolution wasperformed for a maximum of 1,000,000 executed state-ments for each class. Before starting the evolution, weran 100 random test cases on each class to identifytrivial mutants, to focus the search on the difficultobjectives. For the remaining mutants, the budget ofstatements was equally divided. For each mutant, thesearch was ended as soon as an observable differencewas found, i.e., we did not optimize the impact fur-ther. We used all types of assertions listed in Section 4except for String assertions (i.e., checking the outputof the toString method) and exception assertions. Weused a timeout of 5 seconds for individual tests, andif a mutant timed out where the same test case runon the original program did not timeout we assumethat the mutant causes an infinite loop and end thesearch for this particular mutant at this point.

Crossover probability was set to 75%, and for everyoffspring of length n each statement had a mutationprobability of 1/n. Mutation was performed withprobability 1/3 as insertion, 1/3 as deletion, and 1/3as change. As µTEST uses randomized algorithms,each experiment was repeated 30 times with differentrandom seeds.

6.3 Effectiveness

One of our main objectives is to produce test suitesthat are good at detecting mutants. Figure 6 sum-marizes the achieved mutation scores for each of thecase study objects. For each object, the plot shows themutation score averaged over all classes of the objectas well as the 30 runs per class; each box depictsmedian as well as upper and lower quartiles; thewhiskers show the minimum and maximum alongwith outliers. The mutation score is the ratio of killedmutants to mutants in total (including equivalentmutants, as we do not know which mutants areequivalent). NanoXML poses problems for µTEST, as itrequires XML input which is unlikely to be producedautomatically. Similarly, Commons Logging hit theboundaries of what is possible with automated testgeneration, and in particular caused problems forµTEST with its external dependencies and the fact thatparts of the tested code are used within µTEST. For the

11

CLI CDC COL MTH LOG PRI GCO JGT JOT XML

0.0

0.2

0.4

0.6

0.8

1.0

Mut

atio

n S

core

Fig. 6. Mutation score per class: For the majority ofcase study objects and classes, µTEST achieves highmutation scores.

other projects, µTEST was able to produce test suiteswith high mutation scores.

In our experiments, µTEST generates test casesthat kill 75% of all mutants on average.

6.4 Test Case and Test Suite Sizes

Figure 7 illustrates that the number of test casesresulting per class is small on average; this is notsurprising, as each test case not only kills the targetmutation for which it was created, but usually manyother mutants as well.

We define the length of a test case as the numberof statements it consists of, excluding assertions. Thetest suite length is the sum of the lengths of itsconstituent test cases. Figure 8 lists the test suitelengths, confirming that the resulting test suites aresmall and the minimization is effective at producingsmaller test suites.

In general, test suites are often minimized to reducethe overall costs of test execution [23]. The size of indi-viduals may also be of concern during the search witha genetic algorithm, where bloat is a phenomenonin evolutionary search where individuals grow dis-proportionally large, thus inhibiting the search (cf.Section 3). In addition to these considerations, ourmotivation to produce small test suites is that in anon-regression testing scenario the test cases need tobe presented to a developer, who needs to confirmthe generated oracles. We expect that the shorter a testcase, the easier it will be to understand. The results ofour experiments confirm that our approach is effectiveat generating small test suites. However, whether ourintuition on the influence of the length of tests on theirunderstandability is true or not would require humanexperiments, which we plan to do in the future.


010

2030

4050

60

Test

Sui

te S

ize

Fig. 7. Number of test cases in a test suite for a singleclass: Usually, few test cases are necessary to kill allmutants of a class.

There is also a drawback to using short test cases,as longer test cases make it easier to reach coveragegoals [6], and may have better chances at exposinginteraction faults. To exploit the fact that long testcases are good at exploring the state space, we allowthe search to employ longer test sequences, and thenreduce their length after the search. Furthermore,many short test cases can be disadvantageous com-pared to fewer long test cases if the setup costs forindividual test cases are high. These considerationsare outside the scope of this paper, but analysis ofthe effects of test case length is the focus of recentresearch [3], [6], [18].

In general, the minimization of test cases in our sce-nario does not affect the potential to detect mutants;however, any minimization technique might affectthe residual fault detection ability [36]. This meansthat faults that are not represented by the mutantsbut detected by a long test case might no longer bedetected after minimization.

6.5 Assertion Minimization

µTEST minimizes the number of assertions with theaim to cover as many as possible mutants with asfew as possible assertions. The intuition behind thisoptimization is that the fewer assertions there are, thefewer assertions need to be validated by the tester.To see whether µTEST is successful in this respect,Figure 9 summarizes the statistics on the total numberof possible assertions in the generated test cases vs.the assertions selected by µTEST. The figure showsthat the number of possible assertions in a test casecan be very large (the maximum we observed were122 assertions for a test case in Commons Primitives –reduced to a single assertion by µTEST). The reductionis largest for the Commons Primitives, Commons

12


010

2030

4050

6070

Test

Sui

te L

engt

h

Fig. 8. Total length of a test suite for a single class:Short test cases are important to keep test casesunderstandable.


010

2030

40

Ass

ertio

ns /

Test

010

2030

40

AllMinimized

Fig. 9. Statistics on the number of possible assertionsper generated test, and assertions selected by µTEST.

Math, Joda Time, and NanoXML objects, but in allcases there is a significant reduction.

In our experiments, µTEST selects only 32%of all possible assertions on average.

This result proves that µTEST is effective at find-ing small sets of assertions. However, to determinewhether it is actually easier for a tester to confirm asuggested assertion as opposed to the tester comingup with a new assertion by herself would againrequire human experiments, which are out of thescope of this paper. Furthermore, this evaluation doesnot reveal how assertions compare to other types oforacles, such as creating an oracle based on the outputof a test case. However, our choice of assertions asoracles in the context of unit testing is based onthe observation that test cases written by developersusually also use such assertions.

Minimizing the assertions with respect to mutationsby construction does not affect the set of mutantsthat a test case can detect. However, as with thetest case minimization discussed above and any testminimization technique in general, the residual faultdetection ability might be affected. This means thatin theory, removing assertions might lead to faultsno longer being detected, even though the mutationscore stays the same. An analysis of this effect wouldrequire experiments with real faults and is outside thescope of this paper.

6.6 Comparison to Manually Written Tests

In order to see how test cases generated with µTEST

compare to manually written test cases, Table 2 givesmore detailed statistics on the handcrafted test suiteswritten by the developers of the case study objectsand included in their distribution, as well as thosegenerated by µTEST. For the handcrafted tests, we de-rived these numbers by counting the non-commentedsource code statements using JavaNCSS11, and wecounted the number of assertions using the commandline tools grep and wc. The number of statements inTable 2 is derived as the number of non-commentingsource code statements excluding assertions. As someof the objects use parametrized tests, where one testmethod results in several unit tests, we averaged thesenumbers over the number of methods as reportedby JavaNCSS. NanoXML Lite only comes with twotest cases, but these tests parse and act upon complexXML files.

Unlike our initial results [19] the automatically gen-erated test cases are on average shorter than manuallywritten test cases. This demonstrates the effective-ness of both the genetic algorithm in finding shortsequences and the minimization applied in removingall unnecessary statements. The difference from theinitial results can be attributed to the fact that µTEST

has matured significantly, as witnessed by its applica-bility to different open source libraries. In particular,the original version of µTEST included the length inthe fitness function explicitly, while now µTEST ranksindividual with the same fitness according to theirlength during selection, and it employs different bloatreduction strategies (see Section 3).

In only three out of 10 cases µTEST produced moretest cases than in the manual test suites. To someextent, this reduction in the number of test cases isbecause we only considered the automatically testablesubset of classes, while the manually written testcases address all classes. However, another expla-nation is that the automatically generated test casessimply are more effective at finding mutants thanmanually written test cases. To verify this explanation,we analyzed and compared the test suites for thetestable classes in detail. As randomized algorithms

11. http://www.kclee.de/clemens/java/javancss

13

TABLE 2Total number of test cases and average statistics per test case: Manually handcrafted vs. generated

Manual Generated

Case Study Tests Statements/Test Assertions/Test Tests Statements/Test Assertions/Test

Commons CLI 187 7.45 2.80 137.39 4.91 2.57Commons Codec 284 6.67 3.16 236.28 4.50 1.20Commons Collections 12,954 6.28 2.10 1955.67 4.65 2.24Commons Logging 26 6.90 1.03 77.86 6.08 2.00Commons Math 14,693 6.93 3.41 1797.79 4.49 1.91Commons Primitives 3,397 4.05 0.86 1145.67 5.88 1.54Google Collections 33,485 4.52 1.25 781.79 3.88 1.81JGraphT 118 9.10 1.65 484.96 4.56 1.52Joda Time 3,493 4.89 4.55 1553.36 6.10 1.89NanoXML 2 12.67 0.67 35.47 6.22 1.13

TABLE 3A12 measure values in the mutation score

comparisons: A12 < 0.5 means µTEST achieved lower,A12 = 0.5 equal, and A12 > 0.5 higher mutation scores

than the manually written test suites.

Case Study #A12 < 0.5 #A12 = 0.5 #A12 > 0.5

Commons CLI 7 1 5Commons Codec 9 2 9Commons Collections 46 24 123Commons Logging 1 2 1Commons Math 74 10 159Commons Primitives 7 96 51Google Collections 36 9 38JGraphT 30 16 66Joda Time 42 3 78NanoXML 1 0 0

Σ 253 163 530

can produce different results in different runs, wefollow the guidelines on statistical analysis describedby Arcuri and Briand [7].

Statistical difference has been measured with theMann-Whitney U test. To quantify the improvementin a standardized way, we used the Vargha-DelaneyA12 effect size [47]. In our context, the A12 is anestimation of the probability that, if we run µTEST,we obtain a higher mutation score than the manuallycrafted tests. When the two types of tests are equiv-alent, then A12 = 0.5. A high value A12 = 1 wouldmean that µTEST obtained a higher mutation score inall cases.

Table 3 shows how many times we obtained A12

values equal, lower and higher than 0.5. We obtainedp-values lower than 0.05 in 709 out of 946 compar-isons. Figure 10 shows a box-plot of the results ofthe A12 6= 0.5 measure for the coverage groupedby case study object. For NanoXML µTEST achievesclearly worse results than the manually written testcases, which has to be attributed to both difficultiesin generating test cases as well as finding suitableassertions. The manually written test cases read XMLsamples, and compare XML output as oracles, whichµTEST currently cannot do. Although µTEST has fewer

problems generating test cases for Commons Codec,the API exposes only a few possibilities for generatingassertions, and so µTEST achieves higher mutationscores only for five out of 13 classes. For CommonsCodec, Commons Logging, and Google CollectionsµTEST achieves similar mutation scores as the man-ually written test suites. The mutation scores forCommons Codec are close to 100%, which meansthere is little improvement possible. As mentionedpreviously, Commons Logging posed difficulties forµTEST, which led to a low mutation score. GoogleCollections are very well tested (more than 33,000 testcases!), so achieving the same mutation score in 9 andhigher mutation score in 38 out of 83 cases can beseen as a success. Finally, for Commons Collections,Commons Math, Commons Primitives, JGraphT, andJoda Time µTEST achieves clearly higher mutationscores than the manually written test suites. However,as Figure 10 shows, there is significant variation inthese results. This is mainly because for each of theselibraries there are several cases of classes that aredifficult for automated test generation. For example,Commons Math contains many classes where usingrandom numbers (as is the case with a genetic algo-rithm) can lead to very long computation time andthus timeouts. Joda Time has a few classes that heavilydepend on text input in specific date/time formats,which µTEST cannot produce. Similarly, JGraphT hasa few classes that pose difficulties for test generation.In summary, we conclude that µTEST achieves highermutation scores than manually written test cases;improvements in the test generation algorithm wouldlead to further increase, if the API were rich enoughto support assertion generation (i.e., not in the case ofNanoXML).

µTEST generates test suites and oraclesthat find significantly more seeded defects

than manually written test suites.

6.7 Overfitting to Mutations

µTEST optimizes test suites to detect mutants; a basicassumption of mutation testing is that mutants are

14


0.0

0.2

0.4

0.6

0.8

1.0

Fig. 10. A12 for mutation scores: µTEST has a highprobability of achieving a higher mutation score thanmanually written test cases.

based on actual fault models and are representativeof real faults, and experiments have confirmed thatgenerated mutants are indeed similar to real faultsfor the purpose of evaluating testing techniques [2]. Inaddition, the competent programmer hypothesis [30]states that developers produce programs that are closeto being correct, and the coupling effect hypothe-sis [30] states that test cases that are sufficient todetect simple faults (i.e., mutants) are also capable ofdetecting complex faults (i.e., higher order mutants orreal faults). Given these assumptions, µTEST producestest suites that likely to cover most real faults.

Mutation testing does not require that all typesof mutations are considered, as it was extensivelydetermined experimentally [32], [43] that there aresufficient sets of mutation operators: If a test suitekills all mutants derived with such a sufficient set,then it will detect all other mutants with very highprobability. Based on previous work [2], Javalancheby default only uses the mutation operators state-ment deletion, jump negation, arithmetic operator replace-ment, and comparison operator replacement. Accordingto other experiments on sufficient sets of mutationoperators [32], variable replacement (VRO) should notbe subsumed by these operators. The Javalanche de-velopers kindly added this operator to their set, suchthat we can therefore reveal the effect of overfittingto the given mutation operators by evaluating thetest suites produced by µTEST using the Javalanchedefault operators with respect to their effectivenessat finding variable replacement faults. Table 4 showsthat on the variable replacement operator (VRO) themanually written test cases achieve very similar re-sults as on the other operators. In contrast, µTEST’stest cases show significantly lower mutation scoresfor this operator, as expected (interestingly, with theexception of NanoXML).

TABLE 4Mutation scores on the variable replacement operator,which is not in the default set of mutation operators of

Javalanche and was therefore not used by µTEST

when generating tests.

Case Study Manual Generated

Commons CLI 81.85% 32.45%Commons Codec 90.80% 80.67%Commons Collections 62.44% 47.50%Commons Logging 85.71%12 10.00%Commons Math 53.53% 16.24%Commons Primitives 76.41% 54.27%Google Collections 57.96% 45.04%JGraphT 65.22% 23.52%Joda Time 67.90% 48.94%NanoXML 19.14% 45.62%

Average 66.10% 40.43%

µTEST’s choice of assertions is determinedby the choice of mutation operators.

In general, this shows that test generation basedon mutation testing cannot guarantee that all faultsare detected. However, by all means of mutationtesting research, using a sufficient set of operators(e.g., including VRO in the Javalanche set) shouldresult in a good test suite; as usual, evaluation on realfaults would be required to confirm this.

6.8 Threats to Validity

The results of our experiments are subject to thefollowing threats to validity:

Our 952 classes investigated come from 10 projects;while this is a large case study with respect to the stateof the art and the literature, we cannot claim that theresults of our experimental evaluation are generaliz-able. The evaluation should be seen as investigatingthe potential of the technique rather than providing astatement of general effectiveness.

The unit tests against which we compared mightnot be representative of all types of tests (in par-ticular we did not compare against automaticallyderived tests, as other tools do not provide oraclesor only regression oracles), but we chose projectsthat come with test suites of non-trivial sizes. Theunits investigated had to be testable automatically asdiscussed in Sections 5.2 and 6.1. The generalizationto other programming languages than used in ourexperiments (Java) depends on the testability—forexample, a search based approach as used by µTEST

would be quite hopeless on an untyped language suchas Python, unless putting in a major effort in typeanalysis. The results should, however, generalize tocomparable languages such as C# or C++.

12. Not all tests of Commons Logging are executable withinJavalanche; the mutation score therefore only represents the scoreof those mutants for which the test cases executed successfully.

15

Another possible threat is that the tools we haveused or implemented could be defective. To counterthis threat, we have run several manual assessmentsand counter-checks.

We evaluated the quality of unit tests in terms oftheir mutation score (Table 2), as well as size and abil-ity to select small sets of assertions. In practice otherfactors such as understandability or maintainabilityhave to be considered as well. To this extent, we tryto maximize impact and minimize the length of testcases; however, longer test cases might be preferablein terms of their fault detection ability [3], [6].

In our experiments we showed that µTEST cangenerate test cases that detect mutants of a program;however, µTEST is also conceived to identify faultsin the current version of the software, as assertionsinfluenced by faults are expected to be identified asinvalid by the developer. Evaluation of this aspectwould require experiments with real faults, as usingthe same mutation tool to seed defects and to generatetests would lead to experiments not different fromthose we already performed.

7 CONCLUSIONS

Mutation analysis is known to be effective in eval-uating existing test suites. In this paper, we haveshown that mutation analysis can also drive automatedtest generation. The main difference between usingstructural coverage and mutation analysis to guidetest generation is that a mutation does not only showwhere to test, but also helps in identifying what shouldbe checked for. In our experiments, this results in testsuites that are significantly better in finding defectsthan the (already high-quality) manually written ones.

The advent of automatic generation of effective testsuites has an impact on the entire unit testing process:Instead of laboriously thinking of sequences that leadto observable features and creating oracles to checkthese observations, the tester lets a tool create unittests automatically, and receives two test sets: Onerevealing general faults detectable by random testing,the other one consisting of regular unit tests. In thelong run, finding bugs could thus be reduced to thetask of checking whether the generated assertionsmatch the intended behavior.

Although our µTEST experiences are already verypromising, there is ample opportunity to improvethe results further: For example, previously generatedtest cases, manual unit tests, or test cases satisfyinga coverage criterion could serve as a better startingpoint for the genetic algorithm [51]. The search basedalgorithm can be much optimized, for example byapplying testability transformation [21], or improvingthe fitness function. If a mutated method is executedbut the mutant is not executed, then a local opti-mization on the parameters of that method call couldpossibly lead to a result much quicker than the global

search [22] (for example, if the input parameters arenumeric). It is even conceivable to use a hybrid ap-proach with dynamic symbolic execution [24].

To learn more about our work on mutation testing,visit our Web site:

http://www.st.cs.uni-saarland.de/mutation/

ACKNOWLEDGMENTS

We thank David Schuler for his work on Javalanche,and Andrea Arcuri, Valentin Dallmeier, Andrzej Wa-sylkowski, Jeff Offutt, Eva May, and the anonymousreviewers for comments on earlier versions of thispaper. This project has been funded by DFG grantZe509/5-1, and by a Google Focused Research Awardon “Test Amplification”.

REFERENCES

[1] S. Ali, L. C. Briand, H. Hemmati, and R. K. Panesar-Walawege.A systematic review of the application and empirical investi-gation of search-based test-case generation. IEEE Transactionson Software Engineering, 99(PrePrints), 2009.

[2] J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation anappropriate tool for testing experiments? In ICSE ’05: Proceed-ings of the 27th International Conference on Software Engineering,pages 402–411, New York, NY, USA, 2005. ACM.

[3] J. H. Andrews, A. Groce, M. Weston, and R. G. Xu. Randomtest run length and effectiveness. In ASE’08: Proceedings ofthe 2008 23rd IEEE/ACM International Conference on AutomatedSoftware Engineering, pages 19–28, Washington, DC, USA, 2008.IEEE Computer Society.

[4] J. H. Andrews, S. Haldar, Y. Lei, and F. C. H. Li. Tool supportfor randomized unit testing. In RT ’06: Proceedings of the 1stInternational Workshop on Random Testing, pages 36–45, NewYork, NY, USA, 2006. ACM.

[5] A. Arcuri. It does matter how you normalise the branch dis-tance in search based software testing. In ICST’10: Proceedingsof the 3rd International Conference on Software Testing, Verificationand Validation, pages 205–214. IEEE Computer Society, 2010.

[6] A. Arcuri. Longer is better: On the role of test sequencelength in software testing. In ICST’10: Proceedings of the3rd International Conference on Software Testing, Verification andValidation, pages 469–478. IEEE Computer Society, 2010.

[7] A. Arcuri and L. Briand. A practical guide for using statisticaltests to assess randomized algorithms in software engineering.In IEEE International Conference on Software Engineering (ICSE),pages 1–10, New York, NY, USA, 2011. ACM.

[8] A. Arcuri and X. Yao. Search based software testing of object-oriented containers. Information Sciences, 178(15):3075–3095,2008.

[9] K. Ayari, S. Bouktif, and G. Antoniol. Automatic mutationtest input data generation via ant colony. In GECCO ’07: Pro-ceedings of the 9th Annual Conference on Genetic and EvolutionaryComputation, pages 1074–1081, New York, USA, 2007. ACM.

[10] M. Boshernitsan, R. Doong, and A. Savoia. From Daikon toAgitator: lessons and challenges in building a commercial toolfor developer testing. In ISSTA ’06: Proceedings of the 2006International Symposium on Software Testing and Analysis, pages169–180, New York, NY, USA, 2006. ACM.

[11] L. Bottaci. A genetic algorithm fitness function for mutationtesting. In SEMINAL 2001: International Workshop on SoftwareEngineering using Metaheuristic Inovative Algorithms, a workshopat 23rd Int. Conference on Software Engineering, pages 3–7, 2001.

[12] V. Chvatal. A greedy heuristic for the set-covering problem.Mathematics of Operations Research, 4(3), 1979.

[13] I. Ciupa, A. Leitner, M. Oriol, and B. Meyer. Artoo: adap-tive random testing for object-oriented software. In ICSE’08: Proceedings of the 30th International Conference on SoftwareEngineering, pages 71–80, New York, NY, USA, 2008. ACM.

http://www.st.cs.uni-saarland.de/mutation/

16

[14] C. Csallner and Y. Smaragdakis. JCrasher: an automaticrobustness tester for Java. Software Practice and Experience,34(11):1025–1050, 2004.

[15] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on testdata selection: Help for the practicing programmer. Computer,11(4):34–41, April 1978.

[16] R. A. DeMillo and A. J. Offutt. Constraint-based automatictest data generation. IEEE Transactions on Software Engineering,17(9):900–910, September 1991.

[17] R. B. Evans and A. Savoia. Differential testing: a new approachto change detection. In ESEC-FSE ’07: Proceedings of the 6thJoint Meeting of the European Software Engineering Conference andthe ACM SIGSOFT Symposium on The Foundations of SoftwareEngineering, pages 549–552, New York, NY, USA, 2007. ACM.

[18] G. Fraser and A. Arcuri. It is not the length that matters, it ishow you control it. In ICST’11: Proceedings of the 4th Interna-tional Conference on Software Testing, Verification and Validation.IEEE Computer Society, 2011.

[19] G. Fraser and A. Zeller. Mutation-driven generation of unittests and oracles. In Proceedings of the 19th InternationalSymposium on Software Testing and Analysis, ISSTA ’10, pages147–158, New York, NY, USA, 2010. ACM.

[20] P. Godefroid, N. Klarlund, and K. Sen. DART: directedautomated random testing. In PLDI ’05: Proceedings of the 2005ACM SIGPLAN Conference on Programming Language Design andImplementation, pages 213–223, New York, USA, 2005. ACM.

[21] M. Harman, L. Hu, R. Hierons, J. Wegener, H. Sthamer,A. Baresel, and M. Roper. Testability transformation. IEEETransactions on Software Engineering, 30(1):3–16, 2004.

[22] M. Harman and P. McMinn. A theoretical and empirical studyof search-based testing: Local, global, and hybrid search. IEEETrans. Softw. Eng., 36:226–247, March 2010.

[23] M. J. Harrold, R. Gupta, and M. L. Soffa. A methodology forcontrolling the size of a test suite. ACM Trans. Softw. Eng.Methodol., 2:270–285, July 1993.

[24] K. Inkumsah and T. Xie. Improving structural testing ofobject-oriented programs via integrating evolutionary testingand symbolic execution. In ASE 2008: Proceedings of the23rd IEEE/ACM International Conference on Automated SoftwareEngineering, pages 297–306, 2008.

[25] Y. Jia and M. Harman. An Analysis and Survey of theDevelopment of Mutation Testing. IEEE Transactions of SoftwareEngineering, To appear, 2010.

[26] B. F. Jones, D. E. Eyres, and H.-H. Sthamer. A strategy forusing genetic algorithms to automate branch and fault-basedtesting. The Computer Journal, 41(2):98–107, 1998.

[27] Y. Lei and J. H. Andrews. Minimization of randomized unittest cases. In ISSRE ’05: Proceedings of the 16th IEEE InternationalSymposium on Software Reliability Engineering, pages 267–276,Washington, DC, USA, 2005. IEEE Computer Society.

[28] A. Leitner, M. Oriol, A. Zeller, I. Ciupa, and B. Meyer. Efficientunit test case minimization. In ASE ’07: Proceedings of the22nd IEEE/ACM International Conference on Automated SoftwareEngineering, pages 417–420, New York, NY, USA, 2007. ACM.

[29] P. McMinn. Search-based software test data generation: asurvey: Research articles. Software Testing Verification Reliability,14(2):105–156, 2004.

[30] A. J. Offutt. Investigations of the software testing couplingeffect. ACM Trans. Softw. Eng. Methodol., 1:5–20, January 1992.

[31] A. J. Offutt, Z. Jin, and J. Pan. The dynamic domain reductionprocedure for test data generation. Softw. Pract. Exper., 29:167–193, February 1999.

[32] A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf.An experimental determination of sufficient mutant operators.ACM Trans. Softw. Eng. Methodol., 5:99–118, April 1996.

[33] A. Orso and T. Xie. BERT: BEhavioral Regression Testing.In WODA 2008: Proceedings of the International Workshop onDynamic Analysis, pages 36–42, July 2008.

[34] C. Pacheco and M. D. Ernst. Eclat: Automatic generationand classification of test inputs. In ECOOP 2005 — Object-Oriented Programming, 19th European Conference, pages 504–527,Glasgow, Scotland, July 27–29, 2005.

[35] C. Pacheco and M. D. Ernst. Randoop: feedback-directedrandom testing for Java. In OOPSLA ’07: Companion to the22nd ACM SIGPLAN conference on Object-oriented programmingsystems and applications companion, pages 815–816, New York,NY, USA, 2007. ACM.

[36] G. Rothermel, M. J. Harrold, J. Ostrin, and C. Hong. Anempirical study of the effects of minimization on the faultdetection capabilities of test suites. In Proceedings of the In-ternational Conference on Software Maintenance, ICSM ’98, pages34–, Washington, DC, USA, 1998. IEEE Computer Society.

[37] A. Salcianu and M. C. Rinard. Purity and side effect analysisfor Java programs. In VMCAI 2005: Proceedings of the 6th Inter-national Conference on Verification, Model Checking, and AbstractInterpretation, volume 3385 of Lecture Notes in Computer Science,pages 199–215. Springer, 2005.

[38] D. Schuler, V. Dallmeier, and A. Zeller. Efficient mutationtesting by checking invariant violations. In ISSTA ’09: Pro-ceedings of the 18th International Symposium on Software Testingand Analysis, pages 69–80, New York, NY, USA, 2009. ACM.

[39] D. Schuler and A. Zeller. Javalanche: efficient mutation testingfor Java. In ESEC/FSE ’09: Proceedings of the 7th Joint Meetingof the European Software Engineering Conference and the ACMSIGSOFT Symposium on The Foundations of Software Engineering,pages 297–298, New York, NY, USA, 2009. ACM.

[40] D. Schuler and A. Zeller. (Un-)Covering equivalent mutants.In ICST ’10: Proceedings of the 3rd International Conference onSoftware Testing, Verification, and Validation, pages 45–54, Wash-ington, DC, USA, 2010. IEEE Computer Society.

[41] K. Sen, D. Marinov, and G. Agha. CUTE: a concolic unit testingengine for C. In ESEC/FSE-13: Proceedings of the 10th EuropeanSoftware Engineering Conference held jointly with 13th ACMSIGSOFT International Symposium on Foundations of SoftwareEngineering, pages 263–272, New York, NY, USA, 2005. ACM.

[42] R. Sharma, M. Gligoric, A. Arcuri, G. Fraser, and D. Marinov.Testing container classes: Random or systematic? In D. Gian-nakopoulou and F. Orejas, editors, Fundamental Approaches toSoftware Engineering, volume 6603 of Lecture Notes in ComputerScience, pages 262–277. Springer Berlin / Heidelberg, 2011.

[43] A. Siami Namin, J. H. Andrews, and D. J. Murdoch. Sufficientmutation operators for measuring test effectiveness. In Proceed-ings of the 30th international conference on Software engineering,ICSE ’08, pages 351–360, New York, NY, USA, 2008. ACM.

[44] K. Taneja and T. Xie. DiffGen: Automated regression unit-testgeneration. In ASE 2008: Proceedings of the 23rd IEEE/ACM In-ternational Conference on Automated Software Engineering, pages407–410, 2008.

[45] N. Tillmann and J. N. de Halleux. Pex — white box testgeneration for .NET. In TAP 2008: International Conferenceon Tests And Proofs, volume 4966 of LNCS, pages 134 – 253.Springer, 2008.

[46] P. Tonella. Evolutionary testing of classes. In ISSTA ’04:Proceedings of the 2004 ACM SIGSOFT International Symposiumon Software Testing and Analysis, pages 119–128, New York, NY,USA, 2004. ACM.

[47] A. Vargha and H. D. Delaney. A critique and improvement ofthe CL common language effect size statistics of McGraw andWong. Journal of Educational and Behavioral Statistics, 25(2):101–132, 2000.

[48] S. Wappler and F. Lammermann. Using evolutionary algo-rithms for the unit testing of object-oriented software. InGECCO ’05: Proceedings of the 2005 Conference on Genetic andEvolutionary Computation, pages 1053–1060, New York, NY,USA, 2005. ACM.

[49] N. Williams, B. Marre, P. Mouy, and M. Roger. PathCrawler:automatic generation of path tests by combining static and dy-namic analysis. In EDCC 2005: Proceedings ot the 5th EuropeanDependable Computing Conference, volume 3463 of LNCS, pages281–292. Springer, 2005.

[50] T. Xie. Augmenting automatically generated unit-test suiteswith regression oracle checking. In ECOOP 2006: Proceedingsof the 20th European Conference on Object-Oriented Programming,pages 380–403, July 2006.

[51] Z. Xu, M. B. Cohen, and G. Rothermel. Factors affectingthe use of genetic algorithms in test suite augmentation. InProceedings of the 12th annual conference on Genetic and evolu-tionary computation, GECCO ’10, pages 1365–1372, New York,NY, USA, 2010. ACM.

[52] L. Zhang, T. Xie, L. Zhang, N. Tillmann, J. de Halleux, andH. Mei. Test generation via dynamic symbolic execution formutation testing. In Proc. 26th IEEE International Conference onSoftware Maintenance (ICSM 2010), September 2010.

17

Gordon Fraser Gordon Fraser is a post-doc researcher at Saarland University, andmember of the Chair of Software Engineer-ing led by Prof. Andreas Zeller. The centraltheme of his research is improving softwarequality, and his recent research concerns theprevention, detection, and removal of defectsin software. More specifically, he developstechniques to generate test cases automat-ically, and to guide the tester in validatingthe output of tests by producing test oracles

and specifications. Because of the necessity to involve humans inthe testing process he is investigating how to make these artifactsreadable and understandable. At the same time, he strives to developtechniques that are both scalable to complex real-world software andpractically usable, yet preserve their strong formal foundation.

Andreas Zeller Andreas Zeller is a com-puter science professor at Saarland Uni-versity, Saarbrucken, Germany. Zeller’s re-search increases programmer productivity.He researches large programs and their his-tory, and develops methods to predict, iso-late, and prevent causes of program failures– on open-source programs as well as inindustrial contexts at IBM, Microsoft, SAP,and others. Zeller’s contributions to computerscience include the GNU DDD debugger, au-

tomated debugging, mining software archives, and scalable mutationtesting. In 2009, his work on delta debugging obtained the ACMSIGSOFT 10-year impact paper award.

Documents

Mutation-driven Generation of Unit Tests and Oracles · Mutation-driven Generation of Unit Tests and Oracles Gordon Fraser, Member, IEEE, Andreas Zeller, Member, IEEE Abstract—To