15
Comparing Pairwise Testing in Software Product Lines: A Framework Assessment Roberto E. Lopez-Herrejon 1 , Javier Ferrer 2 , Evelyn Nicole Haslinger 1 , Francisco Chicano 2 , Alexander Egyed 1 , and Enrique Alba 2 1 Systems Engineering and Automation Johannes Kepler University, Linz,Austria {roberto.lopez, evelyn.haslinger, alexander.egyed}@jku.at 2 University of Malaga, Malaga, Spain {ferrer, chicano, eat}@lcc.uma.es Abstract. Software Product Lines (SPLs) are families of related soft- ware systems, which provide different feature combinations. In the past few years many SPL testing approaches have been proposed, among them, those that support pairwise testing. Recently a comparison frame- work for pairwise testing in SPLs has been proposed, however it has only been applied to two algorithms, none of them search based, and only on a limited number of case studies. In this paper we address these two limitations by comparing two search based approaches with a leading greedy alternative on 181 case studies taken from two different sources. We make a critical assessment of this framework based on an analysis of how useful it is to select the best algorithm among the existing ap- proaches and an examination of the correlation we found between its metrics. Keywords: Feature Models, Genetic Algorithms, Simulated Annealing, Software Product Lines, Pairwise Testing 1 Introduction A Software Product Line (SPL) is a family of related software systems, which provide different feature combinations [1]. The effective management and reali- zation of variability – the capacity of software artifacts to vary [2] – can lead to substantial benefits such as increased software reuse, faster product customiza- tion, and reduced time to market. Systems are being built, more and more frequently, as SPLs rather than individual products because of several technological and marketing trends. This fact has created an increasing need for testing approaches that are capable of coping with large numbers of feature combinations that characterize SPLs. Many testing alternatives have been put forward [3–6]. Salient among them are those that support pairwise testing [7–13]. With all these pairwise testing approaches available the question now is: how do they compare? To answer this question, Perrouin et al. have recently proposed a comparison framework [14]. However,

Comparing Pairwise Testing in Software Product Lines: A - NEO

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Comparing Pairwise Testing in Software Product Lines: A - NEO

Comparing Pairwise Testing in SoftwareProduct Lines: A Framework Assessment

Roberto E. Lopez-Herrejon1, Javier Ferrer2, Evelyn Nicole Haslinger1,Francisco Chicano2, Alexander Egyed1, and Enrique Alba2

1 Systems Engineering and AutomationJohannes Kepler University, Linz,Austria

{roberto.lopez, evelyn.haslinger, alexander.egyed}@jku.at2 University of Malaga, Malaga, Spain

{ferrer, chicano, eat}@lcc.uma.es

Abstract. Software Product Lines (SPLs) are families of related soft-ware systems, which provide different feature combinations. In the pastfew years many SPL testing approaches have been proposed, amongthem, those that support pairwise testing. Recently a comparison frame-work for pairwise testing in SPLs has been proposed, however it has onlybeen applied to two algorithms, none of them search based, and only ona limited number of case studies. In this paper we address these twolimitations by comparing two search based approaches with a leadinggreedy alternative on 181 case studies taken from two different sources.We make a critical assessment of this framework based on an analysisof how useful it is to select the best algorithm among the existing ap-proaches and an examination of the correlation we found between itsmetrics.

Keywords: Feature Models, Genetic Algorithms, Simulated Annealing,Software Product Lines, Pairwise Testing

1 Introduction

A Software Product Line (SPL) is a family of related software systems, whichprovide different feature combinations [1]. The effective management and reali-zation of variability – the capacity of software artifacts to vary [2] – can lead tosubstantial benefits such as increased software reuse, faster product customiza-tion, and reduced time to market.

Systems are being built, more and more frequently, as SPLs rather thanindividual products because of several technological and marketing trends. Thisfact has created an increasing need for testing approaches that are capable ofcoping with large numbers of feature combinations that characterize SPLs. Manytesting alternatives have been put forward [3–6]. Salient among them are thosethat support pairwise testing [7–13]. With all these pairwise testing approachesavailable the question now is: how do they compare? To answer this question,Perrouin et al. have recently proposed a comparison framework [14]. However,

Roberto
Text Box
Submitted for publication. Please do not distribute. Made available to support ICSE14 submission.
Page 2: Comparing Pairwise Testing in Software Product Lines: A - NEO

this framework has only been applied to two algorithms, none of them searchbased, and only on a limited number of case studies. In this paper we addressthese two limitations by comparing two search based approaches with a leadinggreedy alternative on 181 case studies taken from two different sources.

Our interest (and our contribution) is in answering these two questions:

– RQ1: Is there any correlation between the framework’s metrics?.– RQ2: Is this framework useful to select the best one among the testing

algorithms?

Based on the obtained answers, we make a critical assessment of the frameworkand suggest open venues for further research.

The organization of the paper is as follows. In Section 2 we provide somebackground on the Feature Models used in Software Product Lines. Section 3describes Combinatorial Interaction Testing and how it is applied to test SPLsand Section 4 presents the algorithms used in our experimental section. In Sec-tion 5 the comparison framework of Perrouin et al. is presented and it is evaluatedin Section 6. Some related work is highlighted in Section 7 and we conclude thepaper and outline future work in Section 8.

2 Feature Models and Running Example

Feature models have become a de facto standard for modelling the common andvariable features of an SPL and their relationships collectively forming a tree-likestructure. The nodes of the tree are the features which are depicted as labelledboxes, and the edges represent the relationships among them. Thus, a featuremodel denotes the set of feature combinations that the products of an SPL canhave [15].

Figure 1 shows the feature model of our running example, the Graph ProductLine (GPL), a standard SPL of basic graph algorithms that has been extensivelyused as a case study in the product line community [16]. A product has featureGPL (the root of the feature model) which contains its core functionality, a driverprogram (Driver) that sets up the graph examples (Benchmark) to which acombination of graph algorithms (Algorithms) are applied. The types of graphs(GraphType) can be either directed (Directed) or undirected (Undirected), andcan optionally have weights (Weight). Two graph traversal algorithms (Search)are available: either Depth First Search (DFS) or Breadth First Search (BFS).A product must provide at least one of the following algorithms: numbering ofnodes in the traversal order (Num), connected components (CC), strongly con-nected components (SCC), cycle checking (Cycle), shortest path (Shortest),minimum spanning trees with Prim’s algorithm (Prim) or Kruskal’s algorithm(Kruskal).

In a feature model, each feature (except the root) has one parent featureand can have a set of child features. Notice here that a child feature can onlybe included in a feature combination of a valid product if its parent is includedas well. The root feature is always included. There are four kinds of featurerelationships:

Page 3: Comparing Pairwise Testing in Software Product Lines: A - NEO

Fig. 1. Graph Product Line Feature Model

– Mandatory features are depicted with a filled circle. A mandatory featureis selected whenever its respective parent feature is selected. For example,features Driver and GraphType.

– Optional features are depicted with an empty circle. An optional feature mayor may not be selected if its respective parent feature is selected. An exampleis feature Weight.

– Exclusive-or relations are depicted as empty arcs crossing over a set of linesconnecting a parent feature with its child features. They indicate that exactlyone of the features in the exclusive-or group must be selected whenever theparent feature is selected. For example, if feature Search is selected, theneither feature DFS or feature BFS must be selected.

– Inclusive-or relations are depicted as filled arcs crossing over a set of linesconnecting a parent feature with its child features. They indicate that atleast one of the features in the inclusive-or group must be selected if theparent is selected. If for instance, feature Algorithms is selected then atleast one of the features Num, CC, SCC, Cycle, Shortest, Prim, and Kruskal

must be selected.

Besides the parent-child relations, features can also relate across different branchesof the feature model with the so called Cross-Tree Constraints (CTC). Figure 1shows some of the CTCs of our feature model3. For instance, Cycle requires

DFS means that whenever feature Cycle is selected, feature DFS must also beselected. As another example, Prim excludes Kruskal means that both fea-tures cannot be selected at the same time on any product. These constraints aswell as those implied by the hierarchical relations between features are usuallyexpressed and checked using propositional logic, for further details refer to [18].

Definition 1. Feature List (FL) is the list of features in a feature model.

The FL for the GPL feature model is [GPL, Driver, Benchmark,

GraphType, Directed, Undirected, Weight, Search, DFS, BFS,

Algorithms, Num, CC, SCC, Cycle, Shortest, Prim, Kruskal].

3 In total, the feature model has 13 CTCs for further details please refer to [16] and [17].

Page 4: Comparing Pairwise Testing in Software Product Lines: A - NEO

FS GPL Dri Gtp W Se Alg B D U DFS BFS N CC SCC Cyc Sh Prim Kru

fs0 X X X X X X X Xfs1 X X X X X X X X X X Xfs2 X X X X X X X X X Xfs3 X X X X X X X X X X Xfs4 X X X X X X X X X X X X Xfs5 X X X X X X X X X X X X Xfs6 X X X X X X X X X X X Xfs7 X X X X X X X X X X X XDriver (Dri), GraphType (Gtp), Weight (W), Search (Se), Algorithms (Alg),

Benchmark (B), Directed (D), Undirected (U), Num (N), Cycle (Cyc), Shortest (Sh),Kruskal (Kr).

Table 1. Sample Feature Sets of GPL

Definition 2. A feature set, also called product in SPL, is a 2-tuple [sel,sel]where sel and sel are respectively the set of selected and not-selected features ofa member product4. Let FL be a feature list, thus sel, sel ⊆ FL, sel ∩ sel = ∅,and sel ∪ sel = FL. The terms p.sel and p.sel respectively refer to the set ofselected and not-selected features of product p.

Definition 3. A feature set fs is valid in feature model fm, i.e. valid(fs, fm)

holds, iff fs does not contradict any of the constraints introduced by fm. We willdenote with FS the set of valid feature sets for a feature model (we omit thefeature model in the notation to alleviate the notation).

Definition 4. A feature f is a core feature if it is selected in all the valid featuresets of a feature model fm, and is a variant feature if it is selected in some of thefeature sets.

For example, the feature set fs0=[{GPL, Driver, GraphType, Weight,

Algorithms, Benchmark, Undirected, Prim}, {Search, Directed, DFS,

BFS, Num, CC, SCC, Cycle, Shortest, Kruskal}] is valid. As anotherexample, a feature set with features DFS and BFS would not be valid because itviolates the constraint of the exclusive-or relation which establishes that thesetwo features cannot appear selected together in the same feature set. In ourrunning example, the feature model denotes 73 valid feature sets. Some of themare depicted in Table 1, where selected features are ticked (X) and unselectedfeatures are empty. For the sake of space, in this table we use as column labelsshortened names for some features (e.g. Dri for feature Driver). Notice thatfeatures GPL, Driver, Benchmark, GraphType and Algorithms are core featuresand the remaining features are variant features.

4 Definition based on [18].

Page 5: Comparing Pairwise Testing in Software Product Lines: A - NEO

3 Combinatorial Interaction Testing in SPLs

Combinatorial Interaction Testing (CIT) is a testing approach that constructssamples to drive the systematic testing of software system configurations [19].When applied to SPL testing, the idea is to select a representative subset ofproducts where interaction errors are more likely to occur rather than testingthe complete product family [19]. In this section we provide the basic terminologyof CIT within the context of SPLs.

Definition 5. A t-set ts is a 2-tuple [sel,sel] representing a partially configuredproduct, defining the selection of t features of the feature list FL, i.e. ts.sel ∪ts.sel ⊆ FL ∧ ts.sel∩ ts.sel = ∅ ∧ |ts.sel∪ ts.sel| = t. We say t-set ts is coveredby feature set fs iff ts.sel ⊆ fs.sel ∧ ts.sel ⊆ fs.sel.

Definition 6. A t-set ts is valid in a feature model fm if there exists a validfeature set fs that covers ts. The set of all valid t-sets for a feature model isdenoted with TS (we also omit here the feature model in the notation for thesake of clarity).

Definition 7. A t-wise covering array tCA for a feature model fm is a set ofvalid feature sets that covers all valid t-sets denoted by fm.5 We also use the termtest suite in this paper to refer to a covering array.

Let us illustrate these concepts for pairwise testing, meaning t=2. From thefeature model in Figure 1, a valid 2-set is [{Driver},{Prim}]. It is valid becausethe selection of feature Driver and the non-selection of feature Prim do notviolate any constraints. As another example, the 2-set [{Kruskal,DFS}, ∅] isvalid because there is at least one feature set, for instance fs1 in Table 1, whereboth features are selected. The 2-set [∅, {SCC,CC}] is also valid because thereare valid feature sets that do not have any of these features selected, for instancefeature sets fs0, fs1, and fs3. Notice however that the 2-set [∅,{Directed,Undirected}] is not valid. This is because feature GraphType is present in all thefeature sets (mandatory children of the root) so either Directed or Undirectedmust be selected. In total, our running example has 418 valid 2-sets.

Based on Table 1, the three valid 2-sets just mentioned above are coveredas follows. The 2-set [{Driver},{Prim}] is covered by feature sets fs1, fs2,fs3, fs4, fs6, and fs7. Similarly, the 2-set [{Kruskal,DFS}, ∅] is covered byfeature set fs1, and [∅, {SCC,CC}] is covered by feature sets fs0, fs2, and fs3.

4 Algorithms Overview

In this section we briefly sketch the three testing algorithms we analyse in thispaper: CASA, PGS, and ICPL.

5 Definition based on [20].

Page 6: Comparing Pairwise Testing in Software Product Lines: A - NEO

4.1 CASA Algorithm

CASA is a simulated annealing algorithm that was designed to generate n-wisecovering arrays for SPLs [9]. CASA relies on three nested search strategies. Theoutermost search performs one-sided narrowing, pruning the potential size of thetest suite to be generated by only decreasing the upper bound. The mid-levelsearch performs a binary search for the test suite size. The innermost searchstrategy is the actual simulated annealing procedure, which tries to find a pair-wise test suite of size N for feature model FM. For more details on CASA pleaserefer to [9].

4.2 Prioritized Genetic Solver

The Prioritized Genetic Solver (PGS) is an evolutionary approach proposedby Ferrer et al. [21] that constructs a test suite taking into account prioritiesduring the generation. PGS is a constructive genetic algorithm that adds one newproduct to the partial solution in each iteration until all pairwise combinationsare covered. In each iteration the algorithm tries to find the product that addsthe most coverage to the partial solution. This paper extends and adapts PGSfor SPL testing. PGS has been implemented using jMetal [22], a Java frameworkaimed at the development, experimentation, and study of metaheuristics forsolving optimization problems. For further details on PGS, please refer to [21].

4.3 ICPL

ICPL is a greedy approach to generate n-wise test suites for SPLs, which has beenintroduced by Johansen et. al [20]6. It is basically an adaptation of Chvatal’salgorithm to solve the set cover problem. First, the set TS of all valid t-setsthat need to be covered is generated. Next, the first feature set (product) fs isproduced by greedily selecting a subset of t-sets in TS that constitute a validproduct in the input feature model and added to the (initially empty) test suitetCA. Henceforth all t-sets that are covered by product fs are removed from TS.ICPL then proceeds to generate products and adding them to the test suite tCAuntil TS is empty, i.e. all valid t-sets are covered by at least on product. Toincrease ICPLs performance Johansen et. al made several enhancements to thealgorithm, they for instance parallelized the data independent processing steps.For further details on ICPL please refer to [20].

5 Comparison Framework

In this section we present the four metrics that constitute the framework pro-posed by Perrouin et al. for the comparison of pairwise testing approaches forSPLs [14]. We define them based on the terminology presented in Section 2 andSection 3.

6 ICPL stands for “ICPL Covering array generation algorithm for Product Lines”.

Page 7: Comparing Pairwise Testing in Software Product Lines: A - NEO

For the following metric definitions, let tCA be a t-wise covering array offeature model fm.

Metric 1. Test Suite Size is the number of feature sets selected in a coveringarray for a feature model, namely n = |tCA|.Metric 2. Performance is the time required for an algorithm to compute a testsuite covering array.

Metric 3. Test Suite Similarity. This metric is defined based on Jaccard’s sim-ilarity index and applied to variable features. Let FM be the set of all possiblefeature models, fs and gs be two feature sets in FS, and var : FS×FM → FLbe an auxiliary function that returns from the selected features of a feature setthose features that are variable according to a feature model. The similarityindex of two feature sets is thus defined as follows:

Sim(fs, gs, fm) =

{|var(fs,fm) ∩ var(gs,fm)||var(fs,fm) ∪ var(gs,fm)| if var(fs, fm) ∪ var(gs, fm) 6= ∅0 otherwise

(1)

And the similarity value for the entire covering array is defined as:

TestSuiteSim(tCA, fm) =

∑fsi∈tCA

∑fsj∈tCA Sim(fsi, fsj, fm)

| tCA |2(2)

It should be noted here that the second case of the similarity index, whenthere are no variable features on both feature sets is not part of the originalproposed comparison framework [14]. We added it because we found severalfeature sets formed with only core features in the feature models analysed inSection 6.3.

Metric 4. Tuple Frequency. Let occurrence : TS × 2FS → N be an auxiliaryfunction that counts the occurrence of a t-set in all the feature sets of a coveringarray of a single feature model. The metric is thus defined as:

TupleFrequency(ts, tCA) =occurence(ts, tCA)

| tCA |(3)

The first two metrics are the standard measurements used for comparisonbetween different testing algorithms, not only within the SPL domain. To thebest of our understanding, the intuition behind the Test Suite Similarity is thatthe more dissimilar the feature sets are, as the value approaches 0, the higherchances to detect any faulty behaviour when the corresponding t-wise tests areinstrumented and performed. Along the same lines, the rationale behind tuplefrequency is that by reducing this number, the higher the chances of reducingrepeating the execution of t-wise tests.

Let us provide some examples for the latter two metrics. Consider for in-stance, feature sets fs0, fs1, fs2 and fs7 from Table 1. The variable features

Page 8: Comparing Pairwise Testing in Software Product Lines: A - NEO

in those feature sets are:

var(fs0, gpl) = {Undirected,Weight, Prim}var(fs1, gpl) = {Undirected,Weight, Search,DFS,Connected,Kruskal}var(fs2, gpl) = {Directed, Search,DFS,Number, Cycle}var(fs7, gpl) = {Undirected,Weight, Search,DFS,Connected,Number, Cycle}

An example is the similarity value between feature sets fs0 and fs2, thatis Sim(fs0, fs2, gpl) = 0/8 = 0.0. The value is zero because those two fea-ture sets do not have any selected variable features in common. Now considerSim(fs1, fs7, gpl) = 5/8 = 0.625 which yields a high value because those featuresets have the majority of their selected features in common.

To illustrate the Tuple Frequency metric, let us assume that the set of featuresets in Table 1 is a 2-wise covering array of GPL denoted as tCAgpl

7. For example,the 2-set ts0 = [{Driver},{Prim}] is covered by feature sets fs1, fs2, fs3,fs4, fs6, and fs7. Thus, its frequency is equal to occurrence(ts0, tCAgpl)/8 =6/8 = 0.75. As another example, the 2-set ts1 = [{Kruskal,DFS}, ∅] is coveredby feature set fs1. Thus its frequency is equal to occurrence(ts1, tCAgpl)/8 =1/8 = 0.125.

6 Evaluation

For the comparison of CASA, PGS and ICPL we created two groups of featuremodels based on their provenance. In this section, we describe the salient char-acteristics of both groups, their evaluation with the comparison framework ofSection 5, and a thorough comparison analysis of the three algorithms. All thedata and code used in our analysis is available on [17].

6.1 Feature Model Selection

Open Source Projects. The first group of feature models comes from existingSPLs whose feature models and their code implementation is readily available asopen source projects. Some of them were reverse engineered from single systemsand some were developed as SPLs from the outset. We selected the feature modelsfor this group from the SPL Conqueror project [23]. The goal of this project isto provide reliable estimates of measurable non-functional properties such asperformance, main memory consumption, and footprint. Table 2 summarizesthe 13 feature models of this group. The rationale behind the creation of thisgroup is that they actually represent real case studies, used both in academicand industrial environments.

Feature Models Repository. The second group of selected feature modelscomes from the SPLOT repository [24]. This repository contains a large number

7 There are 24 2-wise pairs, out of the 418 pairs that GPL contains, which are nocovered.

Page 9: Comparing Pairwise Testing in Software Product Lines: A - NEO

SPL Name NF NP Description SPL Name NF NP DescriptionApache 10 256 web server PKJab 12 72 messengerBerkeleyDBF 9 256 database Prevayler 6 32 persistenceBerkeleyDBM 9 3840 database SensorNetwork 27 16704 simulationBerkeleyDBP 27 1440 database Wget 17 8192 file retrievalCurl 14 1024 command line x264 17 2048 video encodingLinkedList 26 1440 data structures ZipMe 8 64 compression libraryLLVM 12 1024 compiler toolNFs: Number of Features, NP: Number of Products, Desc: Description of SPL domain

Table 2. Open SPL Projects Characteristics.

of feature models that are not necessarily associated with open source projectsand that has been made available to the product line community carrying outresearch on formal analysis and reasoning of feature models. From this source,we selected 168 feature models with: i) number of features ranging from 10 to67, with a median of 16.5 and an interquartile range of 10, and ii) number ofproducts between 14 and 73,728 with a median of 131.5 and an interquartilerange of 535.25. The rationale behind the creation of this group was to assessthe framework on a larger and more diverse set of feature models.

Figure 2 depicts the feature models selected based on their number of fea-tures and products. Notice that the filled circles correspond to the open sourceprojects.

1,60E+01

1,60E+02

1,60E+03

1,60E+04

0 10 20 30 40 50 60 70

Num

ber

of P

rodu

cts

Number of Features SPLOT SPLConquerorExamples

Fig. 2. Summary of analysed feature models.

6.2 Experimental Set Up

The three algorithms, CASA, PGS and ICPL, are non-deterministic. For thisreason we performed 30 independent runs for a meaningful statistical analy-sis. All the executions were run in a cluster of 16 machines with Intel Core2

Page 10: Comparing Pairwise Testing in Software Product Lines: A - NEO

Quad processors Q9400 (4 cores per processor) at 2.66 GHz and 4 GB mem-ory running Ubuntu 12.04.1 LTS and managed by the HT Condor 7.8.4 clustermanager. Since we have 3 algorithms and 181 feature models the total numberof independent runs is 3 · 181 · 30 = 16, 290. Once we obtained the resultingtest suites we applied the metrics defined in Section 5 and we report summarystatistics of these metrics. In order to check if the differences between the al-gorithms are statistically significant or just a matter of chance, we applied theWilcoxon rank-sum [25] test and highlight in the tables the differences that arestatistically significant.

6.3 Analysis

The first thing we noticed is the inadequacy of the metric Tuple Frequencyfor our comparison purposes. Its definition, as shown in Equation (3), applieson a per tuple basis. In other words, given a test suite, we should computethis metric for each tuple and we should provide the histogram of the tuplefrequency. This is what the authors of [14] do. This means we should show16, 290 histograms, one per test suite we have. Obviously, this is not viable, sowe thought in alternatives to summarize all this information. The first idea, alsopresent in [14], is to compute the average of the tuple frequencies in a test suitetaking into account all the tuples. But, unfortunately we found that this averagesays nothing about the test suite. We can prove using counting arguments thatthis average depends only on the number of features and the number of validtuples of the model, so this average is the same for all the test suites associated toa feature model. The reader can find the formal proof in the paper’s additionalresources [17]. In the following we omit any information related to the TupleFrequency metric and defer to future work the issue of how to summarize thismetric. We computed the remaining three metrics on all the independent runsof the three algorithms.

In order to answer RQ1 we calculated the Spearman rank’s correlation co-efficient for each pair of metrics. Table 3 shows the results obtained plus thecorrelation values with the number of products and the number of features ofthe feature models (first two columns and two rows of the table).

Products Features Size Performance Similarity

Products 1 0.591 0.593 0.229 -0.127Features 0.591 1 0.545 0.283 -0.159Size 0.593 0.545 1 0.278 -0.371Performance 0.229 0.283 0.278 1 0.168Similarity -0.127 -0.159 -0.371 0.168 1

Table 3. Spearman’s correlation coefficients of all models and algorithms.

We can observe a positive and relatively high correlation among the numberof products, features and test suite size. This is somewhat expected because

Page 11: Comparing Pairwise Testing in Software Product Lines: A - NEO

the number of valid products is expected to increase when more features areadded to a feature model. In the same sense, more features usually implies morecombinations of features that must be covered by the test suite and this usuallymeans that more test cases must be added.

Regarding performance, we expect the algorithms to take more time to gen-erate the test suites for larger models. The positive correlation between theperformance and the three previous size-related measures (products, featuresand size) supports this idea. However, the value is too low (less than 0.3) toclearly claim that larger models require more computation time.

The correlation coefficient between the similarity metric and the rest of met-rics is low and the sign changes, which means that there is no correlation. Wewould expect larger test suites to have more similar test suites (positive cor-relation with size), but this is not the case in general. In our opinion we canfind the justification for this somehow unexpected result in the definition of thesimilarity metric. In Equation (1) the similarity between two products does nottake into account the features that are not present in the two products, whichwe would argue is also a source of similarity. In other words, two products aresimilar not only due to the shared selected features they have, but also due totheir shared unselected features. Thus, our conclusion regarding this metric isthat it is not well-defined and the intuition on how it should behave does notwork, as evidenced in Table 3.

Let us now analyse the metrics results grouped by algorithms to addressRQ2. In Table 4 we show the results for the models in the SPLOT repositoryand in Table 5 we show the results for the 13 open source models. In these twotables we highlight with dark gray the results that are better, with statisticallysignificant difference, than the other two algorithms, and with a light gray theresults that are better than only one of the other algorithms

Algorithm Size Performance Similarity

CASA 9.31 3617.26 0.3581ICPL 11.06 304.11 0.3309PGS 10.88 77245.21 0.3635

Table 4. Average of the metrics computed on the test suites generated by the threealgorithms for the models in SPLOT.

The results for the SPLOT models reveal that CASA is the best algorithmregarding the size of the test suite (with a statistically significant difference)and PGS is the second best algorithm. If we focus on computation time, ICPLis the clear winner followed by CASA. PGS is outperformed by CASA in testsuite size and computation time. Regarding the similarity metric, ICPL is thealgorithm providing more dissimilar products and CASA is the second one. Thisis somewhat counterintuitive. ICPL is the algorithm providing larger test suitesin their solutions. We would expect that a solution with more products to havemore similar products because the number of features is finite in the feature

Page 12: Comparing Pairwise Testing in Software Product Lines: A - NEO

models. Thus, we would expect higher similarity between the products of largertest suites. The results of ICPL, however, contradict this idea. In our opinionthe reason is, again, that the similarity measure is not well-defined because it isjust considering the features that appear in any product but not the ones thatdon’t appear in both products even though the t-tuples with unselected featuresmust all be covered as well.

Model Algor. Size Performance Similarity Model Algor. Size Performance Similarity

ApacheCASA 6 550.00 0.3614

CurlCASA 8.00 933.33 0.3527

ICPL 8.00 217.70 0.3044 ICPL 12.00 276.37 0.2634PGS 8.07 20139.17 0.3482 PGS 12.07 42065.53 0.3541

BerkeleyDBFCASA 6.00 583.33 0.3660

LLVMCASA 6.00 583.33 0.3620

ICPL 7.00 169.63 0.3232 ICPL 9.00 228.77 0.2320PGS 8.13 20007.07 0.3485 PGS 8.80 26453.40 0.3622

BerkeleyDBMCASA 30.00 7600.00 0.3456

PKJabCASA 6.00 616.67 0.3726

ICPL 31.00 511.70 0.2684 ICPL 7.00 195.80 0.3439PGS 30.87 188908.70 0.3731 PGS 7.80 24700.60 0.3675

BerkeleyDBPCASA 9.80 3500.00 0.3904

WgetCASA 9.00 733.33 0.3560

ICPL 10.00 365.83 0.3576 ICPL 12.00 343.33 0.2685PGS 11.37 51854.07 0.3796 PGS 12.40 51234.40 0.3499

SensorNetworkCASA 10.67 1650.00 0.3748

x264CASA 16.00 2966.67 0.3510

ICPL 13.00 431.43 0.3630 ICPL 17.00 343.20 0.2568PGS 13.00 69371.53 0.3806 PGS 16.90 70243.93 0.3668

PrevaylerCASA 6.00 533.33 0.3518

ZipMeCASA 6.00 533.33 0.3654

ICPL 8.00 153.10 0.2677 ICPL 7.00 160.90 0.3428PGS 6.70 20138.13 0.3690 PGS 7.33 23574.37 0.3781

LinkedListCASA 12.03 2533.33 0.4055ICPL 14.00 451.43 0.3996PGS 14.87 75813.10 0.4081

Table 5. Average of the metrics computed on the test suites generated by the threealgorithms for the 13 open source models.

Let us finally analyse the results of the algorithms for the open source models.Regarding the test suite size we again observe that CASA obtains the best results(smaller test suites). The second best in this case is usually ICPL (with the onlyexception of Prevayler). Regarding the computation time, ICPL is the fastestalgorithm followed by CASA in all the models. PGS is the slowest algorithm.If we observe the similarity metric the trend is also clear: ICPL obtains thelowest similarity measures in all the models. In summary, the results obtainedfor the open source models support the same conclusions obtained for the SPLOTmodels: CASA obtains the smallest test suites, ICPL is the fastest algorithm andit also obtains the most dissimilar products.

7 Related Work

In the area of Search-Based Software Engineering a major research focus hasbeen software testing [26,27]. A recent overview by McMinn highlights the ma-jor achievements made in the area and some of the open questions and challengesthat remain [28]. There exists substantial literature on SPL testing [3–6]; how-ever, except for a few efforts [9,13], the application of search based techniques to

Page 13: Comparing Pairwise Testing in Software Product Lines: A - NEO

SPL remains largely unexplored. A related work on computing covering arrayswith a simulated annealing approach is presented by Torres-Jimenez et al. [29].They propose improvements of the state-of-the-art techniques for ConstrainedCombinatorial Interaction Testing (CCIT), an extension to CIT that also con-siders constraints on the inputs.

In the search based literature, there exists a plethora of articles that com-pare testing algorithms using different metrics. For example, Mansour et al. [30],where the authors compare five algorithms for regression testing using eight dif-ferent metrics (including quantitative and qualitative criteria). Similarly, Uyaret al. [31] compare different metrics implemented as fitness functions to solve theproblem of test input generation in software testing. To the best of our knowl-edge, in the literature on test case generation there is no well-known comparisonframework for the research and practitioner community to use. Researchers usu-ally apply their methods to open source programs and compute some metricsdirectly such as the success rate, the number of test cases and performance. Theclosest to a common comparison framework, can be traced back to the workof Rothermel and Harrold [32] where they propose a framework for regressiontesting.

8 Conclusions and Future Work

In this paper, we make an assessment of the comparison framework for pair-wise testing of SPLs proposed by Perrouin et al. on two search based algorithms(CASA and PGS) and one greedy approach (ICPL). We used a total of 181feature models from two different provenance sources. We identified two short-comings of the two novel metrics of this framework: similarity does not considerfeatures that are not selected in a product, and tuple frequency is applicable ona per tuple basis only. Overall, CASA obtains the smallest test suites, ICPL isthe fastest algorithm and it also obtains the most dissimilar products.

We plan to expand our work by considering other testing algorithms, largerfeature models, but, most importantly, by researching other metrics that couldprovide more useful and discerning criteria for algorithm selection. For this lat-ter goal we will follow the guidelines for metrics selection suggested in [33]. Inaddition, we want to study how to enhance the effectiveness of covering arraysby considering information from the associated source code, along the lines ofShi et al. [34].

Acknowledgments

This research is partially funded by the Austrian Science Fund (FWF) projectP21321-N15 and Lise Meitner Fellowship M1421-N15, the Spanish Ministryof Economy and Competitiveness and FEDER under contract TIN2011-28194(roadME project).

Page 14: Comparing Pairwise Testing in Software Product Lines: A - NEO

References

1. Pohl, K., Bockle, G., van der Linden, F.J.: Software Product Line Engineering:Foundations, Principles and Techniques. Springer (2005)

2. Svahnberg, M., van Gurp, J., Bosch, J.: A taxonomy of variability realizationtechniques. Softw., Pract. Exper. 35(8) (2005) 705–754

3. Engstrom, E., Runeson, P.: Software product line testing - a systematic mappingstudy. Information & Software Technology 53(1) (2011) 2–13

4. da Mota Silveira Neto, P.A., do Carmo Machado, I., McGregor, J.D., de Almeida,E.S., de Lemos Meira, S.R.: A systematic mapping study of software product linestesting. Information & Software Technology 53(5) (2011) 407–423

5. Lee, J., Kang, S., Lee, D.: A survey on software product line testing. In: 16thInternational Software Product Line Conference. (2012) 31–40

6. do Carmo Machado, I., McGregor, J.D., de Almeida, E.S.: Strategies for testingproducts in software product lines. ACM SIGSOFT Software Engineering Notes37(6) (2012) 1–8

7. Perrouin, G., Sen, S., Klein, J., Baudry, B., Traon, Y.L.: Automated and scalablet-wise test case generation strategies for software product lines. In: ICST, IEEEComputer Society (2010) 459–468

8. Oster, S., Markert, F., Ritter, P.: Automated incremental pairwise testing of soft-ware product lines. In Bosch, J., Lee, J., eds.: SPLC. Volume 6287 of LectureNotes in Computer Science., Springer (2010) 196–210

9. Garvin, B.J., Cohen, M.B., Dwyer, M.B.: Evaluating improvements to a meta-heuristic search for constrained interaction testing. Empirical Software Engineering16(1) (2011) 61–102

10. Hervieu, A., Baudry, B., Gotlieb, A.: Pacogen: Automatic generation of pairwisetest configurations from feature models. In Dohi, T., Cukic, B., eds.: ISSRE, IEEE(2011) 120–129

11. Lochau, M., Oster, S., Goltz, U., Schurr, A.: Model-based pairwise testing forfeature interaction coverage in software product line engineering. Software QualityJournal 20(3-4) (2012) 567–604

12. Cichos, H., Oster, S., Lochau, M., Schurr, A.: Model-based coverage-driven testsuite generation for software product lines. In: Proceedings of Model Driven En-gineering Languages and Systems. (2011) 425–439

13. Ensan, F., Bagheri, E., Gasevic, D.: Evolutionary search-based test generationfor software product line feature models. In Ralyte, J., Franch, X., Brinkkemper,S., Wrycza, S., eds.: CAiSE. Volume 7328 of Lecture Notes in Computer Science.,Springer (2012) 613–628

14. Perrouin, G., Oster, S., Sen, S., Klein, J., Baudry, B., Traon, Y.L.: Pairwise testingfor software product lines: comparison of two approaches. Software Quality Journal20(3-4) (2012) 605–643

15. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, A.: Feature-Oriented Do-main Analysis (FODA) Feasibility Study. Technical Report CMU/SEI-90-TR-21,Software Engineering Institute, Carnegie Mellon University (1990)

16. Lopez-Herrejon, R.E., Batory, D.S.: A standard problem for evaluating product-line methodologies. In Bosch, J., ed.: GCSE. Volume 2186 of Lecture Notes inComputer Science., Springer (2001) 10–24

17. Lopez-Herrejon, R.E., Ferrer, J., Haslinger, E.N., Chicano, F.,Egyed, A., Alba, E.: Paper code and data repository (2013)http://neo.lcc.uma.es/staff/javi/resources.html.

Page 15: Comparing Pairwise Testing in Software Product Lines: A - NEO

18. Benavides, D., Segura, S., Cortes, A.R.: Automated analysis of feature models 20years later: A literature review. Inf. Syst. 35(6) (2010) 615–636

19. Cohen, M.B., Dwyer, M.B., Shi, J.: Constructing interaction test suites for highly-configurable systems in the presence of constraints: A greedy approach. IEEETrans. Software Eng. 34(5) (2008) 633–650

20. Johansen, M.F., Haugen, Ø., Fleurey, F.: An algorithm for generating t-wise cov-ering arrays from large feature models. In: 16th International Software ProductLine Conference. (2012) 46–55

21. Ferrer, J., Kruse, P.M., Chicano, J.F., Alba, E.: Evolutionary algorithm for pri-oritized pairwise test data generation. In Soule, T., Moore, J.H., eds.: GECCO,ACM (2012) 1213–1220

22. Durillo, J.J., Nebro, A.J.: jmetal: A java framework for multi-objective optimiza-tion. Advances in Engineering Software 42(10) (2011) 760 – 771

23. Siegmund, N., Rosenmuller, M., Kastner, C., Giarrusso, P.G., Apel, S., Kolesnikov,S.S.: Scalable prediction of non-functional properties in software product lines:Footprint and memory consumption. Information & Software Technology 55(3)(2013) 491–507

24. SPLOT: Software Product Line Online Tools (2013) http://www.splot-research.org.

25. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures.Chapman & Hall/CRC; 4 edition (2007)

26. Harman, M., Mansouri, S.A., Zhang, Y.: Search-based software engineering:Trends, techniques and applications. ACM Comput. Surv. 45(1) (2012) 11

27. de Freitas, F.G., de Souza, J.T.: Ten years of search based software engineering: Abibliometric analysis. In Cohen, M.B., Cinneide, M.O., eds.: SSBSE. Volume 6956of Lecture Notes in Computer Science., Springer (2011) 18–32

28. McMinn, P.: Search-based software testing: Past, present and future. In: ICSTWorkshops, IEEE Computer Society (2011) 153–163

29. Torres-Jimenez, J., Rodriguez-Tello, E.: New bounds for binary covering arraysusing simulated annealing. Information Sciences 185 (2012) 137–152

30. Mansour, N., Bahsoon, R., Baradhi, G.: Empirical comparison of regression testselection algorithms. Journal of Systems and Software 57 (2001) 79–90

31. Uyar, H.T., Uyar, A.S., Harmanci, E.: Pairwise sequence comparison for fitnessevaluation in evolutionary structural software testing. In: Genetic and EvolutionaryComputation Conference. (2006) 1959–1960

32. Rothermel, G., Harrold, M.J.: A framework for evaluating regression test selec-tion techniques. In: Proceedings of the 16th International Conference on SoftwareEngineering. (1994) 201–210

33. Meneely, A., Smith, B.H., Williams, L.: Validating software metrics: A spectrumof philosophies. ACM Trans. Softw. Eng. Methodol. 21(4) (2012) 24

34. Shi, J., Cohen, M.B., Dwyer, M.B.: Integration testing of software product linesusing compositional symbolic execution. In de Lara, J., Zisman, A., eds.: FASE.Volume 7212 of Lecture Notes in Computer Science., Springer (2012) 270–284

35. de Almeida, E.S., Schwanninger, C., Benavides, D., eds.: 16th International Soft-ware Product Line Conference, SPLC ’12, Salvador, Brazil - September 2-7, 2012,Volume 1. In de Almeida, E.S., Schwanninger, C., Benavides, D., eds.: SPLC (1),ACM (2012)

36. Whittle, J., Clark, T., Kuhne, T., eds.: Model Driven Engineering Languages andSystems, 14th International Conference, MODELS 2011, Wellington, New Zealand,October 16-21, 2011. Proceedings. In Whittle, J., Clark, T., Kuhne, T., eds.:MoDELS. Volume 6981 of Lecture Notes in Computer Science., Springer (2011)