Testing software to detect and reduce risk

Testing software to detect and reduce risk

Phyllis G. Frankl a,1, Elaine J. Weyuker b,*

a Computer Science Department, Polytechnic University, 6 Metrotech Center, Brooklyn, NY 11201, USAb AT&T Labs ± Research, Room E237, 180 Park Ave., Florham Park, NJ 07932, USA

Received 1 December 1999; accepted 1 December 1999

Abstract

The risk of a piece of software is de®ned to be the expected cost due to operational failures of the software. Notions of the risk

detected by a testing technique and the risk reduction due to a technique are introduced and are used to analytically compare the

e�ectiveness of testing techniques. It is proved that if a certain relation holds between testing techniques A and B, then A is

guaranteed to be at least as good as B at detecting and reducing risk, regardless of the particular faults in the program under test or

their costs. These results can help practitioners choose an appropriate technique for testing software when risk reduction is the

goal. Ó 2000 Elsevier Science Inc. All rights reserved.

Keywords: Fault detection; Program testing; Software testing; Software risk

1. Introduction

Software risk is usually de®ned to be the expected lossattributable to failures in a given piece of software(Boehm, 1989; Gutjahr, 1995; Hall, 1998; Leveson, 1995;Sherer, 1992). It is typically de®ned to be the product ofthe probability of failures occurring and the expectedloss attributable to such failures. An interesting questionto consider is: ``What is the role of a test data selection oradequacy criterion in the assessment of risk?'' Should webe able to predict the risk associated with a softwaresystem based on having information about how thesystem was tested? Intuitively speaking, if a system hasbeen comprehensively tested, there should be less riskassociated with its use than if it has been only lightlytested. Thus, we would like to be able to compare testingstrategies in a way that allows us to say that if a systemhas been tested using criterion C1, it is likely to have lessrisk associated with its use than if it has been tested usingcriterion C2.

Most previous evaluations of the e�ectiveness ofsoftware testing techniques have considered all failuresto be equivalent to one another. They have employedsuch measures of test e�ectiveness as the likelihood ofdiscovering at least one fault (i.e., the likelihood of at

least one failure occurring), the expected number offailures that occur during test, the number of seededfaults discovered during test, and the mean time untilthe ®rst failure, or between failures. In this context, afailure is de®ned to be any deviation between the actualoutput and the speci®ed output for a given input.

In practice, some failures may represent inconse-quential deviations from the speci®cation, while othersare more severe, and some may even be catastrophic.Therefore, in evaluating the risk associated with a pro-gram, one must distinguish between di�erent failures inaccordance with their importance. To do so, we asso-ciate a cost with each failure.

Previous work by Weyuker (1996), Tsoukalas et al.(1993), Ntafos (1997) and Gutjahr (1995) have incor-porated cost into the evaluation of testing techniques.Weyuker used cost or consequence of failure as the basisfor an automatic test case generation algorithm, and toassess the reliability of the software that had been testedusing this algorithm, Tsoukalas et al. and Ntafos ana-lytically compared random testing and partition testingstrategies when cost was taken into account, whileGutjahr derived a test distribution that would result inminimum variance for an unbiased estimator of risk. 2

The Journal of Systems and Software 53 (2000) 275±286www.elsevier.com/locate/jss

* Corresponding author.

E-mail addresses: [email protected] (P.G. Frankl),

[email protected] (E.J. Weyuker).1 Supported in part by NSF Grant CCR-9870270.

2 Gutjahr allowed the cost associated with a subdomain to be a

random variable whose distribution was determined a priori; he also

considered the special case for which each subdomain consists of a

single element, as well as the more general partition testing situation.

0164-1212/00/$ - see front matter Ó 2000 Elsevier Science Inc. All rights reserved.

PII: S 0 1 6 4 - 1 2 1 2 ( 0 0 ) 0 0 0 1 8 - 2

In the work of Gutjahr, Tsoukalas et al. and Ntafos,the input domain is divided, using some partition testingstrategy, and a cost ci is associated a priori with eachsubdomain. A failure of any element of the ith subdo-main is assumed to cost ci. It may be reasonable undersome circumstances to associate costs of failures withsubdomains, for example, if subdomains correspond todistinct functions that the software can perform andcosts can be ascribed to the failure of each of thesefunctions, and each input causes the execution of exactlyone of the functions. However, there are many othersituations in which this is not a realistic approximationof reality. In general, for the subdomains induced bythe testing strategies commonly studied and in use, theconsequences of failure for two di�erent elements of thesame subdomain may be very di�erent. Furthermore,most subdomain testing strategies involve subdomainsthat intersect. For example, each input typically exer-cises many of the functions that have been identi®ed infunctional testing. In the above scheme, any subdomainsthat overlapped would have to have the same cost or theinput space would have to be further subdivided toinsure that this is the case.

In this paper we explore testing to evaluate softwarerisk in a broader setting. In general, a piece of softwarewill have some (possibly large) ®nite number of ``failuremodes'' which may or may not be known in advance.These failure modes represent deviations from thespeci®cation that the system analysts or users view asbeing in some sense `èquivalent'' to one another. Forexample, any failure that results from misspelling a wordin the output might be considered equivalent, while afailure that results in the outputting of the wrong nu-merical value could be considered to have considerablymore severe consequences. Even here, di�erences in themagnitude of the numerical discrepancy might lead todi�erent failure modes.

It may or may not be possible to associate, a priori, a®xed cost or a cost as a function of input with eachfailure mode. For example, the cost of outputting thewrong numerical value might be ®xed or might dependon how far the incorrect value is from the correct one.Weiss and Weyuker (1998) introduced a domain-basedde®nition of software reliability that incorporated thediscrepancy between speci®ed and computed values ofoutputs. This work provided the motivation for the laterwork incorporating cost described by Gutjahr (1995)and Weyuker (1996).

Even if one could identify all of the failure modes ofinterest a priori and assign costs to them, this informa-tion could not be easily used to assign costs to inputelements. To do so, it would be necessary to derive theset of inputs that result in a given failure mode. Sincethis is equivalent to being able to determine the sets ofinputs that compute a given value, and therefore, can bereduced from the question of whether or not a given

input ever causes the program to halt, this question isundecidable in the sense that there can be no algorithmto make this determination (Davis et al., 1994).

In spite of the lack of prior knowledge of failuremodes, their costs, or which inputs correspond to whichfailure modes, it is nevertheless possible to learn some-thing about risk from testing. In this paper, we will as-sume only that when a failure is observed during testing,the cost of that failure can be determined. This is ageneralization of the well-known oracle assumption,which assumes there is a way to determine whether ornot a test output agrees with the speci®ed value. Here,we assume further, that the oracle is able to determinethe cost of the particular kind of failure. This might be arealistic assumption if, for example, a human was con-ducting the testing process and could provide this in-formation. Note that we are assuming not only that theprogram is deterministic, so that a given input alwayssucceeds or always fails, but also that whenever a giveninput causes a failure, it does so in the same way, withthe same cost. 3

Di�erent goals of testing to assess risk can be dis-tinguished:· Testing to detect risk: In addition to counting the

number of failures that occur during testing, onekeeps track of the cost of those failures. Testing tech-nique A will be considered more e�ective than testingtechnique B if the (expected) total cost of failures de-tected during test is higher for A than for B.

· Testing and debugging to reduce risk: It is further as-sumed that each failure that occurs leads to the cor-rection of the fault that caused that failure, therebyreducing the risk associated with the corrected soft-ware. Testing technique A will be considered more ef-fective than testing technique B if A reduces the riskmore than B does, thus resulting in less risky soft-ware. The distinction between this goal and the abovegoal is discussed in more detail in Section 4.

· Testing to estimate risk: In estimating software reli-ability, it is assumed that some faults will remain inthe software. The goal is to estimate the probabilitythat the software will fail after deployment (duringsome speci®ed time). Previous work by Gutjahr andTsoukalas et al. generalized this to the estimation ofthe expected cost due to failures after deployment,i.e. estimating risk. Here, we will say that testing tech-nique A is better than testing technique B (for a giventechnique of estimating risk) if A provides more accu-rate estimates of risk than B.

3 This is not always a realistic assumption, as the cost of a particular

failure mode may depend on the circumstances under which it occurs;

for example, system down-time might be more acceptable at 2 a.m. on

Sunday than during peak business hours since the failure cost might

involve the number of users impacted. In addition, it may depend on

the types and frequencies of failures that precede it.

276 P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

Risk is computed relative to a particular piece of soft-ware, its speci®cation, its usage pro®le (which helpsdetermine the probability of failure), and its operationalenvironment (which a�ects the consequence of a par-ticular failure). We are interested in comparing testingcriteria according to their ability to detect and/or reducerisk and would like to draw conclusions about the rel-ative merits of the criteria that hold for all software,speci®cations, environments and usage pro®les. Conse-quently, we should not expect to be able to say things ofa numerical nature, such as, ``Testing with criterion C1

will result in 50% more risk reduction than testing withcriterion C2''. Instead, we must be content with con-clusions that allow us to make statements of a relativenature, such as ``Testing with C1 will reduce risk at leastas much as testing with C2''. We focus on this type ofcomparison of testing criteria in this paper.

In particular, this paper compares the ability of var-ious subdomain testing techniques to one another ac-cording to how e�ective they are at risk detection andrisk reduction. We show that if a certain relation holdsbetween criterion A and criterion B, then criterion A isguaranteed to be at least as good as criterion B at bothdetecting and reducing risk.

2. Background

In this section, we will provide needed concepts andterminology. Many of the de®nitions were introduced inFrankl and Weyuker (1993a) or Frankl and Weyuker(1993b), in which we investigated ways of comparing thefault-detecting ability of software test data adequacycriteria so that one can say in a concrete way thatone testing method is better than another. Wenow continue this study by investigating how softwarerisk can be incorporated into this comparison. Animportant motivation of our earlier work was to im-prove upon the small amount of analytical workthat had been done to assess the relative e�cacy ofdi�erent proposed testing methods. Prior to thesepapers, most comparisons were based on subsumption,where criterion C1 is said to subsume criterion C2 if forevery program P, every test suite satisfying C1 alsosatis®es C2.

As pointed out by several research groups includingFrankl and Weyuker (1993a), Frankl and Weyuker(1993b), Hamlet (1989), Weiss (1989), and Weyukeret al. (1991), subsumption is not necessarily the idealway to compare testing strategies. One problem withsubsumption is that it is sometimes possible to constructa test suite that satis®es C2 and detects the presence of afault while another test suite that satis®es C1 does notdetect it, even when criterion C1 subsumes criterion C2.The problem is that there are usually many di�erent testsuites satisfying a given test data adequacy criterion, and

generally no real guidance on how to select a ``good''one. Therefore, rather than considering whether it ispossible for an adequate test suite to select a test casethat fails, in Frankl and Weyuker (1993a), Frankl andWeyuker (1993b) we explored whether test suites gen-erated to satisfy a given criterion are likely to includetest cases that fail. For this reason, we used probabilisticways of comparing test data adequacy criteria that as-sessed criteria based on the likelihood of selecting atleast one test case that fails or the expected number offailures that will occur during testing, and in that wayconsidered one criterion to be at least as good as an-other. That analysis was based on investigating howsoftware testing criteria divide a program's input do-main into subsets, or subdomains.

More precisely, in Frankl and Weyuker (1993a) weintroduced the properly covers relation between criteriaand showed it to be ``stronger than'' subsumption. Weproved that if C1 properly covers C2, then when one testcase is independently randomly selected from eachsubdomain using a uniform distribution, the probabilitythat C1 contains at least one test case that fails isguaranteed to be greater than or equal to the probabilitythat C2 contains at least one test case that fails. We thenproved in Frankl and Weyuker (1993b) that if C1

properly covers C2, then when one test case is indepen-dently randomly selected from each subdomain using auniform distribution, the expected number of failuresdetected by C1 is guaranteed to be greater than or equalto the expected number of failures detected by C2. Giventhat these results were proved using a model of testingthat is a reasonable approximation of reality, these arepowerful results allowing us to make concrete what wemean when we say that criterion C1 is at least as good ascriterion C2. We will provide formal de®nitions ofrelevant relations in Section 2.3, as well as a formalstatement of our primary earlier results. This will allowus to build upon these results in our study of relation-ships between testing strategies and software risk as-sessment.

We used these results as a way of guaranteeing that acriterion C1 was at least as good as criterion C2 fortesting any program in a large class of programs. Thiswas done independent of the particular faults occurringin the program. In addition, we showed that if C1 didnot properly cover C2 for some program, then even if C2

was subsumed by C1, it was possible for C2 to be morelikely than C1 to expose a fault in the program. InFrankl and Weyuker (1993b), we used the above resultsto investigate the relative failure-exposing ability ofseveral well-known testing techniques.

In this paper we will extend this investigation byconsidering what we can say precisely about the riskdetected when a program is tested using criterion C1

compared to the risk if it is tested using criterion C2,given that C1 properly covers C2.

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286 277

2.1. Terminology

A multi-set is a collection of objects. In contrast to aset, a multi-set may contain duplicates. Multi-sets will bedelimited by curly braces and set-theoretic operatorsymbols will be used to denote the corresponding multi-set operators. A multi-set S1 is a sub-multi-set of multi-set S2 provided there are at least as many copies of eachelement of S1 in S2 as there are in S1.

The set of possible inputs to a program is known asits input domain. We assume that the program computesa partial function on its domain, i.e. that on a giveninput, the program will always produce the same outputor will always fail to terminate. This assumption doesnot hold for programs whose results depend on the stateof their environment, but it can be made to hold byconsidering all relevant aspects of the environment to becomponents of the input domain. Although we place nobound on the input domain size, we do restrict attentionto programs with ®nite input domains. We do notconsider this to be an unrealistic restriction since allprograms run on machines with ®nite word sizes andwith ®nite amounts of memory. A test suite is a multi-setof test cases, each of which is an element of the inputdomain. We often speak of test suites rather than testsets because it is sometimes pragmatically useful topermit some duplication of test cases.

A test data adequacy criterion is a relationC � Programs� Specifications� Test Suites used fordetermining whether program P has been ``thoroughly''tested by test suite T relative to speci®cation S. IfC�P ; S; T � holds, we say that T is adequate for testing Pwith respect to S according to C, or that T is C-adequatefor P and S.

There are two primary uses for adequacy criteria: as atest suite evaluation method and as the basis for test caseselection strategies. In the ®rst case, the adequacy cri-terion is used to determine whether or not a test suite issu�ciently robust to consider the testing phase to becomplete once all of the test cases in the suite have beenrun. In this case, the way the test suite is constructed isirrelevant and may be independent of the adequacycriterion, and all test case selection is completed beforethe adequacy criterion is applied. If the test suite doesnot satisfy the adequacy criterion, test case selectionresumes and after some additional test cases have beenadded, the adequacy criterion is again applied. In thesecond case, ni test cases are selected to satisfy the ithrequirement determined by the adequacy criterion.(Usually ni � 1.) In this case the adequacy criterion isexplicitly being used to help construct the test suite.

Consider, for example, the statement coverage ade-quacy criterion, which requires that each statement inthe program be executed by some test case. In the ®rstcase, one would select test cases by some independentmeans, then check that every statement has been exe-

cuted, adding more test cases, if necessary. In the secondapproach, the inputs that exercise each statement wouldbe determined and one (or several) test case would beselected for each statement. The results in this paper arebased on the second approach to using an adequacycriterion. In practice, some hybrid of the two ap-proaches is often used.

We will focus on subdomain-based testing approaches,namely those that divide the input domain into subsetscalled subdomains, and then require the selection of oneor more elements from each subdomain. The multi-setof subdomains for criterion C, program P and speci®-cations S will be denoted by SDC�P ; S�. The input do-main may be subdivided based on the program structureof the software under test (program-based testing), thestructure or semantics of the speci®cation (speci®cation-based testing), or some combination of the two. Suchstrategies have sometimes been referred to as partitiontesting strategies, but since in practice such strategiesoften divide the input domain into overlapping subdo-mains, they do not form true partitions of the inputdomain, in the mathematical sense.

The operational pro®le or operational distribution is aprobability distribution that associates with each ele-ment of the input domain, the probability of occurrencewhen the software is operational in the ®eld. Thus, if Qis the operational distribution of a program, Q�t� is theprobability of input t occurring in a single execution ofthe program in its operational environment.

An input t is said to be failure-causing for a givenprogram P and speci®cation S, if the output producedby P on input t does not agree with the output speci®edby S. We will sometimes speak of a test suite T detectinga fault in a program. We will mean by this that there isat least one failure-causing input in T.

2.2. The model

In Frankl and Weyuker (1993a), our goal was tocompare the fault-detecting ability of criteria, and hencewe needed a well-de®ned and realistic model that wasnot biased in any one criterion's favor. We assumed thattest suites were selected to satisfy a subdomain-basedcriterion C by ®rst dividing the domain based onSDC�P ; S�, and then for each subdomain Di 2SDC�P ; S� randomly selecting an element of Di. We letdi � jDij denote the size of subdomain Di, mi be thenumber of failure-causing inputs in Di, and let

M�C; P ; S� � 1ÿYn

i�1

1

�ÿ mi

di

�:

If one test case is independently randomly selected fromeach subdomain according to a uniform distribution, Mis the probability that a test suite chosen using this testselection strategy will cause at least one failure to occur.


M has been widely used by a variety of researchers as thebasis for the comparison of testing strategies (Duranand Ntafos, 1984; Frankl and Weyuker, 1993a; Hamletand Taylor, 1990; Weyuker and Jeng, 1991).

We also de®ned a di�erent measure of a criterion'sfault-detecting ability in Frankl and Weyuker (1993b).We again let SDC�P ; S� � fD1; . . . ;Dng, and assumedthat one test case was independently randomly selectedfrom each subdomain, based on a uniform distribution,and de®ned the expected number of failures detectedto be:

E�C; P ; S� �Xn

i�1

mi

di:

2.3. Testing criteria relations

In our earlier work, we introduced several relations Rthat could hold among subdomain-based testing criteria,and asked whether R�C1;C2� necessarily impliesthat M�C1; P ; S�P M�C2; P ; S� or that E�C1; P ; S�PE�C2; P ; S�. We ®rst showed that it was possible forM�C1; P ; S� to be less than M�C2; P ; S�, even though C1

subsumes C2. This result led to the introduction of astronger comparison relation that we showed had thedesired properties relative to both M�C; P ; S� andE�C; P ; S�.

The following de®nition appeared in Frankl andWeyuker (1993a):

De®nition 1. Let C1 and C2 be criteria. C1 covers C2 for(P,S) if for every subdomain D 2 SDC2�P ; S� there is acollection fD1; . . . ;Dng of subdomains belonging toSDC1�P ; S� such that D1 [ � � � [ Dn � D.

It was shown in Frankl and Weyuker (1993a) that al-though several well-known testing criteria are related bythe covers relation, it was possible for M�C1; P ; S� <M�C2; P ; S� even though C1 covers C2 for �P ; S�. Thereason that this could happen was that one subdomainof C1 could be used to cover two or more subdomains ofC2. This led to the introduction of the properly coversrelation.

De®nition 2. Let SDC1�P ; S� � fD11; . . . ;D1

mg, and letSDC2�P ; S� � fD2

1; . . . ;D2ng. C1 properly covers C2 for

�P ; S� if there is a multi-set

M � fD11;1; . . . ;D1

1;k1; . . . ;D1

n;1; . . . ;D1n;kng

such that M is a sub-multi-set of SDC1�P ; S� and

D21 � D1

1;1 [ � � � [ D11;k1;

..

.

D2n � D1

n;1 [ � � � [ D1n;kn:

Informally, this says that each of C2's subdomains is``covered'' by the union of some of the subdomains of C1

and that no C2 subdomain is used more often in cov-ering the C2 subdomains than its number of occurrencesin SDC1�P ; S�. Note that it is not the number of sub-domains, alone, that determines whether or not onecriterion properly covers another.In Frankl and Weyu-ker (1993a) we proved that:

Theorem 1. If C1 properly covers C2 for program P andspecification S, then M�C1; P ; S�P M�C2; P ; S�.

and in Frankl and Weyuker (1993b) we proved ananalogous result:

Theorem 2. If C1 properly covers C2 for program P andspecification S, then E�C1; P ; S�P E�C2; P ; S�.

The thrust of these two theorems was that one couldguarantee that a given criterion was better at uncoveringfailures than another by showing that they were relatedby the properly covers relation.

2.4. Measuring detected risk

The risk of program P is the expected cost of failureof P in the ®eld on a single input. Let c�t� denote the costdue to deviation (if any) of the output produced byprogram P on input t, P �t�, and the speci®ed output S�t�.Cost is actually a function of the program, its speci®-cation, and numerous environmental and social factors,as well as the input, but for brevity, we denote it by c�t�.Although, as argued above, it may be unrealistic todetermine c�t� for all inputs t a priori, it is much morereasonable to assume that these costs can be determinedfor each test case after executing the test case and ob-serving the failure mode (if any) that results. Our resultsdo not depend on prior knowledge of the values of c�t�.

Let Q�t� denote the operational distribution, i.e., theprobability that input t is selected when P is operationalin the ®eld. Then

R�P ; S� �Xt2D

Q�t�c�t�

is the risk of program P. Equivalent de®nitions of riskhave been used by other authors including Gutjahr(1995), Tsoukalas et al. (1993), and Sherer (1992), andby Weyuker (1996) to weight the operational distribu-tion when selecting test cases depending on both thefrequency of occurrence and consequence of failure.Note that this notion of risk is de®ned relative to aprogram and how it is to be used in the ®eld and is in-dependent of the test selection method used. If P isequivalent to S, i.e. if P is correct, then c�t� is 0 for allelements of the domain, and hence the risk R�P ; S� � 0.This is consistent with one's intuition.


We next de®ne the risk detected by a test suite T forprogram P, relative to speci®cation S to be:

DR�P ; S; T � �Xt2T

c�t�:

Here we have de®ned a notion that is independent ofhow the program will be actually used, but dependenton how it was tested. Of course, if test cases are selectedbased on the operational distribution, then the risk de-tected will give some sort of picture of the risk associ-ated with the program.

The testing techniques we consider are probabilisticin nature. Consequently, we will compare techniques bycomparing the expected value of DR�P ; S; T �, where Tranges over the test suites that could be selected by thegiven technique:

EDR�C; P ; S; T � � E�DR�P ; S; T �� EXt2T

c�t� !

�X

T

Prob�T �Xt2T

c�t� !

;

where Prob�T � is the probability that test suite T will beselected by test selection technique C, relative to pro-gram P and speci®cation S.

3. Comparing the risk detection ability of testing criteria

Theorems 1 and 2 give conditions under which onetesting criterion is guaranteed to be more e�ective thananother according to certain measures of e�ectiveness.In the remainder of the paper, analogous results formeasures of e�ectiveness that are related to risk areproved. This is done in a somewhat more general set-ting, loosening the restrictions on how the subdomainsare used to guide test data selection.

Theorems 1 and 2 assume that test cases are drawnfrom each subdomain according to a uniform distribu-tion on that subdomain. Since risk is de®ned in terms ofthe operational distribution, the use of a uniform dis-tribution when testing either to detect or reduce riskmay be misleading. Surely, if operational distributiondata is available, one would expect the computed risk tobe less accurate than possible if a uniform distribution isused in lieu of the historical usage data. If there areinputs that lead to high cost failures and that are morelikely than average to occur in the ®eld, testing with auniform distribution may lull the tester into a false senseof security. Conversely, if there are inputs causing costlyfailures that are very unlikely to occur in the ®eld,testing with a uniform distribution may lead the tester tothink the software is more risky than it actually is.

Just as the traditional goal of subdomain testing is touncover failures and eliminate the underlying faults thatcaused them, rather than to estimate reliability, the goal

of subdomain testing in this context is to detect (andthen reduce) risk, rather than to estimate risk. Never-theless, it may be more informative to perform subdo-main testing with a non-uniform distribution on eachsubdomain, in order to more closely approximate theoperational distribution, and thereby detect and removean amount of risk that is more closely related to theactual risk of the software. Similarly, if prior in-formation is available about which inputs are likely tocause high-consequence failures, one might choose adistribution that gives higher weight to those inputs.Consequently, we consider subdomain testing withdistributions on subdomains that are not necessarilyuniform.

Let P be a program, S be a speci®cation, D denote theinput domain of P relative to S, and F denote the set offailure-causing inputs. Let C be a subdomain-basedcriterion, and let SDC�P ; S� � fD1; . . . ;Dng be the cor-responding multi-set of subdomains. Let Pr1; . . . ;Prn beprobability distributions on D1; . . . ;Dn, respectively.One can select a C-adequate test suite by independentlyrandomly selecting one test case from each Di accordingto Pri. It then follows that

hi �X

t2Di\F

Pri�t�

is the probability that a test case selected from subdo-main Di will be a failure-causing input, and hence

M�C; P ; S;Pr1; . . . ;Prn� � 1ÿYn

i�1

�1ÿ hi�

is the probability that a test suite selected according tothis strategy will detect a fault by causing P to fail.

We begin by noting that Theorem 1 does not gener-alize to arbitrary test selection strategies of this nature.

Observation 1. Let P be an incorrect program for speci-fication S and let C1 and C2 be subdomain-based criteria.Assume that 0 < M�C2; P ; S� and that M�C1; P ; S� < 1,(i.e., assume that at least one subdomain induced by C2

includes a failure-causing input and that every subdomaininduced by C1 includes at least one non-failure-causinginput.) Then there exist probability distributionsPr1

1; . . . ;Pr1n, and Pr2

1; . . . ;Pr2m such that M�C1; P ; S;Pr1

1;. . . ;Pr1

n� � 0 and M�C2; P ; S;Pr21; . . . ;Pr2

m� � 1.

To see this, simply let Di be a C2 subdomain thatcontains a failure-causing input t, and let Pr2

i select twith probability 1; any test suite selected according tothis strategy is guaranteed to detect a fault. For each i,let Pr1

i select a non-failure-causing input with probabil-ity 1; no test suite selected according to this strategy willdetect a fault. Thus, if neither C nor C0 is guaranteed todetect a fault and neither is guaranteed to never detect afault, there are distributions for which C performs betterthan C0 and distributions for which C0 performs better


than C. Analogous problems arise for the expectednumber of failures, expected risk detection, and ex-pected risk reduction.

The problem occurs because when t belongs to sub-domain D1

i of C1 and to subdomain D2j of C2, there is not

necessarily any relation between the probability that t isselected to represent D1

i and the probability that t isselected to represent D2

j . However, as we shall see below,in certain situations that arise naturally in practice, sucha relation does exist.

De®nition 3. Let Pr be a probability distribution on theinput domain of program P, with speci®cation S, letSDC�P ; S� � fD1; . . . ;Dng, and let

ai �Xt2Di

Pr�t�:

Let

PrDi�t� �1

aiPr�t�:

It is easy to verify that PrDi is a probability distributionon Di. We will call this distribution the inherited distri-bution of Pr on Di. Note that PrDi�t� is the conditionalprobability that test case t is selected, given that sometest case in Di is selected.

There are several approaches to selecting test casesaccording to an inherited distribution. If the structure ofthe subdomains is not too complicated, it may be pos-sible to explicitly compute the probability densities ofthe subdomains (the ai's) and use that to explicitlycompute the distributions on each subdomain. Alter-natively, if it is di�cult to determine the ai's a priori, onecan select test cases according to the distribution on theentire domain, determine which subdomain(s) each testcase lies in (by executing the program or by othermeans), associate each test case with one subdomain thatcontains it and that has not already been ``killed'', anddiscard remaining test cases. correctly. In order to assurethat part of a subdomain that intersects some othersubdomain is not underrepresented, subdomains can begrouped into batches of disjoint subdomains, and eachbatch can be treated separately. This approach results inthe generation of extra test cases, but may be cost-ef-fective in situations in which checking test results ismore expensive than generating and executing tests.Since these and similar approaches are reasonably closeto the way testing criteria are used in practice, we believeinherited distributions o�er a suitable generalization ofprevious studies limited to the uniform distribution. Weshall prove a theorem analogous to Theorem 1 for ex-pected detected risk when tests are selected using aninherited distribution.

The following is proved in Appendix A:

Lemma 1. If test suites are selected by independentlyrandomly selecting one test case from each subdomain,using distribution Pri to select from subdomain Di, thenthe expected risk detected during testing is

EDR�C; P ; S;Pr1; . . . ;Prn� �Xn

i�1

Xt2Di

Pri�t�c�t�:

Corollary 1. Let the expected number of failures be de-noted by E�C; P ; S;Pr1; . . . ;Prn�. Then:

E�C; P ; S;Pr1; . . . ;Prn� �Xn

i�1

hi:

Proof. Treat all failure-causing inputs as having a cost of1 and all non-failure-causing inputs as having a cost of0. Then the expected number of failures is the expectedcost,Xn

i�1

Xt2Di

Pri�t�c�t� �Xn

i�1

Xt2Di\FCI

Pri�t� �Xn

i�1

hi: �

We now prove a result analogous to Theorem 1 forexpected risk detection under inherited distributions.

For testing with distributions inherited from Pr, de-note EDR�C; P ; S;PrD1

; . . . ;PrDn� by EDR�C; P ; S;Pr�.Note that

EDR�C; P ; S;Pr� �Xn

i�1

Xt2Di

Pr�t�ai

c�t�:

In order to assess risk using subdomain testing strate-gies, we will be particularly interested in distributionsinherited from the operational distribution.

Theorem 3. Let C1 properly cover C2 for program P,with specification S and let Pr be any probability distri-bution on the input domain D. Then EDR�C1; P ; S;Pr�PEDR�C2; P ; S;Pr�:

Proof. Let SDC1�P ; S� � fD11; . . . ;D1

mg, letSDC2�P ; S� � fD2

1; . . . ;D2ng, and let

M � fD11;1; . . . ;D1

1;k1; . . . ;D1

n;1; . . . ;D1n;kng

be a multi-set such that M � SDC1�P ; S� and

D21 � D1

1;1 [ � � � [ D11;k1;

..

.

D2n � D1

n;1 [ � � � [ D1n;kn:

This is possible since C1 is assumed to properly cover C2

for program P and speci®cation S.Let a1

i denoteP

t2D1i

Pr�t�, let ai denoteP

t2D2i

Pr�t�,and let ai;j denote

Pt2D1

i;jPr�t�. Then


EDR�C1; P ; S;Pr� �Xm

i�1

Xt2D1

i

Pr�t�a1

ic�t� �1�

PXn

i�1

Xki

j�1

Xt2D1

i;j

Pr�t�ai;j

c�t� �2�

PXn

i�1

Xki

j�1

Xt2D1

i;j

Pr�t�ai

c�t� �3�

PXn

i�1

Xt2D2

i

Pr�t�ai

c�t� �4�

� EDR�C2; P ; S;Pr�: �5�The inequality in line (2) follows from the facts that thesum in line (1) is over all the subdomains in SDC1�P ; S�,the sum in line (2) is over all the subdomains in M, andM is a sub-multi-set of SDC1�P ; S�. The inequality inline (3) holds because for each i; j, D1

i;j � D2i and there-

fore ai P ai;j. The inequality in line (4) holds because thesum in line (3) represents ki selections from each sub-domain D2

i , while the sum in line (4) represents one se-lection from each D2

i . The equalities in lines (1) and (5)follow from the lemma. �

Thus if C1 properly covers C2, it follows that testingwith C1 (using the test selection strategy described above)is guaranteed to detect at least as much risk as testing withC2. This is important since it shows that the properlycovers relation can be used in a natural way to comparetesting strategies with respect to risk reduction, especiallysince it does not assume that tests cases are selected fromwithin subdomains using a uniform distribution.

In Frankl and Weyuker (1993b), we showed that theproperly covers relation holds between various well-known testing criteria for a large class of programs.Each of these criteria was introduced and investigated ina number of earlier papers, and formal de®nitions canbe found in Frankl and Weyuker (1993b). Applying theresult of Theorem 3 to these pairs of criteria yields.

Corollary 2. For any program (in a particular large classof programs) any specification, and for any probabilitydistribution Pr on D

EDR�required-k-tuples��; P ; S;Pr�P EDR�all-uses; P ; S;Pr�P EDR�all-p-uses; P ; S;Pr�P EDR�decision-coverage; P ; S;Pr�;

EDR�ordered-context-coverage; P ; S;Pr�P EDR�context-coverage; P ; S;Pr�P EDR�decision-coverage; P ; S;Pr�;

EDR�multiple-condition-coverage; P ; S;Pr�P EDR�decision-coverage; P ; S;Pr�;

and EDR�decision-condition-coverage; P ; S;Pr�P EDR�decision-coverage; P ; S;Pr�:

4. Comparing the risk reduction ability of testing criteria

As mentioned above, when testing to uncover thepresence of faults, the ultimate goal is not only to detectthe faults, but to remove them, so as to reduce thelikelihood that the software will fail in the ®eld. Simi-larly, when testing to detect risk, the ultimate goal is toremove high consequence faults that are likely to occur,so as to reduce the risk of the software.

The measure investigated above, risk detected by atest suite, is related, but not identical, to the amount ofrisk reduction that will result when all the faults detectedare ®xed. There are several reasons why the detected riskis not necessarily equal to the risk reduction:· The probability of detecting a given fault (by select-

ing a test case that will cause a failure during testing)may di�er from the probability that that fault will re-sult in failure in the ®eld. This might happen due todi�erences between the probability distribution usedin testing and the operational distribution.

· Several test cases may detect the same fault. Thismeans that the cost of that fault will be counted sev-eral times in calculating detected risk, but its removalwill only contribute once to reducing risk.

· The detected risk is calculated using n test cases,where n is the test suite size, whereas the risk reduc-tion is the reduction in the expected cost due to fail-ure on a single run of the program.

· In attempting to remove a fault, the programmermay introduce other faults, in which case the risk willnot decrease as much as expected and may actuallyincrease.

Although the risk reduction resulting from a test set isnot directly measurable, under certain assumptions, wecan still say something about which testing criteria arebetter at reducing risk.

Frankl et al. (1998) introduced the notion of a``failure region'', a subset of the set of failure-causinginputs consisting of inputs that are related in the sensethat the change made in response to detecting a failureof one of the test cases in the region will ®x all of the testcases in the region. This intuition is predicated on thefact that a software change generally does not cause onlythe speci®c test case that caused the change to be madeto behave di�erently. Typically many other elements ofthe domain will also be a�ected, hopefully by causingthem to produce the correct output rather than an in-correct one. All of the inputs whose behavior are cor-rected by the code change to correct the ``fault'' togetherwill be thought of as a failure region. It is assumed that if


any element in the failure region had been selected as atest case, it would have caused the person to make thesame changes to the software.

As in Frankl et al. (1998), we assume that the set offailure-causing inputs can be divided into disjoint failureregions, F1; . . . ; Fp, having the property that whenever atest case from Fi is executed, the person debugging theprogram makes a change that exactly removes Fi. Thisimplies that this change causes all of the inputs in Fi tonow execute correctly. This is a strong assumption, asthe change made upon observing a failure may dependon many factors, including exactly which test case failed,how the tester selected that test case, and the prior ex-perience or whims of the programmer debugging theprogram, at that moment. We recognize that this is notnecessarily true in reality, and that it is even possiblethat at di�erent times the same person might make dif-ferent changes to the software in response to the sametest case failing. Nevertheless, we believe it is a usefulassumption, approximating reality closely enough togive useful insights into the relative strengths andweaknesses of various testing techniques. In particular,since the assumption is applied uniformly in our ana-lyses of various testing criteria, it does not bias the re-sults in favor of or against a particular criterion. Thus itseems reasonable for our present purposes (and those inFrankl et al. (1998)) of comparing the e�ectiveness ofvarious testing techniques. We will further assume thateach failure region has a ®xed cost. This will allow us tomodel the risk reduction obtained through testing.

As above, we assume we can associate a cost of fail-ure c�t� with each input t, but that these costs are notnecessarily known a priori. The cost of a failure regionF` is

c�F`� �Xt2F`

c�t�:

Let Q represent the operational distribution of theprogram. The probability that a failure region F` will beencountered on a single execution of the software in the®eld is

Q�F`� �Xt2F`

Q�t�:

Thus the risk attributed to failure region F` is

R�F`� �Xt2F`

Q�t�c�t�:

The risk reduction due to a test suite T is the di�erencebetween the risk of the original program and the pro-gram obtained by removing all of the failure regionsdiscovered by T. This leads us to another useful measureof the e�ectiveness of a testing technique: the expectedrisk reduction when a test case is selected using thattechnique. As above, since the testing techniques we areinvestigating are probabilistic, we consider the expected

value over possible test suites selected using the tech-nique.

The expected risk reduction can be obtained bysumming the risks due to the failure regions, weightedby the probability that the failure regions are detected(and hence removed) during the testing/debugging pro-cess.

Consider a subdomain testing technique with sub-domains fD1; . . . ;Dng, where one test case is selectedindependently from each using probability distributionPri on Di. Consider subdomain F`. The probability thatF` is detected is equal to the probability that it is de-tected by a test case from at least one subdomain, whichis 1 minus the probability that it is not detected by thetests selected from any of the subdomains:

1

ÿYn

i�1

1

ÿX

t2Di\F`

Pri�t�!!

:

Hence, the expected risk reduction for criterion C (withthe above testing technique) for program P, speci®cationS is

RR�C; P ; S;Pr1; . . . ;Prn�

�Xp

`�1

1

"ÿYn

i�1

1

ÿX

t2Di\F`

Pri�t�!#

c�F`�Q�F`�:

As above, we will restrict attention to probability dis-tributions on the subdomains that are inherited from aprobability distribution on the entire domain. We'lldenote the expected risk reduction in such cases byRR�C; P ; S;Pr�.

In the remainder of this section, we show that a resultanalogous to Theorem 1 holds for the expected risk re-duction and present a corollary analogous to Corollary 2.

Theorem 4. Let C1 properly cover C2 for program P, withspecification S and let Pr be any probability distributionon the input domain D. Then RR�C1; P ; S;Pr�P RR�C2; P ; S;Pr�:

Proof. Assume C1 properly covers C2 for program P andspeci®cation S. Let SDC2�P ; S� � fD2

1; . . . ;D2ng, let

SDC1�P ; S� � fD11; . . . ;D1

mg, and let

M � fD11;1; . . . ;D1

1;k1; . . . ;D1

n;1; . . . ;D1n;kng

be a multi-set such that M � SDC1�P ; S� and

D21 � D1

1;1 [ � � � [ D11;k1;

..

.

D2n � D1

n;1 [ � � � [ D1n;kn:

Let

f ì �X

t2D2i \F`

Pr�t�=ai


and let

f ì;j �X

t2D1i;j\F`

Pr�t�=a1i;j:

That is, f ì denotes the probability that a test case se-lected from D2

i detects failure region F` and f ì;j denotesthe analogous probability for D1

i;j.We want to show that

Xp

`�1

1

ÿYn

i�1

Yki

j�1

�1ÿ f ì;j�!

Q�F`�c�F`�

PXp

`�1

1

ÿYn

i�1

�1ÿ f ì �!

Q�F`�c�F`�;

i.e., that the risk reduction due to those C1 subdomainsinvolved in the covering exceeds the risk reduction dueto all of the C2 subdomains. The remaining C1 subdo-mains, if any, will only further increase the risk reduc-tion ability of C1.

It su�ces to show that for all i; `,Yki

j�1

�1ÿ f ì;j�6 �1ÿ f ì �:

In other words, we need to show that the probability ofnot detecting a given failure region with one test casedrawn from D2

i is greater than or equal to the proba-bility of not detecting that failure region with one testcase from each of the C1 subdomains used to cover D2

i .That is, we can consider the terms from each failureregion separately, then within each such term, we canconsider separately the relative contributions to theproduct on the left and right hand sides from each D2

iand its corresponding covering C1 subdomains.

Note that the expressions in this inequality are similarin form to those occurring in the M measure. However,there are some important di�erences to be noted.Whereas M considers the probability that at least onetest case will detect any failure region, these expressionsconsider the probability that at least one test case willdetect a particular failure region. Furthermore, whereasM assumed test cases were chosen according to theuniform distribution, here we assume they are chosenaccording to any inherited distribution.We can massagethese expressions into exactly the form of the analogousexpressions arising in the proof of Theorem 1 in Frankland Weyuker (1993b).

To simplify the notation, let D denote D2i and let Dj

denote D1i;j. Let Pr�ti� � xi=yi, where xi; yi are integers. 4

Let d be the least common multiple of the yi's. We cantransform the input domain into a space with d points

and emulate Pr with a uniform distribution on this spaceas follows. Set

dj � dXti2Dj

xi

yi;

mj � dX

ti2Dj\F`

xi

yi;

m � dXti2D

xi

yi:

Then m=d is the probability that a test case selected fromD will detect failure region F` and mj=dj is the proabilitythat a test case selected from Dj will detect it. It su�cesto show that

1ÿ md

PY

j

1

�ÿ mi

di

�which follows from the proof of Theorem 1 in Frankland Weyuker (1993b). �

We can again particularize this result to several well-known testing criteria.

Corollary 3. For any program (in a particular large classof programs) any specification, and for any probabilitydistribution Pr on D

RR�required-k-tuples��; P ; S;Pr�P RR�all-uses; P ; S;Pr�P RR�all-p-uses; P ; S;Pr�P RR�decision-coverage; P ; S;Pr�;

RR�ordered-context-coverage; P ; S;Pr�P RR�context-coverage; P ; S;Pr�P RR�decision-coverage; P ; S;Pr�;

RR�multiple-condition-coverage; P ; S;Pr�P RR�decision-coverage; P ; S;Pr�;

and RR�decision-condition-coverage; P ; S;Pr�P RR�decision-coverage; P ; S;Pr�:

5. Conclusion

We have extended the results of Frankl and Weyuker(1993a) and Frankl and Weyuker (1993b) which inves-tigated concrete ways of comparing testing methodsbased on their fault-detecting ability. We have extendedthese results in two directions: considering more generaltest selection strategies and considering measures of teste�ectiveness that are more directly related to the risk ofthe software under test than those used in the earlierwork.

4 If any of the probabilities are irrational, they can be approximated

closely enough by rationals so as to make the inequality hold.


We began by generalizing the test selection strategiesconsidered. In Frankl and Weyuker (1993a) and Frankland Weyuker (1993b), all strategies were assumed to besubdomain-based, with test cases independently, ran-domly selected from each subdomain using a uniformdistribution on the subdomain. In this paper, we relaxedthe requirement that all test case selection be based on auniform distribution, and allowed the selection of testsuites by independently randomly selecting one test casefrom each subdomain using a di�erent probability dis-tribution for each subdomain, but required that thesedistributions be related in the sense that they were all`ìnherited'' from a common distribution on the wholedomain.

We introduced two measures of software testing ef-fectiveness related to software risk, expected detectedrisk and expected risk reduction, and investigatedwhether one could guarantee that one testing techniqueis better than another according to these measures.

We showed that if C1 properly covers C2 for programP and speci®cation S and if test suites are selected byindependently selecting one test case from each subdo-main according to distributions inherited from a com-mon distribution, then C1 is guaranteed to perform atleast as well as C2 according to these risk-related mea-sures. Note that the fact that C1 selects larger test suitesthan C2 is not enough to guarantee this, nor is the factthat C1 subsumes C2. We expect this to be a very usefulresult for determining which subdomain-based testingstrategy to use when minimization of risk is a primaryconsideration for a project.

Acknowledgements

We are grateful to Sandro Morasca for making sev-eral interesting suggestions.

Appendix A

Lemma A.1. If test suites are selected by independentlyrandomly selecting one test case from each subdomain,using distribution Pri to select from subdomain Di, thenthe expected risk detected during testing is

EDR�C; P ; S;Pr1; . . . ;Prn� �Xn

i�1

Xt2Di

Pri�t�c�t�:

Proof. Recall that the expected detected risk is

E�DR�P ; S; T �� X

T

Prob�T �Xt2T

c�t�: �

Let si denote the size of Di, and let fti;1; . . . ; ti;sig denotethe elements of Di. With this test selection strategy, the

test suites are of the form ft1;j1; . . . ; tn;jng where ti;ji is

drawn from subdomain Di with probability Pri�ti;ji�.Since the selections from the subdomains are indepen-dent, the probability of selecting a given test suite is

Prob�ft1;j1; . . . ; tn;jng� �

Yn

i�1

Pri�ti;ji�:

The collection of all test suites is obtained by consider-ing all possible combinations of one test case from eachsubdomain, so,

E�DR�P ;S;T �� Xs1

j1�1

. . .Xsn

jn�1

Yn

i�1

Pri�ti;ji� ! Xn

k�1

c�tk;jk � !

�Xs1

j1�1

. . .Xsn

jn�1

Xn

k�1

c�tk;jk �Yn

i�1

Pri�ti;ji� !

:

We will now consider each particular test case sepa-rately, and its ``contribution'' to the expected detectedrisk, beginning with t1;1, selected from subdomain D1.This involves setting j1 � 1 and k � 1. Collecting all theterms involving t1;1 yields

X1;1 �Xs2

j2�1

. . .Xsn

jn�1

c�t1;1�Pr1�t1;1�Yn

i�2

Pri�ti;ji�

� Pr1�t1;1�c�t1;1�Xs2

j2�1

. . .Xsn

jn�1

Yn

i�2

Pri�ti;ji�

� Pr1�t1;1�c�t1;1�;

since the last sum of products represents the probabilityof selecting one test case from each of the remainingsubdomains, which is equal to one.

Each test case from each subdomain contributes aterm analogous to X1;1, so

E�DR�P ; S; T �� Xn

i�1

Xsi

ji�1

Xi;ji

�Xn

i�1

Xsi

ji�1

Pri�ti;ji�c�ti;ji�

�Xn

i�1

Xt2Di

Pri�t�c�t�:

References

Boehm, B., 1989. Software risk management. In: Proceedings ESEC,

Warwick, UK, September 1989, pp. 1±19.

Davis, M.D., Sigal, R., Weyuker, E.J., 1994. Computability, Com-

plexity and Languages, second ed.. Academic Press, New York.

Duran, J.W., Ntafos, S.C., 1984. An evaluation of random testing.

IEEE Trans. Software Eng. SE-10 (7), 438±444.

Frankl, P.G., Hamlet, D., Littlewood, B., Strigini, L., 1998. Evaluating

testing methods by delivered reliability. IEEE Trans. Software Eng.

24 (10) 586±601.


Frankl, P.G., Weyuker, E.J., 1993a. A formal analysis of the fault

detecting ability of testing methfods. IEEE Trans. Software Eng.,

202±213.

Frankl, P.G., Weyuker, E.J., 1993b. Provable improvements on

branch testing. IEEE Trans. Software Eng. 19 (10), 962±975.

Gutjahr, W.J., 1995. Optimal test distributions for software failure cost

estimation. IEEE Trans. Software Eng. 19 (10), 962±975.

Hall, E.M., 1998. Managing Software Systems Risk. Addison-Wesley,

New York.

Hamlet, D., 1989. Theoretical comparison of testing methods. In:

Proceedings ACM SIG SOFT Third Symposium on Software

Testing, Analysis, and Veri®cation. ACM Press, pp. 28±37.

Hamlet, D., Taylor, R., 1990. Partition testing does not inspire

con®dence. IEEE Trans. Software Eng. 16 (12), 1402±1411.

Leveson, N.G., 1995. Safeware System Safety and Computers.

Addison-Wesley, New York.

Ntafos, S.C., 1997. The cost of software failures. In: Proceedings

IASTED Software Engineering Conference, pp. 53±57.

Sherer, S.A., 1992. Software Failure Risk. Plenum Press, New York.

Tsoukalas, M.Z., Duran, J.W., Ntafos, S.C., 1993. On some reliability

estimation problems in random and partition testing. IEEE Trans.

Software Eng. 19 (7), 687±697.

Weiss, S.N., 1989. Comparing test data adequacy criteria. Software

Eng. Notes 14 (6), 42±49.

Weiss, S.N., Weyuker, E., 1998. An extended domain-based model of

software reliability. IEEE Trans. Software Eng. SE-14 (10), 1512±

1524.

Weyuker, E.J., 1996. Using failure cost information for testing and

reliability assessment. ACM Trans. Software Eng. Meth. 5 (2),

87±98.

Weyuker, E.J., Jeng, B., 1991. Analyzing partition testing strategies.

IEEE Trans. Software Eng. 17 (7), 703±711.

Weyuker, E.J., Weiss, S.N., Hamlet, D., 1991. Comparison of program

testing strategies. In: Proceedings Fourth Symposium on Software

Testing, Analysis, and Veri®cation. ACM Press, pp. 1±10.


Documents

Testing software to detect and reduce risk