An overview of automated software testing

I. SYSTEMS SOFTWARE 133 1991; 1.5:133-138

An Overview of Automated Software Testing

Antonia Bertolino Istituto di Elaborazione della Informazione del CNR, Pisa, Italy

The paper provides a brief, yet comprehensive, introduction to the practice of dynamic testing of computer software and to the currently available automated testing tools.

We first illustrate the basic concepts of testing, outlining its properties and its limitations, and then propose a control flow diagram for the testing process. Following this diagram step by step, we describe all the activities involved and the relative tools.

It is understood that testing must be prepared through an accurate and well-documented specification phase.

1. INTRODUCTION

In the last decade, a considerable part of the investiga- tions within the area of software engineering has dealt with software testing and, in particular, much effort has been devoted to the automatization of the testing process. As a result, there are now many commercial or prototypal automated testing tools in existence.

Properly exploited, automated testing tools can im- prove not only testing productivity but also its efficacy; in fact, they assume the mass of clerical, repetitive, and error-prone activities which are necessary for the sys- tematic testing of software systems.

At the same time, they can never fully replace human operators; indeed, we are convinced that human competence and ingenuity remain the sine qua non condition for the success of the testing process.

In this paper, we give a general overview of automated software testing. Our goal is not to provide a survey of existing tools; there are already useful works fulfilling this scope [3]. Indeed, as the field is in rapid expansion, it is hard to obtain an up-to-date picture.

Neither is it our aim to debate the extent to which the

Address correspondence to Antonia Bertolino, Istituto di Elaborazione della Informazione de1 CNR, via S. Maria, 46, I-56126, Piss, Italy.

testing process can be automatized: this is certainly a stimulating argument, but is of little relevance (at least in the short term) to the performance of testing.

Our objective is, instead, to give a description of the support offered by the currently available automated testing tools and, therefore, also to indicate those ex- pectations which can (and those which cannot) be legiti- mately fulfilled by automatic testing today.

Before discussing automated tools, we present an overview of the basic concepts of software testing, in order to make the paper easily accessible to testing novices. In fact, the paper is mainly addressed to these readers, as our principal aim is to provide a comprehensive introduction to the argument treated.

For this reason, we do not just present an unproduc- tive tool classification, but we conduct a walk-through of the testing process, progressively introducing concepts and tools as they are applied.

The paper is thus organized in the following way: the basic concepts behind software testing are presented in Section 2 and a control-flow diagram for the testing process is given in Section 3. In this diagram, we identify five fundamental steps; for each step Sections 4 to g discuss the basic concepts and the support given by the available tools. Concluding remarks are made in Section 9.

2. TESTING CONCEPTS

Testing is essentially an a posteriori activity; in fact, it is conducted on a finished product to check its validity or, in a word, to validate it.

The notion of validity must be intended, of course, as a relative concept: in fact, to decide on the validity of a product, we must evaluate it against a model which represents the intentions of the designers.

In the field of software engineering, the test item is a program P, which transforms a set of input data, i.e., the Input Domain (ID), to a set of corresponding output data, i.e., the Output Domain (00).

0 Elsevier Science Publishing Co., Inc. 655 Avenue of the Americas, New York, NY 10010 01641212/91/$3.50

134 J. SYSTEMS SOFTWARE 1991; 15:133-138

The program P is developed according to a function F, modeling its behavior. In practice, F can be ex- pressed in different ways: as a formal model, or by natural language specifications; in the worst case, F is merely an idea in the mind of the programmer. What- ever the status of F, in order to validate program P, we must be able to compare it with F.

In software testing, this comparison is performed by experiments, i.e., by executing P and, for each run, comparing its output with the expected output, i.e., that obtained according to F. When the output of P di- verges from the expected output, we say that we have discovered an error.

In the literature, the activity of validating P by experiments is most often referred to as dynamic testing.

For the sake of completeness, we should mention here the two other complementary approaches com- monly used for program validation: in program proving [9], a program is shown to be correct, i.e., equiva- lent to its specification, using theorem proving tech- niques; in static testing, the program is essentially inspected [6], i.e., a list of predetermined properties or structures is checked against P. These two approaches are outside the scope of this paper: henceforth, our subject will be dynamic testing. Furthermore, in order to ensure cohesion of the arguments treated, we shall not deal with problems which are specific to the dynamic testing of real-time software; Glass [7] provides a comprehensive survey of these problems.

To be sure of the certainty of the validity of P through dynamic testing, ideally we should try the program on the entire ID. In fact, due to the intrinsi- cally discontinuous nature of software, given an observation on any point of ID, we cannot infer any prop- erty for the other points in ID. However, excluding trivial cases, ID is too large, practically infinite, for exhaustive testing. Consequently, to contain test costs, when we test a program, we only check the behavior of P for a subset of ID, the Test Input Domain (TZQ. From this limited number of test runs, we then infer the behavior of P for the entire ID, even if there is no definitive justification for this.

Thus, testing corresponds to sampling a certain number of executions of a program P among all its possible executions, by sampling a number of input data within ID.

Ultimately, software development, like any other production, must obey market laws and the testing phase is generally adjusted to reach a reasonable com- promise between the need to maximize the dimensions of TID, in order to increase our confidence in the validity of the program, and the opposing need to minimize its dimensions, in order to reduce testing costs.

A. Bertolino

Of course, we should like to sample ID systematically so that we can increase our chances of discovering possible errors. The identification of a suitable testing/sampling strategy is known as the test data selection problem. An influential paper by Goode- nough and Gerhart [8] has introduced a theory of test data selection, which is now generally accepted.

3. THE TESTING PROCESS

On the basis of the concepts illustrated above, we can now identify the following four tasks as constituting the testing process:

1)

2)

3)

4)

Select a TZD (subset of ID) according to a suitable testing strategy. Derive for TZD the expected output data to form the Expected Output Domain (EOD). In the testing literature, this task is known as the oracle problem. Obtain the effective output data by executing P on TZD; the effective output data form the Test Output Domain ( TOD) . Compare the sets TOD and EOD, in order to validate P.

A Data Flow Diagram (DFD) depicting software testing is shown in Figure 1; the bubbles in the diagram correspond to the tasks identified above.

In practice, the first attempt at identifying an “ade- quate” ’ TZD is never successful. Thus, the testing process becomes a trial and error activity, in which task 1 is iterated a number of times, each time augmenting TZD, until we are satisfied, on the basis of a more or less clearly defined criterion (unfortunately, this criterion is often the budget available).

We can now derive, in Figure 2, a diagram of the testing process, which reflects the DFD of Figure 1, but also takes the above observation into account. In addition, we introduce the notion of control flow between the tasks involved, in accordance with usual practice.

Figure 1. A DFD for software testing.

‘The concept of adequacy for test input data has been formally defined in the literature [see, for example, 141; we use the term here rather more informally.

An Overview of Automated Software Testing J. SYSTEMS SOFTWARE 135 1991; 15:133-138

begin 0 ii

i=O

+

) Select TID i c ID

Derive EODi

Derive TOD i , by executing P

c

Compare TODi and EODi

c yes

0 end

Figure 2. The software testing process.

With reference to this control flow diagram, we can identify five fundamental steps for the testing process,

in correspondence to the four previously listed tasks

and also to the task which determines the stopping of the test activity, according to a specified exit criterion.

In Sections 4-8, we shall discuss the concepts behind

each step, and the tools which are available.

4. THE SELECTION OF T/D

The selection of a suitable TZD is the central problem

in testing. The typical tools for this task are the test data generators.

Obviously, given ID, we could easily conceive an automatic tool which extracts a subset from it anyhow. But, of course, this is not, or is not only, our problem.

In fact, for testing to be effective, TZD must be selected systematically, according to a testing strategy.

Therefore, before examining the typical functions performed by test dam generators, let us briefly introduce the fundamental strategies followed when select-

ing the test input data.

As we have said in Section 2, testing a program P consists of observing a representative sample of all its

possible executions, by trying P on a subset TZD of ID. Thus, we select TZD by first partitioning ID into a

number of classes, such that P behaves equivalently for all data within a class. Consequently, we can check just a few executions for each class, for example only on an internal and a border point * .

Of course, there is no unique ID partition. In fact, the equivalence of P executions within one input class is always dependent on our point of observation, or, in

other words, on what we want to validate. Thus, for example, if we want to validate P against

its specified Z/O relations, we shall partition ID so that one Z/O relation corresponds to each class (this

falls into functional testing, see below). Instead, if we

want to check the execution of P on each segment of

the source code, we shall choose a different ID partition, such that each class corresponds to the execution of one segment in the code (this falls into structural

testing, see below).

There are three basic testing strategies:

(1) functional testing: essentially, TZD is selected, i.e., ID is partitioned, according to the reference model F.

(2) random testing: essentially, TZD is selected, i.e.,

ID is partitioned, according to the way in which P is operated (in fact, ID is generally partitioned

according to the expected use of P). (3) structural testing: essentially, TZD is selected,

i.e., ID is partitioned, according to the structure of P.

In the literature, the first two strategies are both referred to as black-box testing, and the third as

white-box testing. A variety of testing methodologies

[l] fall within each of the above basic strategies. For

example, structural testing can be conducted on the data

structure or on the control structure. Of course, test data generators provide different kinds

of assistance, depending on the strategy for which they have been designed.

Test data generators for random testing function in

the simplest way: they pick random data from ID, according to a chosen distribution. Very often, the operational distribution is assumed: very simply, the

more frequently used input data are tried more fre-

quently .

*The weak point in this procedure, i.e., the reason why testing can only increase confidence in the validity of P, but never guarantee it, is, of course, the assumption that P behaves equivalently on all the data of a class, given the inherently discontinuous nature of digital elaboration.


A. Bertolino

The opposite approach is stress testing, a methodology for functional testing in which the special conditions, i.e., the most strange cases, are insisted upon.

In functional testing, it is the operator’s task to analyze the reference model F and to select TID in accordance. However, tools have been developed to assist the operator in the clerical activities connected with this task.

There are some test data generators which provide the operator with a language to specify the test data and then derive the test instances automatically. These tools are typically used, for example, to generate test programs for compiler validation.

Usually, it is not easy to specify an algorithm to derive TID, e.g., when F consists of the functional specification document. In these cases, the assistance provided is a support to organize the test data, already derived by the operator, into a suitable test plan, e.g., TZD can be structured as a tree. Then, during the test execution, the tool automatically picks the data, as specified, and feeds them to P. This support is usually included into the test-driver, which is an automated testing tool which launches P and controls its execution: test-drivers are further discussed in Section 6.

Finally, there are a number of support tools to generate test data for structural testing: these tools essentially provide the operator with information about the structure of P and this information is then exploited by him to select TZD.

This means that the structure of P is analyzed in order to identify the program paths to be followed and ID is then partitioned so that each class corresponds to a different path: in fact, these tools are called more specifically pathwise test data generators.

Path selection is performed after static analysis of the code. The selection can be made on the basis of the data structure or of the control structure. In both cases, automated tools are available to perform this analysis stage; these are, respectively, data flow analyzers and coverage analyzers.

In data flow analysis, the paths are typically associated to the way in which the program variables are used [lo]. A criterion for selecting TZD could be, for example, to exercise P so that all definitions, i.e., all assignments, are referenced, i.e., used, at least once.

Instead, in coverage analysis [ll], the paths are associated to the track followed by the control flow during execution: a criterion for selecting TID couId be, for example, to exercise P so that all the segments are executed at least once.

After the path selection stage, symbolic evaluators [2] can be used, which simulate P’s execution by taking as input symbolic values, in order to derive path constraints, i.e., the conditions on which a path is

executed. Since path constraints result as expressions of the input variables, the operator can derive the test data to exercise a preselected path from them.

During the test execution, the effective level of path coverage is measured by dynamic analysis, i.e., moni- toring P’s execution. The tools which perform this measure are called program instrumenters. They in- sert some probes (essentially, some procedure calls) into the source code during a preprocessing stage and then, during program execution, monitor the passage of the control-flow to these probes in order to measure test coverage according to the criterion specified. For example, to measure the coverage of segments, a probe is inserted where the control flow passes to each branch.

Before closing this section on test data selection strategies, we want to emphasize that there is no one best strategy; a strategy can be adopted on the basis of the tester’s background, on program development practice, and on the resources available. Of course, inte- grating diverse strategies always enhances testing effectiveness; some experimental studies have been conducted [13] which confirm this intuitive concept.

5. THE DERIVATION OF EOD

In principle, there is no difficulty in automatizing this step: in fact, the same program P under test, when valid, can itself be seen as an automated tool which transforms TZD to EOD. Of course, this is a paradox, as we only need to specify EOD to check P’s validity.

The derivation of EOD can be made through an automated tool, called the expected results generator, by specifying F in an executable language.

But, in practice, developing a running version of F is often considered too expensive and EOD is derived manually. Of course, even this latter approach can be quite expensive, especially when a large mass of test cases must be executed, as is normally the case, for example, with random testing.

Therefore, EOD is sometimes not derived at all and the effective test outputs are inspected by the operator: this may be a dangerous testing practice, since the testing evaluation becomes dependent on the competence and the fairness of the operator.

6. THE DERIVATION OF TOD

The task of deriving TOD is obtained fully automatically by running the program under the control of a testing tool, known as the test driver (also testbed or test harness).

Test drivers conduct all the clerical activities re- quired during the test: as stated in Section 4, they feed the selected test input data to the program under test.

An Overview of Automated Software Testing

They also perform the following tasks:

they provide a test environment, also supplying

drivers and stubs, i.e., simulators respectively for

the calling and the called procedures;

they launch the tests automatically, iterating on all

the data in TID;

they record the test outputs;

they evaluate the test outcome when an automatic comparison with EOD is feasible (see below);

they often supply statistics on test performance.

Test drivers are essential for economic regression testing. Regression testing consists of retrying previ-

ously passed test cases, check that changes have effects.

7. THE COMPARISON

after having modified P, to not also introduced unwanted

BETWEEN EOD AND TOD

When EOD has been specified exactly, the comparison between TOD and EOD can be performed by an automated tool called the comparator.

A comparator is a simple utility that compares the

contents of two files, like the command DIFF pro-

vided by the UNIX system. Some comparators also provide the capability to mask a part of EOD, thus

allowing some unimportant differences to be tolerated.

Comparators are often included in test drivers. If EOD has not been specified in advance, as hinted

in Section 5, the comparison must be performed by the

operator through the inspection of TOD.

8. THE EXIT CRITERION

Unfortunately, in practice, the stopping of testing is

often decided arbitrarily on the basis of management constraints, e.g., the budget available or the time schedule, whereas the right approach is that of estab-

lishing a criterion on which to evaluate testing exten-

siveness and eventually to decide when to stop testing. Typically, this exit criterion can be based on: 1) the

achieved coverage value on the basis of a path selection

criterion; 2) a reliability measure [12]; 3) the error rate for a set of intentional errors.

The path coverage value is evaluated by means of

program instrumenters, as already outlined in Section 4. When the coverage measure reaches a given (high)

threshold, testing is stopped. Consider that 100% cov-

erage for a given criterion is not always feasible since portions of dead code may be present; in addition, when a consistent path coverage has already been reached, raising it further becomes increasingly expensive.

J. SYSTEMS SOFTWARE 137 1991; 15:133-138

The exit criterion can be based on a reliability meas-

ure when TID has been selected randomly, according

to the operational distribution. Very briefly, errors

discovered are related to the execution time at which they have been observed. When the failure intensity, i.e., the rate of failure detection, reaches a desired

(low) threshold, testing is stopped. For this test evaluation methodology, the operator

has to define the operational distribution for P then

used by a random test data generator to select TZD; also, he establishes the desired failure intensity value. A tool for reliability measurement can be used to

record time intervals between failures and to calculate failure intensity.

Path coverage and failure intensity are adopted to evaluate testing extensiveness, respectively, in struc-

tural testing and random testing; it is more difficult to define a (measurable) criterion for functional testing in

addition to the obvious recommendation that every

requirement stated for F must be tested. This is typically used as the criterion for acceptance testing, i.e.,

to verify that a commissioned piece of software com- plies to its specifications in the contract, in this case

assumed as the reference model.

Another approach which permits a more objective evaluation of testing extensiveness could be to integrate functional and structural testing (see also the observa-

tion at the end of Section 4) as follows: first, TZD is

selected according to F and the achieved path coverage is evaluated; then TID is augmented in an attempt to

cover additional paths, until a sufficient coverage is reached.

Finally, a methodology has been proposed to introduce intentional errors in the program and then to

evaluate testing effectiveness on the rate of detection of these known errors. When a chosen (high) threshold in

the rate of discovered errors against the total amount of errors introduced is reached, testing is stopped. The errors can be put altogether into the program under test

(error seeding [5]) or they can be distributed among many similar versions of the program, called mutants (mutation testing [4]). Mutation tools are essentially test drivers, which also compute the mutation rate, in brief, the percentage of erroneous versions detected. They also provide a support to specify the mutation

algorithm and then derive the mutant programs automatically.

9. CONCLUSIONS

We have described the testing process and the associated tools. We have seen that:

l clerical activities, such as launching the tests,


recording the outputs, comparing files, reporting, etc, tion, The Benjamin/Cummings Publishing Company

are automatically performed by test drivers. Inc., 1987.

We can observe that these activities can not only be automatized at a low cost, but also that, given the increasing complexity of current software products, it is not thinkable to operate a testing activity without the support of a test driver.

4.

5.

l On the contrary, some tasks will always require the intervention of a competent operator, particularly the selection of test cases, even if some assistance is provided by test data generators.

6.

R. A. DeMillo, R. J. Lipton and F. G. Sayward, Hints on Test Data Selection: Help for the Practicing Program- mer, Computer, 11, 34-41 (1978).

J. W. Duran and J. J. Wiorkowski, Capture-Recapture Sampling for Estimating Software Error Content, IEEE Tr. on Software Eng., SE-7, 147-148 (1981).

M. E. Fagan, Design and Code Inspections to Reduce

Errors in Program Development, IBM System Journal, 15, 219-248 (1976).

7. To summarize, testing consists of a planning part and

an operative part. While the latter part can gain much from the support of automated tools, the planning stage is mostly based on human ingenuity and competence.

R. L. Glass, Real-Time: The “Lost World” of Software Debugging and Testing, Comm. ACM, 23, 264-271 (1980).

8.

We cannot conclude this overview on software testing without pointing out how the effectiveness of the testing phase is strictly dependent on the previous phases in the life cycle. In fact, we stress again that a program is always validated against a model of its correct functioning. Thus, before entering the imple- mentation and testing phases, we must specify precisely and document P’s expected behavior.

J. B. Goodenough and S. L. Gerhart, Toward a Theory of Test Data Selection, IEEE Tr. on Software Eng., SE-l, 156-173 (1975).

9.

10.

11.

12.

13.

14.

S. L. Hantler and J. C. King, An Introduction to Proving the Correctness of Programs, ACM Comp. Surveys, 8,

331-353 (1976).

J. La&i, On Data Flow Guided Program Testing, SZG- PLAN Notices, 17, 62-71 (1982).

E. F. Miller, Software Testing Technology: An

Overview, in Handbook of Software Engineering (C. R. Vick and C. V. Ramamoortby, eds.), Van Nostrand

Reinhold Company, 1984.

J. D. Musa and A. F. Ackerman, Quantifying Software Validation: When to Stop Testing?, IEEE Software, 6, 19-27 (1989). R. W. Selby, Combining Software Testing Strategies: An Empirical Evaluation, Proceedings, Workshop on Software Testing, July 1986, pp. 82-90.

E. J. Weyuker, The Evaluation of Program-Based Soft-

ware Test Data Adequacy Criteria, Comm. ACM, 31, 668-675 (1988).

REFERENCES

W. R. Adrion, M. A. Branstad and J. C. Cherniavsky,

Validation, Verification and Testing of Computer Soft- ware, ACM Camp. Surveys, 14, 159-192 (1982). L. A. Clarke and D. J. Richardson, Applications of Symbolic Evaluation, J. of Systems and Software, 5, 15-35 (1985). R. A. DeMillo, et al., Software Testing and Evalua-

A. Bertolino

Documents

An overview of automated software testing