Confidence intervals for capture–recapture estimations in software inspections

Confidence intervals for capture–recapture estimations in software

inspections

Thomas Thelin*, Hakan Petersson, Per Runeson

Department of Communication Systems, Lund University, P.O. Box 118, SE-221 00 Lund, Sweden

Received 23 January 2002; revised 29 May 2002; accepted 31 May 2002

Abstract

Software inspections are an efficient method to detect faults in software artefacts. In order to estimate the fault content remaining after

inspections, a method called capture–recapture has been introduced. Most research published in fault content estimations for software

inspections has focused upon point estimations. However, confidence intervals provide more information of the estimation results and are

thus preferable. This paper replicates a capture–recapture study and investigates confidence intervals for capture–recapture

estimators using data sets from two recently conducted software inspection experiments. Furthermore, a discussion of practical

application of capture–recapture with confidence intervals is provided. In capture–recapture, used for software inspection, most research papers

have reported Mh-JK to be the best estimator, but only one study has investigated its subestimators. In addition, confidence intervals based on the

log-normal distribution have not been evaluated before with software inspection data. These two investigations together with a discussion provide

the main contribution of this paper. The result confirms the conclusions of the replicated study and shows when confidence intervals for capture–

recapture estimators can be trusted. q 2002 Elsevier Science B.V. All rights reserved.

Keywords: Capture–recapture models; Confidence interval; Empirical study; Replication; Software inspection

1. Introduction

Inspections have been established as important contri-

butors to improved software quality. They may contribute to

the quality improvement in different ways; firstly by faults

being removed by the inspection effort; secondly by being

efficient means for communication within a project group

and thereby reducing the risk for errors, and thirdly enabling

quantitative quality management, i.e. decisions on release or

further inspections can be made based on an estimate of the

remaining number of faults in the inspected artefact.

The research in the field of software inspections has

addressed all the three aspects. Different reading

techniques have been developed and evaluated, such as

checklist-based reading (CBR) [18], defect-based reading

[29], perspective-based reading [1] and usage-based

reading (UBR) [35]. The use and effect of inspection

meetings, which were very much focused on in the initial

inspection work by Fagan [17], have been evaluated [22,

38]. Finally, the estimation of remaining faults after

inspections have been developed using statistical tech-

niques primarily from the biology field. This began with

Eick et al. [13,14], who started using capture–recapture

techniques which originally were used for estimating

populations [11].

Several studies have been performed to evaluate and

improve the estimates [3–5,16,20,27,31,34,40]. The gener-

alized outcome of the studies are: for an inspection team of

four or larger, the model assuming the same probabilities for

all reviewers varying over different faults (model Mh),

combined with the Jackknife estimator, performs best for

estimation of remaining faults [28]. A few studies point out

other models that work better for specific data sets, but most

studies are in favour of the Mh-JK model.

Furthermore, Miller [24] is even more specific. He points

out one of the subestimators of Mh-JK to perform better.

The Jackknife estimator produces estimators of different

orders, and Miller finds that the model of first order

performs better than the other estimators. He does not

compare the orders with the full estimator. However, he

argues that the subestimators might be better due to different

preconditions for estimates in the software engineering field

compared to the biology domain.

0950-5849/02/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved.

PII: S0 95 0 -5 84 9 (0 2) 00 0 95 -2

Information and Software Technology 44 (2002) 683–702

www.elsevier.com/locate/infsof

* Corresponding author. Tel.: þ46-46-222-3863; fax: þ46-46-14-5823.

E-mail addresses: [email protected] (T. Thelin), hakan.

[email protected] (H. Petersson), [email protected]

(P. Runeson).

http://www.elsevier.com/locate/infsof

Despite all this effort being spent on estimation models,

little work is performed on confidence intervals for the

estimators [2,3,37]. By estimating the confidence interval

for the number of remaining faults, the risk taken in a

decision based on the estimate is under better control,

compared to when only a single point estimate is available.

The result of a capture–recapture estimator using confi-

dence interval is a point estimate with a lower and an upper

bound. The probability that the point estimate is correct is

very small and unknown. Instead, by using a confidence

interval, the probability that the correct value is within the

range is much larger. If the nominal confidence value is

chosen to be 95%, the probability is equal to 0.95 that the

correct value is within its range. Hence, confidence interval

information provides valuable information for a software

project manager during the inspection process.

This paper focuses on point and confidence interval

estimations of remaining faults, using capture–recapture

techniques. Firstly, the study by Miller is replicated, using

data from two recently conducted inspection experiments.

Secondly, the confidence interval for the estimates are

investigated, to better control the risks when making

decisions based on the estimates.

The replication result confirms Miller’s argumentation

that the subestimators of Mh-JK perform better than the full

estimator. However, in this evaluation, order 2 of Mh-JK is

the best subestimator considering bias, variance and the

confidence interval investigation. The confidence interval

recommended is based on the log-normal distribution.

Furthermore, in order to aid future research and prac-

titioners using capture–recapture, an extensive discussion

of the results is provided.

In Section 2, the basics of fault content estimations using

capture–recapture are briefly presented. Section 3 gives an

overview of the origin of the inspection data used. Section 4

presents the research questions, analysis method for this

study, a summary of the replicated study and the confidence

intervals. The analysis itself is presented in Section 5,

followed by a discussion of the results in Section 6

combined with related work. Finally, conclusions are

drawn in Section 7.

2. Fault content estimation

In order to describe and explain the different ways of

making fault content estimations, the following terminology

will be used in this paper. An estimator is a formula used to

predict the number of faults remaining in an artefact. A model

is the umbrella term for a number of estimators assuming the

same prerequisites. A method is a type of approach to make an

estimation. A method may contain a number of models which

in turn may contain many estimators.

Fault content estimation models originate mainly from

ecological studies of animal population. These capture–

recapture models [11] all stem from biologists seeking to

estimate the population of certain species in an area. By (a)

capturing animals, (b) marking them and then (c) releasing

them to perhaps be recaptured at a later time, it is possible

through statistical models to estimate the total number of

animals in the area. The statistical models that biostatisti-

cians use include models for both open and closed

populations, i.e. populations which change or does not

change between trapping occasions, respectively. When

using the capture–recapture models with software inspec-

tions, it is natural to assume closed populations. This is

because the number of faults in an inspected artefact is

likely to be constant as the ‘trapping occasions’ occur at the

same time on the same version of the inspected artefact.

The first introduction of using capture–recapture within

the field of software engineering was made by Mills in 1972

[25]. Mills seeded faults into a system and applied capture–

recapture on the information on faults found during testing

to attain an estimate of the number of remaining faults. The

first application of capture–recapture to software inspec-

tions was made by Eick et al. in 1992 [13].

Not only capture–recapture methods have been used to

make fault content estimations. Another method that utilizes

the same information as the capture–recapture method is

the curve fitting method [41]. The curve fitting method sorts

and plots the fault data from an inspection. Then, it

estimates by fitting a mathematical function to the data.

Another fault content estimation methods is the use of

subjective estimates [2,15]. Subjective estimates mean that

the reviewers themselves produce an estimate based on their

experiences and feelings of the inspected artefact. This type

of method is not included in this study.

2.1. The capture–recapture method

The capture–recapture models in biology that are used

with software inspections have some restrictions [26]. The

most restrictive models have the following.

1. The population is closed,

2. animals do not loose their marks during the experiment,

3. all marks are correctly noted and recorded at each

trapping occasion, and

4. each animal has a constant and equal probability of

capture on each trapping occasion. This also implies

that capture and marking do not affect the catchability

of the animal.

These restrictions were translated to a software inspec-

tion context by Miller [24].

1. Once the document is issued for inspection, it must not be

changed; and the performance of the reviewers should be

constant, i.e. given the same document the reviewers

should find the same faults.

2. Reviewers must not reveal their proposed faults to other

reviewers.

T. Thelin et al. / Information and Software Technology 44 (2002) 683–702684

3. Reviewers must ensure that they accurately record and

document every fault they find. Additionally, the

inspection process, for example, at the collection meet-

ing, must not discard any correct faults.

4. All reviewers must be provided with identical infor-

mation, in terms of source materials, standards, inspec-

tion aids, etc. and this material must be available to them

at all times.

Assumption 4 also implies equality between reviewer

abilities and the complexity of finding different faults.

Depending on the degree of freedom of the ability of the

reviewers and probability of the faults to be detected, four

basic models are formed:

† M0, all faults have equal detection probability, all

reviewers have equal detection ability.

† Mt, all faults have equal detection probability, reviewers

may have different detection abilities.

† Mh, faults may have different detection probabilities, all

reviewers have equal detection ability.

† Mth, faults may have different detection probabilities,

reviewers may have different detection abilities.

There is also a model called Mb, which in biology is

connected to an animal changing its probability when being

trapped once. This model can be combined with the others

creating models Mtb, Mhb and Mthb. However, these

models have not been used for software inspections.

Connected to each model there are a number of

estimators (Table 1). The estimators Mh-JK and Mth-Ch

can both be calculated with different orders of subestima-

tors. In the case of Mh-JK, a procedure of what order to

select is suggested by Burnham and Overton [6]. Miller

argued for not using this selection algorithm in an inspection

context and therefore evaluated the different orders

separately. He found that order 1 of Mh-JK performed

best of the five investigated subestimators [24]. In the study

by Miller and in this study, the Mh-JK’s and Mth-Ch’s

orders are all treated separately. Table 2 shows the formulae

of the five first subestimators of Mh-JK.

These estimators are used by Miller and in this paper.

These are also used in the selection procedure suggested by

Burnham and Overton [6].

2.2. The curve fitting method

Wohlin and Runeson introduced two kinds of curve

fitting methods [41]. The detection profile method (DPM)

and the cumulative method. Of these two models, DPM has

shown to perform best. In DPM, the faults are sorted

decreasingly in terms of the number of reviewers that found

each fault, and an exponential curve is fitted to the data. The

estimate is then attained by extrapolating the curve to a

predetermined value. The predetermined value is the limit

value at the y-axis, where the number of remaining faults is

estimated by reading off the x-value. The predetermined

value was set to 0.5, in the initial study by Wohlin and

Runeson.

DPM is included in this study since it has performed well

in comparison to other capture–recapture estimators, and is

regarded as the best curve fitting method for software

inspections [33].

3. Experiment description

In order to evaluate the capture–recapture estimators and

the confidence intervals for the estimators, data from two

experiments are used. The experiments were conducted in

two university environments in Sweden, where third or

fourth year students inspected software artefacts belonging

to a taxi management system. The system was developed

using a real taxi driver and a taxi dispatcher as stake-

holders, although the system is limited in scope compared to

their existing system. This section gives an overview of the

two experiments. A detailed description of the experiments’

subjects, variables, design, hypotheses, threats to validity,

operation, and results is provided in Refs. [35,36],

respectively.

3.1. First experiment

In the first experiment [35], two variants of a new reading

technique, UBR were compared. The subjects were third

year students at the software engineering Bachelor’s

programme at Lund University. The design specification

for the taxi management system was inspected, and the

reading was guided by a requirements specification

primarily based on use cases [21]. The design specification

comprised nine pages (2300 words) and contained 37 faults.

One subject group (14 subjects) got the use cases ranked

according to the importance for the user (UBRord). The other

subject group (13 subjects) got randomly ordered use cases

(UBRrand).

It is concluded from the experiment that the group using

the ordered set of use cases is more efficient and effective in

finding the faults that are most important from a user

perspective. Efficiency is defined as number of faults found

per time unit and effectiveness is defined as the fraction of

the number of faults found out of the total number of faults

in the document. The average capture probability is a

measure of how probable it is that a fault is to be detected by

the reviewers. The average capture probability was 0.31 for

the group using ordered use cases and 0.23 for the group

using randomized use cases. In Fig. 1, the capture

probability of each fault is shown and in Appendix B, the

raw data together with capture probabilities for each fault

and reviewer are provided.

T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 685

3.2. Second experiment

In the second experiment [36], the same design

specification was inspected, but in this case, the UBR with

ordered use cases was compared to CBR. The students

participating as reviewers in the study were fourth year

software engineering Master students at Blekinge Institute

of Technology in Sweden. In order to give as similar

circumstances for the two groups as possible, they were

given a textual requirements specification for the system.

The UBR group (11 subjects) was given a set of use cases in

addition to the requirements, and the CBR group (12

subjects) was given a checklist.

The experiment shows that the effectiveness and

efficiency are higher for the UBRord group compared to

the CBR group. The average capture probability was 0.31

for the UBRord group and 0.26 for the CBR group. In Fig. 2,

the capture probability of each fault is shown and in

Appendix B, the raw data together with capture probabilities

for each fault and reviewer are provided.

3.3. Data

The data from the above described experiments are used

in this paper to evaluate the performance of capture–

recapture estimators. They represent different reading

techniques, and also, which is of interest to this study,

different capture probabilities.

UBR guides reviewers to focus their inspection effort to

detect the most important faults from a user perspective.

Hence, some faults have higher probability to be found than

others and this is not necessarily true for an inspection

performed in a software organization. Therefore, the data

may not represent an inspection where ad hoc or CBR is

used. However, studies of perspective-based reading [1],

which also focus on subsets of the faults, have shown that

the capture–recapture estimators are robust for PBR data

[19,20,32]. Furthermore, one of the data sets is from CBR,

which does not have this restriction. Consequently, the CBR

data set is primary used in the replication part of the paper,

and the other data sets are used as complements. For the

confidence interval investigation, we have argued that the

use of UBR will not affect the results. However, in order to

extend the knowledge, this part needs to be replicated.

4. Method

In Sections 4.1–4.4, analysis strategy, a summary of the

replicated study and the confidence intervals are presented.

The outline of the study is presented in Fig. 3. Firstly, the

study conducted by Miller is replicated. Secondly, confi-

dence intervals for the best estimator are investigated. These

Table 1

The models and estimators in capture–recapture. More capture–recapture

models exist, but have not been used for software inspections

Model Estimators

M0 M0-ML—maximum likelihood [26]

Mt Mt-ML—maximum likelihood [26]

Mt-Ch—Chao’s estimator [9]

Mh Mh-JK—Jackknife [6]

Mh-Ch—Chao’s estimator [8]

Mth Mth-Ch—Chao’s estimator [10]

Table 2

The formulae of the five first subestimators (orders) of Mh-JK [6]

Order Formula

1 NJ1 ¼ D þk 2 1

k

� �f1

2 NJ2 ¼ D þ2k 2 3

k

� �f1 2

ðk 2 2Þ2

kðk 2 1Þ

( )f2

3 NJ3 ¼ D þ3k 2 6

k

� �f1 2

3k2 2 15k þ 19

kðk 2 1Þ

( )f2 þ

ðk 2 3Þ3

kðk 2 1Þðk 2 2Þ

( )f3

4 NJ4 ¼ D þ4k 2 10

k

� �f1 2

6k2 2 36k þ 55

kðk 2 1Þ

( )f2 þ

4k3 2 42k2 þ 148k 2 175

kðk 2 1Þðk 2 2Þ

( )f3 2

ðk 2 4Þ4

kðk 2 1Þðk 2 2Þðk 2 3Þ

( )f4

5 NJ5 ¼ D þ5k 2 15

k

� �f1 2

10k2 2 70k þ 125

kðk 2 1Þ

( )f2 þ

10k3 2 120k2 þ 485k þ 660

kðk 2 1Þðk 2 2Þ

( )f3 2

ðk 2 4Þ5 2 ðk 2 5Þ5

kðk 2 1Þðk 2 2Þðk 2 3Þ

( )f4

þðk 2 5Þ5

kðk 2 1Þðk 2 2Þðk 2 3Þðk 2 4Þ

( )f5

NJm is the estimated number of faults for order m using Jackknife, D is the unique number of faults found, k is the number of reviewers, and fi is the number

of faults found by i reviewers.


two evaluations use data from the two experiments

described in Section 3.

4.1. Research questions

Two main research questions are addressed in this

investigation.

1. Replication.

Is the result in the study by Miller valid for other data

sets?—A replication of the study performed by Miller

[24] is carried out. The aim of the replication is to

select the best estimators to be used for further

analysis.

2. Confidence interval.

(a) Which capture–recapture estimator has the best

confidence interval coverage and range?—Confi-

dence intervals are evaluated for the selected

estimators in research question 1. The aim is to

select a distribution and the best estimator to be

used for fault content estimations.

(b) Which nominal confidence value is best to use?—

Variation of nominal values is investigated for the

best estimator in research questions 1 and 2a. 95

and 99% are normally used as nominal confidence

values. The aim is to investigate whether other

values provide more correct results, considering

the coverage and range of the confidence

intervals.

4.2. Measures and method

The analysis procedure is divided into three steps. Each

procedure is connected to one research question (1, 2a and

2b).

Research question 1 (Replication ). The measures used

for comparing the estimators are the relative error (RE), the

standard deviation and the root mean square error (RMSE)

[12]. The RE is a measure of bias and the standard deviation

is a measure of dispersion. The RE is defined as:

RE ¼Estimated number of faults 2 True number of faults

True number of faults

RMSE is a measure of accuracy, which takes into account

both standard deviation and bias of an estimator. The RMSE

value is defined as:

RMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiE½ðQ2 QÞ2�

q¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiStdDev2 2 Bias2

pwhere StdDev is the standard deviation and Bias is the

Fig. 1. The capture probability for each fault (first experiment).

Fig. 2. The capture probability for each fault (second experiment).


subtraction between the number of true faults and the

number of estimated faults. These measures are evaluated

for teams of two to nine reviewers.

Research question 2a (coverage and range of confidence

intervals ). The confidence intervals are evaluated by

measuring the percentage of the correct number of faults

within the nominal confidence interval (coverage). This is

carried out using 95% nominal confidence interval for two

to nine reviewers, for four different data sets with different

capture probabilities. Furthermore, the range of the

confidence intervals is investigated and presented in box

plots.

Research question 2b (variation of nominal confidence

value ). Nominal confidence values ranging from 60 to 99%

are presented for the estimator that is best with respect to

accurate estimations, confidence interval coverage and

range. The nominal confidence value is used as input to

the calculation of confidence intervals. For example, if the

nominal confidence value is decided to be 95%, the

confidence interval should ideally include the correct

number of faults 95 of 100 times.

An alternative evaluation measure for capture–recapture

has been proposed by El Emam et al. [15], called decision

accuracy (DA). Instead of measuring the accuracy of the

estimators in terms of bias, DA measures the number of

correct decisions taken by an estimator. This can, for

example, be compared to the number of correct decisions

taken, when reinspections are never made, resulting in the

relative decision accuracy (RDA). However, there are some

limitations of RDA, and one of these is that a threshold

needs to be chosen beforehand, which should be based on

historical data [33]. Due to the restrictions of RDA and the

fact that this paper replicates a study where the DA is not

used, this measure is not utilized in this study.

A measure that is important for some estimators is the

failure rate, i.e. the number of times an estimator cannot

provide an estimate, for example, when no overlap exists.

Some of the estimators, for example, Mh-JK can always

provide an estimation results. This is further discussed in

Section 5.1.

In order to investigate the impact of the number of

reviewers on the estimation results, virtual inspections are

used. This means that all possible combinations of

reviewers are analysed for each group size. For each

group size, every reviewer is included several times. The

number of times depends on the number of reviewers to

combine. For example, if 11 reviewers are used in an

experiment and the team size is three, the number of

combinations is

11

3

!

which equals the number of virtual teams. Virtual inspec-

tions are commonly used for capture–recapture investi-

gations using software inspection data to evaluate different

sizes of groups when the reviewers have inspected

individually [4,5,27,34].

4.3. Summary of the replicated study

Miller [24] evaluates all capture–recapture estimators

described in Section 2, except DPM. Instead of using the full

estimator of Mh-JK and Mth-Ch, he uses all subestimators

of Mh-JK (1–5) and Mth-Ch (1–3). Miller argues that the

procedure used to selecting among the subestimators of Mh-

JK and Mth-Ch, as implemented in the software program

CAPTURE [30], is not applicable for software inspections.

The result may be that one of the subestimators might

produce more accurate result than the full estimator.

Therefore, he uses all orders separately in the evaluation.

Fig. 3. The outline of the investigation presented in this study. The figures refer to section numbers in the paper.


The data come from two previously conducted experiments,

where the purposes were to evaluate tool support for

inspections and defect-based reading, respectively. The

investigation uses RE and box plots to evaluate the

estimators for three to six reviewers. The results are (a)

the estimators underestimate, (b) Mh-JK order 1 is the best

estimator and can be used for three to six reviewers and (c)

Mth-Ch (order 1 and 3) might be appropriate to use if many

reviewers inspect.

Miller’s result that most estimators underestimate when

using software inspection data is compliant to most of the

previously conducted research, which is summarized and

discussed by Petersson et al. [28]. The second and third of

Miller’s results (see (b) and (c) earlier) have not been

replicated before. In addition, Mh-JK’s subestimators have

not been compared to the full estimator using software

inspection data.

Hence, as a replication, this paper provides further

knowledge of the estimators used for capture–recapture

estimations.

4.4. Confidence intervals

The confidence intervals used in this study are based on

the normal and the log-normal distributions. The assump-

tion of normal distribution is only fulfilled for large samples

[26] (many reviewers and faults), which makes it less useful

for software inspections. The other distribution used

assumes that the faults not detected are log-normally

distributed [7]. In the software program CAPTURE [30],

this distribution is used for all estimators implemented. In

this study, an evaluation for both distributions is carried out

using software inspection data from two experiments

described in Section 3 [35,36].

A 95% nominal confidence interval means that 95% of

the correct values should be within the interval in the ideal

case. The 95% confidence interval, assuming normal

distributions is calculated as [26]:

N 2 1:96

ffiffiffiffiffiffiffiffiffiVarðNÞ

q; N þ 1:96

ffiffiffiffiffiffiffiffiffiVarðNÞ

q �

where N is the estimated number of faults. The 95%

confidence interval using the log-normal distribution is

calculated as [7]:

D þN 2 D

C;D þ CðN 2 DÞ

" #

C ¼ exp 1:96

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffilog 1 þ

VarðNÞ

ðN 2 DÞ2

!vuut0@

1A

where D is the unique number of faults found. The variance

of the estimated number of faults uses different formulae

depending on the estimator used.

Three studies have evaluated confidence intervals for

capture–recapture estimators in software inspections [2,3,

37]. Vander Wiel and Votta [37] carried out a simulation

and investigated Mh-JK and Mt-ML for five reviewers with

different capture probabilities. They investigate three

confidence intervals, two for Mt-ML and one for Mh-JK.

The findings in the study are that Mt-ML performs best of

the estimators, which is contradicted by most later results

using software inspection data [2,4,5,24,28]. The confidence

intervals for Mt-ML are based on the normal distribution

(Wald) as well as on likelihood-ratio statistics (likelihood-

ratio confidence). The confidence interval for Mh-JK is the

one suggested by Burnham and Overton when they

developed Mh-JK [6]. This is also based on the normal

distribution. Vander Wiel and Votta found that the

confidence intervals have either too large (conservative) or

too small ranges for five reviewers. When a too large range

is achieved, the confidence intervals are often too broad to

be useful. However, this was not investigated.

Biffl [2] and Biffl and Grossman [3] use confidence

intervals when comparing capture–recapture to subjective

estimations and for reinspections, respectively. These

studies use data from the same experiment and use teams

of four to six reviewers. The conclusion is that their

approach of confidence intervals does not work well. In Ref.

[2], the largest coverage was 74%, with a nominal

confidence value of 95%.

The larger the nominal confidence value is, the broader is

the confidence interval range. Therefore, the nominal

confidence values are varied to examine how the coverage

and range change and how well the confidence intervals

work. This is evaluated for the nominal confidence values

60–95% in steps of 5% together with 99%.

5. Analysis

The analysis is divided into three subsections. The result

of the replication is presented in Section 5.1 (research

question 1). In Section 5.2 (research question 2a) and

Section 5.3 (research question 2b), the confidence interval

investigation is presented.

5.1. Replication

In Table 3, the RMSE values are shown for the 14

estimators investigated. This is shown for the CBR data set;

RMSE, bias and standard deviation values of the estimators

are presented in Appendix A for all of the sets. All

estimators, except Mh-JK (CAPTURE) and DPM are

evaluated by Miller [24]. Mh-JK (CAPTURE) is the

estimator used in the software program CAPTURE [30],

which selects and interpolates between the five subestima-

tors of Mh-JK. The bold figures show the two estimators

with the best RMSE for each number of reviewers. Note that

for some cases, Mh-JK’s subestimators become the same

formula depending on the number of reviewers. For

example, for two reviewers, Mh-JK orders 1 and 2 become


the same formula, and hence the same estimation results are

obtained.

For two reviewers, DPM shows the best results. Since

DPM is based on fitting an exponential curve to the

inspection data, and fitting a curve to two levels is more or

less random, the interpretation is that no estimator works

well for two reviewers.

For three to nine reviewers, one of the Mh-JK’s

subestimators estimates most accurately. Similar results

are obtained for the other three data sets (Appendix A). One

of the orders 1 and 2 of Mh-JK is in most cases one of the

two best estimators and is in several cases the best estimator.

As expected, increasing the number of reviewers

improves the estimations results in most cases. This depends

on that the more input data, the more reliable estimation

results.

All estimators have median RE below zero, which means

that, in general, they underestimate. This is not expressed by

the RMSE value, but is shown in Fig. 4 for the five best

estimators using three to six reviewers (see Appendix A for

all data sets). Underestimations confirm the results of Miller

[24] and most other previously conducted software

capture–recapture research evaluating the estimators, see

e.g. [2,4,5]. In these investigations, it is concluded that Mh-

JK is the best estimator for software inspections. However,

only the full estimator is used, except by Miller [24]. Miller

concludes that Mh-JK order 1 is the best. Since the result in

this paper confirms previously conducted research, Mh-JK

is regarded as the best estimator. Therefore, Mh-JK

(subestimators (1–5) and full estimator) is used in the

confidence interval investigation.

Since Mh-JK is the best estimator, the failure rate is not

discussed. Mh-JK never fails to estimate, and is thus not

affected negatively in such an evaluation. The failure rate

cannot change the result of the replication.


Both the normal distribution and the log-normal

distribution were used in the confidence interval evaluation.

The analysis of coverage and range showed that the normal

distribution is less accurate than the log-normal distribution.

Thus, only the log-normal distribution is used in the further

analysis and in the plots. Four data sets are used. However,

only three of them are represented in the plots, as two of

them show similar results. The reason that two of the data

sets show similar results is probably because both of them

have average capture probabilities equal to 0.31. The data

with equal capture probabilities come from the groups using

ordered use cases (UBRord) in both experiments (Section 3).

Hence, only one of these data sets is used in the further

analysis.

In Fig. 5, the 95% nominal confidence interval is shown

for Mh-JK orders 1–3 and the full estimator. Orders 4 and 5

are also used in the analysis but are less accurate than the

others, and are therefore not shown.Tab

le3

Roo

tm

ean

squ

are

erro

r(R

MS

E)

of

the

esti

mat

ors

for

2–

9re

vie

wer

s.T

he

two

bes

tes

tim

ato

rsfo

rea

chn

um

ber

of

revie

wer

sar

esh

ow

nin

bo

ld.T

he

dat

ase

tis

the

on

ew

her

eth

ere

vie

wer

su

sed

CB

R,se

eS

ecti

on

3.2

Nu

mb

ero

fre

vie

wer

s1

.M

0-M

L2

.M

t-M

L3

.M

t-C

h4

.M

h-J

K

(1)

5.

Mh

-JK

(2)

6.

Mh

-JK

(3)

7.

Mh

-JK

(4)

8.

Mh

-JK

(5)

9.

Mh

-JK

(CA

PT

UR

E)

10

.M

h-C

h1

1.

Mth

-Ch

(1)

12

.M

th-C

h

(2)

13

.M

th-C

h

(3)

14

.D

PM

21

7.4

17

.61

6.7

16

.9–

––

18

.11

7.4

90

.31

9.6

–1

5.9

31

3.5

14

.81

4.4

12

.01

0.6

––

11

.91

3.1

19

.71

4.7

75

.81

3.9

41

2.4

13

.41

3.0

9.3

8.6

8.9

–1

0.6

12

.91

1.7

12

.51

4.0

11

.2

51

1.5

12

.31

1.3

7.7

7.5

8.5

9.1

9.7

11

.48

.91

0.3

9.5

8.8

61

0.7

11

.31

0.0

6.4

6.6

7.9

9.0

9.6

9.4

9.9

7.6

8.8

8.0

7.0

71

0.1

10

.58

.95

.55

.77

.38

.81

0.2

8.8

8.4

6.7

7.6

7.1

5.8

89

.49

.67

.94

.64

.86

.58

.61

0.7

8.1

7.2

6.0

6.6

6.3

5.3

98

.88

.96

.93

.84

.16

.28

.91

1.9

8.3

6.5

5.3

5.7

5.6

5.5


Analysis of all three data sets shows that Mh-JK order 2

is in general the best one. For the data set with the average

capture probability equal to 0.31, the confidence intervals

work well for three and more reviewers using order 2 and

the full estimator. Order 1 has good coverage for four and

more reviewers. Order 3 and higher is less accurate for this

data set.

When the average capture probabilities decrease, more

reviewers are needed in order to achieve good results.

This is applicable for all orders. For the data set with

average capture probability equal to 0.23, order 2 needs

seven or more reviewers as input to produce a coverage

of over 90%.

None of the cases shows accurate results for two

reviewers. Higher average capture probability is probably

needed in order for the estimators to show good results for

two reviewers, although, this is not investigated in this

study.

Consequently, the lower average capture probability, the

more reviewers are needed, and the larger average capture

probability (the higher rate of faults found), the better

coverage. For example, Mh-JK order 2 gives a coverage of

more than 90% for three reviewers for the data set with

capture probability of 0.31, but seven reviewers are

necessary to reach above 90% when the capture probability

is 0.23.

In Fig. 6, the ranges of the confidence intervals are shown

for three to six reviewers. The same data sets as in Fig. 5 are

used. The range is more affected by the number of reviewers

than by the average capture probability. This means that for

a fixed number of reviewers (e.g. three) and for a specific

estimator (e.g. Mh-JK order 1), the range is similar for all

three data sets.

The range itself does not give enough information in

order to select a good estimator. It has to be combined with

the coverage analysis. Since order 2 and the full estimator

have best coverage, these are most important in the range

analysis.

Both these estimators have small ranges for three to six

reviewers. Order 2 has smaller variation in the range. The

full estimator is based on interpolation of the five

subestimators, which sometimes leads to broad ranges.

Combining the results from the coverage and range

analysis gives that Mh-JK order 2 is the most appropriate

estimator to use.

5.3. Variation of the nominal confidence

The nominal confidence values are used in this section to

investigate how the coverage and range vary with the

nominal confidence. If smaller nominal confidence values

are used, the result will be higher coverage of the correct

number of faults.

In Figs. 7 and 8, the coverage and the range of the

confidence intervals are shown, when the nominal confi-

dences are between 60 and 99%. These are shown for Mh-

JK order 2.

Similar patterns as in Section 5.2 are observed. When the

average capture probability is high, the confidence intervals

are trustworthy. However, when the average capture

probability decreases, less trustworthy results are obtained.

Furthermore, the more reviewers, the better is the

coverage. However, six reviewers are not sufficient when

Fig. 4. Box plot of the five most accurate estimators for three to six

reviewers. The numbers below the box plot relate to the estimators in

Table 3.

Fig. 5. The coverage of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right.


the average capture probability equals 0.23. Hence, then the

confidence intervals cannot be trusted. For nominal

confidence of 99%, three to six reviewers give a good

coverage, but then the range is large. However, to trust the

results, the 99% confidence interval may be preferable.

As expected, as the nominal confidence value increases,

the broader becomes the range of the confidence interval.

However, then the coverage becomes better. For six

reviewers and a 99% confidence interval, the median

number of faults is 37, which is equal to the true number

of faults in that data set. For the other cases, the median is

less than 22 faults and the maximum less than 30, which

gives a good range. But then the coverage is more dependent

on the average capture probability than the number of

reviewers (at least for three to six reviewers).

6. Discussion

The results presented in Section 5, raise important

questions to be discussed. This section discusses the results

of the replication (research question 1), the confidence

intervals (research question 2), application of the results, the

findings in biostatistics regarding the estimators and future

work in the area of fault content estimation in software

inspections.

6.1. Replication

Miller [24] investigates 12 different estimators by

measuring bias and variance. His finding is that Mh-JK

order 1 is the best estimator to use. In the first part of this

paper, a replication of Miller’s evaluation was performed

using new data sets. In addition to the 12 estimators, the full

estimator of Mh-JK (implemented in CAPTURE [30]) and

the curve fitting method named DPM [41] were evaluated.

This investigation confirms Miller’s result in the sense

that the subestimators of Mh-JK are better to use for

software inspections than the full estimator. However,

instead of Mh-JK order 1, order 2 is shown to be more

appropriate to use when confidence intervals are

considered.

Miller also points out that Mh-JK order 1 can be used for

three to six reviewers. This result can, however, not be

confirmed, as it may depend on the data set used for

estimations. As shown, the higher the detection probability

Fig. 6. The ranges of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right.

Fig. 7. The coverage of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right. The

nominal confidence values are between 60 and 99%.


is, the more faults are found, which leads to better

estimation results. When many faults are found, capture–

recapture may be used for three reviewers. However, when

few faults are found, at least six reviewers are needed to

obtain accurate estimation results. In other words, the

estimators need sufficient amount of data as input in order to

estimate well. This is a general problem in estimations, i.e.

the more input data the more accurate output. To increase

the amount of data, different factors can be increased, for

example, the number of reviewers and the capture

probability. This is also confirmed in biostatistics, which

is discussed below.

Furthermore, Miller and most other evaluations of

capture–recapture estimators in software inspections have

pointed out that the estimators underestimate [28]. This is

confirmed in this investigation. All estimators underestimated

in general. This may be due to that the software artefacts used

for inspection contain faults that are very difficult to find.

In the study by Miller, data sets with higher average capture

probabilities than in this study were used. The average capture

probabilities in the study by Miller are 0.39, 0.48, 0.75 and

0.84 (per reviewer). Comparing the REs in the box plots,

reveal that less underestimates are obtained. However, Miller

does not separate the data sets and hence does not

investigate the impact of average capture probability. A

mix of all the data sets is used, which results in a data set

with higher average capture probability than the ones used

in this study. Consequently, the estimates are closer to the

correct number of faults, i.e. smaller underestimations.

The subestimators of Mh-JK do not use all available

information if the number of reviewers is more than the

orders used. The information Mh-JK order 1 uses is the

number of faults that has been found exactly by one

reviewer. Mh-JK order 2 uses the number of faults that have

been found by one or two reviewers, in its estimation

formula. If the number of reviewers is more than three, all

available information is not used in the subestimators of

order 1 and 2 (if someone has found three or more faults). In

addition, the estimators have an upper estimation limit. The

larger order, the larger is the limit and this affects the

variance. Hence, if the bias is small for all subestimators,

order 1 of Mh-JK will probably estimate most accurately. So

in Miller’s case when the capture probability is large, the

bias is small, and then Mh-JK estimates well. In this

evaluation, the capture probability was smaller, resulting in

a larger bias, and thus a higher order is needed to estimate

well (in this case Mh-JK order 2).

To avoid this dilemma in biology, a procedure is

implemented that selects two subestimators and interpolates

between their estimation values. This procedure is used in

the full estimator (Mh-JK (CAPTURE)). However, this

study shows that the procedure is not better than using either

of Mh-JK order 1–3 for software inspection data.


The confidence intervals investigated are based on the

normal and log-normal distributions. The log-normal

distribution is most appropriate to use for software

inspection data. However, it cannot be trusted for all data

sets for all numbers of reviewers.

Four data sets were used in this investigation. They have

different average capture probabilities, which affect the

estimation results. Confidence intervals should cover the

nominal confidence value but not be too broad. Therefore,

the average capture probability, the number of reviewers

and the confidence values were varied in order to evaluate

the intervals for Mh-JK and its subestimator.

For the data sets with average capture probability equal

to 0.31, the confidence intervals reflect the nominal

confidence value well for three to nine reviewers using

Mh-JK order 2. This estimator estimates most accurately in

general considering bias, variance and confidence interval

coverage. When decreasing the average capture probability,

more reviewers are needed to achieve the nominal

confidence interval coverage. For the data set with average

capture probability equal to 0.23, seven reviewers are

needed. In this case, Mh-JK order 3 estimates better and

Fig. 8. The ranges of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right. The nominal

confidence values are between 60 and 99%.


only five reviewers are needed to achieve the decided

nominal confidence value.

Hence, the larger capture probability, the more accurate

point estimations and confidence intervals are achieved. A

project manager using the estimation result must be aware

of this. However, it is still better to use a confidence interval

to base ones decisions on, than only a point estimation. The

application of capture–recapture for project managers is

further discussed in Section 6.3.

6.3. Practical application of capture–recapture

For a software organization, we recommend to use Mh-

JK order 2 with the log-normal confidence interval if the

number of reviewers is four or more. If the number of

reviewers is less than four, subjective estimations are more

preferable. In order to evaluate whether capture–recapture

estimations work in a specific organization, historical data

should be collected and evaluated [27]. This paper

contributes with a general investigation of capture–

recapture estimators and points out the best estimator to

start with for a software organization.

Capture–recapture estimations can be calculated before

the inspection meeting (using the individual inspection

record) or after the meeting (using the meeting record with

identified overlap). When a project manager receives an

estimate after the individual inspection, the manager has

several decision alternatives to take [28]. The estimation

result can, for example, show that there are several faults

left in the document. Then, the decision should be to rework

the document. If, on the other hand, few faults remain, the

meeting does not need to be held. Other decisions could be

to hold an inspection meeting or only to add an additional

reviewer and re-estimate. In all cases, the work of compiling

the faults into one record is beneficial. The compiled record

is given to the author of the software artefact if rework is to

be done. If a meeting is to be held, the document can serve

as an agenda of what faults need to be discussed, which will

lead to a more efficient meeting. Note that the above

decisions can also be taken after the meeting.

However, a project manager cannot only rely on a single

estimate when decisions are to be taken. Factors concerning

the project and the product developed also have to be taken

into account. Therefore, an estimate of, for example, 20–40

faults left in the document is sometimes good enough if,

for example, time-to-market is important. However, if

the quality is the important factor in the project, the

estimation of 20–40 faults will probably lead to rework

of the product, followed by a reinspection. Another

important factor that affects the decisions is the severity of

the remaining faults.

Hence, the whole context, and not only an estimate,

needs to be considered when making decisions on which

actions to take. We provide a valuable tool for software

project managers to use and give a recommendation of what

estimator should be used. The problem for a project

manager is to know when to trust the estimation result of

capture–recapture since it depends on a number of factors.

Therefore, the recommendation is to combine Mh-JK (and

the log-normal confidence interval) with subjective esti-

mations, especially for few reviewers (four or less). If more

reviewers are used, most investigations have shown that

capture–recapture estimations can be trusted, and thus also

the confidence interval provides valuable information.

The recommendation is, hence, to start using capture–

recapture (in combination with subjective estimations) and

collect data during a period to evaluate the estimates. The

data to collect are the faults detected and the estimates. The

faults should be classified in what phase they were injected

and detected. After the development, the estimates can be

compared to the true number of faults that was present in the

artefacts inspected. Then, the average capture probability

and the REs of the estimates can be calculated. Furthermore,

software inspection experiments can also be conducted in

order to calculate these measures.

Software organizations that want to start using capture–

recapture need to use a tool where the subestimators of Mh-

JK and the confidence intervals are implemented. The

formulae for Mh-JK are shown in Section 2.1 and the

confidence interval formula is described in Section 4.4. In

this study, Mh-JK and the confidence interval have been

implemented in MATLABw [23].

6.4. Estimators

In biostatistics, many investigations of capture–recap-

ture estimators have been carried out. Although most of

them use simulations and not software inspection data, the

results provide basic knowledge of the estimators. Otis et al.

[26] and White et al. [39] have made extensive evaluations.

Some of the results are applicable to the investigation in this

paper and are discussed in this section. Since they use

capture–recapture for population estimation, they discuss in

terms of number of trapping occasions, population size and

so forth. Here, these terms are translated to number of

reviewers and number of faults to more easily follow the

discussion.

A result in this study is that most of the estimators

underestimate. A discussion of this is provided by White et al.

[39]. They argue that if many faults are difficult to find, i.e.

have low detection probability, Mh-JK as well as the other

estimators will underestimate. This may be a reason to why the

estimators underestimate when used for real data in software

inspections. Overestimations have mostly occurred when non-

software inspection data or simulations are used see e.g. [34,

37].

Furthermore, White et al. [39] and Otis et al. [26] discuss

some recommendations for the capture–recapture estima-

tors used in this paper. They state that the estimators need

to have sufficient amount of data in order to estimate

well. Some recommendations and findings important for

capture–recapture in software inspections are listed below.


† Mh-JK is fairly robust for a number of specific

assumptions, and is the most robust estimator (Mt-Ch,

Mh-Ch, Mth-Ch and DPM are not used in their

investigations).

† If many faults are difficult to find, i.e. have low detection

probability, all estimators will underestimate.

† The more faults that are found, the more statistically

precise will the estimation result be.

† They recommend using at least five reviewers, but seven

to 10 are better.

† Extremely wide confidence intervals tend to reveal poor

experiment conditions, i.e. low capture probabilities.

† White et al. state that when using less than 10 reviewers

(a) the number of faults should not be less than 20; (b) the

average capture probabilities should not be less than 0.30

when the number of faults is less than 100.

† Otis et al. state that when the number of reviewers are

less than 10, the number of faults should not be less than

25 or average capture probabilities not less than 0.10.

These recommendations can provide guidelines for

software inspections. In summary, the estimators are

affected by the (a) capture probabilities, (b) the number of

reviewers and (c) the number of faults in the documents.

In software inspections, reading techniques, like per-

spective-based reading [1], defect-based reading [29] and

UBR [35] aim at increasing the capture probability by

aiding reviewers with information of how to inspect. The

number of faults is more difficult to affect. By seeding faults

according to Mills [25], two different capture–recapture

estimations can be made. This is, however, time-consuming

and it is difficult to know what kind of faults to introduce. It

may be easier to increase the number of reviewers if more

accurate results need to be achieved. Whether the number of

reviewers should be increased is a trade-off between cost

and quality in a software project. Hence, educating

reviewers and improving the reading techniques may be

the most cost-effective activities in order to improve

software inspections as well as the estimation results.

6.5. Further work

Petersson et al. [28] summarize 10 years work on

capture–recapture for software inspections and pin-points

important future research. Capture–recapture estimations

for two reviewers is one of these areas. This is an important

area, since two or three reviewers are commonly used in

software organizations. An extensive simulation carried out

by El Emam and Laitenberger [16] show that Mh-Ch is the

most accurate and can be used for two reviewers. However,

this study shows that using only two reviewers for capture–

recapture estimations leads to high uncertainty in the

estimation results. Additional information is needed in

order to trust the results. Such information could be, for

example, subjective estimates, which have shown to be as or

more accurate than capture–recapture estimates in research

investigations by El Emam et al. [15] and by Biffl [2].

A recommendation based on conducted research is that

subjective estimations should be used when small samples

are used; and capture–recapture should be used when the

samples are large (e.g. many reviewers or many faults

found). Since the data for both estimation methods are easy

to collect, estimate with both methods and then discuss the

results with the reviewers.

As pointed out by Petersson et al. [28], the main future

research for fault content estimations like capture–recap-

ture and subjective methods are to apply the results in a real

environment in software organizations. Collaboration

between software organizations and researchers will help

the research to come up with relevant research questions.

Most papers in capture–recapture have pointed out Mh-

JK to be the best estimator [28]. This study together with

Miller’s show that a subestimator of Mh-JK is even more

preferable. For a project manager, it is important to know

which of the subestimators of Mh-JK to choose for a specific

data set. The selection algorithm implemented in the

program CAPTURE [30] is not reliable to use for software

inspection. Hence, important future research is to design a

selection algorithm that chooses between the subestimators

of Mh-JK and works for software inspection data.

Future research also has to look into how to increase the

average capture probability. This can, for example, be

achieved by using and improving the reading techniques.

Another parameter affecting the estimation result, which is

not investigated in this paper, is the number of faults. Only

one paper on software inspections has studied the effect of

different number of faults in software documents. Briand

et al. [5] varied the number of faults between 6 and 24, and

found that increasing the number of faults reduces the

variance, but does not affect the RE significantly. More

studies are needed to confirm their results with larger

variability of faults.

Nevertheless, the easiest, but not necessarily the most

cost-effective, way to increase the reliability of the

estimation results is to increase the number of reviewers.

7. Summary

This paper has reported on a replication and on a

confidence interval investigation for capture–recapture

estimators in software inspections. The replication confirms

the results, in the sense that Mh-JK’s subestimators may be

better to use with software artefacts than the full estimator.

However, this investigation points out the second sub-

estimator to be the best one, considering bias, standard

deviation and confidence intervals.

The confidence intervals based on the log-normal

distribution performed better than the one based on the

normal distribution. This is true for all cases investigated.

The results in this study are important for researchers as


well as for practitioners in the software engineering area.

The recommendation for practitioners is to use Mh-JK order

2 together with the confidence interval based on the log-

normal distribution. A practitioner using capture–recapture

should focus on using an effective reading technique and

sufficient number of reviewers. Depending on the import-

ance of the document, the more reviewers should be used

and then as a consequence, the estimates will be more

accurate. If reading techniques are used, the side-effect of

finding more faults is that it increases the estimation

accuracy. In addition to capture–recapture, subjective

estimations have shown good results [2,15]. Since the data

are easy to collect, we recommend using both capture–

recapture estimations and subjective estimations and then

discussing the result before decisions are taken. At least

until an evaluation of capture–recapture has been con-

ducted within the software organization.

For researchers, a discussion is provided of what factors

that are important to achieve accurate estimation results.

These factors are the average capture probability and the

number of reviewers. The number of faults probably also

affects the result. Besides the discussion, this paper has

provided important knowledge gained from the replication

and the confidence interval investigation. Especially the

confidence intervals need further attention in capture–

recapture research for software inspections.

Future work for researches as well as for practitioners is

to carry out case studies, surveys and technology transfers of

Table A1

RMSE, bias and standard deviation for the UBRran data (experiment 1), see Section 3.1. The two best estimators for each number of reviewers are shown in

bold. Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for two reviewers

2 3 4 5 6 7 8 9

1. M0-ML RMSE 21.4 20.6 19.8 19.1 18.3 17.6 16.9 16.4

Bias 220.0 220.4 219.7 219.0 218.3 217.6 216.9 216.4

Standard deviation 7.6 3.0 2.2 1.8 1.5 1.3 1.1 1.0

2. Mt-ML RMSE 22.3 21.1 20.1 19.2 18.4 17.6 16.9 16.4

Bias 221.2 220.9 220.0 219.1 218.3 217.6 216.9 216.4


3. Mt-Ch RMSE 21.8 19.3 17.6 16.4 15.4 14.8 14.3 14.2

Bias 221.3 218.9 217.1 215.9 215.0 214.4 214.0 214.0


4. Mh-JK (1) RMSE 20.8 17.7 15.9 14.7 13.7 13.0 12.4 12.1

Bias 220.6 217.5 215.7 214.4 213.5 212.7 212.2 211.9


5. Mh-JK (2) RMSE 20.8 16.3 14.4 13.1 12.2 11.6 11.3 11.3

Bias 220.6 216.0 213.9 212.5 211.5 211.0 210.7 210.8


6. Mh-JK (3) RMSE – 16.3 13.9 12.5 11.7 11.4 11.3 11.7

Bias – 216.0 213.1 211.5 210.5 210.1 210.1 210.7

Standard deviation – 3.5 4.5 4.9 5.1 5.2 5.1 4.9

7. Mh-JK (4) RMSE – – 13.9 12.4 11.6 11.5 11.7 12.2

Bias – – 213.1 211.1 29.9 29.5 29.7 210.6

Standard deviation – – 4.5 5.5 6.1 6.6 6.5 6.1

8. Mh-JK (5) RMSE – – – 12.4 11.7 11.8 12.1 12.6

Bias – – – 211.1 29.6 29.0 29.3 210.4

Standard deviation – – – 5.5 6.7 7.6 7.7 7.2

9. Mh-JK (CAPTURE) RMSE 22.0 18.1 16.6 15.6 14.8 14.3 13.9 13.6

Bias 221.8 217.7 216.0 214.8 213.7 213.2 213.1 213.3


10. Mh-Ch RMSE 20.2 18.2 16.4 15.1 14.2 13.7 13.4 13.2

Bias 218.3 217.3 215.0 213.8 212.7 212.2 212.1 212.6


11. Mth-Ch (1) RMSE 39.0 14.2 14.5 14.2 13.8 13.4 13.0 12.8

Bias 7.6 211.8 213.5 213.6 213.3 213.0 212.8 212.6


12. Mth-Ch (2) RMSE 22.7 19.5 17.2 15.8 14.8 14.1 13.6 13.2

Bias 221.0 218.8 216.6 215.3 214.4 213.7 213.3 213.1


13. Mth-Ch (3) RMSE – 132.3 13.9 14.5 14.2 13.8 13.4 13.2

Bias – 29.6 212.3 213.8 213.8 213.4 213.2 213.0


14. DPM RMSE 18.7 17.4 15.7 14.3 13.0 11.8 10.8 9.9

Bias 218.0 217.1 215.5 214.1 212.8 211.7 210.7 29.8



the estimation techniques to software organizations. Fur-

thermore, more research is needed in order to know how to

efficiently utilize capture–recapture together with subjec-

tive estimations.

Acknowledgments

The authors would like to thank the students

participating in the experiments and Prof. Claes Wohlin,

Dr Bjorn Regnell, Thomas Olsson, Johan Natt och Dag

for help during experiment planning and analysis. This

work was partly funded by The Swedish Agency for

Innovation Systems (VINNOVA), under a grant for the

Center for Applied Software Research at Lund Univer-

sity (LUCAS).

Appendix A. Estimation data

Tables A1–A4.

Table A2

RMSE, bias and standard deviation for the UBRord data (experiment 1), see Section 3.1. The two best estimators for each number of reviewers are shown in

bold. Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for two reviewers

2 3 4 5 6 7 8 9

1. M0-ML RMSE 14.7 10.6 9.4 8.8 8.3 7.9 7.5 6.9

Bias 27.3 28.9 28.9 28.6 28.2 27.8 27.4 26.9


2. Mt-ML RMSE 15.0 11.1 9.9 9.1 8.5 8.0 7.5 6.9

Bias 29.0 29.7 29.4 28.9 28.4 27.9 27.4 26.9


3. Mt-Ch RMSE 15.8 10.2 8.5 7.5 6.8 6.1 5.6 5.2

Bias 28.9 28.7 27.6 26.7 26.0 25.4 24.8 24.4


4. Mh-JK (1) RMSE 13.0 7.6 5.3 4.3 3.8 3.5 3.2 2.9

Bias 212.3 26.8 24.4 23.4 22.8 22.5 22.2 22.0


5. Mh-JK (2) RMSE 13.0 5.8 4.8 4.8 4.7 4.5 4.2 4.0

Bias 212.3 24.0 22.1 21.8 21.7 21.6 21.5 21.3


6. Mh-JK (3) RMSE – 5.8 5.4 6.0 6.2 6.0 5.8 5.6

Bias – 24.0 21.4 21.4 21.4 21.1 20.8 20.6


7. Mh-JK (4) RMSE – – 5.4 6.7 7.2 7.4 7.5 7.6

Bias – – 21.4 21.3 21.0 20.4 0.3 0.5


8. Mh-JK (5) RMSE – – – 6.7 7.8 8.6 9.4 9.9

Bias – – – 21.3 20.8 0.4 1.4 1.8

Standard deviation 2 2 2 6.6 7.8 8.6 9.3 9.7


Bias 213.6 25.8 24.7 24.6 24.4 24.2 24.0 23.5


10. Mh-Ch RMSE 15.2 9.6 7.9 7.2 6.6 6.0 5.7 5.4

Bias 25.1 26.7 26.3 25.7 25.0 24.4 23.8 23.4


11. Mth-Ch (1) RMSE 77.5 15.3 6.4 5.3 5.1 4.9 4.7 4.5

Bias 47.5 6.0 21.1 23.2 24.0 24.2 24.2 24.1


12. Mth-Ch (2) RMSE 17.7 11.7 9.1 7.6 6.7 6.0 5.4 5.0

Bias 29.9 29.1 27.8 26.7 26.0 25.4 25.0 24.6


13. Mth-Ch (3) RMSE 2 35.2 7.8 6.2 5.7 5.4 5.1 4.8

Bias – 9.2 22.4 24.2 24.7 24.7 24.6 24.4


14. DPM RMSE 12.2 9.7 6.9 5.0 5.0 6.4 8.2 10.1

Bias 210.4 28.2 24.6 21.1 2.0 4.7 7.2 9.5



Table A3

RMSE, bias and standard deviation for the CBR data (experiment 2), see Section 3.2. The two best estimators for each number of reviewers are shown in bold.

Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for two reviewers

2 3 4 5 6 7 8 9

1. M0-ML RMSE 17.4 13.5 12.4 11.5 10.7 10.1 9.4 8.8

Bias 29.0 211.2 211.0 210.6 210.1 29.6 29.1 28.6


2. Mt-ML RMSE 17.6 14.8 13.4 12.3 11.3 10.5 9.6 8.9

Bias 212.9 213.2 212.4 211.5 210.8 210.1 29.3 28.7


3. Mt-Ch RMSE 16.7 14.4 13.0 11.3 10.0 8.9 7.9 6.9

Bias 213.9 212.3 210.9 29.7 28.7 27.7 26.9 26.0


4. Mh-JK (1) RMSE 16.9 12.0 9.3 7.7 6.4 5.5 4.6 3.8

Bias 215.1 29.6 27.0 25.5 24.6 23.8 23.2 22.6


5. Mh-JK (2) RMSE 16.9 10.6 8.6 7.5 6.6 5.7 4.8 4.1

Bias 215.1 26.9 24.3 23.4 22.7 22.0 21.3 20.6


6. Mh-JK (3) RMSE – 10.6 8.9 8.5 7.9 7.3 6.5 6.2

Bias – 26.9 23.5 22.7 22.0 21.0 0.0 0.6


7. Mh-JK (4) RMSE – – 8.9 9.1 9.0 8.8 8.6 8.9

Bias – – 23.5 22.4 21.4 0.0 1.3 1.9

Standard deviation 2 2 8.2 8.8 8.9 8.8 8.5 8.7

8. Mh-JK (5) RMSE – – – 9.1 9.6 10.2 10.7 11.9

Bias – – – 22.4 21.0 1.0 2.6 3.1



Bias 216.4 28.5 26.4 25.6 25.0 24.6 24.1 22.9


10. Mh-Ch RMSE 17.4 13.1 12.9 11.4 9.9 8.4 7.2 6.5

Bias 26.5 28.0 27.1 26.2 25.3 24.6 23.9 23.1


11. Mth-Ch (1) RMSE 90.3 19.7 11.7 8.9 7.6 6.7 6.0 5.3

Bias 55.9 6.7 21.0 23.5 24.3 24.5 24.4 24.2


12. Mth-Ch (2) RMSE 19.6 14.7 12.5 10.3 8.8 7.6 6.6 5.7

Bias 29.3 29.5 28.3 27.3 26.5 25.9 25.3 24.8


13. Mth-Ch (3) RMSE – 75.8 14.0 9.5 8.0 7.1 6.3 5.6

Bias – 14.3 22.3 24.6 25.1 25.1 24.9 24.6


14. DPM RMSE 15.9 13.9 11.2 8.8 7.0 5.8 5.3 5.5

Bias 213.1 212.0 28.9 26.0 23.3 21.0 1.1 3.0



Table A4

RMSE, bias and standard deviation for the UBROrd data (experiment 2), see Section 3.2. The two best estimators for each number of reviewers are shown in

bold. Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for 2 reviewers

2 3 4 5 6 7 8 9

1. M0-ML RMSE 13.2 9.4 8.5 7.6 6.9 6.3 5.7 5.1

Bias 26.5 28.0 27.7 27.2 26.5 26.0 25.5 25.0


2. Mt-ML RMSE 13.8 10.0 8.8 7.9 7.1 6.4 5.8 5.1

Bias 28.1 28.7 28.1 27.4 26.8 26.2 25.7 25.0


3. Mt-Ch RMSE 12.1 9.5 7.9 6.6 5.3 4.5 3.8 3.4

Bias 29.1 27.5 26.2 25.0 24.1 23.3 23.0 22.9


4. Mh-JK (1) RMSE 12.1 6.9 4.8 3.8 3.2 2.7 2.3 1.8

Bias 211.2 25.4 22.8 21.4 20.5 0.1 0.5 0.6


5. Mh-JK (2) RMSE 12.1 5.8 5.4 5.2 4.8 4.4 3.9 3.0

Bias 211.2 22.4 20.2 0.7 1.2 1.6 1.5 1.1


6. Mh-JK (3) RMSE – 5.8 6.3 6.6 6.4 6.2 5.5 4.3

Bias – 22.4 0.6 1.3 1.9 2.1 1.5 0.6


7. Mh-JK (4) RMSE 2 2 6.3 7.3 7.6 7.8 7.0 5.7

Bias – – 0.6 1.5 2.4 2.6 1.6 0.0


8. Mh-JK (5) RMSE – – – 7.3 8.2 9.1 8.2 7.1

Bias – – – 1.5 2.7 3.1 1.8 0.1



Bias 212.5 24.2 22.7 22.4 22.1 21.9 22.2 22.4


10. Mh-Ch RMSE 13.4 8.9 7.5 6.4 5.0 4.4 3.4 2.8

Bias 24.4 25.5 24.7 23.6 22.8 22.0 21.8 21.9


11. Mth-Ch (1) RMSE 70.3 14.9 7.6 5.4 4.2 3.4 2.8 2.3

Bias 47.8 7.5 0.8 21.0 21.6 21.6 21.6 21.6


12. Mth-Ch (2) RMSE 15.7 10.9 8.5 6.6 5.1 4.1 3.4 2.8

Bias 29.6 28.0 26.2 24.8 23.7 23.0 22.5 22.3


13. Mth-Ch (3) RMSE – 32.4 8.8 5.7 4.5 3.8 3.2 2.6

Bias – 11.0 20.2 21.9 22.3 22.3 22.2 22.1


14. DPM RMSE 11.4 8.8 5.8 4.2 4.9 6.8 8.8 10.9

Bias 29.8 27.4 23.7 20.2 2.9 5.7 8.2 10.6



Table B1

The data for experiment 1. Reviewer number 1 to 13 is from UBRRan, 14–27 from UBROrd, see Section 3.1. 37 faults were present in the document under inspection. A fault found is marked with 1 in the matrix

Fault number Reviewer number Probability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.52

2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.70

3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.74

4 1 1 1 0.11

5 1 1 1 1 1 1 1 1 1 1 1 1 0.44

6 1 1 1 1 1 1 1 1 1 1 1 0.41

7 1 1 1 1 1 1 1 0.26

8 1 1 1 1 1 1 1 1 1 0.33

9 1 1 1 1 1 1 1 1 1 1 1 0.41

10 1 1 1 1 1 1 1 1 1 0.33

11 1 1 1 1 1 1 1 1 1 1 1 0.41

12 1 1 1 1 1 1 1 0.26

13 0.00

14 1 0.04

15 0.00

16 1 0.04

17 0.00

18 1 1 1 1 1 1 1 1 1 1 0.37

19 1 1 1 1 1 1 1 1 1 1 1 0.41

20 1 1 1 1 1 1 0.22

21 1 1 1 1 1 1 1 1 0.30

22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.59

23 1 1 1 1 1 1 1 1 1 1 1 1 0.44

24 0.00

25 1 1 1 1 1 0.19

26 1 1 1 1 1 1 1 1 1 0.33

27 1 1 1 0.11

28 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.59

29 1 1 0.07

30 1 1 1 1 1 0.19

31 1 1 1 0.11

32 1 1 1 1 1 1 1 1 1 0.33

33 1 1 1 0.11

34 1 1 1 1 1 0.19

35 1 1 1 1 1 1 1 0.26

36 1 1 1 1 0.15

37 1 0.04

Effectiveness 0.19 0.30 0.11 0.24 0.27 0.30 0.24 0.30 0.24 0.14 0.11 0.30 0.24 0.22 0.32 0.16 0.32 0.24 0.30 0.41 0.27 0.22 0.27 0.51 0.46 0.24 0.38

Mean 0.23 0.31

Standard deviation 0.07 0.10

T.

Th

elinet

al.

/In

form

atio

na

nd

So

ftwa

reT

echn

olo

gy

44

(20

02

)6

83

–7

02

70

0

Table B2

The data for experiment 2. Reviewer number 1–12 is from CBR, 13–23 from UBROrd, see Section 3.2. 38 faults were present in the document under inspection. A fault found is marked with 1 in the matrix

Fault number Reviewer number Probability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 1 1 1 1 1 1 1 1 0.35

2 1 1 1 1 1 1 0.26

3 1 1 0.09

4 1 1 1 1 1 1 1 1 1 0.39

5 1 1 1 1 1 1 1 0.30

6 1 1 1 1 1 1 1 1 1 1 0.43

7 0.00

8 1 1 1 1 1 1 1 0.30

9 1 1 1 0.13

10 1 1 1 1 1 1 0.26

11 0.00

12 1 1 1 1 0.17

13 1 1 1 1 1 1 1 0.30

14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.61

15 1 1 1 1 1 0.22

16 1 1 1 1 1 1 1 1 1 1 1 1 0.52

17 1 1 1 1 1 1 1 0.30

18 1 1 1 1 1 1 1 1 0.35

19 1 1 1 1 1 1 1 1 1 0.39

20 1 1 1 1 1 1 1 1 1 1 1 0.48

21 1 1 1 1 1 1 1 0.30

22 1 0.04

23 1 1 1 1 1 1 1 1 1 1 1 0.48

24 0.00

25 1 1 1 1 1 1 1 1 1 0.39

26 1 1 1 1 1 1 1 0.30

27 1 1 1 1 1 1 1 1 0.35

28 1 1 1 1 1 1 1 1 1 0.39

29 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.65

30 1 1 1 1 1 1 0.26

31 1 1 1 1 1 1 1 1 1 0.39

32 1 1 1 1 1 1 0.26

33 1 1 1 0.13

34 1 1 1 1 1 1 0.26

35 1 1 1 0.13

36 1 1 1 1 0.17

37 1 0.04

38 1 1 1 1 1 1 1 1 1 0.39

Effectiveness 0.32 0.21 0.16 0.13 0.29 0.21 0.18 0.50 0.13 0.26 0.16 0.55 0.26 0.50 0.37 0.32 0.29 0.21 0.39 0.26 0.34 0.16 0.34

Mean 0.26 0.31

Standard deviation 0.14 0.09

T.

Th

elinet

al.

/In

form

atio

na

nd

So

ftwa

reT

echn

olo

gy

44

(20

02

)6

83

–7

02

70

1

Appendix B. Reviewer data from the experiments

Tables B1 and B2.

References

[1] V.R. Basili, S. Green, O. Laitenberger, F. Lanubile, F. Shull, S.

Sørumgard, M.V. Zelkowitz, The empirical investigation of perspective-

based reading, Empirical Software Engineering 1 (2) (1966) 133–164.

[2] S. Biffl, Using inspection data for defect estimation, IEEE Software 17

(6) (2000) 36–43.

[3] S. Biffl, W. Grossman, Evaluating the accuracy of defect estimation

models based on inspection data from two inspection cycles,

Proceedings of the 23rd International Conference on Software

Engineering, 2001, pp. 145–154.

[4] L. Briand, K. El Emam, B. Freimut, O. Laitenberger, Quantitative

evaluation of capture–recapture models to control software inspec-

tions, Proceedings of the Eighth International Symposium on

Software Reliability Engineering, 1997, pp. 234–244.

[5] L. Briand, K. El Emam, B. Freimut, O. Laitenberger, A comprehen-

sive evaluation of capture–recapture models for estimating software

defect content, IEEE Transactions on Software Engineering 26 (6)

(2000) 518–540.

[6] K.P. Burnham, W.S. Overton, Estimation of the size of a closed

population when capture–recapture probabilities vary among ani-

mals, Biometrika 65 (1978) 625–633.

[7] K.P. Burnham, D.R. Anderson, G.C. White, C. Brownie, K.H. Pollock,

Design and analysis offish survival experiments based on release capture

data, American Fisheries Society Monograph 5 (1987) 1–437.

[8] A. Chao, Estimating the population size for capture–recapture data

with unequal catchability, Biometrics 43 (1987) 783–791.

[9] A. Chao, Estimating population size for sparse data in capture–

recapture experiments, Biometrics 45 (1989) 427–438.

[10] A. Chao, S.M. Lee, S.L. Jeng, Estimating population size for capture–

recapture data when capture probabilities vary by time and individual

animal, Biometrics 48 (1992) 201–216.

[11] A. Chao, Capture–Recapture Models, in: P. Armitage, T. Colton

(Eds.), Encyclopaedia of Biostatistics, Wiley, USA, 1998.

[12] B. Efron, R.J. Tibshirani, An introduction to the bootstrap,

Monographs on Statistics and Applied Probability, 57, Chapman

and Hall, UK, 1993.

[13] S.G. Eick, C.R. Loader, M.D. Long, L.G. Votta, S.A. Vander Wiel,

Estimating software fault content before coding, Proceedings of the 14th

International Conference on Software Engineering, 1992, pp. 59–65.

[14] S.G. Eick, C.R. Loader, S.A. Vander Wiel, L.G. Votta, How many errors

remain in a software design document after inspection? Proceedings

of the 25th Symposium on the Interface, 1993, pp. 195–202.

[15] K. El Emam, O. Laitenberger, T. Harbich, The application of

subjective estimates of effectiveness to controlling software inspec-

tions, Journal of Systems and Software 54 (2) (2000) 119–136.

[16] K. El Emam, O. Laitenberger, Evaluating capture–recapture models

with two inspectors, IEEE Transactions on Software Engineering 27

(9) (2001) 851–864.

[17] M.E. Fagan, Design and code inspections to reduce errors in program

development, IBM Systems Journal 15 (3) (1976) 182–211.

[18] M.E. Fagan, Advances in software inspections, IEEE Transactions on

Software Engineering 12 (7) (1986) 744–751.

[19] B. Freimut, Capture–recapture models to estimate software fault

content, Diploma Thesis, University of Kaiserslautern, Germany, 1997.

[20] B. Freimut, O. Laitenberger, S. Biffl, Investigating the impact of

reading techniques on the accuracy of different defect content

estimation techniques, Proceedings of the Seventh International

Software Metrics Symposium, 2001, pp. 51–62.

[21] I. Jacobson, M. Christerson, P. Jonsson, G. Overgaard, Object-

Oriented Software Engineering: A Use Case Driven Approach,

Addison-Wesley, USA, 1992.

[22] P.M. Johnson, D. Tjahjono, Does every inspection really need a

meeting? Empirical Software Engineering 3 (1) (1998) 9–35.

[23] Math Works, http://www.mathworks.com.

[24] J. Miller, Estimating the number of remaining defects after inspection,

Software Testing, Verification and Reliability 9 (4) (1999) 167–189.

[25] Mills, H., On the statistical validation of computer programs, Technical

Report, FSC-72-6015, IBM Federal Systems Division, 1972.

[26] D.L. Otis, K.P. Burnham, G.C. White, D.R. Anderson, Statistical

inference from capture data on closed animal populations, Wildlife

Monographs (1978) 62.

[27] H. Petersson, C. Wohlin, An empirical study of experience-based

software defect content estimation methods, Proceedings of the 10th

International Symposium on Software Reliability Engineering, 1999,

pp. 126–135.

[28] H. Petersson, T. Thelin, P. Runeson, C. Wohlin, Capture–recapture in

software inspections after 10 years research—theory, evaluation and

application, CODEN:LUTEDX (TETS-7184)/1-16/2002 & local 14,

Department of Communication Systems, Lund University, 2002.

http://www.telecom.lth.se/Personal/thomast/reports/

Crc10Years_TechnicalReport.pdf.

[29] A. Porter, L. Votta, V.R. Basili, Comparing detection methods for

software requirements inspection: a replicated experiment, IEEE

Transactions on Software Engineering 21 (6) (1995) 563–575.

[30] E. Rexstad, K. P. Burnham, User’s guide for interactive program

CAPTURE, Colorado Cooperative Fish and Wildlife Research Unit,

Colorado State University, Fort Collins, CO 80523, USA, 1991.

[31] P. Runeson, C. Wohlin, An experimental evaluation of an experience-

based capture–recapture method in software code inspections,

Empirical Software Engineering 3 (4) (1998) 381–406.

[32] T. Thelin, P. Runeson, Capture–recapture estimations for perspec-

tive-based reading—a simulated experiment, Proceedings of the

International Conference on Product Focused Software Process

Improvement, 1999, pp. 182–200.

[33] T. Thelin, P. Runeson, Fault content estimations using extended curve

fitting models and model selection, Proceedings of the Fourth

International Conference on Empirical Assessment and Evaluation

in Software Engineering, 2000.

[34] T. Thelin, P. Runeson, Robust estimations of fault content with

capture–recapture and detection profile estimators, Journal of

Systems and Software 52 (2–3) (2000) 139–148.

[35] T. Thelin, P. Runeson, B. Regnell, Usage-based reading—an

experiment to guide reviewers with use cases, Information and

Software Technology 43 (15) (2001) 925–938.

[36] T. Thelin, P. Runeson, C. Wohlin, An experimental comparison of usage-

based and checklist-based reading, Proceedings of the First International

Workshop on Inspection in Software Engineering, 2001, pp. 81–91.

[37] S.A. Vander Wiel, L.G. Votta, Assessing software design using

capture–recapture methods, IEEE Transactions on Software Engin-

eering 19 (11) (1993) 1045–1054.

[38] L.G. Votta, Does every inspection need a meeting? Proceedings of the

First ACM SIGSOFT Symposium on Foundations of Software

Engineering, ACM Software Engineering Notes 18 (5) (1993) 107–114.

[39] G.C. White, D.R. Anderson, K.P. Burnham, D.L. Otis, Capture–

recapture and removal methods for sampling closed populations,

Technical Report, Los Alomos National Laboratory, 1982.

[40] C. Wohlin, P. Runeson, J. Brantestam, An experimental evaluation of

capture–recapture in software inspections, Software Testing, Ver-

ification and Reliability 5 (4) (1995) 213–232.

[41] C. Wohlin, P. Runeson, Defect content estimation from review data,

Proceedings of the 20th International Conference on Software

Engineering, 1998, pp. 400–409.


http://www.mathworks.com

http://www.telecom.lth.se/Personal/thomast/reports/Crc10Years_TechnicalReport.pdf

http://www.telecom.lth.se/Personal/thomast/reports/Crc10Years_TechnicalReport.pdf

Documents

Confidence intervals for capture–recapture estimations in software inspections