Upload
thomas-thelin
View
213
Download
0
Embed Size (px)
Citation preview
Confidence intervals for capture–recapture estimations in software
inspections
Thomas Thelin*, Hakan Petersson, Per Runeson
Department of Communication Systems, Lund University, P.O. Box 118, SE-221 00 Lund, Sweden
Received 23 January 2002; revised 29 May 2002; accepted 31 May 2002
Abstract
Software inspections are an efficient method to detect faults in software artefacts. In order to estimate the fault content remaining after
inspections, a method called capture–recapture has been introduced. Most research published in fault content estimations for software
inspections has focused upon point estimations. However, confidence intervals provide more information of the estimation results and are
thus preferable. This paper replicates a capture–recapture study and investigates confidence intervals for capture–recapture
estimators using data sets from two recently conducted software inspection experiments. Furthermore, a discussion of practical
application of capture–recapture with confidence intervals is provided. In capture–recapture, used for software inspection, most research papers
have reported Mh-JK to be the best estimator, but only one study has investigated its subestimators. In addition, confidence intervals based on the
log-normal distribution have not been evaluated before with software inspection data. These two investigations together with a discussion provide
the main contribution of this paper. The result confirms the conclusions of the replicated study and shows when confidence intervals for capture–
recapture estimators can be trusted. q 2002 Elsevier Science B.V. All rights reserved.
Keywords: Capture–recapture models; Confidence interval; Empirical study; Replication; Software inspection
1. Introduction
Inspections have been established as important contri-
butors to improved software quality. They may contribute to
the quality improvement in different ways; firstly by faults
being removed by the inspection effort; secondly by being
efficient means for communication within a project group
and thereby reducing the risk for errors, and thirdly enabling
quantitative quality management, i.e. decisions on release or
further inspections can be made based on an estimate of the
remaining number of faults in the inspected artefact.
The research in the field of software inspections has
addressed all the three aspects. Different reading
techniques have been developed and evaluated, such as
checklist-based reading (CBR) [18], defect-based reading
[29], perspective-based reading [1] and usage-based
reading (UBR) [35]. The use and effect of inspection
meetings, which were very much focused on in the initial
inspection work by Fagan [17], have been evaluated [22,
38]. Finally, the estimation of remaining faults after
inspections have been developed using statistical tech-
niques primarily from the biology field. This began with
Eick et al. [13,14], who started using capture–recapture
techniques which originally were used for estimating
populations [11].
Several studies have been performed to evaluate and
improve the estimates [3–5,16,20,27,31,34,40]. The gener-
alized outcome of the studies are: for an inspection team of
four or larger, the model assuming the same probabilities for
all reviewers varying over different faults (model Mh),
combined with the Jackknife estimator, performs best for
estimation of remaining faults [28]. A few studies point out
other models that work better for specific data sets, but most
studies are in favour of the Mh-JK model.
Furthermore, Miller [24] is even more specific. He points
out one of the subestimators of Mh-JK to perform better.
The Jackknife estimator produces estimators of different
orders, and Miller finds that the model of first order
performs better than the other estimators. He does not
compare the orders with the full estimator. However, he
argues that the subestimators might be better due to different
preconditions for estimates in the software engineering field
compared to the biology domain.
0950-5849/02/$ - see front matter q 2002 Elsevier Science B.V. All rights reserved.
PII: S0 95 0 -5 84 9 (0 2) 00 0 95 -2
Information and Software Technology 44 (2002) 683–702
www.elsevier.com/locate/infsof
* Corresponding author. Tel.: þ46-46-222-3863; fax: þ46-46-14-5823.
E-mail addresses: [email protected] (T. Thelin), hakan.
[email protected] (H. Petersson), [email protected]
(P. Runeson).
Despite all this effort being spent on estimation models,
little work is performed on confidence intervals for the
estimators [2,3,37]. By estimating the confidence interval
for the number of remaining faults, the risk taken in a
decision based on the estimate is under better control,
compared to when only a single point estimate is available.
The result of a capture–recapture estimator using confi-
dence interval is a point estimate with a lower and an upper
bound. The probability that the point estimate is correct is
very small and unknown. Instead, by using a confidence
interval, the probability that the correct value is within the
range is much larger. If the nominal confidence value is
chosen to be 95%, the probability is equal to 0.95 that the
correct value is within its range. Hence, confidence interval
information provides valuable information for a software
project manager during the inspection process.
This paper focuses on point and confidence interval
estimations of remaining faults, using capture–recapture
techniques. Firstly, the study by Miller is replicated, using
data from two recently conducted inspection experiments.
Secondly, the confidence interval for the estimates are
investigated, to better control the risks when making
decisions based on the estimates.
The replication result confirms Miller’s argumentation
that the subestimators of Mh-JK perform better than the full
estimator. However, in this evaluation, order 2 of Mh-JK is
the best subestimator considering bias, variance and the
confidence interval investigation. The confidence interval
recommended is based on the log-normal distribution.
Furthermore, in order to aid future research and prac-
titioners using capture–recapture, an extensive discussion
of the results is provided.
In Section 2, the basics of fault content estimations using
capture–recapture are briefly presented. Section 3 gives an
overview of the origin of the inspection data used. Section 4
presents the research questions, analysis method for this
study, a summary of the replicated study and the confidence
intervals. The analysis itself is presented in Section 5,
followed by a discussion of the results in Section 6
combined with related work. Finally, conclusions are
drawn in Section 7.
2. Fault content estimation
In order to describe and explain the different ways of
making fault content estimations, the following terminology
will be used in this paper. An estimator is a formula used to
predict the number of faults remaining in an artefact. A model
is the umbrella term for a number of estimators assuming the
same prerequisites. A method is a type of approach to make an
estimation. A method may contain a number of models which
in turn may contain many estimators.
Fault content estimation models originate mainly from
ecological studies of animal population. These capture–
recapture models [11] all stem from biologists seeking to
estimate the population of certain species in an area. By (a)
capturing animals, (b) marking them and then (c) releasing
them to perhaps be recaptured at a later time, it is possible
through statistical models to estimate the total number of
animals in the area. The statistical models that biostatisti-
cians use include models for both open and closed
populations, i.e. populations which change or does not
change between trapping occasions, respectively. When
using the capture–recapture models with software inspec-
tions, it is natural to assume closed populations. This is
because the number of faults in an inspected artefact is
likely to be constant as the ‘trapping occasions’ occur at the
same time on the same version of the inspected artefact.
The first introduction of using capture–recapture within
the field of software engineering was made by Mills in 1972
[25]. Mills seeded faults into a system and applied capture–
recapture on the information on faults found during testing
to attain an estimate of the number of remaining faults. The
first application of capture–recapture to software inspec-
tions was made by Eick et al. in 1992 [13].
Not only capture–recapture methods have been used to
make fault content estimations. Another method that utilizes
the same information as the capture–recapture method is
the curve fitting method [41]. The curve fitting method sorts
and plots the fault data from an inspection. Then, it
estimates by fitting a mathematical function to the data.
Another fault content estimation methods is the use of
subjective estimates [2,15]. Subjective estimates mean that
the reviewers themselves produce an estimate based on their
experiences and feelings of the inspected artefact. This type
of method is not included in this study.
2.1. The capture–recapture method
The capture–recapture models in biology that are used
with software inspections have some restrictions [26]. The
most restrictive models have the following.
1. The population is closed,
2. animals do not loose their marks during the experiment,
3. all marks are correctly noted and recorded at each
trapping occasion, and
4. each animal has a constant and equal probability of
capture on each trapping occasion. This also implies
that capture and marking do not affect the catchability
of the animal.
These restrictions were translated to a software inspec-
tion context by Miller [24].
1. Once the document is issued for inspection, it must not be
changed; and the performance of the reviewers should be
constant, i.e. given the same document the reviewers
should find the same faults.
2. Reviewers must not reveal their proposed faults to other
reviewers.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702684
3. Reviewers must ensure that they accurately record and
document every fault they find. Additionally, the
inspection process, for example, at the collection meet-
ing, must not discard any correct faults.
4. All reviewers must be provided with identical infor-
mation, in terms of source materials, standards, inspec-
tion aids, etc. and this material must be available to them
at all times.
Assumption 4 also implies equality between reviewer
abilities and the complexity of finding different faults.
Depending on the degree of freedom of the ability of the
reviewers and probability of the faults to be detected, four
basic models are formed:
† M0, all faults have equal detection probability, all
reviewers have equal detection ability.
† Mt, all faults have equal detection probability, reviewers
may have different detection abilities.
† Mh, faults may have different detection probabilities, all
reviewers have equal detection ability.
† Mth, faults may have different detection probabilities,
reviewers may have different detection abilities.
There is also a model called Mb, which in biology is
connected to an animal changing its probability when being
trapped once. This model can be combined with the others
creating models Mtb, Mhb and Mthb. However, these
models have not been used for software inspections.
Connected to each model there are a number of
estimators (Table 1). The estimators Mh-JK and Mth-Ch
can both be calculated with different orders of subestima-
tors. In the case of Mh-JK, a procedure of what order to
select is suggested by Burnham and Overton [6]. Miller
argued for not using this selection algorithm in an inspection
context and therefore evaluated the different orders
separately. He found that order 1 of Mh-JK performed
best of the five investigated subestimators [24]. In the study
by Miller and in this study, the Mh-JK’s and Mth-Ch’s
orders are all treated separately. Table 2 shows the formulae
of the five first subestimators of Mh-JK.
These estimators are used by Miller and in this paper.
These are also used in the selection procedure suggested by
Burnham and Overton [6].
2.2. The curve fitting method
Wohlin and Runeson introduced two kinds of curve
fitting methods [41]. The detection profile method (DPM)
and the cumulative method. Of these two models, DPM has
shown to perform best. In DPM, the faults are sorted
decreasingly in terms of the number of reviewers that found
each fault, and an exponential curve is fitted to the data. The
estimate is then attained by extrapolating the curve to a
predetermined value. The predetermined value is the limit
value at the y-axis, where the number of remaining faults is
estimated by reading off the x-value. The predetermined
value was set to 0.5, in the initial study by Wohlin and
Runeson.
DPM is included in this study since it has performed well
in comparison to other capture–recapture estimators, and is
regarded as the best curve fitting method for software
inspections [33].
3. Experiment description
In order to evaluate the capture–recapture estimators and
the confidence intervals for the estimators, data from two
experiments are used. The experiments were conducted in
two university environments in Sweden, where third or
fourth year students inspected software artefacts belonging
to a taxi management system. The system was developed
using a real taxi driver and a taxi dispatcher as stake-
holders, although the system is limited in scope compared to
their existing system. This section gives an overview of the
two experiments. A detailed description of the experiments’
subjects, variables, design, hypotheses, threats to validity,
operation, and results is provided in Refs. [35,36],
respectively.
3.1. First experiment
In the first experiment [35], two variants of a new reading
technique, UBR were compared. The subjects were third
year students at the software engineering Bachelor’s
programme at Lund University. The design specification
for the taxi management system was inspected, and the
reading was guided by a requirements specification
primarily based on use cases [21]. The design specification
comprised nine pages (2300 words) and contained 37 faults.
One subject group (14 subjects) got the use cases ranked
according to the importance for the user (UBRord). The other
subject group (13 subjects) got randomly ordered use cases
(UBRrand).
It is concluded from the experiment that the group using
the ordered set of use cases is more efficient and effective in
finding the faults that are most important from a user
perspective. Efficiency is defined as number of faults found
per time unit and effectiveness is defined as the fraction of
the number of faults found out of the total number of faults
in the document. The average capture probability is a
measure of how probable it is that a fault is to be detected by
the reviewers. The average capture probability was 0.31 for
the group using ordered use cases and 0.23 for the group
using randomized use cases. In Fig. 1, the capture
probability of each fault is shown and in Appendix B, the
raw data together with capture probabilities for each fault
and reviewer are provided.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 685
3.2. Second experiment
In the second experiment [36], the same design
specification was inspected, but in this case, the UBR with
ordered use cases was compared to CBR. The students
participating as reviewers in the study were fourth year
software engineering Master students at Blekinge Institute
of Technology in Sweden. In order to give as similar
circumstances for the two groups as possible, they were
given a textual requirements specification for the system.
The UBR group (11 subjects) was given a set of use cases in
addition to the requirements, and the CBR group (12
subjects) was given a checklist.
The experiment shows that the effectiveness and
efficiency are higher for the UBRord group compared to
the CBR group. The average capture probability was 0.31
for the UBRord group and 0.26 for the CBR group. In Fig. 2,
the capture probability of each fault is shown and in
Appendix B, the raw data together with capture probabilities
for each fault and reviewer are provided.
3.3. Data
The data from the above described experiments are used
in this paper to evaluate the performance of capture–
recapture estimators. They represent different reading
techniques, and also, which is of interest to this study,
different capture probabilities.
UBR guides reviewers to focus their inspection effort to
detect the most important faults from a user perspective.
Hence, some faults have higher probability to be found than
others and this is not necessarily true for an inspection
performed in a software organization. Therefore, the data
may not represent an inspection where ad hoc or CBR is
used. However, studies of perspective-based reading [1],
which also focus on subsets of the faults, have shown that
the capture–recapture estimators are robust for PBR data
[19,20,32]. Furthermore, one of the data sets is from CBR,
which does not have this restriction. Consequently, the CBR
data set is primary used in the replication part of the paper,
and the other data sets are used as complements. For the
confidence interval investigation, we have argued that the
use of UBR will not affect the results. However, in order to
extend the knowledge, this part needs to be replicated.
4. Method
In Sections 4.1–4.4, analysis strategy, a summary of the
replicated study and the confidence intervals are presented.
The outline of the study is presented in Fig. 3. Firstly, the
study conducted by Miller is replicated. Secondly, confi-
dence intervals for the best estimator are investigated. These
Table 1
The models and estimators in capture–recapture. More capture–recapture
models exist, but have not been used for software inspections
Model Estimators
M0 M0-ML—maximum likelihood [26]
Mt Mt-ML—maximum likelihood [26]
Mt-Ch—Chao’s estimator [9]
Mh Mh-JK—Jackknife [6]
Mh-Ch—Chao’s estimator [8]
Mth Mth-Ch—Chao’s estimator [10]
Table 2
The formulae of the five first subestimators (orders) of Mh-JK [6]
Order Formula
1 NJ1 ¼ D þk 2 1
k
� �f1
2 NJ2 ¼ D þ2k 2 3
k
� �f1 2
ðk 2 2Þ2
kðk 2 1Þ
( )f2
3 NJ3 ¼ D þ3k 2 6
k
� �f1 2
3k2 2 15k þ 19
kðk 2 1Þ
( )f2 þ
ðk 2 3Þ3
kðk 2 1Þðk 2 2Þ
( )f3
4 NJ4 ¼ D þ4k 2 10
k
� �f1 2
6k2 2 36k þ 55
kðk 2 1Þ
( )f2 þ
4k3 2 42k2 þ 148k 2 175
kðk 2 1Þðk 2 2Þ
( )f3 2
ðk 2 4Þ4
kðk 2 1Þðk 2 2Þðk 2 3Þ
( )f4
5 NJ5 ¼ D þ5k 2 15
k
� �f1 2
10k2 2 70k þ 125
kðk 2 1Þ
( )f2 þ
10k3 2 120k2 þ 485k þ 660
kðk 2 1Þðk 2 2Þ
( )f3 2
ðk 2 4Þ5 2 ðk 2 5Þ5
kðk 2 1Þðk 2 2Þðk 2 3Þ
( )f4
þðk 2 5Þ5
kðk 2 1Þðk 2 2Þðk 2 3Þðk 2 4Þ
( )f5
NJm is the estimated number of faults for order m using Jackknife, D is the unique number of faults found, k is the number of reviewers, and fi is the number
of faults found by i reviewers.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702686
two evaluations use data from the two experiments
described in Section 3.
4.1. Research questions
Two main research questions are addressed in this
investigation.
1. Replication.
Is the result in the study by Miller valid for other data
sets?—A replication of the study performed by Miller
[24] is carried out. The aim of the replication is to
select the best estimators to be used for further
analysis.
2. Confidence interval.
(a) Which capture–recapture estimator has the best
confidence interval coverage and range?—Confi-
dence intervals are evaluated for the selected
estimators in research question 1. The aim is to
select a distribution and the best estimator to be
used for fault content estimations.
(b) Which nominal confidence value is best to use?—
Variation of nominal values is investigated for the
best estimator in research questions 1 and 2a. 95
and 99% are normally used as nominal confidence
values. The aim is to investigate whether other
values provide more correct results, considering
the coverage and range of the confidence
intervals.
4.2. Measures and method
The analysis procedure is divided into three steps. Each
procedure is connected to one research question (1, 2a and
2b).
Research question 1 (Replication ). The measures used
for comparing the estimators are the relative error (RE), the
standard deviation and the root mean square error (RMSE)
[12]. The RE is a measure of bias and the standard deviation
is a measure of dispersion. The RE is defined as:
RE ¼Estimated number of faults 2 True number of faults
True number of faults
RMSE is a measure of accuracy, which takes into account
both standard deviation and bias of an estimator. The RMSE
value is defined as:
RMSE ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiE½ðQ2 QÞ2�
q¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiStdDev2 2 Bias2
pwhere StdDev is the standard deviation and Bias is the
Fig. 1. The capture probability for each fault (first experiment).
Fig. 2. The capture probability for each fault (second experiment).
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 687
subtraction between the number of true faults and the
number of estimated faults. These measures are evaluated
for teams of two to nine reviewers.
Research question 2a (coverage and range of confidence
intervals ). The confidence intervals are evaluated by
measuring the percentage of the correct number of faults
within the nominal confidence interval (coverage). This is
carried out using 95% nominal confidence interval for two
to nine reviewers, for four different data sets with different
capture probabilities. Furthermore, the range of the
confidence intervals is investigated and presented in box
plots.
Research question 2b (variation of nominal confidence
value ). Nominal confidence values ranging from 60 to 99%
are presented for the estimator that is best with respect to
accurate estimations, confidence interval coverage and
range. The nominal confidence value is used as input to
the calculation of confidence intervals. For example, if the
nominal confidence value is decided to be 95%, the
confidence interval should ideally include the correct
number of faults 95 of 100 times.
An alternative evaluation measure for capture–recapture
has been proposed by El Emam et al. [15], called decision
accuracy (DA). Instead of measuring the accuracy of the
estimators in terms of bias, DA measures the number of
correct decisions taken by an estimator. This can, for
example, be compared to the number of correct decisions
taken, when reinspections are never made, resulting in the
relative decision accuracy (RDA). However, there are some
limitations of RDA, and one of these is that a threshold
needs to be chosen beforehand, which should be based on
historical data [33]. Due to the restrictions of RDA and the
fact that this paper replicates a study where the DA is not
used, this measure is not utilized in this study.
A measure that is important for some estimators is the
failure rate, i.e. the number of times an estimator cannot
provide an estimate, for example, when no overlap exists.
Some of the estimators, for example, Mh-JK can always
provide an estimation results. This is further discussed in
Section 5.1.
In order to investigate the impact of the number of
reviewers on the estimation results, virtual inspections are
used. This means that all possible combinations of
reviewers are analysed for each group size. For each
group size, every reviewer is included several times. The
number of times depends on the number of reviewers to
combine. For example, if 11 reviewers are used in an
experiment and the team size is three, the number of
combinations is
11
3
!
which equals the number of virtual teams. Virtual inspec-
tions are commonly used for capture–recapture investi-
gations using software inspection data to evaluate different
sizes of groups when the reviewers have inspected
individually [4,5,27,34].
4.3. Summary of the replicated study
Miller [24] evaluates all capture–recapture estimators
described in Section 2, except DPM. Instead of using the full
estimator of Mh-JK and Mth-Ch, he uses all subestimators
of Mh-JK (1–5) and Mth-Ch (1–3). Miller argues that the
procedure used to selecting among the subestimators of Mh-
JK and Mth-Ch, as implemented in the software program
CAPTURE [30], is not applicable for software inspections.
The result may be that one of the subestimators might
produce more accurate result than the full estimator.
Therefore, he uses all orders separately in the evaluation.
Fig. 3. The outline of the investigation presented in this study. The figures refer to section numbers in the paper.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702688
The data come from two previously conducted experiments,
where the purposes were to evaluate tool support for
inspections and defect-based reading, respectively. The
investigation uses RE and box plots to evaluate the
estimators for three to six reviewers. The results are (a)
the estimators underestimate, (b) Mh-JK order 1 is the best
estimator and can be used for three to six reviewers and (c)
Mth-Ch (order 1 and 3) might be appropriate to use if many
reviewers inspect.
Miller’s result that most estimators underestimate when
using software inspection data is compliant to most of the
previously conducted research, which is summarized and
discussed by Petersson et al. [28]. The second and third of
Miller’s results (see (b) and (c) earlier) have not been
replicated before. In addition, Mh-JK’s subestimators have
not been compared to the full estimator using software
inspection data.
Hence, as a replication, this paper provides further
knowledge of the estimators used for capture–recapture
estimations.
4.4. Confidence intervals
The confidence intervals used in this study are based on
the normal and the log-normal distributions. The assump-
tion of normal distribution is only fulfilled for large samples
[26] (many reviewers and faults), which makes it less useful
for software inspections. The other distribution used
assumes that the faults not detected are log-normally
distributed [7]. In the software program CAPTURE [30],
this distribution is used for all estimators implemented. In
this study, an evaluation for both distributions is carried out
using software inspection data from two experiments
described in Section 3 [35,36].
A 95% nominal confidence interval means that 95% of
the correct values should be within the interval in the ideal
case. The 95% confidence interval, assuming normal
distributions is calculated as [26]:
N 2 1:96
ffiffiffiffiffiffiffiffiffiVarðNÞ
q; N þ 1:96
ffiffiffiffiffiffiffiffiffiVarðNÞ
q �
where N is the estimated number of faults. The 95%
confidence interval using the log-normal distribution is
calculated as [7]:
D þN 2 D
C;D þ CðN 2 DÞ
" #
C ¼ exp 1:96
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffilog 1 þ
VarðNÞ
ðN 2 DÞ2
!vuut0@
1A
where D is the unique number of faults found. The variance
of the estimated number of faults uses different formulae
depending on the estimator used.
Three studies have evaluated confidence intervals for
capture–recapture estimators in software inspections [2,3,
37]. Vander Wiel and Votta [37] carried out a simulation
and investigated Mh-JK and Mt-ML for five reviewers with
different capture probabilities. They investigate three
confidence intervals, two for Mt-ML and one for Mh-JK.
The findings in the study are that Mt-ML performs best of
the estimators, which is contradicted by most later results
using software inspection data [2,4,5,24,28]. The confidence
intervals for Mt-ML are based on the normal distribution
(Wald) as well as on likelihood-ratio statistics (likelihood-
ratio confidence). The confidence interval for Mh-JK is the
one suggested by Burnham and Overton when they
developed Mh-JK [6]. This is also based on the normal
distribution. Vander Wiel and Votta found that the
confidence intervals have either too large (conservative) or
too small ranges for five reviewers. When a too large range
is achieved, the confidence intervals are often too broad to
be useful. However, this was not investigated.
Biffl [2] and Biffl and Grossman [3] use confidence
intervals when comparing capture–recapture to subjective
estimations and for reinspections, respectively. These
studies use data from the same experiment and use teams
of four to six reviewers. The conclusion is that their
approach of confidence intervals does not work well. In Ref.
[2], the largest coverage was 74%, with a nominal
confidence value of 95%.
The larger the nominal confidence value is, the broader is
the confidence interval range. Therefore, the nominal
confidence values are varied to examine how the coverage
and range change and how well the confidence intervals
work. This is evaluated for the nominal confidence values
60–95% in steps of 5% together with 99%.
5. Analysis
The analysis is divided into three subsections. The result
of the replication is presented in Section 5.1 (research
question 1). In Section 5.2 (research question 2a) and
Section 5.3 (research question 2b), the confidence interval
investigation is presented.
5.1. Replication
In Table 3, the RMSE values are shown for the 14
estimators investigated. This is shown for the CBR data set;
RMSE, bias and standard deviation values of the estimators
are presented in Appendix A for all of the sets. All
estimators, except Mh-JK (CAPTURE) and DPM are
evaluated by Miller [24]. Mh-JK (CAPTURE) is the
estimator used in the software program CAPTURE [30],
which selects and interpolates between the five subestima-
tors of Mh-JK. The bold figures show the two estimators
with the best RMSE for each number of reviewers. Note that
for some cases, Mh-JK’s subestimators become the same
formula depending on the number of reviewers. For
example, for two reviewers, Mh-JK orders 1 and 2 become
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 689
the same formula, and hence the same estimation results are
obtained.
For two reviewers, DPM shows the best results. Since
DPM is based on fitting an exponential curve to the
inspection data, and fitting a curve to two levels is more or
less random, the interpretation is that no estimator works
well for two reviewers.
For three to nine reviewers, one of the Mh-JK’s
subestimators estimates most accurately. Similar results
are obtained for the other three data sets (Appendix A). One
of the orders 1 and 2 of Mh-JK is in most cases one of the
two best estimators and is in several cases the best estimator.
As expected, increasing the number of reviewers
improves the estimations results in most cases. This depends
on that the more input data, the more reliable estimation
results.
All estimators have median RE below zero, which means
that, in general, they underestimate. This is not expressed by
the RMSE value, but is shown in Fig. 4 for the five best
estimators using three to six reviewers (see Appendix A for
all data sets). Underestimations confirm the results of Miller
[24] and most other previously conducted software
capture–recapture research evaluating the estimators, see
e.g. [2,4,5]. In these investigations, it is concluded that Mh-
JK is the best estimator for software inspections. However,
only the full estimator is used, except by Miller [24]. Miller
concludes that Mh-JK order 1 is the best. Since the result in
this paper confirms previously conducted research, Mh-JK
is regarded as the best estimator. Therefore, Mh-JK
(subestimators (1–5) and full estimator) is used in the
confidence interval investigation.
Since Mh-JK is the best estimator, the failure rate is not
discussed. Mh-JK never fails to estimate, and is thus not
affected negatively in such an evaluation. The failure rate
cannot change the result of the replication.
5.2. Confidence intervals
Both the normal distribution and the log-normal
distribution were used in the confidence interval evaluation.
The analysis of coverage and range showed that the normal
distribution is less accurate than the log-normal distribution.
Thus, only the log-normal distribution is used in the further
analysis and in the plots. Four data sets are used. However,
only three of them are represented in the plots, as two of
them show similar results. The reason that two of the data
sets show similar results is probably because both of them
have average capture probabilities equal to 0.31. The data
with equal capture probabilities come from the groups using
ordered use cases (UBRord) in both experiments (Section 3).
Hence, only one of these data sets is used in the further
analysis.
In Fig. 5, the 95% nominal confidence interval is shown
for Mh-JK orders 1–3 and the full estimator. Orders 4 and 5
are also used in the analysis but are less accurate than the
others, and are therefore not shown.Tab
le3
Roo
tm
ean
squ
are
erro
r(R
MS
E)
of
the
esti
mat
ors
for
2–
9re
vie
wer
s.T
he
two
bes
tes
tim
ato
rsfo
rea
chn
um
ber
of
revie
wer
sar
esh
ow
nin
bo
ld.T
he
dat
ase
tis
the
on
ew
her
eth
ere
vie
wer
su
sed
CB
R,se
eS
ecti
on
3.2
Nu
mb
ero
fre
vie
wer
s1
.M
0-M
L2
.M
t-M
L3
.M
t-C
h4
.M
h-J
K
(1)
5.
Mh
-JK
(2)
6.
Mh
-JK
(3)
7.
Mh
-JK
(4)
8.
Mh
-JK
(5)
9.
Mh
-JK
(CA
PT
UR
E)
10
.M
h-C
h1
1.
Mth
-Ch
(1)
12
.M
th-C
h
(2)
13
.M
th-C
h
(3)
14
.D
PM
21
7.4
17
.61
6.7
16
.9–
––
18
.11
7.4
90
.31
9.6
–1
5.9
31
3.5
14
.81
4.4
12
.01
0.6
––
11
.91
3.1
19
.71
4.7
75
.81
3.9
41
2.4
13
.41
3.0
9.3
8.6
8.9
–1
0.6
12
.91
1.7
12
.51
4.0
11
.2
51
1.5
12
.31
1.3
7.7
7.5
8.5
9.1
9.7
11
.48
.91
0.3
9.5
8.8
61
0.7
11
.31
0.0
6.4
6.6
7.9
9.0
9.6
9.4
9.9
7.6
8.8
8.0
7.0
71
0.1
10
.58
.95
.55
.77
.38
.81
0.2
8.8
8.4
6.7
7.6
7.1
5.8
89
.49
.67
.94
.64
.86
.58
.61
0.7
8.1
7.2
6.0
6.6
6.3
5.3
98
.88
.96
.93
.84
.16
.28
.91
1.9
8.3
6.5
5.3
5.7
5.6
5.5
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702690
Analysis of all three data sets shows that Mh-JK order 2
is in general the best one. For the data set with the average
capture probability equal to 0.31, the confidence intervals
work well for three and more reviewers using order 2 and
the full estimator. Order 1 has good coverage for four and
more reviewers. Order 3 and higher is less accurate for this
data set.
When the average capture probabilities decrease, more
reviewers are needed in order to achieve good results.
This is applicable for all orders. For the data set with
average capture probability equal to 0.23, order 2 needs
seven or more reviewers as input to produce a coverage
of over 90%.
None of the cases shows accurate results for two
reviewers. Higher average capture probability is probably
needed in order for the estimators to show good results for
two reviewers, although, this is not investigated in this
study.
Consequently, the lower average capture probability, the
more reviewers are needed, and the larger average capture
probability (the higher rate of faults found), the better
coverage. For example, Mh-JK order 2 gives a coverage of
more than 90% for three reviewers for the data set with
capture probability of 0.31, but seven reviewers are
necessary to reach above 90% when the capture probability
is 0.23.
In Fig. 6, the ranges of the confidence intervals are shown
for three to six reviewers. The same data sets as in Fig. 5 are
used. The range is more affected by the number of reviewers
than by the average capture probability. This means that for
a fixed number of reviewers (e.g. three) and for a specific
estimator (e.g. Mh-JK order 1), the range is similar for all
three data sets.
The range itself does not give enough information in
order to select a good estimator. It has to be combined with
the coverage analysis. Since order 2 and the full estimator
have best coverage, these are most important in the range
analysis.
Both these estimators have small ranges for three to six
reviewers. Order 2 has smaller variation in the range. The
full estimator is based on interpolation of the five
subestimators, which sometimes leads to broad ranges.
Combining the results from the coverage and range
analysis gives that Mh-JK order 2 is the most appropriate
estimator to use.
5.3. Variation of the nominal confidence
The nominal confidence values are used in this section to
investigate how the coverage and range vary with the
nominal confidence. If smaller nominal confidence values
are used, the result will be higher coverage of the correct
number of faults.
In Figs. 7 and 8, the coverage and the range of the
confidence intervals are shown, when the nominal confi-
dences are between 60 and 99%. These are shown for Mh-
JK order 2.
Similar patterns as in Section 5.2 are observed. When the
average capture probability is high, the confidence intervals
are trustworthy. However, when the average capture
probability decreases, less trustworthy results are obtained.
Furthermore, the more reviewers, the better is the
coverage. However, six reviewers are not sufficient when
Fig. 4. Box plot of the five most accurate estimators for three to six
reviewers. The numbers below the box plot relate to the estimators in
Table 3.
Fig. 5. The coverage of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 691
the average capture probability equals 0.23. Hence, then the
confidence intervals cannot be trusted. For nominal
confidence of 99%, three to six reviewers give a good
coverage, but then the range is large. However, to trust the
results, the 99% confidence interval may be preferable.
As expected, as the nominal confidence value increases,
the broader becomes the range of the confidence interval.
However, then the coverage becomes better. For six
reviewers and a 99% confidence interval, the median
number of faults is 37, which is equal to the true number
of faults in that data set. For the other cases, the median is
less than 22 faults and the maximum less than 30, which
gives a good range. But then the coverage is more dependent
on the average capture probability than the number of
reviewers (at least for three to six reviewers).
6. Discussion
The results presented in Section 5, raise important
questions to be discussed. This section discusses the results
of the replication (research question 1), the confidence
intervals (research question 2), application of the results, the
findings in biostatistics regarding the estimators and future
work in the area of fault content estimation in software
inspections.
6.1. Replication
Miller [24] investigates 12 different estimators by
measuring bias and variance. His finding is that Mh-JK
order 1 is the best estimator to use. In the first part of this
paper, a replication of Miller’s evaluation was performed
using new data sets. In addition to the 12 estimators, the full
estimator of Mh-JK (implemented in CAPTURE [30]) and
the curve fitting method named DPM [41] were evaluated.
This investigation confirms Miller’s result in the sense
that the subestimators of Mh-JK are better to use for
software inspections than the full estimator. However,
instead of Mh-JK order 1, order 2 is shown to be more
appropriate to use when confidence intervals are
considered.
Miller also points out that Mh-JK order 1 can be used for
three to six reviewers. This result can, however, not be
confirmed, as it may depend on the data set used for
estimations. As shown, the higher the detection probability
Fig. 6. The ranges of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right.
Fig. 7. The coverage of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right. The
nominal confidence values are between 60 and 99%.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702692
is, the more faults are found, which leads to better
estimation results. When many faults are found, capture–
recapture may be used for three reviewers. However, when
few faults are found, at least six reviewers are needed to
obtain accurate estimation results. In other words, the
estimators need sufficient amount of data as input in order to
estimate well. This is a general problem in estimations, i.e.
the more input data the more accurate output. To increase
the amount of data, different factors can be increased, for
example, the number of reviewers and the capture
probability. This is also confirmed in biostatistics, which
is discussed below.
Furthermore, Miller and most other evaluations of
capture–recapture estimators in software inspections have
pointed out that the estimators underestimate [28]. This is
confirmed in this investigation. All estimators underestimated
in general. This may be due to that the software artefacts used
for inspection contain faults that are very difficult to find.
In the study by Miller, data sets with higher average capture
probabilities than in this study were used. The average capture
probabilities in the study by Miller are 0.39, 0.48, 0.75 and
0.84 (per reviewer). Comparing the REs in the box plots,
reveal that less underestimates are obtained. However, Miller
does not separate the data sets and hence does not
investigate the impact of average capture probability. A
mix of all the data sets is used, which results in a data set
with higher average capture probability than the ones used
in this study. Consequently, the estimates are closer to the
correct number of faults, i.e. smaller underestimations.
The subestimators of Mh-JK do not use all available
information if the number of reviewers is more than the
orders used. The information Mh-JK order 1 uses is the
number of faults that has been found exactly by one
reviewer. Mh-JK order 2 uses the number of faults that have
been found by one or two reviewers, in its estimation
formula. If the number of reviewers is more than three, all
available information is not used in the subestimators of
order 1 and 2 (if someone has found three or more faults). In
addition, the estimators have an upper estimation limit. The
larger order, the larger is the limit and this affects the
variance. Hence, if the bias is small for all subestimators,
order 1 of Mh-JK will probably estimate most accurately. So
in Miller’s case when the capture probability is large, the
bias is small, and then Mh-JK estimates well. In this
evaluation, the capture probability was smaller, resulting in
a larger bias, and thus a higher order is needed to estimate
well (in this case Mh-JK order 2).
To avoid this dilemma in biology, a procedure is
implemented that selects two subestimators and interpolates
between their estimation values. This procedure is used in
the full estimator (Mh-JK (CAPTURE)). However, this
study shows that the procedure is not better than using either
of Mh-JK order 1–3 for software inspection data.
6.2. Confidence intervals
The confidence intervals investigated are based on the
normal and log-normal distributions. The log-normal
distribution is most appropriate to use for software
inspection data. However, it cannot be trusted for all data
sets for all numbers of reviewers.
Four data sets were used in this investigation. They have
different average capture probabilities, which affect the
estimation results. Confidence intervals should cover the
nominal confidence value but not be too broad. Therefore,
the average capture probability, the number of reviewers
and the confidence values were varied in order to evaluate
the intervals for Mh-JK and its subestimator.
For the data sets with average capture probability equal
to 0.31, the confidence intervals reflect the nominal
confidence value well for three to nine reviewers using
Mh-JK order 2. This estimator estimates most accurately in
general considering bias, variance and confidence interval
coverage. When decreasing the average capture probability,
more reviewers are needed to achieve the nominal
confidence interval coverage. For the data set with average
capture probability equal to 0.23, seven reviewers are
needed. In this case, Mh-JK order 3 estimates better and
Fig. 8. The ranges of the confidence intervals using data sets that have average capture probabilities equal to 0.31, 0.26 and 0.23, from left to right. The nominal
confidence values are between 60 and 99%.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 693
only five reviewers are needed to achieve the decided
nominal confidence value.
Hence, the larger capture probability, the more accurate
point estimations and confidence intervals are achieved. A
project manager using the estimation result must be aware
of this. However, it is still better to use a confidence interval
to base ones decisions on, than only a point estimation. The
application of capture–recapture for project managers is
further discussed in Section 6.3.
6.3. Practical application of capture–recapture
For a software organization, we recommend to use Mh-
JK order 2 with the log-normal confidence interval if the
number of reviewers is four or more. If the number of
reviewers is less than four, subjective estimations are more
preferable. In order to evaluate whether capture–recapture
estimations work in a specific organization, historical data
should be collected and evaluated [27]. This paper
contributes with a general investigation of capture–
recapture estimators and points out the best estimator to
start with for a software organization.
Capture–recapture estimations can be calculated before
the inspection meeting (using the individual inspection
record) or after the meeting (using the meeting record with
identified overlap). When a project manager receives an
estimate after the individual inspection, the manager has
several decision alternatives to take [28]. The estimation
result can, for example, show that there are several faults
left in the document. Then, the decision should be to rework
the document. If, on the other hand, few faults remain, the
meeting does not need to be held. Other decisions could be
to hold an inspection meeting or only to add an additional
reviewer and re-estimate. In all cases, the work of compiling
the faults into one record is beneficial. The compiled record
is given to the author of the software artefact if rework is to
be done. If a meeting is to be held, the document can serve
as an agenda of what faults need to be discussed, which will
lead to a more efficient meeting. Note that the above
decisions can also be taken after the meeting.
However, a project manager cannot only rely on a single
estimate when decisions are to be taken. Factors concerning
the project and the product developed also have to be taken
into account. Therefore, an estimate of, for example, 20–40
faults left in the document is sometimes good enough if,
for example, time-to-market is important. However, if
the quality is the important factor in the project, the
estimation of 20–40 faults will probably lead to rework
of the product, followed by a reinspection. Another
important factor that affects the decisions is the severity of
the remaining faults.
Hence, the whole context, and not only an estimate,
needs to be considered when making decisions on which
actions to take. We provide a valuable tool for software
project managers to use and give a recommendation of what
estimator should be used. The problem for a project
manager is to know when to trust the estimation result of
capture–recapture since it depends on a number of factors.
Therefore, the recommendation is to combine Mh-JK (and
the log-normal confidence interval) with subjective esti-
mations, especially for few reviewers (four or less). If more
reviewers are used, most investigations have shown that
capture–recapture estimations can be trusted, and thus also
the confidence interval provides valuable information.
The recommendation is, hence, to start using capture–
recapture (in combination with subjective estimations) and
collect data during a period to evaluate the estimates. The
data to collect are the faults detected and the estimates. The
faults should be classified in what phase they were injected
and detected. After the development, the estimates can be
compared to the true number of faults that was present in the
artefacts inspected. Then, the average capture probability
and the REs of the estimates can be calculated. Furthermore,
software inspection experiments can also be conducted in
order to calculate these measures.
Software organizations that want to start using capture–
recapture need to use a tool where the subestimators of Mh-
JK and the confidence intervals are implemented. The
formulae for Mh-JK are shown in Section 2.1 and the
confidence interval formula is described in Section 4.4. In
this study, Mh-JK and the confidence interval have been
implemented in MATLABw [23].
6.4. Estimators
In biostatistics, many investigations of capture–recap-
ture estimators have been carried out. Although most of
them use simulations and not software inspection data, the
results provide basic knowledge of the estimators. Otis et al.
[26] and White et al. [39] have made extensive evaluations.
Some of the results are applicable to the investigation in this
paper and are discussed in this section. Since they use
capture–recapture for population estimation, they discuss in
terms of number of trapping occasions, population size and
so forth. Here, these terms are translated to number of
reviewers and number of faults to more easily follow the
discussion.
A result in this study is that most of the estimators
underestimate. A discussion of this is provided by White et al.
[39]. They argue that if many faults are difficult to find, i.e.
have low detection probability, Mh-JK as well as the other
estimators will underestimate. This may be a reason to why the
estimators underestimate when used for real data in software
inspections. Overestimations have mostly occurred when non-
software inspection data or simulations are used see e.g. [34,
37].
Furthermore, White et al. [39] and Otis et al. [26] discuss
some recommendations for the capture–recapture estima-
tors used in this paper. They state that the estimators need
to have sufficient amount of data in order to estimate
well. Some recommendations and findings important for
capture–recapture in software inspections are listed below.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702694
† Mh-JK is fairly robust for a number of specific
assumptions, and is the most robust estimator (Mt-Ch,
Mh-Ch, Mth-Ch and DPM are not used in their
investigations).
† If many faults are difficult to find, i.e. have low detection
probability, all estimators will underestimate.
† The more faults that are found, the more statistically
precise will the estimation result be.
† They recommend using at least five reviewers, but seven
to 10 are better.
† Extremely wide confidence intervals tend to reveal poor
experiment conditions, i.e. low capture probabilities.
† White et al. state that when using less than 10 reviewers
(a) the number of faults should not be less than 20; (b) the
average capture probabilities should not be less than 0.30
when the number of faults is less than 100.
† Otis et al. state that when the number of reviewers are
less than 10, the number of faults should not be less than
25 or average capture probabilities not less than 0.10.
These recommendations can provide guidelines for
software inspections. In summary, the estimators are
affected by the (a) capture probabilities, (b) the number of
reviewers and (c) the number of faults in the documents.
In software inspections, reading techniques, like per-
spective-based reading [1], defect-based reading [29] and
UBR [35] aim at increasing the capture probability by
aiding reviewers with information of how to inspect. The
number of faults is more difficult to affect. By seeding faults
according to Mills [25], two different capture–recapture
estimations can be made. This is, however, time-consuming
and it is difficult to know what kind of faults to introduce. It
may be easier to increase the number of reviewers if more
accurate results need to be achieved. Whether the number of
reviewers should be increased is a trade-off between cost
and quality in a software project. Hence, educating
reviewers and improving the reading techniques may be
the most cost-effective activities in order to improve
software inspections as well as the estimation results.
6.5. Further work
Petersson et al. [28] summarize 10 years work on
capture–recapture for software inspections and pin-points
important future research. Capture–recapture estimations
for two reviewers is one of these areas. This is an important
area, since two or three reviewers are commonly used in
software organizations. An extensive simulation carried out
by El Emam and Laitenberger [16] show that Mh-Ch is the
most accurate and can be used for two reviewers. However,
this study shows that using only two reviewers for capture–
recapture estimations leads to high uncertainty in the
estimation results. Additional information is needed in
order to trust the results. Such information could be, for
example, subjective estimates, which have shown to be as or
more accurate than capture–recapture estimates in research
investigations by El Emam et al. [15] and by Biffl [2].
A recommendation based on conducted research is that
subjective estimations should be used when small samples
are used; and capture–recapture should be used when the
samples are large (e.g. many reviewers or many faults
found). Since the data for both estimation methods are easy
to collect, estimate with both methods and then discuss the
results with the reviewers.
As pointed out by Petersson et al. [28], the main future
research for fault content estimations like capture–recap-
ture and subjective methods are to apply the results in a real
environment in software organizations. Collaboration
between software organizations and researchers will help
the research to come up with relevant research questions.
Most papers in capture–recapture have pointed out Mh-
JK to be the best estimator [28]. This study together with
Miller’s show that a subestimator of Mh-JK is even more
preferable. For a project manager, it is important to know
which of the subestimators of Mh-JK to choose for a specific
data set. The selection algorithm implemented in the
program CAPTURE [30] is not reliable to use for software
inspection. Hence, important future research is to design a
selection algorithm that chooses between the subestimators
of Mh-JK and works for software inspection data.
Future research also has to look into how to increase the
average capture probability. This can, for example, be
achieved by using and improving the reading techniques.
Another parameter affecting the estimation result, which is
not investigated in this paper, is the number of faults. Only
one paper on software inspections has studied the effect of
different number of faults in software documents. Briand
et al. [5] varied the number of faults between 6 and 24, and
found that increasing the number of faults reduces the
variance, but does not affect the RE significantly. More
studies are needed to confirm their results with larger
variability of faults.
Nevertheless, the easiest, but not necessarily the most
cost-effective, way to increase the reliability of the
estimation results is to increase the number of reviewers.
7. Summary
This paper has reported on a replication and on a
confidence interval investigation for capture–recapture
estimators in software inspections. The replication confirms
the results, in the sense that Mh-JK’s subestimators may be
better to use with software artefacts than the full estimator.
However, this investigation points out the second sub-
estimator to be the best one, considering bias, standard
deviation and confidence intervals.
The confidence intervals based on the log-normal
distribution performed better than the one based on the
normal distribution. This is true for all cases investigated.
The results in this study are important for researchers as
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 695
well as for practitioners in the software engineering area.
The recommendation for practitioners is to use Mh-JK order
2 together with the confidence interval based on the log-
normal distribution. A practitioner using capture–recapture
should focus on using an effective reading technique and
sufficient number of reviewers. Depending on the import-
ance of the document, the more reviewers should be used
and then as a consequence, the estimates will be more
accurate. If reading techniques are used, the side-effect of
finding more faults is that it increases the estimation
accuracy. In addition to capture–recapture, subjective
estimations have shown good results [2,15]. Since the data
are easy to collect, we recommend using both capture–
recapture estimations and subjective estimations and then
discussing the result before decisions are taken. At least
until an evaluation of capture–recapture has been con-
ducted within the software organization.
For researchers, a discussion is provided of what factors
that are important to achieve accurate estimation results.
These factors are the average capture probability and the
number of reviewers. The number of faults probably also
affects the result. Besides the discussion, this paper has
provided important knowledge gained from the replication
and the confidence interval investigation. Especially the
confidence intervals need further attention in capture–
recapture research for software inspections.
Future work for researches as well as for practitioners is
to carry out case studies, surveys and technology transfers of
Table A1
RMSE, bias and standard deviation for the UBRran data (experiment 1), see Section 3.1. The two best estimators for each number of reviewers are shown in
bold. Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for two reviewers
2 3 4 5 6 7 8 9
1. M0-ML RMSE 21.4 20.6 19.8 19.1 18.3 17.6 16.9 16.4
Bias 220.0 220.4 219.7 219.0 218.3 217.6 216.9 216.4
Standard deviation 7.6 3.0 2.2 1.8 1.5 1.3 1.1 1.0
2. Mt-ML RMSE 22.3 21.1 20.1 19.2 18.4 17.6 16.9 16.4
Bias 221.2 220.9 220.0 219.1 218.3 217.6 216.9 216.4
Standard deviation 6.7 2.8 2.1 1.8 1.5 1.3 1.1 1.0
3. Mt-Ch RMSE 21.8 19.3 17.6 16.4 15.4 14.8 14.3 14.2
Bias 221.3 218.9 217.1 215.9 215.0 214.4 214.0 214.0
Standard deviation 4.6 3.9 4.1 3.9 3.7 3.4 3.1 2.5
4. Mh-JK (1) RMSE 20.8 17.7 15.9 14.7 13.7 13.0 12.4 12.1
Bias 220.6 217.5 215.7 214.4 213.5 212.7 212.2 211.9
Standard deviation 2.6 2.8 2.8 2.7 2.6 2.4 2.1 1.9
5. Mh-JK (2) RMSE 20.8 16.3 14.4 13.1 12.2 11.6 11.3 11.3
Bias 220.6 216.0 213.9 212.5 211.5 211.0 210.7 210.8
Standard deviation 2.6 3.5 3.9 4.0 3.9 3.7 3.5 3.3
6. Mh-JK (3) RMSE – 16.3 13.9 12.5 11.7 11.4 11.3 11.7
Bias – 216.0 213.1 211.5 210.5 210.1 210.1 210.7
Standard deviation – 3.5 4.5 4.9 5.1 5.2 5.1 4.9
7. Mh-JK (4) RMSE – – 13.9 12.4 11.6 11.5 11.7 12.2
Bias – – 213.1 211.1 29.9 29.5 29.7 210.6
Standard deviation – – 4.5 5.5 6.1 6.6 6.5 6.1
8. Mh-JK (5) RMSE – – – 12.4 11.7 11.8 12.1 12.6
Bias – – – 211.1 29.6 29.0 29.3 210.4
Standard deviation – – – 5.5 6.7 7.6 7.7 7.2
9. Mh-JK (CAPTURE) RMSE 22.0 18.1 16.6 15.6 14.8 14.3 13.9 13.6
Bias 221.8 217.7 216.0 214.8 213.7 213.2 213.1 213.3
Standard deviation 2.8 3.7 4.5 4.9 5.5 5.4 4.8 3.0
10. Mh-Ch RMSE 20.2 18.2 16.4 15.1 14.2 13.7 13.4 13.2
Bias 218.3 217.3 215.0 213.8 212.7 212.2 212.1 212.6
Standard deviation 8.6 5.7 6.6 6.1 6.4 6.2 5.6 4.2
11. Mth-Ch (1) RMSE 39.0 14.2 14.5 14.2 13.8 13.4 13.0 12.8
Bias 7.6 211.8 213.5 213.6 213.3 213.0 212.8 212.6
Standard deviation 38.3 7.9 5.1 4.2 3.5 3.0 2.6 2.1
12. Mth-Ch (2) RMSE 22.7 19.5 17.2 15.8 14.8 14.1 13.6 13.2
Bias 221.0 218.8 216.6 215.3 214.4 213.7 213.3 213.1
Standard deviation 8.6 5.3 4.4 3.9 3.4 2.9 2.5 2.1
13. Mth-Ch (3) RMSE – 132.3 13.9 14.5 14.2 13.8 13.4 13.2
Bias – 29.6 212.3 213.8 213.8 213.4 213.2 213.0
Standard deviation – 129.0 6.4 4.3 3.5 3.0 2.6 2.2
14. DPM RMSE 18.7 17.4 15.7 14.3 13.0 11.8 10.8 9.9
Bias 218.0 217.1 215.5 214.1 212.8 211.7 210.7 29.8
Standard deviation 5.0 3.4 2.7 2.3 2.0 1.8 1.6 1.4
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702696
the estimation techniques to software organizations. Fur-
thermore, more research is needed in order to know how to
efficiently utilize capture–recapture together with subjec-
tive estimations.
Acknowledgments
The authors would like to thank the students
participating in the experiments and Prof. Claes Wohlin,
Dr Bjorn Regnell, Thomas Olsson, Johan Natt och Dag
for help during experiment planning and analysis. This
work was partly funded by The Swedish Agency for
Innovation Systems (VINNOVA), under a grant for the
Center for Applied Software Research at Lund Univer-
sity (LUCAS).
Appendix A. Estimation data
Tables A1–A4.
Table A2
RMSE, bias and standard deviation for the UBRord data (experiment 1), see Section 3.1. The two best estimators for each number of reviewers are shown in
bold. Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for two reviewers
2 3 4 5 6 7 8 9
1. M0-ML RMSE 14.7 10.6 9.4 8.8 8.3 7.9 7.5 6.9
Bias 27.3 28.9 28.9 28.6 28.2 27.8 27.4 26.9
Standard deviation 12.7 5.7 2.9 2.0 1.6 1.3 1.1 0.9
2. Mt-ML RMSE 15.0 11.1 9.9 9.1 8.5 8.0 7.5 6.9
Bias 29.0 29.7 29.4 28.9 28.4 27.9 27.4 26.9
Standard deviation 12.0 5.5 2.8 2.0 1.6 1.3 1.1 0.9
3. Mt-Ch RMSE 15.8 10.2 8.5 7.5 6.8 6.1 5.6 5.2
Bias 28.9 28.7 27.6 26.7 26.0 25.4 24.8 24.4
Standard deviation 13.0 5.2 3.8 3.4 3.2 2.9 2.8 2.7
4. Mh-JK (1) RMSE 13.0 7.6 5.3 4.3 3.8 3.5 3.2 2.9
Bias 212.3 26.8 24.4 23.4 22.8 22.5 22.2 22.0
Standard deviation 4.1 3.4 3.0 2.7 2.6 2.4 2.2 2.1
5. Mh-JK (2) RMSE 13.0 5.8 4.8 4.8 4.7 4.5 4.2 4.0
Bias 212.3 24.0 22.1 21.8 21.7 21.6 21.5 21.3
Standard deviation 4.1 4.2 4.3 4.5 4.4 4.2 4.0 3.8
6. Mh-JK (3) RMSE – 5.8 5.4 6.0 6.2 6.0 5.8 5.6
Bias – 24.0 21.4 21.4 21.4 21.1 20.8 20.6
Standard deviation – 4.2 5.2 5.9 6.0 5.9 5.7 5.6
7. Mh-JK (4) RMSE – – 5.4 6.7 7.2 7.4 7.5 7.6
Bias – – 21.4 21.3 21.0 20.4 0.3 0.5
Standard deviation – – 5.2 6.6 7.2 7.4 7.5 7.6
8. Mh-JK (5) RMSE – – – 6.7 7.8 8.6 9.4 9.9
Bias – – – 21.3 20.8 0.4 1.4 1.8
Standard deviation 2 2 2 6.6 7.8 8.6 9.3 9.7
9. Mh-JK (CAPTURE) RMSE 14.2 7.3 6.7 6.4 6.5 6.2 6.0 5.9
Bias 213.6 25.8 24.7 24.6 24.4 24.2 24.0 23.5
Standard deviation 4.1 4.5 4.7 4.5 4.7 4.6 4.5 4.7
10. Mh-Ch RMSE 15.2 9.6 7.9 7.2 6.6 6.0 5.7 5.4
Bias 25.1 26.7 26.3 25.7 25.0 24.4 23.8 23.4
Standard deviation 14.3 6.8 4.8 4.4 4.3 4.1 4.3 4.3
11. Mth-Ch (1) RMSE 77.5 15.3 6.4 5.3 5.1 4.9 4.7 4.5
Bias 47.5 6.0 21.1 23.2 24.0 24.2 24.2 24.1
Standard deviation 61.3 14.1 6.3 4.2 3.2 2.6 2.2 1.9
12. Mth-Ch (2) RMSE 17.7 11.7 9.1 7.6 6.7 6.0 5.4 5.0
Bias 29.9 29.1 27.8 26.7 26.0 25.4 25.0 24.6
Standard deviation 14.6 7.4 4.6 3.6 3.0 2.5 2.1 1.9
13. Mth-Ch (3) RMSE 2 35.2 7.8 6.2 5.7 5.4 5.1 4.8
Bias – 9.2 22.4 24.2 24.7 24.7 24.6 24.4
Standard deviation – 34.0 7.4 4.5 3.3 2.6 2.2 1.9
14. DPM RMSE 12.2 9.7 6.9 5.0 5.0 6.4 8.2 10.1
Bias 210.4 28.2 24.6 21.1 2.0 4.7 7.2 9.5
Standard deviation 6.4 5.1 5.1 4.9 4.6 4.3 3.9 3.5
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 697
Table A3
RMSE, bias and standard deviation for the CBR data (experiment 2), see Section 3.2. The two best estimators for each number of reviewers are shown in bold.
Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for two reviewers
2 3 4 5 6 7 8 9
1. M0-ML RMSE 17.4 13.5 12.4 11.5 10.7 10.1 9.4 8.8
Bias 29.0 211.2 211.0 210.6 210.1 29.6 29.1 28.6
Standard deviation 14.8 7.6 5.7 4.5 3.6 3.0 2.4 1.9
2. Mt-ML RMSE 17.6 14.8 13.4 12.3 11.3 10.5 9.6 8.9
Bias 212.9 213.2 212.4 211.5 210.8 210.1 29.3 28.7
Standard deviation 12.1 6.6 5.2 4.2 3.5 2.9 2.3 1.9
3. Mt-Ch RMSE 16.7 14.4 13.0 11.3 10.0 8.9 7.9 6.9
Bias 213.9 212.3 210.9 29.7 28.7 27.7 26.9 26.0
Standard deviation 9.2 7.4 7.1 5.8 5.0 4.4 3.8 3.3
4. Mh-JK (1) RMSE 16.9 12.0 9.3 7.7 6.4 5.5 4.6 3.8
Bias 215.1 29.6 27.0 25.5 24.6 23.8 23.2 22.6
Standard deviation 7.5 7.1 6.2 5.3 4.5 3.9 3.3 2.7
5. Mh-JK (2) RMSE 16.9 10.6 8.6 7.5 6.6 5.7 4.8 4.1
Bias 215.1 26.9 24.3 23.4 22.7 22.0 21.3 20.6
Standard deviation 7.5 8.1 7.5 6.7 6.0 5.4 4.7 4.0
6. Mh-JK (3) RMSE – 10.6 8.9 8.5 7.9 7.3 6.5 6.2
Bias – 26.9 23.5 22.7 22.0 21.0 0.0 0.6
Standard deviation – 8.1 8.2 8.1 7.6 7.2 6.5 6.1
7. Mh-JK (4) RMSE – – 8.9 9.1 9.0 8.8 8.6 8.9
Bias – – 23.5 22.4 21.4 0.0 1.3 1.9
Standard deviation 2 2 8.2 8.8 8.9 8.8 8.5 8.7
8. Mh-JK (5) RMSE – – – 9.1 9.6 10.2 10.7 11.9
Bias – – – 22.4 21.0 1.0 2.6 3.1
Standard deviation – – – 8.8 9.5 10.1 10.3 11.5
9. Mh-JK (CAPTURE) RMSE 18.1 11.9 10.6 9.7 9.4 8.8 8.1 8.3
Bias 216.4 28.5 26.4 25.6 25.0 24.6 24.1 22.9
Standard deviation 7.5 8.3 8.4 8.0 8.0 7.5 6.9 7.8
10. Mh-Ch RMSE 17.4 13.1 12.9 11.4 9.9 8.4 7.2 6.5
Bias 26.5 28.0 27.1 26.2 25.3 24.6 23.9 23.1
Standard deviation 16.1 10.4 10.8 9.6 8.4 7.1 6.0 5.7
11. Mth-Ch (1) RMSE 90.3 19.7 11.7 8.9 7.6 6.7 6.0 5.3
Bias 55.9 6.7 21.0 23.5 24.3 24.5 24.4 24.2
Standard deviation 70.9 18.6 11.6 8.2 6.3 5.0 4.0 3.2
12. Mth-Ch (2) RMSE 19.6 14.7 12.5 10.3 8.8 7.6 6.6 5.7
Bias 29.3 29.5 28.3 27.3 26.5 25.9 25.3 24.8
Standard deviation 17.2 11.2 9.4 7.3 5.8 4.8 3.9 3.1
13. Mth-Ch (3) RMSE – 75.8 14.0 9.5 8.0 7.1 6.3 5.6
Bias – 14.3 22.3 24.6 25.1 25.1 24.9 24.6
Standard deviation – 74.5 13.8 8.3 6.2 4.9 3.9 3.1
14. DPM RMSE 15.9 13.9 11.2 8.8 7.0 5.8 5.3 5.5
Bias 213.1 212.0 28.9 26.0 23.3 21.0 1.1 3.0
Standard deviation 9.0 7.0 6.8 6.4 6.1 5.7 5.2 4.6
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702698
Table A4
RMSE, bias and standard deviation for the UBROrd data (experiment 2), see Section 3.2. The two best estimators for each number of reviewers are shown in
bold. Note that for some cases, Mh-JK’s subestimators sometimes become the same formula, for example, order 1 and 2 for 2 reviewers
2 3 4 5 6 7 8 9
1. M0-ML RMSE 13.2 9.4 8.5 7.6 6.9 6.3 5.7 5.1
Bias 26.5 28.0 27.7 27.2 26.5 26.0 25.5 25.0
Standard deviation 11.5 5.0 3.5 2.7 2.2 1.8 1.4 1.0
2. Mt-ML RMSE 13.8 10.0 8.8 7.9 7.1 6.4 5.8 5.1
Bias 28.1 28.7 28.1 27.4 26.8 26.2 25.7 25.0
Standard deviation 11.1 4.9 3.4 2.7 2.2 1.8 1.3 1.0
3. Mt-Ch RMSE 12.1 9.5 7.9 6.6 5.3 4.5 3.8 3.4
Bias 29.1 27.5 26.2 25.0 24.1 23.3 23.0 22.9
Standard deviation 8.0 5.9 4.9 4.3 3.4 3.0 2.3 1.7
4. Mh-JK (1) RMSE 12.1 6.9 4.8 3.8 3.2 2.7 2.3 1.8
Bias 211.2 25.4 22.8 21.4 20.5 0.1 0.5 0.6
Standard deviation 4.5 4.3 4.0 3.6 3.1 2.7 2.2 1.7
5. Mh-JK (2) RMSE 12.1 5.8 5.4 5.2 4.8 4.4 3.9 3.0
Bias 211.2 22.4 20.2 0.7 1.2 1.6 1.5 1.1
Standard deviation 4.5 5.2 5.4 5.1 4.6 4.1 3.5 2.8
6. Mh-JK (3) RMSE – 5.8 6.3 6.6 6.4 6.2 5.5 4.3
Bias – 22.4 0.6 1.3 1.9 2.1 1.5 0.6
Standard deviation – 5.2 6.3 6.4 6.1 5.8 5.3 4.3
7. Mh-JK (4) RMSE 2 2 6.3 7.3 7.6 7.8 7.0 5.7
Bias – – 0.6 1.5 2.4 2.6 1.6 0.0
Standard deviation – – 6.3 7.2 7.2 7.3 6.8 5.7
8. Mh-JK (5) RMSE – – – 7.3 8.2 9.1 8.2 7.1
Bias – – – 1.5 2.7 3.1 1.8 0.1
Standard deviation – – – 7.2 7.8 8.5 8.0 7.1
9. Mh-JK (CAPTURE) RMSE 13.3 7.1 6.7 6.2 5.4 5.2 3.5 3.1
Bias 212.5 24.2 22.7 22.4 22.1 21.9 22.2 22.4
Standard deviation 4.5 5.7 6.1 5.7 5.0 4.8 2.8 2.0
10. Mh-Ch RMSE 13.4 8.9 7.5 6.4 5.0 4.4 3.4 2.8
Bias 24.4 25.5 24.7 23.6 22.8 22.0 21.8 21.9
Standard deviation 12.7 7.0 5.9 5.3 4.1 3.9 2.8 2.1
11. Mth-Ch (1) RMSE 70.3 14.9 7.6 5.4 4.2 3.4 2.8 2.3
Bias 47.8 7.5 0.8 21.0 21.6 21.6 21.6 21.6
Standard deviation 51.6 12.9 7.5 5.2 3.9 3.0 2.3 1.7
12. Mth-Ch (2) RMSE 15.7 10.9 8.5 6.6 5.1 4.1 3.4 2.8
Bias 29.6 28.0 26.2 24.8 23.7 23.0 22.5 22.3
Standard deviation 12.4 7.4 5.8 4.5 3.6 2.9 2.2 1.6
13. Mth-Ch (3) RMSE – 32.4 8.8 5.7 4.5 3.8 3.2 2.6
Bias – 11.0 20.2 21.9 22.3 22.3 22.2 22.1
Standard deviation – 30.5 8.8 5.3 3.8 3.0 2.3 1.6
14. DPM RMSE 11.4 8.8 5.8 4.2 4.9 6.8 8.8 10.9
Bias 29.8 27.4 23.7 20.2 2.9 5.7 8.2 10.6
Standard deviation 5.8 4.8 4.4 4.2 3.9 3.6 3.1 2.6
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702 699
Table B1
The data for experiment 1. Reviewer number 1 to 13 is from UBRRan, 14–27 from UBROrd, see Section 3.1. 37 faults were present in the document under inspection. A fault found is marked with 1 in the matrix
Fault number Reviewer number Probability
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.52
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.70
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.74
4 1 1 1 0.11
5 1 1 1 1 1 1 1 1 1 1 1 1 0.44
6 1 1 1 1 1 1 1 1 1 1 1 0.41
7 1 1 1 1 1 1 1 0.26
8 1 1 1 1 1 1 1 1 1 0.33
9 1 1 1 1 1 1 1 1 1 1 1 0.41
10 1 1 1 1 1 1 1 1 1 0.33
11 1 1 1 1 1 1 1 1 1 1 1 0.41
12 1 1 1 1 1 1 1 0.26
13 0.00
14 1 0.04
15 0.00
16 1 0.04
17 0.00
18 1 1 1 1 1 1 1 1 1 1 0.37
19 1 1 1 1 1 1 1 1 1 1 1 0.41
20 1 1 1 1 1 1 0.22
21 1 1 1 1 1 1 1 1 0.30
22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.59
23 1 1 1 1 1 1 1 1 1 1 1 1 0.44
24 0.00
25 1 1 1 1 1 0.19
26 1 1 1 1 1 1 1 1 1 0.33
27 1 1 1 0.11
28 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.59
29 1 1 0.07
30 1 1 1 1 1 0.19
31 1 1 1 0.11
32 1 1 1 1 1 1 1 1 1 0.33
33 1 1 1 0.11
34 1 1 1 1 1 0.19
35 1 1 1 1 1 1 1 0.26
36 1 1 1 1 0.15
37 1 0.04
Effectiveness 0.19 0.30 0.11 0.24 0.27 0.30 0.24 0.30 0.24 0.14 0.11 0.30 0.24 0.22 0.32 0.16 0.32 0.24 0.30 0.41 0.27 0.22 0.27 0.51 0.46 0.24 0.38
Mean 0.23 0.31
Standard deviation 0.07 0.10
T.
Th
elinet
al.
/In
form
atio
na
nd
So
ftwa
reT
echn
olo
gy
44
(20
02
)6
83
–7
02
70
0
Table B2
The data for experiment 2. Reviewer number 1–12 is from CBR, 13–23 from UBROrd, see Section 3.2. 38 faults were present in the document under inspection. A fault found is marked with 1 in the matrix
Fault number Reviewer number Probability
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1 1 1 1 1 1 1 1 1 0.35
2 1 1 1 1 1 1 0.26
3 1 1 0.09
4 1 1 1 1 1 1 1 1 1 0.39
5 1 1 1 1 1 1 1 0.30
6 1 1 1 1 1 1 1 1 1 1 0.43
7 0.00
8 1 1 1 1 1 1 1 0.30
9 1 1 1 0.13
10 1 1 1 1 1 1 0.26
11 0.00
12 1 1 1 1 0.17
13 1 1 1 1 1 1 1 0.30
14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.61
15 1 1 1 1 1 0.22
16 1 1 1 1 1 1 1 1 1 1 1 1 0.52
17 1 1 1 1 1 1 1 0.30
18 1 1 1 1 1 1 1 1 0.35
19 1 1 1 1 1 1 1 1 1 0.39
20 1 1 1 1 1 1 1 1 1 1 1 0.48
21 1 1 1 1 1 1 1 0.30
22 1 0.04
23 1 1 1 1 1 1 1 1 1 1 1 0.48
24 0.00
25 1 1 1 1 1 1 1 1 1 0.39
26 1 1 1 1 1 1 1 0.30
27 1 1 1 1 1 1 1 1 0.35
28 1 1 1 1 1 1 1 1 1 0.39
29 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.65
30 1 1 1 1 1 1 0.26
31 1 1 1 1 1 1 1 1 1 0.39
32 1 1 1 1 1 1 0.26
33 1 1 1 0.13
34 1 1 1 1 1 1 0.26
35 1 1 1 0.13
36 1 1 1 1 0.17
37 1 0.04
38 1 1 1 1 1 1 1 1 1 0.39
Effectiveness 0.32 0.21 0.16 0.13 0.29 0.21 0.18 0.50 0.13 0.26 0.16 0.55 0.26 0.50 0.37 0.32 0.29 0.21 0.39 0.26 0.34 0.16 0.34
Mean 0.26 0.31
Standard deviation 0.14 0.09
T.
Th
elinet
al.
/In
form
atio
na
nd
So
ftwa
reT
echn
olo
gy
44
(20
02
)6
83
–7
02
70
1
Appendix B. Reviewer data from the experiments
Tables B1 and B2.
References
[1] V.R. Basili, S. Green, O. Laitenberger, F. Lanubile, F. Shull, S.
Sørumgard, M.V. Zelkowitz, The empirical investigation of perspective-
based reading, Empirical Software Engineering 1 (2) (1966) 133–164.
[2] S. Biffl, Using inspection data for defect estimation, IEEE Software 17
(6) (2000) 36–43.
[3] S. Biffl, W. Grossman, Evaluating the accuracy of defect estimation
models based on inspection data from two inspection cycles,
Proceedings of the 23rd International Conference on Software
Engineering, 2001, pp. 145–154.
[4] L. Briand, K. El Emam, B. Freimut, O. Laitenberger, Quantitative
evaluation of capture–recapture models to control software inspec-
tions, Proceedings of the Eighth International Symposium on
Software Reliability Engineering, 1997, pp. 234–244.
[5] L. Briand, K. El Emam, B. Freimut, O. Laitenberger, A comprehen-
sive evaluation of capture–recapture models for estimating software
defect content, IEEE Transactions on Software Engineering 26 (6)
(2000) 518–540.
[6] K.P. Burnham, W.S. Overton, Estimation of the size of a closed
population when capture–recapture probabilities vary among ani-
mals, Biometrika 65 (1978) 625–633.
[7] K.P. Burnham, D.R. Anderson, G.C. White, C. Brownie, K.H. Pollock,
Design and analysis offish survival experiments based on release capture
data, American Fisheries Society Monograph 5 (1987) 1–437.
[8] A. Chao, Estimating the population size for capture–recapture data
with unequal catchability, Biometrics 43 (1987) 783–791.
[9] A. Chao, Estimating population size for sparse data in capture–
recapture experiments, Biometrics 45 (1989) 427–438.
[10] A. Chao, S.M. Lee, S.L. Jeng, Estimating population size for capture–
recapture data when capture probabilities vary by time and individual
animal, Biometrics 48 (1992) 201–216.
[11] A. Chao, Capture–Recapture Models, in: P. Armitage, T. Colton
(Eds.), Encyclopaedia of Biostatistics, Wiley, USA, 1998.
[12] B. Efron, R.J. Tibshirani, An introduction to the bootstrap,
Monographs on Statistics and Applied Probability, 57, Chapman
and Hall, UK, 1993.
[13] S.G. Eick, C.R. Loader, M.D. Long, L.G. Votta, S.A. Vander Wiel,
Estimating software fault content before coding, Proceedings of the 14th
International Conference on Software Engineering, 1992, pp. 59–65.
[14] S.G. Eick, C.R. Loader, S.A. Vander Wiel, L.G. Votta, How many errors
remain in a software design document after inspection? Proceedings
of the 25th Symposium on the Interface, 1993, pp. 195–202.
[15] K. El Emam, O. Laitenberger, T. Harbich, The application of
subjective estimates of effectiveness to controlling software inspec-
tions, Journal of Systems and Software 54 (2) (2000) 119–136.
[16] K. El Emam, O. Laitenberger, Evaluating capture–recapture models
with two inspectors, IEEE Transactions on Software Engineering 27
(9) (2001) 851–864.
[17] M.E. Fagan, Design and code inspections to reduce errors in program
development, IBM Systems Journal 15 (3) (1976) 182–211.
[18] M.E. Fagan, Advances in software inspections, IEEE Transactions on
Software Engineering 12 (7) (1986) 744–751.
[19] B. Freimut, Capture–recapture models to estimate software fault
content, Diploma Thesis, University of Kaiserslautern, Germany, 1997.
[20] B. Freimut, O. Laitenberger, S. Biffl, Investigating the impact of
reading techniques on the accuracy of different defect content
estimation techniques, Proceedings of the Seventh International
Software Metrics Symposium, 2001, pp. 51–62.
[21] I. Jacobson, M. Christerson, P. Jonsson, G. Overgaard, Object-
Oriented Software Engineering: A Use Case Driven Approach,
Addison-Wesley, USA, 1992.
[22] P.M. Johnson, D. Tjahjono, Does every inspection really need a
meeting? Empirical Software Engineering 3 (1) (1998) 9–35.
[23] Math Works, http://www.mathworks.com.
[24] J. Miller, Estimating the number of remaining defects after inspection,
Software Testing, Verification and Reliability 9 (4) (1999) 167–189.
[25] Mills, H., On the statistical validation of computer programs, Technical
Report, FSC-72-6015, IBM Federal Systems Division, 1972.
[26] D.L. Otis, K.P. Burnham, G.C. White, D.R. Anderson, Statistical
inference from capture data on closed animal populations, Wildlife
Monographs (1978) 62.
[27] H. Petersson, C. Wohlin, An empirical study of experience-based
software defect content estimation methods, Proceedings of the 10th
International Symposium on Software Reliability Engineering, 1999,
pp. 126–135.
[28] H. Petersson, T. Thelin, P. Runeson, C. Wohlin, Capture–recapture in
software inspections after 10 years research—theory, evaluation and
application, CODEN:LUTEDX (TETS-7184)/1-16/2002 & local 14,
Department of Communication Systems, Lund University, 2002.
http://www.telecom.lth.se/Personal/thomast/reports/
Crc10Years_TechnicalReport.pdf.
[29] A. Porter, L. Votta, V.R. Basili, Comparing detection methods for
software requirements inspection: a replicated experiment, IEEE
Transactions on Software Engineering 21 (6) (1995) 563–575.
[30] E. Rexstad, K. P. Burnham, User’s guide for interactive program
CAPTURE, Colorado Cooperative Fish and Wildlife Research Unit,
Colorado State University, Fort Collins, CO 80523, USA, 1991.
[31] P. Runeson, C. Wohlin, An experimental evaluation of an experience-
based capture–recapture method in software code inspections,
Empirical Software Engineering 3 (4) (1998) 381–406.
[32] T. Thelin, P. Runeson, Capture–recapture estimations for perspec-
tive-based reading—a simulated experiment, Proceedings of the
International Conference on Product Focused Software Process
Improvement, 1999, pp. 182–200.
[33] T. Thelin, P. Runeson, Fault content estimations using extended curve
fitting models and model selection, Proceedings of the Fourth
International Conference on Empirical Assessment and Evaluation
in Software Engineering, 2000.
[34] T. Thelin, P. Runeson, Robust estimations of fault content with
capture–recapture and detection profile estimators, Journal of
Systems and Software 52 (2–3) (2000) 139–148.
[35] T. Thelin, P. Runeson, B. Regnell, Usage-based reading—an
experiment to guide reviewers with use cases, Information and
Software Technology 43 (15) (2001) 925–938.
[36] T. Thelin, P. Runeson, C. Wohlin, An experimental comparison of usage-
based and checklist-based reading, Proceedings of the First International
Workshop on Inspection in Software Engineering, 2001, pp. 81–91.
[37] S.A. Vander Wiel, L.G. Votta, Assessing software design using
capture–recapture methods, IEEE Transactions on Software Engin-
eering 19 (11) (1993) 1045–1054.
[38] L.G. Votta, Does every inspection need a meeting? Proceedings of the
First ACM SIGSOFT Symposium on Foundations of Software
Engineering, ACM Software Engineering Notes 18 (5) (1993) 107–114.
[39] G.C. White, D.R. Anderson, K.P. Burnham, D.L. Otis, Capture–
recapture and removal methods for sampling closed populations,
Technical Report, Los Alomos National Laboratory, 1982.
[40] C. Wohlin, P. Runeson, J. Brantestam, An experimental evaluation of
capture–recapture in software inspections, Software Testing, Ver-
ification and Reliability 5 (4) (1995) 213–232.
[41] C. Wohlin, P. Runeson, Defect content estimation from review data,
Proceedings of the 20th International Conference on Software
Engineering, 1998, pp. 400–409.
T. Thelin et al. / Information and Software Technology 44 (2002) 683–702702