Evaluating Defect Detection Techniques for Software ... · PDF fileEvaluating Defect Detection Techniques for Software Requirements ... di Informatica ... on their own review guidelines

Evaluating Defect Detection Techniques for Software Requirements Inspections

Filippo Lanubile and Giuseppe Visaggio

University of BariDipartimento di Informatica

Via Orabona, 4 - 70126 Bari, Italy+39 080 544 3270

{lanubile, visaggio}@di.uniba.it

Abstract

Perspective-Based Reading (PBR) is a family of defect detection techniques which have been proposed to improve theeffectiveness of software requirements inspections. PBR drives individual document reading by means of perspective-dependent procedural scenarios, which are different for each inspector in the team.

Based on the former PBR experiments, we present a controlled experiment with more than one hundred undergraduatestudents who conducted software inspections as part of a university course. We tested an enhanced procedural version ofPBR by comparing it ad hoc reading and checklist-based reading, and analyzing data both at the individual and team level.The experimental results do not support previous findings that PBR improves defect detection effectiveness with respect tononsystematic reading techniques. We also investigated team composition issues by assessing the effects of combiningdifferent or identical perspectives at inspection meetings. The empirical test did not meet the assumption that differentperspectives in an inspection team achieve a higher coverage of a document.

Process conformance have been explicitly considered and used to check whether readers have departed from the assignedguidelines and then filter raw data. We discovered that only one fifth of PBR reviewers had actually followed their scenarios,and even many checklist reviewers had reverted back to an ad hoc reading style. From the debriefing questionnaires, wemeasured the reviewers’ self-assessment of the assigned reading techniques. The results revealed no relationship with thetype of reading technique but the subjective evaluation was significantly related to trust in process conformance. Thisexperiment provides evidence that process conformance issues play a critical role in the successful application of readingtechniques and more generally, software process tools.

Keywords: perspective-based reading, software reading techniques, software inspection, requirements document, processconformance.

1 Introduction

Software inspection is one of the industry best practices for delivering high-quality software (Wheeler et al., 1996). Themain benefit of software inspections derives from applying inspections early during software development and thenpreventing the exponential growth of defect repair cost (Boehm, 1981).

Software inspection is a structured process for the static verification of software documents, including requirementsspecifications, design documents as well as source code. From the seminal work of Fagan (Fagan, 1976; Fagan 1986) to itsvariants (Humphrey, 1989; Gilb and Graham, 1993), the software inspection process is essentially made up of fourconsecutive steps: planning, preparation, meeting, and rework. During planning, a member of the inspection team sends theinspection material to the rest of the team and makes a schedule for the next steps, During preparation, each inspector in theteam individually understands and reviews the document to find defects. During the meeting, all the inspectors encounter tocollect and discuss the defects from the individual reviews and further review the document to find further defects. Finally,during rework, the author revises the document to fix the defects. The main changes from the original Fagan’s inspectionhave been a shift of primary goals for the preparation and meeting steps. The main goal for preparation has changed frompure understanding to defect detection, and so inspectors have to individually take notes of defects. Consequently, the main

2

goal of the inspection meeting has been reduced from defect discovery to defect collection, including the discussion ofdefects individually found during preparation.

Among the many sources of variations in software inspections, Porter et al. (1998) have shown that changes in theinspection process structure can cut inspection cost and shorten the inspection interval but do not improve the inspectioneffectiveness (basically measured as the number or density of defects found). Rather, reading techniques for analyzingdocuments are the key for improving inspection effectiveness.

1.1 Reading Techniques for Defect Detection

In a software inspection, document reading is first performed by inspectors working alone during preparation. Manyinspectors read documents for defect detection by using ad hoc or checklist techniques. Ad hoc reading for defect detection isa very nonsystematic technique, which leaves to inspectors the freedom to rely on their own review guidelines to find defects.Checklist reading for defect detection requires inspectors to read the document while answering a list of yes/no questions,based on past knowledge of typical defects. Checklist reading can also be considered a nonsystematic technique, althoughless than ad hoc reading, because it does not provide a guideline on how to answer the questions.

Scenario-based reading techniques have been proposed to support the inspectors throughout the reading process in theform of operational scenarios (Basili, 1997). A scenario consists of a set of activities aimed to build a model plus a set ofquestions tied to that model. While building the model and answering the questions, the reader should write down the defectshe finds within the document. Each reader in the inspection team gets a different and specific scenario in order to minimizethe overlapping of discovered defects among team members and then increasing the inspection effectiveness after defectcollection at the meeting. Two families of scenario-based reading techniques have been generated for defect detection insoftware requirements documents.

1.1.1 Defect-Based Reading

The first family of scenario-based reading techniques, Defect-Based Reading (DBR), was defined for detecting defects inrequirements documents written in a state machine notation for event driven-process control systems (Porter et al., 1995).Each DBR scenario is based on a different class of requirements defects and requires a different model to be built beforeanswering to specific questions. In order to empirically validate their proposal, controlled experiments were first run withgraduate students at University of Maryland (Porter et al., 1995) and then with professional software developers from LucentTechnology (Porter and Votta, 1998). Both the experiments showed that DBR was significantly more effective than ad hocand checklist reading. Replications were perfomed by other researchers (Fusaro et al., 1997; Miller et al., 1998; Sandahl etal., 1998) who reused the same experimental material and slightly changed the experimental design. External replications didnot measure any improvement in inspection effectiveness. Since all the external replications were conducted withundergraduate students, the main hypothesis for interpreting the difference in results is the lower familiarity with reviewactivities, requirements specification language, and software domains.

1.1.2 Perspective-Based Reading

Perspective-Based Reading (PBR) is another family of scenario-based reading techniques which have been proposed toimprove the inspection effectiveness for requirements documents expressed in natural language (Basili et al., 1996; Shull etal., 2000). The idea behind PBR is that various customers of a product (here requirements) should read the document from aparticular point of view. For PBR, the different roles are those within the software development process, e.g., analyst, tester,user. To support the inspector throughout the reading process, operational descriptions, i.e., scenarios, for each role areprovided. PBR has three basic properties:

• It is systematic, because it provides a procedure for how to read a document

• It is specific, because the reader is only responsible for his role and the defects he can finds from his particular point ofview

• It is distinct, because readers in a team have different roles and there is the assumption that the overlapping of defectsfound between readers with different roles is kept to minimum.

PBR has been first empirically validated with software developers of NASA/Goddard Space Flight Center (Basili et al.,

3

1996). The results showed that nominal1 teams using PBR achieved better coverage of documents and were more effective atdetecting defects in requirements documents than teams which did not use PBR. These results were confirmed in a replicatedexperiment conducted by other researchers with undergraduate students and real team meetings (Ciolkowski et al., 1997).PBR has been also tailored and applied for reviewing source code documents in an industrial setting (Laitenberger andDeBaud, 1997). Results from an empirical comparison with checklist-based inspection shows that PBR is more effective andless costly than using checklists (Laitenberger et al., 1999).

From the empirical studies of PBR, an unexpected effect was that individuals applying PBR, rather than just teams, weremore effective for defect detection than individuals applying less systematic reading techniques. This effect seems to provideevidence that the first property of PBR, being systematic, might be a sufficient cause for improved inspection effectiveness.

A more procedural-oriented version of PBR was applied in a related empirical investigation2 (Lanubile et al., 1998). Toincrease the specificity of the PBR techniques, more detailed guidelines for each perspective were provided to inspectors,with specific questions distributed at key points in the guidelines.

1.2 Research Questions and Hypotheses

We were interested to further assess the effects of systematic reading techniques on defect detection. Our main researchquestion is the following:

• Are there differences in defect detection effectiveness between reviewing requirements documents using systematicreading techniques and reviewing requirements documents using nonsystematic reading techniques?

Focusing on the enhanced procedural version of PBR as the systematic reading techniques, and based on findings fromprevious studies, our hypotheses are the following:

1. Inspection teams applying PBR find a higher percentage of defects than inspection teams applying nonsystematicreading techniques, such as ad hoc reading and checklists.

2. Individual reviewers applying PBR find a higher percentage of defects than individual reviewers applying nonsystematicreading techniques, such as ad hoc reading and checklists.

As a by-product of the main research question we are able to compare the performance of the different PBR scenarios. Oursecondary research question is the following:

• Are there differences in defect detection effectiveness between reviewing requirements documents using different PBRscenarios?

If the scenarios have been fairly developed, the hypothesis is the following:

3. Individual reviewers applying PBR find the same percentage of defects with any of the assigned scenario.

However, we are also interested to assess the effects of having distinct roles when composing inspection teams. Our researchquestion is the following:

• Are there differences in defect detection effectiveness between reviewing requirements documents having unique roles inan inspection team and reviewing requirements documents having identical roles in an inspection team?

Based on the theory of scenario-based reading, which states that the coordination of distinct and specific scenariosachieves a higher coverage of documents, our hypothesis is the following:

4. Inspection teams composed of unique PBR perspectives are more effective at detecting defects than inspection teamscomposed of identical PBR perspectives.

We have investigated these research questions and tested the research hypotheses by means of a controlled experiment in aclassroom environment with more than one hundred undergraduate students.

1 There were no real team meetings in the experiments. Meetings were simulated by pooling defects found during individualpreparation into nominal team defect logs.2 The goal of the experiment was to understand the effect of abstracting errors from faults in requirements documents ratherthan to compare PBR with other reading techniques. Because of the experimental design, differences between readingtechniques are confounded with other factors.

1.3 Paper Outline

The remainder of this paper is organized as follows. Section 2 describes the experiment, including the variables, design,threats to validity, instrumentation, and execution. Section 3 presents the results from data analysis. The final sectionsummarizes and discusses our findings.

2 The Experiment

4

The experiment was conducted as part of a two-semester software-engineering course at the University of Bari.

Since software requirements specification and software inspections were included in the first half of the course syllabus,the experiment was run as a midterm exam and then reviewers’ performance were subject to grading. Subjects were third-year undergraduate students of the computer science department. Because midterm exams are optional in Italian academiccourses, participation was on a volunteer basis. However, the “premium” grades made most of the students to participateseriously in the experiment.

The experiment simulated in a classroom environment the preparation and meeting steps of an inspection process. Weconducted two runs of the experiment. All subjects, with few exceptions, participated in both runs of the experiment, eachcorresponding to a different software inspection. Some differences between runs were planned in advance and will bedescribed later in the experimental design section, while some changes were introduced after the first run was over, based onsubjects’ feedback, and will be described in the execution section.

2.1 Variables

The independent variables are the variables whose values (or levels) determine the different experimental conditions towhich a subject may be assigned. We manipulated the following independent variables:

• The reading technique. Subjects and then teams can apply a systematic reading technique (PBR) or a nonsystematicreading technique (Ad Hoc or Checklist).

• The team coordination: PBR teams are further decomposed between teams made up of different perspectives (multi-perspective PBR) and teams made up of identical perspectives (mono-perspective PBR). Only multi-perspective PBRteams are consistent with the concept of coordinated teams with focused and distinct responsibilities, while mono-perspective PBR teams have focused but identical responsibilities. Teams applying an Ad Hoc or Checklist readingtechnique have unfocused and identical responsibilities.

• The perspective. Within PBR, a subject may use only one scenario based on one of the assumed perspectives. For thisexperiment we used the following three perspectives:

- Use Case Analyst (UCA)

- Structured Analyst (SA)

- Object-Oriented Analyst (OOA)

In the former PBR experiments, only the first two perspectives had been used and the third one was the perspective of atester. However, having to run the experiment as a midterm exam we could not have taught and exercised students totesting. Because the first part of the course focuses on requirements specification and analysis methods, all theperspectives should center around system analysis activities.

The subjects who do not follow a scenario (i.e., with Ad Hoc and Checklist reading techniques) do not have anyperspective associated.

We measured the following dependent variables:

• The individual defect detection rate: the number of true defects reported by individual inspectors divided by the totalnumber of known defects in the document

• The team defect detection rate: the number of true defects reported at inspection meeting divided by the total number ofknown defects in the document

• Preparation time: time spent, in minutes, by an inspector for the individual preparation

• Meeting time: time spent, in minutes, by a team for the inspection meeting

5

We also collected additional data from two debriefing questionnaires, one for individual preparations and another forinspection meetings. Most of the questions were in a closed form and contained questions useful for measuring conformanceto the assigned guidelines, self-confidence in the inspection results, understanding of the detection techniques and satisfactionwith the inspection process.

2.2 Design

The main goals of this experiment are to compare systematic reading techniques (PBR) vs. nonsystematic readingtechniques (Ad Hoc and Checklist), and to compare distinct vs. overlapping roles with scenario-based reading techniques.With respect to the former PBR experiment, we had introduced a new defect detection technique, checklist, and a neworganization of inspection teams, composed of identical PBR perspectives (mono-PBR). Hence, we could not reuse theexperiment plan from the previous experiments and we decided an entirely new design.

We first decomposed the experiment in two runs, the second run a week after the first run. Each run required subjects toinspect a requirement document, starting with an individual preparation and finishing with a team meeting steps. The runshad the following differences:

- The document to be inspected: ATM in the first run and PG in the second run (see Instrumentation section for a briefdescription of these documents)

- The nonsystematic reading technique used: Ad hoc in the first run and Checklist in the second run.

- The inspection goal: “find as many as defect you can with the help of a defect detection technique” in the first run, “findas many as defect you can while following a defect detection technique”, in the second run (see Execution section for therationale behind the goal shift).

The same subjects participated in both runs using the same technique. However, reviewers assigned to a nonsystematicreading technique applied ad hoc reading in the first run and checklist reading in the second run of the experiment.

In the individual preparation, the experimental plan consists of one independent variable (the reading technique) with twomain levels: the systematic reading technique (PBR) and the nonsystematic reading technique (Ad Hoc or Checklist). Nestedin the PBR level, there are three perspectives (UCA, SA, and OOA). The reading technique and the perspective variablesvary between subjects because no subjects are exposed to more than one experimental condition. Subjects were randomlyassigned to the experimental conditions.

In the team meeting, the experimental plan consists of one independent variable (the reading technique) with two mainlevels: the systematic reading technique (PBR) and the nonsystematic reading technique (Ad Hoc or Checklist). Nested in thePBR level, there are two levels of the team coordination variable: the multi-perspective PBR and the mono-perspective PBR.The reading technique and the team coordination variables vary between teams, which are the units of analysis. Subjects wererandomly assigned to the inspection teams and teams were randomly assigned to the experimental conditions. Teams had tobe composed of three persons but in some cases we had to create four-people teams because of spare people to accommodatein a team.

Table 1 shows the experimental design for each of the two experimental rounds. Differences in the number of subjects andteams between the two runs are due to subject withdrawals (see Execution section for major details).

2.3 Threats to Validity

This section discusses the threats to validity that are relevant for our experiment. To rule out the threats we could notovercome or mitigate, other experiments may use different experimental settings, with other threats to validity of their own.Basili et al. (1999) discuss how processes, products, and context models have an impact on experimental designs in thesoftware engineering domain.

2.3.1 Threats to Internal Validity

Threats to internal validity are rival explanations of the experimental findings that make the cause-effect relationshipbetween independent and dependent variables more difficult to believe. We identified the following threats to internalvalidity:

6

First Run - ATM docIndividual Preparation - week 1 - day 1

Reading Technique Perspective #subjects

Nonsystematic (Ad Hoc) no perspective (NONE) 37

use case analyst (UCA) 25

Systematic (PBR) structured analyst (SA) 26

object-oriented analyst (OOA) 26

Inspection Meeting - week 1 - day 2

Reading Technique Team Coordination #teams

Nonsystematic (Ad Hoc) unspecific / identical responsibilities(Ad Hoc)

12

specific / distinct responsibilities(multi-perspective PBR)

14

Systematic (PBR) UCA only 4

SA only 4

specific / identical

responsibilities

(mono-perspective PBR) OOA only 4

Second Run - PG docIndividual Preparation - week 2 - day 1

Reading Technique Perspective #subjects

Nonsystematic (Checklist) no perspective (NONE) 34

use case analyst (UCA) 24

Systematic (PBR) structured analyst (SA) 26

object-oriented analyst (OOA) 25

Inspection Meeting - week 2 - day 2

Reading Technique Team Coordination #teams

Nonsystematic (Checklist) unspecific / identical responsibilities(Checklist)

11

specific / distinct responsibilities(multi-perspective PBR)

12

Systematic (PBR) UCA only 4

SA only 4

specific / identical

responsibilities

(mono-perspective PBR) OOA only 4

Table 1. Experimental Plan. The experiment consists of two experimental runs. In the first runsubjects reviewed the ATM requirements document and in the second run the PG requirementsdocument. In each run, on the first day subjects performed an individual preparation and on thesecond day an inspection meeting. The same subjects participate in both runs using the sametechnique.

7

History. The history threat refers to specific events that might occur between successive measurements and influence thedependent variables in addition to the experimental variable. In our experiment there were four different points ofmeasurement. Because the experimental tasks were part of a midterm exam, the highest risk event is plagiarism, with subjectsexchanging information in the intervals between tasks. This might be the case for the two one-day intervals betweenindividual preparations and team meetings. To reduce this risk, we told students that only individual tasks were subject tograding. Furthermore, the individual defect lists were collected after individual preparation and returned to subjects justbefore the team meeting. Plagiarism could not occur between the two experimental runs because the requirements documentswere different.

Maturation. The maturation threat refers to changes within the subjects that occur over time and influence the dependentvariables in addition to the experimental variable. Possible changes might occur due to boredom, tiredness, or learning.Boredom might have affected the second run of the experiment, because subjects had to perform a second completeinspection using the same technique. However, because the inspections were assessed as midterm exams, we believe that theconcern for grading is stronger than some initial boredom. Tiredness occurs because of too much effort required by subjects.In our experiment, four hours were allocated for each experimental task, each inspection activity was performed in distinctday and the two complete inspections were conducted in different weeks. While boredom and tiredness tend to degrade theperformance, learning tends to amplify performance. Although we minimized the learning effect by teaching requirementsanalysis and review and having a training session before the experiment itself, we cannot exclude that learning was still inprogress during the experiment. However, the learning effect should be symmetric between the values of the independentvariables, because all the subjects were novices with respect to any defect detection technique.

Instrumentation. The instrumentation threat refers to changes in the measuring instrument or changes in the observers orscores used. In our experiment, two different requirements documents were used, one for each inspection run. Although thespecifications have approximately the same structure, size and number of defects we cannot exclude that the difference in theproblem domain might have an effect on inspection effectiveness. However, requirements to be reviewed changesymmetrically with respect to the independent variables and in the former PBR studies no interaction effects were observedbetween documents and reading techniques.

Selection. The selection threat refers to natural differences in human performance. In our experiment, we reduced selectioneffect by randomly assigning subjects to defect detection techniques and we choose individuals at random to form inspectionteams. We had a large enough number of subjects for being confident that few talented people could not mask differences inthe reading technique performance.

Mortality. The mortality threat refers to differences in loss of subjects from the comparison groups. Because our subjectswere highly motivated by grading, we did not expect many cases of subject drop-outs. We decided that we would have usedonly data points of subjects and teams who completed an entire inspection.

Process conformance. The process conformance threat refers to changes that the subjects autonomously apply to theprocess they are supposed follow. In our experiment, we had discovered from the answers to the questionnaires of the firstrun, that many subjects were not following the systematic technique assigned, thus reverting to a nonsystematic technique.Subjects were just concentrating on successfully accomplishing the inspection goal: find as many defects as you can.Although we knew that we could not strictly enforce the application of the technique assigned, we wanted to have subjectsreally trying it. Then, before the second inspection, we told subjects that the reading techniques should be really followed tobe positively graded and not just be considered an option. The subjects took the announcement seriously because they had toreturn the analysis models and had to cross-reference the defects with the questions in their procedure. However, the effect ofthis change is confounded with maturation and instrumentation changes and then we cannot assess it separately. We finallydecided to perform distinct analyses of the two experiment runs, and then draw conclusions from each run.

2.3.2 Threats to External Validity

Threats to external validity are factors that limit the generalization of the experimental results to the context of interest,here the industrial practice of software inspections. For our experiment, we can identify the following threats to externalvalidity:

Representative subjects. Our students may not be representative of the population of software professionals. However, aformer PBR experiment with NASA developers (Basili et al., 1996) failed to reveal significant relationship between PBRinspection effectiveness and reviewers’ experience. Probably, being a software professional does not imply that theexperience matches with the skills that are relevant to the object of study.

8

Representative artifacts. The requirements documents inspected in this experiment may not be representative of industrialrequirements documents. Our documents are smaller and simpler than industrial ones although in the industrial practice longand complex artifacts are inspected in separate pieces.

Representative processes. The inspection process in this experiment may not be representative of industrial practice.Although there are many variants of the inspection process in the literature and industry, we conducted inspections on thebasis of a widely spread inspection process (Gilb and Graham, 1993). However, our inspections differ from industrialpractice of inspections because individual preparations are not performed on subjects’ own desk with possible interruptions,and inspection meetings do not include the document’s author.

All these threats are inherent to running classroom experiments and can only be overcome by conducting replications withpeople, products, and processes from an industrial context.

2.4 Instrumentation

The experiment has reused most of the material from a previous PBR experiment (Lanubile et al., 1998). The material isavailable as a lab package on the web (Shull, 1998) but we had to translate everything from English to Italian otherwise manystudents would not be confident with reading and using it.

The material includes requirements documents, instructions, instructions and aids for each defect detection technique,defect report forms to be used both for the individual preparation and the team meeting, and debriefing questionnaires.

2.4.1 Requirements Documents

The software requirements specifications were written in natural language and adhered to the IEEE format for SRS (IEEE,1984). The requirements documents used for the experiment were:

• Automated Teller Machine (ATM), 17 pages long and containing 29 defects

• Parking Garage control system (PG), 16 pages long and containing 27 defects

2.4.2 Defect Detection Techniques

Defect detection was supported by means of instruction documents, which were specific for each reading technique: adhoc, checklist, and PBR. PBR instructions were composed of three distinct scenarios for each unique perspective.

Ad hoc reviewers received a defect taxonomy including definitions for the main classes of requirements faults: missinginformation, ambiguous information, inconsistent information, incorrect fact, and extraneous information.

Checklist reviewers received a single checklist derived from the defect taxonomy, with 17 questions covering all thedefect classes. The checklist was not present in the lab-package and then we created the questions by detailing the defectclass definitions.

PBR reviewers received one of three scenarios corresponding to UCA, SA, and OOA perspectives. Each scenario containsa stepwise guide for creating an abstract model from the requirements document and model-specific questions distributed atkey points in the guidelines. The models to be built, and then guidelines and questions, are different for each scenario:

• UCA: the scenario requires creating a use case diagram, including use case descriptions, and answering to 12 questions.

• SA: the scenario requires creating a hierarchy of data flow diagrams, and answering to 9 questions

• OOA: the scenario requires creating a class diagrams, including attributes and operations, and answering to 11 questions

All the scenarios were reused with modifications from the lab-package. However, the OOA scenario had never beenapplied in any former PBR experiment.

PBR reviewers also received a scenario-specific model skeleton to be used for model drawing. The diagrams had to bereturned together with the list of defects found in order to check whether the reviewers had really built the requiredabstraction while reviewing the document.

9

2.4.3 Defect Report Forms

Defect report forms had to be filled out by individuals after inspection preparation and by a team recorder after inspectionmeeting.

A defect report form contains a header and entries for each defect reported. The header includes various identifiers such asthe SRS name, reviewer name, team name, date, initial and finish time. A defect entry includes a defect progressive number,the defect location (requirement identifier and page number), a textual description, and the question in the reading techniquethat was helpful for defect discovery. This last field was not applicable by ad hoc reviewers and was optional for the others. Itwas included as a traceability mechanism to understand whether inspectors have actually tried to answer the questions, andthen check for process conformance.

2.5 Training

All subjects taking a course in software engineering for undergraduates were prepared with a set of lectures onrequirements specifications, software inspections, and analysis model building.

We gave a 2-hour lecture on the IEEE standard for SRS and taught the requirements defect taxonomy. A requirementsdocument for a course scheduler system was presented and an assignment was given for finding defects. The results werediscussed in class and a list of known defects was written out according to the schema of defect report forms.

Another 2-hour lecture was given on software inspections, explaining the goals and the specific process to be used in thisstudy. We then introduced a new requirements document for a video rental system, which was available in the experiment labpackage for training purposes. As a trial inspection, students were asked to individually read the document and record defectson the defect report forms to be used in this experiment. We then created teams, assigned roles inside the teams (moderator,reader, and recorder) and a trial inspection meeting was conducted. After the trial inspection we discussed with students thelist of known defects and what defects they had found.

Afterwards, we gave a set of lectures on requirements analysis where we taught use case analysis, structured analysis, andobject-oriented analysis. For each analysis method, we presented and discussed with students the analysis models for thecourse scheduler system. Students were given three assignments (use case model building, data flow model building, andclass model building) where we asked to build analysis models for the video rental system. The assignments started in classto allow students asking questions, and then completed at home. The results from each assignment were discussed in classand a solution was presented and commented together with students.

Finally, we spent one lecture to present the defect detection techniques and the experiment organization. We alsocommunicated the outcomes of randomly assigning subjects to the experimental conditions. Teams were let free to chooseteam roles as moderator, reader, and recorder.

2.6 Execution

The experiment was run as a midterm exam. Each experiment run, corresponding to a separate inspection (ATM documentfirst and then PG document), took two consecutive days, one for individual preparation and one for team meeting. Thesecond run was scheduled after one week from the first run.

Subjects always worked in two big rooms with enough space to avoid plagiarism and confusion. We were always presentto answer questions and preventing unwanted communication. Each experimental task was limited to four hours and beforeleaving subjects were asked to complete a debriefing questionnaire.

Before each individual preparation step, subjects were given a package containing the requirements document, specificinstructions for the assigned reading technique, and blank defect report forms. After each individual preparation step, wecollected all the material. This material was returned to subjects before the inspection meeting together with new blank defectreport forms. At the inspection meeting, the reader paraphrased each requirement and the team discussed defects foundduring preparation or any new defect. The moderator was responsible for managing discussions and recorder for filling outthe team’s defect report forms.

Immediately after the first inspection, a preliminary analysis of questionnaires was performed and the results were fedback to the students before the second inspection. From the questionnaire answers and discussion with students we realizedthat many subjects were concentrating on finding as many defects as they could without applying the technique assigned.There was an uncontrolled migration of subjects from the systematic reading technique, PBR, to the nonsystematic reading

technique, ad hoc reading. PBR reviewers were complaining about having to build a model and follow a procedure. On theother hand, ad hoc reviewers were complaining about not having a guide. We then made two major changes for the secondrun of the experiment. First, we told subjects that they would be graded also with respect to the ability to follow the assignedprocess. We might check for process conformance by assessing the model developed by following a reading scenario andcounting the percentage of defects which were cross-referenced with the questions in their reading technique. The secondchange was to replace ad hoc reading with checklist-based reading. With this change, also reviewers applying anonsystematic reading technique would have a guide, albeit not procedural. Five subjects, three from a same team, did notparticipate to the second inspection and we had to cancel two teams.

2.7 Data Collection

We collected data through individual defect report forms, team defect report forms, and questionnaires. We validated thereported defects by comparing location and description information with those in the master defect list from the first PBRexperiment. All the reported defects that could be matched to some known defect were original true defects. The otherreported defects could be classified as other true defects, duplicates, false positives, and don’t care. The number of knowndefects in the original PBR experiment was 29 for ATM and 27 for PG. After the first PBR study, other defects have beendiscovered, including some found by our reviewers. After adding these other true defects to the originals, the total number oftrue defects amount to 32 for ATM and 36 for PG. However, to make data analysis results directly comparable to the figuresin former PBR experiments, we will consider for our analysis only original true defects, which will be considered as abenchmark of seeded defects.

3 Results

10

Because of the major changes we made between the two runs of the experiment, we conduct separate analyses for eachrun. In the following, we first test the stated hypotheses by performing analysis of defect detection effectiveness at the teamand individual levels. Next, we analyze the individual performance with respect to process conformance. We then analyze thesubjective evaluation of the reading techniques looking at the answers in the debriefing questionnaires. Finally we analyzethe relationship between time and defect detection effectiveness.

3.1 Analysis of Team Performance

For the team analysis we have a between-groups nested design. The first factor is the reading technique (RTECH) with twolevels:

1. nonsystematic (ADHOC in the first run and CHKL in the second run)

2. systematic (PBR).

The second factor is the team coordination (TCOORDIN) with three levels:

1. unspecific/identical responsibilities (ADHOC in the first run and CHKL in the second run)

2. specific/identical responsibilities, i.e., mono-perspective PBR (MONOPBR).

3. specific/distinct responsibilities, i.e., multi-perspective PBR (MULTIPBR)

The first TCOORDIN level only appears within the first RTECH level (in fact they share the same name, ADHOC in thefirst run and CHKL in the second run), while the other two TCOORDIN levels occur within the PBR level of the RTECHfactor.

The dependent variable is the team defect detection rate (TDEFRATE) that is defined as the number of true defectsreported at inspection meeting divided by the total number of known defects in the document.

Figure 1 presents the distribution of the defect detection rate for both experimental runs using boxplots. Boxplots allowsone to visualize and quickly assess the strength of the relationship between the grouping and dependent variables. Boxplotsare also used to view those values which deviates from central tendencies for their respective groups. As can be seen, thereare two outliers and one extreme value for the MULTIPBR group in the first run and for the CHKL group in the second run.For each group, we applied the Shapiro-Wilks’ W test of normality and we verified the equality of variances assumption withthe Brown-Forsythe’s test. The statistics were not significant at the 0.05 level and then we could proceed to perform aparametric analysis of variance (ANOVA) to test for significant differences between means.

11

The two-way ANOVA for the hierarchically nested design allows us to test two null hypotheses concerning the effects ofthe two factors on the team defect detection rate. The null and alternative hypotheses for the first factor may be stated asfollows:

H10: there is no difference between teams using a nonsystematic reading technique and teams using a systematic readingtechnique group with respect to defect detection rate.

H1a: there is a difference between teams using a nonsystematic reading technique and teams using a systematic readingtechnique with respect to defect detection rate.

The null and alternative hypotheses for the second factor may be stated in a similar fashion:

H20: within the group of teams with specific responsibilities, there is no difference between teams with identicalresponsibilities and teams with distinct responsibilities with respect to defect detection rate.

H2a: within the group of teams with specific responsibilities, there is a difference between teams with identicalresponsibilities and teams with distinct responsibilities with respect to defect detection rate.

Because of hierarchical design, interactions of the nested factor (TCOORDIN) with the factor in which it is nested(RTECH) cannot be evaluated. Thus, in the present study we cannot test the hypothesis that the reading technique property ofbeing systematic and the particular team coordination interact in their effect on the defect detection rate.

The results, summarized in Table 2, revealed no significant effects for the type of reading technique and team coordinationwith respect to defect detection rate. Table 3 reports the mean scores of the defect detection rate for the groups that define theeffects in the analysis. As can be seen, all groups demonstrated similar scores on defect detection rate.

First Run (ATM doc)

Non-Outlier MaxNon-Outlier Min

75%25%

Median

RTECH

TD

EF

RA

TE

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

ADHOC PBR


75%25%

Median

Outliers

Extremes

TCOORDIN

TD

EF

RA

TE

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

ADHOC MULTIPBR MONOPBR

Second Run (PG doc)


75%25%

Median

Outliers

Extremes

RTECH

TD

EF

RA

TE

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CHKL PBR


75%25%

Median

Outliers

Extremes

TCOORDIN

TD

EF

RA

TE

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CHKL MULTIPBR MONOPBR

Figure 1. Boxplots of Team Defect Detection Rate on the two experimental runs. On the left side, values are plotted for groups ofreading techniques; on the right side values are plotted for groups of team coordination.

12

Comparing these scores with those achieved in experiments using the same documents we can note that:

• Inspection teams in our experiment found fewer defects than simulated teams in the Basili et al. (1996) experiment:- in the pilot study the average defect detection rate was 44.3 for the nonsystematic technique and 52.9 for the

systematic technique;- in the 1995 run the average defect detection rate was 48.2 for the nonsystematic technique and 79.3 for the

systematic technique.

• Inspection teams in our experiment found fewer defects than simulated teams in the Ciolkowski et al. (1997)experiment using a systematic technique:- in the 95/96 run the average defect detection rate was 52.0 with the ATM document and 53.5 with the PG

document;- in the 96/97 run the average defect detection rate was 46.5 with the ATM document and 48.9 with the PG

document.

• Inspection teams in our experiment found more defects than simulated teams in the Ciolkowski et al. (1997)experiment using a nonsystematic technique with the exception of ATM document in the 95/96 run:- in the 95/96 run the average defect detection rate was 47.7 with the ATM document and 38.1 with the PG

document;- in the 96/97 run the average defect detection rate was 30.8 with the ATM document and 33.5 with the PG

document.

First Run (ATM doc)

Source df

Effect

MS

Effect

df

Error

MS

Error F p

{1}RTECH 1 .000155 35 .008138 .019014 .891116

{2}TCOORDIN 1 .000826 35 .008138 .101560 .751859

Second Run (PG doc)

Source df

Effect

MS

Effect

df

Error

MS

Error F p

{1}RTECH 1 .000623 32 .010345 .060262 .807651

{2}TCOORDIN 1 .004267 32 .010345 .412457 .525300

Table 2. ANOVA Results from testing hypotheses concerning the effects of reading technique and teamcomposition on the team defect detection rate.

First Run (ATM doc) Second Run (PG doc)

Effect Level ofFactor

N MeanTDEFRATE

Level ofFactor

N MeanTDEFRATE

Total 38 .421842 35 .387143

{1}RTECH ADHOC 12 .419167 CHKL 11 .380909

{1}RTECH PBR 26 .423077 PBR 24 .390000

{2}TCOORDIN ADHOC 12 .419167 CHKL 11 .380909

{2}TCOORDIN MULTIPBR 14 .417857 MULTIPBR 12 .376667

{2}TCOORDIN MONOPBR 12 .429167 MONOPBR 12 .403333

Table 3. Mean scores of the team defect detection rate for the reading technique and team coordination groups.

13

However, these two previous experiments report scores computed from pooling defects logged in the individualpreparation and applying permutation tests on hypothetical teams, while in our experiment the average defect detection ratesare computed from defects that were actually logged during real team meetings.

3.2 Analysis of Individual Performance

Analogously to the team analysis, for the individual analysis we have a between-groups nested design. The first factor isthe reading technique (RTECH) with two levels:

1. nonsystematic (ADHOC in the first run and CHKL in the second run)

2. systematic (PBR).

The second factor is the perspective (PERSP) with four levels:

1. no perspective (ADHOC in the first run and CHKL in the second run)

2. use case analyst (UCA)

3. structured analyst (SA)

4. object-oriented analyst (OOA).

The first PERSP level only appears within the first RTECH level (in fact there is not any perspective with Ad Hoc orChecklist reading techniques), while the other three PERSP levels occur within the PBR level of the RTECH factor.

The dependent variable is the individual defect detection rate (IDEFRATE) that is defined as the number of true defectsreported by individual inspectors divided by the total number of known defects in the document.

Figure 2 presents the boxplots of the individual defect detection rate grouped according to the reading technique andperspective groups.

As can be seen, in the first run there is one outlier for the ADHOC group (or NONE group), and in the second run thereare two outliers for the PBR group and one outlier for the SA and OOA groups. For each group, we applied the Shapiro-Wilks’ W test of normality and we verified the equality of variances assumption with the Brown-Forsythe’s test.. Thestatistics were not significant at the 0.05 level except for the W statistic of the CHKL group (or NONE group) in the secondrun, and then the hypothesis that the distribution is normal should be rejected for this group. Nevertheless, we performed aparametric analysis of variance (ANOVA) to test for significant differences between means because the ANOVA test isrobust against moderate departures from normality when the group size is greater than thirty.

The two-way ANOVA for the hierarchically nested design allows us to test two null hypotheses concerning the effects ofthe two factors on the individual defect detection rate. The null and alternative hypotheses for the first factor may be stated asfollows:

H30: there is no difference between subjects using a nonsystematic reading technique and subjects using a systematicreading technique group with respect to defect detection rate.

H3a: there is a difference between subjects using a nonsystematic reading technique and subjects using a systematicreading technique with respect to defect detection rate.

The null and alternative hypotheses for the second factor may be stated in a similar way:

H40: within the group of subjects following a scenario, there is no difference between subjects in an UCA perspective,subjects in a SA perspective, and subjects in an OOA perspective with respect to defect detection rate.

H4a: within the group of subjects following a scenario, there is a difference between at least two of the following groupswith respect to defect detection rate: subjects in an UCA perspective, subjects in a SA perspective, and subjects in anOOA perspective.

As in the team analysis, there are no interaction effects to test because of the hierarchical design.

14

First Run (ATM doc)


75%25%

Median

Outliers

RTECH

IDE

FR

AT

E

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

ADHOC PBR


75%25%

Median

Outliers

PERSP

IDE

FR

AT

E

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

NONE UCA SA OOA

Second Run (PG doc)


75%25%

Median

Outliers

RTECH

IDE

FR

AT

E

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CHKL PBR


75%25%

Median

Outliers

PERSP

IDE

FR

AT

E

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

NONE UCA SA OOA

Figure 2. Boxplots of Individual Defect Detection Rate on the two experimental runs. On the left side, values are plotted forgroups of reading techniques; on the right side values are plotted for groups of team coordination.

The results, summarized in Table 4, revealed a significant effect, with respect to defect detection rate, only for the type ofreading technique in the first experimental run. Table 5 reports the mean scores of the defect detection rate for the groups thatdefine the effects in the analysis. As can be seen, while in the second run all groups demonstrated similar scores on defectdetection rate, in the first run subjects using an Ad Hoc reading technique detected more defects than subjects using PBR.

We then tested for differences between the absence of perspective and each of the perspectives using the Spjotvoll &Stoline test, which is a generalization of the Tukey HSD test to the case of unequal sample sizes (Winer et al., 1991). Thepost-hoc comparison of means failed to reveal significant differences although the comparison of means between subjectsusing no perspective and subjects in the OOA perspective is slightly higher than the 0.05 p-level (p = 0.052186).

Comparing the mean scores of the individual defect detection rate with those achieved in experiments using samedocuments we can note that:

• In the nonsystematic reading technique group, subjects in our experiment found more defects than subjects in theBasili et al. (1996) experiment: the average defect detection rate was 20.58 in the pilot study and 24.64 in the1995 run

• In the systematic reading technique group, subjects in our experiment found less defects than subjects in the Basili etal. (1996) experiment: the average defect detection rate was 24.92 in the pilot study and 32.14 in the 1995 run

• In the nonsystematic reading technique group, subjects in our experiment found more defects than subjects in theCiolkowski et al. (1997): the average defect detection rate was 23.08 with the ATM document and 19.58 with thePG document

• In the systematic reading technique group, subjects in our experiment found less defects in the ATM document thansubjects in the Ciolkowski et al. (1997), the average defect detection rate was 25.93, and more defects in the PGdocument than subjects in the Ciolkowski et al. (1997), the average defect detection rate was 25.93.

15

First Run (ATM doc)

Source df

Effect

MS

Effect

df

Error

MS

Error F p

{1}RTECH 1 .104446 110 .012019 8.689919 .003908

{2}PERSP 2 .005703 110 .012019 .474481 .623475

Second Run (PG doc)

Source df

Effect

MS

Effect

df

Error

MS

Error F p

{1}RTECH 1 .001860 105 .009523 .195322 .659432

{2}PERSP 2 .003082 105 .009523 .323596 .724261

Table 4. ANOVA Results from testing hypotheses concerning the effects of reading technique and perspective onthe individual defect detection rate.

First Run (ATM doc) Second Run (PG doc)


N MeanIDEFRATE

Level ofFactor

N MeanIDEFRATE

Total 114 .240789 109 .282202

{1}RTECH ADHOC 37 .284595 CHKL 34 .288235

{1}RTECH PBR 77 .219740 PBR 75 .279467

{2}PERSP NONE 37 .284595 NONE 34 .288235

{2}PERSP UCA 25 .235600 UCA 24 .279167

{2}PERSP SA 26 .218462 SA 26 .290385

{2}PERSP OOA 26 .205769 OOA 25 .268400

Table 5. Mean scores of the individual defect detection rate for the reading technique and perspective groups.

3.3 Analysis of Individual Performance based on Process Conformance

So far, we have considered the inspection performance of all the individuals based on the assumption that they wereactually following the assigned reading technique. While this assumption is certainly true for the Ad Hoc group in the firstexperimental run, because it implies the absence of any given reading technique, we cannot not know a priori whether thePBR group of reviewers whether actually follows the prescribed procedure.

In order to check a posteriori for process conformance, the first experimental run had the debriefing questionnaire as asource of data analysis. Each PBR reviewer was asked two closed-ended questions related to the topic of processconformance, one about to what extent the proposed technique was followed and the other about to what extent the reviewerhad focused on the questions for defect detection. Figure 3 details these questions. The first category response (“not at all”for question Q1A_4 and “not careful” for question Q1A_5) is equivalent to a rejection of the assigned reading techniquewhile the second category response (“partially” for question Q1A_4 and “little careful” for question Q1A_5) means thatreviewers were more likely to fall back to an ad hoc reading style. Only the third response category (“fully” for questionQ1A_4 and “very careful” for question Q1A_5) implies that a reviewer has closely followed his assigned PBR scenario.

16

Q1A_4. Did you follow the assigned reading technique completely?(0) not at all (I have completely ignored it)(1) partially (I tried but I have not followed it all the times)(2) fully (very carefully, step by step)

Q1A_5. How carefully did you focus on the questions as an help for defectdetection?(0) not careful (I have completely ignored the questions)(1) little careful (I have read the questions more times and taken them into

account during reading)(2) very careful (I tried to answer questions when I encountered them in the

procedure)

Figure 3. Questions regarding process conformance in the first experimental run.

We analyzed the answers between the two experimental runs. Table 6 and Table 7 summarize the results of the answers,respectively to question Q1A_4 and question Q1A_5. Results show that only 5 PBR reviewers (two with an UCAperspective, two with a SA perspective and none with an OOA perspective) followed the assigned reading techniquecompletely, and just 6 PBR reviewers focused carefully on the questions as an help for defect detection. Furthermore, there isonly one PBR reviewer who answered positively to both questions.

Did you follow the assigned reading techniquecompletely?

First Run (ATM doc)

not at all partially fully Missing


N Count Pct Count Pct Count Pct Count Pct

Total 77 6 7.8% 63 81.8% 5 6.5% 3 3.9%

{2}PERSP UCA 25 2 8.0% 21 84.0% 2 8.0% 0 0.0%

{2}PERSP SA 26 1 3.9% 19 73.1% 3 11.5% 3 11.5%

{2}PERSP OOA 26 3 11.5% 23 88.5% 0 0.0% 0 0.0%

Table 6. Summary of answers to question Q1A_4 after the first experimental run.

How carefully did you focus on the questions as an helpfor defect detection?

First Run (ATM doc)

not careful little careful very careful Missing



Total 77 14 18.2% 51 66.2% 6 7.8% 6 7.8%

{2}PERSP UCA 25 6 24.0% 15 60.0% 3 12.0% 1 4.0%

{2}PERSP SA 26 3 11.5% 17 65.4% 1 3.9% 5 19.2%

{2}PERSP OOA 26 5 19.2% 19 73.1% 2 7.7% 0 0.0%

Table 7. Summary of answers to question Q1A_5 after the first experimental run.

17

From the analysis of debriefing questionnaires, we understood that in the first experimental run PBR reviewers had notappreciated the opportunity to have a procedure as a tool for defect detection. Thus, before the second experimental run, wetold them that their goal was going to change from “find as many defects as possible with the help of the assigned readingtechnique” to “follow the assigned reading technique and find as many defects as possible”. To be fair, we replaced the AdHoc reading technique with the Checklist reading technique that is also not systematic but uses the checklist as a tool fordefect detection like PBR uses the perspective-based scenario.

Having made mandatory the reading technique in the second experimental run, we had to look for some proof of processconformance. For this purpose we verified the analysis models built during PBR scenarios and checked to what extent thedefect entries in the preparation logs were explicitly and reasonably mapped to the questions of the assigned readingtechnique (Checklist or PBR).

We measured the results of this post-hoc verification activity with the variable EVIDCONF, whose categories are orderedin terms of evidence for process conformance:

0 = no evidence (the analysis models built by PBR reviewers are just sketched and few defect entries are linked toquestions)

1 = weak evidence (the analysis models built by PBR reviewers are just sketched or few defect entries are linked toquestions)

2 = strong evidence (the analysis models built by PBR reviewers are sufficiently developed and most defect entriesare linked to questions)

Based on this scale, only the third category (“2 = strong evidence”) gives us enough confidence that a reviewer has closelyfollowed his assigned reading technique.

Table 8 shows the frequencies of the EVIDCONF variable grouped according to the two independent variables,respectively the reading technique (RTECH) and the perspective (PERSP). Results show that for 32 reviewers using checklistand 37 PBR reviewers (12 with an UCA perspective, 16 with a SA perspective and 9 with an OOA perspective) there was astrong evidence that they had followed the assigned reading technique completely.

However, since reviewers were warned in advance that the verification was part of the academic evaluation, we thoughtthat they could first detect defects on an informal basis and then work on the deliverables that were expected as the output ofthe reading technique. Thus, in the debriefing questionnaire at the end of the second experimental run, we asked reviewersone more closed-ended question, related to the topic of process conformance, about to what extent the reviewer had focusedon the questions for defect detection. Figure 4 details this question. The first two category responses (“not careful” and “littlecareful”) mean that reviewers were not actually following any reading technique because the both checklist and PBRscenarios require that questions are used to detect defects and not the opposite. Only the third response category (“verycareful”) implies that a reviewer has closely followed his assigned reading technique.

Evidence for process conformanceSecond Run (PG doc)

no evidence weak evidence strong evid. Missing



Total 109 15 13.8% 25 22.9% 69 63.3% 0 0.0%

{1}RTECH CHKL 34 0 0.0% 2 5.9% 32 94.1% 0 0.0%

{1}RTECH PBR 75 15 20.0% 23 30.7% 37 49.3% 0 0.0%

{2}PERSP NONE 34 0 0.0% 2 5.9% 32 94.1% 0 0.0%

{2}PERSP UCA 24 3 12.5% 9 37.5% 12 50.0% 0 0.0%

{2}PERSP SA 26 5 19.2% 5 19.2% 16 61.6% 0 0.0%

{2}PERSP OOA 25 7 28.0% 9 36.0% 9 36.0% 0 0.0%

Table 8. Summary of scores of the evidence for process conformance in the second experimental run.

18

Q1B_3. How carefully did you focus on the questions as an help for defectdetection?(0) not careful (I first discovered defects and then I looked for questions

to match)(1) little careful (when I found a defect I looked for a question to match)(2) very careful (I tried to answer questions when I encountered them in the

procedure)

Figure 4. Question regarding process conformance in the second experimental run.

After the second experimental run we analyzed the answers to question Q1B_3 (the results are summarized in Table 9.Results show that 11 reviewers using the checklist and 25 PBR reviewers (9 with an UCA perspective, 9 with a SAperspective and 7 with an OOA perspective) focused carefully on the questions for defect detection.

With respect to process conformance, we can only trust in those reviewers who provided the highest evidence for processconformance (“strong evidence”) and answered positively (“very careful”) to the question related to process conformance inthe debriefing questionnaire after the second experimental run. Table 10 shows how the “trustable” reviewers are distributedaccording to the grouping variables reading technique (RTECH) and perspective (PERSP). Less than one third of checklistreviewers can be trusted to have used checklist and one fifth of the PBR reviewers can be trusted to have followed theassigned scenario. The OOA scenario was the least followed PBR scenario (only 3 reviewers over 25). The OOA scenariowas introduced by us for this experiment and then it was never tested in former experiments.

If we consider the restricted dataset made up of only the defect logs found by trustable reviewers, we might test the effectsof the two factors on the individual defect detection rate by repeating the two-way ANOVA for the hierarchically nesteddesign as in the previous section. However, the cell size of the PERSP factor is too small, especially for the OOA level (onlythree observations). Thus, we can only evaluate the differences in means between the two groups (CHKL and PBR) of theRTECH variable. Figure 5 presents the boxplots of the individual defect detection rate (IDEFRATE) grouped according tothe reading technique groups.

How carefully did you focus on the questions as an helpfor defect detection?

Second Run (PG doc)

not careful little careful very careful Missing



Total 109 17 15.6% 53 48.6% 36 33.0% 3 2.8%

{1}RTECH CHKL 34 2 5.9% 21 61.8% 11 32.3% 0 0.0%

{1}RTECH PBR 75 15 20.0% 32 42.7% 25 33.3% 3 4.0%

{2}PERSP NONE 34 2 5.9% 21 61.8% 11 32.3% 0 0.0%

{2}PERSP UCA 24 6 25.0% 9 37.5% 9 37.5% 0 0.0%

{2}PERSP SA 26 3 11.5% 14 53.9% 9 34.6% 0 0.0%

{2}PERSP OOA 25 6 24.0% 9 36.0% 7 28.0% 3 12.0%

Table 9. Summary of answers to question Q1B_3 after the second experimental run.

19

Second Run (PG doc) Reviewers trustablewith respect toprocess conformance


N Count Pct

Total 109 25 22.9%

{1}RTECH CHKL 34 10 29.4%

{1}RTECH PBR 75 15 20.0%

{2}PERSP NONE 34 10 29.4%

{2}PERSP UCA 24 6 25.0%

{2}PERSP SA 26 6 23.1%

{2}PERSP OOA 25 3 12.0%

Table 10. Summary of "trustable" reviewers with respect to process conformance.


75%25%

Median

RTECH

IDE

FR

AT

E

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CHKL PBR

Figure 5. Boxplots of Individual Defect Detection Rate for “trustable” reviewers with respect to process conformance on thesecond experimental run.

20

Since the sample sizes are small (10 observations in the CHKL group and 15 in the PBR group) and the IDEFRATEvariable is not normally distributed within the CHKL group (the p-value of the Shapiro-Wilks’ W statistic is 0.0566 and theLilliefors probability is less than 0.05), we applied the nonparametric Mann-Whitney U test. This test assumes that thedependent variable (IDEFRATE) was measured on at least an ordinal scale. The interpretation of the test is analogous to theinterpretation of the t-test for independent samples, except that the U test is computed based on rank sums rather than means.For small to moderate sized samples, the Mann-Whitney U test may offer even greater power to reject the null hypothesisthan the t-test.

The null and alternative hypotheses for the restricted dataset of trustworthy reviewers may be stated as follows:

H50: there is no difference between trusted subjects using a nonsystematic reading technique and trusted subjects using asystematic reading technique group with respect to defect detection rate.

H5a: there is a difference between trusted subjects using a nonsystematic reading technique and trusted subjects using asystematic reading technique with respect to defect detection rate.

Although the mean score of the defect detection rate for the PBR group (0.286) is higher than the means score for theCHKL group (0.277), the analysis failed to reveal a significant difference between the two groups (U = 66.5, and p =0.637).

3.4 Analysis of the Subjective Evaluation of the Reading Technique

In the debriefing questionnaire at the end of the second experimental run, we asked reviewers to give a self-evaluation ofthe reading technique they have just finished to apply for finding defects. Figure 6 details this closed-ended question. Thefirst category response (“harmful”) means that the reviewer judged negatively the reading technique because it wasconsidered an obstacle for the task of defect detection. The second category response (“no help’) means that the reviewer wasneutral with respect to the technique applied because it was considered just a waste of time. Only the third response category(“helpful”) implies a positive judgment with respect to the assigned reading technique.

Table 11 shows how the percentages of answers related to the subjective evaluation of the reading techniques aredistributed according to the grouping variables reading technique (RTECH) and perspective (PERSP). The percentages arepresented separately with respect to the criterion of process conformance. Then, “trusted” reviewers are kept apart from“untrusted’ reviewers. We remind that the former group is much smaller than the latter group (as already shown in Table 10).

As can be seen in Table 11, the answers are similarly distributed with respect to the reading technique and perspectivevariables. On the other hand, it seems that “trusted” reviewers are more positive in their evaluation than “untrusted”reviewers. In order to test whether the two groups significantly differ with respect to subjective evaluation, we applied theChi-square test for the resulting 2 x 3 contingency table. This is the most common test for significance of the relationshipbetween categorical variables. This measure is based on the fact that we can compute the expected frequencies in acontingency table (i.e., frequencies that we would expect if there was no relationship between the variables).

The null and alternative hypotheses may be stated as follows:

H60: there is no relationship between trust in process conformance and subjective evaluation of the reading technique.

H6a: there is a relationship between trust in process conformance and subjective evaluation of the reading technique.

The analysis obtained a Chi-Square of 4.911523 with a p value of 0.0267. We may therefore reject the null hypothesis,and tentatively conclude that trust in process conformance is related to subjective evaluation of the reading technique.

1B-5. Did the reading technique help you identify defects in the requirements? (0) harmful (it was an obstacle; I would do better without it) (1) no help (I think I would found the same defects even without it) (2) helpful (I have discovered defects that otherwise I could not find)

Figure 6. Question regarding self-evaluation of the reading technique in the second experimental run

21

Did the reading technique help you identify defects in the requirements?Second Run (PG doc)

“Untrusted” reviewers “Trusted” reviewers

Effect LevelofFactor

harmful no help helpful Missing harmful no help helpful Missing

Total 8.3% 50.0% 38.1% 3.6% 8.0% 28.0% 64.0% 0.0%

{1}RTECH CHKL 4.2% 45.8% 45.8% 4.2% 0.0% 30.0% 70.0% 0.0%

{1}RTECH PBR 10.0% 51.7% 35.0% 3.3% 13.3% 26.7% 70.0% 0.0%

{2}PERSP NONE 4.2% 45.8% 45.8% 4.2% 0.0% 30.0% 70.0% 0.0%

{2}PERSP UCA 11.1% 50.0% 38.9% 0.0% 0.0% 33.3% 66.7% 0.0%

{2}PERSP SA 10.0% 45.0% 45.0% 0.0% 16.7% 33.3% 70.0% 0.0%

{2}PERSP OOA 9.1% 59.1% 22.7% 9.1% 33.3% 0.0% 66.7% 0.0%

Table 11. Summary of answers to question Q1B_5 after the second experimental run.

3.5 Analysis of Relationship between Time and Detection Effectiveness

We wanted to verify whether the amount of time available for preparation and meeting might have influenced theinspection performance. Table 12 shows for both experimental runs the correlation coefficients between time (MTNGTIMEfor the inspection meeting and PREPTIME for the individual preparation) and the dependent variables related to detectioneffectiveness (TDEFRATE at the team level and IDEFRATE at the individual level). The last row includes only thoseindividual who can be trusted with respect to process conformance (using the same filtering procedure applied in Section3.3).

Since the time variables are not normally distributed, we use a nonparametric correlation coefficient, Spearman R, whichonly assumes that the variables under consideration were measured on at least an ordinal scale. As can be seen, there is nocorrelation between time and team performance, and thus we can exclude that the time spent to perform the review task hasan influence over the result of the inspection.

Dataset Pair of Variables Valid N Spearman R p-level

First Run (teams) TDEFRATE & MTNGTIME 38 -.298876 .068339

Second Run (teams) TDEFRATE & MTNGTIME 35 .098098 .575047

First Run (individuals) IDEFRATE & PREPTIME 114 .125819 .182242

Second Run (individuals) IDEFRATE & PREPTIME 109 -.020920 .829053

Second Run(“trusted” individuals)

IDEFRATE & PREPTIME 25 -.174370 .404490

Table 12. Correlation between time and detection effectiveness variables

22

4 Summary and Conclusions

Reading is considered a key technical activity to analyze a document and then to improve its quality when applied fordefect detection during software inspections. Past studies, such as (Basili et al., 1996) and (Ciolkowski et al., 1997), haveshown that Perspective-Based Reading (PBR) improve the inspection effectiveness on requirements documents with respectto nonsystematic reading techniques, such Ad Hoc or Checklist.

We tested the effectiveness of PBR in two runs of a controlled experiment with more than one hundred undergraduatestudents. The subjects performed both the preparation and inspection meeting phases on the same requirements documentsthat had been used in the previous studies. The subjects reviewed the documents either applying a nonsystematic readingtechnique (Ad Hoc in the first run and Checklist in the second run) or a systematic reading technique (PBR). Each PBRreviewer was assigned one of three scenarios based on different perspectives: use case analyst, structured analyst, and object-oriented analyst. The two experimental runs used distinct requirements documents: ATM in the first run and Parking Garagein the second run. While in the first run the reviewers were invited to use the assigned reading technique as an help forfinding defects, in the second run they were required to strictly use their reading technique for defect detection.

The main research question was: “Are there differences in defect detection effectiveness between reviewing requirementsdocuments using systematic reading techniques and reviewing requirements documents using nonsystematic readingtechniques?” The findings from our experiment are the following:

• No difference was found between inspection teams applying PBR and inspection teams applying Ad Hoc or Checklistreading with respect to the percentage of discovered defects (H10).

This finding does not support the expected hypothesis, based on past studies, that inspection teams applying PBR find ahigher percentage of defects than inspection teams applying nonsystematic reading techniques, such as ad hoc reading andchecklists. However, the analysis of past studies was performed on simulated inspection teams rather than real teammeetings, as in our case.

• Individual reviewers applying PBR found a smaller percentage of defects than Ad Hoc reviewers (H3a for the first run),but no difference was found between reviewers applying PBR and Checklist reviewers (H30 for the second run).

This finding does not support the expected hypothesis, based on past studies, that individual reviewers applying PBR finda higher percentage of defects than individual reviewers applying nonsystematic reading techniques, such as ad hoc readingand checklists.

A by-product of the main research question was: “Are there differences in defect detection effectiveness betweenreviewing requirements documents using different PBR scenarios?” Our finding is the following:

• Individual reviewers applying PBR found the same percentage of defects with any of the assigned scenario (H40).

Thus, we can consider the three scenarios equivalent with respect to defect detection effectiveness.

We also investigated the effects of having distinct roles when composing inspection teams. The related research questionwas: “Are there differences in defect detection effectiveness between reviewing requirements documents having unique rolesin an inspection team and reviewing requirements documents having identical roles in an inspection team?” The finding is thefollowing:

• There was no difference between teams with identical responsibilities and teams with distinct responsibilities withrespect to the percentage of detected defects (H20).

This finding does not support the theory of scenario-based reading, which states that the coordination of distinct andspecific scenarios achieves a higher coverage of documents.

Furthermore, we verified whether the amount of time available for preparation and meeting could have influenced theinspection performance and thus we can state that that the time spent to perform the review task did not influence the result ofthe inspection.

We went further in our analysis to find an explanation for these contradictory findings. We looked at debriefingquestionnaires in order to check the process conformance assumption, i.e., whether subjects had actually followed theassigned reading techniques. We found that in the first experimental run only one PBR reviewer had declared to have bothfully followed the reading technique and carefully focused on the questions in the assigned scenario. This was the mainreason for making mandatory the use of the reading technique in the second experimental run, and for checking processconformance through the deliverables of the inspection preparation. We measured the results of this post-hoc verification

23

activity and asked again in the debriefing questionnaire to what extent the reviewer had focused on the questions for defectdetection. The result was that less than one third of Checklist reviewers could be trusted to have used the checklist and onefifth of the PBR reviewers could be trusted to have followed the assigned scenario. We tested again the main researchquestion but, this time, we considered only the restricted subgroup of reviewers who could be trusted with respect to processconformance. The result is the following:

• there was no difference between “trusted” reviewers using a nonsystematic reading technique and “trusted” reviewersusing a systematic reading technique with respect to the percentage of defects found (H5a).

However, this time the PBR group scored better than the checklist group, albeit the difference was not yet significant. Theanalysis could not be repeated at the team level because there were no inspection teams exclusively made up of trustedreviewers.

We also investigated the subjective evaluation of the reading techniques by asking reviewers to self-evaluate their readingtechnique. The answers were similarly distributed with respect to the reading technique and the perspective and then we canconclude that the subjective evaluation of the reading technique does not depend on the type of reading technique. On thecontrary, the distribution was different with respect to trust in process conformance: trusted reviewers were more positive intheir evaluation than untrusted reviewers. We tested the significance of this relationship and the result was the following:

• trust in process conformance was related to the subjective evaluation of the reading technique (H60).

However, the relationship between the two variables does not provide evidence concerning a cause-and-effect relationship.One might argue that there were reviewers who followed the reading technique as it was written because they were positivelyimpressed by the assigned reading technique. On the other hand, one might also argue that there were reviewers who gave apositive evaluation of their reading technique because they actually had followed it, thus having the opportunity to appreciatethe technique as it was conceived. This latter explanation might imply that there exists a reviewer’s attitude to be driven by atechnique while performing a task.

We need to better understand how a process tool, such as a reading technique, is perceived by users and what process toolcharacteristics are compatible with the users’ attitudes. The disposition of users towards a process tool might be influencedby many factors including sociological and psychological characteristics. We may, for instance, believe that being anundergraduate student can cause a subject to be less prone at following instructions than graduate or professional softwareengineers. Nevertheless, we may also believe that there are individuals who are more self-disciplined than others because ofthe education received or their personal characters. Further work on reading techniques, and more in general on softwareprocesses, should deeply look in the theories and experimental investigation of the human behavior as social scientists do intheir disciplines.

Acknowledgments

We gratefully acknowledge the collaboration of Nicola Barile in execution and data collection phases of the experiment.Our thanks also to all the students of SE class for their hard work.

References

V. Basili, S. Green, O. Laitenberger, F. Lanubile, F. Shull, S. Sorumgard, and M. Zelkowitz, "The Empirical Investigation ofPerspective-based Reading", Empirical Software Engineering, 1, 133–164, 1996.

V. R. Basili, "Evolving and packaging reading technologies", Journal of Systems and Software, 38 (1): 3-12, July 1997.

V. R. Basili, F. Shull, and F. Lanubile, "Building Knowledge through Families of Experiments", IEEE Transactions onSoftware Engineering, 25(4), July/August 1999.

B. W. Boehm, Software Engineering Economics, Prentice Hall, Englewood Cliffs: NJ, 1981.

M. Ciolkowksi, C. Differding, O. Laitenberger, and J. Munch, "Empirical Investigation of Perspective-based Reading: AReplicated Experiment", ISERN Report 97-13, 1997

M. E. Fagan, "Design and Code Inspections to Reduce Errors in Program Development", IBM Systems Journal, 15(3):182–211, 1976.

M. E. Fagan, "Advances in Software Inspections", IEEE Transactions on Software Engineering, 12(7):744–751, July 1986.

P. Fusaro, F. Lanubile, and G. Visaggio, "A Replicated Experiment to Assess Requirements Inspection Techniques",

24

Empirical Software Engineering, 2, 39–57, 1997.

T. Gilb and D. Graham, Software Inspection, Addison-Wesley Publishing Company, 1993.

W. S. Humphrey, Managing the Software Process, Addison-Wesley Publishing Company, 1989.

IEEE, IEEE Guide to Software Requirements Specifications, IEEE Std. 830, Soft. Eng. Tech. Comm. of the IEEE ComputerSociety, 1984.

O. Laitenberger and J.M. DeBaud, "Perspective-based Reading of Code Documents at Robert Bosch GmbH", Informationand Software Technology, 39:781–791, 1997.

O. Laitenberger, K. El Eman, and T. Harbich, "An Internally Replicated Quasi-Experimental Comparison of Checklist andPerspective-based Reading of Code Documents", Technical Report, International Software Engineering ResearchNetwork, ISERN-99-01, 1999.

F. Lanubile, F. Shull, and V. Basili, "Experimenting with Error Abstraction in Requirements Documents", in Proc. ofMETRICS ’98, 1998.

J. Miller, M. Wood, and M. Roper, "Further Experiences with Scenarios and Checklists", Empirical Software Engineering, 3,37–64, 1998.

A. Porter, L. G. Votta, and V. R. Basili, "Comparing Detection Methods for Software Requirements Inspections: AReplicated Experiment", IEEE Transactions on Software Engineering, 21(6):563–575, June 1995.

A. Porter, and L. Votta, "Comparing Detection Methods for Software Requirements Specification: A Replication UsingProfessional Subjects", Empirical Software Engineering, 3, 355-379, 1998.

A. Porter, H. Siy, A. Mockus, and L. Votta, "Understanding the Sources of Variation in Software Inspections", ACMTransactions on Software Engineering and Methodology, 7(1): 41-79, January 1998.

K. Sandahl, O. Blomkvist, J. Karlsson, C. Krysander, M. Lindvall, N. Ohlsson, "An Extended Replication of an Experimentfor Assessing Methods for Software Requirements Inspections", Empirical Software Engineering, 3, 327–354, 1998.

F. Shull, "Procedural Techniques for Perspective-Based Reading and Error Abstraction",http://www.cs.umd.edu/projects/SoftEng/ESEG/manual/error_abstraction/manual.html, 1998.

F. Shull, I. Rus, and V. Basili, "How Perspective-Based Reading Can Improve Requirements Inspections", Computer, 33(7):73-79, July 2000.

D. A. Wheeler, B. Brykczynski, and R. N. Meeson, Jr. (Eds.), Software Inspection: An Industry Best Practice, IEEEComputer Society Press, 1996.

B. J. Winer, D. R. Brown, and K. M. Michels, Statistical Principles in Experimental Design, 3rd edition, McGraw-Hill, NewYork, 1991.