IOS Press Towards implementing defect prediction in the software development …staff.iiar.pwr.wroc.pl/marian.jureczko/JIFS2019.pdf · 2020. 1. 31. · Undeﬁned 1 (2019) 1–5 1

Undefined 1 (2019) 1–5 1IOS Press

Towards implementing defect prediction inthe software development processMarian Jureczko a,∗ Ngoc Trung Nguyen a Marcin Szymczyk a and Olgierd Unold a

a Department of Computer Engineering, Wroclaw University of Science and Technology, Poland, E-mail:marian.jureczko,[email protected]

Abstract. Defect prediction is a method of identifying possible locations of software defects without testing. Software tests canbe laborious and costly thus one may expect defect prediction to be a first class citizen in software engineering. Nonetheless, theindustry apparently does not see it that way as the level of practical usages is limited. The study describes the possible reasonsof the low adoption and suggests a number of improvements for defect prediction, including a confusion matrix-based model forassessing the costs and gains. The improvements are designed to increase the level of practitioners acceptance of defect predictionby removing the recognized by authors implementation obstacles. The obtained predictors showed acceptable performance. Theresults were processed through the suggested model for assessing the costs and gains and showed the potential of significantbenefits, i.e. up to 90% of the overall cost of the considered test activities.

Keywords: software metrics, software development process, defect prediction, re–open prediction, predicting feature defectiveness,defect prediction economy

1. Introduction

Defect prediction is a technique used to predictfaultiness of a given software artefact expressed interms of defects. It could be the number of defectsor binary information indicating whether the artefactis defect free or not. Such a technique seems to bevery useful in the software development process. If oneknew which artefacts are defect free, one could saveefforts related to testing them. The software tests areoften estimated for up to 50% of the costs of the wholesoftware development (Kettunen et al. [1]). Therefore,such savings should be substantial, but surprisinglythere is few evidence of defect prediction implementa-tions in real life software development processes andall of the few exceptions most likely were conducted inclose collaboration with defect prediction researchers([2,3,4]), i.e. the companies were not able to imple-ment the prediction by their own. Moreover, there arereports from research departments of software vendorswith complaints regarding challenges related to con-

*Corresponding author. E-mail: [email protected].

vincing project managers to start using the defect pre-diction models (Weyuker and Ostrand [5]). One maywonder why as one of the earliest defect predictionmodels was suggested in the 70’ (Halstead [6]) andsince then it is intensively investigated by many re-searchers (Hall et al. [7] or Jureczko and Madeyski [8])and the prediction results are encouraging. Accordingto Hall et al. [7] the model precision is usually close to0.65 and after calibration it can be much better (e.g. upto 0.97 after performing feature selection and choosingoptimal classifier as reported by Shivaji et al. [9]). Weconsidered a survey aiming at characterizing the rea-sons of low adoption of the technique. Unfortunately,preliminary interviews with people deciding about theshape of development process show very limited un-derstanding of the concept of defect prediction (therewas no case where the defect prediction was even con-sidered). Thus a survey might be of low value. It makesus thinking about the possibility of launching defectprediction program in our own environments and wefaced challenges related to discrepancies between de-fect prediction assumptions and the real life softwaredevelopment processes. We identified challenges that

0000-0000/19/$00.00 c© 2019 – IOS Press and the authors. All rights reserved

2 M. Jureczko et al. / Towards implementing defect prediction in the software development process

in our opinion should be solved before recommend-ing defect prediction. Let us present the issues and ourconcerns.

Issue 1: the mapping of defects to the predictionunit of analysis. According to Hall et al. [7] most ofthe defect prediction models are focused on files andclasses. The files and classes pointed out by the modelare recommended for code review or special care dur-ing testing (Weyuker et al. [10]), in the case of theaforementioned artefacts that would be unit tests. Letus first consider what is meant by the fact that a classor file is defective in the context of defect prediction.Most of the researchers (e.g.[11,12]) identify the de-fects in files and classes using issue tracking systemsor commit messages. The procedure is usually as fol-lows. Defect reports are extracted from the issue track-ing system and subsequently from each defect reportits identifier is extracted. Then the comments in theproject’s version control system are scanned for occur-rences of the aforementioned identifier. When there isa match, all classes or files that were modified in theidentified version control commit are marked to be de-fective. In consequence, there can be a defect reportregarding malfunction on a system business level, vis-ible to the end–users and a number of classes that werechanged during defect fixing. When the defect rootcause comes from a technical flaw of the implemen-tation (e.g. error in sorting algorithm) there is nothingwrong with that. Unfortunately there is always a frac-tion of defects with the root cause outside the sourcecode (e.g. misinterpretation of requirements). Such de-fects typically result in a number of changes done incoupled classes. Some of the changes address more orless directly the defect whereas others are side effectadjustments in the coupled code. In order to better il-lustrate this case let us present an example. One of thecommonly investigated projects with respect to defectprediction is JEdit e.g. in [13,14]. Among others, thereis a defect report with number 2739: ’Indentation ofJavascript embedded in HTML is broken’1. It has beenfixed in revision 8093 by introducing changes to 5 files.One of them is JEditTextArea.java where one line waschanged, that is:

indent = buffer.isElectricKey(ch);to indent = buffer.isElectricKey(ch,caretLine);

An additional parameter is passed to the isElec-tricKey method. Does it mean that the method was

1http://sourceforge.net/p/jedit/bugs/2739/ (last access on29/11/2018)

defective before the change? The answer is not clear,specifically because the ’buffer’ class (i.e. JEditBuffer)was changed in the same commit. We ended up witha real defect mapped into changes in the source codethat really solve it which is not exactly the same asa set of defective classes. Nonetheless, the predictionmodel is trained to predict them as defective, whereasin fact we are not sure about their real status. Further-more, it is hardly possible that a unit test or code re-view of the JEditTextArea.java file will result in a cor-rective action regarding the isElectricKey method, aswithout the broader context that comes from the de-fect report, it is not clear that there is something wrongwith the method call. Defect prediction output point-ing at the JEditTextArea.java would be useless for apractitioner despite being correct. Furthermore, auto-matic identification of technical issues is the goal ofstatic code analysis which already have good tool sup-port e.g. FindBugsTM2, SonarQube TM3. Not all defectsare similar to the one described above, but neither it isan exceptional corner case.

Issue 2: no clear definition of defect predictiondata collection program. Another possible issue re-lated to the practical application of defect predictionregards defects distribution, or more specifically, de-fect reports distribution. The researchers do not reportif their data resulting from uniform tests of the wholesystem. Possibly, the opposite is true since it is recom-mended to focus test efforts on new and changed func-tionalities. Namely, it is possible that the ’prediction’shows which parts of the system were tested. In otherwords, the following scenario is plausible. Before therelease, some of the features (the new or changed ones)are intensively tested and in consequence a numberof defects are reported whereas the rest of the sys-tems goes only through light regression tests that donot result in many additional defect reports. The iden-tified defects are used to train the prediction model.However, the tests were conducted in such a way thatthe vast majority of defects were identified in certainparts of the system. In consequence, the model is ad-justed to the design flaws and development history ofthe tested parts of the system instead of to the defects.Whereas the theory behind software metrics based de-fect prediction suggests that some design flaws (e.g.high coupling) increase the possibility of committinga defect, and thus the prediction model should recog-

2http://www.findbugs.sourceforge.net (last access on 29/11/2018)3http://www.sonarqube.org (last access on 29/11/2018)

M. Jureczko et al. / Towards implementing defect prediction in the software development process 3

nize those design flaws. A similar scenario is also pos-sible for post–release defects as the customer may fol-low the same principles during validation (i.e. focus onnew and changed functionalities) or simply train usersin only what is new in a given release. The scope ofconducted tests should not be ignored in the evalua-tion of defect prediction efficiency. Claiming that de-fect prediction works for entire systems when only halfof it is considered for the quality assurance activitiesmay misleadingly increase prediction efficiency. Whatis the value of prediction precision equal to 0.97 whenhalf of the system was not tested at all and the predictorjust learned the rule behind excluding certain softwareartefacts from testing?

Issue 3: no estimation of return of investment.Defect prediction requires a software measurementprogram and thus does not come for free. Since someinvestments must be made, it is crucial to assess thefuture benefits that may come from using defect pre-diction and confront them with the costs. Fortunately,the costs of metrics collection have already been ana-lyzed e.g. [15], but what the benefits look like is not soclear. There are two concepts of employing defect pre-diction in software development. Weyuker et al. [10]suggested to use the model output as a trigger for ad-ditional tests and code reviews which should result inimproving software quality and hence reduce the num-ber of post–release defects. Briand et al. [16] pointedout the possibility of savings regarding not testing arte-facts that were predicted to be defect free. This work isfocused on the latter as detailed empirical data (whichunfortunately is unavailable for us) is needed for theevaluation of the first one ([17]). We suggest a methodof assessing the benefits (and costs) of not testing soft-ware artefacts that presumably are defect free.

Issue 4: poor tool support. Developing tools for in-dustry is not a something that can be done as a partof research paper. Nonetheless, it is a very importantobstacle that may not be ignored. Yang et al. [18] re-ported, that industrial practitioners are more interestedin in the interpretations and follow-up that occur afterprediction than in just the mining itself. Following thesame this work suggests a number of changes to thedesign of defect prediction in order to address as manyof the aforementioned issues as possible.

The usefulness of the redefined defect predictionis tested on two open–source projects using the sug-gested model for assessing costs and gains. The restof this paper is organized as follows. The suggestedapproach to defect prediction as well as the methodof assessing prediction gains and cost are described

in the next section. The Section 3 contains the de-tailed description of our empirical investigation aimedat predicting defects re–opens. Threats to validity arediscussed in Section 4. Related studies concerningre–opens prediction and the evaluation of costs andgains of defect prediction are presented in Section 5.The conclusions, contributions and plans for future re-search are discussed in Section 6.

2. Proposed approach for implementing andassessing defect prediction

The goal of this work is to redesign the defect pre-diction by suggesting approaches that avoid or resolvethe issues mentioned in introduction. We conducted anempirical experiment that employs the defect predic-tion in such a way that makes the practical applica-tion of prediction outcomes straightforward and mini-mizes the possibility of misleading evaluation of pre-diction effects. Additionally, we suggest a model forassessment of the defect prediction application costsand benefits.

2.1. The place of defect prediction in softwaredevelopment

The defect prediction models that use larger arte-facts as the unit of analysis seem to be more handy asit is easier to map them to system level requirementsand due to the size, they come with a broader context.Hall and Fenton [7] reported modules which unfortu-nately are vaguely defined. We believe that the use-fulness of such models depends on what in fact themodule is and what the software development processlooks like. Nonetheless, in our opinion the larger arte-facts are a promising direction and therefore constituteour primary concern.

Let us consider two scenarios that commonly emergein software development, i.e. implementation of a newfeature and correction of a defect. Both of them can bevery complex, but for our considerations we can limitthem to several crucial steps. In the case of the newfeature, there is the writing of the source code; after-wards the tests should be conducted and then, accord-ing to the tests results, the feature is released or goesback to the development team for further improve-ments and tests, which creates a loop in the process.Eventually, the feature is released and in consequencethe end users start to use it. Unfortunately, the tests donot guarantee correctness and it is possible that after


Fig. 1. Software implementation: new features & bugfixing.

the release some hidden defect will emerge. Such situ-ations create losses for the software vendor regardingefforts contributed to defect fixing as well as ruiningthe brand image.

The second scenario, i.e. defect correction (com-monly called bug fixing) is in fact very similar. Whena new defect report appears, some source code mustbe written or changed, subsequently the defect correc-tion should be tested and according to the test result,released or re–implemented. Due to the similarities, asingle activity diagram can be used to render both ofthem, that is Fig. 1.

According to the Fig. 1 there is a saving possibility.By knowing upfront that something is implementedcorrectly we can skip the tests without harm to thesystem quality (in some cases such simplification maybe impossible due to regulations, e.g. development ofmedical devices.Unfortunately, this is knowledge thepractitioners do not posses. It is only possible to pre-dict the correctness of implementation and this is whatthe defect prediction models are about. When consid-ering the usage of defect prediction the software im-plementation diagram should be extended as it is pre-sented on Fig. 2. The defect prediction is used to de-cide whether tests should be conducted. Software ven-dors are not eager to plan the tests according to pre-diction results, i.e. to chose upfront to not test some ofthe artefacts. However, when there are not enough re-sources to meet the deadline some hard decisions mustbe made. When postponing the deadline is not an op-tion the only alternative is to skip some of the tests.The number of skipped test will presumably dependon the available resources whereas defect prediction

will only be used to prioritize them. Nonetheless, theavailable resources are not a feature of defect predic-tion and thus should not be considered as a factor indefect prediction costs and gains assessment. This is asimplification that deviates from the real resources al-location, but without significantly worsening the esti-mation quality of the expected costs and gains whichis introduced in the next subsection. The estimation isacceptable since after testing all artifacts marked as de-fective we can do more tests i.e. trading some of thesaved resources for lower risk of releasing a defect.

There are two different defect prediction units ofanalysis that are compatible with the flow presented onFig. 2: features and defect fixes and in order to havean easy commercial implementation we suggest to notuse other ones. When the unit of analysis is smaller(e.g. the commonly investigated files or classes) it ischallenging to define a set of actions that should be ex-ecuted in response to the prediction results. The file orclass level artefact are covered by unit tests which areusually automated and created not only to detect de-fects but also to prevent regression, enable safe refac-torings and sometimes as a low level documentation,hence optimizing the number of unit tests accordingto prediction outcome may have unexpected conse-quences and makes the impact of defect predictionapplication hardly possible to evaluate. Furthermore,even the two suggested units of analysis may createcomplications. A defect fix which is the object of re–opens prediction is usually small and thus we do notexpect significant deviations from the presented flows.However, features differ in size and not always areautonomous. When a project is developed in phasesand each of the phases spans a number of possiblytightly coupled which each other features it might benot reasonable to analyze each feature in separationas there is considerable risk of introducing defect infeature B when developing feature A. We can con-sider a feature autonomous for the sake of defect pre-diction only when regression in other features can beguarded without manual tests and thus there (in theother features) is no significant test effort during de-velopment of the autonomous feature. Otherwise thescope of tests executed during verification of a featureexceeds the boundaries of the feature which makes rea-sonable releasing features in bulks to reduce the num-ber of repetitions of the same test scenarios and dra-matically changes the flow. As a consequence, the sug-gested in this work solution is applicable to iterativeprocesses that are driven by features (e.g. Scrum) andtake care of modularity (e.g. micro services architec-


Fig. 2. Software implementation with defect prediction. In brack-ets there are probabilities and efforts used in the discussed in nextsubsection model.

ture). Application in other processes (e.g. waterfall) ordesigns (e.g. monolith) may not work as expected.

2.2. Model for defect prediction expected costs andgains assessment

The defect predictions are not perfect. Sometimes,the prediction is wrong and the low quality sourcecode that would be reported during tests may be re-leased and hence there is a loss for the software ven-dor. Therefore, it is not obvious that the process de-picted on Fig. 2 is more cost–effective. Let us considerfour different scenarios in order to evaluate the overalleffect of employing a defect prediction model. The in-tention of the evaluation is to show the gain (or loss) incomparison to a baseline which is testing all softwareartefacts.

– Scenario faulty&faulty (implementation: faulty,prediction: faulty). The artefacts are tested andthus there is no deviation from the baseline.

– Scenario faulty&ok (implementation: faulty, pre-diction: OK). According to the prediction resultsthe software artefacts are not tested, but in factthey are incorrect and thus should be tested andcorrected. Not all defects are discovered duringtests, thus the testing effectiveness should be con-sidered in this scenario. In consequence, there isa loss related to releasing system with defects.

– Scenario ok&faulty (implementation: OK, pre-diction: faulty). The artefacts are tested and thusthere is no deviation from the baseline.

– Scenario ok&ok (implementation: OK, predic-tion: OK). The artefacts are correct and are nottested. Thus, there are savings that come fromskipping the tests.

In order to render the evaluation in a more formalway, let us define: If – implementation not correct(faulty), Iok – implementation correct, Cpr – post–release cost connected with releasing a defective soft-ware artefact, Ct – costs of testing a software artefact,Prok – prediction result: implementation is ok, e –testing effectiveness, where 0 ≤ e ≤ 1 and e = 1 rep-resents detecting all defects in the inspected softwareartefacts, G – expected gain (or loss when G is below0) of prediction model application for certain artefact,i.e. the gain that comes from employing the model insoftware development with respect to a certain arte-fact, TG – total gain of prediction model application,i.e. the sum of values of G calculated for all artefacts,P (x) – probability of x, we are using also conditionalprobabilities, x – average value of x, n - number ofsoftware artefacts.

G = P (Iok) ∗ P (Prok|Iok) ∗ Ct − P (If ) ∗ P (Prok|If ) ∗ (e ∗ Cpr − Ct) (1)

The value of G equals the difference between the gainthat comes from scenario ok&ok and loss from sce-nario faulty&ok and is expressed in the same unit asCpr and Ct which can be man–days as well as money.The scenario ok&ok is the probability of the occur-rence of two random events i.e. correct implementationand defect prediction outcome stating that the imple-mentation is correct, multiplied by the costs of testing.In other words it is the expected saving from not test-ing the correctly implemented features. The scenariofaulty&ok is the probability of two random events, i.e.faulty implementation and prediction claiming that thevery same functionality is defect free, multiplied bythe expected cost of releasing defective functionality.The cost of releasing defective functionality is reducedby the value of costs of the tests, since according toprediction outcome tests are not conducted. In otherwords, the scenario represents the expected loss beinga result of skipping tests of defective functionality be-cause of encouraging prediction results.

The equation can be reformulated in a more com-pact form when assuming that P (gain) = P (Iok) ∗P (Prok|Iok) and P (loss) = P (If ) ∗ P (Prok|If ):

G = P (gain) ∗Ct−P (loss) ∗ (e ∗Cpr−Ct) (2)

Unfortunately even the compact form is not handywhen calculating TG since it would be necessary to


know the value of G for each of the artefact thatgoes through the defect prediction process. And thisis knowledge we cannot expect in a real life softwaredevelopment process. In order to make it applicablein practice, we need a simplifying assumption, i.e. thetesting cost is not correlated with neither actual imple-mentation status (correct, not correct) nor defect pre-diction results (the two later ones should be correlatedwith each other). With the aforementioned assumption,TG can be calculated using the following formula:

TG = n∗(P (gain)∗Ct−P (loss)∗(e∗Cpr−Ct)) (3)

Please note that the P (gain) and P (loss) can be esti-mated using confusion matrix:

P (gain) =artefacts in scenario ok&ok

n(4)

P (loss) =items in scenario faulty&ok

n(5)

The Cpr, Ct as well as testing effectiveness can becompany or project specific. Hence, we advise to es-timate them using company historical data or expertopinion. We recommend to overestimate test effective-ness and the relation of Cpr to Ct when using theEquation 3 in making decision about using defect pre-diction. Such estimations will lead to underestimatingthe TG and as a consequence the defect prediction re-sults should not be worse than expected. We followedthe recommendation in the experiments presented innext section. Test effectiveness represents the propor-tion of defects identified during tests. When is verylow, the majority of defects pass tests and hence testingis not economically justified regardless of the qualityof prediction. When Cpr is much bigger than Ct we sethigh expectations for the defect prediction as releasinga defect becomes relatively expensive while tests arecheap.

3. Experiment

We see two different defect prediction units of anal-ysis that corresponds well with the presented in previ-ous section considerations, those are features and de-fect ”re-opens”. In this section we are applying defectprediction to the later ones. The purpose of the exper-iment was to predict whether and in what conditionsa bug which was resolved can be "re–opened" in thefuture. The experiment is based on data collected fromtwo open–source projects: Flume and Oozie.

3.1. Experiments design

In order to perform classification we created an ap-plication using WEKA that is a powerful tool for datamining. It gives a possibility to classify data using var-ious classification methods, feature selection, visualizeobtained classification results (e.g. decision trees), etc.

To validate the performance of a prediction we used10–fold cross–validation method (CV). After splittingdata into training and test sets, in the next step a fea-ture selection on training set was performed for eachfold. Thanks to that we could select a subset of featureswhich are the most appropriate for predicting whethera bug can reappear in the future. In the feature selec-tion process we used Best-First Search algorithm. Thisalgorithm connects advantages of breadth–first search-ing (prevents the situation of finding a ’dead end’) anddepth of the first search (finding solution without ex-amining every ’competitive’ resolution). Generally, thealgorithm is used for searching the graph, but in oursituation it is suitable too. The Best–First Search algo-rithm is based on choosing every step in this option, forwhich estimation of evaluation function is the best. Inour case Best–First searches the space of feature sub-sets by greedy hill-climbing augmented with a back-tracking facility. As evaluating function we chose the’ClassifierSubsetEval’ function. It evaluates attributesubsets on training data and uses a classifier to estimatethe ’merit’ of a set of attributes. We used J48 classifieras an evaluator.

The next step was data classification. Number of in-stances belonging to specific classes: "Bad" or "Good"in predicting if bug can be re–opened ("Bad" means,that bug can be re–opened and "Good" that it will prob-ably not happen) was imbalanced. In initial dataset,a number of instances with "Good" value for "Re–opened" attribute was much more numerous than thosewith "Bad" value in both Flume and Oozie projects. Tosolve this problem we used SMOTE (Synthetic Minor-ity Over–sampling Technique) [19].

After including SMOTE, a classification methodwas chosen. According to [7] among the most fre-quently used in defect prediction are Decision Trees,Regression-based and Bayesian approaches. Hence,we decided to employ two from each of those cat-egories: J48graft, BFTree, Logistic, Classification-ViaRegression, BayesNet and NaiveBayesSimple. Weused WEKA with default classifiers parameters values,only the threshold value has been raised to 0.85 to re-flect the loss function which in our case is asymmetric.We recommend using a value that is related to the ratio


of the cost of releasing defect (e∗Cpr−Ct) to the costof testing (Ct).

To evaluate the obtained results we used Recall(probability of correct classification of a faulty arte-fact) and Precision (proportion of correctly predictedfaulty modules amongst all modules classified asfaulty).

3.2. Data collection

We investigated two open-source projects: Oozieand Flume with respect to predicting defect re–opens.Oozie is a server–based workflow scheduling and co-ordination system to manage data processing jobs forApache Hadoop. Oozie has been developed since Au-gust 2011 and by April 2017 it had 191,960 linesof code (the aforementioned dates provide the datacollection time interval). More information about thisproject is available on http://oozie.apache.org/.The second project, Flume, is a distributed, reliable,and highly available service for efficiently collecting,aggregating, and moving large amounts of log data. Ithas a simple and flexible architecture based on stream-ing data flows. Flume is robust and fault tolerant withtuneable reliability mechanisms It uses a simple exten-sible data model that allows for online analytic appli-cation (http://flume.apache.org/). Project has beendeveloped since June 2010 and has 93,429 lines ofcode (April 2017). Information about issues relatedwith both projects can be found at https://issues.apache.org/jira/.

The two projects were chosen due to their rich his-tory about defects including re–opens, i.e. they have1,087 bugs and 77 bugs which have (or had) status"Re–opened" (Flume) and 1,484 bugs and 429 havea "Re–opened" status (Oozie). To collect data aboutdefects we used an application developed at WroclawUniversity of Science and Technology – QualitySpy[20]. It is a tool for collecting data about the history ofissues in software projects stored in issue tracking sys-tems (e.g. Atlassian JIRA) and other software develop-ment systems. More information about QualitySpy canbe found on https://opendce.atlassian.net/wiki/

spaces/QS. Using this tool we collected 18 attributes –17 independent variables and the object of prediction,i.e. whether a defect has been re–opened:

– created – date of issue creation,– resolved – date of issue first resolution,– fixing duration – time from issue creation to first

resolution (in days),

– resolution day of week – day of week of the issuefirst resolution,

– resolution day of month – day of month of theissue first resolution,

– resolution day of year – day of year of the issuefirst resolution,

– resolution month – day of month of the issue firstresolution,

– resolution hour – hour of day of the issue firstresolution,

– component in which bug was found,– environment in which bug occurred (operating

system, Java version, etc.),– initial priority – one of the following: trivial, mi-

nor, major, critical, blocker,– advancement of project – days between creation

of the first issue in project and current issue cre-ation,

– number of defects reported by reporter – num-ber of defects reported by the reporter of currentissue before the current issue,

– number of defects resolved by assignee – num-ber of defects resolved by person assigned to cur-rent issue before current issue resolution,

– number of changes registered in the issue track-ing system before current issue,

– number of comments in the issue tracking sys-tem, in the current issue,

– number of changes of the assignee before issueresolution date,

– re–opened – if the issue was re–opened after thefirst resolution.

Most of selected attributes was suggested by [21].We investigated attributes which can be divided into 4categories: work habits (created, resolved, attributesregarding resolution time), bug report (component,environment, advancement of project, initial priority),bug fix (fixing duration, re–open, number of changes,number of comments), people (reported by reporter,resolved by assignee, assignee changes). Attributesfrom this categories can be use to check what havethe biggest influence on predicting if bug can be re–opened in future. The collected data were publishedonline4.

4https://www.researchgate.net/profile/Marian_

Jureczko/publication/271735532_Oozie_Flume_-_

collected_data/data/596108f8aca2728c11cf1067/

flume-input-data.arff and https://www.researchgate.

net/profile/Marian_Jureczko/publication/

http://oozie.apache.org/

http://flume.apache.org/

https://issues.apache.org/jira/

https://issues.apache.org/jira/

https://opendce.atlassian.net/wiki/spaces/QS

https://opendce.atlassian.net/wiki/spaces/QS

https://www.researchgate.net/profile/Marian_Jureczko/publication/271735532_Oozie_Flume_-_collected_data/data/596108f8aca2728c11cf1067/flume-input-data.arff




https://www.researchgate.net/profile/Marian_Jureczko/publication/271735532_Oozie_Flume_-_collected_data/data/596109250f7e9b81943f66f7/oozie-input-data.arff



3.2.1. Experiment resultIn this section we want to put results of our research

about predicting whether bug can be re–opened in fu-ture. Results obtained for the Flume project are pre-sented in Tab. 1 and for Oozie in Tab. 2.

Ten–fold cross validation with the SMOTE tech-nique for data balancing was used and resulted in sat-isfactory outcomes, i.e. Recall up to 0.982 and Preci-sion up to 0.997 in the Flume project and Recall up to0.946 and Precision up to 0.999 in the Oozie project.Nonetheless, the confusion matrix is the most interest-ing outcome with respect to the defect prediction costsand gains assessment. It shows in the case of the Flumeproject that there were 1009 correctly resolved defects(nok = 1009) and 77 re–opened ones (nf = 77).The further calculations are done using the equation3. To get the final gain (or cost) of defect prediction,two additional variables must be evaluated, i.e. the av-erage cost of testing a bugfix (Ct) and the averagepost–release cost of releasing an invalid bugfix (Cpr).The Ct was assumed to be equal to 2 working hours 5

whilst Cpr was evaluated using the 1:10:100 rule 6 andthus assumed to be equal to 20 working hours.

According to Tab. 1 the best prediction model (i.e.BayesNet) can save 1845 working hours in the test-ing process. The tests’ effectiveness was assumed to beequal 1 for the calculations which represents the per-fect effectiveness. It is a defensive approach as it min-imizes the expected gain produced by defect predic-tion. Please note that the gain shall be considered asa difference in costs in comparison with an approachwhere each of the software artefacts is tested, which is85% of the overall bugfixing testing cost (the cost wasestimated using the aforementioned value of Ct).

The same evaluation procedure and the same valuesof Ct as well as of Cpr were used for the Oozie project.Moreover, there were 1055 correctly resolved defects(nok = 1055) and 429 re–opened ones (nf = 429). Inthe case of the Oozie project, the fraction of incorrectly

271735532_Oozie_Flume_-_collected_data/data/

596109250f7e9b81943f66f7/oozie-input-data.arff5 It is a row estimate based on figure given by Hryszko and

Madeyski [22], they reported that fixing and testing a defects costs 3hours on average.

6 The rule recently became very popular among practitioners,e.g. www.dqglobal.com/why-data-should-be-a-business-

asset-the-1-10-100-rule (last access on 29/11/2018). We areusing the rule to cover not only the costs of bugfix implementationwhich are usually smaller but also the costs of reverting defect sideeffects, e.g. data migrations cleaning broken data or adjustments tochanges in published API.

resolved defects is greater, the data is more balanced,and in consequence there are fewer cases with the pos-sibility of making savings during tests. Nonetheless,the number of man hours that can be saved is consider-able and according to Tab. 2 in the case of the best clas-sifier (BayesNet) is equal to 1176, which is 39.6% ofthe overall bugfixing testing cost. In the worst case TGis negative which represent loss. We used the 1:10:100rule thus the penalty for releasing a faulty artefact issignificantly greater than the gain from not testing adefect free one. As a consequence there are savingsonly when classification results are of high quality.

3.3. Other studies

The experiments cover very limited scope whilethere are similar studies conducted by other researchers.This section applies suggested model for defect predic-tion costs and gains assessment to publicly availabledefect reopens prediction results. We went through allthe studies mentioned in section 5 and selected thosethat provide all the data required by the aforemen-tioned model which in fact is the confusion matrix.The confusion matrix is not published frequently, butthere are methods to restore it suggested by Bowes etal. [23].

Experiments regarding defects reopens predictionwere conducted by An et al. [24], Jureczko [25], Shi-hab et al. [21], Xia et al. [26] and Xia et al. [27]. Anet al. [24] was focused on supplementary bug fixeswhich unfortunately changes the context and makesour model inapplicable. Thus, we excluded this study.Jureczko [25] and Xia et al. [27] did not publishenough data to restore the confusion matrix. Xia et al.[26] investigated one of the projects analyzed by Shi-hab et al. [21], hence we choose only the later one (theconfusion matrix was restored using equations sug-gested by Bowes et al. [23]). As a consequence, therewere only two study that can contribute to externalvalidity of the suggested model for defect predictioncosts and gains assessment, i.e. [21] and [25]. [21] is anexperiment conducted on three open-source projects:Eclipse, Apache HTTP and OpenOffice. The authorsinvestigated prediction performance with respect to theselected subset of dependent variables. For the sake ofsimplicity we consider only the results obtained for alldependent variables. Three different classification al-gorithms are considered. The fourth one analyzed byShihab et al. [21], i.e. Zero–R, is excluded. The Zero-Ralgorithm predicted the majority class, which was notre-opened and hence it did not detect them. Restoring










































www.dqglobal.com/why-data-should-be-a-business-asset-the-1-10-100-rule

www.dqglobal.com/why-data-should-be-a-business-asset-the-1-10-100-rule


Table 1Classification results for "Re–opens" prediction in Flume project

Classifier TPFN

FPTN Recall Precision TG [h] % of savings

J48graft 443

331006 .936 .571 1958 90.1

BFTree 448

331001 .846 .571 1858 85.5

Logistic 58154

19855 .274 .753 -1062 -48.9

ClassificationViaRegr. 259

521000 .735 .325 1838 84.6

BayesNet 583

191006 .958 .782 1959 89.3

NaiveBayesSimple 63245

14764 .178 .791 -2882 -133.9

Table 2Classification results for "Re–opens" prediction in Oozie project

Classifier TPFN

FPTN Recall Precision TG [h] % of savings

J48graft 1773

2521052 .983 .413 2050 69.0

BFTree 2529

1771046 .966 .587 1930 65.0

Logistic 375195

54860 .658 .874 -1790 -60.0

ClassificationViaRegr. 671

3621054 .985 .156 2090 70.4

BayesNet 3698

601047 .979 .860 1950 65.7

NaiveBayesSimple 375195

54860 .658 .874 -1790 -60.0

the confusion matrix for the Zero-R algorithm wouldbe challenging and the results could be not representa-tive.

The results are presented in Table 3. We used therestored confusion matrix to calculate costs and gains.The results are very encouraging as the possibility ofsavings varies between 4 % to 81.9 % of estimated costof defects testing which is in our simulation the ap-proximation of cost of testing all bugfixes. It is alsonoteworthy that this is in line with the results obtainedfor the Flume and Oozie projects.

4. Threats to validity

This section evaluates the trustworthiness of the ob-tained results and conclusions, and reports all aspectsof this study that are possibly affected by the authorssubjective point of view.

4.1. Construct validity

The threats to construct validity refers to the extentto which the employed measures accurately representthe theoretical concepts they are intended to measure.The data about defects were collected directly froman issue tracking system. The employed data mostlycomes from issue tracking systems which is a datasource connected with a well known threat. Accord-

ing to Antoniol et al. [28] a fraction of the reports arenot connected with corrective actions. We had no reli-able means to identify and correct the aforementionedthreat, thus we decided to take the data as is and do notintroduce changes.

A method of defect prediction evaluation was sug-gested and assessed on five projects. The assessment isin fact a simulation conducted on real projects. Each ofthe investigated software artefacts come from the realworld, but the application of defect prediction and itsresults are the authors expectation of a most plausiblescenario. Therefore, an empirical evaluation conductedon a real project is a natural direction of further devel-opment that will allow to mitigate the aforementionedthreat to validity.

4.2. External validity

Limited number of projects were investigated. Tosome extent we improved the external validity by us-ing prediction results published by other researchers.Nonetheless, it is hard to justify whether the experi-ment’s results can be extrapolated and make sense forother projects. There is no basis for claiming that ageneral rule has been discovered. It is rather a proveof concept – it has been shown that there exist projectsfor which the suggested approach to defect predictionis beneficial. We believe that the results are applicableto a wide scope of projects, however, providing evi-


Table 3Classification results for projects investigated by Shihab et al. [21]

Project Algorithm TNFP

FNTP Recall Precision TG [h] % of savings

Eclipse Decision tree 1107166

76181 .705 .521 854 27.9

Eclipse Naive Bayes 1091191

65183 .739 .490 1017 33.2

Eclipse Logistic reg 1077203

82168 .672 .453 680 22.2

Apache HTTP Decision tree 12614806

55884 .941 .523 24230 84.4

Apache HTTP Naive Bayes 12672751

283653 .698 .465 20259 70.5

Apache HTTP Logistic reg 12577845

203734 .783 .465 21491 74.8

OpenOffice Decision tree 270512566

11309426 .893 .786 33771 42.0

OpenOffice Naive Bayes 1943810120

10089607 .905 .487 20725 25.8

OpenOffice Logistic reg 254714084

11799439 .893 .786 29727 36.9

dence requires further empirical investigation. It mustbe also stated that it was not our intention to discovera silver bullet, an approach that is useful in every sin-gle project. In consequence, we believe that follow-ing project characteristics may make the suggested ap-proach inapplicable:

– not following the work–flow presented on Fig. 1;– using automated tests instead of manual (for au-

tomated tests the gains and costs should be calcu-lated differently);

– unacceptable possibility of regression that ex-ceeds the boundaries of the object of predictionbut is caused by its development;

– regulations forbidding test optimization, e.g. skip-ping some of the tests.

4.3. Internal validity

The threats to internal validity refer to the misinter-pretation of the true cause of the obtained results.

The conducted experiments are based on a limitednumber of data sources, i.e. the issue tracking sys-tem. There is considerable possibility that employingadditional data sources results in better prediction ef-ficiency. Nonetheless, we decided not to pursue effi-ciency related goals but rather keep the track on theprimary objective which is removing obstacles relatedto defect prediction practical application. The selec-tion of data source for predicting re–opens correspondswith findings of other researchers, e.g. [21].

4.4. Reliability

The study design is driven by reasons for low levelapplicability of defect prediction in industry identifiedby authors. Each of the authors have industrial expe-rience, from 5 to 20 years. Thereby, one may expect

that the reasoning is not detached from software devel-opment reality and corresponds with what the indus-try copes with. Nevertheless, it is a subjective point ofview and it is possible that some of the assumptionsare false or some important factors were overseen.

5. Related work

This work touches several different concepts. Themain driver and contribution regards the suggestedmethod of implementation and costs–benefits assess-ment of defect prediction. Therefore, this section ismostly focused on such research. Additionally we con-ducted an empirical experiment and thereby we alsomention studies investigating defect prediction in con-text of similar experiments.

5.1. Costs and benefits of defect prediction

According to Arora et al. [29] most of the work re-garding defect prediction has been done consideringits ease of use and very few of them have focused onits economical position but determining the answer towhen and how much benefit it has is very important.Among the few exceptions are the study conducted byKhoshgoftaar et al. [30] and [31] where costs of mis-classification was investigated or by Jiang et al. [32]that suggested cost curves as a complementary predic-tion technique.

Weyuker et al. [10] examined six large industrialsystems. The evidences they collected were used tobuild a defect prediction model which later become thecore of an automated defect prediction system. In otherwords, the authors created and reported a tool (they de-clared that it is a prototype of a tool) that requires noadvanced knowledge regarding data mining nor soft-


ware engineering but allows to generate defect predic-tion for the files in a release that is about to enter thesystem testing phase. The tool is considered helpful inscheduling test execution and assigning test resources.The tool output can also point places (i.e. files) for ad-ditional code inspections. In consequence, this shouldlead, according to the authors, to faster defects identi-fication and therefore testing costs reduction (the au-thors recommend against skipping quality related ac-tivities in files predicted to be defect free). The proto-type had been presented to practitioners and was rec-ognized as interesting and useful. Nonetheless, the au-thors admitted that it is difficult to get customers andthus they were not able to report field data regardingcosts and benefits of installing the model in a real soft-ware development process.

Bell et al. [17] discussed empirical assessment ofthe impact of incorporating defect predictions into thesoftware development process. They considered sev-eral experiment designs including randomized con-trolled trial, randomized controlled trial with crossoverdesign and considering individual files as subject. Theconclusion was that the most suitable is a hybrid ap-proach where a large system with a number of inde-pendent subsystems is examined. In that case, subsys-tems are assigned at random to groups of testers anddevelopers with and without feedback from the pre-diction model. The impact assessment is considered tobe challenging since according to the authors there areno reports of industrial application and the assessmentmethod has not been investigated and presented in theliterature. It also must be noted that we have narrowedin this paper the issue regarding assessment and thescenarios we are considering were identified by Bell etal. [17] as using prediction model ’to speed up the test-ing process, and advance the release date of a release’that should result in cost savings which is consistentwith our interpretation.

In contrast to the aforementioned works, Briand etal. [16] suggested a formal method of evaluating theresults of using a defect prediction model. The one pre-sented by Briand et al. [16] is in many aspects simi-lar to ours, thus let us focus on the differences. Otherobjects of prediction were considered which is cru-cial with respect to this study goals. We used defectre–opens and feature defectiveness whereas Briand etal. [16] assumed that it is software class and thus themain driver of their model design was class size (ex-pressed using number of lines), which is inadequatein our case. Another relevant difference regards thebaseline. We compared the defect predictions against

a scenario where all artefacts are tested. Briand et al.[16] referred to a prediction model with outcomes be-ing proportional to the size of the considered softwareclass. This difference had a significant impact on thefinal costs and gains model equation, as using anotherbaseline allowed other equation transformations.

The gap between defect prediction research andcommercial applications was also recognized by Hryszkoand Madeyski [22]. The authors addressed it mostly byproviding tool support, but also by analyzing the ben-efit cost ratio which was based on the Boehm’s Lawabout defect fix cost. Namely, it was assumed that ina waterfall project the cost is higher when a defect isidentified in a later project phase and that some savingscan be done by moving well targeted quality assuranceactivities to earlier phases. The suggested solution dif-fers from ours as it operates on class level predictionsand is not focus on iterative projects.

The evaluation methods of defect prediction mod-els with class as object of prediction were analyzed byArisholm et al. [33]. The authors noticed that test ef-forts for a single class are likely to be proportional toits size and hence suggested a surrogate measure ofcost–effectiveness that is defined in terms of number oflines of source code that should be visited according tothe model output. Classes are ordered from high to lowdefect probability. Accordingly a curve is defined. Thecurve represents the actual defects percentage giventhe percentage of lines of code of the selected by pre-diction model classes. The overall cost–effectivenessis the surface area between the aforementioned curveand a baseline that one would obtain on average, ifclasses were selected randomly. The authors clearlystated that the suggested cost–effectiveness measureis a surrogate one and the evaluation of a real returnof investment requires field data. Nonetheless, a pi-lot study was conducted and the obtained results werevery promising. The authors have not considered eval-uation of defect prediction models with other objectsof prediction thus their consideration does not overlapour work. Measuring test effort using the number oflines of code were considered also by other researchers(e.g. Mende and Koschke [34], Kamei et al. [35] orZhang et al. [12]) in the form of so called effort awaredefect prediction models. Using number of lines corre-sponds well with small prediction unit of analysis (e.g.classes or methods) which is not in line with our solu-tion. Thus we do not discuss this approach further.

Taipale et al. [3] analyzed how to communicate de-fect prediction outcomes to practitioners effectively.Three presentation methods were considered and two


of them directly refers to the prediction results, thoseare:

– Commit hotness ranking – ranking of file changessorted by their probability of causing an error.

– Error probability mapping to source – copy ofsource of project with an estimate of the errorprobability for each line of code.

The first one is in line with our approach, as a de-fect fix can be delivered in a single commit (it is alsopossible in the case of a feature, but less likely), thusthe findings regarding its visualisation are relevant forus as well. Taipale et al. [3] reported feedback sug-gesting that the presentation should be more proac-tive and specifically that there should be an instant re-sponse when committing source code, which leads to-wards the available solutions for static code analysis.The concept is also similar to the just–in–time defectprediction ([36],[37]) where not only single commit isused as the unit of analysis but also quick feedbackloop is recommended. The just-in-time approach is fo-cused on making the unit of analysis small in order toprovide the practitioners with well targeted predictionswhereas in our approach we recommend relatively bigunit of analysis to simplify the defect prediction instal-lation in a software development process. Nonetheless,the difference is smaller than one may think. Both, de-fect fix and feature implementation can be deliveredin a single commit, the trunk based development ap-proach encourages that to some extent. However, thereare also other approaches, like feature branches, wherea fraction of commits represents work in progress andcannot be considered as a object of code review nortesting. That makes expected costs and gains assess-ment challenging (one of our objectives), especiallywhen we take into consideration the popularity of dis-tributed version control systems where two differentoperations can be recognized, i.e. committing changesto local repository and pushing a set of commits to re-mote repository (and of top of that there is the rebasingoperation that enables altering the history of commits).

5.2. Predicting re–opens

The concept of predicting re–opens has been recog-nized a couple years ago ([38,25]). Since then a num-ber of studies on this topic were conducted: [39,40,21,26,24,27]. Among them we especially would like tonote Shihab et al. [21] as it employs a very similar met-rics set to the one used in our study. Please note thatthe goal of our work does not regard improving quality

of predicting re–opens and thus the similarities withlisted above works are very limited.

6. Discussion and conclusions

The study was motivated by the limited usage of de-fect prediction in industry. The authors identified fourissues that may create an obstacle in using defect pre-diction, i.e. the mapping of defects to the predictionunit of analysis, lack of clear definition of defect pre-diction data collection program, no method of estimat-ing return of investment and poor tool support. The is-sues were addressed with propositions of changes inthe method and application of defect prediction. It wasshown how to implement the defect prediction in asoftware development process and other objects of pre-diction than the commonly used were suggested, i.e.defect re–open and feature defectiveness.

The new units of analysis are free of challenges re-lated to defects mapping (Issue 1). When we predictre–opens, there is a clear indicator (the output of de-fect prediction) whether the defect fix shall be tested orcan be released right away. Similar situation is in thecase of features. When the defect prediction indicatesthat a feature is defective, it is obvious what should betested. The defect prediction data collection programis much clearer in the suggested approach (Issue 2). Itis safe to assume that releasing a new feature as wellas fixing defect is processed always in the same way.In Section 2 we presented something that can be con-sidered as a common denominator of many softwaredevelopment processes. There are projects that oper-ates differently, but we defined a clear border of ap-plicability. And on top of that, a method of assessingthe gains and costs of using defect prediction was pro-posed (Issue 3). It is challenging to develop produc-tion ready tool as a part of research, but we managedto create a prototype (Issue 4). Two Jira plug–ins thatallow further evaluation of suggested defect predictionmodel have been developed. The plug–in for re–opensprediction is available on–line7. The suggested in thiswork changes may seem to be insignificant as they arenot far away from ideas suggested by others. How-

7https://www.researchgate.net/publication/

271735579_Reopens_prediction_Jira_AddOn_-_prototype

whereas the one for predicting feature defectiveness can be down-loaded from https://www.researchgate.net/publication/

272175560_Predicting_defectiveness_Jira_AddOn_-_

prototype.

https://www.researchgate.net/publication/271735579_Reopens_prediction_Jira_AddOn_-_prototype

https://www.researchgate.net/publication/271735579_Reopens_prediction_Jira_AddOn_-_prototype

https://www.researchgate.net/publication/272175560_Predicting_defectiveness_Jira_AddOn_-_prototype




ever, when comparing them against other defect pre-diction studies, almost none of them fulfills the newrequirements and thus do not have clear way for com-mercial implementation which is not surprising as wesignificantly narrowed the scope of application (onlytwo acceptable units of analysis and some project con-strains). We do not prove that such studies are inap-plicable, but we also do not see a scenario or recipefor implementing them in industry. Researchers rec-ommendations usually ends up with some abstract ad-vices like moving quality assurance efforts to other, se-lected areas or phases. But when applying such adviseby telling a software developer or tester what exactlyshould be done or when considering the exact causesof the defect prediction implementation, we are at loss.A very interesting solution was suggested by Taipaleet al. [3] which is not in line with our recommenda-tions. Taipale et al. [3] conducted a study where defectprediction was developed in a feedback loop with thepractitioners. The results are impressive, but we con-sider such solution as a very challenging one and hencewe investigated a simpler alternative. We believe thatthe feedback driven development was going towardsstatic code analysis which has good tool support andthus is already well known by practitioners and pre-sumably significantly affects their expectations. Staticcode analysis is very handy in software developmentas it not only points possible defect location but alsoexplains what presumably is wrong. With such tool thedevelopers might get reports similar to: "In line 176 ofclass Xyz ’=’ was used instead of ’==’". It is obviouswhat actions should be executed in response. The de-veloper can go to the line and change ’=’ to ’==’ ormark the report as a false positive. It is hardly possi-ble to beat such reports with defect prediction modelsas they operate in a different way. Static code analy-sis is about interpreting the source code and matchingit against well known and documented patters, i.e. itis to some extent about looking for the defect itself,whereas defect prediction is about identifying factorsthat increase the possibility of committing a defect. Itcould be a reasoning like "if a highly coupled codeis often changed, there is considerable possibility ofhaving a defect" and when using it in place of staticcode analysis tool the developer or tester is presentedwith report like "Class Xyz may contain a defect sinceit was changed 3 times by 2 different developers andits CBO is greater than 10". Such advise is less pre-cise and do not indicate an exact corrective action. Theonly thing the developer or tester can do is to reviewor unit test the class and hope that it will be enough

to decide whether there is a defect to fix or it is justa false positive. Please recall the defect report we dis-cussed in Section 1, it is really challenging for a testeror developer to decide about corresponding correctiveaction.

We recommend against using small unit of predic-tion (classes or methods) for two reasons. Classes andmethods are the subject of unit tests which are usu-ally automated and responsible not only for detectingdefects but also for allowing safe refactorings or pre-venting regression in a continuous manner. Driving theunit tests with defect prediction results would ignorethose factors and hence might have unpredictable con-sequences. And there are difficulties in applying theresults to other types of tests. The functional, accep-tance, integration or contract (the list may go on) testsaim at artifacts that consist of a number of classes andmethods. In the case of some test scenarios, e.g. whentesting a feature, a class may be even only partly in-volved. As a consequence it is not obvious how suchtests should be prioritized to correspond well with out-comes of class level defect prediction. When the de-fect prediction is trained using such data, its outcomesare not very usable for practitioners. Obstacles regard-ing non unit tests can be overcome as many compa-nies collect data about traceability and are at least ableto identify relations between classes, features and testscenarios. There is room for interpretation in mappingclass or method level prediction results on large arti-facts thus we consider that as interesting direction offurther research. We did not investigate it in this workas we do not see other than manual inspection solutionto the low quality of class level data issue which wouldmake the industrial application challenging.

The second reason regards the quality of trainingdata. We discussed a defect report in the Section 1 thatrepresents a real system malfunction but did not makemuch sense on the class level. It could be questionedwhether the defect report is representational. It is verychallenging to assess the quality of file or class leveltraining data. We believe that it can be reliably doneonly by manual analysis of each file marked to be de-fective by experts with project domain knowledge, thatis involving the developers that developed it. Since itis not feasible and the question of quality of input datais critical we decided to investigated a sample of filesby ourselves. We do not posses the domain knowledge,but we have access to the actual changes intended tofix the defect. We investigated 10 subsequent defect re-ports of the JEdit project that followed the issue de-tailed in Section 1 and were in status fixed and could


be mapped to changed during fixing source code files.Than we carefully analyzed each of the changed filesand classified it as one of the following:

– detectable_defect – the change removed a defectfrom the changed file,

– non_detectable_defect – the change removed adefect from the changed file but it was an is-sue that without broader, exceeding the content ofchanged file context was undetectable,

– not_a_defect – the change was a side effect of fix-ing defect in another file, i.e. before the changethe file was as good as after it,

– unclassifiable – we were not able to classify thefile to one of the aforementioned categories.

The results of the classification are as follows: de-tectable_defect = 5, non_detectable_defect = 4,not_a_defect = 7, unclassifiable = 4, and are availableonline with comments explaining our reasoning8. Thereliability of those results is questionable as the sam-ple is small and did not involved people with knowl-edge required to grasp project insights. Nonetheless,we cannot ignore the fact that only less than the halfof the files contains a defect, and only quarter ofthem was considered to be detectable. If those resultsare close to the reality, the defect prediction signifi-cantly suffers from the ’garbage in, garbage out’ anti–pattern which makes significant point against usingfine-grained unit of analysis in defect prediction stud-ies. Please note, that besides the discrepancy betweenfiles changed within defect fix and defective files thereare well known challenges with issue reports misclas-sification (Antoniol et al. [28] or Herzig et al.[41]), i.e.enhancements are labelled as defects and vice versawhich according to Bird et al. [42] can be related toa bias. There are also minor issues regarding qualityof defect data like the reported by Ostrand et al. [43]inaccuracies in severity ratings.

The suggested in this work improvements weretested in a simulation conducted on real world, open–source projects. The conducted experiments showedthat the usage of defect prediction can produce sub-stantial savings. When applying the prediction accord-ing to recommendations presented in this work, thetests efforts can be significantly reduced, i.e. more thanthousand of man–hours in the case of predicting re–opens (i.e. up to 90% of the overall bugfixing testingeffort – please note that such impressive results may be

8https://drive.google.com/open?id=14XAb_XHQNgN_

NYSamKpJSgKKHBx4ip1GjBptUoUN2qc

obtained only for projects where the majority of bug-fixes is correct). Savings were obtained for almost allprojects and classifiers, but the results have noticeablevariance. There are differences between projects andbetween classifiers within specific project. Thus we be-lieve that it is not challenging to get re–opens predic-tions leading to savings, but there are no guarantees.There is some risk of having low quality prediction thatresults in releasing defects. The risk should be takeninto consideration when deciding if the defect predic-tion is the right choice as the study results suggest thatwe will have savings on average but not for each case.

The study was conducted on a limited number ofprojects. Specifically, we did not simulate the ef-fect of predicting feature defectiveness. Predicting de-fects re–opens is relevant only for bugfixing activities,whereas implementing new features can happen in ev-ery project iteration. Unfortunately, collecting featurelevel data from open–source projects is challenging.The development of a feature may be extended overlong period of time and tracing feature dependant ar-tifacts without comprehensive project specific knowl-edge is hardly possible. We increased the number ofprojects investigated with respect to predicting defectre–opens by reusing Shihab et al. [21] experimentsdata in our model for defect prediction costs and gainsassessment. We obtained encouraging results that arein line with our own experiments regarding Flume andOozie projects. Nonetheless, further research regard-ing additional projects and different type of artifacts(i.e. features) should be considered. Particularly inter-esting are proprietary projects that employ manual testas this is the primary target of the suggested approach.

Prototypes that can be installed in an issue trackingsystem were developed. Nonetheless, we consider thestudy only as a promising proof of concept and foun-dation for further research. We are planning a tool forsoftware engineers, but before designing it, we are go-ing to at least improve external validity and identify areasonable set of useful independent variables.

References

[1] V. Kettunen, J. Kasurinen, O. Taipale and K. Smolander, Astudy on agility and testing processes in software organiza-tions, in: Proceedings of the 19th international symposium onSoftware testing and analysis, ACM, 2010, pp. 231–240.

[2] A.T. Misirli, A. Bener and R. Kale, Ai-based software defectpredictors: Applications and benefits in a case study, AI Maga-zine 32(2) (2011), 57–68.

https://drive.google.com/open?id=14XAb_XHQNgN_NYSamKpJSgKKHBx4ip1GjBptUoUN2qc

https://drive.google.com/open?id=14XAb_XHQNgN_NYSamKpJSgKKHBx4ip1GjBptUoUN2qc


[3] T. Taipale, M. Qvist and B. Turhan, Constructing Defect Pre-dictors and Communicating the Outcomes to Practitioners,in: Empirical Software Engineering and Measurement, 2013ACM/IEEE International Symposium on, IEEE, 2013, pp. 357–362.

[4] C. Tantithamthavorn and A.E. Hassan, An Experience Reporton Defect Modelling in Practice: Pitfalls and Challenges, in:International Conference on Software Engineering: SoftwareEngineering in Practice Track (ICSE-SEIP’18), 2018, pp. 71–73.

[5] E. Weyuker and T.J. Ostrand, An Automated Fault PredictionSystem, in: Making Software: What Really Works, and Why WeBelieve It, "O’Reilly Media, Inc.", 2010, pp. 145–160.

[6] M.H. Halstead, Elements of Software Science (Operating andProgramming Systems Series), Elsevier Science Inc., NewYork, NY, USA, 1977.

[7] T. Hall, S. Beecham, D. Bowes, D. Gray and S. Counsell, ASystematic Literature Review on Fault Prediction Performancein Software Engineering, IEEE Transactions on Software En-gineering 38(6) (2012), 1276–1304.

[8] M. Jureczko and L. Madeyski, A review of process metricsin defect prediction studies, Metody Informatyki Stosowanej30(5) (2011), 133–145.

[9] S. Shivaji, E.J. Whitehead, R. Akella and S. Kim, Reducingfeatures to improve code change-based bug prediction, Soft-ware Engineering, IEEE Transactions on 39(4) (2013), 552–569.

[10] E.J. Weyuker, T.J. Ostrand and R.M. Bell, Programmer–basedFault Prediction, in: PROMISE ’10: Proceedings of the SixthInternational Conference on Predictor Models in Software En-gineering, ACM, 2010, pp. 19–11910.

[11] T. Zimmermann, R. Premraj and A. Zeller, Predicting Defectsfor Eclipse, in: PROMISE ’07: Proceedings of the Third Inter-national Workshop on Predictor Models in Software Engineer-ing, IEEE Computer Society, Washington, DC, USA, 2007,p. 9.

[12] F. Zhang, A.E. Hassan, S. McIntosh and Y. Zou, The use ofsummation to aggregate software metrics hinders the perfor-mance of defect prediction models, IEEE Transactions on Soft-ware Engineering 43(5) (2017), 476–491.

[13] S. Kim, E.J. Whitehead and Y. Zhang, Classifying softwarechanges: Clean or buggy?, Software Engineering, IEEE Trans-actions on 34(2) (2008), 181–196.

[14] L. Madeyski and M. Jureczko, Which process metrics cansignificantly improve defect prediction models? An empiricalstudy, Software Quality Journal 23(3) (2015), 393–422, ISSN0963-9314.

[15] N.E. Fenton and M. Neil, A Critique of Software Defect Pre-diction Models, IEEE Transactions on Software Engineering25 (1999), 675–689.

[16] L.C. Briand, W.L. Melo and J. Wust, Assessing the applicabil-ity of fault-proneness models across object-oriented softwareprojects, Software Engineering, IEEE Transactions on 28(7)(2002), 706–720.

[17] R.M. Bell, E.J. Weyuker and T.J. Ostrand, Assessing the im-pact of using fault prediction in industry, in: Software Testing,Verification and Validation Workshops (ICSTW), 2011 IEEEFourth International Conference on, IEEE, 2011, pp. 561–565.

[18] Y. Yang, D. Falessi, T. Menzies and J. Hihn, Actionable Ana-lytics for Software Engineering, IEEE Software 35(1) (2018),

51–53.[19] N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer,

SMOTE: Synthetic Minority Over-sampling Technique, Jour-nal of Artificial Intelligence Research 16(5) (2002), 321–357.

[20] M. Jureczko and J. Magott, QualitySpy: a framework for moni-toring software development processes , Journal of Theoreticaland Applied Computer Science 6(1) (2012), 35–45.

[21] E. Shihab, A. Ihara, Y. Kamei, W.M. Ibrahim, M. Ohira,B. Adams, A.E. Hassan and K.-i. Matsumoto, Studying re-opened bugs in open source software, Empirical Software En-gineering 18(5) (2013), 1005–1042.

[22] J. Hryszko and L. Madeyski, Cost Effectiveness of SoftwareDefect Prediction in an Industrial Project, Foundations of Com-puting and Decision Sciences 43(1) (2018), 7–35.

[23] D. Bowes, T. Hall and D. Gray, DConfusion: a technique toallow cross study performance evaluation of fault predictionstudies, Automated Software Engineering 21(2) (2014), 287–313.

[24] L. An, F. Khomh and B. Adams, Supplementary bug fixes vs.re-opened bugs, in: Source Code Analysis and Manipulation(SCAM), 2014 IEEE 14th International Working Conferenceon, IEEE, 2014, pp. 205–214.

[25] M. Jureczko, Predicting re–opens - practical aspects of defectprediction models, in: Zastosowania metod statystycznych wbadaniach naukowych V, J. Jakubowski, ed., StatSoft Polska,Krakow, 2012, pp. 59–66, Chap. 8.

[26] X. Xia, D. Lo, X. Wang, X. Yang, S. Li and J. Sun, A com-parative study of supervised learning algorithms for re-openedbug prediction, in: Software Maintenance and Reengineering(CSMR), 2013 17th European Conference on, IEEE, 2013,pp. 331–334.

[27] X. Xia, D. Lo, E. Shihab, X. Wang and B. Zhou, Automatic,high accuracy prediction of reopened bugs, Automated Soft-ware Engineering 22(1) (2015), 75–109.

[28] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh and Y.-G. Guéhéneuc, Is it a bug or an enhancement?: a text-based ap-proach to classify change requests, in: Proceedings of the 2008conference of the center for advanced studies on collaborativeresearch: meeting of minds, ACM, 2008, p. 23.

[29] I. Arora, V. Tetarwal and A. Saha, Open issues in software de-fect prediction, Procedia Computer Science 46 (2015), 906–912.

[30] T.M. Khoshgoftaar and E.B. Allen, Classification of fault-prone software modules: Prior probabilities, costs, and modelevaluation, Empirical Software Engineering 3(3) (1998), 275–298.

[31] T. Khoshgoftaar and E. Allen, Logistic regression modeling ofsoftware quality, International Journal of Reliability, Qualityand Safety Engineering 6(04) (1999), 303–317.

[32] Y. Jiang, B. Cukic and T. Menzies, Cost curve evaluation offault prediction models, in: Software Reliability Engineering,2008. ISSRE 2008. 19th International Symposium on, IEEE,2008, pp. 197–206.

[33] E. Arisholm, L.C. Briand and E.B. Johannessen, A systematicand comprehensive investigation of methods to build and eval-uate fault prediction models, Journal of Systems and Software83(1) (2010), 2–17.

[34] T. Mende and R. Koschke, Effort-aware defect prediction mod-els, in: Software Maintenance and Reengineering (CSMR),2010 14th European Conference on, IEEE, 2010, pp. 107–116.

[35] Y. Kamei, S. Matsumoto, A. Monden, K.-i. Matsumoto,


B. Adams and A.E. Hassan, Revisiting common bug predictionfindings using effort-aware models, in: Software Maintenance(ICSM), 2010 IEEE International Conference on, IEEE, 2010,pp. 1–10.

[36] Y. Kamei, E. Shihab, B. Adams, A.E. Hassan, A. Mockus,A. Sinha and N. Ubayashi, A large-scale empirical study ofjust-in-time quality assurance, IEEE Transactions on SoftwareEngineering 39(6) (2013), 757–773.

[37] Y. Kamei, T. Fukushima, S. McIntosh, K. Yamashita,N. Ubayashi and A.E. Hassan, Studying just-in-time defectprediction using cross-project models, Empirical Software En-gineering 21(5) (2016), 2072–2106.

[38] E. Shihab, Z.M. Jiang, W.M. Ibrahim, B. Adams and A.E. Has-san, Understanding the impact of code and process metrics onpost-release defects: a case study on the Eclipse project, in:ESEM ’10: Proceedings of the 2010 ACM-IEEE InternationalSymposium on Empirical Software Engineering and Measure-ment, ACM, New York, NY, USA, 2010, pp. 1–10.

[39] B. Caglayan, A.T. Misirli, A. Miranskyy, B. Turhan andA. Bener, Factors characterizing reopened issues: a case study,in: Proceedings of the 8th International Conference on Predic-

tive Models in Software Engineering, ACM, 2012, pp. 1–10.[40] T. Zimmermann, N. Nagappan, P.J. Guo and B. Murphy, Char-

acterizing and predicting which bugs get reopened, in: Soft-ware Engineering (ICSE), 2012 34th International Conferenceon, IEEE, 2012, pp. 1074–1083.

[41] K. Herzig, S. Just and A. Zeller, It’s not a bug, it’s a feature:how misclassification impacts bug prediction, in: Proceedingsof the 2013 international conference on software engineering,IEEE Press, 2013, pp. 392–401.

[42] C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein,V. Filkov and P. Devanbu, Fair and balanced?: bias in bug-fixdatasets, in: Proceedings of the the 7th joint meeting of theEuropean software engineering conference and the ACM SIG-SOFT symposium on The foundations of software engineering,ACM, 2009, pp. 121–130.

[43] T.J. Ostrand, E.J. Weyuker and R.M. Bell, Where the bugs are,in: ISSTA ’04: Proceedings of the 2004 ACM SIGSOFT Inter-national Symposium on Software Testing and Analysis, ACM,New York, NY, USA, 2004, pp. 86–96.

Documents

IOS Press Towards implementing defect prediction in the software development …staff.iiar.pwr.wroc.pl/marian.jureczko/JIFS2019.pdf · 2020. 1. 31. · Undeﬁned 1 (2019) 1–5 1