Upload
sung-kim
View
5.335
Download
4
Embed Size (px)
DESCRIPTION
Software prediction leveraging repositories has received a tremendous amount of attention within the software engineering community, including PROMISE. In this talk, I will first present great achievements in defect prediction research including new defect prediction features, promising algorithms, and interesting analysis results. However, there are still many challenges in defect prediction. I will talk about them and discuss potential solutions for them leveraging prediction 2.0.
Sung KimThe Hong Kong University of
Science and Technology
Defect, Defect, Defect
Keynote
Program Analysis and Mining (PAM) Group
Program Analysis and Mining (PAM) Group
The First BugSeptember 9, 1947
More Bugs
Finding Bugs
Verification
Testing
Prediction
Defect Prediction
Program Future defects
42 24
14
Tool
Why Prediction?
Defect Prediction Model
D=4.86+
0.018L
F. Akiyama, “An Example of Software System Debugging,” Information Processing, vol. 71, 1971
Defect Prediction
Identifying New Metrics
Developing New Algorithms
Various Granularities
Defect Prediction
Identifying New Metrics
Developing New Algorithms
Various Granularities
Complex Files
Ostrand and Weyuker, Basili et al., TSE 1996, Ohlsson and Alberg, TSE 1996, Menzies et al., TSE 2007
a complex file
a simple file
Complex Files
Ostrand and Weyuker, Basili et al., TSE 1996, Ohlsson and Alberg, TSE 1996, Menzies et al., TSE 2007
a complex file
a simple file
Changes
Bell et al. PROMISE 2011, Moser et al., ICSE 2008, Nagappan et al., ICSE 2006, Hassan et al., ICSM 2005
Changes
Bell et al. PROMISE 2011, Moser et al., ICSE 2008, Nagappan et al., ICSE 2006, Hassan et al., ICSM 2005
Lee et al., FSE2011
View/Edit Patterns
Slide by Mik Kersten. “Mylyn – The task-focused interface” (December 2007, http://live.eclipse.org)
With Mylyn
Tasks are integrated
See only what you are working on
Slide by Mik Kersten. “Mylyn – The task-focused interface” (December 2007, http://live.eclipse.org)
* Eclipse plug-in storing and recovering task contexts
* Eclipse plug-in storing and recovering task contexts
* Eclipse plug-in storing and recovering task contexts
<InteractionEvent … Kind=“ ” … StartDate=“ ” EndDate=“ ” … StructureHandle=“ ” … Interest=“ ” … >
Burst Edits/Views
Lee et al., FSE2011
Burst Edits/Views
Lee et al., FSE2011
Change Entropy
Hassan, “Predicting Faults Using the Complexity of Code Changes,” ICSE 2009
11
1 1 1 1
F1 F2 F3 F4 F5
3 3 3 3 3
F1 F2 F3 F4 F5
Low Entropy High Entropy
The number of changes in a period (e.g., a week) per fileF1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Change Entropy
Hassan, “Predicting Faults Using the Complexity of Code Changes,” ICSE 2009
11
1 1 1 1
F1 F2 F3 F4 F5
3 3 3 3 3
F1 F2 F3 F4 F5
Low Entropy High Entropy
The number of changes in a period (e.g., a week) per fileF1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Previous Fixes
Hassan et al., ICSM 2005, Kim et al., ICSE 2007
Previous Fixes
Hassan et al., ICSM 2005, Kim et al., ICSE 2007
Previous Fixes
Hassan et al., ICSM 2005, Kim et al., ICSE 2007
Network
Zimmermann and Nagappan, “Predicting Defects using Network Analysis on Dependency Graphs,”ICSE 2008
Network
Zimmermann and Nagappan, “Predicting Defects using Network Analysis on Dependency Graphs,”ICSE 2008
More MetricsComplexity (Size)
CKMcCabe
OOProcess metrics
HalsteadDeveloper Count metrics
Change metricsEntropy of changes (Change Complexity)
Churn (source code metrics)# of changes to the file
Previous defectsNetwork measures
Calling structure attributesEntropy (source code metrics)
0 5 10 15 20 25# of publications (last 7 years)
Defect Prediction
Identifying New Metrics
Developing New Algorithms
Various Granularities
Classification
?
Learner
training instances (metrics+ labels)
Prediction(classification)
new instance
complexity metricshistorical metrics
...
Regression
?
Learner
training instances (metrics+ values)
Prediction(values)
new instance
complexity metricshistorical metrics
...
Active Learning
Lo et al., “Active Refinement of Clone Anomaly Reports,” ICSE 2012, Lu et al., PROMISE 2012
clones should be inherently similar to each other, and incon-sistent changes to the clones themselves or their surroundingcode (which are called contexts) may indicate unintentionalchanges, bad programming styles, and bugs.
The technique in [12] is summarized as follows:
1) It uses a code clone detection tool, DECKARD [5], todetect code clones in programs. The output of this stepis a set of clone groups, where each clone group is aset of code pieces that are syntactically similar to eachother (a.k.a. clones);
2) Then, it locates the locations of every clone in thesource code and generates parse (sub)trees for them;
3) Next, it detects inconsistencies among the parse trees ofthe clones and their contexts, e.g., whether the clonescontain different numbers of unique identifiers, and howthe language constructs of the contexts are different.The inconsistencies are then ranked heuristically basedon their potential relationship with bugs. Inconsistentclones unlike to be buggy are also filtered out.
4) Finally, it outputs a list of anomaly reports, each ofwhich indicates the location of a potential bug in thesource code, for developers to inspect.
It has been reported that this technique has high falsepositive rates, even though it can find true bugs of diversecharacteristics that are difficult to detect by other techniques.For example, among more than 800 reported bugs for theLinux Kernel, only 41 are true bugs and another 17 are badprogramming styles; among more than 400 reported bugsfor the Eclipse, only 21 are true bugs and 17 are issues withbad programming styles [12].
IV. OVERALL REFINEMENT FRAMEWORK
A typical clone-based anomaly detection system performsa single batch analysis where a static set of anomaly or bugreports (ordered or unordered) are produced. It requires noor little user intervention (e.g., setting some parameters),butmay produce many false positives. To alleviate this problem,we propose an active learning approach that can dynamicallyand continually refine anomaly reports based on incrementaluser feedbacks; each feedback is immediately incorporatedby our approach into the ordering of anomaly reports tomove possible true positive reports up in the list whilemoving likely false positives towards the end of the list.
Our proposed active refinement process supporting userfeedbacks is shown in Figure 4. It is composed of fiveparts corresponding to the boxes in the figure.2 Let usrefer to them as Block 1 to 5 (counter-clockwise from leftto right). Block 1 represents a typical batch-mode clone-based anomaly detection system. Given a program, thesystem identifies parts of the program that are differentfrom the norm, where the norm corresponds to the common
2A square, a trapeze, and a parallelogram represent a process, a manualoperation, and data respectively.
Anomaly Detection System
Refinement Engine
<<Refinement Loop>>
1
5
UserFeedback
4
Sorted Bug Reports 2
First Few Bug Reports 3
Figure 4. Active Refinement Process
characteristics in a clone group. Then, the set of anomalies orbugs (i.e., Block 2) is presented for manual user inspection.
We extend such typical clone-based anomaly detec-tion systems by incorporating incremental user feedbacksthrough the feedback and refinement loop starting at Block2 followed by Blocks 3, 4, and 5, and back to Block 2. AtBlocks 3 and 4, a user is presented with a few bug reportsand is asked to provide feedbacks on whether the reports heor she sees are false or true positives. These feedbacks arethen fed into our refinement engine (i.e., Block 5) to updatethe original or intermediate lists of bug reports.
With user feedbacks, the refinement engine analyzes thecharacteristics of both false positives and true positiveslabeled by users so far and hypothesizes about other falsepositives and true positives in the list based on various clas-sification and machine learning techniques. This hypothesisis then used to rearrange the remaining bug reports. It ispossible that a true positive, that is originally ranked low, ismoved up the list; a false positive, that is originally rankedhigh, is “downgraded” or pushed down the list.
The active refinement process repeats and users are askedfor more feedbacks. With more iterations, more feedbacksare received, and a better hypothesis can be made for theremaining unlabeled reports.
The ultimate goal of our refinement process is to producea better ordering of bug reports so that true positive reportsare listed first ahead of false positives, which we refer toas the bug report ordering problem. With better ordering,true positives can be identified earlier without the need toinvestigate the entire report list. With less false positivesearlier in the list, a debugger can be encouraged to continueinvestigating the rest of the reports and find more bugs in afixed period of time. If all (or most) of the true positives canappear early, a debugger may stop analyzing the anomalyreports once he or she finds many false positives.
V. REFINEMENT ENGINE
This section elaborates our refinement engine further. Ourrefinement engine takes in a list of anomaly reports andrefines it by reordering the reports. Each anomaly reportis a set of code clones (i.e., a clone group) which containinconsistencies among the clones. Given a list of anomalyreports, ordered either arbitrarily or with some ad-hoc cri-teria, and user-provided labels (i.e., true positives or false
400
Bug Cache10% files
most bug-prone
all files
Load
on m
isspre-
fetc
h
repl
acem
ent
Nearby: co changes
Kim et al., “Predicting Faults from Cached History,” ICSE 2007
Algorithms
31
Classification
Regression
Both
Etc.
0 5 10 15 20 25
4
4
18
21
# of publications (recent 7 years)
Algo
rithm
s
Defect Prediction
Identifying New Metrics
Developing New Algorithms
Various Granularities
Module/Binary/Package Level
Module/Binary/Package Level
File Level
File Level
Method Level
void foo () {...
}
Hata et al.,“Bug Prediction Based on Fine-Grained Module Histories,” ICSE 2012
Method Level
void foo () {...
}
Hata et al.,“Bug Prediction Based on Fine-Grained Module Histories,” ICSE 2012
Change Level
Did I just introduce
a bug?
...
...
...
...
Rev 4............
Rev 1
change............
Rev 2............
Rev 3
Development history of a file
change change
Kim et al., "Classifying Software Changes: Clean or Buggy?" TSE 2009
Change Level
Did I just introduce
a bug?
...
...
...
...
Rev 4............
Rev 1
change............
Rev 2............
Rev 3
Development history of a file
change change
Kim et al., "Classifying Software Changes: Clean or Buggy?" TSE 2009
More Granularities
Project/Release/SubSystem
Component/Module
Package
File
Class
Function/Method
Change/Hunk level
0 5 10 15 20
1
2
8
19
3
8
3
# of publications (recent 7 years)
Defect Prediction Summary
Identifying New Metrics
Developing New Algorithms
Various Granularities
Performance 11
Figure 2. Data used in models
Figure 3. The size of the data sets used for Eclipse
Apache ArgoUML Eclipse Embedded Systemsystem
11
Figure 2. Data used in models
Figure 3. The size of the data sets used for Eclipse
Healthcaresystem
Microsoft
Mozilla
Hall et al., "A Systematic Review of Fault Prediction Performance in Software Engineering," TSE 2011 (Figure 2)
Performance13
*For example plug-ins, binaries
Figure 6. The granularity of the results
in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.
5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.
6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-
13
*For example plug-ins, binaries
Figure 6. The granularity of the results
in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.
5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.
6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-
Class File Module Binary/plug-inBinary/plug-in
11
Figure 2. Data used in models
Figure 3. The size of the data sets used for Eclipse
Hall et al., "A Systematic Review of Fault Prediction Performance in Software Engineering," TSE 2011 (Figure 6)
Performance13
*For example plug-ins, binaries
Figure 6. The granularity of the results
in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.
5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.
6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-
13
*For example plug-ins, binaries
Figure 6. The granularity of the results
in the on-line appendix shows how independent vari-ables as expressed by individual studies have been cate-gorised in relation to the labels used in Figure 7. It showsthat there is variation in performance between modelsusing different independent variables. Models using awide combination of metrics seem to be performing well.For example, models using a combination of static codemetrics (scm), process metrics and source code text seemto be performing best overall (e.g. Shivaji et al. [[164]]).Similarly Bird et al’s study [[18]] which uses a widecombination of socio-technical metrics (code dependencydata together with change data and developer data) alsoperforms well (though the results from Bird et al’s study[[18]] are reported at a high level of granularity). Processmetrics (i.e. metrics based on changes logged in reposito-ries) have not performed as well as expected. OO metricsseem to have been used in studies which perform betterthan studies based only on other static code metrics(e.g. complexity based metrics). Models using only LOCdata seem to have performed competitively comparedto models using other independent variables. Indeed ofthese models using only metrics based on static featuresof the code (OO or SCM), LOC seems as good as anyother metric to use. The use of source code text seemsrelated to good performance. Mizuno et al.’s studies[[116]], [[117]] have used only source code text within anovel spam filtering approach to relatively good effect.
5.4 Performance in relation to modelling techniqueFigure 8 shows model performance in relation to themodelling techniques used. Models based on NaïveBayes seem to be performing well overall. Naïve Bayesis a well understood technique that is in common use.Similarly models using Logistic Regression also seemto be performing well. Models using Linear Regressionperform not so well, though this technique assumesthat there is a linear relationship between the variables.Studies using Random Forests have not performed aswell as might be expected (many studies using NASAdata use Random Forests and report good performances[[97]]). Figure 8 also shows that SVM (Support VectorMachine) techniques do not seem to be related to modelsperforming well. Furthermore, there is a wide range oflow performances using SVMs. This may be becauseSVMs are difficult to tune and the default Weka settingsare not optimal. The performance of models using theC4.5 technique is fairly average. However, Arisholm etal.’s models [[8]], [[9]] used the C4.5 technique (as pre-viously explained these are not shown as their relativelypoor results skew the data presented). C4.5 is thought tostruggle with imbalanced data [16] and [17] and this mayexplain the performance of Arisholm et al.’s models.
6 SYNTHESIS OF RESULTSThis section answers our research questions by synthe-
Class File Module Binary/plug-inBinary/plug-in
11
Figure 2. Data used in models
Figure 3. The size of the data sets used for Eclipse
Hall et al., "A Systematic Review of Fault Prediction Performance in Software Engineering," TSE 2011 (Figure 6)
Defect prediction totally works!
Defect prediction totally works!
Done? Why are not using?
Detailed To Fix List VS Buggy Modules
Detailed To Fix List VS Buggy Modules
This is what developers want!
Defect Prediction 2.0
Finer Granularity
New Customers
Noise Handling
Defect Prediction 2.0
Finer Granularity
New Customers
Noise Handling
FindBugs
http://findbugs.sourceforge.net/
Performance of Bug Detection Tools
Tools' priority 1
FindBugs
jLint
PMD
0 5 10 15 20Precision (%)
War
ning
s
Kim and Ernst, “Which Warnings Should I Fix First?” FSE 2007
RQ1: How Many False Negatives
! Defects missed, partially, or fully captured
! Warnings from a tool should also correctly explain in detail why a flagged line may be faulty
! How many one-line defects are captured and explained reasonably well (so called, “strictly captured”)?
21
Very high miss rates!
Thung et al., “To What Extent Could We Detect Field Defects?” ASE 2012
RQ1: How Many False Negatives
! Defects missed, partially, or fully captured
! Warnings from a tool should also correctly explain in detail why a flagged line may be faulty
! How many one-line defects are captured and explained reasonably well (so called, “strictly captured”)?
21
Very high miss rates!
Thung et al., “To What Extent Could We Detect Field Defects?” ASE 2012
Line Level Defect Prediction
Line Level Defect Prediction
We have seen this bug in revision 100
Bug Fix Memories
Extract patterns in bug fix change history
……
Bug fix changes in revision 1 .. n-1
Memory
Kim et al., “"Memories of bug fixes",” FSE 2006
Extract patterns in bug fix change history
……
Search for patterns in Memory
Bug fix changes in revision 1 .. n-1
Memory
Code to examine
Bug Fix Memories
Kim et al., “"Memories of bug fixes",” FSE 2006
Fix Wizard
Nguyen et al., “Recurring Bug Fixes in Object-Oriented Programs,” ICSE 2010
public void setColspan(int colspan) throws WrongValueException{if (colspan <= 0) throw new WrongValueException(...);if ( colspan != colspan) {
colspan = colspan;
final Execution exec = Executions.getCurrent();
if (exec != null && exec.isExplorer()) invalidate() ;
smartUpdate(”colspan”, Integer.toString( colspan));...
public void setRowspan(int rowspan) throws WrongValueException{if (rowspan <= 0) throw new WrongValueException(...);if ( rowspan != rowspan) {
rowspan = rowspan;
final Execution exec = Executions.getCurrent();
if (exec != null && exec.isExplorer()) invalidate();
smartUpdate(”rowspan”, Integer.toString( rowspan));...
Figure 1: Bug Fixes at v5088-v5089 in ZK
Usage in method colSpan Usage in method rowSpan
Usage in changed code
Executions.getCurrent
Execution.isExplorer
IF
WrongValueException .< init >
IF
Auxheader.smartUpdate
Auxheader.invalidate
IF
Executions.getCurrent
Execution.isExplorer
IF
WrongValueException .< init >
IF
Auxheader.smartUpdate
Auxheader.invalidate
IF
Figure 2: Graph-based Object Usages for Figure 1
velopers tend to copy-and-paste the implementation code,thus creating similar code fragments. Therefore, we hypoth-esize that code peers, i.e. classes/methods having similarfunctions/interactions, tend to have similar implementationcode, similar naming schemes, inherit from the same class,or implement the same interface (H2).
2.2 Manual Analysis of Recurring FixesWe conducted a manual analysis of recurring bug fixes in
a two-phase experiment. First, a group of experienced pro-grammers examined all fixing changes of the subject systemsand manually identified the similar ones. Then, we analyzedtheir reports to characterize such recurring fixes and theirenclosing code units in order to verify the main hypothesisH1: similar fixes tend to occur on code units having similarroles, i.e. providing similar functions and/or participatingin similar interactions, in term of object usages.
We represented object usages in such code units by graph-based object usage model, a technique in our previous workGrouMiner [21]. In general, each usage scenario is modeledas a labeled, directed, acyclic graph, called a groum, in whichnodes represent method invocations/field accesses of objects,as well as control structures (e.g. if, while) and edges repre-sent the usage orders and data dependencies among them.
Table 1 shows subject systems used in our study. Two ofthem were also used by Kim et al. [13] in previous researchon bug fixes. The fixes are considered at the method level,i.e. all fixing changes to a method at a revision of a sys-tem are considered as an atomic fix. Seven Ph.D. studentsin Software Engineering at Iowa State University with theaverage of 5-year experience in Java manually examined allthose fixes and identified the groups of recurring bug fixes
public class UMLOperationsListModel extendsUMLModelElementCachedListModel{
public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MOperation newOp=MMUtil.SINGLETON.buildOperation(classifier);classifier.setFeatures(addElement(oldFeatures,index,newOp,
operations.isEmpty()?null: operations.get(index)));
public class UMLAttributesListModel extendsUMLModelElementCachedListModel{
public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MAttribute newAt=MMUtil.SINGLETON.buildAttribute(classifier);classifier.setFeatures(addElement(oldFeatures,index,newAt,
attributes.isEmpty()?null: attributes.get(index)));
Figure 3: Bug Fixes at v0459-v0460 in ArgoUML
Usage in UMLOperationsListModel.addElement
IF
MClassifier.getFeatures
MMUtil.buildOperation
MClassifier.setFeatures
UMLOperationsListModel.addElement
List.get
List.isEmpty
Usage in UMLAttributesListModel.addElement
IF
MClassifier.getFeatures
MMUtil.buildAttribute
MClassifier.setFeatures
UMLAttributesListModel.addElement
List.get
List.isEmpty
Figure 4: Graph-based Object Usages for Figure 3
(RBFs). Conflicting identifications were resolved by the ma-jority vote among them. There were only 2 disputed groups.
Table 2 shows the collective reports. Columns RBF andPercentage show the total numbers and the percentage of re-curring bug fixes in all fixing ones. We can see that RBFsare between 17-45% of all fixing changes. This is consistentwith the previous report [13]. While many RBFs (85%-97%) occur at the same revisions on di!erent code units(column In Space), less RBFs occur in di!erent revisions (col-umn In Time). Analyzing such recurring fixes, we found thatmost of them (95%-98%) involve object usages (e.g. methodcalls and field accesses). This is understandable because thestudy is focused on object-oriented programs.
2.3 Representative ExamplesExample 1. Figure 1 shows two recurring fixes taken from
ZK system with added code shown in boxes . Two methodssetColspan and setRowspan are very similar in structure andfunction, thus, are considered as cloned code. When theirfunctions need to be changed, they are changed in the sameway. Figure 2 shows the object usage models of those twomethods with the changed parts shown in the boxes. Thenodes such as Executions.getCurrent and Auxheader.smartUpdaterepresent the invocations of the corresponding methods. Anedge such as the one from Executions.getCurrent to Execution.is-Explorer shows the usage order, i.e. the former is called beforethe latter. As we could see, both methods are implementedwith the same object usage. Then, they are also modified inthe same way as shown in the boxes.
317
Fix Wizard
Nguyen et al., Recurring Bug Fixes in Object-Oriented Programs,” ICSE 2010
public void setColspan(int colspan) throws WrongValueException{if (colspan <= 0) throw new WrongValueException(...);if ( colspan != colspan) {
colspan = colspan;
final Execution exec = Executions.getCurrent();
if (exec != null && exec.isExplorer()) invalidate() ;
smartUpdate(”colspan”, Integer.toString( colspan));...
public void setRowspan(int rowspan) throws WrongValueException{if (rowspan <= 0) throw new WrongValueException(...);if ( rowspan != rowspan) {
rowspan = rowspan;
final Execution exec = Executions.getCurrent();
if (exec != null && exec.isExplorer()) invalidate();
smartUpdate(”rowspan”, Integer.toString( rowspan));...
public void setColspan(int colspan) throws WrongValueException{if (colspan <= 0) throw new WrongValueException(...);if ( colspan != colspan) {
colspan = colspan;
final Execution exec = Executions.getCurrent();
if (exec != null && exec.isExplorer()) invalidate() ;
smartUpdate(”colspan”, Integer.toString( colspan));...
public void setRowspan(int rowspan) throws WrongValueException{if (rowspan <= 0) throw new WrongValueException(...);if ( rowspan != rowspan) {
rowspan = rowspan;
final Execution exec = Executions.getCurrent();
if (exec != null && exec.isExplorer()) invalidate();
smartUpdate(”rowspan”, Integer.toString( rowspan));...
Figure 1: Bug Fixes at v5088-v5089 in ZK
Usage in method colSpan Usage in method rowSpan
Usage in changed code
Executions.getCurrent
Execution.isExplorer
IF
WrongValueException .< init >
IF
Auxheader.smartUpdate
Auxheader.invalidate
IF
Executions.getCurrent
Execution.isExplorer
IF
WrongValueException .< init >
IF
Auxheader.smartUpdate
Auxheader.invalidate
IF
Figure 2: Graph-based Object Usages for Figure 1
velopers tend to copy-and-paste the implementation code,thus creating similar code fragments. Therefore, we hypoth-esize that code peers, i.e. classes/methods having similarfunctions/interactions, tend to have similar implementationcode, similar naming schemes, inherit from the same class,or implement the same interface (H2).
2.2 Manual Analysis of Recurring FixesWe conducted a manual analysis of recurring bug fixes in
a two-phase experiment. First, a group of experienced pro-grammers examined all fixing changes of the subject systemsand manually identified the similar ones. Then, we analyzedtheir reports to characterize such recurring fixes and theirenclosing code units in order to verify the main hypothesisH1: similar fixes tend to occur on code units having similarroles, i.e. providing similar functions and/or participatingin similar interactions, in term of object usages.
We represented object usages in such code units by graph-based object usage model, a technique in our previous workGrouMiner [21]. In general, each usage scenario is modeledas a labeled, directed, acyclic graph, called a groum, in whichnodes represent method invocations/field accesses of objects,as well as control structures (e.g. if, while) and edges repre-sent the usage orders and data dependencies among them.
Table 1 shows subject systems used in our study. Two ofthem were also used by Kim et al. [13] in previous researchon bug fixes. The fixes are considered at the method level,i.e. all fixing changes to a method at a revision of a sys-tem are considered as an atomic fix. Seven Ph.D. studentsin Software Engineering at Iowa State University with theaverage of 5-year experience in Java manually examined allthose fixes and identified the groups of recurring bug fixes
public class UMLOperationsListModel extendsUMLModelElementCachedListModel{
public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MOperation newOp=MMUtil.SINGLETON.buildOperation(classifier);classifier.setFeatures(addElement(oldFeatures,index,newOp,
operations.isEmpty()?null: operations.get(index)));
public class UMLAttributesListModel extendsUMLModelElementCachedListModel{
public void add( int index){Object target=getTarget();if (target instanceof MClassifier) {MClassifier classifier=(MClassifier)target;Collection oldFeatures=classifier.getFeatures();MAttribute newAt=MMUtil.SINGLETON.buildAttribute(classifier);classifier.setFeatures(addElement(oldFeatures,index,newAt,
attributes.isEmpty()?null: attributes.get(index)));
Figure 3: Bug Fixes at v0459-v0460 in ArgoUML
Usage in UMLOperationsListModel.addElement
IF
MClassifier.getFeatures
MMUtil.buildOperation
MClassifier.setFeatures
UMLOperationsListModel.addElement
List.get
List.isEmpty
Usage in UMLAttributesListModel.addElement
IF
MClassifier.getFeatures
MMUtil.buildAttribute
MClassifier.setFeatures
UMLAttributesListModel.addElement
List.get
List.isEmpty
Figure 4: Graph-based Object Usages for Figure 3
(RBFs). Conflicting identifications were resolved by the ma-jority vote among them. There were only 2 disputed groups.
Table 2 shows the collective reports. Columns RBF andPercentage show the total numbers and the percentage of re-curring bug fixes in all fixing ones. We can see that RBFsare between 17-45% of all fixing changes. This is consistentwith the previous report [13]. While many RBFs (85%-97%) occur at the same revisions on di!erent code units(column In Space), less RBFs occur in di!erent revisions (col-umn In Time). Analyzing such recurring fixes, we found thatmost of them (95%-98%) involve object usages (e.g. methodcalls and field accesses). This is understandable because thestudy is focused on object-oriented programs.
2.3 Representative ExamplesExample 1. Figure 1 shows two recurring fixes taken from
ZK system with added code shown in boxes . Two methodssetColspan and setRowspan are very similar in structure andfunction, thus, are considered as cloned code. When theirfunctions need to be changed, they are changed in the sameway. Figure 2 shows the object usage models of those twomethods with the changed parts shown in the boxes. Thenodes such as Executions.getCurrent and Auxheader.smartUpdaterepresent the invocations of the corresponding methods. Anedge such as the one from Executions.getCurrent to Execution.is-Explorer shows the usage order, i.e. the former is called beforethe latter. As we could see, both methods are implementedwith the same object usage. Then, they are also modified inthe same way as shown in the boxes.
317
Word Level Defect Prediction
Word Level Defect Prediction
Fix suggestion...
Defect Prediction 2.0
Finer Granularity
New Customers
Noise Handling
all bugs B
Bug DatabaseSource Repository
fixed bugs Bf
commit
commit
commit
commit
commit
all commits C
commit
commit
commit
commit
Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009
all bugs B
Bug DatabaseSource Repository
fixed bugs Bf
commit
commit
commit
commit
commit
all commits C
commit
commit
commit
commit
linked via log messages
Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009
all bugs B
Bug DatabaseSource Repository
fixed bugs Bf
linked fixed bugs Bfl
commit
commit
commit
commit
commit
all commits C
commit
commit
commit
commit
linked via log messages
Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009
all bugs B
Bug DatabaseSource Repository
fixed bugs Bf
linked fixed bugs Bfl
commit
commit
commit
commit
commit
all commits C
commit
commit
linked fixes Cfl
commit
commit
linked via log messages
Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009
all bugs B
Bug DatabaseSource Repository
fixed bugs Bf
linked fixed bugs Bfl
commit
commit
commit
commit
commit
all commits C
commit
commit
linked fixes Cfl
commit
commit
linked via log messages
related,but not linked
Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009
all bugs B
Bug DatabaseSource Repository
fixed bugs Bf
linked fixed bugs Bfl
commit
commit
commit
commit
commit
all commits C
bug fixes Cf
commit
commit
linked fixes Cfl
commit
commit
linked via log messages
related,but not linked
Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009
all bugs B
Bug DatabaseSource Repository
fixed bugs Bf
linked fixed bugs Bfl
commit
commit
commit
commit
commit
all commits C
bug fixes Cf
commit
commit
linked fixes Cfl
commit
commit
linked via log messages
related,but not linked
Noise!
Bird et al., “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009
How resistant a defect prediction model is to noise?
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"
Buggy%F'measure�
(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�
SWT"
Debug"
Columba"
Eclipse"
Scarab"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"
Buggy%F'measure�
(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�
SWT"
Debug"
Columba"
Eclipse"
Scarab"
Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011
How resistant a defect prediction model is to noise?
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"
Buggy%F'measure�
(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�
SWT"
Debug"
Columba"
Eclipse"
Scarab"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"
Buggy%F'measure�
(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�
SWT"
Debug"
Columba"
Eclipse"
Scarab"
Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011
How resistant a defect prediction model is to noise?
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"
Buggy%F'measure�
(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�
SWT"
Debug"
Columba"
Eclipse"
Scarab"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6"
Buggy%F'measure�
(c)%Training%set%false%nega6ve%(FN)%&%false%posi6ve%(FP)%rate�
SWT"
Debug"
Columba"
Eclipse"
Scarab"20%
Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011
Closest List Noise Identification
� . Aj will be returned as the identified noise set. Empirical study found that when � is 3 and � is 0.99, this algorithm performs the best.
C LNI Algorithm: for each iteration j
for each instance Insti for each instance Instk
if(Instk�!Aj) continue;
else add EuclideanDistance(Insti, Instk) to Listi ; calculate percentile of top 10 instances in Listi whose label is different from Insti as � ; if � �� Aj = Aj�!Instj; end end
end end if N(Aj��j-1)/N(Aj��� and N(Aj��j-1)/N(Aj-1���
break; end
end return Aj
F igure 9. The pseudo-code of the C LN I algorithm
A
F igure 10. An illust ration of the C LN I algorithm The high level idea of CLNI can be illustrated as in Figure 10. The blue points represent clean instances and the white points represent buggy instances. When checking if an instance A is noisy, CLNI first lists all instances that are close to A (the points included in the circle). CLNI then calculates the ratio of instances in the list that have a class label different from that of A (the number of orange points over the total number of points in the circle). If the ratio is over a specific threshold � , we consider instance A to have a high probability to be a noisy instance.
6.2 Evaluation We evaluate our noise detection method using data from the Eclipse 3.4 SWT and Debug projects as described in Section 5.2. These two datasets are considered as the �golden sets� as most of their bugs are linked bugs. Following the method described in Section 4.2, we create the noisy datasets for these two projects by selecting random n% of instances and artificially changing their labels (from buggy to clean and from clean to buggy). We then apply the CLNI algorithm to detect noisy instances that we have just injected. We use Precision, Recall and F-measures to evaluate the performance in identifying the noisy instances.
Table 3 shows the results when the noise rate is 20%. The Precisions are above 60%, Recalls are above 83% and F-measures are above 0.71. These promising results confirm that the proposed CLNI algorithm is capable of identifying noisy instances.
Table 3. The performance of CL NI in identifying noisy instances
Precision Recall F-measure
Debug 0.681 0.871 0.764
SWT 0.624 0.830 0.712
Figure 11 also shows the performance of CLNI under different noise levels for the SWT component. When the noise rate is below 25%, F-measures increase with the increase of the noise rates. When the noise rate is above 35%, CLNI will have bias toward incorrect instances, causing F-measures to decrease.
F igure 11. Pe rformance of C LNI with diffe rent noise rates After identifying the noises in the noisy Eclipse 3.4 SWT and Debug datasets using CLNI, we eliminate these noises by flipping their class labels. We then evaluate if the noise-removed training set improves prediction accuracy.
The results for the SWT component before and after removing FN and FP noises are shown in Table 4. In general, after removing the noises, the prediction performance improves for all learners, especially for those that do not have strong noise resistance ability. For example, for the SVM learner, when 30% FN&FP noises were injected into the SWT dataset the F-measure was 0.339. After identifying and removing the noises, the F-measure jumped to 0.706. These results confirm that the proposed CLNI algorithm can improve defect prediction performance for noisy datasets.
Table 4. The defect predic tion per formance after identifying and removing noisy instances (SW T) Re move Noises ?
Noise Rate
Bayes Ne t
Na ïve Bayes SV M Bagging
No 15% 0.781 0.305 0.594 0.841 30% 0.777 0.308 0.339 0.781 45% 0.249 0.374 0.353 0.350
Yes
15% 0.793 0.429 0.797 0.838 30% 0.802 0.364 0.706 0.803 45% 0.762 0.418 0.235 0.505
Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011
Noise detection performance
Precision Recall F-measure
Debug 0.681 0.871 0.764
SWT 0.624 0.830 0.712
(noise level =20%)Kim et al., “Dealing with Noise in Defect Prediction,” ICSE 2011
0
25
50
75
100
0% 15% 30% 45%
SWT
F-m
easu
re
Noise level
Noisey
Bug prediction using cleaned data
Bug prediction using cleaned data
0
25
50
75
100
0% 15% 30% 45%
SWT
F-m
easu
re
Noise level
Noisey Cleaned
Bug prediction using cleaned data
76% F-measure
with 45% noise0
25
50
75
100
0% 15% 30% 45%
SWT
F-m
easu
re
Noise level
Noisey Cleaned
ReLink
Links
Bug database
Unknownlinks
Features
Source code
repository Traditionalheuristics
(link miner)
Recovering links using
feature
Links
Links
Combine
Wu et al., “ReLink: Recovering Links between Bugs and Changes,” FSE 2011
ReLink
Links
Bug database
Unknownlinks
Features
Source code
repository Traditionalheuristics
(link miner)
Recovering links using
feature
Links
Links
Combine
Wu et al., “ReLink: Recovering Links between Bugs and Changes,” FSE 2011
ReLink Performance
Wu et al., “ReLink: Recovering Links between Bugs and Changes,” FSE 2011
ZXing
OpenIntents
Apache
0 20 40 60 80 100F-measure
Proj
ects
Traditional ReLink
Label Historical Changes
...
...
...
...
Rev 102 (no BUG)
...
...
...
...
Rev 101 (with BUG)
fixed
……
Development history of a file
...
...
...
...
Rev 102............
Rev 1............
Rev 100............
Rev 101
change change
Change message: “fix for bug 28434”
Fischer et al, “Populating a Release History Database from Version Control and Bug Tracking Systems,” ICSM2003
Atomic Change
...insertTab()......
Rev 102 (no BUG)
...setText(“\t”)......
Rev 101 (with BUG)
fixed
Change message: “fix for bug 28434”
Fischer et al, “Populating a Release History Database from Version Control and Bug Tracking Systems,” ICSM2003
Composite Change
does meet such expectation. In particular, it potentially offers the following benefits:
x Decomposition helps developers distinguish different concerns within a composite change.
Figure 5 shows the diff for JFreeChart revision 1083, which touches one source code file and modifies six lines of code. According to its commit log, this revision fixes bug 1864222 by modifying the method createCopy to handle an empty range. Our approach decomposed this revision into three change-slices. One change-slice contains only line 944, which is the exact fix intended for this change. The second change-slice, line 677, is a minor fix irrelevant to the intended fix at line 944. The third change-slice contains line 973, 974, 978 and 979, which are trivial formatting improvement. This accurate decomposition could help developers effectively distinguish between sub-changes with different concerns.
x Decomposition draws developer’s attention to incon-spicuous change, which might actually be an im-portant one.
When reviewing a change with tens of changed lines or more, a developer might quickly skim through the change to get a general idea instead of staying focused on every single changed line. Now, if one or two lines in this change in fact address different issues, the developer is likely to overlook them and thus miss these additional intentions. We find that when decomposition is applied, such inconspicuous change can be easily isolated as one independent change-slice. For exam-ple, Xerces revision 730796, 819653 and 944964 were decom-posed into 2, 10 and 2 change-slices, respectively. Figure 6 shows one of each their change-slices that contains only one
line of code change. Although these one-line changes are likely minor fixes or just for perfective purpose, the developer might still want to be aware of them during her change review.
Sometimes an inconspicuous change is not necessarily a minor one. Figure 7 shows one of the three change-slices for JFreeChart revision 1366, where a field is set to true instead of false. Figure 7 also shows one of the two change-slices for JFreeChart revision 1801. The one-line change here fixes a subtle bug in an incorrect if-condition where a parenthesis for the “||” operation is missing. Although both fixes are small, they could have significant impact on the program behavior. Like in the second case, a missing parenthesis messes up the operation precedence and flips the if-condition value, which could cause the program to end up with wrong results or even crash. However, due to its small size, such fix might slip through the developer’s eyes while she is busy reviewing other noticeable changed blocks. Our approach isolates such small but critical fix from the composite change, and presents it to the developer as an individual change-slice. Hence, the developer can be fully aware of such important changes.
x Decomposition reveals questionable changes. Our approach decomposed Commons Math revision
943068 into three change-slices. While the first two change-slices faithfully did what the commit log mentioned to do, the third change-slice caught our attention (Figure 8). In this change-slice, the parameter in the method setForce (boolean) is renamed from force to forceOverwrite. Then this method is directly called by a newly added method setOverwrite (boole-an). The intention here is probably adding a setter for the field forceOverwrite, but the change seems to fail on this purpose – the renamed parameter does not affect the assignment at (*). We suspected that force at the right hand side of the assignment should also be changed to forceOverwrite.
We then searched the subsequent revisions for more evi-dence of our speculation. As expected, the author of this change did fix this bug later in revision 943070, along with a commit log saying “wrong assignment after I renamed the pa-rameter. Unfortunately there doesn’t seem to be a testcase that
public TimeSeriesDataItem addOrUpdate(RegularTimePeriod period, double value)677 return this.addOrUpdate(period, new Double(value)); return addOrUpdate(period, new Double(value));678 } }public TimeSeries createCopy(RegularTimePeriod start, RegularTimePeriod end)944 if (endIndex < 0) { if ((endIndex < 0) || (endIndex < startIndex)) {945 emptyRange = true; emptyRange = true;946 } }public boolean equals(Object object)973 if (!ObjectUtilities.equal( if (!ObjectUtilities.equal(getDomainDescription(),974 getDomainDescription(), s.getDomainDescription() s.getDomainDescription())) {
)){
975 return false; return false;976 } }978 if (!ObjectUtilities.equal( if (!ObjectUtilities.equal(getRangeDescription(),979 getRangeDescription(), s.getRangeDescription() s.getRangeDescription())) {
)){
980 return false; return false;981 } }
Figure 5. JFreeChart revision 1083.
One of the two change-slices from Xerces revision 730796try { try {
//Write start tag //Write start tagwriter.write(“<”); writer.write(‘<‘);
One of the ten change-slices from Xerces revision 819653public Object[] getEntries() { public Object[] getEntries() {
Object[] entries = new String[fNum << 1]; Object[] entries = new Object[fNum << 1];for (int i=0, j=0; i<fTableSize && j<fNum << 1; i++) { for (int i=0, j=0; i<fTableSize && j<fNum << 1; i++) {
One of the two change-slices from Xerces revision 944964public int available() throws IOException { public int available() throws IOException {
int bytesLeft = fLength - fOffset; final int bytesLeft = fLength - fOffset;if (bytesLeft == 0) { if (bytesLeft == 0) {
Figure 6. Examples of inconspicuous and minor changes.
One of the three change-slices from JFreeChart revision 1366this.strokeList = new StrokeList(); this.strokeList = new StrokeList();this.baseStroke = DEFAULT_STROKE; this.baseStroke = DEFAULT_STROKE;this.autoPopulateSeriesStroke = false; this.autoPopulateSeriesStroke = true;
One of the two change-slices from JFreeChart revision 1801if (tick.getValue() != 0.0 if ((tick.getValue() != 0.0
|| !isRangeZeroBaselineVisible() && (paintLine)) { || !isRangeZeroBaselineVisible()) && paintLine) {getRenderer().drawRangeLine(g2, this, getRangeAxis(), getRenderer().drawRangeLine(g2, this, getRangeAxis(),
area, tick.getValue(), gridPaint, gridStroke); area, tick.getValue(), gridPaint, gridStroke);} }
Figure 7. Examples of inconspicuous yet important changes.
public void setForce(boolean force) { public void setForce(boolean forceOverwrite) {
this.forceOverwrite = force; this.forceOverwrite = force; (*)
} }
public void setOverwrite(boolean forceOverwrite) {
setForce(forceOverwrite);
}
Figure 8. The third change-slice from Commons Math revision 943068 reveals questionable change.
JFree revision 1083
Tao et al, “"How Do Software Engineers Understand Code Changes?” FSE 2012
hunk 1
hunk 2
hunk 3
hunk 4
Defect Prediction 2.0
Finer Granularity
New Customers
Noise Handling
Warning Developers
“Safe” Files(Predicted as not buggy)
“Risky” Files(Predicted as buggy)
Change Classification
...
...
...
Rev 4
...
...
...
Rev 1
change.........
Rev 2
...
...
...
Rev 3
change change
Kim et al., "Classifying Software Changes: Clean or Buggy?" TSE 2009
Change Classification
...
...
...
Rev 4
...
...
...
Rev 1
change.........
Rev 2
...
...
...
Rev 3
change change
...
...
...
Rev 4
...
...
...
Rev 1
change.........
Rev 2
...
...
...
Rev 3
change change
“Safe” Files
“Risky” Files
Change Classification
...
...
...
Rev 4
...
...
...
Rev 1
change.........
Rev 2
...
...
...
Rev 3
change change
...
...
...
Rev 4
...
...
...
Rev 1
change.........
Rev 2
...
...
...
Rev 3
change change
“Safe” Files
“Risky” Files
Defect prediction based Change Classification
Debug UI
JDT
JEdit
PDE
POI
Team UI
0 0.20 0.40 0.60 0.80F-measure
Proj
ects
CC Cached CC
Warning Developers
“Safe” Location(Predicted as not buggy)
“Risky” Location(Predicted as buggy)
Test-case Selection
Test-case Selection
Executing test cases
Test-case Selection
Runeson and Ljung, “Improving Regression Testing Transparency and Efficiency with History-Based Prioritization,” ICST 2011
0
0.25
0.50
0.75
1.00
R1.0 R1.1 R1.2 R1.3 R1.4 R1.5
APFD
Releases
BaselineHistory1History2
Warning Prioritization
Warning Prioritization
Warning Prioritization
0"
2"
4"
6"
8"
10"
12"
14"
16"
18"
0" 20" 40" 60" 80" 100"
Pre
cision
)(%))
Warning)Instances)by)Priority)
History"
Tool"
Kim and Ernst, “Which Warnings Should I Fix First?” FSE 2007
Other Topics
• Explanation- Why it has been predicted as defect-prone?
• Cross-project prediction
• Cost effectiveness measures
• Active Learning/Refinement
New metrics
Algorithms
Coarse granularity 1.0
Defect Prediction 2.0
New metrics
Algorithms
Coarse granularity 1.0 2.0New customers
Noise Handling
Finer granularity
Defect Prediction 2.0
New metrics
Algorithms
Corse granularity 1.0 2.0New customers
Noise Handling
Finer granularity
Defect Prediction 2.0
2013
MSR$2013:$Back$to$roots$
Massimiliano$Di$Penta$and$Sung$Kim$Program'co)chairs'
Tom$Zimmermann$General'chair'
Alberto$Bacchelli$$Mining'Challenge'Chair'
MSR$2013:$Back$to$roots$
Massimiliano$Di$Penta$and$Sung$Kim$Program'co)chairs'
Tom$Zimmermann$General'chair'
Alberto$Bacchelli$$Mining'Challenge'Chair'
15February
Some slides/data are borrowed with thanks from
• Tom Zimmermann, Chris Bird
• Andreas Zeller
• Ahmed Hassan
• David Lo
• Jaechang Nam, Yida Tao
• Tien Neguan
• Steve Counsell, David Bowes, Tracy Hall and David Gray
• Wen Zhang