35
Evaluating the presence and impact of bias in bug-fix datasets Israel Herraiz, UPM http://mat.caminos.upm.es/~iht Talk at University of California, Davis April 11 2012 This presentation is available at http://www.slideshare.net/herraiz/evaluating-the-presence-and-impact-of-bias-in-bugfix-datasets

Evaluating the presence and impact of bias in bug-fix datasets

Embed Size (px)

DESCRIPTION

Empirical Software Engineering relies on reusable datasets to make it easier to replicate empirical studies and therefore build theories on top of those empirical results. An area where these reusable datasets are particularly useful is defect predictions. In this area, the goal is to predict which entities will be more error prone, so managers can take preventive actions to improve the quality of the delivered system. These reusable datasets contain information about source code files and their history, bug reports, and bugs fixed in each one of the files. However, some of the most used datasets in the Empirical Software Engineering community have been shown to be biased: many links between files and fixed bugs are missing. Research work has already shown that this bias may affect the performance of defect prediction models. In this talk we will show how to use statistical techniques to evaluate the bias in datasets, and to estimate their impact on defect prediction

Citation preview

Page 1: Evaluating the presence and impact of bias in bug-fix datasets

Evaluating the

presence and impact

of bias in bug-fix

datasets Israel Herraiz, UPM

http://mat.caminos.upm.es/~iht

Talk at University of California,

Davis

April 11 2012

This presentation is available at http://www.slideshare.net/herraiz/evaluating-the-presence-and-impact-of-bias-in-bugfix-datasets

Page 2: Evaluating the presence and impact of bias in bug-fix datasets

1 / 34 http://mat.caminos.upm.es/~iht

Outline

1. Who am I and what do I do

2. The problem

3. Preliminary results

4. The road ahead

5. Take away and discussion

Page 3: Evaluating the presence and impact of bias in bug-fix datasets

2 / 34 http://mat.caminos.upm.es/~iht

1. Who am I and what do I do

Page 4: Evaluating the presence and impact of bias in bug-fix datasets

3 / 34 http://mat.caminos.upm.es/~iht

About me

• PhD on Computer Science from Universidad

Rey Juan Carlos (Madrid) • “A statistical examination of the evolution and properties

of libre software”

• http://herraiz.org/phd.html

• Assistant Professor at the Technical University

of Madrid • http://mat.caminos.upm.es/~iht

• Visiting UC Davis from April to July hosted by

Prof. Devanbu • Kindly funded by a MECD “José Castillejo” grant

(JC2011-0093)

Page 5: Evaluating the presence and impact of bias in bug-fix datasets

4 / 34 http://mat.caminos.upm.es/~iht

What do I do?

Page 6: Evaluating the presence and impact of bias in bug-fix datasets

5 / 34 http://mat.caminos.upm.es/~iht

2. The problem

Page 7: Evaluating the presence and impact of bias in bug-fix datasets

6 / 34 http://mat.caminos.upm.es/~iht

Replication in Empirical Software Engineering

Empirical Software Engineering studies

are hard to replicate.

Verification and replication are crucial

features of an empirical research

discipline.

Reusable datasets lower the barrier for

replication.

Page 8: Evaluating the presence and impact of bias in bug-fix datasets

7 / 34 http://mat.caminos.upm.es/~iht

Reusable datasets

FLOSSMole

Page 9: Evaluating the presence and impact of bias in bug-fix datasets

8 / 34 http://mat.caminos.upm.es/~iht

The case of the Eclipse dataset

http://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/

Defects data for all packages in the releases

2.0, 2.1 and 3.0

Size and complexity metrics for all the files

Page 10: Evaluating the presence and impact of bias in bug-fix datasets

9 / 34 http://mat.caminos.upm.es/~iht

Bug-fix datasets

• The Eclipse data is a bug-fix dataset

• To cross correlate bugs with files, classes or

packages, the data is extracted from

• Bug tracking systems (fixed bug reports)

• Version control system (commits)

• Heuristics to detect relationships between bug-

fix reports and commits

Page 11: Evaluating the presence and impact of bias in bug-fix datasets

10 / 34 http://mat.caminos.upm.es/~iht

A study using the Eclipse dataset

Page 12: Evaluating the presence and impact of bias in bug-fix datasets

11 / 34 http://mat.caminos.upm.es/~iht

The distribution of software faults

• The distribution of software faults (over

packages) is a Weibull distribution

• This study can be easily replicated thanks to the

Eclipse reusable bug-fix dataset

• If the same data is obtained for other case

studies, it can also be easily verified and

extended

Page 13: Evaluating the presence and impact of bias in bug-fix datasets

12 / 34 http://mat.caminos.upm.es/~iht

But…

Page 14: Evaluating the presence and impact of bias in bug-fix datasets

13 / 34 http://mat.caminos.upm.es/~iht

What’s the difference between the two conflicting

studies?

• According to the authors there are

methodological differences

• Zhang uses Alberg diagrams

• Concas et al. use CCDF plots to fit different

distributions, and reason about the generative

process as a model for software maintenance

• What I suspect is a crucial difference

• Zhang reused the Eclipse bug-fix dataset

• Concas et al. gathered the data by themselves

• So the bias in both datasets will be different

Page 15: Evaluating the presence and impact of bias in bug-fix datasets

14 / 34 http://mat.caminos.upm.es/~iht

What’s wrong with the Eclipse bug-fix dataset?

Page 16: Evaluating the presence and impact of bias in bug-fix datasets

15 / 34 http://mat.caminos.upm.es/~iht

Bug feature bias

There are other kind of bias (commit features), but in the case of the two

Eclipse papers, the distribution is about packages features, not bugs

neither commits features.

RQ1: Will this kind of bias hold for packages / classes / files

features?

RQ2: What’s the impact on defect prediction?

Page 17: Evaluating the presence and impact of bias in bug-fix datasets

16 / 34 http://mat.caminos.upm.es/~iht

Impact on prediction

Page 18: Evaluating the presence and impact of bias in bug-fix datasets

17 / 34 http://mat.caminos.upm.es/~iht

Impact on prediction

J48 tree to classify files as defective or not

Page 19: Evaluating the presence and impact of bias in bug-fix datasets

18 / 34 http://mat.caminos.upm.es/~iht

Conclusions so far

• Developers only mark a subset of the bug-fix pairs,

and so heuristics-based recovery methods only find

a subset of the overall bug-fix pairs

• The bias appears as a difference in the distribution

of bugs and commits features

• The conflict between the two studies about the

distribution of bugs in Eclipse is likely to be due to

differences in the distributions caused by bias

• The bias has a great impact on the accuracy of

predictor models

Page 20: Evaluating the presence and impact of bias in bug-fix datasets

19 / 34 http://mat.caminos.upm.es/~iht

3. Preliminary results

Page 21: Evaluating the presence and impact of bias in bug-fix datasets

20 / 34 http://mat.caminos.upm.es/~iht

The distribution of bugs over files

• Number of bugs per file for the case of Zxing

Page 22: Evaluating the presence and impact of bias in bug-fix datasets

21 / 34 http://mat.caminos.upm.es/~iht

The distribution of bugs over files

• Number of bugs per file for the case of Eclipse

Page 23: Evaluating the presence and impact of bias in bug-fix datasets

22 / 34 http://mat.caminos.upm.es/~iht

The distribution of bugs over files

• Comparison between the ReLink and the biased

bug-fix sets (results of the χ2 test, p-values)

Page 24: Evaluating the presence and impact of bias in bug-fix datasets

23 / 34 http://mat.caminos.upm.es/~iht

The distribution of bugs over files

• Comparison between the ReLink and the biased

bug-fix sets (results of the χ2 test, p-values)

RQ1: Will this kind of bias hold for packages /

classes / files features?

Not supported by these examples

Page 25: Evaluating the presence and impact of bias in bug-fix datasets

24 / 34 http://mat.caminos.upm.es/~iht

Time over!

• So there is no difference between the biased

and non-biased datasets?

• And how come the ReLink paper (and others)

report improved accuracies when using the non-

biased datasets?

• What could explain these differences?

Page 26: Evaluating the presence and impact of bias in bug-fix datasets

25 / 34 http://mat.caminos.upm.es/~iht

Impact on prediction accuracy

• What is the prediction accuracy using different

(biased and non-biased) datasets?

• Three datasets

• Biased datasets recovered using heuristics

• “Golden” dataset manually recovered

• By Sung Kim et al., not me!

• Non-biased dataset obtained using the ReLink

tool

• J48 tree classifier, 10 folds cross validation

• Test datasets always extracted from the golden

dataset

Page 27: Evaluating the presence and impact of bias in bug-fix datasets

26 / 34 http://mat.caminos.upm.es/~iht

F-measure values

• Procedure

• Extract 100 subsamples of the same size for

both datasets

• Calculate F-measure using a 10 folds cross

validation

• The test set is always extracted from the “golden”

set

• Repeat for several subsample sizes

• Only results for the case of OpenIntents so far

Page 28: Evaluating the presence and impact of bias in bug-fix datasets

27 / 34 http://mat.caminos.upm.es/~iht

Page 29: Evaluating the presence and impact of bias in bug-fix datasets

28 / 34 http://mat.caminos.upm.es/~iht

RQ2: Impact on prediction

Not clear whether there is any impact

Page 30: Evaluating the presence and impact of bias in bug-fix datasets

29 / 34 http://mat.caminos.upm.es/~iht

RQ2: Impact on prediction

Not clear whether there is any impact

Little warning!

The size is not exactly the same for

the three cases in each boxplot.

The biased is always the smallest

of the three.

I have to repeat this using exactly

the same size for the three

datasets.

Page 31: Evaluating the presence and impact of bias in bug-fix datasets

30 / 34 http://mat.caminos.upm.es/~iht

Preliminary conclusions

• The biased dataset does not provide the worst

accuracy when predicting fault proneness for a

set of (supposedly) unbiased bug fixes and files

• Contrarily to what is reported in previous work

• What is the cause of the reported differences in

accuracy?

• By definition, the size of the so-called biased

dataset will be always smaller

• Dataset size does have an impact on the F-

measure

Page 32: Evaluating the presence and impact of bias in bug-fix datasets

31 / 34 http://mat.caminos.upm.es/~iht

4. The road ahead

Page 33: Evaluating the presence and impact of bias in bug-fix datasets

32 / 34 http://mat.caminos.upm.es/~iht

My workplan at UC Davis

• Discuss the ideas shown here

• Is bias really a problem for defect prediction?

• Extend the study to more cases

• Do you have a dataset of files, bugs, commits,

metrics? Please let me know!

• Improve the study

• What happens if we break down the data in more

coherent subgroups

• Do the results change at different levels of

granularity?

Page 34: Evaluating the presence and impact of bias in bug-fix datasets

33 / 34 http://mat.caminos.upm.es/~iht

5. Take away and conclusions

Page 35: Evaluating the presence and impact of bias in bug-fix datasets

34 / 34 http://mat.caminos.upm.es/~iht

Systematic difference in bug-fixes collected

by heuristics

No observable difference in the

statistical properties of the so-called biased

dataset

Impact on prediction accuracy not clear

Ecological inference

What happens at other scales?

With other subgroups?