127
Data Mining for Software Engineering Tao Xie North Carolina State University www.csc.ncsu.edu/faculty/xie [email protected] Jian Pei Simon Fraser University www.cs.sfu.ca/~jpei [email protected] An up-to-date version of this tutorial is available at http://ase.csc.ncsu.edu/dmse/dmse.pdf Attendants of the tutorial are kindly suggested to download the latest version 2-3 days before the tutorial

Datamingse

Embed Size (px)

DESCRIPTION

Tao Xie, Jian Pei Data Mining for Software Engineering

Citation preview

Page 1: Datamingse

Data Mining for Software Engineering

Tao XieNorth Carolina State Universitywww.csc.ncsu.edu/faculty/xie

[email protected]

Jian PeiSimon Fraser University

www.cs.sfu.ca/[email protected]

An up-to-date version of this tutorial is available at http://ase.csc.ncsu.edu/dmse/dmse.pdf

Attendants of the tutorial are kindly suggested to download the latest version 2-3 days before the tutorial

Page 2: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 2

Outline

• Introduction• What software engineering tasks can be

helped by data mining?• What kinds of software engineering data can

be mined?• How are data mining techniques used in

software engineering?• Case studies• Conclusions

Page 3: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 3

Introduction

• A large amount of data is produced in software development– Data from software repositories– Data from program executions

• Data mining techniques can be used to analyze software engineering data– Understand software artifacts or processes– Assist software engineering tasks

Page 4: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 4

Examples

• Data in software development– Programming: versions of programs– Testing: execution traces– Deployment: error/bug reports– Reuse: open source packages

• Software development needs data analysis– How should I use this class?– Where are the bugs?– How to implement a typical functionality?

Page 5: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 5

Overview

code bases

changehistory

programstates

structuralentities

software engineering data

bugreports

programming defect detection testing debugging maintenance

software engineering tasks helped by data mining

classification association/patterns clustering …

data mining techniques

Page 6: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 6

Software Engineering Tasks

• Programming• Static defect detection• Testing• Debugging• Maintenance

Page 7: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 7

Software Categorization – Why?

• SourceForge hosts 70,000+ software systems– How can one find the software needed?– How can developers collaborate effectively?

• Why software categorization?– SourceForge categorizes software according to their

primary function (editors, databases, etc.)• Software foundries – related software

– Keep developers informed about related software• Learn the “best practice”• Promote software reuse

[Kawaguchi et al. 04]

Page 8: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 8

Software Categorization – What?

• Organize software systems into categories– Software systems in each category share a

somehow same theme– A software system may belong to one or

multiple categories• What are the categories?

– Defined by domain experts manually– Discovered automatically

• Example system: MUDABlue [Kawaguchi et al. 04]

Page 9: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 9

Version Comparison and Search

• What does the current code segment look like in previous versions?– How have they been changed over versions?

• Using standard search tools, e.g., grep?– Source code may not be well documented– The code may be changed

• Can we have some source code friendly search engines?– E.g., www.koders.com, corp.krugle.com,

demo.spars.info

Page 10: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 10

Software Library Reuse

• Issues in reusing software libraries– Which components should I use?– What is the right way to use?– Multiple components may often be used in

combinations, e.g., Smalltalk’s Model/View/Controller• Frequent patterns help

– Specifically, inheritance information is important– Example: most application classes inheriting from library

class Widget tend to override its member function paint(); most application classes instantiating library class Painter and calling its member function begin() also call end()

[Michail 99/00]

Page 11: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 11

API Usage• How should an API be used correctly?

– An API may serve multiple functionalities– Different styles of API usage

• “I know what type of object I need, but I don’t know how to write the code to get the object” [Mandelinet al. 05]– Can we synthesize jungloid code fragments

automatically?– Given a simple query describing the desired code in

terms of input and output types, return a code segment• “I know what method call I need, but I don’t know

how to write code before and after this method call” [Xie & Pei 06]

Page 12: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 12

How Can Data Mining Help?

• Identify characteristic usage of the library automatically

• Understand the reuse of library classes from real-life applications instead of toy programs

• Keep reuse patterns up to date w.r.t. the most recent version of the library and applications

• General patterns may cover inheritance cases

Page 13: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 13

Software Engineering Tasks

• Programming• Static defect detection• Testing• Debugging• Maintenance

Page 14: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 14

Locating Matching Method Calls

• Many bugs due to unmatched method calls– E.g., fail to call free() to deallocate a data

structure– One-line-code-changes: many bugs can be

fixed by changing only one line in source code• Problem: how to find highly correlated pairs

of method calls– E.g., <fopen, fclose>, <malloc, free>

[Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]

Page 15: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 15

Inferring Errors from Source Code

• A system must follow some correctness rules– Unfortunately, the rules are documented or

specified in an ad hoc manner• Deriving the rules requires a lot of a priori

knowledge• Can we detect some errors without knowing

the rules by data mining? [Engler et al. 01]

Page 16: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 16

Inference in Large Systems

• Execution traces inferred properties static checker

• Inference algorithms need to be scalable with the size of the programs and the input traces

• Due to only imperfect traces available in industrial environments, how to use those imperfect traces

• Many inferred properties may be uninteresting; it is hard for a developer to review those properties thoroughly for large programs

[Yang et al. 06]

Page 17: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 17

Detecting Copy-Paste and Bugs

• Copy-pasted code is common in large systems– Code reuse

• Prone to bugs– E.g., identifiers are not changed consistently

• How to detect copy-paste code? – How to scale up to large software?– How to handle minor modifications?

[Li et al. 04]

Page 18: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 18

Software Engineering Tasks

• Programming• Static defect detection• Testing• Debugging• Maintenance

Page 19: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 19

Inspecting Test Behavior

• Automatically generated tests or field executions lack test oracles– Sample/summarize behavior for inspection

• Examples:– Select tests (executions/outputs) for inspection

• E.g., clustering path/branch profiles [Podgurski et al. 01, Bowring et al. 04]

– Summarize object behavior [Xie&Notkin 04, Dallmeier et al. 06]

Page 20: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 20

Mining Object Behavior

• Can we find the undocumented behavior of classes? It may not be observed from program source code directly

Behavior model for JAVA Vector class. Picture from “Mining object behavior with ADABU”[Dallmeier et al. WODA 06]

Page 21: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 21

Mining Specifications• Specifications are very useful for testing

– test generation + test oracle• Major obstacle: protocol specifications are

often unavailable– Example: what is the right way to use the socket

API?• How can data mining help?

– If a protocol is held in well tested programs (i.e., their executions), the protocol is likely valid

[Ammons et al. 02]

Page 22: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 22

SpecificationSpecification Helps

Does the above code follow the correct socket API protocol?[Ammons et al. 02]

Page 23: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 23

Software Engineering Tasks

• Programming• Static defect detection• Testing• Debugging• Maintenance

Page 24: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 24

Fault Localization

• Running tests produces execution traces– Some tests fail and the other tests pass

• Given many execution traces generated by tests, can we suggest likely faulty statements? [Liblit et al. 03/05, Liu et al. 05]– Some traces may lead to program failures– It would be better if we can even suggest the likeliness

of a statement being faulty• For large programs, how can we collect traces

effectively?• What if there are multiple faults?

Page 25: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 25

Analyzing Bug Repositories

• Most open source software development projects have bug repositories– Report and track problems and potential enhancements– Valuable information for both developers and users

• Bug repositories are often messy– Duplicate error reports; Related errors

• Challenge: how to analyze effectively?– Who are reporting and at what rate?– How are reports resolved and by whom?

• Automatic bug report assignment & duplicate detection [Anvik et al. 06]

Page 26: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 26

Stabilizing Buggy Applications

• Users may report bugs in a program, can those bug reports be used to prevent the program from crashing? – When a user attempts an action that led to

some errors before, a warning should be issued• Given a program state S and an event e,

predict whether e likely results in a bug– Positive samples: past bugs– Negative samples: “not bug” reports

[Michail&Xie 05]

Page 27: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 27

Software Engineering Tasks

• Programming• Static defect detection• Testing• Debugging• Maintenance

Page 28: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 28

Guiding Software Changes

• Programmers start changing some locations– Suggest locations that other programmers have

changed together with this locationE.g., “Programmers who changed this function also changed …”

• Mine association rules from change histories– coarse-granular entities: directories, modules,

files– fine-granular entities: methods, variables,

sections[Zimmermann et al. 04, Ying et al. 04]

Page 29: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 29

Aspect Mining

• Discover crosscutting concerns that can be potentially turned into one place (an aspect in aspect-oriented programs)– E.g., logging, timing, communication

• Mine recurring execution patterns– Event traces [Breu&Krinke 04, Tonella&Ceccato

04]– Source code [Shepherd et al. 05]

Page 30: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 30

Software Engineering Data

• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports

Page 31: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 31

Code Entities

• Identifiers within a system [Kawaguchi et al. 04]– E.g., variable names, function names

• Statement sequence within a basic block [Li et al. 04]– E.g., variables, operators, constants, functions,

keywords• Element set within a function [Li&Zhou 05]

– E.g., functions, variables, data types• Call sites within a function [Xie&Pei 05]• API signatures [Mandelin et al. 05]

[Mandelin et al. 05] http://snobol.cs.berkeley.edu/prospector/index.jsp

Page 32: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 32

Relationships btw Code Entities

• Membership relationships– A class contains membership functions

• Reuse relationships– Class inheritance– Class instantiation– Function invocations– Function overriding

[Michail 99/00] http://codeweb.sourceforge.net/ for C++

Page 33: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 33

Software Engineering Data

• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports

Page 34: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 34

Concurrent Versions System (CVS) Comments

[Chen et al. 01] http://cvssearch.sourceforge.net/

Page 35: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 35

CVS CommentsRCS files:/repository/file.h,vWorking file: file.hhead: 1.5...description:----------------------------Revision 1.5Date: ...cvs comment ...----------------------------...

• cvs log – displays for all revisions and its comments for each file

• cvs diff – shows differences between different versions of a file

…RCS file: /repository/file.h,v…9c9,10< old line---> new line> another new line

[Chen et al. 01] http://cvssearch.sourceforge.net/

Page 36: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 36

Code Version Histories• CVS provides file versioning

– Group individual per-file changes into individual transactions (atomic change sets): checked in by the same author with the same check-in comment close in time

• CVS manages only files and line numbers– Associate syntactic entities with line ranges

• Filter out long transactions not corresponding to meaningful atomic changes– E.g., feature requests, bug fixes, branch merging

[Ying et al. 04][Zimmermann et al. 04] http://www.st.cs.uni-sb.de/softevo/erose/

Page 37: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 37

Software Engineering Data

• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports

Page 38: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 38

Method-Entry/Exit States

• State of an object– Values of transitively reachable fields

• Method-entry state– Receiver-object state, method argument values

• Method-exit state– Receiver-object state, updated method

argument values, method return value[Ernst et al. 02] http://pag.csail.mit.edu/daikon/

[Dallmeier et al. 06] http://www.st.cs.uni-sb.de/models/[Henkel&Diwan 03][Xie&Notkin 04/05]

Page 39: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 39

Other Profiled Program States

• Values of variables at certain code locations [Hangal&Lam 02]– Object/static field read/write– Method-call arguments– Method returns

• Sampled predicates on values of variables[Liblit et al. 03/05]

[Hangal&Lam 02] http://diduce.sourceforge.net/[Liblit et al. 03/05] http://www.cs.wisc.edu/cbi/

Page 40: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 40

Software Engineering Data

• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports

Page 41: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 41

Executed Structural Entities

• Executed branches/paths, def-use pairs• Executed function/method calls

– Group methods invoked on the same object• Profiling options

– Execution hit vs. count– Execution order (sequences)

[Dallmeier et al. 05] http://www.st.cs.uni-sb.de/ample/More related tools: http://www.csc.ncsu.edu/faculty/xie/research.htm#related

Page 42: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 42

Software Engineering Data

• Static code bases• Software change history• Profiled program states• Profiled structural entities• Bug reports

Page 43: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 43

Processing Bug Reports

DeveloperTriager

BugRepository

Duplicate Works For Me

Invalid Won’tFix

BugReport

User

Adapted from Anvik et al.’s slides

Page 44: Datamingse

Sample Bugzilla Bug Report

T. Xie and J. Pei: Data Mining for Software Engineering 44

• Bug report image• Overlay the triage questions

Duplicate?

Reproducible?

Assignment?Assigned To: ?

Bugzilla: open source bug tracking toolhttp://www.bugzilla.org/

[Anvik et al. 06] http://www.cs.ubc.ca/labs/spl/projects/bugTriage.html

Adapted from Anvik et al.’s slides

Page 45: Datamingse

Eclipse Bug Data

T. Xie and J. Pei: Data Mining for Software Engineering 45

[Schröter et al. 06] http://www.st.cs.uni-sb.de/softevo/bug-data/eclipse/

• Defect counts are listed as count at the plug-in, package and compilationunit levels.

• The value field contains the actual number of pre- ("pre") and post-release defects ("post"). • The average ("avg") and maximum ("max") values refer to the defects found in the compilation units ("compilationunits").

Page 46: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 46

Data Mining Techniques in SE

• Association rules and frequent patterns• Classification• Clustering• Misc.

Page 47: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 47

Frequent Itemsets

• Itemset: a set of items– E.g., acm={a, c, m}

• Support of itemsets– Sup(acm)=3

• Given min_sup = 3, acmis a frequent pattern

• Frequent pattern mining: find all frequent patterns in a database

Transaction database TDB

TID Items bought100 f, a, c, d, g, I, m, p200 a, b, c, f, l, m, o300 b, f, h, j, o400 b, c, k, s, p500 a, f, c, e, l, p, m, n

Page 48: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 48

Association Rules

• (Time∈{Fri, Sat}) ∧ buy(X, diaper) buy(X, beer)– Dads taking care of babies in weekends drink

beers• Itemsets should be frequent

– It can be applied extensively• Rules should be confident

– With strong prediction capability

Page 49: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 49

A Road Map

• Boolean vs. quantitative associations – buys(x, “SQLServer”) ^ buys(x, “DMBook”)

buys(x, “DM Software”) [0.2%, 60%]– age(x, “30..39”) ^ income(x, “42..48K”)

buys(x, “PC”) [1%, 75%]• Single dimension vs. multiple dimensional

associations • Single level vs. multiple-level analysis

– What brands of beers are associated with what brands of diapers?

Page 50: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 50

Frequent Pattern Mining Methods

• Apriori and its variations/improvements • Mining frequent-patterns without candidate

generation• Mining max-patterns and closed itemsets• Mining multi-dimensional, multi-level

frequent patterns with flexible support constraints

• Interestingness: correlation and causality

Page 51: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 51

A Simple Case

• Finding highly correlated method call pairs• Confidence of pairs help

– Conf(<a,b>)=support(<a,b>)/support(<a,a>)• Check the revisions (fixes to bugs), find the

pairs of method calls whose confidences are improved dramatically by frequent added fixes– Those are the matching method call pairs that

may often be violated by programmers[Livshits&Zimmermann 05]

Page 52: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 52

Conflicting Patterns

• 999 out of 1000 times spin_unlockfollows spin_lock– The single time that spin_unlock does not

follow may likely be an error• We can detect an error without knowing the

correctness rule

[Li&Zhou 05, Livshits&Zimmermann 05, Yang et al. 06]

Page 53: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 53

Frequent Library Reuse Patterns• Items: classes, member functions, reuse relationships

(e.g., inheritance, overriding, instantiation)• Transactions: for every application class A, the set of all

items that are involved in a reuse relationship with A• Pruning

– Uninteresting rules, e.g., a rule holds for every class– Misleading rules, e.g., xy z (conf: 60%) is pruned if y z (conf:

80%)– Statistically insignificant rules, prune rules of a high p-value

• Constrained rules– Rules involving a particular class– Rules that are violated in a particular application [Michail 99/00]

Page 54: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 54

MAPO: Mining Frequent API Patterns

[Xie&Pei 06]

Page 55: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 55

Sequential Pattern Mining in MAPO

• Use BIDE [Wang&Han 04] to mine closed sequential patterns from the preprocessed method-call sequences

• Postprocessing in MAPO– Remove frequent sequences that do not contain the

entities interesting to the user– Compress consecutive calls of the same method into

one– Remove duplicate frequent sequences after the

compression– Remove frequent sequences that are subsequences of

some other frequent sequences[Xie&Pei 06]

Page 56: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 56

Detecting Copy-Paste Code

• Apply closed sequential pattern mining techniques • Customizing the techniques

– A copy-paste segment typically do not have big gaps –use a maximum gap threshold to control

– Output the instances of patterns (i.e., the copy-pasted code segments) instead of the patterns

– Use small copy-pasted segments to form larger ones– Prune false positives: tiny segments, unmappable

segments, overlapping segments, and segments with large gaps

[Li et al. 04]

Page 57: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 57

Find Bugs in Copy-Pasted Segments

• For two copy-pasted segments, are the modifications consistent?– Identifier a in segment S1 is changed to b in

segment S2 3 times, but remains unchanged once – likely a bug

– The heuristic may not be right all the time• The lower the unchanged rate of an

identifier, the more likely there is a bug

[Li et al. 04]

Page 58: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 58

Approximate Patterns for Inferences

• Use an alternating template to find interesting properties– Example: template – (PS)*, an instance: loc.acq loc.rel

• Handling imperfect traces– Instead of requiring perfect matches, check the

ratio of matching– Explore contexts of matching

[Yang et al. 06]

Page 59: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 59

Context Handling

Figure from “Perracotta: mining teporal API rules from imperfect traces”, in [Yang et al. ICSE’06]

Page 60: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 60

Cross-Checking of Execution Traces

• Mine association rules or sequential patterns S F, where S is a statement and F is the status of program failure

• The higher the confidence, the more likely S is faulty or related to a fault

• Using only one statement at the left side of the rule can be misleading, since a fault may be led by a combination of statements– Frequent patterns can be used to improve

[Denmat et al. 05]

Page 61: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 61

Emerging Patterns of Traces

• A method executed only in failing runs is likely to point to the defect– Comparing the coverage of passing and failing

program runs help• Mining patterns frequent in failing program

runs but infrequent in passing program runs– Sequential patterns may be used

[Dallmeier et al. 05, Denmat et al. 05, Yang et al. 06]

Page 62: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 62

Learning Object Behavior

• Extracting models– A static analysis identifies all side-effect-free

methods in the program– Some side-effect-free methods are selected as

inspectors– The program is executed and inspectors are

called to extract information about an object’s state – a vector of inspector values

• Merge models of all objects in a program[Dallmeier et al. 06]

Page 63: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 63

Data Mining Techniques in SE

• Association rules and frequent patterns• Classification• Clustering• Misc.

Page 64: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 64

Classification: A 2-step Process

• Model construction: describe a set of predetermined classes– Training dataset: tuples for model construction

• Each tuple/sample belongs to a predefined class

– Classification rules, decision trees, or math formulae• Model application: classify unseen objects

– Estimate accuracy of the model using an independent test set

– Acceptable accuracy apply the model to classify tuples with unknown class labels

Page 65: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 65

Model Construction

TrainingData

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Name Rank Years TenuredMike Ass. Prof 3 NoMary Ass. Prof 7 YesBill Prof 2 YesJim Asso. Prof 7 Yes

Dave Ass. Prof 6 NoAnne Asso. Prof 3 No

Page 66: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 66

Model Application

Classifier

TestingData Unseen Data

(Jeff, Professor, 4)Name Rank Years TenuredTom Ass. Prof 2 No

Merlisa Asso. Prof 7 NoGeorge Prof 5 YesJoseph Ass. Prof 7 Yes

Tenured?

Page 67: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 67

Supervised vs. Unsupervised Learning• Supervised learning (classification)

– Supervision: objects in the training data set have labels

– New data is classified based on the training set• Unsupervised learning (clustering)

– The class labels of training data are unknown– Given a set of measurements, observations,

etc. with the aim of establishing the existence of classes or clusters in the data

Page 68: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 68

GUI-Application Stabilizer

• Given a program state S and an event e, predict whether e likely results in a bug– Positive samples: past bugs– Negative samples: “not bug” reports

• A k-NN based approach– Consider the k closest cases reported before– Compare Σ 1/d for bug cases and not-bug cases, where

d is the similarity between the current state and the reported states

– If the current state is more similar to bugs, predict a bug[Michail&Xie 05]

Page 69: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 69

Data Mining Techniques in SE

• Association rules and frequent patterns• Classification• Clustering• Misc.

Page 70: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 70

What Is Clustering?

• Group data into clusters– Similar to one another within the same cluster– Dissimilar to the objects in other clusters– Unsupervised learning: no predefined classes

Cluster 1Cluster 2

Outliers

Page 71: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 71

Categories of Clustering Approaches (1)• Partitioning algorithms

– Partition the objects into k clusters– Iteratively reallocate objects to improve the

clustering• Hierarchy algorithms

– Agglomerative: each object is a cluster, merge clusters to form larger ones

– Divisive: all objects are in a cluster, split it up into smaller clusters

Page 72: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 72

Categories of Clustering Approaches (2)• Density-based methods

– Based on connectivity and density functions– Filter out noise, find clusters of arbitrary shape

• Grid-based methods– Quantize the object space into a grid structure

• Model-based– Use a model to find the best fit of data

Page 73: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 73

K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

Assign each objects to most similar center

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

reassign reassign

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center Update

the cluster means

Page 74: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 74

Clustering and Categorization

• Software categorization– Partitioning software systems into categories

• Categories predefined – a classification problem

• Categories discovered automatically – a clustering problem

Page 75: Datamingse

Software Categorization - MUDABlue

T. Xie and J. Pei: Data Mining for Software Engineering 75

• Understanding source code– Use latent semantic analysis (LSA) to find similarity

between software systems– Use identifiers (e.g., variable names, function names)

as features• “gtk_window” represents some window• The source code near “gtk_window” contains some GUI

operation on the window

• Extracting categories using frequent identifiers– “gtk_window”, “gtk_main”, and “gpointer” GTK

related software system– Use LSA to find relationships between identifiers

[Kawaguchi et al. 04]

Page 76: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 76

Overview of MUDABlue

• Extract identifiers• Create identifier-by-software matrix• Remove useless identifiers• Apply LSA, and retrieve categories• Make software clusters from identifier

clusters• Title software clusters

[Kawaguchi et al. 04]

Page 77: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 77

Data Mining Techniques in SE

• Association rules and frequent patterns• Classification• Clustering• Misc.

Page 78: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 78

Searching Source Code/Comments

• CVSSearch: searching using CVS comments

• Comments are often more stable than code segments– Describe a segment of code– May hold for many future versions

• Compare differences of successive versions– For two versions, associate a comment to the

corresponding changes– Propagate changes over versions [Chen et al. 01]

Page 79: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 79

Jungloid Mining

• Given a query describing the input and output types, synthesize code fragments automatically

• Prospector: using API method signatures and jungloids mined from a corpus of sample client programs

• Elementary jungloids– Field access– Static method or constructor invocation– Instance method invocation– Widening reference conversion– Downcast (narrowing reference conversions)

[Mandelin et al. 05]

Page 80: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 80

Finding JungloidsParsing a JAVA source code file in an IFile object using the Eclipse IDE framework

• Use signatures of elementary jungloids and API’s to form a signature graph

• Represent a solution as a path in the graph matching the constraints

• Rank the paths by their lengths – short paths are preferred• Learn downcast from sample programs [Mandelin et al. 05]

Page 81: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 81

Sampling Programs

• During the execution of a program, each execution of a statement takes a probability to be sampled– Sampling large programs becomes feasible– Many traces can be collected

• Bug isolation by analyzing samples– Correlation between some specific statements

or function calls with program errors/crashes

[Liblit et al. 03/05]

Page 82: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 82

Outline

• Introduction• What software engineering tasks can be

helped by data mining?• What kinds of software engineering data can

be mined?• How are data mining techniques used in

software engineering?• Case studies• Conclusions

Page 83: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 83

Case Studies

• MAPO: mining API usages from open source repositories [Xie&Pei 06]

• Code bases sequence analysis programming• DynaMine: finding common error patterns by mining

software revision histories [Livshits&Zimmermann 05]

• Change history association rules defect detection• BugTriage: Who should fix this bugs? [Anvik et al. 06]

• Bug reports classification debugging

Page 84: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 84

Motivation

• APIs in class libraries or frameworks are popularly reused in software development.

• An example programming task: “instrument the bytecode of a Java class by adding an extra method to the class”– org.apache.bcel.generic.ClassGen

public void addMethod(Method m)

Page 85: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 85

First Try: ClassGen Java API Doc

addMethod

public void addMethod(Method m) Add a method to this class. Parameters:m - method to add

Page 86: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 86

Second Try: Code Search Engine

Page 87: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 87

MAPO Approach

• Analyze code segments returned from code search engines and disclose the inherent usage patterns– Input: an API characterized by a method, class,

or packagecode bases: open source repositories or proprietary source repositories

– Output: a short list of frequent API usage patterns related to the API

Page 88: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 88

Sample Tool OutputInstructionList.<init>()InstructionFactory.createLoad(Type, int)InstructionList.append(Instruction)InstructionFactory.createReturn(Type)InstructionList.append(Instruction)MethodGen.setMaxStack()MethodGen.setMaxLocals()MethodGen.getMethod()ClassGen.addMethod(Method)InstructionList.dispose()

•Mined from 36 Java source files, 1087 method seqs

Page 89: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 89

Tool Architecture

Page 90: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 90

ResultsA tool that integrates various components• Relevant code extractor

– download returns from code search engine (koders.com)• Code analyzer

– implemented a lightweight tool for Java programs• Sequence preprocessor

– employed various heuristics • Frequent sequence miner

– reused BIDE [Wang&Han ICDE 2004]• Frequent sequence postprocessor

– employed various heuristics

Page 91: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 91

Case Studies

• MAPO: mining API usages from open source repositories [Xie&Pei 06]

• Code bases sequence analysis programming• DynaMine: finding common error patterns by mining

software revision histories [Livshits&Zimmermann 05]

• Change history association rules defect detection• BugTriage: Who should fix this bugs? [Anvik et al. 06]

• Bug reports classification debugging

Page 92: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 92

Co-Change Pattern

• Things that are frequently changed together often form a pattern (a.k.a. co-change)

• E.g., co-added method calls

Adapted from Livshits et al.’s slides

public void createPartControl(Composite parent) {...// add listener for editor page activationgetSite().getPage().addPartListener(partListener);

}

public void dispose() {...

getSite().getPage().removePartListener(partListener);}

co-added

Page 93: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 93

DynaMine

report bugs

report patternsreporting

run the application

post-process

usagepatterns

errorpatterns

unlikelypatterns

dynamic analysis

instrument relevantmethod calls

mine CVS histories patterns rank and

filterrevision

history mining

Adapted from Livshits et al.’s slides

Page 94: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 94

Mining Patterns

report bugs

report patternsreporting

run the application

post-process

usagepatterns

errorpatterns

unlikelypatterns

dynamic analysis

instrument relevantmethod calls

mine CVS histories patterns rank and

filterrevision

history mining

Adapted from Livshits et al.’s slides

Page 95: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 95

Mining Method Callso1.addListener()o1.removeListener()

Foo.java1.12

o2.addListener()o2.removeListener()System.out.println()

Bar.java1.47

o3.addListener()o3.removeListener()list.iterator()iter.hasNext()iter.next()

Baz.java1.23

o4.addListener()System.out.println()

Qux.java1.41

o4.removeListener()1.42

Adapted from Livshits et al.’s slides

Page 96: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 96

Finding PairsFoo.java

1.12 1 Pairo1.addListener()o1.removeListener()

Bar.java1.47

o2.addListener()o2.removeListener()System.out.println()

1 PairBaz.java

1.23o3.addListener()o3.removeListener()list.iterator()iter.hasNext()iter.next()

2 Pairs

Qux.java1.41 0 Pairso4.addListener()

System.out.println()

0 Pairs1.42 o4.removeListener()

Adapted from Livshits et al.’s slides

Page 97: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 97

Mining Method CallsFoo.java

1.12o1.addListener()o1.removeListener()

Bar.java1.47 o2.addListener()

o2.removeListener()System.out.println()

Co-added calls often represent a usage pattern

Baz.java1.23

o3.addListener()o3.removeListener()list.iterator()iter.hasNext()iter.next()

Qux.java1.41

o4.addListener()System.out.println()

1.42 o4.removeListener()

Adapted from Livshits et al.’s slides

Page 98: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 98

Finding Patterns

o.enterAlignment()o.exitAlignment()o.redoAlignment()iter.hasNext()iter.next()

o.enterAlignment()o.exitAlignment()o.redoAlignment()iter.hasNext()iter.next()

o.enterAlignment()o.exitAlignment()o.redoAlignment()iter.hasNext()iter.next()

o.enterAlignment()o.exitAlignment()o.redoAlignment()iter.hasNext()

Find “frequent itemsets” (with Apriori)

iter.next()

{enterAlignment(), exitAlignment(), redoAlignment()}

\\

Adapted from Livshits et al.’s slides

Page 99: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 99

Ranking PatternsFoo.java

1.12o1.addListener()o1.removeListener()

Bar.java1.47 o2.addListener()

o2.removeListener()System.out.println()

Support count = #occurrences of a patternConfidence = strength of a pattern, P(A|B)

Baz.java1.23

o3.addListener()o3.removeListener()list.iterator()iter.hasNext()iter.next()

Qux.java1.41

o4.addListener()System.out.println()

1.42 o4.removeListener()

Adapted from Livshits et al.’s slides

Page 100: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 100

Ranking PatternsFoo.java

1.12o1.addListener()o1.removeListener()

Bar.java1.47 o2.addListener()

o2.removeListener()System.out.println()

Baz.java1.23

o3.addListener()o3.removeListener()list.iterator()iter.hasNext()iter.next()

Qux.java1.41

o4.addListener()System.out.println()

o4.removeListener()

This is a fix! Rank removeListener() patterns higher

1.42

Adapted from Livshits et al.’s slides

Page 101: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 101

Dynamic Validation

report bugs

report patternsreporting

run the application

post-process

usagepatterns

errorpatterns

unlikelypatterns

dynamic analysis

instrument relevantmethod calls

mine CVS histories patterns rank and

filterrevision

history mining

Adapted from Livshits et al.’s slides

Page 102: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 102

Matches and Mismatches

Find and count matches and mismatches.

o.register(d)o.deregister(d)o.deregister(d)Static vs dynamic counts.

matches o.register(d)mismatch

Adapted from Livshits et al.’s slides

Page 103: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 103

Pattern classification

post-processv validations, e violations

usagepatterns

e<v/10

errorpatterns

v/10<=e<=2v

unlikelypatterns

otherwiseAdapted from Livshits et al.’s slides

Page 104: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 104

Experiments

total 56 patterns

JEDITJEDIT ECLIPSEECLIPSEsincesince 2000 2001

developersdevelopers 92 112

lines of codelines of code 700,000 2,900,000

revisionsrevisions 40,000 400,000

Adapted from Livshits et al.’s slides

Page 105: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 105

Case Studies

• MAPO: mining API usages from open source repositories [Xie&Pei 06]

• Code bases sequence analysis programming• DynaMine: finding common error patterns by mining

software revision histories [Livshits&Zimmermann 05]

• Change history association rules defect detection• BugTriage: Who should fix this bugs? [Anvik et al. 06]

• Bug reports classification debugging

Page 106: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 106

Assigning a Bug

• Many considerations– who has the expertise?– who is available?– how quickly does this have to be fixed?

• Not always an obvious or correct assignment– multiple developers may be suitable– difficult to know what the bug is about– bug fixes get delayed

• triage and fix rate indicates ‘liveness’ of OSS projects

Adapted from Anvik et al.’s slides

Page 107: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 107

Assigning a Bug Today

[email protected]

Adapted from Anvik et al.’s slides

Page 108: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 108

Recommending assignment

[email protected]@[email protected]

Adapted from Anvik et al.’s slides

Page 109: Datamingse

Overview of approach

T. Xie and J. Pei: Data Mining for Software Engineering 109

Approach tuned using Eclipse and Firefox

[email protected]@[email protected]

Machine LearningAlgorithm

AssignmentRecommenderResolved

Bug Reports

Adapted from Anvik et al.’s slides

Page 110: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 110

Steps to the approach

1. Characterize the reports

2. Label the reports

3. Select the reports

4. Use a machine learning algorithm

Adapted from Anvik et al.’s slides

Page 111: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 111

Step 1: Characterizing a report

• Based on two fields– textual summary– description

• Use text categorization approach– represent with a word vector– remove stop words– intra- and inter-document frequency

Adapted from Anvik et al.’s slides

Page 112: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 112

Step 2: Labeling a report

• Must determine who really fixed it– “Assigned-to” field is not accurate

• Project-specific heuristics

Adapted from Anvik et al.’s slides

Page 113: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 113

Step 2: Labeling a report

• Must determine who really fixed it– “Assigned-to” field is not accurate

• Project-specific heuristics– simple

If a report is FIXED, label with who marked it as fixed. (Eclipse)

If a report is DUPLICATE, use the label of the report it duplicates. (Eclipse and Firefox)

Adapted from Anvik et al.’s slides

Page 114: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 114

Step 2: Labeling a report

• Must determine who really fixed it– “Assigned-to” field is not accurate

• Project-specific heuristics– simple– complex

If the report is FIXED and has attachments that are approved by a reviewer, then– If one submitter of patches, use their

name. – If more than one submitter, choose

name of who submitted the most patches.

– If cannot determine submitters, label with the person assigned to the report.

(Firefox)

Adapted from Anvik et al.’s slides

Page 115: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 115

Step 2: Labeling a report

• Must determine who really fixed it– “Assigned-to” field is not accurate

• Project-specific heuristics– simple– complex– unclassifiable

Reports marked as WONTFIX are often resolved after discussion and developers reaching a consensus.– Unknown who would have fixed the bug– Report is labeled unclassifiable.

(Firefox)

Adapted from Anvik et al.’s slides

Page 116: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 116

Step 2: Labeling a report

• Must determine who really fixed it– “Assigned-to” field is not accurate

• Project-specific heuristics– simple– complex– unclassifiable

Eclipse FirefoxSimple 5 4Complex 2 1Unclassifiable 1 4

Adapted from Anvik et al.’s slides

Page 117: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 117

Step 3: Selecting the reports

• Exclude those with no label

• Include those of active developers– developer profiles

0

5

10

15

20

25

30

35

40

Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05

0

5

10

15

20

25

30

35

40

Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05

Adapted from Anvik et al.’s slides

Page 118: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 118

Step 3: Selecting the reports

0

5

10

15

20

25

30

35

40

Sep-04 Oct-04 Nov-04 Dec-04 Jan-05 Feb-05 Mar-05 Apr-05

3 reports / month

Adapted from Anvik et al.’s slides

Page 119: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 119

Step 4: Use a ML algorithm

• Supervised Algorithms– Naïve Bayes– C4.5– Support Vector Machines

• Unsupervised Algorithms– Expectation Maximization

• Incremental Algorithms– Naïve Bayes

Adapted from Anvik et al.’s slides

Page 120: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 120

Evaluating Recommenders

made tionsrecommenda of #tionsrecommenda relevant of #Precision =

developers relevantpossibly of #tionsrecommenda relevant of #Recall =

How do we find this?Adapted from Anvik et al.’s slides

Page 121: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 121

Determining Possibly Relevant Developers

CVSRepository

Fixed BugReport

Module CModule JModule Q…

Modules touched by fix

paulwtryderstibbs…

CVSUsernames

[email protected]@...vendger@...…

BugRepositoryUsernames

Adapted from Anvik et al.’s slides

Page 122: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 122

Still not Straightforward (e.g., Firefox)[email protected]@...vendger@...…

Patch Submitters

[email protected]@...vendger@...…

Patch Submitters

[email protected]@...kpollac@...…

Patch Submitters

paulwtryderstibbs…

CVSUsernames

CVSRepository

BugReport

Module CModule JModule Q…

ModuleList

…Adapted from Anvik et al.’s slides

Page 123: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 123

Precision vs. Recall

A small set of “right” developers (precision) more important than the set of all possible developers (recall)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Multi. NB C4.5 SVM

Rec

all

Eclipse Firefox gcc

Recall

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Multi. NB C4.5 SVM

Prec

isio

n

Eclipse Firefox gcc

Precision

Adapted from Anvik et al.’s slides

Page 124: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 124

Overview

code bases

changehistory

programstates

structuralentities

software engineering data

bugreports

programming defect detection testing debugging maintenance

software engineering tasks

classification association/patterns clustering etc.

data mining techniques

Page 125: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 125

Conclusions

• Software development generates a large amount of different types of data

• Data mining and data analysis can help software engineering substantially

• Successful cases– What software engineering data can be mined?– What software engineering task can be helped?– How to conduct the mining?

Page 126: Datamingse

T. Xie and J. Pei: Data Mining for Software Engineering 126

Challenges

• Complexity in software development– Specific data mining techniques are needed

• Software development and maintenance are dynamic and user-centered– Interactive data mining– Visual data mining and analysis– Online, incremental mining

Page 127: Datamingse

Questions?

Mining Software Engineering Data Bibliographyhttp://ase.csc.ncsu.edu/dmse/•What software engineering tasks can be helped by data mining?•What kinds of software engineering data can be mined?•How are data mining techniques used in software engineering?•Resources