26
S P I R A L Software Process Improvement and Relaibility Assurance Leb. Transfer Defect Learning * 2013. 05. 22 Presenter: Duksan Ryu *Nam, Jaechang, Sinno Jialin Pan, and Sunghun Kim. "Transfer Defect Learning."

Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

S P I R A L Software Process Improvement and Relaibility Assurance Leb.

Transfer Defect Learning*

2013. 05. 22

Presenter: Duksan Ryu

*Nam, Jaechang, Sinno Jialin Pan, and Sunghun Kim. "Transfer Defect Learning."

Page 2: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Content

Introduction

Related work

Approach

Experimental setup

Results

Threats to validity

Conclusion

2

Page 3: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Introduction Software Defect Prediction

To predict # of defects in SW Main approach

1) Prepare data sets from SW repositories 2) Employ machine learning classifiers to build a prediction model

from data sets 3) Identify SW defects by using the trained model.

Within-Project Defect Prediction (WPDP) A prediction model is built from a part of a project The model is evaluated with the remainder of the project.

Cross-Project Defect Prediction (CPDP) New projects

not enough defect data to build a prediction model

Use data from other projects to build a prediction model

3

Page 4: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Defect prediction process 1) The labeling task is based on # of post-release defects

for each file. 2) Defect prediction metrics such as complexity metrics

are used as features 3) Preprocessing data set 4) Training prediction models using machine learning

classifiers 5) The trained model predict whether new instances are

buggy or clean.

4

Page 5: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Issue of Cross-Project Prediction

The poor cross-project prediction performance

Data distributions of source and target projects are different

Machine learning classifiers assume

Training and test data the same feature space

the same data distribution

5

Page 6: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Outline

6

Transfer Learning

Data Mining domain

TCA

Transfer Defect

Learning

SW Engineering domain

TCA+

Defect Prediction

Automatic Normalization

Selection

Page 7: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Related Work

Defect Prediction Within-project prediction

New defect prediction algorithms & New metrics to predict defects effectively [ICSE 2006 – 2013]

Cross-project prediction Three types of studies

Selecting the best data sets from SW repositories for CPDP [Zimmermann et al. ESEC/FSE 2009]

Filtering training data from data set for CPDP [Turhan et al. Empir. SE 2009]

Transfer learning [Ma et al. IST 2012]

7

Page 8: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Related Work Transfer Learning

Widely studied to address cross-domain problems To extract common knowledge from one task domain

and transfer it to another The transferred knowledge is used to train a prediction

model Methods

TPLSA: Topic-bridged probability latent semantic analysis (2008) MMDE: Maximum mean discrepancy embedding (2008) TCA: Transfer Component Analysis (2009)

Applications Text classification Natural language processing WiFi-based indoor localization Computer vision

8

Page 9: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Approach

Transfer defect learning overview

Notation and problem definition

Transfer component analysis

Normalization for data preprocessing

TCA+

9

Page 10: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Transfer defect learning overview

Assumption

The source and target projects have the same set of features.

Their feature distributions may differ.

Objective

To make the feature distributions of the source and target projects similar

10

Page 11: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Notation and problem definition The given source project data set The input file (a vector of metrics) The corresponding defect info (clean or buggy) The given target project data set Assumption: the same set of metrics # of files in the source and target projects n1 n2 The objective of TCA

To learn a transformation mapping to map the data of both the source and target projects onto a latent feature space

Difference bet. data distributions of & becomes small

A standard model f is trained on YS can achieve precise predictions on

1

1{( , y )}i i

n

s S S iD x 1

i

m

Sx R

2

1{ }i

n

T T iD x

( )SX ( )TX

iSy

iSxiTx

( )TX

11

( )TX

Page 12: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Transfer component analysis A feature extraction technique for transfer learning

Motivation

Common latent factors may exist bet. source and target domains, even though observed features of the domains are different.

To reveal the latent factors, project the domains onto a new space called the latent space.

The domain difference can be reduced while the original data structures can be preserved.

Latent space is used as bridge for cross-domain classification tasks

Example) Internet Explorer 8 and Firefox

Both represented by the same set of metrics

Different metric values since development processes are different.

As web browsers, they have some commonality in coding, even though the commonality may be hidden.

If the hidden commonality can be discovered and used to represent the data of the two projects, then the cross-project difference may be reduced.

12

Page 13: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Transfer Component Analysis (x) = x where : a matrix that maps m-dimensional

feature vectors to d-dimensional ones.

1) Learning the transformation by using TCA

2) Mapping

3) Training a classifier f on and the corresponding labels YS

4) Use the model f to make predictions on the target domain test data

m dR

( )SX SX ( )TXTX

SX

( )Tf X

14

Page 14: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Normalization for Data preprocessing Normalization

A data preprocessing technique Gives all features of a data set an equal weight Can improve prediction performance of classification models

NoN: No normalization is applied N1: min-max normalization with a value range from zero to one N2: z-score normalization, makes mean zero and standard deviation (std)

one. N3: mean and std are computed only on source project data but

applied to both project data N4: mean and std are computed only on target project data but applied

to both project data N3 and N4

The size of each data set is too small to estimate the data distribution.

Preliminary cross-project prediction results (F-measures) 15

(x min(x))

max( ) min( )

i

x x

(x min(x))

( )

i

std x

Min-max

Z-score

Page 15: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

TCA+ Prediction performance varies according to

different normalization selections

Propose an algorithm to select appropriate normalization options for a given cross-prediction pair

Normalization selection Identifying similarity of data set characteristics

bet. the source and target projects

A Data set Characteristic Vector (DCV)

DIST = {dij : i, j,1in, 1jn, ij} Use Euclidean distance

16

Page 16: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

TCA+

Six elements of a characteristic vector

Conditions for assignment

17

Page 17: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

TCA+ Rule1

Data sets are similar enough, do not normalize data

Rule2 A large gap in dist_min or

dist_max may indicate different distributions

Rule3 Target data are sparse or dense Very little statistical information

Rule4 Source data are sparse or dense Very little statistical information

Rule5 No applicable rules, use z-score

normalization

18

No normalization

Min-max

Z-score using source data

Z-score using target data

Z-score

Page 18: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Experimental setup

Benchmark Sets

ReLink: Defect benchmark data set with 26 complexity metrics

19

Page 19: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Experimental setup

Experimental Design

Within-project Prediction

Cross-project prediction without TCA

Cross-project prediction with TCA

Cross-Project prediction with TCA+

20

Page 20: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Experimental setup

Machine learning Classifier

Logistic regression

21

Linear regression Y = b0+b1X Logistic regression where t = b0+b1X

1( )

1 tf t

e

Page 21: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Experimental setup

Evaluation Measures

Buggy precision:

Buggy recall:

Buggy F-measure:

22

( ) b b

b b c b

N TPP b

N N TP FP

( ) b b

b b b c

N TPR b

N N TP FN

2 ( ) ( )( )

( ) ( )

P b R bF b

P b R b

Predicted class

Actual class

Class = Yes Class = No

Class = Yes True Positive(TP) False Negative(FN)

Class = No False Positive(FP) True Negative(TN)

Cost Matrix

Page 22: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Results TCA with Different Normalization Options TCA with N2, N3, or N4

significantly improves cross-project defect prediction results of ReLink data sets in terms of average F-measure.

However, the results of some cross-prediction combinations are not improved by TCA.

23

Page 23: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Results

TCA+ outperforms Baseline for all combinations.

Performance of TCA+ is comparable to within-project prediction performance.

24

Page 24: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Threats to validity

Systems are open-source projects.

Experimental results might not be generalizable.

Decision rules in TCA+ might not be generalizable.

25

Page 25: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Conclusion

Applied transfer learning approaches for cross-project defect prediction.

The first application of TCA for defect prediction.

TCA+ is a new approach to select suitable normalization options for TCA.

26

Page 26: Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

Software Process Improvement and Reliability Assurance Lab.

Q and A

Thank you

27