Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction

S P I R A L Software Process Improvement and Relaibility Assurance Leb.

Transfer Defect Learning*

2013. 05. 22

Presenter: Duksan Ryu

*Nam, Jaechang, Sinno Jialin Pan, and Sunghun Kim. "Transfer Defect Learning."

Software Process Improvement and Reliability Assurance Lab.

Content

Introduction

Related work

Approach

Experimental setup

Results

Threats to validity

Conclusion

2


Introduction Software Defect Prediction

To predict # of defects in SW Main approach

1) Prepare data sets from SW repositories 2) Employ machine learning classifiers to build a prediction model

from data sets 3) Identify SW defects by using the trained model.

Within-Project Defect Prediction (WPDP) A prediction model is built from a part of a project The model is evaluated with the remainder of the project.

Cross-Project Defect Prediction (CPDP) New projects

not enough defect data to build a prediction model

Use data from other projects to build a prediction model

3


Defect prediction process 1) The labeling task is based on # of post-release defects

for each file. 2) Defect prediction metrics such as complexity metrics

are used as features 3) Preprocessing data set 4) Training prediction models using machine learning

classifiers 5) The trained model predict whether new instances are

buggy or clean.

4


Issue of Cross-Project Prediction

The poor cross-project prediction performance

Data distributions of source and target projects are different

Machine learning classifiers assume

Training and test data the same feature space

the same data distribution

5


Outline

6

Transfer Learning

Data Mining domain

TCA

Transfer Defect

Learning

SW Engineering domain

TCA+

Defect Prediction

Automatic Normalization

Selection


Related Work

Defect Prediction Within-project prediction

New defect prediction algorithms & New metrics to predict defects effectively [ICSE 2006 – 2013]

Cross-project prediction Three types of studies

Selecting the best data sets from SW repositories for CPDP [Zimmermann et al. ESEC/FSE 2009]

Filtering training data from data set for CPDP [Turhan et al. Empir. SE 2009]

Transfer learning [Ma et al. IST 2012]

7


Related Work Transfer Learning

Widely studied to address cross-domain problems To extract common knowledge from one task domain

and transfer it to another The transferred knowledge is used to train a prediction

model Methods

TPLSA: Topic-bridged probability latent semantic analysis (2008) MMDE: Maximum mean discrepancy embedding (2008) TCA: Transfer Component Analysis (2009)

Applications Text classification Natural language processing WiFi-based indoor localization Computer vision

8


Approach

Transfer defect learning overview

Notation and problem definition

Transfer component analysis

Normalization for data preprocessing

TCA+

9


Transfer defect learning overview

Assumption

The source and target projects have the same set of features.

Their feature distributions may differ.

Objective

To make the feature distributions of the source and target projects similar

10


Notation and problem definition The given source project data set The input file (a vector of metrics) The corresponding defect info (clean or buggy) The given target project data set Assumption: the same set of metrics # of files in the source and target projects n1 n2 The objective of TCA

To learn a transformation mapping to map the data of both the source and target projects onto a latent feature space

Difference bet. data distributions of & becomes small

A standard model f is trained on YS can achieve precise predictions on

1

1{( , y )}i i

n

s S S iD x 1

i

m

Sx R

2

1{ }i

n

T T iD x

( )SX ( )TX

iSy

iSxiTx

( )TX

11

( )TX


Transfer component analysis A feature extraction technique for transfer learning

Motivation

Common latent factors may exist bet. source and target domains, even though observed features of the domains are different.

To reveal the latent factors, project the domains onto a new space called the latent space.

The domain difference can be reduced while the original data structures can be preserved.

Latent space is used as bridge for cross-domain classification tasks

Example) Internet Explorer 8 and Firefox

Both represented by the same set of metrics

Different metric values since development processes are different.

As web browsers, they have some commonality in coding, even though the commonality may be hidden.

If the hidden commonality can be discovered and used to represent the data of the two projects, then the cross-project difference may be reduced.

12


Transfer Component Analysis (x) = x where : a matrix that maps m-dimensional

feature vectors to d-dimensional ones.

1) Learning the transformation by using TCA

2) Mapping

3) Training a classifier f on and the corresponding labels YS

4) Use the model f to make predictions on the target domain test data

m dR

( )SX SX ( )TXTX

SX

( )Tf X

14


Normalization for Data preprocessing Normalization

A data preprocessing technique Gives all features of a data set an equal weight Can improve prediction performance of classification models

NoN: No normalization is applied N1: min-max normalization with a value range from zero to one N2: z-score normalization, makes mean zero and standard deviation (std)

one. N3: mean and std are computed only on source project data but

applied to both project data N4: mean and std are computed only on target project data but applied

to both project data N3 and N4

The size of each data set is too small to estimate the data distribution.

Preliminary cross-project prediction results (F-measures) 15

(x min(x))

max( ) min( )

i

x x

(x min(x))

( )

i

std x

Min-max

Z-score


TCA+ Prediction performance varies according to

different normalization selections

Propose an algorithm to select appropriate normalization options for a given cross-prediction pair

Normalization selection Identifying similarity of data set characteristics

bet. the source and target projects

A Data set Characteristic Vector (DCV)

DIST = {dij : i, j,1in, 1jn, ij} Use Euclidean distance

16


TCA+

Six elements of a characteristic vector

Conditions for assignment

17


TCA+ Rule1

Data sets are similar enough, do not normalize data

Rule2 A large gap in dist_min or

dist_max may indicate different distributions

Rule3 Target data are sparse or dense Very little statistical information

Rule4 Source data are sparse or dense Very little statistical information

Rule5 No applicable rules, use z-score

normalization

18

No normalization

Min-max

Z-score using source data

Z-score using target data

Z-score


Experimental setup

Benchmark Sets

ReLink: Defect benchmark data set with 26 complexity metrics

19


Experimental setup

Experimental Design

Within-project Prediction

Cross-project prediction without TCA

Cross-project prediction with TCA

Cross-Project prediction with TCA+

20


Experimental setup

Machine learning Classifier

Logistic regression

21

Linear regression Y = b0+b1X Logistic regression where t = b0+b1X

1( )

1 tf t

e


Experimental setup

Evaluation Measures

Buggy precision:

Buggy recall:

Buggy F-measure:

22

( ) b b

b b c b

N TPP b

N N TP FP

( ) b b

b b b c

N TPR b

N N TP FN

2 ( ) ( )( )

( ) ( )

P b R bF b

P b R b

Predicted class

Actual class

Class = Yes Class = No

Class = Yes True Positive(TP) False Negative(FN)

Class = No False Positive(FP) True Negative(TN)

Cost Matrix


Results TCA with Different Normalization Options TCA with N2, N3, or N4

significantly improves cross-project defect prediction results of ReLink data sets in terms of average F-measure.

However, the results of some cross-prediction combinations are not improved by TCA.

23


Results

TCA+ outperforms Baseline for all combinations.

Performance of TCA+ is comparable to within-project prediction performance.

24


Threats to validity

Systems are open-source projects.

Experimental results might not be generalizable.

Decision rules in TCA+ might not be generalizable.

25


Conclusion

Applied transfer learning approaches for cross-project defect prediction.

The first application of TCA for defect prediction.

TCA+ is a new approach to select suitable normalization options for TCA.

26


Q and A

Thank you

27

Documents

Transfer Defect Learning - KAISTse.kaist.ac.kr/.../2013/06/Transfer-Defect-Learning.pdf · 2013-06-12 · Software Process Improvement and Reliability Assurance Lab. Introduction