Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
S P I R A L Software Process Improvement and Relaibility Assurance Leb.
Transfer Defect Learning*
2013. 05. 22
Presenter: Duksan Ryu
*Nam, Jaechang, Sinno Jialin Pan, and Sunghun Kim. "Transfer Defect Learning."
Software Process Improvement and Reliability Assurance Lab.
Content
Introduction
Related work
Approach
Experimental setup
Results
Threats to validity
Conclusion
2
Software Process Improvement and Reliability Assurance Lab.
Introduction Software Defect Prediction
To predict # of defects in SW Main approach
1) Prepare data sets from SW repositories 2) Employ machine learning classifiers to build a prediction model
from data sets 3) Identify SW defects by using the trained model.
Within-Project Defect Prediction (WPDP) A prediction model is built from a part of a project The model is evaluated with the remainder of the project.
Cross-Project Defect Prediction (CPDP) New projects
not enough defect data to build a prediction model
Use data from other projects to build a prediction model
3
Software Process Improvement and Reliability Assurance Lab.
Defect prediction process 1) The labeling task is based on # of post-release defects
for each file. 2) Defect prediction metrics such as complexity metrics
are used as features 3) Preprocessing data set 4) Training prediction models using machine learning
classifiers 5) The trained model predict whether new instances are
buggy or clean.
4
Software Process Improvement and Reliability Assurance Lab.
Issue of Cross-Project Prediction
The poor cross-project prediction performance
Data distributions of source and target projects are different
Machine learning classifiers assume
Training and test data the same feature space
the same data distribution
5
Software Process Improvement and Reliability Assurance Lab.
Outline
6
Transfer Learning
Data Mining domain
TCA
Transfer Defect
Learning
SW Engineering domain
TCA+
Defect Prediction
Automatic Normalization
Selection
Software Process Improvement and Reliability Assurance Lab.
Related Work
Defect Prediction Within-project prediction
New defect prediction algorithms & New metrics to predict defects effectively [ICSE 2006 – 2013]
Cross-project prediction Three types of studies
Selecting the best data sets from SW repositories for CPDP [Zimmermann et al. ESEC/FSE 2009]
Filtering training data from data set for CPDP [Turhan et al. Empir. SE 2009]
Transfer learning [Ma et al. IST 2012]
7
Software Process Improvement and Reliability Assurance Lab.
Related Work Transfer Learning
Widely studied to address cross-domain problems To extract common knowledge from one task domain
and transfer it to another The transferred knowledge is used to train a prediction
model Methods
TPLSA: Topic-bridged probability latent semantic analysis (2008) MMDE: Maximum mean discrepancy embedding (2008) TCA: Transfer Component Analysis (2009)
Applications Text classification Natural language processing WiFi-based indoor localization Computer vision
8
Software Process Improvement and Reliability Assurance Lab.
Approach
Transfer defect learning overview
Notation and problem definition
Transfer component analysis
Normalization for data preprocessing
TCA+
9
Software Process Improvement and Reliability Assurance Lab.
Transfer defect learning overview
Assumption
The source and target projects have the same set of features.
Their feature distributions may differ.
Objective
To make the feature distributions of the source and target projects similar
10
Software Process Improvement and Reliability Assurance Lab.
Notation and problem definition The given source project data set The input file (a vector of metrics) The corresponding defect info (clean or buggy) The given target project data set Assumption: the same set of metrics # of files in the source and target projects n1 n2 The objective of TCA
To learn a transformation mapping to map the data of both the source and target projects onto a latent feature space
Difference bet. data distributions of & becomes small
A standard model f is trained on YS can achieve precise predictions on
1
1{( , y )}i i
n
s S S iD x 1
i
m
Sx R
2
1{ }i
n
T T iD x
( )SX ( )TX
iSy
iSxiTx
( )TX
11
( )TX
Software Process Improvement and Reliability Assurance Lab.
Transfer component analysis A feature extraction technique for transfer learning
Motivation
Common latent factors may exist bet. source and target domains, even though observed features of the domains are different.
To reveal the latent factors, project the domains onto a new space called the latent space.
The domain difference can be reduced while the original data structures can be preserved.
Latent space is used as bridge for cross-domain classification tasks
Example) Internet Explorer 8 and Firefox
Both represented by the same set of metrics
Different metric values since development processes are different.
As web browsers, they have some commonality in coding, even though the commonality may be hidden.
If the hidden commonality can be discovered and used to represent the data of the two projects, then the cross-project difference may be reduced.
12
Software Process Improvement and Reliability Assurance Lab.
Transfer Component Analysis (x) = x where : a matrix that maps m-dimensional
feature vectors to d-dimensional ones.
1) Learning the transformation by using TCA
2) Mapping
3) Training a classifier f on and the corresponding labels YS
4) Use the model f to make predictions on the target domain test data
m dR
( )SX SX ( )TXTX
SX
( )Tf X
14
Software Process Improvement and Reliability Assurance Lab.
Normalization for Data preprocessing Normalization
A data preprocessing technique Gives all features of a data set an equal weight Can improve prediction performance of classification models
NoN: No normalization is applied N1: min-max normalization with a value range from zero to one N2: z-score normalization, makes mean zero and standard deviation (std)
one. N3: mean and std are computed only on source project data but
applied to both project data N4: mean and std are computed only on target project data but applied
to both project data N3 and N4
The size of each data set is too small to estimate the data distribution.
Preliminary cross-project prediction results (F-measures) 15
(x min(x))
max( ) min( )
i
x x
(x min(x))
( )
i
std x
Min-max
Z-score
Software Process Improvement and Reliability Assurance Lab.
TCA+ Prediction performance varies according to
different normalization selections
Propose an algorithm to select appropriate normalization options for a given cross-prediction pair
Normalization selection Identifying similarity of data set characteristics
bet. the source and target projects
A Data set Characteristic Vector (DCV)
DIST = {dij : i, j,1in, 1jn, ij} Use Euclidean distance
16
Software Process Improvement and Reliability Assurance Lab.
TCA+
Six elements of a characteristic vector
Conditions for assignment
17
Software Process Improvement and Reliability Assurance Lab.
TCA+ Rule1
Data sets are similar enough, do not normalize data
Rule2 A large gap in dist_min or
dist_max may indicate different distributions
Rule3 Target data are sparse or dense Very little statistical information
Rule4 Source data are sparse or dense Very little statistical information
Rule5 No applicable rules, use z-score
normalization
18
No normalization
Min-max
Z-score using source data
Z-score using target data
Z-score
Software Process Improvement and Reliability Assurance Lab.
Experimental setup
Benchmark Sets
ReLink: Defect benchmark data set with 26 complexity metrics
19
Software Process Improvement and Reliability Assurance Lab.
Experimental setup
Experimental Design
Within-project Prediction
Cross-project prediction without TCA
Cross-project prediction with TCA
Cross-Project prediction with TCA+
20
Software Process Improvement and Reliability Assurance Lab.
Experimental setup
Machine learning Classifier
Logistic regression
21
Linear regression Y = b0+b1X Logistic regression where t = b0+b1X
1( )
1 tf t
e
Software Process Improvement and Reliability Assurance Lab.
Experimental setup
Evaluation Measures
Buggy precision:
Buggy recall:
Buggy F-measure:
22
( ) b b
b b c b
N TPP b
N N TP FP
( ) b b
b b b c
N TPR b
N N TP FN
2 ( ) ( )( )
( ) ( )
P b R bF b
P b R b
Predicted class
Actual class
Class = Yes Class = No
Class = Yes True Positive(TP) False Negative(FN)
Class = No False Positive(FP) True Negative(TN)
Cost Matrix
Software Process Improvement and Reliability Assurance Lab.
Results TCA with Different Normalization Options TCA with N2, N3, or N4
significantly improves cross-project defect prediction results of ReLink data sets in terms of average F-measure.
However, the results of some cross-prediction combinations are not improved by TCA.
23
Software Process Improvement and Reliability Assurance Lab.
Results
TCA+ outperforms Baseline for all combinations.
Performance of TCA+ is comparable to within-project prediction performance.
24
Software Process Improvement and Reliability Assurance Lab.
Threats to validity
Systems are open-source projects.
Experimental results might not be generalizable.
Decision rules in TCA+ might not be generalizable.
25
Software Process Improvement and Reliability Assurance Lab.
Conclusion
Applied transfer learning approaches for cross-project defect prediction.
The first application of TCA for defect prediction.
TCA+ is a new approach to select suitable normalization options for TCA.
26
Software Process Improvement and Reliability Assurance Lab.
Q and A
Thank you
27