Upload
nicolas-bettenburg
View
1.060
Download
1
Embed Size (px)
DESCRIPTION
Talk given at the 2012 Working Conference on Mining Software Repositories (MSR'12) in Zürich, Switzerland.
Citation preview
Print This Page
Secure Payment provided by E-xact Transactions Ltd.
Transcript Order
Your Order
Quantity Item Unit Price1 Transcript, As Quickly As Possible CAD 15.00 CAD 15.00
Total CAD 15.00
This order is now complete. Transaction approved!Here is your receipt:
=========== TRANSACTION RECORD ==========Queen’s Registrar’s Office125-74 UNION ST.GORDON HALLKINGSTON, ON K7L3N6Canada
TYPE: Purchase
ACCT: Mastercard $ 15.00 CAD
CARD NUMBER : ############9698DATE/TIME : 06 Oct 09 14:48:24REFERENCE # : 001 337 MAUTHOR. # : 15474ZTRANS. REF. : 091006154700c0bc69d1
Approved - Thank You 000
Please retain this copy for your records.
Cardholder will pay above amount to cardissuer pursuant to cardholder agreement.=========================================
Think Locally, Act GloballyImproving Defect and Effort Prediction Models
Nicolas Bettenburg • Meiyappan Nagappan • Ahmed E. HassanQueen’s University • Kingston, ON, Canada
SOFTWARE ANALYSIS
& INTELLIGENCE LAB
Saturday, 2 June, 12
Data Modelling in Empirical SE
Observations
2
measured from project data
Saturday, 2 June, 12
Data Modelling in Empirical SE
Observations
Model
2
measured from project data
describe observations mathematically
Saturday, 2 June, 12
Data Modelling in Empirical SE
Observations
Model
Understanding
Prediction
2
measured from project data
describe observations mathematically
guide decision making
guide process optimizations and future research
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
M
Learned Model
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
M
Learned Model
PredictionsY
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
M
Learned Model
PredictionsY
Compare
Saturday, 2 June, 12
Much Research Effort on new metrics and new models!
4
Saturday, 2 June, 12
Maybe we need to look more at the data part
Saturday, 2 June, 12
In the Field
Saturday, 2 June, 12
In the Field
Tom Zimmermann
Saturday, 2 June, 12
In the Field
Tom Zimmermann
We ran 622 cross-project predictions and found that only
3.4% actually worked.
Saturday, 2 June, 12
In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project predictions and found that only
3.4% actually worked.
Saturday, 2 June, 12
In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project predictions and found that only
3.4% actually worked.
Rather than focus on generalities, empirical SE should focus more on context-specific
principles.
Saturday, 2 June, 12
In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project predictions and found that only
3.4% actually worked.
Rather than focus on generalities, empirical SE should focus more on context-specific
principles.
Taking local properties of data into consideration leads to better models!
Saturday, 2 June, 12
Using Locality in Statistical Models
Saturday, 2 June, 12
Does this principle work for statistical models?1
Using Locality in Statistical Models
Saturday, 2 June, 12
Does this principle work for statistical models?1
Does it work for Prediction?2
Using Locality in Statistical Models
Saturday, 2 June, 12
Does this principle work for statistical models?1
Does it work for Prediction?2
Can we do better?3
Using Locality in Statistical Models
Saturday, 2 June, 12
M
Learned Model
Building Local Models
8
Whole Dataset
Testing Data
Training Data
PredictionsY
Saturday, 2 June, 12
M
Learned Model
Building Local Models
8
Whole Dataset
Testing Data
Training Data
PredictionsY
Cluster Data
Saturday, 2 June, 12
Building Local Models
8
Whole Dataset
Testing Data
Training Data Learned ModelsM1 M2 M3
PredictionsY
Cluster Data Learn Multiple
Models
Saturday, 2 June, 12
Building Local Models
8
Whole Dataset
Testing Data
Training Data Learned ModelsM1 M2 M3
Predictions
Y Y Y
Cluster Data Learn Multiple
Models
Predict
Individually
Saturday, 2 June, 12
Building Local Models
8
Whole Dataset
Compare
Testing Data
Training Data Learned ModelsM1 M2 M3
Predictions
Y Y Y
Cluster Data Learn Multiple
Models
Predict
Individually
Saturday, 2 June, 12
9
Global Statistical ModelCHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Saturday, 2 June, 12
9
Global Statistical ModelCHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Saturday, 2 June, 12
9
Global Statistical ModelCHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Saturday, 2 June, 12
9
Global Statistical ModelCHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Model fit leaves much room for improvement!Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
10
Local Statistical Model
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
10
Local Statistical Model
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
10
Local Statistical Model
Model 1
Model 2
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
10
Local Statistical Model
Model 1
Model 2
Improved Fit!Saturday, 2 June, 12
How can we use this approach to get an even better fit?
Saturday, 2 June, 12
12
Be Even More Local !CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Saturday, 2 June, 12
12
Be Even More Local !CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Saturday, 2 June, 12
12
Be Even More Local !CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Saturday, 2 June, 12
12
Be Even More Local !CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Great Fit!
Saturday, 2 June, 12
12
Be Even More Local !CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Great Fit!BUT: Risk of Overfitting the Data!!
Saturday, 2 June, 12
Saturday, 2 June, 12
Clustering independent of Fit
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
14
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
14
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Optimize Local Fit wrt. Minimizing Global Overfit
Multivariate Adaptive Regression Splines (MARS)
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f (X) = X�,
where X� = �0
+ �1
X1
+ �2
X2
+ �3
X3
+ �4
X4
,and
X1
= X X2
= (X � a)+
X3
= (X � b)+
X4
= (X � c)+
.
Overall linearity in X can be tested by testing H0
:
�2
= �3
= �4
= 0.
Optimize Local Fit wrt. Minimizing Global Overfit
Multivariate Adaptive Regression Splines (MARS)create local knowledge that optimizes process globally
Saturday, 2 June, 12
Case Study
15
Saturday, 2 June, 12
Case Study
15
Xalan 2.6
Lucene 2.4Post-Release Defects per Class
20 CK Metrics
Saturday, 2 June, 12
Case Study
15
Xalan 2.6
Lucene 2.4Post-Release Defects per Class
20 CK Metrics
CHINA Total Development Effort in Hours 14 FP Metrics
Saturday, 2 June, 12
Case Study
15
Xalan 2.6
Lucene 2.4Post-Release Defects per Class
20 CK Metrics
CHINA Total Development Effort in Hours 14 FP Metrics
NasaCocDevelopment Length in Months
24 COCOMO-II Metrics
Saturday, 2 June, 12
Results: Goodness of Fit
16
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global Local (Clustered) MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global Local (Clustered) MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global Local (Clustered) MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global Local (Clustered) MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global Local (Clustered) MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)N
umbe
r of C
lust
ers
0
2
4
6
8
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
Dataset
CHINALucene 2.4NasaCocXalan 2.6
Figure 3: Number of clusters generated by MCLUST in each run of the 10-fold cross validation.
term for each additional prediction variable entering theregression model [23].
For practical purposes, we use a publicly available imple-mentation of BIC-based model selection, contained in theR package: BMA. The input to the BMA implementationis the dataset itself, as well as a list of all dependent andindependent variables that should be considered. In our casestudy, we always supply a list of all independent variablesthat were left after VIF analysis. The output of the BMAimplementation is a selection of independent variables, suchthat the linear regression model built with these variableswill produce the best fit to the data, while avoiding over-fitting.
D. Multivariate Adaptive Regression Splines
We use Multivariate Adaptive Regression Splines [11], orMARS, models, as an example of a global prediction modelthat takes local considerations of the dataset into account.MARS models have become increasingly popular in medi-cal, and social sciences, as well as in economical sciences,where they are used with great success [3], [21], [28]. AMARS model has the form Y = ✏
o
+ c1 ⇤H(X1) + · · · +ci
⇤ H(Xn
), with Y called the dependent variable (that isto be predicted), c
i
the i-th hinge coefficient, and H(Xi
)the i-th “hinge function”. Hinge functions are an integralpart of MARS models, as they allow to describe non-linearrelationships in the data. In particular, they partition thedata into disjoint regions that can be then described sepa-rately (our notion of local considerations). In general, hingefunctions used in MARS models take on the form of eitherH(X
i
) = max(c,Xi
�c), or H(Xi
) = max(c, c�Xi
), withc being some constant real value, and X
i
an independent(predictor) variable.
A MARS model is built in two separate phases. In theforward phase, MARS starts with a model which consists ofjust the intercept term (which is the mean of the independentvariables). It then repeatedly adds hinge functions in pairsto the model. At each step it finds the pair of functions thatgives the maximum reduction in residual error. This processof adding terms continues until the change in residual error
is too small to continue or until a maximum number of termsis reached. In our case study, the maximum number of termsis automatically determined by the implementation, and isbased on the amount of independent variables we give asinput. For MARS models, we use all independent variablesin a dataset after VIF analysis.
The first phase often builds a model that suffers fromoverfitting. As a result, the second phase, called the back-ward phase, prunes the model, to increase the model’s gen-eralization ability. The backward phase removes individualterms, deleting the least effective term at each step until itfinds the best submodel. Model subsets are compared usinga performance criterion specific to MARS models, and thebest model is selected and returned as the final predictionmodel.
MARS models have the advantage that a model selectionphase is already built-in by design, so we do not need tocarry out a model selection step similar to BIC, as we dowith global and local models. Second, the hinge functionsin MARS models do already model disjoint regions ofthe dataset separately, such that there is no need for priorpartitioning of the dataset with a clustering algorithm. Bothadvantages make this type of prediction model very easy touse in practice.
E. Cross Validation
For better generalizability of our results, and to counterrandom observation bias, all experiments described in ourcase study are repeated 10 times on random stratified sub-samples of the data into training (90% of the data) andtesting (10% of the data) sets. Stratification is carried outon the measure that is to be predicted. We evaluate allour findings based on the average over the 10 repetitions.This practice of evaluating results is a common approach inMachine Learning, and is often referred to as “10-fold crossvalidation” [27].
IV. RESULTS
In this section, we present the result of our case study.This presentation is carried out in three steps that followour initial three research questions. For each part, we first
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global Local (Clustered) MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
UP TO 2.5x BETTER FIT WHEN USING DATA LOCALITY!
Saturday, 2 June, 12
0
0.175
0.35
0.525
0.7
Xalan 2.6
0.40.52
0.64
0
0.3
0.6
0.9
1.2
Lucene 2.4
0.941.151.15
0
200
400
600
800
CHINA
234.43552.85
765
0
1
2
3
4
NasaCoC
1.632.143.26
Results: Prediction Error
17
Global Local MARS
Saturday, 2 June, 12
0
0.175
0.35
0.525
0.7
Xalan 2.6
0.40.52
0.64
0
0.3
0.6
0.9
1.2
Lucene 2.4
0.941.151.15
0
200
400
600
800
CHINA
234.43552.85
765
0
1
2
3
4
NasaCoC
1.632.143.26
Results: Prediction Error
17
Global Local MARS
Up to 4x lower prediction error with Local Models!
Saturday, 2 June, 12
ModelInterpretation
?
Saturday, 2 June, 12
Model Interpretation
19
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 5000
0.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Saturday, 2 June, 12
Model Interpretation
19
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 5000
0.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Traditional Global Model: General Trends
Saturday, 2 June, 12
Model Interpretation
19
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 5000
0.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Traditional Global Model: General TrendsOne Curve per metric, run corp on that curve
Saturday, 2 June, 12
20
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 50000.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���0
12
30 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 50000.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 150.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Cluster 1
Cluster 6
Model Interpretation
...
Saturday, 2 June, 12
20
Local (Clustered) Model: Many, many, many Trends!
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 50000.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���0
12
30 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 50000.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 150.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Cluster 1
Cluster 6
Model Interpretation
...
Saturday, 2 June, 12
20
Local (Clustered) Model: Many, many, many Trends!
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 50000.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���0
12
30 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 50000.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 150.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Cluster 1
Cluster 6
Model Interpretation
...
Sometimes even contradict
Saturday, 2 June, 12
21
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 5000
0.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Model Interpretation
Saturday, 2 June, 12
21
Regression Splines: Local Trends in a Single Curve
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 5000
0.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Model Interpretation
Saturday, 2 June, 12
21
Regression Splines: Local Trends in a Single Curve
0 5 10 15 20−2.5
−1.5
−0.5
0.5
1 avg_cc
0 50 100 150
0.50
0.60
0.70
0.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.44
0.48
0.52
3 cam
0 5 10 15 20 25 30
0.5
0.7
0.9
1.1
4 cbm
0 10 20 30 40 50
0.50
0.54
0.58
0.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.35
0.45
6 dam
1 2 3 4 5 6 7 8
0.3
0.4
0.5
0.6
7 dit
0 1 2 3 4 5
0.50
0.55
0.60
0.65
8 ic
0 1000 3000 5000
0.6
1.0
1.4
1.8 9 lcom
0.0 0.5 1.0 1.5 2.0
0.3
0.4
0.5
0.6
0.7
10 lcom3
0 1000 2000 3000 4000
0.5
1.0
1.5
2.0
11 loc
0 20 40 60 80 120
12
34
12 max_cc
0.0 0.2 0.4 0.6 0.8 1.0
0.45
0.47
0.49
0.51
13 mfa
0 5 10 15
0.50
0.54
0.58
14 moa
0 5 10 15 20 25 30
0.42
0.46
0.50
15 noc
0 20 40 60 80 100 120
0.50
0.60
0.70
16 npm
lm(formula=f,data=training1)
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.2
1.6
1 cam
0 5 10 15 20 25 30
1.0
1.2
1.4
1.6
1.8
2 cbm
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.65
0.75
0.85
4 dam
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
5 dit
0 1 2 3 4 5
0.9
1.1
1.3
1.5
6 ic
0 1000 3000 5000
1.0
1.5
2.0
2.5
3.0 7 lcom
0.0 0.5 1.0 1.5 2.00.55
0.65
0.75
0.85
8 lcom3
0 1000 2000 3000 4000
12
34
56
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.8
1.0
1.2
1.4
10 mfa
0 5 10 15
0.0
0.2
0.4
0.6
0.8
11 moa
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
12 noc
0 20 40 60 80 100 120
−1.0
0.0
0.5
1.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset
Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.
0 2 4 6 8 10
��
���
���
��� ��� ��� ��� ��� ���
��
��
��� ��� ��� ���
��
���
���
0 10 20 30 40 �� 60
��
���
���
���
��� ��� ��� ��� ��� ���
01
23
0 1 2 3 4
��
��
���
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.
V. CONCLUSIONS
In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.
A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of
prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.
B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.
Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases
Model Interpretation
Combines the best of both worlds!
Saturday, 2 June, 12
Saturday, 2 June, 12
Using Locality in Data to build better Statistical Models.
Saturday, 2 June, 12
Using Locality in Data to build better Statistical Models.
vs = Two Extremes
Saturday, 2 June, 12
Using Locality in Data to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
Saturday, 2 June, 12
Using Locality in Data to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized• combines best of both worlds
Saturday, 2 June, 12
Using Locality in Data to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized• combines best of both worlds• outperforms global and clustered local
Saturday, 2 June, 12
Using Locality in Data to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized• combines best of both worlds• outperforms global and clustered local• summarizes local trends in single curve
Saturday, 2 June, 12