Think Locally, Act Gobally - Improving Defect and Effort Prediction Models

Print This Page

Secure Payment provided by E-xact Transactions Ltd.

Transcript Order

Your Order

Quantity Item Unit Price1 Transcript, As Quickly As Possible CAD 15.00 CAD 15.00

Total CAD 15.00

This order is now complete. Transaction approved!Here is your receipt:

=========== TRANSACTION RECORD ==========Queen’s Registrar’s Office125-74 UNION ST.GORDON HALLKINGSTON, ON K7L3N6Canada

TYPE: Purchase

ACCT: Mastercard $ 15.00 CAD

CARD NUMBER : ############9698DATE/TIME : 06 Oct 09 14:48:24REFERENCE # : 001 337 MAUTHOR. # : 15474ZTRANS. REF. : 091006154700c0bc69d1

Approved - Thank You 000

Please retain this copy for your records.

Cardholder will pay above amount to cardissuer pursuant to cardholder agreement.=========================================

Think Locally, Act GloballyImproving Defect and Effort Prediction Models

Nicolas Bettenburg • Meiyappan Nagappan • Ahmed E. HassanQueen’s University • Kingston, ON, Canada

SOFTWARE ANALYSIS

& INTELLIGENCE LAB

Saturday, 2 June, 12

Data Modelling in Empirical SE

Observations

2

measured from project data



Observations

Model

2


describe observations mathematically



Observations

Model

Understanding

Prediction

2


describe observations mathematically

guide decision making

guide process optimizations and future research


Model Building Today

3

Whole Dataset



3

Whole Dataset

Testing Data

Training Data



3

Whole Dataset

Testing Data

Training Data

M

Learned Model



3

Whole Dataset

Testing Data

Training Data

M

Learned Model

PredictionsY



3

Whole Dataset

Testing Data

Training Data

M

Learned Model

PredictionsY

Compare


Much Research Effort on new metrics and new models!

4


Maybe we need to look more at the data part


In the Field


In the Field

Tom Zimmermann


In the Field

Tom Zimmermann

We ran 622 cross-project predictions and found that only

3.4% actually worked.


In the Field

Tim Menzies

Tom Zimmermann




In the Field

Tim Menzies

Tom Zimmermann



Rather than focus on generalities, empirical SE should focus more on context-specific

principles.


In the Field

Tim Menzies

Tom Zimmermann



Rather than focus on generalities, empirical SE should focus more on context-specific

principles.

Taking local properties of data into consideration leads to better models!


Using Locality in Statistical Models


Does this principle work for statistical models?1




Does it work for Prediction?2




Does it work for Prediction?2

Can we do better?3



M

Learned Model

Building Local Models

8

Whole Dataset

Testing Data

Training Data

PredictionsY


M

Learned Model


8

Whole Dataset

Testing Data

Training Data

PredictionsY

Cluster Data



8

Whole Dataset

Testing Data

Training Data Learned ModelsM1 M2 M3

PredictionsY

Cluster Data Learn Multiple

Models



8

Whole Dataset

Testing Data


Predictions

Y Y Y


Models

Predict

Individually



8

Whole Dataset

Compare

Testing Data


Predictions

Y Y Y


Models

Predict

Individually


9

Global Statistical ModelCHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34

X

f(X)

0 1 2 3 4 5 6

Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.

C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.

Overall linearity in X can be tested by testing H0

:

�2

= �3

= �4

= 0.


9


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


9


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


9


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

Model fit leaves much room for improvement!Saturday, 2 June, 12

CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34

X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

10

Local Statistical Model



X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

10




X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

10


Model 1

Model 2



X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

10


Model 1

Model 2

Improved Fit!Saturday, 2 June, 12

How can we use this approach to get an even better fit?


12

Be Even More Local !CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34

X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


12


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


12


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


12


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

Great Fit!


12


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

Great Fit!BUT: Risk of Overfitting the Data!!



Clustering independent of Fit



X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

14



X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

14

Optimize Local Fit wrt. Minimizing Global Overfit



X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

14


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.




X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

14


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.




X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

14


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


Multivariate Adaptive Regression Splines (MARS)



X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.

14


X

f(X)

0 1 2 3 4 5 6


C(Y |X) = f (X) = X�,

where X� = �0

+ �1

X1

+ �2

X2

+ �3

X3

+ �4

X4

,and

X1

= X X2

= (X � a)+

X3

= (X � b)+

X4

= (X � c)+

.


:

�2

= �3

= �4

= 0.


Multivariate Adaptive Regression Splines (MARS)create local knowledge that optimizes process globally


Case Study

15


Case Study

15

Xalan 2.6

Lucene 2.4Post-Release Defects per Class

20 CK Metrics


Case Study

15

Xalan 2.6


20 CK Metrics

CHINA Total Development Effort in Hours 14 FP Metrics


Case Study

15

Xalan 2.6


20 CK Metrics

CHINA Total Development Effort in Hours 14 FP Metrics

NasaCocDevelopment Length in Months

24 COCOMO-II Metrics


Results: Goodness of Fit

16

Rank-Correlation (0 = worst fit, 1 = optimal fit)



16

Global Local (Clustered) MARS

Xalan 2.6

Lucene 2.4

CHINA

NasaCOC

0.33 0.52 0.69

0.32 0.60 0.83

0.83 0.89 0.89

0.93 0.97 0.99




16


Xalan 2.6

Lucene 2.4

CHINA

NasaCOC

0.33 0.52 0.69

0.32 0.60 0.83

0.83 0.89 0.89

0.93 0.97 0.99




16


Xalan 2.6

Lucene 2.4

CHINA

NasaCOC

0.33 0.52 0.69

0.32 0.60 0.83

0.83 0.89 0.89

0.93 0.97 0.99




16


Xalan 2.6

Lucene 2.4

CHINA

NasaCOC

0.33 0.52 0.69

0.32 0.60 0.83

0.83 0.89 0.89

0.93 0.97 0.99




16


Xalan 2.6

Lucene 2.4

CHINA

NasaCOC

0.33 0.52 0.69

0.32 0.60 0.83

0.83 0.89 0.89

0.93 0.97 0.99

Rank-Correlation (0 = worst fit, 1 = optimal fit)N

umbe

r of C

lust

ers

0

2

4

6

8

Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10

Dataset

CHINALucene 2.4NasaCocXalan 2.6

Figure 3: Number of clusters generated by MCLUST in each run of the 10-fold cross validation.

term for each additional prediction variable entering theregression model [23].

For practical purposes, we use a publicly available imple-mentation of BIC-based model selection, contained in theR package: BMA. The input to the BMA implementationis the dataset itself, as well as a list of all dependent andindependent variables that should be considered. In our casestudy, we always supply a list of all independent variablesthat were left after VIF analysis. The output of the BMAimplementation is a selection of independent variables, suchthat the linear regression model built with these variableswill produce the best fit to the data, while avoiding over-fitting.

D. Multivariate Adaptive Regression Splines

We use Multivariate Adaptive Regression Splines [11], orMARS, models, as an example of a global prediction modelthat takes local considerations of the dataset into account.MARS models have become increasingly popular in medi-cal, and social sciences, as well as in economical sciences,where they are used with great success [3], [21], [28]. AMARS model has the form Y = ✏

o

+ c1 ⇤H(X1) + · · · +ci

⇤ H(Xn

), with Y called the dependent variable (that isto be predicted), c

i

the i-th hinge coefficient, and H(Xi

)the i-th “hinge function”. Hinge functions are an integralpart of MARS models, as they allow to describe non-linearrelationships in the data. In particular, they partition thedata into disjoint regions that can be then described sepa-rately (our notion of local considerations). In general, hingefunctions used in MARS models take on the form of eitherH(X

i

) = max(c,Xi

�c), or H(Xi

) = max(c, c�Xi

), withc being some constant real value, and X

i

an independent(predictor) variable.

A MARS model is built in two separate phases. In theforward phase, MARS starts with a model which consists ofjust the intercept term (which is the mean of the independentvariables). It then repeatedly adds hinge functions in pairsto the model. At each step it finds the pair of functions thatgives the maximum reduction in residual error. This processof adding terms continues until the change in residual error

is too small to continue or until a maximum number of termsis reached. In our case study, the maximum number of termsis automatically determined by the implementation, and isbased on the amount of independent variables we give asinput. For MARS models, we use all independent variablesin a dataset after VIF analysis.

The first phase often builds a model that suffers fromoverfitting. As a result, the second phase, called the back-ward phase, prunes the model, to increase the model’s gen-eralization ability. The backward phase removes individualterms, deleting the least effective term at each step until itfinds the best submodel. Model subsets are compared usinga performance criterion specific to MARS models, and thebest model is selected and returned as the final predictionmodel.

MARS models have the advantage that a model selectionphase is already built-in by design, so we do not need tocarry out a model selection step similar to BIC, as we dowith global and local models. Second, the hinge functionsin MARS models do already model disjoint regions ofthe dataset separately, such that there is no need for priorpartitioning of the dataset with a clustering algorithm. Bothadvantages make this type of prediction model very easy touse in practice.

E. Cross Validation

For better generalizability of our results, and to counterrandom observation bias, all experiments described in ourcase study are repeated 10 times on random stratified sub-samples of the data into training (90% of the data) andtesting (10% of the data) sets. Stratification is carried outon the measure that is to be predicted. We evaluate allour findings based on the average over the 10 repetitions.This practice of evaluating results is a common approach inMachine Learning, and is often referred to as “10-fold crossvalidation” [27].

IV. RESULTS

In this section, we present the result of our case study.This presentation is carried out in three steps that followour initial three research questions. For each part, we first



16


Xalan 2.6

Lucene 2.4

CHINA

NasaCOC

0.33 0.52 0.69

0.32 0.60 0.83

0.83 0.89 0.89

0.93 0.97 0.99


UP TO 2.5x BETTER FIT WHEN USING DATA LOCALITY!


0

0.175

0.35

0.525

0.7

Xalan 2.6

0.40.52

0.64

0

0.3

0.6

0.9

1.2

Lucene 2.4

0.941.151.15

0

200

400

600

800

CHINA

234.43552.85

765

0

1

2

3

4

NasaCoC

1.632.143.26

Results: Prediction Error

17

Global Local MARS


0

0.175

0.35

0.525

0.7

Xalan 2.6

0.40.52

0.64

0

0.3

0.6

0.9

1.2

Lucene 2.4

0.941.151.15

0

200

400

600

800

CHINA

234.43552.85

765

0

1

2

3

4

NasaCoC

1.632.143.26

Results: Prediction Error

17

Global Local MARS

Up to 4x lower prediction error with Local Models!


ModelInterpretation

?


Model Interpretation

19

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 5000

0.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm

lm(formula=f,data=training1)

(a) Part of a global Model learned on the Xalan 2.6 dataset

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm

bug earth(formula=f,data=training1)

(b) Part of a Global model with local considerations learned on the Xalan2.6 dataset

Figure 6: Global models report general trends, while global models with local considerations give insights into different regions of the data. The Y-Axisdescribes the response (in this case bugs) while keeping all other prediction variables at their median values.

0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6

Figure 7: Example of contradicting trends in local models (Xalan 2.6,Cluster 1 and Cluster 6 in Fold 9).

model already partition the data into regions with individualproperties. For example, we observe that an increase of ic(measuring the inheritance coupling through parent classes)is predicted to only have a negative effect on bug-pronenesswhen it attains values larger than 1. Thus a practitioner mightdecide a different course of action than he or she would havedone based on the trends outlined by a global model.

V. CONCLUSIONS

In this study, we investigated the difference of threedifferent types of prediction models. Global models arebuilt on software engineering datasets as-is, while forlocal models we first subdivide the datasets into subsets ofdata with similar observations, before building individualmodels on each subset. In addition, we also studied athird approach: multivariate adaptive regression splines asa global model with local considerations. MARS by designtakes local considerations of individual regions of the datainto account, and can thus be considered a hybrid betweenglobal and local models.

A. Think LocallyWe evaluated each of the three modelling strategies in acase study on four different datasets, which have beenused in prior research on the WHICH machine-learningalgorithm [14]. The results of our case study demonstratethat clustering of a dataset into regions with similarproperties and using the individual regions for building of

prediction models leads to an improved fit of these models.Our findings thus confirm the results of Menzies et al.,who observed a similar effect of data localization on theirWHICH machine-learning algorithm. These increased fitshave practical implications for researchers concerned inusing regression models for understanding: local modelsare more insightful than global models, which report onlygeneral trends across the whole dataset, whereas we havedemonstrated that such general trends may not hold true forparticular parts of the dataset. For example, we have seenin the Xalan 2.6 defect prediction dataset that particularsets of classes are influenced differently by attributes suchas inheritance, cohesion and complexity. Our findingsreinforce the recommendations of Menzies et al. againstthe use of a “one-size-fits-all” approach, such as a globalmodel, when trying to account for such localized effects.

B. Act GloballyWhen the goal is carrying out actual predictions, rather thanunderstanding, local models show only small improvementsover global models, with respect to prediction error andranking. In particular, building local models involves asignificant overhead due to clustering of the data. Eventhough clustering algorithms such as the one presented inthe work by Menzies et al. [14] might run in linear time, westill have to learn a multitude of models, one for each cluster.One particular point that we have not addressed in our studyis whether the choice of clustering algorithm influences thefinal performance of the local models. While our choice wasto use a state-of-the-art model-based clustering algorithmthat partitions data along dimensions of highest variability,future research may want to look deeper into the effect thatdifferent clustering approaches have on the performance oflocal models.

Surprisingly, we found that the relatively small increasein prediction performance of local models is offset byan increased error variance. While predictions from localmodels are close to the actual values most of the time, weobserved the occasional very high errors. In other words,while global models are not as accurate as local models,their worst case scenarios are not as bad as we observe withlocal models. We want to note however that this findingstand in conflict with the findings of Menzies et al., whoobserved the opposite: their clustering algorithm decreases



19

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 5000

0.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






Traditional Global Model: General Trends



19

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 5000

0.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






Traditional Global Model: General TrendsOne Curve per metric, run corp on that curve


20

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 50000.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

�� 0

12

30 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 50000.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 150.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






Cluster 1

Cluster 6


...


20

Local (Clustered) Model: Many, many, many Trends!

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 50000.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

�� 0

12

30 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 50000.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 150.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






Cluster 1

Cluster 6


...


20

Local (Clustered) Model: Many, many, many Trends!

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 50000.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

�� 0

12

30 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 50000.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 150.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS






Cluster 1

Cluster 6


...

Sometimes even contradict


21

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 5000

0.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS








21

Regression Splines: Local Trends in a Single Curve

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 5000

0.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS








21

Regression Splines: Local Trends in a Single Curve

0 5 10 15 20−2.5

−1.5

−0.5

0.5

1 avg_cc

0 50 100 150

0.50

0.60

0.70

0.80

2 ca

0.0 0.2 0.4 0.6 0.8 1.0

0.44

0.48

0.52

3 cam

0 5 10 15 20 25 30

0.5

0.7

0.9

1.1

4 cbm

0 10 20 30 40 50

0.50

0.54

0.58

0.62

5 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.35

0.45

6 dam

1 2 3 4 5 6 7 8

0.3

0.4

0.5

0.6

7 dit

0 1 2 3 4 5

0.50

0.55

0.60

0.65

8 ic

0 1000 3000 5000

0.6

1.0

1.4

1.8 9 lcom

0.0 0.5 1.0 1.5 2.0

0.3

0.4

0.5

0.6

0.7

10 lcom3

0 1000 2000 3000 4000

0.5

1.0

1.5

2.0

11 loc

0 20 40 60 80 120

12

34

12 max_cc

0.0 0.2 0.4 0.6 0.8 1.0

0.45

0.47

0.49

0.51

13 mfa

0 5 10 15

0.50

0.54

0.58

14 moa

0 5 10 15 20 25 30

0.42

0.46

0.50

15 noc

0 20 40 60 80 100 120

0.50

0.60

0.70

16 npm



0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.2

1.6

1 cam

0 5 10 15 20 25 30

1.0

1.2

1.4

1.6

1.8

2 cbm

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

3 ce

0.0 0.2 0.4 0.6 0.8 1.0

0.65

0.75

0.85

4 dam

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

5 dit

0 1 2 3 4 5

0.9

1.1

1.3

1.5

6 ic

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0 7 lcom

0.0 0.5 1.0 1.5 2.00.55

0.65

0.75

0.85

8 lcom3

0 1000 2000 3000 4000

12

34

56

9 loc

0.0 0.2 0.4 0.6 0.8 1.0

0.8

1.0

1.2

1.4

10 mfa

0 5 10 15

0.0

0.2

0.4

0.6

0.8

11 moa

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

12 noc

0 20 40 60 80 100 120

−1.0

0.0

0.5

1.0

13 npm




0 2 4 6 8 10

ï��

��

��

��

ï��

ï��

��

ï��

��

��

0 10 20 30 40 �� 60

ï��

��

��

��

��

01

23

0 1 2 3 4

ï��

ï��

��

ic

npm

npm

ic mfa

mfa

Fold 9, Cluster 1

Fold 9, Cluster 6



V. CONCLUSIONS







Combines the best of both worlds!



Using Locality in Data to build better Statistical Models.



vs = Two Extremes



vs = Two Extremes

Build Local Model, globally Optimized



vs = Two Extremes

Build Local Model, globally Optimized• combines best of both worlds



vs = Two Extremes

Build Local Model, globally Optimized• combines best of both worlds• outperforms global and clustered local



vs = Two Extremes

Build Local Model, globally Optimized• combines best of both worlds• outperforms global and clustered local• summarizes local trends in single curve


Education

Think Locally, Act Gobally - Improving Defect and Effort Prediction Models