41
A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk CERCIA, School of Computer Science, The University of Birmingham Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 1 / 22

Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Embed Size (px)

DESCRIPTION

Promise 2011:"A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"Leandro Minku and Xin Yao.

Citation preview

Page 1: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

A Principled Evaluation of Ensembles of Learning

Machines for Software Effort Estimation

Leandro Minku, Xin Yao{L.L.Minku,X.Yao}@cs.bham.ac.uk

CERCIA, School of Computer Science, The University of Birmingham

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 1 / 22

Page 2: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Outline

Introduction (Background and Motivation)

Research Questions (Aims)

Experiments (Method and Results)

Answers to Research Questions (Conclusions)

Future Work

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 2 / 22

Page 3: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Introduction

Software cost estimation:

Set of techniques and procedures that an organisation uses toarrive at an estimate.

Major contributing factor is effort (in person-hours,person-month, etc).

Overestimation vs. underestimation.

Several software cost/effort estimation models have been proposed.

ML models have been receiving increased attention:

They make no or minimal assumptions about the data and thefunction being modelled.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 3 / 22

Page 4: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Introduction

Ensembles of Learning Machines are groups of learning machinestrained to perform the same task and combined with the aim ofimproving predictive performance.

Studies comparing ensembles against single learners in softwareeffort estimation are contradictory:

Braga et al IJCNN’07 claims that Bagging improves a biteffort estimations produced by single learners.

Kultur et al KBS’09 claims that an adapted Bagging provideslarge improvements.

Kocaguneli et al ISSRE’09 claims that combining different

learners does not improve effort estimations.

These studies either miss statistical tests or do not present theparameters choice. None of them analyse the reason for theachieved results.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22

Page 5: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Introduction

Ensembles of Learning Machines are groups of learning machinestrained to perform the same task and combined with the aim ofimproving predictive performance.

Studies comparing ensembles against single learners in softwareeffort estimation are contradictory:

Braga et al IJCNN’07 claims that Bagging improves a biteffort estimations produced by single learners.

Kultur et al KBS’09 claims that an adapted Bagging provideslarge improvements.

Kocaguneli et al ISSRE’09 claims that combining different

learners does not improve effort estimations.

These studies either miss statistical tests or do not present theparameters choice. None of them analyse the reason for theachieved results.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22

Page 6: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Introduction

Ensembles of Learning Machines are groups of learning machinestrained to perform the same task and combined with the aim ofimproving predictive performance.

Studies comparing ensembles against single learners in softwareeffort estimation are contradictory:

Braga et al IJCNN’07 claims that Bagging improves a biteffort estimations produced by single learners.

Kultur et al KBS’09 claims that an adapted Bagging provideslarge improvements.

Kocaguneli et al ISSRE’09 claims that combining different

learners does not improve effort estimations.

These studies either miss statistical tests or do not present theparameters choice. None of them analyse the reason for theachieved results.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22

Page 7: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Introduction

Ensembles of Learning Machines are groups of learning machinestrained to perform the same task and combined with the aim ofimproving predictive performance.

Studies comparing ensembles against single learners in softwareeffort estimation are contradictory:

Braga et al IJCNN’07 claims that Bagging improves a biteffort estimations produced by single learners.

Kultur et al KBS’09 claims that an adapted Bagging provideslarge improvements.

Kocaguneli et al ISSRE’09 claims that combining different

learners does not improve effort estimations.

These studies either miss statistical tests or do not present theparameters choice. None of them analyse the reason for theachieved results.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 4 / 22

Page 8: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

Page 9: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?

The current studies are contradictory.

They either do not perform statistical comparisons or do notexplain the parameters choice.

It would be worth to investigate the use of different ensembleapproaches.

We build upon current work by considering these points.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

Page 10: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?

Question 2

If a particular method is singled out, what insight on how toimprove effort estimations can we gain by analysing its behaviourand the reasons for its better performance?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

Page 11: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?

Question 2

If a particular method is singled out, what insight on how toimprove effort estimations can we gain by analysing its behaviourand the reasons for its better performance?

Principled experiments, not just intuition or speculations.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

Page 12: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?

Question 2

If a particular method is singled out, what insight on how toimprove effort estimations can we gain by analysing its behaviourand the reasons for its better performance?

Question 3

How can someone determine what model to be used considering aparticular data set?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

Page 13: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?

Question 2

If a particular method is singled out, what insight on how toimprove effort estimations can we gain by analysing its behaviourand the reasons for its better performance?

Question 3

How can someone determine what model to be used considering aparticular data set?

Our study complements previous work, parameters choice isimportant.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 5 / 22

Page 14: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Data Sets and Preprocessing

Data sets: cocomo81, nasa93, nasa, cocomo2, desharnais, 7ISBSG organization type subsets.

Cover a wide range of features.In particular, ISBSG subsets’ productivity rate is statisticallydifferent.

Attributes: cocomo attributes for PROMISE data, functionalsize, development type and language type for ISBSG.

Missing values: delete for PROMISE, k-NN imputation forISBSG.

Outliers: K-means detection / elimination.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 6 / 22

Page 15: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 1: choice of learning

machines

Single learners:

MultiLayer Perceptrons (MLPs) – universal approximators;Radial Basis Function networks (RBFs) – local learning; andRegression Trees (RTs) – simple and comprehensive.

Ensemble learners:

Bagging with MLPs, with RBFs and with RTs – widely andsuccessfully used;Random with MLPs – use full training set for each learner; andNegative Correlation Learning (NCL) with MLPs – regression.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 7 / 22

Page 16: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 2: choice of evaluation

method

Executions were done in 30 rounds, 10 projects for testing andremaining for training, as suggested by Menzies et al. TSE’06.

Evaluation was done in two steps:1 Menzies et al. TSE’06’s survival rejection rules:

If MMREs are significantly different according to a pairedt-test with 95% of confidence, the best model is the one withthe lowest average MMRE.If not, the best method is the one with the best:

1 Correlation2 Standard deviation3 PRED(N)4 Number of attributes

2 Wilcoxon tests with 95% of confidence to compare the twomethods more often among the best in terms of MMRE andPRED(25).

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 8 / 22

Page 17: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 2: choice of evaluation

method

Executions were done in 30 rounds, 10 projects for testing andremaining for training, as suggested by Menzies et al. TSE’06.

Evaluation was done in two steps:1 Menzies et al. TSE’06’s survival rejection rules:

If MMREs are significantly different according to a pairedt-test with 95% of confidence, the best model is the one withthe lowest average MMRE.If not, the best method is the one with the best:

1 Correlation2 Standard deviation3 PRED(N)4 Number of attributes

2 Wilcoxon tests with 95% of confidence to compare the twomethods more often among the best in terms of MMRE andPRED(25).

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 8 / 22

Page 18: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 2: choice of evaluation

method

Executions were done in 30 rounds, 10 projects for testing andremaining for training, as suggested by Menzies et al. TSE’06.

Evaluation was done in two steps:1 Menzies et al. TSE’06’s survival rejection rules:

If MMREs are significantly different according to a pairedt-test with 95% of confidence, the best model is the one withthe lowest average MMRE.If not, the best method is the one with the best:

1 Correlation2 Standard deviation3 PRED(N)4 Number of attributes

2 Wilcoxon tests with 95% of confidence to compare the twomethods more often among the best in terms of MMRE andPRED(25).

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 8 / 22

Page 19: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 2: choice of evaluation

method

Mean Magnitude of the Relative ErrorMMRE = 1

T

∑Ti=1MREi, where MREi =

|predictedi−actuali|actuali

Percentage of estimations within N% of the actual values

PRED(N) = 1T

∑Ti=1

{

1, ifMREi ≤ N100

0, otherwise

Correlation between estimated and actual effort:CORR =

Spa√SpSa

, where

Spa =∑T

i=1(predictedi−p)(actuali−a)

T−1

Sp =∑T

i=1(predictedi−p)2

T−1 , Sa =∑T

i=1(actuali−a)2

T−1 ,

p =∑T

i=1predictedi

T, a =

∑Ti=1

actualiT

.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 9 / 22

Page 20: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 2: choice of evaluation

method

Mean Magnitude of the Relative ErrorMMRE = 1

T

∑Ti=1MREi, where MREi =

|predictedi−actuali|actuali

Percentage of estimations within N% of the actual values

PRED(N) = 1T

∑Ti=1

{

1, ifMREi ≤ N100

0, otherwise

Correlation between estimated and actual effort:CORR =

Spa√SpSa

, where

Spa =∑T

i=1(predictedi−p)(actuali−a)

T−1

Sp =∑T

i=1(predictedi−p)2

T−1 , Sa =∑T

i=1(actuali−a)2

T−1 ,

p =∑T

i=1predictedi

T, a =

∑Ti=1

actualiT

.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 9 / 22

Page 21: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 2: choice of evaluation

method

Mean Magnitude of the Relative ErrorMMRE = 1

T

∑Ti=1MREi, where MREi =

|predictedi−actuali|actuali

Percentage of estimations within N% of the actual values

PRED(N) = 1T

∑Ti=1

{

1, ifMREi ≤ N100

0, otherwise

Correlation between estimated and actual effort:CORR =

Spa√SpSa

, where

Spa =∑T

i=1(predictedi−p)(actuali−a)

T−1

Sp =∑T

i=1(predictedi−p)2

T−1 , Sa =∑T

i=1(actuali−a)2

T−1 ,

p =∑T

i=1predictedi

T, a =

∑Ti=1

actualiT

.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 9 / 22

Page 22: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Experimental Framework – Step 3: choice of parameters

Preliminary experiments using 5 runs.

Each approach was run with all the combinations of 3 or 5parameter values.

Parameters with the lowest MMRE were chosen for further 30runs.

Base learners will not necessarily have the same parameters assingle learners.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 10 / 22

Page 23: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Comparison of Learning Machines – Menzies et al.

TSE’06’s survival rejection rules

Table: Number of Data Sets in which Each Method Survived. Methodsthat never survived are omitted.

PROMISE Data ISBSG Data All DataRT: 2 MLP: 2 RT: 3Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2Rand + MLP: 1 RT: 1 Bag + RTs: 2

Bag + RBF: 1 MLP: 2NCL + MLP: 1 Rand + MLP: 1

Bag + RBF: 1

No approach is consistently the best, even consideringensembles!

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 11 / 22

Page 24: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Comparison of Learning Machines – Menzies et al.

TSE’06’s survival rejection rules

Table: Number of Data Sets in which Each Method Survived. Methodsthat never survived are omitted.

PROMISE Data ISBSG Data All DataRT: 2 MLP: 2 RT: 3Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2Rand + MLP: 1 RT: 1 Bag + RTs: 2

Bag + RBF: 1 MLP: 2NCL + MLP: 1 Rand + MLP: 1

Bag + RBF: 1

No approach is consistently the best, even consideringensembles!

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 11 / 22

Page 25: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Comparison of Learning Machines

What methods are usually amongthe best?

Table: Number of Data Sets in which Each MethodWas Ranked First or Second According to MMRE andPRED(25). Methods never among the first and secondare omitted.

(a) Accoding to MMRE

PROMISE Data ISBSG Data All DataRT: 4 RT: 5 RT: 9Bag + MLP: 3 Bag + MLP 5 Bag + MLP: 8Bag + RT: 2 Bag + RBF: 3 Bag + RBF: 3MLP: 1 MLP: 1 MLP: 2

Rand + MLP: 1 Bag + RT: 2NCL + MLP: 1 Rand + MLP: 1

NCL + MLP: 1

(b) Acording to PRED(25)

PROMISE Data ISBSG Data All DataBag + MLP: 3 RT: 5 RT: 6Rand + MLP: 3 Rand + MLP: 3 Rand + MLP: 6Bag + RT: 2 Bag + MLP: 2 Bag + MLP: 5RT: 1 MLP: 2 Bag + RT: 3MLP: 1 RBF: 2 MLP: 3

Bag + RBF: 1 RBF: 2Bag + RT: 1 Bag + RBF: 1

RTs and bag+MLPs are morefrequently among the bestconsidering MMRE thanconsidering PRED(25).

The first ranked method’sMMRE is statistically differentfrom the others in 35.16% ofthe cases.

The second ranked method’sMMRE is statistically differentfrom the lower ranked methodsin 16.67% of the cases.

RTs and bag+MLPs areusually statistically equal interms of MMRE andPRED(25).

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 12 / 22

Page 26: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions – Revisited

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?

Even though bag+MLPs is frequently among the bestmethods, it is statistically similar to RTs.

RTs are more comprehensive and have faster training.

Bag+MLPs seem to have more potential for improvements.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 13 / 22

Page 27: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Why Were RTs Singled Out?

Hypothesis: As RTs have splits based on information gain,they may work in such a way to give more importance formore relevant attributes.

A further study using correlation-based feature selectionrevealed that RTs usually put higher features higher ranked bythe feature selection method in higher level splits of the tree.

Feature selection by itself was not able to always improveaccuracy.

It may be important to give weights to features when using MLapproaches.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 14 / 22

Page 28: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Why Were RTs Singled Out?

Table: Correlation-Based Feature Selection and RT Attributes RelativeImportance for Cocomo81.

Attributes ranking First tree level in which the attribute Percentage ofappears in more than 50% of the trees trees

LOC Level 0 100.00%

Development modeRequired software reliability Level 1 90.00%

Modern programing practicesTime constraint for cpu Level 2 73.33%

Data base size Level 2 83.34%

Main memory constraintTurnaround timeProgrammers capabilityAnalysts capabilityLanguage experienceVirtual machine experienceSchedule constraintApplication experience Level 2 66.67%

Use of software toolsMachine volatility

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 15 / 22

Page 29: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Why Were Bag+MLPs Singled Out

Hypothesis: bag+MLPs may have lead to a more adequatelevel of diversity.

If we use correlation as the diversity measure, we can see thatbag+MLPs usually had more moderate values when it was the1st or 2nd ranked MMRE method.

However, the correlation between diversity and MMRE wasusually quite low.

Table: Correlation Considering Data Sets in whichBag+MLPs Were Ranked 1st or 2nd.

Approach Correlation intervalacross different data sets

Bag+MLP 0.74-0.92Bag+RBF 0.40-0.83Bag+RT 0.51-0.81NCL+MLP 0.59-1.00Rand+MLP 0.93-1.00

Table: Correlation Considering All Data Sets.

Approach Correlation intervalacross different data sets

Bag+MLP 0.47-0.98Bag+RBF 0.40-0.83Bag+RT 0.37-0.88NCL+MLP 0.59-1.00Rand+MLP 0.93-1.00

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 16 / 22

Page 30: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Taking a Closer Look...

Table: Correlations between ensemble covariance (diversity) andtrain/test MMRE for the data sets in which bag+MLP obtained the bestMMREs and was ranked 1st or 2nd against the data sets in which itobtained the worst MMREs.

Cov. vs Cov. vsTest MMRE Train MMRE

Best MMRE (desharnais) 0.24 0.142nd best MMRE (org2) 0.70 0.382nd worst MMRE (org7) -0.42 -0.37Worst MMRE (cocomo2) -0.99 -0.99

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 17 / 22

Page 31: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Taking a Closer Look...

Table: Correlations between ensemble covariance (diversity) andtrain/test MMRE for the data sets in which bag+MLP obtained the bestMMREs and was ranked 1st or 2nd against the data sets in which itobtained the worst MMREs.

Cov. vs Cov. vsTest MMRE Train MMRE

Best MMRE (desharnais) 0.24 0.142nd best MMRE (org2) 0.70 0.382nd worst MMRE (org7) -0.42 -0.37Worst MMRE (cocomo2) -0.99 -0.99

Diversity is not only affected by the ensemble method, but also bythe data set:

Software effort estimation data sets are very different fromeach other.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 17 / 22

Page 32: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Taking a Closer Look...

Table: Correlations between ensemble covariance (diversity) andtrain/test MMRE for the data sets in which bag+MLP obtained the bestMMREs and was ranked 1st or 2nd against the data sets in which itobtained the worst MMREs.

Cov. vs Cov. vsTest MMRE Train MMRE

Best MMRE (desharnais) 0.24 0.142nd best MMRE (org2) 0.70 0.382nd worst MMRE (org7) -0.42 -0.37Worst MMRE (cocomo2) -0.99 -0.99

Correlation between diversity and performance on test set followstendency on train set.

Why do we have a negative correlation in the worst cases?

Could a method that self-adapts diversity help to improveestimations? How?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 17 / 22

Page 33: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions – Revisited

Question 2

If a particular method is singled out, what insight on how toimprove effort estimations can we gain by analysing its behaviourand the reasons for its better performance?

RTs give more importance to more important features.Weighting attributes may be helpful when using ML forsoftware effort estimation.

Ensembles seem to have more room for improvement forsoftware effort estimation.

A method to self-adapt diversity might help to improveestimations.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 18 / 22

Page 34: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Research Questions – Revisited

Question 3

How can someone determine what model to be used considering aparticular data set?

Effort estimation data sets affect dramatically the behaviourand performance of different learning machines, evenconsidering ensembles.

So, it would be necessary to run experiments (parameterschoice is important) using existing data from a particularcompany to determine what method is likely to be the best.

If the software manager does not have enough knowledge ofthe models, RTs are a good choice.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 19 / 22

Page 35: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Risk Analysis

The learning machines singled out (RTs and bagging+MLPs) werefurther tested using the outlier projects.

MMRE similar or lower (better), usually better than foroutliers-free data sets.

PRED(25) similar or lower (worse), usually lower.

Even though outliers are projects to which the learning machineshave more difficulties in predicting within 25% of the actual effort,they are not the projects to which they give the worst estimates.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 20 / 22

Page 36: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Risk Analysis

The learning machines singled out (RTs and bagging+MLPs) werefurther tested using the outlier projects.

MMRE similar or lower (better), usually better than foroutliers-free data sets.

PRED(25) similar or lower (worse), usually lower.

Even though outliers are projects to which the learning machineshave more difficulties in predicting within 25% of the actual effort,they are not the projects to which they give the worst estimates.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 20 / 22

Page 37: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Conclusions and Future Work

RQ1 – readily available ensembles do not provide generallybetter effort estimations.

Principled experiments (parameters, statistical analysis, severaldata sets, more ensemble approaches) to deal with validityissues.

RQ2 – RTs + weighting features; bagging with MLPs + selfadapting diversity.

Insight based on experiments, not just intuition or speculation.

RQ3 – principled experiments to choose model, RTs if noresources.

No universally good model, even when using ensembles;parameters choice in framework.

Future work:

Learning feature weights in ML for effort estimation.Can we use self-tuning diversity in ensembles of learningmachines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22

Page 38: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Conclusions and Future Work

RQ1 – readily available ensembles do not provide generallybetter effort estimations.

Principled experiments (parameters, statistical analysis, severaldata sets, more ensemble approaches) to deal with validityissues.

RQ2 – RTs + weighting features; bagging with MLPs + selfadapting diversity.

Insight based on experiments, not just intuition or speculation.

RQ3 – principled experiments to choose model, RTs if noresources.

No universally good model, even when using ensembles;parameters choice in framework.

Future work:

Learning feature weights in ML for effort estimation.Can we use self-tuning diversity in ensembles of learningmachines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22

Page 39: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Conclusions and Future Work

RQ1 – readily available ensembles do not provide generallybetter effort estimations.

Principled experiments (parameters, statistical analysis, severaldata sets, more ensemble approaches) to deal with validityissues.

RQ2 – RTs + weighting features; bagging with MLPs + selfadapting diversity.

Insight based on experiments, not just intuition or speculation.

RQ3 – principled experiments to choose model, RTs if noresources.

No universally good model, even when using ensembles;parameters choice in framework.

Future work:

Learning feature weights in ML for effort estimation.Can we use self-tuning diversity in ensembles of learningmachines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22

Page 40: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Conclusions and Future Work

RQ1 – readily available ensembles do not provide generallybetter effort estimations.

Principled experiments (parameters, statistical analysis, severaldata sets, more ensemble approaches) to deal with validityissues.

RQ2 – RTs + weighting features; bagging with MLPs + selfadapting diversity.

Insight based on experiments, not just intuition or speculation.

RQ3 – principled experiments to choose model, RTs if noresources.

No universally good model, even when using ensembles;parameters choice in framework.

Future work:

Learning feature weights in ML for effort estimation.Can we use self-tuning diversity in ensembles of learningmachines to improve estimations?

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 21 / 22

Page 41: Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

Acknowledgements

Search Based Software Engineering (SEBASE) research group.

Dr. Rami Bahsoon.

This work was funded by EPSRC grant No. EP/D052785/1.

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 22 / 22