Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"

A Principled Evaluation of Ensembles of Learning

Machines for Software Effort Estimation

Leandro Minku, Xin Yao{L.L.Minku,X.Yao}@cs.bham.ac.uk

CERCIA, School of Computer Science, The University of Birmingham

Leandro Minku, Xin Yao {L.L.Minku,X.Yao}@cs.bham.ac.uk Ensembles for Software Effort Estimation 1 / 22

Outline

Introduction (Background and Motivation)

Research Questions (Aims)

Experiments (Method and Results)

Answers to Research Questions (Conclusions)

Future Work


Introduction

Software cost estimation:

Set of techniques and procedures that an organisation uses toarrive at an estimate.

Major contributing factor is effort (in person-hours,person-month, etc).

Overestimation vs. underestimation.

Several software cost/effort estimation models have been proposed.

ML models have been receiving increased attention:

They make no or minimal assumptions about the data and thefunction being modelled.


Introduction

Ensembles of Learning Machines are groups of learning machinestrained to perform the same task and combined with the aim ofimproving predictive performance.

Studies comparing ensembles against single learners in softwareeffort estimation are contradictory:

Braga et al IJCNN’07 claims that Bagging improves a biteffort estimations produced by single learners.

Kultur et al KBS’09 claims that an adapted Bagging provideslarge improvements.

Kocaguneli et al ISSRE’09 claims that combining different

learners does not improve effort estimations.

These studies either miss statistical tests or do not present theparameters choice. None of them analyse the reason for theachieved results.


Introduction









Introduction









Introduction









Research Questions

Question 1

Do readily available ensemble methods generally improve effortestimations given by single learners? Which of them would bemore useful?


Research Questions

Question 1


The current studies are contradictory.

They either do not perform statistical comparisons or do notexplain the parameters choice.

It would be worth to investigate the use of different ensembleapproaches.

We build upon current work by considering these points.


Research Questions

Question 1


Question 2

If a particular method is singled out, what insight on how toimprove effort estimations can we gain by analysing its behaviourand the reasons for its better performance?


Research Questions

Question 1


Question 2


Principled experiments, not just intuition or speculations.


Research Questions

Question 1


Question 2


Question 3

How can someone determine what model to be used considering aparticular data set?


Research Questions

Question 1


Question 2


Question 3


Our study complements previous work, parameters choice isimportant.


Data Sets and Preprocessing

Data sets: cocomo81, nasa93, nasa, cocomo2, desharnais, 7ISBSG organization type subsets.

Cover a wide range of features.In particular, ISBSG subsets’ productivity rate is statisticallydifferent.

Attributes: cocomo attributes for PROMISE data, functionalsize, development type and language type for ISBSG.

Missing values: delete for PROMISE, k-NN imputation forISBSG.

Outliers: K-means detection / elimination.


Experimental Framework – Step 1: choice of learning

machines

Single learners:

MultiLayer Perceptrons (MLPs) – universal approximators;Radial Basis Function networks (RBFs) – local learning; andRegression Trees (RTs) – simple and comprehensive.

Ensemble learners:

Bagging with MLPs, with RBFs and with RTs – widely andsuccessfully used;Random with MLPs – use full training set for each learner; andNegative Correlation Learning (NCL) with MLPs – regression.


Experimental Framework – Step 2: choice of evaluation

method

Executions were done in 30 rounds, 10 projects for testing andremaining for training, as suggested by Menzies et al. TSE’06.

Evaluation was done in two steps:1 Menzies et al. TSE’06’s survival rejection rules:

If MMREs are significantly different according to a pairedt-test with 95% of confidence, the best model is the one withthe lowest average MMRE.If not, the best method is the one with the best:

1 Correlation2 Standard deviation3 PRED(N)4 Number of attributes

2 Wilcoxon tests with 95% of confidence to compare the twomethods more often among the best in terms of MMRE andPRED(25).



method








method








method

Mean Magnitude of the Relative ErrorMMRE = 1

T

∑Ti=1MREi, where MREi =

|predictedi−actuali|actuali

Percentage of estimations within N% of the actual values

PRED(N) = 1T

∑Ti=1

{

1, ifMREi ≤ N100

0, otherwise

Correlation between estimated and actual effort:CORR =

Spa√SpSa

, where

Spa =∑T

i=1(predictedi−p)(actuali−a)

T−1

Sp =∑T

i=1(predictedi−p)2

T−1 , Sa =∑T

i=1(actuali−a)2

T−1 ,

p =∑T

i=1predictedi

T, a =

∑Ti=1

actualiT

.



method


T




PRED(N) = 1T

∑Ti=1

{

1, ifMREi ≤ N100

0, otherwise


Spa√SpSa

, where

Spa =∑T


T−1

Sp =∑T


T−1 , Sa =∑T

i=1(actuali−a)2

T−1 ,

p =∑T

i=1predictedi

T, a =

∑Ti=1

actualiT

.



method


T




PRED(N) = 1T

∑Ti=1

{

1, ifMREi ≤ N100

0, otherwise


Spa√SpSa

, where

Spa =∑T


T−1

Sp =∑T


T−1 , Sa =∑T

i=1(actuali−a)2

T−1 ,

p =∑T

i=1predictedi

T, a =

∑Ti=1

actualiT

.


Experimental Framework – Step 3: choice of parameters

Preliminary experiments using 5 runs.

Each approach was run with all the combinations of 3 or 5parameter values.

Parameters with the lowest MMRE were chosen for further 30runs.

Base learners will not necessarily have the same parameters assingle learners.


Comparison of Learning Machines – Menzies et al.

TSE’06’s survival rejection rules

Table: Number of Data Sets in which Each Method Survived. Methodsthat never survived are omitted.

PROMISE Data ISBSG Data All DataRT: 2 MLP: 2 RT: 3Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2Rand + MLP: 1 RT: 1 Bag + RTs: 2

Bag + RBF: 1 MLP: 2NCL + MLP: 1 Rand + MLP: 1

Bag + RBF: 1

No approach is consistently the best, even consideringensembles!


Comparison of Learning Machines – Menzies et al.

TSE’06’s survival rejection rules

Table: Number of Data Sets in which Each Method Survived. Methodsthat never survived are omitted.

PROMISE Data ISBSG Data All DataRT: 2 MLP: 2 RT: 3Bag + MLP: 1 Bag + RTs: 2 Bag + MLP: 2NCL + MLP: 1 Bag + MLP: 1 NCL + MLP: 2Rand + MLP: 1 RT: 1 Bag + RTs: 2

Bag + RBF: 1 MLP: 2NCL + MLP: 1 Rand + MLP: 1

Bag + RBF: 1

No approach is consistently the best, even consideringensembles!


Comparison of Learning Machines

What methods are usually amongthe best?

Table: Number of Data Sets in which Each MethodWas Ranked First or Second According to MMRE andPRED(25). Methods never among the first and secondare omitted.

(a) Accoding to MMRE

PROMISE Data ISBSG Data All DataRT: 4 RT: 5 RT: 9Bag + MLP: 3 Bag + MLP 5 Bag + MLP: 8Bag + RT: 2 Bag + RBF: 3 Bag + RBF: 3MLP: 1 MLP: 1 MLP: 2

Rand + MLP: 1 Bag + RT: 2NCL + MLP: 1 Rand + MLP: 1

NCL + MLP: 1

(b) Acording to PRED(25)

PROMISE Data ISBSG Data All DataBag + MLP: 3 RT: 5 RT: 6Rand + MLP: 3 Rand + MLP: 3 Rand + MLP: 6Bag + RT: 2 Bag + MLP: 2 Bag + MLP: 5RT: 1 MLP: 2 Bag + RT: 3MLP: 1 RBF: 2 MLP: 3

Bag + RBF: 1 RBF: 2Bag + RT: 1 Bag + RBF: 1

RTs and bag+MLPs are morefrequently among the bestconsidering MMRE thanconsidering PRED(25).

The first ranked method’sMMRE is statistically differentfrom the others in 35.16% ofthe cases.

The second ranked method’sMMRE is statistically differentfrom the lower ranked methodsin 16.67% of the cases.

RTs and bag+MLPs areusually statistically equal interms of MMRE andPRED(25).


Research Questions – Revisited

Question 1


Even though bag+MLPs is frequently among the bestmethods, it is statistically similar to RTs.

RTs are more comprehensive and have faster training.

Bag+MLPs seem to have more potential for improvements.


Why Were RTs Singled Out?

Hypothesis: As RTs have splits based on information gain,they may work in such a way to give more importance formore relevant attributes.

A further study using correlation-based feature selectionrevealed that RTs usually put higher features higher ranked bythe feature selection method in higher level splits of the tree.

Feature selection by itself was not able to always improveaccuracy.

It may be important to give weights to features when using MLapproaches.


Why Were RTs Singled Out?

Table: Correlation-Based Feature Selection and RT Attributes RelativeImportance for Cocomo81.

Attributes ranking First tree level in which the attribute Percentage ofappears in more than 50% of the trees trees

LOC Level 0 100.00%

Development modeRequired software reliability Level 1 90.00%

Modern programing practicesTime constraint for cpu Level 2 73.33%

Data base size Level 2 83.34%

Main memory constraintTurnaround timeProgrammers capabilityAnalysts capabilityLanguage experienceVirtual machine experienceSchedule constraintApplication experience Level 2 66.67%

Use of software toolsMachine volatility


Why Were Bag+MLPs Singled Out

Hypothesis: bag+MLPs may have lead to a more adequatelevel of diversity.

If we use correlation as the diversity measure, we can see thatbag+MLPs usually had more moderate values when it was the1st or 2nd ranked MMRE method.

However, the correlation between diversity and MMRE wasusually quite low.

Table: Correlation Considering Data Sets in whichBag+MLPs Were Ranked 1st or 2nd.

Approach Correlation intervalacross different data sets

Bag+MLP 0.74-0.92Bag+RBF 0.40-0.83Bag+RT 0.51-0.81NCL+MLP 0.59-1.00Rand+MLP 0.93-1.00

Table: Correlation Considering All Data Sets.

Approach Correlation intervalacross different data sets

Bag+MLP 0.47-0.98Bag+RBF 0.40-0.83Bag+RT 0.37-0.88NCL+MLP 0.59-1.00Rand+MLP 0.93-1.00


Taking a Closer Look...

Table: Correlations between ensemble covariance (diversity) andtrain/test MMRE for the data sets in which bag+MLP obtained the bestMMREs and was ranked 1st or 2nd against the data sets in which itobtained the worst MMREs.

Cov. vs Cov. vsTest MMRE Train MMRE

Best MMRE (desharnais) 0.24 0.142nd best MMRE (org2) 0.70 0.382nd worst MMRE (org7) -0.42 -0.37Worst MMRE (cocomo2) -0.99 -0.99






Diversity is not only affected by the ensemble method, but also bythe data set:

Software effort estimation data sets are very different fromeach other.






Correlation between diversity and performance on test set followstendency on train set.

Why do we have a negative correlation in the worst cases?

Could a method that self-adapts diversity help to improveestimations? How?



Question 2


RTs give more importance to more important features.Weighting attributes may be helpful when using ML forsoftware effort estimation.

Ensembles seem to have more room for improvement forsoftware effort estimation.

A method to self-adapt diversity might help to improveestimations.



Question 3


Effort estimation data sets affect dramatically the behaviourand performance of different learning machines, evenconsidering ensembles.

So, it would be necessary to run experiments (parameterschoice is important) using existing data from a particularcompany to determine what method is likely to be the best.

If the software manager does not have enough knowledge ofthe models, RTs are a good choice.


Risk Analysis

The learning machines singled out (RTs and bagging+MLPs) werefurther tested using the outlier projects.

MMRE similar or lower (better), usually better than foroutliers-free data sets.

PRED(25) similar or lower (worse), usually lower.

Even though outliers are projects to which the learning machineshave more difficulties in predicting within 25% of the actual effort,they are not the projects to which they give the worst estimates.


Risk Analysis

The learning machines singled out (RTs and bagging+MLPs) werefurther tested using the outlier projects.

MMRE similar or lower (better), usually better than foroutliers-free data sets.

PRED(25) similar or lower (worse), usually lower.

Even though outliers are projects to which the learning machineshave more difficulties in predicting within 25% of the actual effort,they are not the projects to which they give the worst estimates.


Conclusions and Future Work

RQ1 – readily available ensembles do not provide generallybetter effort estimations.

Principled experiments (parameters, statistical analysis, severaldata sets, more ensemble approaches) to deal with validityissues.

RQ2 – RTs + weighting features; bagging with MLPs + selfadapting diversity.

Insight based on experiments, not just intuition or speculation.

RQ3 – principled experiments to choose model, RTs if noresources.

No universally good model, even when using ensembles;parameters choice in framework.

Future work:

Learning feature weights in ML for effort estimation.Can we use self-tuning diversity in ensembles of learningmachines to improve estimations?









Future work:










Future work:










Future work:



Acknowledgements

Search Based Software Engineering (SEBASE) research group.

Dr. Rami Bahsoon.

This work was funded by EPSRC grant No. EP/D052785/1.


Technology

Promise 2011: "A Principled Evaluation of Ensembles of Learning Machines for Software Effort Estimation"