Predicting project delivery rates using the Naive–Bayes classifier

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICEJ. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179 (DOI: 10.1002/smr.250)

Research

Predicting project delivery ratesusing the Naive–Bayes classifier

B. Stewart∗,†

School of Computing and Information Technology, University of Western Sydney, Australia

SUMMARY

The importance of accurate estimation of software development effort is well recognized in softwareengineering. In recent years, machine learning approaches have been studied as possible alternatives tomore traditional software cost estimation methods. The objective of this paper is to investigate the utilityof the machine learning algorithm known as the Naive–Bayes classifier for estimating software projecteffort. We present empirical experiments with the Benchmark 6 data set from the International SoftwareBenchmarking Standards Group to estimate project delivery rates and compare the performance of theNaive–Bayes approach to two other machine learning methods—model trees and neural networks. A projectdelivery rate is defined as the number of effort hours per function point. The approach described is generaland can be used to analyse not only software development data but also data on software maintenance andother types of software engineering. The paper demonstrates that the Naive–Bayes classifier has a potentialto be used as an alternative machine learning tool for software development effort estimation. Copyright 2002 John Wiley & Sons, Ltd.

KEY WORDS: software effort estimation; Bayesian networks; machine learning; model trees; neural networks

1. INTRODUCTION

Accurate estimation of software development effort is crucial to the success of software developmentprojects. A project’s budget, planning, control, and management throughout the entire softwaredevelopment lifecycle depend on reliable cost estimates. During the past 30 years estimation ofsoftware development effort has received a significant amount of attention in software engineeringresearch. Many different estimation models have been developed, ranging from heuristic rule-of-thumb approaches to formal mathematical models. The formal models can be grouped into twobroad categories: parametric models, and machine learning models. Parametric models represent the

∗Correspondence to: Dr B. Stewart, School of Computing and Information Technology, University of Western Sydney,Campbelltown Campus, Locked Bag 1797, Penrith South DC, NSW 1797, Australia.†E-mail: [email protected]

Received 17 September 2001Copyright 2002 John Wiley & Sons, Ltd. Revised 29 January 2002

162 B. STEWART

development effort as a parametrized function of predetermined cost factors, also referred to as metrics,attributes, or cost drivers. The development effort of a new project is estimated by substituting for thecost factors the actual project values. Model parameters are determined by calibration to historical dataon past projects. Some of the most well known parametric models are the COCOMO (ConstructiveCost Model) developed by Boehm [1,2], Albrecht’s function points [3], and the SLIM model developedby Putnam [4].

During the past decade there have been a number of research studies published in the literature onthe use of machine learning techniques for estimating software development effort. The methods usedinclude decision trees [5–8], neural networks [7–9], and reasoning by analogy [10]. Machine learningmethods construct a model from a database of past projects which is then used to predict the softwaredevelopment effort for new projects. In this paper we examine the use of another machine learningalgorithm for software development effort estimation—the Naive–Bayes classifier—and compare itsperformance to two other methods—model trees and artificial neural networks. The Naive–Bayesclassifier is a well-known machine learning algorithm that has proved to have excellent classificationperformance on small data sets. A brief overview of this algorithm and how it can be used for softwaredevelopment effort estimation is given in Section 2. More detailed information on Naive–Bayes andother Bayesian classifiers can be found in [11].

Software engineering data sets often contain many variables, some of which are only weakly relatedto the variable of interest such as software effort. It is usually necessary to pre-process the data set invarious ways and select a subset of variables that show strong relationships to the variable for whichpredictions are to be made. In our experimental work we used the mutual information measure toselect subsets of variables to include in the Naive–Bayes classifier. The mutual information measureis a statistical measure that indicates the strength of association between a pair of random variables.This measure has been used for finding significant relationships in data by several other researchers[11–13].

We carried out empirical experiments using the data set Benchmark Release 6 from the InternationalSoftware Benchmarking Standards Group (ISBSG) [14]. The projects in the data set are sized interms of function points rather than lines of code. The data set is provided with a report describingthe variables and presenting a statistical analysis of the cost factors affecting project delivery rates.A project delivery rate is defined as the number of hours per function point. Due to the nature of thedata set, we have focused on estimating project delivery rates rather than total project effort.

The paper makes two main contributions: (1) it demonstrates that the Naive–Bayes classifier hasthe potential to be used as an additional technique for the prediction of software development andmaintenance effort; and (2) it shows that the mutual information measure is a useful measure forselecting significant variables for the construction of Naive–Bayes classifiers. The approach describedin the paper is general and could be used to estimate values of any variables of interest provided thatsufficient historical data are available.

The remainder of the paper is organized as follows: Section 2 describes general Bayesian networkclassifiers and the Naive–Bayes classifier as a special case of the Bayesian network classifier, Section 3overviews model trees as an alternative method for project delivery rate estimation, Section 4introduces neural networks, Section 5 introduces the concept of mutual information measure and its usefor selecting significant variables, Section 6 describes our experimental work, and Section 7 concludesthe paper and outlines our future work.

Copyright 2002 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179

PREDICTING PROJECT DELIVERY RATES USING THE NAIVE–BAYES CLASSIFIER 163

2. BAYESIAN NETWORK CLASSIFIERS

A fundamental problem in machine learning, data analysis, and pattern recognition is classification ofobserved instances into predetermined categories or classes. For example, software projects could beclassified into categories according to their project delivery rate values. In this case the categories wouldbe subranges (intervals) of project delivery rate values. Classification would determine the interval fora new project on the basis of the observed values of its remaining characteristics.

In machine learning, classification algorithms are grouped into two broad groups—supervised andunsupervised classifiers. In supervised classifiers the categories into which the cases are to be assignedmust have been established prior to classification. In unsupervised classifiers no predeterminedcategories are used, the necessary categories are determined by the algorithm itself. In this paper weuse the term classification to represent supervised classification, unless stated otherwise.

A classification task requires a database of cases from which is constructed a classification modelsuch as the Naive–Bayes model. This database is referred to as a training set and contains measurementdata on cases observed in the past together with their actual categories. The variable whose values arethe case categories (classes) is referred to as the class variable. The classification model derived fromthe training data is then used to predict the categories of instances whose category is unknown. Theclassification problem has been widely studied in statistics and artificial intelligence (AI) and a varietyof different classification methods have been developed. Some of the most popular approaches usedin AI include decision trees [15,16], neural networks [17], nearest-neighbour classifiers [18], geneticalgorithms [19] and more recently Bayesian network classifiers [11].

2.1. Bayesian networks

General Bayesian network classifiers are known as Bayesian networks, belief networks or causalprobabilistic networks. The theoretical concepts of Bayesian networks were invented by Judea Pearl inthe 1980s and are described in his pioneering book Probabilistic Reasoning in Intelligent Systems [13].During the past decade Bayesian networks have gained popularity in AI as a means of representingand reasoning with uncertain knowledge. Examples of practical applications include decision support,safety and risk evaluation, control systems, and data mining [20]. In the software engineering field,Bayesian networks have been used by Fenton [21] for software quality prediction. A wealth of articleson this area of research can be found on the Agena Web site [21].

The state-of-the-art research papers on Bayesian networks are published in the proceedings ofthe Annual Conference on Uncertainty in AI [22]. Theoretical principles of Bayesian networks aredescribed in several books, for example [13,23–26].

A Bayesian network consists of two components: (1) a directed acyclic graph representing thestructure of an application domain, and (2) conditional probability distributions associated with thevertices in the graph.

The vertices of the graph represent the domain variables and the directed edges the relationshipsbetween the variables. With every vertex is associated a table of conditional probabilities of thevertex given each state of its parents. We denote a conditional probability table using the notationP(xi |par(Xi)), where lower case xi denotes values of the corresponding random variable Xi andpar(Xi) denotes a state of the parents of Xi . The graph together with the conditional probabilitytables define the joint probability distribution contained in the data. Using the probabilistic chain rule,


164 B. STEWART

A

D

B C

Figure 1. A Bayesian network.

the joint distribution can be written in the product form P(x1, x2, . . . , xn) = ni=1P(xi |par(Xi)),

where n is the number of vertices in the graph. An example of a simple Bayesian network is given inFigure 1. The corresponding joint probability distribution can be written in the form P(a, b, c, d) =P(a)P (b|a)P (c|a, b)P (d|b, c).

General unrestricted Bayesian networks may be regarded as unsupervised classifiers in the sensethat there is no specific variable designated as the class variable. In a Bayesian network all variablesare treated in the same way and any one can be regarded as the class variable. Classification ina Bayesian network classifier involves performing probabilistic inference on the Bayesian networkusing one of the available probabilistic inference algorithms, for example the algorithm of Lauritzenand Spiegelhalter [27]. Probabilistic inference computes for each vertex in the graph the posteriorprobability distribution P(xi |evidence), where xi represents the values of the variable Xi and evidencerepresents a set of observed values of the remaining variables.

2.2. Naive–Bayes classifier

The Naive–Bayes classifier is a special case of Bayesian network classifier, derived by assuming thatthe variables are conditionally independent given the class variable. Unlike the general Bayesiannetwork classifier, Naive–Bayes is a supervised classifier because one of the variables must bedesignated as the class variable. The graphical structure of Naive–Bayes is represented by a treein which the class variable is the root and the remaining variables are the leaves. Directed edgesconnect the root to the leaves. It is assumed that each variable is conditionally independent of theremaining variables given the class variable. Classification in Naive–Bayes computes the posteriorprobability distribution of the class variable given observed values of the remaining variables,P(c|x1, x2, . . . , xn). Unlike in a general Bayesian network, in Naive–Bayes the posterior probabilitydistribution P(c|x1, x2, . . . , xn) can be computed efficiently from Bayes theorem:

P(c|x1, x2, . . . , xn) = P(c, x1, x2, . . . , xn)

P (x1, x2, . . . , xn)(1)

whereP(c, x1, x2, . . . , xn) = P(c)P (x1|c)P (x2|c) . . . P (xn|c) (2)



YX

C

Figure 2. A Naive–Bayes network.

andP(x1, x2, . . . , xn) =

∑

c

P (c, x1, x2, . . . , xn) (3)

The Naive–Bayes classifier is relatively simple to implement, efficient, robust with respect to noisyor missing data, and performs surprisingly well in many domains. For small data sets it frequentlyoutperforms even more sophisticated state-of-the-art decision tree classifiers [16]. Some comparativeempirical studies are reported in [11].

Example. We illustrate the computations performed by the Naive–Bayes algorithm by means of asimple example illustrated in Figure 2.

The graph in Figure 2 shows the structure of the Naive–Bayes classifier in which the variable C isthe class variable. Using the chain rule, the joint probability distribution corresponding to the graphcan be written in the form P(c, x, y) = P(c)P (x|c)P (y|c).

For simplicity, we assume that the variables C, X, and Y are binary and take on the values of 0 and 1.We also assume that the probability distributions P(c), P(x|c), and P(y|c) have been computed fromthe training data and have the values given in Tables I–III.

The Naive–Bayes classifier computes the conditional probability distribution P(c|x = x0, y = y0)

for some observed values x0 and y0 of the variables X and Y , respectively. Suppose that the observedvalues are x0 = 0 and y0 = 1. Then the Naive–Bayes algorithm computes the conditional probabilitiesP(c = 0|x = 0, y = 1) and P(c = 1|x = 0, y = 1) as follows:

P(c = 0|x = 0, y = 1) = P(c = 0)P (x = 0|c = 0)P (y = 1|c = 0)

sum(4)

P(c = 1|x = 0, y = 1) = P(c = 1)P (x = 0|c = 1)P (y = 1|c = 1)

sum(5)

sum = P(c = 0)P (x = 0|c = 0)P (y = 1|c = 0) + P(c = 1)P (x = 0|c = 1)P (y = 1|c = 1) (6)

Substituting the values from Tables I–III yields the distribution in Table IV.The observed sample (x0 = 0, y0 = 1) will be classified into the category corresponding to the

larger probability value, in this case c = 1.


166 B. STEWART

Table I. The probability distribution of the variable C.

c P (c)

0 0.21 0.8

Table II. The conditional probability distribution P(x|c).

x c P (x|c)0 0 0.10 1 0.31 0 0.91 1 0.7

Table III. The conditional probability distribution P(y|c).

y c P (y|c)0 0 0.20 1 0.61 0 0.81 1 0.4

Table IV. The resulting conditional distribution P(c|x = 0, y = 1).

c x y P (c|x, y)

0 0 1 0.1431 0 1 0.857

2.3. Estimating project delivery rates using the Naive–Bayes classifier

The estimation of project delivery rate value of a new project involves a prediction of the projectdelivery rate on the basis of the observed values of the remaining project characteristics (variables).The Naive–Bayes algorithm is a classification algorithm which computes the conditional probabilitydistribution of the class variable given the observed values of the remaining variables. We assumethat the class variable has discrete values, referred to as categories or classes. Naive–Bayes gives aprobability value for each category of the class variable indicating how likely it is that the observed



instance belongs to that category. For example, assuming that the project delivery rate variable hasbeen discretized into several intervals, Naive–Bayes can compute the probability that a given project’sproject delivery rate falls within each of these intervals. However, it cannot predict the actual value ofthe project delivery rate.

To use Naive–Bayes for estimating project delivery rates, we have to adapt the algorithm forprediction tasks. The goal is to predict the value of the class variable for a given observed instance,rather than to classify that instance into one of several possible categories.

The value of the class variable (project delivery rate) may be approximated by its expected valueusing the formula below.

E(pdr|instance) =i=n∑

i=1

midi ∗ P(pdri |instance) (7)

The formula assumes that the project delivery rate variable was discretized into n intervals,midi denotes the mid point of the ith interval, computed as midi = (loweri + upperi )/2, andP(pdri |instance) denotes the conditional probability of the ith interval given the observed instance.

The midpoint of each interval approximates the interval’s project delivery rate value. The expectedvalue of the project delivery rate is computed as a weighted sum of the project delivery rates of theindividual intervals. The weights in the formula are the conditional probabilities P(pdri |instance)computed by Naive–Bayes.

The limitation of this approach is that the resulting expected value may be affected significantlyby the width of the intervals. If too few intervals are chosen, midi values may be too far apart andmay poorly approximate the actual project delivery rate values. If too many intervals are chosen andthe training sample size is too small, the computed conditional probabilities P(pdri |instance) may beinaccurate. This problem could be reduced by increasing the size of the training data set.

To assess the predictive performance of Naive–Bayes relatively to other methods, we used two otherprediction methods to estimate project delivery rates—model trees and neural networks.

3. ESTIMATING PROJECT DELIVERY RATES USING MODEL TREES

Model trees are a type of decision trees designed for prediction of numeric quantities rather than forclassification. There are two basic types of decision trees—classification trees and regression trees.Classification trees are used for predicting classes (categories of the class variable) of test instanceswhile regression trees are used for predicting numeric values of the class variable. In classificationtrees the class variable must have discrete values while in regression trees it must have numeric values.Both classification and regression trees are constructed in the same way by recursive partitioning of thetraining data set. Many different tree building algorithms have been developed, with two of the mostwidely used systems being CART [15] and C4.5 [16]. The main difference between alternative treelearning algorithms is in the way in which they determine the splitting variables on which to partitionthe data.

Regression trees were first introduced by Breiman et al. [15]. In such trees the leaf nodes containa predicted value of the class variable which is computed by averaging all the training set values thatreach that leaf. To use a regression tree to predict the value of the class variable for a given test instance,


168 B. STEWART

the tree is traversed from the root towards the leaves until some leaf node has been reached. With eachinternal node in the tree is associated a condition that determines which branch is to be followed next.When a node is visited the condition is evaluated using the values of the test instance. Depending onthe outcome of the condition a branch is selected to the next node in the next lower level of the tree.This process continues until a leaf node has been reached. The value at the leaf is the predicted valuefor the given test instance.

Model trees were introduced by Quinlan [28] in 1992. They extend the idea of regression trees bycombining regression trees with regression equations. The leaf nodes in a model tree contain linearregression models rather than single numeric values. To use a model tree to predict the project deliveryrate for a given project, the tree is traversed from the root to a leaf in the normal way and when a leafis reached the regression model is evaluated to obtain a raw predicted value. Rather than using the rawvalue directly a smoothing formula is applied to the raw value to compute a smoothed predicted value.Smoothing yields more accurate predictions.

For our experiments with model trees we used the M5′ algorithm from the Weka machine learningpackage developed at the University of Waikato [29]. We adapted the system to also compute the meanmagnitude of relative error and the PRED(x) measures. An excellent description of this package isgiven in [30] together with the theoretical details of decision trees and many other machine learningtechniques.

4. ESTIMATING PROJECT DELIVERY RATES USING NEURAL NETWORKS

Neural networks are powerful pattern recognition tools that can be applied to solving prediction,classification and clustering problems. Some of the basic ideas of neural networks originated inresearch in neurophysiology in the 1940s. However, the most significant developments of neuralnetworks occurred during the 1980s after John Hopfield invented the back-propagation algorithm fortraining neural networks in 1982. Since then they have become one of the most important tools forpattern recognition and machine learning. A detailed description of neural network concepts is beyondthe scope of this article. We give only a brief outline of the feed forward neural network architecturethat we used in our experiments.

A feed forward neural network consists of nodes organized in layers and links connecting the nodesin adjacent layers. The most common configuration is comprised of an input layer, one or more hiddenlayers, and an output layer. The nodes in the input layer correspond to network inputs, and the nodesin the output layer to network outputs. The number of hidden layers and the number of hidden nodesare chosen arbitrarily. In general, increasing the number of hidden nodes increases the power of thenetwork to recognize patterns but may also lead to over-fitting. Over-fitting occurs when the weightsbecome too specialized to the training data and as a result the network’s ability to predict new samplesis reduced. To prevent over-fitting, for small data sets the number of hidden nodes should be relativelysmall. The nodes in each layer are connected to the nodes in each adjacent layer by links. Figure 3illustrates the basic architecture of the feed forward neural networks that we used in our experiments.The network contains an input layer, a single hidden layer, and an output layer containing a singlenode. The nodes in the input layer correspond to the variables in the data set, the hidden layer containshidden nodes, and the single node in the output layer represents the project delivery rate variable. In ourempirical experiments we used five hidden nodes and the number of input nodes varied in different



Input 1

Input 2

Input 3

Input Layer Hidden Layer Output Layer

Output

Figure 3. A feed forward neural network.

experiments. With each link i–j connecting the nodes i and j is associated a numerical weight wij ,and with each node in the hidden and output layers is associated an activation function which computesthe node’s output. The most widely used activation function is a sigmoid function of the form

1

1 + exp(−x)(8)

where x = ∑i wij Inpij is the weighted sum of inputs into a node j .

All network inputs must be numeric values. If a data set contains non-numeric variables, these haveto be transformed into a numeric form. To achieve better performance all the variables are usuallyscaled between 0 and 1.

Using a neural network for predicting project delivery rates involves two steps: (1) training thenetwork using a training data set; and (2) applying the network to predict the value of the projectdelivery rate for a given project. During the training stage the weights of the links are learned fromthe training data by means of the back-propagation algorithm. The weights are initially set to smallrandomly generated values and are then iteratively updated. The data set is repeatedly traversed andduring each traversal the back-propagation algorithm updates the weights so that the total predictionerror is minimized. Mathematical details of the back-propagation algorithm are beyond the scope ofthis article, we refer the reader to [17] for details.

After the network has been trained it can be used for prediction. To predict the project delivery ratefor a given project, the values of the project’s variables are entered as inputs into the nodes in theinput layer. Input layer nodes only copy their input values to their output values. The output valuesfrom the input nodes become inputs into the hidden layer nodes. Each node in the hidden layer isconnected to every input node in the input layer. Hidden nodes compute their output by multiplyingthe outputs from the input layer nodes by the corresponding link weights, summing the products, andapplying the sigmoid activation function to produce output. The single output node in the output layeris connected to each hidden node in the hidden layer. The output from the output node is computed bymultiplying the outputs from the hidden nodes by the corresponding weights, adding up the products,


170 B. STEWART

and applying the sigmoid activation function. The value returned by the output node is the resultingproject delivery rate.

5. MUTUAL INFORMATION MEASURE

A database of empirical data may contain a large number of variables some of which may notbe relevant to the given classification task. The presence of redundant variables may reduce theperformance of some classification algorithms in terms of execution speed and accuracy. The executionspeed of most classification algorithms is inversely related to the number of variables in the model.The more variables there are the slower the speed. Furthermore, redundant variables may affectclassification or prediction accuracy by causing over-fitting. The model will contain too manyparameters relative to the amount of training data available, and as a consequence will computeparameter values that are too specialized to the training data and have poor generalization capabilities.Decision tree models resolve this problem by pruning the decision trees [16].

In the experiments reported in this paper we also investigated the effects of reducing the number ofvariables on the performance of the Naive–Bayes classifier in the context of predicting project deliveryrates. We performed comparative experiments with a full set of variables and two smaller subsets of theoriginal variables. To select a subset of variables that have a significant effect on the class variable weused the mutual information measure. A variety of other approaches have been proposed in statisticsand AI for variable selection [31]. Our choice was motivated by the fact that the mutual informationmeasure has been used successfully by several researchers in AI for identifying strong relationships indata [11] and that it can be computed relatively efficiently in O(n2) time, where n is the number ofvariables in the data set. The mutual information measure is given by the formula

I (Xi ,Xj ) =∑

i,j

P (xi , xj ) logP(xi, xj )

P (xi)P (xj )(9)

where Xi , Xj are random variables, P(xi, xj ) denotes the joint probability distribution of Xi and Xj ,and P(xi), P(xj ) denote marginal distributions.

6. EXPERIMENTAL WORK

The goal of our experimental work was to assess the feasibility of the Naive–Bayes classifier forprediction of project delivery rates. We carried out experiments using the data disk ‘The BenchmarkRelease 6’ purchased from ISBSG [14]. The data disk includes a report presenting a statistical analysisof the factors affecting project delivery rates (PDR), where a project delivery rate is defined as thenumber of hours per function point. Due to the nature of this database, we focused on estimatingproject delivery rates rather than summary effort values.

To compare the predictive performance of Naive–Bayes to other prediction algorithms, we carriedout the same experiments using two alternative methods—model trees and artificial neural networks.The results of these experiments are presented in Tables V and VI.



6.1. Data description

The data set contains data on 789 projects from 20 countries, drawn from many different industriesand business areas. Most of the projects are less than five years old. The data set contains 55 variables,including discrete and numeric variables. Some projects do not provide values for all the variables.

6.2. Data preparation

The format of the data in the data disk differed from the format required by our software and hence pre-processing was required. This included transformation of some variables into several simpler variables,removal of some variables, and addition of the project delivery rate variable. Project delivery rate valueswere computed by dividing the summary effort in hours by the number of function points.

Due to significant differences in project delivery rates in different industry types, we extracted fromthe data set the data for seven types of organizations for which there was an adequate number ofprojects and performed experiments with these smaller data sets. We used the following organizationtypes: (1) Banking; (2) Communication; (3) Electricity/gas/water; (4) Financial/property/business;(5) Insurance; (6) Manufacturing; and (7) Public administration.

For each data set it was necessary to transform the numeric variables into discrete variables sinceour current implementation of the Naive–Bayes classifier is designed to handle only discrete-valuedvariables. We used equal-interval discretization for this purpose. All numeric variables were discretizedinto 10 intervals. The pre-processing operations resulted in data sets of 65 variables (64 input variablesplus the class variable). Additionally, for each data set two smaller subsets of variables (16 and 8,respectively) were selected using a combination of mutual information measure and our own judgment.

For the experiments with model trees and neural networks we used the same data sets as forNaive–Bayes except that the project delivery rate variable was not discretized. For the neural networkexperiments, all the symbolic variables were transformed to numeric variables and all the variablesscaled between 0 and 1.

6.3. Accuracy measures

To assess the accuracy of predicted values of project delivery rate, we used the mean magnitude ofrelative error (MMRE) and the PRED(x) measures that are commonly used in software cost estimationresearch. The MMRE measure is defined as the average of the magnitudes of relative errors (MRE) ofthe n samples selected from the data set:

MREi =∣∣∣∣actuali − predictedi

actuali

∣∣∣∣ (10)

MMRE = 1

n

i=n∑

i=1

MREi (11)

A problem with the MMRE is that its value can be strongly influenced by a few very large MREvalues. For this reason, the accuracy is also usually measured in terms of the PRED(x) measure definedas the fraction or percentage of the samples with MRE ≤ x. For example, PRED(0.25) is the fraction of


172 B. STEWART

Tabl

eV

.The

resu

lts

ofex

peri

men

tsus

ing

10-f

old

cros

s-va

lida

tion

.

PR

ED

(0.2

5)P

RE

D(0

.50)

PR

ED

(0.7

5)M

MR

E

Exp

erim

ent

NB

MT

NN

NB

MT

NN

NB

MT

NN

NB

MT

NN

Ban

king

10.

410.

400.

290.

680.

660.

530.

820.

750.

690.

520.

670.

96B

anki

ng2

0.52

0.41

0.32

0.72

0.63

0.58

0.87

0.78

0.73

0.41

0.66

0.62

Ban

king

30.

460.

410.

360.

710.

670.

640.

860.

780.

770.

420.

630.

65

Com

mun

icat

ion

10.

350.

290.

380.

550.

590.

530.

750.

680.

550.

620.

910.

92C

omm

unic

atio

n2

0.28

0.29

0.23

0.48

0.44

0.53

0.70

0.66

0.65

0.84

0.99

0.94

Com

mun

icat

ion

30.

350.

320.

230.

530.

490.

480.

820.

660.

600.

550.

990.

90

Ele

ctri

city

/gas

/wat

er1

0.33

0.19

0.15

0.60

0.36

0.30

0.77

0.62

0.65

1.01

1.34

1.98

Ele

ctri

city

/gas

/wat

er2

0.33

0.30

0.15

0.58

0.53

0.25

0.73

0.60

0.48

1.19

1.43

3.32

Ele

ctri

city

/gas

/wat

er3

0.35

0.28

0.20

0.65

0.40

0.43

0.78

0.53

0.50

1.12

1.83

2.27

Fin

anci

al/p

rope

rty/

busi

ness

10.

170.

320.

190.

430.

540.

340.

660.

710.

491.

361.

221.

86F

inan

cial

/pro

pert

y/bu

sine

ss2

0.14

0.24

0.21

0.34

0.50

0.39

0.57

0.62

0.63

1.87

1.34

1.22

Fin

anci

al/p

rope

rty/

busi

ness

30.

170.

270.

270.

360.

490.

430.

560.

630.

661.

571.

331.

05In

sura

nce

10.

340.

310.

190.

610.

580.

440.

800.

730.

610.

610.

770.

89In

sura

nce

20.

410.

350.

280.

630.

580.

580.

760.

720.

740.

640.

750.

69In

sura

nce

30.

350.

330.

340.

650.

610.

560.

780.

740.

680.

650.

760.

78

Man

ufac

turi

ng1

0.38

0.43

0.38

0.63

0.71

0.62

0.80

0.80

0.80

0.65

0.63

0.64

Man

ufac

turi

ng2

0.38

0.32

0.45

0.60

0.64

0.60

0.80

0.77

0.73

0.63

0.68

0.59

Man

ufac

turi

ng3

0.30

0.25

0.47

0.55

0.52

0.70

0.73

0.71

0.90

0.76

0.79

0.36

Pub

lic

adm

inis

trat

ion

10.

16.1

80.

220.

400.

360.

380.

660.

530.

651.

141.

871.

46P

ubli

cad

min

istr

atio

n2

0.20

0.15

0.15

0.44

0.35

0.30

0.67

0.55

0.52

1.30

2.53

1.98

Pub

lic

adm

inis

trat

ion

30.

270.

170.

120.

500.

360.

330.

600.

530.

571.

882.

362.

94



Tabl

eV

I.T

here

sult

sof

expe

rim

ents

usin

gtr

aini

ngda

ta.

PR

ED

(0.2

5)P

RE

D(0

.50)

PR

ED

(0.7

5)M

MR

E

Exp

erim

ent

NB

MT

NN

NB

MT

NN

NB

MT

NN

NB

MT

NN

Ban

king

10.

890.

510.

960.

960.

760.

980.

980.

850.

990.

130.

490.

09B

anki

ng2

0.73

0.56

0.77

0.89

0.75

0.92

0.93

0.81

0.95

0.23

0.46

0.20

Ban

king

30.

680.

460.

600.

850.

760.

870.

930.

840.

920.

290.

510.

29

Com

mun

icat

ion

10.

780.

320.

930.

870.

760.

990.

970.

851.

000.

200.

580.

07C

omm

unic

atio

n2

0.67

0.37

0.74

0.79

0.61

0.88

0.91

0.66

0.91

0.29

0.79

0.29

Com

mun

icat

ion

30.

620.

540.

600.

760.

760.

790.

900.

850.

860.

320.

590.

42

Ele

ctri

city

/gas

/wat

er1

0.76

0.30

0.76

0.89

0.64

0.84

0.89

0.77

0.90

0.42

0.87

0.25

Ele

ctri

city

/gas

/wat

er2

0.64

0.45

0.68

0.77

0.77

0.81

0.78

0.83

0.87

0.67

0.70

0.32

Ele

ctri

city

/gas

/wat

er3

0.59

0.45

0.54

0.78

0.75

0.76

0.83

0.83

0.84

0.68

0.78

0.62

Fin

anci

al/p

rope

rty/

busi

ness

10.

670.

320.

850.

840.

650.

930.

880.

730.

960.

440.

740.

16F

inan

cial

/pro

pert

y/bu

sine

ss2

0.46

0.28

0.52

0.64

0.60

0.76

0.70

0.72

0.86

1.21

0.96

0.52

Fin

anci

al/p

rope

rty/

busi

ness

30.

410.

280.

430.

610.

600.

690.

710.

720.

811.

150.

960.

61In

sura

nce

10.

720.

470.

810.

840.

680.

930.

890.

800.

950.

300.

560.

17In

sura

nce

20.

490.

420.

590.

680.

720.

790.

780.

790.

840.

530.

600.

43In

sura

nce

30.

500.

430.

510.

690.

690.

750.

780.

780.

830.

530.

660.

52M

anuf

actu

ring

10.

930.

750.

960.

960.

890.

990.

990.

931.

000.

090.

220.

06M

anuf

actu

ring

20.

790.

550.

890.

930.

820.

960.

960.

890.

980.

180.

350.

12M

anuf

actu

ring

30.

820.

480.

860.

910.

730.

950.

940.

860.

980.

190.

400.

14

Pub

lic

adm

inis

trat

ion

10.

620.

350.

840.

810.

590.

920.

860.

710.

950.

541.

120.

15P

ubli

cad

min

istr

atio

n2

0.50

0.27

0.71

0.71

0.62

0.82

0.79

0.70

0.89

0.86

1.43

0.33

Pub

lic

adm

inis

trat

ion

30.

560.

360.

430.

720.

610.

680.

760.

700.

780.

891.

380.

79


174 B. STEWART

the samples with MRE ≤ 0.25. To provide more detailed information, in this paper we also presentedthe values for PRED(0.50) and PRED(0.75).

According to software engineering literature [32], a formal cost model is considered to functionwell if its PRED(0.25) is greater than 0.75. Some researchers consider an MMRE of less than 0.25to be good, while Boehm [1] suggested that the MMRE should be 0.10 or less. Most algorithmic andmachine learning models reported in the literature have been unable to satisfy these criteria for a varietyof projects and usually require a careful calibration for specific organizations.

6.4. Cross-validation

In machine learning a popular method for evaluating accuracy of classification algorithms is cross-validation. The basic idea of this method is to divide the original data set into a predetermined number v

of subsets of as nearly equal size as possible and then repeatedly train and test the classifier. In eachiteration one of the subsets is selected as the test set and the remaining subsets are included in thetraining set. Hence testing is performed on previously unseen data. For example, in 10-fold cross-validation the data set is split into 10 nearly equal subsets and the train/test cycle is repeated 10 times,each time using a different subset as the test set and the remainder of the data as the training set.The prediction accuracy is computed by averaging the accuracies over the 10 iterations.

6.5. Summary of our approach

The major steps of our approach to estimating project delivery rates using the Naive–Bayes classifierare summarized below.

Step 1. Pre-process the data. This may involve deletion, addition, transformation, and discretization ofvariables. The project delivery rate variable must be discretized into discrete values.

Step 2. Create training data sets and test data sets. For cross-validation this is done by splitting theoriginal data set into a predefined number of subsets of equal size as described in Section 6.4.For experiments with training data the whole data set is used as both training set and test set.

Step 3. Construct the Naive–Bayes model for this data set.Step 4. Train the Naive–Bayes model using a training data set. This step computes the required

conditional probabilities.Step 5. Use the model to predict project delivery rate for the projects in the associated test set as

explained in Section 2.3, and compute the accuracy measures.Step 6. For cross-validation repeat the steps 4–6 until all the training sets have been processed.Step 7. Compute the resulting accuracy measures by averaging the values of the measures over all test

sets.

6.6. Experimental results

This section describes the experimental results obtained using the data sets created from the BenchmarkRelease 6 data disk. For each data set we performed three experiments using 10-fold cross-validationas described in Section 6.4, and three experiments in which the models were tested on the training datafrom which they were derived. The experiments labelled 1, 2, and 3 differed in the number of input



Table VII. The data sets used for the experiments.

Data set Size The seven variables used in experiment 3

Banking 91 (1) Primary programming language, (2) Project scope, (3) Howmethodology acquired, (4) Used methodology, (5) Reference tableapproach, (6) Upper CASE tool used, (7) Application type

Communication 41 (1) Primary programming language, (2) Application type, (3) FPstandards, (4) Business area type, (5) Recording method, (6) UpperCASE tool used, (7) Development platform

Electricity/gas/water 47 (1) Primary programming language, (2) FP standards, (3) Max-imum team size, (4) How methodology acquired, (5) Referencetable approach, (6) Business area type, (7) Resource level

Financial/property/business 78 (1) Primary programming language, (2) Maximum team size,(3) FP standards, (4) Application type, (5) Business area type,(6) Development platform, (7) Language type

Insurance 81 (1) Primary programming language, (2) Application type,(3) Language type, (4) Business area type, (5) User base locations,(6) Lower CASE tool with code generally used, (7) FP standards

Manufacturing 44 (1) Primary programming language, (2) FP standards, (3) Businessarea type, (4) Application type, (5) Lower CASE tool with codegenerally used, (6) Upper CASE tool used, (7) Language type

Public administration 70 (1) Primary programming language, (2) Business area type,(3) Application type, (4) FP standards, (5) User base locations,(6) Maximum team size, (7) User base business units

variables included in the model. They used 64, 15, and seven input variables respectively, plus the classvariable. A brief description of the data sets is given in Table VII. The results of the cross-validationexperiments are presented in Table VI and the results of the experiments with training data in Table VII.

In experiments 2 and 3 with each data set we used a reduced number of variables by excluding thosevariables for which the values would be difficult to estimate at the start of a project and including onlythe variables for which the values would be known at the beginning of a project. We selected the subsetsof variables by using a combination of the mutual information measure and our own judgment. For eachvariable Xi we computed the value of the mutual information measure with the project delivery ratevariable, I (Xi, PDR), and then selected a subset of variables from those with the largest I (Xi , PDR)

values. Selecting only the variables with the largest values of the mutual information measure couldalso include in the model variables for which values could not be obtained in early stages of the projector which are strongly correlated with project delivery rate. For example, summary effort and projectelapsed time were used to compute the project delivery rate variable and hence are strongly correlatedto it. The variables such as work effort for planning, testing, and implementation would be difficult toestimate at the start of a project and therefore were not included in the models. In experiment 2 we used15 variables and in experiment 3 seven variables, plus the project delivery rate variable. For illustration,for each data set the seven variables used in experiment 3 are listed in Table VII.


176 B. STEWART

Table VIII. Best performance in 21 10-fold cross-validation experiments.

Measure Naive–Bayes Model tree Neural network

PRED(0.25) 12 5 5PRED(0.50) 13 6 2PRED(0.75) 17 2 4MMRE 15 2 4

Table IX. Best performance in 21 experiments on training data.

Measure Naive–Bayes Model tree Neural network

PRED(0.25) 5 0 17PRED(0.50) 3 0 18PRED(0.75) 3 0 19MMRE 3 0 20

The results in Table V were obtained using 10-fold cross-validation as described in Section 6.4, andhence were obtained from the unseen data not used to train the models. The best results are highlightedin bold font.

For each of the three models, Naive–Bayes (NB), model tree (MT), and neural network (NN) theresults show a wide variation in predictive accuracy, depending on the data set used. In terms ofthe MMRE values, the best predictions were obtained for the banking, insurance, communication, andmanufacturing data sets. The results for electricity/gas/water, financial/property/business systems,and public administration data sets were much poorer. Table VIII shows the number of times eachof the three models achieved the best results on the accuracy measures in 21 experiments.

The results in Table VIII show that the Naive–Bayes models produced superior results to modeltrees and neural networks on most cross-validation experiments. It can also be seen from Table V thatNaive–Bayes gave best overall results on five data sets out of seven.

The results in Table VI were obtained by classifying the training data sets. They indicate how wellthe three types of models can predict project delivery rates for the projects from which they wereconstructed. Similarly as in Table V, the results for electricity/gas/water, financial/property/businesssystems, and public administration data sets are significantly worse than for the remaining data sets.Table IX summarizes how many times each model achieved the best performance.

From Table IX as well as from Table VI it can be seen that neural networks achieved superiorperformance in almost all experiments on training data. We used five hidden nodes in each neuralnetwork model. Neural networks are prone to over-fitting and very high accuracy on training data mayindicate over-fitting. In some cases this can be alleviated by reducing the number of hidden nodes.We conducted experiments, not reported here, with smaller numbers of hidden units (three and four)



but in general five hidden units gave better results on cross-validation. Model trees gave results thatwere only slightly better than for cross-validation experiments. The reason for this could be that theM5′ algorithm uses pruning to eliminate redundant variables and constructs a pruned tree containingonly the most significant variables. This prevents over-fitting.

Comparing the results of experiments 1, 2, and 3 for each data set in Table V shows that thereduction of the number of variables did not affect greatly the values of the accuracy measures. In someexperiments the values were better and in some slightly worse. The results indicate that the mutualinformation measure is a useful technique for reducing the number of variables in data sets.

One of the reasons for the wide variation in performance of all three models on different datasets may be the composition of the individual data sets. The projects in the banking data set werepredominantly from the banking business area and were mostly management information systems.The insurance, communication, and manufacturing data sets were comprised of projects from a widerrange of business areas and application types. The three worst performing data sets contained projectsfrom the broadest range of organizations. Poorer results for the more heterogenous data sets maybe partly due to wide differences in software development practices of different organization typesand to the variations in resource requirements of different types of applications. Neither of the threeapproaches was able to capture these differences in the models constructed from relatively smalldata sets.

Another reason for relatively poor performance of all three machine learning approaches may bethat the training sample sizes were too small. Although the sample sizes were not widely differentfor the different organization types, more heterogenous data sets would require larger training sets toenable the models to encode the necessary relationships. It is also possible that some of the importantvariables affecting project delivery rates in such organizations have not been included in the BenchmarkRelease 6 data disk. The performance of all machine learning algorithms is affected by the size of thetraining set. In general, the accuracy tends to improve with increasing size of the training set.

7. CONCLUSION AND FUTURE WORK

In this paper we investigated the feasibility of the machine learning algorithm called Naive–Bayesclassifier for estimating project delivery rates. The basic idea of this approach is to construct aNaive–Bayes classifier from a historical database of past projects and then use it to predict projectdelivery rates for new projects. To compare the performance of Naive–Bayes to other machine learningmethods we conducted the same experiments with two alternative methods—model trees and artificialneural networks. Experimental results obtained with data from the Benchmark Release 6 data diskfrom ISBSG [14] are presented in Tables V and VI. The results of cross-validation experiments inTable V show that from the three approaches studied Naive–Bayes produced the best values of MMREand PRED(x) measures on most of the cross-validation experiments. The results in Table VI show thatthe neural network approach gave the best results on training data experiments. The results of cross-validation experiments indicate how well each model can then predict project delivery rate for unseenprojects and hence are of greater practical significance than the results on training data. However, noneof the models achieved cross-validation results that would be acceptable for practical estimation ofproject delivery rates. As mentioned in Section 6.3, for satisfactory performance PRED(0.25) shouldbe greater than 0.75 and MMRE less than 0.25.


178 B. STEWART

Although the predictive performance achieved in our experiments is far from satisfactory forpractical applications, it needs to be stressed that the data sets used were relatively small and containeda wide variety of projects from different countries, business areas, and application types. It is likelythat if the data were local to a particular organization or a group of similar organizations the resultswould be significantly better. So far none of the machine learning approaches to the software estimationproblem reported in the literature have achieved completely satisfactory results [6–8,10].

It is possible that in the future, when data collection becomes a more widely practiced activity insoftware organizations and large databases of past cases have been accumulated, machine learningalgorithms will yield better results and will play an important part in software cost estimation.

Until now the most popular machine learning algorithms used in software engineering research havebeen decision tree learners and neural networks. To our knowledge, no studies have been reportedon using the Naive–Bayes classifier. Our paper attempts to fill this gap. When compared to othermachine learning approaches, Naive–Bayes has several advantages. For example, unlike decision treesNaive–Bayes is not sensitive to noise in the data. This is an important consideration in softwareengineering where databases often contain imprecise and incomplete information. Naive–Bayes isless affected by the problem of over-fitting than neural networks. Over-fitting occurs when the modelparameters become highly specialized to the training data and have poor generalization capabilitiesfor new samples. The main disadvantage of our current implementation is the necessity to discretizenumeric variables into discrete values. The prediction results are sensitive to the number of intervalschosen. In the future we plan to investigate the use of Naive–Bayes implementations that can handlecontinuous variables.

Our empirical experiments have shown that the Naive–Bayes classifier is a valuable tool for analysisof software engineering data which can be used as an alternative to other more widely used approachessuch as decision trees and neural networks. The results of our experiments have also shown thatselecting model variables on the basis of the mutual information measure is a useful technique thatcan improve efficiency of the model without reducing its accuracy.

Our future research will focus on applying the Naive–Bayes classifier to a variety of softwareengineering databases and comparing its performance to other machine learning approaches. We willalso investigate the feasibility of using more complex Bayesian network classifiers for analysis ofsoftware engineering data.

REFERENCES

1. Boehm B. Software Engineering Economics. Prentice-Hall: Englewood Cliffs NJ, 1981.2. http://sunset.usc.edu/research/COCOMOII/index.html.3. Albrecht AJ, Gaffney JE. Software function, source lines of code, and development effort prediction. IEEE Transactions

in Software Engineering 1983; 9(6):639–648.4. Putnam LH. A general empirical solution to the macro software sizing and estimating problem. IEEE Transactions in

Software Engineering 1978; 4(4):345–361.5. Briand L, Basili V, Thomas W. A pattern recognition approach for software engineering data analysis. IEEE Transactions

in Software Engineering 1992; 18(11):931–942.6. Briand LC, El Emam K, Surmann D, Wieczorek I, Maxwell KD. An assessment and comparison of common software cost

estimation modelling techniques. Proceedings of the 1999 International Conference on Software Engineering (ICSE’99).ACM Press: New York NY, 1999; 313–322.

7. Jorgensen M. Experience with the accuracy of software maintenance task effort prediction models. IEEE Transactions inSoftware Engineering 1995; 21(8):674–681.



8. Srinivasan K, Fisher D. Machine learning approaches to estimating software development effort. IEEE Transactions inSoftware Engineering 1995; 21(2):126–137.

9. Shin M, Goel AL. Empirical data modelling in software engineering using radial basis function. IEEE Transactions inSoftware Engineering 2000; 26(6):567–576.

10. Shepperd M, Schofield C. Estimating software project effort using analogies. IEEE Transactions in Software Engineering1997; 23(12):736–743.

11. Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning 1997; 29(2–3):131–163.12. Chow CK, Liu CN. Approximating discrete probability distributions with dependence trees. IEEE Transactions On

Information Theory 1968; 14:462–467.13. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann: San Mateo

CA, 1988.14. International Software Benchmarking Standards Group, 2002. http://www.isbsg.org.au [27 March 2002].15. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Wadsworth: Belmont CA, 1984.16. Quinlan JR. C4.5:Programs for Machine Learning. Morgan Kaufmann: San Mateo CA, 1993.17. Bishop CM. Neural Networks for Pattern Recognition. Oxford University Press: New York NY, 1995.18. Cover TM, Hart PE. Nearest neighbour pattern classification. IEEE Transactions On Information Theory 1967; 13(1):21–

27.19. Goldberg DE. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley: Reading MA, 1989.20. Hugin Expert, Aalborg, Denmark, 2001. http://www.hugin.com/cases/ [27 March 2002].21. Agena, London, UK, 2001. http://www.agena.co.uk/ [27 March 2002].22. Association for Uncertainty in Artificial Intelligence, 2001. http://www.auai.org/ [27 March 2002].23. Neapolitan RE. Probabilistic Reasoning in Expert Systems: Theory and Algorithms. John Wiley: New York NY, 1990.24. Castillo E, Gutierrez JM, Hadi AS. Expert Systems and Probabilistic Network Models. Springer: New York NY, 1997.25. Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ. Probabilistic Networks and Expert Systems. Springer: New York

NY, 1999.26. Jensen FV. An Introduction to Bayesian Networks. UCL Press: London, 1996.27. Lauritzen SL, Spiegelhalter DJ. Local computations with probabilities on graphical structures and their applications to

expert systems (with discussion). Journal of the Royal Statistical Society Series B 1988; 50:157–224.28. Quinlan JR. Learning with continuous classes. Proceedings of the Fifth Australian Joint Conference on Artificial

Intelligence, Adams N, Sterling L (eds.). World Scientific: Singapore, 1992; 343–348.29. WEKA, The University of Waikato, New Zealand, 2001. http://www.cs.waikato.ac.nz/ml/weka [27 March 2002].30. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan

Kaufmann: San Francisco CA, 2000.31. Motoda H, Liu H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer: Dordrecht, 1998.32. Pfleeger SL. Software Engineering Theory and Practice. Prentice-Hall: Upper Saddle River NJ, 1998.

AUTHOR’S BIOGRAPHY

Bozena Stewart is a Lecturer in Computing in the School of Computing and InformationTechnology at the University of Western Sydney, Australia. Her research interests areprimarily in the fields of artificial intelligence and software engineering. Her currentresearch includes Bayesian networks, machine learning, software cost estimation, anddata mining. She teaches object-oriented programming, data structures and algorithms,software engineering, and artificial intelligence. She holds a PhD degree in computerscience from the University of Technology, Sydney.


Documents

Predicting project delivery rates using the Naive–Bayes classifier