12/07/2015Dr Andy Brooks1 MSc Software Maintenance MS Viðhald Hugbúnaðar Fyrirlestrar 43 og 44 Estimating Effort for Corrective Software Maintenance

19/04/23 Dr Andy Brooks 1

MSc Software MaintenanceMS Viðhald Hugbúnaðar

Fyrirlestrar 43 og 44Estimating Effort for Corrective

Software Maintenance


Case StudyDæmisaga

ReferenceEffort Estimation for Corrective Software Maintenance,

Andrea De Lucia, Eugenio Pompella, and Silvio Stefanucci,

The Fourteenth International Conference on Software Engineering and Knowledge Engineering (SEKE’02)

pp 409-416, 2002. ©ACM

1. Introduction

• Effort estimation helps managers:– plan resource and staff allocation– prepare less risky bids for external contracts– make maintain versus buy decisions

• Effort estimation is complicated by:– the different types of software maintenance

• corrective, adaptive, perfective, preventive

– the scope of software maintenance work• simple method fixes through to full reengineering


1. Introduction

• Effort estimation requires the use of quantitative metrics.

• Software maintenance costs are mainly human resource costs.– the person-days needed

• A linear or non-linear relationship between complexity/size and effort is “commonly assumed”.


Estimation by analogy simple fictitious example by Andy

• The following historical data is available:– Project A involved 100 maintenance requests for 110,000 LOC

and took 25 person-days.– Project B involved 105 maintenance requests for 111,000 LOC

and took 28 person-days.– Project C involved 20 maintenance requests for 2,000 LOC and

took 2 person-days.

• Project D will involve 85 maintenance requests on 91,000 LOC so how much effort is required?

• Project A is the closest match so the effort expended for Project A can be used as an estimate for Project D: 25 person-days.


2. RELATED WORK

Shepperd, M., Schofield, C., and Kitchenham, B. Effort Estimation Using Analogy. Proceedings of the International Conference on Software Engineering (ICSE´96), ©IEEE, 1996, 170-178.

• The first step is deciding on the variables used to describe projects.– “all datasets had at least one variable that was in some sense

size related”

• The second step is deciding on how to determine similarity.– “Analogies are found by measuring Euclidean distance in n-

dimensional space where each dimension corresponds to a variable. Values are standardised so that each dimension contributes equal weight to the process of finding analogies.”


2. RELATED WORK

ArchANGEL tool here: http://dec.bournemouth.ac.uk/ESERG/ANGEL/


• “In N dimensions, the Euclidean distance between two points p and q is √(∑i=1

N (pi-qi)²) where pi (or qi) is the coordinate of p (or q) in dimension i.”– http://www.nist.gov/dads/


2. RELATED WORK

Euclidean distance is: √((x1 - x2)² + (y1 - y2)²)

Manhattan distance is: (x2-x1)+(y2-y1)

(x2,y2)

(x1,y1)


• The third step is deciding how to use known effort data to derive an effort estimate for the new project.– just use the effort for the closest project?– average the effort for the X closest projects?– average the effort for the X closest projects weighting

by closeness of matching?

• Shepperd et. al. used X = 2 and an unweighted average.


2. RELATED WORK


• Effort estimation using analogy was found to outperform traditional algorithmic methods for six different datasets.– later studies, however, did not support this finding

• Shepperd et. al. suggest it is better to use more than one estimation technique, to assess the degree of risk associated with a prediction.– if effort estimation using regression analysis and analogy

strongly disagree, then perhaps any estimation is unsafe– Andy says: in industrial projects, it is unlikely resources are

available to apply more than one technqiue


2. RELATED WORK

3. Experimental Setting

• Multiple linear regression analysis was applied to real data from five corrective maintenance projects from different companies.– All five corrective maintenance projects were

outsourced to one supplier company whose maintenance process closely followed the IEEE Standard for Software Maintenance.

• The data set comprised 144 observations corresponding to monthly maintenance periods.


Treatment of missing values

• If a value is missing one approach is simply to exclude the entire observation [effort, size, NA, MB, NC] from the model building process.– the safest approach

• Another approach is to substitute the mean or the median value calculated from the other observations.

• Yet another approach is to find the most similar observation and use the value found there.– best analogy found by calculating euclidean distances

• Fortunately, the data set did not contain missing values.


Missing Data Techniques

Data available

• Size of the system.• Effort spent in the maintenance period.• Number of maintenance tasks by

– type A source code modification– type B fixing data misalignments through database queries

• data cleansing

– type C (not A or B) user disoperation, problems out of contract, etc.

• Other metrics such as software complexity were not available in full across all the projects.



Table 1: Collected Metrics ©ACM



Table 2: Descriptive statistics ©ACM



144 observations, monthly maintenance periods1960/(35hrs*4wks) = 14 person months

4. Building Effort Estimation Models

• Multiple linear regression analysis minimizes the sum of the squared error.

• Regression analysis is said to be “as good as or better than many competing modeling techniques”. – see references [7] and [18] of the case study article which

showed estimation by analogy was not better

• Incorporating the size of a maintenance task would be useful, but this metric was not available.

• Analysis of residuals from the regression analyses revealed no non-linearity or other trends.


Dealing with outliers• If a value is deemed to be an

outlier, one approach is to exclude the entire observation.– outliers can be caused by

transcription errors

• In a box plot, the box contains 50% of the data set.– the interquartile range (IQR)

• 1.5* IQR away from the box, a value is a suspected outlier

• 3.0*IQR away from the box, a value is deemed an outlier


http://www.physics.csbsju.edu/stats/box2.html

There were no obvious outliers in the data set.

outlier/enfari

Table 3: Metrics correlation matrix ©ACM

• There are no strong correlations between the independent variables used to build the regression models.

• N (total number) correlates less well with NA possibly because NA is much smaller than NB and NC.

• No explanation is given for the correlation r = 0.6458.

19/04/23 Dr Andy Brooks 17strong usually means r > 0.7

correlation matrix/fylgnifylki

Critical commentary from Andy

• Regression models are built assuming that model variables are independent.– so it is important to carry out checks e.g. examine correlations

• We do not know the nature of the correlation coefficient used. Pearson is applied to normally distributed data and Spearman to non-normally distributed data. – sometimes researchers compute both to be sure

• There are some large differences between means and medians in Table 2 which suggests non-normality.– Spearman correlation coefficients should have been calculated

• The correlation of 0.6458 suggests a real linkage between NA and NC i.e. they may not be independent.


Some plots illustrating correlations of various sizes


http://www.jerrydallal.com/

Effort estimation models A, B, C

• NBC is the sum of NB and NC


4. Building Effort Estimation Models

recall

4.1 Evaluating Model Performances

• The coefficient of determination R2 represents the percentage of variation in the dependent variable explained by the independent variables of the model.

• Having a high R2 does not guarantee the quality of future predictions. – R2 does not represent the performance of the

model on a different data set, only the data set upon which the model was built.


Table 4 Model parameters ©ACM

• All model variables are statistically significant (p > 0,05).• Model C explains 90% of the variation in effort.


Assessing the quality of future predictionsPRESS (PREdiction Sum of Squares)

• ŷ (y-hat) means predicted value.

• The residual represents the difference between the ith value in the data set and the value predicted from a regression analysis using all data points except the ith.

• In a data set of size n, n separate regression equations are calculated.

• Smaller PRESS scores are better.

• PRESS is also known as “leave-one-out cross validation”.



Assessing the quality of future predictionsSPR

• ŷ (y-hat) means predicted value.• The residual represents the difference between the ith

value in the data set and the value predicted from a regression analysis using all data points except the ith.

• SPR is the sum of the absolute values rather than the squares of the PRESS residuals.

• SPR is used when a few large PRESS residuals can inflate the PRESS score unreasonably.



Assessing the quality of future predictionsMMRE (Mean Magnitude Relative Error)

• MREi is the magnitude of the relative error.

• ŷ (y-hat) means predicted value.• The residual represents the difference between the ith

value in the data set and the value predicted from a regression analysis using all data points except the ith.

• MMRE is the mean magnitude.• MdMRE is the median magnitude. MMRE might be

dominated by a few MREs with very high values.19/04/23 Dr Andy Brooks 25


Assessing the quality of future predictionsPRED

• RE is the relative error.• “We believe that maintenance managers may, in

most cases and specially for small maintenance tasks, accept a relative error between the actual and predicted effort of about 50%.”

• According to reference [36] (1991) of the case study article, an average error of 100% can be considered “good” and an average error of 32% “outstanding”.



Table 5: Leave-one-out cross validation ©ACM

• Model C is clearly better.– Almost 50% of cases have a relative error of less than 25%.– Almost 83% of cases have a relative error of less than 50%.



Leave More Out Cross Validation for Model C

• The data set is randomly partioned into a training data set and a test set.

• The training data set is used to build the model.• The test data set is used to assess the quality of

the model´s prediction.

• Lx means the training (learning) data set is composed of x% of the observations.

• T100-x means the test data set is composed of 100-x% of the observations.


extending the evaluation of Model C

Table 6: Leave more out cross validation with random partitions

• As the size of the learning set decreases, so does the quality of prediction, as expected.


Model C

Critical commentary from Andy

• It is not stated how many partitions were used to establish each of the average values in Table 6.• a minimum sample size of 10 is usually required to

compute an average with reasonable accuracy

• The trend in Table 6 makes sense, but it is difficult to believe the PRED values in Table 6 for L90-T10.

• PRED50 = 100% yet PRED50 is only 82.64% when all the data except one observation is used for training.– The authors should have addressed what appears to be an

anomalous result.


Model C

Table 7 Cross validation at a project level

• Column P1 represents training with P2, P3, P4, and P5 and the results of testing on project P1.– the regression analysis has “no knowledge” of project P1


Model C5 projects

Cross validation at the project level

• Predictive performance is poor when using projects P1 and P3 as test sets.

• Project P1 had no maintenance tasks of type B and this might explain the poor predictive performance of a model which actually has NB as a predictor variable.

• No explanation is provided for the poor predictive performance using P3 as a test set.


Model C

5. Conclusion• Previously, the supplier company (i.e. the company doing

the maintenance) had used a prediction model which did not distinguish between different types of maintenance task.

• PRED values for this earlier prediction model were not very satisfactory.– PRED25 = 33.33%

– PRED50 = 53.47%

• The authors believed that modelling different types of maintenance task (A, B, and C) would improve prediction, which it did, especially for Model C.– PRED25 = 49.31%

– PRED50 = 82.64%

19/04/23 Dr Andy Brooks 33leave-out-one PREDs

5. Conclusion• More complicated prediction models could be

built, but the authors chose not to, so that the models could be easily calculated by working engineers and managers.

• Effort estimation can also involve estimating values for the independent variables.– estimating the number and type of maintenance tasks

“ex ante” in a forthcoming maintenance period can be done reasonably accurately from historical data

• more complicated models involve more variables to estimate , making it more difficult to predict forthcoming effort


5. Conclusion

• The greatest weakness of using regression models for effort estimation is that they only apply to the “analyzed domain and technological environment”.– i.e. The prediction models are company specific and you cannot

apply the values determined for the model coefficients in another company setting.

• Andy says: This is a likely explanation for the “cross validation at the project level” results. Projects were from different companies so it is perhaps not surprising that trying to predict for one company using data for other companies sometimes did not work well.– P1 and P3 in Table 7


5. Conclusion

• The models presented were adopted by the supplier company providing maintenance services.


Documents

12/07/2015Dr Andy Brooks1 MSc Software Maintenance MS Viðhald Hugbúnaðar Fyrirlestrar 43 og 44 Estimating Effort for Corrective Software Maintenance