Text of “2cee” A 21 st Century Effort Estimation Methodology Tim Menzies Dan Baker email@example.com...
2cee A 21 st Century Effort Estimation Methodology Tim Menzies Dan Baker firstname.lastname@example.org email@example.com Jairus Hihn Karen Lum firstname.lastname@example.org email@example.com 22nd International Forum on COCOMO and Systems/Software Cost Modeling (2007)
2cee2 Our Journey It became quickly apparent in the early stages of our research task that there was a major disconnect between the techniques used by estimation practitioners and the numerous ideas being addressed in the research community It also became clear that many fundamental estimation questions were not being addressed What is a models real estimation uncertainty? How many records required to calibrate? Answers have varied from 10-20 just for intercept and slope If we do not have enough data what is the impact on model uncertainty Data is expensive to collect and maintain so want to keep cost drivers and effort multipliers as few as possible But what are the right ones? When should we build domain specific models? What are the best functional forms? What are the best ways to tune/calibrate a model?
2cee3 Our Journey Continued Data mining techniques provided us with the rigorous tool set we needed to explore the many dimension of the problem we were addressing in a repeatable manner Different Calibration and Validation Datasets Analyze standard and non-standard models Perform exhaustive searches over all parameters and records in order to guide data pruning Rows (Stratification) Columns (variable reduction) Measure model performance by multiple measures We have even been able to determine what performance measures are best
2cee4 Some Things We Learned Along the Way
2cee5 Local Calibration Does Not Always Improve Performance Effort models were learned via either standard LC or COSEEKMO The top plot shows the number of projects in 27 subsets of our two data sources The middle and bottom plots show the standard deviation and mean in performance error Data subsets are sorted by the errors standard deviation For the NASA data set Local Calibration (LC) or re-estimating a and b only does not produce the best model. A more thorough analysis is required including reducing the number of variables
2cee6 Stratification Does Not Always Improve Performance The plots show mean performance error (i.e. |(predicted actual)|/actual) based on 30 experiments with each subset The dashed horizontal lines shows the error rate of models learned from all data from the two sources The crosses show the mean error performance seen in models learned from subsets of that data Crosses below/above the lines indicate models performing better/worse (respectively) than models built from all the data Stratification does not always improve model performance Results show it is 50-50 Main implication is that one must really know their data as there is no solution to determine the best approach to model calibration
2cee7 Cost Driver Instability The bottom line is that we have way too many cost drivers in our models! Furthermore, what smaller set is best varies across different domains and stratifications The cost drivers that are unlikely to improve model performance are pcap, vexp, lexp, modp, tool, sced It is expected for more contemporary data that stor and time would drop out because there are fewer computer constraints these days and modp may become more significant
2cee8 Some Good News Physical SLOC always loads as significant with no language adjustment The standard functional form shown below is virtually always selected as indicated by the non-standard model M5P being selected only once The out-of-the-box version of COCOMO 81 is almost always the best model on the original COCOMO81 data View as a sanity check on our methodology However, for the NASA93 data sometimes one can use the model right out of the box sometimes local calibration is sufficient sometimes a full regression analysis needs to be performed to obtain optimal results
2cee9 The same approach is never best but some combination of the following always wins LC Column Pruning Nearest Neighbor Which is best is determined case- by-case Key Research Findings Our models have too many inputs Measures of RE go up with over specified models Median measures of error not Mean or Pred should be used to compare models There is an instability issue due to the small data sets with significant outliers, which makes it difficult to determine which estimation model or calibration is best. Mann Whitney U Test Manual stratification does not lead to the best model E.g. a combination of flight SW and Class B ground produces a better model then just selecting all your flight records and doing LC. Nearest Neighbor searches for analogous records based on your current project model inputs
2cee10 2cee 21st Century Estimation Environment Just Born: released October 2, 2007 Result of four years of research using machine learning technique to study model calibration and validation techniques Probabilistic Key Features: Dynamic calibration using variable reduction and nearest neighbor search Can be used as either a model analysis tool, calibration tool, and/or an estimation tool Can estimate with partial inputs Uses N-Fold Cross Validation (also called Leave One Out Cross Validation) Uses median not mean to evaluate model performance Runs in Windows, coded in Visual Basic Will be running it in parallel with core tools over next year
Load Historical Data Use Predefined COCOMO Coefficients Bootstrapped Local Calibration Full Local Calibration Nearest Neighbour Local Calibration Optionally Use Manual Or Automatic Feature Selection Optionally Use Manual Stratification 2CEE Define Project Ranges Monte Carlo Project Instances Produce Range of COCOMO Estimates
2CEE Steps Define Model Calibration Evaluate with Cross Validation Define Project Ranges Monte Carlo Estimates
2cee14 2cee Provides Insight into Model Performance and Tuning e.g. officially, COCOMOs tuning parameters vary 2.5