58
Building Cost Estimation Models Using Homogeneous Data Rahul Premraj Thomas Zimmermann Saarland University, Germany University of Calgary, Canada

Building Cost Estimation Models using Homogeneous Data

Embed Size (px)

DESCRIPTION

Slides presented at ESEM 2007, Madrid.

Citation preview

Page 1: Building Cost Estimation Models using Homogeneous Data

BuildingCost Estimation Models

Using Homogeneous Data

Rahul Premraj

Thomas Zimmermann

Saarland University, Germany

University of Calgary, Canada

Page 2: Building Cost Estimation Models using Homogeneous Data

software engineering

data

Page 3: Building Cost Estimation Models using Homogeneous Data

Cross versus Within-Company CostEstimation Studies: A Systematic Review

Barbara A. Kitchenham, Member, IEEE Computer Society, Emilia Mendes, andGuilherme H. Travassos

Abstract—The objective of this paper is to determine under what circumstances individual organizations would be able to rely oncross-company-based estimation models. We performed a systematic review of studies that compared predictions from cross-

company models with predictions from within-company models based on analysis of project data. Ten papers compared cross-company and within-company estimation models; however, only seven presented independent results. Of those seven, three found

that cross-company models were not significantly different from within-company models, and four found that cross-company modelswere significantly worse than within-company models. Experimental procedures used by the studies differed making it impossible to

undertake formal meta-analysis of the results. The main trend distinguishing study results was that studies with small within-companydata sets (i.e., < 20 projects) that used leave-one-out cross validation all found that the within-company model was significantly

different (better) from the cross-company model. The results of this review are inconclusive. It is clear that some organizations wouldbe ill-served by cross-company models whereas others would benefit. Further studies are needed, but they must be independent (i.e.,

based on different data bases or at least different single company data sets) and should address specific hypotheses concerning theconditions that would favor cross-company or within-company models. In addition, experimenters need to standardize their

experimental procedures to enable formal meta-analysis, and recommendations are made in Section 3.

Index Terms—Cost estimation, management, systematic review, software engineering.

Ç

1 INTRODUCTION

EARLY studies of cost estimation models (e.g., [12],[8]) suggested that general-purpose models such as

COCOMO [1] and SLIM [24] needed to be calibrated tospecific companies before they could be used effectively.Taking this result further and following the proposals madeby DeMarco [4], Kok et al. [14] suggested that costestimation models should be developed only from single-company data. However, three main problems can occurwhen relying on within-company data sets [3], [2]:

1. The time required to accumulate enough data onpast projects from a single company may beprohibitive.

2. By the time the data set is large enough to be of use,technologies used by the company may havechanged, and older projects may no longer berepresentative of current practices.

3. Care is necessary as data needs to be collected in aconsistent manner.

These problems motivated the use of cross-companymodels (models built using cross-company data sets, whichare data sets containing data from several companies) foreffort estimation and productivity benchmarking, and,subsequently, several studies compared the predictionaccuracy between cross-company and within-companymodels. In 1999, Maxwell et al. [18] analyzed a cross-company benchmarking database by comparing the accu-racy of a within-company cost model with the accuracy of across-company cost model. They claimed that the within-company model was more accurate than the cross-companymodel, based on the same holdout sample. In the same year,Briand et al. [2] found that cross-company models could beas accurate as within-company models. The following year,Briand et al. [3] reanalyzed the data set employed byMaxwell et al. [18] and concluded that cross-companymodels were as good as within-company models. Twoyears later, Wieczorek and Ruhe [26] confirmed this sametrend using the same data set employed by [2]. Three yearslater, Mendes et al. [20] also confirmed the same trend usingyet another data set.

These results seemed to contradict the results of theearlier studies and pave the way for improved estimationmethods for companies that did not have their own projectdata. However, other researchers found less encouragingresults. Jeffery et al. undertook two studies, both of whichsuggested that within-company models were superior tocross-company models [6], [7]. Two years later, Lefley andShepperd claimed that the within-company model wasmore accurate than the cross-company model, using thesame data set employed by Wieczorek and Ruhe [26] andBriand et al. [2]. Finally, a year later Kitchenham and

316 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 5, MAY 2007

. B.A. Kitchenham is with the School of Computing and Mathematics,University of Keele, Keele Village, Staffordshire, ST5 5BG, UK.E-mail: [email protected].

. E. Mendes is with the Computer Science Department, University ofAuckland, Private Bag 92019, Auckland, New Zealand.E-mail: [email protected].

. G.H. Travassos is with UFRJ/COPPE, Systems Engineering andComputer Science Program, PO Box 68511, 21941-972 Rio de Janeiro—RJ, Brazil. E-mail: [email protected].

Manuscript received 6 June 2006; revised 27 Nov. 2006; accepted 2 Jan. 2007;published online 20 Feb. 2007.Recommended for acceptance by A. Mockus.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TSE-0129-0606.Digital Object Identifier no. 10.1109/TSE.2007.1001.

0098-5589/07/$25.00 ! 2007 IEEE Published by the IEEE Computer Society

systematic ReviewMay, 2007

Page 4: Building Cost Estimation Models using Homogeneous Data

Cross versus Within-Company CostEstimation Studies: A Systematic Review

Barbara A. Kitchenham, Member, IEEE Computer Society, Emilia Mendes, andGuilherme H. Travassos

Abstract—The objective of this paper is to determine under what circumstances individual organizations would be able to rely oncross-company-based estimation models. We performed a systematic review of studies that compared predictions from cross-

company models with predictions from within-company models based on analysis of project data. Ten papers compared cross-company and within-company estimation models; however, only seven presented independent results. Of those seven, three found

that cross-company models were not significantly different from within-company models, and four found that cross-company modelswere significantly worse than within-company models. Experimental procedures used by the studies differed making it impossible to

undertake formal meta-analysis of the results. The main trend distinguishing study results was that studies with small within-companydata sets (i.e., < 20 projects) that used leave-one-out cross validation all found that the within-company model was significantly

different (better) from the cross-company model. The results of this review are inconclusive. It is clear that some organizations wouldbe ill-served by cross-company models whereas others would benefit. Further studies are needed, but they must be independent (i.e.,

based on different data bases or at least different single company data sets) and should address specific hypotheses concerning theconditions that would favor cross-company or within-company models. In addition, experimenters need to standardize their

experimental procedures to enable formal meta-analysis, and recommendations are made in Section 3.

Index Terms—Cost estimation, management, systematic review, software engineering.

Ç

1 INTRODUCTION

EARLY studies of cost estimation models (e.g., [12],[8]) suggested that general-purpose models such as

COCOMO [1] and SLIM [24] needed to be calibrated tospecific companies before they could be used effectively.Taking this result further and following the proposals madeby DeMarco [4], Kok et al. [14] suggested that costestimation models should be developed only from single-company data. However, three main problems can occurwhen relying on within-company data sets [3], [2]:

1. The time required to accumulate enough data onpast projects from a single company may beprohibitive.

2. By the time the data set is large enough to be of use,technologies used by the company may havechanged, and older projects may no longer berepresentative of current practices.

3. Care is necessary as data needs to be collected in aconsistent manner.

These problems motivated the use of cross-companymodels (models built using cross-company data sets, whichare data sets containing data from several companies) foreffort estimation and productivity benchmarking, and,subsequently, several studies compared the predictionaccuracy between cross-company and within-companymodels. In 1999, Maxwell et al. [18] analyzed a cross-company benchmarking database by comparing the accu-racy of a within-company cost model with the accuracy of across-company cost model. They claimed that the within-company model was more accurate than the cross-companymodel, based on the same holdout sample. In the same year,Briand et al. [2] found that cross-company models could beas accurate as within-company models. The following year,Briand et al. [3] reanalyzed the data set employed byMaxwell et al. [18] and concluded that cross-companymodels were as good as within-company models. Twoyears later, Wieczorek and Ruhe [26] confirmed this sametrend using the same data set employed by [2]. Three yearslater, Mendes et al. [20] also confirmed the same trend usingyet another data set.

These results seemed to contradict the results of theearlier studies and pave the way for improved estimationmethods for companies that did not have their own projectdata. However, other researchers found less encouragingresults. Jeffery et al. undertook two studies, both of whichsuggested that within-company models were superior tocross-company models [6], [7]. Two years later, Lefley andShepperd claimed that the within-company model wasmore accurate than the cross-company model, using thesame data set employed by Wieczorek and Ruhe [26] andBriand et al. [2]. Finally, a year later Kitchenham and

316 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 5, MAY 2007

. B.A. Kitchenham is with the School of Computing and Mathematics,University of Keele, Keele Village, Staffordshire, ST5 5BG, UK.E-mail: [email protected].

. E. Mendes is with the Computer Science Department, University ofAuckland, Private Bag 92019, Auckland, New Zealand.E-mail: [email protected].

. G.H. Travassos is with UFRJ/COPPE, Systems Engineering andComputer Science Program, PO Box 68511, 21941-972 Rio de Janeiro—RJ, Brazil. E-mail: [email protected].

Manuscript received 6 June 2006; revised 27 Nov. 2006; accepted 2 Jan. 2007;published online 20 Feb. 2007.Recommended for acceptance by A. Mockus.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TSE-0129-0606.Digital Object Identifier no. 10.1109/TSE.2007.1001.

0098-5589/07/$25.00 ! 2007 IEEE Published by the IEEE Computer Society

systematic ReviewBarbara Kitchenham

Emilia Mendes

Guilherme Travassos

May, 2007

Page 5: Building Cost Estimation Models using Homogeneous Data

Company Specific Models

CrossCompany Models

No Trend

Page 6: Building Cost Estimation Models using Homogeneous Data

Company Specific Models

CrossCompany Models

No Trend

(four studies) (four studies) (two studies)

Page 7: Building Cost Estimation Models using Homogeneous Data

Barbara Kitchenham Emilia Mendes Katrina Maxwell

Lionel Briand Martin ShepperdIsabella Wieczorek

Company Specific Models

CrossCompany Models

No Trend

(four studies) (four studies) (two studies)

Page 8: Building Cost Estimation Models using Homogeneous Data

Barbara Kitchenham Emilia Mendes Katrina Maxwell

Lionel Briand Martin ShepperdIsabella Wieczorek

Company Specific Models

CrossCompany Models

No Trend

(four studies) (four studies) (two studies)

2

2

2

Page 9: Building Cost Estimation Models using Homogeneous Data

Barbara Kitchenham Emilia Mendes Katrina Maxwell

Lionel Briand Martin ShepperdIsabella Wieczorek

Company Specific Models

CrossCompany Models

No Trend

(four studies) (four studies) (two studies)

2

3

2 2

2

1

1

Page 10: Building Cost Estimation Models using Homogeneous Data

Barbara Kitchenham Emilia Mendes Katrina Maxwell

Lionel Briand Martin ShepperdIsabella Wieczorek

Company Specific Models

CrossCompany Models

No Trend

(four studies) (four studies) (two studies)

2

3

2 2

2

1

1

1

1

Page 11: Building Cost Estimation Models using Homogeneous Data

meet Erica

Page 12: Building Cost Estimation Models using Homogeneous Data

she works here

Page 13: Building Cost Estimation Models using Homogeneous Data

she is a metrics

consultant

Page 14: Building Cost Estimation Models using Homogeneous Data

her job

Page 15: Building Cost Estimation Models using Homogeneous Data

Erica’s Boss has a

new project for her

Page 16: Building Cost Estimation Models using Homogeneous Data

what are my

options?

Page 17: Building Cost Estimation Models using Homogeneous Data

Company Specific Models

CrossCompany Models

Page 18: Building Cost Estimation Models using Homogeneous Data

Company Specific Models

CrossCompany Models

Business Specific Models

Page 19: Building Cost Estimation Models using Homogeneous Data
Page 20: Building Cost Estimation Models using Homogeneous Data

why

BusinessSector?

Page 21: Building Cost Estimation Models using Homogeneous Data

An Empirical Analysis of Software Productivity Over Time

Rahul PremrajBournemouth University, UK

[email protected]

Martin ShepperdBrunel University, UK

[email protected]

Barbara Kitchenham!

National ICT, [email protected]

Pekka ForseliusSTTF Oy, Finland

[email protected]

Abstract

OBJECTIVE - the aim is to investigate how softwareproject productivity has changed over time. Within thisoverall goal we also compare productivity between differ-ent business sectors and seek to identify major drivers.METHOD - we analysed a data set of more than 600projects that have been collected from a number of Finnishcompanies since 1978.RESULTS - overall, we observed a quite pronounced im-provement in productivity over the entire time period,though, this improvement is less marked since the 1990s.However, the trend is not smooth. We also observed pro-ductivity variability between company and business sec-tor.CONCLUSIONS - whilst this data set is not a ran-dom sample so generalisation is somewhat problematic,we hope that it contributes to an overall body of knowl-edge about software productivity and thereby facilitates theconstruction of a bigger picture.Keywords: project management, projects, software produc-tivity, trend analysis, empirical analysis.

1. Introduction

Given the importance and size of the software industry it isno surprise that there is a great deal of interest in productiv-ity trends and in particular whether the industry, as a whole,is improving over time. Obviously this is a complex ques-tion for at least three reasons.

First, productivity is difficult to measure because the tra-ditional definition, i.e. the ratio of outputs to inputs re-quires that we have objective methods of measuring both

! Barbara Kitchenham is also with Keele University, UK [email protected]

commodities. Unfortunately, for software the notion of out-put is not straightforward. Lines of code are problematicdue to issues of layout, differing language and the fact thatmost software engineering activity does not directly involvecode. An alternative is Function Points (FPs), in its variousflavours, which although subject to some criticism [?] arein quite widespread use and so in a sense represent the leastbad alternative. In our analysis the output (or size) measurecollected is Experience Points 2.0 [?], a variant of FPs.

Second, productivity is impacted by a very large num-ber of factors, many of which are inherently difficult to as-sess, e.g. task difficulty, skill of the project team, ease ofinteraction with the customer/client and the level of non-functional requirements imposed such as dependability andperformance.

Third, there are clear interactions between many of thesefactors so for instance, it is easier to be productive if qualitycan be disregarded.

Despite these caveats, this paper seeks to analyse soft-ware project productivity trends from 1978-2003 froma data set of more than 600 projects from Finland. Theprojects are varied in size (6 - 5000+ FPs), business sec-tor (e.g. Retail) and type (New Development or Mainte-nance). However, we believe there are sufficient data todraw some preliminary conclusions.

The remainder of the paper is organised as follows. Thenext section very briefly reviews some related work includ-ing a similar, earlier study by Maxwell and Forselius [?].Next we describe the data set used for our analysis. We thengive the results of our analysis, first overall and then af-ter splitting the data set into groups of more closely relatedprojects. We conclude with a discussion of the significanceof the results and some comments on the actual process ofanalysing the data.

METRICS, 2005

Page 22: Building Cost Estimation Models using Homogeneous Data
Page 23: Building Cost Estimation Models using Homogeneous Data

Business Specific Models

Page 24: Building Cost Estimation Models using Homogeneous Data

0 200 400 600 800

All Data Cleaned Data

Finnish data set

788

395

Page 25: Building Cost Estimation Models using Homogeneous Data

E!ort = !Size!

Regression modelE

!ort

Size

Page 26: Building Cost Estimation Models using Homogeneous Data

Test Sets

Page 27: Building Cost Estimation Models using Homogeneous Data

Test Sets

companies

Page 28: Building Cost Estimation Models using Homogeneous Data

Test Sets

companiesA B C D E

Page 29: Building Cost Estimation Models using Homogeneous Data

Research ObjectivesTo develop company-specific cost models for comparisons against other models.

1.

Training Data Testing Data

Page 30: Building Cost Estimation Models using Homogeneous Data

Research ObjectivesTo develop cross-company cost models to compare against company-specific cost models.

I1.

Training Data Testing Data

Page 31: Building Cost Estimation Models using Homogeneous Data

Research ObjectivesTo develop business-specific models to compare their accuracy against company-specific and cross-company cost models.

I1I.

Training Data Testing Data

Page 32: Building Cost Estimation Models using Homogeneous Data

Research ObjectivesTo develop business-specific cost models to determine if they can be used by companies from other business sectors.

IV.

Training Data Testing Data

Page 33: Building Cost Estimation Models using Homogeneous Data
Page 34: Building Cost Estimation Models using Homogeneous Data

Pred (50)Pred (25)

1.00 25.75 50.50 75.25 100.00

Pred (50) Pred (25)

better

Page 35: Building Cost Estimation Models using Homogeneous Data

Pred (50)Pred (25)

1.00 25.75 50.50 75.25 100.00

Pred (50) Pred (25)

1.00 25.75 50.50 75.25 100.00

MdMRE MMRE

better

MdMREMMRE

better

for comparability

Page 36: Building Cost Estimation Models using Homogeneous Data

Company-SpecificCost Models

TestingTraining

A

B

C

D

E

0 25 50 75 100

Pred50 Pred25

better

A

B

C

D

E

0 25 50 75 100

MdMRE MMRE

better

Page 37: Building Cost Estimation Models using Homogeneous Data

Cross-CompanyCost Models

TestingTraining

A

B

C

D

E

0 25 50 75 100

Pred50 Pred25

better

A

B

C

D

E

0 25 50 75 100

MdMRE MMRE

better

Page 38: Building Cost Estimation Models using Homogeneous Data

Business-SpecificCost Models

TestingTraining

A

B

C

D

E

0 25 50 75 100

Pred50 Pred25

better

A

B

C

D

E

0 25 50 75 100

MdMRE MMRE

better

Page 39: Building Cost Estimation Models using Homogeneous Data

Cross-BusinessCost Models

TestingTraining

• Projects from some sectors could be used to predict for projects from other sectors.

• For example, Retail sector projects could predict with high accuracy (Pred50 > 50%).

• But projects from sectors are best used to predict for themselves.

Page 40: Building Cost Estimation Models using Homogeneous Data

Picture: Mike, Delfini Group

Threats to Validity

Page 41: Building Cost Estimation Models using Homogeneous Data

Threats to Validity

Page 42: Building Cost Estimation Models using Homogeneous Data

Threats to Validity

external • Projects originated from Finland only.

Page 43: Building Cost Estimation Models using Homogeneous Data

Threats to Validity

external • Projects originated from Finland only.

internal • Data cleaning removed nearly half the projects.• Only used Size as independent variable.

Page 44: Building Cost Estimation Models using Homogeneous Data

Conclusions

Page 45: Building Cost Estimation Models using Homogeneous Data

Conclusions

Barbara Kitchenham Emilia Mendes Katrina Maxwell

Lionel Briand Martin ShepperdIsabella Wieczorek

Company

Specific Models

Cross

Company ModelsNo Trend

(four studies) (four studies) (two studies)

2

3

2 2

2

1

1

1

1

Page 46: Building Cost Estimation Models using Homogeneous Data

Conclusions

Barbara Kitchenham Emilia Mendes Katrina Maxwell

Lionel Briand Martin ShepperdIsabella Wieczorek

Company

Specific Models

Cross

Company ModelsNo Trend

(four studies) (four studies) (two studies)

2

3

2 2

2

1

1

1

1

what are my

options?

Page 47: Building Cost Estimation Models using Homogeneous Data

Conclusions

Barbara Kitchenham Emilia Mendes Katrina Maxwell

Lionel Briand Martin ShepperdIsabella Wieczorek

Company

Specific Models

Cross

Company ModelsNo Trend

(four studies) (four studies) (two studies)

2

3

2 2

2

1

1

1

1

what are my

options?

Company

Specific Models

Cross

Company Models

Business

Specific Models

Page 48: Building Cost Estimation Models using Homogeneous Data

Conclusions

Page 49: Building Cost Estimation Models using Homogeneous Data

Conclusions• No model performed consistently well

across all experiments.

Page 50: Building Cost Estimation Models using Homogeneous Data

Conclusions• No model performed consistently well

across all experiments.

• Business-specific models performed comparably to company-specific models.

Page 51: Building Cost Estimation Models using Homogeneous Data

Conclusions• No model performed consistently well

across all experiments.

• Business-specific models performed comparably to company-specific models.

• Business-specific models performed better than cross-company models.

Page 52: Building Cost Estimation Models using Homogeneous Data

Conclusions• No model performed consistently well

across all experiments.

• Business-specific models performed comparably to company-specific models.

• Business-specific models performed better than cross-company models.

• Reducing heterogeneity in data may increase their applicability to problems.

Page 53: Building Cost Estimation Models using Homogeneous Data

Conclusions• No model performed consistently well

across all experiments.

• Business-specific models performed comparably to company-specific models.

• Business-specific models performed better than cross-company models.

• Reducing heterogeneity in data may increase their applicability to problems.

• ... and lead to better prediction models.

Page 54: Building Cost Estimation Models using Homogeneous Data

Open Questions

Page 55: Building Cost Estimation Models using Homogeneous Data

Open Questions

• Can we use other algorithms such as decision trees and statistical clustering?

Page 56: Building Cost Estimation Models using Homogeneous Data

Open Questions

• Can we use other algorithms such as decision trees and statistical clustering?

• What are the commonalities amongst projects?

Page 57: Building Cost Estimation Models using Homogeneous Data

Open Questions

• Can we use other algorithms such as decision trees and statistical clustering?

• What are the commonalities amongst projects?

• Does heterogeneity in data sets impact other software engineering areas?

Page 58: Building Cost Estimation Models using Homogeneous Data

Open Questions

• Can we use other algorithms such as decision trees and statistical clustering?

• What are the commonalities amongst projects?

• Does heterogeneity in data sets impact other software engineering areas?

Thank you!