35
Kaiserslautern, Oct. 08, 2008 1 An Approach to Improving Parametric Estimation Models in case of Violation of Assumptions 1 Dept. of Informatica, Sistemi e Produzione University of Rome “Tor Vergata” S. Alessandro Sarcià 1,2 [email protected] t Giovanni Cantone 1 Victor R. Basili 2,3 2 Dept. of Computer Science University of Maryland and 2 Fraunhofer Center for ESE Maryland Author Advisors

Sarcia idoese08

  • Upload
    asarcia

  • View
    175

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Sarcia idoese08

Kaiserslautern, Oct. 08, 20081

An Approach to Improving Parametric Estimation Models in case of Violation of

Assumptions

1Dept. of Informatica, Sistemi e ProduzioneUniversity of Rome “Tor Vergata”

S. Alessandro Sarcià1,2

[email protected]

Giovanni Cantone1 Victor R. Basili2,3

2Dept. of Computer ScienceUniversity of Maryland

and2Fraunhofer Center for ESE Maryland

Author

Advisors

Page 2: Sarcia idoese08

Kaiserslautern, Oct. 08, 20082

Motivation (Why)

Objectives (What)

Roadmap (How)

The problem

The solution

The application

A case study

Conclusion & Benefits

Questions & Feedbacks

Outline

Page 3: Sarcia idoese08

Kaiserslautern, Oct. 08, 20083

MOTIVATION

Page 4: Sarcia idoese08

Kaiserslautern, Oct. 08, 20084

Predicting software engineering variables accurately is the basis for success of mature organizations. This is still an unsolved problem.Our point of view:

Prediction is about estimating values based on mathematical and statistical approaches (no guessing), e.g., regression functions

Variables are cost, effort, size, defects, fault proneness, number of test cases and so forth

Success refers to delivering software systems on time, on budget, and on quality as initially required. In software estimation, success is about providing estimates as close to the actual values as possible (the error is less than a stated threshold). Focus: We consider a wider meaning of it as keeping prediction uncertainty within acceptable thresholds (risk analysis on the estimation model)

Organizations that we refer to are learning organizations that aim at improving their success over time.

Page 5: Sarcia idoese08

Kaiserslautern, Oct. 08, 20085

OBJECTIVES

Page 6: Sarcia idoese08

Kaiserslautern, Oct. 08, 20086

Objectives

- Analyze the estimation risk (uncertainty) of the estimation model, the behavior of the EM with respect to the estimation error over the history (Is it too risky using the chosen model? What is the model

reliability?)

- State a strategy for mitigating the risk of getting estimation failures (we cannot remove the error

completely)

- State a strategy for improving the estimation model (improvement over time) not finding the best model (novelty)

EM Estimation Model

Page 7: Sarcia idoese08

Kaiserslautern, Oct. 08, 20087

ROADMAP

Page 8: Sarcia idoese08

Kaiserslautern, Oct. 08, 20088

An overview on the approach

To reach our objectives:

1. We removed assumptions on the regression functions and dealt with the consequences of it

2. We tailored the Quality Improvement Paradigm (QIP) to an Estimation Improvement Process (EIP) specific for prediction

3. We defined a particular kind of Artificial Neural Network (ANN) and a strategy for analyzing the estimation risk in case of violations of assumptions

4. We used this ANN for mitigating the estimation risk (prediction) and improving the model

To analyze the uncertainty …

To implement our solution

To apply our solution

The ProblemThe Problem

The SolutionThe Solution

The ApplicationThe Application

Page 9: Sarcia idoese08

Kaiserslautern, Oct. 08, 20089

THE PROBLEM

Page 10: Sarcia idoese08

Kaiserslautern, Oct. 08, 200810

Error taxonomy

Page 11: Sarcia idoese08

Kaiserslautern, Oct. 08, 200811

Regression functions

EM:

y = f(x, ) + , E() = 0 and cov() = I2

y : dependent variable (e.g., effort …)x : independent variables (e.g. size, complexity, …): random error (unknown): parameters of the modelE() : expected value of I : identityVar ()= 2

f may be linear, non-linear, and even a generalized model

ŷ = f(x, B) with B and y ŷ; r = (y- ŷ)

e.g., Least Squares estimates

Page 12: Sarcia idoese08

Kaiserslautern, Oct. 08, 200812

Regression assumptions

1. Random Error is not x correlated

2. The variance of the random error is constant (homoschedasticity)

3. is not auto-correlated

4. The probability density of the error is Gaussian

Very often, to have a closed solution for B:

- The model is assumed linear in the parameters (linear or linearized), e.g. polynomials of any degree, log-linear models. Generalized models require iterative procedures for calculating B

Page 13: Sarcia idoese08

Kaiserslautern, Oct. 08, 200813

In case of violations,

when we estimate the uncertainty on the next estimate the prediction interval may be unreliable (type I – II errors).

Violation of Regression assumptions

01TT

02/1UPDOWN x)XX(x1S)1QN(t)y(],[

If normality does not holdwe cannot use t-Student’s percentiles

This is no longer constant

This is not the standard error

This is not the spread

It may be correct

EstimatePrediction Interval

Page 14: Sarcia idoese08

Kaiserslautern, Oct. 08, 200814

Violation of Regression assumptions

Page 15: Sarcia idoese08

Kaiserslautern, Oct. 08, 200815

THE SOLUTION

Page 16: Sarcia idoese08

Kaiserslautern, Oct. 08, 200816

The mathematical solutionWe have to build prediction intervals correctly:

-Based on an empirical approach (observations without any assumptions)-Using a Bayesian approach (including prior and posterior information at the same time)

In particular, to estimate prediction intervals, we build a Feedforward Multilayer Artificial Neural Network for discrimination problems

We call such a network as Bayesian Discrimination Function (BDF):

Page 17: Sarcia idoese08

Kaiserslautern, Oct. 08, 200817

The Quality Improvement Paradigm

Page 18: Sarcia idoese08

Kaiserslautern, Oct. 08, 200818

The Estimation Improvement Process

Page 19: Sarcia idoese08

Kaiserslautern, Oct. 08, 200819

The framework

Page 20: Sarcia idoese08

Kaiserslautern, Oct. 08, 200820

Building the BDFNon-linear x-dependent median

Class A

Class B

BDF

0

1

0.5REKSLOC (Posterior)

Probability

RERE(P1)RE(P2)

fixing

A family

Page 21: Sarcia idoese08

Kaiserslautern, Oct. 08, 200821

Inverting the BDF (Sigmoid is smooth and monotonic)

Inv(BDF)Fixing the probability RE

KSLOC (fixed)

0

0.975

0.5 (Posterior)Probability

REMeUP

Fixing acredibilityrange(95%)

1

0.025

MeDOWN

(Bayesian) Error Prediction Interval

Page 22: Sarcia idoese08

Kaiserslautern, Oct. 08, 200822

Analyzing the model behavior

0

FlatterSteeper

Bia

sed

Bia

sed

Un

bia

sed

Un

bia

sed

KSLOC = 0.95KSLOC = 0.55

KSLOC = 0.32

KSLOC = 0.11

Page 23: Sarcia idoese08

Kaiserslautern, Oct. 08, 200823

Estimate Prediction Interval (M. Jørgensen)

RE = (Act – Est)/Act

To estimate the Estimate Prediction Interval from the Error Prediction Interval, we can substitute and inverting the formula:

[MeDOWN, MeUP] = (Act – Est)/ Act

ON+1DOWN = ActDOWN = Est/(1 – MeDOWN)

ON+1UP = ActUP = Est/(1 – MeUP)

EstimatePredictionInterval

Page 24: Sarcia idoese08

Kaiserslautern, Oct. 08, 200824

THE APPLICATION

Page 25: Sarcia idoese08

Kaiserslautern, Oct. 08, 200825

Scope Error (similarity analysis with estimated data)

Page 26: Sarcia idoese08

Kaiserslautern, Oct. 08, 200826

Assumption Error (estimated data)

Page 27: Sarcia idoese08

Kaiserslautern, Oct. 08, 200827

Improving the model (actual data)Scope extension

Page 28: Sarcia idoese08

Kaiserslautern, Oct. 08, 200828

Improving the model (actual data)Error magnitude and bias What we need to be worried

about is the relative error magnitude not the bias

Page 29: Sarcia idoese08

Kaiserslautern, Oct. 08, 200829

Improving the model (actual data)To shrink the magnitude of the relative error we can:

-Find and try new variables

-Removing irrelevant variables (PCA,CCA, Stepwise)

-Considering dummy variables (different populations)

-Improving the flexibility of the model (generalized models)

-Selecting the right complexity of the model (cross-validation)

Page 30: Sarcia idoese08

Kaiserslautern, Oct. 08, 200830

A CASE STUDY

Page 31: Sarcia idoese08

Kaiserslautern, Oct. 08, 200831

The NASA COCOMO data set [PROMISE]

-2.1

-2.5

-0.6 -0.6 -0.6

-1.5

-2.5 -2.5

-0.4 -0.4-0.6

-0.4 -0.4

0.4

-1.1-1.2

0.1 0.1 0.1

-0.2

-1.3 -1.3

0.3 0.3

-0.1

0.3 0.2

-2.7

0.1

0.7

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

UBBS UB BS

-0.9

-2.4

Relative Error

EXT EXT EXTUBUBUB UB UB UB

77 historical projects (before 1985), 16 projects being estimated (from 1985 to 1987)

Page 32: Sarcia idoese08

Kaiserslautern, Oct. 08, 200832

CONCLUSION&

BENEFITS

Page 33: Sarcia idoese08

Kaiserslautern, Oct. 08, 200833

Benefits of using this approach-Continue using parametric estimation models-Correct the limitations of the parametric models by dealing with the consequences of the violations-The approach is systematic (framework and process) and it can support learning organizations and improvement paradigms-Evaluate the estimation model reliability before using it (early risk evaluation)-The approach is traceable and repeatable (EIP + Frmwrk)-The approach can be completely implemented as an software tool that reduces human interaction-The approach produces experience packages (e.g. ANN) that are easier and faster to store and deliver-The approach is general even though we have shown up its application only to parametric models

Page 34: Sarcia idoese08

Kaiserslautern, Oct. 08, 200834

QUESTIONS&

FEEDBACKS

Page 35: Sarcia idoese08

Kaiserslautern, Oct. 08, 200835

An Approach to Improving Parametric Estimation Models in case of Violation of

Assumptions

1Dept. of Informatica, Sistemi e ProduzioneUniversity of Rome “Tor Vergata”

S. Alessandro Sarcià1,2

[email protected]

Giovanni Cantone1 Victor R. Basili2,3

2Dept. of Computer ScienceUniversity of Maryland

and2Fraunhofer Center for ESE Maryland

Author

Advisors