Sarcia idoese08

Kaiserslautern, Oct. 08, 20081

An Approach to Improving Parametric Estimation Models in case of Violation of

Assumptions

1Dept. of Informatica, Sistemi e ProduzioneUniversity of Rome “Tor Vergata”

S. Alessandro Sarcià1,2

[email protected]

Giovanni Cantone1 Victor R. Basili2,3

2Dept. of Computer ScienceUniversity of Maryland

and2Fraunhofer Center for ESE Maryland

Author

Advisors


Motivation (Why)

Objectives (What)

Roadmap (How)

The problem

The solution

The application

A case study

Conclusion & Benefits

Questions & Feedbacks

Outline


MOTIVATION


Predicting software engineering variables accurately is the basis for success of mature organizations. This is still an unsolved problem.Our point of view:

Prediction is about estimating values based on mathematical and statistical approaches (no guessing), e.g., regression functions

Variables are cost, effort, size, defects, fault proneness, number of test cases and so forth

Success refers to delivering software systems on time, on budget, and on quality as initially required. In software estimation, success is about providing estimates as close to the actual values as possible (the error is less than a stated threshold). Focus: We consider a wider meaning of it as keeping prediction uncertainty within acceptable thresholds (risk analysis on the estimation model)

Organizations that we refer to are learning organizations that aim at improving their success over time.


OBJECTIVES


Objectives

- Analyze the estimation risk (uncertainty) of the estimation model, the behavior of the EM with respect to the estimation error over the history (Is it too risky using the chosen model? What is the model

reliability?)

- State a strategy for mitigating the risk of getting estimation failures (we cannot remove the error

completely)

- State a strategy for improving the estimation model (improvement over time) not finding the best model (novelty)

EM Estimation Model


ROADMAP


An overview on the approach

To reach our objectives:

1. We removed assumptions on the regression functions and dealt with the consequences of it

2. We tailored the Quality Improvement Paradigm (QIP) to an Estimation Improvement Process (EIP) specific for prediction

3. We defined a particular kind of Artificial Neural Network (ANN) and a strategy for analyzing the estimation risk in case of violations of assumptions

4. We used this ANN for mitigating the estimation risk (prediction) and improving the model

To analyze the uncertainty …

To implement our solution

To apply our solution

The ProblemThe Problem

The SolutionThe Solution

The ApplicationThe Application


THE PROBLEM


Error taxonomy


Regression functions

EM:

y = f(x, ) + , E() = 0 and cov() = I2

y : dependent variable (e.g., effort …)x : independent variables (e.g. size, complexity, …): random error (unknown): parameters of the modelE() : expected value of I : identityVar ()= 2

f may be linear, non-linear, and even a generalized model

ŷ = f(x, B) with B and y ŷ; r = (y- ŷ)

e.g., Least Squares estimates


Regression assumptions

1. Random Error is not x correlated

2. The variance of the random error is constant (homoschedasticity)

3. is not auto-correlated

4. The probability density of the error is Gaussian

Very often, to have a closed solution for B:

- The model is assumed linear in the parameters (linear or linearized), e.g. polynomials of any degree, log-linear models. Generalized models require iterative procedures for calculating B


In case of violations,

when we estimate the uncertainty on the next estimate the prediction interval may be unreliable (type I – II errors).

Violation of Regression assumptions

01TT

02/1UPDOWN x)XX(x1S)1QN(t)y(],[

If normality does not holdwe cannot use t-Student’s percentiles

This is no longer constant

This is not the standard error

This is not the spread

It may be correct

EstimatePrediction Interval


Violation of Regression assumptions


THE SOLUTION


The mathematical solutionWe have to build prediction intervals correctly:

-Based on an empirical approach (observations without any assumptions)-Using a Bayesian approach (including prior and posterior information at the same time)

In particular, to estimate prediction intervals, we build a Feedforward Multilayer Artificial Neural Network for discrimination problems

We call such a network as Bayesian Discrimination Function (BDF):


The Quality Improvement Paradigm


The Estimation Improvement Process


The framework


Building the BDFNon-linear x-dependent median

Class A

Class B

BDF

0

1

0.5REKSLOC (Posterior)

Probability

RERE(P1)RE(P2)

fixing

A family


Inverting the BDF (Sigmoid is smooth and monotonic)

Inv(BDF)Fixing the probability RE

KSLOC (fixed)

0

0.975

0.5 (Posterior)Probability

REMeUP

Fixing acredibilityrange(95%)

1

0.025

MeDOWN

(Bayesian) Error Prediction Interval


Analyzing the model behavior

0

FlatterSteeper

Bia

sed

Bia

sed

Un

bia

sed

Un

bia

sed

KSLOC = 0.95KSLOC = 0.55

KSLOC = 0.32

KSLOC = 0.11


Estimate Prediction Interval (M. Jørgensen)

RE = (Act – Est)/Act

To estimate the Estimate Prediction Interval from the Error Prediction Interval, we can substitute and inverting the formula:

[MeDOWN, MeUP] = (Act – Est)/ Act

ON+1DOWN = ActDOWN = Est/(1 – MeDOWN)

ON+1UP = ActUP = Est/(1 – MeUP)

EstimatePredictionInterval


THE APPLICATION


Scope Error (similarity analysis with estimated data)


Assumption Error (estimated data)


Improving the model (actual data)Scope extension


Improving the model (actual data)Error magnitude and bias What we need to be worried

about is the relative error magnitude not the bias


Improving the model (actual data)To shrink the magnitude of the relative error we can:

-Find and try new variables

-Removing irrelevant variables (PCA,CCA, Stepwise)

-Considering dummy variables (different populations)

-Improving the flexibility of the model (generalized models)

-Selecting the right complexity of the model (cross-validation)


A CASE STUDY


The NASA COCOMO data set [PROMISE]

-2.1

-2.5

-0.6 -0.6 -0.6

-1.5

-2.5 -2.5

-0.4 -0.4-0.6

-0.4 -0.4

0.4

-1.1-1.2

0.1 0.1 0.1

-0.2

-1.3 -1.3

0.3 0.3

-0.1

0.3 0.2

-2.7

0.1

0.7

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

UBBS UB BS

-0.9

-2.4

Relative Error

EXT EXT EXTUBUBUB UB UB UB

77 historical projects (before 1985), 16 projects being estimated (from 1985 to 1987)


CONCLUSION&

BENEFITS


Benefits of using this approach-Continue using parametric estimation models-Correct the limitations of the parametric models by dealing with the consequences of the violations-The approach is systematic (framework and process) and it can support learning organizations and improvement paradigms-Evaluate the estimation model reliability before using it (early risk evaluation)-The approach is traceable and repeatable (EIP + Frmwrk)-The approach can be completely implemented as an software tool that reduces human interaction-The approach produces experience packages (e.g. ANN) that are easier and faster to store and deliver-The approach is general even though we have shown up its application only to parametric models


QUESTIONS&

FEEDBACKS


An Approach to Improving Parametric Estimation Models in case of Violation of

Assumptions

1Dept. of Informatica, Sistemi e ProduzioneUniversity of Rome “Tor Vergata”

S. Alessandro Sarcià1,2

[email protected]

Giovanni Cantone1 Victor R. Basili2,3

2Dept. of Computer ScienceUniversity of Maryland

and2Fraunhofer Center for ESE Maryland

Author

Advisors

Education

Sarcia idoese08