Statistical Distribution of Metrics

Statistical distributions of software metrics: dothey matter?

Israel Herraiz

Technical University of Madrid

[email protected]

Grab these slides from

http://slideshare.net/herraiz/statistical-distributions-of-metrics

Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17

http://slideshare.net/herraiz/statistical-distributions-of-metrics

Outline

1 Some background

2 Statistical properties of software metrics

3 Evidence of impact on quality

4 Summary of findings and further work


1 Some background





A (not so) long time ago...

Statistical distribution of software metrics

Software size follows a double Pareto distributionTowards a theoretical model for software growth MSR 2007

More recently

Not only size, but some OO metrics too (and some complexity metrics)On the Statistical Distribution of Object-Oriented SystemProperties WETSoM 2012


OK, but what is that double Pareto thing?

1 100 10000

1e

−0

41

e−

02

1e

+0

0

SLOC

P[X

> x

]

Data

Double Pareto

Lognormal


But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

% F

iles

05

10

15

20

25

30

35


But does it matter?

Most of the files are on thelognormal side

C C++ Java Python Lisp

% F

iles

05

10

15

20

25

30

35

But the power law minoritymatters a lot

C C++ Java Python Lisp%

SLO

C

010

20

30

40


Large files have a large impact

Size estimation models

Some software size estimation models are based on the log-normality of sizemetrics. These models systematically underestimate the size of software.

2000 5000 10000 50000

−1

00

05

0C

SLOC

RE

2000 5000 20000 50000

−1

00

05

0

C++

SLOC

RE

1000 2000 5000 10000

−100

050

Java

SLOC

RE

1000 2000 5000 10000

−100

050

Python

SLOC

RE

On the distribution of source code file sizes ICSOFT 2011


1 Some background





Parameters of the statistical distribution

Power law parameters: λ and xmin

Transition from lognormal to power law

1 100 10000

1e−

04

1e−

02

1e+

00

SLOC

P[X

> x

]

Data

Double Pareto

Lognormal


1 Some background





Probability of finding defects


We have seen that files above xmin account for 40% of total size, beingonly about ∼ 1% of the files.

What about defects? Probability of finding defects in three softwareprojects (using CYCLO as metric)

Project Below xmin Above xmin

Apache .4178 .7708OpenIntents .2500 .7500Zxing .2143 .4161

* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE

2011.



Probability of finding defects (normalized metrics)

Using CYCLO / WMC as metric (cyclomatic complex. per LOC)

Project Below xmin Above xmin

Apache .4159 .6296OpenIntents .2813 .5417Zxing .3181 .2389



Defects density (only pre-release defects)

Using Number of Methods and number of pre-release defects per LOC

Below xmin Above xmin

0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

50

100

150

200

250

300Above xmin

Avg .Dens. = .2685 Avg .Dens. = .4565

* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007



Defects density (only post-release defects)

Using Number of Methods and number of post-release defects per LOC


0 1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

12000Below xmin

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

50

100

150

200

250

300Above xmin

Avg .Dens. = .1437 Avg .Dens. = .2690



Defects density (pre + post-release defects)

Using CYCLO/SLOC and number of total defects per LOC

10−1

101

103

105

10−4

10−3

10−2

10−1

100

Pr(

X ≥

x)

x

10−1

100

101

102

103

104

105

10−1

100

101

102

103


Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files)Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17

1 Some background





Summary and further work

Summary of preliminary findings

Some metrics have a transition from lognormal to power law

Clear relation between normalized metrics and defects density

Although the threshold might not be perfect (e.g., you might find ahigh defects density in a lower side file), it greatly reduces the searchspace for potentially problematic files

Further work

Verify in more projects

Do you have defects data at the file level?

Find explanation for the transition and its influence on quality

How do the statistical parameters change over time? Do defectsevolve accordingly?


Education

Statistical Distribution of Metrics