Upload
israel-herraiz
View
497
Download
3
Embed Size (px)
DESCRIPTION
Presentation for the Seminar on Open Source Evolution 2013 http://informatique.umons.ac.be/genlog/SOS-Evol/SOS-Evol2013.html
Citation preview
Statistical distributions of software metrics: dothey matter?
Israel Herraiz
Technical University of Madrid
Grab these slides from
http://slideshare.net/herraiz/statistical-distributions-of-metrics
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 1/17
Outline
1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 2/17
1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 3/17
A (not so) long time ago...
Statistical distribution of software metrics
Software size follows a double Pareto distributionTowards a theoretical model for software growth MSR 2007
More recently
Not only size, but some OO metrics too (and some complexity metrics)On the Statistical Distribution of Object-Oriented SystemProperties WETSoM 2012
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 4/17
OK, but what is that double Pareto thing?
1 100 10000
1e
−0
41
e−
02
1e
+0
0
SLOC
P[X
> x
]
Data
Double Pareto
Lognormal
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 5/17
But does it matter?
Most of the files are on thelognormal side
C C++ Java Python Lisp
% F
iles
05
10
15
20
25
30
35
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
But does it matter?
Most of the files are on thelognormal side
C C++ Java Python Lisp
% F
iles
05
10
15
20
25
30
35
But the power law minoritymatters a lot
C C++ Java Python Lisp%
SLO
C
010
20
30
40
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 6/17
Large files have a large impact
Size estimation models
Some software size estimation models are based on the log-normality of sizemetrics. These models systematically underestimate the size of software.
2000 5000 10000 50000
−1
00
05
0C
SLOC
RE
2000 5000 20000 50000
−1
00
05
0
C++
SLOC
RE
1000 2000 5000 10000
−100
050
Java
SLOC
RE
1000 2000 5000 10000
−100
050
Python
SLOC
RE
On the distribution of source code file sizes ICSOFT 2011
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 7/17
1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 8/17
Parameters of the statistical distribution
Power law parameters: λ and xmin
Transition from lognormal to power law
1 100 10000
1e−
04
1e−
02
1e+
00
SLOC
P[X
> x
]
Data
Double Pareto
Lognormal
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 9/17
1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 10/17
Probability of finding defects
Probability of finding defects
We have seen that files above xmin account for 40% of total size, beingonly about ∼ 1% of the files.
What about defects? Probability of finding defects in three softwareprojects (using CYCLO as metric)
Project Below xmin Above xmin
Apache .4178 .7708OpenIntents .2500 .7500Zxing .2143 .4161
* Data extracted from “ReLink: Recovering Links between Bugs and Changes” FSE
2011.
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 11/17
Probability of finding defects
Probability of finding defects (normalized metrics)
Using CYCLO / WMC as metric (cyclomatic complex. per LOC)
Project Below xmin Above xmin
Apache .4159 .6296OpenIntents .2813 .5417Zxing .3181 .2389
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 12/17
Probability of finding defects
Defects density (only pre-release defects)
Using Number of Methods and number of pre-release defects per LOC
Below xmin Above xmin
0 1 2 3 4 5 6 7 8 9 100
2000
4000
6000
8000
10000
12000Below xmin
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350
50
100
150
200
250
300Above xmin
Avg .Dens. = .2685 Avg .Dens. = .4565
* Data obtained from "Predicting Defects for Eclipse” PROMISE 2007
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 13/17
Probability of finding defects
Defects density (only post-release defects)
Using Number of Methods and number of post-release defects per LOC
Below xmin Above xmin
0 1 2 3 4 5 6 7 8 9 100
2000
4000
6000
8000
10000
12000Below xmin
0 0.05 0.1 0.15 0.2 0.25 0.3 0.350
50
100
150
200
250
300Above xmin
Avg .Dens. = .1437 Avg .Dens. = .2690
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 14/17
Probability of finding defects
Defects density (pre + post-release defects)
Using CYCLO/SLOC and number of total defects per LOC
10−1
101
103
105
10−4
10−3
10−2
10−1
100
Pr(
X ≥
x)
x
10−1
100
101
102
103
104
105
10−1
100
101
102
103
Below xmin Above xmin
Avg .Dens. = .3335 (>9000 files) Avg .Dens. = .7747 (364 files)Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 15/17
1 Some background
2 Statistical properties of software metrics
3 Evidence of impact on quality
4 Summary of findings and further work
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 16/17
Summary and further work
Summary of preliminary findings
Some metrics have a transition from lognormal to power law
Clear relation between normalized metrics and defects density
Although the threshold might not be perfect (e.g., you might find ahigh defects density in a lower side file), it greatly reduces the searchspace for potentially problematic files
Further work
Verify in more projects
Do you have defects data at the file level?
Find explanation for the transition and its influence on quality
How do the statistical parameters change over time? Do defectsevolve accordingly?
Israel Herraiz, UPM Statistical distributions of software metrics: do they matter? 17/17