Limitations of traditional scientific approaches and new paradigms - Examples from statistics, physics, and implications for Software Engineering Research

Technische Universität München

Dr. Antonio Vetrò

With feedback and some material from:Dr. Daniel Méndez FernándezProf. Dr. Dr. h.c. Manfred BroyProf. Dr. Angelo VulpianiProf. Dr. Francesco Sylos Labini

Limitations of traditional scientific approaches

and new paradigms Examples from statistics, physics, and implications for Software Engineering

Research

2

Outline

Limitations on the usage of p value (Perspective from Statistics)And implications for software engineering research

Limitations on predictions based on past data (Perspective from Physics)And implications for software engineering research

Actions

3

Outline


Limitations on predictions based on past data (Perspective from PhysicsAnd implications for software engineering research

Actions

4

Hypothesis Testing

• Research Hypothesis• a statement of what the researcher believes will be the outcome

of an experiment or a study.• Statistical Hypotheses

• a more formal structure derived from the research hypothesis: a null hypothesis is set up and rejected in favor of the alternative one if not supported by the data

• memento: we can reject hypotheses but not confirm them !

Definitions from: Marco Torchiano, Empirical Software Engineering Course for PhD students, Politecnico di Torino.Example: own.

Example:• Research Hypothesis:

• Java classes containing code smells are more bug prone• Statistical hypothesis :

Given• X = counts nr bugs from a sample of classes with code smells• Y = counts nr bugs from sample of classes without code

smells

H0 : μx≤ μy versus HA: μx≥ μy (one tailed)

5

Hypothesis testing in our example

Example:• Research Hypothesis:

• Java classes containing code smells are more bug prone• Statistical Hypothesis we aim to reject:

• Nr bugs in classes with code smells ≤ Nr bugs in classes without code smells

Distribution nr of bugs in classes with code smells (123)

Distribution nr of bugs in classes without code smells (250)

373 datapoints Data: Hadoop v. 0.14.0 from: Antonio Vetro’, “Hadoop data”, DOI 10.13140/2.1.2413.1204

The P-value: recalling the basics (1/2)

• P-value is the probability that we would have seen our data just by chance if the null hypothesis is true.

• In our example: • Nr bugs in classes with code smells < Nr bugs in classes without code smells

• H0 : μx≤ μy is TRUE (classes with code smells have less bugs)

• Rationale: we want to know the probability of observing μx> μy ,

by chance • E.g., P-value < 0.001 means: P(empirical data/null hypothesis)

<.0001

• In hypothesis testing, a null hypothesis (H0) is rejected in favor of the alternative one (HA) when p-value is lower than a pre defined threshold of making an error

Figure: http://en.wikipedia.org/wiki/One-_and_two-tailed_tests

Error and Power

Type-I Error (also known as “α”): – Rejecting the null when the effect isn’t real.– the probability of finding an effect that isn’t real (false positive). – If we require p-value<.05 for statistical significance, this means that 1/20 times

we will find a positive result just by chance.

Type-II Error (also known as “β “):– Failing to reject the null when the effect is real.– the probability of missing an effect (false negative).

POWER (the flip side of type-II error: 1- β): – The probability of seeing a true effect if one exists. – The probability of not making a type II error– When we design studies, we typically aim for a power of 80% (allowing a false

negative rate, or type II error rate, of 20%).

Your Statistical Decision

True state of null hypothesis

H0 True(example: classes with code

smells are not more bug prone )

H0 False(example: classes without

code smells are re more bug prone)

Reject H0(ex: we conclude classes

without code smells are re more bug prone)

Type I error (α) Correct

Do not reject H0(ex: we conclude classes with code smells are not more bug

prone )

Correct Type II Error (β)


Compute p value in our example* :

Source: Probability and Statistics for Engineers and Scientists, Sheldon Ross* Normal distributions and unknown variances. n and m are sample sizes for X and Y. 95% confidence level

• If the value of the test statistic T is v , then p value is


Compute p value in our example* :

Source: Probability and Statistics for Engineers and Scientists, Sheldon Ross* Normal distributions and unknown variances. n and m are sample sizes for X and Y. 95% confidence level

• If the value of the test statistic T is v , then p value is

Conf. level = 0.95 , t0.95,371 =1.649

1.187 0.208

1.622

6.98

6.891e-12

6.891e-12

6.98 > 1.622

Hypothesis rejected , but this tells nothing about causality. We need explanations

The P-value and significance of the test

By convention, p-values of <0.05 are often accepted as “statistically significant” in the scientific literature; but this is an arbitrary cut-off.

A cut-off of p<0.05 means that in about 5 of 100 experiments, a result would appear significant just by chance (“Type I error”).

The “correct” level of significance to use in a given situation depends on the individual circumstances involved in that situation.

For instance, if rejecting a null hypothesis H0 would result in large costs that would be lost if H0 were indeed true, then we might elect to be more stringent and so choose a significance level of 0.05 or 0.01.

Also, if we initially feel strongly that H0 was correct, then we would require very stringent data evidence to the contrary for us to reject H0. (That is, we would set a very low significance level in this situation.)

For exploratory studies, we might be less conservative and choose level 0.10

11

Low p values: alternative actions

In the praxis, a low p value implies to reject the hypothesis as false

Indeed, we have two more options :

– Reject the observation as an outlier– Accept that we have done a rare observation, which is

still possible given the hypothesis

Next slide explains why

P val : general limitation

“It cannot work backwards and make statements about the underlying

reality. That requires another piece of information: the odds that a real

effect was there in the first place.”

Probabilities after the experiment are computed with Bayes Factor

13

Multiple hypotheses in the era of Big Data

Large availability of data permits to test multiple hypotheses:

# hypotheses were actually null

# hypotheses were actually non null

Total # hypotheses

total # rejected hypothesestotal # not rejected hypotheses

14

Single case testing situation

Type I error

Power1 - β

Power

Type I error

15

Multiple case testing situation

False discovery rateFDR = a / R

Goal: control FDR

16

Why it is important to control FDR (1/2)

Source: E. J. Candès.

1000 hypotheses, 100 potential discoveries

Case A

Case B

17

Why is it important to control FDR (2/2)

Control Per-Comparison Type I Error (PCER)

– a.k.a. “uncorrected testing,” many type I errors– P (FDi > 0) ≤ α marginally for all 1 ≤ i ≤ N

Control Familywise Type I Error (FWER) – e.g.: Bonferroni: use per-comparison significance level α/m– Guarantees P(FD > 0) ≤ α – Very stringent, many type II errors

Control False Discovery Rate (FDR)

– First defined by Benjamini & Hochberg (BH, 1995, 2000) – see algorithm

– Guarantees FDR ≡ E ( FD / D) ≤ α …

Source: Christopher R. Genovese, Dept. of Statistics Carnegie Mellon University

18

Benjamini & Hockenmberg in a nutshell

Modified from: http://www.unc.edu/courses/2007spring/biol/145/001/docs/lectures/Nov12.html

P vals p(1) p(2) p(3) … p(n)

k 1 2 3 … n

threshold α* / m 2α* / m 3α* / m … nα* / m

p(1) < p(2) < p(3) <… < p(n)

Compare each p-value p(k) against its corresponding threshold value kα* / m

Let Decision rule: If k^ ≥ 1 then reject the hypotheses that correspond to p1, p2, ... , pk and fail to reject the hypotheses that correspond to the rest.

21

k pvals threshold1 0,013 0,005 02 0,016 0,01 03 0,017 0,015 04 0,019 0,02 15 0,021 0,025 16 0,037 0,03 07 0,041 0,035 08 0,045 0,04 09 0,048 0,045 010 0,052 0,05 011 0,06 0,055 012 0,068 0,06 013 0,087 0,065 014 0,102 0,07 015 0,106 0,075 016 0,109 0,08 017 0,118 0,085 018 0,136 0,09 019 0,148 0,095 020 0,149 0,1 0

Fix and control FDQ: example Let’s assume we have multiple (independent) studies

investigating on the bug proneness of classes with code smells, with same constructs

We simulate the obtained p values extracting 20 random values from a uniform distribution with min=0.001 and max=0.025. We set our desiredα* = 0.10

Publication bias may still affect results!

22

Observations / food for thoughts

Controlling FDR increases power while maintaining control over the error

Useful technique for putting together findings for multiple studies In software engineering especially convenient for simulations,

but…

What about experimental and observational research ? Often in software engineering replicability of studies is difficult It is very difficult to obtain same conditions and run multiple tests

Other Pitfalls of p val and control actions (I)

1. Unimportant effects may be statistically significant if a study is large (and therefore, has a small standard error and extreme precision).

Pay attention to effect size and confidence intervals.

2. Statistical significance does not imply a cause-effect relationship.

Interpret results in the context of the study design.

3. A significance level of 0.05 means that your false positive rate for one test is 5%: if you run more than one test, your false positive rate will be higher than 5%, i.e. 1- (0.95)n_tests

Control study-wide type I error by planning a limited number of tests. Distinguish between planned and exploratory tests in the results. Correct for multiple comparisons (See correction methods previous slides).

Other pitfalls of p val and control actions (II)

4. Results that are not statistically significant should not be interpreted as “evidence of no effect,” but as “no evidence of effect” Studies may miss effects if they are insufficiently powered (lack precision).

Design adequately powered studies and interpret in the context of study power if results are null.

5. “the effect was significant in the treatment group, but not significant in the control group” does not imply that the groups differ significantly

Use proper statistical test for within group and between group differences

25

Food for thoughts (see last slide for actions)

The root cause of our problem is a philosophy of scientific inference that is supported by the statistical methodology in dominant use. This philosophy might best be described as a form of “naïve inductivism,” a belief that all scientists seeing the same data should come to the same conclusions.

Goodman, S. N. Epidemiology 12, 295–297 (2001).

27

Outline



Actions

28

Summary 1st part of discussion

Statistical tests are a powerful method to give scientific foundations and provide evidence to theories with the observations it is possible to get

However they have some limitations– not all assumptions often hold (e.g., normality in parametric

tests)– often we don’t have enough data to draw proper conclusions

Nowadays large availability of data allows the application of better techniques to reduce the error in drawing conclusions from statistical tests– New paradigm: control on false discovery rate rather than

type I error– However not always applicable to Software Engineering, due

to the intrinsic complexity of the phenomena involved and multiple confounding factors

Examples from Physics can help understanding why

29

Inductive approach in the era of Big Data

Paradigm: “Collect data first, ask question later” (is that science ?) It translates in “inference from data” with no a priori questions

– E.g.: regression , patterns and hypothesis building from historical data Often applied for prediction purposes

Important questions for prediction:– Which are the relevant variables ?– What kind of laws regulates the system ?– What kind of perspective do we take: deterministic or probabilistic ?

Different situationsA. Evolution laws in the system exist and are known B. Evolution laws in the system exist and are not known C. We don’t know whether the system has some laws

30

Deterministic approach The problem of chaos (Poincaré work)

31

The problem of chaos (Poincaré work): an example from physics

That means: a system can be predicted with a tolerance Δ only within a certain time which is dependent on λ

In a deterministic chaotic system we have that:

(Lyapunov exponent λ>0):

Let’s take a simple prediction function (Logistic map):

Its prediction error doubles at every step:

Example with logistic map:Despite the very similar initial conditions (|x(0) − x′(0)| = 4 × 10−6 ), after t=16 the two trajectories are completely different (“Butterfly effect”)

λ = ln 2 ≃ 0.693

Stability, high predictability

Chaos, low predictability

Source: Sylos Labini

32

Prediction window

Predicting eclipses and tides is easier because those phenomena are less chaotic, i.e. Lyapunov exponent λis lower (and so the predictability window is large). That’s why ancient populations (such as Maya) could understand the periodicity of the planets’ movement, without having a physical reference model

The atmosphere is much more chaotic system, and the predictability window is quite short (i.e. λis higher ) see Lorentz efforts

Source: Sylos Labini

33

Problems and their relations to Software Engineering:– Not always the equations of the phenomena are known (do

they exist?)– Often we even don’t have a set of variables which describe

the phenomenon

However large available data on the past can be used to predict the future with a certain probability , or to discover patterns in data

34

Probabilistic approach : method of the analogs

Predicting the future from the past: An old problem from a modern perspectiveCecconi, F. and Cencini, M. and Falcioni, M. and Vulpiani, A., American Journal of Physics, 80, 1001-1008 (2012), DOI:http://dx.doi.org/10.1119/1.4746070

• Most predictions algorithm work under the following basic idea• We know the past , i.e. a series (x1, x2, ...., xM ) where xj = x(j∆t)• We want to forecast the future, i.e. xM+t

• We look back in the past to find a situation similar to the present (time M), i.e. a vector xk with k<M and |xk−xM|< ε

• Predict at time M + t , i.e.

35

ConsiderationsIn an ergodic system, the average return timeof a set A is proportional to a system’s characteristic time τ0 and inverse proportional to the probability of A (Lak’s lemma):

Which in a system of linear dimensions O(ε) is inverse proportional to the number of variables involved*

Good news: to find an analog in the past with precision ε , we must go back in time of

Bad news: to find an analog, the length of the series should be of the same order

(eg precision 5%, t = 6x107 )

Which are the relevant variables ?

Do we take a deterministic or probabilistic perspective ?

What kind of laws regulates our systems ?

* For details see Cecconi, F. and Cencini, M. and Falcioni, M. and Vulpiani, A. DOI:http://dx.doi.org/10.1119/1.4746070

Considerations (continued)

In very complex systems (e.g. earthquakes) the state vector is not known a priori, and data are not enough to get appreciable precision in predictions (because it is almost impossible to find an analogue back in the past)

Example: – the case of Google Flu parabole

36

The Parable of Google Flu: Traps in Big Data Analysis

David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani

SCIENCE, Vol. 343, 14 March 2014

ILI : influenza-like illness CDC : Centers for Disease Control and Prevention, which bases its estimates on surveillance reports from laboratories across the United States

37

Implications for Software Engineering Research

In Software Engineering the space of variables is very large and not known a priori

This is related to the initial important questions:• Which are the relevant variables ?• What kind of laws regulates the system ?• What kind of perspective do we take: deterministic or

probabilistic ?

It is extremely difficult to find similar analogies in the past: SW projects are barely comparable to each other

Even projects with same people and same objectives would follow always different processes, obey different psychological factors, etc.

As a consequence experiments are difficult to reproduce or replicate

38

Reproducibility vs replicability (for experiments)

Reproducibility (requires change): ceteris paribus Replicability ( avoids change) : “poor substitute for

reproducibility” ?

Nature initiative: – no space limitations on Methods sections– statisticians help review papers and measures– encourage raw data online– checklist for life science submissions

Other ongoing initiatives:– The Recomputation Manifesto – ARRIVE – Animal Research: Reporting In Vivo Studies – National Institutes of Health of the United States (NIH)

39

Outline



Actions

40

Actions (proposals): let’s discuss

a) Stop aiming at absolute generalization, focus on specific studies:– Focus on systems’ underlying mechanisms– Provide “engineering solutions” which solve very specific

problems

b) Don’t give up to generalization and, being aware of the limitations of the different approaches, stress on:– rigor and transparency on the methodology of data

collection and analysis (aim at reproducibility first rather then replicability)

– aim for a universal language, i.e. a commonly agreed set of variables to represent the phenomena under study

– provide details on (standardised) context information

c) Develop first theories (see Oberseminar on the Role of Mathematics and Logical Theories )

– delivers also explanations on the underlying mechanisms– empirically test them or provide sound evidence

…

41

Open questions / food for thoughts

Which software engineering phenomena can we study applying empirical methods ?

In which circumstances is it useful to study those phenomena with empirical methods?

What else should we check ( methodology, relevance, etc ) before starting an empirical evaluation ?

42

References and sourcesPerspective from Statistics: R. Foygel Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. Sheldon M. Ross, Introduction to probability and statistic for engineers and

scientists, ELSEVIER Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Probability &

Statistics for Engineers & Scientists Bradley Efron, Large-Scale Inference , Empirical Bayes Methods for Estimation,

Testing, and Prediction, ISBN: 9781107619678, Jan 2013 Goodman, S. N. Epidemiology 12, 295–297 (2001).

Perspective from Physics Cecconi, F. and Cencini, M. and Falcioni, M. and Vulpiani, Predicting the future from

the past: An old problem from a modern perspective, A., American Journal of Physics, 80, 1001-1008 (2012), DOI:http://dx.doi.org/10.1119/1.4746070

Francesco Sylos Labini, Big Data Complexity and Scientific Method Chris Anderson, The End of Theory L.F. Richardson, Weather Prediction by Numerical Process (Cambridge University

Press, 1922)

Replicabilility Nature, Reproducibility initiative (checklist here) Replicability is not Reproducibility: Nor is it Good Science, Chris Drummond Recomputation manifesto

http://www.syloslabini.info/online/big-data-complexity-and-scientific-method/

http://edge.org/3rd_culture/anderson08/anderson08_index.html

http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852

http://www.nature.com/news/announcement-reducing-our-irreproducibility-1.12852

http://www.nature.com/authors/policies/checklist.pdf

http://arxiv.org/pdf/1304.3674.pdf

http://arxiv.org/pdf/1304.3674.pdf