57
1 Het begint met een idee Statistical tests and effect size Ivano Malavolta

The Green Lab - [09 A] Statistical tests and effect size

Embed Size (px)

Citation preview

Page 1: The Green Lab - [09 A] Statistical tests and effect size

1 Het begint met een idee

Statistical tests and effect size

Ivano Malavolta

Page 2: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

2 Ivano Malavolta / S2 group / Statistical tests and effect size

Roadmap

● Warm up● Check for normality● Main statistical tests● p-value corrections● Effect size

Page 3: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

3 Ivano Malavolta / S2 group / Statistical tests and effect size

Context and assumptions

● We focus on quantitative variables only

○ nominal

○ ordinal

○ interval

○ ratio

● Factors are nominal or ordinal

○ dependent variables can be any

● Our statistical tests detect differences between the means of the dependent variable

● Factor alternatives are fixed a priori

Page 4: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

4 Ivano Malavolta / S2 group / Statistical tests and effect size

Tasks for data analysis

1. Descriptive statistics○ for understanding the “shape” of collected data

2. Select statistical test○ according to collected metrics and data distribution

3. Hypothesis testing○ for providing evidence about your findings

i. statistical significance

4. Effect size calculation○ for understanding if your (statistically significant) results are actually

relevant in practice

5. Power analysis○ for knowing if your results are emerging only by luck [1]

Page 5: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

5 Ivano Malavolta / S2 group / Statistical tests and effect size

What is a statistical test?

● Calculation of a sample statistic assuming that the null hypothesis is true

● The calculated value of the statistic has a certain probability given that the null hypothesis is true (p-value)

Page 6: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

6 Ivano Malavolta / S2 group / Statistical tests and effect size

First choice: parametric VS non-parametric tests

● Parametric tests assume a specific distribution of the data○ typically, normal distribution○ more powerful

→ smaller sample sizes are needed

● Non-parametric tests do not make any assumption about data distribution○ more general○ less powerful

→ larger samples are needed

Page 7: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

7 Ivano Malavolta / S2 group / Statistical tests and effect size

How to choose?

● Applicability○ type of variable

■ nominal, ordinal scale → non-parametric○ distribution of the values

■ normality○ other assumptions

● Power

If you can apply it, always prefer a parametric test

In your report explain in details why you choose a specific test

Page 8: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

8 Ivano Malavolta / S2 group / Statistical tests and effect size

How to choose?

Non-parametric methods Parametric methods

Model restrictions

(eg, normality) satisfied?

Nominal or ordinal scale?

YES

NO

YES

NO

Page 9: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

9

Check for normality

Ivano Malavolta / S2 group / Statistical tests and effect size

Page 10: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

10 Ivano Malavolta / S2 group / Statistical tests and effect size

Graphical check (Q-Q plot)

Page 11: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

11 Ivano Malavolta / S2 group / Statistical tests and effect size

Normality tests

● Normality tests○ H

0: sample is drawn from a normal distribution

● Shapiro-Wilk test (AKA Shapiro-Wilk’s W)

● If p-value <α for a given sample, we can conclude data is NOT normally distributed

Page 12: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

12 Ivano Malavolta / S2 group / Statistical tests and effect size

Shapiro-Wilk test

Page 13: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

13 Ivano Malavolta / S2 group / Statistical tests and effect size

Shapiro-Wilk test

Page 14: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

14

Shapiro-Wilk test

● Warning: Shapiro-Wilk is not robust for small samples!○ Additional verification (e.g. via Q-Q plot) is always needed

Page 15: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

15

Main statistical tests

Ivano Malavolta / S2 group / Statistical tests and effect size

Page 16: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

16 Ivano Malavolta / S2 group / Statistical tests and effect size

Statistical tests VS experiment design

Page 17: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

17 Ivano Malavolta / S2 group / Statistical tests and effect size

One factor - 2 treatments - random design

Page 18: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

18 Ivano Malavolta / S2 group / Statistical tests and effect size

t-Test

Goal: compare independent samples

○ Values of the dependent variable obtained with different treatments○ For each treatment you are measuring different subjects

Parametric

Hypotheses:

● Two-tailed○ H

0: μ

2 = μ

1 H

a: μ

2 = μ

1

● One-tailed (alternative: greater)○ H

0: μ

2 = μ

1 H

a: μ

2 > μ

1

● One-tailed (alternative: less) ○ H

0: μ

2 = μ

1 H

a: μ

2 < μ

1

● More powerful● Cannot say anything in the

opposite direction

Page 19: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

19 Ivano Malavolta / S2 group / Statistical tests and effect size

t-Test in R

Page 20: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

20 Ivano Malavolta / S2 group / Statistical tests and effect size

t-Test: example

Page 21: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

21 Ivano Malavolta / S2 group / Statistical tests and effect size

t-Test: example 2

Page 22: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

22 Ivano Malavolta / S2 group / Statistical tests and effect size

Mann-Whitney test

Goal: compare independent samples

● It can be used instead of the t-test when data is not normal

Non-parametric

Hypotheses:

● Two-tailed○ H

0: μ

2 = μ

1 H

a: μ

2 = μ

1

● One-tailed (alternative: greater)○ H

0: μ

2 = μ

1 H

a: μ

2 > μ

1

● One-tailed (alternative: less) ○ H

0: μ

2 = μ

1 H

a: μ

2 < μ

1

● Same hypotheses as the t-test

Page 23: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

23

Mann-Whitney test in R

Page 24: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

24 Ivano Malavolta / S2 group / Statistical tests and effect size

Mann-Whitney test: example

Page 25: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

25 Ivano Malavolta / S2 group / Statistical tests and effect size

One factor - 2 treatments - paired design

Page 26: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

26 Ivano Malavolta / S2 group / Statistical tests and effect size

Paired t-Test

Goal: compare independent samples from repeated measures

○ Each subject receives different treatments○ We focus on the differences exhibited by subjects with different treatments○ Samples must be equals in size

Parametric

Hypotheses:

● Two-tailed○ H

0: μ

d = 0 H

a: μ

d = 0

● One-tailed (alternative: greater)○ H

0: μ

d = 0 H

a: μ

d > 0

● One-tailed (alternative: less) ○ H

0: μ

d = 0 H

a: μ

d < 0

Page 27: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

27 Ivano Malavolta / S2 group / Statistical tests and effect size

Paired t-Test: example

Page 28: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

28 Ivano Malavolta / S2 group / Statistical tests and effect size

Wilcoxon signed-rank test

Goal: compare independent samples from repeated measures

● It can be used instead of the paired t-test in case of not normal data

Non-parametric

● Same hypotheses as the paired t-test

Hypotheses:

● Two-tailed○ H

0: μ

d = 0 H

a: μ

d = 0

● One-tailed (alternative: greater)○ H

0: μ

d = 0 H

a: μ

d > 0

● One-tailed (alternative: less) ○ H

0: μ

d = 0 H

a: μ

d < 0

Page 29: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

29 Ivano Malavolta / S2 group / Statistical tests and effect size

Wilcoxon signed-rank test: example

Page 30: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

30 Ivano Malavolta / S2 group / Statistical tests and effect size

>=1 factors - >2 treatments

Page 31: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

31 Ivano Malavolta / S2 group / Statistical tests and effect size

ANOVA (ANalysis Of VAriance)

Goal: understand how much of the total variance is due to differences within factors, and how much is due to differences across factors

○ Many types of ANOVA tests○ Works for many experiment designs

Parametric

Hypotheses:

H0

: μ1

= μ2

= μ3

Ha: μ

1 = μ

2 V μ

1 = μ

3 V μ

2 = μ

3

Page 32: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

32

F-statistic

F = Variation among sample means / variation within the samples

when H0

→ F follows a known F-distribution

● the mean of the F-distribution tends to be 1

https://goo.gl/jHWo8A

Page 33: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

33

Significance

F tends to be larger if H0

is false

→ the more F deviates from 1, the stronger the evidence for unequal population variances

● Methods to determine significance level:

○ textbook: compare F against a table of critical values (according to DF and α). If F > F

critical, reject H

0

○ computer-based: compute the p-value of finding F greater than the observed value. If p < α, reject H

0

Ivano Malavolta / S2 group / Statistical tests and effect size

Page 34: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

34

Types of ANOVA

Ivano Malavolta / S2 group / Statistical tests and effect size

● One-way ANOVA

○ one factor, >2 treatments

○ if 2 treatments: equivalent to t-test (almost never used)

Page 35: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

35

Types of ANOVA

Ivano Malavolta / S2 group / Statistical tests and effect size

● Factorial ANOVA

○ 2 (two-way) or more factors

○ any number of treatments

○ also computes interactions

Page 36: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

36

How to know which treatments really differ?

Tukey’s test

Page 37: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

37

Types of ANOVA

● Repeated Measures ANOVA (rANOVA)

○ computes changes of a treatment over time

○ used when applying each treatment to the same subjects (within-subject comparison)

Page 38: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

38

Types of ANOVA

● Multivariate ANOVA (mANOVA)

○ more than one outcome (dependent variable)

○ simultaneous analysis

Individual variables Combined

Page 39: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

39

ANOVA assumptions

Ivano Malavolta / S2 group / Statistical tests and effect size

● The dependent variable should be continuous● Samples must be independent ● Normal distribution of the dependent variable between

the groups (approximately)● Errors (aka residuals) should be normally distributed

○ qqPlot(residuals(myData.aov))

● Homoscedasticity○ variance between groups should be the same

■ bartlett.test(x ~ y, data=myData)

■ leveneTest(x ~ y, data=myData)

Assumptions violated → non-parametric alternative

Page 40: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

40

ANOVA: non-parametric alternative

Ivano Malavolta / S2 group / Statistical tests and effect size

● Kruskal-Wallis: one-way non-parametric ANOVA

○ one factor, multiple treatments

○ no estimate of the treatment effect (due to ranking)

Non-parametric

Page 41: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

41 Ivano Malavolta / S2 group / Statistical tests and effect size

Main statistical tests

You are measuring different subjectsYou are measuring the same subject against different treatments

Use this in case the values of your dep. var are not normally distributed

Page 42: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

42

p-value corrections

Ivano Malavolta / S2 group / Statistical tests and effect size

Page 43: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

43

Example

Ivano Malavolta / S2 group / Statistical tests and effect size

Dependent variable = energy consumption of the appIndependent variables =

● A: Image encoding algorithm: {png, jpeg}● B: Mobile device type: {high-end, low-end}● C: Network conditions: {wifi, 3G}

You perform 3 tests:● t.test(A, B)● t.test(A, C)● t.test(B, C)

P(at least one significant result) = 1 − P(no significant results)= 1 − (1 − 0.05)3

≈ 0.15→ 15% chance of seeing relevant results, when there may be none

Page 44: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

44

The problem

Ivano Malavolta / S2 group / Statistical tests and effect size

Multiple tests → higher probability of getting (statistically

significant ) results by chance

→ you have to adjust your α (it was 0.05)

Two main correction techniques:

● Bonferroni● Holm

Page 45: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

45

Bonferroni correction

Supposing we are doing N tests,

we can reject H0

if the p-values of those tests are below α/N

We can reject the H0

if all tests provide a p-value < 0.05/3=0.016

→ 0.016 is our new significance threshold!

Page 46: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

Less stringent that Bonferroni’s

Procedure:

● rank your p-values from the smallest to the largest● multiply the first by N, the second by N-1, etc.● a p-value is significant if after multiplied is <0.05

P-values of the tests: {0.01, 0.02, 0.03}

Bonferroni:● 0.01 * 3 = 0.03 ● 0.02 * 3 = 0.06 ● 0.03 * 3 = 0.09

46

Holm’s correction

Ivano Malavolta / S2 group / Statistical tests and effect size

Holm:● 0.01 * 3 = 0.03 ● 0.02 * 2 = 0.04 ● 0.03 * 1 = 0.03

Page 47: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

47

Effect size

Ivano Malavolta / S2 group / Statistical tests and effect size

Page 48: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

48

Recall

Page 49: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

49

Effect size measures

Ivano Malavolta / S2 group / Statistical tests and effect size

● Cohen’s d

○ parametric statistics

● Cliff’s delta

○ non-parametric statistics

Page 50: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

50

Cohen’s d

Ivano Malavolta / S2 group / Statistical tests and effect size

The magnitude of a main factor treatment effect on the dependent variable

Where:

● x1

, x2

= the means of the two groups● s = standard deviation

○ Pooled standard deviation for independent samples

Values:0 = full overlap1 = 1-sigma distance between the means…3 = 3-sigma distance → ~no overlap

Page 51: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

51

Cohen’s d in R

Independent samples Dependent samples

Page 52: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

52

Cliff’s delta

Ivano Malavolta / S2 group / Statistical tests and effect size

Represents the degree of overlap between the two distributions of scores

Where:

● xi … x

j = the specific values of each sample

● m, n = the cardinalities of the two groups

Values:0 = full overlap+1 = all the values of one group > all the values of the other one...-1 = the inverse

Page 53: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

53

Intuition

Page 54: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

54

Cliff’s delta in R

Page 55: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

55

What this lecture means to you?

Page 56: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

56 Ivano Malavolta / S2 group / Empirical software engineering

Readings

[1] Dybå, Tore, Vigdis By Kampenes, and Dag IK Sjøberg. "A systematic review of statistical power in software engineering experiments." Information and Software Technology 48.8 (2006): 745-755.

Chapter 10 Part 3 Chapter 6

Page 57: The Green Lab - [09 A] Statistical tests and effect size

Vrije Universiteit Amsterdam

57 Ivano Malavolta / S2 group / Empirical software engineering

Some contents of lecture extracted from: ● Giuseppe Procaccianti’s lectures at VU

Acknowledgements