p-values: A significant problem in science? - John Carlin

John CarlinMurdoch Children’s Research Institute &

University of MelbourneBioinformatics FOAM, 28-Mar-14

P-values:A Significant Problem in Science?

2

3

Nature is suddenly concerned about statistics?But wait, that sounds familiar…

4

Outline• The reproducibility crisis (not just p-values)• The p-value as the currency of research

(‘findings’)• Tutorial time: what is a p-value anyway?• A brief history of a significant problem• Ways forward?

5

The reproducibility crisis

6

The reproducibility crisisBasic concern: many scientific claims cannot be replicatedDominant themes:• Pressures to publish, pressures to be first/ original/ novel

– replication studies have less appeal and harder to publish• Significance tests & p-values widely misunderstood and

mis-used (our main topic)– “research teams… fall prey to an honest confusion between

the sweet signal of a genuine discovery and a freak of the statistical noise” (Economist, 19/10/13)

• Peer review process imperfect (at best)• Well documented examples from laboratory science

(Begley – Amgen) & psychology

7

Beauty, sex, and powerGelman (2007), critique of Kanazawa (J.Theor.Biol. 2005-07)

• “Beautiful parents have more daughters”• “Violent men have more sons”(& more: “Ten politically incorrect truths about human

nature” Psychology Today, 2007)

• Almost certainly false “findings” since any reasonable consideration of other studies of similar questions makes even moderate effects highly implausible

• Studies had no power for likely effects, so findings are almost certainly false positives

8

The p-value as research currency: an everyday example (current issue of Nature)

9

The p-value as research currency: an everyday example (current issue of Nature)

First paragraph of results:Transcriptional profiling has demonstrated significant changes in the expression of neuronal genes in the prefrontal cortex of ageing humans9, 10. Analysis of this data set using the Ingenuity Systems IPA platform indicates that the transcription factor most strongly predicted to be activated in the ageing brain is REST (P = 9 × 10−10). Moreover, the 21-base-pair canonical RE1 recognition motif for REST is highly enriched in the age-downregulated gene set (P = 3 × 10−7) (Fig. 1a).

T Lu et al. Nature 000, 1-7 (2014) doi:10.1038/nature13163

Induction of REST in the ageing human prefrontal cortex.

[…] For c and e, values are expressed as fold change relative to the young adult group, and represent the mean ± s.e.m. *P < 0.05, **P < 0.01, ***P < 0.001 by Student’s unpaired t-test…

11

P-value as currency• If you find a significant p-value, you are more

likely to get your research published– And in your report you are allowed to say that you

“found” X (e.g. “found that Y was higher with drug A than drug B”), implying that this is a factual claim

• If you find non-significant p-values, it is harder to get published…

• Why wouldn’t you try to find significant p-values (if you believe you are “on to something”)?

12

P-hacking• Also known as data dredging, fishing, etc.

• N.B. p-hacking may be done unconsciously!

13

The problem of false positives• The emphasis on “findings” (i.e. rejection of null

hypotheses) leads to the plausible claim that a majority of published findings are false (Ioannides, 2005)

• E.g. can calculate frequency of “accepting” & “rejecting” true and false null hypotheses, if:– 90% of hypotheses tested are actually true nulls– Significance level = 0.05– Power = 50%

• Then more than half the “significant” findings are false positives …

14Sterne & Davey Smith, BMJ 2001.

Frequency of “accepting” & “rejecting” true and false null hypotheses

Result of study

Null hypothesis true (association

doesn’t exist)

Null hypothesis false (association

does exist)Total

“Non-signif” 855 50 905

“Significant” 45 50 95

Total 900 100 1000

It can’t happen to me, I understand my P-value Let’s see, take an example from a published article (“randomly selected”): “Occupational Exposure to Extremely Low Frequency Magnetic Fields and Mortality from Cardiovascular Disease”Håkansson et al, American Journal of Epidemiology (15 Sept 2003)

“The authors found a low-level increase in AMI [acute myocardial infarction] risk in the highest exposure group (relative risk = 1.3, 95% confidence interval: 0.9, 1.9) and suggestions of an exposure-response relation (p = 0.02).”

Quote from the Abstract:

15

The results quote a p-value in support of a claim that there may be an exposure-response relationship

• This is actually quite well presented (no mention of significance or implication that they have a “true finding”)

Question

Which of the following is a valid interpretation of the P value?

16

• The probability that the exposure-response relationship is due to chance alone is 0.02.

• The probability that the null hypothesis (i.e. there is no exposure-response relationship) is false is 0.02, i.e. 2%.

• If we did a similar study again, the probability that we would obtain a similar or greater level of association than found in these data, if the null hypothesis (of no exposure-response relationship) is true, is 0.02.

• There is a very low probability (i.e. around 2%) that these results can be explained by chance if there is truly no association.

• The probability that the investigators make a Type I error if they conclude that the association is real is 2%.

• It doesn’t really matter because it’s just a scientific convention that if P < 0.05, then the association is significant.

17

• The probability that the exposure-response relationship is due to chance alone is 0.02.

• The probability that the null hypothesis (i.e. there is no exposure-response relationship) is false is 0.02, i.e. 2%.

• If we did a similar study again, the probability that we would obtain a similar or greater level of association than found in these data, if the null hypothesis (of no exposure-response relationship) is true, is 0.02.

• There is a very low probability (i.e. around 2%) that these results can be explained by chance if there is truly no association.

• The probability that the investigators make a Type I error if they conclude that the association is real is 2%.

• It doesn’t really matter because it’s just a scientific convention that if P < 0.05, then the association is significant.

?

??

19

Fisher’s interpretation (1920s)

P = Prob (we would obtain a more extreme result than the actual one if the null hypothesis is true)

• An index of “surprise”: if P is small, EITHER something surprising has occurred (under the null hypothesis),

• OR the null hypothesis is false, i.e. we should adopt some other theory about the truth.

20

Neyman & Pearson’s version (1930s)

“…no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis.But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we ensure that, in the long run of experience, we shall not often be wrong.”

21

The historical legacy…• A mess: widespread misunderstanding and

confusion!• Adoption of “P < 0.05” as a mantra with the

word “significant” attached– Applying Neyman-Pearson thinking (“accept/

reject” null hypothesis & claim positive/negative finding) in a context that requires inductive inference (conditional on the data)

– Leads to over-interpretation in both directions (“significant” and “non-significant”)

22

Genetics: a good-news story?• Early days of genomics (post HGP, early 2000s?):

“Candidate genes” widely sought & “found” to be associated with disease risk

• Many (most?) such findings failed to replicate• Genome-wide approach

– Unstructured searching for associations– Multiple comparisons on massive scale– Many true null hypotheses– Recognition of need to drastically control “over-calling” (P

< 10-7, not P < 0.05)– P-values for ranking, not declaring “findings”– All apparent associations followed up for replication,

pathways etc.

23

Ways forward• Avoid use of “statistically significant”

– Immediately shifts emphasis away from artificial dichotomisation (whether at 0.05 or anywhere else)

• Change style of presentation, away from “findings” to incremental evidence– Think about directions and magnitudes rather than

rejecting or accepting• Embrace more Bayesian inference• Full disclosure of all data and all data

manipulations• Support and perform pre-registered

replication?

Technology

p-values: A significant problem in science? - John Carlin