46
Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP [email protected]

Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP [email protected]

Embed Size (px)

Citation preview

Page 1: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

Can I Have a P-value For That, Please?

Christopher J. Miller

Associate Director, Biostatistics

AstraZeneca, LP

[email protected]

Page 2: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

2

Outline

DefinitionsQuizHypothesis testing and Power

no mathphilosophy

Things that make no sense to metesting for differences at baselinepost-hoc power calculations

Page 3: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

3

Biostatistics

A term which ought to mean “statistics for biology” but is now increasingly reserved for medical statistics.

S. Senn

Page 4: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

4

Biostatistician

One who has neither the intellect for mathematics nor the commitment for medicine, but likes to dabble in both.

S. Senn

Page 5: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

5

Biometrics

An alternative name for statistics, especially if applied to the life sciences. The advantage of the name compared to statistics is that the general public does not understand what it means, whereas with statistics the general public thinks it understands what it means.

S. Senn

Page 6: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

6

Quiz time!

Page 7: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

7

A 95% Confidence Interval of(5 to 11) for the population mean implies:

1. The probability that the true mean is between5 and 11 is 0.95 (95%).

2. Ninety-five percent of the time (for 95% of samples) the interval will include the true mean. Five to 11 is one such interval.

3. Five to 11 covers 95% of the possible values of the true mean.

Page 8: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

8

A 95% Confidence Interval of(5 to 11) for the population mean implies:

1. The probability that the true mean is between5 and 11 is 0.95 (95%).

2. 95% of the time (for 95% of samples) the interval will include the true mean.Five to 11 is one such interval.

3. Five to 11 covers 95% of the possible values of the true mean.

Page 9: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

9

A p-value < 0.05:

1. Assuming the treatment is not effective, there

is less than a 5% chance of obtaining such results.

2. The observed effect from the treatment isso large that there is less than a 5% chancethat the treatment truly is no better than placebo.

3. On average, fewer than 5% of placebo-treated patients will do better than active-treated patients.

Page 10: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

10

A p-value < 0.05:

1. Assuming the treatment is not effective, there

is less than a 5% chance of obtaining such results.

2. The observed effect from the treatment isso large that there is less than a 5% chancethat the treatment truly is no better than placebo.

3. On average, fewer than 5% of placebo-treated patients will do better than active-treated patients.

Page 11: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

11

Thoughts

How many people got both correct?

P-values and confidence intervals are often misinterpreted.P-values and confidence intervals do not necessarily answer a relevant question.Misunderstandings lead us to present analyses that are nonsensical.

Page 12: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

12

Hypothesis Testing

Question: Is the average effect of active treatment better than that of placebo?

Null Hypothesis: Assume that there is no effect.Ho : A = P or A - P = 0

Alternative Hypothesis Ha : A > P or A - P > 0

Page 13: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

13

Hypothesis testing (cont’d)

Assume Ho is true (true means equal)Choose an analysis model and study designPower studyRun an experimentCollect data

See if you have enough evidence to reject Ho

Ho not false until proven falseHo is never proven to be true“not guilty, until proven guilty”

Page 14: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

14

Hypothesis Testing Essentials

Population

Parameters

Probabilities are related to long-run relative frequency of events in a series of trials

Page 15: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

15

Essentials: Population

“A largely theoretical concept which refers to a (sometimes infinite or undefined) totality of observations of interest.”

Example: All potential patients who might use a new drug.

Page 16: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

16

Essentials: Parameters

Used in conjunction with an underlying population“A function of the values of this population which define their distribution”Unobservable and unknowable

Nature, God, Truth

Example: Population mean or varianceWhen similar functions are calculated from a sample, they are called “statistics”.

Page 17: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

17

Essentials: Probabilities and decisions

Parameters cannot have a probabilityThey are either equal to some value or not

Hypotheses cannot have a probabilityThey are either true or false

A decision to accept or reject a hypothesis is made indirectly using the probability of the evidence given the hypothesis, rather than vice versa.

Errors in decisions are controlled, on average, based on an assumed series of results.

Page 18: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

18

A 95% Confidence Interval of(5 to 11) for the population mean implies:

1. The probability that the true mean is between5 and 11 is 0.95 (95%).

2. 95% of the time (for 95% of samples) the interval will include the true mean.This is one such interval.

3. Five to eleven covers 95% of the possible values of the true mean.

Page 19: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

19

A p-value < 0.05:

1. Assuming the treatment is not effective, there

is less than a 5% chance of obtaining such results.

2. The observed effect from the treatment isso large that there is less than a 5% chancethat the treatment truly is no better than placebo.

3. On average, only 5% of placebo-treated patients

will do better than active-treated patients.

Page 20: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

Things that make no sense to me #1

Page 21: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

21

Baseline differences

You’re reporting on a randomized, parallel-group trial. Active versus placebo.To your dismay, the groups appear to have been “different” at baseline

Mean (SD): 23 (2.3) versus 32 (2.7)

We need a p-value to tell us “how different” they are!

P<0.05 tells us the study is uninterpretable, right?

Page 22: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

22

The test

What is the “deep structure”?Population?

Parameter of interest?

Long-term process?

Decision rule’s meaning?

Point?

Page 23: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

23

Problem

Test appears to say something about the adequacy of the given allocation, whereas it can only be a test of the allocation procedure.

Page 24: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

24

What are we testing?

Null HypothesisThe process of randomization will result in balance across treatment groups.

PopulationAll possible random assignments of patients to treatment.

Page 25: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

25

What are we saying when p<.05?

When comparing 2 drugs after treatment….the difference is rather large to be caused by chance alone, therefore chance must not be the whole explanation.Infer that the drugs have an effect on outcome.Null hypothesis is not true.

When comparing 2 drugs before treatment….the difference is rather large to be caused by chance alone, therefore chance must not be the whole explanation.Infer that randomization has not taken place???…fraud???Type I error???…inadequate sample???

Page 26: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

26

Bottom line

The underlying problem is that randomization is, by definition, a chance mechanism!

So, no matter what the p-value is – unless we are willing to accept tampering as a possibility – we need to conclude that something unusual has happened because of CHANCE alone!

Page 27: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

27

Further silliness

Baseline imbalance does not necessarily mean that meaningful treatment inferences cannot be made

P-value for baseline test has no relation to the ability to make valid treatment comparisons at the end of the trial.

Page 28: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

28

Solutions

ANCOVAAnswers the question: “If both groups had had average overall baseline values, what treatment difference would we have seen?”Makes an average allowance for imbalance

StratificationAllows valid treatment comparison within each strata.Need to think of this before the trial if you want to do it correctly.

Page 29: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

29

In short…

The fact that baseline tests are commonly performed without much apparent harm is no more of a defense than saying of the policy of treating viruses with antibiotics that most patients recover.

S. Senn

Page 30: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

30

Power

Page 31: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

31

Power

Systems are subject to random variationotherwise, why would we experiment?

our lives would be simple without it

We try to see through the random variation (noise) and determine the true effect (signal)

Page 32: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

32

Power (cont’d)

How? Well-planned, adequately-powered experiments

Loose definition of power: “The probably that a statistically significant difference will be found when the null hypothesis is false (ie, when the treatments truly are not equal).”

Page 33: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

33

What Determines Power?

Hypothesis and model

Sample size

Variability among observations

What risk are you willing to take of wrongly rejecting Ho?

How small of a difference among treatments do you need to detect?

Page 34: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

34

Calculating Power

Determine variable of primary interestmean change from baseline in symptoms

Determine comparison of primary interest and null hypothesis

assume mean active is the same as placebo

Determine analysis methodANCOVA

Page 35: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

35

Calculating Power (cont’d)

Get an estimate of population variability among experimental units (Sigma)

literaturepilot/previous trialscan be a joke

Determine smallest difference between treatments you would like to detect (Delta)

often a joke

Page 36: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

36

Clinically relevant difference

A somewhat nebulous concept with various conventions used by statisticians in their power calculations and incidentally, therefore, a means by which they drive their medical colleagues to distraction. This is used in the theory of clinical trials, as opposed to the cynically relevant difference, which is used in the practice.

S. Senn

Page 37: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

37

Calculating Power (cont’d)

Determine risk you’re willing to take of wrongly rejecting Ho

Type I error ()Decide there’s an effect when there really isn’t one

“false conviction”

set low at 5%, but arbitrary

Page 38: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

38

Sample size (n) and Power are the only elements left!

Sample Size per Group (n)Sample Size per Group (n)5050 100100 150150 200200 250250

Po

wer

(%

)P

ow

er (

%)

5050

6060

7070

8080

9090

100100

Calculating Power (cont’d)

Page 39: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

39

Summary of power

Power is a function of:hypothesis being testedstatistical modelsample sizeassumed variability of populationrisk you’re willing to takeminimum “relevant effect size”

No guarantees

Page 40: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

40

Working definition

Power is the probability of a possible outcome of a potential decision conditional upon an imaginable circumstance given a conceivable value of an algebraic embodiment of an abstract mathematical idea and the strict adherence to an extremely precise rule.

S. Senn

Page 41: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

Things that make no sense to me #2

Page 42: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

42

Post-hoc power calculations

Suppose we’ve run a well-designed and adequately-powered study that “fails”

“fails” usually means p>0.05.

We need an excuse.

Page 43: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

43

Post-hoc power calculations

Obviously, the study was underpowered!assume that the variability was larger than anticipated

the sample size was therefore too small

all other assumptions were fine

What was the “actual power” of this wimpy study?

So, you see, the drug probably does work!…I am just a terrible scientist.

Page 44: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

44

Post-hoc power calculations

How do you pick which assumptions were correct/incorrect when recalculating power?

Aribitrary

Ridiculous to do based on the results of 1 study

A view that I support“The power of a trial is a useful concept when planning the trial but has little relevance to the interpretation of its results.” (S. Senn)

Page 45: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

45

Conclusion

…Be careful!

Page 46: Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP chris.miller@astrazeneca.com

46

References

Lang T, Secic M. How to Report Statistics in Medicine, 1997.

Senn S. Statistical Issues in Drug Development, 1997.