Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP [email protected]

Can I Have a P-value For That, Please?

Christopher J. Miller

Associate Director, Biostatistics

AstraZeneca, LP

[email protected]

2

Outline

DefinitionsQuizHypothesis testing and Power

no mathphilosophy

Things that make no sense to metesting for differences at baselinepost-hoc power calculations

3

Biostatistics

A term which ought to mean “statistics for biology” but is now increasingly reserved for medical statistics.

S. Senn

4

Biostatistician

One who has neither the intellect for mathematics nor the commitment for medicine, but likes to dabble in both.

S. Senn

5

Biometrics

An alternative name for statistics, especially if applied to the life sciences. The advantage of the name compared to statistics is that the general public does not understand what it means, whereas with statistics the general public thinks it understands what it means.

S. Senn

6

Quiz time!

7

A 95% Confidence Interval of(5 to 11) for the population mean implies:

1. The probability that the true mean is between5 and 11 is 0.95 (95%).

2. Ninety-five percent of the time (for 95% of samples) the interval will include the true mean. Five to 11 is one such interval.

3. Five to 11 covers 95% of the possible values of the true mean.

8



2. 95% of the time (for 95% of samples) the interval will include the true mean.Five to 11 is one such interval.

3. Five to 11 covers 95% of the possible values of the true mean.

9

A p-value < 0.05:

1. Assuming the treatment is not effective, there

is less than a 5% chance of obtaining such results.

2. The observed effect from the treatment isso large that there is less than a 5% chancethat the treatment truly is no better than placebo.

3. On average, fewer than 5% of placebo-treated patients will do better than active-treated patients.

10

A p-value < 0.05:




3. On average, fewer than 5% of placebo-treated patients will do better than active-treated patients.

11

Thoughts

How many people got both correct?

P-values and confidence intervals are often misinterpreted.P-values and confidence intervals do not necessarily answer a relevant question.Misunderstandings lead us to present analyses that are nonsensical.

12

Hypothesis Testing

Question: Is the average effect of active treatment better than that of placebo?

Null Hypothesis: Assume that there is no effect.Ho : A = P or A - P = 0

Alternative Hypothesis Ha : A > P or A - P > 0

13

Hypothesis testing (cont’d)

Assume Ho is true (true means equal)Choose an analysis model and study designPower studyRun an experimentCollect data

See if you have enough evidence to reject Ho

Ho not false until proven falseHo is never proven to be true“not guilty, until proven guilty”

14

Hypothesis Testing Essentials

Population

Parameters

Probabilities are related to long-run relative frequency of events in a series of trials

15

Essentials: Population

“A largely theoretical concept which refers to a (sometimes infinite or undefined) totality of observations of interest.”

Example: All potential patients who might use a new drug.

16

Essentials: Parameters

Used in conjunction with an underlying population“A function of the values of this population which define their distribution”Unobservable and unknowable

Nature, God, Truth

Example: Population mean or varianceWhen similar functions are calculated from a sample, they are called “statistics”.

17

Essentials: Probabilities and decisions

Parameters cannot have a probabilityThey are either equal to some value or not

Hypotheses cannot have a probabilityThey are either true or false

A decision to accept or reject a hypothesis is made indirectly using the probability of the evidence given the hypothesis, rather than vice versa.

Errors in decisions are controlled, on average, based on an assumed series of results.

18



2. 95% of the time (for 95% of samples) the interval will include the true mean.This is one such interval.

3. Five to eleven covers 95% of the possible values of the true mean.

19

A p-value < 0.05:




3. On average, only 5% of placebo-treated patients

will do better than active-treated patients.

Things that make no sense to me #1

21

Baseline differences

You’re reporting on a randomized, parallel-group trial. Active versus placebo.To your dismay, the groups appear to have been “different” at baseline

Mean (SD): 23 (2.3) versus 32 (2.7)

We need a p-value to tell us “how different” they are!

P<0.05 tells us the study is uninterpretable, right?

22

The test

What is the “deep structure”?Population?

Parameter of interest?

Long-term process?

Decision rule’s meaning?

Point?

23

Problem

Test appears to say something about the adequacy of the given allocation, whereas it can only be a test of the allocation procedure.

24

What are we testing?

Null HypothesisThe process of randomization will result in balance across treatment groups.

PopulationAll possible random assignments of patients to treatment.

25

What are we saying when p<.05?

When comparing 2 drugs after treatment….the difference is rather large to be caused by chance alone, therefore chance must not be the whole explanation.Infer that the drugs have an effect on outcome.Null hypothesis is not true.

When comparing 2 drugs before treatment….the difference is rather large to be caused by chance alone, therefore chance must not be the whole explanation.Infer that randomization has not taken place???…fraud???Type I error???…inadequate sample???

26

Bottom line

The underlying problem is that randomization is, by definition, a chance mechanism!

So, no matter what the p-value is – unless we are willing to accept tampering as a possibility – we need to conclude that something unusual has happened because of CHANCE alone!

27

Further silliness

Baseline imbalance does not necessarily mean that meaningful treatment inferences cannot be made

P-value for baseline test has no relation to the ability to make valid treatment comparisons at the end of the trial.

28

Solutions

ANCOVAAnswers the question: “If both groups had had average overall baseline values, what treatment difference would we have seen?”Makes an average allowance for imbalance

StratificationAllows valid treatment comparison within each strata.Need to think of this before the trial if you want to do it correctly.

29

In short…

The fact that baseline tests are commonly performed without much apparent harm is no more of a defense than saying of the policy of treating viruses with antibiotics that most patients recover.

S. Senn

30

Power

31

Power

Systems are subject to random variationotherwise, why would we experiment?

our lives would be simple without it

We try to see through the random variation (noise) and determine the true effect (signal)

32

Power (cont’d)

How? Well-planned, adequately-powered experiments

Loose definition of power: “The probably that a statistically significant difference will be found when the null hypothesis is false (ie, when the treatments truly are not equal).”

33

What Determines Power?

Hypothesis and model

Sample size

Variability among observations

What risk are you willing to take of wrongly rejecting Ho?

How small of a difference among treatments do you need to detect?

34

Calculating Power

Determine variable of primary interestmean change from baseline in symptoms

Determine comparison of primary interest and null hypothesis

assume mean active is the same as placebo

Determine analysis methodANCOVA

35

Calculating Power (cont’d)

Get an estimate of population variability among experimental units (Sigma)

literaturepilot/previous trialscan be a joke

Determine smallest difference between treatments you would like to detect (Delta)

often a joke

36

Clinically relevant difference

A somewhat nebulous concept with various conventions used by statisticians in their power calculations and incidentally, therefore, a means by which they drive their medical colleagues to distraction. This is used in the theory of clinical trials, as opposed to the cynically relevant difference, which is used in the practice.

S. Senn

37


Determine risk you’re willing to take of wrongly rejecting Ho

Type I error ()Decide there’s an effect when there really isn’t one

“false conviction”

set low at 5%, but arbitrary

38

Sample size (n) and Power are the only elements left!

Sample Size per Group (n)Sample Size per Group (n)5050 100100 150150 200200 250250

Po

wer

(%

)P

ow

er (

%)

5050

6060

7070

8080

9090

100100


39

Summary of power

Power is a function of:hypothesis being testedstatistical modelsample sizeassumed variability of populationrisk you’re willing to takeminimum “relevant effect size”

No guarantees

40

Working definition

Power is the probability of a possible outcome of a potential decision conditional upon an imaginable circumstance given a conceivable value of an algebraic embodiment of an abstract mathematical idea and the strict adherence to an extremely precise rule.

S. Senn

Things that make no sense to me #2

42

Post-hoc power calculations

Suppose we’ve run a well-designed and adequately-powered study that “fails”

“fails” usually means p>0.05.

We need an excuse.

43


Obviously, the study was underpowered!assume that the variability was larger than anticipated

the sample size was therefore too small

all other assumptions were fine

What was the “actual power” of this wimpy study?

So, you see, the drug probably does work!…I am just a terrible scientist.

44


How do you pick which assumptions were correct/incorrect when recalculating power?

Aribitrary

Ridiculous to do based on the results of 1 study

A view that I support“The power of a trial is a useful concept when planning the trial but has little relevance to the interpretation of its results.” (S. Senn)

45

Conclusion

…Be careful!

46

References

Lang T, Secic M. How to Report Statistics in Medicine, 1997.

Senn S. Statistical Issues in Drug Development, 1997.

Documents

Can I Have a P-value For That, Please? Christopher J. Miller Associate Director, Biostatistics AstraZeneca, LP [email protected]