Upload
poppy-french
View
217
Download
0
Embed Size (px)
Citation preview
Can I Have a P-value For That, Please?
Christopher J. Miller
Associate Director, Biostatistics
AstraZeneca, LP
2
Outline
DefinitionsQuizHypothesis testing and Power
no mathphilosophy
Things that make no sense to metesting for differences at baselinepost-hoc power calculations
3
Biostatistics
A term which ought to mean “statistics for biology” but is now increasingly reserved for medical statistics.
S. Senn
4
Biostatistician
One who has neither the intellect for mathematics nor the commitment for medicine, but likes to dabble in both.
S. Senn
5
Biometrics
An alternative name for statistics, especially if applied to the life sciences. The advantage of the name compared to statistics is that the general public does not understand what it means, whereas with statistics the general public thinks it understands what it means.
S. Senn
6
Quiz time!
7
A 95% Confidence Interval of(5 to 11) for the population mean implies:
1. The probability that the true mean is between5 and 11 is 0.95 (95%).
2. Ninety-five percent of the time (for 95% of samples) the interval will include the true mean. Five to 11 is one such interval.
3. Five to 11 covers 95% of the possible values of the true mean.
8
A 95% Confidence Interval of(5 to 11) for the population mean implies:
1. The probability that the true mean is between5 and 11 is 0.95 (95%).
2. 95% of the time (for 95% of samples) the interval will include the true mean.Five to 11 is one such interval.
3. Five to 11 covers 95% of the possible values of the true mean.
9
A p-value < 0.05:
1. Assuming the treatment is not effective, there
is less than a 5% chance of obtaining such results.
2. The observed effect from the treatment isso large that there is less than a 5% chancethat the treatment truly is no better than placebo.
3. On average, fewer than 5% of placebo-treated patients will do better than active-treated patients.
10
A p-value < 0.05:
1. Assuming the treatment is not effective, there
is less than a 5% chance of obtaining such results.
2. The observed effect from the treatment isso large that there is less than a 5% chancethat the treatment truly is no better than placebo.
3. On average, fewer than 5% of placebo-treated patients will do better than active-treated patients.
11
Thoughts
How many people got both correct?
P-values and confidence intervals are often misinterpreted.P-values and confidence intervals do not necessarily answer a relevant question.Misunderstandings lead us to present analyses that are nonsensical.
12
Hypothesis Testing
Question: Is the average effect of active treatment better than that of placebo?
Null Hypothesis: Assume that there is no effect.Ho : A = P or A - P = 0
Alternative Hypothesis Ha : A > P or A - P > 0
13
Hypothesis testing (cont’d)
Assume Ho is true (true means equal)Choose an analysis model and study designPower studyRun an experimentCollect data
See if you have enough evidence to reject Ho
Ho not false until proven falseHo is never proven to be true“not guilty, until proven guilty”
14
Hypothesis Testing Essentials
Population
Parameters
Probabilities are related to long-run relative frequency of events in a series of trials
15
Essentials: Population
“A largely theoretical concept which refers to a (sometimes infinite or undefined) totality of observations of interest.”
Example: All potential patients who might use a new drug.
16
Essentials: Parameters
Used in conjunction with an underlying population“A function of the values of this population which define their distribution”Unobservable and unknowable
Nature, God, Truth
Example: Population mean or varianceWhen similar functions are calculated from a sample, they are called “statistics”.
17
Essentials: Probabilities and decisions
Parameters cannot have a probabilityThey are either equal to some value or not
Hypotheses cannot have a probabilityThey are either true or false
A decision to accept or reject a hypothesis is made indirectly using the probability of the evidence given the hypothesis, rather than vice versa.
Errors in decisions are controlled, on average, based on an assumed series of results.
18
A 95% Confidence Interval of(5 to 11) for the population mean implies:
1. The probability that the true mean is between5 and 11 is 0.95 (95%).
2. 95% of the time (for 95% of samples) the interval will include the true mean.This is one such interval.
3. Five to eleven covers 95% of the possible values of the true mean.
19
A p-value < 0.05:
1. Assuming the treatment is not effective, there
is less than a 5% chance of obtaining such results.
2. The observed effect from the treatment isso large that there is less than a 5% chancethat the treatment truly is no better than placebo.
3. On average, only 5% of placebo-treated patients
will do better than active-treated patients.
Things that make no sense to me #1
21
Baseline differences
You’re reporting on a randomized, parallel-group trial. Active versus placebo.To your dismay, the groups appear to have been “different” at baseline
Mean (SD): 23 (2.3) versus 32 (2.7)
We need a p-value to tell us “how different” they are!
P<0.05 tells us the study is uninterpretable, right?
22
The test
What is the “deep structure”?Population?
Parameter of interest?
Long-term process?
Decision rule’s meaning?
Point?
23
Problem
Test appears to say something about the adequacy of the given allocation, whereas it can only be a test of the allocation procedure.
24
What are we testing?
Null HypothesisThe process of randomization will result in balance across treatment groups.
PopulationAll possible random assignments of patients to treatment.
25
What are we saying when p<.05?
When comparing 2 drugs after treatment….the difference is rather large to be caused by chance alone, therefore chance must not be the whole explanation.Infer that the drugs have an effect on outcome.Null hypothesis is not true.
When comparing 2 drugs before treatment….the difference is rather large to be caused by chance alone, therefore chance must not be the whole explanation.Infer that randomization has not taken place???…fraud???Type I error???…inadequate sample???
26
Bottom line
The underlying problem is that randomization is, by definition, a chance mechanism!
So, no matter what the p-value is – unless we are willing to accept tampering as a possibility – we need to conclude that something unusual has happened because of CHANCE alone!
27
Further silliness
Baseline imbalance does not necessarily mean that meaningful treatment inferences cannot be made
P-value for baseline test has no relation to the ability to make valid treatment comparisons at the end of the trial.
28
Solutions
ANCOVAAnswers the question: “If both groups had had average overall baseline values, what treatment difference would we have seen?”Makes an average allowance for imbalance
StratificationAllows valid treatment comparison within each strata.Need to think of this before the trial if you want to do it correctly.
29
In short…
The fact that baseline tests are commonly performed without much apparent harm is no more of a defense than saying of the policy of treating viruses with antibiotics that most patients recover.
S. Senn
30
Power
31
Power
Systems are subject to random variationotherwise, why would we experiment?
our lives would be simple without it
We try to see through the random variation (noise) and determine the true effect (signal)
32
Power (cont’d)
How? Well-planned, adequately-powered experiments
Loose definition of power: “The probably that a statistically significant difference will be found when the null hypothesis is false (ie, when the treatments truly are not equal).”
33
What Determines Power?
Hypothesis and model
Sample size
Variability among observations
What risk are you willing to take of wrongly rejecting Ho?
How small of a difference among treatments do you need to detect?
34
Calculating Power
Determine variable of primary interestmean change from baseline in symptoms
Determine comparison of primary interest and null hypothesis
assume mean active is the same as placebo
Determine analysis methodANCOVA
35
Calculating Power (cont’d)
Get an estimate of population variability among experimental units (Sigma)
literaturepilot/previous trialscan be a joke
Determine smallest difference between treatments you would like to detect (Delta)
often a joke
36
Clinically relevant difference
A somewhat nebulous concept with various conventions used by statisticians in their power calculations and incidentally, therefore, a means by which they drive their medical colleagues to distraction. This is used in the theory of clinical trials, as opposed to the cynically relevant difference, which is used in the practice.
S. Senn
37
Calculating Power (cont’d)
Determine risk you’re willing to take of wrongly rejecting Ho
Type I error ()Decide there’s an effect when there really isn’t one
“false conviction”
set low at 5%, but arbitrary
38
Sample size (n) and Power are the only elements left!
Sample Size per Group (n)Sample Size per Group (n)5050 100100 150150 200200 250250
Po
wer
(%
)P
ow
er (
%)
5050
6060
7070
8080
9090
100100
Calculating Power (cont’d)
39
Summary of power
Power is a function of:hypothesis being testedstatistical modelsample sizeassumed variability of populationrisk you’re willing to takeminimum “relevant effect size”
No guarantees
40
Working definition
Power is the probability of a possible outcome of a potential decision conditional upon an imaginable circumstance given a conceivable value of an algebraic embodiment of an abstract mathematical idea and the strict adherence to an extremely precise rule.
S. Senn
Things that make no sense to me #2
42
Post-hoc power calculations
Suppose we’ve run a well-designed and adequately-powered study that “fails”
“fails” usually means p>0.05.
We need an excuse.
43
Post-hoc power calculations
Obviously, the study was underpowered!assume that the variability was larger than anticipated
the sample size was therefore too small
all other assumptions were fine
What was the “actual power” of this wimpy study?
So, you see, the drug probably does work!…I am just a terrible scientist.
44
Post-hoc power calculations
How do you pick which assumptions were correct/incorrect when recalculating power?
Aribitrary
Ridiculous to do based on the results of 1 study
A view that I support“The power of a trial is a useful concept when planning the trial but has little relevance to the interpretation of its results.” (S. Senn)
45
Conclusion
…Be careful!
46
References
Lang T, Secic M. How to Report Statistics in Medicine, 1997.
Senn S. Statistical Issues in Drug Development, 1997.