Probability & Significance Everything you always wanted to know about p-values* *but were afraid to ask Evidence Based Chiropractic April 10, 2003

Probability & SignificanceEverything you always wanted to know about p-values*

*but were afraid to ask

Evidence Based Chiropractic

April 10, 2003

Causality Criteria (addendum to April 3 lecture)

• Association between A & B does not indicate presence or direction of causality– If a town with high unemployment has a high crime rate

• Do the unemployed commit the crimes?• Would improved employment result in less crime?• “The ecological fallacy”

• Tests for causality:– Is the association strong?– Is it consistent from study to study?– Did the postulated cause precede the postulated effect?– Is there a dose-response gradient? (more cause more effect)

– Does the association make biological sense?– Is the association specific?– Are there previously proven analogous causal

associations?

Statistical Tests• Employed in explanatory studies• Assess the role of chance as explanation

of pattern observed in data• Most commonly assesses how 2 groups

compare on an outcome:• Is the pattern most probably not due to chance?

– The difference is statistically significant

• Is the pattern likely due to chance?– The difference is not statistically significant

– No matter how well the study is performed, either conclusion could be wrong

p-values (p=probability)• A statistical value that indicates the probability that

the observed pattern is due to chance alone• How confident we can be in the conclusion• ‘this result was significant at p<0.05’

– Statistically speaking, and all other things being equal, we could expect this result to occur by chance no more than 5 times in every 100 trials

• Example: Test 100 coins by flipping each one 100 times– One coin comes up ‘heads’ 73 times

• We suspect this is not an ordinary fair coin• It is possible for an ordinary coin to get this result by chance• Want to know the probability that a fair coin would result 73/100

heads• How confident are we that this is not a fair coin?

Erroneous conclusions: Type I & Type II(see handout)

• Type I: – like a false positive– A difference is shown when in “truth” there is none– >5% chance typically unacceptable in RCT’s

• Type II– Like a false-negative– No difference is shown when in “truth” there is one– Acceptable 10-20%– Consider:

• If sample size is small, &• If difference ‘feels” clinically important

Determinants of ‘power’:

• Define what constitutes a “true” difference

• Determine acceptable levels of Type I and Type II errors– ▲ in one means ▼ in the other (tradeoff)

• Calculate the necessary sample size– Recruit, allowing for losses

• This should be thoroughly described in any ‘Methods’ section!– Example, Bove, JAMA, 280(18); 1998

• The lower the , the higher the power• The higher the , the lower the power• Increased (e.g. from .01 to .05 or .10)

– Increases the chance of saying there is a difference when there is not (Type I error)

• Decreases the rigor of the test

– Decreases the chances of saying there is no difference when there is (Type II error)

• Increases the power

• Decreased (e.g. from .05 or .10 to .01)– Only willing to risk being wrong 1 in 100 times by

saying there’s an effect when there isn’t– Limits chances of concluding there’s an effect

• Lowers the power as well as the Type I error risk

• The probability of correctly concluding that A is not equal to B

• If there is a difference, the probability that you will statistically detect it

• [1 - p(failing to detect a true difference aka Type II error)]

• Sample size needed to “power” a RCT must be calculated a priori, and depend upon:– Expected or clinically important difference– Acceptable p-value (Type I error probability)– Acceptable power (1 – Type II error probability)

The “power” of a RCT

Sample size calculations• When a small difference between groups is

considered clinically important…– A larger sample size is needed

• Setting the significance at .01 instead of .05– This is increasing the rigor of the study– Less willing to accept Type I error– A larger sample size is needed

• To increase the odds of recognizing an actual difference (lower the Type II error)– This is increasing the power of the study– A larger sample size is needed

Sample size is not happenstance!• To draw conclusions about the effectiveness

of treatment (i.e. the difference between 2 groups’ outcomes) the RCT must have the statistical power to detect a real difference– Drawing conclusions about a population based

upon a sample– Study says: A = B

• Caution - Small numbers increase the chance of a Type II error

– Study says: A is not = B• Caution - Small numbers increase the chance of a

Type I error

An essential component of the “Methods” section

• If a published study does not disclose the details of how they estimated their required sample size, including…– Expected or clinically important difference sought– Acceptable probability of making a Type I error– Desired power to detect a difference if there is one– And the statistical package or computer program used to

calculate needed sample size based on the above

• Then, the statistical conclusions can be interesting, informative, but not convincing!

Absence of evidence is not evidence of absence*

• RCT’s are intended to statistically detect a difference if there is one

• What if p>.05?– A ‘negative’ study? – not really– Evidence that the treatments are equivalent? No– Only: “There is no evidence that the groups are

different”

– Altman and Bland, BMJ 1995; 311:485(19 August)

Once again…

• Sample size affects the probability of detecting a difference between groups if there is one

• Sample size affects the probability that a difference between samples reflects a real difference in the underlying population– Not just a random occurrence

Documents

Probability & Significance Everything you always wanted to know about p-values* *but were afraid to ask Evidence Based Chiropractic April 10, 2003