View
1
Download
0
Category
Preview:
Citation preview
CVI Statistics
Michael LaValley1/10/2011
The P-value Police? Often researchers see statistics (and
statisticians) as barriers to publishing their important work
However, good statistics (and statisticians) can help you avoid wasting time and money following false leads
Role of Experimental Design Statistics can only be as good as the data Good data requires thoughtfully designed
experiments Some failures of animal experiments to
translate to human trials have raised the issue of experimental design of animal studies NXY-059 for Stroke (Gawrylewski 2007) Fluid resuscitation in bleeding trauma patients
(Roberts 2002)
Experimental Design A well designed experiment should Produce unbiased comparisons between
groups Provide precise estimates
Well designed experiments require Clear objectives Planning Sample size large enough to achieve the
objectives with good power
Experimental Design Comparison/Control group Concurrent controls Internal control (before and after treatment)
Replication Reduce effect of uncontrolled variation Quantify the uncertainty in the results
Randomization Computer generated
Blocking or stratification Blinding
Hypothesis Tests Hypothesis tests answer a yes/no question
about a population value Example: Quantitative assay for level of antibodies for a
virus in mice Does a vaccine have an effect on the levels of
antibodies? Null Hypothesis (H0) corresponds to no
effect Alternative Hypothesis (HA) indicates that
there is an effect
Hypothesis Tests Example: Suppose there are 10 mice available for the
experiment Assay the mice for antibodies before and after
vaccination Xi is the difference in assay values for mouse
number i Is the mean value of the Xi close to 0? No effect µ is population mean difference
Null hypothesis H0: µ=0 Alternative hypothesis HA: µ≠0
Hypothesis Tests The goal of a hypothesis test is to
reject H0 Rejecting H0 indicates that either H0 is wrong A rare event occurred (type I error)
We cannot confirm H0 on the basis of a test We may fail to reject H0, but we do not
accept H0
Hypothesis Tests Each test has an associated test statistic For a paired t-test for the mouse vaccine
data
We reject H0 when T > t* t* is chosen so that
Pr(Reject H0 when H0 is true) = α In this case, t* is from a t-distribution with 9
degrees of freedom (number of mice – 1)
/ 10XT
s
Hypothesis Tests
Values used are from the t distribution with 9 degrees of freedom
Hypothesis TestsDecision
Not Reject H0
Reject H0
TruthH0 True Right Type I
Error (α)
H0 False Type II Error (β)
Right(Power)
Unfortunately with testing comes the possibility of reaching a wrong conclusion and making an error
Hypothesis Tests Type I Error – reject H0 when it is
true (false positive finding) Hypothesis tests are set up so that the
user specifies the Type I Error rate Significance level α, almost always 0.05
Type II Error – failing to reject H0when it is false (false negative finding) As the Type I error rate is decreased, the
rate of Type II error is increased
Hypothesis Tests The significance level is the rate of
false positive findings that you are willing to live with
Power is the probability of rejecting the null hypothesis (1 - Type II Error rate) Once the significance level is set, the
Power is determined by the sample size For the alternative shown in the figure,
the power is 76%
Hypothesis Tests
For a 0.05 two sided t-test with 9 degrees of freedom, we reject the null if T<-2.26 or T>2.26
76% power if true difference is 3.0
Hypothesis Tests Role of sample size In designing an experiment, one should
determine an appropriate sample size for the goals of the experiment
Given Expected difference between groups Expected variability of measurements Significance level that will be used Power to be targeted
One can determine the sample size to achieve the study goal
Hypothesis Tests Role of sample size There are software packages and online
power calculators available for determining sample size
If the sample size is too small for the study goal, test result is likely to be negative (underpowered)
If the sample size is too large for the study goal, resources will be wasted
http://www.stat.uiowa.edu/~rlenth/Power/
Hypothesis Tests P-value Smallest level of significance for which you
would reject the Null Hypothesis with your data Probability of obtaining data as extreme as what
was found if the Null Hypothesis were true Provides a measure of the evidence against the
Null Hypothesis Small p-values (close to 0) show strong
evidence against the null hypothesis Large p-values (close to 1) show only weak
evidence against the null hypothesis
Hypothesis Tests If p-value ≤ α then reject H0
The p-value is determined by How far the data are from the Null
Hypothesis The sample size
The larger the sample, the smaller the p-value and the greater the power
Hypothesis Test Limitations P-values and hypothesis tests give a
dichotomous (significant/not significant) view of study results
Statistically significant means that the observed difference is unlikely to be due to chance Either H0 is not correct or The observed data is a rare event –
happening no more than (100*α)% of the time
Hypothesis Test Limitations Statistical significance doesn’t mean that
the observed difference is important Could find a significantly significant result with a
large sample size when the observed difference is small and unimportant
Could have a large and important difference between groups with a small sample size and not have statistical significance Would especially be the case for an underpowered
study
Confidence Intervals Confidence intervals show the
precision of the sample values as estimates of population values Provides a range of population values
that are consistent with the study findings
Often more informative than the p-values
Test or Interval Limitations A significance test/confidence interval
doesn’t provide a check of the study design Example: in a study of gene expression Cancer tissue samples kept on ice while the
normal tissue samples are processed Observed differences in expression may be
due to iced/not iced rather than cancer/normal
A statistical procedure will never indicate that this is the reason for the result
Role of Data Distribution Particular tests are tuned for data from the
normal (Gaussian) distribution Examples T-test Standard (Pearson) correlation
Often it is difficult to be sure that the data come from the normal distribution Plot histograms of data – bell-shaped and
symmetric? Plot ordered data values against expected
normal values – is a straight line is obtained? (called QQplots)
Plots require a substantial amount of data to be conclusive
Role of Data Distribution Some tests are specifically designed to work
reasonably well with data from any distribution Called Nonparametric or distribution-free tests Examples
Wilcoxon test (alternative to t-test) Spearman correlation (alternative to standard
correlation) In some situations these may be less likely to reject
the null hypothesis of no difference than tests based on normal data
May want to see if nonparametric results are similar to those assuming normality
Example Study question: what is the effect of
calcium on blood pressure in African-American men
Experiment: a Randomized comparison Treatment group of 10 men received a calcium
supplement for 12 weeks Control group of 11 men received a placebo
during the same period Outcome is the difference in the seated
systolic blood pressure (BP) over the 12-week period
Lyle RM, et al., "Blood pressure and metabolic effects of calcium supplementation in normotensive white and black men," JAMA, 257(1987), pp. 1772-1776
Data Distribution
Histograms by group
QQplot by group
Example These plots aren’t very useful in
determining the data distribution Don’t really suggest normality Aren’t conclusively non-normal either Ambiguity is typical with small numbers
Should probably look at both t-test and Wilcoxon test If same results – everything is fine If different results – probably trust
nonparametric more
Example The t-test is not significant at the 0.05
significance level P-value = 0.12
The Wilcoxon test is not statistically significant at the 0.05 significance level P-value = 0.33
The test results are consistent in that with either we fail to reject the null hypothesis
Important difference? Check the confidence intervals
ExampleMean Decrease in BP
95% Confidence Interval
Calcium Group 5.00 -1.26 to 11.26
Control Group -0.27 -4.24 to 3.69
Difference 5.27 -1.48 to 12.03
Example So we found a 5 mm Hg difference between
groups… Might be large enough to be important? But can’t rule out that this finding is due to
chance (P-value > α)
If 5 mm Hg is worth pursuing, would need to evaluate this in a larger sample Do the power and sample size calculation!
If not, pursue more promising therapies
Multiple-Testing Another issue to be aware of is limits
of ordinary statistical significance when doing many tests
When we use a significance level of α=0.05, we allow about 5 out of every 100 tests to be false positives
When 10s or 100s of tests are run, false positive findings are almost guaranteed
Multiple-Testing An fMRI study using a dead salmon
for a subject found several voxels with significant signal change after being shown 15 pictures http://prefrontal.org/files/posters/Benne
tt-Salmon-2009.pdf Why? Out of 8064 voxels, 16 were significant
(0.2% of voxels)
Multiple-Testing Methods exist (and new ones are
being continually developed) to deal with multiple testing issues Bonferroni correction Tukey’s method False discovery rates Which method is used is less important
than that something is done to account for the number of tests
References Triola MM, Triola MF. Biostatistics for the
Biological and Health Sciences. Pearson Education, Inc., 2006
Grafen A, Hails R. Modern Statistics for the Life Sciences. Oxford University Press, 2002
Broman K. Statistics for Laboratory Scientists I, 2006 (Course Website) http://ocw.jhsph.edu/courses/StatisticsLaboratoryScientistsI/
References Festing M. Principles: the need for better
experimental design. TRENDS in Pharmacological Sciences, 24:341-5, 2003
Roberts I, Kwan I, Evans P, Haig S. Does animal experimentation inform human healthcare? Observations from a systematic review of international animal experiments on fluid resuscitation. BMJ, 324:474-6, 2002
Recommended