Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Susan Stewart, Ph.D. UC Davis School of Medicine
November, 2014
Intro to sample size determination Basic concepts Estimating sample size parameters Response variables ◦ Continuous ◦ Categorical ◦ Time-to-event
Components of sample size estimation
Why is it a good idea to do a sample size calculation?
Why shouldn’t you just pick a size that’s convenient?
Because the sample might be too small to help you answer your research question,
Or the sample might be much larger than you need.
Primary objective of a clinical trial: to evaluate the efficacy and safety of an intervention.
Efficacy evaluation ◦ Compare the average response in the intervention
and control groups in the study sample. ◦ Decide whether the difference between the groups
indicates a true difference between treatments.
Usually the efficacy evaluation is performed in the context of a hypothesis test.
Problem: Determine whether or not the population means of the intervention and control groups truly differ with respect to the outcome of interest. ◦ We regard the intervention and control samples as
being drawn from the target population.
Solution: Assume that the two groups do not differ, and see if the sample data disagree with this assumption. That is, perform a hypothesis test.
The null hypothesis (H0) assumes that there is no difference in outcome between the two groups.
The alternative hypothesis (HA) assumes that one group has a more favorable outcome than the other.
The research hypothesis is usually the alternative hypothesis.
To do a hypothesis test: ◦ Calculate a test statistic from the data. ◦ Determine whether the value of the test
statistic is likely or unlikely under the null hypothesis. ◦ If the value is very unlikely, reject the null
hypothesis.
Problem: we might reject the null hypothesis when it is true. ◦ That is, we might commit Type I error.
Solution: Construct the test so that there is only a 5% chance of incorrectly rejecting the null hypothesis. ◦ That is, the level of the test (alpha) is 0.05.
Hypothesis tests can be 1-sided or 2-sided ◦ 1-sided: tests for differences in one direction only e.g., higher response rate in the intervention group
than in the control group
◦ 2-sided: tests for differences in both directions e.g., either higher or lower response rate in the
intervention group than in the control group
Even if you are primarily interested in one direction, it is customary to do a 2-sided test
The p-value is the probability under the null hypothesis of obtaining data as extreme as that of the sample. ◦ That is, the p-value is the strength of the evidence
against the null hypothesis.
For a level 0.05 test, we reject the null hypothesis if the p-value is 0.05 or less.
Problem: we might fail to reject the null hypothesis when the alternative is true. ◦ That is, we might commit Type II error.
Solution: Select a large enough sample so that there is an 80% chance of rejecting the null hypothesis if the alternative is true. ◦ Then the power to detect the alternative is 80%.
Specify null and alternative hypotheses, type I error rate, and power.
Define the population under study. Gather information relevant to parameters. If measuring time to failure, model recruitment
process and choose length of follow-up period. Calculate sample size over range of
parameters. Select sample size to use.
Epidemiol Rev, 2002; 24(1):39-53
Parameters include ◦ Variability of the response ◦ Level of the response variable in the control group ◦ Difference anticipated or judged clinically relevant
May also need to consider ◦ Loss to follow-up ◦ Noncompliance
Sources of information ◦ Pilot studies: external or internal ◦ Literature: what others have found in similar studies
When a response variable is normally distributed, the difference between the means of two independent samples is assessed with a 2-sample t-test. ◦ The t-test is robust to departures from normality. ◦ May need to transform the response variable (e.g.,
log transform) to obtain approximate normality. The sample size for a z-test usually can be
used to estimate the sample size for a t-test. ◦ A z-test assumes that the sample standard
deviation is known.
𝑧𝑧 =�̅�𝑥 − 𝑦𝑦�𝜎𝜎 2/𝑛𝑛
�̅�𝑥 = intervention group mean 𝑦𝑦� = control group mean 𝜎𝜎2 = common variance in each group 𝑛𝑛 = sample size in each group
Epidemiol Rev, 2002; 24(1):39-53 (eq. 1)
z
𝑛𝑛 = 2𝜎𝜎2 𝑧𝑧1−𝛼𝛼/2 + 𝑧𝑧1−𝛽𝛽 /∆𝐴𝐴
2
𝜎𝜎2 = common variance in each group 𝑧𝑧1−𝛼𝛼/2 = critical value for 2-sided level 𝛼𝛼 test 𝑧𝑧1−𝛽𝛽 = value of a standard normal variable with
cumulative probability equal to 1 − 𝛽𝛽 (power) ∆𝐴𝐴 = difference corresponding to alternative
hypothesis
Epidemiol Rev, 2002; 24(1):39-53 (eq. 2)
Randomized, age-matched Healthy post-menopausal Chinese women
within 10 years of menopause onset Exclusion criteria ◦ Regular participation in exercise ◦ Hormone replacement therapy or drug treatment
affecting bone density ◦ Hypo- or hyper-parathyroidism, hypo- or hyper-
thyroidism, renal or liver disease ◦ History of fractures ◦ BMI over 30
Arch Phys Med Rehabil 2004; 85:717-22
Intervention: Supervised TCC exercise (Yang style) 50 minutes a day, 5 times a week, for 12 months
Control: Retained sedentary lifestyle Primary outcome: Change in bone mineral
density over 12 months ◦ Areal BMD at lumbar spine and proximal femur
measured by dual x-ray absorptiometry (DXA) ◦ Volumetric BMD in distal tibia measured by
multislice peripheral quantitative computed tomography (pQCT)
Null hypothesis ◦ Rate of bone mineral loss is the same in both study
arms. Alternative hypothesis ◦ Rate of bone mineral loss is different (i.e., lower) in
the intervention (TCC) group.
Level of the test: 0.05 (2-sided) Power: 80% Mean bone loss in control group: 2.8% ◦ Average annual trabecular bone loss in previous study in
same population Mean bone loss in intervention group: 1.4% ◦ 50% reduction
Standard deviation in each group ◦ Based on previous study, ~same as mean 3.0% in control group, 1.5% in intervention group (say)
◦ Compute pooled SD=2.37% Dropout: 25% in one year
𝜎𝜎2 = common variance in each group = 2.372=5.62 𝑧𝑧1−𝛼𝛼/2 = critical value for 2-sided level 𝛼𝛼 test = 1.96 𝑧𝑧1−𝛽𝛽 = value of a standard normal variable with
cumulative probability equal to 1 − 𝛽𝛽 (power) = 0.842 ∆𝐴𝐴 = difference corresponding to alternative
hypothesis = 1.4 𝑛𝑛 = 2𝜎𝜎2 𝑧𝑧1−𝛼𝛼/2 + 𝑧𝑧1−𝛽𝛽 /∆𝐴𝐴
2 = 2(5.62) 1.96 + 0.842 /1.4 2 =45 per group =0.75 (60 per group), accounting for dropouts Actual enrollment n=132 total
When a response variable is categorical, a chi-square test of independence is often used to compare two groups.
When there are only 2 categories, this is the same as testing for a difference in proportions.
Need to specify the response proportion in the control group and ◦ The response proportion in the intervention group,
or ◦ The odds ratio
𝑛𝑛 =𝑧𝑧1−𝛼𝛼/2 2𝜋𝜋� 1 − 𝜋𝜋� + 𝑧𝑧1−𝛽𝛽 𝜋𝜋𝑐𝑐 1 − 𝜋𝜋𝑐𝑐 + 𝜋𝜋𝑡𝑡 1 − 𝜋𝜋𝑡𝑡
2
𝜋𝜋𝑐𝑐 − 𝜋𝜋𝑡𝑡 2
𝑛𝑛′ = 𝑛𝑛4 1 + 1 + 4
𝑛𝑛 𝜋𝜋𝑐𝑐 − 𝜋𝜋𝑡𝑡
2
𝜋𝜋𝑐𝑐 = probability of event in control group 𝜋𝜋𝑡𝑡 = probability of event in intervention group 𝜋𝜋� = average probability of event 𝑛𝑛′= number needed in each group
Epidemiol Rev, 2002; 24(1):39-53 (eq. 7B, 7C)
Study aim: test an outreach and counseling intervention to reduce cervical cancer incidence & mortality in low income women
Setting: Highland General Hospital (HGH) Time frame: 3 years Outcome measure: proportion of women who
received initial follow-up at Highland within 6 months of an abnormal Pap test
Prev Med 2005; 41: 741-8
Null hypothesis ◦ Rate of follow-up of abnormal Pap tests is the same
in both study arms. Alternative hypothesis ◦ Rate of follow-up of abnormal Pap tests is different
(i.e., greater) in the intervention group.
Assume 60% follow-up in control group based on previous research
Assume 75% follow-up in intervention group, a clinically important difference achieved in similar interventions
To detect this difference at the 0.05 level (2-sided) with 80% power: n=165 per arm
No loss to follow-up—outcome ascertained through medical records
𝑛𝑛 =1.96 2(0.675) 0.325 +0.842 0.6 0.4 +0.75 0.25
2
0.60−0.75 2 =152
𝑛𝑛′ = 1524
1 + 1 + 4152 0.60−0.75
2
= 165
𝜋𝜋𝑐𝑐 = probability of event in control group = 0.60 𝜋𝜋𝑡𝑡 = probability of event in intervention group = 0.75 𝜋𝜋� = average probability of event = 0.675 𝑛𝑛′= number needed in each group = 165
https://stattools.crab.org/
The log rank test is often used to compare two survival curves.
Most sample size calculations assume an exponential survival distribution.
𝑆𝑆 𝑡𝑡 = 𝑒𝑒−λ𝑡𝑡, where 𝑡𝑡 = time, 𝑆𝑆 𝑡𝑡 = probability of survival to time 𝑡𝑡, and λ = hazard rate = risk of an event per time
unit
Hazard rate: number of events per 100 person years
Median survival time=𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒(2)/(hazard rate) Hazard rate=𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (2)/(median survival time) Hazard rate=-𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (𝑆𝑆 𝑡𝑡 )/t, where 𝑆𝑆 𝑡𝑡 =probability of surviving to time t =expected proportion without an event by t
𝑛𝑛 =(𝑧𝑧1−α/2 + 𝑧𝑧1−β)2[ϕ λ𝐶𝐶 + ϕ λ𝐼𝐼 ]
(λ𝐼𝐼 − λ𝐶𝐶)2
where ϕ(λ) = λ2
1−[𝑒𝑒−𝜆𝜆 𝑇𝑇−𝑇𝑇0 −𝑒𝑒−λ𝑇𝑇] λ𝑇𝑇0�
𝑛𝑛 =number per group λ𝐼𝐼=hazard rate in intervention group λ𝐶𝐶=hazard rate in control group 𝑇𝑇 =total time of trial (first entry to end of study) 𝑇𝑇0=recruitment time (first entry to last entry)
𝐷𝐷 =(𝑧𝑧1−α/2 + 𝑧𝑧1−β)2
𝑝𝑝(1 − 𝑝𝑝)(ln (𝜆𝜆𝐶𝐶/λ𝐼𝐼))2
where 𝐷𝐷 =number of events required to detect the hazard ratio with power 1-β at level α (2-sided) λ𝐼𝐼=hazard rate in intervention group λ𝐶𝐶=hazard rate in control group 𝑝𝑝 =proportion of participants in the control group
Primary research goal: Determine whether performing surgery of the primary tumor followed by systemic therapy improves survival in a certain patient population, compared with systemic therapy only.
Patient population: Patients with synchronous unresectable metastases of colorectal cancer and few or absent symptoms
Primary outcome: Overall survival Study design: Multi-center randomized phase III trial.
BMC Cancer 2014; 14:741
Null hypothesis ◦ Overall survival is not affected by surgery of the
primary tumor before systemic therapy in this patient population.
Alternative hypothesis ◦ Surgery of the primary tumor improves overall
survival in this patient population.
Level of the test: 0.05 (2-sided) Power: 80% Median survival in control group: 13 months Median survival in intervention group: 19
months ◦ Minimal difference to justify a surgical procedure
Recruitment period: 30 months Minimum follow-up: 8 months Total sample size: 360
𝑛𝑛 =(𝑧𝑧1−α/2 + 𝑧𝑧1−β)2[ϕ λ𝐶𝐶 + ϕ λ𝐼𝐼 ]
(λ𝐼𝐼 − λ𝐶𝐶)2
where ϕ(λ) = λ2
1−[𝑒𝑒−𝜆𝜆 𝑇𝑇−𝑇𝑇0 −𝑒𝑒−λ𝑇𝑇] λ𝑇𝑇0�
α=0.05; 𝑧𝑧1−α/2 =1.96; β=0.20; 𝑧𝑧1−β =0.842 λ𝐼𝐼=hazard rate in intervention group = ln(2)/(19/12)=0.438 λ𝐶𝐶=hazard rate in control group = ln(2)/(13/12)=0.640 hazard ratio = 19/13=1.46 𝑇𝑇 =total time of trial (first entry to end of study) =38/12=3.167 𝑇𝑇0=recruitment time (first entry to last entry) = 2.5 𝝓𝝓 𝝀𝝀𝑪𝑪 =0.607; 𝝓𝝓 𝝀𝝀𝑰𝑰 =0.351; 𝟐𝟐𝒏𝒏 =368; required # of events = 218
https://stattools.crab.org/
𝛼𝛼(level): larger → smaller sample size 1-𝛽𝛽 (power): larger → larger sample size Variance: larger → larger sample size ◦ Binary variable: 𝜋𝜋 (probability of event) = 0.5 has
largest variance Difference to detect: larger → smaller sample
size
Problem: Sometimes the sample size required is too large.
Solutions: ◦ Be content to detect with less power (allow more
type II error). ◦ Increase the level of the test (allow more type I
error). ◦ Pick a more extreme alternative.
% Response in Intervention Group
Level Power 60% 65%
5% 90% 538 239
5% 80% 407 182
10% 80% 325 146
Parameters used to estimate sample size are estimates ◦ Often based on small studies
Effectiveness of the intervention ◦ May be based on a different population ◦ May be overestimated
Inclusion and exclusion criteria may change Control group participants may do better
than expected Mathematical models for sample size
calculations are approximate
www.statpages.org www.swogstat.org/statoolsout.html https://stattools.crab.org/