Susan Stewart, Ph.D. UC Davis School of Medicine November

Susan Stewart, Ph.D. UC Davis School of Medicine

November, 2014

Intro to sample size determination Basic concepts Estimating sample size parameters Response variables ◦ Continuous ◦ Categorical ◦ Time-to-event

Components of sample size estimation

Why is it a good idea to do a sample size calculation?

Why shouldn’t you just pick a size that’s convenient?

Because the sample might be too small to help you answer your research question,

Or the sample might be much larger than you need.

Primary objective of a clinical trial: to evaluate the efficacy and safety of an intervention.

Efficacy evaluation ◦ Compare the average response in the intervention

and control groups in the study sample. ◦ Decide whether the difference between the groups

indicates a true difference between treatments.

Usually the efficacy evaluation is performed in the context of a hypothesis test.

Problem: Determine whether or not the population means of the intervention and control groups truly differ with respect to the outcome of interest. ◦ We regard the intervention and control samples as

being drawn from the target population.

Solution: Assume that the two groups do not differ, and see if the sample data disagree with this assumption. That is, perform a hypothesis test.

The null hypothesis (H0) assumes that there is no difference in outcome between the two groups.

The alternative hypothesis (HA) assumes that one group has a more favorable outcome than the other.

The research hypothesis is usually the alternative hypothesis.

To do a hypothesis test: ◦ Calculate a test statistic from the data. ◦ Determine whether the value of the test

statistic is likely or unlikely under the null hypothesis. ◦ If the value is very unlikely, reject the null

hypothesis.

Problem: we might reject the null hypothesis when it is true. ◦ That is, we might commit Type I error.

Solution: Construct the test so that there is only a 5% chance of incorrectly rejecting the null hypothesis. ◦ That is, the level of the test (alpha) is 0.05.

Hypothesis tests can be 1-sided or 2-sided ◦ 1-sided: tests for differences in one direction only e.g., higher response rate in the intervention group

than in the control group

◦ 2-sided: tests for differences in both directions e.g., either higher or lower response rate in the

intervention group than in the control group

Even if you are primarily interested in one direction, it is customary to do a 2-sided test

The p-value is the probability under the null hypothesis of obtaining data as extreme as that of the sample. ◦ That is, the p-value is the strength of the evidence

against the null hypothesis.

For a level 0.05 test, we reject the null hypothesis if the p-value is 0.05 or less.

Problem: we might fail to reject the null hypothesis when the alternative is true. ◦ That is, we might commit Type II error.

Solution: Select a large enough sample so that there is an 80% chance of rejecting the null hypothesis if the alternative is true. ◦ Then the power to detect the alternative is 80%.

Specify null and alternative hypotheses, type I error rate, and power.

Define the population under study. Gather information relevant to parameters. If measuring time to failure, model recruitment

process and choose length of follow-up period. Calculate sample size over range of

parameters. Select sample size to use.

Epidemiol Rev, 2002; 24(1):39-53

Parameters include ◦ Variability of the response ◦ Level of the response variable in the control group ◦ Difference anticipated or judged clinically relevant

May also need to consider ◦ Loss to follow-up ◦ Noncompliance

Sources of information ◦ Pilot studies: external or internal ◦ Literature: what others have found in similar studies

When a response variable is normally distributed, the difference between the means of two independent samples is assessed with a 2-sample t-test. ◦ The t-test is robust to departures from normality. ◦ May need to transform the response variable (e.g.,

log transform) to obtain approximate normality. The sample size for a z-test usually can be

used to estimate the sample size for a t-test. ◦ A z-test assumes that the sample standard

deviation is known.

𝑧𝑧 =�̅�𝑥 − 𝑦𝑦�𝜎𝜎 2/𝑛𝑛

�̅�𝑥 = intervention group mean 𝑦𝑦� = control group mean 𝜎𝜎2 = common variance in each group 𝑛𝑛 = sample size in each group

Epidemiol Rev, 2002; 24(1):39-53 (eq. 1)

z

𝑛𝑛 = 2𝜎𝜎2 𝑧𝑧1−𝛼𝛼/2 + 𝑧𝑧1−𝛽𝛽 /∆𝐴𝐴

2

𝜎𝜎2 = common variance in each group 𝑧𝑧1−𝛼𝛼/2 = critical value for 2-sided level 𝛼𝛼 test 𝑧𝑧1−𝛽𝛽 = value of a standard normal variable with

cumulative probability equal to 1 − 𝛽𝛽 (power) ∆𝐴𝐴 = difference corresponding to alternative

hypothesis

Epidemiol Rev, 2002; 24(1):39-53 (eq. 2)

Randomized, age-matched Healthy post-menopausal Chinese women

within 10 years of menopause onset Exclusion criteria ◦ Regular participation in exercise ◦ Hormone replacement therapy or drug treatment

affecting bone density ◦ Hypo- or hyper-parathyroidism, hypo- or hyper-

thyroidism, renal or liver disease ◦ History of fractures ◦ BMI over 30

Arch Phys Med Rehabil 2004; 85:717-22

Intervention: Supervised TCC exercise (Yang style) 50 minutes a day, 5 times a week, for 12 months

Control: Retained sedentary lifestyle Primary outcome: Change in bone mineral

density over 12 months ◦ Areal BMD at lumbar spine and proximal femur

measured by dual x-ray absorptiometry (DXA) ◦ Volumetric BMD in distal tibia measured by

multislice peripheral quantitative computed tomography (pQCT)

Null hypothesis ◦ Rate of bone mineral loss is the same in both study

arms. Alternative hypothesis ◦ Rate of bone mineral loss is different (i.e., lower) in

the intervention (TCC) group.

Level of the test: 0.05 (2-sided) Power: 80% Mean bone loss in control group: 2.8% ◦ Average annual trabecular bone loss in previous study in

same population Mean bone loss in intervention group: 1.4% ◦ 50% reduction

Standard deviation in each group ◦ Based on previous study, ~same as mean 3.0% in control group, 1.5% in intervention group (say)

◦ Compute pooled SD=2.37% Dropout: 25% in one year

𝜎𝜎2 = common variance in each group = 2.372=5.62 𝑧𝑧1−𝛼𝛼/2 = critical value for 2-sided level 𝛼𝛼 test = 1.96 𝑧𝑧1−𝛽𝛽 = value of a standard normal variable with

cumulative probability equal to 1 − 𝛽𝛽 (power) = 0.842 ∆𝐴𝐴 = difference corresponding to alternative

hypothesis = 1.4 𝑛𝑛 = 2𝜎𝜎2 𝑧𝑧1−𝛼𝛼/2 + 𝑧𝑧1−𝛽𝛽 /∆𝐴𝐴

2 = 2(5.62) 1.96 + 0.842 /1.4 2 =45 per group =0.75 (60 per group), accounting for dropouts Actual enrollment n=132 total

https://stattools.crab.org/



When a response variable is categorical, a chi-square test of independence is often used to compare two groups.

When there are only 2 categories, this is the same as testing for a difference in proportions.

Need to specify the response proportion in the control group and ◦ The response proportion in the intervention group,

or ◦ The odds ratio

𝑛𝑛 =𝑧𝑧1−𝛼𝛼/2 2𝜋𝜋� 1 − 𝜋𝜋� + 𝑧𝑧1−𝛽𝛽 𝜋𝜋𝑐𝑐 1 − 𝜋𝜋𝑐𝑐 + 𝜋𝜋𝑡𝑡 1 − 𝜋𝜋𝑡𝑡

2

𝜋𝜋𝑐𝑐 − 𝜋𝜋𝑡𝑡 2

𝑛𝑛′ = 𝑛𝑛4 1 + 1 + 4

𝑛𝑛 𝜋𝜋𝑐𝑐 − 𝜋𝜋𝑡𝑡

2

𝜋𝜋𝑐𝑐 = probability of event in control group 𝜋𝜋𝑡𝑡 = probability of event in intervention group 𝜋𝜋� = average probability of event 𝑛𝑛′= number needed in each group

Epidemiol Rev, 2002; 24(1):39-53 (eq. 7B, 7C)

Study aim: test an outreach and counseling intervention to reduce cervical cancer incidence & mortality in low income women

Setting: Highland General Hospital (HGH) Time frame: 3 years Outcome measure: proportion of women who

received initial follow-up at Highland within 6 months of an abnormal Pap test

Prev Med 2005; 41: 741-8

Null hypothesis ◦ Rate of follow-up of abnormal Pap tests is the same

in both study arms. Alternative hypothesis ◦ Rate of follow-up of abnormal Pap tests is different

(i.e., greater) in the intervention group.

Assume 60% follow-up in control group based on previous research

Assume 75% follow-up in intervention group, a clinically important difference achieved in similar interventions

To detect this difference at the 0.05 level (2-sided) with 80% power: n=165 per arm

No loss to follow-up—outcome ascertained through medical records

𝑛𝑛 =1.96 2(0.675) 0.325 +0.842 0.6 0.4 +0.75 0.25

2

0.60−0.75 2 =152

𝑛𝑛′ = 1524

1 + 1 + 4152 0.60−0.75

2

= 165

𝜋𝜋𝑐𝑐 = probability of event in control group = 0.60 𝜋𝜋𝑡𝑡 = probability of event in intervention group = 0.75 𝜋𝜋� = average probability of event = 0.675 𝑛𝑛′= number needed in each group = 165



The log rank test is often used to compare two survival curves.

Most sample size calculations assume an exponential survival distribution.

𝑆𝑆 𝑡𝑡 = 𝑒𝑒−λ𝑡𝑡, where 𝑡𝑡 = time, 𝑆𝑆 𝑡𝑡 = probability of survival to time 𝑡𝑡, and λ = hazard rate = risk of an event per time

unit

Hazard rate: number of events per 100 person years

Median survival time=𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒(2)/(hazard rate) Hazard rate=𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (2)/(median survival time) Hazard rate=-𝑙𝑙𝑙𝑙𝑙𝑙𝑒𝑒 (𝑆𝑆 𝑡𝑡 )/t, where 𝑆𝑆 𝑡𝑡 =probability of surviving to time t =expected proportion without an event by t

𝑛𝑛 =(𝑧𝑧1−α/2 + 𝑧𝑧1−β)2[ϕ λ𝐶𝐶 + ϕ λ𝐼𝐼 ]

(λ𝐼𝐼 − λ𝐶𝐶)2

where ϕ(λ) = λ2

1−[𝑒𝑒−𝜆𝜆 𝑇𝑇−𝑇𝑇0 −𝑒𝑒−λ𝑇𝑇] λ𝑇𝑇0�

𝑛𝑛 =number per group λ𝐼𝐼=hazard rate in intervention group λ𝐶𝐶=hazard rate in control group 𝑇𝑇 =total time of trial (first entry to end of study) 𝑇𝑇0=recruitment time (first entry to last entry)

𝐷𝐷 =(𝑧𝑧1−α/2 + 𝑧𝑧1−β)2

𝑝𝑝(1 − 𝑝𝑝)(ln (𝜆𝜆𝐶𝐶/λ𝐼𝐼))2

where 𝐷𝐷 =number of events required to detect the hazard ratio with power 1-β at level α (2-sided) λ𝐼𝐼=hazard rate in intervention group λ𝐶𝐶=hazard rate in control group 𝑝𝑝 =proportion of participants in the control group

Primary research goal: Determine whether performing surgery of the primary tumor followed by systemic therapy improves survival in a certain patient population, compared with systemic therapy only.

Patient population: Patients with synchronous unresectable metastases of colorectal cancer and few or absent symptoms

Primary outcome: Overall survival Study design: Multi-center randomized phase III trial.

BMC Cancer 2014; 14:741

Null hypothesis ◦ Overall survival is not affected by surgery of the

primary tumor before systemic therapy in this patient population.

Alternative hypothesis ◦ Surgery of the primary tumor improves overall

survival in this patient population.

Level of the test: 0.05 (2-sided) Power: 80% Median survival in control group: 13 months Median survival in intervention group: 19

months ◦ Minimal difference to justify a surgical procedure

Recruitment period: 30 months Minimum follow-up: 8 months Total sample size: 360

𝑛𝑛 =(𝑧𝑧1−α/2 + 𝑧𝑧1−β)2[ϕ λ𝐶𝐶 + ϕ λ𝐼𝐼 ]

(λ𝐼𝐼 − λ𝐶𝐶)2

where ϕ(λ) = λ2

1−[𝑒𝑒−𝜆𝜆 𝑇𝑇−𝑇𝑇0 −𝑒𝑒−λ𝑇𝑇] λ𝑇𝑇0�

α=0.05; 𝑧𝑧1−α/2 =1.96; β=0.20; 𝑧𝑧1−β =0.842 λ𝐼𝐼=hazard rate in intervention group = ln(2)/(19/12)=0.438 λ𝐶𝐶=hazard rate in control group = ln(2)/(13/12)=0.640 hazard ratio = 19/13=1.46 𝑇𝑇 =total time of trial (first entry to end of study) =38/12=3.167 𝑇𝑇0=recruitment time (first entry to last entry) = 2.5 𝝓𝝓 𝝀𝝀𝑪𝑪 =0.607; 𝝓𝝓 𝝀𝝀𝑰𝑰 =0.351; 𝟐𝟐𝒏𝒏 =368; required # of events = 218



𝛼𝛼(level): larger → smaller sample size 1-𝛽𝛽 (power): larger → larger sample size Variance: larger → larger sample size ◦ Binary variable: 𝜋𝜋 (probability of event) = 0.5 has

largest variance Difference to detect: larger → smaller sample

size

Problem: Sometimes the sample size required is too large.

Solutions: ◦ Be content to detect with less power (allow more

type II error). ◦ Increase the level of the test (allow more type I

error). ◦ Pick a more extreme alternative.

% Response in Intervention Group

Level Power 60% 65%

5% 90% 538 239

5% 80% 407 182

10% 80% 325 146

Parameters used to estimate sample size are estimates ◦ Often based on small studies

Effectiveness of the intervention ◦ May be based on a different population ◦ May be overestimated

Inclusion and exclusion criteria may change Control group participants may do better

than expected Mathematical models for sample size

calculations are approximate

www.statpages.org www.swogstat.org/statoolsout.html https://stattools.crab.org/

http://www.statpages.org/

http://www.swogstat.org/statoolsout.html


Documents

Susan Stewart, Ph.D. UC Davis School of Medicine November