Sample Size Robert F. Woolson, Ph.D. Department of Biostatistics, Bioinformatics & Epidemiology Joint Curriculum

Sample Size

Robert F. Woolson, Ph.D.Department of Biostatistics,

Bioinformatics & Epidemiology

Joint Curriculum

NIH Study Section

Significance Approach Innovation Environment Investigators

Approach

Feasibility Study Design: Controls,

Interventions Study Size: Sample Size, Power Data Analysis

Sample Size

# of Animals # of Measurement Sites/Animal # of Replications

Sample Size

What #s are Proposed? Adequacy of #s? Compelling Rationale for

Adequacy? Do We need More? Can We Answer Questions With

Fewer?

Sample Size

Simple Question to Ask Answer May Involve:

• Assumptions• Pilot Data• Simplification of Overall Aims to a

Single Question

Simplification

What Is The Question? What Is The Primary Outcome

Variable? What Is The Principal Hypothesis?

Pilot Data

Relationship To Question. Relationship To Primary Variable. Relationship To Hypothesis.

Sample Size/Power Freeware on Web:

http://www.stat.uiowa.edu/~rlenth/Powe

r/ http://hedwig.mgh.harvard.edu/sample_

size/size.html http://www.bio.ri.ccf.org/power.html http://www.dartmouth.edu/~chance/

Sample Size

Purchase Software• http://www.powerandprecision.com/• Nquery: www.statsolusa.com

Animal Studies

Differences usually large Variability usually small Small sample sizes Many groups Repeated measures

Sample Size (# Animals Required)

Excerpts from the MUSC Vertebrate Animal Review Application Form:

“ A power analysis or other statistical justification is required where appropriate. Where the number of animals required is dictated by other than statistical considerations… justify the number… on this basis.”

Sample Size: Ethical Issues in Animal Studies

Ethical Issues• Study too large implies some animals

needlessly sacrificed• Study too small implies potential for

misleading conclusions, unnecessary experimentation

Mann MD, Crouse DA, Prentice ED. Appropriate animal numbers in biomedical research in Light of Animal Welfare Considerations. Laboratory Animal Science, 1991, 41:

Ethical Issues Cont.

Human studies - same rationale hold for studies that are too large or too small.

Sample Size:Specifying the Hypothesis

Specifying the hypothesis• difference from control?• differences among groups over time?• differences among groups at a

particular point in time? A “non-hypothesis”

• Animals in Group A will do better than animals in Group B


Ho: Mean blood pressure on drug A = mean blood pressure on drug B measured six hours after start of treatment.

Ha: Mean blood pressure on drug A < mean blood pressure on drug B measured six hours after start of treatment.

Example (SHR )

Animal blood pressures measured at baseline

Animals randomly assigned to placebo or minoxidil

Animals measured 6 hours post treatment

Changes from baseline calculated for each animal

Example(Continued)

Placebo changes thought to be centered at 0

Expect minoxidil to lower blood pressure, we think by 10 mm Hg

Blood pressure changes have a standard deviation of 5 mm Hg

If two blood pressure change distributions 1. have the same standard deviation (say 5 mm Hg),

2. have different means 1 (- 10)and 2 (0) and then

their distributions might look like the following:

Example(Continued)

How many animals/group needed to have 90 % power to detect the 10 mm Hg mean difference?

How would this sample size change if the standard deviation is 10 mm Hg rather than 5 mm Hg?

Example (Continued)

Suppose we change the endpoint to, did the animal achieve a reduction in blood pressure of 10 or more mm Hg.

Therefore 50 % of those on minoxidil would be expected to have reduction of 10 or more.

About 2.5 % of those on placebo would have reduction of 10 or more.

Example(Continued)

How many animals/group required to have 90 % power to detect the 50 % vs.. 2.5 %?

Why the difference in sample sizes for the same experiment?Comment on:• Assumptions• Endpoint• Specific hypothesis.

Sample Size: Distribution of Response

Nominal/binary (Binomial)• dead, alive

Ordinal (Non-parametric)• inflammation (mild, moderate,

severe) Continuous (Normal*)

• blood pressure* may require transformation

Sample Size:Distribution of Response

• Binomial• N is a function of probability of response

in control and probability of response in treated animals

• Normal• N is a function of difference in means and

standard deviation

Sample Size:One Sample or 2-sample Test

One sample• Change from baseline in one group• Comparison to standard (historical

controls) Two sample

• Two independent study groups

Sample Size:One or Two sided test

One sided test :• Ha: a > 0 • Ha: a < 0

Two sided test• Ha: a not = 0

Sample Size: Choosing

= probability of Type 1 error probability of rejecting Ho when Ho is

true significance level usually 0.01 or 0.05 “calling an innocent person guilty” “concluding two groups are different

when they are not”


Multiple testing can lead to errors.

Pre-specified hypotheses, may not need to adjust;

If all pairwise comparisons are of interest, adjust ( /#tests)


= probability of type II error; probability of failing to reject Ho when a

true difference exists. “Calling guilty person innocent” “Missing a true difference” Power = 1 - Large clinical trials use 0.9 or 0.95;

animal studies usually use 0.8 (80% power).

Sample Size: Power

Concluding groups do not differ when power is low is risky. True difference may have been missed.

80% power implies a 20% chance of missing a true difference.

40% power implies a 60% chance of missing a true difference.

Sample Size: Calculation

Calculate N • specify difference to be detected• specify variability (continuous data)

OR Calculate detectable difference:

• specify N• specify variability (continuous) or

control %

Sample Size: Putting it all together

Continuous (Normal) Distribution

Need all but one: , , 2, , N • Z = 1.96 (2 sided, 0.05);• Z = 1.645 (always one-sided, 0.05, 95%

power) = difference between means 2 = pooled variance

2

22

)Z4(Z2n

Difference (P1-P2)(=0.05, one-sided test, N per

group=100, P1=0.5)

0.2

0.4

0.6

0.8

1.0

0.0

Power

Sample Size(=0.05, one-sided test, P1=.5, P2=.3)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 90

Sample Size

Pow

er

Nquery Advisor

About $700 Many more options than many

other programs Available in student room in our

department

Nquery Advisor

Under “file” choose “New” Choices

• means• proportions• agreement• survival (time to event)• regression

# groups (1,2,>2) testing, confidence intervals, equivalence

Examples

Continuous response Binary response


Ho: Mean blood pressure on drug A = mean blood pressure on drug B measured six hours after start of treatment.

Ha: Mean blood pressure on drug A < mean blood pressure on drug B measured six hours after start of treatment.

Example (SHR )

Animal blood pressures measured at baseline

Animals randomly assigned to placebo or minoxidil

Animals measured 6 hours post treatment

Changes from baseline calculated for each animal

Example(Continued)

Placebo changes thought to be centered at 0

Expect minoxidil to lower blood pressure, we think by 10 mm Hg

Blood pressure changes have a standard deviation of 5 mm Hg

If two blood pressure change distributions 1. have the same standard deviation (say 5 mm Hg),

2. have different means 1 (- 10)and 2 (0) and then

their distributions might look like the following:

Example(Continued)

How many animals/group needed to have 90 % power to detect the 10 mm Hg mean difference?

How would this sample size change if the standard deviation is 10 mm Hg rather than 5 mm Hg?

Example (Continued)

Suppose we change the endpoint to, did the animal achieve a reduction in blood pressure of 10 or more mm Hg.

Therefore 50 % of those on minoxidil would be expected to have reduction of 10 or more.

About 2.5 % of those on placebo would have reduction of 10 or more.

Example(Continued)

How many animals/group required to have 90 % power to detect the 50 % vs. 2.5 %?

Why the difference in sample sizes for the same experiment?Comment on:• Assumptions• Endpoint• Specific hypothesis.

Sample Size: More Than One Primary Response

Use largest sample size.

Sample Size: Food for Thought

Is detectable difference biologically meaningful?

Is sample size too small to be believable?

N = 5 “rule of thumb” but is this valid for the experiment being planned.

Sample Size: Misunderstandings

“Larger the difference, smaller the sample size” ignores contribution of variability

failing to report power for negative study• calculate based on hypothesized

difference and observed variability

Sample Size: Keeping It Small

Study continuous rather than binary outcome (if variability does not increase)• change in tumor size instead of

recurrence Study surrogate outcome where

effect is large• cholesterol reduction rather than

mortality

Examples Of Surrogate Outcome Measures?

Bone density Quality of life Patency Pain relief Functional Status Cholesterol


Decrease variability• Change from baseline or analysis of

covariance• training• equipment• choice of animal model


= 0.05, 2-sided test = 0.2 ; power = 0.8 (80%) Difference between two means = 1 Standard deviation = 2; N =

64/group Standard deviation = 1; N =

17/group

Sample Size Estimation

Parameters are estimates Estimate of relative effectiveness based

on other populations Effectiveness overstated Patients in trials do better Assuming mathematical models Compromise between available

resources/objectives

Sample Size: Pilot Studies

No information on variability No information on efficacy Use effect size from similar studies

or gather pilot data for estimation

Simplification

What Is The Question? What Is The Primary Outcome

Variable? What Is The Principal Hypothesis?

Sample Size/Power Freeware on Web:

http://www.stat.uiowa.edu/~rlenth/Powe

r/ http://hedwig.mgh.harvard.edu/sample_

size/size.html http://www.bio.ri.ccf.org/power.html http://www.dartmouth.edu/~chance/

Sample Size

Purchase Software• http://www.powerandprecision.com/• Nquery: www.statsolusa.com

Additional Comments

Pilot Studies

Complication rate• P = 1 – (1 – r)N where r = complication rate N = sample size• If know desired P and N can solve for

r• If know desired r and P, can solve for

N

Example to Work

Want to have 90% probability of detecting at least one complication, given a 25% complication rate. What N do you need?

You are studying 25 people and want 80% probability of detecting at least one complication. What is the complication rate that would yield this probability.

Pilot Studies

Use larger alpha (>0.05, e.g. 0.15 or 0.2) to compute sample size• If reject null hypothesis will test in

future study Underlying concept – futility;

ensure new treatment not worse than standard.

Pilot Studies

Can reformulate hypothesis• Ho: new treatment = placebo• Ha: new treatment < placebo• Continue to larger study if fail to

reject Ho.

Avoid Data Driven Comparisons

Test here

0

20

40

60

80

month

1

month

2

month

3

month

4

Group 1

Group 2

Randomization: Bias Due to Order of Observations

Learning effect Change in laboratory techniques Different litters Carry-over effects

• under estimate carry-over• two treatments, same animal give A

& B; can only test effect of B after A

Randomization: Order Effects Continued

System fatigue• rabbit heart’s ability to function after

two different treatments

Randomization: Order Effects Continued

Seasonal variability• All rats male, same weight, same age,

media temperature and other incubation conditions identical, housed in identical conditions

• Outcome - unstimulated renin release from kidneys (in vitro) samples at 30 minutes

• Outcome - Metastasis - winter 16% (n=767; summer 8% (n=142) ; logistic regression p<0.03 for season

Documents

Sample Size Robert F. Woolson, Ph.D. Department of Biostatistics, Bioinformatics & Epidemiology Joint Curriculum