Statistical Power and Sample Size Calculations Drug Development Statistics & Data Management July 2014 Cathryn Lewis Professor of Genetic Epidemiology

Statistical Power and Sample Size Calculations

Drug Development Statistics & Data Management

July 2014

Cathryn LewisProfessor of Genetic Epidemiology & StatisticsDepartment of Medical & Molecular GeneticsKing’s College London

With thanks to Irene Rebollo Mesa and Frühling Rijsdijk

Outline

Power and Sample size 2

1. Concepts of power2. Power and types of error3. Software to calculate power4. Power for continuous outcome5. Power for proportion, success/failure6. Quiz!


Planning a StudyQuestion : What are the study endpoints?

Types of Endpoints:

•Binary clinical outcome: Death from disease.

•Quantitative : Creatinine, cholesterol levels, QOL.

•Time to Event: Time to graft failure, time to death, time to recovery

Good Qualities:

-Clinically meaningful

-Practical and feasible to measure

-Occur frequently enough throughout the duration of the trial

4

Planning a StudyQuestion : What is the expected prevalence of outcome (discrete) or variability of the outcome (continuous)?•Based on previous studies, pilot study or hospital/NHS report.•Variability and prevalence are vital for power.• Both are best at intermediate levels.

Question:What is the expected difference between groups •in proportion of events (if discrete), or •in mean measure (if continuous)•Based on previous studies or pilot study•Alternatively, minimum difference clinically relevant•The larger the difference the higher the power

Power and Sample size


Design: What is your Hypothesis

1.Superiority

Objective To determine whether there is evidence of statistical difference in the comparison of interest between two Tx regimes:

A: Tx of Interest B: Placebo or

Active control Tx

H0: The two Txs have equal effect with respect to the mean response

H1: The two Txs are different with respect to the mean response

A B

A B

6

Statistical Power


7

Power

• Definition: The expected proportion of samples in which we decide correctly against the null hypothesis

• It depends on:

1. Size of the (treatment) effect in the population ()

2. The significance level at which we reject the null (0.05)

3. Sample size (N)

4. Design of the study: parallel or crossover etc.

5. Endpoint measurement (categorical, ordinal, continuous)

6. The expected dropout rate


8

Power primer

• We summarise results of a trial in a statistical analysis with a test statistic (e.g. chi-squared, Z score)

• Provide a measure of support for a certain hypothesis

• Pre-determine threshold on test statistic to reject null hypothesis

Test statistic

Inevitably leads to two types of mistake : false positive (YES instead of NO) (Type I)false negative (NO instead of YES) (Type II)

YES OR NO decision-making : significance testing

YESNO


9

T

alpha 0.05

Sampling distribution if HA were true

Sampling distribution if H0 were true

POWER: 1 -

Standard Case


10

Rejection of H0 Non-rejection of H0

H0 true

HA true


Power1-type II error = 1-β

Type II error = β

Signifcance level Type I error = α

11

Hypothesis testing

• Null hypothesis : no effect

• A ‘significant’ result means that we can reject

the null hypothesis

• A ‘non-significant’ result means that we cannot

reject the null hypothesis


12

Statistical significance

• The ‘p-value’

• The probability of a false positive error if the null were in fact true

• Typically, we are willing to incorrectly reject the null 5% or 1% of the time (Type I error)


13


H0 true

HA true


Power1-type II error = 1-β

Type II error = β

Signifcance level Type I error = α

14


H0 true

HA true

Nonsignificant result(1- )

Type II error at rate

Significant result(1-)

Type I error at rate


15

T

alpha 0.05



POWER: 1 -

Standard Case


16

T

POWER: 1 - ↑

Increased effect size

alpha 0.05




17T

More conservative α

alpha 0.01



POWER: 1 - ↓


18

Less conservative α

alpha 0.1



POWER: 1 - ↑


19

T

alpha 0.05



Reduced variation


POWER: 1 - ↑

20

Determining Sample SizeWe need:

– Acceptable type I error rate (),

• usually 0.05, or 0.025 if one sided

– A meaningful difference in the response: the smallest Tx effect clinically worth detecting / that we wish to detect

– The desirable power (1- to detect this difference, min. 80%

– Ratio of allocation to the groups (equal sample sizes?)

– Whether to use one-sided or two-sided test

In addition, – The variability common to the two populations for continuous

endpoint– The response (event) rate of the control group for the binary

endpoint


21Power and Sample size

Calculating power using software or Web

-PRISM StatMate ($50)

-G*Power 3 (Free)

-Statistical software: SPSS, SAS, Stata, R

-PS Power and Sample size Calculation (free) (Windows)

-Web: Google “Statistical Power Calculation”

-Russell V. Lenth

-http://www.stat.uiowa.edu/~rlenth/Power/

-David Schoenfeld

-http://hedwig.mgh.harvard.edu/sample_size/size.html

-Perform calculation in two methods – similar answers

22Statistical Considerations

Russ Lenth’s Power and Sample size pagehttp://www.stat.uiowa.edu/~rlenth/Power/

23Statistical Considerations

http://hedwig.mgh.harvard.edu/sample_size/size.html

24

Determining Sample Size: Continuous outcome

• Two Anti-Hypertensives: – Testing for superiority

• Endpoint: Difference in Diastolic BP – Continuous variable

• Relevant parameters– Difference in Diastolic BP between drugs: =2 mm Hg– Standard deviation of Diastolic BP in each group: = 10 mm Hg– Significance level: 0.05– Required power: 0.8 – Assume equal sized groups

• Calculate sample size required


393 patients in each group




27

Power, by difference between two groups

Statistical Considerations

28

Continuous outcome:



Determining Sample Size: Discrete Example

• APT070 perfusion vs. cold storage of kidney • Testing for superiority

• Endpoint: Delayed Graft Function after transplantation• Proportion of patients experiencing delayed graft

• Relevant parameters• Baseline prevalence: 35%• Minimum difference clinically significance, 10%• p1=0.35, p2=0.25 [proportion with delayed graft function in each group]

• Significance level =0.05 • Power = 80%

• Calculate sample size required

349 patients in each group




http://hedwig.mgh.harvard.edu/sample_size/size.html

With 349 patients on treatment A and 349 patients on treatment B there will be a 0% chance of detecting a significant difference at a two sided 0.05 significance level. This assumes that the response rate of treatment A is 0.35 and the response rate of treatment B is 0.25.

With 349 patients on treatment A and 349 patients on treatment B there will be a 80% chance of detecting a significant difference at a two sided 0.05 significance level. This assumes that the response rate of treatment A is 0.35 and the response rate of treatment B is 0.25.


Discrete outcome

33

How to use power calculations

• Use power prospectively for planning future studies– Determine an appropriate sample size– Evaluating a planned study – will it yield useful information?

• Put science before statistics. – Use effect sizes that are clinically relevant – Don’t get distracted by statistical considerations

• Perform a pilot study – Helps establish procedures, understand and protect against

the unexpected– Gives variance estimates needed in determining sample

size



1.Superiority

2.Equivalence:

Objective To demonstrate that two treatments have no clinically meaningful difference

H0: The two Txs effects are different with respect to the mean response

H1: The two Txs are equal with respect to the mean response

A B

A B

Design: What is your Hypothesis?

A B d or A B d

d A B d

d = largest difference clinically acceptable


3.Non-Inferiority:

Objective To demonstrate that a given treatment is not clinically inferior to another

H0: A given Tx is inferior with respect to the mean response

H1: A given Tx is non-inferior with respect to the mean response

Design: What is your Hypothesis?

A B d

A B d

36

QUIZAssume 80% Power, α = 0.05, two-sided

(x) more with A(y) more with B(z) the same

Study A Study B1. Mortality 20% vs 10% 20% vs 15%

2. Mortality 20% vs 10% 40% vs 30%

3. Diastolic BP 80 vs 85 mmHg 90 vs 95 mmHgSt. dev 10 St dev 10

4. Diastolic BP 80 vs 85 mmHg 80 vs 85 mmHgSt. dev 10 St dev 8

A B




How manysubjects?

Which study needs largest sample size?


37

1. B

2. B

3. Same

4. A

ANSWERS

Bigger effect size in A (doubling of survival. Smaller effect, larger sample size needed to detect

Small difference need more subjects

Only standard deviation matters

Bigger standard deviation more subjects


Documents

Statistical Power and Sample Size Calculations Drug Development Statistics & Data Management July 2014 Cathryn Lewis Professor of Genetic Epidemiology