Enrique F. Schisterman, PhD Epidemiology Branch …jlong/mecd_workshop/schisterman.pdfEnrique F. Schisterman, PhD Epidemiology Branch NICHD, NIH 1 Background Biomarker: A specific

Hybrid Pooled Unpooled DesignsEnrique F. Schisterman, PhD

Epidemiology BranchNICHD, NIH

1

Background

Biomarker: A specific physical trait used to measure or indicate the effects or progress of a disease or condition

Newly developed laboratory methods expand the number of biomarkers on a daily basis

Cost

Measurement Error

Causal Link to Disease

2

q(c)

1 – p(c)0 1

1

ROC: AUC

Area Under the Curve

Overall discriminatory ability

Area under the ROC curve

x

dxdyygxf

YXPAUC

)()(

)(

Motivation

Preliminary analysis of salivary concentrations of cortisol from the LIFE study

P=0.04

Shipment 1 n Mean SDMichigan 85 0.40 0.19Texas 142 0.57 0.79

4

Cortisol by Site & Plate5

02

46

8

Ship1 Ship2 Ship1 Ship2 Ship3

1 2 3 4 5 1 2 3 1 2 3 4 5 1 2 1 2 3 4 5 6 7 8

Michigan Texas

ug/d

L

Graphs by Site

Outline

Background

Limit of Detection Determination

Pooling Biomarkers

Hybrid Design

New Pooling Hybrid and Case Only Designs

Reporting of Biomarker Data

ID Z

1 3.1

2 1.5

3 8.4

4 0.8

5 5.4

6 3.2

7 2.0

8 5.8

9 13.4

10 2.5

11 1.9

12 6.1

•Reporting threshold is equal to 2.2

ID Z

1 3.1

2 ND

3 8.4

4 ND

5 5.4

6 3.2

7 ND

8 5.8

9 13.4

10 2.5

11 ND

12 6.1

Report values < threshold as ‘not detected’

ID Z

1 3.1

2 1.1

3 8.4

4 1.1

5 5.4

6 3.2

7 1.1

8 5.8

9 13.4

10 2.5

11 1.1

12 6.1

Report values < threshold as one half the value of the threshold

Conventional Determination of the Limit of Detection (LOD)

6.15)3( blanksblanksLOD

BLANK SERIES

• 10.0

• 5.0

• 8.1

• 7.1

• 4

• 11.3

• 12.0

• 8.0

• 7.7

• 7.0

Mean = 8.02

Std Dev = 2.53

0 5 10 15 20

Example of LOD left-censored data

0 5 10 15 20 25 30 35

Blanks

“True” biomarker

Better LOD?

Example of LOD left-censored data

0 5 10 15 20 25 30 35

Blanks

“True” biomarkerObserved biomarker (samples)

Outline

Background


Pooling Biomarkers

Hybrid Design

Case Only and Reliability using Pooling Designs

What is pooling?

Physically combining several individual specimens to create a single mixed sample

Pooled samples are the average of the individual specimens

1

2 p

12

Random Sample of Biospecimens

RANDOM SAMPLE

Randomly select 20 samples

FULL DATA

N = 40 Individual Biospecimens

13

Pooling Biospecimens

POOLED DATA

40 samples in groups of 2

FULL DATA

N = 40 Individual Biospecimens

14

q(c)

1 – p(c)0 1

1

ROC: AUC

Area Under the Curve

Overall discriminatory ability

Area under the ROC curve

x

dxdyygxf

YXPAUC

)()(

)(

0

20

40

60

80

100

120

0 2 4 6 8 10

# of Pooled Samples

# o

f A

ssa

ys o

f P

oo

led

Sa

mp

les

20

50

100

200

Number of assays required to reach

equivalency on ROC Curves

Limit of Detection and Pooling

/

i i

i

i

x if x LODZ

N A if x LOD

Unpooled Specimens

Pooled Specimens

( ) ( )

( )

( )

, if LOD

N/A, if LOD

p p

p i i

i p

i

x xZ

x

Un-pooled

0

0.2

0.4

0.6

0.8

1

-4 -3 -2 -1 0 1 2 3 4

Effect of Pooling on Markers Affected by an LOD

Pooled

0

0.2

0.4

0.6

0.8

1

-4 -3 -2 -1 0 1 2 3 4

18

Un-pooled

0

0.2

0.4

0.6

0.8

1

-4 -3 -2 -1 0 1 2 3 4

Pooled

0

0.2

0.4

0.6

0.8

1

-4 -3 -2 -1 0 1 2 3 4

Estimation Based on Pooling

Using Likelihood Function

Differentiating with respect to

To obtain estimates of

Schisterman et al. Bioinformatics 2006

Efficiency of the Mean and Variance

Variance of Estimated Mean

Variance of Estimated Variance

FULL DATA POOLED RANDOMFULL DATA POOLED RANDOM

20

LOD below Mean

LOD below Mean

LOD above Mean

LOD above Mean

Pooling and Random Sampling

Pooling advantages

Reduces the number of assays we need to test

Efficiently estimates the mean

Cost-effective

Random sampling advantages

Reduces the number of assays we need to test

Efficiently estimates the variance

Cost-effective & easy to implement

21

Hybrid Design: Pooled—Unpooled

Creates a sample of both pooled and unpooled samples

Takes advantage of the strengths of both the pooling and random sampling designs

Reduces number of tests to perform

Cuts overall costs

Gains efficiency (by using pooling technique)

Accounts for different types of measurement error without replications

– Pooling error

– Random measurement error

– LOD

22

Outline

Background


Pooling Biomarkers

Hybrid Design


Unpooled: X1,…,X5

Pooled: Z1,…,Z15

Hybrid Sample S: X1,…,X5,Z1,…,Z15

Setup of Hybrid Design

Unpooled: X1,…,X[αn]

Pooled: Z1,…,Z[(1-α)n]

In GeneralHybrid Sample S: X1,…,X[αn],Z1,…,Z[(1-α)n]

α is the proportion of unpooled samples

24

Hybrid Design: Pooled—Unpooled

Create a sample of both pooled and un-pooled samples that takes advantage of the strengths of the pooled and random sample designs

Reduce number of tests to perform

Cut overall costs

Gain efficiency (by using pooling technique)

Accounts for different types of measurement error without replications

Pooling error

Random measurement error

Limit of detection

Mathematical Setup

),(~ 2

xxi NX NiX i 1

),0(~ 2

p

p

i Ne

p

i

ip

pij

ii eXp

Z 1)1(

1

2

2

,~ px

xip

NZ

Individual Samples

Pooled Samples

p is the pooling group size

Average of p individual samples

pooled together

Pooling error

Maximum Likelihood Estimators

22

(1 )

2 2

( )( )

( , , ) ln (1 ) ln2 2

i xi xi ni n

x x p x z

x z

zx

n n

Random Sampling Pooling

22

22

ˆ)1(ˆ

ˆ)1(ˆˆ

xz

xzx

nn

znxn

n

xni

xi

x

2

2

)ˆ(

ˆ

2

2(1 )2

ˆ( )ˆ

ˆ(1 )

i x

i n xp

z

n p

In order to estimate the

variance, α cannot be zero.

Zi* = Zi +ei

M

Add in Measurement Error

),0(~ 2

M

M

i Ne

Individual Samples + ME

Pooled Samples + ME

p is the pooling group size

Measurement error

Xi* = Xi +ei

M* 2 2~ ( , )i x x MX N

22

2

* ,~ Mp

x

xip

NZ

Hybrid Design Example: IL-6

Measured IL-6 on 40 MI cases and 40 controls

Biological specimens were randomly pooled in groups of 2, for the cases and controls separately, and remeasured

We want to evaluate the discriminating ability of this biomarker in terms of AUC

29

Hybrid Design Example: IL-6

n αx αy AÛC Var(AÛC)

Empirical 40 1.00 1.00 0.640 0.0036

Hybrid design: Optimal α 20 0.40 0.35 0.621 0.0049

Random sample: α=1 20 1.00 1.00 0.641 0.0071

Hybrid design reduced the variability of Var(AÛC) by 32% as compared to taking only a random sample

30

Summary—Hybrid Design

Hybrid design is a more efficient way to estimate the mean and variance of a population

Cost-effective

Yields estimate of measurement error without requiring repeated measurements

Here we focus on normally distributed data, but can be applied to other distributions as well

31

Outline

Background


Pooling Biomarkers

Hybrid Design


Study Design

Case-only pooling design to study GxE

Biomarker evaluation reliability studies

34

Case-Only Studies

Used to assess G x E interactions

Assumes independence between gene and environment

Cannot evaluate the independent effects of exposure or genotype

35

2 x 2 Table - Notation

Total Cohort

G+ G-

E+ a1+b1 a0+b0

E- c1+d1 c0+d0

Cases

G+ G-

E+ a1 a0

E- c1 c0

Controls

G+ G-

E+ b1 b0

E- d1 d0

36

Efficiency

*

11

*

11

*

00

*

00

11111111)var(

dcbadcbaORGxE

1100

1111)var(

cacaORGxE

Case-

Control

Case-Only

Case-only designs provide more efficient estimates of G x E

interactions than either case-control or cohort studies

Testing for GxE Interactions37

Public health importance Important in identifying highly susceptible populations

such that modifiable exposures that cause disease can be prevented

Problem Traditional case-control interaction estimator requires

a large n

Cost and biospecimen availability often limit the ability of investigators to test for GxE interactions

A Pooled GxE Estimator38

Pooling biospecimens and measuring allele frequencies

Rather than measuring the genotype for every individual, we propose measuring the allele frequencies of a pool of individuals within disease and environmental strata

Frequency

PCR cycle

ThresholdtC

[Germer et al., Genome Res. 2000]

12

1~

tCDEf

39

2 x 2 Table - Notation

Total Cohort

G+ G-

E+ a1+b1 a0+b0

E- c1+d1 c0+d0

Cases

G+ G-

E+ a1 a0

E- c1 c0

Controls

G+ G-

E+ b1 b0

E- d1 d0

Improvements in Power40

To address potential measurement error, we propose repeatedly measuring each pool

12 pooled measurements does better than 500 individual genotype measurements

Implication:

If an investigator had access to a large number of stored biospecimens, pooling would be very useful in preliminary assessments of GxEinteraction hypotheses

-1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Power Function

GXE interaction

Po

we

r

Pooled Sample Size 1000, error variance = .05

Pooled Sample Size 1000, error variance = .1

Not Pooled Sample Size 500

Adaptations for Different Estimators

Case-Only Estimator Uses assumptions of GxE independence for a rare disease

to achieve high power by not requiring control subjects

Hybrid Estimators Two Step

Step 1: test case-only estimator assumptions Step 2: decide whether to use the case-control or case-only

estimator

Empirical Bayes Weighted combination of the case-control and case-only

estimators

Study Design

Case-only pooling design to study GxE

Biomarker evaluation reliability studies

New user designs

Biomarker Evaluation

Estimating sources of variation in new biomarkers is an important step in development

Between- and within-subject variation, intraclasscorrelation coefficient (ICC)

Biomarkers with similar biological function with the highest ICC can be considered the most reproducible

May then be selected as markers for future studies

ICC

ICC is useful when the biomarker is an:

Outcome measure

ICC dictates trade-off in efficiency between taking an increased number of individuals or repeated measurements

Exposure measure

ICC aids investigators in correcting for measurement error

Pooling Designs

Pooling Designs

Designs where we pool samples over time on individuals (T) are more efficient than designs where we pool over individuals at the same time point (I)

Conclusion47

The combination of statistical and physical properties yield very efficient design

Thanks for the invitation!48

Thank you to my collaborators

Neil Perkins

Albert Vexler

Paul Albert

Yaakov Malinovski

Emily Mitchell

Sunni Mumford

Mitchell Danaher

Documents

Enrique F. Schisterman, PhD Epidemiology Branch …jlong/mecd_workshop/schisterman.pdfEnrique F. Schisterman, PhD Epidemiology Branch NICHD, NIH 1 Background Biomarker: A specific