P. J. Munson, National Institutes of Health, Nov. 2001Page 1 A "Consistency" Test for...

Preview:

Citation preview

P. J. Munson, National Institutes of Health, Nov. 2001Page 1

A "Consistency" Test for Determining the Significance of Gene Expression

Changes on Replicate Samples

and

Two Convenient Variance-stabilizing Transformations

Peter J. Munson, Ph.D.Mathematical and Statistical Computing Laboratory

DCB, CIT, NIH

munson@helix.nih.gov

P. J. Munson, National Institutes of Health, Nov. 2001Page 2

Introduction

• Math. Stat. Comp. Lab. at NIH• Run Affy LIMS database

– Started Dec 2000, Stores >700 chips, – Serves 3 core facilities at NIH

• Study 1– 2 treatments, 5 time points, 6 subjects, 60 U95A chips, PBMC

cells

• Study 2– 3 treatments, 5 time points, 5 subj., 75 Hu6800 chips, human

cells in culter

• Study 3– 4 doses, 2 time oints, 20 subjects, 20 RG U34A chips, blood

cells

P. J. Munson, National Institutes of Health, Nov. 2001Page 3

Outline

• Development of Consistency Test• Variance-stabilizing transforms

– Generalize Logarithm, GLog– Adaptive transform for Average Diff, TAD

• Normalization– Normal quantile + adaptive transform

• Application• Probe-pair data visualization:

– Parallel Axis Coordinate Display

P. J. Munson, National Institutes of Health, Nov. 2001Page 4

Comparing Two Cell Lines

Data from Carlisle, et al., Mol.Carcinogen., 2000Data from Carlisle, et al., Mol.Carcinogen., 2000

• Don’t subtract

background

• Ignore background-level

points

• Calibrate on median

intensity of each cell type

• Over 3-fold change = =

Outside dashed lines

• Are these expression

level changes significant?

real?

P. J. Munson, National Institutes of Health, Nov. 2001Page 5

Duplicate Experiments and "Consistency" Plot

Identifies Real Changes in ExpressionIdentifies Real Changes in Expression

Vimentin

Keratin 5

P. J. Munson, National Institutes of Health, Nov. 2001Page 6

Replication Permits Calculation of Significance (P-values)

4 False-positives4 False-positivesOut of 5760 spots:Out of 5760 spots:

P ≈ 4/5760 = 0.0007P ≈ 4/5760 = 0.0007

P. J. Munson, National Institutes of Health, Nov. 2001Page 7

Consistency Plot

• Compare duplicate experiments, Log Ratio scale

• Set Cutoffs for Over-, Under-expression

• Calculate number detected, D

• Assume Independence, calculate expected number, E, above both, below both cutoffs

• Estimate false positive rate, E/D

0

0. 3

22

45.2

D=24

E=0. 6

E/D=3%

46

11

26.1

4074

4036.6

28

50.4

4113

16

E=0.6

74

88.4

0

1.1

90

27 4170 52 4249

-1

-0.8

-0.6

-0.4

-0.2

-0

0.2

0.4

0.6

0.8

1

L21b**exp45

-1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1L12b**exp44

D=24D=24

D=16D=16

P. J. Munson, National Institutes of Health, Nov. 2001Page 8

-1

0

1

L21**exp64

-1 -0.8 -0.6 -0.4 -0.2 -0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1L12**exp63

-1

-0.8

-0.6

-0.4

-0.2

-0

0.2

0.4

0.6

0.8

1

L21**exp45

-1 0 1L12**exp44

p53 +/+ cells 6 hrs, replicate reciprocal experiment

P. J. Munson, National Institutes of Health, Nov. 2001Page 9

Consistency Test on Relative ExpressionDEFINE: x(g, i) = relative expression value for gene g (=1,...,n) in experiment i (=1,...,m)

Fi(X) = empirical cdf of xi across genes (spots)

c = minj x(g, j), across experiments

THEN assuming that { x(g, i), g=1,...,n } are an independent sample from distribution Fi , the probability that x(g, i) is consistently large is:

pup (g) = Pr(Xi ≥ c, for all i) = ∏i (1 - Fi(c))

P. J. Munson, National Institutes of Health, Nov. 2001Page 10

Consistency Test on Relative Expression- 2

DEFINE: x(g, i) = relative expression value for gene g (= 1,...,n) in experiment i (= 1,...,m) pup(g) = ∏i (1 - Fi( minj x(g, j) )) pdn(g) = ∏i (Fi( maxj x(g, j) ))

THEN

Expected number of false positives: E(g) = n * p(g)

P. J. Munson, National Institutes of Health, Nov. 2001Page 11

Assumptions of Consistency Test

• Independence between experiments

• “Exchangeability” of genes

• Homogeneity of variance across genes (i.e. across expression intensity)

Does NOT require:

• Identical distribution in separate experiments

But, variance homogeneity violated for Affy Avg. Diff. data

P. J. Munson, National Institutes of Health, Nov. 2001Page 12

Variance Stabilizing Transformations

• Logarithm

• Box-Cox, power

• Generalized Logarithm, GLog

• Adaptive, TAD

P. J. Munson, National Institutes of Health, Nov. 2001Page 13

Model Variance as Function of Mean AD

P. J. Munson, National Institutes of Health, Nov. 2001Page 14

Model Variance as Function of Mean AD

Var(y) = a0 Var(y) = a0 + a1*yVar(y) = a0 + a1*y + a2*y2

Var(y) = a2*y2

=>> use logarithms

What about:

Var(y) = a0 + a2*y2

P. J. Munson, National Institutes of Health, Nov. 2001Page 15

Var(y) = a0 + a2 * y2

= a0*( 1+ (y/c)2) where c = sqrt(a0/a2)

GLog(y; c) = sign(y) *ln{ |y/c| + sqrt(1 + y2/c2) }

= s.d. at y = 0 / CV, e.g. = 10 / 0.1 = 100

Generalized Log Transform (G-Log)

P. J. Munson, National Institutes of Health, Nov. 2001Page 16

Quantile Normalization for AD (before)

P. J. Munson, National Institutes of Health, Nov. 2001Page 17

Quantile Normalization for AD (after)

P. J. Munson, National Institutes of Health, Nov. 2001Page 18

Normal Quantile Transform after GLog(AD)(it’s almost linear)

P. J. Munson, National Institutes of Health, Nov. 2001Page 19

Adaptive Transform of AD (TAD) - 1

Model variance (over manyreplicates) vs. mean AD

Plot:

Log(SD) or Wilson-Hilferty, SD^(2/3)transformvs.

Mean of NQ(AD)

Fit smooth function, g whichpredicts SD

P. J. Munson, National Institutes of Health, Nov. 2001Page 20

T(X) = Int(-inf,X,1/g)

Adaptive Transform of AD (TAD) - 2

P. J. Munson, National Institutes of Health, Nov. 2001Page 21

Adaptive Transform of AD (TAD)

P. J. Munson, National Institutes of Health, Nov. 2001Page 22

500

1000

1500

Count Axis

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

100

200

300

Count Axis

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1

Consistency Test p-values

Time 2 vs. Time 0 Time 1 vs. Time 0

Treatment

Sham

P. J. Munson, National Institutes of Health, Nov. 2001Page 23

Table 1. Number of genes detected by consistency test with expected false positivesset to 1.0Group Any Time 1-0 2-0 3-0 4-0

Treated 385 13 340 22 19Controls 83 21 23 26 24Both 2 0 1 2 1

Table 3. Number of genes detected by Maximum TAD greater than 1Group Any time 1-0 2-0 3-0 4-0Treated 275 5 264 4 5Controls 6 1 2 4 4Both 1 0 0 0 1

Results of Study 1(5 time points, 2 treatments, 6 subjects)

P. J. Munson, National Institutes of Health, Nov. 2001Page 24

Probe Pair Data, Delta TAD = 2Parallel Axis Coordinate Display

P. J. Munson, National Institutes of Health, Nov. 2001Page 25

Probe Pair Data Delta TAD = 0.5

P. J. Munson, National Institutes of Health, Nov. 2001Page 26

Probe Pair Data, Delta TAD = -1.5

P. J. Munson, National Institutes of Health, Nov. 2001Page 27

Probe Pair Data, Delta TAD = -0.5

P. J. Munson, National Institutes of Health, Nov. 2001Page 28

Acknowledgements

Lynn Young, MSCLVinay Prabhu, MSCLJennifer Barb, MSCLHoward Shindel, MSCLAndrew Schwartz, CITSteve Bailey, CIT

Robert Danner, CCAnthony Suffredini, CCPeter Eichacker, CCJames Shelhamer, CCEric Gerstenberger, CC

Sayed Daoud, NCIYves Pommier, NCIJohn Weinstein, NCI

David Krizman, NCIAlex Carlisle, NCI

David Rocke, UC Davis

Recommended