Computational Diagnostics based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,

Computational Diagnostics based on Large Scale Gene Expression Profiles using MCMC

Rainer Spang,

Max Planck Institute for Molecular Genetics, Berlin

Harry Zuzan, Carrie Blanchette, Erich Huang, Holly Dressman, Jeff Marks, Joe Nevins, Mike West

Duke Medical Center & Duke University

Estrogen Receptor Status

• 7000 genes• 49 breast tumors• 25 ER+• 24 ER-

Tumor – Chip - 7000 Numbers

Given

7000 Numbers

Wanted

89%

The probability that the tumor is ER+

7000 Numbers Are More Numbers Than We Need

Predict ER status based on the expression levels of super-genes

Singular Value Decomposition

X

FDAE

Data

Loadings Singular values

Expression levels of super genes, orthogonal

matrix

)(genessuper all

|1 0][ ii i x βYP

i

i

i

x

Y

Probit Model

Class of tumor i

Distribution Function of a Standard NormalRegression weight for super gene i

Expression Level of super gene i

Overfitting

• Using only a small number of super genes is not robust at all

• When using many (all) supergenes, the linear model can be easily saturated, i.e. we have several models that fit perfectly well

• Consequence: For a new patient we find among these models some that support that she is ER+ and others that predict she is ER-

Given the Few Profiles With Known Diagnosis:

• The uncertainty on the right model is high

• The variance of the model-weights is large

• The likelihood landscape is flat• We need additional model

assumptions to solve the problem

Informative Priors

Likelihood Prior Posterior

If the Prior Is Chosen Badly:

• We can not reproduce the diagnosis of the training profiles any more

• We still can not identify the model• The diagnosis is driven mostly by

the additional assumptions and not by the data

The Prior Needs to Be designed in 49 Dimensions

• Shape?• Center?• Orientation?• Not to narrow ... not to wide

Shape

multidimensional normal

for simplicity

Center

Assumptions on the model correspond to assumptions on the

diagnosis

]|1[ ii YP

Orientation

orthogonal super-genes !

Not to Narrow ... Not to Wide

Auto adjusting model

Scales are hyper parameters with their own priors

)/,0|()|( 22

1ii

n

ii dNTp

Prior given the hyper parameter

Hyper parameter

Independent super genes

Unbiased prior

Rescaling by singular values

A prior for the hyper parameters

)2/,2/(~2 kkGammai

-Conjugate prior

-Flexibility for

-Symmetric U-Shaped prior for

i

k=2 or k=3

]|1[ ii YP

Latent Variable

iii xh 0 )1,0(~ N

01 ii hY

)(genessuper all

i0 β |1 ][ ii xYP

Albert & Chip 1993

MCMC

- Gibbs Sampler

- Sequential updates of conditional distributions

normal truncated~),,|(

gamma~),,|(

normal~),,|(

TXhp

hXTp

ThXp

All conditional posteriors can be calculated analytically

West 2001, Albert & Chip 1993

What are the additional assumptions

that came in by the prior?

• The model can not be dominated by only a few super-genes ( genes! )

• The diagnosis is done based on global changes in the expression profiles influenced by many genes

• The assumptions are neutral with respect to the individual diagnosis

Which Genes Have Driven the Prediction ?

Gene Weight

nuclear factor 3 alpha 0.853

cysteine rich heart protein 0.842

estrogen receptor 0.840

intestinal trefoil factor 0.840

x box binding protein 1 0.835

gata 3 0.818

ps 2 0.818

liv1 0.812

... many many more ... ...

Thank you!

Documents

Computational Diagnostics based on Large Scale Gene Expression Profiles using MCMC Rainer Spang,