40
1 BGX Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s) Tim Aitman (Hammersmith) Peter Green (Bristol) BBSRC www.bgx.org.uk Biological Atlas of Insulin Resistance

BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

Embed Size (px)

Citation preview

Page 1: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

1BGX

Sylvia RichardsonCentre for Biostatistics

Imperial College, London

Statistical Analysis of Gene Expression Data

In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)

Tim Aitman (Hammersmith)Peter Green (Bristol)

BBSRCwww.bgx.org.uk

Biological Atlasof Insulin Resistance

Page 2: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

2BGX

Statistical modelling and biology

• Extracting the ‘message’ from microarray data needs statistical as well as biological understanding

• Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise

• Our models are structured, reflecting data generation process: – Bayesian hierarchical modelling approach– Inference based on posterior distribution of quantities of

interest

Page 3: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

3BGX

What are gene expression data ?

• DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell type, under specific conditions

• Gene expression data (e.g. Affymetrix) results from the scanning of arrays where hybridisation between a sample and a large number of probes has taken place:– gene expression measure for each gene

• The expression level of ten of thousands of probes are measured on a single microarray:– gene expression profile

• Typically, gene expression profiles are obtained for several samples, in a single or related experiments:– gene expression data matrix

* ** *

*

Page 4: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

4BGX

Common characteristics of data sets in transcriptomic

• High dimensional data (ten of thousands of genes) and few samples

• Many sources of variability (low signal/noise ratio)

• condition/treatment• biological • array manufacture• imaging• technical

• within/between array variation

• gene specific variability of the probes for a gene (e.g. for Affymetrix)

Page 5: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

5BGX

• Gene expression data can be used in several types of analysis:

-- Comparison of gene expression

under different experimental

conditions, or in different tissues

-- Building a predictive model for

classification or prognosis based

on gene expression measurements

-- Exploration of patterns in gene

expression matrices

Analysing gene expression data

Samples

Ge

nes

(2

000

0)Gene expression level

Gene expression data matrix

Page 6: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

6BGX

Common statistical issues

• Pre-processing and data reduction– account for the uncertainty of the signal?– making arrays comparable: “normalisation”

• Realistic assessment of uncertainty• Multiplicity: control of “error rates”• Need to borrow information• Importance to include prior biological knowledge

Illustrate how structured statistical modelling can help to tease out signal from noise and strengthen inference in the context of differential expression studies

Page 7: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

7BGX

Outline

• Background• Modelling uncertainty in the signal• Bayesian hierarchical models for

differential expression experiments– posterior predictive checks– use of posterior distribution of parameters

of interest to select genes of interest

• Further structure: mixture models

Page 8: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

8BGX

Data: Affymetrix chip:

- Each gene g is represented by a probe set,

consisting of a number of probe pairs (reporters) j

Perfect match (PM) and Mismatch (MM)

Aim: Formulate a model to combine PM and MM values into a

new expression value for the gene -- BGX

- Base the model on biological assumptions

- Combine good features of Li and Wong (dChip) and

RMA (Robust Multichip Analysis, Irrizarry et al)

I – Modelling uncertainty in the signal:A fully Bayesian Gene expression index for

Affymetrix Gene Chip arrays (Anne Mette Hein)

Use a flexible Bayesian framework that will allow• to get a measure of uncertainty of the expression• to integrate further components of the experimental design

Page 9: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

9BGX

Single array model: Motivation

Key observations: Conclusions:

• PMs and MMs both increase with spike-in concentration (MMs slower than PMs)

MMs bind fraction of signal

• Spread of PMs increase with level

Multiplicative (and additive) error; transformation needed

• Considerable variability in PM (and MM) response within a probe set

Varying reliability in gene expression estimation for different genes

• Probe effects approximately additive on log-scale

Estimate gene expression measure from PMs and MMs on log scale

Page 10: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

10BGX

BGX single array model

Remaining priors: “vague”

fraction

log(Hgj+1) TN(λ, η2)

Non-specific hybridisation: array wide distribution:

j=1,…,J (20), g=1,…,G

Shrinkage: exchangeability

log(σg2)N(a,

b2)

“Emp. Bayes”

log(Sgj+1) TN(μg,σg2)

Expression measure for gene g is built from: j=1,…,J (20)

“BGX” expression measure

PMgj N( Sgj + Hgj , τ2)

MMgj N(Φ Sgj + Hgj , τ2) Background noise, additive

Gene and probe specific S and H (g:1,…,”1000s”, j=1,…,”tens”)

Page 11: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

11BGX

BGX model: inference Hein et al, Biostatistics, 2005

For each gene g: obtain a distribution for signal (log scale)

g:

BGX: gene expression

PMMM

• Implemented in WinBugs and C++ (MCMC)• All parameters estimated jointly in full Bayesian framework• Posterior distributions of parameters (and functions) obtained

The single array model can be extended to estimate signalfrom several biological replicates, as well as differentialsignal between conditions

Page 12: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

12BGX

Single array model:examples of posterior distributions of BGX indices

Each curve represents a gene

Examples with data:

o: log(PMgj-MMgj)

j=1,…,J

(at 0 if not defined)

Mean 1SD

Page 13: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

13BGX

Comparison with other expression measures

11 genes spiked in at 13 (increasing) concentrations

BGX index μg increases with concentration …..… except for gene 7 (incorrectly spiked-in??)

Indication of smooth & sustained increase over a wider range ofconcentrations

Page 14: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

14BGX

95% credibility intervals for Bayesian gene expression index

11 spike-in genes at 13 different concentrations

Note how the variabilityis substantially larger for low expression level

Each colour corresponds to a different spike-in geneGene 7 : broken red line

Page 15: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

15BGX

II – Modelling differential expression

Differential expression parameter

Condition 1 Condition 2

Posterior distribution (flat prior)

Mixture modelling for classification

Hierarchical model of replicatevariability and array effect

Hierarchical model of replicatevariability and array effect

Start with given pointestimates of expression

Page 16: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

16BGX

Data Sets and Biological question

Biological Question

• Understand the mechanisms of insulin resistance• Using animal models where key genes are knockout

A) Cd36 Knock out Data set (MAS 5) 3 wildtype (“normal”) mice compared with 3 mice with Cd36 knocked out ( 12000 genes on each array )

B) IRS2 Knock out Data set (RMA) 8 wildtype (“normal”) mice compared with 8 mice with IRS2 gene knocked out ( 22700 genes on each array)

Page 17: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

17BGX

Condition 1 (3 replicates)

Condition 2 (3 replicates)

Needs ‘normalisation’

Spline curves shown

Exploratory analysis showing array effect

Mouse dataset A

Page 18: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

18BGX

Data: ygcr = log gene expression gene g, replicate r, condition c

g = gene effect

dg = differential effect for gene g between 2 conditions

r(g)c = array effect – modelled as a smooth (spline) function of g

gc2 = gene specific variance

• 1st level yg1r N(g – ½ dg + r(g)1 , g12)

yg2r N(g + ½ dg + r(g)2 , g22)

Σrr(g)c = 0, r(g)c = function of g , parameters {c,d}

• 2nd level “Flat” priors for g , dg, {c,d}

gc2 lognormal (ac, bc)

Bayesian hierarchical model for differential expression (Lewin et al, Biometrics, 2005)

Exchangeablevariances

Page 19: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

19BGX

Directed Acyclic Graph for the differential expression model (no array effect represented)

a1, b1

½(yg1.+ yg2.)

dg 2g1 s2

g1

2g2 s2

g2g

a2, b2

½(yg1.- yg2.)

Page 20: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

20BGX

Differential expression model

Joint modelling of array effects and differential expression:

• Performs normalisation simultaneously with estimation

• Gives fewer false positives

How to check some of the modelling assumptions?Posterior predictive checks

How to use the posterior distribution of dg to select genes of interest ?

Decision rules

Page 21: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

21BGX

• Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?

• Predict sample variance sg2 new (a chosen checking function)

from the model specification (not using the data for this)

• Compare predicted sg2 new with observed sg

2 obs

‘Bayesian p-value’: Prob( sg2 new > sg

2 obs )

• Distribution of p-values approx Uniform if model is ‘true’

(Marshall and Spiegelhalter, 2003)• Easily implemented in MCMC algorithm

Bayesian Model Checking

Page 22: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

22BGX

Bayesian model checking

a1, b1

½(yg1.+ yg2.)

dg 2g1 s2

g1

2g2 s2

g2g

a2, b2

½(yg1.- yg2.)

2g1

new

s2g1

new

obs

Page 23: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

23BGX

MouseData set A

Page 24: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

24BGX

Use of tail probabilities for selecting gene lists

dg : log fold change

tg = dg / (σ2 g1 / n1 + σ2 g2 / n2 )½ standardised difference

(n1 and n2 # replicates in each condition)

-- Obtain the posterior distribution of dg and/or tg

-- Compute directly posterior probability of genes satisfying criterion X of interest, e.g. dg > threshold or tg

> percentile

pg,X = Prob( g of “interest” | Criterion X, data)

-- Compute the distributions of ranks, …. Interesting statistical issues on relative merits and propertiesof different selection rules based on tail probabilities

Page 25: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

25BGX

• Compute

Probability ( | tg | > 2 | data)

Bayesian T test • Order genes

• Select genes such that

Using the posterior distribution of tg (standardised difference) (Natalia Bochkina)

Probability ( | tg | > 2 | data) > cut-off ( in blue)

By comparison, additional genes selected by a standard

T test with p value < 5% are in red)

Data set B

Page 26: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

26BGX

Credibility intervals for ranks

100 genes with lowest rank (most under/over expressed)

Low rank, high uncertainty

Low rank, low uncertainty

Page 27: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

27BGX

III – Mixture and Bayesian estimation of False Discovery Rates (FDR)

• Mixture models can be used to perform a model based classification

• Mixture models can be considered at the level of the data (e.g. clustering time profiles) or for the underlying parameters

• Mixture models can be used to detect differentially expressed genes if a model of the alternative is specified

• One benefit is that an estimate of the uncertainty of the classification: the False Discovery Rate is simultaneously obtained

Page 28: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

28BGX

Mixture framework for differential expression

yg1r = g - ½ dg + g1r , r = 1, … R1

yg2r = g + ½ dg + g2r , r = 1, … R2

(We assume that the data has been pre normalised)

Var(gcr ) = σ2gc ~ IG(ac, bc)

dg ~ 0δ0 + 1G (-x|1.5, 1) + 2G (x|1.5, 2)

H0 H1

Dirichlet distribution for (0, 1, 2)

Exp(1) hyper prior for 1 and 2

Explicit modellingof the alternative

Page 29: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

29BGX

Mixture for classification of DE genes

• Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data

• Classify using a cut-off on pg0 :

i.e. declare gene is DE if 1- pg0 > pcut

Bayes rule corresponds to pcut = 0.5• Bayesian estimate of FDR (and FNR) for any list

(Newton et al 2003, Broët et al 2004) :

Bayes FDR (list) | data = 1/card(list) Σg list pg0

Page 30: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

30BGX

Performance of the mixture prior

• Joint estimation of all the mixture parameters (including 0) using MCMC algorithms avoids plugging-in of values that are influential on the classification

• Estimation of all parameters combines information from biological replicates and between condition contrasts

• Performance has been tested on simulated data sets

Page 31: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

31BGX

Plot of truedifference ineach case

π0 = 0.8, 500 DE π0 = 0.9, 250 DE

π0 = 0.99, 25 DEπ0 = 0.95, 125 DEπ0 = 0.80, 500 DE

Page 32: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

32BGX

Examples ofsimulated datafor each case

Page 33: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

33BGX

Results averaged over 50 replications

Av. π0 = 0.99

Av. π0 = 0.80 Av. π0 = 0.90

Av. π0 = 0.78 Av. π0 = 0.95

^ ^

^ ^ ^

Good estimatesof 0 = Prob(null)for each case

Page 34: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

34BGX

Comparison of estimated (dotted lines) and observed (full) FDR (black) and FNR (red) rates as cut-off for declaring DE is varied

Bayesian mixture: • good estimates ofFDR and FNR• easy way to choose efficientclassification rule

Page 35: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

35BGX

In summary

Integrated gene expression analysis • Uses the natural hierarchical structure of the data: e.g.

probes within genes within replicate arrays within condition to synthesize, borrow information and provide realistic quantification of uncertainty

• Posterior distributions can be exploited for inference with few replicates: choice of decision rules

• Framework where biological prior information, e.g. on the structure of the probes or on chromosomic location, can be incorporated

• Model based classification, e.g. through mixtures, provides interpretable output and a structure to deal with multiplicity

General framework for investigating other questions

Page 36: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

36BGX

Many interesting questions in the analysis of gene expression data

-- Comparison of gene expression under different experimental conditions, or in different tissues

-- Integrated gene expression analysis

-- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

-- Building a predictive model for classification or prognosis based on gene expression measurements, finding “signatures”

Page 37: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

37BGX

Association of gene expression with prognosis

Investigate properties of high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

Expression plot of 115 prognostic genes comprising The Ovarian Cancer Prognostic Profile

Page 38: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

38BGX

-- Comparison of gene expression under different experimental conditions, or in different tissues

-- Building a predictive model for

classification or prognosis based on

gene expression measurements,

finding “signatures

Other questions ….

-- Integrated gene expression analysis

-- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

-- Perform unsupervisedmodel based clustering-- Estimate graphical models

-- Exploration of patterns and association networks in gene expression matrices

Page 39: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

39BGX

-- Comparison of gene expression under different experimental conditions, or in different tissues

-- Classification of gene expression

profiles and association of gene

expression with other factors, e.g.

prognosis (prediction problem)

Exploration of patterns in gene expression matrices

Perform unsupervisedmodel based clustering (e.g. semi-parametric using basis functions, mixtures or DP processes)

Development of centralnervous systems in rats(9 time points)

samples

gen

es

Page 40: BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

40BGX

BBSRC Exploiting Genomics grant

Colleagues

Natalia Bochkina, Anne Mette Hein, Alex Lewin (Imperial College)Peter Green (Bristol University)Philippe Broët (INSERM, Paris)

Papers and technical reports: www.bgx.org.uk/

Thanks