BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

1BGX

Sylvia RichardsonCentre for Biostatistics

Imperial College, London

Statistical Analysis of Gene Expression Data

In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Mary’s)

Tim Aitman (Hammersmith)Peter Green (Bristol)

BBSRCwww.bgx.org.uk

Biological Atlasof Insulin Resistance

2BGX

Statistical modelling and biology

• Extracting the ‘message’ from microarray data needs statistical as well as biological understanding

• Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise

• Our models are structured, reflecting data generation process: – Bayesian hierarchical modelling approach– Inference based on posterior distribution of quantities of

interest

3BGX

What are gene expression data ?

• DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell type, under specific conditions

• Gene expression data (e.g. Affymetrix) results from the scanning of arrays where hybridisation between a sample and a large number of probes has taken place:– gene expression measure for each gene

• The expression level of ten of thousands of probes are measured on a single microarray:– gene expression profile

• Typically, gene expression profiles are obtained for several samples, in a single or related experiments:– gene expression data matrix

* ** *

*

4BGX

Common characteristics of data sets in transcriptomic

• High dimensional data (ten of thousands of genes) and few samples

• Many sources of variability (low signal/noise ratio)

• condition/treatment• biological • array manufacture• imaging• technical

• within/between array variation

• gene specific variability of the probes for a gene (e.g. for Affymetrix)

5BGX

• Gene expression data can be used in several types of analysis:

-- Comparison of gene expression

under different experimental

conditions, or in different tissues

-- Building a predictive model for

classification or prognosis based

on gene expression measurements

-- Exploration of patterns in gene

expression matrices

Analysing gene expression data

Samples

Ge

nes

(2

000

0)Gene expression level

Gene expression data matrix

6BGX

Common statistical issues

• Pre-processing and data reduction– account for the uncertainty of the signal?– making arrays comparable: “normalisation”

• Realistic assessment of uncertainty• Multiplicity: control of “error rates”• Need to borrow information• Importance to include prior biological knowledge

Illustrate how structured statistical modelling can help to tease out signal from noise and strengthen inference in the context of differential expression studies

7BGX

Outline

• Background• Modelling uncertainty in the signal• Bayesian hierarchical models for

differential expression experiments– posterior predictive checks– use of posterior distribution of parameters

of interest to select genes of interest

• Further structure: mixture models

8BGX

Data: Affymetrix chip:

- Each gene g is represented by a probe set,

consisting of a number of probe pairs (reporters) j

Perfect match (PM) and Mismatch (MM)

Aim: Formulate a model to combine PM and MM values into a

new expression value for the gene -- BGX

- Base the model on biological assumptions

- Combine good features of Li and Wong (dChip) and

RMA (Robust Multichip Analysis, Irrizarry et al)

I – Modelling uncertainty in the signal:A fully Bayesian Gene expression index for

Affymetrix Gene Chip arrays (Anne Mette Hein)

Use a flexible Bayesian framework that will allow• to get a measure of uncertainty of the expression• to integrate further components of the experimental design

9BGX

Single array model: Motivation

Key observations: Conclusions:

• PMs and MMs both increase with spike-in concentration (MMs slower than PMs)

MMs bind fraction of signal

• Spread of PMs increase with level

Multiplicative (and additive) error; transformation needed

• Considerable variability in PM (and MM) response within a probe set

Varying reliability in gene expression estimation for different genes

• Probe effects approximately additive on log-scale

Estimate gene expression measure from PMs and MMs on log scale

10BGX

BGX single array model

Remaining priors: “vague”

fraction

log(Hgj+1) TN(λ, η2)

Non-specific hybridisation: array wide distribution:

j=1,…,J (20), g=1,…,G

Shrinkage: exchangeability

log(σg2)N(a,

b2)

“Emp. Bayes”

log(Sgj+1) TN(μg,σg2)

Expression measure for gene g is built from: j=1,…,J (20)

“BGX” expression measure

PMgj N( Sgj + Hgj , τ2)

MMgj N(Φ Sgj + Hgj , τ2) Background noise, additive

Gene and probe specific S and H (g:1,…,”1000s”, j=1,…,”tens”)

11BGX

BGX model: inference Hein et al, Biostatistics, 2005

For each gene g: obtain a distribution for signal (log scale)

g:

BGX: gene expression

PMMM

• Implemented in WinBugs and C++ (MCMC)• All parameters estimated jointly in full Bayesian framework• Posterior distributions of parameters (and functions) obtained

The single array model can be extended to estimate signalfrom several biological replicates, as well as differentialsignal between conditions

12BGX

Single array model:examples of posterior distributions of BGX indices

Each curve represents a gene

Examples with data:

o: log(PMgj-MMgj)

j=1,…,J

(at 0 if not defined)

Mean 1SD

13BGX

Comparison with other expression measures

11 genes spiked in at 13 (increasing) concentrations

BGX index μg increases with concentration …..… except for gene 7 (incorrectly spiked-in??)

Indication of smooth & sustained increase over a wider range ofconcentrations

14BGX

95% credibility intervals for Bayesian gene expression index

11 spike-in genes at 13 different concentrations

Note how the variabilityis substantially larger for low expression level

Each colour corresponds to a different spike-in geneGene 7 : broken red line

15BGX

II – Modelling differential expression

Differential expression parameter

Condition 1 Condition 2

Posterior distribution (flat prior)

Mixture modelling for classification

Hierarchical model of replicatevariability and array effect

Hierarchical model of replicatevariability and array effect

Start with given pointestimates of expression

16BGX

Data Sets and Biological question

Biological Question

• Understand the mechanisms of insulin resistance• Using animal models where key genes are knockout

A) Cd36 Knock out Data set (MAS 5) 3 wildtype (“normal”) mice compared with 3 mice with Cd36 knocked out ( 12000 genes on each array )

B) IRS2 Knock out Data set (RMA) 8 wildtype (“normal”) mice compared with 8 mice with IRS2 gene knocked out ( 22700 genes on each array)

17BGX

Condition 1 (3 replicates)

Condition 2 (3 replicates)

Needs ‘normalisation’

Spline curves shown

Exploratory analysis showing array effect

Mouse dataset A

18BGX

Data: ygcr = log gene expression gene g, replicate r, condition c

g = gene effect

dg = differential effect for gene g between 2 conditions

r(g)c = array effect – modelled as a smooth (spline) function of g

gc2 = gene specific variance

• 1st level yg1r N(g – ½ dg + r(g)1 , g12)

yg2r N(g + ½ dg + r(g)2 , g22)

Σrr(g)c = 0, r(g)c = function of g , parameters {c,d}

• 2nd level “Flat” priors for g , dg, {c,d}

gc2 lognormal (ac, bc)

Bayesian hierarchical model for differential expression (Lewin et al, Biometrics, 2005)

Exchangeablevariances

19BGX

Directed Acyclic Graph for the differential expression model (no array effect represented)

a1, b1

½(yg1.+ yg2.)

dg 2g1 s2

g1

2g2 s2

g2g

a2, b2

½(yg1.- yg2.)

20BGX

Differential expression model

Joint modelling of array effects and differential expression:

• Performs normalisation simultaneously with estimation

• Gives fewer false positives

How to check some of the modelling assumptions?Posterior predictive checks

How to use the posterior distribution of dg to select genes of interest ?

Decision rules

21BGX

• Check assumptions on gene variances, e.g. exchangeable variances, what distribution ?

• Predict sample variance sg2 new (a chosen checking function)

from the model specification (not using the data for this)

• Compare predicted sg2 new with observed sg

2 obs

‘Bayesian p-value’: Prob( sg2 new > sg

2 obs )

• Distribution of p-values approx Uniform if model is ‘true’

(Marshall and Spiegelhalter, 2003)• Easily implemented in MCMC algorithm

Bayesian Model Checking

22BGX

Bayesian model checking

a1, b1

½(yg1.+ yg2.)

dg 2g1 s2

g1

2g2 s2

g2g

a2, b2

½(yg1.- yg2.)

2g1

new

s2g1

new

obs

23BGX

MouseData set A

24BGX

Use of tail probabilities for selecting gene lists

dg : log fold change

tg = dg / (σ2 g1 / n1 + σ2 g2 / n2 )½ standardised difference

(n1 and n2 # replicates in each condition)

-- Obtain the posterior distribution of dg and/or tg

-- Compute directly posterior probability of genes satisfying criterion X of interest, e.g. dg > threshold or tg

> percentile

pg,X = Prob( g of “interest” | Criterion X, data)

-- Compute the distributions of ranks, …. Interesting statistical issues on relative merits and propertiesof different selection rules based on tail probabilities

25BGX

• Compute

Probability ( | tg | > 2 | data)

Bayesian T test • Order genes

• Select genes such that

Using the posterior distribution of tg (standardised difference) (Natalia Bochkina)

Probability ( | tg | > 2 | data) > cut-off ( in blue)

By comparison, additional genes selected by a standard

T test with p value < 5% are in red)

Data set B

26BGX

Credibility intervals for ranks

100 genes with lowest rank (most under/over expressed)

Low rank, high uncertainty

Low rank, low uncertainty

27BGX

III – Mixture and Bayesian estimation of False Discovery Rates (FDR)

• Mixture models can be used to perform a model based classification

• Mixture models can be considered at the level of the data (e.g. clustering time profiles) or for the underlying parameters

• Mixture models can be used to detect differentially expressed genes if a model of the alternative is specified

• One benefit is that an estimate of the uncertainty of the classification: the False Discovery Rate is simultaneously obtained

28BGX

Mixture framework for differential expression

yg1r = g - ½ dg + g1r , r = 1, … R1

yg2r = g + ½ dg + g2r , r = 1, … R2

(We assume that the data has been pre normalised)

Var(gcr ) = σ2gc ~ IG(ac, bc)

dg ~ 0δ0 + 1G (-x|1.5, 1) + 2G (x|1.5, 2)

H0 H1

Dirichlet distribution for (0, 1, 2)

Exp(1) hyper prior for 1 and 2

Explicit modellingof the alternative

29BGX

Mixture for classification of DE genes

• Calculate the posterior probability for any gene of belonging to the unmodified component : pg0 | data

• Classify using a cut-off on pg0 :

i.e. declare gene is DE if 1- pg0 > pcut

Bayes rule corresponds to pcut = 0.5• Bayesian estimate of FDR (and FNR) for any list

(Newton et al 2003, Broët et al 2004) :

Bayes FDR (list) | data = 1/card(list) Σg list pg0

30BGX

Performance of the mixture prior

• Joint estimation of all the mixture parameters (including 0) using MCMC algorithms avoids plugging-in of values that are influential on the classification

• Estimation of all parameters combines information from biological replicates and between condition contrasts

• Performance has been tested on simulated data sets

31BGX

Plot of truedifference ineach case

π0 = 0.8, 500 DE π0 = 0.9, 250 DE

π0 = 0.99, 25 DEπ0 = 0.95, 125 DEπ0 = 0.80, 500 DE

32BGX

Examples ofsimulated datafor each case

33BGX

Results averaged over 50 replications

Av. π0 = 0.99

Av. π0 = 0.80 Av. π0 = 0.90

Av. π0 = 0.78 Av. π0 = 0.95

^ ^

^ ^ ^

Good estimatesof 0 = Prob(null)for each case

34BGX

Comparison of estimated (dotted lines) and observed (full) FDR (black) and FNR (red) rates as cut-off for declaring DE is varied

Bayesian mixture: • good estimates ofFDR and FNR• easy way to choose efficientclassification rule

35BGX

In summary

Integrated gene expression analysis • Uses the natural hierarchical structure of the data: e.g.

probes within genes within replicate arrays within condition to synthesize, borrow information and provide realistic quantification of uncertainty

• Posterior distributions can be exploited for inference with few replicates: choice of decision rules

• Framework where biological prior information, e.g. on the structure of the probes or on chromosomic location, can be incorporated

• Model based classification, e.g. through mixtures, provides interpretable output and a structure to deal with multiplicity

General framework for investigating other questions

36BGX

Many interesting questions in the analysis of gene expression data

-- Comparison of gene expression under different experimental conditions, or in different tissues

-- Integrated gene expression analysis

-- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

-- Building a predictive model for classification or prognosis based on gene expression measurements, finding “signatures”

37BGX

Association of gene expression with prognosis

Investigate properties of high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

Expression plot of 115 prognostic genes comprising The Ovarian Cancer Prognostic Profile

38BGX


-- Building a predictive model for

classification or prognosis based on

gene expression measurements,

finding “signatures

Other questions ….

-- Integrated gene expression analysis

-- Investigate high dimensional classification rules (prediction with large number of variables) and “large p small n” regression problems (shrinkage or variable selection)

-- Perform unsupervisedmodel based clustering-- Estimate graphical models

-- Exploration of patterns and association networks in gene expression matrices

39BGX


-- Classification of gene expression

profiles and association of gene

expression with other factors, e.g.

prognosis (prediction problem)

Exploration of patterns in gene expression matrices

Perform unsupervisedmodel based clustering (e.g. semi-parametric using basis functions, mixtures or DP processes)

Development of centralnervous systems in rats(9 time points)

samples

gen

es

40BGX

BBSRC Exploiting Genomics grant

Colleagues

Natalia Bochkina, Anne Mette Hein, Alex Lewin (Imperial College)Peter Green (Bristol University)Philippe Broët (INSERM, Paris)

Papers and technical reports: www.bgx.org.uk/

Thanks

Documents

BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,