Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui...

Preview:

Citation preview

Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA

Xiangqin Cui and Gary A. ChurchillGenome Biology 2003, 4:210

Presented by M. Carme Ruíz de Villa and Alex

SánchezDepartament d’Estadística U.B.

Introduction

Remember …

We want to measure how gene expression changes under different conditions. Only two conditions and an adequate number

of replicates t-tests & extensions More than two conditions / more than one

factor: several approaches Analysis of Variance (ANOVA) (Churchill et al.) Linear Models (Smyth, Speed, …)

Sources of variation (1) We want to determine when the

variation due to gene expression is significant, but…

There are multiple sources of variation in measurements besides just gene expression.

We want to know when the variation in measurements is caused by varying levels of gene expression versus other factors.

Sources of variation (2) Some sources of variation in the

measurements in microarray experiments are: Array effects Dye effects Variety effects Gene effects Combinations

Relative expression values If more than two conditions

we cannot simply compute ratios ANOVA modelling yields estimates of the

relative expression for each gene in each sample

The ANOVA model is not based on log ratios. Rather it is applied directly to intensity data. However the difference between two relative expression values can be interpreted as the mean log ratio for comparing two samples.

Technical & biological replicates

If inference is being made on the basis of biological replicates

and there is also technical replication

technical replicates should be averaged to yield a single value for each independent biological unit.

Review of ANOVA models

One way ANOVA Suppose you have a model for each

measurement in your experiment:

yij is jth measurement for ith group. μ : overall mean effect (constant) αi : ith group effect (constant) εij : experimental error term ~N(0,σ2)

Therefore, observations from group i are distributed with mean μ+ αi and variance σ2 .

ijiijy

Hypothesis Testing

)()( .. yyyyyy iiijij

Overall variability Within group variability

Between group variability

Intuition: if between group variability is large compared to within group variability then the differences between means is significant.

Sum of Squares

2

1 1

2.

1 1

2.

1 1

( )

( )

( )

i

i

i

nk

iji j

nk

ij ii j

nk

ii j

y y

y y

y y

Total sum of squares

Within Sum of Squares

Between Sum of Squares

Mean Sum of Squares

Between MS = Between SS/(k-1) Within MS = Within SS/(n-k) F = Between MS / Within SS It is summarized in the ANOVA

table Example 1

Multiple Factor ANOVA

ijk i j ij ijky

The model can be extended by adding more Factors (, , …) Interactions between them (, …) Other …

This is used to model the different sources of variation appearing in microarray experiments

Experiment 1: Latin Square

liver muscle muscleliver

Random effects models If the k factor levels can be considered a

random sample of a population of factors we have a random effect

ANOVA model: Yij= + Ai + eij, overall mean, Ai is a random variable instead of a

constanty, eij experimental error. E(Ai)=0, E(eij)=0, var(Ai)=A

2, var(eij) =2, Ai i eij independent var(Yij)= A

2 + 2.

Where to find more…

Draghici, S. (2003). ANOVA chapter (7) Data analysis tools for microarrays Wiley

Pavlidis, P. (2003) Using ANOVA for gene selection from microarray studies of the nervous systemhttp://microarray.cpmc.columbia.edu/pavlidis/ doc/reprints/anova-methods.pdf

ANOVA Models for Microarray Data

Kerr & Churchill’s model

• yijkg expression measurement from the ith array, jth dye, kth variety, and gth gene.

• μ average expression over all spots.• Ai effect of the ith array. • Dj effect of the jth dye.• Vk effect of the kth variety (=treatment, sample, …)• Gg effect of the gth gene.• (AG)ig effect of the ith array and gth gene.• (VG)kg effect of the kth variety and gth gene.• Єijkg independent and identically distributed error terms.

Interpreting main effects A: differences in fluorescent signal from

array to array (e.g. if arrays are probed under inconsistent conditions that increase or reduce hybridization of labeled cDNA)

D: differences between two dye fluorescent labels (one dye may consistently be brighter than the other)

G: differences in fluorescence for equally expressed genes.

V: differences of expression level between different varieties (samples, tumour types,..).

Interpreting interactions DV: If for a particular variety labelling is

produced in separate runs of the process Differences in the runs can produce pools of cDNA of varying concentrations or quality.

AG: (Spot effect) Spots for a given gene on the different arrays vary in the amount of cDNA available for hybridization.

DG: if there are differences in the dyes that are gene-specific

VG: reflects differences in expression for particular variety and gene combinations that are not explained by the average effects of these varieties and genes.THIS IS THE QUANTITY OF INTEREST !!!

Normalization A,D,V terms effectively normalize the data,

thus the normalization process is integrated with the data analysis.

This approach has several benefits (?) The normalization is based on a clearly stated set

of assumptions It systematically estimates normalization

parameters based on all the data The model can be generalized to the situation

where genes are spotted multiple times on each array rather

Statistically Significant Effects

Array, Dye , Variety & Gene effect Goal: To estimate their value. Need not assess their significance Sometimes don’t appear (gene-level model)

Array x Gene, Variety x Gene effects May or not be present Goal: To assess their significance

Mean effect = 0 if fixed Effect variance = 0 if random

Test statistics: The 3 F’s Hypothesis testing involves

the comparison of two models.

In this setting we consider a null model of no differential

expression (all VG =0) and an alternative model with

differential expression among the conditions (some VG are not equal to zero).

F statistics are computed on a gene-by-gene basis based on the residual sums of squares from fitting each of these models.

11

1010

/

)/()(1

dfrss

dfdfrssrssF

21010 )/()(

3pool

dfdfrssrssF

2/)/(

)/()(2

211

1010

pooldfrss

dfdfrssrssF

Example 1 A gene, which is believed to be related to ovarian cancer is investigated The cancer is sub-classified in 3 cathegories (stages) I, II, III-IV 15 samples, 3 per stage are available They are labelled with 3 colors and hybridized on a 4 channel cDNA array (1

channel empty)(A seemingly more reasonable procedure: double dye-swap reference design)

Example 1. Normalized Data

Array I II III1 100 95 702 90 93 723 105 79 814 83 85 745 78 90 75

Example 1: ANOVA table (1)

ANOVA

EXPRLEVE

810.133 2 405.067 6.505 .012

747.200 12 62.267

1557.333 14

Inter-grupos

Intra-grupos

Total

Suma decuadrados gl

Mediacuadrática F Sig.

555N =

CANCTYP

321

EX

PR

LE

VE

110

100

90

80

70

60

13

If arrays are homogeneousThe appropriate model is 1 factor ANOVA

Example (1): Blocking

Origen de las variaciones

Suma de cuadrados

Grados de libertad

Promedio de los cuadrados F Probabilidad

Valor crítico para F

Filas 18.4 4 4.6 0.08440367 0.98276859 6.38823394Columnas 490 1 490 8.99082569 0.04000226 7.70864972Error 218 4 54.5

Total 726.4 9

If arrays are not homogeneous the appropriate model is 2 factor ANOVA (1 new block factor for arrays)

Example 2: CAMDA kidney dataftp://ftp.camda.duke.edu/CAMDA02_DATASETS/papers/README_normal.html

6 mouse kidney samples

(suppose 6 different treatments)

Compared to a common reference in a double reference design Dye swap Replicate arrays

2

2.1. The ANOVA model

Work only at the gene level: no main effects (A, D, V, G) as defined

Yijk=DGi+AGj+VGk+ijk i=1,2 (dyes) j=1,2; (array) K=1,…,6 (sample)

Example 3: A 2 factor design Diet X Strain

3.2. Design

3.3. The ANOVA model

Yijk=DGi+AGj+Strainl+Dietm+ Strain:Dietlm+ VGk+ijklm

i=1,…,2 (dyes) j=1,…,2; (array) k=1,…,12 (sample) l =1,…,3 (strain) m = 1,...,2 (diet)

3.4 Sample R code (1) data(paigen) paigen <- createData(rawdata, 2) model.full.fix <- makeModel (data =

paigen,formula=~DG+AG+SG+ Strain+Diet+Strain:Diet)

anova.full.fix <-fitmaanova (paigen, model.full.fix)

model.noint.fix <- makeModel (data =

paigen,ormula=~DG+AG+SG+Strain+Diet) anova.noint.fix <- fitmaanova(paigen,

model.noint.fix)

3.4. Sample R code (2) # permutation tests – # test for interaction effect test.int.fix <- ftest(paigen,

model.full.fix, model.noint.fix, n.perm=500)

idx.int.fix <- volcano(anova.full.fix, test.int.fix, title="Int. test")

Recommended