Statistics for Differential Expression Naomi Altman Oct. 06

Statistics for Differential Expression

Naomi Altman

Oct. 06

Some things to consider before we start

Model

Replication

Correlation / Independence

Treatments (conditions, varieties ...)


ModelUsing a statistical model sheds light on the analysis by quantifying

features such as condition effects, sources of biological and experimental variation, etc.

Models can be written down before the data are collected, which clarify how the data should be collected and analyzed.

When an estimate of variability is available, the model can be used to determine appropriate sample size.

Replication


Treatments


Model

ReplicationStatistical methods compare the condition means to the variation

within condition.

The within condition variation can only be estimated by replication of the condition.

Often technical replication (multiple probes in a probeset or multiple hybridizations of the same sample) are treated as if it has biological meaning, but this is not true replication.


Treatments


Model

Replication

Correlation / IndependenceObservations are correlated because:

they are taken on the same individual

they are measured on the same array

they are processed in the same replicate

Most simple analysis methods assume independence and hence must be modified to handle correlated data.

Treatments


Model

Replication


Treatments:

what is interesting?

what is the "action"?

how many can we really handle

2 treatmentsWe have already considered the simple case of 2

treatments using t-tests (or permutation, bootstrap or Wilcoxon versions of the tests)

Which tests do we use and when are they appropriate?

Tests for 2 treatmentsTwo-sample "t-tests" (and similar tests)

require independent samples within and between the 2 treatments

i.e.

1. all RNA samples are biologically independent

2. Each sample is hybridized to a different array

single channel arrays such as Affy, Nimblegen, CodeLink

2 channel arrays with a reference sample in the same channel on each array (use M as the data)

Tests for 2 treatmentsThe paired "t-test" (and similar tests)

1. Each array includes both treatments.

2. Different arrays come from different biological samples.

3. There is no dye effect or technical dye-swaps have been done and the technical replicates have been averaged.

Tests for 3 or more treatments with independent samples

Requires independent samples.

(We cannot extended the paired sample idea, because we do not have 3 or more channels on the array.)

H0: all the population means are equal

HA: At least one of the means differs


examples:

Cancers: several cancer types with 1 sample per patient, several patients with each cancer

Genotypes: several genotypes of mice with 1 sample per mouse, several mice per genotype

Drug: different doses applied to different individuals with 1 sample per individual, several individuals per dose


The t-test assumes that the spreads are all approximately equal and that the populations are approximately normally distributed. The other versions of the test do not require normality.

The test statistic is the ratio of the variance among the sample means to the variance of each sample


If there are T treatments, with ni observations from the ith treatment.

N=n1+ ... + nT

)/()( 2

1 1

TNyyMSE i

T

i

n

jij

i

)1/()( 2

1

TyyMSTr i

T

i

F*=MStr/MSE

has an F-distribution when the null is true.

One-Way ANOVA

One-way ANOVA

summary(aov(iris$Sepal.Length~iris$Species)) Df Sum Sq Mean Sq F value Pr(>F) iris$Species 2 63.212 31.606 119.26 < 2.2e-16 ***Residuals 147 38.956 0.265 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Permution, bootstrap and rank tests (Kruskal-Wallace test) are readily extended to this situation

More complex situationsMany microarray experiments do not fall into this

simple situation due to correlation in the data due to:

biological correlation (same cell-line, individual ...)

using 2-channel microarrays

having multiple probes for the same gene

Also, we may have multifactor studies:

e.g. 2 genotypes, control and exposed, time course

For this we use Linear Mixed Models

Linear Models

It is useful to consider a model for the observed data (on a single probe or probeset):

Y=log2(intensity)

= + + + + ... + error

is the mean over all the conditions and arrays

error is the random error that is a mixture of measurement error and biological variability

the other terms are systematic deviations from the mean, due to the treatments, array effects, lab effects, etc.

Linear Modelse.g. Comparison of liver and kidney tissue in male

and female mice on 2-channel arrays with 3 replicate spots per gene

5 males and 5 females

Y is the log2(intensity) in one channel for one spot.

We need to remember that dye might have an effect.

Linear ModelsFixed effects are the conditions of interest in the

experiment:

Random effects are conditions which explain some of the noise in the model:

How does the model help us?

Generally, differential expression analysis is looking for differences between treatments that are larger than expected by chance.

The model helps us to understand the meaning of "by chance".

The model also allows us to design our experiment to minimize the probability of chance observation of large differences.

How Does the Model Help Us?

mean Log2

Intensity

Male Female

Liver 5.6 6.3

Kidney 9.3 10.7

difference between male and female in liverdifference between liver and kidney in males

What is larger than expected by chance?

Suppose the arrays are: 5 arrays - male and female liver 5 arrays - male and female kidney

Suppose the arrays are: 5 arrays - male liver and kidney 5 arrays - female liver and kidney

The simplest model2 treatments on 2 channel arrays with independent

biological samples, no dye effect and no dye-swap All of the data are independent.

M=log2(Red) - log2(Green)

Mi =+ errori

No differential expression implies

H0:

The F-test for this model is just t2 from the paired t-test

One-Way "ANOVA"Yij = + i + errorij

is the mean expression for the gene over the entire experiment.

i is the deviation of the mean of the ith condition from the overall mean i i=0

The error variance should not depend on the condition.

More Complicated Models with Fixed Effects Only

Yijk = + i +j+()ij +errorijk

We may have 2 or more factors, e.g.

• genotype and drug dose

• genotype and time point

• treatment and dye


i is the deviation of the mean of the ith level of factor A from the overall mean, i i=0

i is the deviation of the mean of the ith level of factor B from the overall mean, j j=0

ij is the deviation of the mean of the ijth combination of levels from + i +j, mean i (ij=j (ij=0

The error variance should not depend on the condition.


A

mea

ns

1.0 1.5 2.0 2.5 3.0

02

46

B

mea

ns

1.0 1.5 2.0 2.5 3.0 3.5 4.0

02

46

A

mea

ns

1.0 1.5 2.0 2.5 3.0

02

46

B

mea

ns

1.0 1.5 2.0 2.5 3.0 3.5 4.0

02

46

Interaction among factors

No interaction among factors


Yijk = + i +j+()ij +errorijk

Normal Theory ANOVA is readily extended to this situation and more factors can be added.

Permutation and bootstrap methods begin to get complicated, but can still be applied.

Rank-based methods are available for 2 factors, but get complicated

Replicates that are not Independent

We often have replicates that are NOT independent:

multiple spots for the same gene on an array

multiple arrays from the same RNA

multiple RNAs from the same tissue

multiple samples from the same individual

multiple labs

multiple "batches"

Replicates that are not Independente.g. A dye-swap experiment in which the dye-swaps are technical replicates (1

dye-swap pair per sample) and there are 2 spots per gene on the array with 2 or more treatments

Yijkt = + i +j +k + s + t + errorijkt


i is the deviation of the mean of the ith treatment, i i=0

i is the deviation of the mean of the ith level of dye from the overall mean, r+g=0

k is the array effect which induces a correlation between the 2 spots on the same array k~N(0,

2)

s is the spot effect which induces a correlation between the 2 channels at the same spot s~N(0,

2)

t is the biological sample effect which induces a correlation between the 2 arrays in the dye-swap pair t~N(0,

2)


The lack of independence can be modeled as a random effect.

This is handled in a straightforward manner by ANOVA modeling but ...

all the other methods get MUCH more complicated.

Much of the available software does not handle this very well.


In some cases, we can return to fixed effects models by averaging (but this loses power).

e.g. technical replicates can be averaged and the averages can be used as if they were the primary data

This is much better than discarding technical replicates, but not as good as modeling them.

Replicates that are not Independent Example

2 conditions on a 2-channel array with replicate spots for each gene, and a dye-swap technical replicate.

e.g. 2 genotypes of mouse

3 mice per genotype

1 mouse from each genotype on

each array

2 arrays from each pair of mice

4 replicate spots per array

We will simplify by modeling M, rather than each channel.

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3


effects:• mouse pair• dye (or equivalently genotype)• array

pair and array are random

dye is fixed

we need to keep track of whether M is

R-G or A-B (genotype difference)

We do not need to include spot as we are using M

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3


data for 1 mouse pair (m):

2 arrays, with 4 spots per array

Mmdas

m is the mouse pair identifier (1,2,3)

d is the dye for genotype A (r,g)

a is the array (1-6 or 1,2 within m)

s is the spot (1-4 within array)

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3


Mijkt = + mi + dj + ak + errorijkt effects:

m mouse pair (random)

d dye (or equivalently genotype in R)

a array (random)

t=1,2,3,4 for the spots

The hypothesis of no genotype effect

is =0.

Notice that we have to be careful about

the sign of M. If we code the effects in

the way it is usually done for ANOVA, M=A-B not R-G

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3


Mijkt = + mi + dj + ak + errorijkt

Our estimate of m is just the

sample mean of M over all the spots.

But our estimate of the SE of ave(M)

is not the sample average, due to the

other effects.

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3


Mijkt = + mi + dj + ak + errorijkt

)()()()( errorVaraVarmVarMVar

erroramM

24/)()(

6/)()(

3/)()(

errorVarerrorVar

aVaraVar

mVarmVar

3 mouse pairs

6 arrays24 observations/gene

24/6/3/*

222erroram sss

Mt

What if we ignore the Dependence

24/*

2Ms

Mt

2222 23/2023/16)( errormMSE

24/6/3/*

222erroram sss

Mt

Compare with

We would use:

The denominator of the ordinary t-test is

much too small

Documents

Statistics for Differential Expression Naomi Altman Oct. 06