Upload
adele-hodges
View
221
Download
0
Embed Size (px)
Citation preview
Statistics for Differential Expression
Naomi Altman
Oct. 06
Some things to consider before we start
Model
Replication
Correlation / Independence
Treatments (conditions, varieties ...)
Some things to consider before we start
ModelUsing a statistical model sheds light on the analysis by quantifying
features such as condition effects, sources of biological and experimental variation, etc.
Models can be written down before the data are collected, which clarify how the data should be collected and analyzed.
When an estimate of variability is available, the model can be used to determine appropriate sample size.
Replication
Correlation / Independence
Treatments
Some things to consider before we start
Model
ReplicationStatistical methods compare the condition means to the variation
within condition.
The within condition variation can only be estimated by replication of the condition.
Often technical replication (multiple probes in a probeset or multiple hybridizations of the same sample) are treated as if it has biological meaning, but this is not true replication.
Correlation / Independence
Treatments
Some things to consider before we start
Model
Replication
Correlation / IndependenceObservations are correlated because:
they are taken on the same individual
they are measured on the same array
they are processed in the same replicate
Most simple analysis methods assume independence and hence must be modified to handle correlated data.
Treatments
Some things to consider before we start
Model
Replication
Correlation / Independence
Treatments:
what is interesting?
what is the "action"?
how many can we really handle
2 treatmentsWe have already considered the simple case of 2
treatments using t-tests (or permutation, bootstrap or Wilcoxon versions of the tests)
Which tests do we use and when are they appropriate?
Tests for 2 treatmentsTwo-sample "t-tests" (and similar tests)
require independent samples within and between the 2 treatments
i.e.
1. all RNA samples are biologically independent
2. Each sample is hybridized to a different array
single channel arrays such as Affy, Nimblegen, CodeLink
2 channel arrays with a reference sample in the same channel on each array (use M as the data)
Tests for 2 treatmentsThe paired "t-test" (and similar tests)
1. Each array includes both treatments.
2. Different arrays come from different biological samples.
3. There is no dye effect or technical dye-swaps have been done and the technical replicates have been averaged.
Tests for 3 or more treatments with independent samples
Requires independent samples.
(We cannot extended the paired sample idea, because we do not have 3 or more channels on the array.)
H0: all the population means are equal
HA: At least one of the means differs
Tests for 3 or more treatments with independent samples
examples:
Cancers: several cancer types with 1 sample per patient, several patients with each cancer
Genotypes: several genotypes of mice with 1 sample per mouse, several mice per genotype
Drug: different doses applied to different individuals with 1 sample per individual, several individuals per dose
Tests for 3 or more treatments with independent samples
The t-test assumes that the spreads are all approximately equal and that the populations are approximately normally distributed. The other versions of the test do not require normality.
The test statistic is the ratio of the variance among the sample means to the variance of each sample
Tests for 3 or more treatments with independent samples
If there are T treatments, with ni observations from the ith treatment.
N=n1+ ... + nT
)/()( 2
1 1
TNyyMSE i
T
i
n
jij
i
)1/()( 2
1
TyyMSTr i
T
i
F*=MStr/MSE
has an F-distribution when the null is true.
One-Way ANOVA
One-way ANOVA
summary(aov(iris$Sepal.Length~iris$Species)) Df Sum Sq Mean Sq F value Pr(>F) iris$Species 2 63.212 31.606 119.26 < 2.2e-16 ***Residuals 147 38.956 0.265 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Permution, bootstrap and rank tests (Kruskal-Wallace test) are readily extended to this situation
More complex situationsMany microarray experiments do not fall into this
simple situation due to correlation in the data due to:
biological correlation (same cell-line, individual ...)
using 2-channel microarrays
having multiple probes for the same gene
Also, we may have multifactor studies:
e.g. 2 genotypes, control and exposed, time course
For this we use Linear Mixed Models
Linear Models
It is useful to consider a model for the observed data (on a single probe or probeset):
Y=log2(intensity)
= + + + + ... + error
is the mean over all the conditions and arrays
error is the random error that is a mixture of measurement error and biological variability
the other terms are systematic deviations from the mean, due to the treatments, array effects, lab effects, etc.
Linear Modelse.g. Comparison of liver and kidney tissue in male
and female mice on 2-channel arrays with 3 replicate spots per gene
5 males and 5 females
Y is the log2(intensity) in one channel for one spot.
We need to remember that dye might have an effect.
Linear ModelsFixed effects are the conditions of interest in the
experiment:
Random effects are conditions which explain some of the noise in the model:
How does the model help us?
Generally, differential expression analysis is looking for differences between treatments that are larger than expected by chance.
The model helps us to understand the meaning of "by chance".
The model also allows us to design our experiment to minimize the probability of chance observation of large differences.
How Does the Model Help Us?
mean Log2
Intensity
Male Female
Liver 5.6 6.3
Kidney 9.3 10.7
difference between male and female in liverdifference between liver and kidney in males
What is larger than expected by chance?
Suppose the arrays are: 5 arrays - male and female liver 5 arrays - male and female kidney
Suppose the arrays are: 5 arrays - male liver and kidney 5 arrays - female liver and kidney
The simplest model2 treatments on 2 channel arrays with independent
biological samples, no dye effect and no dye-swap All of the data are independent.
M=log2(Red) - log2(Green)
Mi =+ errori
No differential expression implies
H0:
The F-test for this model is just t2 from the paired t-test
One-Way "ANOVA"Yij = + i + errorij
is the mean expression for the gene over the entire experiment.
i is the deviation of the mean of the ith condition from the overall mean i i=0
The error variance should not depend on the condition.
More Complicated Models with Fixed Effects Only
Yijk = + i +j+()ij +errorijk
We may have 2 or more factors, e.g.
• genotype and drug dose
• genotype and time point
• treatment and dye
is the mean expression for the gene over the entire experiment.
i is the deviation of the mean of the ith level of factor A from the overall mean, i i=0
i is the deviation of the mean of the ith level of factor B from the overall mean, j j=0
ij is the deviation of the mean of the ijth combination of levels from + i +j, mean i (ij=j (ij=0
The error variance should not depend on the condition.
More Complicated Models with Fixed Effects Only
A
mea
ns
1.0 1.5 2.0 2.5 3.0
02
46
B
mea
ns
1.0 1.5 2.0 2.5 3.0 3.5 4.0
02
46
A
mea
ns
1.0 1.5 2.0 2.5 3.0
02
46
B
mea
ns
1.0 1.5 2.0 2.5 3.0 3.5 4.0
02
46
Interaction among factors
No interaction among factors
More Complicated Models with Fixed Effects Only
Yijk = + i +j+()ij +errorijk
Normal Theory ANOVA is readily extended to this situation and more factors can be added.
Permutation and bootstrap methods begin to get complicated, but can still be applied.
Rank-based methods are available for 2 factors, but get complicated
Replicates that are not Independent
We often have replicates that are NOT independent:
multiple spots for the same gene on an array
multiple arrays from the same RNA
multiple RNAs from the same tissue
multiple samples from the same individual
multiple labs
multiple "batches"
Replicates that are not Independente.g. A dye-swap experiment in which the dye-swaps are technical replicates (1
dye-swap pair per sample) and there are 2 spots per gene on the array with 2 or more treatments
Yijkt = + i +j +k + s + t + errorijkt
is the mean expression for the gene over the entire experiment.
i is the deviation of the mean of the ith treatment, i i=0
i is the deviation of the mean of the ith level of dye from the overall mean, r+g=0
k is the array effect which induces a correlation between the 2 spots on the same array k~N(0,
2)
s is the spot effect which induces a correlation between the 2 channels at the same spot s~N(0,
2)
t is the biological sample effect which induces a correlation between the 2 arrays in the dye-swap pair t~N(0,
2)
Replicates that are not Independent
The lack of independence can be modeled as a random effect.
This is handled in a straightforward manner by ANOVA modeling but ...
all the other methods get MUCH more complicated.
Much of the available software does not handle this very well.
Replicates that are not Independent
In some cases, we can return to fixed effects models by averaging (but this loses power).
e.g. technical replicates can be averaged and the averages can be used as if they were the primary data
This is much better than discarding technical replicates, but not as good as modeling them.
Replicates that are not Independent Example
2 conditions on a 2-channel array with replicate spots for each gene, and a dye-swap technical replicate.
e.g. 2 genotypes of mouse
3 mice per genotype
1 mouse from each genotype on
each array
2 arrays from each pair of mice
4 replicate spots per array
We will simplify by modeling M, rather than each channel.
A1
A1B1
B1
A2
A2B2
B2
A3
A3B3
B3
Replicates that are not Independent Example
effects:• mouse pair• dye (or equivalently genotype)• array
pair and array are random
dye is fixed
we need to keep track of whether M is
R-G or A-B (genotype difference)
We do not need to include spot as we are using M
A1
A1B1
B1
A2
A2B2
B2
A3
A3B3
B3
Replicates that are not Independent Example
data for 1 mouse pair (m):
2 arrays, with 4 spots per array
Mmdas
m is the mouse pair identifier (1,2,3)
d is the dye for genotype A (r,g)
a is the array (1-6 or 1,2 within m)
s is the spot (1-4 within array)
A1
A1B1
B1
A2
A2B2
B2
A3
A3B3
B3
Replicates that are not Independent Example
Mijkt = + mi + dj + ak + errorijkt effects:
m mouse pair (random)
d dye (or equivalently genotype in R)
a array (random)
t=1,2,3,4 for the spots
The hypothesis of no genotype effect
is =0.
Notice that we have to be careful about
the sign of M. If we code the effects in
the way it is usually done for ANOVA, M=A-B not R-G
A1
A1B1
B1
A2
A2B2
B2
A3
A3B3
B3
Replicates that are not Independent Example
Mijkt = + mi + dj + ak + errorijkt
Our estimate of m is just the
sample mean of M over all the spots.
But our estimate of the SE of ave(M)
is not the sample average, due to the
other effects.
A1
A1B1
B1
A2
A2B2
B2
A3
A3B3
B3
Replicates that are not Independent Example
Mijkt = + mi + dj + ak + errorijkt
)()()()( errorVaraVarmVarMVar
erroramM
24/)()(
6/)()(
3/)()(
errorVarerrorVar
aVaraVar
mVarmVar
3 mouse pairs
6 arrays24 observations/gene
24/6/3/*
222erroram sss
Mt
What if we ignore the Dependence
24/*
2Ms
Mt
2222 23/2023/16)( errormMSE
24/6/3/*
222erroram sss
Mt
Compare with
We would use:
The denominator of the ordinary t-test is
much too small