Upload
others
View
18
Download
0
Embed Size (px)
Citation preview
Statistical Analysis of RNA-seq
Davide Risso
Division of Biostatistics, School of Public Health
University of California, Berkeley
IIHG Bioinformatics Summer CourseMay 18–19, 2015
A typical RNA-seq pipeline
Biological Question
Experimental Design
Experiment(Microarray/RNA-Seq)
Pre-processing(Read mapping, Expression quantitation, ...)
Normalization
Clustering Classi�cation Estimation Testing...
Gene-level expression summaries
A matrix of gene expression levels (J genes × n samples)
Condition 1 Condition 2Rep 1 . . . Rep k Rep k + 1 . . . Rep n
gene 1 y1,1 . . . y1,k y1,k+1 . . . y1,ngene 2 y2,1 . . . y2,k y2,k+1 . . . y2,n
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .gene J yJ ,1 . . . yJ ,k yJ ,k+1 . . . yJ ,n
Here we consider gene-level expression for simplicity, but most of theconcepts are generalizable to transcript-level differential expression.
Statistical challenges in RNA-seq data analysis
1 Statistical inference
• technical and biological variability• generalizability / interpretation of the results
2 Exploratory data analysis (EDA)
3 Normalization
4 Choice of the statistical model
• Negative binomial approach: dispersion estimation• Test for differential expression: two-class comparisons + more
general designs
5 Exploration and interpretation of the results
Data availability
All the data that I will use in this presentation are available in Rdata packages.
• Yeast dataset (Risso et al. 2011):github.com/drisso/yeastRNASeqRisso2011
• Zebrafish dataset (Ferreira et al. 2014): zebrafishRNASeq(Bioconductor)
• SEQC dataset (MAQC/SEQC consortium, 2014): seqc(Bioconductor)
https://github.com/drisso/yeastRNASeqRisso2011http://www.bioconductor.org/packages/release/data/experiment/html/zebrafishRNASeq.htmlhttp://www.bioconductor.org/packages/release/data/experiment/html/seqc.html
Yeast datasetRisso et al. (2011); Sherlock Lab, Stanford
Culture/ Library prep. Growth condition Flow-cellLibrary prep. protocol
1 Y1 Protocol 1 YPD 428R12 Y1 Protocol 1 YPD 4328B3 Y2 Protocol 1 YPD 428R14 Y2 Protocol 1 YPD 4328B5 Y7 Protocol 1 YPD 428R16 Y7 Protocol 1 YPD 4328B7 Y4 Protocol 2 YPD 61MKN8 Y4 Protocol 2 YPD 61MKN9 D1 Protocol 1 Del 428R1
10 D2 Protocol 1 Del 428R111 D7 Protocol 1 Del 428R112 G1 Protocol 2 Gly 6247L13 G2 Protocol 1 Gly 62OAY14 G3 Protocol 1 Gly 62OAY
Zebrafish datasetFerreira et al. (2014); Ngai Lab, UC Berkeley
Ctl. 3 Ctl. 1 Ctl. 5 Trt. 9 Trt. 11 Trt. 13
Day 1 Day 2 Day 3
Run 1 1 multiplex lane
Run 2 1 multiplex lane
2 sample types x 3 lib. prep. x 2 runs = 12 samples
92 negative controls ERCC Mix 1
MAQC/SEQC datasetMAQC/SEQC consortium; Nature Biotechnology 32 (2014)
ERCC Mix 1
A1 A2 A3 A4 B1 B2 B3 B4
ERCC Mix 2
Flow-cell F1 8 multiplex lanes
Flow-cell F2 8 multiplex lanes
2 sample types x 4 lib. prep. x 2 flow-cells x 8 lanes = 128 samples
23 negative controls 69 positive controls
Statistical inference
Statistical inference
“The objective of statistics is to make inferences (predictions,decisions) about a population based on information contained in asample.” (Mendenhall 1987).
In RNA-seq, we want to infer the true (relative) expression of a genein a particular tissue in a given population (e.g, American womenwith breast cancer), by measuring it in a sample of that population.
Statistical inference
Useful to always ask the questions
• What is the reference population? (e.g., European adults, maleC57BL/6 mice [in my lab])
• Is my sample representative of the population?
Intuitively, the larger the sample size the better I can estimate thetrue expression.
Technical and Biological variabilityHansen et al. (2011)
• Gene expresion is a stochastic process: any two individuals ofthe same population will have varying level of expression of anygene (biological variability).
• In addition, we often observe measurement errors: these lead totechnical variability.
Var(Yg) =Difference between groups +Biological variability + Technical variability
• The quantity of interest• Can be estimated with technical replicates: Poisson noise• Can be estimated with biological replicates: over-dispersion
Proper replication and experimental design
Many sources of technical and biological variability can confoundthe inference on the quantity of interest (e.g., batch effects, labeffects, population structure, . . . )
Normalization can alleviate the problem, but cannot make up forabsence of replication and/or poor experimental design.
“To consult the statistician after an experiment is finished is oftenmerely to ask him to conduct a post mortem examination. He canperhaps say what the experiment died of.” R. A. Fisher
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA)
Before fitting any model, it is essential to carefully explore the data.
EDA can reveal:
• Need for normalization (difference in read count distributions).• Presence of outlying samples.• Batch effects and other unwanted variation.• Data recording errors.• Data duplication.• Other systematic effects.
Some useful plots for QC/EDA
There are some standard and easy plots that one can use toroutinely explore the data prior to differential expression analysis.These include (but are not limited to):
• Boxplots: to compare distribution of replicate samples.• Relative Log Expression (RLE) plots: identify outliers and/or
batch effects.
• Principal Component Analysis (PCA) plots: identify strongestsources of variation.
• Hierarchical clustering: identify duplicate samples, evaluatetechnical, biological replicates.
• Mean-difference plots: compare technical and biologicalvariability.
BoxplotsFrom J. H. Bullard, K. D. Hansen, and M. Taub (2008).
−2
−1
01
23
Anatomy of a boxplot
outlieroutlier
outlier
A
B
q0.25
q0.5
q0.75
r ≡ |q0.75 − q0.25|A ≡ inf{xi : xi > q0.25 − 1.5r}B ≡ sup{xi : xi < q0.75 + 1.5r}
Boxplots highlight the need for normalization. . .Yeast dataset
Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3
05
1015
Unnormalized read counts
. . . and shows that normalization worked as promised!
Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3
05
1015
Upper−quartile normalization
Relative Log Expression (RLE)
• Particularly useful transformation of read counts.• RLE is defined, for each gene, as the log-ratio of a read count
to the median read count across samples.
• Comparable samples should have similar RLE distributions,centered around zero.
• Unusual RLE distributions could reveal problematic samples orbatch effects.
RLE plots can reveal “hidden” features
Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3
−2
−1
01
Upper−quartile normalization
Principal Component Analysis (PCA)
• PCA can be used as a way to visualize the sources of variationin the data.
• PCA is a statistical procedure that looks for a small set oflinear combinations of the original variables to summarize thedata losing as little information as possible.
• These linear combinations are called principal components(PCs):
• the first PC is the weighted average of the gene expressionmeasures that gives the highest variance across all samples.
• Each succeeding component in turn has the highest variancepossible under the constraint that it is uncorrelated with thepreceding components.
PCA plots can reveal batch effects. . .TCGA data, Anne Biton
. . . problematic data. . .Zebrafish dataset
−0.2 0.0 0.2 0.4 0.6 0.8
−0.
4−
0.2
0.0
0.2
0.4
0.6
0.8
PC1
PC
2
Ctl1
Ctl3
Ctl5
Trt9
Trt11
Trt13
. . . and well-behaved dataYeast dataset
−0.2 0.0 0.2 0.4 0.6
−0.
10.
00.
10.
20.
30.
40.
5
PC1
PC
2
Y1_1Y1_2Y2_1Y2_2
Y7_1Y7_2Y4_1Y4_2
D1D2D7
G1G2
G3
Hierarchical clustering highlights structure in the data. . .Yeast dataset
D1
D2
D7
Y2_
1
Y2_
2
Y4_
1
Y4_
2
Y7_
1
Y7_
2
Y1_
1
Y1_
2
G2
G1
G3
020
000
6000
010
0000
1400
00Yeast data
hclust (*, "complete")
Hei
ght
. . . and can identify duplicate samplesSingle-cell RNA-seq, Russell Fletcher
GB
C03
_N70
1_S
508
GB
C03
_N70
3_S
505
GB
C03
_N70
1_S
502
GB
C03
_N70
7_S
501
GB
C01
_N70
1_S
505
GB
C03
_N70
5_S
507
GB
C03
_N70
7_S
503
GB
C03
_N70
7_S
506
GB
C01
_N71
2_S
502
GB
C01
_N70
1_S
501
GB
C01
_N70
4_S
502
GB
C01
_N71
1_S
502
GB
C01
_N70
1_S
502
GB
C01
_N70
8_S
502
GB
C01
_N70
4_S
501
GB
C01
_N70
4_S
503
GB
C01
_N70
8_S
504
GB
C01
_N71
2_S
504
GB
C01
_N71
1_S
504
GB
C01
_N70
3_S
503
GB
C01
_N71
0_S
501
GB
C01
_N71
1_S
501
GB
C01
_N70
5_S
501
GB
C03
_N70
3_S
504
GB
C01
_N70
3_S
501
GB
C03
_N70
1_S
506
GB
C03
_N70
7_S
508
GB
C03
_N70
3_S
502
GB
C03
_N70
3_S
503
GB
C03
_N70
6_S
502
GB
C03
_N70
1_S
503
GB
C03
_N70
6_S
501
GB
C03
_N70
1_S
501
GB
C03
_N70
1_S
504
GB
C03
_N70
5_S
505
GB
C03
_N70
4_S
508
GB
C03
_N70
5_S
506
GB
C03
_N70
2_S
503
GB
C03
_N70
2_S
506
GB
CP
03_N
705_
S50
5G
BC
P03
_N70
4_S
501
GB
CP
03_N
710_
S50
2G
BC
P02
_N70
4_S
502
GB
CP
03_N
710_
S50
8G
BC
P03
_N70
7_S
508
GB
CP
03_N
709_
S50
5G
BC
P03
_N71
1_S
504
GB
CP
03_N
706_
S50
1G
BC
P03
_N71
2_S
501
GB
CP
03_N
705_
S50
8G
BC
P02
_N70
2_S
506
GB
CP
03_N
707_
S50
6G
BC
P03
_N70
9_S
502
GB
CP
03_N
706_
S50
3G
BC
P03
_N71
1_S
502
GB
CP
03_N
702_
S50
2G
BC
P03
_N70
4_S
504
GB
CP
03_N
705_
S50
6G
BC
P03
_N70
5_S
503
GB
C01
_N70
5_S
503
GB
CP
03_N
705_
S50
2G
BC
P03
_N71
2_S
502
GB
C01
_N70
5_S
504
GB
C01
_N70
6_S
503
GB
C01
_N70
8_S
503
GB
C01
_N71
2_S
503
GB
C01
_N70
1_S
503
GB
C01
_N71
1_S
503
GB
C01
_N70
4_S
504
GB
C01
_N71
2_S
501
GB
C01
_N70
9_S
506
GB
C01
_N71
2_S
505
GB
C01
_N70
4_S
505
GB
C01
_N70
6_S
501
GB
CP
03_N
706_
S50
6G
BC
P03
_N70
3_S
505
GB
CP
03_N
705_
S50
7G
BC
03_N
703_
S50
1G
BC
P02
_N71
0_S
503
GB
CP
03_N
701_
S50
7G
BC
P02
_N70
5_S
507
GB
CP
03_N
703_
S50
8G
BC
P02
_N70
2_S
502
GB
CP
02_N
706_
S50
7G
BC
P02
_N70
5_S
503
GB
CP
02_N
706_
S50
3G
BC
P02
_N70
1_S
505
GB
CP
02_N
703_
S50
5G
BC
P02
_N71
0_S
505
GB
CP
03_N
701_
S50
4G
BC
03_N
704_
S50
3G
BC
P02
_N71
0_S
508
GB
CP
02_N
711_
S50
8G
BC
P02
_N71
2_S
501
GB
CP
02_N
712_
S50
8G
BC
P03
_N70
1_S
501
GB
CP
02_N
711_
S50
2G
BC
P02
_N71
1_S
501
GB
CP
02_N
711_
S50
7G
BC
P02
_N70
7_S
501
GB
CP
02_N
707_
S50
8G
BC
P02
_N71
1_S
503
GB
CP
02_N
706_
S50
2G
BC
P02
_N70
6_S
504
GB
CP
03_N
708_
S50
1G
BC
P03
_N71
0_S
501
GB
C03
_N70
1_S
507
GB
C03
_N70
2_S
505
GB
C03
_N70
6_S
508
GB
C03
_N70
5_S
508
GB
C03
_N70
7_S
505
GB
CP
02_N
710_
S50
2G
BC
01_N
708_
S50
6G
BC
P03
_N71
1_S
505
GB
CP
03_N
707_
S50
7G
BC
01_N
703_
S50
4G
BC
01_N
703_
S50
6G
BC
P02
_N70
7_S
507
GB
CP
02_N
709_
S50
7G
BC
P02
_N70
5_S
508
GB
C01
_N70
3_S
505
GB
CP
03_N
711_
S50
1G
BC
P03
_N70
1_S
502
GB
CP
03_N
707_
S50
3G
BC
P03
_N71
0_S
505
GB
CP
03_N
708_
S50
8G
BC
P03
_N71
2_S
505
GB
C01
_N70
5_S
502
GB
C01
_N70
6_S
502
GB
C01
_N70
9_S
501
GB
CP
03_N
711_
S50
8G
BC
01_N
702_
S50
4G
BC
03_N
703_
S50
7G
BC
P03
_N70
8_S
506
GB
C01
_N71
1_S
505
GB
C01
_N71
0_S
503
GB
CP
03_N
704_
S50
5G
BC
01_N
709_
S50
5G
BC
P03
_N70
4_S
506
GB
CP
03_N
709_
S50
7G
BC
01_N
710_
S50
6G
BC
P02
_N70
5_S
501
GB
C01
_N70
2_S
503
GB
C01
_N70
9_S
503
GB
C01
_N70
9_S
502
GB
C01
_N71
0_S
502
GB
CP
03_N
711_
S50
6G
BC
P03
_N70
6_S
504
GB
CP
03_N
708_
S50
7G
BC
P02
_N70
8_S
504
GB
CP
02_N
703_
S50
7G
BC
P03
_N70
2_S
501
GB
CP
03_N
703_
S50
4G
BC
01_N
705_
S50
5G
BC
P02
_N71
0_S
506
GB
CP
03_N
706_
S50
5G
BC
P03
_N71
2_S
503
GB
CP
02_N
703_
S50
2G
BC
P03
_N70
1_S
506
GB
C03
_N70
2_S
508
GB
C01
_N70
5_S
506
GB
C01
_N70
6_S
505
GB
CP
02_N
706_
S50
6G
BC
P02
_N70
9_S
501
GB
CP
02_N
709_
S50
3G
BC
P02
_N71
1_S
506
GB
CP
03_N
701_
S50
5G
BC
P02
_N71
2_S
504
GB
CP
03_N
701_
S50
3G
BC
P03
_N70
4_S
502
GB
CP
02_N
703_
S50
4G
BC
P02
_N70
3_S
506
GB
CP
02_N
702_
S50
4G
BC
P02
_N70
2_S
507
GB
CP
02_N
708_
S50
1G
BC
P02
_N70
8_S
507
GB
CP
02_N
701_
S50
6G
BC
P02
_N70
1_S
504
GB
CP
02_N
705_
S50
2G
BC
P02
_N70
2_S
503
GB
CP
02_N
712_
S50
2G
BC
P03
_N70
9_S
508
GB
CP
02_N
704_
S50
5G
BC
P02
_N70
4_S
506
GB
CP
02_N
701_
S50
7G
BC
P02
_N70
8_S
505
GB
CP
02_N
708_
S50
6G
BC
P02
_N70
3_S
503
GB
CP
02_N
702_
S50
8G
BC
P02
_N71
2_S
507
GB
C03
_N70
7_S
502
GB
CP
02_N
704_
S50
4G
BC
P02
_N70
9_S
505
GB
CP
02_N
711_
S50
4G
BC
01_N
708_
S50
1G
BC
P03
_N70
3_S
503
GB
CP
03_N
706_
S50
7G
BC
P03
_N70
4_S
503
GB
CP
03_N
708_
S50
2G
BC
P03
_N70
7_S
504
GB
CP
02_N
709_
S50
4G
BC
P03
_N70
3_S
506
GB
CP
03_N
706_
S50
2G
BC
P03
_N70
9_S
506
GB
CP
03_N
703_
S50
7G
BC
P02
_N70
1_S
502
GB
CP
02_N
709_
S50
8G
BC
P03
_N70
9_S
504
GB
C01
_N70
9_S
504
GB
C01
_N71
0_S
504
GB
CP
02_N
705_
S50
6G
BC
P02
_N70
3_S
508
GB
CP
02_N
705_
S50
5G
BC
P02
_N70
6_S
505
GB
CP
03_N
705_
S50
4G
BC
P03
_N70
7_S
501
GB
CP
02_N
705_
S50
4G
BC
P02
_N70
1_S
503
GB
CP
03_N
707_
S50
2
050
150
250
Cluster Dendrogram
hclust (*, "complete")d
Hei
ght
Mean-difference plots (a.k.a. MA-plots)Yeast data
−2 0 2 4 6 8 10
−3
−2
−1
01
23
Technical replicates
mean
diffe
renc
e
Mean-difference plots (a.k.a. MA-plots)Yeast data
−2 0 2 4 6 8 10
−3
−2
−1
01
23
Biological replicates
mean
diffe
renc
e
Normalization
Normalization
Normalization is essential to ensure that observed differences inexpression measures between samples and/or genes are truly due tobiology and not technical artifacts.
We distinguish between two types of effects on read counts andcorresponding normalization procedures.
1 Within-sample normalization adjusts for gene-specific (andpossibly sample-specific) effects, e.g., gene length orGC-content (Risso et al. 2011).
2 Between-sample normalization adjusts for distributionaldifferences in read counts between samples, e.g., lanesequencing depth or barcode efficiency (Bullard et al. 2010;Risso et al. 2014).
Between-sample normalizationBullard et al. (2010), Risso et al. (2014)
We group the between-sample normalization methods in globalscaling and non-linear approaches.
In global scaling normalization, gene-level counts are scaled by asingle factor per sample.
• Total-count (TC), as in RPKM (Mortazavi et al. 2008), stillwidely used.
• Upper-quartile (UQ, Bullard et al. 2010).• Trimmed Mean of M values (TMM, Robinson and Oshlack
2010).
• DESeq normalization (Anders and Huber 2010).
Between-sample normalizationBullard et al. (2010), Risso et al. (2014)
Non-linear normalization
• Full-quantile normalization (FQ, Bullard et al. 2010).• Remove Unwanted Variation (RUV, Risso et al. 2014).
Global scaling or non-linear?
Global scaling
• Simple and flexible.• Extensible: once determined somewhat robustly based on a
given set of genes, the same scaling factors can be applied toother features.
• Can be insufficient for large differences in distributions betweenlibraries.
Non-linear normalization
• Does not involve singling out a parameter.• Can handle very different distributions between libraries.• Can account for possibly more than sequencing depth.• Not a simple function of original data; not easily interpretable.
The optimal normalization method is often dependent on thedataset –> EDA!
Statistical model
The data: gene-level expression summaries
A matrix of gene expression levels (J genes × n samples)
Condition 1 Condition 2Rep 1 . . . Rep k Rep k + 1 . . . Rep n
gene 1 y1,1 . . . y1,k y1,k+1 . . . y1,ngene 2 y2,1 . . . y2,k y2,k+1 . . . y2,n
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .gene J yJ ,1 . . . yJ ,k yJ ,k+1 . . . yJ ,n
Choice of the statistical model: Poisson
The Poisson model is the simplest model to deal with count data,hence a natural choice in this context.
The main assumption of such model is that the the mean is equal tothe variance.
In RNA-seq data, technical variation follows a Poisson noisestructure, but biological variation is higher, leading tooverdispersion.
Mean variance relationship: technical replicatesMAQC/SEQC data (sample A)
2 4 6 8 10
05
1015
mean
varia
nce
Mean variance relationship: biological replicatesZebrafish data (control samples)
0 5 10 15
05
1015
2025
mean
varia
nce
Choice of the statistical model: negative binomial
One way to account for overdispersion is to model the data as anegative binomial.
The negative binomial is a generalization of the Poisson model thataccount for excess of variability.
The negative binomial model can be described in terms of its meanand variance.
E [Y ] = µ;Var(Y ) = µ+ φµ2.
• The variance is a quadratic function of the mean.• φ is referred to as dispersion parameter.• The Poisson case is recovered when φ = 0.
Negative binomial models for RNA-seq data analysis
Due to its simplicity, the negative binomial is one of the most widelyused models used to analyze RNA-seq data.
Popular approaches, like edgeR (Robinson et al. 2010), DESeq(Anders and Huber 2010), DESeq2 (Love et al. 2014), baySeq(Hardcastle and Kelly 2010), and others.
Dispersion estimation
Due to the usually small sample size, estimating the dispersionparameter φ is challenging.
One strategy that has proven effective is to “shrink” thegene-specific dispersion estimates towards the mean across genes.
This strategy, known as shrinkage estimation, allows to borrowstrengths across the thousands of genes, and leads to betterestimates.
Illustration of Shrinkage EstimationLove et al. (2014)
Dispersion estimation with edgeRYeast data
1 100 10000
1e+
001e
+02
1e+
041e
+06
1e+
08Mean−Variance Plot
Mean Expression (Log10 Scale)
Var
ianc
e (L
og10
Sca
le)
A Generalized Linear Model (GLM) approach
GLMs are a flexible and powerful way to model RNA-seq data(McCullough and Nelder 1989). In particular, for each gene j , wehave
logE [Yj |X ] = Xβj ,
where X is called the design matrix and is a known matrix, and βjis a vector of parameters to be estimated.
A Generalized Linear Model (GLM) approach
In a typical two-class differential expression study (e.g., the zebrafishdata), X is
model.matrix(~zfX)
## (Intercept) zfXTrt
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 1
## 6 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$zfX
## [1] "contr.treatment"
A Generalized Linear Model (GLM) approach
In this case, βj has two components and (dropping the index j ) wecan write the model as
logE [Y |X ] = logµ = β0 + β1x ,
where x is the second column of X . Moreover,
logE [Y |x = 0] = β0logE [Y |x = 1] = β0 + β1
hence β1 can be interpreted as the log-fold-change between the twoclasses:
β1 = logE [Y |x = 1]E [Y |x = 0]
Test for differential expression
In this GLM framework, testing for differential expressioncorresponds to testing the null hypothesis H0 : β1 = 0.
One popular approach is the Likelihood Ratio Test (LRT), whichhas the advantage of generalizing to more complex designs (e.g,multi-class comparisons, time-course, . . . ).
The LRT exploits the idea that the “correct” model will fit the databetter (i.e., it will have the higher value for the likelihood function).
In other words, we test if the model with β1 6= 0 fits the data“significantly better” than the one with β1 = 0.
p-values and multiple testing
The result of the testing procedure is often a p-value, i.e., theprobability, under the null hypothesis, to obtain a result equally ormore extreme than the one observed.
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
x
dnor
m(x
)
p−value of 0.05
p-values and multiple testing
Under the null hypotesis, the p-values are uniformly distributedbetween 0 and 1.
If one performs only one test, it is accepted to reject the nullhypothesis for p < 0.05.
When performing 10,000 tests (one per gene), under the nullhypothesis 5% of them (500 genes!) will have a p < 0.05, just bychance!
Hence, we need to account for the “multiplicity of the test”.
Correcting for multiple testing
When dealing with multiple hypothesis testing, one needs to controla suitable Type I error rate, e.g., the family-wise error rate (FWER),or the false discovery rate (FDR).
There are simple procedure to adjust the p-values for multiplicity:e.g., the Bonferroni procedure to control for FWER or theBenjamini-Hochberg procedure to control for FDR (Benjamini andHochberg 1995).
Exploration of the results
Distribution of the p-values
If the model is correctly specified, we expect the p-values to comefrom two distributions:
• uniform between 0 and 1 for the non DE genes;• and (almost) 0 for the DE genes.
Departures from this are indication that there is something wrongwith the model (presence of additional source of variation, lack ofproper normalization, . . . )
Good p-value distributionToy example
x
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
050
010
0015
00
Bad p-value distributionToy example
x
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
050
010
0015
0020
00
p-value distributionZebrafish dataset
top$PValue
Fre
quen
cy
0.0 0.2 0.4 0.6 0.8 1.0
050
010
0015
0020
00
Volcano plotsZebrafish dataset
−10 −5 0 5 10
01
23
45
6
logFC
−lo
g10(
p−va
lue)
Heatmap of DE genesYeast dataset
G2
G3
G1
D1
D7
D2
Y2_
1
Y1_
1
Y4_
1
Y7_
1
YDR380WYMR303CYKL217WYDR256CYGR243WYPL147WYBR297WYGL146CYPL186CYPL156CYPL230WYPR030WYJR095WYIL057CYBL043WYMR206WYIL160CYMR081CYBR033WYOR384WYPL201CYMR280CYOR178CYLR327CYGR067CYJL116CYDR536WYOL052C−AYGL205WYNR002CYER011WYBR092CYMR319CYJL212CYHR094CYKR034WYBR294WYIR017CYNL277WYIL165CYIR032CYPR167CYKL218CYHL036WYHL028WYLR092WYMR062CYJR152WYJL133C−AYKL001CYJR010WYIR031CYER069WYDR345CYNL142WYKR039WYER091CYLR303WYJL172WYLR058CYNL220WYCL030CYCL018WYJR137CYMR120CYPL095CYIR034CYFR030WYHR018CYDR046CYCL064CYKR093WYBR068CYPL111WYHR137WYBR069CYDR508CYOR383CYPL134CYBL015WYIL155CYOR374WYGL062WYGR244CYML120CYLL041CYML054CYFL014WYNL274CYDL130W−AYPR010C−AYOR120WYLR174WYIL136WYFL030WYOR382WYDR216WYHR033WYBR296CYKL187C
Top 100 DE genes
−1 0 1
Row Z−Score
Color Key
Software
All analyses performed within the open source R/Bioconductorframework.
• EDASeq: visualization and normalization.• edgeR: differential expression.
Rmarkdown available to reproduce all the analyses in thispresentation.
Acknowledgements
UC Berkeley
• Sandrine Dudoit• Terry Speed• John Ngai• Todd Ferreira• Russell Fletcher
Stanford
• Gavin Sherlock
UCSF
• Anne Biton
References
S. Anders and W. Huber, Genome Biol 11, (2010).
Y. Benjamini and Y. Hochberg, Journal of the Royal Statistical Society. Series B (Methodological) 57, 289 (1995).
J. Bullard, E. Purdom, K. Hansen, and S. Dudoit, BMC Bioinformatics 11, 94 (2010).
T. Ferreira, S. R. Wilson, J. Choi, D. Risso, S. Dudoit, T. P. Speed, and J. Ngai, Neuron 81, 847 (2014).
T. Hardcastle and K. Kelly, BMC Bioinformatics 11, 422 (2010).
M. I. Love, W. Huber, and S. Anders, Genome Biology 15, 550 (2014).
P. McCullough and J. A. Nelder, Generalized Linear Models (London: Chapman & Hall, 1989).
W. Mendenhall, Introduction to Probability and Statistics, 7th ed. (Duxbury Press, 1987).
A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, Nat Methods 5, 621 (2008).
D. Risso, J. Ngai, T. P. Speed, and S. Dudoit, Nature Biotechnology 32, 896 (2014).
D. Risso, K. Schwartz, G. Sherlock, and S. Dudoit, BMC Bioinformatics 12, 480 (2011).
M. D. Robinson, D. J. McCarthy, and G. K. Smyth, Bioinformatics 26, 139 (2010).
M. D. Robinson and A. Oshlack, Genome Biol 11, (2010).
Statistical inferenceExploratory Data Analysis (EDA)NormalizationStatistical modelExploration of the results