Statistical Analysis of RNA-seq - UIowa Wiki · 2015. 5. 18. · MAQC/SEQC dataset MAQC/SEQC consortium; Nature Biotechnology 32 (2014) ERCC Mix 1 A1 A2 A3 A4 B1 B2 B3 B4 ERCC Mix

Statistical Analysis of RNA-seq

Davide Risso

Division of Biostatistics, School of Public Health

University of California, Berkeley

IIHG Bioinformatics Summer CourseMay 18–19, 2015

A typical RNA-seq pipeline

Biological Question

Experimental Design

Experiment(Microarray/RNA-Seq)

Pre-processing(Read mapping, Expression quantitation, ...)

Normalization

Clustering Classi�cation Estimation Testing...

Gene-level expression summaries

A matrix of gene expression levels (J genes × n samples)

Condition 1 Condition 2Rep 1 . . . Rep k Rep k + 1 . . . Rep n

gene 1 y1,1 . . . y1,k y1,k+1 . . . y1,ngene 2 y2,1 . . . y2,k y2,k+1 . . . y2,n

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .gene J yJ ,1 . . . yJ ,k yJ ,k+1 . . . yJ ,n

Here we consider gene-level expression for simplicity, but most of theconcepts are generalizable to transcript-level differential expression.

Statistical challenges in RNA-seq data analysis

1 Statistical inference

• technical and biological variability• generalizability / interpretation of the results

2 Exploratory data analysis (EDA)

3 Normalization

4 Choice of the statistical model

• Negative binomial approach: dispersion estimation• Test for differential expression: two-class comparisons + more

general designs

5 Exploration and interpretation of the results

Data availability

All the data that I will use in this presentation are available in Rdata packages.

• Yeast dataset (Risso et al. 2011):github.com/drisso/yeastRNASeqRisso2011

• Zebrafish dataset (Ferreira et al. 2014): zebrafishRNASeq(Bioconductor)

• SEQC dataset (MAQC/SEQC consortium, 2014): seqc(Bioconductor)

https://github.com/drisso/yeastRNASeqRisso2011http://www.bioconductor.org/packages/release/data/experiment/html/zebrafishRNASeq.htmlhttp://www.bioconductor.org/packages/release/data/experiment/html/seqc.html

Yeast datasetRisso et al. (2011); Sherlock Lab, Stanford

Culture/ Library prep. Growth condition Flow-cellLibrary prep. protocol

1 Y1 Protocol 1 YPD 428R12 Y1 Protocol 1 YPD 4328B3 Y2 Protocol 1 YPD 428R14 Y2 Protocol 1 YPD 4328B5 Y7 Protocol 1 YPD 428R16 Y7 Protocol 1 YPD 4328B7 Y4 Protocol 2 YPD 61MKN8 Y4 Protocol 2 YPD 61MKN9 D1 Protocol 1 Del 428R1

10 D2 Protocol 1 Del 428R111 D7 Protocol 1 Del 428R112 G1 Protocol 2 Gly 6247L13 G2 Protocol 1 Gly 62OAY14 G3 Protocol 1 Gly 62OAY

Zebrafish datasetFerreira et al. (2014); Ngai Lab, UC Berkeley

Ctl. 3 Ctl. 1 Ctl. 5 Trt. 9 Trt. 11 Trt. 13

Day 1 Day 2 Day 3

Run 1 1 multiplex lane

Run 2 1 multiplex lane

2 sample types x 3 lib. prep. x 2 runs = 12 samples

92 negative controls ERCC Mix 1

MAQC/SEQC datasetMAQC/SEQC consortium; Nature Biotechnology 32 (2014)

ERCC Mix 1

A1 A2 A3 A4 B1 B2 B3 B4

ERCC Mix 2

Flow-cell F1 8 multiplex lanes

Flow-cell F2 8 multiplex lanes

2 sample types x 4 lib. prep. x 2 flow-cells x 8 lanes = 128 samples

23 negative controls 69 positive controls

Statistical inference


“The objective of statistics is to make inferences (predictions,decisions) about a population based on information contained in asample.” (Mendenhall 1987).

In RNA-seq, we want to infer the true (relative) expression of a genein a particular tissue in a given population (e.g, American womenwith breast cancer), by measuring it in a sample of that population.


Useful to always ask the questions

• What is the reference population? (e.g., European adults, maleC57BL/6 mice [in my lab])

• Is my sample representative of the population?

Intuitively, the larger the sample size the better I can estimate thetrue expression.

Technical and Biological variabilityHansen et al. (2011)

• Gene expresion is a stochastic process: any two individuals ofthe same population will have varying level of expression of anygene (biological variability).

• In addition, we often observe measurement errors: these lead totechnical variability.

Var(Yg) =Difference between groups +Biological variability + Technical variability

• The quantity of interest• Can be estimated with technical replicates: Poisson noise• Can be estimated with biological replicates: over-dispersion

Proper replication and experimental design

Many sources of technical and biological variability can confoundthe inference on the quantity of interest (e.g., batch effects, labeffects, population structure, . . . )

Normalization can alleviate the problem, but cannot make up forabsence of replication and/or poor experimental design.

“To consult the statistician after an experiment is finished is oftenmerely to ask him to conduct a post mortem examination. He canperhaps say what the experiment died of.” R. A. Fisher

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA)

Before fitting any model, it is essential to carefully explore the data.

EDA can reveal:

• Need for normalization (difference in read count distributions).• Presence of outlying samples.• Batch effects and other unwanted variation.• Data recording errors.• Data duplication.• Other systematic effects.

Some useful plots for QC/EDA

There are some standard and easy plots that one can use toroutinely explore the data prior to differential expression analysis.These include (but are not limited to):

• Boxplots: to compare distribution of replicate samples.• Relative Log Expression (RLE) plots: identify outliers and/or

batch effects.

• Principal Component Analysis (PCA) plots: identify strongestsources of variation.

• Hierarchical clustering: identify duplicate samples, evaluatetechnical, biological replicates.

• Mean-difference plots: compare technical and biologicalvariability.

BoxplotsFrom J. H. Bullard, K. D. Hansen, and M. Taub (2008).

−2

−1

01

23

Anatomy of a boxplot

outlieroutlier

outlier

A

B

q0.25

q0.5

q0.75

r ≡ |q0.75 − q0.25|A ≡ inf{xi : xi > q0.25 − 1.5r}B ≡ sup{xi : xi < q0.75 + 1.5r}

Boxplots highlight the need for normalization. . .Yeast dataset

Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3

05

1015

Unnormalized read counts

. . . and shows that normalization worked as promised!

Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3

05

1015

Upper−quartile normalization

Relative Log Expression (RLE)

• Particularly useful transformation of read counts.• RLE is defined, for each gene, as the log-ratio of a read count

to the median read count across samples.

• Comparable samples should have similar RLE distributions,centered around zero.

• Unusual RLE distributions could reveal problematic samples orbatch effects.

RLE plots can reveal “hidden” features

Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3

−2

−1

01

Upper−quartile normalization

Principal Component Analysis (PCA)

• PCA can be used as a way to visualize the sources of variationin the data.

• PCA is a statistical procedure that looks for a small set oflinear combinations of the original variables to summarize thedata losing as little information as possible.

• These linear combinations are called principal components(PCs):

• the first PC is the weighted average of the gene expressionmeasures that gives the highest variance across all samples.

• Each succeeding component in turn has the highest variancepossible under the constraint that it is uncorrelated with thepreceding components.

PCA plots can reveal batch effects. . .TCGA data, Anne Biton

. . . problematic data. . .Zebrafish dataset

−0.2 0.0 0.2 0.4 0.6 0.8

−0.

4−

0.2

0.0

0.2

0.4

0.6

0.8

PC1

PC

2

Ctl1

Ctl3

Ctl5

Trt9

Trt11

Trt13

. . . and well-behaved dataYeast dataset

−0.2 0.0 0.2 0.4 0.6

−0.

10.

00.

10.

20.

30.

40.

5

PC1

PC

2

Y1_1Y1_2Y2_1Y2_2

Y7_1Y7_2Y4_1Y4_2

D1D2D7

G1G2

G3

Hierarchical clustering highlights structure in the data. . .Yeast dataset

D1

D2

D7

Y2_

1

Y2_

2

Y4_

1

Y4_

2

Y7_

1

Y7_

2

Y1_

1

Y1_

2

G2

G1

G3

020

000

6000

010

0000

1400

00Yeast data

hclust (*, "complete")

Hei

ght

. . . and can identify duplicate samplesSingle-cell RNA-seq, Russell Fletcher

GB

C03

_N70

1_S

508

GB

C03

_N70

3_S

505

GB

C03

_N70

1_S

502

GB

C03

_N70

7_S

501

GB

C01

_N70

1_S

505

GB

C03

_N70

5_S

507

GB

C03

_N70

7_S

503

GB

C03

_N70

7_S

506

GB

C01

_N71

2_S

502

GB

C01

_N70

1_S

501

GB

C01

_N70

4_S

502

GB

C01

_N71

1_S

502

GB

C01

_N70

1_S

502

GB

C01

_N70

8_S

502

GB

C01

_N70

4_S

501

GB

C01

_N70

4_S

503

GB

C01

_N70

8_S

504

GB

C01

_N71

2_S

504

GB

C01

_N71

1_S

504

GB

C01

_N70

3_S

503

GB

C01

_N71

0_S

501

GB

C01

_N71

1_S

501

GB

C01

_N70

5_S

501

GB

C03

_N70

3_S

504

GB

C01

_N70

3_S

501

GB

C03

_N70

1_S

506

GB

C03

_N70

7_S

508

GB

C03

_N70

3_S

502

GB

C03

_N70

3_S

503

GB

C03

_N70

6_S

502

GB

C03

_N70

1_S

503

GB

C03

_N70

6_S

501

GB

C03

_N70

1_S

501

GB

C03

_N70

1_S

504

GB

C03

_N70

5_S

505

GB

C03

_N70

4_S

508

GB

C03

_N70

5_S

506

GB

C03

_N70

2_S

503

GB

C03

_N70

2_S

506

GB

CP

03_N

705_

S50

5G

BC

P03

_N70

4_S

501

GB

CP

03_N

710_

S50

2G

BC

P02

_N70

4_S

502

GB

CP

03_N

710_

S50

8G

BC

P03

_N70

7_S

508

GB

CP

03_N

709_

S50

5G

BC

P03

_N71

1_S

504

GB

CP

03_N

706_

S50

1G

BC

P03

_N71

2_S

501

GB

CP

03_N

705_

S50

8G

BC

P02

_N70

2_S

506

GB

CP

03_N

707_

S50

6G

BC

P03

_N70

9_S

502

GB

CP

03_N

706_

S50

3G

BC

P03

_N71

1_S

502

GB

CP

03_N

702_

S50

2G

BC

P03

_N70

4_S

504

GB

CP

03_N

705_

S50

6G

BC

P03

_N70

5_S

503

GB

C01

_N70

5_S

503

GB

CP

03_N

705_

S50

2G

BC

P03

_N71

2_S

502

GB

C01

_N70

5_S

504

GB

C01

_N70

6_S

503

GB

C01

_N70

8_S

503

GB

C01

_N71

2_S

503

GB

C01

_N70

1_S

503

GB

C01

_N71

1_S

503

GB

C01

_N70

4_S

504

GB

C01

_N71

2_S

501

GB

C01

_N70

9_S

506

GB

C01

_N71

2_S

505

GB

C01

_N70

4_S

505

GB

C01

_N70

6_S

501

GB

CP

03_N

706_

S50

6G

BC

P03

_N70

3_S

505

GB

CP

03_N

705_

S50

7G

BC

03_N

703_

S50

1G

BC

P02

_N71

0_S

503

GB

CP

03_N

701_

S50

7G

BC

P02

_N70

5_S

507

GB

CP

03_N

703_

S50

8G

BC

P02

_N70

2_S

502

GB

CP

02_N

706_

S50

7G

BC

P02

_N70

5_S

503

GB

CP

02_N

706_

S50

3G

BC

P02

_N70

1_S

505

GB

CP

02_N

703_

S50

5G

BC

P02

_N71

0_S

505

GB

CP

03_N

701_

S50

4G

BC

03_N

704_

S50

3G

BC

P02

_N71

0_S

508

GB

CP

02_N

711_

S50

8G

BC

P02

_N71

2_S

501

GB

CP

02_N

712_

S50

8G

BC

P03

_N70

1_S

501

GB

CP

02_N

711_

S50

2G

BC

P02

_N71

1_S

501

GB

CP

02_N

711_

S50

7G

BC

P02

_N70

7_S

501

GB

CP

02_N

707_

S50

8G

BC

P02

_N71

1_S

503

GB

CP

02_N

706_

S50

2G

BC

P02

_N70

6_S

504

GB

CP

03_N

708_

S50

1G

BC

P03

_N71

0_S

501

GB

C03

_N70

1_S

507

GB

C03

_N70

2_S

505

GB

C03

_N70

6_S

508

GB

C03

_N70

5_S

508

GB

C03

_N70

7_S

505

GB

CP

02_N

710_

S50

2G

BC

01_N

708_

S50

6G

BC

P03

_N71

1_S

505

GB

CP

03_N

707_

S50

7G

BC

01_N

703_

S50

4G

BC

01_N

703_

S50

6G

BC

P02

_N70

7_S

507

GB

CP

02_N

709_

S50

7G

BC

P02

_N70

5_S

508

GB

C01

_N70

3_S

505

GB

CP

03_N

711_

S50

1G

BC

P03

_N70

1_S

502

GB

CP

03_N

707_

S50

3G

BC

P03

_N71

0_S

505

GB

CP

03_N

708_

S50

8G

BC

P03

_N71

2_S

505

GB

C01

_N70

5_S

502

GB

C01

_N70

6_S

502

GB

C01

_N70

9_S

501

GB

CP

03_N

711_

S50

8G

BC

01_N

702_

S50

4G

BC

03_N

703_

S50

7G

BC

P03

_N70

8_S

506

GB

C01

_N71

1_S

505

GB

C01

_N71

0_S

503

GB

CP

03_N

704_

S50

5G

BC

01_N

709_

S50

5G

BC

P03

_N70

4_S

506

GB

CP

03_N

709_

S50

7G

BC

01_N

710_

S50

6G

BC

P02

_N70

5_S

501

GB

C01

_N70

2_S

503

GB

C01

_N70

9_S

503

GB

C01

_N70

9_S

502

GB

C01

_N71

0_S

502

GB

CP

03_N

711_

S50

6G

BC

P03

_N70

6_S

504

GB

CP

03_N

708_

S50

7G

BC

P02

_N70

8_S

504

GB

CP

02_N

703_

S50

7G

BC

P03

_N70

2_S

501

GB

CP

03_N

703_

S50

4G

BC

01_N

705_

S50

5G

BC

P02

_N71

0_S

506

GB

CP

03_N

706_

S50

5G

BC

P03

_N71

2_S

503

GB

CP

02_N

703_

S50

2G

BC

P03

_N70

1_S

506

GB

C03

_N70

2_S

508

GB

C01

_N70

5_S

506

GB

C01

_N70

6_S

505

GB

CP

02_N

706_

S50

6G

BC

P02

_N70

9_S

501

GB

CP

02_N

709_

S50

3G

BC

P02

_N71

1_S

506

GB

CP

03_N

701_

S50

5G

BC

P02

_N71

2_S

504

GB

CP

03_N

701_

S50

3G

BC

P03

_N70

4_S

502

GB

CP

02_N

703_

S50

4G

BC

P02

_N70

3_S

506

GB

CP

02_N

702_

S50

4G

BC

P02

_N70

2_S

507

GB

CP

02_N

708_

S50

1G

BC

P02

_N70

8_S

507

GB

CP

02_N

701_

S50

6G

BC

P02

_N70

1_S

504

GB

CP

02_N

705_

S50

2G

BC

P02

_N70

2_S

503

GB

CP

02_N

712_

S50

2G

BC

P03

_N70

9_S

508

GB

CP

02_N

704_

S50

5G

BC

P02

_N70

4_S

506

GB

CP

02_N

701_

S50

7G

BC

P02

_N70

8_S

505

GB

CP

02_N

708_

S50

6G

BC

P02

_N70

3_S

503

GB

CP

02_N

702_

S50

8G

BC

P02

_N71

2_S

507

GB

C03

_N70

7_S

502

GB

CP

02_N

704_

S50

4G

BC

P02

_N70

9_S

505

GB

CP

02_N

711_

S50

4G

BC

01_N

708_

S50

1G

BC

P03

_N70

3_S

503

GB

CP

03_N

706_

S50

7G

BC

P03

_N70

4_S

503

GB

CP

03_N

708_

S50

2G

BC

P03

_N70

7_S

504

GB

CP

02_N

709_

S50

4G

BC

P03

_N70

3_S

506

GB

CP

03_N

706_

S50

2G

BC

P03

_N70

9_S

506

GB

CP

03_N

703_

S50

7G

BC

P02

_N70

1_S

502

GB

CP

02_N

709_

S50

8G

BC

P03

_N70

9_S

504

GB

C01

_N70

9_S

504

GB

C01

_N71

0_S

504

GB

CP

02_N

705_

S50

6G

BC

P02

_N70

3_S

508

GB

CP

02_N

705_

S50

5G

BC

P02

_N70

6_S

505

GB

CP

03_N

705_

S50

4G

BC

P03

_N70

7_S

501

GB

CP

02_N

705_

S50

4G

BC

P02

_N70

1_S

503

GB

CP

03_N

707_

S50

2

050

150

250

Cluster Dendrogram

hclust (*, "complete")d

Hei

ght

Mean-difference plots (a.k.a. MA-plots)Yeast data

−2 0 2 4 6 8 10

−3

−2

−1

01

23

Technical replicates

mean

diffe

renc

e

Mean-difference plots (a.k.a. MA-plots)Yeast data

−2 0 2 4 6 8 10

−3

−2

−1

01

23

Biological replicates

mean

diffe

renc

e

Normalization

Normalization

Normalization is essential to ensure that observed differences inexpression measures between samples and/or genes are truly due tobiology and not technical artifacts.

We distinguish between two types of effects on read counts andcorresponding normalization procedures.

1 Within-sample normalization adjusts for gene-specific (andpossibly sample-specific) effects, e.g., gene length orGC-content (Risso et al. 2011).

2 Between-sample normalization adjusts for distributionaldifferences in read counts between samples, e.g., lanesequencing depth or barcode efficiency (Bullard et al. 2010;Risso et al. 2014).

Between-sample normalizationBullard et al. (2010), Risso et al. (2014)

We group the between-sample normalization methods in globalscaling and non-linear approaches.

In global scaling normalization, gene-level counts are scaled by asingle factor per sample.

• Total-count (TC), as in RPKM (Mortazavi et al. 2008), stillwidely used.

• Upper-quartile (UQ, Bullard et al. 2010).• Trimmed Mean of M values (TMM, Robinson and Oshlack

2010).

• DESeq normalization (Anders and Huber 2010).

Between-sample normalizationBullard et al. (2010), Risso et al. (2014)

Non-linear normalization

• Full-quantile normalization (FQ, Bullard et al. 2010).• Remove Unwanted Variation (RUV, Risso et al. 2014).

Global scaling or non-linear?

Global scaling

• Simple and flexible.• Extensible: once determined somewhat robustly based on a

given set of genes, the same scaling factors can be applied toother features.

• Can be insufficient for large differences in distributions betweenlibraries.

Non-linear normalization

• Does not involve singling out a parameter.• Can handle very different distributions between libraries.• Can account for possibly more than sequencing depth.• Not a simple function of original data; not easily interpretable.

The optimal normalization method is often dependent on thedataset –> EDA!

Statistical model

The data: gene-level expression summaries

A matrix of gene expression levels (J genes × n samples)

Condition 1 Condition 2Rep 1 . . . Rep k Rep k + 1 . . . Rep n

gene 1 y1,1 . . . y1,k y1,k+1 . . . y1,ngene 2 y2,1 . . . y2,k y2,k+1 . . . y2,n

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .gene J yJ ,1 . . . yJ ,k yJ ,k+1 . . . yJ ,n

Choice of the statistical model: Poisson

The Poisson model is the simplest model to deal with count data,hence a natural choice in this context.

The main assumption of such model is that the the mean is equal tothe variance.

In RNA-seq data, technical variation follows a Poisson noisestructure, but biological variation is higher, leading tooverdispersion.

Mean variance relationship: technical replicatesMAQC/SEQC data (sample A)

2 4 6 8 10

05

1015

mean

varia

nce

Mean variance relationship: biological replicatesZebrafish data (control samples)

0 5 10 15

05

1015

2025

mean

varia

nce

Choice of the statistical model: negative binomial

One way to account for overdispersion is to model the data as anegative binomial.

The negative binomial is a generalization of the Poisson model thataccount for excess of variability.

The negative binomial model can be described in terms of its meanand variance.

E [Y ] = µ;Var(Y ) = µ+ φµ2.

• The variance is a quadratic function of the mean.• φ is referred to as dispersion parameter.• The Poisson case is recovered when φ = 0.

Negative binomial models for RNA-seq data analysis

Due to its simplicity, the negative binomial is one of the most widelyused models used to analyze RNA-seq data.

Popular approaches, like edgeR (Robinson et al. 2010), DESeq(Anders and Huber 2010), DESeq2 (Love et al. 2014), baySeq(Hardcastle and Kelly 2010), and others.

Dispersion estimation

Due to the usually small sample size, estimating the dispersionparameter φ is challenging.

One strategy that has proven effective is to “shrink” thegene-specific dispersion estimates towards the mean across genes.

This strategy, known as shrinkage estimation, allows to borrowstrengths across the thousands of genes, and leads to betterestimates.

Illustration of Shrinkage EstimationLove et al. (2014)

Dispersion estimation with edgeRYeast data

1 100 10000

1e+

001e

+02

1e+

041e

+06

1e+

08Mean−Variance Plot

Mean Expression (Log10 Scale)

Var

ianc

e (L

og10

Sca

le)

A Generalized Linear Model (GLM) approach

GLMs are a flexible and powerful way to model RNA-seq data(McCullough and Nelder 1989). In particular, for each gene j , wehave

logE [Yj |X ] = Xβj ,

where X is called the design matrix and is a known matrix, and βjis a vector of parameters to be estimated.


In a typical two-class differential expression study (e.g., the zebrafishdata), X is

model.matrix(~zfX)

## (Intercept) zfXTrt

## 1 1 0

## 2 1 0

## 3 1 0

## 4 1 1

## 5 1 1

## 6 1 1

## attr(,"assign")

## [1] 0 1

## attr(,"contrasts")

## attr(,"contrasts")$zfX

## [1] "contr.treatment"


In this case, βj has two components and (dropping the index j ) wecan write the model as

logE [Y |X ] = logµ = β0 + β1x ,

where x is the second column of X . Moreover,

logE [Y |x = 0] = β0logE [Y |x = 1] = β0 + β1

hence β1 can be interpreted as the log-fold-change between the twoclasses:

β1 = logE [Y |x = 1]E [Y |x = 0]

Test for differential expression

In this GLM framework, testing for differential expressioncorresponds to testing the null hypothesis H0 : β1 = 0.

One popular approach is the Likelihood Ratio Test (LRT), whichhas the advantage of generalizing to more complex designs (e.g,multi-class comparisons, time-course, . . . ).

The LRT exploits the idea that the “correct” model will fit the databetter (i.e., it will have the higher value for the likelihood function).

In other words, we test if the model with β1 6= 0 fits the data“significantly better” than the one with β1 = 0.

p-values and multiple testing

The result of the testing procedure is often a p-value, i.e., theprobability, under the null hypothesis, to obtain a result equally ormore extreme than the one observed.

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

x

dnor

m(x

)

p−value of 0.05

p-values and multiple testing

Under the null hypotesis, the p-values are uniformly distributedbetween 0 and 1.

If one performs only one test, it is accepted to reject the nullhypothesis for p < 0.05.

When performing 10,000 tests (one per gene), under the nullhypothesis 5% of them (500 genes!) will have a p < 0.05, just bychance!

Hence, we need to account for the “multiplicity of the test”.

Correcting for multiple testing

When dealing with multiple hypothesis testing, one needs to controla suitable Type I error rate, e.g., the family-wise error rate (FWER),or the false discovery rate (FDR).

There are simple procedure to adjust the p-values for multiplicity:e.g., the Bonferroni procedure to control for FWER or theBenjamini-Hochberg procedure to control for FDR (Benjamini andHochberg 1995).

Exploration of the results

Distribution of the p-values

If the model is correctly specified, we expect the p-values to comefrom two distributions:

• uniform between 0 and 1 for the non DE genes;• and (almost) 0 for the DE genes.

Departures from this are indication that there is something wrongwith the model (presence of additional source of variation, lack ofproper normalization, . . . )

Good p-value distributionToy example

x

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

00

Bad p-value distributionToy example

x

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

0020

00

p-value distributionZebrafish dataset

top$PValue

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

010

0015

0020

00

Volcano plotsZebrafish dataset

−10 −5 0 5 10

01

23

45

6

logFC

−lo

g10(

p−va

lue)

Heatmap of DE genesYeast dataset

G2

G3

G1

D1

D7

D2

Y2_

1

Y1_

1

Y4_

1

Y7_

1

YDR380WYMR303CYKL217WYDR256CYGR243WYPL147WYBR297WYGL146CYPL186CYPL156CYPL230WYPR030WYJR095WYIL057CYBL043WYMR206WYIL160CYMR081CYBR033WYOR384WYPL201CYMR280CYOR178CYLR327CYGR067CYJL116CYDR536WYOL052C−AYGL205WYNR002CYER011WYBR092CYMR319CYJL212CYHR094CYKR034WYBR294WYIR017CYNL277WYIL165CYIR032CYPR167CYKL218CYHL036WYHL028WYLR092WYMR062CYJR152WYJL133C−AYKL001CYJR010WYIR031CYER069WYDR345CYNL142WYKR039WYER091CYLR303WYJL172WYLR058CYNL220WYCL030CYCL018WYJR137CYMR120CYPL095CYIR034CYFR030WYHR018CYDR046CYCL064CYKR093WYBR068CYPL111WYHR137WYBR069CYDR508CYOR383CYPL134CYBL015WYIL155CYOR374WYGL062WYGR244CYML120CYLL041CYML054CYFL014WYNL274CYDL130W−AYPR010C−AYOR120WYLR174WYIL136WYFL030WYOR382WYDR216WYHR033WYBR296CYKL187C

Top 100 DE genes

−1 0 1

Row Z−Score

Color Key

Software

All analyses performed within the open source R/Bioconductorframework.

• EDASeq: visualization and normalization.• edgeR: differential expression.

Rmarkdown available to reproduce all the analyses in thispresentation.

Acknowledgements

UC Berkeley

• Sandrine Dudoit• Terry Speed• John Ngai• Todd Ferreira• Russell Fletcher

Stanford

• Gavin Sherlock

UCSF

• Anne Biton

References

S. Anders and W. Huber, Genome Biol 11, (2010).

Y. Benjamini and Y. Hochberg, Journal of the Royal Statistical Society. Series B (Methodological) 57, 289 (1995).

J. Bullard, E. Purdom, K. Hansen, and S. Dudoit, BMC Bioinformatics 11, 94 (2010).

T. Ferreira, S. R. Wilson, J. Choi, D. Risso, S. Dudoit, T. P. Speed, and J. Ngai, Neuron 81, 847 (2014).

T. Hardcastle and K. Kelly, BMC Bioinformatics 11, 422 (2010).

M. I. Love, W. Huber, and S. Anders, Genome Biology 15, 550 (2014).

P. McCullough and J. A. Nelder, Generalized Linear Models (London: Chapman & Hall, 1989).

W. Mendenhall, Introduction to Probability and Statistics, 7th ed. (Duxbury Press, 1987).

A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, Nat Methods 5, 621 (2008).

D. Risso, J. Ngai, T. P. Speed, and S. Dudoit, Nature Biotechnology 32, 896 (2014).

D. Risso, K. Schwartz, G. Sherlock, and S. Dudoit, BMC Bioinformatics 12, 480 (2011).

M. D. Robinson, D. J. McCarthy, and G. K. Smyth, Bioinformatics 26, 139 (2010).

M. D. Robinson and A. Oshlack, Genome Biol 11, (2010).

Statistical inferenceExploratory Data Analysis (EDA)NormalizationStatistical modelExploration of the results

Documents

Statistical Analysis of RNA-seq - UIowa Wiki · 2015. 5. 18. · MAQC/SEQC dataset MAQC/SEQC consortium; Nature Biotechnology 32 (2014) ERCC Mix 1 A1 A2 A3 A4 B1 B2 B3 B4 ERCC Mix