61
Statistical Analysis of RNA-seq Davide Risso Division of Biostatistics, School of Public Health University of California, Berkeley IIHG Bioinformatics Summer Course May 18–19, 2015

Statistical Analysis of RNA-seq - UIowa Wiki · 2015. 5. 18. · MAQC/SEQC dataset MAQC/SEQC consortium; Nature Biotechnology 32 (2014) ERCC Mix 1 A1 A2 A3 A4 B1 B2 B3 B4 ERCC Mix

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

  • Statistical Analysis of RNA-seq

    Davide Risso

    Division of Biostatistics, School of Public Health

    University of California, Berkeley

    IIHG Bioinformatics Summer CourseMay 18–19, 2015

  • A typical RNA-seq pipeline

    Biological Question

    Experimental Design

    Experiment(Microarray/RNA-Seq)

    Pre-processing(Read mapping, Expression quantitation, ...)

    Normalization

    Clustering Classi�cation Estimation Testing...

  • Gene-level expression summaries

    A matrix of gene expression levels (J genes × n samples)

    Condition 1 Condition 2Rep 1 . . . Rep k Rep k + 1 . . . Rep n

    gene 1 y1,1 . . . y1,k y1,k+1 . . . y1,ngene 2 y2,1 . . . y2,k y2,k+1 . . . y2,n

    . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . .gene J yJ ,1 . . . yJ ,k yJ ,k+1 . . . yJ ,n

    Here we consider gene-level expression for simplicity, but most of theconcepts are generalizable to transcript-level differential expression.

  • Statistical challenges in RNA-seq data analysis

    1 Statistical inference

    • technical and biological variability• generalizability / interpretation of the results

    2 Exploratory data analysis (EDA)

    3 Normalization

    4 Choice of the statistical model

    • Negative binomial approach: dispersion estimation• Test for differential expression: two-class comparisons + more

    general designs

    5 Exploration and interpretation of the results

  • Data availability

    All the data that I will use in this presentation are available in Rdata packages.

    • Yeast dataset (Risso et al. 2011):github.com/drisso/yeastRNASeqRisso2011

    • Zebrafish dataset (Ferreira et al. 2014): zebrafishRNASeq(Bioconductor)

    • SEQC dataset (MAQC/SEQC consortium, 2014): seqc(Bioconductor)

    https://github.com/drisso/yeastRNASeqRisso2011http://www.bioconductor.org/packages/release/data/experiment/html/zebrafishRNASeq.htmlhttp://www.bioconductor.org/packages/release/data/experiment/html/seqc.html

  • Yeast datasetRisso et al. (2011); Sherlock Lab, Stanford

    Culture/ Library prep. Growth condition Flow-cellLibrary prep. protocol

    1 Y1 Protocol 1 YPD 428R12 Y1 Protocol 1 YPD 4328B3 Y2 Protocol 1 YPD 428R14 Y2 Protocol 1 YPD 4328B5 Y7 Protocol 1 YPD 428R16 Y7 Protocol 1 YPD 4328B7 Y4 Protocol 2 YPD 61MKN8 Y4 Protocol 2 YPD 61MKN9 D1 Protocol 1 Del 428R1

    10 D2 Protocol 1 Del 428R111 D7 Protocol 1 Del 428R112 G1 Protocol 2 Gly 6247L13 G2 Protocol 1 Gly 62OAY14 G3 Protocol 1 Gly 62OAY

  • Zebrafish datasetFerreira et al. (2014); Ngai Lab, UC Berkeley

    Ctl. 3 Ctl. 1 Ctl. 5 Trt. 9 Trt. 11 Trt. 13

    Day 1 Day 2 Day 3

    Run 1 1 multiplex lane

    Run 2 1 multiplex lane

    2 sample types x 3 lib. prep. x 2 runs = 12 samples

    92 negative controls ERCC Mix 1

  • MAQC/SEQC datasetMAQC/SEQC consortium; Nature Biotechnology 32 (2014)

    ERCC Mix 1

    A1 A2 A3 A4 B1 B2 B3 B4

    ERCC Mix 2

    Flow-cell F1 8 multiplex lanes

    Flow-cell F2 8 multiplex lanes

    2 sample types x 4 lib. prep. x 2 flow-cells x 8 lanes = 128 samples

    23 negative controls 69 positive controls

  • Statistical inference

  • Statistical inference

    “The objective of statistics is to make inferences (predictions,decisions) about a population based on information contained in asample.” (Mendenhall 1987).

    In RNA-seq, we want to infer the true (relative) expression of a genein a particular tissue in a given population (e.g, American womenwith breast cancer), by measuring it in a sample of that population.

  • Statistical inference

    Useful to always ask the questions

    • What is the reference population? (e.g., European adults, maleC57BL/6 mice [in my lab])

    • Is my sample representative of the population?

    Intuitively, the larger the sample size the better I can estimate thetrue expression.

  • Technical and Biological variabilityHansen et al. (2011)

    • Gene expresion is a stochastic process: any two individuals ofthe same population will have varying level of expression of anygene (biological variability).

    • In addition, we often observe measurement errors: these lead totechnical variability.

    Var(Yg) =Difference between groups +Biological variability + Technical variability

    • The quantity of interest• Can be estimated with technical replicates: Poisson noise• Can be estimated with biological replicates: over-dispersion

  • Proper replication and experimental design

    Many sources of technical and biological variability can confoundthe inference on the quantity of interest (e.g., batch effects, labeffects, population structure, . . . )

    Normalization can alleviate the problem, but cannot make up forabsence of replication and/or poor experimental design.

    “To consult the statistician after an experiment is finished is oftenmerely to ask him to conduct a post mortem examination. He canperhaps say what the experiment died of.” R. A. Fisher

  • Exploratory Data Analysis (EDA)

  • Exploratory Data Analysis (EDA)

    Before fitting any model, it is essential to carefully explore the data.

    EDA can reveal:

    • Need for normalization (difference in read count distributions).• Presence of outlying samples.• Batch effects and other unwanted variation.• Data recording errors.• Data duplication.• Other systematic effects.

  • Some useful plots for QC/EDA

    There are some standard and easy plots that one can use toroutinely explore the data prior to differential expression analysis.These include (but are not limited to):

    • Boxplots: to compare distribution of replicate samples.• Relative Log Expression (RLE) plots: identify outliers and/or

    batch effects.

    • Principal Component Analysis (PCA) plots: identify strongestsources of variation.

    • Hierarchical clustering: identify duplicate samples, evaluatetechnical, biological replicates.

    • Mean-difference plots: compare technical and biologicalvariability.

  • BoxplotsFrom J. H. Bullard, K. D. Hansen, and M. Taub (2008).

    −2

    −1

    01

    23

    Anatomy of a boxplot

    outlieroutlier

    outlier

    A

    B

    q0.25

    q0.5

    q0.75

    r ≡ |q0.75 − q0.25|A ≡ inf{xi : xi > q0.25 − 1.5r}B ≡ sup{xi : xi < q0.75 + 1.5r}

  • Boxplots highlight the need for normalization. . .Yeast dataset

    Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3

    05

    1015

    Unnormalized read counts

  • . . . and shows that normalization worked as promised!

    Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3

    05

    1015

    Upper−quartile normalization

  • Relative Log Expression (RLE)

    • Particularly useful transformation of read counts.• RLE is defined, for each gene, as the log-ratio of a read count

    to the median read count across samples.

    • Comparable samples should have similar RLE distributions,centered around zero.

    • Unusual RLE distributions could reveal problematic samples orbatch effects.

  • RLE plots can reveal “hidden” features

    Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 D7 G1 G2 G3

    −2

    −1

    01

    Upper−quartile normalization

  • Principal Component Analysis (PCA)

    • PCA can be used as a way to visualize the sources of variationin the data.

    • PCA is a statistical procedure that looks for a small set oflinear combinations of the original variables to summarize thedata losing as little information as possible.

    • These linear combinations are called principal components(PCs):

    • the first PC is the weighted average of the gene expressionmeasures that gives the highest variance across all samples.

    • Each succeeding component in turn has the highest variancepossible under the constraint that it is uncorrelated with thepreceding components.

  • PCA plots can reveal batch effects. . .TCGA data, Anne Biton

  • . . . problematic data. . .Zebrafish dataset

    −0.2 0.0 0.2 0.4 0.6 0.8

    −0.

    4−

    0.2

    0.0

    0.2

    0.4

    0.6

    0.8

    PC1

    PC

    2

    Ctl1

    Ctl3

    Ctl5

    Trt9

    Trt11

    Trt13

  • . . . and well-behaved dataYeast dataset

    −0.2 0.0 0.2 0.4 0.6

    −0.

    10.

    00.

    10.

    20.

    30.

    40.

    5

    PC1

    PC

    2

    Y1_1Y1_2Y2_1Y2_2

    Y7_1Y7_2Y4_1Y4_2

    D1D2D7

    G1G2

    G3

  • Hierarchical clustering highlights structure in the data. . .Yeast dataset

    D1

    D2

    D7

    Y2_

    1

    Y2_

    2

    Y4_

    1

    Y4_

    2

    Y7_

    1

    Y7_

    2

    Y1_

    1

    Y1_

    2

    G2

    G1

    G3

    020

    000

    6000

    010

    0000

    1400

    00Yeast data

    hclust (*, "complete")

    Hei

    ght

  • . . . and can identify duplicate samplesSingle-cell RNA-seq, Russell Fletcher

    GB

    C03

    _N70

    1_S

    508

    GB

    C03

    _N70

    3_S

    505

    GB

    C03

    _N70

    1_S

    502

    GB

    C03

    _N70

    7_S

    501

    GB

    C01

    _N70

    1_S

    505

    GB

    C03

    _N70

    5_S

    507

    GB

    C03

    _N70

    7_S

    503

    GB

    C03

    _N70

    7_S

    506

    GB

    C01

    _N71

    2_S

    502

    GB

    C01

    _N70

    1_S

    501

    GB

    C01

    _N70

    4_S

    502

    GB

    C01

    _N71

    1_S

    502

    GB

    C01

    _N70

    1_S

    502

    GB

    C01

    _N70

    8_S

    502

    GB

    C01

    _N70

    4_S

    501

    GB

    C01

    _N70

    4_S

    503

    GB

    C01

    _N70

    8_S

    504

    GB

    C01

    _N71

    2_S

    504

    GB

    C01

    _N71

    1_S

    504

    GB

    C01

    _N70

    3_S

    503

    GB

    C01

    _N71

    0_S

    501

    GB

    C01

    _N71

    1_S

    501

    GB

    C01

    _N70

    5_S

    501

    GB

    C03

    _N70

    3_S

    504

    GB

    C01

    _N70

    3_S

    501

    GB

    C03

    _N70

    1_S

    506

    GB

    C03

    _N70

    7_S

    508

    GB

    C03

    _N70

    3_S

    502

    GB

    C03

    _N70

    3_S

    503

    GB

    C03

    _N70

    6_S

    502

    GB

    C03

    _N70

    1_S

    503

    GB

    C03

    _N70

    6_S

    501

    GB

    C03

    _N70

    1_S

    501

    GB

    C03

    _N70

    1_S

    504

    GB

    C03

    _N70

    5_S

    505

    GB

    C03

    _N70

    4_S

    508

    GB

    C03

    _N70

    5_S

    506

    GB

    C03

    _N70

    2_S

    503

    GB

    C03

    _N70

    2_S

    506

    GB

    CP

    03_N

    705_

    S50

    5G

    BC

    P03

    _N70

    4_S

    501

    GB

    CP

    03_N

    710_

    S50

    2G

    BC

    P02

    _N70

    4_S

    502

    GB

    CP

    03_N

    710_

    S50

    8G

    BC

    P03

    _N70

    7_S

    508

    GB

    CP

    03_N

    709_

    S50

    5G

    BC

    P03

    _N71

    1_S

    504

    GB

    CP

    03_N

    706_

    S50

    1G

    BC

    P03

    _N71

    2_S

    501

    GB

    CP

    03_N

    705_

    S50

    8G

    BC

    P02

    _N70

    2_S

    506

    GB

    CP

    03_N

    707_

    S50

    6G

    BC

    P03

    _N70

    9_S

    502

    GB

    CP

    03_N

    706_

    S50

    3G

    BC

    P03

    _N71

    1_S

    502

    GB

    CP

    03_N

    702_

    S50

    2G

    BC

    P03

    _N70

    4_S

    504

    GB

    CP

    03_N

    705_

    S50

    6G

    BC

    P03

    _N70

    5_S

    503

    GB

    C01

    _N70

    5_S

    503

    GB

    CP

    03_N

    705_

    S50

    2G

    BC

    P03

    _N71

    2_S

    502

    GB

    C01

    _N70

    5_S

    504

    GB

    C01

    _N70

    6_S

    503

    GB

    C01

    _N70

    8_S

    503

    GB

    C01

    _N71

    2_S

    503

    GB

    C01

    _N70

    1_S

    503

    GB

    C01

    _N71

    1_S

    503

    GB

    C01

    _N70

    4_S

    504

    GB

    C01

    _N71

    2_S

    501

    GB

    C01

    _N70

    9_S

    506

    GB

    C01

    _N71

    2_S

    505

    GB

    C01

    _N70

    4_S

    505

    GB

    C01

    _N70

    6_S

    501

    GB

    CP

    03_N

    706_

    S50

    6G

    BC

    P03

    _N70

    3_S

    505

    GB

    CP

    03_N

    705_

    S50

    7G

    BC

    03_N

    703_

    S50

    1G

    BC

    P02

    _N71

    0_S

    503

    GB

    CP

    03_N

    701_

    S50

    7G

    BC

    P02

    _N70

    5_S

    507

    GB

    CP

    03_N

    703_

    S50

    8G

    BC

    P02

    _N70

    2_S

    502

    GB

    CP

    02_N

    706_

    S50

    7G

    BC

    P02

    _N70

    5_S

    503

    GB

    CP

    02_N

    706_

    S50

    3G

    BC

    P02

    _N70

    1_S

    505

    GB

    CP

    02_N

    703_

    S50

    5G

    BC

    P02

    _N71

    0_S

    505

    GB

    CP

    03_N

    701_

    S50

    4G

    BC

    03_N

    704_

    S50

    3G

    BC

    P02

    _N71

    0_S

    508

    GB

    CP

    02_N

    711_

    S50

    8G

    BC

    P02

    _N71

    2_S

    501

    GB

    CP

    02_N

    712_

    S50

    8G

    BC

    P03

    _N70

    1_S

    501

    GB

    CP

    02_N

    711_

    S50

    2G

    BC

    P02

    _N71

    1_S

    501

    GB

    CP

    02_N

    711_

    S50

    7G

    BC

    P02

    _N70

    7_S

    501

    GB

    CP

    02_N

    707_

    S50

    8G

    BC

    P02

    _N71

    1_S

    503

    GB

    CP

    02_N

    706_

    S50

    2G

    BC

    P02

    _N70

    6_S

    504

    GB

    CP

    03_N

    708_

    S50

    1G

    BC

    P03

    _N71

    0_S

    501

    GB

    C03

    _N70

    1_S

    507

    GB

    C03

    _N70

    2_S

    505

    GB

    C03

    _N70

    6_S

    508

    GB

    C03

    _N70

    5_S

    508

    GB

    C03

    _N70

    7_S

    505

    GB

    CP

    02_N

    710_

    S50

    2G

    BC

    01_N

    708_

    S50

    6G

    BC

    P03

    _N71

    1_S

    505

    GB

    CP

    03_N

    707_

    S50

    7G

    BC

    01_N

    703_

    S50

    4G

    BC

    01_N

    703_

    S50

    6G

    BC

    P02

    _N70

    7_S

    507

    GB

    CP

    02_N

    709_

    S50

    7G

    BC

    P02

    _N70

    5_S

    508

    GB

    C01

    _N70

    3_S

    505

    GB

    CP

    03_N

    711_

    S50

    1G

    BC

    P03

    _N70

    1_S

    502

    GB

    CP

    03_N

    707_

    S50

    3G

    BC

    P03

    _N71

    0_S

    505

    GB

    CP

    03_N

    708_

    S50

    8G

    BC

    P03

    _N71

    2_S

    505

    GB

    C01

    _N70

    5_S

    502

    GB

    C01

    _N70

    6_S

    502

    GB

    C01

    _N70

    9_S

    501

    GB

    CP

    03_N

    711_

    S50

    8G

    BC

    01_N

    702_

    S50

    4G

    BC

    03_N

    703_

    S50

    7G

    BC

    P03

    _N70

    8_S

    506

    GB

    C01

    _N71

    1_S

    505

    GB

    C01

    _N71

    0_S

    503

    GB

    CP

    03_N

    704_

    S50

    5G

    BC

    01_N

    709_

    S50

    5G

    BC

    P03

    _N70

    4_S

    506

    GB

    CP

    03_N

    709_

    S50

    7G

    BC

    01_N

    710_

    S50

    6G

    BC

    P02

    _N70

    5_S

    501

    GB

    C01

    _N70

    2_S

    503

    GB

    C01

    _N70

    9_S

    503

    GB

    C01

    _N70

    9_S

    502

    GB

    C01

    _N71

    0_S

    502

    GB

    CP

    03_N

    711_

    S50

    6G

    BC

    P03

    _N70

    6_S

    504

    GB

    CP

    03_N

    708_

    S50

    7G

    BC

    P02

    _N70

    8_S

    504

    GB

    CP

    02_N

    703_

    S50

    7G

    BC

    P03

    _N70

    2_S

    501

    GB

    CP

    03_N

    703_

    S50

    4G

    BC

    01_N

    705_

    S50

    5G

    BC

    P02

    _N71

    0_S

    506

    GB

    CP

    03_N

    706_

    S50

    5G

    BC

    P03

    _N71

    2_S

    503

    GB

    CP

    02_N

    703_

    S50

    2G

    BC

    P03

    _N70

    1_S

    506

    GB

    C03

    _N70

    2_S

    508

    GB

    C01

    _N70

    5_S

    506

    GB

    C01

    _N70

    6_S

    505

    GB

    CP

    02_N

    706_

    S50

    6G

    BC

    P02

    _N70

    9_S

    501

    GB

    CP

    02_N

    709_

    S50

    3G

    BC

    P02

    _N71

    1_S

    506

    GB

    CP

    03_N

    701_

    S50

    5G

    BC

    P02

    _N71

    2_S

    504

    GB

    CP

    03_N

    701_

    S50

    3G

    BC

    P03

    _N70

    4_S

    502

    GB

    CP

    02_N

    703_

    S50

    4G

    BC

    P02

    _N70

    3_S

    506

    GB

    CP

    02_N

    702_

    S50

    4G

    BC

    P02

    _N70

    2_S

    507

    GB

    CP

    02_N

    708_

    S50

    1G

    BC

    P02

    _N70

    8_S

    507

    GB

    CP

    02_N

    701_

    S50

    6G

    BC

    P02

    _N70

    1_S

    504

    GB

    CP

    02_N

    705_

    S50

    2G

    BC

    P02

    _N70

    2_S

    503

    GB

    CP

    02_N

    712_

    S50

    2G

    BC

    P03

    _N70

    9_S

    508

    GB

    CP

    02_N

    704_

    S50

    5G

    BC

    P02

    _N70

    4_S

    506

    GB

    CP

    02_N

    701_

    S50

    7G

    BC

    P02

    _N70

    8_S

    505

    GB

    CP

    02_N

    708_

    S50

    6G

    BC

    P02

    _N70

    3_S

    503

    GB

    CP

    02_N

    702_

    S50

    8G

    BC

    P02

    _N71

    2_S

    507

    GB

    C03

    _N70

    7_S

    502

    GB

    CP

    02_N

    704_

    S50

    4G

    BC

    P02

    _N70

    9_S

    505

    GB

    CP

    02_N

    711_

    S50

    4G

    BC

    01_N

    708_

    S50

    1G

    BC

    P03

    _N70

    3_S

    503

    GB

    CP

    03_N

    706_

    S50

    7G

    BC

    P03

    _N70

    4_S

    503

    GB

    CP

    03_N

    708_

    S50

    2G

    BC

    P03

    _N70

    7_S

    504

    GB

    CP

    02_N

    709_

    S50

    4G

    BC

    P03

    _N70

    3_S

    506

    GB

    CP

    03_N

    706_

    S50

    2G

    BC

    P03

    _N70

    9_S

    506

    GB

    CP

    03_N

    703_

    S50

    7G

    BC

    P02

    _N70

    1_S

    502

    GB

    CP

    02_N

    709_

    S50

    8G

    BC

    P03

    _N70

    9_S

    504

    GB

    C01

    _N70

    9_S

    504

    GB

    C01

    _N71

    0_S

    504

    GB

    CP

    02_N

    705_

    S50

    6G

    BC

    P02

    _N70

    3_S

    508

    GB

    CP

    02_N

    705_

    S50

    5G

    BC

    P02

    _N70

    6_S

    505

    GB

    CP

    03_N

    705_

    S50

    4G

    BC

    P03

    _N70

    7_S

    501

    GB

    CP

    02_N

    705_

    S50

    4G

    BC

    P02

    _N70

    1_S

    503

    GB

    CP

    03_N

    707_

    S50

    2

    050

    150

    250

    Cluster Dendrogram

    hclust (*, "complete")d

    Hei

    ght

  • Mean-difference plots (a.k.a. MA-plots)Yeast data

    −2 0 2 4 6 8 10

    −3

    −2

    −1

    01

    23

    Technical replicates

    mean

    diffe

    renc

    e

  • Mean-difference plots (a.k.a. MA-plots)Yeast data

    −2 0 2 4 6 8 10

    −3

    −2

    −1

    01

    23

    Biological replicates

    mean

    diffe

    renc

    e

  • Normalization

  • Normalization

    Normalization is essential to ensure that observed differences inexpression measures between samples and/or genes are truly due tobiology and not technical artifacts.

    We distinguish between two types of effects on read counts andcorresponding normalization procedures.

    1 Within-sample normalization adjusts for gene-specific (andpossibly sample-specific) effects, e.g., gene length orGC-content (Risso et al. 2011).

    2 Between-sample normalization adjusts for distributionaldifferences in read counts between samples, e.g., lanesequencing depth or barcode efficiency (Bullard et al. 2010;Risso et al. 2014).

  • Between-sample normalizationBullard et al. (2010), Risso et al. (2014)

    We group the between-sample normalization methods in globalscaling and non-linear approaches.

    In global scaling normalization, gene-level counts are scaled by asingle factor per sample.

    • Total-count (TC), as in RPKM (Mortazavi et al. 2008), stillwidely used.

    • Upper-quartile (UQ, Bullard et al. 2010).• Trimmed Mean of M values (TMM, Robinson and Oshlack

    2010).

    • DESeq normalization (Anders and Huber 2010).

  • Between-sample normalizationBullard et al. (2010), Risso et al. (2014)

    Non-linear normalization

    • Full-quantile normalization (FQ, Bullard et al. 2010).• Remove Unwanted Variation (RUV, Risso et al. 2014).

  • Global scaling or non-linear?

    Global scaling

    • Simple and flexible.• Extensible: once determined somewhat robustly based on a

    given set of genes, the same scaling factors can be applied toother features.

    • Can be insufficient for large differences in distributions betweenlibraries.

    Non-linear normalization

    • Does not involve singling out a parameter.• Can handle very different distributions between libraries.• Can account for possibly more than sequencing depth.• Not a simple function of original data; not easily interpretable.

    The optimal normalization method is often dependent on thedataset –> EDA!

  • Statistical model

  • The data: gene-level expression summaries

    A matrix of gene expression levels (J genes × n samples)

    Condition 1 Condition 2Rep 1 . . . Rep k Rep k + 1 . . . Rep n

    gene 1 y1,1 . . . y1,k y1,k+1 . . . y1,ngene 2 y2,1 . . . y2,k y2,k+1 . . . y2,n

    . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . .

    . . . . . . . . . . . . . . . . . . . . .gene J yJ ,1 . . . yJ ,k yJ ,k+1 . . . yJ ,n

  • Choice of the statistical model: Poisson

    The Poisson model is the simplest model to deal with count data,hence a natural choice in this context.

    The main assumption of such model is that the the mean is equal tothe variance.

    In RNA-seq data, technical variation follows a Poisson noisestructure, but biological variation is higher, leading tooverdispersion.

  • Mean variance relationship: technical replicatesMAQC/SEQC data (sample A)

    2 4 6 8 10

    05

    1015

    mean

    varia

    nce

  • Mean variance relationship: biological replicatesZebrafish data (control samples)

    0 5 10 15

    05

    1015

    2025

    mean

    varia

    nce

  • Choice of the statistical model: negative binomial

    One way to account for overdispersion is to model the data as anegative binomial.

    The negative binomial is a generalization of the Poisson model thataccount for excess of variability.

    The negative binomial model can be described in terms of its meanand variance.

    E [Y ] = µ;Var(Y ) = µ+ φµ2.

    • The variance is a quadratic function of the mean.• φ is referred to as dispersion parameter.• The Poisson case is recovered when φ = 0.

  • Negative binomial models for RNA-seq data analysis

    Due to its simplicity, the negative binomial is one of the most widelyused models used to analyze RNA-seq data.

    Popular approaches, like edgeR (Robinson et al. 2010), DESeq(Anders and Huber 2010), DESeq2 (Love et al. 2014), baySeq(Hardcastle and Kelly 2010), and others.

  • Dispersion estimation

    Due to the usually small sample size, estimating the dispersionparameter φ is challenging.

    One strategy that has proven effective is to “shrink” thegene-specific dispersion estimates towards the mean across genes.

    This strategy, known as shrinkage estimation, allows to borrowstrengths across the thousands of genes, and leads to betterestimates.

  • Illustration of Shrinkage EstimationLove et al. (2014)

  • Dispersion estimation with edgeRYeast data

    1 100 10000

    1e+

    001e

    +02

    1e+

    041e

    +06

    1e+

    08Mean−Variance Plot

    Mean Expression (Log10 Scale)

    Var

    ianc

    e (L

    og10

    Sca

    le)

  • A Generalized Linear Model (GLM) approach

    GLMs are a flexible and powerful way to model RNA-seq data(McCullough and Nelder 1989). In particular, for each gene j , wehave

    logE [Yj |X ] = Xβj ,

    where X is called the design matrix and is a known matrix, and βjis a vector of parameters to be estimated.

  • A Generalized Linear Model (GLM) approach

    In a typical two-class differential expression study (e.g., the zebrafishdata), X is

    model.matrix(~zfX)

    ## (Intercept) zfXTrt

    ## 1 1 0

    ## 2 1 0

    ## 3 1 0

    ## 4 1 1

    ## 5 1 1

    ## 6 1 1

    ## attr(,"assign")

    ## [1] 0 1

    ## attr(,"contrasts")

    ## attr(,"contrasts")$zfX

    ## [1] "contr.treatment"

  • A Generalized Linear Model (GLM) approach

    In this case, βj has two components and (dropping the index j ) wecan write the model as

    logE [Y |X ] = logµ = β0 + β1x ,

    where x is the second column of X . Moreover,

    logE [Y |x = 0] = β0logE [Y |x = 1] = β0 + β1

    hence β1 can be interpreted as the log-fold-change between the twoclasses:

    β1 = logE [Y |x = 1]E [Y |x = 0]

  • Test for differential expression

    In this GLM framework, testing for differential expressioncorresponds to testing the null hypothesis H0 : β1 = 0.

    One popular approach is the Likelihood Ratio Test (LRT), whichhas the advantage of generalizing to more complex designs (e.g,multi-class comparisons, time-course, . . . ).

    The LRT exploits the idea that the “correct” model will fit the databetter (i.e., it will have the higher value for the likelihood function).

    In other words, we test if the model with β1 6= 0 fits the data“significantly better” than the one with β1 = 0.

  • p-values and multiple testing

    The result of the testing procedure is often a p-value, i.e., theprobability, under the null hypothesis, to obtain a result equally ormore extreme than the one observed.

    −3 −2 −1 0 1 2 3

    0.0

    0.1

    0.2

    0.3

    0.4

    x

    dnor

    m(x

    )

    p−value of 0.05

  • p-values and multiple testing

    Under the null hypotesis, the p-values are uniformly distributedbetween 0 and 1.

    If one performs only one test, it is accepted to reject the nullhypothesis for p < 0.05.

    When performing 10,000 tests (one per gene), under the nullhypothesis 5% of them (500 genes!) will have a p < 0.05, just bychance!

    Hence, we need to account for the “multiplicity of the test”.

  • Correcting for multiple testing

    When dealing with multiple hypothesis testing, one needs to controla suitable Type I error rate, e.g., the family-wise error rate (FWER),or the false discovery rate (FDR).

    There are simple procedure to adjust the p-values for multiplicity:e.g., the Bonferroni procedure to control for FWER or theBenjamini-Hochberg procedure to control for FDR (Benjamini andHochberg 1995).

  • Exploration of the results

  • Distribution of the p-values

    If the model is correctly specified, we expect the p-values to comefrom two distributions:

    • uniform between 0 and 1 for the non DE genes;• and (almost) 0 for the DE genes.

    Departures from this are indication that there is something wrongwith the model (presence of additional source of variation, lack ofproper normalization, . . . )

  • Good p-value distributionToy example

    x

    Fre

    quen

    cy

    0.0 0.2 0.4 0.6 0.8 1.0

    050

    010

    0015

    00

  • Bad p-value distributionToy example

    x

    Fre

    quen

    cy

    0.0 0.2 0.4 0.6 0.8 1.0

    050

    010

    0015

    0020

    00

  • p-value distributionZebrafish dataset

    top$PValue

    Fre

    quen

    cy

    0.0 0.2 0.4 0.6 0.8 1.0

    050

    010

    0015

    0020

    00

  • Volcano plotsZebrafish dataset

    −10 −5 0 5 10

    01

    23

    45

    6

    logFC

    −lo

    g10(

    p−va

    lue)

  • Heatmap of DE genesYeast dataset

    G2

    G3

    G1

    D1

    D7

    D2

    Y2_

    1

    Y1_

    1

    Y4_

    1

    Y7_

    1

    YDR380WYMR303CYKL217WYDR256CYGR243WYPL147WYBR297WYGL146CYPL186CYPL156CYPL230WYPR030WYJR095WYIL057CYBL043WYMR206WYIL160CYMR081CYBR033WYOR384WYPL201CYMR280CYOR178CYLR327CYGR067CYJL116CYDR536WYOL052C−AYGL205WYNR002CYER011WYBR092CYMR319CYJL212CYHR094CYKR034WYBR294WYIR017CYNL277WYIL165CYIR032CYPR167CYKL218CYHL036WYHL028WYLR092WYMR062CYJR152WYJL133C−AYKL001CYJR010WYIR031CYER069WYDR345CYNL142WYKR039WYER091CYLR303WYJL172WYLR058CYNL220WYCL030CYCL018WYJR137CYMR120CYPL095CYIR034CYFR030WYHR018CYDR046CYCL064CYKR093WYBR068CYPL111WYHR137WYBR069CYDR508CYOR383CYPL134CYBL015WYIL155CYOR374WYGL062WYGR244CYML120CYLL041CYML054CYFL014WYNL274CYDL130W−AYPR010C−AYOR120WYLR174WYIL136WYFL030WYOR382WYDR216WYHR033WYBR296CYKL187C

    Top 100 DE genes

    −1 0 1

    Row Z−Score

    Color Key

  • Software

    All analyses performed within the open source R/Bioconductorframework.

    • EDASeq: visualization and normalization.• edgeR: differential expression.

    Rmarkdown available to reproduce all the analyses in thispresentation.

  • Acknowledgements

    UC Berkeley

    • Sandrine Dudoit• Terry Speed• John Ngai• Todd Ferreira• Russell Fletcher

    Stanford

    • Gavin Sherlock

    UCSF

    • Anne Biton

  • References

    S. Anders and W. Huber, Genome Biol 11, (2010).

    Y. Benjamini and Y. Hochberg, Journal of the Royal Statistical Society. Series B (Methodological) 57, 289 (1995).

    J. Bullard, E. Purdom, K. Hansen, and S. Dudoit, BMC Bioinformatics 11, 94 (2010).

    T. Ferreira, S. R. Wilson, J. Choi, D. Risso, S. Dudoit, T. P. Speed, and J. Ngai, Neuron 81, 847 (2014).

    T. Hardcastle and K. Kelly, BMC Bioinformatics 11, 422 (2010).

    M. I. Love, W. Huber, and S. Anders, Genome Biology 15, 550 (2014).

    P. McCullough and J. A. Nelder, Generalized Linear Models (London: Chapman & Hall, 1989).

    W. Mendenhall, Introduction to Probability and Statistics, 7th ed. (Duxbury Press, 1987).

    A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, Nat Methods 5, 621 (2008).

    D. Risso, J. Ngai, T. P. Speed, and S. Dudoit, Nature Biotechnology 32, 896 (2014).

    D. Risso, K. Schwartz, G. Sherlock, and S. Dudoit, BMC Bioinformatics 12, 480 (2011).

    M. D. Robinson, D. J. McCarthy, and G. K. Smyth, Bioinformatics 26, 139 (2010).

    M. D. Robinson and A. Oshlack, Genome Biol 11, (2010).

    Statistical inferenceExploratory Data Analysis (EDA)NormalizationStatistical modelExploration of the results