Analysis of RNA-seq Data - Bioinformatics-core-shared...

Analysis of

RNA-seq Data

Bernard Pereira

The many faces of RNA-seq

Applications

Discovery •  Find new transcripts

•  Find transcript boundaries

•  Find splice junctions

Comparison Given samples from different experimental conditions, find effects of the treatment on

•  Gene expression strengths

•  Isoform abundance ratios, splice patterns, transcript boundaries

Applications

Differential Expression

Mortazavi, A. et al (2008) Nature Methods

•  Comparing feature abundance under different conditions

•  Assumes linearity of signal over a range

of expression levels

•  When feature=gene, well-established

pre- and post-analysis strategies exist

Range of detection

Wang et al (2014) Nature Biotech. Guo et al. (2013) Plos One

Library Prep i

Malone, J.H. & Oliver, B.

Library Prep ii

Biological Technical

Library Prep iii A B

Library Prep iii

Library Prep iv

Hansen, K.D. et al. (2010) Nuc. Acids Res.

Library Prep v

•  Duplicates (optical & PCR)

•  Sequence errors

•  Indels

•  Repetitive/problematic

sequence

Hot off the sequencer…

Auer and Doerg (2010) Genetics

FASTQC

Trimming

•  Quality-based trimming

•  Adapter ‘contamination’

Sequence to sense

Haas, B.J. & Zody, M.C. (2010) Nature Biotech.

De novo assembly

Haas, B.J.. et al (2013) Nature Protocols

•  eg. Trinity

Reference-based assembly

Genome mapping •  Can identify novel features •  Spice aware? •  Can be difficult to reconstruct

isoform and gene structures

Transcriptome mapping •  No repetitive reference •  Overcomes issues of complex

structures •  Novel features? •  How reliable is the

transcriptome?

Trapnell & Salzberg (2009) Nature Biotech

A smart suit(e)

Trapnell, C. et al (2012) Nature Protocols

Tophat/Bowtie

Kim, D. et al (2012) Genome Biology

Tophat/Bowtie

Kim, D. et al (2012) Genome Biology

Cufflinks

Trapnell, C. et al. (2010) Nature Biotech.

How do we look?

Duplicates & RNA-seq

? Single-end vs paired-end

Variant calling vs DE analysis

Intrinsically lower complexity

Highly expressed genes

Platform/pipeline

Model as part of counting process

Counting

Genome-based features •  Exon or gene boundaries?

•  Isoform structures?

•  Gene multireads?

Oshlack, A. et al. (2010) Genome Biology

Transcript-based features •  Transcript assembly?

•  Novel structures?

•  Isoform multireads?

Counting •  eg. HTseq

Counting

Mortazavi, A. et al (2008) Nature Methods

Library size •  Sequencing depth varies

between samples

Counting & normalisation

•  An estimate for the relative counts for each gene is obtained

•  Assumed that this estimate is representative of the original population

Gene Properties •  GC content, length,

sequence

Library composition •  Highly expressed genes

overrepresented at cost of lowly expressed genes

Normalisation i Total Count •  Normalise each sample by total number of reads sequenced.

•  Can also use another statistic similar tototal count; eg. median, upper quartile

Robinson, M.D. & Oshlack, A. (2010) Genome Biology

Normalisation ii

Oshlack, A. & Wakefield, M.J. (2009) Biology Direct

RPKM •  Reads per kilobase per million =

reads for gene A

length of gene A X Total number of reads

Normalisation iii Geometric scaling factor •  Implemented in DESeq

•  Assumes that most genes are not differentially expressed

GM of Gene 1

GM of Gene 2

GM of Gene 3

GM of Gene N

RC of Gene 2

RC of Gene 3

RC of Gene N

Median

RC = read counts (per sample) GM =geometric mean (all samples)

Normalisation iv Trimmed mean of M •  Implemented in edgeR

•  Assumes most genes are

not differentially

expressed

Weight each gene by inverse of its variance (‘trimming’)

k k’

For each gene

Mean weighted ratio

g = each gene

Differential expression •  Simple

3 2 1 0

All we need •  Know what the data looks like

•  Some measure of difference

Cond A

Cond B

Gene X

Modelling – old trends

•  What the data looks like: normal distribution

•  Some measure of difference: t-test etc

•  Technical replicates introduce some variance

Modelling – in fashion •  Use the Poisson distribution for count data from technical replicates

•  Just one parameter required – the mean

Modelling – in fashion

•  Biology is never that simple…

Anders, S. & Huber, W. (2010) Genome Biology

•  The negative binomial distribution represents an overdispersed Poisson

distribution, and has parameters for both the mean and the overdispersion.

Modelling – in fashion

Robinson, M.D. & Smyth, G. (2008) Biostatistics

•  Estimating the dispersion parameter can be difficult with a small

number of samples

•  ‘Share’ information from all genes to obtain global estimate

Shrinkage

Robinson, M.D. & Smyth, G. (2007) Bioinformatics

•  Genes do not share a common dispersion parameter

•  ‘Moderated’ estimate – assign a per-gene weight to the combined estimate

DESeq •  DESeq fits a mean/dispersion relationship model

•  Shifts individual estimates to regression line

The mean-variance relationship •  Variance = Technical (variable) + Biological (constant)

•  A=technical replicates ---> E =(very) biologically different replicates

Law et al. (2010) Genome Biology

Filtering •  Independent filtering = remove genes that have little chance of

showing DE

•  Can use eg. total count

Liu et al. (2014) Bioinformatics

On replicates…

HIGH MEDIUM LOW

On replicates…

Summary

Oshlack, A. et al (2010) Genome Biology

Analysis of RNA-seq Data - Bioinformatics-core-shared...

Documents

Flow Cytometry Shared Resource Bioinformatics Improvements/Bluearc Storage

CRUK CI HPC cluster introduction (I of III)bioinformatics-core-shared-training.github.io/hpc/CRUK...This brief course will give you two things: 1. A refresher on unix and an introduction

The CMBI: Bioinformatics Content Bioinformatics Bioinformatics@CMBI Bioinformatics tools & databases Hanka Venselaar CMBI UMC Radboud February 2009

Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

bioinformatics secrets The Bioinformatics Skill Systemangus.readthedocs.io/en/2014/_static/2014-rpg.pdf · The Bioinformatics Skill System bioinformatics secrets 1. ... bioinformatics

Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/... · I ⇠ 45mn Lecture: data type, summary statistics and graphical displays I ⇠ 15mn Quiz 10:30

Galaxy Anne Pajon, Bioinformatics Core @ CRUK-CIbioinformatics-core-shared-training.github.io/...Galaxy data and tool sharing Galaxy items - histories, workflows, visualizations, and

Bioinformatics for molecular biology€¦ · Bioinformatics for molecular biology Structural bioinformatics tools, predictors, and 3D modeling –Structural Bioinformatics DrJon K

Bioinformatics - Stellenbosch UniversityPevsner J. Bioinformatics and Functional Genomics 3rd Edition Wiley-Blackwell 2015. Bioinformatics, Stellenbosch University • Many bioinformatics

Meta data and bioinformatics Bioinformatics is EBI-centred, loosely organised Bioinformatics was coined by Pauline Hogekamp ~1979 European bioinformatics

RRX Pharma-Bioinformatics Future Challenges in Bioinformatics

EB3233 Bioinformatics Introduction to Bioinformatics

1. INTRODUCTION TO BIOLOGY AND BIOINFORMATICS · INTRODUCTION TO BIOLOGY AND BIOINFORMATICS BIOINFORMATICS COURSE MTAT.03.239 11.09.2013 . 2 "Introduction to Bioinformatics" Bioinformatics

| Bioinformatics USC Libraries Bioinformatics Service · USC Libraries Bioinformatics Service ... Galaxy, R & Bioconductor Bioinformatics Servers Hardware: Two Dell PowerEdge R630

Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Introduction to Biostatistics - Stony Brook Medicine · the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). Introduction to Biostatistics

Functional Genomics Overview - Bioinformatics-core-shared ......PRINCIPAL BIOINFORMATICS ANALYST CRUK–CAMBRIDGEINSTITUTE 18 SEPTEMBER2017 Functional Genomics Overview. ... – Each

Genome Annotation and Visualisation using R and Bioconductorbioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/Day3/Genome...7/22/2015 Genome Annotation and Visualisation

Bioinformatics 2013 Li Bioinformatics Btt029

+ => Bioinformatics: from Sequence to Knowledge Outline: Introduction to bioinformatics The TAU Bioinformatics unit Useful bioinformatics issues and databases: