RNA-Seq as a Discovery Tool Julia Salzman. Deciphering the Genome

RNA-Seq as a Discovery Tool

Julia Salzman

Deciphering the Genome

Salzman, Gawad, WangLacayo, Brown, 2012

Power of RNA-Seq: Quantification and Discovery

• RNA Isoform specific gene expression

• Gene fusions

• Overlooked RNA structural variants

Paired-end RNA-Seq

Matched sequences are obtained for each library molecule

CTTC…..GAAG GGAC…..GCCT

Data: millions of 70-150+ bp A/C/G/Tsequences

• Part 1: Isoform Specific Expression

Example: Paired-end Data Aligned

Some reads are informative about isoform-specific expression

Paired-end RNA-Seq for RNA Isoform Specific Gene Expression

• Since the size distribution of library molecules is known, inferred insert lengths can be used to increase statistical power and inference

Rnpep

Goal: estimate the expression of each isoform?

Nontrivial : we only observe fragments of sequences

Exon 4 Exon 1

100 200 300Base pairs

Sequenced molecule

length

Insert lengths of entire library (pooled) can be calculated and used to precisely estimate the distribution of sizes of cDNA in the library:

Insert Length Distributions

Paired-end RNA-Seq Model• Compute genome-wide insert length distribution

Salzman, Jiang, Wong 2011

•Mapped to Isoform 1 length 150•Mapped to Isoform 2 length 90

100 200 300Base pairs

Sequenced molecule

length

Using PE for quantification is statistically more powerful

• PE model is a statistical improvement over naïve models and has optimal information reduction

• “Information” gain using PE Sequencing

• Overall, using “mate pair” information, more power, but sometimes experimental artifacts can effect results

Paired-end Size Distributions are Foundation for Tophat and other

PE-RNA Seq AlgorithmsSummary and Problems:• rely on a reference• assume uniformity of size distributions in library• over look biases’

Rep.1

Rep.2

• Part 2: Gene Fusions

Recurrent Gene Fusions in Cancer

A handful of recurrent fusions in solid tumors• PAX8 -PPARγ fusion (thyroid cancer)• EML4-ALK fusion (non small cell lung cancer)• TMPRSS2-ERG family fusion (prostate cancer)

More to be learned by unbiased study of RNA

Not Genome-wide

Fusion Discovery

• 2 flavors– Totally “de novo” discovery• Search for any RNA fragments out of order with respect

to the reference genome– not necessarily coinciding with exon boundaries• Noisy

– Discovery with a reference database• Discover fusions at annotated exon boundaries (protein

coding) and better statistical checks• Misses some fusions

Reference Approach

• Search for gene fusions with exon A in gene 1 spliced to exon B of gene 2

Exon A Exon B

Algorithm (with respect to reference)

• Remove all PE reads consistent with the reference

• Identify gene pairs PE reads where (read1, read2) map to (gene1, gene2)

• Find PE reads of the form: (gene A, gene A-B junction)

Exon A Exon B

Paired-End RNA-Seq for Gene Fusions in Ovarian Tumors

• Paired-end sequencing of poly-A selected RNA from 12 late stage tumors– genome wide search

• Top hit of our algorithm : ESRRA-C11orf20

ESRRA

Fusion

C11orf20

• Isoform-specific estimation: ESRRA and the fusion are expressed at roughly equal magnitude (Salzman, Jiang, Wong)

Salzman et al, 2011

• Part 3: Exploratory Analysis of RNA Rearrangements

Bioinformatic Analysis

• Thousands of exon scrambling events in RNA from human leukocytes and cancer samples

Wildtype genome: DNA

Canonical transcript

Inconsistent with the reference genome!

Potential Biological Mechanisms for RNA Rearrangements

DNA Rearrangement

RNA rearrangement

Trans-splicing

Template switching

PCR artifact

Analysis of Leukocyte Data• Exons in ‘scrambled’ (non-increasing) order with respect to

canonical exon order

• Thousands of genes with evidence of exon scrambling• Naïve estimate of fractional abundance of scrambled read rate: all read rate (per

transcript)

100s of Transcripts with High Fractions of Scrambled Isoforms

Canonical Isoform

Scrambled Isoform

< 25%

> 75%

100sof

genes

100s of transcripts from B cells, stem cells and neutrophils have >50% copies from scrambled isoform

What Models Can Explain Exon Scrambling in RNA?

Model 1 to Explain RNA Exon Scrambling

Model 1 Prediction

Can be made statistically precise

Model 1 is statistically inconsistent with vast majority of data

Alternative Model

Model and data are consistent

Mining RNA-Seq Data for Evidence Consistent with Circular RNA?

• In poly-A depleted samples, expect to see strong evidence of scrambled exons (circular RNA)

• In poly-A selected samples, expect to see little evidence of scrambled exons (circular RNA)

Poly-A Depleted Samples Enriched for Scrambled Exons

Align all reads to a custom database

• RNA-Seq can be used for discovery

• Tophat and other fusion/splicing algorithms gives a broad picture

• May have significant noise

• Miss important features of RNA expression

Summary of RNA-Seq for NGS

Currently, all published/downloadable algorithms

will miss identifying circular RNA!(feel free to contact me for the algorithm to identify circular RNA!)

Documents

RNA-Seq as a Discovery Tool Julia Salzman. Deciphering the Genome