Upload
estefania-trueblood
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
RNA-Seq as a Discovery Tool
Julia Salzman
Deciphering the Genome
Salzman, Gawad, WangLacayo, Brown, 2012
Power of RNA-Seq: Quantification and Discovery
• RNA Isoform specific gene expression
• Gene fusions
• Overlooked RNA structural variants
Paired-end RNA-Seq
Matched sequences are obtained for each library molecule
CTTC…..GAAG GGAC…..GCCT
Data: millions of 70-150+ bp A/C/G/Tsequences
• Part 1: Isoform Specific Expression
Example: Paired-end Data Aligned
Some reads are informative about isoform-specific expression
Paired-end RNA-Seq for RNA Isoform Specific Gene Expression
• Since the size distribution of library molecules is known, inferred insert lengths can be used to increase statistical power and inference
Rnpep
Goal: estimate the expression of each isoform?
Nontrivial : we only observe fragments of sequences
Exon 4 Exon 1
100 200 300Base pairs
Sequenced molecule
length
Insert lengths of entire library (pooled) can be calculated and used to precisely estimate the distribution of sizes of cDNA in the library:
Insert Length Distributions
Paired-end RNA-Seq Model• Compute genome-wide insert length distribution
Salzman, Jiang, Wong 2011
•Mapped to Isoform 1 length 150•Mapped to Isoform 2 length 90
100 200 300Base pairs
Sequenced molecule
length
Using PE for quantification is statistically more powerful
• PE model is a statistical improvement over naïve models and has optimal information reduction
• “Information” gain using PE Sequencing
• Overall, using “mate pair” information, more power, but sometimes experimental artifacts can effect results
Paired-end Size Distributions are Foundation for Tophat and other
PE-RNA Seq AlgorithmsSummary and Problems:• rely on a reference• assume uniformity of size distributions in library• over look biases’
Rep.1
Rep.2
• Part 2: Gene Fusions
Recurrent Gene Fusions in Cancer
A handful of recurrent fusions in solid tumors• PAX8 -PPARγ fusion (thyroid cancer)• EML4-ALK fusion (non small cell lung cancer)• TMPRSS2-ERG family fusion (prostate cancer)
More to be learned by unbiased study of RNA
Not Genome-wide
Fusion Discovery
• 2 flavors– Totally “de novo” discovery• Search for any RNA fragments out of order with respect
to the reference genome– not necessarily coinciding with exon boundaries• Noisy
– Discovery with a reference database• Discover fusions at annotated exon boundaries (protein
coding) and better statistical checks• Misses some fusions
Reference Approach
• Search for gene fusions with exon A in gene 1 spliced to exon B of gene 2
Exon A Exon B
Algorithm (with respect to reference)
• Remove all PE reads consistent with the reference
• Identify gene pairs PE reads where (read1, read2) map to (gene1, gene2)
• Find PE reads of the form: (gene A, gene A-B junction)
Exon A Exon B
Paired-End RNA-Seq for Gene Fusions in Ovarian Tumors
• Paired-end sequencing of poly-A selected RNA from 12 late stage tumors– genome wide search
• Top hit of our algorithm : ESRRA-C11orf20
ESRRA
Fusion
C11orf20
• Isoform-specific estimation: ESRRA and the fusion are expressed at roughly equal magnitude (Salzman, Jiang, Wong)
Salzman et al, 2011
• Part 3: Exploratory Analysis of RNA Rearrangements
Bioinformatic Analysis
• Thousands of exon scrambling events in RNA from human leukocytes and cancer samples
Wildtype genome: DNA
Canonical transcript
Inconsistent with the reference genome!
Potential Biological Mechanisms for RNA Rearrangements
DNA Rearrangement
RNA rearrangement
Trans-splicing
Template switching
PCR artifact
Analysis of Leukocyte Data• Exons in ‘scrambled’ (non-increasing) order with respect to
canonical exon order
• Thousands of genes with evidence of exon scrambling• Naïve estimate of fractional abundance of scrambled read rate: all read rate (per
transcript)
100s of Transcripts with High Fractions of Scrambled Isoforms
Canonical Isoform
Scrambled Isoform
< 25%
> 75%
100sof
genes
100s of transcripts from B cells, stem cells and neutrophils have >50% copies from scrambled isoform
What Models Can Explain Exon Scrambling in RNA?
Model 1 to Explain RNA Exon Scrambling
Model 1 Prediction
Can be made statistically precise
Model 1 is statistically inconsistent with vast majority of data
Alternative Model
Model and data are consistent
Mining RNA-Seq Data for Evidence Consistent with Circular RNA?
• In poly-A depleted samples, expect to see strong evidence of scrambled exons (circular RNA)
• In poly-A selected samples, expect to see little evidence of scrambled exons (circular RNA)
Poly-A Depleted Samples Enriched for Scrambled Exons
Align all reads to a custom database
• RNA-Seq can be used for discovery
• Tophat and other fusion/splicing algorithms gives a broad picture
• May have significant noise
• Miss important features of RNA expression
Summary of RNA-Seq for NGS
Currently, all published/downloadable algorithms
will miss identifying circular RNA!(feel free to contact me for the algorithm to identify circular RNA!)