View
169
Download
1
Tags:
Embed Size (px)
DESCRIPTION
20140710 Lukas Paul ERCC 2.0 Workshop
Citation preview
© Lexogen, 2013
Spike-In RNA Variants: Design, Production and Application
ERCC 2.0 workshop
Stanford University – July 10-11, 2014
PPT Number TBD Project Number 0221 Theme T5.2 Mixquer Transcript Quantification (WAFF) Author Lukas Paul
© Lexogen, 2014 2
1. Company introduction 2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA Variants
4. Production and application of Spike-In RNA Variants
ERCC 2.0 Workshop
Vertraulich / Confidential
© Lexogen, 2014 3 Vertraulich / Confidential
Lexogen: Company
• Founded in 2007 • Based in Vienna, Austria • 28 employees (75% in R&D) • Lexogen, Inc.: o/n delivery to US customers
• Services & products with focus on
o Transcriptome profiling technologies o Complementary technologies to Next Generation Sequencing o Innovative solutions for transcriptome research
Lexogen’s mission is to develop innovative technologies that will allow to resolve all complexities of the transcriptome - one of the most enigmatic and exciting areas in biology.
www.LEXOGEN.com
© Lexogen, 2014 4
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D 3. Design and rational of Spike-In RNA Variants
4. Production and application of Spike-In RNA Variants
ERCC 2.0 Workshop
Vertraulich / Confidential
© Lexogen, 2014 5
SENSETM mRNA-Seq Library Preparation Kit
• Convenient, fragmentation-free workflow • Core technology: reverse transcription and ligation on intact RNA • Results in very high preservation of strand orientation
Vertraulich / Confidential PN0203 PPT0383
© Lexogen, 2014 6
ERCC-based Validation of Strandedness
• Strandedness usually quantified by comparing the orientation of a mapped read with the genome annotation
• Problem: annotation incomplete & natural antisense transcription interferes
Use of ERCC transcripts with known orientation provides an absolute means to determine strandedness
Vertraulich / Confidential PN0203 PPT0383
Total RNA Strand Specificity (ERCCs only)a
False Antisense Readsb
Sense Reads (genome-wide)c
2 µg 99.997% 0.003% 99.890%
1 µg 99.986% 0.014% 99.815%
500 ng 99.997% 0.003% 99.821%
50 ng 99.965% 0.035% 99.779%
a number of reads mapping to ERCC genes in the sense direction divided by total number of ERCC reads b number of antisense reads mapping to ERCC transcripts divided by the total number of reads mapped to the ERCC genome c number of reads mapping to annotated genes in the sense orientation divided by the number of reads mapping in both directions. Note that this
measure includes biologically relevant antisense transcription.
© Lexogen, 2014 7
ERCC-validated Strandedness Determines False Positive Background of Library Preparation Method
Vertraulich / Confidential
Knowing the strandedness of the library preparation protocol allows for determining whether a detected
transcript is truly antisense or belongs to the false positive background.
98%
99.9%
strandedness
1153
2415
true antisense transcripts
© Lexogen, 2014 8
“ERCC-validated” Strandedness in Lexogen’s Portfolio
• SENSE mRNA-Seq library preparation kit
• SENSE Total RNA-Seq library preparation kit
Vertraulich / Confidential PN0203 PPT0383
• QuantSeqTM 3’ mRNA library preparation Kit, see workflow (right), ERCCs also used to assess correctness of 3’ end mapping
© Lexogen, 2014 9
Correlation Between ERCC Input and FPKM Measured
Vertraulich / Confidential PN0203 PPT0383
FPKM
N of molecules [102]
1 10 102 103 104 105 106
10-2
1
10
102
103
7
.5x1
04
o SENSE, R2=0.910Competitors, R2=0.834•
© Lexogen, 2014 10
Further Use for ERCC: Transcript Length Coverage:
• Native genes: interference from divergent annotations and differentially expressed transcript variants
• Primer selectivity: aa
ERCCs with seamless coverage from first to last nucleotide Native transcripts start with high coverage indicative of 5’ truncated
annotations Vertraulich / Confidential PN0203 PPT0383
Example: SQUARE TM library prep with intrinsic over-representation of termini
ERCC-0096 Top 500 transcripts
© Lexogen, 2014 11
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA variants 4. Production and application of Spike-In RNA variants
ERCC 2.0 Workshop
Vertraulich / Confidential
© Lexogen, 2014 12
Spike-In RNA Variants (SIRVs) - Rational
• ERCC spike-in controls were designed as mono-exonic RNAs without sequence overlap.
• Complementary, we found it to be desirable to have a set of nucleic acids simulating transcript variants that can be used as external spike-in controls.
• This reference set would o comprise two or more transcript families, with transcripts of the same
family representing reference transcript variants of the same gene o enable the controlled identification and/or quantification of transcript
variants in one or more samples and o permit the assessment, validation and correction of Bioinformatics
pipelines.
Vertraulich / Confidential
© Lexogen, 2014 13
Spike-In RNA Variants – Gene Structure
Reference genes • 7 human genes selected because of diversity in exon-intron structure • Annotated transcripts (Ensembl database) aligned to gene in CLC workbench • „Master transcript“ created for each gene (sequence of all transcript variants)
KLK5
LDHD
Vertraulich / Confidential
CLC main workbench 5
CLC main workbench 5
PN0203 PPT0383
© Lexogen, 2014 14
Addition of Transcript Variants
• Annotated transcript variants were analyzed for AS events • AS events not covered by a variant within a family were incorporated in a
new variant based on the master transcript • To cover non-splicing variants, antisense and overlapping transcripts were
added (mono- and poly-exonic) • Further, Transcription Start-Site (TSS) and End-Site (TES) variants were
added
KLK5 SIRV1
Vertraulich / Confidential
© Lexogen, 2014 15
Spike-In RNA Variants (SIRV): Nucleotide Sequence
AIM • The nucleotide sequence of the SIRVs should be non-homologous at least
to eukarytic genomes and transcriptomes. • In the best case they should not align with any natural occurring sequence.
SOLUTION • Genomic sequences from viruses were used to fill-in exon sequences. Would work in external controls for eukaryotes.
• Sequences were then inverted (flipped) to lose alignment identiy. Final sequences do not align with any entry in the NCBI nt collection when
blasted with standard parameters. SIRV sequences also do not align with ERCC sequences. In silico experiments confirmed that NGS reads generated from the SIRVs
would not map to the genome of any model organism or the “ERCCome”. Vertraulich / Confidential
© Lexogen, 2014 16
Re-establishing Exon-Intron Junction Dinucleotides
Vertraulich / Confidential
• Most junctions are common, i.e. are also annotated in the master transcript.
• These intron sequences are currently annotated as NN (see below), hence junction recognition is no problem for alignment programs
NN-NN GT-AG GC-AG AT-AC
SIRVS 198 (61.11%) 116 (31.10%)
7 (2.16%) 3 (0.93) 314 (96.91%)
ICE database 98.70% 0.79% 0.08%
• Exon-defined intron boundaries were converted to GT-AG (97%), GC-AG (2%) or AT-AC (1%)
Nucleotide conversion to conform with GT-AG rule
© Lexogen, 2014 17
SIRV Properties - Summary
SIRVs are modelled on mammalian sequences • Set of seven SIRV families with 6-18 transcript variants each • 74 transcript variants in total, average length 1200 nt (median 917 nt) • Variants include alternative splicing, start- and end-site variations ,
antisense and overlapping transcripts • GC content: 30-50% (in analogy to ERCC standards) • Poly(A) tail: A(30) at 3’-end (ERCCs: 19-25 adenosines) • Length: 220-2,557 nt, longer SIRVs were trimmed by exon removal
Further modifications • GT-AT exon-intron junction dinucleotide rule observed • Homopolymer runs: ≤7nt • 5’ truncation to obtain 5’ G, needed for T7 transcription • No homology to NCBI nt collection entries or ERCC sequences due to
sequence inversion
Vertraulich / Confidential PN0203 PPT0383
© Lexogen, 2014 18
SIRV Design - Overview
Vertraulich / Confidential
Take natural gene structure and annotated transcript variants Shorten transcript length to a maximum of 2500 nt Fill gene structure with heterologous sequence
Duplicate and modify to add alternative splicing variants Add transcription start-site and end-site variants Add antisense and overlapping variants
observe GU-AG intron rule
cassette exon
alternative start-site
alternative end-site
alternative last exon
intron retention
overlapping, antisense antisense
A5SS
A3SS MXE alternative first exon
overlapping
© Lexogen, 2014 19
1. Company introduction
2. ERCC spike-in mixes in Lexogen‘s R&D
3. Design and rational of Spike-In RNA Variants
4. Production and application of Spike-In RNA Variants
ERCC 2.0 Workshop
Vertraulich / Confidential
© Lexogen, 2014 20
SIRV Production: In vitro Transcription Construct
Vertraulich / Confidential
starts with 5’ G, cap optional
poly(A) tail added Synthetic constructs cloned for singularization and amplification
Run-off T7 transcription
T7-Promoter Restr.Site G Sequence A(30) Restr.Site 5’ 3’
220 - 2557 nt
© Lexogen, 2014 21
SIRV Production, QC and quantification
Production Plasmid linearization T7 run-off transcription Purification (essential!) Storage in Na-Citrate buffer
Quality Control Photometric (Nanodrop): Purity, quantifcation Microfluidics (Bioanalyzer): Integrity, quantifcation • Planned: qPCR: Accurate quantification
Vertraulich / Confidential
© Lexogen, 2014 22
SIRVs: Mixes & RNA-Seq Samples
Initially, 2 mixes were prepared from 60 purified transcript variants: 1. Equimolar: 1:1:1… 2. Low dynamic range: 1:10:100
3 Samples were prepared from these: 1. Equimolar mix,
SIRVs only illumina TruSeq library prep without poly(A) selection
2. Equimolar mix, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection
3. Low dynamic range, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection
Vertraulich / Confidential
© Lexogen, 2014 23
SIRVs: RNA-Seq Experiment
• Illumina MiSeq run: 1x150 nt, 27M reads obtained • Mapping with tophat (v.2.0.8) against combined transcriptomic and
genomic reference (Ensembl GRCh 37.75), Ambion’s ERCC92, and SIRVs
Vertraulich / Confidential
Total reads Mapping reads (%) Uniquely
Mapping reads (%) #1, equimolar SIRVs 10,246,442 8,585,641 83.79% 8,505,344 83.01% #2, equimolar SIRVs, ERCCs, UHR 10,119,416 8,642,852 85.41% 8,399,336 83.00% #3, 1:10:100 SIRVs, ERCCs, UHR 6,308,855 5,404,486 85.67% 5,268,757 83.51%
GRCh37.75 ERCC92 SIRVs Sample #1 4,330 0.05% 11 0.00% 8,505,555 99.95% Sample #2 7,521,308 89.55% 38,031 0.45% 839,997 10.00% Sample #3 4,156,399 78.89% 22,207 0.42% 1,090,151 20.69%
© Lexogen, 2014 24
SIRV RNA-Seq: Input / Output correlation
Vertraulich / Confidential
Molecules Molecules
Molecules sample #1 FPKM
sam
ple
#2 F
PKM
#1 #2
#3 #1 vs #2
© Lexogen, 2014 25
SIRVs RNA-Seq: Transcript Hypotheses
Transcript Hypotheses by Cufflinks • Not complete: e.g., 3ASS and exons not recognized despite multiple exon-
exon reads
Vertraulich / Confidential
cufflinks
© Lexogen, 2014 26
Spike-In RNA Variants: Short Summary
Design & production • 74 transcript variants in 7 families (6-18 variants / family) • Mimic eukaryotic genes in length and GC content; A(30) tail • Include variation on alternative splicing, transcription start-sites and end-
sites, sense/antisense and overlapping genes • No homology to NCBI nt collection entries or ERCC sequences • Produced from stock plasmids as T7 run-off transcripts
Mixtures • 60 SIRVs were mixed in equimolar or low dynamic range (10²) concentrations
Application in RNA-Seq • Mixtures showed high mapability and no cross-mapping with UHR or ERCCs • Low input / output correlation as determined by tophat / cufflinks derived
FPKM • Cufflinks cannot reconstruct all SIRV transcript variants, even in the
equimolar mix, which will lead to wrong FPKM values
Vertraulich / Confidential
© Lexogen, 2014 27
Spike-In RNA Variants: Outlook
Optimizing production & quantification • Large-scale production and purification of transcripts • qPCR-based quantification in addition to Nanodrop & Bioanalyzer results
Application • Evaluation of software for its performance in transcript hypothesis building
and transcript isoform quantification
Open questions • Concentration range? • Sufficient variant complexity? Length? Capping? SNPs? • How many different mixes?
• Pipeline validation (Consortium?) • Sample comparison (DE) • Technical variation • Master mix vs. modules: ERCCs, SIRVs, ncRNA standards & miRNA standards
(complexity, price, validation?)
Vertraulich / Confidential