20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2013

Spike-In RNA Variants: Design, Production and Application

ERCC 2.0 workshop

Stanford University – July 10-11, 2014

PPT Number TBD Project Number 0221 Theme T5.2 Mixquer Transcript Quantification (WAFF) Author Lukas Paul

© Lexogen, 2014 2

1. Company introduction 2. ERCC spike-in mixes in Lexogen‘s R&D

3. Design and rational of Spike-In RNA Variants

4. Production and application of Spike-In RNA Variants

ERCC 2.0 Workshop

Vertraulich / Confidential

© Lexogen, 2014 3 Vertraulich / Confidential

Lexogen: Company

• Founded in 2007 • Based in Vienna, Austria • 28 employees (75% in R&D) • Lexogen, Inc.: o/n delivery to US customers

• Services & products with focus on

o Transcriptome profiling technologies o Complementary technologies to Next Generation Sequencing o Innovative solutions for transcriptome research

Lexogen’s mission is to develop innovative technologies that will allow to resolve all complexities of the transcriptome - one of the most enigmatic and exciting areas in biology.

www.LEXOGEN.com

© Lexogen, 2014 4

1. Company introduction

2. ERCC spike-in mixes in Lexogen‘s R&D 3. Design and rational of Spike-In RNA Variants


ERCC 2.0 Workshop


© Lexogen, 2014 5

SENSETM mRNA-Seq Library Preparation Kit

• Convenient, fragmentation-free workflow • Core technology: reverse transcription and ligation on intact RNA • Results in very high preservation of strand orientation

Vertraulich / Confidential PN0203 PPT0383

© Lexogen, 2014 6

ERCC-based Validation of Strandedness

• Strandedness usually quantified by comparing the orientation of a mapped read with the genome annotation

• Problem: annotation incomplete & natural antisense transcription interferes

Use of ERCC transcripts with known orientation provides an absolute means to determine strandedness


Total RNA Strand Specificity (ERCCs only)a

False Antisense Readsb

Sense Reads (genome-wide)c

2 µg 99.997% 0.003% 99.890%

1 µg 99.986% 0.014% 99.815%

500 ng 99.997% 0.003% 99.821%

50 ng 99.965% 0.035% 99.779%

a number of reads mapping to ERCC genes in the sense direction divided by total number of ERCC reads b number of antisense reads mapping to ERCC transcripts divided by the total number of reads mapped to the ERCC genome c number of reads mapping to annotated genes in the sense orientation divided by the number of reads mapping in both directions. Note that this

measure includes biologically relevant antisense transcription.

© Lexogen, 2014 7

ERCC-validated Strandedness Determines False Positive Background of Library Preparation Method


Knowing the strandedness of the library preparation protocol allows for determining whether a detected

transcript is truly antisense or belongs to the false positive background.

98%

99.9%

strandedness

1153

2415

true antisense transcripts

© Lexogen, 2014 8

“ERCC-validated” Strandedness in Lexogen’s Portfolio

• SENSE mRNA-Seq library preparation kit

• SENSE Total RNA-Seq library preparation kit


• QuantSeqTM 3’ mRNA library preparation Kit, see workflow (right), ERCCs also used to assess correctness of 3’ end mapping

© Lexogen, 2014 9

Correlation Between ERCC Input and FPKM Measured


FPKM

N of molecules [102]

1 10 102 103 104 105 106

10-2

1

10

102

103

7

.5x1

04

o SENSE, R2=0.910Competitors, R2=0.834•

© Lexogen, 2014 10

Further Use for ERCC: Transcript Length Coverage:

• Native genes: interference from divergent annotations and differentially expressed transcript variants

• Primer selectivity: aa

ERCCs with seamless coverage from first to last nucleotide Native transcripts start with high coverage indicative of 5’ truncated

annotations Vertraulich / Confidential PN0203 PPT0383

Example: SQUARE TM library prep with intrinsic over-representation of termini

ERCC-0096 Top 500 transcripts

© Lexogen, 2014 11


2. ERCC spike-in mixes in Lexogen‘s R&D

3. Design and rational of Spike-In RNA variants 4. Production and application of Spike-In RNA variants

ERCC 2.0 Workshop


© Lexogen, 2014 12

Spike-In RNA Variants (SIRVs) - Rational

• ERCC spike-in controls were designed as mono-exonic RNAs without sequence overlap.

• Complementary, we found it to be desirable to have a set of nucleic acids simulating transcript variants that can be used as external spike-in controls.

• This reference set would o comprise two or more transcript families, with transcripts of the same

family representing reference transcript variants of the same gene o enable the controlled identification and/or quantification of transcript

variants in one or more samples and o permit the assessment, validation and correction of Bioinformatics

pipelines.


© Lexogen, 2014 13

Spike-In RNA Variants – Gene Structure

Reference genes • 7 human genes selected because of diversity in exon-intron structure • Annotated transcripts (Ensembl database) aligned to gene in CLC workbench • „Master transcript“ created for each gene (sequence of all transcript variants)

KLK5

LDHD


CLC main workbench 5

CLC main workbench 5

PN0203 PPT0383

© Lexogen, 2014 14

Addition of Transcript Variants

• Annotated transcript variants were analyzed for AS events • AS events not covered by a variant within a family were incorporated in a

new variant based on the master transcript • To cover non-splicing variants, antisense and overlapping transcripts were

added (mono- and poly-exonic) • Further, Transcription Start-Site (TSS) and End-Site (TES) variants were

added

KLK5 SIRV1


© Lexogen, 2014 15

Spike-In RNA Variants (SIRV): Nucleotide Sequence

AIM • The nucleotide sequence of the SIRVs should be non-homologous at least

to eukarytic genomes and transcriptomes. • In the best case they should not align with any natural occurring sequence.

SOLUTION • Genomic sequences from viruses were used to fill-in exon sequences. Would work in external controls for eukaryotes.

• Sequences were then inverted (flipped) to lose alignment identiy. Final sequences do not align with any entry in the NCBI nt collection when

blasted with standard parameters. SIRV sequences also do not align with ERCC sequences. In silico experiments confirmed that NGS reads generated from the SIRVs

would not map to the genome of any model organism or the “ERCCome”. Vertraulich / Confidential

© Lexogen, 2014 16

Re-establishing Exon-Intron Junction Dinucleotides


• Most junctions are common, i.e. are also annotated in the master transcript.

• These intron sequences are currently annotated as NN (see below), hence junction recognition is no problem for alignment programs

NN-NN GT-AG GC-AG AT-AC

SIRVS 198 (61.11%) 116 (31.10%)

7 (2.16%) 3 (0.93) 314 (96.91%)

ICE database 98.70% 0.79% 0.08%

• Exon-defined intron boundaries were converted to GT-AG (97%), GC-AG (2%) or AT-AC (1%)

Nucleotide conversion to conform with GT-AG rule

© Lexogen, 2014 17

SIRV Properties - Summary

SIRVs are modelled on mammalian sequences • Set of seven SIRV families with 6-18 transcript variants each • 74 transcript variants in total, average length 1200 nt (median 917 nt) • Variants include alternative splicing, start- and end-site variations ,

antisense and overlapping transcripts • GC content: 30-50% (in analogy to ERCC standards) • Poly(A) tail: A(30) at 3’-end (ERCCs: 19-25 adenosines) • Length: 220-2,557 nt, longer SIRVs were trimmed by exon removal

Further modifications • GT-AT exon-intron junction dinucleotide rule observed • Homopolymer runs: ≤7nt • 5’ truncation to obtain 5’ G, needed for T7 transcription • No homology to NCBI nt collection entries or ERCC sequences due to

sequence inversion


© Lexogen, 2014 18

SIRV Design - Overview


Take natural gene structure and annotated transcript variants Shorten transcript length to a maximum of 2500 nt Fill gene structure with heterologous sequence

Duplicate and modify to add alternative splicing variants Add transcription start-site and end-site variants Add antisense and overlapping variants

observe GU-AG intron rule

cassette exon

alternative start-site

alternative end-site

alternative last exon

intron retention

overlapping, antisense antisense

A5SS

A3SS MXE alternative first exon

overlapping

© Lexogen, 2014 19


2. ERCC spike-in mixes in Lexogen‘s R&D

3. Design and rational of Spike-In RNA Variants


ERCC 2.0 Workshop


© Lexogen, 2014 20

SIRV Production: In vitro Transcription Construct


starts with 5’ G, cap optional

poly(A) tail added Synthetic constructs cloned for singularization and amplification

Run-off T7 transcription

T7-Promoter Restr.Site G Sequence A(30) Restr.Site 5’ 3’

220 - 2557 nt

© Lexogen, 2014 21

SIRV Production, QC and quantification

Production Plasmid linearization T7 run-off transcription Purification (essential!) Storage in Na-Citrate buffer

Quality Control Photometric (Nanodrop): Purity, quantifcation Microfluidics (Bioanalyzer): Integrity, quantifcation • Planned: qPCR: Accurate quantification


© Lexogen, 2014 22

SIRVs: Mixes & RNA-Seq Samples

Initially, 2 mixes were prepared from 60 purified transcript variants: 1. Equimolar: 1:1:1… 2. Low dynamic range: 1:10:100

3 Samples were prepared from these: 1. Equimolar mix,

SIRVs only illumina TruSeq library prep without poly(A) selection

2. Equimolar mix, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection

3. Low dynamic range, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection


© Lexogen, 2014 23

SIRVs: RNA-Seq Experiment

• Illumina MiSeq run: 1x150 nt, 27M reads obtained • Mapping with tophat (v.2.0.8) against combined transcriptomic and

genomic reference (Ensembl GRCh 37.75), Ambion’s ERCC92, and SIRVs


Total reads Mapping reads (%) Uniquely

Mapping reads (%) #1, equimolar SIRVs 10,246,442 8,585,641 83.79% 8,505,344 83.01% #2, equimolar SIRVs, ERCCs, UHR 10,119,416 8,642,852 85.41% 8,399,336 83.00% #3, 1:10:100 SIRVs, ERCCs, UHR 6,308,855 5,404,486 85.67% 5,268,757 83.51%

GRCh37.75 ERCC92 SIRVs Sample #1 4,330 0.05% 11 0.00% 8,505,555 99.95% Sample #2 7,521,308 89.55% 38,031 0.45% 839,997 10.00% Sample #3 4,156,399 78.89% 22,207 0.42% 1,090,151 20.69%

© Lexogen, 2014 24

SIRV RNA-Seq: Input / Output correlation


Molecules Molecules

Molecules sample #1 FPKM

sam

ple

#2 F

PKM

#1 #2

#3 #1 vs #2

© Lexogen, 2014 25

SIRVs RNA-Seq: Transcript Hypotheses

Transcript Hypotheses by Cufflinks • Not complete: e.g., 3ASS and exons not recognized despite multiple exon-

exon reads


cufflinks

© Lexogen, 2014 26

Spike-In RNA Variants: Short Summary

Design & production • 74 transcript variants in 7 families (6-18 variants / family) • Mimic eukaryotic genes in length and GC content; A(30) tail • Include variation on alternative splicing, transcription start-sites and end-

sites, sense/antisense and overlapping genes • No homology to NCBI nt collection entries or ERCC sequences • Produced from stock plasmids as T7 run-off transcripts

Mixtures • 60 SIRVs were mixed in equimolar or low dynamic range (10²) concentrations

Application in RNA-Seq • Mixtures showed high mapability and no cross-mapping with UHR or ERCCs • Low input / output correlation as determined by tophat / cufflinks derived

FPKM • Cufflinks cannot reconstruct all SIRV transcript variants, even in the

equimolar mix, which will lead to wrong FPKM values


© Lexogen, 2014 27

Spike-In RNA Variants: Outlook

Optimizing production & quantification • Large-scale production and purification of transcripts • qPCR-based quantification in addition to Nanodrop & Bioanalyzer results

Application • Evaluation of software for its performance in transcript hypothesis building

and transcript isoform quantification

Open questions • Concentration range? • Sufficient variant complexity? Length? Capping? SNPs? • How many different mixes?

• Pipeline validation (Consortium?) • Sample comparison (DE) • Technical variation • Master mix vs. modules: ERCCs, SIRVs, ncRNA standards & miRNA standards

(complexity, price, validation?)


Science

20140710 3 l_paul_ercc2.0_workshop