27
© Lexogen, 2013 Spike-In RNA Variants: Design, Production and Application ERCC 2.0 workshop Stanford University July 10-11, 2014 PPT Number TBD Project Number 0221 Theme T5.2 Mixquer Transcript Quantification (WAFF) Author Lukas Paul

20140710 3 l_paul_ercc2.0_workshop

Embed Size (px)

DESCRIPTION

20140710 Lukas Paul ERCC 2.0 Workshop

Citation preview

Page 1: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2013

Spike-In RNA Variants: Design, Production and Application

ERCC 2.0 workshop

Stanford University – July 10-11, 2014

PPT Number TBD Project Number 0221 Theme T5.2 Mixquer Transcript Quantification (WAFF) Author Lukas Paul

Page 2: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 2

1. Company introduction 2. ERCC spike-in  mixes  in  Lexogen‘s  R&D

3. Design and rational of Spike-In RNA Variants

4. Production and application of Spike-In RNA Variants

ERCC 2.0 Workshop

Vertraulich / Confidential

Page 3: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 3 Vertraulich / Confidential

Lexogen: Company

• Founded in 2007 • Based in Vienna, Austria • 28 employees (75% in R&D) • Lexogen, Inc.: o/n delivery to US customers

• Services & products with focus on

o Transcriptome profiling technologies o Complementary technologies to Next Generation Sequencing o Innovative solutions for transcriptome research

Lexogen’s mission is to develop innovative technologies that will allow to resolve all complexities of the transcriptome - one of the most enigmatic and exciting areas in biology.

www.LEXOGEN.com

Page 4: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 4

1. Company introduction

2. ERCC spike-in  mixes  in  Lexogen‘s  R&D 3. Design and rational of Spike-In RNA Variants

4. Production and application of Spike-In RNA Variants

ERCC 2.0 Workshop

Vertraulich / Confidential

Page 5: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 5

SENSETM mRNA-Seq Library Preparation Kit

• Convenient, fragmentation-free workflow • Core technology: reverse transcription and ligation on intact RNA • Results in very high preservation of strand orientation

Vertraulich / Confidential PN0203 PPT0383

Page 6: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 6

ERCC-based Validation of Strandedness

• Strandedness usually quantified by comparing the orientation of a mapped read with the genome annotation

• Problem: annotation incomplete & natural antisense transcription interferes

Use of ERCC transcripts with known orientation provides an absolute means to determine strandedness

Vertraulich / Confidential PN0203 PPT0383

Total RNA Strand Specificity (ERCCs only)a

False Antisense Readsb

Sense Reads (genome-wide)c

2 µg 99.997% 0.003% 99.890%

1 µg 99.986% 0.014% 99.815%

500 ng 99.997% 0.003% 99.821%

50 ng 99.965% 0.035% 99.779%

a number of reads mapping to ERCC genes in the sense direction divided by total number of ERCC reads b number of antisense reads mapping to ERCC transcripts divided by the total number of reads mapped to the ERCC genome c number of reads mapping to annotated genes in the sense orientation divided by the number of reads mapping in both directions. Note that this

measure includes biologically relevant antisense transcription.

Page 7: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 7

ERCC-validated Strandedness Determines False Positive Background of Library Preparation Method

Vertraulich / Confidential

Knowing the strandedness of the library preparation protocol allows for determining whether a detected

transcript is truly antisense or belongs to the false positive background.

98%

99.9%

strandedness

1153

2415

true antisense transcripts

Page 8: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 8

“ERCC-validated”  Strandedness  in  Lexogen’s  Portfolio  

• SENSE mRNA-Seq library preparation kit

• SENSE Total RNA-Seq library preparation kit

Vertraulich / Confidential PN0203 PPT0383

• QuantSeqTM 3’  mRNA  library preparation Kit, see workflow (right), ERCCs also used to assess correctness  of  3’  end  mapping

Page 9: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 9

Correlation Between ERCC Input and FPKM Measured

Vertraulich / Confidential PN0203 PPT0383

FPKM

N of molecules [102]

1 10 102 103 104 105 106

10-2

1

10

102

103

7

.5x1

04

o SENSE, R2=0.910Competitors, R2=0.834•

Page 10: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 10

Further Use for ERCC: Transcript Length Coverage:

• Native genes: interference from divergent annotations and differentially expressed transcript variants

• Primer selectivity: aa

ERCCs with seamless coverage from first to last nucleotide Native transcripts start  with  high  coverage  indicative  of  5’  truncated  

annotations Vertraulich / Confidential PN0203 PPT0383

Example: SQUARE TM library prep with intrinsic over-representation of termini

ERCC-0096 Top 500 transcripts

Page 11: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 11

1. Company introduction

2. ERCC spike-in  mixes  in  Lexogen‘s  R&D

3. Design and rational of Spike-In RNA variants 4. Production and application of Spike-In RNA variants

ERCC 2.0 Workshop

Vertraulich / Confidential

Page 12: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 12

Spike-In RNA Variants (SIRVs) - Rational

• ERCC spike-in controls were designed as mono-exonic RNAs without sequence overlap.

• Complementary, we found it to be desirable to have a set of nucleic acids simulating transcript variants that can be used as external spike-in controls.

• This reference set would o comprise two or more transcript families, with transcripts of the same

family representing reference transcript variants of the same gene o enable the controlled identification and/or quantification of transcript

variants in one or more samples and o permit the assessment, validation and correction of Bioinformatics

pipelines.

Vertraulich / Confidential

Page 13: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 13

Spike-In RNA Variants – Gene Structure

Reference genes • 7 human genes selected because of diversity in exon-intron structure • Annotated transcripts (Ensembl database) aligned to gene in CLC workbench • „Master  transcript“  created  for  each  gene  (sequence  of  all  transcript  variants)

KLK5

LDHD

Vertraulich / Confidential

CLC main workbench 5

CLC main workbench 5

PN0203 PPT0383

Page 14: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 14

Addition of Transcript Variants

• Annotated transcript variants were analyzed for AS events • AS events not covered by a variant within a family were incorporated in a

new variant based on the master transcript • To cover non-splicing variants, antisense and overlapping transcripts were

added (mono- and poly-exonic) • Further, Transcription Start-Site (TSS) and End-Site (TES) variants were

added

KLK5 SIRV1

Vertraulich / Confidential

Page 15: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 15

Spike-In RNA Variants (SIRV): Nucleotide Sequence

AIM • The nucleotide sequence of the SIRVs should be non-homologous at least

to eukarytic genomes and transcriptomes. • In the best case they should not align with any natural occurring sequence.

SOLUTION • Genomic sequences from viruses were used to fill-in exon sequences. Would work in external controls for eukaryotes.

• Sequences were then inverted (flipped) to lose alignment identiy. Final sequences do not align with any entry in the NCBI nt collection when

blasted with standard parameters. SIRV sequences also do not align with ERCC sequences. In silico experiments confirmed that NGS reads generated from the SIRVs

would  not  map  to  the  genome  of  any  model  organism  or  the  “ERCCome”. Vertraulich / Confidential

Page 16: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 16

Re-establishing Exon-Intron Junction Dinucleotides

Vertraulich / Confidential

• Most junctions are common, i.e. are also annotated in the master transcript.

• These intron sequences are currently annotated as NN (see below), hence junction recognition is no problem for alignment programs

NN-NN GT-AG GC-AG AT-AC

SIRVS 198 (61.11%) 116 (31.10%)

7 (2.16%) 3 (0.93) 314 (96.91%)

ICE database 98.70% 0.79% 0.08%

• Exon-defined intron boundaries were converted to GT-AG (97%), GC-AG (2%) or AT-AC (1%)

Nucleotide conversion to conform with GT-AG rule

Page 17: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 17

SIRV Properties - Summary

SIRVs are modelled on mammalian sequences • Set of seven SIRV families with 6-18 transcript variants each • 74 transcript variants in total, average length 1200 nt (median 917 nt) • Variants include alternative splicing, start- and end-site variations ,

antisense and overlapping transcripts • GC content: 30-50% (in analogy to ERCC standards) • Poly(A)  tail:  A(30)  at  3’-end (ERCCs: 19-25 adenosines) • Length: 220-2,557 nt, longer SIRVs were trimmed by exon removal

Further modifications • GT-AT exon-intron junction dinucleotide rule observed • Homopolymer runs:  ≤7nt • 5’  truncation  to  obtain  5’  G,  needed  for  T7  transcription • No homology to NCBI nt collection entries or ERCC sequences due to

sequence inversion

Vertraulich / Confidential PN0203 PPT0383

Page 18: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 18

SIRV Design - Overview

Vertraulich / Confidential

Take natural gene structure and annotated transcript variants Shorten transcript length to a maximum of 2500 nt Fill gene structure with heterologous sequence

Duplicate and modify to add alternative splicing variants Add transcription start-site and end-site variants Add antisense and overlapping variants

observe GU-AG intron rule

cassette exon

alternative start-site

alternative end-site

alternative last exon

intron retention

overlapping, antisense antisense

A5SS

A3SS MXE alternative first exon

overlapping

Page 19: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 19

1. Company introduction

2. ERCC spike-in  mixes  in  Lexogen‘s  R&D

3. Design and rational of Spike-In RNA Variants

4. Production and application of Spike-In RNA Variants

ERCC 2.0 Workshop

Vertraulich / Confidential

Page 20: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 20

SIRV Production: In vitro Transcription Construct

Vertraulich / Confidential

starts with 5’  G, cap optional

poly(A) tail added Synthetic constructs cloned for singularization and amplification

Run-off T7 transcription

T7-Promoter Restr.Site G Sequence A(30) Restr.Site 5’ 3’

220 - 2557 nt

Page 21: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 21

SIRV Production, QC and quantification

Production Plasmid linearization T7 run-off transcription Purification (essential!) Storage in Na-Citrate buffer

Quality Control Photometric (Nanodrop): Purity, quantifcation Microfluidics (Bioanalyzer): Integrity, quantifcation • Planned: qPCR: Accurate quantification

Vertraulich / Confidential

Page 22: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 22

SIRVs: Mixes & RNA-Seq Samples

Initially, 2 mixes were prepared from 60 purified transcript variants: 1. Equimolar:  1:1:1… 2. Low dynamic range: 1:10:100

3 Samples were prepared from these: 1. Equimolar mix,

SIRVs only illumina TruSeq library prep without poly(A) selection

2. Equimolar mix, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection

3. Low dynamic range, 30% SIRVs, 3% ERCCs, 67% UHR (Universal Human Reference RNA) illumina TruSeq library prep without poly(A) selection

Vertraulich / Confidential

Page 23: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 23

SIRVs: RNA-Seq Experiment

• Illumina MiSeq run: 1x150 nt, 27M reads obtained • Mapping with tophat (v.2.0.8) against combined transcriptomic and

genomic reference (Ensembl GRCh 37.75), Ambion’s ERCC92, and SIRVs

Vertraulich / Confidential

Total reads Mapping reads (%) Uniquely

Mapping reads (%) #1, equimolar SIRVs 10,246,442 8,585,641 83.79% 8,505,344 83.01% #2, equimolar SIRVs, ERCCs, UHR 10,119,416 8,642,852 85.41% 8,399,336 83.00% #3, 1:10:100 SIRVs, ERCCs, UHR 6,308,855 5,404,486 85.67% 5,268,757 83.51%

GRCh37.75 ERCC92 SIRVs Sample #1 4,330 0.05% 11 0.00% 8,505,555 99.95% Sample #2 7,521,308 89.55% 38,031 0.45% 839,997 10.00% Sample #3 4,156,399 78.89% 22,207 0.42% 1,090,151 20.69%

Page 24: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 24

SIRV RNA-Seq: Input / Output correlation

Vertraulich / Confidential

Molecules Molecules

Molecules sample #1 FPKM

sam

ple

#2 F

PKM

#1 #2

#3 #1 vs #2

Page 25: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 25

SIRVs RNA-Seq: Transcript Hypotheses

Transcript Hypotheses by Cufflinks • Not complete: e.g., 3ASS and exons not recognized despite multiple exon-

exon reads

Vertraulich / Confidential

cufflinks

Page 26: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 26

Spike-In RNA Variants: Short Summary

Design & production • 74 transcript variants in 7 families (6-18 variants / family) • Mimic eukaryotic genes in length and GC content; A(30) tail • Include variation on alternative splicing, transcription start-sites and end-

sites, sense/antisense and overlapping genes • No homology to NCBI nt collection entries or ERCC sequences • Produced from stock plasmids as T7 run-off transcripts

Mixtures • 60 SIRVs were mixed in equimolar or low dynamic range (10²) concentrations

Application in RNA-Seq • Mixtures showed high mapability and no cross-mapping with UHR or ERCCs • Low input / output correlation as determined by tophat / cufflinks derived

FPKM • Cufflinks cannot reconstruct all SIRV transcript variants, even in the

equimolar mix, which will lead to wrong FPKM values

Vertraulich / Confidential

Page 27: 20140710 3 l_paul_ercc2.0_workshop

© Lexogen, 2014 27

Spike-In RNA Variants: Outlook

Optimizing production & quantification • Large-scale production and purification of transcripts • qPCR-based quantification in addition to Nanodrop & Bioanalyzer results

Application • Evaluation of software for its performance in transcript hypothesis building

and transcript isoform quantification

Open questions • Concentration range? • Sufficient variant complexity? Length? Capping? SNPs? • How many different mixes?

• Pipeline validation (Consortium?) • Sample comparison (DE) • Technical variation • Master mix vs. modules: ERCCs, SIRVs, ncRNA standards & miRNA standards

(complexity, price, validation?)

Vertraulich / Confidential