RNA sequencing

ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014

Maja Molin, PhDDept. of Medical Biochemistry and Microbiology, Uppsala University

Overview

Lecture• Historical perspective – “past” and present techniques• An RNAseq experiment consist of many steps• Design experiment• Purify RNA• Prepare libraries• Sequence • Analysis

ExerciseRNA seq analysis using the de novo assembler Trinity

“Past”• Sequencing -> Sanger sequencing of cDNA libraries

• Limitations in the number of sequences• Redundancy due to highly expressed genes• Read length about 800bp -> poor in full-length• Prone to indel errors

• Global quantifications -> Expression microarrays• Sequences have to be known• Incomplete annotations• No discovery of novel transcripts• Hybridization-based method, problems with SNPs, Indels• Noise• Signal intensity is used to calculate the expression level of the gene

Historical perspective – “past” and present techniques

Present• Sequencing -> Next-Gen Sequencing technologies

• Several different platforms, Illumina, SOLiD, Ion Torrent, 454, PacBio• Short reads • Full-length transcripts• High dynamic range• Strand-specific sequencing• Sequencing errors are mostly substitutions

• Applications• Global differential expression analysis• Characterization of alternative splicing, polyadenylation, transcription• Discovery of novel transcripts• SNP finding• RNA editing• Allelic gene expression

Historical perspective – “past” and present techniques

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

1. Design experiment• Is the primary aim qualitative or quantitative?

Sequence reads must cover the transcripts evenly, including both ends. Coverage depends on library prep and seq. depth

Qualitative/Annotation: identify expressed transcripts, exon/intron boundaries, TSS, poly-A sites.

Quantitative/DGE:meassure differences in gene

expression, alternative splicing, TSS and poly-A sites between ≥2 groups

Must accurately measure the counts of transcripts and the variances assoc. with the counts. Replicates are essential!

http://rnaseq.uoregon.edu/

• Other objectives? SNP finding, allelic gene expression, RNA editing?• Which sequencing technology, Illumina, SOLiD, Ion Torrent, 454, PacBio?

2. Purify RNA

2. Purify RNA• A cell contains many types of RNA, e.g

• rRNA (>80%)• tRNA• mRNA (1-5% of totalRNA)• miRNA• ncRNA• snoRNA

• Always use high quality and high purity RNA for sequencing• OD 260/280 ratio > 1.8, 260/230 ratio close to 2.0 • RIN > 8.0• Measure concentration using Qubit• If RNA extraction is based on phenol (e.g. TRIzol) or

organic methods -> RNA clean-up is recommended using e.g. columns to remove traces of phenol

• DNaseI treatment of RNA is recommended

2. Purify RNA

3. Prepare libraries• Library preparation by the platform or by you?• Library prep. needs to match the sequencing technology.• PolyA selection or rRNA depletion for mRNA sequencing?

• PolyA selection isolates mRNA very efficiently but cannot be used for non-poly RNA.

• rRNA depletion preserves non-polyA RNAs, but less effective of removing all rRNA.

• Single-end or paired end (PE allows more accurate mapping and is useful for isoform detection)

• Strand-specific library (or non-stranded?)• Barcoding and Pooling

Strand-specific (or non-stranded) library

LevinJZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods.Nat Methods. 2010 Sep;7(9):709-15.

Non-stranded library• Does not contain any information about which strand was originally transcribed

Strand-specific library• Preserve the information about which strand was transcribed• Anti-sense transcripts can be identified• Identify the exact boundaries of adjacent genes transcribed from opposite strands• Correct expression pattern of coding or non-coding overlapping transcripts• Often the default method today

Strand-specific (or non-stranded) library

LevinJZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods.Nat Methods. 2010 Sep;7(9):709-15.

Barcoding and pooling

cDNA insertAdapter Adapter

Total RNA

Fragmented mRNA/cDNA

Finished library

IndexBarcoding and pooling:• Short 6-8 nt´s (index) are introduced as part of the adapters• Index provide unique identifier for each sample• The index allows pooling of samples to avoid lane effects and to use the sequencing capacity more efficiently

2. Purify RNA

4. Sequence• Pooling strategy• Sequence depth

Control: 3 biological replicates

Treated: 3 biological replicates

Pool and sequence in one lane on Illumina HiseqPool and sequence in one lane on Illumina Hiseq

Pool and sequence in one lane on Illumina Hiseq

2. Purify RNA

4. Sequence• Pooling strategy• Sequence depth

• 30M reads is sufficient to detect nearly all annotated chicken genes (15742).

• 30M reads generate representative assemblies, good balance between coverage and noise.

• >60M reads sequencing errors accumulate in highly expressed genes and few new genes are discovered

• Increasing replicates is more important than increasing sequencing depth for DE analysis. Wang et al. BMC Bioinformatics 2011, 12(Suppl10):S5

Francis et al. BMC Genomics 2013, 14:167Rapaport et al. Genome Biology 2013, 14:R95

2. Purify RNA

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing reads

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

Quality check of sequence reads• Illumina sequencing runs stores data in large text files called FASTQ (extension .fq or .fastq).

• FastQ files contain both the sequence and the quality of each base call for every read in the run.

• Information about each read is listed on four consecutive lines

1. Sequence ID beginning with @

2. Base calls (sequence)

3. A plus sign

4. Sequence quality codes

@61G9EAAXX100520:5:100:10000:12335/1CGGGTTAGAATCAACAAGTGTAGGAGGAACTTGGTAACGATGATTTAAATTATCTGCACTACGGTCGT+GGGFEGGGGFGGGGGGGGEGDGGEFGGEEFGGFFCFCGGEFFDEEEEAEGDEEBDEDCDEAEBCACED

1.2.3.4.

@61G9EAAXX100520:5:100:10000:12335/1CGGGTTAGAATCAACAAGTGTAGGAGGAACTTGGTAACGATGATTTAAATTATCTGCACTACGGTCGT+GGGFEGGGGFGGGGGGGGEGDGGEFGGEEFGGFFCFCGGEFFDEEEEAEGDEEBDEDCDEAEBCACED@61G9EAAXX100520:5:100:10000:14468/1ACGAGTAATCTTGGTGGGGATACCAAGAGCTTGGAAGAAAGAGGTCTTACCGGGTTCCATACCAGTGT+GGGGGGGGGDGGGGBGGGGGGGGFDFGGGGGGGFEFGEFFGDEFDDEGGEEEEECDDFDEDDACDCDE

@61G9EAAXX100520:5:100:10000:12335/2GGATCTTTCACATTTGAAATGTCTCTTCCTCACCGTAATCCCTCATTGTCTTCCCTTCCAACTACTGG+GGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGFFFGEFFGGGGGGGGDEEGEFGFG@61G9EAAXX100520:5:100:10000:14468/2GTCTTCACCAACGCTGATTTGAAGGAAGTCCGTGAGACCATTATTGCTAATGTTATTGCTGCTCCTGC+GGFGGGGGDGGGGGGGGGGGFEGGFGGGEGGGFGGGGGFGGGGGGGGGGGGGDGBGFFFFFEEFEFFB

Quality check of sequence reads

Paired-end Sequences cDNA insert

One FastQ file with all the left (/1) reads

One FastQ file with all the right (/2) reads

Quality check of sequence reads using FastQC tool(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Quality score across all reads in a file summarized by position. A good run will have quality score >28. If lower at some point, consider trimming.

Shows if a subset of your sequences have overall low quality scores. If the most frequently observed mean quality is <27, a warning is raised. You can consider filtering your reads by average quality to keep only the best reads.

Quality check of sequence reads using FastQC tool(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

2. Purify RNA

5. Analysis

• Preprocessing of sequencing read

Preprocessing of sequencing read using Trimmomatic(http://www.usadellab.org/cms/index.php?page=trimmomatic)

Consider running FastQC again to check your trimming

2. Purify RNA

5. Analysis

• Novel organism – little or no previous sequencing?

• Non-model organism some sequences available (ESTs, Unigene set)

• Genome-Sequenced organism– draft genome with maybe tens of chromosomes, some annotations etc.

• Model organism – genome fully sequenced and annotated with multiple genomes available, well-annotated transcriptomes, genetic maps, available mutants etc.

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

Haas BJ and Zody MC. Nat Biotechnol. 2010 May;28(5):421-3.

TopHat

Cufflinks

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

TrinityTrans-ABySSVelvet-OasesSOAPdenovo-trans

De novo assembly using Trinity

Trinity combines three independent software modules:• Inchworm• Chrysalis• Butterfly

Inchworm

• kmer =short oligonucleotide of length k

• All sequence reads are cut into overlapping kmers (25-mers). Each kmer overlap with its neighbor in all but one base.

Martin and Wang, Nat. Rev. Genet. Oct 2011, vol 12:671-682

1. Identifies seed kmer as most abundant kmer.

2. Extend kmer at 3´end and at 5´end based on coverage

3. For each extension, 4 possible kmers exists, each ending with one of the four nt´s. The most abundant cumulative ending wins!

4. The assembled contig is reported and the assembled kmers are removed from the catalog and the whole process starts again.

Inchworm algorithm

GATTACA9

Inchworm algorithm

GATTACA9

• Report the contig …….AGATTACAGA…...

• Remove assembeld kmers from the catalog of all kmers and then repeat this step

• Trinity default is set at a minimum kmer of 1 (all kmers are used) but with large datasets this parameter can be changed to min. kmer of 2

Inchworm algorithm

Trinity combines three independent software modules:

• Inchworm – linear contigs

• Chrysalis – recluster/re-groups related contigs from Inchworm

• Butterfly – reconstructs transcripts and alternatively spliced isoforms

Trinity output – a fasta file with all the transcripts

c2 is read cluster from Inchwormg0 is “gene”i1 is the isoform

gene identifier

2. Purify RNA

5. Analysis

2. Purify RNA

1. Design experiment• Is the primary aim qualitative/annotation or

quantitative/Differential gene expression?

• Qualitative/annotation

2. Purify RNA

1. Design experiment• Is the primary aim qualitative/annotation or quantitative/Differential

gene expression?

• Quantitative/differential gene expression• The level of gene expression corresponds to read counts• Align reads to transcriptome assembly or reference genome• Calculate expression values/abundance estimation based on

the mapped reads• Output is normalized expression values• Normalization based on both length of the transcript and total

depth of the sequencing.• RPKM (Reads Per Kilobase per Million reads Mapped)• FPKM (Fragments Per Kilobase per Million reads mapped)

Normalized read count/expression values

1. Low expression 2. High expression

Read count

Expression value (RPKM or FPKM)

1 2 1 2

3. Short transcript 4. Long transcript

3 4 3 4

Summary

• An RNAseq experiment consist of many steps• Design experiment• Purify RNA• Prepare libraries• Sequence • Analysis

• Several different options to choose between at every step• De novo assembler Trinity

ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014

Maja Molin, PhDDept. of Medical Biochemistry and Microbiology, Uppsala University

Thank you!

Questions?

RNA sequencing – a basic introduction

Documents

RNA-Sequencing Reveals Oligodendrocyte and …nfcenter.wustl.edu/wp-content/uploads/2014/11/glia22754...RESEARCH ARTICLE RNA-Sequencing Reveals Oligodendrocyte and Neuronal Transcripts

Dynamic single-cell RNA sequencing identifies

RNA sequencing for the study of gene expression regulation · PDF fileRNA sequencing for the study of ... RNA sequencing for the study of gene expression regulation ... The analysis

Single cell RNA sequencing; Methods and applications

RNA sequencing - University of Washington

RNA-Sequencing - Freie Universität · Review RNA-Seq Read Mapping Bowtie Burrows-Wheeler transformation Exactmatch Backtracking RNA-Sequencing Nicolas Balcazar Corinna Blasse An

INVITATION TO NEGOTIATE TEDDY STUDY RNA SEQUENCING ... · USF ITN 17-08-MH TEDDY Study RNA Sequencing Laboratory 1 of 30 INVITATION TO NEGOTIATE TEDDY STUDY RNA SEQUENCING LABORATORY

RNA Sequencing - Departmentsjleek/teaching/2011/genomics/rnaseq.pdf · Much excitement over RNA-Sequencing Time Excitement RNA Sequencing Microarrays!

RNA Sequencing: Improving Diagnostic and Therapeutic

University of Washington · 2012-02-29 · CLASSIFICATION AND CLUSTERING OF SEQUENCING DATA 3 RNA sequencing has some major advantages over the microarray. RNA sequencing data should

RNA sequencing by direct tagmentation of RNA/DNA hybrids › content › pnas › early › 2020 › 01 › 24 › ... · This method, Sequencing HEteRo RNA-DNA-hYbrid (SHERRY), is

RNA Sequencing from Single Cell

RNA capture sequencing enabled liquid biopsy screening

SOFTWARE Open Access Cloud-scale RNA-sequencing

SMARTer® Ultra® Low Input RNA Kit for Sequencing - v3 User ... · The SMARTer Ultra Low Input RNA Kit for Sequencing - v3 consists of the SMARTer Ultra Low Input RNA Kit for Sequencing

Future of RNA Sequencing - viXravixra.org/pdf/1904.0416v1.pdf · 2019. 4. 22. · Future of RNA Sequencing RNA sequencing is a technique used to analyze entire genomes by looking

RNA sequencing : Opportunities and Challenges

Single-cell RNA sequencing of paclitaxel-treated Abstract ... · Single-cell RNA Sequencing Coverage Aggregate RNA sequence coverage across known genes shows coverage across entire

RNA Sequencing Services - LC Sciences · RNA Sequencing Services 1-888-528-8818 microRNA / Small RNA Sequencing Service Next-gen sequencing is a new method and powerful tool used