RNA sequencing – a basic introduction

Preview:

DESCRIPTION

RNA sequencing – a basic introduction. ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014. Maja Molin , PhD Dept. of Medical Biochemistry and Microbiology, Uppsala University. Overview. Lecture Historical perspective – “past” and present techniques - PowerPoint PPT Presentation

Citation preview

RNA sequencing – a basic introduction

ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014

Maja Molin, PhDDept. of Medical Biochemistry and Microbiology, Uppsala University

Overview

Lecture• Historical perspective – “past” and present techniques• An RNAseq experiment consist of many steps• Design experiment• Purify RNA• Prepare libraries• Sequence • Analysis

ExerciseRNA seq analysis using the de novo assembler Trinity

“Past”• Sequencing -> Sanger sequencing of cDNA libraries

• Limitations in the number of sequences• Redundancy due to highly expressed genes• Read length about 800bp -> poor in full-length• Prone to indel errors

• Global quantifications -> Expression microarrays• Sequences have to be known• Incomplete annotations• No discovery of novel transcripts• Hybridization-based method, problems with SNPs, Indels• Noise• Signal intensity is used to calculate the expression level of the gene

Historical perspective – “past” and present techniques

Present• Sequencing -> Next-Gen Sequencing technologies

• Several different platforms, Illumina, SOLiD, Ion Torrent, 454, PacBio• Short reads • Full-length transcripts• High dynamic range• Strand-specific sequencing• Sequencing errors are mostly substitutions

• Applications• Global differential expression analysis• Characterization of alternative splicing, polyadenylation, transcription• Discovery of novel transcripts• SNP finding• RNA editing• Allelic gene expression

Historical perspective – “past” and present techniques

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

1. Design experiment• Is the primary aim qualitative or quantitative?

Sequence reads must cover the transcripts evenly, including both ends. Coverage depends on library prep and seq. depth

Qualitative/Annotation: identify expressed transcripts, exon/intron boundaries, TSS, poly-A sites.

Quantitative/DGE:meassure differences in gene

expression, alternative splicing, TSS and poly-A sites between ≥2 groups

Must accurately measure the counts of transcripts and the variances assoc. with the counts. Replicates are essential!

http://rnaseq.uoregon.edu/

• Other objectives? SNP finding, allelic gene expression, RNA editing?• Which sequencing technology, Illumina, SOLiD, Ion Torrent, 454, PacBio?

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

2. Purify RNA• A cell contains many types of RNA, e.g

• rRNA (>80%)• tRNA• mRNA (1-5% of totalRNA)• miRNA• ncRNA• snoRNA

• Always use high quality and high purity RNA for sequencing• OD 260/280 ratio > 1.8, 260/230 ratio close to 2.0 • RIN > 8.0• Measure concentration using Qubit• If RNA extraction is based on phenol (e.g. TRIzol) or

organic methods -> RNA clean-up is recommended using e.g. columns to remove traces of phenol

• DNaseI treatment of RNA is recommended

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

3. Prepare libraries• Library preparation by the platform or by you?• Library prep. needs to match the sequencing technology.• PolyA selection or rRNA depletion for mRNA sequencing?

• PolyA selection isolates mRNA very efficiently but cannot be used for non-poly RNA.

• rRNA depletion preserves non-polyA RNAs, but less effective of removing all rRNA.

• Single-end or paired end (PE allows more accurate mapping and is useful for isoform detection)

• Strand-specific library (or non-stranded?)• Barcoding and Pooling

Strand-specific (or non-stranded) library

LevinJZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods.Nat Methods. 2010 Sep;7(9):709-15.

Non-stranded library• Does not contain any information about which strand was originally transcribed

Strand-specific library• Preserve the information about which strand was transcribed• Anti-sense transcripts can be identified• Identify the exact boundaries of adjacent genes transcribed from opposite strands• Correct expression pattern of coding or non-coding overlapping transcripts• Often the default method today

Strand-specific (or non-stranded) library

LevinJZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods.Nat Methods. 2010 Sep;7(9):709-15.

Barcoding and pooling

cDNA insertAdapter Adapter

mRNA

Total RNA

Fragmented mRNA/cDNA

Finished library

IndexBarcoding and pooling:• Short 6-8 nt´s (index) are introduced as part of the adapters• Index provide unique identifier for each sample• The index allows pooling of samples to avoid lane effects and to use the sequencing capacity more efficiently

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

4. Sequence• Pooling strategy• Sequence depth

1 2 3

1 2 3

Control: 3 biological replicates

Treated: 3 biological replicates

Pool and sequence in one lane on Illumina HiseqPool and sequence in one lane on Illumina Hiseq

Pool and sequence in one lane on Illumina Hiseq

Pool and sequence in one lane on Illumina Hiseq

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

4. Sequence• Pooling strategy• Sequence depth

• 30M reads is sufficient to detect nearly all annotated chicken genes (15742).

• 30M reads generate representative assemblies, good balance between coverage and noise.

• >60M reads sequencing errors accumulate in highly expressed genes and few new genes are discovered

• Increasing replicates is more important than increasing sequencing depth for DE analysis. Wang et al. BMC Bioinformatics 2011, 12(Suppl10):S5

Francis et al. BMC Genomics 2013, 14:167Rapaport et al. Genome Biology 2013, 14:R95

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing reads

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

Quality check of sequence reads• Illumina sequencing runs stores data in large text files called FASTQ (extension .fq or .fastq).

• FastQ files contain both the sequence and the quality of each base call for every read in the run.

• Information about each read is listed on four consecutive lines

1. Sequence ID beginning with @

2. Base calls (sequence)

3. A plus sign

4. Sequence quality codes

@61G9EAAXX100520:5:100:10000:12335/1CGGGTTAGAATCAACAAGTGTAGGAGGAACTTGGTAACGATGATTTAAATTATCTGCACTACGGTCGT+GGGFEGGGGFGGGGGGGGEGDGGEFGGEEFGGFFCFCGGEFFDEEEEAEGDEEBDEDCDEAEBCACED

1.2.3.4.

@61G9EAAXX100520:5:100:10000:12335/1CGGGTTAGAATCAACAAGTGTAGGAGGAACTTGGTAACGATGATTTAAATTATCTGCACTACGGTCGT+GGGFEGGGGFGGGGGGGGEGDGGEFGGEEFGGFFCFCGGEFFDEEEEAEGDEEBDEDCDEAEBCACED@61G9EAAXX100520:5:100:10000:14468/1ACGAGTAATCTTGGTGGGGATACCAAGAGCTTGGAAGAAAGAGGTCTTACCGGGTTCCATACCAGTGT+GGGGGGGGGDGGGGBGGGGGGGGFDFGGGGGGGFEFGEFFGDEFDDEGGEEEEECDDFDEDDACDCDE

@61G9EAAXX100520:5:100:10000:12335/2GGATCTTTCACATTTGAAATGTCTCTTCCTCACCGTAATCCCTCATTGTCTTCCCTTCCAACTACTGG+GGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGFGGGGGGFFFGEFFGGGGGGGGDEEGEFGFG@61G9EAAXX100520:5:100:10000:14468/2GTCTTCACCAACGCTGATTTGAAGGAAGTCCGTGAGACCATTATTGCTAATGTTATTGCTGCTCCTGC+GGFGGGGGDGGGGGGGGGGGFEGGFGGGEGGGFGGGGGFGGGGGGGGGGGGGDGBGFFFFFEEFEFFB

Quality check of sequence reads

Paired-end Sequences cDNA insert

One FastQ file with all the left (/1) reads

One FastQ file with all the right (/2) reads

Quality check of sequence reads using FastQC tool(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Quality score across all reads in a file summarized by position. A good run will have quality score >28. If lower at some point, consider trimming.

Shows if a subset of your sequences have overall low quality scores. If the most frequently observed mean quality is <27, a warning is raised. You can consider filtering your reads by average quality to keep only the best reads.

Quality check of sequence reads using FastQC tool(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing read

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

Preprocessing of sequencing read using Trimmomatic(http://www.usadellab.org/cms/index.php?page=trimmomatic)

Consider running FastQC again to check your trimming

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing read

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

• Novel organism – little or no previous sequencing?

• Non-model organism some sequences available (ESTs, Unigene set)

• Genome-Sequenced organism– draft genome with maybe tens of chromosomes, some annotations etc.

• Model organism – genome fully sequenced and annotated with multiple genomes available, well-annotated transcriptomes, genetic maps, available mutants etc.

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

Haas BJ and Zody MC. Nat Biotechnol. 2010 May;28(5):421-3.

TopHat

Cufflinks

De novo transcriptome assembly or aligning RNA-seq reads to a reference?

TrinityTrans-ABySSVelvet-OasesSOAPdenovo-trans

De novo assembly using Trinity

Trinity combines three independent software modules:• Inchworm• Chrysalis• Butterfly

Inchworm

• kmer =short oligonucleotide of length k

• All sequence reads are cut into overlapping kmers (25-mers). Each kmer overlap with its neighbor in all but one base.

Martin and Wang, Nat. Rev. Genet. Oct 2011, vol 12:671-682

1. Identifies seed kmer as most abundant kmer.

2. Extend kmer at 3´end and at 5´end based on coverage

3. For each extension, 4 possible kmers exists, each ending with one of the four nt´s. The most abundant cumulative ending wins!

4. The assembled contig is reported and the assembled kmers are removed from the catalog and the whole process starts again.

De novo assembly using Trinity

Inchworm algorithm

GATTACA9

G

A

TC

4

1

0

4

GATTACA9

G

A

TC

4

1

0

4

G AT

C

GA

TC

0 5

10

11

11

A

GATTACA9

G4

5

Inchworm algorithm

A

GATTACA9

G4

5

0 0 0

0

C

T

AG

0

0

6

1

A

GATTACA9

G4

5

0 0 0

0

6A

0

00

0

• Report the contig …….AGATTACAGA…...

• Remove assembeld kmers from the catalog of all kmers and then repeat this step

• Trinity default is set at a minimum kmer of 1 (all kmers are used) but with large datasets this parameter can be changed to min. kmer of 2

Inchworm algorithm

De novo assembly using Trinity

Trinity combines three independent software modules:

• Inchworm – linear contigs

• Chrysalis – recluster/re-groups related contigs from Inchworm

• Butterfly – reconstructs transcripts and alternatively spliced isoforms

Trinity output – a fasta file with all the transcripts

c2 is read cluster from Inchwormg0 is “gene”i1 is the isoform

gene identifier

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

5. Analysis

• Quality check of sequence reads

• Preprocessing of sequencing read

• De novo transcriptome assembly or aligning RNA-seq reads to a reference?

• Annotation of transcripts/differential gene expression, downstream analysis

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

1. Design experiment• Is the primary aim qualitative/annotation or

quantitative/Differential gene expression?

• Qualitative/annotation

An RNA seq experiment consist of many steps

3. Prepare libraries

1. Design experiment

2. Purify RNA

4. Sequence 5. Analysis

1. Design experiment• Is the primary aim qualitative/annotation or quantitative/Differential

gene expression?

• Quantitative/differential gene expression• The level of gene expression corresponds to read counts• Align reads to transcriptome assembly or reference genome• Calculate expression values/abundance estimation based on

the mapped reads• Output is normalized expression values• Normalization based on both length of the transcript and total

depth of the sequencing.• RPKM (Reads Per Kilobase per Million reads Mapped)• FPKM (Fragments Per Kilobase per Million reads mapped)

Normalized read count/expression values

1. Low expression 2. High expression

Read count

Expression value (RPKM or FPKM)

1 2 1 2

3. Short transcript 4. Long transcript

3 4 3 4

Summary

• An RNAseq experiment consist of many steps• Design experiment• Purify RNA• Prepare libraries• Sequence • Analysis

• Several different options to choose between at every step• De novo assembler Trinity

ALLBIO – Scilife – UPPNEX – BILS course 12 -16 May 2014

Maja Molin, PhDDept. of Medical Biochemistry and Microbiology, Uppsala University

Thank you!

Questions?

Recommended