RNA-seq: analysis of raw data and preprocessing - part 2

Raw data investigation

Joachim Jacob20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Experimental setup

We have decided on:● how many samples per condition● how deep

This determines how reliable the statistics will be, using experience, and tools like Scotty. A wrong experimental design cannot be fixed. Best approach: pilot data (3 samples per condition, 10M)

But we have other sequencing options to choose!

PE versus SE Illumina

● Single end (SE): from each cDNA fragment only one end is read.

● Paired end (PE): the cDNA fragment is read from both ends.

Purify and fragment

PE versus SE Illumina

Single end (SE):

● Gene level differential expression

Paired end (PE):

● Novel splice junction detection

● De novo assembly of transcriptome

● Helps with correctly positioning reads on the reference genome sequence.

Note: PE not the same as mate pairs.

Strandedness

● Naive protocols obtain reads from cDNA fragments. BUT the link with the sense or antisense strand is broken.

● Stranded protocols generate reads from one strand, corresponding to the sense or antisense strand (depending on the protocol).

Strandedness

Not strandedStranded

Example of a stranded protocol

● dUTP protocol to generate stranded reads.

Importance of strandedness

● Strandedness can bias the read counts compared to non-stranded protocols.

● Depends on the genome whether you should apply it, e.g. in case genes overlap, the improved benefit of assigning reads to correct genes can outweigh technical variation.

Length of the reads

● Does not matter so much (when we want to quantify aligning to a reference sequence): 50 bp will do.

● The most important point is to be able to accurately position the read on the reference genome sequence, to assign it to the correct gene.

● Length can become important, if you want to assemble the transcriptome.

For DE on the gene level

The 'cheapest' protocol for high-throughput sequencing suffices to achieve DE detection:● SE● 50bp● Option: strandedness.

Use the money you have left over for increasing the number of replicates.

Illumina Truseq protocol

Raw Illumina data

The data you get arrives as...

barcode

experiment

Compressed, usually with gzip

Raw Illumina data

@HWI-ST571:202:D1B86ACXX:2:1102:1146:2155 1:N:0:ACAGTG

CCAACATCGAGGTCGCAATCTTTTTNANCGATATGAACTCTCCAAAAAAA

@@@FFFDFHHDG?FFHIIJJJJJIJ#1#1:BFFIGJJJJJIJJGIJJJJA

CGGAGCTGAAGGAGAAACTGAAATCCCTGCAATGTGAATTGTACGTTCTT

CCCFFFFFGGHHHIJJJJJJJIJFHIJIIIJJJJGIIIIIEFGHIFCHJI

GTTGGCAGCCCTGGAGCCCTGCCTCGGTGGTTTAGCCAGTACTAGGGGAT

CCCFFFFFHHHHHJJJIJJJJJJGIJJCGHFHIGIHJJJBDHGHHJJJIE

ATTTCCTCTTATTTACGTTGCTTTAAAGCGAGACTTCAACGCCATTTGAC

@@CFFFFFHHFHDFGHIJIIJGIJGGEHGGJB>??FHHGFFFGHIGIECF

CATCGAAGCAAAGCATATAAAGTTANTNNTNNCTGAGTTGTACATATTGC

??;;D?DB6CDB+<EFE>:AFA443#2##1##11)0:0?9**0??DAGI4

GAAGTGCCCCGCTGGCAGCACACAAGGAGCAGCCCGCTGCCGGACCACTC

?@@DDDADFFAA:CEGHBFGAHGD?F@BE9BFF?D@F;'-8AG<B92=;;

One read (minimum 4 lines)

http://wiki.bits.vib.be/index.php/.fastq

sequence

certainty reading this base at this position ('quality')

(this one: 87196924 lines)

Exploring the raw data

1) check whether the Fastq file is consistent-

2) Make graphs of some metrics of the raw data

http://wiki.bits.vib.be/index.php/.fastq

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Quality_control_and_visualization_of_raw_reads

FastQC – graphical exploration

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastQC – perfect example

Reads have good quality!

Anna Karenina principle: “There is only one way to be good, but there are many ways to be wrong.”

We will start by showing a good sample. Afterwards we will discuss a less good sample.

http://en.wikipedia.org/wiki/Anna_Karenina_principle

Smooth histogram/ density line towards the right,

steady nucleotide distribution.

Bias typical for illumina

Not strongly fluctuating GC content

Bias typical for illumina

GC-content nicely bell shaped

No N's! (should ring something)

All reads have length 50bp,

Reads are nicely duplicated: some amount of duplication is to be expected in RNA-seq data.

Kmers are short sequence stretches. Sometimes they are overrepresented. But in RNA-seq this is not so important (duplication).

FastQC – less good RNA-seq sample

A relatively large Portion of the reads have mistakes at the 3' end of the read.

There is an over- representation of reads

with a low mean quality score

Not a steady levelof different nucleotide

fractions

Fluctuates

Heavily skewed versusAT rich reads

Apparently a mixture of two sets of reads

with different lengths

Duplication seems abit on the low side

(reported figures are from 60 -75%)

Very highly skewed read number.

Often the sequence of Truseqadaptor, or multi-

plex identifierscan be

found here. BLAST can reveal

more information!

Specific patterns of Specific kmers.

Note: A and T rich

Quality control of raw data

Proceed? Or rerun?

This QC can guide you to which preprocessing steps you need to apply for sure. The extra time and money needed to correct the biases can sometimes justify a rerun of the experiment.

This QC shows which preprocessing steps have already been made by the sequencing provider.

Preprocessing

Removing unwanted parts of the raw data so it helps as much as possible with reaching our goal: defining differentially expressed genes.

1) removing technical contamination● Low quality read parts● Technical sequences: adaptors● PhiX internal control sequences

2) removing biological contamination● polyA-tails● rRNA sequences● mtDNA sequences

After this, we run FastQC again.

Technical contamination

Our goal is to define DE expression, for this we need to assign reads with a high confidence to the correct genomic location.

Removal of low quality read parts: they have a higher chance to contain errors, and cause noise in our read counts.

Removal of adaptor sequences (and other technical sequences, such as multiplex) as they cannot be mapped to the reference genome.

List of technical sequences

Advised to use defaults

http://code.google.com/p/ea-utils/wiki/FastqMcf

Fastq-mcf output

http://code.google.com/p/ea-utils/wiki/FastqMcf

● Never remove duplicate reads! Highly expressed genes can have genuine duplicate reads, which are not due to the PCR amplification step in the protocol.

● PhiX sequences: the DNA of Phi X bacteriophage is spiked in to monitor and optimize sequencing on Illumina machines. Your sequencing provider should filter out those sequences before delivery. You can filter them out by aligning your reads to the PhiX genome.

http://en.wikipedia.org/wiki/Phi_X_174

Biological contamination

Mitochondria containrRNA, mRNA and mtDNA

rRNA and non-coding (95% of RNA)

mRNA (5% of RNA)

nucleus

mRNAs are captured with oligo-dT coated beads.

Occasionally, non-protein coding sequences are also captured (especially since mtRNA and rRNA can be relatively rich in AT).

We can remove them via homology searching (BLAST) with known non-protein coding sequences.

Mitochondrial

mRNA (5% of RNA)

rRNA and nc

mRNAs are post-transcrip- tionally modified: e.g. the addition of a poly-A tail. If our goal is to map the reads to a reference genome sequence, the polyA tails should be removed. This can be viewed as some source of 'biological contamination' in our sequences (…).

AAAAAAAAAAAAA

● Get the non-protein coding sequences via Biomart.

Mitochondrial genome sequence also.

Filter the biological contamination

Your reads

The biological readsImported via Biomart

We are interested in the reads that don't map!

Filter the biological contamination

Your reads

The biological readsImported via Biomart

We are interested in the reads that don't map!

Doing this in Galaxy

Useful: take a sample of your reads: fastq-to-tabular, select random lines, tabular-to-fastq

1. create a new history2. load the sample data in3. Run fastqMcf to remove technical sequences4. Run bowtie to match against biological sequence databases, and keep reads that don't match.5. Summarize: fastqc

→ make a workflow of this sample history.→ run the workflow on all your samples in parallel→ store the cleaned reads in a data library.

Summary preprocessing

Your reads

…...Format consistent? Errors in quality?

Your groomed reads

…....…... Trends in raw data? QC report

Your groomed reads without technical contamination

….... ... Get biological contaminants- ….- ….

Your groomed reads without technical and biological contamination

…... How does your data look now? QC

... Get technical contaminants- ….

KeywordsPaired end

Stranded reads

Adapter sequence

Write in your own words what the terms mean

Exercise

→ investigating and preprocessing raw RNA-seq data

RNA-seq: analysis of raw data and preprocessing - part 2

Technology

RNA-Seq de novo assembly traininggenoweb.toulouse.inra.fr/~formation/RNASeq_de_novo/RNASeq_de_… · – RNA-Seq techniques RNA-Seq experiment set up Read quality assessment Read

RNA-seq co-expression analysis using mixture modelsjouy.inra.fr RNA-seq co-expression analysis 3 / 25 Introduction Co-expression analysis with RNA-seq data RNA-seq data, continued

RNA-seq experiments for bioinformaticians

Analysing RNA-Seq data produced by Mars-Seq protocoldors.weizmann.ac.il/course/course2018/AnalysingRNA-Seq...Analysing RNA-Seq data produced by Mars-Seq protocol Dena Leshkowitz, Introduction

RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

RNA-Seq - · PDF fileWhat is RNA-seq? • RNA-seq is the high-throughput sequencing of the cDNA! • It’s used to measure the RNA expression! • It’s the NGS equivalent of

RNA-Seq Analysis Overview

Rna seq pipeline

Rna seq - PDX models

RNA-seq - Read mapping and · PDF fileRNA-seq (1) Peter N. Robinson Microarrays RNA-seq Alternative splicing mapping cu inks Bipartite RNA-seq Read mapping and Quanti cation Peter

RNA‐Seq: Methods and Applicaonsbarc.wi.mit.edu/education/hot_topics/RNAseq/RNA_Seq.pdf · Outline • Intro to RNA‐Seq Biological Quesons Comparison with Other Methods RNA‐Seq

Analysis of RNA-seq Data - University of Hong Kongcgs.hku.hk/portal/files/GRC/Events/Seminars/2017/20170208/rna-seq.pdf · Outline • What is RNA-seq? • What can RNA-seq do? •

RNA-seq analysis

ERANGE RNA-Seq pipeline

RNA-seq differential expression analysis

SI Appendix Supplementary Methods Data preprocessing · using FLASH v1.2.2 (4). Transcriptome assembly and annotation Preprocessed RNA-seq data were assembled using the ABYSS v1.3.5

RNA-Seq / ChIP-Seq Analysis Workflow

Introductiontodiﬀerentialgeneexpressionanalysisusing RNA-seq · Figure 1 RNA-seq work flow. (a) Schematic diagram of RNA-seq library construction. Total RNA is extracted from 300,000

RNA-Seq and Single-Cell RNA-Seq Tertiary Analysismed.stanford.edu/content/dam/sm/gbsc/YueZhang_2016_Genetics_R… · 3. Statistical Methods RNA-Seq and Single-Cell RNA-Seq Tertiary

RNA-seq data analysis - DKFZ · PDF file1 RNA-seq data analysis RNA-seq data analysis 1. Introductionto RNA-seq 2. Qualitycontrol, preprocessing 3. Alignment to reference 4. Quantitation