45
BIOINFORMATICS LAB Episode IV – Next Generation Sequencing Federico M. Giorgi, PhD Department of Pharmacy and Biotechnology First Cycle Degree in Genomics

BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

BIOINFORMATICS LAB

Episode IV – Next Generation

Sequencing

Federico M. Giorgi, PhD

Department of Pharmacy and Biotechnology

First Cycle Degree in Genomics

Page 2: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

2/60

Page 3: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

3/60

Sequencing Techniques

Qu

alit

y

Length (nt)

Illumina HiSeq 2000

Illumina NextSeq 500

Roche 454

Illumina MiSeq 500

OxfordNanopore

Sanger

Solexa

throughput

20 100 300 600 2000 10000

Page 4: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

4/60

FASTQ format

Page 5: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

5/60

Phred+33 Quality encoding

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ

0.........................26.............41

Page 6: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

6/60

Phred+33 Quality encoding

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ

0.........................26.............41

The numeric Quality Score (Q) is then converted to the

error probability (p) using this formula:

Q = -10 log10(P)

P = 10-Q/10

Page 7: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

7/60

FastQC

Page 8: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

8/60

• Quality

• Adapters

Read Trimming

Page 9: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

9/60

Read Trimming

Barplots indicating the performance of nine read trimming tools at different quality thresholds on a Homo sapiens RNA-Seq dataset.

Page 10: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

10/60

Read Trimming

Page 11: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

11/60

• Benefits for

– RNA-Seq (higher quality reads)

– Variant/Mutation Calling (lower error rate)

– Genome Assembly (faster with lower RAM requirements at similar quality

levels)

Read Trimming

Page 12: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

12/60

• Generated during library preparation (sequence amplification

• Detected by FASTQC

• Taken care of by most Trimming Tools (e.g. PRINSEQ)

PCR duplicates removal

Page 13: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

13/60

• Input: FASTQ

• Tools

– DNA: BWA, Bowtie, Bowtie2

– RNA: Tophat, STAR

– Both: Hisat2

• Output: SAM

Aligning Reads on a Genome

Page 14: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

14/60

The SAM format• Format used to store information on read alignment on a reference genome

• Can be compressed (BAM)

• Can contain only aligned reads (SAM < FASTQ)

• Can contain all reads (you can then delete the original FASTQ files)

Page 15: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

15/60

The SAM format• Format used to store information on read alignment on a reference genome

• Can be compressed (BAM)

• Can contain only aligned reads (SAM < FASTQ)

• Can contain all reads (you can then delete the original FASTQ files)

Page 16: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

16/60

The SAM format• Format used to store information on read alignment on a reference genome

• Can be compressed (BAM)

• Can contain only aligned reads (SAM < FASTQ)

• Can contain all reads (you can then delete the original FASTQ files)

https://samtools.github.io/hts-specs/SAMv1.pdf

Page 17: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

17/60

The SAM Flag Column

FLAG

Page 18: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

18/60

The SAM Flag Column

FLAG

The number is a univocal sum of individual flags,

such as:

• Read paired: 1

• Both reads in pair are aligned: 2

• Read not aligned: 4

• Read in reverse strand: 10

• Secondary alignment: 2048

Page 19: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

19/60

The SAM Flag Column

FLAG

The number is a univocal sum of individual flags, in

hexadecimal format (x) such as:

• Read paired: 0x1

• Both reads in pair are aligned: 0x2

• Read not aligned: 0x4

• Read in reverse strand: 0x10

• Second in pair: 0x80

• Secondary alignment: 0x2048

…etc

E.g.

• Read Paired: 0x1=1

• Both reads in pair are aligned: 0x2=2• Read in reverse strand: 0x10=16

• Second in pair: 0x80=128

Total: 128 + 16 + 2 + 1 = 147

Page 20: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

20/60

The SAM Flag Column

FLAG

The number is a univocal sum of individual flags, in

hexadecimal format (x) such as:

• Read paired: 0x1

• Both reads in pair are aligned: 0x2

• Read not aligned: 0x4

• Read in reverse strand: 0x10

• Second in pair: 0x80

• Secondary alignment: 0x2048

…etc

E.g.

• Read Paired: 0x1=1

• Both reads in pair are aligned: 0x2=2• Read in reverse strand: 0x10=16

• Second in pair: 0x80=128

Total: 128 + 16 + 2 + 1 = 147

Trick: if this column is an

odd number, the dataset

has paired reads

Page 21: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

21/60

The SAM CIGAR Column

CIGAR

Page 22: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

22/60

The SAM CIGAR Column

CIGAR

• A string describing how the read

aligns with the reference

• It consists of one or more

components

• Each component comprises an

operator and the number of bases

which the operator applies to

Page 23: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

23/60

The SAM CIGAR Column

CIGAR

• A string describing how the read

aligns with the reference

• It consists of one or more

components

• Each component comprises an

operator and the number of bases

which the operator applies to

CIGAR string operators:

D Deletion; the nucleotide is present in the reference but not in the read

H Hard Clipping; the clipped nucleotides are not present in the read.

I Insertion; the nucleotide is present in the read but not in the rference.

M Match; can be either an alignment match or mismatch. The nucleotide

is present in the reference.

N Skipped region; a region of nucleotides is not present in the read

P Padding; padded area in the read and not in the reference

S Soft Clipping; the clipped nucleotides are present in the read

Page 24: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

24/60

The SAM CIGAR Column

CIGAR

• A string describing how the read

aligns with the reference

• It consists of one or more

components

• Each component comprises an

operator and the number of bases

which the operator applies to

CIGAR string operators:

D Deletion; the nucleotide is present in the reference but not in the read

H Hard Clipping; the clipped nucleotides are not present in the read.

I Insertion; the nucleotide is present in the read but not in the rference.

M Match; can be either an alignment match or mismatch. The nucleotide is

present in the reference.

N Skipped region; a region of nucleotides is not present in the read

P Padding; padded area in the read and not in the reference

S Soft Clipping; the clipped nucleotides are present in the read

Page 25: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

25/60

The SAM CIGAR Column

Page 26: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

26/60

Common Operations:

• Converting to BAM (binary zipped SAM: smaller)

• Sort BAM (required by BAM visualizers for faster navigation)

• Index BAM (generates a BAI, makes the BAM faster to read by tools)

• Merge BAMs (e.g. from technical replicates)

Common Tools:

• samtools (the old classic: fast and reliable)

• Picard Tools (the Broad Institute alternative: it performs more operations

and has several more parameters to play with)

Working on SAM files

Page 27: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

27/60

• samtools tview

– Command line

– Fast

– Weak

• Tablet

– The first beautiful GUI

• SeqMonk

– For ChIP-Seq

• Integrative Genomics Viewer

– Everyon uses this

Visualizing BAMs

Page 28: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

28/60

• GEO – Gene Expression Omnibus

– American (NCBI, Bethesda, Maryland)

– Largest repository of high-throughput data in the World

• NGS

• Microarrays

Getting NGS data from public databases

Page 29: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

29/60

• ArrayExpress

– European (EBI, Hinxton, United Kingdom)

– More recent than GEO (better search tools)

– GEO and ArrayExpress are partially redundant

Getting NGS data from public databases

Page 30: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

30/60

• Sequence Read Archive SRA

– Subset of NCBI GEO specifically for NGS data (no microarrays)

– Raw data is available

– Essentially FASTQ files

– Compressed and optionally encrypted in the SRA format

Getting NGS data from public databases

Page 31: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

31/60

• Common pipeline when you start from a public dataset

– Find a suitable dataset (ArrayExpress is the best)

– Find a link to the sample IDs (in SRA format)

– Download SRA files

– Convert SRA files to FASTQ files

– Quality control of FASTQ files

– Optional FASTQ Trimming/Adapter removal

– FASTQ alignment on reference genome (BAM)

– BAM visualization

– Downstream Analysis

The SRA Toolkit

Page 32: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

32/60

• Common pipeline when you start from a public dataset

– Find a suitable dataset (ArrayExpress is the best)

– Find a link to the sample IDs (in SRA format)

– Download SRA files

– Convert SRA files to FASTQ files

– Quality control of FASTQ files

– Optional FASTQ Trimming/Adapter removal

– FASTQ alignment on reference genome (BAM)

– BAM visualization

– Downstream Analysis

The SRA Toolkit

NCBI’s SRA

Toolkit}

Page 33: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

33/60

Three Datasets

We will now download and analyze 3 different datasets

Each one represents the three major classes of NGS Experiments:

• DNA-Seq• Whole Genome Sequencing (WGS)

• Whole Exome Sequencing (WXS)

• RNA-Seq

• ChIP-Seq

StarkLannister

Baratheon

Page 34: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

34/60

Converting BAM to gene expression

The predominant reads within a BAM originating from

an RNA-Seq experiment derive from messenger RNAs

RNA-seq reads

Short (36-250 bases)High error rates (1%)Hundreds of millions of readsMany reads span exon-exon junctions

Page 35: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

35/60

Converting BAM to gene expression

Peculiarities of RNA-Seq short reads:

• Alignment is not uniform (proportional to transcript expression)

• Alignment on the same transcript is not uniform (exonucleases

cut from 5’ and 3’)• When aligned on the genome, eukaryotic RNASeq reads can

span across introns

• Alternative isoforms

• RNA editing

Page 36: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

36/60

The GFF format

1.seqid - Chromosome/Scaffold/Reference name

2.source - Source that annotated this feature

3.type - Type of feature (e.g. gene, transcript, exon)

4.start - Start position of the feature

5.end - End position of the feature

6.score - A floating point value (can be used for e.g. peak intensity for ChIP-Seq features)

7.strand - defined as + (forward) or - (reverse).

8.phase - 0, 1 or 2. For coding sequences. “0” means “in frame”, 1 and 2 mean that the codon is shifted 1 or 2 bases

9.attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature. E.gID, Parent, gene_type, gene_name

Tab-separated

Empty columns denoted with “.”

Page 37: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

37/60

Getting counts from RNA-Seq

GFF3

annotation

BAM

alignment

Htseq-count} Gene Counts

Page 38: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

38/60

Getting counts from RNA-Seq

GFF3

annotation

BAM

alignment

Htseq-count} Gene Counts

Exon Counts

Transcript Counts

Anything Counts

Page 39: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

39/60

Getting counts from RNA-Seq

GFF3

annotation

BAM

alignment

Htseq-count} Gene Counts

Exon Counts

Transcript Counts

Anything Counts

Page 40: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

40/60

Let’s open The Terminal!

Reminders:• userid student• password 4genomics4

Terminal

Page 41: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

41/60

Sequences Exercises(Open exercises_04_NGS.pdf)

Page 42: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

42/60

• Please turn it off nicely

Turn Unix off CORRECTLY

Click on the mouse

Page 43: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

43/60

• Please turn it off nicely

Turn Unix off CORRECTLY

Click Again

Page 44: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

44/60

• Please turn it off nicely

Turn Unix off CORRECTLY

Last Click

Page 45: BIOINFORMATICS LAB Episode IV Next Generation Sequencing · 2019-04-08 · 3/60 Sequencing Techniques y Length (nt) Illumina HiSeq 2000 Illumina NextSeq 500 Roche 454 Illumina MiSeq

www.giorgilab.org

Federico M. Giorgi, PhD

Department of Pharmacy and Biotechnology

[email protected]