48
Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY [email protected] // Twitter:@ SahaSurya BTI Plant Bioinformatics Course 2016 http:// www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

Sequencing 2016

Embed Size (px)

Citation preview

Surya SahaSol Genomics Network (SGN)

Boyce Thompson Institute, Ithaca, [email protected] // Twitter:@SahaSurya

BTI Plant Bioinformatics Course 2016

http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

19

53

DNA Structure discovery

19

77

20

12

Sanger DNA sequencing by

chain-terminating inhibitors

19

84

Epstein-Barr virus

(170 Kb)

19

87

Abi370 Sequencer

19

95

20

01

Homo sapiens (3.0 Gb)

20

05

454

Solexa

Solid

20

07

20

11

Ion Torrent

PacBio

Haemophilusinfluenzae(1.83 Mb)

20

13

Slide concept: Aureliano Bombarely

Sequencing over the Ages

Illumina

IlluminaHiseq X

454

3/29/2016 BTI Plant Bioinformatics Course 2016 2

Pinustaeda

(24 Gb)

20

14

NanoporeMinION

20

15

10XGenomics

First generation sequencing

3/29/2016 BTI Plant Bioinformatics Course 2016 3

Sanger. Annu Rev Biochem. 1988;57:1-28.

Thanks to Nick Loman for the mention

Maxam-Gilbert method

3/29/2016 BTI Plant Bioinformatics Course 2016 4

Maxam-Gilbert method

3/29/2016 BTI Plant Bioinformatics Course 2016 5

http://en.wikipedia.org/wiki/File:Maxam-Gilbert_sequencing_en.svg

https://www.nationaldiagnostics.com/electrophoresis/article/maxam-gilbert-sequencing

Sanger method

3/29/2016 BTI Plant Bioinformatics Course 2016 6

Frederick Sanger13 Aug 1918 – 19 Nov 2013

Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977

http://dailym.ai/1f1XeTB

Sanger method

3/29/2016 BTI Plant Bioinformatics Course 2016 7

http://en.wikipedia.org/wiki/File:Sanger-sequencing.svg

http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg

First generation sequencing

• Very high quality sequences (99.999% or Q50)

• Very very low throughput

3/29/2016 BTI Plant Bioinformatics Course 2016 8

Run Time Read Length Reads / Run

Total

nucleotides

sequenced

Cost / MB

Capillary

Sequencing

(ABI3730xl)

20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400

http://www.hindawi.com/journals/bmri/2012/251364/tab1/

Next generation sequencing

3/29/2016 BTI Plant Bioinformatics Course 2016 9

Use the specific technology used to generate the data

– Illumina Hiseq/Miseq/NextSeq

– Pacific Biosciences RS I/RS II

– Ion Torrent Proton/PGM

– SOLiD

– Oxford Nanopore

3/29/2016 BTI Plant Bioinformatics Course 2016 10

http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2

454 Pyrosequencing

One purified DNA fragment, to one bead, to one read.

3/29/2016 BTI Plant Bioinformatics Course 2016 11

http://www.genengnews.com/

GS FLX Titanium

https://mariamuir.com/wp-content/uploads/2013/04/rip.gif

Illumina

3/29/2016 BTI Plant Bioinformatics Course 2016 12

Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB

Number of Reads/ Flow cell

25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion

Read Length

2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp

Cost $99K $250K $740K $10M (10 units)

Source: Illumina

250030004000

500

Illumina

3/29/2016 BTI Plant Bioinformatics Course 2016 13

Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB

Number of Reads/ Flow cell

25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion

Read Length

2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp

Cost $99K $250K $740K $10M (10 units)

Source: Illumina

250030004000

$1000 human genome??

500

Illu

min

a

3/29/2016 BTI Plant Bioinformatics Course 2016 14

Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Illu

min

a

3/29/2016 BTI Plant Bioinformatics Course 2016 15

Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Pacific Biosciences SMRT sequencing

Single Molecule Real Time sequencing

3/29/2016 BTI Plant Bioinformatics Course 2016 17

http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif

RS II

Sequel

Pacific Biosciences SMRT sequencingError correction methods

3/29/2016 BTI Plant Bioinformatics Course 2016 18

Hierarchical genome-assembly process (HGAP)

English et al., PLOS One. 2012

PBJelly

Pacific Biosciences SMRT sequencingError correction methods

3/29/2016 BTI Plant Bioinformatics Course 2016 19

PB

cRP

ipel

ine

3/29/2016 Centre for Agricultural Bioinformatics, Pusa 20

Pacific Biosciences SMRT sequencingRead Lengths

Oxford Nanopore

3/29/2016 Centre for Agricultural Bioinformatics, Pusa 21

https://www.nanoporetech.com/

http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion

http://halegrafx.com/vector-art/free-vector-despicable-me-minions/

3/29/2016 BTI Plant Bioinformatics Course 2016 22

Next generation sequencing

3/29/2016 BTI Plant Bioinformatics Course 2016 23

Run Time Read Length Quality

Total

nucleotides

sequenced

Cost /MB

454

Pyrosequencing24h 700 bp Q20-Q30 1 GB $10

Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15

Illumina Hiseq

25001 - 10days 2x250bp >Q30 3000 GB $0.05

Ion torrent 2h 400bp >Q20 50MB-1GB $1

Pacific

Biosciences30m - 4h 10kb - >40kb

>Q50 consensus

>Q10 single

500 - 1000MB

/SMRT cell$0.13 - $0.60

http://www.hindawi.com/journals/bmri/2012/251364/http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227

Note: Some figures might be out of date

Long range scaffolding

3/29/2016 BTI Plant Bioinformatics Course 2016 24

Hi-C Crosslinking

3/29/2016 BTI Plant Bioinformatics Course 2016 25

3/29/2016 BTI Plant Bioinformatics Course 2016 26

http://mms.businesswire.com/media/20150225005296/en/454639/5/GemCodePlatform.jpg

• Long read information from short reads using 14bp bar codes

• Very low input DNA (ng) and 20 minute library preparation time

• 1ng of DNA is split across 100,000 Gel Coated Beads (GEMs)

• Chromium instrument for single-cell RNAseq

GemCode

3/29/2016 BTI Plant Bioinformatics Course 2016 27

http://mms.businesswire.com/media/20150225005296/en/454639/5/GemCodePlatform.jpg

GemCode

http://www.nature.com/nbt/journal/v34/n3/full/nbt.3432.html

3/29/2016 BTI Plant Bioinformatics Course 2016 28

http://www.bionanogenomics.com/technology/why-genome-mapping/

3/29/2016 BTI Plant Bioinformatics Course 2016 29

Human MHC map

• Sample prep requires very high molecular weight DNA• Nicks at 10 sites / 100kb• Individual molecules are assembles into optical maps• Optical maps and sequences are merged in a hybrid assembly

http://www.bionanogenomics.com/technology/why-genome-mapping/

Many Others..

• Ion Torrent Proton/PGM

• Supporting technologies

– Nabsys

– OpGen

– Fluidigm

3/29/2016 BTI Plant Bioinformatics Course 2016 30

http://nextgenseek.com/2012/11/did-you-know-there-are-at-least-14-next-gen-sequence-technology-companies/

Sequencing Trends

3/29/2016 BTI Plant Bioinformatics Course 2016 31

https://www.google.com/trends/

3/29/2016 BTI Plant Bioinformatics Course 2016 32

0

5000

10000

15000

20000

25000

30000

35000

2008 2009 2010 2011 2012 2013 2014 2015

Number of Publications

Illumina Pacific Biosciences Roche 454 Ion Torrent

-2000

-1000

0

1000

2000

3000

4000

5000

6000

2009 2010 2011 2012 2013 2014 2015

Increase in Number of Publications

Illumina Pacific Biosciences Roche 454 Ion Torrent

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

2009 2010 2011 2012 2013 2014 2015

% Increase in Number of Publications

Pacific Biosciences Roche 454 Ion Torrent

Real cost of Sequencing!!

Sboner, Genome Biology, 2011

3/29/2016 33BTI Plant Bioinformatics Course 2016

3/29/2016 BTI Plant Bioinformatics Course 2016 34https://genomebiology.biomedcentral.com/articles/10.1186/gb-2011-12-8-125

So What Sequencer Do I Use??

Microbial genome

• Draft genome– Illumina Miseq (100-130X)

– Illumina Hiseq (<200X)

• Complete genome– Pacific Biosciences (80-100X)

• Amplicons (16S, ITS)– Illumina Miseq

Eukaryotic genome

• Denovo assembly– Pacific Biosciences (70-80X)

– Illumina Hiseq (100X+)

– 10X Genomics

– Bionano

• Genotyping (GBS)– Illumina Hiseq

• BACs– Pacific Biosciences

3/29/2016 BTI Plant Bioinformatics Course 2016 35

$$$$ ????

3/29/2016 BTI Plant Bioinformatics Course 2016 36

The diploid reference genome

Cornell Sequencing Core

• Illumina Hiseq 2500 (Rapid run and High output)

• Illumina Miseq

• Illumina Nextseq 500

• 10X Genomics GemCode

3/29/2016 BTI Plant Bioinformatics Course 2016 37

http://www.biotech.cornell.edu/brc/genomics/services/price-list#overlay-context=brc/genomics-facility/next-generation-sequencing

$

$

$

Library Types

Single end

Pair end (PE, 150-300 bp, Fwd:/1, Rev:/2)

Mate pair (MP, 2Kb to 20 Kb)

3/29/2016 38

F

F R

F R 454/Roche

FR Illumina

Illumina

Slide credit: Aureliano BombarelyBTI Plant Bioinformatics Course 2016

Implications of Choice of Library

3/29/2016 39Slide credit: Aureliano Bombarely

Consensus sequence

(Contig)

Reads

Scaffold

(or Supercontig)

Pair Read information

NNNNN

Pseudomolecule

(or ultracontig)

F

Genetic information (markers) or Optical maps

NNNNN NN

BTI Plant Bioinformatics Course 2016

Multiplexing Libraries

Use of different tags (4-6 nucleotides) to identify different samples in the same lane/sector.

3/29/2016 40Slide credit: Aureliano Bombarely

AGTCGT

TGAGCA

AGTCGTAGTCGT

AGTCGTAGTCGT

TGAGCATGAGCA

TGAGCATGAGCA

AGTCGT

AGTCGT

AGTCGT

AGTCGT

TGAGCATGAGCA

TGAGCA

TGAGCA

Sequencing

BTI Plant Bioinformatics Course 2016

Data!!

3/29/2016 BTI Plant Bioinformatics Course 2016 41

Fasta files:

It is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.

-Wikipedia

File Formats

3/29/2016 42Slide credit: Aureliano Bombarely

BTI Plant Bioinformatics Course 2016

Fastq files:

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

-Wikipedia

• Single line ID with at symbol (“@”) in the first column.

• Sequences can be in multiple lines after the ID line

• Single line with plus symbol (“+”) in the first column to represent the quality line.

• Quality ID line may contain ID

• Quality values are in multiple lines after the + line but length is identical to sequence

3/29/2016 43Slide credit: Aureliano Bombarely

File Formats

BTI Plant Bioinformatics Course 2016

3/29/2016 44

Quality control: EncodingFastq files:

!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)

KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)

BTI Plant Bioinformatics Course 2016

Quality control: Encoding

3/29/2016 45

!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)

KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)

BTI Plant Bioinformatics Course 2016

3/29/2016 46

Quality control: Encoding

http://en.wikipedia.org/wiki/Phred_quality_score

Phred score of a base is:Qphred = -10 log10 (e)

where e is the estimated error probability of a base

BTI Plant Bioinformatics Course 2016

Pre-processing: Tools

Trimming

• FastQC

• FASTX toolkit

• Trimmomatic

• Scythe

Joining paired-end reads

• fastq-join

• FLASH

• PANDAseq

3/29/2016 47BTI Plant Bioinformatics Course 2016

Thank you!!

3/29/2016 BTI Plant Bioinformatics Course 2016 48