Upload
surya-saha
View
1.136
Download
0
Embed Size (px)
Citation preview
Surya SahaSol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, [email protected] // Twitter:@SahaSurya
BTI Plant Bioinformatics Course 2016
http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die
19
53
DNA Structure discovery
19
77
20
12
Sanger DNA sequencing by
chain-terminating inhibitors
19
84
Epstein-Barr virus
(170 Kb)
19
87
Abi370 Sequencer
19
95
20
01
Homo sapiens (3.0 Gb)
20
05
454
Solexa
Solid
20
07
20
11
Ion Torrent
PacBio
Haemophilusinfluenzae(1.83 Mb)
20
13
Slide concept: Aureliano Bombarely
Sequencing over the Ages
Illumina
IlluminaHiseq X
454
3/29/2016 BTI Plant Bioinformatics Course 2016 2
Pinustaeda
(24 Gb)
20
14
NanoporeMinION
20
15
10XGenomics
First generation sequencing
3/29/2016 BTI Plant Bioinformatics Course 2016 3
Sanger. Annu Rev Biochem. 1988;57:1-28.
Thanks to Nick Loman for the mention
Maxam-Gilbert method
3/29/2016 BTI Plant Bioinformatics Course 2016 5
http://en.wikipedia.org/wiki/File:Maxam-Gilbert_sequencing_en.svg
https://www.nationaldiagnostics.com/electrophoresis/article/maxam-gilbert-sequencing
Sanger method
3/29/2016 BTI Plant Bioinformatics Course 2016 6
Frederick Sanger13 Aug 1918 – 19 Nov 2013
Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977
http://dailym.ai/1f1XeTB
Sanger method
3/29/2016 BTI Plant Bioinformatics Course 2016 7
http://en.wikipedia.org/wiki/File:Sanger-sequencing.svg
http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg
First generation sequencing
• Very high quality sequences (99.999% or Q50)
• Very very low throughput
3/29/2016 BTI Plant Bioinformatics Course 2016 8
Run Time Read Length Reads / Run
Total
nucleotides
sequenced
Cost / MB
Capillary
Sequencing
(ABI3730xl)
20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400
http://www.hindawi.com/journals/bmri/2012/251364/tab1/
Use the specific technology used to generate the data
– Illumina Hiseq/Miseq/NextSeq
– Pacific Biosciences RS I/RS II
– Ion Torrent Proton/PGM
– SOLiD
– Oxford Nanopore
3/29/2016 BTI Plant Bioinformatics Course 2016 10
http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2
454 Pyrosequencing
One purified DNA fragment, to one bead, to one read.
3/29/2016 BTI Plant Bioinformatics Course 2016 11
http://www.genengnews.com/
GS FLX Titanium
https://mariamuir.com/wp-content/uploads/2013/04/rip.gif
Illumina
3/29/2016 BTI Plant Bioinformatics Course 2016 12
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB
Number of Reads/ Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
250030004000
500
Illumina
3/29/2016 BTI Plant Bioinformatics Course 2016 13
Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB
Number of Reads/ Flow cell
25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion
Read Length
2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp
Cost $99K $250K $740K $10M (10 units)
Source: Illumina
250030004000
$1000 human genome??
500
Illu
min
a
3/29/2016 BTI Plant Bioinformatics Course 2016 14
Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402
Illu
min
a
3/29/2016 BTI Plant Bioinformatics Course 2016 15
Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402
Illu
min
a: T
ruSe
qLo
ng
Rea
d
3/29/2016 BTI Plant Bioinformatics Course 2016 16
Voskoboynik eLife 2013;2:e00569
Pacific Biosciences SMRT sequencing
Single Molecule Real Time sequencing
3/29/2016 BTI Plant Bioinformatics Course 2016 17
http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif
RS II
Sequel
Pacific Biosciences SMRT sequencingError correction methods
3/29/2016 BTI Plant Bioinformatics Course 2016 18
Hierarchical genome-assembly process (HGAP)
English et al., PLOS One. 2012
PBJelly
Pacific Biosciences SMRT sequencingError correction methods
3/29/2016 BTI Plant Bioinformatics Course 2016 19
PB
cRP
ipel
ine
3/29/2016 Centre for Agricultural Bioinformatics, Pusa 20
Pacific Biosciences SMRT sequencingRead Lengths
Oxford Nanopore
3/29/2016 Centre for Agricultural Bioinformatics, Pusa 21
https://www.nanoporetech.com/
http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion
http://halegrafx.com/vector-art/free-vector-despicable-me-minions/
Next generation sequencing
3/29/2016 BTI Plant Bioinformatics Course 2016 23
Run Time Read Length Quality
Total
nucleotides
sequenced
Cost /MB
454
Pyrosequencing24h 700 bp Q20-Q30 1 GB $10
Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15
Illumina Hiseq
25001 - 10days 2x250bp >Q30 3000 GB $0.05
Ion torrent 2h 400bp >Q20 50MB-1GB $1
Pacific
Biosciences30m - 4h 10kb - >40kb
>Q50 consensus
>Q10 single
500 - 1000MB
/SMRT cell$0.13 - $0.60
http://www.hindawi.com/journals/bmri/2012/251364/http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227
Note: Some figures might be out of date
3/29/2016 BTI Plant Bioinformatics Course 2016 26
http://mms.businesswire.com/media/20150225005296/en/454639/5/GemCodePlatform.jpg
• Long read information from short reads using 14bp bar codes
• Very low input DNA (ng) and 20 minute library preparation time
• 1ng of DNA is split across 100,000 Gel Coated Beads (GEMs)
• Chromium instrument for single-cell RNAseq
GemCode
3/29/2016 BTI Plant Bioinformatics Course 2016 27
http://mms.businesswire.com/media/20150225005296/en/454639/5/GemCodePlatform.jpg
GemCode
http://www.nature.com/nbt/journal/v34/n3/full/nbt.3432.html
3/29/2016 BTI Plant Bioinformatics Course 2016 28
http://www.bionanogenomics.com/technology/why-genome-mapping/
3/29/2016 BTI Plant Bioinformatics Course 2016 29
Human MHC map
• Sample prep requires very high molecular weight DNA• Nicks at 10 sites / 100kb• Individual molecules are assembles into optical maps• Optical maps and sequences are merged in a hybrid assembly
http://www.bionanogenomics.com/technology/why-genome-mapping/
Many Others..
• Ion Torrent Proton/PGM
• Supporting technologies
– Nabsys
– OpGen
– Fluidigm
3/29/2016 BTI Plant Bioinformatics Course 2016 30
http://nextgenseek.com/2012/11/did-you-know-there-are-at-least-14-next-gen-sequence-technology-companies/
Sequencing Trends
3/29/2016 BTI Plant Bioinformatics Course 2016 31
https://www.google.com/trends/
3/29/2016 BTI Plant Bioinformatics Course 2016 32
0
5000
10000
15000
20000
25000
30000
35000
2008 2009 2010 2011 2012 2013 2014 2015
Number of Publications
Illumina Pacific Biosciences Roche 454 Ion Torrent
-2000
-1000
0
1000
2000
3000
4000
5000
6000
2009 2010 2011 2012 2013 2014 2015
Increase in Number of Publications
Illumina Pacific Biosciences Roche 454 Ion Torrent
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
2009 2010 2011 2012 2013 2014 2015
% Increase in Number of Publications
Pacific Biosciences Roche 454 Ion Torrent
Real cost of Sequencing!!
Sboner, Genome Biology, 2011
3/29/2016 33BTI Plant Bioinformatics Course 2016
3/29/2016 BTI Plant Bioinformatics Course 2016 34https://genomebiology.biomedcentral.com/articles/10.1186/gb-2011-12-8-125
So What Sequencer Do I Use??
Microbial genome
• Draft genome– Illumina Miseq (100-130X)
– Illumina Hiseq (<200X)
• Complete genome– Pacific Biosciences (80-100X)
• Amplicons (16S, ITS)– Illumina Miseq
Eukaryotic genome
• Denovo assembly– Pacific Biosciences (70-80X)
– Illumina Hiseq (100X+)
– 10X Genomics
– Bionano
• Genotyping (GBS)– Illumina Hiseq
• BACs– Pacific Biosciences
3/29/2016 BTI Plant Bioinformatics Course 2016 35
$$$$ ????
Cornell Sequencing Core
• Illumina Hiseq 2500 (Rapid run and High output)
• Illumina Miseq
• Illumina Nextseq 500
• 10X Genomics GemCode
3/29/2016 BTI Plant Bioinformatics Course 2016 37
http://www.biotech.cornell.edu/brc/genomics/services/price-list#overlay-context=brc/genomics-facility/next-generation-sequencing
$
$
$
Library Types
Single end
Pair end (PE, 150-300 bp, Fwd:/1, Rev:/2)
Mate pair (MP, 2Kb to 20 Kb)
3/29/2016 38
F
F R
F R 454/Roche
FR Illumina
Illumina
Slide credit: Aureliano BombarelyBTI Plant Bioinformatics Course 2016
Implications of Choice of Library
3/29/2016 39Slide credit: Aureliano Bombarely
Consensus sequence
(Contig)
Reads
Scaffold
(or Supercontig)
Pair Read information
NNNNN
Pseudomolecule
(or ultracontig)
F
Genetic information (markers) or Optical maps
NNNNN NN
BTI Plant Bioinformatics Course 2016
Multiplexing Libraries
Use of different tags (4-6 nucleotides) to identify different samples in the same lane/sector.
3/29/2016 40Slide credit: Aureliano Bombarely
AGTCGT
TGAGCA
AGTCGTAGTCGT
AGTCGTAGTCGT
TGAGCATGAGCA
TGAGCATGAGCA
AGTCGT
AGTCGT
AGTCGT
AGTCGT
TGAGCATGAGCA
TGAGCA
TGAGCA
Sequencing
BTI Plant Bioinformatics Course 2016
Fasta files:
It is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.
-Wikipedia
File Formats
3/29/2016 42Slide credit: Aureliano Bombarely
BTI Plant Bioinformatics Course 2016
Fastq files:
FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
-Wikipedia
• Single line ID with at symbol (“@”) in the first column.
• Sequences can be in multiple lines after the ID line
• Single line with plus symbol (“+”) in the first column to represent the quality line.
• Quality ID line may contain ID
• Quality values are in multiple lines after the + line but length is identical to sequence
3/29/2016 43Slide credit: Aureliano Bombarely
File Formats
BTI Plant Bioinformatics Course 2016
3/29/2016 44
Quality control: EncodingFastq files:
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)
BTI Plant Bioinformatics Course 2016
Quality control: Encoding
3/29/2016 45
!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)
KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)
BTI Plant Bioinformatics Course 2016
3/29/2016 46
Quality control: Encoding
http://en.wikipedia.org/wiki/Phred_quality_score
Phred score of a base is:Qphred = -10 log10 (e)
where e is the estimated error probability of a base
BTI Plant Bioinformatics Course 2016
Pre-processing: Tools
Trimming
• FastQC
• FASTX toolkit
• Trimmomatic
• Scythe
Joining paired-end reads
• fastq-join
• FLASH
• PANDAseq
3/29/2016 47BTI Plant Bioinformatics Course 2016