70
Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY [email protected] // @ SahaSurya BTI PGRP Intership Program 2015 http:// www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

Sequencing and Bioinformatics PGRP Summer 2015

Embed Size (px)

Citation preview

Surya Saha

Sol Genomics Network (SGN)

Boyce Thompson Institute, Ithaca, [email protected] // @SahaSurya

BTI PGRP Intership Program 2015

http://www.acgt.me/blog/2015/3/7/next-generation-sequencing-must-die

Hello Experiment!• Experimental design for survey

Sample sizeLocationsPhenotypes

6/11/2015 BTI PGRP Summer Internship Program 2015 2

Early Blight infected tomato plantshttp://www.longislandhort.cornell.edu/vegpath/photos/early_blight.htm

Hello Experiment!• Experimental design for survey

Sample sizeLocationsPhenotypes

• Experimental design to identify genetic differencesPCR-based

• Simple Sequence Repeats• Other markers

Sequencing-based• Genes of interest• Single Nucleotide Polymorphisms• Gene expression• Genotyping by Sequencing

6/11/2015 BTI PGRP Summer Internship Program 2015 3

Early Blight infected tomato plantshttp://www.longislandhort.cornell.edu/vegpath/photos/early_blight.htm

Why Sequencing?

• Targeted interrogation of genome

• Economical

• Technological developments

• High-throughput assays

• But requires subsequent validation

6/11/2015 BTI PGRP Summer Internship Program 2015 4

Why Sequencing?

• Targeted interrogation of genome

• Economical

• Technological developments

• High-throughput assays

• But requires subsequent validation

6/11/2015 BTI PGRP Summer Internship Program 2015 5

19

53

DNA Structure discovery

19

77

20

12

Sanger DNA sequencing by chain-terminating inhibitors

19

84

Epstein-Barr virus

(170 Kb)

19

87

Abi370 Sequencer

19

95

20

01

Homo sapiens (3.0 Gb)

20

05

454

Solexa

Solid

20

07

20

11

Ion Torrent

PacBio

Haemophilusinfluenzae(1.83 Mb)

20

13

Slide design credit: Aureliano Bombarely

Sequencing: Then and Now

Illumina

IlluminaHiseq X

454

6/11/2015 BTI PGRP Summer Internship Program 2015 6

Pinustaeda

(24 Gb)

20

14

NanoporeMinION

First generation sequencing

6/11/2015 BTI PGRP Summer Internship Program 2015 7

Sanger. Annu Rev Biochem. 1988;57:1-28.

Thanks to Nick Loman for the mention

Maxam-Gilbert method (1973)

6/11/2015 BTI PGRP Summer Internship Program 2015 8

Maxam-Gilbert method (1973)

6/11/2015 BTI PGRP Summer Internship Program 2015 9

http://en.wikipedia.org/wiki/File:Maxam-Gilbert_sequencing_en.svg

https://www.nationaldiagnostics.com/electrophoresis/article/maxam-gilbert-sequencing

Sanger method (1977)

6/11/2015 BTI PGRP Summer Internship Program 2015 10

Frederick Sanger13 Aug 1918 – 19 Nov 2013

Won the Nobel Prize for Chemistry in 1958 and 1980. Published the dideoxy chain termination method or “Sanger method” in 1977

http://dailym.ai/1f1XeTB

Sanger method (1977)

6/11/2015 BTI PGRP Summer Internship Program 2015 11

http://en.wikipedia.org/wiki/File:Sanger-sequencing.svg

http://en.wikipedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg

First generation sequencing

• Very high quality sequences (99.999% or Q50)

• Very low throughput

6/11/2015 BTI PGRP Summer Internship Program 2015 12

Run Time Read Length Reads / Run

Total

nucleotides

sequenced

Cost / MB

Capillary

Sequencing

(ABI3730xl)

20m-3h 400-900 bp 96 or 384 1.9-84 Kb $2400

http://www.hindawi.com/journals/bmri/2012/251364/tab1/

Next generation sequencing

6/11/2015 BTI PGRP Summer Internship Program 2015 13

6/11/2015 BTI PGRP Summer Internship Program 2015 14

https://twitter.com/kbradnam/status/443153578429923328

• Second generation• Third generation• Fourth generation• Next-next-generation• Next-next-next

generationhttp://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2

Mention the specific technology used to generate the data

– Illumina Hiseq/Miseq/NextSeq

– Pacific Biosciences RS1/RSII

– Ion Torrent Proton/PGM

– SOLiD

– Oxford Nanopore Minion

6/11/2015 BTI PGRP Summer Internship Program 2015 15

http://www.acgt.me/blog/2015/3/10/next-generation-sequencing-must-diepart-2

454 Pyrosequencing

One purified DNA fragment, to one bead, to one read.

6/11/2015 BTI PGRP Summer Internship Program 2015 16

http://www.genengnews.com/

GS FLX Titanium

https://mariamuir.com/wp-content/uploads/2013/04/rip.gif

Illumina

6/11/2015 BTI PGRP Summer Internship Program 2015 17

Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB

Number of Reads/ Flow cell

25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion

Read Length

2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp

Cost $99K $250K $740K $10M (10 units)

Source: Illumina

250030004000

500

Illumina

6/11/2015 BTI Plant Bioinformatics Course 2015 18

Output 0.3-15 Gb 20-120 GB 10-1500 GB 900-1800 GB

Number of Reads/ Flow cell

25 Million 130-400 Million 300 million – 2.5 Billion 3 Billion

Read Length

2x300 bp 2x150 bp 2x250 - 2x125 bp 2x150 bp

Cost $99K $250K $740K $10M (10 units)

Source: Illumina

250030004000

$1000 human genome??

500

Illu

min

a

6/11/2015 BTI PGRP Summer Internship Program 2015 19

Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Illu

min

a

6/11/2015 BTI PGRP Summer Internship Program 2015 20

Mardis 2008. Annu. Rev. Genomics Hum. Genet. 2008. 9:387–402

Pacific Biosciences SMRT sequencing

Single Molecule Real Time sequencing

6/11/2015 BTI PGRP Summer Internship Program 2015 22

http://smrt.med.cornell.edu/images/pacbio_library_prep-1.gif

Pacific Biosciences SMRT sequencingError correction methods

6/11/2015 BTI PGRP Summer Internship Program 2015 23

Hierarchical genome-assembly process (HGAP)

English et al., PLOS One. 2012

PBJelly

Pacific Biosciences SMRT sequencingError correction methods

6/11/2015 BTI PGRP Summer Internship Program 2015 24

PB

cRP

ipel

ine

6/11/2015 BTI PGRP Summer Internship Program 2015 25

Pacific Biosciences SMRT sequencingRead Lengths

http://www.igs.umaryland.edu/labs/grc/

Mean Read Length: 8391 bpMaximum Subread Length: 24585 bp

6/11/2015 BTI PGRP Summer Internship Program 2015 26

Pacific Biosciences SMRT sequencingRead Lengths

Genome Assembly with Long Reads

6/11/2015 BTI PGRP Summer Internship Program 2015 27

Oxford Nanopore

6/11/2015 BTI PGRP Summer Internship Program 2015 28

https://www.nanoporetech.com/

http://erlichya.tumblr.com/post/66376172948/hands-on-experience-with-oxford-nanopore-minion

http://halegrafx.com/vector-art/free-vector-despicable-me-minions/

Oxford Nanopore

6/11/2015 BTI PGRP Summer Internship Program 2015 29

Oxford Nanopore

6/11/2015 BTI PGRP Summer Internship Program 2015 30

https://theconversation.com/how-a-small-backpack-for-fast-genomic-sequencing-is-helping-combat-ebola-41863

6/11/2015 BTI PGRP Summer Internship Program 2015 31

Sequencing Trends

6/11/2015 BTI PGRP Summer Internship Program 2015 32

https://www.google.com/trends/

6/11/2015 BTI PGRP Summer Internship Program 2015 33

0

5000

10000

15000

20000

25000

30000

2008 2009 2010 2011 2012 2013 2014

Number of Publications

Illumina Pacific Biosciences Roche 454 Ion Torrent

-2000

-1000

0

1000

2000

3000

4000

5000

6000

2009 2010 2011 2012 2013 2014

Increase in Number of Publications

Illumina Pacific Biosciences Roche 454 Ion Torrent

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

2009 2010 2011 2012 2013 2014

% Increase in Number of Publications

Pacific Biosciences Roche 454 Ion Torrent

Hi-C Crosslinking

6/11/2015 BTI PGRP Summer Internship Program 2015 34

Others

• Ion Torrent Proton/PGM

• SOLiD

• Helicos

• Supporting technologies– BioNano

– Nabsys

– OpGen

– 10X Genomics

– Fluidigm

6/11/2015 BTI PGRP Summer Internship Program 2015 35

Comparison

6/11/2015 BTI PGRP Summer Internship Program 2015 36

Next generation sequencing

6/11/2015 BTI PGRP Summer Internship Program 2015 37

Run Time Read Length Quality

Total

nucleotides sequenced

Cost /MB

454

Pyrosequencing24h 700 bp Q20-Q30 1 GB $10

Illumina Miseq 27h 2x300bp > Q30 15 GB $0.15

Illumina Hiseq

25001 - 10days 2x250bp >Q30 3000 GB $0.05

Ion torrent 2h 400bp >Q20 50MB-1GB $1

Pacific

Biosciences30m - 4h 10kb - >40kb

>Q50 consensus

>Q10 single

500 - 1000MB

/SMRT cell$0.13 - $0.60

http://www.hindawi.com/journals/bmri/2012/251364/http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431227

http://omicsmaps.com/

Next Generation Genomics: World Map of High-throughput Sequencers

BTI PGRP Summer Internship Program 20156/11/2015 38

6/11/2015 BTI PGRP Summer Internship Program 2015 39

https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/

6/11/2015 BTI PGRP Summer Internship Program 2015 40

https://flxlexblog.wordpress.com/2014/06/11/developments-in-next-generation-sequencing-june-2014-edition/

Real cost of Sequencing!!

Sboner, Genome Biology, 2011

6/11/2015 41BTI PGRP Summer Internship Program 2015

Sequencing Data and Concepts

6/11/2015 BTI PGRP Summer Internship Program 2015 42

Library Types

Single end

Pair end (PE, 150-800 bp, Fwd:/1, Rev:/2)

Mate pair (MP, 2Kb to 20 Kb)

6/11/2015 43

F

F R

F R 454/Roche

FR Illumina

Illumina

Slide credit: Aureliano BombarelyBTI PGRP Summer Internship Program 2015

Implications of Choice of Library

6/11/2015 44Slide credit: Aureliano Bombarely

Consensus sequence

(Contig)

Reads

Scaffold

(or Supercontig)

Pair Read information

NNNNN

Pseudomolecule

(or ultracontig)

F

Genetic information (markers) or Optical maps

NNNNN NN

BTI PGRP Summer Internship Program 2015

Multiplexing Libraries

Use of different tags (4-6 nucleotides) to identify different samples in the same lane/sector.

6/11/2015 45Slide credit: Aureliano Bombarely

AGTCGT

TGAGCA

AGTCGTAGTCGT

AGTCGTAGTCGT

TGAGCATGAGCA

TGAGCATGAGCA

AGTCGT

AGTCGT

AGTCGT

AGTCGT

TGAGCATGAGCA

TGAGCA

TGAGCA

Sequencing

BTI PGRP Summer Internship Program 2015

Fasta files:

It is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes.

-Wikipedia

File Formats

6/11/2015 46Slide credit: Aureliano Bombarely

BTI PGRP Summer Internship Program 2015

Fastq files:

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

-Wikipedia

• Single line ID with at symbol (“@”) in the first column.

• Sequences can be in multiple lines after the ID line

• Single line with plus symbol (“+”) in the first column to represent the quality line.

• Quality ID line may contain ID

• Quality values are in multiple lines after the + line but length is identical to sequence

6/11/2015 47Slide credit: Aureliano Bombarely

File Formats

BTI PGRP Summer Internship Program 2015

6/11/2015 48

Quality control: EncodingFastq files:

!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)

KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)

BTI PGRP Summer Internship Program 2015

Quality control: Encoding

6/11/2015 49

!"#$%&'()*+,-./0123456789 Offset by 33 (Phred+33)

KLMNOPQRSTUVWXYZ[\]^_`abcdefgh Offset by 64 (Phred+64)

BTI PGRP Summer Internship Program 2015

6/11/2015 50

Quality control: Encoding

http://en.wikipedia.org/wiki/Phred_quality_score

Phred score of a base is:Qphred = -10 log10 (e)

where e is the estimated probability of a base being wrong

BTI PGRP Summer Internship Program 2015

Pre-processing: Tools

Trimming

• FastQC

• FASTX toolkit

• Trimmomatic

• Scythe

Joining paired-end reads

• fastq-join

• FLASH

• PANDAseq

6/11/2015 51BTI PGRP Summer Internship Program 2015

Sequencing done! Now What??

6/11/2015 BTI PGRP Summer Internship Program 2015 52

Sequencing done! Now What??

• 1 Hiseq run can produce up to 1500GB or 1.5TB of data

• How much is 250GB of data?

– 250,000,000,000 characters

– 3000 characters per sheet

– 100 sheets / cm

– Stack of ~8000m

6/11/2015 BTI PGRP Summer Internship Program 2015 53

Mount Everest - 8848m

Increase in Sequencing Data

L. Stein, Genome Biology, 2010

6/11/2015 54Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Big Data

6/11/2015 55Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

High Performance Computing

Powerful servers with large amounts of memory, compute cores, and disk

6/11/2015 56BTI PGRP Summer Internship Program 2015

What is bioinformatics?

Bioinformatics /baɪ.oʊˌɪnfərˈmætɪks/ is the application of computer science and information technology to the field of biology and medicine.

6/11/2015 57Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Bioinformatics deals with

Algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics.

Generation of new knowledge in biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm-computing, cellular-computing).

6/11/2015 58Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Bioinformatics can...

Identify similar sequences Provide a putative function for a sequence Assemble sequences (genomes, transcriptomes) Annotate genomes Identify differentially expressed genes Build networks of genes or metabolites Determine phylogenetic relationships Mine literature for biological information Uncover differences between two genomes Calculate how a protein folds

6/11/2015 59Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

What can bioinformatics do for me?

Majority of projects involve large datasets

Speed up your research

Enable you to ask new questions

Basic knowledge of bioinformatics needed

Extract information

Transform information

Run analyses

Build hypotheses, etc.

6/11/2015 60Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

6/11/2015 BTI PGRP Summer Internship Program 2015 61

Linux UNIX-based, free and open source

operating system Very stable, easy to use Created by Linus Torvalds in 1990s

as a student Adopted for most bioinformatics

work Also: installed on cell phones,

laptops, desktops, clusters, supercomputers

Can run on your computer! Virtualized or native

http://www.linux-netbook.com/linux/distributions/

6/11/2015 62Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Linux

UNIX-based, free and open source operating system

Very stable, easy to use

Created by Linus Torvalds in 1990s as a student

Adopted for most bioinformatics work Also: installed on cell phones, laptops, desktops,

clusters, supercomputers

Can run on your computer! Virtualized or native

6/11/2015 63BTI PGRP Summer Internship Program 2015

Further Reading

Plant Bioinformatics Course• Virtual machine setup instructions• Slides for Linux, Sequencing, RNAseq, NGS Read

Mapping and R graphics• http://btiplantbioinfocourse.wordpress.com

• 6/11/2015 64Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Scripting

Scripts: Small programs written by the end-user that control the execution of other programs or perform a simple algorithm

Extremely flexible

Written in Shell, Perl, Python

You can write them yourself!!!

6/11/2015 65Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Perl

Developed since 1980s by Larry Wall

Useful for bioinformatics and web development

Support for objects

Excellent integration of regular expressions (text handling language)

Vast open source code library (http:/cpan.org/) BioPerl (http://bioperl.org/)

Easy to learn

http://www.perl.org/

6/11/2015 66Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Python

Created by Guido van Rossum in 1989

Very elegant language

BioPython libraries

The “new” popular language

Many frameworks (Django for web etc.)

6/11/2015 67Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Language designed for statistics

Support for matrix calculations, graphics

Expression analysis, Next-Gen sequence analysis, Graphics, genome annotation statistics, phylogeny

Interactive

6/11/2015 68Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Databases

Need to store and query data

Biological data is highly structured

Relational database systems

Non-relational systems

6/11/2015 69Slide credit: Lukas Mueller

BTI PGRP Summer Internship Program 2015

Thank you!!

6/11/2015 BTI PGRP Summer Internship Program 2015 70