56
High-throughput sequencing technologies in genome assembly Hans Jansen

20150601 bio sb_assembly_course

Embed Size (px)

Citation preview

Page 1: 20150601 bio sb_assembly_course

High-throughput sequencing technologies in genome assembly

Hans Jansen

Page 2: 20150601 bio sb_assembly_course

Dutch SME at Bioscience Park in Leiden , the Netherlands

• High throughput drug screens, and toxicity assays in zebrafish larvae

• Fish fertility (eel, pike perch, sole) to aid sustainable aquaculture

• Sequencing (genomes, transcriptomes)• Bioinformatics

ZF-screens B.V.

Page 3: 20150601 bio sb_assembly_course

Common carp (Cyprinus carpio)High troughput screening modelGenome and transcriptomes

European and Japanese eel (Anguilla anguilla and Anguilla japonica)Completing the life cycle in aquacultureGenome and transcriptomes

King cobra (Ophiophagus hannah)Evolution and toxinsGenome and transcriptomes

Some examples of genome projects

Page 4: 20150601 bio sb_assembly_course

Chemical cleavage (Maxam and Gilbert)Chain termination (Sanger, Nicklen, and Coulson)

Throughput: 5 samples, 1 Kb/day, micrograms of ssDNA needed

1977 2000 2011

Massively parallel signature sequencing (Brenner)

SMRT (Pacific Biosciences)

Throughput: 3x109 samples, 55 Gb/day,single molecule of DNA needed

A brief history of DNA sequencing

Page 5: 20150601 bio sb_assembly_course

A brief history of DNA sequencing

February 1977: Maxam and GilbertChemical cleavage: Modify nucleotides and cut at the modified position.

December 1977: Sanger, Nicklen, and CoulsonChain termination: Use modified nucleotides to stop the extension of a newly synthesized DNA strand.

Page 6: 20150601 bio sb_assembly_course

A brief history of DNA sequencing

Maxam and Gilbert sequencing was relatively soon abandoned. It was technically complex, used some nasty chemicals and radioactivity.

The Sanger sequencing method has been improved and over the years was the method of choice to sequence the first draft of a human genome.• Thermostable polymerases alleviated the need for ssDNA template• Fluorescent dye terminators to combine all four reactions in one.• Automation of the separation of the DNA fragments. Shotgun sequencing was already used by Sanger to sequence lambda DNA and proved to be a powerful tool to sequence and assemble larger DNA molecules and even whole genomes.

Page 7: 20150601 bio sb_assembly_course

A brief history of DNA sequencing

To make assembly easier partially overlapping BAC clones from the genome were first selected and then sequenced and assembled by the shotgun method.

gDNA

BAC

This was a laborious method and later a whole genome shothun approach was used.

Page 8: 20150601 bio sb_assembly_course

A brief history of DNA sequencing

Page 9: 20150601 bio sb_assembly_course

Genomic DNA

Break the DNA in < 1Kb fragments

3’5’

Polish the ends of the DNA and adenylate them

3’5’

3’A5’

3’A

A3’

5’

5’Ligate adapter to the ends of the DNAT5’

3’T5’

3’

Amplify paired end library3’5’

3’5’

3’5’

3’5’

3’5’

3’5’

3’5’

3’5’

Bind ss-library to flowcell3’5’

Making a paired end library

Page 10: 20150601 bio sb_assembly_course

Attach and cluster the library on a carrier

Page 11: 20150601 bio sb_assembly_course

Sequence the library

Page 12: 20150601 bio sb_assembly_course

2 x 50 bp

Generate large fragments by shearing,and label the ends with biotin (green dash).

Self ligate fragments in large volume,and shear the circular fragments (black dash).

Isolate the biotinylated fragments, convert them to a paired end library and sequence them (red arrows).

Problem: part of these fragments have unconvertible ends.

Problems: larger fragments will self ligate inefficiently.Nicks in the DNA will enable digestion of circularized molecules

The above mentioned problems limit the library to ~10 kb insert size and they tend to have a low number of unique fragments.

Obtaining scaffolding information: mate pairs

Page 13: 20150601 bio sb_assembly_course

Generate large fragments by shearing, isolate ~39 kb fragments and clone in adapted fosmid vector which contain insert flanking EcoP15I sites (purple dash).

Cut with EcoP15I which leaves a 26 bp overhang, end repair fragments and self ligate.

PCR the diTag library from these fragments, and sequence the 52 bp inserts.

Problem: These large fragments will ligate inefficiently in the fosmid vector leading to low complexity libraries.

Obtaining scaffolding information: Fosmid diTags

Page 14: 20150601 bio sb_assembly_course

Library Insert Reads Gbp Coverage Span

PE200 <155 bp 2 × 76 nt 21.9 14.6×

PE280 230–305 bp 2 × 151 11.0 7.3×

PE500 370–485 bp 2 × 50–151 nt 19.3 12.9× 1.2×

MP2K 1.6–2.4 Kbp 2 × 36 nt 5.4 4.5×

MP7K 4–6 Kbp 2 × 51 nt 2.3 0.6×

MP10K 6.5–10 Kbp 2 × 51 nt 5.3 7.7×

MP15K 9–13 Kbp 2 × 51 nt 3.8 8.8×

69 Gbp 34.8× 22.9×

King cobra sequence data

Page 15: 20150601 bio sb_assembly_course

Read merging

If the two reads of a paired end fragment overlap they can be merged into a single longer read

• We use our own script since nothing was available at the time

• Now there are a number of tools: FLASH, SHERA, SeqPrep

• Paired end libraries need to be prepared with the read length in mind, and size select as narrow as possible.

~600 bp

~270 bp

Page 16: 20150601 bio sb_assembly_course

102

Fragmentsize (bp)

% o

f the

ass

embl

y

103 104 105 106

+ 500 bp + 2 Kbp+ 7 Kbp + 10 Kbp+ 15 Kbp

Assembly (cobra)

Page 17: 20150601 bio sb_assembly_course

Contigs

N50 3982 bp

largest 70 Kbp

number 1186408

Tota length 1.45 Gbp

Scaffolds

N50 226 Kbp

largest 2.84 Mbp

number 716551

Total length 1.66 Gbp

number of genes 22183

King cobra sequence assembly

Page 18: 20150601 bio sb_assembly_course

Genome Res. 2007 17: 240-248

This is a method to sequence (a small) part of a genome, and do this for multiple siblings.From the sequence data SNP’s can be identified and used as markers to build a genetic map of this genome.

Analysis of the spotted gar genome cut with SbfI in the parents and 94 individuals from their progeny produced 8406 markers in 29 linkage groups.

Generating a RAD-tag genetic map

Page 19: 20150601 bio sb_assembly_course

From Baird, PLoS ONE 2008

This can be done with multiple samples when using barcodes

After adding the barcodes all samples can be pooled to reduce workload

Pools of short fragments from different individuals.

Generating a RAD-tag genetic map

Page 20: 20150601 bio sb_assembly_course

Amores A et al. Genetics 2011;188:799-808

Generating a RAD-tag genetic map

Page 21: 20150601 bio sb_assembly_course

Long DNA molecules Fluorescently labeled at specific sites are linearized in nanochannels and imaged. The fluorescent fingerprints of each molecule can be assembled and linked to contigs and scaffolds.

Optical mapping: BioNano Genomics

Gabino Sanchez-Perez lecture at 15.00 hrs. will explain this in much more detail and show some great examples how to use this technology.

Page 22: 20150601 bio sb_assembly_course

Just a genome is usually not the goal of a de novo sequencing project.

Based on the general structure of a gene, gene predictions can be made.

exon exon exon exon

AGGT AGTAG

Pyrich CAGGsplice acceptor site

ATG STOP

Poly adenylation signalACsplice donor site

CT ABranch site

A CG T

20-50 bases

intron

RNAseq reads can help validate predictions

Annotation of the genome

Page 23: 20150601 bio sb_assembly_course

Different flavors of RNAseq

• Stranded dUTP RNAseq: simple modification of standard prep gives information of the strandedness of the transcript.

• RNAseq with minimal quantities of RNA : a great tool to look at small numbers of (FACS sorted) cells

• Cage : ideal to find transcription start site

• smallRNA: to explore the miRNA content of a sample

Transcriptome sequencing

Page 24: 20150601 bio sb_assembly_course

Disadvantages of next generation sequencing:

• Complex sample preparation including PCR amplification.• High run costs.• Long run times.• Short reads

Changes needed:

• Single molecule analysis• Reading sequences at a high speed• Highly parallel• Long reads >10kb• No errors

Long reads: what do we want?

Page 25: 20150601 bio sb_assembly_course

Pacific Biosciences PacBio RS IIAvailable since 2010

Oxford Nanopore Technologies MinIONAvailable since 2014

Generating long reads

Page 26: 20150601 bio sb_assembly_course

Pacific Biosciences PacBio RSII

It uses a zero mode waveguide to measure fluorescence in a very small volume.

Page 27: 20150601 bio sb_assembly_course

Ligate hairpin adapters

Fragment gDNA and polish ends, and add adenosine.

Attach polymerase, load on SMRT cell and sequence

DNA polymerase

Transparent bottom of zero mode waveguide

Pacific Biosciences

Page 28: 20150601 bio sb_assembly_course

Pacific Biosciences P6-C4

• Yield 0.5-1 Gbp/SMRT cell.

• Since no amplification is done you sequence the DNA as it comes out of your sample (nicks, base modifications).

• There is very little sequence bias and no systemic errors

Christoph Konig’s lecture at 14.15 hrs will delve much deeper into this technology.

Page 29: 20150601 bio sb_assembly_course

• Started to work on nanopore sensing in 2005• Investments to date 180 million GBP (227 M€)• ~200 employees• Broad IP portfolio

• Announced products: MinION and PromethION systems

• Access program for MinION (MAP)

Oxford Nanopore Technologies

Page 30: 20150601 bio sb_assembly_course

But MAP is much more. It is about being a community and a playground to test new applications.

Last part of the development of this technology is done “in field” in an fairly open program.

100’s of MinIONs send around the globe to see how they would behave in real life.MAP is visible as a web portal with information from ONT and social media like system with blog possibilities, comment, likes, and a forum to ask advice.

MinION access program

Page 31: 20150601 bio sb_assembly_course

Tethering oligo

Motor protein Brake protein

hairpin

abasic nucleotidesTTA

A

Shear (optional)DNA repair (optional), AmpureXP purificationend repair, AmpureXP purificationA tailing, AmpureXP purificationLigation, His-tag purification, Dilution in run buffer and ATP

A MuA transposase protocol is under development. This should further simplify sample preparation (10 minutes).

Library preparation

Page 32: 20150601 bio sb_assembly_course

Tethering oligo

Motor protein E5 Brake protein E3

hairpin

abasic nucleotides

Tether keeps DNA fragment on the membrane leading to a ~20K fold higher DNA concentration close to the pore.

Motor protein unwinds DNA and ratchets it though the pore.

Abasic nucleotides in the hairpin are a recognition point.

Brake protein prevents the motor protein from zipping through the complement strand.

Sequencing

Page 33: 20150601 bio sb_assembly_course

Stills taken from: https://www.nanoporetech.com/news/movies#movie-24-nanopore-dna-sequencing

Strand sequencing

ATP

Page 34: 20150601 bio sb_assembly_course

GGCTCACTCCCATAAGCGGCTC GCTCA CTCAC TCACT CACTC ACTCC CTCCC

Raw Data (ionic curent, pA)

Events (with time domain)

Squiggle (events with time domain removed)

Sensing the DNA

Page 35: 20150601 bio sb_assembly_course

Squiggle plot for a complete read

First the template part in blue, then the abasic nucleotides in the hairpin in red, and finally the complement part in turquoise .

Alignment of template and complement squiggles gives a 2d read.

Squiggle plot

Page 36: 20150601 bio sb_assembly_course

MinKNOW controls the run and shows channel states…..

Interactive interface

Page 37: 20150601 bio sb_assembly_course

….. and amount of events vs read length.Metrichor agent runs in the background to send sequence files to and from the (cloud based) base caller.

MinKNOW can interact with other software.minoTour analyses reads in a streaming mode and can control MinKNOW.

Interactive interface

Page 38: 20150601 bio sb_assembly_course

template mean 8734 bp complement mean 8126 bp 2D mean 9930 bp

Read length is limited by the non-nicked fragment length rather than the by the system.My longest 2D read until now: 93.5 Kbp, template 120 Kb.

Read length distribution

Page 39: 20150601 bio sb_assembly_course

There are actually 4 wells/detection channel. QC at the beginning of the run determines the quality of the 4wells. Sequencing starts on the best set of wells. Each 24 hrs the next best set of wells is chosen.

Yield over time

Page 40: 20150601 bio sb_assembly_course

Errors

Page 41: 20150601 bio sb_assembly_course

ref TGATGTATATGCTCTCTTTTCTGACGTTAGTCTCCGACGGCAGGCTTCAA-TGACCC-A-GGCTGAGAAATTCCCGGACCCTTTTTGCTCAAGAGCGATG |||||||||||||| |||||||||||| ||||||||||||||||||| |||||| | ||||||||||||||||||||| |||||| |||| | | MinION TGATGTATATGCTC----TTCTGACGTTAGCCTCCGACGGCAGGCTTCAATTGACCCGATGGCTGAGAAATTCCCGGACCC--TTTGCTACAGAGTG-T-

ref TTAATTTGTTCAATCATTTGGTTAGGAAAGCGGATGTTGCGGGTTGTTGTTCTGCGGGTTCTGTTCTTCGTTGACATGAG---GTTGCCCCGTATTCAGT |||||||||||||||||||||||||||||||||| ||| |||||| | |||| ||||||| ||| |||||| | || | || | | |MinION TTAATTTGTTCAATCATTTGGTTAGGAAAGCGGA---TGC-GGTTGT--TCCTGC-GGTTCTG----TCG-TGACATCCGTTATTTGCGCTGT-TACGC

ref GTCGC-TGATTTGTATTGTCTGAAGTTGTTTTTACGTTAAGTTGATGCAGATCAATTAATACGATACCT--GCGTCATAATTGATTATTTGACGT--GGT | || || |||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||| |||||||||||||||||||||||| |||MinION ATGGCATGTTTTGTATTGTCTGAAGTTGTTTTTACGTTAAGTTAATGCAGATCAATTAATACGATACCTCGGCGTCATAATTGATTATTTGACGTGGGGT

Error rate lies around 15% for current chemistry (R7.3). Typical passing 2D R7.3 read now is 2.8% deletions, 2.7% insertions and 1.7% substitutions.

R8&9 nanopores are in the pipeline (improving on G/C rich reads and better S/N).

Errors

Page 42: 20150601 bio sb_assembly_course

Errors result from different parts of the system.

On the ASIC:Events are missed by the translation from raw data to event data.Solution: Sharpen up the raw data by playing with voltage and by new nanopores with lower noise. Sequence faster.

In the base caller:Bases outside the observed k-mer influence the current.Solution: Higher k-mer models

Modified bases are currently not included in the k-mer model.Solution: add modified k-mers to the model. Modified k-mers are different from unmodified k-mers.

Errors

Page 43: 20150601 bio sb_assembly_course

Throughput is defined by:Number of channels. 512 on the MinIONSpeed of translocation. 30 bps/secOccupancy of the pore. 90%The time a Flow Cell can run. ~60 hrs.

Currently well over 1 Gb events.On R7.3 this translates to ~400 Mb 2D data.

Throughput

In “fast mode” the MinION will read 500 bps/sec. Currently three MAP groups are testing this. Throughput will increase to ~20 Gb in events.

Page 44: 20150601 bio sb_assembly_course

Longest 2D read: 93.5 KbpLongest template read: 120 Kbp (231 Kbp)Highest yield: 1.32 Gevents

R7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300

50000000

100000000

150000000

200000000

250000000

300000000

350000000

template and 2D yield over the past year

template2D

Runs

Base pairs sequenced (Mbp)

R7.3R6

Page 45: 20150601 bio sb_assembly_course

repeatunique sequence in unique sequence out

Long reads can help to resolve repeat area’s in the assembly graph

And the resulting contigs will now look like this:

Untangle

Page 46: 20150601 bio sb_assembly_course

1. Short read correction Quake (not for small genomes)2. Short read assembly Velvet3. MinION read alignment to Velvet contigs LAST4. Link filtering and contig tiling Untangle script5. Path detachment around repeats Untangle script6. Bubble popping Untangle script7. Delete unconfirmed connections Untangle script8. Contig extraction Untangle script

Assembly and scaffolding strategy

Task Software

Page 47: 20150601 bio sb_assembly_course

Agrobacterium NCPPB 1771 assembly graph

25× transposon →(1160 bp)

8× transposon →(873 bp)

4× rRNA →(6.4 Kb)

271 nodes, 311 connections154 contigsN50 = 198 KbSum = 5.87 Mb

Page 48: 20150601 bio sb_assembly_course

• Alignment: LAST with optimized settings

• Links: alignment filtering and contig tiling

• 7328 reads aligned to contigs

• 438 reads aligned to multiple contigs

• 585 links between contigs

• 13158 reads on R6 and R7 chemistry

• 73.8 Mb total yield (template and 2D)

• 5–85970 nt length, typical ~12 Kb

MinION sequencing and scaffolding

Page 49: 20150601 bio sb_assembly_course

Links between nodes are specific

Means link is confirmed by PCR

Page 50: 20150601 bio sb_assembly_course

Final assembly graph after scaffolding

• 271 nodes + 312 connections → 49 nodes + 5 connections• 154 contigs → ~8 contigs• Complete chromosome 2 (1.2 Mb), pTi (190 Kb), cryptic megaplasmid (746 Kb)• Slight residual fragmentation of chromosome 1

Page 51: 20150601 bio sb_assembly_course

Reads are in HDF5 format and contain all data from the event data onwards.

A cloud based basecaller is provided by Oxford Naopore Technologies.

The MAP community is actively developing software to use this type of data.

Some examples: Jared Simpson’s pipeline to correct and assemble using only nanopore reads.

Live monitoring, alignments and feedback to the MinION.Matt Loose’s Minotour.

Squiggle space alignersEach base is measured 5 times in consecutive kmers so it makes sense to avoid basecalling and work directly with the events (squiggle space)

Software

Page 52: 20150601 bio sb_assembly_course

London Calling 2015

Highlights from Clive Brown’s talk

• Improvements to the basecaller .• Read until (and barcoding).• Fast mode on the MinION MkI (500 bp/sec instead of 30).• New 3000 channel ASIC with “crumpet” chip design to separate ASIC and fluidics part.• MinION MkII and PromethION will have this new ASIC.• Library prep on beads to reduce amounts of DNA needed (lower ng to pg).• Direct RNA sequencing.• Simplified sample preparation and VolTRAX.• Pricing will be “pay as you go”. Initial payment for hardware include some hrs sequencing.• MkI $270 and 3 hrs sequencing (~3 Gbp in fast mode).

Page 53: 20150601 bio sb_assembly_course

London Calling 2015

Much emphasis on getting the library prep simpler and faster to be able to leave the lab.If the system leaves the lab many more applications become possible.

VolTRAX

Page 54: 20150601 bio sb_assembly_course

The technology underlying the MinION system is scalable so larger throughput can be made available relatively easy.

It will use the new ASIC design and will have 144000 channels.Projected throughput: 6.4 Tbp/day.Too much data to do cloud baseclling so will be done locally.

Access Program will start later this year.

London Calling 2015

PromethION

Page 55: 20150601 bio sb_assembly_course

Freek VonkHarald KerkkampAsad HyderMichael RichardsonChristiaan HenkelPaul Hooykaas

Ron DirksGuido van den ThillartHerman Spaink

Pim Arntzen

Erwin FakkertMarten BoetzerWalter PirovanoDiana Uffink

R. Manjunatha Kini

Ken KraaijeveldYavuz AriyurekArnoud SchmitzYahya Anvar

Acknowledgments

Dan TurnerOliver Hartwell

Page 56: 20150601 bio sb_assembly_course