Upload
hansjansen9999
View
14
Download
1
Tags:
Embed Size (px)
Citation preview
High-throughput sequencing technologies in genome assembly
Hans Jansen
Dutch SME at Bioscience Park in Leiden , the Netherlands
• High throughput drug screens, and toxicity assays in zebrafish larvae
• Fish fertility (eel, pike perch, sole) to aid sustainable aquaculture
• Sequencing (genomes, transcriptomes)• Bioinformatics
ZF-screens B.V.
Common carp (Cyprinus carpio)High troughput screening modelGenome and transcriptomes
European and Japanese eel (Anguilla anguilla and Anguilla japonica)Completing the life cycle in aquacultureGenome and transcriptomes
King cobra (Ophiophagus hannah)Evolution and toxinsGenome and transcriptomes
Some examples of genome projects
Chemical cleavage (Maxam and Gilbert)Chain termination (Sanger, Nicklen, and Coulson)
Throughput: 5 samples, 1 Kb/day, micrograms of ssDNA needed
1977 2000 2011
Massively parallel signature sequencing (Brenner)
SMRT (Pacific Biosciences)
Throughput: 3x109 samples, 55 Gb/day,single molecule of DNA needed
A brief history of DNA sequencing
A brief history of DNA sequencing
February 1977: Maxam and GilbertChemical cleavage: Modify nucleotides and cut at the modified position.
December 1977: Sanger, Nicklen, and CoulsonChain termination: Use modified nucleotides to stop the extension of a newly synthesized DNA strand.
A brief history of DNA sequencing
Maxam and Gilbert sequencing was relatively soon abandoned. It was technically complex, used some nasty chemicals and radioactivity.
The Sanger sequencing method has been improved and over the years was the method of choice to sequence the first draft of a human genome.• Thermostable polymerases alleviated the need for ssDNA template• Fluorescent dye terminators to combine all four reactions in one.• Automation of the separation of the DNA fragments. Shotgun sequencing was already used by Sanger to sequence lambda DNA and proved to be a powerful tool to sequence and assemble larger DNA molecules and even whole genomes.
A brief history of DNA sequencing
To make assembly easier partially overlapping BAC clones from the genome were first selected and then sequenced and assembled by the shotgun method.
gDNA
BAC
This was a laborious method and later a whole genome shothun approach was used.
A brief history of DNA sequencing
Genomic DNA
Break the DNA in < 1Kb fragments
3’5’
Polish the ends of the DNA and adenylate them
3’5’
3’A5’
3’A
A3’
5’
5’Ligate adapter to the ends of the DNAT5’
3’T5’
3’
Amplify paired end library3’5’
3’5’
3’5’
3’5’
3’5’
3’5’
3’5’
3’5’
Bind ss-library to flowcell3’5’
Making a paired end library
Attach and cluster the library on a carrier
Sequence the library
2 x 50 bp
Generate large fragments by shearing,and label the ends with biotin (green dash).
Self ligate fragments in large volume,and shear the circular fragments (black dash).
Isolate the biotinylated fragments, convert them to a paired end library and sequence them (red arrows).
Problem: part of these fragments have unconvertible ends.
Problems: larger fragments will self ligate inefficiently.Nicks in the DNA will enable digestion of circularized molecules
The above mentioned problems limit the library to ~10 kb insert size and they tend to have a low number of unique fragments.
Obtaining scaffolding information: mate pairs
Generate large fragments by shearing, isolate ~39 kb fragments and clone in adapted fosmid vector which contain insert flanking EcoP15I sites (purple dash).
Cut with EcoP15I which leaves a 26 bp overhang, end repair fragments and self ligate.
PCR the diTag library from these fragments, and sequence the 52 bp inserts.
Problem: These large fragments will ligate inefficiently in the fosmid vector leading to low complexity libraries.
Obtaining scaffolding information: Fosmid diTags
Library Insert Reads Gbp Coverage Span
PE200 <155 bp 2 × 76 nt 21.9 14.6×
PE280 230–305 bp 2 × 151 11.0 7.3×
PE500 370–485 bp 2 × 50–151 nt 19.3 12.9× 1.2×
MP2K 1.6–2.4 Kbp 2 × 36 nt 5.4 4.5×
MP7K 4–6 Kbp 2 × 51 nt 2.3 0.6×
MP10K 6.5–10 Kbp 2 × 51 nt 5.3 7.7×
MP15K 9–13 Kbp 2 × 51 nt 3.8 8.8×
69 Gbp 34.8× 22.9×
King cobra sequence data
Read merging
If the two reads of a paired end fragment overlap they can be merged into a single longer read
• We use our own script since nothing was available at the time
• Now there are a number of tools: FLASH, SHERA, SeqPrep
• Paired end libraries need to be prepared with the read length in mind, and size select as narrow as possible.
~600 bp
~270 bp
102
Fragmentsize (bp)
% o
f the
ass
embl
y
103 104 105 106
+ 500 bp + 2 Kbp+ 7 Kbp + 10 Kbp+ 15 Kbp
Assembly (cobra)
Contigs
N50 3982 bp
largest 70 Kbp
number 1186408
Tota length 1.45 Gbp
Scaffolds
N50 226 Kbp
largest 2.84 Mbp
number 716551
Total length 1.66 Gbp
number of genes 22183
King cobra sequence assembly
Genome Res. 2007 17: 240-248
This is a method to sequence (a small) part of a genome, and do this for multiple siblings.From the sequence data SNP’s can be identified and used as markers to build a genetic map of this genome.
Analysis of the spotted gar genome cut with SbfI in the parents and 94 individuals from their progeny produced 8406 markers in 29 linkage groups.
Generating a RAD-tag genetic map
From Baird, PLoS ONE 2008
This can be done with multiple samples when using barcodes
After adding the barcodes all samples can be pooled to reduce workload
Pools of short fragments from different individuals.
Generating a RAD-tag genetic map
Amores A et al. Genetics 2011;188:799-808
Generating a RAD-tag genetic map
Long DNA molecules Fluorescently labeled at specific sites are linearized in nanochannels and imaged. The fluorescent fingerprints of each molecule can be assembled and linked to contigs and scaffolds.
Optical mapping: BioNano Genomics
Gabino Sanchez-Perez lecture at 15.00 hrs. will explain this in much more detail and show some great examples how to use this technology.
Just a genome is usually not the goal of a de novo sequencing project.
Based on the general structure of a gene, gene predictions can be made.
exon exon exon exon
AGGT AGTAG
Pyrich CAGGsplice acceptor site
ATG STOP
Poly adenylation signalACsplice donor site
CT ABranch site
A CG T
20-50 bases
intron
RNAseq reads can help validate predictions
Annotation of the genome
Different flavors of RNAseq
• Stranded dUTP RNAseq: simple modification of standard prep gives information of the strandedness of the transcript.
• RNAseq with minimal quantities of RNA : a great tool to look at small numbers of (FACS sorted) cells
• Cage : ideal to find transcription start site
• smallRNA: to explore the miRNA content of a sample
Transcriptome sequencing
Disadvantages of next generation sequencing:
• Complex sample preparation including PCR amplification.• High run costs.• Long run times.• Short reads
Changes needed:
• Single molecule analysis• Reading sequences at a high speed• Highly parallel• Long reads >10kb• No errors
Long reads: what do we want?
Pacific Biosciences PacBio RS IIAvailable since 2010
Oxford Nanopore Technologies MinIONAvailable since 2014
Generating long reads
Pacific Biosciences PacBio RSII
It uses a zero mode waveguide to measure fluorescence in a very small volume.
Ligate hairpin adapters
Fragment gDNA and polish ends, and add adenosine.
Attach polymerase, load on SMRT cell and sequence
DNA polymerase
Transparent bottom of zero mode waveguide
Pacific Biosciences
Pacific Biosciences P6-C4
• Yield 0.5-1 Gbp/SMRT cell.
• Since no amplification is done you sequence the DNA as it comes out of your sample (nicks, base modifications).
• There is very little sequence bias and no systemic errors
Christoph Konig’s lecture at 14.15 hrs will delve much deeper into this technology.
• Started to work on nanopore sensing in 2005• Investments to date 180 million GBP (227 M€)• ~200 employees• Broad IP portfolio
• Announced products: MinION and PromethION systems
• Access program for MinION (MAP)
Oxford Nanopore Technologies
But MAP is much more. It is about being a community and a playground to test new applications.
Last part of the development of this technology is done “in field” in an fairly open program.
100’s of MinIONs send around the globe to see how they would behave in real life.MAP is visible as a web portal with information from ONT and social media like system with blog possibilities, comment, likes, and a forum to ask advice.
MinION access program
Tethering oligo
Motor protein Brake protein
hairpin
abasic nucleotidesTTA
A
Shear (optional)DNA repair (optional), AmpureXP purificationend repair, AmpureXP purificationA tailing, AmpureXP purificationLigation, His-tag purification, Dilution in run buffer and ATP
A MuA transposase protocol is under development. This should further simplify sample preparation (10 minutes).
Library preparation
Tethering oligo
Motor protein E5 Brake protein E3
hairpin
abasic nucleotides
Tether keeps DNA fragment on the membrane leading to a ~20K fold higher DNA concentration close to the pore.
Motor protein unwinds DNA and ratchets it though the pore.
Abasic nucleotides in the hairpin are a recognition point.
Brake protein prevents the motor protein from zipping through the complement strand.
Sequencing
Stills taken from: https://www.nanoporetech.com/news/movies#movie-24-nanopore-dna-sequencing
Strand sequencing
ATP
GGCTCACTCCCATAAGCGGCTC GCTCA CTCAC TCACT CACTC ACTCC CTCCC
Raw Data (ionic curent, pA)
Events (with time domain)
Squiggle (events with time domain removed)
Sensing the DNA
Squiggle plot for a complete read
First the template part in blue, then the abasic nucleotides in the hairpin in red, and finally the complement part in turquoise .
Alignment of template and complement squiggles gives a 2d read.
Squiggle plot
MinKNOW controls the run and shows channel states…..
Interactive interface
….. and amount of events vs read length.Metrichor agent runs in the background to send sequence files to and from the (cloud based) base caller.
MinKNOW can interact with other software.minoTour analyses reads in a streaming mode and can control MinKNOW.
Interactive interface
template mean 8734 bp complement mean 8126 bp 2D mean 9930 bp
Read length is limited by the non-nicked fragment length rather than the by the system.My longest 2D read until now: 93.5 Kbp, template 120 Kb.
Read length distribution
There are actually 4 wells/detection channel. QC at the beginning of the run determines the quality of the 4wells. Sequencing starts on the best set of wells. Each 24 hrs the next best set of wells is chosen.
Yield over time
Errors
ref TGATGTATATGCTCTCTTTTCTGACGTTAGTCTCCGACGGCAGGCTTCAA-TGACCC-A-GGCTGAGAAATTCCCGGACCCTTTTTGCTCAAGAGCGATG |||||||||||||| |||||||||||| ||||||||||||||||||| |||||| | ||||||||||||||||||||| |||||| |||| | | MinION TGATGTATATGCTC----TTCTGACGTTAGCCTCCGACGGCAGGCTTCAATTGACCCGATGGCTGAGAAATTCCCGGACCC--TTTGCTACAGAGTG-T-
ref TTAATTTGTTCAATCATTTGGTTAGGAAAGCGGATGTTGCGGGTTGTTGTTCTGCGGGTTCTGTTCTTCGTTGACATGAG---GTTGCCCCGTATTCAGT |||||||||||||||||||||||||||||||||| ||| |||||| | |||| ||||||| ||| |||||| | || | || | | |MinION TTAATTTGTTCAATCATTTGGTTAGGAAAGCGGA---TGC-GGTTGT--TCCTGC-GGTTCTG----TCG-TGACATCCGTTATTTGCGCTGT-TACGC
ref GTCGC-TGATTTGTATTGTCTGAAGTTGTTTTTACGTTAAGTTGATGCAGATCAATTAATACGATACCT--GCGTCATAATTGATTATTTGACGT--GGT | || || |||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||| |||||||||||||||||||||||| |||MinION ATGGCATGTTTTGTATTGTCTGAAGTTGTTTTTACGTTAAGTTAATGCAGATCAATTAATACGATACCTCGGCGTCATAATTGATTATTTGACGTGGGGT
Error rate lies around 15% for current chemistry (R7.3). Typical passing 2D R7.3 read now is 2.8% deletions, 2.7% insertions and 1.7% substitutions.
R8&9 nanopores are in the pipeline (improving on G/C rich reads and better S/N).
Errors
Errors result from different parts of the system.
On the ASIC:Events are missed by the translation from raw data to event data.Solution: Sharpen up the raw data by playing with voltage and by new nanopores with lower noise. Sequence faster.
In the base caller:Bases outside the observed k-mer influence the current.Solution: Higher k-mer models
Modified bases are currently not included in the k-mer model.Solution: add modified k-mers to the model. Modified k-mers are different from unmodified k-mers.
Errors
Throughput is defined by:Number of channels. 512 on the MinIONSpeed of translocation. 30 bps/secOccupancy of the pore. 90%The time a Flow Cell can run. ~60 hrs.
Currently well over 1 Gb events.On R7.3 this translates to ~400 Mb 2D data.
Throughput
In “fast mode” the MinION will read 500 bps/sec. Currently three MAP groups are testing this. Throughput will increase to ~20 Gb in events.
Longest 2D read: 93.5 KbpLongest template read: 120 Kbp (231 Kbp)Highest yield: 1.32 Gevents
R7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300
50000000
100000000
150000000
200000000
250000000
300000000
350000000
template and 2D yield over the past year
template2D
Runs
Base pairs sequenced (Mbp)
R7.3R6
repeatunique sequence in unique sequence out
Long reads can help to resolve repeat area’s in the assembly graph
And the resulting contigs will now look like this:
Untangle
1. Short read correction Quake (not for small genomes)2. Short read assembly Velvet3. MinION read alignment to Velvet contigs LAST4. Link filtering and contig tiling Untangle script5. Path detachment around repeats Untangle script6. Bubble popping Untangle script7. Delete unconfirmed connections Untangle script8. Contig extraction Untangle script
Assembly and scaffolding strategy
Task Software
Agrobacterium NCPPB 1771 assembly graph
25× transposon →(1160 bp)
8× transposon →(873 bp)
4× rRNA →(6.4 Kb)
271 nodes, 311 connections154 contigsN50 = 198 KbSum = 5.87 Mb
• Alignment: LAST with optimized settings
• Links: alignment filtering and contig tiling
• 7328 reads aligned to contigs
• 438 reads aligned to multiple contigs
• 585 links between contigs
• 13158 reads on R6 and R7 chemistry
• 73.8 Mb total yield (template and 2D)
• 5–85970 nt length, typical ~12 Kb
MinION sequencing and scaffolding
Links between nodes are specific
Means link is confirmed by PCR
Final assembly graph after scaffolding
• 271 nodes + 312 connections → 49 nodes + 5 connections• 154 contigs → ~8 contigs• Complete chromosome 2 (1.2 Mb), pTi (190 Kb), cryptic megaplasmid (746 Kb)• Slight residual fragmentation of chromosome 1
Reads are in HDF5 format and contain all data from the event data onwards.
A cloud based basecaller is provided by Oxford Naopore Technologies.
The MAP community is actively developing software to use this type of data.
Some examples: Jared Simpson’s pipeline to correct and assemble using only nanopore reads.
Live monitoring, alignments and feedback to the MinION.Matt Loose’s Minotour.
Squiggle space alignersEach base is measured 5 times in consecutive kmers so it makes sense to avoid basecalling and work directly with the events (squiggle space)
Software
London Calling 2015
Highlights from Clive Brown’s talk
• Improvements to the basecaller .• Read until (and barcoding).• Fast mode on the MinION MkI (500 bp/sec instead of 30).• New 3000 channel ASIC with “crumpet” chip design to separate ASIC and fluidics part.• MinION MkII and PromethION will have this new ASIC.• Library prep on beads to reduce amounts of DNA needed (lower ng to pg).• Direct RNA sequencing.• Simplified sample preparation and VolTRAX.• Pricing will be “pay as you go”. Initial payment for hardware include some hrs sequencing.• MkI $270 and 3 hrs sequencing (~3 Gbp in fast mode).
London Calling 2015
Much emphasis on getting the library prep simpler and faster to be able to leave the lab.If the system leaves the lab many more applications become possible.
VolTRAX
The technology underlying the MinION system is scalable so larger throughput can be made available relatively easy.
It will use the new ASIC design and will have 144000 channels.Projected throughput: 6.4 Tbp/day.Too much data to do cloud baseclling so will be done locally.
Access Program will start later this year.
London Calling 2015
PromethION
Freek VonkHarald KerkkampAsad HyderMichael RichardsonChristiaan HenkelPaul Hooykaas
Ron DirksGuido van den ThillartHerman Spaink
Pim Arntzen
Erwin FakkertMarten BoetzerWalter PirovanoDiana Uffink
R. Manjunatha Kini
Ken KraaijeveldYavuz AriyurekArnoud SchmitzYahya Anvar
Acknowledgments
Dan TurnerOliver Hartwell