Upload
keith-bradnam
View
583
Download
6
Embed Size (px)
DESCRIPTION
This was a talk given on 2014-09-17 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop. It concerns the Assemblathon projects as well as other aspects relating to genome assembly. A version of this talk is also available on Slideshare with embedded notes. Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.
Citation preview
Genome assembly: then and nowKeith Bradnam
Image from Wellcome Trust
v1.2
Image from flickr.com/photos/dougitdesign/5613967601/
Contents
Sequencing 101!! Genome assembly: then!! Genome assembly: now
Assemblathons!! Intermission!!Advice
Sequencing 101A, C, G, T...
Image from nlm.nih.gov
Read
Read pair
Read pair
Mate pair
Contigs
ScaffoldNNNNNNNNNNNNNNNNNNN
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
Assembly size
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
200 Mbp
15
15
15
5
70
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
95
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
95
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
115
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
115
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
55
15
15
15
5
200 Mbp
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
5
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
5
5
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
190 Mbp
N50 length
NNNNNNNNNNNNNNNNNNN
NNNNNNNNNNN
NNNNNNNNNNN
70 25
20
10
10
5
5
15
15
15
190 Mbp
N50 for two assemblies
N50 for two assemblies
208 Mbp 190 Mbp
N50 for two assemblies
208 Mbp 190 Mbp
N50 = 15 Mbp N50 = 25 Mbp
NG50 for two assemblies
208 Mbp 190 Mbp
NG50 for two assemblies
NG50 for two assemblies
Expected genome size = 250 Mbp
Expected genome size = 250 Mbp
NG50 for two assemblies
NG50 = 15 Mbp NG50 = 15 Mbp
Expected genome size = 250 Mbp
NG50 for two assemblies
$ n50_booster.pl c_japonica.WS230.genomic.fa!!Before:!==============!Total assembly size = 166256191 bp!N50 length = 94149 bp!!Boosting N50...please wait!!After:!==============!Total assembly size = 166256191 bp!N50 length = 104766 bp!!Improvement in N50 length = 10617 bp!!See file c_japonica.WS230.genomic.fa.n50 for your new (and improved) assembly
You should check that high N50 values!are not simply due to lots of Ns in the scaffolds!
Assembly 'x'
Assembly 'x'
Size: 859 Mbp!!
Number of scaffolds: 28!!
N50 = 70.3 Mbp
Assembly 'x'
Size: 859 Mbp!!
Number of scaffolds: 28!!
N50 = 70.3 Mbp
Ns = 90.6% !!!
Assembly 'x'
Size: 859 Mbp!!
Number of scaffolds: 28!!
N50 = 70.3 Mbp
Ns = 90.6% !!!
Basic assembly metrics
Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors Base errors from alignment to reference sequence !and/or input read data
Number of genes From comparison to reference transcriptome !and/or set of known genes
Basic assembly metrics
Metric Description
Assembly size With or without very short contigs?
N50 / NG50 For contigs and/or scaffolds
Coverage When compared to a reference sequence
Errors Base errors from alignment to reference sequence !and/or input read data
Number of genes From comparison to reference transcriptome !and/or set of known genes
And many, many more...
Genome assemblyBack in the day...
Genome assemblyBack in the day...
1998
Genome assembly: then
Genetic maps ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓
Genome assembly: then
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓
Genome assembly: then
So what was the result of spending millions of dollars !to assemble genomes of well-characterized species,!with accurate long reads, and detailed maps???
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
Arabidopsis thaliana
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
Arabidopsis thaliana
✤ 2000: published genome size = 125 Mbp
✤ 2007: genome size = 157 Mbp
✤ 2012: genome size = 135 Mbp
✤ Amount sequenced = 119 Mbp
✤ Ns = 0.2% of genome
Arabidopsis thaliana
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
Drosophila melanogaster
✤ Genome published 1998
✤ Heterochromatin finished 2007
✤ Ns = 4% of genome
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
Caenorhabditis elegans
✤ Genome published 1998
✤ 2004: last N removed
✤ 1998–2014: genome sequence changes
✤ 558 insertions
✤ 230 deletions
✤ 614 substitutions
} Nov 2012
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
Saccharomyces cerevisiae
✤ Genome published 1997
✤ 12 Mbp genome
✤ 1,653 changes to genome since 1997
✤ Last changes made in 2011
Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓
Genome assembly: then
Genetic maps ✗
Physical maps ✗
Understanding of target genome ✗
Haploid / low heterozygosity genome ✗
Accurate & long reads ✗
Resources (time, money, people) ✗
Genome assembly: now
Assembling & finishing!a genome is not easy!
AssemblathonsA new idea is born
Image from flickr.com/photos/dullhunk/4422952630
If you sequence 10,000 genomes...!...you need to assemble 10,000 genomes
How many assembly tools are out there?
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LG
SGACurtain MetassemblerPhusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
EdenaForge
GeneiousIDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
MonumentAtlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
QuastSCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAPSR-ASM
SuccinctAssembly
SUTTARagout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC Omega
GABenchToB
HiPGA
SAGE
HyDA-Vista
MHAP
Mapsembler 2
GAML
SAT-Assembler
RAMPART
VICUNA
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LG
SGACurtain MetassemblerPhusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
EdenaForge
GeneiousIDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
MonumentAtlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
QuastSCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAPSR-ASM
SuccinctAssembly
SUTTARagout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC Omega
GABenchToB
HiPGA
SAGE
HyDA-Vista
MHAP
Mapsembler 2
GAML
SAT-Assembler
RAMPART
VICUNA
bambus2
How many assembly tools are out there?
Ray
Celera
MIRA
ALLPATHS-LG
SGACurtain MetassemblerPhusion
ABySS
Amos
Arapan
CLC
Cortex
DNAnexus
DNA Dragon
EdenaForge
GeneiousIDBA
Newbler
PRICE
PADENA
PASHA
Phrap
TIGR
Sequencher
SeqMan NGen
SHARCGS
SOPRA
SSAKE
SPAdes
Taipan
VCAKE
Velvet
Arachne
PCAP
GAM
MonumentAtlas
ABBA
Anchor
ATAC
Contrail
DecGPU GenoMinerLasergene
PE-Assembler
Pipeline Pilot
QSRA
SeqPrep
SHORTY
fermiTelescoper
QuastSCARPA
Hapsembler
HapCompass
HaploMerger
SWiPS
GigAssembler
MSR-CA
MaSuRCA
GARM
Cerulean
TIGRA
ngsShoRT
PERGA
SOAPdenovo
REAPR
FRCBam
EULER-SR SSPACE
Opera
mip
gapfiller
image
PBJelly
HGAP
FALCON
Dazzler
GGAKE
A5
CABOG
SHRAPSR-ASM
SuccinctAssembly
SUTTARagout
Tedna
Trinity
SWAP-Assembler
SILP3
AutoAssemblyD
KGBAssembler
MetAMOS
iMetAMOS
MetaVelvet-SL
KmerGenie
Nesoni
Pilon
Platanus
CGAL
GAGM
Enly
BESST
Khmer
GRIT
IDBA-MTP
dipSPAdes
WhatsHap
SHEAR
ELOPER
OMACC Omega
GABenchToB
HiPGA
SAGE
HyDA-Vista
MHAP
Mapsembler 2
GAML
SAT-Assembler
RAMPART
VICUNA
Which is the best?
All published since August 14th, 2014!
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from different sequencing technologies
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from different sequencing technologies
✤ used same sequencing technologies but have different sequence libraries
Comparing assemblers
✤ Can't fairly compare two assemblers if they:
✤ produced assemblies from different species
✤ assembled same species, but used sequence data from different sequencing technologies
✤ used same sequencing technologies but have different sequence libraries
✤ Even using different options for the same assembler may produce very different assemblies!
The PRICE genome assembler has 52 command-line options!!!
The PRICE genome assembler has 52 command-line options!!!
how many of them are you going to learn?
A genome assembly competition
An attempt to standardize some aspects !of the genome assembly process
Genome assembly contests
✤ 2010–2011!
✤ Used synthetic data!
✤ Small genome (~100 Mbp)!
✤ We knew the answer!
Assemblathon 1
Published in GigaScience,!July 2013
First published !on arXiv.org!
Jan 2013
Attracted lots of interest, and provoked lots of commentary
But what did the paper reveal?
Type of data Number of genomes
Size of genomes
Do we know the answer?
Assemblathon 1 Synthetic 1 Small ✓
Type of data Number of genomes
Size of genomes
Do we know the answer?
Assemblathon 1 Synthetic 1 Small ✓
Assemblathon 2 Real 3 Large ✗
Melopsittacus undulatus
Boa constrictor constrictorMaylandia zebra
Bird
SnakeFish
Why these three species?
Why these three species?
Because they were there
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Assemble this!
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!(14 libraries)
192x!(8 libraries)
125x!(4 libraries)
Assemble this!
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!(14 libraries)
192x!(8 libraries)
125x!(4 libraries)
Roche 454
16x!(3 libraries)
Assemble this!
Species
Bird
Fish
Snake
Estimated genome size
1.2 Gbp
1.0 Gbp
1.6 Gbp
Illumina
285x!(14 libraries)
192x!(8 libraries)
125x!(4 libraries)
Roche 454
16x!(3 libraries)
PacBio
10x!(2 libraries)
Assemble this!
Who took part?
Who took part?
Who took part?
21 teams!43 assemblies!
52,013,623,777 bp of sequence
Species
Bird
Fish
Snake
Competitive entries
12
10
12
Entries
Species
Bird
Fish
Snake
Competitive entries
12
10
12
Evaluation entries
3
6
0
Entries
Goals
Goals
✤ Assess 'quality' of assemblies
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
Goals
✤ Assess 'quality' of assemblies
✤ Define quality!
✤ Produce ranking of assemblies for each species
✤ Produce ranking of assemblers across species?
Who did what?
Person/group Jobs
Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies
David Schwarz et al. Produce & evaluate optical maps
Jay Shendure et al. Produce Fosmid sequences !(bird & snake only)
Martin Hunt & Thomas Otto Performed REAPR analysis
Dent Earl & Benedict Paten Help with meta-analysis of final rankings
91 co-authors!
flickr.com/photos/jamescridland/613445810
Results!
Lots of results!
102 different metrics!
10 key metrics
Key Metric Description
1 NG50 scaffold length
2 NG50 contig length
3 Amount of assembly in 'gene-sized' scaffolds
4 Number of 'core genes' present
5 Fosmid coverage
6 Fosmid validity
7 Short-range scaffold accuracy
8 Optical map: level 1
9 Optical map: levels 1–3
10 REAPR summary score
Key Metric Description
1 NG50 scaffold length
2 NG50 contig length
3 Amount of assembly in 'gene-sized' scaffolds
4 Number of 'core genes' present
5 Fosmid coverage
6 Fosmid validity
7 Short-range scaffold accuracy
8 Optical map: level 1
9 Optical map: levels 1–3
10 REAPR summary score
1) Scaffold NG50 lengths
✤ Can calculate NG50 length for each assembly!
✤ But also calculate NG60, NG70 etc.!
✤ Plot all results as a graph
1) Scaffold NG50 lengths
2) Contig vs scaffold NG50
2) Contig vs scaffold NG50
2) Contig vs scaffold NG50
3) Gene-sized scaffolds
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
3) Gene-sized scaffolds
✤ Some assembly folks get a little obsessed by length!
✤ How long is 'long enough' for a scaffold?
✤ What if you just wanted to find genes?
✤ Average vertebrate gene = ~25 Kbp
3) Gene-sized scaffolds
4) Core genes
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens
4) Core genes
✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)
✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)
✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens
✤ How many full-length CEGs are in each assembly?
4) Core genes
Species
Bird
Fish
Snake
Core genes (out of 458)
Best individual assembly
420
436
438
4) Core genes
Species
Bird
Fish
Snake
Core genes (out of 458)
Best individual assembly
420
436
438
Across all assemblies
442
455
454
4) Core genes
What does this all mean?
102 metrics!per assembly
10 key !metrics
1 final!ranking
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !core genes
438
436
435
434
433
430
429
428
422
415
360
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !core genes
438
436
435
434
433
430
429
428
422
415
360
Rank
1
2
3
4
5
6
7
8
9
10
11
Assembly
CRACS
SYMB
PHUS
BCM
SGA
MERAC
ABYSS
SOAP
RAY
GAM
CURT
Number of !core genes
438
436
435
434
433
430
429
428
422
415
360
Rank
1
2
3
4
5
6
7
8
9
10
11
Z-score
+0.68
+0.59
+0.54
+0.49
+0.44
+0.30
+0.25
+0.21
–0.08
–0.41
–3.02
What does this all mean?
No really, what does this all mean?
Some conclusions
✤ Very hard to find assemblers that performed well across all 10 key metrics!
✤ Assemblers that perform well in one species, do not always perform as well in another!
✤ Bird & snake assemblies appear better than fish!
✤ No real 'winner' for bird and fish
SGA — best assembler for snake?
SGA — best assembler for snake?
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
Description Rank of snake SGA assembly
NG50 scaffold length 2
NG50 contig length 5
Amount of assembly in 'gene-sized' scaffolds 7
Number of 'core genes' present 5
Fosmid coverage 2
Fosmid validity 2
Short-range scaffold accuracy 3
Optical map: level 1 2
Optical map: levels 1–3 1
REAPR summary score 2
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
Coverage!Z-score
+2.0
–0.3
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
Coverage!Z-score
+2.0
–0.3
Validity!Z-score
+1.4
–0.8
BCM bird assemblies
Assembler
BCM - evaluation
BCM - competitive
Final rank
1
2
NGS data used in
assembly
Illumina + 454
Illumina + 454 + PacBio
Coverage!Z-score
+2.0
–0.3
Validity!Z-score
+1.4
–0.8
NG50 Contig Z-score
+1.5
+2.7
BCM bird assemblies
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
NNNNNNNNNNNNNNNNNNN
PacBio sequence
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
BCM evaluation scaffold
NNNNNNNNNNNNNNNNNNN
BCM competition scaffold
CGTCGNNATCNNGGTTACG
Mismatches from PacBio sequence penalized alignment !score more than matching unknown bases
The choice of one command-line option,!used by one tool in the calculation of one key metric...
...probably made enough difference to drop!the PacBio-containing assembly to 2nd place.
Other conclusions
✤ Different metrics tell different stories!
✤ Heterozygosity was a big issue for bird & fish assemblies!
✤ Final rankings very sensitive to changes in metrics!
✤ N50 is a semi-useful predictor of assembly quality
Inter-specific differences matter
Inter-specific differences matter
✤ The three species have genomes with different properties !
✤ repeats!
✤ heterozygosity
Inter-specific differences matter
✤ The three species have genomes with different properties !
✤ repeats!
✤ heterozygosity
✤ The three genomes had very different NGS data sets!
✤ Only bird had PacBio & 454 data!
✤ Different insert sizes in short-insert libraries
The Big Conclusion
The Big Conclusion
"You can't always get what you want"Sir Michael Jagger, 1969
What comes next?
What comes next?
What comes next?
3?
A wish list for Assemblathon 3
A wish list for Assemblathon 3
✤ Only have 1 species
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use FASTG or GFA genome assembly file format?
A wish list for Assemblathon 3
✤ Only have 1 species
✤ Teams have to 'buy' resources using virtual budgets
✤ Factor in CPU time/cost?
✤ Agree on metrics before evaluating assemblies!
✤ Encourage experimental assemblies
✤ Use FASTG or GFA genome assembly file format?
✤ Get someone else to write the paper!
But maybe we don't need an Assemblathon 3?
Intermission
NGS must die!
NGS must die!
‘NGS’ is used to refer to everything post-Sanger
NGS must die!
‘NGS’ is used to refer to everything post-Sanger
Pyrosequencing was developed ~1996
NGS madness
Next generation sequencing
aka second generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also:
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
NGS madness
Next generation sequencing
aka second generation sequencing
but there’s also: third generation sequencing
fourth generation sequencing
next-next generation sequencing
next-next-next generation sequencing
NGS madness
Technology
Complete Genomics
Ion Torrent
PacBio
Oxford Nanopore
According to some papers…
2nd generation
2nd generation
2nd generation
3rd generation
NGS madness
Technology
Complete Genomics
Ion Torrent
PacBio
Oxford Nanopore
According to some papers…
2nd generation
2nd generation
2nd generation
3rd generation
According to other papers…
3rd generation
3rd generation
3rd generation
4th generation
NGS madness
“PacBio is a 2.5th generation”
“Helicos lies between the transition of next-generation to third generation”
NGS madness
There are different sequencing methodologies, !and there are different sequencing platforms.
NGS madness
There are different sequencing methodologies, !and there are different sequencing platforms.
Use one or the other.
NGS madness
There are different sequencing methodologies, !and there are different sequencing platforms.
Use one or the other.
Or just say ‘current sequencing technologies’.
Intermission
flickr.com/thomashawk
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
From a vertebrate genome assembly with 72,214 sequences…
Length of 10 shortest sequences: !100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
Improvements in sequencing technology !will lead to improvements in genome assembly
Data from Lex Nederbragt’s blog, June 2014
Data from Lex Nederbragt’s blog, June 2014
Long-read technology
Moleculo read data from Illumina BaseSpace, July 2013
Long-read technology
From https://flxlexblog.wordpress.com (Lex Nederbragt's blog)
PacBio!data
Long-read technology
MinIon from Oxford Nanopore
Long-read technology
MinIon from Oxford Nanopore
Where is the data?
Where is the data?
Where is the data?
Nick Loman published the first real-world data on June 10th
Nick also released the first MinION dataset on September 10th
Some other ways to tackle the problems !inherent in genome assembly
Single chromosome assembly?
Tackling heterozygosity
1000 Genomes project plans to sequence 15 'trios' in high-depth
Hi-C
✤ Nature Biotechnology, 31, 2013 !
✤ Burton et al.!
✤ Selvaraj et al.!
✤ Kaplan & Dekker
The future of genome assembly
Kwik-E-Assembler
acgtaacacaancac gggaacnnnacatta acnactagcataata nnnnnnnnnnaacac actttaaattatatc
The future of genome assembly
The future of genome assembly
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
The future of genome assembly
✤ At some point we will look back with embarrassment at this era.
✤ Assembly must, and will, get better, but...
✤ ...'perfect' genomes may remain elusive.
✤ Data management will remain an issue:
✤ the human genome -> human genomes -> tissue-specific genomes
Summary
Summary
✤ There is no real consensus on how to make a good genome assembly
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
Summary
✤ There is no real consensus on how to make a good genome assembly
✤ Try different assemblers, try different command-line options
✤ Decide what it is you want to get out of a genome assembly
✤ Look at your input and output data
✤ Wait 5 years and come back, we’ll (probably) have solved everything!
Useful blogs/tweeps to follow
Lex Nederbragt!@lexnederbragt!
flxlexblog.wordpress.com
Nick Loan!@pathogenomenick!
pathogenomic.bham.ac.uk/blog
Mick Watson!@BioMickWatson!
biomickwatson.wordpress.com
Any questions???