For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel
are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.
Un-zipping Diploid Genomes - Revealing All Kinds of Heterozygous Variants from Comprehensive Haplotig Assembly Jason Chin1, Paul Peluso1, David Rank1, Maria Nattestad2, Fritz J. Sedlazeck2, Michael Schatz2, Alicia Clum3, Kerrie Barry3, Alex Copeland3,
Ronan O’Malley4, Chongyuan Luo4, Joseph Ecker4
1PacBio, 2Cold Spring Harbor Laboratory, 3Joint Genome Institute, 4 HHMI / The Salk Institute
Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher
ploidy (n ≥ 2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build high-
quality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON
(https://github.com/PacificBiosciences/FALCON), we developed new algorithms and software (FALCON-unzip) for de novo haplotype reconstructions from
SMRT Sequencing data.
We apply the algorithms and the prototype software for (1) a highly repetitive diploid fungal genome (Clavicorona pyxidata) and (2) a F1 hybrid from two
inbred Arabidopsis strains: Cvi-0 and Col-0.
For the fungal genome, we achieve an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs of 872 kb from a
95x read length N50 ~16 kb dataset. We find that ~45% of the genome is highly heterozygous and ~55% of the genome is highly homozygous. We
develop methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from an Illumina platform.
For the Arabidopsis F1 hybrid genome, we find that 80% of the genome can be separated into haplotigs. The long-range accuracy of phasing haplotigs is
evaluated by comparing them to the assemblies from the two inbred parental lines.
The more complete view of both haplotypes can help to generate better annotations, provide useful biological insights, including information about all kinds
of heterozygous variants, and be used for studying differential allele expression in organisms. Finally, we apply FALCON/FALCON-Unzip on three other
organisms (grape, zebra finch, human) to demonstrate this is a general approach to solve diploid assembly problem with PacBio long-read data.
ABSTRACT AND BASIC IDEAS RESULTS
Why Do We See Bubbles in Assembly Graphs?
SNPs SNPs SNPs SVs SVs
SNPs SNPs SNPs
SVs SVs
Haplotype 1
Haplotype 2
Genome Sequences
Assembly Graph
In most OLC assembler designs, the overlapper does not catch differences
at SNP level but structural variations are naturally segregated.
Associate contig 1
(Alternative allele)
Associate contig 2
(Alternative allele)
SNPs SNPs SNPs SVs SVs
Primary contig
Augmented with haplotype
information of each reads
FALCON
FALCON-Unzip
Updated primary contig + “associate haplotigs”
FALCON Assembly & FALCON-Unzip Tool
How to Resolve Structural Variations & Het-SNPs Phasing at Once?
3 kb – 100 kb
300 b – 10 kb
Structural
Variations
het-SNP
Overlap-layout
process catches
SV haplotypes
✗Collapsed paths
when there is no
SV
Easy to group
SNPs/reads into
different
haplotypes
✗No phasing
information
associated with
SVs
Nearby SVs may
be phased
automatically
✗Haplotype-fused
paths
Haplotype-
specific paths
✗More
fragmented
contigs
Information Sources Pros & Cons Assembly graph features
Tiling path of haplotype 0
Tiling path of haplotype 1
Remove edges connecting
different haplotypes
“FA
LC
ON
-Un
zip
Pro
ce
ss”
~ 4.80 Mb
Add missing haplotype specific nodes & edges
Remove edges that connect different haplotypes
The final graph comprises a primary contig (blue),
a major haplotig (red) and other smaller haplotigs.
4 major haplotype phased blocks determined by het-
SNPs
Un-phased region
• SMRT Sequencing provides a single data type for routine diploid assembly
• Large genomes are more computationally challenging, but mostly an engineering problem now
• Future Work:
• Haplotype phasing improvement, incorporate other 3rd party and better phasing code
• To develop a sequence aligner for “augmented alignment” for faster Quiver consensus process
• Want to attack the algorithm problem for polyploid assembly? Let us know. We can work together.
Arabidopsis Thaliana F1 Diploid
Assembly Statistics
Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0
F1
Assembler CA/HGAP CA/HGAP FALCON FALCON-Unzip FALCON-Unzip
primary contigs primary contigs haplotigs
Assembly Size (Mb) 126 119 143 140 105
# contigs 1325 194 426 172 248
N50 size (Mb) 6.210 4.79 7.92 7.96 6.92
Max Contig size (Mb) 10.25 11.25 13.39 13.32 11.65
126
119
143
57
140
105
0 20 40 60 80 100 120 140 160
Inbred Col-0
Inbred Cvi-0
F1 FALCON p-contigs
F1 FALCON a-contigs
F1 Unzip p-contigs
F1 Unzip haplotigs
Assembly Size (Mb)
6.21
4.79
7.92
0.146
7.96
6.92
0 2 4 6 8 10
N50 size (Mb)
Construct Arabidopsis thaliana Col-0
x CVI-0 Diploid F1 Line
Col-0 Cvi-0
Col-0 x Cvi-0
• Two inbred lines
sequenced in 2013
(P4-C2 chemistry),
assembled as haploid
genomes
• F1 line constructed
and sequenced in
2015 (P6-C4
chemistry), assembled
with FALCON and
FALCON-Unzip (18.5 Gbp reads, reads
N50 ~ 17.5 kb)
F1 Diploid Assembly vs.
Parental Line Assemblies
Col-0 x Cvi-0 assembly
Col-0 assembly
Cvi-0 assembly
haplotigs
primary contig
Haploid-like contig in the inbred-line assemblies
Many variations
Few or no
variations
Few or no variations
Many variations
Clavicorona
pyxidata
(Coral fungus)
Cabernet
Sauvignon+*
Taeniopygia
guttata
(Zebra finch)$* Human*
Haploid Genome
Size: ~ 44 Mb ~ 500 Mb ~1.2 Gb ~ 3 Gb
Sequencing
Coverage 4.1 Gb / 95x
73.7 Gb /
147x 50 Gb / 42x 255 Gb / 85x
Primary contig size 41.9 Mb 591.0 Mb 1.07 Gb 2.76 Gb
Primary contig N50 1.5 Mb 2.2 Mb 3.23 Mb 22.9 Mb
Haplotig size 25.5 Mb 372.2 Mb 0.84 Gb 2.0 Gb
Haplotig N50 872 kb 767 kb 910 kb 330 kb
+ Led by Cantu lab, UC Davis and Cramer lab, UN Reno
$ Thanks to Erich Jarvis for permission to use preliminary data
* Preliminary results. Fast file system and efficient computational infrastructure are currently needed for
large genomes.
Image credits: Pajoro, et al, Trends in
Plant Science 21.1 (2016): 6-8.
Example of the
FALCON-Unzip
Process
Additional Diploid Genome Assembly
Examples
SUMMARY