1
For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. Un-zipping Diploid Genomes - Revealing All Kinds of Heterozygous Variants from Comprehensive Haplotig Assembly Jason Chin 1 , Paul Peluso 1 , David Rank 1 , Maria Nattestad 2 , Fritz J. Sedlazeck 2 , Michael Schatz 2 , Alicia Clum 3 , Kerrie Barry 3 , Alex Copeland 3 , Ronan O’Malley 4 , Chongyuan Luo 4 , Joseph Ecker 4 1 PacBio, 2 Cold Spring Harbor Laboratory, 3 Joint Genome Institute, 4 HHMI / The Salk Institute Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher ploidy (n ≥ 2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build high- quality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON (https://github.com/PacificBiosciences/FALCON), we developed new algorithms and software (FALCON-unzip) for de novo haplotype reconstructions from SMRT Sequencing data. We apply the algorithms and the prototype software for (1) a highly repetitive diploid fungal genome ( Clavicorona pyxidata) and (2) a F1 hybrid from two inbred Arabidopsis strains: Cvi-0 and Col-0. For the fungal genome, we achieve an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs of 872 kb from a 95x read length N50 ~16 kb dataset. We find that ~45% of the genome is highly heterozygous and ~55% of the genome is highly homozygous. We develop methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from an Illumina platform. For the Arabidopsis F1 hybrid genome, we find that 80% of the genome can be separated into haplotigs. The long-range accuracy of phasing haplotigs is evaluated by comparing them to the assemblies from the two inbred parental lines. The more complete view of both haplotypes can help to generate better annotations, provide useful biological insights, including information about all kinds of heterozygous variants, and be used for studying differential allele expression in organisms. Finally, we apply FALCON/FALCON-Unzip on three other organisms (grape, zebra finch, human) to demonstrate this is a general approach to solve diploid assembly problem with PacBio long-read data. ABSTRACT AND BASIC IDEAS RESULTS Why Do We See Bubbles in Assembly Graphs? SNPs SNPs SNPs SVs SVs SNPs SNPs SNPs SVs SVs Haplotype 1 Haplotype 2 Genome Sequences Assembly Graph In most OLC assembler designs, the overlapper does not catch differences at SNP level but structural variations are naturally segregated. Associate contig 1 (Alternative allele) Associate contig 2 (Alternative allele) SNPs SNPs SNPs SVs SVs Primary contig Augmented with haplotype information of each reads FALCON FALCON-Unzip Updated primary contig + “associate haplotigs” FALCON Assembly & FALCON-Unzip Tool How to Resolve Structural Variations & Het-SNPs Phasing at Once? 3 kb 100 kb 300 b 10 kb Structural Variations het-SNP Overlap-layout process catches SV haplotypes Collapsed paths when there is no SV Easy to group SNPs/reads into different haplotypes No phasing information associated with SVs Nearby SVs may be phased automatically Haplotype-fused paths Haplotype- specific paths More fragmented contigs Information Sources Pros & Cons Assembly graph features Tiling path of haplotype 0 Tiling path of haplotype 1 Remove edges connecting different haplotypes “FALCON-Unzip Process” ~ 4.80 Mb Add missing haplotype specific nodes & edges Remove edges that connect different haplotypes The final graph comprises a primary contig (blue), a major haplotig (red) and other smaller haplotigs. 4 major haplotype phased blocks determined by het- SNPs Un-phased region SMRT Sequencing provides a single data type for routine diploid assembly Large genomes are more computationally challenging, but mostly an engineering problem now Future Work: Haplotype phasing improvement, incorporate other 3 rd party and better phasing code To develop a sequence aligner for “augmented alignment” for faster Quiver consensus process Want to attack the algorithm problem for polyploid assembly? Let us know. We can work together. Arabidopsis Thaliana F1 Diploid Assembly Statistics Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0 F1 Assembler CA/HGAP CA/HGAP FALCON FALCON-Unzip FALCON-Unzip primary contigs primary contigs haplotigs Assembly Size (Mb) 126 119 143 140 105 # contigs 1325 194 426 172 248 N50 size (Mb) 6.210 4.79 7.92 7.96 6.92 Max Contig size (Mb) 10.25 11.25 13.39 13.32 11.65 126 119 143 57 140 105 0 20 40 60 80 100 120 140 160 Inbred Col-0 Inbred Cvi-0 F1 FALCON p-contigs F1 FALCON a-contigs F1 Unzip p-contigs F1 Unzip haplotigs Assembly Size (Mb) 6.21 4.79 7.92 0.146 7.96 6.92 0 2 4 6 8 10 N50 size (Mb) Construct Arabidopsis thaliana Col-0 x CVI-0 Diploid F1 Line Col-0 Cvi-0 Col-0 x Cvi-0 Two inbred lines sequenced in 2013 (P4-C2 chemistry), assembled as haploid genomes F1 line constructed and sequenced in 2015 (P6-C4 chemistry), assembled with FALCON and FALCON-Unzip (18.5 Gbp reads, reads N50 ~ 17.5 kb) F1 Diploid Assembly vs. Parental Line Assemblies Col-0 x Cvi-0 assembly Col-0 assembly Cvi-0 assembly haplotigs primary contig Haploid-like contig in the inbred-line assemblies Many variations Few or no variations Few or no variations Many variations Clavicorona pyxidata (Coral fungus) Cabernet Sauvignon + * Taeniopygia guttata (Zebra finch) $ * Human* Haploid Genome Size: ~ 44 Mb ~ 500 Mb ~1.2 Gb ~ 3 Gb Sequencing Coverage 4.1 Gb / 95x 73.7 Gb / 147x 50 Gb / 42x 255 Gb / 85x Primary contig size 41.9 Mb 591.0 Mb 1.07 Gb 2.76 Gb Primary contig N50 1.5 Mb 2.2 Mb 3.23 Mb 22.9 Mb Haplotig size 25.5 Mb 372.2 Mb 0.84 Gb 2.0 Gb Haplotig N50 872 kb 767 kb 910 kb 330 kb + Led by Cantu lab, UC Davis and Cramer lab, UN Reno $ Thanks to Erich Jarvis for permission to use preliminary data * Preliminary results. Fast file system and efficient computational infrastructure are currently needed for large genomes. Image credits: Pajoro, et al, Trends in Plant Science 21.1 (2016): 6-8. Example of the FALCON-Unzip Process Additional Diploid Genome Assembly Examples SUMMARY

Un-zipping diploid genomes – revealing all kinds of heterozygous … · 2019-04-10 · For the fungal genome, we achieve an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Un-zipping diploid genomes – revealing all kinds of heterozygous … · 2019-04-10 · For the fungal genome, we achieve an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2016 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel

are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners.

Un-zipping Diploid Genomes - Revealing All Kinds of Heterozygous Variants from Comprehensive Haplotig Assembly Jason Chin1, Paul Peluso1, David Rank1, Maria Nattestad2, Fritz J. Sedlazeck2, Michael Schatz2, Alicia Clum3, Kerrie Barry3, Alex Copeland3,

Ronan O’Malley4, Chongyuan Luo4, Joseph Ecker4

1PacBio, 2Cold Spring Harbor Laboratory, 3Joint Genome Institute, 4 HHMI / The Salk Institute

Outside of the simplest cases (haploid, bacteria, or inbreds), genomic information is not carried in a single reference per individual, but rather has higher

ploidy (n ≥ 2) for almost all organisms. The existence of two or more highly related sequences within an individual makes it extremely difficult to build high-

quality, highly contiguous genome assemblies from short DNA fragments. Based on the earlier work on a polyploidy aware assembler, FALCON

(https://github.com/PacificBiosciences/FALCON), we developed new algorithms and software (FALCON-unzip) for de novo haplotype reconstructions from

SMRT Sequencing data.

We apply the algorithms and the prototype software for (1) a highly repetitive diploid fungal genome (Clavicorona pyxidata) and (2) a F1 hybrid from two

inbred Arabidopsis strains: Cvi-0 and Col-0.

For the fungal genome, we achieve an N50 of 1.53 Mb (of the 1n assembly contigs) of the ~42 Mb 1n genome and an N50 of the haplotigs of 872 kb from a

95x read length N50 ~16 kb dataset. We find that ~45% of the genome is highly heterozygous and ~55% of the genome is highly homozygous. We

develop methods to assess the base-level accuracy and local haplotype phasing accuracy of the assembly with short-read data from an Illumina platform.

For the Arabidopsis F1 hybrid genome, we find that 80% of the genome can be separated into haplotigs. The long-range accuracy of phasing haplotigs is

evaluated by comparing them to the assemblies from the two inbred parental lines.

The more complete view of both haplotypes can help to generate better annotations, provide useful biological insights, including information about all kinds

of heterozygous variants, and be used for studying differential allele expression in organisms. Finally, we apply FALCON/FALCON-Unzip on three other

organisms (grape, zebra finch, human) to demonstrate this is a general approach to solve diploid assembly problem with PacBio long-read data.

ABSTRACT AND BASIC IDEAS RESULTS

Why Do We See Bubbles in Assembly Graphs?

SNPs SNPs SNPs SVs SVs

SNPs SNPs SNPs

SVs SVs

Haplotype 1

Haplotype 2

Genome Sequences

Assembly Graph

In most OLC assembler designs, the overlapper does not catch differences

at SNP level but structural variations are naturally segregated.

Associate contig 1

(Alternative allele)

Associate contig 2

(Alternative allele)

SNPs SNPs SNPs SVs SVs

Primary contig

Augmented with haplotype

information of each reads

FALCON

FALCON-Unzip

Updated primary contig + “associate haplotigs”

FALCON Assembly & FALCON-Unzip Tool

How to Resolve Structural Variations & Het-SNPs Phasing at Once?

3 kb – 100 kb

300 b – 10 kb

Structural

Variations

het-SNP

Overlap-layout

process catches

SV haplotypes

✗Collapsed paths

when there is no

SV

Easy to group

SNPs/reads into

different

haplotypes

✗No phasing

information

associated with

SVs

Nearby SVs may

be phased

automatically

✗Haplotype-fused

paths

Haplotype-

specific paths

✗More

fragmented

contigs

Information Sources Pros & Cons Assembly graph features

Tiling path of haplotype 0

Tiling path of haplotype 1

Remove edges connecting

different haplotypes

“FA

LC

ON

-Un

zip

Pro

ce

ss”

~ 4.80 Mb

Add missing haplotype specific nodes & edges

Remove edges that connect different haplotypes

The final graph comprises a primary contig (blue),

a major haplotig (red) and other smaller haplotigs.

4 major haplotype phased blocks determined by het-

SNPs

Un-phased region

• SMRT Sequencing provides a single data type for routine diploid assembly

• Large genomes are more computationally challenging, but mostly an engineering problem now

• Future Work:

• Haplotype phasing improvement, incorporate other 3rd party and better phasing code

• To develop a sequence aligner for “augmented alignment” for faster Quiver consensus process

• Want to attack the algorithm problem for polyploid assembly? Let us know. We can work together.

Arabidopsis Thaliana F1 Diploid

Assembly Statistics

Strain Inbred Col-0 Inbred Cvi-0 Col-0 x Cvi-0

F1

Assembler CA/HGAP CA/HGAP FALCON FALCON-Unzip FALCON-Unzip

primary contigs primary contigs haplotigs

Assembly Size (Mb) 126 119 143 140 105

# contigs 1325 194 426 172 248

N50 size (Mb) 6.210 4.79 7.92 7.96 6.92

Max Contig size (Mb) 10.25 11.25 13.39 13.32 11.65

126

119

143

57

140

105

0 20 40 60 80 100 120 140 160

Inbred Col-0

Inbred Cvi-0

F1 FALCON p-contigs

F1 FALCON a-contigs

F1 Unzip p-contigs

F1 Unzip haplotigs

Assembly Size (Mb)

6.21

4.79

7.92

0.146

7.96

6.92

0 2 4 6 8 10

N50 size (Mb)

Construct Arabidopsis thaliana Col-0

x CVI-0 Diploid F1 Line

Col-0 Cvi-0

Col-0 x Cvi-0

• Two inbred lines

sequenced in 2013

(P4-C2 chemistry),

assembled as haploid

genomes

• F1 line constructed

and sequenced in

2015 (P6-C4

chemistry), assembled

with FALCON and

FALCON-Unzip (18.5 Gbp reads, reads

N50 ~ 17.5 kb)

F1 Diploid Assembly vs.

Parental Line Assemblies

Col-0 x Cvi-0 assembly

Col-0 assembly

Cvi-0 assembly

haplotigs

primary contig

Haploid-like contig in the inbred-line assemblies

Many variations

Few or no

variations

Few or no variations

Many variations

Clavicorona

pyxidata

(Coral fungus)

Cabernet

Sauvignon+*

Taeniopygia

guttata

(Zebra finch)$* Human*

Haploid Genome

Size: ~ 44 Mb ~ 500 Mb ~1.2 Gb ~ 3 Gb

Sequencing

Coverage 4.1 Gb / 95x

73.7 Gb /

147x 50 Gb / 42x 255 Gb / 85x

Primary contig size 41.9 Mb 591.0 Mb 1.07 Gb 2.76 Gb

Primary contig N50 1.5 Mb 2.2 Mb 3.23 Mb 22.9 Mb

Haplotig size 25.5 Mb 372.2 Mb 0.84 Gb 2.0 Gb

Haplotig N50 872 kb 767 kb 910 kb 330 kb

+ Led by Cantu lab, UC Davis and Cramer lab, UN Reno

$ Thanks to Erich Jarvis for permission to use preliminary data

* Preliminary results. Fast file system and efficient computational infrastructure are currently needed for

large genomes.

Image credits: Pajoro, et al, Trends in

Plant Science 21.1 (2016): 6-8.

Example of the

FALCON-Unzip

Process

Additional Diploid Genome Assembly

Examples

SUMMARY