1
1 10 100 10 20 30 40 50 # Mismatches per 100kbp 0.3 1.0 3.0 10.0 10 20 30 40 50 # Indels per 100kbp 0.0 2.5 5.0 7.5 10.0 10 20 30 40 50 Time (min) 1 2 10 20 30 40 50 Peak Memory (GB) Tool Baseline GATK Racon Pilon ntEdit k=20 ntEdit k=25 ntEdit k=30 ntEdit iterative k=35,30,25 1 10 100 20 30 40 50 # Mismatches per 100kbp 0.3 1.0 3.0 10.0 20 30 40 50 # Indels per 100kbp 0 100 200 300 20 30 40 50 Time (min) 0 10 20 30 40 50 20 30 40 50 Peak Memory (GB) Tool Baseline GATK Racon Pilon ntEdit k=25 ntEdit k=30 ntEdit k=35 ntEdit iterative k=40,35,30,25 1 10 100 30 40 50 60 70 k Errors per 100kbp Error type Indels per 100kbp Mismatches per 100kbp 0 200 400 600 800 30 40 50 60 70 k Time (min) 20 25 30 35 40 30 40 50 60 70 k Peak memory (GB) Tool Baseline GATK Racon Pilon ntEdit Coverage Coverage 3 #Edits (M) BUSCO (%) Base 10xG linked reads N/A 5,670 (90.7) +ntEdit k50i1d1 0.2 5,677 (90.8) Base PacBio N/A 1,248 (31.6) +ntEdit k40i3d3 59.0 1,354 (34.3) 11 Simao, 2015 12 Koren, 2017 complete fragmented missing Polishing 44 40 36 32 28 24 20 16 12 8 6 0 5 10 15 0.0001 0.0010 0.0100 0.1000 Bloom filter false positive rate Errors per 100 kbp Error Type Indels Mismatches haploid or diploid DNA source Sequence reads Bloom filter c c c c c c T T T T T T 2. Check k kmer subset (S k ) for absence If S k - k/x : 4. Insert 3’-end positions Check k kmer subset presence If S k_alt + k/y : Apply change, resume 1 If S k_alt + k/y : Apply change to sequence, resume 3 3. Permutate 3’-end base Check k kmer* subset (S k_alt ) for presence ntHits kmers kmers 5. Delete 3’-end positions Check k kmer subset presence If S k_alt + k/y : Apply change, resume 1 Sequence Edited sequence ref copy 1 2 3 edited 4 Bloom dra5 edited ntHits ntHits ntEdit ntEdit kmers NGS reads Definitions kmer..................................word of length k S k ...................................subset of overlapping k kmers S k - ..................................subset of absent, overlapping k kmers S k_alt + ..............................subset of present, overlapping, k alternate kmers x.....................................leniency factor 1, test for absence y.....................................leniency factor 2, test for presence 4 5 6 8 9 12 4 5 6 8 9 12 4 5 6 8 9 12 4 5 6 8 9 12 x: 4 x: 5 x: 6 x: 8 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0.96 0.97 0.98 4 McKenna, 2010 5 Vaser, 2017 6 Jain, 2018 7 Pendleton, 2015 *Single Molecule Sequencing draft genomes **Time for pipeline ***15GB RAM 3 5 10 20 30 40 50 Coverage c1 c2 c3 c4 auto Method https://github.com/bcgsc/nthits https://github.com/bcgsc/ntedit Human* Controlled Software Funding Cacao 8 , Beluga 9 , Axolotl 10 www.bcgsc.ca [email protected] Tuning Experimental ntEdit René Warren Jessica Zhang Lauren Coombe Hamid Mohamadi Inanç Birol Effect of Bloom filter FPR coverage threshold (-c) FPR ~ 0.0005 Threshold error kmers ntCard 2 Controlled C. elegans sequence data Base Illumina White spruce Interior spruce Subs. (M) 49.39 47.29 Indels (M) 1.11 1.02 ntHits 3h29m 4h23m ntEdit 25m 23m ntHits 207.8 206.9 ntEdit 90.2 86.1 Polish w\ 54X Illumina Time ** Edits (M) BUSCO (%) GATK 4 41h45 0.97 5,654 (91.3) ntEdit*** 2h18 0.95 5,670 (91.6) Racon 5 45h54 N/A 5,681 (91.7) GATK 42h21 2.66 5,285 (85.4) ntEdit*** 2h10 3.63 5,651 (91.3) Racon 40h55 N/A 5,670 (91.6) Nanopore 6 5,647 (91.2) PacBio 7 5,285 (85.4) 8 Morrissey, 2019 9 Jones, 2017 10 Nowoshilow, 2018 1 Mikheenko, 2018 http://birol-lab.ca 2 Mohamadi, 2017 http://renewarren.ca ntHits 2h, 40GB / ntEdit 5m, 12GB Experimental Results (FPR) haploid scalable genome sequence polishing E. coli C. elegans H. sapiens chr21 30x k25 30x k35 17x k50 Sensitivity *Values of y are indicated on the plot ntHits ntEdit Bloom filter <bit size 2.3 Gbp genome 20 Gbp genome 32 Gbp genome ntHits 3h, 210GB / ntEdit 20m, 95GB [laurasiatheria] [tetrapoda] < 8X Illumina 60X Illumina diploid RAM (GB) Time net 7 23 34 0 366 385 Baseline SMS* BUSCO% Δ7 Δ106 Check absence Check presence NGS reads (e.g. Illumina) SMS, 10xG, Illumina genome assembly gene sequence, etc. New feature (v1.2.0) -m option editing mode 0-2 [default=0] 0: best substitution, or first supported indel 1: best substitution, or best indel 2: best edit overall (exhaustive) Testing 4 5 6 8 9 12 20 30 40 50 4 5 6 8 9 12 20 30 40 50 4 5 6 8 9 12 20 30 40 50 4 5 6 8 9 12 20 30 40 50 x: 4 x: 5 x: 20 x: 30 0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0.85 0.87 0.89 0.91 0.93 4 5 6 8 9 12 16 20 24 28 32 35 4 5 6 8 9 12 16 20 24 28 32 35 4 5 6 8 9 12 16 20 24 28 32 35 4 5 6 8 9 12 16 20 24 28 32 35 x: 4 x: 5 x: 12 x: 20 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0.94 0.95 0.96 0.97 0.98 0.99 1 False discovery rate Copy reference (ref) Subs. 0.001 Indels 0.0001 Simulate Run Assess QUAST 1 PE100, 300bp frag. err0.1% FPR~0.0005 3 hash fn 96.0 95.4 93.1 56.1 1.3 1.6 2.5 7.7 2.7 3.1 4.4 36.2 0% 20% 40% 60% 80% 100% BUSCOs (% of 1,440 searched) ‘Haploidizing’ Spruce 13 13 Warren, 2015 Canu 12 +ntEdit +pilon +pilon +ntEdit 400 Mbp genome Base Nanopore k35 30 27 25 23 i5 d5 m1 ntHits 15m 4GB / ntEdit 5m <2GB Polish w\ 30X PE100 Illumina reads Assess completeness / accuracy w\ BUSCO 11 : single-copy gene orthologs [embryophyta] k35 30 27 25 23 i5 d5 m1 Acknowledgements Controlled Warren et al. 2019. Bioinformatics. DOI: 10.1093/bioinformatics/btz400 Reference E. coli 3 Walker, 2014 Bloom filter Bloom filter Summary Short Linked Long S Sealer Assembly Correction Scaffolding Gap-filling Polishing Illumina, SMS drafts (Nanopore/PacBio), etc. Read Technology https://github.com/ bcgsc Scalable solutions for genome assembly n=6,253 n=3,950 [euarchontoglires] n=6,192 17X Illumina (pseudohap.) FPR~0.0005 3 hash fn C. elegans H. sapiens chr21 *kmers with alternate 3’end base (k_alt) Genome cacao beluga human spruce axolotl 48 threads (250bp reads, k40) 250 125 0 k50i3d3 k40i3d3 0 200 400 0 1 2 3 4 0 0.5 1 1.5 Memory (GB) Time (hours) Reads (billion) 375 x + 0 0.5 1 1.5 Bases (billions) ntEdit k50i1d1 rate~0.0023 rate~0.0001 ntHits 1. Check kmer

scalable genome sequence polishing ntEdit+ ≥ k/y : Apply change, resume 1 If S k_alt + 6 ≥ k/y : Apply change to sequence, resume 3 3. Permutate 3’-end base Check k kmer* subset

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: scalable genome sequence polishing ntEdit+ ≥ k/y : Apply change, resume 1 If S k_alt + 6 ≥ k/y : Apply change to sequence, resume 3 3. Permutate 3’-end base Check k kmer* subset

● ● ● ● ● ●

● ● ● ● ● ●

●● ● ● ●

●● ● ●

● ● ● ● ●

● ● ● ● ●

● ● ● ● ●●

● ● ● ● ●1

10

100

10 20 30 40 50

# M

ism

atch

es p

er 1

00kb

p

A● ● ● ● ● ●

● ● ● ● ● ●

●●

●●

● ● ● ●

● ● ● ● ● ●●

● ● ● ● ●

● ● ● ● ●●● ● ● ● ● 0.3

1.03.0

10.0

10 20 30 40 50 # In

dels

per

100

kbpB

●●

● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●

●●

●●

0.02.55.07.5

10.0

10 20 30 40 50Ti

me

(min

)

C

● ● ● ● ●●

● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●

● ● ●● ●

1

2

10 20 30 40 50 Peak

Mem

ory

(GB)D

Tool ●

BaselineGATK

RaconPilon

ntEdit k=20ntEdit k=25

ntEdit k=30ntEdit iterative k=35,30,25

● ● ● ● ●

● ● ● ● ●●

●● ● ●

●● ● ●

● ● ● ● ●● ● ● ● ●● ● ● ● ●

● ● ● ● ●

1

10

100

20 30 40 50

# M

ism

atch

es p

er 1

00kb

p

E● ● ● ● ●

● ● ● ● ●

● ●● ● ●●

●● ● ●

● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●

0.3

1.03.0

10.0

20 30 40 50 # In

dels

per

100

kbpF

●●

● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●

● ● ●

●●

0

100

200

300

20 30 40 50

Tim

e (m

in)

G

●● ● ●

● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●

●●

● ● ●

01020304050

20 30 40 50 Peak

Mem

ory

(GB)H

Tool ●

BaselineGATK

RaconPilon

ntEdit k=25ntEdit k=30

ntEdit k=35ntEdit iterative k=40,35,30,25

● ● ● ● ● ● ●●

● ●● ● ● ● ●

●1

10

100

30 40 50 60 70k

Erro

rs p

er 1

00kb

p

Error type Indels per 100kbp Mismatches per 100kbp

I

● ● ● ● ● ● ●0200400600800

30 40 50 60 70k

Tim

e (m

in)

J

●● ● ● ● ●

2025303540

30 40 50 60 70k

Peak

mem

ory

(GB)K

Tool ●

●BaselineGATK

RaconPilon

ntEdit

Coverage

Coverage

3

0 0.5 1

1

2

3

4

#Edits (M)

BUSCO (%)

Base 10xG linked reads

N/A 5,670 (90.7)

+ntEdit k50i1d1 0.2 5,677

(90.8)

Base PacBio N/A 1,248

(31.6)

+ntEdit k40i3d3 59.0 1,354

(34.3)

82 82.5 83 83.5 84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89 89.5 90 90.5 91 91.5 92 92.5

1

2

3

4

5

6

11 Simao, 2015 12 Koren, 2017

96.0

95.4

93.1

56.1

1.3

1.6

2.5

7.7

2.7

3.1

4.4

36.2

0% 20% 40% 60% 80% 100%

canu

ntedit

pilon

+ntedit

BUSCOs(%of1,440searched)

complete fragmented missing

Polishing

●●●●●●●●●

●●●●●●●●

444036322824201612

8

6

0

5

10

15

0.00010.00100.01000.1000Bloom filter false positive rate

Erro

rs p

er 1

00 k

bp

Error Type●

IndelsMismatches

20X Coverage

●●●●●●●●●

●●●●●●●●

444036322824201612

8

6

0

5

10

15

0.00010.00100.01000.1000Bloom filter false positive rate

Error Type●

IndelsMismatches

40X Coverage

haploid or diploid DNA source Sequence reads

Bloom filter

c

ccccc

TTTTT

T

1. Check each word of size k (kmer) in filter

2. Check k kmer subset (Sk) for absence If Sk

- ≥ k/x :

4. Insert 3’-end positions Check k kmer subset presence If Sk_alt

+ ≥ k/y : Apply change, resume 1

If Sk_alt+ ≥ k/y :

Apply change to sequence, resume 3

3. Permutate 3’-end base Check k kmer* subset (Sk_alt) for presence

ntHits kmers

kmers

5. Delete 3’-end positions Check k kmer subset presence If Sk_alt

+ ≥ k/y : Apply change, resume 1

Sequence

*kmerswithalternate3’endbase(k_alt)

Edited sequence

refcopy

1

2

3

edited4Bloomfilter

dra5 edited

ntHits ntHits

ntEdit ntEdit

kmers

NGSreads

Definitions kmer..................................word of length k Sk…...................................subset of overlapping k kmers Sk

- …..................................subset of absent, overlapping k kmers Sk_alt

+…..............................subset of present, overlapping, k alternate kmers x….....................................leniency factor 1, test for absence y….....................................leniency factor 2, test for presence

4

56

8

912

4

56

89

12

4

56

89

12

4

56

8

9 12

x: 4 x: 5 x: 6 x: 8

0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003

0.96

0.97

0.98

E. coli (30x, k25)

45

689

1216

2024

28

32

35

45

6

89

1216

2024

2832 35

4

5689

1216

2024

2832 35

45

689

1216

2024

2832 35

x: 4 x: 5 x: 12 x: 20

0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04

0.940.950.960.970.980.99

C. elegans (30x, k35)

45

689

1220

3040 50

45

689

1220 30

40 50

4

56

8912

2030

40 50

4

56

89

1220 30

40 50

x: 4 x: 5 x: 20 x: 30

0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85

0.87

0.89

0.91

0.93

H. sapiens Chr21 (17x, k50)

FDR

Sensitivity

4McKenna, 2010 5Vaser, 2017 6Jain, 2018 7Pendleton, 2015

*Single Molecule Sequencing draft genomes **Time for pipeline ***15GB RAM

● ● ●

● ●●

●●

●●

●●

● ● ●

● ● ●

3

5

10

20 30 40 50Coverage

# M

ism

atch

es p

er 1

00kb

p

c1c2c3c4auto

● ● ●●

● ● ●●

●● ● ●

● ●●

● ● ●

0.5

1.0

2.0

20 30 40 50Coverage

# In

dels

per

100

kbp

Method

https://github.com/bcgsc/nthits https://github.com/bcgsc/ntedit

Human*

Controlled

Software

Funding

Cacao8, Beluga9, Axolotl10

www.bcgsc.ca � [email protected]

Tuning

Experimental

ntEdit René Warren � Jessica Zhang � Lauren Coombe � Hamid Mohamadi � Inanç Birol

Effect of Bloom filter FPR

cove

rage

th

resh

old

(-c)

FPR ~ 0.0005

Threshold error kmers

ntCard 2

Controlled C. elegans sequence data

Base Illumina

White spruce

Interior spruce

Subs. (M) 49.39 47.29

Indels (M) 1.11 1.02

ntHits 3h29m 4h23m

ntEdit 25m 23m

ntHits 207.8 206.9

ntEdit 90.2 86.1

Polish w\ 54X Illumina Time** Edits (M) BUSCO (%)

GATK4 41h45 0.97 5,654 (91.3)

ntEdit*** 2h18 0.95 5,670 (91.6)

Racon5 45h54 N/A 5,681 (91.7)

GATK 42h21 2.66 5,285 (85.4)

ntEdit*** 2h10 3.63 5,651 (91.3)

Racon 40h55 N/A 5,670 (91.6)

Nan

opor

e6

5,64

7 (9

1.2)

P

acB

io7

5,28

5 (8

5.4)

8 Morrissey, 2019 9 Jones, 2017 10 Nowoshilow, 2018

1Mikheenko, 2018

http://birol-lab.ca

2Mohamadi, 2017

http://renewarren.ca

ntHits 2h, 40GB / ntEdit 5m, 12GB

Experimental

Results

(FPR)

haploid

scalable genome sequence polishing

E. c

oli

C. e

lega

ns

H. s

apie

ns c

hr21

30x k25

30x k35

17x k50

Sen

sitiv

ity

*Values of y are indicated on the plot

ntHits ntEdit

Bloom filter <bit size

2.3

Gbp

gen

ome

20 G

bp g

enom

e 32

Gbp

gen

ome

ntHits 3h, 210GB / ntEdit 20m, 95GB

[laur

asia

ther

ia]

[tetra

poda

]

< 8X

Illu

min

a 60

X Il

lum

ina

diploid

RA

M

(GB

) Ti

me

net

7 23

34

0

366

385

Baseline SMS* BUSCO%

Δ7

Δ106

Che

ck a

bsen

ce

Che

ck p

rese

nce

NGS reads (e.g. Illumina)

SMS, 10xG, Illumina genome assembly gene sequence, etc.

New feature (v1.2.0)

-m option editing mode 0-2 [default=0] 0: best substitution, or first supported indel 1: best substitution, or best indel 2: best edit overall (exhaustive)

Testing

4

56

8

912

4

56

89

12

4

56

89

12

4

56

8

9 12

x: 4 x: 5 x: 6 x: 8

0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003

0.96

0.97

0.98

E. coli (30x, k25)

45

689

1216

2024

28

32

35

45

6

89

1216

2024

2832 35

4

5689

1216

2024

2832 35

45

689

1216

2024

2832 35

x: 4 x: 5 x: 12 x: 20

0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04

0.940.950.960.970.980.99

C. elegans (30x, k35)

45

689

1220

3040 50

45

689

1220 30

40 50

4

56

8912

2030

40 50

4

56

89

1220 30

40 50

x: 4 x: 5 x: 20 x: 30

0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85

0.87

0.89

0.91

0.93

H. sapiens Chr21 (17x, k50)

FDR

Sensitivity

4

56

8

912

4

56

89

12

4

56

89

12

4

56

8

9 12

x: 4 x: 5 x: 6 x: 8

0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003

0.96

0.97

0.98

E. coli (30x, k25)

45

689

1216

2024

28

32

35

45

6

89

1216

2024

2832 35

4

5689

1216

2024

2832 35

45

689

1216

2024

2832 35

x: 4 x: 5 x: 12 x: 20

0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04

0.940.950.960.970.980.99

C. elegans (30x, k35)

45

689

1220

3040 50

45

689

1220 30

40 50

4

56

8912

2030

40 50

4

56

89

1220 30

40 50

x: 4 x: 5 x: 20 x: 30

0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85

0.87

0.89

0.91

0.93

H. sapiens Chr21 (17x, k50)

FDR

Sensitivity

1

False discovery rate

Copy reference (ref)

Subs. 0.001 Indels 0.0001

Simulate

Run

Assess QUAST1

PE100, 300bp frag. err0.1%

FPR~0.0005 3 hash fn

96.0

95.4

93.1

56.1

1.3

1.6

2.5

7.7

2.7

3.1

4.4

36.2

0% 20% 40% 60% 80% 100%

canu

ntedit

pilon

+ntedit

BUSCOs(%of1,440searched)

complete fragmented missing

‘Haploidizing’

Spruce13

13 Warren, 2015

Canu12

+ntEdit

+pilon

+pilon +ntEdit

400 Mbp genome

Base Nanopore

k35 30 27 25 23 i5 d5 m1 ntH

its 1

5m 4

GB

/ nt

Edit

5m <

2GB

� Polish w\ 30X PE100 Illumina reads � Assess completeness / accuracy w\ BUSCO11: single-copy gene orthologs

[embryophyta]

k35 30 27 25 23 i5 d5 m1

Acknowledgements

Controlled

Warren et al. 2019. Bioinformatics. DOI: 10.1093/bioinformatics/btz400

Reference

E. coli

3 Walker, 2014

Bloom filter

Bloom filter

Summary

Short

Linked

Long

S

Sea

ler Assembly Correction Scaffolding Gap-filling Polishing

Illumina, SMS drafts (Nanopore/PacBio), etc.

Read Technology

Scalablesolu+onsforgenomeassembly

https://github.com/ bcgsc

Scalable solutions for genome assembly

n=6,

253

n=3,

950

[euarchontoglires] n=6,192

17X

Illu

min

a

(pseudohap.)

FPR~0.0005 3 hash fn

C. elegans

H. sapiens chr21

*kmers with alternate 3’end base (k_alt)

Genomecacaobelugahumanspruceaxolotl

48 threads (250bp reads, k40)

250 125 0

k50i3d3

k40i3d3

0 0.5 1 1.5

0

200

400

0

1

2

3

4

0 0.5 1 1.5

Bases (billion)

Mem

ory

(GB

)

Tim

e (h

ours

)

Reads (billion)

375

x +

0 0.5 1 1.5

0

200

400

0

1

2

3

4

0 0.5 1 1.5

Bases (billions)

Mem

ory

(GB

)

Tim

e (h

ours

)

Reads (billion)

ntEdit k50i1d1

rate~0.0023

rate~0.0001

ntHits

1. Check kmer