scalable genome sequence polishing ntEdit+ ≥ k/y : Apply change, resume 1 If S k_alt + 6 ≥ k/y : Apply change to sequence, resume 3 3. Permutate 3’-end base Check k kmer* subset

● ● ● ● ● ●

● ● ● ● ● ●

●

●● ● ● ●

●

●

●● ● ●

●

● ● ● ● ●

●

● ● ● ● ●

●

● ● ● ● ●●

● ● ● ● ●1

10

100

10 20 30 40 50

# M

ism

atch

es p

er 1

00kb

p

A● ● ● ● ● ●

● ● ● ● ● ●

●

●●

●●

●

●

●

● ● ● ●

● ● ● ● ● ●●

● ● ● ● ●

●

● ● ● ● ●●● ● ● ● ● 0.3

1.03.0

10.0

10 20 30 40 50 # In

dels

per

100

kbpB

●●

●

●

●

●

● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●

●

●

●

●●

●

●●

●

●

●

●

0.02.55.07.5

10.0

10 20 30 40 50Ti

me

(min

)

C

● ● ● ● ●●

● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●

● ● ●● ●

●

●

●

●

●

●

●

1

2

10 20 30 40 50 Peak

Mem

ory

(GB)D

Tool ●

●

●

●

●

●

●

●

BaselineGATK

RaconPilon

ntEdit k=20ntEdit k=25

ntEdit k=30ntEdit iterative k=35,30,25

● ● ● ● ●

● ● ● ● ●●

●● ● ●

●

●● ● ●

● ● ● ● ●● ● ● ● ●● ● ● ● ●

● ● ● ● ●

1

10

100

20 30 40 50

# M

ism

atch

es p

er 1

00kb

p

E● ● ● ● ●

● ● ● ● ●

● ●● ● ●●

●● ● ●

● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●

0.3

1.03.0

10.0

20 30 40 50 # In

dels

per

100

kbpF

●

●●

●

●

● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●

●

●

● ● ●

●●

●

●

●

0

100

200

300

20 30 40 50

Tim

e (m

in)

G

●

●● ● ●

● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ●

●●

● ● ●

●

●

●

●

●

01020304050

20 30 40 50 Peak

Mem

ory

(GB)H

Tool ●

●

●

●

●

●

●

●

BaselineGATK

RaconPilon

ntEdit k=25ntEdit k=30

ntEdit k=35ntEdit iterative k=40,35,30,25

● ● ● ● ● ● ●●

● ●● ● ● ● ●

●1

10

100

30 40 50 60 70k

Erro

rs p

er 1

00kb

p

Error type Indels per 100kbp Mismatches per 100kbp

I

● ● ● ● ● ● ●0200400600800

30 40 50 60 70k

Tim

e (m

in)

J

●● ● ● ● ●

●

2025303540

30 40 50 60 70k

Peak

mem

ory

(GB)K

Tool ●

●

●

●

●BaselineGATK

RaconPilon

ntEdit

Coverage

Coverage

3

0 0.5 1

1

2

3

4

#Edits (M)

BUSCO (%)

Base 10xG linked reads

N/A 5,670 (90.7)

+ntEdit k50i1d1 0.2 5,677

(90.8)

Base PacBio N/A 1,248

(31.6)

+ntEdit k40i3d3 59.0 1,354

(34.3)

82 82.5 83 83.5 84 84.5 85 85.5 86 86.5 87 87.5 88 88.5 89 89.5 90 90.5 91 91.5 92 92.5

1

2

3

4

5

6

11 Simao, 2015 12 Koren, 2017

96.0

95.4

93.1

56.1

1.3

1.6

2.5

7.7

2.7

3.1

4.4

36.2

0% 20% 40% 60% 80% 100%

canu

ntedit

pilon

+ntedit

BUSCOs(%of1,440searched)

complete fragmented missing

Polishing

●●●●●●●●●

●

●

●●●●●●●●

●

●

●

444036322824201612

8

6

0

5

10

15

0.00010.00100.01000.1000Bloom filter false positive rate

Erro

rs p

er 1

00 k

bp

Error Type●

●

IndelsMismatches

20X Coverage

●●●●●●●●●

●

●

●●●●●●●●

●

●

●

444036322824201612

8

6

0

5

10

15

0.00010.00100.01000.1000Bloom filter false positive rate

Error Type●

●

IndelsMismatches

40X Coverage

haploid or diploid DNA source Sequence reads

Bloom filter

c

ccccc

TTTTT

✓

T

1. Check each word of size k (kmer) in filter

2. Check k kmer subset (Sk) for absence If Sk

- ≥ k/x :

4. Insert 3’-end positions Check k kmer subset presence If Sk_alt

+ ≥ k/y : Apply change, resume 1

If Sk_alt+ ≥ k/y :

Apply change to sequence, resume 3

3. Permutate 3’-end base Check k kmer* subset (Sk_alt) for presence

✓

ntHits kmers

kmers

5. Delete 3’-end positions Check k kmer subset presence If Sk_alt

+ ≥ k/y : Apply change, resume 1

✗

✗

Sequence

*kmerswithalternate3’endbase(k_alt)

Edited sequence

refcopy

1

2

3

edited4Bloomfilter

dra5 edited

ntHits ntHits

ntEdit ntEdit

kmers

NGSreads

Definitions kmer..................................word of length k Sk…...................................subset of overlapping k kmers Sk

- …..................................subset of absent, overlapping k kmers Sk_alt

+…..............................subset of present, overlapping, k alternate kmers x….....................................leniency factor 1, test for absence y….....................................leniency factor 2, test for presence

4

56

8

912

4

56

89

12

4

56

89

12

4

56

8

9 12

x: 4 x: 5 x: 6 x: 8

0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003

0.96

0.97

0.98

E. coli (30x, k25)

45

689

1216

2024

28

32

35

45

6

89

1216

2024

2832 35

4

5689

1216

2024

2832 35

45

689

1216

2024

2832 35

x: 4 x: 5 x: 12 x: 20

0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04

0.940.950.960.970.980.99

C. elegans (30x, k35)

45

689

1220

3040 50

45

689

1220 30

40 50

4

56

8912

2030

40 50

4

56

89

1220 30

40 50

x: 4 x: 5 x: 20 x: 30

0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85

0.87

0.89

0.91

0.93

H. sapiens Chr21 (17x, k50)

FDR

Sensitivity

4McKenna, 2010 5Vaser, 2017 6Jain, 2018 7Pendleton, 2015

*Single Molecule Sequencing draft genomes **Time for pipeline ***15GB RAM

●

●

● ● ●

● ●●

●●

●

●●

●●

●

●

● ● ●

●

●

● ● ●

3

5

10

20 30 40 50Coverage

# M

ism

atch

es p

er 1

00kb

p

●

●

●

●

●

c1c2c3c4auto

●

●

● ● ●●

● ● ●●

●

●● ● ●

●

●

● ●●

●

●

● ● ●

0.5

1.0

2.0

20 30 40 50Coverage

# In

dels

per

100

kbp

Method

https://github.com/bcgsc/nthits https://github.com/bcgsc/ntedit

Human*

Controlled

Software

Funding

Cacao8, Beluga9, Axolotl10

www.bcgsc.ca � [email protected]

Tuning

Experimental

ntEdit René Warren � Jessica Zhang � Lauren Coombe � Hamid Mohamadi � Inanç Birol

Effect of Bloom filter FPR

cove

rage

th

resh

old

(-c)

FPR ~ 0.0005

Threshold error kmers

ntCard 2

Controlled C. elegans sequence data

Base Illumina

White spruce

Interior spruce

Subs. (M) 49.39 47.29

Indels (M) 1.11 1.02

ntHits 3h29m 4h23m

ntEdit 25m 23m

ntHits 207.8 206.9

ntEdit 90.2 86.1

Polish w\ 54X Illumina Time** Edits (M) BUSCO (%)

GATK4 41h45 0.97 5,654 (91.3)

ntEdit*** 2h18 0.95 5,670 (91.6)

Racon5 45h54 N/A 5,681 (91.7)

GATK 42h21 2.66 5,285 (85.4)

ntEdit*** 2h10 3.63 5,651 (91.3)

Racon 40h55 N/A 5,670 (91.6)

Nan

opor

e6

5,64

7 (9

1.2)

P

acB

io7

5,28

5 (8

5.4)

8 Morrissey, 2019 9 Jones, 2017 10 Nowoshilow, 2018

1Mikheenko, 2018

http://birol-lab.ca

2Mohamadi, 2017

http://renewarren.ca

ntHits 2h, 40GB / ntEdit 5m, 12GB

Experimental

Results

(FPR)

haploid

scalable genome sequence polishing

E. c

oli

C. e

lega

ns

H. s

apie

ns c

hr21

30x k25

30x k35

17x k50

Sen

sitiv

ity

*Values of y are indicated on the plot

ntHits ntEdit

Bloom filter <bit size

2.3

Gbp

gen

ome

20 G

bp g

enom

e 32

Gbp

gen

ome

ntHits 3h, 210GB / ntEdit 20m, 95GB

[laur

asia

ther

ia]

[tetra

poda

]

< 8X

Illu

min

a 60

X Il

lum

ina

diploid

RA

M

(GB

) Ti

me

net

7 23

34

0

366

385

Baseline SMS* BUSCO%

Δ7

Δ106

Che

ck a

bsen

ce

Che

ck p

rese

nce

NGS reads (e.g. Illumina)

SMS, 10xG, Illumina genome assembly gene sequence, etc.

New feature (v1.2.0)

-m option editing mode 0-2 [default=0] 0: best substitution, or first supported indel 1: best substitution, or best indel 2: best edit overall (exhaustive)

Testing

4

56

8

912

4

56

89

12

4

56

89

12

4

56

8

9 12

x: 4 x: 5 x: 6 x: 8

0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003

0.96

0.97

0.98

E. coli (30x, k25)

45

689

1216

2024

28

32

35

45

6

89

1216

2024

2832 35

4

5689

1216

2024

2832 35

45

689

1216

2024

2832 35

x: 4 x: 5 x: 12 x: 20

0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04

0.940.950.960.970.980.99


45

689

1220

3040 50

45

689

1220 30

40 50

4

56

8912

2030

40 50

4

56

89

1220 30

40 50

x: 4 x: 5 x: 20 x: 30

0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85

0.87

0.89

0.91

0.93


FDR

Sensitivity

4

56

8

912

4

56

89

12

4

56

89

12

4

56

8

9 12

x: 4 x: 5 x: 6 x: 8

0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003 0 0.0001 0.0002 0.0003

0.96

0.97

0.98

E. coli (30x, k25)

45

689

1216

2024

28

32

35

45

6

89

1216

2024

2832 35

4

5689

1216

2024

2832 35

45

689

1216

2024

2832 35

x: 4 x: 5 x: 12 x: 20

0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04 0 0.01 0.02 0.03 0.04

0.940.950.960.970.980.99


45

689

1220

3040 50

45

689

1220 30

40 50

4

56

8912

2030

40 50

4

56

89

1220 30

40 50

x: 4 x: 5 x: 20 x: 30

0 0.05 0.10 0 0.05 0.10 0 0.05 0.10 0 0.05 0.100.85

0.87

0.89

0.91

0.93


FDR

Sensitivity

1

False discovery rate

Copy reference (ref)

Subs. 0.001 Indels 0.0001

Simulate

Run

Assess QUAST1

PE100, 300bp frag. err0.1%

FPR~0.0005 3 hash fn

96.0

95.4

93.1

56.1

1.3

1.6

2.5

7.7

2.7

3.1

4.4

36.2

0% 20% 40% 60% 80% 100%

canu

ntedit

pilon

+ntedit

BUSCOs(%of1,440searched)

complete fragmented missing

‘Haploidizing’

Spruce13

13 Warren, 2015

Canu12

+ntEdit

+pilon

+pilon +ntEdit

400 Mbp genome

Base Nanopore

k35 30 27 25 23 i5 d5 m1 ntH

its 1

5m 4

GB

/ nt

Edit

5m <

2GB

� Polish w\ 30X PE100 Illumina reads � Assess completeness / accuracy w\ BUSCO11: single-copy gene orthologs

[embryophyta]

k35 30 27 25 23 i5 d5 m1

Acknowledgements

Controlled

Warren et al. 2019. Bioinformatics. DOI: 10.1093/bioinformatics/btz400

Reference

E. coli

3 Walker, 2014

Bloom filter

Bloom filter

Summary

Short

Linked

Long

S

Sea

ler Assembly Correction Scaffolding Gap-filling Polishing

Illumina, SMS drafts (Nanopore/PacBio), etc.

Read Technology

Scalablesolu+onsforgenomeassembly

https://github.com/ bcgsc

Scalable solutions for genome assembly

n=6,

253

n=3,

950

[euarchontoglires] n=6,192

17X

Illu

min

a

(pseudohap.)

FPR~0.0005 3 hash fn

C. elegans

H. sapiens chr21

*kmers with alternate 3’end base (k_alt)

Genomecacaobelugahumanspruceaxolotl

48 threads (250bp reads, k40)

250 125 0

k50i3d3

k40i3d3

0 0.5 1 1.5

0

200

400

0

1

2

3

4

0 0.5 1 1.5

Bases (billion)

Mem

ory

(GB

)

Tim

e (h

ours

)

Reads (billion)

375

x +

0 0.5 1 1.5

0

200

400

0

1

2

3

4

0 0.5 1 1.5

Bases (billions)

Mem

ory

(GB

)

Tim

e (h

ours

)

Reads (billion)

ntEdit k50i1d1

rate~0.0023

rate~0.0001

ntHits

1. Check kmer

Documents

scalable genome sequence polishing ntEdit+ ≥ k/y : Apply change, resume 1 If S k_alt + 6 ≥ k/y : Apply change to sequence, resume 3 3. Permutate 3’-end base Check k kmer* subset