17
Fast and sensi*ve mapping of nanopore sequencing reads with GraphMap Niranjan Nagarajan Genome Ins*tute of Singapore

Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Fast  and  sensi*ve  mapping  of  nanopore  sequencing  reads  with  

GraphMap  

Niranjan  Nagarajan  Genome  Ins*tute  of  Singapore  

Page 2: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

1960’s   1980’s   2015  

2003   2011   2014  

Page 3: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

D  R  .    L  O  O  

W  I  M  P  

Research  →  Medical  Genomics  →  Consumer  Genomics  

Opera*ng  System  =  Sequence  Alignment    

Page 4: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Mapping  nanopore  reads  to  human  genome  

0  

20  

40  

60  

80  

100   LAST  

BWA-­‐MEM  

DALIGNER  

BLASR  

Precision  

0  

20  

40  

60  

80  

100  

Recall  

S.  cerevisiae  (12.1  Mbp)  

E.  coli  (4.6  Mbp)  

C.  elegans  (100  Mbp)  

H.  Sapiens  chr3  (198  Mbp)  

H.  sapiens  (3  Gbp)  

Simulated  reads  based  on  1D  error  profiles  for  E.  coli  dataset  from  Quick  et  al  2014  

Page 5: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Mapping  nanopore  reads  to  human  genome  

0  

20  

40  

60  

80  

100   GraphMap  

LAST  

BWA-­‐MEM  

DALIGNER  

BLASR  

Precision  

0  

20  

40  

60  

80  

100  

Recall  

S.  cerevisiae  (12.1  Mbp)  

E.  coli  (4.6  Mbp)  

C.  elegans  (100  Mbp)  

H.  Sapiens  chr3  (198  Mbp)  

H.  sapiens  (3  Gbp)  

Simulated  reads  based  on  1D  error  profiles  for  E.  coli  dataset  from  Quick  et  al  2014  

Page 6: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

GraphMap  Design    Region    Selec*on  Graph    

Mapping  

Align  

Candidate    Posi*ons  

LCSk    +    L1  

1.  Gapped  Spaced  Seeds  

2.  Graph  Mapping  

0 GC CT

1 TA

2 AA

3 AA

4 AG

5 GA

6

3.  LCSk  Chaining  

Region'

Read'

Loca*ons  

Alignments  

hYp://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html  

Page 7: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

1.  Gapped  Spaced  Seeds   Reference  bases  

Seed  base  Unused  reference  base  (unless  colored)  

"Don't  care"  base  

Reference  bases:  

Selected  bases:  

Indexed  seed:  

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Index  construcDon  

Query  bases:  

Mismatch  seed  (base  7):  

Inser*on  seed  (base  7/8):  Dele*on  seed  (b/w  base  6  &  7):  

Index  lookup  

1 2 3 4 5 6 8 9 10 11 12 13 14

1 2 3 4 5 6 8 9 10 12 13 11

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 8 9 10 11 12 13

1 2 3 4 5 6 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12

Shapes:  6-­‐1-­‐6,    4-­‐1-­‐4-­‐1-­‐4  

Page 8: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Gapped  spaced  seeds  retain  sensi*vity  and  recall  

60  65  70  75  80  85  90  95  

100  

N.  meningi/dis   E.  coli   S.  cerevisiae   C.  elegans   H.  sapiens  (chr3)  

GraphMap  Precision   k=13  Precision  GraphMap  Recall   k=13  Recall  

Page 9: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

2.  Graph  Mapping  to  get  Anchors  

Graph  based  search  for  MEMs  with  indels  

Alignment  graph  01234567 G A-|||x||| - C

Reference:  

Query:  

01234567 G AAAGA -|||x||| - ACAGA

Reference:  

Query:  

IniDal  graph  

0

GC CT

1

TA

2

AA

3

AA

4

AG

5

GA

6

0

GC CT

1

TA

2

AA

3

AA

4

AG

5

GA

6

Final  anchor  graph   CTA AGA

Vertex-­‐centric  parallelism  

Page 10: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

3.  Chaining  Anchors  with  LCSk  

Mo*va*on:  Longest  Common  Subsequence  (LCS)  provides  fast  alignments  but  allows  arbitrary  inser*ons  and  dele*ons  

Region  

Read  

Approximate  alignments  evaluated  to  select  best  region  

Pave*c  et  al  2014  O(n�log(n))  

Page 11: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Other  Features  1.  Different  Alignment  Op*ons  

(marginAlign)  

2.  Mapping  Quality,  BLAST-­‐like  E-­‐value  

3.  Circular  Genome  Support  

4.  Technology  Agnos*c  (w/o  tuning  

parameters)  

–  Illumina,  PacBio,  Ion  Torrent  

hYps://github.com/isovic/graphmap  

Page 12: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Applica*on:  SNV  Calling  (Ammar  et  al.  2015)  #  of  on-­‐target  reads  GraphMap:  6879  BWA-­‐MEM:  5900  LAST:  5032  BLASR:  2284  marginAlign:  4683  DALIGNER:  1451  

CYP2D6  (chr22)  

Read  Coverage    

Read  Alignments  

CYP2D7  CYP2D6  

LAST marginAlign BWA-MEM BLASR DALIGNER GraphMap

Precision

(%) 94 100 (36) 96 100 93 96

True

Positives 49 1 (107) 47 43 75 86

94%  Iden*ty  

LoFreq:  hYps://github.com/csb5/lofreq;  results  in  parentheses  for  marginCaller    

Page 13: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Applica*on:  Structural  Variants  (large  Inser*ons  and  Dele*ons)  

4kbp  deleDon  spanned  by  GraphMap  

LAST marginAlign BWA-MEM BLASR DALIGNER GraphMap

Precision (%) 0 50 67 (90) 94 0 100

Recall (%) 0 5 10 (45) 75 0 100

F1 Score (%) 0 9 17 (60) 83 0 100

Data  from  Quick  et  al  2014  mapped  to  mutated  reference  

Window  size  >  20bp;  AF  >  15%;  Results  from  LUMPY  are  in  parentheses  

Page 14: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Applica*on:  Pathogen  Iden*fica*on  

0  2000  4000  6000  8000  

10000  12000  14000  

0"1000"2000"3000"4000"5000"6000"7000"8000"9000"

10000"

S."ente

rica"Typh

i"str."Ty2"

S."ente

rica"Typh

i"str."C

T18"

S."ente

rica"Paratyphi"A

"str."

S."ente

rica"Paratyphi"A

"str."ATCC"9150"

S."ente

rica"str."CT18"pla

smid"p

HCM2

"

S."ente

rica"H

eidelberg"str."SL

476"

S."ente

rica"Paratyphi"C

"strain"RKS45

94"

S."ente

rica"Agona"str."SL

483"

S."ente

rica"Paratyphi"B

"str."SP

B7"

S."typh

imurium

"LT2"

GraphMap" BWALMEM" LAST" BLASR" DALIGNER"

Database  of  258  reference  sequences  

K-­‐12  and  BW2952  have  >99%  iden*ty  

Data  from  Quick  et  al  2014  

Page 15: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

GraphMap  for  Assembly  

0.00%  

10.00%  

20.00%  

30.00%  

40.00%  

50.00%  

60.00%  

70.00%  

80.00%  

90.00%  

100.00%  

10%   15%   20%  

SensiDvity  

Read  Error  Rate  

GraphMap  

MHAP  (nanopore  fast)  

MHAP  (pacbio  fast)  

MHAP  (pacbio  sensi*ve)  

MHAP  (default)  

Minimap  (default)  

Minimap  (github  params)  

30X  coverage  simulated  reads;  Nanopore  2D  error  profile;  E.  coli  genome      

Page 16: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans

Croa*an  Science  Funda*on  project  UIP-­‐11-­‐2013-­‐7353  

Ivan  Sovic   Mile  Sikic   Swaine  Chen  

hYps://github.com/isovic/graphmap  hYp://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html  

Page 17: Fastand%sensi*ve%mapping%of% nanopore%sequencing ...Gapped%spaced%seeds%retain%sensi*vity%and%recall% 60 65 70 75 80 85 90 95 100 N.#meningi/dis# E.#coli# S.#cerevisiae# C.elegans