Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Fast and sensi*ve mapping of nanopore sequencing reads with
GraphMap
Niranjan Nagarajan Genome Ins*tute of Singapore
1960’s 1980’s 2015
2003 2011 2014
D R . L O O
W I M P
Research → Medical Genomics → Consumer Genomics
Opera*ng System = Sequence Alignment
Mapping nanopore reads to human genome
0
20
40
60
80
100 LAST
BWA-‐MEM
DALIGNER
BLASR
Precision
0
20
40
60
80
100
Recall
S. cerevisiae (12.1 Mbp)
E. coli (4.6 Mbp)
C. elegans (100 Mbp)
H. Sapiens chr3 (198 Mbp)
H. sapiens (3 Gbp)
Simulated reads based on 1D error profiles for E. coli dataset from Quick et al 2014
Mapping nanopore reads to human genome
0
20
40
60
80
100 GraphMap
LAST
BWA-‐MEM
DALIGNER
BLASR
Precision
0
20
40
60
80
100
Recall
S. cerevisiae (12.1 Mbp)
E. coli (4.6 Mbp)
C. elegans (100 Mbp)
H. Sapiens chr3 (198 Mbp)
H. sapiens (3 Gbp)
Simulated reads based on 1D error profiles for E. coli dataset from Quick et al 2014
GraphMap Design Region Selec*on Graph
Mapping
Align
Candidate Posi*ons
LCSk + L1
1. Gapped Spaced Seeds
2. Graph Mapping
0 GC CT
1 TA
2 AA
3 AA
4 AG
5 GA
6
3. LCSk Chaining
Region'
Read'
Loca*ons
Alignments
hYp://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html
1. Gapped Spaced Seeds Reference bases
Seed base Unused reference base (unless colored)
"Don't care" base
Reference bases:
Selected bases:
Indexed seed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Index construcDon
Query bases:
Mismatch seed (base 7):
Inser*on seed (base 7/8): Dele*on seed (b/w base 6 & 7):
Index lookup
1 2 3 4 5 6 8 9 10 11 12 13 14
1 2 3 4 5 6 8 9 10 12 13 11
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 8 9 10 11 12 13
1 2 3 4 5 6 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12
Shapes: 6-‐1-‐6, 4-‐1-‐4-‐1-‐4
Gapped spaced seeds retain sensi*vity and recall
60 65 70 75 80 85 90 95
100
N. meningi/dis E. coli S. cerevisiae C. elegans H. sapiens (chr3)
GraphMap Precision k=13 Precision GraphMap Recall k=13 Recall
2. Graph Mapping to get Anchors
Graph based search for MEMs with indels
Alignment graph 01234567 G A-|||x||| - C
Reference:
Query:
01234567 G AAAGA -|||x||| - ACAGA
Reference:
Query:
IniDal graph
0
GC CT
1
TA
2
AA
3
AA
4
AG
5
GA
6
0
GC CT
1
TA
2
AA
3
AA
4
AG
5
GA
6
Final anchor graph CTA AGA
Vertex-‐centric parallelism
3. Chaining Anchors with LCSk
Mo*va*on: Longest Common Subsequence (LCS) provides fast alignments but allows arbitrary inser*ons and dele*ons
Region
Read
Approximate alignments evaluated to select best region
Pave*c et al 2014 O(n�log(n))
Other Features 1. Different Alignment Op*ons
(marginAlign)
2. Mapping Quality, BLAST-‐like E-‐value
3. Circular Genome Support
4. Technology Agnos*c (w/o tuning
parameters)
– Illumina, PacBio, Ion Torrent
hYps://github.com/isovic/graphmap
Applica*on: SNV Calling (Ammar et al. 2015) # of on-‐target reads GraphMap: 6879 BWA-‐MEM: 5900 LAST: 5032 BLASR: 2284 marginAlign: 4683 DALIGNER: 1451
CYP2D6 (chr22)
Read Coverage
Read Alignments
CYP2D7 CYP2D6
LAST marginAlign BWA-MEM BLASR DALIGNER GraphMap
Precision
(%) 94 100 (36) 96 100 93 96
True
Positives 49 1 (107) 47 43 75 86
94% Iden*ty
LoFreq: hYps://github.com/csb5/lofreq; results in parentheses for marginCaller
Applica*on: Structural Variants (large Inser*ons and Dele*ons)
4kbp deleDon spanned by GraphMap
LAST marginAlign BWA-MEM BLASR DALIGNER GraphMap
Precision (%) 0 50 67 (90) 94 0 100
Recall (%) 0 5 10 (45) 75 0 100
F1 Score (%) 0 9 17 (60) 83 0 100
Data from Quick et al 2014 mapped to mutated reference
Window size > 20bp; AF > 15%; Results from LUMPY are in parentheses
Applica*on: Pathogen Iden*fica*on
0 2000 4000 6000 8000
10000 12000 14000
0"1000"2000"3000"4000"5000"6000"7000"8000"9000"
10000"
S."ente
rica"Typh
i"str."Ty2"
S."ente
rica"Typh
i"str."C
T18"
S."ente
rica"Paratyphi"A
"str."
S."ente
rica"Paratyphi"A
"str."ATCC"9150"
S."ente
rica"str."CT18"pla
smid"p
HCM2
"
S."ente
rica"H
eidelberg"str."SL
476"
S."ente
rica"Paratyphi"C
"strain"RKS45
94"
S."ente
rica"Agona"str."SL
483"
S."ente
rica"Paratyphi"B
"str."SP
B7"
S."typh
imurium
"LT2"
GraphMap" BWALMEM" LAST" BLASR" DALIGNER"
Database of 258 reference sequences
K-‐12 and BW2952 have >99% iden*ty
Data from Quick et al 2014
GraphMap for Assembly
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
10% 15% 20%
SensiDvity
Read Error Rate
GraphMap
MHAP (nanopore fast)
MHAP (pacbio fast)
MHAP (pacbio sensi*ve)
MHAP (default)
Minimap (default)
Minimap (github params)
30X coverage simulated reads; Nanopore 2D error profile; E. coli genome
Croa*an Science Funda*on project UIP-‐11-‐2013-‐7353
Ivan Sovic Mile Sikic Swaine Chen
hYps://github.com/isovic/graphmap hYp://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html