DNA Barcode sequence identification incorporating taxonomic hierarchy and
within taxon variability
Damon P. Little
Cullman Program for Molecular Systematics StudiesThe New York Botanical Garden, Bronx, New York
test data sets (Little and Stevenson 2007)
gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2)
1,037 sequences
413 species71 genera
gymnosperm plastid encoded maturase K (matK)
522 sequences334 species75 genera
…alignment
locus sequencesmedian unaligned length (IQR)
aligned length
nrITS 2
all 137 (108–250) bp 8,733 bp
one per species 196 (115–260) bp 6,778 bp
matK
all 1,561 (1,412–1,661) bp 3,975 bp
one per species 1,601 (1,530–1,661) bp 3,906 bp
pairwise divergence
locus sequences median interquartile rangezero comparisons
nrITS 2
all 30.99% 26.53–34.48% 0.09%
one per species 29.39% 25.75–33.30% 0.21%
matK
all 20.39% 5.95–23.30% 0.54%
one per species 21.38% 8.13–23.89% 0.42%
measuring precision and accuracy
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
lessons learned
“global” alignments do not work
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
“fuzzy” matches are not precise
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
autoapomorphies (unique characters) work... but not always present
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
DOME ID* 100% (100%) 100% (100%)
ATIM 100% (83%) 100 (67%)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
DOME ID* 76% (75%) 90% (90%)
ATIM 83% (71%) 87% (53%)
some sequences are simply unidentifiable
...remaining (insoluble) problems
identical sequences for multiple terminals
shared alleles between terminals
use allele frequency as a predictor?
desirable methodologies and properties of
Sequence IDentification Engines (SIDEs)
Sequence IDentification Engines (SIDEs)
avoid global alignment by comparing short segments: pseudo–alignment
use exact matches
use autoapomorphies where possible
...but allow the use of other characters too
context/text DNA recoding
characters are defined by flanking context
=> pretext and postext
permit “alignment–free” comparisons
size and separation between pretext and postext must be arbitrarily delimited
states (text) limited by the proximity of context
terminals can be individual sequences or composites representing taxa
context/text DNA recoding
context/text DNA recoding
characters are defined by flanking context
=> pretext and postext
permit “alignment–free” comparisons
size and separation between pretext and postext is arbitrarily
possible states (text) is limited by the length of the text
terminals can be individual sequences or composites representing taxa
querying text/context database
find pretext/text/postext in the query sequence and match to references
querying text/context database
querying text/context database
find pretext/text/postext in the query sequence and match to references
score terminals based on the number of matches
final score can be raw or based a weighting function
possible weighting functions
equal weights (raw score)
number of distinct texts
=> up weights more variable characters
1/(number of distinct texts)
=> down weights more variable characters
(number of texts)/(number of scores)
precisionmethod nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
BRONX 0 91% (90%) 88% (84%)
BRONX 1 96% (86%) 98% (79%)
accuracy to speciesmethod nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
BRONX 0 59% (58%) 76% (71%)
BRONX 1 72% (67%) 92% (75%)
BRONX conclusions
BRONX is more precise than existing algorithms
BRONX is sometimes more accurate than existing algorithms
BRONX is an incremental improvement
future directions
improve the scoring function in BRONX
dynamically size context/text
benchmark additional datasets for all methods
incorporate context/text recoding into a scalable version of the ATIM algorithm
acknowledgments
Kenneth Cameron
Santiago Madriñán
Christian Schulz
Dennis Stevenson