Download ppt - DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics

DNA Barcode sequence identification incorporating taxonomic hierarchy and

within taxon variability

Damon P. Little

Cullman Program for Molecular Systematics StudiesThe New York Botanical Garden, Bronx, New York

test data sets (Little and Stevenson 2007)

gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2)

1,037 sequences

413 species71 genera

gymnosperm plastid encoded maturase K (matK)

522 sequences334 species75 genera

…alignment

locus sequencesmedian unaligned length (IQR)

aligned length

nrITS 2

all 137 (108–250) bp 8,733 bp

one per species 196 (115–260) bp 6,778 bp

matK

all 1,561 (1,412–1,661) bp 3,975 bp

one per species 1,601 (1,530–1,661) bp 3,906 bp

pairwise divergence

locus sequences median interquartile rangezero comparisons

nrITS 2

all 30.99% 26.53–34.48% 0.09%

one per species 29.39% 25.75–33.30% 0.21%

matK

all 20.39% 5.95–23.30% 0.54%

one per species 21.38% 8.13–23.89% 0.42%

measuring precision and accuracy

precision

method nrITS2 matK

parsimony ratchet 58% (13%) 71% (41%)

SPR search 60% (11%) 70% (41%)

neighbor joining 65% (8%) 44% (23%)

BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)

BLAST/parsimony ratchet 86% (74%) 77% (55%)

BLAST/SPR 87% (73%) 76% (53%)

BLAST/neighbor joining 93% (71%) 95% (56%)

DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

accuracy to species

method nrITS2 matK


SPR search 69% (47%) 78% (58%)


BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)


BLAST/SPR 79% (67%) 78% (61%)


DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

lessons learned

“global” alignments do not work

precision

method nrITS2 matK


SPR search 60% (11%) 70% (41%)


BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)


BLAST/SPR 87% (73%) 76% (53%)


DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

accuracy to species

method nrITS2 matK


SPR search 69% (47%) 78% (58%)


BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)


BLAST/SPR 79% (67%) 78% (61%)


DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

“fuzzy” matches are not precise

precision

method nrITS2 matK


SPR search 60% (11%) 70% (41%)


BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)


BLAST/SPR 87% (73%) 76% (53%)


DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

accuracy to species

method nrITS2 matK


SPR search 69% (47%) 78% (58%)


BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)


BLAST/SPR 79% (67%) 78% (61%)


DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

autoapomorphies (unique characters) work... but not always present

precision

method nrITS2 matK


SPR search 60% (11%) 70% (41%)


BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)


BLAST/SPR 87% (73%) 76% (53%)


DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

DOME ID* 100% (100%) 100% (100%)

ATIM 100% (83%) 100 (67%)

accuracy to species

method nrITS2 matK


SPR search 69% (47%) 78% (58%)


BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)


BLAST/SPR 79% (67%) 78% (61%)


DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

DOME ID* 76% (75%) 90% (90%)

ATIM 83% (71%) 87% (53%)

some sequences are simply unidentifiable

...remaining (insoluble) problems

identical sequences for multiple terminals

shared alleles between terminals

use allele frequency as a predictor?

desirable methodologies and properties of

Sequence IDentification Engines (SIDEs)

Sequence IDentification Engines (SIDEs)

avoid global alignment by comparing short segments: pseudo–alignment

use exact matches

use autoapomorphies where possible

...but allow the use of other characters too

context/text DNA recoding

characters are defined by flanking context

=> pretext and postext

permit “alignment–free” comparisons

size and separation between pretext and postext must be arbitrarily delimited

states (text) limited by the proximity of context

terminals can be individual sequences or composites representing taxa



characters are defined by flanking context

=> pretext and postext

permit “alignment–free” comparisons

size and separation between pretext and postext is arbitrarily

possible states (text) is limited by the length of the text

terminals can be individual sequences or composites representing taxa

querying text/context database

find pretext/text/postext in the query sequence and match to references



find pretext/text/postext in the query sequence and match to references

score terminals based on the number of matches

final score can be raw or based a weighting function

possible weighting functions

equal weights (raw score)

number of distinct texts

=> up weights more variable characters

1/(number of distinct texts)

=> down weights more variable characters

(number of texts)/(number of scores)

precisionmethod nrITS2 matK


SPR search 60% (11%) 70% (41%)


BLAST 94% (81%) 99% (67%)

BLAT 94% (82%) 99% (69%)

megaBLAST 94% (80%) 99% (61%)


BLAST/SPR 87% (73%) 76% (53%)


DNA–BAR 98% (89%) 100% (79%)

DOME ID 80% (80%) 60% (60%)

ATIM 100% (83%) 100 (67%)

BRONX 0 91% (90%) 88% (84%)

BRONX 1 96% (86%) 98% (79%)

accuracy to speciesmethod nrITS2 matK


SPR search 69% (47%) 78% (58%)


BLAST 67% (63%) 84% (68%)

BLAT 66% (62%) 82% (67%)

megaBLAST 72% (68%) 84% (64%)


BLAST/SPR 79% (67%) 78% (61%)


DNA–BAR 65% (62%) 73% (62%)

DOME ID 67% (66%) 50% (50%)

ATIM 83% (71%) 87% (53%)

BRONX 0 59% (58%) 76% (71%)

BRONX 1 72% (67%) 92% (75%)

BRONX conclusions

BRONX is more precise than existing algorithms

BRONX is sometimes more accurate than existing algorithms

BRONX is an incremental improvement

future directions

improve the scoring function in BRONX

dynamically size context/text

benchmark additional datasets for all methods

incorporate context/text recoding into a scalable version of the ATIM algorithm

acknowledgments

Kenneth Cameron

Santiago Madriñán

Christian Schulz

Dennis Stevenson

http://barcoding.si.edu/