A search engine for phylogenetic tree databases - D. Fernándes-Baca

Preview:

Citation preview

A search engine for phylogenetic tree databases

David Fernández-BacaJoint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)

PhyloFinder

http://pilin.cs.iastate.edu/phylofinder/

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Issues in Phylogenetic Databases Taxonomic consistency

Species may appear in multiple trees by different but synonymous names.

Homonyms Misspellings

Querying capability Storage/representation Exploiting classification trees (e.g., NCBI)

Clustering capabilities Distance measures

Aggregation (synthesis) capabilities Supertrees

Visualization

Classification trees and phylogenies

Exploiting taxonomic classifications The leaves in phylogenetic trees may

represent different taxonomic levels A classification tree can allows us to

locate trees that contain a taxon, as well as its descendants or ancestors. E.g., a “Pinaceae" query would identify

trees that contain “Pinus thunbergii” or “Abies alba”

TreeBASE (Piel, Donoghue, & Sanderson, 1996)

TreeBASE capabilities

Search by taxon author citation study accession number matrix accession number structure (topology)

Tree surfing

TreeBASE limitations

Taxonomic name consistency Querying

Few options Does not exploit classification

Can’t identify ancestors/descendants

Visualization Clustering and aggregation (supertrees)

PhyloFinder

A search engine for tree databases Not a database

Allows powerful phylogenetic queries Handles synonymous taxonomic names (via TBMap) Handles misspellings. Exploits taxonomic classification Offers precise options for identifying different types of

subtrees and metrics for identifying similar trees. Provides a visualization tool with links to GenBank and

TBMap. Fast

Efficient storage and filtering

PhyloFinder Design

Uses simple but powerful techniques Inverted index for filtering Nested-set representation of trees

Least common ancestor queries directly on database

Off-the-shelf spell-checking technology Can be used with any phylogenetic

database E.g., PhyLoTA browser However, set-up is not (yet) automatic

Outline

1. Introduction2. Queries3. Storage and querying4. Acknowledgements

PhyloFinder Queries

Taxonomic queries involve a single taxon or set of taxa.

Phylogenetic queries take as input a phylogenetic tree Locate trees that match it in some specified

way.

Taxonomic Queries

1. Contains: Given a list of taxa, return all trees that contain all or any of these names. Similar to Boolean “AND” and “OR” searches. Automatically searches for synonymous taxa

2. Related: Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy. E.g., if the query taxon is “birds", identify all trees

that contain bird taxa.

3. Pathlength: Given a pair of taxa, return all trees containing them, along with the distance between them in each tree.

Taxonomic Queries: Contains

Phylogenetic Queries

Tree mining: Given a query tree Q, find the database trees that exhibit Q in some way. Options: Return the trees that have Q as an embedded

subtree. Return the trees that refine Q.

Similarity: Given a query tree Q and a specified similarity measure, return trees in database ranked by decreasing similarity from Q. Requires at least 3 taxon overlap

Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that

contains the leaves in A.

a b c d e f g

Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that

contains the leaves in A.

a b c d e f g

Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that

contains the leaves in A.

a b c d e f g

Phylogenetic queries: Notation 2 T|A is obtained from T(A) by suppressing all

internal nodes that have only one child.

a b c d e f g

Phylogenetic queries

Let Q be a query tree with leaf set A. Q is an embedded subtree of T if and

only if it is identical to T|A. Q is refined by T (T refines Q) if T|A is

a refinement of Q.

Phylogenetic queries

Phylogenetic queries: Embedded

Phylogenetic queries: Refined by

Phylogenetic queries: Refined by

Q embedded in T Q refined by T

Phylogenetic queries: Embedded

Similarity queries

Return trees ranked by a similarity score Score is a percentage between 0 and 100%

reflecting how similar query tree is to candidate tree.

PhyloFinder’s similarity measures: Robinson-Foulds (RF) similarity Least common ancestor (LCA) similarity

Score takes degree of taxon overlap into account.

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

System architecture

Least Common Ancestors (LCAs)

a b c d e f g

Least Common Ancestors (LCAs)

a b c d e f g

LCA(b,e)

Storage: Nested intervals

Ancestor/descendant relationship is easy to determine The between predicate defines subtrees

LCAs are easily computed Find common ancestor with largest Node_ID

a b dc e f

(1,10)

(2,9)

(3,5)

(10,10)(4,4) (5,5)(6,6)

(7,9)

(8,8)(9,9)

(Node_ID,RMD_ID)

Storage: Inverted index

For each taxon, store a list of all trees that contain it.

Easy to find trees containing any or all elements in a list of taxa Used as a filter

Cornus

Spigelia

Hedera

1 2 3 5 8 13 21 34

2 4 8 16 32 64128

13 16

Building the inverted indexI. Input trees:

1: (((man,pan),gorilla),pongo), 2. (((human, coprinus),cryptomonas),zea_mays), . . . ,N: (((dogs,homo_sapiens),pig),lambs)

II. Convert trees into lists of taxa:

man pan gorilla pongo, human coprinus . . . . . .

III. Synonymy preprocessing: Replace names by TBMap name clusters:

tc1 tc2 tc3 tc4 , tc1 tc5 . . . . . .

IV. Build index consisting of (i) dictionary (mapping of taxon names to name clusters) and (ii) postings (lists of tree IDs).

tc1 1 2 3 4

tc2 1

Schema

Query Processing: Outline

Consultinverted index

Q: Candidate trees:

Results:

Compare against Q using LCA queries

Implementing Phylogenetic Queries Idea: Use LCA queries to compare ancestor-

descendant relationships in Q with those in T.

M(x) and M(y) have the same relationship in T

as x and y have in Q Q can be embedded in T.

Advantage: Database trees need not be read into main memory.

Implementing Taxonomic Queries Use Boolean (union/intersection)

operations on the inverted index Example: Querying for “birds"

1. Find all bird species in the database trees using the NCBI taxonomy tree.

2. Use inverted index to retrieves the tree ID lists for each bird species.

3. Return the union of these lists.

Tree visualization

Tree visualization

Other tree visualization tools are available: Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-

482. Sanderson 2006. Bioinformatics 22: 1004-1006. Zmasek & Eddy. 2001. Bioinformatics 17: 383-384.

We developed our own to avoid plug-ins, and easily highlight query results and provide outlinks to

GenBank and TBMap.

Spelling

Suggestions come from TreeBASE and NCBI Uses GNU Aspell

Modified to handle special characters (`-', `&', '.') and compound words.

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Under construction

Unrooted trees Supertree methods

MRP, MRF, MMC Desktop version Automatic update Suggestions?

Outline

1. Introduction2. PhyloFinder queries3. Implementation4. Future directions5. Acknowledgements

Thanks to

Rod Page for TBMap Bill Piel for TreeBASE data Mike Sanderson Oliver Eulenstein National Science Foundation (grant EF-

0334832)

Recommended