Data Mining GenBank for Phylogenetic inference - T. Vision

1

Prospects for enablingphylogenetically informed

comparative biology on the web

Todd Vision1,2 & Hilmar Lapp1

1U.S. National Evolutionary Synthesis Center2Dept. Of Biology, University of North Carolina

at Chapel HIll

Suppose you have the sequence of a protein-codinggene, and are interested in its function. What isthe first thing you would do?

• If it were me, I would search for conserveddomains that match records in Pfam and otherprotein domain databases.

• Are these databases complete?

• Are they infallible?

• Are they still useful?

Why are these data useful?• You needn’t have mastery of the specialist

literature before the search

• A match connects you to a vast interconnectedworld of information

• Why not worry about completeness?! A negative result is not expensive

! Many broadly useful records are already present

• Why not worry about fallibility?! The user can weigh the evidence once a match is

found

! Assertions should be exposed to scrutiny

2

Some observations

• This infrastructure is designed to disseminate datato non-specialists

• The relevant data may be derived from multiple“studies”, not all of which are published

• Data is hoarded neither by the researcher nor bythe domain database

• The search service is as widely disseminated asthe data

• Semantic-level machine-to-machinecommunication facilitates human comprehensive

The case of phylogenetic data

• There is a broad audience for phylogenetic data! Organismal phylogeny (e.g. Encyclopedia of Life)

! Gene/protein trees

• Many of the available resources are gearedtoward specialist researchers & students

• Non-specialists turn to taxonomic classificationswhen they need organismal phylogeneticinformation

• Few know where to find gene/protein trees at all

TreeBase

• screenshot

Tree of Life Web Project

3

The NCBI taxonomy

• Provides! A hierarchy for all species represented by DNA

sequences in Genbank

! Names and IDs for internal nodes

! An FTP dump

• But does NOT! Include unsequences species

! Report confidence in topology or monophyly

! Taxonomic nuance (it has synonyms & commonnames)

What if the NCBI taxonomy…

• Listed all taxa, including fossils?

• Allowed one to assess where there areconflicting topologies?

• Reported support values for clades?

• Reported divergence time estimates fornodes (e.g. from TimeTree)

• Reported the provenance of the data?

Node-oriented web services fromthe Tree of Life Web Project

• Name

• Description

• Authority

• Date

• Other names

• Completeness of children

• Extinction status

• Confidence of position

• Monophyly

4

Further barriers to disseminationof phylogenetic information

• Technical obstacles

! Technology for storing and querying trees

! Difficulties with exchange standards

! Inference of consensus trees and supertrees

! Taxonomic intelligence

! Globally unique identifiers

• Social obstacles

! Reluctance to provide incomplete or fallibleinformation

Outline

• Informatics @ NESCent

• An example of a phylogenetically-informedsemantic web application for phenotypedata

• Promoting interoperability and closingtechnical gaps in phyloinformatics throughopen development

NESCent sponsored science

• Catalysis Meetings (large, one-time events)! To foster new collaborations and synthetic research

• Working Groups! Smaller, focused, multiple meetings

• Sabbatical Scholars

• Postdoctoral fellows

• Short-term visitor program! 2 weeks to 3 months

! Encourage collaborative projects

• Application info: http://www.nescent.org

5

Evolutionary Informatics WG

• Organizers: Arlin Stoltzfus and Rutger Vos

• Selected goals:

! XML serialization of NEXUS

! Formal grammar for validation and interconversion ofNEXUS & other formats

! A transition model language for evolutionary modelsused in statistical inference

! An ontology for evolutionary comparative data analysis

• http://www.nescent.org/wg_evoinfo

NESCent Informatics

• Support for sponsored science and scientists! Facilitating electronic collaboration

! Software/database development

! Providing HPC and other IT infrastructure

• Cyberinfrastructure for synthetic science! Data sharing

! Software interoperability

! Training

! In partnership with major national and internationalefforts

GeoPhyloBuilder

• Extension for ArcGIS Software that creates a spatiotemporalGIS network model from a tree with georeferenced nodes.

• 3D visualizations are possible through ArcSCENE.

• http://www.nescent.org/informatics/software.php

“Putting thegeography intophylogeography”

David Kidd & Xianhua Liu

Phylogenetic cyberinfrastructure to enablecomparative biology

• Two traditions in the recording of phenotype data! Natural language descriptions and character matrices

! Statements made using anatomical and trait ontologies,designed to capitalize on the semantic web

• NESCent WG on morphological evolution in fish! Organized by Paula Mabee and Monte Westerfield

! Led to a larger project

• Aim is to integrate! Mutant phenotype data for zebrafish

! Comparative morphology data for the Ostariophysi

6

cell

membrane

axolemma axon

cellprojection

part_of

is_a

part_of

part_of

is_a

Ontologies

• Defined terms with defined relationships

! e.g. Gene Ontology, Cell Ontology

Describing phenotypes usingontologies

• Entity-Quality system (EQ)

• Entity term from an anatomy ontology

! zebrafish anatomy cell ontology, etc.

• Quality term from Phenotype and TraitOntology (PATO)

• e.g. Entity=dorsal fin, Shape=round

Phenotype and Trait Ontology(PATO)

physicalquality

chromaticproperty

opticalquality

color

blue

bright blue dark blue

green

amplitude

buoyancy

...

round state

pointed state

undulate state

Species one

Species two

Species three

dorsal fin shape character 2

Evolutionary character matrices

• Common phenotypic data format inevolutionary biology (e.g. NEXUS)

• Characters + character states, similar toEQ

7

Character CharacterState

Entity

dorsal fin

Attribute

shape

Value

round

AO

PATO

QualityQualityEntity

Character Matrix vs. EQ A scenario

• A geneticist observes a reduction in the numberof a particular bone type (e.g. branchiostegal ray)in a zebrafish mutant of her favorite gene.

• She asks: is this bone variable in number amongspecies in nature?

• She could query the evolutionary phenotypedatabase using:! Entity = Branchiostegal ray (from TAO)

! Qualities pertaining to attribute ‘count’ (from PATO)

• She could examine a visualization of thephylogenetic relationships of the taxa withthe relevant character changes mapped.

• She would see that most Ostariophysi have 3rays, but that reduction has occurredmultiple times:! solenostomids and syngnathids (ghost pipefishes

and pipefishes)

! giganturids

! saccopharyngoid (gulper and swallower) eels

• By examining additional changes on these samebranches, she sees several parallelisms:! loss of the swimbladder, pelvic fins, and scales

! elongation of the mandibular or hyoid arches

! reduction or loss of the opercle in syngnathids andsaccopharyngoids.

! a variety of other bones and soft tissues are lost orgreatly modified

• She might hypothesize that these traitcorrelations are all due to alterations in theexpression of the same suite of morphogens.

• She can select appropriate species from theselineages to follow-up experimentally.

8

What data are needed to enablethis scenario?

• Anatomy and trait ontologies

• Phenotypes in EQ syntax for

! Zebrafish mutants (already exist)

! Species/clades of Ostariophysi

• Phylogenetic relationships among theOstariophysi

! Taxonomy ontology

Some anatomical ontologies

• Amphibia

• C. elegans

• Fish (zebrafish, medaka, teleosts)

• Insects (Drosophila, Mosquito, Hymenoptera)

• Mammals (mouse, human)

• Plants (Arabidopsis, cereals, maize, all plants)

NCBO

NESCent

(Vision, Lapp,

Software Developers)

OBO

(host of TAO, PATO, taxonomy ontology)

Applications

(Phenote, OBO-Edit)

Phenotype Ontologies

for Evolutionary Biology Workshops

EQSYTE database

Curator interface

EQSYTE public interface

USD(Mabee,

Data Curator)

U. Oregon(Westerfield)

Liason to ZFIN

Liason to NCBO

Usability testing

Working groups

Morphologycollaborators

(Arratia, Coburn,Hilton Lunderg, Mayden)

Ichthyology community

(DeepFin, Fishbase)

Ostariophysan

phenotypic data

Tulane U.(Rios/Ontology Curator)

Liason to CToL

Zebrafishphenotypic

& genetic

data

Ontologies

(taxonomy, TAO,PATO, homology)

EQSYTE contents

Preserving published data forfuture integration efforts

• Sequence alignments (e.g. Treebase)

• Long-term population records (e.g. pedigrees)

• 2D and 3D images

• Collection and locality information

• Behaviorial observations

• Numerical tables

• Etc.

• Most of these data are lost upon publication

• These are the stuff of comparative biology

9

Dryad: A digital repository for published datain evolutionary biology

NCSU Digital Library Initiative

Journals and societies involvedso far

• American Naturalist (ASN)

• Evolution (SSE)

• Journal of Evolutionary Biology (ESEB)

• Integrative and Comparative Biology (SICB)

• Molecular Biology and Evolution (SMBE)

• Molecular Ecology

• Molecular Phylogenetics and Evolution

• Systematic Biology (SSB)

Open development

• Open source refers only to the licensing of thesoftware code

• At NESCent, we have been experimenting withpractices in open development

! Community contributes to a shared code base

! Higher barrier to entry

! Can be a substantial payoff in terms of interoperability,functionality, usability, maintenance

! Surprisingly rare in academia

2006 Phyloinformatics Hackathon

TreeBasePAUP* CIPRESHyPhyATV GARLINESCentNCL

JEBLBiojava BiopythonBiorubyBioSQLBio::CDAT BioPerl

10

Hackathon mechanics• Before the meeting

! Participants and users suggested integrative workflows

• At the meeting! Gaps in existing toolkits were identified

! Subgroups collaborated on high priority targets

! Followed a “use case” model

! Subgroups and targets were allowed to be fluid

! Users were on hand to provide datasets, test code,provide their perspective

! Dedicated participants tasked with documentation

• All code is open-source and deposited inestablished repositories

• Sequence family evolution

! BioPerl: Support for TribeMCL, QuickTree,ClustalW, Phylip, PAML

! BioPerl & Biopython: Support for dN/dS-basedtests for selection in HyPhy

! Biojava: Parser for Phylip alignment format

! BioRuby: Support for T-Coffee, MAFFT, andPhylip

Accomplishments• Reconciling trees

! BioPerl: Support for NJTree

! Biopython: Wrapper for Softparsmap

! BioRuby: Model for phylogenetic trees andnetworks with graph algorithms

! BioSQL: Model for phylogenetic trees andnetworks with optimization methods andtopological queries

11

• Phylogenetic inference on non-molecularcharacters! BioPerl: Interoperability between Bio::Phylo and

BioPerl APIs

! BioRuby: NEXUS-compliant data model and parser forPAUP and TNT results

• Phylogenetic footprinting! BioPerl: Support for Footprinter, PhastCons, and using

ClustalW over a sliding window

• Estimation of divergence times! BioPerl: Draft design of r8s wrapper

• NEXUS compliance

! Biojava: Interoperability between Biojava and JEBL

! Biojava & BioRuby: Level II-compliant NEXUS parsers

! All:

! Evaluated major APIs

! Proposed compliance levels

! Gathered test files exposing common errors

! Fixed compliance issues in NCL and Bio::NEXUS referenceimplementations

! Worked on integrating those into GARLI and BioPerl,

respectively

Next hackathon

• Comparative Phylogenetic Methods in R

• December 10-14, 2007

• Organizers: S. Kembel, H. Lapp, B. O'Meara, S.Price, T. Vision, A. Zanne

• http://hackathon.nescent.org/R_Hackathon_1

• Have an idea for a future event? Submit awhitepaper!

• Student internships in open-source softwaredevelopment! Students work with any of a large number of

established OS projects

! Students and mentors work & communicate remotely

• NESCent recruited mentors and oversaw studentprogress! Eleven students worked on projects in visualization,

usability, interoperability & implementation of newmethods

12

NEXML

• Student: Jason Caravas

• Mentor: Rutger Vos

• Flexible serialization of phylogenetic objects

• Perl Bio::Phylo module tools for NEXMLparsing and serialization

Command-line BioSQL

• Student: Jamie Estill

• Mentor: Hilmar Lapp

• Commands for! Database initialization

! Bio::TreeIO import

! Bio::TreeIO export

! Tree query

! Tree optimization

! Tree manipulation

Conservation of phylogeneticdiversity

• Student: Klaas Hartmann

• Mentor: Tobias Thierer

• Implementation of algorithm and GUI foroptimal allocation of a finite budget toindividual species to maximize phylogeneticdiversity.

13

Bayesian calibration ofdivergence times

• Student: Michael Nowak

• Mentor: Derrick Zwickl

• Fossil occurrence data is used toconstruct informative priors ondivergence times for Bayesiananalysis in, e.g. BEAST

Phyloinformatics Summer Course

• Teaching advancedprogramming skills tophylogenetic methodsdevelopers

• Focus is on softwaretechnologies rather thanmethodology

• First year

! 10 days in July 2007

! Organized by Bill Piel ofTreeBASE

! 8 co-instructors

! 23 students (11 female) in thefirst year

Conclusions• The future of web-enabled comparative biology is

beginning to become clearer.! For a preview, see genomics!

• The facile exchange of phylogenetic data is whatwill enable it.

• Expect to be using technologies such asontologies and web services, which are nowlargely foreign to phylogenetic researchers.

• Also expect a shift toward open development.! This will necessitate new modes of training for

academic phyloinformaticists.

Additional acknowledgements• Hackathon participants

• GSoC mentors and students

• Summer course instructors

• Phenotype evolution project! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula

Mabee, Peter Midford, Monte Westerfield

• Data depository:! Ryan Scherle, Jane Greenberg