View
2.531
Download
0
Category
Tags:
Preview:
Citation preview
1
Prospects for enablingphylogenetically informed
comparative biology on the web
Todd Vision1,2 & Hilmar Lapp1
1U.S. National Evolutionary Synthesis Center2Dept. Of Biology, University of North Carolina
at Chapel HIll
Suppose you have the sequence of a protein-codinggene, and are interested in its function. What isthe first thing you would do?
• If it were me, I would search for conserveddomains that match records in Pfam and otherprotein domain databases.
• Are these databases complete?
• Are they infallible?
• Are they still useful?
Why are these data useful?• You needn’t have mastery of the specialist
literature before the search
• A match connects you to a vast interconnectedworld of information
• Why not worry about completeness?! A negative result is not expensive
! Many broadly useful records are already present
• Why not worry about fallibility?! The user can weigh the evidence once a match is
found
! Assertions should be exposed to scrutiny
2
Some observations
• This infrastructure is designed to disseminate datato non-specialists
• The relevant data may be derived from multiple“studies”, not all of which are published
• Data is hoarded neither by the researcher nor bythe domain database
• The search service is as widely disseminated asthe data
• Semantic-level machine-to-machinecommunication facilitates human comprehensive
The case of phylogenetic data
• There is a broad audience for phylogenetic data! Organismal phylogeny (e.g. Encyclopedia of Life)
! Gene/protein trees
• Many of the available resources are gearedtoward specialist researchers & students
• Non-specialists turn to taxonomic classificationswhen they need organismal phylogeneticinformation
• Few know where to find gene/protein trees at all
TreeBase
• screenshot
Tree of Life Web Project
3
The NCBI taxonomy
• Provides! A hierarchy for all species represented by DNA
sequences in Genbank
! Names and IDs for internal nodes
! An FTP dump
• But does NOT! Include unsequences species
! Report confidence in topology or monophyly
! Taxonomic nuance (it has synonyms & commonnames)
What if the NCBI taxonomy…
• Listed all taxa, including fossils?
• Allowed one to assess where there areconflicting topologies?
• Reported support values for clades?
• Reported divergence time estimates fornodes (e.g. from TimeTree)
• Reported the provenance of the data?
Node-oriented web services fromthe Tree of Life Web Project
• Name
• Description
• Authority
• Date
• Other names
• Completeness of children
• Extinction status
• Confidence of position
• Monophyly
4
Further barriers to disseminationof phylogenetic information
• Technical obstacles
! Technology for storing and querying trees
! Difficulties with exchange standards
! Inference of consensus trees and supertrees
! Taxonomic intelligence
! Globally unique identifiers
• Social obstacles
! Reluctance to provide incomplete or fallibleinformation
Outline
• Informatics @ NESCent
• An example of a phylogenetically-informedsemantic web application for phenotypedata
• Promoting interoperability and closingtechnical gaps in phyloinformatics throughopen development
NESCent sponsored science
• Catalysis Meetings (large, one-time events)! To foster new collaborations and synthetic research
• Working Groups! Smaller, focused, multiple meetings
• Sabbatical Scholars
• Postdoctoral fellows
• Short-term visitor program! 2 weeks to 3 months
! Encourage collaborative projects
• Application info: http://www.nescent.org
5
Evolutionary Informatics WG
• Organizers: Arlin Stoltzfus and Rutger Vos
• Selected goals:
! XML serialization of NEXUS
! Formal grammar for validation and interconversion ofNEXUS & other formats
! A transition model language for evolutionary modelsused in statistical inference
! An ontology for evolutionary comparative data analysis
• http://www.nescent.org/wg_evoinfo
NESCent Informatics
• Support for sponsored science and scientists! Facilitating electronic collaboration
! Software/database development
! Providing HPC and other IT infrastructure
• Cyberinfrastructure for synthetic science! Data sharing
! Software interoperability
! Training
! In partnership with major national and internationalefforts
GeoPhyloBuilder
• Extension for ArcGIS Software that creates a spatiotemporalGIS network model from a tree with georeferenced nodes.
• 3D visualizations are possible through ArcSCENE.
• http://www.nescent.org/informatics/software.php
“Putting thegeography intophylogeography”
David Kidd & Xianhua Liu
Phylogenetic cyberinfrastructure to enablecomparative biology
• Two traditions in the recording of phenotype data! Natural language descriptions and character matrices
! Statements made using anatomical and trait ontologies,designed to capitalize on the semantic web
• NESCent WG on morphological evolution in fish! Organized by Paula Mabee and Monte Westerfield
! Led to a larger project
• Aim is to integrate! Mutant phenotype data for zebrafish
! Comparative morphology data for the Ostariophysi
6
cell
membrane
axolemma axon
cellprojection
part_of
is_a
part_of
part_of
is_a
Ontologies
• Defined terms with defined relationships
! e.g. Gene Ontology, Cell Ontology
Describing phenotypes usingontologies
• Entity-Quality system (EQ)
• Entity term from an anatomy ontology
! zebrafish anatomy cell ontology, etc.
• Quality term from Phenotype and TraitOntology (PATO)
• e.g. Entity=dorsal fin, Shape=round
Phenotype and Trait Ontology(PATO)
physicalquality
chromaticproperty
opticalquality
color
blue
bright blue dark blue
green
amplitude
buoyancy
...
round state
pointed state
undulate state
Species one
Species two
Species three
dorsal fin shape character 2
Evolutionary character matrices
• Common phenotypic data format inevolutionary biology (e.g. NEXUS)
• Characters + character states, similar toEQ
7
Character CharacterState
Entity
dorsal fin
Attribute
shape
Value
round
AO
PATO
QualityQualityEntity
Character Matrix vs. EQ A scenario
• A geneticist observes a reduction in the numberof a particular bone type (e.g. branchiostegal ray)in a zebrafish mutant of her favorite gene.
• She asks: is this bone variable in number amongspecies in nature?
• She could query the evolutionary phenotypedatabase using:! Entity = Branchiostegal ray (from TAO)
! Qualities pertaining to attribute ‘count’ (from PATO)
• She could examine a visualization of thephylogenetic relationships of the taxa withthe relevant character changes mapped.
• She would see that most Ostariophysi have 3rays, but that reduction has occurredmultiple times:! solenostomids and syngnathids (ghost pipefishes
and pipefishes)
! giganturids
! saccopharyngoid (gulper and swallower) eels
• By examining additional changes on these samebranches, she sees several parallelisms:! loss of the swimbladder, pelvic fins, and scales
! elongation of the mandibular or hyoid arches
! reduction or loss of the opercle in syngnathids andsaccopharyngoids.
! a variety of other bones and soft tissues are lost orgreatly modified
• She might hypothesize that these traitcorrelations are all due to alterations in theexpression of the same suite of morphogens.
• She can select appropriate species from theselineages to follow-up experimentally.
8
What data are needed to enablethis scenario?
• Anatomy and trait ontologies
• Phenotypes in EQ syntax for
! Zebrafish mutants (already exist)
! Species/clades of Ostariophysi
• Phylogenetic relationships among theOstariophysi
! Taxonomy ontology
Some anatomical ontologies
• Amphibia
• C. elegans
• Fish (zebrafish, medaka, teleosts)
• Insects (Drosophila, Mosquito, Hymenoptera)
• Mammals (mouse, human)
• Plants (Arabidopsis, cereals, maize, all plants)
NCBO
NESCent
(Vision, Lapp,
Software Developers)
OBO
(host of TAO, PATO, taxonomy ontology)
Applications
(Phenote, OBO-Edit)
Phenotype Ontologies
for Evolutionary Biology Workshops
EQSYTE database
Curator interface
EQSYTE public interface
USD(Mabee,
Data Curator)
U. Oregon(Westerfield)
Liason to ZFIN
Liason to NCBO
Usability testing
Working groups
Morphologycollaborators
(Arratia, Coburn,Hilton Lunderg, Mayden)
Ichthyology community
(DeepFin, Fishbase)
Ostariophysan
phenotypic data
Tulane U.(Rios/Ontology Curator)
Liason to CToL
Zebrafishphenotypic
& genetic
data
Ontologies
(taxonomy, TAO,PATO, homology)
EQSYTE contents
Preserving published data forfuture integration efforts
• Sequence alignments (e.g. Treebase)
• Long-term population records (e.g. pedigrees)
• 2D and 3D images
• Collection and locality information
• Behaviorial observations
• Numerical tables
• Etc.
• Most of these data are lost upon publication
• These are the stuff of comparative biology
9
Dryad: A digital repository for published datain evolutionary biology
NCSU Digital Library Initiative
Journals and societies involvedso far
• American Naturalist (ASN)
• Evolution (SSE)
• Journal of Evolutionary Biology (ESEB)
• Integrative and Comparative Biology (SICB)
• Molecular Biology and Evolution (SMBE)
• Molecular Ecology
• Molecular Phylogenetics and Evolution
• Systematic Biology (SSB)
Open development
• Open source refers only to the licensing of thesoftware code
• At NESCent, we have been experimenting withpractices in open development
! Community contributes to a shared code base
! Higher barrier to entry
! Can be a substantial payoff in terms of interoperability,functionality, usability, maintenance
! Surprisingly rare in academia
2006 Phyloinformatics Hackathon
TreeBasePAUP* CIPRESHyPhyATV GARLINESCentNCL
JEBLBiojava BiopythonBiorubyBioSQLBio::CDAT BioPerl
10
Hackathon mechanics• Before the meeting
! Participants and users suggested integrative workflows
• At the meeting! Gaps in existing toolkits were identified
! Subgroups collaborated on high priority targets
! Followed a “use case” model
! Subgroups and targets were allowed to be fluid
! Users were on hand to provide datasets, test code,provide their perspective
! Dedicated participants tasked with documentation
• All code is open-source and deposited inestablished repositories
• Sequence family evolution
! BioPerl: Support for TribeMCL, QuickTree,ClustalW, Phylip, PAML
! BioPerl & Biopython: Support for dN/dS-basedtests for selection in HyPhy
! Biojava: Parser for Phylip alignment format
! BioRuby: Support for T-Coffee, MAFFT, andPhylip
Accomplishments• Reconciling trees
! BioPerl: Support for NJTree
! Biopython: Wrapper for Softparsmap
! BioRuby: Model for phylogenetic trees andnetworks with graph algorithms
! BioSQL: Model for phylogenetic trees andnetworks with optimization methods andtopological queries
11
• Phylogenetic inference on non-molecularcharacters! BioPerl: Interoperability between Bio::Phylo and
BioPerl APIs
! BioRuby: NEXUS-compliant data model and parser forPAUP and TNT results
• Phylogenetic footprinting! BioPerl: Support for Footprinter, PhastCons, and using
ClustalW over a sliding window
• Estimation of divergence times! BioPerl: Draft design of r8s wrapper
• NEXUS compliance
! Biojava: Interoperability between Biojava and JEBL
! Biojava & BioRuby: Level II-compliant NEXUS parsers
! All:
! Evaluated major APIs
! Proposed compliance levels
! Gathered test files exposing common errors
! Fixed compliance issues in NCL and Bio::NEXUS referenceimplementations
! Worked on integrating those into GARLI and BioPerl,
respectively
Next hackathon
• Comparative Phylogenetic Methods in R
• December 10-14, 2007
• Organizers: S. Kembel, H. Lapp, B. O'Meara, S.Price, T. Vision, A. Zanne
• http://hackathon.nescent.org/R_Hackathon_1
• Have an idea for a future event? Submit awhitepaper!
• Student internships in open-source softwaredevelopment! Students work with any of a large number of
established OS projects
! Students and mentors work & communicate remotely
• NESCent recruited mentors and oversaw studentprogress! Eleven students worked on projects in visualization,
usability, interoperability & implementation of newmethods
12
NEXML
• Student: Jason Caravas
• Mentor: Rutger Vos
• Flexible serialization of phylogenetic objects
• Perl Bio::Phylo module tools for NEXMLparsing and serialization
Command-line BioSQL
• Student: Jamie Estill
• Mentor: Hilmar Lapp
• Commands for! Database initialization
! Bio::TreeIO import
! Bio::TreeIO export
! Tree query
! Tree optimization
! Tree manipulation
Conservation of phylogeneticdiversity
• Student: Klaas Hartmann
• Mentor: Tobias Thierer
• Implementation of algorithm and GUI foroptimal allocation of a finite budget toindividual species to maximize phylogeneticdiversity.
13
Bayesian calibration ofdivergence times
• Student: Michael Nowak
• Mentor: Derrick Zwickl
• Fossil occurrence data is used toconstruct informative priors ondivergence times for Bayesiananalysis in, e.g. BEAST
Phyloinformatics Summer Course
• Teaching advancedprogramming skills tophylogenetic methodsdevelopers
• Focus is on softwaretechnologies rather thanmethodology
• First year
! 10 days in July 2007
! Organized by Bill Piel ofTreeBASE
! 8 co-instructors
! 23 students (11 female) in thefirst year
Conclusions• The future of web-enabled comparative biology is
beginning to become clearer.! For a preview, see genomics!
• The facile exchange of phylogenetic data is whatwill enable it.
• Expect to be using technologies such asontologies and web services, which are nowlargely foreign to phylogenetic researchers.
• Also expect a shift toward open development.! This will necessitate new modes of training for
academic phyloinformaticists.
Additional acknowledgements• Hackathon participants
• GSoC mentors and students
• Summer course instructors
• Phenotype evolution project! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula
Mabee, Peter Midford, Monte Westerfield
• Data depository:! Ryan Scherle, Jane Greenberg
Recommended