39
The Development of an Ontology for Data Integration and Query in Comparative Genomics. Trevor Paterson and Andy Law Roslin Institute, Scotland

The Development of an Ontology for Data Integration and Query in Comparative Genomics. Trevor Paterson and Andy Law Roslin Institute, Scotland

Embed Size (px)

Citation preview

  • The Development of an Ontology for Data Integration and Query in Comparative Genomics.

    Trevor Paterson and Andy LawRoslin Institute, Scotland

  • Aims: - To develop enabling technologies for comparative genomics.- To integrate disparate resources (genomic mapping, DNA sequence, evolutionary relationships, functional information) across species boundaries.- In order to inform and expedite genomic mapping: particularly in non-model organisms.

  • Collaborators: Farm animal, crop and microbial genomics; Bioinformatics; Computer Sciences; Statistics.

  • DISPARATE GENOMIC MAPPING DATA - for individual species - multiple datatypes- in many non-standard formats and databases- archived in many locations, variety of access protocols- data of variable quality and completenessPLUS ONLINE BIOINFORMATICS RESOURCES - DNA sequence and genome projects- Gene structure and function- Protein structure, family, function- Evolutionary history, orthology, homology- Phenotypes (genetic traits and diseases)- Population genetics- Gene expression patterns- PublicationsCurrent integration between datasources and across species is largely manual. i.e. difficult, error-prone and very inefficient.

  • Why do Biologists want to integrate mapping data across species?

    What are they trying to do..?GOAL MAP,IDENTIFY AND UNDERSTAND GENES BEHIND PHENOTYPES (i.e. DISEASES & TRAITS)

    ComparaGrid aims to assist this process by exploiting existing mapping data across species boundaries.

  • UNDERLYING BIOLOGICAL PRINCIPAL BEHIND CROSS-SPECIES MAP COMPARISON

    Conservation of Synteny: Conservation of (blocks of) gene order throughout chromosomal evolution

    As species evolve and diverge, their chromosomes rearrange through duplications, inversions, translocations etc - but blocks of genes can be traced through evolutionary history between even relatively divergent species (e.g. chicken and man).

    Therefore the known gene order in these blocks in one species can inform/predict the order of evolutionarily related genes (orthologues) in other species.

  • 20M

  • Ancestral Chromosomespecies A20M 10M SpeciationEventBreakagespecies Bspecies AModern SpeciesInversionHyPOTHESISSequenceSimilarity &Conserved Synteny=>Orthology

  • COMPARATIVE GENOMICS USE CASEQTL (Genetic) MapTasty BaconAgribusiness wants to map the underlying genetic basis of the Tasty Bacon Trait ( a QTL ).

  • COMPARATIVE GENOMICS USE CASETasty BaconThe position of the QTL is correlated on various types of Pig Genetic mapsQTL Map

  • COMPARATIVE GENOMICS USE CASEThere is a known homology between a Pig Marker/Sequence in this region and the human genomeHumanPigQTL MapDNA Sequence Similarity

    => Homology =>? Orthology

  • COMPARATIVE GENOMICS USE CASEHumanPigQTL MapA Physical Map of BAC clones exists for this region of the Human Genome

  • COMPARATIVE GENOMICS USE CASEHumanPigQTL MapChickenThere are known chicken expressed sequences homologous to Human Gene Sequences in this region

  • COMPARATIVE GENOMICS USE CASEHumanPigQTL MapChickenGene expression Data for these Chick ESTs might correlate with a trait similar to TastinessExpressionAnalysis

  • COMPARATIVE GENOMICS USE CASEHumanPigQTL MapChickenThe literature may detail Functions of Human genes in this region, and homologies to genes in other species helping the researcher predict candidate genes in Pigs responsible for tastinessExpressionAnalysisLinkedReferences

  • COMPARATIVE GENOMICS USE CASE:HOW CAN WE AUTOMATE THIS?Provide Architecture to Link and Traverse Data Sources. GRID/ Web-services

    Provide Data Standards to allow this Syntax and Semantics of Data

    Formalise the Links between Data: these Relationships are Data too these are what the Biologists care about

  • WHAT DOES COMPARAGRID NEED TO INTEGRATE DATASOURCES IN A BIOLOGICALLY RELEVANT FASHION?

    A lightweight Exchange Standard or a heavyweight Ontology in OWL-DL?

    1. Lightweight Mapping from RDB Schema to standard

    Minimally: a data exchange standard (defines structure and vocabulary for data exchange): XML Schema? RDF? (a straightforward mapping by data providers, integration logic handling the meaning of relationships must be in the Application)

  • 2. More Heavyweight Mapping

    Capturing the Semantics of the Data Defined RDFS Vocabulary? (mapping still quite lightweight, data is better defined & more reliably integrated, integration of data can be automatic,Applications can rely on semantics) WHAT DOES COMPARAGRID NEED TO INTEGRATE DATASOURCES? A lightweight Exchange Standard or a heavyweight Ontology in OWL-DL?

  • 3. Heavyweight Mapping

    Semantically represent the Relationships between Data(and Relationships between Relationships): Formal Ontology (OWL-DL) (mapping from datasource to Ontology is complex and specialist,Automatic integration and inference is possible over data represented as individuals of the ontology)WHAT DOES COMPARAGRID NEED TO INTEGRATE DATASOURCES? A lightweight Exchange Standard or a heavyweight Ontology in OWL-DL?

  • DO WE NEED YET ANOTHER ONTOLOGY?We think comparative genomics is very different from other biological knowledge domains(SO, OBO, GO)

    We need to integrate both abstract and physical data experimental observations positioning markers on abstract maps, and physical locations of features on representations of DNA sequences

    Metadata is important we need to treat mapping data as assertions that might be accepted or rejected on the basis of quality, provenance and trust

    We need to represent evolutionary relationships between mapped objects these are also assertions not facts based through the relatedness of underlying physical objects (sequence similarity).

    Integration between datasources depends on accepting these evolutionary assertions!

  • IDEALIZED COMPARAGRID ARCHITECTURE:

    The OWL Ontology forms the 'semantic glue' to integrate data sources and express cross species queries.

    The mapping between the data source schema and the integration schema (the CG OWL Ontology) is critical.

  • Raw dataRawdataPublisherserviceTransformerserviceIntegratorAggregationSemanticsSyntaxCOMPARAGRID STACK ARCHITECTURE:

    A publisher service automates mapping DB Schema to OWL

    Bespoke mapping rules map from DB-OWL to CG-OWL

  • BUILDING THE COMPARAGRID ONTOLOGY

    Stage I (Biologists & Bioinformaticians input)Define the Scope of the DomainCollect the terminology used in the DomainInterview practising expertsDocument some use casesObserve how the experts perform an analysisDefine the terms and relationships necessaryModel the knowledge domainOUTPUTS:- a model of the knowledge domain- a prototype ontology (in OWL-DL): terms and relationships necessary to represent the data and the relationships between data (Using Protg).

  • BUILDING THE COMPARAGRID ONTOLOGY

    Stage II (Biologists, Bioinformaticians, Ontologists)Hold workshops for panels of experts across the scope of the domain (animal, plant, microbe).Confirm the Concepts and Relationships that are required.Confirm our model of the knowledge domain.Iterate and refine the prototype model representing this model.OUTPUT:version 1 prototype ComparaGrid OWL Ontology

  • HIERARCHY OF CONCEPTS IN THE COMPARAGRID ONTOLOGY

  • Hierarchy of Object to Object PropertiesHierarchy of Object to Value PropertiesCOMPARAGRID ONTOLOGY: Simple Relationships = Properties

  • In OWL-DL complex relationships can be modelled as Concepts

  • The Importance of RelationshipsBiologists and Bioinformaticians see an important conceptual difference between:

    The nuts and bolts relationships with in the data (EXPERIMENTAL OBSERVATIONS and FACTS)

    Vs

    The biological hypotheses (ASSERTIONS)

    Hopefully the richness and expressivity of OWL-DL will give us the opportunity to capture the subtleties of the different types of relationships and how they may relate to each other.

    Critically we want to infer over the data represented as individuals not merely over properties of the ontology

  • COMPARAGRID ONTOLOGY: Complex Relationships (as Concepts)

  • BUILDING THE COMPARAGRID ONTOLOGY

    Stage III (Expert Ontologists)Refactor the prototype ontology according to good design principlesBuild a core upper-level comparative mapping domain ontology that will integrate with other domainsIncorporate additional modules to represent specific subdomains (Genetic Variation, Abstract Mapping Concepts, Evidence, Evolutionary Relationships etc.) OUTPUT:modularised ComparaGrid OWL Ontology

  • THE MODULARISED COMPARAGRID ONTOLOGY

  • BUILDING THE COMPARAGRID ONTOLOGY

    TimescaleStage I: 6 months Stage II: 6 monthsStage III: ongoing / 3 yearsProblemhow do we develop the architecture and software, when we dont have a final Ontology or model?

    Use the Prototype version?Use small hack ontologies for demonstration data?

    But can we be sure the principals will work for the final larger, more complex Ontology?

  • USING THE COMPARAGRID ONTOLOGY:

    Querying distributed resources through the ComparaGrid Stack ArchitectureUnder DevelopmentTools for converting DB schema to OWL ontology

    Tool support for mapping DB ontologies to CG ontology

    Automatic query translations up and down the stack

    Allows queries to be expressed and resolved in OWL should allow automated reasoning and inferencei.e. Fun Time for the Computer Scientists..

  • Roslin Ark Databases experience as Data Providers (and Biologists/Users)

    We want to export and import data in reusable format

    We could build all our own applications using a common data format..allowing us to traverse data sets according to assertions made between the data.

    .but want to use ComparaGrids clever integration and query through OWL

    i.e. we want to exchange data as OWL so have to incorporate mapping from schema to OWL into our service architecture

  • Roslin Ark Databases experience as Data Providers

    Problems:

    We are waiting for the final ontology

    We are waiting for the stack architecture (which is waiting for the ontology)

    The ComparaGrid Architecture/Toolset is being designed to map from DB schema to OWL, but our DB schema captures none of our domain modelour mapping should be from Object model to OWL .

    We have to implement our own mapping to OWL.

    We want to progress and ACTUALLY DO SOME BIOLOGY!

  • JavaApplication

  • ComparaGrid Ontology: Where are we atand Why?

    Prototype OWL Ontology created: - used to demonstrate mapping of ArkDB to Webservices. - Ontology is flabby and poorly designed? - Mapping from Java to OWL/XML is a cumbersome/manual process.

    Refactoring/modularising the ComparaGrid OWL Ontology is non trivial (Research Project in its own right!). - We are not able to use a final ontology to drive the development of services.

    Until we have a working common data format or ontology we cant start to import and export further datasources

  • ComparaGrid Ontology: Where are we atand Why?

    Implementation of Comparagrid stack integration and query architecture is ongoing.

    Automated / Assisted mapping tools under development. (DB relational schema DB-OWL CG-OWL) [Using hack ontology fragments in the interim.]

    We need further tools to support mapping from any adhoc database or object model to OWL

  • ComparaGrid Ontology: Where are we atand Why?

    As data providers Roslin ArkDB is dependent on the tools and infrastructure being developed by ComparaGrid without knowing how much added value an ontology will give. We hope that the ontology will allow us to represent the interesting biological relationshipsThat it will facilitate automated integration and data traversalThat it will allow inference of new knowledge automatically

    Howeverthe burden is put on the data mapping process a more lightweight approach would simplify this (e.g. RDF/RDFS), but might require that applications understand the context of information sources.

    RDF(S) is becoming quite well supported and allows some inference over semantic relationships. WOULD IT BE GOOD ENOUGH FOR US?