View
216
Download
0
Category
Tags:
Preview:
Citation preview
Bio-ontologies for Annotation and Service Discovery
Chris Wroe( + material from Carole Goble, Alan Rector, Jeremy Rogers, Ian Horrocks)
University of Manchester, UK
Overview Example driven tour of the why, what
and how of ontologies in life sciences Cover the key features of an ontology
Vocabulary, definitions, hierarchies, grammar & reasoning
Cover the key targets of ontology use Biological knowledge, service
descriptions, (database schema)
Ontology – the discipline Semantics – the meaning of meaning. Philosophical discipline, branch of
philosophy that deals with the nature and the organisation of reality.
Science of Being (Aristotle, Metaphysics, IV,1)
What is being? What are the features common to all
beings?
In science…ontology the thing A resource to aid the precise
communication and integration of information
Binds a community to communicate information in some domain of interest in a consistent manner.
Gene Ontology – a community effort Model organism databases need to
be integrated Not possible if they all use a
different vocabulary Gene Ontology Consortium got
together to form “a dynamic controlled vocabulary that
can be applied to all eukaryotes”
Gene Ontology – keeping it simple
Provide three separate vocabularies to describe: The function a gene product is capable of. The process a gene product takes part in. The location at which the gene product has
been found.
GO annotations
Gene detail page in MGD for the vitamin D receptor gene, Vdr
Annotation
GO annotations
Gene detail page in MGD for the vitamin D receptor gene, Vdr
Annotation
Feature 1:
Ontologies provide a shared controlled vocabulary of concepts.
Gene ontology - definitions A diverse community, so explicit
definitions important. 60% of GO concepts have a
textural definition e.g. apoptotic nuclear changes
GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself.
Gene ontology - definitions A diverse community so explicit
definitions important. 60% of GO concepts have a
textural definition e.g. apoptotic nuclear changes
GO:0030262 Changes affecting the nucleus and its contents during apoptosis; includes condensation and fragmentation of nuclear DNA and of the nucleus itself.
Feature 2:
Ontologies provide an agreed definition for each concept to ensure each concept is used in the same way.
biological process death
cell deathtissue death
necrosis histolysis
Gene ontology – organisation
An alphabetical list of 11000 terms is not enough
Hierarchies allow similar terms to be grouped together.
Gene ontology – hierarchy use
GO hierarchy is used for Navigation of concepts by users Indexing of information in databases Aggregating information
Taxonomy remark 1 The world is not a tree, it’s a latticeanimal
rodent
cow
catmouse
dog
domesticverminwild
pet working
Taxonomy remark 2 What does the taxonomy mean?
Concept A is a parent of concept B iff every instance of B is also an instance of A
Superset/subset ICONCLASS
Metalwork of a Door
Closing the DoorMonumental Door
Door
Door-Knocker
Door-keeperThreshold
Action associated with a door
Something attached to a door
Kind ofa door
Classification trickiness"On those remote pages it is written that animals are divided into:a. those that belong to the Emperor b. embalmed ones c. those that are trained d. suckling pigse. mermaids f. fabulous ones g. stray dogs h. those that are included in this classificationi. those that tremble as if they were mad j. innumerable ones k. those drawn with a very fine camel's hair brush l. others m. those that have just broken a flower vase n. those that resemble flies from a distance"
The Celestial Emporium of Benevolent Knowledge, Borges
Classification is task and culture specific
Dyirbal classification of objects in the universe, Bayi: men, kangaroos, possums, bats, most snakes, most
fishes, some birds, most insects, the moon, storms, rainbows, boomerangs, some spears, etc.
Balan: women, anything connected with water or fire, bandicoots, dogs, platypus, echidna, some snakes, some fishes, most birds, fireflies, scorpions, crickets, the stars, shields, some spears, some trees, etc.
Balam: all edible fruit and the plants that bear them, tubers, ferns, honey, cigarettes, wine, cake.
Bala: parts of the body, meat, bees, wind, yamsticks, some spears, most trees, grass, mud, stones, noises, language, etc.
Gene ontology – directed acyclic graphs
Each concept is explicitly grouped either by is-a or part of relationships
Functions are often grouped by type Cellular components are often grouped by part
Each concept can have multiple parents A concepts positions is represented by a directed
acyclic graph Hierarchies are handcrafted so as to suit the ‘culture’ of
biologists
Feature 3:
Ontologies organise concepts in multiple ways for multiple uses. Principle of grouping should be explicit.
Taking it further GO concepts are often phrases
insulin control element activator complex, insulin processing, insulin receptor, insulin receptor complex, insulin receptor ligand, insulin receptor signalling pathway, insulin secretion, insulin acticated sodium/amino acid transporter,
Components of phrase hidden to computer applications
Explicit conceptualisation Semantic similarity searching Automated maintenance of hierarchies. What we need is..
A formal grammar with which to compose phrases
Software which can interpret phrases and produce sound and complete hierarchies
The exploding bicycle ICD-9 (E826) 8 READ-2 (T30..) 81 READ-3 87 ICD-10 (V10-19) 587 V31.22 Occupant of three-wheeled motor vehicle injured in
collision with pedal cycle, person on outside of vehicle, nontraffic accident, while working for income
W65.40 Drowning and submersion while in bath-tub, street and highway, while engaged in sports activity
X35.44 Victim of volcanic eruption, street and highway, while resting, sleeping, eating or engaging in other vital activities
Defusing the exploding bicycle:500 codes in pieces
10 things to hit… Pedestrian / cycle / motorbike / car / HGV / train /
unpowered vehicle / a tree / other 5 roles for the injured…
Driving / passenger / cyclist / getting in / other 5 activities when injured…
resting / at work / sporting / at leisure / other 2 contexts…
In traffic / not in traffic V12.24 Pedal cyclist injured in collision with two- or
three-wheeled motor vehicle, unspecified pedal cyclist, nontraffic accident, while resting, sleeping, eating or engaging in other vital activities
Coordination: Conceptual Lego
hand
extremity
body
acute
chronic
abnormal
normalischaemic
deletion
bacterial
polymorphism
cell
protein
gene
infection
inflammation
Lung
expression
Conceptual Lego“SNPolymorphism of CFTRGene causing Defect in MembraneTransport of ChlorideIon causing Increase in Viscosity of Mucus in CysticFibrosis…”
“Hand which isanatomicallynormal”
DAML+OIL Specifically designed to compose phrases in a
compositional manner Becoming a standard ontology interchange
language Adopted by W3C and will soon become
Ontology Web Language (OWL)
Reasoning support Consistency — check if knowledge is
meaningful Subsumption — structure knowledge,
compute taxonomy Equivalence — check if two classes
denote same set of instances Instantiation — check if individual i
instance of class C Retrieval — retrieve set of individuals
that instantiate C Problems all reducible to consistency
(satisfiability)
Gene Ontology Next Generation Early aim
Proof of concept showing DAML+OIL & description logic can practically help in at least one aspect of GO maintenance.
In cooperation with Mike Ashburner and the GO editorial team
Further aims Prototype an evolutionary environment in
which the benefits can be replicated on a larger scale
Preliminary task Providing an exhaustive is-a taxonomy
GO is-a poly-hierarchy
It becomes increasingly laborious to make sure that all concepts are linked to all possible is-a parents
Metabolism terms: e.g. heparin biosynthesis
[i] (GO:0006024)
Axis 1:
Chemicals
Axis 2:
Process
[chemical] biosynthesis (GO:0009058)
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] heparin biosynthesis (GO:0030210)
[i] glycosaminoglycan biosynthesis (GO:0006024)
[i] heparin metabolism (GO:0030202)
[i] heparin biosynthesis (GO:0030210)
Is this important? Complete taxonomy not necessary for
browsing by biologist (and may actually get in the way)
BUT… improves fidelity of DB record retrieval. Asking for records annotated with ‘glycosaminoglycan
biosynthesis’ or more specific will lead to an additional result
O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)
How can we support the task? Step 0. Translate to DAML+OIL syntax
Provided by OilEd
Provide DAML+OIL based definitions of GO concepts – initially in the metabolism area
DAML+OIL definitions for metabolism concepts
heparin biosynthesis class heparin biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass heparin
(acts_on is unique) Paraphrase: biosynthesis which acts solely on heparin
glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass
glycosaminoglycan
DAML+OIL definitions for metabolism concepts
heparin biosynthesis class heparin biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass heparin
(acts_on is unique) Paraphrase: biosynthesis which acts solely on heparin
glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass
glycosaminoglycan
Feature 4:
Ontologies provide a formal computer interpretable concept definition.
A chemical ontology Initially used MESH to create a DAML+OIL ontology
from a subset of the chemical taxonomy (using UMLS tools/ API)
Provides the following information
carbohydrates[i] polysaccharides
[i] glycosaminogylcans[i] heparin
Reason over the combination
Combine GO definitions with chemical ontology using OilEd API
Send to FaCT DL reasoner…
Paraphrased reasoning process
heparin biosynthesis class heparin biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass heparin
glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass
glycosaminoglycan
Is-a
Inferring a new is-a link heparin biosynthesis
class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin
glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass
glycosaminoglycan
Is-a
Is-a
Inferring a new is-a link heparin biosynthesis
class heparin biosynthesis defined subClassOf biosynthesis restriction onProperty acts_on hasClass heparin
glycosaminoglycan biosynthesis class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis restriction onProperty acts_on hasClass
glycosaminoglycan
Is-a
Is-aFeature 5:
Ontologies can become a dynamic service with reasoning support.
Output OilEd API reports additional inferred is-a
relationships.E.g.
heparin biosynthesis has new is-a parent glycosaminoglycan biosynthesis
Sanitised version sent to GO editorial team for comment.
They (Jane Lomax) makes changes to GO if appropriate and sends back queries
Results Carbohydrate metabolism
22 additional is-a links 17 of which now in GO
Amino acid metabolism Further 17 additional is-a links now in GO
Currently preparing results for metabolism as a whole
Where next with GONG? Moving from proof of concept requires
dedicated software tools to support the process.
Authoring/ Curation of DAML+OIL definitions Tracking GO as it evolves Tracking suggested changes and response to
changes.
myGrid & high level ontologies
myGrid: Personalised extensible environments for data-intensive in silico experiments in biology
Higher level services: workflow, databases, knowledge management, provenance…
Bioinformatics services are published as Web services (and soon Grid Services)
http://www.ebi.ac.uk/collab/mygrid/service0/axis/index.html
Ontologies for Service Discovery
Find appropriate type of services sequence alignment
Find appropriate instances of that service BLAST (an algorithm for sequence alignment), as
delivered by NCBI Assist in forming an appropriate assembly of
discovered services. Find, select and execute instances of services
while the workflow is being enacted.Knowledge in the head of expert bioinformatician
Fetch
WF
Similarsequences
Structure
modellingFetch
View
RASMOL
Protein name
An in silico experiment as a workflow
Four-tiered service descriptions
1. Class of service: • a protein sequence alignment, a protein sequence
database. 2. Specific example of an abstract service:
• BLAST, SWISS-PROT.
3. Instance service description of a specific service: • BLAST, SWISS-PROT as offered by the EBI.
4. Invoked instance service description: • BLAST as offered by the EBI on a particular date, with
particular parameters when a service was actually enacted.
Domain “semantic”
Business “operational”
Service description phrases
Build up a phrase describing classes of service functionality.
Building blocks for phrase come from a suite of ontologies
Template for the description based on DAML-S specialised for bioinformatics.
Use reasoning to maintain a classification of services
Bioinformatics ontology
Web serviceontology
Task ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Upper levelontology
Specialises. All concepts are subclassed from those in the more general ontology.
Contributes concepts to form definitions.
Suite
Bioinformatics ontology
Web serviceontology
Task ontology
Publishing ontology
Informatics ontology
Molecularbiology ontology
Organisationontology
Upper levelontology
Specialises. All concepts are subclassed from those in the more general ontology.
Contributes concepts to form definitions.
Suite
parameters: input, output, precondition, effectperforms_taskuses-resourceis_function_of
class-def defined BLAST-n_service_operation subclass-of atomic_service_operation has_Class performs_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Class produces_result (report has_Class is_report_of sequence_alignment) has_Class uses_resource (database has_Class contains (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule))) has_Class requires_input (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule)) has_Class is_function_of (BLAST_application)
class-def defined pairwise_sequence_alignment_service subclass-of atomic_service_operation has_Class performs_task (aligning has_Class has_feature local has_Class has_feature pairwise) has_Class produces_result (report has_Class is_report_of sequence_alignment) has_Class uses_resource (database has_Class contains (data has_Class encodes (sequence has_Class is_sequence_of nucleic_acid_molecule))) has_Class requires_input (data has_Class encodes (sequence has_Class is_sequence_of
nucleic_acid_molecule)) has_Class is_function_of (BLAST_application)
Description driven classification
PersonalRepository
(Meta Data)Ontology
Server
WorkflowRepository
(Meta Data)Service Type
Directory
RepositoryClient
OntologyClient
WorkflowClient
Portal
Workflowenactment
Bioinformatics services
Service instancedirectory
DAML+OIL Reasoner
(FaCT)
Matcher and
Ranker
Client framework myGrid.version0
1. User selects values from a drop down list to create a property based description of their required service. Values are constrained to provide only sensible alternatives.
2. Once the user has entered a partial description they submit it for matching. The results are displayed below.
3. The user adds the operation to the growing workflow.
4. The workflow specification is complete and ready to match against those in the workflow repository.
Ontology grounds out Link ontology to WSDL and UDDI
types
messages
portType operation
binding
service
XML Schema businessEntity
businessService
bindingTemplate
tModel
WSDL
UDDI
Other uses of ontology Labelling data items in databases
Semantic typing for controlling inputs and outputs
Use by distributed query processing
Ontology/ registry issues How to best integrate with existing
registry technology such as UDDI How do ontological descriptions of
data relate to type systems How big should the phrases
become within the ontology? Who builds these descriptions?
Summary Different ontologies can have a
different selection of features tailored to requirements
Form a wide spectrum of resources Powerful technology available
Harness it for end users
And finally.. predates computers
Linnaeus 18th Century Nomenclature/ classification of species Language independent (Latin) Promoted sharing and integration of knowledge
about related species A community effort – botanists / zoologists
Farr 19th Century Nomenclature of disease for consistent cause of
death reporting Allowed aggregation/integration of data to discover
new knowledge about the aetiology of Cholera. A community effort -- surgeons
Links All myGrid tools & ontology
available from: http://www.mygrid.org.uk GONG site: http://gong.man.ac.uk Building ontologies site:
http://oiled.man.ac.uk/building
Acknowledgements Manchester metadata team
Carole Goble, Robert Stevens, Sean Bechhofer, Phil Lord, Alan Rector, Jeremy Rogers, Chris Garwood
myGrid team GO Consortium
Esp. Mike Ashburner, Midori Harris, Jane Lomax
Sharing info Sharing meaningMetadata Data describing the
content and meaning of resources and services.
But everyone must speak the same language…
Terminologies Shared and common
vocabularies For search engines,
agents, curators, authors and users
But everyone must mean the same thing…
Service providerService provider
Service providerService providerService
providerService provider
Service providerService provider
Service providerService provider
Ontologies Shared and common understanding of a domain Essential for search, exchange and discovery
Origin and History• Humans require words (or at least symbols) to
communicate efficiently. The mapping of words to things is only indirect possible. We do it by creating concepts that refer to things.
• The relation between symbols and things has been described in the form of the meaning triangle:
“Jaguar“
Concept
[Ogden, Richards, 1923]
So what is an ontology?
Catalog/ID
Thesauri
Terms/glossary
Informal Is-a
FormalIs-a
Formalinstance
Frames(properties)
General Logicalconstraints
Valuerestrictions
Disjointness,Inverse, partof
Gene Ontology
Mouse AnatomyEcoCyc
PharmGKB
TAMBISArom
[Deborah McGuinness, Stanford]
Human and machine communication• ... Machine
Agent 1
Things
HumanAgent 2
Ontology Description
MachineAgent 2
exchange symbol,e.g. via nat. language
‘‘JAGUAR“
Internalmodels
Concept
Formalmodels
exchange symbol,e.g. via protocols
MA1HA1 HA2
MA2
Symbol
commit commit
a specific domain, e.g.animals
commitcommitOntology
Formal Semantics
HumanAgent 1
MeaningTriangle
[Maedche et al., 2002]
? Important life science ontologies SWISS-PROT Keywords the SWISS-PROT keyword list now has definitions (in nat. lang.) associated with each
keyword. Edinburgh Anatomies Have whole or partial anatomy ontologies for adult and developmental stages for several model organisms. The Ingenuity company has a large knowledge base of experimental findings in biology. Currently, their ontology is not viewable. The MGED ontology working group aim to develop ontologies for describing gene expression experiments and data. Semiotes Regulatory Networks Model PharmGKB: Pharmacogenetics Knowledge Base. the TAMBIS ontology (TaO) an ontology of bioinformatics and molecular biology. RiboWeb an ontology describing ribosomal components, associated data and computations for processing those data. EcoCyc an ontology describing the genes, gene product function, metabolism and regulation within E. coli. Molecular Biology Ontology (MBO)A general, reference ontology for molecular biology. Gene Ontology (GO) an ontology describing the function, the process and cellular location of gene products from eukaryotes. Mouse Genome Informatics GO browser Mouse Anatomical Dictionary ImMunoGeneTics (IMGT) Ontology STAR/mmCIF Macromolecule structure ontology. STAR/mmCIF Signal Transduction Knowledge #Environment (STKE). GENAROM Ontology of gene product interactions. GeneX Ontologies for comparing gene expression across species. EpoDB Controlled Vocabulary function, cell and tissue type, developmental stage and experimental type. CBIL Controlled Vocabulary Terms for human anatomy. Japan Bio-Ontology Committee including Signal Transduction Ontology flybase controlled vocabulary for fly anatomy used for describing phenotypes.
Recommended