38
Ontologies and Biomedicine What is the "right" amount of semantics?

Ontologies and Biomedicine

Embed Size (px)

DESCRIPTION

Ontologies and Biomedicine. What is the "right" amount of semantics ?. Ontologies and Biomedicine. The “right” amount of semantics depends on what you want to do with it. Ontologies and Biomedicine. Research is based on inference from what is known, and therefore it demands rigor. - PowerPoint PPT Presentation

Citation preview

Page 1: Ontologies and Biomedicine

Ontologies and Biomedicine

What is the "right" amount of semantics?

Page 2: Ontologies and Biomedicine

Ontologies and Biomedicine

The “right” amount of semantics depends on what

you want to do with it

Page 3: Ontologies and Biomedicine

Ontologies and Biomedicine

Research is based on inference from what is known, and therefore

it demands rigor

Page 4: Ontologies and Biomedicine

Ontologies and Biomedicine

Without rigor, we won’t—know what we know, or where to find it, or what to

infer from it.

Page 5: Ontologies and Biomedicine

Natural Language

Computable Ontology

Highly expressive

Ambiguous

Less expressive

Logical and precise

Semantic Spectrum

Page 6: Ontologies and Biomedicine

Ad hoc tagging approach

Let the users defined words and phrases Foregoes the use of an expertly curated

vocabulary or ontology.

Fast and distributed approach yields a vast amount of content No recruitment and training of people to

maintain the ontology is required. No recruitment and training of annotators to

interpret the material is required.

Page 7: Ontologies and Biomedicine
Page 8: Ontologies and Biomedicine

Ad hoc tagging approach

Tagging approach places the burden of interpretation and classification on every end user Overall this is more costly and wasteful Is inappropriate in the scientific domain

The problem is not about people communicating. It is about computers and HCI.

Page 9: Ontologies and Biomedicine

Build, apply, and use Ontology captures current scientific theory

that seeks to explain all of the existing evidence and is used to draw inferences and make predictions Acts like a review Requires curators who are experts in both the

science and logic

Ontology application is the real bottleneck But overall is less costly and wasteful

Page 10: Ontologies and Biomedicine

1.Univocity: Terms should have the same meanings on every occasion of use

2.Positivity:Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine classes.

3.Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.

4.Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level

5.Intelligible Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined

6.Reality Based:When building or maintaining an ontology, always think carefully at how classes relate to instances in reality

7.Distinguish Classes and Instances: What is necessarily true for instances is not necessarily true for classes

Page 11: Ontologies and Biomedicine

Annotation bottleneck

An active lab can easily generate 10-100GB of data per month, and it is very difficult to manage data on this scale.

Even the best analytic schemes will be for naught if we cannot find our data.

And the data is complex Yet, the annotation effort required will

be utterly wasted if it cannot be reliably computed upon.

Page 12: Ontologies and Biomedicine
Page 13: Ontologies and Biomedicine

Implies numerous “light” ontologies

3-dimensions Protein function Cell type Tissue Stage Cellular component Organism And more…

Page 14: Ontologies and Biomedicine

Or it implies a single complex one

3-dimensions Protein function Cell type Tissue Stage Cellular anatomy Organism And more…

Plus all of the relations between these elements

Page 15: Ontologies and Biomedicine

Practicalities

1. The ontology should be robust or the annotator’s time is wasted

2. Research won’t wait, data must be annotated at the rate at which it is generated

3. Complex ontologies are much more difficult to get right than lighter ones

4. Light ontologies are easier to build and maintain

5. Complex ontologies can be built from lighter ones

Page 16: Ontologies and Biomedicine

A “successful” case study

Gene Ontology

Page 17: Ontologies and Biomedicine

The aims of GO

1. To develop comprehensive shared vocabularies of terms describing aspects of molecular biology.

2. To describe the gene products held in each contributing model organism database.

3. To provide a scientific resource for access to the vocabularies, the annotations, and associated data.

4. To provide a software resource to assist in curation of GO term assignments to biological objects.

Page 18: Ontologies and Biomedicine

The primary strength of the GO

The GO covers three domains of biology Molecular Function Biological Process Cellular Component

These are “precisely defined” axes of classification

Page 19: Ontologies and Biomedicine

The breakdown of work

Task 1 Building the ontology: a computable

description of the biological world Task 2

Describing your gene product—annotation Biological process Molecular function Cellular localization

Page 20: Ontologies and Biomedicine

The early key decisions

The vocabulary itself requires a serious and ongoing effort.

Carefully define every concept Initially keep things as simple as possible

and only use a minimally sufficient data representation.

Focus initially on molecular aspects that are shared between many organisms.

Page 21: Ontologies and Biomedicine

GO databases: distributed and centralized

Support cross-database queries By having a mutual understanding of the

definition and meaning of any word used to describe a gene product

Provide database access to a common repository of annotations By submitting a summary of gene products

that have been annotated

Page 22: Ontologies and Biomedicine

GO CVS

FTP

AnonymousCVS

GO data

HTTPDScripts

Page 23: Ontologies and Biomedicine

GO CVS

Many Scripts

GO DatabaseAmiGO

Page 24: Ontologies and Biomedicine

GODatabase.org

Hits = 77,012

Visits = 14,063

Sites = 6,638

Averages per week

Page 25: Ontologies and Biomedicine
Page 26: Ontologies and Biomedicine

www.geneontology.org 7,240www.godatabase.org 33obo.sourceforge.net 10song.sourceforge.net 6

genome.ucsc.edu 3,670www.ncbi.nih.gov 12,000

www.ebi.ac.uk 14,900sciencemag.org 14,900

www.ncbi.nlm.nih.gov 34,500

Number of links to a site: as reported by Google

Page 27: Ontologies and Biomedicine

72020 GO:0006810 transport56862 GO:0005524 ATP binding53622 GO:0019012 virion47773 GO:0006955 immune response46943 GO:0003677 DNA binding41474 GO:0006508 proteolysis and peptidolysis41126 GO:0006355 regulation of transcription, DNA-dependent40427 GO:0004872 receptor activity34943 GO:0005215 transporter activity30890 GO:0007186 G-protein coupled receptor protein signaling pathway30001 GO:0003700 transcription factor activity28127 GO:0006118 electron transport26636 GO:0005509 calcium ion binding24007 GO:0006968 cellular defense response21250 GO:0016486 peptide hormone processing20440 GO:0008152 metabolism19742 GO:0005515 protein binding19316 GO:0007155 cell adhesion18254 GO:0005198 structural molecule activity

Most Common GOIDs accessed via AmiGO

Page 28: Ontologies and Biomedicine

Arabidopsis: TAIR, taxon:3702Caenorhabditis: WormBase, taxon:6239Candida albicans: CGD, taxon:5476Danio: ZFIN, taxon:7955Dictyostelium: DictyBase, taxon:5782Drosophila: FlyBase, taxon:7227Mus: MGI, taxon:10090Oryza sativa: Gramene, taxon:39947 = Oryza sativa (japonica cultivar-group); Rattus: RGD, taxon:10116Saccharomyces: SGD, taxon:4932Leishmania major: GeneDB, taxon:5664Plasmodium falciparum: GeneDB, taxon:5833Schizosaccharomyces pombe: GeneDB, taxon:4896Trypanosoma brucei: GeneDB, taxon:185431Bacillus anthracis: TIGR, taxon:198094Coxiella burnetii: TIGR, taxon:227377Geobacter sulfurreducens: TIGR, taxon:243231Listeria monocytogenes: TIGR, taxon:265669Methylococcus capsulatus: TIGR, taxon:243233Pseudomonas syringae: TIGR, taxon:223283Shewanella oneidensis: TIGR, taxon:211586Vibrio cholerae: TIGR, taxon:686

Taxon covered by the GO (some)

Page 29: Ontologies and Biomedicine

NIH-funded experimental research that uses the GO

National Institute on Aging (NIA) National Institute of Allergy and

Infectious Diseases (NIAID) National Cancer Institute (NCI) National Institute on Drug Abuse

(NIDA) National Institute on Deafness and

Other Communication Disorders (NIDCD)

National Institute of Dental & Craniofacial Research (NIDCR)

National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)

National Institute of Biomedical Imaging and Bioengineering (NIBIB)

National Institute of Environmental Health Sciences (NIEHS)

National Eye Institute (NEI) National Institute of General

Medical Sciences (NIGMS) National Institute of Child Health

and Human Development (NICHD) National Human Genome

Research Institute (NHGRI) National Heart, Lung and Blood

Institute (NHLBI) National Library of Medicine (NLM) National Institute of Neurological

Disorders and Stroke (NINDS) National Center for Research

Resources (NCRR)

Page 30: Ontologies and Biomedicine

Other funded experimental projects that

use the GO

Public Heath Service Walter Reed Army Medical Center United States Department of

Agriculture Department of Defense USAID National Science Foundation

Page 31: Ontologies and Biomedicine

A “successful” case study

There are still challenges to meet

Page 32: Ontologies and Biomedicine

Building upon (sharing) light, axiomatic ontologies eliminates:

1. Spelling mistakes or differences oesinophil vs. eosinophil

2. Differences in synonyms, names or naming conventions

Spermatazoon, sperm cell, spermatozoid, sperm

3. Differences in definitions1. pericardial cell develops_from mesodermal cell

vs. Nothing develops_from pericardial cell

• Inconsistent structure

Page 33: Ontologies and Biomedicine

lamellocyte differentiati

on

plasmatocyte

differentiation

hemocyte differentiation(sensu Arthropoda)

hemocyte

lamellocyte

plasmocyte

Inconsistent structureGO CL

Page 34: Ontologies and Biomedicine

Finer granularity in the GO

GO immune cell

activation, migration, chemotaxis…

erythrocyte differentiation is_a myeloid blood cell differentiation”

CL no such term:

“immune cell”

no such term: “myeloid blood cell”

Page 35: Ontologies and Biomedicine

Courser granularity in the GO

GO neuroblast

proliferation is_a cell proliferation

CL neuroblast is_a

neuronal stem cell is_a stem cell is_a cell

Page 36: Ontologies and Biomedicine

Even a “light” ontology like the GO is difficult enough

A methodology that enforces clear, coherent definitions:

Promotes quality assurance intent is not hard-coded into software Meaning of relationships is defined, not inferred

Guarantees automatic reasoning across ontologies and across data at different granularities

Consequences of inconsistencies Hard to synchronize manually Inconsistent user-search results

Page 37: Ontologies and Biomedicine

Meeting the goal: Drawing inferences

Ahuman

B C DSP:1234 SP:8723 SP:19345?

PMID:5555 PMID:4444

toad

BSP:48392

yeast

B CSP:48291 SP:38921

Direct evidence Direct evidence

Indirect evidence

Indirect evidence

PMID:8976

PMID:9550 PMID:3924

Human

Xenopus

Drosophila

Page 38: Ontologies and Biomedicine

Thank you

Chris Mungall

Sima Misra

NCBOReactome

GOSO