42
http://ncor.us 1 Ontologies in Biomedicine: The Good, The Bad and The Ugly Barry Smith http:// ontology.buffalo.edu/ smith

Http://ncor.us 1 Ontologies in Biomedicine: The Good, The Bad and The Ugly Barry Smith

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

http://ncor.us 1

Ontologies in Biomedicine:

The Good, The Bad and The Ugly

Barry Smith

http://ontology.buffalo.edu/smith

http://ncor.us 2

The GoodFoundational Model of Anatomy (FMA)

ProVery clear statement of scope: structural human anatomy, at all levels of granularity, from the whole organism to the biological macromoleculePowerful treatment of definitions, from which the entire FMA hierarchy is generated – can serve as basis for formal reasoning

ConSome unfortunate artifacts in the ontology deriving from its specific computer representation (Protégé)

http://ncor.us 3

Intermediate

GALENPro Allows formal representation of clinical information Allows multiple views of relevant detail as needed Uses powerful Description Logic (DL)-based

formal structureConRemains only partially developedContains errors: Vomitus contains carrot

– which DLs did not prevent

http://ncor.us 4

IntermediateThe Gene Ontology

Con

Poor formal architecture

Full of errors

menopause part_of death

Poor support for automatic reasoning and error-checking

Poor treatment of definitions

Not trans-granular

No relation to time or instances

http://ncor.us 5

The Gene Ontology

Pro

Open Source

Cross-Species

... has recognized the need for reform, including explicit representation of granular levels

http://ncor.us 6

Problem of Circularity

GO:0042270:

Protection from natural killer cell mediated cytolysis

Definition: The process of protecting a cell from cytolysis by natural killer cells.

http://ncor.us 7

GO:0019836 hemolysis

Definition: The processes that cause hemolysis

X =def. the Y of X

this is worse than circular

http://ncor.us 8

The Bad

Reactome ProRich catalogue of biological process ConIncoherent treatment of categories:

ReferentEntity (embracing e.g. small molecules) is a sibling of PhysicalEntity (embracing complexes, molecules, ions and particles). Similarly CatalystActivity is a sibling of Event.

http://ncor.us 9

The Bad

National Cancer Institute Thesaurus

ProOpen source; ambitiously broad coverage; DL-basedConPoor realization of DL formalismFull of mistakes (many inherited from its UMLS sources):– three disjoint classes of plants: Vascular Plant, Non-

vascular Plant, Other Plant

– three disjoint kinds of cells: Cell, Normal Cell, Abnormal Cell

– Normal Cell is_a Microanatomy

See http://ontology.buffalo.edu/medo/NCIT_Smith.html

http://ncor.us 10

National Cancer Institute Thesaurus

Duratec, Lactobutyrin and Stilbene Aldehyde classified as: Unclassified Drugs and Chemicals

Pro

NCIT, too, has recognized the need for reform

(NCIT is part of the OBO library)

http://ncor.us 11

The UglyUMLS Semantic Network

Pros

Broad coverage; no multiple inheritance

Cons

Incoherent use of ‘conceptual entities’

(e.g. the digestive system as a conceptual part of the organism)

Full of errors

http://ncor.us 12

UMLS Semantic Network

Edges in the graph represent merely “possible significant relations”:– Bacterium causes Experimental Model of

Disease– Experimental Model of Disease affects

Fungus– Experimental model of disease is_a

Pathologic Function

http://ncor.us 13

UMLS Semantic Network

Unclear what the nodes of the graph are:Drug Delivery Device contains Clinical Drug Drug Delivery Device narrower_in_meaning_than Manufactured Object

The use-mention confusion:“Swimming is healthy and has 8 letters”

http://ncor.us 14

The Ugly

Clinical Terms Version 2 (The Read Codes)

Classifies chemicals into:

chemicals whose name begins with ‘A’,

chemicals whose name begins with ‘B’,

chemicals whose name begins with ‘C’, ...

http://ncor.us 15

The Astonishingly (Criminally?) Ugly

Health Level 7HL7 is a UML-based standard for exchange

of information between clinical information systems

has proved very crumbly as a standardThe HL7 Reference Information Model (RIM)

is supposed to overcome this problem by defining the universe of healthcare data in a rigorous way

http://ncor.us 16

HL7-RIM

AnimalDefinition: A subtype of Living Subject representing any

animal-of-interest to the Personnel Management domain.

PersonA subtype of Living Subject representing single human

being [sic] who, in the context of the Personnel Management domain, must also be uniquely identifiable through one or more legal documents.

LivingSubject Definition: A subtype of Entity representing an organism or

complex animal, alive or not.

http://ncor.us 17

HL7 RIM: The Problem of Circularity

Person = Person with documents

has the form: ‘An A is an A which is B’– useless in practical terms since neither we

nor the machine can use them to find out what ‘A’ means

– incorporate a vicious infinite regress– have the effect of making it impossible to

refer to A’s which are not Bs, for example to an undocumented person

http://ncor.us 18

HL7 Logically Incoherent

act = the record of an act

This has the form: An X is the Y of an X

again worse than circular

http://ncor.us 19

HL7-RIM: Logically Contradictory Definitions

Definition of Act: An Act is an action of interest that has happened, can happen, is happening, is intended to happen, or is requested/demanded to happen.

Definition of Act: An Act is the record of something that is being done, has been done, can be done, or is intended or requested to be done.

http://ncor.us 20

HL7 RIM Ontologically Incoherent

The truth about the real world is constructed through a combination and arbitration of attributed statements ...

As such, there is no distinction between an activity and its documentation.

http://ncor.us 21

HL7 Incredibly Successful

• embraced as US federal standard;

• central part of $15 billion program to integrate all UK hospital information systems

• made mandatory by Canada Health Infoway

• adopted by Oracle as basis for its EHR support programs

http://ncor.us 22

HL7 Merchandizing

http://ncor.us 23

From molecules to diseases

A good ontology should enable us to organize our information resources in such a way that we can bridge the granularity gap between genomics and proteomics data and phenotype (clinical, pharmacological, patient-centered) data

http://ncor.us 24

good ontologies require:

Coherent upper level taxonomy distinguishing• continuants (cells, molecules, organisms ...)• occurrents (events, processes)• dependent entities (qualities, functions ...)• independent entities (their bearers)• universals (types, kinds)• instances (tokens, instances)

Coherent relation ontology supporting inference both within and between ontologies.

http://ncor.us 25

good ontologies require:

Consistent use of terms, supported by logically coherent (non-circular) definitions, in both human-readable and computable formats

http://ncor.us 26

Open Biomedical Ontologies (OBO) Upper Biomedical Ontology (UBO)

root UBO:0000001:top subclass BFO:continuant:continuant

– subclass BFO:dependent_entity:dependent_entity • subclass UBO:0000023:quality

– subclass UBO:0000026:phenotype » subclass UBO:0000025:state

– subclass UBO:0000027:disease » subclass UBO:0000005:function

– subclass GO:0003674:molecular_function • subclass BFO:disposition:disposition

– subclass BFO:independent_entity:independent_entity • subclass UBO:0000002:substance

– subclass UBO:0000019:protein – subclass GO:0005575:cellular_component – subclass UBO:0000006:anatomical_entity

» subclass UBO:0000008:gross_anatomical_entity – subclass UBO:0000007:organism

» subclass UBO:0000015:microbe » subclass UBO:0000014:plant » subclass UBO:0000017:animal

• subclass BFO:fiat_part_of_substance:fiat_part_of_substance • subclass BFO:boundary_of_substance:boundary_of_substance • subclass BFO:aggregate_of_substances:aggregate_of_substances

subclass BFO:occurrent:occurrent – subclass BFO:dependent_occurrent:dependent_occurrent

• subclass UBO:0000004:process – subclass GO:0008150:biological_process

• subclass BFO:fiat_part_of_process:fiat_part_of_process – subclass UBO:0000029:life_cycle_stage

• subclass BFO:aggregate_of_processes:aggregate_of_processes – subclass EO:0007359:environment ontology

• subclass BFO:temporal_boundary_of_process:temporal_boundary_of_process – subclass BFO:independent_occurrent:independent_occurrent

http://ncor.us 27

OBO Relation Ontology (RO)OBO Relation Ontology (RO)

• Clear distinction between universals (classes, kinds, types and instances (individuals, tokens

• Precise formal definitions of relations• Automatic applicability to time-indexed instance-

data e.g. in Electronic Health Record• Consistency with the Relation Ontology now a

criterion for admission to the OBO ontology library

see see Genome Biology Genome Biology Apr. 2006Apr. 2006

http://ncor.us 28

Three types of relations

between instances:

Mary’s heart part_of Mary

between an instance and a universal:

Mary instance_of homo sapiens

between universals:

gastrulation part_of embryonic development

http://ncor.us 29

A suite of primitive instance-level relations

identical_to

part_of

located_in

adjacent_to

earlier

derives_from

...

http://ncor.us 30

A suite of defined relations between universals

Foundational is_apart_of

Spatial located_incontained_inadjacent_to

Temporal transformation_ofderives_frompreceded_by

Participation has_participanthas_agent

http://ncor.us 31

GALEN: Vomitus contains carrot

All portions of vomit contain all portions of carrot

All portions of vomit contain some portion of carrot

Some portions of vomit contain some portion of carrot

Some portions of vomit contain all portions of carrot

http://ncor.us 32

all-some structure

A part_of B =def. given any instance a of A there is some instance b of B such that a part_of b on the instance level

Allows automatic ontology integration via cascading reasoning:

A R1 B

B R2 C

A R3 C

http://ncor.us 33

adjacent_to

cell wall adjacent_to cytoplasm

intron adjacent_to exon

Golgi apparatus adjacent_to endoplasmic

reticulum

periplasm adjacent_to plasma membrane

presynaptic membrane adjacent_to synaptic cleft

http://ncor.us 34

A adjacent_to B

every instance of A stands in the instance-level adjacent_to relation to some instance of B

http://ncor.us 35

adjacent_to as a relation between universals is not

symmetric

nucleus adjacent_to cytoplasm

Not: cytoplasm adjacent_to nucleus

seminal vesicle adjacent_to urinary bladder

Not: urinary bladder adjacent_to seminal vesicle

http://ncor.us 36

The Granularity Gulf

most existing data-sources are of fixed, single granularity

many (all?) clinical phenomena cross granularities

http://ncor.us 37

Main obstacle to integrating genetic and EHR data

No facility for dealing with time and instances (particulars, individuals) in current ontologies

http://ncor.us 38

Key idea

To define ontological relations like

part_of, develops_from

it is not enough to look just at universals / classes / types / ‘concepts’ :

we need also to take account of instances and time

http://ncor.us 39

transformation_of

A transformation_of B

=def. any instance of A was at some

earlier time an instance of B

http://ncor.us 40

transformation_of

c at t1

C

c at t

C1

time

same instance

mature RNA transformation_of pre-RNA

adult transformation_of child

carcinomatous colon transformation_of colon

http://ncor.us 41

transformation_of relations cross both time and granularity

C

c at t c at t1

C1

http://ncor.us 42

Advantages of the methodology of enforcing commonly accepted

coherent definitions

promote quality assurance (better coding)

guarantee automatic reasoning across ontologies and across data at different granularities

yields direct connection to times and instances in the EHR