Biomedical ontology tutorial_atlanta_june2011_part1

Preview:

Citation preview

How to Build a Biomedical Ontology

Success StoriesThe Gene Ontology (GO)

SNOMED, ICD and other controlled vocabulariesOntology Design Principles

Ontology Applications

Barry Smithhttp://ontology.buffalo.edu/smith

Uses of ‘ontology’ in PubMed abstracts

2

3

By far the most successful: GO (Gene Ontology)

4

5

Hierarchical view of GO representing relations between represented types

6

Gene Ontology

$100 mill. invested in literature and database curation using the Gene Ontology (GO)

based on the idea of annotation

over 11 million annotations relating gene products (proteins) described in the UniProt, Ensembl and other databases to terms in the GO

multiple secondary uses – because the ontology was not built to meet one specific set of requirements

7

GO provides a controlled system of terms for use in annotating (describing, tagging)

data• multi-species, multi-disciplinary, open source

• contributing to the cumulativity of scientific results obtained by distinct research communities

• compare use of kilograms, meters, seconds in formulating experimental results

8

Sample Gene Array Data

9

where in the cell ?

what kind of molecular function ?

semantic annotation of data

what kind of biological process?

10

natural language labels

to make the data cognitively accessible to human beings

11

compare: legends for mapscompare: legends for maps

12

compare: legends for diagrams

13

ontologies are legends for data

14

compare: legends for mapscompare: legends for maps

15

ontologies are legends for images

16

what lesion ?

what brain function ?

17

ontologies are legends for databases

MouseEcotope GlyProt

DiabetInGene

GluChem

sphingolipid transporter

activity

18

annotation using common ontologies yields integration of databases

MouseEcotope GlyProt

DiabetInGene

GluChem

Holliday junction helicase complex

19

annotation using common ontologies can support comparison of data

20

annotation with Gene Ontology

supports reusability of data

supports search of data by humans

supports comparison of data

supports aggregation of data

supports reasoning with data by humans and machines

21

22

The goal: virtual science

• consistent (non-redundant) annotation

• cumulative (additive) annotation

yielding, by incremental steps, a virtual map of the entirety of reality that is accessible to computational reasoning

23

This goal is realizable if we have a common ontology framework

data is retrievable

data is comparable

data is integratable

only to the degree that it is annotated using a common controlled vocabulary

– compare the role of seconds, meters, kilograms … in unifying science

24

To achieve this end we have to engage in something like philosophy (?)

is this the right way to organize the top level of this portion of the GO?how does the top level of this ontology relate to the top levels of other, neighboring ontologies? 25

Strategy for doing this

see the world as organized via types/universals/categories which are hierarchically organized

and in relation to which statements can be formulated which are universally true of all instances:

cell membrane part_of cell26

Pleural Cavity

Pleural Cavity

Interlobar recess

Interlobar recess

Mesothelium of Pleura

Mesothelium of Pleura

Pleura(Wall of Sac)

Pleura(Wall of Sac)

VisceralPleura

VisceralPleura

Pleural SacPleural Sac

Parietal Pleura

Parietal Pleura

Anatomical SpaceAnatomical Space

OrganCavityOrganCavity

Serous SacCavity

Serous SacCavity

AnatomicalStructure

AnatomicalStructure

OrganOrgan

Serous SacSerous Sac

MediastinalPleura

MediastinalPleura

TissueTissue

Organ PartOrgan Part

Organ Subdivision

Organ Subdivision

Organ Component

Organ Component

Organ CavitySubdivision

Organ CavitySubdivision

Serous SacCavity

Subdivision

Serous SacCavity

Subdivision

part

_of

is_a

Foundational Model of Anatomy Ontology27

siamese

mammal

cat

organism

substancespecies, genera

animal

instances

frog

28

29

with thanks to http://dbmotion.com 30

the problem of continuity of care: patients move around

31

f

f

f

ff

synchronic and diachronic problems of semantic interoperability

(across space and across time)

f

32

f

f

f

ff

how can we link EHR 1 to EHR 2 in a reliable, trustworthy, useful way, which

both systems can understand ?

f

EHR 1 EHR 2

33

f

f

f

ff

the ideal solution: WHO International Classification of

Diseases

fICD

EHR 1 EHR 2

ICDICDPRO: De facto US billing standardMultilanguageCON: De facto US billing standard (corrupts data)No definitions of terms, and so difficult to

judge accuracy of hierarchy and of codingInconsistent hierarchiesHard to reason with resultsHence few secondary uses e.g. for research

34

ICD 11ICD 11The (ontology-based) planmultiple views including

◦billing ◦public health statistics◦research

◦SNOMED compatibility

35

36

f

f

f

ff

the ideal solution: a single universal clinical vocabulary

fSNOMED-CT

EHR 1 EHR 2

SNOMED CT: SNOMED CT: Systematized Nomenclature of Systematized Nomenclature of Medicine-Clinical TermsMedicine-Clinical Terms

PRO:International standard (sort of)Huge resourceFree for member countriesMulti-language (including Spanish)

37

SNOMED CTSNOMED CTCONHuge (but redundant ... and gappy)Contains many examples of false synonymyStill in need of work

◦ No consistent interpretation of relations◦ Many erroneous relation assertions◦ Many idiosyncratic relations◦ Mixes ontology with epistemology◦ It contains numerous compound terms (e.g., test for

X) without the constituent terms (here: X), even where the latter are of obvious salience

(38

Coding with SNOMED-CT is unreliable and inconsistent

Multi-stage multi-committee process for adding terms that follows intuitive rules and not formal principles

Does there exist a strategy for evolutionary improvement?

39

SNOMED CT

40

f

f

f

fanf

above all: SNOMED CT cannot solve the problem of continuity of care because it has

too much redundancy

f

EHR 1 EHR 2

SNOMED-CT

41

f

f

f

fanf

AND because it is used only in certain countries

f

EHR 1 EHR 2

SNOMED-CT

42

f

f

f

ff

link EHR 1 to EHR 2 through a snapshot of the patient’s condition which both systems

can understand

f Unified Medical Language System

(UMLS)

EHR 1 EHR 2

Unified Medical Language Unified Medical Language System (UMLS)System (UMLS)

UMLS is not unified, not a language, not a system (and not only medical); it is an aggregation If we use something like UMLS as reference terminology, we will not solve the translation problem

43

EN

DE

New York State Center of Excellence in Bioinformatics & Life Sciences

R T U New York State Center of Excellence in Bioinformatics & Life Sciences

R T U

UMLS approach to countering silo formation– By ‘linking between different clinical or biomedical

vocabularies’

– However: ‘… the Metathesaurus does not represent a comprehensive NLM-authored ontology of biomedicine or a single consistent view of the world. The Metathesaurus preserves the many views of the world present in its source vocabularies because these different views may be useful for different tasks.’

http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html

New York State Center of Excellence in Bioinformatics & Life Sciences

R T U New York State Center of Excellence in Bioinformatics & Life Sciences

R T U

Prospective standardization is a good thing

Prospective standardization is the only thing which will work in mission critical domains

Prospective standardization means that certain limits to tolerance must be imposed,

Need for top-down governance to ensure common architecture and resolution of border disputes in areas of overlap between domains

46

Principles of Best Practice in Ontology Development

47

Problem of ensuring sensible cooperation in a massively interdisciplinary community

Consider multiple uses of technical terms such as

− type− concept− instance− model− representation− data

48

Three Levels

L3. Words, models (published representations, ontologies, databases ...)

L2. Ideas (concepts, thoughts, memories, ...)

L1. Things (cells, planets, processes of cell division ...)

49

Entity =def

anything which exists, including things and processes, functions and qualities, beliefs and actions, documents and software

(entities on levels 1, 2 and 3)

50

First basic distinction among entities

type vs. instance

(science text vs. diary)

(human being vs. Tom Cruise)

51

For ontologies

it is generalizations that are important = types, universals,

kinds, species

52

A 515287 DC3300 Dust Collector Fan

B 521683 Gilmer Belt

C 521682 Motor Drive Belt

Catalog vs. inventory

53

An ontology is a representation of types

We learn about types in reality from looking at the results of scientific experiments in the form of scientific theories

experiments relate to what is particular science describes what is general

54

Ontology =def.

a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent

1. types in reality

2. those relations between these types which obtain universally (= for all instances)

lung is_a anatomical structure

lobe of lung part_of lung

in accordance with our best current established science

55

siamese

mammal

cat

organism

objecttypes

animal

frog

instances56

Domain =def

a portion of reality that forms the subject-matter of a single science or technology or mode of study or administrative practice:

proteomics

epidemiology

C2

M&S

57

Representation =def

an image, idea, map, picture, name or description ... of some entity or entities.

58

Ontologies are representational artifacts

comparable to science textsand subject to the same sorts of

constraints (including need for update)

59

Representational units =def

terms, icons, alphanumeric identifiers ... which refer, or are intended to refer, to entities

and which are minimal (atoms)

60

Composite representation =def

representation

(1) built out of representational units

which

(2) form a structure that mirrors, or is intended to mirror, the entities in some domain

61

Periodic Table

The Periodic Table

62

Ontologies are here

63

or here

64

Ontologies represent general structures in reality (leg)

65

Ontologies do not represent concepts in people’s heads

66

They represent types in reality

67

How do we know which general terms designate types?

Types are repeatables:

cell, electron, weapon, F16 ...

Instances are one-off:

Bill Clinton, this laptop, this handwave

68

Problem

The same general term can be used to refer both to types and to collections of particulars. Consider:

HIV is an infectious retrovirus

HIV is spreading very rapidly through Asia

69

Class =def

a maximal collection of particulars determined by a general term (‘cell’, ‘electron’ but also: ‘ ‘restaurant in Palo Alto’, ‘Italian’)

the class A = the collection of all particulars x for which ‘x is A’ is true

70

types vs. their extensions

types

{a,b,c,...} collections of particulars

71

Extension

=def The extension of a type is the class of its instances

72

types vs. classes

types

{c,d,e,...} classes

73

types vs. classes

compare: ‘natural kinds’

types

extensions other sorts of classes

74

types vs. classes

types

populations, ...

the class of all diabetic patients in Leipzig on 4 June 1952

75

OWL is a good representation of classes

• F16s

• sibling of Finnish spy

• member of Abba aged > 50 years

76

types, classes, concepts

types

classes

‘concepts’ ?

77

types < classes < ‘concepts’ ?

Cases of ‘concepts’ which, some people say, do not correspond to classes:

‘Cancelled oophorectomy’‘Absent nipple’‘Unlocalized ligand’

A cancelled oophorectomy is not a special kind of conceptual oophorectory

Use: Information Artifact Ontology (IAO)

78

Principle of Low Hanging Fruit

Include even absolutely trivial assertions (assertions you know to be universally true)

pneumococcal virus is_a virus

Computers need to be led by the hand

79

Example: MeSH

MeSH Descriptors Index Medicus Descriptor Anthropology, Education, Sociology and Social Phenomena (MeSH Category) Social Sciences Political Systems National Socialism

National Socialism is_a Political SystemsNational Socialism is_a Anthropology ...

80

Principle of Singular Nouns

Terms in ontologies represent types

Goal: Each term in an ontology should represent exactly one type

Thus every term should be a singular noun

81

Principle: do not commit the use-mention confusion

mouse =def. common name for the species mus musculus

swimming is healthy and has eight letters

82

Principle: do not commit the use-mention confusion

Avoid confusing between words and things

Avoid confusing between concepts in our minds and entities in reality

Recommendation: avoid the word ‘concept’ entirely

83

Trialbank

‘information’ = def. ‘a  written or spoken designation of a concept’

84

‘Heparin therapy’ is an instance of ‘written or spoken designation of a concept’

What are the problems here?

1. misuse of quotation marks

2. confusion of instances and types

3. confusion of concept and reality

Trialbank

85

Principle: beware of terminological baggage

For the sake of interoperability with other ontologies, do not give special meanings to terms with established general meanings

(Don’t use ‘cell’ when you mean ‘plant cell’)

86

ICNP: International Classification of Nursing Procedures (old version)

water =def. a type of Nursing Phenomenon of Physical Environment with the specific characteristics: clear liquid compound of hydrogen and oxygen that is essential for most plant and animal life influencing life and development of human beings.

87

Principle of definitions

Supply definitions for every term

1.human-understandable natural language definition

2.an equivalent formal definition

88

Principle: definitions must be unique

Each term should have exactly one definition

it may have both natural-language and formal versions

(issue with ontologies which exist with different levels of expressivity)

89

The Problem of Circularity

A Person =def. A person with an identity document

Hemolysis =def. The causes of hemolysis

90

Principle of non-circularity

The term defined should not appear in its own definition

91

Example: HL7

‘stopping a medication’ = def.

change of state in the record of a Substance Administration Act from Active to Aborted

92

Principle of Increase in Understandability

A definition should use only terms which are easier to understand than the term defined

Definitions should not make simple things more difficult than they are

93

Generalized Tarski principle (a good, general constraint on a

theory of meaning)

For each linguistic expression ‘E’

‘E’ means E

‘snow’ means: snow

‘pneumonia’ means: pneumonia

94

HL7 Reference Information Model

‘medication’ does not mean: medication

rather it means:

the record of medication in an information system

‘disease’ does not mean: disease

rather it means:

the observation of a disease

95

Principle of Acknowledging Primitives

In every ontology some terms and some relations are primitive = they cannot be defined (on pain of infinite regress)

Examples of primitive relations:

identity

instance_of

96

Principle of Aristotelian Definitions

Use Aristotelian definitions

An A is a B which C’s.

A human being is an animal which is rational

97

Rules for Formulating Terms

Avoid abbreviations even when it is clear in context what they mean (‘breast’ for ‘breast tumor’)

Avoid acronymsAvoid mass terms (‘tissue’, ‘brain mapping’,

‘clinical research’ ...)Treat each term ‘A’ in an ontology is

shorthand for a term of the form ‘the type A’

98

Univocity Terms should have the same meanings on

every occasion of use.

(= They should refer to the same types)

Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies

99

Universality

Ontologies are made of relational assertions

They should include only those which hold universally

100

Universality

Often, order will matter:

We can assert

adult transformation_of child

but not

child transforms_into adult

101

Universality

viral pneumonia caused by virus

but not

virus causes pneumonia

pneumococcal virus causes pneumonia

102

Principle of Universality

results analysis later_than protocol-design

but not

protocol-design earlier_than results analysis

103

Principle of PositivityComplements of types are not themselves types.

Terms such as

non-mammal non-membrane other metalworker in New Zealand

do not designate types in reality

104

Generalized Anti-Boolean Principle

There are no conjunctive and disjunctive types:

anatomic structure, system, or substance

musculoskeletal and connective tissue disorder

105

Objectivity

Which types exist in reality is not a function of our knowledge.

Terms such as

unknown

unclassified

unlocalized

arthropathies not otherwise specified

do not designate types in reality.106

Keep Epistemology Separate from Ontology

If you want to say that

We do not know where A’s are located

do not invent a new class of

A’s with unknown locations

(A well-constructed ontology should grow linearly; it should not need to delete classes or relations because of increases in knowledge)

107

If you want to say

I surmise that this is a case of pneumonia

do not invent a new class of surmised pneumonias

Confusion of ‘findings’ in medical terminologies

Keep Sentences Separate from Terms

108

Single Inheritance

No kind in a classificatory hierarchy should be asserted to have more than one is_a parent on the immediate higher level

109

Multiple Inheritance

thing

carblue thing

blue car

is_a is_a

110

Multiple Inheritance

is a source of errors

encourages laziness

serves as obstacle to integration with neighboring ontologies

hampers use of Aristotelian methodology for defining terms

hampers use of statistical search tools

111

Multiple Inheritance

thing

carblue thing

blue car

is_a1 is_a2

112

Principle of asserted single inheritance

Each reference ontology module should be built as an asserted monohierarchy (a hierarchy in which each term has at most one parent)

Asserted hierarchy vs. inferred hierarchy

113

Principle of normalization

Polyhierarchies should be decomposable into homogeneous disjoint monohierarchies

114

Principle of instantiability

A term should be included in an ontology only if there is evidence that instances to which that term refers exist or have existed or can exist in reality.

Fist

Crowd

115

Avoid mass nouns

Count nouns = an organism, a planet, a handshake

Mass nouns = tissue, information, discourse

Mass nouns almost always go hand in hand with ontological confusion

116

is_a Overloading

The success of ontology alignment demands that ontological relations (is_a, part_of, ...) have the same meanings in the different ontologies to be aligned.

117

Multiple Inheritance

thing

carblue thing

blue car

is_a1 is_a2

118

How to solve this problem

Create two ontologies:

of cars

of colors

Link the two together via cross-products

(= factoring, normalization, modularization)

119

Compositionality

The meanings of compound terms should be determined

1. by the meanings of component terms

together with

2. the rules governing syntax

120

User feedback principle

An ontology should evolve on the basis of feedback derived from those who are using the ontology for example for purposes in annotation.

121

Recommended