33
The complexity of biodiversity knowledge Andrew C. Jones Cardiff University [email protected] Malcolm Scoble The Natural History Museum [email protected]

The complexity of biodiversity knowledge Andrew C. Jones Cardiff University [email protected] Malcolm Scoble The Natural History Museum [email protected]

Embed Size (px)

Citation preview

Page 1: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

The complexity of biodiversity knowledge

Andrew C. JonesCardiff University

[email protected]

Malcolm ScobleThe Natural History Museum

[email protected]

Page 2: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

2Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Purpose of talk

• Malcolm & Andrew are both investigators in BiodiversityWorld (BDW)

• There are many problems BDW doesn’t solve yet …

• … and the funding runs out tomorrow!• We’ll present

– BiodiversityWorld as a framework to support biodiversity research

– Other projects in which biodiversity informatics problems have been addressed individually

• Major challenge: draw these disparate efforts together

Page 3: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

Part 1(Andrew Jones)

Page 4: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

4Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Why Biodiversity Informatics is hard

• Need to integrate data & tools of different kinds for interesting “in silico” analyses

• Various computer science issues, e.g.– Human-Computer Interaction

• Design of environments to support scientific research

– Interoperability– Complexity & heterogeneity of data

• Differences of scientific opinion

• Data quality problems

Page 5: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

5Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

The BiodiversityWorld project

• 3 year e-Science project funded by BBSRC• Partners: The University of Reading, Cardiff

University, The Natural History Museum, Southampton University

• Aim:– Build a Biodiversity Grid

(Problem Solving Environment to support Biodiversity research)

– Support discovery & use of arbitrary tools & data sources for interesting in silico experiments

– Provide environment to get beyond the ‘cutting and pasting into Word documents’ approach to data integration and analysis

Page 6: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

6Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Example problems for BiodiversityWorld

• How should conservation efforts be concentrated?– (example of Biodiversity Richness & Conservation

Evaluation)• Where might a species be expected to occur,

under present or predicted climatic conditions?– (example of Bioclimatic & Ecological Niche

Modelling)• How can geographical information assist in

selection among possible phylogenetic trees?– (example of Phylogenetic Analysis &

Palaeoclimate Modelling)

Page 7: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

7Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

BiodiversityWorld architecture

BiodiversityWorld-GRID Interface (BGI)

The GRID

Workflow enactment

engine Wrapped resources

Native Biodiversity-

World Resources

Metadata repository

Presentation

BGI API

User interface

Page 8: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

8Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Page 9: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

9Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Page 10: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

10Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Some problems not fully solved in BDW

• Flexible data access– BGI designed to make BDW maintainable, but currently

assumes each resource has a predefined set of operations– BioDA project investigated use of OGSA-DAI in BDW

• HCI issues– A much more exploratory approach to workflow construction

might be appropriate?

• Semantic interoperability & data quality– Metadata repository: basic information only– Only basic solution to species naming problems (SPICE)– Other problems of descriptive terms, differences of expert

opinion, etc., remain to be addressed

Page 11: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

11Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Complexity of biodiversity data: a multi-dimensional problem

• Same specimen might be described with differences of:– Terminology– Opinion about identification– Opinion about whether a particular feature is present– Accuracy

• Experts may differ as to:– Circumscription associated with a given scientific name

• (So may not be describing the same concept)– Terminology used to describe a given taxon– Accepted name for a species in a taxonomic checklist

• There may be errors!• ...

Page 12: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

12Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

SPICE for Species 2000

• BBSRC/EPSRC- and EU-funded• SPecies 2000 Interoperability Co-ordination

Environment• Aims:

– build scalable, federated scientific name catalogue organised by taxon (species, etc.)

– provide ‘synonymy server’, enriching information retrieval

• Issue: how to build an architecture to integrate specialist, heterogeneous databases, providing a consistent federated view of broader scope?

• Common Data Model sufficed …– data requirements of federation identical for each database– small set of ‘canned queries’ adequate for the catalogue

Page 13: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

13Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

SPICE internal architecture

GSD GSD

Wrapper(e.g. JDBC)

Wrapper(e.g.CGI/XML

+ ODBC)

User(Web Browser)

User(Web browser)……

……

(in some cases, generic) CORBA ‘wrapper’ element of GSD Wrapper

User Server module(HTTP)

‘Query’ co-ordinator

CAS knowledge repository(taxonomic hierarchy, annual checklist, genus

and other caches, ...)

Common Access System (CAS)

CORBA

Internalwrapper

Externalwrapper

XMLCGI

Page 14: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

14Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

LITCHI

• BBSRC/EPSRC- and EU- funded• Logic-based Integration of Taxonomic Conflicts in

Heterogeneous Information systems• Aim: detect conflicts between species checklists and either

– Assist in producing a consistent checklist, or– Generate correspondences between checklists (‘cross-map’)

• Addressing problems of species classification & naming variations when accessing species-related data

• More general, semantic interoperability issue:– detecting conflicts between different expert views of same subject

matter;– supporting data access based on these views

Page 15: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

15Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

LITCHI example

Checklist 1

– Caragana arborescens Lam. (accepted name)

Caragana sibirica Medikus (synonym)

Checklist 2

– Caragana sibirica Medikus (accepted name)Caragana arborescens Lam. (synonym)

(“Lam.” = “Lamark”)

“A full name which is not a pro-parte name may not appear as both an accepted name and a synonym in the same checklist”

)(_)(_

),,,,(),,,,(_

,,,,,,

21

2211

2121

cparteprocpartepro

tlcansynonymtlcannameaccepted

ttcclan

Page 16: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

16Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Name relationships (LITCHI 2)

Page 17: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

17Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

myViews

• Not funded yet – limited proof-of-concept prototype only

• Addresses problem that an expert may wish to generate taxon descriptions which are:– Coherent;– Mapped explicitly to other taxon descriptions, and– Based directly on existing documentation

(monographs, etc), rather than completely re-coded in some restrictive formalism with a new vocabulary

Page 18: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

18Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Example: describing the same things?• Description A:

– Sarothamnus scoparius (L.) Wimm. ex Koch.– Broom– ... a bush which is 50-200 cm high ...

• Description B:– Cytisus scoparius– Yellow broom– ... a small shrub up to 6ft or more ... native in its yellow form ...

• Description C:– Cytisus scoparius (L.) Link.– Broom– ... a deciduous shrub growing to 2.4m by 1m at a fast rate ... scented flowers ...

• Description D:– Common Broom– Cytisus scoparius– ... covered in profuse golden-yellow flowers ... shrub about 1-3m tall ...

• Description E:– Broom– Cytisus scoparius– ... Like a spineless edition of gorse ... with larger scentless flowers ...

• Similar problems apply to individual specimen descriptions

Page 19: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

19Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Things we might want to do

• In a system where– data is held in as ‘raw’ a form as possible, to avoid information loss, but– we can impose various views and hypotheses

we might wish to …

• Create our own ‘view’ of the data– For a given piece of knowledge, we could

• accept it unaltered• accept but re-express in our terms (e.g. different scientific name; different units; ...)• state it is equivalent to another piece of knowledge

(e.g. minor differences in measurements)• flag it as ‘wrong’• ...

– In relation to another’s view, we might• include or ignore it• declare some ‘mapping’ applicable to a group of items

(e.g. every species of ‘Sarothamnus’ is mapped to ‘Cytisus’)• ...

• Reason with differing levels of precision simultaneously (e.g. binary/continuous characters derived from same features)

Page 20: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

20Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

An experimental prototype

• Proof of concept ...– arbitrary, small data set from various sources: Cytisus & Genista

species– No real ‘front end’ or ‘back end’ yet!

• Implemented in Prolog (a logic programming language)• Formalisms to record complex assertions & their sources• Ontological knowledge not currently separated out explicitly;

rules perform inference• User makes his/her own assertions about (for example)

– synonymy;– which assertions of others to accept;– ...

• ... both very specific and more general rules• Main purpose: illustrate handling multiple opinions/hypotheses

Page 21: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

21

Sample knowledge base extractsassertion(1, association(2, 3,

absent(scent(flowers)))).assertion(1, property(2, yellow(flowers))).assertion(1, label(2, common('Broom'))).assertion(1, label(2,

species('Cytisus', 'scoparius'))).

assertion(4, property(5, shrublet(whole))).assertion(4, property(5, deciduous(whole))).assertion(4, property(5, size(6, in, whole))).assertion(4, property(5, deep_yellow(flowers))).assertion(4, property(5, small(leaves))).assertion(4, label(5,

species('Cytisus', 'ardoinii'))).

assertion(4, property(7, size(6, ft, whole))).assertion(4, label(7,

species('Cytisus', 'scoparius'))).

assertion(12, label(13, common('Broom'))).assertion(12, label(13,

common('Scotch Broom'))).assertion(12, property(13,

compound('sparteine'))).

assertion(12, property(13, compound('tyramine'))).

assertion(12, label(13,species('Sarothamnus', 'scoparius'))).

assertion(14, label(15,species('Sarothamnus', 'scoparius'))).

assertion(14, property(15,size_range(50, 200, cm, whole))).

assertion(14, property(15, bright_yellow(flowers))).

assertion(16, label(17,species('Cytisus', 'scoparius'))).

assertion(16, property(17,max_height(2.4, m, whole))).

assertion(16, property(17,max_width(1, m, whole))).

assertion(16, property(17, present(scent(flowers)))).

assertion(8, property(9, golden_yellow(flowers))).

assertion(8, property(9,size_range(1, 3, m, whole))).

assertion(8, label(9,species('Cytisus', 'scoparius'))).

Source 12 assertsthat item 13’s

label is commonname ‘Scotch Broom’

Page 22: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

22Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Deducing from the knowledge base?- display_accepted_props('Cytisus', 'ardoinii'). shrublet(whole)deciduous(whole)size(6, in, whole)deep_yellow(flowers)small(leaves)

Yes?- display_accepted_props('Cytisus', 'scoparius').yellow(flowers)size(6, ft, whole)golden_yellow(flowers)size_range(1, 3, m, whole)max_height(2.4, m, whole)max_width(1, m, whole)present(scent(flowers))absent(spines)absent(scent(flowers))

Yes

?- display_contradictions_for('Cytisus', 'scoparius').[present(scent(flowers)), absent(scent(flowers))]

Yes

Page 23: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

23Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Adding synonymy (1)

• User regards any statement about a Sarathamnus species as being a statement about a Cytisus species with same epithet:

• assertion(20,synonym(species('Cytisus', Epithet), _, species('Sarothamnus', Epithet), _)).

• (Could be more restrictive, e.g. apply to only particular information sources)

Page 24: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

24Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Adding synonymy (2)?- display_accepted_props('Cytisus', 'scoparius').yellow(flowers)size(6, ft, whole)golden_yellow(flowers)size_range(1, 3, m, whole)max_height(2.4, m, whole)max_width(1, m, whole)present(scent(flowers))compound(sparteine)compound(tyramine)size_range(50, 200, cm, whole)bright_yellow(flowers)absent(spines)absent(scent(flowers))

Yes?- display_contradictions_for('Cytisus', 'scoparius').[size_range(1, 3, m, whole), size_range(50, 200, cm, whole)][present(scent(flowers)), absent(scent(flowers))]

Yes

Page 25: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

25Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Some important issues for future work

• Complexity, e.g.– Trade-off: effective resource discovery v. computational

expense of traversing rich ontology– Scalability of taxonomic conflict detection

• May find large data sets need clever techniques such as Rete network

– Scalability of inference in myViews; caching inferred information

• Managing & ranking large result sets– How to rank resources discovered– How to rank conflicts

to present users with matches they are likely to want• Joining all these fragmentary projects up together

Page 26: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

Part 2(Malcolm Scoble)

Page 27: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

27Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Specimen (unit) dataCollection-level

Observations

Locality

Date of specimen collection

Time of specimen collection

Name of collector

Species/taxon concept

Type specimen

Homonyms Author of taxon

Date of description

Genus name(for binomial)

Images

The complexity of taxonomic/biodiversity data

Species name DNA barcodes

Synonyms

Species concepts

Page 28: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

28Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Where we are now

• Fragmented results

• Fragmented effort

• Largely a paper medium (restricted access)

Where we want to be

• Less fragmented; single site or distributed access

• Easier to update• Coordinated effort• Electronic (or dual)

medium• Free access to data• Taxonomy easier to

use

Taxonomy: from a ‘fragmented’ to a ‘distributed’ resource

Page 29: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

29Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Projects to integrate biodiversity data

• BioCISE (collection-level)

• ENHSIN (specimen (unit)-level)

• BioCASE (unit- & collection-level)

• Species 2000 (species nomenclature)

• SYNTHESYS (taxonomic infrastructure)

• ENBI (network of biodiversity information)

• EDIT (distributed approach to taxonomy)

• PBIs (inventorying the planet’s biodiversity)

• CATE: Creating a Taxonomic e-Science

Page 30: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

30Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

BioCASE National Node Network

BioCASE National Node

CORM

• 31 National Nodes

• Core Meta Database is updated every night

Collection-level

Page 31: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

31Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

NNNNCollection

BioCASE Core

WWWInterface P

Core Data Items

(BioCASE Profile)

keywords,keyword Relations

Enh

ance

dM

eta-

Dat

a

Thesaurus

L

SH

Unitaccess

Metadata

IndexP

B

Cor

e D

ata

Item

s(B

ioC

AS

E P

rofi

le)

L

Collection-levelMeta-Data

X

SpecialInterest

Networks

NationalNodes

NN

UnitInformation

DB

UnitInformation

DB

UnitInformation

DB

UnitInformation

DB

Unit-D

ata

(ABCD)

L,B

All levels

A Biological Collections Service for Europe

Page 32: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

32Jones & Scoble, Semantic Interop., Imperial Coll, 30/03/06

Page 33: The complexity of biodiversity knowledge Andrew C. Jones Cardiff University Andrew.C.Jones@cs.cardiff.ac.uk Malcolm Scoble The Natural History Museum M.Scoble@nhm.ac.uk

Creating a taxonomic e-science (CATE)

• Literature scattered over 250 years of paper publications.

• Data inaccessible other than to specialist users

• Aim to transfer in toto the taxonomy of two groups of organisms to the web (Hawkmoths and Aroids).

• Broad aim: to encourage migration of taxonomy to the web.

• Provide data for those studying biodiversity.

• Encourage quality control, peer-review and the development of “consensus” taxonomies in the web environment.

• Develop means of citation for web-based revisions

Arisaema candidissimumPhoto : RBG Kew

The Hawkmoth Sphinx caligineus sinicus from Beijing, China.Photo: Tony Pittaway