14
The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh http://www.dcs.napier.ac.uk/~prometheus

The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Embed Size (px)

Citation preview

Page 1: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

The Prometheus Taxonomic Database

Cédric Raguenaud, Jessie Kennedy, Peter BarclayNapier University, Edinburgh

http://www.dcs.napier.ac.uk/~prometheus

Page 2: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Contents

What is taxonomy? What are the features of taxonomic data/processes Which database? The Prometheus approach Schema example Particularities of the model Example queries Summary & Conclusions

Page 3: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

What is plant taxonomy?

(vi)

(i)

family

genus

(iii) family

genus

tribe

(iv)

species

genus

tribe

(v)genus

variety

species

(ii)family

genus

Red squares Yellow round shapes

?

Red squares Yellow round shapesPurple diamond shapesYellow round shapes!

Page 4: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Plant Taxonomy Data

The data is hierarchical Multiple overlapping hierarchies co-exist

distinct hierarchies need identified - manipulation and extraction explicit relationships (=> graphs) querying is recursive & dependent on the context of the relationships

Nodes in the hierarchy are aggregate objects also have association to other objects outside the hierarchy

differentiate between association and aggregation in relationships extraction of composite objects required

Levels of the hierarchy bear information Ranks biologically significant (e.g. “genus” vs “species”)

Domain specific rules are important data is derived based on domain specific rules

definition of constraints necessary for defining rules positioning of objects in a hierarchy dependent on domain

specific constraints (e.g. family names must end with -eceae)

Page 5: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Which Database? Existing Taxonomic Databases are inadequate due to:

simplicity of model of taxonomy support single classifications only

limitations of underlying database: Relational model

limited semantics, no explicit relationships, no recursive querying Graph models

limited semantics, often no constraints Semi-structured data

limited semantics, no a priori schema Object-Oriented models

limited support for relationships, no recursive querying

Need OODB with relationships + Graph functionality OODBs with relationships already exist (e.g. OMS, Albano’s, GraphDB)

limited (e.g. no QL, no semantics for relationships, or no constraints) or based on uncommon models (e.g. collection based model of

Albano)

Page 6: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Prometheus Approach

Prometheus Model ODMG model extended with relationships as first class

constructs Association & Aggregation

cardinality, traversibility, sharability, dependency … Reduces gap between design and implementation Attributes on relationships used to distinguish classifications

POOL OQL + operators for manipulating relationships and graphs

query relationship objects define query on aggregation relationships only specify a particular path to be followed through a hierarchy specify the transitive closure of a relationship return a hierarchy as a structure

Prometheus prototype implemented using POET (ODMG OODB) and Java

Page 7: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Simple Taxonomic Schema

NameTheValidity

calculatedFullNameNoAuthorcalculatedFullName

Circumscription

theCircumscription

0..n

0..n

theCircAuthor

theCircPublication

Date

theDate

AuthorAbbreviationtheAbbreviation

ReferenceDatabasetheReference

theRef

LinkToDet

0..n 0..n

theAuthor

theDate

authors

0..n

PublicationAbbreviationtheAbbreviation

theRef

nextRank

previousRank

collector0..n

theAuthorAbbreviations0..n

0..n

thePublicationAbbreviation

0..n

AuthorgivenNames

surnameDOBDOD

EpithettheName

SpecimenbarCode

herbariumcollectionNumber

latitudelongitude

Note

LinkToType

Typedefinition

0..n

theEpithet

Placement

0..n0..1

PublicationthePublication

thePageRank

theBinomialtheName

theAuthors

0..n

thePublication

theRank

Page 8: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Relationships in the DB

The semantics of relationships (e.g. composition) can vary: Prometheus implements all these semantics by providing a set of

behaviours, constraints, and flags that can be combined e.g. When a classification is published, it is unchangeable (even if it

includes mistakes) the theCircumscription relationship implements the “not changeable”

behaviour

Directionality of relationships is important for propagation of operations (e.g. deletion of a composition) as groups at any level contain groups at lower levels

a family contains several genera each of which contain several species

Attributes of relationships are important classifying is independent from the objects classified

relationships build the classification attributes of relationships differentiate classifications

the system is a generic classification system

Page 9: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Downcast operator select the Names whose type is called graveolens.

select n from Name n where n.LinkToType[Name].theEpithet.theName = “graveolens”

the type of the object targeted by the destination attribute of the TaxonomicType relationship should be Name, and not TypeDefinition as shown in the model.

All objects which are not of type Name are discarded with no error reported.

Example Queries - 1

Querying relationships Select the Names whose rank is Genus.

select n from Name n where n.theRank.destination.theName = “Genus”

theRank is a relationship class. n is considered the origin of theRank in the query and the

relationship should be followed only from source to destination i.e. no reverse traversing of the relationship.

0..n

RanktheBinomial

theName

theRank

NameTheValidity

calculatedFullNameNoAuthorcalculatedFullName

NameTheValidity

calculatedFullNameNoAuthorcalculatedFullName

SpecimenbarCode

herbariumcollectionNumber

latitudelongitude

Note

LinkToType

Typedefinition

0..n

EpithettheName

theEpithet

Page 10: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Example Queries - 2

Aggregate operator Select the Names whose circumscription contains the specimen

whose name is “X”select shallow aggregate n from Name n where n.theCircumscription[Specimen].barCode = “X”

extracts the Name objects that satisfy the criterion, then finds for each Name object all objects aggregated to form the concept of Name .

Transitive Closure Select the Names or whose subordinate Names contain the

specimen whose name is “X” select n from Name n where n.theCircumscription[Name]*.theCircumscription.destination[Specimen].barCode = “X”

we use a relationship class as a simple regular expression follow 0 or more theCircumscription relationships to find the

Name objects containing the specimen called “X”. “*” - the repetition of a path between 0 and n times, “?” - an

optional path, “+” - the repetition of a path strictly once or more

NameTheValidity

calculatedFullNameNoAuthorcalculatedFullName

0..n

AuthorgivenNames

surnameDOBDOD

EpithettheName

SpecimenbarCode

herbariumcollectionNumber

latitudelongitude

Note

LinkToType

Typedefinition

0..n

theEpithet

Placement

0..n0..1

PublicationthePublication

thePageRank

theBinomialtheName

theAuthors

0..n

thePublication

theRank

NameTheValidity

calculatedFullNameNoAuthorcalculatedFullName

Circumscription

theCircumscription

0..n

0..n

SpecimenbarCode

herbariumcollectionNumber

latitudelongitude

Note

Page 11: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Example Queries - 3

Follow operator select the names hierarchy

select n, n.theCircumscription from Name n follow theCircumscription

the query engine would know that Name objects in the resulting set must be related by a theCircumscription relationship object.

a hierarchy is a directed connected graph. Therefore, the answer to such a query is a set of connected graphs.

NameTheValidity

calculatedFullNameNoAuthorcalculatedFullName

Circumscription

theCircumscription

0..n

0..n

SpecimenbarCode

herbariumcollectionNumber

latitudelongitude

Note

XLINK Select the names that have specimen “X” in their circumscription

select n from Name n where n.theCircumscription[Name]*.theCircumscription[Specimen].barCode = “X” xlink

finds Name objects that are related to a Specimen whose name is “ X” via one or more theCircumscription relationships in a single hierarchy.

Without xlink, any path relating a Name to a Specimen would be followed and hierarchies mixed up.

Page 12: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Example Queries - 4

Integrity of graphs in path expressions select the names containing specimen X in the

circumscription where the classification was published in Yselect n from Name n, theCircumscription c where n.c[Name]*.c[Specimen].barCode = X” xlink where c.theCircPublication.thePublication = “Y”

finds all Name objects containing the specimen in their circumscription at any depth

but only according to one publication that is declared in the xlink clause.

NameTheValidity

calculatedFullNameNoAuthorcalculatedFullName

Circumscription

theCircumscription

0..n

0..n

SpecimenbarCode

herbariumcollectionNumber

latitudelongitude

Note

AuthorgivenNames

surnameDOBDOD

PublicationthePublication

thePage

theCircAuthor

theCircPublication

Page 13: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Summary & Conclusions

New model (schema) of plant taxonomy defined extensive use of relationships

Plant taxonomy DBMS implemented using Prometheus DB final stages of testing by taxonomists stores all examples of data provided can answer all queries posed demo via http interface available Soon available for download

Conclusion Explicit relationships in DB provide ways to improve

modelling power & mapping between model and implementation support for graph structures

QL support necessary to profit from relationships increased power of ad hoc querying without being domain specific

Page 14: The Prometheus Taxonomic Database Cédric Raguenaud, Jessie Kennedy, Peter Barclay Napier University, Edinburgh prometheus

Acknowledgements

Collaborators Dr Mark Watson, Dr Martin Pullan, Dr Mark Newman

Royal Botanic Garden, Edinburgh

Funding UK Engineering and Physical Sciences Research Council and

Biological and Biotechnology Research Council - Bioinformatics Initiative

Project page: http://www.dcs.napier.ac.uk/~prometheus

Demo: http://146.176.18.75:8080