Transcript
Page 1: Publishing Germplasm Vocabularies as Linked Data

Publishing germplasm vocabularies as Linked Data

What has already been published?What may still be needed?

How to do it?

Valeria Pesce (GFAR)Guntram Geser (Salzburg Research)Caterina Caracciolo (FAO)Vassilis Protonotarios (AgroKnow)

This presentation is a part of the 3rd Session of the 1st International e-Conference on Germplasm Data Interoperability https://sites.google.com/site/germplasminteroperability/

Page 2: Publishing Germplasm Vocabularies as Linked Data

“Vocabularies”

Page 3: Publishing Germplasm Vocabularies as Linked Data

Ingredients for describing things

• Metadata elements to describe individual pieces of information in the data sets

• Metadata sets, metadata element sets, vocabularies • Sets of values for (some of) the metadata elements

• Controlled vocabularies, authority data, value vocabularies, KOS

• They are often both called “vocabularies”

Page 4: Publishing Germplasm Vocabularies as Linked Data

Various flavors of vocabularies

TitleAuthor(s)AbstractSubject(s)Publication datePublication placeType of documentother features…

Entity to be describedType:Bibliographic

resource

Page 5: Publishing Germplasm Vocabularies as Linked Data

Various flavors of vocabularies

TitleAuthor(s)AbstractSubject(s)Publication datePublication placeType of documentother features…

Entity to be describedType:Bibliographic

resource

Metadata vocabulary

“Des

crip

tion

voc

abul

arie

s”

for describing bibliographic resources

Page 6: Publishing Germplasm Vocabularies as Linked Data

Various flavors of vocabularies

TitleAuthor(s)AbstractSubject(s)Publication datePublication placeType of documentother features…

Entity to be describedType?Bibliographic

resource

Metadata vocabulary

KOS

“Des

crip

tion

voc

abul

arie

s”

Controlled list

Concepts suitable for organizing by Topic

Concepts suitable for organizing by Type

for describing bibliographic resources

Page 7: Publishing Germplasm Vocabularies as Linked Data

Various flavors of vocabularies

TitleAuthor(s)AbstractSubject(s)Publication datePublication placeType of documentother features…

Entity to be describedType?Bibliographic

resource

Metadata vocabulary

Authority data

KOS

Data of type Person

Authority dataData of type Geographic location

“Des

crip

tion

voc

abul

arie

s”

Controlled list

Concepts suitable for organizing by Topic

Concepts suitable for organizing by Type

for describing bibliographic resources

Page 8: Publishing Germplasm Vocabularies as Linked Data

Various flavors of vocabularies

TitleAuthor(s)AbstractSubject(s)Publication datePublication placeType of documentother features…

Entity to be describedType?Bibliographic

resource

Metadata vocabulary

Authority data

KOS

“Value vocabularies”

Data of type Person

Authority dataData of type Geographic location

“Des

crip

tion

voc

abul

arie

s”

Controlled list

Concepts suitable for organizing by Topic

Concepts suitable for organizing by Type

for describing bibliographic resources

Page 9: Publishing Germplasm Vocabularies as Linked Data

Various flavors of vocabularies

TitleAuthor(s)AbstractSubject(s)Publication datePublication placeType of documentother features…

Entity to be describedType?Bibliographic

resource

for describing bibliographic resources

Metadata vocabulary

Authority data

KOS

“Value vocabularies”

Data of type Person

Authority dataData of type Geographic location

“Des

crip

tion

voc

abul

arie

s”

Controlled list

Concepts suitable for organizing by Topic

Concepts suitable for organizing by Type

for describing people

Metadata vocabularyfor describing

geographic places

Ontology

Page 10: Publishing Germplasm Vocabularies as Linked Data

Vocabularies in RDF LOD• Resource Description Framework (RDF) approach:

– formalize vocabularies assigning to each metadata element and to each concept a Uniform Resource Identifier (URI)

– RDF vocabularies have published URIs and published machine-readable semantics. things described and indexed with RDF vocabularies can be “understood” by machines and automatically discovered

• Linking classes or concepts across vocabularies makes them Linked Open Data (LOD) vocabularies and allows machines to follow semantic linkages across vocabularies and discover more data.

Page 11: Publishing Germplasm Vocabularies as Linked Data

The importance of LOD vocabularies

• Data exposed using a LOD vocabulary can for this reason alone be considered “Linked Data”

the first thing to do for publishing Linked Data is identifying or publishing the suitable LOD vocabularies

• Data mash-ups rely on common and semantically defined classes, properties and concepts identifiable by URIs.

Page 12: Publishing Germplasm Vocabularies as Linked Data

“Vocabularies” for germplasm data

Page 13: Publishing Germplasm Vocabularies as Linked Data

Metadata (1)Reference standards:

• Multi-crop Passport Descriptors (MCPD)(FAO/Bioversity)– V.1 2006, V.2 2012Data to EURISCO catalogue

• Darwin Core(Biodiversity Information Standards Working Group, TDWG)http://rs.tdwg.org/dwc/Includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries.

Page 14: Publishing Germplasm Vocabularies as Linked Data

Metadata (2)Standard extensions

• The MCPD do not include descriptors for Characterization and Evaluation (C&E) measurements of plant traits/scoresE.g. Morphological and agronomic traits as well as reaction to biotic and abiotic stresses’ resistance to specific pathotypes, grain yield, and protein content

• An initial set of C&E descriptors for the utilization of 22 crops have been developed by Bioversity International4 together with CGIAR and other research centers

The DarwinCore Germplasm Extension (Biodiversity TDWG)– additional terms to describe germplasm samples– maintained by genebanks worldwide– Modelled starting from the Multi-Crop Passport standard (MCPD, 2001) – Includes the new terms for crop trait experiments developed as part of the

European EPGRIS3 project.– Includes a few additional terms for new international crop treaty regulations.

Page 15: Publishing Germplasm Vocabularies as Linked Data

RDF vocabularies for germplasm• TaxonConcept OWL Ontology

written by Peter J. DeVries from 2009 through 2012 was based on the earlier GoeSpecies from 2007: http://www.taxonconcept.org/

Biodiversity Information Standards (TDWG) • Metadata: Darwin Core “SW” ontology in RDF OWL

Semantic web terms for biodiversity data, based on Darwin Core:http://rs.tdwg.org/dwc/terms/

• DwC-germplasm = already represented in RDF SKOShttp://purl.org/germplasm/

• Much activity around the semantic technologies to express major plant / trait / gene ontologies (this overlaps with KOSs)– Plant Ontology (explicitly referenced in the DwC-germplasm)– Gene Ontology,– Trait Ontology – Phenotypic Quality Ontology.

Page 16: Publishing Germplasm Vocabularies as Linked Data

Metadata: Darwin “SW” Core RDF classes

From: http://code.google.com/p/tdwg-rdf/wiki/BiodiversityOntologies

Semantic web terms for biodiversity data, based on Darwin Core

Page 17: Publishing Germplasm Vocabularies as Linked Data

Metadata: Darwin Core RDF model

From: https://code.google.com/p/darwin-sw/

Page 18: Publishing Germplasm Vocabularies as Linked Data

Metadata / KOS: DwC-germplasm extension

From: http://terms.tdwg.org/wiki/Germplasm

Page 19: Publishing Germplasm Vocabularies as Linked Data

KOSsAuthoritative plant names and taxonomies

– Plant Ontology (OBO format)(explicitly referenced in the DwC-germplasm)http://www.plantontology.org

– Gene Ontology (RDF and OWL/RDF)http://www.geneontology.org/

– Trait Ontology (OBO format) http://www.gramene.org/db/ontology/search?id=TO:0000387

– Phenotypic Quality Ontology (OBO and OWL) http://obofoundry.org/cgi-bin/detail.cgi?quality

Some of them are already inter-linked

Page 20: Publishing Germplasm Vocabularies as Linked Data

KOSs: value lists• The DwC-germplasm is mainly a KOS

http://purl.org/germplasm/ It defines concepts.Foe example, http://purl.org/germplasm/germplasmType# is a “List of controlled values for some of the germplasm terms”

Page 21: Publishing Germplasm Vocabularies as Linked Data

KOSs: value lists• When it comes to ranges and controlled sets of values,

there are two typical scenarios:– Ranges of values (numeric or not) that represent a continuum of

values (i.e. “From 1 to 10”, “From 10 to 20” etc. or percentages. See table 2);

– Sets of controlled values (e.g. for “acquisition type”, “measurement type”, color and other observed properties).

• The second case can even be split into two different cases: – the values can come from a dedicated controlled list – the values can come from an established taxonomy, from which

however only a subset of values are valid for that property.

Page 22: Publishing Germplasm Vocabularies as Linked Data

KOSs: value lists

Value lists:Examples of allowed values for some C&E properties

Young shoot: aperture of tip 1=closed, 3=half open, 5=fully open

Young shoot: intensity of anthocyanin coloration on prostrate hairs of tip

1=none or very low, 3=low, 5=medium, 7=high, 9=very high

B. Berry colorColor of the berry skin: green, green-grey, green-rose, green-red, green-black, grey, grey-rose, rose, red, red-violet, black, black-red, black-grey Example: green-rose

Page 23: Publishing Germplasm Vocabularies as Linked Data

KOSs: value lists

• An interesting task would be the publication of most of these lists as Linked Data, following the example of the Dublin Core Types list.http://dublincore.org/documents/dcmi-type-vocabulary/

• Darwin Core Types:http://rs.tdwg.org/dwc/terms/type-vocabulary/index.htm

Page 24: Publishing Germplasm Vocabularies as Linked Data

KOSs: subsets of published KOSs• Special case:

values for which reference to a published thesaurus is recommended but only a specific subset of terms is valid for a specific property. Thesauri are rarely structured around “facets” (or the various properties of entities that can be described by the terms in the thesaurus): they usually have an internal logic that reflects the domain they represent.

Example from the DwC Germplasm extension: values can come from an existing ontology

Page 25: Publishing Germplasm Vocabularies as Linked Data

Which vocabularies for germplasm data need to be published?

Page 26: Publishing Germplasm Vocabularies as Linked Data

How to decide if and what to publish1. Data set already uses some standard vocabularies published as LOD

– No need to publish new vocabularies

2. Data set uses some local vocabularies– If it has the same intended meaning as some standard vocabulary and if the

data owners agree…– Then, replace local vocabulary with standard vocabularies (back to case 1)

3. Data set uses some local vocabularies– If it has the same intended meaning as some standard vocabulary, but data

owners need to keep the local ones…– Then, publish local vocabulary and map it to standard vocabularies

4. Data set uses some local vocabularies– If there is no matching or overlap with any standard vocabularies…– Then, publish local vocabulary for others to re-use

4b. No existing vocabulary contains properties or concepts that are deemed useful by the community– The community works on a new vocabulary to extend the existing ones

Page 27: Publishing Germplasm Vocabularies as Linked Data

What vocabularies to publish for germplasm data?

Good RDF metadata vocabularies / ontologies exist• Need to further extend Darwin Core classes and properties?

Publish an extension to Darwin Core as an RDF or OWL vocabulary (see how later)

Good domain KOSs exist• Need to indicate subsets in domain KOSs to be used for specific properties?

a) Work with classification owners to identify subsets b) Re-publish subsets as SKOS collections linking to concepts in original KOS or as Application Profiles

Only a few value lists have been published (e.g. in DwC-Germplasm or in DwC Types) Publish value lists as SKOS

Page 28: Publishing Germplasm Vocabularies as Linked Data

Publishing value lists

• Identify the most relevant controlled lists that need to be published

• Check if anything similar has already been published or if some existing lists of values can be extended

• Publish them as LOD, linking to any similar concepts already published in other vocabularies.

Page 29: Publishing Germplasm Vocabularies as Linked Data

How to publish new vocabularies as LOD?

Page 30: Publishing Germplasm Vocabularies as Linked Data

LOD guidelines• The methodologies comply with the Linked Data rules (Berners Lee, 2006)

• “Use URIs as names for things”concepts / values in value vocabularies and classes and properties in description vocabularies, as well as the vocabularies themselves, have to be identified by URIs.

• “Use HTTP URIs so that people can look up those names”the URIs for concept / values, classes and properties, as well as vocabularies, have to be resolved as HTTP URLs.

• “When someone looks up a URI, provide useful information”the URLs for concepts, classes and properties, as well as vocabularies, have to return an HTML page with useful information when requested by browsers, or RDF when requested by RDF software; besides, vocabularies should be available for querying behind a SPARQL endpoint.

• “Include links to other URIs, so that more things can be discovered”the URIs of concepts, classes and properties should whenever possible be linked to URIs in other vocabularies, for instance as close match of another concept or sub-class of another class.

Page 31: Publishing Germplasm Vocabularies as Linked Data

Metadata vocabularies• As indicated by the W3C Library Linked Data Incubator Group, metadata elements

set are expressed as RDFS (RDF Schemas) or OWL (Web Ontology Language) ontologies.

• They define classes and properties used to describe something

Tools: listed in http://linkeddatabook.com/editions/1.0/• The Neologism Drupal distribution (open source, easy to use, deployable online

and dedicated to the building and online publication of simple RDF vocabularies• TopBraid Composer (a powerful commercial modeling environment)• Protégé (open-source ontology editor)• The NeOn Toolkit (open-source ontology engineering environment for networked

ontologies)

• http://neologism.deri.ie/ • http://www.topquadrant.com/products/TB_Composer.html • http://protege.stanford.edu/ • http://neon-toolkit.org/

Heath, Tom and Bizer, Christian (2011). Linked Data: Evolving the Web into a Global Data Space (1st edition). Synthesis Lectures on the Semantic Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool. http://linkeddatabook.com/editions/1.0/

Page 32: Publishing Germplasm Vocabularies as Linked Data

KOSs• In RDF, KOSs are normally expressed using the SKOS vocabulary.• They define concepts

Tools: • The VocBench: a multilingual editing and workflow tool developed by FAO for the

management of various types of KOS. It provides functionalities that facilitate both collaborative editing and multilingual terminology.

• MoKi: based on MediaWiki, ontology editing tool where concepts can be added, revised, translated and deleted.

• SKOSJS• Protégé • TemaTres Controlled Vocabulary server• commercial tools like PoolParty or TopBraid Enterprise Vocabulary Net

• http://aims.fao.org/tools/vocbench-2 • https://moki.fbk.eu/website/index.php• https://github.com/tkurz/skosjs • http://protege.stanford.edu • http://www.vocabularyserver.com • http://poolparty.punkt.at/ • http://www.topquadrant.com/solutions/ent_vocab_net.html

Page 33: Publishing Germplasm Vocabularies as Linked Data

Thank you


Recommended