Publishing Germplasm Vocabularies as Linked Data

Embed Size (px)

DESCRIPTION

What has already been published? What may still be needed? How to do it? This presentation is a part of the 3rd Session of the 1st International e-Conference on Germplasm Data Interoperability https://sites.google.com/site/germplasminteroperability/

Text of Publishing Germplasm Vocabularies as Linked Data

  • 1. Publishing germplasm vocabularies as Linked Data What has already been published? What may still be needed? How to do it? This presentation is a part of the 3rd Valeria Pesce (GFAR) Session of the 1st International eGuntram Geser (Salzburg Research) Conference on Germplasm Data Caterina Caracciolo (FAO) Interoperability Vassilis https://sites.google.com/site/germplasminteroperability/ Protonotarios (AgroKnow)

2. Vocabularies 3. Ingredients for describing things Metadata elements to describe individual pieces of information in the data sets Metadata sets, metadata element sets, vocabularies Sets of values for (some of) the metadata elements Controlled vocabularies, authority data, value vocabularies, KOS They are often both called vocabularies 4. Various flavors of vocabularies Type: Bibliographic resourceEntity to be describedTitle Author(s) Abstract Subject(s) Publication date Publication place Type of document other features 5. Various flavors of vocabulariesDescription vocabulariesType: Bibliographic resourceEntity to be describedTitle Author(s) Abstract Subject(s) Publication date Publication place Type of document other featuresMetadata vocabulary for describing bibliographic resources 6. Various flavors of vocabularies Type? Bibliographic resourceEntity to be describedTitle Author(s) Abstract Subject(s) Publication date Publication place Type of document other featuresKOS Concepts suitable for organizing by TopicControlled listDescription vocabulariesConcepts suitable for organizing by TypeMetadata vocabulary for describing bibliographic resources 7. Various flavors of vocabularies Type? Bibliographic resourceEntity to be describedTitle Author(s) Abstract Subject(s) Publication date Publication place Type of document other featuresAuthority data Data of type PersonKOS Concepts suitable for organizing by TopicControlled listDescription vocabulariesConcepts suitable for organizing by TypeMetadata vocabulary for describing bibliographic resourcesAuthority data Data of type Geographic location 8. Various flavors of vocabularies Value vocabularies Type? Bibliographic resourceEntity to be describedTitle Author(s) Abstract Subject(s) Publication date Publication place Type of document other featuresAuthority data Data of type PersonKOS Concepts suitable for organizing by TopicControlled listDescription vocabulariesConcepts suitable for organizing by TypeMetadata vocabulary for describing bibliographic resourcesAuthority data Data of type Geographic location 9. Various flavors of vocabularies Value vocabularies Type? Bibliographic resourceEntity to be describedTitle Author(s) Abstract Subject(s) Publication date Publication place Type of document other featuresAuthority data Data of type PersonKOS Concepts suitable for organizing by TopicControlled listDescription vocabulariesConcepts suitable for organizing by TypeMetadata vocabulary for describing bibliographic resourcesOntology for describing geographic placesAuthority data Data of type Geographic locationMetadata vocabulary for describing people 10. Vocabularies in RDF LOD Resource Description Framework (RDF) approach: formalize vocabularies assigning to each metadata element and to each concept a Uniform Resource Identifier (URI) RDF vocabularies have published URIs and published machine-readable semantics. things described and indexed with RDF vocabularies can be understood by machines and automatically discovered Linking classes or concepts across vocabularies makes them Linked Open Data (LOD) vocabularies and allows machines to follow semantic linkages across vocabularies and discover more data. 11. The importance of LOD vocabularies Data exposed using a LOD vocabulary can for this reason alone be considered Linked Data the first thing to do for publishing Linked Data is identifying or publishing the suitable LOD vocabularies Data mash-ups rely on common and semantically defined classes, properties and concepts identifiable by URIs. 12. Vocabularies for germplasm data 13. Metadata (1) Reference standards: Multi-crop Passport Descriptors (MCPD) (FAO/Bioversity) V.1 2006, V.2 2012 Data to EURISCO catalogue Darwin Core (Biodiversity Information Standards Working Group, TDWG) http://rs.tdwg.org/dwc/ Includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries. 14. Metadata (2) Standard extensions The MCPD do not include descriptors for Characterization and Evaluation (C&E) measurements of plant traits/scores E.g. Morphological and agronomic traits as well as reaction to biotic and abiotic stresses resistance to specific pathotypes, grain yield, and protein content An initial set of C&E descriptors for the utilization of 22 crops have been developed by Bioversity International4 together with CGIAR and other research centers The DarwinCore Germplasm Extension (Biodiversity TDWG) additional terms to describe germplasm samples maintained by genebanks worldwide Modelled starting from the Multi-Crop Passport standard (MCPD, 2001) Includes the new terms for crop trait experiments developed as part of the European EPGRIS3 project. Includes a few additional terms for new international crop treaty regulations. 15. RDF vocabularies for germplasm TaxonConcept OWL Ontology written by Peter J. DeVries from 2009 through 2012 was based on the earlier GoeSpecies from 2007: http://www.taxonconcept.org/Biodiversity Information Standards (TDWG) Metadata: Darwin Core SW ontology in RDF OWL Semantic web terms for biodiversity data, based on Darwin Core: http://rs.tdwg.org/dwc/terms/ DwC-germplasm = already represented in RDF SKOS http://purl.org/germplasm/ Much activity around the semantic technologies to express major plant / trait / gene ontologies (this overlaps with KOSs) Plant Ontology (explicitly referenced in the DwC-germplasm) Gene Ontology, Trait Ontology Phenotypic Quality Ontology. 16. Metadata: Darwin SW Core RDF classes Semantic web terms for biodiversity data, based on Darwin CoreFrom: http://code.google.com/p/tdwg-rdf/wiki/BiodiversityOntologies 17. Metadata: Darwin Core RDF modelFrom: https://code.google.com/p/darwin-sw/ 18. Metadata / KOS: DwC-germplasm extensionFrom: http://terms.tdwg.org/wiki/Germplasm 19. KOSs Authoritative plant names and taxonomies Plant Ontology (OBO format) (explicitly referenced in the DwC-germplasm) http://www.plantontology.org Gene Ontology (RDF and OWL/RDF) http://www.geneontology.org/ Trait Ontology (OBO format) http://www.gramene.org/db/ontology/search?id=TO:0000387 Phenotypic Quality Ontology (OBO and OWL) http://obofoundry.org/cgi-bin/detail.cgi?qualitySome of them are already inter-linked 20. KOSs: value lists The DwC-germplasm is mainly a KOS http://purl.org/germplasm/ It defines concepts. Foe example, http://purl.org/germplasm/germplasmType# is a List of controlled values for some of the germplasm terms 21. KOSs: value lists When it comes to ranges and controlled sets of values, there are two typical scenarios: Ranges of values (numeric or not) that represent a continuum of values (i.e. From 1 to 10, From 10 to 20 etc. or percentages. See table 2); Sets of controlled values (e.g. for acquisition type, measurement type, color and other observed properties). The second case can even be split into two different cases: the values can come from a dedicated controlled list the values can come from an established taxonomy, from which however only a subset of values are valid for that property. 22. KOSs: value lists Value lists: Examples of allowed values for some C&E properties Young shoot: aperture of tip1=closed, 3=half open, 5=fully openYoung shoot: intensity of anthocyanin coloration on prostrate hairs of tip1=none or very low, 3=low, 5=medium, 7=high, 9=very highB. Berry color Color of the berry skin: green, green-grey, green-rose, green-red, green-black, grey, greyrose, rose, red, red-violet, black, black-red, black-grey Example: green-rose 23. KOSs: value lists An interesting task would be the publication of most of these lists as Linked Data, following the example of the Dublin Core Types list. http://dublincore.org/documents/dcmi-type-vocab Darwin Core Types: http://rs.tdwg.org/dwc/terms/type-vocabulary/ind 24. KOSs: subsets of published KOSs Special case: values for which reference to a published thesaurus is recommended but only a specific subset of terms is valid for a specific property. Thesauri are rarely structured around facets (or the various properties of entities that can be described by the terms in the thesaurus): they usually have an internal logic that reflects the domain they represent.Example from the DwC Germplasm extension: values can come from an existing ontology 25. Which vocabularies for germplasm data need to be published? 26. How to decide if and what to publish 1. Data set already uses some standard vocabularies published as LOD No need to publish new vocabularies1. Data set uses some local vocabularies If it has the same intended meaning as some standard vocabulary and if the data owners agree Then, replace local vocabulary with standard vocabularies (back to case 1)1. Data set uses some local vocabularies If it has the same intended meaning as some standard vocabulary, but data owners need to keep the local ones Then, publish local vocabulary and map it to standard vocabularies1. Data set uses some local vocabularies If there is no matching or overlap with any standard vocabularies Then, publish local vocabulary for others to re-use4b. No existing vocabulary contains properties or concepts that are deemed useful by the community The community works on a new vocabulary to extend the existing ones 27. What vocabularies to publish for germplasm data? Good RDF metadata vo