Upload
alasdair-gray
View
213
Download
0
Embed Size (px)
Citation preview
Describing Datasets with the Health Care and Life
Sciences Community ProfileAlasdair Gray
Department of Computer ScienceHeriot-Watt University
www.macs.hw.ac.uk/~ajg33/[email protected]
@gray_alasdair
Michel DumontierStanford University
M. Scott MarshallNetherlands Cancer Institute
Alasdair Gray @gray_alasdair 2
Materials released under CC-BY LicenseYou are free to:• Share — copy and redistribute the material in any medium or format• Adapt — remix, transform, and build upon the material for any purpose,
even commercially.The licensor cannot revoke these freedoms as long as you follow the license terms.Under the following terms:• Attribution — You must give appropriate credit, provide a link to the license,
and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
4 December 2016
Alasdair Gray @gray_alasdair 3
Outline• Dataset Descriptions
• Overview• Use Cases• Requirements• Implementations• Existing Vocabularies
• HCLS Community Profile• Overview• Example• Modules
• Hands On• Develop your own description
• Validation• Overview• Demonstration
• Hands On• Validate your description
• RDF Dataset Statistics• Overview• SPARQL Queries
4 December 2016
Alasdair Gray @gray_alasdair 4
W3C HCLS Group
Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331
4 December 2016
Data Cache (Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices
Identity Resolution
Service
IdentifierManagement
Service
“Adenosine receptor 2a”
EC2.43.4CS4532
P12374
Cor
e Pl
atfo
rm
ChEMBL-RDF
ChEMBLv13
Chem2Bio2RDF
SD
v13v12
v2 or v8
10
Which version of ChEMBL?
Data Cache (Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices
Identity Resolution
Service
IdentifierManagement
Service
“Adenosine receptor 2a”
EC2.43.4CS4532
P12374
Cor
e Pl
atfo
rm
ChEMBL-RDF
ChEMBLv13
Chem2Bio2RDF
SD
v13v12
v2 or v8
Open PHACTSDiscovery PlatformHistoric Use Case
~January 2012
Open PHACTS v2.1ChEMBL 20
http://tiny.cc/ops-datasets
11
Which version of ChEMBL?
Challenges• Datasets available
• In many versions over time• In different formats• From many mirrors/registries
• Datasets build on each other• Files do not carry metadata• Registries
• Can be out-of-date• Can contain conflicting information
4 December 2016 Alasdair Gray @gray_alasdair 12
Scientists require data provenance!
Alasdair Gray @gray_alasdair 14
Metadata Requirements• What is the dataset about?
• Name and description• Content themes
• Who produced the dataset?• Publisher details• Content author details
• When was the dataset created?• Date of creation• Date of publication• Versioning
• Where can I get the dataset?• Licence and rights• Download URL• SPARQL endpoint• Web service
• How was the dataset created?• Experimental methodology• Post-processing
• Why was the dataset created?• Motivation
4 December 2016
HCLS Additional Requirements
Standard metadata requirements plus:
1. Resolvable identifiers for metadata
2. Descriptions of data identifiers
3. Data provenance
4. Data statistics
4 December 2016 Alasdair Gray @gray_alasdair 15
Dublin Core Metadata InitiativeWidely usedBroadly applicable
• Documents• Datasets
✗Generic terms✗Not comprehensive✗No required properties
4 December 2016 Alasdair Gray @gray_alasdair 18
“Date: A point or period of time associated with an event in the lifecycle of the resource.”
19Alasdair Gray @gray_alasdair
Metadata carried with data• Directly embedded: void:inDataset
✗No versioning✗No checklist of requisite fields✗Only for RDF data
VoID: Vocabulary of Interlinked Datasets
4 December 2016
DCAT: Data CatalogSeparates Dataset and Distribution✗No versioning✗No prescribed properties
4 December 2016 Alasdair Gray @gray_alasdair 20
Fixed with DCAT-AP
DCAT-AP doesn’t meet use case needs
Alasdair Gray @gray_alasdair 21
HCLS Community Profilehttp://www.w3.org/TR/hcls-dataset/
4 December 2016
Alasdair Gray @gray_alasdair 22
HCLS Dataset Descriptions• 61 Metadata
properties– 5 Modules
• 18 vocabularies– DCTerms– DCAT– VoID– …
4 December 2016
Alasdair Gray @gray_alasdair 31
Core Metadata (Title & description)
Element Property ValueSummary Level
Version Level
Distribution Level
Type declaration rdf:type dctypes:Dataset MUST MUST SHOULD
Type declaration rdf:typevoid:Dataset or dcat:Distribution MUST NOT MUST NOT MUST
Title dct:title rdf:langString MUST MUST MUSTAlternative titles dct:alternative rdf:langString MAY MAY MAYDescription dct:description rdf:langString MUST MUST MUST
4 December 2016
Core Metadata (Dates & contributors)
Element Property ValueSummary Level
Version Level
Distribution Level
Date created dct:created
rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype
MUST NOT SHOULD SHOULD
Other dates
pav:createdOn or pav:authoredOn or pav:curatedOn
xsd:dateTime, xsd:date, xsd:gYearMonth, or xsd:gYear
MUST NOT MAY MAY
Creators dct:creator IRIMUST NOT MUST MUST
Contributors
dct:contributor or pav:createdBy or pav:authoredBy or pav:curatedBy IRI
MUST NOT MAY MAY
Date of issue dct:issued
rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype
MUST NOT SHOULD SHOULD
Alasdair Gray @gray_alasdair 33
Core Metadata (Publisher and licence)
Element Property ValueSummary Level
Version Level
Distribution Level
Publisher dct:publisher IRI MUST MUST MUSTHTML page foaf:page IRI SHOULD SHOULD SHOULDLogo schemaorg:logo IRI SHOULD SHOULD SHOULDLicense dct:license IRI MAY SHOULD MUSTRights dct:rights rdf:langString MAY MAY MAY
4 December 2016
Alasdair Gray @gray_alasdair 34
Core Metadata (Content description)Element Property Value
Summary Level
Version Level
Distribution Level
Keywords dcat:keyword xsd:string MAY MAY MAY
Language dct:languagehttp://lexvo.org/id/iso639-3/{tag} MUST NOT SHOULD SHOULD
References dct:references IRI MAY MAY MAYConcept descriptors dcat:theme
IRI of type skos:Concept MAY MAY MAY
Vocabulary used void:vocabulary IRI MUST NOT MUST NOTSHOULDStandards used dct:conformsTo IRI MUST NOT MAY SHOULDCitations cito:citesAsAuthority IRI MAY MAY MAYRelated material rdfs:seeAlso IRI MAY MAY MAYPartitions dct:hasPart IRI MAY MAY MUST NOT
4 December 2016
Alasdair Gray @gray_alasdair 35
Identifiers
Element Property ValueSummary Level
Version Level
Distribution Level
Preferred prefix idot:preferredPrefix xsd:string MAY MAY MAYAlternate prefix idot:alternatePrefix xsd:string MAY MAY MAYIdentifier pattern idot:identifierPattern xsd:string MUST NOT MUST NOT MAYURI pattern void:uriRegexPattern xsd:string MUST NOT MUST NOT MAYFile access pattern idot:accessPattern idot:AccessPattern MUST NOT MUST NOT MAY
Example identifier idot:exampleIdentifier xsd:string MUST NOT MUST NOT SHOULD
Example resource void:exampleResource IRI MUST NOT MUST NOT SHOULD
4 December 2016
Alasdair Gray @gray_alasdair 36
Provenance and Change (Versioning)
Element Property ValueSummary Level
Version Level
Distribution Level
Version identifier pav:version xsd:string MUST NOT MUST SHOULD
Version linking dct:isVersionOf IRI MUST NOT MUST MUST NOT
Version linking pav:previousVersion IRI MUST NOT SHOULD SHOULD
Version linking pav:hasCurrentVersion IRI MAY MUST NOT MUST NOT
4 December 2016
Alasdair Gray @gray_alasdair 37
Provenance and Change
Element Property ValueSummary Level
Version Level
Distribution Level
Data source provenance
dct:source or pav:retrievedFrom or prov:wasDerivedFrom IRI MUST NOT SHOULD SHOULD
Item listing sio:has-data-item IRI MUST NOT MUST NOT MAY
Creation tool pav:createdWith IRI MUST NOT SHOULD SHOULD
Update frequency dct:accrualPeriodicityIRI of type dctypes:Frequency SHOULD MUST NOT MUST NOT
4 December 2016
Alasdair Gray @gray_alasdair 38
Availability and DistributionsElement Property Value
Summary Level
Version Level
Distribution Level
Distribution description dcat:distributionIRI of Distribution Level description MUST NOT SHOULD MUST NOT
File format dct:format IRI or xsd:String MUST NOT MUST NOT MUSTFile directory dcat:accessURL IRI MAY MAY MAYFile URL dcat:downloadURL IRI MUST NOT MUST NOT SHOULDByte size dcat:byteSize xsd:decimal MUST NOT MUST NOT SHOULDRDF File URL void:dataDump IRI MUST NOT MUST NOT SHOULD
SPARQL endpoint void:sparqlEndpoint IRI SHOULDSHOULD NOT SHOULD NOT
Documentation dcat:landingPage IRI MUST NOT MAY MAYLinkset void:subset IRI MUST NOT MUST NOT SHOULD
4 December 2016
Statistics (RDF Core)Element Property Value
Summary Level
Version Level
Distribution Level
# of triples void:triples xsd:integer MUST NOT MUST NOT SHOULD
# of typed entities void:entities xsd:integer MUST NOT MUST NOT SHOULD
# of subjects void:distinctSubjects xsd:integer MUST NOT MUST NOT SHOULD
# of properties void:properties xsd:integer MUST NOT MUST NOT SHOULD
# of objects void:distinctObjects xsd:integer MUST NOT MUST NOT SHOULD
# of classes void:classPartition IRI MUST NOT MUST NOT SHOULD
# of literals void:classPartition IRI MUST NOT MUST NOT SHOULD
# of RDF graphs void:classPartition IRI MUST NOT MUST NOT SHOULD
Alasdair Gray @gray_alasdair 40
Statistics (RDF Complete)Element Property Value
Summary Level
Version Level
Distribution Level
class frequency void:classPartition IRI MUST NOT MUST NOT MAY
property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY
property and subject types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and object types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY
property subject and object types void:propertyPartition IRI MUST NOT MUST NOT MAY
4 December 2016
Alasdair Gray @gray_alasdair 41
Hands on Create your own dataset descriptionIf you don’t have your own, pick one of the following two
4 December 2016
Alasdair Gray @gray_alasdair 44
Validation ServiceAndrew BeveridgeJacob Baungard Hansen
Johnny ValLeif GehrmannRoisin Farmer
Sunil KhutanTomas Robertson
4 December 2016
46
Example Constraint
4 December 2016
• Shape
• A Dataset Summary• MUST be declared to be of type dctype:Dataset• MUST have a dcterms:title as a language typed string• MUST NOT have dcterms:created date
<Dataset> rdf:langString
.✗
Alasdair Gray @gray_alasdair
Dates are associated with versions in HCLS
47
Example Validation
4 December 2016
<Dataset> rdf:langString
.✗
Alasdair Gray @gray_alasdair
• Shape
• Data
Valid
48
Example Validation
• Shape
• Data
4 December 2016
<Dataset> rdf:langString
.✗
Alasdair Gray @gray_alasdair
Not Valid
49
Example Validation
4 December 2016
<Dataset> rdf:langString
.✗
Alasdair Gray @gray_alasdair
• Shape
• Data
Valid
50
Example Validation (Closed Shape)
4 December 2016
<Dataset> rdf:langString
.✗
Alasdair Gray @gray_alasdair
• Shape
• Data
Not Valid
51
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
Shape
4 December 2016
<Dataset> rdf:langString
.✗
Alasdair Gray @gray_alasdair
Shape Expressions (ShEx)
52
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}
4 December 2016 Alasdair Gray @gray_alasdair
Validator can’t warn of missing property
Example data
ShEx Validation
53
<Dataset> { `MUST` rdf:type (dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created .}
Shape
4 December 2016
<Dataset> rdf:langString
.✗
Alasdair Gray @gray_alasdair
Validator can warn of missing property
Requirement Levels
Alasdair Gray @gray_alasdair 57
http://www.w3.org/2015/03/ShExValidata/• Try the ChEMBL example
• Play with Options• Add resources• Switch between open/closed shapes• Change requirement level
• Check your own work• Try descriptions from
• Riken Metadatabasehttp://metadb.riken.jp/archives/HCLSProfile/HCLSProfile_MetaDB_ALL.ttl
• PHI-Basehttps://www.dropbox.com/s/tghogwagn962tmi/Metadata.ttl?dl=0
4 December 2016
Statistics (RDF Core)Element Property Value
Summary Level
Version Level
Distribution Level
# of triples void:triples xsd:integer MUST NOT MUST NOT SHOULD
# of typed entities void:entities xsd:integer MUST NOT MUST NOT SHOULD
# of subjects void:distinctSubjects xsd:integer MUST NOT MUST NOT SHOULD
# of properties void:properties xsd:integer MUST NOT MUST NOT SHOULD
# of objects void:distinctObjects xsd:integer MUST NOT MUST NOT SHOULD
# of classes void:classPartition IRI MUST NOT MUST NOT SHOULD
# of literals void:classPartition IRI MUST NOT MUST NOT SHOULD
# of RDF graphs void:classPartition IRI MUST NOT MUST NOT SHOULD
Alasdair Gray @gray_alasdair 60
Statistics (RDF Complete)Element Property Value
Summary Level
Version Level
Distribution Level
class frequency void:classPartition IRI MUST NOT MUST NOT MAY
property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY
property and subject types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and object types void:propertyPartition IRI MUST NOT MUST NOT MAY
property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY
property subject and object types void:propertyPartition IRI MUST NOT MUST NOT MAY
4 December 2016
Alasdair Gray @gray_alasdair 61
Why provide rich dataset descriptions?
• Support • Endpoint
exploration• Query writing
• Eliminates expensive exploratory queries
4 December 2016
Alasdair Gray @gray_alasdair 62
Why provide rich dataset descriptions?
• Support • Endpoint
exploration• Query writing
• Eliminates expensive exploratory queries
4 December 2016
Alasdair Gray @gray_alasdair 63
Why provide rich dataset descriptions?
• Support • Endpoint
exploration• Query writing
• Eliminates expensive exploratory queries
4 December 2016
Alasdair Gray @gray_alasdair 64
Generate with SPARQL queriesNumber of triples Subject-types related to Object-types
4 December 2016
HCLS Dataset Descriptions
https://www.w3.org/TR/hcls-dataset/Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331
[email protected] @gray_alasdair