65
Describing Datasets with the Health Care and Life Sciences Community Profile Alasdair Gray Department of Computer Science Heriot-Watt University www.macs.hw.ac.uk/~ajg33/ [email protected] @gray_alasdair Michel Dumontier Stanford University M. Scott Marshall Netherlands Cancer Institute

Tutorial: Describing Datasets with the Health Care and Life Sciences Community Profile

Embed Size (px)

Citation preview

Describing Datasets with the Health Care and Life

Sciences Community ProfileAlasdair Gray

Department of Computer ScienceHeriot-Watt University

www.macs.hw.ac.uk/~ajg33/[email protected]

@gray_alasdair

Michel DumontierStanford University

M. Scott MarshallNetherlands Cancer Institute

Alasdair Gray @gray_alasdair 2

Materials released under CC-BY LicenseYou are free to:• Share — copy and redistribute the material in any medium or format• Adapt — remix, transform, and build upon the material for any purpose,

even commercially.The licensor cannot revoke these freedoms as long as you follow the license terms.Under the following terms:• Attribution — You must give appropriate credit, provide a link to the license,

and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

4 December 2016

Alasdair Gray @gray_alasdair 3

Outline• Dataset Descriptions

• Overview• Use Cases• Requirements• Implementations• Existing Vocabularies

• HCLS Community Profile• Overview• Example• Modules

• Hands On• Develop your own description

• Validation• Overview• Demonstration

• Hands On• Validate your description

• RDF Dataset Statistics• Overview• SPARQL Queries

4 December 2016

Alasdair Gray @gray_alasdair 4

W3C HCLS Group

Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331

4 December 2016

Alasdair Gray @gray_alasdair 5

Use CasesOverview of a couple of examples

4 December 2016

Alasdair Gray @gray_alasdair 6

FAIR Data Principles

4 December 2016

Alasdair Gray @gray_alasdair 7

FAIR Data

4 December 2016

Alasdair Gray @gray_alasdair 8

Open PHACTS Drug Discovery Platform

4 December 2016

9

Open PHACTS Drug Discovery Platform

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices

Identity Resolution

Service

IdentifierManagement

Service

“Adenosine receptor 2a”

EC2.43.4CS4532

P12374

Cor

e Pl

atfo

rm

ChEMBL-RDF

ChEMBLv13

Chem2Bio2RDF

SD

v13v12

v2 or v8

10

Which version of ChEMBL?

Data Cache (Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON) DomainSpecificServices

Identity Resolution

Service

IdentifierManagement

Service

“Adenosine receptor 2a”

EC2.43.4CS4532

P12374

Cor

e Pl

atfo

rm

ChEMBL-RDF

ChEMBLv13

Chem2Bio2RDF

SD

v13v12

v2 or v8

Open PHACTSDiscovery PlatformHistoric Use Case

~January 2012

Open PHACTS v2.1ChEMBL 20

http://tiny.cc/ops-datasets

11

Which version of ChEMBL?

Challenges• Datasets available

• In many versions over time• In different formats• From many mirrors/registries

• Datasets build on each other• Files do not carry metadata• Registries

• Can be out-of-date• Can contain conflicting information

4 December 2016 Alasdair Gray @gray_alasdair 12

Scientists require data provenance!

Alasdair Gray @gray_alasdair 13

Requirements

4 December 2016

Alasdair Gray @gray_alasdair 14

Metadata Requirements• What is the dataset about?

• Name and description• Content themes

• Who produced the dataset?• Publisher details• Content author details

• When was the dataset created?• Date of creation• Date of publication• Versioning

• Where can I get the dataset?• Licence and rights• Download URL• SPARQL endpoint• Web service

• How was the dataset created?• Experimental methodology• Post-processing

• Why was the dataset created?• Motivation

4 December 2016

HCLS Additional Requirements

Standard metadata requirements plus:

1. Resolvable identifiers for metadata

2. Descriptions of data identifiers

3. Data provenance

4. Data statistics

4 December 2016 Alasdair Gray @gray_alasdair 15

Alasdair Gray @gray_alasdair 16

HCLS ImplementationsRDF Platform

More coming…

4 December 2016

Alasdair Gray @gray_alasdair 17

Existing Vocabularies

4 December 2016

Dublin Core Metadata InitiativeWidely usedBroadly applicable

• Documents• Datasets

✗Generic terms✗Not comprehensive✗No required properties

4 December 2016 Alasdair Gray @gray_alasdair 18

“Date: A point or period of time associated with an event in the lifecycle of the resource.”

19Alasdair Gray @gray_alasdair

Metadata carried with data• Directly embedded: void:inDataset

✗No versioning✗No checklist of requisite fields✗Only for RDF data

VoID: Vocabulary of Interlinked Datasets

4 December 2016

DCAT: Data CatalogSeparates Dataset and Distribution✗No versioning✗No prescribed properties

4 December 2016 Alasdair Gray @gray_alasdair 20

Fixed with DCAT-AP

DCAT-AP doesn’t meet use case needs

Alasdair Gray @gray_alasdair 21

HCLS Community Profilehttp://www.w3.org/TR/hcls-dataset/

4 December 2016

Alasdair Gray @gray_alasdair 22

HCLS Dataset Descriptions• 61 Metadata

properties– 5 Modules

• 18 vocabularies– DCTerms– DCAT– VoID– …

4 December 2016

Alasdair Gray @gray_alasdair 23

ChEMBL Summary Level Description

4 December 2016

Alasdair Gray @gray_alasdair 24

ChEMBL 17 Version Level Description

4 December 2016

Alasdair Gray @gray_alasdair 25

ChEMBL 17 DB Distribution

4 December 2016

Alasdair Gray @gray_alasdair 26

ChEMBL 17 DB Distribution

4 December 2016

Alasdair Gray @gray_alasdair 27

ChEMBL 17 RDF Distribution

4 December 2016

Alasdair Gray @gray_alasdair 28

ChEMBL 17 RDF Distribution

4 December 2016

Alasdair Gray @gray_alasdair 29

ChEMBL 17 RDF Distribution

4 December 2016

Alasdair Gray @gray_alasdair 30

Description Modules

4 December 2016

Alasdair Gray @gray_alasdair 31

Core Metadata (Title & description)

Element Property ValueSummary Level

Version Level

Distribution Level

Type declaration rdf:type dctypes:Dataset MUST MUST SHOULD

Type declaration rdf:typevoid:Dataset or dcat:Distribution MUST NOT MUST NOT MUST

Title dct:title rdf:langString MUST MUST MUSTAlternative titles dct:alternative rdf:langString MAY MAY MAYDescription dct:description rdf:langString MUST MUST MUST

4 December 2016

Core Metadata (Dates & contributors)

Element Property ValueSummary Level

Version Level

Distribution Level

Date created dct:created

rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype

MUST NOT SHOULD SHOULD

Other dates

pav:createdOn or pav:authoredOn or pav:curatedOn

xsd:dateTime, xsd:date, xsd:gYearMonth, or xsd:gYear

MUST NOT MAY MAY

Creators dct:creator IRIMUST NOT MUST MUST

Contributors

dct:contributor or pav:createdBy or pav:authoredBy or pav:curatedBy IRI

MUST NOT MAY MAY

Date of issue dct:issued

rdfs:Literal encoded using the relevant ISO 8601 Date and Time compliant string and typed using the appropriate XML Schema datatype

MUST NOT SHOULD SHOULD

Alasdair Gray @gray_alasdair 33

Core Metadata (Publisher and licence)

Element Property ValueSummary Level

Version Level

Distribution Level

Publisher dct:publisher IRI MUST MUST MUSTHTML page foaf:page IRI SHOULD SHOULD SHOULDLogo schemaorg:logo IRI SHOULD SHOULD SHOULDLicense dct:license IRI MAY SHOULD MUSTRights dct:rights rdf:langString MAY MAY MAY

4 December 2016

Alasdair Gray @gray_alasdair 34

Core Metadata (Content description)Element Property Value

Summary Level

Version Level

Distribution Level

Keywords dcat:keyword xsd:string MAY MAY MAY

Language dct:languagehttp://lexvo.org/id/iso639-3/{tag} MUST NOT SHOULD SHOULD

References dct:references IRI MAY MAY MAYConcept descriptors dcat:theme

IRI of type skos:Concept MAY MAY MAY

Vocabulary used void:vocabulary IRI MUST NOT MUST NOTSHOULDStandards used dct:conformsTo IRI MUST NOT MAY SHOULDCitations cito:citesAsAuthority IRI MAY MAY MAYRelated material rdfs:seeAlso IRI MAY MAY MAYPartitions dct:hasPart IRI MAY MAY MUST NOT

4 December 2016

Alasdair Gray @gray_alasdair 35

Identifiers

Element Property ValueSummary Level

Version Level

Distribution Level

Preferred prefix idot:preferredPrefix xsd:string MAY MAY MAYAlternate prefix idot:alternatePrefix xsd:string MAY MAY MAYIdentifier pattern idot:identifierPattern xsd:string MUST NOT MUST NOT MAYURI pattern void:uriRegexPattern xsd:string MUST NOT MUST NOT MAYFile access pattern idot:accessPattern idot:AccessPattern MUST NOT MUST NOT MAY

Example identifier idot:exampleIdentifier xsd:string MUST NOT MUST NOT SHOULD

Example resource void:exampleResource IRI MUST NOT MUST NOT SHOULD

4 December 2016

Alasdair Gray @gray_alasdair 36

Provenance and Change (Versioning)

Element Property ValueSummary Level

Version Level

Distribution Level

Version identifier pav:version xsd:string MUST NOT MUST SHOULD

Version linking dct:isVersionOf IRI MUST NOT MUST MUST NOT

Version linking pav:previousVersion IRI MUST NOT SHOULD SHOULD

Version linking pav:hasCurrentVersion IRI MAY MUST NOT MUST NOT

4 December 2016

Alasdair Gray @gray_alasdair 37

Provenance and Change

Element Property ValueSummary Level

Version Level

Distribution Level

Data source provenance

dct:source or pav:retrievedFrom or prov:wasDerivedFrom IRI MUST NOT SHOULD SHOULD

Item listing sio:has-data-item IRI MUST NOT MUST NOT MAY

Creation tool pav:createdWith IRI MUST NOT SHOULD SHOULD

Update frequency dct:accrualPeriodicityIRI of type dctypes:Frequency SHOULD MUST NOT MUST NOT

4 December 2016

Alasdair Gray @gray_alasdair 38

Availability and DistributionsElement Property Value

Summary Level

Version Level

Distribution Level

Distribution description dcat:distributionIRI of Distribution Level description MUST NOT SHOULD MUST NOT

File format dct:format IRI or xsd:String MUST NOT MUST NOT MUSTFile directory dcat:accessURL IRI MAY MAY MAYFile URL dcat:downloadURL IRI MUST NOT MUST NOT SHOULDByte size dcat:byteSize xsd:decimal MUST NOT MUST NOT SHOULDRDF File URL void:dataDump IRI MUST NOT MUST NOT SHOULD

SPARQL endpoint void:sparqlEndpoint IRI SHOULDSHOULD NOT SHOULD NOT

Documentation dcat:landingPage IRI MUST NOT MAY MAYLinkset void:subset IRI MUST NOT MUST NOT SHOULD

4 December 2016

Statistics (RDF Core)Element Property Value

Summary Level

Version Level

Distribution Level

# of triples void:triples xsd:integer MUST NOT MUST NOT SHOULD

# of typed entities void:entities xsd:integer MUST NOT MUST NOT SHOULD

# of subjects void:distinctSubjects xsd:integer MUST NOT MUST NOT SHOULD

# of properties void:properties xsd:integer MUST NOT MUST NOT SHOULD

# of objects void:distinctObjects xsd:integer MUST NOT MUST NOT SHOULD

# of classes void:classPartition IRI MUST NOT MUST NOT SHOULD

# of literals void:classPartition IRI MUST NOT MUST NOT SHOULD

# of RDF graphs void:classPartition IRI MUST NOT MUST NOT SHOULD

Alasdair Gray @gray_alasdair 40

Statistics (RDF Complete)Element Property Value

Summary Level

Version Level

Distribution Level

class frequency void:classPartition IRI MUST NOT MUST NOT MAY

property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY

property and subject types void:propertyPartition IRI MUST NOT MUST NOT MAY

property and object types void:propertyPartition IRI MUST NOT MUST NOT MAY

property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY

property subject and object types void:propertyPartition IRI MUST NOT MUST NOT MAY

4 December 2016

Alasdair Gray @gray_alasdair 41

Hands on Create your own dataset descriptionIf you don’t have your own, pick one of the following two

4 December 2016

PHI-Base: http://www.phi-base.org/

Guide to Pharmacology: http://www.guidetopharmacology.org/

Alasdair Gray @gray_alasdair 44

Validation ServiceAndrew BeveridgeJacob Baungard Hansen

Johnny ValLeif GehrmannRoisin Farmer

Sunil KhutanTomas Robertson

4 December 2016

46

Example Constraint

4 December 2016

• Shape

• A Dataset Summary• MUST be declared to be of type dctype:Dataset• MUST have a dcterms:title as a language typed string• MUST NOT have dcterms:created date

<Dataset> rdf:langString

.✗

Alasdair Gray @gray_alasdair

Dates are associated with versions in HCLS

47

Example Validation

4 December 2016

<Dataset> rdf:langString

.✗

Alasdair Gray @gray_alasdair

• Shape

• Data

Valid

48

Example Validation

• Shape

• Data

4 December 2016

<Dataset> rdf:langString

.✗

Alasdair Gray @gray_alasdair

Not Valid

49

Example Validation

4 December 2016

<Dataset> rdf:langString

.✗

Alasdair Gray @gray_alasdair

• Shape

• Data

Valid

50

Example Validation (Closed Shape)

4 December 2016

<Dataset> rdf:langString

.✗

Alasdair Gray @gray_alasdair

• Shape

• Data

Not Valid

51

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

Shape

4 December 2016

<Dataset> rdf:langString

.✗

Alasdair Gray @gray_alasdair

Shape Expressions (ShEx)

52

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

<Dataset> { rdf:type (dctypes:Dataset), dct:title rdf:langString, dct:alternative rdf:langString+, !dct:created .}

4 December 2016 Alasdair Gray @gray_alasdair

Validator can’t warn of missing property

Example data

ShEx Validation

53

<Dataset> { `MUST` rdf:type (dctypes:Dataset), `MUST` dct:title rdf:langString, `MAY` dct:alternative rdf:langString+, `MUST` !dct:created .}

Shape

4 December 2016

<Dataset> rdf:langString

.✗

Alasdair Gray @gray_alasdair

Validator can warn of missing property

Requirement Levels

http://www.w3.org/2015/03/ShExValidata/ Validata Demo

Alasdair Gray @gray_alasdair 56

Validator Hands-On

4 December 2016

Alasdair Gray @gray_alasdair 57

http://www.w3.org/2015/03/ShExValidata/• Try the ChEMBL example

• Play with Options• Add resources• Switch between open/closed shapes• Change requirement level

• Check your own work• Try descriptions from

• Riken Metadatabasehttp://metadb.riken.jp/archives/HCLSProfile/HCLSProfile_MetaDB_ALL.ttl

• PHI-Basehttps://www.dropbox.com/s/tghogwagn962tmi/Metadata.ttl?dl=0

4 December 2016

Alasdair Gray @gray_alasdair 58

RDF Dataset Statistics

4 December 2016

Statistics (RDF Core)Element Property Value

Summary Level

Version Level

Distribution Level

# of triples void:triples xsd:integer MUST NOT MUST NOT SHOULD

# of typed entities void:entities xsd:integer MUST NOT MUST NOT SHOULD

# of subjects void:distinctSubjects xsd:integer MUST NOT MUST NOT SHOULD

# of properties void:properties xsd:integer MUST NOT MUST NOT SHOULD

# of objects void:distinctObjects xsd:integer MUST NOT MUST NOT SHOULD

# of classes void:classPartition IRI MUST NOT MUST NOT SHOULD

# of literals void:classPartition IRI MUST NOT MUST NOT SHOULD

# of RDF graphs void:classPartition IRI MUST NOT MUST NOT SHOULD

Alasdair Gray @gray_alasdair 60

Statistics (RDF Complete)Element Property Value

Summary Level

Version Level

Distribution Level

class frequency void:classPartition IRI MUST NOT MUST NOT MAY

property frequency void:propertyPartition IRI MUST NOT MUST NOT MAY

property and subject types void:propertyPartition IRI MUST NOT MUST NOT MAY

property and object types void:propertyPartition IRI MUST NOT MUST NOT MAY

property and literals void:propertyPartition IRI MUST NOT MUST NOT MAY

property subject and object types void:propertyPartition IRI MUST NOT MUST NOT MAY

4 December 2016

Alasdair Gray @gray_alasdair 61

Why provide rich dataset descriptions?

• Support • Endpoint

exploration• Query writing

• Eliminates expensive exploratory queries

4 December 2016

Alasdair Gray @gray_alasdair 62

Why provide rich dataset descriptions?

• Support • Endpoint

exploration• Query writing

• Eliminates expensive exploratory queries

4 December 2016

Alasdair Gray @gray_alasdair 63

Why provide rich dataset descriptions?

• Support • Endpoint

exploration• Query writing

• Eliminates expensive exploratory queries

4 December 2016

Alasdair Gray @gray_alasdair 64

Generate with SPARQL queriesNumber of triples Subject-types related to Object-types

4 December 2016

Alasdair Gray @gray_alasdair 65

Summary

4 December 2016

Alasdair Gray @gray_alasdair 66

FAIR Data Principles

4 December 2016

HCLS Dataset Descriptions

https://www.w3.org/TR/hcls-dataset/Dumontier M, Gray AJG, Marshall MS, et al. (2016) The health care and life sciences community profile for dataset descriptions.  PeerJ 4:e2331 https://doi.org/10.7717/peerj.2331

[email protected] @gray_alasdair