Transcript
Page 1: 2013 01-14 ops-dataset_descriptions

Dataset Descriptions in Open PHACTS

Alasdair J G GrayUniversity of ManchesterW3C HCLS Call – 14 January 2013

www.openphacts.org/specs/datadesc/

Authors:Christian Y. A. Brenninkmeijer, Chris Evelo, Carole Goble, Alasdair J. G. Gray, Andra Waagmeester and Egon L. Willighagen

Page 2: 2013 01-14 ops-dataset_descriptions

Why?

Public Domain Drug Discovery Data:Pharma are accessing, processing, storing & re-processing

LiteraturePubChem

GenbankPatents Databases

Downloads

Data Integration Data Analysis Firewalled Databases

Repeat @ each

companyx

Page 3: 2013 01-14 ops-dataset_descriptions

The Project

The Innovative Medicines Initiative• EC funded public-private

partnership for pharmaceutical research

• Focus on key problems– Efficacy, Safety,

Education & Training, Knowledge Management

The Open PHACTS Project• Create a semantic integration hub (“Open

Pharmacological Space”)…• Delivering services to support on-going drug

discovery programs in pharma and public domain• Not just another project; Leading academics in

semantics, pharmacology and informatics, driven by solid industry business requirements

• 13 academic partners, 9 pharmaceutical companies, 6 SMEs

• Work split into clusters:• Technical Build (focus here)• Scientific Drive• Community & Sustainability

Page 4: 2013 01-14 ops-dataset_descriptions

Architecture

User Interfaces & Applications

Linked Data API

Linked Data CacheIdentity

Mapping Service

Identity Resolution

Service

Domain Specific Services

Data

Page 5: 2013 01-14 ops-dataset_descriptions

Datasets and Links

Page 6: 2013 01-14 ops-dataset_descriptions

ChemSpider• ChemSpider aggregates data from

over 400 sources• Central integration point for

chemicals in OPS• OPS data covers

– ChEBI– ChEMBL– DrugBank

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 6

Page 7: 2013 01-14 ops-dataset_descriptions

What version of ChEMBL? ~Jan 2012• ChemSpider: EBI SDF file

– ChEMBL 13 • Data Cache: Chem2Bio2RDF ChEMBL RDF

– File downloaded May 2011– Chem2Bio2RDF metadata webpages:

ChEMBL 8– File: ChEMBL 2

• Mapping Server: Kasabi ChEMBL RDF file– ChEMBL 12

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 7

Page 8: 2013 01-14 ops-dataset_descriptions

For the record• OPS currently uses ChEMBL 13

– RDF generated from EBI database dump

– Published at linkedchemistry.info• Credit: Egon Willighagen

• Soon moving to ChEMBL 15– RDF published by EBI

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 8

Page 9: 2013 01-14 ops-dataset_descriptions

Challenges• Datasets available

– In many versions over time– In different formats– From many mirrors/registries

• Files do not carry metadata• Registries

– Can be out-of-date– Can contain conflicting information

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 9

Page 10: 2013 01-14 ops-dataset_descriptions

VoID: Vocabulary of Interlinked Datasets

• Describes RDF datasets– W3C Note: http://www.w3.org/TR/void/

• Metadata carried with data– Directly embedded or

linked (void:inDataset)• Problems

– Very generic– No checklist of requisite fields

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 10

Page 11: 2013 01-14 ops-dataset_descriptions

Provenance Vocabularies• Dublin Core Terms

– Widely used– Terms to generic to give proper credit

• “Date: A point or period of time associated with an event in the lifecycle of the resource.”

• PROV– New W3C standard: www.w3.org/2011/prov– Generic framework for exchanging data– Does not contain required predicates

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 11

Page 12: 2013 01-14 ops-dataset_descriptions

PAV: Provenance, Authoring and Versioning Vocabulary

http://code.google.com/p/pav-ontology/wiki/Homepage• Easy to understand predicates

– http://purl.org/pav/• Right level of granularity

– Distinguishes: author/creator/curator– Captures source of data:

• import/derived/accessed• version/previousVersion

• Being aligned with PROV-O14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 12

Page 13: 2013 01-14 ops-dataset_descriptions

Dataset Descriptions in the Open Pharmacological Space

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 13

Page 14: 2013 01-14 ops-dataset_descriptions

Related Work• Registries: DataHub, MIRIAM

– Do not tie metadata with the data– No checklist of attributes

• BioDBCore– Checklist

• Similar information captured• Includes point of contact information

– Not tied to the data

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 14

Page 15: 2013 01-14 ops-dataset_descriptions

Realisation of Dataset Descriptions

• Needs to be incorporated into data publishing pipeline

• Hard for publishers to provide conformant descriptions– Datasets are complex– Evolve over time– Seen as yet another burden

• Validation tool provided– http://openphacts.cs.man.ac.uk:9090/OPS-IMS/validate

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 15

Page 16: 2013 01-14 ops-dataset_descriptions

Future Vision• Provide rich and accurate

provenance trail of data– Alignment with BioDBCore

• One standard to rule them all– Automatic pipeline from VoID file to

registries• Write once, use many times

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 16

Page 17: 2013 01-14 ops-dataset_descriptions

Thank [email protected]/~graya/www.openphacts.org

14 January 2013 OPS Dataset Descriptions – A. J. G. Gray 17


Recommended