47
Building linked-data, large-scale chemistry platform: challenges, lessons and solutions Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter Corbett Royal Society of Chemistry ACS Spring 2016 San Diego, CA March 13 th 2016

Building linked data large-scale chemistry platform - challenges, lessons and solutions

Embed Size (px)

Citation preview

Page 1: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Building linked-data, large-scale chemistry platform: challenges, lessons and solutions

Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter CorbettRoyal Society of Chemistry

ACS Spring 2016San Diego, CAMarch 13th 2016

Page 2: Building linked data large-scale chemistry platform - challenges, lessons and solutions

ChemSpider – 2007 - 2011

OpenPHACTS – 2011 - 2014

Chemistry Data Platform – 2014 - …

Page 3: Building linked data large-scale chemistry platform - challenges, lessons and solutions

• 45 million chemicals and growing• Data sourced from >500 different sources• Crowdsourced curation and annotation• Ongoing deposition of data from our

journals and our collaborators• A structure centric hub for web-searching

Page 4: Building linked data large-scale chemistry platform - challenges, lessons and solutions

ChemSpider

Page 5: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Chemical vendors and datasources

Page 6: Building linked data large-scale chemistry platform - challenges, lessons and solutions

ChemSpider

Page 7: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Properties - experimental

Page 8: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Literature and patents references

Page 9: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Classification

Page 10: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Spectra

Page 11: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Multimedia

Page 12: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Tagging

Page 13: Building linked data large-scale chemistry platform - challenges, lessons and solutions

ChemSpider - Summary

• Simple, flattish data model• InChI as a primary identifier• Linked by synonyms• Linked by “ExtId”• Standard searches (identity, substructure,

similarity)• Very little semantics

Page 14: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources

Into A Single Open & SustainableAccess Point

OpenPHACTS: 2011-2014

Page 15: Building linked data large-scale chemistry platform - challenges, lessons and solutions

[email protected] @Open_PHACTS

Open PHACTS Practical SemanticsOpenPHACTS

GlaxoSmithKline – CoordinatorUniversität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit AmsterdamNovartisMerck SeronoH. Lundbeck A/SEli LillyNetherlands Bioinformatics CentreSwiss Institute of BioinformaticsConnectedDiscoveryEMBL-European Bioinformatics InstituteJanssen Esteve AlmirallOpenLink ScibiteThe Open PHACTS FoundationSpanish National Cancer Research Centre University of Manchester Maastricht University AqnowledgeUniversity of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität BonnAstraZenecaPfizer

Page 16: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Why is it so hard to….

Competitors?

What’s the structure?

Are they in our file?

What’s similar?

What’s the target?Pharmacology

data?

Known Pathways?

Working On Now?Connections to

disease?

Expressed in right cell type?

IP?

Page 17: Building linked data large-scale chemistry platform - challenges, lessons and solutions

18@gray_alasdair Big Data Integration

Page 18: Building linked data large-scale chemistry platform - challenges, lessons and solutions

19

OpenPHACTS Discovery Platform

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

Indexing

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

21 October 2014 Scientific Lenses – A. J. G. Gray

Page 19: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Gleevec®: Imatinib Mesylate

21 October 2014 Scientific Lenses – A. J. G. Gray 20

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N

Page 20: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Scientific Lenses – A. J. G. Gray 21

skos:exactMatch(InChI)

Strict Relaxed

Analysing Browsing

Structure Lens

21 October 2014

I need to compute an analysis, give me details of the active compound in Gleevec.

Page 21: Building linked data large-scale chemistry platform - challenges, lessons and solutions

22

Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.

CHEMBL427526

CHEMBL521CHEMBL175

Lens Effects: Ibuprofen

21 October 2014 Scientific Lenses – A. J. G. Gray

Page 22: Building linked data large-scale chemistry platform - challenges, lessons and solutions

23

Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.

Default Lens

21 October 2014 Scientific Lenses – A. J. G. Gray

Page 23: Building linked data large-scale chemistry platform - challenges, lessons and solutions

24

Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.

Stereoisomer Lens

21 October 2014 Scientific Lenses – A. J. G. Gray

Page 24: Building linked data large-scale chemistry platform - challenges, lessons and solutions

25

Mapping Generation

21 October 2014 Scientific Lenses – A. J. G. Gray

ops:OPS437281

ops:OPS380297

has_stereoundefined_parent [ci:CHEMINF_000456]

ops:OPS380297

is_stereoisomer_of[ci:CHEMINF_000461] Other relationships

• has part• is tautomer of• uncharged counterpart• isotope…

Page 25: Building linked data large-scale chemistry platform - challenges, lessons and solutions

OpenPHACTS UIhttp://explorer.openphacts.org/

Page 26: Building linked data large-scale chemistry platform - challenges, lessons and solutions

27

Explorer Screenshot

21 October 2014 Scientific Lenses – A. J. G. Gray

Page 27: Building linked data large-scale chemistry platform - challenges, lessons and solutions

28

Explorer Screenshot

21 October 2014 Scientific Lenses – A. J. G. Gray

Page 28: Building linked data large-scale chemistry platform - challenges, lessons and solutions

OpenPHACTS - Summary

• Principal difference – inter-domain links• More complex, but still structure-centric

data model• Ontological relationships introduced• Chemical Lenses – new type of search

Page 29: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Chemistry Data Platform – 2014 - …

Page 31: Building linked data large-scale chemistry platform - challenges, lessons and solutions
Page 32: Building linked data large-scale chemistry platform - challenges, lessons and solutions

RSC Archive – since 1841

Page 33: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Digitally Enabling RSC Archive

Page 34: Building linked data large-scale chemistry platform - challenges, lessons and solutions

ChemSpider Synthetic PagesCompoundsReactionAnalytical DataText and References

Page 35: Building linked data large-scale chemistry platform - challenges, lessons and solutions

RSC DatabasesRSC CompoundsRSC ReactionsRSC SpectraRSC CrystalsRSC PolymersRSC MaterialsRSC AssaysRSC AlgorithmsRSC Models…and on…

Page 36: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Compounds domain

Page 37: Building linked data large-scale chemistry platform - challenges, lessons and solutions
Page 38: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Data quality issue and CVSP

– Robochemistry

– Proliferation of errors in public and private databases

• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL

– Automated quality control system

Page 39: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Chemistry Validation and Standardization Platform

Page 40: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Chemistry Validation and Standardization Platform

Page 41: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Reactions domain

Page 42: Building linked data large-scale chemistry platform - challenges, lessons and solutions
Page 43: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Analytical data domain

Page 44: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Crystallography domain

Page 45: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Chemistry Data Platform - Summary

• Simplified models within domain• Domains are described with its own models

with embedded semantics• No proper domain-specific identifiers• Extensive quality control – CVSP (DOI

10.1186/s13321-015-0072-8)

Page 46: Building linked data large-scale chemistry platform - challenges, lessons and solutions

There is no way back

Page 47: Building linked data large-scale chemistry platform - challenges, lessons and solutions

Thank you

Email: [email protected]

Slides: http://www.slideshare.net/valerytkachenko16