Building linked data large-scale chemistry platform - challenges, lessons and solutions

  • View
    175

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Building linked-data, large-scale chemistry platform: challenges, lessons and solutions

Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter CorbettRoyal Society of Chemistry

ACS Spring 2016San Diego, CAMarch 13th 2016

ChemSpider – 2007 - 2011

OpenPHACTS – 2011 - 2014

Chemistry Data Platform – 2014 - …

• 45 million chemicals and growing• Data sourced from >500 different sources• Crowdsourced curation and annotation• Ongoing deposition of data from our

journals and our collaborators• A structure centric hub for web-searching

ChemSpider

Chemical vendors and datasources

ChemSpider

Properties - experimental

Literature and patents references

Classification

Spectra

Multimedia

Tagging

ChemSpider - Summary

• Simple, flattish data model• InChI as a primary identifier• Linked by synonyms• Linked by “ExtId”• Standard searches (identity, substructure,

similarity)• Very little semantics

Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources

Into A Single Open & SustainableAccess Point

OpenPHACTS: 2011-2014

info@openphactsfoundation.org @Open_PHACTS

Open PHACTS Practical SemanticsOpenPHACTS

GlaxoSmithKline – CoordinatorUniversität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit AmsterdamNovartisMerck SeronoH. Lundbeck A/SEli LillyNetherlands Bioinformatics CentreSwiss Institute of BioinformaticsConnectedDiscoveryEMBL-European Bioinformatics InstituteJanssen Esteve AlmirallOpenLink ScibiteThe Open PHACTS FoundationSpanish National Cancer Research Centre University of Manchester Maastricht University AqnowledgeUniversity of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität BonnAstraZenecaPfizer

Why is it so hard to….

Competitors?

What’s the structure?

Are they in our file?

What’s similar?

What’s the target?Pharmacology

data?

Known Pathways?

Working On Now?Connections to

disease?

Expressed in right cell type?

IP?

18@gray_alasdair Big Data Integration

19

OpenPHACTS Discovery Platform

RDFNanopub

Db

VoID

Data Cache (Virtuoso Triple Store)

Semantic Workflow Engine

Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices

Identity Resolution

Service

Chemistry RegistrationNormalisation & Q/C

IdentifierManagement

Service

Indexing

Cor

e Pl

atfo

rm

P12374EC2.43.4

CS4532

“Adenosine receptor 2a”

RDF

VoID

Db

RDFNanopub

Db

VoID

RDF

Db

VoID

RDFNanopub

VoID

Public Content Commercial

Public Ontologies

User Annotations

Apps

21 October 2014 Scientific Lenses – A. J. G. Gray

Gleevec®: Imatinib Mesylate

21 October 2014 Scientific Lenses – A. J. G. Gray 20

DrugbankChemSpider PubChem

Imatinib

MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N

Scientific Lenses – A. J. G. Gray 21

skos:exactMatch(InChI)

Strict Relaxed

Analysing Browsing

Structure Lens

21 October 2014

I need to compute an analysis, give me details of the active compound in Gleevec.

22

Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.

CHEMBL427526

CHEMBL521CHEMBL175

Lens Effects: Ibuprofen

21 October 2014 Scientific Lenses – A. J. G. Gray

23

Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.

Default Lens

21 October 2014 Scientific Lenses – A. J. G. Gray

24

Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.

Stereoisomer Lens

21 October 2014 Scientific Lenses – A. J. G. Gray

25

Mapping Generation

21 October 2014 Scientific Lenses – A. J. G. Gray

ops:OPS437281

ops:OPS380297

has_stereoundefined_parent [ci:CHEMINF_000456]

ops:OPS380297

is_stereoisomer_of[ci:CHEMINF_000461] Other relationships

• has part• is tautomer of• uncharged counterpart• isotope…

OpenPHACTS UIhttp://explorer.openphacts.org/

27

Explorer Screenshot

21 October 2014 Scientific Lenses – A. J. G. Gray

28

Explorer Screenshot

21 October 2014 Scientific Lenses – A. J. G. Gray

OpenPHACTS - Summary

• Principal difference – inter-domain links• More complex, but still structure-centric

data model• Ontological relationships introduced• Chemical Lenses – new type of search

Chemistry Data Platform – 2014 - …

RSC Archive – since 1841

Digitally Enabling RSC Archive

ChemSpider Synthetic PagesCompoundsReactionAnalytical DataText and References

RSC DatabasesRSC CompoundsRSC ReactionsRSC SpectraRSC CrystalsRSC PolymersRSC MaterialsRSC AssaysRSC AlgorithmsRSC Models…and on…

Compounds domain

Data quality issue and CVSP

– Robochemistry

– Proliferation of errors in public and private databases

• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL

– Automated quality control system

Chemistry Validation and Standardization Platform

Chemistry Validation and Standardization Platform

Reactions domain

Analytical data domain

Crystallography domain

Chemistry Data Platform - Summary

• Simplified models within domain• Domains are described with its own models

with embedded semantics• No proper domain-specific identifiers• Extensive quality control – CVSP (DOI

10.1186/s13321-015-0072-8)

There is no way back

Thank you

Email: tkachenkov@rsc.org

Slides: http://www.slideshare.net/valerytkachenko16

Recommended