Upload
valery-tkachenko
View
175
Download
0
Embed Size (px)
Citation preview
Building linked-data, large-scale chemistry platform: challenges, lessons and solutions
Valery Tkachenko, Alexey Pshenichnov, Aileen Day, Colin Batchelor, Peter CorbettRoyal Society of Chemistry
ACS Spring 2016San Diego, CAMarch 13th 2016
ChemSpider – 2007 - 2011
OpenPHACTS – 2011 - 2014
Chemistry Data Platform – 2014 - …
• 45 million chemicals and growing• Data sourced from >500 different sources• Crowdsourced curation and annotation• Ongoing deposition of data from our
journals and our collaborators• A structure centric hub for web-searching
ChemSpider
Chemical vendors and datasources
ChemSpider
Properties - experimental
Literature and patents references
Classification
Spectra
Multimedia
Tagging
ChemSpider - Summary
• Simple, flattish data model• InChI as a primary identifier• Linked by synonyms• Linked by “ExtId”• Standard searches (identity, substructure,
similarity)• Very little semantics
Open PHACTS Mission: Integrate Multiple Research Biomedical Data Resources
Into A Single Open & SustainableAccess Point
OpenPHACTS: 2011-2014
[email protected] @Open_PHACTS
Open PHACTS Practical SemanticsOpenPHACTS
GlaxoSmithKline – CoordinatorUniversität Wien – Managing entity Technical University of Denmark University of Hamburg, Center for Bioinformatics BioSolveIT GmBH Consorci Mar Parc de Salut de Barcelona Leiden University Medical Centre Royal Society of Chemistry Vrije Universiteit AmsterdamNovartisMerck SeronoH. Lundbeck A/SEli LillyNetherlands Bioinformatics CentreSwiss Institute of BioinformaticsConnectedDiscoveryEMBL-European Bioinformatics InstituteJanssen Esteve AlmirallOpenLink ScibiteThe Open PHACTS FoundationSpanish National Cancer Research Centre University of Manchester Maastricht University AqnowledgeUniversity of Santiago de Compostela Rheinische Friedrich-Wilhelms-Universität BonnAstraZenecaPfizer
Why is it so hard to….
Competitors?
What’s the structure?
Are they in our file?
What’s similar?
What’s the target?Pharmacology
data?
Known Pathways?
Working On Now?Connections to
disease?
Expressed in right cell type?
IP?
18@gray_alasdair Big Data Integration
19
OpenPHACTS Discovery Platform
RDFNanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices
Identity Resolution
Service
Chemistry RegistrationNormalisation & Q/C
IdentifierManagement
Service
Indexing
Cor
e Pl
atfo
rm
P12374EC2.43.4
CS4532
“Adenosine receptor 2a”
RDF
VoID
Db
RDFNanopub
Db
VoID
RDF
Db
VoID
RDFNanopub
VoID
Public Content Commercial
Public Ontologies
User Annotations
Apps
21 October 2014 Scientific Lenses – A. J. G. Gray
Gleevec®: Imatinib Mesylate
21 October 2014 Scientific Lenses – A. J. G. Gray 20
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N
Scientific Lenses – A. J. G. Gray 21
skos:exactMatch(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
21 October 2014
I need to compute an analysis, give me details of the active compound in Gleevec.
22
Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.
CHEMBL427526
CHEMBL521CHEMBL175
Lens Effects: Ibuprofen
21 October 2014 Scientific Lenses – A. J. G. Gray
23
Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.
Default Lens
21 October 2014 Scientific Lenses – A. J. G. Gray
24
Commercial ibuprofen is a racemic mixture containing the same proportion of two chiral forms. Both chiral forms are equally active. Typically, the user will wish to retrieve info for any stereoisomer.
Stereoisomer Lens
21 October 2014 Scientific Lenses – A. J. G. Gray
25
Mapping Generation
21 October 2014 Scientific Lenses – A. J. G. Gray
ops:OPS437281
✔
ops:OPS380297
has_stereoundefined_parent [ci:CHEMINF_000456]
ops:OPS380297
is_stereoisomer_of[ci:CHEMINF_000461] Other relationships
• has part• is tautomer of• uncharged counterpart• isotope…
OpenPHACTS UIhttp://explorer.openphacts.org/
27
Explorer Screenshot
21 October 2014 Scientific Lenses – A. J. G. Gray
28
Explorer Screenshot
21 October 2014 Scientific Lenses – A. J. G. Gray
OpenPHACTS - Summary
• Principal difference – inter-domain links• More complex, but still structure-centric
data model• Ontological relationships introduced• Chemical Lenses – new type of search
Chemistry Data Platform – 2014 - …
Dimensions and complexity of science
RSC Archive – since 1841
Digitally Enabling RSC Archive
ChemSpider Synthetic PagesCompoundsReactionAnalytical DataText and References
RSC DatabasesRSC CompoundsRSC ReactionsRSC SpectraRSC CrystalsRSC PolymersRSC MaterialsRSC AssaysRSC AlgorithmsRSC Models…and on…
Compounds domain
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and private databases
• ChemSpider• PubChem• DrugBank• KEGG• ChEBI/ChEMBL
– Automated quality control system
Chemistry Validation and Standardization Platform
Chemistry Validation and Standardization Platform
Reactions domain
Analytical data domain
Crystallography domain
Chemistry Data Platform - Summary
• Simplified models within domain• Domains are described with its own models
with embedded semantics• No proper domain-specific identifiers• Extensive quality control – CVSP (DOI
10.1186/s13321-015-0072-8)
There is no way back