Upload
alasdair-gray
View
94
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
Scientific Lenses over Linked Data: Identity Management in the Open PHACTS project
Alasdair J G [email protected]
www.alasdairjggray.co.uk@gray_alasdair
http://c745.r45.cf2.rackcdn.com/img/2009/lens_filter_coasters.jpg
Open PHACTS Use Case
“Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”
Chemical Properties (Chemspider) Launched drugs (Drugbank) Human => Mouse (Homologene) Protein Families (Enzyme) Bioactivty Data (ChEMBL) … other info (Uniprot/Entrez etc.)
“Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”
21/05/2014 Brighton Seminar 2
LiteraturePubChem
GenbankPatents Databases
Downloads
Data Integration Data Analysis Firewalled Databases
Repeat @ each companyx
Lowering industry firewalls: pre-competitive informatics in drug discovery Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944
A single, shared solution.
Funded under• IMI: 2011-14• ENSO: 2014-16
Pre-competitive Informatics
Open PHACTS Discovery Platform
21/05/2014 Brighton Seminar 4
Drug Discovery Platform
Apps
Domain API
Interactive responses
Production qualityintegration platform
MethodCalls
(April 2013 – March 2014)
15.8 million total hits
API Hits
An “App Store”?
http://www.openphactsfoundation.org/apps.html
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
Drug
Disease
PathwayTarget
https://dev.openphacts.org/
Linked Data API
21/05/2014 Brighton Seminar 7
OPS Discovery Platform
RDFNanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices
Identity Resolution
Service
Chemistry RegistrationNormalisation & Q/C
IdentifierManagement
Service
Indexing
Cor
e Pl
atfo
rm
P12374EC2.43.4
CS4532
“Adenosine receptor 2a”
RDF
VoID
Db
RDFNanopub
Db
VoID
RDF
Db
VoID
RDFNanopub
VoID
Public Content Commercial
Public Ontologies
User Annotations
Apps
Platform Interaction
Provenance
Multiple Identities
Andy Law's Third Law“The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
21/05/2014 Brighton Seminar 11
P12047X31045P120
47
GB:29384RS_
2353
Are these the same thing?
Gleevec® = Imatinib Mesylate
21/05/2014 Brighton Seminar 12
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib MesylateYLMAHDNUQAMNNX-UHFFFAOYSA-N
21/05/2014 Brighton Seminar 13
21/05/2014 Brighton Seminar 14
Multiple Links: Different Reasons
21/05/2014 Brighton Seminar 16
Link: skos:closeMatchReason: non-salt form
Link: skos:exactMatchReason: drug name
Strict Relaxed
Analysing Browsing
Dynamic Equality
21/05/2014 Brighton Seminar 17
skos:exactMatch(InChI)
Strict Relaxed
Analysing Browsing
Dynamic Equality
21/05/2014 Brighton Seminar 18
skos:closeMatch(Drug Name)
skos:closeMatch(Drug Name)
skos:exactMatch(InChI)
Initial Connectivity
21/05/2014 Brighton Seminar 19
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Compound Information
Genes == Proteins?
BRCA1Breast cancer type 1 susceptibility protein
21/05/2014 Brighton Seminar 21
http://en.wikipedia.org/wiki/File:Protein_BRCA1_PDB_1jm7.png
http://en.wikipedia.org/wiki/File:BRCA1_en.png
Proceed with Caution!
21/05/2014 Brighton Seminar 22
Co-reference Computation
Rules ensure• Unrestricted transitivity
within conceptual type• Restrict crossing
conceptual types
Based on justifications
Provenance captured
Target
Protein
Gene
21/05/2014 Brighton Seminar 23
0..*
0..*
0..*
0..1
0..1
Initial Connectivity
21/05/2014 Brighton Seminar 24
Datasets 37
Linksets 104
Links 7,096,712
Justifications 7
Inferred Connectivity
21/05/2014 Brighton Seminar 25
Datasets 37
Linksets 883
Links 17,383,846
Justifications 7
BridgeDb
21/05/2014 Brighton Seminar 26
http://ops.rsc.org/OPS45975 http://ops.rsc.org/OPS45978
has_isotopically_unspecified_parent [CHEMINF:000459]
has OPS normalized counterpart [CHEMINF:000458]
http://ops.rsc.org/OPS45991
is_tautomer_of[chebi:is_tautomer_of]
http://ops.rsc.org/OPS45987
has_stereoundefined_parent [CHEMINF:000456]
http://ops.rsc.org/OPS45981
Lenses
OPS Discovery Platform
RDFNanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)DomainSpecificServices
Identity Resolution
Service
Chemistry RegistrationNormalisation & Q/C
IdentifierManagement
Service
Indexing
Cor
e Pl
atfo
rm
P12374EC2.43.4
CS4532
“Adenosine receptor 2a”
RDF
VoID
Db
RDFNanopub
Db
VoID
RDF
Db
VoID
RDFNanopub
VoID
Public Content Commercial
Public Ontologies
User Annotations
Apps
?iri cheminf:logd ?logd .FILTER (?iri = cw:979b545d-f9a9 || ?iri = cs:2157 || ?iri = chembl:1280 || ?iri = db:db00945 )
cw:979b545d-f9a9 cheminf:logd ?logd .GRAPH <http://rdf.chemspider.com> {
}
cw:979b545d-f9a9 cheminf:logd ?logd .
Query Expansion
Identity Mapping Service
(BridgeDB)
Query Expander Service
Profiles
Mappings
Q, L1 Q’
[cw:979b545d-f9a9,cs:2157, chembl:1280,db:db00945]
cw:979b545d-f9a9, L1
Can also be achieved through UNION
21/05/2014 Brighton Seminar 29
Experiment
Is it feasible to use a stand-off mapping service?• Base lines (no external call):
– “Perfect” URIs– Linked data querying
• Expansion approaches (external service call):– FILTER by Graph– UNION by Graph
C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. Pettifer: Including Co-referent URIs in a SPARQL Query. COLD 2013. http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf21/05/2014 Brighton Seminar 30
“Perfect” URI BaselineWHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { chembl_mol:m1280 cheminf:mw ?mw . }}
21/05/2014 Brighton Seminar 31
Linked Data BaselineWHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { ?chemblid cheminf:mw ?mw . } cs:2157 skos:exactMatch ?chemblid .}
21/05/2014 Brighton Seminar 32
Queries
Drawn from Open PHACTS API:1. Simple compound information (1)2. Compound information (1)3. Compound pharmacology (M)4. Simple target information (1)5. Target information (1)6. Target pharmacology (M)
21/05/2014 Brighton Seminar 33
Queries
Drawn from Open PHACTS API:1. Simple compound information (1)2. Compound information (1)3. Compound pharmacology (M)4. Simple target information (1)5. Target information (1)6. Target pharmacology (M)
21/05/2014 Brighton Seminar 34
Data:167,783,592 triples
Mappings:2,114,584 triples
Lenses:1
Experiment Data
21/05/2014 Brighton Seminar 35
Average execution times
36
Average execution times
0.01
8
37
Q6: Target Pharmacology
44
Conclusions
• Computing co-reference advantageous– Requires less raw linksets– Larger coverage across datasets
• Rules ensure control– Genes can equal proteins– Compounds never equal proteins
• Provenance captured throughout
21/05/2014 Brighton Seminar 45
Conclusions
• Query expansion slower in general– Due to separate service call– Difference below human perception– UNION faster than FILTER on Virtuoso
• Stand-off mappings feasible• Infrastructure can support lenses
21/05/2014 Brighton Seminar 46
Strict Relaxed
Analysing Browsing