Upload
david-wild
View
399
Download
0
Tags:
Embed Size (px)
Citation preview
New searching paradigms in drug discovery enabled by semantic integration of public data
Indiana University School of Informatics and Computing
David Wild1, Erik Stolterman1, Mic Lajiness2
1Indiana University School of Informatics & Computing2 Eli Lilly & Co, IndianapolisMore information at http://djwild.info
“Information is cheap. Understanding is expensive” (Karl Fast)
Searching in drug discovery Web searching
Only based on keyword co-occurrence, promotes “sloppy” searching Only covers data exposed as web pages Poor coverage of the literature No mechanisms for hypothesis testing, discovery, advanced searching
Commercial literature & patent searching Searching curated literature data in commercial datasets Improved semantics and promotes careful searching limited by dataset and system
Database searching Direct searching of experimental data, from bioassays, literature extraction,
etc Completely siloed – ChEMBL, PubChem, UNIPROT, etc Tools often tied to the data Some cross-linking
Searching strategies differ widely Study of academic searching – Wild & Beckman in Banville, D.
(ed). Chemical Information Mining: Facilitating Literature-Based Discovery. CRC Press 2008. Older academics (professors, postdocs) tend to have a small number of trusted
sources These sources differ widely even for the same domain (e.g. SciFinder vs WoS
vs print journals) Younger academics (students) almost always start with Google and use other
tools if necessary, bt readily adopt other searching tools when they are introduced to them
Industry practitioners tend to use wider subsets of commercial and public resources, but again these differ quite widely The choice of searching tools determines the data that is searched
Current Searching - Web
Search engine – Keyword Search
PubMedPubche
m
UniProt
Web Pages
SIDERBindingD
B ChemSpider
DrugBank
CASReaxsy
sInternal
Siloed Database Search
PubMedPubche
m
UniProt
Web Pages
SIDERBindingD
B ChemSpider
DrugBank
CASReaxsy
sInternal
Google Search
Uniprot
Uniprot
Uniprot
PubMed
PubMed
PubMed
ChemSpider
ChemSpider
ChemBioGrid: a web service infrastructure
Provides a common REST web service infrastructure for cheminformatics and drug discovery tools, algorithms and data sources - Product of a 2005-2008 NIH grant
Provides for atomic web services, and aggregate web services which in turn use other services
For more information see Journal of Chemical Information and Modeling, 2007; 47(4) pp 1303-1307 and http://chembiogrid.info
Enabled production of “mash-up” searching applications (BMC Bioinformatics, 2007; 8:487
WENDI – what can we find relating to compound X?
Designed to profile a new compound against the network of available public information through chemical structure similarity
Product of a 3-year collaboration with Eli Lilly Makes a variety of predictions including anti-tumor activity, and
various toxicity parameters; explores bioactivities of similar compounds and their relationships with genes and diseases using multiple datasets (Drugbank, PubChem, ChEMBL, etc); finds journal articles which discuss similar compounds and their relationships to genes and diseases
Presents results as interactive web page or in PDF report format Links to deeper analysis tools including Chemogenomic Explorer,
ChemoHub, Chem2Bio2RDF Componentized as web service and interface to allow easy
integration with other tools (e.g. Chemogenomic Explorer) with XML output
For more information see Journal of Cheminformatics, 2010, 2, 6
WENDI results page
Semantic Technologies: an enabler for integration
Allows simple, flexible description of heterogeneous graphs of data relationships (RDF), optionally following the rules of an ontology (OWL)
Strengths Merging datasets and moving data between
repositories is technically straightforward – dataset mappings are themselves described in RDF (and OWL). RDF and OWL are highly standardized
Powerful cross-dataset searching with SPARQL Increasing availability of powerful off-the-shelf
searching and visualization tools (TopBraid, etc) Allows application of graph theory algorithms to
data Can express data provenance in RDF
Weaknesses Just emerging from early adopters phase – received
bad press in pharma as hyped too early Triple stores historically less efficient than relational
DBMSs (but rapidly changing) Most focus has been on data and integration rather
than algorithms to use the data Difficulty weighting edges in a relational graph http://blog.project-sierra.de/
archives/1639
Systems chemical biology + Semantic Web
Drug Discovery Today, 2012, in press.
Current Semantic Searching
SPARQL searching & RDF Triple Store
PubMedPubche
m
UniProt
Web Pages
SIDERBindingD
B ChemSpider
DrugBank
CASReaxsy
sInternal
Including commercial and internal data
Keyword, SPARQL searching & RDF Triple Store
PubMedPubche
m
UniProt
Web Pages
SIDERBindingD
B ChemSpider
DrugBank
CASReaxsy
sInternal
Chem2Bio2RDF – www.chem2bio2rdf.org
Semantically integrates 42 heterogeneous public datasets related to drug discovery in a fast Virtuoso triple-store with SPARQL endpoint (linked from main site)
Datasets cover chemistry, chemogenomics, biology, systems & pathways, pharmacology, phenotypes, toxicology, glycomics and publications, and biological entities of compounds, drugs, targets, genes, pathways, diseases and side-effects
Major datasets include PubChem, ChEMBL, DrugBank, PharmGKB, BindingDB, STITCH, CTD, KEGG, SWISSPROT, PDB, SIDER, PubMed. Full set at http://chem2bio2rdf.wikispaces.com/Datasets
Holds data on ~31m chemical structures, ~5,000 marketed drugs, ~59m bioactivity data points and ~19m publications
Linked into LOD cloud, and may form part of OpenPHACTS repository
Permits SPARQL searching using Chem2Bio2OWL ontology. For more information, see BMC Bioinformatics 2010, 11, 255.
Chem2Bio2OWL – Semantic annotation
Fills a gap in current ontologies: covers relationship of chemical compounds and drugs to targets, genes, assays and side-effects
Aligned with other ontologies: released on NCBO Bioportal (http://bioportal.bioontology.org/ontologies/1615)
Simplifies SPARQL searching by integrating equivalent classes across datasets (no longer need to explicitly specify datasets and fields)
Increases power of SPARQL searching allowing inclusion of data and relational classes (e.g. activator vs antagonist)
For more information see http://chem2bio2owl.wikispaces.com. Publication in progress.
Finding multi-target inhibitors of MAPK pathway with a SPARQL query
PREFIX pubchem: <http://chem2bio2rdf.org/pubchem/resource/>PREFIX kegg: <http://chem2bio2rdf.org/kegg/resource/>PREFIX uniprot: <http://chem2bio2rdf.org/uniprot/resource/> SELECT ?compound_cid (count(?compound_cid) as ?active_assays)FROM <http://chem2bio2rdf.org/pubchem>FROM <http://chem2bio2rdf.org/kegg>FROM <http://chem2bio2rdf.org/uniprot> WHERE { ?bioassay pubchem:CID ?compound_cid . ?bioassay pubchem:outcome ?activity . FILTER (?activity=2) . ?bioassay pubchem:Score ?score . FILTER (?score>50) . ?bioassay pubchem:gi ?gi . ?uniprot uniprot:gi ?gi . ?pathway kegg:protein ?uniprot . ?pathway kegg:Pathway_name ?pathway_name . FILTER regex(?pathway_name,"MAPK signaling pathway","i") . } GROUP BY ?compound_cid HAVING (count(*)>1)
BMC Bioinformatics 2010, 11, 255.
Semantic Association Algorithms at IU Association Search – visualize literature supported associations
between any two entities (compound, drug, gene, pathway, disease, side effect). PLoS One, 2011, 6(12), e27506.
Semantic Link Association Prediction (SLAP) – find most highly associated entities (compound, drug, gene, pathway, disease, side effect) to any other entity, based on probabilistic weightings of graph edges based on public experimental datasets. PLoS Computational Biology, in review.
BioLDA – find most highly associated entities to any other entity based on a complex topic model analysis of the literature (PubMed). PLoS One, 2011, 6 (3), e17243
See also: WENDI (J. Cheminf., 2010,2,6); Chemogenomic Explorer (BMC Bio. 2011,12,256), ChemLDA, ChemBioGrid (J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307)
Ibuprofen and Parkinson’s Disease Recent paper got attention of
news outlets (Gao et al. Neurology, 78(10))
Observational meta-study looking at 6-year Parkinson’s occurrence related to NSAID usage
Statistical preventitive link with Ibuprofen, but not aspirin or acetaminophen
Clinical finding – mechanism of action unknown
What can we learn from Google, from semantic association methods, and from both together?
Ibuprofen and Parkinson’s Disease
More Google searches…
More Google searches…
Ibuprofen-neuroinflammation-Parkinson’s
Mining relational paths – association search
Identified 70 genes associated with Ibuprofen and Parkinson’s disease, 9 of which are related to inflammation (IL1A, IL1B, IL1RN, IL6, LTA, NFKB1, NFKBIA, PTGS2, TNF)
Clear direct association between PTGS2 (COX2) and Parkinson’s Disease via CTD (leading to literature)
Single gene, AMBP, differentially associated with Ibuprofen and Parkinson’s Disease but not with other NSAIDS (AMBP has shown potential as a Parkinson’s biomarker)
SLAP – target profile for IbuprofenCOX2 – main targetRegulate neurotransmitter release COX1
Dopamine receptorSeratonin
receptorsCannaboid receptorsMuscarinic receptor(motor control)
vs acetaminophen and aspirin
SLAP - Biologically Similar Drugs
Dopamine agonist, used in Parkinson’s
Dopamine agonist, used in Parkinson’s
BLASC Calculates KL-Divergence
score for any bioterm pairs (drugs,genes, side-effects, pathways, etc)
Available from http://djwild.info
BLASC results
Item1 Item2
Item1 literatures
Item2 literatures
# of co-literatures
co-occur score
Item1 entropy
Item2 entropy
KL divergence
OxaprozinParkinson Disease 124 3587 0 -13.005 4.764 1.709 10.159
Meclofenamic acid
Parkinson Disease 234 3587 0 -13.64 4.464 1.709 12.262
LumiracoxibParkinson Disease 144 3587 0 -13.155 4.188 1.709 12.72
FenoprofenParkinson Disease 341 3587 0 -14.017 3.476 1.709 13.586
TenoxicamParkinson Disease 451 3587 0 -14.297 3.634 1.709 13.895
SulindacParkinson Disease 1390 3587 0 -15.422 3.367 1.709 14.309
EtodolacParkinson Disease 442 3587 0 -14.276 3.525 1.709 14.635
NabumetoneParkinson Disease 348 3587 0 -14.037 2.746 1.709 14.639
DiflunisalParkinson Disease 539 3587 0 -14.475 2.616 1.709 15.13
ValdecoxibParkinson Disease 337 3587 0 -14.005 2.693 1.709 15.161
RofecoxibParkinson Disease 1689 3587 0 -15.617 3.466 1.709 15.316
EtoricoxibParkinson Disease 268 3587 0 -13.776 2.477 1.709 16.372
KetorolacParkinson Disease 1504 3587 0 -15.501 2.598 1.709 17.208
MeloxicamParkinson Disease 803 3587 0 -14.873 2.346 1.709 17.835
Mefenamic acidParkinson Disease 823 3587 0 -14.898 1.558 1.709 17.913
CelecoxibParkinson Disease 2676 3587 0 -16.077 3.096 1.709 18.056
PiroxicamParkinson Disease 2221 3587 0 -15.891 1.495 1.709 18.397
FlurbiprofenParkinson Disease 1616 3587 0 -15.573 1.218 1.709 18.913
IbuprofenParkinson Disease 6791 3587 1 -0.275 1.726 1.709 19.296
NaproxenParkinson Disease 3600 3587 1 0.36 2.021 1.709 19.36
AspirinParkinson Disease 28248 3587 5 -0.091 2.506 1.709 19.851
IndomethacinParkinson Disease 29527 3587 2 -1.052 2.143 1.709 20.085
KetoprofenParkinson Disease 2229 3587 0 -15.894 1.066 1.709 20.117
DiclofenacParkinson Disease 5484 3587 0 -16.795 1.978 1.709 20.417
What do we learn from this? Right now, our options are
Web searching, with very broad scope, ability to be “sloppy”, but very little “drill down” into datasets, publciations, etc, and very limited searching options (keyword only)
Dataset searching which is siloed
Web service infrastructures allow ad hoc integration of searching through mashups
Semantic integration with associated tools allow more advanced kinds of search, exploration and prediction of integrated datasets – including hypothesis testing, etc –these are not really mapped to current searching paradigms (but are closer to traditional structured searching)
Semantic searching likely to have wide adoption in the next few years, which might allow web and semantic dataset searching to be integrated
Wolfram Alpha is a kind of prototype for this (http://www.wolframalpha.com/) but almost useless for drug discovery
Wolfram Alpha
Work in progress….
Ibuprofen
Parkinsons
Disease
Neuro- inflamm
ation
COX2
Ibuprofen
Parkinsons
Disease
Neuro- inflamm
ation
COX2
AMBP
DRD2
IL1A
Query
Result
Karl Fast (again)
“Mess is where creativity and insight come from. We must stop thinking we need to make the
world a neat and tidy place”
Cheminformatics Education at Indiana University
LuLu eBook - $29http://slg.djwild.info
Free cheminformatics learning resources
http://icep.wikispaces.com
Residential Ph.D. program in Informatics with a
Cheminformatics specialty
Distance Graduate Certificate program in Chemical Informatics
http://djwild.info/ed