New searching paradigms in drug discovery enabled by semantic integration of public data

New searching paradigms in drug discovery enabled by semantic integration of public data

Indiana University School of Informatics and Computing

David Wild1, Erik Stolterman1, Mic Lajiness2

1Indiana University School of Informatics & Computing2 Eli Lilly & Co, IndianapolisMore information at http://djwild.info

“Information is cheap. Understanding is expensive” (Karl Fast)

Searching in drug discovery Web searching

Only based on keyword co-occurrence, promotes “sloppy” searching Only covers data exposed as web pages Poor coverage of the literature No mechanisms for hypothesis testing, discovery, advanced searching

Commercial literature & patent searching Searching curated literature data in commercial datasets Improved semantics and promotes careful searching limited by dataset and system

Database searching Direct searching of experimental data, from bioassays, literature extraction,

etc Completely siloed – ChEMBL, PubChem, UNIPROT, etc Tools often tied to the data Some cross-linking

Searching strategies differ widely Study of academic searching – Wild & Beckman in Banville, D.

(ed). Chemical Information Mining: Facilitating Literature-Based Discovery. CRC Press 2008. Older academics (professors, postdocs) tend to have a small number of trusted

sources These sources differ widely even for the same domain (e.g. SciFinder vs WoS

vs print journals) Younger academics (students) almost always start with Google and use other

tools if necessary, bt readily adopt other searching tools when they are introduced to them

Industry practitioners tend to use wider subsets of commercial and public resources, but again these differ quite widely The choice of searching tools determines the data that is searched

Current Searching - Web

Search engine – Keyword Search

PubMedPubche

m

UniProt

Web Pages

SIDERBindingD

B ChemSpider

DrugBank

CASReaxsy

sInternal

Siloed Database Search

PubMedPubche

m

UniProt

Web Pages

SIDERBindingD

B ChemSpider

DrugBank

CASReaxsy

sInternal

Google Search

Uniprot

Uniprot

Uniprot

PubMed

PubMed

PubMed

ChemSpider

ChemSpider

ChemBioGrid: a web service infrastructure

Provides a common REST web service infrastructure for cheminformatics and drug discovery tools, algorithms and data sources - Product of a 2005-2008 NIH grant

Provides for atomic web services, and aggregate web services which in turn use other services

For more information see Journal of Chemical Information and Modeling, 2007; 47(4) pp 1303-1307 and http://chembiogrid.info

Enabled production of “mash-up” searching applications (BMC Bioinformatics, 2007; 8:487

http://chembiogrid.info/

WENDI – what can we find relating to compound X?

Designed to profile a new compound against the network of available public information through chemical structure similarity

Product of a 3-year collaboration with Eli Lilly Makes a variety of predictions including anti-tumor activity, and

various toxicity parameters; explores bioactivities of similar compounds and their relationships with genes and diseases using multiple datasets (Drugbank, PubChem, ChEMBL, etc); finds journal articles which discuss similar compounds and their relationships to genes and diseases

Presents results as interactive web page or in PDF report format Links to deeper analysis tools including Chemogenomic Explorer,

ChemoHub, Chem2Bio2RDF Componentized as web service and interface to allow easy

integration with other tools (e.g. Chemogenomic Explorer) with XML output

For more information see Journal of Cheminformatics, 2010, 2, 6

WENDI results page

Semantic Technologies: an enabler for integration

Allows simple, flexible description of heterogeneous graphs of data relationships (RDF), optionally following the rules of an ontology (OWL)

Strengths Merging datasets and moving data between

repositories is technically straightforward – dataset mappings are themselves described in RDF (and OWL). RDF and OWL are highly standardized

Powerful cross-dataset searching with SPARQL Increasing availability of powerful off-the-shelf

searching and visualization tools (TopBraid, etc) Allows application of graph theory algorithms to

data Can express data provenance in RDF

Weaknesses Just emerging from early adopters phase – received

bad press in pharma as hyped too early Triple stores historically less efficient than relational

DBMSs (but rapidly changing) Most focus has been on data and integration rather

than algorithms to use the data Difficulty weighting edges in a relational graph http://blog.project-sierra.de/

archives/1639

Systems chemical biology + Semantic Web

Drug Discovery Today, 2012, in press.

Current Semantic Searching

SPARQL searching & RDF Triple Store

PubMedPubche

m

UniProt

Web Pages

SIDERBindingD

B ChemSpider

DrugBank

CASReaxsy

sInternal

Including commercial and internal data

Keyword, SPARQL searching & RDF Triple Store

PubMedPubche

m

UniProt

Web Pages

SIDERBindingD

B ChemSpider

DrugBank

CASReaxsy

sInternal

Chem2Bio2RDF – www.chem2bio2rdf.org

Semantically integrates 42 heterogeneous public datasets related to drug discovery in a fast Virtuoso triple-store with SPARQL endpoint (linked from main site)

Datasets cover chemistry, chemogenomics, biology, systems & pathways, pharmacology, phenotypes, toxicology, glycomics and publications, and biological entities of compounds, drugs, targets, genes, pathways, diseases and side-effects

Major datasets include PubChem, ChEMBL, DrugBank, PharmGKB, BindingDB, STITCH, CTD, KEGG, SWISSPROT, PDB, SIDER, PubMed. Full set at http://chem2bio2rdf.wikispaces.com/Datasets

Holds data on ~31m chemical structures, ~5,000 marketed drugs, ~59m bioactivity data points and ~19m publications

Linked into LOD cloud, and may form part of OpenPHACTS repository

Permits SPARQL searching using Chem2Bio2OWL ontology. For more information, see BMC Bioinformatics 2010, 11, 255.

http://chem2bio2rdf.wikispaces.com/Datasets

http://chem2bio2rdf.wikispaces.com/Datasets

Chem2Bio2OWL – Semantic annotation

Fills a gap in current ontologies: covers relationship of chemical compounds and drugs to targets, genes, assays and side-effects

Aligned with other ontologies: released on NCBO Bioportal (http://bioportal.bioontology.org/ontologies/1615)

Simplifies SPARQL searching by integrating equivalent classes across datasets (no longer need to explicitly specify datasets and fields)

Increases power of SPARQL searching allowing inclusion of data and relational classes (e.g. activator vs antagonist)

For more information see http://chem2bio2owl.wikispaces.com. Publication in progress.

http://bioportal.bioontology.org/ontologies/1615

http://bioportal.bioontology.org/ontologies/1615

http://chem2bio2owl.wikispaces.com/

Finding multi-target inhibitors of MAPK pathway with a SPARQL query

PREFIX pubchem: <http://chem2bio2rdf.org/pubchem/resource/>PREFIX kegg: <http://chem2bio2rdf.org/kegg/resource/>PREFIX uniprot: <http://chem2bio2rdf.org/uniprot/resource/> SELECT ?compound_cid (count(?compound_cid) as ?active_assays)FROM <http://chem2bio2rdf.org/pubchem>FROM <http://chem2bio2rdf.org/kegg>FROM <http://chem2bio2rdf.org/uniprot> WHERE { ?bioassay pubchem:CID ?compound_cid . ?bioassay pubchem:outcome ?activity . FILTER (?activity=2) . ?bioassay pubchem:Score ?score . FILTER (?score>50) . ?bioassay pubchem:gi ?gi . ?uniprot uniprot:gi ?gi . ?pathway kegg:protein ?uniprot . ?pathway kegg:Pathway_name ?pathway_name . FILTER regex(?pathway_name,"MAPK signaling pathway","i") . } GROUP BY ?compound_cid HAVING (count(*)>1)

BMC Bioinformatics 2010, 11, 255.

Semantic Association Algorithms at IU Association Search – visualize literature supported associations

between any two entities (compound, drug, gene, pathway, disease, side effect). PLoS One, 2011, 6(12), e27506.

Semantic Link Association Prediction (SLAP) – find most highly associated entities (compound, drug, gene, pathway, disease, side effect) to any other entity, based on probabilistic weightings of graph edges based on public experimental datasets. PLoS Computational Biology, in review.

BioLDA – find most highly associated entities to any other entity based on a complex topic model analysis of the literature (PubMed). PLoS One, 2011, 6 (3), e17243

See also: WENDI (J. Cheminf., 2010,2,6); Chemogenomic Explorer (BMC Bio. 2011,12,256), ChemLDA, ChemBioGrid (J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307)

Ibuprofen and Parkinson’s Disease Recent paper got attention of

news outlets (Gao et al. Neurology, 78(10))

Observational meta-study looking at 6-year Parkinson’s occurrence related to NSAID usage

Statistical preventitive link with Ibuprofen, but not aspirin or acetaminophen

Clinical finding – mechanism of action unknown

What can we learn from Google, from semantic association methods, and from both together?

Ibuprofen and Parkinson’s Disease

More Google searches…

More Google searches…

Ibuprofen-neuroinflammation-Parkinson’s

Mining relational paths – association search

Identified 70 genes associated with Ibuprofen and Parkinson’s disease, 9 of which are related to inflammation (IL1A, IL1B, IL1RN, IL6, LTA, NFKB1, NFKBIA, PTGS2, TNF)

Clear direct association between PTGS2 (COX2) and Parkinson’s Disease via CTD (leading to literature)

Single gene, AMBP, differentially associated with Ibuprofen and Parkinson’s Disease but not with other NSAIDS (AMBP has shown potential as a Parkinson’s biomarker)

SLAP – target profile for IbuprofenCOX2 – main targetRegulate neurotransmitter release COX1

Dopamine receptorSeratonin

receptorsCannaboid receptorsMuscarinic receptor(motor control)

vs acetaminophen and aspirin

SLAP - Biologically Similar Drugs

Dopamine agonist, used in Parkinson’s

Dopamine agonist, used in Parkinson’s

BLASC Calculates KL-Divergence

score for any bioterm pairs (drugs,genes, side-effects, pathways, etc)

Available from http://djwild.info

BLASC results

Item1 Item2

Item1 literatures

Item2 literatures

# of co-literatures

co-occur score

Item1 entropy

Item2 entropy

KL divergence

OxaprozinParkinson Disease 124 3587 0 -13.005 4.764 1.709 10.159

Meclofenamic acid

Parkinson Disease 234 3587 0 -13.64 4.464 1.709 12.262

LumiracoxibParkinson Disease 144 3587 0 -13.155 4.188 1.709 12.72

FenoprofenParkinson Disease 341 3587 0 -14.017 3.476 1.709 13.586

TenoxicamParkinson Disease 451 3587 0 -14.297 3.634 1.709 13.895

SulindacParkinson Disease 1390 3587 0 -15.422 3.367 1.709 14.309

EtodolacParkinson Disease 442 3587 0 -14.276 3.525 1.709 14.635

NabumetoneParkinson Disease 348 3587 0 -14.037 2.746 1.709 14.639

DiflunisalParkinson Disease 539 3587 0 -14.475 2.616 1.709 15.13

ValdecoxibParkinson Disease 337 3587 0 -14.005 2.693 1.709 15.161

RofecoxibParkinson Disease 1689 3587 0 -15.617 3.466 1.709 15.316

EtoricoxibParkinson Disease 268 3587 0 -13.776 2.477 1.709 16.372

KetorolacParkinson Disease 1504 3587 0 -15.501 2.598 1.709 17.208

MeloxicamParkinson Disease 803 3587 0 -14.873 2.346 1.709 17.835

Mefenamic acidParkinson Disease 823 3587 0 -14.898 1.558 1.709 17.913

CelecoxibParkinson Disease 2676 3587 0 -16.077 3.096 1.709 18.056

PiroxicamParkinson Disease 2221 3587 0 -15.891 1.495 1.709 18.397

FlurbiprofenParkinson Disease 1616 3587 0 -15.573 1.218 1.709 18.913

IbuprofenParkinson Disease 6791 3587 1 -0.275 1.726 1.709 19.296

NaproxenParkinson Disease 3600 3587 1 0.36 2.021 1.709 19.36

AspirinParkinson Disease 28248 3587 5 -0.091 2.506 1.709 19.851

IndomethacinParkinson Disease 29527 3587 2 -1.052 2.143 1.709 20.085

KetoprofenParkinson Disease 2229 3587 0 -15.894 1.066 1.709 20.117

DiclofenacParkinson Disease 5484 3587 0 -16.795 1.978 1.709 20.417

What do we learn from this? Right now, our options are

Web searching, with very broad scope, ability to be “sloppy”, but very little “drill down” into datasets, publciations, etc, and very limited searching options (keyword only)

Dataset searching which is siloed

Web service infrastructures allow ad hoc integration of searching through mashups

Semantic integration with associated tools allow more advanced kinds of search, exploration and prediction of integrated datasets – including hypothesis testing, etc –these are not really mapped to current searching paradigms (but are closer to traditional structured searching)

Semantic searching likely to have wide adoption in the next few years, which might allow web and semantic dataset searching to be integrated

Wolfram Alpha is a kind of prototype for this (http://www.wolframalpha.com/) but almost useless for drug discovery

http://www.wolframalpha.com/

http://www.wolframalpha.com/

Wolfram Alpha

Work in progress….

Ibuprofen

Parkinsons

Disease

Neuro- inflamm

ation

COX2

Ibuprofen

Parkinsons

Disease

Neuro- inflamm

ation

COX2

AMBP

DRD2

IL1A

Query

Result

Karl Fast (again)

“Mess is where creativity and insight come from. We must stop thinking we need to make the

world a neat and tidy place”

Cheminformatics Education at Indiana University

LuLu eBook - $29http://slg.djwild.info

Free cheminformatics learning resources

http://icep.wikispaces.com

Residential Ph.D. program in Informatics with a

Cheminformatics specialty

Distance Graduate Certificate program in Chemical Informatics

http://djwild.info/ed