47
From Data Integration to Data mining in Semantic Web systems chemical biology as a case study Bin Chen School of Informatics and Computing Indiana University at Bloomington Lecture for S636 Nov 17, 2011

Towards semantic systems chemical biology

Embed Size (px)

DESCRIPTION

introduce a semantic framework for studying systems chemical biology / systems pharmacology, in which three major projects (Chem2Bio2RDF, Chem2Bio2OWL, SLAP (semantic link association prediction) are covered.

Citation preview

Page 1: Towards semantic systems chemical biology

From Data Integration to Data mining in Semantic Web

systems chemical biology as a case studyBin Chen

School of Informatics and ComputingIndiana University at Bloomington

Lecture for S636Nov 17, 2011

Page 2: Towards semantic systems chemical biology

Outline

• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)

Page 3: Towards semantic systems chemical biology

ChemicalChemical BiologyBiology SystemsSystems PhenotypePhenotype

interacting mapping

Compound Drug

ProteinGene

PPIMetabolic PathwayGene Regulatory

DiseaseSide effectToxicity

Chemogenomics

What’s Systems Chemical Biology

Oprea TI, et al, Systems chemical biology, nature, 2007Oprea TI, et al, Systems chemical biology, nature, 2007

Page 5: Towards semantic systems chemical biology

Semantic Web

• an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.

Semantic web Stackhttp://en.wikipedia.org/wiki/Semantic_Webhttp://en.wikipedia.org/wiki/Semantic_Web

Page 6: Towards semantic systems chemical biology

SPARQLRDF

Ontology

Algorithm and tools

Applications

Experimental Data Text mining Data

Chem2Bio2RDF

Chem2Bio2OWL

Path finding; Association search; Association ranking and prediction

Polypharmacology; drug side effect

Architecture of Semantic Systems Chemical Biology

Page 7: Towards semantic systems chemical biology

Outline

• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)

Page 8: Towards semantic systems chemical biology

RDF (Resource Description Framework)

• a standard model for data interchange on the Web, using triples (subject, predicate, object) to present and link data, and using URIs to identify resources.

Resource(subject)

Value(object)

Property

(predicate)

Drug Lipitorname

<RDF> <Description about="http://chem2bio2rdf.org/drugbank/resource/drugbank_drug/DB01076"> <name>Lipitor</author> <company>Pfizer</company> </Description></RDF>

company

Pfizer

http://chem2bio2rdf.org/drugbank/resource/drugbank_drug/DB01076

URI

Page 9: Towards semantic systems chemical biology

Use RDF to Integrate Data

http://chem2bio2rdf.org/drugbank/DB01076name

company

lipitor

Pfizer

http://chem2bio2rdf.org/drugbank/DB01076 Molecular_Weight

formula

558.6398

C33H35FN2O5

Database 1

Database 2

Same URI, merged!

Page 10: Towards semantic systems chemical biology

Use RDF to Link Data

http://chem2bio2rdf.org/drugbank/DB01076

sameAs

http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB01076

http://chem2bio2rdf.org/pubchem/resource/pubchem_compound/60823

cid

Page 11: Towards semantic systems chemical biology

uniprot

Bio2RDF

Others

LODD

Chem2Bio2RDF

VirtuosoTriple store

SPARQL ENDPOINTS

Dereferenable URI

Browsing

PlotViz: Visualization

Cytoscape Plugin

Linked Path Generation and Ranking

Third party tools

Page 12: Towards semantic systems chemical biology

Workflow for RDF conversion

XML

CSV

DB

TXT

Relational DB

D2R Mapping

D2R server

Dumping VirtuosoTriple Store

Scripts

Ontology

Publishing

External Sources

DownloadLocal copy

Chen B,et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010Chen B,et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010

Page 13: Towards semantic systems chemical biology

# Table c2b2r_DrugBankDrugmap:c2b2r_DrugBankDrug a d2rq:ClassMap;

d2rq:dataStorage map:database;d2rq:uriPattern "drugbank_drug/@@c2b2r_DrugBankDrug.DBID|urlify@@";d2rq:class drugbank:DrugBankDrug;d2rq:classDefinitionLabel "c2b2r_DrugBankDrug";.

map:c2b2r_DrugBankDrug__label a d2rq:PropertyBridge;d2rq:belongsToClassMap map:c2b2r_DrugBankDrug;d2rq:property rdfs:label;d2rq:pattern "@@c2b2r_DrugBankDrug.Generic_Name@@";.

map:c2b2r_DrugBankDrug_DBID a d2rq:PropertyBridge;d2rq:belongsToClassMap map:c2b2r_DrugBankDrug;d2rq:property drugbank:DBID;d2rq:propertyDefinitionLabel "c2b2r_DrugBankDrug DBID";d2rq:column "c2b2r_DrugBankDrug.DBID";

Table

D2R mapping

RDF

Exhibit link

Page 14: Towards semantic systems chemical biology

Node represents each database colored by its RDF vender; Directed edge shows the linkage from one dataset to another dataset, colored by the linkage type. E.g,., the type compound includes CID, CAS, ChEBI, DBID and so on. The size of nodes and the width of edges are dependent on the # of triples and # of linkages respectively.

Chem2Bio2RDF Datasets

Chem2Bio2RDF data

Other data venderscompoundprotein/genechemogenomicsliteratureothers

http://chem2bio2rdf.org

Page 15: Towards semantic systems chemical biology

http://linkeddata.org

Page 16: Towards semantic systems chemical biology

uniprot

Bio2RDF

Others

LODD

Chem2Bio2RDF

VirtuosoTriple store

SPARQL ENDPOINTS

Dereferenable URI

Browsing

PlotViz: Visualization

Cytoscape Plugin

Linked Path Generation and Ranking

Third party tools

Page 17: Towards semantic systems chemical biology

SPARQL

• SQL-like Query Language for RDF

Page 18: Towards semantic systems chemical biology

Implement cheminformatics and bioinformatics tools into SPARQL

ARQ Function Extension

SPARQL

Chemistry Development

KitsBioJAVA Web Services

PREFIX drugbank: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>PREFIX f: <java:org.bio2chem2rdf.arq.> SELECT ?x ?s WHERE { ?x drugbank:smilesStringCanonical ?s FILTER (

f:tanimoto('NS(=O)(=O)C1=CC(=C(Cl)C(Cl)=C1)S(N)(=O)=O', ?s, 'MACCS') > 0.9 )}

f:tanimoto is used for compound similarity search

Page 19: Towards semantic systems chemical biology

Answer scientific questions• Give me all information about this compound• Give me all information about this target• Find chemical associated genes• Find gene associated chemicals• Find disease associated chemicals• Find side effect associated chemicals• Find all the drug-like compounds in PubChem BioAssay that

share at least two targets with a drug in DrugBank • Link KEGG / Reactome Pathways and PubChem to identify

potential multiple pathway inhibitors for MAPK

More in http://chem2bio2rdf.wikispaces.com/multiple+sources

Page 20: Towards semantic systems chemical biology

link

Page 21: Towards semantic systems chemical biology

Outline

• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)

Page 22: Towards semantic systems chemical biology

Node represents each database colored by its RDF vender; Directed edge shows the linkage from one dataset to another dataset, colored by the linkage type. E.g,., the type compound includes CID, CAS, ChEBI, DBID and so on. The size of nodes and the width of edges are dependent on the # of triples and # of linkages respectively.

Chem2Bio2RDF Datasets

Chem2Bio2RDF data

Other data venderscompoundprotein/genechemogenomicsliteratureothers

http://chem2bio2rdf.org

Page 23: Towards semantic systems chemical biology

Ontology workflow

Page 24: Towards semantic systems chemical biology

Step 1: Hunting for scientific questions and targeting goals

• What's the targets of troglitazone?• Find PPARG inhibitors with molecular weight

smaller than 500d.• Which pathway will be affected by

troglitazone? • Find all the common/unique genes or proteins

or drugs between/among two or many nodes. • What genes may the compound interact with

and are expressed in liver?

Page 25: Towards semantic systems chemical biology

Step 2: Propose framework and basic classes

• SmallMolecule• MacroMolecule• Disease• SideEffect• Pathway• BioAssay• Literature• Interaction

Page 26: Towards semantic systems chemical biology

Step 3: Define classes, relations and data properties

• Refine class– Subclass– Utility class

• Object property• Data property

http://chem2bio2owl.wikispaces.com/Version+1.0

Page 27: Towards semantic systems chemical biology

Step 4: Align with External ontology

• Import BioPAX• Map disease to Disease Ontology• Standardize terms

– OBO Foundry– NCBO Bioportal

Page 28: Towards semantic systems chemical biology

Chem2Bio2OWL

Page 29: Towards semantic systems chemical biology

Chem2Bio2RDF

Step 5: Populate Chem2Bio2OWL

• Identifier for compound, drug, protein, gene, pathway, side effect and disease– Primary source

• Term mapping– String similarity match

Protégé API Virtuoso

Pellet reasoning

Chem2Bio2OWL

Page 30: Towards semantic systems chemical biology

Step 6: Evaluation---Consistence checking

• Data property• Manually check sample reasoning results by

domain experts

Page 31: Towards semantic systems chemical biology

Step 6: Evaluation---case study

• Drug target identificationPREFIX c2b2r: http://chem2bio2rdf.org/chem2bio2rdf.owl#PREFIX bp: <http://www.biopax.org/release/biopax-level3.owl#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?target from <http://chem2bio2rdf.org/owl#>

where {

?chemical rdfs:label ?drugName ; c2b2r:hasInteraction ?interaction . ?interaction c2b2r:hasTarget [bp:name ?target]; c2b2r:drugTarget true .

FILTER (str(?drugName)="Troglitazone") }

Annotated Chem2Bio2OWL

Mashed Chem2Bio2RDF

Page 32: Towards semantic systems chemical biology

Outline

• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)

– Semantic Link Association Prediction

Page 33: Towards semantic systems chemical biology
Page 34: Towards semantic systems chemical biology

Two objects are similar if they are related to similar objects

Coauthorship

Same Target

Page 35: Towards semantic systems chemical biology

Two objects are related if they share same objects or their related objects are related

Compound 1

Compound 1

Protein 2

Protein 2

Protein 1

Protein 1

Compound 1

Compound 1

Protein 2

Protein 2

Protein 1

Protein 1

Compound 2

Compound 2

Computer Science

Computer Science

Person2

Person2

Person 1

Person 1

Computer Science

Computer Science

Person2

Person2paper1paper1paper2paper2

advisormajor

publishciteconference

Page 36: Towards semantic systems chemical biology

Cmpd1

Cmpd1

Protein1

Protein1

Protein2

Protein2

Cmpd 2

Cmpd 2

Cmpd1

Cmpd1

Cmpd 2

Cmpd 2

Protein1

Protein1

Neighbor Chemogenomics

ChemogenomicsChemogenomics

Chemogenomics

Protein2

Protein2

Cmpd1

Cmpd1

Protein1

Protein1

Chemogenomics hasGOhasGO

Protein2

Protein2

Cmpd1

Cmpd1

Protein1

Protein1

Chemogenomics PPI

GO:0001 GO:0001

Sample patterns

Cmpd1

Cmpd1

Protein1

Protein1

Cmpd 2

Cmpd 2

Chemogenomicshypertensionhypertension

Side effect Side effect

Cmpd1

Cmpd1

Protein1

Protein1

Cmpd 2

Cmpd 2

ChemogenomicsSubstructureSubstructure

substructure substructure

Page 37: Towards semantic systems chemical biology

Target 2

Target 2

Compound1

Compound1

Compound 2

Compound 2

Compound 3

Compound 3

Target 3

Target 3

GO:00001 GO:00001

hasGO

hasGO

chemogenomicschemogenomics

chemogenomics

chemogenomics

chemogenomics

neighbor

Side Effect 1

Side Effect 1 hasSideEffect

hasSideEffect

Gene Family 1

Gene Family 1

hasGeneFamily

hasGeneFamily

Target 1

Target 1 chemogenomics

Target 4

Target 4

chemogenomics

proteinProteinInteraction

Association depends on its neighborhoodAssociation depends on its neighborhood

Page 38: Towards semantic systems chemical biology
Page 39: Towards semantic systems chemical biology

Statistical ModelConvert the question to a path surfing problem

Gene iGene i Gene j

Gene jPPI

PPI

PPI

hasGOhasGO

hasPathway

chemogenomics

P(i j) =1/3

Page 40: Towards semantic systems chemical biology

Protein2

Protein2

Cmpd1 (s)

Cmpd1 (s)

Protein

1 (t)

Protein

1 (t)

e1 e2

Page 41: Towards semantic systems chemical biology

• Randomly sample 100,000 drug target pairs• Yielding 453,087 paths, 17 patterns

Pattern Samples:

Pattern Distribution

Page 42: Towards semantic systems chemical biology

Statistical Model3. Nodes association estimation

Raw score of random pairs fit to normal distribution!

Page 43: Towards semantic systems chemical biology

Direct: drug target pairs with IC50<30umIndirect: drug target pairs with no interactionRandom: random pairs

Page 44: Towards semantic systems chemical biology
Page 45: Towards semantic systems chemical biology

SLAP interface

Page 46: Towards semantic systems chemical biology

Acknowledgement

• Cheminformatics/Chemogenomics Group (Dr. David Wild, Indiana University)– Xiao Dong, Huijun Wang, Dazhi Jiao, Dr. Qian Zhu,

Madhuvanthi Sankaranarayanan, Jaehong Shin• Semantic Web Lab (Dr. Ying Ding, Indiana

University)Yuyin Sun, Bing He, Shanshan Chen

• High performance computing (Indiana University)Jong Youl Choi

• Pfizer CS COE (Dr. Eric Gifford)

Page 47: Towards semantic systems chemical biology

Thanks!