Upload
bin-chen
View
818
Download
6
Tags:
Embed Size (px)
DESCRIPTION
introduce a semantic framework for studying systems chemical biology / systems pharmacology, in which three major projects (Chem2Bio2RDF, Chem2Bio2OWL, SLAP (semantic link association prediction) are covered.
Citation preview
From Data Integration to Data mining in Semantic Web
systems chemical biology as a case studyBin Chen
School of Informatics and ComputingIndiana University at Bloomington
Lecture for S636Nov 17, 2011
Outline
• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)
ChemicalChemical BiologyBiology SystemsSystems PhenotypePhenotype
interacting mapping
Compound Drug
ProteinGene
PPIMetabolic PathwayGene Regulatory
DiseaseSide effectToxicity
Chemogenomics
What’s Systems Chemical Biology
Oprea TI, et al, Systems chemical biology, nature, 2007Oprea TI, et al, Systems chemical biology, nature, 2007
The data are heterogeneous and scattered around the web…
MATADOR
Semantic Web
• an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.
Semantic web Stackhttp://en.wikipedia.org/wiki/Semantic_Webhttp://en.wikipedia.org/wiki/Semantic_Web
SPARQLRDF
Ontology
Algorithm and tools
Applications
Experimental Data Text mining Data
Chem2Bio2RDF
Chem2Bio2OWL
Path finding; Association search; Association ranking and prediction
Polypharmacology; drug side effect
Architecture of Semantic Systems Chemical Biology
Outline
• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)
RDF (Resource Description Framework)
• a standard model for data interchange on the Web, using triples (subject, predicate, object) to present and link data, and using URIs to identify resources.
Resource(subject)
Value(object)
Property
(predicate)
Drug Lipitorname
<RDF> <Description about="http://chem2bio2rdf.org/drugbank/resource/drugbank_drug/DB01076"> <name>Lipitor</author> <company>Pfizer</company> </Description></RDF>
company
Pfizer
http://chem2bio2rdf.org/drugbank/resource/drugbank_drug/DB01076
URI
Use RDF to Integrate Data
http://chem2bio2rdf.org/drugbank/DB01076name
company
lipitor
Pfizer
http://chem2bio2rdf.org/drugbank/DB01076 Molecular_Weight
formula
558.6398
C33H35FN2O5
Database 1
Database 2
Same URI, merged!
Use RDF to Link Data
http://chem2bio2rdf.org/drugbank/DB01076
sameAs
http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB01076
http://chem2bio2rdf.org/pubchem/resource/pubchem_compound/60823
cid
uniprot
Bio2RDF
Others
LODD
Chem2Bio2RDF
VirtuosoTriple store
SPARQL ENDPOINTS
Dereferenable URI
Browsing
PlotViz: Visualization
Cytoscape Plugin
Linked Path Generation and Ranking
Third party tools
Workflow for RDF conversion
XML
CSV
DB
TXT
Relational DB
D2R Mapping
D2R server
Dumping VirtuosoTriple Store
Scripts
Ontology
Publishing
External Sources
DownloadLocal copy
…
Chen B,et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010Chen B,et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010
# Table c2b2r_DrugBankDrugmap:c2b2r_DrugBankDrug a d2rq:ClassMap;
d2rq:dataStorage map:database;d2rq:uriPattern "drugbank_drug/@@c2b2r_DrugBankDrug.DBID|urlify@@";d2rq:class drugbank:DrugBankDrug;d2rq:classDefinitionLabel "c2b2r_DrugBankDrug";.
map:c2b2r_DrugBankDrug__label a d2rq:PropertyBridge;d2rq:belongsToClassMap map:c2b2r_DrugBankDrug;d2rq:property rdfs:label;d2rq:pattern "@@c2b2r_DrugBankDrug.Generic_Name@@";.
map:c2b2r_DrugBankDrug_DBID a d2rq:PropertyBridge;d2rq:belongsToClassMap map:c2b2r_DrugBankDrug;d2rq:property drugbank:DBID;d2rq:propertyDefinitionLabel "c2b2r_DrugBankDrug DBID";d2rq:column "c2b2r_DrugBankDrug.DBID";
Table
D2R mapping
RDF
Exhibit link
Node represents each database colored by its RDF vender; Directed edge shows the linkage from one dataset to another dataset, colored by the linkage type. E.g,., the type compound includes CID, CAS, ChEBI, DBID and so on. The size of nodes and the width of edges are dependent on the # of triples and # of linkages respectively.
Chem2Bio2RDF Datasets
Chem2Bio2RDF data
Other data venderscompoundprotein/genechemogenomicsliteratureothers
http://chem2bio2rdf.org
http://linkeddata.org
uniprot
Bio2RDF
Others
LODD
Chem2Bio2RDF
VirtuosoTriple store
SPARQL ENDPOINTS
Dereferenable URI
Browsing
PlotViz: Visualization
Cytoscape Plugin
Linked Path Generation and Ranking
Third party tools
SPARQL
• SQL-like Query Language for RDF
Implement cheminformatics and bioinformatics tools into SPARQL
ARQ Function Extension
SPARQL
Chemistry Development
KitsBioJAVA Web Services
PREFIX drugbank: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>PREFIX f: <java:org.bio2chem2rdf.arq.> SELECT ?x ?s WHERE { ?x drugbank:smilesStringCanonical ?s FILTER (
f:tanimoto('NS(=O)(=O)C1=CC(=C(Cl)C(Cl)=C1)S(N)(=O)=O', ?s, 'MACCS') > 0.9 )}
f:tanimoto is used for compound similarity search
Answer scientific questions• Give me all information about this compound• Give me all information about this target• Find chemical associated genes• Find gene associated chemicals• Find disease associated chemicals• Find side effect associated chemicals• Find all the drug-like compounds in PubChem BioAssay that
share at least two targets with a drug in DrugBank • Link KEGG / Reactome Pathways and PubChem to identify
potential multiple pathway inhibitors for MAPK
More in http://chem2bio2rdf.wikispaces.com/multiple+sources
link
Outline
• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)
Node represents each database colored by its RDF vender; Directed edge shows the linkage from one dataset to another dataset, colored by the linkage type. E.g,., the type compound includes CID, CAS, ChEBI, DBID and so on. The size of nodes and the width of edges are dependent on the # of triples and # of linkages respectively.
Chem2Bio2RDF Datasets
Chem2Bio2RDF data
Other data venderscompoundprotein/genechemogenomicsliteratureothers
http://chem2bio2rdf.org
Ontology workflow
Step 1: Hunting for scientific questions and targeting goals
• What's the targets of troglitazone?• Find PPARG inhibitors with molecular weight
smaller than 500d.• Which pathway will be affected by
troglitazone? • Find all the common/unique genes or proteins
or drugs between/among two or many nodes. • What genes may the compound interact with
and are expressed in liver?
Step 2: Propose framework and basic classes
• SmallMolecule• MacroMolecule• Disease• SideEffect• Pathway• BioAssay• Literature• Interaction
Step 3: Define classes, relations and data properties
• Refine class– Subclass– Utility class
• Object property• Data property
http://chem2bio2owl.wikispaces.com/Version+1.0
Step 4: Align with External ontology
• Import BioPAX• Map disease to Disease Ontology• Standardize terms
– OBO Foundry– NCBO Bioportal
Chem2Bio2OWL
Chem2Bio2RDF
Step 5: Populate Chem2Bio2OWL
• Identifier for compound, drug, protein, gene, pathway, side effect and disease– Primary source
• Term mapping– String similarity match
Protégé API Virtuoso
Pellet reasoning
Chem2Bio2OWL
Step 6: Evaluation---Consistence checking
• Data property• Manually check sample reasoning results by
domain experts
Step 6: Evaluation---case study
• Drug target identificationPREFIX c2b2r: http://chem2bio2rdf.org/chem2bio2rdf.owl#PREFIX bp: <http://www.biopax.org/release/biopax-level3.owl#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?target from <http://chem2bio2rdf.org/owl#>
where {
?chemical rdfs:label ?drugName ; c2b2r:hasInteraction ?interaction . ?interaction c2b2r:hasTarget [bp:name ?target]; c2b2r:drugTarget true .
FILTER (str(?drugName)="Troglitazone") }
Annotated Chem2Bio2OWL
Mashed Chem2Bio2RDF
Outline
• Introduction• RDF (Chem2Bio2RDF)• OWL (Chem2Bio2OWL)• Graph mining (SLAP)
– Semantic Link Association Prediction
Two objects are similar if they are related to similar objects
Coauthorship
Same Target
Two objects are related if they share same objects or their related objects are related
Compound 1
Compound 1
Protein 2
Protein 2
Protein 1
Protein 1
Compound 1
Compound 1
Protein 2
Protein 2
Protein 1
Protein 1
Compound 2
Compound 2
Computer Science
Computer Science
Person2
Person2
Person 1
Person 1
Computer Science
Computer Science
Person2
Person2paper1paper1paper2paper2
advisormajor
publishciteconference
Cmpd1
Cmpd1
Protein1
Protein1
Protein2
Protein2
Cmpd 2
Cmpd 2
Cmpd1
Cmpd1
Cmpd 2
Cmpd 2
Protein1
Protein1
Neighbor Chemogenomics
ChemogenomicsChemogenomics
Chemogenomics
Protein2
Protein2
Cmpd1
Cmpd1
Protein1
Protein1
Chemogenomics hasGOhasGO
Protein2
Protein2
Cmpd1
Cmpd1
Protein1
Protein1
Chemogenomics PPI
GO:0001 GO:0001
Sample patterns
Cmpd1
Cmpd1
Protein1
Protein1
Cmpd 2
Cmpd 2
Chemogenomicshypertensionhypertension
Side effect Side effect
Cmpd1
Cmpd1
Protein1
Protein1
Cmpd 2
Cmpd 2
ChemogenomicsSubstructureSubstructure
substructure substructure
Target 2
Target 2
Compound1
Compound1
Compound 2
Compound 2
Compound 3
Compound 3
Target 3
Target 3
GO:00001 GO:00001
hasGO
hasGO
chemogenomicschemogenomics
chemogenomics
chemogenomics
chemogenomics
neighbor
Side Effect 1
Side Effect 1 hasSideEffect
hasSideEffect
Gene Family 1
Gene Family 1
hasGeneFamily
hasGeneFamily
Target 1
Target 1 chemogenomics
Target 4
Target 4
chemogenomics
proteinProteinInteraction
Association depends on its neighborhoodAssociation depends on its neighborhood
Statistical ModelConvert the question to a path surfing problem
Gene iGene i Gene j
Gene jPPI
PPI
PPI
hasGOhasGO
hasPathway
chemogenomics
P(i j) =1/3
Protein2
Protein2
Cmpd1 (s)
Cmpd1 (s)
Protein
1 (t)
Protein
1 (t)
e1 e2
• Randomly sample 100,000 drug target pairs• Yielding 453,087 paths, 17 patterns
Pattern Samples:
Pattern Distribution
Statistical Model3. Nodes association estimation
Raw score of random pairs fit to normal distribution!
Direct: drug target pairs with IC50<30umIndirect: drug target pairs with no interactionRandom: random pairs
SLAP interface
Acknowledgement
• Cheminformatics/Chemogenomics Group (Dr. David Wild, Indiana University)– Xiao Dong, Huijun Wang, Dazhi Jiao, Dr. Qian Zhu,
Madhuvanthi Sankaranarayanan, Jaehong Shin• Semantic Web Lab (Dr. Ying Ding, Indiana
University)Yuyin Sun, Bing He, Shanshan Chen
• High performance computing (Indiana University)Jong Youl Choi
• Pfizer CS COE (Dr. Eric Gifford)
Thanks!