Scripps bioinformatics seminar_day_2

Day 2 of Computing on the shoulders of

giants: how existing knowledge is represented and applied in

bioinformaticsBenjamin Good

[email protected] Professor of the Department of

Molecular and Experimental Medicine

mailto:[email protected]

Recap from Day 1• Make things (articles, genes,

antibodies, etc.) easier to find• Answer questions• Generate hypotheses

Controlled vocabularies (MeSH)Ontologies (Gene Ontology)

knowledge graphs on the Web: the SPARQL query language

knowledge plus computation = inference, the ABC model

Computing with knowledge• Challenges with knowledge graphs

• Too much data• ->> query, sort, visualize, interact

• Not enough data• ->> mine for more..

• Goal for practical day: Go beyond PubMed! • gain hands on experience using a knowledge graph

• either with tools built for the purpose or with your own code…

Assignment: knowledge graph to hypothesis• Option 1 Coding

• Implement and apply an ABC Model style hypothesis generating program (can adapt from example provided)

• explain its logic, explain how you used it to generate a hypothesis, explain the hypothesis (provide a visual)

• Option 2 Non-coding• Use a knowledge discovery application(s) (list provided) to define a new hypothesis• if you can’t think of where to start, try to explain why Metformin may contribute to cancer survival

• Assignment deliverables: a document containing • the inputs you gave to your program or the online tool(s) you used• what was generated in response and the underlying logic • an image and text describing the results, especially any hypothesis you could derive

• (for Option 1 also submit any code written or files generated as a tar or zip archive)

Online tools for knowledge discovery• http://knowledge.bio (* we make this one…)• http://www.biograph.be (this is a good tool, but often breaks down) • http://epiphanet.uth.tmc.edu (also on the flaky side, but can be good) • https://skr3.nlm.nih.gov/SemMed/ (works okay, requires a (free)

account) • http://arrowsmith.psych.uic.edu (ugly interface, but good tool)

http://knowledge.bio/

http://www.biograph.be/


http://epiphanet.uth.tmc.edu/

https://skr3.nlm.nih.gov/SemMed/

http://arrowsmith.psych.uic.edu/

Demos• http://knowledge.bio • http://www.biograph.be• http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi





http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi

http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi

Example question: repurposing all drugs

http://tinyurl.com/hwm9388

?drug

?disease

interacts with

protein

geneencoded by genetic association

treats??



Example program (feel free to follow or adapt to your interest)• Example

• Input = a disease (A)• Output = a ranked list of drugs (C) that might be used for treatment

• Render the results of your workflow as a cytoscape network that illustrates the reasoning behind the predictions

• Implementation• Python• Use a SPARQL endpoint such as http://query.wikidata.org

• + identify and use another endpoint (e.g. EBI, UniProt)• ++ access pubmed articles and MeSH indexing

http://query.wikidata.org/

Python setup• pip install RDFLib, SPARQLWrapper, pandas…. • Hopefully Jupyter already installed ? else install it http://

jupyter.readthedocs.io/en/latest/install.html • get notebook from https://

github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb • go to directory where you put the notebook• run it with• >jupyter notebook• should be ready to run

http://jupyter.readthedocs.io/en/latest/install.html

http://jupyter.readthedocs.io/en/latest/install.html

https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb

https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb

the notebook• will run a basic search for disease-gene-drug connections in wikidata• will sort the results by the number of intervening genes• will export the data to a tab-delimited file you can view in Excel, text

editor, or load into cytoscape• Your job:

• Run it and extend it by one or more of:• adapting the query• changing the way the results are sorted• working with the output in cytoscape to produce an informative visualization

example output rendered in cytoscape

Other queries from Day 1 (slides 48-54)• Drugs that target a cancer and impact a specific biological process

• http://tinyurl.com/j222k6g

• Drugs that target a new disease linked via biological pathway with shared genes to disease the drug is now used to treat

• http://tinyurl.com/gpfr9kj

http://tinyurl.com/j222k6g



http://tinyurl.com/gpfr9kj



Possible inputs for adaptations• Browse and examine wikidata.org to see what you might make use of

• e.g. • Type of physical interaction between gene and drug• Gene ontology annotation (what evidence codes?)• Disease ontology hierarchy• Drug characteristics

Other possible knowledge sources • SPARQL

• UniProt http://sparql.uniprot.org • EBI SPARQL https://www.ebi.ac.uk/rdf/documentation/sparql-endpoints • look for unique identifiers on genes and proteins that you can use to link

wikidata content to their content

• Text• use the NCBI the E-utils API to programmatically access pubmed articles and

MeSH indexing http://www.ncbi.nlm.nih.gov/books/NBK25501/ • Can use to build co-occurrence networks of e.g. MeSH terms

http://sparql.uniprot.org/

https://www.ebi.ac.uk/rdf/documentation/sparql-endpoints

http://www.ncbi.nlm.nih.gov/books/NBK25501/



Good luck! Ask questions!

ABC ranking algorithms• Out of all C, which are most strongly

related to A?• Rank by N shared B concepts

• c2: 4• c4:3• c1: 1• c3: 1• c5:1• c6:1

• Next level: adjust to down-weight highly connected nodes

A B Cc1c2c3c4c5c6

ABC ranking algorithms – advanced (require large networks to be useful) • Wren – Average Minimum Weight (AMW) (Wren)

• http://bioinformatics.oxfordjournals.org/content/20/3/389.full.pdf

• Linking Term Count with Average Minimum Weight (LTC-AMW) (Yetisgen-Yildiz and Pratt)

• https://www.researchgate.net/publication/23759128_A_new_evaluation_methodology_for_literature-based_discovery_systems

• Predicate inter-dependence (Rastegar-Mojarad)• https://s3.amazonaws.com/uploads.hipchat.com/25885/154162/UaGvvQqbr

hPBAWN/A%20new%20method.pdf

http://bioinformatics.oxfordjournals.org/content/20/3/389.full.pdf

https://www.researchgate.net/publication/23759128_A_new_evaluation_methodology_for_literature-based_discovery_systems

https://www.researchgate.net/publication/23759128_A_new_evaluation_methodology_for_literature-based_discovery_systems

Science

Scripps bioinformatics seminar_day_2