18
Day 2 of Computing on the shoulders of giants: how existing knowledge is represented and applied in bioinformatics Benjamin Good [email protected] Assistant Professor of the Department of Molecular and Experimental Medicine

Scripps bioinformatics seminar_day_2

  • Upload
    goodb

  • View
    66

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scripps bioinformatics seminar_day_2

Day 2 of Computing on the shoulders of

giants: how existing knowledge is represented and applied in

bioinformaticsBenjamin Good

[email protected] Professor of the Department of

Molecular and Experimental Medicine

Page 2: Scripps bioinformatics seminar_day_2

Recap from Day 1• Make things (articles, genes,

antibodies, etc.) easier to find• Answer questions• Generate hypotheses

Controlled vocabularies (MeSH)Ontologies (Gene Ontology)

knowledge graphs on the Web: the SPARQL query language

knowledge plus computation = inference, the ABC model

Page 3: Scripps bioinformatics seminar_day_2

Computing with knowledge• Challenges with knowledge graphs

• Too much data• ->> query, sort, visualize, interact

• Not enough data• ->> mine for more..

• Goal for practical day: Go beyond PubMed! • gain hands on experience using a knowledge graph

• either with tools built for the purpose or with your own code…

Page 4: Scripps bioinformatics seminar_day_2

Assignment: knowledge graph to hypothesis• Option 1 Coding

• Implement and apply an ABC Model style hypothesis generating program (can adapt from example provided)

• explain its logic, explain how you used it to generate a hypothesis, explain the hypothesis (provide a visual)

• Option 2 Non-coding• Use a knowledge discovery application(s) (list provided) to define a new hypothesis• if you can’t think of where to start, try to explain why Metformin may contribute to cancer survival

• Assignment deliverables: a document containing • the inputs you gave to your program or the online tool(s) you used• what was generated in response and the underlying logic • an image and text describing the results, especially any hypothesis you could derive

• (for Option 1 also submit any code written or files generated as a tar or zip archive)

Page 5: Scripps bioinformatics seminar_day_2

Online tools for knowledge discovery• http://knowledge.bio (* we make this one…)• http://www.biograph.be (this is a good tool, but often breaks down) • http://epiphanet.uth.tmc.edu (also on the flaky side, but can be good) • https://skr3.nlm.nih.gov/SemMed/ (works okay, requires a (free)

account) • http://arrowsmith.psych.uic.edu (ugly interface, but good tool)

Page 7: Scripps bioinformatics seminar_day_2
Page 8: Scripps bioinformatics seminar_day_2

Example question: repurposing all drugs

http://tinyurl.com/hwm9388

?drug

?disease

interacts with

protein

geneencoded by genetic association

treats??

Page 9: Scripps bioinformatics seminar_day_2

Example program (feel free to follow or adapt to your interest)• Example

• Input = a disease (A)• Output = a ranked list of drugs (C) that might be used for treatment

• Render the results of your workflow as a cytoscape network that illustrates the reasoning behind the predictions

• Implementation• Python• Use a SPARQL endpoint such as http://query.wikidata.org

• + identify and use another endpoint (e.g. EBI, UniProt)• ++ access pubmed articles and MeSH indexing

Page 10: Scripps bioinformatics seminar_day_2

Python setup• pip install RDFLib, SPARQLWrapper, pandas…. • Hopefully Jupyter already installed ? else install it http://

jupyter.readthedocs.io/en/latest/install.html • get notebook from https://

github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_pandas.ipynb • go to directory where you put the notebook• run it with• >jupyter notebook• should be ready to run

Page 11: Scripps bioinformatics seminar_day_2

the notebook• will run a basic search for disease-gene-drug connections in wikidata• will sort the results by the number of intervening genes• will export the data to a tab-delimited file you can view in Excel, text

editor, or load into cytoscape• Your job:

• Run it and extend it by one or more of:• adapting the query• changing the way the results are sorted• working with the output in cytoscape to produce an informative visualization

Page 12: Scripps bioinformatics seminar_day_2

example output rendered in cytoscape

Page 13: Scripps bioinformatics seminar_day_2

Other queries from Day 1 (slides 48-54)• Drugs that target a cancer and impact a specific biological process

• http://tinyurl.com/j222k6g

• Drugs that target a new disease linked via biological pathway with shared genes to disease the drug is now used to treat

• http://tinyurl.com/gpfr9kj

Page 14: Scripps bioinformatics seminar_day_2

Possible inputs for adaptations• Browse and examine wikidata.org to see what you might make use of

• e.g. • Type of physical interaction between gene and drug• Gene ontology annotation (what evidence codes?)• Disease ontology hierarchy• Drug characteristics

Page 15: Scripps bioinformatics seminar_day_2

Other possible knowledge sources • SPARQL

• UniProt http://sparql.uniprot.org • EBI SPARQL https://www.ebi.ac.uk/rdf/documentation/sparql-endpoints • look for unique identifiers on genes and proteins that you can use to link

wikidata content to their content

• Text• use the NCBI the E-utils API to programmatically access pubmed articles and

MeSH indexing http://www.ncbi.nlm.nih.gov/books/NBK25501/ • Can use to build co-occurrence networks of e.g. MeSH terms

Page 16: Scripps bioinformatics seminar_day_2

Good luck! Ask questions!

Page 17: Scripps bioinformatics seminar_day_2

ABC ranking algorithms• Out of all C, which are most strongly

related to A?• Rank by N shared B concepts

• c2: 4• c4:3• c1: 1• c3: 1• c5:1• c6:1

• Next level: adjust to down-weight highly connected nodes

A B Cc1c2c3c4c5c6

Page 18: Scripps bioinformatics seminar_day_2

ABC ranking algorithms – advanced (require large networks to be useful) • Wren – Average Minimum Weight (AMW) (Wren)

• http://bioinformatics.oxfordjournals.org/content/20/3/389.full.pdf

• Linking Term Count with Average Minimum Weight (LTC-AMW) (Yetisgen-Yildiz and Pratt)

• https://www.researchgate.net/publication/23759128_A_new_evaluation_methodology_for_literature-based_discovery_systems

• Predicate inter-dependence (Rastegar-Mojarad)• https://s3.amazonaws.com/uploads.hipchat.com/25885/154162/UaGvvQqbr

hPBAWN/A%20new%20method.pdf