Asking the scientific literature to tell us about metabolism

Preview:

Citation preview

Asking the scientific literature to tell us about metabolism

Peter Murray-Rust, Reader Emeritus, Dept of Chemistry, Univ Cambridge

and Founder TheContentMine

Lhasa, Leeds, UK, 2017-01-12

contentmine.org is supported by a grant to PMR as a

Thousands of scientists have to re-type the literature.

Machines should be doing it!

Treat them as friends.

100 clinical trials a day, 5000 articles a day

Software and Special Thanks

Molecular Informatics, CambridgePeter Corbett, OSCAR (chemical entities),Andy Howlett, “OSIRIS” (graphical chemistry)Daniel Lowe, OPSIN (name 2 structure)Lezan Hawizy, ChemicalTagger (recipes)Mark Williamson, integration and deployment

ContentMine Rik Smith-Unna, getpapers, quickscrape (discovery) Tom Arrow, WikiFactMine (Wikimedia semantics)PM-R norma, AMI (platform) CML (semantics)

ALL SOFTWARE IS OPEN (Apache2)

AMI! Tell me what YOU know about monoxidine?

Wikipedia

Wikidata for monoxidine

Wikidata for moxonidine

Entity extraction

OPSIN says this name is wrong! OSIRIS will interpret this structureIncluding the annotation

Reaction Schemes

Tables

Tables

Graphs

Entities

Plot

Plot

Maths?

Models?

What’s the title?

Some demos

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Search on publicly accessible papers on “Zika”

https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

After AMI2 processing…..

… AMI2 has detected a square

“… simulated by 21cmFAST is in principle independent”

“it is a feature of the 21cmFAST code, and is explained in §3.1.”

SciCodes[1]: Searching for software in arXiv[1]

[1] Proposal to LJ Arnold Foundation (Alice Allen ASCL and PMR)

Using the semi-numerical simulation, 21cmFAST,

[2] arxiv.org: the physics/maths/astronomy.. Preprint server

The language identifies the software!

arxIv has >500 mentions of “21cmFast”

Recommended