40
Mining Bioscience Literature Peter Murray-Rust, University of Cambridge and TheContentMine SynBio, Cambridge UK, 2015-05-18 Much Scientific Data lies hidden in text and images, in articles, theses, reports, patents, lab-books… The ContentMine has Open collaborative tools that anyone can use to find facts and re-use for their own research

ContentMining for Synthetic Biology

Embed Size (px)

Citation preview

Page 1: ContentMining for Synthetic Biology

Mining Bioscience Literature

Peter Murray-Rust, University of Cambridge and TheContentMine

SynBio, Cambridge UK, 2015-05-18

Much Scientific Data lies hidden in text and images, in articles, theses, reports, patents, lab-books…

The ContentMine has Open collaborative tools that anyone can use to find facts and re-use for their own research

Page 2: ContentMining for Synthetic Biology

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.

Adage in public health: “The road to inaction is paved with research papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing

Page 3: ContentMining for Synthetic Biology

Scientific and Medical publication (STM)[+]

• World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles …• … cost $300,000 each to create …• … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries …• … to “publishers” who forbid access to 99.9% of citizens of the

world …• 85% of medical research is wasted (not published, badly conceived,

duplicated, …)

[+] Figures probably +- 50 %[*] arXiV preprint server costs $7 USD per paper

Page 4: ContentMining for Synthetic Biology

The Right to Read is the Right to Mine

http://contentmine.org

Page 5: ContentMining for Synthetic Biology

Facts Marked by “non-scientists” in ContentMine workshops

With Wikipedia everyone can be a scientist

Page 6: ContentMining for Synthetic Biology

ContentMine Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Page 7: ContentMining for Synthetic Biology

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Page 8: ContentMining for Synthetic Biology

OUR TEAM

@jenny_molloy

Ross Mounce

@rmounce

Richard Smith-

Unna

@blahah404

Stephanie Smith-

Unna

@treblesteph

Jenny Molloy

Mark

MacGillivray

@cottagelabs

Peter Murray-

Rust

@petermurrayrust

Charles Oppenheim

@CharlesOppenh

Graham

Steel

@McDawg

Page 9: ContentMining for Synthetic Biology

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

2015• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE• EuropePubmedCentral

Page 10: ContentMining for Synthetic Biology

Content-Mining (TDM*)

• Now COMPLETELY LEGAL IN UK since 2014-06-01 (“Hargreaves”)…

• … Whatever the publishers tell you. Do NOT sign their APIs

• UK can legally IGNORE contractual restrictions• Movement to extend this to Europe (Julia Reda,

MEP proposal)

• And STM publishers are spending millions to stop us

*Text and Data Mining

Page 11: ContentMining for Synthetic Biology

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 12: ContentMining for Synthetic Biology

“nuggets” in a scientific paper

quantity

units

Value ranges

Humans aren’t designed to mine this … chemical

project places

Page 13: ContentMining for Synthetic Biology

What is “Content”?Emily Sena (neuroscience.ed.ac.uk) spends half a day digitising a diagram like this

ContentMine will soon be able to do it in 1 second

Page 14: ContentMining for Synthetic Biology

• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

Page 15: ContentMining for Synthetic Biology

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

http://catalogue.cottagelabs.com/browse http://catalogue.cottagelabs.com/graph

Page 16: ContentMining for Synthetic Biology

quickscrapeCrawlFeed Norma Index &

Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

Plugins

Regex

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

Taggers

Per- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 17: ContentMining for Synthetic Biology

PLoSONE BMC1

BMC2

Closed1 Closed2Hybrid

CATalog

Enhanced annotated articles

FACTSFACTS

Daily Crawl

Crawl … Scrape … Normalize … Mine

Linked OpenData

Semantic Scientific Objects

2000-5000 Articles

Page 18: ContentMining for Synthetic Biology

CORE Repository UK

Page 19: ContentMining for Synthetic Biology

PLoSONE BMC1

BMC2

Closed1 Closed2Hybrid

CATalog

FACTS

Daily Crawl

Crawl … Scrape … Normalize … Mine

Open3 Closed3

SelectedRetrospective

REPO

Articles

Theses

Reports

Patents

FACTS

Page 20: ContentMining for Synthetic Biology

PeterMurray-Rust

BMC publisher

Blue Obelisk paper (20 co-authors)

Sub-network From CATalog

Page 21: ContentMining for Synthetic Biology

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Page 22: ContentMining for Synthetic Biology

Retrieval/Extraction Technologies

• Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model)

• Term-Frequency Inverse-Document-Frequency https://en.wikipedia.org/wiki/Tf%E2%80%93idf

• Regular Expressions • Templates (Information Extraction)• Natural Language Processing (NLP)• Image processing and mining• Lookup (Wikidata, Bioscience databases)

Page 23: ContentMining for Synthetic Biology

Bag of Words

Theses from HAL repository

Page 24: ContentMining for Synthetic Biology

Species

Page 25: ContentMining for Synthetic Biology

Regex for Clinical Trials

Page 26: ContentMining for Synthetic Biology

CLINICAL TRIALSHow to we find (mentions of) clinical trials?

Is a document a (clinical) trial?What is the subject of the trial?

What is the methodology used? How many/long?Does the design and practice conform to CONSORT?

What are the outcomes?Can we extract specific re-usable information?

Who are involved? (researchers, sponsors, patients?)Has a proposed trial been completed and reported?

Page 27: ContentMining for Synthetic Biology

Natural Language Processing

Part of speech tagging (Wordnet, Brown Corpus, etc.)

Page 28: ContentMining for Synthetic Biology

Parsing chemical sentences

This could be extended to much other scientific language

Page 29: ContentMining for Synthetic Biology

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 30: ContentMining for Synthetic Biology

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Page 31: ContentMining for Synthetic Biology

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 32: ContentMining for Synthetic Biology

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

Page 33: ContentMining for Synthetic Biology

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 34: ContentMining for Synthetic Biology

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 35: ContentMining for Synthetic Biology

Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae

0.84 0.91 0.93 0.95

Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma

AMI23.1234.5437.2138.55

Posterior probability

AMI can MEASUREBranch lengths!

NexML

Genus Family

HTML

Page 36: ContentMining for Synthetic Biology

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

Page 37: ContentMining for Synthetic Biology

Problems

• Cannot do handwriting• Scanned documents give poorer results• The older the document the poorer the result• Tables are a major problem• Always try to get the original document • XML better than > Word better than > PDF• Vector images >> PNG > JPEG• Maths, chemistry are specialist

Page 38: ContentMining for Synthetic Biology

POSSIBLE USES• Indexing/searching the literature; G***** for science

• Current awareness; alerts and practices

• Extraction and re-use of facts; re-computation

• Multidisciplinary integration; co-occurrence

• Compliance with funder/institution policies

• Managing your Research Data!

• Finding similar and complementary colleagues

• Reproducibility, checking data and avoiding fraud

Page 39: ContentMining for Synthetic Biology

ContentMine Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Page 40: ContentMining for Synthetic Biology

Join our Open-Source community athttp://www.contentmine.org