View
77
Download
0
Category
Preview:
Citation preview
Mining Bioscience Literature
Peter Murray-Rust, University of Cambridge and TheContentMine
SynBio, Cambridge UK, 2015-05-18
Much Scientific Data lies hidden in text and images, in articles, theses, reports, patents, lab-books…
The ContentMine has Open collaborative tools that anyone can use to find facts and re-use for their own research
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html
We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.
Adage in public health: “The road to inaction is paved with research papers.”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
Scientific and Medical publication (STM)[+]
• World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles …• … cost $300,000 each to create …• … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries …• … to “publishers” who forbid access to 99.9% of citizens of the
world …• 85% of medical research is wasted (not published, badly conceived,
duplicated, …)
[+] Figures probably +- 50 %[*] arXiV preprint server costs $7 USD per paper
The Right to Read is the Right to Mine
http://contentmine.org
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
ContentMine Workshops and Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
OUR TEAM
@jenny_molloy
Ross Mounce
@rmounce
Richard Smith-
Unna
@blahah404
Stephanie Smith-
Unna
@treblesteph
Jenny Molloy
Mark
MacGillivray
@cottagelabs
Peter Murray-
Rust
@petermurrayrust
Charles Oppenheim
@CharlesOppenh
Graham
Steel
@McDawg
Workshops (1-hour -> full day or more)
2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London
2015• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO
Collaborators
• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE• EuropePubmedCentral
Content-Mining (TDM*)
• Now COMPLETELY LEGAL IN UK since 2014-06-01 (“Hargreaves”)…
• … Whatever the publishers tell you. Do NOT sign their APIs
• UK can legally IGNORE contractual restrictions• Movement to extend this to Europe (Julia Reda,
MEP proposal)
• And STM publishers are spending millions to stop us
*Text and Data Mining
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … chemical
project places
What is “Content”?Emily Sena (neuroscience.ed.ac.uk) spends half a day digitising a diagram like this
ContentMine will soon be able to do it in 1 second
• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form
…Open semantic science …• MINE pages with your methods and tools (AMI)
• CAT-alogue results in searchable index• Automate daily process (CANARY)
contentmine.org Infrastructure
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
http://catalogue.cottagelabs.com/browse http://catalogue.cottagelabs.com/graph
quickscrapeCrawlFeed Norma Index &
Transform
XML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
PLoSONE BMC1
BMC2
Closed1 Closed2Hybrid
CATalog
Enhanced annotated articles
FACTSFACTS
Daily Crawl
Crawl … Scrape … Normalize … Mine
Linked OpenData
Semantic Scientific Objects
2000-5000 Articles
CORE Repository UK
PLoSONE BMC1
BMC2
Closed1 Closed2Hybrid
CATalog
FACTS
Daily Crawl
Crawl … Scrape … Normalize … Mine
Open3 Closed3
SelectedRetrospective
REPO
Articles
Theses
Reports
Patents
FACTS
PeterMurray-Rust
BMC publisher
Blue Obelisk paper (20 co-authors)
Sub-network From CATalog
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
Retrieval/Extraction Technologies
• Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model)
• Term-Frequency Inverse-Document-Frequency https://en.wikipedia.org/wiki/Tf%E2%80%93idf
• Regular Expressions • Templates (Information Extraction)• Natural Language Processing (NLP)• Image processing and mining• Lookup (Wikidata, Bioscience databases)
Bag of Words
Theses from HAL repository
Species
Regex for Clinical Trials
CLINICAL TRIALSHow to we find (mentions of) clinical trials?
Is a document a (clinical) trial?What is the subject of the trial?
What is the methodology used? How many/long?Does the design and practice conform to CONSORT?
What are the outcomes?Can we extract specific re-usable information?
Who are involved? (researchers, sponsors, patients?)Has a proposed trial been completed and reported?
Natural Language Processing
Part of speech tagging (Wordnet, Brown Corpus, etc.)
Parsing chemical sentences
This could be extended to much other scientific language
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
VECTOR PDF
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Smoothing Gaussian Filter
Automaticextraction
Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae
0.84 0.91 0.93 0.95
Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma
AMI23.1234.5437.2138.55
Posterior probability
AMI can MEASUREBranch lengths!
NexML
Genus Family
HTML
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
Problems
• Cannot do handwriting• Scanned documents give poorer results• The older the document the poorer the result• Tables are a major problem• Always try to get the original document • XML better than > Word better than > PDF• Vector images >> PNG > JPEG• Maths, chemistry are specialist
POSSIBLE USES• Indexing/searching the literature; G***** for science
• Current awareness; alerts and practices
• Extraction and re-use of facts; re-computation
• Multidisciplinary integration; co-occurrence
• Compliance with funder/institution policies
• Managing your Research Data!
• Finding similar and complementary colleagues
• Reproducibility, checking data and avoiding fraud
ContentMine Workshops and Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Recommended