Upload
petermurrayrust
View
960
Download
1
Embed Size (px)
Citation preview
Content Mining of Science in Europe
Peter Murray-Rust, ContentMine.org, University of Cambridge & Open Forum Europe
OFA, Brussels, BE 2015-10-22
What is mining?Why is it useful?
How YOU can do it without using publishers’ APIsCopyright and restrictive practices are still a major problem
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org
My European Heroes
Young People(ContentMine)
NEELIE KROES
Use Cases of ContentMining
• Epidemiology of obesity (Cambridge U)• (OKF, OpenTrials) Mapping clinical trials
repositories to reports in scientific literature• Mining chemical reactions from patents• Creating a bacterial supertree-of-life from
4500 papers
Polly has 20 seconds to read this paper…
…and 10,000 more
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
400,000 Clinical TrialsIn 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s happened in last 6 years??
Search the whole scientific literatureFor “2009-0100068-41”
ContentMine-ing strategy• Discover. Crawl the COMPLETE relevant literature.
=> bibliography• Scrape (download). ALL papers• Index papers => Facts• Search/analyze papers => complex science• Extract, Annotate, Aggregate (“Transformative”)
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
30, 000 pages/day Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
Facts in contextdaily IUCN endangered species news
en.wikipedia.org CC By-SA
ContentMine Fact of The Day
• Fact of the day• Endangered species in recent science• Facts• Bubbles
https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA
“Root” 4500 papers each with 1 tree
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
Supertree for 924 species
Tree
Supertree created from 4300 papers
Copyright and Mining
• PMR-premise: You cannot do reproducible scientific mining and avoid violating copyright.
• UK (“Hargreaves”) 2014 legislation:– “personal” “non-commercial*” “research” “data
analytics”– legitimizes copying (?to disk), but not publishing
*teaching, textbooks, etc. may be “commercial”
Publishing and ICT
Trust these as much as you trust these
Elsevier Microsoft
Mendeley (Elsevier) Facebook
Digital Science/Macmillan Apple
Wileyetc
Etc.
STM Publishers prevent Mining• FUD & disinformation about legality (Elsevier)• Monopolies on infrastructure (“API”s, CCC
Rightfind)• Technical obstruction (Wiley Captcha,
Macmillan Readcube)• Restrictive contracts with libraries (ALL) [1]• Wasting my/our time (ALL)
[1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.
WILEY … “new security feature… to prevent systematic download of content
“[limit of] 100 papers per day”
“essential security feature … to protect both parties (sic)”
CAPTCHAUser has to type words
ContentMine working with Libraries
• Cambridge: Library, Plant Sciences, Epidemiology, Chemistry
• Cochrane Collaboration on Systematic Reviews of Clinical Trials
• FutureTDM (H2020, LIBER)• Running workshops and training