15
The ContentMine Scraping Stack Richard Smith-Unna Peter Murray-Rust University of Cambridge

The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

Embed Size (px)

DESCRIPTION

The ContentMine project (http://contentmine.org) will harvest 100 million facts from the literature. Here we summarise the technology stack we're building to enable the first step: collecting the literature. This presentation was given with a paper (https://github.com/Blahah/scraperJSON-demo-paper) at WOSP 2014.

Citation preview

Page 1: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

The ContentMine Scraping Stack

Richard Smith-Unna! ! Peter Murray-RustUniversity of Cambridge

Page 2: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

“make 100,000,000 facts from the scholarly literature open, accessible and reusable”

our mission

Page 3: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

• ~ 27,000 peer reviewed journals (Ulrich's)

• > 5,000 publishers

• new papers every day

The scale of the task

Page 4: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

The pipeline

Page 5: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

scraperJSON• scrapers all have the same plumbing

• ignore the plumbing, just configure

• benefits

• supports large collections of scrapers

• no programming required

• not limited to one piece of software

Page 6: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

Basic scraperJSON{

"name": "PLOS",

"url": "plos\\w*.org",

"elements": {

"title": {

"selector": “//h1[@property=‘dc:title’]”,

}

}

}

!

name of the scraper

the URL(s) it applies to

the elements to capture

element name

where to find it

!

!

http://github.com/ContentMine/scraperJSON

Page 7: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature
Page 8: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature
Page 9: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature
Page 10: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

Basic scraperJSON{

"name": "PLoS",

"url": "plos\\w*.org",

"elements": {

"title": {

"selector": “//h1[@property=‘dc:title’]”,

}

}

}

!

name of the scraper

the URL(s) it applies to

the elements to capture

element name

where to find it

!

!

http://github.com/ContentMine/scraperJSON

Page 11: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

bibJSON output

{

"title": "Ab Initio Identification of Novel Regulatory Elements in the Genome of Trypanosoma brucei by Bayesian Inference on Sequence Segmentation"

}

Page 12: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

thresher & quickscrape• reference implementation of scraperJSON

• thresher is the scraping library

• http://github.com/ContentMine/thresher

• quickscrape is the command-line tool

• http://github.com/ContentMine/quickscrape

• Node.js, MIT licensed

Page 13: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

journal-scrapershttp://github.com/ContentMine/journal-scrapers

a self-testing collection of scraperJSON scrapers for academic journals

• PLOS • MDPI • PeerJ • Wiley • ScienceDirect • Springer • Taylor & Francis • NPG, AAAS, RSC, ACS, …

Page 14: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

Future work

• GUI (browser plugin) for creating scrapers

• Standalone GUI for scraping

Page 15: The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

Acknowledgements• Peter Murray-Rust

• Michelle Brook

• Mark MacGillivray

• Emanuil Tolev

• Ross Mounce

• Jenny Molloy

• Our volunteer community and collaborators

• Funding: Shuttleworth Foundation

http://contentmine.org http://github.com/ContentMine