The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific...

The ContentMine Scraping Stack

Richard Smith-Unna! ! Peter Murray-RustUniversity of Cambridge

“make 100,000,000 facts from the scholarly literature open, accessible and reusable”

our mission

• ~ 27,000 peer reviewed journals (Ulrich's)

• > 5,000 publishers

• new papers every day

The scale of the task

The pipeline

scraperJSON• scrapers all have the same plumbing

• ignore the plumbing, just configure

• benefits

• supports large collections of scrapers

• no programming required

• not limited to one piece of software

Basic scraperJSON{

"name": "PLOS",

"url": "plos\\w*.org",

"elements": {

"title": {

"selector": “//h1[@property=‘dc:title’]”,

name of the scraper

the URL(s) it applies to

the elements to capture

element name

where to find it

http://github.com/ContentMine/scraperJSON

Basic scraperJSON{

"name": "PLoS",

"url": "plos\\w*.org",

"elements": {

"title": {

"selector": “//h1[@property=‘dc:title’]”,

name of the scraper

the URL(s) it applies to

the elements to capture

element name

where to find it

http://github.com/ContentMine/scraperJSON

bibJSON output

"title": "Ab Initio Identification of Novel Regulatory Elements in the Genome of Trypanosoma brucei by Bayesian Inference on Sequence Segmentation"

thresher & quickscrape• reference implementation of scraperJSON

• thresher is the scraping library

• http://github.com/ContentMine/thresher

• quickscrape is the command-line tool

• http://github.com/ContentMine/quickscrape

• Node.js, MIT licensed

journal-scrapershttp://github.com/ContentMine/journal-scrapers

a self-testing collection of scraperJSON scrapers for academic journals

• PLOS • MDPI • PeerJ • Wiley • ScienceDirect • Springer • Taylor & Francis • NPG, AAAS, RSC, ACS, …

Future work

• GUI (browser plugin) for creating scrapers

• Standalone GUI for scraping

Acknowledgements• Peter Murray-Rust

• Michelle Brook

• Mark MacGillivray

• Emanuil Tolev

• Ross Mounce

• Jenny Molloy

• Our volunteer community and collaborators

• Funding: Shuttleworth Foundation

http://contentmine.org http://github.com/ContentMine

The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific...

Data & Analytics

ContentMine + EPMC: Finding Zika!

Odd2015 scraping

Architecture of ContentMine Components contentmine.org

COMP 4971C Independent Project Web Scraping …Web Scraping Website with Python for Database Construction HALIM, Kevin 8 2.1.2. Scraping the website Scraping the website GSMArena can

Website Scraping

Scraping Barrell

Python Scraping Showdown

Scraping talk public

Account Aggregation A Break Through Technology · 2. Screen scraping What is screen scraping? • Screen scraping or web scraping is the ability to automatically scrape or retrieve

Scraping by examples

ContentMine (EMBL-EBI Industry Programme)

ContentMine and WikiData

ContentMine: Mining the Scientific Literature

Tour of Scraping

ContentMine at EuropePMC AGM

Scraping Handout

Onlineinfo2012 - Scraping

Scraping HTML with XPathbooks.pharo.org/booklet-Scraping/pdf/2020-02-04-scraping... · 2020. 2. 4. · Illustrations IcamewiththeideaofthisbookletthanktoPeterthatkindlyanswereda questiononthePharomailing-list.TohelpPetershowedtoaPharoerhow

Web scraping and social media scraping { handling JScoin.wne.uw.edu.pl/dcelinska/resources/webscraping/...Web scraping and social media scraping {handling JS Jacek Lewkowicz, Dorota

Screen Scraping Für Frühaufsteher. Agenda Was ist Screen Scraping Methoden Beispiele Frage / Diskussionsrunde 2 Screen Scraping – Christian Schmidt