50
Scholarly Infrastructure: Open or Closed? Peter Murray-Rust*, University of Cambridge and OpenKnowledge DRTD-SHS, Lille, FR 2015-04-21 We can build an Open discovery and re-use system. Theses represent huge untapped communal knowledge. Bliss was it in that dawn to be alive, But to be young was very heaven! Wordsworth on the French Revolution

ContentMine: Liberating scholarship from Open publications and theses

Embed Size (px)

Citation preview

Page 1: ContentMine: Liberating scholarship from Open publications and theses

Scholarly Infrastructure: Open or Closed?

Peter Murray-Rust*, University of Cambridge and OpenKnowledge

DRTD-SHS, Lille, FR 2015-04-21

We can build an Open discovery and re-use system.

Theses represent huge untapped communal knowledge.

Bliss was it in that dawn to be alive, But to be young was very heaven!

Wordsworth on the French Revolution

Page 2: ContentMine: Liberating scholarship from Open publications and theses

Scholarly infrastructure becomes closed

No accountability for monitoring and control

Page 3: ContentMine: Liberating scholarship from Open publications and theses

The Digital Enlightenment: some of my icons

Diderot, Paris, 1751

Berkeley, US, 1966 Paris, 1968

UK, 1969-73

Page 4: ContentMine: Liberating scholarship from Open publications and theses

["How We Stopped SOPA”:

This bill ... shut down whole websites. Essentially, it stopped Americans from communicating entirely with certain groups....

I called all my friends, and we stayed up all night setting up a website for this new group, Demand Progress, with an online petition opposing this noxious bill.... We [got] ... 300,000 signers.... We met with the staff of members of Congress and pleaded with them.... And then it passed unanimously....

And then, suddenly, the process stopped. Senator Ron Wyden ... put a hold on the bill.[48][49]

He added, "We won this fight because everyone made themselves the hero of their own story. Everyone took it as their job to save this crucial freedom.”

Robert Swartz: "Aaron was killed by the government, and MIT betrayed all of its basic principles."[116]

Aaron Swartz

Page 5: ContentMine: Liberating scholarship from Open publications and theses

Some Children of the Digital Enlightenment

• David Carroll & Joe McArthur: OAButton• Rayna Stamboliyska & Pierre-Carl Langlais• Jon Tennant• Ross Mounce• Jenny Molloy• Erin McKiernan• Jack Andraka• Michelle Brook• Heather Piwowar• TheContentMine Team• Rufus Pollock• Jonathan Gray• Sophie Kay

Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP)

J-C promoted these ideas with UNDERGRADUATE scientists.

[1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge

Sophie Kay

Page 6: ContentMine: Liberating scholarship from Open publications and theses

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-

ebola.html

We were stunned recently when we stumbled across an article by European

researchers in Annals of Virology [1982]: “The results seem to indicate that

Liberia has to be included in the Ebola virus endemic zone.” In the future,

the authors asserted, “medical personnel in Liberian health centers should be

aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired

infection.

Adage in public health: “The road to inaction is paved with research

papers.”

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)

Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing

Page 7: ContentMine: Liberating scholarship from Open publications and theses

Open Scholarship must build its own discovery system before it is too late

Communities of Practice + software:

• Wikip(m)edia• Open Street Map• Open Corporates

Theses are under OUR control and hugely valuable.

Page 8: ContentMine: Liberating scholarship from Open publications and theses

eTheses

• Citizens pay $20,000,000,000*…

• … for research in 200,000 science theses*…

• … cost $100,000 each to create* …

• … re-use ??? (near zero)

• … Value???

• *Please challenge these numbers…

• NOTE: we pay publishers $15,000,000,000 for journals and APCs

Page 9: ContentMine: Liberating scholarship from Open publications and theses

Linked Open Data – the world’s knowledge

very little physical science and THESES??

http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

DBPedia

BIO

Comp

Lib

PDB

Ontologies

GOV

GOV.uk

Music,ArtLiterature

Social

Knowledgebases

RDF triples

Page 10: ContentMine: Liberating scholarship from Open publications and theses

Liberation Software

Steve Coast developed OpenStreetMapto challenge the monopoly of the UK Ordnance Survey

Page 11: ContentMine: Liberating scholarship from Open publications and theses

The Right to Read is the Right to Mine

http://contentmine.org

Page 12: ContentMine: Liberating scholarship from Open publications and theses

OUR TEAM

@jenny_molloy

Ross Mounce

@rmounce

Richard Smith-

Unna

@blahah404

Stephanie Smith-

Unna

@treblesteph

Jenny Molloy

Mark

MacGillivray

@cottagelabs

Peter Murray-

Rust

@petermurrayrust

Charles Oppenheim

@CharlesOppenh

Graham

Steel

@McDawg

Page 13: ContentMine: Liberating scholarship from Open publications and theses

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

http://catalogue.cottagelabs.com/browsehttp://catalogue.cottagelabs.com/graph

Page 14: ContentMine: Liberating scholarship from Open publications and theses

Content-Mining (TDM*)

• Now COMPLETELY LEGAL IN UK since 2014-06-01 (“Hargreaves”)…

• … Whatever the publishers tell you. Do NOT sign their APIs

• UK can legally IGNORE contractual restrictions• Movement to extend this to Europe (Julia Reda,

MEP proposal)

• And STM publishers are spending millions to stop us

*Text and Data Mining

Page 15: ContentMine: Liberating scholarship from Open publications and theses

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 16: ContentMine: Liberating scholarship from Open publications and theses

“nuggets” in a scientific paper

quantity

units

Value ranges

Humans aren’t designed to mine this … chemical

project places

Page 17: ContentMine: Liberating scholarship from Open publications and theses

What is “Content”?Emily Sena (neuroscience.ed.ac.uk) spends half a day digitising a diagram like this

ContentMine will soon be able to do it in 1 second

Page 18: ContentMine: Liberating scholarship from Open publications and theses

• CRAWL the web for scientific documents(articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

Page 19: ContentMine: Liberating scholarship from Open publications and theses

quickscrapeCrawlFeed

Norma Index &Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

Plugins

Regex

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

Taggers

Per- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 20: ContentMine: Liberating scholarship from Open publications and theses

CORE Repository UK

Page 21: ContentMine: Liberating scholarship from Open publications and theses

HAL repository FR

Page 22: ContentMine: Liberating scholarship from Open publications and theses

Retrieval/Extraction Technologies

• Bag Of Words https://en.wikipedia.org/wiki/Bag-of-words_model)

• Term-Frequency Inverse-Document-Frequency https://en.wikipedia.org/wiki/Tf%E2%80%93idf

• Regular Expressions

• Templates (Information Extraction)

• Natural Language Processing (NLP)

• Image processing and mining

• Lookup (Wikidata, Bioscience databases)

Page 23: ContentMine: Liberating scholarship from Open publications and theses

Bag of Words

Theses from HAL repository

Page 24: ContentMine: Liberating scholarship from Open publications and theses

Species

Page 25: ContentMine: Liberating scholarship from Open publications and theses

Regex for Clinical Trials

Page 26: ContentMine: Liberating scholarship from Open publications and theses

CLINICAL TRIALS

How to we find (mentions of) clinical trials?Is a document a (clinical) trial?What is the subject of the trial?

What is the methodology used? How many/long?Does the design and practice conform to CONSORT?

What are the outcomes?Can we extract specific re-usable information?

Who are involved? (researchers, sponsors, patients?)Has a proposed trial been completed and reported?

Page 27: ContentMine: Liberating scholarship from Open publications and theses

How a machine reads a chemical thesis

nodes are compounds; arrows are reactions

Page 28: ContentMine: Liberating scholarship from Open publications and theses

Natural Language Processing

Part of speech tagging (Wordnet, Brown Corpus, etc.)

Page 29: ContentMine: Liberating scholarship from Open publications and theses

Parsing chemical sentences

Page 30: ContentMine: Liberating scholarship from Open publications and theses

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 31: ContentMine: Liberating scholarship from Open publications and theses

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Page 32: ContentMine: Liberating scholarship from Open publications and theses

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 33: ContentMine: Liberating scholarship from Open publications and theses

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

Page 34: ContentMine: Liberating scholarship from Open publications and theses

Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , OlleHåstad 2,3 and Per Alström 4

PDF

HTML

Styles , superscripts

And diåcriticspreserved!

AMI

Page 35: ContentMine: Liberating scholarship from Open publications and theses

PDF

Turdus iliacusTaeniopygia guttataSerinus canariaLanius excubitorMelopsittacus undulatusPavo cristatusSturnus vulgarisDolichonyx oryzivorusFicedula hypoleucaVaccinium myrtillusFalco tinnunculus

TurdusPomatostomusLeothrixAmytornisAcanthisittaOrthonyx x 2MalurusCnemophilus x 4Philesturnus x 2Motacilla x 2Toxorhampus x 2

Page 36: ContentMine: Liberating scholarship from Open publications and theses

Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL

Page 37: ContentMine: Liberating scholarship from Open publications and theses

AcanthisittidaeAcanthizidaeAcrocephalidaeCallaeidaeCampephagidaeCnemophilidaeCorvidae

0.840.910.930.95

AcanthisittaAcrocephalusAiluroedusAiluroedusAmytornisCamptostoma

AMI

23.1234.5437.2138.55

Posteriorprobability

AMI can MEASUREBranch lengths!

NexML

Genus Family

HTML

Page 38: ContentMine: Liberating scholarship from Open publications and theses

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

Page 39: ContentMine: Liberating scholarship from Open publications and theses

Problems

• Cannot do handwriting

• Scanned documents give poorer results

• The older the document the poorer the result

• Tables are a major problem

• Always try to get the original document

• XML better than > Word better than > PDF

• Vector images >> PNG > JPEG

• Maths, chemistry are specialist

Page 40: ContentMine: Liberating scholarship from Open publications and theses

Additional material on Open Notebook Science (not presented)

Page 41: ContentMine: Liberating scholarship from Open publications and theses

Free/Open Software DevelopmentEngineered repository

Worldcommunity

CODErewrite

validate

CODEfork

CODE

Re-use

CODERe-use

Github, BitBucketStackOverflow,Apache

inspires

OSI

Example: ContentMine athttp://github.com/ContentMine/quickscrape

Page 42: ContentMine: Liberating scholarship from Open publications and theses

Sophie Kershaw, Panton Fellow, Training PhD Students

Page 43: ContentMine: Liberating scholarship from Open publications and theses

“Do you think you would be more confident in the future about trying to apply Open techniques to your work..?”

• 50% Yes, by myself• 41% Yes, with help/guidance

• 9% No opinion/neutral• 0% No

Page 44: ContentMine: Liberating scholarship from Open publications and theses

Rotation-Based Learning (RBL)

Phase 1: Initiator

• No communication

permitted between groups

• Attempt to reproduce

existing literature

• Deliver a coherent research

story by the end of Phase 1

Phase 2: Successor

• Communication between

groups still prohibited

• Validate and develop the

inherited research story

• Critique your predecessors

• Role of research producer vs. research user • Can this approach help to foster awareness of reproducibility issues?

Throughout Phases 1 & 2:

• Daily lectures on open

science culture & techniques

• First-hand application to own

research work

• Version control using GitHub

• Daily group supervision

Page 45: ContentMine: Liberating scholarship from Open publications and theses

http://michaelnielsen.org/blog/reinventing-discovery/

http://en.wikipedia.org/wiki/Reinventing_Discovery

Page 46: ContentMine: Liberating scholarship from Open publications and theses

TOOLS

Open Notebook Science

Open engineeredrepository

Worldcommunity

INSTRUMENT

validate

merge

MODELCODE

DATA

DATAknowledge

calibrate

Problems are solved communally; Nothing is needlessly duplicated; “publication“ is continuous

Machines and humansWorking together

CC-BY

Page 47: ContentMine: Liberating scholarship from Open publications and theses

“Free” and “Open”

• "Free software is a matter of liberty, not price. ’free speech', not 'free beer'”. (R M Stallman)

• “A piece of data or content is open if anyone is free to use, reuse, and redistribute it” (OKFN)http://opendefinition.org/

• “open” (access) has multiple incompatible “definitions”. Major split is “human eyeballs” vs copying and machine “reusability”

• “Open” is a marketing term for publishers, who frequently (often deliberately) do not grant full Openness.

“Gratis” vs “Libre”

Page 48: ContentMine: Liberating scholarship from Open publications and theses

Critical Historical Open Events

• Free Software Foundation (RMS, 1985) and Linux (Torvalds, 1991)

• The World Wide Web (TBL, 1991)

• The human genome (1990-2001)

The life of Aaron Swartz (1986-2013)

Page 49: ContentMine: Liberating scholarship from Open publications and theses

http://www.budapestopenaccessinitiative.org/read

… an unprecedented public good. …

… completely free and unrestricted access to [peer-reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. …

…Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)

Page 50: ContentMine: Liberating scholarship from Open publications and theses

Panton Authors and Fellows