ContentMine: Open Data and Social Machines

  • View
    103

  • Download
    0

  • Category

    Science

Preview:

Citation preview

ContentMine: Open data And Social Machines

Peter Murray-Rust,

Computation Lab, Univ of Chicago, 2014-11-12

ContentMine: We use machines to liberate[1] 100 million facts /yr from the scientific scholarly

literature and make them free for everyone (WikiData)

WikiData and ContentMines are social machinesThere are no longer any technical obstacles, only

people.

[1] Friday workshop: build your own social machine: scraping XML, PDFs, web pages, FORTRAN output; chemistry, evolutionary biology, computational

materialsSci, social science…

Liberation Software

http://en.wikipedia.org/wiki/Tim_Berners-Lee

Everything in this presentation is ODOSOS (Open Data, Open Standards, Open Source)CC0, CC-BY, W3C etc., Apache2, etc.

Open = “Free to use, re-use and redistribute

http://contentmine.orghttp://bitbucket.org/petermrhttp://wwmm.ch.cam.ac.uk

A promise: I (Petermr) will never sell out to non-transparent organizations.

http://www.budapestopenaccessinitiative.org/read

… an unprecedented public good. …

… completely free and unrestricted access to [peer-reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. …

…Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)

Scientific and Medical publication (STM)[+]

• World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles …• … cost $300,000 each to create …• … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries …• … to “publishers” who forbid access to 99.9% of citizens

of the world …

[+] Figures probably +- 50 %[*] arXiV preprint server costs $7 USD per paper

petermr: I believe in Wikipedia• 2006 http://en.wikipedia.org/wiki/User:Petermr

• 2006 started Open Data (term unknown then!)

• 2009: “the bit of Wikipedia that I wrote is correct” [challenging the idea of “WP is junk”]

• 2009: “Wikipedia is the digital library of this century”

• 2012: I alert WP that Springer has copyrighted > 1000 of our images [Springergate]

• 2014: “For facts in maths, physical and biological sciences I trust Wikipedia.” (Wikimania2014)

A meritocratic criticalvolunteer community

Volunteer community in chemistry: Open Data/Source/Standards

4 Billion USD on human genomeyielded 800 Billion USD and 4 M job-years

Gloom Warning

…three problems—flawed design, non-publication, and poor reporting—together meant >85% of research funds were wasted, a global total loss >100 billion USD per year. [Lancet 2009]

[Even more] waste clearly occurs after publication: from poor access, poor dissemination, and poor uptake of the findings of research. [PLOS Medicine 2014-05-27]

Bad publication wastes science

Publishers’ PDFs destroy science

PDFs do not contain words or subscripts!

PDFs do not contain tables and do not have columns

SVG is turned into JPEG because it’s easier to process

Elsevier wants to control Open Data

[asked by Michelle Brook]

STM Publishers Licence2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights)• [cannot publish to: ] “libraries, repositories, or archives”• [cannot] “Make the results of any TDM Output available on an externally facing server or

website”• “Subscriber shall pay a […] fee”

Heather Piwowar: “negotiating with publishers [made me physically ill]”

WE WALKED OUT• Brit Library• JISC• RLUK• OKFN• …• Ross Mounce• PM-R

Licences destroy Content Mining

CLOSED ACCESS MEANS PEOPLE DIE

CLOSED DATA MEANS PEOPLE DIE

Happiness Restored

The scientist’s amanuensis• "The bane of my life is doing things I know computers could do for

me" (Dan Connolly, W3C)

Example: A semantic amanuensis could• Give me a daily digest of mineralogy papers• Extract all the crystal structures from them• Compute physical properties with GULP and NWChem• Compare the results statistically• Preserve and distribute the complete operation• Prepare the results for publication

The semantic web is having a personal amanuensis

Artificial Intelligence in scienceIn 1970 chess and chemistry were the sandboxes for AI. Some approaches:• Lookup (Knowledge)• Natural Language Processing (NLP)• Brute force calculation (inc. physical methods)• Tree-pruning and heuristics• Logic (cf. OWL-DL) • Human-machine integration (crowdsourcing)• Computer Vision

Domain-specific Turing test: Can a machine pass a first-year chemistry exam?

The Semantic Web

"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation."

Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001

CC-BY-SA Images from Wikipedia

“Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?”

Open Crystallography?“Which countries where tropical diseases are endemic have published structures of chiral natural products?”

Linked Open data from Wikipedia

CC-BY-SA from Wikipedia

The Right to Read is the Right to Mine

http://contentmine.org

• Science can be read and understood by human-machine Amanuensis-symbionts.

• Amanuenses are based on Wikipedia, databases and software (e.g. ContentMine’s AMI)

• The results are fed back into WP and WikiData

http://en.wikipedia.org/wiki/Symbiosis http://en.wikipedia.org/wiki/Eric_Fenby

• Crawl scientific literature (Open Bibliography)• Scrape each scientific article (ContentMine-quickscrape)• Extract the facts (ContentMine-AMI)• Index (Wikipedia)• Republish (WikiData)

Machine Extraction of scientific facts

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Linked Open Data – the world’s knowledge

very little physical science http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

DBPedia

BIO

Comp

Lib

PDB

Ontologies

GOV

GOV.uk

Music,ArtLiterature

Social

Knowledgebases

RDF triples

Part of a COD RDF entry

The Semantic Web understands this

Mathematics Markup LanguageEnergy of c.c.p lattice of argon

4 pages clippedHuman-friendly

Machine-friendly

Many editors and tools existWe used MathWeaver

Automatic!

MathML

CML (Chemical Markup Language)

Human-friendly Machine-friendly

Automatic!

Innovation with Componentisation

Individual, manual, unreusable, flaky

Commodity, standard, reliable, re-usable

Non-semantic data

Data extraction difficult and incomplete

Human readers

Current scientific information flow … is broken for data-rich science

PDFLineprinter output

Text files

Human input

Semantic network closes the loop

Data mined from document

Data available for e-science and re-use

ComputationMeasurement

SemanticAuthoring

Community

Analysis

The network grows autonomously

Machine-machine

Machine-human

Human-machine

Human-human

Humans and machines use different languages

How a machine reads a chemical thesis

nodes are compounds; arrows are reactions

Human-machine symbionts can read science!

WP_Lion

WP_Aspergillus_oryzae

WP_Soybean

Facts Marked by “non-scientists” in ContentMine workshops

With Wikipedia everyone can be a scientist

“nuggets” in a scientific paper

quantity

units

Value ranges

Humans aren’t designed to mine this … chemical

project places

Parsing chemical sentences

A FACT, uncopyrightable, and representable by triples

http://wwmm.ch.cam.ac.uk/chemicaltagger

• Typical

Typical chemical synthesis

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

But we can now turn PDFs into

Science

We can’t turn a hamburger into a cow

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Gaussian Filter

Automaticextraction

Takes < 1 second

Chemical Computer Vision

Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping

Binarization (pixels = 0,1)

Irregular edges

Thinning: thick lines to 1-pixel

Chemical Optical Character Recognition

Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

AMI Demo

http://www.mdpi.com/2218-1989/2/1/39/pdf

https://bitbucket.org/AndyHowlett/ami2-poc

ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor

May take time to start if not connected to web

Output:./target/output/reactionsexample/

SVG: ./page1annotated.svg

CML: image.g.1.4.svg.reaction0.cml AvogadroViewer:

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .

((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));

http://en.wikipedia.org/wiki/Digital_image_processing

http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics

Open notebook science is the practice of making the entire primary record of a research project publicly available online as it is recorded. (WP)

Jean-Claude Bradley was a chemist who actively promoted Open Science in chemistry,… He coined the term Open Notebook Science. … A memorial symposium was held July 14, 2014 at Cambridge University, UK.[9]

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Thanks

• Shuttleworth Foundation and Fellowship• Contentmine.org: Michelle Brook, Jenny Molloy,

Ross Mounce, Richard Smith-Unna, CottageLabs, Charles Oppenheim• Open Knowledge Foundation Community• Wikimedia Community• Blue Obelisk Community

My/our Dream

• An Open Bibliography of science, updated daily

• An interface for ContentMine to feed new facts into WikiData

• Domain-specific enthusiasts to create and run fact extraction and validation

• Wikipedia to become a C21 publisher of reference science

Recommended