17
The Content Mine Peter Murray-Rust[*] University of Cambridge, Open Knowledge, & Shuttleworth Fellow OKFest, Berlin, 2014-07-15, DE [*] and Michelle Brook, Jenny Molloy, Ross Mounce, Richard Smith- Unna, Mark MacGillivray, Emanuel Toliv

Csvconf

Embed Size (px)

Citation preview

Page 1: Csvconf

The Content Mine

Peter Murray-Rust[*]University of Cambridge, Open Knowledge,

& Shuttleworth Fellow OKFest, Berlin, 2014-07-15, DE

[*] and Michelle Brook, Jenny Molloy, Ross Mounce, Richard Smith-Unna, Mark MacGillivray, Emanuel

Toliv

Page 2: Csvconf

Liberating facts for humanity*

• Public science 500,000,000,000 USD per year• 85% of medical research is wasted (bad design, lost

data, non-communication)• ContentMine will liberate 100,000,000 facts per year

from scientific literature• Crawl, Scrape, Extract, Republish• Open Data CC 0, Open Standards, Open Source• COLLABORATIVE, any data-rich discipline

• [*] Closed data means people die

Page 3: Csvconf

But we can now turn PDFs into

Science

We can’t turn a hamburger into a cow

Page 4: Csvconf

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

Page 5: Csvconf

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 6: Csvconf

Chemical Computer Vision

1 sec to turn this into semantic science

Page 7: Csvconf

PROPERTIES (Name-Value-Units-Error)

Name Value UnitsNV U

NV U

N V

U

N

E

V E U

Note CML supports value ranges and errors

Page 8: Csvconf

“nuggets” in a scientific paper

quantity

units

Value ranges

Humans aren’t designed to mine this … chemical

project places

Page 9: Csvconf

Parsing chemical sentences

Page 10: Csvconf

http://wwmm.ch.cam.ac.uk/chemicaltagger

• Typical

Typical chemical synthesis

Page 11: Csvconf

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 12: Csvconf

Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4

PDF

HTML

Styles , superscripts

And diåcritics preserved!

AMI

Page 13: Csvconf

PDF Turdus iliacusTaeniopygia guttataSerinus canariaLanius excubitorMelopsittacus undulatusPavo cristatusSturnus vulgarisDolichonyx oryzivorusFicedula hypoleucaVaccinium myrtillusFalco tinnunculus

TurdusPomatostomus LeothrixAmytornis AcanthisittaOrthonyx x 2MalurusCnemophilus x 4Philesturnus x 2Motacilla x 2Toxorhampus x 2

Page 14: Csvconf

Linked Open Data – the world’s knowledge

very little physical science http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png

DBPedia

BIO

Comp

Lib

PDB

Ontologies

GOV

GOV.uk

Music,ArtLiterature

Social

Knowledgebases

RDF triples

Page 15: Csvconf

Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae

0.84 0.91 0.93 0.95

Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma

AMI23.1234.5437.2138.55

Posterior probability

AMI can MEASUREBranch lengths!

NexML

Genus Family

HTML

Page 16: Csvconf

We can do any data…

Page 17: Csvconf

… pixel analysis …