21
Overview of Practical Content Mining Peter Murray-Rust JISC, London, 2014-12-01

Petermrjisc20141201

Embed Size (px)

Citation preview

Overview of Practical Content Mining

Peter Murray-Rust

JISC, London, 2014-12-01

What is Content Mining

• Mining Text, Tables and Lists, Diagrams, Images

• Born-digital documents

• High-throughput (millions of items/year)

• Formal and Informal Collaboration

• Role of UK

• Hands-on

• Everything is OPEN (OSI , CC-BY, CC0)

The Right to Read is the Right to Mine

http://contentmine.org

ContentMine

• 1-2 year Shuttleworth Funding from 2014-03• Free to everyone, Open Source, updated daily• Structured Text, and Image/Diagram Mining• Workshops for training and training trainers• Bottom-up community development

– Bioscience (EuropePMC, BBSRC)– Disease Ebola– Astrophysics (Stray Toaster)– Chemistry (TSB, EBI, PennState - Citeseer)

• We fight for Justice and Freedom

ContentMine People

• Jenny Molloy

• Ross Mounce

• Peter Murray-Rust + volunteers (Bioscience, disease)

• Richard Smith-Unna + 20 quickscrape volunteers

• Steph Unna

• Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard Jones)

• Prof Charles Oppenheim

• Karien Bezuidenhout (Shuttleworth)

• Advisory Board RSN

ContentMine Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US

Upcoming• JISC• LIBER • BL• Wellcome Trust• WHO

Ebola Collaborators (Atlanta)Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin

Regular Expressions(Easier than Crosswords or Sudoku)

Ebola Ebola

Mali (not Malicious)

Mali\W (end of word)

Bat or bat [Bb]at (alternatives)

bat or bats bats? (optional letter)

Bat or Bats or bator bats

[Bb]ats?

Sudden onset [Ss]udden\s+onset (space/s)

Panthera leo or

Gorilla gorilla

[A-Z][a-z]+\s+[a-z]+

(ranges of letters)

Ebola regex

• <compoundRegex title="ebola">• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>• <regex weight="1.0" fields="marburg">(Marburg)</regex>• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagic\s+fever)</regex>• <regex weight="0.8" fields="sudden_onset">([Ss]udden\s+onset)</regex>• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omiting\s+diarrho?ea)</regex>• <regex weight="0.5" fields="guinea">(Guinea)</regex>• <regex weight="0.5" fields="sierra_leone">(Sierra\s+Leone)</regex>• <regex weight="0.5" fields="liberia">(Liberia)</regex>• <regex weight="0.5" fields="mali">(Mali)\W</regex>• <regex weight="0.6" fields="contact_tracing">([Cc]ontact\s+tracing)</regex>• <regex weight="0.5" fields="bat">\W([Bb]ats?\W)</regex>• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>• <regex weight="0.5" fields="drc">(Democratic Republic\s*(\s*of)?(\s*the)?\s*Congo)(DRC)</regex>• <regex weight="0.6" fields="safe_burial">([Ss]afe\s+burial\s+practice?s)</regex>• <regex weight="1.0" fields="etu">([Ee]bola\s+treatment\s+units?)(ETU)</regex>• </compoundRegex>

I

15 mins to create, 15 mins to install and testOr run online at CottageLabs

Results of Regex on Ebola

• <resultsList xmlns="http://www.xml-cml.org/ami">• <results xmlns="">• <source xmlns="http://www.xml-cml.org/ami"• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">• <regex xmlns="" weight="1.0" fields="[ebola]">• <pattern>(Ebola)</pattern>• </regex>• <hits xmlns="">• <hit ebola="Ebola" />• </hits>• </regex>• </result>• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">• <regex xmlns="" weight="0.5" fields="[sierra_leone]">• <pattern>(Sierra\s+Leone)</pattern>• </regex>• <hits xmlns="">• <hit sierra_leone="Sierra Leone" />• </hits>• </regex>• </result>

Demo of Content Mining

ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

AMI (extraction) architecture

PDF2SVG

Imageanalysis

SVG2XML

Regex Species Phylo Chem

AMI

tablessectionscaptioneddiagrams

Immediate Stakeholders

– Researchers (bio, EBI, chem, materials, astro)

– Funders WT, FWF (Austria), RCUK,

– Libraries (repositories, theses)

– Service providers (EuropePMC)

– knowledge-based SMEs

– Library organisations (JISC, RLUK, LIBER, SPARC)

– Non-profits (Wikimedia, WHO, Mozilla)

Content production

• Scholarly articles

• Theses

• Repositories

• Grey scientific literature

• Grey politico-socio-legal literature

• Company output (reports, accounts, contracts) (e.g. OpenOil)

STM Publishers Licence2012_03_15_Sample_Licence_Text_Data_Mining.pdf(Summary: PMR has NO rights)• [cannot publish to: ] “libraries, repositories, or archives”• [cannot] “Make the results of any TDM Output available on an externally facing server or

website”• “Subscriber shall pay a […] fee”

Heather Piwowar: “negotiating with publishers [made me physically ill]”

WE WALKED OUT• Brit Library• JISC• RLUK• OKFN• …• Ross Mounce• PM-R

Licences destroy Content Mining

Challenges

• Active opposition from content “owners” including serious lobbying and FUD

• Ignorance and apathy from universities; inappropriate reward system

• Sub-optimal technology of publishers

• Lack of common infrastructure, technology, APIs

• And it’s objectively messy anyway

Technical problems

• PDF: lacks words, tables, diagrams

• Non-Unicode character sets (or worse)

• Graphics objects largely destroyed (converted to PNG or worse)

• No communal ontology for document structure.

• HTML carries PublisherJunk and Javascript

Goals of Mining

• Classification of resources

• Entity extraction and indexing

• Aggregation within discipline

• Inter-disciplinary, e.g. biodiversity, phytochemistry

• Repurposing (twitter, ePub, annotation)

• Semantification/intelligent documents

• Detection of error and fraud

What we need

• Inter/national commitment to infrastructure

• Common ontologies and APIs

• Development of community

• Go beyond academia; non-academic reward system