Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

W I S S E N n T E C H N I K n L E I D E N S C H A F T

u www.tugraz.at

Science 2.0 VU Processing Science 2.0 Data, Content Mining

WS 2015/16

Elisabeth Lex KTI, TU Graz

www.tugraz.at n

Agenda

•  Repetition from last time: Open Science •  Processing academic resources •  Mining in academic resources (content perspective) •  Example:

•  ContentMine: Extraction of scientific facts

2

www.tugraz.at n

Repetition: Open Science

•  Open Science •  Ideas, Concepts, Benefits and Pitfalls

•  E.g. Enhancing collaboration and community-building, increasing efficiency of research vs no reward system yet

•  Open Data •  Sharing your data influences how often you get

cited (Piwowar, et al., 2007 and Pinowar, et a., 2013)

•  Different models for Open Access •  Green vs. Gold vs. Hybrid

3

www.tugraz.at n

Open Science – 5 schools of thought

4

www.tugraz.at n

Example: Open Government Data: Eurostat

5

“I’d like to compare the unemployment rate in Austria with other European ones”

Via Google Public Data Explorer, https://www.google.com/publicdata/directory

www.tugraz.at n

Open Access in Science: Open Access Journals ●  Green („self-archiving): author can self-archive at the time of

submission of the publication whether the publication is grey literature (usually internal non-peer-reviewed), a peer-reviewed journal publication, a peer-reviewed conference proceedings paper or a monograph

●  Gold („author pays“): the author or author institution can pay a fee to the publisher at publication time, the publisher then makes the publication available 'free' at the point of access .

●  further little-used “road” hybrid forms: for example platinum open access (does not charge author fees)...

●  Both green and gold are compatible and can co-exist

Source: Jeffery, K. Open Access: An Introduction, 2006. http://www.ercim.eu/publication/Ercim_News/enw64/jeffery.html

www.tugraz.at n

Processing Academic Resources

7

www.tugraz.at n

•  Aggregate scientific results •  Exploratory search in digital collections •  Find experts in domains

•  Make science discoverable •  Improve access to scientific publications •  Extract facts for research •  Discover relationships

•  Check for errors => improve science

Motivation

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,...

•  Mining in Academic Resources •  Information Extraction •  Topic Modeling •  Clustering/Classification •  Linking publications

•  Make available data and source code J

9

www.tugraz.at n

KDD Process

10

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,....



11

www.tugraz.at n

Datasets

•  The European Library Open Dataset •  Digital collection and 200 mio bibliographic records •  http://www.theeuropeanlibrary.org/tel4/access/data/

opendata •  Datahub.io

•  E.g. DBLP Computer Science Bibliography http://datahub.io/dataset/dblp

•  Metadata of over 1.8 mio publications by 1 mio authors

12

www.tugraz.at n

Repositories and Aggregators

•  ISI Web of Science •  Scopus •  Pubmed •  The European Library •  Library of Congress •  ArXiv •  Figshare •  Data Citation Index •  Mendeley •  Google Scholar •  CiteSeerX •  ...

13

www.tugraz.at n

APIs to Repositories ...

•  APIs to access scientific publications and research data

•  rOpenSci: arXiv, PlosOne, Figshare •  Mendeley: Developer API, http://dev.mendeley.com

•  Python package: pip install mendeley

14

www.tugraz.at n

Example - rOpenSci

15

www.tugraz.at n

How?


•  Mining in Academic Resources •  Information Extraction •  Topic Modeling •  Clustering / Classification •  Linking publications


16

www.tugraz.at n

Information Extraction

•  IE Goal: Extract structured information out of unstructured content, e.g.

•  Method names, quantities, temporal expressions •  Authors from scientific publications •  Organizations in acknowledgements section of

papers •  References •  ...

17

www.tugraz.at n

IE Process

18

http://www.nltk.org/book/ch07.html

Input: raw text of a document Output: list of (entity, relation, entity)

ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity)

Applying word classes to words within a sentence

www.tugraz.at n

IE Standard Approaches (1/2)

•  Regular expressions / Rule-based approaches •  E.g. dates, email addresses, @user, RT@user

http://localhost:8888/notebooks/twitterprocessing.ipynb

19

www.tugraz.at n

IE as Machine Learning Task

•  Supervised: train model with annotated training data, use trained model to classify unknown text

•  Choose a class label for a given input •  Identify features of language data to classify it •  Construct language models out of them •  Learn about text/language from these models

•  Methods: •  Classifiers: Naive Bayes, Maxent Models •  Sequence models: Hidden Markov Models, CRFs

20

www.tugraz.at n

Libraries

•  NLTK (http://www.nltk.org) •  http://localhost:8888/notebooks/science20-ie.ipynb

21

www.tugraz.at n

Mining academic documents

•  Extraction of structural elements •  Tables, figures,..

•  Extraction of facts from structural elements and doc •  Named Entity Recognition (e.g. gene names,..) •  Relation extraction (e.g. system A impacts system

B) •  Mostly: PDF format

•  Good for presentation but problems with metadata quality, hard to analyse

•  While PDF analysis tools exist, there is still room for improvement!

22

www.tugraz.at n

Approach

•  Divide and conquer •  Extracting blocks from the PDF based on structure

and layout information •  Classify the extracted blocks

•  E.g. into title, body, references, abstract,.. •  Classify content of extracted blocks

•  E.g. tables •  Extract relevant info from the content (Named

Entities, nouns, dates, quantities,..)

23

www.tugraz.at n

Approach

•  Extracting blocks •  Features: layout specific such as position, font, font

size,.. •  Apply Machine Learning approches

•  Unsupervised (clustering) •  Supervised (classification)

24

www.tugraz.at n

Unsupervised Approach

•  Clustering: given a set of objects find the groupings of objects so that the similarity within a group is maximized and the similarity between groups is minimized

•  Cluster = block •  Successive merge and split mechanism

25

www.tugraz.at n

Supervised Approach

•  Classification: given a set of labeled examples, create a model and use it to predict the label of unknown examples

•  Classify blocks: Maximum Entropy Models •  Create training data by labeling blocks, i.e. assigning

blocks to classes •  Learn a model based on the training data and apply

it to classify unknown blocks •  Features: layout, formatting, word frequencies,..

26

www.tugraz.at n

Fact Extraction from Publications

•  Extract entities from within the identified blocks •  E.g. author block – divide further to extract all

authors contained in the block •  Extract relations between entities

•  Open Information Extraction •  Learns a models without needing training data •  Can extract binary relations from sentences

27

www.tugraz.at n

Example: Measuring quality of Wikipedia

28 Elisabeth Lex, Michael Voelske, Marcelo Errecalde, Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein, and Michael Granitzer. 2012. Measuring the quality of web content using factual information. In Proceedings of WebQuality '12 at WWW‘12

(a) Unbalanced (b) Balanced

Figure 1: Histograms of Wikipedia corpora for unbalanced dataset and balanced dataset.

is the word count of t, and t is a Wikipedia article. Thesame holds for “Factual-density/sentence-count”.

The word count measure outperforms the factual densitymeasure normalized to sentence count as well as the wordcount on the unbalanced corpus. Apparently, word count isa strong feature on the unbalanced corpus.

We then evaluated the factual density measure on the bal-anced corpus where both featured/good and non-featuredarticles are more similar in respect to document length.The results for this experiment are shown in Figure 2(b)as precision-recall curves. On the balanced corpus, factualdensity normalized to sentence count as well as word countperforms much better than on the unbalanced corpus, whileword count, as expected, performs worse. There is not muchdi↵erence between the normalization to word or sentencecount since here, the number of words per document has asmaller influence on the result.

We also analyzed the distributions of featured/good andnon-featured articles if factual density is used as measure,as depicted in Figure 3. We found that the distributionof the featured/good articles is clearly separated from thedistribution of the non-featured articles, with peaks at twodi↵erent factual density values (0.06 and 0.03 respectively).This finding is in contrast to the fact that the distributionsof featured/good articles and non-featured articles have ahigh degree of overlap if word count is used, as shown inFigure 1(b). Consequently, on the balanced corpus, factualdensity clearly outperforms our baseline word count.

In a related experiment, we investigated the relational in-formation contained in the binary relationships ReVerb ex-tracts from sentences. We used the relations, i.e. only thepredicates from the extracted triples as a vocabulary to rep-resent the documents. We then tested the discriminativepower of these features by training a classifier to solve the bi-nary classification problem of distinguishing featured/goodfrom non-featured articles. The results reported in Table 1were obtained using the WEKA6 implementation of a NaiveBayes Classifier in combination with feature selection basedon Information Gain (IG). From 40 000 relations, we selected

6http://www.cs.waikato.ac.nz/~ml/weka/

Figure 3: Distribution of articles by factual density.

the 10% best features in terms of IG. We achieved similarresults for both corpora.

Table 1: Classification results using relational fea-tures on both corpora.

Unbalanced Balanced

Measure Value [%] Value [%]

Accuracy 84.01 87.14F-Measure 84 86.7Precision 84 89.2Recall 84 87.1

Apparently, relational features are more robust when thedocument length varies. However, we need to investigatethis in more detail.

www.tugraz.at n

Extract Topics from Publications

•  Topic Models: algorithms that uncover thematic structure in document collections

•  Facilitate searching, browsing, summarizing •  Latent Dirichlet Allocation (LDA)

•  Hierarchical probabilistic model

18/11/15 29

www.tugraz.at n

LDA

•  Probabilistic model that helps find latent topics for documents

•  Probabilistic model: treat data as observations that stem from a generative proabilistic process which involves hidden variables •  Documents: Thematic structure are the hidden

variables •  Each topic is described by words in the documents

18/11/15 30

www.tugraz.at n

LDA

•  Infer hidden structure using posterior inference

•  „What are the topics that describe the documents?“ •  Classify unknown data using the topic model

•  „How does unknown data fit into estimated topic structure?“

•  Nr of topics Z has to be choosen in advance •  Defines level of specification of topics

18/11/15 31

Probability of ith word for doc d Probability of ti

within topic zi

Probability of using a word from topic zi in the doc

www.tugraz.at n

Example: Model evolution of topics over time in Science journal

18/11/15 32

•  Dataset: pages Science from 1880-2002 from JSTOR archive

https://www.cs.princeton.edu/~blei/topicmodeling.html

www.tugraz.at n

Validation of extracted information

33

•  Crowdsourcing as a way to evaluate mining quality •  Share the extracted information via e.g. a Web-

based platform •  Enable users to give feedback

•  Accept, reject, suggest new concepts/facts

www.tugraz.at n

HowTo: Text Mining using rOpenSci

•  Library that facilitates text mining on publications •  Search for articles •  Fetch articles •  Get links for full text articles (xml, pdf) •  Extract text from articles / convert formats •  Collect bits of articles that you actually need •  Download supplementary materials from papers

34 https://ropensci.org/tutorials/fulltext_tutorial.html

Chamberlain Scott (2015). fulltext: Full Text of Scholarly Articles Across Many Data Sources. R package version 0.1.0. https://github.com/ropensci/fulltext

www.tugraz.at n

Example: Text Mining using rOpenSci

#include the library!library("fulltext“)! #ft_search() - get metadata on a search query.!> (res1 <- ft_search(query = 'open science', from = 'arxiv'))!> (out <- ft_get(res1))!> res1$arxiv!!# ft_get() - get full or partial text of articles.!> res <- ft_get('cs/9301113v1', from='arxiv')!!#extract the fulltext!> res2 <- ft_extract(res)!> res2$arxiv$data!!#extract interesting parts from the fulltext!> out %>% chunks("doi")!

35

www.tugraz.at n

Example: Text Mining using rOpenSci

•  fulltext can extract parts of a paper via chunks(): •  “all”, “front”, “body”, “back”, “title”, “doi”,

“categories”, “authors”, “keywords”, “abstract”, “executive_summary”, “refs”, “refs_dois”, “publisher”, “journal_meta”, “article_meta”, “acknowledgments”, “permissions”, “history”!

•  Can do PDF extraction •  E.g. via GhostScript: (res_gs <- ft_extract(pdf, "gs"))!•  ..

36

https://ropensci.org/tutorials/fulltext_tutorial.html

www.tugraz.at n

How?




37

www.tugraz.at n

Clustering of Academic Resources

•  Detect groupings of papers based on content similarity

•  E.g. alongside of topics •  Transform content (e.g. abstract of a paper) into

machine readable representation •  Bag of Words approach: document treated as bag

of words/terms, represented as vector •  Document-Term matrix: term frequencies across all

documents

38

www.tugraz.at n

Vector Space Model

•  Documents are vectors in Term-Document Space

•  Elements of vector are weights wij corresponding to doc i and term j

•  Weights: frequencies of terms in docs •  TF-IDF

•  Proximity of documents (similarity) calculated by cosine of angle between document vectors

39

www.tugraz.at n

Example: Facilitate exploratory search

•  By topic of interest (cluster = topic of interest) •  Setting: Social bookmarking dataset, URLs

described by tags §  Research Questions:

§  What clusters (aka groups of interests) exist? §  Are they somehow related? §  How do they evolve over time?

www.tugraz.at n

Clustering Algorithms

•  KDD lectures! •  Here, briefly: K-means algorithm

1.  Select k points as initial centroids 2.  Repeat

3.  Form k clusters by assigning all points to closest centroid

4.  Recompute centroid of each cluster 5.  Until centroids don‘t change

18/11/15 41

www.tugraz.at n

Example

www.tugraz.at n

www.tugraz.at n

Classification of Scientific Publications

•  Categorize into established subject-based taxonomy •  E.g. Library of Congress •  UNESCO thesaurus •  DOAJ subject classification •  Library of Congress Subject Headings

44

www.tugraz.at n

How?




45

www.tugraz.at n

Linking Scientific Publications

•  Citations (explicitely defined) •  Similarity

•  Statistical similarity: cosine •  Semantic similarity: more complex, e.g. via topics

•  Usage •  Argument support •  Contradiction •  ...

46

www.tugraz.at n

Linking via Citations

47

www.tugraz.at n

How?


•  Mining in Academic Resources •  Information Extraction •  Clustering / Classification •  Linking publications •  Search


48

www.tugraz.at n

Sharing code

•  Github •  Bitbucket •  iPython Notebooks •  ...

49

www.tugraz.at n

Example: ContentMine

50 http://contentmine.org

Idea: •  facts cannot

be copyrighted •  Billion of facts

in copyright-protected research articles

à Make them publicly accessible!

www.tugraz.at n

Possible questions for ContentMine

•  Find references to papers by a given author. This is metadata and therefore factual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.

•  Find who sponsors research. Extract acknowledgements and perform Named Entity Recognition to detect companies. Link the companies to the papers where they are listed in the acknowledgement

51

www.tugraz.at n

1.  Crawl scientific literature 2.  Scrape each scientific article 3.  Extract facts 4.  Index 5.  Republish (WikiData)

Machine Extraction of scientific facts

https://github.com/ContentMine

www.tugraz.at n

Example: retrieve metadata for specific article

18/11/15 53

www.tugraz.at n

•  Secondary publishers create walled gardens •  E.g. ResearchGate portal

•  Publishers’ contracts ban content-mining. •  Publishers may cut off universities who mine •  Publishers lobby governments to require “licences

for content mining” UK à “the right to read is the right to mine”

Content Mining Problems

http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-our-digital-future-peter-murray-rust-is-the-problem/

www.tugraz.at n

Summary

•  Aggregators/repos for scientific publications •  Mining content/data in publications

•  Information / fact extraction •  Topic modeling •  Clustering

•  E.g. Exploratory analysis of large datasets •  Find groups of interest expressed by user generated

tags and their relations

•  ContentMine as example

55

www.tugraz.at n

Questions?

See you next week!

56