56
www.tugraz.at W I S S E N T E C H N I K L E I D E N S C H A F T www.tugraz.at Science 2.0 VU Processing Science 2.0 Data, Content Mining WS 2015/16 Elisabeth Lex KTI, TU Graz

Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

W I S S E N n T E C H N I K n L E I D E N S C H A F T

u www.tugraz.at

Science 2.0 VU Processing Science 2.0 Data, Content Mining

WS 2015/16

Elisabeth Lex KTI, TU Graz

Page 2: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Agenda

•  Repetition from last time: Open Science •  Processing academic resources •  Mining in academic resources (content perspective) •  Example:

•  ContentMine: Extraction of scientific facts

2

Page 3: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Repetition: Open Science

•  Open Science •  Ideas, Concepts, Benefits and Pitfalls

•  E.g. Enhancing collaboration and community-building, increasing efficiency of research vs no reward system yet

•  Open Data •  Sharing your data influences how often you get

cited (Piwowar, et al., 2007 and Pinowar, et a., 2013)

•  Different models for Open Access •  Green vs. Gold vs. Hybrid

3

Page 4: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Open Science – 5 schools of thought

4

Page 5: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: Open Government Data: Eurostat

5

“I’d like to compare the unemployment rate in Austria with other European ones”

Via Google Public Data Explorer, https://www.google.com/publicdata/directory

Page 6: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Open Access in Science: Open Access Journals ●  Green („self-archiving): author can self-archive at the time of

submission of the publication whether the publication is grey literature (usually internal non-peer-reviewed), a peer-reviewed journal publication, a peer-reviewed conference proceedings paper or a monograph

●  Gold („author pays“): the author or author institution can pay a fee to the publisher at publication time, the publisher then makes the publication available 'free' at the point of access .

●  further little-used “road” hybrid forms: for example platinum open access (does not charge author fees)...

●  Both green and gold are compatible and can co-exist

Source: Jeffery, K. Open Access: An Introduction, 2006. http://www.ercim.eu/publication/Ercim_News/enw64/jeffery.html

Page 7: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Processing Academic Resources

7

Page 8: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

•  Aggregate scientific results •  Exploratory search in digital collections •  Find experts in domains

•  Make science discoverable •  Improve access to scientific publications •  Extract facts for research •  Discover relationships

•  Check for errors => improve science

Motivation

Page 9: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,...

•  Mining in Academic Resources •  Information Extraction •  Topic Modeling •  Clustering/Classification •  Linking publications

•  Make available data and source code J

9

Page 10: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

KDD Process

10

Page 11: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,....

•  Mining in Academic Resources •  Information Extraction •  Topic Modeling •  Clustering/Classification •  Linking publications

•  Make available data and source code J

11

Page 12: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Datasets

•  The European Library Open Dataset •  Digital collection and 200 mio bibliographic records •  http://www.theeuropeanlibrary.org/tel4/access/data/

opendata •  Datahub.io

•  E.g. DBLP Computer Science Bibliography http://datahub.io/dataset/dblp

•  Metadata of over 1.8 mio publications by 1 mio authors

12

Page 13: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Repositories and Aggregators

•  ISI Web of Science •  Scopus •  Pubmed •  The European Library •  Library of Congress •  ArXiv •  Figshare •  Data Citation Index •  Mendeley •  Google Scholar •  CiteSeerX •  ...

13

Page 14: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

APIs to Repositories ...

•  APIs to access scientific publications and research data

•  rOpenSci: arXiv, PlosOne, Figshare •  Mendeley: Developer API, http://dev.mendeley.com

•  Python package: pip install mendeley

14

Page 15: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example - rOpenSci

15

Page 16: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,...

•  Mining in Academic Resources •  Information Extraction •  Topic Modeling •  Clustering / Classification •  Linking publications

•  Make available data and source code J

16

Page 17: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Information Extraction

•  IE Goal: Extract structured information out of unstructured content, e.g.

•  Method names, quantities, temporal expressions •  Authors from scientific publications •  Organizations in acknowledgements section of

papers •  References •  ...

17

Page 18: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

IE Process

18

http://www.nltk.org/book/ch07.html

Input: raw text of a document Output: list of (entity, relation, entity)

ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, and GPE (geo-political entity)

Applying word classes to words within a sentence

Page 19: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

IE Standard Approaches (1/2)

•  Regular expressions / Rule-based approaches •  E.g. dates, email addresses, @user, RT@user

http://localhost:8888/notebooks/twitterprocessing.ipynb

19

Page 20: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

IE as Machine Learning Task

•  Supervised: train model with annotated training data, use trained model to classify unknown text

•  Choose a class label for a given input •  Identify features of language data to classify it •  Construct language models out of them •  Learn about text/language from these models

•  Methods: •  Classifiers: Naive Bayes, Maxent Models •  Sequence models: Hidden Markov Models, CRFs

20

Page 21: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Libraries

•  NLTK (http://www.nltk.org) •  http://localhost:8888/notebooks/science20-ie.ipynb

21

Page 22: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Mining academic documents

•  Extraction of structural elements •  Tables, figures,..

•  Extraction of facts from structural elements and doc •  Named Entity Recognition (e.g. gene names,..) •  Relation extraction (e.g. system A impacts system

B) •  Mostly: PDF format

•  Good for presentation but problems with metadata quality, hard to analyse

•  While PDF analysis tools exist, there is still room for improvement!

22

Page 23: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Approach

•  Divide and conquer •  Extracting blocks from the PDF based on structure

and layout information •  Classify the extracted blocks

•  E.g. into title, body, references, abstract,.. •  Classify content of extracted blocks

•  E.g. tables •  Extract relevant info from the content (Named

Entities, nouns, dates, quantities,..)

23

Page 24: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Approach

•  Extracting blocks •  Features: layout specific such as position, font, font

size,.. •  Apply Machine Learning approches

•  Unsupervised (clustering) •  Supervised (classification)

24

Page 25: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Unsupervised Approach

•  Clustering: given a set of objects find the groupings of objects so that the similarity within a group is maximized and the similarity between groups is minimized

•  Cluster = block •  Successive merge and split mechanism

25

Page 26: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Supervised Approach

•  Classification: given a set of labeled examples, create a model and use it to predict the label of unknown examples

•  Classify blocks: Maximum Entropy Models •  Create training data by labeling blocks, i.e. assigning

blocks to classes •  Learn a model based on the training data and apply

it to classify unknown blocks •  Features: layout, formatting, word frequencies,..

26

Page 27: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Fact Extraction from Publications

•  Extract entities from within the identified blocks •  E.g. author block – divide further to extract all

authors contained in the block •  Extract relations between entities

•  Open Information Extraction •  Learns a models without needing training data •  Can extract binary relations from sentences

27

Page 28: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: Measuring quality of Wikipedia

28 Elisabeth Lex, Michael Voelske, Marcelo Errecalde, Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein, and Michael Granitzer. 2012. Measuring the quality of web content using factual information. In Proceedings of WebQuality '12 at WWW‘12

(a) Unbalanced (b) Balanced

Figure 1: Histograms of Wikipedia corpora for unbalanced dataset and balanced dataset.

is the word count of t, and t is a Wikipedia article. Thesame holds for “Factual-density/sentence-count”.

The word count measure outperforms the factual densitymeasure normalized to sentence count as well as the wordcount on the unbalanced corpus. Apparently, word count isa strong feature on the unbalanced corpus.

We then evaluated the factual density measure on the bal-anced corpus where both featured/good and non-featuredarticles are more similar in respect to document length.The results for this experiment are shown in Figure 2(b)as precision-recall curves. On the balanced corpus, factualdensity normalized to sentence count as well as word countperforms much better than on the unbalanced corpus, whileword count, as expected, performs worse. There is not muchdi↵erence between the normalization to word or sentencecount since here, the number of words per document has asmaller influence on the result.

We also analyzed the distributions of featured/good andnon-featured articles if factual density is used as measure,as depicted in Figure 3. We found that the distributionof the featured/good articles is clearly separated from thedistribution of the non-featured articles, with peaks at twodi↵erent factual density values (0.06 and 0.03 respectively).This finding is in contrast to the fact that the distributionsof featured/good articles and non-featured articles have ahigh degree of overlap if word count is used, as shown inFigure 1(b). Consequently, on the balanced corpus, factualdensity clearly outperforms our baseline word count.

In a related experiment, we investigated the relational in-formation contained in the binary relationships ReVerb ex-tracts from sentences. We used the relations, i.e. only thepredicates from the extracted triples as a vocabulary to rep-resent the documents. We then tested the discriminativepower of these features by training a classifier to solve the bi-nary classification problem of distinguishing featured/goodfrom non-featured articles. The results reported in Table 1were obtained using the WEKA6 implementation of a NaiveBayes Classifier in combination with feature selection basedon Information Gain (IG). From 40 000 relations, we selected

6http://www.cs.waikato.ac.nz/~ml/weka/

Figure 3: Distribution of articles by factual density.

the 10% best features in terms of IG. We achieved similarresults for both corpora.

Table 1: Classification results using relational fea-tures on both corpora.

Unbalanced Balanced

Measure Value [%] Value [%]

Accuracy 84.01 87.14F-Measure 84 86.7Precision 84 89.2Recall 84 87.1

Apparently, relational features are more robust when thedocument length varies. However, we need to investigatethis in more detail.

Page 29: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Extract Topics from Publications

•  Topic Models: algorithms that uncover thematic structure in document collections

•  Facilitate searching, browsing, summarizing •  Latent Dirichlet Allocation (LDA)

•  Hierarchical probabilistic model

18/11/15 29

Page 30: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

LDA

•  Probabilistic model that helps find latent topics for documents

•  Probabilistic model: treat data as observations that stem from a generative proabilistic process which involves hidden variables •  Documents: Thematic structure are the hidden

variables •  Each topic is described by words in the documents

18/11/15 30

Page 31: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

LDA

•  Infer hidden structure using posterior inference

•  „What are the topics that describe the documents?“ •  Classify unknown data using the topic model

•  „How does unknown data fit into estimated topic structure?“

•  Nr of topics Z has to be choosen in advance •  Defines level of specification of topics

18/11/15 31

Probability of ith word for doc d Probability of ti

within topic zi

Probability of using a word from topic zi in the doc

Page 32: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: Model evolution of topics over time in Science journal

18/11/15 32

•  Dataset: pages Science from 1880-2002 from JSTOR archive

https://www.cs.princeton.edu/~blei/topicmodeling.html

Page 33: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Validation of extracted information

33

•  Crowdsourcing as a way to evaluate mining quality •  Share the extracted information via e.g. a Web-

based platform •  Enable users to give feedback

•  Accept, reject, suggest new concepts/facts

Page 34: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

HowTo: Text Mining using rOpenSci

•  Library that facilitates text mining on publications •  Search for articles •  Fetch articles •  Get links for full text articles (xml, pdf) •  Extract text from articles / convert formats •  Collect bits of articles that you actually need •  Download supplementary materials from papers

34 https://ropensci.org/tutorials/fulltext_tutorial.html

Chamberlain Scott (2015). fulltext: Full Text of Scholarly Articles Across Many Data Sources. R package version 0.1.0. https://github.com/ropensci/fulltext

Page 35: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: Text Mining using rOpenSci

#include the library!library("fulltext“)! #ft_search() - get metadata on a search query.!> (res1 <- ft_search(query = 'open science', from = 'arxiv'))!> (out <- ft_get(res1))!> res1$arxiv!!# ft_get() - get full or partial text of articles.!> res <- ft_get('cs/9301113v1', from='arxiv')!!#extract the fulltext!> res2 <- ft_extract(res)!> res2$arxiv$data!!#extract interesting parts from the fulltext!> out %>% chunks("doi")!

35

Page 36: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: Text Mining using rOpenSci

•  fulltext can extract parts of a paper via chunks(): •  “all”, “front”, “body”, “back”, “title”, “doi”,

“categories”, “authors”, “keywords”, “abstract”, “executive_summary”, “refs”, “refs_dois”, “publisher”, “journal_meta”, “article_meta”, “acknowledgments”, “permissions”, “history”!

•  Can do PDF extraction •  E.g. via GhostScript: (res_gs <- ft_extract(pdf, "gs"))!•  ..

36

https://ropensci.org/tutorials/fulltext_tutorial.html

Page 37: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,...

•  Mining in Academic Resources •  Information Extraction •  Topic Modeling •  Clustering/Classification •  Linking publications

•  Make available data and source code J

37

Page 38: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Clustering of Academic Resources

•  Detect groupings of papers based on content similarity

•  E.g. alongside of topics •  Transform content (e.g. abstract of a paper) into

machine readable representation •  Bag of Words approach: document treated as bag

of words/terms, represented as vector •  Document-Term matrix: term frequencies across all

documents

38

Page 39: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Vector Space Model

•  Documents are vectors in Term-Document Space

•  Elements of vector are weights wij corresponding to doc i and term j

•  Weights: frequencies of terms in docs •  TF-IDF

•  Proximity of documents (similarity) calculated by cosine of angle between document vectors

39

Page 40: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: Facilitate exploratory search

•  By topic of interest (cluster = topic of interest) •  Setting: Social bookmarking dataset, URLs

described by tags §  Research Questions:

§  What clusters (aka groups of interests) exist? §  Are they somehow related? §  How do they evolve over time?

Page 41: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Clustering Algorithms

•  KDD lectures! •  Here, briefly: K-means algorithm

1.  Select k points as initial centroids 2.  Repeat

3.  Form k clusters by assigning all points to closest centroid

4.  Recompute centroid of each cluster 5.  Until centroids don‘t change

18/11/15 41

Page 42: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example

Page 43: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Page 44: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Classification of Scientific Publications

•  Categorize into established subject-based taxonomy •  E.g. Library of Congress •  UNESCO thesaurus •  DOAJ subject classification •  Library of Congress Subject Headings

44

Page 45: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,...

•  Mining in Academic Resources •  Information Extraction •  Topic Modeling •  Clustering/Classification •  Linking publications

•  Make available data and source code J

45

Page 46: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Linking Scientific Publications

•  Citations (explicitely defined) •  Similarity

•  Statistical similarity: cosine •  Semantic similarity: more complex, e.g. via topics

•  Usage •  Argument support •  Contradiction •  ...

46

Page 47: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Linking via Citations

47

Page 48: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

How?

•  Aggregate and manage data: repositories, aggregators, datasets,...

•  Mining in Academic Resources •  Information Extraction •  Clustering / Classification •  Linking publications •  Search

•  Make available data and source code J

48

Page 49: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Sharing code

•  Github •  Bitbucket •  iPython Notebooks •  ...

49

Page 50: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: ContentMine

50 http://contentmine.org

Idea: •  facts cannot

be copyrighted •  Billion of facts

in copyright-protected research articles

à Make them publicly accessible!

Page 51: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Possible questions for ContentMine

•  Find references to papers by a given author. This is metadata and therefore factual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.

•  Find who sponsors research. Extract acknowledgements and perform Named Entity Recognition to detect companies. Link the companies to the papers where they are listed in the acknowledgement

51

Page 52: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

1.  Crawl scientific literature 2.  Scrape each scientific article 3.  Extract facts 4.  Index 5.  Republish (WikiData)

Machine Extraction of scientific facts

https://github.com/ContentMine

Page 53: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Example: retrieve metadata for specific article

18/11/15 53

Page 54: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

•  Secondary publishers create walled gardens •  E.g. ResearchGate portal

•  Publishers’ contracts ban content-mining. •  Publishers may cut off universities who mine •  Publishers lobby governments to require “licences

for content mining” UK à “the right to read is the right to mine”

Content Mining Problems

http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-our-digital-future-peter-murray-rust-is-the-problem/

Page 55: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Summary

•  Aggregators/repos for scientific publications •  Mining content/data in publications

•  Information / fact extraction •  Topic modeling •  Clustering

•  E.g. Exploratory analysis of large datasets •  Find groups of interest expressed by user generated

tags and their relations

•  ContentMine as example

55

Page 56: Science 2.0 VU - KTIkti.tugraz.at/staff/elex/courses/science20/slides/...IE as Machine Learning Task • Supervised: train model with annotated training data, use trained model to

www.tugraz.at n

Questions?

See you next week!

56