24
Information Retrieval: Semantic Web Technologies and Wikidata from R Goran S. Milovanović, Phd Wikimedia Deutschland, Data Scientist for Wikidata DataKolektiv, Belgrade

Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

Goran S. Milovanović, PhdWikimedia Deutschland, Data Scientist for WikidataDataKolektiv, Belgrade

Page 2: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

R (Q206904)

Q206904 (R) is P31 (instance of)?

Q206904 (R) is P31 (instance of)Q9143 (programming language).(triplet: a unit to describe knowledge in Wikidata and other semantic knowledge bases)

Page 3: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

How to access Method 1: SPARQL query against the Wikidata Query Service (WDQS)https://query.wikidata.org/

All programming languages:

SELECT ?item WHERE { ?item wdt:P31 wd:Q9143 .}

1418 results

All functional programming languages:

SELECT ?item WHERE { ?item wdt:P31 wd:Q9143 . ?item wdt:P3966 wd:Q193076 . }

73 results

Examples

R Notebook to learn from:

A_WikidataFromR.nb.html

GitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Page 4: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

How to access Method 2: Wikidata MediaWiki APIhttps://www.mediawiki.org/wiki/API:Presenting_Wikidata_knowledge

Examples

https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q180736&props=labels%7Cdescriptions%7Cclaims%7Csitelinks/urls&languages=az&languagefallback=&sitefilter=azwiki&formatversion=2

R Notebook to learn from:

A_WikidataFromR.nb.html

GitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Page 5: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

How to access Method 3: {WikidataR} R Packagehttps://cran.r-project.org/web/packages/WikidataR/vignettes/Introduction.html

Examples

R Notebook to learn from:

A_WikidataFromR.nb.html

GitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

# - Retrieve the Wikidata item: Milano (Q490) item <- get_item(id = 490)

# - retrieve all claims for Q490claims <- names(item[[1]]$claims)head(claims, 20)

"P2924" "P373" "P1225" "P1082" "P1667" "P625" "P910" "P3365" "P349" "P268" "P1791" "P242" "P1036" "P1334" "P227" "P2046" "P6" "P1792" "P1448" "P395"

# What is P2924?# UseWikidataR::get_property()

prop <- get_property(id = 'P2924')prop[[1]]$labels$en$value

[1] "Great Russian Encyclopedia Online ID"

Page 6: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

Case Study Machine Tagging of Textual Corpora + Enriching Topic Model Representations from Wikidata

Apple has so far remained mum on exactly how many Apple Watches they have sold. We're not sure why, especially in the face of multiple reviews and third-party data, all of which seem to suggest that the device is doing pretty well, so why not boast about it, right? Unfortunately the company is still keeping quiet on that front, so for now third-party data will have to do. According to the recent numbers from Strategy Analytics (via AppleInsider), it seems that in Q4 2018, Apple had managed to ship as many as 9.2 million Apple Watches. Note that these are shipped figures, meaning that the actual number of units sold could be less, but since Apple probably as a good handle on the demand for the wearable, it could be close. The shipment of 9.2 million also meant that Apple had managed to capture 50.7% of the market. However according to Strategy Analytics, 50.7% is actually a lot less than the previous year in 2017 where Apple reportedly commanded 60.4% of the smartwatch market. The closest company on that list is Fitbit who is sitting at 12.2%, which is actually a remarkable jump considering that in 2017 the company was at 1.7%. Samsung also ...

2019/03/01, ubergizmo.com

{newsrivr} package

The package wraps up the NEWSRIVER

(https://newsriver.io/) API calls from within R and retrieves full-text

documents.

Learn {newsrivr}: 00_newsrivr.nb.html

GitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Page 7: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

Case Study Machine Tagging of Textual Corpora + Enriching Topic Model Representations from Wikidata

Apple has so far remained mum on exactly how many Apple Watches they have sold.

- We want to (1) recognize “Apple” and “Apple Watch” like named entities, then

- (2) Search Wikidata and collect all of it’s entities that match “Apple” and “Apple Watch”, and then

- (3) Disambiguate against all candidate Wikidata entities to discover which of them represent the tokens “Apple” and “Apple Watch” the best, and finally

- (4) Fetch all of their properties to (4a) perform machine tagging of the documents in the corpus and (4b) experiment with enriched Term-Document matrices in Topic Modeling.

Page 8: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(1) Named Entity Recognition w. {spacyr}

Notebook: 01_NER.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Learn about the {spacyr} package: https://github.com/quanteda/spacyr

An R wrapper to the spaCy “industrial strength natural language processing” Python library from https://spacy.io.

spaCY NER categories:

PERSON, NORP, FAC, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL, CARDINAL

doc_id text1 1 Apple2 1 Apple Watch3 1 Strategy Analytic4 1 AppleInsider5 1 Watch6 1 Fitbit7 1 Samsung8 2 ValuEngine9 2 Apple10 2 NASDAQ11 2 AAPL12 2 JPMorgan Chase & Co.13 2 Royal Bank of Canada14 2 Citigroup15 2 Rosenblatt Security16 2 Morgan Stanley17 2 PE18 2 PEG19 2 iPhone20 2 EP

.

.

.

Page 9: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(2) Search Wikidata w. {WikidataR} and collect all of the results’ Wikidata classes w. SPARQL from WDQS

02_WikidataMatch.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

WikidataR::find_item()function to search for items from the `text` field doc_id text1 1 Apple2 1 Apple Watch3 1 Strategy Analytic4 1 AppleInsider5 1 Watch6 1 Fitbit7 1 Samsung8 2 ValuEngine9 2 Apple10 2 NASDAQ11 2 AAPL12 2 JPMorgan Chase & Co.13 2 Royal Bank of Canada14 2 Citigroup15 2 Rosenblatt Security16 2 Morgan Stanley17 2 PE18 2 PEG19 2 iPhone20 2 EP

.

.

.

doc_id text WD_Label URI1 Apple Apple Q3121 Apple Apple Q17545451 Apple The Apple Q5956601 Apple Watch Apple Watch Q180109461 Apple Watch Apple Watch Q506174781 Apple Watch watchOS Q180124721 Apple Watch Apple Watch Series 3 Q395121231 Apple Watch Apple Watch Series 2 Q268688231 Apple Watch Apple Watch Series 1 Q284189431 Apple Watch Apple Watch Series 4 Q565992361 Apple Watch Apple Watch Steps Q586075381 Watch The Watch Q293131 Watch watch Q19348091 Watch Watch Q19461761 Watch W Q25524551 Watch Watch Q608545231 Watch Watch Q158838571 Watch Watch Q537589991 Watch Watch Q79730421 Fitbit Fitbit Inc. Q5455414

Page 10: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(2) Search Wikidata w. {WikidataR} and collect all of the results’ Wikidata classes w. SPARQL from WDQS

02_WikidataMatch.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

1. text:AppleURI: Q312

CLASSES (P31/P279): business, company, organization, juridical person, group of humans, agent, legal person, subject, Q26720107, group of living things, individual, entity, structure, group of physical objects, group, object, concrete object, spacio-temporal entity, enterprise, organization, group of humans, agent, group of living things, individual, structure, group of physical objects, entity, group, object, concrete object, spacio-temporal entity

2. text: AppleURI: Q1754545

CLASSES (P31/P279): album, release, publication, musical work, mass media, work, product, work of art, communication medium, creative work, item of collection or exhibition, artificial entity, information, manifestation, entity, intellectual work, physical object, abstract object, concrete object, product, artificial physical object, goods, spacio-temporal entity, object, perceptible object

3. text: AppleURI: Q595660

CLASSES (P31/P279): film, audiovisual work, visual artwork, moving image, series, intangible good, creative work, work of art, artificial physical object, motion, image, group, change, physical process, goods, depiction, item of collection or exhibition, physical object, artificial entity, object, intellectual work, perceptible object, concrete object, inconstancy, work, entity, physical phenomenon, process, product, phenomenon, occurrence, spacio-temporal entity, property, category of being, quality, temporal entity, concept, mental representation, abstract object, representation, object

SPARQL for Wikidata classes:additional filter for our dataset.

Page 11: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

03_ReferenceWikipediaCorpus.nb.html04_referenceLDAPreparation.nb.html05_referenceLDATraining.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

doc_id text WD_Label URI1 Apple Apple Q3121 Apple Apple Q17545451 Apple The Apple Q5956601 Apple Watch Apple Watch Q180109461 Apple Watch Apple Watch Q506174781 Apple Watch watchOS Q180124721 Apple Watch Apple Watch Series 3 Q395121231 Apple Watch Apple Watch Series 2 Q268688231 Apple Watch Apple Watch Series 1 Q284189431 Apple Watch Apple Watch Series 4 Q565992361 Apple Watch Apple Watch Steps Q586075381 Watch The Watch Q293131 Watch watch Q19348091 Watch Watch Q19461761 Watch W Q25524551 Watch Watch Q608545231 Watch Watch Q158838571 Watch Watch Q537589991 Watch Watch Q79730421 Fitbit Fitbit Inc. Q5455414

1. Collect all English Wikipedia articlesfor the candidate Wikidata items→ Corpus for Wikidata items.

2. Train a LDA topic model (each Wikidata item is a document).

3. Predict topics for the News corpus (each document is a news article).

4. Compute similarity/distance between (a) the news and (b) the Wikidata entities from a common topical representation.

5. Disambiguate by picking the most similar Wikidata entity to the document at hand.

Page 12: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

Wikipedia

Reference Corpus TDM News Corpus TDM

usethe MediaWikiAPIto get page content

from Wikipedia

Reference Corpus TDM

WikidataEntity

sitelinks

Page 13: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

03_ReferenceWikipediaCorpus.nb.html04_referenceLDAPreparation.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Reference Corpus(Wikipedia articlesfor Wikidata items)

News Corpus

Wikipedia

Text Pre-ProcessingPipeline

ReferenceTerm-Document

Matrix

TargetTerm-Document

Matrix

Common Vocabulary

Page 14: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

05_referenceLDATraining.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Training

- {text2vec} implementation of the WarpLDA algorithm;

- serial, run many models in paralell;

- range of topics: 10 – 3000, by 10.

train dataset test dataset

Page 15: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

05_referenceLDATraining.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Training

- {text2vec} implementation of the WarpLDA algorithm;

- i7 on four physical cores (7 threads), 32Gb RAM, 16GB Swap on SSD

- serial, run many models in parallel;

- range of topics: 10 – 3000, by 10.

Page 16: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

05_referenceLDATraining.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Training

- {text2vec} implementation of the WarpLDA algorithm;

- i7 on four physical cores (7 threads), 32Gb RAM, 16GB Swap on SSD

- serial, run many models in parallel;

- range of topics: 10 – 3000, by 10.

Page 17: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

05_referenceLDATraining.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Training

- {text2vec} implementation of the WarpLDA algorithm;

- i7 on four physical cores (7 threads), 32Gb RAM, 16GB Swap on SSD

- serial, run many models in parallel;

- range of topics: 10 – 3000, by 10.

Result: selected model w. 1000 topics.

Page 18: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(3) Disambiguate Against Wikidata conceptsNote: approximate, document-level entity disambiguation

06_EntityDisambiguation.nb.htmlGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

doc_id text WD_Label URI1 Apple Apple Q3121 Apple Apple Q17545451 Apple The Apple Q5956601 Apple Watch Apple Watch Q180109461 Apple Watch Apple Watch Q506174781 Apple Watch watchOS Q180124721 Apple Watch Apple Watch Series 3 Q395121231 Apple Watch Apple Watch Series 2 Q268688231 Apple Watch Apple Watch Series 1 Q284189431 Apple Watch Apple Watch Series 4 Q565992361 Apple Watch Apple Watch Steps Q58607538

Disambiguation against Wikidata: combine the Structural Knowledge component (from Wikidatasearch) and the Probabilistic Knowledge component (from ML/LDA):

- compute Hellinger distance between the two topical distributions(i.e. for Wikidata entities and for News);

- for each concept, in each document, focus only on the subsetof the distance matrix that encompasses the candidate Wikidata entities only;

- select by min(distance).

Page 19: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(4) Automatic document tagging by Wikidata classes + enriched LDA corpus representation

06_EntityDisambiguation.nb.html/07_LDA_Meta_Features.nb.htmllGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

RESULTS

- All documents in the News corpus tagged by the P31 (instance of) and P279 (subclass of)Wikidata classes encompassing all concepts matched in the respective document.

- For each matched concept, in each document: fetched additional features (properties) from Wikidata

e.g. for companies:

P452 (industry) → Q8148 (industry) | Q268592 (economic branch)P17 (country) → Q6256 (country) | Q3024240 (historical country) | Q1763527 (constituent country)P463 (member of) → Q9200127 (member)P1454 (legal form) → Q155076 (juridical person) | Q12047392 (legal form)P2770 (source of income) → Q1527264 (income)P2283 (uses) → Q1724915 (use)P355 (subsidiary) → Q658255 (subsidiary company)P749 (parent organization) → Q1956113 (parent company)P127 (owned by)P1716 (brand)→Q431289 (brand)P199 (business division) → Q334453 (division)

Throw these into the Bag of Wordstogether with all other News documents features and train LDA again.

Page 20: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(4) Automatic document tagging by Wikidata classes + enriched LDA corpus representation

07_LDA_Meta_Features.nb.htmllGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Page 21: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(4) Automatic document tagging by Wikidata classes + enriched LDA corpus representation

08_Visualizations.nb.htmllGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Page 22: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(4) Automatic document tagging by Wikidata classes + enriched LDA corpus representation

08_Visualizations.nb.htmllGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Page 23: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Information Retrieval: Semantic Web Technologies and Wikidata from R

(4) Automatic document tagging by Wikidata classes + enriched LDA corpus representation

08_Visualizations.nb.htmllGitHub: DataKolektiv’s MilanoR2019 Repositoryhttps://github.com/datakolektiv

Page 24: Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging