23
KNOWLEDGE GRAPHS AND THE ROLE OF DBPEDIA Paul Groth @pgroth pgroth.com Thanks to Joao Moura Elsevier Labs @elsevierlabs 6th DBpedia Community Meeting in The Hague 2016 Feb. 12, 2016

Knowledge Graph Construction and the Role of DBPedia

Embed Size (px)

Citation preview

KNOWLEDGE GRAPHSAND THE ROLE OF DBPEDIA

Paul Groth @pgroth

pgroth.com

Thanks to Joao Moura

Elsevier Labs @elsevierlabs

6th DBpedia Community Meeting in The Hague 2016

Feb. 12, 2016

FAVORITE DBPEDIA PREDICATE….

OUTLINE• The Importance of Structure

• Better taxonomies

• Knowledge graph construction

ELSEVIER LABS - INTRO

WORLD LEADER IN DIGITAL INFO SOLUTIONS4

Published over 330,000 articlesin 2013

Founded over 130 years ago

Work with over 30 million Scientists, students, health & information professionals

Employ over 7,000 employeesin 24 countries

Received over 1 million submissions in 2013

Over the last50 yearsthe majority of Noble Laureates have publishedwith Elsevier

Over 53 million items indexed by Scopus

Elsevier eBooks, OnlineJournals, Databases

Publishes over 2,200 online journals & over 10,000 e-books

SOLUTIONS

ElsevierR+D Solutions

ElsevierClinical Solutions

Helps corporate researchers, R+D professionals, and engineers improve how they interact with, share, and apply information to solve problems using our digital workflow tools, analytics, and data

Provides universities, governments, and research institutions with the resources and insights to improve institutional research strategy, management, and performance.

ElsevierEducation

Helps medical professionals apply trusted data and sophisticated tools to make better clinical decisions, deliver better care, and produce better healthcare outcomes.

Helps educate highly-skilled, effective healthcare professionals, using the most advanced pedagogical tools and reference works.

ElsevierResearch Intelligence

CONTENT

CA

PAB

ILIT

IES

PLAT

FOR

MS

60 % OF TIME IS SPENT ON DATA PREPARATION

STRUCTURED DATA

STRUCTURED DATA

CONNECTING DATA TO APPS

BUILDING BETTER TAXONOMIES• Ontologies and taxonomies help organize and query content

• Annotation• Classification / Navigation• Autocomplete• Suggestion & Recommendation

• We have lots of taxonomies/ontologies

• Journal Classification for Scopus• Mendeley classification system • Science Direct Subject classification • Reference Modules Hierarchies for Books• Submission system Journal classifications • …

• Connect to external ontologies (e.g. MESH)

• Ontology Maintenance, Usage and Mapping

TAXONOMY INDUCTIONStarting with a very shallow hierarchy of syntactical concepts with almost no intersections:

1. Matching concepts against a target (well accepted) taxonomy and dbpedia:

• Problems: Same concept may have different names or terminologies in different branches; Multiple languages etc.

2. Check for partial orders between these concepts, using the hierarchy of the target taxonomy and dbpedia (skos:broader).

3. Finding/completing missing links between concepts.

Example Given two concepts, check if they form a parent-child relation:

select distinct * where{ <http://dbpedia.org/resource/Model-checking>

dbo:wikiPageRedirect* ?conceptChild. ?conceptChild dbo:wikiPageRedirects* ?redirectedChild.?redirectedChild dct:subject ?subjectChild.

<http://dbpedia.org/resource/Formal_methods> dbo:wikiPageRedirect* ?conceptParent. ?conceptParent dbo:wikiPageRedirects* ?redirectedParent.?redirectedParent dct:subject ?subjectParent.

?subjectChild skos:broader ?subjectChildsParentFilter(?subjectChildsParent = ?subjectParent)}

TOWARDS AN ELSEVIER KNOWLEDGE GRAPH• Ongoing proof-of-concept work by Paul Groth, Sujit Pal and Ron Daniel of Elsevier Labs

• Unsupervised, scalable and built with off-the-shelf technologies

• Based on recent work at University College London and University of Massachusetts Amherst• Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013).

14M articles from Science Direct

3.3M triples

475M triples

49M triples p x r matrix p x k, k x r latent factor matrices

~102 triples

920K concepts from EMMeT

ENTITY RESOLUTION: GLAUCOMA

Surface form triples downsampled from 49M entity-resolved triples

ANNOTATION• http://www.slideshare.net/SparkSummit/dictionary-based-annotation-at-scale-with-spark-by-sujit-pal

• What is the problem?

• Annotate millions of documents from different corpora. • 14M docs from Science Direct alone.• More from other corpora, dependency parsing, etc.

• Critical step for Machine Reading and Knowledge Graph applications.• Why is this such a big deal?

• Takes advantage of existing linked data.• No model training for multiple complex STM domains.• However, simple until done at scale.

ANNOTATION PIPELINE

DICTIONARY BASED NE ANNOTATOR (SODA)

DICTIONARY BASED NE ANNOTATOR (SODA)• Part of Document Annotation Pipeline.

• Annotates text with Named Entities from external Dictionaries.

• Why do we have to scale (Wikipedia KBs) – 8 Million entities• Built with Open Source Components

• Apache Solr – Highly reliable, scalable and fault-tolerant search index.• SolrTextTagger – Solr component for text tagging, uses Lucene FST technology.• Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text.• Apache Spark – Lightning fast, large scale data processing.

• Uses ideas from other Open Source libraries

• FuzzyWuzzy – Fuzzy String Matching like a boss.• Contributed back to Open Source

• https://github.com/elsevierlabs-os/soda

TOWARDS AN ELSEVIER KNOWLEDGE GRAPH

14M articles from Science Direct

3.3M triples

475M triples

49M triples p x r matrix p x k, k x r latent factor matrices

~102 triples

920K concepts from EMMeT

MATRIX CONSTRUCTION: GLAUCOMAp

= 83

r = 176

83 x 176 sparse binary-valued matrix with 366 entries

surface form relations

structured relations

entity pairs

MATRIX COMPLETION: GLAUCOMALatent factor matrix

r = 176

p =

83

Latent factor matrix

×

83 x 176 real-valued matrix with 14,608 entries

=

PREDICTED RELATIONS: GLAUCOMA• At threshold = 0.08

• 22 unseen relations• F1 = 0.71

• Applications beyond knowledge graph construction

• Taxonomy and ontology maintenance

• Entity search in task-specific and/or mobile context

• Question answering

glaucoma developed many years after chronic inflammation of uveal tractglaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucomaglaucoma can appear soon in age over 40glaucoma the risk of functional visual field lossglaucoma contributing causes of functional visual field lossglaucoma contributed to functional visual field lossglaucoma is considered the second leading cause of functional visual field lossglaucoma remains the second leading cause of functional visual field loss

This is a unique entity not a string

A DBPEDIA IDEA?• Connect to the Scholarly Ecosystem

• Crossref & Data Cite DOIs + ORCIDS

CONCLUSION• DBPedia and Wikipedia KBs are great reference sources

• Beyond expected use for…• Internal knowledge curation• Stress testing

• We’re hiring