Upload
paul-groth
View
2.196
Download
1
Embed Size (px)
Citation preview
KNOWLEDGE GRAPHSAND THE ROLE OF DBPEDIA
Paul Groth @pgroth
pgroth.com
Thanks to Joao Moura
Elsevier Labs @elsevierlabs
6th DBpedia Community Meeting in The Hague 2016
Feb. 12, 2016
ELSEVIER LABS - INTRO
WORLD LEADER IN DIGITAL INFO SOLUTIONS4
Published over 330,000 articlesin 2013
Founded over 130 years ago
Work with over 30 million Scientists, students, health & information professionals
Employ over 7,000 employeesin 24 countries
Received over 1 million submissions in 2013
Over the last50 yearsthe majority of Noble Laureates have publishedwith Elsevier
Over 53 million items indexed by Scopus
Elsevier eBooks, OnlineJournals, Databases
Publishes over 2,200 online journals & over 10,000 e-books
SOLUTIONS
ElsevierR+D Solutions
ElsevierClinical Solutions
Helps corporate researchers, R+D professionals, and engineers improve how they interact with, share, and apply information to solve problems using our digital workflow tools, analytics, and data
Provides universities, governments, and research institutions with the resources and insights to improve institutional research strategy, management, and performance.
ElsevierEducation
Helps medical professionals apply trusted data and sophisticated tools to make better clinical decisions, deliver better care, and produce better healthcare outcomes.
Helps educate highly-skilled, effective healthcare professionals, using the most advanced pedagogical tools and reference works.
ElsevierResearch Intelligence
CONTENT
CA
PAB
ILIT
IES
PLAT
FOR
MS
BUILDING BETTER TAXONOMIES• Ontologies and taxonomies help organize and query content
• Annotation• Classification / Navigation• Autocomplete• Suggestion & Recommendation
• We have lots of taxonomies/ontologies
• Journal Classification for Scopus• Mendeley classification system • Science Direct Subject classification • Reference Modules Hierarchies for Books• Submission system Journal classifications • …
• Connect to external ontologies (e.g. MESH)
• Ontology Maintenance, Usage and Mapping
TAXONOMY INDUCTIONStarting with a very shallow hierarchy of syntactical concepts with almost no intersections:
1. Matching concepts against a target (well accepted) taxonomy and dbpedia:
• Problems: Same concept may have different names or terminologies in different branches; Multiple languages etc.
2. Check for partial orders between these concepts, using the hierarchy of the target taxonomy and dbpedia (skos:broader).
3. Finding/completing missing links between concepts.
Example Given two concepts, check if they form a parent-child relation:
select distinct * where{ <http://dbpedia.org/resource/Model-checking>
dbo:wikiPageRedirect* ?conceptChild. ?conceptChild dbo:wikiPageRedirects* ?redirectedChild.?redirectedChild dct:subject ?subjectChild.
<http://dbpedia.org/resource/Formal_methods> dbo:wikiPageRedirect* ?conceptParent. ?conceptParent dbo:wikiPageRedirects* ?redirectedParent.?redirectedParent dct:subject ?subjectParent.
?subjectChild skos:broader ?subjectChildsParentFilter(?subjectChildsParent = ?subjectParent)}
TOWARDS AN ELSEVIER KNOWLEDGE GRAPH• Ongoing proof-of-concept work by Paul Groth, Sujit Pal and Ron Daniel of Elsevier Labs
• Unsupervised, scalable and built with off-the-shelf technologies
• Based on recent work at University College London and University of Massachusetts Amherst• Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013).
14M articles from Science Direct
3.3M triples
475M triples
49M triples p x r matrix p x k, k x r latent factor matrices
~102 triples
920K concepts from EMMeT
ANNOTATION• http://www.slideshare.net/SparkSummit/dictionary-based-annotation-at-scale-with-spark-by-sujit-pal
• What is the problem?
• Annotate millions of documents from different corpora. • 14M docs from Science Direct alone.• More from other corpora, dependency parsing, etc.
• Critical step for Machine Reading and Knowledge Graph applications.• Why is this such a big deal?
• Takes advantage of existing linked data.• No model training for multiple complex STM domains.• However, simple until done at scale.
DICTIONARY BASED NE ANNOTATOR (SODA)
DICTIONARY BASED NE ANNOTATOR (SODA)• Part of Document Annotation Pipeline.
• Annotates text with Named Entities from external Dictionaries.
• Why do we have to scale (Wikipedia KBs) – 8 Million entities• Built with Open Source Components
• Apache Solr – Highly reliable, scalable and fault-tolerant search index.• SolrTextTagger – Solr component for text tagging, uses Lucene FST technology.• Apache OpenNLP – Machine Learning based toolkit for processing Natural Language Text.• Apache Spark – Lightning fast, large scale data processing.
• Uses ideas from other Open Source libraries
• FuzzyWuzzy – Fuzzy String Matching like a boss.• Contributed back to Open Source
• https://github.com/elsevierlabs-os/soda
TOWARDS AN ELSEVIER KNOWLEDGE GRAPH
14M articles from Science Direct
3.3M triples
475M triples
49M triples p x r matrix p x k, k x r latent factor matrices
~102 triples
920K concepts from EMMeT
MATRIX CONSTRUCTION: GLAUCOMAp
= 83
r = 176
83 x 176 sparse binary-valued matrix with 366 entries
surface form relations
structured relations
entity pairs
MATRIX COMPLETION: GLAUCOMALatent factor matrix
r = 176
p =
83
Latent factor matrix
×
83 x 176 real-valued matrix with 14,608 entries
=
PREDICTED RELATIONS: GLAUCOMA• At threshold = 0.08
• 22 unseen relations• F1 = 0.71
• Applications beyond knowledge graph construction
• Taxonomy and ontology maintenance
• Entity search in task-specific and/or mobile context
• Question answering
glaucoma developed many years after chronic inflammation of uveal tractglaucoma develop following chronic inflammation of uveal tract glaucoma can appear soon in family history of glaucomaglaucoma can appear soon in age over 40glaucoma the risk of functional visual field lossglaucoma contributing causes of functional visual field lossglaucoma contributed to functional visual field lossglaucoma is considered the second leading cause of functional visual field lossglaucoma remains the second leading cause of functional visual field loss
This is a unique entity not a string