64
DBPEDIA INSIDEOUT: AN INTRODUCTION TO THE MAJOR HUB FOR LINKED OPEN DATA Cristina Pattuelli, Pratt Institute March 16, 2015

DBpedia InsideOut

Embed Size (px)

Citation preview

DBPEDIA INSIDEOUT: AN INTRODUCTION TO THE MAJOR HUB FOR

LINKED OPEN DATA

Cristina Pattuelli, Pratt Institute

March 16, 2015

“DBpedia is the Semantic Web mirror of Wikipedia”

WHAT IT IS

DBpedia is  a crowd-sourced community effort to  extract structured information from Wikipedia and make this information available on the Web in the form of Linked Open Data.

Source: http://lod-cloud.net/

THE STATE OF THE LOD CLOUD 2014

Source: http://lod-cloud.net/

THE STATE OF THE LOD CLOUD 2014

2011: 295 DATASETS 2014: 570 DATASETS (+93%)

Source: blog.classora.com/2012/10/10/describiendo-el-conocimiento-en-un-formato-estandar-para-la-web-semantica-rdf/

 Connected with other Linked Datasets by  50 million RDF links

Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows

CENTRAL INTERLINKING HUB OF THE WEB OF DATA

Web of Data Browsing and Crawling Web Data Integration and Mashups

“Which albums did Miles Davis record with female instrumentalists?” “Which populated places in Australia are below sea level?” “What did Andy Warhol and Thelonious Monk have in common ?”

PEAN TO DBPEDIA

Multi-domain Automatically evolving Community consensus driven

Multilingual >125 language editions Accessible on the Web

DBPEDIA SEMANTICS

4.58 million “things” 583 million “facts”

“THINGS”

Each thing in the DBpedia dataset is identified by a URI of the form http://dbpedia.org/resource/Name Name is  derived from the  URL of  the source Wikipedia article, which has the form

http://en.wikipedia.org/wiki/Name. .

http://dbpedia.org/page/Billie_Holiday

Dereferencing the URI DBpedia: Billie Holiday’s Green Page

http://en.wikipedia.org/wiki/Billie_Holiday

http://dbpedia.org/resource /Billie_Holiday http://en.wikipedia.org/wiki/Billie_Holiday

DBPEDIA SEMANTICS

4.58 million “things” 583 million “facts”

“Facts” as RDF Triples

has name

Subject Predicate Object (Thing)

Billie Holiday

GENERATING FACTS FOR THE ENTITY BILLIE HOLIDAY

has name

Subject Predicate Object

S <http://dbpedia.org/resource/Billie_Holiday>

P <http://xmlns.com/foaf/0.1/name> O ”Billie Holiday”

Billie Holiday

S <http://dbpedia.org/resource/Billie_Holiday>

P <http://dbpedia-owl:alias> O “Lady Day”

S <http://dbpedia.org/resource/Billie_Holiday>

P <http://dbpedia-owl:occupation>

O <http://dbpedia.org/page/Songwriter>

CHARTING DBPEDIA

Extraction Mapping Categorization

HARVESTING FACTS

Wikipedia articles consist mostly of  free text, but  also contain different types of  structured information, such as  infobox templates, categorization information, images, geo-coordinates, and  links to  external Web pages.

DBPEDIA COMPONENTS

Source: http://wiki.dbpedia.org/PHPframework

DBPEDIA COMPONENTS

Extractors turn a specific type

of wiki markup into triples.

DBPEDIA COMPONENTS

Extractors turn a specific type

of wiki markup into triples.

The  core of  DBpedia consists of  an infobox extraction process. I n f o b ox e s a r e  t e m p l a t e s contained in  many Wikipedia ar t ic les. They are  usual ly displayed in  the top  right corner of  articles and  contain factual information.

Infobox for MusicalArtist

INFOBOX EXTRACTION

Raw Infobox Extraction – create triples directly from the infobox data. Mapping-based Infobox Extraction – mappings against the DBpedia Ontology.

RAW INFOBOX EXTRACTION

Generic Algorithm-based Retains property names used in the infobox Properties are identified by the dbpprop prefix.

MAPPING-BASED INFOBOX EXTRACTION

Mapping of infobox data to community-curated DBpedia Ontology. Properties are identified by the dbpedia-owl prefix.

RAW INFOBOX EXTRACTION

Pros: Complete coverage of all the infobox attributes (not all the infoboxes have been mapped yet) Cons: Lower data quality (synonyms are not resolved e.g., paceOfBirth/birthPlace; high error rate to determine the datatype of an attribute value)

MAPPING-BASED INFOBOX EXTRACTION

Pros: Data is cleaner (typing resources, merging name variants, assigning specific datatypes to the values). Cons: Not full coverage.

4.58 million things 4.22 million are classified in a consistent ontology.

Normalization of variant names

THE DBPEDIA ONTOLOGY

Cross-domain ontology Large thematic coverage Currently covers 685 classes which form a  subsumption hierarchy and  2,795 different p r o p e r t i e s d e s c r i b i n g t h e c l a s s e s (aircraftHelicopterAttack) Shallow (≤ 5 levels)

THE DBPEDIA ONTOLOGY

Because the DBpedia Ontology is built upon infobox templates, its semantic structure suffers from a lack of logical consistency and present significant semantic gaps in the hierarchy.

http://mappings.dbpedia.org/server/ontology/classes/

THE DOMAIN OF MUSIC IN THE DBPEDIA ONTOLOGY

Hierarchy is kept shallow (sake of visualization and navigation). – http://dbpedia.org/ontology/MusicalArtist

CATEGORIZING DBPEDIA

WIKIPEDIA CATEGORY SYSTEM

Wikipedia categories to group articles that share similar subjects. Wikipedia categories are constantly evolving and currently number more than 740,000. 80.9 million links to Wikipedia categories.

WIKIPEDIA CATEGORY SYSTEM

Most categories are assigned manually by Wikipedia contributors and can be found listed as links at the bottom of a Wikipedia article.

CATEGORIZING PEOPLE

At least four categories: •  the year the person was born •  the year they died •  their nationality •  their reason for being notable.

CATEGORIZATION OF PEOPLE

First sentence of an article: Billie Holiday (born Eleanora Fagan; April 7, 1915 – July 17, 1959) was an American jazz singer and songwriter.

Year born: Category:1915 births Year died: Category:1959 deaths Nationality: Category: American people

Reason for notability / Occupation: Category:Musicians

WIKIPEDIA CATEGORY SYSTEM

Collaborative effort Advantages à categories are continually updated to correspond with article content. Dis/advantages à lack of consistency in its hierarchical structure and “rather loose relatedness between articles” (Bizer et al. (2009). “Messy hierarchy”

RE-CATEGORIZATION OF BILLIE HOLIDAY

(→‎External links: re-categorisation per Wikipedia:Categories for discussion/Log/2014 December 26, replaced: Category:American women composers

→ Category:American female composers) (undo) -- (Robot - Moving category African-American female musicians toCategory:African-American musicians per CFD at Wikipedia:Categories for discussion/Log/2013 January 10.)

WIKIPEDIA ONTOLOGY IN DBPEDIA

The hierarchical structure of the categories is represented in DBpedia by way of two different properties: dcterms:subject (relate entity to category) skos:broader (relate child to parent category)

http://ensiwiki.ensimag.fr/images/f/fa/Dbpedia-relation-discovery-demo.pdf

The  Hierarchy  of  categories  between  “flower”  and  “cucumber”  

CATEGORY:JAZZ_MUSICIANS

http://dbpedia.org/page/Category:Jazz_musicians  

YAGO ONTOLOGY

A robust classification scheme with a deep hierarchical structure. Originally derived from the Wikipedia category system using the semantic lexicon WordNet.

Over 350,000 classes; 100 relationships Provides DBpedia data with coherence and structural consistency A taxonomic backbone

QUERYING DBPEDIA FOR LINKED JAZZ

Jazz Name Vocabulary Personal name vocabulary in the form of RDF statements including the artist’s name paired with a Uniform Resource Identifier (URI).

<http://dbpedia.org/resource/Billie_Holiday>!<http://xmlns.com/foaf/0.1/name> !“Billie Holiday”  

QUERYING DBPEDIA FOR LINKED JAZZ

DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist

2. must have dbpedia:genre property: dbpedia:Jazz.

QUERYING DBPEDIA FOR LINKED JAZZ

DBpedia was initially queried for literal triples with a foaf:name predicate that satisfied the following criteria: 1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist

2. must have dbpedia:genre property: dbpedia:Jazz.

+ rdfs:label à name of the resource

QUERYING DBPEDIA FOR LINKED JAZZ

Prominent musicians who we expected to find by querying dbpedia:Jazz property were not returned. Example: “Count Basie” -  f e l l u n d e r d b p e d i a : S w i n g _ m u s i c ,

dbpedia:Big_band_music and dbpedia:Piano_blues

-  not under dbpedia:Jazz This required us to revise our query method by expanding it to include additional relevant music genres.

Name Extraction from DBpedia

Bootstrapping  &  Querying  

IN SUM

New type of knowledge representation environment -constant state of flux. -decentralized interplay of different descriptive and classification systems. -it challenges our tolerance threshold for data quality and our traditional notion of authority control.

http

://db

pedi

a.or

g/pa

ge/B

illie_

Holid

ay

LodLive

Visualizing DBpedia

Thank You!

@cristinapattuel [email protected]