Upload
open-knowledge-belgium
View
194
Download
0
Embed Size (px)
Citation preview
RML.io Generating High Quality
Linked Open Data from Open or Not Data
Anastasia Dimou Data Science Lab, Ghent University - iMinds
[email protected] @natadimou
What is the Semantic Web?
The Semantic Web is the extension of the World Wide Web
Are you the owner of your data? OR is the application that hosts your data?
The Semantic Web is the extension of the World Wide Web
enables sharing content beyond the boundaries of applications & websites
The Web for humans, thanks to HTML, is understandable & constant BUT is the Web for machines too?
The Semantic Web is the extension of the World Wide Web
enables sharing content beyond the boundaries of applications & websites
allows machines to understand the meaning of hyperlinked information
Semantic Web enabled applications rely on data represented as Linked Data
What is Linked (Open) Data?
Linked (Open) Data
a standardized way of expressing the relationships between data
semantically annotated the data with different vocabularies or ontologies describe domain-level knowledge understandable by humans & machines
How is Linked Data published?
Linked (Open) Data published in the form of RDF datasets
Resource Description Framework (RDF) is the prevalent data model for describing Linked (Open) Data driven by unique identifiers (URIs) allows establishing a shared meaning
predicate subject object
How is Linked Data derived
from (semi-)structured data?
How is Linked Data derived
from (semi-)structured data?
id firstname lastname lab city
1 Anastasia Dimou DSLab Ghent
2 Ruben Verborgh DSLab Ghent
3 Erik Mannens DSLab Ghent
Person 1 Data Science
Lab works
“Anastasia Dimou”
located DataScience Lab
Ghent
Person 2 Data Science
Lab works
“Ruben Verborgh”
Person 3 DataScience
Lab works
“Erik Mannens”
Person {id}
{lab}
Assign unique identifiers (URIs)
“{firstname} {surname}”
http:://ex.com{id}
http://ex.com{lab}
“{firstname} {surname}”
Annotate data relationships with ontologies
http:://ex.com{id}
http://ex.com{lab}
“{firstname} {surname}”
http:://ex.com{id}
http://ex.com{lab}
“{firstname} {surname}”
ex:1 ex:DSLab ex:works
“Anastasia Dimou”
ex:located ex:DSLab ex:Ghent
ex:2 ex:DSLab ex:works
“Ruben Verborgh”
ex:3 ex:DSLab ex:works
“Erik Mannens”
ex:{id}
ex:{lab}
ex:located ex:{lab} ex:{city}
sets of triples of a dataset have repetitive patterns
“{firstname} {surname}”
ex:{id}
ex:{lab}
sets of triples of a dataset have repetitive patterns
“{firstname} {surname}”
RDF dataset generation tools rely their implementation on repetitively applying those patterns to input data
ex:located ex:{lab} ex:{city}
What are the different Linked Data Generation approaches?
Linked Data generation approaches
case-specific solutions OR format and source specific
R2RML mappings R2RML processor
Data OWNER / PUBLISHER
defines
RDF
DB CSV JSON XML
RDF RDF RDF
RDF Terms (focusing on IRIs) are…
generated independently disregarding their possible prior definitions
manually replicated by reconstructing the same URIs (if possible)
manually aligned afterwards links with other datasets are defined after the RDF terms are published
Why not a uniform approach?
Uniform and declarative RDF generation
from heterogeneous data sources
mappings processor
Data OWNER / PUBLISHER
defines
RDF
DB CSV JSON XML RDF
RDF Mapping Language (RML)
generic scalable mapping language
for generating and interlinking
RDF data from heterogeneous resources
in an integrable and interoperable fashion
superset of the W3C standardized
R2RML mapping language
http://rml.io
Uniform and declarative RDF generation
from heterogeneous data sources
RML mappings processor
Data OWNER / PUBLISHER
defines
RDF
DB CSV JSON XML RDF
Defining Mappings to generate Linked Data Retrieving Input Data Assessing Quality Generating Metadata Editing Mappings
Defining Mappings to generate RDF data Retrieving Input Data Assessing Quality Generating Metadata Editing Mappings
RML describes how to generated RDF from structured data
predicate subject object
Predicate Map Subject Map
Object Map
<#TriplesMap>
rr:constant ex:located
rr:template “http://ex.com/{lab}”
rr:template “http://ex.com/{city}”
rr:template “http://ex.com/{id}”
rr:template “http://ex.com/{lab}”
<#ResearcherMap>
<#LabMap>
rr:template “{firstname} {surname}” rr:termType rr:Literal
RDF Mapping Language (RML)
Extraction Module Mapping Module
RML Processor
Defining Mappings to generate Linked Data Retrieving Input Data Assessing Quality Generating Metadata Editing Mappings
Triples Map
RDF Mapping Language (RML)
Predicate Object Map
Subject Map
Predicate Map
Object Map
RML describes rules to map any structured data to RDF
RML supports any data independently of
which structure and format they have where they originally reside how they are accessed & retrieved
Specifying data which data form a data input how to reference data input extracts
Accessing & Retrieving data data input from original source(s)
Specifying data which data form a data input how to reference data input extracts
Accessing & Retrieving data data input from original source(s)
Triples Map
RDF Mapping Language (RML)
Predicate Object Map
Subject Map
Predicate Map
Object Map
Logical Source
Support data in Heterogeneous Structures and Formats
tabular-structured tables in DBs or CSV files …
hierarchical-structured JSON or XML …
(semi-)structured HTML …
… … …
rr:template “http://ex.com/{id}”
rr:template “http://ex.com/{lab}”
<#ResearcherMap> rr:template “{firstname} {surname}” rr:termType rr:Literal
id firstname surname lab
1 Anastasia Dimou DSLab
2 Ruben Verborgh DSLab
3 Erik Mannens DSLab
tabular-structured data
rr:constant ex:located
rr:template “http://ex.com/
{/labs/lab/short}”
rr:template “http://ex.com/
{/labs/lab/location/city}”
<#LabMap>
<labs> <lab> <short>MMLab</short> <title>Multimedia Lab</title> <location> <city>Ghent</city> </location> </lab> <lab> …. </lab> … </labs>
hierarchical-structured data
Triples Map
RDF Mapping Language (RML)
Predicate Object Map
Subject Map
Predicate Map
Object Map
Logical Source
Reference Formulation
<labs> <lab> <short>MMLab</short> <title>Multimedia Lab</title> <location> <city>Ghent</city> </location> </lab> <lab> …. </lab> … </labs>
<#Lab Logical
Source>
ql:XPath
rr:constant ex:located
rr:template “http://ex.com/
{/labs/lab/short}”
rr:template “http://ex.com/
{/labs/lab/location/city}”
<#LabMap>
Triples Map
RDF Mapping Language (RML)
Predicate Object Map
Subject Map
Predicate Map
Object Map
Logical Source
Reference Formulation
iterator
<labs> <lab> <short>MMLab</short> <title>Multimedia Lab</title> <location> <city>Ghent</city> </location> </lab> <lab> …. </lab> … </labs>
<#Lab Logical
Source>
ql:XPath
“/labs/lab”
rr:constant ex:located
rr:template “http://ex.com/
{/labs/lab/short}”
rr:template “http://ex.com/
{/labs/lab/location/city}”
<#LabMap>
Specifying data which data form a data input how to reference data input extracts
Accessing & Retrieving data data input from original source(s)
Input data
Input data
Input
data
Output RDF
Mapping module
RML Processor
Map doc
Data source
Acce
ss in
terface
Input data
Input data
Input data
Output RDF
Mapping module
RML Processor
Map doc
Data source
Acce
ss in
terface
Data source
Acce
ss in
terface
Retrieval module
Source description
Support different Locations and Access Interfaces
Local File(s) Database connectivity D2RQ
Web source(s) (Web API/service)
DCAT, CSVW, Hydra, VOiD (Dataset)
RDF source(s) VOiD (Endpoint), SPARQL-SD
Triples Map
RDF Mapping Language (RML)
Predicate Object Map
Subject Map
Predicate Map
Object Map
Logical Source
Reference Formulation
iterator
Source
file.xml
WEB
AP
I D
CA
T
XML data
JSON data
tabular data
Output RDF
Mapping module
RML Processor
Map doc
Data repo
WEB
AP
I H
ydra
Data base
JDB
C
D2
RQ
Retrieval module
Source description
Triple store
SPAR
QL
Defining Mappings to generate Linked Data Retrieving Input Data Assessing Quality Generating Metadata Editing Mappings
http://example.com/ Giddeon_Massie
dbo:Event
"1981-08-27" xsd:gYear
http://example.com/ Brick_Bronsky
dbo:Event
"1964" xsd:gYear
http://example.com/ Steve_Meilinger
dbo:Event
"1930-12-12" xsd:gYear
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
http://example.com/ Matt_McBride
dbo:Event
"1985-05-23" xsd:gYear
dbo:birthDate
dbo:birthDate
dbo:birthDate
dbo:birthDate
dbo:birthDate range xsd:date dbo:birthDate domain dbo:Person
http://example.com/ Chuck_Bednarik
dbo:Event
"1925-05-01" xsd:gYear
dbo:birthDate
Violations Most frequent violations are related to how vocabularies or ontologies are applied to the data
dbo:birthDate range xsd:date dbo:birthDate domain dbo:Person
http://example.com/ Chuck_Bednarik
dbo:Event
"1925-05-01" xsd:gYear
dbo:birthDate
RDF DQA with RDFUnit
test-driven data-debugging framework
based on SPARQL-patterns
dbo:birthDate http://example.com/
Chuck_Bednarik dbo:Event
"1925-05-01" xsd:gYear
http://rdfunit.aksw.org
DQA: Dataset Quality Assessment
Adjustments to the dataset are manually but rarely applied but not at the root (hard to identify)
are overwritten if a new version of
the original data is mapped & published
violations DQA
Instead of applying Quality Assessment to the already published RDF dataset
as part of data consumption
Apply Quality Assessment to the Mappings that generate the RDF dataset
MQA: Mapping Quality Assessment
discover violations before they are even generated
specify the origin of the violation
easily apply structural adjustments to the mappings
sets of triples of a dataset have repetitive patterns
dbo:birthDate http://example.com/ {Name}_{Surname}
dbo:Event
“Birth" xsd:gYear
Mapping languages formalize patterns into rules to generate the RDF dataset from the original data
MQA with RDFUnit over RML
dbo:birthDate http://example.com/
Chuck_Bednarik
dbo:Person
"1925-05-01"
xsd:date
DEL: <#ObjectMap> rr:datatype xsd:gYear ADD: <#ObjectMap> rr:datatype xsd:date
data map doc
Mapping Processor
violations MDQA
MDQA: Uniform Mapping & Dataset Quality Assessment
Dataset Vs Mapping Quality Assessment Dataset Quality Assessment Mapping Quality Assessment
size time size time
DBPedia EN 62M 16h 115K 11s
DBPedia NL 21M 1.5h 53K 6s
DBpedia all 511K 32s
* http://mappings.dbpedia.org/validation
Live update of DBpedia Mapping Quality Assessment results every night!
Defining Mappings to generate Linked Data Retrieving Input Data Assessing Quality Generating Metadata Editing Mappings
Metadata
manually defined by data publishers (person-agents), rather than produced by applications (software-agents)
Consider mapping rules to automatically generate self-descriptive provenance and other metadata
W3C standardized Metadata
PROV provenance information
VoID expressing RDF dataset metadata
general metadata structural metadata, links between datasets
DCAT describe datasets in data catalogs
Defining Mappings to generate Linked Data Retrieving Input Data Assessing Quality Generating Metadata Editing Mappings
Semantic Web experts Vs. Data specialists
Modeling Domain Knowledge as Linked (Open) Data is not straightforward for Data Specialists Data context is not straightforward for Semantic Web experts
Semantic Web experts Vs. Data specialists
Data Specialists should be able to specify the mappings, modify and extend them at any time
Approaches for Editing Mappings
RML Editor
http://rml.io/RMLeditor
Defining Mappings to generate Linked Data Retrieving Input Data Assessing Quality Generating Metadata Editing Mappings
The five stars of the Linked Open Data scheme should not be approached as a set of consecutive steps
Well-considered policy regarding mapping and interlinking of data in the context of a certain knowledge domain
RML.io Generating High Quality
Linked Open Data from Open or Not Data
Anastasia Dimou Data Science Lab, Ghent University - iMinds
[email protected] @natadimou