15
DB Group @ UNIMO Dot. Fabio Benedetti Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1 D Day 2015 – Modena Italy LODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources Fabio Benedetti Department of Engineering “Enzo Ferrari” University of Modena & Reggio Emilia D-Day 2015 - Modena

LODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Embed Size (px)

Citation preview

DB

Gro

up

@ U

NIM

O

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 1

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Fabio Benedetti

Department of Engineering “Enzo Ferrari”

University of Modena & Reggio Emilia

D-Day 2015 - Modena

DB

Gro

up

@ U

NIM

O

3Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 3

[Schmachtenberg, Max, Christian Bizer, and Heiko Paulheim. "Adoption of the Linked Data Best Practices in

Different Topical Domains." The Semantic Web–ISWC 2014. Springer International Publishing, 2014. 245-260}

DB

Gro

up

@ U

NIM

O

4Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 4

*Only 570 datasets belong to the LOD cloud,

the remaining datasets do not contain

ingoing/outgoing links to the LOD Cloud.

2009 2014*

Domain Number % Number %

Cross-domain 41 13.95% 41 4.04%

Geographic 31 10.54% 21 2.07%

Government 49 16.67% 183 18.05%

Life sciences 41 13.95% 83 8.19%

Media 25 8.50% 22 2.17%

Publications 87 29.59% 96 9.47%

Social web 0 0.00% 520 51.28%

User-generated content 20 6.80% 48 4.73%

Total 294 1014

2009 Domain

Cross-domain

Geographic

Government

Life sciences

Media

Publications

Social web

2014

DB

Gro

up

@ U

NIM

O

5Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 5

The Open Access trends encourage the

publication of Open Data in form of

Linked Data

But

discovering LOD sources of interest is a

complex task for a user

Main issues

• Do not exist any standard to document a Dataset

• The structure of the Dataset can be understood only

manually exploring the Dataset

• The Semantic Web technologies are extremely complex for

unskilled user

DB

Gro

up

@ U

NIM

O

6Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 6

• To automatically extract and summarize a schema

(Schema Summary) able to describe a LOD Dataset

• Use the Schema Summary to support the user in the

information extraction task

Online & Automatic extraction• It does not require any additional information by the user

• It works with SPARQL endpoints

– We have to handle the bad performance issues of these Datasets

The Schema Summary has to describe a Dataset• Ontology/Vocabulary (OWL & RDFS constraints)

• Open Data (i.e. generated from existing RDBMS)

DB

Gro

up

@ U

NIM

O

7Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 7

Two main modules

• Extraction & Summarization

• Visualization & Querying

LODeX uses a NoSQL

Database as back-end

Input

URLs of SPARQL endpoints

Output

Interactive Schema Summary

LOD Cloud

SPARQL Queries

Schema

Summary

NoSQL

LODeX Post-

processing

Statistical Indexes

LODeX Indexes

Extraction

Query Orchestrator

Schema Summary

Visualizzation

Schema Summary

Basic QueryResults

EndpointURLs

Sgvizler

SPARQL Queries

DB

Gro

up

@ U

NIM

O

8Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 8

Statistical Indexes

They are composed by 9 indexes divided in three groups:

• General group

• Intensional group

• Extensional group

The IE process is able to generate the SPARQL queries used to extract the

different indexes.

• Iterative algorithm able to extract the Intensional knowledge

• Pattern Strategy technique

– It is a technique able to produce an higher number of less complex

SPARQL query

The IE process is able to perform online index extraction handling the

performance issues of the SPARQL endpoints

[F. Benedetti, S. Bergamaschi, and L. Po, “Online index extraction from linked open data sources,” 2014, Linked Data for Information

Extraction (LD4IE) Workshop held at International Semantic Web Conference]

DB

Gro

up

@ U

NIM

O

9Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 9

The elements composing the Schema Summary are:

• Classes

• Properties

• Attributes

An algorithm combines

the information

contained in the

Statistical Indexes to

produce and store the

Schema Summary

[F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data sources,” 2014, International

Semantic Web Conference (Posters & Demos)]

DB

Gro

up

@ U

NIM

O

10Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 10

Schema Summary

SPARQL compiler

SPARQL query

Basic Query

• The User using the Web Application GUI is

driven to building a Basic Query

• A refinement panel helps the user in refine

the Basic Query

A SPARQL compiler automatically generates

the corresponding SPARQL query

Operator supported by the compiler:• AND

• Optional

• Filter

The query is sent to the SPARQL endpoint

and the results can be visualized in a

tabular, maps or chart view (pie, bar, etc.)

• ORDER BY

• LIMIT

• OFFSET

DB

Gro

up

@ U

NIM

O

11Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 11

DB

Gro

up

@ U

NIM

O

12Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 12

Try LODeX demo at: http://dbgroup.unimo.it/lodex2

[F. Benedetti, S. Bergamaschi, and L. Po, “Visual Querying LOD sources with LODeX,” 2014, submitted at The

Semantic Web journal]

DB

Gro

up

@ U

NIM

O

13Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 13

Test Nov. 2014

Dataset URLs 559

Reachable datasets 302

SPARQL 1.1 compatible

206

Extraction completed 185

Task Correct Answers

Schema Summary browsing 94% (32/34)

Query generation 88% (60/68)

Online survey with 17 anonymous

users:

• 8 Skilled users

• 9 Unskilled user

The survey is divided in two parts:

• Schema Summary browsing

clarity

• Query generation

DB

Gro

up

@ U

NIM

O

14Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 14

• Modify the interface of LODeX according to the

results of the online survey

• Extends the VOID descriptor vocabulary in order

to represent the Statistical Indexes and publish our

data as LOD

– Build an observatory for the LOD cloud

• Define clustering techniques to reduce the size of

the Summary for huge dataset

DB

Gro

up

@ U

NIM

O

15Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia 15

Accepted papers• Beneventano, D., Bergamaschi, S., Sorrentino, S., Vincini, M., Benedetti, F. “Semantic

annotation of the CEREALAB database by the AGROVOC linked dataset” (2014)

Ecological Informatics journal, . Article in Press.

• F. Benedetti, S. Bergamaschi, and L. Po, “Online index extraction from linked open

data sources” 2014, Linked Data for Information Extraction (LD4IE) Workshop held at

International Semantic Web Conference

• F. Benedetti, S. Bergamaschi, and L. Po, “A visual summary for linked open data

sources” 2014, International Semantic Web Conference (Posters & Demos)

Submitted papers• F. Benedetti, S. Bergamaschi, and L. Po, “Visual Querying LOD sources with LODeX”

2014, submitted at Semantic Web – Interoperability, Usability, Applicability an IOS

Press Journal

European projects & schools• Web Science Summer School - Southampton University (20-26 July 2014)

• RDA Research Data Alliance - RDA Fourth Plenary Meeting 22 - 24 September 2014 in

Amsterdam. I won an Early Career Scientist grant and I belong to the Big Data

Analytics Interest group.

• Keystone - COST Action IC1302. Autumn 2014 MC and WG Meetings “QUERYING THE

SEMANTIC WEB” 17-18 October 2014, Riva del Garda, TN.

DB

Gro

up

@ U

NIM

O

16Dot. Fabio Benedetti

Dip. Ing. “Enzo Ferrari” – University of Modena e Reggio Emilia

D Day 2015 – Modena ItalyLODeX: Schema Summarization and automatic SPARQL query generation for Linked Open Data sources

Thanks for your attention!