Profiling Linked Open Data

Profiling Linked (Open) Data

Blerina Spahiu

Department of Computer Science,Systems and Communication,University of Milan - [email protected]

Supervisor: Andrea Maurino, Matteo PalmonariTutor: Prof. Flavio De Paoli

[email protected]

mailto:[email protected]

Outline

The research background The research plan Preliminary results Conclusions and Future Work

2University of Milan - Bicocca

Outline



Data profiling definition

- Where shall I begin, please your Majesty? - Begin at the beginning - the King said gravely.

Lewis Carroll in Alice’s Adventures in Wonderland

The process of evaluating data quality is called data profiling and typically involves gathering several aggregated data statistics which constitute the data profiling.

Encyclopedia of Database Systems, June 2014


5

Linked Open Data Cloud

University of Milan - Bicocca

-1014 datasets-188 mio. triples-7 topical categories-80% of linking property is owl:sameAs-98,22% of datasets use RDF vocabulary-7,85% of the datasets provide licensing information.Etc.

6

Linked Open Data Cloud


What types of resources are described in a data set? How are they described?

How well connected are the datasets in the LOD cloud? What is their topic/s?

Are data described as prescribed by the ontology?

Why we need data profiling?

“….because prevention is better than curing”


Data quality assessment Query optimization Ontology / Data integration Data analytics Complex schema discovery Topical discovery Data visualization

8

State of the artTools Goal Input Output Autom

atization

Scalability

Availability

License Tutorial

RoombaAssaf et al., 2015

Generate descriptivedataset profiles

Query portal APIs for available metadata

Quality assessment of metadata

Code in github

Open Source

LODStatsAuer et al., 2012

Comprehensive statistics about RDF

RDF 32 statistical criteria on schema and data level

Only the demo

Demo

ExpLODKhatchadourian S. and Consenses M. P., 2010

Supports exploring summaries of RDF usage and interlinking among datasets

RDF dataset, the BL (bisimulation label) schema and the neighborhoods to consider

Summaries can be viewed and explored in an interactive graphical way and can be exported in a variety of formats

RDFStatsLangegger A. and Wob W., 2010

Generation of different statistics

RDF dataset Histograms for value distributions, classes/properties/datatypes

Semi-Automatic

Yes Apache Yes

ProLOD++Bohm et al., 2010

Computes different profiling,mining or cleansing tasks

RDF dataset Statistics about properties, classes etc. Information about uniqueness and keyness.

Automatic

Demo

9

Data Profiling Tools Survey


Profiling challenges

The results of data profiling are computationally complex to discover Different and new data management architectures and frameworks have

emerged Linked Open Data are heterogeneous data

• Syntactic Heterogeneity (Different formats, query languages)• Schematic Heterogeneity (Different encoding schemas)• Semantic Heterogeneity

(Different vocabularies, semantic overlap of terms) Unified view of data profiling as a field Unifying framework for its task


Outline



Objectives Develop automatic approaches Generate new statistics and knowledge patterns to provide dataset summary

and inspect its quality.• Apply data mining techniques to extract useful knowledge from large

datasets• Implementation of different approaches for outlier detection

Algorithms to overcome challenges to perform profiling in Linked Open Data• Parallel calculation of statistics and patterns extraction in LOD• Data mining techniques to deal with high dimensionality

Topical information extraction and classification Developing a methodology on how to perform profiling tasks

• A deep literature study to classify and formalize profiling tasks


13

Methodology Used

Schematic

Hetetogeneity


Semantic Heterogeneity Syntactic Heterogeneity

Topical Discovery

DataQuality

DataUnderstanding

Outline



15

Work already done (1)Profiling of Italian Public Administration websites

• Decree 33 and 150


16

Profiling of Italian PAs

Benchmark of PAs• Geographical distribution (country wide)• Type of PAs (region, municipality, county)• Size (number of inhabitants)

Compliance Index• Completeness • Accuracy• Timeliness

Profiling websites in terms of compliance


17

“Data Profiling, the moment of truth”


The average index of compliance for the selected• Italian Regions is 0.488 (50% has an index lower than the mean). • Italian Provinces is 0.561 (more than 50% has an index lower than the mean).• Italian Municipalities is 0.462 (more than 50% has an index lower than the mean).

RegionsVeneto has the highest score (0.839)Campania has the lowest score (0.043)

ProvincesBergamo in Lombardia Region have the highest score (0.759)Massa Carrara, in theToscana Region has the lowest score (0.266)

Municipalities Voghera (Lombardia Region) has the highest score (0.759)Ozegna (Piemonte Region) has the lowest score (0.164)

Works already done (2)

Facilitating query for similar datasets discovery Speeding up data searches Trends and best practices of a particular domain can be identified

18

To which extent topical classification can be automated

Data Corpus and Feature Set

Category Datasets %

Government 183 18.05

Publications 96 9.47

Life sciences 83 8.19

User generated content 48 4.73

Cross domain 41 4.04

Media 22 2.17

Geographic 21 2.07

Social Web 520 51.28

19

Data corpus (1014 datasets) extracted in April 2014 from Schmachenberg et al.

Vocabulary Usage (1439) Class URIs (914) Property URIs (2333) Local Class Names (1041) Local Property Names (2493) Text from rdfs:label (1440) Top Level Domain (55) In and Out Degree (2)

Experimental Setup

Classification Approaches K-Nearest Neighbor J-48 Naïve Bayes

Two normalization strategies Binary (bin) Relative term occurrences (rto)

Three sampling techniques No sampling Down sampling Up sampling

20

Results on Combined Feature Sets

21

Our model reaches an accuracy of 81.62%

Confusion Matrix

22

Confusion between publications with government and life sciences

Confusion between user generated content and social networking

23

Works already done (3) ABSTAT is a framework which can be used to summarize linked datasets and at

the same time to provide statistics about them• Summary consists of Abstract Knowledge Patterns (AKPs) of the form

<subjectType, predicate, objectType>• Can help users comparing two datasets• Help detecting errors in the data such as accuracy

Eg: AKPs <dbo:Band,dbo:genre,dbo:Band> • The domain or the range is unspecified for 585 properties in DBpedia Ontology


SubjectType Porperty ObjectType

http://dbpedia.org/ontology/Town http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Country

http://dbpedia.org/ontology/City http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Legistrature

http://dbpedia.org/ontology/Settlement http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/Settlement

http://dbpedia.org/ontology/Country http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/PoliticalParty

http://dbpedia.org/ontology/Village http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/MilitaryConflict

http://dbpedia.org/ontology/Organization http://dbpedia.org/ontology/governmentType http://dbpedia.org/ontology/City

http://dbpedia.org/ontology/AdministrativeRegion http://dbpedia.org/ontology/governmentType

http://dbpedia.org/ontology/Town

http://dbpedia.org/ontology/City

http://dbpedia.org/ontology/Settlement



http://dbpedia.org/ontology/Country

http://dbpedia.org/ontology/Political

http://dbpedia.org/ontology/Political

http://dbpedia.org/ontology/Village

http://dbpedia.org/ontology/Organization

24

Evaluation Plan

Where the Gold Standard exist validate in terms of precision, recall and F-measure

Difficulties to evaluate the validity of the proposed approach

• How these statistics or summarization allow to improve the performance of the actual profiling tasks

• Humans will evaluate the validity of the summarization in terms of relatedness and informativeness

• Provide users a list of statistics and ask their opinion which is more important for their use case


Outline



Conclusions and Future Work

The Topical Classification approach yield an accuracy of 82%, enriching with other features like the linkage coverage

• Each dataset has only one topic, for some datasets multi label classification can be appropriate

• A classifier chain for the multi-label classification• Because of the heavy imbalance of the data a two stage classifier

might help Enrich ABSTAT framework with other statistics and to apply it to

unstructured data such as microdata. Investigate the trade-off between ABSTAT summarization to support

dataset exploration and understanding.

26

Publications• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data

Policies: An empirical evaluation of Italian Local Public Administration. ECIS –eGOV Workshop in the Twenty Second European Conference on Information Systems, Tel Aviv 2014

• A. Maurino, B. Spahiu, C. Batini, G. Viscusi – Compliance with Open Government Data Policies: An empirical evaluation of Italian Local Public Administration. Information Polity Journal, p.263-275, 2014

• M.Palmonari, A.Rula, R.Porrini, A. Maurino, B.Spahiu, V. Ferme – ABSTAT: Linked Data Summaries with Abstraction and STATistics- The Semantic Web: ESWC 2015, Portoroz Slovenia, May31th, 2015 to June 4th, 2015

• R.Meusel, B.Spahiu, C. Bizer, H. Paulheim – Towards Automatic Classification of LOD datasets – LDOW Workshop co-located with 24th International World Wide Web Conferenze (WWW 2015) Firenze, May 19, 2015

• B. Spahiu – Profiling the Linked (Open) Data – Doctoral Consortium Call at ISWC 2015• C. Xie, D. Ritze, B. Spahiu, H. Cai- Instance-based property matching in Linked Open

Data Environment – Ontology Matching Workshop co-located with 14th International Semantic Web Conference, 2015 Bethlehem, Pennsylvania USA.


Thank you for your attention!


Technology

Profiling Linked Open Data