Publish -Time Data Integration for Open Data Platforms

Dipl. Medien-Inf. Julian Eberius |

Publish-Time Data Integration for Open Data Platforms

WOD 2013

Julian Eberius, Patrick Damme, Katrin Braunschweig,Maik Thiele and Wolfgang Lehner (TU Dresden)

Dipl. Medien-Inf. Julian Eberius | 2

> Motivation

Dipl. Medien-Inf. Julian Eberius | | 3

> Premise

Reusability• Standardization• Integration

Free-For-All• Many contributors• Many domains• Lack of standards

Continuous publishing without standardization will continuously increase heterogeneity on the platform.

Is there a solution without predefined schemata / ontologies?


> Problem

Different names for attributes of the same meaning Different meanings for attributes with

same values


> System Overview


> Offline

Domain Clustering Bottom-up clustering on schema-

level Used online to limit search space But also to improve accuracy

Domain Statistics Create different forms value set

synopses Used to save comparison work

online


> Online

Input New dataset ds+ with value

sets vs+

Output Attribute name suggestions

Constraint Instanteneous response

time (Publish-Time!)

Basic Approach Assign ds+ to domain based

on schema information Generate recommendations

based on values


> Naiv-C

Most Naive Approach: Iterate over Corpus C return the names of all attributes with

sufficiently similar value sets order them by overall frequency in the

corpus

Properties: Finds all similar value sets Generates the largest possible number of

recommendations Extremely long run time Might generate to many

recommendations


> Naiv-D

Domain-based Approach: Classify incoming dataset into domain D Iterate over Domain D continue as in Naiv-C

Properties: Finds less similar value sets Shorter run time Only generates recommendations from

one domain


> Cluster / Analysis-D

Synopsis-based Approaches: Create representative value sets RVS for

datasets in domain Match only against RVS

Clustering-D Cluster VS in domain, create RVS Pre-compute recommendation list as all

attribute names of value sets participating in final cluster

Online: find single most similar RVS in D

Analysis-D Create RVS directly for sets of VS with

equal name Online: Find set of similar RVS in D


> Evaluation


> Quality I


> Quality II


> Runtimes


> Cluster Size


> Conclusion

We need statistics-based data integration at publish time to limit the growth of heterogenity in large public dataset corpora.

Lots of work to do: clustering, matching, statistics, indexing, performance.

Documents

Publish -Time Data Integration for Open Data Platforms