Upload
nigel
View
54
Download
0
Tags:
Embed Size (px)
DESCRIPTION
WOD 2013. Publish -Time Data Integration for Open Data Platforms. Julian Eberius , Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden) . Motivation. Premise. - PowerPoint PPT Presentation
Citation preview
Dipl. Medien-Inf. Julian Eberius |
Publish-Time Data Integration for Open Data Platforms
WOD 2013
Julian Eberius, Patrick Damme, Katrin Braunschweig,Maik Thiele and Wolfgang Lehner (TU Dresden)
Dipl. Medien-Inf. Julian Eberius | 2
> Motivation
Dipl. Medien-Inf. Julian Eberius | | 3
> Premise
Reusability• Standardization• Integration
Free-For-All• Many contributors• Many domains• Lack of standards
Continuous publishing without standardization will continuously increase heterogeneity on the platform.
Is there a solution without predefined schemata / ontologies?
Dipl. Medien-Inf. Julian Eberius | | 4
> Problem
Different names for attributes of the same meaning Different meanings for attributes with
same values
Dipl. Medien-Inf. Julian Eberius | | 5
> System Overview
Dipl. Medien-Inf. Julian Eberius | | 6
> Offline
Domain Clustering Bottom-up clustering on schema-
level Used online to limit search space But also to improve accuracy
Domain Statistics Create different forms value set
synopses Used to save comparison work
online
Dipl. Medien-Inf. Julian Eberius | | 7
> Online
Input New dataset ds+ with value
sets vs+
Output Attribute name suggestions
Constraint Instanteneous response
time (Publish-Time!)
Basic Approach Assign ds+ to domain based
on schema information Generate recommendations
based on values
Dipl. Medien-Inf. Julian Eberius | | 8
> Naiv-C
Most Naive Approach: Iterate over Corpus C return the names of all attributes with
sufficiently similar value sets order them by overall frequency in the
corpus
Properties: Finds all similar value sets Generates the largest possible number of
recommendations Extremely long run time Might generate to many
recommendations
Dipl. Medien-Inf. Julian Eberius | | 9
> Naiv-D
Domain-based Approach: Classify incoming dataset into domain D Iterate over Domain D continue as in Naiv-C
Properties: Finds less similar value sets Shorter run time Only generates recommendations from
one domain
Dipl. Medien-Inf. Julian Eberius | | 10
> Cluster / Analysis-D
Synopsis-based Approaches: Create representative value sets RVS for
datasets in domain Match only against RVS
Clustering-D Cluster VS in domain, create RVS Pre-compute recommendation list as all
attribute names of value sets participating in final cluster
Online: find single most similar RVS in D
Analysis-D Create RVS directly for sets of VS with
equal name Online: Find set of similar RVS in D
Dipl. Medien-Inf. Julian Eberius | | 11
> Evaluation
Dipl. Medien-Inf. Julian Eberius | | 12
> Quality I
Dipl. Medien-Inf. Julian Eberius | | 13
> Quality II
Dipl. Medien-Inf. Julian Eberius | | 14
> Runtimes
Dipl. Medien-Inf. Julian Eberius | | 15
> Cluster Size
Dipl. Medien-Inf. Julian Eberius | | 16
> Conclusion
We need statistics-based data integration at publish time to limit the growth of heterogenity in large public dataset corpora.
Lots of work to do: clustering, matching, statistics, indexing, performance.