View
128
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Authros: Nguyen Quoc Viet Hung (1), Nguyen Thanh Tam (1), Zoltán Miklós (2), Karl Aberer (1), Avigdor Gal (3), and Matthias Weidlich (4) 1 École Polytechnique Fédérale de Lausanne 2 Université de Rennes 1 3 Technion – Israel Institute of Technology 4 Imperial College London
Citation preview
Pay-as-you-go Reconciliation in Schema Matching Networks
Nguyen Quoc Viet Hung1, Nguyen Thanh Tam 1, Zoltán Miklós2, Karl Aberer1, Avigdor Gal3, and Matthias Weidlich4
1 École Polytechnique Fédérale de Lausanne2 Université de Rennes 1
3 Technion – Israel Institute of Technology4 Imperial College London
ICDE | 2014 2
Schema Matching - Where?
WWW
Cloud
Large enterprises
P2P Networks
Collaborative Systems
Schema matching is the process of establishing correspondences between theattributes of schemas, for the purpose of data integration
Private PhD Thesis Defense | 12.2013 3
Schema Matching Network
Traditional approach:Mediated schema
Our approach:Schema Matching Network
S1 S2 S3 S2 S3
S1
A network of schemas that are matched against each other
Require consensus on schemaUpdated Frequently
ICDE | 2014 4
Pay-as-you-go Reconciliation
Reconciliation is the process of asking human user to give feedback on correspondences. Need of reconciliation: automatic techniques use heuristics results are inherently uncertain
s1: EoverI
s2: BBC
s3: DVDizzy
a4: productionDate
a1: releaseDatea3: availabilityDate
a2: screeningDate
c4
c2
c1c3
c5
Attribute names are quite similar automatic matching tools often fail to identify the correct correspondences.
Instantiation
Selective matching
Uncertainty Reduction
Pay‐as‐you‐go reconciliation
Incrementally improve matching quality with minimal user effort
Instantiate a single trusted set of correspondences
ICDE | 2014 5
System Overview
General approach: 1. Develop a probabilistic matching network (pSMN) can measure the overall
uncertainty of the network2. Reduce network uncertainty: guide user feedback with minimal effort3. Instantiate a selective matching: maintain a good set of attribute correspondences
to make the system available at any time
ICDE | 2014 6
Outline
Probabilistic Schema Matching Network (pSMN): Model Computation
Uncertainty Reduction Instantiation of the selective matching Experimental results Conclusion and future work
ICDE | 2014 7
pSMN - Modeling Schema matching network is modeled as a quadruple N , , Γ, ,
– set of schemas ‐ interaction graph: represents the connections in the networks. – set of attribute correspondences Γ – set of integrity constraints
An integrity constraint is the formulation of natural properties 1‐1 constraint Cycle constraint (transitivity) Etc.
p – a set of probabilities. Each probability is associated with a correspondence ∈ .
ICDE | 2014 8
pSMN - Computing Probability of a correspondence
Semantics: indicate the correctness of these correspondences Source: integrity constraints and user input. Idea: a correspondence that involves
many violations has a high chance of being problematic. Computation:
Step 1: construct all possible matching instances Ω I , … , I . Matching instance is a maximal set of correspondences satisfying all integrity constraints and user input.
Step 2: compute by the formula:# #
(i.e. ∈ : ∈ )
Challenge: probability computation has a high complexity We use non‐uniform sampling and a view‐maintenance technique to approximate the probability efficiently.
Network Uncertainty: quantify the uncertainty of pSMN based on entropy:
log 1 log 1∈
ICDE | 2014 9
Outline
Probabilistic Schema Matching Network (pSMN): Model Computation
Uncertainty Reduction Instantiation of the selective matching Experimental results Conclusion and future work
ICDE | 2014 10
Reduce Network Uncertainty
Goal: guide user to give feedback with minimal user effort
Problem (UNCERTAINTY MINIMIZATION WITH LIMITED EFFORT BUDGET). Given a probabilistic matching network ⟨ , , , Γ, ⟩ and a budget of user effort , find a set of correspondences ⊆ with , such that , is minimal.
ICDE | 2014 11
Approach – Use heuristic ordering
Idea: feed users the correspondences with highest information‐gain first. Information gain: the uncertainty reduction before and after validation:
|:expected network uncertainty when knowing the true value of c
Two possible solutions: {c1,c2,c3} and {c1,c4,c5}. Ask c1 first the network is unchanged no uncertainty reduction.
Ask c2 first only 1 solution left the network becomes certain.
SA
SB
SC
c1 c2
c3
c4
c5
SA
SB
SC
c1 c2
c3
c4
c5
SA
SB
SC
c1 c2
c3
ICDE | 2014 12
Instantiate a selective matching
Goal:Maintain a single trusted set of correspondences Goodness measurement of a set of correspondences ⊆ :
Repair distance: information loss of eliminating some correspondences to guarantee integrity constraint
Δ ∖ Likelihood: represents the collective correctness of correspondences:
∈ Instantiation problem: given a schema matching network, identify a set of
correspondences ⊆ with minimal repair distance (w.r.t. ) and maximal likelihood.
ICDE | 2014 13
Approach
The instantiation problem is NP‐complete use heuristic approach Algorithm:
Step 1: Initialization ‐ Pickup a sampled matching instance with minimal repair distance
Step 2: Optimization – Randomized local search
Repair Distance
Likelih
ood
I0
randomized local search
Iopt
matching instances: satisfy all constraints
non‐sampled instance
sampled instance
sampled + minimal repair distance
minimal repair distance + maximal likelihood
ICDE | 2014 14
Outline
Probabilistic Schema Matching Network (pSMN): Model Computation
Uncertainty Reduction Instantiation of the selective matching Experimental results Conclusion and future work
ICDE | 2014 15
Experiment – Dataset and Setting
Datasets: Business Partner: schemas from enterprise systems Purchase Order: purchase order e‐business schemas University Application Form: schemas from Web interfaces of American university
application forms WebForm: schemas from Web forms of different domains Thalia: schemas describing university courses
Metrics: Precision: measures quality improvement at each user interaction step , with G
being the exact match.D ∩ /|D |
User effort: the percentage of feedback steps relative to the size of the matcher output.
/| |
ICDE | 2014 16
Efficiency of guiding strategy on uncertainty reduction
Goal: compare between guiding vs. non‐guiding strategy on uncertainty reduction Evaluation procedure:
Increases user effort Upon each user input, measure the network uncertainty and precision
Interesting finding: heuristic ordering strategy achieves savings of up to 48% user effort compared to random ordering.
ICDE | 2014 17
Efficiency of guiding strategy on instantiation
Goal: compare between guiding vs. non‐guiding strategy on instantiation Evaluation procedure:
Increases user effort Measure the precision and recall of the instantiated matching
Interesting finding: heuristic ordering strategy outperforms the baseline with an average difference of 15% (precision) and 14% (recall).
ICDE | 2014 18
Conclusions
We introduce the concept of schema matching networks and probabilistic matching networks
We define a model for pay‐as‐you‐go reconciliation on top of matching networks. We propose a guiding technique to reduce network uncertainty and a heuristic
approach to instantiate a selective matching. Through experiments with real‐world schemas, our guiding strategy outperforms the
baseline: Saving user effort by up to 48% Increasing precision (15%) and recall (14%)
ICDE | 2014 19
Future Work
Generalizing pay‐as‐you‐go reconciliation for crowdsourced models: Business process matching
Ontology alignment
ICDE | 2014 20
THANK YOU
Q&A