View
83
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer
Citation preview
1
Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer
École Polytechnique Fédérale de Lausanne, Switzerland
Zoltán Miklós
Université de Rennes 1, IRISA, France
DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013
2
Database schema matching is an active research field:Surveys: [1], [2]Applications: data transformation, data migration, data alignment, …Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, …
Schema matching is the task of establishing correspondences that connect related attributes in two (independently developed) database schemas.
[1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001[2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011
SA SB
BirthName BirthName
BirthDate
AddressAddress
3
Automatic schema matchers will(sometimes) fail to identify the correct correspondences
There is a need for post‐matchingreconciliation through human inputThis effort is the « real cost » in the company
Schemas do not appear alone, they are part of a matching network
The network‐level consistency constraintsare very important for business users
4
Real‐world scenario: a repository of schemas in the same domain
Schema matching network: connect schemas by pair‐wise matchings
Network‐level consistency constraints
Automatic tools produce incorrect correspondences need validation by human
5
6
7
DASFAA’2013, BDA’2013: On LeveragingCrowdsourcing Techniques for SchemaMatching NetworksER’2013: Minimizing Human Effort in Reconciling Match NetworkscoopIS’2013: Collaborative Schema MatchingReconciliationICDE’2014: Pay‐as‐you‐go Reconciliation in Schema Matching Networks
8
“Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers.” ‐Wiki
Our context: employ many workers (users) to validate same correspondences and combine their answers.
Surveys: [1], [2]A wide range of applications (e.g. CrowdSearch) have been developed on top of more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk).
Our contribution:Define network‐level constraints in schema matching networkDesign questions for workers to validate correspondencesLeverage network‐level constraints to reduce user efforts
[1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011[2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011
9
10
11
Three elements of questions:Asking object: correspondencePossible choices: simple YES/NO questionSupport Information: alternatives, constraint satisfactions, constraint violations
12
User Question Answer
U1 C Yes
U2 C Yes
U3 C No
User Reliability
U1 r1U2 r2U3 r3
User Feedbacks
Answer Aggregation
User Quality
Probabilistic Model (*)
Pr(C)
Corr Aggregation Error Rate
C True 0.19
Compute <a,e> aggregation + error rate
r1 = Pr (C=true | U1=yes) = Pr (C=false | U1=no)
(*) Majority Voting, Expectation Maximization, …See full paper for details
13
Solution: Leverage constraints to reduce error rate
r = 0.6
Goal
To achieve higher accuracy, we need more answers Cost‐Accuracy Tradeoff
14
Idea: correspondences support each other if they satisfy a constraint
1‐1 constraint: ONE source attribute matches to only ONE target attribute
S T
b1
ab2
Pr(ab1=true) = 0.8
Pr(ab2=false) = 0.6
ab1 ab2 ProbT T 0.32 not satisfyT F 0.48 satisfyF T 0.08 satisfyF F 0.12 satisfy
Pr0.48 0.12
0.48 0.08 0.12.
With ConstraintWithout Constraint
Corr Aggregation Error Rate
ab2 False 0.4 (*)
Corr Aggregation Error Rate
ab2 False 0.12 (**)
(*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr |
>
By independence, 0.8 x 0.6
15
Circle constraint: sequence of correspondences create a closed circleΔ: probability of compensating errors along the circle (*)
With ConstraintWithout Constraint
Corr Aggregation Error Rate
ab True 0.2 (**)
Corr Aggregation Error Rate
ab True 0.027 (***)
(**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr
S1
S3
S2
a
c
bPr(ab=T) = 0.8
Pr(ac=T) = 0.8 Pr(bc=T) = 0.8
ab bc ac ProbT T T 0.512 1.0T T F 0.128 0.0T F T 0.128 0.0T F F 0.032F T T 0.128 0.0F T F 0.032F F T 0.032F F F 0.008
Pr0.512 Δ 0.032
0.512 3 Δ 0.032 Δ 0.008. with .
>
By independence, 0.8 x 0.8 x 0.8
* Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.
16
Settings:Real‐world schemas. Use ground truth to simulate users/workers. Error Threshold = 0.1 : make decision when error rate < 0.1; otherwise, continue to ask users.Metric: Cost =
Observation: Cost (With Constraints) Cost (Without Constraints)
17
We model a crowdsourcing process for schema matching network
address optimization goals: minimize monetary cost, maximize accuracy (minimize error rate).
We design a variety of questions with different support information.We leverage consistency constraints reduce error rate reduce the monetary cost.
18