18
1 Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer École Polytechnique Fédérale de Lausanne, Switzerland Zoltán Miklós Université de Rennes 1, IRISA, France DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013

On Leveraging Crowdsourcing Techniques for Schema Matching Networks

Embed Size (px)

DESCRIPTION

Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer

Citation preview

Page 1: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

1

Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer

École Polytechnique Fédérale de Lausanne, Switzerland

Zoltán Miklós

Université de Rennes 1, IRISA, France

DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013

Page 2: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

2

Database schema matching is an active research field:Surveys: [1], [2]Applications: data transformation, data migration, data alignment, …Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, …

Schema matching is the task of establishing correspondences that connect related attributes in two (independently developed) database schemas.

[1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001[2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011

SA SB

BirthName BirthName

BirthDate

AddressAddress

Page 3: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

3

Automatic schema matchers will(sometimes) fail to identify the correct correspondences

There is a need for post‐matchingreconciliation through human inputThis effort is the « real cost » in the company

Schemas do not appear alone, they are part of a matching network

The network‐level consistency constraintsare very important for business users

Page 4: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

4

Real‐world scenario: a repository of schemas in the same domain

Schema matching network: connect schemas by pair‐wise matchings

Network‐level consistency constraints 

Automatic tools produce incorrect correspondences  need validation by human

Page 5: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

5

Page 6: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

6

Page 7: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

7

DASFAA’2013, BDA’2013: On LeveragingCrowdsourcing Techniques for SchemaMatching NetworksER’2013: Minimizing Human Effort in Reconciling Match NetworkscoopIS’2013: Collaborative Schema MatchingReconciliationICDE’2014: Pay‐as‐you‐go Reconciliation in Schema Matching Networks

Page 8: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

8

“Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers.” ‐Wiki

Our context: employ many workers (users) to validate same correspondences and combine their answers.

Surveys: [1], [2]A wide range of applications (e.g. CrowdSearch) have been developed on top of more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk).

Our contribution:Define network‐level constraints in schema matching networkDesign questions for workers to validate correspondencesLeverage network‐level constraints to reduce user efforts

[1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011[2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011

Page 9: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

9

Page 10: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

10

Page 11: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

11

Three elements of questions:Asking object: correspondencePossible choices: simple YES/NO questionSupport Information: alternatives, constraint satisfactions, constraint violations

Page 12: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

12

User Question Answer

U1 C Yes

U2 C Yes

U3 C No

User Reliability

U1 r1U2 r2U3 r3

User Feedbacks

Answer Aggregation

User Quality

Probabilistic Model (*)

Pr(C)

Corr Aggregation Error Rate

C True 0.19

Compute <a,e> aggregation + error rate

r1 = Pr (C=true | U1=yes) = Pr (C=false | U1=no)

(*) Majority Voting, Expectation Maximization, …See full paper for details

Page 13: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

13

Solution: Leverage constraints to reduce error rate

r = 0.6

Goal

To achieve higher accuracy, we need more answers  Cost‐Accuracy Tradeoff

Page 14: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

14

Idea: correspondences support each other if they satisfy a constraint

1‐1 constraint: ONE source attribute matches to only ONE target attribute

S T

b1

ab2

Pr(ab1=true) = 0.8

Pr(ab2=false) = 0.6

ab1 ab2 ProbT T 0.32 not satisfyT F 0.48 satisfyF T 0.08 satisfyF F 0.12 satisfy

Pr0.48 0.12

0.48 0.08 0.12.

With ConstraintWithout Constraint

Corr Aggregation Error Rate

ab2 False 0.4 (*)

Corr Aggregation Error Rate

ab2 False 0.12 (**)

(*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr |

>

By independence, 0.8 x 0.6

Page 15: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

15

Circle constraint: sequence of correspondences create a closed circleΔ: probability of compensating errors along the circle (*)

With ConstraintWithout Constraint

Corr Aggregation Error Rate

ab True 0.2 (**)

Corr Aggregation Error Rate

ab True 0.027 (***)

(**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr

S1

S3

S2

a

c

bPr(ab=T) = 0.8

Pr(ac=T) = 0.8 Pr(bc=T) = 0.8

ab bc ac ProbT T T 0.512 1.0T T F 0.128 0.0T F T 0.128 0.0T F F 0.032F T T 0.128 0.0F T F 0.032F F T 0.032F F F 0.008

Pr0.512 Δ 0.032

0.512 3 Δ 0.032 Δ 0.008. with  .

>

By independence, 0.8 x 0.8 x 0.8

* Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.

Page 16: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

16

Settings:Real‐world schemas. Use ground truth to simulate users/workers. Error Threshold = 0.1  :  make decision when error rate < 0.1; otherwise, continue to ask users.Metric: Cost =  

Observation: Cost (With Constraints)   Cost (Without Constraints)

Page 17: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

17

We model a crowdsourcing process for schema matching network

address optimization goals: minimize monetary cost,  maximize accuracy (minimize error rate).

We design a variety of questions with different support information.We leverage consistency constraints reduce error rate   reduce the monetary cost.

Page 18: On Leveraging Crowdsourcing Techniques for Schema Matching Networks

18