On Leveraging Crowdsourcing Techniques for Schema Matching Networks

1

Nguyen Quoc Viet Hung, Nguyen Thanh Tam, Karl Aberer

École Polytechnique Fédérale de Lausanne, Switzerland

Zoltán Miklós

Université de Rennes 1, IRISA, France

DASFAA 2013, Part II, LNCS 7826, pp. 139 – 154, 2013

2

Database schema matching is an active research field:Surveys: [1], [2]Applications: data transformation, data migration, data alignment, …Automatic Matching Tools: COMA++, AMC, OpenII, Falcon, …

Schema matching is the task of establishing correspondences that connect related attributes in two (independently developed) database schemas.

[1] Rahm, E. et al. “A Survey of Approaches to Automatic Schema Matching”. JVLDB, 2001[2] Bernstein, P.A. et al. “Generic Schema Matching, Ten Years Later”. PVLDB, 2011

SA SB

BirthName BirthName

BirthDate

AddressAddress

3

Automatic schema matchers will(sometimes) fail to identify the correct correspondences

There is a need for post‐matchingreconciliation through human inputThis effort is the « real cost » in the company

Schemas do not appear alone, they are part of a matching network

The network‐level consistency constraintsare very important for business users

4

Real‐world scenario: a repository of schemas in the same domain

Schema matching network: connect schemas by pair‐wise matchings

Network‐level consistency constraints

Automatic tools produce incorrect correspondences need validation by human

5

6

7

DASFAA’2013, BDA’2013: On LeveragingCrowdsourcing Techniques for SchemaMatching NetworksER’2013: Minimizing Human Effort in Reconciling Match NetworkscoopIS’2013: Collaborative Schema MatchingReconciliationICDE’2014: Pay‐as‐you‐go Reconciliation in Schema Matching Networks

8

“Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers.” ‐Wiki

Our context: employ many workers (users) to validate same correspondences and combine their answers.

Surveys: [1], [2]A wide range of applications (e.g. CrowdSearch) have been developed on top of more than 70 crowdsourcing platforms (e.g. Amazon Mechanical Turk).

Our contribution:Define network‐level constraints in schema matching networkDesign questions for workers to validate correspondencesLeverage network‐level constraints to reduce user efforts

[1] E. Law et al. “Human Computation”. Morgan & Claypool Publishers, 2011[2] A. Doan et al. “Crowdsourcing systems on the World Wide Web”. CACM, 2011

9

10

11

Three elements of questions:Asking object: correspondencePossible choices: simple YES/NO questionSupport Information: alternatives, constraint satisfactions, constraint violations

12

User Question Answer

U1 C Yes

U2 C Yes

U3 C No

User Reliability

U1 r1U2 r2U3 r3

User Feedbacks

Answer Aggregation

User Quality

Probabilistic Model (*)

Pr(C)

Corr Aggregation Error Rate

C True 0.19

Compute <a,e> aggregation + error rate

r1 = Pr (C=true | U1=yes) = Pr (C=false | U1=no)

(*) Majority Voting, Expectation Maximization, …See full paper for details

13

Solution: Leverage constraints to reduce error rate

r = 0.6

Goal

To achieve higher accuracy, we need more answers Cost‐Accuracy Tradeoff

14

Idea: correspondences support each other if they satisfy a constraint

1‐1 constraint: ONE source attribute matches to only ONE target attribute

S T

b1

ab2

Pr(ab1=true) = 0.8

Pr(ab2=false) = 0.6

ab1 ab2 ProbT T 0.32 not satisfyT F 0.48 satisfyF T 0.08 satisfyF F 0.12 satisfy

Pr0.48 0.12

0.48 0.08 0.12.

With ConstraintWithout Constraint


ab2 False 0.4 (*)


ab2 False 0.12 (**)

(*) Error Rate = 1 – Pr (ab2=false) (**) Error Rate = 1 – Pr |

>

By independence, 0.8 x 0.6

15

Circle constraint: sequence of correspondences create a closed circleΔ: probability of compensating errors along the circle (*)

With ConstraintWithout Constraint


ab True 0.2 (**)


ab True 0.027 (***)

(**) Error Rate = 1 – Pr (ab=T) (***) Error Rate = 1 – Pr

S1

S3

S2

a

c

bPr(ab=T) = 0.8

Pr(ac=T) = 0.8 Pr(bc=T) = 0.8

ab bc ac ProbT T T 0.512 1.0T T F 0.128 0.0T F T 0.128 0.0T F F 0.032F T T 0.128 0.0F T F 0.032F F T 0.032F F F 0.008

Pr0.512 Δ 0.032

0.512 3 Δ 0.032 Δ 0.008. with .

>

By independence, 0.8 x 0.8 x 0.8

* Cudré-Mauroux, et al. Probabilistic message passing in peer data management systems. ICDE 2006.

16

Settings:Real‐world schemas. Use ground truth to simulate users/workers. Error Threshold = 0.1 : make decision when error rate < 0.1; otherwise, continue to ask users.Metric: Cost =

Observation: Cost (With Constraints) Cost (Without Constraints)

17

We model a crowdsourcing process for schema matching network

address optimization goals: minimize monetary cost, maximize accuracy (minimize error rate).

We design a variety of questions with different support information.We leverage consistency constraints reduce error rate reduce the monetary cost.

18

Internet

On Leveraging Crowdsourcing Techniques for Schema Matching Networks