M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung...

Preview:

Citation preview

MANAGING UNCERTAINTY OF XML SCHEMA MATCHING

Reynold Cheng, Jian Gong, David W. Cheung

ICDE’2010

22

THE DATA INTEGRATION PROBLEM Querying the source data through target query

interface Eg.: querying multiple data sources through a mediate query

interface

Data source

Query interface Target schema

Source schema

Schema mapping

2

…… ……

SCHEMA MATCHING & MAPPING Schema matching: finding element correspondences

with similarities between schemas Schema mapping: a set of one-to-one

correspondences between two schemas Generation: pick up the best correspondences

3

Sample mapping Order - ORDER BP - IP BCN – ICN ……

44

SCHEMA MAPPING AND UNCERTAINTY The mapping between schemas can be uncertain

Compute Pr(Mi) by: 1) aggregating similarities of correspondences, and 2) normalizing probabilities of top-k mappings

Which one is correct?

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Example: Purchase Order schemas

4

55

DATA INTEGRATION RELOADED Managing uncertainty of XML schema matching

Issues: mapping generation and storage, query evaluation etc

Data source

Query interface Mediate schema

Source schema

Uncertain schema mapping

5

…… ……

66

OBSERVATION

Sharing among uncertain mappings

Uncertain mappings

Overlapping: “Order~ORDER” shared by m1-m5

“BP~IP” shared by m1, m2, m4, m5

“BCN~ICN” shared by m1, m2

… 6

77

OBSERVATION How much overlapping are there in real world schema

mappings? Overlapping ratio (o-ratio): the average overlap of the top-

100 possible schema mappings

7

OUR CONTRIBUTION Propose block tree: a novel data structure to represent

a set of mappings Definition Efficient generation

Propose probabilistic twig query (PTQ) Definition Efficient evaluation with the block tree Top-k PTQ, and its computation issue

Improve the possible mapping generation process A divide-and-conquer approach

Conduct experiment on real data to validate our methods

8

RELATED WORK Schema matching approaches and tools [RB01]

COMA [DR02]

Managing uncertainty in schema matching Top-k schema mappings [Gal06] Generating top-k mappings [Murty86]

Query evaluation in data integration Theoretical foundation [Len02] Data integration with uncertainty [DHY07] XML query rewriting for data integration [YP04]

XML query evaluation Twig query [QYD07] Querying probabilistic XML document [KYS08] 9

1010

OUTLINE

Introduction Problem

Data model Query model

Techniques Results Conclusion

10

1111

DATA MODEL XML schema and document [QYD07]

Node-labeled tree Document node may carry text values

Schema mapping [DHY07] One-to-one mapping

11

Schema

Schema

Document

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

1212

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

M1: Order-ORDER, BP-IP, BCN-ICN, …

12

Source query: Target query:

Source schema: Target schema:

1313

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

Step 2: evaluate source query on source document

13

Source query:

Source document:

1414

QUERY MODEL (UNCERTAIN MAPPINGS) Query evaluation with uncertain mappings [DHY07]

Mappings: pM = {(M1,Pr(M1)), …, (Mh,Pr(Mh)} The query answers from mapping Mi have probability Pr(Mi)

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

QS1

QSh

Rewriting Evaluation

14

Source query

1515

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

15

1616

THE BLOCK Each block, which is attached to a target schema

element, consists of: C: A set of correspondences M: A set of mappings

Block Block Block

16

Drawback: Exponential number of blocks to handle

Semantic: mappings in M share correspondences in C

1717

THE C-BLOCK A c-block (constrained block) is a block which:

Contains correspondence for all elements in its sub-tree (so that it’s more useful for query evaluation)

Contains shared mappings more than a threshold (else it’s not worthy to store it)

17

c-block

|pM| = 5Threshold = 0.4

1818

THE BLOCK TREE Creation of the block tree

Follows the structure of the target schema A bottom-up method

18

Lemma 1: (informal)The c-blocks for an element can be created from the c-blocks of its children.(detail)

Lemma 2: (informal)If an element has no c-block, then its parent (if any) has no c-blcok.

1919

THE BLOCK TREE Reducing the storage cost of uncertain mappings

IP

b4

b3

ICN

g2g1

b2

b1C: BCN~ICN

M: m1, m2

C: RCN~ICNM: m3, m4

C: OCN~SCNM: m2, m3

SCN

C: BCN~SCNM: m4, m5

b5

C: BP~IPM: m1, m2, m4, m5

C: BP~IP, BCN~ICNM: m1, m2

SP

...

ORDER

g3C: Order~ORDER

M: m1, m2, m3, m4, m5

m1 Order~ORDER

RCN~SCN...

m2 Order~ORDER

OCN~SCN...

b2.C

b3.C

b2.C

b4.C

m4 Order~ORDER BP~IP

...

b4.C

m5 Order~ORDER BP~IP OCN~ICN ...

b5.C b5.C

m3 Order~ORDER SP~IP BP~SP...

If part of a mapping is in the block tree, then replace it with a link

2020

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

20

2121

QUERY EVALUATION AND UNCERTAINTY The uncertainty in mappings may affect query

answers

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Target query Q: //ICN

which finds all ICNs (contact names of invoice parties) in the purchase order

Example: a source document

Return by M1

Return by M2

21

2222

THE BASELINE APPROACH

Evaluate QT with each mapping in pM separately Drawback

When the mapping Mi is large, or h is large, the computation cost is expensive

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

QS1

QSh

Rewriting Evaluation

DS

DS

23

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

24

IP

ICN

QUERY EVALUATION WITH BLOCK TREE Case 1): the root is found in the block tree, then use the

blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings

25

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers

26

IP

ICN

ORDER

SP

QUERY EVALUATION WITH BLOCK TREE Case 2): the root is not found, decompose the query (if

possible), invoke recursion, and join partial answers

ORDERIP

ICN

SP+ +

Direct query

Recursion Direct query

2727

OUTLINE

Introduction Problem

Data model Query model

Techniques Block tree Query evaluation Mapping generation

Results Conclusion

27

28

MAPPING GENERATION A mapping m for a schema S with another schema T

contains a set of correspondences (es,et) et may be EMPTY, i.e., es matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m’s score is the sum of similarities of its correspondences

Problem definition Given: two schemas S and T, a set of correspondences

(es,et) with similarities (which are schema matching results) Return: h mappings m1, …, mh, whose scores are among the

highest ones

29

MAPPING GENERATION Baseline solution

Finding h-maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite

30

MAPPING GENERATION Observation: XML schema matching is usually sparse Improvement: a divide-and-conquer approach

Derive partitions (Maximal Connected Sub-Graphs) of the bipartite

Find the top-h partial mappings from each partition Merge

3131

OUTLINE

Introduction Problem Techniques Results Conclusion

31

32

DATASET AND RESULTS XML schemas and documents

7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans)

Accompanied sample XML documents

Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-

method)

Target query 10 hand-write queries

33

RESULTS Uncertain mappings, do they really overlap?

34

RESULTS How much space does the block tree save for storing

uncertain mappings? And why?

35

RESULTS Is the block tree effective?

Intuitively, larger blocks tends to be more useful

36

RESULTS The block tree can be efficiently created

Fast, and controllable

37

RESULTS Can the block tree really help to improvement query

performance? Varies the total number of mappings

38

RESULTS Can it scale?

Probabilistic twig query and top-k query

39

RESULTS Top-h mapping generation

Performance gain of partitioning

40

CONCLUSION We study the problem of handling uncertainty in XML

schema matching Observation

Overlapping mappings, sparse bipartite, etc Approach

The block tree Query evaluation with the block tree Generating uncertain mapping more efficiently

Future work Other types of queries, probabilistic document, index

update, relational scenario, etc

4141

THANKS!

Q & A

41

REFERENCES [Len02] Lenzerini, “Data integration: a theoretical perspective”, in

PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data

integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema

matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k

schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in

DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in

increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema

matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”,

in SIGMOD, 2008 …

42

4343

QUERY REWRITING

Given A target twig query QT

A schema mapping m between S and T, which is a set of correspondences (es,et)

Mapping semantic For each sub-tree in source document DS which

contains a set of source element in m, there exists a sub-tree in target document DT which contains the corresponding target elements

Procedure For each element in QT, replace with a source

element Connect all the source elements

4444

LEMMA 1

An example

Lemma 1: (conceptually)The c-blocks for an schema element t can be created from the c-blocks of t’s children.(detail)

Order

InvoiceTo

27|24|25|24

name

Address

streetemail city country

DeliverTo

27|24|25|24

name

Address

streetemail city country

ContactContact

51|49 49|5110052|48 53|4749|5110052|48 50|50 51|49

...

b1.M: 1-52b2.M: 53-100

b3.M: 1,3,5,…b4.M: 2,4,6,...

45

RESULTS

What kind of queries do we used?

Recommended