M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung...

MANAGING UNCERTAINTY OF XML SCHEMA MATCHING

Reynold Cheng, Jian Gong, David W. Cheung

ICDE’2010

THE DATA INTEGRATION PROBLEM Querying the source data through target query

interface Eg.: querying multiple data sources through a mediate query

interface

Data source

Query interface Target schema

Source schema

Schema mapping

…… ……

SCHEMA MATCHING & MAPPING Schema matching: finding element correspondences

with similarities between schemas Schema mapping: a set of one-to-one

correspondences between two schemas Generation: pick up the best correspondences

Sample mapping Order - ORDER BP - IP BCN – ICN ……

SCHEMA MAPPING AND UNCERTAINTY The mapping between schemas can be uncertain

Compute Pr(Mi) by: 1) aggregating similarities of correspondences, and 2) normalizing probabilities of top-k mappings

Which one is correct?

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Example: Purchase Order schemas

DATA INTEGRATION RELOADED Managing uncertainty of XML schema matching

Issues: mapping generation and storage, query evaluation etc

Data source

Query interface Mediate schema

Source schema

Uncertain schema mapping

…… ……

OBSERVATION

Sharing among uncertain mappings

Uncertain mappings

Overlapping: “Order~ORDER” shared by m1-m5

“BP~IP” shared by m1, m2, m4, m5

“BCN~ICN” shared by m1, m2

OBSERVATION How much overlapping are there in real world schema

mappings? Overlapping ratio (o-ratio): the average overlap of the top-

100 possible schema mappings

OUR CONTRIBUTION Propose block tree: a novel data structure to represent

a set of mappings Definition Efficient generation

Propose probabilistic twig query (PTQ) Definition Efficient evaluation with the block tree Top-k PTQ, and its computation issue

Improve the possible mapping generation process A divide-and-conquer approach

Conduct experiment on real data to validate our methods

RELATED WORK Schema matching approaches and tools [RB01]

COMA [DR02]

Managing uncertainty in schema matching Top-k schema mappings [Gal06] Generating top-k mappings [Murty86]

Query evaluation in data integration Theoretical foundation [Len02] Data integration with uncertainty [DHY07] XML query rewriting for data integration [YP04]

XML query evaluation Twig query [QYD07] Querying probabilistic XML document [KYS08] 9

OUTLINE

Introduction Problem

Data model Query model

Techniques Results Conclusion

DATA MODEL XML schema and document [QYD07]

Node-labeled tree Document node may carry text values

Schema mapping [DHY07] One-to-one mapping

Schema

Document

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

M1: Order-ORDER, BP-IP, BCN-ICN, …

Source query: Target query:

Source schema: Target schema:

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

Step 2: evaluate source query on source document

Source query:

Source document:

QUERY MODEL (UNCERTAIN MAPPINGS) Query evaluation with uncertain mappings [DHY07]

Mappings: pM = {(M1,Pr(M1)), …, (Mh,Pr(Mh)} The query answers from mapping Mi have probability Pr(Mi)

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

Rewriting Evaluation

Source query

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

THE BLOCK Each block, which is attached to a target schema

element, consists of: C: A set of correspondences M: A set of mappings

Block Block Block

Drawback: Exponential number of blocks to handle

Semantic: mappings in M share correspondences in C

THE C-BLOCK A c-block (constrained block) is a block which:

Contains correspondence for all elements in its sub-tree (so that it’s more useful for query evaluation)

Contains shared mappings more than a threshold (else it’s not worthy to store it)

c-block

|pM| = 5Threshold = 0.4

THE BLOCK TREE Creation of the block tree

Follows the structure of the target schema A bottom-up method

Lemma 1: (informal)The c-blocks for an element can be created from the c-blocks of its children.(detail)

Lemma 2: (informal)If an element has no c-block, then its parent (if any) has no c-blcok.

THE BLOCK TREE Reducing the storage cost of uncertain mappings

b1C: BCN~ICN

M: m1, m2

C: RCN~ICNM: m3, m4

C: OCN~SCNM: m2, m3

C: BCN~SCNM: m4, m5

C: BP~IPM: m1, m2, m4, m5

C: BP~IP, BCN~ICNM: m1, m2

g3C: Order~ORDER

M: m1, m2, m3, m4, m5

m1 Order~ORDER

RCN~SCN...

m2 Order~ORDER

OCN~SCN...

m4 Order~ORDER BP~IP

m5 Order~ORDER BP~IP OCN~ICN ...

b5.C b5.C

m3 Order~ORDER SP~IP BP~SP...

If part of a mapping is in the block tree, then replace it with a link

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

QUERY EVALUATION AND UNCERTAINTY The uncertainty in mappings may affect query

answers

Target query Q: //ICN

which finds all ICNs (contact names of invoice parties) in the purchase order

Example: a source document

Return by M1

Return by M2

THE BASELINE APPROACH

Evaluate QT with each mapping in pM separately Drawback

When the mapping Mi is large, or h is large, the computation cost is expensive

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

Rewriting Evaluation

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

QUERY EVALUATION WITH BLOCK TREE Case 1): the root is found in the block tree, then use the

blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers

QUERY EVALUATION WITH BLOCK TREE Case 2): the root is not found, decompose the query (if

possible), invoke recursion, and join partial answers

ORDERIP

Direct query

Recursion Direct query

OUTLINE

Introduction Problem

Data model Query model

Techniques Block tree Query evaluation Mapping generation

Results Conclusion

MAPPING GENERATION A mapping m for a schema S with another schema T

contains a set of correspondences (es,et) et may be EMPTY, i.e., es matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m’s score is the sum of similarities of its correspondences

Problem definition Given: two schemas S and T, a set of correspondences

(es,et) with similarities (which are schema matching results) Return: h mappings m1, …, mh, whose scores are among the

highest ones

MAPPING GENERATION Baseline solution

Finding h-maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite

MAPPING GENERATION Observation: XML schema matching is usually sparse Improvement: a divide-and-conquer approach

Derive partitions (Maximal Connected Sub-Graphs) of the bipartite

Find the top-h partial mappings from each partition Merge

OUTLINE

Introduction Problem Techniques Results Conclusion

DATASET AND RESULTS XML schemas and documents

7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans)

Accompanied sample XML documents

Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-

method)

Target query 10 hand-write queries

RESULTS Uncertain mappings, do they really overlap?

RESULTS How much space does the block tree save for storing

uncertain mappings? And why?

RESULTS Is the block tree effective?

Intuitively, larger blocks tends to be more useful

RESULTS The block tree can be efficiently created

Fast, and controllable

RESULTS Can the block tree really help to improvement query

performance? Varies the total number of mappings

RESULTS Can it scale?

Probabilistic twig query and top-k query

RESULTS Top-h mapping generation

Performance gain of partitioning

CONCLUSION We study the problem of handling uncertainty in XML

schema matching Observation

Overlapping mappings, sparse bipartite, etc Approach

The block tree Query evaluation with the block tree Generating uncertain mapping more efficiently

Future work Other types of queries, probabilistic document, index

update, relational scenario, etc

THANKS!

REFERENCES [Len02] Lenzerini, “Data integration: a theoretical perspective”, in

PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data

integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema

matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k

schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in

DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in

increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema

matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”,

in SIGMOD, 2008 …

QUERY REWRITING

Given A target twig query QT

A schema mapping m between S and T, which is a set of correspondences (es,et)

Mapping semantic For each sub-tree in source document DS which

contains a set of source element in m, there exists a sub-tree in target document DT which contains the corresponding target elements

Procedure For each element in QT, replace with a source

element Connect all the source elements

LEMMA 1

An example

Lemma 1: (conceptually)The c-blocks for an schema element t can be created from the c-blocks of t’s children.(detail)

InvoiceTo

27|24|25|24

Address

streetemail city country

DeliverTo

27|24|25|24

Address

streetemail city country

ContactContact

51|49 49|5110052|48 53|4749|5110052|48 50|50 51|49

b1.M: 1-52b2.M: 53-100

b3.M: 1,3,5,…b4.M: 2,4,6,...

RESULTS

What kind of queries do we used?

M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung...

Documents

National Asthma Education and Prevention Program anaging asthMa

Elements of Change 1996: Characterizing and ......Characterizing and Communicating Scientific Uncertainty 8 U ncertainty, or more generally, debate about the level of certainty required

EFET ELECTRONIC CONFIRMATION ATCHING

U NCERTAINTY P RINCIPLE III: S INGLE S LIT E XPERIMENT by Robert Nemiroff Michigan Technological University

P ERSONALIZED J OB M ATCHING Md. Mustafizur Rahman Ellie Clougherty John Clougherty Sam Hewitt

R ONNY D ILLEN M ANAGING D IRECTOR +32 485 804 582 ronny.dillen@rsn.be RAILWAY CHALLENGES IN BELGIUM

O NTOLOGY M ATCHING Thiago Pachêco. R OTEIRO Motivação Ontology matching Definição Conceitos Processo Técnicas Álgebra Ferramentas existentes OAEI Aplicações

A Rapid S tereo M atching A lgorithm B ased on Disparity I nterpolation

I NTRODUCTION TO U NCERTAINTY 1. 2 3 Intelligent user interfaces Communication codes Protein sequence alignment Object tracking

Random m atching markets

12 Considerations for anaging Foreign Supplier · PDF file12 Considerations for anaging Foreign Supplier Risk ... I AIA, A P, irector, Risk Consulting ... Contracts There are numerous

M ANAGING Y OUR P ERSONAL D ATA Keeping Personal Data Private

ON-I REDUCED-ORDER MODELING USING NCERTAINTY-A DEEP …

まちウォッチング atching - Mihama · 2018-03-23 · atching まちウォッチング 15 2018.4月号広報みはま 14 まちの話題をお知らせします第57回町民卓球大会

I NTERPERSONAL C OMMUNICATIONS, CM206 M ANAGING C ONFLICT IN R ELATIONSHIPS

Appendix G Weather & Price Uncertainty Analysis Draft 2018 ......Monthly Demand (Dth) Drat 201 WA RP Appendix G Weather Price ncertainty Analysis Page 2

M ANAGING S UPPLIER R ELATIONSHIPS A IRBUS PJ T WEEDALE 0929683

I NTRODUCTION TO U NCERTAINTY 1. 2 3 S OURCES OF U NCERTAINTY Imperfect representations of the world Imperfect observation of the world Laziness, efficiency

Ｓ upplier Ｑ uality Ｒ e-engineering Ａ ssessment Ｍ anaging team activity

M ANAGING THE I NFORMATION S YSTEMS F UNCTION