View
38
Download
1
Category
Tags:
Preview:
DESCRIPTION
CSE 636 Data Integration. Schema Matching Cupid. Fall 2006. Virtual Integration Architecture. Wrapper. Wrapper. Design-Time. Run-Time. . Schema Matching. Query Reformulation. Query. Result. End User. Mediation Language. Optimization & Execution. Mediator. Global Schema. - PowerPoint PPT Presentation
Citation preview
CSE 636Data Integration
Schema Matching
Cupid
Fall 2006
2
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
Query Result
Wrapper Wrapper
End User
Design-Time
MediationLanguage
SchemaMatching
Run-Time
QueryReformulation
Optimization& Execution
XML
Web Services
3
Independently created schemas…… might be modeling similar information…
… in slightly different ways
Schema Heterogeneity
nameugradID
ugrad *DB1
enrollment *courseIDugradIDgrade
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?evaluation
studentIDstudent *
course *DB2
courseIDtitle
nametype
4
Schema Heterogeneity
nameugradID
ugrad *DB1
enrollment *courseIDugradIDgrade
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?
• Similar entities represented• Dissimilar structures (inverted nesting)• Different element names for similar data values• Similar element names for different data values
evaluation
studentIDstudent *
course *DB2
courseIDtitle
nametype
5
Schema Matching vs. Schema Mapping
• GAV and LAV are schema mapping languages• Mappings:
– set of queries– associations + semantics
• Match:– set of associations only
• Schema Matching:– Identifying associations– First step towards constructing mappings
6
Associations
Semantics
Schema Matching vs. Schema Mapping
for $s1 in DB3/studentwhere $s1/type = ‘UGRAD’return <DB1>
<ugrad><ugradID>{$s1/studentID}</ugradID><name>{$s1/name}</name>
</ugrad></DB1>
LAV Mapping: DB1 Q(DB3)
nameugradID
ugrad *DB1
enrollment *courseIDugradIDgrade
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?
7
The Problem of Schema Matching
Input
• Schemas S1 and S2
• Possibly data instances for S1 and S2
• Background knowledge– thesauri– validated matches– standard schemas– reference instances– ontologies– constraints (keys, data types etc)
Output
• Associations between S1 and S2
Goal• Schema matching tools with significant automated support
8
Schema Matching
How is the match result expressed?
type
courseIDcourse *
student *DB3
studentIDnametype
letter
title ?evaluation
studentIDstudent *
course *DB2
courseIDtitle
nametype
• Pairs of paths• Lists of paths• Schema names
9
Schema Matching
What do we match?
• Depends on the queries we want to ask1. Elements in isolation (leaves in particular)2. Substructures3. Whole schemas
10
Motivation
• Important component in many applications– Data Integration– Data Migration– E-Commerce
• Model Management[Bernstein, Halevy, Pottinger ’00]– Algebra for manipulating models and mappings– Match, Merge, Compose …
11
• Minimize user involvement (semi-automatic)• Data model independent matching (generic)• Schema matching is a hard problem
– Naming and structural differences in schemas– Similar, but non-identical concepts modeled– Multiple data models – SQL DDL, XML, ODMG…
Problems
12
Schema Matching Approaches
• Graph matching
Constraint-based
Individual matchers
Schema-based Content-based
StructuralPer-Element
Constraint-based• Types• Keys
Linguistic
• Names• Descriptions
• Value pattern
and ranges
Constraint-based
Linguistic
• IR (word frequencies, key terms)
Per-Element
Combined matchers
CompositeHybrid
automatic composition
manual composition
Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001
How to match?
13
Cupid
Individual matchers
Schema-based Content-based
• Graph matching
Linguistic Constraint-based
StructuralPer-Element
• Types• Keys
• Value pattern
and ranges
Constraint-based
Linguistic
• IR (word frequencies, key terms)
Per-Element
Constraint-based
• Names• Descriptions
Combined matchers
automatic composition
Composite
manual composition
Hybrid
Madhavan, Bernstein and Rahm, VLDB, 2001
14
Cupid Example
PO
Item
POLines
Qty
LineUoM
POShipTo
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumberUnitofMeasu
re
DeliverTo
City
Street
Address
NameNam
e
15
Cupid Architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Thesaurus
Linguistic Matching
LSIM
SSIMWSIM
16
Linguistic Matching
• Heuristic name matching– Tokenization of names
POOrderNum PO, Order, Num
– Expansion of short-forms, acronymsPO Purchase, Order; Num Number
– Clustering of schema elements based on keywords and data-typesStreet, City, POAddress Address
– Thesaurus of synonyms, hypernyms, acronyms
– Linguistic Similarity coefficient (LSIM) [0,1]
17
Structure Matching
PO
Item
POLines
Qty
LineUoM
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumber
UnitofMeasure
POShipTo
DeliverTo
City
Street
Address
Name
Name
18
PO
Item
POLines
Qty
Line
UoM
Item
PurchaseOrder
Items
Quantity
ItemNum
UnitofMeasure
WSIM > thhigh
WSIM > thhigh
SSIM++
SSIM++
SSIM++
Structure MatchingMutually Reinforcing Similarity
19
PO
POShipTo
PurchaseOrder
InvoiceTo DeliverT
o
Street
City
Address
Street
City
POBillTo
Street
City Address
Street
City
SSIM++
SSIM++
SSIM--
Structure MatchingContext Dependent Disambiguation
20
Intuition
• Atomic elements are similar – Linguistically and data-type similar– Their ancestors are similar
• Compound elements (non-leaf) are similar if– Linguistically similar– Subtrees rooted at the elements are similar
• Mutually recursive – Leaves determine internal node similarity– Similarity of internal nodes leads to increase in leaf
similarity
21
Structure Match Details
• Subtrees are similar if– Immediate children are similar– Leaf sets are similar
• Subtree Similarity (nodes s and t)– Fraction of leaves in subtree s that can be mapped to a
leaf in the other subtree t and vice-versa– Less sensitive to variation in intermediate structure
• Pruning the number of comparisons– Elements must have comparable number of leaves
22
Order-Customer-fk
Referential Integrity
Purchase Order
Product Name
Order ID
Customer ID
Customer
Customer ID Nam
e
Address
Order-Customer-fk
Schema A
Customer-Purchase-Order
Schema B
• Join nodes added to the schema tree for each referential integrity constraint
• Views can be similarly used
23
Cupid Architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Thesaurus
Linguistic Matching
LSIM
SSIMWSIM
Structural (SSIM), Weighted (WSIM) Similarity
InvoiceTo BillTo 0.7
UoM UnitMeasure 0.9
City City 1.0
Linguistic Similarity (LSIM)
InvoiceTo BillTo 0.8 0.7
UoM UnitMeasure 0.7 0.8
InvoiceTo/City BillTo/City 0.8 0.9
24
Mapping Generation
• Individual mapping elements computed from WSIM values:
– Consider only mapping pairs that have WSIM greater than threshold
– For each element of target find most similar source element
– Not accepted mappings with high similarity are returned in order to help user modify map
25
Cupid Architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Thesaurus
Linguistic Matching
LSIM
SSIMWSIM
Input hint
26
Work Needed
• A more robust solution– Auto-tuning parameters– Thesaurus Generation and Evolution
• Schema matching component architecture– Easily extensible by adding multiple techniques– Data Instances for matching– Look at COMA & ProtoPlasm systems
27
References
1. J. Madhavan, P. A. Bernstein, E. RahmGeneric Schema Matching with CupidVLDB, 2001
2. H. H. Do, E. Rahm:COMA - A System for Flexible Combination of Schema Matching ApproachesVLDB, 2002
3. P. A. Bernstein, S. Melnik, M. Petropoulos, C. QuixIndustrial-Strength Schema MatchingSIGMOD Record 33(4), 2004
Recommended