45
1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: 3740535 Email: [email protected] This Presentation based on the following paper: E. Rahm, P. Bernstein,"A Survey of Approaches to Automatic Schema Matching“, VLDB Journal, 10(4): 334- 350 (2001).

1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: 3740535 Email: [email protected]@Site.Uottawa.Ca This

Embed Size (px)

Citation preview

1

A Survey of Approaches to Automatic Schema Matching

Name: Samer SamarahNumber: 3740535

Email: [email protected] Presentation based on the following

paper:E. Rahm, P. Bernstein,"A Survey of Approaches to Automatic Schema

Matching“, VLDB Journal, 10(4): 334-350 (2001).

2

Presentation Outlines Definition of Schema Matching. Schema Matching problem. Applications of Schema Matching. Schema matching approaches. Personal contribution. Conclusion.

3

Schema Matching: Definition Schema Matching: is the process of finding

semantic correspondence between elements of two schemas.

Schema Matching is achieved through Match operation.

Match Operator is a function that takes two schemas as input and returns a mapping between those two schemas as output, called the match result.

Match(S1,S2) Match Result.

4

The Match operator

Match(S1,S2) Match Result

The schema (either S1 or S2) defined to be a set of elements connected by some structure (ER model, OO model,..).

The Match Result is a set of mapping elements, each of which indicate that certain elements of S1 are mapped to certain elements of S2, expressed by .

A mapping expressions, which specifies how the S1 and S2 elements are related, may be associated with the mapping elements.

)(

5

The Match Operator

6

Schema Matching Problem Schema Matching currently performed

manually which makes it: Error-prone. Tedious. Time-consuming.

So, the solution is to automate the match function, but There is no mathematical model that capture the

matching process. Application dependent.

7

Schema Matching Applications Schema Integration. Data Warehouses. E-commerce. Semantic Query Processing. Data Integration system

8

1-Schema Integration

Is the process of constricting a global view from a set of independently developed schema.

Schema integration achieved through identifying the interschema relationships (applying schema matching) then unify these matched elements.

9

Schema Matching Example

S1{SNam,….}

S2{FName,LName,

…}..

Sn

Match + UnifySg

{Name,…}

Example:

Determining the correspondence between SName and Fname,Lname achived through the match operation.

10

2-Data Warehouses DWH is a decision support database that is

extracted from a set of data sources. DWH and sources represent data in

different format (i.e. relations or XML versus multidimensional view)

Constructing DWH require transform data from its format into DWH format.

match operation can be used to identify those elements in the sources that are represented in the DWH, according to this mapping appropriate transformation can be designed.

11

3- E-commerce Trading partners exchange messages

that describes business transactions. Each partners uses its own message

format (EDI, XML,..) or different message schema.

In order to exchange messages, there is a need to translate the messages to the format required by different partners (Matching problem).

12

4- Semantic Query Processing

A run time scenarios where the user specifies the output of the query (in terms of concepts familiar to him, which may be not the same concepts presented in the DB {the Select Clause}), and the system figures out how to produce that output (i.e. the from and where clauses in SQL).

The match operation is used to determent the mapping between the user concepts and the DB concepts).

13

4-Semantic Query Processing (Example)

All employees earn more than 2000$

Emplyee(FName,LName,Salary)

Applying MatchMapping elements

Producing New Query in SQL format:{Select FName, LNameFrom EmployeeWhere Salary >2000}

Output

Example

SQL Q

14

5- Data Integration System

The major component of data integration system is the source description.

Source description maps the sources schema to the mediated schema.

Match operation applied in order to specify this mapping.

15

Generic Match Architecture Schemas to be matched are represented

in a uniform internal representation. Importer translates input schemas from

their native representation into the internal representation.

Exporter translates the match result produced by the match from the internal representation into the representation required by each tool.

16

Generic Match Architecture

Tool 1(Portal Schemas)

Tool 2(E-business Schemas)

Tool 3(DWH Schemas)

Tool 4 (DB Schemas)

Generic Match Implementation

Schema import / export

Internal Schema Representation

Global Libraries(dictionaries, schemas,..)

17

Instance Vs schema: consider instance data or schema information.

Element Vs structure: matching performed for individual schema element (attribute), or for combinations of elements (structure).

Language Vs constraint: use linguistic information (names and textual description) or constraint information (key, relationship)

Matching cardinality: the overall match result may relate one or more elements of one schema to one or more elements of the other (1:1, 1:n, m:n).

Auxiliary information: the use of auxiliary information (dictionaries, pervious matching results, user input,..)

Classification of schema matching approaches

18

Classification of schema matching approaches

Schema Matching Approaches

Combine MatcherIndividual Matcher

Schema Based Instance Based

Element Level Structure Level

Linguistic Constraint Constraint

Element Level

Linguistic Constraint

Hybrid Matcher Composite Matcher

AutomaticManual

19

Schema level matcher Consider only schema information, not

instance data, such as: Name Description Data Type Relationship (is-a, part-of) Constraints Schema structure

Multiple match candidates could be founded, each of which assigned with a similarity degree.

20

Granularity of match

Element level matching: for each element in the first schema, determines the matching elements in the second schema.

Structural level matching: matching combinations of elements that appear together in a structure. Could be fully match or partial match.

21

Granularity of match

Example

TaxExempt

CPhone Birthdate

Address CAddress (E.L) CAddress Address

Name CName (E.L) Cname Name

AccountOwner Customer (S.L partial Match)CustomerAccountOwner

Zip PostalCode (E.L) PostalCode ZIP

State USState (E. L) USState State

Street City (E.L) City Street

Address CustomerAddress (S.L full Match)CustomerAddressAddress

Match GranularityS2 ElementsS1 Elements

22

Linguistic approaches

Linguistic matchers use names and text (word or sentence) to find semantically similar schema elements: Name Matching Description matching

23

Name Matching Name based matching matches schema

elements with equal names or similar names. Equality of names. Equality of canonical name representation

after stemming and preprocessing. Equality of synonyms. Equality of hypernyms (X is a hypernym of Y

if Y is a kind of X. Similarity based on common substring, edit

distance, soundex. User- provided name matches.

Thesauri or dictionary should be exploited.

24

Name Matching

Example: two schema S1, S2 represent two automobile suppliers

CAddress CustomerAddress (Preprocessing)

CustomerAddressCAddress

SoldTo Sold2 (Soundex)Sold2SoldTO

Price Price (Equality of Names)PricePrice

Brand Make (Synonyms)MakeBrand

Car Truck (Hypernyms)Car is an automobile and truck is an automobile.

TruckIDCarID

Matching Based onS2 ElementsS1 Elements

25

Constraint- based approaches

Exploit constraints information associated with the input schemas to determine the similarity of schema elements. Data types and domain constraints Key characteristics (primary, unique) Relationship cardinality Structural constraints such as foreign key

(used by structural matches approaches)

26

Constraint- based approaches

DeptName {String}

DeptNo {int,PK}

Department

BirthDate {date}

Born {date} salary {single}

S2.Personal {Pno, Pname, Dept, born}Select S1.Employee.EmpNo, S1.Employee.EmpName, S1.Department.DeptName, S1.Employee.BirthDateFrom S1.Employee, S1.DepartmentWhere (S1.Employee.DeptNo = S1.Department.DeptNo)Note: Structural Matching

Dept {String} DeptNo {int, ref dep}

Pname {string} EmpName {String}

Born Birthdate {Type}Pno EmpNo| DeptNo {Key}

Pname DeptName {Type}

Pname EmpName {Type}

Dept DeptName {Type}

Dept EmpName {Type}

Pno {int, PK} EmpNo {int, PK}

PersonalEmployee

MatchingS2 ElementsS1 Elements

Several match candidate could results so this approach could be sued to limit the number of candidate.

27

Description Matching Based on linguistic evaluation for the

comment associated with schema elements.

Example:

NL Understanding technology

S1: empn // employee name

S2: name // name of employeeEmpn name

28

Reusing Schema and mapping information

This approach support and exploit the reuse of common schema components and previously determined mapping.

Useful when matching applied to different but similar schemas to the same destination schemas.

29

Reusing Schema and mapping information

Schema S1

Purchase order2

Product

BillTo

Name

Address

ShipTo

Name

Address

ContactPhone

Schema S2

Purchase order

Product

BillTo

Name

Address

ShipTo

Name

Address

Contact

Name

Address

Schema S

POrder

Article

Payee

BillAddress

Recipient

ShipAddress

Goal: Mapping S1 to S

matching result between S2 and S are previously determent and can be reused to map S1 to S

EX:

30

Match cardinality Global cardinality: how many mapping

elements S1 or (S2) elements can participate in the matching results.

Local cardinality: how many elements in S1 match how many elements in S2 within individual mapping element.

Most of approaches restricted to 1:1 local and 1:1 or 1:n global cardinality.

31

Match cardinality

Local Match Cardinality

S1 elements

S2 elements

Matching Expression

1 1:1 element level Price Amount Amount = Price

2 n:1 element level Price,Tax Cost Cost = Price* (1 + Tax/100)

3 1:n element level Name FirstNameLastName

FirstName, LastName = Extract(Name,..)

4 n:1 structure level(n:m element level)

B.TitleB.PuNoP.PuNoP.Name

A.BookA.Publisher

A.Book, A.Publisher =Select B.Title, P.NameFrom B.PWhere B.PuNo =P.PuNo

Example:

Price has 1:n Global Cardinality

32

Instance level approaches Consider data contents. Useful when schema information limited (or

no schema at all). Enhance schema matching by considering

elements whose instances are more similar. Linguistic approach based on IR techniques for

text elements. Constraint based such as value range and

average for numeric element.

33

Instance level approach

EmpNo Dept

234 Marketing

235 Accounting

236 Marketing

SSN Works for

230 Accounting

229 Marketing

228 Marketing

{Dept ≈ works for} (based on “Marketing” Frequency)

{EmpNo ≈ SSN} (based on value range)

Example:

34

Combining matchers Combine several approaches to achieve good match

candidates. Hybrid Matcher: combine several matching approaches

to determine match candidate based on multiple criteria (name, type constraints). More effective (poor candidates filtered out early) Better performance (reducing number of pass over

the schema). Composite matcher: combines the results of several

independently executed matchers, including hybrid matchers. More flexible than hybrid matchers, (allow us to

select from set of matchers). The combination of results could be automatic or

manually).

35

personal contribution A prototype Datalog program designed to

implement a composite matcher. The program was tested on DES system

( Datalog Educational System) available at: http//www.fdi.ucm.es/professor/fernan/DES/

The program takes advantage of linguistic based approach over constraint approach.

36

Database Description

The DataBase contains the following predicates: Source(ElementName, DataType,

constraints,…). // to describe sources Dictionary(Name1,Name2)// a dictionary

to provides synonyms.

37

The Program Rules cand1(N,N) :- s1(N,D,C), s2(N,DataType,Constraint). cand2(N,N1) :- s1(N,D,C), s2(N1,DataType,Constraint), d(N,N1). ok1(N) :- cand1(N,N1). ok1(N1) :- cand1(N,N1). ok2(N) :- cand2(N,N1). ok2(N1) :- cand2(N,N1). cand3(N,N1) :- s1(N,D,C),s2(N1,D,Constraint), not(ok1(N)),not(ok2(N1)), not(ok1(N1)),not(ok2(N)). cand4(N,N1) :- s1(N,DataType,C),s2(N1,D,C), not(ok1(N)),not(ok2(N1)), not(ok1(N1)),not(ok2(N)). match(N,N1) :- cand1(N,N1). match(N,N1) :- cand2(N,N1). match(N,N1) :- cand3(N,N1),cand4(N,N1). match(N,N1) :- cand3(N,N1),not(cand4(N,N1)). match(N,N1) :- cand4(N,N1),not(cand3(N,N1)).

38

EXAMPLE The program was tested on the following schemas.

s1(eNo,integer,pk).s1(city,string,20).s1(street,string,30).s1(state,string,12). s2(eID,integer,pk).s2(cName,string,15).s2(street,string,10).s2(province,string,25).

d(state, province).

39

Results

40

Conclusion A taxonomy for schema approaches was

presented, in order to compare between different approaches to schema matching.

The generic implementation described in the paper, could be base for any new implantation.

Different techniques could be used in order to automate schema matching, such as Natural language, machine learning,IR ...

Having full automated matching (without user interaction ) not achieved yet.

41

Thank You

42

References E. Rahm, Ph.A. Bernstein,"A Survey of Approaches to

Automatic Schema Matching“, VLDB Journal, 10(4): 334-350 (2001).

DES system ( Datalog Educational System) available at: http//www.fdi.ucm.es/professor/fernan/DES/.

Jayant Madhavan, Philip A.Bernstein, Erhard Rahm, “Genric Schema Matching with Cupid”, Proceedings of 27th VLDB conference, Roma, Italy,2001.

Hong-Hai Do, Sergey Melnik, Erhard Rahm, “Comparison of Schema Matching Evaluations”, University of Leipzig Augustusplatz 10-11, 04109, Leipzig, Germany.

AnHai Doan, Pedro Domigos, Alon Levy, “Learning Source Description for Data Integration”, University of Washington, Sattle, WA 98195

43

Appendix

1- Cupid (Microsoft Research)

2- LSD (Learning Source Description)

44

1- Cupid (Microsoft Research) A hybrid matcher based on element and

structure matching. What is new in this approach, that the

schemas represented as a graph which encode the referential constraints into structure the can be matched just like other structures.

The algorithm has two phase: Linguistic matching: matches individual schema

elements based on their names, data types, domains,..

Structural matching: matches schema elements based on the similarity of their context.

45

2- LSD (Learning Source Description)

Composite matcher, with autonomic combination of match results.

An attempt to automate the mappings between source schemas and mediated schema in data integration system.

Uses machine learning techniques to match a new data source against previously determined global schema.

After a set of data sources have been manually mapped to a mediated schema, the system should be able to glean significant information from these mapping for subsequent data sources.