74
Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Creating semantic mappings

(based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Page 2: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

What we have studied before? Formalisms for specifying source descriptions and

how to use these descriptions to reformulate queries

What is the goal now? Set of techniques that helps a designer create

semantic mappings and source descriptions for a particular data integration application Heuristic task Idea is to reduce the time it takes to create semantic

mappings

Page 3: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Motivating example

DVD vendor schemaProduct(title, productionYear, releaseDate, basePrice, listPrice, rating, customerReviews, saleLocation)

Movie(title, year, director, directorOrigin, mainAtors, genre, awards)

Locations(name, taxRate, shippingCharge)

Online Aggregator SchemaItem(title, year, classification, genre, director, price, starring, angReviews)

Page 4: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Recap. semantic heterogeneities Table and attribute names can differ

Attributes rating and classification as well as mainActors and starring

Multiple attributes in one schema correspond to a single attribute in the other basePrice and taxRate from the vendor are used to

compute the value of price in the aggregator schema Tabular organization may be different

DVD vendor requires 3 tables, aggregator needs only one. Coverage and level of detail may differ

DVD vendor models releaseDate and awards that the aggregator does not

Page 5: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Process of creating schema mappings1) Schema matching

Creating correspondences between elements of 2 schemas

Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema

2) Creating schema mappings from the correspondences (and filling in missing details)

Specifies the transformations that have to be applied the the source data in order to produce the target data

Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate

Page 6: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Schema matchingGoal: to automatically create a set of correspondences

between the 2 schemas Given two schemas S1 and S2 and being schema

elements the table/attribute names in the schema, a correspondence A -> B states that a set of elements A in S1 maps to a set of attributes B in S2 Most common correspondences are 1-1, where A and B

are singletonsEx: for the target relation Item, there are the following

correspondences: Product.title -> title Movie.year -> year {Product.basePrice, Locations.taxRate} ->

price

Page 7: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Output of the schema matcher Associate a confidence measure ([0,1]) with

every correspondence Because of the heuristic nature of the schema

matching process Possible heuristics: examine the names of the

schema elements; examine the data values; etc May associate a filter with a correspondence

Ex: {Product.basePrice, Locations.taxRate} -> price may apply only to locations in the US

Page 8: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Components of a schema matcher

Basic Matcher: predicts correspondences based on cues available in schema and data

Combiner: combines the predictions of the basic matchers into a single similarity matrix

Constraint enforcer: applies domain knowledge and constraints to prune the possible matches

Match selector: chooses the best match or matches from the similarity matrix

Page 9: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Process of creating schema mappings1) Schema matching

Creating correspondences between elements of 2 schemas

Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema

2) Creating schema mappings from the correspondences (and filling in missing details)

Specifies the transformations that have to be applied the the source data in order to produce the target data

Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate

Page 10: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Creating mappings from correspondences

Create the actual mappings from the matches (correspondences) Find how tuples from one source can be translated into

tuples in the other Challenge: there may be more than one possible

way of joining the data Ex: To compute the value of price in the aggregator

schema, may join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate, or join Product with Movie to obtain the origin of director, and compute the price based on the taxes in the director’s country of birth

Page 11: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Outline

Base matchers Combining match predictions Applying constraints and domain knowledge

to candidate schema matches Match selector Applying machine learning techniques to

enable the schema matcher to learn Discovering m-m matches From matches to mappings

Page 12: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Base matchers

Input: pair of schemas S1 and S2, whith elements A and

B, respectively Additonal available information, such as data

instances or text descriptions Output:

Correspondence matrix that assigns to every pair of elements (Ai, Bj) a number between 0 and 1 predicting whether Ai corresponds to Bj

Page 13: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Classes of base matchers

1) Name-based matchers: based on comparing names of schema elements

2) Instance-based matchers: based on inspecting data instances

For specific domains, it is possible to develop more specialized and effective matchers

Must look at large amounts of data Slower, but efficiency can be improved More precise

Page 14: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Name-based matchers

Compare the names of the elements, hoping that the names convey the true semantics of elements

Challenge: to find effective distance measures reflecting the distance of element names Names are never written in exactly the same way

Page 15: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Edit distance Edit distance between the strings that represent the names of the

elements Levenshtein distance:

Minimum nb of operations (insertions, deletions or replacements) needed to transform one string into another

Given two schema elements represented by the strings s1 and s2 and their edit distance, denoted by editDistance(s1,s2)

editSimilarity(s1, s2) = 1 – editDistance(s1, s2)/max(length(s1),

length(s2))

Ex: The Levenshtein distance between the strings FullName and FName is 3; between TotalPrice and PriceSum is 8The editSimilarity between FullName and FName is 0.625; between TotalPrice and PriceSum is 0.2

Page 16: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Qgram distance Qgrams: substrings within the names Given a positive integer q, the set of q-grams of s,

qgrams(s), consists of all the substrings of s of size q

Ex: the 3-grams of price are: {pri, ric, ice} Given a number q, qgramSimilarity(s1, s2) = ||qgrams(s1) qgrams(s2)||/ ||qgrams(s1) qgrams(s2)||

Ex: the 3-gram similarity between pricemarked and markedprice is 7/18 = 0.39

Advantages over edit distance: Faster More resilient to word-order interchaging

Page 17: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Sound distance

Based on the way strings sound Soundex algorithm: encodes names by their

sound; can detect when two names sound the same even if spelling varies

Ex: ship2 is similar to shipto

Page 18: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Normalization

Element names can be composed of acronyms or short phrases to express their meanings

Normalization: replaces a single token by several tokens that can be compared Element names should be normalized before applying

distance measures Some normalization techniques:

Expand known abbreviations Expand a string with its synonyms Remove articles, propositions and conjunctions

Page 19: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Instance-based matchers

Data instances, if available, convey the meaning of a schema element, more than its name Use them for predicting correspondences between schema

elements Techniques:

Develop a set of rules for inferring common types from the format of the data values

Ex: phone numbers, prices, zip codes, etc Value overlap Text-field analysis

Page 20: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Value overlap

Measuring the overlap of values in the two elements Applies to categorical elements: whose values range in some

finite domain (e.g., movie ratings, country name) Jaccard coefficient: fraction of the values for the two elements

that can be an instance for both of them Also defined as: conditional probability of a value being an

instance of both elements given that it is an instance of one of them

JaccardSim(e1,e2) = Pr(e1e2 | e1e2) =||D(e1) D(e2)||/||D(e1) D(e2)||

where D(e) is the set of values for element e

Page 21: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Text-field analysis

Applies to elements whose values are longer texts (e.g., house descriptions) Their values can vary drastically The probability of finding the exact string for both

elements is very low Idea: to compare the general topics these

text fields are about Use text classifiers

Page 22: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Text classifiers

Classifier for a concept C: algorithm that identifies instances of C from those that are not Creates an internal model based on training

examples: positive examples that are known to be instances of C and negative examples that are known not to be instances of C

Given an example c, the classifier applies its model to decide whether c is an instance of C

Page 23: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Combining match predictions (1) Result of basic matchers summarized in a

similarity cube: Suppose the schema matcher used l base

matchers to predict correspondences between elements A1, ..., An of S1 and the elements B1, ...,Bm of S2.

The similarity cube assigns each triple (b,i,j) a number between 0 and 1 describing the prediction of base matcher b about the correspondence between Ai and Bj

Page 24: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Combining match predictions (2) Output: a similarity matrix that combines the predictions of

the base matchers For every pair (i, j), want a value between o and 1, Combined(i, j) that gives a single prediction about the correspondence between Ai and Bj

Two possible combinations: Combined(i,j) = max b=1

l Base(b, i, j) Combined(i,j) = 1/l sum b=1

l Base(b, i, j) Max used when we trust in a matcher that outputs a high

value Avg used otherwise Also multi-step combination functions and give weights to

matchers

Page 25: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Applying constraints and domain knowledge to candidate matches There may exist domain-specific knowledge helpful

in the process of schema matching Expressed as a set of constraints that enable

pruning candidate matches Hard constraints: must be applied; schema matcher will

not output any match that violates them Soft constraints: more heuristic nature; may be violated in

some schemas; nb of violated should be minimized A cost is associated to each constraint: infinite for hard

constraints; any positive number for soft constraints

Page 26: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

ExampleSchema:

Book(ISBN, publisher, pubCountry, title, review)

Item(code, name, brand, origin, desc)

Inventory(ISBN, quantity, location)

InStore(code, availQuant)

Constraints:T1: if A -> Item.name, then A is a key. Cost = T2: if sim(A1,B1) sim(A2,B1), A1 is next to A3 and B1 is next to B2

and sim(A3,B2) > 0.8 then match A1 to B1. Cost = 2

T3: Average(length(desc)) 20. Cost = 1.5

T4: if ||{Ai attributes(R1) | Bj attributes(R2) s.t. sim(Ai,Bj) > 0.8}|| >= ||attributes(R1)||/2 then match table R1 to table R2. Cost = 1

Page 27: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Algorithms for applying constraints to the similarity matrix Applying constraints with A* search

Guarantees to find the optimal solution Computationally more expensive

Applying constraints with local propagation Faster May get stuck in a local minimum

Page 28: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Components of a schema matcher

Page 29: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Match selector Input: similarity matrix Output: a schema match or the top few matches If matching system interactive, the user can choose

among the top k correspondences The system computes several possible matchesEx: schema1(shipAddr, shipPhone, billAddr, billPhone)

schema2(addr, phone)The correspondences:shipPhone ->phone and billPhone->phone are chosen until the user indicates:shipAddr->addrthen shipPhone->phone becomes more likely than billPhone->phone

Page 30: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

The algorithm behind The match selection problem can be formulated as an instance

of finding a stable marriage Elements of S1: men; elements of S2:women Sim(i,j): the degree to which Ai and Bj desire each other

Goal: find a stable match between men and women A match is unstable if there are Ai ->Bj and Ak->Bl, such that

sim(i,l)>sim(i,j) and sim(i,l)>sim(k,l) If these couples existed then Ai and Bl would want to be matched

together To produce a schema match without unhappy couples do:

Match={} Repeat:

Let (i,j) be the highest value in sim such that Ai and Bj are not in match

Add Ai->Bj to match

Page 31: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Applying machine learning techniques to enable the schema matcher to learn

Schema matching tasks often repetitive When working in the same domain, one starts to identify

how common domain concepts get expressed in schemas So the designer can create schema matches more quickly

over time So,

Can the schema matching also improve over time? Or: can a schema matcher learn from previous experience? Machine learning techniques can be applied to schema

matching, thus enabling the matcher to improve over time

Page 32: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Learning to match

Suppose n data sources s1,..., sn whose schemas must be mapped into the mediated schema G

Goal: To train the system by manually providing it with

schema matches on a small nb of data sources (e.g., s1,...,sm, where m is much smaller than n)

The system generalizes from the training examples so that it is able to predict matches for sources sm+1,...sn

Page 33: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Components of the system

Page 34: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Training phase (1)

1. Manually specify mappings for several sources

2. Extract source data

3. Create training data for each base learner

4. Train the base learners

5. Train the meta-learner

Page 35: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Training phase (2)

Learning classifiers for elements in the mediated schema The classifier for an element e in the mediated schema

examines an element in a source schema and predict whether it matches e or not

To create classifiers, employs a machine learning algorithm

Each machine learning algorithm typically considers only one aspect of the schema and has advantages/inconvenients So, use a multi-strategy learning technique

Page 36: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Multi-strategy learning Training phase:

Employ a set of learners l1, ..., lk Each base learner creates a classifier for each element e of the

mediated schema from its training examples Use a meta-learner to learn weights for the different base

learners For each element e of the mediated schema and base learner l, the

meta-learner computes a weight we,l He knows how to do that, because we are working with training examples

Matching phase: When presented with a schema S whose elements are e1’,..,

et’ Apply the base learners to e1’,.., et’. Let pe,l(e’) be the

prediction of learner l on whether e’ matches e Combine the learners:

pe(e’) = j=1 k we,lj* pe,lj(e’)

Page 37: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Charlie comesto town

Find houses with 2 bedrooms

priced under 300K

homes.comrealestate.com homeseekers.com

Example

Page 38: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Data Integration

mediated schema

homes.comrealestate.com

source schema 2

homeseekers.com

wrapper wrapperwrapper

source schema 3source schema 1

Find houses with 2 bedrooms priced under 300K

Page 39: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Example

Rule-based learner

Naive-bayes learner

listed-price $250,000 $110,000 ...

address price agent-phone description

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Page 40: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>

<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>

Training the Learners

Naive Bayes Learner

(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...

(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...

realestate.com

Name Learner

address price agent-phone description

Schema of realestate.com

Mediated schema

location listed-price phone comments

Page 41: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>

<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>

<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>

Applying the Learners

Name LearnerNaive Bayes

Meta-Learner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(agent-phone,0.9), (description,0.1)

address price agent-phone description

Schema of homes.com Mediated schema

area day-phone extra-info

Page 42: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Base Learners Input

schema information: name, proximity, structure, ... data information: value, format, ...

Output prediction weighted by confidence score

Examples Name learner

agent-name => (name,0.7), (phone,0.3) Naive Bayes learner

“Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)

Page 43: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Rule-based learner

Examines a set of training examples and computes a set of rules that can be applied to test instances Rules can be represented as logical formulae or

as decision trees Works well in domains where the set of rules can

accurately characterize instances of the class (e.g., identifying elements that adhere to certain formats)

Page 44: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Example: rule-based learner for identifying phone numbers (1) Positive and negative examples of phone numbers:

Example instance? # of digits position of ( position of ) position of –

(608)435-2322 yes 10 1 5 9

(60)445-284 no 9 1 4 8

849-7394 yes 7 - - 4

(1343) 429-441 no 10 1 6 10

43 43 (12 1285) no 10 5 12 -

5549902 no 7 - - -

(212) 433 8842 yes 10 1 5 -

Page 45: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Example: rule-based learner for identifying phone numbers (2) Common method to learn rules is to create a decision tree

Encodes rules such as: If i has 10 digits, a ‘(‘ in position 1 and ‘)’ in position

5, then yes If i has 7 digits, but no ‘-’ in position 4, then no, ...

Page 46: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Naive Bayes Learner

Examines the tokens of a testing instance and assigns to the instance the most likely class given the occurrences of tokens in the training set

Effective for recognizing text fields Given a test instance, the learner converts it

into a bag of tokens

Page 47: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Naive Bayes learner at work

Given that c1, .., cn are elements of the mediated schema, the learner is given a test instance d = {w1,..., wk} to classify

Goal: assign d to the element cd with the highest posterior probability given d:

Cd = arg maxci P(Ci|d)

P(Ci|d)= P(d|Ci)P(Ci)/P(d)

=> Cd = arg maxci [P(d|Ci)P(Ci)/P(d)]

= arg maxci [P(d|Ci)P(Ci)] P(d|Ci) and P(Ci) must be estimated from the training

data

Page 48: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Estimation of P(d|ci) and P(ci)

P(ci)is approximated by the portion of the training instances with label ci

To compute P(d|ci) assume that the tokens wj appear in d independently of each other given ci

P(d|ci) = P(w1|ci) P(w2|ci)... P(wk|ci)

P(wj|ci) = n(wj,ci)/n(ci), where

n(ci): total number of tokens in the training instances with label ci

n(wj,ci): number of times token wj appears in all training instances with label ci

Page 49: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Conclusion of Naive Bayes learner Naive Bayes performs well in many domains in spite

of the fact the independence assumption is not always valid

Works best when: There are tokens strongly indicative of the correct label,

because they appear in one element and not in the othersEx: “beautiful”, “fantastic” to describe houses

There are only weakly suggestive tokens, but many of them Doesn’t work well

Short or numeric fields

Page 50: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Recap. Multi-strategy learning Training phase:

Employ a set of learners l1, ..., lk Each base learner creates a classifier for each element e of

the mediated schema from its training examples Use a meta-learner to learn weights for the different base

learners For each element e of the mediated schema and base learner

l, the meta-learner computes a weight we,l

Matching phase: When presented with a schema S whose elements are e1’,.., et’

Apply the base learners to e1’,.., et’. Let pe,l(e’) be the prediction of learner l on whether e’ matches e

Combine the learners:

pe(e’) = j=1 k we,lj* pe,lj(e’)

Page 51: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Training the meta-learner

Learns the weights to attach to each of the base learners, from the training examples Can be different for every mediated-schema element

How does it work? Asks the base learners for predictions on training examples Judges how well each learner performed in providing the

prediction for each mediated-schema element Assigns to each combination (mediated schema element c i,

base learner lj) a weight indicating how much it trusts that learner predictions regarding ci

Can use any classification algorithm to compute the weights

Page 52: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Outline

Base matchers Combining match predictions Applying constraints and domain knowledge

to candidate schema matches Match selector Applying machine learning techniques to

enable the schema matcher to learn Discovering m-m matches From matches to mappings

Page 53: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

From matches to mappings Schema matches: correspondences between the

source and the target schemas Now: specifying the operations to be performed on

the source data so that they can be transformed into the target data Use DBMS as transformation engines Creating mappings becomes a process of query discovery

Find the queries, using joins, unions, filtering, aggregates, that correctly transform the data into the desired schema

Algorithm that explores the space of possible schema mappings Used in the CLIO system

Page 54: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

User interaction

Creating mappings is a complex process System generates the mapping expressions

automatically The possible mappings are automatically

produced using the semantics conveyed by constraints such as foreign keys.

System shows the designer example data instances so that she can verify which are the right mappings

Page 55: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Motivating example

Question: Union professor salaries with employee salaries,

or Join salaries computed from the two

correspondences

Page 56: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Possible mappings

If attribute ProjRank is a foreign key of the relation PayRate, then the mapping would be:

SELECT P.HrRate * W.Hrs

FROM PayRate P, WorksOn W

WHERE P.Rank = W.ProjRank If attribute ProjRank is not a foreign key of the relation PayRate.

Instead, the name attribute of WorksOn is a foreign key of Student and the Yr attribute of Student is a foreign key of PayRate (the salary depends on the year of the student). Then, the following query should be chosen:

SELECT P.HrRate * W.Hrs

FROM PayRate P, WorksOn W, Student S

WHERE W.Name = S.Name AND S.Yr = P.Rank

Not clear which join path to choose for mapping f1!

Page 57: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Possible mappings (2)

One interpretation of f2 is that values produced from f1 should be joined with those produced by f2

Then, most of the values in the source DB would not be mapped to the target

Another interpretation: there are two ways of computing the salary of employees: one applying to professors and another to other emplyoyees. The corresponding mapping is:

SELECT P.HrRate * W.Hrs

FROM PayRate P, WorksOn W, Student S

WHERE W.Name=S.Name AND S.Yr = P.Rank

UNION ALL

SELECT Salary

FROM Professor

Page 58: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Principles to guide the mapping construction If possible, all values in the source appear in

the target Choose a union rather than a join

If possible, a value from the source should only contribute once to the target Associations between values that exist in the

source should not be lost Use a join rather than a cartesian product to

compute the salary value using f1

Page 59: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Possible mappings (3)

Consider a few more correspondences:

f3: Professor(Id) -> Personnel(Id)

f4: Professor(Name) -> Personnel(Name)

f5: Address(Addr) -> Personnel(Addr)

f6: Student(Name) -> Personnel(Name)

They fall into two candidate sets of correspondences: f2, f3, f4 and f5: map from Professor to Personnel f1, f6: map from other employees to Professor

The algorithm explores the possible joins within every candidate set and considers how to union the transformations corresponding to each candidate set.

Page 60: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Possible mappings (4)f3: Professor(Id) -> Personnel(Id)

f4: Professor(Name) -> Personnel(Name)

f5: Address(Addr) -> Personnel(Addr)

f6: Student(Name) -> Personnel(Name)

Most reasonable mapping is:

SELECT P.Id, P.Name, P.Sal, A.Addr

FROM Professor P, Address A

WHERE A.Id = P.Id

UNION ALL

SELECT NULL as ID, S.Name, P.HrRate*W.Hrs, Null as Addr

FROM Student S, PayRate P, WorksOn W

WHERE S.name=W.name AND S.Yr = P.Rank

Page 61: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Possible mappings (5)f3: Professor(Id) -> Personnel(Id)

f4: Professor(Name) -> Personnel(Name)

f5: Address(Addr) -> Personnel(Addr)

f6: Student(Name) -> Personnel(Name)

But this one is also possible:

SELECT NULL as ID, NULL as Name, NULL as Sal, Addr

FROM Address A

UNION ALL

SELECT P.Id, P.Name, P.Sal, NULL as Addr

FROM Professor P

UNION ALL

SELECT NULL as ID, NULL as Name, NULL as Sal, NULL as Addr

FROM Student S

...

Page 62: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm - Goal

Eliminates unlikely mappings from the large search space of candidate mappings and

identifies correct mappings a user might not otherwise have considered

Page 63: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm - Characteristics

Is interactive: Explores the space of possible mappings and

proposes the most likely ones to the user Accepts user feedback

To guide it in the right direction Uses heuristics

Can be replaced by better ones is available

Page 64: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm – Input

Set of correspondences M = {fi: (Ai -> Bi)}, where Ai: set of attributes in the source S1

Bi : one attribute of the target S2

Possible filters on source attributes Range restriction on an attribute, aggregate of an

attribute, etc

Page 65: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm – 1st phase

Create all possible candidate sets (subsets of M), that contain at most one correspondence per attribute of S2 Represents one way of computing the

attributes of S2 If a set covers all attributes of S2, it is called

complete set Elements of do not need to be disjoint

Page 66: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Example

Given the correspondences: f1 : S1.A -> T.C f2 : S2.A -> T.D f3 : S2.B -> T.C

Then the complete candidate sets are: {{f1, f2}, {f2, f3}}

The singleton sets {f1}, {f2} and {f3} are also candidate sets.

Page 67: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm – 2nd phase

Consider the candidate sets in and search for the best set of joins within each candidate set Considering a candidate set v in and supose (Ai -> Bi) v, and

Ai includes attributes from multiple relations in S1.

Then, search for a join path connecting the relations mentioned in Ai

using the following:

Heuristic: A join path can be either:

A path through foreign keys A path proposed by inspecting previous queries on S, or A path discovered by mining the data for joinable columns in S

Page 68: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm – 2nd phase

The set of candidate sets in for which we find join paths is denoted by . When there are multiple join paths, use the following for selecting join paths:

Heuristic: Prefer paths through foreign keys. If there are multiple such paths, choose one that involves an

attribute on which there is a filter in a correspondence, if it exists.

To further rank paths, favor the join path where the estimated difference between the outer join and inner join is the smallest

Favors joins with the least number of dangling tuples

Page 69: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm – 3rd phase Examine the candidate sets in , and tries to combine

them by union so they cover all the correspondences in M.

Search for covers of the correspondences A subset T of is a cover is it includes all the correspondences in M

and it is minimal (cannot remove a candidate set from T and still obtain a cover)

Example: = {{f1, f2}, {f2, f3}, {f1}, {f2}, {f3}} Possible covers include

T1 = {{f1}, {f2, f3}} T2 = {{f1, f2}, {f2, f3}.

Page 70: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm – 3rd phase

If there are multiple possible covers, use the following:

Heuristic: Choose the cover with the smallest nb of candidate

sets (a simpler mapping should be more appropriate) If there is more than one with the same nb of

candidate sets, choose the one that includes more attributes of S2 (to cover more of that schema)

Page 71: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Query discovery algorithm – 4th phase Creates a schema mapping expression as an SQL

query First creates an SQL query for each candidate set in the

selected cover and then unions them

Example: Suppose v is a candidate set: Attributes of S2 in v are put in the SELECT clause Each of the relations in the join paths found for v are put in the

FROM clause The corresponding join predicates are put in the WHERE clause Any filters associated with the correspondences in v are also

added to the WHERE clause Finally takes the union of the queries for each candidate set in the

cover Compute the SQL mapping expression for T1 and T2 left

as an exercice

Page 72: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

The CLIO tool

http://www.almaden.ibm.com/cs/projects/criollo/

http://birte08.stanford.edu/ppts/11-ho.pd

Page 73: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

References Chapter 4, Draft of the book on “Principles of Data

Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation).

Erhard Rahm and Philip A. Bernstein. “A survey of approaches to automatic schema matching”. VLDB Journal, 10(4):334–350, 2001.

AnHai Doan, Pedro Domingos, Alon Halevy, “Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach”, SIGMOD 2001

R.J. Miller, L.M. Haas, and M. Hernandez. “Schema Matching as Query Discovery”. In VLDB, 2000.

Renée J. Miller, Mauricio A. Hernandez, Laura M. Haas, Ling-Ling Yan, C. T. Howard Ho and Ronald Fagin, Lucian Popa, “The Clio Project: Managing Heterogeneity”, SIGMOD Record 30(1), March 2001, pp. 78-83.

Page 74: Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Virtual Data Integration in industry Known as EII: Enterprise Information Integration Different from EAI: Enterprise Application Integration Different from ETL in DW: Extraction, Tranformation

and Loading in Data Warehousing

Good reference to take a look at: A. Halevy et al, “Enterprise Information Integration: Successes,

Challenges and Controversies”, SIGMOD’05.