81
Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina Berendt KU Leuven, Department of Computer Science http:// www.cs.kuleuven.be /~ berendt/teaching/2014-15-1stsemester/kaw/ Last update: 22 October 2014

1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

Embed Size (px)

Citation preview

Page 1: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

1Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

1

Knowledge and the Web –

Schema, instance and ontology matching

Bettina Berendt

KU Leuven, Department of Computer Science

http://www.cs.kuleuven.be/~berendt/teaching/2014-15-1stsemester/kaw/

Last update: 22 October 2014

Page 2: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

2Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

2

Until now ...

... we have looked into modelling ... we have seen how the languages RDF(S) and OWL allow

us to combine different schemas and data ... we have seen how Linked Data on the Web uses HTTP as a

connecting protocol/architecture ... we have assumed that such combinations can be done

effortlessly (unique names etc.) ... we have looked at some interpretation problems

associated with these procedures Now we need to ask:

What are (further) challenges of such combinations? What are approaches proposed to solve it?

– from the databases & the Semantic Web / ontologies fields

– from architectural and logical points of view

Page 3: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

3Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

3Motivation 1: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs

Page 4: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

4Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

4

Motivation 2a: Schemas coming from different languages

A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]

Une rivière est un cours d'eau qui s'écoule sous l'effet de la gravité et qui se jette dans une autre rivière ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'océan.

Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in België ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.

Page 5: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

5Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

5Motivation 2b: Information about “the same“ thing from different sources

Page 6: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

6Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

6

Motivation 3a: Are these the same entity?

Page 7: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

7Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

7

Motivation 3b: „Who is that?“ – Merging identities

Mickey Mouse

Page 8: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

8Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

8

Motivation 3c: „Who was that?“ – Re-identification

Page 9: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

9Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

9

High-level overview: Goals and approaches in data integration

Basic goal: Combine data/knowledge from different sources

Goal / emphasis can lie on finding correspondences between the models schema matching, ontology matching

the instances record linkage

Techniques can leverage similarities between schema/ontology-level information

instance information

most of today

An established problem in DB; a focus &challenge for LOD (“owl:sameAs“)

Page 10: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

10Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

10

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

(Automated) matching of LOD and LOD ontologies

Evaluating matching

Involving the user: Explanations; mass collaboration

If time permits,these 2 topics

too (briefly)

Page 11: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

11Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

11

The match problem(Running example 1)

Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other

Page 12: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

12Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

12

Running example 2

Page 13: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

13Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

13

Based on what information can the matchings/mappings be found?

(work on the two running examples)

Page 14: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

14Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

14

The match operator

Match operator: f(S1,S2) = mapping between S1 and S2 for schemas S1, S2

Mapping a set of mapping elements

Mapping elements elements of S1, elements of S2, mapping expression

Mapping expression different functions and relationships

Page 15: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

15Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

15

Matching expressions: examples

Scalar relations (=, ≥, ...) S.HOUSES.location = T.LISTINGS.area

Functions T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)

ER-style relationships (is-a, part-of, ...) Set-oriented relationships (overlaps, contains, ...) Any other terms that are defined in the expression language used

Page 16: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

16Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

16

Matching and mapping

1. Find the schema match („declarative“)

2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“)

Example of result of step 2: To create T.LISTINGS from S (simplified notation):

area = SELECT location FROM HOUSES

agent-name = SELECT name FROM AGENTS

agent-address = SELECT concat(city,state) FROM AGENTS

list-price = SELECT price * (1+fee-rate)

FROM HOUSES, AGENTS

WHERE agent-id = id

Page 17: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

17Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

17Based on what information can the matchings/mappings be found?

Rahm & Bernstein‘s classification of schema matching approaches

Page 18: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

18Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

18

Challenges

Semantics of the involved elements often need to be inferred

Often need to base (heuristic) solutions on cues in schema and data, which are unreliable

e.g., homonyms (area), synonyms (area, location)

Schema and data clues are often incomplete e.g., date: date of what?

Global nature of matching: to choose one matching possibility, must typically exclude all others as worse

Matching is often subjective and/or context-dependent e.g., does house-style match house-description or not?

Extremely laborious and error-prone process e.g., Li & Clifton 2000: project at GTE telecommunications:

40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years

Ontologies often even bigger For example Cyc: now (as of 2012) has > 500,000 concepts, ~ 5,000,000

assertions, >26,000 relations

Page 19: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

19Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

19

Semi-automated schema matching (1)

Rule-based solutions Hand-crafted rules Exploit schema information

+ relatively inexpensive

+ do not require training

+ fast (operate only on schema, not data)

+ can work very well in certain types of applications & domains

+ rules can provide a quick & concise method of capturing user knowledge about the domain

– cannot exploit data instances effectively

– cannot exploit previous matching efforts

(other than by re-use)

Page 20: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

20Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

20

Semi-automated schema matching (2)

Learning-based solutions Rules/mappings learned from attribute specifications and statistics of

data content (Rahm&Bernstein: „instance-level matching“)

Exploit schema information and data Some approaches: external evidence

Past matches Corpus of schemas and matches („matchings in real-estate applications

will tend to be alike“) Corpus of users (more details later in this slide set)

+ can exploit data instances effectively

+ can exploit previous matching efforts

– relatively expensive

– require training

– slower (operate data)

– results may be opaque (e.g., neural network output) explanation components! (more details later)

Page 21: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

21Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

21

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

(Automated) matching of LOD and LOD ontologies

Evaluating matching

Involving the user: Explanations; mass collaboration

Page 22: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

22Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

22

Overview (1)

Rule-based approach

Schema types: Relational, XML

Metadata representation: Extended ER

Match granularity: Element, structure

Match cardinality: 1:1, n:1

Page 23: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

23Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

23

Overview (2)

Schema-level match: Name-based: name equality, synonyms, hypernyms,

homonyms, abbreviations

Constraint-based: data type and domain compatibility, referential constraints

Structure matching: matching subtrees, weighted by leaves

Re-use, auxiliary information used: Thesauri, glossaries

Combination of matchers: Hybrid

Manual work / user input: User can adjust threshold weights

Page 24: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

24Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

24

Basic representation: Schema trees

Computation overview:

1. Compute similarity coefficients between elements of these graphs

2. Deduce a mapping from these coefficients

Page 25: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

25Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

25

Computing similarity coefficients (1): Linguistic matching

Operates on schema element names (= nodes in schema tree)

1. Normalization Tokenization (parse names into tokens based on punctuation, case,

etc.)

e.g., Product_ID {Product, ID} Expansion (of abbreviations and acronyms) Elimination (of prepositions, articles, etc.)

2. Categorization / clustering Based on data types, schema hierarchy, linguistic content of names

e.g., „real-valued elements“, „money-related elements“

3. Comparison (within the categories) Compute linguistic similarity coefficients (lsim) based on thesarus

(synonmy, hypernymy)

Output: Table of lsim coefficients (in [0,1]) between schema elements

Page 26: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

26Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

26

How to identify synonyms and homonyms: Example WordNet

Page 27: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

27Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

27How to identify hypernyms: Example WordNet

What if you hadto match

“statement“ and “bill“?

Page 28: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

28Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

28

(Lately also done with Wikipedia rather than with WordNet: e.g. WikiMatch)

Page 29: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

29Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

29

Computing similarity coefficients (2): Structure matching

Intuitions: Leaves are similar if they are linguistic & data-type similar, and

if they have similar neighbourhoods

Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important)

Procedure:

1. Initialize structural similarity of leaves based on data types

Identical data types: compat. = 0.5; otherwise in [0,0.5]

2. Process the tree in post-order

3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold

4. .

Page 30: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

30Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

30

The structure matching algorithm

Output: an 1:n mapping for leaves

To generate non-leaf mappings: 2nd post-order traversal

Page 31: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

31Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

31

Matching shared types

Solution: expand the schema into a schema tree, then proceed as before

Can help to generate context-dependent mappings

Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)

Page 32: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

32Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

32

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

(Automated) matching of LOD and LOD ontologies

Evaluating matching

Involving the user: Explanations; mass collaboration

Page 33: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

33Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

33

Main ideas

A learning-based approach

Main goal: discover complex matches In particular: functions such as

T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate)

T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)

Works on relational schemas

Basic idea: reformulate schema matching as search

Page 34: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

34Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

34

Architecture

Specialized searchers are specialized on discovering certain types of complex matches make search more efficient

Page 35: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

35Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

35

Overview of implemented searchers

Page 36: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

36Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

36

Example: The textual searcher

For target attribute T.LISTINGS.agent-address: Examine attributes and concatenations of attributes from S Restrict examined set by analyzing textual properties

Data type information in schema, heuristics (proportion of non-numeric characters etc.)

Evaluate match candidates based on data correspondences, prune inferior candidates

Page 37: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

37Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

37

Example: The numerical searcher

For target attribute T.LISTINGS.list-price:

Examine attributes and arithmetic expressions over them from S

Restrict examined set by analyzing numeric properties Data type information in schema, heuristics

Evaluate match candidates based on data correspondences, prune inferior candidates

Page 38: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

38Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

38

Search strategy (1): Example textual searcher

1. Learn a (Naive Bayes) classifier

text class („agent-address“ or „other“)

from the data instances in T.LISTINGS.agent-address

2. Apply this classifier to each match candidate (e.g., location, concat(city,state)

3. Score of the candidate = average over instance probabilities

4. For expansion: beam search – only k-top scoring candiates

Page 39: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

39Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

39

Search strategy (2): Example numeric searcher

1. Get value distributions of target attribute and each candidate

2. Compare the value distributions (Kullback-Leibler divergence measure)

3. Score of the candidate = Kullback-Leibler measure

Page 40: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

40Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

40

Evaluation strategies of implemented searchers

Page 41: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

41Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

41

Pruning by domain constraints

Multiple attributes of S: „attributes name and beds are unrelated“ do not generate match candidates with these 2 attributes

Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“ use in evaluation of candidates

Properties of multiple attributes of T: „lot-area and num-baths are unrelated“ at match selector level, „clean up“:

Example

– T.num_baths S.baths

– ? T.lot-area (S.lot-sq-feet/43560)+1.3e-15 * S.baths

Based on the domain constraint, drop the term involving S.baths

Page 42: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

42Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

42

Pruning by using knowledge from overlap data

When S and T share the same data

Consider fraction of data for which mapping is correct e.g., house locations:

S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address

Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location,

keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)

Page 43: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

43Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

43

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

(Automated) matching of LOD and LOD ontologies

Evaluating matching

Involving the user: Explanations; mass collaboration

Page 44: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

44Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

44

What is ontology matching (relative to schema matching)?

same basic idea but works on ontologies that are conceptual models (not on logical

schemas such as relational tables or XML trees) emphasizes that concepts and relations need to be matched and

mapped, and may treat these differently (Note: in the schema matching literature, it is not always clearly

laid out whether the matched items come from a conceptual or a logical model; the toy examples above in particular are also conceptual)

In practice, some ontology matching tasks in fact work on such simple models (or simple subparts of models) that they do not differ at all from what we have seen so far

example: Anatomy task, see below in evaluation

Terminology: Also known as ontology alignment See (Shvaiko & Euzenat, 2005) for more details

Page 45: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

45Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

45Recap: Rahm & Bernstein‘s classification of schema matching approaches

Page 46: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

46Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

46The methods that are important when the schema is in the foreground (which it is in ontologies!)

Page 47: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

47Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

47

The extension by Shvaiko & Euzenat (2005) [Partial view]

Page 48: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

48Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

48

(slide from last week)

Special challenges on LOD ?!

Page 49: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

49Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

49

Using the example of Geonames and DBPedia:

1. Matching instances to generate owl:sameAs links

2. Discovering concepts that cover these instances to map between ontologies

What about matching/mapping instances and classes?

Page 50: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

50Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

50

What can we infer from this ? (1)

<owl:Class rdf:ID="Boek"/> <owl:Class rdf:ID="Book"/>  <owl:DatatypeProperty rdf:ID="ISBN"> <rdf:type rdf:resource="&owl;FunctionalProperty"/> <rdfs:domain rdf:resource="#Book"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty> <owl:DatatypeProperty rdf:ID="isbn"> <rdf:type rdf:resource="&owl;FunctionalProperty"/> <rdfs:domain rdf:resource="#Boek"/> <rdfs:range rdf:resource="&xsd;string"/> </owl:DatatypeProperty> 

Page 51: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

51Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

51

What can we infer from this ? (2)

<Book rdf:ID="mybook1"> <ISBN rdf:datatype="&xsd;string">12345</ISBN> </Book> <Book rdf:ID="mybook2"> <ISBN rdf:datatype="&xsd;string">12345</ISBN> </Book> <Book rdf:ID="mybook3"> <ISBN rdf:datatype="&xsd;string">6789</ISBN> </Book> <Boek rdf:ID="mijnboek_3"> <isbn rdf:datatype="&xsd;string">6789</isbn> </Boek>

Page 52: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

52Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

52What about this? (dbpedia: 526K geog. places/features, GeoNames: 7.8Mio geog. features)

Page 53: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

53Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

53How this matching was done(http://lists.w3.org/Archives/Public/semantic-web/2006Dec/0027.html)

>> Around 100,000 geonames place names now have wikipedia links.

> Very cool. I wonder how you link the articles? Can't be simple word matching, no?

Simple word matching would lead to an incredible mess. There are for example 53 places with the name London and 58 places with the name Paris in the geonames database. Place name disambiguation is a rather hard problem and for matching geonames places with wikipedia articles we use semantic information in the wikipedia dump together with the article title. The semantic information primarily is latitude and longitude, but also country, administrative division, feature type, population and categories …. We only consider articles where we are able to parse semantic information .... Unfortunately there is a proliferation of templates and a lot of wikipedia users have fun inventing new ones instead of reusing existing ones.

Page 54: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

54Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

54But what about the classes?

Page 55: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

55Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

55

Concept covering: Motivation (Parundekar et al., 2012)

“The Web of Linked Data has grown significantly in the past few years – 31.6 billion triples as of September 2011. This includes a wide range of data sources from the government (42%), geographic (19.4%), life sciences (9.6%) and other domains.

A common way that the instances in these sources are linked to others is through the owl:sameAs property.

Though the size of Linked Data Cloud is increasing steadily (10% over the 28.5 billion triples in 2010), inspection of the sources at the ontology level reveals that only a few of them (15 out of the 190 sources) include mappings between their ontologies.

Since interoperability is crucial to the success of the Semantic Web, it is essential that these heterogeneous schemas, the result of a de-centralized approach to the generation of data and ontologies, also be linked.”

Page 56: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

56Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

56

Challenges

The problem of finding alignments in ontologies of Linked Data sources is non-trivial, since there might not be one-to-one concept equivalences.

In some sources the ontology is extremely rudimentary, for example GeoNames has only one class : geonames:Feature

alignment with a well-defined ont. such as DBpedia is not particularly useful.

need to generate more expressive concepts. The necessary information to do this is often present in the properties and values of the instances in the sources.

For example, in GeoNames the values of the featureCode and featureClass properties provide useful concept constructors, which can be aligned with existing concepts in Dbpedia

the concept geonames:featureCode=P.PPL (populated place) aligns to dbpedia:City

Approach: explore the space of concepts defined by value restrictions, (“restriction classes”)

Page 57: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

57Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

57

Restriction classes

Basic expression to define a restriction class:

p = v

• either p is an object property and v is a resource

• Ex.: rdf:type=City

• or p is a data property and v is a literal.

• Ex.: featureCode=P.PPL

• two restriction classes equal if their respective instance sets can be identified as equal after following the owl:sameAs links

Conjunctive and disjunctive restriction classes

Alignment algorithm for disjunctive restriction classes:

1. Find initial equivalence and subset relations

2. Discover concept coverings using disjunctions of restriction classes

Page 58: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

58Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

58

Aligning atomic restriction classes (examples on the board)

Note: There are some typos in the paper. I switched the conclusions of the first 2 if-branches.Also, the cardinality of Img(r1) in the example on p.4 should be 3918

Page 59: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

59Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

59

Identifying concept coverings

Page 60: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

60Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

60

Results

Page 61: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

61Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

61

Claim – can you comment?

„An interesting outcome of our algorithm is that it identifies inconsistencies and possible errors in the linked data, and provides a method for automatically curating the Linked Data Cloud.”

Page 62: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

62Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

62

Part of the evaluation

Page 63: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

71Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

71

Q: “Is this a publicly available tool?“

Not all schema/ontology matchers are available, for many reasons (proprietary, collaboration with a company, own start-up, ..., the Phd student left the institute and nobody understands the code ...)

Increasingly, though, it is seen as good practice by researchers to make their tools available. You can see how (some of) these tools perform by checking the Ontology Alignment Evaluation Initative pages (see part “Evaluating matching“)

Examples:

COMA (database schemas and ontologies) http://dbs.uni-leipzig.de/Research/coma.html

Falcon-OA (RDF(S) and OWL) http://ws.nju.edu.cn/falcon-ao/

LogMap (reasoning-based) http://www.cs.ox.ac.uk/isg/tools/LogMap/

“50 Ontology Mapping and Alignment Tools - More Than 20 Are Currently Active and Often in Open Source”: overview at http://www.mkbergman.com/1769/50-ontology-mapping-and-alignment-tools/

Page 64: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

72Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

72

Outlook

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

(Automated) matching of LOD and LOD ontologies

Evaluating matching

Involving the user: Explanations; mass collaboration

Page 65: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

73Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

73

How to compare?

Input: What kind of input data? (What languages? Only toy examples? What external information?)

Output: mapping between attributes or tables, nodes or paths? How much information does the system report?

Quality measures: metrics for accuracy and completeness?

Effort: how much savings of manual effort, how quantified? Pre-match effort (training of learners, dictionary preparation, ...)

Post-match effort (correction and improvement of the match output)

How are these measured?

Page 66: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

74Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

74

Match quality measures

Need a „gold standard“ (the „true“ match)

Measures from information retrieval:

(standard choice: F1, a = 0.5)

Quantifies post-match effort

Page 67: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

75Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

75

Benchmarking

Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable

Need more standardized conditions (benchmarks)

Since 2004: competitions in ontology matching (more in the next session):

Test cases and contests at http://www.ontologymatching.org/evaluation.html

Page 68: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

76Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

76Example: Tasks 2009 (various are re-used; 2013 is just out)

(excerpt; from http://oaei.ontologymatching.org/2009/); latest completed run at http://oaei.ontologymatching.org/2013/

Expressive ontologies anatomy

The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy.

conference Participants will be asked to find all correct correspondences (equivalence and/or

subsumption correspondences) and/or 'interesting correspondences' within a collection of ontologies describing the domain of organising conferences (the domain being well understandable for every researcher). Results will be evaluated a posteriori in part manually and in part by data-mining techniques and logical reasoning techniques. There will also be evaluation against reference mapping based on subset of the whole collection.

Directories and thesauri fishery gears

features four different classification schemes, expressed in OWL, adopted by different fishery information systems in FIM division of FAO. An alignment performed on this 4 schemes should be able to spot out equivalence, or a degree of similarity between the fishing gear types and the groups of gears, such to enable a future exercise of data aggregation cross systems.

Oriented matching This track focuses on the evaluation of alignments that contain other mapping

relations than equivalences.

Instance matching very large crosslingual resources

The purpose of this task (vlcr) is to match the Thesaurus of the Netherlands Institute for Sound and Vision (called GTAA, see below for more information) to two other resources: the English WordNet from Princeton University and DBpedia.

Page 69: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

77Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

77

Mice and humans

The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy.

(http://oaei.ontologymatching.org/2008/anatomy/)

Page 70: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

78Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

78Matching task and evaluation approach(http://oaei.ontologymatching.org/2007/anatomy/)

We would like to gratefully thank Martin Ringwald and Terry Hayamizu (Mouse Genome Informatics - http://www.informatics.jax.org/), who provided us with a reference mapping for these ontologies.

The reference mapping contains only equivalence correspondences between concepts of the ontologies. No correspondences between properties (roles) are specified.

If your system also creates correspondences between properties or correspondences that describe subsumption relations, these results will not influence the evaluation (but can nevertheless be part of your submitted results).

The results of your matching system will be compared to this reference alignment. Therefore, all of the the results have to be delivered in the format specified here.

Page 71: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

79Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

79Matching task and evaluation approach (http://oaei.ontologymatching.org/2011/oriented/index.html)

“An increasing number of matchers are now capable of deriving mapping relations other than equivalence relations, such as subsumption, disjointness or named relations.

This is a necessity given that we need to compute alignments between ontologies at different granularity levels or between ontologies that elaborate on non-equivalent elements. The evaluation of such mappings was addressed already in OAEI (2009) Oriented Matching track. […]

The track aims also to report on evaluation methods and measures for subsumption mappings, in conjunction to the computation of equivalence mappings.

Targeting these goals, we have built new benchmark datasets that are described below.”

Page 72: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

80Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

80(Some) results(http://oaei.ontologymatching.org/2009/results/anatomy/)

Page 73: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

81Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

81(Some) results(http://oaei.ontologymatching.org/2013/results/anatomy/)

Page 74: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

83Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

83

Outlook

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

(Automated) matching of LOD and LOD ontologies

Evaluating matching

Involving the user: Explanations; mass collaboration

Page 75: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

84Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

84

Example in iMAP

User sees ranked candidates:

1. List-price = price

2. List-price = price * (1 + fee-rate)

Explanation:

a) Both generated from numeric searcher, 2 ranked higher than 1

b) But:

c) Match month-posted = fee-rate

d) domain constraint: matches for month-posted and price do not share attributes

e) cannot match list-price to anything to do with fee-rate

f) Why c)?

g) Data instances of fee-rate were classified as of type date

User corrects this wrong step f), the rest is repaired accordingly

Page 76: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

85Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

85

Background knowledge structure for explanation: dependency graph

Page 77: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

86Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

86

MOBS: Using mass collaboration to automate data integration

1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.)

2. Soliciting user feedback: User query user must answer a simple question user gets answer to initial query

3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings)

4. Combining user feedback (e.g, majority count) Important: „instant gratification“ (e.g., include the new field in the

results page after a user has given helpful input)

Page 78: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

87Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

87

Task for next week (from http://opendefinition.org/)

Do you see a statement in this definition that does not appear substantiated?

Can you give 3 reasons why it may be true?

Can you give 3 reasons why it may be false?

Page 79: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

88Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

88

.... which stands in some relation with these claims ...

„An interesting outcome of our algorithm is that it identifies inconsistencies and possible errors in the linked data, and provides a method for automatically curating the Linked Data Cloud.”

Page 80: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

89Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

89

Outlook

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

(Automated) matching of LOD and LOD ontologies

Evaluating matching

Involving the user: Explanations; mass collaboration

Invited lecture Aad Versteden (Tenforce)

Page 81: 1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

90Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching

90

References / background reading; acknowledgements

Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, 334-350.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.700

Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine.

http://dit.unitn.it/~p2p/RelatedWork/Matching/si-survey-db-community.pdf

Sven Hertling, Heiko Paulheim . WikiMatch - using Wikipedia for ontology matching. In Proc. of The Seventh International Workshop on Ontology Matching, 2012. http://www.dit.unitn.it/~p2p/OM-2012/om2012_Tpaper4.pdf

Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference.

http://dbs.uni-leipzig.de/de/publication/title/generic_schema_matching_with_cupid

Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD 2004.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.4117

P. Shvaiko, J. Euzenat: A Survey of Schema-based Matching Approaches. Journal on Data Semantics, 2005.

http://www.dit.unitn.it/~p2p/RelatedWork/Matching/JoDS-IV-2005_SurveyMatching-SE.pdf

pp. 50ff.: Bizer, C., Cyganiak, R., & Heath, T. (2007). How to Publish Linked Data on the Web. Chapter 6. How to set RDF Links to other Data Sources. http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/#links

pp. 55ff,; Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite. Discovering concept coverings in ontologies of linked data sources. In Proceedings of the 11th International Semantic Web Conference (ISWC 2012), pp. 427–443, Boston, Mass., 2012. http://iswc2012.semanticweb.org/sites/default/files/76490417.pdf

Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, 2002. Revised Papers (pp. 221-237). Springer.

http://dit.unitn.it/~p2p/RelatedWork/Comparison%20of%20Schema%20Matching%20Evaluations.pdf

McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB).

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.9964

Please see the Powerpoint slide-specific „notes“ for URLs of used pictures and formulae