1 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina

Embed Size (px)

Text of 1 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 1 Knowledge and...

  • Slide 1
  • 1 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 1 Knowledge and the Web Schema, instance and ontology matching Bettina Berendt KU Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2014-15-1stsemester/kaw/ Last update: 22 October 2014
  • Slide 2
  • 2 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 2 Until now... n... we have looked into modelling n... we have seen how the languages RDF(S) and OWL allow us to combine different schemas and data n... we have seen how Linked Data on the Web uses HTTP as a connecting protocol/architecture n... we have assumed that such combinations can be done effortlessly (unique names etc.) n... we have looked at some interpretation problems associated with these procedures n Now we need to ask: l What are (further) challenges of such combinations? l What are approaches proposed to solve it? from the databases & the Semantic Web / ontologies fields from architectural and logical points of view
  • Slide 3
  • 3 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 3 Motivation 1: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs
  • Slide 4
  • 4 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 4 Motivation 2a: Schemas coming from different languages n A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]streamwaterfreshwateroceanlake[1][2] n Une rivire est un cours d'eau qui s'coule sous l'effet de la gravit et qui se jette dans une autre rivire ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'ocan.cours d'eaufleuve n Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in Belgi ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.waterstroomzeeoceaanmeermoeraswoestijn beek
  • Slide 5
  • 5 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 5 Motivation 2b: Information about the same thing from different sources
  • Slide 6
  • 6 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 6 Motivation 3a: Are these the same entity?
  • Slide 7
  • 7 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 7 Motivation 3b: Who is that? Merging identities Mickey Mouse
  • Slide 8
  • 8 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 8 Motivation 3c: Who was that? Re-identification
  • Slide 9
  • 9 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 9 High-level overview: Goals and approaches in data integration n Basic goal: Combine data/knowledge from different sources n Goal / emphasis can lie on finding correspondences between l the models schema matching, ontology matching l the instances record linkage n Techniques can leverage similarities between l schema/ontology-level information l instance information most of today An established problem in DB; a focus &challenge for LOD (owl:sameAs)
  • Slide 10
  • 10 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 10 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP (Automated) matching of LOD and LOD ontologies Evaluating matching Involving the user: Explanations; mass collaboration If time permits, these 2 topics too (briefly)
  • Slide 11
  • 11 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 11 The match problem (Running example 1) Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other
  • Slide 12
  • 12 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 12 Running example 2
  • Slide 13
  • 13 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 13 Based on what information can the matchings/mappings be found? (work on the two running examples)
  • Slide 14
  • 14 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 14 The match operator n Match operator: f(S1,S2) = mapping between S1 and S2 l for schemas S1, S2 n Mapping l a set of mapping elements n Mapping elements l elements of S1, elements of S2, mapping expression n Mapping expression l different functions and relationships
  • Slide 15
  • 15 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 15 Matching expressions: examples n Scalar relations (=, ,...) l S.HOUSES.location = T.LISTINGS.area n Functions l T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) l T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n ER-style relationships (is-a, part-of,...) n Set-oriented relationships (overlaps, contains,...) n Any other terms that are defined in the expression language used
  • Slide 16
  • 16 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 16 Matching and mapping 1. Find the schema match (declarative) 2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, procedural) Example of result of step 2: n To create T.LISTINGS from S (simplified notation): area = SELECT location FROM HOUSES agent-name = SELECT name FROM AGENTS agent-address = SELECT concat(city,state) FROM AGENTS list-price = SELECT price * (1+fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
  • Slide 17
  • 17 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 17 Based on what information can the matchings/mappings be found? Rahm & Bernsteins classification of schema matching approaches
  • Slide 18
  • 18 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 18 Challenges n Semantics of the involved elements often need to be inferred Often need to base (heuristic) solutions on cues in schema and data, which are unreliable l e.g., homonyms (area), synonyms (area, location) n Schema and data clues are often incomplete l e.g., date: date of what? n Global nature of matching: to choose one matching possibility, must typically exclude all others as worse n Matching is often subjective and/or context-dependent l e.g., does house-style match house-description or not? n Extremely laborious and error-prone process l e.g., Li & Clifton 2000: project at GTE telecommunications: 40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years n Ontologies often even bigger l For example Cyc: now (as of 2012) has > 500,000 concepts, ~ 5,000,000 assertions, >26,000 relations
  • Slide 19
  • 19 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 19 Semi-automated schema matching (1) Rule-based solutions n Hand-crafted rules n Exploit schema information + relatively inexpensive + do not require training + fast (operate only on schema, not data) + can work very well in certain types of applications & domains + rules can provide a quick & concise method of capturing user knowledge about the domain cannot exploit data instances effectively cannot exploit previous matching efforts (other than by re-use)
  • Slide 20
  • 20 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 20 Semi-automated schema matching (2) Learning-based solutions n Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: instance-level matching) Exploit schema information and data n Some approaches: external evidence l Past matches l Corpus of schemas and matches (matchings in real-estate applications will tend to be alike) l Corpus of users (more details later in this slide set) + can exploit data instances effectively + can exploit previous matching efforts relatively expensive require training slower (operate data) results may be opaque (e.g., neural network output) explanation components! (more details later)
  • Slide 21
  • 21 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 21 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP (Automated) matching of LOD and LOD ontologies Evaluating matching Involving the user: Explanations; mass collaboration
  • Slide 22
  • 22 Berendt: Knowledge and the Web, 2014, http://www.cs.kuleuven.be/~berendt/teaching 22 Overview (1) n Rule-based approach n Schema types: l Relational, XML n M

Recommended

View more >