Chapter 5: Schema Matching and Mapping

Embed Size (px)

DESCRIPTION

Chapter 5: Schema Matching and Mapping. PRINCIPLES OF DATA INTEGRATION. ANHAI DOAN ALON HALEVY ZACHARY IVES. Introduction. We have described formalisms to specify source descriptions algorithms that use these descriptions to reformulate queries How to create the source descriptions? - PowerPoint PPT Presentation

Text of Chapter 5: Schema Matching and Mapping

  • PRINCIPLES OFDATA INTEGRATION

  • IntroductionWe have describedformalisms to specify source descriptionsalgorithms that use these descriptions to reformulate queriesHow to create the source descriptions? often begin by creating semantic matchesname = title, location = concat(city, state, zipcode)then elaborate matches into semantic mappingse.g., structured queries in a language such as SQLSchema matching and mapping are often quite difficultThis chapter describes matching and mapping tools that can significantly reduce the time it takes for the developer to create matches and mappings*

  • OutlineProblem definition, challenges, and overviewSchema matchingMatchersCombining match predictionsEnforcing domain integrity constraintsMatch selectorReusing previous matchesMany-to-many matchesSchema mapping*

  • Semantic MappingsLet S and T be two relational schemasrefer to the attributes and tables of S and T as their elementsA semantic mapping is a query expression that relates a schema S with a schema Tthe following mapping shows how to obtain Movies.title SELECT name as title FROM Items*

  • Semantic MappingsMore examples of semantic mappingsthe following mapping shows how to obtain Items.priceSELECT (basePrice * (1 + taxRate)) AS price FROM Products, Locations WHERE Products.saleLocID = Locations.lidthe following mapping shows how to obtain an entire tuple for Items table of AGGREGATORSELECT title AS name, releaseDate AS releaseInfo, rating AS classification, basePrice * (1 + taxRate) AS price FROM Movies, Products, Locations WHERE Movies.id = Products.mid AND Products.saleLocID = Locations.lid*

  • Example of the Need to Create Semantic Mappings for DI SystemsConsider building a DI systemover two sources, with schemas DVD-VENDOR & BOOK-VENDORassume the mediated schema is AGGREGATORIf we use Global-as-View approach to relate schemasmust describe Items in AGGREGATOR as a query over sourcesto do this, create semantic mappings m1 and m2 that specify how to obtain tuples of Items from DVD-VENDOR and BOOK-VENDOR, respectively, then return semantic mapping (m1 UNION m2) as the GAV description of Items table. *

  • Example of the Need to Create Semantic Mappings for DI SystemsIf we use Local-as-View approach to relate schemasfor each table in DVD-VENDOR and BOOK-VENDOR, must create a semantic mapping that specifies how to obtain tuples for that table from schema AGGREGATOR (i.e., from table Items)If we use GLAV approachthere are semantic mappings going in both directions*

  • Semantic MatchesA semantic match relates a set of elements in a schema S to a set of elements in schema Twithout specifying in detail (to the level of SQL queries) the exact nature of the relationship (as in semantic mappings)One-to-one matchesMovies.title = Items.nameProducts.rating = Items.classificationOne-to-many matchesItems.price = Products.basePrice * (1 + Locations.taxRate)Other types of matchesmany-to-one, many-to-many*

  • Relationship betweenSchema Matching and MappingTo create source descriptionoften start by creating semantic matchesthen elaborate matches into mappingsWhy start with semantic matches? they are often easier to elicit from designerse.g., can specify price = basePrice * (1 + taxRate) from domain knowledgeWhy the need to elaborate matches into mappings? matches often specify functional relationshipsbut they cannot be used to obtain data instancesneed SQL queries, that is, mappings for that purposeso matches need to be elaborated into mappings*

  • Relationship betweenSchema Matching and MappingExample: elaborate the matchprice = basePrice * (1 + taxRate) into mappingSELECT (basePrice * (1 + taxRate)) AS price FROM Product, Location WHERE Product.saleLocID = Location.lidAnother reason for starting with matchesbreak the long process in the middleallow designer to verify and correct the matchesthus reducing the complexity of the overall process

    *

  • Challenges of Schema Matching and MappingMatching and mapping systems must reconcile semantic heterogeneity between the schemasSuch semantic heterogeneity arise in many wayssame concept, but different names for tables and attributesrating vs classificationmultiple attributes in 1 schema relate to 1 attribute in the otherbasePrice and taxRate relate to pricetabular organization of schemas can be quite differentone table in AGGREGATOR vs three tables in DVD-VENDORcoverage and level of details can also differ significantlyDVD-VENDOR also models releaseDate and releaseCompany*

  • Challenges of Schema Matching and MappingWhy do we have semantic heterogeneity? schemas are created by different people whose states and styles are differentdisparate databases are rarely created for exact same purposesWhy reconciling semantic heterogeneity is hardthe semantics is not fully captured in the schemasschema clues can be unreliable intended semantics can be subjectivecorrectly combining the data is difficultStandard is not a solution!works for limited use cases where number of attributes is small and there is strong incentive to agree on them*

  • Overview of Matching SystemsFor now we consider only 1-1 matching systemswill discuss finding complex matches laterKey observation: need multiple heuristics / types of information to maximize matching accuracye.g., by matching the names, can infer that releaseInfo = releaseDate or releaseInfo = releaseCompany, but do not know which oneby matching the data values, can infer that releaseInfo = releaseDate or releaseInfo = year, but do not know which oneby combining both, can infer that releaseInfo = releaseDate*

  • Another Example of the Need to Exploit Mutiple Types of Information $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

    listed-price contact-name contact-phone office commentsrealestate.comsold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle homes.comIf use only namescontact-agent matches either contact-name or contact-phoneIf use only data valuescontact-agent matches either contact-phone or officeIf use both names and data valuescontact-agent matches contact-phone

  • Matching System Architecture*

  • Overview of Mapping SystemsInput: matches, output: actual mappingsKey challenge: find how tuples from one source can be transformed and combined to produce tuples in the otherwhich data transformation to apply? which joins to take? and many more possible decisions*

  • OutlineProblem definition, challenges, and overviewSchema matchingMatchersCombining match predictionsEnforcing domain integrity constraintsMatch selectorReusing previous matchesMany-to-many matchesSchema mapping*

  • Matchersschemas similarity matrixInput: two schemas S and T, plus any possibly helpful auxiliary information (e.g., data instances, text descriptions)Output: sim matrix that assigns to each element pair of S and T a number in [0,1] predicting whether the pair matchNumerous matchers have been proposedWe describe a few, in two classes: name matchers and data matchers*

  • Name-Based MatchersUse string matching techniquese.g., edit distance, Jaccard, Soundex, etc. Often have to pre-process namessplit them using certain delimiterse.g., saleLocID sale, Loc, IDexpand known abbreviations or acronymsloc location, cust customerexpand a string with synonyms / hypernymsadd cost to price, expand product into book, dvd, cdremove stop wordsin, at, and*

  • Example*

  • Instance-Based MatchersWhen schemas come with data instances, these can be extremely helpful in deciding matchesMany instance-based matchers have been proposedSome of the most popularrecognizersuse dictionaries, regexes, or simple rulesoverlap matchersexamine the overlap of values among attributesclassifiersuse learning techniques*

  • Building RecognizersUse dictionaries, regexes, or rules to recognize data values of certain kinds of attributesExample attributes for which recognizers are well suitedcountry names, city names, US statesperson names (can use dictionaries of last and first names)color, rating (e.g., G, PG, PG-13, etc.), phone, fax, soc secgenes, protein, zip codes*

  • Measuring the Overlap of ValuesTypically applies to attributes whose values are drawn from some finite domaine.g., movie ratings, movie titles, book titles, country namesJaccard measure is commonly used Example: use Jaccard measure to build a data-based matcher between DVD-VENDOR and AGGREGATORAGGREGATOR.name refers to DVD titles, DVD-VENDOR.name refers to sale locations, DVD-VENDOR.title refers to DVD titles low score for (name, name), high score for (name, title)*

  • Using ClassifiersBuilds classifiers on one schema and uses them to classify the elements of the other schemae.g., use Nave Bayes, decision tree, rule learning, SVMA common strategyfor each element si of schema S, want to train classifier Ci to recognizer instances of sito do this, need positive and negative training examplestake all data instances of si (that are available) to be positive examplestake all data instances of other elements of S to be negative examplestrain Ci on the positive and negative examples*

  • Using Classifiers A common strategy (cont.)now we can use Ci to compute sim score between si and each element tj of schema Tto do this, apply Ci to data instances of tjfor each instance, Ci produces a number in [0,1] that is the confidence that the instance is indeed an instance of sinow need to aggregate the confidence scores of the instances (of tj) to return