Schema & Ontology Matching: Current Research Directions

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Schema & Ontology Matching: Current Research Directions. AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004. Road Map. Schema Matching motivation & problem definition representative current solutions: LSD, iMAP, Clio broader picture - PowerPoint PPT Presentation

Transcript

  • Schema & Ontology Matching: Current Research Directions AnHai DoanDatabase and Information System GroupUniversity of Illinois, Urbana Champaign

    Spring 2004

  • Road MapSchema Matchingmotivation & problem definitionrepresentative current solutions: LSD, iMAP, Cliobroader pictureOntology Matchingmotivation & problem definitionrepresentative current solution: GLUEbroader pictureConclusions & Emerging Directions

  • Motivation: Data Integration New faculty memberFind houses with 2 bedrooms priced under 200Khomes.comrealestate.comhomeseekers.com

  • Architecture of Data Integration System mediated schemahomes.comrealestate.comsource schema 2homeseekers.comsource schema 3source schema 1Find houses with 2 bedrooms priced under 200K

  • Semantic Matches between Schemasprice agent-name address1-1 matchcomplex matchhomes.comlisted-price contact-name city stateMediated-schema320K Jane Brown Seattle WA240K Mike Smith Miami FL

  • Schema Matching is Ubiquitous!Fundamental problem in numerous applicationsDatabasesdata integrationdata translationschema/view integrationdata warehousingsemantic query processingmodel managementpeer data managementAIknowledge bases, ontology merging, information gathering agents, ...Webe-commercemarking up data using ontologies (e.g., on Semantic Web)

  • Why Schema Matching is DifficultSchema & data never fully capture semantics!not adequately documented schema creator has retired to Florida!Must rely on clues in schema & data using names, structures, types, data values, etc.Such clues can be unreliablesame names => different entities: area => location or square-feetdifferent names => same entity: area & address => locationIntended semantics can be subjectivehouse-style = house-description?military applications require committees to decide!Cannot be fully automated, needs user feedback!

  • Current State of AffairsFinding semantic mappings is now a key bottleneck!largely done by handlabor intensive & error pronedata integration at GTE [Li&Clifton, 2000]40 databases, 27000 elements, estimated time: 12 years Will only be exacerbateddata sharing becomes pervasivetranslation of legacy dataNeed semi-automatic approaches to scale up!Many research projects in the past few yearsDatabases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, U Wisconsin, NCSU, UIUC, Washington, ... AI: Stanford, Karlsruhe University, NEC Japan, ...

  • Road MapSchema Matchingmotivation & problem definitionrepresentative current solutions: LSD, iMAP, Cliobroader pictureOntology Matchingmotivation & problem definitionrepresentative current solution: GLUEbroader pictureConclusions & Emerging Directions

  • LSD Learning Source DescriptionDeveloped at Univ of Washington 2000-2001with Pedro Domingos and Alon HalevyDesigned for data integration settingshas been adapted to several other contextsDesirable characteristicslearn from previous matching activitiesexploit multiple types of information in schema and data incorporate domain integrity constraintshandle user feedbackachieves high matching accuracy (66 -- 97%) on real-world data

  • Schema Matching for Data Integration:the LSD Approach Suppose user wants to integrate 100 data sources 1. User manually creates matches for a few sources, say 3shows LSD these matches2. LSD learns from the matches3. LSD predicts matches for remaining 97 sources

  • Learning from the Manual Matches price agent-name agent-phone office-phone description listed-price contact-name contact-phone office commentsSchema of realestate.comMediated schema $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

    listed-price contact-name contact-phone office commentsrealestate.comIf fantastic & great occur frequently in data instances => descriptionsold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle homes.comIf office occurs in name => office-phone

  • Must Exploit Multiple Types of Information! price agent-name agent-phone office-phone description listed-price contact-name contact-phone office commentsSchema of realestate.comMediated schema $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

    listed-price contact-name contact-phone office commentsrealestate.comIf fantastic & great occur frequently in data instances => descriptionsold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle homes.comIf office occurs in name => office-phone

  • Multi-Strategy LearningUse a set of base learnerseach exploits well certain types of information To match a schema element of a new sourceapply base learnerscombine their predictions using a meta-learnerMeta-learneruses training sources to measure base learner accuracyweighs each learner based on its accuracy

  • Base LearnersTraining

    MatchingName Learnertraining: (location, address) (contact name, name) matching: agent-name => (name,0.7),(phone,0.3) Naive Bayes Learnertraining: (Seattle, WA,address) (250K,price) matching: Kent, WA => (address,0.8),(name,0.2)

    labels weighted by confidence scoreX

  • The LSD ArchitectureMatching PhaseTraining PhaseMediated schemaSource schemasBase-Learner1Base-LearnerkMeta-LearnerTraining datafor base learnersHypothesis1HypothesiskWeights for Base LearnersBase-Learner1 .... Base-LearnerkMeta-LearnerPrediction CombinerPredictions for elementsPredictions for instancesConstraint HandlerMappingsDomainconstraints

  • Training the Base LearnersMiami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic houseBoston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

    location price contact-name contact-phone office commentsrealestate.com address price agent-name agent-phone office-phone descriptionMediated schema

  • Meta-Learner: Stacking[Wolpert 92,Ting&Witten99]Traininguses training data to learn weightsone for each (base-learner,mediated-schema element) pairweight (Name-Learner,address) = 0.2weight (Naive-Bayes,address) = 0.8 Matching: combine predictions of base learnerscomputes weighted average of base-learner confidence scoresSeattle, WAKent, WABend, OR (address,0.4)(address,0.9)Name LearnerNaive Bayes Meta-Learner(address, 0.4*0.2 + 0.9*0.8 = 0.8) area

  • The LSD ArchitectureMatching PhaseTraining PhaseMediated schemaSource schemasBase-Learner1Base-LearnerkMeta-LearnerTraining datafor base learnersHypothesis1HypothesiskWeights for Base LearnersBase-Learner1 .... Base-LearnerkMeta-LearnerPrediction CombinerPredictions for elementsPredictions for instancesConstraint HandlerMappingsDomainconstraints

  • Applying the Learnerscontact-agentName LearnerNaive Bayes Prediction-Combiner(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)(address,0.6), (description,0.4)Meta-LearnerName LearnerNaive Bayes (address,0.7), (description,0.3)(price,0.9), (agent-phone,0.1)extra-infohomes.comSeattle, WAKent, WABend, OR areasold-at(agent-phone,0.9), (description,0.1)Meta-Learner area sold-at contact-agent extra-infohomes.com schema

  • Domain ConstraintsEncode user knowledge about domainSpecified only once, by examining mediated schemaExamplesat most one source-schema element can match addressif a source-schema element matches house-id then it is a keyavg-value(price) > avg-value(num-baths) Given a mapping combination can verify if it satisfies a given constraint area: addresssold-at: price contact-agent: agent-phoneextra-info: address

  • The Constraint HandlerSearches space of mapping combinations efficientlyCan handle arbitrary constraintsAlso used to incorporate user feedback sold-at does not match pricearea: (address,0.7), (description,0.3)sold-at: (price,0.9), (agent-phone,0.1)contact-agent: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4) 0.30.10.10.40.00120.70.90.90.40.2268Domain Constraints At most one element matches addressPredictions from Prediction Combinerarea: addresssold-at: price contact-agent: agent-phoneextra-info: description0.70.90.90.60.3402area: addresssold-at: price contact-agent: agent-phoneextra-info: address

  • The Current LSD SystemCan also handle data in XML formatmatches XML DTDsBase learnersNaive Bayes [Duda&Hart-93, Domingos&Pazzani-97]exploits frequencies of words & symbolsWHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98]employs information-retrieval similarity metricName Learner [SIGMOD-01]matches elements based on their namesCounty-Name Recognizer [SIGMOD-01]stores all U.S. county names XML Learner [SIGMOD-01]exploits hierarchical structure of XML data

  • Empirical EvaluationFour domainsReal Estate I & II, Course Offerings, Faculty ListingsFor each domaincreated mediated schema & d