Schema matching for merging data feeds

Embed Size (px)

Text of Schema matching for merging data feeds

Schema matching for merging data feeds

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching Arnab Nandi Phil BernsteinUniv of Michigan Microsoft Research

1

ScenarioArnab Nandi & Phil Bernstein

2

ScenarioArnab Nandi & Phil Bernstein3Search over structured dataCommerceentertainment

Data onboarding merge an XML data feed from a 3rd partyto Microsoft data warehouse.

ScenarioArnab Nandi & Phil Bernstein

4

query

Search engine + data warehouseUsers3rd Party Feed3rd Party Feed

3rd Party Feed

3rd Party Feed

results

Amazon.com

High PrecisionHigh RecallMinimal Human Involvement

Example Feed-Indiana Jones and The Kingdom of The Crystal Skull2008Ever127

ActionComedyPG-13http://www.indianajones.com/site/index.html-Harrison Ford-

Warehouse: Movies (Host)3rd Party Movie Site (Foreign)

57590Indiana Jones and the Kingdom of the Crystal Skull02:00Action/AdventureNRhttp://www.indianajones.com/Harrison FordKaren Allen

5Arnab Nandi & Phil Bernstein

Schema Matching-Indiana Jones and The Kingdom of The Crystal Skull2008Ever127

ActionComedyPG-13http://www.indianajones.com/site/index.html-Harrison Ford-

Warehouse: Movies (Host)3rd Party Movie Site (Foreign)

57590Indiana Jones and the Kingdom of the Crystal Skull02:00Action/AdventureNRhttp://www.indianajones.com/Harrison FordKaren Allen

6Arnab Nandi & Phil BernsteinFromToMovieMOVIETitleMOVIE_NAMERuntimeRUNTIMECategoryGENRE*MPAARATINGPersonACTOR*

Taxonomy Matching-Indiana Jones and The Kingdom of The Crystal Skull2008Ever127

ActionComedyPG-13http://www.indianajones.com/site/index.html-Harrison Ford-

Warehouse: Movies (Host)3rd Party Movie Site (Foreign)

57590Indiana Jones and the Kingdom of the Crystal Skull02:00Action/AdventureNRhttp://www.indianajones.com/Harrison FordKaren Allen

7Arnab Nandi & Phil BernsteinFromToActionAction/AdventurePG-13NRRR

Various Problems

8

Badly normalized.Unit conversionFormatting choicesIn-band signalingArbitrary labelsArnab Nandi & Phil Bernstein

Non standard vocabulary / languageZero documentationNot enough instances

Unlike conventional matchingArnab Nandi & Phil Bernstein

9

We have web search click dataFor both Warehouse & 3rd party websiteThe databases we are integrating (usually) have a presence on the webWhy not use click data as a feature for schema & taxonomy matching?

query

Search engine + data warehouseUsers3rd Party Feedresults

Outline10Scenario

Using ClicklogsCore ideaUsing Query DistributionsExampleSystem Architecture

Results

Arnab Nandi & Phil Bernstein

Core idea11If two (sets of) products are searched for by similar queries, then they are similar

Small laptop

Arnab Nandi & Phil Bernstein

Web Search

Clicklog

Core idea12

Arnab Nandi & Phil Bernstein

Small LaptopsPro. LaptopsWarehousehardwareeeeAsus.comeee ::: small laptopsSmall laptopSmall laptop

YXZSmall laptop

Query DistributionsArnab Nandi & Phil Bernstein13

click count

Mapping to Taxonomy14Map URL to product, which belongs to taxonomy

http://www.amazon.com/dp/B001JTA59C

Shopping | Electronics |Netbooks

Arnab Nandi & Phil Bernstein

3rd party DB(provided to us)

Aggregating Query Distributions15Arnab Nandi & Phil Bernstein

Small LaptopsPro. LaptopsWarehousehardwareeeeAsus.comeee ::: small laptops

Aggregate URLs to categories16

Aggregate queries for each URL to schema element / taxonomy term

Electronics|Electronics Features|Brands|Asus EEEnetbook, laptop, cheap laptop

Office Products|Office Machines|Netbooksnetbook

Arnab Nandi & Phil Bernstein

Generating CorrespondencesGoal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.ProcessFor each page (URL)Identify query distributionIdentify category / schema element of that page

For each category / schema element CAggregate over pages in C to get query distribution

For each foreign category / schema element Find host category / schema element with most similar query distribution

17Arnab Nandi & Phil Bernstein

Outline18Scenario

Using ClicklogsCore ideaUsing Query DistributionsExampleSystem Architecture

Results

Arnab Nandi & Phil Bernstein

Example: Taxonomy MatchingArnab Nandi & Phil Bernstein

19queryfrequrllaptop70http://searchengine.com/product/macbookprolaptop25http://searchengine.com/product/mininotelaptop5http://asus.com/eeepcnetbook5http://searchengine.com/product/macbookpronetbook20http://searchengine.com/product/mininotenetbook15http://asus.com/eeepccheap netbook5http://asus.com/eeepc

Warehouse: Small LaptopsWarehouse: Professional Laptopseee

19

Example: Taxonomy MatchingArnab Nandi & Phil Bernstein

20laptop: 25/45netbook: 20/45laptop : 70 / 75netbook : 5/75laptop: 5/25netbook: 15/25cheap laptop: 5/25 Warehouse: Small LaptopsWarehouse: Professional Laptopseee

20

Distribution Similarity MetricArnab Nandi & Phil Bernstein21 Jaccard(qhost, qforeign) MinFreq(qhost, qforeign)(all qhost, qforeign combinations)

small laptops vs eeelaptop vs laptop netbook vs netbook laptop vs cheap laptop1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)= 0.74

Example: Taxonomy MatchingArnab Nandi & Phil Bernstein

22Warehouse: Small LaptopsWarehouse: Professional Laptopseeelaptop: 25/45netbook: 20/45laptop : 70 / 75netbook : 5/75laptop: 5/25netbook: 15/25cheap laptop: 5/25 0.74 0.31

Advantages of ClicklogsArnab Nandi & Phil Bernstein23Resilient to language

Resilient to new domains, data, and featuresAs long as people query & click, we have data to learn from

Generates mappings previous methods cantElectronics Electronics Features Brands Texas Instruments Office Products Office Machines Calculators

Software Categories Programming Programming Languages Visual Basic Software Developer Tools

System Design

24

Arnab Nandi & Phil Bernstein

Outline25Scenario

Using ClicklogsCore ideaUsing Query DistributionsExampleSystem Architecture

Results

Arnab Nandi & Phil Bernstein

Experimenting with Click LogsArnab Nandi & Phil Bernstein26Commercial warehouse mapping, 258 productsfrom a 70,000 term Amazon.com taxonomy (613 in gold)to a 6,000 term warehouse taxonomy (40 in gold)

Live.com (now Bing.com) search querylogAmazon to warehouse mapping task, consecutively halving the clicklog size used1.8 million clicks to Amazon.com product pagesTypically each product had a query distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Summary of ResultsArnab Nandi & Phil Bernstein

27

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Precision / RecallArnab Nandi & Phil Bernstein28Commercial warehouse mapping, 258 productsfrom a 70K term Amazon.com taxonomyto a 6,000 term warehouse taxonomy (613 categories used)

Summary of ResultsArnab Nandi & Phil Bernstein

29

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Match QualityArnab Nandi & Phil Bernstein30QDs are unique to entities

QDs are unique to aggregate classes Amazon ProductsAmazon CategoriesWarehouse ProductsWarehouse CategoriesAmazon Products257/258 correct241/258 correct189/258 correct (73%)226/258correctAmazon Categories373/613 correct204/400 correct525/613 (85%)Warehouse Products392/400 correct383/400 correctWarehouse Categories40/40 correct

QDs of entities are closest to the distributions of their aggregate classes

QDs of similar aggregates are similar

Summary of ResultsArnab Nandi & Phil Bernstein

31 90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Varying Clicklog Size

32

Successively decreased clicklog size by half

Recall decreases as clicklog size is decreased

Arnab Nandi & Phil Bernstein

Summary of ResultsArnab Nandi & Phil Bernstein

33

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Comparing Query Distributions34

Jaccard(qhost, qforeign) MinFreq(qhost, qforeign)(all qhost, qforeign combinations)

Replace Jaccard with various phrase similarity metrics

Minimal difference due to size of most queries

Arnab Nandi & Phil Bernstein

Summary of ResultsArnab Nandi & Phil Bernstein

35

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Related + Future WorkArnab Nandi & Phil Bernstein

36Usage Based / CrowdsourcingUsage-Based Schema Matching (ICDE 2008)Elmeleeg