HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH

Scenario

Arnab Nandi & Phil Bernstein

Scenario

  Search over structured data  Commerce  entertainment

  Data onboarding – merge an XML data feed from a 3rd party to Microsoft data warehouse.

Scenario

Search engine + data warehouse

3rd Party Feed

results

“Amazon.com”

• High Precision

• (Irrespective of Recall) • Minimal Human Involvement

Example Feed

-‐<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories>

<Category>Action</Category> <Category>Comedy</Category>

</Categories> <MPAA>PG-‐13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -‐<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -‐</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>

Schema Matching

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>

From To

Movie MOVIE

Title MOVIE_NAME

Runtime RUNTIME

Category GENRE*

MPAA RATING

Person ACTOR*

Taxonomy Matching

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>

From To

Action Action/Adventure

PG-13 NR

Various Problems 8

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels

Non standard vocabulary / language

Zero documentation

Not enough instances

Unlike conventional matching…

  We have web search click data

  For both Warehouse & 3rd party website

  The databases we are integrating (usually) have a presence on the web

  Why not use click data as a feature for schema & taxonomy matching?

Search engine + data warehouse

3rd Party Feed

results

Outline 10

  Scenario

  Using Clicklogs  Core idea  Using Query Distributions  Example  System Architecture

  Results

Core idea 11

  “If two (sets of) products are searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil Bernstein Web Search

Clicklog

Core idea 12

Small Laptops

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small laptops

Small laptop

Query Distributions

0 10 20 30 40 50

small laptop

netbook

hp mini 1000

hp mini

click count

Mapping to Taxonomy 14

  Map URL to product, which belongs to taxonomy

 http://www.amazon.com/dp/B001JTA59C

  Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein

3rd party DB (provided to us)

Aggregating Query Distributions 15

Small Laptops

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small laptops

0 20 40 60

Aggregate URLs to categories 16

  Aggregate queries for each URL to schema element / taxonomy term

 Electronics|Electronics Features|Brands|Asus EEE  “netbook”, “laptop”, “cheap laptop”

 Office Products|Office Machines|Netbooks  “netbook”

Generating Correspondences

  Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

  Process   For each page (URL)

  Identify query distribution   Identify category / schema element of that page

  For each category / schema element C   Aggregate over pages in C to get query distribution

  For each foreign category / schema element   Find host category / schema element with most similar query distribution

Outline 18

  Scenario

  Results

Example: Taxonomy Matching

query freq url laptop 70 http://searchengine.com/product/macbookpro laptop 25 http://searchengine.com/product/mininote laptop 5 http://asus.com/eeepc netbook 5 http://searchengine.com/product/macbookpro netbook 20 http://searchengine.com/product/mininote netbook 15 http://asus.com/eeepc cheap netbook 5 http://asus.com/eeepc

Warehouse: Small Laptops

Warehouse: Professional

Laptops

“laptop”: 25/45 “netbook”: 20/45

“laptop” : 70 / 75 “netbook” : 5/75

“cheap laptop”: 5/25

Laptops

Distribution Similarity Metric

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ

(all qhost, qforeign combinations)

“small laptops” vs “eee” laptop vs laptop netbook vs netbook laptop vs cheap laptop

1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)

= 0.74

Laptops

“laptop” : 70 / 75 “netbook” : 5/75

“cheap laptop”: 5/25

Advantages of Clicklogs

  Resilient to language

  Resilient to new domains, data, and features  As long as people query & click, we have data to learn from

  Generates mappings previous methods can’t  Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments ≈ Office Products ▷ Office Machines ▷ Calculators

 Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools

System Design 24

HAMSTER: Using Search Clicklogs for Schema and

Taxonomy Matching

Arnab Nandi∗University of Michigan, Ann Arbor

arnab@umich.edu

Philip A. BernsteinMicrosoft Research

phil.bernstein@microsoft.com

ABSTRACT

We address the problem of unsupervised matching of schema

information from a large number of data sources into the

schema of a data warehouse. The matching process is the

first step of a framework to integrate data feeds from third-

party data providers into a structured-search engine’s data

warehouse. Our experiments show that traditional schema-

based and instance-based schema matching methods fall short.

We propose a new technique based on the search engine’s

clicklogs. Two schema elements are matched if the distri-

bution of keyword queries that cause click-throughs on their

instances are similar. We present experiments on large com-

mercial datasets that show the new technique has much bet-

ter accuracy than traditional techniques.

1. INTRODUCTION

In this paper, we address the problem of unsupervised

matching of schema information from a large number of data

sources into the schema of a data warehouse. The applica-

tion is the use of structured data sources to enhance the

results of keyword-based web search. For example, Google,

Yahoo and Live Search all provide shopping listings for the

query “digital camera” above their traditional web search re-

sults, presumably by augmenting their keyword index with

structured shopping data. This requires gathering a wide va-

riety of structured data sources into a data warehouse that

is indexed by the search engine for keyword queries. These

sources are typically provided by third-parties, though they

might also be obtained from web sites using information ex-

traction. The sources need to be integrated so that similar

data in the warehouse is indexed in the same way by the

search engine, thereby improving the relevance of the result

of keyword queries. The first step of the integration process

is to match incoming data source schemas to the warehouse

schema.

For example, suppose we are integrating data sources that

describe movies. Our data warehouse of integrated data has

∗Work done while at Microsoft Research.

Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB ‘09, August 24-28, 2009, Lyon, FranceCopyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

%&#'()"#*)$

+")('#$!"#","-'$./")'0*1-'2$

340'5"$

+"6*&*57$

3*1)4'$!"#","-'$

340'5"$

+"6*&*57$

81')7$

Search Engine

340'5"$

:*))'-;*&<'&4'-$

+"6*&*57$

:*))'-;*&<'&4'-$

=;<"#'<$+")('#$./")'0*1-'2$

!"#$%&'(

Figure 1: HAMSTER System Architecture

a column called Rating, which describes the suitability of

the movie for certain audiences (e.g., G, PG-13, R). We

need to integrate a new data source which has an XML tag

<MPAA> that contains the rating. It is beneficial to map

the tag <MPAA> to the column name Rating, so that in-

stances of <MPAA> in the new source are recognized in

our index as ratings. This enables the search engine to an-

swer queries about the rating of movies that appear only in

this new data source, such as a keyword query “rating Dark

Knight”.

Some values in a data source are categorical. By “categori-

cal,” we mean the values come from a controlled vocabulary

and are organized into a taxonomy. For better indexing,

we need to map the data source’s taxonomy to the data

warehouse’s taxonomy. For example, product catalogs usu-

ally categorize each product within a taxonomy. An item

“netbook” might have an attribute “class” in a new data

source, whose value is the path computer � portable � econ-omy � small in the data source’s taxonomy. But the data

warehouse may classify the item differently, such as laptop� lightweight � inexpensive. To do a good job of answering

the query “netbook” over data feeds whose descriptions of

netbooks do not contain the word “netbook,” we need to

map the data feed’s class to the data warehouse’s class and

recognize “netbook” as a term for the latter class.

Outline 25

  Scenario

  Results

Experimenting with Click Logs

  Commercial warehouse mapping, 258 products  from a 70,000 term Amazon.com taxonomy (613 in gold)   to a 6,000 term warehouse taxonomy (40 in gold)

  Live.com (now Bing.com) search querylog  Amazon to warehouse mapping task, consecutively

halving the clicklog size used  1.8 million clicks to Amazon.com product pages  Typically each product had a query distribution

averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Summary of Results

  90% precision / recall possible

  Query distribution is a good similarity metric

  Bigger clicklogs imply better recall

  Technique isn't very sensitive to similarity metric

Precision / Recall

  Commercial warehouse mapping, 258 products  from a 70K term Amazon.com taxonomy   to a 6,000 term warehouse taxonomy (613 categories used)

0 0.2 0.4 0.6 0.8 1

Recall

Instance-based

Query Distribution

Consensus

Name-based

Summary of Results

90% precision / recall possible

Match Quality

  QDs are unique to entities

  QDs are unique to aggregate classes

Amazon Products

Amazon Categories

Warehouse Products Warehouse Categories

Amazon Products

257/258 correct 241/258 correct 189/258 correct (73%) 226/258correct

Amazon Categories

373/613 correct 204/400 correct 525/613 (85%)

Warehouse Products

392/400 correct 383/400 correct

Warehouse Categories

40/40 correct

  QDs of entities are closest to the distributions of their aggregate classes

  QDs of similar aggregates are similar

Summary of Results

Query distribution is a good similarity metric

Varying Clicklog Size 32

  Successively decreased clicklog size by half

  Recall decreases as clicklog size is decreased

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Recall

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Technology

Hamster 1050 FR

Hamster portafolio

le hamster

MY HAMSTER SISI

El hamster

Tugas Hamster

Livro Hamster Feliz

60 Fetus Hamster Paru-Paru Fetus Hamster

Happy Hamster

SİHİRLİ HAMSTER - YKY · Neşeli Günler İlkokulu - Sihirli Hamster / Pamela Butchart Resimleyen: Becka Moor Özgün adı: WigglesBottom Primary - The Magic Hamster Çeviren:

Eseimen Hamster

Hamster Biology Husbandry

Hamster og kanin

About Hamster

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein U NIV OF M ICHIGAN M ICROSOFT R ESEARCH

프로세싱 확장 · 기본 동작 26 import org.roboid.robot.; import processing.hamster.; Hamster hamster; void setup() { hamster = new Hamster(this); } // don't forget 'draw

Hamster Plaza - Savic

Backward Chaining Hamster

Kesihatan Hamster

Hamster is Wild!