HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH

Scenario

Arnab Nandi & Phil Bernstein

2

Scenario


3

  Search over structured data  Commerce  entertainment

  Data onboarding – merge an XML data feed from a 3rd party to Microsoft data warehouse.

Scenario


4

query

Search engine + data warehouse

Users

3rd Party Feed

3rd Party Feed

3rd Party Feed

3rd Party Feed

results

“Amazon.com”

• High Precision

• (Irrespective of Recall) • Minimal Human Involvement

Example Feed

-‐<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories>

<Category>Action</Category> <Category>Comedy</Category>

</Categories> <MPAA>PG-‐13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -‐<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -‐</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>

5


Schema Matching






the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>

6


From To

Movie MOVIE

Title MOVIE_NAME

Runtime RUNTIME

Category GENRE*

MPAA RATING

Person ACTOR*

Taxonomy Matching






the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>

7


From To

Action Action/Adventure

PG-13 NR

R R

Various Problems 8

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels


Non standard vocabulary / language

Zero documentation

Not enough instances

Unlike conventional matching…


9

  We have web search click data

  For both Warehouse & 3rd party website

  The databases we are integrating (usually) have a presence on the web

  Why not use click data as a feature for schema & taxonomy matching?

query

Search engine + data warehouse

Users

3rd Party Feed

results

Outline 10

  Scenario

  Using Clicklogs  Core idea  Using Query Distributions  Example  System Architecture

  Results


Core idea 11

  “If two (sets of) products are searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil Bernstein Web Search

Clicklog

Core idea 12


Small Laptops

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small laptops

Small laptop

Small laptop

Y

X

Z

Small laptop

Query Distributions


13

0 10 20 30 40 50

small laptop

netbook

hp mini 1000

hp mini

click count

Mapping to Taxonomy 14

  Map URL to product, which belongs to taxonomy

 http://www.amazon.com/dp/B001JTA59C

  Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein

3rd party DB (provided to us)

Aggregating Query Distributions 15


Small Laptops

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small laptops

0 20 40 60

0 20 40 60

0 20 40 60

0 20 40 60

0 50

0 50

Aggregate URLs to categories 16

  Aggregate queries for each URL to schema element / taxonomy term

 Electronics|Electronics Features|Brands|Asus EEE  “netbook”, “laptop”, “cheap laptop”

 Office Products|Office Machines|Netbooks  “netbook”


Generating Correspondences

  Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

  Process   For each page (URL)

  Identify query distribution   Identify category / schema element of that page

  For each category / schema element C   Aggregate over pages in C to get query distribution

  For each foreign category / schema element   Find host category / schema element with most similar query distribution

17


Outline 18

  Scenario


  Results


Example: Taxonomy Matching


19

query freq url laptop 70 http://searchengine.com/product/macbookpro laptop 25 http://searchengine.com/product/mininote laptop 5 http://asus.com/eeepc netbook 5 http://searchengine.com/product/macbookpro netbook 20 http://searchengine.com/product/mininote netbook 15 http://asus.com/eeepc cheap netbook 5 http://asus.com/eeepc

Warehouse: Small Laptops

Warehouse: Professional

Laptops

eee



20

“laptop”: 25/45 “netbook”: 20/45

“laptop” : 70 / 75 “netbook” : 5/75


“cheap laptop”: 5/25



Laptops

eee

Distribution Similarity Metric


21

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ

(all qhost, qforeign combinations)

“small laptops” vs “eee” laptop vs laptop netbook vs netbook laptop vs cheap laptop

1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)

= 0.74



22



Laptops

eee


“laptop” : 70 / 75 “netbook” : 5/75


“cheap laptop”: 5/25

0.74

0.31

Advantages of Clicklogs


23

  Resilient to language

  Resilient to new domains, data, and features  As long as people query & click, we have data to learn from

  Generates mappings previous methods can’t  Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments ≈ Office Products ▷ Office Machines ▷ Calculators

 Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools

System Design 24

HAMSTER: Using Search Clicklogs for Schema and

Taxonomy Matching

Arnab Nandi∗University of Michigan, Ann Arbor

[email protected]

Philip A. BernsteinMicrosoft Research

[email protected]

ABSTRACT

We address the problem of unsupervised matching of schema

information from a large number of data sources into the

schema of a data warehouse. The matching process is the

first step of a framework to integrate data feeds from third-

party data providers into a structured-search engine’s data

warehouse. Our experiments show that traditional schema-

based and instance-based schema matching methods fall short.

We propose a new technique based on the search engine’s

clicklogs. Two schema elements are matched if the distri-

bution of keyword queries that cause click-throughs on their

instances are similar. We present experiments on large com-

mercial datasets that show the new technique has much bet-

ter accuracy than traditional techniques.

1. INTRODUCTION

In this paper, we address the problem of unsupervised

matching of schema information from a large number of data

sources into the schema of a data warehouse. The applica-

tion is the use of structured data sources to enhance the

results of keyword-based web search. For example, Google,

Yahoo and Live Search all provide shopping listings for the

query “digital camera” above their traditional web search re-

sults, presumably by augmenting their keyword index with

structured shopping data. This requires gathering a wide va-

riety of structured data sources into a data warehouse that

is indexed by the search engine for keyword queries. These

sources are typically provided by third-parties, though they

might also be obtained from web sites using information ex-

traction. The sources need to be integrated so that similar

data in the warehouse is indexed in the same way by the

search engine, thereby improving the relevance of the result

of keyword queries. The first step of the integration process

is to match incoming data source schemas to the warehouse

schema.

For example, suppose we are integrating data sources that

describe movies. Our data warehouse of integrated data has

∗Work done while at Microsoft Research.

Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB ‘09, August 24-28, 2009, Lyon, FranceCopyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

!"#"$

%&#'()"#*)$

+")('#$!"#","-'$./")'0*1-'2$

340'5"$

+"6*&*57$

3*1)4'$!"#","-'$

340'5"$

+"6*&*57$

81')7$

9*(-$

Search Engine

340'5"$

:*))'-;*&<'&4'-$

+"6*&*57$

:*))'-;*&<'&4'-$

=;<"#'<$+")('#$./")'0*1-'2$

!"#$%&'(

Figure 1: HAMSTER System Architecture

a column called Rating, which describes the suitability of

the movie for certain audiences (e.g., G, PG-13, R). We

need to integrate a new data source which has an XML tag

<MPAA> that contains the rating. It is beneficial to map

the tag <MPAA> to the column name Rating, so that in-

stances of <MPAA> in the new source are recognized in

our index as ratings. This enables the search engine to an-

swer queries about the rating of movies that appear only in

this new data source, such as a keyword query “rating Dark

Knight”.

Some values in a data source are categorical. By “categori-

cal,” we mean the values come from a controlled vocabulary

and are organized into a taxonomy. For better indexing,

we need to map the data source’s taxonomy to the data

warehouse’s taxonomy. For example, product catalogs usu-

ally categorize each product within a taxonomy. An item

“netbook” might have an attribute “class” in a new data

source, whose value is the path computer � portable � econ-omy � small in the data source’s taxonomy. But the data

warehouse may classify the item differently, such as laptop� lightweight � inexpensive. To do a good job of answering

the query “netbook” over data feeds whose descriptions of

netbooks do not contain the word “netbook,” we need to

map the data feed’s class to the data warehouse’s class and

recognize “netbook” as a term for the latter class.


Outline 25

  Scenario


  Results


Experimenting with Click Logs


26

  Commercial warehouse mapping, 258 products  from a 70,000 term Amazon.com taxonomy (613 in gold)   to a 6,000 term warehouse taxonomy (40 in gold)

  Live.com (now Bing.com) search querylog  Amazon to warehouse mapping task, consecutively

halving the clicklog size used  1.8 million clicks to Amazon.com product pages  Typically each product had a query distribution

averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Summary of Results


27

  90% precision / recall possible

  Query distribution is a good similarity metric

  Bigger clicklogs imply better recall

  Technique isn't very sensitive to similarity metric

Precision / Recall


28

  Commercial warehouse mapping, 258 products  from a 70K term Amazon.com taxonomy   to a 6,000 term warehouse taxonomy (613 categories used)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Prec

isio

n

Recall

Instance-based

Query Distribution

Consensus

Name-based

Summary of Results


29

90% precision / recall possible




Match Quality


30

  QDs are unique to entities

  QDs are unique to aggregate classes

Amazon Products

Amazon Categories

Warehouse Products Warehouse Categories

Amazon Products

257/258 correct 241/258 correct 189/258 correct (73%) 226/258correct

Amazon Categories

373/613 correct 204/400 correct 525/613 (85%)

Warehouse Products

392/400 correct 383/400 correct

Warehouse Categories

40/40 correct

  QDs of entities are closest to the distributions of their aggregate classes

  QDs of similar aggregates are similar

Summary of Results


31


Query distribution is a good similarity metric



Varying Clicklog Size 32

  Successively decreased clicklog size by half

  Recall decreases as clicklog size is decreased

0.65

0.75

0.85

0.95

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Prec

isio

n

Recall

Items

Categories

¼ ½ Full Log

1/32


Summary of Results


33


Query distribution is a good similarity metric

Bigger clicklogs imply better recall


Comparing Query Distributions 34

!"#

!$#

%"#

%$#

&""#

"# &"# '"# ("# )"# $"# *"# +"# !"# %"# &""#

!"#$%&%'()

*#$+,,)

,-./0123456781574.#############

946:4./########################

,-./01234567;:<4.7#############

=15############################

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ

(all qhost, qforeign combinations)

  Replace Jaccard with various phrase similarity metrics

  Minimal difference due to size of most queries Arnab Nandi & Phil Bernstein

Summary of Results


35





Related + Future Work


36

  Usage Based / Crowdsourcing   Usage-Based Schema Matching (ICDE 2008)

Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.

  Matching schemas in online communities: A web 2.0 approach (ICDE 2008) R McCann, W Shen, AH Doan

  Web Scale Integration   Web-scale Data Integration: You can only afford to Pay As You Go

(CIDR 2007) Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

Related + Future Work


37

  “Mixed” methods   Ontology matching: A machine learning approach (Handbook on

Ontologies 2004) A Doan, J Madhavan, P Domingos, A Halevy

  Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003) A Doan, P Domingos, A Halevy

  Schema and ontology matching with COMA++ (SIGMOD 2005) D Aumueller, HH Do, S Massmann, E Rahm

Conclusion

  Unsupervised mapping is possible  very high recall / precision when enough queries are

present

  Click logs are promising  Finds results that other methods cannot find  As clicklog size increases, it will produce more mappings

  Combinable with existing methods

38


http://arnab.org/contact http://research.microsoft.com/~philbe/

Questions?


Technology

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching