39
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Embed Size (px)

DESCRIPTION

We address the problem of unsupervised matching of schemainformation from a large number of data sources into theschema of a data warehouse. The matching process is thefirst step of a framework to integrate data feeds from third-party data providers into a structured-search engine’s datawarehouse. Our experiments show that traditional schema-based and instance-based schema matching methods fall short.We propose a new technique based on the search engine’sclicklogs. Two schema elements are matched if the distribution of keyword queries that cause clickthroughs on theirinstances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.

Citation preview

Page 1: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH

Page 2: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Scenario

Arnab Nandi & Phil Bernstein

2

Page 3: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Scenario

Arnab Nandi & Phil Bernstein

3

  Search over structured data  Commerce  entertainment

  Data onboarding – merge an XML data feed from a 3rd party to Microsoft data warehouse.

Page 4: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Scenario

Arnab Nandi & Phil Bernstein

4

query

Search engine + data warehouse

Users

3rd Party Feed

3rd Party Feed

3rd Party Feed

3rd Party Feed

results

“Amazon.com”

• High Precision

• (Irrespective of Recall) • Minimal Human Involvement

Page 5: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Example Feed

-­‐<Movie>    <Title  Key="Yes">Indiana  Jones  and  The  Kingdom  of  The  Crystal  Skull</Title>    <Release  Key="Yes">2008</Release>    <Description>Ever…</Description>    <RunTime>127</RunTime>  <Categories>  

 <Category>Action</Category>    <Category>Comedy</Category>  

 </Categories>    <MPAA>PG-­‐13</MPAA>    <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>  -­‐<Persons>    <Person  Role="Actor"  Character="Indiana  Jones">Harrison  Ford</Person>  -­‐</Persons>    </Movie>    

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE>    <MOVIE_ID>57590</MOVIE_ID>    <MOVIE_NAME>Indiana  Jones  and  the              Kingdom  of    

 the  Crystal  Skull</MOVIE_NAME>    <RUNTIME>02:00</RUNTIME>    <GENRE1>Action/Adventure</GENRE1>    <GENRE2/>    <MPAA>NR</MPAA>    <ADVISORY/>    <URL>http://www.indianajones.com/</URL>    <ACTOR1>Harrison  Ford</ACTOR1>    <ACTOR2>Karen  Allen</ACTOR2>  </MOVIE>    

5

Arnab Nandi & Phil Bernstein

Page 6: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Schema Matching

-­‐<Movie>    <Title  Key="Yes">Indiana  Jones  and  The  Kingdom  of  The  Crystal  Skull</Title>    <Release  Key="Yes">2008</Release>    <Description>Ever…</Description>    <RunTime>127</RunTime>  <Categories>  

 <Category>Action</Category>    <Category>Comedy</Category>  

 </Categories>    <MPAA>PG-­‐13</MPAA>    <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>  -­‐<Persons>    <Person  Role="Actor"  Character="Indiana  Jones">Harrison  Ford</Person>  -­‐</Persons>    </Movie>    

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE>    <MOVIE_ID>57590</MOVIE_ID>    <MOVIE_NAME>Indiana  Jones  and  the              Kingdom  of    

 the  Crystal  Skull</MOVIE_NAME>    <RUNTIME>02:00</RUNTIME>    <GENRE1>Action/Adventure</GENRE1>    <GENRE2/>    <RATING>NR</RATING>    <ADVISORY/>    <URL>http://www.indianajones.com/</URL>    <ACTOR1>Harrison  Ford</ACTOR1>    <ACTOR2>Karen  Allen</ACTOR2>  </MOVIE>    

6

Arnab Nandi & Phil Bernstein

From To

Movie MOVIE

Title MOVIE_NAME

Runtime RUNTIME

Category GENRE*

MPAA RATING

Person ACTOR*

Page 7: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Taxonomy Matching

-­‐<Movie>    <Title  Key="Yes">Indiana  Jones  and  The  Kingdom  of  The  Crystal  Skull</Title>    <Release  Key="Yes">2008</Release>    <Description>Ever…</Description>    <RunTime>127</RunTime>  <Categories>  

 <Category>Action</Category>    <Category>Comedy</Category>  

 </Categories>    <MPAA>PG-­‐13</MPAA>    <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>  -­‐<Persons>    <Person  Role="Actor"  Character="Indiana  Jones">Harrison  Ford</Person>  -­‐</Persons>    </Movie>    

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)

<MOVIE>    <MOVIE_ID>57590</MOVIE_ID>    <MOVIE_NAME>Indiana  Jones  and  the              Kingdom  of    

 the  Crystal  Skull</MOVIE_NAME>    <RUNTIME>02:00</RUNTIME>    <GENRE1>Action/Adventure</GENRE1>    <GENRE2/>    <RATING>NR</RATING>    <ADVISORY/>    <URL>http://www.indianajones.com/</URL>    <ACTOR1>Harrison  Ford</ACTOR1>    <ACTOR2>Karen  Allen</ACTOR2>  </MOVIE>    

7

Arnab Nandi & Phil Bernstein

From To

Action Action/Adventure

PG-13 NR

R R

Page 8: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Various Problems 8

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels

Arnab Nandi & Phil Bernstein

Non standard vocabulary / language

Zero documentation

Not enough instances

Page 9: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Unlike conventional matching…

Arnab Nandi & Phil Bernstein

9

  We have web search click data

  For both Warehouse & 3rd party website

  The databases we are integrating (usually) have a presence on the web

  Why not use click data as a feature for schema & taxonomy matching?

query

Search engine + data warehouse

Users

3rd Party Feed

results

Page 10: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Outline 10

  Scenario

  Using Clicklogs  Core idea  Using Query Distributions  Example  System Architecture

  Results

Arnab Nandi & Phil Bernstein

Page 11: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Core idea 11

  “If two (sets of) products are searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil Bernstein Web Search

Page 12: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Clicklog

Core idea 12

Arnab Nandi & Phil Bernstein

Small Laptops

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small laptops

Small laptop

Small laptop

Y

X

Z

Small laptop

Page 13: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Query Distributions

Arnab Nandi & Phil Bernstein

13

0 10 20 30 40 50

small laptop

netbook

hp mini 1000

hp mini

click count

Page 14: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Mapping to Taxonomy 14

  Map URL to product, which belongs to taxonomy

 http://www.amazon.com/dp/B001JTA59C

  Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein

3rd party DB (provided to us)

Page 15: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Aggregating Query Distributions 15

Arnab Nandi & Phil Bernstein

Small Laptops

Pro. Laptops

Warehouse

hardware eee

Asus.com

eee ::: small laptops

0 20 40 60

0 20 40 60

0 20 40 60

0 20 40 60

0 50

0 50

Page 16: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Aggregate URLs to categories 16

  Aggregate queries for each URL to schema element / taxonomy term

 Electronics|Electronics Features|Brands|Asus EEE  “netbook”, “laptop”, “cheap laptop”

 Office Products|Office Machines|Netbooks  “netbook”

Arnab Nandi & Phil Bernstein

Page 17: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Generating Correspondences

  Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

  Process   For each page (URL)

  Identify query distribution   Identify category / schema element of that page

  For each category / schema element C   Aggregate over pages in C to get query distribution

  For each foreign category / schema element   Find host category / schema element with most similar query distribution

17

Arnab Nandi & Phil Bernstein

Page 18: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Outline 18

  Scenario

  Using Clicklogs  Core idea  Using Query Distributions  Example  System Architecture

  Results

Arnab Nandi & Phil Bernstein

Page 19: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

19

query freq url laptop 70 http://searchengine.com/product/macbookpro laptop 25 http://searchengine.com/product/mininote laptop 5 http://asus.com/eeepc netbook 5 http://searchengine.com/product/macbookpro netbook 20 http://searchengine.com/product/mininote netbook 15 http://asus.com/eeepc cheap netbook 5 http://asus.com/eeepc

Warehouse: Small Laptops

Warehouse: Professional

Laptops

eee

Page 20: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

20

“laptop”: 25/45 “netbook”: 20/45

“laptop” : 70 / 75 “netbook” : 5/75

“laptop”: 5/25 “netbook”: 15/25

“cheap laptop”: 5/25

Warehouse: Small Laptops

Warehouse: Professional

Laptops

eee

Page 21: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Distribution Similarity Metric

Arnab Nandi & Phil Bernstein

21

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ

(all qhost, qforeign combinations)

Page 22: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

“small laptops” vs “eee” laptop vs laptop netbook vs netbook laptop vs cheap laptop

1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)

= 0.74

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

22

Warehouse: Small Laptops

Warehouse: Professional

Laptops

eee

“laptop”: 25/45 “netbook”: 20/45

“laptop” : 70 / 75 “netbook” : 5/75

“laptop”: 5/25 “netbook”: 15/25

“cheap laptop”: 5/25

0.74

0.31

Page 23: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Advantages of Clicklogs

Arnab Nandi & Phil Bernstein

23

  Resilient to language

  Resilient to new domains, data, and features  As long as people query & click, we have data to learn from

  Generates mappings previous methods can’t  Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments ≈ Office Products ▷ Office Machines ▷ Calculators

 Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic  ≈ Software ▷ Developer Tools

Page 24: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

System Design 24

HAMSTER: Using Search Clicklogs for Schema and

Taxonomy Matching

Arnab Nandi∗University of Michigan, Ann Arbor

[email protected]

Philip A. BernsteinMicrosoft Research

[email protected]

ABSTRACT

We address the problem of unsupervised matching of schema

information from a large number of data sources into the

schema of a data warehouse. The matching process is the

first step of a framework to integrate data feeds from third-

party data providers into a structured-search engine’s data

warehouse. Our experiments show that traditional schema-

based and instance-based schema matching methods fall short.

We propose a new technique based on the search engine’s

clicklogs. Two schema elements are matched if the distri-

bution of keyword queries that cause click-throughs on their

instances are similar. We present experiments on large com-

mercial datasets that show the new technique has much bet-

ter accuracy than traditional techniques.

1. INTRODUCTION

In this paper, we address the problem of unsupervised

matching of schema information from a large number of data

sources into the schema of a data warehouse. The applica-

tion is the use of structured data sources to enhance the

results of keyword-based web search. For example, Google,

Yahoo and Live Search all provide shopping listings for the

query “digital camera” above their traditional web search re-

sults, presumably by augmenting their keyword index with

structured shopping data. This requires gathering a wide va-

riety of structured data sources into a data warehouse that

is indexed by the search engine for keyword queries. These

sources are typically provided by third-parties, though they

might also be obtained from web sites using information ex-

traction. The sources need to be integrated so that similar

data in the warehouse is indexed in the same way by the

search engine, thereby improving the relevance of the result

of keyword queries. The first step of the integration process

is to match incoming data source schemas to the warehouse

schema.

For example, suppose we are integrating data sources that

describe movies. Our data warehouse of integrated data has

∗Work done while at Microsoft Research.

Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB ‘09, August 24-28, 2009, Lyon, FranceCopyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.

!"#"$

%&#'()"#*)$

+")('#$!"#","-'$./")'0*1-'2$

340'5"$

+"6*&*57$

3*1)4'$!"#","-'$

340'5"$

+"6*&*57$

81')7$

9*(-$

Search Engine

340'5"$

:*))'-;*&<'&4'-$

+"6*&*57$

:*))'-;*&<'&4'-$

=;<"#'<$+")('#$./")'0*1-'2$

!"#$%&'(

Figure 1: HAMSTER System Architecture

a column called Rating, which describes the suitability of

the movie for certain audiences (e.g., G, PG-13, R). We

need to integrate a new data source which has an XML tag

<MPAA> that contains the rating. It is beneficial to map

the tag <MPAA> to the column name Rating, so that in-

stances of <MPAA> in the new source are recognized in

our index as ratings. This enables the search engine to an-

swer queries about the rating of movies that appear only in

this new data source, such as a keyword query “rating Dark

Knight”.

Some values in a data source are categorical. By “categori-

cal,” we mean the values come from a controlled vocabulary

and are organized into a taxonomy. For better indexing,

we need to map the data source’s taxonomy to the data

warehouse’s taxonomy. For example, product catalogs usu-

ally categorize each product within a taxonomy. An item

“netbook” might have an attribute “class” in a new data

source, whose value is the path computer � portable � econ-omy � small in the data source’s taxonomy. But the data

warehouse may classify the item differently, such as laptop� lightweight � inexpensive. To do a good job of answering

the query “netbook” over data feeds whose descriptions of

netbooks do not contain the word “netbook,” we need to

map the data feed’s class to the data warehouse’s class and

recognize “netbook” as a term for the latter class.

Arnab Nandi & Phil Bernstein

Page 25: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Outline 25

  Scenario

  Using Clicklogs  Core idea  Using Query Distributions  Example  System Architecture

  Results

Arnab Nandi & Phil Bernstein

Page 26: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Experimenting with Click Logs

Arnab Nandi & Phil Bernstein

26

  Commercial warehouse mapping, 258 products  from a 70,000 term Amazon.com taxonomy (613 in gold)   to a 6,000 term warehouse taxonomy (40 in gold)

  Live.com (now Bing.com) search querylog  Amazon to warehouse mapping task, consecutively

halving the clicklog size used  1.8 million clicks to Amazon.com product pages  Typically each product had a query distribution

averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Page 27: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

27

  90% precision / recall possible

  Query distribution is a good similarity metric

  Bigger clicklogs imply better recall

  Technique isn't very sensitive to similarity metric

Page 28: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Precision / Recall

Arnab Nandi & Phil Bernstein

28

  Commercial warehouse mapping, 258 products  from a 70K term Amazon.com taxonomy   to a 6,000 term warehouse taxonomy (613 categories used)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Prec

isio

n

Recall

Instance-based

Query Distribution

Consensus

Name-based

Page 29: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

29

90% precision / recall possible

  Query distribution is a good similarity metric

  Bigger clicklogs imply better recall

  Technique isn't very sensitive to similarity metric

Page 30: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Match Quality

Arnab Nandi & Phil Bernstein

30

  QDs are unique to entities

  QDs are unique to aggregate classes

Amazon Products  

Amazon Categories  

Warehouse Products   Warehouse Categories  

Amazon Products  

257/258 correct   241/258 correct   189/258 correct (73%)   226/258correct  

Amazon Categories  

373/613 correct   204/400 correct   525/613 (85%)  

Warehouse Products  

392/400 correct   383/400 correct  

Warehouse Categories  

40/40 correct  

  QDs of entities are closest to the distributions of their aggregate classes

  QDs of similar aggregates are similar

Page 31: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

31

90% precision / recall possible

Query distribution is a good similarity metric

  Bigger clicklogs imply better recall

  Technique isn't very sensitive to similarity metric

Page 32: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Varying Clicklog Size 32

  Successively decreased clicklog size by half

  Recall decreases as clicklog size is decreased

0.65

0.75

0.85

0.95

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Prec

isio

n

Recall

Items

Categories

¼ ½ Full Log

1/32

Arnab Nandi & Phil Bernstein

Page 33: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

33

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

  Technique isn't very sensitive to similarity metric

Page 34: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Comparing Query Distributions 34

!"#

!$#

%"#

%$#

&""#

"# &"# '"# ("# )"# $"# *"# +"# !"# %"# &""#

!"#$%&%'()

*#$+,,)

,-./0123456781574.#############

946:4./########################

,-./01234567;:<4.7#############

=15############################

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ

(all qhost, qforeign combinations)

  Replace Jaccard with various phrase similarity metrics

  Minimal difference due to size of most queries Arnab Nandi & Phil Bernstein

Page 35: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Summary of Results

Arnab Nandi & Phil Bernstein

35

90% precision / recall possible

  Query distribution is a good similarity metric

  Bigger clicklogs imply better recall

  Technique isn't very sensitive to similarity metric

Page 36: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Related + Future Work

Arnab Nandi & Phil Bernstein

36

  Usage Based / Crowdsourcing   Usage-Based Schema Matching (ICDE 2008)

Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.

  Matching schemas in online communities: A web 2.0 approach (ICDE 2008) R McCann, W Shen, AH Doan

  Web Scale Integration   Web-scale Data Integration: You can only afford to Pay As You Go

(CIDR 2007) Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

Page 37: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Related + Future Work

Arnab Nandi & Phil Bernstein

37

  “Mixed” methods   Ontology matching: A machine learning approach (Handbook on

Ontologies 2004) A Doan, J Madhavan, P Domingos, A Halevy

  Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003) A Doan, P Domingos, A Halevy

  Schema and ontology matching with COMA++ (SIGMOD 2005) D Aumueller, HH Do, S Massmann, E Rahm

Page 38: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

Conclusion

  Unsupervised mapping is possible  very high recall / precision when enough queries are

present

  Click logs are promising  Finds results that other methods cannot find  As clicklog size increases, it will produce more mappings

  Combinable with existing methods

38

Arnab Nandi & Phil Bernstein

Page 39: HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

http://arnab.org/contact http://research.microsoft.com/~philbe/

Questions?

Arnab Nandi & Phil Bernstein