38
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

  • Upload
    mareo

  • View
    28

  • Download
    3

Embed Size (px)

DESCRIPTION

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching . Arnab Nandi  Phil Bernstein Univ of Michigan Microsoft Research. Scenario. Scenario. Search over structured data Commerce entertainment - PowerPoint PPT Presentation

Citation preview

Page 1: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING

Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH

Page 2: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

2

Scenario

Arnab Nandi & Phil Bernstein

Page 3: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

3

Scenario Search over structured data

Commerceentertainment

Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.

Page 4: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

4

Scenario

Arnab Nandi & Phil Bernstein

query

Search engine + data warehouse

Users

3rd Party Feed

3rd Party Feed3rd Party Feed3rd Party Feed

results

“Amazon.com”

•High Precision•High Recall•Minimal Human

Involvement

Page 5: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

5

Example Feed

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

Page 6: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

6

Schema Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>

From ToMovie MOVIETitle MOVIE_NAMERuntime RUNTIMECategory GENRE*MPAA RATINGPerson ACTOR*

Page 7: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

7

Taxonomy Matching

-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>

 <Category>Action</Category> <Category>Comedy</Category>

 </Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>

Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of

the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>From ToAction Action/

AdventurePG-13 NRR R

Page 8: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

8

Various Problems

Badly normalized….

Unit conversion…

Formatting choices…

In-band signaling…

Arbitrary labels

Arnab Nandi & Phil Bernstein

Non standard vocabulary / language

Zero documenta

tion

Not enough

instances

Page 9: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

9

Unlike conventional matching…

Arnab Nandi & Phil Bernstein

We have web search click data For both Warehouse & 3rd party

website

The databases we are integrating (usually) have a presence on the web

Why not use click data as a feature for schema & taxonomy matching?

query

Search engine + data warehouse

Users

3rd Party Feed

results

Page 10: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

10

Outline Scenario Using Clicklogs

Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 11: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

11

Core idea “If two (sets of) products are

searched for by similar queries, then they are similar”

Small laptop

Arnab Nandi & Phil BernsteinWeb Search

Page 12: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

12

Clicklog

Core idea

Arnab Nandi & Phil Bernstein

Small Lapto

psPro.

Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptopsSmall laptopSmall laptop

Y

X

Z

Small laptop

Page 13: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

13

Query Distributions

Arnab Nandi & Phil Bernstein

small laptopnetbook

hp mini 1000hp mini

0 10 20 30 40 50click count

Page 14: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

14

Mapping to Taxonomy Map URL to product, which belongs to

taxonomy

http://www.amazon.com/dp/B001JTA59C

Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein

3rd party DB(provided to us)

Page 15: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

15

Aggregating Query Distributions

Arnab Nandi & Phil Bernstein

Small Laptop

sPro.

Laptops

Warehouse

hardware eee

Asus.com

eee ::: small

laptops

0 5 101520253035404550

0 5 101520253035404550

0 5 101520253035404550

0 5 101520253035404550

0 10 20 30 40 50

0 10 20 30 40 50

Page 16: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

17

Generating Correspondences

Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.

Process For each page (URL)

Identify query distribution Identify category / schema element of that page

For each category / schema element C Aggregate over pages in C to get query distribution

For each foreign category / schema element Find host category / schema element with most similar

query distribution

Page 17: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

18

Outline Scenario Using Clicklogs

Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 18: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

19

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

query freq url

laptop 70http://searchengine.com/product/macbookpro

laptop 25http://searchengine.com/product/mininote

laptop 5 http://asus.com/eeepcnetbook 5

http://searchengine.com/product/macbookpro

netbook 20

http://searchengine.com/product/mininote

netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

Page 19: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

20

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

Page 20: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

21

Distribution Similarity Metric

Arnab Nandi & Phil Bernstein

Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)

Page 21: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

22

“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)= 0.74

Example: Taxonomy Matching

Arnab Nandi & Phil Bernstein

Warehouse: Small

Laptops

Warehouse: Professional

Laptops

eee

“laptop”: 25/45“netbook”: 20/45

“laptop” : 70 / 75“netbook” : 5/75

“laptop”: 5/25“netbook”: 15/25“cheap laptop”:

5/25

0.74

0.31

Page 22: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

23

Advantages of Clicklogs Resilient to language

Resilient to new domains, data, and features As long as people query & click, we have data to learn

from

Generates mappings previous methods can’tElectronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments

≈ Office Products ▷ Office Machines ▷ Calculators

Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic  ≈ Software ▷ Developer Tools

Page 23: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

24

System Design

Arnab Nandi & Phil Bernstein

Page 24: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

25

Outline Scenario Using Clicklogs

Core idea Using Query Distributions Example System Architecture

Results

Arnab Nandi & Phil Bernstein

Page 25: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

26

Experimenting with Click Logs Commercial warehouse mapping, 258 products

from a 70,000 term Amazon.com taxonomy (613 in gold)

to a 6,000 term warehouse taxonomy (40 in gold)

Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,

consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product

pages Typically each product had a query

distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).

Page 26: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

27

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 27: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

28

Precision / Recall Commercial warehouse mapping, 258

products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613

categories used)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.30.40.50.60.70.80.9

1

Instance-basedQuery DistributionConsensusName-based

Recall

Prec

isio

n

Page 28: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

29

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 29: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

30

Match Quality

QDs are unique to entities

QDs are unique to aggregate classes

Amazon Products

Amazon Categories

Warehouse Products

Warehouse Categories

Amazon Products

257/258 correct

241/258 correct

189/258 correct (73%)

226/258correct

Amazon Categories

373/613 correct

204/400 correct 525/613 (85%)

Warehouse Products

392/400 correct 383/400 correct

Warehouse Categories

40/40 correct

QDs of entities are closest to the distributions of their aggregate classes

QDs of similar aggregates are similar

Page 30: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

31

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 31: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

32

Varying Clicklog Size

Successively decreased clicklog size by half

Recall decreases as clicklog size is decreased

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.65

0.75

0.85

0.95

ItemsCategories

Recall

Prec

isio

n

¼ ½ Full Log

1/32

Arnab Nandi & Phil Bernstein

Page 32: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

33

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 33: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

34

Comparing Query Distributions Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)

Σ(all qhost, qforeign combinations)

Replace Jaccard with various phrase similarity metrics

Minimal difference due to size of most queriesArnab Nandi & Phil Bernstein

Page 34: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

35

Summary of Results

Arnab Nandi & Phil Bernstein

90% precision / recall possible

Query distribution is a good similarity metric

Bigger clicklogs imply better recall

Technique isn't very sensitive to similarity metric

Page 35: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

36

Related + Future Work

Arnab Nandi & Phil Bernstein

Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008)

Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.

Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan

Web Scale Integration Web-scale Data Integration: You can only afford to Pay

As You Go (CIDR 2007)Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy

Page 36: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

37

Related + Future Work

Arnab Nandi & Phil Bernstein

“Mixed” methods Ontology matching: A machine learning approach

(Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy

Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy

Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm

Page 37: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

Arnab Nandi & Phil Bernstein

38

Conclusion Unsupervised mapping is possible

very high recall / precision when enough queries are present

Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce

more mappings

Combinable with existing methods

Page 38: HAMSTER: Using Search  Clicklogs  for Schema and Taxonomy Matching

39

Arnab Nandi & Phil Bernstein

http://arnab.org/contacthttp://research.microsoft.com/~philbe/

Questions?