Upload
mareo
View
28
Download
3
Tags:
Embed Size (px)
DESCRIPTION
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching . Arnab Nandi Phil Bernstein Univ of Michigan Microsoft Research. Scenario. Scenario. Search over structured data Commerce entertainment - PowerPoint PPT Presentation
Citation preview
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING
Arnab Nandi Phil BernsteinUNIV OF MICHIGAN MICROSOFT RESEARCH
2
Scenario
Arnab Nandi & Phil Bernstein
Arnab Nandi & Phil Bernstein
3
Scenario Search over structured data
Commerceentertainment
Data onboarding – merge an XML data feed from a 3rd partyto Microsoft data warehouse.
4
Scenario
Arnab Nandi & Phil Bernstein
query
Search engine + data warehouse
Users
3rd Party Feed
3rd Party Feed3rd Party Feed3rd Party Feed
results
“Amazon.com”
•High Precision•High Recall•Minimal Human
Involvement
Arnab Nandi & Phil Bernstein
5
Example Feed
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
Arnab Nandi & Phil Bernstein
6
Schema Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>
From ToMovie MOVIETitle MOVIE_NAMERuntime RUNTIMECategory GENRE*MPAA RATINGPerson ACTOR*
Arnab Nandi & Phil Bernstein
7
Taxonomy Matching
-<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime><Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl>-<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person>-</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2></MOVIE>From ToAction Action/
AdventurePG-13 NRR R
8
Various Problems
Badly normalized….
Unit conversion…
Formatting choices…
In-band signaling…
Arbitrary labels
Arnab Nandi & Phil Bernstein
Non standard vocabulary / language
Zero documenta
tion
Not enough
instances
9
Unlike conventional matching…
Arnab Nandi & Phil Bernstein
We have web search click data For both Warehouse & 3rd party
website
The databases we are integrating (usually) have a presence on the web
Why not use click data as a feature for schema & taxonomy matching?
query
Search engine + data warehouse
Users
3rd Party Feed
results
10
Outline Scenario Using Clicklogs
Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
11
Core idea “If two (sets of) products are
searched for by similar queries, then they are similar”
Small laptop
Arnab Nandi & Phil BernsteinWeb Search
12
Clicklog
Core idea
Arnab Nandi & Phil Bernstein
Small Lapto
psPro.
Laptops
Warehouse
hardware eee
Asus.com
eee ::: small
laptopsSmall laptopSmall laptop
Y
X
Z
Small laptop
13
Query Distributions
Arnab Nandi & Phil Bernstein
small laptopnetbook
hp mini 1000hp mini
0 10 20 30 40 50click count
14
Mapping to Taxonomy Map URL to product, which belongs to
taxonomy
http://www.amazon.com/dp/B001JTA59C
Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein
3rd party DB(provided to us)
15
Aggregating Query Distributions
Arnab Nandi & Phil Bernstein
Small Laptop
sPro.
Laptops
Warehouse
hardware eee
Asus.com
eee ::: small
laptops
0 5 101520253035404550
0 5 101520253035404550
0 5 101520253035404550
0 5 101520253035404550
0 10 20 30 40 50
0 10 20 30 40 50
Arnab Nandi & Phil Bernstein
17
Generating Correspondences
Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.
Process For each page (URL)
Identify query distribution Identify category / schema element of that page
For each category / schema element C Aggregate over pages in C to get query distribution
For each foreign category / schema element Find host category / schema element with most similar
query distribution
18
Outline Scenario Using Clicklogs
Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
19
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
query freq url
laptop 70http://searchengine.com/product/macbookpro
laptop 25http://searchengine.com/product/mininote
laptop 5 http://asus.com/eeepcnetbook 5
http://searchengine.com/product/macbookpro
netbook 20
http://searchengine.com/product/mininote
netbook 15 http://asus.com/eeepccheap netbook 5 http://asus.com/eeepc
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
20
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
21
Distribution Similarity Metric
Arnab Nandi & Phil Bernstein
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)Σ(all qhost, qforeign combinations)
22
“small laptops” vs “eee”laptop vs laptop netbook vs netbook laptop vs cheap laptop1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)= 0.74
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
Warehouse: Small
Laptops
Warehouse: Professional
Laptops
eee
“laptop”: 25/45“netbook”: 20/45
“laptop” : 70 / 75“netbook” : 5/75
“laptop”: 5/25“netbook”: 15/25“cheap laptop”:
5/25
0.74
0.31
Arnab Nandi & Phil Bernstein
23
Advantages of Clicklogs Resilient to language
Resilient to new domains, data, and features As long as people query & click, we have data to learn
from
Generates mappings previous methods can’tElectronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments
≈ Office Products ▷ Office Machines ▷ Calculators
Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools
24
System Design
Arnab Nandi & Phil Bernstein
25
Outline Scenario Using Clicklogs
Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Arnab Nandi & Phil Bernstein
26
Experimenting with Click Logs Commercial warehouse mapping, 258 products
from a 70,000 term Amazon.com taxonomy (613 in gold)
to a 6,000 term warehouse taxonomy (40 in gold)
Live.com (now Bing.com) search querylog Amazon to warehouse mapping task,
consecutively halving the clicklog size used 1.8 million clicks to Amazon.com product
pages Typically each product had a query
distribution averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
27
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Arnab Nandi & Phil Bernstein
28
Precision / Recall Commercial warehouse mapping, 258
products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613
categories used)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.10.20.30.40.50.60.70.80.9
1
Instance-basedQuery DistributionConsensusName-based
Recall
Prec
isio
n
29
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Arnab Nandi & Phil Bernstein
30
Match Quality
QDs are unique to entities
QDs are unique to aggregate classes
Amazon Products
Amazon Categories
Warehouse Products
Warehouse Categories
Amazon Products
257/258 correct
241/258 correct
189/258 correct (73%)
226/258correct
Amazon Categories
373/613 correct
204/400 correct 525/613 (85%)
Warehouse Products
392/400 correct 383/400 correct
Warehouse Categories
40/40 correct
QDs of entities are closest to the distributions of their aggregate classes
QDs of similar aggregates are similar
31
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
32
Varying Clicklog Size
Successively decreased clicklog size by half
Recall decreases as clicklog size is decreased
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.65
0.75
0.85
0.95
ItemsCategories
Recall
Prec
isio
n
¼ ½ Full Log
1/32
Arnab Nandi & Phil Bernstein
33
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
34
Comparing Query Distributions Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign)
Σ(all qhost, qforeign combinations)
Replace Jaccard with various phrase similarity metrics
Minimal difference due to size of most queriesArnab Nandi & Phil Bernstein
35
Summary of Results
Arnab Nandi & Phil Bernstein
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
36
Related + Future Work
Arnab Nandi & Phil Bernstein
Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008)
Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.
Matching schemas in online communities: A web 2.0 approach(ICDE 2008) R McCann, W Shen, AH Doan
Web Scale Integration Web-scale Data Integration: You can only afford to Pay
As You Go (CIDR 2007)Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy
37
Related + Future Work
Arnab Nandi & Phil Bernstein
“Mixed” methods Ontology matching: A machine learning approach
(Handbook on Ontologies 2004)A Doan, J Madhavan, P Domingos, A Halevy
Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003)A Doan, P Domingos, A Halevy
Schema and ontology matching with COMA++ (SIGMOD 2005)D Aumueller, HH Do, S Massmann, E Rahm
Arnab Nandi & Phil Bernstein
38
Conclusion Unsupervised mapping is possible
very high recall / precision when enough queries are present
Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce
more mappings
Combinable with existing methods
39
Arnab Nandi & Phil Bernstein
http://arnab.org/contacthttp://research.microsoft.com/~philbe/
Questions?