View
1.219
Download
4
Category
Tags:
Preview:
DESCRIPTION
We address the problem of unsupervised matching of schemainformation from a large number of data sources into theschema of a data warehouse. The matching process is thefirst step of a framework to integrate data feeds from third-party data providers into a structured-search engine’s datawarehouse. Our experiments show that traditional schema-based and instance-based schema matching methods fall short.We propose a new technique based on the search engine’sclicklogs. Two schema elements are matched if the distribution of keyword queries that cause clickthroughs on theirinstances are similar. We present experiments on large commercial datasets that show the new technique has much better accuracy than traditional techniques.
Citation preview
HAMSTER: USING SEARCH CLICKLOGS FOR SCHEMA AND TAXONOMY MATCHING
Arnab Nandi Phil Bernstein UNIV OF MICHIGAN MICROSOFT RESEARCH
Scenario
Arnab Nandi & Phil Bernstein
2
Scenario
Arnab Nandi & Phil Bernstein
3
Search over structured data Commerce entertainment
Data onboarding – merge an XML data feed from a 3rd party to Microsoft data warehouse.
Scenario
Arnab Nandi & Phil Bernstein
4
query
Search engine + data warehouse
Users
3rd Party Feed
3rd Party Feed
3rd Party Feed
3rd Party Feed
results
“Amazon.com”
• High Precision
• (Irrespective of Recall) • Minimal Human Involvement
Example Feed
-‐<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-‐13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -‐<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -‐</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <MPAA>NR</MPAA> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>
5
Arnab Nandi & Phil Bernstein
Schema Matching
-‐<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-‐13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -‐<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -‐</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>
6
Arnab Nandi & Phil Bernstein
From To
Movie MOVIE
Title MOVIE_NAME
Runtime RUNTIME
Category GENRE*
MPAA RATING
Person ACTOR*
Taxonomy Matching
-‐<Movie> <Title Key="Yes">Indiana Jones and The Kingdom of The Crystal Skull</Title> <Release Key="Yes">2008</Release> <Description>Ever…</Description> <RunTime>127</RunTime> <Categories>
<Category>Action</Category> <Category>Comedy</Category>
</Categories> <MPAA>PG-‐13</MPAA> <SiteUrl>http://www.indianajones.com/site/index.html</SiteUrl> -‐<Persons> <Person Role="Actor" Character="Indiana Jones">Harrison Ford</Person> -‐</Persons> </Movie>
Warehouse: Movies (Host) 3rd Party Movie Site (Foreign)
<MOVIE> <MOVIE_ID>57590</MOVIE_ID> <MOVIE_NAME>Indiana Jones and the Kingdom of
the Crystal Skull</MOVIE_NAME> <RUNTIME>02:00</RUNTIME> <GENRE1>Action/Adventure</GENRE1> <GENRE2/> <RATING>NR</RATING> <ADVISORY/> <URL>http://www.indianajones.com/</URL> <ACTOR1>Harrison Ford</ACTOR1> <ACTOR2>Karen Allen</ACTOR2> </MOVIE>
7
Arnab Nandi & Phil Bernstein
From To
Action Action/Adventure
PG-13 NR
R R
Various Problems 8
Badly normalized….
Unit conversion…
Formatting choices…
In-band signaling…
Arbitrary labels
Arnab Nandi & Phil Bernstein
Non standard vocabulary / language
Zero documentation
Not enough instances
Unlike conventional matching…
Arnab Nandi & Phil Bernstein
9
We have web search click data
For both Warehouse & 3rd party website
The databases we are integrating (usually) have a presence on the web
Why not use click data as a feature for schema & taxonomy matching?
query
Search engine + data warehouse
Users
3rd Party Feed
results
Outline 10
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Core idea 11
“If two (sets of) products are searched for by similar queries, then they are similar”
Small laptop
Arnab Nandi & Phil Bernstein Web Search
Clicklog
Core idea 12
Arnab Nandi & Phil Bernstein
Small Laptops
Pro. Laptops
Warehouse
hardware eee
Asus.com
eee ::: small laptops
Small laptop
Small laptop
Y
X
Z
Small laptop
Query Distributions
Arnab Nandi & Phil Bernstein
13
0 10 20 30 40 50
small laptop
netbook
hp mini 1000
hp mini
click count
Mapping to Taxonomy 14
Map URL to product, which belongs to taxonomy
http://www.amazon.com/dp/B001JTA59C
Shopping | Electronics |Netbooks Arnab Nandi & Phil Bernstein
3rd party DB (provided to us)
Aggregating Query Distributions 15
Arnab Nandi & Phil Bernstein
Small Laptops
Pro. Laptops
Warehouse
hardware eee
Asus.com
eee ::: small laptops
0 20 40 60
0 20 40 60
0 20 40 60
0 20 40 60
0 50
0 50
Aggregate URLs to categories 16
Aggregate queries for each URL to schema element / taxonomy term
Electronics|Electronics Features|Brands|Asus EEE “netbook”, “laptop”, “cheap laptop”
Office Products|Office Machines|Netbooks “netbook”
Arnab Nandi & Phil Bernstein
Generating Correspondences
Goal: To match two schema elements (or categories), determine if they have the same distribution of queries searching for them.
Process For each page (URL)
Identify query distribution Identify category / schema element of that page
For each category / schema element C Aggregate over pages in C to get query distribution
For each foreign category / schema element Find host category / schema element with most similar query distribution
17
Arnab Nandi & Phil Bernstein
Outline 18
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
19
query freq url laptop 70 http://searchengine.com/product/macbookpro laptop 25 http://searchengine.com/product/mininote laptop 5 http://asus.com/eeepc netbook 5 http://searchengine.com/product/macbookpro netbook 20 http://searchengine.com/product/mininote netbook 15 http://asus.com/eeepc cheap netbook 5 http://asus.com/eeepc
Warehouse: Small Laptops
Warehouse: Professional
Laptops
eee
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
20
“laptop”: 25/45 “netbook”: 20/45
“laptop” : 70 / 75 “netbook” : 5/75
“laptop”: 5/25 “netbook”: 15/25
“cheap laptop”: 5/25
Warehouse: Small Laptops
Warehouse: Professional
Laptops
eee
Distribution Similarity Metric
Arnab Nandi & Phil Bernstein
21
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ
(all qhost, qforeign combinations)
“small laptops” vs “eee” laptop vs laptop netbook vs netbook laptop vs cheap laptop
1 x (25/45) + 1 x (20/45) + 0.5 x (5/25)
= 0.74
Example: Taxonomy Matching
Arnab Nandi & Phil Bernstein
22
Warehouse: Small Laptops
Warehouse: Professional
Laptops
eee
“laptop”: 25/45 “netbook”: 20/45
“laptop” : 70 / 75 “netbook” : 5/75
“laptop”: 5/25 “netbook”: 15/25
“cheap laptop”: 5/25
0.74
0.31
Advantages of Clicklogs
Arnab Nandi & Phil Bernstein
23
Resilient to language
Resilient to new domains, data, and features As long as people query & click, we have data to learn from
Generates mappings previous methods can’t Electronics ▷ Electronics Features ▷ Brands ▷ Texas Instruments ≈ Office Products ▷ Office Machines ▷ Calculators
Software ▷ Categories ▷ Programming ▷ Programming Languages ▷ Visual Basic ≈ Software ▷ Developer Tools
System Design 24
HAMSTER: Using Search Clicklogs for Schema and
Taxonomy Matching
Arnab Nandi∗University of Michigan, Ann Arbor
arnab@umich.edu
Philip A. BernsteinMicrosoft Research
phil.bernstein@microsoft.com
ABSTRACT
We address the problem of unsupervised matching of schema
information from a large number of data sources into the
schema of a data warehouse. The matching process is the
first step of a framework to integrate data feeds from third-
party data providers into a structured-search engine’s data
warehouse. Our experiments show that traditional schema-
based and instance-based schema matching methods fall short.
We propose a new technique based on the search engine’s
clicklogs. Two schema elements are matched if the distri-
bution of keyword queries that cause click-throughs on their
instances are similar. We present experiments on large com-
mercial datasets that show the new technique has much bet-
ter accuracy than traditional techniques.
1. INTRODUCTION
In this paper, we address the problem of unsupervised
matching of schema information from a large number of data
sources into the schema of a data warehouse. The applica-
tion is the use of structured data sources to enhance the
results of keyword-based web search. For example, Google,
Yahoo and Live Search all provide shopping listings for the
query “digital camera” above their traditional web search re-
sults, presumably by augmenting their keyword index with
structured shopping data. This requires gathering a wide va-
riety of structured data sources into a data warehouse that
is indexed by the search engine for keyword queries. These
sources are typically provided by third-parties, though they
might also be obtained from web sites using information ex-
traction. The sources need to be integrated so that similar
data in the warehouse is indexed in the same way by the
search engine, thereby improving the relevance of the result
of keyword queries. The first step of the integration process
is to match incoming data source schemas to the warehouse
schema.
For example, suppose we are integrating data sources that
describe movies. Our data warehouse of integrated data has
∗Work done while at Microsoft Research.
Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB ‘09, August 24-28, 2009, Lyon, FranceCopyright 2009 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
!"#"$
%&#'()"#*)$
+")('#$!"#","-'$./")'0*1-'2$
340'5"$
+"6*&*57$
3*1)4'$!"#","-'$
340'5"$
+"6*&*57$
81')7$
9*(-$
Search Engine
340'5"$
:*))'-;*&<'&4'-$
+"6*&*57$
:*))'-;*&<'&4'-$
=;<"#'<$+")('#$./")'0*1-'2$
!"#$%&'(
Figure 1: HAMSTER System Architecture
a column called Rating, which describes the suitability of
the movie for certain audiences (e.g., G, PG-13, R). We
need to integrate a new data source which has an XML tag
<MPAA> that contains the rating. It is beneficial to map
the tag <MPAA> to the column name Rating, so that in-
stances of <MPAA> in the new source are recognized in
our index as ratings. This enables the search engine to an-
swer queries about the rating of movies that appear only in
this new data source, such as a keyword query “rating Dark
Knight”.
Some values in a data source are categorical. By “categori-
cal,” we mean the values come from a controlled vocabulary
and are organized into a taxonomy. For better indexing,
we need to map the data source’s taxonomy to the data
warehouse’s taxonomy. For example, product catalogs usu-
ally categorize each product within a taxonomy. An item
“netbook” might have an attribute “class” in a new data
source, whose value is the path computer � portable � econ-omy � small in the data source’s taxonomy. But the data
warehouse may classify the item differently, such as laptop� lightweight � inexpensive. To do a good job of answering
the query “netbook” over data feeds whose descriptions of
netbooks do not contain the word “netbook,” we need to
map the data feed’s class to the data warehouse’s class and
recognize “netbook” as a term for the latter class.
Arnab Nandi & Phil Bernstein
Outline 25
Scenario
Using Clicklogs Core idea Using Query Distributions Example System Architecture
Results
Arnab Nandi & Phil Bernstein
Experimenting with Click Logs
Arnab Nandi & Phil Bernstein
26
Commercial warehouse mapping, 258 products from a 70,000 term Amazon.com taxonomy (613 in gold) to a 6,000 term warehouse taxonomy (40 in gold)
Live.com (now Bing.com) search querylog Amazon to warehouse mapping task, consecutively
halving the clicklog size used 1.8 million clicks to Amazon.com product pages Typically each product had a query distribution
averaging 13 unique (i.e., different) search queries (min 1, max 181, stdev 22).
Summary of Results
Arnab Nandi & Phil Bernstein
27
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Precision / Recall
Arnab Nandi & Phil Bernstein
28
Commercial warehouse mapping, 258 products from a 70K term Amazon.com taxonomy to a 6,000 term warehouse taxonomy (613 categories used)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Prec
isio
n
Recall
Instance-based
Query Distribution
Consensus
Name-based
Summary of Results
Arnab Nandi & Phil Bernstein
29
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Match Quality
Arnab Nandi & Phil Bernstein
30
QDs are unique to entities
QDs are unique to aggregate classes
Amazon Products
Amazon Categories
Warehouse Products Warehouse Categories
Amazon Products
257/258 correct 241/258 correct 189/258 correct (73%) 226/258correct
Amazon Categories
373/613 correct 204/400 correct 525/613 (85%)
Warehouse Products
392/400 correct 383/400 correct
Warehouse Categories
40/40 correct
QDs of entities are closest to the distributions of their aggregate classes
QDs of similar aggregates are similar
Summary of Results
Arnab Nandi & Phil Bernstein
31
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Varying Clicklog Size 32
Successively decreased clicklog size by half
Recall decreases as clicklog size is decreased
0.65
0.75
0.85
0.95
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Prec
isio
n
Recall
Items
Categories
¼ ½ Full Log
1/32
Arnab Nandi & Phil Bernstein
Summary of Results
Arnab Nandi & Phil Bernstein
33
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Comparing Query Distributions 34
!"#
!$#
%"#
%$#
&""#
"# &"# '"# ("# )"# $"# *"# +"# !"# %"# &""#
!"#$%&%'()
*#$+,,)
,-./0123456781574.#############
946:4./########################
,-./01234567;:<4.7#############
=15############################
Jaccard(qhost, qforeign) ✕ MinFreq(qhost, qforeign) Σ
(all qhost, qforeign combinations)
Replace Jaccard with various phrase similarity metrics
Minimal difference due to size of most queries Arnab Nandi & Phil Bernstein
Summary of Results
Arnab Nandi & Phil Bernstein
35
90% precision / recall possible
Query distribution is a good similarity metric
Bigger clicklogs imply better recall
Technique isn't very sensitive to similarity metric
Related + Future Work
Arnab Nandi & Phil Bernstein
36
Usage Based / Crowdsourcing Usage-Based Schema Matching (ICDE 2008)
Elmeleegy, H.; Ouzzani, M.; Elmagarmid, A.
Matching schemas in online communities: A web 2.0 approach (ICDE 2008) R McCann, W Shen, AH Doan
Web Scale Integration Web-scale Data Integration: You can only afford to Pay As You Go
(CIDR 2007) Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy
Related + Future Work
Arnab Nandi & Phil Bernstein
37
“Mixed” methods Ontology matching: A machine learning approach (Handbook on
Ontologies 2004) A Doan, J Madhavan, P Domingos, A Halevy
Learning to match the schemas of data sources: A multistrategy approach (Machine Learning Journal 2003) A Doan, P Domingos, A Halevy
Schema and ontology matching with COMA++ (SIGMOD 2005) D Aumueller, HH Do, S Massmann, E Rahm
Conclusion
Unsupervised mapping is possible very high recall / precision when enough queries are
present
Click logs are promising Finds results that other methods cannot find As clicklog size increases, it will produce more mappings
Combinable with existing methods
38
Arnab Nandi & Phil Bernstein
http://arnab.org/contact http://research.microsoft.com/~philbe/
Questions?
Arnab Nandi & Phil Bernstein
Recommended