Upload
lamnhi
View
222
Download
1
Embed Size (px)
Citation preview
A Semantic Approach forEstimating Consumer Content Preferences from
Online Search Queries
Jia Liu and Olivier Toubia ∗
September 13, 2016
Job Market Paper
Abstract
We develop an innovative topic model, Hierarchically Dual Latent Dirichlet Al-location (HDLDA), which not only identifies topics in search queries and webpages,but also how the topics in search queries relate to the topics in the corresponding topsearch results. Using the output from HDLDA, a consumer’s content preferences maybe estimated on the fly based on their search queries. We validate our proposed ap-proach across different product categories using an experiment in which we observeparticipants’ content preferences, the queries they formulate, and their browsing be-havior. Our results suggest that our approach can help firms extract and understand thepreferences revealed by consumers through their search queries, which in turn may beused to optimize the production and promotion of online content.
Keywords: search engine optimization, search engine marketing, search queries, pref-erences, topic modeling, semantic mapping, Latent Dirichlet Allocation.
∗Jia Liu is a Ph.D. Candidate in Marketing, Columbia Business School, Columbia University. Email:[email protected]. Olivier Toubia is Glaubinger Professor of Business, Graduate School of Business, ColumbiaUniversity. Email: [email protected]. This paper is based on the second chapter of Jia Liu’s doctoral dissertation.This paper has benefited from the comments of Asim Ansari, David Blei, Kinshuk Jerath, and the participants at theColumbia Marketing and Decision, Risk and Operations seminars, and the 2015 Marketing Science Conference in Bal-timore, MD. Previous version of this paper was titled: “A Semantic Approach for Estimating Consumer InformationNeeds from Online Search Queries.”
1 Introduction
“The Future of Marketing Is Semantic.”
— Daniel Newman, Forbes, August 2014
Over the past decade, search engines like Google have become one of the primary tools con-
sumers use when searching for products, services, or information. This trend has given rise to
two major industries: Search Engine Optimization (SEO) and Search Engine Marketing (SEM).
SEO refers to the process of tailoring a website to optimize its organic ranking for a given set
of keywords or queries, in order to improve traffic and lead generation (Amerland, 2013). SEM
usually refers to paid advertising on search engines. Search marketing spending in the U.S. only is
expected to reach $31.62 billion in 2015 (Statista, 2015).
Success in both of these industries hinges on firms’ ability to infer consumers’ content prefer-
ences, given their search queries. For example, a firm engaging in SEO should focus its promotion
efforts on queries that reflect preferences that are best aligned with its content. Similarly, firms
engaging in SEM should be willing to bid more on keywords and queries that reflect preferences
that are well aligned with their content and offerings. In addition, search engines themselves are
constantly trying to improve how well their search results satisfy their users’ content preferences.
Despite the importance of being able to infer consumers’ content preferences from their queries,
very little research has been done in this area. Some research (which will be reviewed in Section
2.1) has developed taxonomies of search queries and search intent. However, this research does
not enable firms to infer content preferences in a quantified, nuanced and detailed manner. Another
stream of research in marketing (which will be reviewed in Section 2.2) has quantified consumer
preferences from their search behavior. However, that research has primarily focused on consumer
search behavior that manifests itself via discrete choices (e.g., purchasing, clicking). Text-based
search behavior (e.g., typing a search query), despite being a major way in which consumers search
today, has not received as much attention in that literature.
Inferring content preferences from search queries presents several challenges. The first chal-
lenge is the curse of dimensionality, due to the nature of textual data. Indeed, the number of
2
possible keywords or queries available to consumers is very large. A second challenge is that
search terms tend to be ambiguous, i.e., consumers might use the same term in different ways.
Third, there exist some potentially complex semantic relationships between the content in a search
query and the content in the corresponding search results. Previous research (also reviewed in
Section 2.2) has shown that consumers have the ability to leverage these semantic relationships
when formulating queries. That is, consumers do not necessarily formulate queries that exactly
and directly reflect their content preferences, but rather they formulate queries that are more likely
to retrieve the type of content they are looking for. A fourth challenge is the sparsity of query data.
Each search query tends to contain very few words, and the average number of queries per user
session is usually less than five (Wang et al., 2003; Kamvar and Baluja, 2006; Jansen et al., 2009).
In this paper, we propose a framework that addresses the above challenges. We describe content
as a set of topics, following the literature in information retrieval (Manning et al., 2008). We
capture a consumer’s preferences with a probability distribution over these topics, that reflects the
content the consumer is searching for. This allows us to address the curse of dimensionality and the
ambiguity of keywords by identifying a small number of topics in queries and webpages, defined
as probabilistic combinations of words. We address the issue that search queries and search results
are semantically related, by explicitly building a hierarchical structure in our topic model, that
quantifies the semantic mapping between search queries and search results. We alleviate the issue
of the sparsity of queries both by combining information from multiple queries and by linking
queries to results.
More specifically, we develop an innovative probabilistic topic model – Hierarchically Dual
Latent Dirichlet Allocation (HDLDA). HDLDA is built upon Latent Dirichlet Allocation (LDA)
(Blei et al., 2003), an unsupervised Bayesian learning algorithm that extracts “topics” from text
based on occurrence. HDLDA is a model for bag-of-word data where documents (in our context
web pages) are multi-labeled by very small subsets of their own words (in our context a search
query “labels” a webpage if it can retrieve it). The model is dual because the documents and the
labels share the same topic-word distributions (in our context, the same set of topics may describe
3
search queries and webpages). It is hierarchical because the labels influence the topic distribution
of the documents, i.e., labels and documents are semantically related to each other. In our context,
the prior on the topic distribution of a webpage is assumed to be a function of the topic distributions
of the queries that retrieve that page. HDLDA can be estimated on any primary or secondary dataset
that contains the text of a set of queries and their corresponding search results from a search engine.
In this paper, we collect the data using Google’s Application Program Interface (API). The output
from HDLDA include the number of topics, their descriptions, the distribution of topics in each
webpage and query, and the semantic mapping between topics in queries and topics in search
results.
We further propose an approach for estimating a consumer’s content preferences based on
their queries, using the output of HDLDA. In particular, if consumers formulate queries with the
objective of reaching a target topic distribution among the results, their preferences may be simply
estimated as the expected topic distribution of the search results, given their search queries. Such
estimation may be made on the fly, making it useful for firms interested in customizing content
(e.g., display or search advertising) based on a consumer’s query.
To evaluate our approach, we run a lab experiment. Such an experiment offers two benefits
when it comes to validating the estimation of preferences based on the output of HDLDA. First, our
experimental framework allows us to exogenously manipulate and know users’ preferred content,
which we can compare to our estimates. In particular, we provided participants with search tasks
in various categories (e.g., finding a ski resort with specific features to recommend to someone),
and asked them to perform a series of searches to find a suitable URL for each task description.
Second, with secondary data the browsing behavior of users would likely be influenced by search
advertising and more generally by the customization of content based on the user’s history. Such
factors do not preclude the use of HDLDA in practice. However, when validating HDLDA, such
factors may provide alternative explanations for the users’ browsing behavior, and cast doubt on
any link that would be found between this behavior and some model-based predictions. Therefore,
for the purpose of our experiment, we built our own search engine Hoogle, which allows removing
4
the influence of the customization of search results and of search advertising on users’ browsing
behavior. Technically, Hoogle serves as a filter between Google and the user. It runs all queries
for all users through the Google API, with no user history being captured (unlike with regular
Google searches). Therefore, results are not customized based on the user’s history. In addition,
Hoogle allows us to show only organic search results. Again, in practice, marketers/advertisers
may use a search engine’s search API to collect data on queries that are relevant for their SEO/SEM
strategies and use these data as input to HDLDA, without running any experiment and without
using a customized search tool such as Hoogle.
We find that HDLDA fits the corpus collected from this experiment much better than tradi-
tional LDA. We also test our approach for estimating consumers’ preferences based on queries.
Our approach assumes that consumers leverage the semantic mapping between queries and results
when they formulate search queries. That is, consumers anticipate the type of content that will be
retrieved by their queries, and they formulate queries such that the search results will match their
content preferences in expectation. We find that this assumption gives rise to significantly more
accurate estimation of preferences, compared to a benchmark that assumes that consumers ignore
the semantic mapping between queries and results.
Our research has both methodological and managerial contributions. Methodologically, HDLDA
has an innovative structure compared with current extensions of LDA in the marketing and topic
modeling literatures. Managerially, HDLDA can help firms to organize and understand the mean-
ing of different search queries and webpages, and to predict the content preferences of users based
on their search queries. HDLDA is scalable across large numbers of queries and pages within
any search domain. In general, by empowering marketers/advertisers to infer consumers’ content
preferences based on queries, our work can help them improve the fit between their content and
consumer preferences. In the context of SEO, our work can help firms determine the type of queries
that are most relevant to their content, and therefore that should be the focus of their SEO efforts.
For example, firms can determine the topics reflected in their content, identify search queries that
tend to indicate that a user is interested in these topics, and focus their efforts on improving their
5
organic ranking on these queries. In the context of SEM, our research can help advertisers deter-
mine how well their content matches a given query or keyword, which informs how much they
should be willing to bid on that query. Our framework also enables firms to perform dimension
reduction on queries, by representing them in a topic space rather than treating them as separate
and independent entities. Such dimension reduction may help analyze the click-through rates or
other performance metrics for the queries on which the firm advertises, for example by regressing
these metrics on the set of topics captured by the queries.
The rest of the paper is organized as follows. In Section 2, we review the related literature.
In Section 3, we introduce the topic model HDLDA. In Section 4, we describe our experimental
design, data collection, some descriptive statistics, and data processing. We present our empirical
results in Section 5. We conclude in Section 6.
2 Relevant Literature
2.1 Online Search Queries
Most research on understanding users’ intent behind their search queries has come from the
computer science and information system literatures. This research has primarily focused on clas-
sifying consumers’ search intent into some discrete categories (Broder, 2002; Jansen et al., 2007;
Sanasam et al., 2008; Shen et al., 2011). The first and most popular categorization was proposed
by Broder (2002) who defined three very broad classes: informational, navigational, and transac-
tional. Informational search involves looking for a specific fact or topic; navigational search seeks
to locate a specific web site; and transactional search usually involves looking for information re-
lated to a particular product or service. Jansen et al. (2008) showed that about 80% of queries are
informational, about 10% are navigational, and under 10% are transactional.
Such empirical study of search logs provides valuable insights into what people search for
and how they search for content. However, these types of analysis do not quantify users’ content
preferences. That is, they do not describe in detail the type of content the user is looking for. In
6
contrast, we propose a method that allows “reverse engineering” preferences based on queries.
This is managerially important, to help website owners or advertisers improve the fit between their
content and the users’ preferences.
2.2 Search Models
Consumer search behavior is often modeled within a utility maximization framework. Appli-
cations of this framework to online search have focused on discrete search behavior, such as which
links or products users decide to view. This was done either using static models, or dynamic models
in which users search sequentially and stop searching when the marginal cost of search exceeds the
marginal gains (Jeziorski and Segal, 2010; Kim et al., 2010; Dzyabura, 2013; Ghose et al., 2013;
Shi and Trusov, 2013; Yang et al., 2015). However, text-based search behavior, such as typing a
search query, has been largely ignored. Typing a search query is a first-order search behavior on
most search platforms, and queries contain valuable information about users’ preferences (Pirolli,
2007). Yet the field is lacking tools to leverage query data.
Liu and Toubia (2015) lay some foundations for the development of such tools, on which our
work builds. In particular, they show using stylized experiments that users form queries that lever-
age semantic relationships between queries and search results. That is, a query is not a direct
representation of the user’s content preferences, but rather a tool used by the user to retrieve con-
tent that matches their preferences. These authors give the example of a user typing the following
query: “affordable sedan made in America.” It is possible that the most important attributes for
this consumer are in fact safety, comfort, and made in America, and that affordability is of lesser
importance. This consumer might have decided to type the query “affordable sedan made in Amer-
ica” because they believe that cars made in America are generally safe and comfortable, but not
necessarily affordable. In that case, the consumer anticipated that they would find relevant search
results efficiently (i.e., with short queries) by only including “made in America” and “affordable”
in their queries, but not “safe” or “comfortable,” although these are important attributes. In other
words, the consumer may have leveraged the semantic mapping between queries and results when
7
formulating their query. Our work builds on these initial results by constructing a practical and
scalable framework for capturing these semantic relationships and inferring content preferences
from queries.
2.3 Natural Language Processing
There has been a stream of recent research in marketing that applies Natural Language Process-
ing (NLP) to analyze online user-generated content (Archak et al., 2011; Lee and Bradlow, 2011;
Netzer et al., 2012; Ghose et al., 2012). Our research builds upon the literature on topic modeling
within NLP, or the so-called Latent Dirichlet Allocation (LDA) (Blei et al., 2003). LDA is an
unsupervised Bayesian learning algorithm that extracts “topics” from text, based on occurrence.
By examining a set of documents, LDA represents each topic by a probability distribution over
words, and each document by a probability distribution over topics. Intuitively, one would expect
particular words to appear in a document more or less frequently based on the topic(s) reflected
in that document. Recently, researchers in marketing have started adopting LDA to extract topical
level information. Examples are Tirunillai and Tellis (2014) who apply LDA to identify dimen-
sions of quality and valence expressed in online reviews, and Abhishek et al. (2014) who use LDA
to measure the contextual ambiguity of a search keyword.
Our proposed topic model HDLDA has an innovative structure compared with other extensions
of LDA. We highlight three extensions related to ours: the correlated topic model (Blei and Laf-
ferty, 2007), the hierarchical topic model (Griffiths et al., 2004), and the relational topic model
(Chang and Blei, 2009). The correlated topic model allows correlation between documents in the
occurrence of topics. In contrast, HDLDA focuses on the correlation between different types of
documents, e.g., webpages and queries. The hierarchical topic model aims to learn a hierarchy of
topics, i.e., which topics that are more general vs. specific. In contrast, HDLDA defines hierarchy
over documents, i.e., the topics in documents are related to the topics in labels. The relational topic
model studies document networks (e.g., whether two research papers tend to be cited by the same
authors), whereas HDLDA leverages the observed hierarchy between different types of documents
8
to infer their topics and semantic relationships. Therefore, methodologically, we innovate both in
the topic modeling and marketing literatures; managerially, we provide a tool that allows inferring
users’ content preferences based on their queries.
3 HDLDA
In this section we describe our proposed modeling framework, HDLDA. HDLDA is a model
for bag-of-word data where documents are multi-labeled by very small subsets of their own words.
That is, in our case, webpages are labeled by the queries that retrieve them. We assume that there
is one LDA process for the documents (e.g., webpages) and one LDA process for the labels (e.g.,
queries). The two processes share the same topic-word distributions, and they are hierarchical
in the sense that the topic distribution of each document is related to the topic distribution of its
label(s). HDLDA can be applied to any corpus that has such hierarchically dual structure. We
focus here on an application to search engines, and set the notations within this context.
3.1 The Model
Suppose there is a collection of Q different search queries for a particular search domain, and
these search queries can retrieve a collection of P different webpages on a search engine. Let
lpq ∈ {0,1} indicate whether webpage p can be retrieved by query q, i.e., it can appear in the top
search results for query q. Let J denote the total number of different words in the vocabulary,
where words are indexed by j ∈ {1,2, ...,J}. Let wq j denote the jth word in query q, and wq denote
the vector of Jq words associated with that query, where Jq is the number of words in the query.
Similarly, let wpi denote the ith word in webpage p, and wp denote the vector of Jp words associated
with that webpage, where Jp is the number of words in the page. The set of relationships between
the data and the model parameters are described by the graphical model in Figure 1. Note that
we treat the labels {lpq} as exogenously given by the search engine, i.e., we do not model their
generating process.
9
[Insert Figure 1 Here]
Topics. We let K denote the number of different topics in the domain. Search queries and
webpages in the collection are assumed to share the same set of topics, but each document exhibits
these topics in different proportions. These topic proportions are reflected by the words present in
the documents. Similarly to a LDA, we model each topic k as a vector ϕk which follows a Dirichlet
distribution over the J words in the vocabulary:
ϕk ∼ DirichletJ(η) (1)
for k = 1,2, ...,K. The hyper-parameter η controls the sparsity of the word distribution.
Queries. To model the observed jth word wq j in each query q, we need to model the query’s
topic distribution θq and the latent topic assignment zq j ∈ {1,2, ...,K} for that word. Following
LDA, we assume:
θq ∼ DirichletK(α), zq j ∼Category(θq), wq j ∼Category(ϕzq j) (2)
for q = 1,2, ...,Q and j = 1,2, ..,Jq. The hyper-parameter α controls the sparsity of the topic
distribution.
Webpages. Search engines like Google retrieve links based on how well their webpage content
matches with a given search query. Hence, even though both queries and webpages are treated
as documents, webpages are semantically related with the set of queries that can retrieve them.
HDLDA captures such semantic relationship by incorporating a hierarchical structure. Specifically,
we model the prior on the topic distribution for webpage p, θp, as a function of the topic distribution
of the queries that label this webpage. The semantic mapping between queries (labels) and results
are specified at the topic-level, using a K×K matrix R. In this matrix, each element rkk′ indicates
the effect of topic k in the labeling queries on topic k′
in the corresponding search results. As
multiple queries may retrieve the same webpage, we use the average topic distribution among the
10
queries that retrieve webpage p,1 which is computed as θq(p) = ∑q θqlpq
∑q lpq. Following the Dirichlet-
multinomial Regression topic model introduced by Mimno and McCallum (2012), we assume that:
θp ∼ DirichletK(exp(RT1 θq(p)), ...,exp(RT
Kθq(p))) (3)
The exponential of the product between the kth column of R and θq(p) is proportional to the
expected weight of topic k in the search results. That is, the weight on each topic in each document
is related to the weight of each topic in the labeling query(ies). Given θp, the observed jth word in
webpage p is then modeled in a standard manner:
zp j ∼Category(θp), wp j ∼Category(ϕzp j) (4)
for p = 1,2, ...,P and j = 1,2, ..,Jp.
3.2 Inference Algorithm
Given the content of all queries and webpages and the labeling of webpages by queries, our
goal is to estimate Θ ={{ϕk},{zp},{zq},{θp},{θq},R
}. We use a combination of Markov Chain
Monte Carlo (MCMC) and optimization, i.e., a stochastic EM sampling scheme (Diebolt and Ip,
1995; Nielsen, 2000; Mimno and McCallum, 2012), to draw samples from the posterior distribu-
tions. Specifically, the posterior distributions of {ϕk},{zp},{zq},{θp} are all conjugate, so we
rely on Gibbs sampling. We use the Metropolis-Hasting algorithm to sample {θq} which are not
conjugate. We estimate R by maximizing its likelihood function given {θq} and {θp}. That is,
over the MCMC sampling, we alternate between sampling {ϕk},{zp},{zq},{θp},{θq} from the
current conditional, and numerically optimizing R given the other parameters. Though one can
also estimate R using an additional Metropolis-Hasting step, Mimno and McCallum (2012) have
suggested that a stochastic EM sampling approach works very efficiently for topic models. The
1One may suggest to use the summation of all the θq’s. The issue is that this would artificially decrease the varianceof the Dirichlet distribution.
11
details of our inference algorithm are presented in Appendix A. In Appendix B, we also report a
simulation study that explores the performance of this inference algorithm.
3.3 Estimating Content Preferences Based on Queries
HDLDA provides estimates of the topic distributions of all queries and webpages in a corpus,
and of how topics in queries are mapped onto topics in search results. Given this information, we
would like to estimate consumers’ content preferences based on their queries. In order to be more
relevant to practice and facilitate ad targeting and content customization, such inference should be
feasible on the fly based on a single query.
We define consumer i’s content preferences as the topic distribution that the consumer is search-
ing for on webpages. This distribution is captured by a vector with K topic weights, denoted as
βi. Our approach relies on Liu and Toubia (2015)’s finding that consumers anticipate the types of
results that will be retrieved by their query, and that they formulate queries that will retrieve results
that match their content preferences in expectation. Under this condition, a user’s preferences may
be simply estimated as the expected topic distribution of the search results, given their query. That
is, given a query with topic distribution θq, the search engine will retrieve search results drawn from
the following distribution: θp ∼ DirichletK(
exp(RT1 θq), ...,exp(RT
Kθq)). The consumer’s content
preferences βi might be simply estimated as the mean of this distribution:
βi = E(θp|θq),( exp(RT
1 θq)
∑k exp(RTk θq)
,exp(RT
2 θq)
∑k exp(RTk θq)
, ...,exp(RT
Kθq)
∑k exp(RTk θq)
)(5)
We note that this approach assumes that consumers are able to leverage the semantic mapping
between search queries and search results when forming queries. This assumption is grounded in
previous research (Liu and Toubia, 2015). Moreover, we do not need to assume that consumers
are consciously leveraging semantic relationships in search, but rather that they behave as if they
did. Nevertheless, in Section 5.4 we compare our approach to a benchmark that assumes that con-
sumers do not take the semantic mapping between queries and results into account when forming
12
queries. We compare our estimate of βi from Equation (5), to a naive estimate which assumes that
consumers simply formulate queries whose topic distributions directly match their preferences,
i.e., βi = θq.
4 Empirical Study
Researchers and practitioners may run HDLDA on any primary or secondary dataset that con-
tains the text of a set of queries and their corresponding search results from a search engine. How-
ever, in this paper our objective is not only to show how HDLDA may be applied, but also to test the
model’s ability to infer consumers’ content preferences based on their search queries. Accordingly,
we use a dataset collected experimentally, because it offers two main advantages over secondary
data. First, it enables us to manipulate and know the content preferences underlying users’ behav-
ior, and therefore to compare estimated preferences with true preferences. Second, with secondary
data the browsing behavior of users would likely be influenced by search advertising and more
generally by the customization of content based on the user’s history. Such customization could
bias our comparisons between various benchmarks, for example by providing alternative expla-
nations for the superior performance of one benchmark over another. Our experimental approach
allows removing the influence of customization and hence provides unbiased comparisons between
benchmarks. In this section, we describe our unique dataset in detail.
4.1 Experimental Design
We conducted a lab experiment in which N = 197 participants performed a series of online
search tasks using a custom-built search engine that we will introduce in Section 4.2. Participants’
objective was to make purchase recommendations. Our search tasks were designed based on five
product categories about which consumers commonly acquire information on the Internet before
purchase: ski resorts, printers, cars, laptops, and cameras. These were relevant categories for our
13
participants (mostly undergraduate and graduate students).2 We manipulated content preferences
exogenously by giving participants specific search tasks, i.e., instructions on what they should
search for. We introduced some heterogeneity in content preferences in each category, by designing
two task descriptions reflecting different preferences, as displayed in Table 1. For example, Task
1 asked participants to search for a family-friendly ski resort, while Task 2 asked them to search
for an exclusive ski resort. We note that we designed these tasks descriptions before acquiring and
analyzing any webpage content, i.e., our task descriptions were not designed to be aligned with or
map onto the set of topics extracted from participants’ actual search data using HDLDA. Hence,
we should expect each task description to have positive weights on multiple topics.
[Insert Table 1 Here]
We used a between-subject design in which each participant was randomly assigned to one
version of the two tasks in each product category. The order of the categories was randomized for
each participant. Each participant was asked to submit one URL of their chosen webpage that they
believed best matched each search task description. To ensure that participants invested enough
effort, we designed this study to be incentive-aligned. Participants were told that all submitted
links would be evaluated by the researchers, based on relevance and usefulness. In addition to a
$7 participation fee, we gave a $100 cash bonus to the participant whose chosen webpage best
matched with the corresponding task description. Participants were informed of this incentive
before the experiment, and we notified the winner within two weeks of the experiment. We find
that this incentive worked well, as on average participants spent at least 20 minutes to finish all five
tasks.
4.2 Data Collection
The input of HDLDA is a set of queries and associated search results. For a given set of search
queries, the associated organic search results may be obtained using one of the APIs offered by
2We ran the lab experiment in the winter on the east coast of the U.S., so ski resorts were relevant for our partici-pants.
14
search engines such as Google. As mentioned above, in addition to applying HDLDA in various
domains, our objective with the present experiment is also to validate HDLDA, by exploring its use
for inferring a user’s preferences based on their search queries. In order to ensure that our validation
of HDLDA not be biased or influenced by unobserved factors such as the user’s browsing history or
the customization of content by the search engine, we built our own search engine called Hoogle,
and used it to collect search data. Hoogle retrieves all the organic search results for each search
query with no user history being captured, using the Google customer engine API. That is, for any
search query, Hoogle retrieves a similar set of search results as Google, with the only differences
that search results are not personalized based on past search history and there is no sponsored
search result. A screenshot from the Hoogle interface is presented in Figure 2. Each result page
shows ten links with their titles and snippets. The font, color, and size are the same as Google.
[Insert Figure 2 Here]
The search logs from Hoogle include the following information for each participant and for
each task: the query(ies) submitted by the participant, the search results seen by the participant on
each page, and the links clicked by the participants. Immediately after we finished collecting data
in the lab experiment, we also used python scripts to automatically download all the content of the
webpages in the participants’ search results (i.e., the actual content of the webpages corresponding
to all the links in the pages of search results viewed by participants).
We note again that we developed Hoogle only in order to provide a clean comparison of users’
estimated preferences with their “true” preferences, but that Hoogle is not necessary in order to run
HDLDA. In practice, most firms have a set of keywords/queries that they think are most relevant
and valuable for their SEO/SEM strategies. Such firms can use the search API offered by Google
or other search engines to collect data and run HDLDA; they do not need to build their own search
engines or run any experiment to collect these data. In addition, HDLDA can certainly be applied
on any secondary dataset that contains a set of queries and their associated search results.
15
4.3 Descriptive Statistics
Table 2 reports descriptive statistics on the search data collected from this lab experiment for
each product category, including the number of unique queries, the number of different words
among all the queries, the variation across users’ queries, users’ query usage, the number of words
per query, and the average proportion of the words in a query that come from the task description.
First of all, there exists some heterogeneity across users’ queries. Such variation is measured by
the edit distance. The edit distance quantifies how dissimilar two queries are from each other by
counting the minimum possible weighted number of character operations (including insertions,
deletions, and substitutions), required to transform one query into the other (Martin and Jurafsky,
2000). For example, the edit distance between “best school laptop” and ”school laptop” is five (five
characters need to be deleted: b, e, s, t, space). We compute the edit distance between all possible
pairs of queries from participants in the same task, and report the sample average and standard
deviation.
In addition, we find that on average users tend to use few queries in a session and also few
words in a query, which is consistent with the existing empirical studies on online search queries
(Jansen et al., 2000; Spink et al., 2001; Kamvar and Baluja, 2006; Wu et al., 2014). These numbers
also vary across product categories. For example, users submitted slightly more queries when
searching for laptops. Lastly, the average proportion of words in users’ search queries that are
from the task description ranges from 0.40 to 0.75. Hence, users form queries that combine words
in the task description with other words.
[Insert Table 2 Here]
We now turn to descriptive statistics related to position effects. Previous research has shown
that position could affect behavioral outcomes such as consumer click-through, conversion rate,
and sales (Hotchkiss et al., 2005; Kihlstrom and Riordan, 1984; Varian, 2007). In our data, we find
that the majority of users limit their browsing to the links on the first page of results (i.e., the top 10
organic links retrieved by Google). Therefore, we treat every page of results as starting from the
first position. We plot the click-through rate (CTR) against the top ten positions in Figure 3. The
16
CTR at each position is calculated as the percentage of clicks at this position across all the clicks
from the ten positions, and hence the CTR sums to one across positions. Consistent with previous
research, we see a quasi-exponential decrease in CTR as a function of position (Narayanan and
Kalyanam, 2011; Abhishek et al., 2014).
[Insert Figure 3 Here]
Lastly, we study whether there exists some agreement amongst users on the best webpage for
each task. For each task and for each URL that was chosen at least once, we define the agreement
level as the number of users who picked that URL as their chosen links. Overall the distribution
of the agreement level is long-tailed, i.e., the majority of users tend to choose different URLs.
We report the top 5 most chosen links and their corresponding agreement levels for each task in
Figure 4. We see that a few links show some level of consensus amongst users, and there exists
heterogeneity across tasks.
[Insert Figure 4 Here]
4.4 Data Processing
Given that most advertisers or firms do business in certain domains and design webpage content
and keyword lists within that, we run HDLDA separately for each category. Accordingly, we
combine all the search queries and result webpages from the two search tasks in each product
category as one corpus. Following the bag-of-word approach, each webpage or query can be
treated as an unordered set of words. We process the text in each corpus based on standard practice
in text mining. We remove any delimiting character, including hyphens, used to separate words;
we eliminate punctuation, non-English characters, and a standard list of English stop words; and
no stemming is performed.
We form the vocabulary for each corpus in two steps. We first select words that appear at least
n times in total (we varied n between 10 and 20 depending on the corpus, based on inspection
- the selection of n was finalized before running HDLDA). This helps eliminate a fair amount
of meaningless words that cannot be removed in the preprocessing stage. Among the remaining
17
words, we then calculate the mean term frequency-inverse document frequency (t f -id f ) for each
word. The t f -id f is commonly used to select vocabulary in topic modeling, and is computed
as t f -id f (w) = fw× log(
Nnw
), where, fw is the average number of times word w occurs in each
document in the corpus, and nw is the number of documents containing word w. We keep words
whose mean t f -id f is above the median, which allows omitting words that have low frequency as
well as those occurring in too many documents (Hornik and Grun, 2011).
The descriptive statistics of the resulting corpora are summarized in Table 3. The first row is
the number of words that are selected as the vocabulary for each corpus. The number of words
in each query or webpage is calculated based on the selected vocabulary, rather than the original
content. Hence, one may notice that the number of words per query here is smaller than in Table 2.
On average, each query contains about 4 words and each webpage contains 250-600 words. The
last part of the table is the observed labels between queries and webpages. Because in our data
most participants only viewed the first page of search results for each query, on average each query
retrieves about ten webpages.3 Webpages however can be retrieved by very different numbers of
queries.
[Insert Table 3 Here]
5 Results
In this section, we first describe how we determine the number of topics before conducting
model inference. We then evaluate how well HDLDA fits the corpus and compare the model to
two benchmarks. Next, we present the results of the posterior estimates from HDLDA. Lastly, we
proceed to the individual-level estimation of content preferences as described in Section 3.3.
3In some cases, the total number of links could be smaller than ten for a query, if the query either cannot beunderstood by Google or has very few relevant results.
18
5.1 Determining the Number of Topics in HDLDA
One key decision in topic modeling is choosing the number of topics for a corpus if this pa-
rameter is not specified a priori (Blei and Lafferty, 2009). Depending on the goals and available
means, a researcher may apply a variety of performance metrics (Griffiths and Steyvers, 2004;
Wallach et al., 2009; Chang and Blei, 2009). In our case, if we remove the hierarchical structure
between queries and webpages (i.e. set R = 0), the prior on θp is reduced to a uniform distribution
(exp(RTk θq(p))= 1, and Dirichlet(1) is the uniform distribution). If we further set α= 1 (which we
do in our analysis), then the prior distributions on queries and webpages become identical. In other
words, HDLDA is reduced to traditional LDA, where queries and webpages are treated equally as
documents, with a common prior on their topic distributions. We use the optimal number of topics
for LDA as the number of topics for HDLDA. This makes our comparisons of HDLDA to LDA
more conservative (i.e., the number of topics is optimized for LDA).
The literature on topic modeling suggests that estimating the probability of held-out documents
can provide a clear and interpretable metric for evaluating the performance of topic models, and
choosing the number of topics that generates the best language model (Wallach et al., 2009; Blei
and Lafferty, 2009). Therefore, we divide each corpus into training and hold-out testing sets. For
each number of topic K, we measure the performance of the learned model by computing the
perplexity on the hold-out testing set. The perplexity is monotonically decreasing in the likelihood
of the testing set, and is equivalent to the inverse of the geometric mean of the per-word likelihood
(Blei et al., 2003). A lower perplexity score indicates better generalization performance. For a
testing set of documents Dtest , let Nd denote the number of words in each document d ∈Dtest . The
perplexity of the testing set is computed as:
Perplexity(Dtest) = exp(− ∑d ∑w log
(∑k ϕkwθdk
)∑d Nd
)(6)
For each corpus, we held out 20% of the data and trained LDA on the remaining 80%. We test
for K ∈ {2,3, ...} until perplexity becomes worse. To guarantee the robustness of the results, we
19
repeat this procedure 10 times and select the number of topics based on the average perplexity score
across the 10 replications. We find that the optimal number of topics is: K∗= 2 for the ski category,
K∗ = 3 for printer, K∗ = 3 for car, K∗ = 4 for laptop, and K∗ = 2 for camera. We estimate HDLDA
for each category with the optimal K. We set prior parameters α = 1 and η = 0.01 (Jagarlamudi
et al., 2012), and run 8,000 MCMC iterations using the first 4,000 as burn-in.
5.2 Model Fit
We compare HDLDA to two variations of LDA, based on the Deviance Information Criterion
(DIC) (Spiegelhalter et al., 2002). The first benchmark is the basic LDA in which the queries and
webpages are treated equally as documents and have the same fixed prior. Specifically, we assume
θq ∼ DirichletK(α), θp ∼ DirichletK(α), and set α = 1. Note that here the prior on the topic
distribution of each query, θq, is the same as in HDLDA; and the prior on the topic distribution of
each webpage, θp, is also the same as in HDLDA when R is reduced to a zero matrix. Therefore,
this basic LDA is nested within HDLDA. The inference algorithm for the basic LDA is simply
Gibbs sampler.
Our next LDA benchmark explores whether the benefit of HDLDA comes from having a flexi-
ble prior distribution on the topic distributions of webpages, rather than the semantic mapping be-
tween queries and webpages per se. Accordingly, in our second LDA benchmark we set the prior
on webpages to be an unknown vector which we estimate, while keeping the prior on the topic
distributions of queries fixed. That is: θq ∼DirichletK(α), where α = 1; and θq ∼DirichletK(α∗),
where α∗ is an unknown K-dimensional vector that is estimated. We label this benchmark as
“Flexible LDA.” This benchmark is also nested within HDLDA in which we assume all the top-
ics in a query have the same effect on each topic in the results, i.e., R =(r11K,r21K, ...,rK1K
),
where 1K denotes a K-dimensional vector of one and {rk}k=1,...,K are all scalars. This reduces the
mean of the Dirichlet distribution in Equation (3) to(
exp(r1),exp(r2), ...,exp(rK))
(because the
vector θq(p) sums to one), which is exactly the unknown prior on θp in Flexible LDA. Similarly
to HDLDA, the inference algorithm for Flexible LDA is a stochastic EM algorithm in which we
20
alternate between sampling {ϕk},{zp},{zq},{θp},{θq} from the Gibbs sampler, and numerically
optimizing α∗ given the other parameters.
The DIC of all the three models for different product categories is presented in Table 4. We see
that although Flexible LDA performs better than the basic LDA, HDLDA achieves a much lower
DIC compared to both LDA benchmarks across all categories. This suggests that it is reasonable
to explicitly model the semantic mapping between search queries and search results, and that the
benefits from HDLDA do not come merely from its flexible prior structure.
[Insert Table 4 Here]
5.3 Estimates of Topics and Semantic Mapping
We now interpret the topics generated by HDLDA in each corpus. To ease interpretation, we
focus on the most relevant words in each topic. The relevance of word w to topic k is measured as
follows (Sievert and Shirley, 2014; Bischof and Airoldi, 2012):
r(w,k|λ) = λ log(ϕkw)+(1−λ) log
(ϕkw
pw
)(7)
where ϕkw is the posterior estimate of the probability of seeing word w given topic k, pw is the
empirical distribution of word w in the corpus; λ determines the weight given to the probability of
word w under topic k relative to its lift ϕkwpw
, both measured on the log scale. Setting λ = 1 results in
the familiar ranking of words in decreasing order of their topic-specific probabilities, and setting
λ = 0 ranks words solely based on lift. We set λ = 0.6, following the empirical studies conducted
by Sievert and Shirley (2014).
[Insert Figure 5 and Table 5 Here]
For ease of interpretation, we simulate the content of each topic using the exponential of the
above relevance measure. That is, we generate sets of words for each topic, where the probability
of occurrence of each word is proportional to the exponential of its relevance. We use word clouds
to visualize the simulated sets of words. As an example, in Figure 5 we report the word clouds
21
simulated for the two topics extracted from the camera category. Words with larger font size have
higher relevance. We can clearly see that the word cloud on the left-hand side for Topic 1 is more
relevant to “waterproof cameras for kids,” whereas the other word cloud for Topic 2 is more likely
to have words that are relevant to “professional digital cameras.” In addition, we report in Table
5 some of the most relevant words for each topic, along with examples of queries and webpages
with very high weights on each topic. While we only present five sample words per topic for space
reasons, as one can see from Figure 5, this is not enough to define or capture a topic completely.
Our observations suggest consumers in general use a variety of words to form queries, and these
words have various degrees of specificity. We also observe that the recovered topic distributions of
webpages in general seem to be consistent with their actual content.
We also report the average of the posterior estimates of the topic proportions of all queries and
webpages in each task. See Table 6. We see that the average θq is close to the average θp in each
category. However, this does not mean that topic distributions in search results are systematically
identical to topic distributions in the corresponding queries, i.e., it does not mean that the matrix
R is proportional to the identity matrix. In particular, the posterior estimates of R are reported in
Table 7. All parameter estimates are statistically significant, which supports our hypothesis that
topics in search results are related to topics in the corresponding search queries. Recall that each
element rkk′ in the kth row and k′th column of matrix R can be interpreted as the effect of the
weight on topic k in a query on the weight on topic k′
in the corresponding search results. We can
clearly see that there exist asymmetries in the semantic relationships, i.e. rkk′ 6= rk′k. This finding
is consistent with and extends the results of Liu and Toubia (2015), who found asymmetries in the
semantic relationships on Google, at the level of individual words.
[Insert Tables 6 and 7 Here]
5.4 Accuracy of Preference Estimation
We now consider the estimation of content preferences from queries. In particular, we explore
the benefits of assuming that consumers leverage the semantic mapping between queries and results
22
when formulating their queries. As described in Section 3.3, given a query with topic distribution
θq, the consumer’s preferences may be estimated as: βi = E(θp|θq), using Equation (5). That is,
preferences may be estimated as the expected value of the topic distribution of the results that will
be retrieved by the search engine, given the query. This estimation approach is consistent with
consumers formulating queries that will give rise to content that matches their content preferences
in expectation. A benchmark approach assumes that consumers ignore the semantic mapping
between queries and results, and estimates the preferences simply as: βi = θq.
5.4.1 Measures of “Truth”
To evaluate these two approaches, we compare their estimates of consumers’ content prefer-
ences with the “truth.” One of the appeals of our experimental design is that we can manipulate
the “true” content preferences, and use the task description given to the participant as a measure
of their true content preferences. Hence, our first measure of “truth” is the topic distribution of the
task description given to the participant. This measure offers the benefit of being exogenous, and
unaffected by the participant’s behavior or belief. However, one limitation of this measure is that
there might be variations in how participants interpret the task description. Hence, we complement
this first measure with a second measure: the topic distribution of the actual webpage chosen by the
participant.4 This measure has the benefit of reflecting the actual behavior of participants. However
one of its limitation is that it is influenced both by demand-side variables (i.e., how participants
interpret the task description) and supply-side variables (i.e., the options presented to participants
by the search engine). Together, these two imperfect measures of ”truth” provide convergent evi-
dence regarding the ability of HDLDA to estimate content preferences based on queries. Note that
the data used to train HDLDA contains neither the text of the task descriptions, nor the knowledge
of which webpage was selected by each participant. Therefore, both our truth measures may be
4Some participants did not follow the instructions correctly (e.g., participants submitted some random text withoutactually conducting the search on Hoogle, or found their chosen link on other search engines). We drop these observa-tions in our analysis. The proportion of observations dropped is 2% for the ski category, 2% for printer, 10% for car,12% for laptop, and 6% for camera. As a result, there are 195 participants who submitted a valid chosen webpage forat least one task.
23
viewed as external validations.
[Insert Table 8 Here]
We use the same procedure to estimate the topic distributions of the task descriptions and of
the chosen webpages. In particular, we use a traditional LDA process in which the topic-word
distribution ϕ is set to the estimate from HDLDA. That is, at each iteration we draw the latent
topic assignment of each word in the task description/chosen webpage, and then draw from the
posterior of the topic distribution given these latent topic assignments (assuming a flat prior on
the topic distribution). We estimate the topic distribution of the task description/chosen webpage
as the average over 3,000 iterations. The estimated topic distributions of all the task descriptions
are presented in Table 8. We see that each task description may have large weights on multiple
topics. However, when comparing the weights of the same topic across the two task descriptions,
the topic with the relatively larger weight is consistent with our expectation. For example, in the ski
resort category, Task 1 (respectively, Task 2) was designed to correspond to a family-friendly resort
(respectively, a luxury resort). Consistent with this, we find that Task 1 has a larger weight on Topic
1 compared to Task 2, and the opposite is true for Topic 2. Note however that the weight on Topic
1 is larger overall compared to the weights on Topic 2, reflecting the fact that family-friendliness
is a more common/popular theme in this category compared to luxury.
5.4.2 Performance Metrics
We compare the “true” preferences to the estimated preferences using two performance metrics.
The first metric relies on the fact that both the truth and the estimate are vectors that capture topic
distributions. We measure prediction accuracy using cosine similarity, which is commonly used
in topic modeling to understand the similarity between documents. The cosine similarity between
two vectors a and b is computed as their inner product (or angle):
f (a,b) =a ·b||a||||b||
=∑k akbk√
∑k a2k
√∑k b2
k
(8)
24
Cosine similarity ranges from 0, indicating completely opposite, to 1, meaning exactly the same.
Second, we calculate the perplexity score of the “true” description of each participant’s content
preferences (i.e., task description or chosen webpage), given the estimated content preferences.
Specifically, let Li denote the content (i.e., set of words) of the “true” description of user i’s pref-
erences. According to Equation (6), the perplexity score of Li given an estimated preferences βi is
calculated as:
perplexity(Li|βi) = exp(− ∑w∈Li log
(∑k ϕkwβi
)|Li|
)(9)
where, |Li| denotes the length of the content Li, and ϕ is the topic-word distribution estimated by
HDLDA. Note that smaller value indicates better model performance.
5.4.3 Results
We compare the accuracy of the estimates of the content preferences based on Equation (5):
βi = E(θp|θq); to that of the benchmark that assumes that consumers ignore semantic mapping in
query formation: βi = θq. We compare the accuracy of these two estimates based on both truth
measures, using both cosine similarity and perplexity. We compute the performance of the esti-
mates based on each query in each category from each participant separately. We first compute
the average performance over the queries submitted by each participant in each category. Taking
the average of this performance across categories gives us an average performance for each partici-
pant. In Table 9, we report the average performance of the two benchmarks across participants. We
compare the two benchmarks using paired two-sample t-tests. Both performance metrics suggest
that the estimates of consumers’ content preferences are more accurate when we assume that con-
sumers leverage the semantic mapping between queries and results when they formulate queries,
rather than assuming that the queries are direct representations of the information preferences. The
difference is significant in all cases (p-value< 0.05).
[Insert Table 9 Here]
25
6 Conclusions
In this paper, we studied the link between consumer search queries and their content prefer-
ences. We developed an innovative topic model, HDLDA, that jointly estimates the topic distribu-
tions in queries and webpages as well as the semantic mapping between queries and their results.
In our domain of application, HDLDA captures the facts that a webpage is retrieved by certain
queries, and that topics in queries are semantically related to topics in search results. The output
from HDLDA include the number of topics, the description of these topics, the topic distribution of
all the queries and webpages, and how topics in queries are mapped onto topics in search results.
More generally, HDLDA is a model for bag-of-word data that can be applied to any context in
which documents are multi-labeled by very small subsets of their own words. HDLDA has a novel
structure within the broad literature of topic modeling, and our paper provides a methodological
contribution to the probabilistic topic modeling literature.
Using the output of HDLDA, it is possible to estimate a consumer’s content preferences on the
fly, based on each query. In particular, we can simply estimate a consumer’s preferences as the
expected topic distribution in the search results, given the topic distribution in the search query.
We find that prediction accuracy is improved when we assume that consumers leverage semantic
relationships between queries and results when forming their queries, compared to a benchmark
that assumes that consumers ignore these semantic relationships.
In order to validate our approach for estimating content preferences, we developed our own
search engine, which allows removing the variations in organic and sponsored search results due
to customization. This tool offers opportunities to shed new light on important research questions
in consumer search and search advertising, such as position effects and advertising effects. Indeed,
although we did not do this in the current paper, this tool allows varying the order of organic and
sponsored search results, moving sponsored results to the organic section, etc.
From a managerial perspective, HDLDA is scalable across thousands of queries and pages
within a search domain in a way that can automatically extract, understand and organize the mean-
ing of queries and web pages, without human intervention. It is therefore a potential tool for
26
marketers/advertisers to build their own semantic graph and query intent research. The input to
HDLDA may come from various sources, including search engine’s search API and secondary
search data. The output of HDLDA can guide firms’ SEO and SEM strategies. In SEO, firms can
use our approach to quantify how well their content matches the content preferences captured by
various keywords and queries, and focus their efforts on promoting their organic search rank for
those queries with the best fit. In SEM, firms can use our approach to determine their willingness
to bid on various keywords and queries, based on how well their content matches with the underly-
ing content preferences. Our approach may also be used to reduce the dimensionality of queries in
order to systematically study the performance of large sets of queries with respect to metrics such
as click-through rate.
We close by highlighting additional areas for future research. First, the corpora we used in this
paper contain only a small proportion of the actual queries and webpages related to each category.
Future research may use larger datasets. Second, future research may combine information on
queries with information on click-stream behavior, to provide a more extensive set of observations
based on which content preferences may be estimated. Recent developments in Collaborative Topic
Modeling (Wang and Blei, 2011) might provide the foundation for models that would formulate
both query formation and clicking behavior as functions of content preferences.
27
References
Abhishek, V., J. Gong, and B. Li (2014). Examining the impact of contextual ambiguity on search
advertising keyword performance: A topic model approach. Available at SSRN 2404081.
Amerland, D. (2013). Google Semantic Search: Search Engine Optimization (SEO) Techniques
That Get Your Company More Traffic, Increase Brand Impact, and Amplify Your Online Presence
(1st ed.). Que Publishing Company.
Anandkumar, A., Y.-k. Liu, D. J. Hsu, D. P. Foster, and S. M. Kakade (2012). A spectral algorithm
for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pp. 917–
925.
Andrieu, C. and J. Thoms (2008). A tutorial on adaptive mcmc. Statistics and Computing 18(4),
343–373.
Archak, N., A. Ghose, and P. G. Ipeirotis (2011). Deriving the pricing power of product features
by mining consumer reviews. Management Science 57(8), 1485–1509.
Arora, S., R. Ge, and A. Moitra (2012). Learning topic models–going beyond svd. In Foundations
of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pp. 1–10. IEEE.
Bischof, J. and E. M. Airoldi (2012). Summarizing topical content with word frequency and
exclusivity. In Proceedings of the 29th International Conference on Machine Learning (ICML-
12), pp. 201–208.
Blei, D. M. and J. D. Lafferty (2007). A correlated topic model of science. The Annals of Applied
Statistics, 17–35.
Blei, D. M. and J. D. Lafferty (2009). Text Mining: Classification, Clustering, and Applications,
Chapter Topic Models, pp. 71–93. Taylor and Francis.
Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent dirichlet allocation. the Journal of machine
Learning research 3, 993–1022.
28
Broder, A. (2002). A taxonomy of web search. In ACM Sigir forum, Volume 36, pp. 3–10. ACM.
Chang, J. and D. M. Blei (2009). Relational topic models for document networks. In International
Conference on Artificial Intelligence and Statistics, pp. 81–88.
Diebolt, J. and E. H. Ip (1995). A stochastic em algorithm for approximating the maximum likeli-
hood estimate. Technical report, Sandia National Labs., Livermore, CA (United States).
Dzyabura, D. (2013). The role of changing utility in product search. Available at SSRN 2202904.
Gelman, A. and X. L. Meng (1996). Model checking and model improvement. Markov chain
Monte Carlo in practice Springer US, 189–201.
Ghose, A., P. Ipeirotis, and B. Li (2013). Surviving social media overload: predicting consumer
footprints on product search engines. Technical report, Working paper.
Ghose, A., P. G. Ipeirotis, and B. Li (2012). Designing ranking systems for hotels on travel search
engines by mining user-generated and crowdsourced content. Marketing Science 31(3), 493–
520.
Griffiths, T. L., D. M. Blei, M. I. Jordan, and J. B. Tenenbaum (2004). Hierarchical topic mod-
els and the nested chinese restaurant process. Advances in neural information processing sys-
tems 16, 17.
Griffiths, T. L. and M. Steyvers (2004). Finding scientific topics. Proceedings of the National
Academy of Sciences 101(suppl 1), 5228–5235.
Hornik, K. and B. Grun (2011). topicmodels: An r package for fitting topic models. Journal of
Statistical Software 40(13), 1–30.
Hotchkiss, G., S. Alston, and G. Edwards (2005). Google eye tracking report: How searchers see
and click on Google search results. Enquiro Search Solutions.
29
Jagarlamudi, J., H. Daume III, and R. Udupa (2012). Incorporating lexical priors into topic mod-
els. In Proceedings of the 13th Conference of the European Chapter of the Association for
Computational Linguistics, pp. 204–213. Association for Computational Linguistics.
Jansen, B. J., D. Booth, and B. Smith (2009). Using the taxonomy of cognitive learning to model
online searching. Information Processing & Management 45(6), 643–663.
Jansen, B. J., D. L. Booth, and A. Spink (2007). Determining the user intent of web search engine
queries. In Proceedings of the 16th international conference on World Wide Web, pp. 1149–
1150. ACM.
Jansen, B. J., D. L. Booth, and A. Spink (2008). Determining the informational, navigational, and
transactional intent of web queries. Information Processing & Management 44(3), 1251–1266.
Jansen, B. J., A. Spink, A. Pfaff, and A. Goodrum (2000). Web query structure: Implications for ir
system design. In Proceedings of the 4th World Multiconference on Systemics, Cybernetics and
Informatics (SCI 2000), pp. 169–176.
Jeziorski, P. and I. Segal (2010). What makes them click: Empirical analysis of consumer de-
mand for search advertising. Technical report, Working papers//the Johns Hopkins University,
Department of Economics.
Kamvar, M. and S. Baluja (2006). A large scale study of wireless search behavior: Google mobile
search. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pp.
701–709. ACM.
Kihlstrom, R. E. and M. H. Riordan (1984). Advertising as a signal. The Journal of Political
Economy, 427–450.
Kim, J. B., P. Albuquerque, and B. J. Bronnenberg (2010). Online demand under limited consumer
search. Marketing Science 29(6), 1001–1023.
30
Lee, T. Y. and E. T. Bradlow (2011). Automated marketing research using online customer reviews.
Journal of Marketing Research 48(5), 881–894.
Liu, J. and O. Toubia (2015). A framework for modeling how consumers form online search
queries. Available at SSRN: http://ssrn.com/abstract=2675331.
Manning, C. D., P. Raghavan, and H. Schutze (2008). Introduction to information retrieval, Vol-
ume 1. Cambridge university press Cambridge.
Martin, J. H. and D. Jurafsky (2000). Speech and language processing. International Edition.
Mimno, D. and A. McCallum (2012). Topic models conditioned on arbitrary features with
dirichlet-multinomial regression. arXiv preprint arXiv:1206.3278.
Narayanan, S. and K. Kalyanam (2011). Measuring position effects in search advertising: A
regression discontinuity approach. Technical report, Working Paper.
Netzer, O., R. Feldman, J. Goldenberg, and M. Fresko (2012). Mine your own business: Market-
structure surveillance through text mining. Marketing Science 31(3), 521–543.
Nielsen, S. F. (2000). The stochastic em algorithm: estimation and asymptotic results. Bernoulli,
457–489.
Pirolli, P. L. (2007). Information foraging theory: Adaptive interaction with information. Oxford
University Press.
Sanasam, R. S., H. A. Murthy, and T. A. Gonsalves (2008). Determining user’s interest in real
time. In Proceedings of the 17th international conference on World Wide Web, pp. 1115–1116.
ACM.
Shen, Y., J. Yan, S. Yan, L. Ji, N. Liu, and Z. Chen (2011). Sparse hidden-dynamics conditional
random fields for user intent understanding. In Proceedings of the 20th international conference
on World wide web, pp. 7–16. ACM.
31
Shi, S. W. and M. Trusov (2013). The path to click: Are you on it? working paper.
Sievert, C. and K. E. Shirley (2014). Ldavis: A method for visualizing and interpreting topics.
Sponsor: Idibon, 63.
Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesian measures
of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 64(4), 583–639.
Spink, A., D. Wolfram, M. B. Jansen, and T. Saracevic (2001). Searching the web: The public and
their queries. Journal of the American society for information science and technology 52(3),
226–234.
Statista (2015). Digital marketing spending in the united states from 2014 to 2019. Avail-
able from http://www.statista.com/statistics/275230/us-interactive-marketing-spending-growth-
from-2011-to-2016-by-segment.
Tirunillai, S. and G. J. Tellis (2014). Mining marketing meaning from online chatter: Strategic
brand analysis of big data using latent dirichlet allocation. Journal of Marketing Research 51(4),
463–479.
Varian, H. R. (2007). Position auctions. international Journal of industrial Organization 25(6),
1163–1178.
Wallach, H. M., I. Murray, R. Salakhutdinov, and D. Mimno (2009). Evaluation methods for topic
models. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.
1105–1112. ACM.
Wang, C. and D. M. Blei (2011). Collaborative topic modeling for recommending scientific arti-
cles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discov-
ery and data mining, pp. 448–456. ACM.
32
Wang, P., M. W. Berry, and Y. Yang (2003). Mining longitudinal web queries: Trends and patterns.
Journal of the American Society for Information Science and Technology 54(8), 743–758.
Wu, W.-C., D. Kelly, and A. Sud (2014). Using information scent and need for cognition to under-
stand online search behavior. In Proceedings of the 37th international ACM SIGIR conference
on Research & development in information retrieval, pp. 557–566. ACM.
Yang, L., O. Toubia, and M. G. De Jong (2015). A bounded rationality model of information
search and choice in preference measurement. Journal of Marketing Research 52(2), 166–183.
33
Tables
Table 1: Search Tasks
Ski
1
A family is planing to take a vacation at a ski resort in Vermont. They are looking for a small resort thatis suitable for family and children. The resort should have plenty of trails for beginner skiers, and also a skischool for kids. There should be lesson package deals including all-day lift tickets, ski rental, lessons, and soon. At the minimum, the resort should have an on-site ski rental shop, and offer some kind of discounts.
2
David, a banker, is planning to take a vacation at a ski resort in Vermont. He is looking for an exclusiveresort that could offer a variety of terrains for intermediate and advanced skiers. Specifically, the mountainshould be large and the slopes should be somewhat difficult. In addition, the resort should offer other activitiessuch as snow tubing, and amenities such as a lounge and spa treatments.
Printer
3
Jessica, a college student, wants to purchase a budget printer for school work. The printer should be able toprint, copy, and scan. Double-sided printing would also be attractive to her. Color printing is not required, asshe will mostly print black and white. In addition, the printer should print fast with low noise, and the runningcost should be low.
4
A family wants to purchase a small printer designed for home users who want lab-quality photos. They wantto be able to connect the printer to Wi-Fi and smart phones, and it should also be able to print photos withoutthe topic of a computer. As the printer will be used very frequently, the family is willing to pay slightly morefor a printer which is cheaper to run in the long-term.
Car
5
A family wants to buy a new car that could provide more generous space for seating and cargo than theirold compact sedan. The family’s budget is $25,000. They want the new car to be safe, reliable, economic andfuel-efficient. It should have a 4-cylinder engine, and a high EPA mileage. The car should also handle snowand ice well.
6
Catherine wants to buy a small car, in order to save money on gas, insurance, and maintenance. She alsowants to be able to park more easily in a big city. Her price range is between $10,000 and $14,000. She wantsthe car to be attractive, stylish, fun, and practical. Despite its small size, the car should still be safe and shouldoffer a comfortable ride.
Laptop
7
Mike, a college student, wants to buy a new laptop. In addition to school work, the laptop should providegood performance for playing games. It should have at least an Intel Core i5 CPU, 8GB of RAM, a goodgraphics card, and a larger screen. Mike prefers windows and Linus systems, because of their flexibility andwide options for programs. Mike’s price range is between $800 and $1000.
8
Mike, a consultant, wants to buy a laptop for work and traveling. Mike’s budget is $600. He will mostly usethe laptop for internet, word, powerpoint, and emails. He needs a laptop with enough speed, good display, verylong battery life, small size, and light weight. Also, the laptop should be durable enough to handle pressure ordropping that may often happen during traveling.
Camera
9
A couple wants to buy a camera for their 9-year old son. The camera should be simple to use, and easy tocarry anywhere. The camera should have a large viewing screen or touchscreen. Its picture quality should bevery good. More importantly, the camera should be sturdy and be able to withstand falls. And it should also bewater proof, so that it can be used underwater. The couple prefers a camera in the price range of $100-$200.
10
Kevin, a beginner photographer, wants to buy his first digital SLR camera. Kevin is looking for a modelin the mid-price range, or alternatively a package kit that includes the body, lenses, and tripod. The camerashould be able to shoot both JPEG and RAW files. And it should also come with a Wi-Fi adapter, which makesit easier to quickly share images through a laptop or phone.
34
Table 2: Descriptive Statistics of Users’ Search Queries
Task #users#uniquequeries
#uniquewords
Edit distance#queries per
user#words per
queryOverlap w.description
Ski1 99 173 111 31.41 (21.35) 2.13 (1.56) 5.36 (2.62) .76 (.25)2 97 187 118 33.26 (26.52) 2.54 (1.94) 5.22 (3.14) .75 (.28)
Printer3 97 295 183 32.56 (14.73) 3.47 (2.93) 5.93 (3.14) .57 (.32)4 97 204 162 30.82 (25.94) 2.51 (2.02) 4.67 (2.19) .53 (.30)
Car5 97 375 262 35.71 (20.61) 4.68 (2.93) 5.91 (3.31) .56 (.36)6 99 368 244 24.53 (16.86) 2.51 (2.02) 4.15 (1.88) .42 (.36)
Laptop7 98 338 243 32.12 (15.36) 4.06 (4.17) 6.43 (3.29) .58 (.36)8 95 292 250 30.54 (26.90) 3.67 (3.03) 4.34 (2.27) .43 (.37)
Camera9 98 243 214 31.13 (18.89) 2.95 (2.51) 4.71 (2.22) .44 (.29)10 95 271 141 28.65 (16.61) 3.66 (2.78) 5.28 (2.81) .68 (.34)
Note: we report the sample average with the standard deviation in parentheses. The edit distance is the minimum number ofoperations required to transform one query into the other. Larger values indicate lower similarity. We compute the edit distancebetween all pairs of queries (where queries are pooled together across users) for the same task. The last column reports theproportion of words in a user’s query that appear in the task description.
Table 3: Descriptive Statistics of the Corpora
Ski Printer Car Laptop Camera
Vocabulary size 1,709 3,258 3,128 3,321 3,309
Que
ry Unique queries 351 495 749 631 515
Words per query 4.62 (2.31) 3.84 (2.01) 3.83 (2.14) 4.50 (2.87) 4.14 (2.03)
Page Unique webpages 1,167 1,984 4,253 3,042 2,238
Words per webpage 258 (278) 366 (341) 559 (697) 736 (1862) 444 (525)
Lab
el
Query-page pairs 3,636 4,928 7,851 6,454 5,049
Webpages per query 10.36 (1.89) 9.96 (2.00) 10.48 (5.61) 10.23 (2.82) 9.80 (2.96)
Queries per webpage 3.12 (6.51) 2.48 (4.41) 1.85 (2.72) 2.12 (4.22) 2.26 (3.84)
Note: We report the average across all participants, with standard deviations in parentheses.
35
Table 4: DIC
Model Ski Printer Car Laptop Camera
HDLDA 3,007,214 7,840,806 26,512,580 23,812,942 11,265,738
Basic LDA 3,168,291 8,106,714 27,326,572 24,476,162 11,456,171
Flexible LDA 3,151,762 8,044,257 27,251,737 24,599,756 11,348,911
Note: Basic LDA assumes a fixed prior on θp. Flexible LDA assumes a flexible prior on θp,which is estimated by the model.
36
Table 5: Topics Extracted from HDLDA
Topic Examples of Relevant Words Example of Query Example of Webpage
Ski
1 lesson, kids, family, rental,beginner
“family ski vermont and skischool”
homepage of the ski resortMad Driver Mountain
2 hotel, spa, reviews, luxury,exclusive
“ski vermont spa”exclusive ski package fromKillington on TripAdvisor
Printer
1 photo, small, ink, wifi,quality
“best wireless lab qualityprinters”
PIXMA iP4000R photoprinter on U.S.A. Canon
2 wireless, laser, staples, scan,black
“buy best printer”Brother wireless all-in-one
printer on Amazon
3 adobe, student, college,double, sided
“student printer”on-campus student printing
service info
Car
1 miles, engine, play,transmission, sedan
“honda sedan $10000 to$14000”
a list of safest cars by Forbes
2 Nissan, Volkswagen, safety,jeep, SUV
“safety ratings toyota rav 4”a list of luxury crossover
SUVs on USNews
3 year, insurance, home, sales,gas
“buy a car online”a used Honda Accord on
Cargurus.com
Laptop
1 amazon, shipping,accessories, price, buy
“buy windows laptop”best laptops of 2015 on
CNET.com
2 lenovo, play, thinkpad,review, ideapad,
“lenovo ideapad y50”laptop reviews on
lenovo.com
3 battery, business, screen,performance, gaming
“durable laptop budget longbattery life”
best business laptops onBusinessInsider.com
4 intel, cpu, core, memory,ram
“intel core i5 cpu”gaming laptop on
CyberPowerPC.com
Camera
1 waterproof, kids, screen,touch, tablet
“kid friendly waterproofcamera”
Polaroid waterproof digitalcamera on Kmart
2 lens, dslr, ISO, compact, shot “nikon d3200 bundle”Canon EOS 60D review on
Camera Labs
37
Table 6: Average θq and θp for Each Category
Topic 1 Topic 2 Topic 3 Topic 4
Skiθq 0.65 0.35θp 0.63 0.37
Printerθq 0.45 0.28 0.26θp 0.41 0.24 0.34
Carθq 0.31 0.37 0.32θp 0.27 0.45 0.28
Laptopθq 0.26 0.18 0.30 0.25θp 0.33 0.09 0.45 0.12
Cameraθq 0.42 0.58θp 0.46 0.54
Table 7: Posterior Estimates of R
Ski−0.58 −0.21(0.05) (0.04)
2.20 0.20(0.10) (0.06)
Printer
−0.23 −0.16 −0.97(0.08) (0.04) (0.07)
1.18 −0.42 −0.25(0.06) (0.07) (0.03)
−0.47 −0.27 1.77(0.03) (0.05) (0.04)
Car
0.76 2.24 0.20(0.06) (0.08) (0.04)
0.18 −0.22 −0.58(0.06) (0.07) (0.03)
−0.74 −0.38 0.32(0.03) (0.05) (0.04)
Camera−0.25 1.28(0.05) (0.09)
0.30 −0.48(0.05) (0.04)
Laptop
1.07 0.31 3.38 0.49(0.06) (0.04) (0.06) (0.05)
1.93 −0.63 −0.30 −0.16(0.11) (0.05) (0.11) (0.07)
−0.60 −0.84 −0.45 −0.60(0.05) (0.03) (0.08) (0.05)
−0.05 −0.69 0.39 −0.65(0.07) (0.03) (0.09) (0.05)
Note: for the R matrix in each category, its element in the (k,k
′) position is the effect of topic k
in search queries on the topic k′
in search results. The value below each parameter is the corre-sponding standard deviation across posterior draws. All the estimates are statistically significantbased on the 95% posterior credible interval.
38
Table 8: Posterior Estimates of the Topic Distribution ofTask Descriptions
Task Topic 1 Topic 2 Topic 3 Topic 4
Ski1 0.95 0.052 0.77 0.23
Printer3 0.53 0.16 0.304 0.75 0.07 0.18
Car5 0.13 0.32 0.556 0.19 0.19 0.62
Laptop7 0.11 0.04 0.60 0.258 0.07 0.04 0.86 0.03
Camera9 0.77 0.2310 0.26 0.74
Table 9: Estimating Preferences from Queries
Task Description Chosen Webpage
Estimation Method Cosine Similarity Perplexity Cosine Similarity Perplexity
Leverage semantic mapping: βi = E(θp|θq) 0.844 316.2 0.812 352.5
Don’t leverage semantic mapping: βi = θq 0.817 358.2 0.772 402.1
Note: all comparisons are statistically significant (p-value< 0.05). We evaluate the estimated preferences based both on theircosine similarity and perplexity score with the “true” preferences. The truth are measured either by the topic distributionof the search task description, or the topic distribution of the webpage chosen by the participant. Larger cosine similarityindicates better performance, while smaller perplexity indicates better performance.
39
Figures
Figure 1: The Graphical Model of HDLDA
40
Figure 2: The Interface of Hoogle
41
Figure 3: CTR as a Function of Position
Note: CTR is normalized to sum to 1 across positions.
Figure 4: Agreement Level of the Top 5 Most Chosen Links for Each Task
Note: for each URL, its agreement level is defined as the number of participants whopicked that URL as their chosen links.
42
(a) (b)
Figure 5: Word Cloud of Two Topics in Camera
Part (a) and (b) are the simulated content for Topic 1 and Topic 2 in the camera category.The size of each word is proportional to the exponential of its relevance.
43
Appendix A: Inference Algorithm for HDLDA
This appendix describes the inference algorithm for HDLDA. Across the collection of Q queries,
let nkq, j be the number of word tokens in the qth query that are the jth word in the vocabulary and
are assigned to the kth topic. Mathematically, nkq, j = ∑
Jqi=1 I(zqi = k ∧wqi = j). Hence, nk
q, j is
three-dimensional: query, word, and topic. By summing the counts across different words within
a query, we get Nkq that denotes the number of words that are assigned to topic k in query q. By
summing the counts across different queries, we get Nkj that denotes the number of queries that
have the jth word in the vocabulary and are also assigned to the kth topic. Across the collection
of P webpages, we defined mkp, j as the number of word tokens in the pth webpage that are the jth
word in the vocabulary and that are assigned to the kth topic, i.e. mkp, j = ∑
Jpi=1 I(zpi = k∧wpi = j).
We similarly define the summation across words and webpages respectively, which are denoted as
Mkp and Mk
j .
Gibbs Sampler for Assignments zp and zq. Given the topic distribution θq and the word
distribution ϕ, the posterior distribution of each zq j is:
Pr(zq j = k|wq j,θq,{ϕk}) =Pr(wq j|zq j = k,ϕk)Pr(zq j = k|θq)
∑Ki=1 Pr(wq j|zq j = i,ϕi)Pr(zq j = i|θq)
=ϕk,wq jθqk
∑Ki=1 ϕi,wq jθqi
(10)
Similarly, the posterior distribution of the assignment zpi depends on the data wpi, and the latent
distributions ϕ and θp. The posterior distribution of each zpi is
Pr(zpi = k|wpi,θp,{ϕk}) =ϕk,wpiθpk
∑Kj=1 ϕ j,wpiθp j
(11)
Gibbs Sampler for ϕ and θp. The posterior of the topic distribution ϕ in the collection is
still a Dirichlet, and only depends on the latent assignment and the data including both queries and
webpages:
Pr(ϕk|{zp},{zq},{wq},{wp},η) = DirJ(η+Nk1 +Mk
1, ...,η+NkJ +Mk
J) (12)
The posterior distribution of the topic distribution θp for each webpage p only depends on its latent
assignment and its prior. This distribution is also given in closed form, conditional on the semantic
44
relationship matrix R and {θq}:
Pr(θp|zp,R,{θq}, lp) = DirK(
exp(RT1 θq(p))+M1
p, ...,exp(RTKθq(p))+MK
p)
(13)
Metropolis-Hastings Algorithm for θq. The posterior distribution of the topic distribution θq
for each query q is non-conjugate. Indeed, it depends not only on the latent assignment zq, but also
on the topic distributions of the webpages that can be retrieved by query q. We use Metropolis-
within-Gibbs, and apply the iteration sequentially to each q:
Pr(θq|zq,{θp},{θ−q},R,{lpq},α) ∝ Pr({θp}|θq,θ−q,R,{lpq})Pr(zq|θq)Pr(θq|α)
∝ ∏p,lpq=1
DirK
(θp|{exp
(RT
k∑q∈Q θqlpq
∑q∈Q lpq
)})
DirK
(θq|α+N1
q , ...,α+NKq
)(14)
Here, we apply an adaptive proposal distribution (with vanishing adaptation) for random walk
Metropolis-Hasting algorithm to sample each θq (Andrieu and Thoms, 2008). The proposal distri-
bution is Dirichlet, θ(t)q ∼ Dir(σq,tθ
(t−1)q ). We adaptively choose σq,t for each θq to attain a target
acceptance rate while preserving the convergence of the Markov chain by the Robbin-Monro algo-
rithm:
σq,t = σq,t−1 exp((a∗−aq,t−1)/tδ
)where, aq,t−1 is the acceptance rate at iteration t−1 for θq; a∗ is the optimal acceptance rate which
usually is set to 0.23 for large problems (Gelman and Meng, 1996); δ ∈ (0,1] controls the decay
rate of the adaption. Basically, the precision will increase if the current acceptance rate is below
the target rate. Note that because the proposal distribution is asymmetric, the acceptance ratio for
θq at each iteration t should be obtained as:
rqt = min
{L(θ(t)q ) f
(θ(t−1)q |σq,tθ
(t)q
)L(θ(t−1)
q ) f(
θ(t)q |σq,tθ
(t−1)q
) ,1}
where, f (x|y) denotes the density of the Dirichlet distribution Dir(y) at x. Based on our empirical
study we find that for a small corpus, even a random walk Metropolis-Hasting (without adaptation)
45
algorithm converges quite well for sampling θq.
Maximizing R from Dirichlet-Multinomial Regression Model. The parameter R controls
the relationship between θq and θp which is captured by a Dirichlet-multinomial regression model
in Equation (3). We estimate R by optimizing the full log-likelihood of the model (Mimno and
McCallum, 2012),
l(R) =P
∑p=1
(logΓ
(∑k
exp(RTk θq(p))
)−∑
k
[logΓ(exp(RT
k θq(p)))−(
exp(RTk θq(p))−1
)log(θpk)
])
The derivative of this log-likelihood with respect to rtk is
∂l(R)∂rtk
=P
∑p=1
θqt(p)exp(RTk θq(p))
(Ψ
(∑
jexp(RT
j θq(p)))
−Ψ
(exp(RT
k θq(p)))+ log(θpk)
)
where, Ψ(·) denotes the digamma function what is defined as the logarithmic derivative of the
gamma function, Ψ(x) = ddx logΓ(x). The optimization problem could be difficult if K is large. Our
implementation is mainly based on the standard BFGS optimization, because this method has been
shown to be fast, robust, and reliable in practice.
46
Appendix B: Simulation Study
This appendix presents a synthetic data analysis of HDLDA. We first describe the data gener-
ation process of the model, and the parameterization of our simulation study. Then we summarize
the estimation results based on the inference algorithm presented in Appendix A.
B1. Data Generation
For a given set of parameters {K,Q,P,J,{Jq}q,{Jp}p,{lpq}p,q,α,η,R}, the following proce-
dure describes the data generative process for HDLDA:
1. For each topic k = 1,2, ...,K, draw a distribution over words: ϕk|η∼ DirichletJ(η)
2. For each query q = 1,2, ...,Q
(a) Draw topic distribution θq|α∼ DirichletK(α)
(b) For j = 1,2, ...,Jq
i. Draw topic assignment zq j|θq ∼Category(θq)
ii. Draw word wq j|(zq j,{ϕk})∼Category(ϕzq j)
3. For each webpage p = 1,2, ...,P
(a) Calculate the average topic distribution among its labeling queries: θq(p) = ∑q θqlpq
∑q lpq
(b) Draw topic distribution θp|(R,θq(p))∼ DirichletK(exp(RT1 θq(p)), ...,exp(RT
Kθq(p)))
(c) For i = 1,2, ...,Jp
i. Draw topic assignment zpi|θp ∼Category(θp)
ii. Draw word wpi|(zpi,{ϕk})∼Category(ϕzpi)
We simulate a dataset with a structure similar to real search data that we collected from the
experimental study described in Section 4. Specifically, we set K = 3, Q = 800, P = 4000, and
J = 2000. For q = 1,2, ...,Q, we draw the integer Jq randomly (uniformly) from [2,20]; for p =
47
1,2, ...,P, we draw the integer Jp randomly (uniformly) from [300,600]. We draw {lpq} so that
each query can retrieve ten pages, and each page can be retrieved by at least one query (the mean
is 2.00 with a standard deviation 1.02). For the semantic mapping matrix R, we set all the diagonal
elements to be 0.8, and set all the off-diagonal elements to be 0.4. We set α = 1, and η = 0.01. All
other parameters are generated according to the process described above.
We calibrate HDLDA on this simulated dataset using the inference algorithm described in
Appendix A. The model parameters include about 1.79 million latent word assignments z, 6,000
parameters in {ϕk}, 12,000 parameters in {θp}, and 2,400 parameters in{θq}, and 9 parameters in
R. We run 10,000 MCMC iterations and use the first 5,000 as burn-in.
B2. Simulation Results
Before presenting the estimation results, we provide some background on the identification of
topic models (especially LDA) in general. This is important in forming reasonable expectations
on what and how much can be recovered in HDLDA. Though topic modeling is an approach that
has proved successful in automatic comprehension and classification of data, only recent work has
attempted to give provable guarantees for the problem of learning the model parameters (Arora
et al., 2012; Anandkumar et al., 2012). The problem of recovering nonnegative matrices ϕ (topics)
and θ (topic distributions) with small inner-dimension K (number of topics), is NP-hard (Arora
et al., 2012). As a solution, recent work has relied on very strong assumptions about the corpus,
e.g., restricting one topic per document, or assuming each topic has words that appear only in
that topic. At best, ϕ could only be recovered up to permutations (Anandkumar et al., 2012). In
addition, according to Arora et al. (2012), it is impossible to learn the topic distribution matrix
θ to within arbitrary accuracy, and this is theoretically impossible even if we knew ϕ and the
distribution from which θ is generated.
Given this background, empirically one should not expect all parameters in topic models to
be recovered. As an example, we test the well-known Gibbs sampler algorithm on a basic LDA,
using multiple simulated corpora which have similar size as the one described above. We find that
48
about 90% of the topics ϕ are covered by the 95% credible interval (CI), and about 80% of the
topic proportions θ are covered by the 95% CI. Given the complexity of the inference algorithm
for HDLDA, one should not expect its recovery to be better than the Gibbs sampler for a basic
LDA.
As measures of recovery performance, we report the proportion of the parameters that are
recovered by the posterior 95% CI and the mean square error (MSE) between the true and the
estimated parameters. The details are given in Table A for {ϕk}, {θp}, and {θq} respectively. One
can see that the recovery for both {ϕk} and {θp} is pretty good, and it is decent for {θq}. Finally,
we report the posterior estimates and the 95% CI for all the parameters in R, which are given in
Table B. Note that because the MLE for the Dirichlet-Multinomial regression has significant bias
if the sample size is not large enough,1 we would not expect the estimators of R from the stochastic
EM algorithm to do better than the MLE of R using the true data (Nielsen, 2000). That is, the best
that the inference algorithm can achieve in estimating R is recovering its MLE, denoted as RMLE ,
that is estimated using the true {θp} and {θq} and also reported in Table B. We can see for the true
R, 6 out of 9 parameters are covered by the 95% CI. The number is increased to 8 in recovering
RMLE .
Table A: Simulation Results for{ϕk,θp,θq}
Parameters Coverage MSE
{ϕk} 90.12% 1.06e-09
{θp} 89.77% 0.0006
{θq} 74.67% 0.0563
1For example, when we increase the number of webpages from 4,000 to 40,000, the MLE of R for the Dirichlet-Multinomial regression model using the true {θp} and {θq} has much smaller bias. Specifically, the MSE is decreasedfrom 0.0111 to 0.0006.
49
Table B: Simulation Results for R
Parameters R RMLE Posterior Estimate 95% CI
r11 0.8 0.841 0.622 [0.547, 0.723]
r21 0.4 0.337 0.410 [0.307, 0.487]
r31 0.4 0.369 0.442 [0.349, 0.537]
r12 0.4 0.439 0.403 [0.339, 0.463]
r22 0.8 0.810 0.780 [0.685, 0.839]
r32 0.4 0.338 0.351 [0.293, 0.426]
r13 0.4 0.413 0.365 [0.262, 0.389]
r23 0.4 0.429 0.583 [0.510, 0.672]
r33 0.8 0.722 0.698 [0.584, 0.798]
50