A Semantic Approach for Estimating Consumer Content Preferences

A Semantic Approach forEstimating Consumer Content Preferences from

Online Search Queries

Jia Liu and Olivier Toubia ∗

September 13, 2016

Job Market Paper

Abstract

We develop an innovative topic model, Hierarchically Dual Latent Dirichlet Al-location (HDLDA), which not only identifies topics in search queries and webpages,but also how the topics in search queries relate to the topics in the corresponding topsearch results. Using the output from HDLDA, a consumer’s content preferences maybe estimated on the fly based on their search queries. We validate our proposed ap-proach across different product categories using an experiment in which we observeparticipants’ content preferences, the queries they formulate, and their browsing be-havior. Our results suggest that our approach can help firms extract and understand thepreferences revealed by consumers through their search queries, which in turn may beused to optimize the production and promotion of online content.

Keywords: search engine optimization, search engine marketing, search queries, pref-erences, topic modeling, semantic mapping, Latent Dirichlet Allocation.

∗Jia Liu is a Ph.D. Candidate in Marketing, Columbia Business School, Columbia University. Email:[email protected]. Olivier Toubia is Glaubinger Professor of Business, Graduate School of Business, ColumbiaUniversity. Email: [email protected]. This paper is based on the second chapter of Jia Liu’s doctoral dissertation.This paper has benefited from the comments of Asim Ansari, David Blei, Kinshuk Jerath, and the participants at theColumbia Marketing and Decision, Risk and Operations seminars, and the 2015 Marketing Science Conference in Bal-timore, MD. Previous version of this paper was titled: “A Semantic Approach for Estimating Consumer InformationNeeds from Online Search Queries.”

mailto:[email protected]

mailto:[email protected]

1 Introduction

“The Future of Marketing Is Semantic.”

— Daniel Newman, Forbes, August 2014

Over the past decade, search engines like Google have become one of the primary tools con-

sumers use when searching for products, services, or information. This trend has given rise to

two major industries: Search Engine Optimization (SEO) and Search Engine Marketing (SEM).

SEO refers to the process of tailoring a website to optimize its organic ranking for a given set

of keywords or queries, in order to improve traffic and lead generation (Amerland, 2013). SEM

usually refers to paid advertising on search engines. Search marketing spending in the U.S. only is

expected to reach $31.62 billion in 2015 (Statista, 2015).

Success in both of these industries hinges on firms’ ability to infer consumers’ content prefer-

ences, given their search queries. For example, a firm engaging in SEO should focus its promotion

efforts on queries that reflect preferences that are best aligned with its content. Similarly, firms

engaging in SEM should be willing to bid more on keywords and queries that reflect preferences

that are well aligned with their content and offerings. In addition, search engines themselves are

constantly trying to improve how well their search results satisfy their users’ content preferences.

Despite the importance of being able to infer consumers’ content preferences from their queries,

very little research has been done in this area. Some research (which will be reviewed in Section

2.1) has developed taxonomies of search queries and search intent. However, this research does

not enable firms to infer content preferences in a quantified, nuanced and detailed manner. Another

stream of research in marketing (which will be reviewed in Section 2.2) has quantified consumer

preferences from their search behavior. However, that research has primarily focused on consumer

search behavior that manifests itself via discrete choices (e.g., purchasing, clicking). Text-based

search behavior (e.g., typing a search query), despite being a major way in which consumers search

today, has not received as much attention in that literature.

Inferring content preferences from search queries presents several challenges. The first chal-

lenge is the curse of dimensionality, due to the nature of textual data. Indeed, the number of

2

possible keywords or queries available to consumers is very large. A second challenge is that

search terms tend to be ambiguous, i.e., consumers might use the same term in different ways.

Third, there exist some potentially complex semantic relationships between the content in a search

query and the content in the corresponding search results. Previous research (also reviewed in

Section 2.2) has shown that consumers have the ability to leverage these semantic relationships

when formulating queries. That is, consumers do not necessarily formulate queries that exactly

and directly reflect their content preferences, but rather they formulate queries that are more likely

to retrieve the type of content they are looking for. A fourth challenge is the sparsity of query data.

Each search query tends to contain very few words, and the average number of queries per user

session is usually less than five (Wang et al., 2003; Kamvar and Baluja, 2006; Jansen et al., 2009).

In this paper, we propose a framework that addresses the above challenges. We describe content

as a set of topics, following the literature in information retrieval (Manning et al., 2008). We

capture a consumer’s preferences with a probability distribution over these topics, that reflects the

content the consumer is searching for. This allows us to address the curse of dimensionality and the

ambiguity of keywords by identifying a small number of topics in queries and webpages, defined

as probabilistic combinations of words. We address the issue that search queries and search results

are semantically related, by explicitly building a hierarchical structure in our topic model, that

quantifies the semantic mapping between search queries and search results. We alleviate the issue

of the sparsity of queries both by combining information from multiple queries and by linking

queries to results.

More specifically, we develop an innovative probabilistic topic model – Hierarchically Dual

Latent Dirichlet Allocation (HDLDA). HDLDA is built upon Latent Dirichlet Allocation (LDA)

(Blei et al., 2003), an unsupervised Bayesian learning algorithm that extracts “topics” from text

based on occurrence. HDLDA is a model for bag-of-word data where documents (in our context

web pages) are multi-labeled by very small subsets of their own words (in our context a search

query “labels” a webpage if it can retrieve it). The model is dual because the documents and the

labels share the same topic-word distributions (in our context, the same set of topics may describe

3

search queries and webpages). It is hierarchical because the labels influence the topic distribution

of the documents, i.e., labels and documents are semantically related to each other. In our context,

the prior on the topic distribution of a webpage is assumed to be a function of the topic distributions

of the queries that retrieve that page. HDLDA can be estimated on any primary or secondary dataset

that contains the text of a set of queries and their corresponding search results from a search engine.

In this paper, we collect the data using Google’s Application Program Interface (API). The output

from HDLDA include the number of topics, their descriptions, the distribution of topics in each

webpage and query, and the semantic mapping between topics in queries and topics in search

results.

We further propose an approach for estimating a consumer’s content preferences based on

their queries, using the output of HDLDA. In particular, if consumers formulate queries with the

objective of reaching a target topic distribution among the results, their preferences may be simply

estimated as the expected topic distribution of the search results, given their search queries. Such

estimation may be made on the fly, making it useful for firms interested in customizing content

(e.g., display or search advertising) based on a consumer’s query.

To evaluate our approach, we run a lab experiment. Such an experiment offers two benefits

when it comes to validating the estimation of preferences based on the output of HDLDA. First, our

experimental framework allows us to exogenously manipulate and know users’ preferred content,

which we can compare to our estimates. In particular, we provided participants with search tasks

in various categories (e.g., finding a ski resort with specific features to recommend to someone),

and asked them to perform a series of searches to find a suitable URL for each task description.

Second, with secondary data the browsing behavior of users would likely be influenced by search

advertising and more generally by the customization of content based on the user’s history. Such

factors do not preclude the use of HDLDA in practice. However, when validating HDLDA, such

factors may provide alternative explanations for the users’ browsing behavior, and cast doubt on

any link that would be found between this behavior and some model-based predictions. Therefore,

for the purpose of our experiment, we built our own search engine Hoogle, which allows removing

4

the influence of the customization of search results and of search advertising on users’ browsing

behavior. Technically, Hoogle serves as a filter between Google and the user. It runs all queries

for all users through the Google API, with no user history being captured (unlike with regular

Google searches). Therefore, results are not customized based on the user’s history. In addition,

Hoogle allows us to show only organic search results. Again, in practice, marketers/advertisers

may use a search engine’s search API to collect data on queries that are relevant for their SEO/SEM

strategies and use these data as input to HDLDA, without running any experiment and without

using a customized search tool such as Hoogle.

We find that HDLDA fits the corpus collected from this experiment much better than tradi-

tional LDA. We also test our approach for estimating consumers’ preferences based on queries.

Our approach assumes that consumers leverage the semantic mapping between queries and results

when they formulate search queries. That is, consumers anticipate the type of content that will be

retrieved by their queries, and they formulate queries such that the search results will match their

content preferences in expectation. We find that this assumption gives rise to significantly more

accurate estimation of preferences, compared to a benchmark that assumes that consumers ignore

the semantic mapping between queries and results.

Our research has both methodological and managerial contributions. Methodologically, HDLDA

has an innovative structure compared with current extensions of LDA in the marketing and topic

modeling literatures. Managerially, HDLDA can help firms to organize and understand the mean-

ing of different search queries and webpages, and to predict the content preferences of users based

on their search queries. HDLDA is scalable across large numbers of queries and pages within

any search domain. In general, by empowering marketers/advertisers to infer consumers’ content

preferences based on queries, our work can help them improve the fit between their content and

consumer preferences. In the context of SEO, our work can help firms determine the type of queries

that are most relevant to their content, and therefore that should be the focus of their SEO efforts.

For example, firms can determine the topics reflected in their content, identify search queries that

tend to indicate that a user is interested in these topics, and focus their efforts on improving their

5

organic ranking on these queries. In the context of SEM, our research can help advertisers deter-

mine how well their content matches a given query or keyword, which informs how much they

should be willing to bid on that query. Our framework also enables firms to perform dimension

reduction on queries, by representing them in a topic space rather than treating them as separate

and independent entities. Such dimension reduction may help analyze the click-through rates or

other performance metrics for the queries on which the firm advertises, for example by regressing

these metrics on the set of topics captured by the queries.

The rest of the paper is organized as follows. In Section 2, we review the related literature.

In Section 3, we introduce the topic model HDLDA. In Section 4, we describe our experimental

design, data collection, some descriptive statistics, and data processing. We present our empirical

results in Section 5. We conclude in Section 6.

2 Relevant Literature

2.1 Online Search Queries

Most research on understanding users’ intent behind their search queries has come from the

computer science and information system literatures. This research has primarily focused on clas-

sifying consumers’ search intent into some discrete categories (Broder, 2002; Jansen et al., 2007;

Sanasam et al., 2008; Shen et al., 2011). The first and most popular categorization was proposed

by Broder (2002) who defined three very broad classes: informational, navigational, and transac-

tional. Informational search involves looking for a specific fact or topic; navigational search seeks

to locate a specific web site; and transactional search usually involves looking for information re-

lated to a particular product or service. Jansen et al. (2008) showed that about 80% of queries are

informational, about 10% are navigational, and under 10% are transactional.

Such empirical study of search logs provides valuable insights into what people search for

and how they search for content. However, these types of analysis do not quantify users’ content

preferences. That is, they do not describe in detail the type of content the user is looking for. In

6

contrast, we propose a method that allows “reverse engineering” preferences based on queries.

This is managerially important, to help website owners or advertisers improve the fit between their

content and the users’ preferences.

2.2 Search Models

Consumer search behavior is often modeled within a utility maximization framework. Appli-

cations of this framework to online search have focused on discrete search behavior, such as which

links or products users decide to view. This was done either using static models, or dynamic models

in which users search sequentially and stop searching when the marginal cost of search exceeds the

marginal gains (Jeziorski and Segal, 2010; Kim et al., 2010; Dzyabura, 2013; Ghose et al., 2013;

Shi and Trusov, 2013; Yang et al., 2015). However, text-based search behavior, such as typing a

search query, has been largely ignored. Typing a search query is a first-order search behavior on

most search platforms, and queries contain valuable information about users’ preferences (Pirolli,

2007). Yet the field is lacking tools to leverage query data.

Liu and Toubia (2015) lay some foundations for the development of such tools, on which our

work builds. In particular, they show using stylized experiments that users form queries that lever-

age semantic relationships between queries and search results. That is, a query is not a direct

representation of the user’s content preferences, but rather a tool used by the user to retrieve con-

tent that matches their preferences. These authors give the example of a user typing the following

query: “affordable sedan made in America.” It is possible that the most important attributes for

this consumer are in fact safety, comfort, and made in America, and that affordability is of lesser

importance. This consumer might have decided to type the query “affordable sedan made in Amer-

ica” because they believe that cars made in America are generally safe and comfortable, but not

necessarily affordable. In that case, the consumer anticipated that they would find relevant search

results efficiently (i.e., with short queries) by only including “made in America” and “affordable”

in their queries, but not “safe” or “comfortable,” although these are important attributes. In other

words, the consumer may have leveraged the semantic mapping between queries and results when

7

formulating their query. Our work builds on these initial results by constructing a practical and

scalable framework for capturing these semantic relationships and inferring content preferences

from queries.

2.3 Natural Language Processing

There has been a stream of recent research in marketing that applies Natural Language Process-

ing (NLP) to analyze online user-generated content (Archak et al., 2011; Lee and Bradlow, 2011;

Netzer et al., 2012; Ghose et al., 2012). Our research builds upon the literature on topic modeling

within NLP, or the so-called Latent Dirichlet Allocation (LDA) (Blei et al., 2003). LDA is an

unsupervised Bayesian learning algorithm that extracts “topics” from text, based on occurrence.

By examining a set of documents, LDA represents each topic by a probability distribution over

words, and each document by a probability distribution over topics. Intuitively, one would expect

particular words to appear in a document more or less frequently based on the topic(s) reflected

in that document. Recently, researchers in marketing have started adopting LDA to extract topical

level information. Examples are Tirunillai and Tellis (2014) who apply LDA to identify dimen-

sions of quality and valence expressed in online reviews, and Abhishek et al. (2014) who use LDA

to measure the contextual ambiguity of a search keyword.

Our proposed topic model HDLDA has an innovative structure compared with other extensions

of LDA. We highlight three extensions related to ours: the correlated topic model (Blei and Laf-

ferty, 2007), the hierarchical topic model (Griffiths et al., 2004), and the relational topic model

(Chang and Blei, 2009). The correlated topic model allows correlation between documents in the

occurrence of topics. In contrast, HDLDA focuses on the correlation between different types of

documents, e.g., webpages and queries. The hierarchical topic model aims to learn a hierarchy of

topics, i.e., which topics that are more general vs. specific. In contrast, HDLDA defines hierarchy

over documents, i.e., the topics in documents are related to the topics in labels. The relational topic

model studies document networks (e.g., whether two research papers tend to be cited by the same

authors), whereas HDLDA leverages the observed hierarchy between different types of documents

8

to infer their topics and semantic relationships. Therefore, methodologically, we innovate both in

the topic modeling and marketing literatures; managerially, we provide a tool that allows inferring

users’ content preferences based on their queries.

3 HDLDA

In this section we describe our proposed modeling framework, HDLDA. HDLDA is a model

for bag-of-word data where documents are multi-labeled by very small subsets of their own words.

That is, in our case, webpages are labeled by the queries that retrieve them. We assume that there

is one LDA process for the documents (e.g., webpages) and one LDA process for the labels (e.g.,

queries). The two processes share the same topic-word distributions, and they are hierarchical

in the sense that the topic distribution of each document is related to the topic distribution of its

label(s). HDLDA can be applied to any corpus that has such hierarchically dual structure. We

focus here on an application to search engines, and set the notations within this context.

3.1 The Model

Suppose there is a collection of Q different search queries for a particular search domain, and

these search queries can retrieve a collection of P different webpages on a search engine. Let

lpq ∈ {0,1} indicate whether webpage p can be retrieved by query q, i.e., it can appear in the top

search results for query q. Let J denote the total number of different words in the vocabulary,

where words are indexed by j ∈ {1,2, ...,J}. Let wq j denote the jth word in query q, and wq denote

the vector of Jq words associated with that query, where Jq is the number of words in the query.

Similarly, let wpi denote the ith word in webpage p, and wp denote the vector of Jp words associated

with that webpage, where Jp is the number of words in the page. The set of relationships between

the data and the model parameters are described by the graphical model in Figure 1. Note that

we treat the labels {lpq} as exogenously given by the search engine, i.e., we do not model their

generating process.

9

[Insert Figure 1 Here]

Topics. We let K denote the number of different topics in the domain. Search queries and

webpages in the collection are assumed to share the same set of topics, but each document exhibits

these topics in different proportions. These topic proportions are reflected by the words present in

the documents. Similarly to a LDA, we model each topic k as a vector ϕk which follows a Dirichlet

distribution over the J words in the vocabulary:

ϕk ∼ DirichletJ(η) (1)

for k = 1,2, ...,K. The hyper-parameter η controls the sparsity of the word distribution.

Queries. To model the observed jth word wq j in each query q, we need to model the query’s

topic distribution θq and the latent topic assignment zq j ∈ {1,2, ...,K} for that word. Following

LDA, we assume:

θq ∼ DirichletK(α), zq j ∼Category(θq), wq j ∼Category(ϕzq j) (2)

for q = 1,2, ...,Q and j = 1,2, ..,Jq. The hyper-parameter α controls the sparsity of the topic

distribution.

Webpages. Search engines like Google retrieve links based on how well their webpage content

matches with a given search query. Hence, even though both queries and webpages are treated

as documents, webpages are semantically related with the set of queries that can retrieve them.

HDLDA captures such semantic relationship by incorporating a hierarchical structure. Specifically,

we model the prior on the topic distribution for webpage p, θp, as a function of the topic distribution

of the queries that label this webpage. The semantic mapping between queries (labels) and results

are specified at the topic-level, using a K×K matrix R. In this matrix, each element rkk′ indicates

the effect of topic k in the labeling queries on topic k′

in the corresponding search results. As

multiple queries may retrieve the same webpage, we use the average topic distribution among the

10

queries that retrieve webpage p,1 which is computed as θq(p) = ∑q θqlpq

∑q lpq. Following the Dirichlet-

multinomial Regression topic model introduced by Mimno and McCallum (2012), we assume that:

θp ∼ DirichletK(exp(RT1 θq(p)), ...,exp(RT

Kθq(p))) (3)

The exponential of the product between the kth column of R and θq(p) is proportional to the

expected weight of topic k in the search results. That is, the weight on each topic in each document

is related to the weight of each topic in the labeling query(ies). Given θp, the observed jth word in

webpage p is then modeled in a standard manner:

zp j ∼Category(θp), wp j ∼Category(ϕzp j) (4)

for p = 1,2, ...,P and j = 1,2, ..,Jp.

3.2 Inference Algorithm

Given the content of all queries and webpages and the labeling of webpages by queries, our

goal is to estimate Θ ={{ϕk},{zp},{zq},{θp},{θq},R

}. We use a combination of Markov Chain

Monte Carlo (MCMC) and optimization, i.e., a stochastic EM sampling scheme (Diebolt and Ip,

1995; Nielsen, 2000; Mimno and McCallum, 2012), to draw samples from the posterior distribu-

tions. Specifically, the posterior distributions of {ϕk},{zp},{zq},{θp} are all conjugate, so we

rely on Gibbs sampling. We use the Metropolis-Hasting algorithm to sample {θq} which are not

conjugate. We estimate R by maximizing its likelihood function given {θq} and {θp}. That is,

over the MCMC sampling, we alternate between sampling {ϕk},{zp},{zq},{θp},{θq} from the

current conditional, and numerically optimizing R given the other parameters. Though one can

also estimate R using an additional Metropolis-Hasting step, Mimno and McCallum (2012) have

suggested that a stochastic EM sampling approach works very efficiently for topic models. The

1One may suggest to use the summation of all the θq’s. The issue is that this would artificially decrease the varianceof the Dirichlet distribution.

11

details of our inference algorithm are presented in Appendix A. In Appendix B, we also report a

simulation study that explores the performance of this inference algorithm.

3.3 Estimating Content Preferences Based on Queries

HDLDA provides estimates of the topic distributions of all queries and webpages in a corpus,

and of how topics in queries are mapped onto topics in search results. Given this information, we

would like to estimate consumers’ content preferences based on their queries. In order to be more

relevant to practice and facilitate ad targeting and content customization, such inference should be

feasible on the fly based on a single query.

We define consumer i’s content preferences as the topic distribution that the consumer is search-

ing for on webpages. This distribution is captured by a vector with K topic weights, denoted as

βi. Our approach relies on Liu and Toubia (2015)’s finding that consumers anticipate the types of

results that will be retrieved by their query, and that they formulate queries that will retrieve results

that match their content preferences in expectation. Under this condition, a user’s preferences may

be simply estimated as the expected topic distribution of the search results, given their query. That

is, given a query with topic distribution θq, the search engine will retrieve search results drawn from

the following distribution: θp ∼ DirichletK(

exp(RT1 θq), ...,exp(RT

Kθq)). The consumer’s content

preferences βi might be simply estimated as the mean of this distribution:

βi = E(θp|θq),( exp(RT

1 θq)

∑k exp(RTk θq)

,exp(RT

2 θq)

∑k exp(RTk θq)

, ...,exp(RT

Kθq)

∑k exp(RTk θq)

)(5)

We note that this approach assumes that consumers are able to leverage the semantic mapping

between search queries and search results when forming queries. This assumption is grounded in

previous research (Liu and Toubia, 2015). Moreover, we do not need to assume that consumers

are consciously leveraging semantic relationships in search, but rather that they behave as if they

did. Nevertheless, in Section 5.4 we compare our approach to a benchmark that assumes that con-

sumers do not take the semantic mapping between queries and results into account when forming

12

queries. We compare our estimate of βi from Equation (5), to a naive estimate which assumes that

consumers simply formulate queries whose topic distributions directly match their preferences,

i.e., βi = θq.

4 Empirical Study

Researchers and practitioners may run HDLDA on any primary or secondary dataset that con-

tains the text of a set of queries and their corresponding search results from a search engine. How-

ever, in this paper our objective is not only to show how HDLDA may be applied, but also to test the

model’s ability to infer consumers’ content preferences based on their search queries. Accordingly,

we use a dataset collected experimentally, because it offers two main advantages over secondary

data. First, it enables us to manipulate and know the content preferences underlying users’ behav-

ior, and therefore to compare estimated preferences with true preferences. Second, with secondary

data the browsing behavior of users would likely be influenced by search advertising and more

generally by the customization of content based on the user’s history. Such customization could

bias our comparisons between various benchmarks, for example by providing alternative expla-

nations for the superior performance of one benchmark over another. Our experimental approach

allows removing the influence of customization and hence provides unbiased comparisons between

benchmarks. In this section, we describe our unique dataset in detail.

4.1 Experimental Design

We conducted a lab experiment in which N = 197 participants performed a series of online

search tasks using a custom-built search engine that we will introduce in Section 4.2. Participants’

objective was to make purchase recommendations. Our search tasks were designed based on five

product categories about which consumers commonly acquire information on the Internet before

purchase: ski resorts, printers, cars, laptops, and cameras. These were relevant categories for our

13

participants (mostly undergraduate and graduate students).2 We manipulated content preferences

exogenously by giving participants specific search tasks, i.e., instructions on what they should

search for. We introduced some heterogeneity in content preferences in each category, by designing

two task descriptions reflecting different preferences, as displayed in Table 1. For example, Task

1 asked participants to search for a family-friendly ski resort, while Task 2 asked them to search

for an exclusive ski resort. We note that we designed these tasks descriptions before acquiring and

analyzing any webpage content, i.e., our task descriptions were not designed to be aligned with or

map onto the set of topics extracted from participants’ actual search data using HDLDA. Hence,

we should expect each task description to have positive weights on multiple topics.

[Insert Table 1 Here]

We used a between-subject design in which each participant was randomly assigned to one

version of the two tasks in each product category. The order of the categories was randomized for

each participant. Each participant was asked to submit one URL of their chosen webpage that they

believed best matched each search task description. To ensure that participants invested enough

effort, we designed this study to be incentive-aligned. Participants were told that all submitted

links would be evaluated by the researchers, based on relevance and usefulness. In addition to a

$7 participation fee, we gave a $100 cash bonus to the participant whose chosen webpage best

matched with the corresponding task description. Participants were informed of this incentive

before the experiment, and we notified the winner within two weeks of the experiment. We find

that this incentive worked well, as on average participants spent at least 20 minutes to finish all five

tasks.

4.2 Data Collection

The input of HDLDA is a set of queries and associated search results. For a given set of search

queries, the associated organic search results may be obtained using one of the APIs offered by

2We ran the lab experiment in the winter on the east coast of the U.S., so ski resorts were relevant for our partici-pants.

14

search engines such as Google. As mentioned above, in addition to applying HDLDA in various

domains, our objective with the present experiment is also to validate HDLDA, by exploring its use

for inferring a user’s preferences based on their search queries. In order to ensure that our validation

of HDLDA not be biased or influenced by unobserved factors such as the user’s browsing history or

the customization of content by the search engine, we built our own search engine called Hoogle,

and used it to collect search data. Hoogle retrieves all the organic search results for each search

query with no user history being captured, using the Google customer engine API. That is, for any

search query, Hoogle retrieves a similar set of search results as Google, with the only differences

that search results are not personalized based on past search history and there is no sponsored

search result. A screenshot from the Hoogle interface is presented in Figure 2. Each result page

shows ten links with their titles and snippets. The font, color, and size are the same as Google.


The search logs from Hoogle include the following information for each participant and for

each task: the query(ies) submitted by the participant, the search results seen by the participant on

each page, and the links clicked by the participants. Immediately after we finished collecting data

in the lab experiment, we also used python scripts to automatically download all the content of the

webpages in the participants’ search results (i.e., the actual content of the webpages corresponding

to all the links in the pages of search results viewed by participants).

We note again that we developed Hoogle only in order to provide a clean comparison of users’

estimated preferences with their “true” preferences, but that Hoogle is not necessary in order to run

HDLDA. In practice, most firms have a set of keywords/queries that they think are most relevant

and valuable for their SEO/SEM strategies. Such firms can use the search API offered by Google

or other search engines to collect data and run HDLDA; they do not need to build their own search

engines or run any experiment to collect these data. In addition, HDLDA can certainly be applied

on any secondary dataset that contains a set of queries and their associated search results.

15

4.3 Descriptive Statistics

Table 2 reports descriptive statistics on the search data collected from this lab experiment for

each product category, including the number of unique queries, the number of different words

among all the queries, the variation across users’ queries, users’ query usage, the number of words

per query, and the average proportion of the words in a query that come from the task description.

First of all, there exists some heterogeneity across users’ queries. Such variation is measured by

the edit distance. The edit distance quantifies how dissimilar two queries are from each other by

counting the minimum possible weighted number of character operations (including insertions,

deletions, and substitutions), required to transform one query into the other (Martin and Jurafsky,

2000). For example, the edit distance between “best school laptop” and ”school laptop” is five (five

characters need to be deleted: b, e, s, t, space). We compute the edit distance between all possible

pairs of queries from participants in the same task, and report the sample average and standard

deviation.

In addition, we find that on average users tend to use few queries in a session and also few

words in a query, which is consistent with the existing empirical studies on online search queries

(Jansen et al., 2000; Spink et al., 2001; Kamvar and Baluja, 2006; Wu et al., 2014). These numbers

also vary across product categories. For example, users submitted slightly more queries when

searching for laptops. Lastly, the average proportion of words in users’ search queries that are

from the task description ranges from 0.40 to 0.75. Hence, users form queries that combine words

in the task description with other words.


We now turn to descriptive statistics related to position effects. Previous research has shown

that position could affect behavioral outcomes such as consumer click-through, conversion rate,

and sales (Hotchkiss et al., 2005; Kihlstrom and Riordan, 1984; Varian, 2007). In our data, we find

that the majority of users limit their browsing to the links on the first page of results (i.e., the top 10

organic links retrieved by Google). Therefore, we treat every page of results as starting from the

first position. We plot the click-through rate (CTR) against the top ten positions in Figure 3. The

16

CTR at each position is calculated as the percentage of clicks at this position across all the clicks

from the ten positions, and hence the CTR sums to one across positions. Consistent with previous

research, we see a quasi-exponential decrease in CTR as a function of position (Narayanan and

Kalyanam, 2011; Abhishek et al., 2014).


Lastly, we study whether there exists some agreement amongst users on the best webpage for

each task. For each task and for each URL that was chosen at least once, we define the agreement

level as the number of users who picked that URL as their chosen links. Overall the distribution

of the agreement level is long-tailed, i.e., the majority of users tend to choose different URLs.

We report the top 5 most chosen links and their corresponding agreement levels for each task in

Figure 4. We see that a few links show some level of consensus amongst users, and there exists

heterogeneity across tasks.


4.4 Data Processing

Given that most advertisers or firms do business in certain domains and design webpage content

and keyword lists within that, we run HDLDA separately for each category. Accordingly, we

combine all the search queries and result webpages from the two search tasks in each product

category as one corpus. Following the bag-of-word approach, each webpage or query can be

treated as an unordered set of words. We process the text in each corpus based on standard practice

in text mining. We remove any delimiting character, including hyphens, used to separate words;

we eliminate punctuation, non-English characters, and a standard list of English stop words; and

no stemming is performed.

We form the vocabulary for each corpus in two steps. We first select words that appear at least

n times in total (we varied n between 10 and 20 depending on the corpus, based on inspection

- the selection of n was finalized before running HDLDA). This helps eliminate a fair amount

of meaningless words that cannot be removed in the preprocessing stage. Among the remaining

17

words, we then calculate the mean term frequency-inverse document frequency (t f -id f ) for each

word. The t f -id f is commonly used to select vocabulary in topic modeling, and is computed

as t f -id f (w) = fw× log(

Nnw

), where, fw is the average number of times word w occurs in each

document in the corpus, and nw is the number of documents containing word w. We keep words

whose mean t f -id f is above the median, which allows omitting words that have low frequency as

well as those occurring in too many documents (Hornik and Grun, 2011).

The descriptive statistics of the resulting corpora are summarized in Table 3. The first row is

the number of words that are selected as the vocabulary for each corpus. The number of words

in each query or webpage is calculated based on the selected vocabulary, rather than the original

content. Hence, one may notice that the number of words per query here is smaller than in Table 2.

On average, each query contains about 4 words and each webpage contains 250-600 words. The

last part of the table is the observed labels between queries and webpages. Because in our data

most participants only viewed the first page of search results for each query, on average each query

retrieves about ten webpages.3 Webpages however can be retrieved by very different numbers of

queries.


5 Results

In this section, we first describe how we determine the number of topics before conducting

model inference. We then evaluate how well HDLDA fits the corpus and compare the model to

two benchmarks. Next, we present the results of the posterior estimates from HDLDA. Lastly, we

proceed to the individual-level estimation of content preferences as described in Section 3.3.

3In some cases, the total number of links could be smaller than ten for a query, if the query either cannot beunderstood by Google or has very few relevant results.

18

5.1 Determining the Number of Topics in HDLDA

One key decision in topic modeling is choosing the number of topics for a corpus if this pa-

rameter is not specified a priori (Blei and Lafferty, 2009). Depending on the goals and available

means, a researcher may apply a variety of performance metrics (Griffiths and Steyvers, 2004;

Wallach et al., 2009; Chang and Blei, 2009). In our case, if we remove the hierarchical structure

between queries and webpages (i.e. set R = 0), the prior on θp is reduced to a uniform distribution

(exp(RTk θq(p))= 1, and Dirichlet(1) is the uniform distribution). If we further set α= 1 (which we

do in our analysis), then the prior distributions on queries and webpages become identical. In other

words, HDLDA is reduced to traditional LDA, where queries and webpages are treated equally as

documents, with a common prior on their topic distributions. We use the optimal number of topics

for LDA as the number of topics for HDLDA. This makes our comparisons of HDLDA to LDA

more conservative (i.e., the number of topics is optimized for LDA).

The literature on topic modeling suggests that estimating the probability of held-out documents

can provide a clear and interpretable metric for evaluating the performance of topic models, and

choosing the number of topics that generates the best language model (Wallach et al., 2009; Blei

and Lafferty, 2009). Therefore, we divide each corpus into training and hold-out testing sets. For

each number of topic K, we measure the performance of the learned model by computing the

perplexity on the hold-out testing set. The perplexity is monotonically decreasing in the likelihood

of the testing set, and is equivalent to the inverse of the geometric mean of the per-word likelihood

(Blei et al., 2003). A lower perplexity score indicates better generalization performance. For a

testing set of documents Dtest , let Nd denote the number of words in each document d ∈Dtest . The

perplexity of the testing set is computed as:

Perplexity(Dtest) = exp(− ∑d ∑w log

(∑k ϕkwθdk

)∑d Nd

)(6)

For each corpus, we held out 20% of the data and trained LDA on the remaining 80%. We test

for K ∈ {2,3, ...} until perplexity becomes worse. To guarantee the robustness of the results, we

19

repeat this procedure 10 times and select the number of topics based on the average perplexity score

across the 10 replications. We find that the optimal number of topics is: K∗= 2 for the ski category,

K∗ = 3 for printer, K∗ = 3 for car, K∗ = 4 for laptop, and K∗ = 2 for camera. We estimate HDLDA

for each category with the optimal K. We set prior parameters α = 1 and η = 0.01 (Jagarlamudi

et al., 2012), and run 8,000 MCMC iterations using the first 4,000 as burn-in.

5.2 Model Fit

We compare HDLDA to two variations of LDA, based on the Deviance Information Criterion

(DIC) (Spiegelhalter et al., 2002). The first benchmark is the basic LDA in which the queries and

webpages are treated equally as documents and have the same fixed prior. Specifically, we assume

θq ∼ DirichletK(α), θp ∼ DirichletK(α), and set α = 1. Note that here the prior on the topic

distribution of each query, θq, is the same as in HDLDA; and the prior on the topic distribution of

each webpage, θp, is also the same as in HDLDA when R is reduced to a zero matrix. Therefore,

this basic LDA is nested within HDLDA. The inference algorithm for the basic LDA is simply

Gibbs sampler.

Our next LDA benchmark explores whether the benefit of HDLDA comes from having a flexi-

ble prior distribution on the topic distributions of webpages, rather than the semantic mapping be-

tween queries and webpages per se. Accordingly, in our second LDA benchmark we set the prior

on webpages to be an unknown vector which we estimate, while keeping the prior on the topic

distributions of queries fixed. That is: θq ∼DirichletK(α), where α = 1; and θq ∼DirichletK(α∗),

where α∗ is an unknown K-dimensional vector that is estimated. We label this benchmark as

“Flexible LDA.” This benchmark is also nested within HDLDA in which we assume all the top-

ics in a query have the same effect on each topic in the results, i.e., R =(r11K,r21K, ...,rK1K

),

where 1K denotes a K-dimensional vector of one and {rk}k=1,...,K are all scalars. This reduces the

mean of the Dirichlet distribution in Equation (3) to(

exp(r1),exp(r2), ...,exp(rK))

(because the

vector θq(p) sums to one), which is exactly the unknown prior on θp in Flexible LDA. Similarly

to HDLDA, the inference algorithm for Flexible LDA is a stochastic EM algorithm in which we

20

alternate between sampling {ϕk},{zp},{zq},{θp},{θq} from the Gibbs sampler, and numerically

optimizing α∗ given the other parameters.

The DIC of all the three models for different product categories is presented in Table 4. We see

that although Flexible LDA performs better than the basic LDA, HDLDA achieves a much lower

DIC compared to both LDA benchmarks across all categories. This suggests that it is reasonable

to explicitly model the semantic mapping between search queries and search results, and that the

benefits from HDLDA do not come merely from its flexible prior structure.


5.3 Estimates of Topics and Semantic Mapping

We now interpret the topics generated by HDLDA in each corpus. To ease interpretation, we

focus on the most relevant words in each topic. The relevance of word w to topic k is measured as

follows (Sievert and Shirley, 2014; Bischof and Airoldi, 2012):

r(w,k|λ) = λ log(ϕkw)+(1−λ) log

(ϕkw

pw

)(7)

where ϕkw is the posterior estimate of the probability of seeing word w given topic k, pw is the

empirical distribution of word w in the corpus; λ determines the weight given to the probability of

word w under topic k relative to its lift ϕkwpw

, both measured on the log scale. Setting λ = 1 results in

the familiar ranking of words in decreasing order of their topic-specific probabilities, and setting

λ = 0 ranks words solely based on lift. We set λ = 0.6, following the empirical studies conducted

by Sievert and Shirley (2014).

[Insert Figure 5 and Table 5 Here]

For ease of interpretation, we simulate the content of each topic using the exponential of the

above relevance measure. That is, we generate sets of words for each topic, where the probability

of occurrence of each word is proportional to the exponential of its relevance. We use word clouds

to visualize the simulated sets of words. As an example, in Figure 5 we report the word clouds

21

simulated for the two topics extracted from the camera category. Words with larger font size have

higher relevance. We can clearly see that the word cloud on the left-hand side for Topic 1 is more

relevant to “waterproof cameras for kids,” whereas the other word cloud for Topic 2 is more likely

to have words that are relevant to “professional digital cameras.” In addition, we report in Table

5 some of the most relevant words for each topic, along with examples of queries and webpages

with very high weights on each topic. While we only present five sample words per topic for space

reasons, as one can see from Figure 5, this is not enough to define or capture a topic completely.

Our observations suggest consumers in general use a variety of words to form queries, and these

words have various degrees of specificity. We also observe that the recovered topic distributions of

webpages in general seem to be consistent with their actual content.

We also report the average of the posterior estimates of the topic proportions of all queries and

webpages in each task. See Table 6. We see that the average θq is close to the average θp in each

category. However, this does not mean that topic distributions in search results are systematically

identical to topic distributions in the corresponding queries, i.e., it does not mean that the matrix

R is proportional to the identity matrix. In particular, the posterior estimates of R are reported in

Table 7. All parameter estimates are statistically significant, which supports our hypothesis that

topics in search results are related to topics in the corresponding search queries. Recall that each

element rkk′ in the kth row and k′th column of matrix R can be interpreted as the effect of the

weight on topic k in a query on the weight on topic k′

in the corresponding search results. We can

clearly see that there exist asymmetries in the semantic relationships, i.e. rkk′ 6= rk′k. This finding

is consistent with and extends the results of Liu and Toubia (2015), who found asymmetries in the

semantic relationships on Google, at the level of individual words.

[Insert Tables 6 and 7 Here]

5.4 Accuracy of Preference Estimation

We now consider the estimation of content preferences from queries. In particular, we explore

the benefits of assuming that consumers leverage the semantic mapping between queries and results

22

when formulating their queries. As described in Section 3.3, given a query with topic distribution

θq, the consumer’s preferences may be estimated as: βi = E(θp|θq), using Equation (5). That is,

preferences may be estimated as the expected value of the topic distribution of the results that will

be retrieved by the search engine, given the query. This estimation approach is consistent with

consumers formulating queries that will give rise to content that matches their content preferences

in expectation. A benchmark approach assumes that consumers ignore the semantic mapping

between queries and results, and estimates the preferences simply as: βi = θq.

5.4.1 Measures of “Truth”

To evaluate these two approaches, we compare their estimates of consumers’ content prefer-

ences with the “truth.” One of the appeals of our experimental design is that we can manipulate

the “true” content preferences, and use the task description given to the participant as a measure

of their true content preferences. Hence, our first measure of “truth” is the topic distribution of the

task description given to the participant. This measure offers the benefit of being exogenous, and

unaffected by the participant’s behavior or belief. However, one limitation of this measure is that

there might be variations in how participants interpret the task description. Hence, we complement

this first measure with a second measure: the topic distribution of the actual webpage chosen by the

participant.4 This measure has the benefit of reflecting the actual behavior of participants. However

one of its limitation is that it is influenced both by demand-side variables (i.e., how participants

interpret the task description) and supply-side variables (i.e., the options presented to participants

by the search engine). Together, these two imperfect measures of ”truth” provide convergent evi-

dence regarding the ability of HDLDA to estimate content preferences based on queries. Note that

the data used to train HDLDA contains neither the text of the task descriptions, nor the knowledge

of which webpage was selected by each participant. Therefore, both our truth measures may be

4Some participants did not follow the instructions correctly (e.g., participants submitted some random text withoutactually conducting the search on Hoogle, or found their chosen link on other search engines). We drop these observa-tions in our analysis. The proportion of observations dropped is 2% for the ski category, 2% for printer, 10% for car,12% for laptop, and 6% for camera. As a result, there are 195 participants who submitted a valid chosen webpage forat least one task.

23

viewed as external validations.


We use the same procedure to estimate the topic distributions of the task descriptions and of

the chosen webpages. In particular, we use a traditional LDA process in which the topic-word

distribution ϕ is set to the estimate from HDLDA. That is, at each iteration we draw the latent

topic assignment of each word in the task description/chosen webpage, and then draw from the

posterior of the topic distribution given these latent topic assignments (assuming a flat prior on

the topic distribution). We estimate the topic distribution of the task description/chosen webpage

as the average over 3,000 iterations. The estimated topic distributions of all the task descriptions

are presented in Table 8. We see that each task description may have large weights on multiple

topics. However, when comparing the weights of the same topic across the two task descriptions,

the topic with the relatively larger weight is consistent with our expectation. For example, in the ski

resort category, Task 1 (respectively, Task 2) was designed to correspond to a family-friendly resort

(respectively, a luxury resort). Consistent with this, we find that Task 1 has a larger weight on Topic

1 compared to Task 2, and the opposite is true for Topic 2. Note however that the weight on Topic

1 is larger overall compared to the weights on Topic 2, reflecting the fact that family-friendliness

is a more common/popular theme in this category compared to luxury.

5.4.2 Performance Metrics

We compare the “true” preferences to the estimated preferences using two performance metrics.

The first metric relies on the fact that both the truth and the estimate are vectors that capture topic

distributions. We measure prediction accuracy using cosine similarity, which is commonly used

in topic modeling to understand the similarity between documents. The cosine similarity between

two vectors a and b is computed as their inner product (or angle):

f (a,b) =a ·b||a||||b||

=∑k akbk√

∑k a2k

√∑k b2

k

(8)

24

Cosine similarity ranges from 0, indicating completely opposite, to 1, meaning exactly the same.

Second, we calculate the perplexity score of the “true” description of each participant’s content

preferences (i.e., task description or chosen webpage), given the estimated content preferences.

Specifically, let Li denote the content (i.e., set of words) of the “true” description of user i’s pref-

erences. According to Equation (6), the perplexity score of Li given an estimated preferences βi is

calculated as:

perplexity(Li|βi) = exp(− ∑w∈Li log

(∑k ϕkwβi

)|Li|

)(9)

where, |Li| denotes the length of the content Li, and ϕ is the topic-word distribution estimated by

HDLDA. Note that smaller value indicates better model performance.

5.4.3 Results

We compare the accuracy of the estimates of the content preferences based on Equation (5):

βi = E(θp|θq); to that of the benchmark that assumes that consumers ignore semantic mapping in

query formation: βi = θq. We compare the accuracy of these two estimates based on both truth

measures, using both cosine similarity and perplexity. We compute the performance of the esti-

mates based on each query in each category from each participant separately. We first compute

the average performance over the queries submitted by each participant in each category. Taking

the average of this performance across categories gives us an average performance for each partici-

pant. In Table 9, we report the average performance of the two benchmarks across participants. We

compare the two benchmarks using paired two-sample t-tests. Both performance metrics suggest

that the estimates of consumers’ content preferences are more accurate when we assume that con-

sumers leverage the semantic mapping between queries and results when they formulate queries,

rather than assuming that the queries are direct representations of the information preferences. The

difference is significant in all cases (p-value< 0.05).


25

6 Conclusions

In this paper, we studied the link between consumer search queries and their content prefer-

ences. We developed an innovative topic model, HDLDA, that jointly estimates the topic distribu-

tions in queries and webpages as well as the semantic mapping between queries and their results.

In our domain of application, HDLDA captures the facts that a webpage is retrieved by certain

queries, and that topics in queries are semantically related to topics in search results. The output

from HDLDA include the number of topics, the description of these topics, the topic distribution of

all the queries and webpages, and how topics in queries are mapped onto topics in search results.

More generally, HDLDA is a model for bag-of-word data that can be applied to any context in

which documents are multi-labeled by very small subsets of their own words. HDLDA has a novel

structure within the broad literature of topic modeling, and our paper provides a methodological

contribution to the probabilistic topic modeling literature.

Using the output of HDLDA, it is possible to estimate a consumer’s content preferences on the

fly, based on each query. In particular, we can simply estimate a consumer’s preferences as the

expected topic distribution in the search results, given the topic distribution in the search query.

We find that prediction accuracy is improved when we assume that consumers leverage semantic

relationships between queries and results when forming their queries, compared to a benchmark

that assumes that consumers ignore these semantic relationships.

In order to validate our approach for estimating content preferences, we developed our own

search engine, which allows removing the variations in organic and sponsored search results due

to customization. This tool offers opportunities to shed new light on important research questions

in consumer search and search advertising, such as position effects and advertising effects. Indeed,

although we did not do this in the current paper, this tool allows varying the order of organic and

sponsored search results, moving sponsored results to the organic section, etc.

From a managerial perspective, HDLDA is scalable across thousands of queries and pages

within a search domain in a way that can automatically extract, understand and organize the mean-

ing of queries and web pages, without human intervention. It is therefore a potential tool for

26

marketers/advertisers to build their own semantic graph and query intent research. The input to

HDLDA may come from various sources, including search engine’s search API and secondary

search data. The output of HDLDA can guide firms’ SEO and SEM strategies. In SEO, firms can

use our approach to quantify how well their content matches the content preferences captured by

various keywords and queries, and focus their efforts on promoting their organic search rank for

those queries with the best fit. In SEM, firms can use our approach to determine their willingness

to bid on various keywords and queries, based on how well their content matches with the underly-

ing content preferences. Our approach may also be used to reduce the dimensionality of queries in

order to systematically study the performance of large sets of queries with respect to metrics such

as click-through rate.

We close by highlighting additional areas for future research. First, the corpora we used in this

paper contain only a small proportion of the actual queries and webpages related to each category.

Future research may use larger datasets. Second, future research may combine information on

queries with information on click-stream behavior, to provide a more extensive set of observations

based on which content preferences may be estimated. Recent developments in Collaborative Topic

Modeling (Wang and Blei, 2011) might provide the foundation for models that would formulate

both query formation and clicking behavior as functions of content preferences.

27

References

Abhishek, V., J. Gong, and B. Li (2014). Examining the impact of contextual ambiguity on search

advertising keyword performance: A topic model approach. Available at SSRN 2404081.

Amerland, D. (2013). Google Semantic Search: Search Engine Optimization (SEO) Techniques

That Get Your Company More Traffic, Increase Brand Impact, and Amplify Your Online Presence

(1st ed.). Que Publishing Company.

Anandkumar, A., Y.-k. Liu, D. J. Hsu, D. P. Foster, and S. M. Kakade (2012). A spectral algorithm

for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pp. 917–

925.

Andrieu, C. and J. Thoms (2008). A tutorial on adaptive mcmc. Statistics and Computing 18(4),

343–373.

Archak, N., A. Ghose, and P. G. Ipeirotis (2011). Deriving the pricing power of product features

by mining consumer reviews. Management Science 57(8), 1485–1509.

Arora, S., R. Ge, and A. Moitra (2012). Learning topic models–going beyond svd. In Foundations

of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pp. 1–10. IEEE.

Bischof, J. and E. M. Airoldi (2012). Summarizing topical content with word frequency and

exclusivity. In Proceedings of the 29th International Conference on Machine Learning (ICML-

12), pp. 201–208.

Blei, D. M. and J. D. Lafferty (2007). A correlated topic model of science. The Annals of Applied

Statistics, 17–35.

Blei, D. M. and J. D. Lafferty (2009). Text Mining: Classification, Clustering, and Applications,

Chapter Topic Models, pp. 71–93. Taylor and Francis.

Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent dirichlet allocation. the Journal of machine

Learning research 3, 993–1022.

28

Broder, A. (2002). A taxonomy of web search. In ACM Sigir forum, Volume 36, pp. 3–10. ACM.

Chang, J. and D. M. Blei (2009). Relational topic models for document networks. In International

Conference on Artificial Intelligence and Statistics, pp. 81–88.

Diebolt, J. and E. H. Ip (1995). A stochastic em algorithm for approximating the maximum likeli-

hood estimate. Technical report, Sandia National Labs., Livermore, CA (United States).

Dzyabura, D. (2013). The role of changing utility in product search. Available at SSRN 2202904.

Gelman, A. and X. L. Meng (1996). Model checking and model improvement. Markov chain

Monte Carlo in practice Springer US, 189–201.

Ghose, A., P. Ipeirotis, and B. Li (2013). Surviving social media overload: predicting consumer

footprints on product search engines. Technical report, Working paper.

Ghose, A., P. G. Ipeirotis, and B. Li (2012). Designing ranking systems for hotels on travel search

engines by mining user-generated and crowdsourced content. Marketing Science 31(3), 493–

520.

Griffiths, T. L., D. M. Blei, M. I. Jordan, and J. B. Tenenbaum (2004). Hierarchical topic mod-

els and the nested chinese restaurant process. Advances in neural information processing sys-

tems 16, 17.

Griffiths, T. L. and M. Steyvers (2004). Finding scientific topics. Proceedings of the National

Academy of Sciences 101(suppl 1), 5228–5235.

Hornik, K. and B. Grun (2011). topicmodels: An r package for fitting topic models. Journal of

Statistical Software 40(13), 1–30.

Hotchkiss, G., S. Alston, and G. Edwards (2005). Google eye tracking report: How searchers see

and click on Google search results. Enquiro Search Solutions.

29

Jagarlamudi, J., H. Daume III, and R. Udupa (2012). Incorporating lexical priors into topic mod-

els. In Proceedings of the 13th Conference of the European Chapter of the Association for

Computational Linguistics, pp. 204–213. Association for Computational Linguistics.

Jansen, B. J., D. Booth, and B. Smith (2009). Using the taxonomy of cognitive learning to model

online searching. Information Processing & Management 45(6), 643–663.

Jansen, B. J., D. L. Booth, and A. Spink (2007). Determining the user intent of web search engine

queries. In Proceedings of the 16th international conference on World Wide Web, pp. 1149–

1150. ACM.

Jansen, B. J., D. L. Booth, and A. Spink (2008). Determining the informational, navigational, and

transactional intent of web queries. Information Processing & Management 44(3), 1251–1266.

Jansen, B. J., A. Spink, A. Pfaff, and A. Goodrum (2000). Web query structure: Implications for ir

system design. In Proceedings of the 4th World Multiconference on Systemics, Cybernetics and

Informatics (SCI 2000), pp. 169–176.

Jeziorski, P. and I. Segal (2010). What makes them click: Empirical analysis of consumer de-

mand for search advertising. Technical report, Working papers//the Johns Hopkins University,

Department of Economics.

Kamvar, M. and S. Baluja (2006). A large scale study of wireless search behavior: Google mobile

search. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pp.

701–709. ACM.

Kihlstrom, R. E. and M. H. Riordan (1984). Advertising as a signal. The Journal of Political

Economy, 427–450.

Kim, J. B., P. Albuquerque, and B. J. Bronnenberg (2010). Online demand under limited consumer

search. Marketing Science 29(6), 1001–1023.

30

Lee, T. Y. and E. T. Bradlow (2011). Automated marketing research using online customer reviews.

Journal of Marketing Research 48(5), 881–894.

Liu, J. and O. Toubia (2015). A framework for modeling how consumers form online search

queries. Available at SSRN: http://ssrn.com/abstract=2675331.

Manning, C. D., P. Raghavan, and H. Schutze (2008). Introduction to information retrieval, Vol-

ume 1. Cambridge university press Cambridge.

Martin, J. H. and D. Jurafsky (2000). Speech and language processing. International Edition.

Mimno, D. and A. McCallum (2012). Topic models conditioned on arbitrary features with

dirichlet-multinomial regression. arXiv preprint arXiv:1206.3278.

Narayanan, S. and K. Kalyanam (2011). Measuring position effects in search advertising: A

regression discontinuity approach. Technical report, Working Paper.

Netzer, O., R. Feldman, J. Goldenberg, and M. Fresko (2012). Mine your own business: Market-

structure surveillance through text mining. Marketing Science 31(3), 521–543.

Nielsen, S. F. (2000). The stochastic em algorithm: estimation and asymptotic results. Bernoulli,

457–489.

Pirolli, P. L. (2007). Information foraging theory: Adaptive interaction with information. Oxford

University Press.

Sanasam, R. S., H. A. Murthy, and T. A. Gonsalves (2008). Determining user’s interest in real

time. In Proceedings of the 17th international conference on World Wide Web, pp. 1115–1116.

ACM.

Shen, Y., J. Yan, S. Yan, L. Ji, N. Liu, and Z. Chen (2011). Sparse hidden-dynamics conditional

random fields for user intent understanding. In Proceedings of the 20th international conference

on World wide web, pp. 7–16. ACM.

31

Shi, S. W. and M. Trusov (2013). The path to click: Are you on it? working paper.

Sievert, C. and K. E. Shirley (2014). Ldavis: A method for visualizing and interpreting topics.

Sponsor: Idibon, 63.

Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesian measures

of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical

Methodology) 64(4), 583–639.

Spink, A., D. Wolfram, M. B. Jansen, and T. Saracevic (2001). Searching the web: The public and

their queries. Journal of the American society for information science and technology 52(3),

226–234.

Statista (2015). Digital marketing spending in the united states from 2014 to 2019. Avail-

able from http://www.statista.com/statistics/275230/us-interactive-marketing-spending-growth-

from-2011-to-2016-by-segment.

Tirunillai, S. and G. J. Tellis (2014). Mining marketing meaning from online chatter: Strategic

brand analysis of big data using latent dirichlet allocation. Journal of Marketing Research 51(4),

463–479.

Varian, H. R. (2007). Position auctions. international Journal of industrial Organization 25(6),

1163–1178.

Wallach, H. M., I. Murray, R. Salakhutdinov, and D. Mimno (2009). Evaluation methods for topic

models. In Proceedings of the 26th Annual International Conference on Machine Learning, pp.

1105–1112. ACM.

Wang, C. and D. M. Blei (2011). Collaborative topic modeling for recommending scientific arti-

cles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discov-

ery and data mining, pp. 448–456. ACM.

32

Wang, P., M. W. Berry, and Y. Yang (2003). Mining longitudinal web queries: Trends and patterns.

Journal of the American Society for Information Science and Technology 54(8), 743–758.

Wu, W.-C., D. Kelly, and A. Sud (2014). Using information scent and need for cognition to under-

stand online search behavior. In Proceedings of the 37th international ACM SIGIR conference

on Research & development in information retrieval, pp. 557–566. ACM.

Yang, L., O. Toubia, and M. G. De Jong (2015). A bounded rationality model of information

search and choice in preference measurement. Journal of Marketing Research 52(2), 166–183.

33

Tables

Table 1: Search Tasks

Ski

1

A family is planing to take a vacation at a ski resort in Vermont. They are looking for a small resort thatis suitable for family and children. The resort should have plenty of trails for beginner skiers, and also a skischool for kids. There should be lesson package deals including all-day lift tickets, ski rental, lessons, and soon. At the minimum, the resort should have an on-site ski rental shop, and offer some kind of discounts.

2

David, a banker, is planning to take a vacation at a ski resort in Vermont. He is looking for an exclusiveresort that could offer a variety of terrains for intermediate and advanced skiers. Specifically, the mountainshould be large and the slopes should be somewhat difficult. In addition, the resort should offer other activitiessuch as snow tubing, and amenities such as a lounge and spa treatments.

Printer

3

Jessica, a college student, wants to purchase a budget printer for school work. The printer should be able toprint, copy, and scan. Double-sided printing would also be attractive to her. Color printing is not required, asshe will mostly print black and white. In addition, the printer should print fast with low noise, and the runningcost should be low.

4

A family wants to purchase a small printer designed for home users who want lab-quality photos. They wantto be able to connect the printer to Wi-Fi and smart phones, and it should also be able to print photos withoutthe topic of a computer. As the printer will be used very frequently, the family is willing to pay slightly morefor a printer which is cheaper to run in the long-term.

Car

5

A family wants to buy a new car that could provide more generous space for seating and cargo than theirold compact sedan. The family’s budget is $25,000. They want the new car to be safe, reliable, economic andfuel-efficient. It should have a 4-cylinder engine, and a high EPA mileage. The car should also handle snowand ice well.

6

Catherine wants to buy a small car, in order to save money on gas, insurance, and maintenance. She alsowants to be able to park more easily in a big city. Her price range is between $10,000 and $14,000. She wantsthe car to be attractive, stylish, fun, and practical. Despite its small size, the car should still be safe and shouldoffer a comfortable ride.

Laptop

7

Mike, a college student, wants to buy a new laptop. In addition to school work, the laptop should providegood performance for playing games. It should have at least an Intel Core i5 CPU, 8GB of RAM, a goodgraphics card, and a larger screen. Mike prefers windows and Linus systems, because of their flexibility andwide options for programs. Mike’s price range is between $800 and $1000.

8

Mike, a consultant, wants to buy a laptop for work and traveling. Mike’s budget is $600. He will mostly usethe laptop for internet, word, powerpoint, and emails. He needs a laptop with enough speed, good display, verylong battery life, small size, and light weight. Also, the laptop should be durable enough to handle pressure ordropping that may often happen during traveling.

Camera

9

A couple wants to buy a camera for their 9-year old son. The camera should be simple to use, and easy tocarry anywhere. The camera should have a large viewing screen or touchscreen. Its picture quality should bevery good. More importantly, the camera should be sturdy and be able to withstand falls. And it should also bewater proof, so that it can be used underwater. The couple prefers a camera in the price range of $100-$200.

10

Kevin, a beginner photographer, wants to buy his first digital SLR camera. Kevin is looking for a modelin the mid-price range, or alternatively a package kit that includes the body, lenses, and tripod. The camerashould be able to shoot both JPEG and RAW files. And it should also come with a Wi-Fi adapter, which makesit easier to quickly share images through a laptop or phone.

34

Table 2: Descriptive Statistics of Users’ Search Queries

Task #users#uniquequeries

#uniquewords

Edit distance#queries per

user#words per

queryOverlap w.description

Ski1 99 173 111 31.41 (21.35) 2.13 (1.56) 5.36 (2.62) .76 (.25)2 97 187 118 33.26 (26.52) 2.54 (1.94) 5.22 (3.14) .75 (.28)

Printer3 97 295 183 32.56 (14.73) 3.47 (2.93) 5.93 (3.14) .57 (.32)4 97 204 162 30.82 (25.94) 2.51 (2.02) 4.67 (2.19) .53 (.30)

Car5 97 375 262 35.71 (20.61) 4.68 (2.93) 5.91 (3.31) .56 (.36)6 99 368 244 24.53 (16.86) 2.51 (2.02) 4.15 (1.88) .42 (.36)

Laptop7 98 338 243 32.12 (15.36) 4.06 (4.17) 6.43 (3.29) .58 (.36)8 95 292 250 30.54 (26.90) 3.67 (3.03) 4.34 (2.27) .43 (.37)

Camera9 98 243 214 31.13 (18.89) 2.95 (2.51) 4.71 (2.22) .44 (.29)10 95 271 141 28.65 (16.61) 3.66 (2.78) 5.28 (2.81) .68 (.34)

Note: we report the sample average with the standard deviation in parentheses. The edit distance is the minimum number ofoperations required to transform one query into the other. Larger values indicate lower similarity. We compute the edit distancebetween all pairs of queries (where queries are pooled together across users) for the same task. The last column reports theproportion of words in a user’s query that appear in the task description.

Table 3: Descriptive Statistics of the Corpora

Ski Printer Car Laptop Camera

Vocabulary size 1,709 3,258 3,128 3,321 3,309

Que

ry Unique queries 351 495 749 631 515

Words per query 4.62 (2.31) 3.84 (2.01) 3.83 (2.14) 4.50 (2.87) 4.14 (2.03)

Page Unique webpages 1,167 1,984 4,253 3,042 2,238

Words per webpage 258 (278) 366 (341) 559 (697) 736 (1862) 444 (525)

Lab

el

Query-page pairs 3,636 4,928 7,851 6,454 5,049

Webpages per query 10.36 (1.89) 9.96 (2.00) 10.48 (5.61) 10.23 (2.82) 9.80 (2.96)

Queries per webpage 3.12 (6.51) 2.48 (4.41) 1.85 (2.72) 2.12 (4.22) 2.26 (3.84)

Note: We report the average across all participants, with standard deviations in parentheses.

35

Table 4: DIC

Model Ski Printer Car Laptop Camera

HDLDA 3,007,214 7,840,806 26,512,580 23,812,942 11,265,738

Basic LDA 3,168,291 8,106,714 27,326,572 24,476,162 11,456,171

Flexible LDA 3,151,762 8,044,257 27,251,737 24,599,756 11,348,911

Note: Basic LDA assumes a fixed prior on θp. Flexible LDA assumes a flexible prior on θp,which is estimated by the model.

36

Table 5: Topics Extracted from HDLDA

Topic Examples of Relevant Words Example of Query Example of Webpage

Ski

1 lesson, kids, family, rental,beginner

“family ski vermont and skischool”

homepage of the ski resortMad Driver Mountain

2 hotel, spa, reviews, luxury,exclusive

“ski vermont spa”exclusive ski package fromKillington on TripAdvisor

Printer

1 photo, small, ink, wifi,quality

“best wireless lab qualityprinters”

PIXMA iP4000R photoprinter on U.S.A. Canon

2 wireless, laser, staples, scan,black

“buy best printer”Brother wireless all-in-one

printer on Amazon

3 adobe, student, college,double, sided

“student printer”on-campus student printing

service info

Car

1 miles, engine, play,transmission, sedan

“honda sedan $10000 to$14000”

a list of safest cars by Forbes

2 Nissan, Volkswagen, safety,jeep, SUV

“safety ratings toyota rav 4”a list of luxury crossover

SUVs on USNews

3 year, insurance, home, sales,gas

“buy a car online”a used Honda Accord on

Cargurus.com

Laptop

1 amazon, shipping,accessories, price, buy

“buy windows laptop”best laptops of 2015 on

CNET.com

2 lenovo, play, thinkpad,review, ideapad,

“lenovo ideapad y50”laptop reviews on

lenovo.com

3 battery, business, screen,performance, gaming

“durable laptop budget longbattery life”

best business laptops onBusinessInsider.com

4 intel, cpu, core, memory,ram

“intel core i5 cpu”gaming laptop on

CyberPowerPC.com

Camera

1 waterproof, kids, screen,touch, tablet

“kid friendly waterproofcamera”

Polaroid waterproof digitalcamera on Kmart

2 lens, dslr, ISO, compact, shot “nikon d3200 bundle”Canon EOS 60D review on

Camera Labs

37

Table 6: Average θq and θp for Each Category

Topic 1 Topic 2 Topic 3 Topic 4

Skiθq 0.65 0.35θp 0.63 0.37

Printerθq 0.45 0.28 0.26θp 0.41 0.24 0.34

Carθq 0.31 0.37 0.32θp 0.27 0.45 0.28

Laptopθq 0.26 0.18 0.30 0.25θp 0.33 0.09 0.45 0.12

Cameraθq 0.42 0.58θp 0.46 0.54

Table 7: Posterior Estimates of R

Ski−0.58 −0.21(0.05) (0.04)

2.20 0.20(0.10) (0.06)

Printer

−0.23 −0.16 −0.97(0.08) (0.04) (0.07)

1.18 −0.42 −0.25(0.06) (0.07) (0.03)

−0.47 −0.27 1.77(0.03) (0.05) (0.04)

Car

0.76 2.24 0.20(0.06) (0.08) (0.04)

0.18 −0.22 −0.58(0.06) (0.07) (0.03)

−0.74 −0.38 0.32(0.03) (0.05) (0.04)

Camera−0.25 1.28(0.05) (0.09)

0.30 −0.48(0.05) (0.04)

Laptop

1.07 0.31 3.38 0.49(0.06) (0.04) (0.06) (0.05)

1.93 −0.63 −0.30 −0.16(0.11) (0.05) (0.11) (0.07)

−0.60 −0.84 −0.45 −0.60(0.05) (0.03) (0.08) (0.05)

−0.05 −0.69 0.39 −0.65(0.07) (0.03) (0.09) (0.05)

Note: for the R matrix in each category, its element in the (k,k

′) position is the effect of topic k

in search queries on the topic k′

in search results. The value below each parameter is the corre-sponding standard deviation across posterior draws. All the estimates are statistically significantbased on the 95% posterior credible interval.

38

Table 8: Posterior Estimates of the Topic Distribution ofTask Descriptions

Task Topic 1 Topic 2 Topic 3 Topic 4

Ski1 0.95 0.052 0.77 0.23

Printer3 0.53 0.16 0.304 0.75 0.07 0.18

Car5 0.13 0.32 0.556 0.19 0.19 0.62

Laptop7 0.11 0.04 0.60 0.258 0.07 0.04 0.86 0.03

Camera9 0.77 0.2310 0.26 0.74

Table 9: Estimating Preferences from Queries

Task Description Chosen Webpage

Estimation Method Cosine Similarity Perplexity Cosine Similarity Perplexity

Leverage semantic mapping: βi = E(θp|θq) 0.844 316.2 0.812 352.5

Don’t leverage semantic mapping: βi = θq 0.817 358.2 0.772 402.1

Note: all comparisons are statistically significant (p-value< 0.05). We evaluate the estimated preferences based both on theircosine similarity and perplexity score with the “true” preferences. The truth are measured either by the topic distributionof the search task description, or the topic distribution of the webpage chosen by the participant. Larger cosine similarityindicates better performance, while smaller perplexity indicates better performance.

39

Figures

Figure 1: The Graphical Model of HDLDA

40

Figure 2: The Interface of Hoogle

41

Figure 3: CTR as a Function of Position

Note: CTR is normalized to sum to 1 across positions.

Figure 4: Agreement Level of the Top 5 Most Chosen Links for Each Task

Note: for each URL, its agreement level is defined as the number of participants whopicked that URL as their chosen links.

42

(a) (b)

Figure 5: Word Cloud of Two Topics in Camera

Part (a) and (b) are the simulated content for Topic 1 and Topic 2 in the camera category.The size of each word is proportional to the exponential of its relevance.

43

Appendix A: Inference Algorithm for HDLDA

This appendix describes the inference algorithm for HDLDA. Across the collection of Q queries,

let nkq, j be the number of word tokens in the qth query that are the jth word in the vocabulary and

are assigned to the kth topic. Mathematically, nkq, j = ∑

Jqi=1 I(zqi = k ∧wqi = j). Hence, nk

q, j is

three-dimensional: query, word, and topic. By summing the counts across different words within

a query, we get Nkq that denotes the number of words that are assigned to topic k in query q. By

summing the counts across different queries, we get Nkj that denotes the number of queries that

have the jth word in the vocabulary and are also assigned to the kth topic. Across the collection

of P webpages, we defined mkp, j as the number of word tokens in the pth webpage that are the jth

word in the vocabulary and that are assigned to the kth topic, i.e. mkp, j = ∑

Jpi=1 I(zpi = k∧wpi = j).

We similarly define the summation across words and webpages respectively, which are denoted as

Mkp and Mk

j .

Gibbs Sampler for Assignments zp and zq. Given the topic distribution θq and the word

distribution ϕ, the posterior distribution of each zq j is:

Pr(zq j = k|wq j,θq,{ϕk}) =Pr(wq j|zq j = k,ϕk)Pr(zq j = k|θq)

∑Ki=1 Pr(wq j|zq j = i,ϕi)Pr(zq j = i|θq)

=ϕk,wq jθqk

∑Ki=1 ϕi,wq jθqi

(10)

Similarly, the posterior distribution of the assignment zpi depends on the data wpi, and the latent

distributions ϕ and θp. The posterior distribution of each zpi is

Pr(zpi = k|wpi,θp,{ϕk}) =ϕk,wpiθpk

∑Kj=1 ϕ j,wpiθp j

(11)

Gibbs Sampler for ϕ and θp. The posterior of the topic distribution ϕ in the collection is

still a Dirichlet, and only depends on the latent assignment and the data including both queries and

webpages:

Pr(ϕk|{zp},{zq},{wq},{wp},η) = DirJ(η+Nk1 +Mk

1, ...,η+NkJ +Mk

J) (12)

The posterior distribution of the topic distribution θp for each webpage p only depends on its latent

assignment and its prior. This distribution is also given in closed form, conditional on the semantic

44

relationship matrix R and {θq}:

Pr(θp|zp,R,{θq}, lp) = DirK(

exp(RT1 θq(p))+M1

p, ...,exp(RTKθq(p))+MK

p)

(13)

Metropolis-Hastings Algorithm for θq. The posterior distribution of the topic distribution θq

for each query q is non-conjugate. Indeed, it depends not only on the latent assignment zq, but also

on the topic distributions of the webpages that can be retrieved by query q. We use Metropolis-

within-Gibbs, and apply the iteration sequentially to each q:

Pr(θq|zq,{θp},{θ−q},R,{lpq},α) ∝ Pr({θp}|θq,θ−q,R,{lpq})Pr(zq|θq)Pr(θq|α)

∝ ∏p,lpq=1

DirK

(θp|{exp

(RT

k∑q∈Q θqlpq

∑q∈Q lpq

)})

DirK

(θq|α+N1

q , ...,α+NKq

)(14)

Here, we apply an adaptive proposal distribution (with vanishing adaptation) for random walk

Metropolis-Hasting algorithm to sample each θq (Andrieu and Thoms, 2008). The proposal distri-

bution is Dirichlet, θ(t)q ∼ Dir(σq,tθ

(t−1)q ). We adaptively choose σq,t for each θq to attain a target

acceptance rate while preserving the convergence of the Markov chain by the Robbin-Monro algo-

rithm:

σq,t = σq,t−1 exp((a∗−aq,t−1)/tδ

)where, aq,t−1 is the acceptance rate at iteration t−1 for θq; a∗ is the optimal acceptance rate which

usually is set to 0.23 for large problems (Gelman and Meng, 1996); δ ∈ (0,1] controls the decay

rate of the adaption. Basically, the precision will increase if the current acceptance rate is below

the target rate. Note that because the proposal distribution is asymmetric, the acceptance ratio for

θq at each iteration t should be obtained as:

rqt = min

{L(θ(t)q ) f

(θ(t−1)q |σq,tθ

(t)q

)L(θ(t−1)

q ) f(

θ(t)q |σq,tθ

(t−1)q

) ,1}

where, f (x|y) denotes the density of the Dirichlet distribution Dir(y) at x. Based on our empirical

study we find that for a small corpus, even a random walk Metropolis-Hasting (without adaptation)

45

algorithm converges quite well for sampling θq.

Maximizing R from Dirichlet-Multinomial Regression Model. The parameter R controls

the relationship between θq and θp which is captured by a Dirichlet-multinomial regression model

in Equation (3). We estimate R by optimizing the full log-likelihood of the model (Mimno and

McCallum, 2012),

l(R) =P

∑p=1

(logΓ

(∑k

exp(RTk θq(p))

)−∑

k

[logΓ(exp(RT

k θq(p)))−(

exp(RTk θq(p))−1

)log(θpk)

])

The derivative of this log-likelihood with respect to rtk is

∂l(R)∂rtk

=P

∑p=1

θqt(p)exp(RTk θq(p))

(Ψ

(∑

jexp(RT

j θq(p)))

−Ψ

(exp(RT

k θq(p)))+ log(θpk)

)

where, Ψ(·) denotes the digamma function what is defined as the logarithmic derivative of the

gamma function, Ψ(x) = ddx logΓ(x). The optimization problem could be difficult if K is large. Our

implementation is mainly based on the standard BFGS optimization, because this method has been

shown to be fast, robust, and reliable in practice.

46

Appendix B: Simulation Study

This appendix presents a synthetic data analysis of HDLDA. We first describe the data gener-

ation process of the model, and the parameterization of our simulation study. Then we summarize

the estimation results based on the inference algorithm presented in Appendix A.

B1. Data Generation

For a given set of parameters {K,Q,P,J,{Jq}q,{Jp}p,{lpq}p,q,α,η,R}, the following proce-

dure describes the data generative process for HDLDA:

1. For each topic k = 1,2, ...,K, draw a distribution over words: ϕk|η∼ DirichletJ(η)

2. For each query q = 1,2, ...,Q

(a) Draw topic distribution θq|α∼ DirichletK(α)

(b) For j = 1,2, ...,Jq

i. Draw topic assignment zq j|θq ∼Category(θq)

ii. Draw word wq j|(zq j,{ϕk})∼Category(ϕzq j)

3. For each webpage p = 1,2, ...,P

(a) Calculate the average topic distribution among its labeling queries: θq(p) = ∑q θqlpq

∑q lpq

(b) Draw topic distribution θp|(R,θq(p))∼ DirichletK(exp(RT1 θq(p)), ...,exp(RT

Kθq(p)))

(c) For i = 1,2, ...,Jp

i. Draw topic assignment zpi|θp ∼Category(θp)

ii. Draw word wpi|(zpi,{ϕk})∼Category(ϕzpi)

We simulate a dataset with a structure similar to real search data that we collected from the

experimental study described in Section 4. Specifically, we set K = 3, Q = 800, P = 4000, and

J = 2000. For q = 1,2, ...,Q, we draw the integer Jq randomly (uniformly) from [2,20]; for p =

47

1,2, ...,P, we draw the integer Jp randomly (uniformly) from [300,600]. We draw {lpq} so that

each query can retrieve ten pages, and each page can be retrieved by at least one query (the mean

is 2.00 with a standard deviation 1.02). For the semantic mapping matrix R, we set all the diagonal

elements to be 0.8, and set all the off-diagonal elements to be 0.4. We set α = 1, and η = 0.01. All

other parameters are generated according to the process described above.

We calibrate HDLDA on this simulated dataset using the inference algorithm described in

Appendix A. The model parameters include about 1.79 million latent word assignments z, 6,000

parameters in {ϕk}, 12,000 parameters in {θp}, and 2,400 parameters in{θq}, and 9 parameters in

R. We run 10,000 MCMC iterations and use the first 5,000 as burn-in.

B2. Simulation Results

Before presenting the estimation results, we provide some background on the identification of

topic models (especially LDA) in general. This is important in forming reasonable expectations

on what and how much can be recovered in HDLDA. Though topic modeling is an approach that

has proved successful in automatic comprehension and classification of data, only recent work has

attempted to give provable guarantees for the problem of learning the model parameters (Arora

et al., 2012; Anandkumar et al., 2012). The problem of recovering nonnegative matrices ϕ (topics)

and θ (topic distributions) with small inner-dimension K (number of topics), is NP-hard (Arora

et al., 2012). As a solution, recent work has relied on very strong assumptions about the corpus,

e.g., restricting one topic per document, or assuming each topic has words that appear only in

that topic. At best, ϕ could only be recovered up to permutations (Anandkumar et al., 2012). In

addition, according to Arora et al. (2012), it is impossible to learn the topic distribution matrix

θ to within arbitrary accuracy, and this is theoretically impossible even if we knew ϕ and the

distribution from which θ is generated.

Given this background, empirically one should not expect all parameters in topic models to

be recovered. As an example, we test the well-known Gibbs sampler algorithm on a basic LDA,

using multiple simulated corpora which have similar size as the one described above. We find that

48

about 90% of the topics ϕ are covered by the 95% credible interval (CI), and about 80% of the

topic proportions θ are covered by the 95% CI. Given the complexity of the inference algorithm

for HDLDA, one should not expect its recovery to be better than the Gibbs sampler for a basic

LDA.

As measures of recovery performance, we report the proportion of the parameters that are

recovered by the posterior 95% CI and the mean square error (MSE) between the true and the

estimated parameters. The details are given in Table A for {ϕk}, {θp}, and {θq} respectively. One

can see that the recovery for both {ϕk} and {θp} is pretty good, and it is decent for {θq}. Finally,

we report the posterior estimates and the 95% CI for all the parameters in R, which are given in

Table B. Note that because the MLE for the Dirichlet-Multinomial regression has significant bias

if the sample size is not large enough,1 we would not expect the estimators of R from the stochastic

EM algorithm to do better than the MLE of R using the true data (Nielsen, 2000). That is, the best

that the inference algorithm can achieve in estimating R is recovering its MLE, denoted as RMLE ,

that is estimated using the true {θp} and {θq} and also reported in Table B. We can see for the true

R, 6 out of 9 parameters are covered by the 95% CI. The number is increased to 8 in recovering

RMLE .

Table A: Simulation Results for{ϕk,θp,θq}

Parameters Coverage MSE

{ϕk} 90.12% 1.06e-09

{θp} 89.77% 0.0006

{θq} 74.67% 0.0563

1For example, when we increase the number of webpages from 4,000 to 40,000, the MLE of R for the Dirichlet-Multinomial regression model using the true {θp} and {θq} has much smaller bias. Specifically, the MSE is decreasedfrom 0.0111 to 0.0006.

49

Table B: Simulation Results for R

Parameters R RMLE Posterior Estimate 95% CI

r11 0.8 0.841 0.622 [0.547, 0.723]

r21 0.4 0.337 0.410 [0.307, 0.487]

r31 0.4 0.369 0.442 [0.349, 0.537]

r12 0.4 0.439 0.403 [0.339, 0.463]

r22 0.8 0.810 0.780 [0.685, 0.839]

r32 0.4 0.338 0.351 [0.293, 0.426]

r13 0.4 0.413 0.365 [0.262, 0.389]

r23 0.4 0.429 0.583 [0.510, 0.672]

r33 0.8 0.722 0.698 [0.584, 0.798]

50

Documents

A Semantic Approach for Estimating Consumer Content Preferences