Working Paper | Search Personalization using Machine Learning

Search Personalization using Machine Learning

Hema Yoganarasimhan ∗

University of Washington

Abstract

Query-based search is commonly used by many businesses to help consumers findinformation/products on their websites. Examples include search engines (Google,Bing), online retailers (Amazon, Macy’s), and entertainment sites (Hulu, YouTube).Nevertheless, a significant portion of search sessions are unsuccessful, i.e., do not provideinformation that the user was looking for. We present a machine learning framework thatimproves the quality of search results through automated personalization based on a user’ssearch history. Our framework consists of three modules – (a) Feature generation, (b)NDCG-based LambdaMART algorithm, and (c) Feature selection wrapper. We estimateour framework on large-scale data from a leading search engine using Amazon EC2servers. We show that our framework offers a significant improvement in search qualitycompared to non-personalized results. We also show that the returns to personalizationare monotonically, but concavely increasing with the length of user history. Next, wefind that personalization based on short-term history or “within-session” behavior is lessvaluable than long-term or “across-session” personalization. We also derive the value ofdifferent feature sets – user-specific features contribute over 50% of the improvementand click-specific over 28%. Finally, we demonstrate scalability to big data and derivethe set of optimal features that maximize accuracy while minimizing computing speed.

Keywords: marketing, online search, personalization, big data, machine learning

∗The author is grateful to Yandex for providing the data used in this paper. She thanks Brett Gordon for detailedcomments on the paper. Thanks are also due to the participants of the 2014 UT Dallas FORMS, Marketing Science, andBig Data and Marketing Analytics conferences for feedback. Please address all correspondence to: [email protected].

1 Introduction

1.1 Online Search and Personalization

As the Internet has matured, the amount of information and products available in individual websiteshas grown exponentially. Today, Google indexes over 30 trillion webpages (Google, 2014), Amazonsells 232 million products (Grey, 2013), Netflix streams more than 10,000 titles (InstantWatcher,2014), and over 100 hours of video are uploaded to YouTube every minute (YouTube, 2014). Whileunprecedented growth in product assortments gives an abundance of choice to consumers, it nev-ertheless makes it hard for them to locate the exact product or information they are looking for. Toaddress this issue, most businesses today use a query-based search model to help consumers locatethe product/information that best fits their needs. Consumers enter queries (or keywords) in a searchbox, and the firm presents a set of products/information that is deemed most relevant to the query.

We focus on the most common class of search – Internet search engines, which play an importantrole in today’s digital economy. In the U.S. alone, over 18 billion search queries are made every month,and search advertising generates over $4 billion US dollars in quarterly revenues (Silverman, 2013;comScore, 2014). Consumers use search engines for a variety of reasons; e.g., to gather informationon a topic from authoritative sources, to remain up-to-date with breaking news, and to aid purchasedecisions. Search results and their relative ranking can therefore have significant implications forconsumers, businesses, governments, and for the search engines themselves.

However, a significant portion of searches are unsuccessful, i.e., they do not provide the infor-mation that the user was looking for. According to some experts, this is due to the fact that evenwhen two users search using the same keywords, they are often looking for different information.Thus, a standardized ranking of results across users may not achieve the best results. To addressthis concern, Google launched personalized search a few years back, customizing results based on auser’s IP address and cookie (individual-level search and click data), using up to 180 days of history(Google Official Blog, 2009). Since then, other search engines such as Bing and Yandex have alsoexperimented with personalization (Sullivan, 2011; Yandex, 2013). See Figure 1 for an example ofhow personalization can influence search results.

However, the extent and effectiveness of search personalization has been a source of intense debate,partly fueled by the secrecy of search engines on how they implement personalization (Schwartz,2012; Hannak et al., 2013).1 Indeed, some researchers have even questioned the value of personalizingsearch results, not only in the context of search engines, but also in other settings (Feuz et al., 2011;

1In Google’s case, the only source of information on its personalization strategy is a post in its official blog, which suggeststhat whether a user previously clicked a page and the temporal order of a user’s queries influence her personalizedrankings (Singhal, 2011). In Yandex’s case, all that we know comes from a quote by one of its executives, Mr. Bakunov –“Our testing and research shows that our users appreciate an appropriate level of personalization and increase their use ofYandex as a result, so it would be fair to say it can only have a positive impact on our users.” (Atkins-Kruger, 2012).

1

Figure 1: An example of personalization. The right panel shows results with personalization turned off byusing the Chrome browser in a fresh incognito session. The left panel shows personalized results. The fifthresult in the left panel from the domain babycenter.com does not appear in the right panel. This is because theuser in the left panel had previously visited this URL and domain multiple times in the past few days.

Ghose et al., 2014). In fact, this debate on the returns to search personalization is part of a largerone on the returns to personalization of any marketing mix variable (see §2 for details). Given theprivacy implications of storing long user histories, both consumers and managers need to understandthe value of personalization.

1.2 Research Agenda

In this paper, we examine the returns to personalization in the context of Internet search. We addressthree managerially relevant research questions:

1. Can we provide a general framework to implement personalization in the field and evaluatethe gains from it? In principle, if consumers’ tastes and behavior show persistence over timein a given setting, personalization using data on their past behavior should improve outcomes.However, we do not know the degree of persistence consumers’ tastes/behavior and the extentto which personalization can improve search experience is an empirical question.

2. Gains from personalization depend on a large number of factors that are under the manager’s

2

control. These include the length of user history, the different attributes of data used in theempirical framework, and whether data across multiple sessions is used or not. We examine theimpact of these factors on the effectiveness of personalization. Specifically:

(a) How does the improvement in personalization vary with the length of user history? Forexample, how much better can we do if we have data on the past 100 queries of a usercompared to only her past 50 queries?

(b) Which attributes of the data or features, lead to better results from personalization? Forexample, which of these metrics is more effective – a user’s propensity to click on thefirst result or her propensity to click on previously clicked results? Feature design is animportant aspect of any ranking problem and is therefore treated as a source of competitiveadvantage and kept under wraps by most search engines (Liu, 2011). However, all firmsthat use search would benefit from understanding the relative usefulness of different typesof features.

(c) What is the relative value of personalizing based on short-run data (a single session)vs. long-run data (multiple sessions)? On the one hand, within session personalizationmight be more informative of consumer’s current quest. On the other hand, acrosssession data can help infer persistence in consumers’ behavior. Moreover, within sessionpersonalization has to happen in real time and is computationally intensive. So knowingthe returns to it can help managers decide whether it is worth the computational cost.

3. What is the scalability of personalized search to big data? Search datasets tend to be extremelylarge and personalizing the results for each consumer is computationally intensive. For aproposed framework to be implementable in the field, the framework has to be scalable andefficient. Can we specify sparser models that would significantly reduce the computationalcosts while minimizing the loss in accuracy?

1.3 Challenges and Research Framework

We now discuss the three key challenges involved in answering these questions and present anoverview of our research framework.

• Out-of-sample predictive accuracy: First, we need a framework with extremely high out-of-sample predictive accuracy. Note that the search engine’s problem is one of prediction – whichresult or document will a given user find most relevant? More formally, what is the user-specificrank-ordering of the documents in decreasing order of relevance, so the most relevant documentsare at the top for each user?2

2Search engines don’t aim to provide the most objectively correct or useful information. Relevance is data-driven andsearch engines aim to maximize the user’s probability of clicking on and dwelling on the results shown. Hence, it is

3

Unfortunately, the traditional econometric tools used in marketing literature have been designedfor causal inference (to derive the best unbiased estimates) and not prediction. So they performpoorly in out-of-sample predictions. Indeed, the well-known “bias-variance trade-off” suggeststhat unbiased estimators are not always the best predictors because they tend to have high varianceand vice-versa. See Hastie et al. (2009) for an excellent discussion of this concept. The problemsof prediction and causal inference are thus fundamentally different and often at odds with eachother. Hence, we need a methodological framework that is optimized for predictive accuracy.

• Large number of attributes with complex interactions: The standard approach to modeling is –assume a fixed functional form to model an outcome variable, include a small set of explanatoryvariables, allow for a few interactions between them, and then infer the parameters associatedwith the pre-specified functional form. However, this approach has poor predictive ability (aswe later show). In our setting, we have a large number of features (close to 300), and we need toallow for non-linear interactions between them. This problem is exacerbated by the fact that wedon’t know which of these variables and which interaction effects matter, a priori. Indeed, it isnot possible for the researcher to come up with the best non-linear specification of the functionalform either by manually eyeballing the data or by including all possible interactions. The latterwould lead to over 45,000 variables, even if we restrict ourselves to simple pair-wise interactions.

Thus, we are faced with the problem of not only inferring the parameters of the model, but alsoinferring the optimal functional form of the model itself. So we need a framework that can searchand identify the relevant set of features and all possible non-linear interactions between them.

• Scalability and efficiency to big data settings: Finally, as discussed earlier, our approach needs tobe both efficient and scalable. Efficiency refers to the idea that the run-time and deployment of themodel should be as small as possible without minimal loss in accuracy. Thus, an efficient modelmay lose a very small amount of accuracy, but make huge gains on run-time. Scalability refersto the idea thatX times increase in the data leads to no more than 2X (or linear) increase in thedeployment time. Both scalability and efficiency are necessary to run the model in real settings.

To address these challenges, we turn to the Machine Learning methods, pioneered by computerscientists. These are specifically designed to provide extremely high out-of-sample accuracy, workwith a large number of data attributes, and perform efficiently at scale.

Our framework employs a three-pronged approach – (a) Feature generation, (b) NDCG-basedLambdaMART algorithm, and (c) Feature selection using the Wrapper method. The first moduleconsists of a program that uses a series of functions to generate a large set of features (or attributes)that function as inputs to the subsequent modules. These features consist of both global as well aspersonalized metrics that characterize the aggregate as well as user-level data regarding relevance of

possible that non-personalized results may be perceived as more informative and the personalized ones more narrow, byan objective observer. However, this is not an issue when maximizing the probability of click for a specific user.

4

search results. The second module consists of the LambdaMART ranking algorithm that maximizessearch quality, as measured by Normalized Discounted Cumulative Gain (NDCG). This moduleemploys the Lambda ranking algorithm in conjunction with the machine learning algorithm calledMultiple Additive Regression Tree (MART) to maximize the predictive ability of the framework.MART performs automatic variable selection and infers both the optimal functional form (allowingfor all types of interaction effects and non-linearities) as well as the parameters of the model. It is nothampered by the large number of parameters and variables since it performs optimization efficientlyin gradient space using a greedy algorithm. The third module focuses on scalability and efficiency – itconsists of a Wrapper algorithm that wraps around the LambdaMART step to derive the set of optimalfeatures that maximize search quality with minimal loss in accuracy. The Wrapper thus identifies themost efficient model that scales well.

Apart from the benefits discussed earlier, an additional advantage of our approach is its applicabil-ity to datasets that lack information on context, text, and links, factors that have been shown to playa big role in improving search quality (Qin et al., 2010). Our framework only requires an existingnon-personalized ranking. As concerns about user privacy increase, this is valuable.

1.4 Findings and Contribution

We apply our framework to a large-scale dataset from Yandex. It is the fourth largest search enginein the world, with over 60% market share in Russia and Eastern European (Clayton, 2013; Pavliva,2013) and over $320 billion US dollars in quarterly revenues (Yandex, 2013). In October 2013,Yandex released a large-scale anonymized dataset for a ‘Yandex Personalized Web Search Challenge’,which was hosted by Kaggle (Kaggle, 2013). The data consists of information on anonymized useridentifiers, queries, query terms, URLs, URL domains and clicks. The data spans one month, consistsof 5.7 million users, 21 million unique queries, 34 million search sessions, and over 64 million clicks.

Because all three modules in our framework are memory and compute intensive, we estimate themodel on the fastest servers available on Amazon EC2, with 32 cores and 256 GB of memory.

We present five key results. First, we show that our personalized machine learning frameworkleads to significant improvements in search quality as measured by NDCG. In situations where thetopmost link is not the most relevant (based on the current ranking), personalization improves searchquality by approximately 15%. From a consumer’s perspective, this leads to big changes in theordering of results they see – over 55% of the documents are re-ordered, of which 30% are re-orderedby two or more positions. From the search engine’s perspective, personalization leads to a substantialincrease in click-through-rates at the top position, which is a leading measure of user satisfaction.

We also show that standard logit-style models used in the marketing literature lead to no improve-ment in search quality, even if we use the same set of personalized features generated by the machinelearning framework. This re-affirms the value of using a large-scale machine learning framework that

5

can provide high predictive accuracy.Second, the quality of personalized results increases with the length of a user’s search history in a

monotonic but concave fashion (between 3%− 17% for the median user who has made about 50 pastqueries). Since data storage costs are linear, firms can use our estimates to derive the optimal lengthof history that maximizes their profits.

Third, we examine the value of different types of features. User-specific features generate over50% of the improvement from personalization. Click-related features (that are based on users’clicks and dwell-times) are also valuable, contributing over 28% of the improvement in NDCG.Both domain- and URL-specific features are also important, contributing over 10% of the gain inNDCG. Interestingly, session-specific features are not very useful – they contribute only 4% ofthe gain in NDCG. Thus, we find that “across-session” personalization based on user-level data ismore valuable than “within-session” personalization based on session-specific data. Firms oftenstruggle with implementing within-session personalization since it has to happen in real-time and iscomputationally expensive. Our findings can help firms decide whether the relatively low value fromwithin-session personalization is sufficient to warrant the computing costs of real-time responses.

Fourth, we show that our method is scalable to big data. Going from 20% of the data to the fulldataset leads to a 2.38 times increase in computing time for feature-generation, 20.8 times increase fortraining, and 11.6 times increase for testing/deployment of the final model. Thus, training and testingcosts increase super-linearly, whereas feature-generation costs increase sub-linearly. Crucially, noneof the speed increases are exponential, and all of them provide substantial improvements in NDCG.

Finally, to increase our framework’s efficiency, we provide a forward-selection wrapper, whichderives the optimal set of features that maximize search quality with minimal loss in accuracy. Of the293 features provided by the feature-generation module, we optimally select 47 that provide 95% ofthe improvement in NDCG with significantly lower computing costs and speed. We find that trainingthe pruned model takes only 15% of the time taken to train the unpruned model. Similarly, we findthat deploying the pruned model is only 50% as expensive as deploying the full model.

In sum, our paper makes three main contributions to the marketing literature on personalization.First, it presents a comprehensive machine learning framework to help researchers and managersimprove personalized marketing decisions using data on consumers’ past behavior. Second, it presentsempirical evidence in support of the returns to personalization in the online search context. Third, itdemonstrates how large datasets or big data can be leveraged to improve marketing outcomes withoutcompromising the speed or real-time performance of digital applications.

2 Related LiteratureOur paper relates to many streams of literature in marketing and computer science.

First, it relates to the empirical literature on personalization in marketing. In an influential paper,

6

Rossi et al. (1996) quantify the benefits of one-to-one pricing using purchase history data and findthat personalization improves profits by 7.6%. Similarly, Ansari and Mela (2003) find that content-targeted emails can increase the click-throughs up to 62%, and Arora and Henderson (2007) findthat customized promotions in the form of ‘embedded premiums’ or social causes associated withthe product can improve profits. On the other hand, a series of recent papers question the returns topersonalization in a variety of contexts ranging from advertising to ranking of hotels in travel sites(Zhang and Wedel, 2009; Lambrecht and Tucker, 2013; Ghose et al., 2014).3

Our paper differs from this literature on two important dimensions. Substantively, we focus onproduct personalization (search results) unlike the earlier papers that focus on the personalization ofmarketing mix variables such as price and advertising. Even papers that consider ranking outcomes,e.g., Ghose et al. (2014), use marketing mix variables such as advertising and pricing to make rankingrecommendations. However, in the information retrieval context, there are no such marketing mixvariables. Moreover, with our anonymized data, we don’t even have access to the topic/context ofsearch. Second, from a methodological perspective, previous papers use computationally intensiveBayesian methods that employ Markov Chain Monte Carlo (MCMC) procedures, or randomizedexperiments, or discrete choice models – all of which are not only difficult to scale, but also lackpredictive power (Rossi and Allenby, 2003; Friedman et al., 2000). In contrast, we use machinelearning methods that can handle big data and offer high predictive performance.

In a similar vein, our paper also relates to the computer science literature on personalizationof search engine rankings. Qiu and Cho (2006) infer user interests from browsing history usingexperimental data on ten subjects. Teevan et al. (2008) and Eickhoff et al. (2013) show that there couldbe returns to personalization for some specific types of searches involving ambiguous queries andatypical search sessions. Bennett et al. (2012) find that long-term history is useful early in the sessionwhereas short-term history later in the session. There are two main differences between these papersand ours. First, in all these papers, the researchers had access to the data with context – informationon the URLs and domains of the webpages, the hyperlink structure between them, and the queriesused in the search process. In contrast, we work with anonymized data – we don’t have data on theURL identifiers, texts, or links to/from webpages; nor do we know the queries. This makes our tasksignificantly more challenging. We show that personalization is not only feasible, but also beneficialin such cases. Second, the previous papers work with much smaller datasets, which helps them avoidthe scalability issues that we face.

More broadly, our paper is relevant to the growing literature on machine learning methodologiesfor Learning-to-Rank (LETOR) methods. Many information retrieval problems in business settingsoften boil down to ranking. Hence, ranking methods have been studied in many contexts such as

3A separate stream of literature examines how individual-level personalized measures can be used to improve multi-touchattribution in online channels (Li and Kannan, 2014).

7

collaborative filtering (Harrington, 2003), question answering (Surdeanu et al., 2008; Banerjee et al.,2009), text mining (Metzler and Kanungo, 2008), and digital advertising (Ciaramita et al., 2008). Werefer interested readers to Liu (2011) for details.

Our paper also contributes to the literature on machine learning methods in marketing. Most ofthese applications have been in the context of online conjoint analysis and use statistical machinelearning techniques. These methods are employed to alleviate issues such as noisy data and complexitycontrol (similar to feature selection in our paper); they also help in model selection and improverobustness to response error (in participants). Some prominent works in this area include Toubia et al.(2003, 2004); Evgeniou et al. (2005, 2007). Please see Toubia et al. (2007) for a detailed discussion ofthis literature, and Dzyabura and Hauser (2011); Tirunillai and Tellis (2014); Netzer et al. (2012);Huang and Luo (2015) for other applications of machine learning in marketing.

3 Data

3.1 Data Description

We use the publicly available anonymized dataset released by Yandex for its personalization challenge(Kaggle, 2013). The dataset is from 2011 (exact dates are not disclosed) and represents 27 days ofsearch activity. The queries and users are sampled from a large city in eastern Europe. The data isprocessed such that two types of sessions are removed – those containing queries with commercialintent (as detected by Yandex’s proprietary classifier) and those containing one or more of the top-Kmost popular queries (where K is not disclosed). This minimizes risks of reverse engineering ofuser identities and business interests. The current rankings shown in the data are based on Yandex’sproprietary non-personalized algorithm.

Table 1 shows an example of a session from the data. As such, the data contains information onthe following anonymized variables:

• User: A user, indexed by i, is an agent who uses the search engine for information retrieval. Usersare tracked over time through a combination of cookies, IP addresses, and logged in accounts.There are 5,659,229 unique users in the data.

• Query: A query is composed of one or more words, in response to which the search engine returnsa page of results. Queries are indexed by q. There are a total of 65,172,853 queries in the data,of which 20,588,928 are unique.4 Figure 2 shows the distribution of queries in the data, whichfollows a long-tail pattern – 1% of queries account for 47% of the searches, and 5% for 60%.

• Term: Each query consists of one or more terms, and terms are indexed by l. For example, thequery ‘pictures of richard armitage’ has three terms – ‘pictures’, ‘richard’, and ‘armitage’. Thereare 4,701,603 unique terms in the data, and we present the distribution of the number of terms in

4Queries are not the same as keywords, which is another term used frequently in connection with search engines. Pleasesee Gabbert (2011) for a simple explanation of the difference between the two phrases.

8

20 M 6 5

20 0 Q 0 6716819 2635270,4603267

70100440,5165392 17705992,1877254 41393646,368557946243617,3959876 26213624,2597528 29601078,286723253702486,4362178 11708750,1163891 53485090,43578244643491,616472

20 13 C 0 7010044020 495 C 0 29601078

Table 1: Example of a session. The data is in a log format describing a stream of user actions, with each linerepresenting a session, query, or a click. In all lines, the first column represents the session (SessionID = 20 inthis example). The first row from a session contains the session-level metadata and is indicated by the letter M.It has information on the day the session happened (6) and the UserID (5). Rows with the letter Q denote aQuery and those with C denote a Click. The user submitted the query with QueryID 6716819, which containsterms with TermIDs 2635270, 4603267. The URL with URLID 70100440 and DomainID 5165392 is the topresult on SERP. The rest of the nine results follow similarly. 13 units of time into the session the user clicked onthe result with URLID 70100440 (ranked first in the list) and 495 units of time into the session s/he clicked onthe result with URLID 29601078.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Frac

tion

of s

earc

hes

Fraction of unique queries

Figure 2: Cumulative distribution function of the number of unique queries – the fraction of overall searchescorresponding to unique queries, sorted by popularity.

all the queries seen in the data in Table 2.5

• Session: A search session, indexed by j, is a continuous block of time during which a user isrepeatedly issuing queries, browsing and clicking on results to obtain information on a topic.Search engines use proprietary algorithms to detect session boundaries (Goker and He, 2000);and the exact method used by Yandex is not publicly revealed. According to Yandex, there are34,573,630 sessions in the data.

• URL: After a user issues a query, the search engine returns a set of ten results in the first page.These results are essentially URLs of webpages relevant to the query. There are 69,134,718unique URLs in the data, and URLs are indexed by u. Yandex only provides data for the firstSearch Engine Results Page (SERP) for a given query because a majority of users never go beyond

5Search engines typically ignore common keywords and prepositions used in searches such as ‘of’, ‘a’, ‘in’, because theytend to be uninformative. Such words are called ‘Stop Words’.

9

Number of Terms Percentage ofPer Query Queries

1 22.322 24.853 20.634 14.125 8.186 4.677 2.388 1.279 0.64

10 or more 1.27

Table 2: Distribution of number of terms in a query.(Min, Max) = (1, 82).

Position Probability of click1 0.44512 0.16593 0.10674 0.07485 0.05566 0.04197 0.03208 0.02589 0.0221

10 0.0206

Table 3: Click probability based on position.

the first page, and click data for subsequent pages is very sparse. So from an implementationperspective, it is difficult to personalize later SERPs; and from a managerial perspective, thereturns to personalizing them is low since most users never see them.

• URL domain: Is the domain to which a given URL belongs. For example, www.imdb.com is adomain, whilehttp://www.imdb.com/name/nm0035514/ andhttp://www.imdb.com/name/nm0185819/ are URLs or webpages hosted at this domain. There are 5,199,927unique domains in the data.

• Click: After issuing a query, users can click on one or all the results on the SERP. Clicking aURL is an indication of user interest. We observe 64,693,054 clicks in the data. So, on average, auser clicks on 3.257 times following a query (including repeat clicks to the same result). Table 3shows the probability of click by position or rank of the result. Note the steep drop off in clickprobability with position – the first document is clicked 44.51% of times, whereas the fifth one isclicked a mere 5.56% of times.

• Time: Yandex gives us the timeline of each session. However, to preserve anonymity, it does notdisclose how many milliseconds are in one unit of time. Each action in a session comes with atime stamp that tells us how long into the session the action took place. Using these time stamps,Yandex calculates ‘dwell time’ for each click, which is the time between a click and the next clickor the next query. Dwell time is informative of how much time a user spent with a document.

• Relevance: A result is relevant to a user if she finds it useful and irrelevant otherwise. Searchengines attribute relevance scores to search results based on user behavior – clicks and dwelltime. It is well known that dwell time is correlated with the probability that the user satisfiedher information needs with the clicked document. The labeling is done automatically, and hencedifferent for each query issued by the user. Yandex uses three grades of relevance:

• Irrelevant, r = 0.

10

www.imdb.com

http://www.imdb.com/name/nm0035514/



0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Frac

tion

of s

essi

ons

Number of queries

Figure 3: Cumulative distribution function of thenumber of queries in a session. Min, Max = 1, 280.

0

0.2

0.4

0.6

0.8

1

1 4 16 64 256

Frac

tion

of u

sers

Number of queries

Figure 4: Cumulative distribution function of num-ber of queries per user (observed in days 26, 27. Min,Max = 1, 1260.

A result gets a irrelevant grade if it doesn’t get a click or if it receives a click with dwell timestrictly less than 50 units of time.

• Relevant, r = 1.Results with dwell times between 50 and 399 time units receive a relevant grade.

• Highly relevant, r = 2.Is given to two types of documents – a) those that received clicks with a dwell time of 400

units or more, and b) if the document was clicked, and the click was the last action of thesession (irrespective of dwell time). The last click of a session is considered to be highlyrelevant because a user is unlikely to have terminated her search if she is not satisfied.

In cases where a document is clicked multiple times after a query, the maximum dwell time isused to calculate the document’s relevance.

In sum, we have a total of 167,413,039 lines of data.

3.2 Model-Free Analysis

For personalization to be effective, we need to observe three types of patterns in the data. First, weneed sufficient data on user history to provide personalized results. Second, there has to be significantheterogeneity in users’ tastes. Third, users’ tastes have to exhibit persistence over time. Below, weexamine the extent to which search and click patterns satisfy these criteria.

First, consider user history metrics. If we examine the number of queries per session (see Figure 3),we find that 60% of the sessions have only one query. This implies that within session personalizationis not possible for such sessions. However, 20% of sessions have two or more queries, makingthem amenable to within session personalization. Next, we examine the feasibility of across-sessionpersonalization for users observed in days 26 and 27 by considering the number of queries they have

11

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5

Frac

tion

of u

sers

Average clicks/query for user

Figure 5: Cumulative distribution function of av-erage number of clicks per query for a given user.

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5

Frac

tion

of u

sers

Average clicks/query for user

Figure 6: Cumulative distribution function of aver-age clicks per query for users who have issued 10 ormore queries.

issued in the past.6 We find that the median user has issued 16 queries and over 25% of the users haveissued at least 32 queries (see Figure 4). Together, these patterns suggest that a reasonable degree ofboth short and long-term personalization is feasible in this context.

Figure 5 presents the distribution of the average number of clicks per query for the users in thedata. As we can see, there is considerable heterogeneity in users’ propensity to click. For example,more than 15% of users don’t click at all, whereas 20% of them click on 1.5 results or more for eachquery they issue. To ensure that these patterns are not driven by one-time or infrequent users, wepresent the same analysis for users who have issued at least 10 queries in Figure 6, which showssimilar click patterns. Together, these two figures establish the presence of significant user-specificheterogeneity in users’ clicking behavior.

Next, we examine whether there is variation or entropy in the URLs that are clicked for a givenquery. Figure 7 presents the cumulative distribution of the number of unique URLs clicked for agiven query through the time period of the data. 37% of queries receive no clicks at all, and close toanother 37% receive a click to only one unique URL. These are likely to be navigational queries; e.g.,almost everyone who searches for ‘cnn’ will click on www.cnn.com. However, close to 30% ofqueries receive clicks on two or more URLs. This suggests that users are clicking on multiple results.This could be either due to heterogeneity in users’ preferences for different results or because theresults themselves change a lot for the query during the observation period (high churn in the resultsis likely for news related queries, such as ‘earthquake’). To examine if the last explanation drives allthe variation in Figure 7, we plot the number of unique URLs and domains we see for a given querythrough the course of the data in Figure 8. We find that more than 90% of queries have no churn in the

6As explained in §6.1, users observed early in the data have no/very little past history because we have a data snapshot forfixed time period (27 days), and not a fixed history for each user. Hence, following Yandex’s suggestion, we run thepersonalization algorithm for users observed in the last two of days. In practice, of course, firms don’t have to work withsnapshots of data; rather they are likely to have longer histories stored at the user level.

12

www.cnn.com

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Frac

tion

of u

niqu

e qu

erie

s

Total URLs clicked

Figure 7: Cumulative distribution function of num-ber of unique URLs clicked for a given query in the27 day period.

0

0.2

0.4

0.6

0.8

1

8 10 16 32

Frac

tion

of u

niqu

e qu

erie

s

Total URL/Domain results

URL ResultsDomain Results

Figure 8: Cumulative distribution function of num-ber of unique URLs and domains shown for a givenquery in the 27 day period.

URLs shown in results page. Hence, it is unlikely that the entropy in clicked URLs is driven by churnin search engine results. Rather, it is more likely to be driven in heterogeneity in users’ tastes.

In sum, the basic patterns in the data point to significant heterogeneity in users’ preferences andpersistence of those preferences, which suggest the existence of positive returns to personalization.

4 Feature GenerationWe now present the first module of our three-pronged empirical strategy – feature generation. Inthis module, we conceptualize and define an exhaustive feature set that captures both global andindividual-level attributes that can help predict the most relevant documents for a given search by auser i during a session j. Feature engineering is an important aspect of all machine learning problemsand is of both technical and substantive interest.

Before proceeding, we make note of two points. First, the term ‘features’ as used in the machinelearning literature may be interpreted as ‘attributes’ or ‘explanatory variables’ in the marketingcontext. Second, features based on the hyper-link structure of the web and the text/context of theURL, domain, and query are not visible to us since we work with anonymized data. Such featuresplay an important role in ranking algorithms and not having information on them is a significantdrawback (Qin et al., 2010). Nevertheless, we have one metric that can be interpreted as a discretizedcombination of such features – the current rank based on Yandex’s algorithm (which does not takepersonalized search history into account). While this is not perfect (since there is information-lossdue to discretization and because we cannot interact the underlying features with the user-specificones generated here), it is the best data we have, and we use it as a measure of features invisible to us.

Our framework consists of a series of functions that can be used to generate a large and variedset of features. There are two main advantages to using such a framework instead of deriving anddefining each feature individually. First, it allows us to succinctly define and obtain a large number

13

of features. Second, the general framework ensures that the methods presented in this paper can beextrapolated to other personalized ranking contexts, e.g., online retailers or entertainment sites.

4.1 Functions for Feature Generation

We now present a series of functions that are used to compute sets of features that capture attributesassociated with the past history of both the search result and the user.

1. Listings(query , url , usr , session, time): This function calculates the number of times a urlappeared in a SERP for searches issued by a given usr for a given query in a specific sessionbefore a given instance of time.

This function can also be invoked with certain input values left unspecified, in which case itis computed over all possible values of the unspecified inputs. For example, if the user andsession are not specified, then it is computed over all users and across all sessions. That is,Listings(query , url ,−,−, time) computes the number of occurrences of a url in any SERP fora specific query issued before a given instance of time. Another example of a generalizationof this function involves not providing an input value for query , usr , and session (as inListings(−, url ,−,−, time)), in which case the count is performed over all SERPs associatedwith all queries, users, and sessions; i.e., we count the number of times a given url appears inall SERPs seen in the data before a given point in time irrespective of the query.

We also define the following two variants of this function.

• If we supply a search term (term) as an input instead of a query, then the invocationListings(term, url , usr , session, time) takes into account all queries that include term asone of the search terms.

• If we supply a domain as an input instead of a URL, then the invocationListings(query , domain, usr , session, time) will count the number of times that a resultfrom domain appeared in the SERP in the searches associated with a given query .

2. AvgPosn(query , url , usr , session, time): This function calculates the average position/rank ofa url within SERPs that list it in response to searches for a given query within a given session

and before an instance of time for a specific usr . It provides information over and above theListings function as it characterizes the average ranking of the result as opposed to simplynoting its presence in the SERP.

3. Clicks(query , url , usr , session, time): This function calculates the number of times that aspecific usr clicked on a given url after seeing it on a SERP provided in response to searchesfor a given query within a specific session before a given instance of time. Clicks(·) can then

14

be used to compute the click-through rates (CTRs) as follows:

CTR(query , url , usr , session, time) =Clicks(query , url , usr , session, time)

Listings(query , url , usr , session, time)(1)

4. HDClicks(query , url , usr , session, time): This function calculates the number of times thatusr made a high duration click or a click of relevance r = 2 on a given url after seeing it on aSERP provided in response to searches for a given query within a specific session and before agiven instance of time . HDClicks(·) can then be used to compute high-duration click-throughrates (HDCTRs) as follows:

HDCTR(query , url , usr , session, time) =HDClicks(query , url , usr , session, time)


Note that HDClicks and HDCTR have information over and beyond Clicks and CTR since theyconsider clicks of high relevance.

5. ClickOrder(query , url , usr , session, time) is the order of the click (conditional on a click) fora given url by a specific usr in response to a given query within a specific session and before agiven instance of time. Then:

WClicks(query , url , usr , session, time) =

1ClickOrder(query,url ,usr ,session,time)

if url was clicked

0 if url was not clicked(3)

WClicks calculates a weighted click metric that captures the idea that a URL that is clicked firstis more relevant than one that is clicked second, and so on. This is in contrast to Clicks thatassigns the same weight to URLs irrespective of the order in which they are clicked. WClicks isused to compute weighted click-through rates (WCTRs) as follows:

WCTR(query , url , usr , session, time) =WClicks(query , url , usr , session, time)


For instance, a URL that is always clicked first when it is displayed has WCTR = CTR, whereasfor a URL that is always clicked second, WCTR = 1

2CTR, and for one that is always clicked

third, WCTR = 13CTR.

6. PosnClicks(query , url , usr , session, posn, time). The position of a URL in the SERP isdenoted by posn. Then:

PosnClicks(query , url , usr , session, posn, time) =posn

10× Clicks(query , url , usr , session, time) (5)

15

If a URL in a low position gets clicked, it is likely to be more relevant than one that is in ahigh position and gets clicked. This stems from the idea that users usually go through the listof URLs from top to bottom and that they only click on URLs in lower positions if they areespecially relevant. Using this, we can calculate the position weighted CTR as follows:

PosnCTR(query , url , usr , session, posn, time) =PosnClicks(query , url , usr , session, posn, time)

Listings(query , url , usr , session, posn, time)(6)

7. Skips(query , url , usr , session, time): Previous research suggests that users usually go overthe list of results in a SERP sequentially from top to bottom (Craswell et al., 2008). Thus, if theuser skips a top URL and clicks on one in a lower position, it is significant because the decisionto skip the top URL is a conscious decision. Thus, if a user clicks on position 2, after skippingposition 1, the implications for the URLs in positions 1 and 3 are different. The URL in position1 has been examined and deemed irrelevant, whereas the URL in position 3 may not have beenclicked because the user never reached it. Thus, the implications of a lack of clicks are moredamning for the skipped URL in position 1 compared to the one in position 3.

The function Skips thus captures the number of times a specific url has been ‘skipped’ by agiven usr when it was shown in response to a specific query within a given session and beforetime. For example, for a given query assume that the user clicks on the URL in position 5followed by that in position 3. Then URLs in positions 1 and 2 were skipped twice, the URLsin positions 3 and 4 were skipped once, and those in positions 6-10 were skipped zero times.We then compute SkipRate as follows:

SkipRate(query , url , usr , session, time) =Skips(query , url , usr , session, time)


8. Dwells(query , url , usr , session, time): This function calculates the total amount of time thata given usr dwells on a specific url after clicking on it in response to a given query within agiven session and before time. Dwells can be used to compute the average dwell time (ADT),which has additional information on the relevance of a url over and above the click-throughrate.

ADT(query , url , usr , session, time) =Dwells(query , url , usr , session, time)

Clicks(query , url , usr , session, time)(8)

As in the case of Listings, all the above functions can be invoked in their generalized forms withoutspecifying a query, user, or session, Similarly, they can also be invoked for terms and domains, except

16

SkipRate which is not invoked for domains. The features generated from these functions are listedin Table A1 in the Appendix. Note that we use only the first four terms for features computed usingsearch terms (e.g., those features numbered 16 through 19). Given that 80% of the queries have fouror fewer search terms (see Table 2), the information loss from doing so is minimal while keeping thenumber of features reasonable.

In addition to the above, we also use the following functions to generate additional features.

9. QueryTerms(query) is the total number of terms in a given query.

10. Day(usr , session, time) is the day of the week (Monday, Tuesday, etc.) on which the querywas made by a given usr in specific session before a given instance of time.

11. ClickProb(usr , posn, time). This is the probability that a given usr clicks on a search resultthat appears at posn within a SERP before a given instance of time . This function captures theheterogeneity in users’ preferences for clicking on results appearing at different positions.

12. NumSearchResults(query) is the total number of search results appearing on any SERP asso-ciated with a given query over the observation period. It is an entropy metric on the extent towhich search results associated with a query vary over time. This number is likely to be highfor news-related queries.

13. NumClickedURLs(query) is the total number of unique URLs clicked on any SERP associatedwith that query . It is a measure of the heterogeneity in users’ tastes for information/resultswithin a query. For instance, for navigational queries this number is likely to be very low,whereas for other queries like ‘Egypt’, it is likely to be high since some users may be seekinginformation on the recent political conflict in Egypt, others on tourism to Egypt, while yetothers on the history/culture of Egypt.

14. SERPRank(url , query , usr , session). Finally, as mentioned earlier, we include the posi-tion/rank of a result for a given query by a given usr in a specific session. This is informativeof many aspects of the query, page, and context that are not visible to us. By using SERPRank asone of the features, we are able to fold in this information into a personalized ranking algorithm.Note that SERPRank is the same as the position posn of the URL in the results page. However,we define it as a separate feature since the position can vary across observations.

As before, some of these functions can be invoked more generally, and all the features generatedthrough them are listed in Table A1 in the Appendix.

4.2 Categorization of Feature Set

The total number of features that can we can generate from these functions is very large (293 to bespecific). To aid exposition and experiment with the impact of including or excluding certain sets of

17

features, we group the above set of features into the following (partially overlapping) categories. Thisgrouping also provides a general framework that can be used in other contexts.

1. FP includes all user or person-specific features except ones that are at session-level granularity.For example,FP includes Feature 76, which counts the number of times a specific url appearedin a SERP for searches issued by a given usr for a specific query .

2. FS includes all user-specific features at a session-level granularity. For example, FS includesFeature 151, which measures the number of times a given url appeared in a SERP for searchesissued by a given usr for a specific query in a given session.

3. FG includes features that provide global or system-wide statistics regarding searches (i.e.,features which are not user-specific). For instance, FG includes Feature 1, which measuresthe number of times a specific url appeared in a SERP for searches involving a specific queryirrespective of the user issuing the search.

4. FU includes features associated with a given url . For example, FU includes Features 1, 76, 151,but not Feature 9, which is a function of the domain name associated with a search result, butnot URL.

5. FD includes features that are attributes associate with the domain name of the search result.This set includes features such as Feature 9, which measures the number of times a givendomain is the domain name of a result in a SERP provided for a specific query .

6. FQ includes features that are computed based on the query used for the search. For example,FQ includes Feature 1, but does not include Features 16–19 (which are based on the individualterms used in the search) and Feature 226 (which is not query-specific).

7. FT includes features that are computed based on the individual terms comprising a search query.For example, FT includes Features 16− 19.

8. FC includes features that are based on the clicks received by the result. Examples includeFeatures 2, 5, and 6.

9. FN includes features that are based on weighted clicks; e.g., it includes Features 36− 39.

10. FW includes features that are computed based on the dwell times associated with user clicks;e.g., it includes Feature 3, which is the average dwell time for a given url in response to aspecific query .

11. FR includes features that are computed based on the rankings of a search result on differentSERPs. Examples include Features 7 and 15.

12. FK includes features that measure how often a specific search result was skipped over by a useracross different search queries, e.g., Feature 8.

18

The third column of Table A1 in the Appendix provides a mapping from features to the feature setsthat they belong to. It is not necessary to include polynomial or other higher-order transformations ofthe features because the optimizer we use (MART) is insensitive to monotone transformations of theinput features (Murphy, 2012).

5 Ranking MethodologyWe now present the second module of our empirical framework, which consists of a methodology forre-ranking the top 10 results for a given query, by a given user, at a given point in time. The goal ofre-ranking is to promote URLs that are more likely to be relevant to the user and demote those that areless likely to be relevant. Relevance is data-driven and defined based on clicks and dwell times.

5.1 Evaluation Metric

To evaluate whether a specific re-ranking method improves overall search quality or not, we firstneed to define an evaluation metric. The most commonly used metric in the ranking literature isNormalized Discounted Cumulative Gain (NDCG). It is also used by Yandex to evaluate the qualityof its search results. Hence, we use this metric for evaluation too.

NDCG is a list-level graded relevance metric, i.e., it allows for multiple levels of relevance. NDCGmeasures the usefulness, or gain, of a document based on its position in the result list. NDCG rewardsrank-orderings that place documents in decreasing order of relevance. It penalizes mistakes in toppositions more than those in lower positions, because users are more likely to focus and click on thetop documents.

We start by defining the simpler metric, Discounted Cumulative Gain, DCG (Jarvelin andKekalainen, 2000; Jarvelin and Kekalainen, 2002). DCG is a list-level metric defined for thetruncation position P . For example, if we are only concerned about the top 10 documents, then wewould work with DCG10. The DCG at truncation position P is defined as:

DCGP =P∑p=1

2rp − 1

log2(p+ 1)(9)

where rp is the relevance rating of the document (or URL) in position p.A problem with DCG is that the contribution of each query to the overall DCG score of the data

varies. For example, a search that retrieves four ordered documents with relevance scores 2, 2, 1, 1,will have higher DCG scores compared to a different search that retrieves four ordered documentswith relevance scores 1, 1, 0, 0. This is problematic, especially for re-ranking algorithms, where weonly care about the relative ordering of documents. To address this issue, a normalized version ofDCG is used:

NDCGP =DCGP

IDCGP

(10)

19

where IDCGP is the ideal DCG at truncation position P . The IDCGP for a list is obtained by rank-ordering all the documents in the list in decreasing order of relevance. NDCG always takes valuesbetween 0 and 1. Since NDCG is computed for each search in the training data and then averagedover all queries in the training set, no matter how poorly the documents associated with a particularquery are ranked, it will not dominate the evaluation process since each query contributes equally tothe average measure.

NDCG has two main advantages. First, it allows a researcher to specify the number of positionsthat she is interested in. For example, if we only care about the first page of search results (as in ourcase), then we can work with NDCG10 (henceforth referred to asNDCG for notational simplicity).Instead, if we are interested in the first two pages, we can work with NDCG20. Second, NDCG is alist-level metric; it allows us to evaluate the quality of results for the entire list of results with built-inpenalties for mistakes at each position.

Nevertheless, NDCG suffers from a tricky disadvantage. It is based on discrete positions, whichmakes its optimization very difficult. For instance, if we were to use an optimizer that uses continuousscores, then we would find that the relative positions of two documents (based on NDCG) willnot change unless there are significant changes in their rank scores. So we could obtain the samerank orderings for multiple parameter values of the model. This is an important consideration thatconstrains the types of optimization methods that can be used to maximize NDCG.

5.2 General Structure of the Ranking Problem

We now present the general structure of the problem and some nomenclature to help with exposition.We use i to index users, j to index sessions, and k to index the search within a session. Further, let qdenote the query associated with a search, and u denote the URL of a given search result.

Let xquijk ∈ RF denote a set of F features for document/URL u in response to a query q for the kth

search in the jth session of user i. The function f(·) maps the feature vector to a score squijk such thatsquijk = f(xquijk;w), wherew is a set of parameters that define the function f(·). squijk can be interpretedas the gain or value associated with document u for query q for user i in the kth search in her jth

session. Let U qmijk /U qn

ijk denote the event that URLm should be ranked higher than n for query q foruser i in the kth search of her jth session. For example, ifm has been labeled highly relevant (rm = 2)and n has been labeled irrelevant (rn = 0), then U qm

ijk /U qnijk . Note that the labels for the same URLs

can vary with queries, users, and across and within the session; hence the super-scripting by q andsub-scripting by i, j, and k.

Given two documents m and n associated with a search, we can write out the probability thatdocumentm is more relevant than n using a logistic function as follows:

P qmnijk = P (U qm

ijk /U qnijk ) =

1

1 + e−σ(sqmijk−s

qnijk)

(11)

20

where xqmijk, xqnijk refer to the input features of the two documents and σ determines the shape of this

sigmoid function. Using these probabilities, we can define the log-Likelihood of observing the data orthe Cross Entropy Cost function as follows. This is interpreted as the penalty imposed for deviationsof model predicted probabilities from the desired probabilities.

C qmnijk = −P qmn

ijk log(P qmnijk )− (1− P qmn

ijk ) log(1− P qmnijk ) (12)

C =∑i

∑j

∑k

∑∀m,n

C qmnijk (13)

where P qmnijk is the observed probability (in data) thatm is more relevant than n. The summation in

Equation (13) is taken in this order – first, over all pairs of URLs (m,n) in the consideration setfor the kth search in session j for user i, next over all the searches within session j, then over all thesessions for the user i, and finally over all users.

If the function f(·) is known or specified by the researcher, the cost C can be minimized usingcommonly available optimizers such as Newton-Raphson or BFGS. Traditionally, this has been theapproach adopted by marketing researchers since they are usually interested in inferring the impact ofsome marketing activity (e.g., price, promotion) on an outcome variable (e.g., awareness, purchase)and therefore assume the functional form of f(·). In contrast, when the goal is to predict the bestrank-ordering that maximizes consumers’ click probability and dwell-time, assuming the functionalform f(·) is detrimental to the predictive ability of the model. Moreover, manual experimentationto arrive at an optimal functional form becomes exponentially difficult as the number of featuresexpands. This is especially the case if we want the function to be sufficiently flexible (non-linear) tocapture the myriad interaction patterns in the data. To address these challenges, we turn to solutionsfrom the machine learning literature such as MART that infer both the optimal parameters w andfunction f(·) from the data.7

5.3 MART – Multiple Additive Regression Trees

MART is a machine learning algorithm that models a dependent or output variable as a linearcombination of a set of shallow regression trees (a process known as boosting). In this section, weintroduce the concepts of Classification and Regression Trees (CART) and boosted CART (referredto as MART). We present the high level overview of these models here and refer interested readers toMurphy (2012) for details.

21

A

B

Figure 9: Example of a CART model.

5.3.1 Classification and Regression Trees

CART methods are a popular class of prediction algorithms that recursively partition the input spacecorresponding to a set of explanatory variables into multiple regions and assign an output value foreach region. This kind of partitioning can be represented by a tree structure, where each leaf of thetree represents an output region. Consider a dataset with two input variables x1, x2, which are usedto predict or model an output variable y using a CART. An example tree with three leaves (or outputregions) is shown in Figure 9. This tree first asks if x1 is less than or equal to a threshold t1. If yes, itassigns the value of 1 to the output y. If not (i.e., if x1 > t1), it then asks if x2 is less than or equalto a threshold t2. If yes, then it assigns y = 2 to this region. If not, it assigns the value y = 3 to thisregion. The chosen y value for a region corresponds to the mean value of y in that region in the case ofa continuous output and the dominant y in case of discrete outputs.

Trees are trained or grown using a pre-defined number of leaves and by specifying a cost functionthat is minimized at each step of the tree using a greedy algorithm. The greedy algorithm impliesthat at each split, the previous splits are taken as given, and the cost function is minimized goingforward. For instance, at node B in Figure 9, the algorithm does not revisit the split at node A. Ithowever considers all possible splits on all the variables at each node. Thus, the split points at eachnode can be arbitrary, the tree can be highly unbalanced, and variables can potentially repeat at latterchild nodes. All of this flexibility in tree construction can be used to capture a complex set of flexibleinteractions, which are not predefined but are learned using the data.

CART is popular in the machine learning literature because it is scalable, is easy to interpret, canhandle a mixture of discrete and continuous inputs, is insensitive to monotone transformations, andperforms automatic variable selection (Murphy, 2012). However, it has accuracy limitations becauseof its discontinuous nature and because it is trained using greedy algorithms. These drawbacks can beaddressed (while preserving all the advantages) through boosting, which gives us MART.

7In standard marketing models, ws are treated as structural parameters that define the utility associated with an option. Sotheir directionality and magnitude are of interest to researchers. However, in this case there is no structural interpretationof w. In fact, with complex trainers such as MART, the number of parameters is large and difficult to interpret.

22

5.3.2 Boosting

Boosting is a technique that can be applied to any classification or prediction algorithm to improveits accuracy (Schapire, 1990). Applying the additive boosting technique to CART produces MART,which has now been shown empirically to be the best classifier available (Caruana and Niculescu-Mizil, 2006; Hastie et al., 2009). MART can be viewed as performing gradient descent in the functionspace using shallow regression trees (with a small number of leaves). MART works well because itcombines the positive aspects of CART with those of boosting. CART, especially shallow regressiontrees, tend to have high bias, but have low variance. Boosting CART models addresses the biasproblem while retaining the low variance. Thus, MART produces high quality classifiers.

MART can be interpreted as a weighted linear combination of a series of regression trees, eachtrained sequentially to improve the final output using a greedy algorithm. MART’s output L(x) canbe written as:

LN(x) =N∑n=1

αnln(x, βn) (14)

where ln(x, βn) is the function modeled by the nth regression tree and αn is the weight associatedwith the nth tree. Both l(·)s and αs are learned during the training or estimation. MARTs are alsotrained using greedy algorithms. ln(x, βn) is chosen so as to minimize a pre-specified cost function,which is usually the least-squared error in the case of regressions and an entropy or logit loss functionin the case of classification or discrete choice models.

5.4 Learning Algorithms for Ranking

We now discuss two Learning to Rank algorithms that can be used to improve search quality – RankNetand LambdaMART. Both of these algorithms can use MART as the underlying classifier.

5.4.1 RankNet – Pairwise learning with gradient descent

In an influential paper, Burges et al. (2005) proposed one of the earliest ranking algorithms – RankNet.Versions of RankNet are used by commercial search engines such as Bing (Liu, 2011).

RankNet uses a gradient descent formulation to minimize the Cross-entropy function defined inEquations (12) and (13). Recall that we use i to index users, j to index sessions, k to index the searchwithin a session, q to denote the query associated with a search, and u to denote the URL of a givensearch result. The score and entropy functions are defined for two documentsm and n.

Then, let Sqmnijk ∈ 0, 1,−1, where Sqmnijk = 1 if m is more relevant than n, −1 if n is morerelevant thanm, and 0 if they are equally relevant. Thus, we have P qmn

ijk = 12(1 + Sqmnijk ), and we can

rewrite C qmnijk as:

C qmnijk =

1

2(1− Sqmnijk )σ(sqmijk − s

qnijk) + log(1 + e−σ(s

qmijk−s

qnijk)) (15)

23

Note that Cross Entropy function is symmetric; it is invariant to swappingm and n and flipping thesign of Sqmnijk .8 Further, we have:

C qmnijk = log(1 + e−σ(s

qmijk−s

qnijk) if Sqmnijk = 1

C qmnijk = log(1 + e−σ(s

qnijk−s

qmijk) if Sqmnijk = −1

C qmnijk =

1

2σ(sqmijk − s

qnijk) + log(1 + e−σ(s

qmijk−s

qnijk)) if Sqmnijk = 0 (16)

The gradient of the Cross Entropy Cost function is:

∂C qmnijk

∂sqmijk= −

∂C qnmijk

∂sqnijk= σ

(1

2(1− Sqmnijk )− 1


qnijk)

)(17)

We can now employ a stochastic gradient descent algorithm to update the parameters,w, of the model,as follows:

w → w − η∂C qmn

ijk

∂w= w − η

(∂C qmn

ijk

∂sqmijk

∂sqmijk∂w

+∂C qnm

ijk

∂sqnijk

∂sqnijk∂w

)(18)

where η is a learning rate specified by the researcher. The advantage of splitting the gradient as shownin Equation (18) instead of working with ∂C qmn

ijk /∂w is that it speeds up the gradient calculation byallowing us to use the analytical formulations for the component partial derivatives.

Most recent implementations of RankNet use the following factorization to improve speed.

∂C qmnijk

∂w= λqmnijk

(∂sqmijk∂w−∂sqnijk∂w

)(19)

where

λqmnijk =∂C qmn

ijk

∂sqmijk= σ

(1

2(1− Sqmnijk )− 1


qnijk)

)(20)

Given this cost and gradient formulation, MART can now be used to construct a sequence ofregression trees that sequentially minimize the cost function defined by RankNet. The nth step (ortree) of MART can be conceptualized as the nth gradient descent step that refines the function f(·) inorder to lower the cost function. At the end of the process, both the optimal f(·) andws are inferred.

While RankNet has been successfully implemented in many settings, it suffers from threesignificant disadvantages. First, because it is a pairwise metric, in cases where there are multiplelevels of relevance, it doesn’t use all the information available. For example, consider two sets ofdocuments m1, n1 and m2, n2with the following relevance values: 1, 0 and 2, 0. In bothcases, RankNet only utilizes the information thatm should be ranked higher than n and not the actualrelevance scores, which makes it inefficient. Second, it does not differentiate between mistakes at the8Asymptotically, Cqmn

ijk becomes linear if the scores give the wrong ranking and zero if they give the correct ranking.

24

top of the list from those at the bottom. This is problematic if the goal is to penalize mistakes at the topof the list more than those at the bottom. Third, it optimizes the Cross Entropy Cost function, and notNDCG, which is our true objective function. This mismatch between the true objective function andthe optimized objective function implies that RankNet is not the best algorithm to optimize NDCG.

5.4.2 LambdaMART: Optimizing for NDCG

Note that RankNet’s speed and versatility comes from its gradient descent formulation, which requiresthe loss or cost function to be twice differentiable in model parameters. Ideally, we would like topreserve the gradient descent formulation because of its attractive convergence properties. However,we need to find a way to optimize for NDCG within this framework, even though it is a discontinuousoptimization metric that does not lend itself easily to gradient descent easily. LambdaMART addressesthis challenge (Burges et al., 2006; Wu et al., 2010).

The key insight of LambdaMART is that we can directly train a model without explicitly specifyinga cost function, as long as we have gradients of the cost and these gradients are consistent with thecost function that we wish to minimize. The gradient functions, λqmnijk s, in RankNet can be interpretedas directional forces that pull documents in the direction of increasing relevance. If a documentm ismore relevant than n, thenm gets an upward pull of magnitude |λqmnijk | and n gets a downward pull ofthe same magnitude.

LambdaMART expands this idea of gradients functioning as directional forces by weighing themwith the change in NDCG caused by swapping the positions ofm and n, as follows:

λqmnijk =−σ


qnijk)|∆NDCGqmn

ijk | (21)

Note that NDCGqmnijk = NDCGqmn

ijk = 0 whenm and n have the same relevance; so λqmnijk = 0 in thosecases. (This is the reason why the term σ

2(1− Sqmnijk ) drops out.) Thus, the algorithm does not spend

effort swapping two documents of the same relevance.Given that we want to maximize NDCG, suppose that there exists a utility or gain function C qmn

ijk

that satisfies the above gradient equation. (We refer to C qmnijk as a gain function in this context, as

opposed to a cost function, since we are looking to maximize NDCG.) Then, we can continue to usegradient descent to update our parameter vectorw as follows:

w → w + η∂C qmn

ijk

∂w(22)

Using Poincare’s Lemma and a series of empirical results, Yue and Burges (2007) and Donmez et al.(2009), have shown that the gradient functions defined in Equation (21) correspond to a gain functionthat maximizes NDCG. Thus, we can use these gradient formulations, and follow the same set of

25

steps as we did for RankNet to maximize NDCG.9

LambdaMART has been successfully used in many settings (Chapelle and Chang, 2011) and werefer interested readers to Burges (2010) for a detailed discussion on the implementation and relativemerits of these algorithms. In our analysis, we use the RankLib implementation of LambdaMART(Dang, 2011), which implements the above algorithm and learns the functional form of f(·) and theparameters (ws) using MART.

6 ImplementationWe now discuss the details of implementation and data preparation.

6.1 Datasets – Training, Validation, and Testing

Supervised machine learning algorithms (such as ours) typically require three datasets – training,validation, and testing. Training data is the set of pre-classified data used for learning or estimatingthe model, and inferring the model parameters. In marketing settings, usually the model is optimizedon the training dataset and its performance is then tested on a hold-out sample. However, this leads theresearcher to pick a model with great in-sample performance, without optimizing for out-of-sampleperformance. In contrast, in most machine learning settings (including ours), the goal is to picka model with the best out-of-sample performance, hence the validation data. So at each step ofthe optimization (which is done on the training data), the model’s performance is evaluated on thevalidation data. In our analysis, we optimize the model on the training data, and stop the optimizationonly when we see no improvement in its performance on the validation data for at least 100 steps.Then we go back and pick the model that offered the best performance on the validation data. Becauseboth the training and validation datasets are used in model selection, we need a new dataset for testingthe predictive accuracy of the model. This is done on a separate hold-out sample, which is labeledas the test data. The results from the final model on the test data are considered the final predictiveaccuracy achieved by our framework.

The features described in §4.1 can be broadly categorized into two sets – (a) global features,which are common to all the users in the data, and (b) user-specific features that are specific to a user(could be within or across session features). In real business settings, managers typically have dataon both global and user-level features at any given point in time. Global features are constructedby aggregating the behavior of all the users in the system. Because this requires aggregation overmillions of users and data stored across multiple servers and geographies, it is not feasible to keep thisup-to-date on a real-time basis. So firms usually update this data on a regular (weekly or biweekly)basis. Further, global data is useful only if has been aggregated over a sufficiently long period of time.In our setting, we cannot generate reliable global features for the first few days of observation because

9The original version of LambdaRank was trained using neural networks. To improve its speed and performance, Wu et al.(2010) proposed a boosted tree version of LambdaRank – LambdaMART, which is faster and more accurate.

26

Global Data

TrainingData

ValidationData

TestData

Days 1--25 Days 26--27

Session A1 Session A2 Session A3User A

Session B1 Session B2User B

Session C1 Session C2 Session C3User C

Figure 10: Data-generation framework.

of insufficient history. In contrast, user-specific features have to be generated on a real-time basis andneed to include all the past activity of the user, including earlier activity in the same session. Thisis necessary to ensure that we can make personalized recommendations for each user. Given theseconstraints and considerations, we now explain how we partition the data and generate these features.

Recall that we have a snapshot of 27 days of data from Yandex. Figure 10 presents a schema ofhow the three datasets (training, validation, test) are generated from the 27 days of data. First, usingthe entire data from the first 25 days, we generate the global features for all three datasets. Next, wesplit all the users observed in the last two days (days 26− 27) into three random samples (training,validation, and testing).10 Then, for each observation of a user, all the data up till that observationis used to generate the user/session level features. For example, Sessions A1, A2, B1, B2, and C1are used to generate global features (such as the listing frequency and CTRs at the query, URL, anddomain level) for all three datasets. Thus, the global features for all three datasets do not includeinformation from sessions in days 26− 27 (Sessions A3, C2, and C3). Next, for user A in the trainingdata, the user level features at any given point in Session A3 are calculated based on her entire historystarting from Session A1 (when she is first observed) till her last observed action in Session A3. UserB is not observed in the last two days. So her sessions are only used to generate aggregate globalfeatures, i.e., she does not appear at the user-level in the training, validation, or test datasets. For userC, the user-level features at any given point in Sessions C2 or C3 are calculated based on her entirehistory starting from Session C1.

10We observe a total of 1174096 users in days 26− 27, who participate in 2327766 sessions. Thus, each dataset (training,validation, and testing) has approximately one third of these users and sessions in it.

27

Dataset NDCG Baseline NDCG with Personalization % GainDataset 1 0.7937 0.8121 2.32%Dataset 2 0.5085 0.5837 14.78%

Table 4: Gains from Personalization

6.2 Data Preparation

Before constructing the three datasets (training, validation, and test) described above, we clean thedata to accommodate the test objective. Testing whether personalization is effective or not is onlypossible if there is at least one click. So following Yandex’s recommendation, we filter the datasetsin two ways. First, for each user observed in days 26− 27, we only include all her queries from thetest period with at least one click with a dwell time of 50 units or more (so that there is at least onerelevant or highly relevant document). Second, if a query in this period does not have any short-termcontext (i.e., it is the first one in the session) and the user is new (i.e., she has no search sessions in thetraining period), then we do not consider the query since it has no information for personalization.

Additional analysis of the test data shows that about 66% of the clicks observed are already forthe first position. Personalization cannot further improve user experience in these cases because themost relevant document already seems to have been ranked at the highest position. Hence, to get aclearer picture of the value of personalization, we need to focus on situations where the user clicks ona document in a lower position (in the current ranking) and then examine if our proposed personalizedranking improves the position of the clicked documents. Therefore, we consider two test datasets – 1)the first is the entire test data, which we denote as Dataset 1, and 2) the second is a subset that onlyconsists of queries for which clicks are at position 2 or lower, which we denote as Dataset 2.

6.3 Model Training

We now comment on some issues related to training our model. The performance of LambdaMARTdepends on a set of tunable parameters. For instance, the researcher has control over the number ofregression trees used in the estimation, the number of leaves in a given tree, the type of normalizationemployed on the features, etc. It is important to ensure that the tuning parameters lead to a robustmodel. A non-robust model usually leads to overfitting; that is, it leads to higher improvement in thetraining data (in-sample fit), but lower improvements in validation and testing datasets (out-of-samplefits). We take two key steps to ensure that the final model is robust. First, we normalize the featuresusing their zscores to make the feature scales comparable. Second, we experimented with a largenumber of specifications for the number of trees and leaves in a tree, and found that 1000 trees and 20leaves per tree was the most robust combination.

28

7 Results

7.1 Gains from Personalization

After training and validating the model, we test it on the two datasets described earlier – Dataset 1

and Dataset 2. The results are presented in Table 4. First, for Dataset 1, we find that personalizationimproves the NDCG from a baseline (with SERPRank as the only feature) of 0.7937 to 0.8121, whichrepresents a 2.32% improvement over the baseline. Since 66% of the clicks in this dataset are alreadyfor the top-most position, it is hard to interpret the value of personalization from these numbers. Sowe focus on Dataset 2, which consists of queries that received clicks at positions 2 or lower, i.e., onlythose queries where personalization can play a role in improving the search results. Here, we find theresults promising – compared to the baseline, our personalized ranking algorithm leads to a 14.78%

increase in search quality, as measured by NDCG.These improvements are significant in the search industry. Recall that the baseline ranking

captures a large number of factors including the relevance of page content, page rank, and the link-structure of the web-pages, which are not available to us. It is therefore notable that we are able tofurther improve on the ranking accuracy by incorporating personalized user/session-specific features.

Our overall improvement is higher than that achieved by winning teams from the Kaggle contest(Kaggle, 2014).This is in spite of the fact that the contestants used a longer dataset for training theirmodel.11 We conjecture that this is due to our use of a comprehensive set of personalized features inconjunction with the parallelized LambdaMART algorithm that allows us to exploit the completescale of the data. The top three teams in the contest released a technical report each discussing theirrespective methodologies. Interested readers please refer to Masurel et al. (2014); Song (2014);Volkovs (2014) for details on their approaches.

7.2 Value of Personalization from the Firm’s and Consumers’ Perspectives

So far we presented the results purely in terms of NDCG. However, the NDCG metric is hard tointerpret, and it is not obvious how changes in NDCG translate to differences in rank-orderings andclick-through-rates (CTRs) in a personalized page. So we now present some results on these metrics.

The first question we consider is this – From a consumer’s perspective, how different does theSERP look when we deploy the personalized ranking algorithm compared to what it looks like underthe current non-personalized algorithm? Figures 11 and 12 summarize the shifts after personalizationfor Dataset 1 and Dataset 2, respectively. For both datasets, over 55% of the documents are re-ordered,which is a staggering number of re-orderings. (Note that the percentage of results (URLs) with zero

11The contestants used all the data from days 1− 27 for building their models and had their models tested by Yandex onan unmarked dataset (with no information on clicks made by users) from days 28− 30. In contrast, we had to resort tousing the dataset from days 1− 27 for both model building and testing. So the amount of past history we have for eachuser in the test data is smaller, putting us at a relative disadvantage (especially since we find that longer histories lead tohigher NDCGs). Thus, it is noteworthy that our improvement is higher than that achieved by the winning teams.

29

Perc

enta

ge

0.00

12.50

25.00

37.50

50.00

Change in Position-5 -4 -3 -2 -1 0 1 2 3 4 5

Figure 11: Histogram of change in position for theresults in Dataset 1.

Perc

enta

ge

0.00

12.50

25.00

37.50

50.00

Change in Position-5 -4 -3 -2 -1 0 1 2 3 4 5

Figure 12: Histogram of change in position for theresults in Dataset 2.

change in position is 44.32% for Dataset 1 and 41.85% for Dataset 2.) Approximately 25% of thedocuments are re-ordered by one position (increase of one position or decrease of one position). Therest 30% are re-ordered by two positions or more. Re-orderings by more than five positions are rare.Overall, these results suggest that the rank-ordering of results in the personalized SERP is likely tobe quite different from the non-personalized SERP. These findings highlight the fact that even smallimprovements in NDCG lead to significant changes in rank-orderings, as perceived by consumers.

Incr

ease

in C

TR

-1.00

0.00

1.00

2.00

3.00

4.00

Position1 2 3 4 5 6 7 8 9 10

Figure 13: Expected increase in CTR as a functionof position for Dataset 1. CTRs are expressed inpercentages; therefore the change in CTRs are alsoexpressed in percentages.

Incr

ease

in C

TR

-9.00

-4.50

0.00

4.50

9.00

13.50

18.00

Position1 2 3 4 5 6 7 8 9 10

Figure 14: Expected increase in CTR as a functionof position for Dataset 2. CTRs are expressed inpercentages; therefore the change in CTRs are alsoexpressed in percentages.

Next, we consider the returns to personalization from the firm’s perspective. User satisfactiondepends on the search engine’s ability to serve relevant results at the top of a SERP. So any algorithmthat increases CTRs for top positions is valuable. We examine how position-based CTRs are expectedto change with personalization compared to the current non-personalized ranking. Figures 13 and 14summarize our findings for Datasets 1 and 2. The y-axis depicts the difference in CTRs (in percentage)

30

0

0.5

1

1.5

2

2.5

3

3.5

0 20 40 60 80 100Perc

enta

ge im

prov

emen

t ove

r bas

elin

e

Number of prior user queries

Figure 15: NDCG improvements over the baselinefor different lower-bounds on the number of priorqueries by the user for Dataset 1.

10

12

14

16

18

20

0 20 40 60 80 100Perc

enta

ge im

prov

emen

t ove

r bas

elin

e

Number of prior user queries

Figure 16: NDCG improvements over the baselinefor different lower-bounds on the number of priorqueries by the user for Dataset 2.

after and before personalization, and the x-axis depicts the position. Since both datasets only includequeries with one or more clicks, the total number of clicks before and after personalization aremaintained to be the same. (The sum of the changes in CTRs across all positions is zero.) However,after personalization, the documents are re-ordered so that more relevant documents are displayedhigher up. Thus, the CTRs for the topmost position increases, while that of lower positions decrease.In both datasets, we see a significant increase in the CTR for position 1. For Dataset 1, CTR forposition 1 changes (in a positive direction) by 3.5%. For Dataset 2, which consists of queries receivingclicks to positions 2 or lower, this increase is even more substantial. We find that the CTR to firstposition is now higher by 18% . For Dataset 1, the increase in CTR for position 1 mainly comes frompositions 3 and 10, whereas for Dataset 2, it comes from positions 2 and 3.

Overall, we find that deploying a personalization algorithm would lead to increase in CTRsfor higher positions, thereby reducing the need for consumers to scroll further down the SERP forrelevant results. This in turn would increase user satisfaction and reduce switching – both of whichare valuable outcomes from the search engine’s perspective.

7.3 Value of Search History

Next, we examine the returns to using longer search histories for personalization. Even though wehave less than 26 days of history on average, the median query (by a user) in the test datasets has 49

prior queries, and the 75th percentile has 100 prior queries.12 We find that this accumulated historypays significant dividends in improving search quality, as shown in Figures 15 and 16. When we setthe lower-bound for past queries to zero, we get the standard 2.32% improvement over the baseline

12The number of past queries for a median query (50) is a different metric than the one discussed earlier, median of thenumber of queries per user (shown in Figure 4 as 16), because of two reasons. First, this number is calculated onlyfor queries which have one or more prior queries (see data preparation in §6.2). Second, not all users submit the samenumber of queries. Indeed, a small subset of users account for most queries. So the average query has a larger medianhistory than the average user.

31

for Dataset 1 and 14.78% for Dataset 2 (Table 4), as shown by the points on the y-axis in Figures 15and 16. When we restrict the test data to only contain queries for which the user has issued at least 50

past queries (which is close to the median), we find that the NDCG improves by 2.76% for Dataset1 and 17% for Dataset 2. When we increase the lower-bound to 100 queries (the 75th percentile forour data), NDCG improves by 3.03% for Dataset 1 and 18.51% for Dataset 2, both of which are quitesignificant when compared to the baseline improvements for the entire datasets.

These findings can be summarized into two main patterns. First, the search history is morevaluable when we focus on searches where the user has to go further down in the SERP to look forrelevant results. Second, there is a monotonic, albeit concave increase in NDCG as a function ofthe length of search history. These findings re-affirm the value of storing and using personalizedbrowsing histories to improve users’ search experience. Indeed, firms are currently struggling withthe decision of how much history to store, especially since storage of these massive personal datasetscan be extremely costly. By quantifying the improvement in search results for each additional pieceof past history, we provide firms with tools to decide whether the costs associated with storing longerhistories are offset by the improvements in consumer experience.

7.4 Comparison of Features

We now examine the value of different types of features in improving search results using a compara-tive analysis. In order to perform this analysis, we start with the full model, then remove a specificfeature set (as described in §4.2) and evaluate the loss in NDCG prediction accuracy as a consequenceof the removal. The percentage loss is defined as:

NDCGfull −NDCGcurrent

NDCGfull −NDCGbase

× 100 (23)

Note that all the models always have SERPRank as a feature. Table 5 provides the results from thisexperiment.

We observe that by eliminating the Global features, we lose 33.15% of the improvement that wegain from using the full feature set. This suggests that the interaction effects between global andpersonalized features plays an important role in improving the accuracy of our algorithm. Next, wefind that eliminating user-level features results in a loss of a whopping 53.26%. Thus, over 50% ofthe improvement we provide comes from using personalized user-history. In contrast, eliminatingsession-level features while retaining across-session user-level features results in a loss of only 3.81%.This implies that across session features are more valuable than within-session features and thatuser behavior is stable over longer time periods. Thus, recommendation engines can forgo real-timesession-level data (which is costly to incorporate in terms of computing time and deployment) withouttoo much loss in prediction accuracy.

We also observe that eliminating domain name based features results in a significant loss in

32

NDCG % LossFull set 0.8121−Global features (FG) 0.8060 33.15−User features (FP ) 0.8023 53.26− Session features (FS) 0.8114 3.80−URL features (FU ) 0.8101 10.87−Domain features (FD) 0.8098 12.50−Query features (FQ) 0.8106 8.15− Term features (FT ) 0.8119 1.09−Click features (FC) 0.8068 28.80− Position features (FR) 0.8116 2.72− Skip features (FK) 0.8118 1.63− Posn-click features (FN ) 0.8120 0.54−Dwell features (FW ) 0.8112 4.89

Table 5: Loss in prediction accuracy when different feature sets are removed from the full model.

prediction accuracy, as does the elimination of URL level features. This is because users prefer searchresults from domains and URLs that they trust. Search engines can thus track user preferences overdomains and use this information to optimize search ranking. Query-specific features can also bequite informative (8.15% loss in accuracy). This could be because they capture differences in userbehavior across queries.

We also find that features based on users’ past clicking and dwelling behavior can be informative,with Click-specific features being the most valuable (with a 28.8% loss in accuracy). This is likelybecause users tend to exhibit persistence in how they click on links – both across positions (someusers may only click on the first two positions while others may click on multiple positions) andacross URLs and domains. (For a given query, some users may have persistence in taste for certainURLs; e.g., some users might be using a query for navigational purposes while others might be usingit to get information/news.) History on past clicking behavior can inform us of these proclivities andhelp us improve the results shown to different users.

In sum, these findings give us insight into the relative value of different types of features. This canhelp businesses decide which types of information to incorporate into their search algorithms andwhich types of information to store for longer periods of time.

7.5 Comparison with Pairwise Maximum Likelihood

We now compare the effectiveness of machine learning based approaches to this problem to thestandard technique used in the marketing literature – Maximum Likelihood. We estimate a pairwiseMaximum Likelihood model with a linear combination of the set of 293 features used previously.Note that this is similar to a RankNet model where the functional form f(·) is specified as a linearcombination of the features, and we are simply inferring the parameter vectorw. Thus, it is a pairwiseoptimization that does not use any machine learning algorithms.

33

Pairwise Maximum Likelihood LambdaMARTSingle feature Full features Single feature Full features

NDCG on training data 0.7930 0.7932 0.7930 0.8141NDCG on validation data 0.7933 0.7932 0.7933 0.8117NDCG on test data 0.7937 0.7936 0.7937 0.8121

Table 6: Comparison with Pairwise Maximum Likelihood

We present the results of this analysis in Table 6. For both Maximum Likelihood and Lamb-daMART, we report results for a single feature model (with just SERPRank) and a model with all thefeatures. In both methods, the single feature model produces the baseline NDCG of 0.7937 on thetest data, which shows that both methods perform as expected when used with no additional features.Also, as expected, in the case of full feature models, both methods show improvement on the trainingdata. However, the improvement is very marginal in the case of pairwise Maximum Likelihood, andcompletely disappears when we take the model to the test data. While LambdaMART also suffersfrom some loss of accuracy when moving from training to test data, much of its improvement issustained in the test data, which turns out to be quite significant. Together these results suggest thatMaximum Likelihood is not an appropriate technique for large scale predictive models.

The lackluster performance of the Maximum Likelihood model stems from two issues. First,because it is a pairwise gradient-based method it does not maximize NDCG. Second, it does notallow for the fully flexible functional form and a large number of non-linear interactions that machinelearning techniques such as MART do.

7.6 Scalability

We now examine the scalability of our approach to large datasets. Note that there are three aspectsof modeling – feature generation, testing, and training – all of which are computing and memoryintensive. In general, feature generation is more memory intensive (requiring at least 140GB ofmemory for the full dataset), and hence it is not parallelized. The training module is more computingintensive. The RankLib package that we use for training and testing is parallelized and can run onmulti-core servers, which makes it scalable. All the stages (feature generation, training, and testing)were run on Amazon EC2 nodes since the computational load and memory requirements for theseanalyses cannot be handled by regular PCs. The specification of the compute server used on Amazon’sEC2 node is – Intel(R) Xeon(R) CPUE5− 2670 v2 @ 2.50GHz with 32 cores and 256GB of memory.This was the fastest server with the highest memory available at Amazon EC2.

We now discuss the scalability of each module – feature generation, training, and testing – one byone, using Figures 17, 18 and 19. The runtimes are presented as the total CPU time added up over allthe 32 cores. To construct the 20% dataset, we continue to use all the users and sessions from days1− 25 to generate the global features, but use only 20% of the users from days 26− 27 to generate

34

0

1000

2000

3000

4000

5000

6000

0 20 40 60 80 100

Feat

ure

gene

ratio

n tim

e (s

ecs)

Percentage of dataset

Figure 17: Runtimes for generating the features themodel on different percentages of the data.

0

200000

400000

600000

800000

1000000

0 20 40 60 80 100

Trai

ning

tim

e (s

ecs)


Figure 18: Runtimes for training the model on dif-ferent percentages of the data.

0

500

1000

1500

2000

2500

0 20 40 60 80 100

Test

ing

time

(sec

s)


Figure 19: Runtimes for testing the model on differ-ent percentages of the data.

0.808

0.809

0.81

0.811

0.812

0.813

0.814

0 20 40 60 80 100

ND

CG


Figure 20: NDCG improvements from using differ-ent percentages of the data.

user and session specific features.For the full dataset, the feature generation module takes about 83 minutes on a single-core

computer (see Figure 17), the training module takes over 11 days (see Figure 18), and the testingmodule takes only 35 minutes (see Figure 19). These numbers suggest that training is the primarybottleneck in implementing the algorithm, especially if the search engine wants to update the trainedmodel on a regular basis. So we parallelize the training module over a 32-core server. This reduces thetraining time to about 8.7 hours. Overall, with sufficient parallelization, we find that the frameworkcan be scaled to large datasets in reasonable time-frames.

Next, we compare the computing cost of going from 20% of the data to the full dataset. We findthat it increases by 2.38 times for feature generation, by 20.8 times for training, and 11.6 times fortesting. Thus, training and testing costs increase super-linearly whereas it is sub-linear for featuregeneration. These increases in computing costs however brings a significant improvement in modelperformance, as measured by NDCG (see Figure 20). Overall, these measurements affirm the value ofusing large scale datasets for personalized recommendations, especially since the cost of increasing

35

the size of the data is paid mostly at the training stage, which is easily parallelizable.

8 Feature SelectionWe now discuss the third and final step of our empirical approach. As we saw in the previoussection, even when parallelized on 32-core servers, personalization algorithms can be very computing-intensive. Therefore, we now examine feature selection, where we optimally choose a subset offeatures that maximize the predictive accuracy of the algorithm with minimal loss in accuracy. Westart by describing the background literature and frameworks for feature selection, then describe itsapplication to our context, followed by the results.

Feature selection is a standard step in all machine learning settings that involve supervisedlearning because of two reasons (Guyon and Elisseeff, 2003). First, it can provide a faster andcost-effective, or a more efficient, model by eliminating less relevant features with minimal loss inaccuracy. Efficiency is particularly relevant when training large datasets such as ours. Second, itprovides more comprehensible models which offer a better understanding of the underlying datagenerating process.13

The goal of feature selection is to find the smallest set of features that can provide a fixed predictiveaccuracy. In principle, this is a straightforward problem since it simply involves an exhaustive searchof the feature space. However, even with a moderately large number of features, this is practicallyimpossible. With F features, an exhaustive search requires 2F runs of the algorithm on the trainingdataset, which is exponentially increasing in F . In fact, this problem is known to be NP-hard (Amaldiand Kann, 1998).

8.1 Wrapper Approach

The Wrapper method addresses this problem by using a greedy algorithm. We describe it brieflyhere and refer interested readers to Kohavi and John (1997) for a detailed review. Wrappers canbe categorized into two types – forward selection and backward elimination. In forward selection,features are progressively added until a desired prediction accuracy is reached or until the incrementalimprovement is very small. In contrast, a backward elimination Wrapper starts with all the featuresand sequentially eliminates the least valuable features. Both these Wrappers are greedy in the sensethat they do not revisit former decisions to include (in forward selection) or exclude features (inbackward elimination). In our implementation, we employ a forward-selection Wrapper.

13In cases where the datasets are small and the number of features large, feature selection can actually improve thepredictive accuracy of the model by eliminating irrelevant features whose inclusion often results in overfitting. Manymachine learning algorithms, including neural networks, decision trees, CART, and Naive Bayes learners have beenshown to have significantly worse accuracy when trained on small datasets with superfluous features (Duda and Hart,1973; Aha et al., 1991; Breiman et al., 1984; Quinlan, 1993). In our setting, we have an extremely large dataset, whichprevents the algorithm from making spurious inferences in the first place. So feature selection is not accuracy-improvingin our case; however it leads to significantly lower computing times.

36

To enable a Wrapper algorithm, we also need to specify a selection as well as a stopping rule. Weuse the robust Best-first selection rule (Ginsberg, 1993), wherein the most promising node is selectedat every decision point. For example, in a forward-selection algorithm with 10 features, at the firstnode, this algorithm considers 10 versions of the model (each with one of the features added) andthen picks the feature whose addition offers the highest prediction accuracy. Further, we use thefollowing stopping rule, which is quite conservative. LetNDCGbase be theNDCG from the basemodel currently used by Yandex, which only uses SERPRank. Similarly, letNDCGcurrent be theimprovedNDCG at the current iteration, andNDCGfull be the improvement that can be obtainedfrom using a model with all the 293 features. Then our stopping rule is:

NDCGcurrent −NDCGbase

NDCGfull −NDCGbase

> 0.95 (24)

According to this stopping rule, the Wrapper algorithm stops adding features when it achieves 95% ofthe improvement offered by the full model.

Wrappers offer many advantages. First, they are agnostic to the underlying learning algorithmand the accuracy metric used for evaluating the predictor. Second, greedy Wrappers have been shownto be computationally advantageous and robust against overfitting (Reunanen, 2003).

8.2 Feature Classification for Wrapper

Wrappers are greedy algorithms that offers computational advantages. They reduce the number ofruns of the training algorithm from 2F to F (F+1)

2. While this is significantly less than 2F , it is still

prohibitively expensive for large feature sets. For instance, in our case, this would amount to 42, 778

runs of the training algorithm, which is not feasible. Hence, we use a coarse Wrapper – that runs onsets or classes of features as opposed to individual features. We use domain-specific knowledge toidentify groups of features and perform the search process on these groups of features as opposed toindividual ones. This allows us to limit the number of iterations required for the search process to areasonable number.

Recall that we have 293 features. We classify these into 39 classes. First, we explain howwe classify the first 270 features listed in Table A1 in the Appendix. These features encompassthe following group of eight features – Listings, AvgPosn, CTR, HDCTR, WCTR, PosnCTR, ADT,

SkipRate. We classify these in to 36 classes as shown in Figure 21. At the lowest denomination, thisset of features can be obtained at the URL or domain level.14 This gives us two sets of features. At thenext level, these features can be aggregated at the query level, at the term level (term1, term2, term3,or term4), or aggregated over all queries and terms (referred to as Null). This gives us six possibleoptions. At a higher level, they can be further aggregated at the session-level, over users (across all

14Note that SkipRate is not calculated at the domain level; hence at the domain level this set consists of seven features.

37

URL featuresListings

Avg.PosnCTR

HDCTRWCTR

PosnCTRADT

SkipRate

Domain featuresListings

Avg.PosnCTR

HDCTRWCTR

PosnCTRADT

63 2

Null

Term4

Term3

Term2

Term1

Query

Global

User

User-session

X X = 36

Figure 21: Classes of features used in feature-selection.

their sessions), or over all users and sessions (referred to as Global). Thus, we have 2× 6× 3 = 36

sets of features.Next, we classify features 271− 278, which consist of Listings, CTR, HDCTR, ADT at user and

session levels, aggregated over all URLs and domains, into the 37th class. The 38th class consistsof user-specific click probabilities for all ten positions, aggregated over the user history (features279−288 in Table A1 in the Appendix). Finally, the remaining features 289−292 form the 39th class.Note that SERPRank is not classified under any class, since it is included in all models by default.

8.3 Results: Feature Selection

We use a search process that starts with a minimal singleton feature set that comprises just theSERPRank feature. This produces the baseline NDCG of 0.7937. Then we sequentially explore thefeature classes using the first-best selection rule. Table 7 shows the results of this exploration process,along with the class of features added in each iteration. We stop at iteration six (by which time, wehave run the model 219 times), when the improvement in NDCG satisfies the condition laid out inEquation (24). The improvement in NDCG at each iteration is also shown in Figure 22, along with themaximum NDCG obtainable with a model that includes all the features. Though the condition for thestopping rule is satisfied at iteration 6, we show the results till iteration 11 in Figure 22, to demonstrate

38

Iteration no. No. of features Feature class added at iteration NDCG0 1 SERPRank 0.79371 9 User–Query–URL 0.80332 17 Global–Query–URL 0.80703 24 User–Null–Domain 0.80944 32 Global–Null–URL 0.81045 40 User-session–Null–URL 0.81086 47 Global–Null–Domain 0.8112

Table 7: Set of features added at each iteration of the Wrapper algorithm based on Figure 21. The correspondingNDCGs are also shown.

0.792

0.796

0.800

0.804

0.808

0.8120

0.816

0 10 20 30 40 50 60 70 80 90

ND

CG

Number of features

Figure 22: Improvement in NDCG at each iteration of the Wrapper algorithm. The tick marks in the redline refer to the iteration number and the blue line depicts the maximum NDCG obtained with all the featuresincluded.

the fact that the improvement in NDCG after iteration 6 is minimal. Note that at iteration six, thepruned model with 47 features produces an NDCG of 0.8112, which equals 95% of the improvementoffered by the full model. This small loss in accuracy brings a huge benefit in scalability as shown inFigures 23 and 24. When using the entire dataset, the training time for the pruned model is≈ 15% ofthe time it takes to train the full model. Moreover, the pruned model scales to larger datasets muchbetter than the full model. For instance, the difference in training times for full and pruned modelsis small when performed on 20% of the dataset. However, as the data size grows, the training timefor the full model increases more steeply than that for the pruned model. While training-time is themain time sink in this application, training is only done once. In contrast, testing runtimes have moreconsequences for the model’s applicability since testing/prediction has to be done in real-time formost applications. As with training times, we find that the pruned model runs significantly faster onthe full dataset (≈ 50%) as well as scaling better with larger datasets.

Together, these findings suggest that feature selection is a valuable and important step in reducingruntimes and making personalization implementable in real-time applications when we have big data.

39

0

200000

400000

600000

800000

1000000

0 20 40 60 80 100

Trai

ning

tim

e (s

ecs)


FullPruned

Figure 23: Runtimes for training the full and prunedmodels on different percentages of the data.

0

500

1000

1500

2000

2500

0 20 40 60 80 100

Test

ing

time

(sec

s)


FullPruned

Figure 24: Runtimes for testing the full and prunedmodels on different percentages of the data.

9 ConclusionPersonalization of digital services remains the holy grail of marketing. In this paper, we studyan important personalization problem – that of search rankings in the context of online search orinformation retrieval. Query-based web search is now ubiquitous and used by a large number ofconsumers for a variety of reasons. Personalization of search results can therefore have significantimplications for consumers, businesses, governments, and for the search engines themselves.

We present a three-pronged machine learning framework that improves search results throughautomated personalization using gradient-based ranking. We apply our framework to data from thepremier eastern European search engine, Yandex, and provide evidence of substantial improvementsin search quality using personalization. In contrast, standard maximum likelihood based proceduresprovide zero improvement, even when supplied with features generated by our framework. Wequantify the value of the length of user history, the relative effectiveness of short-term and long-termcontext, and the impact of other factors on the returns to personalization. Finally, we show that ourframework can perform efficiently at scale, making it well suited for real-time deployment.

Our paper makes four key contributions to the marketing literature. First, it presents a generalmachine learning framework that marketers can use to rank recommendations using personalizeddata in many different settings. Second, it presents empirical evidence in support of the returns topersonalization in the online search context. Third, it demonstrates how big data can be leveraged toimprove marketing outcomes without compromising the speed or real-time performance of digitalapplications. Finally, it provides insights on how machine learning methods can be adapted tosolve important marketing problems that have been technically unsolvable so far (using traditionaleconometric or analytical models). We hope our work will spur more research on the application ofmachine learning methods to not only personalization-related issues, but also on a broad array ofmarketing issues.

40

A Appendix

Table A1: List of features.

Feature No. Feature Name Feature Classes1 Listings(query , url ,−,−,−) FG, FQ, FU , FC

2 CTR(query , url ,−,−,−) FG, FQ, FU , FC

3 ADT(query , url ,−,−,−) FG, FQ, FU , FW

4 AvgPosn(query , url ,−,−,−) FG, FQ, FU , FR

5 HDCTR(query , url ,−,−,−) FG, FQ, FU , FC , FW

6 WCTR(query , url ,−,−,−) FG, FQ, FU , FC , FN

7 PosnCTR(query , url ,−,−,−) FG, FQ, FU , FC , FR

8 SkipRate(query , url ,−,−,−) FG, FQ, FU , FK

9 Listings(query , domain,−,−,−) FG, FQ, FD, FC

10 CTR(query , domain,−,−,−) FG, FQ, FD, FC

11 ADT(query , domain,−,−,−) FG, FQ, FD, FW

12 AvgPosn(query , domain,−,−,−) FG, FQ, FD, FR

13 HDCTR(query , domain,−,−,−) FG, FQ, FD, FC , FW

14 WCTR(query , domain,−,−,−) FG, FQ, FD, FC , FN

15 PosnCTR(query , domain,−,−,−) FG, FQ, FD, FC , FR

16-19 Listings(term1..4, url ,−,−,−) FG, FT , FU , FC

20-23 CTR(term1..4, url ,−,−,−) FG, FT , FU , FC

24-27 ADT(term1..4, url ,−,−,−) FG, FT , FU , FW

28-31 AvgPosn(term1..4, url ,−,−,−) FG, FT , FU , FR

32-35 HDCTR(term1..4, url ,−,−,−) FG, FT , FU , FC , FW

36-39 WCTR(term1..4, url ,−,−,−) FG, FT , FU , FC , FN

40-43 PosnCTR(term1..4, url ,−,−,−) FG, FT , FU , FC , FR

44-47 SkipRate(term1..4, url ,−,−,−) FG, FT , FU , FK

48-51 Listings(term1..4, domain,−,−,−) FG, FT , FD, FC

52-55 CTR(term1..4, domain,−,−,−) FG, FT , FD, FC

56-59 ADT(term1..4, domain,−,−,−) FG, FT , FD, FW

60-63 AvgPosn(term1..4, domain,−,−,−) FG, FT , FD, FR

64-67 HDCTR(term1..4, domain,−,−,−) FG, FT , FD, FC , FW

68-71 WCTR(term1..4, domain,−,−,−) FG, FT , FD, FC , FN

72-75 PosnCTR(term1..4, domain,−,−,−) FG, FT , FD, FC , FR

76 Listings(query , url , usr ,−, time) FP , FQ, FU , FC

77 CTR(query , url , usr ,−, time) FP , FQ, FU , FC

78 ADT(query , url , usr ,−, time) FP , FQ, FU , FW

79 AvgPosn(query , url , usr ,−, time) FP , FQ, FU , FR

80 HDCTR(query , url , usr ,−, time) FP , FQ, FU , FC , FW

81 WCTR(query , url , usr ,−, time) FP , FQ, FU , FC , FN

82 PosnCTR(query , url , usr ,−, time) FP , FQ, FU , FC , FR

83 SkipRate(query , url , usr ,−, time) FP , FQ, FU , FK

84 Listings(query , domain, usr ,−, time) FP , FQ, FD, FC

85 CTR(query , domain, usr ,−, time) FP , FQ, FD, FC

86 ADT(query , domain, usr ,−, time) FP , FQ, FD, FW

87 AvgPosn(query , domain, usr ,−, time) FP , FQ, FD, FR

88 HDCTR(query , domain, usr ,−, time) FP , FQ, FD, FC , FW

89 WCTR(query , domain, usr ,−, time) FP , FQ, FD, FC , FN

90 PosnCTR(query , domain, usr ,−, time) FP , FQ, FD, FC , FR

91-94 Listings(term1..4, url , usr ,−, time) FP , FT , FU , FC

95-98 CTR(term1..4, url , usr ,−, time) FP , FT , FU , FC

Continued on next page

i

Table A1 – continued from previous pageFeature No. Feature Name Feature Classes

99-102 ADT(term1..4, url , usr ,−, time) FP , FT , FU , FW

103-106 AvgPosn(term1..4, url , usr ,−, time) FP , FT , FU , FR

107-110 HDCTR(term1..4, url , usr ,−, time) FP , FT , FU , FC , FW

111-114 WCTR(term1..4, url , usr ,−, time) FP , FT , FU , FC , FN

115-118 PosnCTR(term1..4, url , usr ,−, time) FP , FT , FU , FC , FR

119-122 SkipRate(term1..4, url , usr ,−, time) FP , FT , FU , FK

123-126 Listings(term1..4, domain, usr ,−, time) FP , FT , FD, FC

127-130 CTR(term1..4, domain, usr ,−, time) FP , FT , FD, FC

131-134 ADT(term1..4, domain, usr ,−, time) FP , FT , FD, FW

135-138 AvgPosn(term1..4, domain, usr ,−, time) FP , FT , FD, FR

139-142 HDCTR(term1..4, domain, usr ,−, time) FP , FT , FD, FC , FW

143-146 WCTR(term1..4, domain, usr ,−, time) FP , FT , FD, FC , FN

147-150 PosnCTR(term1..4, domain, usr ,−, time) FP , FT , FD, FC , FR

151 Listings(query , url , usr , session, time) FS , FQ, FU , FC

152 CTR(query , url , usr , session, time) FS , FQ, FU , FC

153 ADT(query , url , usr , session, time) FS , FQ, FU , FW

154 AvgPosn(query , url , usr , session, time) FS , FQ, FU , FR

155 HDCTR(query , url , usr , session, time) FS , FQ, FU , FC , FW

156 WCTR(query , url , usr , session, time) FS , FQ, FU , FC , FN

157 PosnCTR(query , url , usr , session, time) FS , FQ, FU , FC , FR

158 SkipRate(query , url , usr , session, time) FS , FQ, FU , FK

159 Listings(query , domain, usr , session, time) FS , FQ, FD, FC

160 CTR(query , domain, usr , session, time) FS , FQ, FD, FC

161 ADT(query , domain, usr , session, time) FS , FQ, FD, FW

162 AvgPosn(query , domain, usr , session, time) FS , FQ, FD, FR

163 HDCTR(query , domain, usr , session, time) FS , FQ, FD, FC , FW

164 WCTR(query , domain, usr , session, time) FS , FQ, FD, FC , FN

165 PosnCTR(query , domain, usr , session, time) FS , FQ, FD, FC , FR

166-169 Listings(term1..4, url , usr , session, time) FS , FT , FU , FC

170-173 CTR(term1..4, url , usr , session, time) FS , FT , FU , FC

174-177 ADT(term1..4, url , usr , session, time) FS , FT , FU , FW

178-181 AvgPosn(term1..4, url , usr , session, time) FS , FT , FU , FR

182-185 HDCTR(term1..4, url , usr , session, time) FS , FT , FU , FC , FW

186-189 WCTR(term1..4, url , usr , session, time) FS , FT , FU , FC , FN

190-193 PosnCTR(term1..4, url , usr , session, time) FS , FT , FU , FC , FR

194-197 SkipRate(term1..4, url , usr , session, time) FS , FT , FU , FK

198-201 Listings(term1..4, domain, usr , session, time) FS , FT , FD, FC

202-205 CTR(term1..4, domain, usr , session, time) FS , FT , FD, FC

206-209 ADT(term1..4, domain, usr , session, time) FS , FT , FD, FW

210-213 AvgPosn(term1..4, domain, usr , session, time) FS , FT , FD, FR

214-217 HDCTR(term1..4, domain, usr , session, time) FS , FT , FD, FC , FW

218-221 WCTR(term1..4, domain, usr , session, time) FS , FT , FD, FC , FN

222-225 PosnCTR(term1..4, domain, usr , session, time) FS , FT , FD, FC , FR

226 Listings(−, url ,−,−,−) FG, FU , FC

227 CTR(−, url ,−,−,−) FG, FU , FC

228 ADT(−, url ,−,−,−) FG, FU , FW

229 AvgPosn(−, url ,−,−,−) FG, FU , FR

230 HDCTR(−, url ,−,−,−) FG, FU , FC , FW

231 WCTR(−, url ,−,−,−) FG, FU , FC , FN

232 PosnCTR(−, url ,−,−,−) FG, FU , FC , FR

233 SkipRate(−, url ,−,−,−) FG, FU , FK

Continued on next page

ii

Table A1 – continued from previous pageFeature No. Feature Name Feature Classes

234 Listings(−, domain,−,−,−) FG, FD, FC

235 CTR(−, domain,−,−,−) FG, FD, FC

236 ADT(−, domain,−,−,−) FG, FD, FW

237 AvgPosn(−, domain,−,−,−) FG, FD, FR

238 HDCTR(−, domain,−,−,−) FG, FD, FC , FW

239 WCTR(−, domain,−,−,−) FG, FD, FC , FN

240 PosnCTR(−, domain,−,−,−) FG, FD, FC , FR

241 Listings(−, url , usr ,−, time) FP , FU , FC

242 CTR(−, url , usr ,−, time) FP , FU , FC

243 ADT(−, url , usr ,−, time) FP , FU , FW

244 AvgPosn(−, url , usr ,−, time) FP , FU , FR

245 HDCTR(−, url , usr ,−, time) FP , FU , FC , FW

246 WCTR(−, url , usr ,−, time) FP , FU , FC , FN

247 PosnCTR(−, url , usr ,−, time) FP , FU , FC , FR

248 SkipRate(−, url , usr ,−, time) FP , FU , FK

249 Listings(−, domain, usr ,−, time) FP , FD, FC

250 CTR(−, domain, usr ,−, time) FP , FD, FC

251 ADT(−, domain, usr ,−, time) FP , FD, FW

252 AvgPosn(−, domain, usr ,−, time) FP , FD, FR

253 HDCTR(−, domain, usr ,−, time) FP , FD, FC , FW

254 WCTR(−, domain, usr ,−, time) FP , FD, FC , FN

255 PosnCTR(−, domain, usr ,−, time) FP , FD, FC , FR

256 Listings(−, url , usr , session, time) FS , FU , FC

257 CTR(−, url , usr , session, time) FS , FU , FC

258 ADT(−, url , usr , session, time) FS , FU , FW

259 AvgPosn(−, url , usr , session, time) FS , FU , FR

260 HDCTR(−, url , usr , session, time) FS , FU , FC , FW

261 WCTR(−, url , usr , session, time) FS , FU , FC , FN

262 PosnCTR(−, url , usr , session, time) FS , FU , FC , FR

263 SkipRate(−, url , usr , session, time) FS , FU , FK

264 Listings(−, domain, usr , session, time) FS , FD, FC

265 CTR(−, domain, usr , session, time) FS , FD, FC

266 ADT(−, domain, usr , session, time) FS , FD, FW

267 AvgPosn(−, domain, usr , session, time) FS , FD, FR

268 HDCTR(−, domain, usr , session, time) FS , FD, FC , FW

269 WCTR(−, domain, usr , session, time) FS , FD, FC , FN

270 PosnCTR(−, domain, usr , session, time) FS , FD, FC , FR

271 Listings(−,−, usr ,−, time) FP , FC

272 CTR(−,−, usr ,−, time) FP , FC

273 HDCTR(−,−, usr ,−, time) FP , FC

274 ADT(−,−, usr ,−, time) FP , FW

275 Listings(−,−, usr , session, time) FS , FC

276 CTR(−,−, usr , session, time) FS , FC

277 HDCTR(−,−, usr , session, time) FS , FC

278 ADT(−,−, usr , session, time) FS , FW

279-288 ClickProb(usr , posn, time) FP , FC

289 NumSearchResults(query) FG, FQ, FC

290 NumClickedURLs(query) FG, FQ, FC

291 Day(usr , session, time) FS , FU

292 QueryTerms(query) FQ

293 SERPRank(url , usr , session, search)

iii

ReferencesD. W. Aha, D. Kibler, and M. K. Albert. Instance-based Learning Algorithms. Machine learning, 6(1):37–66,

1991.E. Amaldi and V. Kann. On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in

Linear Systems. Theoretical Computer Science, 209(1):237–260, 1998.A. Ansari and C. F. Mela. E-customization. Journal of Marketing Research, pages 131–145, 2003.N. Arora and T. Henderson. Embedded Premium Promotion: Why it Works and How to Make it More Effective.

Marketing Science, 26(4):514–531, 2007.A. Atkins-Kruger. Yandex Launches Personalized Search Results For East-

ern Europe, December 2012. URL http://searchengineland.com/yandex-launches-personalized-search-results-for-eastern-europe-142186.

S. Banerjee, S. Chakrabarti, and G. Ramakrishnan. Learning to Rank for Quantity Consensus Queries. InProceedings of the 32nd international ACM SIGIR Conference on Research and Development in InformationRetrieval, pages 243–250. ACM, 2009.

P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui. Modeling the Impact ofShort- and Long-term Behavior on Search Personalization. In Proceedings of the 35th International ACMSIGIR Conference on Research and Development in Information Retrieval, pages 185–194, 2012.

L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. The Wadsworth andBrooks-Cole statistics-probability series. Taylor & Francis, 1984.

C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Technical report, MicrosoftResearch Technical Report MSR-TR-2010-82, 2010.

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to RankUsing Gradient Descent. In Proceedings of the 22nd International Conference on Machine Learning, pages89–96. ACM, 2005.

C. J. Burges, R. Ragno, and Q. V. Le. Learning to Rank with Nonsmooth Cost Functions. In Advances inNeural Information Processing Systems, pages 193–200, 2006.

R. Caruana and A. Niculescu-Mizil. An Empirical Comparison of Supervised Learning Algorithms. InProceedings of the 23rd International Conference on Machine Learning, pages 161–168. ACM, 2006.

O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine LearningResearch-Proceedings Track, 14:1–24, 2011.

M. Ciaramita, V. Murdock, and V. Plachouras. Online Learning from Click Data for Sponsored Search. InProceedings of the 17th International Conference on World Wide Web, pages 227–236. ACM, 2008.

N. Clayton. Yandex Overtakes Bing as Worlds Fourth Search Engine,2013. URL http://blogs.wsj.com/tech-europe/2013/02/11/yandex-overtakes-bing-as-worlds-fourth-search-engine/.

comScore. December 2013 U.S. Search Engine Rankings, 2014. URL http://investor.google.com/financial/tables.html.

N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An Experimental Comparison of Click Position-bias Models.In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 87–94, 2008.

V. Dang. RankLib. Online, 2011. URL http://www.cs.umass.edu/˜vdang/ranklib.html.P. Donmez, K. M. Svore, and C. J. Burges. On the Local Optimality of LambdaRank. In Proceedings of the

32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages460–467. ACM, 2009.

R. O. Duda and P. E. Hart. Pattern Recognition and Scene Analysis, 1973.

iv

http://searchengineland.com/yandex-launches-personalized-search-results-for-eastern-europe-142186

http://searchengineland.com/yandex-launches-personalized-search-results-for-eastern-europe-142186

http://blogs.wsj.com/tech-europe/2013/02/11/yandex-overtakes-bing-as-worlds-fourth-search-engine/

http://blogs.wsj.com/tech-europe/2013/02/11/yandex-overtakes-bing-as-worlds-fourth-search-engine/

http://investor.google.com/financial/tables.html

http://investor.google.com/financial/tables.html

http://www.cs.umass.edu/~vdang/ranklib.html

D. Dzyabura and J. R. Hauser. Active Machine Learning for Consideration Heuristics. Marketing Science, 30(5):801–819, 2011.

C. Eickhoff, K. Collins-Thompson, P. N. Bennett, and S. Dumais. Personalizing Atypical Web Search Sessions.In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 285–294,2013.

T. Evgeniou, C. Boussios, and G. Zacharia. Generalized Robust Conjoint Estimation. Marketing Science, 24(3):415–429, 2005.

T. Evgeniou, M. Pontil, and O. Toubia. A Convex Optimization Approach to Modeling Consumer Heterogeneityin Conjoint Estimation. Marketing Science, 26(6):805–818, 2007.

M. Feuz, M. Fuller, and F. Stalder. Personal Web Searching in the Age of Semantic Capitalism: Diagnosing theMechanisms of Personalisation. First Monday, 16(2), 2011.

J. Friedman, T. Hastie, R. Tibshirani, et al. Additive Logistic Regression: A Statistical View of Boosting (withDiscussion and a Rejoinder by the Authors). The Annals of Statistics, 28(2):337–407, 2000.

E. Gabbert. Keywords vs. Search Queries: What’s the Difference?, 2011. URLhttp://www.wordstream.com/blog/ws/2011/05/25/keywords-vs-search-queries.

A. Ghose, P. G. Ipeirotis, and B. Li. Examining the Impact of Ranking on Consumer Behavior and SearchEngine Revenue. Management Science, 2014.

M. Ginsberg. Essentials of Artificial Intelligence. Newnes, 1993.A. Goker and D. He. Analysing Web Search Logs to Determine Session Boundaries for User-oriented Learning.

In Adaptive Hypermedia and Adaptive Web-Based Systems, pages 319–322, 2000.Google. How Search Works, 2014. URL http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/.

Google Official Blog. Personalized Search for Everyone, 2009. URL http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html.

P. Grey. How Many Products Does Amazon Sell?, 2013. URL http://export-x.com/2013/12/15/many-products-amazon-sell/.

I. Guyon and A. Elisseeff. An Introduction to Variable and Feature Selection. The Journal of Machine LearningResearch, 3:1157–1182, 2003.

A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, and C. Wilson.Measuring Personalization of Web Search. In Proceedings of the 22nd International Conference on WorldWide Web, pages 527–538. International World Wide Web Conferences Steering Committee, 2013.

E. F. Harrington. Online Ranking/Collaborative Filtering using the Perceptron Algorithm. In ICML, volume 20,pages 250–257, 2003.

T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The Elements of StatisticalLearning, volume 2. Springer, 2009.

D. Huang and L. Luo. Consumer Preference Elicitation of Complex Products using Fuzzy Support VectorMachine Active Learning. Marketing Science, 2015.

InstantWatcher. InstantWatcher.com, 2014. URL http://instantwatcher.com/titles/all.K. Jarvelin and J. Kekalainen. IR Evaluation Methods for Retrieving Highly Relevant Documents. In

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR ’00, pages 41–48. ACM, 2000.

K. Jarvelin and J. Kekalainen. Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions onInformation Systems (TOIS), 20(4):422–446, 2002.

Kaggle. Personalized Web Search Challenge, 2013. URL http://www.kaggle.com/c/yandex-personalized-web-search-challenge.

Kaggle. Private Leaderboard - Personalized Web Search Challenge , 2014. URL https://www.kaggle.com/c/yandex-personalized-web-search-challenge/leaderboard.

v

http://www.wordstream.com/blog/ws/2011/05/25/keywords-vs-search-queries

http://www.wordstream.com/blog/ws/2011/05/25/keywords-vs-search-queries

http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/

http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/

http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html

http://googleblog.blogspot.com/2009/12/personalized-search-for-everyone.html

http://export-x.com/2013/12/15/many-products-amazon-sell/

http://export-x.com/2013/12/15/many-products-amazon-sell/

http://instantwatcher.com/titles/all

http://www.kaggle.com/c/yandex-personalized-web-search-challenge

http://www.kaggle.com/c/yandex-personalized-web-search-challenge

https://www.kaggle.com/c/yandex-personalized-web-search-challenge/leaderboard

https://www.kaggle.com/c/yandex-personalized-web-search-challenge/leaderboard

R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Artificial intelligence, 97(1):273–324, 1997.A. Lambrecht and C. Tucker. When Does Retargeting Work? Information Specificity in Online Advertising.

Journal of Marketing Research, 50(5):561–576, 2013.H. Li and P. Kannan. Attributing Conversions in a Multichannel Online Marketing Environment: An Empirical

Model and a Field Experiment. Journal of Marketing Research, 51(1):40–56, 2014.T. Y. Liu. Learning to Rank for Information Retrieval. Springer, 2011.P. Masurel, K. Lefvre-Hasegawa, C. Bourguignat, and M. Scordia. Dataiku’s Solution to Yandexs Personalized

Web Search Challenge. Technical report, Dataiku, 2014.D. Metzler and T. Kanungo. Machine Learned Sentence Selection Strategies for Query-biased Summarization.

In SIGIR Learning to Rank Workshop, 2008.K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020,

9780262018029.O. Netzer, R. Feldman, J. Goldenberg, and M. Fresko. Mine Your Own Business: Market-Structure Surveillance

Through Text-Mining. Marketing Science, 31(3):521–543, 2012.H. Pavliva. Google-Beater Yandex Winning Over Wall Street on Ad

View, 2013. URL http://www.bloomberg.com/news/2013-04-25/google-tamer-yandex-amasses-buy-ratings-russia-overnight.html.

T. Qin, T.-Y. Liu, J. Xu, and H. Li. LETOR: A Benchmark Collection for Research on Learning to Rank forInformation Retrieval. Information Retrieval, 13(4):346–374, 2010.

F. Qiu and J. Cho. Automatic Identification of User Interest for Personalized Search. In Proceedings of the15th International Conference on World Wide Web, pages 727–736, 2006.

J. R. Quinlan. C4. 5: Programs for Machine Learning, volume 1. Morgan kaufmann, 1993.J. Reunanen. Overfitting in Making Comparisons Between Variable Selection Methods. The Journal of

Machine Learning Research, 3:1371–1382, 2003.P. E. Rossi and G. M. Allenby. Bayesian Statistics and Marketing. Marketing Science, 22(3):304–328, 2003.P. E. Rossi, R. E. McCulloch, and G. M. Allenby. The Value of Purchase History Data in Target Marketing.

Marketing Science, 15(4):321–340, 1996.R. E. Schapire. The Strength of Weak Learnability. Machine learning, 5(2):197–227, 1990.B. Schwartz. Google: Previous Query Used On 0.3% Of Searches, 2012. URL http://www.seroundtable.com/google-previous-search-15924.html.

D. Silverman. IAB Internet Advertising Revenue Report, October 2013. URL http://www.iab.net/media/file/IABInternetAdvertisingRevenueReportHY2013FINALdoc.pdf.

A. Singhal. Some Thoughts on Personalization, 2011. URL http://insidesearch.blogspot.com/2011/11/some-thoughts-on-personalization.html.

G. Song. Point-Wise Approach for Yandex Personalized Web Search Challenge. Technical report, IEEE.org,2014.

D. Sullivan. Bing Results Get Localized & Personalized, 2011. URL http://searchengineland.com/bing-results-get-localized-personalized-64284.

M. Surdeanu, M. Ciaramita, and H. Zaragoza. Learning to Rank Answers on Large Online QA Collections. InACL, pages 719–727, 2008.

J. Teevan, S. T. Dumais, and D. J. Liebling. To Personalize or Not to Personalize: Modeling Queries withVariation in User Intent. In Proceedings of the 31st Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, pages 163–170, 2008.

S. Tirunillai and G. J. Tellis. Extracting Dimensions of Consumer Satisfaction with Quality from OnlineChatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation. Working paper, 2014.

O. Toubia, D. I. Simester, J. R. Hauser, and E. Dahan. Fast Polyhedral Adaptive Conjoint Estimation. MarketingScience, 22(3):273–303, 2003.

vi

http://www.bloomberg.com/news/2013-04-25/google-tamer-yandex-amasses-buy-ratings-russia-overnight.html

http://www.bloomberg.com/news/2013-04-25/google-tamer-yandex-amasses-buy-ratings-russia-overnight.html

http://www.seroundtable.com/google-previous-search-15924.html

http://www.seroundtable.com/google-previous-search-15924.html

http://www.iab.net/media/file/IABInternetAdvertisingRevenueReportHY2013FINALdoc.pdf

http://www.iab.net/media/file/IABInternetAdvertisingRevenueReportHY2013FINALdoc.pdf

http://insidesearch.blogspot.com/2011/11/some-thoughts-on-personalization.html

http://insidesearch.blogspot.com/2011/11/some-thoughts-on-personalization.html

http://searchengineland.com/bing-results-get-localized-personalized-64284

http://searchengineland.com/bing-results-get-localized-personalized-64284

O. Toubia, J. R. Hauser, and D. I. Simester. Polyhedral Methods for Adaptive Choice-based Conjoint Analysis.Journal of Marketing Research, 41(1):116–131, 2004.

O. Toubia, T. Evgeniou, and J. Hauser. Optimization-Based and Machine-Learning Methods for ConjointAnalysis: Estimation and Question Design. Conjoint Measurement: Methods and Applications, page 231,2007.

M. Volkovs. Context Models For Web Search Personalization. Technical report, University of Toronto, 2014.Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting Boosting for Information Retrieval Measures.

Information Retrieval, 13(3):254–270, 2010.Yandex. It May Get Really Personal We have Rolled Out Our Second–Generation Personalised Search Program,

2013. URL http://company.yandex.com/press_center/blog/entry.xml?pid=20.Yandex. Yandex Announces Third Quarter 2013 Financial Results, 2013. URLhttp://company.yandex.com/press_center/blog/entry.xml?pid=20.

YouTube. YouTube Statistics, 2014. URL https://www.youtube.com/yt/press/statistics.html.

Y. Yue and C. Burges. On Using Simultaneous Perturbation Stochastic Approximation for IR measures, andthe Empirical Optimality of LambdaRank. In NIPS Machine Learning for Web Search Workshop, 2007.

J. Zhang and M. Wedel. The Effectiveness of Customized Promotions in Online and Offline Stores. Journal ofMarketing Research, 46(2):190–206, 2009.

vii

http://company.yandex.com/press_center/blog/entry.xml?pid=20



https://www.youtube.com/yt/press/statistics.html

https://www.youtube.com/yt/press/statistics.html