Personalized Click Prediction in Sponsored Search

Personalized Click Prediction in Sponsored Search

Haibin ChengYahoo! Labs

4401 Great America ParkwaySanta Clara, CA, U.S.A

[email protected]

Erick Cantú-PazYahoo! Labs

4401 Great America ParkwaySanta Clara, CA, U.S.A

[email protected]

ABSTRACTSponsored search is a multi-billion dollar business that gen-erates most of the revenue for search engines. Predicting theprobability that users click on ads is crucial to sponsoredsearch because the prediction is used to influence ranking,filtering, placement, and pricing of ads. Ad ranking, fil-tering and placement have a direct impact on the user ex-perience, as users expect the most useful ads to rank highand be placed in a prominent position on the page. Pric-ing impacts the advertisers’ return on their investment andrevenue for the search engine. The objective of this paperis to present a framework for the personalization of clickmodels in sponsored search. We develop user-specific anddemographic-based features that reflect the click behaviorof individuals and groups. The features are based on obser-vations of search and click behaviors of a large number ofusers of a commercial search engine. We add these featuresto a baseline non-personalized click model and perform ex-periments on offline test sets derived from user logs as wellas on live traffic. Our results demonstrate that the per-sonalized models significantly improve the accuracy of clickprediction.

Categories and Subject DescriptorsH.3.5 [Online Information Services]: Commercial Ser-vices; H.4.m [Information Systems]: Miscellaneous; I.5.2[Design Methodology]: Classifier Design and Evaluation

General TermsAlgorithms, Measurement, Design, Experimentation, Hu-man Factors

KeywordsSponsored Search, Click Prediction, Personalization, UserProfile, Demographic, Click Feedback, Maximum EntropyModeling

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WSDM’10, February 4–6, 2010, New York City, New York, USA.Copyright 2010 ACM 978-1-60558-889-6/10/02 ...$10.00.

1. INTRODUCTIONSponsored search is an Internet advertising system that

generates most of the revenue of search engines by presentingtargeted advertisements along with the search results. Inthe common “pay-per-click” model, advertisers are chargedfor each click on their ads. To maximize the revenue for asearch engine and maintain a desirable user experience, thesponsored search system needs to make decisions related toselection, ranking and placement of the ads. These decisionsare based greatly on the probabilities that users will click onads, and therefore accurate click prediction is an essentialproblem in sponsored search.

Current state-of-the-art sponsored search systems typi-cally rely on a machine learned model to predict the clicka-bility of ads returned for a user search query. For the experi-ments in this paper, we will use a system based on numeroususer-independent features as a baseline. Some of these fea-tures are based on the similarity of the query to the text ofthe ads and range in complexity from simple word or phraseoverlap to more sophisticated semantic similarities betweenthe query and different elements of the ads. Other featuresare related to the historical performance of ads. In our expe-rience, certain statistics of the past performance of ads aregood predictors of the click probability. Yet another groupof features gives contextual information, such as the time ofday or day of the week. All of these features ignore the usersboth individually and as parts of groups with similar behav-iors and therefore the model will predict the same probabil-ity of click for every user. We believe that personalizing theclick prediction benefits both the users and the advertisers:The users will be presented ads in the manner that is mostrelevant to them, and the advertisers will receive clicks fromusers who are more engaged with the ads.

The objective of this paper is to present the design of per-sonalized click prediction models. An essential part of thesemodels is the development of new user-related features. Webase our features on observations over a significant volumeof search queries from a large number of users. Our obser-vations suggest that user click behavior varies significantlywith regard to their demographic background, such as ageor gender. We investigate the click distribution for differ-ent users from various backgrounds and design a set of de-mographic features to model their group clicking patterns.Recognizing that there is still significant variability in de-mographic groups, we also investigate user-specific features.The new user-related features are integrated with other fea-tures in a maximum entropy classification framework andcontribute to the final predicted clickability score for each

351

Figure 1: Overview of sponsored search system.

query-ad-user tuple. We tested the personalized models of-fline on a large test set based on log data of the Yahoo!search engine. The results show that the personalized mod-els are significantly more accurate than the non-personalizedbaseline. In addition, we report results of a test on live trafficof a personalized model that confirm the offline evaluations.

The rest of this paper is organized as follows. Section 2briefly outlines one approach to click prediction in sponsoredsearch. Next, we present a study on user click distributionin Section 3. The personalized click prediction framework isproposed in Section 4. The user related features developedin our work are introduced in Section 5. The experimentalsetup and results are presented in Section 6. We discusssome related work in Section 7 and conclude the paper witha summary of our findings and proposals for future work inSection 8.

2. CLICK PREDICTIONSponsored search is a complex advertising system that

presents ads to users of search engines. It involves severalprocesses as illustrated in Figure 1. The input query fromthe user is used to retrieve a list of candidate ads. Theexact mechanisms of query parsing, query expansion, andad retrieval used in our system are beyond the scope of thispaper. For our purposes we assume that we receive a set ofcandidate ads that need to be scored by the click model toestimate the probability they will be clicked. This estimateis an essential component of the sponsored search systemas it influences user experience and revenue for the searchengine: The click probability is a factor to rank the ads inappropriate order, filter out uninteresting ads, place the adsin different sections of the page, and to determine the pricethat will be charged to the advertiser if a click occurs.

We formulate click prediction as a supervised learningproblem. We collected click and non-click events from logs astraining samples, where each sample represents a query-adpair presented to a user. Assume there is a set of n train-ing samples, D = {(f(qj , aj), cj)}n

j=1, where f(qj , aj) ∈ �d

represents the d-dimensional feature space for query-ad pairj and cj ∈ {−1, +1} is the corresponding class label (+1 :click or −1 : non-click). Given a query q and ad a, the

problem is to calculate the probability of click p(c|q, a). Themaximum entropy model (ME) [4] is well suited for thistask because of its strength in combining diverse forms ofcontextual information, and formulates the click probabilityfor a query-ad pair as follows:

p(c|q, a) =1

1 + exp(∑d

i=1 wifi(q, a))(1)

where fi(q, a) is the i-th feature derived for query-ad pair(q, a) and wi ∈ w is the associated weight. Given the train-ing set D, the maximum entropy model learns the weightvector w by maximizing the likelihood of exponential mod-els as:

w = max(n∑

i=1

log(p(ci|qi, ai)) + log(p(w))) (2)

where the first part represents the likelihood function andthe second part utilizes a Gaussian prior on the weight vec-tor w to smooth the maximum entropy model [7]. There aremany approaches available in the literature [15] to solve thiskind of optimization problems including iterative scaling andits variants, quasi-Newton algorithms, and conjugate gradi-ent ascent. Given the large collection of samples and highdimensional feature space, we use a nonlinear conjugate gra-dient algorithm [16].

2.1 FeaturesAn accurate maximum entropy model relies greatly on the

design of features f. There are many possible features thatcan be derived for the purpose of predicting click probabil-ities. One class of features explores the lexical similaritybetween the query and ads by calculating word or phraseoverlap of the query to different elements of the ads. Thesefeatures rely on a simple assumption that users tend to clickon ads that appear to be relevant to their query and thatquery-ad overlap is correlated with perceived relevance. Wehave found some usefulness in these features, but it is clearthat the discrimination power of lexical features is limiteddue to the typically short queries and simple ads.

Another set of features is derived from the historical per-formance of ads. In our experience, these features are goodestimators of the future performance of ads. It is well known [9]that the click-through rate (CTR) of search results or adver-tisements decreases significantly depending on the positionof the results. To account for this position bias, we use aposition-normalized statistic known as clicks over expectedclicks (COEC):

COEC =

∑Rr=1 cr∑R

r=1 ir ∗ CTRr

, (3)

where the numerator is the total number of clicks receivedby a query-ad pair; the denominator can be interpreted asthe expected clicks (ECs) that an average ad would receiveafter being impressed ir times at rank r, and CTRr is theaverage CTR for each position in the result page (up to R),computed over all queries and ads.

We can obtain COEC statistics for specific query-ad pairs,and we have found these features to be good predictors ofclick probabilities. However, many impressions are neededfor these statistics to be reliable and therefore data for spe-cific query-ad pairs can be sparse and noisy. To amelio-rate this problem, we can obtain additional COEC statistics

352

by counting clicks and expected clicks over aggregations ofqueries or ads. The full details of these aggregations areoutside the scope of this paper, but briefly we note that theadvertisers organize their ads in ad groups, campaigns, andaccounts. We can exploit this organization and count clicksand expected clicks for different combinations of query-adgroups, campaigns, and accounts. Of course, other aggre-gations using query or ad clusters are possible and can beused in practice. The COECs and expected clicks computedover each of the aggregation levels are used as inputs to themaximum entropy framework.

There are other features that provide additional contextthat we have observed are helpful in predicting the clickprobabilities. For example, features such as time of day, dayof week, and the position on the page where ads have beendisplayed.

Note that all the features described so far ignore differ-ences among users. We will use the model described in thissection as a baseline for our personalized models with userinformation.

2.2 Feature Quantization, Conjunctions, andSelection

The click feedback (CF) features are expected to be skewed,and this may cause problems to many learning algorithms,including maximum entropy. In our work, the features aretransformed into the log form and then quantized using asimple K-means [5] clustering algorithm with objective func-tion:

arg mink∑

j=1

∑

vi∈Cj

||vi − uj ||2 (4)

where vi is the feature value after log transform, uj is thecentroid of cluster Cj and k is the number of clusters. Eachcluster of feature values represents a quantized segment. Weintroduce binary indicator features for each segment, anduse these binary features as inputs to the ME model. Wealso introduce a binary indicator feature to indicate that acertain value is missing, which is common for click feedbackfeatures (for new ads or new queries, for example).

To model relationships among features, we create featureconjunctions by taking the cross product of the binary in-dicators for pairs of features. We select the features to beconjoined using domain knowledge. For example, we haveconjunctions between COECs and expected clicks for eachof the levels over which we aggregate click feedback statis-tics. If the COECs and ECs are quantized into ten segmentseach, we will add 100 new binary features corresponding toall the possible combinations of the segments.

The feature set will grow exponentially after quantizationand adding conjunctions. To limit the growth, we eliminatebinary features and conjunctions that appear less than 10thousand times in the training data. After quantization,conjunction and selection, the features are used as inputs tothe maximum entropy click model.

3. USER CLICK ANALYSISAs a first step towards our goal for personalized click mod-

els, we conduct a preliminary analysis of the log data tounveil some patterns of user behavior in sponsored search.

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Views Per <User, Ad>

Per

cent

age

of <

Use

r,A

d> P

airs

0 5 10 15 20Views Per <User, Ad>

Ave

rage

CT

R

Figure 2: The upper graph shows the distribution ofuser-ad views. The bottom graph shows that CTRof user-ad pairs is inversely proportional to the num-ber of views.

3.1 User-Ad View and Click DistributionsFor the initial part of the analysis, we consider one day

of data (2008/08/01) sampled from the logs of the Yahoo!search engine. Our initial objective is to examine the dis-tribution of user-ad views and clicks. The upper graph ofFigure 2 shows that the distribution of user-ad views followsa power law distribution. In the figure, the X-axis repre-sents the number of daily observations for each user-ad pairand the Y -axis represents the percentage of unique user-adpairs. The data show that approximately 85% of user-adpairs are observed only once and 10% are observed twice,which means that only 5% of the user-ad pairs are observedmore than twice. The bottom graph of Figure 2 plots theaverage CTR for each group of user-ad pairs with the samenumber of views (note the sudden jump at x = 16 may bedue to limited data size). The average CTR is clearly de-creasing linearly when the views are increasing. This trendmotivates one of the features that we describe in Section 5,in which we compare the past click behavior of each user toa group of users with similar searching patterns.

3.2 View, Click and Bid Distributions per UserDemographics

In marketing, consumers are segmented into groups withsimilar needs or shopping behaviors. This segmentation can

353

occur along different dimensions, such as behavioral, demo-graphic or geographic variables. In the work reported here,we segment users into demographic groups using data dis-closed by them as part of the registration process to obtainan account with Yahoo!. These demographic data are notavailable for all users of the search engine and are not guar-anteed to be accurate.

For this part of the study, we use the entire training datathat will be described in detail in Section 6, but it is sufficientto say for now that these data span two months of activityof more than 400 million unique users. We partition userswith demographic information into groups according to theirage, job, marriage status, interests and their occupation asshown from the top to the bottom panels in Figure 3. TheX-axes of the graphs in Figure 3 represent the demographicgroups. For example, age is partitioned into 8 segments andthere are four possible values for marriage status. The Y -axes show the number of views, average CTR, and averagebids for each demographic category. We omit the scalesto protect proprietary information. The average bids in theright panel indicate how the ads shown to different segmentsof users differ in terms of the bid advertisers make and aresuggestive of query differences among the groups.

Here we list a few of our observations from Figure 3:

• From all the graphs, we can clearly see that femaleusers always view and click more ads than males andthat those ads have higher bids, except a few rare andnoisy cases like in age group 0–13.

• Click-through rates have a very strong positive corre-lation with age.

• Users who are single are more likely to click than mar-ried and divorced users.

• Users with travel, shopping and business interests aremore likely to click ads, whose bids seem to be highertoo. Users with music, sport entertainment interestsare less likely to click ads.

• Users whose occupations are in financial, engineering,and travel industries are more likely to click than thosein entertainment and education industries.

These data show strong correlations between the users’ de-mographic background and their searching and clicking be-haviors, which could used to improve the accuracy of clickprediction.

4. PERSONALIZED CLICK PREDICTIONThe baseline prediction model described in Section 2 esti-

mates the click probabilities using user-independent featuresand therefore will produce the same estimates for every user.In this section, we propose a family of models that incorpo-rate user information.

Our vision of personalized models is a direct extension ofthe maximum entropy model introduced in Section 2. LetD = {(f(qj , aj , uj), cj)}n

j=1 represent the new training setwhere each sample j represents a click or non-click eventwhen ad aj is presented to user uj for query qj . Insteadof assigning a click probability for each new query-ad pairas shown in Equation 1, we develop a new click predictionfunction p(c|q, a, u) for each query-ad-user tuple as:

p(c|q, a, u) =1

1 + exp(∑d

i=1 wifi(q, a, u))(5)

where fi(q, a, u) is a feature derived from the query (q), ad(a) or user (u) or a combination thereof, and wi is the weightassociated with each feature. Similarly, the weight vectorw is learned by maximizing the likelihood of exponentialmodels with Gaussian prior as:

w = max(n∑

i=1

log(p(ci|qi, ai, ui)) + log(p(w))). (6)

Again, the nonlinear conjugate gradient method is used tosolve this optimization problem.

In the personalized click model, user information is inte-grated into the ME model directly as features and combinedwith features derived from the query and the ads. The learn-ing algorithm will automatically tune the weights as before.The next section describes the user-related features we use.

5. USER FEATURESWe consider two sets of user-related features. The first

set is composed of demographic group features, which cap-ture the behaviors of group of users segmented according totheir demographic background. The second set of featuresis composed of features associated with each particular user.

For both sets of features, we accumulate historical infor-mation over multiple days in the past. We remove the userswith fewer than two ad views during a day. We found thatthis was a convenient way to smooth the data. Then, foreach item, we consider a time window that is long enoughto accumulate 500 ad impressions or reach a maximum of 90days. This variable length window is a good compromise be-tween accumulating enough data to compute reliable statis-tics for items with low daily activity and capturing temporalchanges in behavior for items with high activities.

5.1 Demographic FeaturesIn our work, we partition users into demographic segments

based on age, gender, marriage status, interests, job status,and occupation. To incorporate this information into themaximum entropy framework, we introduce binary featuresfor each possible value of the demographic variables. Forexample, there are eight binary features indicating each ofthe possible age groups, and only one of these eight featurestriggers for each user.

Besides using the demographic information as direct in-puts to the click model, we develop click feedback featuresto capture the historical behavior of user groups. We cansegment the users based on one or multiple of the demo-graphic variables available to us. Specifically, in our exper-iments we take the cross product of these gender and agegroups to partition the users, resulting in 16 groups com-posed of users with the same gender and of the same agegroup. Once the demographic groups are formed, we envi-sion numerous combinations with other factors for which wealready compute click feedback features. For example, wealready accumulate historical information for all the ads ofspecific advertiser accounts and, along those lines, we canconceive of accumulating data for combinations of accountsand demographic groups. Such a combination would cap-ture the relative preference of different demographic groupsto specific brands or merchants. Table 1 lists some of thecombinations we conceive as being useful in click prediction.

As for specific features, our experiments use the COECand expected click features described earlier, but we can

354

Documents

Personalized Click Prediction in Sponsored Search