Spatio-Temporal Models for Estimating Click-through Rate · 2009-04-06 · Spatio-Temporal Models for Estimating Click-through Rate Deepak Agarwal Yahoo! Labs 701 First Ave Sunnyvale,

Spatio-Temporal Models for Estimating Click-through Rate

Deepak AgarwalYahoo! Labs701 First Ave

Sunnyvale, CA [email protected]

Bee-Chung ChenYahoo! Labs701 First Ave


Pradheep ElangoYahoo! Labs701 First Ave


ABSTRACTWe propose novel spatio-temporal models to estimate click-through rates in the context of content recommendation. Wetrack article CTR at a fixed location over time through a dy-namic Gamma-Poisson model and combine information fromcorrelated locations through dynamic linear regressions, sig-nificantly improving on per-location model. Our models ad-just for user fatigue through an exponential tilt to the first-view CTR (probability of click on first article exposure) thatis based only on user-specific repeat-exposure features. Weillustrate our approach on data obtained from a module (To-day Module) published regularly on Yahoo! Front Page anddemonstrate significant improvement over commonly usedbaseline methods. Large scale simulation experiments tostudy the performance of our models under different sce-narios provide encouraging results. Throughout, all model-ing assumptions are validated via rigorous exploratory dataanalysis.

Categories and Subject DescriptorsH.3.5 [Information Storage and Retrieval]: Online In-formation Services; G.3 [Probability and Statistics]

General TermsAlgorithms, Design, Experimentation, Measurement

1. INTRODUCTIONThe web has become an important medium to distribute

information from various sources. Web sites like Blogo-sphere, YouTube, Digg, Yahoo!, Google News, MSN; newsfeeds like Associated Press and Washington Post provideusers with a wide range of choices to keep up with diversecontent in a timely fashion. However, explosion of contentmay make it difficult for users to select the right content tolook at – a phenomenon that is also refereed to as informa-tion overload. In fact, content publishers and aggregatorsselect the best and most relevant content to attract and re-tain users to their site on an ongoing basis.

In general, a publisher page is composed of several sec-tions; some serve personalized content, some are links to ap-plications (e.g. e-mail), some allow users to add new contentof their choices and so on. In fact, several sites also publisha module with most popular content; Digg, Yahoo! Buzz,

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2009, April 20–24, 2009, Madrid, Spain.ACM 978-1-60558-487-4/09/04.

Top Stories on Google News, Top Picks on MSN are someexamples. Often, human editors select a set of articles froma diverse content pool. While editorial selection prunes lowquality content and ensures various constraints that charac-terize a site (e.g. in terms of topicality) are met, the qualityof a site may improve further if it adapts quickly to show thebest articles. One way is to constantly monitor the contentquality (defined through user feedback: clicks, votes, etc)of articles over time. In this paper, we develop and illus-trate novel statistical methods to track article quality for aprominent module, called the Today Module, published onthe Yahoo! Front Page.

Today Module Overview: Figure 1 shows the loca-tion of the Today Module on the Yahoo! Front Page, thatconsists of four tabs: Featured, Entertainment, Sports andVideo. Each of the four tabs on the panel publish articlesat four positions. Unlike the other three tabs, the FeaturedTab is not tied to a particular topic and selects articles froma diverse content mix. We label the four positions on theFeatured Tab as F1, F2, F3, F4. Only one of the four footerson a tab is“active”as a story (see Fig 1). The F1 story is dis-played by default but a user can switch to another story (atF2, F3, F4) by clicking on the corresponding footer. Sincethe F1 position has maximum exposure, it is the prime spoton the Today Module. At any given time, the system picksthe top four articles from an active set; these are displayedat F1, F2, F3, F4. New articles created by editors are pe-riodically pushed to replace some old ones; the procedurehelps keep up with novel and important stories (e.g., break-ing news) and eliminate the irrelevant and fading ones.

Several measures based on user feedback are generallyused to measure article quality scores; those that rely on userclicks have strong signal and are readily available. Whennormalized by amount of exposure that is typically measuredby the number of times an article is shown at a location (e.g,at a certain position), we get a measure called click-throughrate (CTR), i.e., number of clicks per display; throughoutwe use CTR to select the best articles in our application.Other measures like votes, time spent reading the article,etc can also be used; the choice depends on the application.

Contributions: We provide a modeling approach to esti-mate CTR after incorporating spatio-temporal correlationsand adjusting for user fatigue. In particular, we propose dy-namic models to: (a) Track article CTR at a single locationthrough a dynamic Gamma-Poisson model. (b) Combineinformation from correlated locations (e.g., other positions)through dynamic linear regressions to improve tracking per-

WWW 2009 MADRID! Track: Data Mining / Session: Click Models

21

Figure 1: The Today Module on Yahoo! Front Page

formance. (c) Adjust for user fatigue in the recommendationprocess by modeling CTR drop through repeat-exposure fea-tures. Our modeling assumptions are validated via exten-sive exploratory data analysis. In fact, our analysis on com-pletely randomized data provide new insights on various as-pects of user-content interaction. In particular, we showarticle CTR decay is explained better through amount ofrepeat exposure, instead of article lifetime as suggested by[15]; article CTR at different positions are correlated butthe correlation is article-specific and evolves over time – theconventional approach of using article-independent factorsto translate CTR among positions can lead to inferior sys-tem performance (which will be shown in Section 4); thereis strong time-of-day and day-of-week effect in article CTRthat can be largely attributed to the location of the user(US versus International). To the best of our knowledge, alarge scale study similar to ours on randomized data havenot been reported for a web application before.

We note that although we present our models as enhancedmethods of selecting popular stories, it is easy to extendour methods to perform personalization for user segments.In particular, article CTR from our models can be used asinformative features in regression models; one can also trackarticle CTR separately in different user segments in order toprovide segment-specific popular stories. A detailed analysisand illustration of such extensions is beyond the scope of thispaper; we provide further discussion of this issue in section 5.

Our roadmap is as follows: In Section 2, we report onan extensive exploratory data analysis that motivate ourmodeling in later sections. In Section 3, we describe ourspatio-temporal modeling approach. Section 4 report on ex-tensive experiments (real and synthetic data) to show theeffectiveness of our methods. We end with a discussion andscope for future work in Section 5.

2. EXPLORATORY ANALYSIS OF CTRWe mention two key aspects of our data below.Large volume of click-through data: The Yahoo!

homepage gets hundreds of millions of daily user visits; it isthe most visited content page on the web. The Today Mod-ule is an important component of the homepage that helpsin improving user engagement, often reflected through clickson articles displayed on the Featured Tab. Large amount of

data (clicks and views) obtained from Today Module enableus to accurately measure article CTR at fairly fine temporalresolutions. In this paper, we estimate CTR aggregated at5-minute time intervals.

Serving-biased versus completely randomized data:We analyze data obtained from two buckets. A bucket repre-sents a disjoint set of randomly selected users who are servedcontent according to some scheme (referred to as a servingscheme hereafter). Since our goal in this study is to serve themost popular, we consider a bucket (estimated most popular,EMP hereafter) that continues to show the best stories at theF1-F4 positions unless article popularity scores change withrecent data. In addition, we create another bucket (randombucket, RND hereafter) that serves a randomly selected setof four distinct articles to each user visit. Previous studiesof content optimization systems [15, 16] only report findingsbased on data obtained from EMP bucket. In fact, the com-pletely randomized nature of data obtained from RND helpsus study subtle user-content interaction patterns in an un-biased way; conducting such analysis from EMP data alonerequires adjusting for confounding factors (e.g. repeat expo-sure, selection bias in displayed articles), which is difficultto carry out in practice.

2.1 Temporal CharacteristicsAlmost every article CTR shows significant temporal vari-

ation and different behavior in the two buckets. Figure 2illustrates this for a randomly selected article. We observea sharp F1 CTR decay in the EMP bucket (discontinuitiesindicate periods when the article was taken out of F1) but astrong time-of-day, day-of-week pattern (TOD hereafter) inRND with almost no CTR decay; the diurnal article CTRis several times higher than the nocturnal one (relative toPST). In fact, TOD in RND and decay in EMP are observedfor almost all articles.

Next, we explain TOD variations and decay through acomprehensive data analysis. For simplicity, we only analyzearticle CTR at the F1 position; we discuss multiple positionslater in this section.

Regression analysis: To parsimoniously estimate TODvariations in article CTR, we perform a regression analysisseparately for each bucket. Let pitr denote the CTR of ar-ticle i in time interval t with repeat exposure r, where r isthe fraction of users in t who have previously seen article iat F1. Through extensive offline analysis, we converged onthe following CTR model that provides a good explanationof CTR dynamics.

log pitr = θi + f(t) + g(r) (1)

where θi’s are main effects that represent the adjusted pop-ularity of article i; f(t) is a periodic (period of one week)adaptive regression spline function with the knots selectedthrough the AIC criteria [9], and g(r) is a quadratic functionthat models article CTR decay. In fact, we plug-in the esti-mate of f(t) from RND (black curve in Figure 3) to obtainf(t) for EMP.1 The goal of this one-time offline regressionanalysis is not to build an online model that tracks articleCTR but to illustrate and estimate the TOD pattern, andprovide an explanation of CTR decay as a function of repeat

1This provides a better fit. Model is fitted through a Pois-son regression with CTR given by Equation (1). It has 707parameters and provides an excellent fit (deviance ∼ 0) withfew parameters.


22

0.0

0.4

0.8

Sca

led

CT

R

16:3

017

:00

17:3

018

:00

18:3

019

:00

19:3

020

:00

20:3

021

:00

(a) CTR in the EMP Bucket

0.0

0.4

0.8

Sca

led

CT

R

16:0

017

:00

18:0

019

:00

20:0

021

:00

22:0

023

:00

00:0

001

:00

02:0

003

:00

04:0

005

:00

06:0

007

:00

08:0

009

:00

10:0

011

:00

12:0

013

:00

14:0

015

:00

(b) CTR in the Random Bucket

Figure 2: Empirical F1 CTR of a random article in EMP and RD

−0.

4−

0.2

0.0

0.2

Offs

et in

log

CT

R

0 1 2 3 4 5 6 7

Figure 3: TOD pattern for overall traffic (black solidcurve) vs. US-only traffic (red dashed curve)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

−3

−2

−1

0

Fraction of repeated users

Offs

et in

log

CT

R

Figure 4: Decay as a function of repeat exposure

exposure. The excellent fit of model in Equation (1) alsosupports a common TOD pattern and decay curve acrossarticles.

Explaining decay pattern: Estimated decay (i.e., g(r))in Figure 4 is largely due to the amount of repeat exposureand not the article lifetime as some earlier studies had in-dicated [15]. In fact, Figure 2(a) shows an example wherearticle CTR increases with lifetime when the article is putback at F1 after being taken out of EMP for a while. Indeed,the fraction of repeat exposure decreases when the article re-appears and hence it gets a higher CTR, relative to when itwas shown previously.

Explaining the TOD pattern: To explain estimatedTOD pattern (i.g., f(t)) in Figure 3, we performed an ex-tensive slice and dice analysis of the data that revealedtraffic split by US versus international visits best explain

00.

20.

40.

60.

81.

0

Hou

rly N

umbe

r of

Vie

ws

07/1

5

07/1

6

07/1

7

07/1

8

07/1

9

07/2

0

07/2

1

Figure 5: View volumes: US traffic (red solid curve)vs. international traffic (blue dotted curve)

TOD. Figure 5 shows the scaled number of page views perhour over a week (relative to PST) for US and internationalusers separately. Not surprisingly, the Yahoo! Front Pageis mostly visited by US users during the day and interna-tional users during the night. We note that the CTR ofeach article for US users is several times higher than thatof international users (articles are programmed mainly forUS audience), hence the nocturnal CTR are substantiallylower. Re-estimating f(t) via model in Equation (1) for USonly visits (red dashed curve in Figure 3) still shows a TODpattern but with significantly smaller variations. The regres-sion model does not provide a good fit to international visitdata due to data sparseness (low click and visit volumes).

Further analysis with user and article features failed toexplain the TOD pattern in US traffic. In general, it is dif-ficult to construct temporal features related to populationchanges that can explain away the temporal variations in ar-ticle CTR; capturing such latent population changes directlythrough a dynamic model that tracks article CTR (insteadof population changes) is the approach we follow in this pa-per. However, sharp temporal changes in CTR require thedynamic models to adapt more, forcing it to use a smallamount of recent data which may increase variance and de-teriorate tracking performance. Hence, it is always useful tolook for features that explain large temporal fluctuations inCTR. In fact, all our analysis in subsequent sections will bebased on US only user visits (we track international visitsseparately).

Explaining CTR decay through per-user features:


23

1 3 5 7 9

02

46

810

Number of previous clicks

Rel

ativ

e C

TR

(a) Previous clicks

1 3 5 7 9

0.0

0.2

0.4

0.6

0.8

Number of views since last clickR

elat

ive

CT

R

(b) Views since last click

1 3 5 7 9

0.0

0.2

0.4

0.6

0.8

Number of previous F1 views

Rel

ativ

e C

TR

1 3 5 7 9

0.0

0.2

0.4

0.6

0.8

Number of previous F1 views

Rel

ativ

e C

TR

(c) Previous F1 Views

0 15 30 45 60

0.0

0.5

1.0

1.5

Minutes since 1st F1 View

Rel

ativ

e C

TR

(d) Time since 1st view

Figure 6: Relative CTR as functions of differentuser-activity features. (relative to first-view CTR)

Although aggregate repeat exposure explains CTR decay ad-equately, adjusting for user fatigue in the recommendationprocess requires a model that explains CTR decay throughuser level repeat-exposure features. Prior exposure to arti-cles may occur in several ways and with varying degrees ofstrength. We consider features that are based on numberof previous article views at the story position, at the footerposition, number of previous story clicks, number of viewssince last click and time since first view.

One expects previous clicks to be a strong exposure thatshould substantially reduce the user’s propensity to click onthe article again. However, this is not true since each articleis composed of multiple links in our application; users whointeract with an article generally engage more by exploringmultiple links. Figure 6(a) shows variation in CTR withnumber of previous clicks, relative to the first-view CTR(probability of a click on first exposure to the article); itshows an increased propensity of repeat click. Number ofviews since a user’s last click, shown in Figure 6(b), providesa better explanation for decay, but is noisy.

Out of all the features we looked at, number of previousF1 article views, shown in Figure 6(c), is the best predictorof decay; the ones based on time elapsed since first exposureof different types, e.g., Figure 6(d), were noisy and weaklypredictive after the first five minutes.

2.2 Positional CharacteristicsPresentation bias affects article CTR, i.e., the same arti-

cle displayed at different positions under identical conditionswill have different click rates. In our scenario, the F1 posi-tion is prominent and receives more attention than footers.

46

810

1214

F1

CT

R /

F2

CT

R

05:0

0

07:0

0

09:0

0

11:0

0

13:0

0

15:0

0

17:0

0

19:0

0

21:0

0

23:0

0

Figure 7: F1 to F2 CTR ratio during a day

Table 1: Percentage of clickersBucket F1 only F2-F4 only Both F1 & F2-F4EMP 58% 13% 29%

Random 54% 17% 29%

However, feedback received from footers provide additionalsource of information that is potentially useful in improvingCTR estimates. In fact, it is ideal to have a joint modelthat learns from all positions to improve individual posi-tional estimates. Prior work (see [13, 12]) assume a modelthat translates article CTR among positions through a com-mon article-independent but position-dependent ratio, i.e.,CTR(position x) = τx,yCTR(position y), where τ ’s are un-known constants independent of articles. In this section,we study correlation among positional CTR for articles overtime by using the RND data. We recall that for RND, eacharticle is served at all positions, uniformly at random, ineach time interval.

To study the correlation between F1 and Fx article CTRover time (x = 2, 3, 4), we first note that footer CTR aresimilar (but significantly lower than the F1 CTR); hencewe only look at the relationship between F1 and F2. InFigure 7, each curve is the F1-to-F2 CTR ratio of an articleduring its lifetime. As evident, articles have different ratiosthat change over time, clearly violating the assumption of anarticle-independent ratio used in prior work. Next, Figure8(a) shows the F1 and F2 CTRs of a few randomly selectedarticles over time. Although there is a strong positive linearrelationship for each article, there are substantial variationsin the straight lines fitted to each individual articles (bothin slope and intercept). Figure 8(b) shows the distributionof slope as function of average F1 CTR across articles. Asevident, variation in slopes is not explained by F1 articleCTR. (Intercepts also show a similar behavior.) In fact,none of the features we analyzed (e.g., article lifetime, timeof day, article categories) were able to explain the articlespecific F1 to F2 relationship.

We also note that there is a difference in users who click onF1 and footers. Defining a clicker to be a user who clickedthe Today Module at least k times in a month, Table 1 showsthe percentage of clickers who clicked on F1 only, clickerswho clicked one of the footers only, clickers who clicked both,for k = 1. (Other values of k provided similar results.) Asseen, overlap in F1 and footer clickers is not large.

2.3 Discussion of Exploratory AnalysisOur exploratory analysis clearly shows that article CTR

is dynamic. Although we believe the root cause of dynamic


24

0.05 0.15 0.25

0.2

0.4

0.6

0.8

Scaled F2 CTR

Sca

led

F1

CT

R

(a) F1 vs. F2 over time

0.2 0.6

−5

05

10

Mean item F1 CTR

Slo

pe(b) Regression slope

Figure 8: Relationship between F1 and F2 CTR. Inplot (a), each curve starts with (F2,F1) CTR at time0 at one of the end points and shows the temporalrelationship for the article over time. In plot (b),slopes of linear regression fits for around 300 articlecurves are plotted against mean F1 article CTR.

CTR is user-population change over time, it is often hardto identify features that capture such population changesin practice. Thus, the ability of tracking CTR over time ateach position (enhanced by combining estimates from differ-ent positions) is useful for content optimization systems. Wealso point out the importance of using completely random-ized data for understanding the dynamics of such systems inan unbiased way. Analysis based solely on non-randomizedserving data (e.g., data from the EMP bucket) may be mis-leading. For example, Wu and Huberman [15] analyzed theserving data obtained from digg.com where they empiricallyshow that popularity of an article decays quickly over timeand modeled this decay subsequently using article lifetime(time elapsed since first appearance). As in our applica-tion, we conjecture that the true reason for decay in theirapplication is not lifetime, but the amount of repeat expo-sure. Also, since most articles in their context have shortlifetime and they were not able to put an article at a primeposition for long periods of time, they did not observe theTOD pattern we see in RND. Furthermore, they conducteda follow-up analysis [16] comparing different serving schemesbased on their previous model but using a global positionaltranslation factor for each position (i.e., all stories have thesame factor). Here again, a global translation factor maynot be correct as we observe in our application. Finally, wenote that some interesting temporal patterns were reportedin the context of search in [10].

3. SPATIO-TEMPORAL CTR MODELINGWe describe our spatio-temporal approach to modeling

CTR of a single article. We shall refer to the spatial co-ordinates as locations, each location represents an indepen-dent source of information that is correlated with the CTRat the target location. For instance, the target location maybe the F1 position and the footers (F2-F4) may consist ofother correlated locations. We assume the CTR of articlei when shown at location ` to user u at time t is given byθu,i`t. To account for user fatigue, we assume the following

θu,i`t = θ0i`t exp{g(Ru)} (2)

where θ0i`t is the first-view CTR (probability of click on first

exposure to article) and g(Ru) is a function that depends onthe repeat-view features Ru of user u. We note that θ0i`t isindependent of the user u; we adjust for user fatigue by an“exponential tilt” to the first-view CTR through a functionthat only depends on the repeat-exposure features. Thus,our modeling approach consists of: (a) Dynamic model toestimate θ0i`t for a fixed location over time. (b) Regressionmodel to enhance the estimate of θ0i`t at the target location` by combining information from correlated locations. (c)Estimating the function g(Ru) to adjust for repeat exposureto article per user. We note that although we use the scoringfunction in Equation 2 to select the top-k articles in a con-tent recommendation system in this paper, the estimationtechnique is general and has wider applicability.

We now describe our approach to select the top-k articlesto be shown in the EMP bucket in the context of our ap-plication. We note that locations in our application are thepositions at which articles are shown. Since one of the posi-tion (F1 in our case) is more important than others, we rankarticles based on their predicted F1 CTR (or some monotonetransform) in the next time interval. This is a reasonablestrategy if F1 article CTR is independent of CTR of otherarticles that are shown at footer positions. This is indeedthe case in our application as shown empirically in the Ap-pendix. The presence of RND bucket ensures we obtain dataon all live articles at all times; thus we can always predictCTR of a live article in the next time interval. As discussedearlier, we build models to track first-view article CTR andadjust for user fatigue through repeat-exposure features toprovide each user with user-specific most popular articles.

In the rest of the paper, we describe candidate models foreach component of our unified spatial-temporal modelingapproach that are motivated by our content optimizationproblem. More specifically, we address the following.

• Effective tracking of article CTR at a single position:As shown in Section 2, temporal dynamics of articleCTR is difficult to model through known features –models that quickly adapt to temporal changes are at-tractive. We note that articles that are in the exploremode (either in RND or some other small bucket) mayreceive lower views (and clicks) compared to those thatare promising. This may imbalance the article viewdistribution over time; our tracking model must be ro-bust to such imbalance in sample size. We provide aBayesian time series model based on a Gamma-Poissonassumption (validated through empirical analysis) thatis both accurate and robust.

• Improved tracking by combining information across po-sitions: With small sample size, estimated article CTRat target position may improve substantially by com-bining information from other positions. However, vari-ation of position correlations across articles and timemakes this a challenging modeling task which, to thebest of our knowledge, has not been addressed before.We provide a dynamic regression model to address thisissue in Section 3.2.

• Adjusting for user fatigue: As shown in Section 2, arti-cle CTR decay with repeat exposure. Developing user-specific repeat-exposure features that can appropri-ately down-weigh article CTR per user is of paramountimportance for good performance. In Section 3.3, we


25

present a novel method that provides recommendationafter adjusting for fatigue in a model based way.

Before we proceed further, a word about our notations.Since decay is modeled using features, we drop the user in-dex. Throughout, θixt, cixt and vixt denotes the true articleCTR, the number of clicks and the number of views (respec-tively) obtained by showing article i at location x in time in-terval t, for users who see the article for the first time. Someor all subscripts are dropped when not needed. Unless oth-erwise mentioned, CTR would always mean first-view CTRin the rest of this section.

3.1 Dynamic Models for a Single LocationWe begin with CTR tracking for a single article at a fixed

location, and drop the article and location indices. At anygiven time interval t, we have a prior probability distributionfor the CTR θt that is based on data until time interval t−1;this is combined with the observed clicks and views (ct andvt respectively) at time t, to obtain the posterior probabilitydistribution of CTR θt at time t.

Gamma-Poisson and Beta-Binomial: For occurrence(or count) data, Beta-Binomial and Gamma-Poisson are at-tractive choices [4]. For a Beta-Binomial model, the condi-tional distribution of clicks c given views v and CTR θ isa binomial with mean vθ, and the prior for θ is Beta(α,γ)

that has mean α/γ and variance α(γ−α)

γ2(1+γ). The posterior of

θ is Beta(α + c, γ + v). For the Gamma-Poisson model,the prior of θ is Gamma(α, γ) with mean α/γ and varianceα/γ2, while the observation distribution of c given v and θis Poisson with mean vθ; the posterior of θ in this case isGamma(α + c, γ + v). We work with the Gamma-Poissonmodel throughout the paper. In fact, we validate the mod-eling assumption on our data in the Appendix.

Dynamic Gamma-Poisson (DGP) model: We nowdescribe our dynamic Bayesian model that is based on aGamma-Poisson assumption and fitted through the “dis-counting” concept pioneered by [11]. More specifically, con-sider obtaining CTR posterior θt at time t. Assume theposterior of θt−1 (CTR at time t−1) is Gamma(αt−1, γt−1)with mean µt−1 and variance σ2

t−1. Note that this posteriorcaptures all information relevant to CTR based on observeddata until time t − 1. The conditional distribution of ct

for known θt is Poisson(mean=vtθt). Now, to compute theposterior of θt, we need to decide on a prior for θt; it isnatural to use the posterior of θt−1 as the prior. However,such an approach gives equal weight to all previous datapoints and adapts slowly to CTR changes over time. Thediscounting concept solves this problem by using a dilatedversion of θt−1 as a prior (dilation increases variance; mean isunchanged.), i.e., prior for θt is now Gamma(δαt−1, δγt−1)with variance σ2

t−1/δ, 0 < δ ≤ 1. The posterior of θt isgiven by Gamma(αt = δαt−1 + ct, γt = δγt−1 + vt), whereδ determines the rate of adaptation of the tracker; it is es-timated using a tuning dataset. Note that small values of δadapt faster and may lead to high variance; large values ofδ adapt slowly and may lead to high bias in predictions. Inour application, since the TOD pattern for US traffic is nottoo sharp, values of δ in the range of [.95, 1) worked well.

This model is simple to implement. Starting with α0 andγ0 as our initial prior parameters (which can be estimatedbased on historical data analysis), for each interval t, the

update formula for (αt, γt) is given by

αt = δαt−1 + ct (3)

γt = δγt−1 + vt

where vt = ct = 0 if the article is not shown in some intervalt. The predicted CTR for the next interval has mean αt/γt

and variance αt/γ2t . In fact, the posterior mean αt/γt is

the predicted CTR for next interval t + 1. A closer lookat Equation 3 shows that αt =

∑t−1k=0 ct−kδk + δtα0 and

βt =∑t−1

k=0 vt−kδk + δtβ0; thus the posterior mean is a ratioof exponentially weighted sums (clicks and views).

Alternative models: Our DGP model is different froma commonly used model that tracks CTR through a ex-ponentially weighted sum of per-interval click-to-view ratio(ct/vt); DGP smooths the clicks and views separately in-stead and provides a more robust model, especially for im-balanced data. We note that the above update rule can alsobe applied to the Beta-Binomial model, but the variance di-lation would be approximate, not exact. In prior work [1],we applied the Gaussian Kalman filter with logit transfor-mation to dynamic CTR tracking. This model is sensitiveto noisy or sparse data. Kalman filters can also be definedbased on more appropriate distributions, e.g., Poisson obser-vations with Gamma states. However, the update formulais not closed-form and our update rule can be thought of asan approximation to that.

3.2 Spatial ModelOur exploratory analysis in Section 2 clearly shows strong

article specific spatial (more specifically, positional in thiscase) correlation that evolves through time. With smallsample size, estimated article CTR at target location mayimprove substantially by combining information from othercorrelated locations. We propose a dynamic linear regressionmodel (DLR) to accomplish this.

Dynamic Linear Regression (DLR): Let θxt denotethe CTR of an article at location x in time interval t, for x= 1, ..., K. Without loss of generality, assume our goal is toestimate the CTR θ1t at the target location by using datafrom all correlated locations θxt. Let Dxt denote aggregatedclicks (cx1, ..., cx,t−1) and views (vx1, ..., vx,t−1) for thearticle at x observed before time t. Our goal is to estimatethe posterior distribution of θ1t combining information fromD1t, ..., DKt,

The basic idea of dynamic linear regression (DLR) modelis to use a base model (e.g., the Gamma-Poisson model) toseparately track the article CTR θxt for each location x andobtain additional information for the target from location xthrough a dynamic linear regression (fitted using a Kalmanfilter) of the target θ1t on the location θxt. Finally, infor-mation from all locations are combined to obtain a moreinformed posterior for the target.

Motivated by our exploratory analysis shown in Figure 8(a),we model the relationship between θ1t and θxt by a linear re-gression. Specifically, we assume the following linear model:

θ1t = αxt + βxt θxt + εxt, εxt ∼ N (0, s2x), (4)

where αxt and βxt are time-varying parameters of the linearmodel, and εxt is the error term with variance s2

x.To obtain information on θ1t from location x, we first use

a base model to track the distribution of (θxt | Dxt) for eachlocation x separately. Each base model outputs the CTRmean θ̂xt and variance s2

xt. Then, for a location x, we apply


26

the above linear model to predict θ1t using θxt, this gives thedistribution of (θ1t | Dxt). It can be shown that (θ1t | Dxt)has the following mean and variance:

µxt = E[θ1t | Dxt] = α̂xt + β̂xt θ̂xt

σ2xt = V[θ1t | Dxt] =

V[αxt] + V[βxt]s2xt + β̂2

xts2xt + θ̂2

xtV[βxt]+

2C[αxt, βxt]θ̂xt + s2x

Finally, we combine each individual (θ1t | Dxt) through thefollowing standard proposition[6].

Proposition 1. Assume θ1t has an uninformative prior.If (θ1t | Dxt) ∼ N (µxt, σ

2xt), for x = 1, ..., K, and D1, ...,

DK are mutually independent given θ1t, then(θ1t | D1t, ..., DKt) is normally distributed with

E[θ1t | D1t, ..., DKt] =∑

x(1/σ2xt)µxt∑

x(1/σ2xt)

V[θ1t | D1t, ..., DKt] = 1∑x(1/σ2

xt)

The revised mean and variance formulae are used to get anupdated CTR posterior for the base model through methodof moments. We now discuss estimation of time-varyingregression parameters in Equation 4 through a Kalman filterthat uses output of the base model. Specifically, for eachposition x, we use the following state-space model:

θ̂1t = αxt + θ̂xtβxt + εxt, εxt ∼ N (0, s2xs2

1t + β2xts

2xt)[

αxt

βxt

]=

[αx,t−1

βx,t−1

]+ ε, ε ∼ N (0,Σ)

where s2x is the model error variance; s2

1t and s2xt are the

variances of (θ1t|D1t) and (θxt|Dxt) that are outputs fromthe base models. In fact, the variance formula for εxt abovecan be derived using the formula of iterated variance calcu-lation: V ar(θ̂1t) = Eθ2t(V ar(θ̂1t|θ2t)) + V arθ2t(E(θ̂1t|θ2t)).

Note that we assume V ar(θ̂1t|θ2t) = s2xs2

1t; this incorporatesthe uncertainty involved in θ1t. αx,t−1 and βx,t−1 are theregression parameters from the previous interval; and Σ isthe variance-covariance matrix controlling how the regres-sion relationship evolves over time. Note that s2

x and Σare article independent constants that are estimated usinghistorical data in an offline manner.

We apply the standard Kalman filter update rule [11] totrack αxt and βxt over time. The Kalman filter outputs α̂xt,β̂xt, V[αxt], V[βxt] and C[αxt, βxt] are used in computing µxt

and σ2xt in Proposition 1.

Alternate Models: Our spatial modeling approach issimple, scalable and is motivated by methods used in meta-analysis[7] that improve estimate of a quantity by combin-ing information available from different independent sources.An alternate approach would be to use a multivariate Kalmanfilter that assumes CTR from different locations are corre-lated and evolves over time. One such model would be as-suming that clicks are generated according to Poisson andthe state (a vector of the true CTR’s of L locations) evolveover time based on a L× L covariance matrix, which is dif-ficult to estimate when L is large.

3.3 User Fatigue ModelAs shown in Section 2, article CTR decays with repeat

exposure. Developing user-specific repeat-exposure featuresthat can appropriately down-weight article CTR per useris of paramount importance for good performance. In fact,Figure 6(c) clearly shows an exponential drop-off in overall

CTR with increase in amount of repeat exposure. To modeluser fatigue, it is important to incorporate exposure featuresper user while estimating overall CTR; especially in EMP.We present a novel dynamic model to incorporate fatiguethat provides good performance in our application.

Let θRt denote article CTR at time t corresponding torepeat-exposure feature vector R. We assume the followingfactorization: θRt = θ0t exp{R′bt}, where θ0t is the firstview CTR at time t (probability of click on first article ex-posure). Comparing this with Equation (2), we see this is aspecial case of our general spatial-temporal approach with

g(Ru) = R′bt. If cRt and vRt denote clicks and views at

time t corresponding to feature vector R, then the distribu-tion of our observations conditional on the state at time t isgiven by

cRt | θ0t,bt ∼ Poisson(vRt θ0t exp{R′bt}) (5)

Indeed, Equation (5) provides the likelihood of the state vec-tor (θ0t,bt) at time t. We assume θ0t is known and equalsthe posterior mean obtained as ouput of our spatial modelor Dynamic Gamma Poisson model at a single position. Wenow focus on updating the posterior of bt. The prior for bt

is assumed to be Gaussian with mean equal to the poste-rior mean b̂t−1 of bt−1, and variance obtained by dilatingthe posterior variance At−1 of bt−1 to At−1/δ. Combiningthe likelihood in Equation (5) with the prior provides theposterior of bt by Bayes theorem. Since the posterior is notavailable in closed form, we apply Laplace approximationto obtain a Gaussian posterior. In fact, the posterior is ap-proximately Gaussian with mean given by the mode b̂t oflog-posterior and variance At given by inverse of negativehessian computed at mode of log-posterior.

4. EXPERIMENTAL RESULTSIn this section, we present results on experiments with

both real and simulated data. First, we show that the dy-namic Gamma-Poisson model that tracks article CTR at asingle position significantly outperforms several alternativemodels. We then show that our spatial DLR model sig-nificantly improves CTR tracking when the target positionsuffers from data sparsity. In fact, we find methods that usearticle-independent translation ratios to estimate positionaleffects may hurt performance in our application. Finally, wereport analysis of our fatigue model.

4.1 Replay ExperimentsWe first report experimental results on “replay” experi-

ments that are based on log data collected from the TodayModule. In fact, we use US traffic data from RND bucketaggregated at five-minute intervals collected over a month,which contains about 180 million user visits. Parameters ofall models are estimated using training set (first 15 days)and results are reported on the test set (last 15 days). Wesubsampled our test set to simulate performance for differenttraffic volumes.

Replay methodology: To evaluate the performance ofan online model, we “replay” articles retrospectively usingthe model-predicted F1 CTR at the next time interval, andaggregate the total number of clicks that is received by thehighest ranked articles over a 15 day test period. We onlyreport the F1 position performance since it accounts for themajority of click traffic and hence a good approximationto the system performance (results for other positions were


27

80

85

90

95

10

0

Avg Views Per Interval

Pe

rce

nta

ge

Lift

in F

1 C

TR

25

0

50

0

10

00

15

00

25

00

50

00

DLRDGP−F1GKF−MultiCumu−F1DGP−RatioDGP−NaiveGKF−F1EWMA

(a) Replay

50

60

70

80

90

Avg Views Per Interval

25

0

50

0

10

00

15

00

25

00

50

00

DLRDGP−F1

(b) UCB1 simulation

Figure 9: Model performance in the replay experi-ment and simulation based on UCB1 serving scheme

qualitatively similar). The target position here is the F1position.

Models: We report results for a range of models. DLRis the dynamic linear regression model described in Sec-tion 3.2 that tracks F1 CTR using data from all positions.Several variations of the dynamic Gamma-Poission models(DGP) are tested: DGP-F1 uses data only from F1; DGP-Naive assumes all positions are exchangeable and aggre-gates clicks and views from all positions; DGP-Ratio ag-gregates across positions but downweighs footer views by aconstant ratio (overall F1 to footer CTR ratio). We alsofitted two Gaussian Kalman filter (GKF) models to the em-pirical CTR and a logistic transformation of the empiricalCTR. GKF-F1 is a univariate Gaussian model that onlyuses the F1 data, while GKF-Multi is a standard multi-variate Gaussian Kalman filter model. We also consider twoother baseline models: EWMA-F1 (exponential-weightedmoving average) that tracks the F1 CTR θ1t for an articleby using exponentially-weighted moving average of empiri-cal CTR ratios: θ1t = wθ1,t−1 + (1 − w)(ct/vt) where, ct

and vt are the observed F1 clicks and views at t, and w istuned using the training data. Cumu-F1 (cumulative F1CTR) simply divides the total number of F1 clicks by F1views observed up to a given time point. In fact, this is aspecial case of DGP obtained by setting discounting factorto unity.

Experimental Results: Figure 9(a) shows the results.For CTR tracking using data from a single positon, DGP-F1significantly outperforms alternatives GKF-F1 and EWMA-F1, that use instant CTR (ct/vt) in model updates. ForCTR tracking using multi-positional data, DLR performssignificantly better than the alternatives: GKF-Multivar,DGP-Ratio and DGP-Naive. To our surprise, the single-positional DGP-F1 is hard to beat, only DLR can providea slight improvement over it. The other models that arebased on a fixed positional correlation assumption have in-ferior performance compared to DGP-F1 since the positionalcorrelations are not constant and vary across time and arti-

cles. In fact, we see that using data from all positions maybe worse than using data from a single position if the modeldoes not track non-stationary correlations properly.

Why DGP-F1 is good? In the RND data used by ourreplay experiments, each article receives roughly the samevolume of traffic at each position. Since footers receive muchfewer clicks than F1, the noise in data from footers is higherthan that from F1. It turns out that after having the lessnoisy F1 traffic for every article at all times, it is difficult tosqueeze additional information from the weaker F2-F4 signaleven if the model that predicts F1 CTR using correlationfrom F2-F4 is accurate. In fact, when the traffic volume ofthe target position is sufficiently large at all time, there isno need to use correlation among multiple positions.

However, in many scenarios, it is not feasible to have aconstant stream of completely randomized data for each ar-ticle at all times. When the pool of live articles is large, wemay only be able to explore a small fraction of articles at atime. For some web sites, human editors may want to havefull control over some important positions for some periodsof time. Also, to fully optimize the CTR of a system, com-plete randomization is not an optimal exploration scheme.Thus, we conduct a series of simulation experiments to un-derstand how multi-positional models would perform withdifferent traffic patterns.

4.2 Simulation StudyIn this section, we investigate the behavior of DLR with

different traffic patterns. We apply the loess method de-scribed in the Appendix to smooth the CTR time-series ofeach article at each position using the RND data, and treateach smoothed CTR curve as the true CTR θixt of article iat position x in interval t. A simulation run is similar to areplay run. However, in each interval, instead of using theobserved clicks and views, we control the number of viewsvixt to be generated and simulate the number of clicks ac-cording to Poisson(θixt vixt). Since we know the true CTRs,we can measure the predictive error exactly. Also, becauseonly DLR is competitive to the single-positional DGP-F1,we only report results for these two models.

Constant F1 traffic: We first simulate a scenario inwhich the F2-F4 traffic is constant (with 100 views per arti-cle, per position, per interval), and the F1 traffic is also con-stant but may be relatively small. We vary the ratio betweenthe F1 traffic volume and F2-F4 traffic volume, and show themean absolute errors of DLR and DGP-F1 in Figure 10(a).As seen from the figure, DLR outperforms DGP-F1 whenthe F1 traffic is less than 10% of F2-F4 traffic. Larger thanthat, there is almost no difference in performance.

Periodic F1 traffic: Next, we simulate scenarios in whichthe F2-F4 traffic is still constant (with the same volumeas before), but the F1 traffic is observed only periodically.We vary the fraction of time intervals in which F1 data isobserved, and show the mean absolute errors of DLR andDGP-F1 in Figure 10(b). As seen from the figure, DLRoutperforms DGP-F1 when the F1-observed fraction is lessthan 30%.

Simulation with the UCB1 serving scheme: Finally,we try to understand the performance of DLR in a systemthat seeks to maximize CTR by allocating different amountsof exploration traffic to articles (instead of using a smallfixed-size random bucket with uniform randomization). Weimplemented the UCB1 scheme [2] and use it to control the


28

0.01

0.05

0.10

0.50

1.00

5.00

10.0

0

0.00

20.

006

0.01

00.

014

F1−to−others traffic ratio

Mea

n ab

solu

te e

rror

DGP−F1DLR

(a) Constant F1 traffic

0.02

0.05

0.10

0.20

0.50

0.00

30.

005

0.00

7

Fraction of time with F1 data

Mea

n ab

solu

te e

rror

DGP−F1DLR

(b) Periodic F1 traffic

Figure 10: Behavior of DLR with different trafficpatterns (x-axis is log-scaled)

amount of F1 exploration traffic to be given to each article,while still keeping F2-F4 traffic the same as the RND data.Figure 9(b) shows the CTR lifts of DLR and DGP-F1 over amodel that predicts article CTR randomly for different traf-fic volumes. As seen, DLR provides better lifts especiallywhen the traffic volume is small. This result is encourag-ing, but future work is needed to understand the interactionbetween DLR and an exploration scheme.

4.3 User Fatigue AnalysisIn this section, we provide results to show the effectiveness

of our user fatigue model discussed in Section 3.3. In partic-ular, we show the predictive accuracy measured in terms ofmean absolution error (MAE) and predictive log-likelihoodfor models with and without previous user exposure features.

After conducting preliminary exploratory analysis, we con-verged on the following repeat exposure feature vector R =(R1, R2) where, R1 is number of previous F1 article viewsand R2 is number of previous story views through footerclicks on Featured tab or other tab click or other tab clickfollowed by a footer click. Other features (number of pre-vious footer exposures, time since first exposures of varioustypes) were noisy and did not look promising. We testedthe following models: m0: Null model which assumes norepeat exposure features. m1: which assumes R = R1,m2: which assumes R = (R1, R2) and m3: which assumesR = (R1, R2, R1R2). m3 does not converge on a numberof articles for which it provided poor performance, hencewe only report results for m0, m1 and m2. Define M1 =MAE(m1)/MAE(m0) and M2 = MAE(m2)/MAE(m0).Figure 11 shows the plot of MAE ratio of M2/M1 (represent-ing error reduction of m2 over m1) versus M1 (representingerror reduction of m1 over m0). First, both m1 and m2 aresignificantly better than the null model m0 on all articles,m2 is marginally better than m1 for a large number of ar-ticles. Results for predictive log-likelihood showed similarpatterns.

5. DISCUSSIONWe have discussed dynamic models for estimating CTR

of articles that incorporate information from correlated lo-cations and decay caused by repeated exposure. Althoughwe demonstrated the utility of the models using a contentrecommendation application based on overall popularity, themodels can also be used to perform personalized recommen-

0.2 0.4 0.6 0.8 1.0

0.90

0.94

0.98

M1

M2/

M1

Figure 11: Mean Absolute Error (MAE) of user fa-tigue models.

dation. One approach to personalization is to create usersegments based on analysis on a large amount of histori-cal data (using clustering or other techniques), track articleCTR for each segment separately and serve the segment-specific most popular articles. This is a reasonable solutionas our experiments in Section 4 showed, spatial modelingperforms well even with extremely sparse data. Another ap-proach is to build personalization models (e.g., regressionmodels) that are based on both user features and dynamicarticle CTR estimates obtained from our models to combinerelevance and popularity. In our preliminary study, thosedynamic CTR features are the most important features insuch models.

Our models are motivated by rigorous exploratory analy-sis. All the modeling assumptions have been verified. Exper-imental results confirmed good performance of our models.

6. REFERENCES[1] D. Agarwal, B.-C. Chen, P. Elango, and et al. Online

models for content optimization. In NIPS, 2008.

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeanalysis of the multiarmed bandit problem. MachineLearning, 2002.

[3] Y. Benjamini and Y. Hochberg. Controlling the falsediscovery rate: a practical and powerful approach tomultiple testing. Journal of the Royal StatisticalSociety B, 1995.

[4] C.A.Colin and P.K.Trivedi. Regression Analysis ofCount Data. Cambridge University Press, 1998.

[5] J. M. Chambers. Software for Data Analysis:Programming with R. Springer, 2008.

[6] C.R.Rao. Linear Statistical Inference and ItsApplications. Wiley, 2002.

[7] D.K.Stangl and D.A.Berry. Meta-analysis in Medicineand Health Policy. CRC Press, 2000.

[8] D.Lambert and C.Liu. Adaptive thresholds:Monitoring streams of network counts. Journal of theAmerican Statistical Association, 2006.

[9] C. Kaufman, V. Ventura, and R. Kass. Spline-basednon-parametric regression for periodic functions andits application to directional tuning of neurons.Statistics in Medicine, 2005.

[10] Q. Mei and K. W. Church. Entropy of search logs:how hard is search? with personalization? withbackoff? In WSDM, 2008.


29

[11] M.West and J.Harrison. Bayesian Forecasting andDynamic Models. Springer-Verlag, 1997.

[12] F. Radlinski and T. Joachims. Active exploration forlearning rankings from clickthrough data. In KDD,2007.

[13] M. Richardson, E. Dominowska, and R. Ragno.Predicting clicks: estimating the click-through rate fornew ads. In WWW, 2007.

[14] S.P.Ellner and Y.Seifu. Using spatial statistics toselect model complexity. Journal of Computationaland Graphical Statistics, 2002.

[15] F. Wu and B. A. Huberman. Novelty and collectiveattention. In Proceedings of National Academy ofSciences, 2007.

[16] F. Wu and B. A. Huberman. Popularity, novelty andattention. In EC. ACM, 2008.

APPENDIXEffect of Concurrent articles: Ranking of articles atdifferent slots depends on correlation in article CTR thatare shown concurrently to users. If CTR of an article isinfluenced by other concurrent articles in the content pool,it is important to adjust for such effects in ranking articles.In general, it is hard to conduct such correlation studies fromserving bucket data; the availability of random bucket datain our scenario makes this study feasible.

To measure such correlations, we compare the observednumber of clicks cij for article i displayed at F1 when shownconcurrently with article j shown at Fx (x= 2, 3, 4) withbaseline Eij , expected number of clicks under the null hy-pothesis of no correlation. Large deviations of cij ’s from thecorresponding Eij ’s would be indicative of correlation.

Computing baseline frequency Eij: To avoid data spar-sity, we only considered pairs that were shown together atleast 1000 times during their lifetimes. To adjust for tem-poral variations, we partitioned the data into several 2 hourwindows. If ˆCTRi,h denote the empirical CTR of articlei in time window h and Nij,h is the number of joint oc-curences of article i at F1 with article j at Fx in time win-dow h, Eij =

∑h Nij,h

ˆCTRi,h. We then compute p-values(Pij,low,Pij,high) that measures extremeness of cij under aPoisson model with mean Eij . In particular, Pij,low =Pr(X ≤ cij); Pij,high = Pr(X > cij), where X followsPoisson(Eij). Low values of Pij,low(Pij,high) are indicativeof strong negative(positive) correlations. Figure 5 showsthe quantile-quantile plot of normal scores of the p-valuesagainst a standard normal distribution. The close agree-ment indicates no significant correlations in our data. Wealso conducted formal hypothesis testing by adjusting thep-values for multiple testing using the FDR procedure([3]).The test flagged only 3 pairs as significant (out of a total ofabout 26K) at 5% error rate. This further shows there is noneed to incorporate the effect of other articles while select-ing the set of top 4, perhaps the editors are already ensuringa diverse pool of articles through their programming.

Validating Gamma-Poisson Assumption To validatethe Gamma-Poisson assumption empirically, we consider clickand view time series of all articles appearing in a monthin RND. Our method is motivated by techniques proposedin [8]. Recall the Gamma-Poisson model: if the CTR orstate θt at each time interval is known, then clicks ct ∼Poisson(vtθt) and ct’s are independent of each other. Of

−4 −2 0 2 4

−6

−2

02

4

Standard normal

norm

al s

core

s

Figure 12: Q-Q plot of normal scores for Pij,high.Similar patterns observed for normal scores of Pij,low

0 40 80 120

040

8012

0Obs Clicks

Exp

ecte

d cl

icks

(a) Poisson Q-Q

0.05 0.15

0.05

0.15

True CTR

Exp

ecte

d C

TR

(b) Gamma Q-Q

Figure 13: Validating Gamma-Poisson assumption

course the states are unknown and estimated from data,which introduces uncertainty modeled through a Gammadistribution. We first fit a non-parametric curve to the timeseries using loess (local polynomial regression) available insoftware R [5]; the neighborhood size (span) is selected suchthat the autocorrelations in the residuals is close to zero assuggested in [8]. In fact, this ensures the assumption of in-dependence of clicks in different intervals is approximatelytrue. The autocorrelation measure we use is based on MoranI statistic as recommended in [14].

Assuming the estimated loess curve θ̃t for an article intime t is the truth, we draw observations from Poisson(θ̃tvt).Figure 13(a) shows a quantile-quantile plot of observed clicksversus simulated clicks from fitted Poisson distribution. Theclose agreement among the quantiles clearly shows the Pois-son assumption is reasonable.

To validate the Gamma assumption, we assume CTR ofan article in a given hour changes very slowly and can beassumed to be almost a constant. We fit a Gamma distri-bution to each of the 12 loess fitted article CTR in a givenhour by matching moments and draw 12 samples from thefitted Gamma distribution. The quantile-quantile plot ofestimated loess CTR’s and Gamma samples shown in fig-ure 13(b) shows the Gamma assumption is reasonable forour data. (Similar patterns observed over large number ofsimulations.) This one-time offline analysis clearly showsthe Gamma-Poisson assumption is appropriate for trackingarticle CTR in our data. A similar analysis showed Beta-Binomial is also reasonable but slightly worse than Gamma-Poisson.


30

Documents

Spatio-Temporal Models for Estimating Click-through Rate · 2009-04-06 · Spatio-Temporal Models for Estimating Click-through Rate Deepak Agarwal Yahoo! Labs 701 First Ave Sunnyvale,