Ensemble Contextual Bandits for Personalized Recommendation

Ensemble Contextual Bandits for Personalized Recommenda8on

Liang Tang, Yexi Jiang, Lei Li, Tao Li Florida Interna8onal University

10/7/14 ACM RecSys 2014 1

Cold Start Problem for Learning based Recommenda8on

•  Issue: Do not have enough appropriate data. – Historical user log data is biased. – User interest may change over 8me. – New items (or users) are added.

•  Approach: Exploita8on and Explora8on – Contextual Mul8-‐Arm Bandit Algorithm

10/7/14 ACM RecSys 2014 2

The contextual informa8on are item features and user features

Contextual Bandit Algorithm with Personalized Recommenda8on

•  Contextual Bandit –  Let a1, …, am be a set of arms. –  Given a context xt, the model decides which arm to pull. –  AYer each pull, you receive a random reward, which is determined by the pulled arm and xt.

–  Goal: maximize the total received reward.

•  Online Recommenda8on –  Arm à Item Pull à Recommend –  Context à User feature –  Reward à Click

10/7/14 ACM RecSys 2014 3

Problem Statement •  Problem Se3ng: have many different recommenda8on models (or policies): – Different CTR Predic8on Algorithms. – Different Explora8on-‐Exploita8on Algorithms. – Different Parameter Choices.

•  No data to do model valida;on

•  Problem Statement: how to build an ensemble model that is close to the best model in the cold start situa8on ?

10/7/14 ACM RecSys 2014 4

How Ensemble?

•  Classifier ensemble method does not work in this seang – Recommenda8on decision is NOT purely based on the predicted CTR.

•  Each individual model only tells us: – Which item to recommend.

10/7/14 ACM RecSys 2014 5

Ensemble Method

•  Our Method: – Allocate recommenda8on chances to individual models.

•  Problem: – Beder models should have more chances. – We do not know which one is good or bad in advance.

–  Ideal solu8on: allocate all chances to the best one.

10/7/14 ACM RecSys 2014 6

Current Prac8ce: Online Evalua8on (or A/B tes8ng)

Let π1, π2 … πm be the individual models. 1.  Deploy π1, π2 … πm into the online system at the

same 8me. 2.  Dispatch a small percent user traffic to each model. 3.  AYer a period, choose the model having the best

CTR as the produc8on model.

10/7/14 ACM RecSys 2014 7

Current Prac8ce: Online Evalua8on (or A/B tes8ng)

Let π1, π2 … πm be the individual models. 1.  Deploy π1, π2 … πm into the online system at the

same 8me. 2.  Dispatch a small percent user traffic to each model. 3.  AYer a period, choose the model having the best

CTR as the produc8on model.

10/7/14 ACM RecSys 2014 8

If we have too many models, this will hurt the performance of the online system.

Our Idea 1 (HyperTS) •  The CTR of model πi is a random unknown variable, Ri .

•  Goal: –  maximize , rt is a random number drawn from Rs(t), s(t)=1,2,…, or m. For each t=1,…,N, we decide s(t).

•  Solu;on: –  Bernoulli Thompson Sampling (flat prior: beta(1,1)) .

–  π1, π2 … πm are bandit arms.

10/7/14 ACM RecSys 2014 9

1N

rtt=1

N

∑CTR of our ensemble

model

No tricky parameters

An Example of HyperTS

10/7/14 ACM RecSys 2014 10

In memory, we keep these es8mated CTRs for π1, π2 … πm.

R1

R2

Rk

…

Rm

…


10/7/14 ACM RecSys 2014 11

A user visit

HyperTS selects a candidate model, πk .

R1

R2

Rk

…

Es8mated CTRs

Rm

…


10/7/14 ACM RecSys 2014 12

A user visit


πk recommends item A to the user.

A

xt:: context features

Es8mated CTRs

R1

R2

Rk

…

Rm

…


10/7/14 ACM RecSys 2014 13

A user visit


πk recommends item A to the user.

A

xt:: context features

rt :click or not

HyperTS updates the es8ma8on of Rk based on rt.

Es8mated CTRs

R1

R2

Rk

…

Rm

… update

Two-‐Layer Decision

10/7/14 ACM RecSys 2014 14

Bernoulli Thompson Sampling

π1 π2 πm πk

Item A Item B Item C

… …

Our Idea 2 (HyperTSFB)

•  Limita8on of Previous Idea: – For each recommenda8on, user feedback is used by only one individual model (e.g., πk).

•  Mo8va8on: – Can we update all R1, R2, …, Rm by every user feedback? (Share every user feedback to every individual model).

10/7/14 ACM RecSys 2014 15


•  Assume each model can output the probability of recommending any item given xt. –  E.g., for determinis8c recommenda8on, it is 1 or 0.

•  For a user visit xt: 1.  πk is selected to perform recommenda8on (k=1,2,…, or

m). 2.  Item A is recommended by πk given xt. 3.  Receive a user feedback (click or not click), rt. 4.  Ask every model π1, π2 … πm, what is the probability of

recommending A given xt.

10/7/14 ACM RecSys 2014 16


•  Assume each model can output the probability of recommending any item given xt. –  E.g., for determinis8c recommenda8on, it is 1 or 0.

•  For a user visit xt: 1.  πk is selected to perform recommenda8on (k=1,2,…, or

m). 2.  Item A is recommended by πk given xt. 3.  Receive a user feedback (click or not click), rt. 4.  Ask every model π1, π2 … πm, what is the probability of

recommending A given xt.

10/7/14 ACM RecSys 2014 17

Es;mate the CTR of π1, π2 … πm (Importance Sampling)

Experimental Setup •  Experimental Data –  Yahoo! Today News data logs (randomly displayed).

–  KDD Cup 2012 Online Adver8sing data set.

•  Evalua;on Methods –  Yahoo! Today News: Replay (see Lihong Li et. al’s WSDM 2011

paper). –  KDD Cup 2012 Data: Simula>on by a Logis8c Regression Model.

10/7/14 ACM RecSys 2014 18

Compara8ve Methods

•  CTR Predic8on Algorithm – Logis8c Regression

•  Exploita8on-‐Explora8on Algorithms – Random, ε-‐greedy, LinUCB, SoYmax, Epoch-‐greedy, Thompson sampling

•  HyperTS and HyperTSFB

10/7/14 ACM RecSys 2014 19

Results for Yahoo! News Data •  Every 100,000 impressions are aggregated into a bucket.

10/7/14 ACM RecSys 2014 20

Results for Yahoo! News Data (Cont.)

10/7/14 ACM RecSys 2014 21

Conclusions for Experimental Results

1.  The performance of baseline exploita8on-‐explora8on algorithms is very sensi8ve to the parameter seang. –  In cold-‐start situa8on, no enough data to tune parameter.

2.  HyperTS and HyperTSFB can be close to the op8mal baseline algorithm (No guarantee be beder than the op8mal one), even though some bad individual models are included.

3.  For contextual Thompson sampling, the performance depends on the choice of prior distribu8on for the logis8c regression. –  For online Bayesian learning, the posterior distribu8on approxima8on is

not accurate(cannot store the past data).

10/7/14 ACM RecSys 2014 22

Ques8on & Thank you

•  Thank you!

•  Ques8on?

10/7/14 ACM RecSys 2014 23

Technology

Ensemble Contextual Bandits for Personalized Recommendation