Human Interaction with Recommendation Systemsproceedings.mlr.press/v84/schmit18a/schmit18a.pdf · Human Interaction with Recommendation Systems sumptions are a perfect re ection of

Human Interaction with Recommendation Systems

Sven Schmit Carlos RiquelmeStanford University Google Brain

Abstract

Many recommendation algorithms rely onuser data to generate recommendations.However, these recommendations also affectthe data obtained from future users. Thiswork aims to understand the effects of thisdynamic interaction. We propose a simplemodel where users with heterogeneous prefer-ences arrive over time. Based on this model,we prove that naive estimators, i.e. thosewhich ignore this feedback loop, are not con-sistent. We show that consistent estimatorsare efficient in the presence of myopic agents.Our results are validated using extensive sim-ulations.

1 INTRODUCTION

We find ourselves surrounded by recommendationsthat help us make better decisions. However, relativelylittle work has been devoted to the understanding ofthe dynamics of such systems caused by the interac-tion with users. This work aims to understand thedynamics that arise when users combine the recom-mendations with their own preference when making adecision.

For example, a user of Netflix uses their recommenda-tions to decide what movie to watch. However, thisuser also has her own beliefs about movies, e.g. basedon artwork, synopsis, actors, recommendations byfriends, etc. The user thus combines the suggestionsfrom Netflix with her own preferences to decide whatmovie to watch. Netflix captures data on the out-come to improve its recommendations to future users.Of course, this pattern is not unique to Netflix, butobserved more broadly; across all platforms that userecommendations.

Proceedings of the 21st International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2018, Lan-zarote, Spain. PMLR: Volume 84. Copyright 2018 by theauthor(s).

A first requirement for any estimator is consistency;however it is not clear that in the presence of humaninteraction naive estimators are consistent. Indeed, weshow that simple estimators can easily be fooled by theselection effect of the users. We propose to measureperformance by adapting the notion of regret from themulti-armed bandit literature. Using this metric, weshow that naive estimators suffer linear regret; evenwith ‘infinite data’ the performance per time step isbounded away from the optimum.

Using the notion of regret is useful as it also allowsus to quantify the efficiency of estimators. New usersand items with little to no data constantly arrive andthus a recommendation system is always in a state oflearning. It is therefore important that the systemlearn efficiently from data. While this might soundlike the well-known cold-start problem, that is not thefocus of this work; Rather than providing recommen-dation solutions for users in the absence of data, wefocus on quantifying how quickly an algorithm obtainsenough data to make good recommendations. This ismore akin to the social learning and incentivizing ex-ploration literature than work on the cold-start prob-lem.

1.1 Main results

From a technical standpoint, this paper providesa dynamical model that captures the dynamics ofusers with heterogeneous preferences, while abstract-ing away the specifics of recommendation algorithms.In the first part of this work, we show that there is asevere selection bias problem that leads to linear re-gret. Second, we show that when the algorithm usesunbiased estimates for items, ‘free’ exploration occursand we recover the familiar logarithmic regret bound.This is important because inducing agents to exploreis difficult from both a statistical and strategic pointof view. We validate our claims using simulations withfeature-based and low-rank methods.

It is important to note that the focus of this work isto provide a simplified framework that allows us toreason about the dynamic aspects of recommendationsystems. We do not claim that the model nor the as-


sumptions are a perfect reflection of reality. Instead,we believe that the model we propose provides an ex-cellent lens to better understand vital aspects of rec-ommendation systems.

1.2 Related work

This work roughly intersects with three separate fieldsof study. Recommendation systems [Adomavicius andTuzhilin, 2005] have attracted much attention. In par-ticular, much research has focused on new methodsthat treat the data as fixed, rather than dynamic.There has been less work on selection bias, which wasfirst demonstrated by Marlin [2003], and subsequentwork [Amatriain et al., 2009, Marlin et al., 2007, Steck,2010]. Rather than modeling user behavior directly,they impose the statistical assumption of a covarianceshift; the distribution of observed ratings is not al-tered by conditioning on the selection event, but fivestar ratings are more likely to be observed. More re-cently, Schnabel et al. [2016] and Joachims et al. [2017]link the bias from covariance shifts to recent advancesin causal inference. Mackey et al. [2010] combine ma-trix factorization and latent factor models to captureheterogeneity in interactions and context.

The different approach of this work is reminiscent ofthe work on social learning [Chamley, 2004, Smith andSørensen, 2000], where agents learn about the state ofthe world by combining their private signals with ob-servations of actions (but not necessarily outcomes) ofothers. The work of Ifrach et al. [2014] is closest re-lated to our setup. They discuss how consumer reviewsconverge on the quality of a product, given diversityof preferences under a reasonable price assumption.However, the work in social learning focuses on usersinteracting with a single item. This seemingly minorminor difference leads to completely different dynam-ics.

Finally, we can relate our work on exploration tothe multi-armed bandit literature [Bubeck and Cesa-Bianchi, 2012]. In particular, there has been priorwork on human interaction with multi-armed banditalgorithms: for example, how a system can optimallyinduce myopic agents to explore [Kremer et al., 2013]by using payments [Frazier et al., 2014] or by the waythe system disseminates information [Papanastasiouet al., 2014, Mansour et al., 2015, 2016]. Similar tothose works, we use the regret framework to analyse asystem with interacting agents. Because in our modelagents have heterogeneous preferences, we show thatagents do not need to be incentivized to explore. Re-cent work by Bastani et al. [2017], Qiang and Bay-ati [2016] consider natural exploration in contextualbandit problems, and show that a modified greedyalgorithm performs well. While their motivation is

different, the results are similar to ours. There hasalso been work on ‘free exploration’ in auction envi-ronments [Hummel and McAfee, 2014].

1.3 Organization

In the next section, we introduce our model. In Sec-tions 3 and 4 we focus on the issues of consistency andefficiency, respectively. We illustrate our results withsimulations in Section 5 before concluding.

2 MODELINGHUMAN-ALGORITHMINTERACTION

In this section, we propose a model for the interactionbetween the recommendation system (platform) andusers (agents). Each user selects one of the items theplatform recommends, and reports their experience tothe platform by providing a rating as feedback. Theplatform uses this feedback to update the recommen-dations for the next user.

More formally, we assume there are K items, labeledi = 1, . . . ,K, and each item has a distinct, but un-known, quality Qi ∈ R. This aspect models the ver-tical differentiation between items and it is the taskof the platform to estimate these qualities. For nota-tional convenience, we assume Q1 > Q2 > . . . > QK .At every time step t = 1, . . . , T a new user arrivesand selects one of the K items. To do so, the userreceives a private preference signal θit ∼ Fi for eachitem, drawn from a preference distribution Fi whichwe make precise later. The value of item i for user t is

Vit = Qi + θit + εit (1)

where εit is additional noise drawn independently froma noise distribution E with mean 0 and finite varianceσ2E < ∞. To aid the agents, the platform provides

a recommendation score sit, aggregating the feedbackfrom previous agents. The agent uses her own prefer-ences, along with the score, to select item at accordingto

at = arg maxisit + θit. (2)

Hence, we make the assumption that the agent isboundedly rational and uses sit as a surrogate for thequality. Abusing notation, we write

Vt = Vatt = Qat + θatt + εatt (3)

for the value of the chosen item for agent t. Afterthe agent selects item at and observes the value Vt,the platform queries for feedback Wt from the user.For example the platform can ask the user to providethe value of the item as a rating, in which case Wt =

Sven Schmit, Carlos Riquelme

Vt. Note that the private preferences θt of the agentremain hidden. The platform uses this feedback togive recommendations to future users. In particular,we require sit to be measurable with respect to pastfeedback, that is σ{aτ ,Wτ : τ < t}.

We measure the performance of a recommendation sys-tem in terms of (pseudo-)regret:

RT =

T∑t=1

maxi

(Qi + θit)− (Qat + θatt), (4)

which sums the difference between the expected valueof the best item and the expected value of the selecteditem.1 We note that if scores sit ≡ Qi for all t, theregret of such platform would be 0, as each user selectsher optimal action using equation (2).

2.1 Preferences, values and personalization

We use this section to expand on the motivation ofthe proposed model. The value for item i at time tconsists of three parts (see equation (1)). First, theintrinsic quality Qi can be seen as the mean qualityacross users. In our theoretical analysis we treat thisas a constant to be estimated such that we are ableto disentangle the model fitting from the dynamics ofinteraction. In this simple setting, one should view itas a vertical differentiator between items. Taking ho-tels as example, it could model quality of service andcleanliness, where a common ranking across agents issensible. The intrinsic qualities Qi can be replaced bymore complicated models, for example based on fea-ture based regression methods, or matrix factorizationmethods. Indeed, in Section 5.2.3 we provide simu-lation results where we replace Qi with a low rankmatrix factorization model.

The second term in the value equation, θit, modelshorizontal differentiation across agents. In our sim-plified model agents only arrive once, and thus thisalso covers different contexts. For example, one trav-eler prefers a hotel on the waterfront, while anotherprefers a hotel downtown, and yet a third prefers stay-ing close to the convention center. While these hypo-thetical hotels could have the same quality, the valuefor users differs, in ways known to the user. However,the intrinsic quality of these properties are unknownto these users.

All in all, the value of item i for agent t is drawn froma distribution with mean Qi, and where the varianceconsists of a part that is known to the user (θit) and apart that is unknown to both platform and user (εit).

1 Unlike the traditional bandit setting, there is no sin-gle best item. Rather, different users might have differentoptimal items.

In section 5 we investigate how well the theoreticalresults carry over to more general models.

One could argue that personalization methods (i.e. re-placing Q with more sophisticated models) supersedethe need for idiosyncratic preferences θit, as these pref-erences can be captured by those models. However, weargue that in most cases this factor cannot be elimi-nated. Every recommendation system is constrainedin terms of the quantity and quality of the data it isbased on. First, a user only interacts with a system sooften, and that limits the amount of personalizationthat models can achieve. Second, recommendationsystems often have access to only weak features, andsome aspects of user preferences and contexts, such astaste or style, can be difficult to capture. Together,these constraints make it difficult to fully model userspreferences, hence the need to explicitly model the un-observed preferences to get a deeper understanding ofthe dynamics of recommendations systems.

2.2 Incentives

We note that the agents in our model are boundedlyrational: their behavior is not optimal, and in par-ticular ignores the design of the platform. Experi-mentally, there has been abundant evidence of humanbehavior that is not rational [Camerer, 1998, Kahne-man, 2003]. Simple heuristics of user behavior havebeen used by others in the social learning commu-nity. Examples include learning about technologiesfrom word-of-mouth interactions [Ellison and Fuden-berg, 1993, 1995] and modeling persuasion in socialnetworks [Demarzo et al., 2003]. The combination ofmachine learning and mechanism design with bound-edly rational agents is explored by Liu et al. [2015].

The behavior of the user in our model implicitly relieson three assumptions:

1. The user is naive; she beliefs the scores suppliedby the platform are unbiased estimates of the truequality.

2. The user is myopic; she selects the item that seemsbest for her.

3. The user has incentives to give honest feedback.

The first assumption seems unrealistic if the platformabuses this power to dictate exploration, which doesnot align with the myopic behavior. However, in Sec-tion 4 we show that there is no need for such aggresiveexploration from the platform to obtain order-optimalperformance. We also note that if the platform out-puts the true qualities Qi, then the selection rule (2)is optimal for a myopic agent. Finally, it is not obvi-ous why a myopic user would leave feedback. While


we do not explicitly model returning users, we arguethat in general a user is motivated to leave feedbackbecause it leads to better recommendations for her inthe future.

3 CONSISTENCY

In this section, we analyse the performance of stan-dard algorithms, that is, scoring processes that do nottake into account that agents have private preferences,and base the scores on empirical averages. This doesinclude algorithms that trade-off exploration and ex-ploitation, such as variants of UCB [Auer et al., 2002]and Thompson Sampling [Russo and Roy, 2016]. Wefocus on the Bernoulli preferences model, though inSection 5 we empirically demonstrate that differentpreference distributions lead to similar outcomes.

First we define the set of agents before time t that haveselected item i by

Sit = {τ < t : aτ = i}. (5)

We also define Vit to denote the empirical average ofitem i up to time t:

Vit =1

|Sit|∑τ∈Sit

Vτ (6)

where we use Vit = 0 when Sit = ∅. We want to showthat the system suffers linear regret when the platformuses any scoring mechanism for which scores convergeto the empirical average of the observed values. Thismeans that the system never converges to an optimalpolicy; rather a constant fraction of users are misledinto perpetuity. To make this rigorous, we define thenotion of mean-converging scoring process.

Definition 1. A scoring process that outputs scoressit for item i at time t is mean-converging if

1. sit is a function of {Vτ : τ ∈ Sit} and t.

2. sit → Vit almost surely if lim inft|Sit|t > 0 almost

surely.

In words, the score only depends on the observed out-comes for this particular item, and if we observe alinear number of selections of arm i, then the scoreconverges to the mean outcome. Trivially, this includesusing the average itself as score, sit = Vit, but this def-inition also includes well known methods that carefullybalance exploitation with exploration, such as versionsof UCB and Thompson Sampling.

From the previous section, we know that, ideally, thescores supplied to the user converge to the quality of

the item, sit → Qi, as more users select item i. Wesay that the scores are biased if this is not the case:

sit 6→ Qi as |Sit| → ∞. (7)

The next proposition shows that mean-convergingscoring processes lead to linear regret, because thesescores are generally biased. We only show this resultfor when preferences are drawn from Bernoulli distri-butions as this simplifies the analysis significantly. InSection 5 simulations show that linear regret is ob-served under a variety of preference distributions. Un-der the Bernoulli model, it is needed that the gap be-tween qualities is ‘small’, though we show that thiscondition is rather weak.

Proposition 1. When θit ∼ Bernoulli(p) for all i, t,if

∆ = Q1 −Q2 <(1− p)K

(1− p)K + p(8)

and sit is mean-converging, then

lim supt→∞

Rtt≥ c (9)

for some c > 0.

The proof of this proposition can be found in the sup-plemental material. The intuition behind the result isthat the best ranked item is selected by users that donot necessarily like it that much, while other items areonly selected by users who really love it. Therefore,the ratings of the best ranked item suffers relative toothers.

We note that for K = 100 and p = 1/50, the con-dition requires ∆ < 0.86. More generally, in the rel-

evant regime where p < log(K)2K , the condition is sat-

isfied if ∆ < 0.7 for all K. We also note that thelinear regret we obtain is not caused by the usual ex-ploration/exploitation trade-off, but rather the esti-mators being biased.

There is no bias result for general preference distribu-tions as it is possible to cherry pick distributions insuch a way that biases cancel each other out exactly.Furthermore, the magnitude of the problem dependscrucially on the variance in user preference relative tothe differences in qualities.2 However, in Section 5 weprovide simulations with a variety of preference distri-butions that suggest that bias is not an artifact of ourassumptions.

3.1 Unbiased estimates

Naturally, a first attempt to improve the linear regretis aimed at obtaining unbiased versions of the naiveaveraging. We now sketch a few such approaches.

2 In the limiting scenario of no variance in preferences,we already know that there is no bias either.


3.1.1 Randomization

Researchers can avoid selection bias in experimentalstudies by randomizing treatments, and we can employthe same approach here. Instead of the user choos-ing an action, the platform assigns matches betweenusers and items. Note that pure randomization is notneeded, user ratings are unbiased as long as the selec-tion of items is independent from private preferences.

Just like randomized control studies, this approach isoften infeasible or prohibitively expensive, e.g. a plat-form cannot force the user to select a certain hotel orrestaurant. However, small scale user experimentationcan potentially inform the platform on the magnitudeand effects of idiosyncratic preferences and the bias itinduces.

3.1.2 Algorithmic approach

Another option is to consider algorithmic approachesto obtain consistent estimators. In the case ofBernoulli preferences, it is possible to obtain unbiasedestimates from data. However, this approach does notgeneralize to other preference distributions, let aloneto the case where we do not know the underlying pref-erence distributions.

3.1.3 Changing the feedback model

Given the difficulty of debiasing feedback algorithmi-cally, we briefly discuss a third alternative. The tradi-tional type of question ‘How would you rate this item?’asks for an absolute measure of satisfaction, which cor-responds to directly probing for Vt in our model. If weask how the chosen item compares to the expectation,we ask for a relative measure of feedback, approximat-ing Vt−(satt+θatt). An example of such prompt couldbe ‘How does this item compare to your expectation?’This way, we can uncover an unbiased estimate of Qat .Importantly, it does not require any distributional as-sumptions on the form of the preferences. Whethersuch relative feedback works in practice would requirea thorough empirical study, which is beyond the scopeof this work. We also note that this approach does notwork when platforms collect implicit feedback.

4 EFFICIENCY

From the previous section we know that naive scoringmechanisms are inconsistent and lead to linear regret.We now focus on the efficiency of consistent estimators,and we assume we have access to unbiased feedbackfrom now on. But, this is not necessarily sufficient toguarantee good performance. The multi-armed ban-dit literature suggests that algorithms with small re-gret require a careful balance between exploration and

exploitation.

In particular, that means that the system needs toobtain data on every item in order to provide usefulscores to the users. However, myopic agents have nointerest in assisting the platform with exploration. Inthis section we address the problem of exploration inthe proposed model. As opposed to the research men-tioned in the introduction, we deal with agents withheterogeneous preferences. It seems natural that theseheterogeneous agents help the system explore, but it isnot obvious to what extent this helps. We show thatbecause of this diversity in preferences, the free explo-ration leads to optimal performance (up to constants);we recover the standard logarithmic regret bound fromthe bandit literature. This means that there is littleneed for a platform to implement a complicated ex-ploration strategy, and incentives naturally align muchbetter than in the settings of previous work.

4.1 Formal result

We assume the qualities Qi are bounded, and withoutloss of generality we can assume they are bounded in[0, 1], and we assume access to unbiased feedback fromthe user. That is, the feedback at time t for chosenitem it is Vt = Qit +εt. However, we no longer requirethat private preference distributions are Bernoulli. LetVit denote the average of the values observed of itemi by time t

Vit =1

|Sit|∑τ∈Sit

Vτ (10)

where Sit = {τ < t : aτ = i}. We now considerthe scoring algorithm that clips the value onto [0, 1]:sit = max(0,min(1, Vit)).

Then, if the private preferences have sufficiently largevariance, made precise in the theorem statement, thenexploration is guaranteed and the platform suffers log-arithmic regret. Let ∆min be the smallest gap in qual-ities ∆min = mini,j |Qi −Qj |.

The following result shows that empirical averaging isenough to get an order optimal (pseudo)-regret boundwith respect to the total number of agents T .

Proposition 2. If for all i, Fi are such that P(θit >1) > γ′ and P(θit ≤ 0) ≥ γ Then

E[regret(T )] ≤(

16σ2

∆min+ ∆min

)K

+32ασ2K(log(T )− log(∆min) + log(2))

∆2minC

(11)

where C = γ′γK−1.

The proof can be found in the supplement. The mainidea is that we can show every arm is chosen suffi-


ciently often initially, and then concentration inequal-ities ensure good performance after an initial learningphase.

Corollary 3. Suppose Fi is a symmetric distributionaround 0 such that P(θit > 1) > γ, then the theoremapplies with C ≥ γ21−K .

Also note that the theorem applies for Bernoulli pref-erences, where C = p(1−p)K−1. The small value of Cleads to a large leading constant. Simulations suggestthat performance is much better in practice.

4.2 Remarks

The main take-away from this result is not so much thespecific bound, but rather the practical insight thatit, and its proof, yield. The intuition is that initiallyestimates of quality are poor. Therefore, it takes sometime and luck for users with idiosyncratic preferencesto try these items. As estimates improve, however,most agents are drawn to their optimal choice. Sincethese choices differ across agents, the platform gets tolearn efficiently about all items without incurring aregret penalty.

The practical consequence of this observation is that toimprove the performance, the designer of a recommen-dation system should focus on simple ways to makenew items, or more generally items with few observa-tions, more likely to be chosen. We can achieve this byhighlighting new arrivals. A good example is Netflix,which clearly displays a ‘Recently Added’ selection.

5 SIMULATIONS

In this section we empirically demonstrate that thetheoretical results derived in the previous sections holdmuch more broadly. First, we focus on verifying theresults from our simplified model. Thereafter, weconsider more advanced personalization models usingfeature-based and low rank methods, where we inves-tigate the dynamics with private preferences.3

5.1 Simulations of regret

Before considering more advanced methods, we sim-ulate our model using different preference distribu-tions and plot the cumulative regret over time. Werun 50 simulations with 5000 time steps and K = 50items across four preference distributions with ran-domly drawn parameters. We then compare biasedand unbiased algorithms based on empirical averages.Figure 1 shows the cumulative regret paths for each

3 The code to replicate the simulations is publicly avail-able at https://github.com/schmit/human_interaction.

of these simulations. The qualities were drawn fromthe uniform distribution over [0, 1]. For the preferencedistributions Fi for item i, we used

• Bernoulli distribution with pi ∼ U [0, 2 log(K)3K ].

• Normal distribution with µi = 0 and σi ∼ U [0, 1].

• Exponential distribution with scale λ−1i ∼ U [0, 1].

• Pareto distribution with shape αi ∼ U [2, 4].

These are chosen such that the variance in prefer-ence and qualities is roughly similar. A clear patternemerges; In all cases, the (biased) empirical averageslead to linear regret, not just for the Bernoulli modelcovered by Proposition 1. Second, we note the unbi-ased scores lead to much better results regardless ofthe preference distribution, in line with Proposition 2.

0 1000 2000 3000 4000 5000timestep

0

200cu

mul

ativ

e re

gret

Bernoulli preferencesBiasedConsistent

0 1000 2000 3000 4000 5000timestep

0

200

400

cum

ulat

ive

regr

et

Normal preferences

0 1000 2000 3000 4000 5000timestep

0

100

200

300

cum

ulat

ive

regr

et

Exponential preferences

0 1000 2000 3000 4000 5000timestep

0

200

400

cum

ulat

ive

regr

et

Pareto preferences

Figure 1: These plots shows the cumulative regretplotted against time for both the naive empirical av-erages in blue, and the unbiased averages in orange.

5.2 Personalization methods

We now focus on methods that provide more personal-ized recommendations. Because estimating such mod-els is more complicated and computationally intensive,we simplify the dynamics of our simulations to a two-staged approach. We then use this approach to ex-periment with a feature-based and a low-rank approx-imation approach to personalization based on synthet-ically generated data.

5.2.1 Two-staged simulations

In our original setup, the platform updates its scoringrule after every observation. This is impractical whendealing with more sophisticated models. Instead wefirst collect a set of observations using a fixed scor-ing algorithm (a training set), and fit the scoring al-gorithm once to this training data. We then use this

https://github.com/schmit/human_interaction


trained algorithm to generate a new set of observations(a test set), again without updating the algorithm inbetween observations. The test set is used to mea-sure the performance of the fitted algorithm. Insteadof using regret, we measure performance by directlycomputing the average rating on the second datasetgenerated by our trained algorithm. Note that it ispossible to iterate generating data and fitting modelsmultiple times before generating a test set.

5.2.2 Ridge regression

In this section we discuss a feature-based model of per-sonalization where the rating is assumed to be a linearfunction of observed covariates.

The model In the feature-based setting, each itemhas an unknown parameter vector wi and each user-item pair has an observed feature vector xit. The valueof item i for user t then becomes

Vit = Qi + xTitwi + θit + εit. (12)

Furthermore, we also parametrize θit in terms of xit,such that

θit ∼ N (xTitw, σθ) (13)

where w is another unknown parameter vector. Aftergenerating the training set using a fixed scoring rule,we use ridge regression to regress the reported ratingsfor each item, which leads to estimates Qi and wi foreach item. These are then used as scoring rule: sit =Qi + xTitwi.

Simulation details There are n = 100 items, andwi ∈ Rp where p = 20. We generate Qi ∼ N (0, 1) andwij , wij ∼ N (0, 1/

√p) independently. The elements of

the feature vectors are generated independently follow-ing xijt ∼ N (0, 1). The error term is drawn accordingto εit ∼ N (0, 1), and we set σθ = 0.1. We generate10np = 20000 observations.

The training sets are generated using four differentscoring rules:

1. Using the oracle scoring rule: sit = Qi + xTitwi,which leads to perfect recommendations.

2. Using the oracle scoring rule and unbiased ratingsVit − θit.

3. Using randomly selected items, hence the user hasno choice.

4. We iterate steps one and two twice, where we firstuse randomly selected items, then fit a Ridge re-gression to estimate the parameters, and use theseto generate the test set: sit = Qi + xTitwi.

This last training set allows us to better understandhow the system evolves over time.

R (o

/b)

R (o

/c)

R (r/

b)

RR (r

/b)0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Aver

age

ratin

g on

test

set

Ridge regression

ALS

(o/b

)

ALS

(o/c

)

ALS

(r/b)

ALSA

LS (r

/b)0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Aver

age

ratin

g on

test

set

Matrix Factorization

Figure 2: We compare performances of recommen-dation systems based on different training data in afeature-based model on the left and matrix factoriza-tion on the right. Green corresponds to random se-lections in the training set, orange to oracle selectionswith unbiased feedback, red to the iterated model ini-tially based on random selections, and blue is basedon biased oracle ratings.

Results The average values of the selected items in thetest set are plotted in the left plot of Figure 2. Thetwo dotted lines provide useful benchmarks: the topline shows the performance of the oracle scoring rule,which upper bounds the performance. The bottom lineshows the performance in the absence of a scoring rule,that is sit = 0 for all i, t. Finally, note that randomselections lead to an average rating of 0.

We note that the best performing recommendationsare given by the model trained on random data(green); these are close to the performance of the or-acle. Unbiased ratings based on the oracle (orange)perform a bit worse due to a feedback loop. The modelis only trained on ‘good’ selections and this leads toa degradation of performance. We also see that theiterated model (red) that was initially trained on ran-dom selections performs worse than the randomly gen-erated data, suggesting that the quality deterioratesover time. Finally, the model trained on oracle data(blue) performs much worse than all the models, anddoes not perform much better than the ‘no-score algo-rithm’ that does not provide recommendations.

5.2.3 Matrix factorization

In this section we investigate the dynamics of privatepreferences that are low rank. We use the same two-stage approach as before, where we first use a fixedscoring rule to generate a training set, fit our model,and use the fitted scoring rule to generate a test set tomeasure performance.

The model The low-rank models assumes that the


value for item i by user j follows the model

vij = ai + bj + uTi vj + xTi yj + εit (14)

where ui and vj are (hidden) q-dimensional vectorsmodeling interaction between user and item, and aiand bj are terms modeling overall differences betweenusers and items. Similarly, xi and yj are q-dimensionalvectors that combine the private preference θit = xTj yj .Note that, unlike in previous settings, here we observethe same user multiple times. The recommendationsystem provides score sij for user i and item j and theuser selects item

arg maxjsij + xTi yj (15)

and reports her value Vij . We use alternating leastsquares [Koren et al., 2009] to estimate a, b, u and v.4

Simulation details As in the feature-based simula-tion, we generate four training sets, one based on anoracle, one based on an oracle with unbiased ratings,one based on random selections, and finally an iteratedversion of the random selections process, where we fita model to the random selections data and use thatmodel to generate training data.

We simulate 2000 users and 500 items, with rank q =4. Entries of u, v, x and y are independent Gaussianswith variance 1/q. To reduce variance, the error termhas a small variance, εij ∼ N(0, 0.01) and for bothtraining and test sets each user rates 40 items. Werun alternating least squares with rank 2q and varyingregularization.

Results The two dotted lines denote the same bench-marks as before. Again, we notice that recommenda-tions trained on random data perform best, but thistime the difference is much more pronounced. Therecommendations based on perfect recommendations(blue and orange) perform a lot worse. In fact, theydo barely better than not recommending items at alland having users base their choice solely on their ownpreference signals. Part of the degradation in perfor-mance seems to be caused by a feedback loop; theobservations are not randomly sampled. We also no-tice a much stronger degradation in performance ofthe iterated model (red). This suggests that the dy-namic nature of recommendation systems affect matrixfactorization methods more severely than the simplerlinear model from the previous section.

6 DISCUSSION

In this work, we introduce a model for analyzing feed-back in recommendation systems. We propose a sim-

4 We ensure that users do not rate the same item inboth the training and test set.

ple model that explicitly looks at heterogeneous pref-erences among users of a recommendation system, andtakes the dynamics of learning into account. We thenconsider the consistency and efficiency of natural esti-mators in this model. Recent work has focused on ex-ploration, or efficiency, with selfish agents. On the onehand, preferences lead to inconsistent estimators if thisaspect is not taken into account. On the other hand,we also show that there is an upside to heterogeneouspreferences; they automatically lead to efficiency. Us-ing simulations, we demonstrate that these phenomenapersist when we use more sophisticated recommenda-tion methods, such as matrix factorization.

6.1 Future work

There are several directions of further research. Oursimplified model does not capture all aspects of rec-ommendation systems. The most interesting aspect isthat, in practice, users only observe a limited set of rec-ommendations, rather than the entire inventory. Thiscan lead to an inefficiency in the rate of exploration,and requires further study.

Our model and simulations show that consistency ofmodels is an issue that is difficult to resolve. We be-lieve that progress can be made. Theoretically, onepossible avenue is to also model the selection processdirectly and combine it with the model for outcomes.Empirically, by large scale studies that test the effectsof human interaction on estimators.

6.2 The bigger picture

We believe that this work has raised fundamental andimportant issues relating the interaction between ma-chine learning systems and the users interacting withthem. Algorithms not only consume data, but in theirinteraction with users also create data, a much moreopaque process but equally vital in designing systemsthat achieve the goals we set out to achieve. There isstill a lot of room for improvement by gaining a betterunderstanding of these dynamics.

7 Acknowledgements

The authors would like to thank Ramesh Johari, VijayKamble, Brad Klingenberg, Yonatan Gur, Peter Lof-gren, Andrea Locatelli, and anonymous reviewers fortheir suggestions and feedback. This work was sup-ported by the Stanford TomKat Center, and by theNational Science Foundation under Grant No. CNS-1544548. Any opinions, findings, and conclusions orrecommendations expressed in this material are thoseof the author(s) and do not necessarily reflect the viewsof the National Science Foundation.


References

G. Adomavicius and A. Tuzhilin. Toward the nextgeneration of recommender systems: A survey ofthe state-of-the-art and possible extensions. IEEETrans. Knowl. Data Eng., 17:734–749, 2005.

X. Amatriain, J. M. Pujol, and N. Oliver. I likeit... i like it not: Evaluating user ratings noise inrecommender systems. In International Conferenceon User Modeling, Adaptation, and Personalization,pages 247–258. Springer, 2009.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeanalysis of the multiarmed bandit problem. MachineLearning, 47:235–256, 2002.

H. Bastani, M. Bayati, and K. Khosravi. Exploitingthe natural exploration in contextual bandits. arXivpreprint arXiv:1704.09011, 2017.

S. Bubeck and N. Cesa-Bianchi. Regret analysis ofstochastic and nonstochastic multi-armed banditproblems. CoRR, abs/1204.5721, 2012.

C. Camerer. Bounded rationality in individual deci-sion making. Experimental economics, 1(2):163–183,1998.

C. Chamley. Rational Herds: Economic Models ofSocial Learning. Rational Herds: Economic Mod-els of Social Learning. Cambridge University Press,2004. ISBN 9780521530927. URL https://books.

google.com/books?id=2dgbOh6VE9YC.

P. M. Demarzo, D. Vayanos, J. Zwiebel, N. Bar-beris, G. Becker, J. Bendor, L. Blume, S. Board,E. Dekel, S. Dellavigna, D. Duffie, D. Easley, G. El-lison, S. Gervais, E. Glaeser, K. Judd, D. Kreps,E. Lazear, G. Loewenstein, L. Nelson, A. Neuberger,M. Rabin, J. Scheinkman, A. Schoar, P. Sorenson,P. Veronesi, and R. Zeckhauser. Persuasion bias,social influence, and unidimensional opinions. 2003.

G. Ellison and D. Fudenberg. Rules of thumb for sociallearning. Journal of Political Economy, 101(4):612–643, 1993.

G. Ellison and D. Fudenberg. Word-of-mouth commu-nication and social learning. The Quarterly Journalof Economics, 110(1):93–125, 1995.

P. I. Frazier, D. Kempe, J. M. Kleinberg, and R. Klein-berg. Incentivizing exploration. In SIGECOM, 2014.

P. Hummel and R. P. McAfee. Machine learning in anauction environment. In WWW, 2014.

B. Ifrach, C. Maglaras, and M. Scarsini. Bayesian so-cial learning with consumer reviews. SIGMETRICSPerformance Evaluation Review, 41:28, 2014.

T. Joachims, A. Swaminathan, and T. Schnabel. Unbi-ased learning-to-rank with biased feedback. CoRR,abs/1608.04468, 2017.

D. Kahneman. Maps of bounded rationality: Psychol-ogy for behavioral economics. The American eco-nomic review, 93(5):1449–1475, 2003.

Y. Koren, R. M. Bell, and C. Volinsky. Matrix factor-ization techniques for recommender systems. Com-puter, 42, 2009.

I. Kremer, Y. Mansour, and M. Perry. Implementingthe ”wisdom of the crowd”. In SIGECOM, 2013.

T.-Y. Liu, W. Chen, and T. Qin. Mechanism learn-ing with mechanism induced data. In AAAI, pages4037–4041, 2015.

L. W. Mackey, D. J. Weiss, and M. I. Jordan. Mixedmembership matrix factorization. In ICML, 2010.

Y. Mansour, A. Slivkins, and V. Syrgkanis. Bayesianincentive-compatible bandit exploration. CoRR,abs/1502.04147, 2015.

Y. Mansour, A. Slivkins, V. Syrgkanis, and Z. S.Wu. Bayesian exploration: Incentivizing explo-ration in bayesian games. In Proceedings of the2016 ACM Conference on Economics and Com-putation, EC ’16, pages 661–661, New York, NY,USA, 2016. ACM. ISBN 978-1-4503-3936-0. doi:10.1145/2940716.2940755. URL http://doi.acm.

org/10.1145/2940716.2940755.

B. M. Marlin. Modeling user rating profiles for collab-orative filtering. In NIPS, 2003.

B. M. Marlin, R. S. Zemel, S. T. Roweis, andM. Slaney. Collaborative filtering and the missingat random assumption. In UAI, 2007.

Y. Papanastasiou, K. Bimpikis, and N. Savva. Crowd-sourcing exploration. History, 2014.

S. Qiang and M. Bayati. Dynamic pricing with demandcovariates. 2016.

D. Russo and B. V. Roy. An information-theoreticanalysis of thompson sampling. Journal of MachineLearning Research, 17:68:1–68:30, 2016.

T. Schnabel, A. Swaminathan, A. Singh, N. Chandak,and T. Joachims. Recommendations as Treatments:Debiasing Learning and Evaluation. ArXiv e-prints,Feb. 2016.

L. Smith and P. Sørensen. Pathological outcomes ofobservational learning. Econometrica, 68(2):371–398, 2000.

H. Steck. Training and testing of recommender sys-tems on data missing not at random. In KDD, 2010.

https://books.google.com/books?id=2dgbOh6VE9YC

https://books.google.com/books?id=2dgbOh6VE9YC

http://doi.acm.org/10.1145/2940716.2940755

http://doi.acm.org/10.1145/2940716.2940755

Documents

Human Interaction with Recommendation Systemsproceedings.mlr.press/v84/schmit18a/schmit18a.pdf · Human Interaction with Recommendation Systems sumptions are a perfect re ection of