1
Bayesian Incentive-Compatible Bandit Exploration
Vasilis Syrgkanis
Microsoft Research, NYC
Joint with: Yishay Mansour (Microsoft Research and Tel-Aviv Univ.)
Aleksandrs Slivkins (Microsoft Research)
2
Exploration-Exploitation in Recommendation Systems
3
Exploration problem
Prior bias of users leads to lack of exploration
Can miss good options that a priori seem inferior (“hidden gems”)
System needs to “incentivize” exploration
This talk: incentivizing exploration through information asymmetry
4
Bayesian Incentive-Compatible Bandit Model
T users arrive sequentially Each can take one of K actions Each action has a mean reward of Common prior belief on each Realized reward, noisy version of At each time-step system recommends an action
Users report realized reward
Recommendation algorithm
1 t T…Users
1 KiActions ……
𝜇𝑖
𝑟 𝑖𝑡
𝜇𝑖
Prior on Mean Rewards
Realized rewards noisy
𝜇𝑖0
𝐼 𝑡 𝑟 𝐼 𝑡𝑡
𝜇10𝜇𝐾
0
5
Systems Objective: Asymptotically Small Regret
Recommendation algorithm
1 t T…Users
1 KiActions ……
𝜇𝑖
𝑟 𝑖𝑡
𝜇𝑖
Prior on Mean Rewards
Realized rewards noisy
𝜇𝑖0𝜇10𝜇𝐾
0
𝐼 𝑡 𝑟 𝐼 𝑡𝑡 Ex-post regret
Bayesian Regret
6
So far equivalent to Stochastic i.i.d. Multi-armed Bandit Model:
regret achievable
7
Incentive Compatibility (IC)
Playing recommended action has expected utility as high as any other action
e.g. first user can only take action 1
If users observe everything will only take the posterior better action given previous rewards – cannot guarantee exploration
8
Information Asymmetry
Users do not observe rewards or recommendations of previous users
Unaware whether rewards of previous steps have made a priori better arms worse than a priori worse arms
9
Main question
Is regret achievable under the incentive compatibility constraint?
10
Main Results: Bayesian Regret Black-box reduction: any bandit algorithm to an incentive compatible one (prior-dependent constant blow up in Bayesian regret)
Implies Bayesian regret IC algorithms T steps of any algorithm can be simulated in an incentive compatible manner in time steps Average expected reward as high as that of the algorithm
Enables modular design of IC recommendation systems
11
Main Results: Ex-post Regret ex-post regret for instances with large “gap” in the means
Difference of best arm and suboptimal arms lower bounded by a constant
Detail-free algorithm (doesn’t need to know full prior, but only an upper bound on a single parameter of the prior)
12
Some related work Kremer et al. [2014]: Same model, two arms, primarily Bayesian optimal
for non-stochastic rewards, for stochastic
Che and Horner [2013]: continuous time stochastic model, two arms, binary reward, Bayesian optimal
Frazier et al. [2014]: Monetary transfers allowed, users observe past actions, payments vs. information asymmetry
Connections to Bayesian Persuasion: Kamenica, Gentzkow [2011]
Connections to Strategic Experimentation: Bolton, Harris [1999]
13
Key Ideas
14
Key idea: two arms, first sample Hide exploration in a pool of exploitation
Pull arm 1 Pull posterior better arm
users users
Recommend arm 2
• “Exploration” user doesn’t know if he is recommended action 2 because of it is the better posterior action or because of exploration
• If L is large enough exploitation is most probable and hence in his own interest
15
Key idea: Black-box reduction
Pull arm 1
Pull posterior better arm
users users
Recommend arm 2
Initial Sampling Phase
Pull posterior better arm
Ask Bandit algorithmReport back
Pull posterior better arm
Ask Bandit algorithmReport back
…
users users
• Expected reward of exploit users at least as good as algorithm’s reward
• Expected reward at least: • Regret at most: • If A is algorithm IC algorithm
16
Key idea: Ex-post regret
• Use sample means instead of posterior means• Exploit arm is arm 2 only if sample mean higher with some confidence• After initial sampling phase do active arms elimination
Pull arm 1 Pull an exploit arm
users users
Recommend arm 2
17
Key idea: many arms Need to first sample arms to convince to play
Do a contest:
Many technical difficulties to perform contest with sample means for detail-free
Use of sample averages with a confidence bound not as straight-forward
Not trivial to define exploit arm as a function of sample means
Pull arm 1Pull posterior better
arm from 1,,2
users users
Recommend arm 2
Pull posterior better arm from 1,2,3
users
Recommend arm 3
Pull posterior better arm from 1,2,3,4
users
Recommend arm 4
18
User-Heterogeneity: Contextual Bandit Extension
Each user has a publicly observable context Context affects user’s mean reward Need to find optimal policy: context arm Captures user heterogeneity Black-box reduction very useful to leverage contextual bandit algorithms
19
Summary
Enables modular design of IC recommendation systems O(√ ) and O(log ( )) instance-based ex-post regret 𝑇 𝑇 Detail-free algorithm (doesn’t need to know full prior, but only an upper bound on a single parameter of the prior)
Extensions: contextual bandits and other (see paper)
Thank you