Download pptx - Vasilis Syrgkanis Microsoft Research, NYC 1 Joint with:Yishay Mansour (Microsoft Research and Tel-Aviv Univ.) Aleksandrs Slivkins (Microsoft Research)

1

Bayesian Incentive-Compatible Bandit Exploration

Vasilis Syrgkanis

Microsoft Research, NYC

Joint with: Yishay Mansour (Microsoft Research and Tel-Aviv Univ.)

Aleksandrs Slivkins (Microsoft Research)

2

Exploration-Exploitation in Recommendation Systems

3

Exploration problem

Prior bias of users leads to lack of exploration

Can miss good options that a priori seem inferior (“hidden gems”)

System needs to “incentivize” exploration

This talk: incentivizing exploration through information asymmetry

4

Bayesian Incentive-Compatible Bandit Model

T users arrive sequentially Each can take one of K actions Each action has a mean reward of Common prior belief on each Realized reward, noisy version of At each time-step system recommends an action

Users report realized reward

Recommendation algorithm

1 t T…Users

1 KiActions ……

𝜇𝑖

𝑟 𝑖𝑡

𝜇𝑖

Prior on Mean Rewards

Realized rewards noisy

𝜇𝑖0

𝐼 𝑡 𝑟 𝐼 𝑡𝑡

𝜇10𝜇𝐾

0

5

Systems Objective: Asymptotically Small Regret

Recommendation algorithm

1 t T…Users

1 KiActions ……

𝜇𝑖

𝑟 𝑖𝑡

𝜇𝑖

Prior on Mean Rewards

Realized rewards noisy

𝜇𝑖0𝜇10𝜇𝐾

0

𝐼 𝑡 𝑟 𝐼 𝑡𝑡 Ex-post regret

Bayesian Regret

6

So far equivalent to Stochastic i.i.d. Multi-armed Bandit Model:

regret achievable

7

Incentive Compatibility (IC)

Playing recommended action has expected utility as high as any other action

e.g. first user can only take action 1

If users observe everything will only take the posterior better action given previous rewards – cannot guarantee exploration

8

Information Asymmetry

Users do not observe rewards or recommendations of previous users

Unaware whether rewards of previous steps have made a priori better arms worse than a priori worse arms

9

Main question

Is regret achievable under the incentive compatibility constraint?

10

Main Results: Bayesian Regret Black-box reduction: any bandit algorithm to an incentive compatible one (prior-dependent constant blow up in Bayesian regret)

Implies Bayesian regret IC algorithms T steps of any algorithm can be simulated in an incentive compatible manner in time steps Average expected reward as high as that of the algorithm

Enables modular design of IC recommendation systems

11

Main Results: Ex-post Regret ex-post regret for instances with large “gap” in the means

Difference of best arm and suboptimal arms lower bounded by a constant

Detail-free algorithm (doesn’t need to know full prior, but only an upper bound on a single parameter of the prior)

12

Some related work Kremer et al. [2014]: Same model, two arms, primarily Bayesian optimal

for non-stochastic rewards, for stochastic

Che and Horner [2013]: continuous time stochastic model, two arms, binary reward, Bayesian optimal

Frazier et al. [2014]: Monetary transfers allowed, users observe past actions, payments vs. information asymmetry

Connections to Bayesian Persuasion: Kamenica, Gentzkow [2011]

Connections to Strategic Experimentation: Bolton, Harris [1999]

13

Key Ideas

14

Key idea: two arms, first sample Hide exploration in a pool of exploitation

Pull arm 1 Pull posterior better arm

users users

Recommend arm 2

• “Exploration” user doesn’t know if he is recommended action 2 because of it is the better posterior action or because of exploration

• If L is large enough exploitation is most probable and hence in his own interest

15

Key idea: Black-box reduction

Pull arm 1

Pull posterior better arm

users users

Recommend arm 2

Initial Sampling Phase


Ask Bandit algorithmReport back


Ask Bandit algorithmReport back

…

users users

• Expected reward of exploit users at least as good as algorithm’s reward

• Expected reward at least: • Regret at most: • If A is algorithm IC algorithm

16

Key idea: Ex-post regret

• Use sample means instead of posterior means• Exploit arm is arm 2 only if sample mean higher with some confidence• After initial sampling phase do active arms elimination

Pull arm 1 Pull an exploit arm

users users

Recommend arm 2

17

Key idea: many arms Need to first sample arms to convince to play

Do a contest:

Many technical difficulties to perform contest with sample means for detail-free

Use of sample averages with a confidence bound not as straight-forward

Not trivial to define exploit arm as a function of sample means

Pull arm 1Pull posterior better

arm from 1,,2

users users

Recommend arm 2

Pull posterior better arm from 1,2,3

users

Recommend arm 3

Pull posterior better arm from 1,2,3,4

users

Recommend arm 4

18

User-Heterogeneity: Contextual Bandit Extension

Each user has a publicly observable context Context affects user’s mean reward Need to find optimal policy: context arm Captures user heterogeneity Black-box reduction very useful to leverage contextual bandit algorithms

19

Summary

Enables modular design of IC recommendation systems O(√ ) and O(log ( )) instance-based ex-post regret 𝑇 𝑇 Detail-free algorithm (doesn’t need to know full prior, but only an upper bound on a single parameter of the prior)

Extensions: contextual bandits and other (see paper)

Thank you