Reinforcement Learning - Lecture 5: Bandit optimisation · Interact with an i.i.d. or adversarial...

Preview:

Citation preview

Reinforcement Learning

Lecture 5: Bandit optimisation

Alexandre Proutiere, Sadegh Talebi, Jungseul Ok

KTH, The Royal Institute of Technology

Objectives of this lecture

Introduce bandit optimisation: the most basic RL problem without

dynamics. Optimising the exploration vs exploitation trade-off.

• Regret lower bounds

• Algorithms based on the ”optimism in front of uncertainty” principle

• Thompson Sampling algorithm

• Structured bandits

2

Lecture 5: Outline

.

1. Classifying bandit problems

2. Regret lower bounds

3. Algorithms based on the ”optimism in front of uncertainty” principle

4. Thompson Sampling algorithm

5. Structured bandits

3

Lecture 5: Outline

1. Classifying bandit problems

2. Regret lower bounds

3. Algorithms based on the ”optimism in front of uncertainty” principle

4. Thompson Sampling algorithm

5. Structured bandits

4

Bandit Optimisation

• Interact with an i.i.d. or adversarial environment

• Set of available actions A with unknown sequences of rewards

rt(a), t = 1, . . .

• The reward is the only feedback – bandit feedback

• Stochastic vs. adversarial bandits

- i.i.d. environment: rt(a) random variable with mean θa

- adversarial environment: rt(a) is arbitrary!

• Objective: develop an action selection rule π maximising the

expected cumulative reward up to step T

Remark: π must select an action depending on the entire history of

observations!

5

Regret

• Difference between the cumulative reward of an ”Oracle” policy and

that of agent π

• Regret quantifies the price to pay for learning

• Exploration vs. exploitation trade-off: we need to probe all actions

to play the best later

6

Applications

Clinical trial, Thompson 1933

- Two available treatments with unknown rewards (’Live’ or ’Die’)

- Bandit feedback: after administrating the treatment to a patient, we

observe whether she survives or dies

- Goal: design a treatment selection scheme π maximising the number

of patients cured after treatment

7

Applications

Rate adaptation in 802.11 wireless systems

- The AP sequentially sends packets to the receiver and has K

available encoding rates r1 < r2 < . . . < rK

- The unknown probability a packet sent at rate rk is received is θk

- Goal: design a rate selection scheme that learns the θk’s and quickly

converges to rate rk? maximising µk = rkθk over k

8

Applications

Search engines

- The engine should list relevant webpages depending on the request

’jaguar’

- The CTRs (Click-Through-Rate) are unknown

- Goal: design a list selection scheme that learns the list maximising

its global CTRs

9

Bandit Taxonomy

• Stochastic bandits: the sequence of rewards (rt(a), a ∈ A)t≥1 is

generated according to an i.i.d. process – the average rewards are

unknown

• Adversarial bandits: arbitrary sequence of rewards

Most bandit problems in engineering are stochastic ...

10

Stochastic Bandit Taxonomy

Unstructured problems: average rewards are not related

θ = (θ1, . . . , θK) ∈ Θ =∏k[ak, bk]

Structured problems: the decision maker knows that average rewards

are related. She knows Θ. The rewards observed for a given arm provides

side information about the other arms.

θ = (θ1, . . . , θK) ∈ Θ not an hyper-

rectangle

11

Stochastic Bandit Taxonomy

12

Lecture 5: Outline

1. Classifying bandit problems

2. Regret lower bounds

3. Algorithms based on the ”optimism in front of uncertainty” principle

4. Thompson Sampling algorithm

5. Structured bandits

13

Unstructured Stochastic Bandits – Robbins 1952

• Finite set of actions A = {1, . . . ,K}

• (Unknown) rewards of action a ∈ A: (rt(a), t ≥ 0) i.i.d. Bernoulli

with E[rt(a)] = θa, θ ∈ Θ = [0, 1]K

• Optimal action a? ∈ arg maxa θa

• Online policy π: select action aπt at time t depending on

aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)

• Regret up to time T : Rπ(T ) = Tθa? −∑Tt=1 θaπt

14

Problem-specific regret Lower Bound

Uniformly good algorithms: An algorithm π is uniformly good if for all

θ ∈ Θ, for any sub-optimal arm a, the number of times Na(t) arm a is

selected up to round t satisfies: E[Na(t)] = o(tα) for all α > 0.

Theorem (Lai-Robbins 1985)

For any uniformly good algorithm π:

lim infT

Rπ(T )

log(T )≥∑a 6=a?

θa? − θaKL(θ, θa?)

where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)

15

Minimax regret Lower Bound

Theorem (Auer et al. 2002)

For any T , we can find a problem (depending on T ) such for any

algorithm π,

Rπ(T ) ≥√KT (1− 1

K).

16

Unified proofs

• Change-of-measure: θ → ν

• Log-likelihood ratio: Eθ[L] =∑j Eθ[Nj(T )]KL(θj , νj)

• Data processing inequality. For any event A ∈ FT ,

Pν(A) = Eθ[exp(−L)1A]. Jensen’s inequality yields:

Pν(A) ≥ exp(−Eθ[L]|A)Pθ(A)

Pν(Ac) ≥ exp(−Eθ[L]|Ac)Pθ(Ac)

Hence Eθ[L] ≥ KL(Pθ(A),Pν(A))

• Data processing inequality v2. For all Z FT -measurable,

Eθ[L] ≥ KL(Eθ(Z),Eν(Z))

17

Proof – problem-specific lower bound

Change-of-measure: θ → ν with θj = νj for all j 6= a, νa = θa? + ε.

Eθ[L] = Eθ[Na(t)]KL(θa, θa? + ε) ≥ KL(Pθ(A),Pν(A))

Select A = {Na?(T ) ≤ T −√T}. Markov inequality yields (for uniformly

good algorithms): limT→∞ Pθ[A] = 0 = limT→∞ Pν [Ac]

lim inft→∞

E[Na(T )

log(T )≥ 1

KL(θa, θa? + ε)

18

Proof – minimax lower bound

Change-of-measure: θa = 1/2 for all a. Then there exists a such that

Eθ[Na(T )] ≤ T/K. νi = θi for all i 6= a, and νa = 1/2 + ε.

Eθ[L] = Eθ[Na(t)]KL(1/2, 1/2 + ε) ≥ KL(Eθ(Z),Eν(Z))

where KL(1/2, 1/2 + ε) = 12 log( 1

1−4ε2 ). Select Z = Na(T )/T .

Pinsker’s inequality yields:

Eθ[Na(T )

2log(

1

1− 4ε2) ≥ 2

(Eν [Na(T )]

T− Eθ[Na(T )]

T

)2

Hence with Eθ[Na(T )] ≤ T/K,

Eν [Na(T )]

T≤ 1

K+

1

2

√T

Klog(

1

1− 4ε2)

19

Proof – minimax lower bound

Now Rν(T ) = Tε(1− Eν [Na(T )]T ), and we conclude choosing ε =

√K/T

that:

Rν(T ) ≥√KT (1− 1/K).

20

Lecture 5: Outline

1. Classifying bandit problems

2. Regret lower bounds

3. Algorithms based on the ”optimism in front of uncertainty”

principle

4. Thompson Sampling algorithm

5. Structured bandits

21

Concentration

The main tools in the design and analysis of algorithms for stochastic

bandits are concentration-of-measure results.

Let X1, X2, . . . i.i.d. real-valued random variable with mean µ, and with

all moments G(λ) = log(E[eλ(Xn−µ)]). Sn =∑ni=1Xi.

• Strong law of large number: P[limn→∞Snn = µ] = 1

• Concentration inequality: let δ, λ > 0,

P[Sn − nµ ≥ δ] = P[eλ(Sn−nµ) ≥ eλδ]≤ e−λδE[eλ(Sn−nµ)]

= e−λδn∏i=1

E[eλ(Xi−µ)]

= enG(λ)−λδ

≤ e− supλ>0(λδ−nG(λ))

22

Concentration

P[Sn − nµ ≥ δ] ≤ e− supλ>0(λδ−nG(λ))

• Bounded r.v. Xn ∈ [a, b], G(λ) ≤ λ2 (b−a)28

Hoeffding’s inequality:

P[Sn − nµ ≥ δ] ≤ e− 2δ2

n(b−a)2

• Sub-gaussian r.v.: G(λ) ≤ σ2λ2/2

• Bernoulli r.v.: G(λ) = log(µeλ(1−µ) − (1− µ)e−λµ)

Chernoff’s inequality:

P[Sn − nµ ≥ δ] ≤ e−nKL(µ+δ/n,µ)

where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)

23

Algorithms

Estimating the average reward of arm a:

θa(t) =1

Na(t)

t∑n=1

rn(a)1a(n)=a

• ε-greedy. In each round t:

- with probability 1− ε, select the best empirical arm

a?(t) ∈ argmaxa θa(t)

- with probability ε, select an arm uniformly at random

The algorithm has linear regret (not uniformly good)

24

Algorithms

• εt-greedy. In each round t:

- with probability 1− εt, select the best empirical arm

a?(t) ∈ argmaxa θa(t)

- with probability εt, select an arm uniformly at random

The algorithm has logarithmic regret for Bernoulli rewards and

εt = min(1, Ktδ2 ) where δ = mina 6=a?(θa? − θa)

Sketch of proof. For a 6= a? to be selected in round t, we need (most

often) θa(t) ≥ θa + δ. The probability that this occurs is less than

exp(−2δ2Na(t)). But Na(t) is close to log(t)/δ2. Summing over t yields

the result.

25

Algorithms

Optimism in front of Uncertainty

Upper Confidence Bound algorithm:

ba(t) = θa(t) +

√2 log(t)

Na(t)

θ(t): empirical reward of a up to t

Na(t): nb of times a played up to t

In each round t, select the arm with highest

index ba(t)

Under UCB, the number of times a 6= a? is selected satisifies:

E[Na(T )] ≤ 8 log(T )

(θa? − θa)2+π2

6

26

Regret analysis of UCB (Auer et al. 2002)

Na(T ) = 1 +

T∑t=K+1

1{a(t) = a} ≤ `+

T∑t=K+1

1{a(t) = a,Na(t) ≥ `}

≤ `+

T∑t=K+1

1{Na(t) ≥ `}1{ba?(t− 1) ≤ ba(t− 1)}

≤ `+

T∑t=K+1

1{mins<t

(θ?s +

√2 log(t− 1)

s) ≤ max

`≤s<t(θas +

√2 log(t− 1)

s)}

≤ `+

T∑t=K+1

∑s<t,`≤s′<t

1{θ?s +

√2 log(t− 1)

s≤ θas′ +

√2 log(t− 1)

s′)}

27

Regret analysis of UCB (Auer et al. 2002)

θ?s +√

2 log(t−1)s ≤ θas′ +

√2 log(t−1)

s′ implies:

A : θ?s ≥ θ? −√

2 log(t− 1)

s, or B : θas′ ≥ θa +

√2 log(t− 1)

s′

or C : θ? < θa + 2

√2 log(t− 1)

s′

Hoeffding ineuqality yields: P[A] ≤ t−4 and P[B] ≤ t−4.

For ` = 8 log(T )/(θ? − θa)2 and s′ ≥ `, C does not happen. We

conclude by:

E[Na(T )] ≤ 8 log(T )

(θ? − θa)2+∑t≥1

t∑s,s′=1

2t−4

≤ 8 log(T )

(θ? − θa)2+π2

3

28

Algorithms

KL-UCB algorithm:

ba(t) = max{q ≤ 1 : Na(t)KL(θa(t), q) ≤ f(t)}

where f(t) = log(t) + 3 log log(t) is the confidence level.

In each round t, select the arm with highest index ba(t)

Under KL-UCB, the number of times a 6= a? is selected satisifies: for all

δ < θa? − θa,

E[Na(T )] ≤ log(T )

KL(θa + δ, θa?)+ C log log(T ) + δ−2

29

Lecture 5: Outline

1. Classifying bandit problems

2. Regret lower bounds

3. Algorithms based on the ”optimism in front of uncertainty” principle

4. Thompson Sampling algorithm

5. Structured bandits

30

Algorithms

Bayesian framework, put a prior distribution on the parameters θ

Example: Bernoulli distribution with uniform prior on [0, 1], we observed

p successes (’1’) and q failures (’0’). Then θ ∼ β(p+ 1, q + 1), i.e., the

density is proportional to θp(1− θ)q.

Thompson Sampling algorithm: Assume that at round t, arm a had

pa(t) successes and qa(t) failures. Let ba(t) ∼ β(pa(t) + 1, qa(t) + 1).

The algorithm selects the arm a with the highest ba(t).

Under Thompson Sampling, for any suboptimal arm a, we have:

lim supT→∞

E[Na(T )]

log(T )=

1

KL(θa, θa?)

31

Illustration: UCB vs. KL-UCB

32

Illustration: Thompson Sampling

33

Performance

34

Lecture 5: Outline

1. Classifying bandit problems

2. Regret lower bounds

3. Algorithms based on the ”optimism in front of uncertainty” principle

4. Thompson Sampling algorithm

5. Structured bandits

35

Structured bandits

• Unstructured bandits: best possible regret ∼ K log(T )

• How to exploit a known structure to speed up the learning process?

• Structure: the decision maker knows that θ ∈ Θ not an

hyper-rectangle

• Examples: the average reward is a (convex, unimodal, Lipschitz, ...)

function of the arm:

Θ = {θ : a 7→ θa unimodal}

36

Regret lower bound

Theorem (Application of Graves-Lai 1997)

For any uniformly good algorithm π:

lim infT

Rπ(T )

log(T )≥ c(θ)

where c(θ) is the minimal value of:

infna,a6=a?

∑a6=a?

na(θa? − θa)

s.t. infλ∈B(θ)

∑a6=a?

naKL(θa, λa) ≥ 1,

and B(θ) = {λ ∈ Θ : a? not optimal under λ, λa? = θa?}.

37

Graphically unimodal bandits

Arms = vertices of a known graph G

38

Graphically unimodal bandits

Arms = vertices of a known graph G

The average rewards are G-unimodal: from any vertex, there is a path in

the graph along which rewards are increased. Notation: θ ∈ ΘG

39

G-unimodal bandits: Regret lower bound

N (k): set of neighbours of k in G

Theorem For any uniformly good algorithm π:

lim infT

Rπ(T )

log(T )≥ cG(θ) =

∑k∈N (a?)

θa? − θkKL(θk, θa?)

40

G-unimodal bandits: Optimal algorithm

Defined through the maximum degree of G, the empirical means θa(t),

the leader L(t), the number of time arms have been the leader la(t), and

KL UCB indexes:

ba(t) = sup{q : Na(t)KL(θa(t), q) ≤ log(lL(t)(t)}

Algorithm. OAS (Optimal Action Sampling)

1. For t = 1, . . . ,K, select action a(t) = k

2. For t ≥ K + 1, select action

a(t) =

{L(t) if (lL(t)(t)− 1)/(γ + 1) ∈ Narg maxk∈N (L(t)) bk(t) otherwise

41

G-unimodal bandits: OAS optimality

Theorem For any θ ∈ Θ,

lim supT→∞

ROAS(T )

log(T )≤ cG(θ)

42

Rate adaptation in 802.11

Goal: adapt the modulation scheme to the channel quality

802.11 a/b/g

Select the rate only

rate r1 r2 . . . rKsuccess probability µ1 µ2 . . . µK

throughput θ1 θ2 . . . θK (θk = rkµk)

Structure: unimodality of k 7→ θk and µ1 ≥ µ2 ≥ . . . ≥ µK

43

Rate adaptation in 802.11

802.11 n/ac

Select the rate and a MIMO

mode

Structure: example with two modes, unimodality w.r.t. G

44

802.11g – stationary channels

Smooth throughput decay w.r.t. rate

45

802.11g – stationary channels

Steep throughput decay w.r.t. rate

46

802.11g – nonstationary channels

Traces

47

802.11g – nonstationary channels

Performance of OAS with sliding window

48

References

Discrete unstructured bandits

• Thompson, On the likelihood that one unknown probability exceeds

another in view of the evidence of two samples, 1933

• Robbins, Some aspects of the sequential design of experiments, 1952

• Lai and Robbins. Asymptotically efficient adaptive allocation rules,

1985

• Lai. Adaptive treatment allocation and the multi-armed bandit

problem, 1987

• Gittins, Bandit Processes and Dynamic Allocation Indices, 1989

• Auer, Cesa-Bianchi and Fischer, Finite time analysis of the

multiarmed bandit problem. 2002.

• Garivier and Moulines, On upper-confidence bound policies for

non-stationary bandit problems, 2008

• Slivkins and Upfal, Adapting to a changing environment: the

brownian restless bandits, 2008

49

References

• Garivier and Capp. The KL-UCB algorithm for bounded stochastic

bandits and beyond, 2011

• Honda and Takemura, An Asymptotically Optimal Bandit Algorithm

for Bounded Support Models, 2010

Discrete structured bandits

• Anantharam, Varaiya, and Walrand, Asymptotically efficient

allocation rules for the multiarmed bandit problem with multiple

plays, 1987

• Graves and Lai Asymptotically efficient adaptive choice of control

laws in controlled Markov chains, 1997

• Gyrgy, Linder, Lugosi and Ottucsk, The on-line shortest path

problem under partial monitoring, 2007

• Yu and Mannor, Unimodal bandits, 2011

• Cesa-Bianchi and Lugosi, Combinatorial bandits, 2012

50

References

• Chen, Wang and Yuan. Combinatorial multi-armed bandit: General

framework and applications, 2013

• Combes and Proutiere, Unimodal bandits: Regret lower bounds and

optimal algorithms, 2014

• Magureanu, Combes, and Proutiere. Lipschitz bandits: Regret lower

bounds and optimal algorithms, 2014

Thompson sampling

• Chapelle and Li, An Empirical Evaluation of Thompson Sampling,

2011

• Korda, Kaufmann and Munos, Thompson Sampling: an

asymptotically optimal finite-time analysis, 2012

• Korda, Kaufmann and Munos, Thompson Sampling for

one-dimensional exponential family bandits, 2013

• Agrawal and Goyal, Further optimal regret bounds for Thompson

Sampling, 2013.

51

Lecture 5: Appendix

Adversarial bandits

52

Adversarial Optimisation

• Finite set of actions A = {1, . . . ,K}

• Unknown and arbitrary rewards of action a ∈ A: (rt(a), t ≥ 0)

decided by an adversary at time 0

• Best empirical action a? ∈ arg maxa∑Tt=1 rt(a)

• Online policy π: select action aπt at time t depending on:

- Expert setting: (r1(a), . . . , rt−1(a))a=1,...,K

- Bandit setting: aπ1 , r1(aπ1 ), . . . , a

πt−1, rt−1(a

πt−1)

• Regret up to time T : Rπ(T ) =∑Tt=1 rt(a

?)− E∑Tt=1 rt(a

πt )

53

Expert Setting

• Let St(a) =∑tn=1 rt(a) be the cumulative reward of a

• Multiplicative update algorithm (Littlestone-Warmuth 1994)

Select arm a with probability

pt(a) =eηSt−1(a)∑b eηSt−1(b)

• Regret: the multiplicative update algorithm π has zero-regret:

∀T, Rπ(T ) ≤ Tη

8+

log(K)

η

For η =√

8 log(K)/T , Rπ(T ) ≤√

log(K)T2

54

Bandit Setting

• Let St(a) =∑tn=1 rt(a) be the cumulative reward of a

• Building an unbiased estimator of St(a): St(a) =∑tn=1Xt(a)

where

Xt(a) = 1aπ(t)=art(a)

pt(a)

• Multiplicative update algorithm (Littlestone-Warmuth 1994)

Select arm a with probability

pt(a) =eηSt−1(a)∑b eηSt−1(b)

• Regret: the multiplicative update algorithm π has zero-regret:

∀T, Rπ(T ) ≤ TKη

2+

log(K)

η

For η =√

2 log(K)/(KT ), Rπ(T ) ≤√

2K log(K)T

55

Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

56

Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

57

Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

58

Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

59

Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

60

Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

61

Adversarial Convex Bandits

• Continuous set of actions A = [0, 1]

• (Unknown) Arbitrary but concave rewards of action x ∈ A: rt(x)

• Online policy π: select action xπt at time t depending on

xπ1 , r1(xπ1 ), . . . , xπt−1, rt−1(xπt−1)

• Regret up to time T : (defined w.r.t. the best empirical action up to

time T )

Rπ(T ) = maxx∈[0,1]

T∑t=1

rt(x)−T∑t=1

rt(xπt )

Can we do something smart at all? Achieve a sublinear regret?

62

Adversarial Convex Bandits

• If rt(·) = r(·), and if r(·) was known, we could apply a gradient

ascent algorithm

• One-point gradient estimate:

f(x) = Ev∈B [f(x+ δv)], B = {x : ‖x‖2 ≤ 1}

Eu∈S [f(x+ δu)u] = δ∇f(x), S = {x : ‖x‖2 = 1}

• Simulated Gradient Ascent algorithm: at each step t, do

- ut uniformly chosen in S

- yt = xt + δut

- yt+1 = yt + αrt(xt)ut

• Regret: R(T ) = O(T 5/6)

63