The Epoch-Greedy Algorithm for Contextual Multi …cseweb.ucsd.edu/~kamalika/teaching/CSE291W11/feb28.pdfThe Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford

The Epoch-Greedy Algorithm for

Contextual Multi-armed Bandits

John Langford and Tong Zhang

Presentation by Terry Lam

02/2011

Outline

• The Contextual Bandit Problem

• Prior Works

• The Epoch Greedy Algorithm

• Analysis

Standard k-armed bandits problem

• The world chooses k rewards

– r, r, …, rk ∈ [0, 1]

• The player chooses an arm a ∈ {1, 2, … k}

– Without knowledge of the world’s chosen award

• The player observes the reward ra

– Only reward of pulled arm observed

Contextual Bandits

• The player observes context information xxxx

• The world chooses k rewards– r, r, …, rk ∈ [0, 1]

• The player chooses an arm a∈ {1, 2, … k}– Without knowledge of the world’s chosen award

• The player observes the reward ra• Only reward of pulled arm observed

Why Contextual Bandits?

• Context information is common in practice

• For example, matching ads to web-page

– Bandit arms = ads

– Context information = web page contents, visitor profile

– Reward = revenue with clicked ads

– Goal: relevant ads on each page to maximize the expected revenue

Definitions

• Contextual bandit problem– Distribution

• x: context

•• a ∈ {1, 2, … k} is the arm to be pulled

• ra ∈ [0, 1] is the reward for arm a

– Repeated game• At each round, a sample (x, r1, r2, …, rk) is drawn from P• The context x is announced

• The player chooses an arm a• Only reward ra is revealed

(x,−→r ) ∼ P

Definitions

• Contextual bandit algorithm B

– At time step t, decide which arm a ∈ {1, 2, …., k}

– Known information

• Current context xt

• Previous observations (x1, a1, ra, 1), … (xt-1, at-1, ra, t-1)

• Goal: maximize expected total reward

R(h) = E(x,−→r )∼P

[rh(x)]

Definitions

• A hypothesis h maps a context x to an arm a– h : X a {1, L, k}

– Expected reward

– Expected regret of B w.r.t. h up to time T

• Hypothesis space H– Set of all hypotheses h

– Expected regret of B w.r.t. H up to time T• B to compete with the best hypothesis in H !

Outline


• Prior Works


• Analysis

Prior works

• EXP3 (Auer et al., 1995)

– Standard multi-armed bandits

– Context information is lost

– Set for arm i

– Draw it according to p(t), p(t),… pk(t)

– Receive reward xi,t (t) ∈ [0, 1]

– For i = 1, …, k

– Regret bound versus the best arm:

Prior works

• EXP4 (Auer et al., 1995)

– Combine the advices of m experts

– At time t:

• Get advice vectors ξ(t), ξ(t), …, ξm(t)

– Each expert advises a distribution on arms

• Weight wj(t) for expert j. Let

• Combine for final distribution on arms

• For arm i :

– Still have exploration parameter γ

Prior works

• EXP4 (cont.)

– At time t:

• Draw it according to p(t), p(t),… pk(t)

• Receive reward xi,t (t) ∈ [0, 1]

• For arm i = 1, …, k

• For expert j = 1, …, m

– Regret w.r.t. the best expert

Epoch-Greedy properties

• No knowledge of time horizon T

• Regret bound O(T 2/3 · ln1/3m)

– m=|H| = size of hypothesis space

– Each hypothesis as an expert

• O(ln(T )) with certain structure of the hypothesis space

• Reduced computational complexity

Outline


• Prior Works


• Analysis

Intuition: T is known

• First phase: n steps of explorations

– Random pulling of arms

• Second phase: exploitations

– Average regret for one exploitation step ǫn

– Total regret: n + (T - n) ǫn

– Pick n to minimize the total regret

Intuition: T is unknown

• Run exploration/exploitation in epochs

• At epoch llll:

– One step of exploration

– ⌈1/ǫl⌉ steps of exploitations

• Recall: after learning llll random explorations, ǫǫǫǫllll is the

average regret for one exploitation step

Intuition: T is unknown

• Total regret after L epochs:

• Let LT be the epoch containing T

• It is easy to prove that

• No worse than three times the bound with known T and optimal stopping point

Algorithm Key Ideas

• Three main components

– Random explorations

• Large immediate regret

– Learning the best hypothesis from explorations

• Reduce regret for future exploitation steps

– Exploitations by following the best hypothesis

• Maximizing immediate rewards

• Run in several epochs, each epoch contains

– Exactly one step of random exploration

– Several steps of exploitations

Notations

• Zl =(xl, al, ra, l): random exploration sample at epoch l

• : set of all explorations up to epoch l

• : number of exploitation steps in epoch l

– Either data-independent or data-dependent

• Empirical reward maximization estimator

Zl1 = {Z1, · · · , Zl}

s(Zl1)

Epoch-Greedy algorithm

• Epoch-Greedy

– For epoch l = 1, 2, …

• Observe xl

• Pick al ∈ {1, 2, …., k} uniformly random

• Receive reward ra, l ∈ [0, 1]

•• Find the best hypothesis by solving

• Repeat times

– Observe context x

– Select arm

– Receive reward ra ∈ [0, 1]

exploration

s(Zl1)

exploitations

learning

Empirical reward estimation

• Dealing with missing observations

– In context x, random pulling a only yields ra

• Fully observed reward:

– i.e.

• Reward expectation w.r.t. exploration samples

• Empirical reward estimation of h ∈ H

Outline


• Prior Works


• Analysis

Theorem

• Denote: per epoch exploitation cost

• For all T, nl, L such that

The expected regret of Epoch-Greedy algorithm

∆R(Epoch−Greedy,H, T ) ≤ L+∑L

l=1 µl(H, s) + T∑L

l=1 Pr[s(Z l1) < nl]

Theorem Proof Sketch

• For all T, nl, L such that


• There are two complementary cases

– Case 1: for all l = 1, …, L

– Case 2: for some l = 1, …, L


l=1 µl(H, s) + T∑L



• Case 1: for all l = 1, …, L

• Then

– T is contained in L epochs

– Exploitation regret in epoch l is µl(H,s)

– Exploration regret at most 1

– Regret contribution of Case 1:

L+∑L

l=1 µl(H, s)


• Case 2: for some l = 1, …, L

– Regret of Case 2 is at most T

• Regret is at most 1 per step

– Probability of Case 2:

– Bound for regret contribution of Case 2

• Total regret of Case 1 and Case 2:


l=1 µl(H, s) + T∑L


Bound for Finite Hypothesis Space

• Denote size of hypothesis space m = |H| < ∞

• By Berstein inequality,

– c: some constant

• Recall: per epoch exploitation cost

• Pick then s(Zl1) = ⌊c

√l/k lnm⌋ µl(H, s) ≤ 1

Bound for Finite Hypothesis Space

• Take then

• Let L = ⌊c’T 2/3 (k ln m)1/3⌋ for some constant c’, then

• Therefore,

T ≤∑L

l=1 nl

Pr[s(Zl1) < nl] = 0

Theorem: for all T, nl, L such that


Bound improvement

• Let H = {h1, L, hm}

• WLOG, R(h1) ≥ R(h2) ≥ L ≥ R(hm)

• Suppose R(h1) ≥ R(h2) + Δ for some Δ > 0

– Δ: gap between the best and second best bandits

• With appropriate parameter choices

– is data dependent in this case

• That means, regret is O(k ln(m) + k ln(T ))

∆R(Epoch−Greedy,H, T ) ≤ 2⌈ 8k(lnm+ln(T+1)c∆2 ⌉+ 1 + c”k∆−2

Conclusions

• Contextual multi-armed bandits

– Generalization of the standard multi-armed

bandits

– Observable context helps decision to pull arms

– Sample complexity for exploration-exploitation

trade-off

– Good for large hypothesis spaces or with special

structures

Documents

The Epoch-Greedy Algorithm for Contextual Multi …cseweb.ucsd.edu/~kamalika/teaching/CSE291W11/feb28.pdfThe Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford