Upload
phambao
View
217
Download
1
Embed Size (px)
Citation preview
The Epoch-Greedy Algorithm for
Contextual Multi-armed Bandits
John Langford and Tong Zhang
Presentation by Terry Lam
02/2011
Outline
• The Contextual Bandit Problem
• Prior Works
• The Epoch Greedy Algorithm
• Analysis
Standard k-armed bandits problem
• The world chooses k rewards
– r, r, …, rk ∈ [0, 1]
• The player chooses an arm a ∈ {1, 2, … k}
– Without knowledge of the world’s chosen award
• The player observes the reward ra
– Only reward of pulled arm observed
Contextual Bandits
• The player observes context information xxxx
• The world chooses k rewards– r, r, …, rk ∈ [0, 1]
• The player chooses an arm a∈ {1, 2, … k}– Without knowledge of the world’s chosen award
• The player observes the reward ra• Only reward of pulled arm observed
Why Contextual Bandits?
• Context information is common in practice
• For example, matching ads to web-page
– Bandit arms = ads
– Context information = web page contents, visitor profile
– Reward = revenue with clicked ads
– Goal: relevant ads on each page to maximize the expected revenue
Definitions
• Contextual bandit problem– Distribution
• x: context
•• a ∈ {1, 2, … k} is the arm to be pulled
• ra ∈ [0, 1] is the reward for arm a
– Repeated game• At each round, a sample (x, r1, r2, …, rk) is drawn from P• The context x is announced
• The player chooses an arm a• Only reward ra is revealed
(x,−→r ) ∼ P
Definitions
• Contextual bandit algorithm B
– At time step t, decide which arm a ∈ {1, 2, …., k}
– Known information
• Current context xt
• Previous observations (x1, a1, ra, 1), … (xt-1, at-1, ra, t-1)
• Goal: maximize expected total reward
R(h) = E(x,−→r )∼P
[rh(x)]
Definitions
• A hypothesis h maps a context x to an arm a– h : X a {1, L, k}
– Expected reward
– Expected regret of B w.r.t. h up to time T
• Hypothesis space H– Set of all hypotheses h
– Expected regret of B w.r.t. H up to time T• B to compete with the best hypothesis in H !
Outline
• The Contextual Bandit Problem
• Prior Works
• The Epoch Greedy Algorithm
• Analysis
Prior works
• EXP3 (Auer et al., 1995)
– Standard multi-armed bandits
– Context information is lost
– Set for arm i
– Draw it according to p(t), p(t),… pk(t)
– Receive reward xi,t (t) ∈ [0, 1]
– For i = 1, …, k
– Regret bound versus the best arm:
Prior works
• EXP4 (Auer et al., 1995)
– Combine the advices of m experts
– At time t:
• Get advice vectors ξ(t), ξ(t), …, ξm(t)
– Each expert advises a distribution on arms
• Weight wj(t) for expert j. Let
• Combine for final distribution on arms
• For arm i :
– Still have exploration parameter γ
Prior works
• EXP4 (cont.)
– At time t:
• Draw it according to p(t), p(t),… pk(t)
• Receive reward xi,t (t) ∈ [0, 1]
• For arm i = 1, …, k
• For expert j = 1, …, m
– Regret w.r.t. the best expert
Epoch-Greedy properties
• No knowledge of time horizon T
• Regret bound O(T 2/3 · ln1/3m)
– m=|H| = size of hypothesis space
– Each hypothesis as an expert
• O(ln(T )) with certain structure of the hypothesis space
• Reduced computational complexity
Outline
• The Contextual Bandit Problem
• Prior Works
• The Epoch Greedy Algorithm
• Analysis
Intuition: T is known
• First phase: n steps of explorations
– Random pulling of arms
• Second phase: exploitations
– Average regret for one exploitation step ǫn
– Total regret: n + (T - n) ǫn
– Pick n to minimize the total regret
Intuition: T is unknown
• Run exploration/exploitation in epochs
• At epoch llll:
– One step of exploration
– ⌈1/ǫl⌉ steps of exploitations
• Recall: after learning llll random explorations, ǫǫǫǫllll is the
average regret for one exploitation step
Intuition: T is unknown
• Total regret after L epochs:
• Let LT be the epoch containing T
• It is easy to prove that
• No worse than three times the bound with known T and optimal stopping point
Algorithm Key Ideas
• Three main components
– Random explorations
• Large immediate regret
– Learning the best hypothesis from explorations
• Reduce regret for future exploitation steps
– Exploitations by following the best hypothesis
• Maximizing immediate rewards
• Run in several epochs, each epoch contains
– Exactly one step of random exploration
– Several steps of exploitations
Notations
• Zl =(xl, al, ra, l): random exploration sample at epoch l
• : set of all explorations up to epoch l
• : number of exploitation steps in epoch l
– Either data-independent or data-dependent
• Empirical reward maximization estimator
Zl1 = {Z1, · · · , Zl}
s(Zl1)
Epoch-Greedy algorithm
• Epoch-Greedy
– For epoch l = 1, 2, …
• Observe xl
• Pick al ∈ {1, 2, …., k} uniformly random
• Receive reward ra, l ∈ [0, 1]
•• Find the best hypothesis by solving
• Repeat times
– Observe context x
– Select arm
– Receive reward ra ∈ [0, 1]
exploration
s(Zl1)
exploitations
learning
Empirical reward estimation
• Dealing with missing observations
– In context x, random pulling a only yields ra
• Fully observed reward:
– i.e.
• Reward expectation w.r.t. exploration samples
• Empirical reward estimation of h ∈ H
Outline
• The Contextual Bandit Problem
• Prior Works
• The Epoch Greedy Algorithm
• Analysis
Theorem
• Denote: per epoch exploitation cost
• For all T, nl, L such that
The expected regret of Epoch-Greedy algorithm
∆R(Epoch−Greedy,H, T ) ≤ L+∑L
l=1 µl(H, s) + T∑L
l=1 Pr[s(Z l1) < nl]
Theorem Proof Sketch
• For all T, nl, L such that
The expected regret of Epoch-Greedy algorithm
• There are two complementary cases
– Case 1: for all l = 1, …, L
– Case 2: for some l = 1, …, L
∆R(Epoch−Greedy,H, T ) ≤ L+∑L
l=1 µl(H, s) + T∑L
l=1 Pr[s(Z l1) < nl]
Theorem Proof Sketch
• Case 1: for all l = 1, …, L
• Then
– T is contained in L epochs
– Exploitation regret in epoch l is µl(H,s)
– Exploration regret at most 1
– Regret contribution of Case 1:
L+∑L
l=1 µl(H, s)
Theorem Proof Sketch
• Case 2: for some l = 1, …, L
– Regret of Case 2 is at most T
• Regret is at most 1 per step
– Probability of Case 2:
– Bound for regret contribution of Case 2
• Total regret of Case 1 and Case 2:
∆R(Epoch−Greedy,H, T ) ≤ L+∑L
l=1 µl(H, s) + T∑L
l=1 Pr[s(Z l1) < nl]
Bound for Finite Hypothesis Space
• Denote size of hypothesis space m = |H| < ∞
• By Berstein inequality,
– c: some constant
• Recall: per epoch exploitation cost
• Pick then s(Zl1) = ⌊c
√l/k lnm⌋ µl(H, s) ≤ 1
Bound for Finite Hypothesis Space
• Take then
• Let L = ⌊c’T 2/3 (k ln m)1/3⌋ for some constant c’, then
• Therefore,
T ≤∑L
l=1 nl
Pr[s(Zl1) < nl] = 0
Theorem: for all T, nl, L such that
The expected regret of Epoch-Greedy algorithm
Bound improvement
• Let H = {h1, L, hm}
• WLOG, R(h1) ≥ R(h2) ≥ L ≥ R(hm)
• Suppose R(h1) ≥ R(h2) + Δ for some Δ > 0
– Δ: gap between the best and second best bandits
• With appropriate parameter choices
– is data dependent in this case
• That means, regret is O(k ln(m) + k ln(T ))
∆R(Epoch−Greedy,H, T ) ≤ 2⌈ 8k(lnm+ln(T+1)c∆2 ⌉+ 1 + c”k∆−2
Conclusions
• Contextual multi-armed bandits
– Generalization of the standard multi-armed
bandits
– Observable context helps decision to pull arms
– Sample complexity for exploration-exploitation
trade-off
– Good for large hypothesis spaces or with special
structures