17
Taming the monster : A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob Schapire

Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Embed Size (px)

Citation preview

Page 1: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Taming the monster :A fast and s imple a lgor i thm for contextual bandi tsP R E S E N T E D B Y S a t y e n K a l e

J o i n t w o r k w i t h A l e k h A g a r w a l , D a n i e l H s u , J o h n L a n g f o r d , L i h o n g L i a n d R o b S c h a p i r e

Page 2: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Learning to interact: example #1

Loop:› 1. Patient arrives with symptoms, medical history, genome, …› 2. Physician prescribes treatment.› 3. Patient’s health responds (e.g., improves, worsens).

Goal: prescribe treatments that yield good health outcomes.

Page 3: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Learning to interact: example #2

Loop:› 1. User visits website with profile, browsing history, …› 2. Website operator choose content/ads to display.› 3. User reacts to content/ads (e.g., click, “like”).

Goal: choose content/ads that yield desired user behavior.

Page 4: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Contextual bandit setting (i.i.d. version)

Set X of contexts/features and K possible actions For t = 1,2,…,T:

› 0. Nature draws (xt, rt) from distribution D over X × [0,1]K.

› 1. Observe context xt. [e.g., user profile, browsing history]

› 2. Choose action at ϵ [K]. [e.g., content/ad to display]

› 3. Collect reward rt(at). [e.g., indicator of click or positive feedback]

Goal: algorithm for choosing actions at that yield high reward.

Contextual setting: use features xt to choose good actions at.

Bandit setting: rt(a) for a ≠ at is not observed.› Exploration vs. exploitation

Page 5: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Learning objective and difficulties

No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X [K]

(e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π:

… a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies;

explicit bookkeeping is computationally infeasible when Π is large.

Page 6: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Arg max oracle (AMO)

Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns

Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex

relax., backprop) for cost-sensitive multiclass learning algorithms.

Page 7: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Our results

New fast and simple algorithm for contextual bandits› Optimal regret bound (up to log factors): › Amortized calls to argmax oracle (AMO) per round.

Comparison to previous work› [Thompson’33]: no general analysis.› [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies.› [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round.› [DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round.

Note: Exp4 also works in adversarial setting.

Page 8: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Rest of this talk

1. Action distributions, reward estimates viainverse probability weights [oldies but goodies]

2. Algorithm for finding policy distributionsthat balance exploration/exploitation

3. Warm-start / epoch trick

New

New

Page 9: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Basic algorithm structure (same as Exp4)

Start with initial distribution Q1 over policies Π.

For t=1,2,…,T:› 0. Nature draws (xt,rt) from distribution D over X × [0,1]K.

› 1. Observe context xt.

› 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt).

› 2b. Draw action at from pt.

› 3. Collect reward rt(at).

› 4. Compute new distribution Qt+1 over policies Π.

Page 10: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Inverse probability weighting (old trick)

Importance-weighted estimate of reward from round t:

Unbiased, and has range & variance bounded by 1/pt(a).

Can estimate total reward and regret of any policy:

Page 11: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Constructing policy distributions

Optimization problem (OP):

Find policy distribution Q such that:

Low estimated regret (LR) – “exploitation"

Low estimation variance (LV) – “exploration”

Theorem: If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most

Page 12: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Feasibility

Feasibility of (OP): implied by minimax argument.

Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.

Page 13: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Coordinate descent algorithm

INPUT: Initial weights Q. LOOP:

› IF (LR) is violated, THEN replace Q by cQ.› IF there is a policy π causing (LV) to be violated, THEN

• UPDATE Q(π) = Q(π) + α.› ELSE

• RETURN Q.

Above, both 0 < c < 1 and α have closed form expressions.

(Technical detail: actually optimize over sub-distributions Q that may sum to < 1.)

Claim: Can check by making one AMO call per iteration.

Page 14: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Iteration bound for coordinate descent

# steps of coordinate descent =

Also gives bound on sparsity of Q.

Analysis via a potential function argument.

Page 15: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Warm-start

If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need

coordinate descent iterations over all T rounds.

Caveat: need one AMO call/round to even check if (OP) is solved.

Page 16: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Epoch trick

Regret analysis: Qt has low instantaneous expected regret(crucially relying on i.i.d. assumption).› Therefore same Qt can be used for O(t) more rounds!

Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,…

› Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors).

Squares: only update on rounds 12,22,32,42,…› Total of O(T1/2) updates, each requiring AMO calls, on average.

Page 17: Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford,

Experiments

Algorithm Epsilon-greedy

Bagging Linear UCB “Online Cover”

[Supervised]

Loss 0.095 0.059 0.128 0.053 0.051

Time (seconds)

22 339 212000 17 6.9

Bandit problem derived from classification task (RCV1).Reporting progressive validation loss.

“Online Cover” = variant with stateful AMO.