225
RL Course Introduction to Reinforcement Learning A. LAZARIC (SequeL Team @INRIA-Lille ) Machine Learning Summer School – Toulouse, France dubbing: A. GARIVIER (Institut de Mathématique de Toulouse) SequeL – INRIA Lille

A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

RL Course

Introduction to Reinforcement Learning

A. LAZARIC (SequeL Team @INRIA-Lille)Machine Learning Summer School – Toulouse, France

dubbing: A. GARIVIER (Institut de Mathématique de Toulouse)

SequeL – INRIA Lille

Page 2: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation

Outline

MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline

Multi-armed Bandit Problems

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 2/124

Page 3: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Outline

MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline

Multi-armed Bandit Problems

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 3/124

Page 4: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 4/124

Page 5: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 5/124

Page 6: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous robotics

I Elder careI Exploration of unknown /

dangerous environmentsI Robotics for entertainment

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124

Page 7: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous robotics

I Elder care

I Exploration of unknown /dangerous environments

I Robotics for entertainment

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124

Page 8: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous robotics

I Elder careI Exploration of unknown /

dangerous environments

I Robotics for entertainment

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124

Page 9: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous robotics

I Elder careI Exploration of unknown /

dangerous environmentsI Robotics for entertainment

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124

Page 10: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applications

I Trading execution algorithmsI Portfolio managementI Option pricing

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124

Page 11: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applications

I Trading execution algorithms

I Portfolio managementI Option pricing

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124

Page 12: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applications

I Trading execution algorithmsI Portfolio management

I Option pricing

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124

Page 13: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applications

I Trading execution algorithmsI Portfolio managementI Option pricing

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124

Page 14: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy management

I Energy grid integrationI Maintenance schedulingI Energy market regulationI Energy production

management

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124

Page 15: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy management

I Energy grid integration

I Maintenance schedulingI Energy market regulationI Energy production

management

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124

Page 16: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy management

I Energy grid integrationI Maintenance scheduling

I Energy market regulationI Energy production

management

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124

Page 17: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy management

I Energy grid integrationI Maintenance schedulingI Energy market regulation

I Energy productionmanagement

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124

Page 18: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy management

I Energy grid integrationI Maintenance schedulingI Energy market regulationI Energy production

management

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124

Page 19: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems

I Web advertisingI Product recommendationI Date matching

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124

Page 20: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems

I Web advertising

I Product recommendationI Date matching

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124

Page 21: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems

I Web advertisingI Product recommendation

I Date matching

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124

Page 22: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems

I Web advertisingI Product recommendationI Date matching

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124

Page 23: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications

I Bike sharing optimizationI Election campaignI ER service optimizationI Intelligent Tutoring Systems

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124

Page 24: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization

I Election campaignI ER service optimizationI Intelligent Tutoring Systems

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124

Page 25: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization

I Election campaign

I ER service optimizationI Intelligent Tutoring Systems

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124

Page 26: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization

I Election campaignI ER service optimization

I Intelligent Tutoring Systems

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124

Page 27: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization

I Election campaignI ER service optimizationI Intelligent Tutoring Systems

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124

Page 28: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Interactive Learning Problems

Why: Important Problems

I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applicationsI And many more...

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 11/124

Page 29: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation A Model for Sequential Decision Making

Outline

MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline

Multi-armed Bandit Problems

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 12/124

Page 30: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation A Model for Sequential Decision Making

What

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 13/124

Page 31: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation A Model for Sequential Decision Making

What: Sequential Decision-Making under Uncertainty

Environment

Agent

actuationaction / state /

perception

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 14/124

Page 32: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation A Model for Sequential Decision Making

What: Sequential Decision-Making under Uncertainty

Environment

AgentLearning

Critic

perceptionactuationaction /

rewardstate /

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 14/124

Page 33: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation A Model for Sequential Decision Making

What: A Different Machine Learning Paradigm

I Supervised learning: an expert (supervisor) provides examplesof the right strategy (e.g., classification of clinical images).Supervision is expensive.

I Unsupervised learning: different objects are clustered togetherby similarity (e.g., clustering of images on the basis of theirsimilarity). No actual performance is optimized.

I Reinforcement learning: learning by direct interaction (e.g.,autonomous robotics). Minimum level of supervision (reward)and maximization of long term performance.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 15/124

Page 34: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation A Model for Sequential Decision Making

What: A Different Machine Learning Paradigm

I Supervised learning: an expert (supervisor) provides examplesof the right strategy (e.g., classification of clinical images).Supervision is expensive.

I Unsupervised learning: different objects are clustered togetherby similarity (e.g., clustering of images on the basis of theirsimilarity). No actual performance is optimized.

I Reinforcement learning: learning by direct interaction (e.g.,autonomous robotics). Minimum level of supervision (reward)and maximization of long term performance.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 15/124

Page 35: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation A Model for Sequential Decision Making

What: A Different Machine Learning Paradigm

I Supervised learning: an expert (supervisor) provides examplesof the right strategy (e.g., classification of clinical images).Supervision is expensive.

I Unsupervised learning: different objects are clustered togetherby similarity (e.g., clustering of images on the basis of theirsimilarity). No actual performance is optimized.

I Reinforcement learning: learning by direct interaction (e.g.,autonomous robotics). Minimum level of supervision (reward)and maximization of long term performance.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 15/124

Page 36: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Outline

MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline

Multi-armed Bandit Problems

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 16/124

Page 37: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

How: the Course

Agent

Environment

Learning

reward perception

Critic

actuationaction / state /

Formal and rigorous approach to the RL’s way tosequential decision-making under uncertainty

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 17/124

Page 38: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

How: the Course

Agent

Environment

Learning

reward perception

Critic

actuationaction / state /

Formal and rigorous approach to the RL’s way tosequential decision-making under uncertainty

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 17/124

Page 39: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

How: the Course

Agent

Environment

Learning

reward perception

Critic

actuationaction / state /

Formal and rigorous approach to the RL’s way tosequential decision-making under uncertainty

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 17/124

Page 40: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

How: the Course

I How to model an RL problem

I Models without states = MAB

I How to solve exactly an (small) MDP

I Hands-on session! (2h)

I How to solve approximately a (larger) MDP

I How to solve incrementally an MDP

I How to efficiently explore an MDP

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 18/124

Page 41: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

How to model an RL problem

The Markov Decision Process

The Model

Value Functions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 19/124

Page 42: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

The Agent-Environment Interaction Model

Agent

Environment

Learning

reward perception

Critic

actuationaction / state /

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 20/124

Page 43: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

The Agent-Environment Interaction Model

The environmentI Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization)I Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon)I Reactive: adversarial (e.g., chess) or fixed (e.g., tetris)I Observability : full (e.g., chess) or partial (e.g., robotics)I Availability : known (e.g., chess) or unknown (e.g., robotics)

The criticI Sparse (e.g., win or loose) vs informative (e.g., closer or further)I Preference rewardI Frequent or sporadicI Known or unknown

The agentI Open loop controlI Close loop control (i.e., adaptive)I Non-stationary close loop control (i.e., learning)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 21/124

Page 44: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

The Agent-Environment Interaction Model

The environmentI Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization)I Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon)I Reactive: adversarial (e.g., chess) or fixed (e.g., tetris)I Observability : full (e.g., chess) or partial (e.g., robotics)I Availability : known (e.g., chess) or unknown (e.g., robotics)

The criticI Sparse (e.g., win or loose) vs informative (e.g., closer or further)I Preference rewardI Frequent or sporadicI Known or unknown

The agentI Open loop controlI Close loop control (i.e., adaptive)I Non-stationary close loop control (i.e., learning)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 21/124

Page 45: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

The Agent-Environment Interaction Model

The environmentI Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization)I Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon)I Reactive: adversarial (e.g., chess) or fixed (e.g., tetris)I Observability : full (e.g., chess) or partial (e.g., robotics)I Availability : known (e.g., chess) or unknown (e.g., robotics)

The criticI Sparse (e.g., win or loose) vs informative (e.g., closer or further)I Preference rewardI Frequent or sporadicI Known or unknown

The agentI Open loop controlI Close loop control (i.e., adaptive)I Non-stationary close loop control (i.e., learning)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 21/124

Page 46: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):

I X is the state space,I A is the action space,I p(y |x , a) is the transition probability with

p(y |x , a) = P(xt+1 = y |xt = x , at = a),

I r(x , a, y) is the reward of transition (x , a, y).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124

Page 47: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):

I X is the state space,

I A is the action space,I p(y |x , a) is the transition probability with

p(y |x , a) = P(xt+1 = y |xt = x , at = a),

I r(x , a, y) is the reward of transition (x , a, y).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124

Page 48: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):

I X is the state space,I A is the action space,

I p(y |x , a) is the transition probability with

p(y |x , a) = P(xt+1 = y |xt = x , at = a),

I r(x , a, y) is the reward of transition (x , a, y).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124

Page 49: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):

I X is the state space,I A is the action space,I p(y |x , a) is the transition probability with

p(y |x , a) = P(xt+1 = y |xt = x , at = a),

I r(x , a, y) is the reward of transition (x , a, y).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124

Page 50: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):

I X is the state space,I A is the action space,I p(y |x , a) is the transition probability with

p(y |x , a) = P(xt+1 = y |xt = x , at = a),

I r(x , a, y) is the reward of transition (x , a, y).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124

Page 51: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process: the Assumptions

Time assumption: time is discrete

t → t + 1

Possible relaxationsI Identify the proper time granularityI Most of MDP literature extends to continuous time

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 23/124

Page 52: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process: the Assumptions

Markov assumption: the current state x and action a are asufficient statistics for the next state y

p(y |x , a) = P(xt+1 = y |xt = x , at = a)

Possible relaxationsI Define a new state ht = (xt , xt−1, xt−2, . . .)

I Move to partially observable MDP (PO-MDP)I Move to predictive state representation (PSR) model

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 24/124

Page 53: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process: the Assumptions

Reward assumption: the reward is uniquely defined by a transition(or part of it)

r(x , a, y)

Possible relaxationsI Distinguish between global goal and reward functionI Move to inverse reinforcement learning (IRL) to induce the

reward function from desired behaviors

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 25/124

Page 54: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Markov Decision Process: the Assumptions

Stationarity assumption: the dynamics and reward do not changeover time

p(y |x , a) = P(xt+1 = y |xt = x , at = a) r(x , a, y)

Possible relaxationsI Identify and remove the non-stationary components (e.g.,

cyclo-stationary dynamics)I Identify the time-scale of the changes

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 26/124

Page 55: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Question

Is the MDP formalism powerful enough?

⇒ Let’s try!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 27/124

Page 56: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Example: the Retail Store Management Problem

Description. At each month t, a store contains xt items of a specificgoods and the demand for that goods is Dt . At the end of each monththe manager of the store can order at more items from his supplier.Furthermore we know that

I The cost of maintaining an inventory of x is h(x).I The cost to order a items is C(a).I The income for selling q items is f (q).I If the demand D is bigger than the available inventory x , customers

that cannot be served leave.I The value of the remaining inventory at the end of the year is g(x).I Constraint: the store has a maximum capacity M.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 28/124

Page 57: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Example: the Retail Store Management Problem

I State space: x ∈ X = 0, 1, . . . ,M.

I Action space: it is not possible to order more items that thecapacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.

I Dynamics: xt+1 = [xt + at − Dt ]+.

Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,

Dti.i.d.∼ D.

I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124

Page 58: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Example: the Retail Store Management Problem

I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.

I Dynamics: xt+1 = [xt + at − Dt ]+.

Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,

Dti.i.d.∼ D.

I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124

Page 59: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Example: the Retail Store Management Problem

I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.

I Dynamics: xt+1 = [xt + at − Dt ]+.

Problem: the dynamics should be Markov and stationary!

I The demand Dt is stochastic and time-independent. Formally,Dt

i.i.d.∼ D.I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]

+).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124

Page 60: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Example: the Retail Store Management Problem

I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.

I Dynamics: xt+1 = [xt + at − Dt ]+.

Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,

Dti.i.d.∼ D.

I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124

Page 61: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Example: the Retail Store Management Problem

I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.

I Dynamics: xt+1 = [xt + at − Dt ]+.

Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,

Dti.i.d.∼ D.

I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124

Page 62: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Policy

Definition (Policy)A decision rule πt can be

I Deterministic: πt : X → A,I Stochastic: πt : X → ∆(A),

A policy (strategy, plan) can beI Non-stationary: π = (π0, π1, π2, . . . ),I Stationary (Markovian): π = (π, π, π, . . . ).

Remark: MDP M + stationary policy π ⇒ Markov chain of stateX and transition probability p(y |x) = p(y |x , π(x)).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 30/124

Page 63: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Policy

Definition (Policy)A decision rule πt can be

I Deterministic: πt : X → A,I Stochastic: πt : X → ∆(A),

A policy (strategy, plan) can beI Non-stationary: π = (π0, π1, π2, . . . ),I Stationary (Markovian): π = (π, π, π, . . . ).

Remark: MDP M + stationary policy π ⇒ Markov chain of stateX and transition probability p(y |x) = p(y |x , π(x)).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 30/124

Page 64: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Policy

Definition (Policy)A decision rule πt can be

I Deterministic: πt : X → A,I Stochastic: πt : X → ∆(A),

A policy (strategy, plan) can beI Non-stationary: π = (π0, π1, π2, . . . ),I Stationary (Markovian): π = (π, π, π, . . . ).

Remark: MDP M + stationary policy π ⇒ Markov chain of stateX and transition probability p(y |x) = p(y |x , π(x)).

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 30/124

Page 65: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Motivation Outline

Example: the Retail Store Management Problem

I Stationary policy 1

π(x) =

M − x if x < M/40 otherwise

I Stationary policy 2

π(x) = max(M − x)/2 − x ; 0

I Non-stationary policy

πt(x) =

M − x if t < 6⌊(M − x)/5⌋ otherwise

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 31/124

Page 66: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems

Outline

Motivation

Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 32/124

Page 67: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

Outline

Motivation

Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 33/124

Page 68: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

How to efficiently explore an MDP

The Exploration-ExploitationDilemma

Multi-Armed Bandit

Contextual Linear Bandit

Reinforcement Learning

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 34/124

Page 69: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

How to efficiently explore an MDP

The Exploration-ExploitationDilemma

Multi-Armed Bandit

Contextual Linear Bandit

Reinforcement Learning

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 34/124

Page 70: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

The Navigation Problem

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 35/124

Page 71: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

The Navigation Problem

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 35/124

Page 72: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

The Navigation Problem

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 35/124

Page 73: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

The Navigation Problem

Question: which route should we take?

Problem: each day we obtain a limited feedback: traveling timeof the chosen route

Results: if we do not repeatedly try different options we cannotlearn.

Solution: trade off between optimization and learning .

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124

Page 74: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

The Navigation Problem

Question: which route should we take?

Problem: each day we obtain a limited feedback: traveling timeof the chosen route

Results: if we do not repeatedly try different options we cannotlearn.

Solution: trade off between optimization and learning .

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124

Page 75: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

The Navigation Problem

Question: which route should we take?

Problem: each day we obtain a limited feedback: traveling timeof the chosen route

Results: if we do not repeatedly try different options we cannotlearn.

Solution: trade off between optimization and learning .

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124

Page 76: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

The Navigation Problem

Question: which route should we take?

Problem: each day we obtain a limited feedback: traveling timeof the chosen route

Results: if we do not repeatedly try different options we cannotlearn.

Solution: trade off between optimization and learning .

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124

Page 77: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

Learning the Optimal Policy

For i = 1, . . . , n1. Set t = 02. Set initial state x0

3. While (xt not terminal)

3.1 Take action at according to a suitable exploration policy3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function

Q(xt , at) = Q(xt , at) + α(xt , at)δt

3.5 Set t = t + 1EndWhile

EndFor

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 37/124

Page 78: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

Learning the Optimal Policy

For i = 1, . . . , n1. Set t = 02. Set initial state x0

3. While (xt not terminal)

3.1 Take action at = arg maxa Q(xt , a)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function

Q(xt , at) = Q(xt , at) + α(xt , at)δt

3.5 Set t = t + 1EndWhile

EndFor

⇒ no convergence

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 38/124

Page 79: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

Learning the Optimal Policy

For i = 1, . . . , n1. Set t = 02. Set initial state x0

3. While (xt not terminal)

3.1 Take action at = arg maxa Q(xt , a)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function

Q(xt , at) = Q(xt , at) + α(xt , at)δt

3.5 Set t = t + 1EndWhile

EndFor⇒ no convergence

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 38/124

Page 80: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

Learning the Optimal Policy

For i = 1, . . . , n1. Set t = 02. Set initial state x0

3. While (xt not terminal)

3.1 Take action at ∼ U(A)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function

Q(xt , at) = Q(xt , at) + α(xt , at)δt

3.5 Set t = t + 1EndWhile

EndFor

⇒ very poor rewards

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 39/124

Page 81: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Introduction

Learning the Optimal Policy

For i = 1, . . . , n1. Set t = 02. Set initial state x0

3. While (xt not terminal)

3.1 Take action at ∼ U(A)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function

Q(xt , at) = Q(xt , at) + α(xt , at)δt

3.5 Set t = t + 1EndWhile

EndFor⇒ very poor rewards

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 39/124

Page 82: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

Outline

Motivation

Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 40/124

Page 83: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

How to efficiently explore an MDP

The Exploration-ExploitationDilemma

Multi-Armed Bandit

Contextual Linear Bandit

Reinforcement Learning

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 41/124

Page 84: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

Reducing RL down to Multi-Armed Bandit

Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):

I X is the state space,I A is the action space,I p(y |x , a) is the transition probabilityI r(x , a, y) is the reward of transition (x , a, y)

⇒ r(a) is the reward of action a

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 42/124

Page 85: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

Notice

For coherence with the bandit literature we use the notationI i = 1, . . . ,K set of possible actionsI t = 1, . . . , n timeI It action selected at time tI Xi ,t reward for action i at time t

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 43/124

Page 86: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

Learning the Optimal Policy

Objective: learn the optimal policy π∗ as efficiently as possible

For t = 1, . . . , n1. Set t = 02. Set initial state x03. While (xt not terminal)

3.1 Take action at3.2 Observe next state xt+1 and reward rt3.3 Set t = t + 1

EndWhileEndFor

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 44/124

Page 87: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

Learning the Optimal Policy

Objective: learn the optimal policy π∗ as efficiently as possibleFor t = 1, . . . , n

1. Set t = 02. Set initial state x03. While (xt not terminal)

3.1 Take action at3.2 Observe next state xt+1 and reward rt3.3 Set t = t + 1

EndWhileEndFor

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 44/124

Page 88: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Protocol

The learner has i = 1, . . . ,K arms (actions)

At each round t = 1, . . . , n

I At the same timeI The environment chooses a vector of rewards Xi,tK

i=1I The learner chooses an arm It

I The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other

arms

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124

Page 89: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Protocol

The learner has i = 1, . . . ,K arms (actions)

At each round t = 1, . . . , nI At the same time

I The environment chooses a vector of rewards Xi,tKi=1

I The learner chooses an arm ItI The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other

arms

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124

Page 90: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Protocol

The learner has i = 1, . . . ,K arms (actions)

At each round t = 1, . . . , nI At the same time

I The environment chooses a vector of rewards Xi,tKi=1

I The learner chooses an arm It

I The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other

arms

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124

Page 91: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Protocol

The learner has i = 1, . . . ,K arms (actions)

At each round t = 1, . . . , nI At the same time

I The environment chooses a vector of rewards Xi,tKi=1

I The learner chooses an arm ItI The learner receives a reward XIt ,t

I The environment does not reveal the rewards of the otherarms

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124

Page 92: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Protocol

The learner has i = 1, . . . ,K arms (actions)

At each round t = 1, . . . , nI At the same time

I The environment chooses a vector of rewards Xi,tKi=1

I The learner chooses an arm ItI The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other

arms

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124

Page 93: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

Paradigmatic Example

Imagine you are a doctor:I patients visit you one after another for a given diseaseI you prescribe one of the (say) 5 treatments availableI the treatments are not equally efficientI you do not know which one is the best, you observe the effect

of the prescribed treatment on each patient⇒ What do you do?I You must choose each prescription using only the previous

observationsI Your goal is not to estimate each treatment’s efficiency

precisely, but to heal as many patients as possible

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 46/124

Page 94: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The (stochastic) Multi-Armed Bandit ModelEnvironment K arms with parameters θ = (θ1, . . . , θK ) such that

for any possible choice of arm It ∈ 1, . . . ,K attime t, one receives the reward

Xt = XIt ,t

where, for any 1 ≤ i ≤ K and s ≥ 1, Xi ,s ∼ νi , andthe (Xi ,s)i ,s are independent.

Reward distributions νi ∈ Fi parametric family, or not. Examples:canonical exponential family, general boundedrewards

Example Bernoulli rewards: θ ∈ [0, 1]K , νi = B(θi)

Strategy The agent’s actions follow a dynamical strategyπ = (π1, π2, . . . ) such that

It = πt(X1, . . . ,Xt−1)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 47/124

Page 95: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Game (cont’d)Goal: Choose π so as to maximize

EA [Sn] =n∑

t=1

K∑i=1

E[E [XtIIt = i|X1, . . . ,Xt−1]

]=

K∑i=1

µiE [Ti ,n]

where Ti ,n =∑

t≤n IIt = i is the number of draws of arm i up totime n, and µi = E (νi).

=⇒ Equivalent to minimizing the regret

Rn(A) = maxi=1,...,K

E[ n∑

t=1Xi ,t]− E

[ n∑t=1

XIt ,t

]where µ∗ ∈ maxµi : 1 ≤ i ≤ K.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 48/124

Page 96: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Game (cont’d)Goal: Choose π so as to maximize

EA [Sn] =n∑

t=1

K∑i=1

E[E [XtIIt = i|X1, . . . ,Xt−1]

]=

K∑i=1

µiE [Ti ,n]

where Ti ,n =∑

t≤n IIt = i is the number of draws of arm i up totime n, and µi = E (νi).=⇒ Equivalent to minimizing the regret

Rn(A) = maxi=1,...,K

E[ n∑

t=1Xi ,t]− E

[ n∑t=1

XIt ,t

]where µ∗ ∈ maxµi : 1 ≤ i ≤ K.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 48/124

Page 97: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner

⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 98: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms

⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 99: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms

⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret

⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 100: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms

⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm

⇒ exploitation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 101: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms

⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm

⇒ exploitation

Challenge: The learner should solve two opposite problems!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 102: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm

⇒ exploitation

Challenge: The learner should solve two opposite problems!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 103: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitationChallenge: The learner should solve two opposite problems!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 104: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Exploration–Exploitation Lemma

Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration

Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitationChallenge: The learner should solve the exploration-exploitationdilemma!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124

Page 105: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Multi–armed Bandit Game (cont’d)

ExamplesI Packet routingI Clinical trialsI Web advertisingI Computer gamesI Resource miningI ...

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 50/124

Page 106: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem

DefinitionThe environment is stochastic

I Each arm has a distribution νi bounded in [0, 1] andcharacterized by an expected value µi

I The rewards are i.i.d. Xi ,t ∼ νi (as in the MDP model)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 51/124

Page 107: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I Regret

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 108: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I Regret

Rn(A) = maxi=1,...,K

E[ n∑

t=1Xi ,t]− E

[ n∑t=1

XIt ,t

]

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 109: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I Regret

Rn(A) = maxi=1,...,K

(nµi)− E[ n∑

t=1XIt ,t

]

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 110: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I Regret

Rn(A) = maxi=1,...,K

(nµi)−K∑

i=1E[Ti ,n]µi

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 111: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I Regret

Rn(A) = nµi∗ −K∑

i=1E[Ti ,n]µi

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 112: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I RegretRn(A) =

∑i =i∗

E[Ti ,n](µi∗ − µi)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 113: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I RegretRn(A) =

∑i =i∗

E[Ti ,n]∆i

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 114: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

NotationI Number of times arm i has been pulled after n rounds

Ti ,n =n∑

t=1IIt = i

I RegretRn(A) =

∑i =i∗

E[Ti ,n]∆i

I Gap ∆i = µi∗ − µi

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124

Page 115: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems The Bandit Model

The Stochastic Multi–armed Bandit Problem (cont’d)

Rn(A) =∑i =i∗

E[Ti ,n]∆i

⇒ we only need to study the expected number of pulls of thesuboptimal arms

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 53/124

Page 116: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

Outline

Motivation

Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 54/124

Page 117: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Stochastic Multi–armed Bandit Problem (cont’d)

Optimism in Face of Uncertainty Learning (OFUL)

Whenever we are uncertain about the outcome of an arm, weconsider the best possible world and choose the best arm.

Why it works:I If the best possible world is correct ⇒ no regretI If the best possible world is wrong ⇒ the reduction in the

uncertainty is maximized

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 55/124

Page 118: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Stochastic Multi–armed Bandit Problem (cont’d)

Optimism in Face of Uncertainty Learning (OFUL)

Whenever we are uncertain about the outcome of an arm, weconsider the best possible world and choose the best arm.Why it works:

I If the best possible world is correct ⇒ no regretI If the best possible world is wrong ⇒ the reduction in the

uncertainty is maximized

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 55/124

Page 119: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Stochastic Multi–armed Bandit Problem (cont’d)

−4 −2 0 2 4 60

5

10

15

20

25

Rewards−4 −2 0 2 4 60

5

10

15

20

25

30

35

40

Rewards

pulls = 100 pulls = 200

−4 −2 0 2 4 60

2

4

6

8

10

12

14

Rewards−4 −2 0 2 4 60

0.5

1

1.5

2

2.5

3

Rewards

pulls = 50 pulls = 20

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 56/124

Page 120: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Stochastic Multi–armed Bandit Problem (cont’d)Optimism in face of uncertainty

−4 −2 0 2 4 60

5

10

15

20

25

Rewards−4 −2 0 2 4 60

5

10

15

20

25

30

35

40

Rewards

−4 −2 0 2 4 60

2

4

6

8

10

12

14

Rewards−4 −2 0 2 4 60

0.5

1

1.5

2

2.5

3

Rewards

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 57/124

Page 121: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) AlgorithmThe idea

1 (10) 2 (73) 3 (3) 4 (23)−1.5

−1

−0.5

0

0.5

1

1.5

2

Arms

Rew

ard

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 58/124

Page 122: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm

Show time!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 59/124

Page 123: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

At each round t = 1, . . . , nI Compute the score of each arm i

Bi = (optimistic score of arm i)

I Pull armIt = arg max

i=1,...,KBi ,s,t

I Update the number of pulls TIt ,t = TIt ,t−1 + 1 and the otherstatistics

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 60/124

Page 124: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ)

Bi = (optimistic score of arm i)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124

Page 125: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ)

Bi ,s,t = (optimistic score of arm i if pulled s times up to round t)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124

Page 126: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ)

Bi ,s,t = (optimistic score of arm i if pulled s times up to round t)

Optimism in face of uncertainty:Current knowledge: average rewards µi ,sCurrent uncertainty : number of pulls s

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124

Page 127: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ)

Bi ,s,t = knowledge +︸︷︷︸optimism

uncertainty

Optimism in face of uncertainty:Current knowledge: average rewards µi ,sCurrent uncertainty : number of pulls s

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124

Page 128: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

The score (with parameters ρ and δ)

Bi ,s,t = µi ,s + ρ

√log 1/δ

2s

Optimism in face of uncertainty:Current knowledge: average rewards µi ,sCurrent uncertainty : number of pulls s

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124

Page 129: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

At each round t = 1, . . . , nI Compute the score of each arm i

Bi ,t = µi ,Ti,t + ρ

√log(t)2Ti ,t

I Pull armIt = arg max

i=1,...,KBi ,t

I Update the number of pulls TIt ,t = TIt ,t−1 + 1 and µi ,Ti,t

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 62/124

Page 130: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

TheoremLet X1, . . . ,Xn be i.i.d. samples from a distribution bounded in[a, b], then for any δ ∈ (0, 1)

P

[∣∣∣1nn∑

t=1Xt − E[X1]

∣∣∣ > (b − a)√

log 2/δ2n

]≤ δ

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 63/124

Page 131: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

After s pulls, arm i

P

[E[Xi ] ≤

1s

s∑t=1

Xi ,t +

√log 1/δ

2s

]≥ 1 − δ

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 64/124

Page 132: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

After s pulls, arm i

P

[µi ≤ µi ,s +

√log 1/δ

2s

]≥ 1 − δ

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 64/124

Page 133: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

After s pulls, arm i

P

[µi ≤ µi ,s +

√log 1/δ

2s

]≥ 1 − δ

⇒ UCB uses an upper confidence bound on the expectation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 64/124

Page 134: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

TheoremFor any set of K arms with distributions bounded in [0, b], ifδ = 1/t, then UCB(ρ) with ρ > 1, achieves a regret

Rn(A) ≤∑i =i∗

[4b2

∆iρ log(n) + ∆i

(32 +

12(ρ− 1)

)]

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 65/124

Page 135: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Let K = 2 with i∗ = 1

Rn(A) ≤ O(

1∆ρ log(n)

)Remark 1: the cumulative regret slowly increases as log(n)

Remark 2: the smaller the gap the bigger the regret... why?

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 66/124

Page 136: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Let K = 2 with i∗ = 1

Rn(A) ≤ O(

1∆ρ log(n)

)Remark 1: the cumulative regret slowly increases as log(n)Remark 2: the smaller the gap the bigger the regret... why?

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 66/124

Page 137: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Bandit Algorithms: UCB

The Upper–Confidence Bound (UCB) Algorithm (cont’d)

Show time (again)!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 67/124

Page 138: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret

Outline

Motivation

Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 68/124

Page 139: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret

Asymptotically Optimal Strategies

I A strategy π is said to be consistent if, for any (νi)i ∈ FK ,

1nE[Sn] → µ∗

I The strategy is efficient if for all θ ∈ [0, 1]K and all α > 0,

Rn(A) = o(nα)

I There are efficient strategies and we consider the bestachievable asymptotic performance among efficient strategies

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 69/124

Page 140: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret

The Bound of Lai and Robbins

One-parameter reward distribution νi = νθi , θi ∈ Θ ⊂ R .

Theorem [Lai and Robbins, ’85]If π is an efficient strategy, then, for any θ ∈ ΘK ,

lim infn→∞

Rn(A)

log(n) ≥∑

i :µi<µ∗

µ∗ − µiKL(νi , ν∗)

where KL(ν, ν ′) denotes the Kullback-Leibler divergence

For example, in the Bernoulli case:

KL(B(p),B(q)

)= dber(p, q) = p log p

q + (1 − p) log 1 − p1 − q

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 70/124

Page 141: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret

The Bound of Burnetas and KatehakisMore general reward distributions νi ∈ Fi

Theorem [Burnetas and Katehakis, ’96]If π is an efficient strategy, then, for any θ ∈ [0, 1]K ,

lim infn→∞

Rnlog(n) ≥

∑i :µi<µ∗

µ∗ − µiKinf (νi , µ∗)

where

Kinf (νi , µ∗) = inf

K (νi , ν

′) :

ν ′ ∈ Fi ,E (ν ′) ≥ µ∗ν∗

δ1

δ 12

δ0

Kinf (νa, µ⋆)

νa

µ∗

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 71/124

Page 142: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret

Intuition

I First assume that µ∗ is known and that n is fixedI How many draws ni of νi are necessary to know that µi < µ∗

with probability at least 1 − 1/n?I Test: H0 : µi = µ∗ against H1 : ν = νiI Stein’s Lemma: if the first type error αni ≤ 1/n, then

βni % exp(− niKinf (νi , µ

∗))

=⇒ it can be smaller than 1/n if

ni ≥log(n)

Kinf (νi , µ∗)

I How to do as well without knowing µ∗ and n in advance? Notasymptotically?

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 72/124

Page 143: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Outline

Motivation

Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance

Extensions

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 73/124

Page 144: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Remark: the regret bound is distribution–dependent

Rn(A; ∆) ≤ O(

1∆ρ log(n)

)

Meaning: the algorithm is able to adapt to the specific problem athand!Worst–case performance: what is the distribution which leads tothe worst possible performance of UCB? what is thedistribution–free performance of UCB?

Rn(A) = sup∆

Rn(A;∆)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 74/124

Page 145: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Remark: the regret bound is distribution–dependent

Rn(A; ∆) ≤ O(

1∆ρ log(n)

)Meaning: the algorithm is able to adapt to the specific problem athand!

Worst–case performance: what is the distribution which leads tothe worst possible performance of UCB? what is thedistribution–free performance of UCB?

Rn(A) = sup∆

Rn(A;∆)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 74/124

Page 146: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Remark: the regret bound is distribution–dependent

Rn(A; ∆) ≤ O(

1∆ρ log(n)

)Meaning: the algorithm is able to adapt to the specific problem athand!Worst–case performance: what is the distribution which leads tothe worst possible performance of UCB? what is thedistribution–free performance of UCB?

Rn(A) = sup∆

Rn(A;∆)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 74/124

Page 147: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity...

... nosense because the regret is defined as

Rn(A; ∆) = E[T2,n]∆

then if ∆i is small, the regret is also small...In fact

Rn(A;∆) = min

O(

1∆ρ log(n)

),E[T2,n]∆

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124

Page 148: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity...... nosense because the regret is defined as

Rn(A; ∆) = E[T2,n]∆

then if ∆i is small, the regret is also small...In fact

Rn(A;∆) = min

O(

1∆ρ log(n)

),E[T2,n]∆

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124

Page 149: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity...... nosense because the regret is defined as

Rn(A; ∆) = E[T2,n]∆

then if ∆i is small, the regret is also small...

In fact

Rn(A;∆) = min

O(

1∆ρ log(n)

),E[T2,n]∆

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124

Page 150: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Problem: it seems like if ∆ → 0 then the regret tends to infinity...... nosense because the regret is defined as

Rn(A; ∆) = E[T2,n]∆

then if ∆i is small, the regret is also small...In fact

Rn(A;∆) = min

O(

1∆ρ log(n)

),E[T2,n]∆

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124

Page 151: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Worst–case Performance

Then

Rn(A) = sup∆

Rn(A;∆) = sup∆

min

O(

1∆ρ log(n)

), n∆

√n

for ∆ =√

1/n.Remark: Non-stochastic bandits: it is possible to ensure thesame O(

√n) regret even without any stochastic asumption on the

reward process.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 76/124

Page 152: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB

Remark: UCB is an anytime algorithm (δ = 1/t)

Bi ,s,t = µi ,s + ρ

√log t2s

Remark: If the time horizon n is known then the optimal choice isδ = 1/n

Bi ,s,t = µi ,s + ρ

√log n2s

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 77/124

Page 153: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB

Remark: UCB is an anytime algorithm (δ = 1/t)

Bi ,s,t = µi ,s + ρ

√log t2s

Remark: If the time horizon n is known then the optimal choice isδ = 1/n

Bi ,s,t = µi ,s + ρ

√log n2s

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 77/124

Page 154: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB (cont’d)

Intuition: UCB should pull the suboptimal armsI Enough: so as to understand which arm is the bestI Not too much: so as to keep the regret as small as possible

The confidence 1 − δ has the following impact (similar for ρ)I Big 1 − δ: high level of explorationI Small 1 − δ: high level of exploitation

Solution: depending on the time horizon, we can tune how totrade-off between exploration and exploitation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 78/124

Page 155: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB (cont’d)

Intuition: UCB should pull the suboptimal armsI Enough: so as to understand which arm is the bestI Not too much: so as to keep the regret as small as possible

The confidence 1 − δ has the following impact (similar for ρ)I Big 1 − δ: high level of explorationI Small 1 − δ: high level of exploitation

Solution: depending on the time horizon, we can tune how totrade-off between exploration and exploitation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 78/124

Page 156: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB (cont’d)

Intuition: UCB should pull the suboptimal armsI Enough: so as to understand which arm is the bestI Not too much: so as to keep the regret as small as possible

The confidence 1 − δ has the following impact (similar for ρ)I Big 1 − δ: high level of explorationI Small 1 − δ: high level of exploitation

Solution: depending on the time horizon, we can tune how totrade-off between exploration and exploitation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 78/124

Page 157: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof

Let’s dig into the (1 page and half!!) proof.

Define the (high-probability) event [statistics]

E =

∀i , s

∣∣∣µi,s − µi

∣∣∣ ≤√ log 1/δ2s

By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.

At time t we pull arm i [algorithm]

On the event E we have [math]

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124

Page 158: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof

Let’s dig into the (1 page and half!!) proof.

Define the (high-probability) event [statistics]

E =

∀i , s

∣∣∣µi,s − µi

∣∣∣ ≤√ log 1/δ2s

By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.At time t we pull arm i [algorithm]

Bi,Ti,t−1 ≥ Bi∗,Ti∗,t−1

On the event E we have [math]

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124

Page 159: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof

Let’s dig into the (1 page and half!!) proof.

Define the (high-probability) event [statistics]

E =

∀i , s

∣∣∣µi,s − µi

∣∣∣ ≤√ log 1/δ2s

By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.At time t we pull arm i [algorithm]

µi,Ti,t−1 +

√log 1/δ2Ti,t−1

≥ µi∗,Ti∗,t−1 +

√log 1/δ2Ti∗,t−1

On the event E we have [math]

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124

Page 160: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB ProofLet’s dig into the (1 page and half!!) proof.

Define the (high-probability) event [statistics]

E =

∀i , s

∣∣∣µi,s − µi

∣∣∣ ≤√ log 1/δ2s

By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.At time t we pull arm i [algorithm]

µi,Ti,t−1 +

√log 1/δ2Ti,t−1

≥ µi∗,Ti∗,t−1 +

√log 1/δ2Ti∗,t−1

On the event E we have [math]

µi + 2

√log 1/δ2Ti,t−1

≥ µi∗

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124

Page 161: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus

µi + 2

√log 1/δ

2(Ti,n − 1) ≥ µi∗

Reordering [math]

Ti,n ≤ log 1/δ2∆2

i+ 1

under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]Trading-off the two terms δ = 1/n2, we obtain

µi,Ti,t−1 +

√2 log n2Ti,t−1

and

E[Ti,n] ≤log n∆2

i+ 1 + K

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124

Page 162: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus

µi + 2

√log 1/δ

2(Ti,n − 1) ≥ µi∗

Reordering [math]

Ti,n ≤ log 1/δ2∆2

i+ 1

under event E and thus with probability 1 − nKδ.

Moving to the expectation [statistics]Trading-off the two terms δ = 1/n2, we obtain

µi,Ti,t−1 +

√2 log n2Ti,t−1

and

E[Ti,n] ≤log n∆2

i+ 1 + K

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124

Page 163: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus

µi + 2

√log 1/δ

2(Ti,n − 1) ≥ µi∗

Reordering [math]

Ti,n ≤ log 1/δ2∆2

i+ 1

under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]

E[Ti,n] = E[Ti,nIE ] + E[Ti,nIEC ]

Trading-off the two terms δ = 1/n2, we obtain

µi,Ti,t−1 +

√2 log n2Ti,t−1

and

E[Ti,n] ≤log n∆2

i+ 1 + K

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124

Page 164: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus

µi + 2

√log 1/δ

2(Ti,n − 1) ≥ µi∗

Reordering [math]

Ti,n ≤ log 1/δ2∆2

i+ 1

under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]

E[Ti,n] ≤log 1/δ

2∆2i

+ 1 + n(nKδ)

Trading-off the two terms δ = 1/n2, we obtain

µi,Ti,t−1 +

√2 log n2Ti,t−1

and

E[Ti,n] ≤log n∆2

i+ 1 + K

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124

Page 165: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus

µi + 2

√log 1/δ

2(Ti,n − 1) ≥ µi∗

Reordering [math]

Ti,n ≤ log 1/δ2∆2

i+ 1

under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]

E[Ti,n] ≤log 1/δ

2∆2i

+ 1 + n(nKδ)

Trading-off the two terms δ = 1/n2, we obtain

µi,Ti,t−1 +

√2 log n2Ti,t−1

and

E[Ti,n] ≤log n∆2

i+ 1 + K

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124

Page 166: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

UCB Proof (cont’d)

Trading-off the two terms δ = 1/n2, we obtain

µi,Ti,t−1 +

√2 log n2Ti,t−1

and

E[Ti,n] ≤log n∆2

i+ 1 + K

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 81/124

Page 167: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB (cont’d)

Multi–armed Bandit: the same for δ = 1/t and δ = 1/n...

... almost (i.e., in expectation)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 82/124

Page 168: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB (cont’d)

Multi–armed Bandit: the same for δ = 1/t and δ = 1/n...... almost (i.e., in expectation)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 82/124

Page 169: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the confidence δ of UCB (cont’d)The value–at–risk of the regret for UCB-anytime

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 83/124

Page 170: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)

Bi,s = µi,s + ρ

√log n2s

TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n

Practice: ρ = 0.2 is often the best choice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124

Page 171: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)

Bi,s = µi,s + ρ

√log n2s

TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n

Practice: ρ = 0.2 is often the best choice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124

Page 172: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)

Bi,s = µi,s + ρ

√log n2s

TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n

Practice: ρ = 0.2 is often the best choice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124

Page 173: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)

Bi,s = µi,s + ρ

√log n2s

TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n

Practice: ρ = 0.2 is often the best choice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124

Page 174: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i.

AlgorithmI Compute the score of each arm iI Pull arm

It = arg maxi=1,...,K

Bi,t

I Update the number of pulls TIt ,t , µi,Ti,t

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124

Page 175: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i.

AlgorithmI Compute the score of each arm i

Bi,t = µi,Ti,t + ρ

√log(t)2Ti,t

I Pull armIt = arg max

i=1,...,KBi,t

I Update the number of pulls TIt ,t , µi,Ti,t

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124

Page 176: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: UCB-V

Idea: use empirical Bernstein bounds for more accurate c.i.

AlgorithmI Compute the score of each arm i

Bi,t = µi,Ti,t +

√2σ2

i,Ti,tlog t

Ti,t+

8 log t3Ti,t

I Pull armIt = arg max

i=1,...,KBi,t

I Update the number of pulls TIt ,t , µi,Ti,t and σ2i,Ti,t

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124

Page 177: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: UCB-VIdea: use empirical Bernstein bounds for more accurate c.i.

AlgorithmI Compute the score of each arm i

Bi,t = µi,Ti,t +

√2σ2

i,Ti,tlog t

Ti,t+

8 log t3Ti,t

I Pull armIt = arg max

i=1,...,KBi,t

I Update the number of pulls TIt ,t , µi,Ti,t and σ2i,Ti,t

RegretRn ≤ O

( 1∆

log n)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124

Page 178: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: UCB-VIdea: use empirical Bernstein bounds for more accurate c.i.

AlgorithmI Compute the score of each arm i

Bi,t = µi,Ti,t +

√2σ2

i,Ti,tlog t

Ti,t+

8 log t3Ti,t

I Pull armIt = arg max

i=1,...,KBi,t

I Update the number of pulls TIt ,t , µi,Ti,t and σ2i,Ti,t

Regret

Rn ≤ O(σ2

∆log n

)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124

Page 179: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: KL-UCB

Idea: use even tighter c.i. based on Kullback–Leibler divergence

d(p, q) = p log pq + (1 − p) log 1 − p

1 − q

Algorithm: Compute the score of each arm i (convex optimization)

Bi,t = max

q ∈ [0, 1] : Ti,td(µi,Ti,t , q

)≤ log(t) + c log(log(t))

Regret: pulls to suboptimal arms

E[Ti,n

]≤ (1 + ϵ)

log(n)d(µi , µ∗)

+ C1 log(log(n)) + C2(ϵ)

nβ(ϵ)

where d(µi , µ∗) > 2∆2

i

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 86/124

Page 180: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: KL-UCB

Idea: use even tighter c.i. based on Kullback–Leibler divergence

d(p, q) = p log pq + (1 − p) log 1 − p

1 − q

Algorithm: Compute the score of each arm i (convex optimization)

Bi,t = max

q ∈ [0, 1] : Ti,td(µi,Ti,t , q

)≤ log(t) + c log(log(t))

Regret: pulls to suboptimal arms

E[Ti,n

]≤ (1 + ϵ)

log(n)d(µi , µ∗)

+ C1 log(log(n)) + C2(ϵ)

nβ(ϵ)

where d(µi , µ∗) > 2∆2

i

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 86/124

Page 181: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: KL-UCB

Idea: use even tighter c.i. based on Kullback–Leibler divergence

d(p, q) = p log pq + (1 − p) log 1 − p

1 − q

Algorithm: Compute the score of each arm i (convex optimization)

Bi,t = max

q ∈ [0, 1] : Ti,td(µi,Ti,t , q

)≤ log(t) + c log(log(t))

Regret: pulls to suboptimal arms

E[Ti,n

]≤ (1 + ϵ)

log(n)d(µi , µ∗)

+ C1 log(log(n)) + C2(ϵ)

nβ(ϵ)

where d(µi , µ∗) > 2∆2

i

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 86/124

Page 182: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: Thompson strategy

Idea: Use a Bayesian approach to estimate the means µii

Algorithm: Assuming Bernoulli arms and a Beta prior on the meanI Compute

Di,t = Beta(Si,t + 1,Fi,t + 1)I Draw a mean sample as

µi,t ∼ Di,t

I Pull armIt = arg max µi,t

I If XIt ,t = 1 update SIt ,t+1 = SIt ,t + 1, else update FIt ,t+1 = FIt ,t + 1

Regret:

limn→∞

Rnlog(n) =

K∑i=1

∆id(µi , µ∗)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 87/124

Page 183: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

Improvements: Thompson strategy

Idea: Use a Bayesian approach to estimate the means µii

Algorithm: Assuming Bernoulli arms and a Beta prior on the meanI Compute

Di,t = Beta(Si,t + 1,Fi,t + 1)I Draw a mean sample as

µi,t ∼ Di,t

I Pull armIt = arg max µi,t

I If XIt ,t = 1 update SIt ,t+1 = SIt ,t + 1, else update FIt ,t+1 = FIt ,t + 1

Regret:

limn→∞

Rnlog(n) =

K∑i=1

∆id(µi , µ∗)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 87/124

Page 184: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

How to efficiently explore an MDP

The Exploration-ExploitationDilemma

Multi-Armed Bandit

Contextual Linear Bandit

Reinforcement Learning

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 88/124

Page 185: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit Problem

Motivating ExamplesI Different users may have different preferencesI The set of available news may change over timeI We want to minimise the regret w.r.t. the best news for each

user

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 89/124

Page 186: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit ProblemThe problem: at each time t = 1, . . . , n

I User ut arrives and a set of news At is providedI The user ut together with a news a ∈ At are described by a

feature vector xt,aI The learner chooses a news at and receives a reward rt,at

The optimal news: at each time t = 1, . . . , n, the optimal news is

a∗t = arg maxa∈At

E[rt,a]

The regret:

Rn = E[ n∑

t=1rt,a∗t

]− E

[ n∑t=1

rt,at

]

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 90/124

Page 187: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit Problem

The linear assumption: the reward is a linear combinationbetween the context and an unknown parameter vector

E[rt,a|xt,a] = x⊤t,aθa

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 91/124

Page 188: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit Problem

The linear regression estimate:I Ta = t : at = aI Construct the design matrix of all the contexts observed when

action a has been taken Da ∈ R|Ta|×d

I Construct the reward vector of all the rewards observed whenaction a has been taken ca ∈ R|Ta|

I Estimate θa as

θa = (D⊤a Da + I)−1D⊤

a ca

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 92/124

Page 189: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit Problem

Optimism in face of uncertainty: the LinUCB algorithmI Chernoff-Hoeffding in this case becomes∣∣x⊤

t,aθa − E[rt,a|xt,a]∣∣ ≤ α

√x⊤

t,a(D⊤a Da + I)−1xt,a

I and the UCB strategy is

at = arg maxa∈At

x⊤t,aθa + α

√x⊤

t,a(D⊤a Da + I)−1xt,a

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 93/124

Page 190: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit Problem

The evaluation problemI Online evaluation: too expensiveI Offline evaluation: how to use the logged data?

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 94/124

Page 191: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit Problem

Evaluation from logged dataI Assumption 1: contexts and rewards are i.i.d. from a

stationary distribution

(x1, . . . , xK , r1, . . . , rK ) ∼ D

I Assumption 2: the logging strategy is random

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 95/124

Page 192: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Multi-armed Bandit Problems Worst-case Performance

The Contextual Linear Bandit ProblemEvaluation from logged data: given a bandit strategy π, adesired number of samples T , and a (infinite) stream of data

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 96/124

Page 193: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions

Outline

Motivation

Multi-armed Bandit Problems

ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 97/124

Page 194: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Some Examples

Outline

Motivation

Multi-armed Bandit Problems

ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 98/124

Page 195: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Some Examples

Non-stationary Bandits

I Changepoint : rewarddistributions change abruptly

I Goal : follow the best armI Application : scanning

tunnelling microscope

I Variants D-UCB et SW-UCB including a progressive discountof the past

I Bounds O(√

n log n) are proved, which is (almost) optimal

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 99/124

Page 196: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Some Examples

Generalized Linear Bandits

I Bandit with contextual information:

E[Xt |It ] = µ(m′Itθ∗)

where θ∗ ∈ Rd is an unkown parameter and µ : R → R is alink function

I Example : binary rewards

µ(x) = exp(x)1 + exp(x)

I Application : targeted web adsI GLM-UCB : regret bound depending on dimension d and not

on the number of arms

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 100/124

Page 197: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Some Examples

Stochastic Optimization

I Goal : Find the maximum of afunction f : C ⊂ Rd → R(possibly) observed in noise

I Application : DAS

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

1.6

res$x

res$

f

I Model : f is the realization of a Gaussian Process (or has asmall norm in some RKHS)

I GP-UCB : evaluate f at the point x ∈ C where the confidenceinterval for f (x) has the highest upper-bound

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 101/124

Page 198: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Best Arm Identification

Outline

Motivation

Multi-armed Bandit Problems

ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 102/124

Page 199: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Best Arm Identification

Motivation

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 103/124

Page 200: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Best Arm Identification

Goal = regret minimization

Improve performance:

Ü fixed number of test users − > smaller probability of errorÜ fixed probability of error − > fewer test users

Tools: sequential allocation and stopping

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 104/124

Page 201: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Best Arm Identification

The modelA two-armed bandit model is

I a set ν = (ν1, ν2) of two probability distributions (’arms’) withrespective means µ1 and µ2

I a∗ = argmaxa µa is the (unknown) best am

To find the best arm, an agent interacts with the bandit modelwith

I a sampling rule (At)t∈N where At ∈ 1, 2 is the arm chosenat time t (based on past observations) − > a sampleZt ∼ νAt is observed

I a stopping rule τ indicating when he stops sampling the armsI a recommendation rule aτ ∈ 1, 2 indicating which arm he

thinks is best (at the end of the interaction)

nadaIn classical A/B Testing, the sampling rule At is uniform on1, 2 and the stopping rule τ = t is fixed in advance.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 105/124

Page 202: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Best Arm Identification

Two possible goalsThe agent’s goal is to design a strategy A = ((At), τ, aτ ) satisfying

Fixed-budget setting Fixed-confidence setting

τ = t Pν(aτ = a∗) ≤ δ

pt(ν) := Pν(at = a∗) as small Eν [τ ] as smallas possible as possible

An algorithm using uniform sampling is

Fixed-budget setting Fixed-confidence setting

a classical test of a sequential test of(µ1 > µ2) against (µ1 < µ2) (µ1 > µ2) against (µ1 < µ2)

based on t samples with probability of erroruniformly bounded by δ

[Siegmund 85]: sequential tests can save samples !A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 106/124

Page 203: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Outline

Motivation

Multi-armed Bandit Problems

ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 107/124

Page 204: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst

I Subset A ⊂ X ofimportant items

I |X | ≫ 1, |A| ≪ |X |I Access to X only by

probabilistic experts(Pi)1≤i≤K :sequentialindependent draws

Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124

Page 205: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst

I Subset A ⊂ X ofimportant items

I |X | ≫ 1, |A| ≪ |X |I Access to X only by

probabilistic experts(Pi)1≤i≤K :sequentialindependent draws

Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124

Page 206: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst

I Subset A ⊂ X ofimportant items

I |X | ≫ 1, |A| ≪ |X |I Access to X only by

probabilistic experts(Pi)1≤i≤K :sequentialindependent draws

Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124

Page 207: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst

I Subset A ⊂ X ofimportant items

I |X | ≫ 1, |A| ≪ |X |I Access to X only by

probabilistic experts(Pi)1≤i≤K :sequentialindependent draws

Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124

Page 208: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Optimal Exploration with Probabilistic Expert Advice

Search space : A ⊂ Ω discrete setProbabilistic experts : Pi ∈ M1(Ω) for i ∈ 1, . . . ,K

Requests : at time t, calling expert It yields a realization ofXt = XIt ,t independent with law Pa

Goal : find as many distinct elements of A as possible withfew requests :

Fn = Card (A ∩ X1, . . . ,Xn)

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 109/124

Page 209: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

GoalAt each time step t = 1, 2, . . . :

I pick an index It = πt(I1,Y1, . . . , Is−1,Ys−1

)∈ 1, . . . ,K

according to past observationsI observe Yt = XIt ,nIt ,t

∼ PIt , where

ni ,t =∑s≤t

IIs = i

Goal: design the strategy π = (πt)t so as to maximize the numberof important items found after t requests

Fπ(t) =∣∣∣A ∩

Y1, . . . ,Yt

∣∣∣Assumption: non-intersecting supports

A ∩ supp(Pi) ∩ supp(Pj) = ∅ for i = j

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 110/124

Page 210: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Is it a Bandit Problem ?

It looks like a bandit problem. . .I sequential choices among K optionsI want to maximize cumulative rewardsI exploration vs exploitation dilemma

. . . but it is not a bandit problem !I rewards are not i.i.d.I destructive rewards: no interest to observe twice the same

important itemI all strategies eventually equivalent

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 111/124

Page 211: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The oracle strategy

Proposition: Under the non-intersecting support hypothesis, thegreedy oracle strategy selecting the expert with highest ‘missingmass’

I∗t ∈ arg max1≤i≤K

Pi (A \ Y1, . . . ,Yt)

is optimal: for every possible strategy π, E[Fπ(t)

]≤ E

[F ∗(t)

].

Remark: the proposition if false if the supports may intersect

=⇒ estimate the “missing mass of important items”!

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 112/124

Page 212: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Missing mass estimationLet us first focus on one expert i : P = Pi ,Xn = Xi ,n

X1, . . . ,Xn independent draws of P

On(x) =n∑

m=1IXm = x

How to ’estimate’ the total mass of the unseen important items

Rn =∑x∈A

P(x)IOn(x) = 0 ?

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 113/124

Page 213: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The Good-Turing EstimatorIdea: use the hapaxes = items seen only once (linguistic)

Rn =Unn , where Un =

∑x∈A

IOn(x) = 1

Lemma [Good ’53]: For every distribution P,

0 ≤ E[Rn]− E

[Rn]≤ 1

n

Proposition: With probability at least 1 − δ for every P,

Rn − 1n − (1 +

√2)√

log(4/δ)n ≤ Rn ≤ Rn + (1 +

√2)√

log(4/δ)n

See [McAllester and Schapire ’00, McAllester and Ortiz ’03]:I deviations of Rn: McDiarmid’s inequalityI deviations of Rn: negative association

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 114/124

Page 214: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The Good-UCB algorithm [Bubeck, Ernst & G.]

Optimistic algorithm based on Good-Turing’s estimator :

It+1 = arg maxi∈1,...,K

Hi(t)Ni(t)

+ c

√log (t)Ni(t)

I Ni(t) = number of draws of Pi up to time tI Hi(t) = number of elements of A seen exactly once thanks to

PiI c = tuning parameter

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 115/124

Page 215: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Classical analysis

Theorem: For any t ≥ 1, under the non-intersecting supportassumption, Good-UCB (with constant C = (1 +

√2)√

3) satisfies

E[F ∗(t)− F UCB(t)

]≤ 17

√Kt log(t) + 20

√Kt + K + K log(t/K )

Remark: Usual result for bandit problem, but not-so-simple analysis

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 116/124

Page 216: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

A Typical Run of Good-UCB

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 117/124

Page 217: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The macroscopic limitI Restricted framework: Pi = U1, . . . ,NI N → ∞I |A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =

∑i qi

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 118/124

Page 218: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The macroscopic limitI Restricted framework: Pi = U1, . . . ,NI N → ∞I |A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =

∑i qi

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 118/124

Page 219: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The macroscopic limitI Restricted framework: Pi = U1, . . . ,NI N → ∞I |A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =

∑i qi

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 118/124

Page 220: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

The Oracle behaviour

The limiting discovery process of the Oracle strategy isdeterministic

Proposition: For every λ ∈ (0, q1), for every sequence (λN)Nconverging to λ as N goes to infinity, almost surely

limN→∞

T N∗ (λN)

N =∑

i

(log qi

λ

)+

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 119/124

Page 221: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Oracle vs. uniform samplingOracle: The proportion of important items not found after

Nt draws tends to

q−F ∗(t) = I(t)qI(t) exp (−t/I(t)) ≤ KqK exp(−t/K )

with qK =(∏K

i=1 qi)1/K

the geometric mean of the(qi)i .

Uniform: The proportion of important items not found afterNt draws tends to KqK exp(−t/K )

=⇒ Asymptotic ratio of efficiency

ρ(q) = qKqK

=1K∑k

i=1 qi(∏ki=1 qi

)1/K ≥ 1

larger if the (qi)i are unbalancedA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 120/124

Page 222: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Macroscopic optimality

Theorem: Take C = (1 +√

2)√

c + 2 with c > 3/2 in theGood-UCB algorithm.

I For every sequence (λN)N converging to λ as N goes toinfinity, almost surely

lim supN→+∞

T NUCB(λ

N)

N ≤∑

i

(log qi

λ

)+

I The proportion of items found after Nt steps F GUCB

converges uniformly to F ∗ as N goes to infinity

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 121/124

Page 223: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Simulation

nadaNumber of items found by Good-UCB (line), the oracle (bold dashed), and byuniform sampling (light dotted) as a function of time, for sample sizes N =

128,N = 500, N = 1000 and N = 10000, in an environment with 7 experts.A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 122/124

Page 224: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Bibliography I

R. E. Bellman.Dynamic Programming.Princeton University Press, Princeton, N.J., 1957.

D.P. Bertsekas and J. Tsitsiklis.Neuro-Dynamic Programming.Athena Scientific, Belmont, MA, 1996.

W. Fleming and R. Rishel.Deterministic and stochastic optimal control.Applications of Mathematics, 1, Springer-Verlag, Berlin New York, 1975.

R. A. Howard.Dynamic Programming and Markov Processes.MIT Press, Cambridge, MA, 1960.

M.L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming.John Wiley & Sons, Inc., New York, Etats-Unis, 1994.

A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 123/124

Page 225: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning

Extensions Exploration with Probabilistic Expert Advice

Reinforcement Learning

Alessandro [email protected]

sequel.lille.inria.fr

Aurélien [email protected]

dubbing: