REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

REINFORCEMENTREINFORCEMENT LEARNINGLEARNING

LEARNING TO PERFORMLEARNING TO PERFORM

BEST ACTIONS BY REWARDSBEST ACTIONS BY REWARDS

Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

Solution Solution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

RL brings a way of RL brings a way of programming agents by programming agents by reward and punishment reward and punishment

without specifying without specifying howhow the the task is to be achieved. task is to be achieved.

(Kaelbling,1996)(Kaelbling,1996)

Based on trial-error Based on trial-error interactionsinteractions

A set of problems rather than A set of problems rather than a set of techniquesa set of techniques

The standard reinforcement-learning model

i: inputr: reward s: state

a: action

The reinforcement learning model The reinforcement learning model consists ofconsists of::

a discrete set of environment states: a discrete set of environment states: SS ; ;

a discrete set of agent actions a discrete set of agent actions AA ; and ; and

a set of scalar reinforcement signals; a set of scalar reinforcement signals; typically {0,1} , or the real numberstypically {0,1} , or the real numbers

(different from supervised learning)(different from supervised learning)

An example dialog for agent environment relationship:

Environment: You are in state 65. You have 4 possible actions.Agent: I'll take action 2.Environment: You received a reinforcement of 7 units. You are now in state 15. You have 2 possible actions. Agent: I'll take action 1. Environment: You received a reinforcement of -4 units. You are now in state 65. You have 4 possible actions. Agent: I'll take action 2. Environment: You received a reinforcement of 5 units. You are now in state 44. You have 5 possible actions.. .. .. .

SomeSome ExamplesExamples

Bioreactor Bioreactor actions: actions: stirring rate, temperature controlstirring rate, temperature controlstates: states: sensory readings related to chemicalssensory readings related to chemicalsreward: reward: instant production rate of target chemicalinstant production rate of target chemical

Recycling robotRecycling robotactions: actions: search for a can, wait, or rechargesearch for a can, wait, or recharge states:states: low battery, high battery low battery, high batteryreward: +reward: + for having a can, - for running out of for having a can, - for running out of

batterybattery

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal BehaviorModels of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

SolutionSolution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision


Models of Optimal Behavior:Agent tries to maximize one of the following:

finite-horizon Model:

infinite-horizon discounted model:

average-reward model:




k-armed bandit k-armed bandit k gambling machinesk gambling machines

hh pullspulls areare allowedallowed

Machines are not equivalent:Machines are not equivalent:

Trying to learn paying probabilities Trying to learn paying probabilities

Tradeoff between exploitation-Tradeoff between exploitation-exploration exploration

Solution Strategies 1:Solution Strategies 1:Dynamic Programming Dynamic Programming

ApproachApproachA belief stateA belief state::

Expected pay-offExpected pay-off

remaining:remaining:

Probability of action i Probability of action i

being paid: being paid:

Dynamic Programming Dynamic Programming Approach:Approach:

, then

Because there are no remaining pulls.So, all values can be recursively computed:

Update probabilities after each action




MARKOV DECISION MARKOV DECISION PROCESSPROCESS

k-armed bandit k-armed bandit gives gives immediate immediate reward reward

DELAYED DELAYED REWARD?REWARD?

Characteristics of MDP:a set of states : Sa set of actions : Aa reward function :R : S x A R

A state transition function:T: S x A ∏ ( S)

T (s,a,s’): probability of transition from s to s’ using action a

MDP EXAMPLE:

Transitionfunction

States and rewards

Bellman Equation:

(Greedy policy selection)

Value Iteration AlgorithmValue Iteration Algorithm

AN ALTERNATIVE ITERATION: (Singh,1993)

(Important for model free learning)

Stop Iteration when V(s) differs less than є.Policy difference ratio =< 2єγ / (1-γ ) ( Williams & Baird 1993b)

Policy Iteration Algorithm Policy Iteration Algorithm Policies converge faster than values.

Why faster convergence?

POLICY ITERATION ON GRID WORLD


0

0

0

0

0

0.8

-0.8

0 0


0

0

0

0

0

0.8

-0.8

0 0

0

0

0

0.64

0

0.8

0.46

0 0


0

0

0

0.64

0

0.8

0.46

0 0


0

0

0

0.64

0

0.8

0.46

0 0

0.51

0

0

0.77

0

0.93

0.59

0.36 0


0

0

0

0.64

0

0.8

0.46

0 0

0.51

0

0

0.77

0

0.93

0.59

0.36 0

0.66

0.41

0

0.89

0.32

0.95

0.70

0.48 0.19

MDP Graphical RepresentationMDP Graphical Representation

β, α : T (s, action, s’ )

Similarity to HMMs




Model Free MethodsModel Free MethodsModels of the environment:Models of the environment:T: S T: S x x A A ∏∏ ( S) ( S) and and R : S R : S xx A A R R

Do we know them? Do we have to Do we know them? Do we have to know them?know them?

Monte Carlo MethodsMonte Carlo Methods Adaptive Heuristic CriticAdaptive Heuristic Critic Q LearningQ Learning

Monte Carlo Methods Monte Carlo Methods IdeaIdea: : Hold statistics about rewards for each stateHold statistics about rewards for each state Take the averageTake the average This is the V(s)This is the V(s)

Based only on experience Based only on experience Assumes episodic tasks Assumes episodic tasks (Experience is divided into episodes and all (Experience is divided into episodes and all

episodes will terminate regardless of the actions episodes will terminate regardless of the actions selected.) selected.)

Incremental in episode-by-episode sense not Incremental in episode-by-episode sense not step-by-step sense. step-by-step sense.

Problem: Unvisited <s, a> pairs(problem of maintaining exploration)

For every <s, a> make sure that:P(<s, a> selected as a start state and action) >0

(Assumption of exploring starts )

Monte Carlo ControlMonte Carlo Control

How to select Policies:How to select Policies:(Similar to policy evaluation) (Similar to policy evaluation)

ADAPTIVE HEURISTIC CRITIC & TD(λ)ADAPTIVE HEURISTIC CRITIC & TD(λ)

How the AHC learns, TD(0) algorithm:

AHC : TD Algorithm

Q LEARNINGQ LEARNING

Q values in Value Iteration:

But we don’t know and

Instead use the following :

Decayed α properly? Q values will converge. (Singh 1994)

Q-LEARNING CRITICS:Q-LEARNING CRITICS:

Simpler than AHC learningSimpler than AHC learning Q-Learning is exploration sensitiveQ-Learning is exploration sensitive Analog to value iteration in MDPAnalog to value iteration in MDP Most popular Model free learning Most popular Model free learning

algorithmalgorithm




Model Based MethodsModel Based Methods

Model free methods do not learn the Model free methods do not learn the model parametersmodel parameters

Inefficient use of data! Learn the Inefficient use of data! Learn the model.model.

Certainty Equivalent Methods:Certainty Equivalent Methods:

First learn the models of the environment by First learn the models of the environment by keeping statistics, then learn the actions to take.keeping statistics, then learn the actions to take.

ObjectionsObjections:: Arbitrary division between the learning phase andArbitrary division between the learning phase and

acting phaseacting phase Initial data gathering. (How to choose exploration Initial data gathering. (How to choose exploration

strategy without knowing the model)strategy without knowing the model) Changes in the environment?Changes in the environment?

Better to learn the model and to act Better to learn the model and to act simultaneouslysimultaneously

DYNADYNAAfter action a, from s to s’ with reward r:

1.Update Transition model T and reward function R

2.Update Q values:

3.Do k more random updates (random s, a pairs ):




POMDPsPOMDPsWhat if state information (from sensors) is noisy?Mostly the case!

MDP techniques are suboptimal!

Two halls are not the same.

POMDPs – A Solution StrategyPOMDPs – A Solution Strategy

SE: Belief State Estimator (Can be based on HMM)

П: MDP Techniques

ROADMAP:ROADMAP:

Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and SolutionAn Immediate Reward Example and Solution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods GeneralizationGeneralization Partially Observable Markov Decision ProcessPartially Observable Markov Decision Process ApplicationsApplications Conclusion Conclusion

APPLICATIONSAPPLICATIONS

Juggling robotJuggling robot: dynamic programming : dynamic programming

(Schaal & Atkeson 1994)(Schaal & Atkeson 1994)

Box pushing robot: Box pushing robot: Q-learning Q-learning

(Mahadevan& Connel 1991a)(Mahadevan& Connel 1991a)

Disk collecting robot: Disk collecting robot: Q-learningQ-learning

(Mataric 1994)(Mataric 1994)

ROADMAP:ROADMAP:

Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and SolutionAn Immediate Reward Example and Solution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods GeneralizationGeneralization Partially Observable Markov Decision ProcessPartially Observable Markov Decision Process ApplicationsApplications Conclusion Conclusion

ConclusionConclusion

RL is not supervised learning RL is not supervised learning

Planning rather than classificationPlanning rather than classification

Poor performance on large problems.Poor performance on large problems. New methods neded New methods neded (i.e. shaping, imitation, reflexes)(i.e. shaping, imitation, reflexes)

RL/MDPs extends HMMs. How? RL/MDPs extends HMMs. How?

MDP as a graph

Is it possible to represent it as an HMM?

Relation to HMMsRelation to HMMsRecycling robot example revisitedRecycling robot example revisited

as an HMM problemas an HMM problem

Battery t-1 Battery t+1Battery t

Action t-1 Action t+1Action t

Battery={ low, high}Action={wait, search, recharge}

Relation to HMMsRelation to HMMsRecycling robot example revisitedRecycling robot example revisited

as an HMM problemas an HMM problem

Battery t-1 Battery t+1Battery t

Action t-1 Action t+1Action t

Battery={ low, high}Action={wait, search, recharge}

Not representable as an HMM

HMMs vs MDPsHMMs vs MDPs

Once we have the MDP representation of the Once we have the MDP representation of the problem, We can do inferences just like HMM by problem, We can do inferences just like HMM by converting it to probabilistic automaton. Vice converting it to probabilistic automaton. Vice versa is not possible.versa is not possible.

(Actions, rewards)(Actions, rewards)

Use HMMs to do probabilistic reasoning over timeUse HMMs to do probabilistic reasoning over time Use MDP/RL to optimize the behavior.Use MDP/RL to optimize the behavior.

Documents

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel