48
REINFORCEMENT REINFORCEMENT LEARNING LEARNING LEARNING TO PERFORM LEARNING TO PERFORM BEST ACTIONS BY REWARDS BEST ACTIONS BY REWARDS Tayfun Gürel

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Embed Size (px)

Citation preview

Page 1: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

REINFORCEMENTREINFORCEMENT LEARNINGLEARNING

LEARNING TO PERFORMLEARNING TO PERFORM

BEST ACTIONS BY REWARDSBEST ACTIONS BY REWARDS

Tayfun Gürel

Page 2: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

Solution Solution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

Page 3: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

RL brings a way of RL brings a way of programming agents by programming agents by reward and punishment reward and punishment

without specifying without specifying howhow the the task is to be achieved. task is to be achieved.

(Kaelbling,1996)(Kaelbling,1996)

Based on trial-error Based on trial-error interactionsinteractions

A set of problems rather than A set of problems rather than a set of techniquesa set of techniques

Page 4: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

The standard reinforcement-learning model

i: inputr: reward s: state

a: action

Page 5: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

The reinforcement learning model The reinforcement learning model consists ofconsists of::

a discrete set of environment states: a discrete set of environment states: SS ; ;

a discrete set of agent actions a discrete set of agent actions AA ; and ; and

a set of scalar reinforcement signals; a set of scalar reinforcement signals; typically {0,1} , or the real numberstypically {0,1} , or the real numbers

(different from supervised learning)(different from supervised learning)

Page 6: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

An example dialog for agent environment relationship:

Environment: You are in state 65. You have 4 possible actions.Agent: I'll take action 2.Environment: You received a reinforcement of 7 units. You are now in state 15. You have 2 possible actions. Agent: I'll take action 1. Environment: You received a reinforcement of -4 units. You are now in state 65. You have 4 possible actions. Agent: I'll take action 2. Environment: You received a reinforcement of 5 units. You are now in state 44. You have 5 possible actions.. .. .. .

Page 7: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

SomeSome ExamplesExamples

Bioreactor Bioreactor actions: actions: stirring rate, temperature controlstirring rate, temperature controlstates: states: sensory readings related to chemicalssensory readings related to chemicalsreward: reward: instant production rate of target chemicalinstant production rate of target chemical

Recycling robotRecycling robotactions: actions: search for a can, wait, or rechargesearch for a can, wait, or recharge states:states: low battery, high battery low battery, high batteryreward: +reward: + for having a can, - for running out of for having a can, - for running out of

batterybattery

Page 8: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal BehaviorModels of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

SolutionSolution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

Page 9: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Models of Optimal Behavior:Agent tries to maximize one of the following:

finite-horizon Model:

infinite-horizon discounted model:

average-reward model:

Page 10: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

SolutionSolution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

Page 11: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

k-armed bandit k-armed bandit k gambling machinesk gambling machines

hh pullspulls areare allowedallowed

Machines are not equivalent:Machines are not equivalent:

Trying to learn paying probabilities Trying to learn paying probabilities

Tradeoff between exploitation-Tradeoff between exploitation-exploration exploration

Page 12: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Solution Strategies 1:Solution Strategies 1:Dynamic Programming Dynamic Programming

ApproachApproachA belief stateA belief state::

Expected pay-offExpected pay-off

remaining:remaining:

Probability of action i Probability of action i

being paid: being paid:

Page 13: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Dynamic Programming Dynamic Programming Approach:Approach:

, then

Because there are no remaining pulls.So, all values can be recursively computed:

Update probabilities after each action

Page 14: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

SolutionSolution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

Page 15: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

MARKOV DECISION MARKOV DECISION PROCESSPROCESS

k-armed bandit k-armed bandit gives gives immediate immediate reward reward

DELAYED DELAYED REWARD?REWARD?

Characteristics of MDP:a set of states : Sa set of actions : Aa reward function :R : S x A R

A state transition function:T: S x A ∏ ( S)

T (s,a,s’): probability of transition from s to s’ using action a

Page 16: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

MDP EXAMPLE:

Transitionfunction

States and rewards

Bellman Equation:

(Greedy policy selection)

Page 17: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Value Iteration AlgorithmValue Iteration Algorithm

AN ALTERNATIVE ITERATION: (Singh,1993)

(Important for model free learning)

Stop Iteration when V(s) differs less than є.Policy difference ratio =< 2єγ / (1-γ ) ( Williams & Baird 1993b)

Page 18: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Policy Iteration Algorithm Policy Iteration Algorithm Policies converge faster than values.

Why faster convergence?

Page 19: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POLICY ITERATION ON GRID WORLD

Page 20: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POLICY ITERATION ON GRID WORLD

0

0

0

0

0

0.8

-0.8

0 0

Page 21: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POLICY ITERATION ON GRID WORLD

0

0

0

0

0

0.8

-0.8

0 0

0

0

0

0.64

0

0.8

0.46

0 0

Page 22: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POLICY ITERATION ON GRID WORLD

0

0

0

0.64

0

0.8

0.46

0 0

Page 23: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POLICY ITERATION ON GRID WORLD

0

0

0

0.64

0

0.8

0.46

0 0

0.51

0

0

0.77

0

0.93

0.59

0.36 0

Page 24: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POLICY ITERATION ON GRID WORLD

0

0

0

0.64

0

0.8

0.46

0 0

0.51

0

0

0.77

0

0.93

0.59

0.36 0

0.66

0.41

0

0.89

0.32

0.95

0.70

0.48 0.19

Page 25: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

MDP Graphical RepresentationMDP Graphical Representation

β, α : T (s, action, s’ )

Similarity to HMMs

Page 26: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

SolutionSolution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

Page 27: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Model Free MethodsModel Free MethodsModels of the environment:Models of the environment:T: S T: S x x A A ∏∏ ( S) ( S) and and R : S R : S xx A A R R

Do we know them? Do we have to Do we know them? Do we have to know them?know them?

Monte Carlo MethodsMonte Carlo Methods Adaptive Heuristic CriticAdaptive Heuristic Critic Q LearningQ Learning

Page 28: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Monte Carlo Methods Monte Carlo Methods IdeaIdea: : Hold statistics about rewards for each stateHold statistics about rewards for each state Take the averageTake the average This is the V(s)This is the V(s)

Based only on experience Based only on experience Assumes episodic tasks Assumes episodic tasks (Experience is divided into episodes and all (Experience is divided into episodes and all

episodes will terminate regardless of the actions episodes will terminate regardless of the actions selected.) selected.)

Incremental in episode-by-episode sense not Incremental in episode-by-episode sense not step-by-step sense. step-by-step sense.

Page 29: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Problem: Unvisited <s, a> pairs(problem of maintaining exploration)

For every <s, a> make sure that:P(<s, a> selected as a start state and action) >0

(Assumption of exploring starts )

Page 30: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Monte Carlo ControlMonte Carlo Control

How to select Policies:How to select Policies:(Similar to policy evaluation) (Similar to policy evaluation)

Page 31: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ADAPTIVE HEURISTIC CRITIC & TD(λ)ADAPTIVE HEURISTIC CRITIC & TD(λ)

How the AHC learns, TD(0) algorithm:

AHC : TD Algorithm

Page 32: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Q LEARNINGQ LEARNING

Q values in Value Iteration:

But we don’t know and

Instead use the following :

Decayed α properly? Q values will converge. (Singh 1994)

Page 33: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Q-LEARNING CRITICS:Q-LEARNING CRITICS:

Simpler than AHC learningSimpler than AHC learning Q-Learning is exploration sensitiveQ-Learning is exploration sensitive Analog to value iteration in MDPAnalog to value iteration in MDP Most popular Model free learning Most popular Model free learning

algorithmalgorithm

Page 34: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

SolutionSolution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

Page 35: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Model Based MethodsModel Based Methods

Model free methods do not learn the Model free methods do not learn the model parametersmodel parameters

Inefficient use of data! Learn the Inefficient use of data! Learn the model.model.

Page 36: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Certainty Equivalent Methods:Certainty Equivalent Methods:

First learn the models of the environment by First learn the models of the environment by keeping statistics, then learn the actions to take.keeping statistics, then learn the actions to take.

ObjectionsObjections:: Arbitrary division between the learning phase andArbitrary division between the learning phase and

acting phaseacting phase Initial data gathering. (How to choose exploration Initial data gathering. (How to choose exploration

strategy without knowing the model)strategy without knowing the model) Changes in the environment?Changes in the environment?

Better to learn the model and to act Better to learn the model and to act simultaneouslysimultaneously

Page 37: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

DYNADYNAAfter action a, from s to s’ with reward r:

1.Update Transition model T and reward function R

2.Update Q values:

3.Do k more random updates (random s, a pairs ):

Page 38: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP: Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and An Immediate Reward Example and

SolutionSolution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods Partially Observable Markov Decision Partially Observable Markov Decision

ProcessProcess ApplicationsApplications Conclusion Conclusion

Page 39: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POMDPsPOMDPsWhat if state information (from sensors) is noisy?Mostly the case!

MDP techniques are suboptimal!

Two halls are not the same.

Page 40: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

POMDPs – A Solution StrategyPOMDPs – A Solution Strategy

SE: Belief State Estimator (Can be based on HMM)

П: MDP Techniques

Page 41: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP:

Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and SolutionAn Immediate Reward Example and Solution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods GeneralizationGeneralization Partially Observable Markov Decision ProcessPartially Observable Markov Decision Process ApplicationsApplications Conclusion Conclusion

Page 42: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

APPLICATIONSAPPLICATIONS

Juggling robotJuggling robot: dynamic programming : dynamic programming

(Schaal & Atkeson 1994)(Schaal & Atkeson 1994)

Box pushing robot: Box pushing robot: Q-learning Q-learning

(Mahadevan& Connel 1991a)(Mahadevan& Connel 1991a)

Disk collecting robot: Disk collecting robot: Q-learningQ-learning

(Mataric 1994)(Mataric 1994)

Page 43: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP:ROADMAP:

Introduction to the problemIntroduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and SolutionAn Immediate Reward Example and Solution Markov Decision ProcessMarkov Decision Process Model Free MethodsModel Free Methods Model Based MethodsModel Based Methods GeneralizationGeneralization Partially Observable Markov Decision ProcessPartially Observable Markov Decision Process ApplicationsApplications Conclusion Conclusion

Page 44: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ConclusionConclusion

RL is not supervised learning RL is not supervised learning

Planning rather than classificationPlanning rather than classification

Poor performance on large problems.Poor performance on large problems. New methods neded New methods neded (i.e. shaping, imitation, reflexes)(i.e. shaping, imitation, reflexes)

RL/MDPs extends HMMs. How? RL/MDPs extends HMMs. How?

Page 45: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

MDP as a graph

Is it possible to represent it as an HMM?

Page 46: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Relation to HMMsRelation to HMMsRecycling robot example revisitedRecycling robot example revisited

as an HMM problemas an HMM problem

Battery t-1 Battery t+1Battery t

Action t-1 Action t+1Action t

Battery={ low, high}Action={wait, search, recharge}

Page 47: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

Relation to HMMsRelation to HMMsRecycling robot example revisitedRecycling robot example revisited

as an HMM problemas an HMM problem

Battery t-1 Battery t+1Battery t

Action t-1 Action t+1Action t

Battery={ low, high}Action={wait, search, recharge}

Not representable as an HMM

Page 48: REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

HMMs vs MDPsHMMs vs MDPs

Once we have the MDP representation of the Once we have the MDP representation of the problem, We can do inferences just like HMM by problem, We can do inferences just like HMM by converting it to probabilistic automaton. Vice converting it to probabilistic automaton. Vice versa is not possible.versa is not possible.

(Actions, rewards)(Actions, rewards)

Use HMMs to do probabilistic reasoning over timeUse HMMs to do probabilistic reasoning over time Use MDP/RL to optimize the behavior.Use MDP/RL to optimize the behavior.