reinforcement learning

Reinforcement LearningDipendra Misra

Cornell Universitydkm@cs.cornell.edu

https://dipendramisra.wordpress.com/

Setup from Lenz et. al. 2014

Grasp the green cup.

Output: Sequence of controller actions

Supervised Learning

Expert Demonstrations

Supervised Learning

Expert Demonstrations

Problem?

Supervised LearningGrasp the cup. Problem?

Training data

Test data No exploration

Exploring the environment

What is reinforcement learning?

“Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment”

- R. Sutton and A. Barto

Interaction with the environment

reward +

new environment

action

Scalar reward

Episodic vs

Non-Episodic

Rollout

hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni

st ! st+1

e.g.,1$

Policy

⇡(s, a) = 0.9

reward +

new environment

action

Objective?

Objective

hs1, a1, r1, s2, a2, r2, s3, · · · an, rn, sni

maximize expected reward

#Problem?

Discounted Reward

unbounded

maximize expected reward

discount future reward

Problem?

�trt+1

� 2 [0, 1)

Discounted Reward

maximize discounted expected reward

"n�1X

�trt+1

if r M

�trt+1

�tM =M

1� �

and � 2 [0, 1)

Need for discounting

• To keep the problem well formed

• Evidence that humans discount future reward

Markov Decision ProcessMDP is a tuple where

• is a set of finite or infinite states

• is a set of finite or infinite actions

• For the transition

• is the transition probability

• is the reward for the transition

• is the discounted factor

(S,A, P,R, �)

P as,s0 2 P

Ras,s0 2 R

sa�! s0

� 2 [0, 1]

}Markov Asmp.

MDP Example

Example from Sutton and Barto 1998

Summary

Maximize discounted expected reward

"n�1X

�trt+1

MDP is a tuple (S,A, P,R, �)

Agent controls the policy⇡(s, a)

What we learned

Reinforcement Learning

Agent-Reward-EnvironmentNo supervisionExploration

MDPPolicy

Value functions • Expected reward from following a policy

V ⇡(s) = E

�trt+1 | s1 = s,⇡

Q⇡(s, a) = E

�trt+1 | s1 = s, a1 = a,⇡

State value function

State action value function

State Value function

V ⇡(s) = E

�trt+1 | s1 = s,⇡

· · ·

⇡(s1, a1)Pa1s1,s2

⇡(s2, a2)Pa2s2,s3

Ra1s1,s2

Ra2s2,s3

V ⇡(s1) = E

�trt+1

· · ·

⇡(s1, a1)Pa1s1,s2

⇡(s2, a2)Pa2s2,s3

Ra1s1,s2

Ra2s2,s3

t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where=X

(r1 + �r2 · · · )p(t)

V ⇡(s1) = E

�trt+1

P (s1, a1, s2)P (t0 | s1, a1, s2)�Ra1

s1,s2 + �(r2 · · · )

⇡(s1, a2)X

P a1s1,s2

�Ra1

s1,s2 + �V (s2)

t = hs1, a1, s2, a2 · · · i = hs1, a1, s2i : t0where

V ⇡(s2)

V ⇡(s2)}

(r1 + �r2 · · · )p(t)

P (s1, a1, s2){Ra1s1,s2 + �

P (t0 | s1, a1, s2)(r2 · · · )}

Bellman Self-Consistency Eqn

V ⇡(s) =X

⇡(s, a)X

P as,s0

s,s0 + �V ⇡(s0)

Q⇡(s, a) =X

P as,s0

s,s0 + �X

⇡(s0, a0)Q⇡(s0, a0)

Q⇡(s, a) =X

P as,s0

s,s0 + �V ⇡(s0)

similarly

Bellman Self-Consistency Eqn

V ⇡(s) =X

⇡(s, a)X

P as,s0

s,s0 + �V ⇡(s0)

Given N states, we have N equations in N variables

Solve the above equation

Does it have a unique solution?

Yes, it does. Exercise: Prove it.

Optimal Policy

V ⇡(s) =X

⇡(s, a)X

P as,s0

s,s0 + �V ⇡(s0)

Given a state s

policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2

V ⇡1(s) � V ⇡2(s)

How to define a globally optimal policy?

Optimal Policy

policy is as good as (den. ) if: ⇡1 ⇡2 ⇡1 � ⇡2

V ⇡1(s) � V ⇡2(s)

How to define a globally optimal policy?

is an optimal policy if:⇡⇤

V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡

Does it always exists?

Yes it always does.

Existence of Optimal Policy

Leader policy for every state is: ⇡s = argmax

⇡V ⇡

(s)sDefine: ⇡⇤(s, a) = ⇡s(s, a) 8s, a

To show is optimal or equivalently:⇡⇤

�(s) = V ⇡⇤(s)� V ⇡s(s) � 0

V ⇡⇤(s) =

⇡s(s, a)X

P as,s0{Ra

s,s0 + �V ⇡⇤(s0)}

V ⇡s

(s) =X

⇡s(s, a)X

P as,s0{Ra

s,s0 + �V ⇡s(s0)}

Existence of Optimal Policy

V ⇡⇤(s) =

⇡s(s, a)X

P as,s0{Ra

s,s0 + �V ⇡⇤(s0)}

V ⇡s

(s) =X

⇡s(s, a)X

P as,s0{Ra

s,s0 + �V ⇡s(s0)}

Leader policy for every state is: ⇡s = argmax

⇡V ⇡

�(s) = V ⇡⇤(s)� V ⇡s(s) = �

⇡s(s, a)X

P as,s0{V ⇡⇤

(s0)� V ⇡s(s0)}

= �X

⇡s(s, a)X

P as,s0{�(s0)} = �conv(�(s0))�

� �(s0)

Existence of Optimal Policy Leader policy for every state is: ⇡s = argmax

⇡V ⇡

�(s) = V ⇡⇤(s)� V ⇡s(s) = �

⇡s(s, a)X

P as,s0{V ⇡⇤

(s0)� V ⇡s(s0)}

�(s) � �min �(s0)

min �(s) � �min �(s0)

� 2 [0, 1) ) min �(s) � 0 Hence proved

= �X

⇡s(s, a)X

P as,s0{�(s0)} = �conv(�(s0))�

Bellman’s Optimality Condition Define and V ⇤(s) = V ⇡⇤

(s) Q⇤(s, a) = Q⇡⇤(s, a)

V ⇡⇤(s) =

⇡⇤(s, a)Q⇡⇤

(s, a) max

aQ⇡⇤

(s, a)

Claim: V ⇡⇤(s) = max

aQ⇡⇤

(s, a)

Let V ⇡⇤(s) < max

aQ⇡⇤

(s, a)

Define ⇡0(s) = argmax

aQ⇡⇤

(s, a)

Bellman’s Optimality Condition ⇡0(s) = argmax

aQ⇡⇤

(s, a)

V ⇡0(s) = Q⇡0

(s,⇡0(s))

V ⇡⇤(s) =

⇡⇤(s, a)Q⇡⇤(s, a)

�(s) = V ⇡0(s)� V ⇡⇤

(s) = Q⇡0(s,⇡0(s))�

⇡⇤(s, a)Q⇡⇤(s, a)

� Q⇡0(s,⇡0(s))�Q⇡⇤

(s,⇡0(s)) = �X

P⇡0(s)s,s0 �(s0)

�(s) � 0

such that 9s0 �(s0) > 0is not optimal⇡⇤

Bellman’s Optimality Condition

V ⇤(s) = max

(s, a)

V ⇤(s) = max

P as,s0{Ra

s,s0 + �V ⇤(s0)}

similarly

Q⇤(s, a) =

P as,s0{Ra

s,s0 + �max

a0Q⇤

(s0, a0)}

Optimal policy from Q value

Given an optimal policy is given by:Q⇤(s, a)

Corollary: Every MDP has a deterministic optimal policy

⇡⇤(s) = argmax

(s, a)

Summary

V ⇤(s) = max

P as,s0{Ra

s,s0 + �V ⇤(s0)}

V ⇡(s) =X

⇡(s, a)X

P as,s0

s,s0 + �V ⇡(s0)

Bellman’s optimality condition

Bellman’s self-consistency equation

An optimal policy exists such that:⇡⇤

V ⇡⇤(s) � V ⇡(s) 8s 2 S,⇡

What we learned

MDPPolicy

Consistency Equation Optimal Policy Optimality Condition

Solving MDP

To solve an MDP is to find an optimal policy

Bellman’s Optimality Condition

V ⇤(s) = max

P as,s0{Ra

s,s0 + �V ⇤(s0)}

Iteratively solve the above equation

Bellman Backup Operator

V ⇤(s) = max

P as,s0{Ra

s,s0 + �V ⇤(s0)}

T : V ! V

(TV )(s) = max

P as,s0{Ra

s,s0 + �V (s0)}

Dynamic Programming Solution

Initialize randomlyV 0

until kV t+1 � V tk1 > ✏

V t+1 = TV t

return V t+1

V t+1(s) = max

P as,s0{Ra

s,s0 + �V t(s0)}

Convergence

Theorem:

(TV )(s) = max

P as,s0{Ra

s,s0 + �V (s0)}

kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R

kwhere

Proof:|(TV1)(s)� (TV2)(s)| = |max

P as,s0{Ra

s,s0 + �V1(s0)}�

�max

P as,s0{Ra

s,s0 + �V2(s0)}|

) kTV1 � TV2k1 �kV1 � V2k1

f(x)�max

g(x)| max

|f(x)� g(x)|using

Convergence

Theorem: kTV1 � TV2k1 = �kV1 � V2k1kxk1 = max{|x1|, |x2| · · · |xk|}; x 2 R

kwhere

Proof: |(TV1)(s)� (TV2)(s)| max

P as,s0(V1(s

0)� V2(s

P as,s0 |(V1(s

0)� V2(s

�kV1 � V2k1

) kTV1 � TV2k1 �kV1 � V2k1

s0|V1(s

0)� V2(s

Optimal is a fixed point

V ⇤= max

P as,s0{Ra

s,s0 + �V ⇤(s0)} = TV ⇤

is a fixed point of TV ⇤

Optimal is the fixed point

V ⇤= max

P as,s0{Ra

s,s0 + �V ⇤(s0)} = TV ⇤

is a fixed point of TV ⇤

Theorem: is the only fixed point of V ⇤ T

TV1 = V1Proof: TV2 = V2

kV1 � V2k1 = kTV1 � TV2k1 �kV1 � V2k1

As therefore kV1 � V2k1 = 0 ) V1 = V2� 2 [0, 1)

Dynamic Programming Solution

Initialize randomlyV 0

until kV t+1 � V tk1 > ✏

V t+1 = TV t

return V t+1

Theorem: algorithm converges for all V 0

Proof: kV t+1 � V ⇤k1 = kTV t � TV ⇤k1 �kV t � V ⇤k1kV t � V ⇤k1 �tkV 0 � V ⇤k1

limt!1

kV t � V ⇤k1 limt!1

�tkV 0 � V ⇤k1 = 0

Problem?

Summary

Iteratively solving optimality condition

Bellman Backup Operator

Convergence of the iterative solution

V t+1(s) = max

P as,s0{Ra

s,s0 + �V t(s0)}

(TV )(s) = max

P as,s0{Ra

s,s0 + �V (s0)}

What we learned

MDPPolicy

Consistency Equation Optimal Policy Optimality Condition

Bellman Backup Operator Iterative Solution

In next tutorial

• Value and Policy Iteration

• Monte Carlo Solution

• SARSA and Q-Learning

• Policy Gradient Methods

• Learning to search OR Atari game paper?

https://dipendramisra.wordpress.com/

reinforcement learning - Dipendra Misra · What is reinforcement learning? “Reinforcement...

Documents

Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

Deep Learning for Reinforcement Learning in Pacman · Deep Learning for Reinforcement Learning in Pacman Deep Learning für Reinforcement Learning in Pacman Vorgelegte Bachelor-Thesis

Multi-Objective Reinforcement Learning using Sets of Pareto … · 2020. 10. 19. · learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement

Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Reinforcement Learning - 4. Model-free reinforcement Learning

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Reinforcement Learning - Multi-Agent Reinforcement

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning

Generalization in Reinforcement Learning: Successful ...papers.nips.cc/paper/1109-generalization-in-reinforcement-learning... · Generalization in Reinforcement Learning: Successful

Reinforcement Learning: Learning algorithms

Deep Learning for Reinforcement Learning in · PDF fileDeep Learning for Reinforcement Learning in ... Deep Learning for Reinforcement Learning in Pacman Deep Learning für ... Während

Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Reinforcement Learning & Apprenticeship Learning

Tutorial: Deep Reinforcement Learning - Machine Learning ...hunch.net/~beygel/deep_rl_tutorial.pdfTutorial: Deep Reinforcement Learning - Machine Learning

reinforcement learning part2 - Cornell Universitydkm/tutorials/reinforcement_learning_part2.pdf · Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu

Inverse Reinforcement Learning CS885 Reinforcement

Reinforcement Learning Chapter 13 What is Reinforcement Learning? Q-Learning Examples 1

Reinforcement Learning Das Reinforcement Learning-Problem Alexander Schmid