Download pdf - CS 4803 / 7643 Deep Learning, Fall 2020 · 2020. 10. 20. · Deep Learning, Fall 2020 Reinforcement Learning: Module 1/3 Presented by Nirbhay Modhe. Reinforcement Learning ... +1

CS 4803 / 7643Deep Learning, Fall 2020

Reinforcement Learning: Module 1/3

Presented by Nirbhay Modhe

https://nirbhayjm.github.io/

Reinforcement Learning

Introduction

Types of Machine Learning

Reinforcement Learning

⬣ Evaluative feedback in the form of reward

⬣ No supervision on the right action

Supervised Learning

⬣ Train Input: 𝑋, 𝑌

⬣ Learning output: 𝑓 ∶ 𝑋 → 𝑌, 𝑃(𝑦|𝑥)

⬣ e.g. classification

SheepDogCatLionGiraffe

Unsupervised Learning

⬣ Input: 𝑋

⬣ Learning output: 𝑃 𝑥

⬣ Example: Clustering, density estimation, etc.

RL: Sequential decision making in an environment with evaluative feedback.

What is Reinforcement Learning?

⬣ Environment may be unknown, non-linear, stochastic and complex.⬣ Agent learns a policy to map states of the environments to actions.

⬣ Seeking to maximize cumulative reward in the long run.

Agent

Action, Response, Control

State, Stimulus, Situation

Reward, Gain, Payoff, Cost

Environment(world)

Figure Credit: Rich Sutton

What is Reinforcement Learning?

Evaluative Feedback Sequential Decisions

⬣ Pick an action, receive a reward (positive or negative)

⬣ No supervision for what the “correct” action is or would have been, unlike supervised learning

⬣ Plan and execute actions over a sequence of states

⬣ Reward may be delayed, requiring optimization of future rewards (long-term planning).

RL: Sequential decision making in an environment with evaluative feedback.

RL API

RL: Environment Interaction API

⬣ At each time step t, the agent:⬣ Receives observation ot⬣ Executes action at

⬣ At each time step t, the environment:⬣ Receives action at⬣ Emits observation ot+1⬣ Emits scalar reward rt+1

Slide credit: David Silver

RL: Challenges

Signature Challenges in Reinforcement Learning

⬣ Evaluative feedback: Need trial and error to find the right action

⬣ Delayed feedback: Actions may not lead to immediate reward

⬣ Non-stationarity: Data distribution of visited states changes when the policy changes

⬣ Fleeting nature of time and online data

Slide adapted from: Richard Sutton

Examples of RL tasks

Robot Locomotion

⬣ Objective: Make the robot move forward

⬣ State: Angle and position of the joints⬣ Action: Torques applied on joints⬣ Reward: +1 at each time step upright

and moving forward

Figures copyright John Schulman et al., 2016. Reproduced with permission.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Atari Games

⬣ Objective: Complete the game with the highest score⬣ State: Raw pixel inputs of the game state⬣ Action: Game controls e.g. Left, Right, Up, Down⬣ Reward: Score increase/decrease at each time step


Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.


Go

⬣ Objective: Defeat opponent⬣ State: Board pieces⬣ Action: Where to put next piece

down⬣ Reward: +1 if win at the end of game,

0 otherwise


Markov Decision

Processes

Markov Decision Processes (MDPs)

⬣ MDPs: Theoretical framework underlying RL⬣ An MDP is defined as a tuple

: Set of possible states: Set of possible actions

: Distribution of reward: Transition probability distribution, also written as p(s’|s,a)

: Discount factor⬣ Interaction trajectory: ⬣ Markov property: Current state completely characterizes state of the

environment⬣ Assumption: Most recent observation is a sufficient statistic of history

(S,A,R,T, �)

<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>







(S,A,R,T, �)








(S,A,R,T, �)








(S,A,R,T, �)


MDP Variations

Fully observed MDP Partially observed MDP

⬣ Agent receives the true state st at time t

⬣ Example: Chess, Go

⬣ Agent perceives its own partial observation ot of the state st at time t, using past states e.g. with an RNN

⬣ Example: Poker, First-person games (e.g. Doom)

Source: https://github.com/mwydmuch/ViZDoom

MDP Variations

Fully observed MDP Partially observed MDP

⬣ Agent receives the true state st at time t

⬣ Example: Chess, Go

⬣ Agent perceives its own partial observation ot of the state st at time t, using past states e.g. with an RNN

⬣ Example: Poker, First-person games (e.g. Doom)

Source: https://github.com/mwydmuch/ViZDoom

We will assume fully observed MDPs for this lecture

⬣ In Reinforcement Learning, we assume an underlying MDP with unknown:⬣ Transition probability distribution⬣ Reward distribution

⬣ Evaluative feedback comes into play, trial and error necessary

⬣ Let’s first assume that we know the true reward and transition distribution and look at algorithms for solving MDPs i.e. finding the best policy

MDPs in the context of RL

(S,A,R,T, �)


MDP



⬣ Let’s first assume that we know the true reward and transition distribution and look at algorithms for solving MDPs i.e. finding the best policy


(S,A,R,T, �)


MDP



⬣ For this lecture, assume that we know the true reward and transition distributionand look at algorithms for solving MDPs i.e. finding the best policy⬣ Rewards known everywhere, no evaluative feedback⬣ Know how the world works i.e. all transitions


(S,A,R,T, �)


MDP

A Grid World MDP

⬣ Agent lives in a 2D grid environment

⬣ State: Agent’s 2D coordinates⬣ Actions: N, E, S, W⬣ Rewards: +1/-1 at absorbing states

⬣ Walls block agent’s path⬣ Actions to not always go as planned

⬣ 20% chance that agent drifts one cell left or right of direction of motion (except when blocked by wall).

Figure credits: Pieter Abbeel

A Grid World MDP






A Grid World MDP






A Grid World MDP






Solving MDPs: Optimal policy

⬣ Solving MDPs by finding the best/optimal policy

⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic

⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!









n = |S|m = |A|

<latexit sha1_base64="CXqwVtQDOE6XrU15HsshGqwbZgc=">AAACDnicbZDLSsNAFIYn9VbjLerSzWApuCqJFGoXQtWNy4r2Ak0ok+mkHZxMwsxEKGmewI2v4saFIm5du/NtTNIgVv1h4OM/53DO/G7IqFSm+amVlpZXVtfK6/rG5tb2jrG715VBJDDp4IAFou8iSRjlpKOoYqQfCoJ8l5Gee3uR1Xt3REga8Bs1DYnjozGnHsVIpdbQqHJ4Cme2j9QEIxZfJzPb1v0F7yyZDY2KWTNzwb9gFVABhdpD48MeBTjyCVeYISkHlhkqJ0ZCUcxIotuRJCHCt2hMBily5BPpxPl3ElhNnRH0ApE+rmDu/pyIkS/l1HfTzuxG+buWmf/VBpHyTpyY8jBShOP5Ii9iUAUwywaOqCBYsWkKCAua3grxBAmEVZqgnofQzAXn0KgX0LS+Q+ge16x6rXFVr7TOizjK4AAcgiNggQZogUvQBh2AwT14BM/gRXvQnrRX7W3eWtKKmX2wIO39C3SZnC4=</latexit>

?

<latexit sha1_base64="UqL3pjB1oXDq0ky7Q3gukqP8FBE=">AAAB6HicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJFGoX0qIbly3YC7ShTKaTduxkEmYmQgl9AjcuFHGrD+Pejfg2pkkRbz8MfPznHM6Z3w05U9qyPozc0vLK6lp+3dzY3NreKezutVUQSUJbJOCB7LpYUc4EbWmmOe2GkmLf5bTjTi7m9c4NlYoF4kpPQ+r4eCSYxwjWidWsDQpFq2SlQn/BXkCx9mqehS/vZmNQeOsPAxL5VGjCsVI92wq1E2OpGeF0ZvYjRUNMJnhEewkK7FPlxOmhM3SUOEPkBTJ5QqPU/T4RY1+pqe8mnT7WY/W7Njf/q/Ui7Z06MRNhpKkg2SIv4kgHaP5rNGSSEs2nCWAiWXIrImMsMdFJNmYaQjUVyqBSXkDV/gqhfVKyy6VK0yrWzyFTHg7gEI7BhgrU4RIa0AICFG7hHh6Ma+POeDSestacsZjZhx8ynj8BMJ2Qmg==</latexit>

?






n = |S|m = |A|


n

<latexit sha1_base64="rVenuSol0bd6bUb9aZSGBh6vdaY=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuZhJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwJmApmaaQyeWQEKfQ9sfnc/q7RuQikXiUo9j8EIyECxglGhjNUSvUHRKTir8F9w5FE9f7ZP45d2u9wpvV/2IJiEITTlRqus6sfYmRGpGOUztq0RBTOiIDKBrUJAQlDdJD53iA+P0cRBJ84TGqft9YkJCpcahbzpDoofqd21m/lfrJjo49iZMxIkGQbNFQcKxjvDs17jPJFDNxwYIlczciumQSEK1ycZOQ6imwhlUynOoul8htI5KbrlUaTjF2hnKlEd7aB8dIhdVUA1doDpqIooA3aJ79GBdW3fWo/WUteas+cwu+iHr+RN32ZDJ</latexit>

1

<latexit sha1_base64="jPqgT5fa1yRX1SriC0Pr72kPD74=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuZhJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwJmApmaaQyeWQEKfQ9sfnc/q7RuQikXiUo9j8EIyECxglGhjNdxeoeiUnFT4L7hzKJ6+2ifxy7td7xXervoRTUIQmnKiVNd1Yu1NiNSMcpjaV4mCmNARGUDXoCAhKG+SHjrFB8bp4yCS5gmNU/f7xISESo1D33SGRA/V79rM/K/WTXRw7E2YiBMNgmaLgoRjHeHZr3GfSaCajw0QKpm5FdMhkYRqk42dhlBNhTOolOdQdb9CaB2V3HKp0nCKtTOUKY/20D46RC6qoBq6QHXURBQBukX36MG6tu6sR+spa81Z85ld9EPW8ycbZZCM</latexit>





n = |S|m = |A|


?


?






n = |S|m = |A|


n

<latexit sha1_base64="rVenuSol0bd6bUb9aZSGBh6vdaY=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuZhJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwJmApmaaQyeWQEKfQ9sfnc/q7RuQikXiUo9j8EIyECxglGhjNUSvUHRKTir8F9w5FE9f7ZP45d2u9wpvV/2IJiEITTlRqus6sfYmRGpGOUztq0RBTOiIDKBrUJAQlDdJD53iA+P0cRBJ84TGqft9YkJCpcahbzpDoofqd21m/lfrJjo49iZMxIkGQbNFQcKxjvDs17jPJFDNxwYIlczciumQSEK1ycZOQ6imwhlUynOoul8htI5KbrlUaTjF2hnKlEd7aB8dIhdVUA1doDpqIooA3aJ79GBdW3fWo/WUteas+cwu+iHr+RN32ZDJ</latexit>

m

<latexit sha1_base64="b2VnzvgNgjOlf37sI0qnm4I3cCc=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuThJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwFkITc00h04sgQifQ9sfnc/q7RuQikXhpR7H4AkyCFnAKNHGaoheoeiUnFT4L7hzKJ6+2ifxy7td7xXervoRTQSEmnKiVNd1Yu1NiNSMcpjaV4mCmNARGUDXYEgEKG+SHjrFB8bp4yCS5oUap+73iQkRSo2FbzoF0UP1uzYz/6t1Ex0cexMWxomGkGaLgoRjHeHZr3GfSaCajw0QKpm5FdMhkYRqk42dhlBNhTOolOdQdb9CaB2V3HKp0nCKtTOUKY/20D46RC6qoBq6QHXURBQBukX36MG6tu6sR+spa81Z85ld9EPW8yd2VZDI</latexit>





⬣ Discount factor:

Recap & Next Lecture

Today, we saw⬣ MDPs: Theoretical framework underlying RL, solving MDPs

⬣ Policy: How an agents acts at states (to be continued in next lecture)

Next Lecture:⬣ Value function (Utility): How good is a particular state or state-action pair?

⬣ Algorithms for solving MDPs (Value Iteration)

⬣ Departure from known rewards and transitions: Reinforcement Learning