CS 4803 / 7643Deep Learning, Fall 2020
Reinforcement Learning: Module 1/3
Presented by Nirbhay Modhe
Reinforcement Learning
Introduction
Types of Machine Learning
Reinforcement Learning
⬣ Evaluative feedback in the form of reward
⬣ No supervision on the right action
Supervised Learning
⬣ Train Input: 𝑋, 𝑌
⬣ Learning output: 𝑓 ∶ 𝑋 → 𝑌, 𝑃(𝑦|𝑥)
⬣ e.g. classification
SheepDogCatLionGiraffe
Unsupervised Learning
⬣ Input: 𝑋
⬣ Learning output: 𝑃 𝑥
⬣ Example: Clustering, density estimation, etc.
RL: Sequential decision making in an environment with evaluative feedback.
What is Reinforcement Learning?
⬣ Environment may be unknown, non-linear, stochastic and complex.⬣ Agent learns a policy to map states of the environments to actions.
⬣ Seeking to maximize cumulative reward in the long run.
Agent
Action, Response, Control
State, Stimulus, Situation
Reward, Gain, Payoff, Cost
Environment(world)
Figure Credit: Rich Sutton
What is Reinforcement Learning?
Evaluative Feedback Sequential Decisions
⬣ Pick an action, receive a reward (positive or negative)
⬣ No supervision for what the “correct” action is or would have been, unlike supervised learning
⬣ Plan and execute actions over a sequence of states
⬣ Reward may be delayed, requiring optimization of future rewards (long-term planning).
RL: Sequential decision making in an environment with evaluative feedback.
RL API
RL: Environment Interaction API
⬣ At each time step t, the agent:⬣ Receives observation ot⬣ Executes action at
⬣ At each time step t, the environment:⬣ Receives action at⬣ Emits observation ot+1⬣ Emits scalar reward rt+1
Slide credit: David Silver
RL: Challenges
Signature Challenges in Reinforcement Learning
⬣ Evaluative feedback: Need trial and error to find the right action
⬣ Delayed feedback: Actions may not lead to immediate reward
⬣ Non-stationarity: Data distribution of visited states changes when the policy changes
⬣ Fleeting nature of time and online data
Slide adapted from: Richard Sutton
Examples of RL tasks
Robot Locomotion
⬣ Objective: Make the robot move forward
⬣ State: Angle and position of the joints⬣ Action: Torques applied on joints⬣ Reward: +1 at each time step upright
and moving forward
Figures copyright John Schulman et al., 2016. Reproduced with permission.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Examples of RL tasks
Atari Games
⬣ Objective: Complete the game with the highest score⬣ State: Raw pixel inputs of the game state⬣ Action: Game controls e.g. Left, Right, Up, Down⬣ Reward: Score increase/decrease at each time step
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.
Examples of RL tasks
Go
⬣ Objective: Defeat opponent⬣ State: Board pieces⬣ Action: Where to put next piece
down⬣ Reward: +1 if win at the end of game,
0 otherwise
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Markov Decision
Processes
Markov Decision Processes (MDPs)
⬣ MDPs: Theoretical framework underlying RL⬣ An MDP is defined as a tuple
: Set of possible states: Set of possible actions
: Distribution of reward: Transition probability distribution, also written as p(s’|s,a)
: Discount factor⬣ Interaction trajectory: ⬣ Markov property: Current state completely characterizes state of the
environment⬣ Assumption: Most recent observation is a sufficient statistic of history
(S,A,R,T, �)
<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>
Markov Decision Processes (MDPs)
⬣ MDPs: Theoretical framework underlying RL⬣ An MDP is defined as a tuple
: Set of possible states: Set of possible actions
: Distribution of reward: Transition probability distribution, also written as p(s’|s,a)
: Discount factor⬣ Interaction trajectory: ⬣ Markov property: Current state completely characterizes state of the
environment⬣ Assumption: Most recent observation is a sufficient statistic of history
(S,A,R,T, �)
<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>
Markov Decision Processes (MDPs)
⬣ MDPs: Theoretical framework underlying RL⬣ An MDP is defined as a tuple
: Set of possible states: Set of possible actions
: Distribution of reward: Transition probability distribution, also written as p(s’|s,a)
: Discount factor⬣ Interaction trajectory: ⬣ Markov property: Current state completely characterizes state of the
environment⬣ Assumption: Most recent observation is a sufficient statistic of history
(S,A,R,T, �)
<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>
Markov Decision Processes (MDPs)
⬣ MDPs: Theoretical framework underlying RL⬣ An MDP is defined as a tuple
: Set of possible states: Set of possible actions
: Distribution of reward: Transition probability distribution, also written as p(s’|s,a)
: Discount factor⬣ Interaction trajectory: ⬣ Markov property: Current state completely characterizes state of the
environment⬣ Assumption: Most recent observation is a sufficient statistic of history
(S,A,R,T, �)
<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>
MDP Variations
Fully observed MDP Partially observed MDP
⬣ Agent receives the true state st at time t
⬣ Example: Chess, Go
⬣ Agent perceives its own partial observation ot of the state st at time t, using past states e.g. with an RNN
⬣ Example: Poker, First-person games (e.g. Doom)
Source: https://github.com/mwydmuch/ViZDoom
MDP Variations
Fully observed MDP Partially observed MDP
⬣ Agent receives the true state st at time t
⬣ Example: Chess, Go
⬣ Agent perceives its own partial observation ot of the state st at time t, using past states e.g. with an RNN
⬣ Example: Poker, First-person games (e.g. Doom)
Source: https://github.com/mwydmuch/ViZDoom
We will assume fully observed MDPs for this lecture
⬣ In Reinforcement Learning, we assume an underlying MDP with unknown:⬣ Transition probability distribution⬣ Reward distribution
⬣ Evaluative feedback comes into play, trial and error necessary
⬣ Let’s first assume that we know the true reward and transition distribution and look at algorithms for solving MDPs i.e. finding the best policy
MDPs in the context of RL
(S,A,R,T, �)
<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>
MDP
⬣ In Reinforcement Learning, we assume an underlying MDP with unknown:⬣ Transition probability distribution⬣ Reward distribution
⬣ Evaluative feedback comes into play, trial and error necessary
⬣ Let’s first assume that we know the true reward and transition distribution and look at algorithms for solving MDPs i.e. finding the best policy
MDPs in the context of RL
(S,A,R,T, �)
<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>
MDP
⬣ In Reinforcement Learning, we assume an underlying MDP with unknown:⬣ Transition probability distribution⬣ Reward distribution
⬣ Evaluative feedback comes into play, trial and error necessary
⬣ For this lecture, assume that we know the true reward and transition distributionand look at algorithms for solving MDPs i.e. finding the best policy⬣ Rewards known everywhere, no evaluative feedback⬣ Know how the world works i.e. all transitions
MDPs in the context of RL
(S,A,R,T, �)
<latexit sha1_base64="jvrrofWpoD7V++zTCNzfMnM4K7I=">AAACIHicbZDLSgMxFIYzXmu9VV26CYqgKCUzjNTuqm5c1ktVaEs5k6ZtMJkZkoxQhr6FWze+ihsXiuhOn8Z0WkTFHwIf/zmHc/IHseDaEPLhTExOTc/M5uby8wuLS8uFldVLHSWKshqNRKSuA9BM8JDVDDeCXceKgQwEuwpujof1q1umNI/CC9OPWVNCN+QdTsFYq1UobTckmB4FkZ4P9r758AefjTkI0guLXZASdlqFTVIkPvE9D5Oi65ddz7fgHbikvI/dIsm0Wdlo7N59VPrVVuG90Y5oIlloqACt6y6JTTMFZTgVbJBvJJrFQG+gy+oWQ5BMN9PsgwO8ZZ027kTKvtDgzP05kYLUui8D2zm8VP+tDc3/avXEdA6aKQ/jxLCQjhZ1EoFNhIdp4TZXjBrRtwBUcXsrpj1QQI3NNJ+FUM6ER1Dyx1B2v0O49Gw+xdKpTeMIjZRD62gDbSMXlVAFnaAqqiGK7tEjekYvzoPz5Lw6b6PWCWc8s4Z+yfn8AhVPp5Y=</latexit>
MDP
A Grid World MDP
⬣ Agent lives in a 2D grid environment
⬣ State: Agent’s 2D coordinates⬣ Actions: N, E, S, W⬣ Rewards: +1/-1 at absorbing states
⬣ Walls block agent’s path⬣ Actions to not always go as planned
⬣ 20% chance that agent drifts one cell left or right of direction of motion (except when blocked by wall).
Figure credits: Pieter Abbeel
A Grid World MDP
⬣ Agent lives in a 2D grid environment
⬣ State: Agent’s 2D coordinates⬣ Actions: N, E, S, W⬣ Rewards: +1/-1 at absorbing states
⬣ Walls block agent’s path⬣ Actions to not always go as planned
⬣ 20% chance that agent drifts one cell left or right of direction of motion (except when blocked by wall).
Figure credits: Pieter Abbeel
A Grid World MDP
⬣ Agent lives in a 2D grid environment
⬣ State: Agent’s 2D coordinates⬣ Actions: N, E, S, W⬣ Rewards: +1/-1 at absorbing states
⬣ Walls block agent’s path⬣ Actions to not always go as planned
⬣ 20% chance that agent drifts one cell left or right of direction of motion (except when blocked by wall).
Figure credits: Pieter Abbeel
A Grid World MDP
⬣ Agent lives in a 2D grid environment
⬣ State: Agent’s 2D coordinates⬣ Actions: N, E, S, W⬣ Rewards: +1/-1 at absorbing states
⬣ Walls block agent’s path⬣ Actions to not always go as planned
⬣ 20% chance that agent drifts one cell left or right of direction of motion (except when blocked by wall).
Figure credits: Pieter Abbeel
Solving MDPs: Optimal policy
⬣ Solving MDPs by finding the best/optimal policy
⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic
⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!
Solving MDPs: Optimal policy
⬣ Solving MDPs by finding the best/optimal policy
⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic
⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!
Solving MDPs: Optimal policy
⬣ Solving MDPs by finding the best/optimal policy
⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic
⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!
n = |S|m = |A|
<latexit sha1_base64="CXqwVtQDOE6XrU15HsshGqwbZgc=">AAACDnicbZDLSsNAFIYn9VbjLerSzWApuCqJFGoXQtWNy4r2Ak0ok+mkHZxMwsxEKGmewI2v4saFIm5du/NtTNIgVv1h4OM/53DO/G7IqFSm+amVlpZXVtfK6/rG5tb2jrG715VBJDDp4IAFou8iSRjlpKOoYqQfCoJ8l5Gee3uR1Xt3REga8Bs1DYnjozGnHsVIpdbQqHJ4Cme2j9QEIxZfJzPb1v0F7yyZDY2KWTNzwb9gFVABhdpD48MeBTjyCVeYISkHlhkqJ0ZCUcxIotuRJCHCt2hMBily5BPpxPl3ElhNnRH0ApE+rmDu/pyIkS/l1HfTzuxG+buWmf/VBpHyTpyY8jBShOP5Ii9iUAUwywaOqCBYsWkKCAua3grxBAmEVZqgnofQzAXn0KgX0LS+Q+ge16x6rXFVr7TOizjK4AAcgiNggQZogUvQBh2AwT14BM/gRXvQnrRX7W3eWtKKmX2wIO39C3SZnC4=</latexit>
?
<latexit sha1_base64="UqL3pjB1oXDq0ky7Q3gukqP8FBE=">AAAB6HicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJFGoX0qIbly3YC7ShTKaTduxkEmYmQgl9AjcuFHGrD+Pejfg2pkkRbz8MfPznHM6Z3w05U9qyPozc0vLK6lp+3dzY3NreKezutVUQSUJbJOCB7LpYUc4EbWmmOe2GkmLf5bTjTi7m9c4NlYoF4kpPQ+r4eCSYxwjWidWsDQpFq2SlQn/BXkCx9mqehS/vZmNQeOsPAxL5VGjCsVI92wq1E2OpGeF0ZvYjRUNMJnhEewkK7FPlxOmhM3SUOEPkBTJ5QqPU/T4RY1+pqe8mnT7WY/W7Njf/q/Ui7Z06MRNhpKkg2SIv4kgHaP5rNGSSEs2nCWAiWXIrImMsMdFJNmYaQjUVyqBSXkDV/gqhfVKyy6VK0yrWzyFTHg7gEI7BhgrU4RIa0AICFG7hHh6Ma+POeDSestacsZjZhx8ynj8BMJ2Qmg==</latexit>
?
<latexit sha1_base64="UqL3pjB1oXDq0ky7Q3gukqP8FBE=">AAAB6HicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJFGoX0qIbly3YC7ShTKaTduxkEmYmQgl9AjcuFHGrD+Pejfg2pkkRbz8MfPznHM6Z3w05U9qyPozc0vLK6lp+3dzY3NreKezutVUQSUJbJOCB7LpYUc4EbWmmOe2GkmLf5bTjTi7m9c4NlYoF4kpPQ+r4eCSYxwjWidWsDQpFq2SlQn/BXkCx9mqehS/vZmNQeOsPAxL5VGjCsVI92wq1E2OpGeF0ZvYjRUNMJnhEewkK7FPlxOmhM3SUOEPkBTJ5QqPU/T4RY1+pqe8mnT7WY/W7Njf/q/Ui7Z06MRNhpKkg2SIv4kgHaP5rNGSSEs2nCWAiWXIrImMsMdFJNmYaQjUVyqBSXkDV/gqhfVKyy6VK0yrWzyFTHg7gEI7BhgrU4RIa0AICFG7hHh6Ma+POeDSestacsZjZhx8ynj8BMJ2Qmg==</latexit>
Solving MDPs: Optimal policy
⬣ Solving MDPs by finding the best/optimal policy
⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic
⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!
n = |S|m = |A|
<latexit sha1_base64="CXqwVtQDOE6XrU15HsshGqwbZgc=">AAACDnicbZDLSsNAFIYn9VbjLerSzWApuCqJFGoXQtWNy4r2Ak0ok+mkHZxMwsxEKGmewI2v4saFIm5du/NtTNIgVv1h4OM/53DO/G7IqFSm+amVlpZXVtfK6/rG5tb2jrG715VBJDDp4IAFou8iSRjlpKOoYqQfCoJ8l5Gee3uR1Xt3REga8Bs1DYnjozGnHsVIpdbQqHJ4Cme2j9QEIxZfJzPb1v0F7yyZDY2KWTNzwb9gFVABhdpD48MeBTjyCVeYISkHlhkqJ0ZCUcxIotuRJCHCt2hMBily5BPpxPl3ElhNnRH0ApE+rmDu/pyIkS/l1HfTzuxG+buWmf/VBpHyTpyY8jBShOP5Ii9iUAUwywaOqCBYsWkKCAua3grxBAmEVZqgnofQzAXn0KgX0LS+Q+ge16x6rXFVr7TOizjK4AAcgiNggQZogUvQBh2AwT14BM/gRXvQnrRX7W3eWtKKmX2wIO39C3SZnC4=</latexit>
n
<latexit sha1_base64="rVenuSol0bd6bUb9aZSGBh6vdaY=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuZhJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwJmApmaaQyeWQEKfQ9sfnc/q7RuQikXiUo9j8EIyECxglGhjNUSvUHRKTir8F9w5FE9f7ZP45d2u9wpvV/2IJiEITTlRqus6sfYmRGpGOUztq0RBTOiIDKBrUJAQlDdJD53iA+P0cRBJ84TGqft9YkJCpcahbzpDoofqd21m/lfrJjo49iZMxIkGQbNFQcKxjvDs17jPJFDNxwYIlczciumQSEK1ycZOQ6imwhlUynOoul8htI5KbrlUaTjF2hnKlEd7aB8dIhdVUA1doDpqIooA3aJ79GBdW3fWo/WUteas+cwu+iHr+RN32ZDJ</latexit>
1
<latexit sha1_base64="jPqgT5fa1yRX1SriC0Pr72kPD74=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuZhJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwJmApmaaQyeWQEKfQ9sfnc/q7RuQikXiUo9j8EIyECxglGhjNdxeoeiUnFT4L7hzKJ6+2ifxy7td7xXervoRTUIQmnKiVNd1Yu1NiNSMcpjaV4mCmNARGUDXoCAhKG+SHjrFB8bp4yCS5gmNU/f7xISESo1D33SGRA/V79rM/K/WTXRw7E2YiBMNgmaLgoRjHeHZr3GfSaCajw0QKpm5FdMhkYRqk42dhlBNhTOolOdQdb9CaB2V3HKp0nCKtTOUKY/20D46RC6qoBq6QHXURBQBukX36MG6tu6sR+spa81Z85ld9EPW8ycbZZCM</latexit>
Solving MDPs: Optimal policy
⬣ Solving MDPs by finding the best/optimal policy
⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic
⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!
n = |S|m = |A|
<latexit sha1_base64="CXqwVtQDOE6XrU15HsshGqwbZgc=">AAACDnicbZDLSsNAFIYn9VbjLerSzWApuCqJFGoXQtWNy4r2Ak0ok+mkHZxMwsxEKGmewI2v4saFIm5du/NtTNIgVv1h4OM/53DO/G7IqFSm+amVlpZXVtfK6/rG5tb2jrG715VBJDDp4IAFou8iSRjlpKOoYqQfCoJ8l5Gee3uR1Xt3REga8Bs1DYnjozGnHsVIpdbQqHJ4Cme2j9QEIxZfJzPb1v0F7yyZDY2KWTNzwb9gFVABhdpD48MeBTjyCVeYISkHlhkqJ0ZCUcxIotuRJCHCt2hMBily5BPpxPl3ElhNnRH0ApE+rmDu/pyIkS/l1HfTzuxG+buWmf/VBpHyTpyY8jBShOP5Ii9iUAUwywaOqCBYsWkKCAua3grxBAmEVZqgnofQzAXn0KgX0LS+Q+ge16x6rXFVr7TOizjK4AAcgiNggQZogUvQBh2AwT14BM/gRXvQnrRX7W3eWtKKmX2wIO39C3SZnC4=</latexit>
?
<latexit sha1_base64="UqL3pjB1oXDq0ky7Q3gukqP8FBE=">AAAB6HicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJFGoX0qIbly3YC7ShTKaTduxkEmYmQgl9AjcuFHGrD+Pejfg2pkkRbz8MfPznHM6Z3w05U9qyPozc0vLK6lp+3dzY3NreKezutVUQSUJbJOCB7LpYUc4EbWmmOe2GkmLf5bTjTi7m9c4NlYoF4kpPQ+r4eCSYxwjWidWsDQpFq2SlQn/BXkCx9mqehS/vZmNQeOsPAxL5VGjCsVI92wq1E2OpGeF0ZvYjRUNMJnhEewkK7FPlxOmhM3SUOEPkBTJ5QqPU/T4RY1+pqe8mnT7WY/W7Njf/q/Ui7Z06MRNhpKkg2SIv4kgHaP5rNGSSEs2nCWAiWXIrImMsMdFJNmYaQjUVyqBSXkDV/gqhfVKyy6VK0yrWzyFTHg7gEI7BhgrU4RIa0AICFG7hHh6Ma+POeDSestacsZjZhx8ynj8BMJ2Qmg==</latexit>
?
<latexit sha1_base64="UqL3pjB1oXDq0ky7Q3gukqP8FBE=">AAAB6HicbZDLSsNAFIZP6q3GW9Wlm8EiuCqJFGoX0qIbly3YC7ShTKaTduxkEmYmQgl9AjcuFHGrD+Pejfg2pkkRbz8MfPznHM6Z3w05U9qyPozc0vLK6lp+3dzY3NreKezutVUQSUJbJOCB7LpYUc4EbWmmOe2GkmLf5bTjTi7m9c4NlYoF4kpPQ+r4eCSYxwjWidWsDQpFq2SlQn/BXkCx9mqehS/vZmNQeOsPAxL5VGjCsVI92wq1E2OpGeF0ZvYjRUNMJnhEewkK7FPlxOmhM3SUOEPkBTJ5QqPU/T4RY1+pqe8mnT7WY/W7Njf/q/Ui7Z06MRNhpKkg2SIv4kgHaP5rNGSSEs2nCWAiWXIrImMsMdFJNmYaQjUVyqBSXkDV/gqhfVKyy6VK0yrWzyFTHg7gEI7BhgrU4RIa0AICFG7hHh6Ma+POeDSestacsZjZhx8ynj8BMJ2Qmg==</latexit>
Solving MDPs: Optimal policy
⬣ Solving MDPs by finding the best/optimal policy
⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic
⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!
n = |S|m = |A|
<latexit sha1_base64="CXqwVtQDOE6XrU15HsshGqwbZgc=">AAACDnicbZDLSsNAFIYn9VbjLerSzWApuCqJFGoXQtWNy4r2Ak0ok+mkHZxMwsxEKGmewI2v4saFIm5du/NtTNIgVv1h4OM/53DO/G7IqFSm+amVlpZXVtfK6/rG5tb2jrG715VBJDDp4IAFou8iSRjlpKOoYqQfCoJ8l5Gee3uR1Xt3REga8Bs1DYnjozGnHsVIpdbQqHJ4Cme2j9QEIxZfJzPb1v0F7yyZDY2KWTNzwb9gFVABhdpD48MeBTjyCVeYISkHlhkqJ0ZCUcxIotuRJCHCt2hMBily5BPpxPl3ElhNnRH0ApE+rmDu/pyIkS/l1HfTzuxG+buWmf/VBpHyTpyY8jBShOP5Ii9iUAUwywaOqCBYsWkKCAua3grxBAmEVZqgnofQzAXn0KgX0LS+Q+ge16x6rXFVr7TOizjK4AAcgiNggQZogUvQBh2AwT14BM/gRXvQnrRX7W3eWtKKmX2wIO39C3SZnC4=</latexit>
n
<latexit sha1_base64="rVenuSol0bd6bUb9aZSGBh6vdaY=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuZhJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwJmApmaaQyeWQEKfQ9sfnc/q7RuQikXiUo9j8EIyECxglGhjNUSvUHRKTir8F9w5FE9f7ZP45d2u9wpvV/2IJiEITTlRqus6sfYmRGpGOUztq0RBTOiIDKBrUJAQlDdJD53iA+P0cRBJ84TGqft9YkJCpcahbzpDoofqd21m/lfrJjo49iZMxIkGQbNFQcKxjvDs17jPJFDNxwYIlczciumQSEK1ycZOQ6imwhlUynOoul8htI5KbrlUaTjF2hnKlEd7aB8dIhdVUA1doDpqIooA3aJ79GBdW3fWo/WUteas+cwu+iHr+RN32ZDJ</latexit>
m
<latexit sha1_base64="b2VnzvgNgjOlf37sI0qnm4I3cCc=">AAAB6HicbZDLSsNAFIYn9VbjrerSzWARXJVECrULsejGZQv2AjWUyfSkHTuThJmJUEqfwI0LRdzqw7h3I76N06SItx8GPv5zDufM78ecKe04H1ZuYXFpeSW/aq+tb2xuFbZ3WipKJIUmjXgkOz5RwFkITc00h04sgQifQ9sfnc/q7RuQikXhpR7H4AkyCFnAKNHGaoheoeiUnFT4L7hzKJ6+2ifxy7td7xXervoRTQSEmnKiVNd1Yu1NiNSMcpjaV4mCmNARGUDXYEgEKG+SHjrFB8bp4yCS5oUap+73iQkRSo2FbzoF0UP1uzYz/6t1Ex0cexMWxomGkGaLgoRjHeHZr3GfSaCajw0QKpm5FdMhkYRqk42dhlBNhTOolOdQdb9CaB2V3HKp0nCKtTOUKY/20D46RC6qoBq6QHXURBQBukX36MG6tu6sR+spa81Z85ld9EPW8yd2VZDI</latexit>
Solving MDPs: Optimal policy
⬣ Solving MDPs by finding the best/optimal policy
⬣ Formally, a policy is a mapping from states to actions⬣ Deterministic⬣ Stochastic
⬣ What is a good policy?⬣ Maximize current reward? Sum of all future rewards?⬣ Discounted sum of future rewards!
⬣ Discount factor:
Recap & Next Lecture
Today, we saw⬣ MDPs: Theoretical framework underlying RL, solving MDPs
⬣ Policy: How an agents acts at states (to be continued in next lecture)
Next Lecture:⬣ Value function (Utility): How good is a particular state or state-action pair?
⬣ Algorithms for solving MDPs (Value Iteration)
⬣ Departure from known rewards and transitions: Reinforcement Learning