L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – (Deep) Reinforcement Learning

Administrativia• Projects!

• Project Check-in due April 11th

– Will be graded pass/fail, if fail then you can address the issues– Counts for 5 points of project score

• Poster due date moved to April 23rd (last day of class)– No presentations

• Final submission due date April 30th

(C) Dhruv Batra & Zsolt Kira 2

Project Check-in/Progress Report


Administrativia• Piazza me requests for topics after reinforcement

learning


Types of Learning• Supervised learning

– Learning from a “teacher”– Training data includes desired outputs

• Unsupervised learning– Discover structure in data– Training data does not include desired outputs

• Reinforcement learning– Learning to act under evaluative feedback (rewards)

(C) Dhruv Batra & Zsolt Kira 5Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

What is Reinforcement Learning?

Agent-oriented learning—learning by interacting with an environment to achieve a goal ○ more realistic and ambitious than other kinds of machine

learningLearning by trial and error, with only delayed evaluative feedback (reward)○ the kind of machine learning most like natural learning○ learning that can tell for itself when it is right or wrong

Slide Credit: Rich Sutton

David Silver 2015

Example: Hajime Kimura’s RL Robots

Before After

Backward New Robot, Same algorithmSlide Credit: Rich Sutton

● Environment may be unknown, nonlinear, stochastic and complex

● Agent learns a policy mapping states to actions

○ Seeking to maximize its cumulative reward in the long run

AgentAction, Response, Control

State, Stimulus, Situation

Reward, Gain, Payoff, Cost

Environment

(world)


RL API

RL API

(C) Dhruv Batra & Zsolt Kira 10Slide Credit: David Silver

Signature challenges of RL

Evaluative feedback (reward)Sequentiality, delayed consequencesNeed for trial and error, to explore as well as exploitNon-stationarityThe fleeting nature of time and online data


Robot Locomotion

Objective: Make the robot move forward

State: Angle and position of the jointsAction: Torques applied on jointsReward: 1 at each time step upright + forward movement

Figures copyright John Schulman et al., 2016. Reproduced with permission.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.


Go

Objective: Win the game!

State: Position of all piecesAction: Where to put the next piece downReward: 1 if win at the end of the game, 0 otherwise

This image is CC0 public domain


Demo• http://projects.rajivshah.com/rldemo/• https://cs.stanford.edu/people/karpathy/convnetjs/de

mo/rldemo.html


Markov Decision Process- Mathematical formulation of the RL problem

- Life is trajectory:

Defined by:

: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor


What is Markov about MDPs?

• “Markov” generally means that given the present state, the future and the past are independent

• For Markov decision processes, “Markov” means action outcomes depend only on the current state

Andrey Markov (1856‐1922)

Slide Credit: http://ai.berkeley.edu

Example: Grid World

The agent receives rewards each time step Small “living” reward each step (can be negative) Big rewards come at the end (good or bad)

This is now our supervision! Different from supervised learning: Rewards can be sparse How can we assign credit/blame to actions far into the past that caused

our large positive/negative reward? The world is uncertain

If we have a model of it, how can we solve? If we don’t, how can we solve it?

Should we build a model of it?


Example: Grid World

A maze‐like problem The agent lives in a grid Walls block the agent’s path

Noisy movement: actions do not always go as planned 80% of the time, the action North takes the

agent North (if there is no wall there)

10% of the time, North takes the agent West; 10% East

If there is a wall in the direction the agent would have been taken, the agent stays put

The agent receives rewards each time step Small “living” reward each step (can be negative) Big rewards come at the end (good or bad)

Goal: maximize sum of rewards


Grid World Actions

Deterministic Grid World Stochastic Grid World


Solving MDPs

Assume for now: We know the transition function We have a state space defined


Components of an RL Agent

• Value function– How good is each state and/or state-action pair?

• Policy– How does an agent behave?

• Model– Agent’s representation of the environment


Optimal Quantities

The value (utility) of a state s:V*(s) = expected utility starting in s and acting optimally

The value (utility) of a q‐state (s,a):Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally

The optimal policy:*(s) = optimal action from state s

a

s

s’

s, a

(s,a,s’) is a transition

s,a,s’

s is a state

(s, a) is a q‐state


Policies

Optimal policy when R(s, a, s’) = ‐0.03 for all non‐terminals s

• In deterministic single-agent search problems, we want an optimal plan, or sequence of actions, from start to a goal

• For MDPs, we want an optimal policy *: S → A– A policy gives an action for each state– An optimal policy is one that maximizes

expected utility if followed– An explicit policy defines a reflex agent


What’s a good policy?

The optimal policy *


Maximizes current reward? Sum of all future reward?




Discounted future rewards!


Utilities of Sequences


Utilities of Sequences• What preferences should an agent have over reward

sequences?

• More or less?

• Now or later?[1, 2, 2] [2, 3, 4]or

[0, 0, 1] [1, 0, 0]or


Discounting

• It’s reasonable to maximize the sum of rewards• It’s also reasonable to prefer rewards now to rewards later• One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps


Discounting

• How to discount?– Each time we descend a level,

we multiply in the discount once

• Why discount?– Sooner rewards probably do

have higher utility than later rewards

– Also helps our algorithms converge

• Example: discount of 0.5– U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3– U([1,2,3]) < U([3,2,1])




Discounted future rewards!

Formally:

with


Optimal Policies

R(s) = ‐2.0R(s) = ‐0.4

R(s) = ‐0.03R(s) = ‐0.01Reward

at every non-terminal state (living reward/

penalty)







Value Function• A value function is a prediction of future reward

• “State Value Function” or simply “Value Function”– How good is a state? – Am I in trouble? Am I winning this game?

• “Action Value Function” or Q-function– How good is a state action-pair? – Should I do this now?


Values of States• Fundamental operation: compute the value of a state

– Expected utility under optimal action– Average sum of (discounted) rewards

• Recursive definition of value:a

s

s, a

s,a,s’

s’

Bellman Equation


Values of States• Fundamental operation: compute the value of a state

– Expected utility under optimal action– Average sum of (discounted) rewards

• Recursive definition of value:a

s

s, a

s,a,s’

s’

Bellman Equation


Snapshot of Demo – Gridworld V Values

Noise = 0.2

Discount = 0.9

Living reward = 0


Snapshot of Demo – Gridworld Q Values

Noise = 0.2

Discount = 0.9

Living reward = 0


Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …




How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):




How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):

How good is a state-action pair?The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):







Model

(C) Dhruv Batra & Zsolt Kira 44Slide Credit: David Silver

Model


• Model predicts what the world will do next

Slide Credit: David Silver

Components of an RL Agent• Value function

– How good is each state and/or state-action pair?



(C) Dhruv Batra 46

Solving MDPs

Assume for now: We know the transition function We have a state space defined


Value Iteration

• Start with V0(s) = 0: no time steps left means an expected reward sum of zero

• Given vector of Vk(s) values, do one update from each state:

• Repeat until convergence

• Complexity of each iteration: O(S2A)

• Theorem: will converge to unique optimal values– Basic idea: approximations get refined towards optimal values– Policy may converge long before values do

a

Vk+1(s)

s, a

s,a,s’

)’(skV


Example: Racing

• A robot car wants to travel far, quickly• Three states: Cool, Warm, Overheated• Two actions: Slow, Fast• Going faster gets double reward

Cool

Warm

Overheated

Fast

Fast

Slow

Slow

0.5

0.5

0.5

0.5

1.0

1.0

+1

+1

+1

+2

+2

‐10


Racing Search Tree


Example: Value Iteration

0 0 0

2 1 0

3.5 2.5 0

Assume no discount!


Computing Actions from Values

• Let’s imagine we have the optimal values V*(s)

• How should we act?– It’s not obvious!

• We need to do a one step calculation

• This is called policy extraction, since it gets the policy implied by the values


Computing Actions from Q-Values• Let’s imagine we have the optimal

q-values:

• How should we act?– Completely trivial to decide!

• Important lesson: actions are easier to select from q-values than values!


Reinforcement Learning

• Still assume a Markov decision process (MDP):– A set of states s S– A set of actions (per state) A– A model T(s,a,s’)– A reward function R(s,a,s’)

• Still looking for a policy (s)

• New twist: don’t know T or R– I.e. we don’t know which states are good or what the actions do– Must actually try actions and states out to learn


Model-Based Learning• Model-Based Idea:

– Learn an approximate model based on experiences– Solve for values as if the learned model were correct

• Step 1: Learn empirical MDP model– Count outcomes s’ for each s, a– Normalize to give an estimate of– Discover each when we experience (s, a, s’)

• Step 2: Solve the learned MDP– For example, use value iteration, as before


The Story So Far: MDPs and RL

Known MDP: Offline SolutionGoal Technique

Compute V*, Q*, * Value / policy iteration

Evaluate a fixed policy Policy evaluation

Unknown MDP: Model‐Based Unknown MDP: Model‐FreeGoal Technique

Compute V*, Q*, * VI/PI on approx. MDP

Evaluate a fixed policy PE on approx. MDP

Goal Technique

Compute V*, Q*, * Q‐learning

Evaluate a fixed policy Value Learning


Model-Free Learning


Example: Expected Age

Goal: Compute expected age of students

Unknown P(A): “Model Based” Unknown P(A): “Model Free”

Without P(A), instead collect samples [a1, a2, … aN]

Known P(A)

Why does this work? Because samples appear with the right frequencies.

Why does this work? Because eventually you learn the right

model.


Reinforcement Learning• We still assume an MDP:

– A set of states s S– A set of actions (per state) A– A model T(s,a,s’)– A reward function R(s,a,s’)

• Still looking for a policy (s)

• New twist: don’t know T or R, so must try out actions

• Big idea: Compute all averages over T using sample outcomes


Model-Free Learning

• Model-free (temporal difference) learning– Experience world through episodes

– Update estimates each transition

– Over time, updates will mimic Bellman updates

r

a

s

s, a

s’

a’

s’, a’

s’’


Passive Reinforcement Learning

• Simplified task: policy evaluation– Input: a fixed policy (s)– You don’t know the transitions T(s,a,s’)– You don’t know the rewards R(s,a,s’)– Goal: learn the state values

• In this case:– Learner is “along for the ride”– No choice about what actions to take– Just execute the policy and learn from experience– This is NOT offline planning! You actually take actions in the world.


Direct Evaluation

• Goal: Compute values for each state under

• Idea: Average together observed sample values– Act according to – Every time you visit a state, write down what

the sum of discounted rewards turned out to be– Average those samples

• This is called direct evaluation


Example: Direct Evaluation

Input Policy

Assume: = 1

Observed Episodes (Training) Output Values

A

B C D

E

B, east, C, ‐1

C, east, D, ‐1

D, exit, x, +10

B, east, C, ‐1

C, east, D, ‐1

D, exit, x, +10

E, north, C, ‐1

C, east, A, ‐1

A, exit, x, ‐10

Episode 1 Episode 2

Episode 3 Episode 4E, north, C, ‐1

C, east, D, ‐1

D, exit, x, +10

A

B C D

E

+8 +4 +10

‐10

‐2


Problems with Direct Evaluation

• What’s good about direct evaluation?– It’s easy to understand– It doesn’t require any knowledge of T, R– It eventually computes the correct average

values, using just sample transitions

• What bad about it?– It wastes information about state connections– Each state must be learned separately– So, it takes a long time to learn

Output Values

A

B C D

E

+8 +4 +10

‐10

‐2

If B and E both go to C under this policy, how can their values be different?


Why Not Use Policy Evaluation?

• Simplified Bellman updates calculate V for a fixed policy:– Each round, replace V with a one‐step‐look‐ahead layer over V

– This approach fully exploited the connections between the states– Unfortunately, we need T and R to do it!

• Key question: how can we do this update to V without knowing T and R?– In other words, how do we take a weighted average without knowing the weights?

(s)

s

s, (s)

s, (s),s’

s’


Sample‐Based Policy Evaluation?

• We want to improve our estimate of V by computing these averages:

• Idea: Take samples of outcomes s’ (by doing the action!) and average

(s)

s

s, (s)

'1s'2s '3ss, (s),s’

s'

Almost! But we can’t rewind time to get sample after sample from state s.

What’s the difficulty of this algorithm?


Temporal Difference Learning

• Big idea: learn from every experience!– Update V(s) each time we experience a transition (s, a, s’, r)– Likely outcomes s’ will contribute updates more often

• Temporal difference learning of values– Policy still fixed, still doing evaluation!– Move values toward value of whatever successor occurs: running

average

(s)

s

s, (s)

s’

Sample of V(s):

Update to V(s):

Same update:


Exponential Moving Average

• Exponential moving average – The running interpolation update:

– Makes recent samples more important:

– Forgets about the past• Decreasing learning rate (alpha) can give converging averages

Why do we want to forget about the past?

(distant past values were wrong anyway)


Example: Temporal Difference Learning

Assume: = 1, α = 1/2

Observed TransitionsB, east, C, ‐2

0

0 0 8

0

0

‐1 0 8

0

0

‐1 3 8

0

C, east, D, ‐2

A

B C D

E

States


Problems with TD Value Learning

• TD value learning is a model‐free way to do policy evaluation, mimicking Bellman updates with running sample averages

• However, if we want to turn values into a (new) policy, we’re sunk:

a

s

s, a

s,a,s’

s’Idea: learn Q‐values, not values

Makes action selection model‐free too!


Active Reinforcement Learning

• Full reinforcement learning: optimal policies (like value iteration)– You don’t know the transitions T(s,a,s’)– You don’t know the rewards R(s,a,s’)– You choose the actions now– Goal: learn the optimal policy / values

• In this case:– Learner makes choices!– Fundamental tradeoff: exploration vs. exploitation– This is NOT offline planning! You actually take actions in the

world and find out what happens…


Detour: Q‐Value Iteration

• Value iteration: find successive (depth‐limited) values– Start with V0(s) = 0, which we know is right– Given Vk, calculate the depth k+1 values for all states:

• But Q‐values are more useful, so compute them instead– Start with Q0(s,a) = 0, which we know is right– Given Qk, calculate the depth k+1 q‐values for all q‐states:


Q-Learning

• We’d like to do Q-value updates to each Q-state:

– But can’t compute this update without knowing T, R

• Instead, compute average as we go– Receive a sample transition (s,a,r,s’)– This sample suggests

– But we want to average over results from (s,a) – So keep a running average


Q-Learning Properties

• Amazing result: Q-learning converges to optimal policy --even if you’re acting suboptimally!

• This is called off-policy learning

• Caveats:– You have to explore enough– You have to eventually make the learning rate

small enough– … but not decrease it too quickly– Basically, in the limit, it doesn’t matter how you select actions (!)


Generalizing Across States

• Basic Q-Learning keeps a table of all q-values

• In realistic situations, we cannot possibly learn about every single state!

– Too many states to visit them all in training– Too many states to hold the q-tables in memory

• Instead, we want to generalize:– Learn about some small number of training states

from experience– Generalize that experience to new, similar

situations– This is the fundamental idea in machine learning!

[demo – RL pacman]


Example: Pacman

Let’s say we discover through experience that this state is

bad:

In naïve q‐learning, we know nothing about this state:

Or even this one!


Feature-Based Representations

• Solution: describe a state using a vector of features (properties)

– Features are functions from states to real numbers (often 0/1) that capture important properties of the state

– Example features:• Distance to closest ghost• Distance to closest dot• Number of ghosts• 1 / (dist to dot)2

• Is Pacman in a tunnel? (0/1)• …… etc.• Is it the exact state on this slide?

– Can also describe a q-state (s, a) with features (e.g. action moves closer to food)


Linear Value Functions

• Using a feature representation, we can write a q function (or value function) for any state using a few weights:

• Advantage: our experience is summed up in a few powerful numbers

• Disadvantage: states may share features but actually be very different in value!


Approximate Q-Learning

• Q-learning with linear Q-functions:

• Intuitive interpretation:– Adjust weights of active features– E.g., if something unexpectedly bad happens, blame the features that

were on: disprefer all states with that state’s features

• Formal justification: online least squares

Exact Q’s

Approximate Q’s


Minimizing Error*

Approximate q update explained:

Imagine we had only one point x, with features f(x), target value y, and weights w:

“target” “prediction”


Approaches to RL• Value-based RL

– Estimate the optimal action-value function

• Policy-based RL– Search directly for the optimal policy

• Model– Build a model of the world

• State transition, reward probabilities

– Plan (e.g. by look-ahead) using model

(C) Dhruv Batra 81

Deep RL• Value-based RL

– Use neural nets to represent Q function

• Policy-based RL– Use neural nets to represent policy

• Model– Use neural nets to represent and learn the model

(C) Dhruv Batra 82

Q-Networks

Slide Credit: David Silver

Case Study: Playing Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.

[Mnih et al. NIPS Workshop 2013; Nature 2015]


:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4


FC-256

FC-4 (Q-values)








FC-256

FC-4 (Q-values)

Input: state st








FC-256

FC-4 (Q-values)

Familiar conv layers, FC layer








FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)








FC-256


Number of actions between 4-18 depending on Atari game








FC-256


Number of actions between 4-18 depending on Atari game

A single feedforward pass to compute Q-values for all actions from the current state => efficient!



Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning



Deep Q-learning

Loss function:Forward Pass



Deep Q-learning

Loss function:

where

Forward Pass



Deep Q-learning

Loss function:

where

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):



Deep Q-learning

Loss function:

where

Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):


Training the Q-network: Experience Replay

96

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops


Training the Q-network: Experience Replay

97

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

Address these problems using experience replay- Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played- Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples


Experience Replay

(C) Dhruv Batra 98Slide Credit: David Silver

Video by Károly Zsolnai-Fehér. Reproduced with permission.https://www.youtube.com/watch?v=V1eYniJ0Rnk


Documents

L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the