99
CS 4803 / 7643: Deep Learning Zsolt Kira Georgia Tech Topics: (Deep) Reinforcement Learning

L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

CS 4803 / 7643: Deep Learning

Zsolt KiraGeorgia Tech

Topics: – (Deep) Reinforcement Learning

Page 2: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Administrativia• Projects!

• Project Check-in due April 11th

– Will be graded pass/fail, if fail then you can address the issues– Counts for 5 points of project score

• Poster due date moved to April 23rd (last day of class)– No presentations

• Final submission due date April 30th

(C) Dhruv Batra & Zsolt Kira 2

Page 3: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Project Check-in/Progress Report

(C) Dhruv Batra & Zsolt Kira 3

Page 4: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Administrativia• Piazza me requests for topics after reinforcement

learning

(C) Dhruv Batra & Zsolt Kira 4

Page 5: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Types of Learning• Supervised learning

– Learning from a “teacher”– Training data includes desired outputs

• Unsupervised learning– Discover structure in data– Training data does not include desired outputs

• Reinforcement learning– Learning to act under evaluative feedback (rewards)

(C) Dhruv Batra & Zsolt Kira 5Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 6: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

What is Reinforcement Learning?

Agent-oriented learning—learning by interacting with an environment to achieve a goal ○ more realistic and ambitious than other kinds of machine

learningLearning by trial and error, with only delayed evaluative feedback (reward)○ the kind of machine learning most like natural learning○ learning that can tell for itself when it is right or wrong

Slide Credit: Rich Sutton

Page 7: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

David Silver 2015

Page 8: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Hajime Kimura’s RL Robots

Before After

Backward New Robot, Same algorithmSlide Credit: Rich Sutton

Page 9: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

● Environment may be unknown, nonlinear, stochastic and complex

● Agent learns a policy mapping states to actions

○ Seeking to maximize its cumulative reward in the long run

AgentAction, Response, Control

State, Stimulus, Situation

Reward, Gain, Payoff, Cost

Environment

(world)

Slide Credit: Rich Sutton

RL API

Page 10: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

RL API

(C) Dhruv Batra & Zsolt Kira 10Slide Credit: David Silver

Page 11: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Signature challenges of RL

Evaluative feedback (reward)Sequentiality, delayed consequencesNeed for trial and error, to explore as well as exploitNon-stationarityThe fleeting nature of time and online data

Slide Credit: Rich Sutton

Page 12: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Robot Locomotion

Objective: Make the robot move forward

State: Angle and position of the jointsAction: Torques applied on jointsReward: 1 at each time step upright + forward movement

Figures copyright John Schulman et al., 2016. Reproduced with permission. 

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 13: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission. 

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 14: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Go

Objective: Win the game!

State: Position of all piecesAction: Where to put the next piece downReward: 1 if win at the end of the game, 0 otherwise

This image is CC0 public domain

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 15: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Demo• http://projects.rajivshah.com/rldemo/• https://cs.stanford.edu/people/karpathy/convnetjs/de

mo/rldemo.html

(C) Dhruv Batra & Zsolt Kira 15

Page 16: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Markov Decision Process- Mathematical formulation of the RL problem

- Life is trajectory:

Defined by:

: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 17: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

What is Markov about MDPs?

• “Markov” generally means that given the present state, the future and the past are independent

• For Markov decision processes, “Markov” means action outcomes depend only on the current state

Andrey Markov (1856‐1922)

Slide Credit: http://ai.berkeley.edu

Page 18: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Grid World

The agent receives rewards each time step Small “living” reward each step (can be negative) Big rewards come at the end (good or bad)

This is now our supervision! Different from supervised learning: Rewards can be sparse How can we assign credit/blame to actions far into the past that caused 

our large positive/negative reward? The world is uncertain

If we have a model of it, how can we solve? If we don’t, how can we solve it?

Should we build a model of it?

Slide Credit: http://ai.berkeley.edu

Page 19: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Grid World

A maze‐like problem The agent lives in a grid Walls block the agent’s path

Noisy movement: actions do not always go as planned 80% of the time, the action North takes the 

agent North (if there is no wall there)

10% of the time, North takes the agent West; 10% East

If there is a wall in the direction the agent would have been taken, the agent stays put

The agent receives rewards each time step Small “living” reward each step (can be negative) Big rewards come at the end (good or bad)

Goal: maximize sum of rewards

Slide Credit: http://ai.berkeley.edu

Page 20: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Grid World Actions

Deterministic Grid World Stochastic Grid World

Slide Credit: http://ai.berkeley.edu

Page 21: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Solving MDPs

Assume for now: We know the transition function We have a state space defined

Slide Credit: http://ai.berkeley.edu

Page 22: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Components of an RL Agent

• Value function– How good is each state and/or state-action pair?

• Policy– How does an agent behave?

• Model– Agent’s representation of the environment

(C) Dhruv Batra & Zsolt Kira 22

Page 23: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Optimal Quantities

The value (utility) of a state s:V*(s) = expected utility starting in s and acting optimally

The value (utility) of a q‐state (s,a):Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally

The optimal policy:*(s) = optimal action from state s

a

s

s’

s, a

(s,a,s’) is a transition

s,a,s’

s is a state

(s, a) is a q‐state

Slide Credit: http://ai.berkeley.edu

Page 24: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Policies

Optimal policy when R(s, a, s’) = ‐0.03 for all non‐terminals s

• In deterministic single-agent search problems, we want an optimal plan, or sequence of actions, from start to a goal

• For MDPs, we want an optimal policy *: S → A– A policy gives an action for each state– An optimal policy is one that maximizes

expected utility if followed– An explicit policy defines a reflex agent

Slide Credit: http://ai.berkeley.edu

Page 25: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

What’s a good policy?

The optimal policy *

Page 26: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

What’s a good policy?

Maximizes current reward? Sum of all future reward?

The optimal policy *

Page 27: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

What’s a good policy?

Maximizes current reward? Sum of all future reward?

Discounted future rewards!

The optimal policy *

Page 28: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Utilities of Sequences

Slide Credit: http://ai.berkeley.edu

Page 29: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Utilities of Sequences• What preferences should an agent have over reward

sequences?

• More or less?

• Now or later?[1, 2, 2] [2, 3, 4]or

[0, 0, 1] [1, 0, 0]or

Slide Credit: http://ai.berkeley.edu

Page 30: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Discounting

• It’s reasonable to maximize the sum of rewards• It’s also reasonable to prefer rewards now to rewards later• One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Slide Credit: http://ai.berkeley.edu

Page 31: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Discounting

• How to discount?– Each time we descend a level,

we multiply in the discount once

• Why discount?– Sooner rewards probably do

have higher utility than later rewards

– Also helps our algorithms converge

• Example: discount of 0.5– U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3– U([1,2,3]) < U([3,2,1])

Slide Credit: http://ai.berkeley.edu

Page 32: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

What’s a good policy?

Maximizes current reward? Sum of all future reward?

Discounted future rewards!

Formally:

with

The optimal policy *

Page 33: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Optimal Policies

R(s) = ‐2.0R(s) = ‐0.4

R(s) = ‐0.03R(s) = ‐0.01Reward

at every non-terminal state (living reward/

penalty)

Slide Credit: http://ai.berkeley.edu

Page 34: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Components of an RL Agent

• Value function– How good is each state and/or state-action pair?

• Policy– How does an agent behave?

• Model– Agent’s representation of the environment

(C) Dhruv Batra & Zsolt Kira 34

Page 35: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Value Function• A value function is a prediction of future reward

• “State Value Function” or simply “Value Function”– How good is a state? – Am I in trouble? Am I winning this game?

• “Action Value Function” or Q-function– How good is a state action-pair? – Should I do this now?

(C) Dhruv Batra & Zsolt Kira 35

Page 36: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Values of States• Fundamental operation: compute the value of a state

– Expected utility under optimal action– Average sum of (discounted) rewards

• Recursive definition of value:a

s

s, a

s,a,s’

s’

Bellman Equation

Slide Credit: http://ai.berkeley.edu

Page 37: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Values of States• Fundamental operation: compute the value of a state

– Expected utility under optimal action– Average sum of (discounted) rewards

• Recursive definition of value:a

s

s, a

s,a,s’

s’

Bellman Equation

Slide Credit: http://ai.berkeley.edu

Page 38: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Snapshot of Demo – Gridworld V Values

Noise = 0.2

Discount = 0.9

Living reward = 0

Slide Credit: http://ai.berkeley.edu

Page 39: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Snapshot of Demo – Gridworld Q Values

Noise = 0.2

Discount = 0.9

Living reward = 0

Slide Credit: http://ai.berkeley.edu

Page 40: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 41: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 42: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):

How good is a state-action pair?The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 43: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Components of an RL Agent

• Value function– How good is each state and/or state-action pair?

• Policy– How does an agent behave?

• Model– Agent’s representation of the environment

(C) Dhruv Batra & Zsolt Kira 43

Page 44: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Model

(C) Dhruv Batra & Zsolt Kira 44Slide Credit: David Silver

Page 45: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Model

(C) Dhruv Batra & Zsolt Kira 45

• Model predicts what the world will do next

Slide Credit: David Silver

Page 46: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Components of an RL Agent• Value function

– How good is each state and/or state-action pair?

• Policy– How does an agent behave?

• Model– Agent’s representation of the environment

(C) Dhruv Batra 46

Page 47: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Solving MDPs

Assume for now: We know the transition function We have a state space defined

Slide Credit: http://ai.berkeley.edu

Page 48: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Value Iteration

• Start with V0(s) = 0: no time steps left means an expected reward sum of zero

• Given vector of Vk(s) values, do one update from each state:

• Repeat until convergence

• Complexity of each iteration: O(S2A)

• Theorem: will converge to unique optimal values– Basic idea: approximations get refined towards optimal values– Policy may converge long before values do

a

Vk+1(s)

s, a

s,a,s’

)’(skV

Slide Credit: http://ai.berkeley.edu

Page 49: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Racing

• A robot car wants to travel far, quickly• Three states: Cool, Warm, Overheated• Two actions: Slow, Fast• Going faster gets double reward

Cool

Warm

Overheated

Fast

Fast

Slow

Slow

0.5 

0.5 

0.5 

0.5 

1.0 

1.0 

+1 

+1 

+1 

+2 

+2 

‐10

Slide Credit: http://ai.berkeley.edu

Page 50: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Racing Search Tree

Slide Credit: http://ai.berkeley.edu

Page 51: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Value Iteration

0             0             0

2             1             0

3.5          2.5          0

Assume no discount!

Slide Credit: http://ai.berkeley.edu

Page 52: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Computing Actions from Values

• Let’s imagine we have the optimal values V*(s)

• How should we act?– It’s not obvious!

• We need to do a one step calculation

• This is called policy extraction, since it gets the policy implied by the values

Slide Credit: http://ai.berkeley.edu

Page 53: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Computing Actions from Q-Values• Let’s imagine we have the optimal

q-values:

• How should we act?– Completely trivial to decide!

• Important lesson: actions are easier to select from q-values than values!

Slide Credit: http://ai.berkeley.edu

Page 54: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Reinforcement Learning

• Still assume a Markov decision process (MDP):– A set of states s  S– A set of actions (per state) A– A model T(s,a,s’)– A reward function R(s,a,s’)

• Still looking for a policy (s)

• New twist: don’t know T or R– I.e. we don’t know which states are good or what the actions do– Must actually try actions and states out to learn

Slide Credit: http://ai.berkeley.edu

Page 55: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Model-Based Learning• Model-Based Idea:

– Learn an approximate model based on experiences– Solve for values as if the learned model were correct

• Step 1: Learn empirical MDP model– Count outcomes s’ for each s, a– Normalize to give an estimate of– Discover each when we experience (s, a, s’)

• Step 2: Solve the learned MDP– For example, use value iteration, as before

Slide Credit: http://ai.berkeley.edu

Page 56: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

The Story So Far: MDPs and RL

Known MDP: Offline SolutionGoal Technique

Compute V*, Q*, * Value / policy iteration

Evaluate a fixed policy  Policy evaluation

Unknown MDP: Model‐Based Unknown MDP: Model‐FreeGoal Technique

Compute V*, Q*, * VI/PI on approx. MDP

Evaluate a fixed policy  PE on approx. MDP

Goal Technique

Compute V*, Q*, * Q‐learning

Evaluate a fixed policy  Value Learning

Slide Credit: http://ai.berkeley.edu

Page 57: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Model-Free Learning

Slide Credit: http://ai.berkeley.edu

Page 58: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Expected Age

Goal: Compute expected age of students

Unknown P(A): “Model Based” Unknown P(A): “Model Free”

Without P(A), instead collect samples [a1, a2, … aN]

Known P(A)

Why does this work?  Because samples appear with the right frequencies.

Why does this work?  Because eventually you learn the right 

model.

Slide Credit: http://ai.berkeley.edu

Page 59: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Reinforcement Learning• We still assume an MDP:

– A set of states s S– A set of actions (per state) A– A model T(s,a,s’)– A reward function R(s,a,s’)

• Still looking for a policy (s)

• New twist: don’t know T or R, so must try out actions

• Big idea: Compute all averages over T using sample outcomes

Slide Credit: http://ai.berkeley.edu

Page 60: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Model-Free Learning

• Model-free (temporal difference) learning– Experience world through episodes

– Update estimates each transition

– Over time, updates will mimic Bellman updates

r

a

s

s, a

s’

a’

s’, a’

s’’

Slide Credit: http://ai.berkeley.edu

Page 61: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Passive Reinforcement Learning

• Simplified task: policy evaluation– Input: a fixed policy (s)– You don’t know the transitions T(s,a,s’)– You don’t know the rewards R(s,a,s’)– Goal: learn the state values

• In this case:– Learner is “along for the ride”– No choice about what actions to take– Just execute the policy and learn from experience– This is NOT offline planning! You actually take actions in the world.

Slide Credit: http://ai.berkeley.edu

Page 62: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Direct Evaluation

• Goal: Compute values for each state under

• Idea: Average together observed sample values– Act according to – Every time you visit a state, write down what

the sum of discounted rewards turned out to be– Average those samples

• This is called direct evaluation

Slide Credit: http://ai.berkeley.edu

Page 63: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Direct Evaluation

Input Policy 

Assume:  = 1

Observed Episodes (Training) Output Values

A

B C D

E

B, east, C, ‐1

C, east, D, ‐1

D, exit,  x, +10

B, east, C, ‐1

C, east, D, ‐1

D, exit,  x, +10

E, north, C, ‐1

C, east,   A, ‐1

A, exit,    x, ‐10

Episode 1 Episode 2

Episode 3 Episode 4E, north, C, ‐1

C, east,   D, ‐1

D, exit,    x, +10

A

B C D

E

+8 +4 +10

‐10

‐2

Slide Credit: http://ai.berkeley.edu

Page 64: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Problems with Direct Evaluation

• What’s good about direct evaluation?– It’s easy to understand– It doesn’t require any knowledge of T, R– It eventually computes the correct average 

values, using just sample transitions

• What bad about it?– It wastes information about state connections– Each state must be learned separately– So, it takes a long time to learn

Output Values

A

B C D

E

+8 +4 +10

‐10

‐2

If B and E both go to C under this policy, how can their values be different?

Slide Credit: http://ai.berkeley.edu

Page 65: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Why Not Use Policy Evaluation?

• Simplified Bellman updates calculate V for a fixed policy:– Each round, replace V with a one‐step‐look‐ahead layer over V

– This approach fully exploited the connections between the states– Unfortunately, we need T and R to do it!

• Key question: how can we do this update to V without knowing T and R?– In other words, how do we take a weighted average without knowing the weights?

(s)

s

s, (s)

s, (s),s’

s’

Slide Credit: http://ai.berkeley.edu

Page 66: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Sample‐Based Policy Evaluation?

• We want to improve our estimate of V by computing these averages:

• Idea: Take samples of outcomes s’ (by doing the action!) and average

(s)

s

s, (s)

'1s'2s '3ss, (s),s’

s'

Almost!  But we can’t rewind time to get sample after sample from state s.

What’s the difficulty of this algorithm?

Slide Credit: http://ai.berkeley.edu

Page 67: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Temporal Difference Learning

• Big idea: learn from every experience!– Update V(s) each time we experience a transition (s, a, s’, r)– Likely outcomes s’ will contribute updates more often

• Temporal difference learning of values– Policy still fixed, still doing evaluation!– Move values toward value of whatever successor occurs: running 

average

(s)

s

s, (s)

s’

Sample of V(s):

Update to V(s):

Same update:

Slide Credit: http://ai.berkeley.edu

Page 68: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Exponential Moving Average

• Exponential moving average – The running interpolation update:

– Makes recent samples more important:

– Forgets about the past• Decreasing learning rate (alpha) can give converging averages

Why do we want to forget about the past?

(distant past values were wrong anyway)

Slide Credit: http://ai.berkeley.edu

Page 69: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Temporal Difference Learning

Assume:  = 1, α = 1/2

Observed TransitionsB, east, C, ‐2

0

0 0 8

0

0

‐1 0 8

0

0

‐1 3 8

0

C, east, D, ‐2

A

B C D

E

States

Slide Credit: http://ai.berkeley.edu

Page 70: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Problems with TD Value Learning

• TD value learning is a model‐free way to do policy evaluation, mimicking Bellman updates with running sample averages

• However, if we want to turn values into a (new) policy, we’re sunk:

a

s

s, a

s,a,s’

s’Idea: learn Q‐values, not values

Makes action selection model‐free too!

Slide Credit: http://ai.berkeley.edu

Page 71: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Active Reinforcement Learning

• Full reinforcement learning: optimal policies (like value iteration)– You don’t know the transitions T(s,a,s’)– You don’t know the rewards R(s,a,s’)– You choose the actions now– Goal: learn the optimal policy / values

• In this case:– Learner makes choices!– Fundamental tradeoff: exploration vs. exploitation– This is NOT offline planning! You actually take actions in the

world and find out what happens…

Slide Credit: http://ai.berkeley.edu

Page 72: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Detour: Q‐Value Iteration

• Value iteration: find successive (depth‐limited) values– Start with V0(s) = 0, which we know is right– Given Vk, calculate the depth k+1 values for all states:

• But Q‐values are more useful, so compute them instead– Start with Q0(s,a) = 0, which we know is right– Given Qk, calculate the depth k+1 q‐values for all q‐states:

Slide Credit: http://ai.berkeley.edu

Page 73: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Q-Learning

• We’d like to do Q-value updates to each Q-state:

– But can’t compute this update without knowing T, R

• Instead, compute average as we go– Receive a sample transition (s,a,r,s’)– This sample suggests

– But we want to average over results from (s,a) – So keep a running average

Slide Credit: http://ai.berkeley.edu

Page 74: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Q-Learning Properties

• Amazing result: Q-learning converges to optimal policy --even if you’re acting suboptimally!

• This is called off-policy learning

• Caveats:– You have to explore enough– You have to eventually make the learning rate

small enough– … but not decrease it too quickly– Basically, in the limit, it doesn’t matter how you select actions (!)

Slide Credit: http://ai.berkeley.edu

Page 75: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Generalizing Across States

• Basic Q-Learning keeps a table of all q-values

• In realistic situations, we cannot possibly learn about every single state!

– Too many states to visit them all in training– Too many states to hold the q-tables in memory

• Instead, we want to generalize:– Learn about some small number of training states

from experience– Generalize that experience to new, similar

situations– This is the fundamental idea in machine learning!

[demo – RL pacman]

Slide Credit: http://ai.berkeley.edu

Page 76: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Example: Pacman

Let’s say we discover through experience that this state is 

bad:

In naïve q‐learning, we know nothing about this state:

Or even this one!

Slide Credit: http://ai.berkeley.edu

Page 77: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Feature-Based Representations

• Solution: describe a state using a vector of features (properties)

– Features are functions from states to real numbers (often 0/1) that capture important properties of the state

– Example features:• Distance to closest ghost• Distance to closest dot• Number of ghosts• 1 / (dist to dot)2

• Is Pacman in a tunnel? (0/1)• …… etc.• Is it the exact state on this slide?

– Can also describe a q-state (s, a) with features (e.g. action moves closer to food)

Slide Credit: http://ai.berkeley.edu

Page 78: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Linear Value Functions

• Using a feature representation, we can write a q function (or value function) for any state using a few weights:

• Advantage: our experience is summed up in a few powerful numbers

• Disadvantage: states may share features but actually be very different in value!

Slide Credit: http://ai.berkeley.edu

Page 79: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Approximate Q-Learning

• Q-learning with linear Q-functions:

• Intuitive interpretation:– Adjust weights of active features– E.g., if something unexpectedly bad happens, blame the features that

were on: disprefer all states with that state’s features

• Formal justification: online least squares

Exact Q’s

Approximate Q’s

Slide Credit: http://ai.berkeley.edu

Page 80: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Minimizing Error*

Approximate q update explained:

Imagine we had only one point x, with features f(x), target value y, and weights w:

“target” “prediction”

Slide Credit: http://ai.berkeley.edu

Page 81: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Approaches to RL• Value-based RL

– Estimate the optimal action-value function

• Policy-based RL– Search directly for the optimal policy

• Model– Build a model of the world

• State transition, reward probabilities

– Plan (e.g. by look-ahead) using model

(C) Dhruv Batra 81

Page 82: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Deep RL• Value-based RL

– Use neural nets to represent Q function

• Policy-based RL– Use neural nets to represent policy

• Model– Use neural nets to represent and learn the model

(C) Dhruv Batra 82

Page 83: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Q-Networks

Slide Credit: David Silver

Page 84: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Case Study: Playing Atari Games

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission. 

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 85: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 86: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

Input: state st

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 87: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

Familiar conv layers, FC layer

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 88: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 89: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

Number of actions between 4-18 depending on Atari game

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 90: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

:neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

Number of actions between 4-18 depending on Atari game

A single feedforward pass to compute Q-values for all actions from the current state => efficient!

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 91: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 92: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:Forward Pass

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 93: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:

where

Forward Pass

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 94: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:

where

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 95: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Remember: want to find a Q-function that satisfies the Bellman Equation:

Deep Q-learning

Loss function:

where

Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 96: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Training the Q-network: Experience Replay

96

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 97: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Training the Q-network: Experience Replay

97

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

Address these problems using experience replay- Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played- Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 98: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Experience Replay

(C) Dhruv Batra 98Slide Credit: David Silver

Page 99: L22 rl nohidden - Georgia Institute of Technology...Administrativia • Projects! • Project Check-in due April 11th – Will be graded pass/fail, if fail then you can address the

Video by Károly Zsolnai-Fehér. Reproduced with permission.https://www.youtube.com/watch?v=V1eYniJ0Rnk

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n