Lirong Xia
Reinforcement Learning (2)
Tue, March 21, 2014
• Project 2 due tonight
• Project 3 is online (more later)
– due in two weeks
2
Reminder
Recap: MDPs
3
• Markov decision processes:– States S– Start state s0
– Actions A– Transition p(s’|s,a) (or T(s,a,s’))– Reward R(s,a,s’) (and discount ϒ)
• MDP quantities:– Policy = Choice of action for each (MAX) state– Utility (or return) = sum of discounted rewards
Optimal Utilities
4
• The value of a state s:– V*(s) = expected utility starting in s and
acting optimally• The value of a Q-state (s,a):
– Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally
• The optimal policy:– π*(s) = optimal action from state s
Solving MDPs
5
• Value iteration– Start with V1(s) = 0– Given Vi, calculate the values for all states for depth i+1:
– Repeat until converge– Use Vi as evaluation function when computing Vi+1
• Policy iteration– Step 1: policy evaluation: calculate utilities for some fixed
policy– Step 2: policy improvement: update policy using one-step
look-ahead with resulting utilities as future values– Repeat until policy converges
1'
max , , ' , , ' 'i ia
s
V s T s a s R s a s V s
• Don’t know T and/or R, but can observe R
– Learn by doing
– can have multiple episodes (trials)
6
Reinforcement learning
The Story So Far: MDPs and RL
7
Things we know how to do:• If we know the MDP
– Compute V*, Q*, π* exactly– Evaluate a fixed policy π
• If we don’t know the MDP– If we can estimate the MDP then
solve– We can estimate V for a fixed
policy π– We can estimate Q*(s,a) for the
optimal policy while executing an exploration policy
Techniques:• Computation
• Value and policy iteration
• Policy evaluation
• Model-based RL• sampling
• Model-free RL:• Q-learning
Model-Free Learning
8
• Model-free (temporal difference) learning– Experience world through episodes
(s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…)– Update estimates each transition (s,a,r,s’)– Over time, updates will mimic Bellman updates
1'
'
Q-Value Iteration (model-based, requires known MDP)
, , , ' , , ' max ', 'i ia
s
Q s a T s a s R s a s Q s a
'
Q-Learning (model-free, requires only experienced transitions)
( , ) (1 ) ( , ) max ', 'a
Q s a Q s a r Q s a
Detour: Q-Value Iteration
9
• We’d like to do Q-value updates to each Q-state:
– But can’t compute this update without knowing T,R• Instead, compute average as we go
– Receive a sample transition (s,a,r,s’)– This sample suggests
– But we want to merge the new observation to the old ones
– So keep a running average
1'
'
, , , ' , , ' max ', 'i ia
s
Q s a T s a s R s a s Q s a
'
( , ) (1 ) ( , ) max ', 'a
Q s a Q s a r Q s a
'
( , ) max ', 'a
Q s a r Q s a
Q-Learning Properties
10
• Will converge to optimal policy– If you explore enough (i.e. visit each q-state many times)– If you make the learning rate small enough– Basically doesn’t matter how you select actions (!)
• Off-policy learning: learns optimal q-values, not the values of the policy you are following
Q-Learning
11
• Q-learning produces tables of q-values:
Exploration / Exploitation
12
• Random actions (ε greedy)–Every time step, flip a coin–With probability ε, act randomly–With probability 1-ε, act according to current
policy
Today: Q-Learning with state abstraction
13
• In realistic situations, we cannot possibly learn about every single state!– Too many states to visit them all in training– Too many states to hold the Q-tables in memory
• Instead, we want to generalize:– Learn about some small number of training states from
experience– Generalize that experience to new, similar states– This is a fundamental idea in machine learning, and we’ll
see it over and over again
Example: Pacman
14
• Let’s say we discover through experience that this state is bad:
• In naive Q-learning, we know nothing about this state or its Q-states:
• Or even this one!
Feature-Based Representations
15
• Solution: describe a state using a vector of features (properties)– Features are functions from states
to real numbers (often 0/1) that capture important properties of the state
– Example features:• Distance to closest ghost• Distance to closest dot• Number of ghosts• 1/ (dist to dot)2
• Is Pacman in a tunnel? (0/1)• …etc• Is it the exact state on this slide?
– Can also describe a Q-state (s,a) with features (e.g. action moves closer to food)
Similar to a evaluation function
Linear Feature Functions
16
• Using a feature representation, we can write a Q function (or value function) for any state using a few weights:
• Advantage: our experience is summed up in a few powerful numbers
• Disadvantage: states may share features but actually be very different in value!
Function Approximation
17
• Q-learning with linear Q-functions:transition = (s,a,r,s’)
• Intuitive interpretation:– Adjust weights of active features– E.g. if something unexpectedly bad happens, disprefer all
states with that state’s features• Formal justification: online least squares
'
difference max ', ' ( , )a
r Q s a Q s a Exact Q’s
Approximate Q’s
Example: Q-Pacman
18
Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a)
fDOT(s,NORTH)=0.5
fGST(s,NORTH)=1.0
Q(s,a)=+1
R(s,a,s’)=-500
difference=-501
wDOT←4.0+α[-501]0.5
wGST ←-1.0+α[-501]1.0
Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a)
α= Northr = -500
s
s’
Linear Regression
19
Ordinary Least Squares (OLS)
20
Minimizing Error
21
21
error2
error
k kk
k k mkm
m m k k mk
w y w f x
wy w f x f x
w
w w y w f x f x
Imagine we had only one point x with features f(x):
Approximate q update explained:
max ( ', ') ( , )m ma
mw r Q s Q f xaaw s
“target” “prediction”
• As many as possible?
– computational burden
– overfitting
• Feature selection is important
– requires domain expertise
22
How many features should we use?
Overfitting
23
The elements of statistical learning. Fig 2.11. 24
• MDPs
– Q1: value iteration
– Q2: find parameters that lead to certain optimal policy
– Q3: similar to Q2
• Q-learning
– Q4: implement the Q-learning algorithm
– Q5: implement ε greedy action selection
– Q6: try the algorithm
• Approximate Q-learning and state abstraction
– Q7: Pacman
– Q8 (bonus): design and implement a state-abstraction Q-learning
algorithm
• Hints
– make your implementation general
– try to use class methods as much as possible
25
Overview of Project 3