Upload
dangnhan
View
249
Download
0
Embed Size (px)
Citation preview
1
ECE 517: Reinforcement Learningin Artificial Intelligence
Lecture 14: Planning and Learning
Dr. Itamar Arel
College of EngineeringDepartment of Electrical Engineering and Computer Science
The University of TennesseeFall 2015
October 27, 2015
ECE 517: Reinforcement Learning in AI
Final projects - logistics
Projects can be done in groups of up to 3 students
Details on projects will be posted soon Students are encouraged to propose a topic
Please email me your top three choices for a project along with a preferred date for your presentation
Presentation dates:
Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD)
Format: 20 min presentation + 5 min Q&A
~5 min for background and motivation
~15 for description of your work, results, conclusions
Written report due: Monday, Dec. 7
Format similar to project report
2
ECE 517: Reinforcement Learning in AI
Final projects – sample topics
DQN – Playing Atari games using RL
Teris player using RL (and NN)
Curiosity based TD learning*
Reinforcement Learning of Local Shape in the Game of Go
AIBO learning to walk
Study of value function definitions for TD learning
Imitation learning in RL
3
ECE 517: Reinforcement Learning in AI 4
Outline
Introduction
Use of environment models
Integration of planning and learning methods
ECE 517: Reinforcement Learning in AI 5
Introduction
Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternativesThen showed how they can be seamlessly integrated by using eligibility traces such as in TD(l)Planning methods: e.g. Dynamic Programming and heuristic search Rely on knowledge of a model Model – any information that helps the agent predict the way
the environment will behave
Learning methods: Monte Carlo and Temporal Difference Learning Do not require a model
Our goal: Explore the extent to which the two methods can be intermixed
ECE 517: Reinforcement Learning in AI 6
The original idea
ECE 517: Reinforcement Learning in AI 7
The original idea (cont.)
ECE 517: Reinforcement Learning in AI 8
Models
Model: anything the agent can use to predict how the environment will respond to its actions Distribution models: provide description of all possibilities
(of next states and rewards) and their probabilitiese.g. Dynamic Programming
Example - sum of a dozen dice – produce all possible sums and their probabilities of occurring
Sample models: produce just one sample experienceIn our example - produce individual sums drawn according to this probability distribution
Both types of models can be used to (mimic) produce simulated experience
Often sample models are much easier to come by
ECE 517: Reinforcement Learning in AI 9
Planning
Planning: any computational process that uses a model to create or improve a policy
Planning in AI: State-space planning (such as in RL) – search for policy
Plan-space planning (e.g., partial-order planner)e.g. evolutionary methods
We take the following (unusual) view: All state-space planning methods involve computing value
functions, either explicitly or implicitly
They all apply backups to simulated experience
ECE 517: Reinforcement Learning in AI 10
Planning (cont.)
Classical DP methods are state-space planning methods
Heuristic search methods are state-space planning methods
Planning methods rely on “real” experience as input, but in many cases they can be applied to simulated experience just as well
Example: a planning method based on Q-learning:
Random-Sample One-Step Tabular Q-Planning
ECE 517: Reinforcement Learning in AI 11
Learning, Planning, and Acting
Two uses of real experience: Model learning: to
improve the model Direct RL: to directly
improve the value function and policy
Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL.Here, we call it planning.
Q: What are the advantages/disadvantages of each?
ECE 517: Reinforcement Learning in AI 12
Direct vs. Indirect RL
Indirect methods: make fuller use of
experience: get better policy with fewer environment interactions
Direct methods simpler
not affected by bad models
But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel
Q: Which scheme do you think applies to humans?
ECE 517: Reinforcement Learning in AI 13
The Dyna-Q Architecture (Sutton 1990)
ECE 517: Reinforcement Learning in AI 14
The Dyna-Q Algorithm
model learning
(update)
planning
direct RL
Random-sample single-step tabular Q-planning method
ECE 517: Reinforcement Learning in AI 15
Dyna-Q on a Simple Maze
rewards = 0 until
goal reached, when
reward = 1
ECE 517: Reinforcement Learning in AI 16
Dyna-Q Snapshots: Midway in 2nd Episode
Recall that in a planning context … Exploration – trying actions that improve the model
Exploitation – Behaving in the optimal way given the current model
Balance between the two is always a key challenge!
ECE 517: Reinforcement Learning in AI 17
Variations on the Dyna-Q agent
(Regular) Dyna-Q Soft exploration/exploitation with constant
rewards
Dyna-Q+ Encourages exploration of state-action pairs that
have not been visited in a long time (in real interaction with the environment)
If n is the number of steps elapsed between two
consecutive visits to (s,a), then the reward is
larger as a function of n
Dyna-AC Actor-Critic learning rather that Q-learning
ECE 517: Reinforcement Learning in AI 18
More on Dyna-Q+ ?
Uses an “exploration bonus”: Keeps track of time since each state-action pair
was tried for real
An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting
The agent (indirectly) “plans” how to visit long unvisited states
ECE 517: Reinforcement Learning in AI 19
When the Model is Wrong: Blocking Maze (cont.)
The maze example was oversimplified
In reality many things could go wrong Environment could be stochastic
Model can be imperfect (local minimum, stochasticity or no convergence)
Partial experience could be misleading
When the model is incorrect, the planning process will compute a suboptimal policy
This is actually a learning opportunity Discovery and correction of the modeling error
ECE 517: Reinforcement Learning in AI 20
When the Model is Wrong: Blocking Maze (cont.)
The changed environment is harder
ECE 517: Reinforcement Learning in AI 21
Shortcut Maze
The changed environment is
easier
ECE 517: Reinforcement Learning in AI 22
Prioritized Sweeping
In the Dyna agents presented, simulated transitions are started in uniformly chosen state-action pairs Probably not optimal
Which states or state-action pairs should be generated during planning?Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would
change a lot if backed up, prioritized by the size of the change
When a new backup occurs, insert predecessors according to their priorities
Always perform backups from first in queue
Moore and Atkeson 1993; Peng and Williams, 1993
ECE 517: Reinforcement Learning in AI 23
Prioritized Sweeping
ECE 517: Reinforcement Learning in AI 24
Prioritized Sweeping vs. Dyna-Q
Both use N = 5 backups per
environmental interaction
ECE 517: Reinforcement Learning in AI 25
Trajectory Sampling
Trajectory sampling: perform backups along simulated trajectories
This samples from the on-policy distribution Distribution constructed from experience (visits)
Advantages when function approximation is used
Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored:
Initial states
Reachable underoptimal control
Irrelevant states
ECE 517: Reinforcement Learning in AI 26
Summary
Discussed close relationship between planning and learningImportant distinction between distribution modelsand sample modelsLooked at some ways to integrate planning and learning synergy among planning, acting, model learning
Distribution of backups: focus of the computation prioritized sweeping trajectory sampling: backup along trajectories