ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence

1

ECE 517: Reinforcement Learningin Artificial Intelligence

Lecture 14: Planning and Learning

Dr. Itamar Arel

College of EngineeringDepartment of Electrical Engineering and Computer Science

The University of TennesseeFall 2015

October 27, 2015

ECE 517: Reinforcement Learning in AI

Final projects - logistics

Projects can be done in groups of up to 3 students

Details on projects will be posted soon Students are encouraged to propose a topic

Please email me your top three choices for a project along with a preferred date for your presentation

Presentation dates:

Nov. 17, 19, 24 and Dec. 1 + additional time slot (TBD)

Format: 20 min presentation + 5 min Q&A

~5 min for background and motivation

~15 for description of your work, results, conclusions

Written report due: Monday, Dec. 7

Format similar to project report

2

ECE 517: Reinforcement Learning in AI

Final projects – sample topics

DQN – Playing Atari games using RL

Teris player using RL (and NN)

Curiosity based TD learning*

Reinforcement Learning of Local Shape in the Game of Go

AIBO learning to walk

Study of value function definitions for TD learning

Imitation learning in RL

3

ECE 517: Reinforcement Learning in AI 4

Outline

Introduction

Use of environment models

Integration of planning and learning methods


Introduction

Earlier we discussed Monte Carlo and temporal-difference methods as distinct alternativesThen showed how they can be seamlessly integrated by using eligibility traces such as in TD(l)Planning methods: e.g. Dynamic Programming and heuristic search Rely on knowledge of a model Model – any information that helps the agent predict the way

the environment will behave

Learning methods: Monte Carlo and Temporal Difference Learning Do not require a model

Our goal: Explore the extent to which the two methods can be intermixed


The original idea


The original idea (cont.)


Models

Model: anything the agent can use to predict how the environment will respond to its actions Distribution models: provide description of all possibilities

(of next states and rewards) and their probabilitiese.g. Dynamic Programming

Example - sum of a dozen dice – produce all possible sums and their probabilities of occurring

Sample models: produce just one sample experienceIn our example - produce individual sums drawn according to this probability distribution

Both types of models can be used to (mimic) produce simulated experience

Often sample models are much easier to come by


Planning

Planning: any computational process that uses a model to create or improve a policy

Planning in AI: State-space planning (such as in RL) – search for policy

Plan-space planning (e.g., partial-order planner)e.g. evolutionary methods

We take the following (unusual) view: All state-space planning methods involve computing value

functions, either explicitly or implicitly

They all apply backups to simulated experience


Planning (cont.)

Classical DP methods are state-space planning methods

Heuristic search methods are state-space planning methods

Planning methods rely on “real” experience as input, but in many cases they can be applied to simulated experience just as well

Example: a planning method based on Q-learning:

Random-Sample One-Step Tabular Q-Planning


Learning, Planning, and Acting

Two uses of real experience: Model learning: to

improve the model Direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL.Here, we call it planning.

Q: What are the advantages/disadvantages of each?


Direct vs. Indirect RL

Indirect methods: make fuller use of

experience: get better policy with fewer environment interactions

Direct methods simpler

not affected by bad models

But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel

Q: Which scheme do you think applies to humans?


The Dyna-Q Architecture (Sutton 1990)


The Dyna-Q Algorithm

model learning

(update)

planning

direct RL

Random-sample single-step tabular Q-planning method


Dyna-Q on a Simple Maze

rewards = 0 until

goal reached, when

reward = 1


Dyna-Q Snapshots: Midway in 2nd Episode

Recall that in a planning context … Exploration – trying actions that improve the model

Exploitation – Behaving in the optimal way given the current model

Balance between the two is always a key challenge!


Variations on the Dyna-Q agent

(Regular) Dyna-Q Soft exploration/exploitation with constant

rewards

Dyna-Q+ Encourages exploration of state-action pairs that

have not been visited in a long time (in real interaction with the environment)

If n is the number of steps elapsed between two

consecutive visits to (s,a), then the reward is

larger as a function of n

Dyna-AC Actor-Critic learning rather that Q-learning


More on Dyna-Q+ ?

Uses an “exploration bonus”: Keeps track of time since each state-action pair

was tried for real

An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

The agent (indirectly) “plans” how to visit long unvisited states


When the Model is Wrong: Blocking Maze (cont.)

The maze example was oversimplified

In reality many things could go wrong Environment could be stochastic

Model can be imperfect (local minimum, stochasticity or no convergence)

Partial experience could be misleading

When the model is incorrect, the planning process will compute a suboptimal policy

This is actually a learning opportunity Discovery and correction of the modeling error


When the Model is Wrong: Blocking Maze (cont.)

The changed environment is harder


Shortcut Maze

The changed environment is

easier


Prioritized Sweeping

In the Dyna agents presented, simulated transitions are started in uniformly chosen state-action pairs Probably not optimal

Which states or state-action pairs should be generated during planning?Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would

change a lot if backed up, prioritized by the size of the change

When a new backup occurs, insert predecessors according to their priorities

Always perform backups from first in queue

Moore and Atkeson 1993; Peng and Williams, 1993


Prioritized Sweeping


Prioritized Sweeping vs. Dyna-Q

Both use N = 5 backups per

environmental interaction


Trajectory Sampling

Trajectory sampling: perform backups along simulated trajectories

This samples from the on-policy distribution Distribution constructed from experience (visits)

Advantages when function approximation is used

Focusing of computation: can cause vast uninteresting parts of the state space to be (usefully) ignored:

Initial states

Reachable underoptimal control

Irrelevant states


Summary

Discussed close relationship between planning and learningImportant distinction between distribution modelsand sample modelsLooked at some ways to integrate planning and learning synergy among planning, acting, model learning

Distribution of backups: focus of the computation prioritized sweeping trajectory sampling: backup along trajectories

Documents

ECE 517: Reinforcement Learning in Artificial Intelligenceweb.eecs.utk.edu/~itamar/courses/ECE-517/notes/lecture14.pdf · 1 ECE 517: Reinforcement Learning in Artificial Intelligence