RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL...

RL + DL

Rob Platt Northeastern University

Some slides from David Silver, Deep RL Tutorial

Deep RL in an exciting recent development

Remember the RL scenario...

Agent World

Agent takes actions

Agent perceives states and rewards

Tabular RL

Worldstate action

argmaxQ action

teQ-function

Update rule

Tabular RL

Worldstate action

argmaxQ action

Q-function

Update rule

Deep RL

Worldstate action

argmax

Q-function

Deep RL

Worldstate action

argmax

Q-function

But, why would we want to do this?

Where does “state” come from?

Agent takes actions

Agent perceives states and rewards

Earlier, we dodged this question: “it’s part of the MDP problem statement”

But, that’s a cop out. How do we get state?

Typically can’t use “raw” sensor data as state w/ a tabular Q-function– it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

Linear function approximation?

In the RL lecture, we suggested using Linear Function approximation:

Q-function encoded by the weights

But, this STILL requires a human designer to specify the features!

Is it possible to do RL WITHOUThand-coding either states or features?

Deep Q Learning

Instead of state, we have an image– in practice, it could be a history of the k most recent images stacked as a single k-channel image

Hopefully this new image representation is Markov…– in some domains, it might not be!

Deep Q Learning

Network structure

Q-function

Conv 1 Conv 2 FC 1 Output

Stack of images

Q-function

Stack of images

Network structure

Q-function

Stack of images

Num output nodes equals the number of actions

Network structure

Updating the Q-function

Here’s the standard Q-learning update equation:

Rewriting:

let’s call this the “target”

This equation adjusts Q(s,a) in the direction of the target

Rewriting:

let’s call this the “target”

This equation adjusts Q(s,a) in the direction of the target

We’re going to accomplish this same thing in a different way using neural networks...

Remember how we train a neural network...

Given a dataset:

Define loss function:

Adjust w, b so as to minimize:

Do gradient descent on dataset:

1. repeat

4. until converged

Use this loss function:

Updating the Q-network

Notice that Q is now parameterized by the weights, w

I’m including the bias in the weights

target

Notice that the target is a function of the weights– this is a problem b/c targets are non-stationary

target

Notice that the target is a function of the weights– this is a problem b/c targets are non-stationary

So… when we do gradient descent, we’re going to hold the targets constant

Now we can do gradient descent!

Deep Q Learning, no replay buffers

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Where:

Until s is terminal

Where:

This is all that changed relative to standard

q-learning

Deep Q Learning, no replay buffers

Example: 4x4 frozen lake env

Get to the goal (G)Don’t fall in a hole (H)

Experience replay

Deep learning typically assumes independent, identically distributed (IID) training data

Experience replay

Until s is terminal

But is this true in the deep RL scenario?

Experience replay

Until s is terminal

But is this true in the deep RL scenario?

Our solution: buffer experiences and then “replay” them during training

Initialize Q(s,a;w) with random weights

Repeat (for each episode):Initialize sRepeat (for each step of the episode):

If mod(step,trainfreq) == 0:sample batch B from D

Deep Q Learning WITH replay buffers

Replay buffer

Add this exp to buffer

One step grad descent WRT buffer

Train every trainfreq steps

Initialize Q(s,a;w) with random weights

Repeat (for each episode):Initialize sRepeat (for each step of the episode):

If mod(step,trainfreq) == 0:sample batch B from D

Deep Q Learning WITH replay buffers

Replay buffer

Add this exp to buffer

One step grad descent WRT buffer

Train every trainfreq steps

– buffers like this are pretty common in DL...

Example: 4x4 frozen lake env

Get to the goal (G)Don’t fall in a hole (H)

Atari Learning Environment

Additional Deep RL “Tricks”

Combined algorithm: 3x mean Atari score DQN w/ replay buffers

RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL...

Documents

The Nuts and Bolts of Deep RL Research - University of ...rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdfThe Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016

Deep RL Agent for a Real-Time Action Strategy Game · Deep RL Agent for a Real-Time Action Strategy Game Michal Warchalski,1 Dimitrije Radojevic, 1 Milos Milosevic1 1Nordeus 11 Milutina

FLOW - Lagrangian Control through Deep-RL: Applications to … · 2020. 8. 7. · flow-project/flow. I. INTRODUCTION Over the past few years, deep reinforcement learning (RL) has

Deep RL with Q-Functions - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/... · Today’s Lecture 1. How we can make Q-learning work with deep

DWA-RL: Dynamically Feasible Deep Reinforcement ... - UMD

10-703 Deep RL and Controls OpenAI Gym Recitation API Basic Datatypes ... Minecraft. VirtualEnv Installation ... 10-703 Deep RL and Controls OpenAI Gym Recitation Author: Devin Schwab

Stochastic Latent Actor-Critic: Deep Reinforcement ......stochastic sequential models and RL into a single method, performing RL in the model’s learned ... Partial observability

Comparison of Deep Reinforcement Learning Policies to ......performance of deep RL for moving obstacle avoidance. III. PRELIMINARIES A. Robot and Obstacle Dynamics Consider a holonomic

Learning to Optimize Join Queries with Deep RL€¦ · Learning to Optimize Join Queries with Deep RL Join Optimization DQ: Deep Q-learning for join optimization Discussion CS294

The Nuts and Bolts of Deep RL Research - University of ...The Nuts and Bolts of Deep RL Research John Schulman December 9th, 2016 Outline Approaching New Problems Ongoing Development

DeepReinforcementLearningrll.berkeley.edu/deeprlcourse/docs/2015.08.26.Lecture01Intro.pdf · Deep!RL!Frontier 23 [KSH2012]!Krizhevsky,!Sutskever,!&!Hinton.,!ImageNetClassification"with"Deep"Convolutional"Neural"Networks,2012

Lecture 12: Deep Reinforcement Learning · Atari, Go, Chess, Pacman o Learning sequential algorithms Attention, memory Some examples of RL. UVA DEEP LEARNING COURSE –EFSTRATIOS

Video Summarisation by Classiﬁcation with Deep ...andrea/papers/2018_BMVC_Video... · ZHOU ET AL.: VIDEO SUMMARISATION BY CLASSIFICATION WITH DEEP RL 3. the classiﬁer as global

REVISITING THE MASTER-SLAVE ARCHITECTURE …Master-Slave Multi-Agent Reinforcement Learning challenge how deep RL can be effectively scaled to more agents in various situations. Deep

Tutorial: Deep Reinforcement Learning€¦ · Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making I RL is for an agent with the capacity to

Introduction to Deep Reinforcement Learning Model-free Methods · •People started to apply deep learning into robotics Deep RL for Robotics Levine, Sergey, and Chelsea Finn. "Deep

David Silver’s introduction to RL slides Reinforcement ... · • Implement (in code) common RL algorithms including a deep RL algorithm (as assessed by the homeworks) • Describe

Deep Reinforcement Learning - Manuel Gomez Rodriguez · PDF fileDeep Reinforcement Learning John Schulman 1 MLSS, May 2016, Cadiz ... Q-Learning. What is Deep RL?

RL 17a: Deep RL - The University of Edinburgh · AlphaGo Chess: ˇ30possiblemoves,ˇ80movespergame Go: ˇ250possiblemoves,ˇ150movespergame Approach Reduce depth by truncating search

APPROACH SURFACE HORIZONTAL SECTION RL 400…dsewebapps.dse.vic.gov.au/Shared/ATSAttachment1.nsf/(attachment... · rl 400 rl 400.52 rl 340 rl 350 rl 360 rl 370 rl 380 ... first section