Randomized Prior Functions for Deep Reinforcement Learning05-15-30...2015/05/16 · • “Sequential decision making under uncertainty.” We need effective uncertainty estimates

Randomized Prior Functions for Deep Reinforcement Learning

Ian Osband, John Aslanides, Albin Cassirer

bit.ly/rpf_nips @ianosband

http://bit.ly/rpf_nips

Reinforcement Learningbit.ly/rpf_nips

@ianosband



@ianosband

Data & Estimation

= Supervised Learning



@ianosband

+ partial feedback =

Multi-armed Bandit

Data & Estimation




@ianosband

+ delayed consequences =

Reinforcement Learning


Multi-armed Bandit

Data & Estimation




@ianosband




Multi-armed Bandit

Data & Estimation


• “Sequential decision making under uncertainty.”



@ianosband




Multi-armed Bandit

Data & Estimation



• Three necessary building blocks:



@ianosband




Multi-armed Bandit

Data & Estimation



• Three necessary building blocks:1. Generalization



@ianosband




Multi-armed Bandit

Data & Estimation




2. Exploration vs Exploitation



@ianosband




Multi-armed Bandit

Data & Estimation





3. Credit assignment



@ianosband




Multi-armed Bandit

Data & Estimation






• As a field, we are pretty good at combining any 2 of these 3.  … but we need practical solutions that combine them all.



@ianosband




Multi-armed Bandit

Data & Estimation



We need effective uncertainty estimates for Deep RL




• As a field, we are pretty good at combining any 2 of these 3.  … but we need practical solutions that combine them all.


Estimating uncertainty in deep RLbit.ly/rpf_nips

@ianosband



@ianosband

Dropout sampling



@ianosband

Dropout sampling

“Dropout sample ⩬ posterior sample”

(Gal+Gharamani 2015)



@ianosband

Dropout sampling



Dropout rate does not concentrate with

the data.



@ianosband

Dropout sampling




the data.

Even “concrete” dropout not

necessarily right rate.



@ianosband

Dropout sampling

Variational inference




the data.





@ianosband

Dropout sampling





the data.



Apply VI to Bellman error as if it was an

i.i.d. supervised loss.



@ianosband

Dropout sampling





the data.





Bellman error: 

 Uncertainty in Q ➡ correlated TD loss.

VI on i.i.d. model does not propagate

uncertainty.

Q(s, a) = r + γ maxα

Q(s′�, α)



@ianosband

Dropout sampling


Distributional RL




the data.





Bellman error: 



uncertainty.


Q(s′�, α)



@ianosband

Dropout sampling


Distributional RL




the data.





Models Q-value as a distribution, rather

than point estimate.

Bellman error: 



uncertainty.


Q(s′�, α)



@ianosband

Dropout sampling


Distributional RL




the data.







Bellman error: 



uncertainty.


Q(s′�, α) This distribution != posterior uncertainty.



@ianosband

Dropout sampling


Distributional RL




the data.







Bellman error: 



uncertainty.



Aleatoric vs Epistemic … it’s not the right

thing for exploration.



@ianosband

Dropout sampling


Distributional RL

Count-based density




the data.







Bellman error: 



uncertainty.







@ianosband

Dropout sampling


Distributional RL

Count-based density




the data.







Bellman error: 



uncertainty.





Estimate number of “visit counts” to

state, add bonus.



@ianosband

Dropout sampling


Distributional RL

Count-based density




the data.







Bellman error: 



uncertainty.






state, add bonus.

The “density model” has nothing to do

with the actual task.



@ianosband

Dropout sampling


Distributional RL

Count-based density




the data.







Bellman error: 



uncertainty.






state, add bonus.



With generalization, state “visit count"  

!= uncertainty.



@ianosband

Dropout sampling


Distributional RL

Count-based density

Bootstrap ensemble




the data.







Bellman error: 



uncertainty.






state, add bonus.




!= uncertainty.



@ianosband

Dropout sampling


Distributional RL

Count-based density

Bootstrap ensemble




the data.







Bellman error: 



uncertainty.






state, add bonus.




!= uncertainty.

Train ensemble on noisy data - classic

statistical procedure!



@ianosband

Dropout sampling


Distributional RL

Count-based density

Bootstrap ensemble




the data.







Bellman error: 



uncertainty.






state, add bonus.




!= uncertainty.



No explicit “prior” mechanism for

“intrinsic motivation”



@ianosband

Dropout sampling


Distributional RL

Count-based density

Bootstrap ensemble




the data.







Bellman error: 



uncertainty.






state, add bonus.




!= uncertainty.



No explicit “prior” mechanism for

“intrinsic motivation”

If you’ve never seen a reward, why would the agent explore?


Randomized prior functionsbit.ly/rpf_nips

@ianosband



@ianosband

• Key idea: add a random untrainable “prior” function to each member of the ensemble.



@ianosband


• Visualize effects in 1D regression:



@ianosband


• Visualize effects in 1D regression:• Training data (x,y) black points.



@ianosband


• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.



@ianosband


• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.



@ianosband


• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.



@ianosband


• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.

Exact Bayes posterior for linear functions!


“Deep Sea” Explorationbit.ly/rpf_nips

@ianosband



@ianosband

• Stylized “chain” domain testing “deep exploration”:



@ianosband

• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.



@ianosband

• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.



@ianosband

• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.



@ianosband

• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N



@ianosband

• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.



@ianosband


• Only one policy (out of more than 2N) has positive return. 



@ianosband



• ε-greedy / Boltzmann / policy gradient / are useless. 



@ianosband



• ε-greedy / Boltzmann / policy gradient / are useless. 

• Algorithms with deep exploration can learn fast! [1] “Deep Exploration via Randomized Value Functions”


Visualize BootDQN+prior exploration.bit.ly/rpf_nips

@ianosband



@ianosband

• Compare DQN+ε-greedy vs BootDQN+prior.



@ianosband

• Compare DQN+ε-greedy vs BootDQN+prior.1K

K

∑k=1

maxα

Qk(s, α)• Define ensemble average:



@ianosband


• Heat map shows estimated value of each state.

1K

K

∑k=1

maxα




@ianosband



1K

K

∑k=1

maxα


• Red line shows exploration path taken by agent.



@ianosband



• DQN+ε-greedy gets stuck on the left, gives up. 

1K

K

∑k=1

maxα





@ianosband



• DQN+ε-greedy gets stuck on the left, gives up. 

• BootDQN+prior hopes something is out there, keeps exploring potentially-rewarding states… learns fast!

1K

K

∑k=1

maxα




VIDEO TO COVER THIS WHOLE SLIDE



Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer





Blog post bit.ly/rpf_nips





Montezuma’s Revenge!





Montezuma’s Revenge!

Demo code bit.ly/rpf_nips


Documents

Randomized Prior Functions for Deep Reinforcement Learning05-15-30...2015/05/16 · • “Sequential decision making under uncertainty.” We need effective uncertainty estimates