Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Randomized Prior Functions for Deep Reinforcement Learning
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
Data & Estimation
= Supervised Learning
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
• As a field, we are pretty good at combining any 2 of these 3. … but we need practical solutions that combine them all.
http://bit.ly/rpf_nips
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
We need effective uncertainty estimates for Deep RL
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
• As a field, we are pretty good at combining any 2 of these 3. … but we need practical solutions that combine them all.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α)
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α)
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α)
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
Train ensemble on noisy data - classic
statistical procedure!
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
Train ensemble on noisy data - classic
statistical procedure!
No explicit “prior” mechanism for
“intrinsic motivation”
http://bit.ly/rpf_nips
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
Train ensemble on noisy data - classic
statistical procedure!
No explicit “prior” mechanism for
“intrinsic motivation”
If you’ve never seen a reward, why would the agent explore?
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.
http://bit.ly/rpf_nips
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.
Exact Bayes posterior for linear functions!
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
• Only one policy (out of more than 2N) has positive return.
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
• Only one policy (out of more than 2N) has positive return.
• ε-greedy / Boltzmann / policy gradient / are useless.
http://bit.ly/rpf_nips
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
• Only one policy (out of more than 2N) has positive return.
• ε-greedy / Boltzmann / policy gradient / are useless.
• Algorithms with deep exploration can learn fast! [1] “Deep Exploration via Randomized Value Functions”
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
• DQN+ε-greedy gets stuck on the left, gives up.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
• DQN+ε-greedy gets stuck on the left, gives up.
• BootDQN+prior hopes something is out there, keeps exploring potentially-rewarding states… learns fast!
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
VIDEO TO COVER THIS WHOLE SLIDE
bit.ly/rpf_nips @ianosband
http://bit.ly/rpf_nips
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
http://bit.ly/rpf_nips
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips
http://bit.ly/rpf_nips
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips
Montezuma’s Revenge!
http://bit.ly/rpf_nips
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips
Montezuma’s Revenge!
Demo code bit.ly/rpf_nips
http://bit.ly/rpf_nips