View
40
Download
3
Category
Preview:
DESCRIPTION
Optimal Learning & Bayes -Adaptive MDPs. An Overview Slides on M. Duff’s Thesis/Ch.1 SDM-RG, Mar-09. Optimal Learning: Overview. Behaviour that maximizes expected total reward while interacting with an uncertain world. Behave well while learning, learn while behaving well. - PowerPoint PPT Presentation
Citation preview
Slides prepared by Georgios Chalkiadakis
Optimal Learning &Bayes-Adaptive MDPs
An Overview
Slides on M. Duff’s Thesis/Ch.1SDM-RG, Mar-09
Slides prepared by Georgios Chalkiadakis
Optimal Learning: Overview
Behaviour that maximizes expected total reward while interacting with an uncertain world.
Behave well while learning, learn while behaving well.
Slides prepared by Georgios Chalkiadakis
Optimal Learning: Overview
What does it mean to behave optimally under uncertainty?
Optimality is defined with respect to a distribution of environments.
Explore vs. Exploit given prior uncertainty regarding environments
What is the “value of information”?
Slides prepared by Georgios Chalkiadakis
Optimal Learning: Overview
Bayesian approach: Evolve uncertainty about unknown process parameters
The parameters describe prior distributions about the world model (transitions/rewards)
That is, about information states
Slides prepared by Georgios Chalkiadakis
Optimal Learning: Overview
The sequential problem is described by a “hyperstate”-MDP (“Bayes-Adaptive MDP”):
Instead of just physical states physical states+ information states
Slides prepared by Georgios Chalkiadakis
Simple “stateless” example
Bernoulli process parameters θ1 θ2 describe the actual (but unknown) probabilities of success
Bayesian: Uncertainty about parameters describe it by conjugate prior distributions:
Slides prepared by Georgios Chalkiadakis
Conjugate Priors
A prior is conjugate to a likelihood function if the posterior is in the same family with the prior
Prior in the family, posterior in the family
A simple update of hyperparameters is enough to get the posterior!
Information-statetransition diagram
Slides prepared by Georgios Chalkiadakis
It simply becomes:
Slides prepared by Georgios Chalkiadakis
Bellman optimality equation (with k steps
to go)
Slides prepared by Georgios Chalkiadakis
Enter physical states (MDPs)
2 physical states
Slides prepared by Georgios Chalkiadakis
Enter physical states (MDPs)
2 physical states / 2 actions
Four Bernoulli processes: 1 at 1, 2 at 1, 1 at 2, 2 at 2
(a_1^1, b_1^1) hyperparameters of beta distribution capturing uncertainty about p^1_{11}
full hyperstate:
Note: we now have to be in a specific physical state to sample a related process
Enter physical states (MDPs)
Optimality equation
Slides prepared by Georgios Chalkiadakis
More than 2 physical states…What priors
now?
Dirichlet – conjugate to the multinomial sampling Sampling is now multinomial : s many s’
We will see examples in future readings…
Slides prepared by Georgios Chalkiadakis
Certainty equivalence? Truncate the horizon
Compute terminal values using means
…and proceed with a receding-horizon approach Perform DP , take first “optimal” action, shift
window fwd, repeat
Or, even simpler, consider an horizon of 1• Compute DP “optimal” policies using means of current belief distributions• perform action, ob
Or, even more simply, use a myopic c-e approach:
Use means of current priors to compute DP optimal policies
Execute “optimal” action, observe transition
Update distribution, repeat
Slides prepared by Georgios Chalkiadakis
No, it’s not a good idea!...
Actions / state transitions might be starved forever,
…even if the initial prior is an accurate model of uncertainty!
Slides prepared by Georgios Chalkiadakis
Example
Example (cont.)
Slides prepared by Georgios Chalkiadakis
So, we have to be properly Bayesian
If the prior is an accurate model of uncertainty, “important” actions/states will not be starved
There exists Bayesian RL algorithms that do a more than a decent job! (future readings) However, if the prior provides a distorted picture of
reality, then we can have no convergence guarantees
…but “optimal learning” is still in place (assuming that other algorithms operate with the same prior knowledge)
Recommended